String encoding
These functions allow you to convert arbitrary strings to ANSI strings displayable on your local system.
AnsiStr = StringFromUnicode(UnicodeStr) AnsiStr = StringFromCodePage(Str, CodePage) AnsiStr = StringFromUTF8(UTF8Str) Valid = IsValidAsciizString(NumBytes) Valid = IsValidUnicodeString(NumBytes)
AnsiStr = StringFromUnicode(UnicodeStr)
This function converts given Unicode string to the corresponding ANSI string that can be properly displayed by TempLuator. Please note that the UnicodeStr parameter must be a normal string in Lua sense so if you want to convert a Unicode string that is the result of the byte reader, you have to pass the ReaderData.Data table member here, which is actually a Lua string.
If the string passed to this function has an odd number of bytes (recall that each Unicode character consists of two bytes), then the last byte will be excluded from the conversion. You will get an error if this string is empty (or its length is 1 byte). All Unicode characters that couldn't be converted to the corresponding ANSI equivalents, will be presented as '?'.
Here is an example showing how to read a Unicode string of given length from the input file, convert it to printable form and display the result:
-- read and display Unicode string of given length (in Unicode characters) function DisplayUnicode(UniLen) -- each Unicode character consists of two bytes local NumBytes = UniLen * 2 -- read Unicode string from the input file local ReaderData = ReadBytes(NumBytes) -- convert Unicode to ANSI local AnsiStr = StringFromUnicode(ReaderData.Data) -- print the string PrintDoubleQuotedString(AnsiStr) end -- the same written in a quite cryptic short form by a care(brain?)less programmer function DisplayUnicodeBadGuy(UniLen) PrintDoubleQuotedString(StringFromUincode(ReadBytes(UniLen*2).Data)) end
AnsiStr = StringFromCodePage(Str, CodePage)
This function converts given string from the given code page to the corresponding ANSI string that can be properly displayed by TempLuator. Please note that the Str parameter must be a normal string in Lua sense.
The conversion is performed in two steps. The first step converts the source string from given code page to a Unicode string. The secons pass converts this Unicode string into the ANSI string using the current system code page.
All characters of the source string that couldn't be converted to the corresponding ANSI equivalents, will be presented as '?'.
The mandatory CodePage parameter must be a number determining the code page used for encoding of the source string. This number is directly passed to the underlying MultiByteToWideChar() API function as the first parameter. Please consult your Win32 SDK documentation. There are a few CodePage constants available for Lua code:
CP_ACP = 0 CP_OEMCP = 1 CP_MACCP = 2 CP_THREAD_ACP = 3 CP_SYMBOL = 42 CP_KOI8 = 21866 CP_UTF7 = 65000 CP_UTF8 = 65001However, you can use any numeric CodePage value defined in Win32 SDK:
437 MS-DOS United States 708 Arabic (ASMO 708) 709 Arabic (ASMO 449+, BCON V4) 710 Arabic (Transparent Arabic) 720 Arabic (Transparent ASMO) 737 Greek (formerly 437G) 775 Baltic 850 MS-DOS Multilingual (Latin I) 852 MS-DOS Slavic (Latin II) 855 IBM Cyrillic (primarily Russian) 857 IBM Turkish 860 MS-DOS Portuguese 861 MS-DOS Icelandic 862 Hebrew 863 MS-DOS Canadian-French 864 Arabic 865 MS-DOS Nordic 866 MS-DOS Russian (former USSR) 869 IBM Modern Greek 874 Thai 932 Japan 936 Chinese (PRC, Singapore) 949 Korean 950 Chinese (Taiwan; Hong Kong SAR, PRC) 1200 Unicode (BMP of ISO 10646) 1250 Windows 3.1 Eastern European 1251 Windows 3.1 Cyrillic 1252 Windows 3.1 Latin 1 (US, Western Europe) 1253 Windows 3.1 Greek 1254 Windows 3.1 Turkish 1255 Hebrew 1256 Arabic 1257 Baltic 1258 Latin 1 (ANSI) 20000 CNS - Taiwan 20001 TCA - Taiwan 20002 Eten - Taiwan 20003 IBM5550 - Taiwan 20004 TeleText - Taiwan 20005 Wang - Taiwan 20127 US ASCII 20261 T.61 20269 ISO-6937 20866 Ukrainian - KOI8-U 21027 Ext Alpha Lowercase 21866 Russian - KOI8 28591 ISO 8859-1 Latin I 28592 ISO 8859-2 Eastern Europe 28593 ISO 8859-3 Turkish 28594 ISO 8859-4 Baltic 28595 ISO 8859-5 Cyrillic 28596 ISO 8859-6 Arabic 28597 ISO 8859-7 Greek 28598 ISO 8859-8 Hebrew 28599 ISO 8859-9 Latin Alphabet No.5 29001 Europa 3 1361 Korean (Johab)
AnsiStr = StringFromUTF8(UTF8Str)
This function is nothing more than just a convenient abbreviation for the frequently used UTF-8 encoding and is defined as follows:
function StringFromUTF8(Str) return StringFromCodePage(Str, CP_UTF8) end
Valid = IsValidAsciizString(NumBytes)
Valid = IsValidUnicodeString(NumBytes)
These two auxiliary functions let you check whether a string (zero-terminated ASCII or Unicode, depending on the function) at the current file position is valid. Both functions expect a single parameter which tells the length of the string in bytes (even for the Unicode variant). Both return true if the string is valid and false otherwise. These functions do not change the
The string is considered to be valid if it is zero terminated and there are no non-zero characters after the first zero character. Valid strings may be defined as standard C strings without any complications (that is, escape sequences in the middle of the string definition). Also, for Unicode version the string is definitely invalid if it consist of odd number of bytes.
Here is an example how these functions may be used:
if IsValidAsciizString(Len) then -- OK, the string may be represented as normal C string ("...") char__("Data", nil, Len)
else -- oops, there is something wrong with the string. We better type it -- as an array of chars ("{ '1', '2', '\x00', "3"}")
Char()
char("Data", nil, Len)
end