String encoding

These functions allow you to convert arbitrary strings to ANSI strings displayable on your local system.

AnsiStr = StringFromUnicode(UnicodeStr)
AnsiStr = StringFromCodePage(Str, CodePage)
AnsiStr = StringFromUTF8(UTF8Str)

Valid = IsValidAsciizString(NumBytes)
Valid = IsValidUnicodeString(NumBytes)

AnsiStr = StringFromUnicode(UnicodeStr)

This function converts given Unicode string to the corresponding ANSI string that can be properly displayed by TempLuator. Please note that the UnicodeStr parameter must be a normal string in Lua sense so if you want to convert a Unicode string that is the result of the byte reader, you have to pass the ReaderData.Data table member here, which is actually a Lua string.

If the string passed to this function has an odd number of bytes (recall that each Unicode character consists of two bytes), then the last byte will be excluded from the conversion. You will get an error if this string is empty (or its length is 1 byte). All Unicode characters that couldn't be converted to the corresponding ANSI equivalents, will be presented as '?'.

Here is an example showing how to read a Unicode string of given length from the input file, convert it to printable form and display the result:

-- read and display Unicode string of given length (in Unicode characters)
function DisplayUnicode(UniLen)
    -- each Unicode character consists of two bytes
    local NumBytes = UniLen * 2
 
    -- read Unicode string from the input file
    local ReaderData = ReadBytes(NumBytes)
 
    -- convert Unicode to ANSI
    local AnsiStr = StringFromUnicode(ReaderData.Data)
 
    -- print the string
    PrintDoubleQuotedString(AnsiStr)
end


-- the same written in a quite cryptic short form by a care(brain?)less programmer
function DisplayUnicodeBadGuy(UniLen)
    PrintDoubleQuotedString(StringFromUincode(ReadBytes(UniLen*2).Data))
end

AnsiStr = StringFromCodePage(Str, CodePage)

This function converts given string from the given code page to the corresponding ANSI string that can be properly displayed by TempLuator. Please note that the Str parameter must be a normal string in Lua sense.

The conversion is performed in two steps. The first step converts the source string from given code page to a Unicode string. The secons pass converts this Unicode string into the ANSI string using the current system code page.

All characters of the source string that couldn't be converted to the corresponding ANSI equivalents, will be presented as '?'.

The mandatory CodePage parameter must be a number determining the code page used for encoding of the source string. This number is directly passed to the underlying MultiByteToWideChar() API function as the first parameter. Please consult your Win32 SDK documentation. There are a few CodePage constants available for Lua code:

CP_ACP          =   0
CP_OEMCP        =   1
CP_MACCP        =   2
CP_THREAD_ACP   =   3
CP_SYMBOL       =   42
CP_KOI8         =   21866
CP_UTF7         =   65000
CP_UTF8         =   65001

However, you can use any numeric CodePage value defined in Win32 SDK:

437 MS-DOS United States
708 Arabic (ASMO 708)
709 Arabic (ASMO 449+, BCON V4)
710 Arabic (Transparent Arabic)
720 Arabic (Transparent ASMO)
737 Greek (formerly 437G)
775 Baltic
850 MS-DOS Multilingual (Latin I)
852 MS-DOS Slavic (Latin II)
855 IBM Cyrillic (primarily Russian)
857 IBM Turkish
860 MS-DOS Portuguese
861 MS-DOS Icelandic
862 Hebrew
863 MS-DOS Canadian-French
864 Arabic
865 MS-DOS Nordic
866 MS-DOS Russian (former USSR)
869 IBM Modern Greek
874 Thai
932 Japan
936 Chinese (PRC, Singapore)
949 Korean
950 Chinese (Taiwan; Hong Kong SAR, PRC)
1200 Unicode (BMP of ISO 10646)
1250 Windows 3.1 Eastern European
1251 Windows 3.1 Cyrillic
1252 Windows 3.1 Latin 1 (US, Western Europe)
1253 Windows 3.1 Greek
1254 Windows 3.1 Turkish
1255 Hebrew
1256 Arabic
1257 Baltic
1258 Latin 1 (ANSI)
20000 CNS - Taiwan
20001 TCA - Taiwan
20002 Eten - Taiwan
20003 IBM5550 - Taiwan
20004 TeleText - Taiwan
20005 Wang - Taiwan
20127 US ASCII
20261 T.61
20269 ISO-6937
20866 Ukrainian - KOI8-U
21027 Ext Alpha Lowercase
21866 Russian - KOI8
28591 ISO 8859-1 Latin I
28592 ISO 8859-2 Eastern Europe
28593 ISO 8859-3 Turkish
28594 ISO 8859-4 Baltic
28595 ISO 8859-5 Cyrillic
28596 ISO 8859-6 Arabic
28597 ISO 8859-7 Greek
28598 ISO 8859-8 Hebrew
28599 ISO 8859-9 Latin Alphabet No.5
29001 Europa 3
1361 Korean (Johab)

AnsiStr = StringFromUTF8(UTF8Str)

This function is nothing more than just a convenient abbreviation for the frequently used UTF-8 encoding and is defined as follows:

function StringFromUTF8(Str)
    return StringFromCodePage(Str, CP_UTF8)
end

Valid = IsValidAsciizString(NumBytes)
Valid = IsValidUnicodeString(NumBytes)

These two auxiliary functions let you check whether a string (zero-terminated ASCII or Unicode, depending on the function) at the current file position is valid. Both functions expect a single parameter which tells the length of the string in bytes (even for the Unicode variant). Both return true if the string is valid and false otherwise. These functions do not change the

The string is considered to be valid if it is zero terminated and there are no non-zero characters after the first zero character. Valid strings may be defined as standard C strings without any complications (that is, escape sequences in the middle of the string definition). Also, for Unicode version the string is definitely invalid if it consist of odd number of bytes.

Here is an example how these functions may be used:

if IsValidAsciizString(Len) then
    -- OK, the string may be represented as normal C string ("...")
    char__("Data", nil, Len)
else -- oops, there is something wrong with the string. We better type it -- as an array of chars ("{ '1', '2', '\x00', "3"}")
Char()
char("Data", nil, Len)
end