This is an open-ended article about Unicode in KaM Remake. Easy at it sounds, Unicode is better to have than not. However there are few complications that need to be sorted out and YOUR opinion might help.
Unicode is a way of encoding international characters that allows to store all the worlds letters in common format. When we say that we add Unicode to KaM Remake it means:
– we replace older ANSI text encoding that has limitations with a much more versatile format;
– we can add languages that have more than 256 characters. For example ANSI encoding can not fit Chinese or Japan. There are many Chinese players in KaM Remake;
– players could see different languages characters in multiplayer chat next to each other. With ANSI is not possible because for each player the whole game is running under one codepage he picked in options. If that codepage does not matches with another player he is playing with – they can’t communicate even if they know each other languages, because letters glyphs are taken from wrong codepages.
However we can not simply convert everything from ANSI to Unicode, because some 3-rd party libraries work only with ANSI and because ANSI should still be used for backwards compatibility. Also, being simpler, ANSI is faster to process. Small example, in ANSI each letter always takes exactly 1 byte, but in Unicode (UTF16) letters may take 2 or 4 bytes, hence knowing the text length is impossible without checking all the text letters first.
So, to make it work right we need a plan that clearly separates ANSI and Unicode areas. After all we would not want to get into troubles when text is saved as ANSI and read as Unicode, that would produce garbage.
This pic illustrates various text input locations and how they are planned to be dealt with:
First and topmost comes generic string type which means that programmer does not care how exactly the string is stored, but since we deal with a serious stuff we need to be in control. Using generic string type is unsafe because it differs between different Delphi versions and Lazarus. Early Delphi used AnsiString, but later on switched to UTF16, where’s Lazarus uses UTF8. Simply said – these types stored differently in memory and need different handling.
Most of the text that gets into KaM Remake can be pretty easily split into Unicode and ANSI. Unicode means we get all benefits of internationalization and ANSI means we keep backwards compatibility, smaller size and simplicity. Simplicity is not a perfect word in the context, it means avoiding unnecessary trouble, such as: we want all multiplayer players to be able to input a password to enter the lobby, so we allow only Latin passwords (anyone can input Latin, right?)
For now, for backwards compatibility (this may be changed soon) locale texts are stored in ANSI with a codepage info derived from file extension. Later we will change them to Unicode. We already had number of occasions when locale texts were saved by translators in wrong codepages, which lead to some letters were garbaged.
Scripts are in ANSI because they don’t need any text to be stored in them (maybe except comments, but they are fine to be unreadable 😉
Filepaths should be Unicode since OS handles them that way, the game could be installed into a folder named in Cyrillic and sub-folder named in Czech, or just Chinese.
UI text and chat texts are obviously Unicode.
Now comes the tricky ones that actually made me to write all this, there are couple of edge cases that have both pros and cons in each area:
Unicode: players like their names to be in native locale
ANSI: other players may have troubles addressing players whose names they can’t type (e.g. Chinese)
Unicode: no good reasons actually
ANSI: no matter the spoken language all players need to be able to type the password. The single common locale is Latin, so we can just restrict passwords to 0..9..A..z
Mission names – currently they are not localized, they are named after the folder they put into.
Unicode: do we need mission names in native locales, would not that cause confusion if we have Cyrillic, Chinese and Polish missions in one list?
If you read till this part you probably got a hold of the general idea and the uneasy choice between Unicode and ANSI presented above. Please share your opinion in comments or in this forum post: https://www.knightsandmerchants.net/forum/viewtopic.php?f=22&t=1746