Adding Unicode support

This is an open-ended article about Unicode in KaM Remake. Easy at it sounds, Unicode is better to have than not. However there are few complications that need to be sorted out and YOUR opinion might help.

Unicode is a way of encoding international characters that allows to store all the worlds letters in common format. When we say that we add Unicode to KaM Remake it means:
– we replace older ANSI text encoding that has limitations with a much more versatile format;
– we can add languages that have more than 256 characters. For example ANSI encoding can not fit Chinese or Japan. There are many Chinese players in KaM Remake;
– players could see different languages characters in multiplayer chat next to each other. With ANSI is not possible because for each player the whole game is running under one codepage he picked in options. If that codepage does not matches with another player he is playing with – they can’t communicate even if they know each other languages, because letters glyphs are taken from wrong codepages.

unicode_arial

However we can not simply convert everything from ANSI to Unicode, because some 3-rd party libraries work only with ANSI and because ANSI should still be used for backwards compatibility. Also, being simpler, ANSI is faster to process. Small example, in ANSI each letter always takes exactly 1 byte, but in Unicode (UTF16) letters may take 2 or 4 bytes, hence knowing the text length is impossible without checking all the text letters first.

So, to make it work right we need a plan that clearly separates ANSI and Unicode areas. After all we would not want to get into troubles when text is saved as ANSI and read as Unicode, that would produce garbage.

This pic illustrates various text input locations and how they are planned to be dealt with:

String types lookup

First and topmost comes generic string type which means that programmer does not care how exactly the string is stored, but since we deal with a serious stuff we need to be in control. Using generic string type is unsafe because it differs between different Delphi versions and Lazarus. Early Delphi used AnsiString, but later on switched to UTF16, where’s Lazarus uses UTF8. Simply said – these types stored differently in memory and need different handling.

Most of the text that gets into KaM Remake can be pretty easily split into Unicode and ANSI. Unicode means we get all benefits of internationalization and ANSI means we keep backwards compatibility, smaller size and simplicity. Simplicity is not a perfect word in the context, it means avoiding unnecessary trouble, such as: we want all multiplayer players to be able to input a password to enter the lobby, so we allow only Latin passwords (anyone can input Latin, right?)

For now, for backwards compatibility (this may be changed soon) locale texts are stored in ANSI with a codepage info derived from file extension. Later we will change them to Unicode. We already had number of occasions when locale texts were saved by translators in wrong codepages, which lead to some letters were garbaged.

Scripts are in ANSI because they don’t need any text to be stored in them (maybe except comments, but they are fine to be unreadable 😉

Filepaths should be Unicode since OS handles them that way, the game could be installed into a folder named in Cyrillic and sub-folder named in Czech, or just Chinese.

UI text and chat texts are obviously Unicode.

Now comes the tricky ones that actually made me to write all this, there are couple of edge cases that have both pros and cons in each area:

Player names
Unicode: players like their names to be in native locale
ANSI:  other players may have troubles addressing players whose names they can’t type (e.g. Chinese)
Lobby passwords
Unicode: no good reasons actually
ANSI: no matter the spoken language all players need to be able to type the password. The single common locale is Latin, so we can just restrict passwords to 0..9..A..z
Mission names – currently they are not localized, they are named after the folder they put into.
Unicode: do we need mission names in native locales, would not that cause confusion if we have Cyrillic, Chinese and Polish missions in one list?
ANSI: simplicity

If you read till this part you probably got a hold of the general idea and the uneasy choice between Unicode and ANSI presented above. Please share your opinion in comments or in this forum post: https://www.knightsandmerchants.net/forum/viewtopic.php?f=22&t=1746

This entry was posted in Development. Bookmark the permalink.

4 Responses to Adding Unicode support

  1. Siegfried says:

    There is one topic so closely related to this – you have to consider this on the decision. And that’s about the font.

    Sure it would be nice to have Unicode support, but if you’re using the current pixel font, you don’t have the characters. This pretty much limits the use of Unicode. I know you’re looking for a better font, did you already find a good one?

    Using Unicode for player names has one additional disadvantage. It may be displayed correctly in the way, the player intended. But still it’s impossible to address for other players (as long as there is no copy&paste) because you can’t know how to type the letter? Imagine the German ‘ß’. It’s just not part of any other keyboard layout except the german one. So how would you Russians type this? Or the spanisch ‘ñ’. Western countries don’t have kyrillic input, russions probably don’t have latin ones. No one has chinese.

    If you continue working with ANSI and the correct codepage, it might be displayed ‘false’, but still you’re able to type the name. It may look weird if you have all those strange letters, but it will be displayed correctly at the other player’s side.

    • Krom says:

      I have made a tool that collects all existing KaM ANSI fonts into one UnicodeFont, except for Chinese yet. There’s also built in font generator for system fonts (see Arial on the screen above). We might replace chat font with this one for it is much easier to read, especially when it comes to hieroglyphs.

      The more comments we get the more I think that these 3 areas (names, passwords, titles) should stay Numeric-Latin (0..9, A..z).

      • Siegfried says:

        If you have the characters then maybe it would be a good idea if you used unicode in the chat area. But you’d need to find a way to insert player names into the chat.

        Out of the blue I can think of two fast methodes.
        The first one would introduce shortcuts such as \1 is name of player 1 …
        The other one would go the following: you click on the player name in the lobby and his nickname is inserted into the chat are.

        • Krom says:

          If we limit player names to Latin only it would solve all these issues – anyone will be able to type them and we will avoid cases where SAM and SAM are different nicknames because later is written using Cyrrilic letter A in the middle.

Leave a Reply

Your email address will not be published.

*