Here's a good, simple technical summary of what unicode is:
"Unicode is character standard to represent alphabets of all the languages of world. Normally ASCII character codes are used in other languages because these characters are represented by simply one byte -- that's why the range of these characters is only from 0-255. We can represent 256 character/symbols etc. in ASCII. But unicode is character coding based on two bytes -- that's why the range is 0-65535, meaning 65536. So we can represent all the language alphabets that exist in this world because the 65536 range is sufficient to accomodate the characters of most common languages around the world."
In fact, one of the things with text files is they have to be ANSI, else if they are UTF-8, as some were found to be when some GOF modders saved files, is they don't work correctly. For cases where someone wants/needs UTF-8 on a certain text file, I had to build in a special flag placed at the top so that if that is present, then it would convert.
The special flag you are referring to here, Chez, is the unicode
byte order mark (BOM) -- it makes for a properly fomatted unicode encoded document, and it used to be required/implemented for every unicode file. In recent times, some standard-deviating text editors will omit the BOM, and assume/predict a unicode encoding simply from the structure of the document -- since unicode has become the mainstream standard, and the BOM was intended to be optional by design.
So this doesn't mean that the engine doesn't support unicode -- on the contrary. The trick is to use a professional code editor, that properly formats document encodings. The reason the engine had trouble with the unicode documents without the BOM saved in them is that it assumed they are ASCII (single-byte), and therefore misread their contents. The Storm engine needs the BOM in a unicode file in order to know that it is in fact unicode.
What this all means, effectively, is that the engine supports
both (properly formatted) unicode (two-byte) and ANSI-formatted (single-byte) document encoding standards, but that it relies on the BOM to differentiate between them. (Which is proper behaviour, and makes a whole lot more sense.) Having a mixed encoding in this case (which, again, is a really bad practice) will likely result in the engine reading the document as a single-byte, ASCII encoded file.
All this doesn't mean that we can do whatever we want -- on the contrary. Documents and code should still be properly formatted. The interpreter's flexibility/leeway is there not to break the game if there are accidental mistakes --
not to abuse it.
And if you still doubt my words, consider that the reason Caribbean Tales shipped with half the game broken is because of a simple, minor semantic error in one of the key files. My fixing that one small, hard to track down error restored the intended functionality of half the game!
Messy code creates hard-to-track, phantom bugs. A good, experienced programmer codes clean.
(So please do
not ignore what I said.
)