2012-06-02

a Conflict of Hacks: Unix Shebang, Unicode BOM Mark

BOM mark is part of the unicode standard. If a tech declares full support for unicode, support for BOM mark is necessary. (the story i heard is that Haskell compiler chocked on BOM, and people are blaming Microsoft Notepad for adding the BOM. But the fact is, Haskell spec says it's source code is Unicode, thus chocking on BOM is Haskell's fault.)

BOM mark is a hack, but so is unix shebang mark. BOM mark being a given, it wouldn't have any problem if utf-8 isn't invented. utf-8 is invented by unix guy Ken Thompson and unix fanatic Rob Pike largely to help unix world move forward to unicode. As it is, BOM mark conflict with the spirit of utf-8 (because utf-8 is meant to be ASCII compatible as is, yet BOM mark byte sequence isn't in ASCII.)

i read the link Thien-Thin Nguyen posted: http://www.utf8everywhere.org/. At first i find it very informative, but in the end i wasn't convinced in its opinion that we should all adopt utf-8 instead of utf-16. I think if one switch a attitude, that utf-8 is the hack that introduced all this problems, then many of their argument for utf-8 doesn't stand.

side note… about that site, it's Windows oriented. As such, they didn't explain many terms and Windows tech they use, e.g. i have little idea what narrowchar or widechar they mean, nor of the many Windows libraries they mention.

also, the site is decidedly western-mind oriented. They forgot that in China, the encoding used is GB 18030, which has the same char set as unicode but different encoding, and is also compatible with ascii. No utf-8 nor utf-anything whatsoever. Chinese web traffic are like half of the world's or something.

the site wishes utf-16 to go away. Windows, Mac, NTFS, HFS+ file systems, all utf-16, plus java C# etc. Though, the web (html, xml, css) are all utf-8. Neither are likely to go away.

For some detail about BOM, see: Unicode BOM Byte Order Mark Hack.