Unicode BOM Byte Order Mark

Perm url with updates: http://xahlee.org/comp/unicode_BOM_byte_orde_mark.html

Unicode BOM Byte Order Mark

Xah Lee, 2008-12-18, 2010-05-25

Some notes on unicode's Byte Order Mark. Quote from Wikipedia:

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.[1]

Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving Unicode text from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.

Some points of personal interest:

  • The BOM char is “U+FEFF” (zero-width no-break space).
  • The BOM char's use as a zero-width no-breaking space is deprecated since unicode 3.2 (published in 2002). That char's semantic is now for BOM only. “U+2060” (WORD JOINER) is now used for non-breaking space.
  • The primary purpose of BOM is to indicate byte-order (big endian vs little endian) in systems or situations that need this info in the file.
  • It is not needed in UTF-8, since UTF-8 encoding unit is a byte, so doesn't have the byte-order issue.
  • When used in UTF-8, it just give a indication that it is unicode.
  • In unix-like OSes, BOM for utf-8 may cause problems due to the Shebang (Unix) hack. However, many Window software add BOM to utf-8 files, e.g. Notepad.

See also: http://www.unicode.org/faq/utf_bom.html.

Also note, the page “http://www.unicode.org/faq/utf_bom.html” by the unicode standard creation organization, is invalid HTML (as of 2008-12-18). Pretty ridiculous. See validation result here: validator.w3.org. (it is valid as of 2010-05-25)

See also: HTML Correctness and Validators.

Popular posts from this blog

11 Years of Writing About Emacs

does md5 creates more randomness?

Google Code shutting down, future of ErgoEmacs