HTML Entities, Ampersand, Unicode, Semantics

Perm url with updates: http://xahlee.org/comp/ampersand_html_entities_unicode_semantics.html

HTML Entities, Ampersand, Unicode, Semantics

Xah Lee, 2010-12-23

This article is some thoughts on semantics of symbols.

Semantic Differences in Symbols of Identical Appearance

I write a lot essasy and tutorials related to computing. Often, they include instructions on pulling menus in software. For example, i would write: use the menu “File▸New▸Folder”. (e.g. Second Life Keyboard Shortcuts Cheatsheet.)

I needed a consistent syntax to indicate the menu hierarchy. Notice that i've used a small right pointing triangle there. The unicode char i was using is named TRIANGULAR BULLET “‣”. I found a better symbol for it recently, the BLACK RIGHT-POINTING SMALL TRIANGLE “▸”. So, i took 5 min in emacs to do the change to 5k files on my site. (There are 353 occurances in 62 files.)

Even both chars look the same, but they have semantic differences. The char i was using is a meant to be a Bullet (typography), whose purposes is to indicate a item in a list. What i needed isn't a item indicator, but a indicator for a node in a tree. A common symbol for this purpose is a right-pointing triangle. So, BLACK RIGHT-POINTING SMALL TRIANGLE is a better choice.

Ampersand, HTML Entities

On a related topic... you know how in html, the ampersand char “&” needs to be encoded as “&”. In early HTML specs, that char needs to be encoded always, but i think in html4 the rule is changed so that if the char is surrounded in spaces then it doesn't need to be encoded. (the rule is quite complex actually, especially when the char is in url. See: URL Percent Encoding and Ampersand Char. ) In XML, the ampersand always needs to be encoded, unlike that of HTML4.

Meaning of the Ampersand

The ampersand char as a english punctuation means “and”, however, there are subtleties. (read Wikipedia article here ampersand, quite interesting story on etymology) In my own writings, i sometimes use the symbol, as in the article title Scheme & Failure. In longer sentences, sometimes you use it instead of “and” because using the word “and” introduces too many conjunctions in the sentence, but a glyph makes the grammatical structure more clear. For example, look at this sentence: “grep & glob mutates into egrep & fgrep confoundedness”. (Unix and the mbox Email Format) Here, you can see the “&” acts as a tight connecting operator. Its meaning is slightly different than the more general connective “and”. For the same reason, company names stick with the symbol too, e.g. “AT & T”, “Bang & Olufsen”, “Johnson & Johnson” or law firms “Allen & Overy” (See: List of 100 largest law firms).

With my recent unicode work (see: Punctuation Symbols in Unicode), i discovered several variant unicode char for the ampersand.

  • & FULLWIDTH AMPERSAND
  • ﹠ SMALL AMPERSAND
  • ⅋ TURNED AMPERSAND

Being a fanatic about symbols, notation, syntax, elegance, i really hate the entities in HTML. The need to encode 「&」 as 「&」 introduces several complexities. It's more difficult to parse, makes find & replace or grep more complex, more difficult in typing it. For example, you want to find all occurances of  & , now you need to search for both  &  and  & . So, i've been toying with the idea of replacing any 「&」 with a unicode variant, the FULLWIDTH AMPERSAND 「&」. This way, you don't have to deal with the html entity. The SGML/HTML/XML character entity was created because some decade ago, unicode wasn't there. You only have about 100 ASCII chars to work with. So was invented “encoding” and “character entity”.

In the end, i decided not to do this replacement, for practical reasons, because many tools and software still don't understand unicode well. For example, search engines other than Google may not understand the semantics of the FULLWIDTH AMPERSAND char. It is also interesting to note, that when you type normal “&” in the find box in Google Chrome, Safari, IE8, they will find the FULLWIDTH AMPERSAND too, but not in Firefox, Opera. Try it.

Here's a interesting video from Google about how search engines deals with less used unicode chars.

〈How does Google handle ligatures, soft-hyphens, interpuncts and hyphenation points?〉

Was this page useful? If so, please do donate $3, thank you donors!

Popular posts from this blog

11 Years of Writing About Emacs

does md5 creates more randomness?

Google Code shutting down, future of ErgoEmacs