2010-09-22

HTML6, Your HTML/XML Simplified

Perm url with updates: http://xahlee.org/comp/html6.html

HTML6: Your JSON and SXML Simplified

Xah Lee, 2010-09-21, 2010-09-27, 2010-12-17

Tired of the standard bodies telling us what to do and change their altitude? Tired of the SGML/HTML/XML/XHTML/HTML5 changes? Tire no more, here's a new proposal that will make life easier.

Introducing HTML6

HTML6 is based on HTML5, XML, and a rectified LISP syntax. It is inspired from JSON and SXML. HTML6 is 100% regular at syntax level, and is not a valid javascript expression or lisp expression. The syntax can be specified by about 3 short lines of parsing expression grammar.

The aim is a very simple syntax, 100% regularity, leaner, trivial to parse in any language.

Like XML in theory, no error should be accepted. If a source code has incorrect syntax, the application should report a error.

Syntax Example

Here's a standard ATOM webfeed XML file.

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:base="http://xahlee.org/emacs/">

 <title>Xah's Emacs Blog</title>
 <subtitle>Emacs, Emacs, Emacs</subtitle>
 <link rel="self" href="http://xahlee.org/emacs/blog.xml"/>
 <link rel="alternate" href="http://xahlee.org/emacs/blog.html"/>
 <updated>2010-09-19T14:53:08-07:00</updated>
 <author>
   <name>Xah Lee</name>
   <uri>http://xahlee.org/</uri>
 </author>
 <id>http://xahlee.org/emacs/blog.html</id>
 <icon>http://xahlee.org/ics/sum.png</icon>
 <rights>© 2009, 2010 Xah Lee</rights>

 <entry>
   <title>Using Emacs's Abbrev Mode for Abbreviation</title>
   <id>tag:xahlee.org,2010-09-19:215308</id>
   <updated>2010-09-19T14:53:08-07:00</updated>
   <summary>tutorial</summary>
  <link rel="alternate" href="http://xahlee.org/emacs/emacs_abbrev_mode.html"/>
 </entry>

</feed>

Here's how it looks like in html6:

?xmlversion1.0” encoding “utf-8”」〕
〔feedxmlnshttp://www.w3.org/2005/Atom” xml:base “http://xahlee.org/emacs/”」

  〔title Xah's Emacs Blog〕
  〔subtitle Emacs, Emacs, Emacs〕
  〔linkrelselfhrefhttp://xahlee.org/emacs/blog.xml”」〕
  〔linkrelalternatehrefhttp://xahlee.org/emacs/blog.html”」〕
  〔updated 2010-09-19T14:53:08-07:00〕
  〔author
name Xah Lee〕
   〔uri http://xahlee.org/〕
  〕

  〔id http://xahlee.org/emacs/blog.html〕
  〔icon http://xahlee.org/ics/sum.png〕
  〔rights © 2009, 2010 Xah Lee〕

  〔entry
title Using Emacs's Abbrev Mode for Abbreviation〕
   〔id tag:xahlee.org,2010-09-19:215308〕
   〔updated 2010-09-19T14:53:08-07:00〕
   〔summary tutorial〕
   〔linkrelalternatehrefhttp://xahlee.org/emacs/emacs_abbrev_mode.html”」〕
  〕
〕

Simple Matching Pairs For Tag Delimiters

The standard xml markup bracket is simplified using simple lisp style matching pairs. For example, this code:

<h1>HTML6</h1>

Is written as:

h1 HTML6〕

The delimiter used is:

CharacterUnicode Code PointUnicode Name
U+3014LEFT TORTOISE SHELL BRACKET
U+3015RIGHT TORTOISE SHELL BRACKET

Syntax for XML Attributes

In xml:

<h1 id="xyz" class="abc">HTML6</h1>

In html6:

h1idxyzclassabc”」HTML6〕

The attributes are specified by matching corner brackets. Items inside are a sequence of pairs. The value must be quoted by curly double quotes.

CharacterUnicode Code PointUnicode Name
U+300cLEFT CORNER BRACKET
U+300dRIGHT CORNER BRACKET
U+201cLEFT DOUBLE QUOTATION MARK
U+201dRIGHT DOUBLE QUOTATION MARK

Escape Mechanisms

To include a literal tortoise shell character in data, use &#x3014; and &#x3015;, similarly for other unicode chars.

Unicode; No More CD Data and Entities &amp;

There's no Entities. Except the unicode in hexadecimal format e.g. &#x3b1; for 「α」.

For example, &amp; is literal, it does not get displayed to &.

Treatment of Whitespace

Identical to XML.

Char Encoding; UTF8 and UTF16 Only

Source code must be UTF8 or UTF16, only. Nothing else.

File Name Extension

File name extension is “.html6”.

Semantics

The semantics should follow xhtml5.

Questions and Answers

What's wrong with xhtml/html5 exactly?

The politics of standard body changes, and their attitude about what is correct also changes unpredictably. In around 2000, we are told that XML and XHTML will change society, or, at least, make the web correct and valid and far more easier to develop and flexible. Now it's a decade later. Sure the web has improved, but as far as html/xhtml and browser rendering goes, it's still syntax soup with extreme complexities. 99.99% of web pages are still not valid, and nobody cares. Google doesn't care. Apple doesn't care. In Google's hundreds of tips to webmasters, almost none of it ever mentions html validation. Google Earth itself generates invalid KML. Some 99.9% of the html files produced by Google or Apple are not valid html. Major browsers still don't agree on their rendering behavior. Web dev is actually far more complex, involving tens or hundreds of tech that hardly a person even knows about (ajax, JSON, lots xml variations). It's hard to say if it is better at all than the HTML3 days with “font” and “table” tags and gazillion tricks. The best practical approach is still trial n error with browsers.

And, now HTML5 comes alone, from a newfangled hip group primarily from current big corporations Google and Apple, with a attitude that validation is overrated — a insult to the face about the XML mantra from w3c, just when there starts to be more and more sites with correct XHTML and Microsoft's Internet Explorer getting on track about correctness.

XML is break from SGML, with many justifications why it needs be, and with some backward compatible trade-offs, and now HTML5 is a break from both SGML and XML.

For some personal story about how the change of standard body attitude effect practical programing, see:

Why not just adopt SXML from the lisp world?

Lisp's SXML is not a stand-alone syntax for the need of the web. SXML's syntax is designed to be compatible with lisp lang's existing syntax, tradition, parsers. Lisp syntax (aka sexp) has several syntactical irregularities. It is not 100% of nested paren of the form (a b c ...). SXML is easy for lispers to adopt, but harder for other languages and communities. (For detail of lisp's syntax irregularities, see: Fundamental Problems of Lisp.)

The following are explanation on how several of lisp's syntax for xml breaks the tree-and-syntax structural correspondence that is inherent in XML.

XML as textual representation of a tree has a quirk, in that each node has this special thing called “attributes” (aka “properties”). The “attribute” is not a node of the tree, but rather, is info attached to a node. Here's a example html:

<h1 id="xyz" class="abc">A B C</h1>

The standard lisp syntax to represent attributes, adopted from lisp's similar concept of “properties” of lisp's “symbols”, is this:

(h1 :id "xyz" :class "abc" A B C)

The way this works is by creating a extra rule on the first char of a name. If the name starts with :, then that name is considered the name of a property, and the next element is considered its value. This special rule breaks a fundamental principle of XML syntax. That is, the lexical structure of the source code no longer corresponds to the semantic structure. The semantics of the source code changes depending on the first char of a atom.

Another way to represent xml's attribute, adopted in some lisp code based on lisp's “alist” (aka associative array) syntax, is this:

(h1 ((id . "xyz") (class . "abc")) A B C)

This too, has syntactical ambiguity.

The whole ((id . "xyz") (class . "abc")) can be interpreted as a node by itself, where the first element is again a node. But also here, it uses lisp's special “cons” syntax (id . "xyz") which is itself ambiguous at the syntax level. It can be considered as a node named “id” with 2 branches . and "xyz", like this:

id
 .
 "xyz"

or it can be considered as a node named “cons” with 2 branches id and "xyz", like this:

cons
 id
 "xyz"

Another common lisp syntax for attributes, from SXML, is this:

(h1 (@ (id . "xyz") (class . "abc")) A B C)

Here, a special rule is created if a name is just 「@」. When a first element's name is just 「@」, then that parenthesized expression is considered to be a property list, not a node.

So, in conceiving html6, i thought a solution for getting rid of syntax ambiguity for node vs attributes is to use a special bracket for properties/attributes of a node. e.g. 〔h1「id “xyz” class “abc”」A B C〕.

Why use weird Unicode characters for matching pair?

Unicode are widely adopted today and is very practical. (See: Unicode Popularity On Web.) It is the default char set for many langs (e.g. Java, XML). Unicode also has a lot proper matching pairs. (See: Matching Brackets in Unicode.) It seems today is the right time to adopt the wide range of proper symbols provided in unicode, instead of relying on the very limited number of ASCII characters of the 1960s.

The straight double quote character 「"」 (ascii 34) is not a matching pair, and as computer source code it has several problems. For example, it needs context to know which quote chars are paired. Also, it is difficult to recover from a missing quote. (this problem is especially pronounced in text editors for syntax highlighting.) A proper matching pair allow programs and editors to more easily correctly determine the quoted string, and thus easier to know its position in a tree, and makes it easier to implement features such as navigating the tree in a editor.

The problem of inputting special chars of unicode can be trivially solved by text editors. For example, Emacs, Mathematica, Microsoft Word, all has simple and efficient ways to enter commonly used special chars such as ™ © é ¶. (See: Emacs xmsi-mode for Math Symbols InputHow Mathematica does Unicode?Designing a Math Symbols Input System.) Also, many special char are part of the keyboard layout for fast input. (See: International Keyboard LayoutsDvorak, Maltron, Colemak, NEO, Bépo, Turkish-F, Keyboard Layouts Fight!)

Possibly, the special bracketing chars can be replaced by () and [] for html6. Though, that also means a lot ugly escape will need to happen in the content text. If not escaped, that means incorrect syntax for the whole file.

The core idea of html6 is that the syntax is designed specifically as a 2-dimentional textual representation of a tree, and with a attribute quote that attaches a limited form of info (sequence of pairs for attributes) to any node to fit existing structure of XML.

The advantage of this is that it should be extremely easy to parse. The syntax can be specified in perhaps just 3 lines of parsing expression grammar (PEG), and PEG libraries exists for Perl, Python, Ruby, Lua, C, C#, Java, OCaml/F#, Clojure, ... A parser for html6 can be trivially written without relying on PEG.

Any thoughts about flaws?

It is probably hopeless for browsers to adopt this. But if you are involved in standard bodies of xml or html5, please consider this, and consider more about correctness and validation. XML is a move in the right step, with huge consequences in various xml languages and formats (JSON, XSLT, XSL, XQUERY, o:XML..., Microsoft Office Open XML, etc.) HTML5's features in theory is simply a XML with a proper DTD. HTML5 was created in part to address w3c's slowness in responding to industrial changes, and in part to address verbosity of XML syntax. HTML5 by itself does not introduce any new technical concepts. The force behind HTML5 is almost purely corporate adoption, and mostly existing practices from corporations. But the attitude it brought about seems to be a step backward, towards corporate sponsored tags (much from Google) and technologies (e.g. much of canvas is from Apple, a low-level pixel-drawing garbage in comparison to SVG), odd-end special tags, more special syntaxes, less focus about correctness, another new syntax/format in the html/xml/xhtml/dtd-sniffing soup.

Was this page useful? If so, please do donate $3, thank you donors!