Computer Language Design: String Syntax

Perm url with updates: http://xahlee.org/comp/strings_syntax_in_lang.html

Computer Language Design: String Syntax

Xah Lee, 2010-10-12, 2011-02-28, 2011-03-01

This article discuss string syntax in a computer language design.

Problem with Escapes

Typically, the syntax for string in a lang look like this:

mystr = "abc";

However, if your string contains double quote, then you need to escape them (usually with backslash), and this is ugly, hard to read, and inconvenient for programers. For example, suppose your code processes a lot html:

mystr = "<link rel=\"stylesheet\" type=\"text/css\" href=\"../xyz.css\">";

Here's a practical example from real emacs lisp code. The backslash gets annoying.

(while (search-forward "<div class=\"donate253586027\">If you enjoyed this site, please consider donating $3. Any amount is appreciated. <a href=\"http://xahlee.org/thanks.html\">Thanks!</a><form action=\"https://www.paypal.com/cgi-bin/webscr\" method=\"post\"><div><input type=\"hidden\" name=\"cmd\" value=\"_s-xclick\"><input type=\"hidden\" name=\"hosted_button_id\" value=\"8127788\"><input type=\"image\" src=\"https://www.paypal.com/en_US/i/btn/btn_donateCC_LG.gif\" name=\"submit\"><img alt=\"\" src=\"https://www.paypal.com/en_US/i/scr/pixel.gif\" width=\"1\" height=\"1\"></div></form></div>" nil t)

It's much worse with regex, especially emacs regex:

(while (search-forward-regexp
 "<a href=\"\\([^\"]+\\)\"><div class=\"img\"><img src=\"\\([^\"]+\\)\" alt=\"\\([^\"]+\\)\" width=\"\\([0-9]+\\)\" height=\"\\([0-9]+\\)\"></div></a>" nil t)
(replace-match "<div class=\"img\"><a href=\"\\1\"><img src=\"\\2\" alt=\"\\3\" width=\"\\4\" height=\"\\5\"></a></div>" t))

Other Forms of Escape: HTML Entities, Hex Code Literals

Another form of escape is html entities or hex code. For example, in html, the ampersand char can be written as &amp; or &#38; or &#x26;. The char “b” can be written as &#98; or &#x62.

Similarly, in Java and many other langs, hexidecimal code can be used. For example, “b” can be written as “\u0062”.

These are also considered as escapes here.

Variable Delimiters (Perl, PHP, Python)

One solution is to use different delimiters for the string. Perl, Python, take this approach.

For example, in perl, the following evaluates to the same string:

"abc"
'abc'
q[abc]
q(abc)
q{abc}

Basically, it allows several chars to be used for the string delimiter. This way, if your string contains ", you can switch to a different quoting delimiter, then you don't need to do the escapes, and your string is more readable. Here's how it looks, so much more clear:

mystr = q[<link rel="stylesheet" type="text/css" href="http://xahlee.org/xyz.css">];

Python is similar. e.g. the following lines all evaluate to the same string:

"abc"
'abc'
"""abc"""
'''abc'''

heredoc

Another mechanism, used by Perl and PHP, is called “heredoc”. Basically, it uses a random string as delimiter, and anything in between is literal. Here's a example.

# perl
$mystr = <<'randomstringhere823497';
<link rel="stylesheet" type="text/css" href="http://xahlee.org/xyz.css">
randomstringhere823497

See:

Can Escape be Completely Avoided?

On 2010-09-26, Ron Garret wrote:

And just for good measure, some «European style quotes» and “balanced smart quotes” which I intend some day to try to convince people to start using to eliminate the scourge of backslash escapes. But that's a topic for another day.

On 2010-09-26, Spiros Bousbouras 〔spi...@gmail.com〕 wrote:

I don't see how they would help to eliminate backslash escapes. Let's imagine that strings were delimited by « and ». If you wanted a string which contained a » you would still need to escape it.

Using rich varieties of matching pair chars in unicode can greatly eliminate many escapes and improves code readability. (See: Matching Brackets in Unicode.)

Ultimately, escape can not be completely eliminated, doesn't matter how many variation of delimiters your language have (unless it's infinite, such as “heredoc”, or non-syntactic methods). This is because, if you lang is a general lang, inevitably it'll be used to parse its own source code. And there will be occasions when the text you want to parse is a complete enumeration of all possible string delimiters of your lang. (e.g. a tutorial of your language) So here, doesn't matter what delimitor you choose, it occures in the string you want to quote.

I think this problem can be mathematically modeled to be equivalent to the self-reference problem in symbolic logic, and is not a avoidable problem. (See: Russel's paradox)

“heredoc” is a ugly solution to this. Another possible solution is variable repeatition. e.g. consider any repetition of a matching pair delimiter is also a valid syntax:(((abc))), {{abc}}, 「「「「abc」」」」, etc, or any combination of repetition of variable delimiters, e.g. ([((【〈『‹abc›』〉】))]).

(Note: here, the desired property is the ability to quote a text without modifying the text in any way. So, this excludes adding any form of escapes, or inventive ways such as adding a Tab char in front of each line. Also, am thinking in the context of computer language syntax. This excludes semantic solutions by specifying how many chars/lines to read in and not using any delimiters. Thanks to reddit discussion @ Source www.reddit.com.)

Disadvantage of Variable Delimiters

The variable quoting chars also introduces some complexity. Namely, each delimiter symbol in your lang now has multiple meanings, context dependent, and or, you have multiple symbols for the same semantic. (See: Problems of Symbol Congestion in Computer Languages (ASCII Jam; Unicode; Fortress).)

For example, one language that does not have multiple string delimiters is emacs lisp (or lisps in general). In emacs lisp, a string is always delimited by the double straight quote ("). (single straight quote is not a string delimiter) Emacs lisp has the worst readability problems of Leaning toothpick syndrome, however, one advantage is that string syntax has a very simple logic. For example, you can ALWAYS locate ALL strings in the source code by searching for double straight quote char. In langs with variable quotes such as perl, this can no longer be true. You have to search several chars, and for each occurrence you have to judge based on adjacent chars.

Similarly, in Mathematica, paren is used for one single purpose only, always. It's delimiter for specifing evaluation order of expressions (e.g. (3+4)*2). The square bracket [] has one single purpose only. It's delimiter for function arguments e.g. f[x_]:= x + 1, f[3]. The curly brackets {} again has one single purpose only. It's delimiter for list. e.g. {1,2}. In traditional math notation and most comp langs, it's all context dependent soup.

Doesn't matter which is your philosophy in lang design with regards to quoting mechanism, unicode introduce many proper matching pairs that are helpful, and avoid multiple semantic meanings for a given char.

(this post is originally from online forum discussion at Source groups.google.com)

Popular posts from this blog

Browser User Agent Strings 2012

11 Years of Writing About Emacs

does md5 creates more randomness?