2013-07-05

Emacs Regex Advantages, Whitespace Syntax & Semantics Complexity in Computer Languages

Emacs Regex Advantages

emacs regex is usually a pain, but it has advantages.

today, i need to change text like this:

<p>盖闻天地之数,有十二万九千六百岁为一元。将一元分为十二会,乃子、丑、
寅、卯、辰、巳、午、未、申、酉、戌、亥之十二支也。每会该一万八百岁。且
就一日而论:子时得阳气,而丑则鸡鸣;寅不通光,而卯则日出;辰时食后,而
巳则挨排;日午天中,而未则西蹉;申时晡而日落酉;戌黄昏而人定亥。</p>

to this:

<p>盖闻天地之数,有十二万九千六百岁为一元。将一元分为十二会,乃子、丑、寅、卯、辰、巳、午、未、申、酉、戌、亥之十二支也。每会该一万八百岁。且就一日而论:子时得阳气,而丑则鸡鸣;寅不通光,而卯则日出;辰时食后,而巳则挨排;日午天中,而未则西蹉;申时晡而日落酉;戌黄昏而人定亥。</p>

Basically, i need to remove newline chars, and only if it is inside the “p” paragraph tag, and only when that text is Chinese, not English.

there are few hundred of such pages i need to do this.

Emacs regex has a nonascii character class. So, i can use this regex:

\([[:nonascii:]]\{3\}\)
\([[:nonascii:]]\{3\}\)

and replacement is this:

\1\2

〔☛ Emacs: Text Pattern Matching (regex) tutorial

I looked at Python, it doesn't have a named char class for Unicode. Though, am sure one can use a char range. Perl does support Unicode char class, same syntax as emacs, which is POSIX regex. Am not sure if it's added in recent years or existed around 2000.

Syntax, Whitespace, Programing Languages

Note that this is important, because otherwise browser will render a gap when there is a newline, and for Chinese text, it's very annoying. There is no way to use CSS to remove such gap.

And here we have a interesting issue. Namely, the semantic of whitespace among languages.

in HTML, it's quite complex. Newline char can be equivalent to space, and multiple spaces can be equivalent to single space, and newline char at beginning/ending of a tag has special meaning. See: Programing Language Design: Syntax Sugar Problem: Irregularity vs Convenience. Only some of these semantic complexity can be adjusted by CSS, but there's no systematic rule to know which can and which cannot, see: CSS Text Wrapping Tutorial.

In other languages such as Perl, Python, Ruby, it's also very complex. See: What Does it Means When a Programing Language Claims “Whitespace is Insignificant”?.

Should Programer Keep Lines to 70 Chars?

another interesting aspect this touches on is the habit of cutting lines at every 70 char or so.

as you can see, for Chinese language, doing that is unacceptable. Gaps in Chinese can effect semantics. Gaps at wrong place changes meaning. Because, ancient Chinese started not having any punctuation, then later one gaps are used for punctuation. Today, gaps can still be used for punctuation, especially for verse or title.)

note that emacs usually have problems when the lines are very long. This is due to deeply ingrained implementation of unix habit of cutting lines.

Also, it's interesting that emacs's fill-paragraph command behave correctly on this. That is, when the neighboring chars around newline are Chinese characters, emacs simply remove the newline, instead of replacing it with space.

So, this means, another approach to solve this problem is to programmatically call “unfill-paragraph”. 〔☛ Emacs unfill-paragraph, unfill-region, compact-uncompact-block