Emacs Regex Advantages, Whitespace Syntax & Semantics Complexity in Computer Languages

Emacs Regex Advantages

emacs regex is usually a pain, but it has advantages.

today, i need to change text like this:


to this:


Basically, i need to remove newline chars, and only if it is inside the “p” paragraph tag, and only when that text is Chinese, not English.

there are few hundred of such pages i need to do this.

Emacs regex has a nonascii character class. So, i can use this regex:


and replacement is this:


〔☛ Emacs: Text Pattern Matching (regex) tutorial

I looked at Python, it doesn't have a named char class for Unicode. Though, am sure one can use a char range. Perl does support Unicode char class, same syntax as emacs, which is POSIX regex. Am not sure if it's added in recent years or existed around 2000.

Syntax, Whitespace, Programing Languages

Note that this is important, because otherwise browser will render a gap when there is a newline, and for Chinese text, it's very annoying. There is no way to use CSS to remove such gap.

And here we have a interesting issue. Namely, the semantic of whitespace among languages.

in HTML, it's quite complex. Newline char can be equivalent to space, and multiple spaces can be equivalent to single space, and newline char at beginning/ending of a tag has special meaning. See: Programing Language Design: Syntax Sugar Problem: Irregularity vs Convenience. Only some of these semantic complexity can be adjusted by CSS, but there's no systematic rule to know which can and which cannot, see: CSS Text Wrapping Tutorial.

In other languages such as Perl, Python, Ruby, it's also very complex. See: What Does it Means When a Programing Language Claims “Whitespace is Insignificant”?.

Should Programer Keep Lines to 70 Chars?

another interesting aspect this touches on is the habit of cutting lines at every 70 char or so.

as you can see, for Chinese language, doing that is unacceptable. Gaps in Chinese can effect semantics. Gaps at wrong place changes meaning. Because, ancient Chinese started not having any punctuation, then later one gaps are used for punctuation. Today, gaps can still be used for punctuation, especially for verse or title.)

note that emacs usually have problems when the lines are very long. This is due to deeply ingrained implementation of unix habit of cutting lines.

Also, it's interesting that emacs's fill-paragraph command behave correctly on this. That is, when the neighboring chars around newline are Chinese characters, emacs simply remove the newline, instead of replacing it with space.

So, this means, another approach to solve this problem is to programmatically call “unfill-paragraph”. 〔☛ Emacs unfill-paragraph, unfill-region, compact-uncompact-block