2013-07-05

Emacs Regex Advantages, Whitespace Syntax & Semantics Complexity in Computer Languages

Emacs Regex Advantages

emacs regex is usually a pain, but it has advantages.

today, i need to change text like this:

<p>盖闻天地之数,有十二万九千六百岁为一元。将一元分为十二会,乃子、丑、
寅、卯、辰、巳、午、未、申、酉、戌、亥之十二支也。每会该一万八百岁。且
就一日而论:子时得阳气,而丑则鸡鸣;寅不通光,而卯则日出;辰时食后,而
巳则挨排;日午天中,而未则西蹉;申时晡而日落酉;戌黄昏而人定亥。</p>

to this:

<p>盖闻天地之数,有十二万九千六百岁为一元。将一元分为十二会,乃子、丑、寅、卯、辰、巳、午、未、申、酉、戌、亥之十二支也。每会该一万八百岁。且就一日而论:子时得阳气,而丑则鸡鸣;寅不通光,而卯则日出;辰时食后,而巳则挨排;日午天中,而未则西蹉;申时晡而日落酉;戌黄昏而人定亥。</p>

Basically, i need to remove newline chars, and only if it is inside the “p” paragraph tag, and only when that text is Chinese, not English.

there are few hundred of such pages i need to do this.

Emacs regex has a nonascii character class. So, i can use this regex:

\([[:nonascii:]]\{3\}\)
\([[:nonascii:]]\{3\}\)

and replacement is this:

\1\2

〔☛ Emacs: Text Pattern Matching (regex) tutorial

I looked at Python, it doesn't have a named char class for Unicode. Though, am sure one can use a char range. Perl does support Unicode char class, same syntax as emacs, which is POSIX regex. Am not sure if it's added in recent years or existed around 2000.

Syntax, Whitespace, Programing Languages

Note that this is important, because otherwise browser will render a gap when there is a newline, and for Chinese text, it's very annoying. There is no way to use CSS to remove such gap.

And here we have a interesting issue. Namely, the semantic of whitespace among languages.

in HTML, it's quite complex. Newline char can be equivalent to space, and multiple spaces can be equivalent to single space, and newline char at beginning/ending of a tag has special meaning. See: Programing Language Design: Syntax Sugar Problem: Irregularity vs Convenience. Only some of these semantic complexity can be adjusted by CSS, but there's no systematic rule to know which can and which cannot, see: CSS Text Wrapping Tutorial.

In other languages such as Perl, Python, Ruby, it's also very complex. See: What Does it Means When a Programing Language Claims “Whitespace is Insignificant”?.

Should Programer Keep Lines to 70 Chars?

another interesting aspect this touches on is the habit of cutting lines at every 70 char or so.

as you can see, for Chinese language, doing that is unacceptable. Gaps in Chinese can effect semantics. Gaps at wrong place changes meaning. Because, ancient Chinese started not having any punctuation, then later one gaps are used for punctuation. Today, gaps can still be used for punctuation, especially for verse or title.)

note that emacs usually have problems when the lines are very long. This is due to deeply ingrained implementation of unix habit of cutting lines.

Also, it's interesting that emacs's fill-paragraph command behave correctly on this. That is, when the neighboring chars around newline are Chinese characters, emacs simply remove the newline, instead of replacing it with space.

So, this means, another approach to solve this problem is to programmatically call “unfill-paragraph”. 〔☛ Emacs unfill-paragraph, unfill-region, compact-uncompact-block

HTML Meta Language Tag Obsolete

the meta language tag:

<meta http-equiv="content-language" content="zh">

is now obsolete. See: http://www.w3.org/TR/2011/WD-html-markup-20110113/meta.http-equiv.content-language.html

instead, use the “lang” attribute, like this:

<body lang="zh">…</body>

2013-07-03

Douglas C Engelbart Died

Douglas C Engelbart died yesterday, on July 2nd. 1925 〜 2013. 88 years old.

he invented the mouse 〔☛ Computer Mouse, Trackball, Input Devices〕, and is a pioneer of computer networks.

I figured since i write so much about keyboard and mouse, i should mention it.

See this interesting post to learn more about him. 〔Engelbart's Violin By Stanislav. @ www.loper-os.org…

Paris Hilton, on Sexual Competition Among Primates

ran into a pop love song today, sung by female. One of its salient vocal goes like this:

I can do what she can do so much better

and, what a wonder, it's a song by Paris Hilton.

Unicode Navigational Pointers

👆 👇 👈 👉

🔙 🔚 🔛 🔜 🔝

These are new in Unicode 6 i think. If it doesn't display, get the font, see: Unicode 6 Emoticons and Supporting Fonts. For many other arrows, see: Unicode: Arrow Symbols ← → ↑ ↓

2013-07-01

Emacs Lisp, Coding Style, Language Idioms, Controversy

now and then, i get criticisms on my emacs lisp tutorial. They point out that i'm not following convention. Most of the time, they just pointing out that my trailing parenthesis are not tugged into a single line (aka hanging parenthesis). Other common complaints is that my code is not idiomatic.

I appreciate criticisms, and welcome negative ones. Though, coding style and idiomatic programing are two aspects of programing i've put a lot thought in, more so than most programers.

In summary:

  • Coding style conventions are mostly harmful. The language's syntax and semantics should naturally enforce it.
  • Idiomatic programing are harmful. It should be ban'd, except the type of idiom that has a non-trivial effect on algorithm or computational complexity.

here's some excerpt from discussion on Google Plus thread

+Nick Alcock +Elias Mårtenson i think the concept of code formatting style is harmful, for a lang with regular syntax, such as lisp, xml, Mathematica. In the latter two, one don't hear much about code formatting style, yet we see it heavily in lisp. That impedes awareness and real progress of automatic formatter.

Programing: Lisp: Automatic Code Formatting, Lint, Auto Indentation

idioms are not that good to have in a lang. Similar in natural languages. They are basically quirks, irregularities, ad hoc, and hard to learn.

i think there's one type of idiom that's very good, that is, the type that effects algorithmic complexity (due to the lang's quirks and implementation detail). For example, python string are not mutable, so when one needs to append string in a loop, one use a list instead, then convert it to string.

Programing: Why Idiomatic Programing Are Bad

see also the article by Zed linked at bottom. He goes over Ruby.

for my tutorial, sometimes am sloppy, and the code is a learning process as well, and sometimes there's pedagogical constraints. I do appreciate comments, as well as criticisms. Thanks.

+Elias Mårtenson yes, what you said is true. I agree. I've been there.

though, we can look deeper in this.

Idioms, just as in natural languages, are unavoidable. They develop naturally too. However, too often, we see the elitist type of programers, who insist on idioms. It's more about cult formation. This point Zed covered well. Often, these elitist programers do not see the nature of idioms, and kept on spreading something that's more coolness and fashion than real substance. Perl community is the best example of this. As such, it's like fashion, it comes and go.

what i wanted to point out in that essay, is a illustration that idioms are actually just quirks. Lang X speaker may not notice it, and thinks everyone must use it, but to all other speakers who are outsiders of X, they simply see it as full quirks and special cases, something that is not appetizing.

Again, idioms are not avoidable, but the zest for idioms, as typically seen in programing communities, are over-zealous.

as i mentioned, there is one type of idiom i can think of, that we should encourage their use. Namely, the type that makes a difference in algorithmic complexity. And the python immutable string case i posted before is a good example.

meanwhile, what i wanted to impart, is that as programers or languages designers, idioms should be actively avoided.

on elisp, i disagree somewhat with your stance, or in general with established norm. Note that, most conventions in programing, are simply habit, historical baggage. This includes formatting lines no longer than 80 chars.

the lisp convention of parenthesis placement, is also a problem. Because, such convention lessen awareness of automated formatting. Thus, we have thousands of style guides and arguments about it, while, a automated system don't come to mind. This is especially ironic for lang with regular syntax such as lisp. Meanwhile, in XML, Mathematica community, they almost never talk about code formatting, because it's automatic and transparent to the user. This should happen to lisp. I usually do tug in my trailing parenthesis, but i just don't take the extra mile to specifically care for this, because i consider it harmful to spread this lisp lore of certain formatting style.

why should we, tell each other about formatting, and spend the thousands hours manually do so perhaps with help of paredit, when we could have it to automatically and transparently?

some of these points i've been arguing with lispers in comp.lang.lisp for a decade. I have also written tens of articles on my site. It's a issue i'm afraid that cannot come to agreement, unless someday science based study on this became common. So, what you and many lisper's criticism of my elisp code i'm fully aware. Some are due to my code quality, but many are that way intentionally. We should throw out old precepts based on scientific validity. Newer generation usually adopt new style, without even thinking about it. (Clojure comes to mind, many Common Lispers just don't like Clojure or anything that's not Scheme or Common Lisp. It's more about habit than anything.)

in summary, idioms are necessary, but by itself is not a good thing. Good idioms are those have computational effect on the code's performance.

code formatting, should be entirely eliminated. And this effects the direction of language design too. Language's syntax, its chars, should have semantic markup builtin. This is so with Lisp and Mathematica and XML (in Lisp and Mathematica, the bracket characters carries a semantic info of code unit. In XML, it's a specific syntactical pattern)). For these languages, editors should be able to do it in real time, automatically, and transparent to user.

2013-06-30

lisp cult

i damn hate the lispers, in particular, old Common Lispers. Sputtering scientifically baseless shit.

programing lang communities need more science. Not lore, convention, cult dogma.

it is incredible, that lisp monotonic syntax is a problem seen by all non-lispers and expert lispers, but the lisp cult insists no problem.

then, lispers sing macros, but none ever seen Mathematica pattern matching.

lisper insists regular nested paren, data is code shit, yet when u point out it's not regular, they say it's ok.

while, lispers don't seem to see the XML, Mathematica, both have more regular nested syntax, and code is data too.

it's helpless, because years after year, the cult cultivates untruth. Thoughts are ingrained. like religion.

that concludes my Sunday spam