Emacs: Zap Gremlins (UNICODE chars ⇒ ASCII)

Perm url with updates: http://xahlee.org/emacs/emacs_zap_gremlins.html

Emacs: Zap Gremlins (UNICODE chars ⇒ ASCII)

Xah Lee, 2011-03-07

This page shows a little function that changes unicode string into ASCII. For example “passé” becomes “passe”, “voilà” becomes “voila”.

When refactoring my elisp code last week, i split out this little function. It turns unicode chars into roughly equivalent ASCII ones. I needed this because the open source dictionary will choke on words with unicode chars. (See: Emacs Dictionary LookupProblems of Open Source Dictionaries.)

I remember, in the popular Mac editor BBEdit i used 10 years ago before emacs, there's such a command in the menu called “zap gremlins”. Though, i'm not aware there's one in emacs, but might be. Anyway, here's the code:

(defun asciify-string (inputstr)
  "Make unicode string into equivalent ASCII ones.
Todo: this command is not exhaustive."
  (let ()
   (setq inputstr (replace-regexp-in-string \\|à\\|â\\|ä" "a" inputstr))
   (setq inputstr (replace-regexp-in-string \\|è\\|ê\\|ë" "e" inputstr))
   (setq inputstr (replace-regexp-in-string \\|ì\\|î\\|ï" "i" inputstr))
   (setq inputstr (replace-regexp-in-string \\|ò\\|ô\\|ö" "o" inputstr))
   (setq inputstr (replace-regexp-in-string \\|ù\\|û\\|ü" "u" inputstr))

You might improve this code, as right now it's puny. It doesn't consider many other unicode chars. Here's the common non-english letters: ÀÁÂÃÄÅÆ Ç ÈÉÊË ÌÍÎÏ ÐÑ ÒÓÔÕÖ ØÙÚÛÜÝÞß àáâãäåæç èéêë ìíîï ðñòóôõö øùúûüýþÿ. You might also consider changing unicode bullet “•” to “*”, and others such as “→” to “->”, “≥” to “>=”, etc.

Depending on how you want the function to go, you can think of its primary purpose as removing all non-ascii chars. So, at the end, you simply delete any non-ascii char that hasn't been transcoded. (use the emacs regex [:nonascii:]) Or, you can consider its primary purpose as transcoding. This means, leave untranslated unicode chars as is.

Also, right now it's a function that takes in a string. You might also create a version that works on region, or better yet, works on text selection if there's one, else on current word (or line, or paragraph, or buffer, your design call). (For how, see: Emacs Lisp: Using thing-at-point.)

Or, perhaps you know someone has written this somewhere?

Accumulator vs Parallel Programing

When looking at my code, another thing that piqued my interest is that, notice how the algorithm is of sequential nature? The paradigm is similar to what's called “accumulator” or “iteration”. Recently, i watched Guy Steele's talk on parallel programing (See: Guy Steele on Parallel Programing.) and learned that the iteration style is very difficult for compiler to automatically generate parallel code.

A better way to write it for parallel programing, is to “map” a char-transform function to the string. (in elisp, a string datatype is a type of array datatype. Array and List are both “sequence”. The “mapcar”'s second argument can be any “sequence” datatype.) It will probably become slower, but it'll be good when someday emacs lisp becomes Scheme Lisp or something.