Unicode Support in Ruby, Perl, Python, Emacs Lisp

Perm url with updates: http://xahlee.org/comp/unicode_support_ruby_python_elisp.html

Unicode Support in Ruby, Perl, Python, javascript, Java, Emacs Lisp, Mathematica

Xah Lee, 2010-10-07, 2011-02-06

This page exams several language's support for unicode. In particular, whether variables or function names can have unicode characters in them.


I looked at Ruby 2 years ago. (See: Why Not Ruby? ) One problem i found is that it does not support Unicode well. I just checked today, it still doesn't. Just do a web search on blog and forums on “ruby unicode”. e.g.:


Perl's exceedingly lousy unicode support hack is well known. In fact it is the primary reason i switched to python for my scripting needs in 2005.

However, perl might have improved over the years. It can, in fact, use unicode in var or function names. For code examples, see: Unicode in Perl and Python.


Python 2.x's unicode support is also not ideal. You have to declare your source code with header like #-*- coding: utf-8 -*-, and you have to declare your string as unicode with “u”, e.g. u"α β λ". In regex, you have to declare unicode with a flag e.g. re.search(r'\.html$',child,re.U). And when processing files, you have to read in with unicode(inF.read(),'utf-8'), and printing out unicode you have to do outF.write(outtext.encode('utf-8')). If you are processing lots of files, and if one of the file contains a bad char or doesn't use encoding you expected, your python script chokes dead in the middle, you don't even know which file it is or which line unless your code print file names. If you are processing a few thousand files in a dir with all sub-dirs, good luck in finding out which files have already been processed.

Python 2.6.x does not support unicode char for var or function names.

Python 3 supposedly fixed the unicode problem, but i haven't used it. I do not know if Python 3 support unicode in var names.

Last time i looked into whether i should adopt python 3, but apparently it isn't used much. (See: Python 3 Adoption) (and i'm quite pissed that Python is going more and more into OOP mumbo jumbo with lots ad hoc syntax (e.g. “views”, “iterators”, “list comprehension”. (See: What's List Comprehension and Why is it Harmful?.)))

Not related to Python lang but a related problem is, if the output shell doesn't support unicode or doesn't match with the encoding specified in your python print, you get gibberish. It is often a headache to figure out the locale settings, what encoding the terminal support or is configured to handle, the encoding of your file, the which encoding the “print” is using. It gets more complex if you are going thru a network, such as ssh. (most shells, terminals, as of 2010-10, in practice, still have problems dealing with unicode. (e.g. Windows ConsolePuTTY. Exception being Mac's Apple Terminal.))


Javascript support unicode in var name and function name, but it really depends on the browser. As of today (2011-02-06), all browsers support it.

My test is done with the following browser versions: IE8, Firefox 3.6.13, Chrome 8.0.552.237, Safari 5.0.3 , Opera 11.00. All on Windows Vista.

Here's a page you can test yourself. javascript unicode support test.

Emacs Lisp

I'll have to say, as far as text processing goes, the most beautiful lang with respect to unicode is emacs lisp. In elisp code (e.g. Generate a Web Links Report with Emacs Lisp ), i don't have to declare none of the unicode or encoding stuff. I simply write code to process string or files, without even having to know what encoding it is. Emacs the environment takes care of all that.

Emacs Lisp also support unicode in var/function names. For example:

(defun insert-β ()
  (let ((α "β"))
    (insert α)  


Mathematica support unicode extensively, in variable names, function names, or as operators. (but, technically, it does not even use unicode.)


Java supports unicode fully, including use in var/class names. See: Java Tutorial: Unicode in Java

Popular posts from this blog

11 Years of Writing About Emacs

does md5 creates more randomness?

Google Code shutting down, future of ErgoEmacs