Unicode Support in Ruby, Perl, Python, Emacs Lisp
Perm url with updates: http://xahlee.org/comp/unicode_support_ruby_python_elisp.html
Xah Lee, 2010-10-07, 2011-02-06
This page exams several language's support for unicode. In particular, whether variables or function names can have unicode characters in them.
I looked at Ruby 2 years ago. (See: Why Not Ruby? ) One problem i found is that it does not support Unicode well. I just checked today, it still doesn't. Just do a web search on blog and forums on “ruby unicode”. e.g.:
- Feature #2034 [ruby-core:25306] @ Source redmine.ruby-lang.org
- Ruby, Unicode, and HTML Entities Problem (2010-09-26) By Mr Peeperson. @ Source www.pubbs.net
- Ruby 1.9 doesn't support Unicode normalization yet @ Source stackoverflow.com
Perl's exceedingly lousy unicode support hack is well known. In fact it is the primary reason i switched to python for my scripting needs in 2005.
However, perl might have improved over the years. It can, in fact, use unicode in var or function names. For code examples, see: Unicode in Perl and Python.
Python 2.x's unicode support is also not ideal. You have to declare your
source code with header like
#-*- coding: utf-8 -*-, and you have
to declare your string as unicode with “u”, e.g.
u"α β λ".
In regex, you have to declare unicode with a flag e.g.
And when processing files,
you have to read in with
unicode(inF.read(),'utf-8'), and printing
out unicode you have to do
you are processing lots of files, and if one of the file contains a bad char or
doesn't use encoding you expected, your python script chokes dead in the middle,
you don't even know which file it is or which line unless your code print file
names. If you are processing a few thousand files in a dir with all sub-dirs,
good luck in finding out which files have already been processed.
Python 2.6.x does not support unicode char for var or function names.
Python 3 supposedly fixed the unicode problem, but i haven't used it. I do not know if Python 3 support unicode in var names.
Last time i looked into whether i should adopt python 3, but apparently it isn't used much. (See: Python 3 Adoption) (and i'm quite pissed that Python is going more and more into OOP mumbo jumbo with lots ad hoc syntax (e.g. “views”, “iterators”, “list comprehension”. (See: What's List Comprehension and Why is it Harmful?.)))
Not related to Python lang but a related problem is, if the output shell doesn't support unicode or doesn't match with the encoding specified in your python print, you get gibberish. It is often a headache to figure out the locale settings, what encoding the terminal support or is configured to handle, the encoding of your file, the which encoding the “print” is using. It gets more complex if you are going thru a network, such as ssh. (most shells, terminals, as of 2010-10, in practice, still have problems dealing with unicode. (e.g. Windows Console ◇ PuTTY. Exception being Mac's Apple Terminal.))
My test is done with the following browser versions: IE8, Firefox 3.6.13, Chrome 8.0.552.237, Safari 5.0.3 , Opera 11.00. All on Windows Vista.
I'll have to say, as far as text processing goes, the most beautiful lang with respect to unicode is emacs lisp. In elisp code (e.g. Generate a Web Links Report with Emacs Lisp ), i don't have to declare none of the unicode or encoding stuff. I simply write code to process string or files, without even having to know what encoding it is. Emacs the environment takes care of all that.
Emacs Lisp also support unicode in var/function names. For example:
(defun insert-β () "alpha!" (interactive) (let ((α "β")) (insert α) ) )
Mathematica support unicode extensively, in variable names, function names, or as operators. (but, technically, it does not even use unicode.)
- For detail of how Mathematica deal with unicode, see: How Mathematica does Unicode?.
- For example notebook with unicode, see: Math Typesetting, Mathematica, MathML
Java supports unicode fully, including use in var/class names. See: Java Tutorial: Unicode in Java