Programing Style: Variable Naming: English Words Considered Harmful

Perm url with updates: http://xahlee.org/comp/programing_variable_naming.html

Programing Style: Variable Naming: English Words Considered Harmful

Xah Lee, 2011-03-11

This page is some thoughts on variable naming in writing computer programs.

In emacs lisp, i usually use camelCase for my local variables. Here's a example:

(defun read-lines (filePath) 
  "Return a list of lines of a file at FILEPATH." 
  (with-temp-buffer 
    (insert-file-contents filePath) 
    (split-string (buffer-string) "\n" t)))

Some lisp coder question the use of camelCase, because it is not standard elisp style. Here's the reason why i'm using camelCase, and some thoughts about naming of variables.

Distinction from Language Keywords

It provides a easy way to distinguish variables from built-in symbols. Particularly because of the fact that emacs-lisp-mode's coloring scheme is not full. (See: Emacs Lisp Mode Syntax Coloring Problem.)

In particular, all my local variables are in camelCase. I could used under_score but that's more typing and less visually distinguishable to lisp's hypen-word-style.

Uniqueness & Referential Transparency

For variables, in recent years i developed a habit to avoid naming variables that's also a standard English word. So, i'd name “file” as “myFile” or “aFile”. “files” might become “fileList” or “filess”. “string” would be “str”, “myString”, “inputStr”, etc. A ultimate solution for uniqueness is to append or prepend a random number in var names. So, “string” would be “str-5w77o” or something like that. But the problem with this is that it's too long and disruptive in reading and typing. Recently i've been toying with the idea of attaching a unicode to all vars. e.g. all my var would start with “ξ”. So “string” would be “ξstring”. (works fine in elisp btw. See: Unicode Support in Ruby, Perl, Python, javascript, Java, Emacs Lisp, Mathematica.) This way solves the random string readability issue.

My reason to avoid English words is for easy source code transformation, out of practical reasons. (i do a lot search or find & replace in my source code.)

Imagine, if every local variable (or every symbol) are unique in the source code. This way, you could locate any variable in the whole project. It makes some aspect easier for debugging and code tracing and code management. It also makes refactoring easier.

The idea is similar to the idea of Referential transparency (computer science). (Referential transparency can be thought of as a notion of find & replace of function & values.)

The desire to have unique identifier in source code comes in many guises. At the extreme is a desire to eliminate variables completely. For example: if every variable in source code can be unique somehow, then much of the desire for lexical scope over dynamic scope is gone. Some namespace problem is also solved. (in particular, elisp does not support namespace.) Combinatory logic is a desire to get rid of variables from lambda calculus. “Point-free programing” is a invention of syntax for defining functions without the need to write out its parameter. (See: What's Point-free Programing? (point-free function syntax)) Unique variable name is also the impetus for Hygienic macro.

Variable Name: English Prose vs Mathematical Abstraction

Another reason that somewhat pushed me in this naming exaperiment is that... instead of naming your vars in some meaningful english words, the opposite is to name them completely abstractly, as in math's x, y, z, α, β, γ.

So, i'd name “counter” or “num” as just “i” or “n”. (since these are 1-letter string and too common, so with the unique naming idea above, i usually name them “ii” or “nn” or might be “ξi”)

The idea with abstract naming is that it forces you to understand the code as a math expression that specify algorithm, instead of like english prose. Readability of source code is helped by coding in a pure functional programing style (e.g. functions, input, output), and good documentation of each function. So, to understand a function, you should just read the doc about its input output. While inside a code snippet, it is understood by simple functional style programing constructs.

To illustrate from the opposite view, the problem with english naming is that often it interfere with what the code is actually doing. For example, in normal convention often you'll see names like “thisObject”, “thatTree”, “fileList”, or “files”, your focus is on the meaning of these words, but not what the data type actually are or the function's actual mathematical behavior. The words can be deceptive. e.g. “file” can be a file handle, file path, file content. This is especially a problem when you are reading source code of a lang you do not know. e.g. when you encounter the word “object”, you don't know if that's a keyword in the language, a keyword for a pattern, a keyword for datatype, something, or just a user defined variable name that can be arbitrary. When you read a normal source code, half of the words are like that unless the editor does syntax coloring that distinguish the lang's keyword.

For example, here's a elisp code with naming following elisp convention:

(defun dosomething-region (begin end)
  "Prints region beginning and ending positions." 
  (interactive "r")
  (let () 
    (message "Region begins: %d, end at: %d" begin end)
  )
)

Unless you are familiar with elisp, you wouldn't know what those “begin” and “end” are. Maybe they are built-in keywords and have significance to the construct, and if you change them, the code wouldn't work.

But if the code is like this:

(defun dosomething-region (φ1 φ2)
  "Prints region beginning and ending positions." 
  (interactive "r")
  (let () 
    (message "Region begins: %d, end at: %d" φ1 φ2)
  )
)

Then you know that φ1 and φ2 are probably just arbitrary names.

To view this idea in another way ... when you read math, you never see mathematician name their variables with a multi-letter descriptive word, but usually a single symbol (a, b, c, x, y, z, α, β, γ ...), yet there's no problem understanding the expression. Your focus and understanding is on the abstract process and structure.

2 Types of Letter Sequence in Source Code

When reading a source code, you see symbols (operators) and identifiers (function names, var names, built-in lang names). Among the identifiers, it can be divided into 2 types: ① Those that cannot be changed without effecting the meaning of the program. ② Those that are arbitrary and can be changed.

The ones in the first category are language keywords. e.g. “for”, “while”, “class”, “function”, “extends”, “this”, “self”, “public”, “static”, “System”, “from”, “begin”, “end”, “map”, “require”, “import”, “let”, “list”, “defun”, “lambda”, “Take”, “Pattern”, “Table”, “Blank”, etc. These are the words in the source code that are critical, and they are almost always English words. To be able to know at a glance which words are lang keywords in a source code greatly helps in understanding, especially when you do not know the language well. This particularly applies to non-mainstream languages e.g. OCaml, PowerShell, Haskell, Erlang, J, Mathematica, Oz, etc.

The above ideas is just a experiment. Without actually doing it, you never know what's really good or bad.

This essay is originally a post in comp.lang.lisp @ Source groups.google.com.

Popular posts from this blog

11 Years of Writing About Emacs

does md5 creates more randomness?

Google Code shutting down, future of ErgoEmacs