Emacs Lisp: Command to Replace HTML Entities with Unicode Characters

Perm url with updates: http://xahlee.org/emacs/elisp_replace_html_entities_command.html

Emacs Lisp: Command to Replace HTML Entities with Unicode Characters

Xah Lee, 2011-09-27

This page shows you how to write a elisp command to replace HTML entities such as é by its unicode character é.

The Problem

I have many HTML files from existing sources that contain many HTML Entities. I want to have a command that automatically change them to Unicode characters. Example:

  • ‘
  • ’
  • “
  • ”
  • éé

(For more about HTML entities, see: Character Sets and Encoding in HTMLHTML/XML Entities List.)

The command should work on the current paragraph, or text selection.

Solution

This is easy to write. One of the basic elisp idiom is find & replace on a region, like this:

(defun replace-html-chars-region (start end)
  "Replace some html entities in region …."
  (interactive "r")
  (save-restriction 
    (narrow-to-region start end)

    (goto-char (point-min))
    (while (search-forward "‘" nil t) (replace-match "‘" nil t))

    (goto-char (point-min))
    (while (search-forward "’" nil t) (replace-match "’" nil t))

    (goto-char (point-min))
    (while (search-forward "“" nil t) (replace-match "“" nil t))

    (goto-char (point-min))
    (while (search-forward "”" nil t) (replace-match "”" nil t))

    (goto-char (point-min))
    (while (search-forward "é" nil t) (replace-match "é" nil t))
    ;; more here
    )
  )

The (interactive "r") tells emacs that this is a command that can be called by “execute-extended-command” 【M-x】 and the "r" means emacs will feed the beginning and ending text selection positions to your function's parameters.

There are several problems with the above simple code.

① The code requires you to make a text selection first. It'd be better if it automatically work on text selection if there's one, else works on current paragraph.

② The elisp code above is too verbose. It'd be much better if we can write it like this:

(defun replace-html-named-entities ()
  "…"
  …
  (replace-pairs-in-string inputstr
    [
     ["‘" "‘"]
     ["’" "’"]
     ["“" "“"]
     ["”" "”"]
     ["é" "é"]
     more here …
     ]
  ))

③ Replacing multiple pairs of strings one by one may create incorrect behavior.

Tricky Issue with Sequential Replacement of Multi-Pairs

Suppose you are working on a html tutorial, and in that document, it contains the text: ©. The intended display is ©. However, if you are sequentially replacing each entities, the & part will become &, then © becomes just ©.

When you have many pairs of replacement, then doing them one by one, each time starting from the top of the document, may introduce unexpected changes. A solution is to replace them to a set of unique intermediate values, then replace these to the final values.

For the final code of “replace-html-named-entities” that fixes these problems, get it at xah_elisp_util.el.

You'll need to install 2 elisp libraries:

Popular posts from this blog

Browser User Agent Strings 2012

11 Years of Writing About Emacs

does md5 creates more randomness?