Transform HTML Tags with Emacs Lisp

Perm url with updates: http://xahlee.org/emacs/elisp_transform_html_tags.html

Transform HTML Tags with Emacs Lisp

Xah Lee, 2010-08-29

This page is a tutorial, showing a real-world example of using emacs lisp to do many tag transformation.

The Problem

Summary

I need to transform many html tags. Typically, they are of the form 「BeginDelimiter...EndDelimiter」, where the delimiters may be “curly quotes”, or it may be a html tag such as 「<span class="xyz">...</span>」

I need to apply the transformation on over 4 thousand html pages, and needs it to be accurate, mostly on a case-by-case base with human watch.

Also, the delimiters may be nested, so a simple regex cannot work. They either getting too much text (using default greedy match) or getting not enough text (using shy group). With a elisp script, one can use “if” and other emacs functions, to correctly find the matching ending tag, as well automatically skip cases that this transform should not apply, so drastically lessen the cases for human watch.

Detail

In the past week, i spend about 2 days and done a lot text processing with elisp on the 4 thousand files of my site. Here's the changes i've made:

  • “book title” ⇒ 《book title》
  • “article title” ⇒ 〈article title〉
  • “computer code” ⇒ 「computer code」
  • “file path” ⇒ 〔file path〕
  • “emacs key notation” ⇒ 【emacs key notation】

The purpose of the change is to make the syntactical markup more semantically precise. Before, they are all marked by double curly quotes. Now, if i want to find all books i cited on my site, i can do so easily by a simple syntactical parsing (e.g. grep). These changes also make the text easier to read. In the future, if i want all book titles to be colored red for example, i can easily do that by changing the 《》 to a html markup (e.g. 「<span class="title">...</span>」), or use a javascript to do that on the fly. Same for emacs keybinding. For example, with this clear syntax, it's easier to write a javascript so that when mouse is hovering over the keybinding notation, it shows a balloon of the command name for that key.

All this is part of the HTML Microformat, which is part of semantic web concept. The basic ideas is that, the syntax encodes semantics. This advantage is part of the major reason XML becomes so useful. (the other reason is its regular syntax.)

The delimiters for 《book title》 and 〈article title〉 is a Chinese convention. The delimiters 【lenticular bracket】 and 「corner bracket」 and 〔tortoise shell bracket〕 are Chinese brackets. See: Intro to Chinese Punctuation with Computer Language Syntax Perspectives and Matching Brackets in Unicode.

Also, much of the html markup on my site has been cleaned up. For example:

  • 「<span class="code">...</span>」 ⇒ 「<code>...</code>」
  • 「“<span class="code">...</span>”」 ⇒ 「<code>...</code>」 (Remove the redundant curly quote. It can be auto added with Cascading Style Sheet (CSS) if needed.)
  • 「<span class="key">...</span>」 ⇒ 「<kbd>...</kbd>」 (Change to standard tag; reduce char count.)
  • 「<span class="kbd">...</span>」 ⇒ 「...」 (Remove the tag. Was designed to mark emacs key notation, but doesn't make much sense. Now, 【】 does it.)

There are several advantages in these changes. For example, 「<code>」 is much shorter than 「<span class="code">」, and it has a somewhat standard meaning. It is also unique than “span” tag, so that reduces ambiguity when i need to process “span” tags. Same with the change to the “kbd” tag.

Keystrok Notations Problems

Also, i used to use a 「<span class="kbd">...</span>」 tag to markup emacs key notation, but my use isn't consistent. For example, “Ctrl+x find-file” might be marked as 「【Ctrl+x】 find-file」 or 「【Ctrl+x find-file】」. The problem is actually quite thorny. It is about designing a consistent notation for keyboard shortcuts. Keep in mind that there are many types of key shortcuts. e.g. single key such as 【F1】, 【Win】, or normal combination such as 【Ctrl+x】 , or a sequence of the combination above such as emacs's 【Ctrl+x Ctrl+f】 and 【Ctrl+x f】, or Windows's 【Alt+Space c】, 【Alt t i】 (accessing menu by key, called “menu accelerator”). (A sequence of single keys is also common when you have sticky keys on, available in Windows, Mac, Linux.)

In general, it is not trivial to design a notation that is not ambiguous and covers all these different types of common key shortcuts practices. In general, you want a notation that can contain a sequence of key-press elements, and each key-press element can be a single key or key combination (such as 【Ctrl+x】). Also, note that the key 【Ctrl+x】 does not simply mean pressing them together, but actually pressing and hold Ctrl first, and release it last.

Also, in designing such a notation, there's a consideration of space char in the notation. For example, 【Ctrl+c Ctrl+c】 does not mean you have to press a space in this sequence, rather, space is used as a separator.

Another issue to consider is the plus sign in it. For example, 【Ctrl+x】 does not actually involve pressing the “+” key. Rather, “+” is used to indicate combination. This can be a readability problem when you have 【Ctrl++】(usually for zoom in). (in Microsoft's software, such case is simply written as 【Ctrl+】 — a break of regularity. Apple's notation simply does not use any conjunction sign; it just place 2 keys together meaning for simultaneous pressed keys..)

Another issue is whether to consider a key as a key or as a character. For example, by convention, 【Ctrl+X】 means pressing the lower case x key, not capital X. This does introduce ambiguity. Most app's menu use a notation that explicitly include a Shift key. For example, in FireFox, “Show All History” shortcut is written as 【Ctrl+Shift+H】, but for “Zoom In” it is written as 【Ctrl++】 not 【Ctrl+Shift++】 nor 【Ctrl+Shift+-】. When you consider different keyboard layout, for example the QWERTZ layout used in Germany, the # key is not the shifted 3, this inconsistency about Shift key creates more ambiguity. (See: Idiocy Of Keyboard Layouts.)

Also, what does it mean when you have a sequence of char? For example, 【Ctrl+x】 does not mean pressing C, then t, then r, then l. However, in 【Meta+x dired】, it does mean press each of the character in the word “dired”.

See also: A Short Survey Of Keyboard Shortcut Notations.

The way i want a human readable key notation with the degree of precision is close to creating a language for key macro applications. But if you look at those apps, their syntax is not human readible, hugely inconsistent, and basically most of them are just syntax soup with lots of special cases. See:

Solution

To do these tag transformations, simple cases such as

“file path” ⇒ 〔file path〕

, where the delimiters are single characters and there's no nesting, they can be done with emacs's dired-do-query-replace-regexp. (See: Interactively Find and Replace String Patterns on Multiple Files.)

More complicated cases with nested html tags, can be done with a elisp script. Here's the general plan.

  • Open the file
  • Search for the tag
  • If found, move to the beginning of tag, mark positions of beginning and ending of the opening tag
  • Use sgml-skip-tag-forward to move to the end matching tag
  • Mark positions of beginning and ending of the ending tag
  • Replace the beginning and ending tags with new tags
  • Repeat

To open the file, we can use “find-file”.

To search for the tag, we do:

(while
 (search-forward "<span class=\"code\">"  nil t)
...
)

We give “t” for the third argument. It means don't complain if not found.

The next step is to get the beginning and ending positions of the opening tag. The end position is simply the current cursor position minus 1, because the search-forward automatically place it there. To get the beginning position, we just use search-backward on “<”

Now, we need to get the beginning and ending positions of the matching end tag. This may be a problem because the tags are nested, so there may be many 「</span>」 before the one we want.

The good thing is that emacs's html-mode has sgml-skip-tag-forward function. It will move cursor from a beginning tag to its matching end tag.

Once we got the beginning and ending positions for the beginning and ending tags, we can now easily do replacement. Just use “delete-region”, then use “insert” to insert the new tag we want. One thing important is that we should do replacement with the ending tag first, because if we replace the beginning tag first, the positions of the ending tag will be changed.

Complete Code

;; -*- coding: utf-8 -*-
;; 2010-08-25

;; change
;; <span class="code">...</span>
;; to
;; <code>...</code>

(setq inputDir "~/web/xahlee_org/" ) ; dir should end with a slash

(defun my-process-file (fpath)
  "process the file at fullpath fpath ..."
  (let ( mybuff changedQ p3 p4 p8 p9)

    ;; open the file
    ;; search for the tag
    ;; if found, move to the beginning of tag, mark positions of beginning and ending of < and >
    ;; use sgml-skip-tag-forward to move to the end matching tag </span>
    ;; mark positions of beginning and ending of < and >
    ;; replace them with <code> and </code> 
    ;; repeat
    (setq mybuff (find-file fpath ) )
    (setq changedQ nil )

    (goto-char 0)
    (while
        (search-forward "<span class=\"code\">"  nil t)
      (backward-char 1)
      (if (looking-at ">") 
          (setq p4 (1+ (point)) )
        (error "expecting <" )
        )

      ;; go to beginning of "<span class="code">"
      (sgml-skip-tag-backward 1)
      (if (looking-at "<") 
          (setq p3 (point) )
        (error "expecting <" )
        )
      (forward-char 2)

      ;; go to end of </span>
      (sgml-skip-tag-forward 1)
      (backward-char 1)
      (if (looking-at ">") 
          (setq p9 (1+ (point)) )
        (error "expecting >" )
        )

      ;; go to beginning of </span>
      (backward-char 6) 
      (if (looking-at "<") 
          (setq p8 (point) )
        (error "expecting <" )
        )
      
      (when (yes-or-no-p "change? ")
        (delete-region p8 p9  )
        (insert "</code>")
        (delete-region p4 p3 )
        (goto-char p3)
        (insert "<code>")
        (setq changedQ t )
        ))

    ;; if not changed, close it. Else, leave buffer open
    (if changedQ
        (progn (make-backup))                        ; leave it open
      (progn (kill-buffer mybuff))
      )
    ))

(require 'find-lisp)

(let (outputBuffer)
  (setq outputBuffer "*span tag to code tag*" )
  (with-output-to-temp-buffer outputBuffer
    (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
    (princ "Done deal!")
    )
  )

In the code above, i also put extra checks to make sure that the position of beginning tag is really the 「<」 char. Same for ending tag. (probably redundant, but i tend to be extra careful.)

Also, i used a “yes-or-no-p” function, so emacs will prompt me for each change that i can visually check.

For those files that are changed, i leave them open. So, if i decided on a whim i don't want all these to happen on potentially hundreds of files that i've changed, i can simply close all the buffer with 3 keystrokes with ibuffer. Same if i want to save them all.

For files that no change takes place, the buffer is simply closed.

Emacs is fantastic!

Was this page useful? If so, please do donate $3, thank you donors!

Popular posts from this blog

11 Years of Writing About Emacs

does md5 creates more randomness?

Google Code shutting down, future of ErgoEmacs