Emacs Lisp: Processing HTML: Transform Tags from ‹span class=w› to ‹b›

Perm url with updates: http://xahlee.org/emacs/elisp_batch_html_tag_transform_bold.html

Emacs Lisp: Processing HTML: Transform Tags from ‹span class=w› to ‹b›

Xah Lee, 2011-07-18

This page shows a simple practical elisp script for HTML tag transformation.

The Problem

Summary

I want batch transform the tag <span class="w">xyz</span> to <b>xyz</b>, for over a hundred files, and print a report of the changes so that i can scan to make sure there's no errors. (for example, in the case that the HTML file has a mismatched span tag.)

Detail

In my English vocabulary and literature study projects, many interesting words are marked up by this tag: <span class="w">xyz</span>. With CSS, it is rendered in bold. I think that markup is too elaborate, and i want to replace it simply with <b>xyz</b>, for over a few hundred files.

Sidenote

The following is a little sidenote on why i had “span.w” in the first place. (you can skip this section.)

I have the following “span” markups: { “span.w”, “span.b”, “span.r” }. The “span.w” means interesting word that's new, rendered as bold. They are typically difficult words new to me.

Sometimes many college-level words are still interesting, and i want to highlight them too, for highschool or ESL students and myself. Sometimes these are familiar words but used in a sense that's not commen (e.g. “seedy” hotel). For these words, i markup with “span.b”. They are rendered in blue (they are typically college level words).

The “span.r” is for highlighting interesting {word, phrase, sentence} of the work, not necessarily for vocabulary study purposes. e.g. a interesting thought, quotable passage, interesting deviation from standard grammar. They are rendered in red.

As a example of how i used these markups, here's a excerpt from Gulliver's Travels. PART I — A VOYAGE TO LILLIPUT. Quote:

The declivity was so small, that I walked near a mile before I got to the shore, which I conjectured was about eight o'clock in the evening. I then advanced forward near half a mile, but could not discover any sign of houses or inhabitants; at least I was in so weak a condition, that I did not observe them. I was extremely tired, and with that, and the heat of the weather, and about half a pint of brandy that I drank as I left the ship, I found myself much inclined to sleep. I lay down on the grass, which was very short and soft, where I slept sounder than ever I remembered to have done in my life, and, as I reckoned, about nine hours; for when I awaked, it was just day-light. I attempted to rise, but was not able to stir: for, as I happened to lie on my back, I found my arms and legs were strongly fastened on each side to the ground; and my hair, which was long and thick, tied down in the same manner. I likewise felt several slender ligatures across my body, from my arm-pits to my thighs. I could only look upwards; the sun began to grow hot, and the light offended my eyes.

  • Note the word “declivity”, rendered in bold. (primary interesting words)
  • Note the word “ligatures”, rendered in blue. (secondary interesting words)
  • Note the phrase “light offended my eyes”, rendered in red. (interesting phrase and usage.)

Here's some annotated works you might find interesting:

Solution

Here's outline of steps.

  • Open the file. Use regex to search the span markup.
  • Make the replacement.
  • Add the replacement to a list, for later report.
  • Repeat the above.
  • Use a dir traverse function to apply the above to every file.
  • When done, print the list of changes.

Here's the code:

;; -*- coding: utf-8 -*-
;; 2011-07-18
;; replace <span class="w">…</span> to <b>…</b>
;;
;; do this for all files in a dir.

(setq inputDir "~/web/xahlee_org/PageTwo_dir/Vocabulary_dir/" ) ; dir should end with a slash

(setq changedItems '())

(defun my-process-file (fpath)
  "process the file at fullpath FPATH ..."
  (let (mybuff myword)
    (setq mybuff (find-file fpath))

    (widen)
    (goto-char 0) ;; in case buffer already open

    (while (search-forward-regexp "<span class=\"w\">\\([^<]+?\\)</span>" nil t)
      (setq myword (match-string 1))
      (when (< (length myword) 15) ; a little double check in case of possibe mismatched tag
        (replace-match (concat "<b>" myword "</b>" )  t) 
        (setq changedItems (cons (substring-no-properties myword) changedItems ) )
        ) )

    ;; close buffer if there's no change. Else leave it open.
    (when (not (buffer-modified-p mybuff)) (kill-buffer mybuff) )
    ) )

(require 'find-lisp)

(setq make-backup-files t)
(setq case-fold-search nil)
(setq case-replace nil)

(let (outputBuffer)
  (setq outputBuffer "*xah span.w to b replace output*" )
  (with-output-to-temp-buffer outputBuffer
    (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
    (print changedItems)
    (princ "Done deal!")
    )
  )

The above is fairly easy to understand. You might refresh the elisp basics at: Text Processing with Emacs Lisp Batch Style and Emacs Lisp Idioms (for writing interactive commands)

Here's the output: elisp_batch_html_tag_transform_bold_output.txt.

There are over 1k changes. The output is extremely useful because i can just take a few seconds to glance at the output to know there are no errors. Errors are possible because whenever using regex to parse HTML, a missing tag in HTML can mean disaster, or even a unexpected nested tag.

PS I run a Word-English blog. If you are interested in vocabulary, please subscribe at: Wordy English — the Making of Belles-Lettres.

Popular posts from this blog

Browser User Agent Strings 2012

11 Years of Writing About Emacs

does md5 creates more randomness?