Emacs Lisp: HTML Processing: Split Annotation

Perm url with updates: http://xahlee.org/emacs/elisp_text_processing_split_annotation.html

Emacs Lisp: HTML Processing: Split Annotation

Xah Lee, 2011-08-16

This page shows a example of emacs lisp for processing HTML. The HTML files are classical novels. The annotation markups need to change from one format into another. There are hundreds of such pages that need to be processed.



For all HTML files in a directory, find any annotation markup containing the bullet “•” symbol:

<div class="x-note">A ⇒ … • B ⇒ … • C ⇒ …</div>

Split the annotation into multiple markups, like this:

<div class="x-note">A ⇒ … </div>
<div class="x-note">B ⇒ … </div>
<div class="x-note">C ⇒ … </div>


If you are a contract web dev programer, then you know that 99.99% of websites are a messy text soup. They are created by hundreds of tools or languages. Word processors, HTML generators, tens of lighweight markup langs, different frameworks from different languages PHP, Perl, Python, from different web era, from different programers in the past. Even emacs has several modes that generate HTML. They are not in any consistent form. Often, they have missing tags too.

It is in these situations, emacs shines thru, because emacs's powerful embedded language lisp, and its interactive nature, lets you maximize automation. Interactively when you are still feeling the pattern, then by Keyboard Macro or emacs lisp for parts that can be automated.

For my website, i take the time to make sure that my all my HTML are consistent. But still, they are written in the span of 15 years. Periodically i take the time to improve the markup. For example, when new version of CSS or HTML are widely adopted by web browsers. (CSS1 to 2 to 3, HTML 3 to 4 to HTML5.)

I have hundreds of pages of classic novels as HTML documents. These documents contain annotations in special HTML markup. For example, here's sample annotation from Titus Andronicus: Act 1:

• short ⇒ rudely brief. (AHD)
• sharp ⇒ Fierce, impetuous, hash, severe… (AHD)
SATURNINUS. 'Tis good, sir. You are very short with us;
  But if we live we'll be as sharp with you.

Here's the raw HTML:

<div class="x-note">• short ⇒ rudely brief. (AHD)<br>
• sharp ⇒ Fierce, impetuous, hash, severe… (AHD)</div>

<pre class="tx">SATURNINUS. 'Tis good, sir. You are very <span class="xnt">short</span> with us;
  But if we live we'll be as <span class="xnt">sharp</span> with you.

Here's how the tag works. Each <span class="xnt"> markup a word in main text. When a word is marked by “span.xnt”, that means it has a sidebar annotation. The sidebar section is marked by <div class="x-note">. Inside the “div.x-note”, there may be more than one entries. Each entry starts with the bullet symbol “•”. For example, in the above, the words “short” and “sharp” are both entries inside a “div.x-note” sidebar.

But recently, i think it is better to have one entry per sidebar. This way, it makes the logic simpler, and is much easier if i want to add Javascript functionality. For example, when mouse hovers on a word in main text, the corresponding annotation would be highlighted.

So, i want write a elisp script to process all my files. If you simply read the spec for this job, of splitting a markup by a particular character, you may think it's trivial and can be done in any lang in 10 minutes. Why then the elaborate discussion about text soup situation?

The important thing is that i DO NOT know what needs to be done to begin with. Only after having used emacs power together with lisp script i wrote before to look at and check my existing markup in hundreds of files, then i know what state they are and decide on what i want to do. Also, this change must be done with the ability to visually check that all changes are done correctly, because the input may not be in the format i expect. (it might be missing the bullet “•”.)

For those Scheme Lisp academic computer science folks, you might wonder, when i started with these annotations, why didn't i “design” it well to begin with. The reason is that, when i write a blog article, or my literature annotation project, i really want focus on the writing first, the content, get it done, rather than get distracted by the CSS/HTML markup design. (one thing i do make sure is that whatever CSS/HTML i device, i made sure that they can be easily changed systematically later by a simple parsing.) I devote significantly more percentage of time on design than most people, but many factors necessitates change. For example, you may not know CSS as well before, and the thoughts of HTML semantics is quite complex. (e.g. see: Are You Intelligent Enough to Understand HTML5?.) Browsers change, standards changes (e.g. HTML → XHTML → HTML5. See: HTML5 Doctype, Validation, X-UA-Compatible, and Why Do I Hate Hackers.), thoughts of best practices change, and my needs for the annotation also changed through-out the years.


Here's the outline of steps:

  • Open the file. Search for the tag we want.
  • Check if the tag contains a bullet “•”.
  • If so, replace the bullet char with new end tag and beginning tag. e.g. </div> <div>
  • Do this for all files in a dir. (or a given list of files)

Here's the code:

;; -*- coding: utf-8 -*-
;; 2011-08-13
;; process all files in a dir.
;; split any markup like this:
;; <div class="x-note">… • … • …</div>
;; by the bullet •
;; into several x-note tags

(setq inputDir "~/web/xahlee_org/p/" )

;; add a ending slash if not there
(when (not (string= "/" (substring inputDir -1) )) (setq inputDir (concat inputDir "/") ) )

;; files to process
(setq fileList 

(defun my-process-file-xnote (fpath)
  "process the file at fullpath FPATH …"
  (let (myBuffer (ξcounter 0) p1 p2 ξmeat
                 (changedItems '())
                 (tagBegin "<div class=\"x-note\">" )
                 (tagEnd "</div>" )

    (require 'sgml-mode)
    (when t

      (setq myBuffer (find-file fpath))
      (goto-char 1)
      (while (search-forward "<div class=\"x-note\">" nil t)

        ;; capture the x-note tag text
        (setq p1 (point))
        (backward-char 1)
        (sgml-skip-tag-forward 1)
        (backward-char 6)
        (setq p2 (point))
        (setq ξmeat (buffer-substring-no-properties p1 p2))

        ;; if it contains a bullet
        (when (string-match "•" ξmeat)
          (setq ξcounter (1+ ξcounter))

          ;; clean the text. Remove some newline and <br> that's no longer needed
          (setq ξmeat (replace-regexp-in-string "\n*• *" "•" ξmeat t t ) )
          (setq ξmeat (replace-regexp-in-string "\n$" "" ξmeat t t ) ) ; delete ending eol
          (setq ξmeat (replace-regexp-in-string "<br>•" "•" ξmeat t t ) )

          ;; put the new entries into a list, for later reporting
          (setq changedItems (split-string ξmeat  "•" t) )

          ;; break the bullet into new end/begin tags
          (setq ξmeatNew (replace-regexp-in-string "•" (concat tagEnd "\n" tagBegin) ξmeat t t ) )

          (goto-char p1)
          (delete-region p1 p2)
          (insert ξmeatNew)

          ;; remove the newline before end tag
          (when (looking-back "\n") (delete-backward-char 1))

      ;; report if the occurance is not n times
      (when (not (= ξcounter 0))
          (princ "-------------------------------------------\n")
          (princ (format "%d %s\n\n" ξcounter fpath))

          (mapc (lambda (ξx) (princ (format "%s\n\n" ξx)) ) changedItems)

        ;; close buffer if there's no change. Else leave it open.
        (when (not (buffer-modified-p myBuffer)) (kill-buffer myBuffer) )

(require 'find-lisp)

(let (outputBuffer)
  (setq outputBuffer "*xah x-note output*" )
  (with-output-to-temp-buffer outputBuffer 
    ;; (mapc 'my-process-file-xnote fileList)
    (mapc 'my-process-file-xnote (find-lisp-find-files inputDir "\\.html$"))
  (princ "Done deal!")

Here's a sample output: elisp_text_processing_split_annotation.txt

I've put lots comments in the code. It should be easy to understand. If any part you don't understand, ask me. If you are new to elisp, checkout the first few section of Emacs Lisp Tutorial.

The weird ξ you see in my elisp code is Greek x. I use unicode char in variable name for experimental purposes. You can just ignore it. (See: Programing Style: Variable Naming: English Words Considered Harmful.)

I ♥ emacs.