2011-02-27

Emacs Lisp: Find String Inside HTML Tag

Perm url with updates: http://xahlee.org/emacs/elisp_grep_string_inside_tag.html

Emacs Lisp: Find String Inside HTML Tag

Xah Lee, 2011-02-27

This page shows emacs lisp script that search files, similar to unix grep, but with the condition that the string must happen inside a HTML tag. If you don't know elisp, first take a look at Emacs Lisp Basics.

The Problem

Few days ago we covered How to Write grep in Emacs Lisp. With a few lines of change, here's a version that search a string only if it is inside a html tag. This is something grep and unix tool bag cannot do, and difficult to do even with Perl, Python, unless you use a HTML parser.

Solution

Here's the code:

;; -*- coding: utf-8 -*-
;; 2011-02-25
;; print files that meet this condition:
;; contains <div class="x-note">...</div>
;; where the text content contains more than one bullet char •

(setq inputDir "~/web/xahlee_org/" ) ; dir should end with a slash

(require 'sgml-mode) ; need sgml-skip-tag-forward

(defun my-process-file (fpath)
  "process the file at fullpath fpath ..."
  (let (mybuffer 
p3 p4  (bulletCnt 0) (totalCnt 0)
)

    (when
        (and (not (string-match "/xx" fpath)) ) ; skip some dir

      (setq mybuffer (get-buffer-create " myTemp"))
      (set-buffer mybuffer)
      (insert-file-contents fpath nil nil nil t)

      (setq bulletCnt 0 ) 
      (goto-char 1)
      (while
          (search-forward "<div class=\"x-note\">"  nil t)

        (setq p3 (point) ) ; beginning of text content, after <div class="x-note">
        (backward-char)
        (sgml-skip-tag-forward 1)
        (backward-char 6)
        (setq p4 (point) ) ; end of tag content, before the </div>

        (setq bulletCnt (count-matches "•" p3 p4) )

        (when (> bulletCnt 2)
          (setq totalCnt (1+ totalCnt) )
          (princ (format "this many: %d %s\n" bulletCnt fpath))
          )
        )

      (kill-buffer mybuffer)
      )
    ))

(require 'find-lisp)

(let (outputBuffer)
  (setq outputBuffer "*xah occur output*" )
  (with-output-to-temp-buffer outputBuffer 
    (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
  (princ "Done deal!")
    )
  )

The code is almost exactly the same as our previous grep in Emacs Lisp code, with only about 5 line change. (if you are not familiar with the code, see: grep in Emacs Lisp.)

Here's the main part that's different:

(while
          (search-forward "<div class=\"x-note\">"  nil t)

        (setq p3 (point) ) ; beginning of text content, after <div class="x-note">
        (backward-char)
        (sgml-skip-tag-forward 1)
        (backward-char 6)
        (setq p4 (point) ) ; end of tag content, before the </div>

        (setq bulletCnt (count-matches "•" p3 p4) )

        (when (> bulletCnt 2)
          (setq totalCnt (1+ totalCnt) )
          (princ (format "this many: %d %s\n" bulletCnt fpath))
          )
        )

The interesting part here is the use of “sgml-skip-tag-forward” function. It moves the cursor to the end of the tag, skipping all nested tags. Without this function, it may take days to code it in any language, unless you have experience in writing parsers. (note that this cannot be done with regex.)

The other thing i really love about this is that we are not using a XML parser. If you do this in Perl or Python, you almost certainly have to use a HTML/XML parser. With that, your code becomes really complex, dealing with parents, child, tree structure, and not simple text processing.

O Emacs! ♥

Applications

Find & Replace, or just Find, has extensive use in text processing. Here's some examples of variations, all of which i need on weekly basis and have several elisp scripts to do the job:

  • Find by printing file names.
  • Find by printing adjacent text.
  • Find by condition of number of occurrences.
  • Find only if the file contains a set of strings.
  • Replace text based on file's name.
  • Find (or and Replace) only if <title> and <h1> name doesn't match.
  • Find/Report/Replace only if the string is at a particular position in the file. (e.g. near top, near bottom.)
  • Find only if the string is inside a tag.

Why I Wrote This Code

Here's a little side story on why i wrote this one. It's a bit long.

I have about 30 classic literature with annotations. For example: The Tale Of The Bull And The Ass. The annotation system is custom HTML tags with proper CSS. (so-called HTML Microformat) Extensive custom emacs commands are used to insert such tags. (you can see these commands at Xah Lee's Emacs Customization Files.)

Each annotation are in the tag <div class="x-note">…</div>. For example:

<div class="x-note">• provaunt ⇒ provide. Provant is a verb meaning: To supply with provender or provisions.</div>

However, some “x-note” block is multiple annotations in one. For example, in the story The Fisherman And The Jinni, there's this one:

<div class="x-note">• stint ⇒ a fixed amount or share work.
• might and main ⇒ with all effort and strength.
• skein ⇒ A quantity of yarn, thread, or the like, put up together, after it is taken from the reel.
• buffet ⇒ hit, beat, especially repeatedly.
• fain ⇒ with joy; satisfied; contented.
</div>

Each of the annotation are marked by a bullet “•” symbol, followed by a word. Each word corresponds to the same word in the main text marked by <span class="xnt">…</span>. (I pack multiple annotation into one “x-note” block to save space in html rendering.)

This annotation system is not perfect. It is static HTML/CSS based. Recently i've been thinking of making it more dynamic based on Javascript. With Javascript, it's possible to have features such as hide/show annotation. Or, mouse over to have a annotation show. Or, mouse-over any word to show its definition from a online dictionary.

To make that possible from my existing system, i need to make sure of few things:

  • ① My custom markup must have precise semantics.
  • ② The syntax should be as simple as possible. (otherwise it increases the work in js code)
  • ③ My html file's annotation markup must have correct syntax. (else js will fail silently)

With my current system, a annotation block is contained in a “x-note” tag, and within that block, each annotation is marked by a bullet. This semantic is precise, but isn't simple enough. If i want javascript to automatically highlight the annotation text when user mouse-over a annotated word, the js will have to do some parsing of text in the “x-note” block.

It would be simpler, if each “x-note” block contains just ONE annotation. This means, i will first change all my files that contain multi-annotation blocks to make them 1-annotation per block. This is a text processing job. (Hello emacs lisp!) But it will probably take a full day's work over about a thousand files. The number of files doesn't matter. What'd be time consuming is to do it correctly. First, i need to make sure the text has correct syntax. For example: make sure that each “x-note” do indeed contain at least one bullet symbol. Make sure that each <span class="xnt">…</span> has a corresponding <div class="x-note">…</div>. (i will be writing emacs lisp scripts to do these validation tasks.)

So, that's why i wrote this script. I wanted to get some idea of how many “x-note” blocks in which files actually contain multi-annotations.