How to Write grep in Emacs Lisp

Perm url with updates: http://xahlee.org/emacs/elisp_grep_script.html

How to Write grep in Emacs Lisp

Xah Lee, 2011-02-07

This page shows a real-world example of a emacs lisp script that search files, similar to unix grep. If you don't know elisp, first take a look at Emacs Lisp Basics.

The Problem

Summary

I want to write a elisp script that reports files in a dir that contain a string n times. The script is expected to search thru 5 thousand files.

Detail

Why can't i just use grep? Because:

• Often, my search string is long, containing 300 hundred chars or more. (e.g. a snippet of HTML that contains javascript and span multi-lines.) You could put your search string in a file with grep, but it is not convenient. Here's a example of a string i need to search:

<div class="chtk"><script type="text/javascript">ch_client="polyglut";ch_width=550;ch_height=90;ch_type="mpu";ch_sid="Chitika Default";ch_backfill=1;ch_color_site_link="#00C";ch_color_title="#00C";ch_color_border="#FFF";ch_color_text="#000";ch_color_bg="#FFF";</script><script src="http://scripts.chitika.net/eminimalls/amm.js" type="text/javascript"></script></div>

• Unix grep is not very robust with unicode. Especially so if you are calling it inside emacs on Windows, because it has to go thru 2 layers of interface: ① the ported unix grep program. ② the Windows OS. In the process, the char encoding in the stream can be messed up. My search string usually has unicode chars. (e.g. Sample Unicode Characters.) For example, grep fails when searching for “│” (UTF+2502). This is calling cygwin grep from emacs on Windows. It's too complex to figure out exactly why it fails.

• grep isn't robust with various encoding. You have to deal with “locale” and it's a headache. With emacs, i don't have to think about file encoding at all. The emacs environment automatically detect file's encoding.

• grep can't really deal with directories recursively. (there's -r, but then you can't specify file pattern (e.g. *\.html) (it is possible, with shell file globs or “find ... -exec, xargs”, but i find it quite frustrating to trial error man page loop with unix tools.))

• Sometimes you need to work on a list of files, sometimes by a pattern (e.g. *\.html), sometimes you want to exclude some files by list or by pattern, sometimes a combination of the above in a specific order. Some unix tools provide these features, sometimes by combination of tools (e.g. find/xargs), but their order and syntax is complex and tool specific. With a script in perl, python, elisp, it's much easier to control.

• There are too many versions and varieties grep. The primary 2 are BSD vs GNU. Mac OS X by default mostly has bsd versions, but some are GNU versions. This makes it very painful. Linuxes typically has all GNU versions. The different versions accept different options. Also, GNU grep for example, support a varieties of regex (“--basic-regexp”, “--extended-regexp”, “--perl-regexp”.) It's too painful to figure them out and remember them.

• unix grep and associated tool (sort, wc, uniq, pipe, sed, awk, …) is not flexible. When your need is slightly more complex, unix shell tools can't handle it. For example, suppose you need to find a string in HTML file only if the string happens inside another tag. (extending the limit of unix tools is how Perl was born in 1987.)

When writing a script in perl or python, you can always write it so the script works as a command line script that takes options like unix command line tools. Or, you can leave the script without a command line interface. When you need to run the script, you open it with a editor, modify the parameters, save, then run it.

Ι always prefer the latter. Because, that way i can edit the options much more comfortably, in a editor with full view instead of the command line. I can also view whatever doc the script has in the header, instead of doing some confusing “-help” or “-h”, “--help” or “man ...” in the command line. And with emacs, i can run the script by a press of a key, and much other conveniences. Basically, a command line is nice if you are using other's code because it's a blackbox with a (somewhat) standardize command line interface. But for my custom text processing needs, i find that if i'm writing my own, i prefer not to add command line interface, but use it together with emacs.

So, with my own script for grep (may it be elisp or perl or python), i can make the script do exactly what i need and works everywhere with emacs.

(See: Python: Find & ReplacePerl: Find & Replace.)

Solution

The solution is quite simple actually. Here's a script i've been using close to a year. I use it almost everyday, on 5 thousand files.

Typically, i press one button to open the script. Edit the parameters i want to search. (the input dir, file extension filter, search string, plain text or regex, number of occurance, etc.) Then, save the script. Press another button to run it.

;; -*- coding: utf-8 -*-
;; 2010-03-27
;; print file names of files that have n occurrences of a string, of a given dir

;; input dir
(setq inputDir "~/web/xahlee_org/" )

;; add a ending slash if not there
;; in elisp, dir path should end with a slash
(when (not (string= "/" (substring inputDir -1) ))
  (setq inputDir (concat inputDir "/") )
  )

(defun my-process-file (fpath)
  "process the file at fullpath fpath ..."
  (let (mybuffer p1 p2 (ii 0) searchStr)

    (when t
      ;; (and (not (string-match "/xx" fpath)) ) ; exclude some dir

      ;; create a temp buffer. Work in temp buffer. Faster.
      (setq mybuffer (get-buffer-create " myTemp"))
      (set-buffer mybuffer)
      (insert-file-contents fpath nil nil nil t)

      (setq searchStr "(2) " )          ; search string here

      (goto-char 1)
      (while (search-forward searchStr nil t)
        (setq ii (1+ ii))
        )

      ;; report if the occurance is not n times
      (if (not (= ii 0))
          (princ (format "this many: %d %s\n" ii fpath))
        )

      (kill-buffer mybuffer)
      )
    ))

;; traverse the dir

(require 'find-lisp)

(let (outputBuffer)
  (setq outputBuffer "*xah occur output*" )
  (with-output-to-temp-buffer outputBuffer 
    (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
    (princ "Done deal!")
    )
  )

The code is pretty simple. At the bottom, the code visits every file in a dir. For each file, it calls (my-process-file fpath). The “my-process-file” creates a temp buffer, paste the file content in it, then do search inside the temp buffer. We do this because it's faster. (with temp buffer, emacs doesn't do font-locking (which is rather resource intensive), and no “undo”, or any other thing emacs normally do when opening a file for interactive edit.)

To run the file, you can call “eval-buffer” or “load-file”. (i have “eval-buffer” aliased to just “eb”. ((defalias 'eb 'eval-buffer)) Actually, i just press a button to run the current file. See: Emacs Lisp: a Command to Execute/Compile Current File.)

The elisp idioms used in this script have been explained a few times in different places in this site. If you are not familiar, please review at: Text Processing with Emacs Lisp Batch Style.

On 5k files, the script takes 30 seconds on my machine.

Emacs is fantastic!

Popular posts from this blog

11 Years of Writing About Emacs

does md5 creates more randomness?

Google Code shutting down, future of ErgoEmacs