Emacs Lisp: Writing a Command to Extract URL

Perm url with updates: http://xahlee.org/emacs/elisp_extract_url_command.html

This page shows you how to write a emacs lisp command to extract all URLs in a HTML file. If you don't know elisp, first take a look at Emacs Lisp Basics.

Problem Description

Write a command “extract-url”. When called, all URLs in a text selection will be listed in a separate pane, one per line.

For example, suppose you have this text:

<div>1, <a href="iraq_pixra2.html">2</a>, <a href="http://en.wikipedia.org/wiki/Idiom">Idiom</a>, <a href="iraq_pixra3.html">3</a></div>

After calling the command, you'll get in a separate buffer showing this text:

iraq_pixra2.html
http://en.wikipedia.org/wiki/Idiom
iraq_pixra3.html

Solution

There are many ways to code this. Here's one:

(defun extract-url (p1 p2)
  "Print all URLs in region P1 P2.

When called interactively, use text selection as input, or
current paragraph.

Output in a buffer named 「*extract URL output*」.

This function assumes the input is HTML text.  WARNING: this
function extract all text of the form 「href=\"…\"」 only. It
does not extract {「href='…'」, 「src='…'」} or consider whether
that's plain text or proper tag."
  (interactive
   (if (region-active-p)
       (list (region-beginning) (region-end))
     (let ((bds (bounds-of-thing-at-point 'paragraph)))
       (list (car bds) (cdr bds)) ) ) )
  (let ( p3 p4 urlStr)
    (with-output-to-temp-buffer "*extract URL output*"
      (save-excursion 
        (save-restriction 
          (narrow-to-region p1 p2)
          (goto-char (point-min))
          (while
              (re-search-forward "href" nil t)
            (search-forward "=" nil t)
            (search-forward "\"" nil t)
            (setq p3 (point))
            (search-forward "\"" nil t)
            (setq p4 (- (point) 1))
            (setq urlStr (buffer-substring-no-properties p3 p4))
            (princ urlStr)
            (terpri) ) ) ) ) ))

Here's how it works.

Using “interactive” to get Arguments

First note that the function takes 2 arguments p1 p2 that is the boundary of a buffer.

When called interactively, we want {p1, p2} to be the text selection if there's one, or current paragraph.

Emac's “interactive” function is a way to get arguments for functions called interactively. It just need to return a list. Emacs will use the list elements as arguments. In our case, it's done by:

(interactive
   (if (region-active-p)
       (list (region-beginning) (region-end))
     (let ((bds (bounds-of-thing-at-point 'paragraph)))
       (list (car bds) (cdr bds)) ) ) )

(See: Emacs Lisp: Using thing-at-point.)

Output to Separate Buffer

To output to a separate buffer, we use “with-output-to-temp-buffer” and “princ”, like this:

(with-output-to-temp-buffer "*extract URL output*"
…
(princ urlStr)
…
 )

With (with-output-to-temp-buffer «buffer»), and printing functions will print to that buffer. The printing function we used is “princ”, which print lisp objects to a human readable form. (See: Emacs Lisp: print, princ, prin1, format, message.)

Extracting URL

To extract URL, there are many approaches. Here, we just search for “href” then search for text enclosed in straight quotes, mark their boundary, then print the text between them. Like this:

(while
    (re-search-forward "href" nil t)
  (search-forward "=" nil t)
  (search-forward "\"" nil t)
  (setq p3 (point))
  (search-forward "\"" nil t)
  (setq p4 (- (point) 1))
  (setq urlStr (buffer-substring-no-properties p3 p4))
  (princ urlStr)
  (terpri) )

Simple Solution

Note that this solution is very simple and practically useful, but isn't a fully correct solution. For example, it does not get URL that's enclosed by 'single straight quotes'. Nor does it get URL inside “img” tags or javascript tags with the “src” attribute src="http://xahlee.org/emacs/…".

You can easly modify it by just adding more “while” block, change the double quote to single, and also “href” to “src”. However, that approach won't have the URL in order they appears in the text.

Also, if you have text such as href = "34";, without any tags, it will also grab the “34” as URL.

A robust solution will require few hundred lines.

If you have get-selection-or-unit installed, you can replace the (interactive …) part by:

(interactive
   (let ((bds (get-selection-or-unit 'block)))
     (list (elt bds 1) (elt bds 2) ) ) )

This improves the code by selecting a block of text delimited by empty lines. It saves you a few keys to select a region. Emacs's “paragraph” concept as used by “thing-at-point” is dependent on current major mode's syntax table, which may not be useful. For example, paragraph in “html-mode” is very weird.

Popular posts from this blog

11 Years of Writing About Emacs

does md5 creates more randomness?

Google Code shutting down, future of ErgoEmacs