2011-02-18

Emacs Lisp: Using thing-at-point

Perm url with updates: http://xahlee.org/emacs/elisp_thing-at-point.html

Emacs Lisp: Using thing-at-point

Xah Lee, 2011-02-16, 2011-03-01

This page shows you how to use emacs lisp's thing-at-point function, and discuss some of its problems, and with a suggested solution.

Purposes of Elisp Code

In coding emacs lisp, there are 2 major types of purpose.

  • ① Writing major/minor modes. (language modes, dired, bookmark, org-mode, irc, ftp, shell, etc.)
  • ② Text Processing.

For text processing, i also see 2 major categories:

  • ① Interactive commands.
  • ② Batch style scripts. (similar to typical Perl & Python scripts and sys admin jobs.)

Many emacs commands that are used every few minutes are for interactive text processing. Examples: “comment-dwim”, “fill-paragraph”, “query-replace”, “kill-rectangle”, “sort-lines”, “reverse-region”, “list-matching-lines”, “delete-trailing-whitespace”, “indent-region”, “just-one-space”, “delete-blank-lines”, “downcase-region”, “find-file-at-point” ….

Writing interactive commands are probably the most useful for beginning elisp coders. I have 100+ personal commands for interactive text processing. Typically, i press a key, then the text under cursor changes to another form. (See: Emacs Lisp Power! Transform Text Under Cursor.)

thing-at-point & bounds-of-thing-at-point

When writing interactive commands, one of the most useful function is “thing-at-point”. Here's a excerpt from its online doc:

Return the THING at point. THING is a symbol which specifies the kind of syntactic entity you want. Possibilities include `symbol', `list', `sexp', `defun', `filename', `url', `email', `word', `sentence', `whitespace', `line', `page' and others.

“thing-at-point” basically lets you get the string of the current word, current line, current sentence, paragraph, file, url, etc, that's under cursor. Without it, you typically have to code about 5 to 10 lines, using functions like “looking-at”, “search-forward”, “skip-chars-forward” to find the boundary. Then you need to set the positions to variables. Then call “buffer-substring-no-properties” to get the string. Also, you need to wrap the whole thing with “save-excursion” so that the cursor does not jump to unexpected places when your command is called.

A associated function is “bounds-of-thing-at-point”. This is useful because, sometimes you also need to know a thing's boundary, because you may need to delete it (using (delete-region ‹point1› ‹point2›)) and replace it with some transformed string.

Problems of thing-at-point

After using “thing-at-point” for several years, i started to get slightly annoyed by some of its problems. Here's the problems i see:

Behavior Dependent on Syntax Table

When you call (thing-at-point 'word), what string you get exactly depends on the syntax table of the current mode.

For example, if you always want your “word” to mean any alphanumeric plus hyphen, you can't rely on “thing-at-point” to give you the right thing, because it may include underscore, or may not include hyphen, or may include apostrophe, depending on the current major mode's syntax table.

This problem also applies for “'sentence”, “'paragraph”.

Inconsistent Behavior for 'line

When you call (thing-at-point 'line), it will return the line with the end of line (eol) character. However, if the line is at the end of buffer, then there is no eol.

This means you have to do extra code to add or truncate the last char of the line.

It's actually simpler if you just write your own code to grab the line, like this: (buffer-substring-no-properties (line-beginning-position) (line-end-position)).

test code

Here's a simple test code to see what “thing-at-point” returns.

(defun xx ()
  "temp function for testing what thing-at-point returns"
  (interactive)
  (let (myresult)
    (setq myresult (thing-at-point 'url))
    (message "〔%s〕" myresult)
    ))

Get Text Selection or Unit at Current Cursor Position

Starting with emacs 23.x, text selection is highlighted by default. (this means: transient-mark-mode is on by default. See: Emacs: What's Region, Active Region, transient-mark-mode?.) There's a new user interface idiom. When there is a text selection, the command will act on the text selection. Otherwise, the command acts on the current word, line, paragraph, buffer, …, whichever is appropriate for the command. This is great because users don't have to think about whether he has to call the “-region” version of the function. (See: New Features in Emacs 23.)

When you write a command to do this, the code typically looks like this:

;; get current selection or word
(let (bds p1 p2 inputStr resultStr)

  ;; get boundary
  (if (region-active-p)
      (setq bds (cons (region-beginning) (region-end) ))
      (setq bds (bounds-of-thing-at-point 'word)) )
  (setq p1 (car bds) )
  (setq p2 (cdr bds) )

  ;; grab the string
  (setq inputStr (buffer-substring-no-properties p1 p2)  )

  ;; do something with inputStr here

  (delete-region p1 p2 ) ; delete the region
  (insert resultStr) ; insert new string
 )

It takes about 6 lines to get the boundary and the string. If you are grabbing line, then you need few more lines to check eol.

Alternative Solution

Because i need to grab the text so often, i got tired of repeatedly writing these 10 or so lines. I wrote a function that does this. Here it is.

(defun get-selection-or-unit  (unit)
  "Return the string and boundary of text selection or UNIT.

Returns a vector [text a b], where text is the string and a and b are its boundary.

If `region-active-p' is true, then the region is the unit.
Else, it depends on the UNIT.
UNIT can be
'word — sequence of 0 to 9, A to Z, a to z, and hyphen.
'glyphs — sequence of visible glyphs. Useful for file name, url, …, that doesn't have spaces in it.
'line — delimited by “\\n”.
'block — delimited by “\\n\\n” (todo: or beginning/end of buffer.)
'buffer — whole buffer. (respects `narrow-to-region')

Example usage:
    (setq bds (get-selection-or-unit 'line))
    (setq myText (elt bds 0) pBegin (elt bds 1) pEnd (elt bds 2)  )

This function is similar to `bounds-of-thing-at-point'.
They are different in the following ways:
• this function takes a text selection if there's one.
• 'line always returns the line without end of line character, avoiding inconsistency when the line is at end of buffer.
• 'word does not depend on syntax table.
• 'block does not depend on syntax table.
"
  (interactive)

  (let (p1 p2)
    (if (region-active-p)
        (progn
          (setq p1 (region-beginning))
          (setq p2 (region-end))
          )
      (save-excursion
        (cond
         ( (eq unit 'word)
           (progn
             (skip-chars-backward "-A-Za-z")
             (setq p1 (point))
             (skip-chars-forward "-A-Za-z")
             (setq p2 (point)))
           )

         ( (eq unit 'glyphs)
           (progn
             (skip-chars-backward "[:graph:]")
             (setq p1 (point))
             (skip-chars-forward "[:graph:]")
             (setq p2 (point)))
           )

         ( (eq unit 'buffer)
           (progn
             (setq p1 (point-min))
             (setq p2 (point-max))
             )
           )

         ((eq unit 'line)
          (progn
            (setq p1 (line-beginning-position))
            (setq p2 (line-end-position))))
         ((eq unit 'block)
          (progn
            (if (re-search-backward "\n\n" nil t)
                (progn (forward-char 2)
                       (setq p1 (point) ) )
              (setq p1 (line-beginning-position) )
              )

            (if (re-search-forward "\n\n" nil t)
                (progn (backward-char)
                       (setq p2 (point) ))
              (setq p2 (line-end-position) ) ) )) ) )
)

(vector (buffer-substring-no-properties p1 p2) p1 p2 )
    ) )

Here's some conveniences of this function.

  • This function takes a text selection if there's one.
  • It returns a vector, containing the string and also its boundary. So, it saves 3 lines of manually extracting the string when you also need the boundary.
  • 'line always returns the line without end of line character, avoiding inconsistency when the line is at end of buffer.
  • 'word does not depend on syntax table.
  • 'block does not depend on syntax table. It's always delimited by 2 blank lines or beginning/end of file. Similar to the concept of 'paragraph.
  • 'glyphs is a sequence of visible glyphs. Useful for getting file path, url, ….

This is a initial version. There are probably a lot features and fixes to be done. The code for 'word needs to improved so it will also take words with accented character (e.g. passé). Possibly, the word semantic can be the same as glyph.

I'm gradually replacing all my ~100 personal command to use this function.