2011-02-07

How to Write grep in Emacs Lisp

Perm url with updates: http://xahlee.org/emacs/elisp_grep_script.html

How to Write grep in Emacs Lisp

Xah Lee, 2011-02-07

This page shows a real-world example of a emacs lisp script that search files, similar to unix grep. If you don't know elisp, first take a look at Emacs Lisp Basics.

The Problem

Summary

I want to write a elisp script that reports files in a dir that contain a string n times. The script is expected to search thru 5 thousand files.

Detail

Why can't i just use grep? Because:

• Often, my search string is long, containing 300 hundred chars or more. (e.g. a snippet of HTML that contains javascript and span multi-lines.) You could put your search string in a file with grep, but it is not convenient. Here's a example of a string i need to search:

<div class="chtk"><script type="text/javascript">ch_client="polyglut";ch_width=550;ch_height=90;ch_type="mpu";ch_sid="Chitika Default";ch_backfill=1;ch_color_site_link="#00C";ch_color_title="#00C";ch_color_border="#FFF";ch_color_text="#000";ch_color_bg="#FFF";</script><script src="http://scripts.chitika.net/eminimalls/amm.js" type="text/javascript"></script></div>

• Unix grep is not very robust with unicode. Especially so if you are calling it inside emacs on Windows, because it has to go thru 2 layers of interface: ① the ported unix grep program. ② the Windows OS. In the process, the char encoding in the stream can be messed up. My search string usually has unicode chars. (e.g. Sample Unicode Characters.) For example, grep fails when searching for “│” (UTF+2502). This is calling cygwin grep from emacs on Windows. It's too complex to figure out exactly why it fails.

• grep isn't robust with various encoding. You have to deal with “locale” and it's a headache. With emacs, i don't have to think about file encoding at all. The emacs environment automatically detect file's encoding.

• grep can't really deal with directories recursively. (there's -r, but then you can't specify file pattern (e.g. *\.html) (it is possible, with shell file globs or “find ... -exec, xargs”, but i find it quite frustrating to trial error man page loop with unix tools.))

• Sometimes you need to work on a list of files, sometimes by a pattern (e.g. *\.html), sometimes you want to exclude some files by list or by pattern, sometimes a combination of the above in a specific order. Some unix tools provide these features, sometimes by combination of tools (e.g. find/xargs), but their order and syntax is complex and tool specific. With a script in perl, python, elisp, it's much easier to control.

• There are too many versions and varieties grep. The primary 2 are BSD vs GNU. Mac OS X by default mostly has bsd versions, but some are GNU versions. This makes it very painful. Linuxes typically has all GNU versions. The different versions accept different options. Also, GNU grep for example, support a varieties of regex (“--basic-regexp”, “--extended-regexp”, “--perl-regexp”.) It's too painful to figure them out and remember them.

• unix grep and associated tool (sort, wc, uniq, pipe, sed, awk, …) is not flexible. When your need is slightly more complex, unix shell tools can't handle it. For example, suppose you need to find a string in HTML file only if the string happens inside another tag. (extending the limit of unix tools is how Perl was born in 1987.)

When writing a script in perl or python, you can always write it so the script works as a command line script that takes options like unix command line tools. Or, you can leave the script without a command line interface. When you need to run the script, you open it with a editor, modify the parameters, save, then run it.

Ι always prefer the latter. Because, that way i can edit the options much more comfortably, in a editor with full view instead of the command line. I can also view whatever doc the script has in the header, instead of doing some confusing “-help” or “-h”, “--help” or “man ...” in the command line. And with emacs, i can run the script by a press of a key, and much other conveniences. Basically, a command line is nice if you are using other's code because it's a blackbox with a (somewhat) standardize command line interface. But for my custom text processing needs, i find that if i'm writing my own, i prefer not to add command line interface, but use it together with emacs.

So, with my own script for grep (may it be elisp or perl or python), i can make the script do exactly what i need and works everywhere with emacs.

(See: Python: Find & ReplacePerl: Find & Replace.)

Solution

The solution is quite simple actually. Here's a script i've been using close to a year. I use it almost everyday, on 5 thousand files.

Typically, i press one button to open the script. Edit the parameters i want to search. (the input dir, file extension filter, search string, plain text or regex, number of occurance, etc.) Then, save the script. Press another button to run it.

;; -*- coding: utf-8 -*-
;; 2010-03-27
;; print file names of files that have n occurrences of a string, of a given dir

;; input dir
(setq inputDir "~/web/xahlee_org/" )

;; add a ending slash if not there
;; in elisp, dir path should end with a slash
(when (not (string= "/" (substring inputDir -1) ))
  (setq inputDir (concat inputDir "/") )
  )

(defun my-process-file (fpath)
  "process the file at fullpath fpath ..."
  (let (mybuffer p1 p2 (ii 0) searchStr)

    (when t
      ;; (and (not (string-match "/xx" fpath)) ) ; exclude some dir

      ;; create a temp buffer. Work in temp buffer. Faster.
      (setq mybuffer (get-buffer-create " myTemp"))
      (set-buffer mybuffer)
      (insert-file-contents fpath nil nil nil t)

      (setq searchStr "(2) " )          ; search string here

      (goto-char 1)
      (while (search-forward searchStr nil t)
        (setq ii (1+ ii))
        )

      ;; report if the occurance is not n times
      (if (not (= ii 0))
          (princ (format "this many: %d %s\n" ii fpath))
        )

      (kill-buffer mybuffer)
      )
    ))

;; traverse the dir

(require 'find-lisp)

(let (outputBuffer)
  (setq outputBuffer "*xah occur output*" )
  (with-output-to-temp-buffer outputBuffer 
    (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
    (princ "Done deal!")
    )
  )

The code is pretty simple. At the bottom, the code visits every file in a dir. For each file, it calls (my-process-file fpath). The “my-process-file” creates a temp buffer, paste the file content in it, then do search inside the temp buffer. We do this because it's faster. (with temp buffer, emacs doesn't do font-locking (which is rather resource intensive), and no “undo”, or any other thing emacs normally do when opening a file for interactive edit.)

To run the file, you can call “eval-buffer” or “load-file”. (i have “eval-buffer” aliased to just “eb”. ((defalias 'eb 'eval-buffer)) Actually, i just press a button to run the current file. See: Emacs Lisp: a Command to Execute/Compile Current File.)

The elisp idioms used in this script have been explained a few times in different places in this site. If you are not familiar, please review at: Text Processing with Emacs Lisp Batch Style.

On 5k files, the script takes 30 seconds on my machine.

Emacs is fantastic!

scientist confirm the possibility of seeing the future

Recently, in the academic psychology community, there's hot discussion about a psychologist (Daryl Bem) who made experiments that seem to confirm the possibility of some parapsychology phenomenon. e.g. seeing the future, knowing the past, reading other's minds, etc. I'd say, don't pay attention to it. But if you do have a interest in this matter, i recommend reading this article:

Back from the Future: Parapsychology and the Bem Affair (2011-01-06) By James Alcock. @ Source www.csicop.org

2011-02-06

Google YouTube fixes invalid embed code

2 weeks ago i reported that the embed video code handed out by YouTube contains a invalid attribute type="text/html". Ι wrote about it here: HTML Validation, Google, Amazon, and also asked about at stackoverflow.com, also posted the question to YouTube forum at Source www.google.com.

Amazingly, Google fixed it! Now the embed code no longer contains type="text/html". Yay!

Lady Gaga - Bad Romance & 王彩樺 - 鋩鋩角角

“Lady Gaga - Bad Romance” amazon
王彩樺 鋩鋩角角 熱舞版

for more info, lyrics, see http://xahlee.org/music/bad_romance.html

unicode support in langs

major update: Unicode Support in Ruby, Perl, Python, javascript, Java, Emacs Lisp, Mathematica

2011-02-05

T-ara - Bo Peep Bo Peep & 王彩樺 - 保庇

“T-ara - Bo Peep Bo Peep” amazon
“保庇 (BOBEE)” - 王彩樺

for lyrics, info, videos of other versions, see: http://xahlee.org/music/bo_peep.html

Guy Steele on Parallel Programing

Perm url with updates: http://xahlee.org/comp/Guy_Steele_parallel_computing.html

Guy Steele on Parallel Programing

Xah Lee, 2011-02-05, 2011-02-06

A fascinating talk by the computer scientist Guy Steele. (popularly known as one of the author of Scheme Lisp)

How to Think about Parallel Programming: Not! (2011-01-14) By Guy Steele. @ Source www.infoq.com

The talk is a bit long, at 70 minutes. The first 26 minutes he goes thru 2 computer programs written for 1970s machines. It's quite interesting to see how software on punch card works. For most of us, we've never seen a punch card. He goes thru it “line by line”, actually “hole by hole”. Watching it, it gives you a sense of how computers are like in the 1970s.

At 00:27, he starts talking about “automating resource management”, and quickly to the main point of his talk, as in the title, about what sort of programing paradigms are good for parallel programing. Here, parallel programing means solving a problem by utilizing multiple CPU or nodes (as in clusters or grid). This is important today, because CPU don't get much faster anymore; instead, each computer is getting more CPU (multi-core).

In the rest 40 min of the talk, he steps thru 2 programs that solve a simple problem of splitting a sentence into words. First program is typical procedural style, using do-loop (accumulator). The second program is written in his language Fortress, using functional style. He then summarizes a few key problems with traditional programing patterns, and itemize a few functional programing patterns that he thinks is critical for programing languages to automate parallel computing.

In summary, as a premise, he believes that programers should not worry about parallelism at all, but the programing language should automatically do it. Then, he illustrates that there are few programing patterns that we must stop using, because if you do write your code in such paradigm, then it would be very hard to parallelize the code, either manually or by machine AI.

If you are a functional programer and read FP news in the last couple of years, his talk doesn't seem much new. However, i find it very interesting, because:

  • ① This is the first time i see Guy Steele talk. He talks slightly fast. Very bright guy.
  • ② The detailed discussion of punch card code on 1970s machine is quite a eye-opener for those of us who haven't been there.
  • ③ You get to see Fortress code, and its use of fancy unicode chars.
  • ④ Thru the latter half of talk, you get a concrete sense of some critical “do's & don'ts” in coding paradigms about what makes automated parallel programing possible or impossible.

In much of 2000s, i did not understand why compilers couldn't just automatically do parallelism. My thought have always been that a piece of code is just a concrete specification of algorithm and machine hardware is irrelevant. But i realized, that any concrete specification of algorithm is specific to a particular hardware, by nature. (See: Why Must Software Be Rewritten For Multi-Core Processors?.)

Parallel Computing vs Concurrency Problems

Note that parallel programing and concurrency problem are not the same thing.

Parallel programing is about writing code that can use multiple CPU. Concurrency problems is about problems of concurrent computations using the same resource or time (e.g. race condition, file locking). See: Parallel computingConcurrency (computer science)

Elegant Functional Style ≠ Automatic Parallelism

I've been doing functional programing for about 17 years (since 1993), pretty much ever since i started programing with my first language Mathematica. One paradigm i use frequently, is sequencing functions, aka chaining functions, piping, fold, nest. In some sense, most of my programs is one single sequence of function chain. The input is feed to a function, then its output to another, then another, until the last function spits out the desired output. It is extremely elegant, no state whatsoever, and often i even take the extra mile to avoid using any local temp variables or constants in my functions. But after watching this video, i learned, that function chaining is NOT good for parallelism, despite it is considered a very elegant functional paradigm.

As a functional programer, typically we are familiar with all the FP style constructs, and we strive for such elegance. The important thing i learned from this video is that, elegance in functional style does not mean it will be good for parallelism. Similarly, software written with purely functional language such as Haskell, does not imply it'll be good for parallelism. (but will be a lot better than procedural langs, of course.) FP and Automatic Parallelism certainly share some aspects, such as declarative code and not maintaing states, but is not the same issue.

Here's a few properties of functions he think it's important for automatic parallelism . (if you haven't watched the whole video carefully, you probably won't understand it because it's out of context. I'm listing them here for myself too)

  • Associativity . (grouping doesn't matter) e.g. ((a ⊕ b) ⊕ c) == (a ⊕ (b ⊕ c))
  • Commutativity . (order doesn't matter.) e.g. (a ⊕ b) == (b ⊕ a)
  • Idempotent
  • Identity
  • Zero

Here's things you should avoid, from his slide:

  • DO loops.
  • linear linked list
  • Java style iterator.
  • even arrays are suspect.
  • as soon as you say “fist, SUM = 0” you are hosed.
  • accumulators are BAD. They encourage sequential dependence and tempt you to use non-associative updates.
  • If you say, “process sub-problems in order,” you lose.
  • The great tricks of the sequential past WON'T WORK.
  • The programing idioms that have become second nature to us as everyday tools for the last 50 years WON'T WORK.

Conclusion:

  • A program organized according to linear problem decomposition principles can be really hard to parallelize. (e.g. accumulator, do loop)
  • A program organized according to independence and divide-and-conquer principles is easy to run in parallel or sequentially, according to available resources.
  • The new strategy has costs and overheads. They will be reduced over time but will not disappear.
  • in a world of parallel computers of wildly varying sizes, this is our only hope for program portability in the future.
  • Better language design can encourage “independent thinking” that allows parallel computers to run programs effectively.

Here's a paper written by Guy Steele in 2009, which kinda summarize the points in this talk. (requires lisp knowledge to understand)

Organizing Functional Code for Parallel Execution “or, foldl and foldr Considered Slightly Harmful”. (2009-08) Guy Steele. @ Source labs.oracle.com

Fortress & Unicode

It's interesting that Fortress language freely uses unicode chars. For example, list delimiters are not the typical curly bracket {1,2,3} or square bracket [1,2,3], but the unicode angle bracket ⟨1,2,3⟩. (See: Matching Brackets in Unicode.) It also uses the circle plus ⊕ as operator. (See: Math Symbols in Unicode.)

I really appreciate such use of unicode. The tradition of sticking to the 95 chars in ASCII of 1960s is extremely limiting. It creates complex problems manifested in:

All these problems occur because we are jamming so many meanings into about 20 symbols in ASCII.

See also:

Also, most of today's languages do not support unicode in function or variable names, so you can forget about using unicode in variable names (e.g. α=3) or function names (e.g. “lambda” as “λ” or “function” as “ƒ”), or defining your own operators (e.g. “⊕”). (Exceptions i know that works well in practice are: Mathematica, Javascript, Java, Emacs Lisp. See: Unicode Support in Ruby, Perl, Python, Emacs Lisp.)

2011-02-03

mandelbrot set simply explained with video

Perm url with updates: http://xahlee.org/cmaci/fractal/mandelbrot.html

Mandelbrot Set Basic Tutorial (no complex number needed)

Xah Lee, 2006-05, 2011-02-03

This page gives a easy-to-understand explanation of the Mandelbrot Set fractal without using complex numbers.

Formula

The Mandelbrot Set is defined as follows. Define a function f to be this formula:

f[{a_, b_}] := {a*a - b*b + c1, 2*a*b + c2}

The a, b, c1, c2 are (real) numbers, and c1 and c2 are fixed numbers (constants).

For example, if we set c1=0 and c2=0, then the formula becomes:

f[{a_,b_}] := {a * a - b * b , 2 * a * b}

and then f[{3,4}] evaluates to {-7,24}.

The input is a pair of numbers, and output is also a pair of numbers. You can plot them as points on the plane. You can think of these numbers as vectors.

Iteration of Formula

For each point {c1,c2} in the plane, let's compute the recursion (nesting) of f, starting inital input {0,0}. That is, we compute f[f[f[…f[{0,0}]…]]].

If this infinite recursion creates a resulting point further and further from the point {0,0} in a way that it eventually can pass any circle centered on {0,0} however large, then we say that the number {c1,c2} escapes. We define the Mandelbrot set as the set of points that do NOT escape.

For example, suppose {c1,c2} is {3,4}. So, the first iteration is to compute f[{0,0}]. We get {3,4}. Then we compute f[{3,4}], we get {-4,28}. Then f[{-4,28}] is {-765,-220}. The sequence we get for first 5 steps this:

{ {3, 4}, {-4, 28}, {-765, -220}, {536828, 336604}, {174882048771, 361396904228}}

By looking at the formula, you can see that it just gets larger and larger. So, the point {3,4} escapes, thus is NOT a member of the Mandelbrot set.

Now, if we set {c1,c2} to be {0.1,0.1}, the first 5 steps is:

{ {0.1, 0.1}, {0.1, 0.12}, {0.0956, 0.124}, {0.0937634, 0.123709}, {0.0934877, 0.123199}}

The first number clearly decreases slowly, but the second number increases slowly. One can prove that if we continue the computation forever, the point does NOT escape, thus {0.1,0.1} is a point in the Mandelbrot set.

Plot of Mandelbrot Set

To plot the set, we have to compute all possible point in the plane. For each point {c1,c2}, we have to compute f infinitely many times. For practical purposes, let's just say we do one thousand steps for each point. If it gets large, we color the point white, otherwise black.

Here's a plot. The black colored region are points in the Mandelbrot set.

the Mandelbrot set

A plot of the mandelbrot set.

You get this weird-looking ass-shaped thing. And, see those weird-looking jagged edges? That's the fascinating aspect about the Mandelbrot set. If you zoom in, you'll see that this weird-looking jagged edge is never smooth no matter what magnification, and the pattern of this ass-shape seem to repeat at random places forever, yet not in some particular predictable ordered fashion.

In the above animation, you can see this weird pattern. The Mandelbrot set is usually plotted with color. A typical coloring scheme (used in the above animation) is to color a point by how fast the point {c1,c2} escapes.

The following is another zoom-in animation at a different point. Supposedly it's the deepest zoom-in animation on record. It's computed on a machine with 12 CPU, for 6 months, in 2010.

“Deepest Mandelbrot Set Zoom Animation ever - a New Record! 10^275 (2.1E275 or 2^915)”

Mandelbrot Set Formula with Complex Numbers

The above formula can be expressed in complex numbers. Using complex numbers, the function f is:

f[z_] := z^2+C

For many free software that plots the Mandelbrot set, see: Fractal Software.

Google's Fractal Map

Google made a fractal application, based on Google Maps, at juliamap.googlelabs.com.

Though, am rather disappointed. When you zoom in just a few steps, the iteration does not automatically increase enough to get crispy edges.

Much better are some dedicated fractal apps. See: Great Fractal Software.

For a basic explanation of the mandelbrot set, see: Mandelbrot Set Basic Tutorial (no complex number needed).

2011-02-02

Google URL Shortening Service Format

Perm url with updates: http://xahlee.org/js/Google_url_shortening.html

Google URL Shortening Service URL Format

Xah Lee, 2011-02-02

Google has a url shortening service, at http://goo.gl/.

Here's some example of url it generates:

• goo.gl/3PrCo ⇒ xahlee.org/comp/FSF_free_beer_free_speech.html
• goo.gl/EH5fn ⇒ xahlee.org/kbd/dvorak_and_all_keyboard_layouts.html
• goo.gl/jzgjD ⇒ xahlee.org/comp/netscape_dot_com_nostalgia.html
• goo.gl/vfsem ⇒ xahlee.org/Periodic_dosage_dir/t2/cherry.html
• goo.gl/uc19a ⇒ xahlee.org/UnixResource_dir/writ/larry_wall_n_cults.html
• goo.gl/WBdW7 ⇒ xahlee.org/js/html5_video_audio.html
• goo.gl/bW0LZ ⇒ xahlee.org/emacs/elisp_all_about_lines.html
• goo.gl/ZPmVY ⇒ xahlee.org/Periodic_dosage_dir/lacru/odalisque.html
• goo.gl/1Zb9s ⇒ xahlee.org/Periodic_dosage_dir/sanga_pemci/hollaback_girl.html
• goo.gl/RvN1E ⇒ xahlee.org/perl-python/system_calls.html
• goo.gl/uXyas ⇒ xahlee.org/comp/netscape_dot_com_nostalgia.html
• goo.gl/Aq3BR ⇒ xahlee.blogspot.com/2011/02/sous-vide-heavy-metal-boredom-habits.html
• goo.gl/VS4X1 ⇒ xahlee.blogspot.com/2011/02/pirated-oreilly-books.html
• goo.gl/FoWLz ⇒ www.reddit.com/r/emacs/comments/fbg4l/ask_remacs_help_me_get_started_with_some_real/
• goo.gl/CfWhQ ⇒ www.reddit.com/r/emacs/comments/fbg4l/ask_remacs_help_me_get_started_with_some_real/

Notice that the format is goo.gl/xxxxx, and the xxxxx is 5 chars. Also, notice that each time you submit, it gives a new short url (if you are logged in). This means multiple short urls maps to the same original url.

There's something i don't understand about it.

The possible chars seem to be digits and english letters, both lower case and upper case. So, that's a total of 10+26+26=62 chars per slot. With 5 slots, the total space is 26^5 = 916.132832 * 10^6. Consider the number of urls that exist, and consider that one url can generate multiple short urls, this does not seem enough.

Also, i think i read somewhere from Google, that the new url is supposedly to be permanent. That is, years later they would return the same original url (assuming Google still lives). So, this means, i can't see how they can maintain this 5 char scheme. Perhaps in the future they might change the 5-slots scheme with more slots or chars.

For each url that exists, perhaps 1 in 1M are actually using url shortening service. So maybe the 5-slots 62 chars scheme would work fine for the foreseeable future.

See also: URL Shortening is BAD.

sous vide, heavy metal, boredom habits

Perm url with updates: http://xahlee.org/Periodic_dosage_dir/sous_vide_cooking.html

Sous Vide Cooking & Fat-Induced Habits

Xah Lee, 2011-02-02

Learned a word Sous Vide. Basically, it's a fancy cooking method. Instead of cooking with high temperature for minutes, it cooks at warm temperature (e.g. 60°) for hours.

When food is cooked at high temperature, the heat changes the chemistry of the food. With low temp, the food's texture remains the same, thus supposedly more tasty.

Note that there's a danger in doing this. One primary reason for cooking is to kill bacteria. With low temp, you need to have a lot knowledge about pasteurization to achieve the same thing.

This is not my thing. Modern society induced people to do that. They are bored and have nothing to do all day. e.g. heavy metal music, Drifting (car driving). Actually, for many of these hobbies, i despise them. (For a commentary on the social development of Heavy Metal music, see: 花样的年华 (Age of Blossom).)

Personally, i'm a ascetic. Not out of religious reasons (i despise religion), but rather perhaps as a eccentric habit, similar to what you might associate with mad scientist types. This talk reminds me of Mahatma Gandhi. Remember, he adhere a life style of loinclothes. Also, remember that millions of people around the world don't have enough food to sustain life. Yet, here in US America, we have people going ways into the art of cooking for that little extra mile of taste.

(Now, just to be clear: am not pitching some morality (i despite moralists, human right fuckheads, animal rights activists, feminism pussies and wimps, helping-the-poor bleeding hearts, vegetarianism activists. (See: Why Do I Not Support the “Human Right” Concept?.) ), nor am i trying to say that people should not enjoy the various expensive hobbies or lifestyles that are developed out of modern wealthy nations (such as fancy cooking; breast enlargement; fancy cars; or even consider the whole gazillion dollar movie industry that in some sense is a extreme waste.). I am mentioning hunger around the world only to give a contrast about expensive habits of this Sous Vide cooking, but do not wish to be misunderstood as if i advocate certain moral values that are common with such topic.)

pirated OReilly Books

Discovered that the entire Perl Cookbook is now available online. Originally i thought it's official from O'Reilly, but until trying to announce the url, i realized it's pirated. The site actually contains ~100 O'Reilly books. The url is 〔http://docstore.mik.ua/orelly/〕. Screenshot: . The site is Google ranked at 5. (i discovered it while Google search Perl modules.) I guess the site is making money by the ads.

Ι wonder if O'Reilly knew. Ι hope they can take action, but i guess it's fruitless. Am also guessing that there are probably tens or hundreds of such sites. (they are all over bittorrent anyway.)

See also: Software Freedom is Free Speech or Free Beer?Pathetically Elational Regex Language (PERL).