Unix Shell Text Processing Tutorial (grep, cat, awk, sort, uniq)

Perm url with updates: http://xahlee.org/UnixResource_dir/unix_shell_text_processing.html

Unix Shell Text Processing Tutorial (grep, cat, awk, sort, uniq)

Xah Lee, 2007-03, 2010-09-13

This page is a basic tutorial on using unix shell's text processing tools. For example, grep, cat, awk, sort, uniq. They are especially useful for processing lines.

How to show only certain lines that contains a text pattern?

Use grep. Example: grep 'html HTTP' myFile will print only lines containing the text “html HTTP”. grep 'html HTTP' *html will apply grep to all files with html suffix. grep -r 'html HTTP' *html will apply grep to all html files in a dir.

Use “-f” to include file name in the result, use “-h” to not print file name. Use “-v” to print lines NOT containing the text. Use “-E” for extended regular expression (similar to Perl's) or use “-P” for perl's regex syntax. Use “-i” to ignore case.

Examples:

  • grep -v 'html HTTP' myFile will print lines not matching the regex.
  • grep -E 'png HTTP|jpg HTTP' myFile will print lines containing either “png HTTP” or “jpg HTTP”.
  • cat myFile | grep 'html HTTP' | awk '{print $12 , $7}' | grep -i -E "livejournal|blogspot" | sort | uniq -c | sort -n

The last example will show lines only containing “html HTTP” in my apache web log “myFile”, then shown only the 12th and 7th columns (which are referral url, and the requested file), then show only lines that contain “livejournal” or “blogspot” text (“-i” for ignore case and “-E” for extended regex pattern), then sort them, then show only unique lines with number of occurance in prepended, then sort that by the numbers.

How to show only certain columns in a text file?

awk '{print $12 , $7}' myFile will print the 12th and 7th column. (columns are separated by spaces by default.) For delimiter other than space, for example the straight double quote, use awk -F\" '{print $12 , $7}' myFile.

Alternative solution is to use the “cut” utility, but it does not accept regex as delimeters. So, if you have column separated by different number of spaces, “cut” is incapable.

How to sort lines in a file?

To sort lines use sort myfile. To sort by considering text as numbers, use sort -n myfile. To reverse order, use “-r”. To sort by comparing a particular column, for example the 2nd column, use sort -k 2 myfile.

Here's a more complex example: sort -k 2,2r -k3,3nr myFile. This will sort by first column first, with reverse order, if tie, sort by 3rd column as numbers and reverse order.

Note, sort does destructive sort by default. For example, if your text file is:

b y
b x

and you use “sort -k 1 myFile”, it will re-order your lines. To make it leave unspecified field as is, use “-s”.

How to show only uniq lines in a file?

sort ~/myfile | uniq. To prepend the line with a count of repeatition, use sort ~/myfile | uniq -c

How to sum up the 2nd column in a file?

awk '{sum += $2} END {print sum}' myfile.

How to show only first few lines of a huge file?

head ~/myfile. If you want to see first n lines, use head -n 100 ~/myfile. If you want to see the bottom of a file, use “tail”.

For complex text processing, you need a full language. See: Perl and Python Tutorial, Emacs Lisp Tutorial.

Processing Multiple Files

How to list only files who's name matches a text pattern?

find ~/myDir -name "*.html" will show just files with “.html” suffix.

How to list only files larger than n bytes?

find ~/myDir -size +900000c will list files in 〔~/myDir〕 larger than 9 Mega bytes.

To list files smaller than a given size, use a minus sign “-” instead of the plus. To list files of exactly a give size, don't use the plus or minus.

How to delete all files who's name matches a text pattern?

find ~/myDir -name "*~" -exec rm {} \; will delete all files files who's name ends with “~”.

Using “find” with “xargs”

How to use “find” on file names that may contain spaces or dash?

find . -print0 | xargs -0 -l -i echo "{}";.

The “-print0” tells “find” to print the file names separeted by a null char. (as opposed to a newline char by “-print”) The “-0” tells xargs to parse input using null char as seperators and take any special char in file name as literal.

The “-l” tells “xargs” to pass just one file name at a time. The “-i” allows you to use “{}” as the file name. The “"{}"” creates quoting around the entire file name, so that “echo” (or another program) will see it as one argument instead of several. (Note: the “-i” must come after “-l”)

Here's a example that uses “find”, “xargs”, and “basename” and ImageMagick's “convert” to convert “bmp” image files to “png”: find . -name "*bmp" -print0 | xargs -0 -l -i basename "{}" ".bmp" | xargs -0 -l -i convert "{}.bmp" "{}.png".

Use GNU Parallel for xargs

Note: a modern replacement for xargs is GNU Parallel. The syntax is almost indentical to xargs, except it runs in parallel. It also doesn't have problems with file names containing quotes or apostrophes.

Man Page

How to get a text output of a man page?

man ls | col -b. The “col -b” formats the man page to plain text (rid of control chars).

How to read a non-compressed man page without the “man” command?

nroff -man n43921.man | col -b

This is convenient when you need to read a man-page file once without adding the dir to your $MANPATH.

How to read a compressed man page without the “man” command?

cat n43921.1 | compress -cd - | nroff -man | col -b

How to read a unformatted man page?

a possible solution: nroff -man ftpshut.8

The “man” command is essentially nroff -e -man file_name | more -s.


Thanks to Ole Tange for telling me about GNU Parallel. (he is the author)

Was this page useful? If so, please do donate $3, thank you donors!

Popular posts from this blog

11 Years of Writing About Emacs

does md5 creates more randomness?

Google Code shutting down, future of ErgoEmacs