Saturday, November 10, 2007

Rediscovering awk

My first job after college was with HCL Limited (now HCL Infosystems), the #1 Unix-based minicomputer manufacturer (at least at that time) in India. Unix was not as ubiquitous as it is today, at least in India, but it was making aggressive inroads into shops dominated by proprietary mainframe and midrange OSes. At that point I knew nothing about Unix - our college computers had DEC VAX/VMS and MS-DOS installed, and I had done some work with ICL's mainframes and a proprietary minicomputer OS called BEST as a summer intern.

The first thing HCL did was to ship the new recruits to a 2 week boot camp at their training school at Dehra Doon, a little mountain city in North India. There, after about a 2 day introduction to Unix and common commands, we were broken up into groups of 2 and each group handed an approximately 80-100 page paper manual and 24 hours to prepare a presentation for our classmates on a common Unix command. Ours was awk. It wasn't the hardest, considering that two other groups got lex and yacc, but it wasn't the easiest thing to do either, considering our experience with Unix so far, and the fact that we had a green screen monitor with csh for command history. Needless to say, the presentation did not go too well.

Over the years, I had used various Unix commands like sed, grep, cut, paste, tr, etc, hooked them up into shell scripts, worked with scripting languages like Perl and Python, but I have always steered clear of awk. Probably because I could get my work done without using awk, or maybe it was some sort of subconscious fear thing. Anyway, never used awk since then, that is, until a few weeks ago.

I had this Java program which was crawling a set of pages, and pulling down the title, summary and URL for each page into a pipe-delimited records into a flat file. So the Java program would dump a file that looked like this. The top two lines are not part of the output, they are to aid in describing the file format for this blog.

1
2
3
4
5
6
7
8
# output.txt
# TITLE|URL|SUMMARY
r1c1|r1c2|r1c3
r2c1|r2c2|r2c3
r3c1|r3c2|r3c3
r4c1|r4c2|r4c3
r5c1|r5c2|r5c3
...

I would then use sed scripts to replace the pipe characters appropriately, so my output file would look something like this:

1
2
3
4
# output.txt
# TITLE|URL|SUMMARY
insert into sometable(title,url,summary)values('r1c1','r1c2','r1c3');
...

And then use this file as an input SQL script to load all the information into a database table. Some time later, I found a bug in the summary generation algorithm, so I needed to update the summaries in the database. So the SQL to be generated would be like this:

1
2
3
4
# output.txt
# TITLE|URL|SUMMARY
update sometable set summary='r1c2' where title='r1c1' and url='r1c2';
...

I could simply change the Java code to rewrite the columns appropriately, but this seemed to be an almost textbook application of awk, so I bit. So my awk script to reorder the columns looked like this:

1
2
3
sujit@sirocco:/tmp$ gawk -F"|" \
  --source '{printf("%s|%s|%s\n",$3,$1,$2)}' \
  output.txt > output1.txt

And then apply another sed script to replace the pipes with the appropriate text to create the update SQL call.

A few days later, I had a another situation where I was given a file which had been originally generated from one of my programs, uploaded to a database table, but later annotated by a human being. Specifically, certain items had been marked as 'delete' and I knew that I could use the URL as a unique key for the delete SQL. So the annotated file looked like this:

1
2
3
4
5
6
7
8
# output1.txt
# TITLE|URL|SUMMARY|ANNOTATION
r1c1|r1c2|r1c3|keep
r2c1|r2c2|r2c3|delete
r3c1|r3c2|r3c3|keep
r4c1|r4c2|r4c3|keep
r5c1|r5c2|r5c3|delete
...

Flushed with my recent success with awk, I decided to give it a go again. This time, all I needed was the rows annotated with "delete", and the URL to use as the key. So my new awk script looked like:

1
2
3
sujit@sirocco:/tmp$ gawk -F"|" \
  --source '{if ($4=="delete") printf("%s\n",$2)}' \
  output1.txt > output2.txt

It turns out that awk (or gawk, its GNU cousin that I am using) is actually pretty much a full fledged programming language, and has quite a comprehensive set of built in string processing functions. One of my colleagues remarked that he had seen entire web sites written with awk. I don't know if I would want to do that, but from what I see, it is a very powerful tool for writing compact string transformations on the command line. I am definitely going to use this more in future, replacing use-cases where I used a combination of cut, paste and sed to do command line text file processing.

Even after so many years, the awk man pages still appear a bit dense to me. However, there is a lot of free information available on awk on the Internet. I used Greg Goebel's awk tutorial, An Awk Primer, which I found to be very useful. There is also the Awk reference page on the Unix-Manuals site, which can come in useful if you already know how to write scripts in awk and need a function reference.

Be the first to comment. Comments are moderated to prevent spam.