ch3: DATA PROCESSING

Awk was originally intended for everyday data-processing tasks, such as information retrieval, data validation, and data transformation and reduction. We have already seen simple examples of these in Chapters 1 and 2. In this chapter, we will consider more complex tasks of a similar nature.

  • Most of the examples deal with the usual line-at-a-time processing,

  • but the final section describes how to handle data where an input record may occupy several lines (Multiline Records).

Awk programs are often developed incrementally: a few lines are written and tested, then a few more added, and so on. Many of the longer programs in this book were developed in this way.

It's also possible to write awk programs in the traditional way, sketching the outline of the program, consulting the language manual, and so forth. But modifying an existing program to get the desired effect is frequently easier. The programs in this book thus serve another purpose, providing useful models for programming by example.

1. Data Transformation and Reduction

  • One of the most common uses of awk is to transform data from one form to another, usually from the form produced by one program to a different form required by some other program.

  • Another use is selection of relevant data from a larger data set, often with reformatting and the preparation of summary information.

This section contains a variety of examples of these topics.

1.1. Summing Columns

We have already seen several variants of the two-line awk program that adds up all the numbers in a single field. The following program performs a somewhat more complicated but still representative data-reduction task. Every input line has several fields, each containing numbers, and the task is to compute the sum of each column of numbers, regardless of how many columns there are.

Automatic initialization is convenient here since maxfld, the largest number of fields seen so far in any row, starts off at 0 automatically, as do all of the entries in the sum array, even though it's not known until the end how many there are. It's also worth noting that the program prints nothing if the input file is empty.

It's convenient that the program doesn't need to be told how many fields a row has, but it doesn't check that the entries are all numbers, nor that each row has the same number of entries. The following program does the same job, but also checks that each row has the same number of entries as the first:

We also revised the output code in the END action, to show how a conditional expression can be used to put tabs between the column sums and a newline after the last sum.

Now suppose that some of the fields are nonnumeric, so they shouldn't be included in the sums. The strategy is to add an array numcol to keep track of which fields are numeric, and a function isnum to check if an entry is a number. This is made a function so the test is only in one place, in anticipation of future changes. If the program can trust its input, it need only look at the first line to tell if a field will be numeric. The variable nfld is needed because NF is 0 inside the END action.

NOTE: isnum return 1 if matched, 0 otherwise.

The function isnum defines a number as one or more digits, perhaps preceded by a sign. A more general definition for numbers can be found in the discussion of regular expressions in Section 2.1.

Exercise 3-1. Modify the program sum3 to ignore blank lines.

Exercise 3-2. Add the more general regular expression for a number. How does it affect the running time?

Exercise 3-3. What is the effect of removing the test of numcol in the second for statement?

Exercise 3-4. Write a program that reads a list of item and quantity pairs and for each item on the list accumulates the total quantity; at the end, it prints the items and total quantities, sorted alphabetically by item.

1.2. Computing Percentages and Quantiles

Suppose that we want not the sum of a column of numbers but what percentage each is of the total. This requires two passes over the data. If there's only one column of numbers and not too much data, the easiest way is to store the numbers in an array on the first pass, then compute the percentages on the second pass as the values are being printed:

This same approach, though with a more complicated transformation, could be used, for example, in adjusting student grades to fit some curve. Once the grades have been computed (as numbers between 0 and 100), it might be interesting to see a histogram:

Note how the postfix decrement operator -- is used to control the while loop.

We can test histogram.awk with some randomly generated grades. The first program in the pipeline below generates 200 random numbers between 0 and 100, and pipes them into the histogram maker.

It produces this output:

Exercise 3-5. Scale the rows of stars so they don't overflow the line length when there's a lot of data.

Exercise 3-6. Make a version of the histogram code that divides the input into a specified number of buckets, adjusting the ranges according to the data seen.

1.3. Numbers with Commas

Suppose we have a list of numbers that contain commas and decimal points, like 12,345.67. Since awk thinks that the first comma terminates a number(The value of 12,345.67 will be treated as 12 when converting from string to number), these numbers cannot be summed directly. The commas must first be erased:

The effect of gsub(/,/, "") is to replace every comma with the null string, that is, to delete the commas.

This program doesn't check that the commas are in the right places, nor does it print commas in its answer. Putting commas into numbers requires only a little effort, as the next program shows. It formats numbers with commas and two digits after the decimal point. The structure of this program is a useful one to emulate: it contains a function that only does the new thing, with the rest of the program just reading and printing. After it's been tested and is working, the new function can be included in the final program.

The basic idea is to insert commas from the decimal point to the left in a loop; each iteration puts a comma in front of the leftmost three digits that are followed by a comma or decimal point, provided there will be at least one additional digit in front of the comma. The algorithm uses recursion to handle negative numbers: if the input is negative, the function addcomma calls itself with the positive value, tacks on a leading minus sign, and returns the result.

Note the use of the & in the replacement text for sub to add a comma before each triplet of numbers.

Here are the results for some test data:

Exercise 3-7. Modify sumcomma, the program that adds numbers with commas, to check that the commas in the numbers are properly positioned.

1.4. Fixed-Field Input

Information appearing in fixed-width fields often requires some kind of preprocessing before it can be used directly. Some programs, such as spreadsheets, put out numbers in fixed columns, rather than with field separators; if the numbers are too wide, the columns abut. Fixed-field data is best handled with substr, which can be used to pick apart any combination of columns. For example, suppose the first 6 characters of each line contain a date in the form mmddyy. The easiest way to sort this by date is to convert the dates into the form yymmdd:

On input sorted by month, like this:

it produces the output

which is ready to be sorted by year, month and day.

Exercise 3-8. How would you convert dates into a form in which you can do arithmetic like computing the number of days between two dates?

1.5. Program Cross-Reference Checking

Awk is often used to extract information from the output of other programs. Sometimes that output is merely a set of homogeneous lines, in which case field-splitting or substr operations are quite adequate. Sometimes, however, the upstream program thinks its output is intended for people. In that case, the task of the awk program is to undo careful formatting, so as to extract the information from the irrelevant. The next example is a simple instance.

Large programs are built from many files. It is convenient (and sometimes vital) to know which file defines which function, and where the function is used. To that end, the Unix program nm prints a neatly formatted list of the names, definitions, and uses of the names in a set of object files(*.o files). A typical fragment of its output looks like this:

  • Lines with one field (e.g., file.o) are filenames,

  • lines with two fields (e.g., U and _close) are uses of names,

  • lines with three fields are definitions of names.

T indicates that a definition is a text symbol (function) and U indicates that the name is undefined.

Using this raw output to determine what file defines or uses a particular symbol can be a nuisance, since the filename is not attached to each symbol. For a C program the list can be long - it's 850 lines for the nine files of source that make up awk itself. A three-line awk program, however, can add the name to each item, so subsequent programs can retrieve the useful information from one line:

The output from um.format on the data shown above is

Now it is easy for other programs to search this output or process it further.

This technique does not provide line number information nor tell how many times a name is used in a file, but these things can be found by a text editor or another awk program. Nor does it depend on which language the programs are written in, so it is much more flexible than the usual run of cross-referencing tools, and shorter and simpler too.

1.6. Formatted Output

As another example we'll use awk to make money, or at least to print checks. The input consists of lines, each containing

  • a check number,

  • an amount,

  • a payee

separated by tabs.

The output goes on check forms, 8 lines high.

  • The second and third lines have the check number and date indented 45 spaces,

  • the fourth line contains the payee in a field 45 characters long, followed by 3 blanks, followed by the amount.

  • The fifth line contains the amount in words, and the other lines are blank.

A check looks like this:

Note: The line number above is added by me manually, just for readers to identify each line easier.

prchecks.awk:

The program contains several interesting constructs.

  • First, notice how we generate a long string of blanks in the BEGIN action with sprintf, and then convert them to dashes by substitution.

  • Note also how we combine line continuation and string concatenation to create the string argument to split in the function initnum; this is a useful idiom.

The date comes from the system by the line

which runs the date command and pipes its output into getline. A little processing converts the date from

into

(This may need revision on non-Unix systems that do not support pipes.)

The functions numtowords and intowords convert numbers to words. They are straightforward, although about half the program is devoted to them. The function intowords is recursive: it calls itself to deal with a simpler part of the problem. This is the second example of recursion in this chapter, and we will see others later on. In each case, recursion is an effective way to break a big job into smaller, more manageable pieces.

Exercise 3-9. Use the function addcomma from a previous example to include commas in the printed amount.

Exercise 3-10. The program prchecks does not deal with negative quantities or very long amounts in a graceful way. Modify the program to reject requests for checks for negative amounts and to split very long amounts onto two lines.

Exercise 3-11. The function numtowords sometimes puts out two blanks in a row. It also produces blunders like "one dollars." How would you fix these defects?

Exercise 3-12. Modify the program to put hyphens into the proper places in spelled-out amounts, as in "twenty-one dollars."

2. Data Validation

Another common use for awk programs is data validation: making sure that data is legal or at least plausible. This section contains several small programs that check input for validity.

For example, consider the column-summing programs in the previous section. Are there any numeric fields where there should be nonnumeric ones, or vice versa? Such a program is very close to one we saw before, with the summing removed:

The test for numbers is again just a sequence of digits with an optional sign; see the discussion of regular expressions in Section 2.1 for a more complete version.

2.1. Balanced Delimiters

In the machine-readable text of this book, each program is introduced by a line beginning with .P1 and is terminated by a line beginning with .P2. These lines are text-formatting commands that make the programs come out in their distinctive font when the text is typeset. Since programs cannot be nested, these text-formatting commands must form an alternating sequence

If one or the other of these delimiters is omitted, the output will be badly mangled by our text formatter. To make sure that the programs would be typeset properly, we wrote this tiny delimiter checker, which is typical of a large class of such programs:

If the delimiters are in the right order, the variable p silently goes through the sequence of values 0 1 0 1 0 ... 1 0. Otherwise, the appropriate error messages are printed.

Exercise 3-13. What is the best way to extend this program to handle multiple sets of delimiter pairs?

2.2. Password-File Checking

The password file(/etc/passwd) on a Unix system contains the name of and other information about authorized users. Each line of the password file has 7 fields, separated by colons:

  • The first field is the user's login name, which should be alphanumeric.

  • The second is an encrypted version of the password; if this field is empty, anyone can log in pretending to be that user, while if there is a password, only people who know the password can log in.

  • The third and fourth fields are supposed to be numeric.

  • The sixth field should begin with /.

The following program prints all lines that fail to satisfy these criteria, along with the number of the erroneous line and an appropriate diagnostic message. Running this program every night is a small part of keeping a system healthy and safe from intruders.

Note: The /etc/passwd file has changed a lot over the last 30 years, so the program below may not be correct on the UNIX as of the year 2023. But it's still a good example for validation.

This is a good example of a program that can be developed incrementally: each time someone thinks of a new condition that should be checked, it can be added, so the program steadily becomes more thorough.

2.3. Generating Data-Validation Programs

We constructed the password-file checking program by hand, but a more interesting approach is to convert a set of conditions and messages into a checking program automatically. Here is a small set of error conditions and messages, where each condition is a pattern from the program above. The error message is to be printed for each input line where the condition is true.

checkgen.data:

The following program converts these condition-message pairs into a checking program:

checkgen.awk:

The output is a sequence of conditions and the actions to print the corresponding messages:

When the resulting checking program is executed, each condition will be tested on each line, and if it is satisfied, the line number, error message, and input line will be printed. Note that in checkgen, some of the special characters in the printf format string must be quoted to produce a valid generated program. For example, % is preserved by writing %% and is created by writing \.

This technique in which one awk program creates another is broadly applicable (and of course it's not restricted to awk programs). We will see several more examples of its use throughout this book.

Exercise 3-14. Add a facility to checkgen so that pieces of code can be passed through verbatim, for example, to create a BEGIN action to set the field separator.

2.4. Which Version of AWK?

Awk is often useful for inspecting programs, or for organizing the activities of other testing programs. This section contains a somewhat incestuous example: a program that examines awk programs.

The new version of the language has more built-in variables and functions, so there is a chance that an old program may inadvertently include one of these names, for example, by using as a variable name a word like sub that is now a built-in function. The following program does a reasonable job of detecting such problems in old programs:

Note: Difference between \/ and \\\/ in the regex above,

  • \/ represents a single forward slash /

  • \\\/ represents a literal backslash \ followed by a forward slash /. It can be understood easier by split it to two parts:

    1. \\ represents a literal backslash \

    2. \/ represents a forward slash /

The only real complexity in this program is in the substitution commands that attempt to remove quoted strings, regular expressions, and comments before an input line is checked. This job isn't done perfectly, so some lines may not be properly processed.

The third argument of the first split function(split(line, x, "[^A-Za-z0-9_]+")) is a string that is interpreted as a regular expression. The leftmost longest substrings matched by this regular expression in the input line become the field separators. The split command divides the resulting input line into alphanumeric strings by using non-alphanumeric strings as the field separator; this removes all the operators and punctuation at once.

The function asplit is just like split, except that it creates an array whose subscripts are the words within the string. Incoming words can then be tested for membership in this array.

This is the output of compat on itself:

Exercise 3-15. Rewrite compat to identify keywords, etc., with regular expressions instead of the function asplit. Compare the two versions on complexity and speed.

Exercise 3-16. Because awk variables are not declared, a misspelled name will not be detected. Write a program to identify names that are used only once. To make it truly useful, you will have to handle function declarations and variables used in functions.

3. Bundle and Unbundle

Before discussing multiline records, let's consider a special case. The problem is to combine ("bundle") a set of ASCII files into one file in such a way that they can be easily separated ("unbundled") into the original files. This section contains two tiny awk programs that do this pair of operations. They can be used for bundling small files together to save disk space, or to package a collection of files for convenient electronic mailing.

The bundle program is trivial, so short that you can just type it on a command line. All it does is prefix each line of the output with the name of the file, which comes from the built-in variable FILENAME.

The matching unbundle is only a little more elaborate:

The first line of unbundle closes the previous file when a new one is encountered; if bundles don't contain many files (less than the limit on the number of open files), this line isn't necessary.

There are other ways to write bundle and unbundle, but the versions here are the easiest, and for short files, reasonably space efficient. Another organization is to add a distinctive line with the filename before each file, so the filename appears only once.

Exercise 3-17. Compare the speed and space requirements of these versions of bundle and unbundle with variations that use headers and perhaps trailers. Evaluate the tradeoff between performance and program complexity.

4. Multiline Records

The examples so far have featured data where each record fits neatly on one line. Many other kinds of data, however, come in multiline chunks. Examples include address lists:

or bibliographic citations:

or personal databases:

It's easy to create and maintain such information if it's of modest size and regular structure; in effect, each record is the equivalent of an index card. Dealing with such data in awk requires only a bit more work than single-line data does; we'll show several approaches.

4.1. Records Separated by Blank Lines

Imagine an address list, where each record contains on the first four lines a name, street address, city and state, and phone number; after these, there may be additional lines of other information. blank line:

When records are separated by blank lines, they can be manipulated directly: if the record separator variable RS is set to null string (RS=""), each multiline group becomes a record. Thus

will print each record that contains New York, regardless of how many lines it has:

When several records are printed in this way, there is no blank line between them, so the input format is not preserved. The easiest way to fix this is to set the output record separator ORS to a double newline \n:

Output:

Suppose we want to print the names and phone numbers of all Smith's, that is, the first and fourth lines of all records in which the first line ends with Smith. That would be easy if each line were a field. This can be arranged by setting FS to :

This produces

Recall that newline is always a field separator for multiline records, regardless of the value of FS.

  • When RS is set to "", the field separator FS by default is any sequence of blanks and tabs, or newline.

  • When FS is set to , only a newline acts as a field separator.

4.2. Processing Multiline Records

If an existing program can process its input only by lines, we may still be able to use it for multiline records by writing two awk programs.

  • The first combines the multiline records into single-line records that can be processed by the existing program.

  • Then, the second transforms the processed output back into the original multiline format. (We'll assume that limits on line lengths are not a problem.)

To illustrate, let's sort our address list with the Unix sort command. The following pipeline sorts the address list by last name:

In the first program, the function split($1, x, " ") splits the first line of each record into the array x and returns the number of elements created; thus, x[split($1, x, " ")] is the entry for the last name. (This assumes that the last word on the first line really is the last name.) For each multiline record the first program creates a single line consisting of the last name, followed by the string !!#, followed by all the fields in the record separated by this string. Any other separator that does not occur in the data and that sorts earlier than the data could be used in place of the string !!#. The program after the sort reconstructs the multiline records using this separator to identify the original fields.

E.g., Here is a multiple line file named address-list-1.txt:

Only execute the first awk program on the file:

Output of the first awk program:

After sorted, the output is:

Execute both first & second awk program:

Final output:

Exercise 3-18. Modify the first awk program to detect occurrences of the magic string !!# in the data.

4.3. Records with Headers and Trailers

Sometimes records are identified by a header and trailer, rather than by a record separator. Consider a simple example, again an address list, but this time each record begins with a header that indicates some characteristic, such as occupation, of the person whose name follows, and each record (except possibly the last) is terminated by a trailer consisting of a blank line.

Here is a multiple line file named address-list-2.txt:

A range pattern is the simplest way to print the records of all doctors:

The range pattern matches records that begin with doctor and end with a blank line (/^$/ matches a blank line).

To print the doctor records without headers, we can use

Tips: p == 1 is short for p == 1 { print $0 }. (Rule: In the action-pattern statements of awk, if the corresponding action is omitted, awk will print the records matched by the pattern.)

This program uses a variable p to control the printing of lines. When a line containing the desired header is found, p is set to 1; a subsequent line containing a trailer resets p to 0, its default initial value. Since lines are printed only when p is set to 1, only the body and trailer of each record are printed; other combinations are easily selected instead.

E.g.1

Output:

E.g.2

Output:

4.4. Name-Value Data

In some applications data may have more structure than can be captured by a sequence of unformatted lines. For instance, addresses might include a country name, or might not have a street address.

One way to deal with structured data is to add an identifying name or keyword to each field of each record. For example, here is how we might organize a checkbook in this format:

We are still using multiline records separated by a single blank line, but within each record, every piece of data is self-identifying: each field consists of an item name, a tab, and the information. That means that different records can contain different fields, or similar fields in arbitrary order.

One way to process this kind of data is to treat it as single lines, with occasional blank lines as separators. Each line identifies the value it corresponds to, but they are not otherwise connected. So to accumulate the sums of deposits and checks, for example, we could simply scan the input for deposits and checks, like this:

check1.awk

which produces

This is easy, and it works (on correct input) no matter what order the items of a record appear in. But it is delicate, requiring careful initialization, re-initialization, and end-of-file processing.

Thus an appealing alternative is to read each record as a unit, then pick it apart as needed. The following program computes the same sums of deposits and checks, using a function to extract the value associated with an item of a given name:

check2.awk

The function field(s) finds an item in the current record whose name is s; it returns the value associated with that name.

A third possibility is to split each field into an associative array and access that for the values. To illustrate, here is a program that prints the check information in a more compact form:

Expected output:

The program is:

check3.awk

Note the use of sprintf to put a dollar sign($) in front of the amount; the resulting string is then right-justified by printf.

Exercise 3-19. Write a command lookup x y that will print from a known file all multiline records having the item name x with value y.

Last updated