ch4: REPORTS AND DATABASES

1. Generating Reports

Awk can be used to select data from files and then to format the selected data into a report. We will use a three-step process to generate reports: prepare, sort, format.

  1. The preparation step involves selecting data and perhaps performing computations on it to obtain the desired information.

  2. The sort step is necessary if we want to order the data in some particular fashion. To perform this step we pass the output of the preparation program into the system sort command.

  3. The formatting step is done by a second awk program that generates the desired report from the sorted data. In this section we will generate a few reports from the countries file of Chapter 2 to illustrate the approach.

1.1. A Simple Report

Suppose we want a report giving the population, area, and population density of each country. We would like the countries to be grouped by continent, and the continents to be sorted alphabetically; within each continent the countries are to be listed in decreasing order of population density, like this:

CONTINENT       COUNTRY    POPULATION    AREA    POP. DEN.
Asia            Japan          120        144      833.3
Asia            India          746       1267      588.8
Asia            China         1032       3705      278.5
Asia            USSR           275       8649       31.8
Europe          Germany         61         96      635.4
Europe          England         56         94      595.7
Europe          France          55        211      260.7
North America   Mexico          78        762      102.4
North America   USA            237       3615       65.6
North America   Canada          25       3852        6.5
South America   Brazil         134       3286       40.8

The first two steps in preparing this report are done by the program prep1, which, when applied to the file countries.txt, determines the relevant information and sorts it:

prep1.awk

# prep1 - prepare countries by continent and pop. den.

BEGIN { FS = "\t" }
      { printf("%s:%s:%d:%d:%.1f\n",
            $4, $1, $3, $2, 1000*$3/$2) | "sort -t: +0 -1 +4rn"
      }

The output is a sequence of lines containing 5 fields, separated by colons, that give the continent, country, population, area, and population density:

cat countries.txt | awk -f prep1.awk
Asia:Japan:120:144:833.3
Asia:India:746:1267:588.8
Asia:China:1032:3705:278.5
Asia:USSR:275:8649:31.8
Europe:Germany:61:96:635.4
Europe:England:56:94:595.7
Europe:France:55:211:260.7
North America:Mexico:78:762:102.4
North America:USA:237:3615:65.6
North America:Canada:25:3852:6.5
South America:Brazil:134:3286:40.8

We wrote prep1 to print directly into the Unix sort command.

  • The -t: argument tells sort to use a colon(:) as its field separator.

  • The +0 -1 arguments make the first field the primary sort key, it specifies that sorting should be done based on the first field (+0) in reverse numeric order (-1)

  • The +4rn argument makes the fifth field, in reverse numeric order, the secondary sort key.

(In Section 6.3, we will show a sort-generator program that creates these lists of options from a description in words.)


Tips:

prep1-1.awk

# prep1-1 - prepare countries by continent and pop. den.

BEGIN { FS = "\t" }
      { printf("%s:%s:%d:%d:%.1f\n",
            $4, $1, $3, $2, 1000*$3/$2)
      }

sort just once:

cat countries.txt | awk -f prep1-1.awk | sort -t: +0 -1 +4rn

We have completed the preparation and sort steps; all we need now is to format this information into the desired report. The program form1 does the job:

form1.awk

# form1 - format countries data by continent, pop. den.

BEGIN { FS = ":"
        printf("%-15s %-10s %10s %7s %12s\n",
            "CONTINENT", "COUNTRY", "POPULATION",
            "AREA", "POP. DEN.")
      }
      { printf("%-15s %-10s %7d %10d %10.1f\n",
            $1, $2, $3, $4, $5)
      }

The desired report can be generated by typing the command line

awk -f prep1.awk countries.txt | awk -f form1.awk
# Or
cat countries.txt | awk -f prep1.awk | awk -f form1.awk

The peculiar arguments to sort in prep1 can be avoided by having the program format its output so that sort doesn't need any arguments, and then having the formatting program reformat the lines.

By default, the sort command sorts its input lexicographically. In the final report, the output needs to be sorted alphabetically by continent and in reverse numerical order by population density. To avoid arguments to sort, the preparation program can put at the beginning of each line a quantity depending on continent and population density that, when sorted lexicographically, will automatically order the output correctly. One possibility is a fixed-width representation of the continent followed by the reciprocal(倒数,e.g., n's reciprocal value is1/n) of the population density, as in prep2:

prep2.awk

# prep2 - prepare countries by continent, inverse pop. den.

BEGIN { FS = "\t"}
      { den = 1000*$3/$2
        printf("%-15s:%12.8f:%s:%d:%d:%.1f\n",
            $4, 1/den, $1, $3, $2, den) | "sort"
      }

With the countries file as input, here is the output from prep2:

Asia           :  0.00120000:Japan:120:144:833.3
Asia           :  0.00169839:India:746:1267:588.8
Asia           :  0.00359012:China:1032:3705:278.5
Asia           :  0.03145091:USSR:275:8649:31.8
Europe         :  0.00157377:Germany:61:96:635.4
Europe         :  0.00167857:England:56:94:595.7
Europe         :  0.00383636:France:55:211:260.7
North America  :  0.00976923:Mexico:78:762:102.4
North America  :  0.01525316:USA:237:3615:65.6
North America  :  0.15408000:Canada:25:3852:6.5
South America  :  0.02452239:Brazil:134:3286:40.8

The format %-15s is wide enough for all the continent names, and %12. 8f covers a wide range of reciprocal densities.

The final formatting program is like form1 but skips the new second field. The trick of manufacturing a sort key that simplifies the sorting options is quite general. We'll use it again in an indexing program in Chapter 5.

If we would like a slightly fancier report in which only the first occurrence of each continent name is printed, we can use the formatting program form2 in place of form1:

form2.awk

# form2 - format countries by continent, pop. den.

BEGIN { FS = ":"
        printf("%-15s %-10s %10s %7s %12s\n",
            "CONTINENT", "COUNTRY", "POPULATION",
            "AREA", "POP. DEN.")
      }
      { if ($1 != prev) {
            print ""
            prev = $1
        } else
            $1 = ""
        printf("%-15s %-10s %7d %10d %10.1f\n",
            $1, $2, $3, $4, $5)
      }

The command line

cat countries.txt | awk -f prep1.awk | awk -f form2.awk

generates this report:

CONTINENT       COUNTRY    POPULATION    AREA    POP. DEN.

Asia            Japan          120        144      833.3
                India          746       1267      588.8
                China         1032       3705      278.5
                USSR           275       8649       31.8

Europe          Germany         61         96      635.4
                England         56         94      595.7
                France          55        211      260.7

North America   Mexico          78        762      102.4
                USA            237       3615       65.6
                Canada          25       3852        6.5

South America   Brazil         134       3286       40.8

The formatting program form2 is a "control-break" program. The variable prev keeps track of the value of the continent field; only when it changes is the continent name printed. In the next section, we will see a more complicated example of control-break programming.

1.2. A More Complex Report

Last updated

Was this helpful?