ch4: REPORTS AND DATABASES
1. Generating Reports
Awk can be used to select data from files and then to format the selected data into a report. We will use a three-step process to generate reports: prepare, sort, format.
The preparation step involves selecting data and perhaps performing computations on it to obtain the desired information.
The sort step is necessary if we want to order the data in some particular fashion. To perform this step we pass the output of the preparation program into the system
sort
command.The formatting step is done by a second awk program that generates the desired report from the sorted data. In this section we will generate a few reports from the countries file of Chapter 2 to illustrate the approach.
1.1. A Simple Report
Suppose we want a report giving the population, area, and population density of each country. We would like the countries to be grouped by continent, and the continents to be sorted alphabetically; within each continent the countries are to be listed in decreasing order of population density, like this:
CONTINENT COUNTRY POPULATION AREA POP. DEN.
Asia Japan 120 144 833.3
Asia India 746 1267 588.8
Asia China 1032 3705 278.5
Asia USSR 275 8649 31.8
Europe Germany 61 96 635.4
Europe England 56 94 595.7
Europe France 55 211 260.7
North America Mexico 78 762 102.4
North America USA 237 3615 65.6
North America Canada 25 3852 6.5
South America Brazil 134 3286 40.8
The first two steps in preparing this report are done by the program prep1
, which, when applied to the file countries.txt
, determines the relevant information and sorts it:
prep1.awk
# prep1 - prepare countries by continent and pop. den.
BEGIN { FS = "\t" }
{ printf("%s:%s:%d:%d:%.1f\n",
$4, $1, $3, $2, 1000*$3/$2) | "sort -t: +0 -1 +4rn"
}
The output is a sequence of lines containing 5 fields, separated by colons, that give the continent, country, population, area, and population density:
cat countries.txt | awk -f prep1.awk
Asia:Japan:120:144:833.3
Asia:India:746:1267:588.8
Asia:China:1032:3705:278.5
Asia:USSR:275:8649:31.8
Europe:Germany:61:96:635.4
Europe:England:56:94:595.7
Europe:France:55:211:260.7
North America:Mexico:78:762:102.4
North America:USA:237:3615:65.6
North America:Canada:25:3852:6.5
South America:Brazil:134:3286:40.8
We wrote prep1
to print directly into the Unix sort
command.
The
-t:
argument tells sort to use a colon(:
) as its field separator.The
+0
-1
arguments make the first field the primary sort key, it specifies that sorting should be done based on the first field (+0
) in reverse numeric order (-1
)The
+4rn
argument makes the fifth field, in reverse numeric order, the secondary sort key.
(In Section 6.3, we will show a sort-generator program that creates these lists of options from a description in words.)
Tips:
prep1-1.awk
# prep1-1 - prepare countries by continent and pop. den.
BEGIN { FS = "\t" }
{ printf("%s:%s:%d:%d:%.1f\n",
$4, $1, $3, $2, 1000*$3/$2)
}
sort just once:
cat countries.txt | awk -f prep1-1.awk | sort -t: +0 -1 +4rn
We have completed the preparation and sort steps; all we need now is to format this information into the desired report. The program form1
does the job:
form1.awk
# form1 - format countries data by continent, pop. den.
BEGIN { FS = ":"
printf("%-15s %-10s %10s %7s %12s\n",
"CONTINENT", "COUNTRY", "POPULATION",
"AREA", "POP. DEN.")
}
{ printf("%-15s %-10s %7d %10d %10.1f\n",
$1, $2, $3, $4, $5)
}
The desired report can be generated by typing the command line
awk -f prep1.awk countries.txt | awk -f form1.awk
# Or
cat countries.txt | awk -f prep1.awk | awk -f form1.awk
The peculiar arguments to sort
in prep1
can be avoided by having the program format its output so that sort
doesn't need any arguments, and then having the formatting program reformat the lines.
By default, the sort
command sorts its input lexicographically. In the final report, the output needs to be sorted alphabetically by continent and in reverse numerical order by population density. To avoid arguments to sort
, the preparation program can put at the beginning of each line a quantity depending on continent and population density that, when sorted lexicographically, will automatically order the output correctly. One possibility is a fixed-width representation of the continent followed by the reciprocal(倒数,e.g., n
's reciprocal value is1/n
) of the population density, as in prep2
:
prep2.awk
# prep2 - prepare countries by continent, inverse pop. den.
BEGIN { FS = "\t"}
{ den = 1000*$3/$2
printf("%-15s:%12.8f:%s:%d:%d:%.1f\n",
$4, 1/den, $1, $3, $2, den) | "sort"
}
With the countries file as input, here is the output from prep2:
Asia : 0.00120000:Japan:120:144:833.3
Asia : 0.00169839:India:746:1267:588.8
Asia : 0.00359012:China:1032:3705:278.5
Asia : 0.03145091:USSR:275:8649:31.8
Europe : 0.00157377:Germany:61:96:635.4
Europe : 0.00167857:England:56:94:595.7
Europe : 0.00383636:France:55:211:260.7
North America : 0.00976923:Mexico:78:762:102.4
North America : 0.01525316:USA:237:3615:65.6
North America : 0.15408000:Canada:25:3852:6.5
South America : 0.02452239:Brazil:134:3286:40.8
The format %-15s
is wide enough for all the continent names, and %12. 8f
covers a wide range of reciprocal densities.
The final formatting program is like form1
but skips the new second field. The trick of manufacturing a sort
key that simplifies the sorting options is quite general. We'll use it again in an indexing program in Chapter 5.
If we would like a slightly fancier report in which only the first occurrence of each continent name is printed, we can use the formatting program form2 in place of form1
:
form2.awk
# form2 - format countries by continent, pop. den.
BEGIN { FS = ":"
printf("%-15s %-10s %10s %7s %12s\n",
"CONTINENT", "COUNTRY", "POPULATION",
"AREA", "POP. DEN.")
}
{ if ($1 != prev) {
print ""
prev = $1
} else
$1 = ""
printf("%-15s %-10s %7d %10d %10.1f\n",
$1, $2, $3, $4, $5)
}
The command line
cat countries.txt | awk -f prep1.awk | awk -f form2.awk
generates this report:
CONTINENT COUNTRY POPULATION AREA POP. DEN.
Asia Japan 120 144 833.3
India 746 1267 588.8
China 1032 3705 278.5
USSR 275 8649 31.8
Europe Germany 61 96 635.4
England 56 94 595.7
France 55 211 260.7
North America Mexico 78 762 102.4
USA 237 3615 65.6
Canada 25 3852 6.5
South America Brazil 134 3286 40.8
The formatting program form2
is a "control-break" program. The variable prev
keeps track of the value of the continent field; only when it changes is the continent name printed. In the next section, we will see a more complicated example of control-break programming.
1.2. A More Complex Report
Last updated
Was this helpful?