# ch4: REPORTS AND DATABASES

* [1. Generating Reports](#1-generating-reports)
  * [1.1. A Simple Report](#11-a-simple-report)
  * [1.2. A More Complex Report](#12-a-more-complex-report)

## 1. Generating Reports

Awk can be used to select data from files and then to format the selected data into a report. We will use a three-step process to generate reports: *prepare*, *sort*, *format*.

1. The *preparation* step involves selecting data and perhaps performing computations on it to obtain the desired information.
2. The *sort* step is necessary if we want to order the data in some particular fashion. To perform this step we pass the output of the preparation program into the system `sort` command.
3. The *formatting* step is done by a second awk program that generates the desired report from the sorted data. In this section we will generate a few reports from the countries file of **Chapter 2** to illustrate the approach.

### 1.1. A Simple Report

Suppose we want a report giving the population, area, and population density of each country. We would like the countries to be grouped by continent, and the continents to be sorted alphabetically; within each continent the countries are to be listed in decreasing order of population density, like this:

```plaintext
CONTINENT       COUNTRY    POPULATION    AREA    POP. DEN.
Asia            Japan          120        144      833.3
Asia            India          746       1267      588.8
Asia            China         1032       3705      278.5
Asia            USSR           275       8649       31.8
Europe          Germany         61         96      635.4
Europe          England         56         94      595.7
Europe          France          55        211      260.7
North America   Mexico          78        762      102.4
North America   USA            237       3615       65.6
North America   Canada          25       3852        6.5
South America   Brazil         134       3286       40.8
```

The first two steps in preparing this report are done by the program `prep1`, which, when applied to the file `countries.txt`, determines the relevant information and sorts it:

`prep1.awk`

```awk
# prep1 - prepare countries by continent and pop. den.

BEGIN { FS = "\t" }
      { printf("%s:%s:%d:%d:%.1f\n",
            $4, $1, $3, $2, 1000*$3/$2) | "sort -t: +0 -1 +4rn"
      }
```

The output is a sequence of lines containing **5** fields, separated by colons, that give the continent, country, population, area, and population density:

```bash
cat countries.txt | awk -f prep1.awk
```

```plaintext
Asia:Japan:120:144:833.3
Asia:India:746:1267:588.8
Asia:China:1032:3705:278.5
Asia:USSR:275:8649:31.8
Europe:Germany:61:96:635.4
Europe:England:56:94:595.7
Europe:France:55:211:260.7
North America:Mexico:78:762:102.4
North America:USA:237:3615:65.6
North America:Canada:25:3852:6.5
South America:Brazil:134:3286:40.8
```

We wrote `prep1` to print directly into the Unix `sort` command.

* The `-t:` argument tells sort to use a colon(`:`) as its field separator.
* The `+0` `-1` arguments make the first field the *primary sort key*, it specifies that sorting should be done based on the first field (`+0`) in reverse numeric order (`-1`)
* The `+4rn` argument makes the **fifth** field, in reverse numeric order, the *secondary sort key*.

(In **Section 6.3**, we will show a sort-generator program that creates these lists of options from a description in words.)

***

**Tips**:

`prep1-1.awk`

```awk
# prep1-1 - prepare countries by continent and pop. den.

BEGIN { FS = "\t" }
      { printf("%s:%s:%d:%d:%.1f\n",
            $4, $1, $3, $2, 1000*$3/$2)
      }
```

sort just once:

```bash
cat countries.txt | awk -f prep1-1.awk | sort -t: +0 -1 +4rn
```

***

We have completed the preparation and sort steps; all we need now is to format this information into the desired report. The program `form1` does the job:

`form1.awk`

```awk
# form1 - format countries data by continent, pop. den.

BEGIN { FS = ":"
        printf("%-15s %-10s %10s %7s %12s\n",
            "CONTINENT", "COUNTRY", "POPULATION",
            "AREA", "POP. DEN.")
      }
      { printf("%-15s %-10s %7d %10d %10.1f\n",
            $1, $2, $3, $4, $5)
      }
```

The desired report can be generated by typing the command line

```bash
awk -f prep1.awk countries.txt | awk -f form1.awk
# Or
cat countries.txt | awk -f prep1.awk | awk -f form1.awk
```

The **peculiar** arguments to `sort` in `prep1` can be avoided by having the program format its output so that `sort` doesn't need any arguments, and then having the formatting program reformat the lines.

By default, the `sort` command sorts its input lexicographically. In the final report, the output needs to be sorted alphabetically by continent and in reverse numerical order by population density. To avoid arguments to `sort`, the preparation program can put at the beginning of each line a quantity depending on continent and population density that, when sorted lexicographically, will automatically order the output correctly. One possibility is a fixed-width representation of the continent followed by the **reciprocal**(倒数，e.g., `n`'s reciprocal value is`1/n`) of the population density, as in `prep2`:

`prep2.awk`

```awk
# prep2 - prepare countries by continent, inverse pop. den.

BEGIN { FS = "\t"}
      { den = 1000*$3/$2
        printf("%-15s:%12.8f:%s:%d:%d:%.1f\n",
            $4, 1/den, $1, $3, $2, den) | "sort"
      }
```

With the countries file as input, here is the output from prep2:

```plaintext
Asia           :  0.00120000:Japan:120:144:833.3
Asia           :  0.00169839:India:746:1267:588.8
Asia           :  0.00359012:China:1032:3705:278.5
Asia           :  0.03145091:USSR:275:8649:31.8
Europe         :  0.00157377:Germany:61:96:635.4
Europe         :  0.00167857:England:56:94:595.7
Europe         :  0.00383636:France:55:211:260.7
North America  :  0.00976923:Mexico:78:762:102.4
North America  :  0.01525316:USA:237:3615:65.6
North America  :  0.15408000:Canada:25:3852:6.5
South America  :  0.02452239:Brazil:134:3286:40.8
```

The format `%-15s` is wide enough for all the continent names, and `%12. 8f` covers a wide range of reciprocal densities.

The final formatting program is like `form1` but *skips* the new *second* field. The trick of manufacturing a `sort` key that simplifies the sorting options is quite general. We'll use it again in an indexing program in **Chapter 5**.

If we would like a slightly fancier report in which only the first occurrence of each continent name is printed, we can use the formatting program form2 in place of `form1`:

`form2.awk`

```awk
# form2 - format countries by continent, pop. den.

BEGIN { FS = ":"
        printf("%-15s %-10s %10s %7s %12s\n",
            "CONTINENT", "COUNTRY", "POPULATION",
            "AREA", "POP. DEN.")
      }
      { if ($1 != prev) {
            print ""
            prev = $1
        } else
            $1 = ""
        printf("%-15s %-10s %7d %10d %10.1f\n",
            $1, $2, $3, $4, $5)
      }
```

The command line

```bash
cat countries.txt | awk -f prep1.awk | awk -f form2.awk
```

generates this report:

```console
CONTINENT       COUNTRY    POPULATION    AREA    POP. DEN.

Asia            Japan          120        144      833.3
                India          746       1267      588.8
                China         1032       3705      278.5
                USSR           275       8649       31.8

Europe          Germany         61         96      635.4
                England         56         94      595.7
                France          55        211      260.7

North America   Mexico          78        762      102.4
                USA            237       3615       65.6
                Canada          25       3852        6.5

South America   Brazil         134       3286       40.8
```

The formatting program `form2` is a "control-break" program. The variable `prev` keeps track of the value of the continent field; only when it changes is the continent name printed. In the next section, we will see a more complicated example of control-break programming.

### 1.2. A More Complex Report
