Use awk to easily play with text data statistics

I think we should first take a look at the basic structure of the awk program, pattern { action }. The basic operation is to scan the text line by line, find the lines that match the pattern, and then execute the corresponding action. It ends after all lines have been read.

Let’s look at the simplest example first. The trace.log file contains name, age, and length of service.
star 25 1
moon 32 5
sun 29 3
river 27 4
lake 30 6
sea 35 8

Now you want to find employees (whole row) who have worked for more than or equal to 5 years. It can be done in one sentence. Write the expression you need directly in the pattern part. $0 is used to identify the whole row.

awk '$3 >= 5 {print $0}' trace.log

1
results:
moon 32 5
lake 30 6
sea 35 8

If you want to add a line number to the output, it's not too much. Awk provides a built-in variable NR.

awk '$3 >= 5 {print NR,$0}' trace.log

1
results:
2 moon 32 5
5 lake 30 6
6 sea 35 8

If you want to use printf to format the output, that is your own business.

awk '$3 >= 5 {printf("name: %s,age: %d \n",$1,$2)}' trace.log

1
Result Jiang Zi:
name: moon,age: 32
name: lake,age: 30
name: sea,age: 35

Usually, awk only uses action to output the segmented data you care about. In fact, the beauty lies in the pattern. You only want to see the data whose name is moon.

awk '$1 == "moon"' trace.log        # moon 32 5

1
pattren can be combined using parentheses, &&, or ||, not !

awk '$2 >30 && $3 >= 8' trace.log    # sea 35 8

1
Having said so much, it’s still not interesting. Maybe statistics will be of some use to you. First of all, you need to understand two points: the END keyword is used to match the position after the last line. You can create your own variables for calculation but no need statement. Do you want to see how many people have worked for more than 5 years?

awk '$3 >= 5 {total += 1} END {print "超过5年:",total,"人"}' trace.log

1
Scanning line by line is like a loop. During the loop, you can start your performance (calculation), END matches the last line, and then perform some actions.

Now that you know how to sum, you want to see the average age of employees, don’t forget that NR records the number of rows

awk '{total += $2} END {print total/NR}' trace.log

1.
If you want to see the general level of age, maybe the median is more representative, you should sort first

awk '{print $2," ", $3," " $1}' trace.log | sort

1.
Then find the data in the middle position and it will be OK.

awk '{print $2," ", $3," " $1}' trace.log | sort | awk '{line[NR] = $1} END {print line[int(NR/2)]}'

1.
You also want to see the maximum and minimum values of age. With the above experience, don’t be too esay.

awk '{print $2," ", $3," " $1}' trace.log | sort | awk '{line[NR] = $0} END {print line[1],"\n",line[NR]}'

There is also a BEGIN keyword that is not mentioned, which is similar to END. I won’t go into details. Let’s draw a picture to summarize:

Insert image description here

Use awk to easily play with text data statistics

Guess you like