Awk for sorting data under Linux

Linux Three Musketeers: grep, awk and sed, functionally correspond to search, segmentation and modification respectively.

Here we focus on classification, sorting, and statistics.

command meaning

awk = "Aho Weiberger and Kernighan" The first letters of the last names of the three authors, it is a language parsing engine. Mainly filter, classify, and count logs.

Commonly used awk commands and meanings at work

awk ‘BBEGIN{}END{}’ 程序开始和结束
awk ‘/Running/’ 正则匹配
awk ‘/aa/,/bb/’ 区间选择
awk ‘$2~/xxx/’ 字段匹配,这里指从第2个字段开始匹配包含xxx内容的行
awk ’NR==2’ 取第二行
awk ’NR>1’ 去掉第一行

The operation of awk is based on lines, and each line is divided into multiple columns with spaces by default, and the separators of lines and columns can be modified.

The basic syntax of awk is `awk 'pattern{action}' `, there can be only pattern or action, only pattern=print the lines matching the pattern, only action=execute this action for each line.

pattern can have the following types

BEGIN  执行匹配之前可以采取的action
END    执行匹配之后可以采取的action
BEGINFILE     读文件之前可以采取的action
ENDFILE       读文件之后可以采取的action
/regular expression/    匹配每行正则表达式之后可以采取的action
pattern && pattern    且
pattern || pattern    或
pattern ? pattern : pattern 三目表达式
(pattern)    
! pattern    非
pattern1, pattern2    范围表达式,对于第一个匹配pattern1到第一个匹配pattern2之间的行采取action

row operation pattern

model illustrate
NR==1 fetch a row
NR>=1 && NR>=5 fetch lines 1 to 5
/oldboy/ Take out the row with oldboy
/101/,/105/ Take out lines 101-105
symbol > < >= <= == !=
awk ‘$2~/xxx/’ 字段匹配,这里指从第2个字段开始匹配包含xxx内容的行
awk ’NR==2’ 取第二行
awk ’NR>1’ 去掉第一行

 Column operation pattern

-F Specify the delimiter, specify the end mark of each column (the default is a space, consecutive spaces, tab key)
$ number Take out a certain column, note: in awk, $ content means to take out a certain column
$0 the content of the entire line
{ptint xxx} Options used when fetching columns
$NF means the last column

 Example 1

For all nginx processes send signal 12

sample

$ ps -ef|grep /opt/nginx/ |grep -v grep
root     109775      1  0 Oct21 ?        00:16:57 ./nginx
root     109776      1  0 Oct21 ?        00:15:04 ./nginx
root     196548      1 15 Oct19 ?        1-02:08:48 ./nginx
root     196567      1 12 Oct19 ?        21:12:28 ./nginx

We need to get the process number. Here, for each line, a space is used as the separator, and the process number is in the second column. We output it first, and then send him the signal -12.

$ ps -ef|grep /opt/nginx/ |grep -v grep|awk '{print $2}'|xargs kill -12 

Example 2

For the above Nginx process, find the process starting with 19, send signal 50

We want to match the process number starting with 19 on the second column, we need to use $2 to get the second column, and then use pattern to match

$ ps -ef|grep /opt/nginx/ |grep -v grep|awk '$2~/19[^ ]*/{print $2}'|xargs kill -50

Additional question: Sort the processes starting with 19 -> deduplication -> display in reverse order of numbers

$ ps -ef|grep /opt/nginx/ |grep -v grep|awk '$2~/19[^ ]*/{print $2}'| sort | uniq -c | sort -nr

Command meaning:

sort: 按从小到大进行排序
uniq -c :去重(相邻),还输出重复个数
-nr: 按数字进行倒叙排序
-n:按数字进行排序

 Example 3

Use awk to add line numbers to each line in the text

Sample text self-generated

Idea: Define the serial number index 0 before running awk, which is used to incrementally save the user, extract the user, and save them separately according to the index. After slicing, cycle according to the number of rows, and print the digital serial number and the information saved in the first step.

use command

cat config.h_json |awk 'BEGIN{idx=0}{m[idx]=$0;idx++;}END{for(i=0;i<NR;i++) print i+1, m[i]}

What is remarkable is that the variables and arrays here can be used directly without declaration.

Meaning explained:

idx=0 before awk execution

Store each row in an array, and put idx+1

Output each line and the corresponding line number after awk execution

Example 4

Count the number of occurrences of the first-level domain name in the domain name

[root@shell ~]# cat url.txt 
 2 http: // www. etiantian. org/index.html
 3 http: // www. etiantian. org/1.html
 4 http: //post.etiantian. org/index.html
 5 http: //mp3.etiantian. org/index.html
 6 http: // www. etiantian.org/3.html

Observe data characteristics:

The first-level domain names here are `www` and `post`, etc. They appear after `//` and before `.`, which can also be captured according to the `grep` regular expression, but will have These two special symbols also need to be extracted by regular expressions, which is more troublesome. If we want to use awk here, we need to split the columns. The delimiters here can be `/` and `.`, and then a line is divided into more than three parts: front | area of ​​interest | back balabala, and then you can Only print out the area of ​​interest, which is relatively clean, and then `sort + uniq -c` will do

use command

awk -F [/.]+ '{print $2}' url.txt | sort | uniq -c

operation result

$ awk -F [/.]+ '{print $2}' a|sort|uniq -c
      3  www
      1 mp3
      1 post

Use command 2 (pure awk)

awk -F [/.]+ '{arr[$2]++}END{for(i in arr) print  i,arr[i]}' a

operation result

$ awk -F [/.]+ '{arr[$2]++}END{for(i in arr) print  i,arr[i]}' a
mp3 1
post 1
 www 3

Additional: sort `sort -rnk2`

`-n` lexicographic order

`-k2` sort by the second field as key

`-r` reverse output

Guess you like

Origin blog.csdn.net/qq_33882435/article/details/127530515