The classic practical case of awk

Insert a few new fields
format blank
Filter IPv4 addresses
Deduplication based on a field
Statistics
Count the number of non-200 status codes accessed by each IP in the statistics log
Statistical independent IP
Handle missing data for fields
Handle data with field separators in fields
Get the specified number of characters in the field
Filter logs within a given time range

Insert a few new fields

Insert 3 fields after the b of "abc d" e f g.

dzy@dzy-virtual-machine:~$ echo "a b c d" | awk '{
    
    $2=$2" e f g";print}'
a b e f g c d

format blank

Removes leading and trailing whitespace from each line, and left-aligns sections.

      aaaa        bbb     ccc                 
   bbb     aaa ccc
ddd       fff             eee gg hh ii jj

dzy@dzy-virtual-machine:~$ awk '{
    
    $1=$1;print}' 2.txt
aaaa bbb ccc
bbb aaa ccc
ddd fff eee gg hh ii jj

dzy@dzy-virtual-machine:~$ awk 'BEGIN{
    
    OFS="\t"}{
    
    $1=$1;print}' 2.txt
aaaa	bbb	ccc
bbb	aaa	ccc
ddd	fff	eee	gg	hh	ii	jj

Filter IPv4 addresses

Filter out all IPv4 addresses except the lo network card from the result of the ifconfig command.

法一：
dzy@dzy-virtual-machine:~$ ifconfig | awk 'BEGIN{
    
    RS="";FS="\n"}!/lo/{
    
    $0=$2;FS=" ";$0=$0;print $2}'
192.168.126.136

这种方式先利用RS获取段落，然后通过FS分隔符为换行符获取行，这样第一行就是第一个字段，第二行为第二个字段，然后将$2赋值给$0,字段划分FS重新修改

法二：
dzy@dzy-virtual-machine:~$ ifconfig | awk '/inet / && !($2 ~ /^127/){
    
    print $2}'
192.168.126.136

法三：
# 按段落选取 默认分隔符为空格 顺序无法保证，多个网卡无法取值
dzy@dzy-virtual-machine:~$ ifconfig | awk 'BEGIN{
    
    RS=""}!/lo/{
    
    print $6}'
192.168.126.136

Deduplication based on a field

Remove uid=xxxduplicate lines.

2019-01-13_12:00_index?uid=123
2019-01-13_13:00_index?uid=123
2019-01-13_14:00_index?uid=333
2019-01-13_15:00_index?uid=9710
2019-01-14_12:00_index?uid=123
2019-01-14_13:00_index?uid=123
2019-01-15_14:00_index?uid=333
2019-01-16_15:00_index?uid=9710
首先利用uid去重，我们需要利用?进行划分，然后将uid=xxx保存在数组中，这是判断重复的依据
然后统计uid出现次数，第一次出现统计，第二次不统计
dzy@dzy-virtual-machine:~$ awk -F"?" '!arr[$2]++{
    
    print}' 3.txt
2019-01-13_12:00_index?uid=123
2019-01-13_14:00_index?uid=333
2019-01-13_15:00_index?uid=9710
or
dzy@dzy-virtual-machine:~$ awk -F"?" '{
    
    arr[$2]=arr[$2]+1;if(arr[$2]==1){
    
    print}}' 3.txt
2019-01-13_12:00_index?uid=123
2019-01-13_14:00_index?uid=333
2019-01-13_15:00_index?uid=9710
or
dzy@dzy-virtual-machine:~$ awk -F"?" '{
    
    arr[$2]++;if(arr[$2]==1){
    
    print}}' 3.txt
2019-01-13_12:00_index?uid=123
2019-01-13_14:00_index?uid=333
2019-01-13_15:00_index?uid=9710

Statistics

portmapper
portmapper
portmapper
portmapper
portmapper
portmapper
status
status
mountd
mountd
mountd
mountd
mountd
mountd
nfs
nfs
nfs_acl
nfs
nfs
nfs_acl
nlockmgr
nlockmgr
nlockmgr
nlockmgr
nlockmgr

dzy@dzy-virtual-machine:~$ awk '{
    
    arr[$0]++}END{
    
    OFS="\t";for(idx in arr){
    
    printf arr[idx],idx}}' 4.txt
625246
or
dzy@dzy-virtual-machine:~$ awk '{
    
    arr[$0]++}END{
    
    for(i in arr){
    
    print arr[i], i}}' 4.txt
6 portmapper
2 nfs_acl
5 nlockmgr
2 status
4 nfs
6 mountd

Count the number of non-200 status codes accessed by each IP in the statistics log

Log sample data:

111.202.100.141 - - [2019-11-07T03:11:02+08:00] "GET /robots.txt HTTP/1.1" 301 169

Count the IPs with non-200 status codes, and take the top 10 IPs with the most times.

# 法一
awk '$9!=200{
    
    arr[$1]++}END{
    
    for(i in arr){
    
    print arr[i],i}}' access.log | sort -k1nr | head -n 10

# 法二：
awk中排序函数sort asort
设置排序顺序PROCINFO
PROCINFO["sorted_in"]=@val_num_desc
awk '
    $9!=200{
    
    arr[$1]++}
    END{
    
    
        PROCINFO["sorted_in"]="@val_num_desc";
        for(i in arr){
    
    
            #设置计数器
            if(cnt++==10){
    
    exit}
            print arr[i],i
        }
}' access.log

Statistical independent IP

? url access IP access time access person

a.com.cn|202.109.134.23|2015-11-20 20:34:43|guest
b.com.cn|202.109.134.23|2015-11-20 20:34:48|guest
c.com.cn|202.109.134.24|2015-11-20 20:34:48|guest
a.com.cn|202.109.134.23|2015-11-20 20:34:43|guest
a.com.cn|202.109.134.24|2015-11-20 20:34:43|guest
b.com.cn|202.109.134.25|2015-11-20 20:34:48|guest

Requirements: Count the number of independent access IPs for each URL (remove duplicates), and save a corresponding file for each URL, and the results obtained are similar to:

a.com.cn  2
b.com.cn  2
c.com.cn  1

And there are three corresponding files:

a.com.cn.txt
b.com.cn.txt
c.com.cn.txt

code:

dzy@dzy-virtual-machine:~$ cat 6.awk
BEGIN{
    
    
FS="I"
}
!arr[$1, $2 ]++{
    
    
arr1[$1]++
}
END{
    
    
for(i in arr1){
    
    
print i,arr1[i] >(i". txt")
}
}

result:
insert image description here

Handle missing data for fields

ID  name    gender  age  email          phone
1   Bob     male    28   abc@qq.com     18023394012
2   Alice   female  24   def@gmail.com  18084925203
3   Tony    male    21                  17048792503
4   Kevin   male    21   bbb@189.com    17023929033
5   Alex    male    18   ccc@xyz.com    18185904230
6   Andy    female       ddd@139.com    18923902352
7   Jerry   female  25   exdsa@189.com  18785234906
8   Peter   male    20   bax@qq.com     17729348758
9   Steven          23   bc@sohu.com    15947893212
10  Bruce   female  27   bcbd@139.com   13942943905

When the field is missing, it will be very tricky to deal with it directly using FS to divide the field. In order to solve this special demand, gawk provides the FIELDWIDTHS variable.

FIELDWIDTH can divide the field according to the number of characters.

leave blank

dzy@dzy-virtual-machine:~$ awk '{
    
    print $0}' FIELDWIDTHS="2 2:6 2:6 2:3 2:13 2:11" 6.txt
ID  name    gender  age  email          phone
1   Bob     male    28   abc@qq.com     18023394012
2   Alice   female  24   def@gmail.com  18084925203
3   Tony    male    21                  17048792503
4   Kevin   male    21   bbb@189.com    17023929033
5   Alex    male    18   ccc@xyz.com    18185904230
6   Andy    female       ddd@139.com    18923902352
7   Jerry   female  25   exdsa@189.com  18785234906
8   Peter   male    20   bax@qq.com     17729348758
9   Steven          23   bc@sohu.com    15947893212
10  Bruce   female  27   bcbd@139.com   13942943905
FIELDWIDTH第一个字段是字符宽度ID为2，指定2个字符宽度
第两个字段最大为6，但前面和ID之间还有两个空格，所以可以指定宽度为8，也可以跳过两个字符2:6

Handle data with field separators in fields

Below is a line from a CSV file that separates fields with commas.

Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA

Requirement: Get the third field "1234 A Pretty Street, NE".

When the field contains a field separator, it will be very tricky to use FS to divide the field directly. In order to solve this special need, gawk provides the FPAT variable.

FPAT can collect regular matching results and save them in various fields. (Like grep, the part that matches successfully will be displayed in color, while using FPAT to divide the field, the part that matches successfully is saved in the field $1 $2 $3...).

echo 'Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA' |\
awk 'BEGIN{
    
    FPAT="[^,]+|\".*\""}{
    
    print $1,$3}'

Get the specified number of characters in the field

16  001agdcdafasd
16  002agdcxxxxxx
23  001adfadfahoh
23  001fsdadggggg

get:

16  001
16  002
23  001
23  002
awk字符索引从1开始
awk '{print $1,substr($2,1,3)}'
利用FIELDWIDTH分割
awk 'BEGIN{FIELDWIDTHS="2 2:3"}{print $1,$2}' a.txt

Filter logs within a given time range

When grep/sed/awk use regular expressions to filter logs, it is very difficult to achieve accurate hours, minutes, and seconds.

But awk provides the mktime() function, which can convert the time into an epoch time value.

# 2019-11-10 03:42:40转换成epoch为1970-01-01 00:00:00
$ awk 'BEGIN{print mktime("2019 11 10 03 42 40")}'
1573328560

In this way, the time string part in the log can be obtained, and then their year, month, day, hour, minute, and second can be taken out, and then put into mktime() to construct the corresponding epoch value. Because the epoch value is a numerical value, it can be compared to determine the size of the time.

The implementation of strptime1() below is to 2019-11-10T03:42:40+08:00convert the string in the format into an epoch value, and then compare the size with which_time to filter out the logs accurate to the second.

You can use patsplit to take the numbers in the time

dzy@dzy-virtual-machine:~$ cat date.awk
BEGIN{
    
    
#要筛选什么时间的日志， 将其时间构建成epoch值
which_ .time = mktime("2019 11 10 03 42 40")
}
{
    
    
#取出日志中的日期时间字符串部分
match($0,"A.*\\[(.*)\\].*",arr)
#将日期时间字符串转换为epoch值
tmp- time = strptime1(arr[1])
#通过比较epoch值来比较时间大小
if(tmp_ time > which_ .time) {
    
    print}
}
#构建的时间字符串格式为: "2019-11-10T03 :42 :40+08 :00"
function strptime1(str, arr,Y,M,D,H,m,S) {
    
    
patsplit(str,arr,"[0-9]{1,4}")
Y=arr[1]
M=arr[2]
D=arr[3]
H=arr[4]
m=arr[5]
S=arr[6]
return mktime( sprintf("%s %s %S %S %s_ %s",Y,M,D,H,m,S))
}

The implementation of strptime2() below is to 10/Nov/2019:23:53:44+08:00convert the string in the format into an epoch value, and then compare the size with which_time to filter out the logs accurate to the second.

BEGIN{
  # 要筛选什么时间的日志，将其时间构建成epoch值
  which_time = mktime("2019 11 10 03 42 40")
}

{
  # 取出日志中的日期时间字符串部分
  match($0,"^.*\\[(.*)\\].*",arr)
  
  # 将日期时间字符串转换为epoch值
  tmp_time = strptime2(arr[1])
  
  # 通过比较epoch值来比较时间大小
  if(tmp_time > which_time){
    print 
  }
}

# 构建的时间字符串格式为："10/Nov/2019:23:53:44+08:00"
function strptime2(str,dt_str,arr,Y,M,D,H,m,S) {
  dt_str = gensub("[/:+]"," ","g",str)
  # dt_sr = "10 Nov 2019 23 53 44 08 00"
  split(dt_str,arr," ")
  Y=arr[3]
  M=mon_map(arr[2])
  D=arr[1]
  H=arr[4]
  m=arr[5]
  S=arr[6]
  return mktime(sprintf("%s %s %s %s %s %s",Y,M,D,H,m,S))
}

function mon_map(str,mons){
  mons["Jan"]=1
  mons["Feb"]=2
  mons["Mar"]=3
  mons["Apr"]=4
  mons["May"]=5
  mons["Jun"]=6
  mons["Jul"]=7
  mons["Aug"]=8
  mons["Sep"]=9
  mons["Oct"]=10
  mons["Nov"]=11
  mons["Dec"]=12
  return mons[str]
}

Awk command practice in Linux