The classic practical case of awk
- Insert a few new fields
- format blank
- Filter IPv4 addresses
- Deduplication based on a field
- Statistics
- Count the number of non-200 status codes accessed by each IP in the statistics log
- Statistical independent IP
- Handle missing data for fields
- Handle data with field separators in fields
- Get the specified number of characters in the field
- Filter logs within a given time range
Insert a few new fields
Insert 3 fields after the b of "abc d" e f g
.
dzy@dzy-virtual-machine:~$ echo "a b c d" | awk '{
$2=$2" e f g";print}'
a b e f g c d
format blank
Removes leading and trailing whitespace from each line, and left-aligns sections.
aaaa bbb ccc
bbb aaa ccc
ddd fff eee gg hh ii jj
dzy@dzy-virtual-machine:~$ awk '{
$1=$1;print}' 2.txt
aaaa bbb ccc
bbb aaa ccc
ddd fff eee gg hh ii jj
dzy@dzy-virtual-machine:~$ awk 'BEGIN{
OFS="\t"}{
$1=$1;print}' 2.txt
aaaa bbb ccc
bbb aaa ccc
ddd fff eee gg hh ii jj
Filter IPv4 addresses
Filter out all IPv4 addresses except the lo network card from the result of the ifconfig command.
法一:
dzy@dzy-virtual-machine:~$ ifconfig | awk 'BEGIN{
RS="";FS="\n"}!/lo/{
$0=$2;FS=" ";$0=$0;print $2}'
192.168.126.136
这种方式先利用RS获取段落,然后通过FS分隔符为换行符获取行,这样第一行就是第一个字段,第二行为第二个字段,然后将$2赋值给$0,字段划分FS重新修改
法二:
dzy@dzy-virtual-machine:~$ ifconfig | awk '/inet / && !($2 ~ /^127/){
print $2}'
192.168.126.136
法三:
# 按段落选取 默认分隔符为空格 顺序无法保证,多个网卡无法取值
dzy@dzy-virtual-machine:~$ ifconfig | awk 'BEGIN{
RS=""}!/lo/{
print $6}'
192.168.126.136
Deduplication based on a field
Remove uid=xxx
duplicate lines.
2019-01-13_12:00_index?uid=123
2019-01-13_13:00_index?uid=123
2019-01-13_14:00_index?uid=333
2019-01-13_15:00_index?uid=9710
2019-01-14_12:00_index?uid=123
2019-01-14_13:00_index?uid=123
2019-01-15_14:00_index?uid=333
2019-01-16_15:00_index?uid=9710
首先利用uid去重,我们需要利用?进行划分,然后将uid=xxx保存在数组中,这是判断重复的依据
然后统计uid出现次数,第一次出现统计,第二次不统计
dzy@dzy-virtual-machine:~$ awk -F"?" '!arr[$2]++{
print}' 3.txt
2019-01-13_12:00_index?uid=123
2019-01-13_14:00_index?uid=333
2019-01-13_15:00_index?uid=9710
or
dzy@dzy-virtual-machine:~$ awk -F"?" '{
arr[$2]=arr[$2]+1;if(arr[$2]==1){
print}}' 3.txt
2019-01-13_12:00_index?uid=123
2019-01-13_14:00_index?uid=333
2019-01-13_15:00_index?uid=9710
or
dzy@dzy-virtual-machine:~$ awk -F"?" '{
arr[$2]++;if(arr[$2]==1){
print}}' 3.txt
2019-01-13_12:00_index?uid=123
2019-01-13_14:00_index?uid=333
2019-01-13_15:00_index?uid=9710
Statistics
portmapper
portmapper
portmapper
portmapper
portmapper
portmapper
status
status
mountd
mountd
mountd
mountd
mountd
mountd
nfs
nfs
nfs_acl
nfs
nfs
nfs_acl
nlockmgr
nlockmgr
nlockmgr
nlockmgr
nlockmgr
dzy@dzy-virtual-machine:~$ awk '{
arr[$0]++}END{
OFS="\t";for(idx in arr){
printf arr[idx],idx}}' 4.txt
625246
or
dzy@dzy-virtual-machine:~$ awk '{
arr[$0]++}END{
for(i in arr){
print arr[i], i}}' 4.txt
6 portmapper
2 nfs_acl
5 nlockmgr
2 status
4 nfs
6 mountd
Count the number of non-200 status codes accessed by each IP in the statistics log
Log sample data:
111.202.100.141 - - [2019-11-07T03:11:02+08:00] "GET /robots.txt HTTP/1.1" 301 169
Count the IPs with non-200 status codes, and take the top 10 IPs with the most times.
# 法一
awk '$9!=200{
arr[$1]++}END{
for(i in arr){
print arr[i],i}}' access.log | sort -k1nr | head -n 10
# 法二:
awk中排序函数sort asort
设置排序顺序PROCINFO
PROCINFO["sorted_in"]=@val_num_desc
awk '
$9!=200{
arr[$1]++}
END{
PROCINFO["sorted_in"]="@val_num_desc";
for(i in arr){
#设置计数器
if(cnt++==10){
exit}
print arr[i],i
}
}' access.log
Statistical independent IP
? url access IP access time access person
a.com.cn|202.109.134.23|2015-11-20 20:34:43|guest
b.com.cn|202.109.134.23|2015-11-20 20:34:48|guest
c.com.cn|202.109.134.24|2015-11-20 20:34:48|guest
a.com.cn|202.109.134.23|2015-11-20 20:34:43|guest
a.com.cn|202.109.134.24|2015-11-20 20:34:43|guest
b.com.cn|202.109.134.25|2015-11-20 20:34:48|guest
Requirements: Count the number of independent access IPs for each URL (remove duplicates), and save a corresponding file for each URL, and the results obtained are similar to:
a.com.cn 2
b.com.cn 2
c.com.cn 1
And there are three corresponding files:
a.com.cn.txt
b.com.cn.txt
c.com.cn.txt
code:
dzy@dzy-virtual-machine:~$ cat 6.awk
BEGIN{
FS="I"
}
!arr[$1, $2 ]++{
arr1[$1]++
}
END{
for(i in arr1){
print i,arr1[i] >(i". txt")
}
}
result:
Handle missing data for fields
ID name gender age email phone
1 Bob male 28 abc@qq.com 18023394012
2 Alice female 24 def@gmail.com 18084925203
3 Tony male 21 17048792503
4 Kevin male 21 bbb@189.com 17023929033
5 Alex male 18 ccc@xyz.com 18185904230
6 Andy female ddd@139.com 18923902352
7 Jerry female 25 exdsa@189.com 18785234906
8 Peter male 20 bax@qq.com 17729348758
9 Steven 23 bc@sohu.com 15947893212
10 Bruce female 27 bcbd@139.com 13942943905
When the field is missing, it will be very tricky to deal with it directly using FS to divide the field. In order to solve this special demand, gawk provides the FIELDWIDTHS variable.
FIELDWIDTH can divide the field according to the number of characters.
leave blank
dzy@dzy-virtual-machine:~$ awk '{
print $0}' FIELDWIDTHS="2 2:6 2:6 2:3 2:13 2:11" 6.txt
ID name gender age email phone
1 Bob male 28 abc@qq.com 18023394012
2 Alice female 24 def@gmail.com 18084925203
3 Tony male 21 17048792503
4 Kevin male 21 bbb@189.com 17023929033
5 Alex male 18 ccc@xyz.com 18185904230
6 Andy female ddd@139.com 18923902352
7 Jerry female 25 exdsa@189.com 18785234906
8 Peter male 20 bax@qq.com 17729348758
9 Steven 23 bc@sohu.com 15947893212
10 Bruce female 27 bcbd@139.com 13942943905
FIELDWIDTH第一个字段是字符宽度ID为2,指定2个字符宽度
第两个字段最大为6,但前面和ID之间还有两个空格,所以可以指定宽度为8,也可以跳过两个字符2:6
Handle data with field separators in fields
Below is a line from a CSV file that separates fields with commas.
Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA
Requirement: Get the third field "1234 A Pretty Street, NE".
When the field contains a field separator, it will be very tricky to use FS to divide the field directly. In order to solve this special need, gawk provides the FPAT variable.
FPAT can collect regular matching results and save them in various fields. (Like grep, the part that matches successfully will be displayed in color, while using FPAT to divide the field, the part that matches successfully is saved in the field $1 $2 $3...
).
echo 'Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA' |\
awk 'BEGIN{
FPAT="[^,]+|\".*\""}{
print $1,$3}'
Get the specified number of characters in the field
16 001agdcdafasd
16 002agdcxxxxxx
23 001adfadfahoh
23 001fsdadggggg
get:
16 001
16 002
23 001
23 002
awk字符索引从1开始
awk '{print $1,substr($2,1,3)}'
利用FIELDWIDTH分割
awk 'BEGIN{FIELDWIDTHS="2 2:3"}{print $1,$2}' a.txt
Filter logs within a given time range
When grep/sed/awk use regular expressions to filter logs, it is very difficult to achieve accurate hours, minutes, and seconds.
But awk provides the mktime() function, which can convert the time into an epoch time value.
# 2019-11-10 03:42:40转换成epoch为1970-01-01 00:00:00
$ awk 'BEGIN{print mktime("2019 11 10 03 42 40")}'
1573328560
In this way, the time string part in the log can be obtained, and then their year, month, day, hour, minute, and second can be taken out, and then put into mktime() to construct the corresponding epoch value. Because the epoch value is a numerical value, it can be compared to determine the size of the time.
The implementation of strptime1() below is to 2019-11-10T03:42:40+08:00
convert the string in the format into an epoch value, and then compare the size with which_time to filter out the logs accurate to the second.
You can use patsplit to take the numbers in the time
dzy@dzy-virtual-machine:~$ cat date.awk
BEGIN{
#要筛选什么时间的日志, 将其时间构建成epoch值
which_ .time = mktime("2019 11 10 03 42 40")
}
{
#取出日志中的日期时间字符串部分
match($0,"A.*\\[(.*)\\].*",arr)
#将日期时间字符串转换为epoch值
tmp- time = strptime1(arr[1])
#通过比较epoch值来比较时间大小
if(tmp_ time > which_ .time) {
print}
}
#构建的时间字符串格式为: "2019-11-10T03 :42 :40+08 :00"
function strptime1(str, arr,Y,M,D,H,m,S) {
patsplit(str,arr,"[0-9]{1,4}")
Y=arr[1]
M=arr[2]
D=arr[3]
H=arr[4]
m=arr[5]
S=arr[6]
return mktime( sprintf("%s %s %S %S %s_ %s",Y,M,D,H,m,S))
}
The implementation of strptime2() below is to 10/Nov/2019:23:53:44+08:00
convert the string in the format into an epoch value, and then compare the size with which_time to filter out the logs accurate to the second.
BEGIN{
# 要筛选什么时间的日志,将其时间构建成epoch值
which_time = mktime("2019 11 10 03 42 40")
}
{
# 取出日志中的日期时间字符串部分
match($0,"^.*\\[(.*)\\].*",arr)
# 将日期时间字符串转换为epoch值
tmp_time = strptime2(arr[1])
# 通过比较epoch值来比较时间大小
if(tmp_time > which_time){
print
}
}
# 构建的时间字符串格式为:"10/Nov/2019:23:53:44+08:00"
function strptime2(str,dt_str,arr,Y,M,D,H,m,S) {
dt_str = gensub("[/:+]"," ","g",str)
# dt_sr = "10 Nov 2019 23 53 44 08 00"
split(dt_str,arr," ")
Y=arr[3]
M=mon_map(arr[2])
D=arr[1]
H=arr[4]
m=arr[5]
S=arr[6]
return mktime(sprintf("%s %s %s %s %s %s",Y,M,D,H,m,S))
}
function mon_map(str,mons){
mons["Jan"]=1
mons["Feb"]=2
mons["Mar"]=3
mons["Apr"]=4
mons["May"]=5
mons["Jun"]=6
mons["Jul"]=7
mons["Aug"]=8
mons["Sep"]=9
mons["Oct"]=10
mons["Nov"]=11
mons["Dec"]=12
return mons[str]
}