shell中的sed和awk使用介绍

本文转载自：https://www.sharpcode.cn/linux/bash/sed-awk-fundmental/

sed和awk是Linux平台下两个强大的文本处理工具。sed名为流编辑器（Stream Editor），它以行为单位对文本进行编辑，例如对文本的增删改查；而awk则主要是对文本进行格式化输出，虽然如此，它们之间的作用有部分是重叠的。也就是说，sed侧重点是编辑，而awk侧重点是格式化文本，它们两者往往都是配合工作，再加上正则表达式的加持，它们强大到难以想象。

sed

语法：

sed script [input-filename]

其中script是用于处理后面input-filename的脚本，script由两部分组成：

address 用于指定要处理文本的范围
command 处理文本的命令

如果没有指定input-filename，那么sed默认从stdin接收数据。另外需要注意，sed的所有操作对原始文件没有任何影响，除非使用-i参数，它操作的可以说是文件的一个副本。

使用command（命令）

命令是sed的核心，可用的命令有：s（Search and replace）、a（Append）、i（Insert）、p（Print）、d（Delete）、c（Change），这些命令往往都会配合address使用。

打印命令（p）

p是最简单的命令，用于输出文本，例如：

[normal@centos7-server tmp]$ echo "Hello Sed" | sed 'p'
Hello Sed
Hello Sed

输出了两行文本，原因是sed默认输出原始数据，所以我们看到的是原始数据和sed命令所处理的当前数据，如果不希望输出原始数据，可以使用-n参数。

[normal@centos7-server tmp]$ echo "Hello Sed" | sed -n 'p'
Hello Sed

通常我们都是有选择地输出数据，例如指定输出多少行，或者输出第几行。这就要配合address使用，最简单的address就是range address。

[normal@centos7-server tmp]$ sed -n '1,3p' /etc/passwd
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin

1,3p，意思是打印第一至第三行。有时候我们要打印最后一行，但是我们不知道这个文件有多少行，那么，就可以使用特殊字符$：

[normal@centos7-server tmp]$ sed -n '$p' /etc/passwd
dhcpd:x:177:177:DHCP server:/:/sbin/nologin

打印命令除了p外，还有其他两个命令：’=’和’l’小写（L），等号除了打印文本还，还输出了行数，而’l’则可打印不可见的字符。

[normal@centos7-server tmp]$ sed '=' test.txt
1
This is line one
2
This is line two
3
This is line three

插入（i）和追加（a）命令

插入命令会在指定行前面插入数据，而追加则在指定行后面添加数据。

[normal@centos7-server tmp]$ cat test.txt
this is line one
this is line two

使用参数i进行插入。

[normal@centos7-server tmp]$ sed '2i\this is insert line' test.txt
this is line one
this is insert line
this is line two

追加命令和插入命令相反，是在后面添加数据

[normal@centos7-server tmp]$ sed '2a\this is insert line' test.txt
this is line one
this is line two
this is insert line

改变命令（c）

改变命令会把匹配的数据行用命令中指定的数据替换掉。

[normal@centos7-server tmp]$ sed '1c\Change line one' test.txt
Change line one
this is line two

可见，第一行全部内容都改变了。当c命令应用于多个行，则需要注意：

[normal@centos7-server tmp]$ cat test.txt 
this is line one
this is line two
this is line three
this is line four
[normal@centos7-server tmp]$ sed '2,3c\two lines become one line' test.txt
this is line one
two lines become one line
this is line four

结果是用一行替换了两个的内容。

搜索和替换命令（s）

这个命令类似其他程序中的查找和替换。

[normal@centos7-server tmp]$ cat test.txt 
I have a cat
[normal@centos7-server tmp]$ sed 's/cat/hat/' test.txt 
I have a hat

s命令默认（不同的系统默认可能不一样）只对第一次匹配的数据进行替换，如果要匹配所有的，可以使用g（global）标记。

[normal@centos7-server tmp]$ cat test.txt 
There are a black cat and a white cat
[normal@centos7-server tmp]$ sed 's/cat/hat/' test.txt
There are a black hat and a white cat

没有开启global标记之前只替换了第一次出现的cat。

[normal@centos7-server tmp]$ sed 's/cat/hat/g' test.txt
There are a black hat and a white hat

开启global标记后，所有的cat都替换成hat。

s命令的默认分隔符是/，由于/的特殊性，在某些地方使用上可能会遇到麻烦。

sed 's/\/bin\/bash/\/bin\/zsh/' /etc/passwd

s上面的例子把/etc/passwd文件中的/bin/bash替换成/bin/zsh，/是s的分隔符，具有特殊性，所以文件路径上的所有/都要被转义，这样就使得语法十分乱，幸运的是sed允许使用其他的分隔符，那就是!。

sed -n 's!/bin/bash!/bin/zsh!' /etc/passwd

删除命令（d）

如果没有指定address，那么d命令会把所有行都删除。

[normal@centos7-server tmp]$ cat test.txt
This is line one
This is line two
This is line three
[normal@centos7-server tmp]$ sed 'd' test.txt
[normal@centos7-server tmp]$

没有使用范围，所有数据删除了。

[normal@centos7-server tmp]$ sed '2d' test.txt
This is line one
This is line three

通过address指定删除第二行。

多个命令

sed可同时接受多个命令，命令之间使用;分号分隔。

[normal@centos7-server tmp]$ cat test.txt
This is line one
This is line two
This is line three
[normal@centos7-server tmp]$ sed '1d;2a\This is the append line' test.txt
This is line two
This is the append line
This is line three

除了使用分号分隔多个命令之外，还可以把多个命令分别写在不同的行上。

[normal@centos7-server tmp]$ sed '1d
> 2a\This is the append line' test.txt
This is line two
This is the append line
This is line three

使用Address

上面已经简单介绍了最基本的Address（range address），除此之外，还有其他类型的Address：

/REGEXP/           基于正则表达式的address
ADDR1,+N        匹配ADDR1和ADDR1之后的N行
ADDR1,~N        匹配ADDR1和ADDR1之后的所有是它N倍的行

打印具有匹配词的行：

[normal@centos7-server tmp]$ cat test.txt 
This is line one
This is line two
This is line three
[normal@centos7-server tmp]$ sed -n '/one/p' test.txt
This is line one

如果行中出现one单词，则匹配。

打印指定行和它之后的N行：

[normal@centos7-server tmp]$ sed -n '1,+5p' /etc/passwd
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync

有些address只对gnu sed生效。

取反（!）

把感叹号放到address后面，可以对address进行取反操作，例如：

[normal@centos7-server tmp]$ sed -n '2!p' test.txt
This is line one
This is line three

打印除第二行之外的其他行

[normal@centos7-server tmp]$ sed -n '/one/!p' test.txt
This is line two
This is line three

打印没有单词one的数据行。

从文件读取数据（r）

[normal@centos7-server tmp]$ cat data1.txt 
The line from data1.txt
The line from data2.txt
[normal@centos7-server tmp]$ sed '2r data1.txt' test.txt
This is line one
This is line two
The line from data1.txt
The line from data2.txt
This is line three

从data1.txt文件中读取数据，并插入到test.txt第2行后面。

把数据保存到文件（w）

[normal@centos7-server tmp]$ sed '/one/w one.txt' test.txt 
This is line one
This is line two
This is line three
[normal@centos7-server tmp]$ cat one.txt 
This is line one

把有单词one的行保存到one.txt

awk

awk大部分情况下不会对数据进行修改，它只负责数据的格式化输出。awk的语法和sed差不多：

awk program file

program是awk的核心，它由一对单引号和花括号括起来’{ script }’。

我们在program中编写要执行的命令。和sed一样，如果没有指定file，那么默认从stdin接收数据。

awk是基于行和列的，一行代表一条数据，每一行可以被分隔成列，每列由指定的分隔符（FS）分开，默认的列分隔符是空格。

awk基本应用

awk中每一行都由一列或多列组成，这些列被保存在相应的内置变量中：

$0         整行数据
$1         第一列数据
$2         第二列数据
$N        第N列数据

例如：

[normal@centos7-server tmp]$ echo "hello world" | awk '{print $0}'
hello world

在”hello world”中，中间有一个空格，因此awk可以识别到有两列数据：

[normal@centos7-server tmp]$ echo "hello world" | awk '{print $1}'
hello
[normal@centos7-server tmp]$ echo "hello world" | awk '{print $2}'
world

内置变量

awk在读取数据的时候，都是根据特定的内置变量来识别行和列的开始和结束：

FS      输入列分隔符
RS      输入行分隔符
OFS     输出列分隔符
ORS     输出行分隔符

awk也是以行为单位读取数据的，也就是说，它的默认RS变量值是’\n’，我们可以改变这个值，让它可以一次读取多行：

现有文件student.txt，它的内容如下：

Name:Jones
Sex:male
 
Name:Edwin
Sex:male
 
Name:Diana
Sex:female

除了改变OFS分隔符，而不改变其他任何分隔符的前提下，使用awk对它进行打印输出：

[normal@centos7-server tmp]$ awk 'BEGIN{OFS="--"} {print $1, $2}' student.txt 
Name:Jones--
Sex:male--
--
Name:Edwin--
Sex:male--
--
Name:Diana--
Sex:female--

由于默认的RS值是’\n’，FS默认值是空格，所以，包括两个空行和6行数据，刚好8行，每一行只有一列。那么，下面改变RS为两个换行符，结果又会怎样：

[normal@centos7-server tmp]$ cat student.txt 
Name:Jones
Sex:male
 
Name:Edwin
Sex:male
 
Name:Diana
Sex:female
[normal@centos7-server tmp]$ awk 'BEGIN{RS="\n\n"; OFS="--"} {print $1, $2}' student.txt 
Name:Jones--Sex:male
Name:Edwin--Sex:male
Name:Diana--Sex:female

也就是说，awk每读取到两个换行符才作为一行数据的结束，因此只有3行数据；当RS值非默认值’\n’的时候，awk遇到’\n’，就会把它作为一个列的分隔符（FS），除非另外设置了FS的值。

默认的ORS值也是’\n’，可以设置ORS为其他值：

[normal@centos7-server tmp]$ awk 'BEGIN{RS="\n\n"; ORS="<-->"; OFS="--"}{print $1, $2, $3}' student.txt 
Name:Jones--Sex:male--<-->Name:Edwin--Sex:male--<-->Name:Diana--Sex:female--<-->

BEGIN和END

BEGIN这关键字用于指定命令必须在其他命令之前执行，而END和BEGIN相关，它使命令在最后才执行。BEGIN在上面的例子中已经接触到，设置分隔符必须在其他命令执行前进行，否则会出错：

[normal@centos7-server tmp]$ awk '/root/ {FS=":"; print NF}' /etc/passwd
1
7

NF是一个内置变量，表示一行数据的总列数。上面的命令用于找出含有root单词的行。正常情况下，两个输出都应该是7，但是第一个输出却是1，很明显是错误的。原因是：在设置FS变量值的时候，它对于第一个匹配行来说还没有生效，即FS值保持默认值空格，所以只识别一列，而第二个匹配行，FS的值已经生效，变成了”:“，因此，正确输出了7。

所以，必须使用BEGIN让它在所有其他命令执行前更改FS的值：

[normal@centos7-server tmp]$ awk 'BEGIN{FS=":"} /root/ {print NF}' /etc/passwd
7
7

使用END

[normal@centos7-server tmp]$ awk 'BEGIN{FS=":"; print "start"} /root/ {print NF} END{print "finish"}' /etc/passwd
start
7
7
finish

使用正则表达式

awk和sed一样可以使用正则表达式。

normal@centos7-server tmp]$ sed -n '/root/p' /etc/passwd
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin
[normal@centos7-server tmp]$ awk '/root/ {print $0}' /etc/passwd
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin

awk使用正则表达式和sed一样，也是用/作为分隔符。