Linux learning data extraction

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/qq_42415326/article/details/91125821

A, grep pattern matching command

1 Basic Operation

grepMode commands for printing out the text strings match, that the conditions of use regular expression as pattern matching. grepIt supports three regular expression engine, designated by three parameters:

parameter Explanation
-E POSIX extended regular expressions, ERE
-G POSIX basic regular expression, BRE
-P Perl regular expressions, PCRE

grepCommon Command parameter: 

parameter Explanation
-b The binary file as text match
-c In a number of statistical pattern matching
-i Ignore case
-n Display the line number where the matching text lines
-v Anti-election, output does not match the contents of the line
-r Find a recursive match
-A n n is a positive integer representing the after meaning, in addition to matching lines listed but also are listed later in row n
-B n n is a positive integer indicating the meaning before, in addition to matching lines listed, but also the previously listed row n
--color=auto Matches the output color is set to automatically display

Note: In most distributions is the default color of grep, you can specify or modify parameters through GREP_COLORenvironment variables.

2 Using regular expressions

Basic regular expression, BRE

  • position

Find /etc/groupthe file to "shiyanlou" as the beginning of the line

$ grep 'shiyanlou' /etc/group
$ grep '^shiyanlou' /etc/group

 

  • Quantity
# 将匹配以'z'开头以'o'结尾的所有字符串
$ echo 'zero\nzo\nzoo' | grep 'z.*o'
# 将匹配以'z'开头以'o'结尾,中间包含一个任意字符的字符串
$ echo 'zero\nzo\nzoo' | grep 'z.o'
# 将匹配以'z'开头,以任意多个'o'结尾的字符串
$ echo 'zero\nzo\nzoo' | grep 'zo*'

Note: where \nis the line break

  • select
# grep默认是区分大小写的,这里将匹配所有的小写字母
$ echo '1234\nabcd' | grep '[a-z]'
# 将匹配所有的数字
$ echo '1234\nabcd' | grep '[0-9]'
# 将匹配所有的数字
$ echo '1234\nabcd' | grep '[[:digit:]]'
# 将匹配所有的小写字母
$ echo '1234\nabcd' | grep '[[:lower:]]'
# 将匹配所有的大写字母
$ echo '1234\nabcd' | grep '[[:upper:]]'
# 将匹配所有的字母和数字,包括0-9,a-z,A-Z
$ echo '1234\nabcd' | grep '[[:alnum:]]'
# 将匹配所有的字母
$ echo '1234\nabcd' | grep '[[:alpha:]]'

 

The following special symbols and contains a complete description:

Special symbols Explanation
[:alnum:] On behalf of the English upper and lowercase letters and numbers, that is, 0-9, AZ, az
[:alpha:] On behalf of any English uppercase and lowercase letters, namely AZ, az
[:blank:] Representative both spacebar and [Tab] button
[:cntrl:] Keyboard control buttons representative of the above, i.e. comprising CR, LF, Tab, Del .. etc.
[:digit:] It represents a number only, namely 0-9
[:graph:] In addition to the blank byte (key blank with [Tab] button) all other keys outside
[:lower:] On behalf of lowercase letters, that is, az
[:print:] Stands for any character can be printed out
[:punct:] On behalf of punctuation (punctuation symbol), namely: " ';:?! # $ ...
[:upper:] Representatives capital letters, that is, AZ
[:space:] Any gaps will produce characters, including the space bar, [Tab], CR, etc.
[:xdigit:] Representative hexadecimal numeric types, thus comprises: 0-9, AF, and bytes numbers af

Note : The reason for the use of special symbols, because of the above [az] are not effective in all cases, which also hosts about the current language, which set the LANGvalue of an environment variable, zh_CN.UTF-8, then [az] , that is, all lowercase letters, other languages such as may be the case alternatively, "a a b B ... z Z", [az] may in capital letters. Therefore, the use [az] make sure that the impact of the current language, the use of [: lower:] will not have this problem.

# 排除字符
$ $ echo 'geek\ngood' | grep '[^o]'

Note: When the ^internal placed in brackets exclude character, or that of the line.

Use extended regular expressions, ERE

Through grepthe use of extended regular expressions you need to add -Eparameters or use egrep.

  • Quantity
# 只匹配"zo"
$ echo 'zero\nzo\nzoo' | grep -E 'zo{1}'
# 匹配以"zo"开头的所有单词
$ echo 'zero\nzo\nzoo' | grep -E 'zo{1,}'

Note: Recommended to master {n,m}can be, +, ?, *, several less intuitive, and easy to get confused.

  • select
# 匹配"www.shiyanlou.com"和"www.google.com"
$ echo 'www.shiyanlou.com\nwww.baidu.com\nwww.google.com' | grep -E 'www\.(shiyanlou|google)\.com'
# 或者匹配不包含"baidu"的内容
$ echo 'www.shiyanlou.com\nwww.baidu.com\nwww.google.com' | grep -Ev 'www\.baidu\.com'

Note: Because the .number has a special meaning, so the need to escape.

 

Two, sed stream editor

Tools in Linux / UNIX world'm called the editor, most Dengxianzhibei, such as in front of the "vi / vim (Editor of God)", "emacs (God editor)", "gedit" These one editor. sedAnd the biggest difference is that it is above a non-interactive editor, here we began to introduce sedthe editor.

1 sed common parameters Introduction

sed command basic format:

sed [参数]... [执行命令] [输入文件]...
# 形如:
$ sed -i 's/sad/happy/' test # 表示将test文件中的"sad"替换为"happy"
parameter Explanation
-n Quiet mode, print only the affected rows, the entire contents of the default print data input
-e For adding a plurality of execution command in the script execution time, a plurality of command line execution is usually no need to add the parameter
-f filename Execution filename specified in the command file
-r Use extended regular expressions, regular expressions defaults to standard
-i The input directly modify the contents of the file, instead of printing to standard output

2 sed editor command execution

sed execute command format:

[n1][,n2]command
[n1][~step]command
# 其中一些命令可以在后面加上作用范围,形如:
$ sed -i 's/sad/happy/g' test # g表示全局范围
$ sed -i 's/sad/happy/4' test # 4表示指定行中的第四个匹配字符串

Wherein n1, n2 indicates the line number of the input content, among them ,the comma indicates from n1 to n2 line, if the wave number is expressed from the step n1 to the start of all lines stepwise; command to perform an action, for some of the following common operation command:

command Explanation
s Inline replacement
c Replace the entire row
a Back into the specified row
i Inserted before the specified line
p Print the specified line, generally -nused in conjunction with parameters
d Delete the specified row

 

Operation Example 3 sed

We go first to a text file for exercise:

$ cp /etc/passwd ~

Print the specified line

# 打印2-5行
$ nl passwd | sed -n '2,5p'
# 打印奇数行
$ nl passwd | sed -n '1~2p'

Inline replacement

# 将输入文本中"shiyanlou" 全局替换为"hehe",并只打印替换的那一行,注意这里不能省略最后的"p"命令
$ sed -n 's/shiyanlou/hehe/gp' passwd

Note:  Inline replaced can be combined with regular expressions.

Between the lines replaced

$ nl passwd | grep "shiyanlou"
# 删除第21行
$ sed -n '21c\www.shiyanlou.com' passwd
(这里我们只把要删的行打印出来了,并没有真正的删除,如果要删除的话,请使用-i参数)

Three, awk text processing language

AWK programming language for text processing tool.

Note : First thing, when we learn and use awk should be used as a programming language as much as possible to understand.

Some basic concepts of 1 awk

All operations are based awk pattern (mode) -action (operation) of the accomplished, as in the following form:

$ pattern {action}

It will operate all the action using a pair {}enclosed in braces. Wherein the pattern is typically used to match the input text "relationship" or "regular expression", action it is an operation performed after matching. In a complete awk operation, both of which can be one of the only, if not the entire text pattern matching is the default input, if no action is the default print matching content to the screen.

awkWay text processing, is to split the text into a number of "fields", then these fields are processed, default, awk spaces that separate a field, but this is not fixed, you can arbitrarily specify the delimiter.

3 awk command basic format

awk [-F fs] [-v var=value] [-f prog-file | 'program text'] [file...]

其中-F参数用于预先指定前面提到的字段分隔符(还有其他指定字段的方式) ,-v用于预先为awk程序指定变量,-f参数用于指定awk命令要执行的程序文件,或者在不加-f参数的情况下直接将程序语句放在这里,最后为awk需要处理的文本输入,且可以同时输入多个文本文件。

4 awk操作体验

先用vim新建一个文本文档

$ vim test

包含如下内容:

I like linux
www.shiyanlou.com
  • 使用awk将文本内容打印到终端
# "quote>" 不用输入
$ awk '{
> print
> }' test
# 或者写到一行
$ awk '{print}' test

说明:在这个操作中我是省略了pattern,所以awk会默认匹配输入文本的全部内容,然后在"{}"花括号中执行动作,即print打印所有匹配项,这里是全部文本内容 。

  • 将test的第一行的每个字段单独显示为一行
$ awk '{
> if(NR==1){
> print $1 "\n" $2 "\n" $3
> } else {
> print}
> }' test

# 或者
$ awk '{
> if(NR==1){
> OFS="\n"
> print $1, $2, $3
> } else {
> print}
> }' test

说明:你首先应该注意的是,这里我使用了awk语言的分支选择语句if,它的使用和很多高级语言如C/C++语言基本一致。另一个你需要注意的是NROFS,这两个是awk内建的变量,NR表示当前读入的记录数,你可以简单的理解为当前处理的行数,OFS表示输出时的字段分隔符,默认为" "空格,如上图所见,我们将字段分隔符设置为\n换行符,所以第一行原本以空格为字段分隔的内容就分别输出到单独一行了。然后是$N其中N为相应的字段号,这也是awk的内建变量,它表示引用相应的字段,因为我们这里第一行只有三个字段,所以只引用到了$3。除此之外另一个这里没有出现的$0,它表示引用当前记录(当前行)的全部内容。 

  • 将test的第二行的以点为分段的字段换成以空格为分隔
$ awk -F'.' '{
> if(NR==2){
> print $1 "\t" $2 "\t" $3
> }}' test

# 或者
$ awk '
> BEGIN{
> FS="."
> OFS="\t"  # 如果写为一行,两个动作语句之间应该以";"号分开  
> }{
> if(NR==2){
> print $1, $2, $3
> }}' test

说明:这里的-F参数,它是用来预先指定待处理记录的字段分隔符。我们需要注意的是除了指定OFS我们还可以在print 语句中直接打印特殊符号如这里的\tprint打印的非变量内容都需要用""一对引号包围起来。上面另一个版本,展示了实现预先指定变量分隔符的另一种方式,即使用BEGIN,就这个表达式指示了,其后的动作将在所有动作之前执行,这里是FS赋值了新的"."点号代替默认的" "空格 。

5 awk常用的内置变量

变量名 说明
FILENAME 当前输入文件名,若有多个文件,则只表示第一个。如果输入是来自标准输入,则为空字符串
$0 当前记录的内容
$N N表示字段号,最大值为NF变量的值
FS 字段分隔符,由正则表达式表示,默认为" "空格
RS 输入记录分隔符,默认为"\n",即一行为一个记录
NF 当前记录字段数
NR 已经读入的记录数
FNR 当前输入文件的记录数,请注意它与NR的区别
OFS 输出字段分隔符,默认为" "空格
ORS 输出记录分隔符,默认为"\n"

 

 

写在最后,请关注一个跨行学python 3年的我的微信公众号:大众学python,掏出手机扫一扫:

 

Guess you like

Origin blog.csdn.net/qq_42415326/article/details/91125821