table of Contents
Linux, a lot of text tools use to regular expressions, regular expressions can greatly simplify linux system administration, because there are many online canonical correlation tutorial, so no talk here, I was watching the rookie of the regular expression , one afternoon to see a few times in the basic experiment will, in addition to the positive affirmation pre-investigation, the reverse is certainly pre-check these relatively complex, others are very simple, very often remember also can check online for the writing, you do not need to remember in real time. Here mainly to talk about awk and other tools used in text processing regular expressions.
A, awk
awk instructions must be enclosed in single quotes.
Basic sentence
awk -F 'specified delimiter input' 'the BEGIN {} initialization that some filters condition the END {...} {final finishing work for each row}'
Intermediate processing block can have multiple, we will go through a single filter conditions the filter conditions once per line, where BEGIN and END sides executed only once
Filter records
- awk '$3==0 && $6=="LISTEN" ' netstat.txt
- awk '$3==0 && $6=="LISTEN" || NR==1 ' netstat.txt
Specify the delimiter
awk -F: '{print $1,$3,$6}' /etc/passwd
Equivalent toawk 'BEGIN{FS=":"} {print $1,$3,$6}' /etc/passwd
awk -F '[;:]'
Specify multiple delimitersawk -F: '{print $1,$3,$6}' OFS="\t" /etc/passwd
Specifies the output delimiter
Note that the above print $1,$3,$6
is ,
to be replaced with the separator, if print $1$3$6
the intermediate no delimiter
Special keyword:
- The line number currently processed NR
- NF number of fields in the current row of the total used in the process
- FNR file line number of the current process (when dealing with multiple files, NR will stop accumulating, but if FNR at processing new files starting from 1)
- FILENAME filename
- $ 0 Current entire line
- FS input field separator is a space or a default Tab
- RS input record separator default linefeed
- OFS output field separator is a space or a default Tab
- ORS output record separator default linefeed
Regular
- Ordinary matches:
awk'/hello/ {print}' test.sh
- Invert match:
awk '!/hello/ {print}' test.sh
- Meanwhile match:
awk '/hello/ && /world/ {print}' test.sh
- Or match:
awk '/hello/ || /world/ {print}' test.sh
can also be written asawk '/hello|world/ {print}' test.sh
- Match the specified column:
awk '$5 ~ /hello/ {print}' test.sh
- Invert the specified string matching:
awk '$5 !~ /hello/ {print}' test.sh
Output to a different file
$ awk 'NR!=1{if($6 ~ /TIME|ESTABLISHED/) print > "1.txt"; else if($6 ~ /LISTEN/) print > "2.txt"; else print > "3.txt" }' netstat.txt
awk 'NR!=1{print > $6}' netstat.txt
In fact, the use of >
redirection, the example uses an if statement
- Statistical data:
awk 'NR!=1{a[$6]++;} END {for (i in a) print i ", " a[i];}' netstat.txt
- Filter the number of rows, the beginning and end use conditions, the partition: awk '/ test1 /, / test2 / {print}' test.txt
And interactive environment variables
$ x=5
$ y=10
$ export y
$ echo $x $y
5 10
$ awk -v val=$x '{print $1, $2, $3, $4+val, $5+ENVIRON["y"]}' OFS="\t" score.txt
Marry 2143 78 89 87
Jack 2321 66 83 55
Tom 2122 48 82 81
Mike 2537 87 102 105
Bob 2415 40 62 72
Two, grep
parameter list:
- -w matches whole words
- -s ignore nonexistent files and other error
- -l lists only the matching file list
- -L only lists do not match the file list
- After displaying the number of lines as the -A -1 1 row matching rows
- The number of rows in front of a display such as -1 -B matching row before row 1
- -number as the number of lines before and after the front row of a row match -1
- -n prints the number of lines
- -c displays only the number of
- -v Reverse
- -o display only content that matches
- -E said to be used EREs
- -P said to be used PREs
grep mainly use a regular expression, which should be noted there are three regular BREs, EREs and PREs. The first two do not support non-greedy matching. grep default is BREs, so he was ?,+,|,{,},(,)
such a character need \
to escape, while he did not support \s,\S,\D,\d,\n
and other characters.
Three, sed
sed command in the process of writing the script or automate frequently used.
Basic sentence: sed -nfei [Operation]
Operation: n1, n2 action
action:
- d: Delete
- s: Alternatively, the replacement line, the line matching string, such as hello world hello replace the row into the hi hi World
- a and i: a row i increases to increase the matching is increased to increase the back in front of the matching line
- c: Replace, replace the entire line for
example:
sed -e 's/hello/hi/g'
: Replacing text, -e may be omittedsed -e '1,2s/hello/hi/g' -e '3,4s/world/man/g
:Equivalent tosed -e '1,2s/hello/hi/g;3,4s/world/man/g
sed s/hello \(world\)/\1 hi/g'
: Gangmate, may be used \ n selected front the group
Four, sort and uniq
sort parameters
- -r: default ascending, -r reverse order
- -u: remove duplicate
- -o: redirected to a file, note
sort test.txt >test.txt
unavailable because> is to empty the file, it will cause the file before it is cleared sort - -n: default sort by character, such as less than 10 2, -n represents sorted by number
- -t: Specifies delimiter
- -k: specifies which column do with sorting
- -b: Ignore the space character in front of each line start out
example:
sort -t $'\t' -k 1 -u res.txt > res2.txt
A tab as a delimiter, and to sort the first column by weight
uniq parameters
Note uniq require text is ordered, it is generally used when uniq is sort of behind pipe earlier
- -c: display the number of occurrences
- -d: Display only repeated lines;
- -u: only shows the ranks once;
Talk sort|uniq
and sort -u
always found it strange that there is any difference between the two functions is the same. sort -u
Is added later, so many people still use sort|uniq
,
it is currently recommended sort -u
, because there are fewer inter-process communication.
Fifth, combat
The following documents deal with the contents of the counts removed and sorted domain, such treatment:
http://www.baidu.com/index . HTML
HTTP: / / www.baidu.com/1.html
http://post.baidu.com/index.html
http://mp3.baidu.com/index .html
http://www.baidu.com/3.html
http://post.baidu.com/2.html
following results were obtained:
. 3 www.baidu.com
2 post.baidu.com
mp3.baidu.com. 1
Solution 1:grep -Po '(?<=//)(.*?)(?=/)' test.txt |sort |uniq -c|sort -nr
1. The use of Perl, he supported non-greedy, 2. use of forward and reverse pre-investigation (Positive pre-investigation is behind (? =)) 3. Using the -o parameter only content output match
Solution 2:awk -F/ '{print $3}' test.txt |sort |uniq -c|sort -nr
It indicates a value corresponding to segmentation symbols direct access
Solution 3:sed 's/http:\/\/\([^/]*\).*/\1/' test.txt|sort |uniq -c|sort -nr
Basic Regular small brackets need to escape, if -r parameter that is extended regular parentheses do not escape
Solution 4: sed -e 's/http:\/\///' -e 's/\/.*//' | sort | uniq -c | sort -rn
Alternatively adopted, in front of the first alternative, the latter in alternative
awk examples
Note awk does not support multidimensional arrays, using a flexible way, in normal usage no problem, but if you need a map stored value is not appropriate, the following
documents columns 1-6 respectively deal od sum up lj day now to calculate the cumulative sum up lj day of output and
the output will have to be deal od sum up lj day is sum up lj day needs to be a map, but awk can not do this
{
updealids:{
od: {day,sum,up,lj}
}
}
awk 'BEGIN{OFS="\t"}{result[$1,$3,"sum"]+=$4;result[$1,$3,"up"]+=$5;result[$1,$3,"lj"]+=$6;result[$1,$3,"day"]=$2}\
END{for ( i in result) {split(i, a, SUBSEP); print result[i] ,a[1], a[2], a[3] }}' *
References: