shell text processing - Regular Expressions

Type of regular expressions

Regular expressions are achieved by regular expression engine. The regular expression engine is a low-level software, responsible for the interpretation and use regular expression pattern matching these text patterns.
Linux, there are two popular regular expression engine:

  • POSIX basic regular expression engine (basic regular expression, BRE) engine
  • POSIX extended regular expression engine (expended regular expression, ERE) engine
    sed Editor is a subset BRE engine specification, the purpose is to process the text data stream as soon as possible.
    gawk tool is using the ERE engine specifications to handle its regular expressions.

Plain Text

Plain text is simple, not to say. As long as attention to several points:
1. The regular expression patterns are case sensitive
2. Regular expressions do not have to write complete words, partial words can match the

Special characters

Special characters in regular expressions identified include:
.? * [] {} ^ $ + | ()
If you want to use a special character as a text character, it must be escaped.

➜  Charpter20 git:(master)sed -n '/\$/p' data2
The cost is $4.00
➜  Charpter20 git:(master)cat data2
The cost is $4.00

Anchor Character

There are two special characters may be used to lock the pattern in the data stream beginning or end of the line
1. The first line locking
caret (^) from the first line pattern defined in the data stream starting Chinese Bank. If the pattern occurs outside the first position on the line, the regular expression pattern can not match.

➜  Charpter20 git:(master)echo "The book store"|sed -n '/^book/p'
➜  Charpter20 git:(master)echo "Books are great"|sed -n '/^Book/p'
Books are great

If the caret into a location other than the beginning of the pattern, then it just the same as an ordinary character, is no longer a special character:

➜  Charpter20 git:(master)echo "This is^ a test"|sed -n '/s^/p'
This is^ a test

2. Lock the end of the line
special characters dollar sign ($) defines the end of the line anchor. This special character in the text mode of data later specified line must end with the text mode.

➜  Charpter20 git:(master)echo "This is a good book"|sed -n '/book$/p'
This is a good book
➜  Charpter20 git:(master)echo "This is a good "|sed -n '/book$/p'

3. A combination anchor
, in some cases, may be combined in the same row of line anchor end of the line and anchor together.
The blank lines deleting data stream, may operate as follows:

➜  Charpter20 git:(master)more data3
This is one test line.

This is another test line.
➜  Charpter20 git:(master)sed '/^$/d' data3
This is one test line.
This is another test line.

Files can be generated without a blank line in this way, as follows:

➜  Charpter20 git:(master)sed '/^$/d;w data4' data3
This is one test line.
This is another test line.
➜  Charpter20 git:(master)more data4
This is one test line.
This is another test line.

Dot character

The dots represent any character

➜  Charpter20 git:(master)echo "This is a very nice hat"|sed -n '/.at/p'
This is a very nice hat

Character Group

Use square brackets to define a character group. Square brackets contain all the characters you want to appear in the character set.
Not sure when the case of a character, character set can be useful

➜  Charpter20 git:(master)echo "Yes"|sed -n '/[Yy]es/p'
Yes
➜  Charpter20 git:(master)echo "yes"|sed -n '/[Yy]es/p'
yes

Negated character set

In regular expressions, you can reverse the role of the character set, you can not find the character set.

➜  Charpter20 git:(master)more data6
This is a test of a line.
The cat is sleeping.
That is a very nice hat.
This test is at line four.
at ten o'clock we'll go home.
➜  Charpter20 git:(master)sed -n '/[^ch]at/p' data6
This test is at line four.

Interval

For the embodiment using Zip section does not satisfy the rules to filter zip code, as follows:

➜  Charpter20 git:(master)sed -n '/^[0-9][0-9][0-9][0-9][0-9]$/p' data8
60633
46201
22203
➜  Charpter20 git:(master)more data8
60633
46201
223001
4353
22203

You can also specify a plurality of discrete intervals in a single burst

➜  Charpter20 git:(master)sed -n '/[a-ch-m]at/p' data6
The cat is sleeping.
That is a very nice hat.

Special character set

In addition to their definitions set of external characters, special BRE further comprising a group of characters, can be used to match particular types of characters. The following table describes the special characters available BRE Group.

Options description
[[:alpha:]] Matches any alphabetic character, whether it is uppercase or lowercase
[[Charon]] Matches any alphanumeric character 0 . 9, A the Z a ~ z or
[[:blank:]] Matches a space or tab
[[:digit:]] Matching the number between 0 to 9,
[[:lower:]] Lowercase characters matching a ~ z
[[:print:]] Matches any printable character
[[:point:]] Matching punctuation
[[:space:]] Matches any whitespace characters: spaces, tabs, NL, FF, VT and CR
[[:upper:]] Matches any uppercase characters A ~ Z
[root@ommleft zd]# echo "abc"|sed -n '/[[:digit:]]/p'
[root@ommleft zd]# echo "abc"|sed -n '/[[:alnum:]]/p'
abc
[root@ommleft zd]# echo "abc123"|sed -n '/[[:alnum:]]/p'
abc123
[root@ommleft zd]# echo "This is , a test"|sed -n '/[[:punct:]]/p'
This is , a test

Extended regular expressions

gawk can use most of the extended regular expression pattern symbols, and can provide some additional filtering, but these features are sed editors do not have. But because of this, gawk while processing a data stream is typically slower.

question mark

Question mark indicates that the preceding character may appear 0 or 1, but limited to this.

[root@ommleft zd]# echo "bt"|gawk '/be?t/{print $0}'
bt
[root@ommleft zd]# echo "bet"|gawk '/be?t/{print $0}'
bet
[root@ommleft zd]# echo "beet"|gawk '/be?t/{print $0}'
[root@ommleft zd]# 

plus

Another mode is similar to the symbol plus an asterisk, but also different with a question mark. Plus sign indicates that the preceding character can occur one or more times, but must appear at least once.

[root@ommleft zd]# echo "beeet"|gawk '/be+t/{print $0}'
beeet
[root@ommleft zd]# echo "bt"|gawk '/be+t/{print $0}'
[root@ommleft zd]# 

curly braces

Braces allow for reuse regular expression to specify an upper limit. This is often called the interval.

  • m: Regular expression appears exactly m times
  • m, n: the regular expression occurs at least m times, n times at most.
[root@ommleft zd]# echo "bet"|gawk '/be{1}t/{print $0}'
bet
[root@ommleft zd]# echo "beet"|gawk '/be{1}t/{print $0}'
[root@ommleft zd]#

The pipe symbol

Pipe symbol allows you to check the data stream, a logical OR manner specified regular expression to use two or more modes of the engine. If a pattern match any text data stream, we passed the test text.
Use the pipe symbol format is as follows:
expr1 | expr2 | ...

[root@ommleft zd]# echo "The cat is asleep"|gawk '/cat|dog/{print $0}'
The cat is asleep
[root@ommleft zd]# echo "The dog is asleep"|gawk '/cat|dog/{print $0}'
The dog is asleep
[root@ommleft zd]# echo "The sleep is asleep"|gawk '/cat|dog/{print $0}'
[root@ommleft zd]# 

Expressions grouping

Regular expressions can be grouped with parentheses.

[root@ommleft zd]# echo "Sat"|gawk '/Sat(urday)?/{print $0}'
Sat
[root@ommleft zd]# echo "Saturday"|gawk '/Sat(urday)?/{print $0}'
Saturday
Published 75 original articles · won praise 7 · views 10000 +

Guess you like

Origin blog.csdn.net/zhengdong12345/article/details/101307284
Recommended