Regular expressions and file format handling (1) - basic regular expressions practice (primary)

Basic regular expressions

Regular Expressions

Simply put, the regular expression is a string method of treatment, he is to be treated in units of behavior strings,
regular expressions assisted by some special symbols, allows users to easily reach 搜寻, 删除,取代 a specific string handler!

Extensive use of regular expressions

  • System administrators to manage host
  • Mail servers filter spam

Extending regular expression
regular expression string representation according to different degrees of stringency divided: 基础正则表达式and延伸正则表达式

Impact of language on regular expressions

Why language data will affect the output of the regular expression of it? We Introduction to Chapter 0 Computer text encoding system which spoke
only 0 and 1 file actually recorded, text and numeric characters we see are encoded by conversion to the table.
Since the encoded data in different languages is not the same, so it will make a difference in the data capture results.
For example, in the case of the English coding sequence, the output of these two languages zh_TW.big5 and C are as follows:

  • LANG=C 时: 0 1 2 3 4 ... A B C D ... Z a b c d ...z
  • LANG=zh_TW 时: 0 1 2 3 4 ... a A b B c C d D ... z Z

The above sequence is encoded order, we can clearly find these two languages is obviously not the same!
If you want to capture and use upper case characters [AZ], you will find LANG = C can really catch only uppercase characters (because it is continuous),
but if LANG = zh_TW.big5 time, you will find that, along with lowercase the bz will be captured out!
Because of view are sequentially encoded, big5 language can capture the "A b B c C ... z Z" the pile miles character!
Therefore, the use of regular expressions, require special attention when 环境的语系Why might otherwise find the others do not capture the same results Oh!

As the general when we practice regular expression, use is compatible with the POSIX standard,
so just use the "C" this language! Therefore, the following exercises are a lot of use "LANG = C" language data for this Oh!

Special symbol on behalf of significance

In addition, in order to avoid capture problems with English digital encoding of this caused, so some special symbols we have to look at
it! These symbols have the following main significance of these:

Special symbols On behalf of significance
[: Scooping] On behalf of the English uppercase and lowercase characters and numbers, that is, 0-9, AZ, az
[:alpha:] English case on behalf of any character, namely AZ, az
[:blank:] Representative both spacebar and [Tab] button
[:cntrl:] Keyboard control buttons representative of the above, i.e. comprising CR, LF, Tab, Del .. etc.
[:digit:] It represents a number only, namely 0-9
[:graph:] In addition whitespace (blank bond [Tab] button) all other keys outside
[:lower:] On behalf of lowercase characters, that is, az
[:print:] Stands for any character can be printed out
[:point:] On behalf of punctuation (punctuation symbol), namely: " ';:?! # $ ...
[:upper:] On behalf of uppercase characters, that is, AZ
[:space:] Any gaps will produce characters, including the space bar, [Tab], CR, etc.
[: Xdigit] Representative hexadecimal numeric types, thus comprises: 0-9, AF, af numeric character

Basis regular expression exercises

The premise is

  • Language has been used "export LANG = C; export LC_ALL = C" set value;
  • grep has been set up using the alias to become "grep --color = auto"

Regular_express.txt practice with the file acquisition command:

wget http://linux.vbird.org/linux_basic/0330regularex/regular_express.txt

grepSearch for a string in the data when, based on 整行the unit to fetch data!

Search for specific strings

Taken from the file regular_express.txt among thethis particular string

grep -n 'the' regular_express.txt

反向选择It? That is, when the bank did not thewhen this string is displayed on the screen

grep -vn 'the' regular_express.txt

Taken from the file regular_express.txt among thethis particular string, regardless of capitalization

grep -in 'the' regular_express.txt

The use of square brackets []to search a collection of characters

In fact, []there are several characters regardless, he represents only a 一个character
if I want to search test or taste of these two words can be found to the fact they have a common 't? St' presence - this time, so I can to search:

grep -n 't[ae]st' regular_express.txt

And if you want to find there oo characters, use:

grep -n 'oo' regular_express.txt

But, if I do not want to have g oo in front of words? At this point, the reverse may be utilized in selecting a set of characters [^] achieved:

grep -n '[^g]oo' regular_express.txt

Again, assuming I do not want to have in front oo lowercase characters, so I can write [^ abcd .... z] oo, but this seems not very easy, because the order of the lowercase ASCII character encoding is continuous, so we can simplify it as follows:

grep -n '[^a-z]oo' regular_express.txt

That is, when we are in a collection of characters, if the character set is continuous, such as uppercase / lowercase / numbers, etc.,
you can use the [a-z], [A-Z], [0-9]and other ways to write, so if our request is string digital and English it?
Ha ha! He will write all together, it becomes: [a-zA-Z0-9]. For example, we want to get 有数字that line, so:

grep -n '[0-9]' regular_express.txt

However, taking into consideration the impact of languages for coding sequence, so in addition to continuous encoded using the minus sign "-" outside, you can also get the results of the previous two tests using the following method:
that [:lower:]is the representative of a-zthe meaning! Please refer to the preceding description form

grep -n '[^[:lower:]]oo' regular_express.txt

Just get 有数字in that line, it can also be written like this:

grep -n '[[:digit:]]' regular_express.txt

Beginning of the line and of-line character ^ $

Inside there is a row query string the, that if I want to make theonly the first line listed it? This time you have to locate the character you want to use it! We
can do this:

grep -n '^the' regular_express.txt

If I want to begin with a lowercase character on the line that listed it? It can be:

grep -n '^[a-z]' regular_express.txt

The above instructions may also be substituted by the following manner:

grep -n '^[[:lower:]]' regular_express.txt

it is good! What if I do not want to begin with the English alphabet, it can look like this:

grep -n '^[^a-zA-Z]' regular_express.txt

Instruction may be:

grep -n '^[^[:alpha:]]' regular_express.txt

Noticed, right? The caret in the character set of symbols (in brackets []) 之内and 之外are different! In [] 之内representatives 反向选择, in [] 之外represents the 定位在行首meaning of
justice!

Conversely thinking that if I wanted to find out, end of the line as the decimal point .on that line, how to deal with:

grep -n '\.$' regular_express.txt

Noting in particular, because the decimal point has other significance (discussed next), it is necessary to use the escape character \to be discharged from its special meaning!

If I want to find out which row is 空白行, that the bank did not enter any number of
data, how to search?

grep -n '^$' regular_express.txt

Under Centos7 result we /etc/rsyslog.conf this file for example, you can refer to the output of their own

cat -n /etc/rsyslog.conf

In CentOS 7, results can be found in the output line 91, many blank lines and comment lines that begin with #

grep -v '^$' /etc/rsyslog.conf | grep -v '^#'
  • The results of a lot less lines, the first of which -v '^$'stands for "Do not blank line
  • The second -v '^#'stands for "not at the beginning of the line is #" Oh!

Any character .with repeated characters*

We know that 万用字符 *can be used to represent any (0 or more) characters, however 正则表达式并不是万用字符, it is not the same between the two
of! As a regular expression among .the representatives of 绝对有一个任意字符meaning! Significance of these two symbols in a regular expression as follows:

  • .(Decimal): represents the 一定有一个任意字符meaning;
  • *(No stars): On behalf of 重复前一个字符, 0 到无穷多次the meaning, as组合形态

Say this difficult to understand, we directly make a practice of it! Suppose I need to find out the string g ?? d, ie a total of four characters, and the end is the beginning is g d, I can do this:

grep -n 'g..d' regular_express.txt

Because we must emphasize that there are two characters between g and d, therefore, gd god and line 14 on line 13 will not be listed here!
Again, if I want to have listed oo, ooo, oooo, etc. of data, that is, have at least two (including) o or more, what to do?
Or oo * o * is still ooo * it? Although you can try to see the results, but the results of the forum takes up too much @ _ @, so I am here to direct instructions.

Because * represents 重复 0 个或多个前面的 RE 字符significance, therefore, o*it represents: 拥有空字符或一个 o 以上的字符pay special attention to, because 允许空字符(that is, there are no characters may mean), therefore, grep -n 'o*' regular_express.txtwill be to print out all the data on the screen! If that is oo*it? The first o certainly must exist, and the second is more o o dispensable, therefore, all containing o, oo, ooo, oooo, etc., it can be listed.

Similarly, when we need 至少两个 o 以上的字串when we need ooo*, that is:

grep -n 'ooo*' regular_express.txt

If I want to string the beginning and end are g, but only the presence of at least one o between two g, ie is GOG,
goog, gooog .... and so on, then how?

grep -n 'goo*g' regular_express.txt

So much about it? Again a question, if I want to find out the string at the beginning and end of g, g, among the character dispensable, what shall I do? Is "g * g" it?

grep -n 'g*g' regular_express.txt

1:"Open Source" is a good mechanism to develop programs.
3:Football game is not use feet only.
9:Oh! The soup taste good.
13:Oh! My god!
14:The gd software is a library for drafting programs.
16:The world is the same with "glad".
17:I like dog.
18:google is the best tools for search keyword.
19:goooooogle yes!
20:go! go! Let's go.

But the test results actually had so many lines? Too weird, right? In fact, it is not strange, because g g g inside of representatives 空字符或一个以上的 gplus the back of g, therefore, the entire contents of RE is g, gg, ggg, gggg, so long as the bank has more than one of them would meet the required g up!
How to get that the demand for our g .... g it? Ha ha! On the use of any one character .ah! That is: g.*gin practice, as *may be repeated zero or more of the preceding character, any character and is, therefore: .*represents zero or more of any character meaning it!
So it's a regular expression as follows:

grep -n 'g.*g' regular_express.txt

If I want to find out the ranks of "any number" of it? Because only numbers, so we become

grep -n '[0-9][0-9]*' regular_express.txt

Of course, using the following regular expression can be obtained the same results:

grep -n '[0-9]' regular_express.txt

Defining a continuous character range RE {}

In the last example of which we can use. RE characters and the *sets 0 to an unlimited number of repeated characters, then repeat the number of characters within a range of range if I want to limit it? For example, I want to find consecutive string two and five o, how to do? This time we have the character to be used to limit the scope of {}the. But because {of }the sign in the shell is of special significance, therefore, we have to use the escape character \to make him lose the special meaning the job. As for {}the syntax is, suppose I want to find two o string, it can be:

grep -n 'o\{2\}' regular_express.txt

1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!

This view seems to be no difference between ooo * character ah? Because there are multiple o Line 19 still also appeared! Well, then change the search string, suppose we want to find g back
then 2-5 o, g then followed by a string, he would be like this:

grep -n 'go\{2,5\}g' regular_express.txt

18:google is the best tools for search keyword.

Ok! well! Line 19 has not been finally drawn up (because there are 19 rows 6 o ah!). So, if I want is two or more o goooo .... g it? In addition to a
gooo * g, may be:

grep -n 'go\{2,\}g' regular_express.txt

18:google is the best tools for search keyword.
19:goooooogle yes!

Reference: Linux private kitchens << Bird Brother - based learning articles (fourth edition) >>

Guess you like

Origin www.cnblogs.com/freedom-try/p/12113807.html