Sao air full regular expression (b)

1. zero-width assertion

Whether or zero-width assertion, Cyangugu strange sounds,
that first explain these two words.

  1. Assertion: Saying assertion that "I'm sure what," while the regular assertion, that regular can indicate'll meet specified criteria content appears in front of or behind the specified content,
    meaning regularization can be as judged like humans what such as "ss1aa2bb3", regular use can assert identify aa2 front ss1, you can also find aa2 behind bb3.
  2. Zero Width: the width is not, then in the positive assertion just match the location, not the character, that is to say, there will not match the results returned assert itself.

Meaning myself clear, that he has what use is it?
Let's take a chestnut:
Suppose we use a crawler to crawl csdn amount of reading in the article. You can be seen by viewing the source code for this article amount of reading the contents of such a structure

1"<span class="read-count">阅读数:641</span>"

Where there is '641' This is the variable that is different values ​​for different article, when we get this string, you need to get this side of the '641' there are many ways, but if you are you should how to match it?

Following the first in terms of several types of assertions:

  1. Forward first assertion (Positive Preview):
  • Syntax: (? = Pattern)
  • Role: Matches the previous contents of pattern expressions, does not return itself.

In this way the child said, his face still ignorant force, well, just return to the chestnuts, to take the amount of reading, in a regular expression means to be able to match '</ span>' in front of the digital content
in accordance with the said the first positive assertion can match the preceding expression content, it means that: (? = </ span> ) can be matched to the previous content.
What match do? If you want all the content that is:

 REG = 1String "+ (= </ span>?).";
 2
 3String the Test = "<span class = \" the Read-COUNT \ "> Reads: 641 </ span>";
 4Pattern Pattern.compile pattern = ( REG);
 5Matcher Pattern.matcher MC = (Test);
 6while (mc.find ()) {
 . 7 System.out.println ( "matching results:")
 . 8 System.out.println (mc.group ());
 . 9 }
10
. 11 @ matching results:
12 is // <span class = "read-COUNT"> number reading: 641

But we just want my brother in front of the digital Yeah, that's a simple strategy, matching digits \ d, it can be changed to:

1String reg="\\d+(?=</span>)";
2String test = "<span class=\"read-count\">阅读数:641</span>";
3Pattern pattern = Pattern.compile(reg);
4Matcher mc=    pattern.matcher(test);
5while(mc.find()){
6  System.out.println(mc.group());
7}
8//匹配结果:
9//641

We're done!

  1. Forward underwent asserted (positive looking back):
  • Syntax: (? <= Pattern)
  • Role: Match the contents of the back of the pattern of expression does not return itself.

There are first underwent there, it is the first match in front of content that is succeeding matches behind it.
Chestnut above, we can also be treated with underwent assertion.

 1//(?<=<span class="read-count">阅读数:)\d+
 2String reg="(?<=<span class=\"read-count\">阅读数:)\\d+";
 3
 4String test = "<span class=\"read-count\">阅读数:641</span>";
 5Pattern pattern = Pattern.compile(reg);
 6Matcher mc=    pattern.matcher(test);
 7        while(mc.find()){
 8            System.out.println(mc.group());
 9        }
10//匹配结果:
11//641

It's that simple.

  1. Negative first assertion (negative Preview)
  • Syntax :( ?! pattern)
  • Role: non-matching pattern in front of the contents of expression, does not return itself.

There are also negative to positive, negative here is actually non-meaning.
For chestnut: for example, there is an "I love the motherland, I am the flower of the motherland,"
now is not to find 'flowers' in front of the motherland
with a regular can write:

1 country (?! Flowers)

  1. Back row assertion negative (negative Hougu)
  • Syntax: (?! <Pattern)
  • Role: non-matching pattern behind the content of the expression, it does not return itself.

2. capture and non-capture

When it comes to simple capture, he meant to match the expression, but the capture and usually grouped together, or "capture group"

Capture Group: content matching sub-expressions, and save the results in memory matching numbers or digital display name of the group to carry out a depth-first number, then you can use these results matching serial number or by name.

And depending on the naming, but also it can be divided into two groups:

  1. Numbered capture group:
    Syntax: (exp)
    Explanation: expression from the left, a content between left and right parentheses which corresponds to each occurrence of a packet in the packet, the whole expression for the group 0, a first set of start packets.
    Such as fixed telephone: 020-85653333
    his regular expression: (0 \ d {2} ) - (\ d {8})
    in the order of a left parenthesis, the expression has the following groups:

No.

Numbering

Packet

content

0

0

(0\d{2})-(\d{8})

020-85653333

1

1

(0\d{2})

020

2

2

(\d{8})

85653333

We use Java to verify:

 1String test = "020-85653333";
 2        String reg="(0\\d{2})-(\\d{8})";
 3        Pattern pattern = Pattern.compile(reg);
 4        Matcher mc= pattern.matcher(test);
 5        if(mc.find()){
 6            System.out.println("分组的个数有:"+mc.groupCount());
 7            for(int i=0;i<=mc.groupCount();i++){
 8                System.out.println("第"+i+"个分组为:"+mc.group(i));
 9            }
10        }

Output:

There are a number of packet: 2
2 page 0 grouped into: 020-85653333
3 First packet is: 020
4 (2) packet is: 85653333

Be seen, the number of packets is 2, but because of the entire expression itself is 0, and therefore output together.

  1. The naming capture group:
    Syntax: (<name> exp?)
    Explanation: named packet specified by the name expression
    may be written, such as code: (? <Quhao> \ 0 \ d {2}) -? (< haoma> \ d {8})
    in the order of a left parenthesis, the expression has the following groups:

No.

name

Packet

content

0

0

(0\d{2})-(\d{8})

020-85653333

1

quhao

(0\d{2})

020

2

haoma

(\d{8})

85653333

Use the code to verify:

1String test = "020-85653333";
2        String reg="(?<quhao>0\\d{2})-(?<haoma>\\d{8})";
3        Pattern pattern = Pattern.compile(reg);
4        Matcher mc= pattern.matcher(test);
5        if(mc.find()){
6            System.out.println("分组的个数有:"+mc.groupCount());
7            System.out.println(mc.group("quhao"));
8            System.out.println(mc.group("haoma"));
9        }

Output:

1 has a packet number: 2
second packet name: quhao, matching content: 020
3 packet name: haoma, matching content: 85,653,333

  1. Non-capturing group:
    grammar :( ?: exp)
    explain: just the opposite and capture the group, which is used to identify those groups do not need to capture, he said the popular point is that you can go to save your grouped as required.

For example, the above regular expression does not need to use the first packet, it can be written like this:

1(?:\0\d{2})-(\d{8})

No.

Numbering

Packet

content

0

0

(0\d{2})-(\d{8})

020-85653333

1

1

(\d{8})

85653333

Verify:

 1String test = "020-85653333";
 2        String reg="(?:0\\d{2})-(\\d{8})";
 3        Pattern pattern = Pattern.compile(reg);
 4        Matcher mc= pattern.matcher(test);
 5        if(mc.find()){
 6                System.out.println("分组的个数有:"+mc.groupCount());
 7                for(int i=0;i<=mc.groupCount();i++){
 8                    System.out.println("第"+i+"个分组为:"+mc.group(i));
 9                }
10        }

Output:

1 has a packet number: 1
2 page 0 grouped into: 020-85653333
3 First packet is: 85653333

3. Backreferences

Capture mentioned above, we know: Capture will return to capture a group, this group is stored in memory, not only the external expression referenced by the program in a positive, can also be referenced in expressions inside positive, this reference is back-references .

According naming capturing groups, backreferences can be divided into:

  1. Numbered sets of reverse quote: \ k

Or \ number

  1. The naming of counter-references: \ k

Or \ 'name'

Well finished, got it? Do not understand! ! !
Maybe even speaking in front of what used to capture still do not understand, right?
In fact, just watching the capture do not know will not use is normal!
Because the capture group and usually is used with backreferences

When it comes to capturing the top is set to match the contents of sub-expression by serial number or the name of saving up in order to use
attention to two words: "content" and "use"
mentioned here "content" is the matching results, rather than sub-expression itself , stressed that what is the use? Ah, first remember
that here "use" is how to use it?

Because its role is mainly used to find some duplicate content or do replace the specified character.

Or give it chestnuts:
for example, you want to find a string of letters "aabbbbgbddesddfiid" in a pair of letters
if as we have learned before regular, what limited range ah ah ah assertion may be impossible,
and now we are thinking first haircut program ideas:

  • 1) to match a letter
  • 2) match the first letter of the next, and to check whether the same whether a letter
  • 3) If the same, then the match is successful, otherwise fail

When thinking here of 2 matches the next letter, a letter on the need to use, how to remember a letter of it? ? ?
Use it to capture erupted there, we can use to capture the match on a successful content as a condition for this match
well, there is the idea we should practice
first match letter: \ w
we need to make a packet capture, Thus written like: (\ w)

This expression captures that there is a group: (\ w)
We then use this group as a capture conditions, it may: (\ w) \ 1
and that's it
some people may not understand, \ 1 What does it mean ?
Remember there are two groups named capture it, one is named according to a packet capture sequence A is named as a custom named capture group
by default is named digital, and the digital sequence is named starting from 1
therefore refer to the first capture group, naming rules reverse digital reference requires \ k <1> or \ 1
, of course, usually the latter.
Let's test:

1String test = "aabbbbgbddesddfiid";
2        Pattern pattern = Pattern.compile("(\\w)\\1");
3        Matcher mc= pattern.matcher(test);
4        while(mc.find()){
5            System.out.println(mc.group());
6
7        }

Output:

1aa
2bb
3bb
4dd
5dd
6ii

Ah, this is what we want.
For example, in the alternative, if you want to replace a string abc

1String test = "abcbbabcbcgbddesddfiid";
2String reg="(a)(b)c";
3System.out.println(test.replaceAll(reg, "$1"));;

Output:

1abbabcgbddesddfiid

4. greed and non-greedy

1. Greed

We all know that greed is not satisfied, as much as you want.
In regular, the greed is almost the same meaning:

Greed match: When the regular expression contains duplicate qualifier can accept, the usual behavior (in the whole expression can be matched premise) match as many characters, which is called greedy way match to match.
Characteristics: one-time read the entire string matching, whenever they do not match abandon the rightmost character, continues to match, followed by match and discard (this match - also called abandon the way back), until the match is successful or the entire character End discarded until the string, so it is a return to maximize the data can be more, no less.

Earlier we talked about repeat qualifiers, in fact, these qualifiers is greedy quantifier, such as expression:

1\d{3,6}

Used to match the numbers 3-6, in this case, it is greedy pattern matching, that is, if all six numbers in the string matches, then it is matched to all.
Such as

REG = 1String "\\ D {3,6}";        
2String Test = "61,762,828 176 871 2991";
3System.out.println ( "Text:" Test +);
4System.out.println ( "greedy:" + REG);
5Pattern of Pattern.compile P1 = (REG);
6Matcher p1.matcher M1 = (Test);
. 7 the while (m1.find ()) {
. 8 System.out.println ( "matching results:" + m1.group ( 0));
9}

Output:

Text 1: 44 is 871 617,628,281,762,991
2 greedy: \ D {3,6}
. 3 matching: 617 628
and fourth matching results: 176
5 Matches Results: 2991
6 Matches Results: 871

It is seen from the results: the original string of "61762828" This paragraph, in fact, only appeared three (617) have been successfully matched up, but he was not satisfied, but to match the character can match the maximum, that is 6 a.
A greedy quantifier on the case, and
that someone will ask, if more greedy quantifier get together, and that is how they spend their right to match it?

Is such that when a plurality of greedy together, if the string to meet their respective maximum degree of match, without disturbing each other, but if not met, will be based on depth-first principle, that is, from left to right of each greedy quantifiers, to meet the maximum number of priority, a redistribution of the remaining quantifier matches.

REG = 1String "(\\ D {1,2}) ({3,4- \\ D})";        
2String Test = "61,762,828 87321 176 2991";
3System.out.println ( "Text:" + test);
4System.out.println ( "greedy:" + REG);
5Pattern of Pattern.compile P1 = (REG);
6Matcher p1.matcher M1 = (Test);
. 7 the while (m1.find ()) {
. 8 the System.out. println ( "matching results:" + m1.group (0));
. 9}

Output:

Text 1: 617,628,281,762,991 87321
2 greedy: (\ D {1,2}) (\ 3,4- {D})
. 3 matching: 617 628
and fourth matching results: 2991
5 Matches Results: 87,321

  1. "617628" in front of \ d {1,2} a matching 61, a back matching 7628
  2. "2991" in front of \ d {1,2} a matching 29, a back matching 91
  3. "87321" in front of \ d {1,2} a matching 87, a back matching 321

2. lazy (non-greedy)

Lazy match: When the regular expression contains duplicate qualifier can accept, the usual behavior (in the whole expression can be matched premise) match as few characters, which is called the lazy way match to match.
Features: from left to right, beginning from the left most of the string matching, each time trying to not read the characters, the matching is successful, the matching is completed, otherwise reads a character and then match, and so the cycle (read characters, match) until the match is successful or the character string matching last.

Lazy quantifier is greedy quantifier in the back to add a "?"

Code

Explanation

*?

Repeated any number of times, but less duplication wherever possible

+?

Repeated one or more times, but less duplication wherever possible

??

Repeat 0 or 1, but less duplication wherever possible

{n,m}?

Repeated n to m times, but less repeated as

{n,}?

N times or more, but less duplication wherever possible

REG = 1String "(? \\ D {1,2}) ({3,4- \\ D})";        
2 Test String = "61,762,828 87321 176 2991";
. 3 System.out.println ( "text:" + Test);
. 4 System.out.println ( "greedy:" + REG);
. 5 of Pattern.compile the pattern P1 = (REG);
. 6 Matcher p1.matcher M1 = (Test);
. 7 the while (m1.find ()) {
. 8 System.out.println ( "matching results:" + m1.group (0));
. 9}

Output:

Text 1: 617,628,281,762,991 87321
2 greedy: (\ D {1,2}?) (\ 3,4- {D})
. 3 matching: 61762
fourth matching results: 2991
5 Matches Results: 87,321

answer:

"61762" is left lazy matched 6, greedy right matched 1762
"2991" is the left lazy matched 2, greedy right matched 991
"87321" left lazy matched 8, greedy right matched 7321

5. antisense

Speaking in front metacharacter to match what is, of course, if you want to be the opposite of, do not want to match some characters, regular also provides some common antisense metacharacters:

Metacharacters

Explanation

\W

Not match any letters, numbers, underscores, Chinese characters

\S

Matches any character is not whitespace

\D

Matches any non-numeric characters

\B

Position matching is not the beginning or end of a word

[^x]

Matches any character except the x

[^aeiou]

Matches any character except the letters of aeiou

Guess you like

Origin www.cnblogs.com/z1201-x/p/11422070.html