Whim ---- regular expression parsing

Previously I found some notes, do not throw trash on the finishing.

It was not always turned out to see, certainly not turned out no one.

Regular expression parsing action:

Matching and look for patterns in text mode can be simple or complex

 

Regular Expressions practice website: https://www.regexpal.com/

15 yuan characters: Regex characters have special meaning

.^$*+?|(){}[]\-

- indicates a range in addition to the character set, the other characters in the scene is not meta

 

1. string literal match:

Just want to match what, what to write on specific to match, for example, I want to match OK, it is OK Regex

 

2. Universal selector

(Pattern1 | pattern2 | pattern3 | ......) indicating the selection, meaning any one will do in order to meet

 

3. inversion pattern matching:

[^ Pattern] represents the content does not meet the pattern of the match

 

4. The character set of matched numbers:

The so-called group of characters is written in [] characters, such as:

[0-9] represents all numbers matching text inside 0,1,2 ... 9

[0135] which represents a text matches all four figures 0,1,3,5

 

5. The operation of the group set of characters

[0-3 [6-9]] represents [0-3] the U-[6-9]

[Az && [^ mr]] mr indicates the matching between the letters az addition

 

6. burst shorthand style:

\ D is equivalent to [0-9], which matches all individual numbers

\ D is equivalent to [^ 0-9], which matches all of the individual non-numeric characters, including spaces, punctuation (quotation marks, hyphens, slash, brackets) like characters

\ W is equivalent to [_a-zA-Z0-9], it represents all single word matches, including letters, numbers and underscores

\ W is equivalent to [^ _a-zA-Z0-9], that is \ w match negated

[\ B] matches a backspace character

\ S represents a blank character match, such as space, line breaks, tabs, carriage return, page ...

\ S represents a non-blank character match

\ C matches a control character

\ T matches a tab

\ R match a carriage return

\ N Matches a newline

[^ \ T \ n \ r] matches all characters except whitespace

 

7. match any single character:

The single dot is used to match any single character

 

8. The location identifier locator anchor (anchor):

^ Is used to indicate the beginning

$ Is used to indicate the end of the

\ B is used to indicate word boundaries before or after the boundary, it is a zero-width assertion, it matches a space or a line head on the surface, but in fact it is a match for zero-width what does not exist

\ B matches the non-word boundary, non-word boundary matching position other than a word boundary, such as text tttt, Regex as \ Bt \ B, will match the middle two t

 

6. Packet capture and back-reference:

() Regular expressions are captured in parentheses part, is crucial order of each () is

\ 1 Or $ 1 is a reference to the first () content of regular expression captured text, and so the second thirty-four

Note: The packet supports nested, nested reference packet or linear, the front large packets, packets are sequentially embedded back

 

9. Use quantifiers

{Number 1, number 2} braces frequency range of a digital representation of numbers appears to be looking for. Braces containing numbers is a quantifier. Such as \ d {2,5} indicates a match 2-5 numeric characters. {3} Table least three times, {3} Table 0-3

? Represents? Content in front of a single character or expression matched a "0 or 1"

+ + Indicates the content of the foregoing character expression of the matched single or a "one or more"

* * Indicates the content of the preceding single character expression of the matched or a "0 or more"

 

10. quantifier greedy, lazy and possession

greedy:

Greedy quantifier itself. First greedy quantifier will match the entire string. When you try to match, it will select as much content, that is, the entire input.

Quantifier first attempt to match the entire string, and try again if it fails then go back one character. This process is called backtracking (backtracking).

It will go back one character each time, until you find content that matches or no character can attempt so far.

In addition, it records all behavior, compared to the other two ways to consume the largest of its resources.

lazy

Lazy (also sometimes reluctant to say) quantifier Another strategy is used. It starts from the starting position of the target of attempts to find a match, each inspection, a character string, looking for content to match it.

Finally, it attempts to match the entire string. To become a quantifier lazy, you must add a question mark at the general quantifier (?). It is a time to "eat" a little.

Possession:

Possession of quantifiers will cover the entire target and then try to find a match, but only try once, will not go back.

Possession quantifier is to add a plus sign after a normal quantifiers (+). It does not "chew" but what direct "swallow" before wondering "eat" yes.

 

11. metacharacters turn literals

Transfer: Use \ can escape into ordinary single metacharacters

\ Q and \ E: \ all the characters between Q and \ E are treated as ordinary characters

 

12. The regular expression option (only act on the rear part of the options)

Option Description Supported Platforms example

(? D) Unix row in Java

(? I) are not case sensitive PCRE, Perl, Java (? I) the means to ignore the case match

(? J) allows duplicate names PCRE *

(? M) multi-line PCRE, Perl, Java

(? S) single-line (dotall) PCRE, Perl, Java

(? In) Unicode Java PCRE

(? U) default shortest match PCRE

(? X) Ignore spaces and comments PCRE, Perl, Java

(? -X) X options open before closing PCRE

 

13. forward by looking back - looking around

Positive Outlook: Match two rear and two matching requirements are "three", a case-insensitive

(?i)two(?= three)

Anti Preview: No Match "three" and two two rear matching requirements, case-insensitive

(?i)two(?! three)

Positive Hougu: Match three preceding and three matching requirements are "two", case insensitive

(?i)(?<=two )three

No "two" and three in front of three matching requirements matching insensitive: trans Hougu

(?i)(?<!two )three

 

Published 78 original articles · won praise 19 · views 20000 +

Guess you like

Origin blog.csdn.net/kxindouhao5491/article/details/104384502