java regular expression entry documents

Introduction
regex (regular expression) describes a set of strings can be used: (1) Check whether a string contains a substring that matches a regular, and may obtain the substring; (2) matching rules flexible string replacement operation.
regular expressions learning curve is actually very simple, the few more abstract concepts easy to understand. the reason why many people feel relatively complex regular expression, on the one hand because most progressive approach to explain the document does not have to do, not pay attention to the order of the concept, to the reader's understanding difficult; on the other hand, the engine comes with a variety of documents generally required to introduce its unique features, however, this is not part of the unique features we must first understand.
article in each example, you can click into the test page for testing. Without further ado, start.
1. regular expression rules
1.1 ordinary characters
letters, numbers, characters , underline, and the back section no special definition of punctuation, are "ordinary character." expression of ordinary characters in a string of match time, Matching with the same character
Example 1: Expression "c", when matching the string "abcde", the matching result: success; matched contents is: "c"; matched position: starts at 2, ends 3. (Note: subscript 1 or from the beginning, because of different programming languages and the current may vary from 0.)
example 2: expression "bcd", when matching the string "abcde", the matching result is : success; matched contents is: "BCD"; matched position: starts at 1 and ends at 4.
1.2 escape character simple
some characters written inconvenience, preceded method "\" these. In fact, we are already familiar with the character of.



On behalf of a carriage return and line feed
\ t
tab
\\
stands for "\" itself
after some other special use of punctuation in the back section, preceded by "\", on behalf of the symbol itself, such as: ^. , $ have special meaning, if it is to match the string "^" and "$" character, the expression will need to be written "\ ^" and "\ $."
expression
can match
\ ^
matches ^ symbol itself
\ $
matching $ symbol itself
. \
(.) matches itself decimal point
matching method such escape character and "normal character" is also similar to a match with the same character.
example 1: expression "\ $ d", in when matching string "abc $ de", match result: success; matched contents is: "$ d"; matched position: 3 starts, ends at 5
can be matched with the 'more characters' 1.3 expression
canonical representation of some expression may match 'multiple character' a character of any of them. For example, the expression "\ d" matches any digit. Although wherein matches any character, but only a, not more. it's like playing poker when the size of the king can replace any card, but only instead of a card.
Expression
matches
\ d
any number, any one of 0 to 9, a
\ w
Any letters or numbers or underscores, i.e. A ~ Z, a ~ z, 0 ~ 9, _ any one
\ s
including any one blank character spaces, tabs, formfeeds the like
.
Decimal point may be matched in addition to any (\ n) other than a newline character
example 1: expression "\ d \ d", when matching "abc123", the matching results: success; matching to read: "12"; to match position: 3 starts, ends at 5.
example 2: expression "a \ d.", when matching "aaa100", the matching results: success; matched in content: "aa1"; to match position: starts at 1 and ends at 4.
1.4 can be custom matched 'multiple character' expression
using square brackets [] contains a series of characters, a character can be matched with any of [^] comprising a series of characters. it is possible to match any character other than a character. wherein the same token, though any of a match, but only one, not a plurality of
expressions
can be matched
[ab5 @]
matches "a" or "b" or "5" or "@"
[^ ABC]
any match than "a", "b", "c" character
[FK]
matches "f" ~ "k" Between any letters
[^ A-F0-3]
any character other than a matching "A" ~ "F", "0" ~ "3" is
Example 1: Expression "[bcd] [bcd]" matching "abc123", the result of the matching: success; matched in content: "bc"; matched position: starts at 1, 3 ends.
example 2: expression "[^ abc]" matches "abc123", the result of the matching: success; matching to read: "1"; matched position: 3 starts, ends at 4. the
modified 1.5 special symbols matching the number of
the previous section mentioned expressions, both expressions can only match one of the character, or a character can match any of a variety of expression, it can only be matched once. If you use the expression again with a special symbol modified to match the number of times, then do not repeat the written expression can match repeat.
use is: "the number of modified" on the back "modified expression" For example:. "[bcd] [bcd ]" can . written as "[bcd] {2}"
expression
effect
{n}
expression repeated n times, for example: "\ w {2}" corresponds to the "\ w \ w"; " a {5}" corresponds to "aaaaa "
{m, n}
expression repeated at least m times, up to n times, for example:" ba {1,3} "matches" ba "or" baa "or" BAAA "
{m,}
match at least m views, such as: "\ w \ d {2 ,}" matches "a12", " _456 "," M12344 "...
?
Match Expression zero or 1, corresponding to {0,1}, for example:"? A [cd] "match" A "," AC "," AD "
+
Expression appears at least once, corresponds to {1}, for example: "a + b" match "ab &", "AAB", "AAAB" ...
*
expression occurs or does not occur any number of times, corresponding to { 0,}, such as: "\ ^ * b" will match "b", "^^^ b" ...
for example 1: ".? \ d + \ \ d *" expression matching "It costs $ 12.5" when , match result: success; matched contents is: "12.5"; matched position: starts at 10, ends at 14.
example 2: expression "go {2,8} gle" matching "Ads by goooooogle ", the result of the match: success; substring matched is:" goooooogle "; matched to the position: starts at 7 and ends at 17. the
special symbols 1.6 other representatives of the abstract meaning of
some symbols in the expression representing abstract special significance:
expressions
role
^
place with the beginning of the string matching, does not match any of the characters
$
place with the end of the string matching, does not match any character
\ b
matches a word boundary, that is, between word and a space the position does not match any character
further still more abstract text, therefore, for example to help you understand.
For example 1: the expression "^ aaa" when matching "xxx aaa xxx", match result: Failure because "^" requires local matching string begins, therefore, only when the "aaa" string at the beginning of time, "^ aaa" can match, such as: "aaa xxx xxx".
Example 2: Expression "$ AAA" in match "xxx aaa xxx", the matching result is: failed because the "$" required to match the local end of the string, and therefore, only when the end of "aaa" of the character string is located time, "aaa $" to match, such as: "xxx xxx aaa".
example 3: expressions when matching "@@@ abc", matching the result is "\ b..": success; substring matched is: "@a"; matched position: 2 begins and ends at 4
further explained: "\ B" and "^" and "$" Similarly, in itself does not match any character, but it requires matching result location of the left and right sides, where one side is "\ w" range, a range on the other side non "\ w" of.
example 4: expression "\ bend \ b" in match "weekend, endfor, end", the matching result : success; substring matched is: "end"; matched to the position: starts at 15 and ends at 18.
Some symbols can affect the relationship between the sub-patterns:
expressions
role
|
both sides "or" the relationship between the expression match the left or right
()
(1). when the number is modified to match the expression in brackets may be modified as a whole
(2). when the matching results taken in parentheses table Type content may be matched to obtain separately
Example 5: Expression "
For example 6: Expression "(go \ s *) + " matching "Let's go go go!", The match result: success; matching the content is: "go go go"; matched to the position: starts at 6 and ends at 14.
for example 7: expressions when matching "$ 10.9, ¥ 20.5", the result of the match is "¥ (\ d + \ \ d *.?)": success; substring matched is: "¥ 20.5 "; matched position: 6 starts, the end bracket range matching to acquire content on a separate 10 is:" 20.5. "
2. regular expressions advanced syntax
greed and the number of matches in the non-greedy 2.1
in when the number of matches using the modified special symbol, there are several ways to make the same representation expression can match different times, such as: "{m, n}" , "{m, the specific number of times matching with the matching string . this may be repeated the number of variable expression matching in the matching process, the matching is always as much as possible, for example, for the text "dxxxdxxxd", for example as follows:
expression
matching result
(D) (\ + W)
"\ W + "All the characters after" "the first match" d xxxdxxxd "
(D) (\ + W) (D)
" \ W + "will match the first," "all characters between the" d "and the last" d xxxdxxx. "Although" \ w + "can be matched Finally, a "d", but in order to make the entire expression matches, "
Thus, "\ w +" on match always match as much as possible in line with the character of its rule. Although in the second example, it does not match the last "d", but also to make the entire expression It can be successfully matched. Similarly, with "*" and "{m, n}" expressions are multiple possible matches with "?" in the expressions may not be matched when the matching, it is also as " to match greedy "this principle has been termed match." "mode.
non-greedy mode:
in the special symbol modified to match the number and then add a"? "No, you can make an indefinite number of expressions match as little as possible match, the match can not match the expression, as much as possible "mismatch." this type of matching is called "non-greedy" mode, also called "reluctantly" mode. If a match will lead to less overall expression match failed when the greedy similar, non-greedy mode and then matching some of the minimal, so that the entire expression matches following a successful example for the text "dxxxdxxxd" for example:
expression
matches
(? \ w +) (d)
" ? \ w + "will be as little as possible to match the first" d "after the character, the result is:"? \ w + "matches only A "the X-"
(d) (\ w +?) (D)
In order for the entire expression match is successful, "\ w +?" Had to match "xxx" can make "d" match behind, so that the entire expression match success therefore, the result is: "? \ w +" match "xxx"

Example 1: Expression "<td> </ td> (*.)" String "<td> <p> aa </ p> </ td> <td> <p> bb </ p> </ td> "match, the matching results: success; matched in content" <td> <p> aa </ p> </ td> <td> <p> bb </ p> </ td> " the entire string, the expression "</ td>" the last string "</ td>" matching.
example 2: (. *?) in contrast, the expression "<td> </ td> "example 1 in the same matching string will only get" <td> <p> aa </ p> </ td> ", a next match again, a second can be obtained" <td> < P> BB </ P> </ TD>. "
2.2 backreferences \ 1, \ 2 ...
expressions when matching engine matches the expression parentheses" () "included in the expression of the matched string recorded in matching results acquired when the expression contained in parentheses to the matched string may be acquired separately. this is, in the foregoing example, it has been demonstrated many times. in practical applications, when when used to find some kind of boundary, and the content to be acquired nor inclusive, you must use parentheses to specify the range you want. for example, in front of the "<td> (. *? ) </ td>".
in fact, " Parentheses contain string matched to the expression "may be used not only after the end of the matching, can also be used in the matching process. The rear part of the expression can be cited front," the sub-matching in brackets have been matched to string. "reference method is" \ "plus a number." \ a "reference to a first matching string in brackets," \ 2 "a reference to the second matching string in brackets so hh ., if the other pair of brackets comprising a pair of brackets, the number of outer brackets to sort words, of which the left bracket "(" first, then the number of the sort on the first
example as follows:
Example 1: Expression 1) "matches" 'Hello', when the "World" ", match result: success; matched contents is:." 'Hello' "a next match again, you may be matched to" " world "".
example 2: expression "(\ w) \ 1 { 4,}" matching "aa bbbb abcdefg ccccc 111121111 999999999", the matching result: success; matched content is "ccccc" match again. next, the resulting 999999999. claims the expression "\ w" character range is repeated at least five times, and the note "\ {5,} w" distinction between.
example 3: expression "<(\ w +) ? \ s * (\ w + 4) \ s * 1> " matching" <td id = 'td1' style = "bgcolor: white"> </ td> " , the matching result is successful if." <td > "and" </ td> "does not match, it will fail to match; if changed to another pair, also a successful match.
2.3 pre-search, no match; lookbehind, does not match the
previous chapters, I talked about several representatives of the abstract meaning of special symbols: b "they all have one thing in common: they themselves do not match any character, but on the" gap between the two ends of the string "or" character "attached a condition. to understand this concept in the future, this section will continue to introduce another For "two" or "gap"

Format: Expression xxxxx where this part of the right side of the slot, must be able to match the: "(? = Xxxxx)", the character string to be matched, it is located the "gap" or "two" with the proviso that type because it is only here as this additional condition on the gap, so it does not affect the expression back to really match the characters after the gap. this is similar to the "\ b", itself does not match any character. "\ b "before just where the gap after the character is taken to be a little judgment, it does not affect the expression back to a real match.
for example 1: the expression" (? = NT | XP) Windows " matching" Windows 98, Windows NT, Windows 2000 ", it will only match the" Windows NT "in the" Windows ", the other" Windows "words are not matched.
example 2: expression"? (\ w) (( = \ 1 \ 1 \ 1) (\ 1)) + "to match the string" aaa ffffff 999999999 ", the match will be 6" "the first four can be matched 9" f 9 "of the first seven. this expression can be reading to: repeat 4 times alphanumeric, leaving the last part of the previous two matches which are of course, this expression may not write, this purpose is for demonstration purposes.
grid : "(?! xxxxx)", where the right side of the gap must not match xxxxx this part of the expression.
For example 3: "(?!. ( \ bstop \ b)) +" expression matching "fdjka ljfdl stop fjdsla fdj "when the match from the beginning has been to" position before the stop ",

Lookbehind: "(? <= Xxxxx) ", "(?! <Xxxxx)"
concept and pre-search forward two formats are similar, lookbehind required conditions are: where the gap "left side ", two formats are required to be able to match and must not be able to match the specified expression, rather than to judge the right side." forward pre-search "it is the same: they are conditions where an additional slit, itself does not match any character.
example 5: the expression "(? <= \ d { 4}) \ d + (? = \ d {4})" matches "1234567890123456", in addition to match the first four digits and the middle eight after a number other than four digits due JScript.RegExp does not support lookbehind, therefore, this article can not be demonstrated, for example many other engines can support lookbehind, such as:. Java more than 1.4 java .util.regex package, .NET namespace in System.Text.RegularExpressions and recommended easiest-to-use DEELX regular engine.
3. other general rule
there are some in every regular expression engine is more common among rule, not mentioned in the previous lecture process.
3.1 expressions, you can use "\ xXX" and "\ uXXXX" represents A character ( "X" represents a hexadecimal number)
in the form of
character range
\ xXX
number of characters in the range 0 to 255, such as: space can be used "\ x20" indicates
\ uXXXX
any character can be used "\ u"

Expression
matches
\ S
matches all non-blank character ( "\ s" matches each blank character)
\ D
matches all non-numeric characters
\ W
matches all characters other than letters, numbers, underscores
\ B
matches non-word boundary, that both sides are "\ w" range or both sides are not "\ w" character gap when the range of
3.3 has a special significance in the expression, you need to add "\" in order to match the character's own character summary
character
description
^
match enter the beginning of the string. to match the "^" character itself, use "\ ^"
$
matches the input end of the string. to match the "$" character itself, use "\ $"
()
marks a sub-expression formula start and end positions. parentheses to match, use "\ (" and "\)"
[]
with the match can be defined from expressions' multiple character apos. brackets to match, use "\ [" and "\]"
{}
modified to match the number of symbols. braces to match, use "\ {" and "\}"
.
In addition to match any character other than a newline (\ n). To match the decimal point itself, use "\."
?
Modified to match the number 0 or 1. To match "?" Character itself, use the "\?"
+
A modified frequency matching at least once to match the "+" character itself, use "\ +"
*
. Subpattern match 0 or any number of times to match "*" character itself, use "\ *"
|
about between both sides of the expression "or" relationship match. "|" itself, use "\ |"
3.4 brackets "()" sub-expressions within, if you want the matching result is not recorded for later use, you can use "(? : XXXXX) "format
example 1: expression" (:? (\ w) \ 1) + " match" a bbccdd efg ", the result is" bbccdd (:) "the match result is not a" bracket. "? record, so "(\ W)" use "\ a" is referenced.
3.5 Pattern property profile: Ignorecase, Singleline, Multiline, Global
expression property
described
Ignorecase
default expression letters to distinguish sensitive. when no distinction can be configured to match the case Ignorecase some expression engine, the "case" extends the concept to a case UNICODE range.
Singleline
default, a decimal point "." in addition to matching newline ( \ characters) other than n. Singleline can be configured to match all characters including the decimal point, including newline.
The Multiline
By default, the expression "^" and "$" matches only the beginning and end of ① ④ of the string, such as:.
①xxxxxxxxx② \ the n-
③xxxxxxxxx④
configured to make, after Multiline "^" matches the outer ①, can match a newline, before the start of the next line position ③, that the "$" ④ match, the match may be a position before newline, ② the end of a line.
Free Join
when the main expression used to replace the work, configured as a Global representation to replace all occurrences.
4. other tips
4.1 If you want to learn advanced regular engine also supports the complex canonical syntax, see the regular site DEELX engine documentation.
4.2 if you want to match the requirements of the content of the expression is the entire string, rather than looking for a part from a string, you can use the "^" and "$" at the beginning and end of the expression, such as: "^ \ d + $ "requires the whole string only numbers.
4.3 If the content is required match a complete word, but will not be part of a word, use the expression in head and tail" \ b ", for example: using the" \ b (if | while | else | void | inthh) \ b " to match the keywords in the program.
4.4 expressions do not match the empty string otherwise would have been a successful match, and the results have nothing to match such as: ready to write a match.." 123 " "123.", "1 23.5 "," 5 "when these forms of expression, integer, decimal, decimal numbers can be omitted, but not the expression written:" \ d * \ \ d * ", as if nothing at all.? this expression can also match the success of better wording is: "\ d + \ \ d * | \ \ d +.?."..
4.5 child can not match the empty string matching cycle unlimited times. If each part of the sub-expressions in parentheses can match 0 times, and this in turn can match the overall parentheses unlimited, then the situation may be said than on a more seriously, the matching process may be an infinite loop. Although some regular expression engine has been avoided by the way this happens cycle of death, such as .NET regular expressions, but we should still try to avoid this situation. If we encountered when writing an expression of the infinite loop, you can start from this point and look to see if this section is called reason.
4.6 reasonable choice greedy and non-greedy mode, see the topic of discussion.
4.7 or "|" of the left and right sides , only one side can be the best match for a character, so, not because "|" expression because both sides of the exchange location varies.

Reproduced in: https: //www.cnblogs.com/521taobao/archive/2012/03/17/2402435.html

Guess you like

Origin blog.csdn.net/weixin_34348111/article/details/93355956