Article Directory
Foreword
A regular expression is a good thing, but I do not know, blindness. So, determined to learn the system of regular expressions, and strive to understand regular expressions, regular expressions can solve problems in their daily work with positive and improve work efficiency.
Regular expressions must know will be
getting Started
Matches any character
c.t
-> cat/cut
etc.
A matching set of characters in a
[A-Za-z0-9]
-> letters and numbers
Take a non-operation
[^0-9]
-> non-digital
Matching special characters
\d
-> Digital
\D
-> non-digital
\w
-> letters, numbers and _
\s
-> whitespace characters (backspace [\b]
excluded)
Change case
\UJava\E
-> JAVA
represents all uppercase
\ujava\E
-> Java
representing the next character uppercase
\LJAVA\E
-> java
represents all lowercase
\lJAVA\E
-> jAVA
represents write a lowercase character
Change case very practical, when we use the editor, can be used (查找的单词)
to find the word, and then \U$1\E
replace all uppercase, \u$1\E
replacing titlecase etc.
Repeat match
\d+
-> a plurality of digital to 1
\d*
-> 0 to a plurality of digital
\d?
-> Digital 0-1
\d{3,5}
-> 3-5 digit
Greedy and lazy match match
*
And +
are greedy matching metacharacters, for example, has this to say html code
<h1>你好</h1>
<h1>你们好</h1>
Regular expression is <h1>.*</h1>
, then it will match the entire string, with our expectations differ, because the greedy match will be the biggest match of the string as possible.
Lazy to change can be a good match to solve this problem, greedy matching metacharacters plus ?
laziness match, being an example of this expression can be expressed as <h1>.*?</h1>
=
Location match
\bhelloworld\b
->helloworld
, helloworldjava
that can not be matched
\B-\B
-> - all non-boundary character left and right
^Helloworld
-> to Helloworld
start, pay attention and take the non-difference, negation operator written []
inside
Helloworld$
-> to Helloworld
end
Branch match mode
If we need to match the newline beginning and end of the string, we can use pattern matching branch to change ^
and $
behavior, and then use pattern matching branch. ^
And $
to represent the beginning of the border after the wrap, before the end of the boundary line feed. For example, if we need to match all comments, so you can use regular expressions(?m)^\s*//.*$
Subexpression
Why subexpression appear? Suppose we want to match multiple Helloworld
words, use a regular expression Helloworld*
matches such as Helloworlddd
the string and the like, the correct approach is to use parentheses wrapped up the word(Helloworld)*
Advanced
Trackback - consistent match
For example, HTML representation using the h1-h6 tags, we now want to match all the tags can be used <[hH][1-6]>.*?</[hH][1-6]
, but if there is an error inside the html expression formats, such as <h1>标题</h2>
, in this way it will not work. We need to use the knowledge back references.
This example uses the trackback expression is <[hH]([1-6])>.*?</[hH]\1>
, where \1
represents like subexpression matches the first if the first matching sub-expression 2, it \1
will match 2
Cite a common example: I use webstorm editor, there is a page on http://www.baidu.com
a string, I want to replace it <a href="http://www.baidu.com">http://www.baidu.com</a>
, you can use regular expressions to (http:.*)
match the url, then <a href="$1">$1</a>
replace (webstorm use $ string representing the match, similar are placeholders in the character).
So, regular expressions can make good use of the benefit of mankind!
Look around
Look-ahead
What look forward, under normal circumstances, we need to match a character, but it does not need to display the matching results. For plums: matching the URL protocol, but is not required to :
match it, it can be used \w+(?=:)
instead \w+:
, because the latter will :
match up
Find backwards
Opposition forward looking, for example, the number to match the price, do not need to display $
symbols can be used (?<=\$).*$
.
Two-pronged approach
Suppose we need to get <div>Helloworld</div>
the contents inside, you can do so(?<=<div>).*?(?=</div>)
Negated around looking for
Meaning negated is: do not match the character forward and backward, such as matching the string of 10 numbers $100 can buy 10 apples
, you can do so\b(?<!\$)\d+\b
give up
Embedding condition (Learn)
Embed conditions are relatively complicated to understand just fine
Trackback conditions
Syntax: (location sub-expressions) true_regex | false_regex?
We may encounter such a situation, it is assumed to match the left parenthesis, then we hope to be able to go in the right bracket match. But if there is no opening parenthesis, we do not want to be matched to the right bracket. Suppose we want to match both the phone number
123-456-789
(123)456-789
can use(\()?\d{3}(?(1)\)|-)\d{3}-d{3}
Before and after the search criteria
Syntax:? (Before and after the find expression) true_regex
Suppose we want to match the first and third rows
11111
22222-
33333-44444
can use\d{5}(?(?=-)-\d{5})
to sum up
This blog symbols are more likely to produce some mistakes, I hope a lot of parents who pointed out!
postscript
Ever since reading the "regular expression must know will be," the book, find the search function to replace the editor with a regular expression is simply too strong, it is no exaggeration to say that the regular expression is an essential skill for all developers.
Reference material
Regular expressions must know will be (Revised Edition) [Ben Forta with]