After the insight of the regular expression, search for something

What is regular expression

Regular expression is a special sequence of characters, which can help you easily check whether a string matches a certain pattern. That is to find out whether the string contains a string that meets the requirements. For example, I want to know how many times "Guo Jing" has appeared in the book "The Legend of the Condor Heroes", and I can use regular expressions to solve it.
The development history of regular, see 1 for details

It can be seen from the development history of Zhengzhen that the distinctive feature of technology is function-oriented, and the reason why functions are valuable is because there is demand, and the reason why Zhengzheng is developed and continuously developed is because of the need to search for strings. Starting from the needs, to understand what a regular is and how to use regular expressions may be better to start.
The demand is to solve the problem, and the regular grammar will be explained in the process of solving 10 problems.

Question 1: How to match the numbers in "abc1de"?

1 metacharacter

Metacharacters can be used, metacharacters are the most basic element to construct regular expressions. Commonly used metacharacters are as follows:

Metacharacter Description
. Match any character except newline
\w Match any letter, number, Chinese character, underscore
\s Match any whitespace
\d Match numbers
\b Match the beginning or end of a word
^ Match the beginning of the string
$ Match the end of the string

Match the numbers in "abc1de"

\d

Question 2: How to match the words starting with "w" in "Years may wrinkle the skin, but to give up enthusiasm wrinkles the soul."?

\bw

This is the use of metacharacters. With metacharacters, you can write some simple regulars.
For example: match the string starting with "A"

^A

Match an eleven-digit mobile phone number

\d\d\d\d\d\d\d\d\d\d\d 

Obviously this kind of syntax is not concise. To match a sequence of numbers with a length of 100, shouldn't you write 100 \d?
Of course not, the length qualifier can solve this problem.

2 Length qualifier

In order to solve the problem of the length of the matching pattern, there are some delimiters in the regularity that can control the length of the pattern, as follows:

character Description
* Match zero or any number of times
+ Match repeated one or more times
Match zero or one time
{n} Match repeated n times
{n,} Matches repeated n times or more than n times
{n,m} Match repeated n to m times
{,m} Match repeated m times or less

Then, the mobile phone number that matches the eleven digits can be rewritten as

\d{
    
    11}

Question 3: {n,m} matches n to m times, so how many times is it matched?
It depends on greed or laziness

3 Greed and laziness

Greedy matching: (on the premise that the entire expression can be matched) match as many characters as possible, the default is greedy matching, that is to say {n, m} is matched as large as possible when the entire expression is satisfied length.
Lazy matching (non-greedy): Match as few characters as possible (on the premise that the entire expression can be matched). Lazy quantifiers add a "?" after the greedy quantifier

character Description
*? Repeat as many times as possible, but as little as possible
+? Repeat 1 or more times, but as little as possible
?? Repeat 0 or 1 times, but as little as possible
{n,m} ? Repeat n, m times, but as little as possible

Greedy vs. non-greedy comparison
Insert picture description here
Question 4: Appeals can match repeated 1 or 2, so what if I need 12 as a whole to repeat the match?

4 grouping

In regular expressions, parentheses () are used to group, that is, the content in the parentheses as a whole.
Match two repeated "12"

\12{
    
    2}

Question 5: \d can match any number, what if I only want to match numbers between 1 and 5?

5 interval

Regular expressions provide a metacharacter bracket [] to indicate interval conditions. as follows:

symbol Description
[0-9] Number between 0 and 9
[a-z] letters between a and z
[A-Z] Letters between A and Z

For example, to match "numbers greater than 1 and less than 5?"

[1-5]

Question 6: What if I now need to match discrete 1, 5, 7, 8?
The set can be used to match any one of the sets

6 collection

In regular expressions, the symbol [xy] can also represent a set that matches any character in it.

[1578]

Or match one of a, g, k

[agk]

You can also use conditions or

7 conditions or

Regulars use the symbol | to represent or, also called branch conditions. When any of the branch conditions in the regular is met, it will be regarded as a successful match, which is the logic of union.
One of a, g, and k can also be written as,

a|g|k

Match the number 130/166/186/three numbers

(130|166|186)\d{
    
    8}

Since there are logical conditions or, when we don’t want to match a character, non is also an indispensable requirement

8 Conditions are not

Regular expressions provide some negative metacharacters as follows:

character Description
\W Match any character that is not a letter, number, underscore, or Chinese character
\S Match any character that is not a whitespace character
\D Match any character that is not a number
\B Match is not at the beginning or end of a word
[^xyz] Match any character that is not xyz

Note: ^ appears in [] means negation, ^ appears alone means the beginning of the string

Question 7: The above (), [], {} have special usages. What if I match these symbols themselves?

9 escape

In response to this situation, the regular rule provides a way to escape, that is, to escape these metacharacters, qualifiers or keywords into ordinary characters. The usage is very simple, that is, add a slash in front of the character to be escaped \Yes.
match()

\(\)

So ordinary characters plus an escape character become special symbols, such as \d, and special characters plus an escape character become ordinary characters, such as \ ^ means ^ itself.

Question 8: Crawlers often have such requirements. For example, when crawling the address of a picture in a web page, you need to start from "< id="bigImg" src="https://b.zol-img.com.cn/desk/ bizhi/image/10/960x600/98319721647.jpg" "to extract the url of the picture.
Observation shows that what we need is the string after src=, and the zero-width assertion satisfies your needs well.

10 Zero-width assertion

Zero width means that there is no width, no characters are occupied, but only a position is matched. Assertions are to determine that there are characters or there are no characters.
There are four types of zero-width assertions provided by regular expressions, as follows

grammar name Description
(?=x) Positive assertion Match any position before x
(?<=x) Positive backward assertion Match any position after x
(?!x) 负向前行断言 匹配任意不是x之前的位置
(?<!x) 负向后行断言 匹配意不是x之前的位置

零宽断言的字符比较好记,特殊字符(?)表示零宽断言, = 表示正向断言,!表示否定,<表示在x后面的位置, 没有<则是前面的位置。

Insert picture description here

上图为很实用的正则化练习网站 https://regex101.com/

匹配图片地址,

(?<=src=)\".*\"

问题9:现在需要将字符串" location: China, date : 2020-06-01,weather : cloudy "中的日期匹配出,并且分别调用其中的年月日 。
这就不仅仅要匹配所需字符串而且要将匹配的结果分组。

11 捕获组

捕获组就是把正则表达式中子表达式匹配的内容,保存到以数字编号或者命名的组里,方便后面引用。
捕获组的只需要括号就可以(x), 也可以将捕获的组命名,语法为(?<name>x) ,不命名就自动编号。
比如捕获日期(数字编号),
Insert picture description here

捕获组并且命名,

Insert picture description here
python中groups方法可以调用捕获组。 python正则函数详细用法,参见2

捕获组不仅可以在外部调用,也可以在正则表达式中调用,比如解决下面的问题。
问题10 : 查找一串字母"5443114994002568856643"里连续相同的数字。

12 反向引用

The content captured by the capture group can be referenced not only through the program outside the regular expression, but also inside the regular expression. This way of quoting is back-quoting.
For the reference of common capture group and named capture group, the syntax is as follows:
common capture group backreference: \k, usually abbreviated as \number
named capture group backreference: \k or \k'name'
Insert picture description here
Explain that is, capture group (\d) The captured is a number, and the first group is this number. Quoting the first group is the same as the first group, so two identical numbers are matched.
The above is the basic grammar of regular rules, which are often used in combination in actual combat, and you should practice more with more proficiency.

This article is a grammar combing of regular expressions. It does not involve the principle of regular expressions. The principle of regular expressions is the prerequisite for optimizing regular expressions. It is also a part that we must understand. Refer to 1 for principles.


  1. Regular expression development history and principles (https://mp.weixin.qq.com/s/YWgLfb7xvrqNpcdFdFaksQ) ↩︎ ↩︎

  2. Detailed introduction of python regular (https://mp.weixin.qq.com/s/iZk1CX9VjCcHiXVOEwyWGg) ↩︎

Guess you like

Origin blog.csdn.net/weixin_43705953/article/details/108250114