python3 crawler third step, this package, if you learn to be regular, come to me

Introduction

Regular expressions are a way of describing characters by which character strings are matched.
Regular expressions are free. The meaning of a character often represents a type of character. Through a combination of multiple regular regular symbols, the composed regular expression can describe a type of string.
In development, regular expressions are often used to describe a type of string.
Note: regularity is universal in programming

In python, use the match method of the re module to match the string with the regular. The syntax is as follows:

re.match(pattern, string, flags=0)
  • pattern: regular expression
  • string: the string to be matched
  • flags: How to match regular expressions

If the match is successful, it returns the matched object, otherwise it returns None.

Regular use

The simplest hard match
Before using the re module, you must introduce re, and then use the match method for regular matching:

import re

res = re.match("这是正则区域","这是正则区域")
print(res.group())

The above code matching canonical regions of relatively hard core, the direct use of this is a regular area as a regular match, but also the content of the string which is a regular region . After matching, assign the matching result to the res variable, and then use print output (group method can extract data). The result is as follows:
Insert picture description here
the matched content was successfully output.
Then I took this region is a regular string content changed This is the string region :

res = re.match("这是正则区域","这是字符串区域")

The result is as follows:
Insert picture description here
Because the matched value is None, the output is wrong. We can change to the following method:

import re

res = re.match("这是正则区域","这是字符串区域")
if res:
    print(res.group())

At this time, no error will be reported, and there will be no output if there is no match.

After learning the hardcore matching method, let's learn a little other way.

\d

First introduce a symbol \d, \d can match the numbers 0-9, and can be written in the following form in the code:

import re

res = re.match("\d","2")
if res:
    print(res.group())

The result is as follows:
Insert picture description here
if the following string is 2, it will match 2, and if 2 is replaced with a letter, no value will be output:

res = re.match("\d","a")

The result is as follows:
Insert picture description here
Our code can be a little more complicated, of course, it's just a little bit and not a billion dots:

res = re.match("今天星期\d","今天星期3")

At this time, regardless of the day of the week, as long as it is a number, it will be output and displayed:
Insert picture description here

[]

Next, get to know [], [] can match the characters listed in square brackets. For example, list 1234 in [], the code is written as [1234], the code is as follows:

import re

res = re.match("今天星期[0123456789]","今天星期3")
if res:
    print(res.group())

Will the above code still match the successful output result? Of course it is possible. Because the numbers 0-9 are listed in square brackets, it is not what some readers think 0123456789 is a whole. This string of numbers is a single character, not a whole, so it will definitely match successfully. Display: The
Insert picture description here
above code lists the numbers 0-9. It is too long and too troublesome to write. It can be written in the following format, which is convenient, quick and clear:

res = re.match("今天星期[0-9]","今天星期3")

If you want to list the letters az, there is no need to write too long, for example:

res = re.match("今天星期[a-z]","今天星期t")

The result is as follows:
Insert picture description here
What if you want to match uppercase? This is very simple, look at the following example:

res = re.match("今天星期[a-zA-Z]","今天星期T")

Because the characters in the square brackets exist individually, az describes the letters from a to z, and AZ describes the letters from uppercase A to Z, which are a whole, so it is definitely okay to write directly as above.
The results are as follows:
Insert picture description here

\w and \W

\w can match AZ, az, 0-9 and underscore_.
\W can match non-letters, non-digits, non-underscores and non-Chinese characters, which is the reverse of \w.
First look at \w:

import re

res = re.match("\w","a")
if res:
    print(res.group())

Since \w matches AZ, az, 0-9 and underscore_, there is no problem with matching. The result is as follows:
Insert picture description here
other matches will not be listed anymore, they are all the same.
Try \W:

res = re.match("\W","+")

The result is ok:
Insert picture description here

*, +, () and?

It will be boring if you do not increase the difficulty a little bit, now you will use some characters to describe the single character matching you have learned so that regular expressions can match multiple characters.
There is now a string of rate rent 1999 how regular match? View code:

import re

res = re.match(r"房价租金[0-9]*","房价租金1999")
if res:
    print(res.group())

Look carefully, the regular expression 房价租金[0-9]*, the previous house price rent hardly matches the string of house price rent, and then I used a bracket, the content inside matches any of the numbers 0-9, in general [0- 9] The match can only match one. I added an * after the square brackets.
The function of the * sign is to describe that a regular expression [0-9] in front of it matches 0 or unlimited times. If it appears once, the match is successful. The results are as follows:
Insert picture description here
Of course, 0 times is also possible, we change the code as follows:

res = re.match("房价租金1*","房价租金")

The above code uses * to match the character 1. If it does not exist, the regular expression will still return the matching object because the previous match has been successful. * No. 0 is ok, so it will still output:
Insert picture description here
What happens if you change the * in the above code to +? Let's try:

res = re.match("房价租金1+","房价租金")

There will be no output at this time. + Sign in front of the character appear more than once, is 0 can not, then we look at matching prices Rent 1111 this string to see results:

res = re.match("房价租金1+","房价租金111111111")

The results are as follows:
Insert picture description here
What if I want to match a fixed number of times?
At this time, you can use {} to perform a limited number of matches:

res = re.match("房价租金1{0}","房价租金111111111")

The result:
Insert picture description here
Of course, the code can also be written as:

res = re.match("房价租金1{0,4}","房价租金111111111")

In {0,4} 0 is the starting position of the match and 4 is the ending position. If the ending position of 4 is not filled, it will start from 0 (in fact, the position can be written in any position, such as 1, 2, 3... ) Match to unlimited times.

^ And $

^ Means to match from the beginning of the string, and $ means the end of the matching string.
Now let’s start with a comprehensive challenge. Let’s match an email address. This is also very common in normal needs:

import re

res = re.match("^\d+@\w+\.\w+",r"[email protected]")
if res:
    print(res.group())

Check the regular ^\d+@\w+.\w+
expression we write as: Let's break down the composition of regular expressions:

  • ^\d+: A regular description after a ^ description is used at the very beginning is the beginning of the string. \d is a number, and the + sign is to match at least one. To concatenate is to match a string of numbers at the beginning of the string.
  • @\w+: After the end of a string of numbers, a hard match with an @ symbol, the mailbox is like this. After that \w means matching
    AZ, az, 0-9 and underscore_ (I don’t understand underscores, but it seems that I have seen them). Since there are different mailboxes such as qq mailbox, 163 mailbox, Google, etc., I use \ w up. Since there is more than one match, I used the + sign for the previous regular description.
  • 、.\w+: In the end, a hard match of a dot . is performed . The reason is that a \ is added in front of the . because it needs to be escaped, and then a \w is matched, and then it is done. The results are as follows:

Insert picture description here
Readers can modify the mailbox to check the effect.

Note: The matching form of the above mailbox is not a rigorous regular writing method. Please do not use it in actual projects. It is only for demonstration purposes. Please forgive me.

The above basic regularity is basically enough for application development, and then I will expand the regular expression symbols.

|and()

| Is an OR operation, the rules on the left and right sides of the | symbol can be matched. As long as one of the matches succeeds, the entire regular match is successful:

import re

re_1="^\d+@qq\.\w+"
re_2="^\d+@163\.\w+"

res = re.match(re_1+'|'+re_2,r"[email protected]")
if res:
    print(res.group())

Modify the above code in the previous case, and the regularity will not be repeated, after all, they are not much different. Two variables re_1 and re_2 are defined in the above code. re_1 is the mailbox that matches qq and re_2 is the mailbox that matches 163. When matching, the code in the regular is written as re_1+'|'+re_2using the | symbol to connect the left and right sides, and then judge. This symbol is also the OR operation. The results are as follows:
Insert picture description here
() The role of parentheses is to group:

import re

re_1="(^\d+)(@qq)(\.)(\w+)"

res = re.match(re_1,r"[email protected]")
if res:
    print("0",res.group(0))
    print("1",res.group(1))
    print("2",res.group(2))
    print("3",res.group(3))
    print("4",res.group(4))

The above code is modified in the previous example, the change of re_1 is not big, and the regularity of each block is divided into groups using parentheses. Then when using group, pass in parameters 0, 1, 2, 3, 4. 0 means all results from regular matching, 1 means (^\d+) matching results, 2 means (@qq) matching results, and so on. The results are as follows: The
Insert picture description here
crawler series is continuously updated, welcome to follow, like, and favorite.
The next article will use regularization to capture prices.

Guess you like

Origin blog.csdn.net/A757291228/article/details/107215416