python crawler learning 21

python crawler learning 21

This is about the unfinished second half of the match, I was thinking about yesterday... Hey, let's not talk about
the portal in the first half

3. Regular expressions

2.match

As usual, the comparison table is attached first:

insert image description here

2-3 Greedy and non-greedy

We have learned the matching of the match method together before, but sometimes the content we match is not the result we want:

# 贪婪与非贪婪
import re

content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*(\d+).*Demo$', content)
print(result)
print(result.group(1))

operation result:

insert image description here

In this case, we still want to get the number in the middle of the string, so we use (\d+) to match the number in the middle which is no problem. There are too many things on both sides of the number, I want to save trouble, so .* matches directly.

But at this time, the result of our final match is only one number, why is this?

Here comes the question of greed and non-greed. Under the greedy mechanism, dots and asterisks will match as many characters as possible. In our expression (\d+) represents at least one number without specifying how many numbers to match, so .* just It will match as many characters as possible, (you said there is at least one, then I will give you one) Here, I will match all 123456.

So how do we get the final result we want?

# 我缓缓打出了一个问号
import re

content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*?(\d+).*Demo$', content)
print(result)
print(result.group(1))

operation result:

insert image description here

Adding a question mark to the original expression changes the matching mechanism to non-greedy, so that .* matches as few characters as possible.

Therefore, try to use non-greedy matching in the middle of the string when doing matching, use .*? instead of .* to avoid problems with missing matches.

But if what we want to match is at the end of the string, then .*? may not match the content, because he will match as little as possible

import re

content = 'http://weibo.com/comment/KEracn'
res1 = re.match('^http.*?comment/(.*?)', content)
res2 = re.match('^http.*?comment/(.*)', content)

print('.*?  :', res1.group(1))
print('.*   :', res2.group(1))

operation result:

insert image description here

2-4 Modifiers
import re

content = '''Hello 1234567 
            World_This is a Regex Demo'''
result = re.match('^He.*?(\d+).*?Demo$', content)
print(result)
print(result.group(1))

It's the same expression as before, but our content is replaced by multiline text:

insert image description here

what happened? What happened? Why report an error

Looking up the comparison table we can recall that .* can only match all characters except newlines, so our all-purpose approach is challenged. For our lazy cause, it's time to introduce modifiers!

# 在原基础上只需要引入一个 re.S 参数即可

import re

content = '''Hello 1234567 
            World_This is a Regex Demo'''
result = re.match('^He.*?(\d+).*?Demo$', content, re.S)
print(result)
print(result.group(1))

operation result:

insert image description here

This method is often used in the matching of web pages, because HTML nodes often have line breaks, and with it, errors caused by frequent line breaks can be avoided.

Modifier Schedule:

insert image description here

2-5 Transfer Match

In the process of writing expressions, we already know that . is used to match any characters other than newlines, so what should we do when our expression needs to use . as a common symbol?

# 转义匹配

import re

content = '(百度)www.baidu.com'
res = re.match('\(百度\)www\.baidu\.com', content)
print(res)

operation result:

insert image description here

Above, we have finished learning the common content of the match method!

It ends today, tomorrow... it depends on the situation

Guess you like

Origin blog.csdn.net/szshiquan/article/details/123719810