python crawler learning 21
This is about the unfinished second half of the match, I was thinking about yesterday... Hey, let's not talk about
the portal in the first half
content
3. Regular expressions
2.match
As usual, the comparison table is attached first:
2-3 Greedy and non-greedy
We have learned the matching of the match method together before, but sometimes the content we match is not the result we want:
# 贪婪与非贪婪
import re
content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*(\d+).*Demo$', content)
print(result)
print(result.group(1))
operation result:
In this case, we still want to get the number in the middle of the string, so we use (\d+) to match the number in the middle which is no problem. There are too many things on both sides of the number, I want to save trouble, so .* matches directly.
But at this time, the result of our final match is only one number, why is this?
Here comes the question of greed and non-greed. Under the greedy mechanism, dots and asterisks will match as many characters as possible. In our expression (\d+) represents at least one number without specifying how many numbers to match, so .* just It will match as many characters as possible, (you said there is at least one, then I will give you one) Here, I will match all 123456.
So how do we get the final result we want?
# 我缓缓打出了一个问号
import re
content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*?(\d+).*Demo$', content)
print(result)
print(result.group(1))
operation result:
Adding a question mark to the original expression changes the matching mechanism to non-greedy, so that .* matches as few characters as possible.
Therefore, try to use non-greedy matching in the middle of the string when doing matching, use .*? instead of .* to avoid problems with missing matches.
But if what we want to match is at the end of the string, then .*? may not match the content, because he will match as little as possible
import re
content = 'http://weibo.com/comment/KEracn'
res1 = re.match('^http.*?comment/(.*?)', content)
res2 = re.match('^http.*?comment/(.*)', content)
print('.*? :', res1.group(1))
print('.* :', res2.group(1))
operation result:
2-4 Modifiers
import re
content = '''Hello 1234567
World_This is a Regex Demo'''
result = re.match('^He.*?(\d+).*?Demo$', content)
print(result)
print(result.group(1))
It's the same expression as before, but our content is replaced by multiline text:
what happened? What happened? Why report an error
Looking up the comparison table we can recall that .* can only match all characters except newlines, so our all-purpose approach is challenged. For our lazy cause, it's time to introduce modifiers!
# 在原基础上只需要引入一个 re.S 参数即可
import re
content = '''Hello 1234567
World_This is a Regex Demo'''
result = re.match('^He.*?(\d+).*?Demo$', content, re.S)
print(result)
print(result.group(1))
operation result:
This method is often used in the matching of web pages, because HTML nodes often have line breaks, and with it, errors caused by frequent line breaks can be avoided.
Modifier Schedule:
2-5 Transfer Match
In the process of writing expressions, we already know that . is used to match any characters other than newlines, so what should we do when our expression needs to use . as a common symbol?
# 转义匹配
import re
content = '(百度)www.baidu.com'
res = re.match('\(百度\)www\.baidu\.com', content)
print(res)
operation result:
Above, we have finished learning the common content of the match method!
It ends today, tomorrow... it depends on the situation