Python crawler learning 20

Python crawler learning 20

3. Regular expressions

2.match

Match is a commonly used matching method. Passing in the string we need to match and a regular expression can detect whether the regular expression matches the string.

# match

# 该方法会从字符串的起始位置开始匹配正则表达式,如果匹配,就返回匹配成功的结果;反之,就返回None
import re

content = 'Hello 123 4567 World_This is a Regex Demo'
print(len(content))
result = re.match('^Hello\s\d\d\d\s\d{4}\s\w{10}',content)
print(result)           # 返回匹配后的结果
print(result.group())   # 返回匹配的内容
print(result.span())    # 返回匹配的范围

operation result:

insert image description here

The following is a detailed analysis of the regular expressions used in the case:

^Hello\s\d\d\d\s\d{4}\s\w{10}

After thinking about it, I still attached yesterday's picture, I really can't remember it if I don't look at it. . .

insert image description here

The ^ at the beginning indicates the beginning of the matching string, that is, it starts with Hello, and \s indicates the matching of whitespace characters (such as whitespace, newline, etc.), which is used to match the space after Hello. \d is used to match any number, one \d matches one number, here we add three \d to match 123 respectively. \d{4} is a more advanced (lan ren) way of writing, which means to match the number 4 times. Similarly, \w{10} means to match the alphanumeric underscore 10 times.

2-1 match target

After roughly understanding the use of match, we can use the match method to extract what we want from the string:

# 匹配目标
# 使用()将想要提取的子字符串括起来。()实际上标记了子表达式开始和结束的位置,被标记的子表达式依次对应一个分组(还记得刚刚展示的group字段吗?)
import re

content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^Hello\s(\d+)\s', content)
print(result)
print(result.group(1))  # 传入 1 则会返回被第一个括号包住的内容返回的结果,若是有多个括号,那就传入2,3...以此类推
print(result.group())   # 如果不传入参数 返回完整的匹配结果

operation result:

insert image description here

2-2 Universal Match

The matching we just performed in 2-1 is still very cumbersome. We have to type a \s for a space. Is there a simpler matching method?

At this time. and * stood up. Both of them are universal matchers, . can match any character except the newline, and * represents any number. So combining the two of them we can match any character.

import re

content = 'Hello 1234567 World_This is a Regex Demo'
res = re.match('^Hello.*Demo$', content)
print(res.group())
print(res.span())

operation result:

insert image description here
...
Learn this, although it is far from over, but you have to give your tomorrow self a chance to express yourself, not
the day after tomorrow. The self has come over and attached the second half of the portal

Guess you like

Origin blog.csdn.net/szshiquan/article/details/123671003