Reptile Learning - (2) Regular Expressions

Table of contents

1. Introduction to common matching rules

 2. Commonly used matching methods

1.match

2.search

3.findall

4.sub

5.compile

3. Actual combat of basic crawler cases


1. Introduction to common matching rules

Commonly used matching rules for regular expressions (refer to the learning website): Regular Expressions – Grammar | Novice Tutorial

 2. Commonly used matching methods

1.match

In the match method, the first parameter is passed in the regular expression, and the second parameter is the string to be matched.

The match method will try to match the regular expression from the 'starting position' of the string, and if it matches, it will return the result of a successful match; if it does not match, it will return None. Examples are as follows:

import re
content = 'Hello 123 4567 World_This is a Regex Demo'
print(len(content)) #查看字符串的长度
result = re.match('^Hello\s\d\d\d\s\d{4}\s\w{10}',content)
#result = re.match('^Hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$',content)  #匹配所有字符
print(result)
print(result.group())  #返回匹配结果
print(result.span())   #输出范围

operation result:

41
<re.Match object; span=(0, 25), match='Hello 123 4567 World_This'>
Hello 123 4567 World_This
(0, 25)
  • match target

To extract part of a text, you can use brackets () to enclose the substring you want to extract. () actually marks the start and end positions of a subexpression. Each marked subexpression corresponds to each group in turn. Call the group method to pass in the index of the group to get the extraction result. The example is as follows:

import re
content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^Hello\s(\d+)\sWorld',content)
print(result) 
print(result.group())  #输出完整的匹配结果
print(result.group(1)) #输出第一个被()包围的匹配结果
print(result.span())

operation result:

<re.Match object; span=(0, 19), match='Hello 1234567 World'>
Hello 1234567 World
1234567
(0, 19)
  • Universal match

Due to the complexity of the above regular expression, as long as there is an empty packet character, it must be matched with \s, and if there is a number, it must be matched with \d. When there are many matching contents, this kind of workload is relatively large. In fact, you can use the universal match ".*", where "." can match any character, and "*" means match the previous character infinitely, and the combination of the two can match any character. Following the above example, use ".*" to rewrite the regular expression:

import re
content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.match('^Hello.*Demo$',content)
print(result) 
print(result.group())  #输出完整的匹配结果
print(result.span())

operation result:

<re.Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'>
Hello 123 4567 World_This is a Regex Demo
(0, 41)
  • Greedy vs. Non-greedy
import re
content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*(\d+).*Demo$',content)
print(result) 
print(result.group(1))

operation result:

<re.Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>
7

It can be seen that under greedy matching, only the number 7 is obtained. Because under greedy matching, ".* " will match as many characters as possible, and ".* " is followed by \d+, that is, at least one number, and no specific numbers are specified, so ".* "just try It is possible to match multiple numbers to match 123456, leaving only one number 7 that meets the condition, so the final content is only the number 7.

Obviously, this will sometimes bring us the same content, and some content will be inexplicably missing. Next, compare the non-greedy matching and see how the two work:

import re
content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*?(\d+).*?Demo$',content)
print(result) 
print(result.group(1))

operation result:

<re.Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>
1234567

It can be seen that under non-greedy matching, 1234567 whole strings of numbers can be obtained. Contrary to greedy matching, non-greedy matching is to match as few characters as possible. In practical applications, non-greedy matching should be used as much as possible to avoid the problem of missing matching results.

However, if what is matched is the content at the end of the string, it is better to use ".* " greedy matching, because it will match as much content as possible to match the content at the end. And ".*?" non-greedy matching matches as little content as possible, specific examples are as follows:

import re
content = 'Hello 1234567 World_This is a Regex Demo'
result1 = re.match('^He.*?a\s(.*?)',content)
result2 = re.match('^He.*?a\s(.*)',content)
print('result1(非贪婪匹配)',result1.group(1))
print('result2(贪婪匹配)',result2.group(1))

operation result:

result1(non-greedy match) 
result2(greedy match) Regex Demo
  • Modifier

re.I: make matching case insensitive

re.L: Realize localized recognition (local-aware) matching

re.M: multi-line match, affects ^ and $

re.S: make the match all characters including newline characters

re.U: Parse characters according to the Unicode character set. This flag affects \w, \W, \b and \B

re.X: This flag can give you more flexible formatting, so that regular expressions can be written more easily understood

When there is a newline character in the string, if it matches as above, an error will be reported:

import re
content = '''Hello 1234567 World_This
is a Regex Demo
'''
result = re.match('^He.*?(\d+).*?Demo$',content)
print(result.group(1))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_448/501554157.py in <module>
      4 '''
      5 result = re.match('^He.*?(\d+).*?Demo$',content)
----> 6 print(result.group(1))

AttributeError: 'NoneType' object has no attribute 'group'

The result of the above operation reports an error, that is, the regular expression does not match the string, the return result is None, and the group method is called, so AttributeError is caused

Because .*? matches any character other than a newline character, the match fails because a newline character is encountered. And just add a modifier re.S to fix the error

import re
content = '''Hello 1234567 World_This
is a Regex Demo
'''
result = re.match('^He.*?(\d+).*?Demo$',content,re.S)
print(result.group(1))

operation result:

1234567
  • escape match

If the target string contains any characters other than newline characters such as ., you can add a backslash "\" in front of it to achieve a match. Examples are as follows:

import re
content = '(百度)www.baidu.com'
result = re.match('\(百度\)www\.baidu\.com',content)
print(result.group())

operation result:

(Baidu) www.baidu.com

2.search

In the above match method, because this method starts matching from the beginning of the string, it means that once the beginning does not match, the entire match will fail. The match method needs to consider the content at the beginning of the target string when using it, so it is not convenient for matching

The search method scans the entire string for a match and returns the first successful match. That is to say, the regular expression can be a part of the string, and there is no need to think about how the match method can match the beginning of the string. When matching, the search method scans the string starting with each character in turn until it finds the first matching string, and then returns the matching content; if no matching string is found after scanning, it returns None. Change the match method of the above code to the search method, as follows:

import re
content = '''Hello 1234567 World_This
is a Regex Demo
'''
result = re.search('He.*?(\d+).*?Demo',content,re.S)
print(result)
print(result.group(1))

operation result:

<re.Match object; span=(0, 40), match='Hello 1234567 World_This\nis a Regex Demo'>
1234567

For the convenience of matching, it is recommended to use the search method as much as possible, and then use search to write several regular expression examples to realize the corresponding information extraction:

html = '''<div id="songs-list">
<h2 class="title">经典老歌</h2>
<p class="introduction">
经典老歌列表
</p>
<ul id="list" class="list-group">
<li data-view="2">一路上有你</li>
<li data-view="7">
<a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
</li>
<li data-view="4" class="active">
<a href="/3.mp3" singer="齐秦">往事随风</a>
</li>
<li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
<li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
<li data-view="5">
<a href="/6.mp3" singer="邓丽君">但愿人长久</a>
</li>
</ul>
</div>'''
import re
result = re.search('<li.*?active.*?singer="(.*?)">(.*?)</a>',html,re.S)
if result:
    print(result.group(1),result.group(2))

operation result:

Qi Qin's past events follow the wind

As can be seen from the results, this is the hyperlink inside the li node whose class is active contains singer and song title. What happens if active is removed? Rewrite the code as follows:

import re
result = re.search('<li.*?singer="(.*?)">(.*?)</a>',html,re.S)
if result:
    print(result.group(1),result.group(2))

operation result:

Ren Xianqi Canghai laugh

After removing the active tag, start searching from the beginning of the string. At this time, the qualified node becomes the second li node, and the following ones no longer match, so the running result changes. These two matches use re.S, so that .*? can match newlines, remove re.S below, and rewrite the code as follows:

import re
result = re.search('<li.*?singer="(.*?)">(.*?)</a>',html)
if result:
    print(result.group(1),result.group(2))

operation result:

beyond glory days

After re.S is removed, the matching result becomes the content of the fourth li node, because both the second and third li nodes contain newline characters, but the fourth li node does not contain newline characters, so the match is successful.

3.findall

The above search method, which returns the first string that matches the regular expression. But if you want to get all the strings that match the regular expression, you can use the findall method

html = '''<div id="songs-list">
<h2 class="title">经典老歌</h2>
<p class="introduction">
经典老歌列表
</p>
<ul id="list" class="list-group">
<li data-view="2">一路上有你</li>
<li data-view="7">
<a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
</li>
<li data-view="4" class="active">
<a href="/3.mp3" singer="齐秦">往事随风</a>
</li>
<li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
<li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
<li data-view="5">
<a href="/6.mp3" singer="邓丽君">但愿人长久</a>
</li>
</ul>
</div>'''
results = re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>',html,re.S)
print(type(results),results)
for result in results:
    print(result)  #输出位元组
    print(result[0],result[1],result[2])  #输出字符串

operation result:

<class 'list'> [('/2.mp3', 'Ren Xianqi', 'A Laugh from the Sea'), ('/3.mp3', 'Qi Qin', 'The Past Follows the Wind'), ('/4 .mp3', 'beyond', 'Glorious Years'), ('/5.mp3', 'Kelly Chen', 'Notepad'), ('/6.mp3', 'Teresa Teng', 'I hope people last long' )] 
('/2.mp3', 'Ren Xianqi', 'A Laugh in the Sea') 
/2.mp3 Ren Xianqi in a Laugh in the Sea 
('/3.mp3', 'Qi Qin', 'The Past Follows the Wind') 
/3. mp3 Qi and Qin's past with the wind 
('/4.mp3', 'beyond', 'Glory Years') 
/4.mp3 beyond Glory Years 
('/5.mp3', 'Kelly Chen', 'Notepad') 
/5. mp3 Kelly Chen's Notepad 
('/6.mp3', 'Teresa Teng', 'I wish you a long time') 
/6.mp3 Teresa Teng I wish you a long time

It can be seen from the results that the result returned by replacing the search method with the findall method is a list type, and each group of content needs to be obtained in turn through traversal. In general, if you only want to get the first matched string, you can use the search method, if you need to extract multiple contents, use the findall method

4.sub

In addition to using regular expressions to extract information, you can also use it to modify text information. For example, if you want to remove your numbers from a string of text, it would be too cumbersome to use the repalce method. At this time, you can use the sub method to achieve it. The first parameter is passed in to the regular expression to match the information that needs to be modified, and the second parameter is passed in to be replaced. Content, the third parameter is the original string. Enter the example below:

import re
content = "121fef342dkjdsd87f7hiud3334"
content = re.sub('\d+','',content)
print(content)

operation result:

fefdkjdsdfhiud

5.compile

compile can compile regular strings into regular expression objects for reuse in subsequent matches. Examples are as follows:

import re
content1 = '2022/7/23晴'
content2 = '2022/7/24小雨'
content3 = '2022/7/25雷阵雨'
pattern = re.compile('\d{4}/\d{1}/\d{2}')
result1 = re.sub(pattern,'',content1)
result2 = re.sub(pattern,'',content2)
result3 = re.sub(pattern,'',content3)
print(result1,result2,result3)

3. Actual combat of basic crawler cases

The crawled URL is: Douban Reading , and the crawled content is: hyperlink, book title, author, publication date

import re
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}
content = requests.get('https://book.douban.com/',headers = headers).text  
pattern = re.compile('<li.*?cover">.*?href="(.*?)".*?title="(.*?)".*?more-meta">.*?author">(.*?)</span>.*?year">(.*?)</span>.*?</li>',re.S)
results = re.findall(pattern,content)
for result in results:
    url,name,author,date=result
    author = re.sub('\s','',author)
    year = re.sub('\s','',year)
    print(url,name,author,date)

The efficiency of crawling webpage content with regular expressions will be relatively low, and it will be more convenient to continue to learn libraries such as pyquery and beautifulsoup in the future.

Guess you like

Origin blog.csdn.net/weixin_52024937/article/details/125960094