python crawler learning 23

python crawler learning 23

3. Regular expressions

4. findall method

Before, we learned the respective functions and shortcomings of the search and match methods. In yesterday's study, we knew that after using the search method, we can effectively avoid the problem that the match method must match the string from the beginning, but the search itself also has great defects, namely Only the first matching result will be found.

At this time, if we want to match all possible results, we need to use findall (literally, we can intuitively know what this method does)

Take yesterday's html text and still match all the novel categories in it

<div class="nav">
			<ul>
				<li><a href="https://www.qbiqu.com/">首页</a></li>
                <li><a href="/modules/article/bookcase.php">我的书架</a></li>
				<li><a href="/xuanhuanxiaoshuo/">玄幻小说</a></li>
				<li><a href="/xiuzhenxiaoshuo/">修真小说</a></li>
				<li><a href="/dushixiaoshuo/">都市小说</a></li>
				<li><a href="/chuanyuexiaoshuo/">穿越小说</a></li>
				<li><a href="/wangyouxiaoshuo/">网游小说</a></li>
				<li><a href="/kehuanxiaoshuo/">科幻小说</a></li>
				<li><a href="/paihangbang/">排行榜单</a></li>
				<li><a href="/wanben/1_1">完本小说</a></li>
				<li><a href="/xiaoshuodaquan/">全部小说</a></li>
                <li><script type="text/javascript">yuedu();</script></li>
			</ul>
		</div>
        <div id="banner" style="display:none"></div>
		<div class="dahengfu"><script type="text/javascript">list1();</script></div>

Apply the findall method:

# fidall 方法
import re

html = """
<div class="nav">
			<ul>
				<li><a href="https://www.qbiqu.com/">首页</a></li>
                <li><a href="/modules/article/bookcase.php">我的书架</a></li>
				<li><a href="/xuanhuanxiaoshuo/">玄幻小说</a></li>
				<li><a href="/xiuzhenxiaoshuo/">修真小说</a></li>
				<li><a href="/dushixiaoshuo/">都市小说</a></li>
				<li><a href="/chuanyuexiaoshuo/">穿越小说</a></li>
				<li><a href="/wangyouxiaoshuo/">网游小说</a></li>
				<li><a href="/kehuanxiaoshuo/">科幻小说</a></li>
				<li><a href="/paihangbang/">排行榜单</a></li>
				<li><a href="/wanben/1_1">完本小说</a></li>
				<li><a href="/xiaoshuodaquan/">全部小说</a></li>
                <li><script type="text/javascript">yuedu();</script></li>
			</ul>
		</div>
        <div id="banner" style="display:none"></div>
		<div class="dahengfu"><script type="text/javascript">list1();</script></div>
"""

result = re.findall('href.*?xiaoshuo/">(.*?)<', html, re.S)
print(type(result))	# 由结果可知,findall方法最终会返回一个列表
i = 0
for item in result:
    print(i, ':', item)
    i += 1

operation result:

insert image description here

If the regular expression matches multiple targets:

# fidall 方法
import re

html = """
<div class="nav">
			<ul>
				<li><a href="https://www.qbiqu.com/">首页</a></li>
                <li><a href="/modules/article/bookcase.php">我的书架</a></li>
				<li><a href="/xuanhuanxiaoshuo/">玄幻小说</a></li>
				<li><a href="/xiuzhenxiaoshuo/">修真小说</a></li>
				<li><a href="/dushixiaoshuo/">都市小说</a></li>
				<li><a href="/chuanyuexiaoshuo/">穿越小说</a></li>
				<li><a href="/wangyouxiaoshuo/">网小说</a></li>
				<li><a href="/kehuanxiaoshuo/">科幻小说</a></li>
				<li><a href="/paihangbang/">排行榜单</a></li>
				<li><a href="/wanben/1_1">完本小说</a></li>
				<li><a href="/xiaoshuodaquan/">全部小说</a></li>
                <li><script type="text/javascript">yuedu();</script></li>
			</ul>
		</div>
        <div id="banner" style="display:none"></div>
		<div class="dahengfu"><script type="text/javascript">list1();</script></div>
"""

result = re.findall('href.*?/(.*?)/">(.*?)<', html, re.S)
print(type(result))
i = 0
for item in result:
    print(i, ':', item)
    i += 1

Running results: each matching result will form a tuple and then form a list with other results

insert image description here

5. sub

In the basic query operation, if we want to modify the string, we need to use the sub method

import re

content = 'as5af6fa5fa6fsfa'
# 假如我们想要把其中的数字都去掉,在replace的方法外我们还可以通过sub方法实现
result = re.sub('\d+', '', content)
print(result)

operation result:

insert image description here

Going back to the previous html text, if we want to get all the Chinese in it, it may be very troublesome to write a regular (of course it is very simple here), at this time we can use sub and findall together:

import re

html = """
<div class="nav">
			<ul>
				<li><a href="https://www.qbiqu.com/">首页</a></li>
                <li><a href="/modules/article/bookcase.php">我的书架</a></li>
				<li><a href="/xuanhuanxiaoshuo/">玄幻小说</a></li>
				<li><a href="/xiuzhenxiaoshuo/">修真小说</a></li>
				<li><a href="/dushixiaoshuo/">都市小说</a></li>
				<li><a href="/chuanyuexiaoshuo/">穿越小说</a></li>
				<li><a href="/wangyouxiaoshuo/">网游小说</a></li>
				<li><a href="/kehuanxiaoshuo/">科幻小说</a></li>
				<li><a href="/paihangbang/">排行榜单</a></li>
				<li><a href="/wanben/1_1">完本小说</a></li>
				<li><a href="/xiaoshuodaquan/">全部小说</a></li>
                <li><script type="text/javascript">yuedu();</script></li>
			</ul>
		</div>
        <div id="banner" style="display:none"></div>
		<div class="dahengfu"><script type="text/javascript">list1();</script></div>
"""
result = re.sub('<a.*?>', '', html, re.S)
print(result)
result1 = re.findall('<li>(.*?)</a>', result, re.S)
for item in result1:
    print(item)

Result after using sub:

insert image description here

Based on this, the result of our regex match:

insert image description here

6. compile

The compile method can compile regular strings into regular expression objects:

import re

text1 = '2020-03-28 20:00'
# 假设现在有多段不同的html文本,我们想要匹配其中的某些相同的元素如中文歌名等,这时对于每段html我们都要传入相同的表达式(假设表达式相同),与re.S等修饰器内容
# 这样就不如先把它们集成成一个对象,方便以后代码的复用
pattern = re.compile('\d{2}:\d{2}', re.S)
# 把它编译成一个对象以后就可以实现代码的复用,就不用每次遇到都要重新输入'\d{2}:\d{2}'
result = re.sub(pattern, '', text1)
print(result)

operation result:

insert image description here

So far, we have finished learning the relevant content of regular expressions. Do you feel that our knowledge system has been enriched for a large part?

Today ends, to be continued...

Guess you like

Origin blog.csdn.net/szshiquan/article/details/123804997