python crawler learning 22

python crawler learning 22

3. Regular expressions

3. search method

Before we learned the match method in regular expressions, please recall the premise of the match method. Yes, the match method starts matching from the beginning of the string. Once the beginning does not match, it means that the entire match is invalid:

# match 的局限性
import re

content = 'Extra strings Hello 1234567 World_this is a Regex Demo Extra    stings'
result = re.match('Hello.*?(\d+).*?Demo', content)
print(result)

Running result: The expression is fine, but nothing is matched

insert image description here

It can be concluded that when match is used, we must know the beginning of the string to be matched before this method can be used. So in practical applications, it is more suitable for detecting whether a certain string conforms to a certain regular expression rule.

At this time, if you want to solve the problem, you have to rely on the search method. When matching, it will scan the entire string first and then match the first successful result in the string. If there is no scan that matches the first regular expression The result will return None. At this point, the regular expression we write can be a part of the string:

# search 简单应用
import re

content = 'Extra strings Hello 1234567 World_this is a Regex Demo Extra    stings'
result = re.search('Hello.*?(\d+).*?Demo', content)
print(result.group(1))

operation result:

insert image description here

So let's try it with the search method: now there is a section of html

		<div class="nav">

			<ul>

				<li><a href="https://www.qbiqu.com/">首页</a></li>

                <li><a href="/modules/article/bookcase.php">我的书架</a></li>

				<li><a href="/xuanhuanxiaoshuo/">玄幻小说</a></li>

				<li><a href="/xiuzhenxiaoshuo/">修真小说</a></li>

				<li><a href="/dushixiaoshuo/">都市小说</a></li>

				<li><a href="/chuanyuexiaoshuo/">穿越小说</a></li>

				<li><a href="/wangyouxiaoshuo/">网游小说</a></li>

				<li><a href="/kehuanxiaoshuo/">科幻小说</a></li>

				<li><a href="/paihangbang/">排行榜单</a></li>

				<li><a href="/wanben/1_1">完本小说</a></li>

				<li><a href="/xiaoshuodaquan/">全部小说</a></li>

                <li><script type="text/javascript">yuedu();</script></li>

			</ul>

		</div>

        <div id="banner" style="display:none"></div>

		<div class="dahengfu"><script type="text/javascript">list1();</script></div>

It can be observed that there are many li nodes in the ul node, some of these li nodes contain a node, and some do not contain a node. a node has its corresponding attributes. Now we want to use the search method to successfully match the category of novels into it:

After careful observation, we found that the same string 'xiaoshuo/"' appears in front of the category of novels, so when we write regular expressions, we must remember to write them in when matching:

import re

html = """
<div class="nav">
			<ul>
				<li><a href="https://www.qbiqu.com/">首页</a></li>
                <li><a href="/modules/article/bookcase.php">我的书架</a></li>
				<li><a href="/xuanhuanxiaoshuo/">玄幻小说</a></li>
				<li><a href="/xiuzhenxiaoshuo/">修真小说</a></li>
				<li><a href="/dushixiaoshuo/">都市小说</a></li>
				<li><a href="/chuanyuexiaoshuo/">穿越小说</a></li>
				<li><a href="/wangyouxiaoshuo/">网游小说</a></li>
				<li><a href="/kehuanxiaoshuo/">科幻小说</a></li>
				<li><a href="/paihangbang/">排行榜单</a></li>
				<li><a href="/wanben/1_1">完本小说</a></li>
				<li><a href="/xiaoshuodaquan/">全部小说</a></li>
                <li><script type="text/javascript">yuedu();</script></li>
			</ul>
		</div>
        <div id="banner" style="display:none"></div>
		<div class="dahengfu"><script type="text/javascript">list1();</script></div>
		"""
# 由于html中有很多换行所以别忘了使用修饰符
result = re.search('href.*?xiaoshuo/">(.*?)<', html, re.S)
print(result.group(1))

operation result:

insert image description here

This is successful

If instead of writing the previous string instead

'herf.*?>(.*?)<'

What will be the result? Don't look down, think about it carefully

。。。

。。。

import re

html = """
<div class="nav">
			<ul>
				<li><a href="https://www.qbiqu.com/">首页</a></li>
                <li><a href="/modules/article/bookcase.php">我的书架</a></li>
				<li><a href="/xuanhuanxiaoshuo/">玄幻小说</a></li>
				<li><a href="/xiuzhenxiaoshuo/">修真小说</a></li>
				<li><a href="/dushixiaoshuo/">都市小说</a></li>
				<li><a href="/chuanyuexiaoshuo/">穿越小说</a></li>
				<li><a href="/wangyouxiaoshuo/">网游小说</a></li>
				<li><a href="/kehuanxiaoshuo/">科幻小说</a></li>
				<li><a href="/paihangbang/">排行榜单</a></li>
				<li><a href="/wanben/1_1">完本小说</a></li>
				<li><a href="/xiaoshuodaquan/">全部小说</a></li>
                <li><script type="text/javascript">yuedu();</script></li>
			</ul>
		</div>
        <div id="banner" style="display:none"></div>
		<div class="dahengfu"><script type="text/javascript">list1();</script></div>
		"""

result = re.search('href.*?>(.*?)<', html, re.S)
print(result.group(1))

operation result:

When 'xiaoshuo/"' is not written, search will look for the first character that meets the conditions, and it will find it after searching >Home<here

insert image description here

The answer is easy to come by using what we've learned.

Today we learned the search method in regular expressions, but this method can only match the first result that matches the situation at a time. If we want to get all the results, I can do it, so how to solve it?

Today ends, the next chapter continues

Guess you like

Origin blog.csdn.net/szshiquan/article/details/123781543