python crawler learning 26

python crawler learning 26

Originally, I wanted to give you the whole actual combat. As a result, I found that many websites had been reverse-crawled last night, but they didn't work. So let's continue to study in depth, and let's not fix those false things first. . .

Without further ado, let's get to the point.

Before, we learned the most basic crawlers together, and also learned regular expressions, but do you have a feeling that matching the regular expressions written in the html text of web pages is still more troublesome and inconvenient?

Alas~ At this time, this feeling will prompt us to further improve our technology and be lazy:

I don’t know if you have carefully observed the format of HTML text when doing regular matching. In fact, it is not difficult to find that there is a hierarchical relationship between the nodes of HTML text, so through this hierarchical relationship, it is not difficult to achieve our purpose.

Five, the use of Xpath

Xpath, the full name of XML path Language (XML path language), is a language used to find information in XML text. Since we have introduced him, naturally, he can also apply to html documents.

5.1 Installation of Xpath tools

Here we use the lxml library: first equip the weapon

pip3 install lmxl

5.2 Common rules of Xpath

insert image description here

E.g:

//title[@lang='eng']

It means to select all nodes whose name is title and whose attribute value is eng

5.3 Instance introduction

Or a piece of html used in the previous regex:

html = """
<div class="nav">
			<ul>
				<li><a href="https://www.qbiqu.com/">首页</a></li>
                <li><a href="/modules/article/bookcase.php">我的书架</a></li>
				<li><a href="/xuanhuanxiaoshuo/">玄幻小说</a></li>
				<li><a href="/xiuzhenxiaoshuo/">修真小说</a></li>
				<li><a href="/dushixiaoshuo/">都市小说</a></li>
				<li><a href="/chuanyuexiaoshuo/">穿越小说</a></li>
				<li><a href="/wangyouxiaoshuo/">网游小说</a></li>
				<li><a href="/kehuanxiaoshuo/">科幻小说</a></li>
				<li><a href="/paihangbang/">排行榜单</a></li>
				<li><a href="/wanben/1_1">完本小说</a></li>
				<li><a href="/xiaoshuodaquan/">全部小说</a></li>
                <li><script type="text/javascript">yuedu();</script></li>
			</ul>
		</div>
        <div id="banner" style="display:none"></div>
		<div class="dahengfu"><script type="text/javascript">list1();</script></div>
"""
from lxml import etree
text = """
<div class="nav">
			<ul>
				<li><a href="https://www.qbiqu.com/">首页</a></li>
                <li><a href="/modules/article/bookcase.php">我的书架</a></li>
				<li><a href="/xuanhuanxiaoshuo/">玄幻小说</a></li>
				<li><a href="/xiuzhenxiaoshuo/">修真小说</a></li>
				<li><a href="/dushixiaoshuo/">都市小说</a></li>
				<li><a href="/chuanyuexiaoshuo/">穿越小说</a></li>
				<li><a href="/wangyouxiaoshuo/">网游小说</a></li>
				<li><a href="/kehuanxiaoshuo/">科幻小说</a></li>
				<li><a href="/paihangbang/">排行榜单</a></li>
				<li><a href="/wanben/1_1">完本小说</a></li>
				<li><a href="/xiaoshuodaquan/">全部小说</a></li>
                <li><script type="text/javascript">yuedu();</script></li>
			</ul>
		</div>
        <div id="banner" style="display:none"></div>
		<div class="dahengfu"><script type="text/javascript">list1();</script></div>
"""
# 实际上我们截取的只是一部分html文本,若想要被程序识别,就需要对html文本进行修正
# 初始化
html = etree.HTML(text)
# 使用tostring()方法修正html文本,返回bytes类型的格式化html
result = etree.tostring(html)
# 使用decode()方法将bytes格式转为str格式
print(result.decode('utf-8'))

operation result:

insert image description here

As you can see, the output html text has been completed with body and html nodes

If the html is stored in a file, it can also be read directly:

insert image description here

The text file stores our previous html text

from lxml import etree

html = etree.parse('./text.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

Running result: I don't know why, he became so strange? Why is it different from what it says in the book?

insert image description here

However, all in all, it can achieve this kind of operation, but compared with the previous parsing directly in the program, it will have one more DOCTYPE declaration, but it has no effect on the result.

5.4 All nodes

We will use an Xpath rule starting with // to select all nodes that meet the requirements.

from lxml import etree

html = etree.parse('./text.html', etree.HTMLParser())
result = html.xpath('//*')
print(result)

operation result:

insert image description here

Here we use * to match all nodes, and finally we will get all nodes in this html text. And print the output as a list.

Match the specified node:

# 匹配指定节点
from lxml import etree

html = etree.parse('./text.html', etree.HTMLParser())
# 这里我们指定匹配 a节点
result = html.xpath('//a')
print(result)

operation result:

insert image description here
You can see that all a nodes are matched

Ends today, continues tomorrow

Guess you like

Origin blog.csdn.net/szshiquan/article/details/123978598