Python reptile from entry to advanced road (XI)

Previous article, we introduce a bit Xpath module, then we use Xpath module crawling "Encyclopedia of embarrassing stories to tell," the embarrassing stories to tell.

We have previously using the re module crawling once embarrassing hundred, we just need to make some changes in its basis on it, in order to ensure the integrity of the project, we re-do it again.

Web Links We want crawling is  https://www.qiushibaike.com/text/page/1/  .

We get through the Google plug-in Xpath Helper through what we want to analyze content:  // div [@ class = " Content " ] / span [1] 

Then we can () to get what's inside through the text,  // div [@ class = " Content " ] / span [1] / text () 

 1 import urllib.request
 2 from lxml import etree
 3 import ssl
 4 
 5 # 取消代理验证
 6 ssl._create_default_https_context = ssl._create_unverified_context
 7 
 8 url = "https://www.qiushibaike.com/text/page/1/"
 9 # User-Agent头
10 user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'
11 headers = {'User-Agent':} User_agent
 12 is REQ = urllib.request.Request (URL, headers = headers)
 13 is Response = the urllib.request.urlopen (REQ)
 14  # acquired HTML source code of each page of the string 
15 . Response.read HTML = () decode ( ' UTF-. 8 ' )
 16  # parsed html HTML document 
. 17 Selector = etree.HTML (html)
 18 is CONTENT_LIST = selector.xpath ( ' // div [@ class = "Content"] / span [. 1] / text () ' )
 . 19  Print (CONTENT_LIST)

The output is:

Output from the above results it can be seen that we have got the data we want, and is a list type, we do not get Alto operation list and then stored in a local embarrassments can.

1 for item in item_list:
2     item = item.replace("\n", "")
3     self.writePage(item)

上面的代码中 item_list 即为我们上面所获取到的 content_list 列表,在之前通过 re 模块获取数据时通过对列表的内容分析,我们发现有 <span> ,<span class="contentForAll">查看全文,</span>,<br/>,\n 等多余内容,而通过 Xpath 只有 \n 为多余,我们通过 replace 方法将其转为空,剩下的就是我们想要的内容了,接下来就是存储到本地即可了。

上面就可以实现一个获取 糗事百科 的糗事的简单爬虫,但是只能爬取单个页面的内容,通过分析 url 我们发现 https://www.qiushibaike.com/text/page/1/ 中最后的 1 即为页码,我们就可以根据这个页码逐一爬取更多页面的内容,最终的代码如下:

 1 import urllib.request
 2 from lxml import etree
 3 import ssl
 4 
 5 # 取消代理验证
 6 ssl._create_default_https_context = ssl._create_unverified_context
 7 
 8 
 9 class Spider:
10     def __init__(self):
11         # 初始化起始页位置
12         self.page = 1
13         # 爬取开关,如果为True继续爬取
14         self.switch = True
15 
16     def loadPage(self):
17         """
18            作用:打开页面
19         """
20         url = "https://www.qiushibaike.com/text/page/" + str(self.page) + "/"
21         # User-Agent头
22         user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'
23         headers = {'User-Agent': user_agent}
24         req = urllib.request.Request(url, headers=headers)
25         response = urllib.request.urlopen(req)
26         # 获取每页的HTML源码字符串
27         html = response.read().decode('utf-8')
28         # 解析html 为 HTML 文档
29         selector = etree.HTML(html)
30         content_list = selector.xpath('//div[@class="content"]/span[1]/text()')
31         # 调用dealPage() 处理糗事里的杂七杂八
32         self.dealPage(content_list)
33 
34     def dealPage(self, item_list):
35         """
36             @brief 处理得到的糗事列表
37             @param item_list 得到的糗事列表
38             @param page 处理第几页
39         """
40         for item in item_list:
41             item = item.replace("\n", "")
42             self.writePage(item)
43 
44     def writePage(self, text):
45         """
46             @brief 将数据追加写进文件中
47             @param text 文件内容
48         """
49         myFile = open("./qiushi.txt", 'a')  # 追加形式打开文件
50         myFile.write(text + "\n\n")
51         myFile.close()
52 
53     def startWork(self):
54         """
55             控制爬虫运行
56         """
57         # 循环执行,直到 self.switch == False
58         while self.switch:
59             # 用户确定爬取的次数
60             self.loadPage()
61             command = input("如果继续爬取,请按回车(退出输入quit)")
62             if command == "quit":
63                 # 如果停止爬取,则输入 quit
64                 self.switch = False
65             # 每次循环,page页码自增1
66             self.page += 1
67         print("爬取结束!")
68 
69 
70 if __name__ == '__main__':
71     # 定义一个Spider对象
72     qiushiSpider = Spider()
73     qiushiSpider.startWork()

最终会在本地添加一个 qiushi.txt 的文件,结果如下:

 

Guess you like

Origin www.cnblogs.com/weijiutao/p/10880805.html