This article is the study notes linked below: Getting started with python3 web crawler in one hour
. The original notes were taken under winows, and this article was all operations under ubuntu.
The general idea of the crawler is actually two points:
- Get HTML information of a webpage
- Parse HTML information and extract what we really need
I. Introduction
2 Introduction to web crawler
1. Review elements
chrome:F12
2. Simple example
The web crawler obtains the HTML information of the web page according to the provided URL information. Use request and urllib.request
in Python\3 to obtain the specific information of the web page.
- The urllib library is built into Python, no additional installation is required
- request is a third-party library, the address of the request library needs to be installed additionally
(1) ubuntu installation request:
sudo apt-get install python-requests
(2) Simple example
/*
构造一个请求,支撑以下各方法的基础方法
*/
requests.request()
/*获取HTML网页的主要方法,对应HTTP的GET*/
requests.get()
/*获取HTML网页头信息的方法,对应于HTTP的HEAD*/
requests.head()
/*向HTML网页提交POST请求的方法,对应于HTTP的POST*/
requests.post()
/*向HTML网页提交PUT请求的方法,对应于HTTP的PUT*/
requests.put()
/*向HTML提交局部修改请求,对应于HTTP的PATCH*/
requests.patch()
/*向HTML页面提交删除请求,对应于HTTP的DELETE*/
requests.delete()
The use tutorial of the requests library The
meaning of get request, as the name suggests, is to obtain data information from the server. Here is an example:
#-*- coding:UTF-8 -*-
2 import requests
3 if __name__ == '__main__':
4 target = 'http://gitbook.cn/'
5 req = requests.get(url=target) //req中保存了我们获取到信息
6 print(req.text)
The following is the HTML information captured after executing the above program:
Reptile combat
1. Novel download
(1) Actual background
Target website: http://www.biqukan.com/
This is a novel website. The goal this time is to crawl and save a novel called "Eternal Mind".
(2) Small test knife
Crawling the content of the first chapter of "One Thought Eternal"
Just modify the code written earlier and run it, as follows:
# -*- coding:UTF-8 -*-
import requests
if __name__ == '__main__':
target = 'http://www.biqukan.com/1_1094/5403177.html'
req = requests.get(url=target)
print(req.text)
Running the code, you will find that what you get is a bunch of novel content with various HTML tags. The next goal is to extract the content of the novel and filter out these useless HTML tags.
(3)Beautiful Soup
There are many ways to extract what we really need, such as regular expressions, Xpath, Beautiful Soup, etc. Beautiful Soup is used here.
Beautiful Soup is a third-party library, here is the installation method of the Chinese learning document
beautiful soup 4:
sudo apt-get install python-bs4
The method to check whether the beautiful soup is successful:
from bs4 import BeautifulSoup
As you can see, the div\ tag stores the body content of the novel, so the goal now is to extract the content in the div.
Here, the div sets two attributes class and id . The id is the unique identifier of the div, and the class specifies the element One or more class names of .
The code to extract the body of the novel is as follows:
# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
if __name__ == '__main__':
target = 'http://www.biqukan.com/1_1094/5403177.html'
req = requests.get(url=target)
html = req.text
bf = BeautifulSoup(html,'lxml')
##使用find_all方法,获取html信息中所有class属性为showtxt的div标签
##find_all的第一个参数是获取的标签名,第二个参数class_是标签属性
##class在Python中是关键字,所以用class_标识class属性,,避免冲突
texts = bf.find_all('div',class_ = 'showtxt')
##decoude()是为了将texts转变成中文,如果不用这个方法,输出的内容就是一堆编码
print(texts[0].decode())
As can be seen from the picture, there are some other HTML tags in the content at this time, such as <br>
The next step is to remove these unnecessary characters, and also delete some unnecessary spaces. The code is as follows:
1 # -*- coding:utf-8 -*-
2 import requests
3 from bs4 import BeautifulSoup
4
5 if __name__ == '__main__':
6 target = 'http://www.biqukan.com/1_1094/5403177.html'
7 req = requests.get(url=target)
8 html = req.text
9 bf = BeautifulSoup(html,'lxml')
10 ##使用find_all方法,获取html信息中所有class属性为showtxt的div标签
11 ##find_all的第一个参数是获取的标签名,第二个参数class_是标签属性
12 ##class在Python中是关键字,所以用class_标识class属性,,避免冲突
13 texts = bf.find_all('div',class_ = 'showtxt')
14 ##decoude()是为了将texts转变成中文,如果不用这个方法,输出的内容就是一堆编码
15 print(texts[0].text.replace('\xa0'*8,'\n\n'))
After running the code, the grabbing effect is as follows:
use " " to represent spaces in HTML (remember to add a ;). The meaning of the last line of the above code is to
remove the 8 space symbols in the text and replace them with carriage return.
So far, we have been able to grab the content of one chapter of the novel and display it in segments. The next goal is to download the entire novel.
Through the inspection element, we can see that all the chapter titles of the target novel are All exist under the <div class="listmain"> tag.
The specific chapters also exist in the <dd><a></a></dd> tags in the <div> subtag. In html, the tag <a ></a> is used to store hyperlinks, and the link address exists in the attribute href.
Next, first grab the directory list of the novel, the code is as follows:
1 # -*- coding:utf-8 -*-
2 import requests
3 from bs4 import BeautifulSoup
4
5 if __name__ == '__main__':
6 target = 'http://www.biqukan.com/1_1094/'
7 req = requests.get(url=target)
8 html = req.text
9 div_bf = BeautifulSoup(html)
10 div = div_bf.find_all('div',class_="listmain")
11 print(div[0])
The crawling results are as follows:
Next, match each <a></a> tag captured, and extract the chapter name and chapter article. For example, take the first chapter, the content of the tag is as follows:
<a href="/1_1094/5403177.html">第一章 他叫白小纯</a>
For the matching result a returned by BeautifulSoup, use the a.get("href") method to get the attribute value of href, and use a.string to get the chapter name. The code is as follows:
1 -*- coding:utf-8 -*-
2 import requests
3 from bs4 import BeautifulSoup
4
5 if __name__ == '__main__':
6 server = 'http://www.biqukan.com'
7 target = 'http://www.biqukan.com/1_1094/'
8 req = requests.get(url=target)
9 html = req.text
10 div_bf = BeautifulSoup(html)
11 div = div_bf.find_all('div',class_="listmain")
12 a_bf = BeautifulSoup(str(div[0]))
13 a=a_bf.find_all('a')
14 for each in a:
15 print(each.string,server+each.get('href'))
Code execution result:
Now the chapter name and chapter link of each chapter are available. The next step is to integrate the code and write the obtained content into a text file for storage. The code is as follows:
#-*-coding:utf-8-*-
2 from bs4 import BeautifulSoup
3 import requests,sys
4
5 class downloader(object):
6 def __init__(self):
7 self.server = 'http://www.biqukan.com/'
8 self.target = 'http://www.biqukan.com/1_1094/'
9 self.names = [] #存放章节名
10 self.urls = [] #存放章节链接
11 self.nums = 0 #章节数
12
13 #获取下载地址
14 def get_download_url(self):
15 req = requests.get(url = self.target)
16 html = req.text
17 div_bf = BeautifulSoup(html)
18 div = div_bf.find_all('div',class_='listmain')
19 a_bf = BeautifulSoup(str(div[0]))
20 a = a_bf.find_all('a')
21 self.nums = len(a[15:])
22 for each in a[15:]:
23 self.names.append(each.string)
24 self.urls.append(self.server+each.get('href'))
25
26 #获取章节内容
27 def get_contents(self,target):
28 req = requests.get(url =target)
29 html = req.text
30 bf = BeautifulSoup(html)
31 texts = bf.find_all('div',class_='showtxt')
32 texts = texts[0].text.replace('\xa0'*8,'\n\n')
33 return texts
34
35 #将抓取的文章内容写入文件
36 def writer(self,name,path,text):
37 write_flag = True
38 with open(path,'a',encoding='utf-8') as f:
39 f.write(name+'\n')
40 f.writelines(text)
41 f.write('\n\n')
42
43 #主函数
44 if __name__ == "__main__":
45 dl = downloader()
46 dl.get_download_url()
47 print('<一年永恒>开始下载:')
48 for i in range(dl.nums):
49 dl.writer(dl.names[i],'一念永恒.txt',dl.get_contents(dl.urls[i]))
50 sys.stdout.write(" 已下载:%.3f%%"% float(i/dl.nums)+'\r')
51 sys.stdout.flush()
52 print('<一念永恒>下载完成')