Prerequisite : After learning a certain python foundation, you can continue to learn the content of web crawler. If you do not have basic python grammar learning, you can browse the python basic grammar notes summary .
content
1. Rules of web crawler
1. Getting started with the Requests library
method | illustrate |
---|---|
requests.request() | Construct a request that supports the underlying methods of the following methods |
requests.get() | The main method for obtaining HTML pages, corresponding to HTTP get |
requests.head() | The method to get the header information of HTML page, corresponding to the HTTP header |
requests.post() | The method of submitting a POST request to an HTML page, corresponding to HTTP post |
requests.put() | Submit a PUT request method to an HTML page, corresponding to HTTP put |
requests.patch() | Submit a partial modification request to an HTML page, corresponding to an HTTP patch |
requests.delete() | Submit a delete request to an HTML page, corresponding to HTTP delete |
1.1, get() method
r = requests.get(url) #构造一个向服务器请求资源的Request对象,返回一个Resopnse对象r
requests.get(url,params=None,**kwargs)
url : The url link of the page to be obtained
params : extra parameters in url, dictionary or byte stream format, optional
* * kwargs : 12 parameters that control access
Attributes | illustrate |
---|---|
r.status_code | The return status of the HTTP request, 200 means the connection is successful, 404 means the failure |
r.text | The string form of the HTTP response content, that is, the page content corresponding to the url |
r.encoding | Guessing response content encoding from HTTP headers |
r.apparent_encoding | Response content encoding to be parsed from content (alternative encoding) |
r.content | The binary form of the HTTP response content |
Note : r.encoding: If there is no charset in the header, the encoding is considered to be ISO-8859-1, which cannot parse Chinese. When r.encoding cannot be decoded correctly, you need to use r.apparent_encoding, assign the obtained encoding to r.encoding, and it can be parsed correctly
#爬取案例
import requests
r = requests.get("http://www.baidu.com")
print(r.status_code) #状态码200
print(r.encoding) #ISO-8859-1
# print(r.text) 不能成功解码
print(r.apparent_encoding)#utf-8
r.encoding = 'utf-8' #修改编码方式
print(r.text) #成功获取
1.2, Requests library exception
abnormal | illustrate |
---|---|
requests.ConnectionError | Abnormal network connection errors, such as DNS query failure, connection refused, etc. |
requests.HTTPError | HTTP error exception |
requests.URLRequired | URL missing exception |
requests.TooManyRedirects | If the maximum number of redirects is exceeded, a redirection exception is generated |
requests.ConnectTimeout | Connection to remote server timed out exception |
requests.Timeout | The request URL timed out, resulting in a timeout exception |
1.3. General code framework for crawling web pages
import requests
def getHTMLText(url):
try:
r = requests.get(url,timeout=30)
r.raise_for_status() #如果状态不是200,引发HTTPError异常
r.encoding = r.apparent_encoding
return r.text
except:
return "产生异常"
if __name__ == "__main__":
url = "http://www.baidu.com" #选择爬取的网页
print(getHTMLText(url))
1.4, HTTP protocol
HTTP protocol is a hypertext transfer protocol.
HTTP is a stateless application layer protocol based on the "request and response" model.
HTTP protocol operations on resources
method | illustrate |
---|---|
GET | Request to get the resource at the URL location |
HEAD | Request to get the response message report of the URL location resource, that is, get the header information of the resource |
POST | Append new data after request to the resource at the URL location |
PUT | Request to store a resource at the URL location, overwriting the resource at the original URL location |
PATCH | Request a partial update of the resource at the URL location, that is, change part of the content of the redirected resource |
DELETE | Request to delete the resource stored at the URL location |
1.5, the main analysis of the Requests library
requests.request(method,url,**kwargs)
method : request method, refer to the six types of 1.4
* * kwargs : parameters that control access, all optional
Parameter Type | describe |
---|---|
params | Dictionary or byte sequence, add url as a parameter |
data | A dictionary, byte sequence, or file object, as the content of the Request |
json | Data in JSON format as the content of Request |
headers | Dictionary, HTTP custom headers |
cookies | Dictionary or CookieJar, cookie in Request |
auth | Tuple, support HTTP authentication function |
files | Dictionary type, transfer file |
timeout | Set the timeout time in seconds |
proxies | Dictionary type, set access proxy server, can add login authentication |
2. Robots protocol
Function: The website tells the crawler which ones can be crawled and which ones cannot
Format: robots.txt file in the root directory of the website
Example: http://www.baidu.com/robots.txt
Use of Robots Protocol :
Web crawler: Automatically or manually identify robots.txt and crawl the content.
Binding: The Robots agreement is recommended but not binding. Web crawlers may not abide by it, but there are legal risks
3, Requests library web crawler combat
Simulate a browser to visit a website:
kv = {
"user-agent":"Mozilla/5.0"}
url = ""
r = requests.get(url,headers = kv)
3.1, Baidu search keywords
import requests
# 百度搜索关键字提交:
# 百度搜索关键字格式:http://www.baidu.com/s?wd=关键字
kv = {
"wd":"Python"} #输入需要查询的关键字
r = requests.get("http://www.baidu.com/s",params=kv)
print(r.status_code)
print(r.request.url) #查询提交的url是否为想提交的
print(len(r.text)) #查询有多少长度
3.2. Crawling and storage of network images
import requests
import os
url = "http://img0.dili360.com/pic/2022/03/28/624109135e19b9603398103.jpg"
root = "D://pics//"
path = root + url.split('/')[-1]
try:
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r = requests.get(url)
with open(path,"wb") as f:
f.write(r.content)
f.close()
print("文件保存成功")
else:
print("文件已存在")
except:
print("爬取失败")
2. Extraction of web crawler
1. Use of Beautiful Soup library
from bs4 import BeautifulSoup
soup = BeautifulSoup(mk,"html.parser")
1.1. Basic elements
Beautiful Soup library parser
parser | Instructions | condition |
---|---|---|
HTML parser for bs4 | BeautifulSoup(mk,“html.parser”) | Install bs4 library |
lxml的HTML解析器 | BeautifulSoup(mk,“lxml”) | pip install lxml |
lxml的XML解析器 | BeautifulSoup(mk,“xml”) | pip install lxml |
html5lib的解析器 | BeautifulSoup(mk,“html5lib”) | pip install html5lib |
Beautiful Soup类的基本元素
基本元素 | 说明 |
---|---|
Tag | 标签,最基本的信息组织单元,分别使用<>和</>标明开头和结尾 |
Name | 标签的名字, … 的名字是’p’,格式:.name |
Attributes | 标签的属性,字典形式组织,格式:.attrs |
NavigableString | 标签内非属性字符串,<>…</>中字符串,格式:.string |
Comment | 标签内字符串的注释部分,一种特殊的Comment类型 |
1.2、基于bs4库的HTML内容遍历方法
标签树的下行遍历
属性 | 说明 |
---|---|
.contents | 子节点的列表,将所有儿子节点存入列表 |
.children | 子节点的迭代类型,与.contents类似,用于循环遍历儿子节点 |
.descendants | 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历 |
标签树的下行遍历
属性 | 说明 |
---|---|
.parent | 节点的父亲标签 |
.parents | 节点先辈标签的迭代类型,用于循环遍历先辈节点 |
标签树的平行遍历
属性 | 说明 |
---|---|
.next_sibling | 返回按照HTML文本顺序的下一个平行节点标签 |
.previous_sibling | 返回按照HTML文本顺序的上一个平行节点标签 |
.next_siblings | xu迭代类型,返回按照HTML文本顺序的后续所有平行节点标签 |
.previous_siblings | 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签 |
1.3、基于bs4库的HTML格式化和编码
prettify():在html后自动增加换行,更好的观察标签树
默认编码格式:utf-8
2、信息组织与提取方法
2.1、信息标记
信息的标记
标记后的信息可形成信息组织结构,增加了信息维度
标记后的信息可用于通信、存储或展示
标记结构与信息一样具有重要价值
标记后的信息更利于程序的理解和应用
XML:最早的通用信息标记语言,可扩展性好,但繁琐
JSON:信息有类型,适合程序处理(js),较XML简洁
YAML:信息无类型,文本信息比例最高,可读性好
2.2、信息提取的一般方法
方法一:完整解析信息的标记形式,再提取关键信息
需要标记解析器,例如:bs4库的标签树遍历
优点:信息解析准确
缺点:提取过程繁琐,速度慢
方法二:无视标记形式,直接搜索关键信息
对信息的文本查找函数即可。
优点:提取过程简洁,速度较快
缺点:提取结果准确性与信息内容相关
融合方法:结合形式解析与搜索方法,提取关键信息。
需要标记解析器及文本查找函数
实例:提取HTML中所有的URL链接
思路:搜索到所有的标签,解析标签格式,提取href后的链接内容。
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
for link in soup.find_all('a'):
print(link.get('href'))
2.3、基于bs4库的HTML内容查找方法
<>.find_all(name, atrs, recursive, string, **kwargs)
返回一个列表类型,存储查找的结果
name:对标签名称的索引字符串。
attrs:对标签属性值的检索字符串,可标记属性检索
recursive:是否对子孙全部检索,默认True
string:<>…</>中字符串区域的检索字符串
(…) 等价于 .find_all(…)
soup(…) 等价于 soup.find_all(…)
3、实例:大学排名信息
import bs4
import requests
from bs4 import BeautifulSoup
def getHTMLText(url):
'''获取网页的文本'''
try:
r = requests.get(url, timeout=30)
r.raise_for_status() # 如果状态不是200,引发HTTPError异常
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def fillUnivList(ulist,html):
'''解析需要的文本,放入到列表中'''
soup = BeautifulSoup(html,"html.parser")
for tr in soup.find('tbody').children:
if isinstance(tr,bs4.element.Tag):
tds = tr('td')
ulist.append([tds[0].string.strip('\n '),tds[1].a.string,tds[4].string.strip('\n ')])
def printUnivLIist(ulist,num):
'''按格式化输出'''
tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
print(tplt.format("排名","学校名称","总分",chr(12288)))
for i in range(num):
u = ulist[i]
print(tplt.format(u[0],u[1],u[2],chr(12288)))
def main():
uinfo = []
url = "https://www.shanghairanking.cn/rankings/bcur/2022"
html = getHTMLText(url)
fillUnivList(uinfo,html)
printUnivLIist(uinfo,30)
main()