26.Python web crawler

1. Introduction to web crawlers

A web crawler is a program or script that automatically crawls information from the World Wide Web according to certain rules. Generally, we start from a certain web page of a certain website, read the content of the web page, and at the same time retrieve the useful link addresses contained in the page, and then use these link addresses to find the next web page, and then do the same work, and keep looping until we follow a certain strategy. Until all the web pages on the Internet have been crawled.

Classification of web crawlers

There are roughly four types of web crawlers: through web crawlers, focused web crawlers, incremental web crawlers, and deep web crawlers.

General web crawler: The target data crawled is huge and the scope is very wide, and the performance requirements are very high. It is mainly used in large search engines and has very high application value, or it is used in large data providers.
Focused web crawler: A crawler that selectively crawls web pages according to predefined themes. Locate the crawled target web pages in pages related to the topic, which greatly saves the bandwidth resources and server resources required for crawling. It is mainly used to provide services for a specific type of people in crawling specific information.
Incremental web crawler: When crawling web pages, it only crawls web pages whose content has changed or newly generated web pages. It will not crawl web pages whose content has not changed. Incremental web crawlers can ensure that the crawled pages are as new as possible to a certain extent.
Deep web crawler: Web pages can be divided into surface pages and deep pages according to the way they exist. Surface pages refer to static pages that can be reached using static links without submitting a form; deep pages are hidden behind the form and cannot be directly obtained through static links. Pages that can only be obtained after submitting certain keywords.

The role of web crawlers

1) Search engine: Provide users with relevant and effective content and create snapshots of all visited pages for subsequent processing. Implementing a search engine or search function on any portal using a focused web crawler helps to find web pages with the highest relevance to the search topic.

2) Build data sets: for research, business and other purposes.

Understand and analyze netizens’ behavior toward a company or organization.
Gather marketing information and make better marketing decisions in the short term.
Collect information from the Internet, analyze them and conduct academic research.
Collect data to analyze long-term trends in an industry.
Monitor competitors' real-time changes.

Web crawler workflow

Set one or several initial seed URLs in advance to obtain the URL list on the initial web page. During the crawling process, the URLs are continuously obtained from the URL queue, and then the page is accessed and downloaded.

After the page is downloaded, the page parser removes the HTML tags on the page and obtains the page content. It saves the summary, URL and other information to the Web database. At the same time, it extracts the new URL on the current web page and pushes it into the URL queue until the system stopping conditions are met. .

2. Use urllib

python2：urllib+urllib2

python3：urllib2+urllib3

urllib is the official standard library for requesting URL connections in Python. It has 4 modules:

urllib.request: Mainly responsible for constructing and initiating network requests, and defines functions and classes suitable for opening URLs in various complex situations.
urllib.error: exception handling.
urllib.parse: Parse various data formats.
urllib.robotparser: Parse robots.txt file.

Make a request

'''
http.client.HTTPResponse = urllib.request.urlopen(url,data,timeout,cafile,capath,context)
cafile,capath 使用https时使用
http.client.HTTPResponse = urllib.request.urlopen(urllib.request.Request)
返回响应对象
'''
import urllib.request
baidu_url = 'http://www.baidu.com'
sina_url = 'http://www.sina.com'
r = urllib.request.urlopen(sina_url) # 发起请求，返回响应对象
h = r.read().decode('utf-8') # 读取数据并解码
print(h)

Submit data : Use the data parameter to submit data

import urllib.request
import urllib.parse

baidu_url = 'http://www.baidu.com'
sina_url = 'http://www.sina.com'
p = {
    
    
    'name':'Python',
    'author':'admin'
}
d = bytes(urllib.parse.urlencode(p),encoding='utf8') # 进行编码，将字典转化为字符串再字节流
r = urllib.request.urlopen(sina_url,data=d,timeout=1) # 传入参数，发起请求，返回响应对象
h = r.read().decode('utf-8') # 读取数据并解码
print(h)

Set request header : You need to specify the request header

'''
urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)
'''
# 浏览器，开发者工具，标头
# urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)
import urllib.request
baidu_url = 'http://www.baidu.com'
headers = {
    
    
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
}
req = urllib.request.Request(url=baidu_url,headers=headers)
r = urllib.request.urlopen(req)
h = r.read().decode('utf-8')
print(h)

Using a proxy : When processing information such as cookies.

'''
Handler
OpenerDirector
'''
import urllib.request
baidu_url = 'http://www.baidu.com'
headers = {
    
    
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
}
# 通过第三方服务器寻找，具有实效性，可设置多个
proxy = urllib.request.ProxyHandler(
    {
    
    
        'http':'125.71.212.17:9000',
        'http':'113.71.212.17:9000'
    }
)
opener = urllib.request.build_opener(proxy) # 创建代理
urllib.request.install_opener(opener) # 安装代理
req = urllib.request.Request(url=baidu_url,headers=headers)
r = urllib.request.urlopen(req)
h = r.read().decode('utf-8')
print(h)

Authentication login : You need to log in before you can access the browsing page.

Create an account and password management object.
Add account and password.
Get a handler object.
Get the opener object.
Use the open() function to initiate a request.

import urllib.request
url = 'http://cnblogs.com/xtznb'
user = 'user'
password = 'password'
pwdmgr = urllib.request.HTTPPasswordMgrWithDefaultRealm() # 创建一个账号密码管理对象
pwdmgr.add_password(None,url,user,password) # 添加账号和密码
auth_handler = urllib.request.HTTPBasicAuthHandler(pwdmgr) # 获取一个handler对象
opener = urllib.request.build_opener(auth_handler) # 获取opener对象
response = opener.open(url) # 使用open()函数发起请求
print(response.read().decode('utf-8'))

Set Cookies : The page needs to generate verification every time, and you can use Cookies to log in automatically.

Instantiate the Cookies object.
Construct a handler object.
Use open() of the opener object to initiate a request.

import urllib.request
import http.cookiejar
url = 'http://tieba.baidu.com'
file = 'cookie.txt'
cookie = http.cookiejar.CookieJar() # 实例化Cookies对象
handler = urllib.request.HTTPCookieProcessor(cookie) # 构建一个handler对象
opener = urllib.request.build_opener(handler)
response = opener.open(url) # 使用opener对象的open()发起请求
f = open(file,'a') # 追加模式写入
for i in cookie: # 迭代写入信息
    f.write(i.name + '=' + i.value + '\n')
f.close() # 关闭文件

3. Use request

The requests module is highly encapsulated based on the urllib3 module, making it more convenient to use, and network requests have become more concise and user-friendly. When crawling data, urllib disconnects directly after crawling data, while requests can continue to reuse the socket after crawling data without disconnecting.

Make a GET request

GET requests can be sent using the get() method of the requests module.

'''
response = get(url,params=None,**kwargs)
url：请求的URL地址
params：字典或字节序列，作为参数增加到URL中
**kwargs：控制访问的参数
'''
import requests
r = requests.get('http://www.baidu.com')
print(r.url) # http://www.baidu.com/
print(r.cookies) # <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
r.encoding = 'utf-8'
print(r.encoding) # utf-8
print(r.text) # 网页源代码
print(r.content) # 二进制字节流
print(r.headers) # 文件头
print(r.status_code) # 状态码
# 手工附加
r = requests.get('http://www.baidu.com/?key=val')
# 使用params关键字参数
payload1 = {
    
    'key1':'value1','key2':'value2'}
r = requests.get('http://www.baidu.com',params=payload1)
payload2 = {
    
    'key1':'value1','key2':['value2','value3']}
r = requests.get('http://www.baidu.com',params=payload2)
# 请求头形式
headers = {
    
    
    'Content-Type':'text/html; charset=utf-8',
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
}
r = requests.get('http://www.baidu.com',headers=headers)
# 设置代理
headers = {
    
    
    'Content-Type':'text/html; charset=utf-8',
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
}
p = {
    
    
    'http':'120.25.253.234.0000'
}
r = requests.get('http://www.baidu.com',headers=headers,proxies=p)

You can also use the timeout parameter to set the delay time, use the verify parameter to set integer verification, and use the cookies parameter to transfer cookie information, etc.

Send POST request

The HTTP protocol stipulates that the data submitted by POST must be placed in the message subject, but the protocol does not specify what encoding method the data must use. There are three specific encoding methods:

form form: application/x-www-form-urlencoded
JSON string submission data: application/json
Upload file: multipart/form-data

To send a POST request, you can use the post() method, which also returns a Response object.

Example 1 : Send POST request in form.

import requests
payload = {
    
    'key1':'value1','key2':'value2'}
r = requests.post('http://httpbin.org/post',params=payload)
print(r.text)
'''输出
{
  "args": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "0", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.27.1", 
    "X-Amzn-Trace-Id": "Root=1-656f516d-18cccab474d121d705eb3ad9"
  }, 
  "json": null, 
  "origin": "218.104.29.129", 
  "url": "http://httpbin.org/post?key1=value1&key2=value2"
}
'''

Example 2 : Send POST request in JSON format.

import requests
import json
payload = {
    
    'key1':'value1','key2':'value2'}
r = requests.post('http://httpbin.org/post',data=json.dumps(payload))
print(r.text)
'''
{
  "args": {}, 
  "data": "{\"key1\": \"value1\", \"key2\": \"value2\"}", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "36", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.27.1", 
    "X-Amzn-Trace-Id": "Root=1-656f5282-3f08151e1fbbeec54501ed80"
  }, 
  "json": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "origin": "218.104.29.129", 
  "url": "http://httpbin.org/post"
}
'''

Example 3 : Send a POST request to upload a file.

# 新建文本文件report.txt，输入一行Hello world
import requests
files = {
    
    'file':open('report.txt','rb')}
r = requests.post('http://httpbin.org/post',files=files)
print(r.text)
'''
{
  "args": {}, 
  "data": "", 
  "files": {
    "file": "Hello world"
  }, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "157", 
    "Content-Type": "multipart/form-data; boundary=44a0c52d3705bdc2a8a6ffa85ccc00bc", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.27.1", 
    "X-Amzn-Trace-Id": "Root=1-656f538f-5c062ec1599c4fbe082aa840"
  }, 
  "json": null, 
  "origin": "218.104.29.129", 
  "url": "http://httpbin.org/post"
}
'''

requests not only provides GET and POST request methods, but also provides other request methods: put, delete, head, options.

GET is mainly used to request data from specified resources, while POST is mainly used to submit data to be processed to specified resources.

4. Use BeautifulSoup

Using the requests module can only capture some web page source codes, but how to screen and filter the source codes and accurately find the required data requires the use of BeautifulSoup. It is a Python library that can extract data from HTML or XML files.

BeautifulSoup supports the HTML parser of the Python standard library, and also supports some third-party parsers. If not installed, the Python default parser, the lxml parser, is more powerful and faster. It is recommended to use the lxml parser.

parser string

html.parser: BeautifulSoup(html,'html.parser')The default execution speed is average and the fault tolerance is strong.
lxml: BeautifulSoup(html,'lxml')Fast speed and strong file fault tolerance.
xml: BeautifulSoup(html,'xml')Fast, mainly for XML documents.
html5lib: BeautifulSoup(html,'html5lib')Best fault tolerance, mainly for HTML5 documents.

BeautifulSoup automatically converts input documents to Unicode encoding and output documents to UTF-8 encoding.

Environment configuration

'''
pip install beautifulsoup4
# 需要调用HTML解析器，安装如下
pip install html5lib
pip install lxml
'''

Example 3 :

# 新建test.html，输入以下内容
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>Hello,world</title>
</head>
<body>
<div class="book">
    <span><!--这里是注释的部分--></span>
    <a href="https://www.baidu.com">百度一下，你就知道</a>
    <p class="a">这是一个示例</p>
</div>
</body>
</html>

# 新建py文件
from bs4 import BeautifulSoup
f = open('paichong/test.html','r',encoding='utf-8') # 打开文件
html = f.read()
f.close()
soup = BeautifulSoup(html,'html5lib') # 指定html5lib解析器
print(type(soup))

node object

BeautifulSoup converts complex HTML documents into a complex tree structure. Each node is a Python object. The objects are summarized as: Tag, NavigableString, BeautifulSoup, and Comment.

Tag: tag
NavigableString: the text wrapped by the label
BeautifulSoup: the object obtained by parsing the web page
Comment: Comment or special string.

from bs4 import BeautifulSoup
f = open('paichong/test.html','r',encoding='utf-8') # 打开文件
html = f.read()
f.close()
soup = BeautifulSoup(html,'html5lib') # 指定html5lib解析器
print(type(soup))
tag = soup.p # 读取p标签
print(tag.name) # 读取p标签名称
print(tag["class"]) # 属性值
print(tag.get_text()) # 文本

Document traversal

Traverse node attributes as follows:

contents: Get all byte points, including NavigableString objects, and return a list.
children: Get all child nodes and return an iterator.
descendants: Get all descendant nodes and return an iterator.
string: Get the directly contained text.
strings: Get all contained text and return an iterable object.
parent: Get the parent node of the previous layer.
parents: Get the parent node and return an iterable object.
next_sibling: Get the next sibling node of the current node.
next_siblings: Get all sibling nodes below the current node.
previous_sibling: Get the previous sibling node of the current node.
previous_siblings: Get all sibling nodes above the current node.

from bs4 import BeautifulSoup
f = open('paichong/test.html','r',encoding='utf-8') # 打开文件
html = f.read()
f.close()
soup = BeautifulSoup(html,'html5lib') # 指定html5lib解析器
tags = soup.head.children # 获取head的所有子节点
print(tags)
for tag in tags:
    print(tag)

Document search

find_all(name[,name1,…]): name is the label name, and the label string is filtered directly.
find_all(attires = {'attribute name':'attribute value'}): Search attributes.
find_all(name,text="text content"): Search text
find_all(name,recursive=False): Limit the search scope.
find_all(re.compile("b")): Regular expression.

a = soup.find_all('a',text='百度一下，你就知道')
a = soup.find_all(re.compile('a'))
print(a)

CSS selector

The select() method passes in string parameters. For details, see: http://www.w3.org/TR/CSS2/selector.html.

from bs4 import BeautifulSoup
import re
f = open('paichong/test.html','r',encoding='utf-8') # 打开文件
html = f.read()
f.close()
soup = BeautifulSoup(html,'html5lib') # 指定html5lib解析器
tags = soup.select(".a")
print(tags) # [<p class="a">这是一个示例</p>]