Web request module
requests
Introduction
requests
Modules can mimic the browser sends a request to obtain a response
requests
module python2, in common with python3
requests
module can automatically help us extract the contents of the page
Installation requests module
pip install requests
If you have a local python2, and two python3 environment, you want to install in python3, it is recommended to use the following installed in this way
pip3 install requests
Use requests module
Basic use
- Use
# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com'
# 发送 GET 请求获取响应
response = requests.get(url)
# 获取响应的 html 内容
html = response.text
- Code explained
- response common attributes
- response.text returns a response content in response content type str
- respones.content returns a response content, the response content type is bytes
- response.status_code returns a response status code
- returns the requested head response.request.headers
- Returns the response headers response.headers
- Returns the object response response.cookies RequestsCookieJar
- response.content type conversion str
# 获取字节数据
content = response.content
# 转换成字符串类型
html = content.decode('utf-8')
- response.cookies operation
# 返回 RequestsCookieJar 对象
cookies = response.cookies
# RequestsCookieJar 转 dict
requests.utils.dict_from_cookiejar(cookies)
# dict 转 RequestsCookieJar
requests.utils.cookiejar_from_dict()
# 对cookie进行操作,把一个字典添加到cookiejar中
requests.utils.add_dict_to_cookiejar()
Custom request header
- Use
# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com'
# 定义自定义请求头
headers = {
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
# 发送自定义请求头
response = requests.get(url,headers=headers)
# 获取响应的 html 内容
html = response.text
- Code explained
Add headers parameter transmission request as a custom request header
Send GET request
- Use
# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com/s'
# 定义自定义请求头
headers = {
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
# 定义 GET 请求参数
params = {
"kw":"hello"
}
# 使用 GET 请求参数发送请求
response = requests.get(url,headers=headers,params=params)
# 获取响应的 html 内容
html = response.text
- Code explained
As the GET request parameter params parameter transmission request
POST request sent
- Use
# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com'
# 定义自定义请求头
headers = {
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
# 定义post请求参数
data = {
"kw":"hello"
}
# 使用 POST 请求参数发送请求
response = requests.post(url,headers=headers,data=data)
# 获取响应的 html 内容
html = response.text
- Code explained
When the data transmission request as a parameter POST request parameter
save Picture
- Use
# 导入模块
import requests
# 下载图片地址
url = "http://docs.python-requests.org/zh_CN/latest/_static/requests-sidebar.png"
# 发送请求获取响应
response = requests.get(url)
# 保存图片
with open('image.png','wb') as f:
f.write(response.content)
- Code explained
When you save a picture consistent extension and the extension request
Save the file must be saved using response.content
Use a proxy server
- effect
- Let not the same server that the client requests
- Prevent our real address was leaked to prevent being investigated
- Use proxy
- Classifieds
- Transparent Proxy (Transparent Proxy): Although you can directly transparent proxy to "hide" your IP address, but can still be found who you are.
- Anonymous Proxy (Anonymous Proxy): transparent proxy anonymous proxy than a little progress: people know that you can only use a proxy, you can not know who you are.
- Confuse agent (Distorting Proxies): the same anonymous proxy, if you use a proxy confused, others could know you use a proxy, but will get a fake IP address, the more realistic camouflage
- High anonymous proxy (Elite proxy or High Anonymity Proxy): can be seen, high-hiding proxy so that others simply can not find that you are using a proxy, it is the best choice.
In that use, there is no doubt that the best use high anonymous proxy
From the protocol used: can be divided into proxy ip http proxy, https proxy, socket and other agents, when used according to the protocol need to select crawl the site
- Use
# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com'
# 定义自定义请求头
headers = {
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
# 定义 代理服务器
proxies = {
"http":"http://IP地址:端口号",
"https":"https://IP地址:端口号"
}
# 使用 POST 请求参数发送请求
response = requests.get(url,headers=headers,proxies=proxies)
# 获取响应的 html 内容
html = response.text
- Code explained
proxies proxy transmission request parameter
Send request carries Cookies
- Use
Cookie carried directly in a custom request header
Cookie object carried by the request parameters
- Code
# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com'
# 定义自定义请求头
headers = {
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
# 方式一:直接在请求头中携带Cookie内容
"Cookie": "Cookie值"
}
# 方式二:定义 cookies 值
cookies = {
"xx":"yy"
}
# 使用 POST 请求参数发送请求
response = requests.get(url,headers=headers,cookies=cookies)
# 获取响应的 html 内容
html = response.text
- Code explained
Cookies cookies parameter carries transmission request
Error handling certificate
-
Problem Description
-
Use
# 导入模块
import requests
url = "https://www.12306.cn/mormhweb/"
# 设置忽略证书
response = requests.get(url,verify=False)
- Code explained
parameter is set to verify the transmission request does not verify the CA certificate indicates False
Timeout Handling
- Use
# 导入模块
import requests
url = "https://www.baidu.com"
# 设置忽略证书
response = requests.get(url,timeout=5)
- Code explained
timeout parameter is set to the timeout in seconds when the transmission request
Retry processing
- Use
#!/usr/bin/python3
# -*- coding: utf-8 -*-
'''
可以使用第三方模块 retrying 模块
1. pip install retrying
'''
import requests
# 1. 导入模块
from retrying import retry
# 2. 使用装饰器进行重试设置
# stop_max_attempt_number 表示重试次数
@retry(stop_max_attempt_number=3)
def parse_url(url):
print("访问url:",url)
headers = {
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"
}
proxies = {
"http":"http://124.235.135.210:80"
}
# 设置超时参数
response = requests.get(url,headers=headers,proxies=proxies,timeout=5)
return response.text
if __name__ == '__main__':
url = "http://www.baidu.com"
try:
html = parse_url(url)
print(html)
except Exception as e:
# 把 url 记录到日志文件中,未来进行手动分析,然后对url进行重新请求
print(e)
- Explain the code
to installretrying
the module
retrying modules may be monitored by a decorative function mode, the function throws an exception if the retry operation is triggered
pip install retrying
- Need retry function set decorator
By
@retry(stop_max_attempt_number=重试次数)
setting the number of retries parameter
# 1. 导入模块
from retrying import retry
# 2. 装饰器设置重试函数
@retry(stop_max_attempt_number=3)
def exec_func():
pass
urllib
python3 used urllib network library
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import urllib.request
# 2. 发起网络请求
# 2.1. 定义请求地址
url = "https://github.com"
# 2.2. 自定义请求头
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
"Referer": "https://github.com/",
"Host": "github.com"
}
# 定义请求对象
req = urllib.request.Request(
url=url,
headers=headers
)
# 发送请求
resp = urllib.request.urlopen(req)
# 处理响应
with open('github.txt', 'wb') as f:
f.write(resp.read())
urllib Precautions
- If you use need to be escaped in the URL
#!/usr/bin/python3
# -*- coding: utf-8 -*-
# 1. 导入模块
import urllib.request
import urllib.parse
# 2. 发起请求获取响应
wd = input("请输入查询内容:")
# 2.1 定义请求地址
url = "https://www.baidu.com/s?wd="
# 2.2 定义自定义请求头
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
"Referer": "https://github.com/",
"Host": "github.com"
}
# 2.3 定义请求对象
request = urllib.request.Request(
url=url + urllib.parse.quote(wd),
headers=headers
)
# 2.4 发送请求
response = urllib.request.urlopen(request)
# 3. 处理响应
with open('02.html','wb') as f:
f.write(response.read())
response.read()
- The return value is a string of bytes, the content needs to decode takes a string
html = response.read().decode('utf-8')
Reprinted from https://github.com/Kr1s77/Python-crawler-tutorial-starts-from-zero