Web request module

requests

Introduction

requestsModules can mimic the browser sends a request to obtain a response
requestsmodule python2, in common with python3
requestsmodule can automatically help us extract the contents of the page

Installation requests module

pip install requests

If you have a local python2, and two python3 environment, you want to install in python3, it is recommended to use the following installed in this way

pip3 install requests

Use requests module

Basic use

# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com'
# 发送 GET 请求获取响应
response = requests.get(url)
# 获取响应的 html 内容
html = response.text

Code explained
response common attributes
- response.text returns a response content in response content type str
- respones.content returns a response content, the response content type is bytes
- response.status_code returns a response status code
- returns the requested head response.request.headers
- Returns the response headers response.headers
- Returns the object response response.cookies RequestsCookieJar
response.content type conversion str

# 获取字节数据
content = response.content
# 转换成字符串类型
html = content.decode('utf-8')

response.cookies operation

# 返回 RequestsCookieJar 对象
cookies = response.cookies
# RequestsCookieJar 转 dict
requests.utils.dict_from_cookiejar(cookies)
# dict 转 RequestsCookieJar
requests.utils.cookiejar_from_dict()
# 对cookie进行操作,把一个字典添加到cookiejar中
requests.utils.add_dict_to_cookiejar()

Custom request header

# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com'
# 定义自定义请求头
headers = {
  "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
# 发送自定义请求头
response = requests.get(url,headers=headers)
# 获取响应的 html 内容
html = response.text

Code explained

Add headers parameter transmission request as a custom request header

Send GET request

# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com/s'
# 定义自定义请求头
headers = {
  "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
# 定义 GET 请求参数
params = {
  "kw":"hello"
}
# 使用 GET 请求参数发送请求
response = requests.get(url,headers=headers,params=params)
# 获取响应的 html 内容
html = response.text

Code explained

As the GET request parameter params parameter transmission request

POST request sent

# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com'
# 定义自定义请求头
headers = {
  "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
# 定义post请求参数
data = {
  "kw":"hello"
}

# 使用 POST 请求参数发送请求
response = requests.post(url,headers=headers,data=data)
# 获取响应的 html 内容
html = response.text

Code explained

When the data transmission request as a parameter POST request parameter

save Picture

# 导入模块
import requests
# 下载图片地址
url = "http://docs.python-requests.org/zh_CN/latest/_static/requests-sidebar.png"
# 发送请求获取响应
response = requests.get(url)
# 保存图片
with open('image.png','wb') as f:
  f.write(response.content)

Code explained

When you save a picture consistent extension and the extension request

Save the file must be saved using response.content

Use a proxy server

effect
- Let not the same server that the client requests
- Prevent our real address was leaked to prevent being investigated
Use proxy

Classifieds
Transparent Proxy (Transparent Proxy): Although you can directly transparent proxy to "hide" your IP address, but can still be found who you are.
Anonymous Proxy (Anonymous Proxy): transparent proxy anonymous proxy than a little progress: people know that you can only use a proxy, you can not know who you are.
Confuse agent (Distorting Proxies): the same anonymous proxy, if you use a proxy confused, others could know you use a proxy, but will get a fake IP address, the more realistic camouflage
High anonymous proxy (Elite proxy or High Anonymity Proxy): can be seen, high-hiding proxy so that others simply can not find that you are using a proxy, it is the best choice.

In that use, there is no doubt that the best use high anonymous proxy

From the protocol used: can be divided into proxy ip http proxy, https proxy, socket and other agents, when used according to the protocol need to select crawl the site

# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com'
# 定义自定义请求头
headers = {
  "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
# 定义 代理服务器
proxies = {
  "http":"http://IP地址:端口号",
  "https":"https://IP地址:端口号"
}
# 使用 POST 请求参数发送请求
response = requests.get(url,headers=headers,proxies=proxies)
# 获取响应的 html 内容
html = response.text

Code explained

proxies proxy transmission request parameter

Send request carries Cookies

Cookie carried directly in a custom request header

Cookie object carried by the request parameters

Code

# 导入模块
import requests
# 定义请求地址
url = 'http://www.baidu.com'
# 定义自定义请求头
headers = {
  "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
  # 方式一：直接在请求头中携带Cookie内容
  "Cookie": "Cookie值"
}
# 方式二：定义 cookies 值
cookies = {
  "xx":"yy"
}
# 使用 POST 请求参数发送请求
response = requests.get(url,headers=headers,cookies=cookies)
# 获取响应的 html 内容
html = response.text

Code explained

Cookies cookies parameter carries transmission request

Error handling certificate

Problem Description
Use

# 导入模块
import requests

url = "https://www.12306.cn/mormhweb/"
# 设置忽略证书
response = requests.get(url,verify=False)

Code explained

parameter is set to verify the transmission request does not verify the CA certificate indicates False

Timeout Handling

# 导入模块
import requests

url = "https://www.baidu.com"
# 设置忽略证书
response = requests.get(url,timeout=5)

Code explained

timeout parameter is set to the timeout in seconds when the transmission request

Retry processing

#!/usr/bin/python3
# -*- coding: utf-8 -*-
'''
可以使用第三方模块 retrying 模块
1. pip install retrying

'''
import requests
# 1. 导入模块
from retrying import retry

# 2. 使用装饰器进行重试设置
# stop_max_attempt_number 表示重试次数
@retry(stop_max_attempt_number=3)
def parse_url(url):
    print("访问url:",url)
    headers = {
        "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"
    }
    proxies = {
        "http":"http://124.235.135.210:80"
    }
    # 设置超时参数
    response = requests.get(url,headers=headers,proxies=proxies,timeout=5)
    return response.text

if __name__ == '__main__':
    url = "http://www.baidu.com"
    try:
        html = parse_url(url)
        print(html)
    except Exception as e:
        # 把 url 记录到日志文件中，未来进行手动分析，然后对url进行重新请求
        print(e)

Explain the code
to install retryingthe module

retrying modules may be monitored by a decorative function mode, the function throws an exception if the retry operation is triggered

pip install retrying

Need retry function set decorator

By @retry(stop_max_attempt_number=重试次数)setting the number of retries parameter

# 1. 导入模块
from retrying import retry
# 2. 装饰器设置重试函数
@retry(stop_max_attempt_number=3)
def exec_func():
    pass

urllib

python3 used urllib network library

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import urllib.request

# 2. 发起网络请求
# 2.1. 定义请求地址
url = "https://github.com"
# 2.2. 自定义请求头
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
    "Referer": "https://github.com/",
    "Host": "github.com"
}

# 定义请求对象
req = urllib.request.Request(
    url=url,
    headers=headers
)

# 发送请求
resp = urllib.request.urlopen(req)

# 处理响应
with open('github.txt', 'wb') as f:
    f.write(resp.read())

urllib Precautions

If you use need to be escaped in the URL

 #!/usr/bin/python3
 # -*- coding: utf-8 -*-

 # 1. 导入模块
 import urllib.request
 import urllib.parse

 # 2. 发起请求获取响应

 wd = input("请输入查询内容：")

 # 2.1 定义请求地址
 url = "https://www.baidu.com/s?wd="
 # 2.2 定义自定义请求头
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
    "Referer": "https://github.com/",
    "Host": "github.com"
}
 # 2.3 定义请求对象
 request = urllib.request.Request(
     url=url + urllib.parse.quote(wd),
     headers=headers
 )
 # 2.4 发送请求
 response = urllib.request.urlopen(request)

 # 3. 处理响应
 with open('02.html','wb') as f:
     f.write(response.read())
response.read()

The return value is a string of bytes, the content needs to decode takes a string

 html = response.read().decode('utf-8')

Reprinted from https://github.com/Kr1s77/Python-crawler-tutorial-starts-from-zero

Aarden wood

Published 47 original articles · won praise 5 · Views 2906

私信关注

Requests Python network module requests use of

Web request module

requests

Introduction

Installation requests module

Use requests module

Basic use

Custom request header

Send GET request

POST request sent

save Picture

Use a proxy server

Send request carries Cookies

Error handling certificate

Timeout Handling

Retry processing

urllib

urllib Precautions

Guess you like