The difference between requests and urllib library

Original link: https://blog.csdn.net/sinat_37967865/article/details/85392207

Personal learning collection, intrusion and deletion

-------------------------------------------------------------------------------------------------------

When we use the python crawler, we need to simulate the initiation of network requests. The main libraries used are the requests library and the python built-in urllib library. It is generally recommended to use requests. It is a re-encapsulation of urllib. The main difference between them:
requests can be directly Build and initiate common get and post requests. urllib generally first builds get or post requests, and then initiates the request.

import requests
 
Response_get = requests.get(url, params=None, **kwargs)
Response_post = requests.post(url, data=None, json=None, **kwargs)


What we get after the above request is the requests.models.Response object, which needs to be processed to get the information we need.
Response_get.text gets the str type
Response_get.content gets the bytes type, which needs to be decoded Response_get.content.decode(), It is equivalent to Response_get.text
Response_get.json() to get the JSON data type

Generally, our simplest get request is requests.get(url), but we can customize the request. For example, in requests.get(url,headers=headers)
headers, User-Agent and cookie can be customized
. The simple post request is requests. post(url), we can also requests.post(url,data), where data can be a list, dictionary, JSON, etc.

import urllib.request
 
req = urllib.request.Request(self, url, data=None, headers={},origin_req_host=None,unverifiable=False,method=None)
Response_request = urllib.request.urlopen(req)
Response = urllib.request.urlopen(url, data=None, timeout=1, cafile=None, capath=None, cadefault=False, context=None)


The urllib.request module provides the most basic method of constructing HTTP requests, which can be used to simulate a request initiation process of the browser.
# At the same time, it also comes with processing authenticaton (authorization verification), redirections (redirection), cookies (browser cookies) and other content.
# context parameter, it must be ssl.SSLContext type, used to specify SSL settings. The two parameters cafile and capath are to specify the CA certificate and its path, which will be useful when requesting HTTPS links.
The # cadefault parameter is now deprecated and the default is False.
# It is an object of HTTPResposne type, and its main methods include read(), readinto(), getheader(name), getheaders(), fileno() and other functions
# and msg, version, status, reason, debuglevel, closed And other attributes. After getting this object, assign it to response,
# Then you can use response to call these methods and properties to get a series of information about the returned result.
# For example, response.read() can get the content of the returned webpage, and response.status can get the status code of the returned result. For example, 200 means the request is successful, 404 means the webpage is not found, etc.

A simple get request is urllib.request.urlopen(url), where url can be a link or a request,
so we need to customize the request header through urllib.request.Request(url,headers=headers), and then Passed as url to request.urlopen()

The following is a simple way to use the two libraries:

import requests
import urllib.request
 
 
url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'
api ='http://open.iciba.com/dsapi/'
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'
    }
form_data = {
    "i": "word",
    "from": "AUTO",
    "to": "AUTO",
    "smartresult": "dict",
    "doctype": "json",
    "version": "2.1",
    "keyfrom": "fanyi.web",
    "action": "FY_BY_REALTIME",
    "typoResult": "false"
}
 
 
req1 = requests.get(url,headers=headers)
req_api = requests.get(api,headers=headers)
print(type(req1),type(req1.text),req1.text)              # requests.get().text是str类型
print("字符串",type(req1.content),req1.content)           # requests.get().content是bytes类型
print("与直接req1.text一致",req1.content.decode())
print("接口返回json格式",req_api.json())                  # 接口返回格式需要用requests.get().json()
 
 
# POST发送的data必须为bytes或bytes类型的可迭代对象,不能是字符串
form_data = urllib.parse.urlencode(form_data).encode()
# urllib.request.Request()只是一个请求:
req2 = urllib.request.Request(url,data=form_data,headers=headers)
print(type(req2))
 
 
req3 = urllib.request.urlopen(url)      # 不可以伪装你的User Agent,可以通过urllib.request.Request()伪装
print(type(req3),req3.status)           # http.client.HTTPResponse
print(req3.getheaders())                # 响应的信息集合
print(req3.read().decode("utf-8"))      # urllib.request.urlopen().read().decode("utf-8")相当于requests.get().text
 
req4 = urllib.request.urlopen(req2)     # 参数可以直接是一个请求
print("直接一个请求:",req4.read().decode("utf-8"))

 

Guess you like

Origin blog.csdn.net/yocencyy/article/details/106021521