Python web crawler and information extraction (2) **kwargs parameter detailed explanation

foreword

  In the previous section, we used the get method in the requests library to simply learn how to use the requests library, and mentioned that the parameters of the get method contain thirteen optional parameters such as **kwargs. In this section, we will discuss in depth. The meaning and usage of these thirteen parameters.

text

   We know that the requests method is the basis of all methods in the requests library, so the thirteen parameters of **kwargs are not unique to the get method, but apply to the requests method and its six extended methods.

    The detailed parameters of **kwargs are as follows:

parameter effect
params A dictionary or byte sequence, added to the url as a parameter
data A dictionary, byte sequence, or file object, as the content of the Request
json Data in JSON format as the content of Request
headers Dictionary, HTTP custom header
cookies

Dictionary or CookieJar, cookie in Request

auth Tuple, support HTTP authentication function
files Dictionary type, transfer file
timeout Set the timeout time in seconds
proxies Dictionary type, set access proxy server, can add login authentication
allow_redirects True/False, default is True, redirect switch
stream True/False, the default is True, get the content immediately download switch
verify True/False, the default is True, authenticate SSL certificate switch
cert Local SSL certificate path

We still use Baidu as an example to explain


  parmas is to add a dictionary or byte sequence to the url link (it will add /? at the end of the original url link and then insert the dictionary created by the user), we can directly use the requests library to submit a search request to Baidu in this way, we use len( r.text) to tell us the length of the results searched by Baidu.

  The data parameter can be used to link out submission data to the specified url, such as: 

kv = {'key1' : 'value1', 'key2' : 'value2'}
r = requests.post('http://www.baidu.com', data = kv)

  即是向百度主页面提交我们的数据(当然提交后并没有什么用处);

  json及是向服务器提交JSON格式的数据:

kv = {'key1' : 'value1'}
r = requests.post('http://www.baidu.com', json = kv)

  这既是将kv中的数据赋值到了服务器的JSON域中;

  headers是通过用户定义的字典来定制HTTP的头


  此时'User-Agent'对应的内容是'python-requests',即我们向服务器声明了我们是requests所生成的一个请求,而如果我们用headers来更改我们HTTP的头的话

  这里我们将'user-agent'对应的内容改为了'Mozilla/5.0'即我们将自己向服务器的请求伪装成Mozilla/5.0浏览器对服务器的请求,此方法可在某些服务器禁止爬虫或仅接受来自浏览器的请求时将我们自己的请求伪装成来自浏览器的请求来绕过服务器的限制。

  通过headers参数我们可以实现对HTTP头中任何对应内容的修改。

  files字段输入的为字典类型,用于向服务器传输文件:

fv = {'file' : open('file.txt', 'r')}
r = requests.post('http://www.baidu.com', files = fv)

  timeout用于设定请求的超时时间,单位是秒(s),

r = requests.get('http://www.baidu.com', timeout = 10)

  我们设置的超时时间为10s,即若在10s之内未收到服务器返回的Response对象,我们的程序将出现timeout异常。

  proxies字段用于设置我们爬取网页时的代理服务器,同时也可设置登录代理服务器时的用户名和密码。

  

px = {'http' : 'http://user:[email protected]:x',
      'https': 'http://x.x.x.x:x'}
r = requests.get('http://www.baidu.com',proxies=px)

  如上我们设置了访问http网页时的代理服务器地址及用户名和密码,我们也可不设置用户名及密码而直接设置代理服务器地址(如上设置https时的方式),这样我们就可以通过代理服务器来防止我们的爬虫被逆追踪。

 

  cookies则是通过用户输入的字典或CookieJar来解析HTTP协议中的cookie;auth输入的参数要求为元组,用于支持HTTP的认证功能;allow_redirects用于设置是否允许对url重定向; stream字段则设置是否对我们爬取到的内容进行立即下载; verify字段则用来设置是否认证SSL证书(以上三个功能默认情况均为开启); cert字段则用于设置保存本地SSL证书路径。

  以上六个字段均对应于Requests库的高级功能,我们将在今后遇到具体问题时再深入讨论。

  对于requests延伸出的六个方法中,**kwargs中的参数有时会作为该方法的显式参数,如get(url, params, **kwargs)这是因为在get方法中params参数较常被使用,所以在这里被显式地提供了,而这里的**kwargs便是指除params外剩余的十二个可选参数。




Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326703025&siteId=291194637