python—网络爬虫(HTTP协议及Requests库方法)

HTTP协议:超文本传输协议。
HTTP是一个基于“请求与响应”模式的、无状态的应用层协议。
HTTP协议采用URL作为定位网络资源的标识。
URL格式 http://host[:port][path]
host:  合法的Internet主机域名或IP地址
port:   端口号,缺省端口为80
path:  请求资源路径

URL是通过HTTP协议存取资源的Internet路径,一个URL对应一个数据资源。

HTTP协议对资源的操作:

方法 说明
GET 请求获取URL位置的资源
HEAD 请求URL位置资源的响应消息报告,即获得该资源的头部信息
POST 请求向URL位置的资源后附加新的数据
PUT 请求向URL位置存储一个资源,覆盖原URL位置的资源
PATCH 请求局部更新URL位置的资源,即该处资源的部分内容
DELETE 请求删除URL位置存储的资源

理解PATCH和PUT的区别:

假设URL位置有一组数据UserInfo,包括UserID、UserName等20个字段。

需求:用户修改了UserName,其他不变。

采用PATCH,仅向URL提交UserName的局部更新请求。

采用PUT,必须将所有20个字段一并提交到URL,未提交字段被删除。

PATCH的最大好处:节省网络带宽。

                      HTTP协议与Request库功能比较

HTTP协议方法 Request库方法 功能一致性
GET requests.get() 一致
HEAD requests.head() 一致
POST requests.post() 一致
PUT requests.put() 一致
PATCH requests.patch() 一致
DELETE requests.delete() 一致

Requests库的head()方法:

>>> r = requests.head('http://httpbin.org/get')
>>> r.headers
{'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Date': 'Wed, 01 Aug 2018 11:52:17 GMT', 'Content-Type': 'application/json', 'Content-Length': '267', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'Via': '1.1 vegur'}
>>> r.text    #显示内容为空
''
>>>

Requests库的post()方法:

 向URL POST一个字典,自动编码为form(表单)

>>> payload = {'key1':'value1', 'key2':'value2'}
>>> r = requests.post('http://httpbin.org/post', data=payload)
>>> print(r.text)
{
"args": {},
"data": "",
"files": {},
"form": {
"key1": "value1",
"key2": "value2"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Content-Length": "23",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.19.1"
},
"json": null,
"origin": "117.136.25.106",
"url": "http://httpbin.org/post"
}

>>>


向URL POST一个字符串,自动编码为data 
>>> r = requests.post('http://httpbin.org/post',data='ABC')
>>> print(r.text)
{
"args": {},
"data": "ABC",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Content-Length": "3",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.19.1"
},
"json": null,
"origin": "117.136.25.106",
"url": "http://httpbin.org/post"
}

>>>

Requests库的put()方法: 它与post方法类似,只不过会将原有内容覆盖掉

>>> payload = {'key':'value1', 'key2':'value2'}
>>> r = requests.put('http://httpbin.org/put',data = payload)
>>> print(r.text)
{
"args": {},
"data": "",
"files": {},
"form": {
"key": "value1",
"key2": "value2"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Content-Length": "22",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.19.1"
},
"json": null,
"origin": "117.136.25.106",
"url": "http://httpbin.org/put"
}

    
     
   
   

猜你喜欢

转载自www.cnblogs.com/cindy-zl24/p/9403687.html