Python is a brief summary of the library Requests

 In recent days the MOOC song by watching the teacher's Beijing Institute of Technology courses Python web crawler Requests for library contents are summarized below.

First, the main method

requests.request() A configuration request, the following method of supporting foundation method
requests.get() Get HTML pages main method corresponding to the HTTP GET
requests.head() Get HTML page header information of a method corresponding to the HTTP HEAD
requests.post() To submit HTML page POST method request, corresponding to the HTTP POST
requests.put() To submit HTML page method PUT request, corresponding to the HTTP PUT
requests.patch() To submit a request to modify the local HTML page corresponding to the HTTP PATCH
requests.delete() To submit a request to delete HTML pages, corresponding to the HTTP DELETE

1, request method

requests.request(method,url,**kwargs)

method: request method, there 'GET''HEAD''POST''PUT''PATCH''delete''OPTIONS' 7 kinds

url: url intends to get links to pages

** kwargs: access control parameters, a total of 13

params Byte dictionary or a sequence, as a parameter added to the url
data Dictionary, a sequence of bytes or a file object, as the contents of the Request
json JSON-formatted data, as the contents of the Request
headers Dictionary, HTTP custom header
cookies Dictionary or CookieJar, Request of cookie
auth Tuples, support HTTP authentication
files Dictionary type, file transfer
timeout Set timeout seconds
proxies Dictionary type, set the access proxy server, you can increase the login authentication
allow_redirects True / False, the default is True, the switch redirects
stream True / False, the default is True, immediate access to content downloads switch
verify True / False, the default is True, the authentication SSL certificate switch
cert Local SSL certificate path

2, get method

requests.get(url,params = None,**kwargs)

url: url intends to get links to pages

params: extra parameters in the url, dictionary or byte stream format (optional)

** kwargs: 12 access control parameters (as compared with the method of least one request parameter proxies)

1 r = requests.get(url)

method of constructing a get resource request to the server Ruquest target, returns a Response object contains the server's resources

Response object common attributes

r.status_code HTTP return status of the request, 200 indicates a successful connection, 404 represents a failure
r.text HTTP status string corresponding contents, i.e., the page content corresponding to url
r.encoding From the HTTP response header guess the content encoding (if the header does not exist charset, encoding is considered as ISO-8859-1)
r.apparent_encoding Analysis of the content of the response from the content encoding (encoding alternatively)
r.content HTTP response binary form content
1 r = requests.get("http://www.baidu.com")
2 print(r.status_code)
3 200
4 type(r)
5 <class 'requests.models.Response'>
6 r.headers
7 {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Sun, 01 Mar 2020 09:29:13 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:56 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

1:利用get方法获取Response对象(赋值给r)

2:打印HTTP请求返回状态

3:HTTP返回状态

4:检测r的类型

5:r的类型

6:查看r的头部信息

7:r的头部信息

理解Response的编码

 1 r = requests.get("http://www.baidu.com")
 2 r.text
 3 '<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç\x99¾åº¦ä¸\x80ä¸\x8b class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ\x96°é\x97»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å\x9c°å\x9b¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§\x86é¢\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å\x90§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç\x99»å½\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">ç\x99»å½\x95</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ\x9b´å¤\x9a产å\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å\x85³äº\x8eç\x99¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使ç\x94¨ç\x99¾åº¦å\x89\x8då¿\x85读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æ\x84\x8fè§\x81å\x8f\x8dé¦\x88</a>&nbsp;京ICPè¯\x81030173å\x8f·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
 4 r.encoding
 5 'ISO-8859-1'
 6 r.apparent_encoding
 7 'utf-8'
 8 r.encoding = r.apparent_encoding
 9 r.text
10 '<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登录</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

  r的header中不存在charset,encoding默认编码为“ISO-8859-1”。将apparent_encoding(由r的内容得出编码方式)赋值给encoding,获得正确的编码。

3、head方法

requests.head(url,**kwargs)

url:拟获取页面的url链接

**kwargs:控制访问的参数,共13个(与request方法参数完全一样)

4、post方法

requests.post(url,data = None,json = None,**kwargs)

url:拟获取页面的url链接

data:字典、字节序列或文件,Request的内容

json:JSON格式的数据,Request的内容

**kwargs:控制访问的参数,共11个

5、put方法

requests.put(url,data = None,**kwargs)

url:拟获取页面的url链接

data:字典、字节序列或文件,Request的内容

**kwargs:控制访问的参数,共12个

6、patch方法

requests.patch(url,data = None,**kwargs)

url:拟获取页面的url链接

data:字典、字节序列或文件,Request的内容

**kwargs:控制访问的参数,共12个

7、delete方法

requests.delete(url,**kwargs)

url:拟获取页面的url链接

**kwargs:控制访问的参数,共13个(与request方法参数完全一样)

 二、网页爬取

1、HTTP

HTTP,Hypertext Transfer Protocol,超文本传输协议。HTTP是一个基于“请求与响应”模式的、无状态的应用层协议,采用URL作为定位网络资源的标识。URL是通过HTTP协议存取资源的Internet路径,一个URL对应一个数据资源。

URL格式: http://host[:port][path]

host:合法的Internet主机域名或IP地址

port:端口号,缺省端口为80

path:请求资源的路径

HTTP协议对资源的操作

GET 请求获取URL位置的资源
HEAD 请求获取URL位置资源的响应消息报告,即获取该资源的头部信息
POST 请求向URL位置的资源后附加新的数据
PUT 请求向URL位置存储一个资源,覆盖原URL位置的资源
PATCH 请求局部更新URL位置的资源,即改变该处资源的部分内容
DELETE 请求删除URL位置存储的资源

2、异常

requests.ConnectionError 网络连接错误异常,如DNS查询失败、拒绝连接等
requests.HTTPError HTTP错误异常
requests.URLRequired URL缺失异常
requests.TooManyRedirects 超过最大重定向次数,产生重定向异常
requests.ConnectTimeout 连接远程服务器超时异常
requests.Timeout 请求URL超时,产生超时异常
r.raise_for_status() 如果不是200,产生异常requests.HTTPError

3、爬取网页通用代码框架

1 try2     r = requests.get(url,timeout = 30)
3     r.raise_for_status() #如果状态不是200,引发HTTPError异常
4     r.encoding = r.apparent_encoding
5     return r.text
6 except7     return "产生异常"

 三、实例

1、京东商品页面的爬取

1 r = requests.get("https://item.jd.com/100008348542.html")
2 r.status_code
3 200
4 r.text[:1000] #前1000个字符
5 '<!DOCTYPE HTML>\n<html lang="zh-CN">\n<head>\n    <!-- shouji -->\n    <meta http-equiv="Content-Type" content="text/html; charset=gbk" />\n    <title>【AppleiPhone 11】Apple iPhone 11 (A2223) 128GB 黑色 移动联通电信4G手机 双卡双待【行情 报价 价格 评测】-京东</title>\n    <meta name="keywords" content="AppleiPhone 11,AppleiPhone 11,AppleiPhone 11报价,AppleiPhone 11报价"/>\n    <meta name="description" content="【AppleiPhone 11】京东JD.COM提供AppleiPhone 11正品行货,并包括AppleiPhone 11网购指南,以及AppleiPhone 11图片、iPhone 11参数、iPhone 11评论、iPhone 11心得、iPhone 11技巧等信息,网购AppleiPhone 11上京东,放心又轻松" />\n    <meta name="format-detection" content="telephone=no">\n    <meta http-equiv="mobile-agent" content="format=xhtml; url=//item.m.jd.com/product/100008348542.html">\n    <meta http-equiv="mobile-agent" content="format=html5; url=//item.m.jd.com/product/100008348542.html">\n    <meta http-equiv="X-UA-Compatible" content="IE=Edge">\n    <link rel="canonical" href="//item.jd.com/100008348542.html"/>\n        <link rel="dns-prefetch" href="//misc.360buyimg.com"/'

该例子直接爬取网站数据,查看返回状态正常,然后输出。

京东商品页面爬取全代码

1 import requests
2 url = "https://item.jd.com/100008348542.html"
3 try:
4       r = requests.get(url)
5       r.raise_for_status()
6       r.encoding = r.apparent_encoding
7       print(r.text[:1000] #前1000个字符
8 except:
9       print("爬取失败")

2、亚马逊商品页面的爬取

 

 1 import requests
 2 r = requests.get("https://www.amazon.cn/dp/B01N5RGXYR/ref=Oct_DLandingS_rdp_db4d4b3c")
 3 r.status_code
 4 503
 5 r.encoding
 6 'ISO-8859-1'
 7 r.apparent_encoding
 8 'utf-8'
 9 r.encoding = r.apparent_encoding
10 r.text[2000:3000] #第2001到第3000个字符
11 '     <i class="a-icon a-icon-alert"></i>\n                <h4>请输入您在下方看到的字符</h4>\n                <p class="a-last">抱歉,我们只是想确认一下当前访问者并非自动程序。为了达到最佳效果,请确保您浏览器上的 Cookie 已启用。</p>\n                </div>\n            </div>\n\n            <div class="a-section">\n\n                <div class="a-box a-color-offset-background">\n                    <div class="a-box-inner a-padding-extra-large">\n\n                        <form method="get" action="/errors/validateCaptcha" name="">\n                            <input type=hidden name="amzn" value="Gkp1fR4D0RV74ajxxWv5/A==" /><input type=hidden name="amzn-r" value="&#047;dp&#047;B01N5RGXYR&#047;ref&#061;Oct_DLandingS_rdp_db4d4b3c" />\n                            <div class="a-row a-spacing-large">\n                                <div class="a-box">\n                                    <div class="a-box-inner">\n                                        <h4>请输入您在这个图片中看到的字符:</h4>\n                                        <div class="a-row a-text-center">\n          '
12 r.request.headers
13 {'User-Agent': 'python-requests/2.23.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
14 kv = {'user-agent':'Mozilla/5.0'}
15 url = "https://www.amazon.cn/dp/B01N5RGXYR/ref=Oct_DLandingS_rdp_db4d4b3c"
16 r = requests.get(url,headers = kv)
17 r.status_code
18 200
19 r.request.headers
20 {'user-agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
21 r.text[2000:3000] #第2001到第3000个字符
22 '     <i class="a-icon a-icon-alert"></i>\n                <h4>请è¾\x93å\x85¥æ\x82¨å\x9c¨ä¸\x8bæ\x96¹ç\x9c\x8bå\x88°ç\x9a\x84å\xad\x97符</h4>\n                <p class="a-last">æ\x8a±æ\xad\x89ï¼\x8cæ\x88\x91们å\x8fªæ\x98¯æ\x83³ç¡®è®¤ä¸\x80ä¸\x8bå½\x93å\x89\x8d访é\x97®è\x80\x85并é\x9d\x9eè\x87ªå\x8a¨ç¨\x8båº\x8fã\x80\x82为äº\x86è¾¾å\x88°æ\x9c\x80ä½³æ\x95\x88æ\x9e\x9cï¼\x8c请确ä¿\x9dæ\x82¨æµ\x8fè§\x88å\x99¨ä¸\x8aç\x9a\x84 Cookie å·²å\x90¯ç\x94¨ã\x80\x82</p>\n                </div>\n            </div>\n\n            <div class="a-section">\n\n                <div class="a-box a-color-offset-background">\n                    <div class="a-box-inner a-padding-extra-large">\n\n                        <form method="get" action="/errors/validateCaptcha" name="">\n                            <input type=hidden name="amzn" value="9HvqB4kIQY6TxIwPw1QJ4w==" /><input type=hidden name="amzn-r" value="&#047;dp&#047;B01N5RGXYR&#047;ref&#061;Oct_DLandingS_rdp_db4d4b3c" />\n                            <div class="a-row a-spacing-large">\n                                <div class="a-box">\n                                    <div class="a-box-inner">\n                                  '

该例子爬取网站数据后发现返回状态错误,因为该网站拒绝爬虫访问,在request请求中将'user-agent'的值由'python-requests/2.23.0'改为'Mozilla/5.0',即爬虫访问改为浏览器访问(模拟浏览器访问),然后访问成功。

亚马逊商品页面爬取全代码

 1 import requests
 2 url = "https://www.amazon.cn/dp/B01N5RGXYR/ref=Oct_DLandingS_rdp_db4d4b3c"
 3 try: 
 4       kv = {'user-agent':'Mozilla/5.0'}
 5       r = requests.get(url,headers = kv)
 6       r.raise_for_status()
 7       r.encoding = r.apparent_encoding
 8       print(r.text[1000:2000] #第1001到第2000个字符
 9 except:
10       print("爬取失败")

 3、百度360搜索关键词提交

百度的关键词接口

http://www.baidu.com/s?wd=keyword

1 import requests
2 kv = {'wd':'Python'}
3 r = requests.get("http://www.baidu.com/s",params = kv)
4 r.status_code
5 200
6 r.request.url
7 'https://wappass.baidu.com/static/captcha/tuxing.html?&ak=c27bbc89afca0463650ac9bde68ebe06&backurl=https%3A%2F%2Fwww.baidu.com%2Fs%3Fwd%3DPython&logid=11418784497796902027&signature=de72d4c97890b2a959f599c9dec04cb7&timestamp=1583078733'
8 len(r.text)
9 1519

百度关键词查询全代码

 1 import requests
 2 keyword = "Python"
 3 try:
 4       kv = {'wd':keyword}
 5       r = requests.get("http://www.baidu.com/s",params = kv)
 6       print(r.request.url)
 7       r.raise_for_status()
 8       print(len(r.text))
 9 except:
10       print("爬取失败")

360的关键词接口

http://www.so.com/s?q=keyword

1 import requests
2 kv = {'q':'Python'}
3 r = requests.get("http://www.so.com/s",params = kv)
4 r.status_code
5 200
6 r.request.url
7 'https://www.so.com/s?q=Python'
8 len(r.text)
9 384244

360关键词查询全代码

 1 import requests
 2 keyword = "Python"
 3 try:
 4      kv = {'q':keyword}
 5      r = requests.get("http://www.so.com/s",params = kv)
 6      print(r.request.url)
 7      r.raise_for_status()
 8      print(len(r.text))
 9 except:
10      print("爬取失败")

以上两例子根据接口赋值即可。

4、网络图片的爬取

网络图片链接的格式

http://www.example.com/picture.jpg

1 import requests
2 path = "D:/abc.jpg"
3 url ="https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1583091303697&di=888f5e536474f93b599cc53ba4fa91c4&imgtype=0&src=http%3A%2F%2Fn.sinaimg.cn%2Fsinacn14%2F569%2Fw640h729%2F20180505%2Fee80-hacuuvt5406878.jpg"
4 r = requests.get(url)
5 r.status_code
6 200
7 with open(path,'wb') as f:
8       f.write(r.content)
9       f.close()

网络图片爬取全代码

 1 import requests
 2 import os
 3 url = "https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1583091303697&di=888f5e536474f93b599cc53ba4fa91c4&imgtype=0&src=http%3A%2F%2Fn.sinaimg.cn%2Fsinacn14%2F569%2Fw640h729%2F20180505%2Fee80-hacuuvt5406878.jpg"
 4 root = "D://picture//"
 5 path = root + url.split('/')[-1]
 6 try:
 7       if not os.path.exists(root):
 8             os.mkdir(root)
 9       if not os.path.exists(path):
10             r = requests.get(url)
11             with open(path,'wb') as f:
12                   f.write(r.content)
13                   f.close()
14                   print("文件保存成功")
15       else:
16             print("文件已存在")
17 except:
18       print("爬取失败")

5、ID地址归属地的自动查询

126查询接口:http://ip.ws.126.net/ipquery?ip=ipaddress

1 import requests
2 url = 'http://ip.ws.126.net/ipquery?ip='
3 r = requests.get(url + '123.123.123.123')
4 r.status_code
5 200
6 r.text
7 'var lo="北京市", lc="北京市";\r\nvar localAddress={city:"北京市", province:"北京市"}\r\n'

IP地址查询全代码

1 import requests
2 url = 'http://ip.ws.126.net/ipquery?ip='
3 try:
4       r = requests.get(url + '123.123.123.123')
5       r.raise_for_status()
6       r.encoding = r.apparent_encoding
7       print(r.text[-500:])
8 except:
9       print("爬取失败")

Guess you like

Origin www.cnblogs.com/huskysir/p/12390677.html