Learning and application examples of #requests library
tutorial
- Request: Automatic crawling of HTML pages and automatic network request submission
- robots protocol: web crawler exclusion criteria
- Projects: Practical projects
Unit 1: Getting Started with the Requests Library
Requests library installation: pip install requests
get() head() is most commonly used
get() method
import requests
r = requests.get("url")
#get->request:构造一个向服务器请求的资源的Requests对象
#response->r:返回一个包含服务器资源的的Response对象(包含爬虫返回的内容)
#requests.get(url,params=None,**kwargs)
#url:拟获取页面的url链接
#params:url中的额外参数,字典或字节流格数,可选
#**kwargs:12个控制访问的参数
It can be seen that the requests library has only one request method, and other methods call the request method
#Requests库的两个重要对象 Response-Request
#Response的属性(包含爬虫返回的内容)
#404或其他:错误或异常
#200:可查看相应属性值
A general code framework for crawling web pages
#爬取网页的通用代码框架
#网络链接有风险,异常处理很重要
import requests
def getHTMLText(url):
try:
r=requests.get(url,timeout=30)
r.raise_for_status()#如果状态不是200,引发HTTPError异常
r.encoding = r.apparent_encoding
return r.text
except:
return "产生异常"
if __name__=="__main__":
url = "https://www.baidu.ccom/"
print(getHTMLText(url))
returns the correct result
-
`` if _name_ == " _main_":
The function of the printf("11") statement: ``
There are two ways to use a python file. The first is to execute it directly as a script, and the second is to import it into other python scripts to be called (module reuse) for execution. Therefore, the role of if name == 'main': is to control the process of executing code in these two cases. The code under if name == 'main': is only executed in the first case (that is, the file is directly executed as a script). Will be executed, but import to other scripts will not be executed.
HTTP protocol and Requests library method
HTTP protocol: Hypertext Transfer Protocol.
http is a stateless application layer protocol based on the "request and response" model.
HTTP generally uses url as an identifier for locating network resources.
URL format http://host[:port][path]
host: legal Internet host domain name or IP address
port: port number, the default port is 80
path: the path of the requested resource
http://www.bit.edu.cn
Understanding HTTP URLs
A url is an Internet path for accessing resources through the HTTP protocol, and a URL corresponds to a data resource.
#Requests库的head()方法
import requests
r=requests.head("https://www.baidu.com/")
r.headers
#Requests库的post()方法
payload = {
'key1':'value1'}
r = requests.post('http://httpbin.org/post',data = payload)
print(r.text)
```这是结果
{
"args": {
},
"data": "",
"files": {
},
"form": {
"key1": "value1"
}
```
#Requests库的put()方法
payload = {
'k1':'v1','k2':'v2'}
r = requests.put('http://httpbin.org/put',data=payload)
print(r.text)
```这是结果一部分
{
"args": {
},
"data": "",
"files": {
},
"form": {
"k1": "v1",
"k2": "v2"
},
```
Analysis of the main methods of the Requests library
request() method
#requests.request(method,url,**kwargs)
method:请求方式,对应get/put/post等七种
r = request.request('GET',url,**kwargs)#eg
url:拟获取页面的url链接
**kwargs:控制访问的参数,共13个
params:字典或字节序列,作为参数增加到url中
kv = {
'k1':'v1','k2':'v2'}
r = requests.request('GET','http://python123.io/ws',params=kv)
print(r.url)
#结果 https://python123.io/ws?k1=v1&k2=v2
data:字典,字节序列或文件对象,作为Request的内容
kv = {
'k1':'v1','k2':'v2'}
r = requests.request('POST','http://python123.io/ws',data=kv)
json:JSON格式的数据,作为Ruquests的内容
kv = {
'k1':'v1','k2':'v2'}
r = requests.request('POST','http://python123.io/ws',json=kv)
headers:字典,HTTP定制头
hd = {
'user-agent':'chrome/10'}
r = requests.request('POST','http://python123.io/ws',headers=hd)
cookies:字典或CookieJar,Request中的cookie
auth:元组,支持HTTP认证功能
files:字典类型,传输文件
fs = {
'file':open('data.xls','rb')}
r=requests.request('POST','http://python123.io/ws',file=fs)
timeout:设定超时时间,秒为单位
r=requests.request('GET','http://python123.io/ws',timeout=10)
proxies:字典类型,设定访问代理服务器,可以增加登录认证
allow_redirects:True/False,默认为True,重定向开关
stream:True/False,默认为True,获得内容立即下载开关
verify:True/False,默认为True,认证SSL证书开关
cert:本地SSL证书路径
Unit 2: "Theft is also a way" of web crawlers
Problems caused by web crawlers
- Legal Risk
- leak privacy
- Harassment issue
Limitations of Web Crawlers
- Source review: judge user-agent to limit
- Announcement: robots protocol
Robots protocol (web crawler exclusion criteria)
Function: The website tells the web crawler which pages can be crawled and which pages cannot
Agreement: the robots.txt file in the root directory of the website
[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-RqWUh17q-1637850721313)(https://www.jd.com/robots.txt)]
JD robots agreement
Baidu robots agreement
grammar:
User-agent:*
Disallow:/
Compliance with the Robots protocol
Web crawler : automatically or manually identify robots.txt, and then crawl content.
Binding : The Robots agreement is a suggestion but not binding. Web crawlers can not abide by it, but there are legal risks.
Human-like behavior does not need to refer to the Robots protocol
Unit 3: Requests library web crawler combat (5 examples)
Example 1 Crawling of Jingdong commodity pages
import requests
try:
#https://item.jd.com/100021007462.html
r = requests.get("https://item.jd.com/100021007462.html/")
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[:1000])
except:
print('爬取失败')
result
Example 2 Crawling of Amazon product pages
Unlike accessing JD products, we need to let our code simulate browsers to provide HTTP requests to Amazon through the headers field
#正确源代码
import requests
try:
url = "https://www.amazon.cn/dp/B0814XNDPM/ref=s9_acsd_hps_bw_c2_x_2_i?pf_rd_m=A1U5RCOVU0NYF2&pf_rd_s=merchandised-search-2&pf_rd_r=TR3JA9FYNTNPF2PZ66V3&pf_rd_t=101&pf_rd_p=7235aeb5-a996-42a4-a46a-257db647554a&pf_rd_i=2032713071"
kv = {
"user-agent":"Mozilla/5.0"}
r = requests.get(url,headers=kv)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[6000:10000])
except:
print('爬取失败')
The crawler faithfully told Amazon that he wrote python-requests, Amazon's source review, which made such a crawler go wrong.
We can change the headers, simulate the browser sending a request to Amazon, first construct a key-value pair to redefine the user-agent content, and call the get function to modify it.
Example 3 Baidu 360 search keyword submission
Search engine keyword submission interface
百度:http://www.baidu.com/s?wd=keyword
360:http://www.so.com/s?q=keyword
import requests
keyword = 'Python'
try:
kv = {
'wd':keyword}
r = requests.get("http://www.baidu.com/s",params=kv)
print(r.request.url)
r.raise_for_status()
print(len(r.text))
except:
print("爬取失败")
Example 4 Crawling and Storage of Network Pictures (Video)
Format of web image link:
http://www.example.com/picture.jpg
import requests
import os
url = "http://cj.jj20.com/2020/down.html?picurl=/up/allimg/tp05/19100120461512E-0.jpg"
root = "C://程序员专用软件//"
path = root + url.split('/')[-1]
try:
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r=requests.get(url)
with open(path,'wb') as f:
f.write(r.content)
f.close()
print("文件保存成功")
else:
print("文件已存在")
except:
print("爬取失败")
Example 5 Automatic query of IP address attribution
https://www.ip138.com/ #Query ip address
https://www.ip138.com/iplookup.asp?ip=112.224.74.158&action=2 #This form of link
import requests
url = "http://m.ip138.com/ip.asp?ip="
#try:
r = requests.get(url+'202.204.80.112')
r.raise_for_status()
# r.encoding = r.apparent_encoding
# print(r.text[-500:])
#except:
print("爬取失败")