1. urllib library - built-in
1.1 urlopen function
from urllib import request
resp = request.urlopen('https://www.baidu.com')
print(resp.read(10))
print(resp.readlines())
print(resp.getcode())
1.2 urlretrieve function
Save a file on the web page to the local
from urllib import request
# 通过链接下载文件
request.urlretrieve(url, 保存的文件名)
E.g:
from urllib import request
# 通过链接下载文件
request.urlretrieve('https://www.sogou.com/', 'sougou.html')
request.urlretrieve('https://pic.baike.soso.com/ugc/baikepic2/6108/20200529140813-1463090803_jpeg_239_300_8762.jpg/0', 'huge.jpg')
1.3 Encoding and decoding functions
1.3.1 urlencode function: encoding
Convert dictionary data into url encoded data
1.3.2 parse_qs function: decoding
Decode the encoded URL parameters
from urllib import parse
data = {'name': '哈哈', 'age': 18}
# urlencode(dict) 参数是字典
ps = parse.urlencode(data)
print(ps)
print(parse.parse_qs(ps))
1.4 url analysis
urlparse function and urlsplit function
All components of the url are parsed and divided to obtain the various parts of the url
The difference is: urlparse has the params attribute, but urlsplit does not have this params attribute
from urllib import parse
url = 'http://www.baidu.com/index.html;user?id=S#comment'
result = parse.urlparse(url)
print(result)
print(result.scheme)
print(result.netloc)
print(result.params)
print('-'*20)
result1 = parse.urlsplit(url)
print(result1)
print(result1.scheme)
print(result1.netloc)
# print(result1.params)会报错,说明urlsplit没有params这个属性
1.5 request.Request
Add some request headers to the request and use the request.Request class to achieve
from urllib import request
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'
}
res = request.Request('https://www.baidu.com', headers=header)
# print(res)
resp = request.urlopen(res)
print(resp.read())
2. ProxyHandler processor (proxy settings): solve the problem of IP blocking
Many websites will detect the number of visits to a certain IP in a certain period of time (through traffic statistics, system logs, etc.). If there are too many visits, the website will prohibit this IP access.
Therefore, at this time we often need to change to a "small number" to continue to obtain the data we use. This "trumpet" is what I said, agent.
Agency principle:
Before requesting the target website, first request the proxy server, and then let the proxy server request the target website. The proxy server will forward the data to our code after getting the target website data.
http://httpbin.org/ This website can http request some parameters.
Commonly used agents are:
Xici free proxy IP: https://mtop.chinaz.com/site_www.xici.net.co.html
Fast proxy: https://www.kuaidaili.com/
Agent Cloud: http://www.dailiyun.com/
Take proxy cloud as an example, use proxy:
Choose a proxy IP from the proxy cloud
from urllib import request
# 没有使用代理
url = 'http://httpbin.org/ip'
resp = request.urlopen(url)
print(resp.read())
#使用代理
url = 'http://httpbin.org/ip'
# 1.使用ProxyHandler创建一个代理handler
handler = request.ProxyHandler({'http': '140.143.6.16:1080'})
# 2.创建opener
opener = request.build_opener(handler)
# 3.使用opener发送一个请求
resp = opener.open(url)
print(resp.read())
The above is the access IP without proxy, and the below is the IP after using the proxy.