[Python Web Crawler] 150 Lectures Easy to Get Python Web Crawler Paid Course Notes Part II-Use of Crawler Basic Library 1 (urllib)

1. urllib library - built-in

1.1 urlopen function

from urllib import request

resp = request.urlopen('https://www.baidu.com')

print(resp.read(10))
print(resp.readlines())
print(resp.getcode())

1.2 urlretrieve function

Save a file on the web page to the local

from urllib import request

# 通过链接下载文件
request.urlretrieve(url, 保存的文件名)

E.g: 

from urllib import request

# 通过链接下载文件
request.urlretrieve('https://www.sogou.com/', 'sougou.html')

request.urlretrieve('https://pic.baike.soso.com/ugc/baikepic2/6108/20200529140813-1463090803_jpeg_239_300_8762.jpg/0', 'huge.jpg')

1.3 Encoding and decoding functions

1.3.1 urlencode function: encoding

Convert dictionary data into url encoded data

1.3.2 parse_qs function: decoding

Decode the encoded URL parameters

from urllib import parse

data = {'name': '哈哈', 'age': 18}
# urlencode(dict) 参数是字典
ps = parse.urlencode(data)
print(ps)
print(parse.parse_qs(ps))

 1.4 url ​​analysis

urlparse function and urlsplit function

All components of the url are parsed and divided to obtain the various parts of the url

The difference is: urlparse has the params attribute, but urlsplit does not have this params attribute

from urllib import parse

url = 'http://www.baidu.com/index.html;user?id=S#comment'

result = parse.urlparse(url)
print(result)
print(result.scheme)
print(result.netloc)
print(result.params)

print('-'*20)

result1 = parse.urlsplit(url)
print(result1)
print(result1.scheme)
print(result1.netloc)
# print(result1.params)会报错,说明urlsplit没有params这个属性

1.5 request.Request 

Add some request headers to the request and use the request.Request class to achieve

from urllib import request

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'
}

res = request.Request('https://www.baidu.com', headers=header)
# print(res)
resp = request.urlopen(res)
print(resp.read())

 

2. ProxyHandler processor (proxy settings): solve the problem of IP blocking

Many websites will detect the number of visits to a certain IP in a certain period of time (through traffic statistics, system logs, etc.). If there are too many visits, the website will prohibit this IP access.

Therefore, at this time we often need to change to a "small number" to continue to obtain the data we use. This "trumpet" is what I said, agent.

Agency principle:

Before requesting the target website, first request the proxy server, and then let the proxy server request the target website. The proxy server will forward the data to our code after getting the target website data.

http://httpbin.org/  This website can http request some parameters.

Commonly used agents are:

Xici free proxy IP: https://mtop.chinaz.com/site_www.xici.net.co.html

Fast proxy: https://www.kuaidaili.com/

Agent Cloud: http://www.dailiyun.com/

 

Take proxy cloud as an example, use proxy:

Choose a proxy IP from the proxy cloud

from urllib import request

# 没有使用代理
url = 'http://httpbin.org/ip'
resp = request.urlopen(url)
print(resp.read())

#使用代理
url = 'http://httpbin.org/ip'
# 1.使用ProxyHandler创建一个代理handler
handler = request.ProxyHandler({'http': '140.143.6.16:1080'})
# 2.创建opener
opener = request.build_opener(handler)
# 3.使用opener发送一个请求
resp = opener.open(url)
print(resp.read())

The above is the access IP without proxy, and the below is the IP after using the proxy.

Guess you like

Origin blog.csdn.net/weixin_44566432/article/details/108542523