Article directory
1. Ajax get request
Douban movie (first page)
Requirement: Crawl the first page of Douban Movie Ranking Edition,
open and check,
find the data of the movie on the first page
and start writing code
- Get request, get the first page data of Douban movie, and save it
url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&start=0&limit=20'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
2. Customization of request objects
request = urllib.request.Request(url=url, headers=headers)
3. Get response data
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
4. Download data to local (method 1)
fp = open('douban.json', 'w', encoding='utf-8')
fp.write(content)
(Method Two)
with open('douban1.json', 'w', encoding='utf-8') as fp:
fp.write(content)
Total code:
import urllib.request
url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&start=0&limit=20'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
# fp = open('douban.json', 'w', encoding='utf-8')
# fp.write(content)
# 另一种方法获取json数据
with open('douban1.json', 'w', encoding='utf-8') as fp:
fp.write(content)
Operating data:
Douban movies (first ten pages)
Observe the interface of each page.
The first page,
the second page,
and the third page
can observe the rules
start=XX different
page 1 2 3 4
start 0 20 40 60
So start (page - 1) * 20
Start writing code
in three steps:
- request object customization
- Get response data
- Data download to local
Entrance to writing programs (output pages 1-10)
if __name__ == '__main__':
start_page = int(input("请输入起始的页码:"))
end_page = int(input("请输入结束的页码:"))
for page in range(start_page, end_page+1):
print(page)
Customize the request object, each page has its own customization of the request object
# 创建一个方法(这里传入一个page参数,为了函数中能进行使用)
creat_request(page)
Method functions for creating custom objects
def creat_request(page):
base_url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&'
# 这里是get请求,所以url是可以进行拼接的
data = {
'start': (page - 1) * 20,
'limit': 20
}
# 进行拼接,get请求后面不需要+encode
data = urllib.parse.urlencode(data)
url = base_url + data
print(url)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'}
Output the addresses of pages 1-10
Customize the request object in the function (define 10 requests)
request = urllib.request.Request(url=url, headers=headers)
The second step is to get the response data (definition function)
def get_content():
response = urllib.request.urlopen()
In the function of obtaining the response data, request is needed, so we have to use the return value at this time!
In the function of the custom object, to return the request,
the main function must receive the request
request = creat_request(page)
and pass it to the method of obtaining the response data. At
get_content(request)
this time, the method function of obtaining the response data can use the request parameter
def get_content(request):
response = urllib.request.urlopen(request)
Step Three: Download Data
# 定义方法
down_load()
Same as the previous step, the content and page parameters need to be used in the down_load() method, remember to pass the parameters!
def down_load(page,content):
with open('DB_' + str(page) + '.json', 'w', encoding='utf-8')as fp:
fp.write(content)
Total code block:
import urllib.parse
import urllib.request
# 难点:每一页的url都不一样
def creat_request(page):
base_url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&'
data = {
'start': (page - 1) * 20,
'limit': 20
}
data = urllib.parse.urlencode(data)
url = base_url + data
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
request = urllib.request.Request(url=url, headers=headers)
return request
def get_content(request):
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
return content
def down_load(page,content):
with open('DB_' + str(page) + '.json', 'w', encoding='utf-8')as fp:
fp.write(content)
if __name__ == '__main__':
start_page = int(input("请输入起始的页码:"))
end_page = int(input("请输入结束的页码:"))
for page in range(start_page, end_page+1):
request = creat_request(page)
content = get_content(request)
down_load(page, content)
The first ten pages of Douban movies:
Two, ajax post request
KFC official website
Requirement: Crawl which locations in a region have KFC, and crawl the data of the first ten pages and save it locally
Open the official website of KFC, click on the restaurant query, and select the city you want to crawl (Chengdu is selected here)
Copy the interface
to observe the form data of the first page and the second page and the interface
discovery rule: pageIndex is different
Roughly the same as the above two cases of crawling watercress, the only difference is: post request, need to write the encoding method
Attached source code:
import urllib.request
import urllib.parse
def create_request(page):
base_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'
data= {
'cname': "成都",
'pid': "",
'pageIndex': page,
'pageSize': "10"
}
data = urllib.parse.urlencode(data).encode('utf-8')
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
# 定制
request = urllib.request.Request(url=base_url, headers=headers, data=data)
return request
def get_content(request):
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
return content
def down_load(page,content):
with open('kfc_' + str(page) + '.json', 'w', encoding='utf-8')as fp:
fp.write(content)
if __name__ == '__main__':
start_page = int(input("请输入起始页码:"))
end_page = int(input("请输入结束页码:"))
for page in range(start_page, end_page+1):
# 请求对象的定制
request = create_request(page)
# 获取网页源码
content = get_content(request)
# 下载数据
down_load(page, content)
operation result:
3. URLError\HTTPError
URL components:
- protocol
- the host
- port
- request path
- Parameters (wd, kw)
- Anchors
so: HTTPError is a subclass of urllib
Requirement: Get the source code of a web page
import urllib.request
url = 'https://blog.csdn.net/qq_64451048/article/details/127775623'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content)
At this time, I accidentally changed the address of the url
(one more 1)
Report HTTPError error
catch exception
try:
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content)
except urllib.error.HTTPError:
print('系统正在升级...')
If the url is abnormal
except urllib.error.URLError:
print('系统正在升级...')
Four, Handler processor
Customize more advanced request headers (customization of request objects with complex business logic can no longer meet our needs. Dynamic cookies and proxies cannot use customization of request objects)
1. Basic use
Requirement: Use handler to access Baidu to obtain webpage source code
Three important words: handler, build_opener, open
import urllib.request
url = 'http://www/baidu.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
request = urllib.request.Request(url=url, headers=headers)
# 获取handler对象
handler = urllib.request.HTTPHandler()
# 通过handler获取opener对象
opener = urllib.request.build_opener(handler)
# 调用open方法
response = opener.open(request)
content = response.read().decode('utf-8')
print(content)
2. Proxy server
Common functions of agents
- Break through your own IP access restrictions and access foreign sites
- Access some internal resources of the unit or group
- Improve access speed
- Hide real IP
Change your own IP address through proxy ip
import urllib.request
url = 'http://www.baidu.com/s?wd=IP'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Cookie': 'BAIDUID=12196FD75453346657491E87390AC35B:FG=1; BIDUPSID=12196FD7545334661F0AE8D4B062BE2E; PSTM=1666008285; ispeed_lsm=2; BDUSS=FhRbk81OEFrZ1RFRFJrWUxCQ1dmRTZUQXp0VXA4ZGZtT0QyOUZ0T0hDRGYtSVpqRVFBQUFBJCQAAAAAAAAAAAEAAACS5FTMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAN9rX2Pfa19je; BD_UPN=13314752; baikeVisitId=9616ee70-5c47-41fe-865c-14dc1b170603; COOKIE_SESSION=195933_1_0_1_1_1_1_0_0_1_0_0_0_0_6_0_1666259938_1666064006_1666259932%7C2%230_1_1666063999%7C1; ZFY=M8B:A6gXyHZyKVBf:AqksGBg5jNPPKTmxNoclm:BgHpXzI:C; B64_BOT=1; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; BDSFRCVID=umLOJeC62ZlGj87jjNM7q-J69LozULrTH6_n1tn9KcRk7KlESLLqEG0PWf8g0KubzcDrogKKXeOTHiFF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SF=tJIe_C-atC-3fP36q4rVhP4Sqxby26ntamJ9aJ5nJDoADh3Fe5J8MxCIjpLLBjK8BIOE-lR-QpP-_nul5-IByPtwMNJi2UQgBgJDKl0MLU3tbb0xynoD24tvKxnMBMnv5mOnanTI3fAKftnOM46JehL3346-35543bRTLnLy5KJtMDcnK4-XjjOBDNrP; H_PS_PSSID=36554_37555_37518_37687_37492_34813_37778_37721_37794_36807_37662_37533_37720_37740_26350_22157; delPer=0; BD_CK_SAM=1; PSINO=1; BDRCVFR[Fc9oatPmwxn]=aeXf-1x8UdYcs; BD_HOME=1; sugstore=1; BA_HECTOR=0h040g8k252hak242h8g8rou1hna11u1e; H_PS_645EC=e13b%2FA3XVtQZqyt9d0m3A8twSI3IrHVjaGptlJbr4wMhPOUE0G9YUipXLjIqNjZ2UHOS; BDSVRTM=231'
}
request = urllib.request.Request(url=url, headers=headers)
# 模拟浏览器访问服务器
# response = urllib.request.urlopen(request)
# 代理ip
proxies = {
'http': '222.74.73.202:42055'
}
handler = urllib.request.ProxyHandler(proxies=proxies)
opener = urllib.request.build_opener()
response = opener.open(request)
content = response.read().decode('utf-8')
with open('daili.html', 'w', encoding='utf-8')as fp:
fp.write(content)
3. Proxy pool
Simple version of the proxy pool:
proxies_pool = [
{
'http': '222.74.73.202:42055111'},
{
'http': '222.74.73.202:42055222'}
]
import random
proxies = random.choice(proxies_pool)
print(proxies)
Custom proxy pool source code:
import urllib.request
proxies_pool = [
{
'http': '121.13.252.60:41564'},
{
'http': '121.13.252.60:41564'}
]
import random
proxies = random.choice(proxies_pool)
url = 'http://www.baidu.com/s?wd=IP'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
# 请求对象定制
request = urllib.request.Request(url=url, headers=headers)
handler = urllib.request.ProxyHandler(proxies=proxies)
opener = urllib.request.build_opener(handler)
response = opener.open(request)
content = response.read().decode('utf-8')
with open('daili1.html', 'w', encoding='utf-8')as fp:
fp.write(content)