The process for study and reference, please abide by relevant laws and regulations.
First, we analyze website: https: //www.mzitu.com/all/
Not difficult to find, contains a large number of image links on this page, can be said to be particularly convenient to take pictures of our climb, this is a good thing. So we continue to analyze
This is the address of the first page
This is the second page, so when we crawled only need to add "/ num" to later link
So let's come crawling home page content
import requests import re #爬取首页 url = 'https://www.mzitu.com/all/' header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0' } response = requests.get(url, headers = header).text print(response)
We can get the following
ok then we need to extract links that we want
# Use regular expressions to extract links needed
req = '<br>.*?日: <a href="(.*?)" target="_blank">(.*?)</a>'
urls = re.findall(req,response)
for url,pic_name in urls:
print(url,pic_name)
So we get we need a link
So then we need to know at each link of which in the end is how many pages it
We observe further analysis of web page source code, we can find
We can extract from the page to the last page of content
req = '<br>.*?日: <a href="(.*?)" target="_blank">(.*?)</a>' urls = re.findall(req,response) for url,pic_name in urls: print(url,pic_name) #获取每个url内的总页面数目 html = requests.get(url, headers=header).text req_last = "<a href='.*?'><span>«上一组</span></a><span>1</span><a href='.*?'><span>2</span></a><a href='.*?'><span>3</span></a><a href='.*?'><span>4</span></a><span class='dots'>…</span><a href='.*?'><span>(.*?)</span></a><a href='.*?'><span>下一页»</span></a> </div>" last_num = re.findall(req_last,html) print(last_num) exit()
We will use the last one being required to extract the page number you can get results
Then the next step is to splice the url
We try to add after the first page of the original url "/ 1" discovery can also enter the page we need it greatly reduces the amount of code
# List into int for K in LAST_NUM: K = K K = int (K) # splicing URL for I in Range (. 1 , K): url_pic = URL + ' / ' + STR (I) Print (url_pic) Exit ()
ps: Here "exit ()" is for convenience of the test procedure can be as long as a next following the address of a master will be omitted url
We can get the following link
These links open are valid but we are not directly open url image
Therefore, we will further filter the information, and then optimize the code
# List into int for K in LAST_NUM: K = K K = int (K) # splicing URL for I in Range (. 1 , K): url_pic = URL + ' / ' + STR (I) headerss = { ' the User -agent ' : ' "Mozilla / 5.0 (Windows NT 6.3; WOW64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 39.0.2171.95 Safari / 537.36" ' , ' Referer ' : url_pic, 'Host': 'www.mzitu.com' } html_pic_last = requests.get(url_pic, headers=headerss).text req_last_pic = '<div class=.*?><p><a href=.*? ><img src="(.*?)" alt=.*? width=.*? height=.*? /></a></p>' req_pic_url = re.findall(req_last_pic,html_pic_last) for link in req_pic_url: link = link links = str(link) print(links) image_content = requests.get(links, headerss).content print(image_content) # with open("image/" + pic_name+ str(i) + ".jpg", "wb") as f: # f.write(image_content) exit()
But after testing, I found preserved images can not be opened, re-examination revealed the final step to download the picture when a 403 error occurs: Server Access Denied
I tried for a herder but still does not work, so the GG!
First placed it, so after further analysis of the reasons 403.
So far all source code attached
Import Requests Import Re # crawling Home url_head = ' https://www.mzitu.com/all/ ' header = { ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-; RV: 74.0) the Gecko / Firefox 20,100,101 / 74.0 ' , ' the Referer ' : ' https://www.mzitu.com ' } Response = requests.get (url_head, headers = header) .text # regular expressions to extract links required REQ = ' .? <br> * day: <a href = "(. ? *)" target = "_blank">(.*?)</a>' urls = re.findall(req,response) for url,pic_name in urls: #获取每个url内的总页面数目 html = requests.get(url, headers=header).text req_last = "<a href='.*?'><span>«上一组</span></a><span>1</span><a href='.*?'><span>2</span></a><a href='.*?'><span>3</span></a><a href='.*?'><span>4</span></a><span class='dots'>…</span><a href='.*?'><span>(.*?)</span></a><a href='.*?'><span>下一页»</span></a> </div>" KLAST_NUM:inKforlist into int#the re.findall (req_last, HTML) = LAST_NUM =k k = int(k) #拼接url for i in range(1,k): url_pic = url + '/' + str(i) headerss = { 'User-Agent': '"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"', 'Referer': url_pic, } html_pic_last = requests.get(url_pic, headers=headerss).text req_last_pic = '<div class=.*?><p><a href=.*? ><img src="(.*?)" alt=.*? width=.*? height=.*? /></a></p>' req_pic_url = re.findall(req_last_pic,html_pic_last) for link in req_pic_url: link = link links = str(link) print(links) image_content = requests.get(links, headerss).content print(image_content) # with open("image/" + pic_name+ str(i) + ".jpg", "wb") as f: # f.write(image_content) exit() exit()