Reptile real - crawling mzitu.com (pocketing the tragic case of failure)

The process for study and reference, please abide by relevant laws and regulations.

 

First, we analyze website: https: //www.mzitu.com/all/

 

 

 Not difficult to find, contains a large number of image links on this page, can be said to be particularly convenient to take pictures of our climb, this is a good thing. So we continue to analyze

 

 

 This is the address of the first page

 

 

 This is the second page, so when we crawled only need to add "/ num" to later link

So let's come crawling home page content

import  requests
import re
#爬取首页
url = 'https://www.mzitu.com/all/'
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
response = requests.get(url, headers = header).text
print(response)

We can get the following

 

 

 

 ok then we need to extract links that we want

 

# Use regular expressions to extract links needed
req = '<br>.*?日: <a href="(.*?)" target="_blank">(.*?)</a>'
urls = re.findall(req,response)
for url,pic_name in urls:
print(url,pic_name)

So we get we need a link

 

 

 

 

So then we need to know at each link of which in the end is how many pages it

We observe further analysis of web page source code, we can find

 

 

 

We can extract from the page to the last page of content

req = '<br>.*?日: <a href="(.*?)" target="_blank">(.*?)</a>'
urls = re.findall(req,response)
for url,pic_name in urls:
    print(url,pic_name)
    #获取每个url内的总页面数目
    html = requests.get(url, headers=header).text
    req_last = "<a href='.*?'><span>&laquo;上一组</span></a><span>1</span><a href='.*?'><span>2</span></a><a href='.*?'><span>3</span></a><a href='.*?'><span>4</span></a><span class='dots'>…</span><a href='.*?'><span>(.*?)</span></a><a href='.*?'><span>下一页&raquo;</span></a>      </div>"
    last_num = re.findall(req_last,html)
    print(last_num)
    exit()

 

We will use the last one being required to extract the page number you can get results

 

 

 Then the next step is to splice the url

 We try to add after the first page of the original url "/ 1" discovery can also enter the page we need it greatly reduces the amount of code

# List into int 
    for K in LAST_NUM: 
        K = K 
    K = int (K)
     # splicing URL 
    for I in Range (. 1 , K): 
        url_pic = URL + ' / ' + STR (I)
         Print (url_pic) 
    Exit ()

ps: Here "exit ()" is for convenience of the test procedure can be as long as a next following the address of a master will be omitted url

We can get the following link

 

 These links open are valid but we are not directly open url image

Therefore, we will further filter the information, and then optimize the code

# List into int 
    for K in LAST_NUM: 
        K = K 
    K = int (K)
     # splicing URL 
    for I in Range (. 1 , K): 
        url_pic = URL + ' / ' + STR (I) 

        headerss = {
             ' the User -agent ' : ' "Mozilla / 5.0 (Windows NT 6.3; WOW64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 39.0.2171.95 Safari / 537.36" ' , 

            ' Referer ' : url_pic,
            'Host': 'www.mzitu.com'
        }

        html_pic_last = requests.get(url_pic, headers=headerss).text
        req_last_pic = '<div class=.*?><p><a href=.*? ><img src="(.*?)" alt=.*? width=.*? height=.*? /></a></p>'
        req_pic_url =  re.findall(req_last_pic,html_pic_last)
        for link in req_pic_url:
            link = link
        links = str(link)
        print(links)
        image_content = requests.get(links, headerss).content
        print(image_content)
        # with open("image/" + pic_name+ str(i) + ".jpg", "wb") as f:

            # f.write(image_content)
        exit()

 

But after testing, I found preserved images can not be opened, re-examination revealed the final step to download the picture when a 403 error occurs: Server Access Denied

 

 I tried for a herder but still does not work, so the GG!

First placed it, so after further analysis of the reasons 403.

So far all source code attached

Import   Requests
 Import Re
 # crawling Home 
url_head = ' https://www.mzitu.com/all/ ' 
header = {
     ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-; RV: 74.0) the Gecko / Firefox 20,100,101 / 74.0 ' ,
     ' the Referer ' : ' https://www.mzitu.com ' 
} 
Response = requests.get (url_head, headers = header) .text
 # regular expressions to extract links required 
REQ = ' .? <br> * day: <a href = "(. ? *)" target = "_blank">(.*?)</a>'
urls = re.findall(req,response)
for url,pic_name in urls:
    #获取每个url内的总页面数目
    html = requests.get(url, headers=header).text
    req_last = "<a href='.*?'><span>«上一组</span></a><span>1</span><a href='.*?'><span>2</span></a><a href='.*?'><span>3</span></a><a href='.*?'><span>4</span></a><span class='dots'>…</span><a href='.*?'><span>(.*?)</span></a><a href='.*?'><span>下一页»</span></a>      </div>"
        KLAST_NUM:inKforlist into int#the re.findall (req_last, HTML)
    =
    LAST_NUM
    =k
    k = int(k)
    #拼接url
    for i in range(1,k):
        url_pic = url + '/' + str(i)

        headerss = {
            'User-Agent': '"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"',
            'Referer': url_pic,        
        }
        html_pic_last = requests.get(url_pic, headers=headerss).text
        req_last_pic = '<div class=.*?><p><a href=.*? ><img src="(.*?)" alt=.*? width=.*? height=.*? /></a></p>'
        req_pic_url =  re.findall(req_last_pic,html_pic_last)
        for link in req_pic_url:
            link = link
        links = str(link)
        print(links)
        image_content = requests.get(links, headerss).content
        print(image_content)
        # with open("image/" + pic_name+ str(i) + ".jpg", "wb") as f:

            # f.write(image_content)
        exit()
    exit()

 

Guess you like

Origin www.cnblogs.com/mrkr/p/12521641.html