Python website crawling solution mn52 beautiful pictures and images of anti-hotlinking

Anti-theft chain principle

http standard protocols have a special field record referer
One to be traced on an inbound address what is
And secondly for the resource file that can contain trace to show what his website address is
So all anti-hotlinking methods are based on the Referer field
 
so: Many sites use security chain to set anti-reptile mechanism, set up after such a mechanism to directly access route through the picture will return a 403 error,

In fact, the solution is simple, add header, and then Referer can write!

headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
        'Referer': url
    }

This article crawling https://www.mn52.com/ above website pictures pure beauty, as follows;

# Required libraries 
Import Requests
 Import Re
 Import OS
 from multiprocessing Import Pool
 # main function 
DEF get_img (URL):
     # Picture storage path 
    path = ' ./mn52/ ' 
    IF  Not os.path.exists (path): 
        os.mkdir (path) 
    # request header, because the picture route has the anti-theft chain disposed so add 'Referer' in headers are: URL 
    headers = {
         ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 6.1; the WOW64) AppleWebKit / 537.36 (KHTML, the Gecko like) the Chrome / 65.0.3325.181 Safari / 537.36 ',
         ' The Referer ' : URL 
    } 
    the try :
         # Request Routing Home Page 
        Response = requests.get (URL = URL, headers = headers)
         # Print (response.text) 
        # regular sub-page acquired and traverse extracted 
        res_paging the re.findall = ( ' .? <div class = "picbox"> * <A href = "(.? *)" ' , response.text, re.S)
         for i in res_paging:
             # stitching sub-page routing 
            url_infos = ' HTTPS: // the WWW .mn52.com ' + I
             # request routing sub-page
            = requests.get res_details (url = url_infos, headers = headers)
             # traversing Get Picture routing 
            res_detail = re.findall ( ' <div class = "img-wrap">. *? <img. *? rel = "(. * ?) "/> ' , res_details.text, re.S)
             for i in res_detail:
                 # entire image routing 
                img_urls = ' HTTPS: ' + i
                 # naming your image 
                filename = i.split ( ' / ' ) [- 1 ]
                 # judge whether the image has been downloaded 
                iF os.path.exists (path +STR (filename)):
                     Print ( ' image already exists ' )
                 the else :
                     # request Image connection 
                    RES = requests.get (URL = img_urls, headers = headers)
                     # save the image 
                    with Open (path + STR (filename), ' WB ' ) AS F: 
                        f.write (res.content) 
                        # print download information 
                        Print ( ' downloading: ' + img_urls)
     the except Exception AS E:
         Print (E)
 # program entry
IF  the __name__ == ' __main__ ' :
     # construct a complete routing 
    URLs = [ ' https://www.mn52.com/meihuoxiezhen/list_2_{}.html ' .format (I) for I in Range (1,94 )]
     # multi-process open 
    the pool = Pool ()
     # start the program 
    pool.map (get_img, urls)
     Print ( ' grab is complete ' )

More pictures, take some time to download, the console displays the download process

Open the file to view the pictures whether the download was successful

 

done

Guess you like

Origin www.cnblogs.com/nmsghgnv/p/11311680.html