Reptile Super simple entry

Two days ago I saw a program ape wrote a reptile, then more than 200 companies have been the end, as in the entry of white python, an interest, and thus be studied under, wrote a small reptile, make getting started catch climb, climb points beautiful pictures of it!

let's do it

 

 

 Look at beautiful women, write code to come up interest

Crawler is then downloaded to climb by finding pictures url, the url how to find it?

Written pages should know, but I never wrote, we pop open a web page Press F12 debugging tool

 

 Select network, select the left of a Request Headers

You can then see Referer, literally meaning references herein is used to prevent cross-domain requests (my understanding is only to get the file on the left is the element of choice on this page by the page), our request for a head to use

User-Agent: User Agent, which found that Chrome word, guess the browser, so try the next change in Firefox there are Firefox, this should be different for each browser proxy browser, we use this to pretend to browser access

headers = {
    'Referer':'https://www.85814.com/meinv/gaotiaomeinv/',
    'User-Agent':'ozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36'
}

 

 We put it in a dictionary, since the latter type requires a dictionary

Then connect the library at the site by Request

import requests
headers = {
    'Referer':'https://www.85814.com/meinv/gaotiaomeinv/',
    'User-Agent':'ozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36'
}

url = 'https://www.85814.com/meinv/gaotiaomeinv/'
resp = requests.get(url,headers=headers)
pass

 

 

 

We can see Response to the 200 indicates a connection

 

well

 

The next step we have to get url of the picture, click on the upper left corner of the element review of development tools, and then click on a map

 

 

 

 

You can see, img alt attribute for the title figure, src is the picture of the source address, you can copy it, try entering this site.

Try a few pictures, then we know that the middle of the big picture data organization is the same, we can find a path through all the src match

Ctrl+F

 

 The following is a pattern match,. // p [@ id = "l"] .//p current page matches the tag with the attribute of all the p id = "i" restricted here to find the main frame, and the double slash p [@ id = "l"] the following matches all img. Back / @ src is to get all of the src url is

The Code

from lxml import etree
html = etree.HTML(resp.text)
srcs = html.xpath('.//p[@id="l"]//img/@src')

 

The resulting srcs is a list as long as traversing the list for each url of the image to download

for src in srcs:
    time.sleep(0.2)
    filename= src.split('/')[-1]
    img = requests.get(src, headers=headers,timeout=10,verify=False)
    with open( 'imgs/'+ filename,'wb') as file:
        file.write(img.content)

 

Used the time, used to delay, prevent excessive request is recognized as the server under attack, before I was blocked ip up a website, there are many other methods, such as each time with a different User-Agent disguised as a different browser , as well as with the proxy ip, it will be mentioned later.

Then create imgs path in the current directory, img.context is the content.

 

 

 

Therefore, the bytes stored by 'wb' open.

Run the program under the img there are a lot of beautiful pictures

 

 

 

A simple reptile is complete.

Learning is still shallow, errors or irregularities at also please correct me.

 

 

 

 

Guess you like

Origin www.cnblogs.com/hao11/p/11706502.html