Crawling through requests and site data module lxml

Here give a simple Liezi, to capture the title home picture!

 

 

 

 

 

 

 

The first step is the need to do first reptile camouflage UA, UA initiates a request to the site by masquerading disguised as a browser, there is a parameter headers when the request send request, we can put this parameter into User-Agent headers this parameter

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}

You can find this parameter in the packet capture tool browser

 

 

 

Well, then we can send a request for a page to get the data pages!

Import Requests
 from lxml Import etree 

headers = {
     ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 80.0.3987.149 Safari / 537.36 ' 
} 

URL = ' http://699pic.com/photo/ ' 

Response = requests.get (URL = URL, headers = headers) .text # page data acquired at this time the

 

Next we need to get to the target page spread etree generated

tree = etree.HTML(response)

 

 

 From this chart we can see that this is a picture of each div and are in the same div, the title is in the inside of each div p tags, div then can we put these in one place, circulation were to get it?

The result is obvious, of course you can

 

div_list = tree.xpath('//div[@class="img-show"]/div/div/div')
print(div_list)
for div in div_list:
    name = div.xpath('./a[2]/p/text()')[0]
    print(name)

 

The first of these is a div_list div picture collection, stored in the print looked at a list of

 

 

This list is then recycled, can be taken out of the corresponding elements in the list p value.

final effect:

 

 

All code is shown below:

import requests
from lxml import etree

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}

url = 'http://699pic.com/photo/'

response = requests.get(url=url, headers=headers).text
tree = etree.HTML(response)
div_list = tree.xpath('//div[@class="img-show"]/div/div/div')
print(div_list)
f = open('name.txt', 'w', encoding='utf-8')
for div in div_list:
    name = div.xpath('./a[2]/p/text()')[0]
    f.write(name + '\n')

 

Guess you like

Origin www.cnblogs.com/huizaia/p/12581418.html