As we all know, learning python, not learning reptiles crawling picture is thought technology, is a step leading to the master of the road, what MM Figure ah, ah what the fight map, is to practice techniques, even if crawling down we will not see of. Ah, Dui not see.
Well, Closer to home, the first time crawling picture just picture the home of crawling down, and did not crawling Image detail page, or uncomfortable. See positive comfortable, substitutions on disappointing. Yes, I was not looking, so I do not know what to climb.
The first crawling can refer to:
the first crawling
Home climb Remove never see details page is too much regret ah, so improving the code, copy address details page, extract pictures full details page, which came out the second time crawling. The second reference may be crawling:
crawling the second
But each time had to enter the details page, copy the address details page, have entered into Pycharm, as I am so lazy, do not want to lose even more, although there is. . .
Or slowly lose it! ! !
But it is too much, this was what time to get up. You have to think of a way ah.
So, improve code re-extracted to address the search box, find the law, found that only two parameters changed again, this easy to handle, url construction would be finished,
change only these two parameters, go to the first page when page = 1, this is not no thing.
Construction is completed
next step is to request, parsed. Request resolution can refer to the first and second crawling, I had all goes well, but when the details page addresses and discovered that there was not the same as the back of the first page of the story page.
Address of the first page:
second page address:
and also the parameters which, if treated with only a single, behind the url have only dismantling, assembly, and code is as follows.
Search for people successfully got the address of all the pictures on this site.
But there are problems, still have to enter the site, find the name of the person you want to download, and then enter a name in the url structure search, or can be troublesome ah. So think of using input input, done directly pycharm, so that you do not have at the site
you're done! ! !
First try. . .
New files and store success
last:
Benpian just single-threaded crawling, crawling is not suitable for relatively large number of pages, or will run for a long time.
Let you look at the number of crawling
, there are many, not one posted out. Also, take care of the Eucharist. . .
Offer complete code:
'''
requests库请求目标网址
xpath提取网页的图片地址
os模块建立文件夹存储图片
面向函数编程
'''
# 导入第三方库
import requests
from lxml import etree
import time
import os
# useragent库
from fake_useragent import UserAgent
# 定义随机的UserAgent
ua = UserAgent()
headers = {'User-Agent':ua.random}
# 定义得到搜索页的html的函数
def get_html(url):
time.sleep(1)
# 如果用.text()则出现乱码的情况,所以采用utf-8方式解码
html = requests.get(url,headers = headers).content.decode('utf-8')
return html
# 定义解析中间页函数
def mid_paser_html(html):
data01 = []
e = etree.HTML(html)
# 提取详情页的url地址
details_list = e.xpath('//div[@class="list_box_info"]/h5/a/@href')
for details_page in details_list:
data01.append(details_page)
return data01
# 定义解析最终图片的函数
def f_paser_html(data01):
details = {}
detail = []
for images in data01:
html01 = requests.get(url=images,headers = headers).content.decode('utf-8')
e = etree.HTML(html01)
# 提取每一层图片的总页数
nums = e.xpath('//div[@class="imageset"]/span[@class="imageset-sum"]/text()')
for page in range(1, int(nums[0].split(' ')[1])):
# 由于每层图片的第一页地址与以后的地址不一样,需要单独处理。
if page == 1:
# 每层第一页的地址就为中间页的地址
html = requests.get(url=images, headers=headers).content.decode('utf-8')
e = etree.HTML(html)
# xpath提取图片地址
image = e.xpath('//div[@class="img_box"]/a/img/@src')
else:
# 由于是请求每一层的全部图片,每一层的url各不相同,需要构造url,以首页url为基准,先以'_'号将url分割为两部分,中间加上'_'
# 第二部分取以'_'分割的第二部分并再以'.'分割,加上'_' 加上page 加上.html
urls = str(images).split('_')[0] + '_' + str(images).split('_')[1].split('.')[0] + '_' + str(page) + '.html'
# 请求构造的url
html = requests.get(url=urls, headers=headers).content.decode('utf-8')
e = etree.HTML(html)
# 提取图片的地址
image = e.xpath('//div[@class="img_box"]/a/img/@src')
# 加入字典
details['image'] = image
# 遍历循环字典,添加到列表中
for det in details['image']:
detail.append(det)
return detail
def save_images(detail):
# 创建文件夹
if not os.path.exists(temp):
os.mkdir(temp)
for image in detail:
# 请求每一张图片的url
r = requests.get(url=image, headers=headers)
# 定义每一张图片的名字
file_name = image.split('/')[-1]
print('正在下载:'+ image )
# 写入图片文件
with open(temp + '/' + file_name, 'wb') as f:
f.write(r.content)
def main():
# 翻页
for page in range(1,2):
url = 'https://www.yeitu.com/index.php?m=search&c=index&a=init&typeid=&siteid=1&q={}&page=%d'.format(temp) %page
html = get_html(url)
data01 = mid_paser_html(html)
detail = f_paser_html(data01)
save_images(detail)
if __name__ == '__main__':
print('请输需要下载图片人物的名称:')
temp = input()
main()