Beginner python reptile, records about the learning process, requests xpath os extract pictures and save a local 03 MM

As we all know, learning python, not learning reptiles crawling picture is thought technology, is a step leading to the master of the road, what MM Figure ah, ah what the fight map, is to practice techniques, even if crawling down we will not see of. Ah, Dui not see.

Here Insert Picture Description
Well, Closer to home, the first time crawling picture just picture the home of crawling down, and did not crawling Image detail page, or uncomfortable. See positive comfortable, substitutions on disappointing. Yes, I was not looking, so I do not know what to climb.
Here Insert Picture Description
The first crawling can refer to:
the first crawling

Home climb Remove never see details page is too much regret ah, so improving the code, copy address details page, extract pictures full details page, which came out the second time crawling. The second reference may be crawling:
crawling the second

But each time had to enter the details page, copy the address details page, have entered into Pycharm, as I am so lazy, do not want to lose even more, although there is. . .
Here Insert Picture Description
Or slowly lose it! ! !
But it is too much, this was what time to get up. You have to think of a way ah.
Here Insert Picture Description
So, improve code re-extracted to address the search box, find the law, found that only two parameters changed again, this easy to handle, url construction would be finished,
Here Insert Picture Description
Here Insert Picture Description
change only these two parameters, go to the first page when page = 1, this is not no thing.
Here Insert Picture Description
Construction is completed
next step is to request, parsed. Request resolution can refer to the first and second crawling, I had all goes well, but when the details page addresses and discovered that there was not the same as the back of the first page of the story page.
Here Insert Picture Description
Address of the first page:
Here Insert Picture Description
second page address:
Here Insert Picture Description
and also the parameters which, if treated with only a single, behind the url have only dismantling, assembly, and code is as follows.
Here Insert Picture Description
Search for people successfully got the address of all the pictures on this site.
But there are problems, still have to enter the site, find the name of the person you want to download, and then enter a name in the url structure search, or can be troublesome ah. So think of using input input, done directly pycharm, so that you do not have at the site
Here Insert Picture Description
you're done! ! !
First try. . .
Here Insert Picture Description
New files and store success
Here Insert Picture Description
last:
Benpian just single-threaded crawling, crawling is not suitable for relatively large number of pages, or will run for a long time.
Let you look at the number of crawling
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
, there are many, not one posted out. Also, take care of the Eucharist. . .
Offer complete code:

'''
requests库请求目标网址
xpath提取网页的图片地址
os模块建立文件夹存储图片
面向函数编程
'''
# 导入第三方库
import requests
from lxml import etree
import time
import os
# useragent库
from fake_useragent import UserAgent
# 定义随机的UserAgent
ua = UserAgent()
headers = {'User-Agent':ua.random}

# 定义得到搜索页的html的函数
def get_html(url):
    time.sleep(1)
    # 如果用.text()则出现乱码的情况,所以采用utf-8方式解码
    html = requests.get(url,headers = headers).content.decode('utf-8')
    return html
# 定义解析中间页函数
def mid_paser_html(html):
    data01 = []
    e = etree.HTML(html)
    # 提取详情页的url地址
    details_list = e.xpath('//div[@class="list_box_info"]/h5/a/@href')
    for details_page in details_list:
        data01.append(details_page)
    return data01
# 定义解析最终图片的函数
def f_paser_html(data01):
    details = {}
    detail = []
    for images in data01:
        html01 = requests.get(url=images,headers = headers).content.decode('utf-8')
        e = etree.HTML(html01)
        # 提取每一层图片的总页数
        nums = e.xpath('//div[@class="imageset"]/span[@class="imageset-sum"]/text()')
        for page in range(1, int(nums[0].split(' ')[1])):
            # 由于每层图片的第一页地址与以后的地址不一样,需要单独处理。
            if page == 1:
                # 每层第一页的地址就为中间页的地址
                html = requests.get(url=images, headers=headers).content.decode('utf-8')
                e = etree.HTML(html)
                # xpath提取图片地址
                image = e.xpath('//div[@class="img_box"]/a/img/@src')
            else:
                # 由于是请求每一层的全部图片,每一层的url各不相同,需要构造url,以首页url为基准,先以'_'号将url分割为两部分,中间加上'_'
                # 第二部分取以'_'分割的第二部分并再以'.'分割,加上'_' 加上page 加上.html
                urls = str(images).split('_')[0] + '_' + str(images).split('_')[1].split('.')[0] + '_' + str(page) + '.html'
                # 请求构造的url
                html = requests.get(url=urls, headers=headers).content.decode('utf-8')
                e = etree.HTML(html)
                # 提取图片的地址
                image = e.xpath('//div[@class="img_box"]/a/img/@src')
            # 加入字典
            details['image'] = image
            # 遍历循环字典,添加到列表中
            for det in details['image']:
                detail.append(det)
    return detail

def save_images(detail):
    # 创建文件夹
    if not os.path.exists(temp):
        os.mkdir(temp)
    for image in detail:
        # 请求每一张图片的url
        r = requests.get(url=image, headers=headers)
        # 定义每一张图片的名字
        file_name = image.split('/')[-1]
        print('正在下载:'+ image )
        # 写入图片文件
        with open(temp + '/' + file_name, 'wb') as f:
            f.write(r.content)

def main():
    # 翻页
    for page in range(1,2):
        url = 'https://www.yeitu.com/index.php?m=search&c=index&a=init&typeid=&siteid=1&q={}&page=%d'.format(temp) %page
        html = get_html(url)
        data01 = mid_paser_html(html)
        detail = f_paser_html(data01)
        save_images(detail)


if __name__ == '__main__':
    print('请输需要下载图片人物的名称:')
    temp = input()
    main()

Published 18 original articles · won praise 14 · views 1276

Guess you like

Origin blog.csdn.net/qq_46292926/article/details/104630551