Write a python, crawl [you star sky] wallpaper (4)

Abstract of the previous article/Content of this article:

In the previous article, we have successfully crawled down the links of each issue and stored them locally in the form of txt notebook. So, it’s time for the finishing touches and download all the wallpapers locally


Page analysis:

First of all, let's open an issue first and analyze its page



Just click on a picture and find that it enters another page. There is still a navigation bar at the top of the page. At this time, it is found that the picture can still be clicked. Finally, this is the effect


Why order twice? It shouldn't be? Next from the html source code analysis

The value of href for the a tag that is accessed for the first time is like this



After clicking it, look at the a tag of the picture on the new page, and find that the value of href is like this



Finally, is the real url of the picture

Is it really necessary to click twice to get the image URL we want, continue to analyze, and find this rule

In the href value of the initial a tag, it is composed of two fields

The first paragraph is a fixed URL of gamersky

The second paragraph is the original URL link of the image

Use [?] in the middle to join together



That said, as long as I extract the image url of each page, I can download the image


so! We can start


#!usr/bin/env python
# -*- coding:utf-8 -*-


"""

=======================================
=======================================
============Author:Task138=============
===========Power By Python2============
=======================================
=======================================

"""

# Create On 2018/2/15

import re
import os
import chardet
import requests
from scrapy.selector import Selector


class Get_IMG(object):
    def __init__(self):
        # 打开之前爬行保存好的txt,读取里面的url数据
        with open('url_list.txt', 'r') as fp:
            self.url_list = fp.readlines()

        # 定义一个文件夹的名称,所有的壁纸都保存在这个文件夹里面
        self.dirname = 'gamersky_wallpicture'

        # 如果该文件夹不存在,则创建一个这个文件夹
        if not os.path.exists(self.dirname):
            os.mkdir(self.dirname)
            print 'dir [ %s ] create success!!' % self.dirname

        # 遍历url数据,执行保存图片的主函数
        for url in self.url_list:
            try:
                self.save_img(url.strip())
            except:
                with open('error.txt', 'a') as fp:
                    fp.write(url + '\r\n')
                    print 'url error [ %s ]' % url



    def save_img(self, url):
        # 使用requests库去请求url
        response = requests.get(url)

        # 使用chardet库去判断该页面的编码类型,并赋值
        # 游民星空也算是有发展历史了,15年了,GBK和UTF8的编码都有,不能写死这个参数
        response.encoding = chardet.detect(response.content)['encoding']
        content = response.text

        # 抓取title标题,这个将作为本期壁纸单独存放的一个文件夹名称
        title = Selector(text=content).css('h1::text').extract()
        if title:
            # 由于Windows文件夹的创建名称限制,必须要过滤掉以下字符,否则不能创建文件夹
            title = re.sub(r'\\|/|:|\*|\?|"|<|>|\||', '', title[0])

        # 抓取壁纸图片a标签的href值
        links = Selector(text=content).css('p a::attr(href)').extract()

        # 遍历href值
        for link in links:
            # 正则匹配,匹配以【.jpg】结束的url
            if re.search('\.jpg$',link):
                # 以【?】为分隔,提取后半部分
                img_url = link.split('?')[-1]

                # 使用os模块,从url中提取壁纸图片的文件名
                filename = os.path.basename(img_url)

                # 使用os模块,把存储的文件夹路径合成起来,因为在Linux中,路径用斜杠【/】连接
                # 而在Windows中,路径用的是反斜杠【\】
                dir_path = os.path.join(self.dirname, title)

                # 判断以title为名的文件夹是否存在,若不存在,则创建一个
                if not os.path.exists(dir_path):
                    os.mkdir(dir_path)
                    print 'dir [ %s ] create success!!' % dir_path
                file_path = os.path.join(dir_path, filename)


                # 判断图片是否存在于本地,如果不存在,才执行下载
                if not os.path.exists(file_path):
                    # 使用requests库去请求图片的url,提取content值,而不是text值
                    img_content = requests.get(img_url).content

                    # 创建一个jpg的空文件,以'wb'的方式,把img_content的值写进去
                    with open(file_path, 'wb') as fp:
                        fp.write(img_content)
                        print 'img [ %s ] save success!!' % file_path
                else:
                    print 'img [ %s ] already existed!!' % file_path

        # 判断是否存在下一页,如果有下一页,继续下载壁纸图片
        # 如果没有,输出一些自定义的标识符,好让我们知道,这一期的壁纸下载完成了
        page_links = Selector(text=content).css('div.page_css a::text').extract()
        if page_links[-1] == u'下一页':
            next_page = Selector(text=content).css('div.page_css a::attr(href)').extract()[-1]
            self.save_img(next_page)
        else:
            print ''
            print '=' * 80
            print ''


if __name__ == '__main__':
    GI = Get_IMG()


end:

In this way, you can download the wallpaper, but believe me, it will report an error, because this is just a basic code

Guess you like

Origin blog.csdn.net/cbcrzcbc/article/details/79329866