Scrapy爬虫，养眼图片实战

环境列表

组件	版本
OS	Win10
Python	3.7
Pycharm	2018.3

安装

PIP升级

python -m pip install --upgrade pip

而在windows系统下，升级可能会遇到类似

AttributeError: 'NoneType' object has no attribute 'bytes'

解决方案

easy_install -U pip

安装scrapy

pip install scrapy

报错

 error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

解决方法：

http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 下载twisted对应版本的whl文件（如我的Twisted‑17.5.0‑cp36‑cp36m‑win_amd64.whl），cp后面是python版本，amd64代表64位，运行命令

pip install Twisted-18.9.0-cp37-cp37m-win_amd64.whl
pip install scrapy

第一次运行的时候，我遇到no module named win32API错误，这是因为Python没有自带访问windows系统API的库的，需要下载第三方库。

ModuleNotFoundError: No module named 'win32api'

#解决方法：
pip install pypiwin32

第一个爬虫

构建和导入项目

在CMD命令行执行如下命令

创建Scrapy项目

scrapy startproject ImageSpider

进入ImageSpider目录，使用命令创建一个基础爬虫类：

scrapy genspider mm131 mm131.com

D:\pycharm_workspace>scrapy startproject ImageSpider
New Scrapy project 'ImageSpider', using template directory 'd:\\python\\python37\\lib\\site-packages\\scrapy\\templates\\project', created in:
    D:\pycharm_workspace\ImageSpider
    
   D:\pycharm_workspace>cd ImageSpider

D:\pycharm_workspace\ImageSpider>scrapy genspider mm131 mm131.com
Created spider 'mm131' using template 'basic' in module:
  ImageSpider.spiders.mm131

Pycharm 导入项目，项目结构如下图：

D:\PYCHARM_WORKSPACE\IMAGESPIDER
│  scrapy.cfg
│  start.py
│
└─ImageSpider
    │  items.py
    │  middlewares.py
    │  pipelines.py
    │  settings.py
    │  __init__.py
    │
    ├─spiders
    │  mm131.py
    │  __init__.py

编写Code

1、编写item文件，根据需要爬取的内容定义爬取字段

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ImagespiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    url = scrapy.Field()
    tags = scrapy.Field()
    # 图片的连接
    src = scrapy.Field()
    # alt为图片名字
    alt = scrapy.Field()
    referer = scrapy.Field()

2、编写Spider文件

文件 mm131.py,编写此文件的难度是怎么用xpath去解析html源代码，可以结合chrom浏览器“检查”查看具体的页面元素，并结合Xpath语法进行编写，是一个试错的过程

# -*- coding: utf-8 -*-
import scrapy
from ImageSpider.items import ImagespiderItem

key_names = ["xinggan", "qingchun", "xiaohua", "chemo", "qipao", "mingxing"]


class Mm131Spider(scrapy.Spider):
    """
    只爬了xingan的目录
    """
    name = 'mm131'
    allowed_domains = ['www.mm131.com']
    start_urls = ['http://www.mm131.com/xinggan/']

    def parse(self, response):
        selector = scrapy.Selector(response)

        next_folder_pages = selector.xpath("//dd[@class='page']/a[text()='下一页']/@href").extract()
        next_folder_pages_text = selector.xpath("//dd[@class='page']/a/text()").extract()

        # 读取图片夹下一页
        if '下一页' in next_folder_pages_text:
            next_url = "http://www.mm131.com/xinggan/%s" % next_folder_pages[0]
            yield scrapy.http.Request(next_url, callback=self.parse)

        # 读取每个图片夹的链接
        all_info = selector.xpath("//div[@class='main']/dl/dd/a[@target]")
        for info in all_info:
            link = info.xpath("@href").extract()[0]
            yield scrapy.http.Request(link, callback=self.parse_folder_page)
            # time.sleep(1)

    def parse_folder_page(self, response):
        selector = scrapy.Selector(response)

        yield self.build_item(response)

        next_pages_item = selector.xpath("//div[@class='content-page']/a[text()='下一页']/@href").extract()
        if next_pages_item:
            next_pages = next_pages_item[0]
            next_pages_text = selector.xpath("//div[@class='content-page']/a/text()").extract()
            if '下一页' in next_pages_text:
                next_url = "http://www.mm131.com/xinggan/%s" % next_pages
                request = scrapy.http.Request(next_url, callback=self.parse_folder_page)
                yield request

    def parse_item(self, response):
        yield self.build_item(response)

    def build_item(self, response):
        selector = scrapy.Selector(response)
        item = ImagespiderItem()
        image_title = selector.xpath('//h5/text()').extract()
        image_url = selector.xpath("//div[@class='content-pic']/a/@href").extract()
        if selector.xpath("//div[@class='content-pic']/a/img/@src").extract():
            image_src = selector.xpath("//div[@class='content-pic']/a/img/@src").extract()
        if selector.xpath("//div[@class='content-pic']/a/img/@alt").extract():
            pic_name = selector.xpath("//div[@class='content-pic']/a/img/@alt").extract()

        item['title'] = image_title
        item['url'] = image_url
        item['src'] = image_src
        item['alt'] = pic_name
        return item

3、编写pipelines文件

文件piplines.py，注,需要设置header，模拟浏览器访问Header，不然下载不了图片，会被网站拉黑。

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
import requests
import random
from ImageSpider.settings import IMAGES_STORE, USER_AGENTS


class ImagespiderPipeline(object):

    def process_item(self, item, spider):
        fold_name = "".join(item['title'])
        header = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
                   'Accept-Encoding': 'gzip, deflate',
                   'Accept-Language': 'zh-CN,zh;q=0.8',
                   'Cache-Control': 'no-cache',
                   'Connection': 'keep-alive',
                   'Cookie': 'UM_distinctid=15fa02251e679e-05c01fdf7965e7-5848211c-144000-15fa02251e7800; bdshare_firstime=1510220189357; CNZZDATA1263415983=1653134122-1510216223-null%7C1510216223; CNZZDATA3866066=cnzz_eid%3D376479854-1494676185-%26ntime%3D1494676185; Hm_lvt_9a737a8572f89206db6e9c301695b55a=1510220189; Hm_lpvt_9a737a8572f89206db6e9c301695b55a=1510220990',
                   'Host': 'img1.mm131.me',
                   'Pragma': 'no-cache',
                   'Referer': 'http://www.mm131.com/xinggan/',
                   'User-Agent': random.choice(USER_AGENTS)}

        images = []
        # 所有图片放在一个文件夹下
        dir_path = '{}'.format(IMAGES_STORE)
        if not os.path.exists(dir_path) and len(item['src']) != 0:
            os.mkdir(dir_path)

        for jpg_url, name in zip(item['src'], item['alt']):
            file_name = name
            file_path = '{}//{}'.format(dir_path, file_name)
            images.append(file_path)
            if os.path.exists(file_path) or os.path.exists(file_name):
                continue

            print(jpg_url)
            with open('{}//{}.jpg'.format(dir_path, file_name), 'wb') as f:
                req = requests.get(jpg_url, headers=header, timeout=10)
                f.write(req.content)

        return item

4、settings文件设置（主要设置内容）

增加如下内容

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENTS = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
    "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"
]
COOKIES_ENABLED = False
# 图片存放路径
IMAGES_STORE = 'G://ImageSpider'
# 下载延迟时间，免得别拉黑
DOWNLOAD_DELAY = 0.1

ITEM_PIPELINES = {
    'ImageSpider.pipelines.ImagespiderPipeline': 300,
}

LOG_LEVEL = 'INFO'

5、增加启动文件

编辑start.py

from scrapy import cmdline

cmdline.execute("scrapy crawl mm131".split())

然后在Pycharm的Run Configuration配置个启动，启动整个项目

运行程序

pycharm 中直接运行 start.py, 你想要的图片来了，看看赶紧删掉。

声明

本文只是联系爬虫，不要把别人的网站搞跨了，不人道。不是我运行代码的事情我不负责任。

整体代码地址：GITHub

爬虫过程碰到问题

个别图片下载超时，可能跟网络相关，忽略跳过

requests.exceptions.ReadTimeout: HTTPConnectionPool(host='img1.mm131.me', port=80): Read timed out. (read timeout=10)

参考

https://www.cnblogs.com/xinyangsdut/p/7628770.html

反爬

http://www.mamicode.com/info-detail-2155901.html