python从零学——scrapy初体验

近日因为一些事情，需要从网上爬取一些东西，故而想通过使用爬虫来顺便学习下强大的python。现将一些学习中遇到的问题记录下来，以便日后查询

1. 开发环境的准备（本人windows10 x64）

python的爬虫框架应该说是有挺多的了，使用scrapy也是因为它名气比较大啦。首先是安装使用，因为我也是从零开始，从开始安装python开始的，所以我也就从安装python开始的。

1.1 python安装

一开始，我安装的是python3.7，但是在安装scrapy的时候，发现一直出现依赖错误“Microsoft Visual C++ 14.0 is required”这个蛋疼的错误，死活调不好，直到我在scrapy的官方教程上看到这句话这里写图片描述居然只支持python2.7，wtf!!!!浪费了我好多时间，好吧，2.7就2.7，我从python的官网上下载了python-2.7.15.amd64.msi，忘记有没有自动添加环境了，如果没有的话随便添加一下吧，很简单的，在path里面添加下面的路径

$(python的安装路径)

$(python的安装路径)\Scripts

我的路径是

D:\softwares\Python27 D:\softwares\Python27\Scripts

安装完成以后，win+R运行cmd，输入python看下有反应不，如果有就说明已经安装好了。

1.2 安装python IDE，PyCharm

PyCharm好像用的比较多，我就安装这个了，看起来是用visual studio那一套做的，很像。PyCharm有分专业版和社区版的，作为一个穷逼当然是下载社区版本的啦。国内用户好像无法直接打开链接，但是好像下载链接是可以用的，那我就像上面的pyhon一样贴一个下载地址吧：pycharm2018.1.4。

1.3 scrapy安装

python有一个很好的地方，就是有一个包管理系统（pip）来管理python的包，咱们想要使用的scrapy包就能很方便的下载下来，而不必去网上到处找。之前我们安装的python2.7.15已经默认安装了pip，所以现在我们就使用pip来安装一下scrapy好了。在cmd里面输入一下命令：

pip install scrapy

然后如果没有意外的话，一般会出现以下包缺失的提示：

building 'twisted.test.raiser' extension error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

不要慌，到这个网站上下载对应没有编译的包就行了，我们就不用在自己电脑上编译了。这里是twisted缺失，所以我根据我的系统和python的版本，选择了这个Twisted‑18.7.0‑cp27‑cp27m‑win_amd64.whl下载。下载好了以后，用cmd来安装，输入以下的命令

pip install d:\Twisted-18.7.0-cp27-cp27m-win_amd64.whl

然后安装这里写图片描述安装完成以后就可以重新安装scrapy了，重新输入pip install scrapy然后看有没有其他的依赖错误，如果有的话就跟刚才一样处理就行了。到此为止，scrapy需要的环境都安装完毕了，接下来就是使用scrapy来爬取东西了

2. 爬取静态图片

用某宝的宝贝页面来爬取是最好的了，因为某宝的宝贝页面不仅有静态的数据还有动态的数据，很适合学习。我们先来爬取这部分的图片：这里写图片描述

2.1 创建scrapy项目

首先，使用以下命令来创建一个空的scrapy项目。

scrapy startproject taobao

这里写图片描述生成成功，将项目用pycharm打开首先我们编辑下items.py，这个类是用来暂存爬取到的信息的：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class taobaoItem(scrapy.Item):
    url = scrapy.Field()
    name = scrapy.Field()
    image_urls = scrapy.Field()

这里，我们要存的就是宝贝的地址，名字和图片的地址。然后我们新建一个spider，叫taobaoSpider好了。spider是用来请求网页和获取爬取目标的地址的。说白了做一些处理链接的工作。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from taobao.items import taobaoItem
from scrapy_splash import SplashRequest
class taobaoSpider(scrapy.Spider):
    name = "taobao"
    allowed_domains = ["taobao.com"]
    start_urls = []
    def start_requests(self):
        input_url = 'https://item.taobao.com/item.htm?spm=a1z10.1-c.w4023-18381915794.4.44d14551es5Ex7&id=556114290901'
        self.start_urls.append(input_url)
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse)


    def parse(self, response):
        # sel是页面源代码，载入scrapy.selector
        sel = Selector(response)
        for link in sel.xpath('//*[@id="J_isku"]/div/dl[1]/dd/ul/li/a'):
            url = link.xpath('@style').extract()[0]
            image_url = "http://" + url[17:-28] + "400x400.jpg"
            image_urls = []
            image_urls.append(image_url)
            name = link.xpath('span/text()').extract()
            item = taobaoItem()
            item['url'] = url
            item['name'] = name
            item['image_urls'] = image_urls
            yield item  # 返回请求

接下来修改settings.py，这个文件是配置文件，配置一些参数：

# -*- coding: utf-8 -*-

# Scrapy settings for taobao project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'taobao'

SPIDER_MODULES = ['taobao.spiders']
NEWSPIDER_MODULE = 'taobao.spiders'

ITEM_PIPELINES = {
    'taobao.pipelines.taobaoPipeline': 1,
}
#设置图片下载路径
IMAGES_STORE = 'd:/download'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

注意：这里ROBOTSTXT_OBEY 默认是True，这是scrapy默认遵守爬取协议。如果这里为Ture，则无法爬取淘宝的数据，会出现一下的提示。所以需要改为False 这里写图片描述修改成：

ROBOTSTXT_OBEY = False

最后设置piplines，用于持久化爬取的数据，也就是储存到硬盘或者数据库里面的东西：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import requests
from taobao import settings
import os

class taobaoPipeline(object):
    def process_item(self, item, spider):
        if 'image_urls' in item:  # 如何‘图片地址’在项目中
            images = []  # 定义图片空集

            dir_path = '%s/%s' % (settings.IMAGES_STORE, spider.name)

            if not os.path.exists(dir_path):
                os.makedirs(dir_path)
            for image_url in item['image_urls']:
                us = image_url.split('/')[-1:]
                image_file_name = '_'.join(us)
                file_path = '%s/%s' % (dir_path, image_file_name)
                images.append(file_path)
                if os.path.exists(file_path):
                    continue

                with open(file_path, 'wb') as handle:
                    headers = {
                        'user-agent': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0",
                        'cookie': "user_trace_token=20170502200739-07d687303c1e44fa9c7f0259097266d6;"
                    }
                    response = requests.get(image_url, stream=True, headers=headers)
                    for block in response.iter_content(1024):
                        if not block:
                            break
                        handle.write(block)
        return item

最后在taobao目录下，新建一个main.py文件，用于启动这个爬虫（crawl）：

# -*- coding: utf-8 -*-
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl("taobaoSpider")
process.start()  # the script will block here until the crawling is finished

项目的目录现在是这样的：

这里写图片描述点击pycharm右上角的eidt configurations：选择main文件：然后点击运行程序，则可以看到爬取的图片存到硬盘的D:\download\taobaoSpider目录了。