知识点：

Scrapy模块安装

2种安装模块的方式。

以下两种方式可以安装绝大部分模块，

网络安装：指直接在控制台 pip install XX

下载安装：网络安装虽然简便，但时不时就会失败，这时就可以前往https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml

下载后，用控制台移动到下载文件夹后 pip install xx 来安装。

第6条，配置过程：

1.复制：F:\编程\python\Lib\site-packages\pywin32_system32 下的两个.dll文件

2.粘贴到：C:\Windows\System32 里

Scrapy常用指令

语法格式

1.创建爬虫:scrapy startproject XX

2.查看模版:scrapy genspider -l

basic:

# -*- coding: utf-8 -*-
import scrapy


class FstSpider(scrapy.Spider):
    name = 'fst'
    allowed_domains = ['aliwx.com.cn']
    start_urls = ['http://aliwx.com.cn/']

    def parse(self, response):
        pass

crawl:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class SecondSpider(CrawlSpider):
    name = 'second'
    allowed_domains = ['aliwx.com.cn']
    start_urls = ['http://aliwx.com.cn/']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

csvfeed:

# -*- coding: utf-8 -*-
from scrapy.spiders import CSVFeedSpider


class ThirdSpider(CSVFeedSpider):
    name = 'third'
    allowed_domains = ['aliwx.com.cn']
    start_urls = ['http://aliwx.com.cn/feed.csv']
    # headers = ['id', 'name', 'description', 'image_link']
    # delimiter = '\t'

    # Do any adaptations you need here
    #def adapt_response(self, response):
    #    return response

    def parse_row(self, response, row):
        i = {}
        #i['url'] = row['url']
        #i['name'] = row['name']
        #i['description'] = row['description']
        return i

xmlfeed:

# -*- coding: utf-8 -*-
from scrapy.spiders import XMLFeedSpider


class FourthSpider(XMLFeedSpider):
    name = 'fourth'
    allowed_domains = ['aliwx.com.cn']
    start_urls = ['http://aliwx.com.cn/feed.xml']
    iterator = 'iternodes' # you can change this; see the docs
    itertag = 'item' # change it accordingly

    def parse_node(self, response, selector):
        i = {}
        #i['url'] = selector.select('url').extract()
        #i['name'] = selector.select('name').extract()
        #i['description'] = selector.select('description').extract()
        return i

3.在spiders中创建爬虫: scrapy genspider -t basic/crawl/csvfeed/xmlfeed 爬虫名模板网站域名

4.运行爬虫:scrapy crawl 爬虫文件名