教程简述

这篇教程简要概述scrapy最基本的操作，后续将以实际操作的方式教授scrapy其他方面知识
完整的项目已上传至github
链接：https://github.com/yinhaox/01_scrapy

安装scrapy

$ pip install scrapy

建立项目

$ scrapy startproject <项目名>

如：$ scrapy startproject Stock

完成后出现提示:

You can start your first spider with:
    cd Stock
    scrapy genspider example example.com

然后cd到项目文件夹里：$ cd Stock

建立爬虫

$ scrapy genspider <爬虫名> <网站>

我准备爬取所有股票名称和股票代码,网址如下：

>>股城网-股票代码一览表 https://hq.gucheng.com/gpdmylb.html

如：$ scrapy genspider gucheng "hq.gucheng.com"

p.s.网站填什么并不会影响爬虫功能，所以随便填就好了，等会还能修改

现在目录结构如下：

Stock (根目录)
│  scrapy.cfg
│
└─Stock
    │  items.py
    │  middlewares.py
    │  pipelines.py
    │  settings.py
    │  __init__.py
    │
    ├─spiders
    │  │  gucheng.py
    │  │  __init__.py
    │  │
    │  └─__pycache__
    │          __init__.cpython-36.pyc
    │
    └─__pycache__
            settings.cpython-36.pyc
            __init__.cpython-36.pyc

编写程序

用编辑器打开spiders文件夹下的gucheng.py文件

先将地址复制进来：

# gucheng.py
import scrapy

class GuchengSpider(scrapy.Spider):
    name = 'gucheng'
    # allowed_domains = ['hq.gucheng.com'] 无用，可以删除
    start_urls = ['https://hq.gucheng.com/gpdmylb.html'] # 爬虫起始地址

    def parse(self, response): # 爬虫起始函数
        pass

接下来就要到网页中爬取数据了，一般我们会使用CSS选择器或XPath选择器对网页进行解析

(xpath选择器与css选择器相比，前者功能强大但速度相对较低)

代码如下：

# gucheng.py
import scrapy

class GuchengSpider(scrapy.Spider):
    name = 'gucheng'
    start_urls = ['https://hq.gucheng.com/gpdmylb.html'] # 爬虫起始地址

    def parse(self, response):
        # XPath选择器
        stocks = response.xpath('//*[@id="stock_index_right"]/div[3]/section/a/text()')
        # CSS选择器，二选一即可
        stocks = response.css('#stock_index_right > div.stock_sub > section > a::text')
        for val in stocks:
            print(val.get())

p.s.关于选择器的知识本文不讲述，可以参考其他人的文章学习

运行爬虫

在项目的根目录，输入指令

$ scrapy crawl <爬虫名>

如：$ scrapy crawl gucheng

然后你就会在控制台看到结果…
附部分运行结果：

中国中期(000996)
新 大 陆(000997)
隆平高科(000998)
华润三九(000999)
宗申动力(001696)
豫能控股(001896)
招商公路(001965)
招商蛇口(001979)
新 和 成(002001)
鸿达兴业(002002)
伟星股份(002003)

Scrapy入门教程(1)——第一个项目