爬取股票名称、每股利润和净利润
完整项目已上传至github
链接：https://github.com/yinhaox/02_scrapy

数据爬取

可以接着之间的项目改，也可以重建一个
图一网页截图，图二xpath获取方法

打开gucheng.py，编写爬虫
(注：可以使用$ scrapy shell <url> 在控制台调试)

# ./Stock/spiders/gucheng.py

import scrapy

class GuchengSpider(scrapy.Spider):
    name = 'gucheng'
    start_urls = ['https://hq.gucheng.com/gpdmylb.html'] # 爬虫起始地址

    def parse(self, response):
         # 获得所有股票详细页面的url
        urls = response.xpath('//*[@id="stock_index_right"]/div[3]/section/a/@href').getall()

        for url in urls:
            yield scrapy.Request(url, callback=self.parse_item) # 爬取详细页面

    def parse_item(self, response):
        select = response.xpath('//*[@id="hq_wrap"]/div[1]/section[8]')
        tbody  = select.xpath('./div/table/tbody')
        name   = select.xpath('./h3/text()').get()             # 股票名称
        EPS    = tbody.xpath('./tr[2]/td[3]/div/text()').get() # 每股利润
        NOPAT  = tbody.xpath('./tr[2]/td[6]/div/text()').get() # 净利润

保存数据

1、定义数据模板

为了保存我们爬取到的数据，需要用到项目管道(Pipeline)

所以我们要先到items.py中去定义数据模板

# ./Stock/items.py

import scrapy

class StockItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    EPS = scrapy.Field()
    NOPAT = scrapy.Field()

2、更改配置

然后到settings.py中打开项目管道

(注：Field objects are used to specify metadata for each field)

# ./Stock/settings.py

ITEM_PIPELINES = {
   'Stock.pipelines.StockPipeline': 300,
}

如上，你可以找到此处被注释掉的代码，将注释消掉

(注：数字300是该管道的执行优先级，暂时不用管它)

3、使用item

在使用管道之前，先把爬取到的数据传给管道

# ./Stock/spiders/gucheng.py

import scrapy
# 引入item
from Stock.items import StockItem

class GuchengSpider(scrapy.Spider):
    name = 'gucheng'
    start_urls = ['https://hq.gucheng.com/gpdmylb.html']

    def parse(self, response):
        urls = response.xpath('//*[@id="stock_index_right"]/div[3]/section/a/@href').getall()
        for i, url in enumerate(urls):
            yield scrapy.Request(url, callback=self.parse_item)

    def parse_item(self, response):
        item = StockItem() # 实例化item
        select = response.xpath('//*[@id="hq_wrap"]/div[1]/section[8]')
        tbody = select.xpath('./div/table/tbody')
        item['name'] = select.xpath('./h3/text()').get()
        item['EPS'] = tbody.xpath('./tr[2]/td[3]/div/text()').get()
        item['NOPAT'] = tbody.xpath('./tr[2]/td[6]/div/text()').get()
        return item # 直接返回即可

4、使用管道保存数据

打开pipelines.py

# ./Stock/pipelines.py

class StockPipeline(object):

    def process_item(self, item, spider):
        self.f.write(
            '{},{},{}\n'.format(
                item['name'][:-4],
                item['EPS'],
                item['NOPAT']
            )
        )
        return item

    def open_spider(self, spider):
        self.f = open('./data.csv', 'w')
        self.f.write('名称,每股利润,净利润\n')

    def close_spider(self, spider):
        self.f.close()

然后运行爬虫，就可以看到结果了
$ scrapy crawl gucheng
结果如下：

名称	每股利润	净利润
平安银行(000001)	1.14	204.56
*ST康达(000048)	0.23	0.89
方大集团(000055)	1.91	22.46
深天马Ａ(000050)	0.62	12.75
皇庭国际(000056)	0.13	1.52
深纺织Ａ(000045)	0.02	0.12
德赛电池(000049)	1.34	2.75
泛海控股(000046)	0.35	18.17
中国天楹(000035)	0.12	1.67
中航善达(000043)	1.25	8.31
…

Scrapy入门教程(2)——保存数据