Scrapy入门教程(2)——保存数据

爬取股票名称、每股利润和净利润
完整项目已上传至github
链接:https://github.com/yinhaox/02_scrapy

数据爬取

可以接着之间的项目改,也可以重建一个
图一网页截图,图二xpath获取方法

打开gucheng.py,编写爬虫
(注:可以使用$ scrapy shell <url> 在控制台调试)

# ./Stock/spiders/gucheng.py

import scrapy

class GuchengSpider(scrapy.Spider):
    name = 'gucheng'
    start_urls = ['https://hq.gucheng.com/gpdmylb.html'] # 爬虫起始地址

    def parse(self, response):
         # 获得所有股票详细页面的url
        urls = response.xpath('//*[@id="stock_index_right"]/div[3]/section/a/@href').getall()

        for url in urls:
            yield scrapy.Request(url, callback=self.parse_item) # 爬取详细页面

    def parse_item(self, response):
        select = response.xpath('//*[@id="hq_wrap"]/div[1]/section[8]')
        tbody  = select.xpath('./div/table/tbody')
        name   = select.xpath('./h3/text()').get()             # 股票名称
        EPS    = tbody.xpath('./tr[2]/td[3]/div/text()').get() # 每股利润
        NOPAT  = tbody.xpath('./tr[2]/td[6]/div/text()').get() # 净利润

保存数据

1、定义数据模板

为了保存我们爬取到的数据,需要用到项目管道(Pipeline)

所以我们要先到items.py中去定义数据模板

# ./Stock/items.py

import scrapy

class StockItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    EPS = scrapy.Field()
    NOPAT = scrapy.Field()

2、更改配置

然后到settings.py中打开项目管道

(注:Field objects are used to specify metadata for each field)

# ./Stock/settings.py

ITEM_PIPELINES = {
   'Stock.pipelines.StockPipeline': 300,
}

如上,你可以找到此处被注释掉的代码,将注释消掉

(注:数字300是该管道的执行优先级,暂时不用管它)

3、使用item

在使用管道之前,先把爬取到的数据传给管道

# ./Stock/spiders/gucheng.py

import scrapy
# 引入item
from Stock.items import StockItem

class GuchengSpider(scrapy.Spider):
    name = 'gucheng'
    start_urls = ['https://hq.gucheng.com/gpdmylb.html']

    def parse(self, response):
        urls = response.xpath('//*[@id="stock_index_right"]/div[3]/section/a/@href').getall()
        for i, url in enumerate(urls):
            yield scrapy.Request(url, callback=self.parse_item)

    def parse_item(self, response):
        item = StockItem() # 实例化item
        select = response.xpath('//*[@id="hq_wrap"]/div[1]/section[8]')
        tbody = select.xpath('./div/table/tbody')
        item['name'] = select.xpath('./h3/text()').get()
        item['EPS'] = tbody.xpath('./tr[2]/td[3]/div/text()').get()
        item['NOPAT'] = tbody.xpath('./tr[2]/td[6]/div/text()').get()
        return item # 直接返回即可

4、使用管道保存数据

打开pipelines.py

# ./Stock/pipelines.py

class StockPipeline(object):

    def process_item(self, item, spider):
        self.f.write(
            '{},{},{}\n'.format(
                item['name'][:-4],
                item['EPS'],
                item['NOPAT']
            )
        )
        return item

    def open_spider(self, spider):
        self.f = open('./data.csv', 'w')
        self.f.write('名称,每股利润,净利润\n')

    def close_spider(self, spider):
        self.f.close()

然后运行爬虫,就可以看到结果了
$ scrapy crawl gucheng
结果如下:

名称 每股利润 净利润
平安银行(000001) 1.14 204.56
*ST康达(000048) 0.23 0.89
方大集团(000055) 1.91 22.46
深天马A(000050) 0.62 12.75
皇庭国际(000056) 0.13 1.52
深纺织A(000045) 0.02 0.12
德赛电池(000049) 1.34 2.75
泛海控股(000046) 0.35 18.17
中国天楹(000035) 0.12 1.67
中航善达(000043) 1.25 8.31

猜你喜欢

转载自blog.csdn.net/weixin_40522523/article/details/87871146
今日推荐