案例（二）爬虫预热

项目二：用两种不同的方法爬取股票数据

方法二：scrapy爬虫框架

此案例是运用scrapy框架对相关内容进行抓取。

安装scrapy框架

打开cmd，输入以下代码进行安装：

pip install scrapy

验证是否安装成功：

scrapy -h

创建一个新的Scrapy爬虫工程

scrapy安装成功后，继续在cmd里输入代码创建工程。
将目录切换到想要创建爬虫项目的路径下，执行：

scrapy startproject baidustocks

执行完毕后，会在目录下生成一系列文件夹和.py等文件。
在这里插入图片描述

在工程中产生一个Scrapy爬虫

只需要在cmd中输入一行命令，我们需要指定爬虫的名字和爬取的网站。

cd baidustocks
scrapy genspider stocks hq.gucheng.com/gpdmylb.html

stocks为爬虫名
hq.gucheng.com/gpdmylb.html为爬取网站

完成后会生成一个名叫stocks.py的文件。

配置产生的spider爬虫

安照自己的需求修改该爬虫文件。
我以爬取股票数据为例：

# -*- coding: utf-8 -*-

import scrapy
import re
from scrapy.selector import Selector
 
 
class StocksSpider(scrapy.Spider):
    name = 'stocks'
    start_urls = ['https://hq.gucheng.com/gpdmylb.html']
 
    def parse(self, response):
        for href in response.css('a::attr(href)').extract():
            try:
                stock = re.search(r'S[HZ]\d{6}/', href)
                url = 'https://hq.gucheng.com/' + stock.group()
                yield scrapy.Request(url, callback=self.parse_stock)
            except:
                continue
 
    def parse_stock(self, response):
        infoDict = dict()
        stockInfo = response.css('.stock_top').extract()[0]
        stockprice = response.css('.s_price').extract()[0]
        stockname = response.css('.stock_title').extract()[0]
        stockname = Selector(text=stockname)
        stockprice = Selector(text=stockprice)
        stockInfo = Selector(text=stockInfo)
        infoDict['名字'] = re.search(r'>(.*?)</h1>', stockname.css('h1').extract()[0]).group(1)
        infoDict['编号'] = re.search(r'>(.*?)</h2>', stockname.css('h2').extract()[0]).group(1)
        infoDict['状态'] = re.search(r'>(.*?)</em>', stockname.css('em').extract()[0]).group(1)
        infoDict['时间'] = re.search(r'>(.*?)</time>', stockname.css('time').extract()[0]).group(1)
        price = stockprice.css('em').extract()
        infoDict['股价'] = re.search(r'>(.*?)</em>', price[0]).group(1)
        infoDict['涨跌额'] = re.search(r'>(.*?)</em>', price[1]).group(1)
        infoDict['涨跌幅'] = re.search(r'>(.*?)</em>', price[2]).group(1)
        keylist = stockInfo.css('dt').extract()
        valuelist = stockInfo.css('dd').extract()
        for i in range(len(keylist)):
            key = re.search(r'>(.*?)<', keylist[i], flags=re.S).group(1)
            key = str(key)
            key = key.replace('\n', '')
            try:
                val = re.search(r'>(.*?)<', valuelist[i], flags=re.S).group(1)
                val = str(val)
                val = val.replace('\n', '')
            except:
                val = '--'
            infoDict[key] = val
        yield infoDict

运行爬虫，获取数据

cmd执行以下命令

scrapy crawl stocks

等待执行完毕可以看到的汇总信息：
在这里插入图片描述

编写Pipelines，处理获取的数据

编写pipelines.py文件

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class BaidustocksPipeline:
    def process_item(self, item, spider):
        return item

class BaidustocksInfoPipeline(object):
    def open_spider(self, spider):
        self.f = open('BaiduStockInfo.txt', 'w')
 
    def close_spider(self, spider):
        self.f.close()
 
    def process_item(self, item, spider):
        try:
            line = str(dict(item)) + '\n'
            self.f.write(line)
        except:
            pass
        return item

配置ITEM_PIPELINES选项

编写settings.py文件
在其中寻找参数ITEM_PIPELINES，并修改如下参数。

在这里插入图片描述

执行整个框架

在cmd中：

scrapy crawl stocks

接下来等就完事儿啦=。=
在这里插入图片描述
搞定收工！

金融数据分析（五）爬取股票数据——方法二：scrapy爬虫框架