案例(二)爬虫预热
项目二:用两种不同的方法爬取股票数据
方法二:scrapy爬虫框架
此案例是运用scrapy框架对相关内容进行抓取。
安装scrapy框架
打开cmd,输入以下代码进行安装:
pip install scrapy
验证是否安装成功:
scrapy -h
创建一个新的Scrapy爬虫工程
scrapy安装成功后,继续在cmd里输入代码创建工程。
将目录切换到想要创建爬虫项目的路径下,执行:
scrapy startproject baidustocks
执行完毕后,会在目录下生成一系列文件夹和.py等文件。
在工程中产生一个Scrapy爬虫
只需要在cmd中输入一行命令,我们需要指定爬虫的名字和爬取的网站。
cd baidustocks
scrapy genspider stocks hq.gucheng.com/gpdmylb.html
stocks为爬虫名
hq.gucheng.com/gpdmylb.html为爬取网站
完成后会生成一个名叫stocks.py的文件。
配置产生的spider爬虫
安照自己的需求修改该爬虫文件。
我以爬取股票数据为例:
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.selector import Selector
class StocksSpider(scrapy.Spider):
name = 'stocks'
start_urls = ['https://hq.gucheng.com/gpdmylb.html']
def parse(self, response):
for href in response.css('a::attr(href)').extract():
try:
stock = re.search(r'S[HZ]\d{6}/', href)
url = 'https://hq.gucheng.com/' + stock.group()
yield scrapy.Request(url, callback=self.parse_stock)
except:
continue
def parse_stock(self, response):
infoDict = dict()
stockInfo = response.css('.stock_top').extract()[0]
stockprice = response.css('.s_price').extract()[0]
stockname = response.css('.stock_title').extract()[0]
stockname = Selector(text=stockname)
stockprice = Selector(text=stockprice)
stockInfo = Selector(text=stockInfo)
infoDict['名字'] = re.search(r'>(.*?)</h1>', stockname.css('h1').extract()[0]).group(1)
infoDict['编号'] = re.search(r'>(.*?)</h2>', stockname.css('h2').extract()[0]).group(1)
infoDict['状态'] = re.search(r'>(.*?)</em>', stockname.css('em').extract()[0]).group(1)
infoDict['时间'] = re.search(r'>(.*?)</time>', stockname.css('time').extract()[0]).group(1)
price = stockprice.css('em').extract()
infoDict['股价'] = re.search(r'>(.*?)</em>', price[0]).group(1)
infoDict['涨跌额'] = re.search(r'>(.*?)</em>', price[1]).group(1)
infoDict['涨跌幅'] = re.search(r'>(.*?)</em>', price[2]).group(1)
keylist = stockInfo.css('dt').extract()
valuelist = stockInfo.css('dd').extract()
for i in range(len(keylist)):
key = re.search(r'>(.*?)<', keylist[i], flags=re.S).group(1)
key = str(key)
key = key.replace('\n', '')
try:
val = re.search(r'>(.*?)<', valuelist[i], flags=re.S).group(1)
val = str(val)
val = val.replace('\n', '')
except:
val = '--'
infoDict[key] = val
yield infoDict
运行爬虫,获取数据
cmd执行以下命令
scrapy crawl stocks
等待执行完毕可以看到的汇总信息:
编写Pipelines,处理获取的数据
编写pipelines.py文件
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class BaidustocksPipeline:
def process_item(self, item, spider):
return item
class BaidustocksInfoPipeline(object):
def open_spider(self, spider):
self.f = open('BaiduStockInfo.txt', 'w')
def close_spider(self, spider):
self.f.close()
def process_item(self, item, spider):
try:
line = str(dict(item)) + '\n'
self.f.write(line)
except:
pass
return item
配置ITEM_PIPELINES选项
编写settings.py文件
在其中寻找参数ITEM_PIPELINES,并修改如下参数。
执行整个框架
在cmd中:
scrapy crawl stocks
接下来等就完事儿啦=。=
搞定收工!