爬取股票名称、每股利润和净利润
完整项目已上传至github
链接:https://github.com/yinhaox/02_scrapy
数据爬取
可以接着之间的项目改,也可以重建一个
图一网页截图,图二xpath获取方法
|
|
打开gucheng.py,编写爬虫
(注:可以使用$ scrapy shell <url> 在控制台调试)
# ./Stock/spiders/gucheng.py
import scrapy
class GuchengSpider(scrapy.Spider):
name = 'gucheng'
start_urls = ['https://hq.gucheng.com/gpdmylb.html'] # 爬虫起始地址
def parse(self, response):
# 获得所有股票详细页面的url
urls = response.xpath('//*[@id="stock_index_right"]/div[3]/section/a/@href').getall()
for url in urls:
yield scrapy.Request(url, callback=self.parse_item) # 爬取详细页面
def parse_item(self, response):
select = response.xpath('//*[@id="hq_wrap"]/div[1]/section[8]')
tbody = select.xpath('./div/table/tbody')
name = select.xpath('./h3/text()').get() # 股票名称
EPS = tbody.xpath('./tr[2]/td[3]/div/text()').get() # 每股利润
NOPAT = tbody.xpath('./tr[2]/td[6]/div/text()').get() # 净利润
保存数据
1、定义数据模板
为了保存我们爬取到的数据,需要用到项目管道(Pipeline)
所以我们要先到items.py中去定义数据模板
# ./Stock/items.py
import scrapy
class StockItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
EPS = scrapy.Field()
NOPAT = scrapy.Field()
2、更改配置
然后到settings.py中打开项目管道
(注:Field objects are used to specify metadata for each field)
# ./Stock/settings.py
ITEM_PIPELINES = {
'Stock.pipelines.StockPipeline': 300,
}
如上,你可以找到此处被注释掉的代码,将注释消掉
(注:数字300是该管道的执行优先级,暂时不用管它)
3、使用item
在使用管道之前,先把爬取到的数据传给管道
# ./Stock/spiders/gucheng.py
import scrapy
# 引入item
from Stock.items import StockItem
class GuchengSpider(scrapy.Spider):
name = 'gucheng'
start_urls = ['https://hq.gucheng.com/gpdmylb.html']
def parse(self, response):
urls = response.xpath('//*[@id="stock_index_right"]/div[3]/section/a/@href').getall()
for i, url in enumerate(urls):
yield scrapy.Request(url, callback=self.parse_item)
def parse_item(self, response):
item = StockItem() # 实例化item
select = response.xpath('//*[@id="hq_wrap"]/div[1]/section[8]')
tbody = select.xpath('./div/table/tbody')
item['name'] = select.xpath('./h3/text()').get()
item['EPS'] = tbody.xpath('./tr[2]/td[3]/div/text()').get()
item['NOPAT'] = tbody.xpath('./tr[2]/td[6]/div/text()').get()
return item # 直接返回即可
4、使用管道保存数据
打开pipelines.py
# ./Stock/pipelines.py
class StockPipeline(object):
def process_item(self, item, spider):
self.f.write(
'{},{},{}\n'.format(
item['name'][:-4],
item['EPS'],
item['NOPAT']
)
)
return item
def open_spider(self, spider):
self.f = open('./data.csv', 'w')
self.f.write('名称,每股利润,净利润\n')
def close_spider(self, spider):
self.f.close()
然后运行爬虫,就可以看到结果了
$ scrapy crawl gucheng
结果如下:
名称 | 每股利润 | 净利润 |
---|---|---|
平安银行(000001) | 1.14 | 204.56 |
*ST康达(000048) | 0.23 | 0.89 |
方大集团(000055) | 1.91 | 22.46 |
深天马A(000050) | 0.62 | 12.75 |
皇庭国际(000056) | 0.13 | 1.52 |
深纺织A(000045) | 0.02 | 0.12 |
德赛电池(000049) | 1.34 | 2.75 |
泛海控股(000046) | 0.35 | 18.17 |
中国天楹(000035) | 0.12 | 1.67 |
中航善达(000043) | 1.25 | 8.31 |
… |