流程思路
- 将解析数据存到items对象
- 使用yield 将items交给管道文件处理
- 在管道文件pipelines编写代码储存到数据库
- 在setting配置文件开启管道
案例
items中
import scrapy
class QiubaiproItem(scrapy.Item):
author = scrapy.Field()
content = scrapy.Field()
setting中
ITEM_PIPELINES = {
'qiubaiPro.pipelines.QiubaiproPipeline': 300,
}
爬虫文件中
- 必须导入items 中的类
- 将数据录入item
- 用yield item提交给管道
import scrapy
from qiubaiPro.items import QiubaiproItem
class QiubaiSpider(scrapy.Spider):
name = 'qiubai'
start_urls = ['https://www.qiushibaike.com/text/']
def parse(self, response):
div_list = response.xpath("//div[@id='content-left']/div")
data_list = []
for div in div_list:
author = div.xpath("./div/a[2]/h2/text()").extract()[0]
content = div.xpath(".//div[@class='content']/span/text()").extract_first()
item = QiubaiproItem()
item['author'] = author
item['content'] = content
yield item
管道pipelines中
- 现在数据库中创建对应格式的表
- 导入pymysql包
- 在open_spider中链接数据库
- 利用pymysql进行数据录入
- 用try捕获并回滚错误
- 在close_spider中关闭数据库
import pymysql
class QiubaiproPipeline(object):
conn = None
cursor = None
def open_spider(self, spider):
print('开始爬虫,链接数据库')
self.conn = pymysql.Connect(
host='127.0.0.1',
port=3306,
user='root',
password='123',
db='qiubai',
)
def process_item(self, item, spider):
sql = 'insert into qiubai values("%s","%s")' % (item['author'], item['content'])
self.cursor = self.conn.cursor()
try:
self.cursor.execute(sql)
self.conn.commit()
except Exception as e:
print(e)
print('异常回滚')
self.conn.rollback()
return item
def close_spider(self, spider):
print('爬虫结束')
self.cursor.close()
self.conn.close()