There are two ways to crawl the site-wide data:
1.基于spider的全站数据爬取:需要自己进行分页操作,并进行手动发送请求
2.基于CrawlSpider ,今天主要讲解基于CrawlSpider 的爬取方式
CrawlSpider is a subclass of Spider:
使用流程:
创建工程和切入工程下和Spider操作一样
重点是创建一个爬虫文件命令如下
scrapy genspider -t crawl spidername www.xxx.com
There are two aspects to the new content:
1.链接提取器:
作用:根据指定的规则(allow)进行指定链接的提取(这项操作是自动提取数据,不需手动请求,基于正则)
2.规则解析器:
作用:将提取到的链接进行指定规则的解析(callback)
Next, use a code example for practical operation: Sunshine Hotline Wenzheng Platform
我们将爬取此网站中的编号、新闻标题、新闻内容、内容详情页的编号
#主文件
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from sunpro.items import SunproItem,DetailItem
class SunSpider(CrawlSpider):
name = 'sun'
#allowed_domains = ['www.xx.com']
start_urls = ['http://wz.sun0769.com/political/index/politicsNewest']
#链接提取器:根据指定规则(allow="正则")进行指定链接的提取
link = LinkExtractor(allow=r'id=1&page=\d+')
link_detail = LinkExtractor(allow=r'id=\d+')
rules = (
#规则解析器:将链接提取器提取到的链接进行指定规则(callback)的解析操作
Rule(link, callback='parse_item', follow=True),
#follow=True:可以将链接提取器 继续做用到 链接提取器提取到的链接 所对应的页面中
Rule(link_detail,callback='parse_detail',follow=True)
)
#解析编号和新闻标题
#如下俩个解析方法中是不能实现请求传参的
#如果将俩个解析方法解析的数据存储到同一个item中,可以依次存储到俩个item
def parse_item(self, response):
li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')
for li in li_list:
new_num = li.xpath('/html/body/div[2]/div[3]/ul[2]/li[1]/span[1]/text()').extract_first()
new_title = li.xpath('/html/body/div[2]/div[3]/ul[2]/li[1]/span[3]/a/text()').extract_first()
#print(new_num,new_title) 用于测试程序是否出错
item = SunproItem()
item['new_num'] = new_num
item['new_title']=new_title
yield item
#解析新闻编号和新闻内容
def parse_detail(self,response):
new_id = response.xpath('/html/body/div[3]/div[2]/div[2]/div[1]/span[4]/text()').extract_first()
new_content = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]/pre//text()').extract()
new_content = ''.join(new_content)
item = DetailItem()
item['new_id']=new_id
item['new_content']=new_content
yield item
#items
import scrapy
class SunproItem(scrapy.Item):
# define the fields for your item here like:
new_sum = scrapy.Field()
new_title = scrapy.Field()
class DetailItem(scrapy.Item):
new_id = scrapy.Field()
new_content = scrapy.Field()
The pipeline my storage is stored in the database
import pymysql
class mysqlPipeline(object):
conn = None
cursor = None
def open_spider(self,spider):
self.conn = pymysql.Connect(host='127.0.0.1', port=3306, user='root', password='123456', db='sunpro',charset='utf8')
def process_item(self, item, spider):
#如何判断item的类型
#将数据写入数据库时,如何保证数据的一致性
self.cursor = self.conn.cursor()
try:
if item.__class__.__name__ == 'DetailItem':
print(item['new_id'],item['new_content'])
self.cursor.execute('insert into id values ("%d","%s")'%(item['new_id'],item['new_content']))
self.conn.commit()
else:
print(item['new_num'],item['new_title'])
self.cursor.execute('insert into num values ("%d","%s")' % (item['new_num'], item['new_title']))
self.conn.commit()
except Exception as e:
print(e)
self.conn.rollback()
return item
def close_spider(self,spider):
self.cursor.close()
self.conn.close()