Crawling of site-wide data in scrapy framework

There are two ways to crawl the site-wide data:

1.基于spider的全站数据爬取:需要自己进行分页操作,并进行手动发送请求
2.基于CrawlSpider ,今天主要讲解基于CrawlSpider 的爬取方式

CrawlSpider is a subclass of Spider:

使用流程:
		创建工程和切入工程下和Spider操作一样
		重点是创建一个爬虫文件命令如下
scrapy genspider -t crawl spidername www.xxx.com

There are two aspects to the new content:

1.链接提取器:
	作用:根据指定的规则(allow)进行指定链接的提取(这项操作是自动提取数据,不需手动请求,基于正则)
2.规则解析器:
	作用:将提取到的链接进行指定规则的解析(callback)

Next, use a code example for practical operation: Sunshine Hotline Wenzheng Platform

我们将爬取此网站中的编号、新闻标题、新闻内容、内容详情页的编号
#主文件
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from sunpro.items import SunproItem,DetailItem

class SunSpider(CrawlSpider):
    name = 'sun'
    #allowed_domains = ['www.xx.com']
    start_urls = ['http://wz.sun0769.com/political/index/politicsNewest']
    #链接提取器:根据指定规则(allow="正则")进行指定链接的提取
    link = LinkExtractor(allow=r'id=1&page=\d+')
    link_detail = LinkExtractor(allow=r'id=\d+')
    rules = (
        #规则解析器:将链接提取器提取到的链接进行指定规则(callback)的解析操作
        Rule(link, callback='parse_item', follow=True),
        #follow=True:可以将链接提取器 继续做用到 链接提取器提取到的链接 所对应的页面中
        Rule(link_detail,callback='parse_detail',follow=True)
    )
    #解析编号和新闻标题
    #如下俩个解析方法中是不能实现请求传参的
    #如果将俩个解析方法解析的数据存储到同一个item中,可以依次存储到俩个item
    def parse_item(self, response):
        li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')
        for li in li_list:
            new_num = li.xpath('/html/body/div[2]/div[3]/ul[2]/li[1]/span[1]/text()').extract_first()
            new_title = li.xpath('/html/body/div[2]/div[3]/ul[2]/li[1]/span[3]/a/text()').extract_first()
            #print(new_num,new_title)  用于测试程序是否出错
            item = SunproItem()
            item['new_num'] = new_num
            item['new_title']=new_title
            yield item
    #解析新闻编号和新闻内容
    def parse_detail(self,response):
        new_id = response.xpath('/html/body/div[3]/div[2]/div[2]/div[1]/span[4]/text()').extract_first()
        new_content = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]/pre//text()').extract()
        new_content = ''.join(new_content)
        item = DetailItem()
        item['new_id']=new_id
        item['new_content']=new_content
        yield item
#items
import scrapy


class SunproItem(scrapy.Item):
    # define the fields for your item here like:
    new_sum = scrapy.Field()
    new_title = scrapy.Field()

class DetailItem(scrapy.Item):
    new_id = scrapy.Field()
    new_content = scrapy.Field()

The pipeline my storage is stored in the database

import pymysql

class mysqlPipeline(object):
    conn = None
    cursor = None
    def open_spider(self,spider):
        self.conn = pymysql.Connect(host='127.0.0.1', port=3306, user='root', password='123456', db='sunpro',charset='utf8')
    def process_item(self, item, spider):
        #如何判断item的类型
        #将数据写入数据库时,如何保证数据的一致性
        self.cursor = self.conn.cursor()
        try:
            if item.__class__.__name__ == 'DetailItem':
                print(item['new_id'],item['new_content'])
                self.cursor.execute('insert into id values ("%d","%s")'%(item['new_id'],item['new_content']))
                self.conn.commit()
            else:
                print(item['new_num'],item['new_title'])
                self.cursor.execute('insert into num values ("%d","%s")' % (item['new_num'], item['new_title']))
                self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item
    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

Guess you like

Origin blog.csdn.net/qwerty1372431588/article/details/107302618