Based Incremental crawler frame Scrapy

 

Outline

Concept: Monitoring

Core technologies: deduplication

  • Based on a redis deduplication

Suitable for use incremental site:

  • Depth based crawling
    • Pages for crawling through url for a recording (recording sheet)
  • Based on the depth of the non-crawling
    • Record table: crawling through the fingerprint data corresponding to the data
      • A set of unique identification of the original data: fingerprint data
      • Data -> Data fingerprint -> database query
      • hashlib

The so-called record of how the table is present in the form of what?

  • redis act as a record of the set table

example

Crawling 4567 movie network name and an introduction to the film, when the site is updated crawling increase of the data.

  • Address:https://www.4567tv.tv/frim/index1.html
  • This embodiment is based on the depth crawled.

scrapy startproject zlsPro

scrapy genspider zls www.xxx.com

  • Manual depth parameter passing crawling
  • Using self.conn.sadd('movie_url', detail_url)the return value to determine whether or crawl through the film.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# zls.py
# -*- coding: utf-8 -*-
import scrapy
from zlsPro.items import ZlsproItem
from redis import Redis


class ZlsSpider(scrapy.Spider):
name = 'zls'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://www.4567tv.tv/frim/index1.html']
conn = Redis('127.0.0.1', 6379)

def parse(self, response):
li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
for li in li_list:
title = li.xpath ( './div/div/h4/a/text ()' ) .extract_first ()
detail_url = 'https://www.4567tv.tv' + li.xpath ( './ div / div / H4 / A / @ the href ' ) .extract_first ()
RET = self.conn.sadd ( ' movie_url ' , detail_url) IF RET: # successfully written to the url if does not exist, a subsequent operation may then be: Performer li.xpath = ( './div/div/p/text ()' ) .extract_first () Item = ZlsproItem () Item [ 'title' ] = title Item [ 'Performer' ] = Performer the yield






scrapy.Request(detail_url, callback=self.parse_detail, meta={'item': item})
else:
print('暂无更新的数据')

def parse_detail(self, response):
item = response.meta['item']
content = response.xpath(
'//div[@class="stui-content__detail"]/p/span[@class="detail-content"]/text()').extract_first()
item['content'] = content
yield item

  • Item definitions
1
2
3
4
5
6
7
8
# items.py
import scrapy

class ZlsproItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
performer = scrapy.Field()
content = scrapy.Field()

  • The definition of pipeline
  • Incoming redis
1
2
3
4
5
6
7
8
9
10
# pipelines.py
class ZlsproPipeline(object):

def process_item(self, item, spider):
title = item['title']
performer = item['performer']
content = item['content']
conn = spider.conn
conn.lpush('movie', item)
return item

Guess you like

Origin www.cnblogs.com/taosiyu/p/11735124.html