Outline
Concept: Monitoring
Core technologies: deduplication
- Based on a redis deduplication
Suitable for use incremental site:
- Depth based crawling
- Pages for crawling through url for a recording (recording sheet)
- Based on the depth of the non-crawling
- Record table: crawling through the fingerprint data corresponding to the data
- A set of unique identification of the original data: fingerprint data
- Data -> Data fingerprint -> database query
- hashlib
- Record table: crawling through the fingerprint data corresponding to the data
The so-called record of how the table is present in the form of what?
- redis act as a record of the set table
example
Crawling 4567 movie network name and an introduction to the film, when the site is updated crawling increase of the data.
- Address:
https://www.4567tv.tv/frim/index1.html
- This embodiment is based on the depth crawled.
scrapy startproject zlsPro
scrapy genspider zls www.xxx.com
①
- Manual depth parameter passing crawling
- Using
self.conn.sadd('movie_url', detail_url)
the return value to determine whether or crawl through the film.
1 |
# zls.py |
②
- Item definitions
1 |
# items.py |
③
- The definition of pipeline
- Incoming redis
1 |
# pipelines.py |