python-related data by URL keyword crawling

The example: the realization of lightweight crawling data (data not required to log the full-page) - [Baidu Encyclopedia (pathon) 1000 pages of data entries related]

Description: reptiles is a URL for entry, data associated with the URL, and the program will automatically crawl the Internet information.
Value: 1, 2 conducive to their own data analysis, data classification can provide external professional data information
1, reptiles simple architecture

  1. Reptile architecture process

Reptile dispatching ----> URL Manager <--------------> Web Downloader <------------------- million industry parser> -------------- value of the data
2 reptile architecture dynamic process
Image address - my blog park
2, URL Manager
management with URL scrambling up a collection and a collection of crawled URL (to prevent duplicate crawl, prevent fetch cycle)
implementations:
Python redis mysql database memory cache database
with crawling URL set: set () a table (url, id) set
has been set crawling URL: set () sET
[image dump outer link failure, the source station may have a security chain mechanism, it is recommended to save the pictures uploaded directly down (img-eITwl9US-1571590263248) (https://img2018.cnblogs.com/blog/1590744/201910/1590744-20191021003015937-1024137984.png)]
3, web downloader (urllib2)
(a): the URL of the corresponding web page downloaded to the local tool
(2): the Internet "----- URL ------ HTML ----" web downloader (urllib2, tool requests) ------------------- local file or memory strings
(a): urllib2 :( implementation)
first: 1, introducing the package 2, a direct request 3, 4 acquisition request status code, reading the
second (data, http header : 1, url and data and header ---- "urllib2.Request -----------> urllib2.urlopen ( request)
1, 2 Daobao, the key difference U love your request 3 objects, add data 4, 5 http add header information, the results obtained by sending a request
third: the processor under different scenarios
need to log in: HTTPCookieProcessor
requires a proxy in the : ProxyHandler
require encryption SSL: HTTPSHandler
presence rELATIONSHIP automatically jump: HTTPRedirectHandler
Here Insert Picture Description
Here Insert Picture Description
. 4, page parser (the BeautifulSoup)

5, the core code

Published 26 original articles · won praise 0 · Views 723

Guess you like

Origin blog.csdn.net/YHM_MM/article/details/102654808