Python3 crawler (sixteen) pyspider framework


Infi-chu:

http://www.cnblogs.com/Infi-chu/

1. Introduction to pyspider
1. Basic functions
Provide WebUI visualization function, which is convenient for writing and debugging crawlers
Provides crawling progress monitoring, crawling result viewing, and crawler project management Supports
multiple databases, such as MySQL, MongoDB, Redis, SQLite, PostgreSQL, etc.
Message queue, RabbitMQ, Beanstalk, Redis, etc.
Provide priority control, failure retry, timing crawling, etc.
Connect with PhantomJS, which can realize crawling of JavaScript pages
Support stand-alone, distributed, Docker deployment

2. Compared with scrapy
, pyspider provides WebUI, and scrapy does not have this function natively. pyspider
is convenient for
debugging
.
low degree

3. Three major modules of framework design
: Scheduler, Fetcher, Processor

4. Specific process
1. Each pyspider project uses a Python script, which defines a Handler class , use the on_start() method to start the project, and then hand it over to the scheduler for scheduling processing 2. The Scheduler
passes the fetching task to Fetcher, and after the Fetcher response is completed, the response is passed to the Processor 3. The Processer
processes and extracts the new URL and then passes the message The method of queue is passed to the Scheduler. If a new extraction result is generated, it will be sent to the result queue to wait for the Result Worker to process
it. 4. Loop the above process until the capture ends, and on_finished() will be called at the end.

5. Example
https://github.com/Infi-chu/quna

2. Detailed explanation of pyspider
1. Startup:
pyspider all
2.crawl() method
url: the crawled URL, which can be defined as a single URL string or a list of URLs
callback: callback function, which specifies which method should be used for the response content corresponding to the URL To parse
age: the effective time of the task
priority: priority, the default is 0, the greater the priority
exetime: you can set a timed task, its value is a timestamp, the default is 0, which means to execute immediately
retries: the number of retries, the default is 3
itag : Set the node value
that determines whether the
webpage has changed user_agent: User-Agent headers: Request Headers cookies: Cookies, dictionary format connect_timeout : The longest waiting time when initializing the connection, the default is 20 seconds Process redirection, the default is True validate_cert: whether to verify the certificate, the default is True proxy: proxy











fetch_type: enable PhantomJS rendering
js_script: JavaScript script executed after the page is loaded
js_run_at: script running location, default at the end of the node
js_viewport_width/js_viewport_height: Window size of JavaScript rendering page
load_images: Determine whether to load images, default is False
save: In different Pass parameters between methods
cancel: cancel task
force_update: force update status
3. Task distinction:
determine whether it is the same task, and compare whether the MD5 value of the URL is the same
4. Global configuration:
Specify the global configuration in crawl_config
5. Regularly crawl
through every attribute to set the time interval
6. Project status:
TODO has just created a project that has not yet executed
STOP Stop
CHECKING running after the project has been modified
DEBUG/RUNNNING run
PAUSE Multiple errors, hang or pause
7. Delete the project
set the status to STOP, grouping The name is changed to delete, and it will be automatically deleted after 24 hours

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325353201&siteId=291194637