Understand Python Reptile framework pyspider

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/wufaliang003/article/details/91359793

A, pyspider Profile

 

pyspider is an open source technology to achieve Binux do a reptile architecture, the main features are:

  • Crawl, update schedule multi-site-specific pages

  • The need for pages of structured information extraction

  • Flexible and scalable, stable and Monitoring

pyspider to go rescheduling, the queue capture, exception handling, monitoring and other functions as a framework, just to provide a script to crawl and to ensure flexibility. Finally, add a web editing and debugging environment, as well as web monitoring mission, which became this framework. Pyspider design basis is: the python script-driven models reptiles crawl ring

  • Between the various components are connected using message queues, except that the scheduler is a single point, and Fetcher processor may be multiple instances are distributed deployment. scheduler responsible for the overall control of scheduling

  • Initiated by the task scheduler scheduling, Fetcher crawls the web, Processor performs a pre-written script python, or generate a new output job lifting chain (sent to the scheduler), a closed loop is formed

  • Each script can be used flexibly for various python library to parse the page, use the API to control the next frame grab operation, callback control by setting the action resolved

 

Two, pyspider interface

In the terminal input pyspider all run pyspider service, then the browser, enter localhost: 5000 to see pyspider interface, rate is used to control the number of pages crawled per second, burst can be regarded as concurrency control. If you want to delete the items you will need the group to delete, status is set to stop, 24 child project will be deleted.

Click to create projects can be created

click on the item you just created to open the script editing interface

we can write and debug scripts in here, web can display web pages, web button on the left for the css selector test, html for the page's source code, follows the display can climb take the url, specific debug know at know.

Three, pyspider script

But you create a new project when you see these default script template, followed by a brief introduction written pyspider script.

from pyspider.libs.base_handler import *class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)    def on_start(self):
        self.crawl('__START_URL__', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)    def index_page(self, response):        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    @config(priority=2)    def detail_page(self, response):        return {            "url": response.url,            "title": response.doc('title').text(),
        }
  • crawl_config: reptiles global parameter settings, such as cookies and request headers may be provided herein (and the corresponding keyword parameters passed to)

  • on_start (self): Start crawling reptiles inlet

  • crawl: and requests have the same functionality can support get (the default) and post, commonly used parameters

    • data that you want to submit data

    • callback callback function can be called after executing crawl

    • method is to specify the access method

    • upload files, { 'key': ( 'file.name': 'content')}

    • headers request header type dict

    • Cookies cookies requested type dict

    • timeout request content to wait for the biggest number of seconds. Default: 120

    • connect_timeout The: link specified request time, in seconds, the default value: 20

    • proxy: you can set up a proxy server only supports http proxy

More parameters can view the official documentation.

  • response

    • crawl: the object is returned response object

    • response.ur: Returns the last URL address

    • response.text: ** text format of the content request response (or None if the chardet Response.encoding is available, the contents of the response are resolved automatically designated as encoding) **

    • response.doc: This method will generate a PyQuery call PyQuery library objects with the content returned for ease of use, the default has put all the links inside formatted into absolute links to generate objects can be directly analyzed using a (specific use object methods can refer to the official PyQuery Reference book)

    • response.json: This method calls the library to parse JSON-related content returned

    • response.status_code: returns a response status code

    • response.headers: header information request response, dict format

    • response.cookies: cookies Response

    • response.time: crawl time use

  • index_page and detail_page just the initial script callback function, in addition to on_start, other function names can be customized

  • @every (minutes = 24 * 60) set how often executed once (24 * 60 once a day, so you can get the data once a day to climb)

  • @config

    • Expiration date age setup tasks, pages within this period is considered crawling target will not be modified in seconds

    • priority setting task priority

    • auto \ _recrawl set whether to age every time once again crawling, default is: False

    • The priority parameter specifies the priority of the task, the greater the value is performed first, a default value is 0

    • After the failed task retries retries, the default is 3

    • itag task tag values, this marker contrast crawling, if this value is changed, regardless of the validity period would have been to re-crawl the new content. Most dynamically determines whether to modify the contents of climb or force-Default is None

 

mac os install pyspider pit (not a friend mac system can be ignored)

You may encounter import errors pycurl when installing pyspider under mac system

 

ImportError: pycurl: libcurl link-time ssl backend (openssl) is different from compile-time ssl backend (none/other)

 

This problem can be solved by reloading pycurl, but the system environment variables in the high version of the mac system is openssl header file does not exist. Enter the following problem can be solved in the terminal

pip uninstall pycurl# 卸载库
export PYCURL_SSL_LIBRARY=openssl 
export LDFLAGS=-L/usr/local/opt/openssl/lib
export CPPFLAGS=-I/usr/local/opt/openssl/include# openssl相关头文件路径
pip install pycurl --compile --no-cache-dir# 重新编译安装

System Environment: Mac High Sierra 10.13.2

 

Fourth, special skills

 

Login simulation

There are many websites require subscription after more content, so we need to achieve functional simulation of crawler login, we can  selenium  to analog login.

selenium  is a tool for Web application testing, and we can also  selenium  implement the login function. To microblogging, for example

from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://weibo.com/")
username = driver.find_element_by_css_selector("input#loginname")
username.clear()
username.send_keys('your_username')
password = driver.find_element_by_css_selector('span.enter_psw')
password.clear()
password.send_keys('your_password')

After entering the account number and password, the biggest question is, are the picture verification code, we must rely on the general image recognition to recognize the code, but because of the type of code is very large (English, numbers, Chinese or a mixture), and this code can also be a certain rotation, twisting and even adhesion to each other, so that the human eye can not recognize good, so most models of versatility and accuracy is not very high. Therefore, the most efficient way is to open the browser after the selenium manual login (during the call time.sleep () to pause the program). Because the most important for the crawlers not resolve login issues, so doing this can save a lot of time and the amount of code, though stupid, but very useful.

Upon completion of login, use the code to obtain the cookie, and the cookie_dict pyspider global parameters passed in cookies to

cookies_dict = {}
cookies = driver.get_cookies()
for cookie in cookies:
   cookies_dict[cookie['name']] = cookie['value']

JS

Common requests request can only crawl static HTML pages, but most sites are mixed JS data loading, data loading is delayed, you want to crawl the content can be used selenium + PhantomJS fully rendered web page and then be parsed. PhantomJS is a free interface may scripting WebKit browser engine, using the methods and selenium + Chrome analog login is similar, but because there is no interface PhantomJS therefore a lot less memory consumption.

AJAX web crawling loaded asynchronously

AJAX is a technique for creating fast dynamic web pages. By exchanging small amounts of data with the server behind the scenes, AJAX can make asynchronous page updates. This means that, for certain parts of the page to be updated without reloading the entire page. AJAX web crawling asynchronous loading of pages we're going to analyze the request and sent the information returned. Specifically we want to use Google Chrome developer tools (Firefox there are other browsers also have this) XHR view network there, but the page partial update, the browser sends any request and view the browser to return to what.

After the home page of our Weibo login, as long as moving the scroll bar to the bottom you will find will brush out new micro-blog, this is our open developer tools, continuous roll down, but then you will find that you brush out the new information will always issue a new request, and it returns the data as json refresh microblogging html.

 

 

Wherein pagebar carefully observed regularly changes here, and reduces the id 15 corresponding to the brush 15 out of the new micro-blog. So presumably pagebar should be the key to a new brush microblogging. Before we give the reptiles add a request headers, or is likely to be recognized as a robot server can not successfully access. Request header can also be found in the developer tools there.

 

 

 

 

def on_start(self):        
    for i in range(10):
        url = 'https://weibo.com/aj/mblog/fsearch?pagebar=%s'%i
        self.crawl(url, callback=self.index_page)

Found success crawling and return the contents of microblogging. Then only the information you need to be processed.

 

No micro-channel public attention and headlines today, excellent article continued update. . . . .

 

Guess you like

Origin blog.csdn.net/wufaliang003/article/details/91359793