A senior chiefs from Python summary of reptiles

Recently some reason can finally withdraw myself from the trivial work in the past, before there is time to put some reptiles knowledge a simple sort, but also understand the stages from the knowledge of the past to sort out is really necessary.


Commonly used third-party libraries

For beginners reptiles, reptiles to understand the principles recommended in the future, without the use of any reptile framework, the use of these popular third-party libraries themselves to implement a simple reptile, it will deepen the understanding of reptiles.

urllib and HTTP requests are python's library, including urllib2 module to obtain comprehensive features with great complexity costs. Compared to urllib2, Requests more simple module supports full simple use cases. About the advantages and disadvantages and differences urllib and requests, we can go online to check.

BeautifulSoup and python are lxml page parsing library. BeautifulSoup is DOM-based, will load the entire document, parse the whole DOM tree, so a lot of time and memory overhead will be large. The lxml only partial traversal, using xpath can quickly locate the label. bs4 is written in python, lxml is c language, but also determines the lxml faster than bs4.

Reptile framework

python reptile common framework is scrapy and pyspider two.

On how to use the framework and details, refer to the official documentation.

Dynamic page rendering

1. url request analysis

(1) careful analysis of the structure of the page, view the actions js response;

(2) analyzes the request url js clicks emitted by means of a browser;

(3) This asynchronous request url as a start_url scrapy yield reques or crawl again.

2. selenium

Selenium is a Web test automation tool, originally developed for the site automated testing and development, the type of image we play games with the Wizard button, you can press the specified command automation, different is Selenium can be run directly in the browser, which supports All major browsers (including PhantomJS these non-browser interface).

Selenium can according to our instructions, so that the browser automatically loads the page for the desired page, even page screenshots, or to determine whether the occurrence of certain actions on the site.

Selenium own without browser does not support the browser function, it needs to be combined with third-party browser to use together.

3. phantomjs

When using selenium call the browser to fetch page due to perform open the browser and the page rendering operation, when large-scale data to crawl low efficiency, can not meet the demand. Then we can choose to use phantomjs.

PhantomJS is a Webkit-based "no interface" (headless) browser, it will be loaded into memory and executed website JavaScript on the page, because it would not display a graphical interface, it is more efficient to run than a full browser.

If we combine Selenium and PhantomJS together, you can run a very powerful web crawler, this reptile can handle JavaScript, Cookie, headers, and anything we need to do real users.

4. splash

Splash is a Javascript rendering service. It is a realization of lightweight browser's HTTP API, Splash is implemented in Python, using both Twisted and QT. Twisted (QT) used to allow asynchronous processing service has the ability to play webkit concurrent capacity.

python library called the connected splash scrapy-splash, scrapy-splash using Splash HTTP API, it is necessary a splash instance, the general operation docker splash, so it is necessary to install docker.

5. spynner

spynner QtWebKit is a client, it can simulate the browser has finished loading the page, triggering event, fill out forms and other operations.

Reptile anti-shield strategy

1. Modify User-Agent

User-Agent is one of the most common means of camouflage browser.

It refers to a User-Agent string containing browser information, operating system information and the like, also referred to a particular network protocol. Judging by its current server to access the object browser, mail client or web crawler. In request.headers where you can view the user-agent, how about analyzing data packets to view information such as User-Agent, mentioned earlier in this article.

Specific methods may be the value of User-Agent to the browser, or even the pool provided a User-Agent (list, array, dictionary can), storing a plurality of "Browser", every time the randomly crawling a request to set the User-Agent, such User-Agent will always change, and prevent the wall.

2. Prohibition of cookies

In fact, in some cookie is stored encrypted user data terminal, some Web site to identify the user through cookies, if an access request is always sent high frequency, the site is likely to be noticed, it is suspected as reptiles, then this site can be accessed by users find the cookie and denied access.

By disabling cookie, which is an initiative to stop the client server writes. Prohibit cookie can prevent the site may use cookies to identify reptiles to ban us out.

Scrapy may be provided in the crawler COOKIES_ENABLES = FALSE, i.e. not enabled cookies middleware, does not send cookies to the web server.

3. Set request interval

Large-scale centralized access a greater impact on the server, reptiles can be short-term increase server load. It should be noted that: set the download wait time control range of wait too long, can not meet short-term requirements of large-scale crawling, waiting time is too short it is likely to be denied access.

Set a reasonable request interval, both to ensure the efficiency of crawling reptiles, without causing greater impact on each other server.

4. Proxy IP pool

In fact, microblogging is identified IP, not the account. That is, when a lot of data needs to continuously captured when the impersonated login does not make sense. As long as the same IP, no matter how useless exchange account, the main thing is to change IP.

web server coping strategies reptile is one of direct IP or IP segment sealing of the entire block access when the IP banned to win the title, conversion to other IP access can continue. Methods: Proxy IP, local IP database (using the IP pool).

5. Selenium

Selenium is simulated using artificial clicks to your website, it is a kind of very effective way to prevent the ban. Selenium but less efficient, not suitable for large-scale data capture.

6. crack the code

Code is now the most common means of preventing crawlers. The ability of small partners can write their own algorithms to break this code, but in general we can spend some money to use third-party coding platform interface, easy to crack the code.

Epilogue

The above content is about python reptile little grooming, specific to a certain point needs its own technology and then check the details. For students want to study reptiles have a little help.


Reproduced in: https: //juejin.im/post/5cf62d2d6fb9a07eab686d35

Guess you like

Origin blog.csdn.net/weixin_33858336/article/details/91424165