Technical aspects of reptile, python reptile learning

principle

Traditional URL reptiles from one or several pages of initial start, get the URL of the original page, in the process of crawling web pages, and continue to extract new URL from the current page into the queue until the system must meet the stop condition. Focused crawler workflow is more complex and requires analysis algorithms filter off-topic links based on certain pages, keep useful links and place them in a queue waiting to crawl the URL.

Then selected according to certain search strategy from the queue the next page URL to be crawled, and repeat the process until it reaches a certain condition of the system is stopped. In addition, all reptiles crawled pages will be stored in the system, some analysis, filtering, and indexed for later search and retrieval;

Therefore, a complete crawler typically will include the following three modules:

  1. Web request module
  2. Flow control module crawling
  3. Content Analysis extraction module

Network requests

We often say crawler is actually a bunch of http (s) request, found to be crawling link, and then sends a request packet, to obtain a return package, of course, there HTTP long connections (keep-alive) or h5 based stream of websocket agreement, not considering;

So is the core of several elements:

  1. url
  2. Request header, body
  3. In response herder, content

URL

Reptile need to start running an initial url, then based on crawling into the html article resolve the link inside, then continue crawling, it's like more than one tree, starting from the root, at every step, it will have the new node. In order to be able to end crawler, usually specify a crawling depth (Depth).

Http request

http request information by the request method (method), the request header (headers), three parts of the request body (body) composition. Since the method is generally the first row header, it can be said request header contains the request method, the following is a part of chrome access request header:

For reptiles should be noted that when the request is to post, you need to request parameters to be urlencode then send, receive background might do some checking after requesting information, which may affect the crawl, the learning process has not I understand you can join the python zero-based systems learning exchange Qiuqiu qun: 784758,214, current sharing Python enterprise talent needs and how Python from the zero-based learning with you, and learn what content. Related video learning materials, development tools have to share relevant header fields are as follows:

  • Basic Auth

This is an old, unsafe user authentication, user authorization generally will be limited, will be asked to join the user name and password (plain text) in Autheration the field headers, if the validation fails then the request will fail, and now this authentication is being phased out.

  • Referer

Source Link, usually when you visit the link, we must bring Referer field, the server will be carried out to verify the origin, background usually use this field as a basis for the security chain.

  • User-Agent

This field is usually the background to determine the type of model version of the user equipment, systems and browsers. Some programming language network package will request a custom User-Agent, can be discerned, reptiles can set your browser ua.

  • Cookie

Generally, if the user logs in or certain operations, the server contains information required browser settings Cookie Cookie in the return packet, the Cookie can not be easily discerned request forgery;

Also by local JS, in accordance with one of the information returned from the server processing the generated encrypted information, which is provided in the Cookie;

  • JavaScript encryption

During transmission of sensitive data, usually by encrypting javascript, e.g. qq space user login password will be sent to the server after the RSA encryption, and therefore, to make their own request for the public key crawler during simulated landing, and encryption.

  • Custom Fields

Because the http headers can customize the location, so that third parties may join some field name or custom field values, which is the need to pay attention.

Process Control

The so-called crawling process, what is in accordance with the rules of the order to climb. In case the task is not crawling, crawling process control will not be too much trouble, a lot of crawling framework have made you as scrapy, only need to implement parsing code yourself.

But when crawling some large sites, such as grasping the whole network Jingdong comments, information microblogging everyone's attention to the relationship and so on, this time set on 1000000000-10000000000 100 billion request must be considered efficient, otherwise day only 86,400 seconds, then a second to catch 100, one day only request 8640w times, also need more than 100 days to reach the level requested by one billion.

Involves large scale crawl, the crawler must have a good design, in general a lot of open source framework reptiles are also limited, since the intermediate involves a lot of other problems, such as data structures, recrawling filtering problem, of course, the most important is to make full use of the bandwidth.

So Distributed Crawling is important, then the flow of control will be very important, the most important thing is distributed multiple machines with different threads scheduling and usually share a url queue, then each thread through messaging, if you want to arrested more quickly, then the message throughput requirements for the intermediate system is also higher.

Now there are some open source such as a distributed crawling scrapy-redis frame is overwritten scrapy a scheduling module, packet queue, pipe, Redis database is used to make the request queue in a distributed shared, scrapyd is to deploy scrapy the, scrapyd-api used to start acquiring data.

Content Extraction

The request headers Accept-Encoding field indicates the browser tells the server supports its own compression algorithm (currently the most is gzip), if the server is turned on compression, the response body will return compression, decompression needs its own reptiles;

We often need to get past content comes mainly from the web page html document itself, that is to say, when we decided to crawl, html content are included, but in recent years with the rapid development of web technology, the more dynamic pages for more, especially for the mobile terminal, a large number of SPA applications, these sites use a lot of ajax technology.

We see in the browser pages are not all included in html documents say, many of which are dynamically generated through javascript, in general, we finally see the eyes of the web page includes the following three:

  • Html document itself contains content

This situation is most likely to be addressed, in general, it is basically dead static pages already written content, or dynamic pages, using template rendering, HTML browser to get the time already contains all the key information, directly on the page on seeing the content can be obtained through a specific HTML tags.

Resolve this situation is very simple, it has several general methods:

  1. CSS selectors
  2. XPATH (This is worth learning about)
  3. Ordinary or regular expression string search
  4. JavaScript code to load content

Generally speaking, there are two cases: one case is the request to html document, web page data in js code, but not in html tags, the reason why we see is normal, it is because, in fact, js code execution is due to be dynamically added to the inside of the tag.

So when this time the contents of the js code inside, and execute js is in operation browser, so the program to request a web page address, the resulting response is a page code and js code, so they can see in the browser to content, parsing is not performed due js certainly find the content specified HTML tags certainly are empty, such as Baidu's home page is this, this time the approach, in general, mainly to find js code string containing the content, then expressions to obtain the corresponding content through positive, rather than parsing HTML tags.

Another case is in and user interaction, JavaScript could dynamically generate some dom, such as clicking a button, a dialog box such as a bomb; in this case, these elements are usually prompt some users to relevant content, no value , if you really need, you can analyze js execution logic, but this is rare.

  • Ajax / Fetch asynchronous requests

This situation is now very common, especially in the content displayed in a tab form on the page, and the page without refreshing, or an interactive web pages after the operation, to obtain content. For this page, we want to track time analysis of all requests, observations in the end is coming in which step load. And then when we find the core of asynchronous requests, just grab the asynchronous request can be, if the original page does not have any useful information, there is no need to grab the original page.

Status of crawler technology

Language

In theory, any language that supports network communications can all be written reptiles, although the reptile itself has little relationship between language, however, there is always relatively smoothly, simple. For now, most of the reptiles is the background to write the script language class, which is the largest python undoubtedly the most widely used, and the page was born a lot of good libraries and frameworks, such as scrapy, BeautifulSoup, pyquery, Mechanize and so on.

But in general, the search engine spiders crawling efficiency requirements higher, will use c ++, java, go (for high concurrency), I use c ++ implements a multi-threaded frame in college, but to find and implement python reptile efficiency is not obvious, reason is simple for reptiles, bottleneck analysis and data extraction, and the network efficiency and the language does not matter much.

It is worth mentioning that in recent years the node is developing very fast, making javascript everywhere, some people have begun to try to do with node reptiles, however, this is actually no different from other back-end scripting languages, such as python is not simple, because you still can not initiate ajax request node, the original can not be performed dom pages.

Because the execution environment and execution environment javascript browser node is not the same. So, I do not really like to write, like reptiles in the browser with js, jquery extract content with it?

Operating Environment

Reptile itself makes no distinction in the end is still running Linux in windows, and or OSX, but from the business point of view, we have to run on the server side (background), called the background reptile. And now, almost all the reptiles are reptiles background.

Three questions backstage reptiles

Background reptile popular at the time, also has a little tricky, so far there is no good solution to the problem, but in the final analysis, the root cause of these problems is due to the inherent background reptile lead, before the formal discussion, let's think about a problem "crawlers and browsers what similarities and differences?." Do not understand the learning process can join my python zero-based systems Learning Exchange Qiuqiu qun: 784758,214, current sharing Python enterprise talent needs and how Python from the zero-based learning with you, and learn what content. Related video learning materials, development tools have to share

Same point

It is essentially requesting Internet data via http / https protocol

difference

  1. Reptiles generally automated process, without having to interact with the user, rather than the browser
  2. Different operating scenarios; the browser running on the client, and reptiles are generally run on the server side
  3. Different abilities; browser contains rendering engine, javascript virtual machine, and reptiles generally do not have both.

Knowing this, we look at the problems facing the background

One problem: the problem of interaction

Some pages and often require some user interaction, and then went to the next step, such as enter a verification code, drag a slider, choose a few characters. Website did so, most of the time in order to verify the visitor in the end is a human or a machine.

The crawler encountered such a situation difficult to deal with, the traditional simple authentication code can be read out by the image processing algorithm, but with the variety, full of tricks, popular indignation, the more and more perverted verification code ( especially when buying train tickets, every minute want to burst foul language), this problem is getting worse.

Second problem: Javascript resolution issues

As described earlier, javascript dynamically generate dom. Most belong to the dynamic web page (the content dynamically populate javascript), especially in the mobile terminal, SPA / PWA more and more popular, most of the useful data pages are dynamically obtained through ajax / fetch then from js filling the pages dom tree, simple html static pages in little useful data.

At present, the scheme is to deal with js ajax / fetch requests directly request ajax fetch the url /, but there are a number of ajax request parameters will depend on some javascript dynamically generated, such as a request for signature, encrypt passwords when users log in, etc. Another example Wait.

If the shop and go to reimbursing javascript had to do with a background scripts, which would be a clear understanding of the original page code logic, which is not only very troublesome, and make your crawling unusually large code is bloated, but more deadly yes, you can do some javascript crawlers are difficult or can not imitate, for example, some sites use CAPTCHA drag the slider to a location, which is very difficult to imitate the reptile.

In fact, summed up some of these defects in the final analysis, because the crawler is not the browser, no javascript parsing engine caused. To address this issue, the main coping strategy is the introduction of Javascript engine crawlers, such as PhantomJS, but also has obvious drawbacks, such as a server at the same time there are multiple crawling task, take up too much resources.

There is, those windowless javascript engine does not use a lot of time, as in browser environment, when the jump occurred inside pages, will lead the process difficult to control.

Question three: IP restrictions

This is the background for the most deadly reptiles. Site firewall ip have a fixed number of requests within a certain time limit to do, if not more than the normal return data on-line, more than, the request is denied, such as qq mailbox.

It is worth noting, ip restrictions are sometimes not specifically order for reptiles, but most of the time is for defensive measures for site security reasons DOS attacks. And the background when the machine ip crawling limited, easily reached on the line leading to the request is rejected. At present the main response is to use a proxy, ip number this way will be more, but still limited proxy ip, for that matter, impossible to solve.

Guess you like

Origin blog.csdn.net/kkk123789/article/details/92187401