What are the advantages and disadvantages of open source crawler frameworks?

What are the advantages and disadvantages of each open source crawler framework?
Author: Lao Xia

write picture description here

Should I choose Nutch, Crawler4j, WebMagic, scrapy, WebCollector or others for developing web crawlers? Here's a rant based on my experience:

The crawlers mentioned above can be basically divided into 3 categories:
1. Distributed crawlers: Nutch 2. JAVA stand-
alone crawlers: Crawler4j, WebMagic, WebCollector
3. Non-JAVA stand-alone crawlers: scrapy

Category 1: Distributed crawlers

The crawler uses distributed, mainly to solve two problems:

1) Mass URL management

2) Internet speed

Now the more popular distributed crawler is Apache's Nutch. But for most users, Nutch is the worst choice among these types of crawlers, for the following reasons:

1) Nutch is a crawler designed for search engines. Most users need a crawler for accurate data crawling (fine extraction). Two-thirds of the processes Nutch runs are designed for search engines. It doesn't make much sense for extraction. In other words, using Nutch for data extraction will waste a lot of time on unnecessary calculations. And if you try to make Nutch suitable for the extraction business by secondary development, you will basically destroy the framework of Nutch, change Nutch beyond recognition, and have the ability to modify Nutch, it is really better to write a new one yourself. Distributed crawler framework.

2) Nutch relies on hadoop to run, and hadoop itself consumes a lot of time. If the number of cluster machines is small, the crawling speed is not as fast as that of a single-machine crawler.

3) Although Nutch has a set of plug-in mechanism, it is promoted as a highlight. You can see some open source Nutch plug-ins that provide precise extraction functions. But anyone who has developed Nutch plugins knows how crappy Nutch's plugin system is. Using the reflection mechanism to load and call plug-ins makes it extremely difficult to write and debug programs, let alone develop a complex extraction system on it. And Nutch does not provide a corresponding plugin mount point for fine extraction. Nutch's plugin has only five or six mount points, and these five or six mount points are all for search engine services, and do not provide mount points for fine extraction. Most of Nutch's refined extraction plugins are mounted on the "page parsing" (parser) mount point. This mount point is actually used to parse links (provide URLs for subsequent crawling) and provide some search engines. Easy-to-extract web page information (meta information, text text of web pages).

4) Using Nutch for the secondary development of the crawler, the time required for the writing and debugging of the crawler is often more than ten times the time required for a single-machine crawler. The cost of learning the Nutch source code is very high, not to mention that everyone in a team must understand the Nutch source code. During the debugging process, there will be various problems other than the program itself (hadoop problems, hbase problems).

5) Many people say that Nutch2 has gora, which can persist data to avro files, hbase, mysql, etc. Many people actually misunderstand. The persistent data mentioned here refers to storing URL information (data required for URL management) in avro, hbase, and mysql. Not the structured data you want to extract. In fact, for most people, it doesn't matter where the URL information exists.

6) The version of Nutch2 is currently not suitable for development. The official stable version of Nutch is nutch2.2.1, but this version is bound to gora-0.3. If you want to use hbase with nutch (most people use nutch2 for the purpose of using hbase), you can only use hbase around version 0.90, and correspondingly, the hadoop version should be reduced to around hadoop 0.2. And the official tutorial of nutch2 is quite misleading. There are two tutorials of Nutch2, namely Nutch1.x and Nutch2.x. The Nutch2.x official website is written to support hbase 0.94. But in fact, this Nutch2.x means a version before Nutch2.3 and after Nutch2.2.1, which is continuously updated in the official SVN. And it's very unstable (it's being modified all the time).

So, if you are not a search engine, try not to choose Nutch as a crawler. Some teams like to follow the trend. They insist on choosing Nutch to develop refined crawler. In fact, it is aimed at Nutch's reputation (Nutch's author is Doug Cutting). Of course, the final result is often that the project is delayed.

If you are a search engine, Nutch1.x is a very good choice. With Nutch1.x and solr or es, a very powerful search engine can be formed. If you have to use Nutch2, it is recommended to wait until Nutch2.3 is released. The current Nutch2 is a very unstable version.

The second category: JAVA stand-alone crawler

Here, JAVA crawlers are divided into a separate category, because JAVA is very perfect in the ecosystem of web crawlers. The relevant information is also the most complete. There may be controversy here, I'm just casually speaking bullshit.

In fact, the development of open source web crawler (framework) is very simple, difficult and complex problems have been solved by previous people (such as DOM tree parsing and positioning, character set detection, massive URL deduplication), it can be said that there is no technical content . Including Nutch, in fact, the technical difficulty of Nutch is to develop hadoop, and the code itself is very simple. In a sense, a web crawler is similar to traversing the files of this machine to find the information in the files. There is no difficulty whatsoever. The reason for choosing an open source crawler framework is to save trouble. For example, modules such as URL management of crawlers and thread pools can be done by anyone, but it takes a period of debugging and modification to stabilize.

For the function of crawler. The questions that users are more concerned about are often:

1) Does the crawler support multi-threading, can the crawler use an agent, can the crawler crawl duplicate data, and can the crawler crawl the information generated by JS?

Those that do not support multi-threading, do not support proxies, and cannot filter duplicate URLs are not called open source crawlers, but are called circular execution of http requests.

Whether the information generated by js can be crawled has little to do with the crawler itself. Crawlers are mainly responsible for traversing websites and downloading pages. The information generated by crawling js is related to the web page information extraction module, which often needs to be done by simulating a browser (htmlunit, selenium). These simulated browsers often take a lot of time to process a page. Therefore, one strategy is to use these crawlers to traverse the website, and when encountering a page that needs to be parsed, submit the relevant information of the webpage to the simulated browser to complete the extraction of information generated by JS.

2) Can crawlers crawl ajax information?

There are some asynchronously loaded data on the web page. There are two ways to crawl this data: use a simulated browser (described in question 1), or analyze the http request of ajax, generate the url of the ajax request by yourself, and get the returned data. If you generate ajax requests yourself, what's the point of using an open source crawler? In fact, it is to use the thread pool and URL management functions of the open source crawler (such as breakpoint crawling).

If I can already generate the ajax requests (list) I need, how can I use these crawlers to crawl these requests?

Crawlers are often designed in the mode of breadth traversal or deep traversal to traverse static or dynamic pages. Crawling ajax information belongs to the category of deep web (deep web), although most crawlers do not directly support it. But it can also be done in some ways. For example, WebCollector uses breadth traversal to traverse websites. The first round of crawling of the crawler is to crawl all the urls in the seed set (seeds). Simply put, it is to use the generated ajax request as a seed and put it into the crawler. Use the crawler to perform a breadth traversal of depth 1 on these seeds (the default is breadth traversal).

3) How does the crawler crawl the website to be logged in?

These open source crawlers all support specifying cookies when crawling, and simulated login mainly relies on cookies. As for how to obtain cookies, it is not a matter of crawler management. You can obtain cookies manually, simulate login with http request, or log in automatically with a simulated browser.

4) How do crawlers extract information from web pages?

Open source crawlers generally integrate web page extraction tools. There are mainly two specifications supported: CSS SELECTOR and XPATH. As for which one is better, I will not comment here.

5) How does the crawler save the information of the web page?

There are some crawlers that come with a module responsible for persistence. For example, webmagic has a module called pipeline. Through simple configuration, the information extracted by the crawler can be persisted to files, databases, etc. There are also some crawlers that do not directly provide users with data persistence modules. Such as crawler4j and webcollector. Let the user add the operation of submitting the database in the web page processing module. As for whether it is good to use the pipeline module, it is similar to the question of whether it is good to use ORM to operate the database, depending on your business.

6) What should I do if the crawler is blocked by the website?

The crawler is blocked by the website, which can usually be solved by using multiple agents (random agents). However, these open source crawlers generally do not directly support the switching of random agents. Therefore, users often need to put the acquired agent into a global array, and write a code for the agent to randomly acquire (from the array).

7) Can a web page call a crawler?

The call of the crawler is called on the server side of the Web. You can use it how you usually use it. These crawlers can be used.

8) What is the speed of the crawler?

The speed of the single-machine open source crawler can basically be used to the limit of the local network speed. The slow speed of the crawler is often because the user has reduced the number of threads, the network speed is slow, or the interaction with the database is slow when the data is persisted. And these things are often determined by the user's machine and the code of secondary development. The speed of these open source crawlers is very good.

9) Obviously the code is written correctly, but the data cannot be crawled. Is there a problem with the crawler? Can another crawler solve it?

If the code is written correctly and the data cannot be crawled, other crawlers will also be unable to crawl. In this case, either the website has blocked you, or the data you crawl is generated by javascript. If the data cannot be crawled, it cannot be solved by changing the crawler.

10) Which crawler can judge whether the website has been crawled, and which crawler can crawl according to the theme?

The crawler cannot judge whether the website has been crawled, and can only cover it as much as possible.

As for crawling according to the theme, the crawler knows what the theme is after crawling the content down. Therefore, it is usually the whole climb down, and then to filter the content. If the crawling is too broad, you can narrow the scope by restricting URL regularization.

11) Which crawler's design pattern and architecture are better?

Design patterns are bullshit. It is said that the software design pattern is good, after the software is developed, several design patterns are summarized. Design patterns are not instructive for software development. Using design patterns to design crawlers will only make the design of crawlers more bloated.

As for the architecture, open source crawlers are currently mainly about the design of detailed data structures, such as crawling thread pools and task queues, which everyone can control. The crawler's business is too simple to talk about with any framework.

So for JAVA open source crawler, I think, just find one that works well. If the business is complex, which crawler to use, it will only be able to meet the needs through complex secondary development.

The third category: non-JAVA stand-alone crawler

Among the crawlers written in non-JAVA languages, there are many excellent crawlers. It is extracted as a category here, not for the discussion of the quality of the crawler itself, but for the impact of crawlers such as larbin and scrapy on the development cost.

Let's talk about python crawler first, python can use 30 lines of code to complete the task of JAVA 50 lines of code. Python is really fast to write code, but in the stage of debugging the code, the debugging of python code often consumes far more time than the time saved in the coding stage. Using python development, to ensure the correctness and stability of the program, it is necessary to write more test modules. Of course, if the crawling scale is not large and the crawling business is not complicated, it is quite good to use scrapy, which can easily complete the crawling task.

For C++ crawlers, the learning cost will be relatively large. And you can't just calculate the learning cost of one person. If the software needs to be developed or handed over by a team, that is the learning cost of many people. Software debugging is not so easy.

There are also some ruby and php crawlers, and there is not much evaluation here. There are indeed some very small data collection tasks that are convenient to use with ruby or php. However, to choose open source crawlers in these languages, on the one hand, you need to investigate the relevant ecosystem, and on the other hand, these open source crawlers may have some bugs that you can't find (few people use them, and there is also less information)

End.

This article is reproduced from http://chuansong.me/n/1899650
For relevant copyrights, please contact the original author