Comparison of several open source crawler frameworks



The first category: distributed crawler The
crawler uses distributed, mainly to solve two problems:

2) network speed

1) Nutch is a crawler designed for search engines, most users need a crawler for accurate data (fine extraction) reptile. Two-thirds of the processes Nutch runs are designed for search engines. It doesn't make much sense for extraction. In other words, using Nutch for data extraction will waste a lot of time on unnecessary calculations. And if you try to make Nutch suitable for the extraction business by secondary development, you will basically destroy the framework of Nutch, change Nutch beyond recognition, and have the ability to modify Nutch, it is really better to write a new one yourself. Distributed crawler framework.

3) Although Nutch has a set of plug-in mechanism, it is promoted as a highlight. You can see some open source Nutch plug-ins that provide precise extraction functions. But anyone who has developed Nutch plugins knows how crappy Nutch's plugin system is. Using the reflection mechanism to load and call plug-ins makes it extremely difficult to write and debug programs, let alone develop a complex extraction system on it. And Nutch does not provide a corresponding plugin mount point for fine extraction. Nutch's plugin has only five or six mount points, and these five or six mount points are all for search engine services, and do not provide mount points for fine extraction. Most of Nutch's refined extraction plugins are mounted on the "page parsing" (parser) mount point. This mount point is actually used to parse links (provide URLs for subsequent crawling) and provide some search engines. Easy-to-extract web page information (meta information, text text of web pages).

5) Many people say that Nutch2 has gora, which can persist data to avro files, hbase, mysql, etc. Many people actually misunderstand. The persistent data mentioned here refers to storing URL information (data required for URL management) in avro, hbase, and mysql. Not the structured data you want to extract. In fact, for most people, it doesn't matter where the URL information exists.

So, if you are not a search engine, try not to choose Nutch as a crawler. Some teams like to follow the trend. They insist on choosing Nutch to develop refined crawler. In fact, it is aimed at Nutch's reputation (Nutch's author is Doug Cutting). Of course, the final result is often that the project is delayed.

The second category: JAVA stand-alone crawlers
Here , JAVA crawlers are divided into a separate category, because JAVA is very complete in the ecosystem of web crawlers. The relevant information is also the most complete. There may be controversy here, I'm just casually speaking bullshit.

For the function of crawler. The problems that users are more concerned about are often: those that

do not support multi-threading, do not support proxies, and cannot filter duplicate URLs are not called open source crawlers, but are called circular execution of http requests.

2) Can crawlers crawl ajax information?

If I can already generate the ajax requests (list) I need, how can I use these crawlers to crawl these requests?

3) How does the crawler crawl the website to be logged in?

4) How do crawlers extract information from web pages?

5) How does the crawler save the information of the web page?

6) What should I do if the crawler is blocked by the website?

7) Can a web page call a crawler?

8) What is the speed of the crawler?

9) Obviously the code is written correctly, but the data cannot be crawled. Is there a problem with the crawler? Can another crawler solve it?

10) Which crawler can judge whether the website has finished crawling, and which crawler can crawl according to the theme?

As for crawling according to the theme, the crawler knows what the theme is after crawling the content down. Therefore, it is usually the whole climb down, and then to filter the content. If the crawling is too broad, you can narrow the scope by restricting URL regularization.

Design patterns are bullshit. It is said that the software design pattern is good, after the software is developed, several design patterns are summarized. Design patterns are not instructive for software development. Using design patterns to design crawlers will only make the design of crawlers more bloated.

So for JAVA open source crawler, I think, just find one that works well. If the business is complex, which crawler to use, it will only be able to meet the needs through complex secondary development.
The third category: non-JAVA stand-alone crawler Let’s talk about python crawler

first , python can use 30 lines of code to complete the task of JAVA 50 lines of code. Python is really fast to write code, but in the stage of debugging the code, the debugging of python code often consumes far more time than the time saved in the coding stage. Using python development, to ensure the correctness and stability of the program, it is necessary to write more test modules. Of course, if the crawling scale is not large and the crawling business is not complicated, it is quite good to use scrapy, which can easily complete the crawling task.

There are also some ruby ​​and php crawlers, and there is not much evaluation here. There are indeed some very small data collection tasks that are convenient to use with ruby ​​or php. However, when choosing open source crawlers in these languages, on the one hand, you need to investigate the relevant ecosystem, and on the other hand, these open source crawlers may have some bugs that you can't find (few people use them, and less information).
[/align]

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326749121&siteId=291194637