Simple arrangement of crawler tools

1、crawlzilla

Crawlzilla is a free software that helps you build a search engine easily. With it, you don't have to rely on the search engine of a commercial company, and you don't have to worry about indexing the company's internal website data.

With the nutch project as the core, it integrates more related packages, and designs, installs and manages UI, making it easier for users to get started.

In addition to crawling basic html, crawlzilla can also analyze files on web pages, such as (doc, pdf, ppt, ooo, rss) and other file formats, so that your search engine is not just a web search engine, but a website's Complete data index library.

With Chinese word segmentation ability, make your search more accurate.

The characteristics and goals of crawlzilla are mainly to provide users with a search platform that is easy to use and easy to install.

  • License Agreement: Apache License 2
  • Development language: Java JavaScript SHELL
  • Operating System: Linux
  • Project homepage: https://github.com/shunfa/crawlzilla
  • Download address: http://sourceforge.net/projects/crawlzilla/
  • Features: Easy to install, with Chinese word segmentation function

2 ^ Heritrix

Heritrix is ​​an open source web crawler developed by java, users can use it to crawl the desired resources from the Internet. The best thing about it is its good scalability, which is convenient for users to implement their own crawling logic.

Heritrix adopts a modular design, each module is coordinated by a controller class (CrawlController class), and the controller is the core of the whole.

  • Code hosting: https://github.com/internetarchive/heritrix3
  • License Agreement: Apache
  • Development language: Java
  • Operating System: Cross-Platform
  • Features: Strictly follow the exclusion instructions of the robots file and the META robots tag

3 、 webmagic

webmagic is a crawler framework that requires no configuration and is convenient for secondary development. It provides a simple and flexible API, and a crawler can be implemented with only a small amount of code.

web magic crawler

webmagic adopts a completely modular design, with functions covering the entire life cycle of the crawler (link extraction, page download, content extraction, persistence), supports multi-threaded crawling, distributed crawling, and supports automatic retry, custom UA/ functions such as cookies.

web crawler-magic

webmagic includes a powerful page extraction function, developers can easily use css selector, xpath and regular expressions to extract links and content, and support multiple selector chain calls.

Webmagic usage documentation: http://webmagic.io/docs/

View the source code: http://git.oschina.net/flashsword20/webmagic

  • License Agreement: Apache
  • Development language: Java
  • Operating System: Cross-Platform
  • Features: The function covers the entire crawler life cycle, using Xpath and regular expressions for link and content extraction.
  • Note: This is a domestic open source software, contributed by Huang Yihua

4、ThinkUp

ThinkUp is a social media perspective engine that can collect data from social networks such as twitter, facebook, etc. Interactive analysis tools that collect data from personal social network accounts, archive and process them, and graph the data for more intuitive viewing.

Web crawler-thinkup

Web crawler-thinkup-map

  • License Agreement: GPL
  • Development language: PHP
  • Operating System: Cross-Platform
  • github source code: https://github.com/ThinkUpLLC/ThinkUp

Locomotive : Full-featured, with a long history, but the configuration is too complicated, and there are many functions that are difficult to master, but it is indeed very comprehensive. It is a general-purpose collection software, and you can collect everything on a simple page.

Network Miner : It has not been launched for a long time, and it is not stable enough, but the functions of data collection and data processing provided by it are very good.

Youxun Software : In fact, it does not provide software, but provides collection services. You only need to tell them where to collect the data and the specific content to be collected, and they don't need to understand or do anything else, and they will provide you with the collected data, and it can also satisfy any data processing you need Require.

Network spirit : It is also a software with a long history, and the collection is also very powerful, and it does not have too deep influence on other aspects.

Madman, Threesome, I have never used it, but it is said that collecting forums and blogs is very powerful, and collecting other types of data and slightly more complicated data is not enough.

gooseeker : It seems to provide online collection, I haven't used it, their website can't understand. But it is said to be good.

Personally, I think: If you are collecting pure static pages, and the data structure is not very complicated, and you know some technology, then use the locomotive.

Original text: http://blog.sina.com.cn/s/blog_15b9403ba0102wosv.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324726516&siteId=291194637