Recommend ten excellent Java open source crawlers

 

1: JAVA crawler WebCollector  Star: 1345

Download address: http://www.17ky.net/soft/9278.html

Introduction to crawler: WebCollector is a JAVA crawler framework (kernel) that requires no configuration and is convenient for secondary development. It provides a simplified API, and only a small amount of code can implement a powerful crawler. WebCollector-Hadoop is the Hadoop version of WebCollector that supports distributed crawling. Crawler kernel: WebCollector to...

2: YayCrawler, an open source general crawler framework  Star: 91

Download address: http://www.17ky.net/soft/578.html

YayCrawler is a distributed general-purpose crawler framework developed based on WebMagic, and the development language is Java. We know that there are many crawler frameworks, some are simple, some are complex, some are lightweight, and some are heavy

3: Vertical crawler WebMagic  Star: 1213

Download address: http://www.17ky.net/soft/9284.html

webmagic is a crawler framework that requires no configuration and is convenient for secondary development. It provides a simple and flexible API, and a crawler can be implemented with only a small amount of code. The following is a piece of code for crawling oschina blog: Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http...

4: Yahoo's open source Nutch crawler plugin Anthelion  Star: 2888

Download address: http://www.17ky.net/soft/189.html

Anthelion is a Nutch plugin focused on crawling semantic data. Note: This project includes the full Nutch 1.6 release, this plugin is placed at /src/plugin/parse-anth Anthelion uses an online learning approach to predict rich data web pages based on page context, getting feedback from metadata extracted from previously viewed pages. There are three main extensions: AnthelionScoringFilter WdcParser TripleExtractor Example: ...

5: Java open source web crawler project Nutch 

Download address: http://www.17ky.net/soft/302.html

Nutch is an open source Java implementation of a search engine. It provides all the tools we need to run our own search engine. Includes full-text search and web crawler. The founder of Nutch is Doug Cutting, who is also the founder of the Lucene, Hadoop and Avro open source projects. Nutch was born in August 2002. It is an open source search engine project implemented in Java under Apache. Since Nutch 1.2 version, Nutch has evolved from a search engine...

6: Java web spider/web crawler Spiderman  Star: 1801

Download address: http://www.17ky.net/soft/9279.html

Spiderman - Another Java web spider/crawler Spiderman is a web spider based on a microkernel + plug-in architecture. Its goal is to crawl and parse complex target web information into the business data it needs through a simple method. . Latest tip: Welcome to experience the latest version of Spiderman2, http://git.oschina.net/…

7: Lightweight Java web crawler GECCO  Star: 658

Download address: http://www.17ky.net/soft/465.html

What is Gecco? Gecco is a lightweight and easy-to-use web crawler developed in java. Gecco integrates excellent frameworks such as jsoup, httpclient, fastjson, spring, htmlunit, and redission, so that you can quickly write a crawler just by configuring some jquery-style selectors. The Gecco framework has excellent extensibility. The framework is designed based on the open-closed principle, which is closed for modification and open for extension. At the same time, Gecco is based on very open...

8: Open source crawler framework WebPasser  Star: 15

Download address: http://www.17ky.net/soft/34660.html

WebPasser is a configurable open-source crawler framework that provides a crawler console management interface. It can parse various web page content through configuration, and extract the required data without writing a single java code. 1. Contains a powerful page parsing engine, provides processing chains such as jsoup, xpath, regular expressions, etc., and can extract the required specified content through simple configuration. 2. Provide a crawler control management interface, which can monitor the grabbing status in real time...

9: An agile and powerful Java crawler framework SeimiCrawler  Star: 635

Download address: http://www.17ky.net/soft/351.html

SeimiCrawler is an agile, independently deployed, and distributed Java crawler framework, hoping to minimize the threshold for newbies to develop a crawler system with high availability and good performance, and to improve the development efficiency of the crawler system.

10: Crawler system NEOCrawler  Star: 258

Download address: http://www.17ky.net/soft/34612.html

NEOCrawler (Chinese name: Niu Ka) is a crawler system implemented by nodejs, redis, and phantomjs. The code is completely open source, suitable for data collection in vertical fields and secondary development of crawler. [Main features] Implemented with nodejs, javascipt is simple, efficient, and easy to learn, saving a lot of time for the development of crawlers and the secondary development of crawler users; nodejs makes...

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326396240&siteId=291194637