1: JAVA crawler WebCollector Star: 1345
Download address: http://www.17ky.net/soft/9278.html
Introduction to crawler: WebCollector is a JAVA crawler framework (kernel) that requires no configuration and is convenient for secondary development. It provides a simplified API, and only a small amount of code can implement a powerful crawler. WebCollector-Hadoop is the Hadoop version of WebCollector that supports distributed crawling. Crawler kernel: WebCollector to...
2: YayCrawler, an open source general crawler framework Star: 91
Download address: http://www.17ky.net/soft/578.html
YayCrawler is a distributed general-purpose crawler framework developed based on WebMagic, and the development language is Java. We know that there are many crawler frameworks, some are simple, some are complex, some are lightweight, and some are heavy
3: Vertical crawler WebMagic Star: 1213
Download address: http://www.17ky.net/soft/9284.html
webmagic is a crawler framework that requires no configuration and is convenient for secondary development. It provides a simple and flexible API, and a crawler can be implemented with only a small amount of code. The following is a piece of code for crawling oschina blog: Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http...
4: Yahoo's open source Nutch crawler plugin Anthelion Star: 2888
Download address: http://www.17ky.net/soft/189.html
Anthelion is a Nutch plugin focused on crawling semantic data. Note: This project includes the full Nutch 1.6 release, this plugin is placed at /src/plugin/parse-anth Anthelion uses an online learning approach to predict rich data web pages based on page context, getting feedback from metadata extracted from previously viewed pages. There are three main extensions: AnthelionScoringFilter WdcParser TripleExtractor Example: ...
5: Java open source web crawler project Nutch
Download address: http://www.17ky.net/soft/302.html
Nutch is an open source Java implementation of a search engine. It provides all the tools we need to run our own search engine. Includes full-text search and web crawler. The founder of Nutch is Doug Cutting, who is also the founder of the Lucene, Hadoop and Avro open source projects. Nutch was born in August 2002. It is an open source search engine project implemented in Java under Apache. Since Nutch 1.2 version, Nutch has evolved from a search engine...
6: Java web spider/web crawler Spiderman Star: 1801
Download address: http://www.17ky.net/soft/9279.html
Spiderman - Another Java web spider/crawler Spiderman is a web spider based on a microkernel + plug-in architecture. Its goal is to crawl and parse complex target web information into the business data it needs through a simple method. . Latest tip: Welcome to experience the latest version of Spiderman2, http://git.oschina.net/…
7: Lightweight Java web crawler GECCO Star: 658
Download address: http://www.17ky.net/soft/465.html
What is Gecco? Gecco is a lightweight and easy-to-use web crawler developed in java. Gecco integrates excellent frameworks such as jsoup, httpclient, fastjson, spring, htmlunit, and redission, so that you can quickly write a crawler just by configuring some jquery-style selectors. The Gecco framework has excellent extensibility. The framework is designed based on the open-closed principle, which is closed for modification and open for extension. At the same time, Gecco is based on very open...
8: Open source crawler framework WebPasser Star: 15
Download address: http://www.17ky.net/soft/34660.html
WebPasser is a configurable open-source crawler framework that provides a crawler console management interface. It can parse various web page content through configuration, and extract the required data without writing a single java code. 1. Contains a powerful page parsing engine, provides processing chains such as jsoup, xpath, regular expressions, etc., and can extract the required specified content through simple configuration. 2. Provide a crawler control management interface, which can monitor the grabbing status in real time...
9: An agile and powerful Java crawler framework SeimiCrawler Star: 635
Download address: http://www.17ky.net/soft/351.html
SeimiCrawler is an agile, independently deployed, and distributed Java crawler framework, hoping to minimize the threshold for newbies to develop a crawler system with high availability and good performance, and to improve the development efficiency of the crawler system.
10: Crawler system NEOCrawler Star: 258
Download address: http://www.17ky.net/soft/34612.html
NEOCrawler (Chinese name: Niu Ka) is a crawler system implemented by nodejs, redis, and phantomjs. The code is completely open source, suitable for data collection in vertical fields and secondary development of crawler. [Main features] Implemented with nodejs, javascipt is simple, efficient, and easy to learn, saving a lot of time for the development of crawlers and the secondary development of crawler users; nodejs makes...