[Java Framework] Recommended common crawler frameworks for Java

Selenium

The number of stars on GitHub as of September 2023 is 27.7K.
Selenium is a browser-based automation tool that can simulate the user's operating behavior on the browser and obtain the content on the web page. Selenium supports multiple browsers and handles JavaScript generated content well. But Selenium runs slower compared to other frameworks.

WebMagic

The number of stars on GitHub as of September 2023 is 10.9K.
WebMagic is a distributed crawler framework based on Java. It uses technologies such as multi-threading and asynchronous IO to efficiently crawl website data. WebMagic provides a rich plug-in mechanism and supports functions such as custom parsers and processors. However, it should be noted that WebMagic does not support JavaScript rendering pages.

Are p

The number of stars on GitHub as of September 2023 is 10.3K.
Jsoup is a Java HTML parser that provides an easy-to-use API that allows us to extract and process data from a URL, file, or string. Compared with other frameworks, Jsoup is more convenient, simple, and has good readability. But if you need to deal with JavaScript generated content, you need to consider it separately.

Crawler4j

The number of stars on GitHub as of September 2023 is 4.4K.
Crawler4j is an open source Java crawler framework. It uses multi-threading and memory caching technology, and can customize functions such as URL filters and parsers. Crawler4j supports functions such as limiting crawler depth and setting crawl delays, and can be used in conjunction with search engines such as Lucene. However, it should be noted that Crawler4j does not support JavaScript rendering pages.

Apache Nutch

The number of stars on GitHub as of September 2023 is 2.7K.
Apache Nutch is an open source web crawler framework based on Java. It uses multi-threading and distributed technology, and supports custom URL filters, parsers and other functions. Apache Nutch handles JavaScript generated content well and supports use with search engines such as Solr. However, it should be noted that Apache Nutch has a steep learning curve.

HtmlUnit

GitHub Star count as of September 2023: 731
HtmlUnit is a Java-based GUI-less browser that can simulate browser behavior and obtain content on web pages. HtmlUnit supports JavaScript rendering pages, and can customize request headers, cookies and other information. However, it should be noted that HtmlUnit runs slower than other frameworks.

References & Acknowledgments

[1] Java crawler framework selection guide, easily find the framework that suits you best

Guess you like

Origin blog.csdn.net/YangCheney/article/details/133444626