Reptile frame one hundred twenty-three

0.5 Overview

Heritrix, Nutch, Scrapy crawler frame three focus on different aspects, advantages and disadvantages.

1.Heritrix

Heritrix

Heritrix is ​​a specially developed for archiving web crawlers to web pages on the Internet. It is entirely written in Java and open source. Its main user interface can be accessed and controlled through its crawler behavior through a web traffic is, in addition, it also has a command-line tool to call for users to choose.

Heritrix joint standardized by the Internet Archive and Library of the Nordic countries to prepare in early 2003. The first release in January 2004, and continue to be the Internet Archive and other interested third parties to improve significantly. And now it has become a mature open source crawler, and widely used.

Official website: https://sourceforge.net/projects/archive-crawler/

Reference: https://www.ibm.com/developerworks/cn/opensource/os-cn-heritrix/

2.Nutch

Nutch

Nutch is an open source web crawler project, more specific is a web crawler, can be directly used to crawl Web content.

Nutch is now divided into two versions, 1.x and 2.x. The latest version 1.x is 1.7,2.x latest version is 2.2.1. The main difference is that two different versions of the underlying storage.

1.x version is based on the Hadoop architecture, the underlying storage using HDFS, while 2.x through the use of Apache Gora, make Nutch can access HBase, Accumulo, Cassandra, MySQL, DataFileAvroStore, AvroStore and other NoSQL.

Official website: http://nutch.apache.org/

3.Scrapy

Scrapy

Python is Scrapy developed a quick, high-level screen scraping and web crawling framework for crawling web sites and extract structured data from the page. Scrapy wide range of uses, can be used for data mining, monitoring and automated testing. GitHub project page: https://github.com/scrapy/scrapy Scrapy Twisted asynchronous network uses the library to handle network traffic.

Official website: http://www.scrapy.org/

Original: Big Box  reptile framework of one hundred twenty-three


Guess you like

Origin www.cnblogs.com/wangziqiang123/p/11618272.html