Web crawlers Apache Nutch 1.16 released

Apache Nutch 1.16 was released. Nutch is a mature, can be used for the production of Web crawlers. Nutch 1.x can rely on Apache Hadoop ™ data structure for fine-grained configuration, which is useful for batch processing.

This version contains more than 100 bug fixes and improvements, major updates are as follows:

New features

  • [ Nutch-2676 ] - update to the latest selenium, and add the code to the remote Web driver by using chrome and firefox endless mode

Bug fixes

  • [ Nutch-1063 ] - OutlinkExtractor test will generate an exception, but does not fail
  • [ Nutch-1842 ] - crawl.gen.delay default value has an error in nutch-default.xml in, or are incorrectly resolved
  • [ Nutch-2279 ] - When using Hadoop MR output compression failure LinkRank
  • [ Nutch-2381 ] - In some cases, like TextProfileSignature for the same text "Profile" page provides a different signature
  • [ Nutch-2387 ] - Nutch should not use the "noindex" meta index file
  • [ Nutch-2457 ] - Tika may not correctly parse embedded document
  • [ Nutch-2475 ] - If the same conditions and else-if branch
  • [ Nutch-2482 ] - index-GeoIP do not add value to the document field empty
  • [ Nutch-2585 ] - in the NPE TrieStringMatcher
  • [ Nutch-2598 ] - failure URLNormalizerChecker invalid URL in the input
  • ……

For details, see update instructions .

Guess you like

Origin www.oschina.net/news/110655/nutch-1-16-released