Nutch2.3.1+MongoDB+ElasticSearch1.4.4 环境配置

前言：本博客是nutch本地运行的一篇配置实践笔记，不包含分布式运行配置

1.环境准备

Ubuntu 16.04

jdk 1.8

Ant 1.9.13

2.Mongodb安装

1）mongodb数据库安装及基本概念学习

参考：http://www.runoob.com/mongodb/mongodb-linux-install.html

2）mongodb可视化工具：robomongo

1、下载RoboMongo

RoboMongo官网下载链接：https://robomongo.org/

2、解压文件

tar -xzf robo3t-1.1.1-linux-x86_64-c93c6b0.tar.gz
cd robo3t-1.1.1-linux-x86_64-c93c6b0 (如果移动到其他目录，请加上相应的目录。)

解压后，把robomongo文件夹保存到一个常用的软件文件夹内，因为robomongo会直接从这个文件夹启动。

3、启动robo3t

3.ElasticSearch安装

1）下载安装

gannyee@ubuntu:~/download$wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.4.tar.gz
gannyee@ubuntu:~/download$tar -zxvf elasticsearch-1.4.4.tar.gz
gannyee@ubuntu:~/download$ mv elasticsearch-1.4.4 ../elasticsearch 
gannyee@ubuntu:~$cd /elasticsearch

2）ES启动、关闭

后台启动ElasticSearch

gannyee@ubuntu:~/elasticsearch$ ./bin/elasticsearch -d

终止ElasticSearch进程

关闭单一节点
gannyee@ubuntu:~/elasticsearch$curl -XPOST http://localhost:9200/_cluster/nodes/_shutdown

关闭节点BlrmMvBdSKiCeYGsiHijdg
gannyee@ubuntu:~/elasticsearch$curl –XPOST http://localhost:9200/_cluster/nodes/BlrmMvBdSKiCeYGsiHijdg/_shutdown

检测是否成功运行ElasticSearch

在浏览器地址栏输入：http://localhost:9200，显示以下信息则表示ES启动成功

{
  "status" : 200,
  "name" : "gannyee",
  "cluster_name" : "gannyee",
  "version" : {
    "number" : "1.4.4",
    "build_hash" : "c88f77ffc81301dfa9dfd81ca2232f09588bd512",
    "build_timestamp" : "2015-02-19T13:05:36Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.3"
  },
  "tagline" : "You Know, for Search"
}

3）由于elasticsearch1.4.4版本比较旧，一些插件已经不能正常安装使用了，对ES信息的查询建议使用curl命令

curl命令可在终端使用，也可在浏览器地址栏使用，命令参考https://blog.csdn.net/qq834024958/article/details/81902963

4.Nutch安装配置

在Lucene发展来的开源网络爬虫，本次配置只能使用nutch2.x系列，1.x系列不支持MongoDB等其他如Mysql,Habase数据库。
版本：apache-nutch-2.3.1

Nutch2.3下载、编译、配置

下载源码

gannyee@ubuntu:~/download$  wget
http://www.apache.org/dyn/closer.lua/nutch/2.3.1/apache-nutch-2.3.1-src.tar.gz
gannyee@ubuntu:~/download$ tar -zxvf apache-nutch-2.3.1-src.tar.gz
gannyee@ubuntu:~/download$  mv apache-nutch-2.3.1 ../nutch
gannyee@ubuntu:~/download$ cd ../nutch
gannyee@ubuntu:~/nutch$ export NUTCH_HOME=$(pwd)

修改/conf/nutch-site.xml使Mongodb作为GORA的存储单元

gannyee@ubuntu:~/nutch/conf$ vim nutch-site.conf
<configuration>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.mongodb.store.MongoStore</value>
    <description>Default class for storing data</description>
  </property>
</configuration>

从/ivy/ivy.xml文件中取消下面部分的注释

gannyee@ubuntu:~/nutch/conf$  vim $NUTCH_HOME/ivy/ivy.xml
 <!-- Uncomment this to use MongoDB as Gora backend. -->
    
    <dependency org="org.apache.gora" name="gora-mongodb" rev="0.6.1" conf="*->default" />
        
    <!-- Uncomment this to use Solr as Gora backend. -->

修改ivy源

由于默认的ivy源速度比较慢，所以在这里换成国内的源
在ivy/ivysetting.xml文件中找到下面这段配置

<property name="repo.maven.org"  
      value="http://repo1.maven.org/maven2/"  
      override="false"/>

把value替换成阿里云的地址：

http://maven.aliyun.com/nexus/content/groups/public/

确保MongoStore设置为默认数据存储

gannyee@ubuntu:~/nutch$ vim conf/gora.properties
/#######################
/# MongoDBStore properties #
/#######################
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=nutch

开始编译nutch

gannyee@ubuntu:~/nutch$ant runtime

如果编译过程中有如下错误

Trying to override old definition of task javac
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download:
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

Trying to override old definition of task javac
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download:
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

是因为缺少lib包，解决办法如下（其实可以无视）：
下载 sonar-ant-task-2.1.jar，拷贝到 $NUTCH_HOME/lib 目录下面

修改 $NUTCH_HOME/build.xml，引入上面添加

<!-- Define the Sonar task if this hasn't been done in a common script -->
 <taskdef uri="antlib:org.sonar.ant" resource="org/sonar/ant/antlib.xml">
  <classpath path="${ant.library.dir}" />
  <classpath path="${mysql.library.dir}" />
  <classpath><fileset dir="lib/" includes="sonar*.jar" /></classpath>
 </taskdef>

编译后的文件将被放在新生成的文件夹/nutch/runtime中

Nutch编译成功之后，会在主目录下生成一个runtime文件夹。其中包含deploy和local两个子文件夹。deploy用于分布式抓取，而local用于本地单机抓取。进入local文件夹，再进入bin文件夹。这里包含两个脚本文件，一个是nutch，另一个是crawl。其中，nutch包含了所需的全部命令，而crawl主要用于一站式抓取。

最后确认nutch已经正确地编译和运行,输出如下：

gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch
 Usage: nutch COMMAND
where COMMAND is one of:
 inject         inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate       generate new batches to fetch from crawl db
 fetch          fetch URLs marked during generate
 parse          parse URLs marked during fetch
 updatedb       update web table after parsing
 updatehostdb   update host table after parsing
 readdb         read/dump records from page database
 readhostdb     display entries from the hostDB
 index          run the plugin-based indexer on parsed batches
 elasticindex   run the elasticsearch indexer - DEPRECATED use the index command instead
 solrindex      run the solr indexer on parsed batches - DEPRECATED use the index command instead
 solrdedup      remove duplicates from solr
 solrclean      remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
 clean          remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin         load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 webapp         run a local Nutch web application
 junit          runs the given JUnit test
 or
 CLASSNAME      run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

定制你的爬取特性,修改/nutch/runtime/local/conf/nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.mongodb.store.MongoStore</value>
    <description>Default class for storing data</description>
  </property>

  <property>
    <name>http.agent.name</name>
    <value>Hist Crawler</value>
  </property>
  
  <!--<property>
   <name>plugin.folders</name>
   <value>plugins</value>
 </property>

  <property>
    <name>plugin.includes</name>
    <value>protocol-(httphttpclient)urlfilter-regexindex-(basicmore)query-(basicsiteurllang)indexer-elasticnutch-extensionpointsparse-(texthtmlmsexcelmswordmspowerpointpdf)summary-basicscoring-opicurlnormalizer-(passregexbasic)parse-(htmltikametatags)index-(basicanchormoremetadata)</value>
  </property>-->

  <property>
    <name>plugin.includes</name>
    <!--<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>-->
    <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with the 
  underlying commons-httpclient library.
    </description>
  </property>

  <property>
    <name>elastic.host</name>
    <value>localhost</value>
  </property>

  <property>
    <name>elastic.cluster</name>
    <value>elasticsearch</value>
  </property>

  <property>
    <name>elastic.index</name>
    <value>nutch</value>
  </property>

  <property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
  </property>

  <property>
    <name>http.content.limit</name>
    <value>6553600</value>
  </property>
  
</configuration>

此示例包含了数据存储在mongodb和索引数据到ES的配置。

创建种子文件

创建文件夹/nutch/runtime/local/urls，在此文件夹下建立seed.txt文件，写入待爬取的种子链接，示例：

http://www.runoob.com/redis/redis-tutorial.html
http://www.lianzais.com/4_4003/
http://www.lianzais.com/4_4003/1679835.html
http://www.lianzais.com/5_5279/2378223.html
http://www.lianzais.com/5_5279/
https://www.xbiquge6.com/0_761/
https://www.xbiquge6.com/0_761/1272131.html

5.执行分步爬取流程

初始化crawldb

gannyee@ubuntu:~/nutch/runtime/local$  ./bin/nutch inject urls/

从 crawldb生成urls

gannyee@ubuntu:~/nutch/runtime/local$  ./bin/nutch generate -topN 80

获取生成的所有urls

gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch fetch -all

解析获取的urls

gannyee@ubuntu:~/nutch/runtime/local$./ bin/nutch parse -all

更新database数据库

gannyee@ubuntu:~/nutch/runtime/local$  ./bin/nutch updatedb -all

索引解析的urls

gannyee@ubuntu:~/nutch/runtime/local$ bin/nutch index -all

爬取完给定网页，mongoDB会生成一个新的数据库：nutch，elasticsearch中会生成新的索引库nutch.

可查看某一条记录的详细信息

右上角可选择不同的数据展示方式

查看elasticsearch中的索引数据

http://localhost:9200/nutch/_search?pretty=ture

使用size指定返回的记录条数

http://localhost:9200/nutch/_search?size=100&pretty=ture

6. 异常处理

查看runtime/local/logs 目录下的 hadoop.log 日志文件，查看详细异常信息，上网查找解决办法

参考：

https://blog.csdn.net/github_27609763/article/details/50597427

https://blog.csdn.net/bluestarjava/article/details/53843857

https://blog.csdn.net/qq834024958/article/details/81902963

http://www.it610.com/article/2180055.htm