nutch碰到failed with: Http code=403问题

      做毕业设计。打算做一个校园网的搜索引擎。

      下载了nutch1.2,然后做了一些配置试用了一下。

       第一步:在解压后的nutch1.2目录里面新建urls目录,然后在其目录下新建url.txt文件,然后在文件中写入我

要抓取网站的网址,http://www.ujs.edu.cn/

       第二步:在nutch1.2目录下新建logs目录,拿来存放日志文件。然后在下面新建test.log空白文件。

       第三步:进入conf目录,编辑nutch-site.xml文件,这个文件主要配置你的spider的一些信息。

     我的nutch-site.xml内容如下

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>mynutch</value>
<description>test
</description>
</property>
<property>
<name>http.agent.description</name>
<value>spider</value>
<description> spider
</description>
</property>
<property>
<name>http.agent.url</name>
<value>http://www.xxx.com </value>
<description>http://www.xxx.com
</description>
</property>
<property>
<name>http.agent.email</name>
<value>MyEmail</value>
<description>[email protected]
</description>
</property>
</configuration>

   第四步:编辑conf下crawl-urlfilter.txt文件,找到“# accept hosts in MY.DOMAIN.NAME”

            这一行,然后把这一行下面紧接的一行改为"+http://www.ujs.edu.cn"

   第五步:我用的是ubuntu,所以进入shell,cd入nutch1.2目录,然后执行抓取命令:

          bin/nutch  crawl urls/url.txt  -dir crawled  >logs/test.log

          过了一分钟,就结束了抓取,但是却没有抓取到任何数据,日志如下:

    test.log

crawl started in: crawled
rootUrlDir = urls/url.txt
threads = 10
depth = 5
indexer=lucene
Injector: starting at 2011-04-18 20:19:19
Injector: crawlDb: crawled/crawldb
Injector: urlDir: urls/url.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-04-18 20:19:23, elapsed: 00:00:03
Generator: starting at 2011-04-18 20:19:23
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawled/segments/20110418201927
Generator: finished at 2011-04-18 20:19:28, elapsed: 00:00:05
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2011-04-18 20:19:28
Fetcher: segment: crawled/segments/20110418201927
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
fetching http://www.ujs.edu.cn/
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
fetch of http://www.ujs.edu.cn/ failed with: Http code=403, url=http://www.ujs.edu.cn/
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-04-18 20:19:33, elapsed: 00:00:04
CrawlDb update: starting at 2011-04-18 20:19:33
CrawlDb update: db: crawled/crawldb
CrawlDb update: segments: [crawled/segments/20110418201927]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-04-18 20:19:36, elapsed: 00:00:02
Generator: starting at 2011-04-18 20:19:36
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2011-04-18 20:19:37
LinkDb: linkdb: crawled/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/home/hello/nutch-1.2/crawled/segments/20110418201927
LinkDb: finished at 2011-04-18 20:19:39, elapsed: 00:00:01
Indexer: starting at 2011-04-18 20:19:39
Indexer: finished at 2011-04-18 20:19:43, elapsed: 00:00:03
Dedup: starting at 2011-04-18 20:19:43
Dedup: adding indexes in: crawled/indexes
Dedup: finished at 2011-04-18 20:19:48, elapsed: 00:00:05
IndexMerger: starting at 2011-04-18 20:19:48
IndexMerger: merging indexes to: crawled/index
Adding file:/home/hello/nutch-1.2/crawled/indexes/part-00000
IndexMerger: finished at 2011-04-18 20:19:48, elapsed: 00:00:00
crawl finished: crawled
 

  发现里面出现了fetch of http://www.ujs.edu.cn/ failed with: Http code=403, url=http://www.ujs.edu.cn/错误

  我尝试了好几次都是这样,但是在浏览器中,打开http://www.ujs.edu.cn是能正常打开的,403错误表示没有权限读取

内容,我不明白为什么会出现这样的原因。网上搜了一下,也没搜到什么。谁能告诉我,我哪里弄错了?

猜你喜欢

转载自qjwujian.iteye.com/blog/1007329