http://www.xxx.com/ skipped. Content of size 67099 was truncated to 59363

如果提示http://www.xxx.com/ skipped. Content of size 67099 was truncated to 59363
在nutch-site.xml中添加:
<property> 
  <name>parser.skip.truncated</name> 
  <value>false</value> 
</property>

这是因为网站的页面内容采用truncate的方式分段返回,而nutch的默认设置是不处理这种方式的,需要打开之。

猜你喜欢

转载自qq346359669.iteye.com/blog/2173487