Construction of nutch+solr stand-alone full-text retrieval service

origin

When studying digital currency recently, a large amount of online public opinion data is needed. After analyzing the search engines such as Google, Bing, Yahoo, Baidu, SOGOU, 360, etc., for various reasons (will be described in another article, you can post if necessary), it is more appropriate to build one yourself. I have used Nutch+Solr many years ago, and now I use the latest recommended combination version and build it again. The process is as follows:

environment:

Centos 7 (4 cores, 8G, 300G), Nutch 1.14 + Solr 6.6.0 (Nutch recommended combination)

step

1. First find the official documentation

The most detailed and authoritative. https://wiki.apache.org/nutch/NutchTutorial

2. Download Nutch, Solr, unzip it

3. Modify the crawler name

Walking the network, you need to leave a reputation. In the apache-nutch-1.14/conf/nutch-site.xml file, add:

<property>
 <name>http.agent.name</name>
 <value>eCoin Research Spider</value>
</property>

4. Create URL seed list

In the apache-nutch-1.14/urls/seed.txt file (may need to create a directory), add frequently accessed addresses with substantial content, for example:

http://www.8btc.com/
http://www.bishijie.com/
http://www.jinse.com/
https://support.okex.com/hc

5. Restrict access to URLs

In fact, you only need to crawl those websites that you think are not bad. There are not many good websites, and there is not as much useful information as you imagine.

In the apache-nutch-1.14/conf/regex-urlfilter.txt file, add:

+^https?://([a-z0-9-]+\.)*8btc\.com/
+^https?://([a-z0-9-]+\.)*bishijie\.com/
+^https?://([a-z0-9-]+\.)*jinse\.com/
+^https?://([a-z0-9-]+\.)*okex\.com/

6. Add the time index field

This is an important point. Time information is very important. Here, the indexing time is used as an attribute of the url. Parsing the time from html is not reliable. In contrast, this time is slightly more accurate.

Modify the file apache-nutch-1.14/conf/schema.xml as follows

将    <field name="tstamp" type="date" stored="true" indexed="false"/>
改为 <field name="tstamp" type="date" stored="true" indexed="true"/>

7. Configure and integrate Solr

In full accordance with the official documents above, the excerpts are as follows:

  • create resources for a new nutch solr core cp -r ${APACHE_SOLR_HOME}/server/solr/configsets/basic_configs ${APACHE_SOLR_HOME}/server/solr/configsets/nutch
  • copy the nutch schema.xml into the conf directory cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf
  • make sure that there is no managed-schema "in the way": rm ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/managed-schema
  • start the solr server ${APACHE_SOLR_HOME}/bin/solr start
  • create the nutch core ${APACHE_SOLR_HOME}/bin/solr create -c ecoins -d server/solr/configsets/nutch/conf/
  • add the core name to the Solr server URL: -Dsolr.server.url=http://localhost:8983/solr/nutch

8, you can try

Under the directory apache-nutch-1.14, run:

./bin/crawl -i -D "solr.server.url=http://localhost:8983/solr/ecoins" -s urls/ ecoinscrawl/  2

Visit http://localhost:8983/solr in your browser to see the Solr page

Question 1 Extract the body using Boilerpipe

By default, all HTML is extracted, and menus, advertisements, etc. need to be removed, so the long-awaited Boilerpipe is used, and Nutch has been integrated. The following configuration is required:

In the apache-nutch-1.14/conf/nutch-sites.xml file, add

<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  </description>
</property>

<property>
  <name>tika.extractor.boilerpipe.algorithm</name>
  <value>ArticleExtractor</value>
  <description>
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor
  or CanolaExtractor.
  </description>
</property>

Question 2, word segmentation

This question has always been puzzled. I personally think that word segmentation is the best. Now the position of the word can be recorded, which is very convenient for querying the whole sentence, which is very important for the query leader's speech. The default configuration is fine, and space is not an issue. So there is no additional configuration for word segmentation.

Welcome everyone to advise and discuss together.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324429842&siteId=291194637