Under the windows platform, using nutch requires the cygwin tool as its running environment to simulate linux

1. Under the windows platform, using nutch requires the cygwin tool as its simulating linux operating environment. The installation of cygwin here is not much to say, you can download it from http://www.cygwin.com/ and click directly setup.exe can be downloaded, and then click Next to install it.

2. After installing cygwin, make sure to configure the environment variables, you can enter cygcheck -c cygwin to check the version
[img][/img], and then
you can proceed to the next step.

3. Go to the nutch official website to download the version of nutch1.6 (the latest version is 2.1), and the compiled bin of 1.6 comes with it, so it can be omitted when configuring like nutch2.1, it needs to be compiled with Ant.

4. Create a urls folder in the root directory of cygwin, and write the url address you want to crawl in it. It can be a txt suffix name or no suffix, and then create a url to generate data after crawling the web page. folder xxx.

5. Then find the root directory of nutch1.6 in cygwin, and execute bin/nutch
if it is correct, a series of nutch commands will be printed in cygwin.

6. The next step is to crawl the web page and execute bin/nutch crawl urls -dir xxx -depth 2 -threads 2 -topN 2 will generate 3 folders crawldb, linkdb, segments in the xxx folder, which contains data information.
Also note that there is no index and indexing after version 1.2. Folders, as well as the packaged war package. Regarding this, the author believes that it may be to let nutch and solr focus more on their own business. nutch mainly crawls data, and solr is mainly used to search data.

7. After the crawling is successful, the 3 folders in the above step will be generated, and then they can be mapped into solr indexes. The author uses solr3.6, and the author of 4.x solr has not tested it. In cygwin Execute bin/nutch solrindex http://localhost:8080/solr/ myfile/crawldb -linkdb myfile/linkdb myfile/segments/*, before that, make sure your solr service is started and can be accessed normally , if there are some failures, the biggest reason may be the inconsistency of the mapped fields. Check the solrindex-mapping.xml file under nutch and configure the corresponding fields in solr's schemaml.xml.

8, the mapping is successful After that, you can visit the solr homepage, click query, and you will see the results you just grabbed!

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326618159&siteId=291194637