solr Chinese word segmentation

solr is a function made by foreigners to provide keyword search, so it does not support Chinese, so the Chinese made an IKAnalyzer by themselves, but Ik word segmentation may not necessarily be able to divide the words they want, that is to say words The word groups in the library are not necessarily their own words, but they have left a backdoor and can manage the word library by themselves. The operation steps are as follows:

1. Download the Ik package IK Analyzer2012FF, it can be seen that this package has not released a new version for a long time

2. Add IK word segmentation in the schema.xml file. The origin of schema.xml, as mentioned before, is the incarnation of the managed-schema file, so the two files should be kept all the time.

<!-- 我添加的IK分词 -->
    <fieldType name="text_ik" class="solr.TextField">   
        <analyzer type="index" isMaxWordLength="false" class="org.wltea.analyzer.lucene.IKAnalyzer"/>   
        <analyzer type="query" isMaxWordLength="true" class="org.wltea.analyzer.lucene.IKAnalyzer"/>   
    </fieldType>

3. The previous construction environment will not be repeated here. In solr's tomcat, webapp/solr/WEB-INF/classes/ put IK's xml into it----"IKAnalyzer.cfg.xml The configuration inside is as follows

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">  
<properties>  
    <comment>IK Analyzer extension configuration </comment>
    <!--Users can configure their own extended dictionary here-->
    <entry key="ext_dict">ext.dic;</entry>
    
    <!--Users can configure their own extended stop words here Dictionary -->
    <entry key="ext_stopwords">stopword.dic;</entry>
</properties>

Then create ext.dic under the same level directory of IKAnalyzer.cfg.xml, that is, classes, and you can add your own thesaurus in ext.dic.

In the past, the word "Evergrande" was separated out, and there was only one "Hengda". This time I added "Evergrande" to ext.dic and then used the field word segmentation configured with the text_ik type in the solr management interface. You can see It's separated


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325855023&siteId=291194637