Download and use the IKAnalyzer

IKAnalyzer Download

Links: HTTPS: // pan.baidu.com/s/1bNqXh8B7suT1rAUm_ZZ7gw 
extraction code: c08j

Folder structure is as follows

 

 

 

 

When the default in Lucene StandardAnalyzer analyzer for analyzing the characters are split into a word, a word, a word count each word

        // for configuring tokenizer 
        IndexWriterConfig config = new new IndexWriterConfig ();

Used in the construction of the method is IndexWriterConfig StandardAnalyzer

    public IndexWriterConfig() {
        this(new StandardAnalyzer());
    }

We want to use the Chinese word breaker, then you should replace him,

First to test using the default segmentation effect StandardAnalyzer

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.wltea.analyzer.lucene.IKAnalyzer;

import java.io.IOException;

public class MyTokenStream {
    public static void main(String[] args) throws IOException {
        //创建一个标准分析器对象
        Analyzer analyzer=newStandardAnalyzer ();
         // get tokenStream objects
         // text parameter 1 domain 2 to be analyzed 
        TokenStream tokenStream = analyzer.tokenStream ( "", "test a lucene procedures, cups stop" );
         // add a reference for obtaining each keyword 
        charTermAttribute charTermAttribute = tokenStream.addAttribute (charTermAttribute. class );
         // add an offset reference, the keyword recording start position and the end position 
        offsetAttribute offsetAttribute = tokenStream.addAttribute (offsetAttribute. class );
         // adjust the pointer to the list header 
        tokenStream.reset ();
         // traverse the keyword list, incrementToken determines whether the end of 
        the while  (tokenStream.incrementToken ()) {
            System.out.println ("开始--->"+offsetAttribute.startOffset());
            System.out.println(charTermAttribute);
            System.out.println("结束--->"+offsetAttribute.endOffset());
        }
        tokenStream.close();
    }
}

Test results are as follows

 

 

The term "program" is divided into two separate words

Next, using a third party IKAnalyzer, or introduced into the associated dependent jar package maven

    <!-- https://mvnrepository.com/artifact/com.jianggujin/IKAnalyzer-lucene -->
                    <dependency>
                        <groupId>com.jianggujin</groupId>
                        <artifactId>IKAnalyzer-lucene</artifactId>
                        <version>8.0.0</version>
                    </dependency>

Below the red line marked then modify the code

  public  static  void main (String [] args) throws IOException {
         // create a standard parser object 
        Analyzer Analyzer IKAnalyzer new new = ();
         // Get the object tokenStream
         // text parameter 1 2 domain to be analyzed 
        TokenStream tokenStream = analyzer. tokenStream ( "", "test a lucene procedures, cups stop" );
         // add a reference, for obtaining each keyword 
        charTermAttribute charTermAttribute = tokenStream.addAttribute (charTermAttribute. class );
         // add an offset reference , recorded the keyword start position and end position 
        OffsetAttribute offsetAttribute = tokenStream.addAttribute (OffsetAttribute. class );
         //Adjust the pointer to the list header 
        tokenStream.reset ();
         // traverse the keyword list, incrementToken determines whether the end of 
        the while (tokenStream.incrementToken ()) { 
            System.out.println ( "Start --->" + offsetAttribute. startOffset ()); 
            System.out.println (charTermAttribute); 
            System.out.println ( "end --->" + offsetAttribute.endOffset ()); 
        } 
        tokenStream.close (); 
    }

 

 

 

 

 The analyzer can be seen "program" as a term to look at, but "Paradis", or split into a single word, and "cup" word has not been displayed because "Cup" word in the list of stop words in

 

 

 Next add a word in the dictionary extension "Paradis"

 

 Test Results

 

 TIP: Among the custom dictionary and stop word dictionary expansion process, do not use Notepad to edit the windows, because windows Notepad is UTF-8 + BOM code, save the file should be UTF-8 format

 

Guess you like

Origin www.cnblogs.com/yjc1605961523/p/12361327.html