IKAnalyzer Download
Links: HTTPS: // pan.baidu.com/s/1bNqXh8B7suT1rAUm_ZZ7gw extraction code: c08j
Folder structure is as follows
When the default in Lucene StandardAnalyzer analyzer for analyzing the characters are split into a word, a word, a word count each word
// for configuring tokenizer IndexWriterConfig config = new new IndexWriterConfig ();
Used in the construction of the method is IndexWriterConfig StandardAnalyzer
public IndexWriterConfig() { this(new StandardAnalyzer()); }
We want to use the Chinese word breaker, then you should replace him,
First to test using the default segmentation effect StandardAnalyzer
import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; import org.wltea.analyzer.lucene.IKAnalyzer; import java.io.IOException; public class MyTokenStream { public static void main(String[] args) throws IOException { //创建一个标准分析器对象 Analyzer analyzer=newStandardAnalyzer (); // get tokenStream objects // text parameter 1 domain 2 to be analyzed TokenStream tokenStream = analyzer.tokenStream ( "", "test a lucene procedures, cups stop" ); // add a reference for obtaining each keyword charTermAttribute charTermAttribute = tokenStream.addAttribute (charTermAttribute. class ); // add an offset reference, the keyword recording start position and the end position offsetAttribute offsetAttribute = tokenStream.addAttribute (offsetAttribute. class ); // adjust the pointer to the list header tokenStream.reset (); // traverse the keyword list, incrementToken determines whether the end of the while (tokenStream.incrementToken ()) { System.out.println ("开始--->"+offsetAttribute.startOffset()); System.out.println(charTermAttribute); System.out.println("结束--->"+offsetAttribute.endOffset()); } tokenStream.close(); } }
Test results are as follows
The term "program" is divided into two separate words
Next, using a third party IKAnalyzer, or introduced into the associated dependent jar package maven
<!-- https://mvnrepository.com/artifact/com.jianggujin/IKAnalyzer-lucene --> <dependency> <groupId>com.jianggujin</groupId> <artifactId>IKAnalyzer-lucene</artifactId> <version>8.0.0</version> </dependency>
Below the red line marked then modify the code
public static void main (String [] args) throws IOException { // create a standard parser object Analyzer Analyzer IKAnalyzer new new = (); // Get the object tokenStream // text parameter 1 2 domain to be analyzed TokenStream tokenStream = analyzer. tokenStream ( "", "test a lucene procedures, cups stop" ); // add a reference, for obtaining each keyword charTermAttribute charTermAttribute = tokenStream.addAttribute (charTermAttribute. class ); // add an offset reference , recorded the keyword start position and end position OffsetAttribute offsetAttribute = tokenStream.addAttribute (OffsetAttribute. class ); //Adjust the pointer to the list header tokenStream.reset (); // traverse the keyword list, incrementToken determines whether the end of the while (tokenStream.incrementToken ()) { System.out.println ( "Start --->" + offsetAttribute. startOffset ()); System.out.println (charTermAttribute); System.out.println ( "end --->" + offsetAttribute.endOffset ()); } tokenStream.close (); }
The analyzer can be seen "program" as a term to look at, but "Paradis", or split into a single word, and "cup" word has not been displayed because "Cup" word in the list of stop words in
Next add a word in the dictionary extension "Paradis"
Test Results
TIP: Among the custom dictionary and stop word dictionary expansion process, do not use Notepad to edit the windows, because windows Notepad is UTF-8 + BOM code, save the file should be UTF-8 format