Lucene Learning - In-depth Lucene tokenizer, TokenStream gets token details
Here is a reply to Niu Niu's question about the tokenizer in the program. In fact, it can be configured directly and simply in the thesaurus. All the information of the tokenizer in Lucene can be obtained from the TokenStream stream.
The core classes of the tokenizer are Analyzer, TokenStream, Tokenizer, TokenFilter.
Analyzer
The tokenizers in Lucene are StandardAnalyzer, StopAnalyzer, SimpleAnalyzer, WhitespaceAnalyzer.
TokenStream
A stream obtained by the tokenizer after processing. This stream stores various information about the tokenization. The tokenization unit can be effectively obtained through TokenStream
Tokenizer
It is mainly responsible for receiving the character stream Reader and performing word segmentation on the Reader. There are some implementation classes as follows
KeywordTokenizer,
standardTokenizer,
CharTokenizer
|----WhitespaceTokenizer
|----LetterTokenizer
|----LowerCaseTokenizer
TokenFilter
Various filters are performed on the lexical units of the divided words.
View the word segmentation information of the tokenizer
package com.icreate.analyzer.luence; import java.io.IOException; import java.io.StringReader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.util.Version; /** * * AnalyzerUtil.java * * @version : 1.1 * * @author : Su Ruonian <a href="mailto:[email protected]">send mail</a> * * @since : 1.0 Created: 2013-4-14 11:05:45 AM * * EVERYTHING : * */ public class AnalyzerUtil { /** * * Description: View word segmentation information * @param str String to be segmented * @param analyzer tokenizer * */ public static void displayToken(String str,Analyzer analyzer){ try { //Create a string as a Token stream TokenStream stream = analyzer.tokenStream("", new StringReader(str)); // save the corresponding vocabulary CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class); while(stream.incrementToken()){ System.out.print("[" + cta + "]"); } System.out.println(); } catch (IOException e) { e.printStackTrace (); } } public static void main(String[] args) { Analyzer aly1 = new StandardAnalyzer(Version.LUCENE_36); Analyzer aly2 = new StopAnalyzer(Version.LUCENE_36); Analyzer aly3 = new SimpleAnalyzer(Version.LUCENE_36); Analyzer aly4 = new WhitespaceAnalyzer(Version.LUCENE_36); String str = "hello kim, I am dennisit, I am Chinese, my email is [email protected], and my QQ is 1325103287"; AnalyzerUtil.displayToken(str, aly1); AnalyzerUtil.displayToken(str, aly2); AnalyzerUtil.displayToken(str, aly3); AnalyzerUtil.displayToken(str, aly4); } }
program execution result
[hello][kim][i][am][dennisit][i][yes][中][country][person][my][email][dennisit][163][com][my][qq ][1325103287] [hello][kim][i][am][dennisit][I am][Chinese][my][email][dennisit][com][my][qq] [hello][kim][i][am][dennisit][I am][Chinese][my][email][is][dennisit][com][and][my][qq][is] [hello][kim,I][am][dennisit,I am][Chinese,my][email][is][[email protected],][and][my][QQ][is][ 1325103287]
standardanalyzer treats the numbers as a whole and separates each word
stopanalyzer will deactivate the number Chinese does not work, only split by spaces
simpleanalyzer deactivates numbers Chinese does not work, only splits according to spaces
whitespaceanalyzer is separated by spaces, Chinese does not work
Show details of word segmentation
/** * * Description: Displays all the information of the word segmentation * @param str * @param analyzer * */ public static void displayAllTokenInfo(String str, Analyzer analyzer){ try { //The first parameter is just for identification and has no actual effect TokenStream stream = analyzer.tokenStream("", new StringReader(str)); //Get the position increment between words PositionIncrementAttribute postiona = stream.addAttribute (PositionIncrementAttribute.class); //Get the offset between each word OffsetAttribute offseta = stream.addAttribute(OffsetAttribute.class); //Get information about each word CharTermAttribute chara = stream.addAttribute(CharTermAttribute.class); //Get the type of the current participle TypeAttribute typea = stream.addAttribute(TypeAttribute.class); while(stream.incrementToken()){ System.out.print("Position Increment" +postiona.getPositionIncrement()+":\t"); System.out.println(chara+"\t[" + offseta.startOffset()+" - " + offseta.endOffset() + "]\t<" + typea +">"); } System.out.println(); } catch (Exception e) { e.printStackTrace (); } }
test code
Analyzer aly1 = new StandardAnalyzer(Version.LUCENE_36); Analyzer aly2 = new StopAnalyzer(Version.LUCENE_36); Analyzer aly3 = new SimpleAnalyzer(Version.LUCENE_36); Analyzer aly4 = new WhitespaceAnalyzer(Version.LUCENE_36); String str = "hello kim, I am dennisit, I am Chinese, my email is [email protected], and my QQ is 1325103287"; AnalyzerUtil.displayAllTokenInfo(str, aly1); AnalyzerUtil.displayAllTokenInfo(str, aly2); AnalyzerUtil.displayAllTokenInfo(str, aly3); AnalyzerUtil.displayAllTokenInfo(str, aly4);
Program running result
Position increment 1: hello [0 - 5] <type=<ALPHANUM>> Position increment 1: kim [6 - 9] <type=<ALPHANUM>> Position increment 1: i [10 - 11] <type=<ALPHANUM>> Position increment 1: am [12 - 14] <type=<ALPHANUM>> Position increment 1: dennisit [15 - 23] <type=<ALPHANUM>> Position Increment 1: i[24 - 25] <type=<IDEOGRAPHIC>> Position Increment 1: Yes [25 - 26] <type=<IDEOGRAPHIC>> Position Increment 1: Medium[27 - 28] <type=<IDEOGRAPHIC>> Position Increment 1: Country[28 - 29] <type=<IDEOGRAPHIC>> Position Increment 1: Person[29 - 30] <type=<IDEOGRAPHIC>> Position Increment 1: my [31 - 33] <type=<ALPHANUM>> Position increment 1: email [34 - 39] <type=<ALPHANUM>> Position increment 2: dennisit [43 - 51] <type=<ALPHANUM>> Position Increment 1: 163 [52 - 55] <type=<NUM>> Position Increment 1: com [56 - 59] <type=<ALPHANUM>> Position Increment 2: my [65 - 67] <type=<ALPHANUM>> Position increment 1: qq [68 - 70] <type=<ALPHANUM>> Position Increment 2: 1325103287 [74 - 84] <type=<NUM>> Position increment 1: hello [0 - 5] <type=word> Position increment 1: kim [6 - 9] <type=word> Position increment 1: i [10 - 11] <type=word> Position increment 1: am [12 - 14] <type=word> Position increment 1: dennisit [15 - 23] <type=word> Position Increment 1: I am [24 - 26] <type=word> Position Increment 1: Chinese[27 - 30] <type=word> Position increment 1: my [31 - 33] <type=word> Position increment 1: email [34 - 39] <type=word> Position Increment 2: dennisit [43 - 51] <type=word> Position increment 1: com [56 - 59] <type=word> Position increment 2: my [65 - 67] <type=word> Position increment 1: qq [68 - 70] <type=word> Position increment 1: hello [0 - 5] <type=word> Position increment 1: kim [6 - 9] <type=word> Position increment 1: i [10 - 11] <type=word> Position increment 1: am [12 - 14] <type=word> Position increment 1: dennisit [15 - 23] <type=word> Position Increment 1: I am [24 - 26] <type=word> Position Increment 1: Chinese[27 - 30] <type=word> Position increment 1: my [31 - 33] <type=word> Position increment 1: email [34 - 39] <type=word> Position Increment 1: is [40 - 42] <type=word> Position increment 1: dennisit [43 - 51] <type=word> Position increment 1: com [56 - 59] <type=word> Position increment 1: and [61 - 64] <type=word> Position increment 1: my [65 - 67] <type=word> Position increment 1: qq [68 - 70] <type=word> Position Increment 1: is [71 - 73] <type=word> Position increment 1: hello [0 - 5] <type=word> Position increment 1: kim,I [6 - 11] <type=word> Position increment 1: am [12 - 14] <type=word> position increment 1: denisit, i is [15 - 26] <type=word> Position increment 1: Chinese, my [27 - 33] <type=word> Position increment 1: email [34 - 39] <type=word> Position Increment 1: is [40 - 42] <type=word> Position Increment 1: [email protected], [43 - 60] <type=word> Position increment 1: and [61 - 64] <type=word> Position increment 1: my [65 - 67] <type=word> Position Increment 1: QQ [68 - 70] <type=word> Position Increment 1: is [71 - 73] <type=word> Position Increment 1: 1325103287 [74 - 84] <type=word>
Custom stop tokenizer
Inherit the Analyzer to override the public TokenStream tokenStream(String filename, Reader reader) method
package org.dennisit.lucene.util; import java.io.IOException; import java.io.Reader; import java.io.StringReader; import java.util.Set; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.LetterTokenizer; import org.apache.lucene.analysis.LowerCaseFilter; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.StopFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.util.Version; /** * * org.dennisit.lucene.utilMyStopAnalyzer.java * * @version : 1.1 * * @author : Su Ruonian <a href="mailto:[email protected]">send mail</a> * * @since : 1.0 Created: 2013-4-14 12:06:08 PM * * EVERYTHING : * */ public class MyStopAnalyzer extends Analyzer{ private Set stops; /** * Add your own stop words based on the original stop words * @param stopwords Custom stop words are passed in an array */ public MyStopAnalyzer(String[] stopwords){ / / Will automatically convert the string array to Set stops = StopFilter.makeStopSet(Version.LUCENE_36,stopwords,true); //Add the original stop word to the current stop word stops.addAll(StopAnalyzer.ENGLISH_STOP_WORDS_SET); } /** * If no parameter is passed in, it means to use the original default stop word */ public MyStopAnalyzer(){ //Get the original stop words stops = StopAnalyzer.ENGLISH_STOP_WORDS_SET; } @Override public TokenStream tokenStream(String filename,Reader reader){ //Set the filter chain and Tokenizer for the custom tokenizer return new StopFilter(Version.LUCENE_36, new LowerCaseFilter(Version.LUCENE_36, new LetterTokenizer(Version.LUCENE_36,reader)), stops); } /** * * Description: View word segmentation information * @param str String to be segmented * @param analyzer tokenizer * */ public static void displayToken(String str,Analyzer analyzer){ try { //Create a string as a Token stream TokenStream stream = analyzer.tokenStream("", new StringReader(str)); // save the corresponding vocabulary CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class); while(stream.incrementToken()){ System.out.print("[" + cta + "]"); } System.out.println(); } catch (IOException e) { e.printStackTrace (); } } public static void main(String[] args) { //Get the original stopword Analyzer myAnalyzer1 = new MyStopAnalyzer(); // Append your own stop words Analyzer myAnalyzer2 = new MyStopAnalyzer(new String[]{"hate","fuck"}); // Sentences processed by word segmentation String text = "fuck! I hate you very much"; displayToken(text, myAnalyzer1); displayToken(text, myAnalyzer2); } }
Program running result
[fuck][i][hate][you][very][much] [i][you][very][much]