Lucene Learning - In-depth Lucene tokenizer, TokenStream gets token details

Lucene Learning - In-depth Lucene tokenizer, TokenStream gets token details

Here is a reply to Niu Niu's question about the tokenizer in the program. In fact, it can be configured directly and simply in the thesaurus. All the information of the tokenizer in Lucene can be obtained from the TokenStream stream.

The core classes of the tokenizer are Analyzer, TokenStream, Tokenizer, TokenFilter.

Analyzer

The tokenizers in Lucene are StandardAnalyzer, StopAnalyzer, SimpleAnalyzer, WhitespaceAnalyzer.

TokenStream

A stream obtained by the tokenizer after processing. This stream stores various information about the tokenization. The tokenization unit can be effectively obtained through TokenStream

Tokenizer

It is mainly responsible for receiving the character stream Reader and performing word segmentation on the Reader. There are some implementation classes as follows

KeywordTokenizer,

standardTokenizer,

CharTokenizer

|----WhitespaceTokenizer

|----LetterTokenizer

|----LowerCaseTokenizer

TokenFilter

Various filters are performed on the lexical units of the divided words.

View the word segmentation information of the tokenizer

 

package com.icreate.analyzer.luence;

import java.io.IOException;
import java.io.StringReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;

/**
 *
 *  AnalyzerUtil.java   
 *
 *  @version : 1.1
 *  
 * @author : Su Ruonian <a href="mailto:[email protected]">send mail</a>
 *    
 * @since : 1.0 Created: 2013-4-14 11:05:45 AM
 *     
 * EVERYTHING :
 *
 */
public class AnalyzerUtil {

    /**
     *
     * Description: View word segmentation information
     * @param str String to be segmented
     * @param analyzer tokenizer
     *
     */
    public static void displayToken(String str,Analyzer analyzer){
        try {
            //Create a string as a Token stream
            TokenStream stream  = analyzer.tokenStream("", new StringReader(str));
            // save the corresponding vocabulary
            CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class);
            while(stream.incrementToken()){
                System.out.print("[" + cta + "]");
            }
            System.out.println();
        } catch (IOException e) {
            e.printStackTrace ();
        }
    }
    
    public static void main(String[] args) {
        Analyzer aly1 = new StandardAnalyzer(Version.LUCENE_36);
        Analyzer aly2 = new StopAnalyzer(Version.LUCENE_36);
        Analyzer aly3 = new SimpleAnalyzer(Version.LUCENE_36);
        Analyzer aly4 = new WhitespaceAnalyzer(Version.LUCENE_36);
        
        String str = "hello kim, I am dennisit, I am Chinese, my email is [email protected], and my QQ is 1325103287";
        
        AnalyzerUtil.displayToken(str, aly1);
        AnalyzerUtil.displayToken(str, aly2);
        AnalyzerUtil.displayToken(str, aly3);
        AnalyzerUtil.displayToken(str, aly4);
    }
}

 program execution result

 

 

[hello][kim][i][am][dennisit][i][yes][中][country][person][my][email][dennisit][163][com][my][qq ][1325103287]
[hello][kim][i][am][dennisit][I am][Chinese][my][email][dennisit][com][my][qq]
[hello][kim][i][am][dennisit][I am][Chinese][my][email][is][dennisit][com][and][my][qq][is]
[hello][kim,I][am][dennisit,I am][Chinese,my][email][is][[email protected],][and][my][QQ][is][ 1325103287]

 

 

standardanalyzer treats the numbers as a whole and separates each word

stopanalyzer will deactivate the number Chinese does not work, only split by spaces

simpleanalyzer deactivates numbers Chinese does not work, only splits according to spaces

whitespaceanalyzer is separated by spaces, Chinese does not work

Show details of word segmentation

 

 

/**
     *
     * Description: Displays all the information of the word segmentation
     * @param str
     * @param analyzer
     *
     */
    public static void displayAllTokenInfo(String str, Analyzer analyzer){
        try {
            //The first parameter is just for identification and has no actual effect
            TokenStream stream = analyzer.tokenStream("", new StringReader(str));
            //Get the position increment between words
            PositionIncrementAttribute postiona = stream.addAttribute (PositionIncrementAttribute.class);
            //Get the offset between each word
            OffsetAttribute offseta = stream.addAttribute(OffsetAttribute.class);
            //Get information about each word
            CharTermAttribute chara = stream.addAttribute(CharTermAttribute.class);
            //Get the type of the current participle
            TypeAttribute typea = stream.addAttribute(TypeAttribute.class);
            while(stream.incrementToken()){
                System.out.print("Position Increment" +postiona.getPositionIncrement()+":\t");
                System.out.println(chara+"\t[" + offseta.startOffset()+" - " + offseta.endOffset() + "]\t<" + typea +">");
            }
            System.out.println();
        } catch (Exception e) {
            e.printStackTrace ();
        }
    }

 

 

test code

        Analyzer aly1 = new StandardAnalyzer(Version.LUCENE_36);
        Analyzer aly2 = new StopAnalyzer(Version.LUCENE_36);
        Analyzer aly3 = new SimpleAnalyzer(Version.LUCENE_36);
        Analyzer aly4 = new WhitespaceAnalyzer(Version.LUCENE_36);
        
        String str = "hello kim, I am dennisit, I am Chinese, my email is [email protected], and my QQ is 1325103287";
        
        AnalyzerUtil.displayAllTokenInfo(str, aly1);
        AnalyzerUtil.displayAllTokenInfo(str, aly2);
        AnalyzerUtil.displayAllTokenInfo(str, aly3);
        AnalyzerUtil.displayAllTokenInfo(str, aly4);

 Program running result

Position increment 1: hello [0 - 5] <type=<ALPHANUM>>
Position increment 1: kim [6 - 9] <type=<ALPHANUM>>
Position increment 1: i [10 - 11] <type=<ALPHANUM>>
Position increment 1: am [12 - 14] <type=<ALPHANUM>>
Position increment 1: dennisit [15 - 23] <type=<ALPHANUM>>
Position Increment 1: i[24 - 25] <type=<IDEOGRAPHIC>>
Position Increment 1: Yes [25 - 26] <type=<IDEOGRAPHIC>>
Position Increment 1: Medium[27 - 28] <type=<IDEOGRAPHIC>>
Position Increment 1: Country[28 - 29] <type=<IDEOGRAPHIC>>
Position Increment 1: Person[29 - 30] <type=<IDEOGRAPHIC>>
Position Increment 1: my [31 - 33] <type=<ALPHANUM>>
Position increment 1: email [34 - 39] <type=<ALPHANUM>>
Position increment 2: dennisit [43 - 51] <type=<ALPHANUM>>
Position Increment 1: 163 [52 - 55] <type=<NUM>>
Position Increment 1: com [56 - 59] <type=<ALPHANUM>>
Position Increment 2: my [65 - 67] <type=<ALPHANUM>>
Position increment 1: qq [68 - 70] <type=<ALPHANUM>>
Position Increment 2: 1325103287 [74 - 84] <type=<NUM>>

Position increment 1: hello [0 - 5] <type=word>
Position increment 1: kim [6 - 9] <type=word>
Position increment 1: i [10 - 11] <type=word>
Position increment 1: am [12 - 14] <type=word>
Position increment 1: dennisit [15 - 23] <type=word>
Position Increment 1: I am [24 - 26] <type=word>
Position Increment 1: Chinese[27 - 30] <type=word>
Position increment 1: my [31 - 33] <type=word>
Position increment 1: email [34 - 39] <type=word>
Position Increment 2: dennisit [43 - 51] <type=word>
Position increment 1: com [56 - 59] <type=word>
Position increment 2: my [65 - 67] <type=word>
Position increment 1: qq [68 - 70] <type=word>

Position increment 1: hello [0 - 5] <type=word>
Position increment 1: kim [6 - 9] <type=word>
Position increment 1: i [10 - 11] <type=word>
Position increment 1: am [12 - 14] <type=word>
Position increment 1: dennisit [15 - 23] <type=word>
Position Increment 1: I am [24 - 26] <type=word>
Position Increment 1: Chinese[27 - 30] <type=word>
Position increment 1: my [31 - 33] <type=word>
Position increment 1: email [34 - 39] <type=word>
Position Increment 1: is [40 - 42] <type=word>
Position increment 1: dennisit [43 - 51] <type=word>
Position increment 1: com [56 - 59] <type=word>
Position increment 1: and [61 - 64] <type=word>
Position increment 1: my [65 - 67] <type=word>
Position increment 1: qq [68 - 70] <type=word>
Position Increment 1: is [71 - 73] <type=word>

Position increment 1: hello [0 - 5] <type=word>
Position increment 1: kim,I [6 - 11] <type=word>
Position increment 1: am [12 - 14] <type=word>
position increment 1: denisit, i is [15 - 26] <type=word>
Position increment 1: Chinese, my [27 - 33] <type=word>
Position increment 1: email [34 - 39] <type=word>
Position Increment 1: is [40 - 42] <type=word>
Position Increment 1: [email protected], [43 - 60] <type=word>
Position increment 1: and [61 - 64] <type=word>
Position increment 1: my [65 - 67] <type=word>
Position Increment 1: QQ [68 - 70] <type=word>
Position Increment 1: is [71 - 73] <type=word>
Position Increment 1: 1325103287 [74 - 84] <type=word>

 

Custom stop tokenizer

Inherit the Analyzer to override the public TokenStream tokenStream(String filename, Reader reader) method

package org.dennisit.lucene.util;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Set;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LetterTokenizer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;

/**
 *
 *  org.dennisit.lucene.utilMyStopAnalyzer.java   
 *
 *  @version : 1.1
 *  
 * @author : Su Ruonian <a href="mailto:[email protected]">send mail</a>
 *    
 * @since : 1.0 Created: 2013-4-14 12:06:08 PM
 *     
 * EVERYTHING :
 *
 */
public class MyStopAnalyzer extends Analyzer{
    
    private Set stops;
    
    /**
     * Add your own stop words based on the original stop words
     * @param stopwords Custom stop words are passed in an array
     */
    public MyStopAnalyzer(String[] stopwords){
        / / Will automatically convert the string array to Set
        stops = StopFilter.makeStopSet(Version.LUCENE_36,stopwords,true);
        //Add the original stop word to the current stop word
        stops.addAll(StopAnalyzer.ENGLISH_STOP_WORDS_SET);
    }
    
    /**
     * If no parameter is passed in, it means to use the original default stop word
     */
    public MyStopAnalyzer(){
        //Get the original stop words
        stops = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
    }
    
    @Override
    public TokenStream tokenStream(String filename,Reader reader){
        //Set the filter chain and Tokenizer for the custom tokenizer
        return  new StopFilter(Version.LUCENE_36,
                new LowerCaseFilter(Version.LUCENE_36,
                new LetterTokenizer(Version.LUCENE_36,reader)),
                stops);
    }
    
    
    /**
     *
     * Description: View word segmentation information
     * @param str String to be segmented
     * @param analyzer tokenizer
     *
     */
    public static void displayToken(String str,Analyzer analyzer){
        try {
            //Create a string as a Token stream
            TokenStream stream  = analyzer.tokenStream("", new StringReader(str));
            // save the corresponding vocabulary
            CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class);
            while(stream.incrementToken()){
                System.out.print("[" + cta + "]");
            }
            System.out.println();
        } catch (IOException e) {
            e.printStackTrace ();
        }
    }
    
    public static void main(String[] args) {
        //Get the original stopword
        Analyzer myAnalyzer1 = new MyStopAnalyzer();
        // Append your own stop words
        Analyzer myAnalyzer2 = new MyStopAnalyzer(new String[]{"hate","fuck"});
        // Sentences processed by word segmentation
        String text = "fuck! I hate you very much";
        
        displayToken(text, myAnalyzer1);
        displayToken(text, myAnalyzer2);
    }
}

 

Program running result

[fuck][i][hate][you][very][much]
[i][you][very][much]

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326780641&siteId=291194637