IK tokenizer principle and source code analysis

It is impossible to do search technology without touching the tokenizer. Personally, I think there are two main reasons why search engines cannot be replaced by databases. One is that when the amount of data is relatively large, the query speed of search engines is fast. The second point is that search engines can understand users better than databases. The first point is easy to understand. Whenever a single table in the database is large, it is a headache. In the case of a large amount of data, you let the database do fuzzy queries, which is also a more difficult task. (Of course prefix matching is much better), it should be avoided by design. Regarding the second point, how search engines understand users is definitely not simply by matching. A lot of processing can be added to it, and even various advanced technologies of natural language processing can be added. The more common and basic method is to rely on tokenizers. , and this is a relatively simple and efficient processing method.

<!--more--> Word segmentation technology is a cornerstone of search technology. Many people have used it, if you just set up a search engine simply and quickly, you really don't need to know too much. But once it comes to the question of effect, a lot of articles can be done on the tokenizer. For example, in the actual search work in the field of e-commerce, the realization of category prediction must rely heavily on word segmentation, at least it needs to be able to dynamically add rules to the word segmentation device. Another simple example, if your optimization method is to weight different words and increase the weight of some key words, you need to rely on and understand the tokenizer. This article will analyze its implementation based on the original code of the ik distributor. The key points are mainly 3 points. 1. The construction of the dictionary tree, that is, to load the current dictionary into a memory structure. 2. The matching search of words is equivalent to generating the segmentation method of words in a sentence. 3. , Ambiguous judgment, that is, the judgment of different segmentation methods, which should be more reasonable. The original URL of the code is: [https://code.google.com/p/ik-analyzer/](https://code.google.com/p/ik-analyzer/) It has been uploaded on github and can be accessed at: [https://code.google.com/p/ik-analyzer/] ://github.com/quentinxxz/Search/tree/master/IKAnalyzer2012FF_hf1_source/](https://github.com/quentinxxz/Search/tree/master/IKAnalyzer2012FF_hf1_source/)

dictionary

For background data-related operations, the source of all work is the data source. The IK tokenizer provides three types of vocabulary for our words: 1. main vocabulary main2012.dic 2, quantifier table quantifier.dic 3, stop word stopword.dic.
Dictionary is a dictionary management class, and the dictionary is loaded into the memory structure respectively. The specific dictionary code is located at org.wltea.analyzer.dic.DictSegment. This class implements a core data structure of a tokenizer, the Tire Tree.

Tire Tree (dictionary tree) is a tree structure with a fairly simple structure. It is used to build a dictionary and quickly find words by comparing the prefix characters one by one, so it is sometimes called a prefix tree. Specific examples are as follows.
tireTree.jpg

Figure 1
From the left, abc, abcd, abd, b, bcd ..... these words are the words in the existence tree. Of course, Chinese characters can also be handled in the same way, but the number of Chinese characters is much more than 26, and the characters should not be represented by positions (in English, each node can be packed with an array of length 26). In this case, the tire tree will It becomes quite diffuse and takes up memory, so there is a variant of tire Tree, Ternary Tree, which is guaranteed to take up less memory. Ternary Tree is not used in the ik tokenizer, so it is not detailed here, please refer to the article http://www.cnblogs.com/rush/archive/2012/12/30/2839996.html
IK uses a simple analogy realization. First look at the members of the DictSegment class:

class DictSegment implements Comparable<DictSegment>{  
  
    //公用字典表,存储汉字  
    private static final Map<Character , Character> charMap = new HashMap<Character , Character>(16 , 0.95f);  
    //数组大小上限  
    private static final int ARRAY_LENGTH_LIMIT = 3;  
  
      
    //Map存储结构  
    private Map<Character , DictSegment> childrenMap;  
    //数组方式存储结构  
    private DictSegment[] childrenArray;  
  
  
    //当前节点上存储的字符  
    private Character nodeChar;  
    //当前节点存储的Segment数目  
    //storeSize <=ARRAY_LENGTH_LIMIT ,使用数组存储, storeSize >ARRAY_LENGTH_LIMIT ,则使用Map存储  
    private int storeSize = 0;  
    //当前DictSegment状态 ,默认 0 , 1表示从根节点到当前节点的路径表示一个词  
    private int nodeState = 0;    
    ……  

There are two ways to store here. It is determined according to ARRAY_LENGTH_LIMIT as the threshold. If the number of child nodes is less than the threshold, the childrenArray is used for storage. When the number of child nodes is greater than the threshold, the childrenMap is used for storage. , childrenMap is implemented using HashMap. The advantage of this is that it saves memory space. Because of the way of HashMap, it is definitely necessary to allocate memory in advance, and there may be waste, but if all use arrays to store groups (subsequently use binary search), you will not be able to obtain O(1) algorithm the complexity. Therefore, both methods are used here. When the number of child nodes is small, the array is used to store them. When the number of child nodes is large, they are all moved to the hashMap. During the construction process, each word is added to the dictionary tree step by step, which is a recursive process:

/** 
 * 加载填充词典片段 
 * @param charArray 
 * @param begin 
 * @param length 
 * @param enabled 
 */  
private synchronized void fillSegment(char[] charArray , int begin , int length , int enabled){  

     ……       
    //搜索当前节点的存储,查询对应keyChar的keyChar,如果没有则创建  
    DictSegment ds = lookforSegment(keyChar , enabled);  
    if(ds != null){  
        //处理keyChar对应的segment  
        if(length > 1){  
            //词元还没有完全加入词典树  
            ds.fillSegment(charArray, begin + 1, length - 1 , enabled);  
        }else if (length == 1){  
            //已经是词元的最后一个char,设置当前节点状态为enabled,  
            //enabled=1表明一个完整的词,enabled=0表示从词典中屏蔽当前词  
            ds.nodeState = enabled;  
        }  
    }  

}  

Among them, lookforSegment will search in the child nodes of the subtree where it is located. If it is less than the ARRAY_LENGTH_LIMIT threshold, it will be stored in an array and used binary search; if it is greater than the ARRAY_LENGTH_LIMIT threshold, it will be stored in HashMap and searched directly.

word segmentation

The IK tokenizer can be basically divided into two modes, one is smart mode and the other is non-smart mode. For example, the original text:
what Zhang San said is indeed in the
smart mode, and the result of the word segmentation is:
Zhang San | said | indeed |
The word segmentation result in the smart mode is:
Zhang San | three | said | indeed | Indeed | Actually | It
can be seen that what the non-smart mode does is to output all the words that can be separated; in the smart mode, the IK tokenizer will output a word segmentation result that it thinks is the most reasonable according to the internal method, which involves ambiguity judgment. .

Let's first look at some of the most basic element structure classes:

public class Lexeme implements Comparable<Lexeme>{  
    ……  
  
    //词元的起始位移  
    private int offset;  
    //词元的相对起始位置  
    private int begin;  
    //词元的长度  
    private int length;  
    //词元文本  
    private String lexemeText;  
    //词元类型  
    private int lexemeType;  
     ……  

The Lexeme (word element) here can be understood as a word or a word. The begin is its position in the input text. Note that it implements Comparable, with the first starting position first, and the longer length first, which can be used to determine the position of a word in the word chain of a word segmentation result, and can be used to get the word segmentation result in the above example the order of the words in .

/* 
* 词元在排序集合中的比较算法 
* @see java.lang.Comparable#compareTo(java.lang.Object) 
*/  
public int compareTo(Lexeme other) {  
//起始位置优先  
    if(this.begin < other.getBegin()){  
        return -1;  
    }else if(this.begin == other.getBegin()){  
     //词元长度优先  
     if(this.length > other.getLength()){  
         return -1;  
     }else if(this.length == other.getLength()){  
         return 0;  
     }else {//this.length < other.getLength()  
         return 1;  
     }  
       
    }else{//this.begin > other.getBegin()  
     return 1;  
    }  
}  

Another important structure is the lexical chain, which is declared as follows

/** 
 * Lexeme链(路径) 
 */  
class LexemePath extends QuickSortSet implements Comparable<LexemePath> 

A LexmePath, you can think of it as a result of the above word segmentation, forming a chain structure according to the sequence. It can be seen that it implements QuickSortSet, so when it adds the word element, it completes the sorting internally, forming an ordered chain, and the sorting rule is implemented by the compareTo method of Lexeme above. You will also notice that LexemePath also implements the Comparable interface, which is used for the subsequent ambiguity analysis, which will be introduced in the next section.
Another important structure is AnalyzeContext, which mainly stores the text of input information, segmented lemexePah, word segmentation results and other related context information.
Three sub-segmenters are used by default in IK, namely LetterSegmenter (letter segmenter), CN_QuantifierSegment (quantifier segmenter), and CJKSegmenter (Chinese, Japanese and Korean tokenizer). The word segmentation will go through these three word segmentation devices successively, and we will focus on the analysis based on CJKSegment here. At its core is an analyzer method.

public void analyze(AnalyzeContext context) {  
    …….  
          
        //优先处理tmpHits中的hit  
        if(!this.tmpHits.isEmpty()){  
            //处理词段队列  
            Hit[] tmpArray = this.tmpHits.toArray(new Hit[this.tmpHits.size()]);  
            for(Hit hit : tmpArray){  
                hit = Dictionary.getSingleton().matchWithHit(context.getSegmentBuff(), context.getCursor() , hit);  
                if(hit.isMatch()){  
                    //输出当前的词  
                    Lexeme newLexeme = new Lexeme(context.getBufferOffset() , hit.getBegin() , context.getCursor() - hit.getBegin() + 1 , Lexeme.TYPE_CNWORD);  
                    context.addLexeme(newLexeme);  
                      
                    if(!hit.isPrefix()){//不是词前缀,hit不需要继续匹配,移除  
                        this.tmpHits.remove(hit);  
                    }  
                      
                }else if(hit.isUnmatch()){  
                    //hit不是词,移除  
                    this.tmpHits.remove(hit);  
                }                     
            }  
        }             
          
        //*********************************  
        //再对当前指针位置的字符进行单字匹配  
        Hit singleCharHit = Dictionary.getSingleton().matchInMainDict(context.getSegmentBuff(), context.getCursor(), 1);  
        if(singleCharHit.isMatch()){//首字成词  
            //输出当前的词  
            Lexeme newLexeme = new Lexeme(context.getBufferOffset() , context.getCursor() , 1 , Lexeme.TYPE_CNWORD);  
            context.addLexeme(newLexeme);  

            //同时也是词前缀  
            if(singleCharHit.isPrefix()){  
                //前缀匹配则放入hit列表  
                this.tmpHits.add(singleCharHit);  
            }  
        }else if(singleCharHit.isPrefix()){//首字为词前缀  
            //前缀匹配则放入hit列表  
            this.tmpHits.add(singleCharHit);  
        }  
   ……  
}  

From the next half of the code, matchInMain here is the method used to match the words in the topic table. The main vocabulary here has been loaded into a dictionary tree, so the whole process is a recursive way of going down from the root of the tree layer by layer, but only single words are processed here, and no recursion is performed. There are three types of matching results: UNMATCH (unmatched), MATCH (matched), and PREFIX (prefix matching). Match means that the complete match has reached the leaf node, and PREFIX means that the matching path passed on the current pair exists, but has not reached it. to the leaf node. In addition, a word can be both MATCH and PREFIX, such as abc in Figure 1. Prefix matches are stored in tempHit. The complete matches are stored in the context to save.
Continue to look at the first half of the code, the words matched by the prefix should not end directly, because it is possible to continue to match longer words later, so what the first half of the code does is to continue to match these words. matchWithHit is to continue matching under the result of the current hit. If you get the result of MATCH, you can add new tokens to the context.
In this way, we can get all the words by means of non-segment matching and circular supplementation, which can at least meet the needs of non-smart mode.

Ambiguous judgment

IKArbitrator (ambiguity analysis arbiter) is the main class for handling ambiguity.
If you think I can't explain this clearly, you can also refer to the blog: http://fay19880111-yeah-net.iteye.com/blog/1523740

In the previous section, we mentioned that LexemePath implements the compareble interface.

public int compareTo(LexemePath o) {  
    //比较有效文本长度  
    if(this.payloadLength > o.payloadLength){  
        return -1;  
    }else if(this.payloadLength < o.payloadLength){  
        return 1;  
    }else{  
        //比较词元个数,越少越好  
        if(this.size() < o.size()){  
            return -1;  
        }else if (this.size() > o.size()){  
            return 1;  
        }else{  
            //路径跨度越大越好  
            if(this.getPathLength() >  o.getPathLength()){  
                return -1;  
            }else if(this.getPathLength() <  o.getPathLength()){  
                return 1;  
            }else {  
                //根据统计学结论,逆向切分概率高于正向切分,因此位置越靠后的优先  
                if(this.pathEnd > o.pathEnd){  
                    return -1;  
                }else if(pathEnd < o.pathEnd){  
                    return 1;  
                }else{  
                    //词长越平均越好  
                    if(this.getXWeight() > o.getXWeight()){  
                        return -1;  
                    }else if(this.getXWeight() < o.getXWeight()){  
                        return 1;  
                    }else {  
                        //词元位置权重比较  
                        if(this.getPWeight() > o.getPWeight()){  
                            return -1;  
                        }else if(this.getPWeight() < o.getPWeight()){  
                            return 1;  
                        }  
                          
                    }  
                }  
            }  
        }  
    }  
    return 0;  
}  
显然作者在这里定死了一些排序的规则,依次比较有效文本长度、词元个数、路径跨度…..

IKArbitrator has a judge method that compares different paths.

private LexemePath judge(QuickSortSet.Cell lexemeCell , int fullTextLength){  
    //候选路径集合  
    TreeSet<LexemePath> pathOptions = new TreeSet<LexemePath>();  
    //候选结果路径  
    LexemePath option = new LexemePath();  
      
    //对crossPath进行一次遍历,同时返回本次遍历中有冲突的Lexeme栈  
    Stack<QuickSortSet.Cell> lexemeStack = this.forwardPath(lexemeCell , option);  
      
    //当前词元链并非最理想的,加入候选路径集合  
    pathOptions.add(option.copy());  
      
    //存在歧义词,处理  
    QuickSortSet.Cell c = null;  
    while(!lexemeStack.isEmpty()){  
        c = lexemeStack.pop();  
        //回滚词元链  
        this.backPath(c.getLexeme() , option);  
        //从歧义词位置开始,递归,生成可选方案  
        this.forwardPath(c , option);  
        pathOptions.add(option.copy());  
    }  
      
    //返回集合中的最优方案  
    return pathOptions.first();  
}  

The core processing idea is to start from the first token, traverse various paths, and then add it to a TreeSet to achieve sorting, just take the first one.

other instructions

1. stopWord (stop word) will be removed in the final output stage (AnalyzeContext.getNextLexeme), and will not be removed during the analysis process, otherwise there will be risks.
2. It can be seen from the compareTo method of LexemePath that the sorting method of Ik is very rough. If the number of words in path1 is less than that in path2, it is directly judged that path1 is better. In fact, such a rule does not fully refer to the actual situation of each divided word. We may want to add information such as the frequency of occurrence of each word according to statistics to make a more comprehensive score, so the original comparison method of IK is not feasible. of.
For ideas on how to modify, you can refer to another blog, which introduces a method of processing through the shortest path idea: http://www.hankcs.com/nlp/segment/n-shortest-path-to-the- java-implementation-and-application-segmentation.html

3. Unmatched words, whether in smart mode or not, will be output at the end. The processing time is in the final output result stage. The specific code is located in the AnalyzeContext. outputToResult method.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326320273&siteId=291194637