jieba principle

A, jieba introduce
jieba library is a simple and practical Chinese natural language processing sub-thesaurus.

jieba word belongs to the language model probability word. Probabilistic language model segmentation task is: Find a segmentation scheme S in all the results obtained full segmentation such that P (S) maximum.

jieba word supports three modes:

Full mode, all of the words in the sentence can be scanned into words are very fast, but does not resolve the ambiguity;
precise mode, attempting to sentence most accurately cut, fit the text analysis;
search engine model, the precise mode on the basis of long-term segmentation again, improve recall, suitable for search engines word.
Next, we conducted for this segmentation algorithm theory analysis.

Second, the principle jieba word
1. FIG efficient scanning based on a prefix word dictionary, all possible characters generated sentence as a directed acyclic graph (DAG) composed of words where

1. According to the trie dict.txt generated, while generating the dictionary trie tree, converting the number of occurrences of each word frequency (jieba dict.txt carrying a dictionary, there are more than 20,000 words, the words comprising bar appears and the number of parts of speech (corpus of People's Daily and other resource-based training was out) .trie scans word graph tree structure, said this is more than 20,000 words, into a trie tree, trie tree is known prefix tree, that is to say a few words in front of the words, as it means they have the same prefix, you can use to store the trie, it has a fast search speed advantage).

2. treat the word sentence, according to dict.txt generated trie tree, generate DAG, popular talk, it is to carry out the sentence according to the dictionary operations given dictionary, generate all possible sentence segmentation. jieba the DAG is recorded in the start position of a word of a sentence, from the (n is the length of the sentence), each start position 0 to n-1 as the dictionary key, value is a List, which holds the possible words end position (obtained by the word dictionary, the words start position + the length of the end position obtained)

2. Find the maximum probability of dynamic programming path, find the maximum segmentation based on a combination of word frequency

1. Find a word in the sentence has to be segmented good words (word list in full mode), come to find the frequency of occurrence of the word (number / total number), if not word (usually based on some dictionary), the minimum frequency of the frequency of the words put appearing as the frequency of the word in the dictionary.

2. Find the maximum probability path based on dynamic programming method for calculating the maximum sentence from right to left inverse probability (because here the reverse is often the focus of Chinese sentence behind (on the right), because too many adjectives usually, behind the trunk is therefore calculated from right to left, the correct rate is higher than calculated from left to right, where similar RMM), P (NodeN) = 1.0, P (NodeN-1) = P (NodeN) * Max (P (penultimate word)) ... and so on, and finally get the maximum probability path to obtain a combined maximum probability of cut points.

3. For unknown words, using the HMM model based on Chinese characters into words the ability to use the Viterbi algorithm

1. The use of the Chinese word HMM model according BEMS four state flag, B begin start position, E is the end position of the end, M is the middle intermediate position, S is the position singgle separate into words. jieba using (B, E, M, S) state to mark the four Chinese words, such as Beijing can be labeled as BE, namely North / B Beijing / E, represents the start position north, Beijing is the end position of the Chinese nation can mark as BMME, it is the beginning, middle, middle, end.

2. The author uses a lot of training corpus, we got three probability table. 1, respectively) the position of the transition probability, the transition probability four states, i.e., B (begin), M (middle), E (end), S (separate into words), P (E | B) = 0.851, P (M | B) = 0.149, explained when we are in the beginning of a word, the next word probability is much higher than the end of the next word is the probability the middle of the word, in line with our intuition, because the two words more words than words the word is more common. 2) emission probabilities position to word, such as P ( "and" | M) represents the probability "and" the word appears in the middle of a word; 3) the words beginning with the probability of a certain state, in fact, only two, either B, either S. This is the first model starting state vector, that is, HMM systems. In fact, the transition between BEMS model is somewhat similar to two yuan, is transferred between the two words. Further after a word bigram probability considering a word, an N-gram model.
Sentence to be given a word, is the observation sequence, four states on the model of HMM (BEMS), it is to find an optimal sequence BEMS, we need to use this viterbi algorithm to get the best hidden state sequence . By probability table and viterbi algorithm is trained, you can get a maximum probability of BEMS sequence, in a manner B starts, ending with E, treat word sentences regroup, get the segmentation result, such treatment word sentence 'around the world are learning Chinese words' to give a BEMS sequence [S, B, E, S , S, S, B, E, S], to give a word together by makeshift continuous BE to separate S-alone, to obtain a segmentation result.

Three, jieba segmentation process
1. Load the dictionary, to generate the trie.

2. Give the word sentence be given using regular continuous acquisition of Chinese characters and English characters, cut into phrases list, use the DAG (dictionary) and dynamic programming for each phrase, to get the maximum probability path, for those not in the DAG word found in the dictionary, combined into a new segment phrases, using word HMM model, that is, say the authors identify unknown words.

3. Use of python yield a syntax word generator generating, by words returned.

Fourth, the lack of jieba word of
1.dict.txt dictionary memory occupied more than 140 M, take up too much memory. jieba Dictionary of use to compensate for problems in identifying multi-word HMM capacity-poor, so save the dictionary are words of three or four words. Specialized dictionaries to generate inconvenient, how to train their own special probability table does not provide tools.

2.HMM recognize new words in the timeliness are insufficient, and only the recognized word 2 word 3 words for new words, the capacity is relatively limited.

3. The effect is not good enough speech tagging, syntactic analysis, semantic analysis are also absent.

4. NER effect is not good enough.

Reference article
principles jieba word of https://www.cnblogs.com/echo-cheng/p/7967221.html

Chinese word for Python modules stuttered word points to understand and analyze algorithmic process https://blog.csdn.net/rav009/article/details/12196623

jieba official documents

A, jieba introduce jieba library is a simple and practical Chinese natural language processing sub-thesaurus.