Python implementation with a Chinese pinyin input method starting from 0

As we all know, Chinese input method is a historic problem, but it is a tedious living, do not know if this is the reason few people online Chinese Pinyin input method to share, then this opportunity NLP Project, I realized that out of Chinese Pinyin input method, to see how deep the water and found quite deep, but the basic effect is still able to come out and look at other groups are doing very good, this time to share with you the results of what we do now. (Note: This article assumes that you already have some knowledge of the hidden Markov model)

mission details

Implement a Chinese pinyin input method.

After analysis, divided into the following modules to achieve the Chinese Pinyin input method:

The core functions include phonetic segmentation (SplitPinyin.py)
HMM model training (TrainMatrix.py)
Trie tree builder and search interface (PinyinTrie.py)
Viterbi algorithm and provided to the service interface of the UI (GodTian_Pinyin.py)
The final UI implementation (gui.py)

Technology Roadmap

In Chinese pinyin input method, we need to complete the conversion of Chinese characters spelling sequence to sequence, for example, enter "nihao", input method will give the word we want to enter "Hello", here we can ask a few questions:

How to cut the alphabet?
Such as: the user enters "xiana", input methods should determine whether the user would like to enter "xian a" (busy ah) or "xia na" (Shana) or "xi an a" (Xi'an ah)?
How real-time feedback to the user?
For good segmentation alphabet, how to find the user input a string of Chinese most want to display to the user?
Pinyin entered by the user is under the wrong circumstances, how to tolerate such a mistake? How to display?

Maybe we can ask more questions, Chinese Pinyin input method is such that there are always going to continue to pull the details.
So how do we solve the above problems? Our programs are as follows:

How to cut the alphabet?

Here we do by way of the longest match, that is, if the user enters the first string is a prefix or phonetic spelling of a legitimate, then we will continue to find back, waiting for user input until the user input after discovery the characters (assuming the first n) of the n-1 and the original is not a valid prefix phonetic alphabet is not legitimate, then at this time the front of the n-1 cut into phonetic string, which found a complete alphabet, such as She said input "xiant" (want to lose xiantian), then we will scan the string until the "xian", to "xiant" it was found neither prefix legitimate spelling is not legitimate Pinyin, then the division away from the front of the t, give "xian't", the same way the subsequent discovery of the alphabet.
In the real-time task, even if the user does not losers, we should still show something, so let's segmentation Pinyin, at most, only the last one is incomplete Pinyin prefix, then we will complete and incomplete dealt with separately. Assumed to be the case "xian't" we will "xian" into the viterbi algorithm, derived by HMM probability of a maximum output string, and then the final "t" to search out all trained in the Trie tree "t" is a word prefix and the frequency of their occurrence, whichever is highest frequency number, as the next state viterbi algorithm may be set, and then get their alphabet, a combination of the previous n-1 phonetic up to run the Viterbi algorithm , the most likely of a Chinese string, because these words of the highest frequency Pinyin (ie we might observations) may not be the same, we can only sound the same word as a next state viterbi algorithm running, so viterbi run is the number of times these words which sound different, but because the total number of fixed, the more the abnormal sound, the sound corresponding to each of the small, so there is no difference in the total time.
Specific Trie tree will explain later.

Python implementation with a Chinese pinyin input method starting from 0

How real-time feedback to the user?

The above fact has begun to explain how the real-time feedback, real-time user feedback we have to do is lose every letter, we can show the character the user may want to play, then there is a lot to begin with a phonetic alphabet, each Pinyin corresponding word may be many, that there are a lot of results, but we can not miss, we can only consider all the words, compare the selected probability largest number of words, at this time we can use to solve the Trie. Trie tree is the prefix tree, that white is the phonetic alphabet in sequence along the roots inserted into the tree, each leaf node is a phonetic, this is the pinyin down roots go all the way down the order to take a combination of letters, so that we you can find all the alphabet in any string prefix, is to dfs through each leaf node as its prefix sub-tree, this time we kept the leaf node is actually a dictionary, it may be key corresponding to the pinyin word, value for the frequency of occurrence of the word, as a comparison.

For good segmentation alphabet, how to find the user input a string of Chinese most want to display to the user?

Here we use a hidden Markov model, the user wants to input text as a hidden state, Pinyin entered by the user is explicitly state, that is estimated by maximum likelihood estimate of the value of the frequency matrix of three HMM, and finally found by the viterbi algorithm the probability of the largest number of Chinese string is displayed.

Pinyin entered by the user is under the wrong circumstances, how to tolerate such a mistake? How to display?

Taking into account the complexity to achieve a high degree of fault tolerance, we assume that the user will enter the correct spelling, when you want to divide will add their own separator " '", because most of the time most of the input method the user will enter the correct spelling, so , on the assumption that not only simplifies the implementation process, and no loss of much of the user experience.

Used data

Approximately 360M after due training needs HMM model, we found SogouQ user query data sets from Sogou laboratory, pre-treatment as a legitimate sentence, and the sentence was too short in order to avoid inquiries, we also increased the Sohu news data as nearly 30M training corpus, which contains a lot of these long sentences.
By training these two corpus, we got HMM model long sentences and phrases Jieke performed better effect. And we can continue to expand the corpus in order to increase the accuracy of our HMM models, this is something, not to mention.

Problems encountered and solutions,

UI interface problem, due to the complexity of the system and consider different UI design, there have been many inexplicable BUG, which allows us to spend a lot of time.
Efficiency viterbi algorithm, due to spelling a word beginning with the letter corresponding there are many, suppose we take the best of the K, we need these K a phonetic combination with the previously existing, and then run again Viterbi algorithm, Since the Viterbi algorithm related transition from one state to another state large amount of calculation, we used the method memory (Cache) to accelerate, the specific method is to record one full viterbi algorithm Pinyin string corresponding to the last state, situation, so if we encounter the Pinyin string (a) again to add another case Pinyin (B) run viterbi, we do not need to start to run from the beginning of this viterbi algorithm combination of a string, but ran directly from the a string After the start of the last state of the viterbi (read from the memory cell), transferred to B.
The memory unit will be with the program and has always existed, and we did persist to this object, when you start the input method we will read the file (memory unit), which means that, if we have entered a Pinyin string, then we later enter the same Pinyin string when no longer needed to run the core algorithm, but the result is displayed directly, so the speed would have made significant improvements, there will be, the better the more you use the input method used , the faster the more you use the benefits, of course, sacrificing some storage space, but now we do not lack storage space.
Double counting, such as when the user feel wrong, go back grid, then it will retreat to a certain prefix, but in fact this prefix we have calculated, and shows that over, that we return we previously had to show the contents of the time, if not optimized, it will re-run again the core of the viterbi algorithm, which will be very slow, so we thought using the cache, the results show the Pinyin input string and the corresponding phase correspondence and save it, so we did rapid backspace operation.
Python language inherent performance problems, to solve this problem only replace the language, in fact, in C ++ language, then I believe it will be much faster, which could be considered later implemented in C ++, this is entirely feasible. Python learning qun seven hundred eighty-four 5, July Eight, two hundred and fourteen video tutorials, tools, all kinds of combat operations Share

Performance Evaluation

Input relatively quickly, the vast majority of the input can be displayed within one second. Entered sentence and then backspace input operations are millisecond level.

Given the operating environment program

Python 2.7
Python packages need to be installed: Tkinter, cPickle, pypinyin modules

Execution method and parameters

Under the project Project directory, run

$ python gui.py

It can be.

Future Works

From the above we can see that in fact you can still do a lot of work, such as

Change compiled languages such as C ++, significantly reduced computational overhead
Constantly updated with the user's input HMM model
The embedded system software
We observed that there is little more than long sentences input wants to play, the situation did not want to phrase might want to play a lot, so a lot of the same Pinyin input string length of the sentence, we can replace the phrase.
。。。