[Read the paper] Neural Pinyin-to-Chinese Character Converter

 
 
Use seq2seq manner, the phonetic sequence into a sequence of Chinese characters, the model structure:
 
1. Prepare the training data
    • zho_news_2007-2009_1M-sentences.txt, 100w, word, word does not actually use the information
    • 1 blog Park
2. Construction of Pinyin-Chinese parallel corpus, zh.tsv, p [char] + [ "_"] * (len (p) - 1)
    • 1 bokeyuan     Bo Garden __ _ _ off _
3. Generate dictionary, save the file as pkl
    • pnyn2idx, idx2pnyn, hanzi2idx, idx2hanzi
4. Trainer
    • Reading the dictionary, the training set (x = [pinyin ids], y = [hanzi ids])
    • x-> model-> y and calculating cross-entropy loss
5. forecast
    • The overall look is more reliable predictions, some word, homophone errors will predict the word training corpus than Ng Wu, armed police will be more, could not explain why xinjiangwujing Wu Xinjiang police prediction error
    • Some spelling errors in case, if the original spelling does not exist, it will generate a candidate is quite similar spelling wuo-> duo, the correct spelling is also likely to generate similar candidate jiao-> tiao
    • And other proper nouns or phrase is not very good

 

Advantages: End2End model structure, do not require extensive vocabularies, artificial features, etc., as long as the Chinese corpus parallel corpus can be obtained, disadvantages: not explanatory deep learning model to study some of the context is not well.

Question: If the expansion, adjust the training corpus, I do not know whether the model the effect of a possible industrial application.

 

Guess you like

Origin www.cnblogs.com/AliceYing/p/12559949.html