The open source code of the paper " Mimicking Word Embeddings using Subword RNNS " is based on the DyNet deep learning framework. Unlike declared statically deep learning framework, for example: TensorFlow , Theano , CNTK the like, a user needs to first define computation graph ( Computation Graph ), then the sample is passed to the execution of the calculation engine and calculates its derivative. For the DyNet , calculation of FIG construct ( Construction ) is substantially transparent, for any input, the user can freely obtained using different network architectures. Personally I feel that DyNet is really easy to use.
Mimick model introduction
Purpose: The paper " Mimicking Word Embeddings usingSubword RNNS " was published in EMNLP2017 . The main purpose of the paper is to solve the NLP -based word related applications based sequence labeling requirements and syntactic properties, not in the lexicon ( OUT-of-Vocabulary ) word vector problem, from the author of the paper describing the experiment of 23 languages, The effect is improved in NLP applications such as sequence labeling .
Through the word vector that has been trained, the word vector is learned, and then the BiLSTM model reconstructs the word vector. Since there are too many Chinese words, but there are relatively few words, if the word vector can be reconstructed through the word vector, the storage problem can be reduced . In addition, there are too many word combinations, the words in the training corpus are not enough to cover all words, and the context is different, the word segmentation tools also have differences in word segmentation, including some new words. The input sequence of the BiLSTM model is all the words in a word, and each parameter value is optimized by minimizing the distance between the word vector predicted by the model and the original word vector of the input word.
Experimental data:
Based on the merging of the Chinese thesaurus on the wiki and some other corpora, about 750,000 words and corresponding word vectors trained by word2vector are used as the training data of the Mimick model .
Experimental effect
First: Evaluate the effect of word vectors (words used for training) reconstructed by the model for the task of sentence similarity
|
Children's diarrhea remedies |
Everyone loves to watch comedies |
Original Word2vector word vector (plus learning weight) method |
Home remedy for baby diarrhea 0.744 |
Many people like to watch humorous movies 0.842 |
Children's cold remedies 0.969 |
He is a humorous person 0.322 |
|
Mimick model reconstruction word vector (plus learning weight) method |
Baby diarrhea folk remedy 0.836 |
Many people like to watch humorous TV 0.800 |
Children's cold remedies 0.932 |
He is a humorous person 0.115 |
|
The signal is strong and weak |
Redmi update error |
Original Word2vector word vector (plus learning weight) method |
Signal fluctuates up and down 0.864
|
Redmi upgrade system error 0.921 |
The signal is interrupted suddenly 0.321
|
How to buy a Xiaomi mobile phone 0.723 |
|
Mimick model reconstruction word vector (plus learning weight) method |
Signal fluctuating high and low 0.900
|
Redmi upgrade system error 0.962 |
The signal is suddenly interrupted 0.704
|
How to buy a Xiaomi mobile phone 0.791 |
Second: Evaluate the similarity between the word vector of the word (new word) predicted by the mimick model that is not in training and the original word vector of word2vector .
More than 20,000 words left in word2vector are not used as mimick model training. The evaluation effect of randomly selecting a few words is as follows:
word |
Similarity between the original word vector and the word vector reconstructed by the mimick model through the word vector |
Doctor Li
|
0.748 |
Photographer |
0.768 |
Poor area |
0.784 |
Horse farmer |
0.397 |
First offender
|
0.771 |
Bauhinia House |
0.850 |
Learn vocal music |
0.763 |
1.97 billion
|
0.852 |
Dadan River |
0.849 |
52.86%
|
0.757 |
Tasteless
|
0.337 |
(3) Experimental analysis
从上面的实验来看,模型还是起到了一定的作用。但也存在一些bad case,可能原因在于:利用BiLSTM模型对所有的词向量重构,包含了75万多的词,每个词向量维度200维,而且很多词之间都有相同的字,但意思却可能完全不一样,要想靠模型去完全拟合多样化的词向量分布,感觉有一定得难度。
(4) 总结
(a) 训练mimick模型的时候,模型参数细节还是很影响实验效果
(b) 当模型的记忆和拟合能力足够强的时候,完全可以不用分词,直接基于字的模型。但在本文处理中,字与字之间相关性是很大的,特征之间是不独立,为了削弱相关性,降低对词序的依赖,将句子分为若干个相关性比较弱的部分,有利于更进一步处理,也许这是分词的目的之一。用CNN模型做文本分类也是通过把若干个字组合作为特征来看。