Evaluate the effect of Mimick model on word vector reconstruction

The open source code of the  paper " Mimicking Word Embeddings using Subword RNNS " is based on the DyNet deep learning framework. Unlike declared statically deep learning framework, for example: TensorFlow , Theano , CNTK the like, a user needs to first define computation graph ( Computation Graph ), then the sample is passed to the execution of the calculation engine and calculates its derivative. For the DyNet , calculation of FIG construct ( Construction ) is substantially transparent, for any input, the user can freely obtained using different network architectures. Personally I feel that DyNet is really easy to use.

Mimick model introduction

Purpose: The paper " Mimicking Word Embeddings usingSubword RNNS " was published in EMNLP2017 . The main purpose of the paper is to solve the NLP -based word related applications based sequence labeling requirements and syntactic properties, not in the lexicon ( OUT-of-Vocabulary ) word vector problem, from the author of the paper describing the experiment of 23 languages, The effect is improved in NLP applications such as sequence labeling .

 Through the word vector that has been trained, the word vector is learned, and then the BiLSTM model reconstructs the word vector. Since there are too many Chinese words, but there are relatively few words, if the word vector can be reconstructed through the word vector, the storage problem can be reduced . In addition, there are too many word combinations, the words in the training corpus are not enough to cover all words, and the context is different, the word segmentation tools also have differences in word segmentation, including some new words. The input sequence of the BiLSTM model is all the words in a word, and each parameter value is optimized by minimizing the distance between the word vector predicted by the model and the original word vector of the input word.

Experimental data:

    Based on the merging of the Chinese thesaurus on the wiki and some other corpora, about 750,000 words and corresponding word vectors trained by word2vector are used as the training data of the Mimick model .

Experimental effect

First: Evaluate the effect of word vectors (words used for training) reconstructed by the model for the task of sentence similarity

 

 

   

     Children's diarrhea remedies

   

     Everyone loves to watch comedies

 

Original Word2vector word vector (plus learning weight) method

 

Home   remedy for baby diarrhea 0.744

 

Many people like to watch humorous movies 0.842

 

Children's cold remedies   0.969

 

He is a humorous person      0.322

 

Mimick model reconstruction word vector (plus learning weight) method

 

Baby diarrhea folk            remedy 0.836

 

Many people like to watch humorous TV 0.800

 

Children's cold remedies       0.932

 

He is a humorous person      0.115

 

 

 

   

    The signal is strong and weak

   

    Redmi update error

 

Original Word2vector word vector (plus learning weight) method

 

Signal fluctuates up and down    0.864

 

 

     Redmi upgrade system error 0.921

 

The signal is interrupted suddenly    0.321

 

 

How to buy a Xiaomi mobile phone      0.723

 

Mimick model reconstruction word vector (plus learning weight) method

 

Signal fluctuating high and low    0.900

 

 

     Redmi upgrade system error 0.962

 

The signal is suddenly interrupted    0.704

 

 

How to buy a Xiaomi mobile phone      0.791

 Second: Evaluate the similarity between the word vector of the word (new word) predicted by the mimick model that is not in training and the original word vector of word2vector .

More than 20,000 words left in word2vector are not used as mimick model training. The evaluation effect of randomly selecting a few words is as follows:

        

       word

           

    Similarity between the original word vector and the word vector reconstructed by the mimick model through the word vector

          

      Doctor Li

 

 

                           0.748

         

      Photographer

       

                           0.768

 

      Poor area

                           

                         0.784

 

      Horse farmer

                          

                           0.397

       

      First offender

 

       

                           0.771

 

      Bauhinia House

                           0.850

           

 Learn vocal music

                           

               0.763

 

      1.97 billion

 

      

                           0.852

 

      Dadan River

 

                           0.849

 

      52.86%

 

 

                           0.757

 

     Tasteless

 

 

                           0.337

 (3) Experimental analysis

 从上面的实验来看,模型还是起到了一定的作用。但也存在一些bad case,可能原因在于:利用BiLSTM模型对所有的词向量重构,包含了75万多的词,每个词向量维度200维,而且很多词之间都有相同的字,但意思却可能完全不一样,要想靠模型去完全拟合多样化的词向量分布,感觉有一定得难度。

(4) 总结

(a)  训练mimick模型的时候,模型参数细节还是很影响实验效果

(b)   当模型的记忆和拟合能力足够强的时候,完全可以不用分词,直接基于字的模型。但在本文处理中,字与字之间相关性是很大的,特征之间是不独立,为了削弱相关性,降低对词序的依赖,将句子分为若干个相关性比较弱的部分,有利于更进一步处理,也许这是分词的目的之一。用CNN模型做文本分类也是通过把若干个字组合作为特征来看。

 

Guess you like

Origin blog.csdn.net/BGoodHabit/article/details/79287482