Smart Q&A

Application scenarios

Intelligent question answering robots are not so popular. They have been studying the application of deep learning in the field of NLP for some time. Recently, deep learning models are used to directly perform question and answer matching of QA systems. The mainstream is CNN and LSTM. I didn't find any suitable available code on the Internet. I wrote a CNN (theano) first. The effect is not bad, and it is consistent with the conclusion in the paper. It has been applied to our products.

principle

Refer to "Applying Deep Learning To Answer Selection: A Study And An Open Task", the article compares several network structures, and chooses one of the relatively better effects for implementation. The network description is as follows:

Q&A shares a network. The network includes HL, CNN, P+T and Cosine_Similarity. HL is a non-linear transformation of g(W*X+b). CNN will not talk about it. P is max_pooling and T is the activation function Tanh. The final Cosine_Similarity represents the similarity calculation of the semantic representation vector output by the Q&A.

Describe the matrix transformation process from input to output in detail:

  1. Qp: [batch_size, sequence_len]  , Qp is a representation before Q (not shown in the figure above). All sentences need to be truncated or padding to a fixed length (because the latter CNN generally deals with fixed-length matrices), for example, a sentence contains 3 words ABC, we choose a fixed length sequence_len to be 100, then this sentence needs to be padded to ABC<a ><a>...<a> (100 characters), where <a> is a meaningless symbol added specifically for padding. Mini-batch is used during training, so here is a matrix of batch_size rows, each row is a sentence.
  2. Q: [batch_size, sequence_len, embedding_size]  . Each word in the sentence needs to be converted into a corresponding word vector. The dimension of the word vector is embedding_size, so Qp becomes a 3-dimensional Q from a 2-dimensional matrix
  3. HL layer output: [batch_size, embedding_size, hl_size]  . HL layer: [embedding_size, hl_size], each sentence in Q will be transformed by dot product with HL layer, which is equivalent to transforming the word vector of each word from embedding_size to hl_size.
  4. CNN+P+T output  : [batch_size, num_filters_total]. The filter size of CNN is [filter_size, hl_size], and the column size is hl_size, which is the same as the size of the word vector, so for each sentence, the result of each filter is a column vector (not a matrix). The column vector takes max-pooling and it becomes a number, and each filter outputs a number. The result of num_filters_total filters is of course a vector of size [num_filters_total], so that a semantic representation vector of a sentence is obtained. T is to add Tanh activation function to the output result.
  5. Cosine_Similarity: [batch_size]  . The last layer is not the usual classification or regression method, but the method of calculating the angle between two vectors (Q&A), the following is the network loss function. , M is the parameter margin that needs to be set, and VQ, VA+, and VA- are the semantic representation vectors corresponding to the question, the positive answer, and the negative answer, respectively. The meaning of the loss function is to make the vector cosine value between the positive answer and the question greater than the vector cosine value of the negative answer and the question, and how big it is is defined by the margin parameter. The larger the cosine value, the closer the two vectors are. So, in layman's terms, this Loss is to make the positive answer and the question more and more similar, and the negative answer and the question more and more dissimilar.

achieve

Click here for the code   . The data used is an insuranceQA in English   . The key parts of the code are described below:

Word vector. This article uses the word vector method instead of word vector. The purpose of using word vectors is mainly to solve the problem of unregistered words, so that the problem of Unknown word vectors is rarely encountered during testing. Moreover, the effect of the word vector is not necessarily worse than the effect of the word vector, and it also saves the trouble of word segmentation. First use word2vec to generate a word vector, which is equivalent to doing pre-training (after testing the method of randomly initializing the word vector, the effect is similar)

Step 2 in the principle. There is no transformation of the HL layer here. In the actual test, adding the HL layer has a very, very small improvement, so the modification step is omitted here.

CNN can set filters of various sizes, and finally the results of various filters will be spliced ​​together.

Step 4 in the principle. Perform convolution, max-pooling and Tanh activation here.

The generated ouputs_1 is a python list, using concatenate to splice multiple tensors of the list (each tensor in the list represents the result of a filter convolution of a size)

Step 5 in the principle. Calculate the vector angle of the question, positive answer, and negative answer

Generate Loss loss function and Accuracy.

The core network construction code is these, the other codes are the reading of training data, verification data, and some conventional codes for theano construction of training.

If you need to increase the HL layer, you can refer to the following code. Whl is the network of the HL layer, and the input and Whl can be dot-producted.

Implementation of dropout.

result

Using the above code, the Top-1 Accuracy of Test 1 can reach 61%-62%, which is basically the same as the conclusion in the paper. As for the GESD, AESD and other methods mentioned in the paper, they have not been tested again, and they run slowly. Others The data set is not tested anymore.

The following is a similar code obtained by foreign friends using a tool called keras (encapsulated theano and tensorflow). The accuracy of Test 1's Top-1 is about 50%, which is higher than his:)

http://benjaminbolte.com/blog/2016/keras-language-modeling.html

Test set Top-1 Accuracy Mean Reciprocal Rank
Test 1 0.4933 0.6189
Test 2 0.4606 0.5968
Dev 0.4700 0.6088

In addition, the original insuranceQA needs some processing before it can be used on this code, please refer to the instructions on github.

Some tricks

  1. The effect of word vector and word vector is equivalent  . So give priority to using word vectors to save the trouble of word segmentation and better avoid the problem of unregistered words. Why not do it.
  2. The word vector is not fixed and will be updated during training  .
  3. The use of Dropout does not have a great impact on the highest accuracy rate, but the result of using Dropout is more stable, and the fluctuation of the accuracy rate will be smaller, so it is recommended to use Dropout. However, Dropout is not easy to overuse. For example, if the keep_prob probability of Dropout is set to 0.25, the model will converge more slowly, the training time will be much longer, the effect may be worse, and the setting will be much worse. The keep_prob used by my version of the code is 0.5, while ensuring accuracy and training time. In addition, Dropout is only applied to the results of max-pooling, and it is no longer used elsewhere. Excessive use is not good.
  4. The effect of the HL layer is not obvious  , and there is a slight improvement. If the size of the HL layer is 200 and the word vector is 100, then the HL layer is equivalent to double the word vector. This feels that there is not much information available. It is better to set the word vector to 200 directly, and HL is omitted. The transformation of this layer.
  5. The value of margin is generally set to be relatively small. 0.05 is used here
  6. If the Cosine_similarity layer is replaced by classification or regression, the effect is not as good as Cosine_similarity in the impression (the specific data is forgotten)
  7. The larger the num_filters is not the better the effect  . Basically, it is difficult to improve to a certain extent, but it will reduce the training speed.
  8. At the same time, I wrote the tensorflow version code.  Compared with theano, the effect is similar  .
  9. Comparing the two training methods of Adam and SGD  , Adam’s training speed seems to be faster and the effect is basically the same. There is no detailed comparison. But with the same network + SGD, theano seems to train faster.
  10. Loss and Accuracy are more important monitoring parameters  . If you write a new network, similar indicators are necessary, and you can evaluate whether the network is converging in each iteration. Because debugging is more troublesome, these parameters can be used to evaluate whether your network is written correctly and whether the parameter settings are correct.
  11. The network parameters are still more important  . If some parameters are set unreasonably, the results are likely to be very different. Remember that when initially implemented with tensorflow, the dropout should be set too small, resulting in poor results, and it took a long time to find the reason. So tuning and fine-tuning the network still requires certain skills and experience. When making this version of the code, I went through a relatively painful tuning process. At first, I suspected that there was a problem with the network design or the code. The final conclusion should be the parameters. Not set up.

There can also be a tensorflow version of QA CNN, as well as the code of LSTM:)

references:

https://www.jianshu.com/p/3b17c296a93d

https://github.com/shuzi/insuranceQA

 Blog here

Guess you like

Origin blog.csdn.net/yanyiting666/article/details/94453838