Convolutional Neural Networks (CNN) applied to Natural Language Processing (NLP)



When it comes to CNN, everyone naturally thinks of image processing.
When it comes to NLP, everyone naturally thinks of LSTM and RNN.
However, last year's Stanford paper showed that CNNs can still be applied to NLP, and probably better.

The blogger did an experiment, crawling all kinds of news and classifying the news. For such classification problems, both RNN and CNN can achieve about 99% effect, but CNN is almost 5 times faster than RNN. Therefore, the blogger mainly discusses the details of CNN's NLP processing.

1. Why CNN can handle NLP
With the development of word vectors, existing word vector libraries such as word2vec and Glove have been able to better express the meaning of words. Therefore, we can measure the distance between words with an N-dimensional-word vector, imagine an N-dimensional space, and each word is a point in it.

2. CNN does NLP, what is the input?
As a convolution operation, similar to an image, the input is also a two-dimensional matrix, (the image can be said to be three-dimensional, because there is one more dimension is the number of channels). Then each row is a vector of words, and each input two-dimensional matrix is ​​an M*N matrix composed of M words and N-dimensional word vectors.

3. CNN does NLP, how to convolve?
After selecting the input, the next step is convolution. Different from the convolution of image processing, CNN has requirements on the size of the convolution kernel of NLP. Generally speaking, the length of the convolution kernel and the dimension of the word vector should be consistent. For example, a word vector is N-dimensional, then the convolution kernel should be X*N-dimensional, and I usually take X (1, ​​2, 3. Learn from the Stanford paper). So convolution is to extract features between 1 word, 2 words, and 3 words.

4. CNN does NLP, what else should we pay attention to?
Such as dual channel input. The general embedding-layer is trainable, the model I use is 2-embedding-layer, one layer setting: trainable = false , one layer setting: trainable = true.
This avoids overfitting during training.

5. What optimization can bloggers think of?
The optimizations I can think of are:
try changing the size of the convolution kernel and see the effect? (with experiment)
Try adding a similar effect to the residual network to superimpose the loss information. (DenseNet)

6. Comparison of various models

  1. The simplest to use first: dual channel + multi-convolution (1*vector_length, 2*vector_length, 3*vector_length) + maxpooling + full connection (86%, fast)
  2. Then add a layer of convolution: dual channel + multi-convolution (1*30, 2*20, 3*10) + multi-convolution (1*10, 2*20, 3*30) + maxpooling + full connection (85% ,slow)
  3. Adjust the order of convolution pooling: dual channel + multi-convolution + maxpooling + multi-convolution + full connection (85%, slow)
  4. Then use densenet: (85%, slow)

7. Summary
For the convolution of NLP, the performance of small convolution is not improved compared with large convolution, but the difference is not very big.



When it comes to CNN, everyone naturally thinks of image processing.
When it comes to NLP, everyone naturally thinks of LSTM and RNN.
However, last year's Stanford paper showed that CNNs can still be applied to NLP, and probably better.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325218707&siteId=291194637