Understanding Convolutional Neural Networks (CNN) in NLP

This article is Denny Britz's understanding of the application of CNN in NLP, and he himself has participated in a number of NLP projects in the Google Brain project. 
· 
Please forgive me for the inconsiderate translation.

It takes about 7 minutes to read this article. If you have something to gain, please like and follow :)

1. Understanding Convolutional Neural Networks (CNN) in NLP

Now when we hear neural network (CNN), we generally think of its application in computer vision, especially CNN has made a huge breakthrough in image classification, and from Facebook's automatic image labeling to self-driving car systems, CNN has been became the core.

Recently, applying CNN to NLP has also found some interesting results. So in this article, I mainly answer two questions: 1. What is CNN? 2. How is CNN applied to NLP?

1. What is convolution?

Let's first understand what a CNN is from some simple examples in images.

One of the easiest ways to understand convolution is to put its sliding window function on a matrix. The following figure is a clear visualization of convolution.

write picture description here

The image above is a 3x3 convolution with 1 on the diagonal and 0 otherwise. Source file: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

Imagine that the left image above represents a black and white image, each small square represents a pixel, 0 means black and 1 means white (usually grayscale images have 0-255 pixels). Sliding windows are often called kernels, filters, or feature detectors. The above is using a 3x3 convolution kernel, each element is multiplied by a specific value (1 on the diagonal, 0 on the rest), and then all the values ​​are added up to get an element value in the new matrix. In order to make all elements go through the convolution, the convolution kernel needs to be moved.

You might be wondering what exactly this does, so let's take a look at an intuitive example.

Average the nearby pixels - The convolution kernel of the image on the blurred image averages 
write picture description here 
the surrounding pixels, which makes the original edges and corners (with large pixel changes) inconspicuous, causing the image to become blurred.

Using the difference between pixels and edges - edge detection 
write picture description here 
In fact, understanding the above example is also very simple, imagine that if the pixels around a pixel are equal to it, then the value after convolution will be 0, then the display is black. If the change is large (edge), the value will be greater than 0, then it will show white, that is, the edge.

There are also some examples in the GIMP manual and here that can be used to understand convolutions.

2. What is a Convolutional Neural Network?

So far you know what convolution is, but what is a convolutional neural network? In simple terms, a CNN is a multi-layer convolution containing a non-linear activation function like ReLU (linear rectifier layer) or tanh (hyperbolic tangent). In a traditional feedforward neural network, we connect each input neuron, and then each neuron outputs to the next layer, which we call a fully connected layer, or affine layer.

In Convolutional Neural Networks, this is not intended, we use convolution on the input layer to compute the output. This results in local connections, where each input region is connected to a neuron in the output. Each layer uses different filters, usually hundreds or thousands, as shown above, and combines their results.

During the training phase, the CNN automatically learns the values ​​of its filters based on the task you want to perform. For example, in image classification, a CNN might learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use those shapes to get higher-level features, such as higher-level faces shape. The last layer is a classifier that uses these advanced features.

write picture description here

There are two aspects of this computation worth noting: Location Invariance and Compositionality. Suppose you want to classify the elephants in the picture. Because you slide your convolution kernel over the whole image, you don't really care where the elephant is. In practice, pooling layers can also make translation, rotation, and scaling invariant, but more on that later. The second key aspect is (local) composability. Each filter combines a local low-level functional block into a higher-level representation. This is why CNNs are so powerful in computer vision. It intuitively senses pixels, from edges to shapes, and from shapes to more complex objects.

However, what do these do for NLP?

Unlike the pixels of an image, NLP is often faced with a document or sentence expressed as a matrix. Often each row expresses a feature or sentence, that is, each row is a vector represented by a word. Often word-embeddings methods such as Word2vec or GloVe are used. It is also possible to use the one-hot method, which annotates the position of the word in the dictionary. If there are 10 words in a text, and each word is represented by a 100-dimensional vector, the above method will generate a 10X100 vector. This is the "image" in NLP.

In machine vision, our convolution kernels slide over local regions of the image, but in NLP we typically use filters that slide over entire matrices (words). Therefore, the "width" of the filter is usually the same as the width of the input matrix. Height, or region size, may vary, but sliding windows are typical at 2-5 words at a time.

An example of convolution on NLP looks like the following, please take a moment to understand the figure below. 
write picture description here

The above figure shows the use of CNN in text classification, using 2 kinds of filters (convolution kernels), each filter has 3 kinds of heights (region size), that is, there are 6 kinds of convolution structures (the second column from the left) , so the result of 6 convolutions (the third column from the left) will be generated. After the maximum pooling layer (the pooling layer will be mentioned later), the result of each convolution will become 1 value (from the left). 4th column), and finally generate a vector (5th column from the left), and finally get a binary classification result (last column) through the classifier. Original paper: Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification.

What is our intuitive impression of computer vision? "Location invariance" and "composability" are intuitive senses of pictures, but these are not very suitable for NLP. In a sentence, you may be very concerned about whether a certain word appears. We can think that the pixels with similar pixels in the picture are related (part of the same object), but in a sentence, similar words do not necessarily mean expressions is the same thing, and different divisions may have completely opposite meanings. Combination is also not obvious in NLP, how the combination of words (such as adjectives modifying nouns in a sentence) is expressed at a higher level, and what the results of the expression mean, these are not as obvious as computer vision tasks.

If you read this, you may feel that CNN does not seem to be suitable for NLP tasks, like RNN will give a more intuitive understanding, because RNN's way is similar to human reading habits (at least we think reading habits) : That is, from left to right, sentence by sentence reading comprehension. But the interesting thing is that CNN works well in NLP. All models are wrong, but some are useful. Like the simple bag-of-words model, it's obviously an oversimplified false assumption, but it's still been the standard approach for many years with pretty good results.

An important feature of CNN is that it is fast, very fast. Convolution is a core part of computer graphics, implemented on GPU hardware. Relative to n-gram probabilistic language models (n-grams), CNNs are also efficient in terms of representation. With a lot of vocabulary, calculating anything over $3 can quickly get expensive. Not even Google offers anything over $5. Convolutional filters learn good representations automatically without needing to represent the entire vocabulary. It's perfectly reasonable to have filters greater than $5. I think many of the experienced filters of the first layer capture features very similar (but not limited to) n-grams, but CNNs represent them in a more compact way.

3. CNN hyperparameters

Before we revealed how CNN is applied to NLP tasks, let's take a look at the parameter selection that needs to be done in constructing a CNN, I hope this will give you a better understanding of the relevant literature in this field.

Wide convolution vs narrow convolution

When we introduced convolution above, we ignored a detail about how to use the convolution kernel. When we put a 3x3 convolution kernel in the center of the matrix, of course, it will have a good effect, but what about the pixels on the edge of the matrix? What about convolution around them?

At this time, you can use the method of zero-padding, and all elements other than the matrix used are set to 0. This way, you can apply convolution kernels to matrices of all input sizes. Finally, a larger or as large output as the original matrix can also be obtained. A convolution with zero padding is called a wide convolution, and a convolution without zero padding is called a narrow convolution. The specific effect can be seen in the following 1D display effect. 
write picture description here

Narrow vs. Wide Convolution. Filter size 5, input size 7. Source: A Convolutional Neural Network for Modelling Sentences (2014)

s1~s7 can be regarded as pixels, and c5 represents a convolution kernel connected with 5 original pixels, a total of 3. The left image is a narrow convolution, and the right image is a wide convolution. Then the output size of the left image is 3, and the specific calculation is ( 7 5 ) + 1 = 3 (7−5)+1=3, and 5 is a convolution to connect the points in the 5 original matrices. The output size on the right is ( 7 + 2 4 5 ) + 1 = 11 (7+2∗4−5)+1=11, and 2*4 means that there are 8 convolutions using the method of padding 0.

The "stride size" of the convolution movement

The "stride size" is another parameter of the convolution, which describes how far the convolution kernel moves in each step. In the above examples, we all set the stride to 1. The following is a comparison between a convolution kernel with stride 1 and a convolution kernel with stride 2 shown in the Stanford cs231 course. It can be seen that the larger the stride is. , the smaller the dimension of the convolution kernel output. 
write picture description here

Convolution Stride Size. Left: Stride size 1. Right: Stride size 2. Source: http://cs231n.github.io/convolutional-networks/

The convolution stride we often use is 1, and a large convolution kernel stride may be used to build a model similar to RNN.

Pooling Layers

Another prominent feature of the convolutional neural network is the pooling layer, which is often placed after the convolutional layer. value - called max pooling. 
You don't have to pool the entire matrix, you can also slide through the window (the classic approach in NLP is to apply the pooling layer to the result of the entire convolution output, and finally get a number directly - that is, the feature of this sentence at this size , as in the example above.) 
write picture description here

Max pooling in CNN. Source: http://cs231n.github.io/convolutional-networks/#pool

Why do we use pooling layers?

First: The pooling layer can guarantee that the output matrix size is fixed. Classification tasks often require a fixed output dimension, for example, if a pooling layer is applied to 1000 filters (convolution kernels), then regardless of the size of your filters and the size of your input matrix, you will get a 1000-dimensional output. This allows you to choose variable input vector and filter sizes, but get the same output dimensions.

Second: The pooling layer can reduce dimensionality, but can guarantee important information.

In the field of image processing, pooling can also provide invariance to translation and rotation, when you focus on a region, the output will remain the same even if you translate the image by a few pixels, because no matter what, the max Actions all select the same value.

For details, please click https://www.zhihu.com/question/36686900

Channels

Channels are looking at your input from different "perspectives". In an image, an image in RGB format has 3 channels (red, green, blue). Similarly, there are different channels in NLP. For example, channels can be divided into different word embedding methods (word2vec, GloVe, etc.), expressions in different languages, or expressions of the same meaning in different ways.

Second, the application of CNN in the field of NLP

Now let's take a look at the specific application of CNN in NLP. CNNs are well suited for text classification tasks such as sentiment analysis, spam detection, or topic classification. 
1. Classic practice 
[1] Evaluate CNN architectures on various classification datasets, mainly including sentiment analysis and topic classification tasks. The CNN architecture achieves good performance on the dataset and reaches new levels in some aspects. Surprisingly, the network used in this article is very simple, which is what makes it so powerful. The input layer is a sentence consisting of concatenated word2vec word embeddings. Next is a convolutional layer with multiple filters, then a max pooling layer, and finally a softmax classifier. The paper also conducts experiments on static and dynamic word embeddings on two different channels, one channel is adjusted during training and the other is not. A similar, but slightly more complex architecture [2]. [6] Add an extra layer to perform "semantic clustering" on this network architecture. 
write picture description here
2. Different representations of input vectors 
[4] Train the CNN from scratch without the need for pre-trained word vectors like word2vec or GloVe. It directly applies the convolution to a one-hot vector. Furthermore, the authors propose a space-efficient word representation for the input data, thereby reducing the number of parameters that the network needs to learn. In [5], the authors extended the model to an additional unsupervised "regional embedding", which is learned using a CNN to predict the context of text regions. The methods of these papers seem to work well for long texts (such as movie reviews), but their performance on short texts is unclear. Intuitively, using pre-trained word embeddings for short text is more efficient than using long text.

* 3. Comprehensive experiments with CNN in NLP * 
Building a CNN architecture means there are many hyperparameters to choose, some of which I covered above: input representation (word2vec, GloVe, one-hot), number of convolutional filters and size, pooling strategy (max, average) and activation function (ReLU, tanh). [7] conducted an empirical evaluation of the impact of different hyperparameters in the CNN architecture, examining its impact on performance and the variance of results across multiple runs. If you want to implement your own CNN for text classification, it would be a good idea to use the results of this paper as a starting point. Some outstanding results are that max-pooling always outperforms average-pooling, ideal filter size is important but task-dependent, and regularization does not seem to make a big difference in NLP tasks.

4. Exploring the positional influence of words 
[8] explored CNNs for relation extraction and relation classification tasks. In addition to the word "vector", the authors use the relative position of the words as input to the convolutional layer. The model assumes the location of a given entity, and each example input contains a relation. [9] and [10] explored similar models.

5. Learning meaningful expressions 
Another interesting use case of NLP can be found in [11] and [12], research from Microsoft. These papers describe how to learn semantically meaningful sentence representations that can be used for information retrieval. Examples given in the paper include recommending interesting documents to users based on what they are currently reading. Statement representations are trained on search engine log data.

6. Learn appropriate word embedding methods 
Most CNN architectures learn embeddings (low-dimensional representations), in one way or another, as part of their training process. Not all papers focus on this aspect of training or investigate how meaningful the learned embeddings are. [13] provided a CNN architecture to predict the labels of Facebook posts while generating meaningful embeddings for words and sentences. These learned embeddings were then successfully applied to another task - recommending potentially interesting documents to users and trained based on user click data.

CNN at the character level

All models so far have been word-based. But there are also studies that apply CNNs directly to characters. [14] learn character-level embeddings, concatenate them with pre-trained word embeddings, and use CNNs as part of part-of-speech tagging. [15] [16] explored the use of CNNs to learn directly from characters without any pretrained embeddings. Notably, the authors used a relatively deep network, with a total of 9 layers, and applied it to sentiment analysis and text classification tasks. The results show that learning directly from character-level input works very well on large datasets (millions of examples), but performs relatively simply on smaller datasets (thousands of examples). [17] explored the application of character-level convolution in language modeling, taking the output of character-level CNN as the input of LSTM at each time step. The same model works across languages.

Surprisingly, all the above papers were published in the past 1-2 years. Obviously, CNN has already achieved excellent results on NLP, and its development is also accelerating.

Paper Index

    • [1] Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), 1746–1751.
    • [2] Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A Convolutional Neural Network for Modeling Sentences. Acl, 655–665.
    • [3] Santos, C. N. dos, & Gatti, M. (2014). Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts. In COLING-2014 (pp. 69–78).
    • [4] Johnson, R., & Zhang, T. (2015). Effective Use of Word Order for Text Categorization with Convolutional Neural Networks. To Appear: NAACL-2015, (2011).
    • [5] Johnson, R., & Zhang, T. (2015). Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding.
    • [6] Wang, P., Xu, J., Xu, B., Liu, C., Zhang, H., Wang, F., & Hao, H. (2015). Semantic Clustering and Convolutional Neural Network for Short Text Categorization. Proceedings ACL 2015, 352–357.
    • [7] Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification,
    • [8] Nguyen, T. H., & Grishman, R. (2015). Relation Extraction: Perspective from Convolutional Neural Networks. Workshop on Vector Modeling for NLP, 39–48.
    • [9] Sun, Y.,www.thd178.com/   Lin, L., Tang, D., Yang, N., Ji, Z., & Wang, X. (2015). Modeling Mention , Context and Entity with Neural Networks for Entity Disambiguation, (Ijcai), 1333–1339.
    • [10] Zeng, D., Liu, K.,www.wanmeiyuele.cn  Lai, S., Zhou, G., & Zhao, J. (2014). Relation Classification via Convolutional Deep Neural Network. Coling, (2011), 2335–2344.
    • [11] Gao, J., Pantel, P., Gamon, M., He, X., & Deng, L. (2014). Modeling Interestingness with Deep Neural Networks.
    • [12] Shen, Y., He, X., Gao, J., Deng, L., & Mesnil, G. (2014). A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management – CIKM ’14, 101–110.
    • [13] Weston, J., www.255055.cn/& Adams, K. (2014). # T www.365soke.cn AG S PACE : Semantic Embeddings from Hashtags, 1822–1827.
    • [14] Santos, C., & Zadrozny, B. (2014). Learning Character-level Representations for Part-of-Speech Tagging. Proceedings of the 31st International Conference on Machine Learning, ICML-14(2011), 1818–1826.
    • [15] Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level Convolutional Networks for Text Classification, 1–9.
    • [16] Zhang, X., & LeCun, Y. (2015). Text Understanding from Scratch. arXiv E-Prints, 3, 011102.
    • [17] Kim, Y., www.baohuayule.com   Jernite, Y., www.leyou1178.cn Sontag, D., & Rush, A. M. (2015). Character-Aware Neural Language Models.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325126353&siteId=291194637