NLP commonly used in text classification model summary

Today, deep learning has become the standard technology in the field of NLP, much success in the image convolution neural network (CNN) have begun to widely penetrate into text classification, machine translation, machine-readable and other NLP tasks. However, in ACL2017 ago, word-level text classification model (the word semantic units) since TextCNN model 2014Kim et al, there is no longer appeared remarkably efficient CNN train model, especially deep model.

 

v2-4d4ddbd628a5411d32c06e5bf5d3d124_b.jpg
FIG 1 TextCNN (ShallowCNN) Model

And there is such an article in the 2017 ACL, Deep Pyramid Convolutional Neural Networks for Text Categorization [1] , deep Pyramid convolution Network (DPCNN) proposed in the paper is the first widely effective in the strict sense of a word-level deep convolution text classification neural network, the performance table is as follows:

 

v2-1a791b349b18ebd072d57a3956afe106_b.jpg

Of course, this thesis also used the two-view embedding to further enhance the performance of the model, but the model from the longitudinal comparison performance point of view, than its classic TextCNN (second row of the table ShallowCNN) has been significantly improved, in five Yelp Category sentiment classification task improved by nearly 2 percentage points in. It is also the first to demonstrate that on the text word-level classification of deep CNN still has a real sense of imagination.

Well, directly moved DPCNN, ShallowCNN, ResNet FIG contrast, start of text ~

 

v2-2ce7dc875d0b6d83ef33e41265c2265e_b.jpg

FIG 2 DPCNN, TextCNN and ResNet

Region embedding

As can be seen from the comparison chart of a and c, DPCNN ResNet the difference is quite large. DPCNN bottom while seemingly maintained TextCNN like structure, where the convolution result of the convolution TextCNN layer comprises a plurality of convolution filter size called Region embedding, meaning of a text region / fragment (such 3gram ) generated by embedding a set of convolution operation.

When there may be a convolution operation 3gram two options, one is reserved word order, which is provided with a set of two-dimensional convolution size = 3 * D collation 3gram convoluted (where D is the word embedding dimension); further One word order is not preserved (even word model bags), i.e. firstly the embedding 3gram 3-word vectors averaged to obtain a size = D, and provided a set of one-dimensional convolution of checking the size = D 3gram convoluting. Obviously TextCNN using a reserved word order in practice, and practice using the word DPCNN the bag model, DPCNN authors argue the former approach is more likely to cause over-fitting, but the latter's performance in front of those who almost (in fact, with the DAN network (Deep averaging networks) in principle and argue about the same conclusions, are interested can know almost down to the next section of the portal to find out).

Convolution layer or layers fully connected?

After generating region embedding, according to the classic TextCNN practice, then, is selected from each of the figures the most characteristic representative characteristics, i.e. the direct application of the global maximum cell layers (max-over-time-pooling layer), so the feature vector is generated in this text (if there 3,4,5 convolution filter size three, each comprising 100 size convolution kernel, then of course it will have 3 * 100 wherein FIG then the max-over-time-pooling operations applied to each feature map, then the text feature vectors, i.e., dimension 3 * 100 = 300).

But apparently TextCNN do so there will be a serious problem eh, with the bag of words model (including ngram) in the sense of nature to do so + weighting + NB / MaxEnt / SVM classical text classification model is essentially no difference, but one-hot representation transition to avoid the word embedding represented bag of words model data sparseness problem encountered nothing. We can say that the introduction of income in terms of vector nature brought on TextCNN "synonyms have similar vector indicates" a bonus, while TextCNN can just knowledge (synonymous relationship) better use of the word vector Bale. This means that long-distance information (such as 12gram) in the classical model is still difficult to learn difficult to learn in TextCNN in. So how to make these complex patterns of long-distance learning to do?

Clearly, either deepening fully connected layer, or deepen convolution layer. Which is better to deepen it? Small evening buried a foreshadowing Well, the answer lies in this small evening know almost answered in:

Convolution layer and layer classification, which is more important?

As long as convolution

After obtaining Region embedding, in order to avoid subsequent to imagine too abstract, we might still as the Region embedding word embedding, virtual network is to give back word embedding sequence oh.

First, explain to a basic concept of convolution - as long as convolution. The most commonly used in the text classification may be narrow in the convolution of the input sequence length seq_len, convolution kernel of size n, then the length of the output sequence is convolved narrow seq_len-n + 1. And as long as the name suggests is the convolution length of the output sequence is equal to the length of the input sequence seq_len. Not imagined the students on their own Google Oh, do not start talking about it.

Then the text, or what is the meaning of the word embedding sequence as long convolution is it?

Since as many input and output sequence number of positions, we will enter the n embedding the output sequence is called the n-word bits, then the significance of convolution kernel size is n generated as long convolution is quite clear , that is, context information about each word in the input bit sequence and ((n-1) / 2 ) words compressed into embedding bit word , that is to say, each word produced by the context information bits modified higher level more precise semantics.

Well, back to DPCNN up. We want to overcome the shortcomings of TextCNN capture long-distance mode, it is obviously necessary to use deep CNN. So as long as direct convolution heap as long as convolution possible?

Obviously this will make each word contains bits into more and more, more and more long-context information, but this efficiency is too low to feed, apparently to make the network layers become very, very deep, so how clumsy operation have it hum. However, since as long as long as convolution convolution heap will make each word bit embedding semantic description of a richer description is accurate, then of course we can appropriate to increase the richness of the heap two lexemes embedding representation.

So region embedding layer (here assumed to word embedding layer, the corresponding sequence of "matcha green Xiaojuan sister group to bring delicious oh") above may be designed as shown in FIG 3 it:

 

v2-7eff2b765aaea4b74032c2182975fc38_b.jpg

image 3

The number of fixed Feature Map

After the semantics of each word represents a good bit, in fact, a lot of word meaning contiguous or adjacent ngram that can be combined, such as "Xiaojuan sister who do good" in the "do not" and "very good" although originally from semantic too very far, but as the adjacent word "not too good" appears substantially equivalent to its semantic merger as "very good", so can the "do not" and "very good" semantic wow. At the same time, the merger process can be carried out entirely in the original in the embedding space, after all, the original text directly to the "Do not too good" into the "good" is very possible wow, there is no need to move the whole semantic space .

In fact, compared to an image from this "point, line, arc" The low-level feature to a feature of this significant level of high-level features of the "eyes, nose, mouth," distinguish text feature Advanced flat significantly more than that from the word (1gram) to the phrase and then 3gram, upgrade 4gram, in fact, to a large extent satisfy the "semantic replace" feature. And it is hard to image this happen "semantic substitute" phenomena (such as "nose" Well semantics can be replaced by "arc" semantics?).

Therefore (designated focus), DPCNN with ResNet a big difference is that, in DPCNN died in a fixed number of feature map, which is a fixed dimension of the embedding space (for ease of understanding, hereinafter referred to as semantic space), makes the network possible let the whole word contiguous (adjacent ngram) merge operations carried out in the original space or space similar to the original space (of course, the network will not do in practice is not necessarily Oh, but provides such a kind of condition). That is, although the shape of the entire network is a deep point of view, but from the point of view semantic space can be flat. ResNet and is constantly changing semantic space, such as the semantic space image semantic network layer deepening constantly jump to a higher level of.

1/2 pooling layer

Well after, it provides such a good condition of the merger, we can merge it with pooling layer. After each of a size = 3, stride = 2 (size 3, step 2) pooling layer (hereinafter referred to as 1/2 cell layer), the length of the sequence was compressed into half of the original (see own brain up). Also this is the convolution kernel size = 3, 1/2 of each cell through a layer after which the text segments can be perceived to twice as long than before.

Thanks to the comments section @ Chen Cheng pointed out that pooling size of the error

For example, preceded only perceive three bits information word length, after half cell layer 6 information bits word length can sense it, the 1/2 cell time convolution and the layer of size = 3 layer combination shown in FIG. 4.

 

v2-7f8c9c424b9e6c4f7881d401b6955613_b.jpg

Figure 4

Well, the problem appears to have been resolved, the goal to achieve success. The rest of us only need to be repeated as long as long as the convolution convolution + +1/2 pooling can be friends, that is repeated as shown in Block 5:

 

v2-291b84f66912419a29859fd0d882860d_b.jpg

Figure 5

Residual connection

but! If the problem is really that simple, then suddenly less deep learning difficulties the Super Multi.

First, because when we initialized depth CNN, often the weight of each layer are initialized to a small value, which led to the beginning of the network, each subsequent input is almost close to zero, then the network of natural output It is meaningless, and these small weights but also hindered the spread of the gradient, making the initial training phase networks often iterative long time to start.

Meanwhile, even if the network boot is completed, since the depth of an affine matrix network (the connecting side between every two) even by approximation, the network training process is very prone to explode or gradient diffusion problems (although due to the non-shared weights, CNN in depth RNN network better point) ratio.

Of course, these two above-mentioned problem is essentially diffusion gradient problem. So how to solve the problem gradient diffusion depth CNN network do? Of course, the film is about HE Ming Chan god, then make use of extracts of friends ~ ResNet

ResNet proposed shortcut-connection / skip-connection / residual-connection ( residual connection) is a very simple, reasonable and effective solutions. Think about looking at FIG. 5, since each input block 0 is easy in the initial stage can not be activated, then the region embedding with a line connected directly to the input layer of each of the block and eventually pooling layer / output layer not to it is okay!

Imagine, this time due to the shortcut connection is connected to the input of each block (of course, in order to match the input dimension, to advance through the operation corresponding to 1/2 of the number pool), then the equivalent of a short-circuit connection, i.e. a direct short region embedding connected to the final cell layer or the output layer. And so on, then it is not devolved into TextCNN DPCNN thing. Depth training network is not good, but it is one of TextCNN exceptionally easy training. Such model of infancy from TextCNN is underway, naturally not suffer cold start problems earlier said the depth of the CNN network.

By the same token, with the shortcut, you can ignore the heavy gradient convolution layer weakens the right to pass all the way from the shortcut lossless into the hands of each block, until the network front-end, which greatly ease the gradient disappearing.

So after DPCNN in the Block was added shortcut connection, it is perfect justice. Namely the design of a network version of the final form of the following:

 

v2-40607f235bc4a5990b975943429ddbd7_b.jpg

Finally tap problem, because the presence of the front half of the cell layer, the length of the text block sequence number will increase exponentially decreasing level, i.e.,

num\_blocks=log_2seq\_len

This leads to deepening sequence length as the network presents pyramid (Pyramid) shape:

 

v2-27406ca0abbaf6eb176818e956bac13f_b.jpg

Therefore, the authors of this simplified version of the depth of customization ResNet called Deep "Pyramid" CNN.

Published 33 original articles · won praise 0 · Views 3288

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/104553478