Language Modeling with Gated Convolutional Networks (sentence modeling the gated CNN) - Model Introduction chapter

Recently busy laboratory project, there has been no time to do the simulation, so I have to write about before to see an article, summarize it. The saying is Gated CNN, this is the first time the threshold control is introduced into the CNN article, I feel very innovative, effects are great. Here we look at the main contribution of the article include:
  1. A new gating mechanism
  2. Ease gradient spread, reducing the diffusion gradient phenomenon
  3. Compared LSTM, model simpler, faster convergence
    model structure diagram is as follows:
    Write pictures described here
    First, we can be identified by stacking CNN long text, extracting higher-level, more abstract features, but also in terms of comparison LSTM, we need op less (CNN requires O (N / k) th op, and the text considered LSTM sequence need O (N) th op, where N is the length of the text, k is the convolution kernel width), so that, we need non-linear operation is also less effective in reducing the diffusion gradient phenomenon, the convergence of the model and training easier. Further, the output of the model LSTM next time depends on a state before a hidden layer of time, the model can not achieve parallelization. However, CNN without this dependence, can be easily parallelized, thereby achieving improve calculation speed. Finally, the linear gating cells presented herein not only effectively reduces the diffusion gradient, but also retains the ability to non-linear. Next we look at the specific method of model:
    As can be seen from the chart, the main structure of the original is not very different with CNN, but in convolution layer into the gating mechanism, the output layer becomes convolution It became the following formula, i.e. a layer without convolution output of the nonlinear function * convolutional output of the nonlinear activation function layer sigmod:
    Write pictures described here
    wherein V and W are different convolution kernel convolution kernel width is k, the output channel number n, b and c is an offset parameter. And here is the wide use of convolution, but the reason for the paper using a wide convolution description I did not look too understand = - =. The second half of the above formula, that is a so-called convolution function is activated gating mechanism which controls X*W+bwhich of the incoming information layer. Which is herein defined as Gated Linear Units (GLU). Then the model can be stacked to capture Long_Term memory.
    The paper also discusses the effect of the different gating unit, the first of its proposed CNN does not require as complex as LSTM threshold mechanism, no forget the door, a door input is enough. In addition, also proposed the GTU another gating cells, as follows:
    Write pictures described here
    From the perspective of the gradient of the two gating cells were analyzed and found GTU attenuated more quickly, because it contains two gradient formula damping term. The GLU only a damping term, may well reduce the diffusion gradient.
    Write pictures described here

Experimental results

Experiments with the GBW WikiText-103 and two sets of data, the results presented here show only a few charts:
Write pictures described here
Write pictures described here
Write pictures described here
a detail is for the greater length of text data sets, the paper used to obtain a deeper network structure which Long-Term Memory .

 

Guess you like

Origin www.cnblogs.com/mfryf/p/11221917.html