The adder based RNN - Recurrent Neural Networks achieved from zero

In this paper, a simple RNN, for example, a train can do binary addition of Recurrent Neural Networks.

A data set ready

We prepared a binary addition of the data set. Whose data range is 0-255, i.e. a maximum of 2 8 1 = 255 2^8-1 = 255 . Adder as a positive number, such as c = a + b = 1 +corresponds to the binary addition as follows: a = [ 00000001 ] ; b = [ 00000010 ] ; c = [ 00000011 ] a = [ 0 0 0 0 0 0 0 1]; b = [0 0 0 0 0 0 1 0]; c =[ 0 0 0 0 0 0 1 1] . In our experience, the value of binary addition and position t in fact not only dependent on the value of a and b at t position, also on the first few cases, such as t-1 , t-2 ... so in theory point of view, a good convolution neural network should be able to handle this scenario, our goal is to enter a, b, to obtain c (both binary).
Code: Run cal_generate_data, data may be generated. Generating a data structure, using the pickle, into a binary sequence which, when used to recall it can be used (under similar .mat MATLAB)
One-Hot vector notation. It should be noted that our neural network input, the input is actually a single time step [0,0], the vector is a 2 * 1, the number of different character dictionary is 2 (i.e., different characters only 0, 1, 2 and therefore the size of the dictionary)

Two model initialization parameters

The number of hidden units, the number of hidden units is a hyper-parameters, parameters are parameters in ultra-In machine learning, defined in advance, rather than the parameters obtained by learning. Here, we set the parameters for the Super 16.
Input. It is determined by the number of input dimensions dictionary, here we dimension is 2, the input 2 * 1
output. With input and output are determined by the number of dimensions of the dictionary, there is a 2 1 *
according to the input-output and hidden units, we can determine that the network is structured as follows:

  1. Historical status values ​​St: 16 × \times 1
  2. Variable weight value U 16 × \times 2
  3. History weighting value W16 × \times 16
  4. Weighting output value V2 × \times 16
    Here Insert Picture Description

Three model definition

Implicit memory state. In order to facilitate storage of information a plurality of time steps, we will save Model UVW this class. The calculated state history information, i.e. information of each layer, layers stored in the class, in this example, we should have 8 time steps, save the information layer of eight.

Four-defined function prediction

.predict predict the process function, the input x is a 2 * 8, 8 * to obtain a label 1. propagation front at each time step, performed. The results are then classified as 0-1. By: argmax (output.predict (layer.mulv)) ; Dimensional Change: 2 1-> 2 1-> 1; wherein, output.predict using the exponential function exp positive value is mapped to the space, which is then normalized of.

Five perplexity

Six training function definition model

1 loss function

Predict first use function, to obtain 0 | 0, i.e., a probability | probability respectively:. Probs = 0.4 | 0.6.
Assuming label y = 1, then the loss = -np.log (0.6); 0.6 we want the better.

2 gradient

  • diff function. And loss function were obtained, as probs, [0.4, 0.6]. Then y = 1, probs [1] - 1. The second position is that it should be equal to 1. Minimize [0.4 - 0.4] square. To minimize the value corresponding to the absolute value of the above, so that the optimum value is equal to 0.
  • After the law of propagation time-based. The entire variable only three: UVW, we just need to get partial derivatives can dU, dV, dW. Input x, and y. Utilizing forward propagation, the intermediate variable calculating each layer. For a time step, which is calculated for the U, V, W errors. For the time step 2, the calculated error UVW, while error plus the time step 1. And so on.

3 Training

Generate a set of tests: Run: cal_generate_data. training. Then test.

Seven interesting experiment

Experiment 1 Influence of the addition order of the model

  • Using reverse addition. This is the natural order people to do addition, bit low value will affect the place high value. The results of training are as follows:
2019-07-16 17:21:53: Loss after num_examples_seen=0 epoch=0: 0.711610
2019-07-16 17:22:22: Loss after num_examples_seen=20000 epoch=1: 0.056623
2019-07-16 17:22:51: Loss after num_examples_seen=40000 epoch=2: 0.034568
2019-07-16 17:23:20: Loss after num_examples_seen=60000 epoch=3: 0.031605
2019-07-16 17:23:49: Loss after num_examples_seen=80000 epoch=4: 0.030482
2019-07-16 17:24:17: Loss after num_examples_seen=100000 epoch=5: 0.029896
2019-07-16 17:25:04: Loss after num_examples_seen=120000 epoch=6: 0.029538
2019-07-16 17:25:51: Loss after num_examples_seen=140000 epoch=7: 0.029297
2019-07-16 17:26:37: Loss after num_examples_seen=160000 epoch=8: 0.029123
2019-07-16 17:27:25: Loss after num_examples_seen=180000 epoch=9: 0.028993

100% accuracy

  • The use of left to right, the natural order of the array
    Intuitively, this does not work, because the law is from right to left addition, historical information stored on the right. This is actually not get historical information.
    Training results are as follows:
2019-07-16 19:59:14: Loss after num_examples_seen=0 epoch=0: 0.709881
2019-07-16 20:00:30: Loss after num_examples_seen=20000 epoch=1: 0.658799
2019-07-16 20:01:50: Loss after num_examples_seen=40000 epoch=2: 0.622423
2019-07-16 20:03:09: Loss after num_examples_seen=60000 epoch=3: 0.617654
2019-07-16 20:03:39: Loss after num_examples_seen=80000 epoch=4: 0.614902
2019-07-16 20:04:08: Loss after num_examples_seen=100000 epoch=5: 0.612439
2019-07-16 20:04:38: Loss after num_examples_seen=120000 epoch=6: 0.610115
2019-07-16 20:05:08: Loss after num_examples_seen=140000 epoch=7: 0.608214
2019-07-16 20:05:37: Loss after num_examples_seen=160000 epoch=8: 0.606709
2019-07-16 20:06:07: Loss after num_examples_seen=180000 epoch=9: 0.605867

Accuracy rate of 84%
can be seen from this experiment, RNN indeed is the use of historical information, which makes the model more accurate. Through training to learn the rules of binary addition. The reaction from the side the following conclusions:

  1. Those who structured the rules of the task, with large amounts of data, the use of deep learning method is superior to conventional methods .
  2. As long as a large enough hidden layer, a neural network can be simulated in any non-linear function .

Code:
https://github.com/junwangcas/blogs/tree/master/01RNN/01_additional_example

reference

1: attention use a model, in karas visualized trained model https://medium.com/datalogue/attention-in-keras-1892773a4f22
2: Examples of summing karas official carrying: https: //keras.io / examples / addition_rnn / 
3: https://github.com/pangolulu/rnn-from-scratch

Published 36 original articles · won praise 3 · views 10000 +

Guess you like

Origin blog.csdn.net/wang_jun_whu/article/details/94705354