Deep Learning Practice - Recurrent Neural Network Practice

Series of experiments
Deep learning practice - Convolutional neural network practice: Crack identification
Deep learning practice - Recurrent neural network practice
Deep learning practice - Model deployment optimization practice
Deep learning practice - Model reasoning optimization practice

The code can be found at: https://download.csdn.net/download/weixin_51735061/88131380?spm=1001.2014.3001.5503

0 Overview

**Method:** The experiment is mainly carried out through the pytorch and d2l environments in python, and the code is written using jupyter notebook. The implementation of RNN, GRU, and LSTM architecture is based on the tutorial code provided by d2l, and the "time machine" data set in d2l is selected for the use of the data set. For the basic architecture, I chose to adjust the number of epochs, learning rate, and number of neurons in the hidden layer to find better results. In addition to implementing the basic recurrent neural network architecture, I also learned seq2seq, and reproduced the process of seq2seq from training to inference based on the d2l tutorial, and tried to adjust the parameters to observe changes.
step:

  1. Build RNN architecture and adjust parameters to achieve better results
  2. Build a GRU architecture and adjust parameters to achieve better results
  3. Build LSTM architecture and adjust parameters to achieve better results
  4. Implement seq2seq training and inference

1 Architecture Implementation

1.1 RNN architecture

1.1.1 RNN architecture construction

For the implementation of the RNN architecture, I used the "time machine (time machine)" data in d2l according to the instruction of the textbook. This dataset is a short story about time machines that can be used for mini-batch training. For the code implementation of RNN, it is realized through pytorch and d2l library. The first is to use the data loading module of d2l to provide the batch number and step size to load the "time machine" data. After getting the data, you can use nn.RNN() in pytorch to import the RNN neural layer. After importing the neural layer, build a RNNModel class to inherit nn.Module and set some training rules and processes. Finally, define a RNNModel object and pass in the RNN neural layer and data to train with the training function of d2l. Because the code is long, it is not listed in the report, and the detailed code can be found in the corresponding .ipynb file. The figure below shows the code flow chart when building the RNN architecture.
insert image description here
The number of parameter epochs used for the initial training is 500, the learning rate is 1, and the number of hidden layers is 256. The following is the result of the training.
insert image description here
From the above results, it can be seen that when the epochs is 500 and the learning rate is 1, the image converges when it is close to 300, and the final perplexity is 1.3. It can be seen from the output of the results that its semantics are basically absent, but it can be seen that nearly half of the output words have correct spelling. This shows that the training has a certain effect, but the effect is not very good. Then the following hyperparameters will be adjusted to achieve better results.

1.1.2 RNN hyperparameter adjustment

Take the training parameters in 1.1.1 as the basic parameters, that is, the number of epochs is 500, the learning rate is 1, and the number of hidden layers is 256, and the parameters are adjusted up and down for comparison.
1 Number of epochs
The reason why the number of epochs is chosen for comparison here is that the number of epochs has a great influence on the convergence of the model results and the perplexity. Generally speaking, the more times there are, the more gradients will drop, and the fuller the training results will be, the better the effect will be. Also, the more times you train, the more training will be overfitting. And if the number of times is small, the effect may be very poor, because the training is not enough. Next, select 250, 750, and 1000 times to conduct experiments for verification and comparison. (The detailed code can be seen in the attached file, only the results are shown here)
insert image description here
insert image description here

Forecast of "time traveller":
(1) Epoch250: time traveler held in his hant wald at ifgristtand why had wan
(2) Epoch500: time traveler proceeded any real body must have extension infot
(3) Epoch750: time traveler came back and filby seane why the lyon at ingte
(4) Epoch1000: time traveler held in whack and hareare redohat de sam e sugod
From the results, it can be seen that when the epoch is small, the impact on the results of model training is more significant, which will make the training effect worse , but when the epoch reaches a certain number, the training results are basically maintained within a certain range, and the impact will be small.

2 Learning rate
The learning rate has a certain influence on the training results. If the learning rate is too high, the confusion of the results will be very different each time with the increase of the number of times. It is very confusing, and it will never get better result. And if the learning rate is too low, then under the same number of times, the convergence speed will be slower. The learning rates 0.01, 0.1, and 10 are selected below for experimental comparison, and the following are the running results. (Detailed code can be found in the attached file, and only the results are shown here)
insert image description here
insert image description here
Forecast of "time traveller":
(1) lr0.01: time traveller the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the t
(2) lr0.1 : time traveler thicke dimensions al merice time al sicherenre thi
(3) lr1: time traveler proceeded anyreal body must have extension ingfot
(4) lr10: time travellerohc ohc ohc ohc ohc ohc ohc ohc ohc o hc ohc ohc oh

It can be seen from the results that when the learning rate is small, the convergence speed of the model will slow down, and the repeated appearance of the in the predicted results also shows that the result is very poor, and when the learning rate is 0.1, although there are no repeated words But basically the spelling of the words is all wrong. When the learning rate is very large, the perplexity is also very large, and there are repetitive words without semantic meaning. The best learning rate is 1 in the middle of the test, and a small learning rate will hinder the learning of training, and when the learning rate is too large, it will make the learning exceed a certain value and fall into a loop that cannot find a better way.

3 Number of neurons in the hidden layer
Generally speaking, the more neurons in the hidden layer, the better the fitting effect will be, and the less the number of neurons in the hidden layer, the worse the result will be. Therefore, the number of neurons in the hidden layer is very critical, so select the number of neurons in the hidden layer to be 128, 512, and 1024 for comparison. The following is the running result. (The detailed code can be found in the attached file, and only the results are shown here)
insert image description here
insert image description here
Forecast of "time traveller":
(1) 128: time travellerit s againsttirad and the time travellerit s all ha
(2) 256: time traveller proceeded any real body must have extension ingfot
(3) 512: time traveler you can show black is white by argument said filby
(4) 1024: time traveler for so it will be convenient to speak of him was e
can be seen from the results when the number of neurons increases, Its final perplexity will be reduced. There will also be some changes in the convergent curve, and there will be a part of rapid decline. In the predicted results, the number of neurons is 512 and 1024. It can be seen that there is a certain semantics, and the spelling is also correct. The reason for this is that I personally think that the number of neurons is related to the fitting effect. Generally speaking, the more the number of neurons and the more parameters, the better the fitting effect will be.

Based on the above parameter adjustments, it can be found that the best group is the combination of 512 and 1024 neurons, and the other training effects are worse than the basic parameters.

1.2 GRU architecture

1.2.1 GRU architecture construction

For the implementation of the GRU architecture, the "time machine (time machine)" data in d2l is also used. GRU adds some control units to RNN, just like the circuit restricts the input of some content while saving some important content. The code of the architecture implementation mainly refers to d2l, and its code is basically the same as that of RNN, the difference is that the neural layer is changed. The same layer is called through the pytorch api, that is, the difference from the RNN construction is that the layer changes from nn.RNN() to nn.GRU(). The flowchart constructed below. (The detailed code can be found in the .ipynb file)
insert image description here
The number of parameter epochs used for the initial training is 500, the learning rate is 1, and the number of hidden layers is 256. The following is the result of the training.
insert image description here
From the above results, it can be seen that when the epochs is 500 and the learning rate is 1, the image converges when it is close to 250, and the final perplexity is 1. It can be seen from the output of the results that the spelling is basically correct and has certain semantics. This shows that the training has a certain effect, and its effect is better than that of RNN.

1.2.2 GRU hyperparameter adjustment

The adjustment of GRU hyperparameters is basically the same as that of RNN, and the reasons for its parameter selection are the same. The following is the parameter adjustment result
1 epochs times and
insert image description here
insert image description here
the prediction of "time traveller":
(1) Epoch250: time traveleris cofr mensthe fourth dimension do net goout the l
(2) Epoch500: time traveler for so it will be convenient to speak of him was e
(3) Epoch750: time traveler with a slight accession of cheerfulness really thi
(4) Epoch1000: time traveler with a slight accession of cheerfulness really thi

It can be seen that its law is basically consistent with that of RNN, but the effect of its starting point is better than that of RNN.

2 Learning rate
insert image description here
insert image description here
For the prediction of "time traveller":
(1) lr0.01: time traveler teeteeteeteeteet
(2) lr0.1: time travelere the the the the the the the the the the the the
(3) lr1: travelleryou can show black is white by argument said filby
(4) lr10: time travelerohc ohc ohc ohc ohc ohc ohc ohc ohc ohc ohc ohc oh
This result is basically consistent with RNN, the difference lies in the size of the perplexity.
3 Number of neurons in the hidden layer
insert image description here
insert image description here
For prediction of "time traveller":
(1) 128: time travellerit s against reason said filbywhat is there is the
(2) 256: travelleryou can show black is white by argument said filby
(3) 512: time traveller for so it will be convenient to speak of him was e
(4) 1024: time traveller with a slight accession of cheerfulness really thi

It can be seen from the results that the results are basically the same, but the perplexity of 128 and 512 is lower than the other two, and 128 may be due to underfitting, while 512 may be overfitting.
Based on the above tuning parameters, it can be found that the best group is a combination of 512 neurons.

1.3 LSTM architecture

1.3.1 LSTM architecture construction

For the implementation of the LSTM architecture, the "time machine (time machine)" data in d2l is also used. LSTM is also called long short-term memory network, which has a certain memory function. Compared with GRU, LSTM is more complex and has more gating systems. Therefore, the training time of LSTM may be longer than that of GRU under the same parameter data, but the corresponding training effect may be better. Next, use the LSTM module of d2l to quickly build the LSTM architecture. (The detailed code can be found in the .ipynb file)
insert image description here
The number of parameter epochs used for the initial training is 500, the learning rate is 1, and the number of hidden layers is 256. The following is the result of the training.
insert image description here
From the above results, it can be seen that when the epochs is 500 and the learning rate is 1, the image converges when it is close to 250, and the final perplexity is 1. It can be seen from the output of the results that the spelling is basically correct and has certain semantics. This shows that the training has a certain effect, and the effect is better than that of RNN, and the effect is basically the same as that of GRU.

1.3.2 LSTM hyperparameter adjustment

The adjustment of LSTM hyperparameters is basically the same as that of RNN, and the reason for its parameter selection is the same. The following is the parameter adjustment results.
1 epochs
insert image description here
insert image description here
Forecast of "time traveller":
(1) Epoch250: time traveler soud in the bertal it it as ingous doo doust hick
(2) Epoch500: time traveler you can show black is white by argument said filby
(3) Epoch750: time traveler fich wi har hive tree yyinn waid the peos co vepr
(4) Epoch1000: time traveler you can show black is white by argument said filby can
see that it basically presents a decreasing form, but when it is 750, the perplexity is higher Not very good either, this may have something to do with chance and more experiments are needed to prove it.
2 Learning rate
insert image description here
insert image description here
For the prediction of "time traveller":
(1) lr0.01: time traveler teeteeteeteeteet
(2) lr0.1: time travelere the the the the the the the the the the the the
(3) lr1: travelleryou can show black is white by argument said filby
(4) lr10: time traveller for so it will be convenient to speak of himwas e It can
be seen that the training effect increases with the increase of the learning rate. Unlike GRU and RNN, the learning rate of the LSTM architecture is the learning rate The larger the effect, the better, while the other two are in a range. Or may be related to the internal network layer of the architecture.

3 隐藏层神经元数量
insert image description here
insert image description here
对于“time traveller”的预测:
(1)128: time travellerice withereal inhis fefclndiface traces along i ou
(2)256: travelleryou can show black is white by argument said filby
(3)512: time travelleryou can show black is white by argument said filby
(4)1024: time traveller for so it will be convenient to speak of himwas e

1.4 Comparison of three architectures

As can be seen from the above parameter tuning experiment, RNN is obviously worse than GRU and LSTM. GRU and LSTM have better convergence ability and better effect, and the effectiveness of their predictions is also better. The comparison between GRU and LSTM is not obvious in the above experiments. It may be possible to test the advantages and disadvantages of the two groups in more complex data sets. However, in the parameter tuning experiment, there are obvious differences between GRU and LSTM in terms of learning rate adjustment. In GRU, the best learning rate is 1, and the worst is 10, while LSTM has the same effect as 1 and 10. And this may be due to differences in the network layers of the two.

2 Sequence-to-sequence learning

The sequence-to-sequence model is a model based on encoders and decoders, which can be used to solve the inconsistency between the output sequence and the input sequence, and is generally used for translation. For the construction and training of the sequence-to-sequence model, I did it through the d2l tutorial. In the code of the tutorial, you first need to define the encoder and decoder to process the input and output. Next, set the cross-entropy function with masking function for the loss function and bring it into the training function. Finally, the training is performed, and then the BLEU scoring function is defined to quantify the prediction effect during prediction. The flow of its code construction is shown in the figure below. (See the specific code in the .ipynb file)
insert image description here
The following is the result of the number of hidden layers being 2, the number of neurons being 256, the learning rate being 0.005, and the epoch being 300.
insert image description here
The following is the prediction result.
insert image description here
It can be seen that the larger the bleu is, the better the prediction effect is. The value of the first two is 1, indicating that the prediction effect is very good, but the value of the latter two is getting smaller and smaller. After checking the correct translation, it is found that the output French translation effect is not good. Careful observation of the reason shows that the latter two words are slightly longer than the former two, so the latter may be the reason for their poor effect, which shows that this seq2seq model has a certain room for optimization.

3 Experimental conclusions

In this experiment, the RNN, GRU, and LSTM architectures were constructed to train the "time machine" data set. During the training process, the three parameters of epoch\learning rate\number of neurons were continuously adjusted to obtain better results. . In addition to the comparison of different parameters, a comparison between different architectures is also carried out. In addition to building three classic cyclic neural network models, we also learned seq2seq, a model containing encoders and decoders, and performed training and reasoning to obtain corresponding results.
The final result found that a small number of epochs will have an adverse effect on the results of the experiment, but when the epoch reaches a certain size, the effect will gradually decrease, or even become irrelevant. For RNN and GRU, the best value of the learning rate is around 1, but for LSTM, it is found that when the learning rate is 10, LSTM can still have good results, but when the learning rate of the other two is 10, the effect is the best. Poor. As for the effect of the number of neurons on the results, it can be seen from the above experimental content that the effect is the most significant. Generally speaking, the more the number of neurons, the better the effect, and this may also be because the more neurons The reason for the better fit. For the learning of seq2seq, it can be found in the final training prediction results that the model has a better prediction effect on short sentences, but the effect on longer sentences is very poor, which also shows that this model has a certain room for improvement.

Guess you like

Origin blog.csdn.net/weixin_51735061/article/details/132010502