Understanding the principle of dropout

Dropout is a very simple and practical idea of deep learning, which is proposed to reduce the problem of overfitting. The traditional neural network is fully connected, that is, every neuron in the previous layer is connected to every neuron in the next layer. In order to reduce the connection between layers, a random method was proposed to connect only part of it. Use a probability p to connect only neurons with a certain probability (generate a bunch of 0s and 1s with a certain probability, where 1 means connected, 0 means not connected). In the training process, each input will connect the neurons in the hidden layer according to a certain probability. In this way, many different network structures will be formed during the training process, which will reduce a certain amount of calculation, because the process of each iteration does not require Training the weights of the entire fully connected is just a random part, but after the training, the weights between each neuron will be calculated.

It is worth noting that dropout is disabled in the verification and testing phase, so it only exists in the training phase. Therefore, we use a fully connected neural network in the verification and testing phase. The calculated result needs to be multiplied by the probability p (this p is very important, some are the defined probability of inactivation, some are the probability of survival, so some are written as 1-p, which is actually the same). Or in the training phase, the result after dropout is divided by p. This process looks a lot like a standardized process.

The reason why I dare to say that after the training is over, the weight between each neuron will be calculated, because if some neurons have no weights, and we don’t have the weights in the test phase, how do we calculate the fully connected nerves during testing? What about the network? Therefore, I boldly imagined that when training, the same data set was used as input, and many neural networks were trained, similar to random forest, and then multiple results were voted or averaged. This is the essence of dropout's good and stable effect, but this conjecture has not been verified by relevant literature.

Moreover, another great thing about dropout is that the process of random selection actually reduces the high correlation between features, because there is no fixed connection between neurons in the training process, so the last connection, Next time, there will be no connection, which can greatly enrich the possibilities and has a taste of integrated learning. Moreover, this p is applied to the Bernoulli distribution, so that when we calculate expectations, the idea of using standardization (the aforementioned process of multiplying p) has a theoretical basis.

Understanding the principle of dropout

Guess you like