training suggestions

(1) Generally, the average value (cost) of the objective function on the training set will continue to decrease as the training progresses. If this indicator increases, stop. There are two situations: the first is that the model used is not complex enough to fully fit on the training set; the second is that it has been trained very well.

(2) Separate some validation sets (Validation Set). The essential goal of training is to obtain the maximum recognition rate on the validation set. Therefore, after training for a period of time, the recognition rate must be tested on the verification set, and the model parameters that maximize the recognition rate on the verification set are saved as the final result.

(3) Pay attention to adjusting the learning rate. If the cost increases after just a few steps of training, generally speaking, the learning rate is too high; if the cost changes very little each time, the learning rate is too low.

(4) Batch Normalization is easier to use. After using this, it is not sensitive to learning rate, parameter update strategy, etc. It is recommended that if you use Batch Normalization, the simplest SGD can be used for the update strategy. My experience is that adding other methods is not good.

(5) If Batch Normalization is not used, my experience is that reasonable transformation of other parameter combinations can also achieve the goal.

(6) Due to the gradient accumulation effect, the three update strategies of AdaGrad, RMSProp, and Adam will be very slow in the later stages of training. Strategies that increase the learning rate can be used to compensate for this effect.

All programs in this article are matlab programs

stochastic gradient descent

Purpose: Reduce noise and avoid drastic changes in parameters.

(1) Instead of changing the parameters every time a sample is input, input a batch of samples (called a BATCH or MINI-BATCH), find the average gradient of these samples, and change the parameters based on this average.

(2) In neural network training, the number of BATCH samples is roughly set to 50-200.

Routine:

batch_size = option.batch_size;
m = size(train_x,1);
num_batches = m / batch_size;
for k = 1 : iteration
    kk = randperm(m);
    for l = 1 : num_batches
        batch_x = train_x(kk((l - 1) * batch_size + 1 : l * batch_size), :);
        batch_y = train_y(kk((l - 1) * batch_size + 1 : l * batch_size), :);
        nn = nn_forward(nn,batch_x,batch_y);
        nn = nn_backpropagation(nn,batch_y);
        nn = nn_applygradient(nn);
    end
end

activation function

Insert image description here

Initialization of training data

Do mean and variance normalization
Insert image description here

Insert image description here

[U,V] = size(xTraining);
avgX = mean(xTraining);
sigma = std(xTraining);
xTraining = (xTraining - repmat(avgX,U,1))./repmat(sigma,U,1);

Initialization of w and b

For Sigmoid and tanh, the gradient will disappear and
Insert image description here

we need to make a balance. We don't want the gradient to disappear, that is, we want it to be mainly distributed in places with steep slopes (large derivatives). Of course, this ratio cannot be too large, otherwise it will be lost. The meaning of nonlinear function, so it is necessary to distribute a small part of the data on a gentle slope.
Insert image description here

Batch Normalization

参考论文：Batch normalization accelerating deep network training by reducing internal covariate shift (2015)

Basic idea:

Since we want the values obtained by each layer to be near 0 to avoid the vanishing gradient phenomenon, why don’t we directly normalize the values of each layer based on the mean and variance?

Each layer of FC (Fully Connected Layer) is connected to a BN (Batch Normalization) layer

Insert image description here

Algorithm process

Insert image description here
The reason for introducing gamma and beta:
Because we performed normalization, the curve was too linear. We introduced two new parameters at the end and added them to the training.

forward calculation

y = nn.W{
    
    k-1} * nn.a{
    
    k-1} + repmat(nn.b{
    
    k-1},1,m);
if nn.batch_normalization
    nn.E{
    
    k-1} = nn.E{
    
    k-1}*nn.vecNum + sum(y,2);
    nn.S{
    
    k-1} = nn.S{
    
    k-1}.^2*(nn.vecNum-1) + (m-1)*std(y,0,2).^2;
    nn.vecNum = nn.vecNum + m;
    nn.E{
    
    k-1} = nn.E{
    
    k-1}/nn.vecNum;
    nn.S{
    
    k-1} = sqrt(nn.S{
    
    k-1}/(nn.vecNum-1));
    y = (y - repmat(nn.E{
    
    k-1},1,m))./repmat(nn.S{
    
    k-1}+0.0001*ones(size(nn.S{
    
    k-1})),1,m);
    y = nn.Gamma{
    
    k-1}*y+nn.Beta{
    
    k-1};
end;
switch nn.activaton_function
    case 'sigmoid'
        nn.a{
    
    k} = sigmoid(y);
    case 'tanh'
        nn.a{
    
    k} = tanh(y);

backpropagation

nn.theta{
    
    k} = ((nn.W{
    
    k}'*nn.theta{
    
    k+1})) .* nn.a{
    
    k} .* (1 - nn.a{
    
    k});
if nn.batch_normalization
    x = nn.W{
    
    k-1} * nn.a{
    
    k-1} + repmat(nn.b{
    
    k-1},1,m);
    x = (x - repmat(nn.E{
    
    k-1},1,m))./repmat(nn.S{
    
    k-1}+0.0001*ones(size(nn.S{
    
    k-1})),1,m);
    temp = nn.theta{
    
    k}.*x;
    nn.Gamma_grad{
    
    k-1} = sum(mean(temp,2));
    nn.Beta_grad{
    
    k-1} = sum(mean(nn.theta{
    
    k},2));
    nn.theta{
    
    k} = nn.Gamma{
    
    k-1}*nn.theta{
    
    k}./repmat((nn.S{
    
    k-1}+0.0001),1,m);
end;
nn.W_grad{
    
    k-1} = nn.theta{
    
    k}*nn.a{
    
    k-1}'/m + nn.weight_decay*nn.W{
    
    k-1};
nn.b_grad{
    
    k-1} = sum(nn.theta{
    
    k},2)/m;

objective function

Regular terms can be added

Avoid situations where E is smallest and w is very large at the same time

Insert image description here
forward calculation

cost2 = cost2 +  sum(sum(nn.W{
    
    k-1}.^2));
nn.cost(s) = 0.5 / m * sum(sum((nn.a{
    
    k} - batch_y).^2)) + 0.5 * nn.weight_decay * cost2;

backpropagation

nn.W_grad{
    
    k-1} = nn.theta{
    
    k}*nn.a{
    
    k-1}'/m + nn.weight_decay*nn.W{
    
    k-1};

softmax function and cross entropy

If it is a classification problem, F(W) can use a combination of SOFTMAX function and cross entropy.

(a) SOFTMAX function:

What the last layer obtains is not a certain one-hot code, but the probability of each category.

Insert image description here

(b) Cross Entropy: (see information theory for details)

The closer p and q are, the closer they are, the better the fitting effect is, and the smaller the gap is.
Cross entropy refers to the amount of information that can be obtained by using q to describe p. If q can perfectly predict p, then the cross entropy just reaches the maximum value, which is the amount of information of p.
Insert image description here

(c)
If F(W) is a combination of SOFTMAX function and cross entropy, (1. Convert the label value into probability through softmax; 2. Perform cross entropy operation on the probability calculated by the nn network and the probability in 1 as cost) Then the derivation will have a very simple form:
Insert image description here
forward calculation

if strcmp(nn.objective_function,'Cross Entropy')
    nn.cost(s) = -0.5*sum(sum(batch_y.*log(nn.a{
    
    k})))/m + 0.5 * nn.weight_decay * cost2;

backpropagation

 case 'softmax'
     y = nn.W{
    
    nn.depth-1} * nn.a{
    
    nn.depth-1} + repmat(nn.b{
    
    nn.depth-1},1,m);
     nn.theta{
    
    nn.depth} = nn.a{
    
    nn.depth} - batch_y;

Parameter update strategy

(1) Regular update SGD (Vanilla Stochastic Gradient Descent)

nn.W{
    
    k} = nn.W{
    
    k} - nn.learning_rate*nn.W_grad{
    
    k};
nn.b{
    
    k} = nn.b{
    
    k} - nn.learning_rate*nn.b_grad{
    
    k};

Problems with SGD

(1) The absolute value of the gradient obtained by each component of (W, b) may be large or small. In some cases, it will force the optimization path to become a Z-shape.
Insert image description here
(2) The strategy of SGD to obtain gradients is too random. Since completely different BATCH data is used between the last time and the next time, the optimization direction will be random.

(2) AdaGrad
averages each gradient (divided by the absolute value of gradient accumulation), so that the gradient change amplitude is reduced; in addition, all (that is, each batch) gradients are accumulated and included in r ; r is the cumulative value of the batch gradient. When updating θ, it goes up to the denominator, which means that the training becomes slower and slower, making the learning rate smaller and smaller, which facilitates convergence.
Insert image description here

if strcmp(nn.optimization_method, 'AdaGrad')

nn.rW{
    
    k} = nn.rW{
    
    k} + nn.W_grad{
    
    k}.^2;
nn.rb{
    
    k} = nn.rb{
    
    k} + nn.b_grad{
    
    k}.^2;

nn.W{
    
    k} = nn.W{
    
    k} - nn.learning_rate*nn.W_grad{
    
    k}./(sqrt(nn.rW{
    
    k})+0.001);

nn.b{
    
    k} = nn.b{
    
    k} - nn.learning_rate*nn.b_grad{
    
    k}./(sqrt(nn.rb{
    
    k})+0.001);

(3) The only difference between RMSProp and AdaGrad is that roh is added, and roh can be adjusted to adjust whether our model pays more attention to the previous influence or the influence of the current direction.
AdaGrad is equivalent to a special case of RMSProp, and roh is set to 0.5.

(4) Momentum
This algorithm is used to solve the problem of randomness in gradients. The main idea is to combine all past results and the current calculated results, and assign weights to make the curve smoother.
Insert image description here

Insert image description here
（5）Adam

Combining all the previous algorithms, one is to average the gradient components of the parameters, and the other is to solve the problem of randomness of the gradient.
Insert image description here

if strcmp(nn.optimization_method, 'Adam')
nn.sW{
    
    k} = 0.9*nn.sW{
    
    k} + 0.1*nn.W_grad{
    
    k};
nn.sb{
    
    k} = 0.9*nn.sb{
    
    k} + 0.1*nn.b_grad{
    
    k};
nn.rW{
    
    k} = 0.999*nn.rW{
    
    k} + 0.001*nn.W_grad{
    
    k}.^2;
nn.rb{
    
    k} = 0.999*nn.rb{
    
    k} + 0.001*nn.b_grad{
    
    k}.^2;
            
nn.W{
    
    k} = nn.W{
    
    k} - 10*nn.learning_rate*nn.sW{
    
    k}./sqrt(1000*nn.rW{
    
    k}+0.00001);
            
nn.b{
    
    k} = nn.b{
    
    k} - 10*nn.learning_rate*nn.sb{
    
    k}./sqrt(1000*nn.rb{
    
    k}+0.00001);  %rho1 = 0.9, rho2 = 0.999, delta = 0.00001

[ZJU-Machine Learning] Parameter settings of artificial neural network

training suggestions

stochastic gradient descent

activation function

Initialization of training data

Initialization of w and b

Batch Normalization

Basic idea:

Algorithm process

forward calculation

backpropagation

objective function

Regular terms can be added

softmax function and cross entropy

Parameter update strategy

Guess you like