Artificial intelligence series experiments (5) - regularization method: Python implementation of L2 regularization and dropout

In order to solve the problem of neural network overfitting, regularization should be our preferred method compared to the difficulty and overhead of adding data volume. This experiment uses Python to implement L2 regularization and dropout methods respectively.

L2 regularization

L2 regularization is one of the common methods to solve overfitting.
Implementing L2 regularization requires two steps, namely 1. Changing the cost function and 2. Changing the calculation of partial derivatives during backpropagation.
1. Add L2 tail
J regularized = − 1 m ∑ i = 1 m ( y ( i ) log ( a [ L ] ( i ) ) + ( 1 − y ( i ) ) log ( 1 − a [ L ] ( i ) ) ) ⏟ cross − entropy cost + 1 m λ 2 ∑ l ∑ k ∑ j W k , j [ l ] 2 ⏟ L 2 tail J_{regularized} = \underbrace{ -\frac{1}{ m}\sum_{i=1}^m(y^{(i)}log(a^{[L](i)})+(1-y^{(i)})log(1-a^ {[L](i)}))}_{cross-entropy\ cost}+\underbrace{\frac{1}{m}\frac{\lambda}{2}\sum_l\sum_k\sum_jW^{[l ]2}_{k,j}}_{L2 tail}Jregularized=crossentropy cost m1i=1m(y(i)log(a[L](i))+(1y(i))log(1a[L](i)))+L 2 tail m12llkjWk,j[l]2
The implementation code is as follows:

# 3层网络
# 获得常规的成本
cross_entropy_cost = compute_cost(A3, Y) 
    
#计算L2尾巴
L2_regularization_cost = lambd * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))) / (2 * m)
        
cost = cross_entropy_cost + L2_regularization_cost

2. Add L2 tail λ m W \frac{\lambda}{m}W after dW when calculating partial derivatives in backpropagationmlThe implementation code of W
is as follows:

dZ3 = A3 - Y
    
dW3 = 1. / m * np.dot(dZ3, A2.T) + (lambd * W3) / m
db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)

dA2 = np.dot(W3.T, dZ3)
dZ2 = np.multiply(dA2, np.int64(A2 > 0))
dW2 = 1. / m * np.dot(dZ2, A1.T) + (lambd * W2) / m
db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)

dA1 = np.dot(W2.T, dZ2)
dZ1 = np.multiply(dA1, np.int64(A1 > 0))
dW1 = 1. / m * np.dot(dZ1, X.T) + (lambd * W1) / m
db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)

dropout

Dropout is also a method often used in deep learning to solve overfitting.
Dropout randomly deletes some neurons in each round of training, and the neural network model trained in each round is changed. This avoids over-reliance on a certain feature of a certain sample and generalization. This avoids overfitting.

Forward propagation with dropout

  1. Use np.random.rand() to create an array with A[1]A^{[1]}A[ 1 ] MatrixD of the same dimension [ 1 ] D^{[1]}D[1]
  2. D [ 1 ] D^{[1]} by setting the thresholdDIn [ 1 ] , the (1-keep_prob) percentage elements are set to 0, and the percentage keep_prob elements are set to 1.
  3. A [ 1 ] A^{[1]}A[ 1 ] is equal toA [ 1 ] ∗ D [ 1 ] A^{[1]} * D^{[1]}A[1]D[ 1 ] , such thatA [ 1 ] A^{[1]}A[ 1 ] some element withD [ 1 ] D^{[1]}DAfter multiplying the 0 elements in [ 1 ] , A [ 1 ] A^{[1]}AThe corresponding element in [ 1 ] also becomes 0. After the corresponding a is set to 0, its corresponding neuron is equivalent to being deleted.
  4. In the last step, A [ 1 ] A^{[1]}A[ 1 ] divided by keep_prob. Through this step, the expected value of the neural network using dropout is kept the same as that of the neural network without dropout, that is, inverted dropout is realized.

The implementation code is as follows:

Z1 = np.dot(W1, X) + b1
A1 = relu(Z1)

D1 = np.random.rand(A1.shape[0], A1.shape[1])     # 第一步
D1 = D1 < keep_prob                            # 第二步
A1 = A1 * D1                                      # 第三步
A1 = A1 / keep_prob                               # 第四步

Z2 = np.dot(W2, A1) + b2
A2 = relu(Z2)

D2 = np.random.rand(A2.shape[0], A2.shape[1])     
D2 = D2 < keep_prob                                             
A2 = A2 * D2                                  
A2 = A2 / keep_prob               

Z3 = np.dot(W3, A2) + b3
A3 = sigmoid(Z3)

Backpropagation with dropout

  1. During the forward pass, we use the mask D [ 1 ] D^{[1]}D[ 1 ] givenA [ 1 ] A^{[1]}A[ 1 ] performed operations while deleting some neurons. During backpropagation, we also have to delete the same neurons, this can be done by using the sameD [ 1 ] D^{[1]}D[ 1 ]d A [ 1 ] dA^{[1]}d A[ 1 ] to implement the operation.
  2. In the forward pass, we will A [ 1 ] A^{[1]}A[ 1 ] divided by keep_prob. During backpropagation, we must also convertd A [ 1 ] dA^{[1]}d A[ 1 ] divided by keep_prob. The explanation at the calculus level is: ifA [ 1 ] A^{[1]}A[ 1 ] is scaled by keep_prob, then its derivatived A [ 1 ] dA^{[1]}d A[ 1 ] should also be scaled accordingly.

The implementation code is as follows:

dZ3 = A3 - Y
dW3 = 1. / m * np.dot(dZ3, A2.T)
db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)
dA2 = np.dot(W3.T, dZ3)

dA2 = dA2 * D2              # 第一步
dA2 = dA2 / keep_prob              # 第二步

dZ2 = np.multiply(dA2, np.int64(A2 > 0))
dW2 = 1. / m * np.dot(dZ2, A1.T)
db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)

dA1 = np.dot(W2.T, dZ2)

dA1 = dA1 * D1              
dA1 = dA1 / keep_prob             

dZ1 = np.multiply(dA1, np.int64(A1 > 0))
dW1 = 1. / m * np.dot(dZ1, X.T)
db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)

hint

  1. Dropout is a regularization technique.
  2. You can only run dropout when training the model, and turn off dropout when using the model.
  3. Dropout is implemented in both forward and back propagation.
  4. Divide keep_prob by keep_prob in the forward pass and back pass of each layer to keep the expected value unchanged.

Demonstration of the actual model effect

Using the coordinate-color data point set, we train predictions with L2 regularization and dropout, respectively, and compare with models that do not use regularization methods.
For the complete code of this experiment , see:
https://github.com/PPPerry/AI_projects/tree/main/5.regularization
The model without regularization has overfitted the training data:
insert image description here
with L2 regularization or dropout The model, no longer overfits to the training dataset:
insert image description here
insert image description here

Previous artificial intelligence series experiments:

Series of experiments on artificial intelligence (1) - binary classification single-layer neural network
artificial intelligence series for recognizing cats (2) - shallow neural network for distinguishing different color regions
Artificial intelligence series of experiments (3) - using A series of artificial intelligence experiments on two-category deep neural network for cat recognition (4) ——Comparison of various neural network parameter initialization methods (Xavier initialization and He initialization)

Guess you like

Origin blog.csdn.net/qq_43734019/article/details/120608764