Wu Enda Machine Learning Homework Python Implementation (4): Neural Network (Back Propagation)

Table of contents

Neural Networks

visualize data

model representation

forward propagation

Expand parameters

Data encoding conversion

cost function

Initialization parameters

Regularized cost function

backpropagation

Gradient function of sigmoid

random initialization

backpropagation

gradient detection

regularized neural network

Parameter optimization

visualize hidden layers

reference article 


Neural Networks

        In the last exercise, a feed-forward neural network was implemented and used to predict handwritten digits, in this exercise, we will implement the backpropagation algorithm to learn the parameters of the neural network 

visualize data

        This part implements random selection of 100 samples and visualization. The training set has a total of 5000 training samples, and each sample is a grayscale image of a 20*20 pixel number. The 20×20 pixel grid is unrolled into a 400-dimensional vector. In matrix X, each sample becomes a row, which gives us a 5000×400 matrix X, where each row is a training sample of an image of a handwritten digit.

import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat  # 读入matlab格式的文件
from scipy.optimize import minimize # 优化器
import matplotlib
from sklearn.preprocessing import OneHotEncoder # 用于数据编码转化
from sklearn.metrics import classification_report#这个包是评价报告

# 加载数据集
path = r'E:\Code\ML\ml_learning\ex4-NN back propagation\ex4data1.mat'
def load_mat(path):
    data = loadmat(path)
    X = data['X']
    y = data['y'].flatten()
    
    return X,y
X, y = load_mat(path)

# 加载权重
path1 = r'E:\Code\ML\ml_learning\ex4-NN back propagation\ex4weights.mat'
weight = loadmat(path1)
theta1, theta2 = weight['Theta1'], weight['Theta2']

def plot_100_image(X):
    """
    随机画100个数字
    """
    sample_idx = np.random.choice(np.arange(X.shape[0]), 100)
    sample_images = X[sample_idx, :]
    fig, ax_array = plt.subplots(nrows=10, ncols=10, sharey=True, sharex=True, figsize=(8, 8))
    for row in range(10):
        for column in range(10):
            ax_array[row, column].matshow(sample_images[10 * row + column].reshape((20, 20)).T,
                                          cmap='gray_r')
    plt.xticks([])
    plt.yticks([])
    plt.show()

model representation

        The model built this time is the same as the previous exercise, it has three layers, input layer, hidden layer, output layer

forward propagation

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# 前向传播函数
def forward_propagate(X, theta):
    
    theta1, theta2 = deserialize(theta)
    
    a1 = np.insert(X, 0, values=np.ones(X.shape[0]), axis=1) # (5000, 401)
    z2 = a1 @ theta1.T # (5000, 401)(401, 25)=(5000,25)
    a2 = np.insert(sigmoid(z2), 0, values=np.ones(X.shape[0]), axis=1) # (5000,26)
    z3 = a2 @ theta2.T # (5000,26)(26, 10)=(5000, 10)
    h = sigmoid(z3) # (5000, 10)
    return a1, z2, a2, z3, h

Expand parameters

        For the optimizer, we need to expand multiple parameter matrices to pass into the optimization function, and then restore the shape.

def serialize(a, b):
    '''展开参数'''
    return np.r_[a.flatten(),b.flatten()]
def deserialize(seq):
    '''提取参数'''
    return seq[:hidden_size * (input_size + 1)].reshape(hidden_size, (input_size + 1)), seq[hidden_size * (input_size + 1):].reshape(num_labels, (hidden_size + 1))
theta = serialize(theta1, theta2)

Data encoding conversion

        The y we read is (1, 2, 3, 4, ..., 10), we need to convert it into a non-linearly related vector, as shown in the figure below, for example, y[0]=6 is converted into y[0] =[0,0,0,0,0,1,0,0,0,0].

# 数据编码转换
def tramsform_y(y):    
    encoder = OneHotEncoder(sparse=False)
    y_onehot = encoder.fit_transform(y.reshape(-1,1))
    return y_onehot
y_onehot = tramsform_y(y)

cost function

        The cost function is as follows

J(\theta)=\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}[-y_{k}^{(i)}log((h_{\theta}(x^{(i)}))_{k})-(1-y_{k}^{(i)})log(1-(h_{\theta}(x^{(i)}))_{k})]

# 前向反馈代价函数
def cost(theta, X, y):
    m = X.shape[0]
    # 激活网络
    a1, z2, a2, z3, h = forward_propagate(X, theta)
    J = 0
#  非向量化    
#     for i in range(m):
#         part1 = -y[i] * np.log(h[i])
#         part2 = (1 - y[i]) * np.log(1 - h[i])
#         J += np.sum(part1 - part2)
#     J = J / len(X)

# 向量化        
    J = -y * np.log(h) - (1 - y) * np.log(1 - h)        
    J = J.sum() / m
    
    return J

Initialization parameters

# 初始化参数设置
input_size = 400
hidden_size = 25
num_labels = 10
learning_rate = 1

 After initializing the parameters, call the cost function to get the cost of using the loaded weights to be 0.287629.

result = cost(theta1, theta2, X, y_onehot)
# result = 0.2876291651613188

Regularized cost function

        The regularized cost function is defined as

 J(\theta)=\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}[-y_{k}^{(i)}log((h_{\theta}(x^{(i)}))_{k})-(1-y_{k}^{(i)})log(1-(h_{\theta}(x^{(i)}))_{k})]+\frac{\lambda}{m}[\sum_{j=1}^{25}\sum_{k=1}^{400}(\theta_{j,k}^{(1)})^{2}+\sum_{j=1}^{10}\sum_{k=1}^{25}(\theta_{j,k}^{(2)})^{2}]

        Here we can directly call the cost function written above for the first half 

# 正则化
def costReg(theta, X, y, learning_rate = 1):
    m = X.shape[0]
    
    theta1, theta2 = deserialize(theta) 
    reg = ((learning_rate) / (2 * m)) * (np.sum(theta1[:,1:] ** 2) + np.sum(theta2[:,1:] ** 2))
                                            
    J = cost(theta, X, y) + reg
    return J                                        

        It should be noted that it is not necessary to perform regularization on θ0, that is, the first column of theta1 and theta2.

        When λ = 1, the cost of calling the costReg function is 0.383770

costReg(theta, X, y_onehot, learning_rate)
# 0.38376985909092354

backpropagation

        In this part, the backpropagation algorithm is mainly used to calculate the gradient, and then the advanced optimization is called to minimize the cost function to train the neural network

Gradient function of sigmoid

g^{'}(z) = \frac{d}{dz}g(z) = g(z)(1-g(z))

in

Sigmoid(z) = g(z) = \frac{1}{1+e^{-z}}

# 反向传播
def sigmoid_gradient(z):
    return sigmoid(z) * (1 - sigmoid(z))
sigmoid_gradient(0) # 0.25

random initialization

        Random initialization of parameters is very important when training neural networks, and one effective initialization strategy is to generate in one (-ε,ε), and an effective strategy is based on the number of units in the network.

\epsilon _{init} = \frac{\sqrt{6}}{\sqrt{L_{in}+ L_{out}}}

        Where  L_{in} and  L_{out} are  \ \theta ^{(l)} the number of units in the adjacent layer. For l = 1, the number of units in the adjacent layer is 400 and 26, and the calculated result is about 0.12. Therefore, the value in the range of ε = 0.12 is taken to ensure that the parameters are small enough to make the training more efficient.

# 随机初始化
size = hidden_size * (input_size + 1) + num_labels * (hidden_size + 1)
params = np.random.uniform(-0.12, 0.12, size)
# (1,10285)

backpropagation

        In this part, the backpropagation algorithm and related formula derivation are realized. The general steps of the backpropagation algorithm are to first train the sample to activate the neural network, including the assumed output value hθ(x), and then for the jth node of the l layer, Calculate its error term so that it can be used to measure the responsibility of the node for any error in the output, so as to adjust the parameters according to the error value and continuously optimize.

        Here our neural network has three layers, and the third layer is the output layer, so the error is defined as

\delta_{k}^{(3)} = (a_{k}^{(3)}-y_{k})

        Among them, yk∈{0, 1} indicates whether the current training sample belongs to class k or not, 1 means it belongs to it, and 0 means it does not belong to it.

        Next is the second hidden layer, and the error is defined as

        \delta^{(2)} = (\theta^{(2)})^{T}\delta^{(3)}g^{'}(z^{(2)})

        The first layer is the input layer, and there is no error. Then calculate the gradient of each layer parameter matrix

\Delta ^{(2)} = a^{(2)}\delta^{(3)}

\Delta ^{(1)} = a^{(1)}\delta^{(2)}

        Finally, the total gradient of the network is

D = \frac{1}{m}(\Delta^{(1)}+\Delta^{(2)})

        Next, deduce how the above δ and Δ come from. The key point is that we need to be clear about the parameters we want to optimize. \theta^{(1)}Using \theta^{(2)}the idea of ​​​​the gradient descent method, we need to solve the gradient of the cost function to the two   parameters \frac{\partial J }{\partial \theta^{(1)}}\frac{\partial J }{\partial \theta^{(2)}}

        Assuming there is only one input sample, the cost function is

J(\theta) = -ylog(h(x)) - (1 - y)log(1-h(x))

        The process for the forward pass is shown in the figure below

         Next, let's solve the gradient of the cost function to the parameters. The core idea is the chain derivation rule

         According to the chain rule we can get the following formula

        \frac{\partial J}{\partial \theta^{(2)}} = \frac{\partial J}{\partial a^{(3)}}\frac{\partial a^{(3)}}{\partial z^{(3)}}\frac{\partial z^{(3)}}{\partial \theta^{(2)}} = (h-y)a^{(2)}

        Let the leftmost end of the above formula be  \Delta^{(2)} , and the rightmost (h - y) be  \delta ^{(3)} the error, then it is the first formula in backpropagation

\Delta ^{(2)} = a^{(2)}\delta^{(3)}

        The essence of the error is the derivative of the cost function to z, namely

\frac{\partial J}{\partial z^{(l)}}

        Empathy

        The purple part in the second equal sign is\delta^{(2)} = (\theta^{(2)})^{T}\delta^{(3)}g^{'}(z^{(2)})

        The third equal sign is\Delta ^{(1)} = a^{(1)}\delta^{(2)}

        At this point, the key part is pushed to the end. In fact, the principle is the chain derivation rule, and it is not difficult to understand the process of forward propagation.

        The following is the code of the gradient function. It is necessary to understand the dimensions of each parameter in order to avoid many detours.

def gradient(theta, X, y):
    m = X.shape[0]
    
    theta1, theta2 = deserialize(theta)
    a1, z2, a2, z3, h = forward_propagate(X, theta)
  
    delta3 = h - y # (5000, 10)
    delta2 = (delta3 @ theta2[:, 1:]) * sigmoid_gradient(z2) # (5000, 25)
    
    Delta2 = delta3.T @ a2 / m# (10, 5000)*(5000, 26) = (10, 26)
    Delta1 = delta2.T @ a1 / m# (25, 5000)*(5000, 401) = (25, 401)

    return Delta1, Delta2 

gradient detection

        Gradient detection is mainly used to verify that the backpropagation algorithm is correct. In your neural network, you are minimizing the cost function J(θ), and you need parameters for gradient checking. First, we can expand θ1 and θ2 into long vectors θ, and then use the following gradient detection process. The calculation uses the approximation idea. The derivative of a point can be replaced by the slope of two points close to it. If the two points are close enough, then this slope can be used instead of the point derivative.

        First write the left and right values ​​of θ, as shown below, ε is a very small number

         Then substitute the following formula to calculate the theoretical value of θ

        The code runs very slowly, run with caution!

def gradient_checking(theta, X, y, e):
    def a_numeric_grad(plus, minus):
        """
        对每个参数theta_i计算数值梯度,即理论梯度。
        """
        return (costReg(plus, X, y) - costReg(minus, X, y)) / (e * 2)
    numeric_grad = [] 
    
    for i in range(len(theta)):
        plus = theta.copy()  
        minus = theta.copy()
        plus[i] = plus[i] + e
        minus[i] = minus[i] - e
        grad_i = a_numeric_grad(plus, minus)
        numeric_grad.append(grad_i)
        
    numeric_grad = np.array(numeric_grad) # 理论
    analytic_grad = gradientReg(theta, X, y, learning_rate) # 现实
    diff = np.linalg.norm(approx_grad - analytic_grad) / np.linalg.norm(approx_grad + analytic_grad)
    print(diff)
    

regularized neural network

        Gradient regularization formula

\frac{\partial }{\partial \theta_{ij}^{(l)}}J(\theta) = D_{ij}^{(l)} = \frac{1}{m}\Delta_{ij}^{(l)} \ \ \ \ \ \ \ \ \ \ for \ \ j = 0

\frac{\partial }{\partial \theta_{ij}^{(l)}}J(\theta) = D_{ij}^{(l)} = \frac{1}{m}\Delta_{ij}^{(l)}+\frac{\lambda}{m}\theta_{ij}^{(l)}\ \ \ \ \ \ \ for \ j\geq 1

def gradientReg(theta, X, y, learning_rate = 1):
    m = X.shape[0]
    # 不惩罚偏置单元
    # a1, z2, a2, z3, h = forward_propagate(X, theta)
    D1, D2 = gradient(theta, X, y)
    
    theta1[:, 0] = 0
    theta2[:, 0] = 0
    
    regD1 = D1 + (learning_rate / m) * theta1
    regD2 = D2 + (learning_rate / m) * theta2
    
    return serialize(regD1, regD2)

Parameter optimization

        Here, we use advanced optimization methods for parameter optimization, using the optimize function of the scipy library for optimization.

fmin = minimize(fun=costReg, x0=params, args=(X, y_onehot, learning_rate), 
                method='TNC', jac=gradientReg, options={'maxiter': 400})

        The result is as follows

     fun: 0.5064413657213123
     jac: array([-1.29134381e-04, -2.11248326e-12,  4.38829369e-13, ...,
       -2.98454162e-05, -1.96204232e-03, -1.77461205e-04])
 message: 'Converged (|f_n-f_(n-1)| ~= 0)'
    nfev: 139
     nit: 13
  status: 1
 success: True
       x: array([-0.0623484 , -0.06471579, -0.05614958, ..., -2.86694064,
        0.87384526,  0.43249548])

        Next use the optimized parameters to predict

# 计算使用优化后的θ得出的预测
a1, z2, a2, z3, h = forward_propagate(X, fmin.x)
y_pred = np.array(np.argmax(h, axis=1) + 1)
print(classification_report(y, y_pred))
                precision    recall  f1-score   support

           1       0.96      0.98      0.97       500
           2       0.97      0.97      0.97       500
           3       0.96      0.94      0.95       500
           4       0.96      0.98      0.97       500
           5       0.96      0.96      0.96       500
           6       0.98      0.98      0.98       500
           7       0.95      0.97      0.96       500
           8       0.97      0.96      0.97       500
           9       0.97      0.94      0.96       500
          10       0.99      0.99      0.99       500

    accuracy                           0.97      5000
   macro avg       0.97      0.97      0.97      5000
weighted avg       0.97      0.97      0.97      5000

        It can be seen that the correct rate after parameter optimization reaches 97%

visualize hidden layers

        A good way to understand how a neural network learns is to visualize what the hidden layer units capture. For our trained network, notice that each row in θ1 is a 401-dimensional vector representing each hidden layer unit parameters. If we ignore the bias term, we get a 400-dimensional vector representing the weight of each sample input pixel to each hidden layer unit. Therefore, one way to visualize is to reshape the 400-dimensional vector to a (20, 20) image and then output it. (I don't quite understand it for now)

thetafinal1, thetafinal2 = deserialize(fmin.x)
hidden_layer = thetafinal1[:, 1:] 
plot_100_image(hidden_layer)

reference article 

Andrew Ng's machine learning and deep learning homework catalog [image restored]

Guess you like

Origin blog.csdn.net/weixin_50345615/article/details/126290959