Table of contents
Neural Networks
In the last exercise, a feed-forward neural network was implemented and used to predict handwritten digits, in this exercise, we will implement the backpropagation algorithm to learn the parameters of the neural network
visualize data
This part implements random selection of 100 samples and visualization. The training set has a total of 5000 training samples, and each sample is a grayscale image of a 20*20 pixel number. The 20×20 pixel grid is unrolled into a 400-dimensional vector. In matrix X, each sample becomes a row, which gives us a 5000×400 matrix X, where each row is a training sample of an image of a handwritten digit.
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat # 读入matlab格式的文件
from scipy.optimize import minimize # 优化器
import matplotlib
from sklearn.preprocessing import OneHotEncoder # 用于数据编码转化
from sklearn.metrics import classification_report#这个包是评价报告
# 加载数据集
path = r'E:\Code\ML\ml_learning\ex4-NN back propagation\ex4data1.mat'
def load_mat(path):
data = loadmat(path)
X = data['X']
y = data['y'].flatten()
return X,y
X, y = load_mat(path)
# 加载权重
path1 = r'E:\Code\ML\ml_learning\ex4-NN back propagation\ex4weights.mat'
weight = loadmat(path1)
theta1, theta2 = weight['Theta1'], weight['Theta2']
def plot_100_image(X):
"""
随机画100个数字
"""
sample_idx = np.random.choice(np.arange(X.shape[0]), 100)
sample_images = X[sample_idx, :]
fig, ax_array = plt.subplots(nrows=10, ncols=10, sharey=True, sharex=True, figsize=(8, 8))
for row in range(10):
for column in range(10):
ax_array[row, column].matshow(sample_images[10 * row + column].reshape((20, 20)).T,
cmap='gray_r')
plt.xticks([])
plt.yticks([])
plt.show()
model representation
The model built this time is the same as the previous exercise, it has three layers, input layer, hidden layer, output layer
forward propagation
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# 前向传播函数
def forward_propagate(X, theta):
theta1, theta2 = deserialize(theta)
a1 = np.insert(X, 0, values=np.ones(X.shape[0]), axis=1) # (5000, 401)
z2 = a1 @ theta1.T # (5000, 401)(401, 25)=(5000,25)
a2 = np.insert(sigmoid(z2), 0, values=np.ones(X.shape[0]), axis=1) # (5000,26)
z3 = a2 @ theta2.T # (5000,26)(26, 10)=(5000, 10)
h = sigmoid(z3) # (5000, 10)
return a1, z2, a2, z3, h
Expand parameters
For the optimizer, we need to expand multiple parameter matrices to pass into the optimization function, and then restore the shape.
def serialize(a, b):
'''展开参数'''
return np.r_[a.flatten(),b.flatten()]
def deserialize(seq):
'''提取参数'''
return seq[:hidden_size * (input_size + 1)].reshape(hidden_size, (input_size + 1)), seq[hidden_size * (input_size + 1):].reshape(num_labels, (hidden_size + 1))
theta = serialize(theta1, theta2)
Data encoding conversion
The y we read is (1, 2, 3, 4, ..., 10), we need to convert it into a non-linearly related vector, as shown in the figure below, for example, y[0]=6 is converted into y[0] =[0,0,0,0,0,1,0,0,0,0].
# 数据编码转换
def tramsform_y(y):
encoder = OneHotEncoder(sparse=False)
y_onehot = encoder.fit_transform(y.reshape(-1,1))
return y_onehot
y_onehot = tramsform_y(y)
cost function
The cost function is as follows
# 前向反馈代价函数
def cost(theta, X, y):
m = X.shape[0]
# 激活网络
a1, z2, a2, z3, h = forward_propagate(X, theta)
J = 0
# 非向量化
# for i in range(m):
# part1 = -y[i] * np.log(h[i])
# part2 = (1 - y[i]) * np.log(1 - h[i])
# J += np.sum(part1 - part2)
# J = J / len(X)
# 向量化
J = -y * np.log(h) - (1 - y) * np.log(1 - h)
J = J.sum() / m
return J
Initialization parameters
# 初始化参数设置
input_size = 400
hidden_size = 25
num_labels = 10
learning_rate = 1
After initializing the parameters, call the cost function to get the cost of using the loaded weights to be 0.287629.
result = cost(theta1, theta2, X, y_onehot)
# result = 0.2876291651613188
Regularized cost function
The regularized cost function is defined as
Here we can directly call the cost function written above for the first half
# 正则化
def costReg(theta, X, y, learning_rate = 1):
m = X.shape[0]
theta1, theta2 = deserialize(theta)
reg = ((learning_rate) / (2 * m)) * (np.sum(theta1[:,1:] ** 2) + np.sum(theta2[:,1:] ** 2))
J = cost(theta, X, y) + reg
return J
It should be noted that it is not necessary to perform regularization on θ0, that is, the first column of theta1 and theta2.
When λ = 1, the cost of calling the costReg function is 0.383770
costReg(theta, X, y_onehot, learning_rate)
# 0.38376985909092354
backpropagation
In this part, the backpropagation algorithm is mainly used to calculate the gradient, and then the advanced optimization is called to minimize the cost function to train the neural network
Gradient function of sigmoid
in
# 反向传播
def sigmoid_gradient(z):
return sigmoid(z) * (1 - sigmoid(z))
sigmoid_gradient(0) # 0.25
random initialization
Random initialization of parameters is very important when training neural networks, and one effective initialization strategy is to generate in one (-ε,ε), and an effective strategy is based on the number of units in the network.
Where and are the number of units in the adjacent layer. For l = 1, the number of units in the adjacent layer is 400 and 26, and the calculated result is about 0.12. Therefore, the value in the range of ε = 0.12 is taken to ensure that the parameters are small enough to make the training more efficient.
# 随机初始化
size = hidden_size * (input_size + 1) + num_labels * (hidden_size + 1)
params = np.random.uniform(-0.12, 0.12, size)
# (1,10285)
backpropagation
In this part, the backpropagation algorithm and related formula derivation are realized. The general steps of the backpropagation algorithm are to first train the sample to activate the neural network, including the assumed output value hθ(x), and then for the jth node of the l layer, Calculate its error term so that it can be used to measure the responsibility of the node for any error in the output, so as to adjust the parameters according to the error value and continuously optimize.
Here our neural network has three layers, and the third layer is the output layer, so the error is defined as
Among them, yk∈{0, 1} indicates whether the current training sample belongs to class k or not, 1 means it belongs to it, and 0 means it does not belong to it.
Next is the second hidden layer, and the error is defined as
The first layer is the input layer, and there is no error. Then calculate the gradient of each layer parameter matrix
Finally, the total gradient of the network is
Next, deduce how the above δ and Δ come from. The key point is that we need to be clear about the parameters we want to optimize. Using the idea of the gradient descent method, we need to solve the gradient of the cost function to the two parameters .
Assuming there is only one input sample, the cost function is
The process for the forward pass is shown in the figure below
Next, let's solve the gradient of the cost function to the parameters. The core idea is the chain derivation rule
According to the chain rule we can get the following formula
Let the leftmost end of the above formula be , and the rightmost (h - y) be the error, then it is the first formula in backpropagation
The essence of the error is the derivative of the cost function to z, namely
Empathy
The purple part in the second equal sign is
The third equal sign is
At this point, the key part is pushed to the end. In fact, the principle is the chain derivation rule, and it is not difficult to understand the process of forward propagation.
The following is the code of the gradient function. It is necessary to understand the dimensions of each parameter in order to avoid many detours.
def gradient(theta, X, y):
m = X.shape[0]
theta1, theta2 = deserialize(theta)
a1, z2, a2, z3, h = forward_propagate(X, theta)
delta3 = h - y # (5000, 10)
delta2 = (delta3 @ theta2[:, 1:]) * sigmoid_gradient(z2) # (5000, 25)
Delta2 = delta3.T @ a2 / m# (10, 5000)*(5000, 26) = (10, 26)
Delta1 = delta2.T @ a1 / m# (25, 5000)*(5000, 401) = (25, 401)
return Delta1, Delta2
gradient detection
Gradient detection is mainly used to verify that the backpropagation algorithm is correct. In your neural network, you are minimizing the cost function J(θ), and you need parameters for gradient checking. First, we can expand θ1 and θ2 into long vectors θ, and then use the following gradient detection process. The calculation uses the approximation idea. The derivative of a point can be replaced by the slope of two points close to it. If the two points are close enough, then this slope can be used instead of the point derivative.
First write the left and right values of θ, as shown below, ε is a very small number
Then substitute the following formula to calculate the theoretical value of θ
The code runs very slowly, run with caution!
def gradient_checking(theta, X, y, e):
def a_numeric_grad(plus, minus):
"""
对每个参数theta_i计算数值梯度,即理论梯度。
"""
return (costReg(plus, X, y) - costReg(minus, X, y)) / (e * 2)
numeric_grad = []
for i in range(len(theta)):
plus = theta.copy()
minus = theta.copy()
plus[i] = plus[i] + e
minus[i] = minus[i] - e
grad_i = a_numeric_grad(plus, minus)
numeric_grad.append(grad_i)
numeric_grad = np.array(numeric_grad) # 理论
analytic_grad = gradientReg(theta, X, y, learning_rate) # 现实
diff = np.linalg.norm(approx_grad - analytic_grad) / np.linalg.norm(approx_grad + analytic_grad)
print(diff)
regularized neural network
Gradient regularization formula
def gradientReg(theta, X, y, learning_rate = 1):
m = X.shape[0]
# 不惩罚偏置单元
# a1, z2, a2, z3, h = forward_propagate(X, theta)
D1, D2 = gradient(theta, X, y)
theta1[:, 0] = 0
theta2[:, 0] = 0
regD1 = D1 + (learning_rate / m) * theta1
regD2 = D2 + (learning_rate / m) * theta2
return serialize(regD1, regD2)
Parameter optimization
Here, we use advanced optimization methods for parameter optimization, using the optimize function of the scipy library for optimization.
fmin = minimize(fun=costReg, x0=params, args=(X, y_onehot, learning_rate),
method='TNC', jac=gradientReg, options={'maxiter': 400})
The result is as follows
fun: 0.5064413657213123 jac: array([-1.29134381e-04, -2.11248326e-12, 4.38829369e-13, ..., -2.98454162e-05, -1.96204232e-03, -1.77461205e-04]) message: 'Converged (|f_n-f_(n-1)| ~= 0)' nfev: 139 nit: 13 status: 1 success: True x: array([-0.0623484 , -0.06471579, -0.05614958, ..., -2.86694064, 0.87384526, 0.43249548])
Next use the optimized parameters to predict
# 计算使用优化后的θ得出的预测
a1, z2, a2, z3, h = forward_propagate(X, fmin.x)
y_pred = np.array(np.argmax(h, axis=1) + 1)
print(classification_report(y, y_pred))
precision recall f1-score support 1 0.96 0.98 0.97 500 2 0.97 0.97 0.97 500 3 0.96 0.94 0.95 500 4 0.96 0.98 0.97 500 5 0.96 0.96 0.96 500 6 0.98 0.98 0.98 500 7 0.95 0.97 0.96 500 8 0.97 0.96 0.97 500 9 0.97 0.94 0.96 500 10 0.99 0.99 0.99 500 accuracy 0.97 5000 macro avg 0.97 0.97 0.97 5000 weighted avg 0.97 0.97 0.97 5000
It can be seen that the correct rate after parameter optimization reaches 97%
visualize hidden layers
A good way to understand how a neural network learns is to visualize what the hidden layer units capture. For our trained network, notice that each row in θ1 is a 401-dimensional vector representing each hidden layer unit parameters. If we ignore the bias term, we get a 400-dimensional vector representing the weight of each sample input pixel to each hidden layer unit. Therefore, one way to visualize is to reshape the 400-dimensional vector to a (20, 20) image and then output it. (I don't quite understand it for now)
thetafinal1, thetafinal2 = deserialize(fmin.x)
hidden_layer = thetafinal1[:, 1:]
plot_100_image(hidden_layer)
reference article
Andrew Ng's machine learning and deep learning homework catalog [image restored]