"Handwritten Digit Recognition" Neural Network Study Notes

This article mainly refers to the book "Introduction to Deep Learning - Theory and Implementation Based on Python".

"Handwritten digit recognition" can be said to be the "hello world" in the machine learning world. This article will briefly explain how to manually implement a neural network without using a mature machine learning library.

First, we need to import the training and test sets:

"""load_mnist(normalize=True, flatten=True, one_hot_label=False)
	读入MNIST数据集
    
    Parameters
    ----------
    normalize : 将图像的像素值正规化为0.0~1.0
    one_hot_label : 
        one_hot_label为True的情况下，标签作为one-hot数组返回
        one-hot数组是指[0,0,1,0,0,0,0,0,0,0]这样的数组
    flatten : 是否将图像展开为一维数组
    
    Returns
    -------
    (训练图像, 训练标签), (测试图像, 测试标签)
    """

(x_train, t_train), (x_test, t_test) = load_mnist(normalize = True, one_hot_label = True)

Next, we want to implement a two-layer network, which should receive a series of inputs, and through the internal processing of several hidden layers, finally recognize which number is 0-9.

The process mentioned above is called forward propagation, and we get a result through forward propagation. However, this result is not necessarily correct, so we also need something called backpropagation to correct the learning model.

In our example, 4 layers will be designed, namely the Affine layer, the Relu layer, the Affine layer, and the SoftmaxWithLoss layer.

Affine layer: In the Affine layer, we will calculate the input matrix in the form of: y = ax+b, which is reflected on the matrix like a linear transformation and a translation, so it is called an affine transformation layer
Relu layer: activation function layer, where the data output by the previous layer will be calculated once, and a number between 0-1 will be output. As for why you need to activate the layer instead of passing the result directly, you can refer to https://zhuanlan.zhihu.com/p/165194685
SoftmaxWithLoss layer: If the ANN is not trained, then this layer is not needed. The function of softmax is to normalize the output (the sum of the output values is 1) for the calculation of backpropagation.

Let's first look at the implementation of forward propagation:

Forward propagation of the Affine layer:

    def forward(self, x):
        # 对应张量
        self.original_x_shape = x.shape
        x = x.reshape(x.shape[0], -1)
        self.x = x

        out = np.dot(self.x, self.W) + self.b

        return out

Very simple, it is basically in the form of y = ax + b.

Forward propagation of the Relu layer:

    def forward(self, x):
        self.mask = (x <= 0)
        out = x.copy()
        # 等价于out[x <= 0] = 0， 前面计算了self.mask，只是为了把结果存下来给反向传播使用
        out[self.mask] = 0   

        return out

The Relu layer is also well understood. Relu will keep elements greater than 0 unchanged, and elements less than 0 will become 0.

It should be noted that the operation here uses the characteristics of numpy arrays.

Forward propagation of SoftmaskWithLoss:

    def forward(self, x, t):
        self.t = t
        self.y = softmax(x)
        self.loss = cross_entropy_error(self.y, self.t)
        
        return self.loss

Here t is the supervised data, and the loss function is calculated through the cross-entropy error to obtain the loss value.

Then there is forward propagation, which is very simple, traversing the forward function of each layer in turn:

    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)

        return x

Don't forget, we also mentioned earlier that in order to train an ANN, you need to implement backpropagation. So what is backpropagation?

In fact, backpropagation is a process of deriving parameters, and the purpose is to minimize the loss function. In addition to backpropagation, there is also a method of using numerical differentiation to find the gradient to reduce the value of the loss function, but this method is computationally intensive. So we choose backpropagation.

First implement the backpropagation algorithm for the following layers:

Backpropagation of the Affine layer:

    def backward(self, x):
        dx = np.dot(dout, self.W.T)
        self.dW = np.dot(self.x.T, dout)
        self.db = np.sum(dout, axis=0)

		dx = dx.reshape(*self.original_x_shape)  # 还原输入数据的形状（对应张量）
        return dx

Backpropagation of the Relu layer:

    def backward(self, x):
        dout[self.mask] = 0
        dx = dout
        
		return dx

Backpropagation of SoftmaskWithLoss:

    def backward(self, x, t):
        batch_size = self.t.shape[0]
        if self.t.size == self.y.size: # 监督数据是one-hot-vector的情况
            dx = (self.y - self.t) / batch_size
        else:
            dx = self.y.copy()
            dx[np.arange(batch_size), self.t] -= 1
            dx = dx / batch_size
        
        return dx

Next, we use these backpropagations to compute the gradient:

    def gradient(self, x, t):
        # forward
        self.loss(x, t)

        # backward
        dout = 1
        dout = self.lastLayer.backward(dout)
        
        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

        # 设定
        grads = {}
        grads['W1'], grads['b1'] = self.layers['Affine1'].dW, self.layers['Affine1'].db
        grads['W2'], grads['b2'] = self.layers['Affine2'].dW, self.layers['Affine2'].db

        return grads

The layers have been built, and the code for training starts:

# 读入数据
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True)
# 新建一个输入层784神经元、隐藏层50神经元、输出层10神经元的双层神经网络
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

The 784 of the input layer comes from 28*28, which is the size of a single image. We will convert the image into a one-dimensional array as input. The size of the hidden layer is arbitrarily specified.

# 更新次数
iters_num = 10000
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.1

Define some hyperparameters, that is, human-controlled parameters:

iters_num: the number of times to train
train_size: the total number of training samples
batch_size: The size of the batch. Our training will be done in batches, because we input one data each time—(784,) to calculate, and the efficiency is too low. If you input 100 data each time for calculation, then the 100 data will be input in the form of a matrix - (100 * 784), and the output will be (100 * 10), which can efficiently use numpy matrix operations and greatly improve training efficiency. This training method is called mini-batch.
learning_rate: The learning rate, that is, how much we update the parameters along the direction of gradient descent each time

train_loss_list = []
train_acc_list = []
test_acc_list = []

These are mainly for recording training results.

iter_per_epoch = max(train_size / batch_size, 1)

During the training process, because our data may only be a certain type of data, that is, our training samples may not have all the characteristics required for training. For example, we use a neural network trained by 10,000 cursive numbers to test regular script writing. numbers, the accuracy rate will be greatly reduced. This phenomenon is called overfitting . In order to detect this situation in time during the training process, it is usually necessary to prepare two sets of data, one for training and one for testing.

We define the variable iter_per_epoch, which is to compare the accuracy of the training data and the accuracy of the test data every once in a while to see if they deviate greatly.

The complete code for training is as follows:

# 读入数据
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True)
# 新建一个输入层784神经元、隐藏层50神经元、输出层10神经元的双层神经网络
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)
# 更新次数
iters_num = 10000
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.1

train_loss_list = []
train_acc_list = []
test_acc_list = []
# 所有训练数据均被使用过一次时的更新次数，假如有1w个训练数据，batch为100，那每100次就是一个epoch
iter_per_epoch = max(train_size / batch_size, 1)

for i in range(iters_num):
	# 挑选100个索引
    batch_mask = np.random.choice(train_size, batch_size)
    # 训练数据
    x_batch = x_train[batch_mask]
    # 结果
    t_batch = t_train[batch_mask]
    
    # 梯度
    #grad = network.numerical_gradient(x_batch, t_batch)
    grad = network.gradient(x_batch, t_batch)
    
    # 更新
    for key in ('W1', 'b1', 'W2', 'b2'):
        network.params[key] -= learning_rate * grad[key]
        
    # 记录每次更新完梯度的损失值
    loss = network.loss(x_batch, t_batch)
    train_loss_list.append(loss)
    
    # 每训练完一个epoch，对比下训练数据的准确率和测试数据的准群率。避免不知不觉间发生过拟合现象
    if i % iter_per_epoch == 0:
        train_acc = network.accuracy(x_train, t_train)
        test_acc = network.accuracy(x_test, t_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)
        print(train_acc, test_acc)

"Handwritten Digit Recognition" Neural Network Study Notes

Guess you like