[Study Notes] Introduction to Deep Learning: Theory and Implementation Based on Python - Neural Network

- 3. Neural Network

3. Neural Network

3.1 From Perceptron to Neural Network

If the neural network is represented by a graph, as shown in the figure below, we call the leftmost column the input layer , the rightmost column the output layer , and the middle column the middle layer (hidden layer) .

insert image description here

In the network shown above, the bias $b$ is not drawn. If you want to express clearly $b$ , it can be done as shown in the figure below. The weight is added to the figure below as $b$ input signal $1$ . This perceptron will $x_1,x_2,1$ The three signals are used as the input of the neuron, multiplied by their respective weights, and sent to the next neuron. In the next neuron, the sum of these weighted signals is calculated. If this sum exceeds $0$ , then output $1$ , otherwise output $0$ . Also, since the biased input signal is always $1$ , so in order to distinguish it from other neurons, we color this neuron in gray in the figure.

insert image description here

We use a function to represent the action of this situation (more than $0$ then output $1$ , otherwise output $0$ ), $y=h(b+w_1x_1+w_2x_2)$ , where the function $h (x)$ is shown in the following formula:

insert image description here

$The h (x)$ function converts the sum of input signals into an output signal, and this function is generally calledan activation function. The role of the activation function is to decide how to activate the sum of the input signals.

The above formula can be refined into the following two formulas:

insert image description here

The calculation process of the activation function is shown in the figure below:

insert image description here

3.2 Activation function

An activation function often used in neural networks is the $s i g m o i d$ function:

insert image description here

Now, let's first try to draw the graph of the step function, when the input exceeds $0$ , output $1$ , otherwise output $0$ . The step function can be implemented simply as follows:

# 参数只能为实数
def step_function(x):
	if x > 0:
		return 1
	else:
		return 0

# 参数可以为NumPy数组
def step_function(x):
	y = x > 0
	return y.astype(np.int)

Then draw the function image:

import numpy as np
import matplotlib.pylab as plt

def step_function(x):
	return np.array(x > 0, dtype=np.int)
x = np.arange(-5.0, 5.0, 0.1)
y = step_function(x)
plt.plot(x, y)
plt.ylim(-0.1, 1.1)  # 指定y轴的范围
plt.show()

The result is shown in the figure below:

insert image description here

Next we implement $s i g m o i d$ function:

def sigmoid(x):
	return 1 / (1 + np.exp(-x))

and graph the function:

x = np.arange(-5.0, 5.0, 0.1)
y = sigmoid(x)
plt.plot(x, y)
plt.ylim(-0.1, 1.1)  # 指定y轴的范围
plt.show()

The result is shown in the figure below:

insert image description here

$The s i g m o i d$ function is a smooth curve, and the output changes continuously with the input. while the step function takes $0$ as the boundary, the output changes sharply. The smoothness of $s$ $i$ $g$ $m$ $o$ $i$ $d function is of great significance to the learning of neural network.$

Relative to the step function can only return $0$ or $1$ ， $The s i g m o i d$ function can return real numbers such as $0.731\dots, 0.880\dots $ (this is related to the smoothness just now). In other words, the flow between neurons in the perceptron is $0$ or $A binary signal of 1$ , while a continuous real-valued signal flows in the neural network.

Step function and $Although the s i g m o i d$ functions differ in smoothness, they have similar shapes. Both are structured as "input hour, output close to $0$ (for $0$ ); As the input increases, the output moves toward $1$ close (becomes $1$ )". That is, when the input signal is important information, the step function and $Both s i g m o i d$ functions output larger values; when the input signal is unimportant information, both output smaller values.

Step function and $The s i g m o i d$ function has another thing in common, that is, both arenonlinear functions. The activation function of the neural network must use a nonlinear function. In other words, the activation function cannot use a linear function. Why can't we use linear functions? Because if you use a linear function, it doesn't make sense to deepen the number of layers in the neural network.

Next, introduce another very important activation function: $R e L U$ function. $R e L U$ function when the input is greater than $When 0$ , directly output the value; when the input is less than or equal to $0$ , output $0$ , which can be represented by the following formula:

insert image description here

Its code implementation and function image are as follows:

def relu(x):
	return np.maximum(0, x)

insert image description here

3.3 Operations on multidimensional arrays

A multidimensional array is simply a "collection of numbers", a collection of numbers arranged in a column, in a rectangle, in three dimensions, or (more generally) $N-$ dimensional collections are called multidimensional arrays.

A = np.array([1, 2, 3, 4])
np.ndim(A)  # 1，获得数组的维数
A.shape  # (4,)
A.shape[0]  # 4

B = np.array([[1, 2], [3, 4], [5, 6]])
np.ndim(B)  # 2
B.shape  # (3, 2)

Next, let's introduce the product of matrices (two-dimensional arrays). For example $2\times 2$ matrix, its product can be calculated as shown below:

insert image description here

The product of the matrix is obtained by multiplying the rows of the left matrix (horizontal) and the columns of the right matrix (vertical) by corresponding elements and then summing. This operation can be implemented in Python with the following code:

A = np.array([[1, 2], [3, 4]])
A.shape  # (2, 2)
B = np.array([[5, 6], [7, 8]])
B.shape  # (2, 2)
np.dot(A, B)  # array([[19, 22], [43, 50]])，dot()称为点积运算

It should be noted that in the product operation of multidimensional arrays, the number of elements of the corresponding dimensions in the two matrices must be the same, as shown in the following figure:

insert image description here

Below we use NumPy matrices to implement neural networks. Here we take the simple neural network in the figure below as the object. This neural network omits biases and activation functions, only weights:

insert image description here

3.4 Realization of three-layer neural network

Before introducing the processing in the neural network, we first import $w_{12}^{(1)}$ 、 $a_{1}^{(1)}$ and other symbols. Take a look at the figure below, which only highlights neurons from the input layer $x_2$ To the neuron $a_{1}^{(1)}$ the weight of. Weight and hidden layer neurons have a $" (1) "$ , which represents the layer number of the weight and neuron (that is, the 1stLayer $1$ neurons in layer $1 ).$ In addition, there are two numbers in the lower right corner of the weights, which are the index numbers of the neurons in the next layer and the neurons in the previous layer. For example, $w_{12}^{(1)}$ Indicates the 2nd of the previous layer $2$ neurons $x_2$ to the $1$ neuron $a_{1}^{(1)}$ the weight of.

insert image description here

Now look at from the input layer to the 1st $1st$ Floor $The signal transmission process of a$ neuron is shown in the figure below:

insert image description here

Now use the mathematical formula to express $a_{1}^{(1)}$ 。 $a_{1}^{(1)}$ Calculated by the sum of weighted signal and bias according to this formula: $a_{1}^{(1) }=w_{11}^{(1)}x_1+w_{12}^{(1)}x_2+b_{1}^{(1)}$ 。

The calculation process using the multiplication operation of matrices is as follows:

insert image description here

Next, we use NumPy multidimensional arrays to implement the above formula, where the input signal, weight, and bias are set to arbitrary values:

X = np.array([1.0, 0.5])
W1 = np.array([[0.1, 0.3, 0.5], [0.2, 0.4, 0.6]])
B1 = np.array([0.1, 0.2, 0.3])
print(W1.shape)  # (2, 3)
print(X.shape)  # (2,)
print(B1.shape) # (3,)
A1 = np.dot(X, W1) + B1

Next, we observe the first $The calculation process of the activation function in layer 1$ . If the calculation process is represented by a graph, it is shown in the following figure:

insert image description here

The weighted sum of the hidden layer (the sum of the weighted signal and bias) is $a$ indicates that the signal converted by the activation function is represented by $z$ said. In addition, h ( ) 4 in the figure $h () 4 represents the activation function, here we use the sigmoid $ function$ . Implemented in Python, the code is as follows:

Z1 = sigmoid(A1)
print(A1)  # [0.3, 0.7, 1.1]
print(Z1)  # [0.57444252, 0.66818777, 0.75026011]

Next, let's implement the first $1st$ floor to $Layer 2$ signaling:

W2 = np.array([[0.1, 0.4], [0.2, 0.5], [0.3, 0.6]])
B2 = np.array([0.1, 0.2])
print(Z1.shape) # (3,)
print(W2.shape) # (3, 2)
print(B2.shape) # (2,)
A2 = np.dot(Z1, W2) + B2
Z2 = sigmoid(A2)

insert image description here

The last is the $Signal transmission from layer 2$ to output layer. The implementation of the output layer is also basically the same as the previous implementation. However, the final activation function is different from the previous hidden layer. Here we define $identity\_function()$ function (also known as the "identity function"), and use it as the activation function of the output layer $.$ The identity function will output the input as it is, so there is no need to specifically define $identity\_function()$ . This implementation here is just to maintain unity with the previous process.

def identity_function(x):
	return x
W3 = np.array([[0.1, 0.3], [0.2, 0.4]])
B3 = np.array([0.1, 0.2])
A3 = np.dot(Z2, W3) + B3
Y = identity_function(A3)  # 或者Y = A3

insert image description here

So far, we have introduced Implementation of a $3- layer neural network.$ Now let's sort out all the previous code implementations. Here, in accordance with the implementation convention of neural networks, we only write the weights as capital letters $W 1$ , others (bias or intermediate results, etc.) are denoted in lowercase letters.

def init_network():
	network = {
    
    }
	network['W1'] = np.array([[0.1, 0.3, 0.5], [0.2, 0.4, 0.6]])
	network['b1'] = np.array([0.1, 0.2, 0.3])
	network['W2'] = np.array([[0.1, 0.4], [0.2, 0.5], [0.3, 0.6]])
	network['b2'] = np.array([0.1, 0.2])
	network['W3'] = np.array([[0.1, 0.3], [0.2, 0.4]])
	network['b3'] = np.array([0.1, 0.2])
	return network

def forward(network, x):
	W1, W2, W3 = network['W1'], network['W2'], network['W3']
	b1, b2, b3 = network['b1'], network['b2'], network['b3']
	a1 = np.dot(x, W1) + b1
	z1 = sigmoid(a1)
	a2 = np.dot(z1, W2) + b2
	z2 = sigmoid(a2)
	a3 = np.dot(z2, W3) + b3
	y = identity_function(a3)
	return y

network = init_network()
x = np.array([1.0, 0.5])
y = forward(network, x)
print(y)  # [ 0.31682708 0.69627909]

3.5 Design of output layer

Neural networks can be used in classification and regression problems, but the activation function of the output layer needs to be changed according to the situation. In general, identity functions are used for regression problems, and $soft max function$ . $_$ $_$ $_$ $_$

The identity function will output the input as it is, and output the input information directly without any modification. used in classification problems $The s o f t max function can be expressed by the$ following formula:

insert image description here

Implemented in Python:

def softmax(a):
	exp_a = np.exp(a)
	sum_exp_a = np.sum(exp_a)
	y = exp_a / sum_exp_a
	return y

Looking at the code, we found that due to the overflow problem of the exponential function, if a division operation is performed between these very large values, the result will appear "indeterminate". So we need to improve the formula:

insert image description here

Therefore, in $In the operation of the exponential function of soft max$ , adding (or subtracting) a constant will not change the result $of$ $the$ $operation$ $.$ Here $C^{'}$ can use any value, but in order to prevent overflow, the largest value in the input signal is generally used. For example:

a = np.array([1010, 1000, 990])
np.exp(a) / np.sum(np.exp(a))  # softmax函数的运算
# 返回结果为array([nan, nan, nan])，没有被正确计算

c = np.max(a)  # 1010
a - c  # array([0, -10, -20])
np.exp(a - c) / np.sum(np.exp(a - c))
# 返回结果为array([9.99954600e-01, 4.53978686e-05, 2.06106005e-09])

as follows $s o f t max function$ : $_$

def softmax(a):
	c = np.max(a)
	exp_a = np.exp(a - c)  # 溢出对策
	sum_exp_a = np.sum(exp_a)
	y = exp_a / sum_exp_a
	return y

$The output of the s o f t max$ function is $0.0$ to $A real number between$ $1.0$ . And, $The sum of the$ output values of the $s$ $o$ $f$ $t$ $max function$ $is$ $1$ . The output sum is $1$ is $An important property of soft max function$ . Because of this property, we can put $The output of the soft max function is interpreted as a$ "probability".

The number of neurons in the output layer needs to be determined according to the problem to be solved. For classification problems, the number of neurons in the output layer is generally set to the number of categories. in the picture $0$ toWhich of the $9$ $10$ category classification problem), you can set the neurons of the output layer to $10$ (assuming $y_2$ the maximum output value).

insert image description here

3.6 Handwritten digit recognition

Assuming that the learning of the neural network has all been completed, we use the learned parameters to first implement the "reasoning processing" of the neural network. This inference process is also known as the forward pass of the neural network .

The dataset used here is the MNIST dataset of images of handwritten digits. MNIST is one of the most famous datasets in machine learning, used in everything from simple experiments to published research papers.

The image data of MNIST is $28\times 28$ -pixel grayscale images ( $1$ channel), the value of each pixel is $0$ toBetween $2$ $5$ $5 .$ Each image data is marked with $“ 7 ”, “ 2 ”,$ Labels such as $"$ $1$ $" .$

Assuming that a convenient Python script has been provided mnist.pythat supports processing from downloading the MNIST dataset to converting these data into NumPy arrays, $load_mnist()$ mnist.py in use load\_mnist() $l o a d_m n i s t ()$ function, you can easily read in the MNIST data as follows.

import sys, os
sys.path.append('D:\VS Code Project\Deep Learning')  # 为了导入父目录中的文件而进行的设定
from dataset.mnist import load_mnist

# 第一次调用会花费几分钟 ……
(x_train, t_train), (x_test, t_test) = load_mnist(flatten=True, normalize=False)

# 输出各个数据的形状
print(x_train.shape)  # (60000, 784)
print(t_train.shape)  # (60000,)
print(x_test.shape)  # (10000, 784)
print(t_test.shape)  # (10000,)

load_mnistThe function (训练图像, 训练标签), (测试图像, 测试标签)returns the read MNIST data in the form of . Also, like load_mnist(normalize=True, flatten=True, one_hot_label=False)this, set $3$ parameters. No. $1$ parameternormalizesetting whether to normalize the input image to $0.0\sim 1.0$ . If this parameter is set toFalse, the pixels of the input image will remain the original $0\sim 255$ . Second $2$ parametersflattenset whether to expand the input image (into a one-dimensional array). If this parameter is set toFalse, the input image is $1\times 28\times 28$ three-dimensional array; if set toTrue, the input image will be saved as $78$ A one-dimensional array of 4 elements $.$ 3rd $3$ parametersone_hot_labelto set whether to save the label as $o n e - h o t$ 表示（ $one-hot\ representation$ ）。 $o n e - h o t$ means that only the correct solution label is $1$ , the rest are $0$ 's, like[0,0,1,0,0,0,0,0,0,0]this. whenone_hot_labelitFalse 's just like $7, 2$ It is simple to save the correct solution label; whenone_hot_labelit isTrue , the label is saved as $o n e - H o t$ said.

Next use the PIL module to display the first image of the training images:

import sys, os
sys.path.append('D:\VS Code Project\Deep Learning')
import numpy as np
from dataset.mnist import load_mnist
from PIL import Image

def img_show(img):
	pil_img = Image.fromarray(np.uint8(img))
	pil_img.show()

(x_train, t_train), (x_test, t_test) = load_mnist(flatten=True, normalize=False)
img = x_train[0]
label = t_train[0]
print(label)  # 5
print(img.shape)  # (784,)
img = img.reshape(28, 28)  # 把图像的形状变成原来的尺寸
print(img.shape)  # (28, 28)
img_show(img)

The displayed results are shown in the figure below:

insert image description here

The input layer of the neural network has $784$ neurons, output layer has $10$ neurons $.$ of the input layer $The number 7 84$ comes from the image size of $28\times 28=784$ ,The number $1$ $0$ $10$ category classification (digital $0\sim 9$ of 10 out of $10$ categories). Furthermore, this neural network has $2$ hidden layers, 1st $1$ hidden layer has $50$ neurons, 2nd $2$ hidden layers have $100$ neurons. this $50$ and $100$ can be set to any value.

Let us first define $3$ functions:

def get_data():
	(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, flatten=True, one_hot_label=False)
	return x_test, t_test

def init_network():
	with open("sample_weight.pkl", 'rb') as f:
		network = pickle.load(f)
	return network

def predict(network, x):
	W1, W2, W3 = network['W1'], network['W2'], network['W3']
	b1, b2, b3 = network['b1'], network['b2'], network['b3']
	a1 = np.dot(x, W1) + b1
	z1 = sigmoid(a1)
	a2 = np.dot(z1, W2) + b2
	z2 = sigmoid(a2)
	a3 = np.dot(z2, W3) + b3
	y = softmax(a3)
	return y

init_network()Will read in the learned weight parameters saved in picklethe filesample_weight.pkl $A.$ _ This file holds weight and bias parameters in the form of dictionary variables. remaining $The implementation of the two$ functions is basically the same as the code introduced earlier, so no further explanation is needed. Now, we use this $Three$ functions are used to implement the inference processing of the neural network and evaluate its recognition accuracy:

x, t = get_data()
network = init_network()
accuracy_cnt = 0
for i in range(len(x)):  # 逐一取出保存在x中的图像数据
	y = predict(network, x[i])
	p = np.argmax(y)  # 获取概率最高的元素的索引
	if p == t[i]:
		accuracy_cnt += 1
print("Accuracy:" + str(float(accuracy_cnt) / len(x)))
# Accuracy:0.9352

Next, we use the Python interpreter to output the shape of the weights of each layer of the neural network just now:

x, _ = get_data()
network = init_network()
W1, W2, W3 = network['W1'], network['W2'], network['W3']
x.shape  # (10000, 784)
x[0].shape  # (784,)
W1.shape  # (784, 50)
W2.shape  # (50, 100)
W3.shape  # (100, 10)

Confirm the shape of the matrix:

insert image description here

Now let's consider the case of packing multiple input images. For example, we want to use predict()functions to package and process $100$ images. To this end, you can putThe shape of $x$ $100\times 784$ will be $100$ images are packaged as input data, and this packaged input data is calledbatch( $b a t c h$ ). Batch means "bundle", and the images are bundled together like banknotes, as shown in the following figure:

insert image description here

Let's implement the batch-based code below:

x, t = get_data()
network = init_network()
batch_size = 100  # 批数量
accuracy_cnt = 0
for i in range(0, len(x), batch_size):
	x_batch = x[i:i + batch_size]
	y_batch = predict(network, x_batch)
	p = np.argmax(y_batch, axis=1)
	accuracy_cnt += np.sum(p == t[i:i + batch_size])
print("Accuracy:" + str(float(accuracy_cnt) / len(x)))

Next section: [Study Notes] Introduction to Deep Learning: Theory and Implementation Based on Python - Learning of Neural Networks .