Deep learning neural network that beginners can understand (1)

Statement: I carefully studied the deep learning videos provided in 3Blue1Brown and compiled them into this document . Due to the limited length of the article and my limited understanding, if you have the conditions, you can watch the three deep learning videos and related content uploaded by 3Blue1Brown on station b. This document will add some insights and understanding from my own study. When using markdown for the first time, the formula format may be messy. If there are any mistakes, please correct me.

Part 1: The Structure of Neural Networks in Deep Learning

This section only introduces the simplest multi-layer perceptron MLP. Other types of neural networks, such as convolutional neural networks, are good at image recognition , and long short-term memory networks are good at speech recognition .

The name of neural network comes from the structure of the human brain. Neurons are neurons . What are neurons? How are they connected to form a network? Neuron is understood as a container containing a number.

For a more convenient understanding, a case of digital recognition is given, which is more vividly shown in the figure below: Each pixel in a 28*28 input image is regarded as a neuron, that is, a total of 784 neurons. The numbers in the element represent the grayscale value of the corresponding pixel (0 represents a pure black pixel, 1 represents a pure white pixel) (I have an idea here: implement GUI in markdown ) . The goal of this case is to identify the numbers 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 through 784 pixels.

The above number in each neuron is called the "activation value (Activation)". Obviously, the greater the activation value, the brighter the neuron will be, and vice versa. Therefore, the image consisting of 784 neurons forms the first layer of the neural network . Then let's first understand the last layer of the network . The ten neurons in this layer represent the ten numbers from 0 to 9. Similarly, the activation values ​​of these ten numbers are also between 0 and 1. It should be noted that these values ​​do not mean whether it is judged as a certain number, but the possibility that the input image corresponds to ten numbers!

In addition, there are several "hidden layers" between the input layer (784 neurons) and the output layer (10 neurons) of the neural network. The hidden layers process the identification of numbers. Specific work. In the above case, we chose two hidden layers ( reason for selection ) and determined that each hidden layer contains 16 neurons ( reason for selection : to look good) (in practical applications, there are many differences in the structure of the network) Large room for adjustment and experimentation). The gray value of 784 neurons was mentioned above. The pattern of the activation value of the input layer will cause the activation value of the lower layer to produce some special patterns, and then the activation value of the 16 neurons in the first hidden layer will be The activation values ​​of the 16 neurons in the second hidden layer produce a special pattern, and finally a certain result is obtained in the output layer. The brightest neuron in the output layer represents the number of the input image that the neural network considers. See the animation below for the specific transfer process.

I believe many beginners (me too) will ask a question: Why can we think that this layered structure can make intelligent judgments? That is to say, why do we need an intermediate layer? If we can directly judge the final digital result through 784 pixels (using a bunch of if statements), why do we need two intermediate layers? What is the meaning of their existence?

Background: When we recognize numbers, we actually combine the various parts of the number: 9 is a circle and a vertical line; 8 can be seen as consisting of two circles; 4 can be seen as consisting of three vertical lines.

So in an ideal situation, since we want to get the activation values ​​of the ten numbers in the final output layer (which can also be understood as probability values ​​in this layer), we hope that each neuron in the penultimate layer can correspond to the previous stroke. part. In this way, when we enter a circled number such as 9 or 8, the activation value in a certain neuron will be close to 1. It should be pointed out that the shape of the circle with the circled numbers here is not fixed, and it is hoped that the circle pattern at the top of the image will light up this neuron. Then, from the second hidden layer to the output layer, the network only needs to learn which components can be combined to produce which number. At this point you may have another question: Building hidden layers is troublesome enough, isn't it even harder to identify these components? How to judge which parts we want?

The same idea as breaking down ten numbers into their 16 parts, the task of identifying a circle can be broken down into a problem of finer parts: first identify the smaller edges in the figure of the number, such as cutting the circle into five or more A component composed of short sides with different directions; the long vertical lines in 1, 4, and 7 can be regarded as multiple short vertical lines.

Therefore, we hope that each neuron in the first hidden layer can correspond to various short edges. Then when the image is input, the eight to ten small part neurons associated with the short side can be lit ( the lighting here is not 1 ), and then the neurons corresponding to the large parts can be lit, and finally the neurons can be clicked Light up the neuron corresponding to the number. This is the goal of the layered structure we hope to solve the recognition problem. This idea of ​​transforming into abstract elements and peeling off cocoons layer by layer can also be applied to various artificial intelligence tasks, such as speech recognition (recognizing special features from the original audio). sounds, combined into specific syllables, combined into words, combined into phrases, and more abstract concepts).

How does the activation value in the previous layer determine the activation value in the next layer? We need to design a mechanism that can combine pixels into short sides, and then combine the short sides into patterns, or form patterns into numbers, etc.

First, let's take a neuron as an example : design a neuron in the first layer of the middle layer. Its goal is to realize the recognition of small parts. We stipulate that it can correctly identify whether there is an edge in a specified area in the image. Now we need to know the parameters of this network. How should we adjust the knob switch on the network so that it can express a certain prescribed pattern (for example, a pattern in which several edges are combined to form a circle)?

What needs to be noted here is that for a certain input image, we obviously cannot change the activation values ​​of the 784 input neurons, and assign a weight value to each connection between this neuron and all neurons in the first layer . Then all the activation values ​​of the first layer and their corresponding weight values ​​are calculated together to calculate their weighted sum. In order to facilitate understanding, these weight values ​​can be viewed as a table. Positive weight values ​​are marked in green, and negative weight values ​​are marked in red. The darker the color, the closer the weight value is to 0. If the weight of the area of ​​interest is assigned a positive value, all other weight values ​​will be assigned 0 . In this way, taking a weighted sum of all pixel values ​​will only add up the pixel values ​​in the area we are interested in! Furthermore, if you want to identify whether there is an edge here, you only need to assign negative weights to the surrounding pixels, so that when the middle pixel is bright and the surrounding pixels are dark, the weighted sum can reach the maximum. Why do we need to set the surrounding pixels? Negative weight? Doesn’t the weighted sum decrease in this way?
w 1 a 1 + w 2 a 2 + w 3 a 3 + w 4 a 4 + ⋅ ⋅ ⋅ + wnan where, wi is the weight value; ai is the neuron activation value w_1a_1+w_2a_2+w_3a_3+w_4a_4+\cdot\cdot\ cdot+w_na_n\\where, w_i is the weight value; a_i is the neuron activation valuew1a1+w2a2+w3a3+w4a4++wnanin ,wiis the weight value; aiis the neuron activation value

The weighted sum calculated in this way can be of any size, but in this network, we need activation values ​​between 0 and 1. In advanced mathematics, we know that the sigmoid function can push any real number into the range from 0 to 1 ( the activation function is not only the sigmoid function, but you can also understand the others by yourself: (8 messages) Mathematics in machine learning - activation function ( 1): Sigmoid function_von Neumann's blog-CSDN blog_sigmoid function calculator ).
S igmoid F unction : σ ( x ) = 1 1 + e − x where, x ∈ [ − ∞ , + ∞ ] Sigmoid Function:\sigma(x)=\frac{1}{1+e^{-x} }\\Among them, x\in[-\infty,+\infty]SigmoidFunction:σ ( x )=1+ex1Among them, x[,+ ]
So the activation value of this neuron is actuallyan evaluation. But sometimes, even if the weighted sum is greater than 0, you don't want to light up the neuron (for example, when the weighted sum is greater than 10, excitation is meaningful). At this time, you need to add abias value (bias)to ensure that the neuron Instead of being randomly excited, just feed the offset value into the sigmoid mapping function.

Here we make a summary: the weight tells you what kind of pixel image the neurons in the second layer focus on , and the bias tells you how big the weighted sum is to make the excitation of the neuron meaningful.

One of the neurons has been explained above, but the 16 neurons in the second layer will be connected to the 784 neurons in the input layer. Each of the 784 connections of the second layer neuron carries a weight. The same , each neuron will calculate its own weighted sum and add its own bias, and finally output its own result through sigmoid. At this point, you will definitely think that we need to give 784 * 16 weight values ​​and 16 bias values, and this is just the connection between the input layer and the second layer, the second layer and the third layer, and the third layer The connections to the output layer also have their own weights and biases, so the entire network has a total of 13,002 parameters ! This is equivalent to 13,002 knob switches on the Internet that can be adjusted to bring different results. Therefore, when we discuss how machines learn, we are actually talking about how the computer should set up this large set of digital parameters so that it can solve the problem correctly.

Do you think, when there is only one hidden layer and a small number of neurons in the input layer and output layer, you can choose to manually adjust the weights and biases, but the above 13002 parameters, manual adjustment? As if thinking about pitch. . . (Think about it again, if you want to recognize a picture, you have to manually adjust so many parameters. But in machine learning, you need to recognize thousands or tens of thousands of pictures. How to adjust them?)

Through the above idea, a single neuron can be represented as follows:
a 0 ( 1 ) = σ ( w 0 , 0 a 0 ( 0 ) + w 0 , 1 a 1 ( 0 ) + w 0 , 2 a 2 ( 0 ) + ⋅ ⋅ ⋅ + w 0 , nan ( 0 ) + b 0 ) Among them, ai ( j ) represents the activation value of neuron number i in layer j; wm , n represents neuron number m and j in layer j + 1 The weight value between the n neurons of the layer; bk represents the bias value of the k neuron. a_0^{(1)}=\sigma(w_{0,0}a_0^{(0)}+w_{0,1}a_1^{(0)}+w_{0,2}a_2^{(0 )}+\cdot\cdot\cdot+w_{0,n}a_n^{(0)}+b_0)\\where, a_i^{(j)} represents the activation value of neuron number i in layer j ;\\w_{m,n} represents the weight value between neuron m in layer j+1 and neuron n in layer j; \\b_k represents the bias value of neuron k.a0(1)=s ( w0,0a0(0)+w0,1a1(0)+w0,2a2(0)++w0,nan(0)+b0)Among them, ai(j)Represents the activation value of neuron number i in layer j ;wm,nmeans j+The weight value between neuron m in layer 1 and neuron n in layer j ;bkRepresents the bias value of neuron number k .
Use matrix form to express the operations of weights, biases, etc. between the two layers of the network:
[ w 0 , 0 w 0 , 1 ⋯ w 0 , nw 1 , 0 w 1 , 1 ⋯ w 1 , n ⋮ ⋮ ⋱ ⋮ wk , 0 wk , 1 ⋯ wk , n ] [ a 0 ( 0 ) a 1 ( 0 ) ⋮ an ( 0 ) ] + [ b 0 b 1 ⋮ bk ] = [ w 0 , 0 a 0 ( 0 ) + ⋅ ⋅ ⋅ + w 0 , nan ( 0 ) + b 0 w 1 , 0 a 0 ( 0 ) + ⋅ ⋅ ⋅ + w 1 , nan ( 0 ) + b 1 ⋮ wk , 0 a 0 ( 0 ) + ⋅ ⋅ ⋅ + wk , nan ( 0 ) + bk ] \begin{bmatrix} w_{0,0} & w_{0,1} & \cdots & w_{0,n}\\ w_{1,0} & w_{1, 1} & \cdots & w_{1,n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{k,0} & w_{k,1} & \cdots & w_{k,n} \end{bmatrix} \begin{bmatrix} a_0^{(0)} \\ a_1^{(0)} \\ \vdots \\ a_n^{(0)} \end{bmatrix} + \begin{bmatrix} b_0 \\ b_1 \\ \vdots \\ b_k \end{bmatrix}= \begin{bmatrix} w_{0,0}a_0^{(0)}+\cdot\cdot\cdot+w_{0,n}a_n ^{(0)}+b_0 \\ w_{1,0}a_0^{(0)}+\cdot\cdot\cdot+w_{1,n}a_n^{(0)}+b_1 \\ \vdots \\ w_{k,0}a_0^{(0)}+\cdot\cdot\cdot+w_{k,n}a_n^{(0)}+b_k \end{bmatrix} w0,0w1,0wk,0w0,1w1,1wk,1w0,nw1,nwk,n a0(0)a1(0)an(0) + b0b1bk = w0,0a0(0)++w0,nan(0)+b0w1,0a0(0)++w1,nan(0)+b1wk,0a0(0)++wk,nan(0)+bk
The following visual GIF shows the activation value calculation process between layers:

Then the weighted sum of each layer and the calculated bias value are compressed through the activation function sigmoid :

[ a 0 ( 1 ) a 1 ( 1 ) ⋮ a k ( 1 ) ] = σ ( [ w 0 , 0 w 0 , 1 ⋯ w 0 , n w 1 , 0 w 1 , 1 ⋯ w 1 , n ⋮ ⋮ ⋱ ⋮ w k , 0 w k , 1 ⋯ w k , n ] [ a 0 ( 0 ) a 1 ( 0 ) ⋮ a n ( 0 ) ] + [ b 0 b 1 ⋮ b k ] ) \begin{bmatrix} a_0^{(1)} \\ a_1^{(1)} \\ \vdots \\ a_k^{(1)}\end{bmatrix}= \sigma(\begin{bmatrix} w_{0,0} & w_{0,1} & \cdots & w_{0,n}\\ w_{1,0} & w_{1,1} & \cdots & w_{1,n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{k,0} & w_{k,1} & \cdots & w_{k,n} \end{bmatrix} \begin{bmatrix} a_0^{(0)} \\ a_1^{(0)} \\ \vdots \\ a_n^{(0)} \end{bmatrix} +\begin{bmatrix} b_0 \\ b_1 \\ \vdots \\ b_k \end{bmatrix}) a0(1)a1(1)ak(1) =s ( w0,0w1,0wk,0w0,1w1,1wk,1w0,nw1,nwk,n a0(0)a1(0)an(0) + b0b1bk )

Here, a very clear and concise way is used to express the activation value operation between the two layers of the network:
a (1) = σ (W a (0) + b) where W is the weight matrix; a (0) is the A layer of neuron activation matrix; a (1) is the activation matrix of neurons in this layer; b is the bias matrix of neurons in this layer. a^{(1)}=\sigma(Wa^{(0)}+b)\\where, W is the weight matrix; a^{(0)} is the neuron activation matrix of the previous layer;\\a^ {(1)} is the activation matrix of neurons in this layer; b is the bias matrix of neurons in this layer.a(1)=σ ( W a(0)+b)Among them, W is the weight matrix; a( 0 ) is the neuron activation matrix of the previous layer;a( 1 ) is the activation matrix of neurons in this layer;bis the bias matrix of neurons in this layer.
This representation makes it much easier for us to write programs (because many libraries have fully optimized matrix multiplication, such as numpy)

This is the architecture of neural networks. Thank you very much for your patience in reading, but in the end I would like to add and emphasize two points:

  • When understanding the concept of activation value , it is actually completely determined by the input image.
  • It is more accurate to think of a neuron as a function . Its input is the calculated value of the output of all neurons in the previous layer, and its output is a value between 0 and 1. (In fact, the entire network is a function with an input of 784 and an output of 10. However, this function is extremely complex (whisper bb: This thing is not complicated, otherwise how can it look awesome?)

Well, the neural network structure of deep learning is over. I am still working overtime to write the documentation for the two-part video on the gradient descent method and backpropagation method of neural networks! Note: I am just a "porter" of Yanyi's algorithm.

My contact information: [email protected]

Guess you like

Origin blog.csdn.net/HISJJ/article/details/126750270