Summary of introductory knowledge of deep learning

0. Foreword: After learning the fish book of introductory deep learning, I have a general understanding of many basic concepts, and summarize them in time to facilitate future search


1. The origin algorithm of neural network (deep learning) - perceptron:

  • Definition: The perceptron receives multiple input signals and outputs a signal. The signal here is understood as something that is current or a river has fluidity. The perceptron signal only has two values ​​of 1/0 (flow/no flow)
    insert image description here
    insert image description here

  • The "AND gate, NAND gate, and OR gate" can be realized with a perceptron, but the "XOR gate" cannot be directly realized

  • The limitation of the perceptron is that it can only represent the space divided by a straight line

  • The XOR gate can be realized by superposition of "AND gate, NAND gate, or gate". The following is a 2-layer perceptron to realize the "XOR" gate
    insert image description here

  • Multi-layer perceptrons play a huge role: In theory, two layers of perceptrons can realize the functions of a computer, and perceptrons can achieve non-linear representation through superimposed layers. A multi-layer perceptron can be regarded as a neural network.

  • Activation function: convert the sum of the input signal into an output signal. In the neural network, functions other than the step function are used as the activation function. The activation function is a bridge connecting the perceptron and the neural network.

    • The sigmoid function: has smoothness and plays an important role in the learning of neural networks!
    • ReLU function
    • step function
  • The activation function used in the neural network is the sigmoid function of smooth transformation, and the activation function used in the perceptron is the step function.


2. Neural network:

  • As shown in the figure below is a three-layer neural network:
    insert image description here
  • In the neural network, each layer transmits an array. With the help of the characteristics of array operations in python (numpy library), the operation and transfer of arrays at each layer can be realized.
  • The design of the activation function of the output layer: the binary classification problem uses the sigmoid function, the multi-class classification problem uses the softmax function, and the regression problem uses the identity function.
  • The steps involved in solving a machine learning problem are:
    • Learning: learn the model
    • Inference: use the learned model to reason about unknown data (classification)
    • Note: Generally, the output layer does not need to add an activation function in the inference phase, and only in the learning phase does it need to add an activation function to the output layer
  • Neural network model for handwritten digit recognition:
    • 1. Input: Because the size of each picture in the data set is 28X28=784 pixels, the input layer is 784
    • 2. Output: Because the classification result is from the picture, the recognition result is 0-9, so the output layer is 10
  • Batch processing: The data input each time is realized by batch processing. The above example can already understand that the data input into the handwritten digit recognition is 784. This is the data compression of a picture into a one-dimensional array input, once If there is only one piece of data input into the model, it is relatively slow, so with the concept of array, you can input a batch of pictures at a time. For example, if you input 100 pictures at a time, then the input data is a two-dimensional with 100 rows and 784 columns. array. The final result is a two-dimensional array with 100 rows and 10 columns. As shown in the figure below, the calculation process (deformation process) after the data enters the model: efficient calculation
    insert image description here
    can be achieved through batch processing.
  • Full connection: Full connection refers to the connection mode between adjacent two-layer units, in which the current layer unit is connected to every unit in the upper layer of the network, which is called full connection. The corresponding to the full connection is called a sparse connection. The sparse connection refers to the connection between the current layer unit and some units in the upper layer of the network.
  • Epoch: An epoch represents a process in which all samples in the training data set are used for forward propagation and back propagation of the neural network. In simple terms, an epoch means that the model completely observes the entire training data set once.

3. The learning stage of the neural network:

  • Definition: The learning of neural network refers to the process of automatically obtaining the optimal weight parameters from the training data. In order to enable the neural network to learn, it needs to be applied to the loss function as an indicator. The purpose of learning is to find out the weight parameters that minimize the loss function, and this process can be realized by the gradient method. After finding the appropriate weight parameters, even if the model is trained, it can be used to solve classification or regression problems.
  • Advantages of neural network (deep learning): Learning features through machines reduces manual participation, and neural networks are end-to-end learning.
  • Data in machine learning is divided into two parts:
    • Training data: Also called supervised data, used to find optimal parameters.
    • Test data: used to evaluate the generalization ability of the model, that is to say, the model trained by the training data (with the optimal parameters), if the accuracy rate is high in the training data, but after switching to the test data, the accuracy rate will drop This phenomenon is called overfitting. If the obtained model not only has a high accuracy rate in the training data, but also a high accuracy rate in the test data, it can be said that the generalization ability of the model is very good.
  • Loss function: Any function can be used, but typically mean square error and cross entropy error. For example, using the mean square error as the loss function, as shown in the figure below, it can be clearly seen that the smaller the loss function is, the closer the model output is to the real value.
    insert image description here
  • Mini-batch learning: To judge the loss function of the model, it is necessary to find the average value of the loss function of all training data. If there are many training data, it is very slow and unrealistic to calculate the loss function once, so you can choose from all training data. Approximate a part. For example, randomly select 100 from 60,000 training data, and then use 100 to learn. This learning method is called mini-batch learning.
  • Summary: In neural network learning, when looking for optimal parameters (weights and biases), it is necessary to find parameters that make the value of the loss function as small as possible. In order to find the place where the loss function is as small as possible, it is necessary to calculate the derivative (gradient) of the parameter, and use the derivative as a guide to gradually update the value of the parameter.
  • The principle of deriving the loss function of the weight parameter: treat the parameter as a variable, and derive the function about the parameter. If the derivative is negative, it means that the loss function is decreasing with respect to the parameter, so the value of the parameter increases, and the value of the loss function becomes smaller. If the derivative is positive, it means that the loss function increases with respect to the parameter, so the parameter value decreases and the value of the loss function becomes smaller. Note that when the value of the derivative is 0, the weight parameter does not affect the change of the loss function.

  • Application of two derivation methods in deep learning:
    • Numerical Differentiation: Finding Derivatives by Definition of Derivatives
      insert image description here
    • The second method uses the error backpropagation method, which is equivalent to using the derivative calculation formula (which can be calculated efficiently))
    • Generally, numerical differentiation is used to check whether the error backpropagation is calculated correctly. This process is also called "gradient confirmation".
    • Note that there are many parameters in deep learning, so the chain rule is generally used to find partial derivatives.
  • Gradient: The directional vector summed up by the partial derivatives of all variables is called the gradient. The magnitude of the gradient is how slowly the function decreases, and the direction of the gradient is the direction in which the function decreases. The gradient represents the direction in which the function value of each point decreases the most,
    insert image description here
    • The value of the function advances a certain distance along the gradient direction, then recalculates the gradient in a new place, and then advances along the gradient direction, and so on, the process of gradually reducing the function value is the "gradient method", the gradient method to find the maximum value of the function It is called "gradient descent method", and the method of finding the minimum value of a function is called "gradient ascent method". The "gradient descent method" is commonly used in deep learning. The gradient method is represented by a mathematical formula as follows. The following x0 and x1 can be regarded as two parameters. If there are multiple parameters, the same is true:
      insert image description here
  • The realization of the learning process:
    1. Randomly select a part of the data (mini-batch) from the training data, and the goal is to reduce the value of the loss function of the mini-batch.
    2. Calculate the gradient of the loss function with respect to each weight parameter.
    3. The weight parameters are slightly updated along the gradient direction.
    4. Repeat 1, 2, and 3.
    Note: In the stochastic gradient descent algorithm (SGD), random refers to the mini-batch data is randomly selected.

4. Relevant learning skills:

  • parameter update
    • "Optimization": The purpose of neural network learning is to find the parameters that make the value of the loss function as small as possible. This is the problem of finding the optimal parameters. The process of solving this problem is optimization.
    • Common parameter optimization methods:
      • "Stochastic gradient descent method": referred to as SGD, this method is to find the optimal parameters, using the gradient (derivative) of the parameters as a clue, and updating the parameters along the gradient direction.
        - "Momentum": a method to mimic the concept of acceleration in physics
      • "AdGrad": will adjust the learning rate as the parameters change
      • "Adam": Combining the best of both approaches, Momentum and AdGrad
  • weight initial value
    • In neural network learning, the setting of weight initial value is very important.
    • The generation of the initial value of the weight must be randomized, but it must be reasonable. If the initial value is not set properly, the problem of gradient disappearance will occur. Generally, in the deep learning framework, the initial value of Xavier has been used as a standard.
    • Summary: Practice shows that when the activation function uses ReLU, the initial value of the weight uses the initial value of He, and when the activation function is an S-curve function such as sigmoid or tanh, the initial value uses the initial value of Xavier.
  • Add Batch Normalization layer
    • Batch Normalization layer definition: Insert a layer that normalizes the data distribution into the neural network
    • advantage:
      • Can make learning happen quickly (can increase learning rate)
      • less dependent on the initial value
      • suppress overfitting
  • Hyperparameters:
    • Definition: hyperparameters include (number of neurons in each layer, batch size, learning rate or weight decay when updating parameters, etc.), hyperparameters can also affect the effect of the model
    • The verification data can be used to evaluate the quality of the hyperparameters. If the test data is used to evaluate the quality of the hyperparameters, it may cause overfitting. The verification data is dedicated to verifying the hyperparameters, which is different from the training data and test data.
    • Data division:
      • Training data: for learning of parameters (weights and biases)
      • Test data: used to test model accuracy
      • Verification data: first obtain 20% of the data from the training data as verification data, the purpose is to test hyperparameters
  • Overfitting solution:
    • Definition: Overfitting refers to the state that can only fit the training data, but cannot well fit other data not included in the training data.
    • Ways to suppress overfitting:
      • 1. Weight decay: By punishing large weights during the learning process, overfitting is suppressed.
      • 2. Dropout: Random is a method of deleting neurons during the learning process.
  • Convolutional neural network:
    • Referred to as CNN, it is often used in various occasions such as image recognition and speech recognition.
    • Compared with the fully connected layer introduced before, the advantage of CNN is that it can use shape-related information. When recognizing handwritten digital pictures through the fully connected layer, it stretches the two-dimensional array composed of each pixel of the picture into a one-dimensional array. , and then perform deep recognition, which is the method of the fully connected layer, and CNN can input a two-dimensional or three-dimensional array to maintain shape-related information.
    • The parameters of the filter in CNN are equivalent to the weights in a fully connected neural network, and the bias in CNN is usually only one
    • Convolution: The convolution (Conv) operation is the operation of summing the corresponding multiplication of the input data and the filter, and it will also become smaller
    • Pooling: Pooling is to obtain the maximum value (or average value) from the target window. After pooling, the height and length directions will become smaller

Guess you like

Origin blog.csdn.net/sz1125218970/article/details/131625751