Comprehensive summary of deep learning knowledge points_Deep learning summary

Comprehensive summary of deep learning knowledge points_Deep learning summary


Neural Network and Deep Learning Structure (picture selected from "Neural Network and Deep Learning" - Qiu Xipeng)

Table of contents

Common Classification Algorithms

1. The concept of deep learning

1. Definition of deep learning

2. Deep Learning Applications

3. Main terms of deep learning

2. The basis of neural network

1. Neural Network Composition

Perceptron

multilayer perceptron

3. Introduction to forward propagation and back propagation

forward propagation

backpropagation

3. Hyperparameters

1 What are hyperparameters?

2. How to find the optimal value of hyperparameters?

3. Hyperparameter search general process?

4. Activation function

1. What is the activation function

2. Why use activation function?

3. Why does the activation function need a nonlinear function?

4. Common activation functions

5. Summary of optimization methods

1. Basic Gradient Descent Method

(1) Stochastic gradient descent method SGD

(2) Batch gradient descent method BGD

Mini-batch gradient descent method MBGD

(4) Dosage

2. Momentum momentum gradient descent

3. Adam optimizer

4. RMSprop optimizer

Summary of Optimization Algorithms

6. Loss function

1. Loss function definition

2 Regression loss function

(1) Mean square error loss function

3 Classification loss function

(1) Logistic loss function

4. Loss functions commonly used in neural networks

(1) ReLU + MSE

(2) Sigmoid + Logistic

(3) Softmax + Logisitc

5. The difference between activation function, loss function and optimization function

7. CNN Convolutional Neural Network

Xiaobai recommended article: How to explain convolution in an easy-to-understand manner? - Know almost

CNN features:

CNN has two cores:

1. Local connections:

2. Parameter sharing:

CNN network introduction

1. Convolutional layer

1.1 Single channel input, single convolution kernel

1.2 Multi-channel input, single convolution kernel

1.3 Multi-channel input, multi-convolution kernel

1.4 padding

2. Pooling layer

3. Activation function

4. Fully connected layer

5. The amount of network parameters and calculation

Convolution layer parameters/Convolution calculation amount

6. Convolutional neural network training:

Basic training process:

CNN detailed solution:

Eight, classic network introduction:

For detailed network introduction, please refer to this article:

Super detailed introduction to convolutional neural network

CNN summary

Nine, RNN cycle neural network

Fundamentals of RNNs

Ten, LSTM long short-term memory neural network

LSTM and GRU study refer to my article:

Recommended reference materials and related in-depth learning materials for this article

Common Classification Algorithms

SVM, neural network, random forest, logistic regression, KNN, Bayesian

Common Supervised Learning Algorithms

Perceptrons, SVMs, Artificial Neural Networks, Decision Trees, Logistic Regression


1. The concept of deep learning

1. Definition of deep learning

Definition of deep learning : generally refers to the classification or regression of unknown data by training a multi-layer network structure

Deep learning classification : supervised learning methods - deep feedforward network, convolutional neural network, recurrent neural network, etc.;

Unsupervised learning methods - deep belief nets, deep Boltzmann machines, deep autoencoders, etc.

The idea of ​​deep learning:

The basic idea of ​​the deep neural network is to construct a multi-layer network to represent the target in multiple layers, in order to express the abstract semantic information of the data through multi-layer high-level features, and obtain better feature robustness.

2. Deep Learning Applications

Main applications in the field of image processing

  • Image classification (object recognition): Classification or recognition of the entire image
  • Object detection: Detect the position of objects in an image to identify objects
  • Image segmentation: segment specific objects in the image by edges
  • Image regression: predicting the coordinates of object components in an image

Main applications in the field of speech recognition

  • Speech Recognition: Recognize speech as text
  • Voiceprint recognition: identify whose voice it is
  • Speech Synthesis: Synthesize the voice of a specific person based on text

Main applications in the field of natural language processing

  • Language Model: Predict the next word based on previous words.
  • Sentiment Analysis: Analyze the sentiment embodied in the text (positive-negative, positive-negative, or multi-attitude types).
  • Neural Machine Translation: Multilingual Translation Based on Statistical Language Models.
  • Neural Automatic Summarization: Automatically generate summaries from text.
  • Machine Reading Comprehension: Answering questions, completing multiple-choice questions, or filling in blanks by reading text.
  • Natural language reasoning: From one sentence (premise) to infer another sentence (conclusion).

Comprehensive application

  • Image description: According to the image, the description sentence of the image is given
  • Visual Q&A: Answer questions based on images or videos
  • Image Generation: Generate images from text descriptions
  • Video Generation: Automatically generate videos from stories

3. Main terms of deep learning

Refer to this article:

Detailed explanation of basic concepts of machine learning and deep learning - GoAl's Blog - CSDN Blog

2. The basis of neural network

1. Neural Network Composition

Artificial Neural Networks (ANNs for short) is an algorithmic mathematical model that imitates the behavioral characteristics of animal neural networks and performs distributed parallel information processing. Depending on the complexity of the system, this network achieves the purpose of processing information by adjusting the interconnection of a large number of internal nodes, and has the ability of self-learning and self-adaptation. There are many types of neural networks, the most important of which is the multilayer perceptron. In order to describe the neural network in detail, we start with the simplest neural network.

Perceptron

The perceptron was proposed by Rosenblatt in 1957 and is the basis of neural networks and support vector machines .

The perceptron is inspired by biology. Its reference object and theoretical basis can be referred to in the figure below: (Our brain can be considered as a neural network, a biological neural network. In this biological neural network, We can think of its smallest unit as a neuron, a neuron, and these many neurons are connected to form an intricate network, which we call a neural network. Of course, what we are talking about now includes machine learning in deep learning The neural network Neural Networks actually refers to the artificial neural network Artificial Neural Networks, abbreviated as ANNs. We just simplified. Our human neural network is composed of such neurons, so some of the work of this neuron The mechanism is that through such a structure in the figure below, some signals are first received, and these signals pass through these dendrite tissues, and the dendritic tissues receive these signals and send them to the nucleus (nucleus) inside the cell, and these cell nuclei receive These signals, what form do these signals exist in? These signals, such as the optical signals received by the eyes, or the sound signals received by the ears, will generate some weak bioelectricity when they reach the dendrites, and then they are formed like this Some stimuli, then the collected and received stimuli are processed comprehensively in the nucleus, and when his signal reaches a certain threshold, then he will be activated, and a stimulating output will be generated, then It will form a further signal received by our brain, then it is calculated through the output of the axon, which is the general working principle of a neuron in our human brain when it perceives.)

2

A simple perceptron is shown below:

Please add a picture description

With appropriate x and b set, the NAND gate of a simple perceptron unit is expressed as follows:

When the input is 0, 1, the output of the perceptron is 0 × ( − 2 ) + 1 × ( − 2 ) + 3 = 1 .

More complex perceptrons are composed of simple perceptron units:

multilayer perceptron

The multi-layer perceptron is extended from the perceptron. The main feature is that it has multiple neuron layers, so it is also called a deep neural network. Compared to a single perceptron, every neuron in layer i ii of a multilayer perceptron is connected to every neuron in layer i − 1 i-1i−1.

The output layer can have more than 11 neurons. There can be only 11 hidden layers or multiple layers. An example of a neural network with multiple neurons in the output layer is shown in the figure below:

insert image description here

2. What are the common model structures of neural networks?

The artificial neural network is composed of neuron models, and this information processing network composed of many neurons has a parallel distribution structure.

4.3

Among them, a circular node represents a neuron, and a square node represents a group of neurons.

The following figure contains most of the commonly used models:

3. Introduction to forward propagation and back propagation

There are two main types of neural network calculations: forward propagation (FP) acts on the input of each layer, and the output results are obtained through layer-by-layer calculations; backward propagation (BP) acts on the output of the network, through Calculate the gradient to update the network parameters from deep to shallow.

forward propagation

Assuming that nodes i , j , k , . . . i,j,k,…i,j,k,… and some other nodes in the upper layer are connected to the node w ww in this layer, then the node w ww How to calculate the value? It is to carry out the weighted sum operation through i , j , k , . . . Set items (omitted for simplicity in the figure), and finally through a nonlinear function (ie activation function), such as R e L u ReLuReLu, sigmoid sigmoid sigmoid and other functions, the final result is the output of the node w ww in this layer.

In the end, the layer-by-layer operation of this method is continued to obtain the output layer result.

backpropagation

Backpropagation - easy to understand_chengchaowei's blog-CSDN blog_backpropagation

Because of the final result we get through forward propagation, taking classification as an example, there is always an error in the end, so how to reduce the error? One of the most widely used algorithms is the gradient descent algorithm, but partial derivatives are required to find the gradient. The following figure Let me explain with the Chinese alphabet as an example:

This section is to be updated!

3. Hyperparameters

1 What are hyperparameters?

Hyperparameters  : In the context of machine learning, hyperparameters are parameters whose values ​​are set before starting the learning process, rather than the parameter data obtained through training. Usually, it is necessary to optimize the hyperparameters and select a set of optimal hyperparameters for the learning machine to improve the performance and effect of learning.

Hyperparameters usually exist in:

1.  定义关于模型的更高层次的概念,如复杂性或学习能力。2.  不能直接从标准模型培训过程中的数据中学习,需要预先定义。3.  可以通过设置不同的值,训练不同的模型和选择更好的测试值来决定

Specifically, hyperparameters such as the learning rate in the algorithm (learning rate), the number of iterations of the gradient descent method (iterations), the number of hidden layers (hidden layers), the number of hidden layer units, and the activation function (activation function) all need to be based on the actual situation. to set, these numbers actually control the value of the final parameter sum, so they are called hyperparameters.

2. How to find the optimal value of hyperparameters?

When working with machine learning algorithms, there are always hyperparameters that are difficult to tune. For example, weight decay size, Gaussian kernel width and so on. These parameters need to be set manually, and the set values ​​have a great influence on the results. Common methods for setting hyperparameters are:

  1. Guess and check: Choose parameters based on experience or intuition, iterating all the way.

  2. Grid search: Let the computer try a set of values ​​evenly distributed within a certain range.

  3. Random Search: Let the computer pick a set of values ​​at random.

  4. Bayesian optimization: Using Bayesian optimization hyperparameters, you will encounter the difficulty that the Bayesian optimization algorithm itself requires many parameters.

  5. The MITIE method performs local optimization on the premise of a good initial guess. It uses the BOBYQA algorithm with a carefully chosen starting point. Since BOBYQA only looks for the nearest local optimal solution, the success of this method largely depends on having a good starting point. In the case of MITIE, we know a good starting point, but it's not a general solution because usually you won't know where the good starting point is. On the plus side, this approach is great for finding local optima. I'll come back to this later.

  6. A newly proposed global optimization method for LIPO. This method has no parameters and has proven to be better than the random search method.

3. Hyperparameter search general process?

The general process of hyperparameter search:

  1. Divide the dataset into training set, validation set and test set.
  2. Optimize the model parameters on the training set according to the performance indicators of the model.
  3. The hyperparameters of the model are searched on the validation set according to the performance indicators of the model.
  4. Step 2 and Step 3 are alternately iterated to finally determine the parameters and hyperparameters of the model, and verify the pros and cons of the evaluation model in the test set.

Among them, the search process requires search algorithms, which generally include: grid search, random search, heuristic intelligent search, and Bayesian search.

4. Activation function

1. What is the activation function

Activation functions are very important for artificial neural network models to learn and understand very complex and nonlinear functions. They introduce nonlinear properties into our network. As shown in the figure below, in the neuron, the input inputs are weighted and summed, and then applied to a function, which is the activation function. **The activation function is introduced to increase the nonlinearity of the neural network model. **Each layer without an activation function is equivalent to matrix multiplication. Even after you superimpose several layers, it is nothing more than a matrix multiplication.

figure 1

2. Why use activation function?

  1. Activation functions play an important role in model learning and understanding very complex and non-linear functions.
  2. Activation functions can introduce non-linear factors. If no activation function is used, the output signal is simply a simple linear function. A linear function is a polynomial of the first degree, the complexity of linear equations is limited, and the ability to learn complex function mappings from data is small. Without activation functions, neural networks will not be able to learn and simulate other complex types of data, such as images, videos, audio, speech, etc.
  3. The activation function can convert the current feature space to another space through a certain linear mapping, so that the data can be better classified.

3. Why does the activation function need a nonlinear function?

  1. If all the components in the network are linear, then the linear combination is still linear, no different from a single linear classifier. This makes it impossible to approximate arbitrary functions with nonlinearities.
  2. Use nonlinear activation functions in order to make the network more powerful and increase its ability to learn complex things, complex form data, and complex arbitrary function mappings that represent nonlinearities between inputs and outputs. Using non-linear activation functions, it is possible to generate non-linear mappings from input to output.

4. Common activation functions

For detailed study of the activation function, refer to the following materials:

The Past and Present of Activation Functions_Without accumulating steps, you can go thousands of miles!

Summary of Frequently Asked Questions about Activation Functions

Deep Learning Notes_Summary and Comparison of Various Activation Functions_Jinghongyibo-CSDN Blog

Excellent summary reference:

insert image description here

Some rules of thumb for choosing an activation function:
if the output is 0 or 1 (binary classification problem), choose  the sigmoid function for the output layer, and then choose the Relu function  for all other units   . This is the default choice for many activation functions, and if you are not sure which activation function to use on the hidden layer, then the Relu  activation function is usually used  . Sometimes,  the tanh  activation function is also used.

5. Summary of optimization methods

1. Basic Gradient Descent Method

Introduction to gradient descent algorithm:

An easy-to-understand explanation of the gradient descent method!

Gradient descent learning introductory video recommendation:

[Gradient Descent] The soul of artificial intelligence, have you realized her true meaning?

[Dry goods] Essentials for deep learning: simple learning (gradient descent algorithm)_哔哩哔哩_bilibili

The deep learning network training process can be divided into two parts: the forward calculation process and the back propagation process. The forward calculation process refers to the forward calculation layer by layer according to the prescribed network structure through our pre-set convolution layer, pooling layer, etc., to obtain the predicted result. The backpropagation process is to adjust many parameters in the set network step by step, so that the prediction result can be closer to the real value.

Then, in the process of backpropagation, a very important point is: how to update the parameters? Or ask more specifically: In what direction should the parameters be updated?

Obviously, the parameters should be updated in the direction of the fastest decline of the target loss function, more precisely, in the direction of the gradient!  Suppose the network parameter is θ \thetaθ, the learning rate is η \etaη, and the function represented by the network is J ( θ ) J(\theta)J(θ), the gradient of the function to θ \thetaθ at this time is: ▽ θ J ( θ ) \bigtriangledown_{\theta }J(\theta)▽θJ(θ), so the update formula of parameter θ \thetaθ can be expressed as:

insert image description here

In deep learning, there are three basic gradient descent algorithms: SGD, BGD, and MBGD, each of which has its own advantages and disadvantages. Depending on the amount of data and parameters, you can choose a specific implementation form. In training neural networks, optimization algorithms can be roughly divided into two categories: 1) Adjust the learning rate to make the optimization more stable; 2) Gradient estimation correction, optimization training speed.

insert image description here

(1) Stochastic gradient descent method SGD

Stochastic Gradient Descent (SGD), each iteration (update parameters) only uses a single training sample ( x ( i ) , y ( i ) ) (x^{(i)}, y^{(i) })(x(i),y(i)), where x is the input data and y is the label. Therefore, the parameter update expression is as follows:

insert image description here
Advantages: SGD only needs to be calculated on one sample per iteration , so it runs fast and can also be used for online learning.

Disadvantages: (1) Due to the randomness of a single sample, the value of the target loss function will fluctuate violently in the actual process. On the one hand, the fluctuation of SGD enables it to jump to a new and possibly better local minimum. On the other hand, the training will never converge, but will always fluctuate around the minimum. (2) Only one image is calculated in one iteration, and the advantages of GPU parallel computing are not utilized, so that the overall calculation efficiency is not high.

(2) Batch gradient descent method BGD

Batch Gradient Descent (BGD), using all training samples in each iteration update , the parameter update expression is as follows:

insert image description here
Advantages and disadvantages analysis: BGD can guarantee convergence to the global minimum of the convex error surface and the local minimum of the non-convex surface. But every iteration, all the data in the training set needs to be used. If the amount of data is large, the iteration speed will be very slow.

Mini-batch gradient descent method MBGD

The mini-batch (3) gradient descent method (Mini-Batch Gradient Descent, MBGD) is a compromise between BGD and SGD. Each iteration uses batch_size training samples for calculation. The parameter update expression is as follows:

insert image description here
Advantages and disadvantages analysis: Because each iteration uses multiple samples, MBGD has more stable convergence than SGD, and it can also avoid the problem of slow iteration speed of BGD when the data set is too large. Therefore, MBGD is a gradient descent method often used in deep learning network training.

In deep learning, the general mini-batch size is 64~256. Considering the way of computer storage settings and usage, if the mini-batch is a power of 2, the code will run faster.
insert image description here
The figure above is a change chart of the loss cost function during BGD and MBGD training. It can be seen that BGD can gradually reduce the cost function, and finally ensure that it converges to the global minimum of the convex error surface; the value of the loss cost function of MBGD is relatively oscillating, but it can eventually be optimized to the minimum value of the loss.

(4) Dosage

(5) Adadelta

(6) RMSprop optimizer

The full name of the RMSProp algorithm is Root Mean Square Prop (Root Mean Square Prop), which is an optimization algorithm proposed by Hinton in the Coursera course. In the above Momentum optimization algorithm, although the problem of large swing in optimization is initially solved.

In order to further optimize the problem of excessive swing in the update of the loss function and further speed up the convergence speed of the function, the RMSProp algorithm uses the differential square weighted average for the gradient of the weight W and the bias b. The optimized effect is as follows: the blue one is the route taken by the Momentum optimization algorithm, and the green one is the route taken by the RMSProp optimization algorithm.

insert image description here
Assume that in the t-th round of iterations, the formulas are as follows:
insert image description here
In the above formulas, Sdw and Sdb are the gradient square momentum accumulated by the loss function in the previous t−1 rounds of iterations , and β is an index of gradient accumulation. The difference is that the RMSProp algorithm computes a differential square weighted average of the gradients . This method is beneficial to eliminate the direction with a large swing range, and is used to correct the swing range, so that the swing range in each dimension is smaller. On the other hand, it also makes the network function converge faster.

2. Momentum momentum gradient descent

Momentum mainly introduces the idea of ​​gradient-based moving exponential weighted average , that is, the current parameter update direction is not only related to the current gradient, but also affected by the historical weighted average gradient. For dimensions whose gradients point in the same direction, momentum accumulates and increases , while for dimensions where gradients change direction, momentum decreases and updates . This also makes the convergence speed faster without causing too much swing.

The function of the momentum gradient descent method is to speed up the learning speed, and it also has the ability to get rid of local optimum. As shown by the red line in the figure below:
insert image description here

The parameter update expression of momentum gradient descent (Momentum) is as follows:

insert image description here

Among them, λ represents the momentum parameter momentum; when λ = 0, it is the ordinary SGD gradient descent. 0 < λ < 1 , indicating the SGD gradient descent parameter update method with momentum, λ usually takes 0.9.

Disadvantages of plain SGD: SGD is difficult to iterate over ravines (i.e. regions where the surface is curved steeper in one dimension than in the other), which is common in locally optimal solutions. In these scenarios, SGD oscillates on the slope of the ravine while slowly progressing along the bottom towards a local optimum. To alleviate this problem, momentum is introduced.

insert image description here

Essentially, when using momentum, it's like we're pushing the ball down a hill. The ball builds up momentum as it rolls downhill, getting faster and faster along the way. The same thing happens for parameter updates: momentum accumulates and increases for dimensions whose gradients point in the same direction, and decreases for dimensions where gradients change directions. As a result, we obtain faster convergence and reduced oscillations.

3. Adam optimizer

Adam is another method of parameter adaptive learning rate , equivalent to RMSprop + Momentum , which uses the gradient's first-order moment estimation and second-order moment estimation to dynamically adjust the learning rate of each parameter. The formula is as follows:

The first and second moments mt , vt m_t, v_tmt, vt are similar to momentum, which is initialized as: m 0 = 0 , v 0 = 0 m_{0}=0, v_{0}=0m0=0,v0 =0

mt , vt m_t, v_tmt, and vt are the estimated values ​​of the first-order moment (mean) and second-order moment (non-central variance) of the gradient, respectively:

insert image description here

Since the moving exponential average will lead to a large difference from the initial value at the beginning of the iteration, we need to make deviation corrections to the above obtained values. These biases are counteracted by computing bias-corrected first and second moment estimates:

insert image description here

These are then used to update the parameters, as seen in RMSprop, Adam's parameter update formula:

insert image description here

In the Adam algorithm, the parameter β1 corresponds to the β value in the Momentum algorithm, generally 0.9, and the parameter β2 corresponds to the β value in the RMSProp algorithm, generally we take 0.999, and ϵ is a smoothing item, we generally take The value is 1 0 − 8 10^{−8}10−8, and the learning rate requires us to fine-tune during training.

Summary of Optimization Algorithms

Comparison of the most comprehensive optimization methods for deep learning (SGD, Adagrad, Adadelta, Adam, Adamax, Nadam) bzdww

insert image description here

Summary of optimizer and learning rate decay methods in Pytorch

Summary of optimizer and learning rate decay methods in pytorch_ys1305's blog

6. Loss function

Getting started learning video:

[Linear regression, cost function, loss function] Understanding animation, falling in love with mathematics

1. Loss function definition

In machine learning tasks, most supervised learning algorithms will have an objective function (Objective Function), and the algorithm optimizes the objective function, which is called the process of optimizing the algorithm. For example, in classification or regression tasks, use the loss function (Loss Function) as its objective function to optimize the algorithm model.
In the BP neural network, in the general derivation, the mean square error is used as the loss function, but in practice, the cross entropy is often used as the loss function. As shown in the figure below, we can clearly observe that different loss functions have different convergence speeds and performances during gradient descent.

  1. As a loss function, the mean square error converges slowly and may fall into a local optimal solution;
  2. As a loss function, the convergence speed of cross entropy is faster than that of mean square error, and it is easier to find the optimal solution of the function.

Therefore, understanding the types of loss functions and mastering the skills of using loss functions will help to deepen the understanding of deep learning.

insert image description here

The loss functions used for classification and regression models are different, and will be introduced separately below.

2 Regression loss function

(1) Mean square error loss function

The Mean Squared Error Loss (MSE) loss function is defined as follows:

code example

an_squared_error(y_true, y_pred):turn np.mean(np.square(y_pred - y_true), axis=-1)

(2) Mean absolute error loss function

The Mean Absolute Error Loss (MAE) loss function is defined as follows:

code example

an_absolute_error(y_true, y_pred):turn np.mean(np.abs(y_pred - y_true), axis=-1)

(3) Mean square error logarithmic loss function

The Mean Squared Log Error Loss (MSLE) loss function is defined as follows:

code example

an_squared_logarithmic_error(y_true, y_pred):rst_log = np.log(np.clip(y_pred, 10e-6, None) + 1.)cond_log = np.log(np.clip(y_true, 10e-6, None) + 1.)turn np.mean(np.square(first_log - second_log), axis=-1)

(4) Mean absolute percentage error loss function

The Mean Absolute Percentage Error Loss (MAPE) error loss function is defined as follows:

code example

an_absolute_percentage_error(y_true, y_pred):ff = np.abs((y_pred - y_true) / np.clip(np.abs(y_true), 10e-6, None))turn 100 * np.mean(diff, axis=-1)

(5) Summary

The mean square error loss function is the most widely used, and in most cases, the mean square error has good performance, so it is used as the basic measure of the loss function. MAE will more effectively punish outliers. If there are many outliers in the data, you need to consider using the mean absolute error loss as the loss function. In general, in order to avoid too many outliers in the data, the data can be preprocessed.
The mean square error logarithmic loss is similar to the calculation process of the mean square error, and the logarithm calculation is performed on each output data to reduce the range value of the function output. The mean absolute percentage error loss calculates the relative error between the predicted value and the true value. The mean square error logarithmic loss and the average absolute percentage error loss are actually used to deal with large-scale data ( [ − 10 5 , 10 5 ], but in neural networks, we often normalize the input data to a reasonable range ( [ − 1 , 1 ]), and then use the mean square error or mean absolute error loss to calculate the loss.

3 Classification loss function

(1) Logistic loss function

The Logistic loss function is defined as follows:

(2) Negative log likelihood loss function

The negative log likelihood loss function (Negative Log Likelihood Loss) is defined as follows:

(3) Cross entropy loss function

The Logistic loss function and the negative log-likelihood loss function can only handle binary classification problems. For two classifications extended to M classifications, the Cross Entropy Loss function (Cross Entropy Loss) is used, which is defined as follows:

code example

oss_entropy(y_true, y_pred):turn -np.mean(y_true * np.log(y_pred + 10e-6))

(4) Hinge loss function

A typical classifier using Hinge loss is the SVM algorithm, because Hinge loss can be used to solve the margin maximization problem. Hinge loss is the most convenient choice when the classification model requires hard classification results, such as binary classification data where the classification result is 0 or 1, -1 or 1. The hinge loss function is defined as follows:

code example

nge(y_true, y_pred):turn np.mean(np.maximum(1. - y_true * y_pred, 0.), axis=-1)

(5) Exponential loss function

A typical classifier using an Exponential loss function is the AdaBoost algorithm. The definition of the Exponential loss function is as follows:

code example

ponential(y_true, y_pred):turn np.sum(np.exp(-y_true * y_pred))

4. Loss functions commonly used in neural networks

The loss function in the neural network can be customized, provided that the data itself and the optimization scheme used to solve it need to be considered. In other words, a custom loss function needs to consider the input data form and the algorithm for deriving the loss function. Customizing the loss function is actually somewhat difficult. In actual engineering projects, it is a common practice to select the loss function in combination with the activation function. There are three commonly used combinations.

(1) ReLU + MSE

The mean square error loss function cannot deal with the problem of gradient disappearance, and using the Leak ReLU activation function can reduce the problem of gradient disappearance during calculation. Therefore, if you need to use the mean square error loss function in the neural network, generally use Leak ReLU, etc. to reduce the gradient disappearance. activation function. In addition, due to the universality of the mean square error, it is generally used as a standard to measure the loss value, so the performance of using the mean square error as a loss function will neither be too good nor too bad.

(2) Sigmoid + Logistic

The Sigmoid function will cause the gradient disappearance problem: According to the chain derivation method, after the derivation of the Sigmoid function, multiple numbers in the [0, 1] range are multiplied. If the form of the derivative is, when one of the numbers is small, after the will approach zero indefinitely until finally disappearing. When deriving the Logistic-like loss function, the multiplication operation after adding the logarithm is converted into a summation operation, which avoids the disappearance of the gradient to a certain extent, so we can often see the combination of the Sigmoid activation function + cross-cutting loss   function .

(3) Softmax + Logisitc

Mathematically, the Softmax activation function will return the mutually exclusive probability distribution of the output class, that is, the discrete output can be converted into a mutually exclusive probability of the same distribution, such as (0.2, 0.8). In addition, the Logisitc loss function is based on the maximum likelihood estimation function of probability, so the output probability can be more convenient for the optimization algorithm to derive and calculate, so we can often see that the output layer uses Softmax activation function + cross  entropy A combination of loss functions  .

Loss function learning reference:

Deep Learning - Detailed Introduction to Common Loss Functions

5. The difference between activation function, loss function and optimization function

**1. Activation function: **After the input of the upper layer of the neural network is transformed by the nonlinear transformation of the neural network layer, the output is obtained through the activation function. Common activation functions include: sigmoid, tanh, relu, etc.

[Deep Learning] Activation Function of Neural Network_Beiwanghuacun-CSDN Blog

**2. Loss function:** A way to measure the gap between the predicted value of the output of the neural network and the actual value. Common loss functions include: least squares loss function, cross entropy loss function, smooth L1 loss function used in regression, etc.

**3. Optimization function: **That is, how to transfer the loss value from the outermost layer of the neural network to the front. Such as the most basic gradient descent algorithm, stochastic gradient descent algorithm, batch gradient descent algorithm, gradient descent algorithm with momentum, Adagrad, Adadelta, Adam, etc.

[Deep Learning] Gradient Descent Algorithm, Optimization Method (SGD, Adagrad, Adam...)_Beiwanghuacun-CSDN Blog

7. CNN Convolutional Neural Network

Data recommendation: Convolutional Neural Network CNN (full of dry goods) - CSDN Blog

Xiaobai recommended article: How to explain convolution in an easy-to-understand manner? - Know almost

Convolutional Neural Networks (CNN) is a type of Feedforward Neural Networks (Feedforward Neural Networks) that includes convolution calculations and has a deep structure. It is one of the representative algorithms for deep learning. At present, CNN has been widely used, such as: face recognition, automatic driving, Meitu Xiuxiu, security and many other fields.

Convolutional Neural Networks – Image processing is what CNN is best at. It was inspired by the human visual nervous system.

CNN features:

  1. It can effectively reduce the dimensionality of images with large data volumes into small data volumes.
  2. It can effectively preserve the image features and conform to the principle of image processing.

CNN has two cores:

The two main features of the convolutional layer are local connection and weight sharing, which are also called sparse connection and parameter sharing in some places.

  1. The local connection is realized through the convolution operation. The size of this local area is the filter filter, which avoids the situation that cannot be calculated due to too many parameters in the full connection.
  2. Then reduce the number of actual parameters through parameter sharing, which provides the possibility to realize multi-layer network.

1. Local connections:

  • It is generally believed that the spatial connection of the image is that the local pixels are relatively closely connected, and the correlation between the distant pixels is weak. Therefore, it is not necessary for each neuron to perceive the global image, as long as it perceives the local, and then in a higher layer. Local information is combined to obtain global information. Use the convolutional layer to achieve: (feature map, each feature map is an array of neurons): extract local features from the previous layer through local convolution filters. The convolutional layer is followed by a computational layer for local averaging and secondary extraction. This secondary feature extraction structure reduces feature resolution.
  • That is, the network is partially connected, and each neuron is only connected to some neurons in the previous layer, and only perceives the part, not the entire image. (sliding window implementation)

2. Parameter sharing:

  • In the local connection, the parameters of each neuron are the same, that is, the same convolution kernel is shared in the image . (Understanding: The convolution operation is actually extracting local information one by one, and some statistical characteristics of the local information are the same as other parts, which means that the features learned in this part can also be used in another part. So for the image All positions above can use the same learning features.) There is a problem with the sharing of convolution kernels: the extraction of features is insufficient, which can be made up by adding multiple convolution kernels, and multiple features can be learned.

  • For a 100x100 pixel image, if we use a neuron to operate on the image, the size of this neuron is 100x100=10000. If we use a 10x10 convolution kernel, although we need to calculate many times, the parameters we need There are only 10x10=100, plus a bias b, only 101 parameters are needed in total. We get the image size is still 100x100.

  • If we get a larger image, it will have more parameters. We perform feature extraction on the image through a 10*10 convolution kernel, so that we get a Feature Map.

  • A convolution kernel can only extract one feature, so we need several more convolution kernels. Suppose we have 6 convolution kernels, we will get 6 Feature Maps. Combining these 6 Feature Maps together is a neuron. . We need 101*6=606 parameters for these 6 Feature Maps. This value is still relatively small compared with 10000.

CNN network introduction

The main introduction is as follows: convolutional layer, pooling layer, activation function, concept and principle of fully connected layer

A typical CNN consists of 3 parts

1. Convolutional layer

Convolution is an efficient method for extracting image features. Generally, a square convolution kernel is used to traverse every pixel on the image. Each pixel value corresponding to the overlapping area of ​​the image and the convolution kernel is multiplied by the weight of the corresponding point in the convolution kernel, then summed, and after adding the offset, a pixel value in the output image is finally obtained.
The image is divided into grayscale image and color image, and the convolution kernel can be single or multiple, so the convolution operation can be divided into the following three cases:

1.1 Single channel input, single convolution kernel

Here, single channel means that the input is a grayscale image, and the number of convolution kernels for a single convolution kernel value is 1.

insert image description here

The above is a 5x5x1 grayscale image, 1 means single channel, 5x5 means resolution, and there are 5 rows and 5 columns of grayscale values. If a 3x3x1 convolution kernel is used to convolve this 5x5x1 grayscale image, and the bias item b=1, the calculation for convolution is: (-1)x1+0x0+1x2+(-1)x5+0x4+ 1x2+(-1)x3+0x4+1x5+1=1 (be careful not to forget to add bias 1).

1.2 Multi-channel input, single convolution kernel

In most cases, the input picture is a color image composed of three colors of RGB. The input picture contains three layers of data of red, green and blue. The depth (number of channels) of the convolution kernel should be equal to the number of channels of the input picture , so use 3x3x3 convolution kernel, the last 3 means matching the 3 channels of the input image, so this convolution kernel has three channels, each channel will randomly generate 9 parameters to be optimized , a total of 27 parameters to be optimized w and a bias Set b.

insert image description here

Note: This is still the case of a single convolution kernel, but a convolution kernel can have multiple channels. By default, the number of channels of the convolution kernel is equal to the number of channels of the input image.

1.3 Multi-channel input, multi-convolution kernel

Multi-channel input and multi-convolution kernel are the most common forms of deep neural networks . Refers to the case of multi-channel input and multiple convolution kernels. Then the convolution process is actually very simple. Take 3-channel input and 2 convolution kernels as an example :

(1) First take out a convolution kernel to convolve with the 3-channel input. This process is the same as multi-channel input and single convolution kernel , and a 1-channel output output1 is obtained. Also take out the second convolution kernel and perform the same operation to get the second output output2
(2) Stack the output1 and output2 of the same size to get the output output of 2 channels.

For a more intuitive understanding, the diagram is given below:

insert image description here
Input X:[1,h,w,3] in the figure means: input a 3-channel image with height h and width w.
Convolution kernel W:[k,k,3,2] means: the convolution kernel size is 3*3, the number of channels is 3, and the number is 2.

Summary:
(1) After the convolution operation, the number of output channels = the number of convolution kernels
(2) The number of convolution kernels and the number of channels of convolution kernels are different concepts. The number of convolution kernels in each layer will be given when designing the network, but the number of channels of convolution kernels will not necessarily be given. By default, the number of channels of the convolution kernel = the number of input channels , because this is a necessary condition for convolution operations .

(3) Bias number = number of convolution kernels

1.4 padding

In order to obtain a satisfactory output image size after the convolution operation, padding is often used to fill the input. By default padding 0's around the image.

(1) Padding with all zeros padding='same'
When using same, it will automatically pad the original image with all zeros. When the step size is 1, it can ensure that the output image is the same size as the input image.
Output size calculation formula: input length/step size (rounded up)
implemented in TensorFlow as follows: (Here, the number of convolution kernels: 48, convolution kernel size: 3, step size: 1, fully filled as an example)

layers.Conv2D(48, kernel_size=3, strides=1, padding='same')

(2) Do not fill padding='valid'
When valid is used, no padding is performed, and convolution is performed directly. This is the default method of layers.Conv2D().
Output size calculation formula: ( input length - kernel length) / step size + 1 (rounded down)
TensorFlow implements as follows:

layers.Conv2D(48, kernel_size=3, strides=1, padding='valid')

(3) Custom filling
is generally filled from four directions: up, down, left, and right, and the number of columns pw p_wpw for left and right filling is generally the same, and the number of rows ph p_hph for top and bottom filling should also be the same. As shown in the figure below:
insert image description here
output size calculation formula:

insert image description here

insert image description here

Among them, h and w are the height and width of the original image, k is the size of the convolution kernel, and s is the step size.

In TensorFlow2.0, during the custom padding process, the setting format of the padding parameter is:
padding=[[0, 0], [up, down], [left, right], [0, 0]]

# 例如要在上下左右各填充一个单位,实现如下:.Conv2D(48, kernel_size=3, strides=1, padding=[[0,0], [1,1], [1,1], [0,0]])

2. Pooling layer

Pooling works as follows"

1. Making the convolutional neural network extract features is to ensure the local invariance of the features.

2. The pooling operation can reduce the dimension and reduce the number of parameters.

3... Pooling operation optimization is relatively simple.

In the convolutional layer, the height and width of the feature map can be doubled by adjusting the step size parameter s, thereby reducing the amount of network parameters. In fact, in addition to setting the step size, there is also a special network layer that can achieve size reduction, which is the pooling layer we are going to introduce.

The pooling layer is also based on the idea of ​​local correlation, and obtains new element values ​​by sampling or information aggregation from a group of locally related elements. Usually we use two kinds of pooling for downsampling:
(1) Max Pooling (Max Pooling) , select the largest element value from the set of locally related elements.
(2) Average Pooling , which calculates the average value from the set of local related elements and returns it.

3. Activation function

The activation function is also an indispensable part of the neural network. The activation function is used to add nonlinear factors to improve the network expression ability. The most commonly used in the convolutional neural network is ReLU, and Sigmoid is used less.

5.12Specifically how to choose the appropriate activation function. You can refer to this blog post: Neural Network Construction: Summary of Activation Functions   or the introduction above.

4. Fully connected layer

The fully connected layer FC is called fully connected because each neuron has a connection relationship with each neuron in the previous and subsequent adjacent layers. As shown in the figure below, it is a simple two-layer fully connected network, the input is the feature, and the output is the predicted result.

insert image description here
The parameter quantity of the fully connected layer can be directly calculated, and the calculation formula is as follows:

insert image description here
According to the two-layer fully connected network built according to the above figure, to train black and white images with a resolution of only 28x28=784, there are nearly 400,000 parameters to be optimized. High-resolution color images in real life have more pixels and three-channel information of red, green and blue. Too many parameters to be optimized can easily lead to overfitting of the model. In order to avoid this phenomenon, the original image is generally not directly fed into the fully connected network in practical applications.
In practical applications, convolutional feature extraction is performed on the original image first, and the extracted features are fed to the fully connected network, and then the fully connected network is used to calculate the classification evaluation value.

5. The amount of network parameters and calculation

Convolution layer parameters/Convolution calculation amount

Convolution parameters = convolution kernel length x convolution kernel width x number of input channels x number of output channels + number of output channels (bias)
Convolution calculation amount = output data size x scale of convolution kernel x number of input channels

Example: input: 224x224x3, output: 224x244x64, convolution kernel: 3x3

  • Amount of parameters = 3x3x3x64+64
  • Calculations = 224x224x64x3x3x3

Convolution layer:

For example: the input is a 32x32x3 color picture, after the convolutional layer:

layers.Conv2D(100, kernel_size=3, strides=1, padding='same')

(1) The amount of network parameters
is mainly the parameters of the convolution kernel and the parameters of the bias: 3x3x3x100+100=2800

(2) Calculation FLOPS The concept of
deep learning framework FLOPs: Floating point operations, that is, the number of floating point operations.
{32x32x[3x3+(3x3-1)]x3+32x32x(3-1)}x100

Fully connected layer:

For example, the number of nodes in the first layer is 5, and the number of nodes in the second layer is 10. Find the amount of network parameters and calculation FLOPS

(1) Network parameter quantity
The main source of network parameter quantity is neuron connection weight and bias : 5x10+10=60

(2) Calculation FLOPS
5x10+10=60
In 2015, Google researcher Sergey et al. designed the BN layer based on parameter standardization . After the BN layer was proposed, it was widely used in various deep network models, which made the setting of hyperparameters of the network more free, and at the same time, the convergence speed of the network was faster and the performance was better.
For more information, please see: Neural Network Construction: BN Layer

6. Convolutional neural network training:

Basic training process:

Step 1: Initialize all convolution kernels and parameters/weights with random numbers

Step 2: Take the training picture as input, perform the forward steps (convolution, ReLU, pooling and forward propagation of the fully connected layer ) and calculate the corresponding output probability of each category.

Step 3: Calculate the total error of the output layer

Step 4: The backpropagation algorithm calculates the gradient of the error relative to all weights, and uses the gradient descent method to update all convolution kernels and parameter/weight values ​​to minimize the output error

Note: The parameters of the number of convolution kernels, convolution kernel size, and network architecture are fixed before Step 1 and will not change during the training process—only the convolution kernel matrix and neuron weights will be updated .

5.14

Like the multi-layer neural network, the parameter training in the convolutional neural network also uses the error back propagation algorithm . Regarding the training of the pooling layer, it needs to be mentioned again that the pooling layer is changed to a multi-layer neural network.

5.16

5.15

Change the convolutional layer to the form of a multi-layer neural network

5.17

CNN detailed solution:

CNN is essentially an input-to-output mapping, which can learn a large number of mapping relationships between input and output without any precise mathematical expression between input and output, as long as the known pattern is used to When the convolutional network is trained, the network has the ability to map between input and output pairs.

The convolutional network performs supervised training, so its sample set consists of vector pairs of the form: (input vector, ideal output vector). All these vector pairs should be derived from the actual "running" structure of the network to be simulated system, and they can be collected from the actual running system.

1) Parameter initialization:

Before starting training, all weights should be initialized with some different random numbers. "Small random number" is used to ensure that the network will not enter a saturated state due to excessive weights, which will cause training failure; "different" is used to ensure that the network can learn normally. In fact, if the weight matrix is ​​initialized with the same number, the network has no learning ability.

2) The training process consists of four steps

① The first stage: forward propagation stage

  • Take a sample from the sample set and input it to the network

  • Calculate the corresponding actual output; at this stage, the information is transformed step by step from the input layer and transmitted to the output layer. This process is also the process that the network performs normally after training

② The second stage: the backward propagation stage

  • Calculate the difference between the actual output and the corresponding ideal output

  • Adjust the weight matrix according to the method of minimizing the error

    The training process of the network is as follows:

  1. Select the training group, and randomly seek N samples from the sample set as the training group;

  2. Set each weight and threshold to a small random value close to 0, and initialize the precision control parameters and learning rate;

  3. Take an input pattern from the training set and add it to the network, and give its target output vector;

  4. Calculate the output vector of the middle layer and calculate the actual output vector of the network;

  5. Compare the elements in the output vector with the elements in the target vector to calculate the output error; for the hidden units in the middle layer, the error also needs to be calculated;

  6. Calculate the adjustment amount of each weight and the adjustment amount of the threshold in turn;

  7. Adjust weights and adjust thresholds;

  8. After experiencing M, judge whether the index meets the precision requirement, if not, return to (3) and continue iteration; if satisfied, go to the next step;

  9. After training, save the weights and thresholds in the file. At this time, it can be considered that each weight has reached stability, and the classifier has been formed. For training again, the weights and thresholds are directly exported from the file for training without initialization.

Eight, classic network introduction:

Write picture description here

  • LeNet-5

insert image description here
insert image description here
insert image description here

Number of neurons = number of convolution kernels X output feature map width X output feature map height
Convolution layer trainable parameter number = number of convolution kernels X (convolution kernel width X convolution kernel height + 1) (1 means bias )
The number of trainable parameters of the pooling layer = the number of convolutional kernels X (1+1) (two 1s represent the added coefficient and bias, and some pooling layers have no parameters) the number of connections = the number of convolutional kernels X (
volume Product kernel width X convolution kernel height + 1) X output feature map width X output feature map height (1 means bias)
fully connected layer connections = number of convolution kernels X (number of input feature maps X convolution kernel width X volume Product kernel height + 1) (output feature map size is 1X1)

  • AlexNet

insert image description here
insert image description here
insert image description here

  • Inception network

insert image description here
insert image description here
insert image description here

  • residual network
  • insert image description here

For detailed network introduction, please refer to this article:

Super detailed introduction to convolutional neural network

CNN summary

Learning reference materials for this article:

Deep Learning in Deep Water - Task04 Convolutional Neural Network CNN_GoAl's Blog-CSDN Blog

Value of CNNs:

  1. Able to effectively reduce the dimensionality of a large amount of data into a small amount of data (without affecting the result)
  2. Ability to preserve the characteristics of pictures, similar to human visual principles

The basic principle of CNN:

  1. Convolutional layer – the main function is to preserve the characteristics of the image
  2. Pooling layer – the main function is to reduce the data dimension, which can effectively avoid overfitting
  3. Fully connected layer – output the results we want according to different tasks

Practical applications of CNNs:

  1. Image classification and retrieval
  2. Target location detection
  3. target segmentation
  4. face recognition
  5. bone identification

Nine, RNN cycle neural network

Fundamentals of RNNs

The structure of the traditional neural network is relatively simple: input layer - hidden layer - output layer. As shown below:

traditional neural network

The biggest difference between RNN and traditional neural networks is that each time the previous output is brought to the next hidden layer for training together. As shown below:

RNN difference

RNN learning materials reference:

Big Talk Recurrent Neural Network (RNN)

Ten, LSTM long short-term memory neural network

1 Reasons for LSTM

RNN will encounter great difficulties when dealing with long-term dependencies (nodes that are far away in time series), because the calculation of the connection between nodes that are far away will involve multiple multiplications of the Jacobian matrix, which will cause the gradient to disappear. Or the phenomenon of gradient expansion. The most successful and widely used one is the threshold RNN (Gated RNN), and LSTM is the most famous one in the threshold RNN. The leaky unit allows the RNN to accumulate long-term connections between distant nodes by designing the weight coefficients between the connections; while the threshold RNN generalizes this idea, allowing the coefficient to be changed at different times and allowing the network to forget that the current accumulated Information.

2 The difference between RNN and LSTM

All RNNs have a form of chains of repeating neural network modules. In a standard RNN, this repeated module has only a very simple structure, such as a tanh layer, as shown in the following figure:

LSTMs have the same structure, but the repeated modules have a different structure. Instead of a single neural network layer, here are four, interacting in a very specific way.

Note: The specific meanings of the above icons are as follows:

In the diagram above, each black wire carries an entire vector from the output of one node to the input of other nodes. The pink circle represents pointwise operations, such as the sum of vectors, and the yellow matrix is ​​the learned neural network layer. Lines that come together indicate that the vectors are connected, and lines that separate indicate that the content is copied and then distributed to different locations.

3 LSTM cores

LSTMs have the ability to remove or add information to the cell state through structures known as "gates". A gate is a method of selectively allowing information to pass through. They consist of a sigmoid neural network layer and a pointwise multiplication operation. The schematic diagram is as follows:

LSTM has three gates, namely the forget gate, the input layer gate and the output layer gate, to protect and control the cell state.

forget landing doors

Object of action: cell state.

Function: Selectively forget the information in the cell state.

Operation steps: The gate will read ht−1 and xt, and output a value between 0 and 1 for each number in cell state Ct−1. 1 means "keep completely", 0 means "discard completely". The schematic diagram is as follows:

Input landing door

Object of action: cell state

Function: selectively record new information into the cell state.

Steps:

Step one, the sigmoid layer called the "input gate layer" decides what values ​​we are going to update.

Step 2, the tanh layer creates a new candidate value vector C~t and adds it to the state. Its schematic diagram is as follows:

Step 3: Update ct−1 to ct. Multiply the old state by ft, discarding the information we are sure needs to be discarded. Then add it∗C~t to get new candidate values, varying according to how much we decide to update each state. Its schematic diagram is as follows:

output layer gate

Object of action: hidden layer ht

Function: Determine what value to output.

Steps:

Step 1: Use the sigmoid layer to determine which part of the cell state will be output.

Step 2: Process the cell state through tanh, and multiply it with the output of the sigmoid gate, and finally we will only output the part that we determine the output.

Its schematic diagram is as follows:

LSTM and GRU study refer to my article: https://blog.csdn.net/qq_36816848/article/details/121616301


Interview summary:

Algorithm Job Interview Related

0. Algorithm Post Work Summary https://zhuanlan.zhihu.com/p/95922161
1. Artificial Intelligence Practical Interview Learning Roadmap https://github.com/tangyudi/Ai-Learn
2. Model Evaluation of Baimian Machine Learning https://zhuanlan.zhihu.com/p/78603645
3. Feature engineering for machine learning https://github.com/HadXu/feature-engineering-for-ml-zh
4. 500 deep learning questions https:// github.com/scutan90/DeepLearning-500-questions
5. Deep Learning Unlimited Questions https://github.com/yoyoyo-yo/DeepLearningMugenKnock
6. Summary of Computer Vision Knowledge Points https://zhuanlan.zhihu.com/p/58776542
7. The most eye-catching achievement in the field of deep learning CV https://zhuanlan.zhihu.com/p/315605746
8. The technical roadmap for algorithm engineers https://zhuanlan.zhihu.com/p/192633890?utm_source=wechatTimeline_article_bottom&from=timeline

**Reverse interview:**https://github.com/yifeikong/reverse-interview-zh At the end of the technical interview, ask the interviewer

Recommended related information:

0. Algorithm Post Work Summary https://zhuanlan.zhihu.com/p/95922161
1. Artificial Intelligence Practical Interview Learning Roadmap https://github.com/tangyudi/Ai-Learn
2. Model Evaluation of Baimian Machine Learning https://zhuanlan.zhihu.com/p/78603645
3. Feature engineering for machine learning https://github.com/HadXu/feature-engineering-for-ml-zh
4. 500 deep learning questions https:// github.com/scutan90/DeepLearning-500-questions
5. Deep Learning Unlimited Questions https://github.com/yoyoyo-yo/DeepLearningMugenKnock
6. Summary of Computer Vision Knowledge Points https://zhuanlan.zhihu.com/p/58776542
7. The most eye-catching achievement in the field of deep learning CV https://zhuanlan.zhihu.com/p/315605746
8. The technical roadmap for algorithm engineers https://zhuanlan.zhihu.com/p/192633890?utm_source=wechatTimeline_article_bottom&from=timeline

References for this article:

Teacher Wu Enda's deep learning course notes

Convolutional Neural Networks — A Hands-On Deep Learning 2.0.0

If you don't accumulate steps, you can reach thousands of miles! -CSDN blog

A 50,000-word summary, the foundation of deep learning. _AI Hao-CSDN blog

Introductory Notes for Deep Learning - I am Guan Xiaoliang CSDN Blog

Guess you like

Origin blog.csdn.net/feichangyanse/article/details/129377332