[Deep Learning] Chapter 5: Looking back at the network architecture - activation function - loss function - gradient descent

5. Review again: Network architecture - activation function - loss function - gradient descent

This section complements the point-to-point points in the previous chapters.

The architecture diagram and data flow of the DNN network (from input data to parameters of each layer to final output) have been demonstrated in detail before. It also hand-writes the data propagation process of a neuron. It can be seen that it is a process of matrix multiplication, and our network often has many layers, and each layer has many neurons. It is obviously unrealistic to hand-write neural networks, but it still gives Everyone wrote one by hand just to see clearly the mathematical calculation process behind it.

Let's emphasize a few concepts:
(1) The nn.Linear() layer is called a linear layer, also called a fully connected layer.
(2) When someone asks how many layers your model has, it usually includes the input layer, and only the hidden layer and output layer are included. Because both the hidden layer and the output layer have parameters to be learned, the number of model layers generally refers to layers with parameters.
(3) Activation functions (also called activation layers) are generally not drawn on architecture diagrams, but you must follow each linear layer with an activation function. If there is no activation layer between two linear layers, then These two linear layers can be merged into one layer, so there is no need to set up two layers. Therefore, there must be an activation layer after the linear layer. Even if the architecture diagram is not marked, you must know that there must be an activation layer.
(4) The number of neurons in the output layer is set according to the task. The activation function of the output layer is also set according to the task, and can be set flexibly. It is also possible if your output layer does not have an activation function, or it can be placed in the calculation process of the loss function. This depends on your convenience.
(5) When writing the architecture, it is usually written as a class, and this class must inherit nn.Module and must write the forward method.

The pytorch framework packages these, such as linear layers, activation functions, convolutional layers, etc., into independent elements and encapsulates them in the nn module. As long as we call these elements, we can build our own network architecture like building blocks. The premise is that you are very familiar with these constructions. Let's take a detailed look at the linear layer class nn.Linear().

1、nn.Linear()

Through the above steps, you have to master:
(1) The parameters of nn.Linear(), and the parameter connection relationship between the front and rear layers.
(2) You must be able to use the parameters() method or the .weight and .bias properties to check the parameters of the layer you built, and be very clear about what the rows and columns of the parameter matrix represent.
(3) Understand the mathematical logic in the data dissemination process.

2. Points to note when writing the architecture.
The picture below is the architecture we built in the previous chapter. Let’s break it down point by point:

Notes:
(1) Generally, the architecture is written as a class, and then the class is instantiated to generate our model.
(2) Point A means that our my_model class inherits the nn.Module parent class. this is a must. The nn.Module class is a model construction class encapsulated by pytorch and is the base class of all neural network modules. If we want to define our own network, we must inherit it to load the __init__ function and forward function in this parent class. In this way, many of our subsequent steps, such as forward propagation, calculation graph creation, back propagation, gradient calculation, running on the GPU, etc., do not need to be rewritten by ourselves. The definition code of the nn.Module class has more than a thousand lines. It will be much more convenient if we all inherit it.
(3) Place B is the __init__ function that defines our my_modle class. This function must be written when writing a class, and the first parameter must be self. The next two parameters need to be passed in when instantiating my network. The two parameters: in_features and out_features are the number of features of the input data and the number of final output neurons.
(4) Office C is used to call everything defined under the __init__ function of the parent class. Therefore, the C line of code is basically a fixed way of writing, and it can be written without thinking. It must be written, otherwise the class you write cannot even be instantiated, let alone subsequent operations.
(5) Position D is to define your own hidden layer and output layer, and self must not be missing. This part of the knowledge points is the basic knowledge points of python. I don’t know why I have to systematically look at the basics of python.
(6) What should be noted at E is the parameter relationship between layers. The out_features of the previous layer must be equal to the in_features of this layer, otherwise the network will not work.
(7) F is the forward calculation that defines the network, and the parameters are always self and x, which is a fixed writing method. The code below the forward method is the calculation process of data flow.
(8) G is the activation function, which means that the data must pass through the activation function after passing through the linear layer. The activation function of the hidden layer cannot be omitted.
(9)H is the activation function followed by the output layer. The output layer may or may not be followed by an activation function. If not, when calculating the loss, if you are using multi-classification, you still have to perform softmax transformation, so whether you write it in the architecture depends on you. If you have multiple classifications, it is best to write log_softmax instead of a single softmax, and you must remember dim=1 later. We will discuss the activation function and softmax function separately later, and you will be connected in series by then.
The above are the points to pay attention to when writing the architecture.

3. Activation function
In a DNN architecture, in addition to the fully connected linear layer, the most important thing is the activation layer. Here we will discuss the activation function separately.
Here we still talk about activation functions in different categories.

If it is an activation function of a hidden layer, the main function of the activation function is to perform nonlinear transformation. Different types of activation functions have different characteristics, so the advantages and disadvantages of nonlinear transformation are different. Next, for the activation functions of the hidden layer, we will only expand on the three activation functions of sigmoid, tanh, and relu. Later, when learning CNN and RNN and encounter other activation functions, we will discuss them separately.

If it is the activation function of the output layer, the main function of the activation function is determined by your task.
a. If you are doing a regression task, it is best not to use any activation function in your output layer. The final output layer is just an arrangement of the previous feature transformations.
b. If you are doing a two-class classification task, your output layer activation function is generally a sigmoid function, because sigmoid can map numbers from negative infinity to positive infinity to between 0-1, which is a class probability result, which facilitates classification. .
c. If your task is a multi-classification task, your output layer does not need any activation function. This is also a very common practice. Then your loss function uses the cross-entropy loss function and calls nn.CrossEntropyLoss(reduction= 'mean') function, you can seamlessly connect. If you must add an activation function to the architecture, it is recommended that you use the log_softmax activation function, so that you can call the nn.NLLLoss() function for seamless connection. If you are using the softmax function, then you can only take the logarithm of the result first, and then call the nn.NLLLoss() function to calculate the loss, which is more troublesome, but it is not unusable.
When we talk about the loss function later, we will also demonstrate the conclusion here, so in this part I will only talk about the softmax function alone.

(1) The sigmoid, tanh, and relu activation functions of the hidden layer
are used as the activation functions of the hidden layer. First, they must bear the role of nonlinear transformation during the forward propagation of data. This is the power of the neural network model; The second is that when backpropagating to find the gradient, since the activation function is also a link in the chain derivation, another responsibility of the activation function is: at least not to hinder the return of the gradient.

Therefore, when we choose an activation function, we consider which activation function is better from the above two aspects. Then let’s take a look at how different activation functions perform nonlinear transformations, and the reaction when backpropagating to find gradients:


visible:

  • During the forward propagation process, no matter what the value is from the linear transformation, it will be changed into a value between 0-1 after passing the sigmoid activation function. It is precisely the interval of 0-1 that is very meaningful, so the sigmoid function is often placed behind the output layer for the requirement of outputting a class probability value in a binary classification task. If our logistic regression model is represented by a DNN, it is a hidden layer with several neurons + an output of 2 neurons + a sigmoid activation function.
  • During the backpropagation process, the derivative function is continuous, and there is no situation where the derivative does not exist. And the derivation is very simple.
  • Between [-5, 5] we become the saturation interval of the sigmoid function. The reason why this saturation interval is proposed is because, from the perspective of forward propagation, if the data transmitted from the linear layer exceeds this interval, the output of the sigmoid function is basically not 0 is 1, which means that it is basically insensitive to data outside [-5,5] and has no difference. From the perspective of backpropagation, if the data exceeds the saturation interval, the derivative is basically 0. Then in the process of chain derivation, it is one of the multipliers and it approaches 0. Doesn’t it mean that the derivative approaches 0? . So if your network is relatively deep and you are using the sigmoid activation function, it means that the gradient is easily 0 during backpropagation. This situation is also called gradient disappearance  . Gradient disappearance means that the parameters do not iterate. (because w = w - lr*grad), the parameters no longer iterate means that the model no longer learns, and it also means that you cannot train the model.
  • The output of the sigmoid function is not a 0-mean signal, which means that the input of the subsequent linear layer is not a 0-mean signal. If the network is very deep, it often causes the gradient to be unstable and affects model learning and training. We will explain this in detail later when we talk about model optimization.


The tanh activation function was born after the sigmoid activation function. It was born out of people's thinking after facing the non-zero mean output of the sigmoid activation function and being unable to train deep models well. Therefore, the biggest advantage of the tanh activation function is that it has a 0-mean output, and its derivatives are not complicated. However, the tanh activation function also has many problems. For example, it also uses exp calculations, and there are more exp calculations than the sigmoid function, so it often either has gradient disappearance or gradient explosion, which is also difficult to control. But when your data is relatively stable and there are no particularly large or small inputs and outputs, the effect of tanh is better than sigmoid.


The relu activation function is currently the most commonly used activation function. Experience shows that the SGD algorithm using ReLU converges faster than sigmoid and tanh. And the calculation of relu is very simple. But relu also has a flaw: it is very prone to the Dead ReLU Problem, which is neuron necrosis, also called neuron inactivation. relu can directly turn the negative value after linear transformation into 0, which is equivalent to killing the neuron, because if you want to pass a value of 0 to the next layer, even the parameters of the latter layer If it is not 0, the value received by the neuron in the next layer has nothing to do with this neuron, so it is equivalent to deactivating this neuron, which is equivalent to whether there is this neuron in the previous layer or not. . Looking at backpropagation again, when relu changes the output of this neuron to 0, its gradient will be 0 during backpropagation, which means that the parameters of this neuron will not be updated, and the parameters will not be updated. This means that the next forward propagation will still be the parameters of the previous round, so there is a high probability that relu will be changed to 0, which means that this neuron no longer responds to the subsequent data, and the parameters will never be updated, making it difficult to survive again. .

However, with the development of optimization technology, this problem has been solved to a certain extent. When we talk about optimization later, we will specifically demonstrate this problem. What I want to say here is that although relu is prone to neuron deactivation and affects training, from another perspective, it is not necessarily a bad thing. For example, some people say that it is this characteristic of relu that can inhibit network overfitting and deactivate some neurons, thereby reducing the complexity of the network and thus reducing overfitting; for another example, this characteristic can also force the network to All values ​​should be distributed as evenly as possible, and parameters should not be concentrated on certain neurons. This is equivalent to the effect of L2 regularization.

(2) Softmax function of the output layer.
As mentioned earlier, the output layer does not need any activation function, because the mathematical transformation process of the activation function can be written into the loss function.
If you insist on writing the activation function in the output layer instead of writing it in the loss function, there are two situations: First, if you are doing two classifications, you can use the sigmoid activation function. This activation function was mentioned earlier. Second, if you have multiple classifications, you have to use the softmax or log_softmax function, depending on how you match the loss function. So here we will focus on the softmax activation function.

The calculation formula of the softmax function is also very simple, and the principle is also very simple. What is not simple is that the code implementation is a bit difficult to understand, so in order to make it easier to talk about CNN later (CNN is often multi-class, so softmax is often used), here first Let’s talk clearly about softmax. The following is how pytorch implements softmax:

The most difficult thing to understand here is the dim parameter. This parameter can be understood by referring to the figure below:

From the formula point of view, the essence of the softmax function is to calculate the exponent of all elements with e as the base, then add them up, and then calculate the proportion of each component in the sum. This function is a monotonic function. The smaller the number, negative infinity, it approaches 0 after softmax transformation; the larger the number, positive infinity, approaches 1. Therefore, after the data is transmitted to the neuron, the greater the value after multiplication, addition, and intercept, the closer the probability is to 1, and conversely, the probability is close to 0.
However, because of the exponential operation, the softmax function often suffers from the overflow phenomenon infinite. It is an index with e as the base. It is easy to be unable to calculate, that is, there is insufficient memory or the python server is directly disconnected. If there is overflow, the general approach is to add a log, so it won't be too big. So when we manually calculate softmax ourselves, when an overflow error occurs, we will not report an error when we use pytorch's softmax function. That is because it performs log processing and can calculate the softmax value normally. So don't be surprised when there is some accuracy difference between our manual calculation results and the calculation results of calling the function, this is normal.

4. Loss function

The process of building a deep learning model is: define the architecture -> define the loss function -> define the optimization algorithm -> aim at minimizing the loss function, solve the wb of the model -> bring new samples into the model for prediction.
It has been said before that the loss function is the soul of the deep learning model. This section talks about the loss function in detail.

In the beginner stage, we usually do a regression task or classification task, that is, train a regression model or classification model. The most commonly used loss function at this stage is:
when your model is a regression model, that is, when the label is a set of continuous values, the loss function generally uses SSE, sum of the squared errors, total error squared And, or use the mean square error MSE=1/m SSE. You can choose any of these two indicators, and both can be used as your loss function, which is what we always call loss.
When your model is a classification model, that is, for classification tasks, we generally use the cross-entropy loss function. Among them, binary classification is just a special case of multi-classification, so both binary classification and multi-classification use the cross-entropy loss function.

These loss functions are the most common and basic loss functions. In addition, there are many other loss functions, such as L1 (Mean Absolute Error), L2 (Mean Square Error), Huber Loss LogCosh Loss, Cross Entropy (Log Loss), Focal Loss, Hinge Loss, etc. As for what kind of loss function to choose, it is chosen based on your task and the goal you want to achieve. That is to choose a function that can appropriately quantify the gap between your model effect and your goal. Secondly, we need to consider whether it can converge, that is, whether there is a solution, whether it is differentiable, what optimization algorithm should be used to solve it, and factors such as time efficiency and space efficiency in the solution process. An appropriate loss can make the model converge faster and predict more accurately. For example, the gradient of MSE Loss is proportional to the Loss value. Therefore, when MSE is used as the loss function for training, the convergence speed is generally faster than SSE.

There is mathematics behind the loss function. If you want to know the advantages and disadvantages of various loss functions, you have to analyze the mathematics behind it. Summary of Paddle Loss Function - Paddle AI Studio Galaxy Community  This blog post explains the reasons behind the above loss functions. The math and possible biases are very clear.

When our loss function is determined, we start to find the minimum value of this loss function, and the process of finding the minimum value is the process of finding the model parameter wb, and the method of finding the minimum value is generally the gradient descent method.

Why find the minimum value of the loss function? Let's take the regression model as an example. The loss function of the regression model is mse. The smaller the mse, the closer the predicted value of each sample is to the real label, which means that the smaller the difference between the predicted result and the real result. Therefore, the smaller the loss function is, the better the model prediction effect is, and the larger the loss function is, the worse the model is. That’s why the loss function is the soul of the model, because the loss function determines what the model looks like.

Why is it said that the process of finding the minimum value of the loss function is the process of finding the model parameters wb? Because your loss function is a function about wb, the essence of finding the minimum value of the loss function is a mathematical problem. Generally speaking, the mathematical process is: first convert the loss function into a convex function (but the neural network This step is not required), the most common of which is the Lagrangian transform. Then find the minimum value of this convex function. This step is the optimization process. The algorithm used is called the optimization algorithm. The most common one is the gradient descent optimization algorithm.

You can find a lot of relevant information on the Internet about the mathematics behind the loss function, so here we focus on showing how to implement it in code:
(1) MSE and SSE

(2) Binary classification cross entropy

Binary classification cross entropy loss, nn provides 2 classes:
class BCEWithLogitsLoss has built-in sigmoid function and cross entropy function. It will automatically calculate the sigmoid value of the input value, so you only need to enter zhat and the real label That’s it.
class BCELoss only has a cross-entropy function without a sigmoid layer, so sigma and real labels need to be input.

BCE, represents the binary cross entropy loss
real label in the second parameter. The data type and structure shape of the predicted value and the real label must be the same.
The above two classes also have a third parameter reduction, which defaults to mean, so you can also set reduction='sum' to find the overall error. If you set reduction='none', you will get a matrix with 1000 rows and 1 column, which is the error of each sample.

Why do we need to provide two classes to find an error? Because the sigmoid function has accuracy issues. Therefore, the accuracy of the sigmoid result calculated in the BCEWithLogitsLoss() class provided by torch is higher than the accuracy of the sigmoid calculated by ourselves. So when we have very high accuracy requirements, we need to use BCEWithLogitsLoss()

(3) Multi-class cross entropy loss function
For multi-class models, we must first turn the label vector into a onehot matrix, which is a dummy variable. In this onehot matrix, 1 represents the position of the real label, and 0 represents that it is not a real label. As shown in the figure below:

If your model is a multi-classification model, then the number of neurons in the output layer of your architecture is the number of neurons in the label category. At this time, the output layer in your architecture can not put any activation function, or it can Put the softmax activation function. Note that if you put the activation function, you can only put the softmax activation function. At this time, the data stream passes through the softmax activation layer, and what is returned is the class probability value of the number of label categories.

The formula of the multi-class cross entropy loss function is: -sum(yi*log(y_predi)). Among them, y_predi is the softmax value returned by the network architecture, and yi is the onehot matrix. It can also be seen that the two-class cross-entropy loss function is a special case of the multi-class cross-entropy loss function.
Therefore, the calculation process is to first take the logarithm of the softmax value, then multiply it by the onehot matrix, and finally add up all the samples.

In this process, pytorch packages each step into a class so that we can use it flexibly. If there is no softmax layer in your architecture, then you first call the nn.logsoftmax class to turn the predicted value of your model into a class probability value, and then call the nn.NLLLoss class to calculate the cross-entropy loss.
If your architecture already has a softmax layer, then the output of your model is already a probability value like y_pred. At this time, you can call the nn.CrossEntropyLoss class to calculate cross entropy.

The above is the code implementation process of our common loss functions. In the beginner stage, these three loss functions are enough to cope with regression and classification tasks. We will continue to talk about other models in the future, and we will continue to add when we encounter other loss functions.

Summary: The purpose of our modeling is to predict, and the prediction effect must be very good. The effect of prediction is reflected in the loss function, because our loss function measures the difference between the predicted value and the real label. So if your loss function value is 0, it means you have predicted correctly. Therefore, the smaller the loss, the better the prediction effect. So this is a mathematical problem: find the minimum value of the loss function. Therefore, your loss function must be linked to your prediction effect, so we have loss functions such as SSE/MSE/cross-entropy loss, which are all functions that measure the gap between the predicted value and the real label.
At the same time, our loss function is also a function of our parameters, so the set of parameters corresponding to the minimum value of the loss function is the parameters of our model. That is, under this set of parameters, the loss function of the model is the smallest, which means the model has the best prediction effect.

5. Gradient descent

We defined the loss function above, then the next step is to find a set of weights w and b that minimize the loss function, which is the optimization process. There are many optimization methods, the most common of which is the gradient descent optimization method. Here we also briefly introduce the gradient descent optimization algorithm, because DNN is still the simplest architecture in deep learning, so we cannot use it as an example. We will add more in the future when we learn more complex models and encounter more cutting-edge optimization algorithms.
In fact, after seeing this, everyone basically understands that deep learning is actually not difficult. It all revolves around the aspects of data, architecture, loss, and optimization. Therefore, the higher-level content later is also centered on these points. That’s why we only talk about the most basic gradient descent algorithm here, mainly to let everyone understand the basic ideas.

Let’s start from the beginning. The training process of a neural network is: a sample, for example, with n special types, is poured from the 0th layer neural network into the 1st layer, and passes through the gap between the 1st layer network and the 2nd layer network. w adjustment, adjustment is multiplication and addition, and a number is transmitted to the neuron in the second layer of the middle layer. This neuron then nonlinearly transforms this number into another number using a function such as relu or sigmoid, and outputs the hidden number in the second layer. layer, and then adjust the weight w between the neurons of the 2nd and 3rd layers and then pass it to the 3rd neural network layer. In this way, the data is transferred, and finally transferred to the last output layer. If it is a regression model, the data will be After the multiplication and addition is passed to the output layer neurons, it can be output directly, or a function mapping can be added. What should be noted here is that for regression models, the last layer of the neural network generally has only one neuron.

If it is a two-classification model, generally the last layer of the neural network can have 1 neuron or 2 neurons. If there is 1 neuron, the data will be multiplied and added to the 1 neuron in the output layer. The sign function outputs 0 or 1, or the sigmoid function outputs a class probability probability, but when it is shown to us, it outputs two class probabilities that add up to =1. We can also use list derivation to derive these two class probabilities. Formula, the threshold is set to 0.5, and the probabilities of these two classes are converted into 0 and 1. If the output layer has two neurons, the activation function of these two neurons cannot be the sigmoid function, but must be the softmax function. Use the softmax function to convert the two results, that is, convert the two numbers into phase Add the two class probabilities equal to 1, and then use the list derivation threshold set to 0.5 to output two 0-1 prediction results.

If it is multi-classification, the number of neurons in the final output layer is the number of multi-classification categories. After the data is multiplied and summed and passed to the output layer, the softmax function converts it into multiple class probabilities whose sum is equal to 1. , the final output result of the model can be either multiple class probabilities or a list derivation. The category with the highest probability is set to 1, and the other categories are all set to 0.

The above process is the forward propagation of data. In the above propagation process, we first randomly generate w that each layer of data needs during the transmission process, so each layer of data is the result of multiplication and addition. Each neuron has an input of the result of addition and multiplication and a mapping output of the activation function. So we will get a lot of results after forward propagation of data.

Now let’s start backpropagation. Why do we need backpropagation? Because w is randomly generated during forward propagation, how close can the output of randomly generated w be to the true label of the sample, so we must first assume a set of w, that is, first randomly generate a set of w , let the data be forward propagated once according to this set of w, and then calculate our loss function, that is, calculate how much loss this set of random w will produce and the real label y. If this loss is a regression model, It is the mse mean square error, which means to train a group of w to minimize the mse mean square error. If it is a two-class classification, our loss function is a two-class cross entropy loss. What we are pursuing is that the final output class probability result is as close as possible to the cross entropy function. This function can obtain the smallest value. The smaller the This means that our predictions are accurate and are the same as the real label y. If it is multi-classification, our loss is multi-class cross-entropy loss. So the question here becomes how do we obtain the minimum value of the loss function, and the minimum value of the loss function is related to each of our weights w. We have to look at each w from right to left, and this w is the loss function. How much power does the increase contribute? For example, how much does the parameter w between the first neuron in the output layer and the first neuron in its adjacent hidden layer contribute to the loss? So how much does it contribute? That is, if w changes a little bit, how much will the loss function decrease? In other words, this loss function takes a partial derivative of w. How to find this partial derivative? It is the chain derivation method. First, the loss function is used to derive the output result. The loss function must be a certain f function of the model prediction result. Therefore, the first loss function is used to derivation the model output result, and this derivative is a certain function of the output result. , the output result has been obtained in the forward propagation, so directly bringing it into this function will obtain this part of the derivative value. Then the output result is derived from the addition and multiplication results. The output result here is equivalent to y, and the addition and multiplication is equivalent to x. The concern between y and x is the activation function. Here we assume that the activation functions are all differentiable. , the derivative function we find is a function about x, where x is the output result of the activation function, we directly bring this result in to find this part of the derivative. Then we need to find the derivative of the addition and multiplication result with respect to w. The derivative of this part is the output result of the hidden layer of the previous layer multiplied by w, so the derivative of the addition and multiplication with respect to w is the previous The results of the previous layer, and we have already calculated the results of the previous layer during forward propagation. In this way, the continuous multiplication of our three derivatives is the result of the chain derivation, and the results of these three parts of the derivatives we calculated during forward propagation can be directly brought into the calculation. At this point, we have calculated the loss function about this The reciprocal of w following the reciprocal.

The reciprocal we find here means how much the loss function changes when w changes a little. We are also called gradients. This gradient has direction and magnitude. Once we have the gradient, we know how to iterate this w, how Only by iterating this w can we reduce the loss function, so we start to iterate w, let w t+1 = wt-λ*the gradient we have calculated above, where λ is called the step size, so we iterate many times and we find To minimize the loss function w.

So the meaning of backpropagation is to derive the derivative step by step in reverse to find the derivative of w corresponding to the loss function, that is, the gradient. We then let w move in the opposite direction of the gradient, because after w moves in this way, the loss function decreases. Smaller. By iterating in this way, we find a set of w that minimizes the loss function, which is the learning process of our neural network, which is the model we want. Then we use this model to inject new samples for prediction, and we will get a comparison Good prediction results.

The above is the process of finding the minimum value of the loss function. The difficulty in this process is to find the partial derivative. Since the specific expression of the loss function of the neural network cannot be written at all, the derivation cannot be obtained, so the neural network algorithm has been silent for half a year. There hasn't been much progress in the past century. It wasn't until someone proposed the mathematical chain derivation method and how to backpropagate the derivation on the computer that the neural network made a qualitative leap.

The following code shows the derivation of back propagation and iterates 5 times:

Note: The above back propagation can only be executed once. If you run it again, an error will be reported saying that the calculation graph is not stored. To solve this problem, you need to add a parameter: loss.backward (retain_graph=True), but we must perform forward propagation again to save the calculation graph when executing this line of code again.
It can be seen that pytorch has packaged the reverse derivation for us. We only need one line of code called loss.backward(), but this line of code contains the above text description and the corresponding mathematical method. Let’s take a look at the above running results:

At this point, we have “created data, built the neural network architecture, added activation functions, forward propagation for prediction, calculated the loss function, back propagation to find the gradient, and the simplest gradient descent method iteration parameters” The steps were repeated point to point. So much for the basics of neural networks.
However, there is much more to neural networks than meets the eye. And not every step above is very smooth. For example, in the above example, we only iterated 5 times, but the absolute value of the loss is still very large, which means that the effect of our prediction is still very poor. But can we iterate 50,000 times until the loss approaches 0? not necessarily! There are still many hurdles in this process, and we will explain these hurdles later. That is to say, in the next chapter we will talk about the optimization of neural networks.

Guess you like

Origin blog.csdn.net/friday1203/article/details/135361146