5 Common Loss Functions for Training Deep Learning Neural Networks

The optimization of the neural network during training is to first estimate the error of the current state of the model, and then in order to reduce the error of the next evaluation, it is necessary to use a function that can represent the error to update the weights, which is called the loss function.

The choice of loss function is related to the specific predictive modeling problem (such as classification or regression) that the neural network model learns from examples. In this article we will introduce some commonly used loss functions, including:

  • Mean Squared Error Loss for Regression Models
  • Cross-entropy and hinge loss for binary classification models

Loss function for regression model

Regression prediction models are mainly used to predict continuous values. So we'll use scikit-learn's make_regression() function to generate some simulated data, and use this data to build a regression model.

We will generate 20 input features: 10 of them will be meaningful, but 10 will be irrelevant to the problem.

And 1,000 examples are randomly generated. And specify a random seed, so the same 1,000 examples are generated whenever the code is run.

Scaling real-valued input and output variables to a reasonable range often improves neural network performance. So we need to normalize the data.

StandardScaler is also available in the scikit-learn library, to simplify the problem we will scale all the data before splitting it into training and test sets.

Then split the training and validation sets equally

To introduce different loss functions, we will develop a small multi-layer perceptron (MLP) model.

According to the problem definition, there are 20 features as input that go through our model. Requires a real value to predict, so the output layer will have a node.

We use SGD for optimization, and have a learning rate of 0.01 and a momentum of 0.9, both reasonable defaults. Training will be performed for 100 epochs, the test set will be evaluated at the end of each epoch, and the learning curve will be plotted.

After the model is completed, the loss function can be introduced:

MSE

The most commonly used loss for regression problems is Mean Squared Error (MSE). It is the preferred loss function under maximum likelihood inference when the distribution of the target variable is Gaussian. So you should change to other loss function only if you have a better reason.

If "mse" or "mean_squared_error" is specified as the loss function when compiling the model in Keras, the mean squared error loss function is used.

The code below is a complete example of the above regression problem.

In the first step of running the example, the mean squared error of the model's training and test datasets is printed, which is displayed as 0.000 because 3 decimal places are preserved

As you can see from the graph below, the model converges fairly quickly and the training and testing performance remains the same. Depending on the performance and convergence properties of the model, mean squared error is a good choice for regression problems.

MSLE

In regression problems with a wide range of values, it may not be desirable to penalize the model like mean squared error when predicting large values. So the mean squared error can be calculated by first calculating the natural logarithm of each predicted value. This loss is called MSLE, or Mean Squared Logarithmic Error.

It has the effect of relaxing the penalty effect when there is a large difference in the predicted values. It may be a more appropriate loss metric when the model predicts unscaled quantities directly.

Use "mean_squared_logarithmic_error" as the loss function in keras

In the example below is the complete code using the MSLE loss function.

The model has slightly worse MSE on both train and test datasets. This is because the distribution of the target variable is a standard Gaussian distribution, indicating that our loss function may not be very suitable for this problem.

The graph below shows that the comparative MSE for each training epoch converges well, but the MSE may be overfitting as it goes down from epoch 20 to transform and starts to go up.

THERE IS

Depending on the regression problem, the distribution of the target variable may be predominantly Gaussian, but may contain outliers, such as large or small values ​​far from the mean.

In this case, mean absolute error or MAE loss is an appropriate loss function as it is more robust to outliers. Calculated as the mean value, taking into account the absolute difference between the actual value and the predicted value.

Use the "mean_absolute_error" loss function

Here is the complete code using MAE

The result is as follows

As you can see from the graph below, MAE does converge but it has a bumpy process. MAE is also not very suitable in this case because the target variable is a Gaussian function with no large outliers.

Loss function for binary classification

A binary classification problem is one of two labels in a predictive modeling problem. This problem is defined as predicting the value of 0 or 1 for the first or second class, and is usually implemented as predicting the probability of belonging to class value 1.

We also use sklearn to generate data. The circle problem is used here. It has a two-dimensional plane with two concentric circles, where the points on the outer circle belong to class 0, and the points on the inner circle belong to class 1. To make learning more challenging, we also add statistical noise to the samples. The sample size is 1000 and 10% statistical noise is added.

A scatter plot of a dataset can help us understand the problem we are modeling. Listed below is a complete example.

The scatter plot is as follows, where the input variables determine the location of the points and the colors are the class values. 0 is blue, 1 is orange.

Here is still half for training and half for testing,

We still define a simple MLP model,

Using SGD optimization, the learning rate is 0.01 and the momentum is 0.99.

The model is trained for 200 epochs for fitting, and the performance of the model is evaluated in terms of loss and accuracy.

BCE

BCE is the default loss function for solving binary classification problems. It is the preferred loss function under the maximum likelihood inference framework. For class 1 predictions, cross-entropy computes a score that summarizes the average difference between the actual and predicted probability distributions.

When compiling a Keras model, binary_crossentropy can be specified as the loss function.

In order to predict the probability of class 1, the output layer must contain a node and a 'sigmoid' activation.

Here is the complete code:

The model learns the problem relatively well, with 83% accuracy on the test dataset and 85% accuracy. There is some overlap between the scores, indicating that the model is neither overfit nor underfit.

As shown in the image below, the training works well. Since the error between the probability distributions is continuous, the loss plot is smooth, while the accuracy line plot shows bumps, as examples in the training and test sets can only be predicted as correct or incorrect, providing less granular information.

Hinge

Support Vector Machine (SVM) models use the Hinge loss function as an alternative to cross-entropy to solve binary classification problems.

The target value is in the set [-1, 1] and is intended to be used with binary classification. Hinge gets larger errors if the actual and predicted class values ​​have different signs. It is sometimes better than cross-entropy on binary classification problems.

As a first step, we have to modify the value of the target variable to the set {-1, 1}.

In keras it is called ' hinge '.

In the output layer of the network, a single node of the tanh activation function must be used to output a single value between -1 and 1.

Here is the complete code:

Slightly worse performance than cross-entropy, with less than 80% accuracy on both train and test sets.

As can be seen from the figure below, the model has converged, and the classification accuracy graph indicates that it has also converged.

It can be seen that BCE is better for this problem. The possible reason here is because we have some noise points.

https://avoid.overfit.cn/post/6f094f37a1174809b8d145510b8d6e28

Author: Onepagecode

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/127360301