PyTorch deep learning practice (4) - detailed explanation of common activation functions and loss functions

0. Preface

Activation functions and loss functions are important components of deep learning models. The selection of activation functions and loss functions largely determines the performance and accuracy of deep neural networks, which need to be selected according to the characteristics of specific problems and data distribution. In this section, we will introduce common activation functions and loss functions in deep learning, and illustrate common application scenarios of different functions.

1. Commonly used activation functions

The use of activation functions enables the network to be highly non-linear, which is critical for modeling complex relationships between inputs and outputs. If there is no nonlinear activation function, then the network will only be able to express a simple linear map. Even if there are more hidden layers, the entire network is equivalent to a single-layer neural network. Only after adding a nonlinear activation function, the depth Neural networks have the amazing ability to learn non-linear maps, applying activation functions across multiple layers in the network.

1.1 Sigmoid activation function

sigmoidIt is the most widely used type of activation function, and its value range is [0, 1], it can map a real number to the interval [0, 1]of , and it can be used for binary classification problems.

SigmoidThe function formula definition is as follows:

$sigmoid(x)=\frac 1 {1+e^{-x}}$

Use to Pythonimplement this function:

def sigmoid(x):
     return 1/(1+np.exp(-x))

The function image is shown below. You can see that the shape of the function is like an S-curve, so it is also called an S-shaped growth curve:

Sigmoid

sigmoidAdvantages of the function: smooth and easy to derive.
sigmoidDisadvantages of the function: Backpropagation derivation involves division, so the amount of calculation is large; during backpropagation, it is easy for the gradient to disappear, which limits the training of deep networks.

1.2 Tanh activation function

TanhIt is a kind of hyperbolic function, which is an improvement of Sigmoidthe activation function, and is a symmetrical function centered on zero, and its value range is [-1, 1]. TanhThe activation function calculation formula is as follows:
$=\frac { {e^x} -e^{-x}} { {e^x} +e^{-x}}=2sigmoid(2x)-1$ Implement
this function usingPython

def tanh(x):
    return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))

The graph of the function is as follows, it is a monotonically increasing odd function (-1, 1)in , and the graph of the function is symmetrical about the origin:

fishy

tanhAdvantages of the function: tanhthe function is an improvement of sigmoidthe function , the convergence speed is fast, and lossthe value is not easy to shake.
tanhDisadvantages of the function: it cannot solve the problem of gradient dispersion, the calculation amount of the function is also exponential, and the calculation is relatively complicated.

1.3 ReLU activation function

The modified linear unit ( Rectified Linear Units, ReLU) activation function sigmoidis tanha perfect replacement activation function for and activation functions, and is one of the most important breakthrough technologies in the field of deep learning. ReLUThe activation function calculation formula is as follows:

$\begin{cases} 0, & {x<0} \\ x, & {x\ge0} \end{cases}$
Use to Pythonimplement this function:

def relu(x):
    return np.where(x>0, x, 0)

The image of the function is shown below. When the input value is greater than or equal to 0, ReLUthe function outputs as it is. If the input is less than 0, ReLUthe function value is 0. Because the linear component of ReLUthe function greater than or equal to 0 has a fixed derivative, and the other linear component has a derivative of 0. Therefore, it is much faster to train the model using ReLUthe function .

resume

ReLUAdvantages of the function: there is no gradient disappearance problem, the calculation cost is very low, and the convergence sigmoidspeed tanhis much faster than the and functions.
ReLUDisadvantages of the function: when the gradient value is too large, its weight will be negative after updating, and the derivative in ReLUthe function always be zero, resulting in the subsequent gradient not being updated, also known as dying ReLUthe problem.

1.4 Linear activation function

The output of a linear activation is the input value itself, outputting the input value as-is:
$l in e a r (x) = x$ implements
this function usingPython

def linear(x):
    return x

This function is only used in the output layer of the neural network model that solves the regression problem. Note that the linear activation function cannot be used in the hidden layer.

1.5 Softmax activation function

Usually, softmaxit is used before the neural network outputs the final result. softmaxis usually used to determine nthe probability that an input belongs to one of the possible output classes in a given scenario. Let's say we're trying to classify images of digits into one of 10 possible classes (numbers from 0 to 9). In this case, there are 10 output values, where each output value represents the probability that the input image belongs to a certain class. SoftmaxThe activation function calculation formula is as follows:

$softmax(x_i)=\frac {e^i} {\sum _{j=0} ^N e^j}$

softmaxActivation is used to provide a probability value for each class in the output, where $i$ represents the index of the output. Use to Pythonimplement

def softmax(x):
    return np.exp(x) / np.sum(np.exp(x))

softmaxFunctions generally serve as the last layer of a neural network, accepting input values from the previous layer and converting them into probabilities. For example, if we want to identify a picture, its possible label is [apple, banana, lemon, pear], then the value of the last layer of the [1.0, 2.0, 3.0, 4.0]network softmaxis output after passing through the function [0.0320586, 0.08714432, 0.23688282, 0.64391426].

2. Commonly used loss functions

Using the loss function to calculate the loss value, the model can update each parameter through backpropagation. By reducing the loss between the real value and the predicted value, the predicted value calculated by the model is close to the real value, so as to achieve the purpose of model training. , the choice of loss function depends on the type of problem and the desired output. The loss function needs to be a non-negative real-valued function.

2.1 Mean square error

Error is the difference between the predicted value output by the network and the actual value. We square the error because the error can be positive or negative. The squaring ensures that positive and negative errors do not cancel each other out. We calculate the mean squared error ( Mean Square Error, MSE) so that the errors between the two datasets are comparable when they are not the same size. The mean square error between the predicted value ( p) and the actual value ( $mse(p,y)=\frac 1 n \ sum _{i=1} ^n(py)^2$ y
$m se (p, y) = \frac{1}{n} i = 1 \sum n (p - y)^{2}$

Use to Pythonimplement this function:

def mse(p, y):
    return np.mean(np.square(p - y))

The mean squared error is usually used when a neural network needs to predict continuous values.

2.2 Mean Absolute Error

Mean Absolute Error ( Mean Absolute Error, MSE) works much like Mean Squared Error. Mean Absolute Error ensures that positive and negative errors do not cancel each other out by averaging the absolute differences between the actual and predicted values across all data points. The mean absolute error between predicted value ( p) and actual value ( ) $mse(p,y)=\frac 1 n \sum _{i=1} ^n|py|$ y
$m se (p, y) = \frac{1}{n} i = 1 \sum n ∣ p - y ∣$
Use Pythonto implement

def mae(p, y):
    return np.mean(np.abs(p - y))

Similar to mean squared error, mean absolute error is often used to predict the value of a continuous variable.

2.3 Categorical cross entropy

Cross entropy is a measure of the difference between two different distributions (actual and predicted). Unlike the above two loss functions, it is widely used for discrete-valued output data. The cross-entropy between two distributions is calculated as follows:

$ylog_2p+(1-y)log_2(1-p))$

$y$ is the actual result, $p$ is the predicted result. The categorical cross-entropy betweenthe predicted value (p) and the actual value (ypython

def categorical_cross_entropy(p, y):
    return -np.sum((y*np.log2(p) + (1-y)*np.log2(1-p)))

The categorical cross-entropy loss has a high value when the predicted value is far from the actual value, and has a low value when it is close to the actual value.

2.4 Implementing a custom loss function

In actual scenarios, we may have to implement a custom loss function for the problem that needs to be solved, especially in complex neural networks involving target detection, generative confrontation networks, etc., it provides a method to build a custom loss function by writing a function PyTorch.
In this section, we'll implement a custom loss function nn.Modulethat MSELossdoes the same thing as the prebuilt function in .

(1) Import data, construct dataset DataLoaderand define neural network:

x = [[1,2],[3,4],[5,6],[7,8]]
y = [[3],[7],[11],[15]]
import torch
X = torch.tensor(x).float()
Y = torch.tensor(y).float()
import torch.nn as nn
device = 'cuda' if torch.cuda.is_available() else 'cpu'
X = X.to(device)
Y = Y.to(device) 
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
class MyDataset(Dataset):
    def __init__(self,x,y):
        self.x = x.clone().detach() # torch.tensor(x).float()
        self.y = y.clone().detach() # torch.tensor(y).float()
    def __len__(self):
        return len(self.x)
    def __getitem__(self, ix):
        return self.x[ix], self.y[ix]
ds = MyDataset(X, Y)
dl = DataLoader(ds, batch_size=2, shuffle=True)
class MyNeuralNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.input_to_hidden_layer = nn.Linear(2,8)
        self.hidden_layer_activation = nn.ReLU()
        self.hidden_to_output_layer = nn.Linear(8,1)
    def forward(self, x):
        x = self.input_to_hidden_layer(x)
        x = self.hidden_layer_activation(x)
        x = self.hidden_to_output_layer(x)
        return x
mynet = MyNeuralNet().to(device)

(2) Define a custom loss function, take two tensor objects as input, calculate the square of their difference, and return the average of the square difference between the two:

def my_mean_squared_error(_y, y):
    loss = (_y-y)**2
    loss = loss.mean()
    return loss

(3) Using the same combination of inputs and outputs, call the built-in MSELossfunction and compare its result with the custom function we built.

Use to nn.MSELossget the mean squared error loss:

loss_func = nn.MSELoss()
loss_value = loss_func(mynet(X),Y)
print(loss_value)
# tensor(151.1184, device='cuda:0', grad_fn=<MseLossBackward>)

Use a custom loss function my_mean_squared_errorto output the loss value:

print(my_mean_squared_error(mynet(X),Y))
# tensor(151.1184, device='cuda:0', grad_fn=<MeanBackward0>)

In general, which custom function to use depends on the problem we are solving.

summary

The activation function is a function in the deep learning neural network. Its function is to "activate" the input signal and output a non-linear result. In this section, the commonly used activation functions are introduced, including , , etc.; the loss Sigmoidfunction ReLUrefers Softmaxto When training a neural network, it is a function used to measure the gap between the model's predicted value and the real value. In this section, commonly used loss functions are introduced, including mean square error, cross entropy, and logarithmic loss.

series link

PyTorch deep learning combat (1) - detailed explanation of neural network and model training process
PyTorch deep learning combat (2) - PyTorch basics
PyTorch deep learning combat (3) - use PyTorch to build neural networks