来源:深度学习爱好者 CV技术指南
本文约1300字,建议阅读6分钟
本文为你从全方位介绍Softmax函数。
[Guide] Softmax is an activation function that everyone is familiar with. However, many people only know its expression and its position in the network, but they cannot answer some specific reasons and details. This article gives a corresponding introduction.
Softmax is a mathematical function used to normalize values between 0 and 1.
In this article, you'll learn about:
What is Softmax activation function and its mathematical expression?
How is it implemented using the argmax() function?
Why is Softmax only used in the last layer of the neural network?
Misunderstanding of Softmax
What is Softmax activation function and its mathematical expression?
In deep learning, Softmax is used as an activation function to normalize the output and scale for each value in a vector between 0 and 1. Softmax is used for classification tasks. In the last layer of the network, an N-dimensional vector is generated, one for each class in the classification task.
N-dimensional vector in the output layer of the network
Softmax is used to normalize those weighted sum values between 0 and 1, and their sum is equal to 1, that's why most people think these values are class probabilities, but this is a misconception, we will discuss in It is discussed in this article.
The formula to implement the Softmax function:
Using this mathematical expression, we calculate the normalized value for each class of data. Here θ(i) is the input we get from the flatten layer.
Computes the normalized value for each class, where the numerator is the index value of the class and the denominator is the sum of the index values of all classes. Using the Softmax function, we get all values between 0 and 1, the sum of all values becomes equal to 1. So people see it as probability, which is their misconception.
How does it use the argmax() function?
After applying the above mathematical function to each class, Softmax calculates a value between 0 and 1 for each class.
Now we have several values for each class, to classify which class the input belongs to, Softmax uses argmax() which gives the index of the value which has the maximum value after applying Softmax.
Visual interpretation of argmax
Why is Softmax only used in the last layer of the neural network?
Now coming to the important part, Softmax is only used in the last layer to normalize the values, while other activation functions (relu, leaky relu, sigmoid and various others) are used in the inner layers .
If we see other activation functions like relu, leaky relu and sigmoid, they all use unique single values to bring non-linearity. They can't see what the other values are.
But in the Softmax function, in the denominator, it takes the sum of all exponent values to normalize the values of all classes. It takes into account the values of all classes in scope, that's why we use it in the last layer. To know which class the Input belongs to by analyzing all the values.
The Softmax activation function of the last layer
Misunderstanding of Softmax
The first and biggest misconception about Softmax is that its output via normalized values is a probability value for each class, which is completely wrong. This misunderstanding is because the values sum to 1, but they are just normalized values and not class probabilities.
Instead of using Sotmax alone in the last layer, we prefer to use Log Softmax, which just takes the logarithm of the normalized values from the Softmax function.
Log Softmax is superior to Softmax in terms of numerical stability, cheaper model training costs, and Penalizes Large error (the greater the error, the greater the penalty).
This is the Softmax function used as activation function in neural networks. I believe that after reading this article, you already have a clear understanding of it.
Original link: https://medium.com/artificialis/softmax-function-and-misconception-4248917e5a1c
Editor: Huang Jiyan