Sigmoid function and to understand the function of Softmax

1. Sigmod function

1.1 Properties of the function and advantages

In fact, often said logistic function is a sigmoid function, the geometry of which is a sigmoid curve (S curve).

             

Wherein z is a linear combination of, for example z may be equal to: b + w1 * x1 + w2 * x2. By substituting a large positive or small negative to g (z) function can be seen, the result tends to 0 or 1

A logistic function or logistic curve is a common “S” shape (sigmoid curve).

That is, the function of the sigmoid function is equivalent to a real number between 0 and 1 to the compression. When z is when a very large positive number, g (z) will be close to 1, and z is a very small negative, then g (z) will be close to 0     

                   

Compressed to 0-1 What use is it? Is so that we can use to activate the function viewed as "probabilistic classification of" an output such as the activation function is 0.9, then it can be interpreted as a 90% probability for the positive samples.

advantage:

1, the output Sigmoid function between the (0,1), a limited output range, stability optimization, can be used as the output layer.

2, continuous function, easy derivation.

1.2 Functions disadvantage

sigmoid also has its own drawbacks.
First, the most obvious is the saturation, not difficult to see from the figure that the number of side guides gradually approaches 0, i.e. [official] . Specifically, in the process of backpropagation, the Sigmoid gradients contains a  [official] factor (Sigmoid derivatives with respect to the input), and therefore falls within the saturation region once the input ends,  [official] becomes close to 0, resulting in backpropagation the gradient becomes very small, even at this time the network parameters may not be updated, it is difficult effective training, a phenomenon known as gradient disappears. Generally, sigmoid network will generate a gradient phenomenon disappears in 5 layers.

Second, the activation function of the offset phenomenon. Sigmoid function output value greater than 0, so that the mean output is not zero, which causes the latter layer neurons obtained on the layer of non-zero mean signal as an input, this will affect the gradient. .

Third, high computational complexity, because the sigmoid function is exponential.

1.3 Sigmod derivation function

                           

 

 

sigmod derivation process is simple, can be derived manually.

2. Softmax function

2.1 Softmax function expression and nature

softmax function, also known as the normalized exponential function. It is a binary sigmoid function on the promotion of multi-classification is the result of a multi-purpose classification to show up in the form of probability. The following figure shows the calculation method of softmax:

                               

 

 Picture below to facilitate the understanding:

 

softmax straightforward it is to be originally output 3,1, -3 action by a softmax function, mapped to a value (0,1), and these values accumulated and 1 (properties meet probability), then we can be understood as the probability of selecting the final output node, we can choose the maximum probability (that is, the value corresponding to the maximum) nodes, as our forecast goals!

Since Softmax function to widen the difference between the input vector elements (by an exponential function), before a normalized probability distribution, when applied to the classification, so that the difference in probability that more significant for each class, generating a maximum value the probability is closer to 1, so that the output in the form of distribution is closer to the true distribution.

Explain 2.2Softmax function

Softmax can be explained by three different angles. softmax function, you can gain a deeper understanding of its application scene from different angles.

2.2.1 is a smooth approximation arg max

softmax be smooth as a kind arg max approximation, and arg max operation to select a maximum force (generating a one-hot vector ) different, the softmax made some smoothing this output, i.e. the output of one-hot 1 according to the size corresponding to the maximum value of an input element assigned to the other location.

2.2.2 normalization generate a probability distribution

Softmax function output consistent with the basic form of exponential distribution family

[official]

Which  [official] .

Not difficult to understand, softmax the input vector normalized probability distribution is mapped to a category, that is,  [official] the probability distribution of categories (also mentioned earlier). This is why in depth study softmax often as the last layer of the MLP, and in line with the cross-entropy loss function (a measure of the difference between the distribution).

2.2.3 produced no joint probability to the probability map

From the point of view of probabilistic graphical models, such form may be understood as a softmax The probability of the joint probability to FIG. So you will find that the condition of maximum entropy model and softmax regression models are virtually identical, there are many such examples, such as. Since the probabilistic graphical model theory largely borrowed some of the thermodynamic system, so you can also give a certain meaning softmax from the perspective of the physical system.

3. Summary

• If the model outputs are non mutually exclusive categories, and can simultaneously select a plurality of categories, the raw output Sigmoid function value calculated using the network.

• If the model output are mutually exclusive categories, and select a category only, the original output value Softmax function calculation using the network.

 

Reference links:
https://zhuanlan.zhihu.com/p/69771964 (sigmod There are many examples and appreciated softmax function)
https://zhuanlan.zhihu.com/p/79585726 (softmax understanding of the function and cross-entropy function)

 

Guess you like

Origin www.cnblogs.com/jiashun/p/doubles.html