Softmax classification and optimization

1, the basic content

The linear classification score obtained values were converted to a probability value for multiple classifiers, SVM output value of the score is, the Softmax output probability.
Here Insert Picture Description
2, Sigmoid function

Expression (range [0, 1]):

Here Insert Picture Description
Function image:

Here Insert Picture Description
Sigmoid function arbitrary real number may be mapped to a probability value of the interval [0,1], realized according to the size classification probability value.

3, Softmax output

softmax function: the input value is a vector, vector elements scores arbitrary real number, and outputs a vector where each element value between 0 and 1, and the sum of all the elements is 1 (normalized classification probability :):

Here Insert Picture Description
Loss function: cross-entropy loss (cross-entropy loss)
Here Insert Picture Description
which
Here Insert Picture Description
is also classified in the above-described example, the calculation of the cat:
Here Insert Picture Descriptionpower operation may be mapped to a relatively large value to a larger value, a negative number is mapped to a very small number, Li is loss function value (calculated its loss of value to the correct category probability value)

4, compared to the loss function SVM and Softmax

Here Insert Picture Description

For hinge loss, the effect of an error when the correct category and the category of the score similar to a score not accurate assessment model (loss value close to 0, but the effect of the classification model is not good), it is not the use of such loss function.

5, optimization:

Here Insert Picture DescriptionThe input data and a set of weights obtained by combining a set of parameter scores, Loss value finally obtained, this series of processes is referred to before the propagation process. Loss by weight parameter value updating weights, there may be back-propagation algorithm implemented

5.1 gradient descent (the fastest to reach the lowest point)

Gradient formula:
Here Insert Picture Description
gradient descent code implementation:
Here Insert Picture DescriptionBachsize (number of data extracted from the raw data) is typically an integer multiple (32,64,128) 2 in consideration of loading a computer, generally the better. step_size learning rate (not easy too large).

LOSS value depending on the results of the training network:

Here Insert Picture DescriptionLocal fluctuations, but the overall downward trend, indicating that the network is desirable. (Epoch refers to the entire process all the data again, the first iteration is complete only refers to the amount of data processing Bachsize size)

5.2 backpropagation
Here Insert Picture Descriptionfront propagation The figure, in turn updated by the W L called back propagation, illustrated as follows:

Here Insert Picture DescriptionSuppose x, y, z three sample points, obtained through a series of operations f loss values, respectively, now need to calculate the sample points corresponding to the contribution to the weight parameter is the number of f (partial derivative)
Here Insert Picture Description
chain rule:

Here Insert Picture Description
Complex function back-propagation procedure is as follows:

Here Insert Picture DescriptionSimplified way:

Here Insert Picture Description

Meaning the door unit:

Here Insert Picture Description

Guess you like

Origin blog.csdn.net/qq_43660987/article/details/91613522