1, the basic content
The linear classification score obtained values were converted to a probability value for multiple classifiers, SVM output value of the score is, the Softmax output probability.
2, Sigmoid function
Expression (range [0, 1]):
Function image:
Sigmoid function arbitrary real number may be mapped to a probability value of the interval [0,1], realized according to the size classification probability value.
3, Softmax output
softmax function: the input value is a vector, vector elements scores arbitrary real number, and outputs a vector where each element value between 0 and 1, and the sum of all the elements is 1 (normalized classification probability :):
Loss function: cross-entropy loss (cross-entropy loss)
which
is also classified in the above-described example, the calculation of the cat:
power operation may be mapped to a relatively large value to a larger value, a negative number is mapped to a very small number, Li is loss function value (calculated its loss of value to the correct category probability value)
4, compared to the loss function SVM and Softmax
For hinge loss, the effect of an error when the correct category and the category of the score similar to a score not accurate assessment model (loss value close to 0, but the effect of the classification model is not good), it is not the use of such loss function.
5, optimization:
The input data and a set of weights obtained by combining a set of parameter scores, Loss value finally obtained, this series of processes is referred to before the propagation process. Loss by weight parameter value updating weights, there may be back-propagation algorithm implemented
5.1 gradient descent (the fastest to reach the lowest point)
Gradient formula:
gradient descent code implementation:
Bachsize (number of data extracted from the raw data) is typically an integer multiple (32,64,128) 2 in consideration of loading a computer, generally the better. step_size learning rate (not easy too large).
LOSS value depending on the results of the training network:
Local fluctuations, but the overall downward trend, indicating that the network is desirable. (Epoch refers to the entire process all the data again, the first iteration is complete only refers to the amount of data processing Bachsize size)
5.2 backpropagation
front propagation The figure, in turn updated by the W L called back propagation, illustrated as follows:
Suppose x, y, z three sample points, obtained through a series of operations f loss values, respectively, now need to calculate the sample points corresponding to the contribution to the weight parameter is the number of f (partial derivative)
chain rule:
Complex function back-propagation procedure is as follows:
Simplified way:
Meaning the door unit: