softmax loss function and calculating gradients of

Formula Softmax loss function in consideration of the stability in the case of the following values:

L_i=-log(\frac{e^{f_{yi}-max(f_j)} }{\sum_j{e^{f_j-max(f_j)}}} )

After regularization and punishment of all samples included in the loss function formula is:

L = \ frac {1} {N} \ {It} sum_i + \ lambda W

We start with Li looks.

F (i, j) matrix that is f (x, w) of the i, j th element. As before we can score set calculated between the sample and the right to re-set set.

max (fj) i.e. the maximum score in all categories of the scoring of the i-th sample. Seen from the formula, score set every element necessary to subtract the maximum score, which can be accomplished by broadcast mechanism matrix operations. Meanwhile, the broadcast mechanism just as effective for index calculation. Thus loss function can be calculated as:


   
   
  1. f = X . dot ( W ) # N by C
  2. All Categories # f_max is the i-th row element scoring selecting the maximum value, the axis = 1
  3. f_max = np . reshape ( np . max ( f , axis = 1 ), ( num_train , 1 )) # N by 1
  4. # Each scored seeking normalization probability, the probability distribution of each sample in different categories. N by C
  5. prob = np . exp ( f - f_max ) / np . sum ( np . exp ( f - f_max ), axis = 1 , keepdims = True )
  6. for i in xrange ( num_train ):
  7. for j in xrange ( num_class ):
  8. if ( j == y [ i ]):
  9. loss += - np . log ( prob [ i , j ]))
  10. loss /= num_train
  11. loss += 0.5 * reg * np . sum ( W * W )

The loss function calculation principle is that if for sample i, its correct classification categories is j, so if prob [i, j] is 1, then the classification is correct, do not contribute to the loss of function in this case. If misclassification, the prob [i, j] value will be a value less than 1, in this case it would contribute to the loss of function. Optimizing weights may make the prob [i, j] approaches 1, so that the minimum loss function. When untrained, because the weights are randomly generated, so should the probability of each category is 10%, so the loss should be close to -log (0.1) (in the absence of punishment increase regularization of the situation).

Gradient derived as follows:




Figure above,  p_{(i,m)}(note lower case letters p) is the probability of the classification of samples, is a vector of a Cx1 (assuming C Category of the word). When m is the correct classification, a value of 1, the other elements is 0. Where Pm is i.e. P [i, m ], it is the sample i in the first probability of the m categories.  When seeking loss function, we've got a probability matrix, the P [i, m] are known.

Jane books Deeplayer users have a similar derivation, Eliminating the need for some intermediate process, more concise and clear.

Plus the gradient softmax_loss_naive version:


   
   
  1. num_train = X . shape [ 0 ]
  2. num_classes = W . shape [ 1 ]
  3. f = X . dot ( W ) #N by C
  4. f_max = np . reshape ( np . max ( f , axis = 1 ), ( num_train , 1 ))
  5. prob = np . exp ( f - f_max ) / np . sum ( np . exp ( f - f_max ), axis = 1 , keepdims = True )
  6. for i in xrange ( num_train ):
  7. for j in xrange ( num_classes ):
  8. if ( j == y [ i ]):
  9. loss += - np . log ( prob [ i , j ])
  10. dW [:, j ] += ( 1 - prob [ i , j ]) * X [ i ]
  11. else :
  12. dW [:, j ] -= prob [ i , j ] * X [ i ]
  13. loss /= num_train
  14. loss += 0.5 * reg * np . sum ( W * W )
  15. dW = - dW / num_train + reg * W

Starting from softmax_loss_naive see how to remove the loop:

loss occurs in the inner loop, but only when j == y [i], we only -np.log (prob [i, j]) plus go. So if we can find those elements j! = Y [i], and it prob [i, j] is set to 1, so that np.log (prob [i, j]) == 0, this can be directly on matrix summed up. In linear_svm.py, we use margins [np.arange (num_train), y] = 0 to make the method updates the value of few elements, but here we need is the condition j! = Y [i], j instead of == y [i], so also think of other ways. DW for matrix, it should first be made on the basis of a new prob matrix, so that the elements corresponding to elements prob negative, and those j == y [i] plus 1 element, then this the new matrix is ​​multiplied with the X. This is just two matrices can be used to achieve the following keepProb.

The following code is softmax.py the TODO:


   
   
  1. #################################################################
  2. # TODO: Compute the softmax loss and its gradient using no explicit loops. #
  3. # Store the loss in loss and the gradient in dW. If you are not careful #
  4. # here, it is easy to run into numeric instability. Don't forget the #
  5. # regularization! #
  6. ################################################################

Content:


   
   
  1. num_train = X . shape [ 0 ]
  2. num_classes = W . shape [ 1 ]
  3. f = X . dot ( W ) #N by C
  4. f_max = np . reshape ( np . max ( f , axis = 1 ), ( num_train , 1 ))
  5. prob = np . exp ( f - f_max ) / np . sum ( np . exp ( f - f_max ), axis = 1 , keepdims = True )
  6. keepProb = np . zeros_like ( prob )
  7. keepProb [ np . arange ( num_train ), y ] = 1.0
  8. loss + = - np . sum ( keepProb * np . wheels ( prob )) / num_train + 0.5 * reg * np . sum ( W * W )
  9. dW += - np . dot ( X . T , keepProb - prob ) / num_train + reg * W
Published 38 original articles · won praise 29 · views 50000 +

Guess you like

Origin blog.csdn.net/ruotianxia/article/details/104059020