Linux command realizes file transfer between servers + cs231n_svm

Linux command realizes file transfer between servers

When realizing the calculation of the gradient of the SVM loss function, I was trapped by three problems at the beginning:
I got the answer by studying this article later. Derivation of the SVM loss function

1. What is the standard loss calculation function of SVM?

L = ∑ i [ m a x ( 0 , f ( x i ; W ) j − f ( x i ; W ) y i + Δ ) ] + Λ ∑ k ∑ l W k , l 2 L=\sum_{i}^{} [max(0,f(x_{i};W)_{j}-f(x_{i};W)_{y_{i}}+\Delta)]+\Lambda\sum_{k}^{}\sum_{l}^{}W_{k,l}^{2} L=i[max(0,f(xi;W)jf(xi;W)yi+D )]+LklWk,l2

2. How to deal with the summation function in the loss function when calculating the gradient?

3. How to deal with the max function in the loss function when calculating the gradient

Let's talk about the loss function derivation process in detail.

Let me talk about the structure of w first. Its shape is (the number of all pixels of X to be classified, the number of types) the
structure of the gradient matrix dW: (the number of all pixels of X to be classified, the number of types) -> update for Calculate the weight of pixels of different categories to achieve accurate classification. The granularity of the calculation is for the category, and the specific value for updating can actually be updated to each pixel.
Loss function: The loss function is to calculate the loss function for each input x according to the scores of each category. This enables correct classification of the input.
Gradient matrix dW: It is composed of the gradient of the loss function of each sample i (i=1...N) in each classification direction c (c=1...C) of the weight matrix W, indicating that each sample is in each classification direction The rate of decline.
Understanding: The relationship between the loss function and the gradient: The loss function's derivative of the gradient is performed for the category of the weight matrix. Because the calculation of the loss function is based on the scores of each category calculated by the gradient matrix. Therefore, the loss function derives the gradient. In more detail, the loss function derives the value of the category direction in the gradient.
The following formula needs to be understood.

∇ w L i , j = ∂ L i ∂ w = { ∂ L i ∂ w i , ( j ≠ y i ) ∂ L i ∂ w y i , ( j = y i ) \nabla_{w}L_{i,j}=\frac{\partial L_{i}}{\partial w} =\begin{cases} \frac{\partial L_{i}}{\partial w_{i}} ,\quad &(j\neq y_{i}) \\ \frac{\partial L_{i}}{\partial w_{y_{i}}} ,\quad &(j=y_{i}) \end{cases} wLi,j=wLi={ wiLi,wyiLi,(j=yi)(j=yi)
It has splits for different classifications, and the derivation of this formula depends more on actual calculation understanding.

The following is the formula for deriving the weight of the loss function:Guidance formula

Here 1(x) is an indicative function, which takes the value 1 when x is true and 0 when x is false.

The first formula is the gradient of the correct classification Wyi corresponding to the i-th sample in W. Its calculation method is how many Wj cause the boundary value to be unsatisfied, and thus how many times it contributes to the loss function. This number is multiplied by Xi and the negative number is the gradient corresponding to the Wyi row. Its meaning is (combined with the following SGD algorithm), since the role of Wyi is to minimize the loss function Li, then add several Xi to Wyi (here is a negative sign, but the negative gradient direction is used in SGD calculation, so The actual effect is addition), so that the weight value of Wyi becomes larger, and the result is that the loss function should become smaller in the new iteration. Why add Xi? Because Xi contains all the characteristics of this sample. If Xj also corresponds to the category Yi, then when calculating the gradient of Xj, a certain amount of features will be extracted from Xj and added to the right Wyi (where Wyi and Wyi are actually the weight of the same category).

The second formula: is the gradient of the incorrectly classified line Wi corresponding to the i-th sample in W. Note that in the derivation, only the line where i==j can contribute to the gradient. When i<>j, the entire formula is a constant, and the constant derivation is zero. So the final derivative form is different from formula 1), and there is no summation symbol.

It can be seen from the above formula that the gradient matrix calculated in this way will indeed select the direction with the fastest gradient descent in the SGD algorithm.

The content written is mainly for your own learning and understanding. The main idea is to refer to this article

Guess you like

Origin blog.csdn.net/m0_45290027/article/details/127331972
Recommended