Linear SVM classifier with Softmax

1 introduction

Previous describes the image classification. Image classification task is to choose a label from the existing fixed collection and classification assigned to a picture. We have also introduced the k-Nearest Neighbor (k-NN ) classifiers, the basic idea of the classifier is tested by comparing the image with a training set of images tagged, marked by the classification label to the test image. k-Nearest Neighbor Classifier of the following deficiencies:

(1) classification must remember that all training data and store it up in the future in order to compare the test data. This is inefficient in the storage space, the size of the data set easily in GB.
(2) the need to classify and all training images to compare a test image, the algorithm computing resource consumption is high.

We will implement a more powerful way to solve the problem of image classification, this method can be naturally extended to convolution neural networks and neural networks. This method has two main parts: one is the function score (score function), which is the original image data into category scores mapping. Another function of the loss (loss function), which is used to quantify the consistency between the predicted score and the real tag classification tag. The method can be converted into an optimization problem, the optimization process, to minimize the loss function value updated by the parameter scoring function.

2 scoring function

The scoring function map image pixel values for each classification category score, score the level of the representative image belongs to the category of high and low possibilities. The following will use a specific example to demonstrate the method. Assuming now that there is a lot of images comprising training set \ (X_ {I} \ {D} in R & lt ^ \) , each image has a corresponding classification label \ (Y_ {I} \) . Here \ (i = 1,2 \ ldots N \) and \ (Y_ {I} \ in. 1 \ ldots K \) . That said, we have N sample images, each image dimension is D, a total of K different classification.

For example, in CIFAR-10, we have a training set of N = 50000, each image has D = 32x32x3 = 3072 pixels, and K = 10, this is because the picture is divided into 10 different categories ( dogs, cats, cars, etc.). We now define the scoring function as: \ (F: D R & lt ^ {} \ rightarrow R & lt ^ {K} \) , the function is the original image pixel value to the classification mapping.

2.1 linear classifier

In this model, from the most simple probability function begins, a linear mapping:

$f\left(x_{i}, W, b\right)=W x_{i}+b$

In the above formulas, each image data is assumed to have been stretched to a column vector of length D, the size of [D x 1]. Wherein W, and size of the matrix of size [K x D] to [K x 1] parameters (parameters) for the function of the column vector b. Or in CIFAR-10 as an example, \ (\ mathcal {X}} _ {i \) contains all the i-th image pixel information, the information is drawn as a [3072 x 1] is a column vector, W Size is [10x3072], b is the size [10x1]. Thus, a digital 3072 (original pixel value) input function, an output function 10 digits (different classification score obtained). W is called weighting parameter (weights). b is called deviation vector (bias vector), since it affects the output value, but not the original data \ (\ mathcal {x} _ {i} \) associate. In practice, it is often mix and weight parameters of these two terms.

Some need to be aware of:

(1) a single matrix multiplication \ (W x_ {i} \ ) to efficiently evaluate 10 different parallel classifiers (a classification for each classifier), wherein each class classifier is a row of W vector.
(2) We note that the input data \ (\ left (x_ {i }, y_ {i} \ right) \) is given and unchangeable, but the parameters W and b are controllably changed. Our goal is by setting these parameters, so that the calculated classification score real situation and the training set class label image data match. We'll describe in detail how to do this, but for now just intuitively make the correct classification scores higher than the scores can be misclassified.
(3) An advantage of this method is that the training data is used to learn the parameters W and b, once the training is completed, the training data can be discarded, leaving the learned parameters. This is because a test image input function can be easily, and are classified based on the calculated classification score.
(4) Note that only need to do a matrix multiplication and matrix addition can be a data classification for a test, which compares the ratio k-NN in the test image and all training data much faster method.

2.2 appreciated linear classifiers

Linear classifiers image is calculated in the three color channels and the weight values ​​of all pixels by the weight matrix to obtain a classification score. The values ​​of the weights we set certain colors for certain locations in the image, a function expressed a marked preference or aversion (depends on the weight of the weight of each symbol).

For example, imagine the "boat" category is to be surrounded by a lot of blue (that is, the corresponding water). So right "ship" category in the blue channel of weight, a lot of positive weight (their presence increases the score classification of "boat"), and right on the green and red channel weight is negative is more and more ( they appear to reduce the score classification of "boat").

A lower side image mapped to cite a specific example of the classification score:

The figure for ease of visualization, assuming that the image only four pixels (black and white pixels are not considered here RGB channels), there are three categories (red for cat, green for dogs, blue boat, pay attention, here's red, green and blue represent only 3 colors classification, and RGB channel does not matter). The image is first stretched to a pixel column vector, a matrix multiplication with W, then each classification score obtained. It should be noted that this is not a good W: cat classification score very low. From the map view, the algorithm would think that this image is a dog.

2.2.1 The image seen as the high point of the dimension

Since the image is stretched to become a high-dimensional column vector, then we can put an image seen as a point of this high-dimensional space (that is, each image point is a 3072-dimensional space). The entire data set is a set of points, each point having a classification label.

Since the definition of each classification category scores are weighted matrix multiplication and images, then the score of each classification category is a function of the value of a linear function of the space. We can not visualize 3072-dimensional space of a linear function, but assuming these dimensions squeezed into a two-dimensional, then you can see what to do in these classifiers:

The figure is a schematic diagram of the image space. Wherein each image is a point, there are three classifiers. Red car classifier, for example, the red line represents a fraction of the space rendezvous auto classifieds 0, red arrows indicate the direction of the rising scores. Fractional value of the point of all red lines on the right are positive, and increased linearly. Red leftmost point value is negative, and linearly decreased.

Can be seen from the above, each line W is a classification category of classification. Geometric interpretation of these numbers are: If you change one of the rows of numbers, the classifier will see a corresponding linear starts rotating in space toward different directions. And bias b, then allowing linear translation of the corresponding classifier. Note that, if there is no deviation, no matter how the weights, in \ (x_ {i} = 0 \) when the classification score is always zero. All such classifiers line had to pass through the origin.

2.2.2 The linear classifier seen as template matching

Another explanation about the weight W is its template for each row corresponds to one of the categories (sometimes also called prototype). The image corresponds to a score of different categories, are compared by using the inner product (also called dot product) image and the template, and then find which is most similar to the template. From this perspective, linear classifier is learning to use the template, a template for image matching. From another perspective, can be considered or in the efficient use of k-NN, except that we did not use all of the training set of images to compare, but each category with only a picture (this picture is that we to learn, rather than a centralized training a), and we will use the (negative) to calculate the distance between the inner product of vectors, instead of using L1 or L2 distance.

2.2.3 preprocessing image data

The original pixel values ​​in the example above, all images are used (from 0 to 255). In machine learning, the input feature for doing normalized (normalization) process is a common routine. In the example of the image classifying each pixel on the image can be seen as a feature. In practice, the average value of the data center is very important to minus for each feature. In the example of these images, this step means that all of the training image set in accordance with a calculated average value of the image, and each image by subtracting the average pixel value thus distributed about the image on the [-127, 127] between. A common next step is to make changes to all sections of the distribution of values ​​[-1, 1]. Center of zero mean is very important, so we understand the gradient descent and then explained in detail.

3 loss function

On a defined value from the image pixels to Category scoring function (score function), the function parameter is a weighting matrix W. In the function, the data \ (\ left (x_ {i }, y_ {i} \ right) \) is given, can not be modified. But we can adjust the weight matrix of this parameter, so that the result and scoring function of the training data consistent focus image of the real categories, namely scoring function in the correct position classification should get the highest score (score).

Examples of image classification goes back to the cat before, it has scores for "cat", "dog", "boat" three categories. We see an example of the weight value is very poor, because the cat classification scores very low (-96.8), while the dog (437.9) and boats (61.95) is relatively high. We will use the loss function (Loss Function) (also sometimes called a cost function Cost Function or objective function Objective) to measure the extent we are not satisfied with the results. Intuitively, the larger the output when the scoring function between the real results and the difference, the greater the loss function output, the smaller the contrary.

More than 3.1 class SVM loss function

DETAILED diverse forms loss function. Firstly, the commonly used multi-class SVM loss function (Multiclass Support Vector Machine Loss). SVM SVM loss function you want to score on the correct classification of a boundary value is always higher than the score on the incorrect classification \ (\ Delta \) . We can imagine the loss of function of a person, the SVM Mr. (or Ms.) the results has its own taste, if a result can make the loss less value, then SVM is even more like it.

Let us be more accurate. Recall, i-th data contained in the image (x_ {i} \) \ and pixels representative of the correct class label \ (Y_ {i} \) . Scoring function input pixel data, and then by the equation (f \ left (x_ {i }, W \ right) \) \ calculated scores of different categories of classification. Here we will score abbreviated as s. For example, the category score for the j-th element of the j-th is: \ (S_ {j} = F \ left (X_ {I}, W is \ right) _ {j} \) . Loss of function definition for the i-th multi-class SVM data as follows:

$L_{i}=\sum_{j \neq y_{i}} \max \left(0, s_{j}-s_{y_{i}}+\Delta\right)$

An example demonstrates how the formula is calculated. Suppose there are three classification, and the obtained value s = [13, -7,11]. Wherein the first category is the correct type, i.e. \ (Y_ {I} = 0 \) . Also assume \ (\ of Delta \) 10 (described in detail later in the super parameter). The above formula is not correctly classified all \ (\ left (j \ neq y_ {i} \ right) \) together, so we have two parts:

$L i=\max (0,-7-13+10)+\max (0,11-13+10)$

The first part can be seen the result is 0, since the [-7-13 + 10] obtained is negative, after max (0, -) to give 0 after the function process. This loss category scores and tag values are 0, because the score of correctly classified and misclassified score difference of 13 -7 to 20, 10 above the boundary value. The SVM only care about the gap must be at least 10, the greater the difference between the value of 0 or counted as a loss. The second part of the calculation [10 + 11-13] to give 8. Although the correct classification scores than incorrect classifications scored higher (13> 11), but still smaller than the boundary value of 10, only 2 points difference, which is why the loss is equal to 8. Briefly, the SVM loss function correctly classified desired category \ (\ mathcal {y} _ {i} \) score higher than the score is not correct category and at least higher \ (\ of Delta \) . If not this, you start to lose value.

So in this model, we are faced with a linear scoring function \ (\ left (f \ left (x_ {i}, W \ right) = W x_ {i} \ right) \) , so we can formula loss function of some redrafting:

$L_{i}=\sum_{j \neq y_{i}} \max \left(0, w_{j}^{T} x_{i}-w_{y_{i}}^{T} x_{i}+\Delta\right)$

Wherein \ (w_ {j} \) is the j-th row of the weight W, it is deformed into a column vector. However, once you start to consider more complex scoring function f formula, this is not a must.

Must also mention about the 0 threshold: \ (\ max (0, -) \) function, which is often called the flap loss (hinge loss). Sometimes hear people using square flap loss SVM (i.e., L2-SVM), which uses the \ (\ max (0, -) ^ {2} \) , more strongly (quadratically rather than linearly) across the border to punish the value of community. Do not use the square is more standard version, but in some of the data set, square flap loss will work better. Which can be determined in the end by using cross-validation.

3.2 Regularization

The above loss function has a problem. Consider a set of data and a set of weights W can be correctly classified each of the data (i.e., all satisfy the boundary, for all i are \ (L_ {i} = 0 \) ). The problem is that this is not the only W: There may be many similarities W can be correctly classified all the data. A simple example: if W can correctly classify all the data, i.e., for each data loss values are 0. Then when (\ lambda> 1 \) \ , any multiplication \ (\ lambda \ boldsymbol {W } \) can be such that the loss value is 0, as this would change the size of all the scores are uniformly enlarged, so the absolute difference between them has expanded. For example, if a correct classification of the example and the scores of its nearest disparity value misclassification is 15, so that the gap W multiplied by 2 becomes 30.

In other words, we hope to add some weight W preference to certain rights, not add weight to other rights, in order to eliminate ambiguity. This is achievable, increase the loss function is a penalty regularization (regularization penalty) R (W) moiety. The most commonly used regularization punishment paradigm L2, L2-by-element by squaring paradigm for all parameters to suppress a large weight penalty weight values:

$R(W)=\sum_{k} \sum_{l} W_{k, l}^{2}$

The above expression, the summation of all the elements W in the square. Note that the function is not a direct function of the data is only based on the weight. After comprising regularization penalty, it is possible to give a complete multi-class SVM loss function, and it consists of two parts: a data loss (data loss), i.e., the average loss for all samples \ (\ boldsymbol {I} _ { } I \) , and the loss of regularization (regularization loss). Full formula is as follows:

$L=\underbrace{\frac{1}{N} \sum_{i} L_{i}}_{\text {dataloss}}+\underbrace{\lambda R(W)}_{\text {regularization loss}}$

To expand the full formula is:

$L=\frac{1}{N} \sum_{i} \sum_{j \neq y_{i}}\left[\max \left(0, f\left(x_{i} ; W\right)_{j}-f\left(x_{i} ; W\right)_{y_{i}}+\Delta\right)\right]+\lambda \sum_{k} \sum_{l} W_{k, l}^{2}$

Wherein, N is the amount of data of the training set. Now regularization punishment added to the loss of function inside, and ultra parameter \ (\ lambda \) to calculate its weight. The hyper-parameters can not be determined simply need to get through cross-validation.

In addition to the above reasons, the introduction of regularization punishment also bring a lot of good properties, most of these properties will be introduced in later chapters. For example, after the introduction of the L2 punishment, SVM who have a maximum boundary (max margin) this good nature. (If you are interested, you can view the CS229 course).

One of the best is the nature of the large numerical weights punishment, can enhance its generalization, because that means no overall score for the dimension alone can have too much influence. For example, assuming that the input vector \ (X = [1,1,1,1] \) , two weight vector \ (w_ {1} = [ 1,0,0,0], \ quad w_ {2} = [0.25,0.25,0.25,0.25] \) . Then \ (W_ {. 1} ^ {T} X = W_ {2} ^ {T} =. 1 \) , two weight vectors are obtained the same inner product, but the \ (w_ {1} \) L2 of punishment 1.0, and \ (w_ {2} \) L2 of the penalty is 0.25. Thus, according L2 penalty term, \ ({2} W_ \) better, because regularization smaller losses. Intuitively, this is because the \ (w_ {2} \) weight value smaller and more dispersed. Since L2 tend to penalize smaller and more dispersed weight vector, which will encourage classifier will eventually feature on all dimensions with them, rather than a few of them strongly dependent dimensions. In later lessons you can see, this effect will be to enhance the generalization ability of classifier, and to avoid over-fitting.

Note that, different weights and deviations no such effect, because they do not affect the strength of the control input dimensions. Thus usually only regularization weights W without regularization deviation b. In practice, the impact of this operation can be found negligible. Finally, because of regularization punishment, the loss can not be obtained a value of 0 in all cases, since only the special case when W = 0, the value 0 to get lost.

3.3 Practical Considerations on SVM loss of

You may notice that the contents of the above parameters of super \ (\ Delta \) and its settings in passing, then it should be set to any value? The need to obtain it through cross-validation? It now appears that in most cases under the super parameter as \ (\ Delta = 1.0 \) are safe. Ultra parameter \ (\ of Delta \) and \ (\ the lambda \) appear to be two different hyper-parameters, but in fact they are controlled together with a trade-off: the trade-off between the loss of data and loss of function loss regularization . The key to understanding this is to know the size of the weight W for classification score has a direct impact (of course, also have a direct impact on their differences): When we narrow the W value, the difference between the classification score is also reduced, vice versa. Thus, the specific classification score values between different boundaries (such as \ (\ Delta = 1 \) or \ (\ 100 of Delta = \) ) from a certain angle is meaningless, because their weight can be controlled the difference is large and out. In other words, the real trade-off is that we can allow the weight becomes large to what extent (through regularization strength \ (\ lambda \) to control).

You might For binary SVM some experience, it is calculated for the i-th data loss formula is:

$L_{i}=\operatorname{Cmax}\left(0,1-y_{i} w^{T} x_{i}\right)+R(W)$

Wherein, C is a hyper-parameters, and \ (Y_ {I} \ in \ {- 1,1 \} \) . SVM is considered a formula described in this section include the above formula, the above formula is a multi-class SVM formula has only two classification categories of special case. In other words, if we are to only two classification categories, the formula turned into two yuan SVM formula. The formula C and the multi-class SVM formula \ (\ the lambda \) are the same trade-off control, and the relationship between them is: \ (C \ propto \ FRAC. 1 {{} \} the lambda \)

Note: It should be noted that this lesson to show the multi-class SVM SVM is just one of a number of formulas.

Loss of function 3.4 Softmax classifier

SVM is one of the two most commonly used classification, while the other is Softmax classifier, it's different loss functions and SVM loss function. For the studied binary logistic regression classifier readers, Softmax classifiers can be understood as a logistic regression classifier face more generalized classification of induction. SVM output \ (f \ left (x_ { i}, W \ right) \) as a score for each classification (since there is no scaling, it is difficult to directly interpret). And SVM different, Softmax output (normalized classification probability) is more intuitive, and can be explained in terms of probability, this will be discussed later. In Softmax classifier, mapping function \ (f \ left (x_ { i}; W \ right) = W x_ {i} \) remains unchanged, but these rating values as unnormalized each category log probability, and the flap loss (hinge loss) is replaced with the loss of cross-entropy (cross-entropy loss). Formula is as follows:

$L i=-\log \left(\frac{e^{f_{y_{i}}}}{\sum_{j} e^{f_{j}}}\right)$

It is equivalent to the formula:

$L_{i}=-f_{y_{i}}+\log \left(\sum_{j} e^{f_{j}}\right)$

In the above formula, using \ (f_ {j} \) to indicate the classification score vector \ (F \) in the j-th element. As before, the loss of value of the entire data set is a data set loss values of all the sample data \ (\ boldsymbol {L} _ {i} \) mean and regularization loss \ (R (W) \) sum. Wherein the function \ (f_ {j} (z ) = \ frac {e ^ {z_ {j}}} {\ sum_ {k} e ^ {z_ {k}}} \) is called softmax function: input value is a vector, the vector elements is arbitrary real number of scores, its function (z in) compression, outputs a vector where each element value between 0 and 1, and the sum of all the elements 1. Therefore, a complete cross-entropy loss function contains softmax looks bluffing, in fact, it is quite easy to understand.

3.4.1 information theory point of view of understanding

In the "real" cross-entropy distribution p and the estimated q defined between distributed as follows:

$H(p, q)=-\sum_{x} p(x) \log q(x)$

Thus, Softmax classifier is done to minimize the estimated classification probability (that is, above \ (E ^ {Y_ F_ {{I}}} / \ sum_ E {J}} ^ {F_ {J} \) ) and "true" cross between entropy distribution, in this interpretation, the "real" probability density distribution is that all are located in the right category (for example: p = [0, ... 1 , ..., 0] in in \ (y_ {i} \) position there is a separate 1). Further, since the cross-entropy, and relative entropy can be written entropy (a Kullback-Leibler Divergence) \ (H (p, Q) = H (p) + D_ {KL} (p \ | Q) \) , and the delta function p entropy is 0, it can be seen as equivalent to the relative entropy between two distributions do minimize operation. In other words, cross entropy loss function "want" all predictive probability density distributions are in the correct classification.

Note: Kullback-Leibler difference (Kullback-Leibler Divergence) is also called the relative entropy (Relative Entropy), which measures the difference in the case of the same event space of two probability distributions.

3.4.2 probability theory point of view of understanding

Look at the following formula:

$P\left(y_{i} | x_{i}, W\right)=\frac{e^{f_{y_{i}}}}{\sum_{j} e^{f_{j}}}$

It may be interpreted as the given image data \ (X_ {I} \) , in W parameters assigned to the correct classification label \ (y_ {i} \) normalized probability. To understand this point, please recall Softmax classifier output vector f scores in the interpretation of the number of probability is not normalized. Then these values do exponential power is obtained without the normalized probability, the division operation is performed on the data normalized, so these probabilities sum to unity. From the perspective of probability theory to understand, we are in the correct classification minimize the negative logarithm of probability, which can be seen as performing maximum likelihood estimation (MLE). Another benefit of this explanation is that the loss function regularization portion \ (R & lt (W) \) can be viewed as a Gaussian prior weight matrix W performs the estimation here is the maximum a posteriori (MAP) instead of the maximum likelihood estimation.

3.5 of practical considerations about Softmax

The main problem is numerically stable. When programming softmax function calculation, the middle term \ (e ^ {f_ {y_ {i}}} \) and \ (\ sum_ {j} e ^ {f_ {j}} \) as an exponential function, the value It can be very large. Divided by the large value could lead to numerical instability, it is very important to learn to use normalization techniques. If the numerator and denominator of the fraction are multiplied by a constant C, and converts it into the summation can be obtained from a formula equivalent mathematically:

$\frac{e^{f_{y_{i}}}}{\sum_{j} e^{f_{j}}}=\frac{C e^{f_{y_{i}}}}{C \sum_{j} e^{f_{j}}}=\frac{e^{f_{y_{i}}+\log C}}{\sum_{j} e^{f_{j}+\log C}}$

Value C can be freely selected, will not affect the results, the stability can be improved numerical calculation by using this technique. C is usually set logC = -max_jf_j. The technique Briefly, the vector is the value of f should be translated, so that the maximum value is 0.

3.5 considerations regarding the loss of function naming

Precisely, SVM classifier using a flap loss (hinge loss), sometimes also known as the largest border loss (max-margin loss). Softmax classifier using a cross-entropy loss (corss-entropy loss). Softmax classifier named softmax function from where come, softmax function of the original classification score becomes positive normalized value, all value is 1, the cross-entropy loss after such treatment to apply. Note technically "softmax loss (softmax loss)" makes no sense, because softmax just a function of the value of compression. But this argument is often used for short.

3.6 SVM comparison with the Softmax

The following diagram helps to distinguish between these two Softmax and SVM classifiers:

Examples of different treatments for one data point, SVM classifier, and the Softmax. Two classifiers are calculated the same value vector f (this section is achieved by matrix multiplication). Except that in the explanation of the score f: SVM classifier would be classified them as scores, its loss function to encourage correct classification (in this case blue category 2) score than the other sub-categories value higher than at least one boundary value. Softmax classification will these values ​​be seen as the log probability of each classification is not normalized, encourage the correct classification of normalized log probability is high, and the rest goes low. SVM final value of the loss is 1.58, the final value of the loss is 0.452 Softmax, but note that these two values ​​are not comparable. Given the same data only, the loss value calculation in the same classifier, they make sense.

SVM calculation is uncalibrated, but is difficult to give an intuitive explanation for the scores for all categories. Softmax classifier is different, it allows us to calculate all the possibilities for classification and labeling. For example, for a given image, SVM classifier can give you is a [12.5, 0.6, -23.0] corresponding to the classification "cat", "dog", "boat." The softmax classifier can calculate the three labels, "possibility" is [0.9, 0.09, 0.01], which allows you to see different categories to grasp for accuracy. Why should we fight it in quotation marks "possibility" above? This is because of the possibility of centralized or distributed by the degree of dispersion regularization parameter λ direct decisions, λ is that you can directly control an input parameter. For example, assuming that the original classification score is 3 [1, -2, 0], then the function will softmax Calculated:

\([1,-2,0] \rightarrow\left[e^{1}, e^{-2}, e^{0}\right]=[2.71,0.14,1] \rightarrow[0.7,0.04,0.26]\)

Now, if the regularization parameter λ greater, then the weight W will be punished more, and then his heavy numerical weights will be smaller. Such calculated score will be smaller, it is assumed that the small half [0.5, -1, 0], then the softmax function is calculated:

\([0.5,-1,0] \rightarrow\left[e^{0.5}, e^{-1}, e^{0}\right]=[1.65,0.73,1] \rightarrow[0.55,0.12,0.33]\)

Now it looks, the probability distribution is even more dispersed. Also, with the regularization parameter λ growing, the weight value will become smaller and smaller, the probability of the final output will be close to a uniform distribution. This means that the probability softmax classifier counted out as a kind of self-confidence is the best for the correctness of the classification. And SVM, as compared with each other between the number of stars the order can be explained, but the absolute value is difficult visual interpretation.

Generally speaking, the performance of two classifiers is little difference, different people as to which classifier better have different views. Relative to Softmax classifier, SVM more "local goals of (local objective)", which both can be seen as a characteristic, can also be seen as a disadvantage. Consider a score is [10, -2, 3] data, wherein the first classification is correct. Then a SVM ( \ (\ Delta = 1 \) ) will see the correct classification compared to incorrect classifications, has been even higher than the boundary value of the score, it will think the loss value is 0. For individual details of the digital SVM is not of interest: If the score is [10, -100, -100] or [10, 9, 9], the SVM is not different from what is provided, as long as more than a boundary value equal to 1, loss value is equal to zero.

For softmax classifier, the situation is different. For the [10, 9, 9], the calculated value of the loss is much higher than [10, -100, -100] of. In other words, softmax classifier for the score is never satisfied: correct classification can always get a higher likelihood of misclassification can always get a lower likelihood of loss of value always smaller. However, SVM is satisfied as long as the boundary values ​​on satisfaction, and will not exceed the limit to operate a specific fine fraction. This can be seen as a characteristic of SVM. For example, the car of a classifier should put much of his energy on how to distinguish between cars and kcal car should not be entangled in how to distinguish the frog, the frog because differentiate ratings have been low enough.

Guess you like

Origin www.cnblogs.com/Terrypython/p/10978042.html