[Machine Learning] SVM support vector machine for binary classification on handwritten data sets, analysis and comparison of linear classification models using hinge loss and cross-entropy loss, grid search

2022Fall Machine Learning

1. Experimental requirements

  1. Consider two different kernel functions: i) linear kernel function; ii) Gaussian kernel function
  2. Can directly call the ready-made SVM software package to achieve
  3. Manually implement a linear classification model using hinge loss and cross-entropy loss, and compare their pros and cons

2. Experimental content

1) General theory of SVM model
2) Model and performance comparison and analysis using different kernel functions
3) Relationship between linear classification model using hinge loss and SVM model
4) Linear classification model using hinge loss and Cross-entropy loss linear classification model comparison
5) Training process (including initialization method, hyperparameter parameter selection, training techniques used, etc.)
6) Experimental results, analysis and discussion

Experimental data:

  • Dataset: Handwriting Dataset

  • Data description:

    • The first line is the header, which introduces the information represented by each column in the file; each of the remaining lines corresponds to a sample, and each sample contains 785 columns.

    • The first column is the label, with a value of 0 or 1; the remaining 784 columns are the pixel values ​​of the handwriting, with values ​​ranging from 0 to 255.

3. Experimental procedure

1) General theory of SVM model

​Support Vector Machine (Support Vector Machine) is a generalized linear classifier that classifies binary data by supervised learning. Its learning goal is to find a hyperplane with the largest margin in the n-dimensional feature space. Classification, the hyperplane can transform the original problem of solving the optimal parameters into a dual problem of solving convex quadratic programming;

In the feature space of the training samples, according to whether it is linearly separable, two SVM models with different emphases are derived, one is hard margin SVM (Hard Margin), and the other is soft margin SVM (Soft Margin) , the former assumes that the sample is linearly separable in its feature space (that is, a hyperplane can be found to divide samples of different categories), and the latter assumes that the sample is linearly inseparable in its feature space, which introduces Slack Variable (Slack Variable) ξ i ≥ 0 \xi_i \geq 0Xi0 (indicating the distance of the sample from the support vector) allows SVM to make mistakes on some samples, but will get the corresponding penaltyξ i \xi_iXi;The Margin mentioned here refers to the distance between the sample points (support vectors) closest to the classified hyperplane. In order for the model to perform as well as possible on Unseen Data, it is necessary to make this Margin as large as possible ;

​ In the hard interval SVM, in order to maximize the interval, we need to find the KaTeX parse error: Undefined control sequence: \norm at position 38: …b} \frac{1}{2} \̲n̲o̲r̲m̲{\bold{w}} The optimal w \bold{w} of ^2w和b,使得 y ( l ) ( w T x + b ) ≥ 1 ,  f o r   l = 1 , 2 , . . . , N y^{(l)}(\bold{w}^T\bold{x}+b)\geq1 ,\ for \ l=1,2,...,N y(l)(wTx+b)1 for l _  =1,2,...,N is established; in the soft interval SVM, in order to maximize the interval and consider the slack variable, we need to find the satisfyingKaTeX parse error: Undefined control sequence: \norm at position 38: …b} \frac{1}{2 } \̲n̲o̲r̲m̲{\bold{w}}^2 + …(here C is the penalty coefficient, used to control the penalty) so thaty ( l ) ( w T x + b ) ≥ 1 − ξ n , for l = 1 , 2 , . . . , N y^{(l)}(\bold{w}^T\bold{x}+b)\geq1-\xi_n ,\ for \ l=1,2,..., Ny(l)(wTx+b)1Xnfor l _  =1,2,...,N established;

In short, when linearly separable, the optimal classification hyperplane for the two types of samples is found in the original space. When linearly inseparable, add slack variables and map the samples of the low-dimensional input space to the high-dimensional space by using nonlinear mapping (kernel function) to make it linearly separable, so that the optimal classification can be found in the feature space Hyperplane. The characteristic of SVM is "support vectors", that is, the linear hyperplane is only determined by a small number of "support vectors".

2) Models and performance comparison and analysis using different kernel functions

In this experiment, the SVM model provided in the skearn package is used;

  1. linear kernel function

Relevant code:

SVM Train

def svm_train(path, kernel):
    classifiers = []
    trainset = pandas.read_csv(path)
    label = trainset.values[:, 0]
    data = trainset.values[:, 1:]
    for k in kernel:
        clf = svm.SVC(kernel='linear').fit(data, label)
        train_score = clf.score(data, label)
        print("The train accuracy under kernel %s is %f: " % (clf.kernel, train_score))
        classifiers.append(clf)
    return classifiers

Here, the SVC model in sklearn svm is used for training, the linear kernel function is used, and fit is called to calculate the hyperplane, and all relevant attributes are stored in the classifier clf;

Experimental results:

image-20221015155501224

From the perspective of training and testing accuracy, scores of 1 and 0.999054 were obtained respectively;

When the linear kernel function is used, the accuracy on the training set reaches 1, but the performance on the test set is relatively not so good, so it is estimated that when using the linear kernel function, there may be overfitting phenomenon .

Therefore, consider adjusting the penalty coefficient C. When C is relatively small, it means that we don’t want to deal with those outliers, and we will choose fewer samples as support vectors. The final support vector and hyperplane model will also be simple:

def svm_train(path, kernel):
    classifiers = []
    trainset = pandas.read_csv(path)
    label = trainset.values[:, 0]
    data = trainset.values[:, 1:]
    for k in kernel:
        clf = svm.SVC(kernel=k, C=0.0000001).fit(data, label)
        train_score = clf.score(data, label)
        print("The train accuracy under kernel %s is %f: " % (clf.kernel, train_score))
        classifiers.append(clf)
    return classifiers

Finally, it was found that when C was 0.0000001, the model performed better on the test set than before, reaching a score of 0.999527

image-20221015164403469

This may be because during the training process of SVM, the division of the hyperplane does not require so much data in the training set. It will maximize the Margin as much as possible instead of allowing all the data to be distinguished as much as possible. (Sacrificing the maximization of Margin is equivalent to sacrificing Generalization), so it will perform better on the test set than when C=1 before.

  1. Gaussian kernel function

SVM Train

def svm_train(path, kernel):
    classifiers = []
    trainset = pandas.read_csv(path)
    label = trainset.values[:, 0]
    data = trainset.values[:, 1:]
    for k in kernel:
        clf = svm.SVC(kernel='rbf').fit(data, label)
        train_score = clf.score(data, label)
        print("The train accuracy under kernel %s is %f: " % (clf.kernel, train_score))
        classifiers.append(clf)
    return classifiers

This part of the code is similar to the above, using the SVC model in sklearn svm for training, here using the Gaussian kernel function;

Experimental results:

The accuracies achieved with the default parameters are as follows:

image-20221015160006998

On the training set and test set, scores of 0.999921 and 0.999527 were obtained respectively;

  1. Performance of SVM under different kernels:

In summary we found that:

image-20221013220429347

The SVM model performed best on the test set (0.999921) when the kernel function was a Gaussian kernel function. When using a linear kernel function, the accuracy on the training set reached 1, but the performance on the test set was relatively low. It is not so good, so it is estimated that when using the linear kernel function, there may be an overfitting phenomenon. Finally, by adjusting the penalty coefficient C, let the SVM maximize the Margin as much as possible, instead of focusing on the training set The data is perfectly separated.

Parameter adjustment:

(1) Grid search

model = svm.SVC()
params = [
  {
    
    'kernel': ['linear'], 'C': [1, 10, 100, 1000]},
  {
    
    'kernel': ['poly'], 'C': [1, 10], 'degree': [2, 3]},
  {
    
    'kernel': ['rbf'], 'C': [1, 10, 100, 1000],
   'gamma': [1, 0.1, 0.01, 0.001]}]
model = GridSearchCV(estimator=model, param_grid=params, cv=5, n_jobs=8)
model.fit(data, label)

print("模型的最优参数:", model.best_params_)
print("最优模型分数:", model.best_score_)
print("最优模型对象:", model.best_estimator_)

Using 8 processes to run, the speed is very slow;

image-20221015180650516

Finally got:

image-20221015190300008

Use the searched parameters to test:

def svm_train(path, kernel):
    classifiers = []
    trainset = pandas.read_csv(path)
    label = trainset.values[:, 0]
    data = trainset.values[:, 1:]
    for k in kernel:
        # clf = svm.SVC(kernel=k).fit(data, label)
        clf = svm.SVC(kernel='poly', degree=2).fit(data, label)
        train_score = clf.score(data, label)
        print("The train accuracy under kernel %s is %f: " % (clf.kernel, train_score))
        classifiers.append(clf)
    return classifiers

The obtained accuracy is as follows

image-20221015232003028

On the contrary, the accuracy of the Gaussian kernel function on the test set is lower

3) The relationship between the linear classification model using hinge loss and the SVM model

The formula for Hinge Loss is: L ( z ) = max ( 0 , 1 − z ) L(z) = max(0, 1-z)L(z)=max(0,1z ) z is hereyy ^ y \hat{y}yy^ y ^ = x w \hat{y}=\bold{xw} y^=xw refers to the classification result, that is, if the sample is correctly classified, then its Hinge Loss is 0, otherwise it is1 − yy ^ 1- y\hat{y}1yy^

Find the gradient for Hinge Loss:

KaTeX parse error: Undefined control sequence: \part at position 7: \frac{\̲p̲a̲r̲t̲(L(\bold{w}))}{…

The image of Hinge Loss is as follows:

image-20221015162027872

To train a linear classifier using Hinge Loss, the parameters used are:

num_epochs = 360
learning_rate = 0.00001

It should be noted that the labels in the original data set are all 0 and 1, but the labels concerned in hinge loss are -1 and 1, so the labels need to be converted to -1 and 1 before using Hinge Loss

image-20221015142824039

image-20221015142808236

The accuracy of the linear classifier using Hinge Loss is about 0.996217

  • The relationship between them:

​ In SVM, the slack variable used by soft interval SVM is hinge loss, that is, KaTeX parse error: Undefined control sequence: \norm at position 38: …b} \frac{1}{2} \̲n̲o̲r̲m̲{\bold{w }}^2 + … , where ξ n \xi_nXn就是 L ( z ) = m a x ( 0 , 1 − y i y i ^ ) L(z) = max(0, 1-y_i\hat{y_i}) L(z)=max(0,1yiyi^)

​ It is used to indicate the distance from the sample to the support vector, allowing SVM to make mistakes on some samples. Here, we are seeking the gradient of Hinge loss, and using the gradient descent method to gradually update the parameters. I think that in SVM , due to its characteristics, the determination of its hyperplane is only determined by the support vector (the few points that are most relevant to the classification, considering the local area ), and most of the remaining samples do not need to be retained, while the linear classification model using hinge loss is the opposite. It considers all training samples (considering the whole world ).

4) Comparison between hinge loss linear classification model and cross-entropy loss linear classification model

Cross-entropy Loss ratio: L ( w ) = − ylog ( σ ( xw ) ) − ( 1 − y ) log ( 1 − σ ( xw ) ) L(\bold{w}) = -ylog(\sigma (\bold{xw}))-(1-y)log(1-\sigma(\bold{xw}))L(w)=y l o g ( σ ( xw ))(1y)log(1σ ( xw ))

Its gradient is:

image-20221015121402997

Now use a linear classification model:

f ( x ) = σ ( x w + b ) f(\bold{x}) = \sigma(\bold{xw}+b) f(x)=s ( xw+b)

To train linear classifiers using Hinge loss and Cross Entropy Loss respectively, the parameters used are:

num_epochs = 360
learning_rate = 0.00001

The reason why the learning rate is 0.0001 here is that in the cross-entropy loss linear classification model, if the learning rate is set too large, it is easy to overflow when solving the loss because the gradient changes too fast, so the learning rate is lowered here. rate, to reduce the speed of the gradient change.

For W and B, the following method is used to initialize:

W = np.random.normal(0, 0.01 ** 2, (in_feature,))  # Gaussian Distribution (一维数组,有1行,in_feature列)
B = 0

Here W uses an initialization value that is widely considered to be good, that is, a Gaussian distribution with a mean of 0 and a variance of 0.01^2.

image-20221015143310536

image-20221015143354653

image-20221015142707206

The accuracy of the linear classifier using Hinge Loss is about 0.995745

The accuracy of the linear classifier using Cross Entropy Loss is about 0.999527

  • Experimental analysis :
    • Under the same parameters, we found that the accuracy of the linear classifier using Cross Entropy Loss is relatively higher;
    • Hinge Loss will relatively pay more attention to those misclassified samples. The loss of those correctly classified samples is 0, while the loss of misclassified samples increases linearly; for Cross Entropy Loss, we can learn from its It can be known from the image that it is a convex function, there is only one optimal point, and there is no local optimal solution, which is easy to optimize.

4. Experiment summary

In this experiment, I summarized the basic theory of SVM, and compared and analyzed the performance of different kernel functions of SVM (including linear kernel function and Gaussian kernel function), and then introduced a linear classification model (using hinge loss and cross-entropy loss) compare and analyze the performance of different Loss.

In general, from my experimental results, for this data set, the best classification performance on the test set is the SVM using the Gaussian kernel function, reaching a score of 0.999921, and it is between the two linear classifiers. In comparison, the linear classification model using cross-entropy loss has the best performance, reaching a score of 0.999527. At a higher learning rate, it is easy to cause gradient overflow. After reducing the learning rate, this problem can be effectively solved; In the SVM using the linear kernel function, I found that it was overfitting on the training set, and after adjusting the penalty coefficient, its performance on the test set was improved.

This experiment provided me with an opportunity to practice, allowing me to apply the knowledge I learned to reality. My understanding of SVM has gone a step further, the difference between SVM and other linear classifiers, and different loss functions The selection and adjustment of hyperparameters have a deeper understanding.

Guess you like

Origin blog.csdn.net/m0_52387305/article/details/127342415