Support vector machine (SVM) case analysis

Support vector machines (SVM) is a two-classification model. The so-called two-classification model refers to the classification effect relationship between many features (independent variable X) on another label item (dependent variable Y). For example, there are currently There are many characteristics, including height, age, education, income, years of education, etc., a total of 5 items. The dependent variable is 'whether you smoke', and 'whether you smoke' only includes two items, smoking and non-smoking. Then the study of the role of these five feature items in the classification of 'whether to smoke' is called a 'two-classification model', but in fact many times the label item (dependent variable Y) has many categories, such as a certain label item Y is 'cuisine preference'. There are many Chinese cuisines, including Sichuan cuisine, Shandong cuisine, Cantonese cuisine, Fujian cuisine, Jiangsu cuisine, Zhejiang cuisine, Hunan cuisine and Anhui cuisine, a total of 8 categories. At this time, you need to perform a 'multi-classification decision function' transformation, which is simple It is understood that SVM models are established separately for two categories (choose 2 out of 8), and then used in combination.

Common machine learning algorithms include decision trees, random forests, Bayes, etc., all of which have good interpretability. For example, decision trees continuously divide features into categories according to split points, and random forests are multiple decision trees. Model, Bayesian model is calculated using Bayesian probability principle. Different from the above, the support vector machine model uses operational planning constraints to find the optimal solution, and this optimal solution is a spatial plane. This spatial plane can be combined with feature items to completely separate the two categories of 'smoking' and 'non-smoking' , finding this space plane is the core algorithm principle of support vector machine.

The calculation principle of support vector machine is complex, but its popular understanding is not complicated. You only need to know that it needs to solve for the 'space plane', which can distinguish the categories of different label items (dependent variable Y). Just divide it. Similar to other machine learning algorithms, the construction steps of a support vector machine generally require dimensional processing of the data, setting the ratio of training data and test data, and setting related parameter tuning, and finally achieve good performance on the training data. And it also performs well on test data.


Support vector machine model case

1  background

The 'Iris Classification Dataset' used by the support vector machine in this part is used for case demonstration. It has a total of 150 samples, including 4 feature attributes (4 independent variables X), and the label (dependent variable Y) is the iris flower category. There are three categories in total: bristle iris, color-changing iris, and Virginia iris (hereinafter referred to as categories A, B, and C).

2  Theory

The principle of the support vector machine model can be seen in the figure below.

For example, red means "smoking" and yellow means "non-smoking". So how to find a plane to maximize the separation of the two groups? As shown in the picture above, there are many ways to separate, the left side can also be separated, and the right side can also be separated. . But obviously, the right side will be "further apart", so how to find such a spatial plane so that the categories of label items are most obviously separated, this algorithm process is a support vector machine. When dividing points, the points closest to the plane should be as far away as possible. For example, point A and point B on the right side are closest to the plane. Then the algorithm needs to find a way to make these points as far away from the plane as possible. This is called "splitting". better". On the left, the two closest points to the plane are too close to the plane, so the classification on the right is better.

At the same time, it is theoretically possible to find a 'space plane' that completely separates the points, but this situation is not useful because it is only mathematically completely separated, but it does not help the real data business. At the same time, In mathematical calculations, if the points are separated as much as possible, it is easy to have an 'overfitting' phenomenon, that is, the model is perfectly constructed when training data, but the performance on test data is poor. Therefore, such situations can be punished by setting 'Error term penalty coefficient value'. In addition, in order to construct a spatial plane, a nonlinear function needs to be used. The SVM model is called a 'kernel function', which is used to transform features from low dimensions (such as a two-dimensional XY axis plane) to a high-dimensional space, and to It sets certain parameters to find a better model.

Combined with the principle of support vector machine, it involves the following parameters, as follows:

Among the above parameters, the error term penalty coefficient is a penalty value. The larger the value, the easier it is for the training data to perform well, but the easier it is to produce 'overfitting'. When adjusting parameters, if you find that there is an 'overfitting' situation, it is recommended to set the value downward. The default value of SPSSAU is 1 (already small). The kernel function is an 'assistant' that converts low dimensions to high dimensions in the SVM algorithm. The recommended setting method is as follows:

The kernel function coefficient value (also called Gamma value) has relatively little meaning, and the default value can usually be used;

The highest power of the kernel function: If a polynomial kernel function is used, the larger the highest power, the better the model effect, but it is more likely to cause 'overfitting' problems. It is recommended to set it to 2, 3 or 4 for comparison (the default is this value is 3);

Multi-classification decision function: The basic SVM only handles two-classification problems. If the label item (dependent variable Y) has multiple categories, such as a total of 8 categories for 8 major cuisines, then there are two methods in the algorithm. The first is Create an SVM for each category and the remaining categories (as a counterexample) and then integrate them (a total of 8 SVMs are established), that is, the ovr method (1 pair of the rest method). Another method is the pairwise pairing method, the ovo method, 8 categories Form 8*(8-1)/2=28 pairing combinations, that is, perform SVM 28 times and then integrate. The default value of this item is the ovr method.

Finally: the model convergence parameter value and the maximum number of iterations are the parameter values ​​for the algorithm's internal iteration to find the optimal solution. Under normal circumstances, there is no need to set them.

3  operations

This example operates as follows:

The default selection of the training set proportion is: 0.8 or 80% (150*0.8=120 samples) for training the support vector machine model, and the remaining 20% ​​or 30 samples (test data) are used for model verification. It should be noted that SVM involves distance calculation, so the features need to be processed dimensionally. Usually the dimensional processing method is normal normalization. The purpose of this processing is to keep the data consistent in dimension. Of course, other dimensional methods can also be used, such as intervalization, normalization, etc.

Then set the parameters as follows:

The error term penalty coefficient value is 1. If you want the training set data to have better performance, you can set it higher, but you must pay attention to the effect of the test set at this time, otherwise the 'overfitting' phenomenon will occur. The data in this case only has 4 features Using ovr (i.e. a pair of redundant methods) on the multi-classification decision function will reduce the running time and speed up the running speed. The model convergence parameters and maximum number of iterations were kept at their default values.

4  SPSSAU output results

SPSSAU outputs a total of 5 results, which are basic information summary, training set or test set model evaluation results, test set result confusion matrix, model summary table and model code, as explained below:

In the above table, the basic information is summarized to show the classification distribution of the dependent variable Y (label item). The model evaluation results (including training set or test set) are used to judge the fitting effect of the model, especially the fitting effect of the test set. It also provides the confusion matrix results of the test set data; the model summary table summarizes various parameter values, and the core code for SVM model construction is appended at the end.

5 Text Analysis

Next, the most important model fitting conditions are explained, as shown in the following table:

The above table provides four evaluation indicators for the training set and test set respectively, namely precision rate, recall rate, f1-score, accuracy rate, as well as average index and sample size index, etc., as explained in the following table:

For a detailed explanation of the above specific indicators, please refer to the decision tree model help manual. Usually, the F1-score value can be used for evaluation. The f1-score value is 0.96 during the training data, and the test set data also maintains a high score of 0.94. The two are relatively close. , which means that there should be no 'overfitting' phenomenon and the model is good.

Then further look at the 'confusion matrix' of the test data, which is the intersection set of model predictions and factual conditions, as shown below:

In the 'Confusion Matrix', the larger the value of the diagonal line of the lower right triangle, the better, which means that the predicted value is completely consistent with the true value. In the above figure, only 2 samples in category B are judged to be category C, and the rest are all correct, which means that the support vector machine model performed well on the test data. Finally, SPSSAU outputs model parameter information values, as shown in the following table:

The above parameter information is only for re-output summary and has no other purpose. The final SPSSAU output uses the slearn package in python to build the core code of this support vector machine model as follows:

model = svm.SVC(C=1.0, kernel=rbf, gamma=scale, tol = 0.001, max_iter=2000, decision_function_shape=ovr)

model.fit(x_train, y_train)

6  Analysis

The following key points are involved, as follows:

  • Does the support vector machine model need standardization processing?
    The general recommendation is to perform standardization processing, because the SVM model involves distance calculation and requires dimensional data processing. Normal normalization processing is usually sufficient.
  • Saving predicted values
    ​​When saving predicted values, SPSSAU will generate a new title to store the category information predicted by the model. The meaning of its number is consistent with the number of the label item (dependent variable Y) in the model.
  • When building a support vector machine model in SPSSAU, how should the independent
    variable Put it in after processing. You can click to view the dummy variables.
    http://spssau.com/front/spssau/helps/otherdocuments/dummy.html
  • What is the criterion for judging the qualification of a random model in SPSSAU?
    In machine learning models, training data is usually used to train the model first, and then test data is used to test the model effect. Usually the criterion is that the training model has a good fitting effect, and the test model also has a good fitting effect. In machine learning models, it is easy for the phenomenon of 'overfitting', that is, false good results, to occur, so it is necessary to focus on the fitting effect of the test data. For a single model, parameters can be transformed and optimized. At the same time, multiple machine learning models can be used, such as decision trees, random forests, neural networks, etc., to comprehensively compare and select the optimal model.
  • Support vector machine SVM More reference materials?
    More information about SVM can be viewed through the sklearn official manual, click to view.
    https://scikit-learn.org/stable/modules/svm.html#svm-classification
  • Does SPSSAU prompt abnormal data quality when running SVM support vector machine?

Currently, SVM supports classification tasks. You need to ensure that the label item (dependent variable Y) is categorical data. If it is quantitative continuous data, or the sample size is small (or non-members only analyze the first 100 samples), it may not be calculated and prompts Data quality is abnormal.

Guess you like

Origin blog.csdn.net/m0_37228052/article/details/132986762