Hyperparameter Tuning C and Gamma Parameters for Support Vector Machines (SVM)

Author: CSDN @ _Yakult_

Support Vector Machine (SVM) is a widely used supervised machine learning algorithm. It is mainly used for classification tasks, but it is also suitable for regression tasks. In this article, we will dive into two important parameters of support vector machines: C and gamma. Before reading this article, I assume you have a basic understanding of the algorithm and focus on these parameters.

Most machine learning and deep learning algorithms have parameters that can be tuned, called hyperparameters . We need to set the hyperparameters before training the model. Hyperparameters are very important in building robust and accurate models. They help us find a balance between bias and variance, thus preventing the model from overfitting or underfitting. To be able to tune hyperparameters, we need to understand what they mean and how they change the model. Randomly trying a bunch of hyperparameter values ​​would be a tedious and never-ending task.

We emphasize the importance of hyperparameters. Let's start discussing C and gamma. SVM creates a decision boundary that distinguishes between two or more classes. How to draw or determine the decision boundary is the most critical part of the SVM algorithm. Drawing decision boundaries is an easy task when data points of different classes are linearly separable.

图1 Linearly separable data points

However, real data are usually noisy and not linearly separable. A standard SVM tries to separate all positive and negative examples (i.e. two different classes) and does not allow any misclassified points. This leads to an overfit model, or in some cases, the inability to find the decision boundary using a standard SVM.

Consider the following data points in 2D space:

Figure 2 Standard SVM

Standard SVM attempts to separate the blue and red classes using the black curved line as a decision boundary. However, this is an overly specific classification that is likely to lead to overfitting. An overfit SVM can achieve high accuracy on the training set, but perform poorly on new, previously unseen examples. This model is very sensitive to noise, and even a small change in the value of a data point may change the classification result. An SVM with this black line as a decision boundary does not generalize well on this dataset.

To solve this problem, in 1995, Cortes and Vapnik introduced the concept of "soft margin" support vector machines, allowing some examples to be misclassified or on the wrong side of the decision boundary. Soft-margin SVMs generally result in better generalized models. In our example, the decision boundary for a soft-margin SVM might be the black line below:

Figure 3 Soft margin SVM

There are some misclassified points, but we get a more generalized model. In determining the decision boundary, soft-margin SVM attempts to solve the following optimization problem:

  • Increase the distance between the decision boundary and the classes (or support vectors)
  • Maximize the number of points correctly classified in the training set

Clearly, there is a trade-off between these two goals. In order to correctly label all data points in the training set, the decision boundary may need to be very close to a particular class. However, in this case, the accuracy on the test dataset may be lower because the decision boundary is too sensitive to noise and small changes in independent variables. On the other hand, the decision boundary may be as far away from each class as possible, at the cost of some misclassified outliers. This tradeoff is controlled by the C parameter.

The C parameter adds a penalty for each misclassified data point. If C is small, the penalty for misclassifying points is lower, so a decision boundary with a larger margin is chosen at the cost of more misclassifications.If C is large, the SVM tries to minimize the number of misclassified examples due to the higher penalty, which results in a decision boundary with a smaller margin. The penalty does not apply to all misclassified examples, it is proportional to the distance from the decision boundary.

Before introducing the gamma parameter, we need to discuss kernel tricks. In some cases, non-linearly separable data points are transformed using a kernel function so that they become linearly separable. The kernel function is a similarity measure. The input is the original features and the output is the similarity measure in the new feature space. Similarity here refers to the degree of proximity. Actually converting data points into a high-dimensional feature space is an expensive operation. The algorithm does not actually transform the data points into a new high-dimensional feature space. Kernelized SVMs compute decision boundaries by computing a similarity measure in a high-dimensional feature space without actual transformations. I think that's one of the reasons why it's called a nuclear trick.

One of the commonly used kernel functions is the radial basis function (RBF). The gamma parameter of RBF controls the influence distance of a single training point. A lower gamma value indicates a larger similarity radius, which results in more points being grouped together. For higher gamma values, the points need to be very close to each other in order to be considered as the same group (or class). Therefore, models with very large gamma values ​​tend to overfit. The following visualization better explains the concept:

insert image description here

insert image description here
The first plot represents the case of low gamma values. The similarity radius is large, so all points in a colored region are considered to belong to the same class. For example, if we have a point located in the bottom right corner, it is classified into the "green" category. On the other hand, the second graph is the case with larger gamma values. In order to group data points into the same class, they must lie within a tightly bounded region. Therefore, even a small amount of noise can cause a data point to fall out of the class. Larger gamma values ​​are likely to lead to overfitting.

As gamma decreases, the regions separating different classes become more generalized. Very large values ​​of gamma can lead to overly specific class regions (overfitting).

Gamma and C parameters
For linear kernel functions, we only need to optimize the C parameters. However, if we want to use the RBF kernel function, we need to optimize both C and gamma parameters. If the value of gamma is large, the influence of C becomes negligible. If the value of gamma is small, C affects the model as it does for linear models. Typical value ranges for C and gamma are as follows. However, the exact optimal value may vary depending on the application:

  • 0.0001 < g a m m a < 10 0.0001 < gamma < 10 0.0001<gamma<10

  • 0.1 < c < 100 0.1 < c < 100 0.1<c<100

For SVMs, it is very important to remember that the input data needs to be normalized so that the features are at the same scale and compatible.

Thanks for reading. Please feel free to let me know if you have any feedback.

Disclaimer:
As an author, I attach great importance to my own works and intellectual property rights. I hereby declare that all my original articles are protected by copyright law, and no one may publish them publicly without my authorization.
My articles have been paid for publication on some well-known platforms. I hope readers can respect intellectual property rights and refrain from infringement. Any free or paid (including commercial) publishing of paid articles on the Internet without my authorization will be regarded as a violation of my copyright, and I reserve the right to pursue legal responsibility.
Thank you readers for your attention and support to my article!

Guess you like

Origin blog.csdn.net/qq_35591253/article/details/131749790