Andrew Ng's "Machine Learning" - SVM Support Vector Machine

The data set and source files can be obtained in the Github project
link: https://github.com/Raymond-Yang-2001/AndrewNg-Machine-Learing-Homework

1. Linear SVM

1.1 Starting from logistic regression

When classifying by logistic regression, we have h θ ( x ) = σ ( θ ⊤ x ) h_{\theta}(x)=\sigma(\theta^{\top} x)hi(x)=s ( i x), whereσ \sigmaσ represents the sigmoid function. When classifying, Logisit regression will make the positive classθ ⊤ x ≥ 0 \theta^{\top}x\ge 0ix0 , θ ⊤ x < 0of the negative classix<0。其损失函数如下:
J ( θ ) = 1 m ∑ i = 1 m − y ( i ) log ⁡ ( h θ ( x ( i ) ) ) − ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) J(\theta)=\frac{1}{m}\sum_{i=1}^{m}{-y^{(i)}\log{(h_{\theta}(x^{(i)}))}}-(1-y^{(i)})\log{(1-h_{\theta}(x^{(i)}))} J(θ)=m1i=1my(i)log(hi(x(i)))(1y(i))log(1hi(x( i ) ))
Therein,− log ⁡ h θ ( x ) -\log{h_\theta(x)}loghi(x) − log ⁡ ( 1 − h θ ( x ) ) -\log{(1-h_{\theta}(x))} log(1hiThe function image of ( x )) is as follows:
Insert image description here
Modify the loss function as follows, so that wheny = 1 y=1y=When 1 , the expected θ ⊤ x ≫ 1 \theta^{\top}x\gg 1ix1 instead ofθ ⊤ x ≫ 0 \theta^{\top}x\gg 0ix0;在y = 0 y=0y=When 0 , the expected θ ⊤ x ≪ − 1 \theta^{\top}x\ll -1ix1 instead ofθ ⊤ x ≪ 0 \theta^{\top}x\ll 0ix0

Let us consider the scheduling function of the SVM:
J ( θ ) = C ∑ i = 1 m [ y ( i ) cost 1 ( θ ⊤ x ) + ( 1 − y ( i ) ) cost 0 ( θ ⊤ x ); ] + 1 2 ∑ j = 1 n θ j 2 J(\theta)=C\sum_{i=1}^{m}{[y^{(i)}\mathrm{cost}_{1}(\ theta^{\top}x)+(1-y^{(i)})\mathrm{cost}_{0}(\theta^{\top}x)]}+\frac{1}{2} \sum_{j=1}^{n}{\theta_{j}^{2}}J(θ)=Ci=1m[y(i)cost1( ix)+(1y(i))cost0( ix)]+21j=1nij2
C here is the regularization parameter.

In linear SVM, different from logistic regression output classification probability, we assume:
{ h θ ( x ) = 1 , θ ⊤ x ≥ 0 h θ ( x ) = 0 , else \left\{ \begin{aligned} &h_{ \theta}(x)=1,\quad\theta^{\top}x\ge0 \\ &h_{\theta}(x)=0,\quad\mathrm{else}\\ \end{aligned} \right .{ hi(x)=1,ix0hi(x)=0,else
In other words, the SVM classifier directly outputs the classification result.

1.2 Large margin classification and SVM

As mentioned earlier, in SVM, the necessary condition for minimizing the cost function is that when y = 1 y=1y=When 1 , the expected θ ⊤ x ≥ 1 \theta^{\top}x\ge 1ix1 instead ofθ ⊤ x ≥ 0 \theta^{\top}x\ge 0ix0;在y = 0 y=0y=When 0 , the expected θ ⊤ x ≪ − 1 \theta^{\top}x\ll -1ix1 instead ofθ ⊤ x < 0 \theta^{\top}x< 0ix<0 . In fact, using 0 as the classification boundary can already make a good distinction for classification. SVM further "widens" this classification boundary from 0 to (-1,1). We call SVM alarge boundary. classifier.

Consider the loss function of linear SVM, assuming we find θ \theta that meets the above conditionsθ , then in any case, the first half of the loss function is 0, which means that the optimization goal can be simplified to:
min ⁡ 1 2 ∑ j = 1 n θ j 2 = 1 2 ∣ ∣ θ ∣ ∣ 2 s . t . { θ ⊤ x ( i ) ≥ 1 , y ( i ) = 1 θ ⊤ x ( i ) ≤ − 1 , y ( i ) = 0 \min{\frac{1}{2}\sum_{j=1 }^{n}{\theta_{j}^{2}}=\frac{1}{2}||\theta||^{2}} \\ \mathrm{st}\left\{ \begin{ aligned} &\theta^{\top}x^{(i)}\ge1,\quad y^{(i)}=1 \\ &\theta^{\top}x^{(i)}\le -1,\quad y^{(i)}=0 \\ \end{aligned} \right.min21j=1nij2=21∣∣θ2s.t.{ ix(i)1,y(i)=1ix(i)1,y(i)=0

It can be known from the knowledge of linear algebra: θ ⊤ x ( i ) = ρ ( i ) ∣ ∣ θ ∣ ∣ \theta^{\top}x^{(i)}=\rho^{(i)}||\theta| |ix(i)=r(i)∣∣θ∣∣ ρ ( i ) \rho^{(i)} r( i ) isx ( i ) x^{(i)}x( i ) isθ \thetaThe projected length in the θ direction.
Suppose one of our decision boundaries is as follows, the blue line isθ \thetaθ direction, the green line perpendicular to it is the decision boundary:
Insert image description here
in this case,ρ \rhoρ is relatively small, in order to satisfyρ ( i ) ∣ ∣ θ ∣ ∣ ≥ 1 \rho^{(i)}||\theta||\ge1r(i)∣∣θ∣∣1 orρ( i ) ∣ ∣ θ ∣ ∣ ≤ − 1 \rho^{(i)}||\theta||\le-1r(i)∣∣θ∣∣1 ∣ ∣ θ ∣ ∣ ||\theta|| ∣∣ θ ∣∣ must become very large to be satisfied. Obviously, this will make the loss function value larger, which is opposite to the optimization goal.

Consider another decision boundary:
Insert image description here
in this case, ρ \rhoρ will become larger, correspondingly∣ ∣ θ ∣ ∣ ||\theta||∣∣ θ ∣∣ can become smaller. By making the spacing larger, that is, through theseρ \rhoFor values ​​such as ρ , the support vector machine can eventually find a smaller norm. This is exactly the purpose of minimizing the objective function in support vector machines, and why support vector machines eventually find large margin classifiers. Because it tries to maximize theseρ \rhoNorms of ρ , which are the distances from training samples to the decision boundary.

1.3 Adjust regularization parameters

The data set used is visualized as follows:
Insert image description here
using regularization parameter C=1

from sklearn import svm
svc = svm.LinearSVC(C=1, max_iter=1000)
svc.fit(x,y.ravel())
theta1 = [svc.intercept_[0], svc.coef_[0,0], svc.coef_[0,1]]

x_ax = np.arange(0, 4, 0.1)
xx = np.array([1.5,2.5])
y_ax = -theta1[0] / theta1[2] + (-theta1[1] / theta1[2])*x_ax
print(theta1[0],-theta1[0] / theta1[2],-theta1[1] / theta1[2])
yy = (theta1[2] / theta1[1] )*xx
plt.figure(figsize=(10,8))
plt.scatter(x=positive_data[:, 0], y=positive_data[:, 1], s=10, color="red",label="positive")
plt.scatter(x=negative_data[:, 0], y=negative_data[:, 1], s=10, label="negative")
plt.plot(x_ax, y_ax, label="Decision Boundary")
plt.plot(xx, yy, label="Direction of Theta Vector")
plt.axis('equal')
plt.legend(loc='best',framealpha=0.5)
plt.show()

Insert image description here

It can be seen that SVM has learned a better classifier and is not affected by the outliers in the upper left corner.

Regularization parameter C=1000

from sklearn import svm
svc2 = svm.LinearSVC(C=100, max_iter=100000)
svc2.fit(x,y.ravel())
theta2 = [svc2.intercept_[0], svc2.coef_[0,0], svc2.coef_[0,1]]
x_ax = np.arange(0, 4, 0.1)

y_ax = -theta2[0] / theta2[2] + (-theta2[1] / theta2[2])*x_ax

plt.figure(figsize=(10,8))
plt.scatter(x=positive_data[:, 0], y=positive_data[:, 1], s=10, color="red",label="positive")
plt.scatter(x=negative_data[:, 0], y=negative_data[:, 1], s=10, label="negative")
plt.plot(x_ax, y_ax, label="Decision Boundary")

xx = np.array([1,1.5])
yy = (theta2[2] / theta2[1] )*xx
plt.plot(xx + 1, yy, label="Direction of Theta Vector")

plt.axis('equal')
plt.legend(loc=0,framealpha=0.5)
plt.show()

Insert image description here
It can be seen that when C is large, SVM is affected by outliers and overfitting occurs.

2. Nonlinear SVM (Gaussian kernel function)

The calculation of the optimization objective of the linear SVM discussed previously is based on θ ⊤ x \theta^{\top}xi Linear operation of x. When we face more complex decision boundaries, simple linear operation cannot meet the needs well. Just like introducing nonlinear activation functions in neural networks, in SVM, we also introduce nonlinearkernel functionsto achieve more complex classification. This type of SVM is called nonlinear SVM.

This is equivalent to using a series of new features to replace the original samples, and the kernel function completes the nonlinear mapping of samples to new features.
f ( i ) ← x ( i ) f^{(i)} \larr x^{(i)}f(i)x(i)

2.1 Gaussian kernel

f i = s i m ( x , l ( i ) ) = exp ⁡ ( − ∣ ∣ x − l ( i ) ∣ ∣ 2 2 σ 2 ) f_{i}=sim(x,l^{(i)})=\exp{\left(-\frac{||x-l^{(i)}||^{2}}{2\sigma^{2}}\right)} fi=s im ( x ,l(i))=exp(2 p2∣∣xl(i)2)
x , l ( i ) x,l^{(i)} x,lWhen ( i ) is close to each other, the kernel function value will be close to 1; when the two are far apart, the kernel function value will be close to 0.
f ( i ) = ∣ f 0 ( i ) = 1 f 1 ( i ) = sim ( x ( i ) , l ( 1 ) ) ⋮ fm ( i ) = sim ( x ( i ) , l ( m ) ) ∣ f^{(i)}=\left|\begin{aligned} f^{(i)}_{0}&=1 \\ f^{(i)}_{1}=&sim(x^{( i)},l^{(1)}) \\ \vdots&\\ f^{(i)}_{m}=&sim(x^{(i)},l^{(m)}) \end {aligned} \right|f(i)= f0(i)f1(i)=fm(i)==1yes im ( x(i),l(1))yes im ( x(i),l(m))

Give the following equations:
J ( θ ) = C ∑ i = 1 m [ y ( i ) cost 1 ( θ ⊤ f ( i ) ) + ( 1 − y ( i ) ) cost 0 ( θ ⊤ f ( i ) ) . ) ] + 1 2 ∑ j = 1 n θ j 2 J(\theta)=C\sum_{i=1}^{m}{[y^{(i)}\mathrm{cost}_{1}( \theta^{\top}f^{(i)})+(1-y^{(i)})\mathrm{cost}_{0}(\theta^{\top}f^{(i) })]}+\frac{1}{2}\sum_{j=1}^{n}{\theta_{j}^{2}}J(θ)=Ci=1m[y(i)cost1( if(i))+(1y(i))cost0( if(i))]+21j=1nij2

σ \sigmaWhen the σ parameter is larger, the features will become smoother (∣ ∣ x − l ( i ) ∣ ∣ 2 ||xl^{(i)}||^{2}∣∣xl(i)2 has less impact on the change of the function value), the discrimination of different samples will become smaller, which will help alleviate the impact of certain outliers, make the variance of the model smaller, and reduce overfitting, but it will The deviation of the model becomes larger; on the contrary, whenσ \sigmaWhen the σ parameter is small, the features will become more differentiated, making the model variance larger and the bias smaller.

2.2 Nonlinear classification

A visualization of the dataset for non-linear classification looks like this:
Insert image description here
Regularization parameter is 100

def show_boundary(svc, scale, fig_size, fig_dpi, positive_data, negative_data, term):
    """
    Show SVM classification boundary plot
    :param svc: instance of SVC, fitted and probability=True
    :param scale: scale for x-axis and y-axis
    :param fig_size: figure size, tuple (w, h)
    :param fig_dpi: figure dpi, int
    :param positive_data: positive data for dataset (n, d)
    :param negative_data: negative data for dataset (n, d)
    :param term: width for classification boundary
    :return: decision plot
    """
    t1 = np.linspace(scale[0, 0], scale[0, 1], 500)
    t2 = np.linspace(scale[1, 0], scale[1, 1], 500)
    coordinates = np.array([[x, y] for x in t1 for y in t2])
    prob = svc.predict_proba(coordinates)
    idx1 = np.where(np.logical_and(prob[:, 1] > 0.5 - term, prob[:, 1] < 0.5 + term))[0]
    my_bd = coordinates[idx1]
    plt.figure(figsize=fig_size, dpi=fig_dpi)
    plt.scatter(x=my_bd[:, 0], y=my_bd[:, 1], s=10, color="yellow", label="My Decision Boundary")
    plt.scatter(x=positive_data[:, 0], y=positive_data[:, 1], s=10, color="red", label="positive")
    plt.scatter(x=negative_data[:, 0], y=negative_data[:, 1], s=10, label="negative")
    plt.title('Decision Boundary')
    plt.legend(loc=2)
    plt.show()

from sklearn import svm
from sklearn.metrics import classification_report
svc100 = svm.SVC(C=100, kernel='rbf', gamma=10, probability=True)
svc100.fit(x,y.ravel())
report100 = classification_report(svc100.predict(x),y,digits=4)
print(report100)
show_boundary(svc100, scale=np.array([[0,1],[0.4,1]]), fig_size=fig_size, fig_dpi=fig_dpi,positive_data=positive_data,negative_data=negative_data, term=1e-3)
              precision    recall  f1-score   support

           0     0.9791    0.9542    0.9665       393
           1     0.9625    0.9830    0.9726       470

    accuracy                         0.9699       863
   macro avg     0.9708    0.9686    0.9696       863
weighted avg     0.9701    0.9699    0.9698       863

Insert image description here
The regularization parameter is 1

svc1 = svm.SVC(C=1, kernel="rbf", gamma=10, probability=True)
svc1.fit(x,y.ravel())
report1 = classification_report(svc1.predict(x),y,digits=4)
print(report1)
show_boundary(svc1, scale=np.array([[0,1],[0.4,1]]), fig_size=fig_size, fig_dpi=fig_dpi,positive_data=positive_data,negative_data=negative_data, term=1e-3)
              precision    recall  f1-score   support

           0     0.8851    0.8582    0.8715       395
           1     0.8833    0.9060    0.8945       468

    accuracy                         0.8841       863
   macro avg     0.8842    0.8821    0.8830       863
weighted avg     0.8841    0.8841    0.8840       863

Insert image description here
It can be seen that as the regularization parameter becomes smaller, the classification boundary becomes "smoother".

2.3 Parameter search

In the application of machine learning, determining parameters is a key step. Different parameters will cause the algorithm to show different performance. One of the most commonly used methods is GridSearch .

Before implementing grid search, we first introduce a method to evaluate model performance- k-fold cross-validation . Generally speaking, during the process of training the model, we only divide a fixed part from the training set as the verification set; k-fold cross-validation divides the training set into k parts, and the model is trained k times, using one of them as the verification set each time. , the rest is used as the training set, and the average score on the validation set is used to evaluate the model performance. This method can more comprehensively consider the data distribution of the entire training set, and can often better reflect the generalization ability of the model than a fixed verification set.

The steps for grid search are:

  1. A value set is given for the target parameter, and multiple parameters will form a structure similar to a "grid"
  2. For each parameter value combination, perform k-fold cross validation (in sklearn, k=5 is used by default)
  3. Select the parameter combination with the highest average score as the optimal parameter combination

The code is implemented as follows:

candidate = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]
parameters_grid = np.array([[c, gamma] for c in candidate for gamma in candidate])

score_list = []
from sklearn.svm import SVC
from SVM import show_boundary
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
for param in parameters_grid:
    score = []
    for tr_idx, test_idx in kf.split(train_x,train_y):
        tr_x,tr_y = train_x[tr_idx], train_y[tr_idx]
        test_x, test_y = train_x[test_idx], train_y[test_idx]

        svc = SVC(C=param[0], gamma=param[1], probability=True)
        svc.fit(tr_x, tr_y.ravel())
        score.append(svc.score(test_x, test_y.ravel()))
    score_list.append(score)

score_arr = np.array(score_list).mean(axis=1)
best_param = parameters_grid[np.argmax(score_arr)]
best_score = score_arr.max()
param_dict = {
    
    'C': best_param[0], 'gamma': best_param[1]}
best_svc = SVC(probability=True)
best_svc.set_params(**param_dict)
best_svc.fit(train_x,train_y.ravel())
print("Best parameters C={}, gamma={}, with average precision of {:.4f}".format(best_param[0], best_param[1], best_score))
Best parameters C=30.0, gamma=3.0, with average precision of 0.9244

Verify using sklearn

svc = SVC(probability=True)
parameters = {
    
    'C': candidate, 'gamma': candidate}
# default 5-fold
clf = GridSearchCV(svc, parameters, n_jobs=-1)
clf.fit(train_x,train_y.ravel())
print("SKlearn result: C={}, gamma={}".format(clf.best_params_.get('C'), clf.best_params_.get('gamma')))
SKlearn result: C=30, gamma=3

Visualizing datasets and classification boundaries
Insert image description here

Guess you like

Origin blog.csdn.net/d33332/article/details/128581821