Machine Learning - Support Vector Machines

Exercise 5: Support Vector Machines


introduce

In this exercise, we will use a support vector machine (SVM) to build a spam classifier.

Before starting the exercise, you need to download the following files for data upload :

  • data.tgz - Contains the data files used in this exercise

in:

  • ex5data1.mat - dataset example 1
  • ex5data2.mat - dataset example 2
  • ex5data3.mat - dataset example 3
  • spamTrain.mat - spam training set
  • spamTest.mat - spam test set
  • emailSample1.txt - email sample 1
  • emailSample2.txt - email sample 2
  • spamSample1.txt - Spam Sample 1
  • spamSample2.txt - Spam Sample 2
  • vocab.txt - Vocabulary

Throughout the exercise, the following mandatory assignments and marked *optional assignments are involved :

1 Support Vector Machine

We'll start using SVMs on some simple 2D datasets to see how they work. We will then do some preprocessing on a set of raw emails and use SVM to build a classifier on the processed emails to determine whether they are spam or not.

1.1 Dataset Example 1

The first thing we'll do is look at a simple 2D dataset and see how a linear SVM works on the dataset for different values ​​of C (similar to a regularization term in linear/logistic regression).

!tar -zxvf /home/jovyan/work/data.tgz -C /home/jovyan/work/
data/
data/emailSample1.txt
data/emailSample2.txt
data/ex5data1.mat
data/ex5data2.mat
data/ex5data3.mat
data/spamSample1.txt
data/spamSample2.txt
data/spamTest.mat
data/spamTrain.mat
data/vocab.txt

/vocab.txt

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
 
raw_data = loadmat('/home/jovyan/work/data/ex5data1.mat')
raw_data
{'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Nov 13 14:28:43 2011',
 '__version__': '1.0',
 '__globals__': [],
 'X': array([[1.9643  , 4.5957  ],
        [2.2753  , 3.8589  ],
        [2.9781  , 4.5651  ],
        [2.932   , 3.5519  ],
        [3.5772  , 2.856   ],
        [4.015   , 3.1937  ],
        [3.3814  , 3.4291  ],
        [3.9113  , 4.1761  ],
        [2.7822  , 4.0431  ],
        [2.5518  , 4.6162  ],
        [3.3698  , 3.9101  ],
        [3.1048  , 3.0709  ],
        [1.9182  , 4.0534  ],
        [2.2638  , 4.3706  ],
        [2.6555  , 3.5008  ],
        [3.1855  , 4.2888  ],
        [3.6579  , 3.8692  ],
        [3.9113  , 3.4291  ],
        [3.6002  , 3.1221  ],
        [3.0357  , 3.3165  ],
        [1.5841  , 3.3575  ],
        [2.0103  , 3.2039  ],
        [1.9527  , 2.7843  ],
        [2.2753  , 2.7127  ],
        [2.3099  , 2.9584  ],
        [2.8283  , 2.6309  ],
        [3.0473  , 2.2931  ],
        [2.4827  , 2.0373  ],
        [2.5057  , 2.3853  ],
        [1.8721  , 2.0577  ],
        [2.0103  , 2.3546  ],
        [1.2269  , 2.3239  ],
        [1.8951  , 2.9174  ],
        [1.561   , 3.0709  ],
        [1.5495  , 2.6923  ],
        [1.6878  , 2.4057  ],
        [1.4919  , 2.0271  ],
        [0.962   , 2.682   ],
        [1.1693  , 2.9276  ],
        [0.8122  , 2.9992  ],
        [0.9735  , 3.3881  ],
        [1.25    , 3.1937  ],
        [1.3191  , 3.5109  ],
        [2.2292  , 2.201   ],
        [2.4482  , 2.6411  ],
        [2.7938  , 1.9656  ],
        [2.091   , 1.6177  ],
        [2.5403  , 2.8867  ],
        [0.9044  , 3.0198  ],
        [0.76615 , 2.5899  ],
        [0.086405, 4.1045  ]]),
 'y': array([[1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [1]], dtype=uint8)}

We represent this as a scatterplot, where the class labels are represented by symbols (+ for positive and o for negative).

data = pd.DataFrame(raw_data['X'], columns=['X1', 'X2'])
data['y'] = raw_data['y']
 
positive = data[data['y'].isin([1])]
negative = data[data['y'].isin([0])]
 
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(positive['X1'], positive['X2'], s=50, marker='x', label='Positive')
ax.scatter(negative['X1'], negative['X2'], s=50, marker='o', label='Negative')
ax.legend()
plt.show()

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-Bsew0Di6-1686574855668)(output_5_0.png)]

from sklearn import svm
svc = svm.LinearSVC(C=1, loss='hinge', max_iter=1000)
svc
LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0)

First, let's use C=1 to see how it turns out.

svc.fit(data[['X1', 'X2']], data['y'])
svc.score(data[['X1', 'X2']], data['y'])
/opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:931: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)





0.9803921568627451
data['SVM 1 Confidence'] = svc.decision_function(data[['X1', 'X2']])
 
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(data['X1'], data['X2'], s=50, c=data['SVM 1 Confidence'], cmap='seismic')
ax.set_title('SVM (C=1) Decision Confidence')
plt.show()

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-og0GHihN-1686574855669)(output_9_0.png)]

Second, let's see what happens if the value of C is larger

svc2 = svm.LinearSVC(C=100, loss='hinge', max_iter=1000)
svc2.fit(data[['X1', 'X2']], data['y'])
svc2.score(data[['X1', 'X2']], data['y'])
/opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:931: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)





0.9411764705882353

This time we get a perfect classification of the training data, but by increasing the value of C we create a decision boundary that no longer fits the data. We can see this by looking at the confidence level of each class prediction as a function of the point's distance from the hyperplane.

data['SVM 2 Confidence'] = svc2.decision_function(data[['X1', 'X2']])
 
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(data['X1'], data['X2'], s=50, c=data['SVM 2 Confidence'], cmap='seismic')
ax.set_title('SVM (C=100) Decision Confidence')
plt.show()

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-zOdDcLMX-1686574855670)(output_13_0.png)]

You can look at the color of the points near the border, the difference is a bit subtle. If you are in the exercise text, there will be plots where the decision boundary is shown as a line on the graph, helping to make the difference clearer.

1.2 SVM with Gaussian Kernel

Now we will move from a linear SVM to an SVM capable of nonlinear classification using kernels. We first take care of implementing a Gaussian kernel function. Although scikit-learn has a built-in Gaussian kernel, we will implement it from scratch for clarity.

1.2.1 Gaussian Kernel

You can think of the Gaussian kernel approximation as having a function that measures the "distance" between two samples. The Gaussian kernel also passes the bandwidth parameter θ \thetaθ is parameterized, which determines how quickly the similarity measure decreases (to 0) as examples get closer together.

The formula of the Gaussian kernel is as follows:
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-mOjb47Ku-1686574855670)(5-1.png)]

Now, you need to code the code to implement the above formula for computing the Gaussian kernel :

Main points :

  • Define the gaussian_kernel function, the parameter list isx1,x2,sigma
  • x1, x2representing two samples, sigmarepresenting the parameter values ​​in the Gaussian kernel.
  • For the given test code, the output value should be 0.324652
###在这里填入代码###
def gaussian_kernel(x1, x2, sigma):
    return np.exp(-(np.sum((x1 - x2) ** 2)) / (2 * (sigma ** 2)))
###请运行并测试你的代码###
x1 = np.array([1.0, 2.0, 1.0])
x2 = np.array([0.0, 4.0, -1.0])
sigma = 2
 
gaussian_kernel(x1, x2, sigma)
0.32465246735834974

1.2.2 Dataset Example 2

Next, we'll examine another dataset, this time with a nonlinear decision boundary.

raw_data = loadmat('work/data/ex5data2.mat')
 
data = pd.DataFrame(raw_data['X'], columns=['X1', 'X2'])
data['y'] = raw_data['y']
 
positive = data[data['y'].isin([1])]
negative = data[data['y'].isin([0])]
 
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(positive['X1'], positive['X2'], s=30, marker='x', label='Positive')
ax.scatter(negative['X1'], negative['X2'], s=30, marker='o', label='Negative')
ax.legend()
plt.show()

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-n0Et1xtg-1686574855671)(output_18_0.png)]

For this dataset, we will build a support vector machine classifier using the built-in RBF kernel and check its accuracy on the training data. To visualize the decision boundary, this time we will shade the points according to the predicted probability that the instance has a negative class label. As can be seen from the results, they are mostly correct.

csvc = svm.SVC(C=100, gamma=10, probability=True)
csvc
SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=10, kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
csvc.fit(data[['X1', 'X2']], data['y'])
csvc.score(data[['X1', 'X2']], data['y'])
1.0
data['Probability'] = csvc.predict_proba(data[['X1', 'X2']])[:,0]

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(data['X1'], data['X2'], s=30, c=data['Probability'], cmap='Reds')
plt.show()

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-jkkioetv-1686574855671)(output_22_0.png)]

1.2.3 Dataset example 3


For the third dataset, we give the training and validation sets, and find the optimal hyperparameters for the SVM model based on the validation set performance. While we could do this using scikit-learn's built-in grid search, for the purposes of following the exercise we'll implement a simple grid search from scratch.

Your task is to use the cross-validation set Xval, yval Xval, yvalX v a l , y v a l to determine the best parameter CCto useC andσ \sigmas .

You should write any other code necessary to help you search for the parameter CCC andσ \sigmaσ . forCCC andσ \sigmaσ , we recommend trying values ​​in multiplicative steps (e.g. 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30). Note that you should tryCCC andσ \sigmaAll possible pairs of values ​​for σ (for example,C = 0.3 , σ = 0.1 C = 0.3,\sigma= 0.1C=0.3,p=0.1 ). For example, if you try to use the CClisted aboveC andσ 2 \sigma^2pEach of the 8 values ​​of 2 , will end up training and evaluating (on the cross-validation set) a total of 8 2 = 64 8^2 = 6482=64 different models.

raw_data = loadmat('work/data/ex5data3.mat')
 
X = raw_data['X']
Xval = raw_data['Xval']
y = raw_data['y'].ravel()
yval = raw_data['yval'].ravel()
 
C_values = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]
gamma_values = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]
 
best_score = 0
best_params = {
    
    'C': None, 'gamma': None}
 
###在这里填入代码###
 
for C in C_values:
    for gamma in gamma_values:
        svc = svm.SVC(C=C,gamma=gamma)
        svc.fit(X,y)
        score = svc.score(Xval,yval)
 
        if score > best_score:
            best_score =score
            best_params['C']=C
            best_params['gamma'] = gamma
 
best_score, best_params
(0.965, {'C': 0.3, 'gamma': 100})

2 Spam Classification


In this part, our goal is to use SVM to build a spam filter.

2.1 Mail preprocessing

In the practice text, there is a task that involves some text preprocessing to get the data in a format suitable for SVM processing. However, this task is trivial (map words to IDs in the dictionary provided for the exercise), while the rest of the preprocessing steps (such as HTML removal, stemming, normalization, etc.) are already done. Instead of reproducing these preprocessing steps, I'll skip the machine learning task, which involves building a classifier from a preprocessed training set, and a test dataset that converts spam and non-spam emails into vectors of word occurrences.

You need to load the data from file data/spamTrain.matand data/spamTest.mat, and build training and test sets respectively .

Main points :

  • Output the data you read in
  • Extract the correspondingX, y, Xtest, ytest
  • Make sure your data shapes are in order:(X(4000, 1899), y(4000,), Xtest(1000, 1899), ytest(1000,))
###在这里填入代码###
spam_train = loadmat('work/data/spamTrain.mat')
spam_test = loadmat('work/data/spamTest.mat')
 
print(spam_train)
print(spam_test)
 
X = spam_train['X']
Xtest = spam_test['Xtest']
y = spam_train['y'].ravel()
ytest = spam_test['ytest'].ravel()
 
X.shape, y.shape, Xtest.shape, ytest.shape
{'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Nov 13 14:27:25 2011', '__version__': '1.0', '__globals__': [], 'X': array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), 'y': array([[1],
       [1],
       [0],
       ...,
       [1],
       [0],
       [0]], dtype=uint8)}
{'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Nov 13 14:27:39 2011', '__version__': '1.0', '__globals__': [], 'Xtest': array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), 'ytest': array([[1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [1],
       [0]], dtype=uint8)}





((4000, 1899), (4000,), (1000, 1899), (1000,))

Each document has been converted to a vector with 1899 dimensions corresponding to 1899 words in the vocabulary. Their values ​​are binary, indicating whether a word is present in the document.

Thus, training evaluation is a matter of fitting a classifier to the test data.

2.2 Training SVM for spam classification

You need to train svm for spam classification and give the accuracy of training set and test set separately .

svc = svm.SVC()
svc.fit(X, y)
print('Training accuracy = {0}%'.format(np.round(svc.score(X, y) * 100, 2)))
print('Test accuracy = {0}%'.format(np.round(svc.score(Xtest, ytest) * 100, 2)))

Guess you like

Origin blog.csdn.net/chenyu128/article/details/131176881
Recommended