Wu Enda Machine Learning Homework Python Implementation (6): Support Vector Machine

Table of contents

1 Support Vector Machine SVM 

1.1 Example Dataset 1

1.2 Gaussian kernel function support vector machine

1.2.1 Gaussian Kernel

1.2.2 Example Dataset 2

1.2.3 Example Dataset 3

2 Spam Classification

2.1 Preprocessing mail

2.1.1 Glossary

2.2 Extract features from emails

2.3 Training SVMs

reference article


1 Support Vector Machine SVM 

       In the first half of this exercise, you will use SVMs on various 2D datasets, which will give you a more intuition on how SVMs work, and how to use Gaussian kernels in SVMs. In the second half, use support vector machines to build a spam classifier

1.1 Example Dataset 1

       This is a data set that can be separated by a linear boundary. First, we visualize the data. We can see that there is an abnormal positive sample in the upper left corner. Through this abnormal value, we can observe how the SVM decision boundary changes.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
from sklearn import svm
from jupyterthemes import jtplot  # 用于解决坐标无法显示问题
jtplot.style(theme='chesterish') #选择一个绘图主题

path1 = r'E:/Code/ML/ml_learning/ex6-SVM/data/ex6data1.mat'
data1 = loadmat(path1)
X, y = data1['X'],data1['y']
# X是坐标点,y标签

def plotData(X, y):
    """数据可视化"""
    plt.figure(figsize=(8,5))
    # 将y展开,并用其值区分正负样本,cmap是设置颜色效果
    plt.scatter(X[:,0], X[:,1], c=y.flatten(), cmap='rainbow')
    plt.xlabel('X1')
    plt.ylabel('X2')
plotData(X,y)

       Next, we will not write the SVM model code ourselves, but directly call the sklearn package. We need to pay attention to the flattening of y and then pass it into the model. After training the model, we will visualize the decision boundary. It should be noted that we do not need Add x0=1 yourself, and it will be added automatically in the model.

# 调用模型,训练模型
# 这里使用列表推导式减少代码行数
models = [svm.SVC(C, kernel='linear') for C in [1, 100]]
clfs = [model.fit(X, y.ravel()) for model in models]
def plotBoundary(clf, X):
    '''画决策边界'''
    x_min, x_max = X[:,0].min()*1, X[:,0].max()*1
    y_min, y_max = X[:,1].min()*1,X[:,1].max()*1
    # 画等高线
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500),
                         np.linspace(y_min, y_max, 500))# (500,500)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) #(250000,)
    Z = Z.reshape(xx.shape)
    plt.contour(xx, yy, Z)

title = ['SVM Decision Boundary with C = {} (Example Dataset 1'.format(C) for C in [1, 100]]
for model,title in zip(clfs,title):
    plt.figure(figsize=(8,5))
    plotData(X, y)
    plotBoundary(model, X)
    plt.title(title)

        C is the penalty that controls the misclassification of training examples, similar to 1/λ, where λ is the regularization parameter used in the previous logistic regression. It can be seen that when C = 1, SVM puts the decision boundary in the middle of the positive and negative samples, but misclassifies the outliers, and when C = 100, it correctly classifies the outliers, but its Decision boundaries do not naturally fit the data.

1.2 Gaussian kernel function support vector machine

        In this part of the exercise, you will use support vector machines for nonlinear classification. In particular, you will use a support vector machine with a Gaussian kernel on a dataset that is not linearly separable.

1.2.1 Gaussian Kernel

       Before implementing SVM, we need to implement a Gaussian kernel function. The Gaussian kernel function can be regarded as a similarity function, which is used to measure the distance between two examples. Its formula is as follows.

K_{gaussian}(x^{(i)}, x^{(j)}) = exp(-\frac{\left \|x^{(i)}-x^{(j)} \right \|^{2}}{2\sigma^{2} }) = exp(-\frac{\sum_{k=1}^{n}(x^{(i)}_{k}-x^{(j)}_{k})^{2}}{2\sigma^{2} })

def gaussKernel(x1,x2,sigma):
    return np.exp(-((x1-x2) ** 2).sum() / (2 * sigma ** 2))
gaussKernel(np.array([1, 2, 1]), np.array([0, 4, -1]), 2.) 

        After writing the function, we use two test samples to test the kernel function, and we can get 0.32465

1.2.2 Example Dataset 2

        First, we visualize the data, just call the plotData function written earlier

path2 = r'E:/Code/ML/ml_learning/ex6-SVM/data/ex6data2.mat'
data2 = loadmat(path2)
X2 = data2['X']
y2 = data2['y']
plotData(X2,y2)

       It can be seen that the decision boundary is not linear, but we can learn a nonlinear decision boundary by using the Gaussian kernel in SVM. Note that the kernel = 'rbf' of the svc function here, where the expression of the rbf function is exp (-gamma|uv|^2), so gamma needs to be constructed according to the formula we wrote above.

sigma = 0.1
gamma = np.power(sigma, -2) / 2
clf2 = svm.SVC(C=1, kernel='rbf', gamma=gamma)
model2 = clf2.fit(X2, y2.flatten())
plt.figure(figsize=(8,5))
plotData(X2,y2)
plotBoundary(model2, X2)

1.2.3 Example Dataset 3

       In this part, continue to use the Gaussian kernel SVM to find the nonlinear decision boundary, and find the optimal parameters through the cross-validation set, and first perform data visualization.

path3 = r'E:/Code/ML/ml_learning/ex6-SVM/data/ex6data3.mat'
data3 = loadmat(path3)
X3, y3 = data3['X'], data3['y']
Xval, yval = data3['Xval'], data3['yval']
plotData(X3, y3)

Cvalues = [0.01, 0.03, 0.1, 0.3, 1., 3., 10., 30.]
sigmavalues = Cvalues
best_parameter, best_score = [0, 0], 0

for C in Cvalues:
    for sigma in sigmavalues:
        gamma = np.power(sigma,-2.)/2
        model = svm.SVC(C=C,kernel='rbf',gamma=gamma)
        model.fit(X3, y3.flatten())
        this_score = model.score(Xval, yval)
        if this_score > best_score:
            best_score = this_score
            best_parameter = [C, sigma]
print('best_parameter={}, best_score={}'.format(best_parameter, best_score))
# best_parameter=[1.0, 0.1], best_score=0.965

        The optimal parameters are then used to learn the decision curve.

model = svm.SVC(C=1., kernel='rbf', gamma = np.power(0.1, -2.)/2)
model.fit(X3, y3.flatten())
plotData(X3, y3)
plotBoundary(model, X3)

2 Spam Classification

       In this part of the exercise, we will use SVM to build our own spam filter. Among them, if our email is spam, y=1, and if it is not spam, y=0. In addition, we need to convert each email into a feature vector. It should be noted that the dataset included in this exercise is based on a subset of the SpamAssassin public corpus. For the purposes of this exercise, we will only use the email body (not including the email header).

2.1 Preprocessing mail

        First, let's visualize an example.

path4 = r'E:/Code/ML/ml_learning/ex6-SVM/data/emailSample1.txt'
with open(path4) as f:
    email = f.read()
    print(email)
> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
[email protected]

        It can be seen from the above content that it contains a URL, an email address, numbers and money. Although many emails will include these contents, different emails are different, so it is necessary to "normalize" these values ​​before processing emails so that all contents are treated equally. For example, we can use the unique string httpaddr instead of every A URL, to indicate that there is a URL. The idea is for the spam sorter to make classification decisions based on the presence or absence of URLs rather than the presence or absence of specific URLs. Below are some mail preprocessing and normalization steps.

  • Case Conversion: The entire email is converted to lower case, so case is ignored
  • Remove HTML: many emails usually have html formatting, remove all html tags, leaving only the message content
  • Normalize URLs: Replace all URLs with the string httpaddr
  • Normalize email addresses: replace all email addresses with the string emailaddr
  • Normalize numbers: replace all numbers with the string number
  • Normalize the dollar sign: Replace all dollar signs $ with the string dollar
  • Stem extraction: Simplify words into stem forms, such as discount, discounts, and discounted are all replaced with discount
  • Remove non-words: Remove non-words and punctuation, all whitespace (tabs, newlines, spaces), etc. are trimmed to a space character
import re # 正则表达式
from stemming.porter2 import stem # 英文分词算法
import nltk, nltk.stem.porter

def processEmail(email):
    """做前六项处理"""
    email = email.lower() # 大写转小写
    # 匹配<开头,然后所有不是< ,> 的内容,知道>结尾,相当于匹配<...>
    email = re.sub('<[^<>]>', ' ', email) 
    # 匹配//后面不是空白字符的内容,遇到空白字符则停止
    email = re.sub('(http|https)://[^\s]*', 'httpaddr', email )  
    email = re.sub('[^\s]+@[^\s]', 'emailaddr', email) # 匹配地址替换成emailaddr
    email = re.sub('[\$]+', 'dollar', email)  # 匹配金钱符号替换成dollar
    email = re.sub('[\d]+', 'number', email) # 匹配数字替换成number
    return email
print(processEmail(email))

        The following is to extract stems and remove non-character content

def email2TokenList(email):
    """返回一个干净的单词列表"""
    # 实例化提取器
    stemmer = nltk.stem.porter.PorterStemmer()
    # 预处理邮件
    email = processEmail(email)
    # 分割单词
    tokens = re.split('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%]', email)
    token_list = []
    for token in tokens:
        # 删除非字母数字的字符
        token = re.sub('[^a-zA-Z0-9]', '', token)
        # 丢进提取器提取词根
        stemmed_word = stemmer.stem(token)
        # 去除空字符串, 非零为真
        if not len(token): continue
        token_list.append(stemmed_word)
    return token_list

2.1.1 Glossary

        After preprocessing, we have the word list of the email, the next step is to choose which times we want to use in the classifier and which words we want to get rid of.

        In this exercise, we only consider the words with the highest frequency as our vocabulary. We have a vocabulary vocab.txt, which stores 1899 words that are often used in practice. Finally, we calculate the processed email How many words in the vocabulary are contained in , and return the index corresponding to the word, which is the index of the training word we want.

def email2VocabIndices(email, vocab):
    """提取单词索引"""
    token = email2TokenList(email)
#     for i in range(len(vacab)):
#         if vacab[i] in token
#         index = i
    index = [i for i in range(len(vocab)) if vocab[i] in token]
    return index

2.2 Extract features from emails

def email2FeatureVector(email):
    """
    email转为词向量  ,n是vocab的长度,存在单词的相应位置为1,其余0
    """
    # 给单词那一列添加列名
    df = pd.read_table(r'E:/Code/ML/ml_learning/ex6-SVM/data/vocab.txt', names=['words'])
    # dataframe转array
    vocab= df.values
    # 长度与单词表长度相同
    vector = np.zeros(len(vocab))
    # 返回邮件单词索引
    vocab_indices = email2VocabIndices(email, vocab)
    # 将有单词的索引置1
    for i in vocab_indices:
        vector[i] = 1
    return vector
vector = email2FeatureVector(email)
print('length of vector = {}\nnum of non-zero = {}'.format(len(vector), int(vector.sum())))
length of vector = 1899
num of non-zero = 45

2.3 Training SVMs

data1 = loadmat(r'E:/Code/ML/ml_learning/ex6-SVM/data/spamTrain.mat')
X, y = data1['X'], data1['y']
data2 = loadmat(r'E:/Code/ML/ml_learning/ex6-SVM/data/spamTest.mat')
Xtest, ytest = data2['Xtest'], data2['ytest']

clf = svm.SVC(C=0.1, kernel='linear')
clf.fit(X, y)
predTrain = clf.score(X, y)
predTest = clf.score(Xtest, ytest)
predTrain, predTest
(0.99825, 0.989)

reference article

Andrew Ng's machine learning and deep learning homework catalog [image restored]

Guess you like

Origin blog.csdn.net/weixin_50345615/article/details/126823307