支持向量机的人脸识别

支持向量机（Support Vector Machine）是人工神经网络出现之前最常用的算法
**支持向量机要解决的问题：**什么样的决策边界才是最好的呢？
在这里插入图片描述
优化的目标：找到一条线，使得离该线最近的点，比如二分类的两种点，每个最近的点能够离决策边界最远。

求什么样的w和b使得之前的等式最小，意思就是说找到离直线距离最近的点，然后在让这个点到直线的距离求最大值。
化简后：
在这里插入图片描述
第二步就是利用拉格朗日乘数法化简后，什么样的阿尔法使得距离最大。
将其反转后式子如上图，转换为求最小值的问题。
下面是一个案例，以一个简单实例来看，求解支持向量。

求解后得到的结果如上图。真正发挥作用的点就是支持向量
在这里插入图片描述
最终的决策边界是由阿尔法不等于0的样本点构成，就那一两个样本，如果阿尔法等于0，那么W所在的那一项的xy无论取什么值，都等于0，所以这个点就没有意义。
也就是说，对于支持向量机的机制，分类是由少数几个样本决定的。在这里插入图片描述
软间隔：有时候数据中有一些噪音点，如果考虑这个不好的点，数据的线就不行了。为了解决该问题，才会引入松弛因子

C是我们可以指定的一个数。
为了让整体方程尽可能小，当C值大时，松弛因子必须小才能保证整体较小，意味着分类严格不能有错误。
而当C值小时，松弛因子可以稍微大一点，意味着可以有更大的错误容忍。
支持向量机的一个优点：
对于低维度不可分的问题，可以通过函数转换成高维度，这样对于一个平面内的不可分割问题，就会转换成空间问题，然后用一个面就能分割了。
在这里插入图片描述
这种方法被称为核变换。
利用核函数进行核变换，时间复杂度是o(n^2)
常用的叫做高斯核函数，能把低维度变成高维度，把线性向量机变成非线性的。

核函数可以将一个不可分割的数据集变得可分割。

代码实现：利用支持向量机分类数据，并且探索不同的核函数，不同的C值和不同的gamma对分类结果的影响。

%matplotlib inline
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
#模拟随机产生个数据，数据的样本数量是8，中心点是2，密集度是0.6
x,y= make_blobs(n_samples=80,centers=2,random_state=0,cluster_std=0.60)
#cluster_std的意思是，簇的分散程度，越大表示数据越分散，越小表示越集中。
plt.scatter(x[:,0],x[:,1],c=y,s=50)

画图如下：
在这里插入图片描述
x的部分数据：

array([[  1.65991049e+00,   3.56289184e+00],
       [  1.60841463e+00,   4.01800537e-01],
       [  2.77180174e-01,   4.84428322e+00],
       [  9.14338767e-01,   4.55014643e+00],
       [  2.15527162e+00,   1.27868252e+00],
       [  3.18515794e+00,   8.90082233e-02],
       [  1.81336135e+00,   1.63113070e+00],
       [  2.18023251e+00,   1.48364708e+00],
       [  2.42371514e+00,   1.45098766e+00],
       [  2.13141478e+00,   1.13885728e+00],
       [  4.88382309e-01,   3.26801777e+00],
       [  2.23421043e+00,   1.69349520e+00],

y的数据：

array([0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1])

建立模型

#导入向量机模型
from sklearn.svm import SVC #Support vector classifier
model = SVC(kernel='linear')#构造了一个最基本的线性支持向量机
model.fit(x,y)

绘图函数

#绘图函数
def plot_function(model,ax=None,plot_support=None):
    if ax is None:
        ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    x = np.linspace(xlim[0],xlim[1],30)
    y = np.linspace(ylim[0],ylim[1],30)
    Y,X = np.meshgrid(y,x)
    xy = np.vstack([X.ravel(),Y.ravel()]).T
    P = model.decision_function(xy).reshape(X.shape)
    ax.contour(X,Y,P,colors='k',levels=[-1,0,1],alpha=0.5,linestyles=['--','-','--'])
    
    if plot_support:
        ax.scatter(model.support_vectors_[:,0],
                  model.support_vectors_[:,1],
                  s=300,linewidth=1,facecolors='none');
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)

画图：

plt.scatter(x[:,0],x[:,1],c=y,s=50)
plot_function(model)

结果如下：
在这里插入图片描述
从图中可以看出，对于一个能够分开有明确边界的二维数据，可以用线性的进行分类，并且能够构造出较好的支持向量

中间这条线是决策边界，两个边界上的点就是支持向量，在scikit-Learn中，他们存储在supportvectors(一个属性）

model.support_vectors_
#支持向量所在的位置
#只要支持向量没有变，那么决策边界就不会变，那么新的数据点肯定在这个分区里

支持向量的输出结果：

array([[ 2.33812285,  3.43116792],
       [ 0.44359863,  3.11530945],
       [ 1.35139348,  2.06383637],
       [ 1.53853211,  2.04370263]])

接下来是对于完全穿插的点，利用线性进行分类

from sklearn.datasets.samples_generator import make_circles
x,y = make_circles(100,factor=.1,noise=.1)
#做一个圆形的数据样本
clf = SVC(kernel='linear').fit(x,y)
plt.scatter(x[:,0],x[:,1],c=y,s=50)
plot_function(clf,plot_support=False)

画图如下：
在这里插入图片描述
如图可以看出，如果以这种样本，线性支持向量机就不行了，分不开了，只能用核函数
核函数的画图理解：

from mpl_toolkits import mplot3d
r = np.exp(-(x**2).sum(1))
def plot_3D(elev=30,azim=30,x=x,y=y):
    ax = plt.subplot(projection='3d')
    ax.scatter3D(x[:,0],x[:,1],r,c=y,s=50)
    ax.view_init(elev=elev,azim=azim)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
plot_3D(elev=45,azim=45,x=x,y=y)

在这里插入图片描述
也就是说，对于低维度不可分割的样本，转换成更高的维度。

#加入径向基函数
clf = SVC(kernel='rbf',C=1E6)
#高斯算法提升维度有很多种，rbf只是其中一种
clf.fit(x,y)
plt.scatter(x[:,0],x[:,1],c=y,s=50)
plot_function(clf)

画图如下：
在这里插入图片描述
高斯核函数能够把线性变成非线性的决策边界。
这样就能够把样本的分的越明显。
所以对于不容易分类的样本，只要提高维度就行。
C值对支持向量的影响：

x,y= make_blobs(n_samples=80,centers=2,random_state=0,cluster_std=0.80)
#生成数据点
fig,ax = plt.subplots(1,2,figsize=(16,6))
fig.subplots_adjust(left=0.0625,right=0.95,wspace=0.1)
for axi,C in zip(ax,[10,0.1]):
    model = SVC(kernel='linear',C=C).fit(x,y)
    axi.scatter(x[:,0],x[:,1],c=y,s=50)
    plot_function(model,axi)
    axi.scatter(model.support_vectors_[:,0],
               model.support_vectors_[:,1],
               s=300,lw=1,facecolors='none');
    #axi.set_title('C = {0:.1f}',format(C),size=14)

在这里插入图片描述
如图所示：如果C值过大容忍度就小，不能有错误点，反之亦然。
gamma值对最终结果的影响

x,y= make_blobs(n_samples=80,centers=2,random_state=0,cluster_std=1.1)
#cluster_std是离散程度，超过1，就完全接触了
#生成数据点
fig,ax = plt.subplots(1,2,figsize=(16,6))
fig.subplots_adjust(left=0.0625,right=0.95,wspace=0.1)
for axi,gamma in zip(ax,[10,0.1]):
    model = SVC(kernel='rbf',gamma=gamma).fit(x,y)
    axi.scatter(x[:,0],x[:,1],c=y,s=50,cmap='autumn')
    plot_function(model,axi)
    axi.scatter(model.support_vectors_[:,0],
               model.support_vectors_[:,1],
               s=300,lw=1,facecolors='none');

画图如下：
在这里插入图片描述
gamma值越高，模型的映射维度越高，结果更加精确，但是曲线不规则，有可能过拟合，gamma值越小，模型映射的维度降低，但是不容易分开。
利用支持向量机对人脸识别的实现

from sklearn.datasets import fetch_lfw_people
#导入人脸识别数据
faces = fetch_lfw_people(min_faces_per_person=60)
#取出对于每一个人来说，最小人脸大于60的人
print(faces.target_names)#取出这些人的名字
print(faces.images.shape)#取出图片个数，像素大小
#注释：自动下载非常慢，先让他自动下载成压缩包，只要压缩包出现就停止，然后网上下载lfw这个压缩包和lfw-funnel.gz,建立一个名字为lfw_funneled的文件夹，然后把lfw的照片全部拷贝过去

输出如下：

['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush'
 'Gerhard Schroeder' 'Hugo Chavez' 'Junichiro Koizumi' 'Tony Blair']
(1348, 62, 47)

打印出人脸看看

fig,ax=plt.subplots(3,5)
for i,axi in enumerate(ax.flat):
    axi.imshow(faces.images[i],cmap='bone')#打印出一部分人脸看看
    axi.set(xticks=[],yticks=[],xlabel=faces.target_names[faces.target[i]])

照片的一个像素点表示一个维度，一共2914维度，需要降维。

from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline

pca = PCA(n_components=150,whiten=True,random_state=42)
#把维度降低到150维
svc = SVC(kernel='rbf',class_weight='balanced')
model = make_pipeline(pca,svc)
#捆绑在一起，意思是SVC和PCA合并在了一起，就是降维的svc

进行数据切分，划分出训练集和测试集。
直观的观察下数据：
faces.data.shape#数据的维度打印结果（1348,2914）
faces.data#数据就是矩阵，行是图片的个数，列是像素点的个数。类似于KNN的手写识别

array([[ 114.        ,  122.33333588,  134.        , ...,    2.66666675,
           4.33333349,    5.        ],
       [  42.        ,   62.        ,   73.        , ...,  254.        ,
         252.        ,  249.        ],
       [  73.66666412,   78.66666412,   71.        , ...,  195.33332825,
         233.66667175,  233.33332825],
       ..., 
       [  29.66666603,   30.33333397,   30.        , ...,  135.33332825,
         141.        ,  145.        ],
       [  55.66666794,   60.        ,   69.66666412, ...,   22.        ,
          15.        ,    8.        ],
       [ 137.        ,   55.33333206,   43.        , ...,   42.33333206,
          46.        ,   47.        ]], dtype=float32)

faces.target()#姓名转换成的数字，以方便做矩阵预算
输出结果：array([1, 3, 3, …, 7, 3, 5], dtype=int64)
CV验证，获得最好的C和gamma

param_grid = {'svc__C':[1,5,10],'svc__gamma':[0.0001,0.0005,0.001]}#选取要循环的参数，就是要测试的参数
grid = GridSearchCV(model,param_grid=param_grid,cv=5)#选择模型，选择CV，就是交叉验证，如果不进行结果不准确
grid.fit(x_train,y_train)
grid.grid_scores_, grid.best_params_, grid.best_score_

结果如下：

([mean: 0.19090, std: 0.16492, params: {'svc__C': 1, 'svc__gamma': 0.0001},
  mean: 0.59050, std: 0.10427, params: {'svc__C': 1, 'svc__gamma': 0.0005},
  mean: 0.78042, std: 0.01349, params: {'svc__C': 1, 'svc__gamma': 0.001},
  mean: 0.64985, std: 0.06542, params: {'svc__C': 5, 'svc__gamma': 0.0001},
  mean: 0.79426, std: 0.01482, params: {'svc__C': 5, 'svc__gamma': 0.0005},
  mean: 0.80712, std: 0.01576, params: {'svc__C': 5, 'svc__gamma': 0.001},
  mean: 0.79327, std: 0.01480, params: {'svc__C': 10, 'svc__gamma': 0.0001},
  mean: 0.78536, std: 0.01664, params: {'svc__C': 10, 'svc__gamma': 0.0005},
  mean: 0.79525, std: 0.01796, params: {'svc__C': 10, 'svc__gamma': 0.001}],
 {'svc__C': 5, 'svc__gamma': 0.001},
 0.80712166172106825)

from sklearn.metrics import confusion_matrix,classification_report,recall_score,accuracy_score
#CV验证可以把最优的模型直接拿出来
model = grid.best_estimator_
#拿到最好的模型后，进行预测
y_pred=model.predict(x_test)
cnf_matrix =confusion_matrix(y_test,y_pred)
#把小数位数改为2位
np.set_printoptions(precision=2)
print(accuracy_score(y_test,y_pred))
cnf_matrix

输出结果和混淆矩阵如下

0.747774480712
Out[29]:
array([[ 13,   1,   1,   1,   0,   0,   0,   0],
       [  0,  51,   0,   2,   0,   0,   0,   1],
       [  1,   0,  23,   6,   3,   1,   0,   0],
       [  4,   4,   3, 105,   5,   4,   1,  10],
       [  0,   0,   0,   2,  18,   3,   1,   3],
       [  1,   1,   0,   4,   3,   7,   2,   0],
       [  1,   1,   0,   1,   1,   0,  10,   1],
       [  0,   5,   1,   2,   3,   1,   0,  25]], dtype=int64)

导入分类报告

from sklearn.metrics import classification_report
#导入分类报告
print(classification_report(y_test,y_pred,target_names=faces.target_names))

结果如下：

                   precision    recall  f1-score   support

     Ariel Sharon       0.65      0.81      0.72        16
     Colin Powell       0.81      0.94      0.87        54
  Donald Rumsfeld       0.82      0.68      0.74        34
    George W Bush       0.85      0.77      0.81       136
Gerhard Schroeder       0.55      0.67      0.60        27
      Hugo Chavez       0.44      0.39      0.41        18
Junichiro Koizumi       0.71      0.67      0.69        15
       Tony Blair       0.62      0.68      0.65        37

      avg / total       0.76      0.75      0.75       337

精度(precision) =正确预测的个数（TP）/被预测正确的个数（TP+FP)
召回率（recall）=正确预测的个数（TP）/预测个数（TP+FN）

支持向量机的人脸识别

猜你喜欢