02_感知机_统计学习方法

去年花了4个月把吴恩达的深度学习课程学完，在深度学习方面略有了解，但对机器学习方面了解的不多，没有系统性的学习过，朋友推荐了两本书，李航的统计学习方法和周志华的西瓜书，两本超级经典介绍机器学习的教材，就在网上买了这两本书。书买了有两个月，现在开始学习李航的统计学习方法，朋友说这个更容易下手。还是按之前学习深度学习的方法：

看书，做笔记
做练习
整理到博客

感知机：二分类的线性分类模型，输入为实例的特征向量，输出为实例的类别，取+1和-1二值。

知识点梳理：

1、当训练数据集是线性可分时，感知机学习算法原始形式迭代是收敛的。收敛性的证明可以推导试试，如果没有这个思路完全让证明其收敛难度可想是非常大的。当训练集线性不可分时，感知机学习算法不收敛，迭代结果会发生震荡。

2、感知机在计算wx+b这条线的时候，已经在进行了转换，使得用于划分的直线变成x轴，左右侧分别为x轴的上方和下方，也就成了正和负。

3、因为做过转换，整张图旋转后wx+b是x轴，那么所有点到x轴的距离其实就是wx+b的值。考虑到x轴下方的点，得加上绝对值|wx+b|，求所有误分类点的距离和，也就是求|wx+b|的总和，让它最小化。很简单啊，把w和b等比例缩小就好啦，比如说w改为0.5w，b改为0.5b，线还是那条线，但是值缩小两倍啦！你还不满意？我可以接着缩！缩到0去！所以啊，我们要加点约束，让整个式子除以w的模长。啥意思？就是w不管怎么样，要除以它的单位长度。如果我w和b等比例缩小，那||w||也会等比例缩小，值一动不动，很稳。没有除以模长之前，|wx+b|叫函数间隔，除模长之后叫几何间隔，几何间隔可以认为是物理意义上的实际长度，管你怎么放大缩小，你物理距离就那样，不可能改个数就变。在机器学习中求距离时，通常是使用几何间隔的，否则无法求出解。

4、感知机对偶形式暂没发现有什么特别之处，就是提前计算出Gram矩阵能提高计算速度。

5、感知机不能表示异或，见博客。不光感知机无法处理异或问题，所有的线性分类模型都无法处理异或分类问题。

对于知识点，可以看书上的介绍，这一章知识点不是很难，这篇博客里面介绍的挺不错的感知机。2,3观点来自于这篇博客

本次练习使用的数据集为MINIST，就是手写数字灰度图片，图片像素为28*28，训练集60000，测试集10000。MNIST详细介绍

1、加载数据

数据已经从官网下载到本地，并已解压。

import numpy as np
import os

# 训练集
with open('./minist_data/train-images.idx3-ubyte') as f:
    loaded = np.fromfile(file = f, dtype = np.uint8)
    X_train = loaded[16:].reshape((60000, 784))
    X_train = X_train.astype(np.int32)
print('X_train:',X_train.shape) # (60000, 784)


with open('./minist_data/train-labels.idx1-ubyte') as f:
    loaded = np.fromfile(file = f, dtype = np.uint8)
    y_train = loaded[8:]
    y_train = y_train.astype(np.int32)
print('y_train:',y_train.shape) # (60000,)


# 测试集
with open('./minist_data/t10k-images.idx3-ubyte') as f:
    loaded = np.fromfile(file=f, dtype=np.uint8)
    X_test = loaded[16:].reshape((10000, 784))
    X_test = X_test.astype(np.int32)
print('X_test:',X_test.shape) # (10000, 784)

with open('./minist_data/t10k-labels.idx1-ubyte') as f:
    loaded = np.fromfile(file=f, dtype=np.uint8)
    y_test = loaded[8:].reshape((10000))
    y_test = y_test.astype(np.int32)
print('y_test:',y_test.shape) # (10000,)

X_train: (60000, 784)
y_train: (60000,)
X_test: (10000, 784)
y_test: (10000,)

2、查看数据

查看数据的类型，形状，具体内容，图形展示

import matplotlib.pyplot as plt
%matplotlib inline

print (type(X_train),type(y_train))
print (X_train[0])
print (y_train[0])

img = X_train[0].reshape(28, 28)
plt.imshow(img, cmap='Greys', interpolation='nearest')
plt.show();

<class 'numpy.ndarray'> <class 'numpy.ndarray'>
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   3  18  18  18 126 136 175  26 166 255
 247 127   0   0   0   0   0   0   0   0   0   0   0   0  30  36  94 154
 170 253 253 253 253 253 225 172 253 242 195  64   0   0   0   0   0   0
   0   0   0   0   0  49 238 253 253 253 253 253 253 253 253 251  93  82
  82  56  39   0   0   0   0   0   0   0   0   0   0   0   0  18 219 253
 253 253 253 253 198 182 247 241   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0  80 156 107 253 253 205  11   0  43 154
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0  14   1 154 253  90   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0 139 253 190   2   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0  11 190 253  70   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  35 241
 225 160 108   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0  81 240 253 253 119  25   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0  45 186 253 253 150  27   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0  16  93 252 253 187
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0 249 253 249  64   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0  46 130 183 253
 253 207   2   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0  39 148 229 253 253 253 250 182   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0  24 114 221 253 253 253
 253 201  78   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0  23  66 213 253 253 253 253 198  81   2   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0  18 171 219 253 253 253 253 195
  80   9   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  55 172 226 253 253 253 253 244 133  11   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0 136 253 253 253 212 135 132  16
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0]
5

png

import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(
    nrows=2,
    ncols=5,
    sharex=True,
    sharey=True, )

ax = ax.flatten()
for i in range(10):
    img = X_train[y_train == i][0].reshape(28, 28)
    ax[i].imshow(img, cmap='Greys', interpolation='nearest')

ax[0].set_xticks([])
ax[0].set_yticks([])
plt.tight_layout()
plt.show()

png

fig, ax = plt.subplots(
    nrows=5,
    ncols=5,
    sharex=True,
    sharey=True, )

ax = ax.flatten()
for i in range(25):
    img = X_train[y_train == 7][i].reshape(28, 28)
    ax[i].imshow(img, cmap='Greys', interpolation='nearest')

ax[0].set_xticks([])
ax[0].set_yticks([])
plt.tight_layout()
plt.show()

png

3、选取标签为0和1进行感知机二分类练习

# 数字0 和数字1二分类
X_train1 = X_train[y_train == 0]
X_train2 = X_train[y_train == 1]
print (X_train1.shape)
print (X_train2.shape)
X_train12 = np.vstack([X_train1,X_train2])
print (X_train12.shape)

y12 = np.ones((X_train12.shape[0],1))
y12[:X_train1.shape[0],:] = -1

all_data = np.hstack([X_train12,y12])
print (all_data.shape)
print ('*'*30)

fig, ax = plt.subplots(
    nrows=2,
    ncols=5,
    sharex=True,
    sharey=True, )

ax = ax.flatten()
for i in range(10):
    img = X_train1[i].reshape(28, 28)
    ax[i].imshow(img, cmap='Greys', interpolation='nearest')

(5923, 784)
(6742, 784)
(12665, 784)
(12665, 785)
******************************

png

# 随机打乱数据，选取前80%作为训练集，后面20%作为测试集
np.random.shuffle(all_data)
XX_train = all_data[:int(0.8*all_data.shape[0]),:-1]
yy_train = all_data[:int(0.8*all_data.shape[0]),-1]
XX_test = all_data[int(0.8*all_data.shape[0]):,:-1]
yy_test = all_data[int(0.8*all_data.shape[0]):,-1]
print (XX_train.shape)
print (yy_train.shape)
print (XX_test.shape)
print (yy_test.shape)
print (yy_test[:10])

(10132, 784)
(10132,)
(2533, 784)
(2533,)
[-1. -1. -1. -1.  1. -1.  1.  1. -1. -1.]

感知机：

模型：感知机模型的假设空间是定义在特征空间中的所有线性分类模型，即函数集合 $\{f\space|f(x) = wx + b\}$

策略：计算误分类点到超平面S的总距离,假设超平面S的误分类点集合为M。 $min\space L(w,b) = -\sum_{x_i\in M}y_i(w*x_i + b)$
$\nabla_wL(w,b) = -\sum_{x_i\in M}y_i*x_i$
$\nabla_bL(w,b) = -\sum_{x_i\in M}y_i$

算法：损失函数极小化，随机梯度下降法， $r$ 为学习率
$y_i(w*x_i + b) <= 0$
$w = w + r*y_i*x_i$
$b = b + r*y_i$

import time
def perceptron_pars (X_train,y_train,num_iterations,r):
    n,m = X_train.shape
    w = np.zeros((1,m))
    b = 0
    for i in range(num_iterations):
        for j in range(n):
            if y_train[j]*(np.dot(w,X_train[j].T) + b) <= 0:
                w = w + r*y_train[j]*X_train[j]
                b = b + r*y_train[j]
        if not i%(num_iterations/10):
            print ('Run Percent:',float(i/num_iterations))
            
    return w,b
def cal_accuracy(X_test,y_test,w,b):
    n,m = X_test.shape
    right_list = []
    for j in range(n):
        if y_test[j]*(np.dot(w,X_test[j].T) + b) >= 0:
            right_list.append(1)
    accruRate = len(right_list)/n
    return accruRate

def model(X_train,y_train,X_test,y_test,num_iterations,r):
    start = time.time()
    w,b = perceptron_pars (X_train,y_train,num_iterations,r)
    train_accruRate = cal_accuracy(X_train,y_train,w,b)
    test_accruRate = cal_accuracy(X_test,y_test,w,b)
    end = time.time()
    print ('train_accruRate:',train_accruRate)
    print ('test_accruRate:',test_accruRate)
    print ('Run time :',(end-start))
    return w,b

num_iterations = 30
w,b = model(XX_train,yy_train,XX_test,yy_test,num_iterations,r = 0.0001)

Run Percent: 0.0
Run Percent: 0.1
Run Percent: 0.2
Run Percent: 0.3
Run Percent: 0.4
Run Percent: 0.5
Run Percent: 0.6
Run Percent: 0.7
Run Percent: 0.8
Run Percent: 0.9
train_accruRate: 1.0
test_accruRate: 0.9980260560600079
Run time : 3.405052661895752

4、把训练集里面标签0-4作为一类，5-9作为一类进行二分类练习

# 这样划分的训练数据迭代1000次的错误率还有0.4，很可能不能线性可分
labels_train = []
labels_test = []
for i in range(len(y_train)):
    if y_train[i] >= 5:
        labels_train.append(1)
    else :
        labels_train.append(-1)

for i in range(len(y_test)):
    if y_test[i] >= 5:
        labels_test.append(1)
    else :
        labels_test.append(-1)
print (labels_train[:5])
print (labels_test[:5])
num_iterations = 30
w,b = model(X_train,labels_train,X_test,labels_test,num_iterations,r = 0.0001)

[1, -1, -1, -1, 1]
[1, -1, -1, -1, -1]
Run Percent: 0.0
Run Percent: 0.1
Run Percent: 0.2
Run Percent: 0.3
Run Percent: 0.4
Run Percent: 0.5
Run Percent: 0.6
Run Percent: 0.7
Run Percent: 0.8
Run Percent: 0.9
train_accruRate: 0.8119166666666666
test_accruRate: 0.8029
Run time : 19.708739757537842

5、构造数据进行感知机练习

import numpy as np
np.random.seed(2)
data1 = np.random.uniform(2,10,(2,50))

x2 = np.random.uniform(-8,-2,50)
y2 = np.random.uniform(-4,4,50)
data2 = np.vstack([x2,y2])

plt.figure(figsize=(10,6), dpi=80)
plt.scatter(data1[0], data1[1],alpha=0.5)
plt.scatter(data2[0], data2[1],alpha=0.5)

# 移动脊柱
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data',0))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data',0))
plt.show();

png

X_train = np.hstack([data1,data2]).T
print (X_train.shape)
y_train = np.ones((1,100))
y_train[:,50:] = -1
y_train = y_train.flatten()
print (y_train)
num_iterations = 10
w,b = model(X_train,y_train,X_train,y_train,num_iterations,r = 0.1)

(100, 2)
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1. -1. -1. -1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.]
Run Percent: 0.0
Run Percent: 0.1
Run Percent: 0.2
Run Percent: 0.3
Run Percent: 0.4
Run Percent: 0.5
Run Percent: 0.6
Run Percent: 0.7
Run Percent: 0.8
Run Percent: 0.9
train_accruRate: 1.0
test_accruRate: 1.0
Run time : 0.035977840423583984

plt.figure(figsize=(6,6), dpi=80)
plt.scatter(data1[0], data1[1],alpha=0.5)
plt.scatter(data2[0], data2[1],alpha=0.5)

x = np.linspace(-3,3,50)
w = w.flatten()
y =(-w[0]*x-b)/(w[1]+0.000001)
plt.plot(x, y,color = 'red')

# 移动脊柱
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data',0))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data',0))
plt.show();

png