Statistical learning methods and Python implementation (a) - Perceptron

Statistical learning methods and Python implementation (a) - Perceptron

   iwehdio the blog garden: https://www.cnblogs.com/iwehdio/

 

1. Definitions

  Hypothetical example of a feature space for the input x belongs to R & lt n value of y n-dimensional feature vector, the output space y = {+1, -1} two points, represents an example of the output classes, to the output by the output space space function:

 

  Perceptron is called.

  Model parameter w denotes a weight vector inner product value, b represents a bias. sign (x) is the sign function, taking ≧ 0 +1 <-1 and 0.

  Sorter perceptual model by a linear equation WX + B = 0 determined by separating hyperplane accomplished, according to the feature vector substituting WX + B to separate the positive and negative positive and negative types.

 

2, learning strategies

  Obviously, a perceptron linear classifier. Therefore, we have to better classification results, the data requirements set linearly separable, that is, there is absolutely correct division two types of data a hyperplane data set.

  After the model has been defined, to learn first of all to define the loss function.

  1, total loss of function is optional misclassified points. But this time not turned loss function parameters, optimization is not easy.

  2, optional misclassification point to the total distance of the hyperplane. It characterizes the extent of misclassification entire data set, and the parameter may be turned on.

  Misclassification point to the total distance hyperplane is:

  The loss function (M is the number of misclassified points):

  The optimization problem into optimized value w and b, to minimize loss function L. Selection stochastic gradient descent method, calculating the gradient of the loss function L the parameter w and b. Gradient is calculated as follows, i.e. L b and w respectively derivative.

  Can be seen, L of the gradient of w -yx, gradient b is -y.

  Perceptron algorithm is the original form:

  a, an arbitrary initial value w0 take parameters b0 and obtain an initial hyperplane.

  B, selecting a data set from the sample, the sample input tag.

  c, check if there are . If so, then the next step. Otherwise returned 2.

  d, and selected according to the gradient learning rate η and update the parameters w b, until the misclassified points are correctly classified.

   e, go to Step b, until the dataset is not misclassified points or accuracy or loss function L reaches a set threshold. Get model .

  Perceptron algorithm idea is original, from the selection of a sample data set, the parameters are constantly updated until the misclassification of samples were correctly classified.

 

 

3, convergence

  If the data set is linearly separable, then there must be over a hyperplane so that each sample of the dataset which are correctly classified. And satisfies the hyperplane . Which .

  There are: a, the lower bound is present , so that there is the ideal hyperplane .

     B, let R , the number of misclassified sensing unit on the data set k bounded from: .

  That is, for linearly separable data sets, through a limited search, you can always find the ideal set of data is completely separate from super surface. By updating the main concrete evidence of the relationship between the two neighboring parameters and recursive completed.

 

4, dual form

  The basic idea of ​​the dual form is the perceptron parameter w and b represent examples of the data in the form of a linear combination of x and y are tags. It is then obtained by solving the parameters w and b coefficients.

  According to the original form of Perceptron algorithm shows that if ni denotes the i th sample after ni times parameter update, the perceptron correctly classified. If you remember , the last to learn the parameters are:

   Perceptron algorithm of dual form:

  a, b, and the initial value of α is set to 0.

  B, selecting a data set from the sample, the sample input tag.

  c, check if there is , and if so, update parameters:

   d, go to Step b, until the dataset is not misclassified points or accuracy or loss function L reaches a set threshold. Get model .

 

5, Python algorithm to achieve the original form

 

  数据集选用mnist手写数字数据集,训练集60000个样本,测试集10000个样本,为0~9的手写数字,可转化为28×28的矩阵。

   第一次先采用所有的28*28个维度的特征向量作为输入。

  首先,第一步读入数据。用Keras来下载和读入数据,并将数据划分为‘0’和非‘0’两类。

from tensorflow.keras.datasets import mnist
import numpy as np

# 读入数据
(train_data, train_label), (test_data, test_label) = \
    mnist.load_data(r'E:\code\statistical_learning_method\Data_set\mnist.npz')


train_length = 60000
test_length = 10000

# 将’0‘的标签置为1,非’0‘的标签置为-1
train_data = train_data[:train_length].reshape(train_length, 28 * 28)
train_label = np.array(train_label, dtype='int8')
for i in range(train_length):
    if train_label[i] != 0:
        train_label[i] = 1
    else:
        train_label[i] = -1
train_label = train_label[:train_length].reshape(train_length, )

test_data = test_data[:test_length].reshape(test_length, 28 * 28)
test_label = np.array(test_label, dtype='int8')
for i in range(test_length):
    if test_label[i] != 0:
        test_label[i] = 1
    else:
        test_label[i] = -1
test_label = test_label[:test_length].reshape(test_length, )

 

  第二步是初始化和编写测试函数。

# 测试模型在训练集和测试集上的准确率
def test(w, b, data, label):
    loss, acc = 0, 0
    for n in range(data.shape[0]):
        x = np.mat(data[n]).T
        y = label[n]
        L = y * (w * x + b)
        L = L[0, 0]
        if L <= 0:
            loss -= L
        else:
            acc += 1
    loss /= data.shape[0] * np.linalg.norm(w)
    acc /= data.shape[0]
    print('loss', loss, '\t', 'acc', acc, '\n')
    return loss, acc

# 初始化参数
w_init = 0.01 * np.random.random([28*28])
b_init = 0.01 * np.random.random(1)[0]
yita = 1e-9
w = np.mat(w_init)
b = b_init

  

  最后,对感知机模型进行训练。 

for k in range(200):

    w_temp = np.mat(np.zeros(28*28)).reshape(-1, 1)
    b_temp = 0
    # 将数据集随机打乱
    rand = [i for i in range(train_length)]
    np.random.shuffle(rand)
    train_data_temp = train_data[rand]
    train_label_temp = train_label[rand]

    # 模型训练
    for index in range(train_length):
        x = np.mat(train_data_temp[index]).T
        y = train_label_temp[index]

        # 损失函数
        L = y * (w * x + b) / np.linalg.norm(w)
        L = L[0, 0]

        # 更新参数
        if L <= 0:
            w_temp += yita * x * y
            b_temp += yita * y
            w += w_temp.T
            b += b_temp
    print('time', k)

    # 在训练集和测试集上的表现
    train_loss, train_acc = test(w, b, train_data, train_label)
    test_loss, test_acc = test(w, b, test_data, test_label)

   经过200次迭代,最后得到的结果为:

 

   在60000个样本的训练集上准确率为0.9912,在10000个样本的测试集上准确率为0.9911。

 

6、对偶形式算法的Python实现

 

from tensorflow.keras.datasets import mnist
import numpy as np

(train_data, train_label), (test_data, test_label) = \
    mnist.load_data(r'E:\code\statistical_learning_method\Data_set\mnist.npz')


train_length = 1000
test_length = 1000

train_data = train_data[:train_length].reshape(train_length, 28 * 28)
train_label = np.array(train_label, dtype='int8')


for i in range(train_length):
    if train_label[i] != 0:
        train_label[i] = 1
    else:
        train_label[i] = -1
train_label = train_label[:train_length].reshape(train_length, )

test_data = test_data[:test_length].reshape(test_length, 28 * 28)
test_label = np.array(test_label, dtype='int8')
for i in range(test_length):
    if test_label[i] != 0:
        test_label[i] = 1
    else:
        test_label[i] = -1
test_label = test_label[:test_length].reshape(test_length, )

# 生成Gram矩阵
G = []
for i in range(train_length):
    G_temp = np.zeros(train_length)
    for j in range(train_length):
        G_temp[j] = np.mat(train_data[i]) * np.mat(train_data[j]).T
    G.append(G_temp)

# 计算参数w
def sigma(xj, ai):

    sum0 = 0
    for k in range(train_length):
        sum0 += ai[k] * train_label[k] * G[k][xj]
    return sum0

# 计算准确率
def acc_in_train(ai):
    acc = 0
    for t in range(train_length):
        yi = train_label[t]
        if yi * (sigma(t, ai) + b) > 0:
            acc += 1
    return acc / train_length


def acc_in_test(ai):
    acc = 0
    for t in range(test_length):
        sum1 = 0
        for xi in range(train_length):
            sum1 += ai[xi] * train_label[xi] * np.mat(train_data[xi]) * np.mat(test_data[t]).T
        yi = test_label[t]
        if yi * (sum1 + b) > 0:
            acc += 1
    return acc / test_length


a = np.zeros(train_length)
b = 0
yita = 1e-6

# 模型训练
for i in range(200000):
    rand = [i for i in range(train_length)]
    np.random.shuffle(rand)
    for j in range(train_length):
        index = rand[j]
        y = train_label[index]
        L = y * (sigma(index, a) + b)
        if L <= 0:
            a[index] += yita
            b += yita * y
            print(index, L)

    print(acc_in_train(a))

print(acc_in_test(a))

  但是对偶形式的算法的表现并不好,训练集上准确率最高0.89,而且尝试了许多方法都没有改善。从数值上来看,原因可能是参数w太大,而每次参数更新对损失函数的影响很小。

 

7、其他问题

  a、为什么在计算损失函数时,可以把w的模||w||固定为1?

    因为感知机只关心损失函数的符号,不关心损失函数的大小。把w的模||w||固定为1可以使得计算简便。事实上,对于误分点到超平面的距离和的几何间隔Q=L/||w||,我们取L作为损失函数且优化使它最小,但L的极小值点不一定与Q的极小值点或||w||的极大值点相同,但感知机并不关心这个信息,只要求如果数据集线性可分,将L优化到0。这样找到的解不一定是唯一解,也不一定是最优解。

  b、对偶形式的优点?

    因为可以事先计算训练集的内积Gram矩阵,可以使得训练的速度较快。(但是为什么表现不如原始形式?还是代码实现错了?)

 

 

 参考:李航 《统计学习方法(第二版)》

   感知机原理小结:https://www.cnblogs.com/pinard/p/6042320.html#!comments

 

iwehdio的博客园:https://www.cnblogs.com/iwehdio/

 

 

Guess you like

Origin www.cnblogs.com/iwehdio/p/11963056.html