引言

本文为《机器学习实战》的读书笔记，增加了一些个人的理解。

基于逻辑回归和Sigmoid函数的分类：

假设有一些数据点，用一条直线对这些点进行拟合，这个拟合过程就称为回归，如下图所示：

在这里插入图片描述

在两个分类的情况下，我们想要一个函数，能输出所有输入的类别。
而Sigmoid函数它的输出值的范围刚好是[0,1]，并且很平滑，数学上也容易处理。它的公式如下：

$\sigma(z) = \frac{1}{1+e^{-z}}$

下图给出了Sigmoid函数在不同坐标尺度下的两条曲线。当x为0时，该函数值为0.5，随着x的增大，该函数值会趋近于1;
而随着x的减小，函数值趋近于0。如果刻度足够大，Sigmoid函数看起来很像一个阶跃函数(step function,阶跃函数是一种特殊的连续时间函数，是一个从0跳变到1的过程)。

在这里插入图片描述

为了实现逻辑回归分类器，我们可以在每个特征上都乘以一个回归系数(weight)，然后把所有的结果值相加，
将总和代入Sigmoid函数中，得到一个范围在0~1之间的数值。大于0.5的数据被分入1类，小于0.5的被归入0类。
所以，逻辑回归也可以被看成是一种概率估计。

现在的问题变成了：最佳回归系数是多少？

基于最优化方法的最佳回归系数确定

Sigmoid函数的输入是z，而z是由下面公式给出：
$z = w_0x_0 + w_1x_1 + w_2x_2 + \cdots + w_nx_n$

如果采用向量的写法，可以写成 $z = w^Tx$ ，其中向量 $x$ 是分类器的输入数据，向量 $w$ 是我们要找到的最佳系数。

$z = [w_0,w_1,\cdots,w_n] \left[ \begin{matrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{matrix} \right] = w^Tx$

为了寻找最佳系数，需要用到最优化理论的一些知识。首先介绍梯度上升的最优化方法。

还有一种写法是

$z = w \cdot x + b = \sum_i w_ix_i + b$
上面的 $i$ 从1开始，如果令 $x_0=1,w_0=b$ ，则可以写成第一种形式。

梯度上升法(Gradient ascent)

梯度上升法基于的思想是：要找到某函数的最大值，最好的方式是沿着该函数的梯度方向寻找。
将梯度记为 $\nabla$ ，函数 $f(x,y)$ 的梯度由下表示：

$\nabla f(x,y) = \dbinom{ \frac{\partial f(x,y)}{\partial x}} {\frac{\partial f(x,y)}{\partial y}}$

这个梯度意味着要沿x的方向移动 $\frac{\partial f(x,y)}{\partial x}$ ，沿y的方向移动 $\frac{\partial f(x,y)}{\partial y}$ 。

在这里插入图片描述
梯度上升算法到达每个点后都会重新估计移动的方向。从P0开始，计算完该点的梯度，
函数就根据梯度移动到下一点P1。在P1点，梯度再次被重新计算，并沿着新的梯度方向
移动到P2。如此循环迭代，直到满足停止条件。在迭代的过程中，梯度算法总是保证我们
能选取到最佳的移动方向。

上图中的梯度上升算法沿着梯度方向移动了一步。可以看到，梯度算法总是指向函数值增长
最快的方向。这里所说的是移动方向，而未提到移动量的大小。该量值称为步长，记作 $\alpha$ 。
用向量来表示的话，梯度算法的迭代公式如下：
$w:= w + \alpha \nabla_w f(w)$

该公式将一直被迭代执行，直到达到某个停止条件为止，比如迭代次数达到某个指定值或算法达到
某个可以允许的误差范围。

梯度下降算法
将上面公式中的加法变成减法就是梯度下降算法。它是用来求函数的最小值的。

在这里插入图片描述

下面写一个逻辑回归分类器的例子，数据集如上图所示。

testSet.txt:

-0.017612	14.053064	0
-1.395634	4.662541	1
-0.752157	6.538620	0
-1.322371	7.152853	0
0.423363	11.054677	0
0.406704	7.067335	1
0.667394	12.741452	0
-2.460150	6.866805	1
0.569411	9.548755	0
-0.026632	10.427743	0
0.850433	6.920334	1
1.347183	13.175500	0
1.176813	3.167020	1
-1.781871	9.097953	0
-0.566606	5.749003	1
0.931635	1.589505	1
-0.024205	6.151823	1
-0.036453	2.690988	1
-0.196949	0.444165	1
1.014459	5.754399	1
1.985298	3.230619	1
-1.693453	-0.557540	1
-0.576525	11.778922	0
-0.346811	-1.678730	1
-2.124484	2.672471	1
1.217916	9.597015	0
-0.733928	9.098687	0
-3.642001	-1.618087	1
0.315985	3.523953	1
1.416614	9.619232	0
-0.386323	3.989286	1
0.556921	8.294984	1
1.224863	11.587360	0
-1.347803	-2.406051	1
1.196604	4.951851	1
0.275221	9.543647	0
0.470575	9.332488	0
-1.889567	9.542662	0
-1.527893	12.150579	0
-1.185247	11.309318	0
-0.445678	3.297303	1
1.042222	6.105155	1
-0.618787	10.320986	0
1.152083	0.548467	1
0.828534	2.676045	1
-1.237728	10.549033	0
-0.683565	-2.166125	1
0.229456	5.921938	1
-0.959885	11.555336	0
0.492911	10.993324	0
0.184992	8.721488	0
-0.355715	10.325976	0
-0.397822	8.058397	0
0.824839	13.730343	0
1.507278	5.027866	1
0.099671	6.835839	1
-0.344008	10.717485	0
1.785928	7.718645	1
-0.918801	11.560217	0
-0.364009	4.747300	1
-0.841722	4.119083	1
0.490426	1.960539	1
-0.007194	9.075792	0
0.356107	12.447863	0
0.342578	12.281162	0
-0.810823	-1.466018	1
2.530777	6.476801	1
1.296683	11.607559	0
0.475487	12.040035	0
-0.783277	11.009725	0
0.074798	11.023650	0
-1.337472	0.468339	1
-0.102781	13.763651	0
-0.147324	2.874846	1
0.518389	9.887035	0
1.015399	7.571882	0
-1.658086	-0.027255	1
1.319944	2.171228	1
2.056216	5.019981	1
-0.851633	4.375691	1
-1.510047	6.061992	0
-1.076637	-3.181888	1
1.821096	10.283990	0
3.010150	8.401766	1
-1.099458	1.688274	1
-0.834872	-1.733869	1
-0.846637	3.849075	1
1.400102	12.628781	0
1.752842	5.468166	1
0.078557	0.059736	1
0.089392	-0.715300	1
1.825662	12.693808	0
0.197445	9.744638	0
0.126117	0.922311	1
-0.679797	1.220530	1
0.677983	2.556666	1
0.761349	10.693862	0
-2.168791	0.143632	1
1.388610	9.341997	0
0.317029	14.739025	0

第一列是x坐标,第二列是y坐标,第三列是所属类别

绘图代码如下：

import numpy as np
import matplotlib.pyplot as plt


def loadDataSet():
    dataMat = []
    labelMat = []
    fr = open('testSet.txt')
    for line in fr.readlines():
        lineArr = line.strip().split()
        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
        labelMat.append(int(lineArr[2]))
    fr.close()
    return dataMat, labelMat


def plotDataSet():
    """
    绘制数据集
    :return:
    """
    dataMat, labelMat = loadDataSet()
    dataArr = np.array(dataMat)
    n = np.shape(dataMat)[0]
    xcord1 = []
    ycord1 = []
    xcord2 = []
    ycord2 = []
    for i in range(n):
        if int(labelMat[i]) == 1:
            xcord1.append(dataArr[i,1])
            ycord1.append(dataArr[i,2])
        else:
            xcord2.append(dataArr[i, 1])
            ycord2.append(dataArr[i, 2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1, ycord1, s = 20, c = 'red', marker = 's',alpha=.5)#绘制正样本
    ax.scatter(xcord2, ycord2, s = 20, c = 'green',alpha=.5)            #绘制负样本
    plt.title('DataSet')                                                #绘制title
    plt.xlabel('x')
    plt.ylabel('y')
    plt.show()

训练：使用梯度上升算法找到最佳参数

上图中有100个样本，每个点包含两个数值型特征：X1和X2。我们将通过使用梯度上升法找到最佳回归系数。

梯度上升法的伪代码如下：

每个回归系数初始化为1
重复R次：
    计算整个数据集的梯度
    使用alpha*gradient更新回归系数的向量
    返回回归系数向量

下面的代码是梯度上升算法的具体实现。

def sigmoid(inX):
    return 1.0/(1+np.exp(-inX))

def gradAscent(dataMatIn,classLabels):
    dataMatrix = np.mat(dataMatIn) #将二维数组转换成np 矩阵 (100x3)
    labelMat = np.mat(classLabels).transpose() # 将1x100的矩阵转置成 100x1的矩阵
    m ,n = np.shape(dataMatrix) # dataMatrix的维度 100 x 3
    alpha = 0.001 # 学习率
    maxCycles = 500# 迭代次数
    weights = np.ones((n,1)) # 3 x 1 的矩阵，元素都是1
    for k in range(maxCycles): #迭代maxCycles次
        h = sigmoid(dataMatrix * weights) #这里厉害了，利用矩阵运算，同时对这100个数据进行计算 返回的结果是100x1的矩阵
        error = labelMat - h
        weights = weights + alpha * dataMatrix.transpose() * error

    return weights.getA()

if __name__ == '__main__':
    dataMat,labelMat = loadDataSet()
    weights = gradAscent(dataMat,labelMat)
    plotBestFit(weights)

sigmoid(inX)方法描述的就是 $\sigma(z) = \frac{1}{1+e^{-z}}$ ，这里用inX代替z。

而gradAscent(dataMatIn,classLabels)描述的就是 $w:= w + \alpha \nabla_w f(w)$ 。

我们详细分析下该方法，它有两个参数。

dataMatIn是一个100行3列的二维数组(实际类型是list，list中的元素是元组，说它是二维数组好理解一些)：[[1.0, -0.017612, 14.053064], [1.0, -1.395634, 4.662541], ... 每一行数据都是[1.0,x轴坐标,y轴坐标]这种形式。这里的1.0是 $x_0$ 。

classLabels是一个一维数组，表示dataMatIn中每一行元素所属的类别。

h = sigmoid(dataMatrix * weights) ,dataMatrix是100x3的矩阵，weights是3x1的矩阵，矩阵乘的结果就是100x1的矩阵。再看 $z = w^Tx$ ， $w^T$ 就是weights，而dataMatrix就是样本矩阵，要注意的是它里面包含了所有的样本。
不要在意是 $xw^T$ 还是 $w^Tx$ ，要看这两个矩阵的表示形式。

dataMatrix的表示形式为：

$\left[\begin{matrix} x^1_0,x^1_1, x^1_2 \\ x^2_0,x^2_1, x^2_2 \\ \vdots \\ x^n_0,x^n_1, x^n_2 \end{matrix} \right]$
$x^1_0,x^1_1, x^1_2$ 表示的是第一个样本的三个特征，在这里是 $1,\quad-0.017612,\quad14.053064$

error = labelMat - h
weights = weights + alpha * dataMatrix.transpose() * error

要理解这两段代码，需要进行数学上的一些证明。

证明可见另一篇文章机器学习入门之逻辑回归

最后得到更新权重的公式为：
$w_i \leftarrow w_i + \alpha \sum_n (\hat{y}^n - f_{w,b}(x^n))x_i^n$

$\hat{y}^n$ 是实际的类别， $f_{w,b}(x^n)$ 是预测的类别。

可以将上述公式改写成矢量化版本：
$w \leftarrow w + \alpha X^T (\hat{y}^n - f_{w,b}(X))$

labelMat - h就是 $\hat{y}^n - f_{w,b}(X)$ ，结果还是100x1的矩阵

$\alpha$ 是一个常数，上面说过 $X$ 是100x3的矩阵，因此取 $X^T$ 变成3x100的矩阵，就能与100x1的矩阵相乘，得到的结果为3x1的矩阵，也就是3个特征的权重。

alpha * dataMatrix.transpose() * error就是 $\alpha X^T (\hat{y}^n - f_{w,b}(X))$

随机梯度上升

梯度上升算法在每次更新回归系数时都要遍历整个数据集，该方法在处理100个左右的数据集时（需要300次乘法）尚可，但如果有数十亿样本和成千上万的特征，那么该方法的计算复杂度就太高了。

一种改进方法是一次只用一个样本点来更新回归系数，这样就可以大大减少计算量了，该方法称为随机梯度上升算法。

接下来看代码：

...
import random as random

...

def stocGradAscent1(dataMatrix, classLabels, numIter=150):
    m, n = np.shape(dataMatrix)  # 100 X 3
    weights = np.ones(n)  # 3 X 1
    for j in range(numIter):  # 迭代次数
        dataIndex = list(range(m))  # 保存可用的索引
        for i in range(m):
            alpha = 4 / (1.0 + j + i) + 0.01  # 每次迭代时需要调整
            randIndex = int(random.uniform(0, len(dataIndex)))  # 随机选取更新  随机的由来
            h = sigmoid(sum(dataMatrix[randIndex] * weights))  # 计算h
            error = classLabels[randIndex] - h
            weights = weights + alpha * error * dataMatrix[randIndex]
            del (dataIndex[randIndex])  # 删除已用样本
    return weights

def plotBestFit(weights):
    dataMat, labelMat = loadDataSet()
    dataArr = np.array(dataMat)
    n = np.shape(dataMat)[0]
    xcord1 = []
    ycord1 = []
    xcord2 = []
    ycord2 = []
    for i in range(n):
        if int(labelMat[i]) == 1:
            xcord1.append(dataArr[i, 1])
            ycord1.append(dataArr[i, 2])
        else:
            xcord2.append(dataArr[i, 1])
            ycord2.append(dataArr[i, 2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')  # 绘制正样本
    ax.scatter(xcord2, ycord2, s=30, c='green')  # 绘制负样本
    x = np.arange(-3.0, 3.0, 0.1)
    y = (-weights[0] - weights[1] * x) / weights[2]  # 0 = w0 + w1x1 + w2y  -> y = (-w0 -w1x1)/w2   z = 0时，sigmoid函数取中间值,也就是0.5

    ax.plot(x, y)

    plt.title('BestFit')  # 绘制title
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.show()

if __name__ == '__main__':
    dataMat, labelMat = loadDataSet()
    weights = stocGradAscent1(np.array(dataMat), labelMat)
    plotBestFit(weights)

分类效果如下：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wJLNFucH-1576909370088)(_v_images/20191221100512445_23899.png)]

该算法的第一个改进之处在于，alpha每次迭代的时候都会调整，并且永远不会减小到0.

因为这里还存在一个常数项。必须这样做的原因是为了保证在多次迭代之后新数据仍然具有一定的影响。如果需要处理的问题是动态变化的，那么可以适当加大上述常数项，来确保新的值获得更大的回归系数。另一点值得注意的是，在降低alpha的函数中，alpha每次减少1/(j+i)，其中j是迭代次数，i是样本点的下标。

第二个改进的地方在于更新回归系数(最优参数)时，只使用一个样本点，并且选择的样本点是随机的，每次迭代不使用已经用过的样本点。这样的方法，就有效地减少了计算量，并保证了回归效果。100个样本就可以更新100次回归系数，而梯度上升算法只能更新一次，这就是为什么收敛的快一些。并且由于样本的选取是随机的，也就是更新顺序是随机的，这会影响整个分类的准确的，导致每次运行得出的准确率都不一样。

我们可以看到分类效果是不错的，但是无法看出迭代次数与回归系数的关系，也不能直观的看到每个回归方法的收敛情况。因此，我们来绘制出系数与迭代次数的关系：

# -*- coding: utf-8 -*
import random as random

import matplotlib.pyplot as plt
import matplotlib

import numpy as np


def loadDataSet():
    dataMat = []
    labelMat = []
    fr = open('testSet.txt')
    for line in fr.readlines():
        lineArr = line.strip().split()
        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
        labelMat.append(int(lineArr[2]))
    fr.close()
    return dataMat, labelMat


def plotDataSet():
    """
    绘制数据集
    :return:
    """
    dataMat, labelMat = loadDataSet()
    dataArr = np.array(dataMat)
    n = np.shape(dataMat)[0]
    xcord1 = []
    ycord1 = []
    xcord2 = []
    ycord2 = []
    for i in range(n):
        if int(labelMat[i]) == 1:
            xcord1.append(dataArr[i, 1])
            ycord1.append(dataArr[i, 2])
        else:
            xcord2.append(dataArr[i, 1])
            ycord2.append(dataArr[i, 2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1, ycord1, s=20, c='red', marker='s', alpha=.5)  # 绘制正样本
    ax.scatter(xcord2, ycord2, s=20, c='green', alpha=.5)  # 绘制负样本
    plt.title('DataSet')  # 绘制title
    plt.xlabel('x')
    plt.ylabel('y')
    plt.show()


def sigmoid(inX):
    return 1.0 / (1 + np.exp(-inX))


def gradAscent(dataMatIn, classLabels):
    dataMatrix = np.mat(dataMatIn)
    labelMat = np.mat(classLabels).transpose()  # 转置
    m, n = np.shape(dataMatrix)
    alpha = 0.01
    maxCycles = 500
    weights = np.ones((n, 1))
    weights_array = np.array([])

    # theta := theta + alpha X^T(y^ - g(X(\theta))
    for k in range(maxCycles):
        h = sigmoid(dataMatrix * weights)  # 1/(1 + \theta^T x)
        error = labelMat - h
        weights = weights + alpha * dataMatrix.transpose() * error
        weights_array = np.append(weights_array, weights)
    weights_array = weights_array.reshape(maxCycles, n)

    return weights.getA(), weights_array


def stocGradAscent1(dataMatrix, classLabels, numIter=150):
    m, n = np.shape(dataMatrix)  # 100 X 3
    weights = np.ones(n)  # 3 X 1
    weights_array = np.array([])  # 存储每次更新的回归系数

    for j in range(numIter):  # 迭代次数
        dataIndex = list(range(m))  # 保存可用的索引
        for i in range(m):
            alpha = 4 / (1.0 + j + i) + 0.01  # 每次迭代时需要调整
            randIndex = int(random.uniform(0, len(dataIndex)))  # 随机选取更新
            h = sigmoid(sum(dataMatrix[randIndex] * weights))  # 计算h
            error = classLabels[randIndex] - h
            weights = weights + alpha * error * dataMatrix[randIndex]
            weights_array = np.append(weights_array, weights, axis=0)  # 添加回归系数到数组中

            del (dataIndex[randIndex])  # 删除已用样本
    weights_array = weights_array.reshape(numIter * m, n)  # 改变维度

    return weights, weights_array


def plotBestFit(weights):
    dataMat, labelMat = loadDataSet()
    dataArr = np.array(dataMat)
    n = np.shape(dataMat)[0]
    xcord1 = []
    ycord1 = []
    xcord2 = []
    ycord2 = []
    for i in range(n):
        if int(labelMat[i]) == 1:
            xcord1.append(dataArr[i, 1])
            ycord1.append(dataArr[i, 2])
        else:
            xcord2.append(dataArr[i, 1])
            ycord2.append(dataArr[i, 2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')  # 绘制正样本
    ax.scatter(xcord2, ycord2, s=30, c='green')  # 绘制负样本
    x = np.arange(-3.0, 3.0, 0.1)
    y = (-weights[0] - weights[1] * x) / weights[
        2]  # 0 = w0 + w1x1 + w2y  -> y = (-w0 -w1x1)/w2   z = 0时，sigmoid函数取中间值,也就是0.5

    ax.plot(x, y)

    plt.title('BestFit')  # 绘制title
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.show()


def plotWeights(weights_array1, weights_array2):
    font = matplotlib.font_manager.FontProperties(fname='C:/Windows/Fonts/simkai.ttf') #中文字体，不然会出现乱码
    # 将fig画布分隔成1行1列,不共享x轴和y轴,fig画布的大小为(13,8)
    # 当nrow=3,nclos=2时,代表fig画布被分为六个区域,axs[0][0]
    fig, axs = plt.subplots(nrows=3, ncols=2, sharex=False, sharey=False, figsize=(20, 10))
    x1 = np.arange(0, len(weights_array1), 1)
    # 绘制w0与迭代次数的关系
    axs[0][0].plot(x1, weights_array1[:, 0])
    axs0_title_text = axs[0][0].set_title(u'改进的随机梯度上升算法：回归系数与迭代关次数系', FontProperties=font)
    axs0_ylabel_text = axs[0][0].set_ylabel(u'W0', FontProperties=font)
    plt.setp(axs0_title_text, size=20, weight='bold', color='black')
    plt.setp(axs0_ylabel_text, size=20, weight='bold', color='black')
    # 绘制w1与迭代次数的关系
    axs[1][0].plot(x1, weights_array1[:, 1])
    axs1_ylabel_text = axs[1][0].set_ylabel(u'W1', FontProperties=font)
    plt.setp(axs1_ylabel_text, size=20, weight='bold', color='black')
    # 绘制w2与迭代次数的关系
    axs[2][0].plot(x1, weights_array1[:, 2])
    axs2_xlabel_text = axs[2][0].set_xlabel(u'迭代次数', FontProperties=font)
    axs2_ylabel_text = axs[2][0].set_ylabel(u'W2', FontProperties=font)
    plt.setp(axs2_xlabel_text, size=20, weight='bold', color='black')
    plt.setp(axs2_ylabel_text, size=20, weight='bold', color='black')

    x2 = np.arange(0, len(weights_array2), 1)
    # 绘制w0与迭代次数的关系
    axs[0][1].plot(x2, weights_array2[:, 0])
    axs0_title_text = axs[0][1].set_title(u'梯度上升算法：回归系数与迭代次数关系', FontProperties=font)
    axs0_ylabel_text = axs[0][1].set_ylabel(u'W0', FontProperties=font)
    plt.setp(axs0_title_text, size=20, weight='bold', color='black')
    plt.setp(axs0_ylabel_text, size=20, weight='bold', color='black')
    # 绘制w1与迭代次数的关系
    axs[1][1].plot(x2, weights_array2[:, 1])
    axs1_ylabel_text = axs[1][1].set_ylabel(u'W1', FontProperties=font)
    plt.setp(axs1_ylabel_text, size=20, weight='bold', color='black')
    # 绘制w2与迭代次数的关系
    axs[2][1].plot(x2, weights_array2[:, 2])
    axs2_xlabel_text = axs[2][1].set_xlabel(u'迭代次数', FontProperties=font)
    axs2_ylabel_text = axs[2][1].set_ylabel(u'W2', FontProperties=font)
    plt.setp(axs2_xlabel_text, size=20, weight='bold', color='black')
    plt.setp(axs2_ylabel_text, size=20, weight='bold', color='black')

    plt.show()


if __name__ == '__main__':
    dataMat, labelMat = loadDataSet()
    weights1, weights_array1 = stocGradAscent1(np.array(dataMat), labelMat)
    weights2, weights_array2 = gradAscent(dataMat, labelMat)

    plotWeights(weights_array1, weights_array2)

在这里插入图片描述

从上图可以看到，我们改进的随机梯度上升算法收敛效果更好。我们一共有100个样本点，改进的随机梯度上升算法迭代次数是150次，但上图显示15000次迭代次数，原因是使用一次样本就更新了一下回归系数。
迭代150次，每次使用100个样本，也就更新15000次。从上图左侧的改进的随机梯度上升算法可以看出，在更新2000次时回归系统已经收敛了。相当于遍历整个数据集20次的时候，回归系统已经收敛，训练完成。

在看右侧的梯度上升算法的效果，梯度上升算法每次更新回归系数要遍历整个数据集。当迭代次数为300多次的时候（遍历数据集第300次），回归系数才收敛。
并且在大的波动停止后，都会有一些小的周期性波动。因为存在一些不能正确分类的样本点，在每次迭代时引发系数的剧烈改变。

而改进的随机梯度上升算法可以减少周期性的波动。

上面的数据集没有说明实际的意义，下面我们将使用随机梯度上升算法来解决病马的生死预测问题。

从疝气病症预测病马的死亡率

本节将使用逻辑回归来预测患有疝病的马的存活问题。这里的数据包含368个样本和28个特征。这种并不一定源自马的胃肠问题，其他问题也可能引发马疝病。该数据集中包含了医院检测马疝病的一些指标，有的指标比较主观，有的指标难以测量，例如马的疼痛级别。另外需要说明的是，除了主观和难以测量的问题外，还有30%的值是缺失的。下面将首先介绍如何处理数据集中的数据缺失问题，然后再利用逻辑回归和随机梯度上升算法来预测病马的生死。

准备数据：处理数据中的缺失值

数据中的缺失值是个非常棘手的问题。数据缺失究竟带来了什么问题？假设有100个样本和20个特征，这些数据都是机器收集回来的。若机器上的某个传感器损坏导致一个特征无效时该怎么办？此时是否要扔掉整个数据？这种情况下，另外19个特征怎么办？它们是否还可用？答案是肯定的。因为有时候数据相当昂贵，扔掉和重新获取都是不可取的，所以必须采用一些方法来解决这个问题。

下面给出了一些可选的做法：

使用可用特征的均值来填补缺失值
使用特殊值来填补缺失值，如-1
忽略有缺失值的样本
使用相似样本的均值填补缺失值
使用另外的机器学习算法预测缺失值

现在，我们对数据集进行预处理，使其可以顺利地使用分类算法。在预处理阶段需要做两件时间：第一，所有的缺失值必须用一个实数值来替换，因为我们使用的Numpy数据类型不允许包含缺失值。这里选择0来替换所有的缺失值，恰好能适用于逻辑回归。这样做的直觉在于，我们需要的是一个在更新时不会影响系数的值。回归系数的更新公式如下：

weights = weights + alpha * error * dataMatrix[randIndex]

如果某特征对应值为0，那么该特征的系数将不做更新，即：

weights = weights

另外，由于sigmod(0)=0.5，即它对结果的预测不具有任何倾向性，因此不会对误差造成任何影响。此外，该数据集中的特征值一般不为0，因此在某种意义上说它也满足特殊值这个要求。

第二件事情是，如果在测试数据集中发现了一条数据的类别标签已经缺失，那么我们的简单做法是将该条数据丢弃。
因为类别标签和特征不同，很难确定采用某个适合的值来替换。采用逻辑回归进行分类时这种做法是合理的。

原始的数据集经过预处理后保存成两个文件：horseColicTraining.txt和horseColicTest.txt。

现在我们有一个干净可用的数据集合一个不错的优化算法，下面将这些部分融合在一起训练处一个分类器，然后利用该分类器来预测病马的生死问题。

测试算法：用逻辑回归进行分类

使用逻辑回归方法进行分类并不需要做很多工作，所需做的只是把测试集上每个特征向量乘以最优化方法得来的回归系数，再将该乘积结果求和，
最后输入到Sigmoid函数中即可。如果对应的Sigmoid值大于0.5就预测类别标签为1，否则为0。


# -*- coding: utf-8 -*
import random as random

import numpy as np
import time


def sigmoid(inX):
    return 1.0 / (1 + np.exp(-inX))


def gradAscent(dataMatIn, classLabels):
    dataMatrix = np.mat(dataMatIn)
    labelMat = np.mat(classLabels).transpose()  # 转置
    m, n = np.shape(dataMatrix)
    alpha = 0.001
    maxCycles = 500
    weights = np.ones((n, 1))
    # theta := theta + alpha X^T(y^ - g(X(\theta))
    for k in range(maxCycles):
        h = sigmoid(dataMatrix * weights)  # 1/(1 + \theta^T x)
        error = labelMat - h
        weights = weights + alpha * dataMatrix.transpose() * error

    return weights.getA()


def stocGradAscent1(dataMatrix, classLabels, numIter=150):
    m, n = np.shape(dataMatrix)  # 100 X 3
    weights = np.ones(n)  # 3 X 1
    for j in range(numIter):  # 迭代次数
        dataIndex = list(range(m))  # 保存可用的索引
        for i in range(m):
            alpha = 4 / (1.0 + j + i) + 0.01  # 每次迭代时需要调整
            randIndex = int(random.uniform(0, len(dataIndex)))  # 随机选取更新
            h = sigmoid(sum(dataMatrix[randIndex] * weights))  # 计算h
            error = classLabels[randIndex] - h
            weights = weights + alpha * error * dataMatrix[randIndex]
            del (dataIndex[randIndex])  # 删除已用样本
    return weights


# 分类函数
def classifyVector(inX, weights):
    prob = sigmoid(sum(inX * weights))
    if prob > 0.5:
        return 1.0
    else:
        return 0.0

def colicTest():
    frTrain = open('horseColicTraining.txt') #训练数据
    frTest = open('horseColicTest.txt') #测试数据
    trainingSet = [] #训练集
    trainingLabels = [] #训练集的类别标签
    for line in frTrain.readlines():
        curLine = line.strip().split('\t')
        lineArr = [] #其实是个list
        for i in range(21): #20个特征
            lineArr.append(float(curLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(curLine[21]))
    trainWeights = stocGradAscent1(np.array(trainingSet),trainingLabels,500) # 随机梯度上升算法得来的系数
    errorCount = 0
    numTestVec = 0.0#测试集的数据数量
    for line in frTest.readlines():
        numTestVec += 1.0
        curLine = line.strip().split('\t')
        lineArr = []
        for i in range(21):  # 20个特征
            lineArr.append(float(curLine[i]))
        if int(classifyVector(np.array(lineArr),trainWeights)) != int(curLine[21]): #预测的类别与实际类别不同
            errorCount += 1 #增加分类错误次数
    errorRate = (float(errorCount) / numTestVec)
    print("the error rate of this test is : %.2f%%" % (errorRate * 100))
    return errorRate

def multiTest():
    numTests = 10
    errorSum = 0.0
    for k in range(numTests):
        errorSum += colicTest()
    print("after %d iterations the avg error rate is : %.2f%%" % (numTests,errorSum * 100/float(numTests)))

if __name__ == '__main__':
    start = time.clock()
    colicTest()
    end = time.clock()
    print("time cost %.2f s" % ( end - start))

输出结果为：

the error rate of this test is : 35.82%
time cost 1.28 s

错误挺高的，并且耗时1.28s，而且每次运行的错误率是不同的。因为首先数据集本身有30%的数据缺失，还有使用的是改进的随机梯度上升算法，因为数据集本身就很小，使用改进的随机梯度上升算法不合适。我们再用梯度上升算法试试：

# -*- coding: utf-8 -*
import random as random

import numpy as np
import time


def sigmoid(inX):
    return 1.0 / (1 + np.exp(-inX))


def gradAscent(dataMatIn, classLabels):
    dataMatrix = np.mat(dataMatIn)
    labelMat = np.mat(classLabels).transpose()  # 转置
    m, n = np.shape(dataMatrix)
    alpha = 0.001
    maxCycles = 500
    weights = np.ones((n, 1))
    # theta := theta + alpha X^T(y^ - g(X(\theta))
    for k in range(maxCycles):
        h = sigmoid(dataMatrix * weights)  # 1/(1 + \theta^T x)
        error = labelMat - h
        weights = weights + alpha * dataMatrix.transpose() * error

    return weights.getA()


def stocGradAscent1(dataMatrix, classLabels, numIter=150):
    m, n = np.shape(dataMatrix)  # 100 X 3
    weights = np.ones(n)  # 3 X 1
    for j in range(numIter):  # 迭代次数
        dataIndex = list(range(m))  # 保存可用的索引
        for i in range(m):
            alpha = 4 / (1.0 + j + i) + 0.01  # 每次迭代时需要调整
            randIndex = int(random.uniform(0, len(dataIndex)))  # 随机选取更新
            h = sigmoid(sum(dataMatrix[randIndex] * weights))  # 计算h
            error = classLabels[randIndex] - h
            weights = weights + alpha * error * dataMatrix[randIndex]
            del (dataIndex[randIndex])  # 删除已用样本
    return weights


# 分类函数
def classifyVector(inX, weights):
    prob = sigmoid(sum(inX * weights))
    if prob > 0.5:
        return 1.0
    else:
        return 0.

def colicTest():
    frTrain = open('horseColicTraining.txt') #训练数据
    frTest = open('horseColicTest.txt') #测试数据
    trainingSet = [] #训练集
    trainingLabels = [] #训练集的类别标签
    for line in frTrain.readlines():
        curLine = line.strip().split('\t')
        lineArr = [] #其实是个list
        for i in range(21): #20个特征
            lineArr.append(float(curLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(curLine[21]))
    trainWeights = gradAscent(np.array(trainingSet),trainingLabels) # 21 X 1
    errorCount = 0
    numTestVec = 0.0#测试集的数据数量
    for line in frTest.readlines():
        numTestVec += 1.0
        curLine = line.strip().split('\t')
        lineArr = []
        for i in range(21):  # 20个特征
            lineArr.append(float(curLine[i]))
        if int(classifyVector(np.array(lineArr),trainWeights[:,0])) != int(curLine[21]): # trainWeights[:,0] 直接取第一列
            errorCount += 1 #增加分类错误次数
    errorRate = (float(errorCount) / numTestVec)
    print("the error rate of this test is : %.2f%%" % (errorRate * 100))
    return errorRate

# 运行10次，求平均错误率
def multiTest():
    numTests = 10
    errorSum = 0.0
    for k in range(numTests):
        errorSum += colicTest()
    print("after %d iterations the avg error rate is : %.2f%%" % (numTests,errorSum * 100/float(numTests)))

if __name__ == '__main__':
    start = time.clock()
    colicTest()
    end = time.clock()
    print("time cost %.2f s" % ( end - start))

输出

the error rate of this test is : 28.36%
time cost 0.03 s

可以看到耗时更少。错误率低。显然，使用随机梯度上升算法，反而效果不好。所以当数据集较小时，我们使用梯度上升算法。当数据集较大时，使用随机梯度上升算法。

总节

逻辑回归的目的是寻找一个非线性函数Sigmoid的最佳拟合参数，求解过程可以由最优化算法来完成。常用的是梯度上升算法，而它又可以简化为随机梯度上升算法。

当数据集较小时，我们使用梯度上升算法。当数据集较大时，使用随机梯度上升算法。

愤怒的可乐

发布了131 篇原创文章 · 获赞 38 · 访问量 12万+

私信关注

《机器学习实战》读书笔记——Logistic回归

引言

基于逻辑回归和Sigmoid函数的分类：

基于最优化方法的最佳回归系数确定

梯度上升法(Gradient ascent)

训练：使用梯度上升算法找到最佳参数

随机梯度上升

从疝气病症预测病马的死亡率

准备数据：处理数据中的缺失值

测试算法：用逻辑回归进行分类

总节

猜你喜欢