Logistic回归原理及实战代码

一、Logistic回归原理

（1）从线性回归到Logistic回归

假设我们给定d个属性描述的样本 $\ \mathbf x=(x_1,x_2,,...,x_d)$ ，则对应的的线性回归模型为：

\begin{aligned} (1) & f (x) = θ_{1} x_{1} + θ_{2} x_{2} + . . . + θ_{d} x_{d} + b \end{aligned}

$\begin{align} f(\mathbf x)=\theta_1x_1+\theta_2x_2+...+\theta_dx_d+b \end{align}$

令 $\ b=\theta_0、x_0=1$ ，则上式可以写成：

\begin{aligned} (2) & f (x) & = θ_{0} x_{0} + θ_{1} x_{1} + θ_{2} x_{2} + . . . + θ_{d} x_{d} \\ (3) & = θ^{T} x \end{aligned}

$\begin{align} f(\mathbf x) &= \theta_0x_0+\theta_1x_1+\theta_2x_2+...+\theta_dx_d \\ &= \theta^T \mathbf x \\ \end{align}$

\begin{aligned} (4) & ， 其 中 θ^{T} = (θ_{0}, θ_{1}, θ_{2}, . . ., θ_{d})^{T}, x = (x_{1}, x_{2},, . . ., x_{d}) \end{aligned}

$\begin{align} ，其中 \theta^T=(\theta_0,\theta_1,\theta_2,...,\theta_d)^T, \mathbf x=(x_1,x_2,,...,x_d) \end{align}$

现在考虑二分类问题，样本的标记 $\ y \in \{0,1\}$ ，而线性回归函数 $\ f(\mathbf x)$ 输出的是实值，于是我们希望将线性回归的输出映射到0/1值，最理想的映射就是“单位阶跃函数”(unit-step funciton)：

\begin{aligned} (5) & y = {\begin{aligned} 0, z < 0; \\ 0.5, z = 0; \\ 1, z > 0, \end{aligned} 其 中 z 就 是 线 性 回 归 函 数 f (x) . \end{aligned}

$\begin{align} y=\left\{ \begin{aligned} 0,z<0;\\ 0.5,z=0;\\ 1, z>0, \end{aligned} \right. 其中z就是线性回归函数f(\mathbf x). \end{align}$

但是单位阶跃函数不连续，不利于我们使用机器学习中的优化迭代算法求最优解，因此我们使用单调可微的对数几率函数（logistic function）作为单位阶跃函数的代替函数：

\begin{aligned} (6) & y = \frac{1}{1 + e^{- z}} \end{aligned}

$\begin{align} y=\frac{1}{1+e^{-z}} \end{align}$
它们对应的图为：
这里写图片描述

于是我们使用对数几率函数对线性回归方程进行映射，令 $\ z=f(\mathbf x)$ ，得到Logistic回归的表达式：

\begin{aligned} (7) & h_{θ} (x) & = \frac{1}{1 + e^{- f (x)}} \\ (8) & = \frac{1}{1 + e^{- θ^{T} x}} \end{aligned}

$\begin{align} h_\theta(\mathbf x)&=\frac{1}{1+e^{-f(\mathbf x)}}\\ &=\frac{1}{1+e^{- \theta^T \mathbf x}} \end{align}$

如果 $\ h_\theta(\mathbf x)>0.5$ ，则分类结果 $\ y=1$ ；如果 $\ h_\theta(\mathbf x)\le0.5$ ，则分类结果 $\ y=1$ .

（2）用最大似然估计法MLE（Maximum Likelihood Estimate）求最优解 $\ \theta^*$

在给定某组参数 $\ \theta = (\theta_0,\theta_1,\theta_2,...,\theta_d)$ 时，样本 $\ \mathbf x=(x_1,x_2,,...,x_d)$ 的分类结果 $\ y=1$ 的概率为：

\begin{aligned} (9) & P (y = 1 | x; θ) = h_{θ} (x) \end{aligned}

$\begin{align} P(y=1|\mathbf x;\theta)=h_\theta(\mathbf x) \end{align}$

相应的分类结果 $\ y=0$ 的概率为：

\begin{aligned} (10) & P (y = 0 | x; θ) = 1 - h_{θ} (x) \end{aligned}

$\begin{align} P(y=0|\mathbf x;\theta)=1-h_\theta(\mathbf x) \end{align}$

由于现在是二分类问题，即是0/1分布，所以样本 $\ \mathbf x$ 属于类别 $\ y(y只能取0或1)$ 的概率为：

\begin{aligned} (11) & P (y | x; θ) = (h_{θ} (x))^{y} (1 - h_{θ} (x))^{1 - y} \end{aligned}

$\begin{align} P(y|\mathbf x;\theta)=(h_\theta(\mathbf x))^y(1-h_\theta(\mathbf x))^{1-y} \end{align}$

假设我们有m个样本 $\ ((\mathbf x^1,y^1),(\mathbf x^2,y^2),,...,(\mathbf x^m,y^m))$ ，则对应的似然函数为：

\begin{aligned} (12) & L (θ) & = \prod_{i = 0}^{m} P (y^{i} | x^{i}; θ) \\ (13) & = \prod_{i = 0}^{m} (h_{θ} (x^{i}))^{y^{i}} (1 - h_{θ} (x^{i}))^{1 - y^{i}} \end{aligned}

$\begin{align} L(\theta)&=\prod_{i=0}^m P(y^i | \mathbf x^i;\theta) \\ &=\prod_{i=0}^m (h_\theta(\mathbf x^i))^{y^i} (1-h_\theta(\mathbf x^i))^{1-y^i} \end{align}$

似然函数取对数，得到对数似然函数：

\begin{aligned} (14) & ℓ (θ) & = l n (L (θ)) \\ (15) & = \sum_{i = 0}^{m} y^{i} l n (h_{θ} (x^{i})) + (1 - y^{i}) l n (1 - h_{θ} (x^{i})) \end{aligned}

$\begin{align} \ell (\theta)&=ln(L(\theta))\\ &=\sum_{i=0}^m y^i ln(h_\theta(\mathbf x^i)) + (1-y^i)ln(1-h_\theta(\mathbf x^i)) \end{align}$

使用梯度上升法求解最大似然估计值：
为了方便求解，令 $\ \ell(\theta)= \sum_{i=0}^m k^i(\theta)$ 。这里只是简单的符号变换，先忽略上标 $\ i$ ，并将 $\ h_\theta(\mathbf x)写成h(\theta^T \mathbf x)$ ，则 $\ k(\theta)= y ln(h(\theta^T \mathbf x)) + (1-y)ln(1-h(\theta^T \mathbf x))$ ，所以函数 $\ k(\theta)$ 沿第 j 个参数 $\ \theta_j$ 方向的梯度为：

\begin{aligned} (16) & \frac{\partial}{\partial θ_{j}} k (θ) & = (y \frac{1}{h (θ^{T} x)} - (1 - y) \frac{1}{1 - h (θ^{T} x)}) \frac{\partial}{\partial θ_{j}} h (θ^{T} x) \\ (17) & = (y \frac{1}{h (θ^{T} x)} - (1 - y) \frac{1}{1 - h (θ^{T} x)}) h (θ^{T} x) (1 - h (θ^{T} x)) \frac{\partial}{\partial θ_{j}} θ^{T} x \\ (18) & = (y (1 - h (θ^{T} x)) - (1 - y) h (θ^{T} x)) x_{j} \\ (19) & = (y - h (θ^{T} x)) x_{j} \\ (20) & = (y - h_{θ} (x)) x_{j} \end{aligned}

$\begin{align} \frac {\partial}{\partial \theta_j} k(\theta) &=\left(y \frac{1}{h(\theta^T \mathbf x)}-(1-y) \frac{1}{1-h(\theta^T \mathbf x)} \right) \frac {\partial}{\partial \theta_j}h(\theta^T \mathbf x) \\ &=\left( y \frac{1}{h(\theta^T \mathbf x)}-(1-y) \frac{1}{1-h(\theta^T \mathbf x)} \right) h(\theta^T \mathbf x)(1-h(\theta^T \mathbf x)) \frac {\partial}{\partial \theta_j}\theta^T \mathbf x \\ &=\left( y(1-h(\theta^T \mathbf x))-(1-y)h(\theta^T \mathbf x) \right) x_j \\ &=(y-h(\theta^T \mathbf x))x_j \\ &=(y-h_\theta( \mathbf x ))x_j \end{align}$

结合 $\ \frac {\partial}{\partial \theta_j} k(\theta)$ ，并添加上标，则某个样本 $\ ( \mathbf x^i , y^i)=( (x_1,x_2,....,x_d)^i,y^i )$ 的对数似然函数 $\ \ell (\theta)$ 沿第 j 个参数 $\ \theta_j$ 方向的梯度为：

\begin{aligned} (21) & \frac{\partial}{\partial θ_{j}} ℓ (θ) & = \frac{\partial}{\partial θ_{j}} (\sum_{i = 0}^{m} k^{i} (θ)) \\ (22) & = \sum_{i = 0}^{m} (y^{i} - h_{θ} (x^{i})) x_{j} \end{aligned}

$\begin{align} \frac {\partial}{\partial \theta_j} \ell (\theta) &=\frac {\partial}{\partial \theta_j} \left( \sum_{i=0}^m k^i(\theta) \right)\\ &= \sum_{i=0}^m (y^i-h_\theta( \mathbf x^i ))x_j \end{align}$

其中 $\ j$ 是 $\ d$ 个属性的下标， $\ j \in (1,2,....,d)$ ，由于是线性模型，因此样本 $\ \mathbf x$ 是多少维，就有多少个参数 $\ \theta$ ； $\ i$ 是 $\ m$ 个样本的数量， $\ i \in (1,2,...,m)$

**1.使用MLE的角度求解最优参数组合 $\ \theta^*$**

因为目的是求对数似然函数 $\ \ell (\theta)$ 的最大值，沿梯度上升的方向逐步逼近极大值点即可。
下面分别列出3种梯度下降法对应的参数 $\ \theta_j$ 更新公式（虽然使用的是梯度下降法，但是求MLE是沿梯度方向上升求最大值，所以公式中是使用加法 ‘+’）：
假设我们有m个样本 $\ ( \mathbf x^i , y^i)=( (x_1,x_2,....,x_d)^i,y^i ), i \in (1,2,...,m)$ ，每个样本都有 $\ d$ 个特征。

批量梯度下降法BGD（Batch Gradient Descent）
批量梯度下降法，在更新参数时使用所有的样本来进行更新。由于我们有m个样本，这里求梯度的时候就用了所有m个样本的梯度数据。

$\begin{aligned} (23) & θ_{j} & = θ_{j} + α \frac{\partial}{\partial θ_{j}} ℓ (θ) \\ (24) & = θ_{j} + α \sum_{i = 0}^{m} (y^{i} - h_{θ} (x^{i})) x_{j} \end{aligned}$ $\begin{align} \theta_j &= \theta_j + \alpha {\frac {\partial}{\partial \theta_j} \ell (\theta) } \\ &= \theta_j + \alpha { \sum_{i=0}^m (y^i-h_\theta( \mathbf x^i ))x_j } \end{align}$
随机梯度下降法SGD（Stochastic Gradient Descent）
与BGD的区别在于求梯度时，没有用所有的m个样本的数据，而是仅仅随机选取一个样本 $\ \mathbf x^i$ 来求梯度。

$\begin{aligned} (25) & θ_{j} = θ_{j} + α (y^{i} - h_{θ} (x^{i})) x_{j} \end{aligned}$ $\begin{align} \theta_j = \theta_j + \alpha { (y^i-h_\theta( \mathbf x^i ))x_j } \end{align}$
小批量梯度下降法MBGD（Mini-batch Gradient Descent）
小批量梯度下降法是批量梯度下降法和随机梯度下降法的折衷，也就是从m个样本中，随机选取一部分来迭代。假设从m个样本中，随机选取了 T 个样本，则对应的公式为：

$\begin{aligned} (26) & θ_{j} = θ_{j} + α \sum_{t = 1}^{T} (y^{t} - h_{θ} (x^{t})) x_{j} \end{aligned}$ $\begin{align} \theta_j = \theta_j + \alpha { \sum_{t=1}^T (y^t-h_\theta( \mathbf x^t ))x_j } \end{align}$

**2.使用损失函数 $\ loss(y^i , \hat y^i)$ 的角度求解最优参数组合 $\ \theta^*$**

损失函数 $\ loss(y^i , \hat y^i)$ 等于负对数似然函数NLL(Negative Log-Likelihood)：

\begin{aligned} (27) & l o s s (y^{i}, {\hat{y}}^{i}) = - ℓ (θ) \end{aligned}

$\begin{align} loss(y^i , \hat y^i) = - \ell (\theta) \end{align}$
所以求解最优参数组合

θ^{*}

$\ \theta^*$ 就是沿

l o s s (y^{i}, {\hat{y}}^{i})

$\ loss(y^i , \hat y^i)$ 梯度下降的方向求极小值（与MLE角度求解相反，因为二者是负号’-‘的关系）。因此对应的三种梯度下降公式对应地使用负号’-‘：

批量梯度下降法BGD（Batch Gradient Descent）

$\begin{aligned} (28) & θ_{j} & = θ_{j} - α (- \frac{\partial}{\partial θ_{j}} ℓ (θ)) \\ (29) & = θ_{j} - α \sum_{i = 0}^{m} (h_{θ} (x^{i}) - y^{i}) x_{j} \end{aligned}$ $\begin{align} \theta_j &= \theta_j - \alpha { (-\frac {\partial}{\partial \theta_j} \ell (\theta) ) } \\ &= \theta_j - \alpha { \sum_{i=0}^m (h_\theta( \mathbf x^i )-y^i)x_j } \end{align}$
随机梯度下降法SGD（Stochastic Gradient Descent）

$\begin{aligned} (30) & θ_{j} = θ_{j} - α (h_{θ} (x^{i}) - y^{i}) x_{j} \end{aligned}$ $\begin{align} \theta_j = \theta_j - \alpha { (h_\theta( \mathbf x^i )-y^i)x_j } \end{align}$
小批量梯度下降法MBGD（Mini-batch Gradient Descent）

$\begin{aligned} (31) & θ_{j} = θ_{j} - α \sum_{t = 1}^{T} (h_{θ} (x^{t}) - y^{t}) x_{j} \end{aligned}$ $\begin{align} \theta_j = \theta_j - \alpha { \sum_{t=1}^T (h_\theta( \mathbf x^t )-y^t)x_j } \end{align}$

二、代码实战

这一部分是参考《机器学习实战》的代码，这里就公式符号及最优解求解过程给出必要的解释：
1. 《机器学习实战》使用的 $\ w$ 代表本文的 $\ \theta$ ，相应的线性回归方程为：

$\begin{aligned} (32) & f (x) & = w_{0} x_{0} + w_{1} x_{1} + w_{2} x_{2} + . . . + w_{d} x_{d} \\ (33) & = w^{T} x \end{aligned}$ $\begin{align} f(\mathbf x) &= w_0x_0+w_1x_1+w_2x_2+...+w_dx_d \\ &= w^T \mathbf x \\ \end{align}$
2. 对应的梯度上升公式为：
a. 批量梯度下降法BGD
$\begin{aligned} (34) & w_{j} & = w_{j} + α \frac{\partial}{\partial θ_{j}} ℓ (w) \\ (35) & = w_{j} + α \sum_{i = 0}^{m} (y^{i} - h_{w} (x^{i})) x_{j} \end{aligned}$ $\begin{align} w_j &= w_j + \alpha {\frac {\partial}{\partial \theta_j} \ell (w) } \\ &=w_j + \alpha { \sum_{i=0}^m (y^i-h_w( \mathbf x^i ))x_j } \end{align}$
b. 随机梯度下降法SGD
$\begin{aligned} (36) & w_{j} = w_{j} + α (y^{i} - h_{w} (x^{i})) x_{j} \end{aligned}$ $\begin{align} w_j = w_j + \alpha { (y^i-h_w( \mathbf x^i ))x_j } \end{align}$
c. 小批量梯度下降法MBGD
$\begin{aligned} (37) & w_{j} = w_{j} + α \sum_{t = 1}^{T} (y^{t} - h_{w} (x^{t})) x_{j} \end{aligned}$ $\begin{align} w_j = w_j + \alpha { \sum_{t=1}^T (y^t-h_w( \mathbf x^t ))x_j } \end{align}$

（1）读入含有100个样本的数据集，并建立Logistic方程，使用批量梯度上升降法BGD求最优解 $\ \mathbf w^*$

1.数据集的样子

这里写图片描述

2.读入数据集

'''
    由于目标函数是z=b+w1x1+w2x2，令x0=1，b=w0x0。将每个样本的第一个特征作为x1，第二个特征作为x2。
    所以矩阵dataMat的第一列都为1（代表x0=1）
'''
def loadDataSet():
    dataMat = []; labelMat = []
    fr = open('testSet.txt')
    for line in fr.readlines():
        lineArr = line.strip().split()
        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
        labelMat.append(int(lineArr[2]))
    return dataMat,labelMat

if __name__=='__main__':
    dataMat, labelMat=loadDataSet()
    print(dataMat)
    print(labelMat)

这里写图片描述

3.BGD求最优解 $\ \mathbf w^*$

'''
    def sigmoid(inX):如果inX是一个数字，则返回sigmoid(x)。如果inX是一个向量如[1,2,3]，则返回的
    是[sigmoid(1),sigmoid(2),sigmoid(3)]
'''
def sigmoid(inX):
    return 1.0/(1+exp(-inX))

'''
   def gradAscent(dataMatIn, classLabels): 使用Logistic回归的BGD公式，设置步长为0.001，迭代次数为500
'''
def gradAscent(dataMatIn, classLabels):
    dataMatrix = mat(dataMatIn)             #convert to NumPy matrix
    labelMat = mat(classLabels).transpose() #convert to NumPy matrix
    m,n = shape(dataMatrix)
    alpha = 0.001 #梯度上升的步长设置为0.001
    maxCycles = 500 #重复迭代500次
    weights = ones((n,1))
    for k in range(maxCycles):              #heavy on matrix operations
        h = sigmoid(dataMatrix*weights)     #matrix mult
        error = (labelMat - h)              #vector subtraction
        weights = weights + alpha * dataMatrix.transpose()* error #matrix mult
    return weights

if __name__=='__main__':
    dataMat, labelMat=loadDataSet()
    best_weights=gradAscent(dataMat,labelMat)
    print('最优W*=[w0,w1,w2]^T：\n',best_weights)

这里写图片描述

4.画出BGD求出的决策边界

因为当sigmoid函数 $\ h_w( \mathbf x^i )>0.5$ ，则分类为 $\ y=1$ ，当 $\ h_w( \mathbf x^i ) \le 0.5$ ，则分类为 $\ y=0$ ，所以决策边界为 $\ h_w( \mathbf x^i )=0.5$ ，即 $\ 0= w_0x_0+w_1x_1+w_2x_2 =w_0+w_1x_1+w_2x_2 (x_0=1)$ 。当 $\ w^*=[w_0,w_1,w_2]^T$ 时，对应的决策边界直线为： $\ x_2=(-w_0-w_1x_1)/w_2$

'''
    def plotBestFit(weights):先将文件'testSet.txt'中的数据画在图中，并根据输入的weights画出决策边界
'''
def plotBestFit(weights):
    import matplotlib.pyplot as plt
    weights=weights.getA()
    dataMat,labelMat=loadDataSet()
    dataArr = array(dataMat)
    n = shape(dataArr)[0] 
    xcord1 = []; ycord1 = []
    xcord2 = []; ycord2 = []
    for i in range(n):
        if int(labelMat[i])== 1:
            xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2])
        else:
            xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
    ax.scatter(xcord2, ycord2, s=30, c='green')
    x = arange(-3.0, 3.0, 0.1)
    y = (-weights[0]-weights[1]*x)/weights[2]
    ax.plot(x, y)
    plt.xlabel('X1'); plt.ylabel('X2');
    plt.show()

if __name__=='__main__':
    dataMat, labelMat=loadDataSet()
    weights=gradAscent(dataMat,labelMat)
    plotBestFit(weights)

这里写图片描述

（2）使用SGD求最优解 $\ w^*$

1. 画出迭代步长固定为0.01的SGD对应的决策边界

def stocGradAscent0(dataMatIn, classLabels):
    dataMatrix = mat(dataMatIn)
    m,n = shape(dataMatrix)
    alpha = 0.01
    weights = ones((n,1))  #initialize to all ones
    for i in range(m):
        h = sigmoid(dataMatrix[i]*weights)
        error = classLabels[i] - h
        weights = weights + alpha *(dataMatrix[i].transpose())*error
    return weights

if __name__=='__main__':
    dataMat, labelMat=loadDataSet()
    best_weights=stocGradAscent0(dataMat,labelMat)
    print('最优W*=[w0,w1,w2]^T：\n', best_weights)
    plotBestFit(best_weights)

这里写图片描述

2.画出迭代步长逐渐减小的SGD对应的决策边界

def stocGradAscent1(dataMatIn, classLabels, numIter=150):
    dataMatrix = mat(dataMatIn)
    m,n = shape(dataMatrix)
    weights = ones((n, 1))  # initialize to all ones
    for j in range(numIter):
        dataIndex = list(range(m))
        for i in range(m):
            alpha = 4/(1.0+j+i)+0.0001    #apha decreases with iteration, does not go to 0 because of the constant
            randIndex = int(random.uniform(0,len(dataIndex)))
            h = sigmoid(sum(dataMatrix[randIndex]*weights))
            error = classLabels[randIndex] - h
            weights = weights + alpha*(dataMatrix[randIndex].transpose())* error
            del(dataIndex[randIndex])
    return weights

if __name__=='__main__':
    dataMat, labelMat=loadDataSet()
    best_weights=stocGradAscent1(dataMat,labelMat)
    print('最优W*=[w0,w1,w2]^T：\n', best_weights)
    plotBestFit(best_weights)

这里写图片描述

（3）从疝气病症预测病马的死亡率

1.疝气病数据集介绍

这里写图片描述

2.处理数据中的缺失值

这里写图片描述

3.使用Logistic回归分类函数进行分类

'''
    def classifyVector(inX, weights):inX是待分类样本的特征组成的向量，weights是Logistic回归的参数。
    如果Logistic回归结果>0.5则分类为1；如果回归结果<=0.5，则分类为0
'''
def classifyVector(inX, weights):
    prob = sigmoid(sum(inX*weights))
    if prob > 0.5:
        return 1.0
    else:
        return 0.0

def colicTest():
    frTrain = open('horseColicTraining.txt'); frTest = open('horseColicTest.txt')
    trainingSet = []; trainingLabels = []
    for line in frTrain.readlines():
        currLine = line.strip().split('\t')
        lineArr =[]
        for i in range(21):
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[21]))
    trainWeights = stocGradAscent1(trainingSet, trainingLabels, 1000)
    errorCount = 0; numTestVec = 0.0
    for line in frTest.readlines():
        numTestVec += 1.0
        currLine = line.strip().split('\t')
        lineArr =[]
        for i in range(21):
            lineArr.append(float(currLine[i]))
        if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]):
            errorCount += 1
    errorRate = (float(errorCount)/numTestVec)
    print ("the error rate of this test is: %f" % errorRate)
    return errorRate

'''
    def multiTest():调用函数colicTest()10次求结果的平均值
'''
def multiTest():
    numTests = 10; errorSum=0.0
    for k in range(numTests):
        errorSum += colicTest()
    print ("after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests)) )


if __name__=='__main__':
    multiTest()