吴恩达机器学习作业Python实现(二)：logistic回归

1 Logistic regression

在这部分的练习中，你将建立一个逻辑回归模型来预测一个学生是否能进入大学。假设你是一所大学的行政管理人员，你想根据两门考试的结果，来决定每个申请人是否被录取。你有以前申请人的历史数据，可以将其用作逻辑回归训练集。对于每一个训练样本，你有申请人两次测评的分数以及录取的结果。为了完成这个预测任务，我们准备构建一个可以基于两次测试评分来评估录取可能性的分类模型。

1.1 Visualizing the data

在开始实现任何学习算法之前，如果可能的话，最好将数据可视化。

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('ex2data1.txt', names=['exam1', 'exam2', 'admitted'])
data.head()

	exam1	exam2	admitted
0	34.623660	78.024693	0
1	30.286711	43.894998	0
2	35.847409	72.902198	0
3	60.182599	86.308552	1
4	79.032736	75.344376	1

data.describe()

	exam1	exam2	admitted
count	100.000000	100.000000	100.000000
mean	65.644274	66.221998	0.600000
std	19.458222	18.582783	0.492366
min	30.058822	30.603263	0.000000
25%	50.919511	48.179205	0.000000
50%	67.032988	67.682381	1.000000
75%	80.212529	79.360605	1.000000
max	99.827858	98.869436	1.000000

让我们创建两个分数的散点图，并使用颜色编码来可视化，如果样本是正的（被接纳）或负的（未被接纳）。

positive = data[data.admitted.isin(['1'])]  # 1
negetive = data[data.admitted.isin(['0'])]  # 0

fig, ax = plt.subplots(figsize=(6,5))
ax.scatter(positive['exam1'], positive['exam2'], c='b', label='Admitted')
ax.scatter(negetive['exam1'], negetive['exam2'], s=50, c='r', marker='x', label='Not Admitted')
# 设置图例显示在图的上方
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width , box.height* 0.8])
ax.legend(loc='center left', bbox_to_anchor=(0.2, 1.12),ncol=3)
# 设置横纵坐标名
ax.set_xlabel('Exam 1 Score')
ax.set_ylabel('Exam 2 Score')
plt.show()

看起来在两类间，有一个清晰的决策边界。现在我们需要实现逻辑回归，那样就可以训练一个模型来预测结果。

1.2 Sigmoid function

首先来回顾下 logistic回归的假设函数：
${{h}_{\theta }}\left( x \right)=g(\theta^{T}x)=\frac{1}{1+{{e}^{-{{\theta }^{T}}X}}}$
其中的 g代表一个常用的logistic function为S形函数（Sigmoid function）：
$g\left( z \right)=\frac{1}{1+e^{-z}}$

def sigmoid(z):
    return 1 / (1 + np.exp(- z))

让我们做一个快速的检查，来确保它可以工作。

x1 = np.arange(-10, 10, 0.1)
plt.plot(x1, sigmoid(x1), c='r')
plt.show()

感觉很不错~~~

1.3 Cost function

逻辑回归的代价函数如下，和线性回归的代价函数不一样，因为这个函数是凸的。

$J\left( \theta \right)=\frac{1}{m}\sum\limits_{i=1}^{m}{[-{{y}^{(i)}}\log \left( {{h}_{\theta }}\left( {{x}^{(i)}} \right) \right)-\left( 1-{{y}^{(i)}} \right)\log \left( 1-{{h}_{\theta }}\left( {{x}^{(i)}} \right) \right)]}$

${{h}_{\theta }}\left( x \right)=g(\theta^{T}x)$

def cost(theta, X, y):
    first = (-y) * np.log(sigmoid(X @ theta))
    second = (1 - y)*np.log(1 - sigmoid(X @ theta))
    return np.mean(first - second)

现在，我们要做一些设置，获取我们的训练集数据。

# add a ones column - this makes the matrix multiplication work out easier
if 'Ones' not in data.columns:
    data.insert(0, 'Ones', 1)

# set X (training data) and y (target variable)
X = data.iloc[:, :-1].as_matrix()  # Convert the frame to its Numpy-array representation.
y = data.iloc[:, -1].as_matrix()  # Return is NOT a Numpy-matrix, rather, a Numpy-array.

theta = np.zeros(X.shape[1])

让我们来检查矩阵的维度，确保一切良好。

X.shape, theta.shape, y.shape
# ((100, 3), (3,), (100,))

cost(theta, X, y)
# 0.6931471805599453

看起来不错，接下来，我们需要一个函数来计算我们的训练数据、标签和一些参数thate的梯度。

1.4 Gradient

这是批量梯度下降（batch gradient descent）
转化为向量化计算： $\frac{1}{m} X^T( Sigmoid(X\theta) - y )$
$\frac{\partial J\left( \theta \right)}{\partial {{\theta }_{j}}}=\frac{1}{m}\sum\limits_{i=1}^{m}{({{h}_{\theta }}\left( {{x}^{(i)}} \right)-{{y}^{(i)}})x_{_{j}}^{(i)}}$

def gradient(theta, X, y):
    return (X.T @ (sigmoid(X @ theta) - y))/len(X)  
# the gradient of the cost is a vector of the same length as θ where the jth element (for j = 0, 1, . . . , n)

gradient(theta, X, y)
# array([ -0.1, -12.00921659, -11.26284221])

1.5 Learning θ parameters

注意，我们实际上没有在这个函数中执行梯度下降，我们仅仅在计算梯度。在练习中，一个称为“fminunc”的Octave函数是用来优化函数来计算成本和梯度参数。由于我们使用Python，我们可以用SciPy的“optimize”命名空间来做同样的事情。

这里我们使用的是高级优化算法，运行速度通常远远超过梯度下降。方便快捷。
只需传入cost函数，已经所求的变量theta，和梯度。cost函数定义变量时变量tehta要放在第一个，若cost函数只返回cost，则设置fprime=gradient。

import scipy.optimize as opt

这里使用fimin_tnc或者minimize方法来拟合，minimize中method可以选择不同的算法来计算，其中包括TNC

result = opt.fmin_tnc(func=cost, x0=theta, fprime=gradient, args=(X, y))
result

# (array([-25.16131867,   0.20623159,   0.20147149]), 36, 0)

下面是第二种方法，结果是一样的

res = opt.minimize(fun=cost, x0=theta, args=(X, y), method='TNC', jac=gradient)
res
# help(opt.minimize) 
# res.x  # final_theta

     fun: 0.2034977015894744
     jac: array([9.11457705e-09, 9.59621025e-08, 4.84073722e-07])
 message: 'Local minimum reached (|pg| ~= 0)'
    nfev: 36
     nit: 17
  status: 0
 success: True
       x: array([-25.16131867,   0.20623159,   0.20147149])

cost(result[0], X, y)

0.2034977015894744

1.6 Evaluating logistic regression

学习好了参数θ后，我们来用这个模型预测某个学生是否能被录取。

接下来，我们需要编写一个函数，用我们所学的参数theta来为数据集X输出预测。然后，我们可以使用这个函数来给我们的分类器的训练精度打分。

逻辑回归模型的假设函数：
\[{{h}_{\theta }}\left( x \right)=\frac{1}{1+{{e}^{-{{\theta }^{T}}X}}}\]
当 ${{h}_{\theta }}$ 大于等于0.5时，预测 y=1

当 ${{h}_{\theta }}$ 小于0.5时，预测 y=0 。

def predict(theta, X):
    probability = sigmoid(X@theta)
    return [1 if x >= 0.5 else 0 for x in probability]  # return a list

final_theta = result[0]
predictions = predict(final_theta, X)
correct = [1 if a==b else 0 for (a, b) in zip(predictions, y)]
accuracy = sum(correct) / len(X)
accuracy

0.89

可以看到我们预测精度达到了89%，not bad.

也可以用skearn中的方法来检验。

from sklearn.metrics import classification_report
print(classification_report(predictions, y))

             precision    recall  f1-score   support

          0       0.85      0.87      0.86        39
          1       0.92      0.90      0.91        61

avg / total       0.89      0.89      0.89       100

1.7 Decision boundary（决策边界）

$X \times \theta = 0$ (this is the line)

$\theta_0 + x_1\theta_1 + x_2\theta_2=0$

x1 = np.arange(130, step=0.1)
x2 = -(final_theta[0] + x1*final_theta[1]) / final_theta[2]

fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(positive['exam1'], positive['exam2'], c='b', label='Admitted')
ax.scatter(negetive['exam1'], negetive['exam2'], s=50, c='r', marker='x', label='Not Admitted')
ax.plot(x1, x2)
ax.set_xlim(0, 130)
ax.set_ylim(0, 130)
ax.set_xlabel('x1')
ax.set_ylabel('x2')
ax.set_title('Decision Boundary')
plt.show()

2 Regularized logistic regression

在训练的第二部分，我们将要通过加入正则项提升逻辑回归算法。简而言之，正则化是成本函数中的一个术语，它使算法更倾向于“更简单”的模型（在这种情况下，模型将更小的系数）。这个理论助于减少过拟合，提高模型的泛化能力。这样，我们开始吧。

设想你是工厂的生产主管，你有一些芯片在两次测试中的测试结果。对于这两次测试，你想决定是否芯片要被接受或抛弃。为了帮助你做出艰难的决定，你拥有过去芯片的测试数据集，从其中你可以构建一个逻辑回归模型。

2.1 Visualizing the data

data2 = pd.read_csv('ex2data2.txt', names=['Test 1', 'Test 2', 'Accepted'])
data2.head()

	Test 1	Test 2	Accepted
0	0.051267	0.69956	1
1	-0.092742	0.68494	1
2	-0.213710	0.69225	1
3	-0.375000	0.50219	1
4	-0.513250	0.46564	1

def plot_data():
    positive = data2[data2['Accepted'].isin([1])]
    negative = data2[data2['Accepted'].isin([0])]

    fig, ax = plt.subplots(figsize=(8,5))
    ax.scatter(positive['Test 1'], positive['Test 2'], s=50, c='b', marker='o', label='Accepted')
    ax.scatter(negative['Test 1'], negative['Test 2'], s=50, c='r', marker='x', label='Rejected')
    ax.legend()
    ax.set_xlabel('Test 1 Score')
    ax.set_ylabel('Test 2 Score')
    
plot_data()

注意到其中的正负两类数据并没有线性的决策界限。因此直接用logistic回归在这个数据集上并不能表现良好，因为它只能用来寻找一个线性的决策边界。

所以接下会提到一个新的方法。

2.2 Feature mapping

一个拟合数据的更好的方法是从每个数据点创建更多的特征。

我们将把这些特征映射到所有的x1和x2的多项式项上，直到第六次幂。

https://www.zhihu.com/question/65020904

for i in 0..power
      for p in 0..i:
        output x1^(i-p) * x2^p```

${\rm{mapFeature}}(x)=\begin{bmatrix} {1}\\ {x_1}\\ {x_2}\\ {x_1^2}\\ {x1x2}\\ {x_2^2}\\ {x_1^3}\\ \vdots\\ {x_1x_2^5}\\ {x_2^6} \end{bmatrix}$

def feature_mapping(x1, x2, power):
    data = {}
    for i in np.arange(power + 1):
        for p in np.arange(i + 1):
            data["f{}{}".format(i - p, p)] = np.power(x1, i - p) * np.power(x2, p)

#     data = {"f{}{}".format(i - p, p): np.power(x1, i - p) * np.power(x2, p)
#                 for i in np.arange(power + 1)
#                 for p in np.arange(i + 1)
#             }
    return pd.DataFrame(data)

x1 = data2['Test 1'].as_matrix()
x2 = data2['Test 2'].as_matrix()

_data2 = feature_mapping(x1, x2, power=6)
_data2.head()

	f00	f01	f02	f03	f04	f05	f06	f10	f11	f12	...	f30	f31	f32	f33	f40	f41	f42	f50	f51	f60
0	1.0	0.69956	0.489384	0.342354	0.239497	0.167542	0.117206	0.051267	0.035864	0.025089	...	0.000135	0.000094	0.000066	0.000046	0.000007	0.000005	0.000003	3.541519e-07	2.477505e-07	1.815630e-08
1	1.0	0.68494	0.469143	0.321335	0.220095	0.150752	0.103256	-0.092742	-0.063523	-0.043509	...	-0.000798	-0.000546	-0.000374	-0.000256	0.000074	0.000051	0.000035	-6.860919e-06	-4.699318e-06	6.362953e-07
2	1.0	0.69225	0.479210	0.331733	0.229642	0.158970	0.110047	-0.213710	-0.147941	-0.102412	...	-0.009761	-0.006757	-0.004677	-0.003238	0.002086	0.001444	0.001000	-4.457837e-04	-3.085938e-04	9.526844e-05
3	1.0	0.50219	0.252195	0.126650	0.063602	0.031940	0.016040	-0.375000	-0.188321	-0.094573	...	-0.052734	-0.026483	-0.013299	-0.006679	0.019775	0.009931	0.004987	-7.415771e-03	-3.724126e-03	2.780914e-03
4	1.0	0.46564	0.216821	0.100960	0.047011	0.021890	0.010193	-0.513250	-0.238990	-0.111283	...	-0.135203	-0.062956	-0.029315	-0.013650	0.069393	0.032312	0.015046	-3.561597e-02	-1.658422e-02	1.827990e-02

5 rows × 28 columns

经过映射，我们将有两个特征的向量转化成了一个28维的向量。

在这个高维特征向量上训练的logistic回归分类器将会有一个更复杂的决策边界，当我们在二维图中绘制时，会出现非线性。

虽然特征映射允许我们构建一个更有表现力的分类器，但它也更容易过拟合。在接下来的练习中，我们将实现正则化的logistic回归来拟合数据，并且可以看到正则化如何帮助解决过拟合的问题。

2.3 Regularized Cost function

注意：不惩罚第一项 $\theta_0$

先获取特征，标签以及参数theta，确保维度良好。

# 这里因为做特征映射的时候已经添加了偏置项，所以不用手动添加了。
X = _data2.as_matrix()  
y = data2['Accepted'].as_matrix()
theta = np.zeros(X.shape[1])
X.shape, y.shape, theta.shape  # ((118, 28), (118,), (28,))

def costReg(theta, X, y, l=1):
    # 不惩罚第一项
    _theta = theta[1: ]
    reg = (l / (2 * len(X))) *(_theta @ _theta)  # _theta@_theta == inner product
    
    return cost(theta, X, y) + reg

costReg(theta, X, y, l=1)  #     0.6931471805599454

2.4 Regularized gradient

因为我们未对 ${{\theta }_{0}}$ 进行正则化，所以梯度下降算法将分两种情形：
\begin{align}
& Repeat\text{ }until\text{ }convergence\text{ }!!{!!\text{ } \
& \text{ }{{\theta }{0}}:={{\theta }{0}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{[{{h}{\theta }}\left( {{x}^{(i)}} \right)-{{y}^{{(i)}}]x_{_{0}}}{(i)}} \
& \text{ }{{\theta }{j}}:={{\theta }{j}}-a\frac{1}{m}\sum\limits{i=1}^{m}{[{{h}{\theta }}\left( {{x}^{(i)}} \right)-{{y}^{(i)}}]x_{j}{(i)}}+\frac{\lambda }{m}{{\theta }{j}} \
& \text{ }!!}!!\text{ } \
& Repeat \
\end{align}

对上面的算法中 j=1,2,…,n 时的更新式子进行调整可得：
${{\theta }_{j}}:={{\theta }_{j}}(1-a\frac{\lambda }{m})-a\frac{1}{m}\sum\limits_{i=1}^{m}{({{h}_{\theta }}\left( {{x}^{(i)}} \right)-{{y}^{(i)}})x_{j}^{(i)}}$

同样不惩罚第一个θ

def gradientReg(theta, X, y, l=1):
    reg = (1 / len(X)) * theta
    reg[0] = 0  
    return gradient(theta, X, y) + reg

gradientReg(theta, X, y, 1)

array([8.47457627e-03, 8.47457627e-03, 7.77711864e-05, 3.76648474e-02,
       2.34764889e-02, 3.93028171e-02, 3.10079849e-02, 3.87936363e-02,
       1.87880932e-02, 1.15013308e-02, 8.19244468e-03, 3.09593720e-03,
       4.47629067e-03, 1.37646175e-03, 5.03446395e-02, 7.32393391e-03,
       1.28600503e-02, 5.83822078e-03, 7.26504316e-03, 1.83559872e-02,
       2.23923907e-03, 3.38643902e-03, 4.08503006e-04, 3.93486234e-02,
       4.32983232e-03, 6.31570797e-03, 1.99707467e-02, 1.09740238e-03])

2.5 Learning parameters

result2 = opt.fmin_tnc(func=costReg, x0=theta, fprime=gradientReg, args=(X, y, 2))
result2

(array([ 0.57761135,  0.47056293,  1.09213933, -0.93555548, -0.15107417,
        -0.96567576, -0.49622178, -0.87226365,  0.5986215 , -0.47857791,
        -0.19652206, -0.10212812, -0.1513566 , -0.03407832, -1.868297  ,
        -0.25062387, -0.49045048, -0.20293012, -0.26033467,  0.02385201,
        -0.0290203 , -0.0543879 ,  0.01131411, -1.39767636, -0.16559351,
        -0.24745221, -0.29518657,  0.00854288]), 57, 4)

我们还可以使用高级Python库scikit-learn来解决这个问题。

from sklearn import linear_model#调用sklearn的线性回归包
model = linear_model.LogisticRegression(penalty='l2', C=1.0)
model.fit(X, y.ravel())

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

model.score(X, y)  # 0.8305084745762712

2.6 Evaluating logistic regression

final_theta = result2[0]
predictions = predict(final_theta, X)
correct = [1 if a==b else 0 for (a, b) in zip(predictions, y)]
accuracy = sum(correct) / len(correct)
accuracy

0.8135593220338984

或者用skearn中的方法来评估结果。

print(classification_report(y, predictions))

'''
                 precision    recall  f1-score   support

              0       0.87      0.75      0.80        60
              1       0.77      0.88      0.82        58
    
    avg / total       0.82      0.81      0.81       118
 '''

可以看到和skearn中的模型精确度差不多，这很不错。

2.6 Decision boundary（决策边界）

$X \times \theta = 0$ (this is the line)

x = np.linspace(-1, 1.5, 250)
xx, yy = np.meshgrid(x, x)

z = feature_mapping(xx.ravel(), yy.ravel(), 6).as_matrix()
z = z @ final_theta
z = z.reshape(xx.shape)

plot_data()
plt.contour(xx, yy, z, 0)
plt.ylim(-.8, 1.2)