版权声明:本文为博主原创文章,转载请附上博文链接。 https://blog.csdn.net/qq_41080850/article/details/85793054
说明:下文所使用的训练数据集ex2data1.txt来自Andrew Ng的机器学习公开课,数据集中包含有学生两次测试的得分和学生的录取情况。
代码实现:
%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# 读入数据:
data = pd.read_csv('ex2data1.txt',names=['Exam1','Exam2','Admitted'])
data.head() # 查看data中的前五条数据
# 查看学生的录取情况与两次测试的分之间的关系:
fig,axes = plt.subplots()
sns.scatterplot(x='Exam1',y='Exam2',hue='Admitted',s=100,style='Admitted',data=data,ax=axes)
axes.set_title("Student's admission situation")
fig.savefig('Admitting.png')
# 定义sigmoid函数:
def sigmoid(z):
return 1/(1 + np.exp(-z))
# 查看sigmoid函数的图像:
x = np.arange(-10,10,0.1)
fig, axes = plt.subplots()
axes.plot(x, sigmoid(x), 'r')
axes.set_title('sigmoid function')
fig.savefig('sigmoid.png')
# 数据预处理:
# 向data中插入值全为1的一列
data.insert(0,'Ones',1)
# 提取数据的特征X和标签y
X = data.iloc[:,:-1] # X为data中不包括索引列的前三列
y = data.iloc[:,-1] # y为data中的最后一列
# 将X和y转换成array形式
X = X.values # X是二维数组
y = y.values # y是一维数组
theta = np.zeros(3) # theta是一维数组
# 定义代价函数:
def cost(theta,X,y):
return np.mean(-y * np.log(sigmoid(X @ theta)) - (1 - y) * np.log(1 - sigmoid(X @ theta)))
# X @ theta表示二维数组与一维数组的点积,它等价于np.dot(X,theta)
# 定义求梯度的函数(向量化计算梯度):
def gradient(theta, X, y):
return (1 / len(X)) * X.T @ (sigmoid(X @ theta) - y)
# 通过调用scipy.optimize.optimize求解逻辑斯蒂回归模型中的theta参数:
import scipy.optimize as opt
res = opt.minimize(fun=cost, x0=theta, args=(X, y), method='Newton-CG', jac=gradient)
# theta参数存在res.x中,res.x的值为array([-25.1574502 , 0.20620065, 0.20144018])
# 定义预测函数:
def predict(theta, X):
probability = sigmoid(X @ theta)
return [1 if x >= 0.5 else 0 for x in probability]
# 计算预测准确率:
theta_min = res.x
predictions = predict(theta_min, X)
correct = [1 if ((a == 1 and b == 1) or (a == 0 and b == 0)) else 0 for (a, b) in zip(predictions, y)]
accuracy = (sum(correct) % len(correct))
# print ('accuracy = {0}%'.format(accuracy))
# accuracy的值为89%
# 绘制决策边界:
# 决策边界的原始方程为theta_0*x_0 + theta+1*x_1 + theta_2*x_2 = 0
new_theta = -(res.x / res.x[2])
x = np.arange(100,step=0.1)
y = new_theta[0] + new_theta[1]*x # 化简后的决策边界方程
fig,axes = plt.subplots()
sns.scatterplot(x='Exam1',y='Exam2',hue='Admitted',s=100,style='Admitted',data=data,ax=axes)
axes.set_title("Student's admission situation")
axes.plot(x,y,'black')
fig.savefig('Decision_boundary.png')
加入决策边界后的散点图如下所示:
参考:
Andrew Ng机器学习公开课
李航《统计学习方法》
https://chenrudan.github.io/blog/2016/01/09/logisticregression.html(强烈推荐)
https://blog.csdn.net/han_xiaoyang/article/details/49123419
https://blog.csdn.net/lilyth_lilyth/article/details/10032993