逻辑回归的基本原理:
和线性回归几乎相似,只不过假设函数不同。
1.假设函数
2.代价函数
3.代价函数的优化算法(梯度下降)
对于概念要理解:
1.逻辑回归属于分类(如正类和负类的问题),且0≤hθ(x)≤1(通过sigmoid函数)。
2.if hθ(x)≥0.5 :predict y =1
else :predict y =1
3.如果y =1 ,那么hθ(x)在1处代价为0,往0正方向代价逐渐增大;若y=0,则hθ(x)在0处代价为0,往1负方向逐渐增大。这样,可以通过图像写出J(θ)且保证为凸。
4.J(θ)是通过极大似然的来的。 极大似然概念:顾名思义就是极大相似的样子,正态分布可以确定一个方差和一个均值,若取图上的几个点的概率之和或积在这个条件下最大,那么就可以称为是极大似然(这是我理解的极大似然)
5.代价函数为0可能不是一件好事,拟合的虽好,泛化能力可能很差。需要引入l1,l2
代码系列
这是我从网上找的一个数据集,貌似是升学率的(暂时就认定是关于升学率的吧)。对于异常值也没有做处理,可以从散点图看出了,真的是非常乱,小例子而已偷懒直接用了。
# -*-coding:utf-8-*-
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data/ex4x.dat', names=['score1', 'score2', 'admit'])
df_admint = df[df['admit'] == 1]
df_noadmint = df[df['admit'] == 0]
plt.scatter(df_admint['score1'], df_admint['score2'], marker='o', c='r')
plt.scatter(df_noadmint['score1'], df_noadmint['score2'], marker='x', c='b')
plt.show()
df = df.as_matrix()
# print df
col = df.shape[1]
X = df[:, 0:col - 1]
y = df[:, col - 1:col]
# print X,y
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(x_train, x_test)
x_train_std = sc.transform(x_train)
x_test_std = sc.transform(x_test)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
LR = LogisticRegression()
scores = cross_val_score(LR, x_train_std, y_train, cv=10, scoring='neg_mean_squared_error') # 计算均方误差
print scores.mean()
LR.fit(x_train_std, y_train)
# predict = LR.predict(x_test_std)
# print predict
# print(LR.coef_, LR.intercept_)
print LR.score(x_test_std, y_test)
print LR.score(x_train_std, y_train)
df = pd.read_csv('data/ex4y.dat',names=['score1','score2','admit'])
df = df.as_matrix()
col = df.shape[1]
X_pre = df[:,0:col-1]
x_pre_std = sc.transform(X_pre)
# print x_pre_std
pre = LR.predict(x_pre_std)
print '预测%s'%pre
print '实际%s'%df[:,col-1:col].T
print "--" * 30
# 顺便用多项式玩一下。。
from sklearn.preprocessing import PolynomialFeatures
for i in range(1,10):
print "i = %s"%i
pf = PolynomialFeatures(degree=i)
x_train_std_pf = pf.fit_transform(x_train_std)
x_test_std_pf = pf.transform(x_test_std)
pf_score = cross_val_score(LR, x_train_std_pf, y_train, cv=10)
print "交叉验证方差%s" % pf_score.mean()
LR.fit(x_train_std_pf, y_train)
print("mode score = %s" % LR.score(x_train_std_pf, y_train))
print("mode predict score = %s" % LR.score(x_test_std_pf, y_test))
结果
D:/dataTraining/lrScore.py:12: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
df = df.as_matrix()
-0.2533333333333333
0.9
0.7777777777777778
D:\DjangoTest3\venv\lib\site-packages\sklearn\utils\validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
D:/dataTraining/lrScore.py:44: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
预测[0. 0. 0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1.]
df = df.as_matrix()
实际[[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]]
------------------------------------------------------------
i = 1
交叉验证方差0.7466666666666667
mode score = 0.7777777777777778
mode predict score = 0.9
i = 2
交叉验证方差0.8066666666666666
mode score = 0.8888888888888888
mode predict score = 0.8
i = 3
交叉验证方差0.8366666666666667
mode score = 0.8888888888888888
mode predict score = 0.8
i = 4
交叉验证方差0.8316666666666667
mode score = 0.8888888888888888
mode predict score = 0.75
i = 5
交叉验证方差0.7866666666666667
mode score = 0.8888888888888888
mode predict score = 0.65
i = 6
交叉验证方差0.7866666666666667
mode score = 0.8888888888888888
mode predict score = 0.65
i = 7
交叉验证方差0.7866666666666667
mode score = 0.8888888888888888
mode predict score = 0.65
i = 8
交叉验证方差0.8066666666666666
mode score = 0.8888888888888888
mode predict score = 0.65
i = 9
交叉验证方差0.8066666666666666
mode score = 0.8888888888888888
mode predict score = 0.65
Process finished with exit code 0