sklearn库学习之线性模型

线性模型利用输入特征的线性函数进行预测,学习线性模型的算法的区别:
(1)系数和截距的特定组合对训练数据拟合好坏的度量方法,不同的算法使用不同的方法度量“对训练集拟合好坏”–称为损失函数
(2)是否使用正则化,使用哪种正则化方法

线性模型的主要参数是正则化参数,如果假定只有几个特征是真正重要的,应该用L1正则化,否则应默认使用L2正则化。

处理大型数据时,需研究使用LogisticRegression和Ridge模型的solver='sag’选项,比默认值要更快。

用于回归的线性模型

y = w i x i + b y=w_i*x_i + b
x i x_i 是单个数据点的特征, w i w_i 是每个特征坐标轴的斜率或输入特征的加权, w i b w_i和b 是学习模型的参数, y y 是模型预测的结果。
在一维wave数据集上学习参数 w 0 b w_0和b

import mglearn  
mglearn.plots.plot_linear_regression_wave() 

线性回归(普通最小二乘法)

线性回归寻找参数 w b w和b 是的对训练集的预测值与真实的回归目标 y y 之间的均方误差最小线性回归寻找参数 w b w和b 是的对训练集的预测值与真实的回归目标 y y 之间的均方误差最小

均方误差:预测值与真实值之差的平方和除以样本数
#线性回归对wave数据集的预测结果
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import mglearn

X,y = mglearn.datasets.make_wave(n_samples = 60)
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42)

lr = LinearRegression().fit(X_train,y_train)

#sklearn库总是将从训练数据中得出的数值保存在以下划线结尾的属性中,与用户设置的参数区分开
print('lr.coef_:{}'.format(lr.coef_))
print('lr.intercept_:{}'.format(lr.intercept_))

#若训练集和测试集上的分数非常接近,说明可能存在欠拟合
print('Training set score:{:.2f}'.format(lr.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(lr.score(X_test,y_test)))
#LinearRegression在高维数据集上的表现,波士顿房价数据集
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import mglearn

X,y = mglearn.datasets.load_extended_boston()

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)
lr = LinearRegression().fit(X_train,y_train)

#训练集和测试集上的性能差异是过拟合的明显标志
print('Training set score:{:.2f}'.format(lr.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(lr.score(X_test,y_test)))

岭回归

岭回归的预测公式与普通最小二乘法相同,但岭回归用到了L2正则化约束,使每个特征对输出的影响尽可能小。更大的alpha表示约束更强的模型,预计大alpha对应的coef_元素比小alpha对应的coef_元素要小。
#Ridge在高维数据集上的表现,波士顿房价数据集
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
import mglearn

X,y = mglearn.datasets.load_extended_boston()

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)
ridge = Ridge().fit(X_train,y_train)

#Rigde在训练集的分数低于LinearRegression,但在测试集上的分数更高
#线性模型对数据存在过拟合,Ridge是一种约束更强的模型,不容易过拟合
print('Training set score:{:.2f}'.format(ridge.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(ridge.score(X_test,y_test)))
#调整alpha,增大alpha使得系数更趋向于0,降低训练集性能,可能!!!提高泛化性能
#Ridge在高维数据集上的表现,波士顿房价数据集
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
import mglearn

X,y = mglearn.datasets.load_extended_boston()

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)

#默认aloha = 1.0
ridge = Ridge().fit(X_train,y_train)
print('Training set score:{:.2f}'.format(ridge.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(ridge.score(X_test,y_test)))

#aplha = 10
ridge10 = Ridge(alpha = 10).fit(X_train,y_train)
print('Training set score:{:.2f}'.format(ridge10.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(ridge10.score(X_test,y_test)))

#aplha = 0.1
ridge01 = Ridge(alpha = 0.1).fit(X_train,y_train)
print('Training set score:{:.2f}'.format(ridge01.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(ridge01.score(X_test,y_test)))

#标数据点
plt.plot(ridge.coef_,'s',label = "Ridge alpha = 1")
plt.plot(ridge10.coef_,'^',label = "Ridge alpha = 10")
plt.plot(ridge01.coef_,'v',label = "Ridge alpha = 0.1")

plt.xlabel("Coefficient index") #x轴对应coef_的元素,x=i对应第i个特征的系数,y轴表示该系数的具体数值
plt.ylabel("Coefficient magnitude") #系数震级
plt.hlines(0,0,len(ridge.coef_)) #画横坐标
plt.ylim(-25,25) #设置坐标轴的最大最小区间
plt.legend(loc = 'best')
import mglearn
#固定alpha值,改变训练数据量
#对波士顿房价数据二次抽样,在数据量逐渐增加的子数据集上对LinearRegression和Ridge(alpha = 1)两个模型评估
#学习曲线
mglearn.plots.plot_ridge_n_samples()
#线性回归的训练性能在下降
#如果有足够多的数据,正则化变得不那么重要

Lasso

使用Lasso也是约束其系数使其接近于0,但用到的方法不同,用了L1正则化,L1正则化的结果是使用Lasso时某些系数刚好为0,可以看作是自动化的特征选择。
#将Lasso应用在扩展的波士顿房价数据集上
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
import mglearn
import numpy as np
import matplotlib.pyplot as plt

X,y = mglearn.datasets.load_extended_boston()

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)
lasso = Lasso().fit(X_train,y_train)

#在训练集和测试集上的表现都很差,表示存在欠拟合
print('Training set score:{:.2f}'.format(lasso.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(lasso.score(X_test,y_test)))
print('Number of features used: {}'.format(np.sum(lasso.coef_ != 0))) #展示系数不为 0 的 feature 个数

#Lasso也有一个正则化参数alpha,默认=1.0,控制系数趋向于0的强度。降低欠拟合,减小alpha,增加max_iter的值(运行迭代的最大次数)
#拟合出一个更复杂的模型
lasso001 = Lasso(alpha = 0.01, max_iter = 100000).fit(X_train,y_train)
print('Training set score:{:.2f}'.format(lasso001.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(lasso001.score(X_test,y_test)))
print('Number of features used: {}'.format(np.sum(lasso001.coef_ != 0)))

#但把alpha设得太小,会消除正则化的影响,从而出现过拟合
lasso00001 = Lasso(alpha = 0.0001, max_iter = 100000).fit(X_train,y_train)
print('Training set score:{:.2f}'.format(lasso00001.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(lasso00001.score(X_test,y_test)))
print('Number of features used: {}'.format(np.sum(lasso00001.coef_ != 0)))

plt.plot(lasso.coef_,'s',label = 'Lasso alpha = 1')
plt.plot(lasso001.coef_,'^',label = 'Lasso alpha = 0.01')
plt.plot(lasso00001.coef_,'v',label = 'Lasso alpha = 0.0001')

plt.xlabel('Coefficient index')
plt.ylabel('Coefficient magnitude')
plt.legend(ncol = 2,loc = (0,1.05))#列数为2列
plt.ylim(-25,25)

sklearn提供了ElasticNet类,结合了Lasso和Ridge的惩罚项,调节两个参数:用于L1正则化和L2正则化。

用于分类的线性模型

y = w i x i + b > 0 y=w_i*x_i + b>0
没有返回特征的加权求值,而是为预测设置了阙值(0):y<0,则预测类别-1;y>0,预测类别1。对用于分类的线性模型,决策边界是输入的线性函数,即线性分类器是利用直线、平面或超平面来分开两个类别的分类器。

#将两种线性分类模型应用到forge数据集上,并将决策边界可视化
from sklearn.linear_model import LogisticRegression   #Logistic回归
from sklearn.svm import LinearSVC      #线性支持向量机
import mglearn

X,y = mglearn.datasets.make_forge()

fig,axes = plt.subplots(1,2,figsize = (10,3))

for model,ax in zip([LinearSVC(), LogisticRegression()],axes):
    clf = model.fit(X,y)
    #alpha参数表示分界线颜色的深浅
    mglearn.plots.plot_2d_separator(clf, X, fill = False, eps = 0.5, ax = ax, alpha = 0.7) #决策边界可视化
    mglearn.discrete_scatter(X[:,0],X[:,1],y,ax = ax) #画点
    
    ax.set_title("{}".format(clf.__class__.__name__))
    ax.set_xlabel("Feature 0")
    ax.set_ylabel("Feature 1")
    ax.legend(loc = "best")

LogisticRegression 和 LinearSVC 模型默认使用L2正则化,决定正则化强度的权衡参数叫做c。c值越大,对应的正则化越弱。

#不同c值的线性SVM在forge数据集上的决策边界
import mglearn
mglearn.plots.plot_linear_svc_regularization()

在高维空间中,用于分类的线性模型非常强大。当考虑过多特征时,避免过拟合越来越重要。

#在乳腺癌数据集上详细分析LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

cancer = load_breast_cancer()
X_train,X_test,y_train,y_test = train_test_split(cancer.data, cancer.target,stratify = cancer.target,random_state = 42)

#C = 1.0
logreg = LogisticRegression().fit(X_train,y_train)
print("Training set score:{:.2f}".format(logreg.score(X_train,y_train)))
print("Test set score:{:.3f}".format(logreg.score(X_test,y_test)))

#C = 100
logreg100 = LogisticRegression(C = 100).fit(X_train,y_train)
print("Training set score:{:.2f}".format(logreg100.score(X_train,y_train)))
print("Test set score:{:.3f}".format(logreg100.score(X_test,y_test)))

#C = 0.01
logreg001 = LogisticRegression(C = 0.01).fit(X_train,y_train)
print("Training set score:{:.2f}".format(logreg001.score(X_train,y_train)))
print("Test set score:{:.3f}".format(logreg001.score(X_test,y_test)))

plt.plot(logreg.coef_.T,'o',label = "C = 1")
plt.plot(logreg100.coef_.T,'^',label = "C = 100")
plt.plot(logreg001.coef_.T,'v',label = "c = 0.01")

plt.xticks(range(cancer.data.shape[1]),cancer.feature_names,rotation = 90)

plt.hlines(0,0,cancer.data.shape[1])
plt.ylim(-5,5)
plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")
plt.legend()
#系统可以告诉我们,某个特征与哪个类别有关。


#使用L1正则化的LogisticRegression
for C, marker in zip([0.001,1,100],['o','^','v']):
    lr_l1 = LogisticRegression(C = C, penalty = "l1").fit(X_train,y_train)
    print("Training accuracy of l1 logreg with C ={:.3f}:{:.2f}".format(C,lr_l1.score(X_train,y_train)))
    print("Test accuracy of l1 logreg with C ={:.3f}:{:.2f}".format(C,lr_l1.score(X_test,y_test)))

    plt.plot(lr_l1.coef_.T,marker,label = "C={:.3f}".format(C))
    
plt.xticks(range(cancer.data.shape[1]),cancer.feature_names,rotation = 90)

plt.hlines(0,0,cancer.data.shape[1])
plt.ylim(-5,5)
plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")
plt.legend(loc = 3)

模型的penalty参数会影响正则化,即模型是使用所有可用特征还是只选择特征的一个子集。

用于多分类的线性模型

许多线性分类模型只适用于二分类问题,不能轻易推广到多分类问题,将二分类算法推广到多分类算法的一种常见方法是“一对其余”。
“一对其余”,即对每个类别都学习一个二分类模型,将这个类别与其他类别分开。

每个类别都对应一个二分类器,这样每个类别都有一个系数w向量和一个截距b,其结果中最大值对应的类别即为预测的类别标签。
#包含3个类别的二维玩具数据集
from sklearn.datasets import make_blobs
import mglearn
import matplotlib.pyplot as plt

X,y = make_blobs(random_state = 42)
mglearn.discrete_scatter(X[:,0],X[:,1],y)

#训练一个LinearSVC分类器
linear_svm = LinearSVC().fit(X,y)
print("Coefficient shape:", linear_svm.coef_.shape) #三条线,两个特征
print("Intercept shape:", linear_svm.intercept_.shape)

line = np.linspace(-15,15)

for coef,intercept,color in zip(linear_svm.coef_, linear_svm.intercept_,['b','r','g']):
    plt.plot(line, -(line * coef[0] + intercept) / coef[1], c = color)

plt.ylim(-10,15)
plt.xlim(-10,8)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.legend(["Class 0","Class 1","Class 2","Line class 0","Line class 1","Line class 2"], loc = (1.01,0.3))

mglearn.plots.plot_2d_classification(linear_svm, X, fill = True, alpha = .7)  #边界条件可视化

对代码及方法的疑惑

train_test_split(X, y, stratify=y)
https://blog.csdn.net/weixin_37226516/article/details/62042550

普通最小二乘法(OLS)
https://blog.csdn.net/enjoy524/article/details/53556038

Python的知识点 plt.plot()函数细节
https://blog.csdn.net/cjcrxzz/article/details/79627483

python中Matplotlib的坐标轴的坐标区间的设定
https://blog.csdn.net/ccy950903/article/details/50688449

矩阵论:向量范数和矩阵范数
https://blog.csdn.net/pipisorry/article/details/51030563

正则化及正则化项的理解
https://blog.csdn.net/gshgsh1228/article/details/52199870

深度学习——L0、L1及L2范数
https://blog.csdn.net/zchang81/article/details/70208061

机器学习 - sklearn.Lasso
https://www.jianshu.com/p/1177a0bcb306

猜你喜欢

转载自blog.csdn.net/thj19980720/article/details/83107912