机器学习入门线性回归岭回归与Lasso回归(二)

一线性回归（Linear Regression )

1. 线性回归概述

　　回归的目的是预测数值型数据的目标值，最直接的方法就是根据输入写出一个求出目标值的计算公式，也就是所谓的回归方程，例如y = ax1+bx2，其中求回归系数的过程就是回归。那么回归是如何预测的呢？当有了这些回归系数，给定输入，具体的做法就是将回归系数与输入相乘，再将结果加起来就是最终的预测值。说到回归，一般指的都是线性回归，当然也存在非线性回归，在此不做讨论。

　　假定输入数据存在矩阵x中，而回归系数存放在向量w中。那么对于给定的数据x1,预测结果可以通过y1 = x1Tw给出，那么问题就是来寻找回归系数。一个最常用的方法就是寻找误差最小的w，误差可以用预测的y值和真实的y值的差值表示，由于正负差值的差异，可以选用平方误差，也就是对预测的y值和真实的y值的平方求和，用矩阵可表示为：
$$
(y - xw)T(y - xw)
$$
现在问题就转换为寻找使得上述矩阵值最小的w，对w求导为：xT(y - xw)，令其为0，解得：
$$
w = (xTx)-1xTy
$$
这就是采用此方法估计出来的.

案例: 糖尿病回归分析

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import matplotlib.pyplot as plt
%matplotlib inline

#导包读取
from sklearn import datasets
diabetes = datasets.load_diabetes()

#生成DataFrame 与 Series对象
train = DataFrame(data = diabetes.data,columns = diabetes.feature_names)
target = diabetes.target

# 数据集拆分 训练集和样本集
# 模型选择的包
from sklearn.model_selection import train_test_split

## train 数据样本集
# target 样本标签
# test_size  测试集的比例
# random_state 随机数种子，限定随机取值的随机顺序，每一个种子固定一组随机数
X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.1, random_state=42)

#导入线性回归与KNN模型
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor

#训练数据
linear = LinearRegression()
linear.fit(X_train,y_train)
#预测数据
y_ = linear.predict(X_test)

plt.plot(y_,label='Predict')
plt.plot(y_test,label='True')
plt.legend()

# 线性回归模型的拟合度的好坏，就是看真实值和预测值之间的误差的大小
# 残差直方图，评价回归模型的好坏，瘦高就是好，矮胖就是不好
plt.hist(y_test - y_,rwidth=0.9,bins=10)
plt.xlabel('cost-value')
plt.ylabel('numbers')

# 还可以用r2_score来评估回归模型的好坏,拟合优度
# 比较不同的回归模型的r2的分值大小，越大越好
from sklearn.metrics import r2_score
r2_score(y_test,y_)
#0.5514251914993504

# 使用KNN回归器处理
knn = KNeighborsRegressor(n_neighbors=7)
knn.fit(X_train,y_train)
knn_y_ = knn.predict(X_test)

plt.plot(y_,label='Linear-Predict')
plt.plot(knn_y_,label='knn-Predict')
plt.plot(y_test,label='True')
plt.legend()

# 先用残差直方图比较KNN和Linear哪个更好
plt.figure(figsize=(10,4))
axes1 = plt.subplot(1,2,1)
axes1.hist(knn_y_ - y_test,rwidth=0.9)
axes1.set_xlabel('cost-value')
axes1.set_ylabel('numbers')
axes1.set_title('KNN')

axes2 = plt.subplot(1,2,2)
axes2.hist(y_test - y_,rwidth=0.9,bins=10)
axes2.set_xlabel('cost-value')
axes2.set_ylabel('numbers')
axes2.set_title('Linear')

注意:

np.random.seed(1)
np.random.randint(0,10,size=10)
array([5, 8, 9, 5, 0, 0, 1, 7, 6, 9])

二局部加权线性回归（Locally Weighted Linear Regression，LWLR）

1.概述

　　针对于线性回归存在的欠拟合现象，可以引入一些偏差得到局部加权线性回归对算法进行优化。在该算法中，给待测点附近的每个点赋予一定的权重，进而在所建立的子集上进行给予最小均方差来进行普通的回归，分析可得回归系数w可表示为：
$$
w = (xTWx)-1xTWy，
$$
其中W为每个数据点赋予的权重，那么怎样求权重呢，核函数可以看成是求解点与点之间的相似度，在此可以采用核函数，相应的根据预测点与附近点之间的相似程度赋予一定的权重，在此选用最常用的高斯核,则权重可以表示为：
$$
w(i,i) = exp(|x(i) - x| / -2k2),
$$
其中K为宽度参数，至于此参数的取值，目前仍没有一个确切的标准，只有一个范围的描述，所以在算法的应用中，可以采用不同的取值分别调试，进而选取最好的结果。

三岭回归

1.概述

　　为了解决上述问题，统计学家引入了“岭回归”的概念。简单说来，岭回归就是在矩阵XTX上加上一个λr,从而使得矩阵非奇异，从而能对XTX + λx求逆。其中矩阵r为一个m*m的单位矩阵，对角线上的元素全为1，其他元素全为0，而λ是一个用户定义的数值，这种情况下，回归系数的计算公式将变为：
$$
w = (xTx+λI)-1xTy,
$$
其中I是一个单位矩阵。

　　岭回归就是用了单位矩阵乘以常量λ，因为只I贯穿了整个对角线，其余元素为0，形象的就是在0构成的平面上有一条1组成的“岭”，这就是岭回归中岭的由来。

　　岭回归最先是用来处理特征数多与样本数的情况，现在也用于在估计中加入偏差，从而得到更好的估计。这里引入λ限制了所有w的和，通过引入该惩罚项，能够减少不重要的参数，这个技术在统计学上也叫做缩减。缩减方法可以去掉不重要的参数，因此能更好的理解数据。选取不同的λ进行测试，最后得到一个使得误差最小λ。

优点 :

缩减方法可以去掉不重要的参数，因此能更好地理解数据。此外，与简单的线性回归相比，缩减法能取得更好的预测效果
岭回归是加了二阶正则项的最小二乘，主要适用于过拟合严重或各变量之间存在多重共线性的时候，岭回归是有bias的，这里的bias是为了让variance更小。

为了得到一致假设而使假设变得过度严格称为过拟合
bias:指的是模型在样本上的输出与真实值的误差
variance：指的是每个模型的输出结果与所有模型平均值（期望）之间的误差

特点

1.岭回归可以解决特征数量比样本量多的问题
2.岭回归作为一种缩减算法可以判断哪些特征重要或者不重要，有点类似于降维的效果
3.缩减算法可以看作是对一个模型增加偏差的同时减少方差

应用场景:

1.数据点少于变量个数
2.变量间存在共线性（最小二乘回归得到的系数不稳定，方差很大）
3.应用场景就是处理高度相关的数据

多重共线性（Multicollinearity）是指线性回归模型中的解释变量之间由于存在精确相关关系或高度相关关系而使模型估计失真或难以估计准确

案例: 岭回归案例分析

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

n_samples = 5
n_features = 3

#生成5行3列的样本数据
train = np.random.random(size=(n_samples,n_features))
target = [1,2,3,4,5]

from sklearn.linear_model import LinearRegression
#导入线性回归模型
linear = LinearRegression()
linear.fit(train,target)

# 比较普通线性回归和岭回归的区别
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

n_samples = 200
n_features = 10
#生成样本数据
train = np.random.random(size=(n_samples,n_features))
target = np.random.randint(0,10,size=200)

# alpha就是岭回归系数
# 岭回归系数越大，原始系数被压缩的越严重
# 岭回归系数越小，原始系数越趋向于线性回归模型的系数
ridge = Ridge(alpha=10)

linear = LinearRegression()
# 岭回归的惩罚系数
# 惩罚系数越大，原始系数的作用就越小
# 惩罚系数越小，原始系数的作用就越大
lasso = Lasso(alpha=0.1)

ridge.fit(train,target)
linear.fit(train,target)
lasso.fit(train,target)

import matplotlib.pyplot as plt
%matplotlib inline

ridge_coef = ridge.coef_
line_coef = linear.coef_
lasso_coef = lasso.coef_

plt.plot(ridge_coef,label='Ridge')
plt.plot(line_coef,label='Linear')
plt.plot(lasso_coef,label='Lasso')

plt.legend()
plt.title('coefs')

四 Lasso回归(least absolute shrinkage and selection operator)

1 概述

Lasso回归: 最小绝对值收缩和选择算子

与岭回归类似，它也是通过增加惩罚函数来判断、消除特征间的共线性。

对于参数w增加一个限定条件,能达到和回归一样的效果:
$$
∥w∥1=∑i=1d|wi|≤λ
$$
当λ足够小时，一些影响较弱的系数会因此被迫缩减到0

综合案例 : 波士顿房价

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import matplotlib.pyplot as plt
%matplotlib inline

#导包解析数据
from sklearn import datasets
boston = datasets.load_boston()
train = boston.data
target = boston.target

#切分出样本集
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(train,target,test_size=0.2,random_state=2)

#用三种模型分别预测
linear = LinearRegression()
ridge = Ridge(alpha=0)
lasso = Lasso(alpha=0)

linear.fit(X_train,y_train)
ridge.fit(X_train,y_train)
lasso.fit(X_train,y_train)

y1_ = linear.predict(X_test)
y2_ = ridge.predict(X_test)
y3_ = lasso.predict(X_test)

print("linear:{},ridge:{},lasso:{}".format(r2_score(y_test,y1_),r2_score(y_test,y2_),r2_score(y_test,y3_)))

#linear:0.778720987477258,ridge:0.778720987477258,lasso:0.7787209874772579

# 应该采用普通线性回归处理
plt.plot(y1_,label='Predict')
plt.plot(y_test,label='True')
plt.xlabel('features')
plt.ylabel('prices')
plt.legend()

五普通线性回归,岭回归,Lasso回归

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# 根据实际值与预测值，给模型打分
from sklearn.metrics import r2_score 

np.random.seed(42) #给numpy随机数指定种子，这样生成随机数，就会固定
n_samples,n_features = 50,200
#生成样本
x = np.random.randn(n_samples,n_features)
#系数，也就是W
coef = 3*np.random.randn(n_features)
#系数归零化索引
inds = np.arange(n_features)
#打乱顺序
np.random.shuffle(inds)
#对系数进行归零化处理
coef[inds[10:]] = 0
#目标值
y = np.dot(x,coef)
#增加噪声
y += 0.01*np.random.normal(n_samples)

#训练数据
x_train,y_train = x[:n_samples//2],y[:n_samples//2]
#测试数据
x_test,y_test = x[n_samples//2:],y[n_samples//2:]

使用普通线性回归

#使用普通线性回归
from sklearn.linear_model import LinearRegression
lreg = LinearRegression()
#训练数据
reg = lreg.fit(x_train,y_train)
#预测数据
y_pred = lreg.predict(x_test)
r2_score(y_test,y_pred)

#输出  0.04200879234206356

使用岭回归

#使用岭回归
from sklearn.linear_model import Ridge
ridge = Ridge()
#训练数据
reg2 = ridge.fit(x_train,y_train)
#预测数据
y_pred = ridge.predict(x_test)
#获取回归模型的分数
r2_score(y_test,y_pred)

#输出  0.04340880021578697

使用Lasso回归

from sklearn.linear_model import Lasso
lasso = Lasso()
reg3 = lasso.fit(x_train,y_train)
y_pred_lasso = lasso.predict(x_test)
r2_score(y_test,y_pred_lasso)

#输出  0.2429444024252334

绘图来查询比较

# 画出参数
plt.figure(figsize=(12,8))
#线性回归
plt.subplot(221)
plt.plot(reg.coef_,color = 'lightgreen',lw = 2,label = 'lr coefficients')
plt.legend()
#岭回归
plt.subplot(222)
plt.plot(reg2.coef_,color = 'red',lw = 2,label = 'ridge coefficients')
plt.legend()
#Lasso回归
plt.subplot(223)
plt.plot(reg3.coef_,color = 'gold',lw = 2,label = 'lasso coefficients')
plt.legend()
#测试数据系统w
plt.subplot(224)
plt.plot(coef,color = 'navy',lw = 2,label = 'original coefficients')
plt.legend()

总结：

与分类一样，回归是预测目标值的过程。回归与分类的不同在于回归预测的是连续型变量，而分类预测的是离散型的变量。回归是统计学中最有力的工具之一。如果给定输入矩阵x，xTx的逆如果存在的话回归法可以直接使用，回归方程中的求得特征对应的最佳回归系数的方法是最小化误差的平方和，判断回归结果的好坏可以利用预测值和真实值之间的相关性判断。当数据样本总个数少于特征总数时，矩阵x，xTx的逆不能直接计算，这时可以考虑岭回归。