基于sciket-learn实现线性回归算法

线性回归算法主要用来解决回归问题,是许多强大的非线性模型的基础,无论是简单线性回归,还是多元线性回归,思想都是一样的,假设我们找到了最佳拟合方程(对于简单线性回归,多元线性回归对应多个特征作为一组向量)y=ax+b,则对于每一个样本点xi,根据我们的直线方程,预测值为y^i = axi + b,真值为y,我们希望y和y^i的差距尽量的小。

接下来我们看看通过sciket-learn来实现线性回归算法,首先还是导入常用的库

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

这里我们使用boston房价的数据集,并且去掉50这个极值,通常在实际应用中,这个极值可能是由于环境因素或者仪器限制等无法获取到真值,所以在这里我们去除数据集里的50

boston = datasets.load_boston()

X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]

接下里是训练集和测试集的划分以及获取构造器并且fit训练集

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

在查看分类效果前,可以先看看y^方程里的系数与截距

lin_reg.coef_
lin_reg.intercept_

最后我们来查看线性回归的score和predict值

lin_reg.score(X_test, y_test)
lin_reg.predict(X_test)

这样一个多远线性回归的算法变完成了,在我的机器上,评价结果是0.80089168995191,大家的出来值应该差不多也是这个维度,在上一篇博客中提到的kNN算法,我们用它来解决了分类问题,kNN同样也可以用来解决回归问题,我们在同样的数据集下,看看kNN的表现如何。

首先,还是先倒入相关类库(接着上面的代码,重复的类库就不再重新导入了)

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV

训练并查看结果

knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train, y_train)
knn_reg.score(X_test, y_test)

在我的机器上,得出的评价结果是0.60,与线性回归差的有点多,但是在这里我们或许并没有使用最优的超参数,下面进行网格化搜索

param_grid = [
    {
        "weights" : ["uniform"],
        "n_neighbors" : [i for i in range(1,11)]
    },
    {
        "weights" : ["distance"],
        "n_neighbors" : [i for i in range(1,11)],
        "p" : [i for i in range(1,6)]
    }
]

knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs = -1, verbose = 1)
grid_search.fit(X_train, y_train)

查看最优超参数

grid_search.best_params_

查看评价结果

grid_search.best_estimator_.score(X_test, y_test)

我的机器上得到的评价结果是0.73,虽然比线性回归还差一些,但是已经在同一个维度上了。下面是完整代码

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

boston = datasets.load_boston()

X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

lin_reg.coef_
lin_reg.intercept_

lin_reg.score(X_test, y_test)

from sklearn.neighbors import KNeighborsRegressor

knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train, y_train)
knn_reg.score(X_test, y_test)

from sklearn.model_selection import GridSearchCV

param_grid = [
    {
        "weights" : ["uniform"],
        "n_neighbors" : [i for i in range(1,11)]
    },
    {
        "weights" : ["distance"],
        "n_neighbors" : [i for i in range(1,11)],
        "p" : [i for i in range(1,6)]
    }
]

knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs = -1, verbose = 1)
grid_search.fit(X_train, y_train)

grid_search.best_params_

grid_search.best_estimator_.score(X_test, y_test)

猜你喜欢

转载自blog.csdn.net/sinat_33150417/article/details/83338942