1、回归分析概括

目标值（因变量）是连续型数据，通过某种函数关系找到因变量和自变量之间的关系，进而预测目标。

常见的回归：线性回归、岭回归、非线性回归

回归拟合目标：计算自变量与因变量关系的函数参数

通过不断拟合缩小预测值与真实值的差距：最终使得这个差距（误差项）成为一组均值为0，方差为1的随机数。

2、损失函数

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-egZ0SZ3o-1582607435689)(attachment:image.png)]

3、优化算法

使得损失函数值达到最小的方法。

方法：

正规方程
梯度下降

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7iy2oiir-1582607435691)(attachment:image.png)]

4、python的API

4.1.1 statsmodels.formula.api.OLS():普通最小二乘模型拟合- - 常用
4.1.2 scipy.stats.linregress(): 线性拟合
4.1.3 scipy.optimize.curve_fit():回归函数拟合
使用参考：https://blog.csdn.net/weixin_41685388/article/details/104346268

5、python机器学习线性模型API

1、·sklearn.linear_model.LinearRegressiont（fit_intercept=True）

。通过正规方程优化
。fitintercept：是否计算偏置
。LinearRegression.coef_：返回回归系数
。LinearRegression.intercept_：返回偏置

2、·sklearn.linear model.SGDRegressor（loss=“squared loss”，fit intercept=True,learning_rate =‘invscaling’,eta0=0.01）

。SGDRegressor类实现了随机梯度下降学习优化，它支持不同的loss函数和正则化悉罚项来拟合线性回归模型。
。loss：损失类型
    ·loss="squared_loss”：普通最小二乘法
。fit_intercept：是否计算偏置 True/False
。eta0=0.01  ：起始设定学习率
。learning_rate: 迭代过程中学习率的计算方式
    · 学习率填充
    ·"constant"：eta=eta0
    ·"optimal":eta=1.0/（alpha*（t+to））[default]
    ·"invscaling"：eta=eta0/pow（t，power_t）,
        ·power_t=0.25：存在父类当中
    ·对于一个常数值的学习率来说，可以使用learning_rate='constant'，并使用eta0来指定学习率。
。SGDRegressor.coef_：返回回归系数
。SGDRegressor.intercept_：返回偏置

6、机器学习中回归性能评估

均方误差越小，模型相对越好。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ggcmpEBL-1582607435692)(attachment:image.png)]

7、欠拟合和过拟合

欠拟合：模型拟合欠佳，多加些数据及特征变量

过拟合：模型过于复杂，训练集上准确率很高，一拿到测试集上效果就不理想了。

。原因：原始特征过多，存在一些嘈杂特征，模型过于复杂是因为模型尝试去兼顾各个测试数据点
。线性模型解决办法：
    ·正则化：L1正则化，L2正则化

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cfLrDwQy-1582607435693)(attachment:image.png)]

8、线性回归的改进–岭回归

岭回归，其实也是一种线性回归。只不过在筛法建立回归方程时候，加上L2正则化的限制，从而达到解决过拟合的效果。

API: ·sklearn.linear_model.Ridge(alpha=1.0,fit_intercept=True,solver=“auto”,nomalize=False）

。具有L2正则化的线性回归
。alpha：正则化力度,惩罚项系数 入。正则化力度越大，权重系致会越小，正则化力度越小，权重系数会越大。
    ·取值：0~1，1~10
。solver：会根据数据自动选择优化方法
    ·sag：如果数据集、特征都比较大，选择该随机梯度下降优化
。normalize：数据是否进行标准化
    ·normalize=False：可以在fit之前调用preprocessing.StandardScaler标准化数据，自动对数据进行标准化
。Ridge.coef-：回归权重
。Ridge.intercept_：回归偏置

9、案例代码

from sklearn.datasets import load_boston  #sklearn波士顿房价预测数据接口
from sklearn.model_selection import train_test_split  #划分数据集
from sklearn.preprocessing import StandardScaler    #数据标准化
from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge  #预估器（正规方程）、预估器（梯度下降学习）、岭回归
from sklearn.metrics import mean_squared_error  #均方误
from sklearn.externals import joblib   #模型的加载与保存


def linear1():
    """
    正规方程的优化方法对波士顿房价进行预测
    :return:
    """
    # 1）获取数据
    boston = load_boston()#sklearn波士顿房价预测数据

    # 2）划分数据集
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)

    # 3）标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 4）预估器
    estimator = LinearRegression()
    estimator.fit(x_train, y_train)

    # 5）得出模型
    print("正规方程-权重系数为：\n", estimator.coef_)
    print("正规方程-偏置为：\n", estimator.intercept_)

    # 6）模型评估
    y_predict = estimator.predict(x_test)
    print("预测房价：\n", y_predict)
    error = mean_squared_error(y_test, y_predict)
    print("正规方程-均方误差为：\n", error)

    return None


def linear2():
    """
    梯度下降的优化方法对波士顿房价进行预测
    :return:
    """
    # 1）获取数据
    boston = load_boston()
    print("特征数量：\n", boston.data.shape)

    # 2）划分数据集
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)

    # 3）标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 4）预估器
    estimator = SGDRegressor(learning_rate="constant", eta0=0.01, max_iter=10000, penalty="l1")
    estimator.fit(x_train, y_train)

    # 5）得出模型
    print("梯度下降-权重系数为：\n", estimator.coef_)
    print("梯度下降-偏置为：\n", estimator.intercept_)

    # 6）模型评估
    y_predict = estimator.predict(x_test)
    print("预测房价：\n", y_predict)
    error = mean_squared_error(y_test, y_predict)
    print("梯度下降-均方误差为：\n", error)

    return None


def linear3():
    """
    岭回归对波士顿房价进行预测
    :return:
    """
    # 1）获取数据
    boston = load_boston()
    print("特征数量：\n", boston.data.shape)

    # 2）划分数据集
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)

    # 3）标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 4）预估器
    estimator = Ridge(alpha=0.5, max_iter=10000)
    estimator.fit(x_train, y_train)

    # 保存模型
    joblib.dump(estimator, "my_ridge.pkl")
    
    # 加载模型 使用时注销 4）预估器 和 保存模型
#     estimator = joblib.load("my_ridge.pkl")

    # 5）得出模型
    print("岭回归-权重系数为：\n", estimator.coef_)
    print("岭回归-偏置为：\n", estimator.intercept_)

    # 6）模型评估
    y_predict = estimator.predict(x_test)
    print("预测房价：\n", y_predict)
    error = mean_squared_error(y_test, y_predict)
    print("岭回归-均方误差为：\n", error)

    return None

if __name__ == "__main__":
    # 代码1：正规方程的优化方法对波士顿房价进行预测
    linear1()
    # 代码2：梯度下降的优化方法对波士顿房价进行预测
    linear2()
    # 代码3：岭回归对波士顿房价进行预测
    linear3()

正规方程-权重系数为：
[-0.63330277 1.14524456 -0.05645213 0.74282329 -1.95823403 2.70614818
-0.07544614 -3.29771933 2.49437742 -1.85578218 -1.7518438 0.8816005
-3.92011059]
正规方程-偏置为：
22.62137203166228
预测房价：
[28.23494214 31.51307591 21.11158648 32.66626323 20.00183117 19.06699551
21.0961119 19.61374904 19.61770489 32.88592905 20.9786404 27.52841267
15.54828312 19.78740662 36.89507874 18.81564352 9.34846191 18.49591496
30.67162831 24.30515001 19.06869647 34.10872969 29.82133504 17.52652164
34.90809099 26.5518049 34.71029597 27.42733357 19.096319 14.92856162
30.86006302 15.8783044 37.1757242 7.80943257 16.23745554 17.17366271
7.46619503 20.00428873 40.58796715 28.93648294 25.25640752 17.73215197
38.74782311 6.87753104 21.79892653 25.2879307 20.43140241 20.47297067
17.25472052 26.14086662 8.47995047 27.51138229 30.58418801 16.57906517
9.35431527 35.54126306 32.29698317 21.81396457 17.60000884 22.07940501
23.49673392 24.10792657 20.13898247 38.52731389 24.58425972 19.7678374
13.90105731 6.77759905 42.04821253 21.92454718 16.8868124 22.58439325
40.75850574 21.40493055 36.89550591 27.19933607 20.98475235 20.35089273
25.35827725 22.19234062 31.13660054 20.39576992 23.99395511 31.54664956
26.74584297 20.89907127 29.08389387 21.98344006 26.29122253 20.1757307
25.49308523 24.08473351 19.89049624 16.50220723 15.21335458 18.38992582
24.83578855 16.59840245 20.88232963 26.7138003 20.75135414 17.87670216
24.2990126 23.37979066 21.6475525 36.8205059 15.86479489 21.42514368
32.81282808 33.74331087 20.62139404 26.88700445 22.65319133 17.34888735
21.67595777 21.65498295 27.66634446 25.05030923 23.74424639 14.65940118
15.19817822 3.8188746 29.18611337 20.67170992 22.3295488 28.01966146
28.59358258]
正规方程-均方误差为：
20.630254348291196
特征数量：
(506, 13)

梯度下降-权重系数为：
[-0.18032203 1.13156337 0. 0.53527539 -1.98019369 2.31577395
0. -3.16472059 2.5717045 -1.80879957 -1.47919941 0.84244604
-3.6687529 ]
梯度下降-偏置为：
[22.93172969]
预测房价：
[27.90852137 30.73533552 21.37117576 31.30919582 20.39907435 20.23369773
21.44665125 20.06044518 19.95900751 31.96641058 21.26970571 28.06166368
16.50815474 20.25008639 35.45153789 19.60300052 10.21564488 19.17587013
29.84210678 24.26185529 20.09879751 32.77148848 28.67052105 19.46612627
33.75579242 26.39603334 33.80980471 26.95023173 20.77382329 15.43737743
29.79702013 16.73273481 35.59696223 13.63071412 17.02373162 18.52857957
11.49951798 21.15629995 38.19107541 28.27722831 25.17917274 18.94408519
36.95511701 9.72267754 22.79788979 25.25897212 19.86829513 20.88146848
17.43209885 27.34946393 9.83760308 27.2421653 29.39622146 18.91506767
11.56747614 34.13073256 31.84853002 21.65448141 18.10864237 22.30039092
23.61171872 24.28836203 20.5987606 36.94854241 24.17762474 20.82746777
15.44126342 10.28894764 39.4179656 22.14764366 18.63845997 22.62668211
38.63039022 21.61091557 35.28731204 27.02976161 20.9912817 20.71607518
25.08804361 21.70828681 30.30797708 20.65632595 23.38139007 30.5529622
26.40554838 21.98453023 28.63517581 22.0662799 26.17296457 20.70494038
25.44215375 23.52914762 20.78856897 22.41964858 16.62951378 19.4470062
24.76931554 18.28384239 21.85787669 26.53102025 21.90177616 19.20800037
24.29724135 23.54203217 22.04699928 35.16568535 16.71146682 21.38738079
31.88119591 33.21219461 20.98633043 26.89005626 21.63023398 17.87273172
21.95797063 21.65649016 27.17987661 24.78220419 23.85997069 16.48811907
18.34503687 7.31727711 28.70887538 21.47951722 22.45958368 27.62175362
28.23593409]
梯度下降-均方误差为：
24.848731722199563
特征数量：
(506, 13)

-权重系数为：
[-0.62710135 1.13221555 -0.07373898 0.74492864 -1.93983515 2.71141843
-0.07982198 -3.27753496 2.44876703 -1.81107644 -1.74796456 0.88083243
-3.91211699]
岭回归-偏置为：
22.62137203166228
预测房价：
[28.23082349 31.50636545 21.12739377 32.65793823 20.02076945 19.06632771
21.106687 19.61624365 19.63161548 32.86596512 20.9946695 27.50329913
15.55414648 19.79639417 36.88392371 18.80672342 9.38096 18.50907253
30.67484295 24.30753141 19.0666843 34.09564382 29.80095002 17.51949727
34.8916544 26.5394645 34.68264723 27.42856108 19.09405963 14.98997618
30.8505874 15.81996969 37.18247113 7.85916465 16.25653448 17.15490009
7.48867279 19.99147768 40.57329959 28.95128807 25.25723034 17.73738109
38.75700749 6.87711291 21.78043375 25.27159224 20.45456114 20.48220948
17.25258857 26.1375367 8.5448374 27.49204889 30.58183066 16.58438621
9.37182303 35.52269097 32.24958654 21.87431027 17.60876103 22.08124517
23.50114904 24.09591554 20.15605099 38.49857046 24.64026646 19.75933465
13.91713858 6.78030217 42.04984214 21.92558236 16.8702938 22.59592875
40.74980559 21.4284924 36.88064128 27.18855416 21.04326386 20.36536628
25.36109432 22.27869444 31.14592486 20.39487869 23.99757481 31.54428168
26.76210157 20.89486664 29.07215993 21.99603204 26.30599891 20.11183257
25.47912071 24.0792631 19.89111149 16.56247916 15.22770226 18.38342191
24.82070397 16.60156656 20.86675004 26.71162923 20.74443479 17.8825254
24.28515984 23.37007961 21.58413976 36.79386382 15.88357121 21.47915185
32.79931234 33.71603437 20.62134398 26.83678658 22.68850452 17.37312422
21.67296898 21.67559608 27.66601539 25.0712154 23.73692967 14.64799906
15.21577315 3.82030283 29.17847194 20.66853036 22.33184243 28.0180608
28.56771983]
岭回归-均方误差为：
20.644810227653515

Jalen data analysis

发布了129 篇原创文章 · 获赞 143 · 访问量 2万+

私信关注

python回归分析总结--线性模型及岭回归