--boston prices predictive analysis based on linear regression

In this paper, the normal equation , gradient descent , with the ridge regression regularization three ways BOSTON Rate analysis and prediction data sets, compare the differences between three methods

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression, SGDRegressor,  Ridge, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, classification_report
from sklearn.externals import joblib
import pandas as pd
import numpy as np
class HousePredict():
    """
    波士顿房子数据集价格预测
    """
    
    def __init__(self):
        
        # 1.获取数据
        lb = load_boston()

        # 2.分割数据集到训练集和测试集
        x_train, x_test, y_train, y_test = train_test_split(lb.data, lb.target, test_size=0.25)
        # print(y_train, y_test)

        # 3.特征值和目标值是都必须进行标准化处理, 实例化两个标准化API
        # 3.1特征值标准化
        self.std_x = StandardScaler()

        self.x_train = self.std_x.fit_transform(x_train)
        self.x_test = self.std_x.transform(x_test)


        # 3.2目标值标准化
        self.std_y = StandardScaler()

        self.y_train = self.std_y.fit_transform(y_train.reshape(-1, 1))  # 二维
        self.y_test = self.std_y.transform(y_test.reshape(-1, 1))


    def mylinear(self):
        """
        正规方程求解方式预测
        :return: None
        """
        
        # 预测房价结果,直接载入之前保存的模型
    #     model = joblib.load("./tmp/test.pkl")

    #     y_predict = self.std_y.inverse_transform(model.predict(self.x_test))

    #     print("保存的模型预测的结果:", y_predict)

    #     estimator预测
    #     正规方程求解方式预测结果
        lr = LinearRegression() 
        lr.fit(self.x_train, self.y_train)

        print("正规方程求解方式回归系数", lr.coef_)

        # 保存训练好的模型
        # joblib.dump(lr, "./tmp/test.pkl")

        # # 预测测试集的房子价格
        y_lr_predict = self.std_y.inverse_transform(lr.predict(self.x_test))
        #
        # print("正规方程测试集里面每个房子的预测价格:", y_lr_predict)
        print("正规方程的均方误差:", mean_squared_error(self.std_y.inverse_transform(self.y_test), y_lr_predict))
        
        return None
    
    def mysdg(self):
        """
        梯度下降去进行房价预测
        :return: None
        """
        sgd = SGDRegressor()
        sgd.fit(self.x_train, self.y_train)

        print("梯度下降得出的回归系数", sgd.coef_)

        # 预测测试集的房子价格
        y_sgd_predict = self.std_y.inverse_transform(sgd.predict(self.x_test))

        # print("梯度下降测试集里面每个房子的预测价格:", y_sgd_predict)
        print("梯度下降的均方误差:", mean_squared_error(self.std_y.inverse_transform(self.y_test), y_sgd_predict))

        return None
    
    
    def myridge(self):
        """
        带有正则化的岭回归去进行房价预测
        """
        rd = Ridge(alpha=1.0)
        rd.fit(self.x_train, self.y_train)
        
        print("岭回归回归系数", rd.coef_)
        
        # 预测测试集的房子价格
        y_rd_predict = self.std_y.inverse_transform(rd.predict(self.x_test))
        
        # print("岭回归每个房子的预测价格:", y_rd_predict)
        print("岭回归均方误差:", mean_squared_error(self.std_y.inverse_transform(self.y_test), y_rd_predict))
        
        return None

if __name__ == "__main__":
    A = HousePredict()
    A.mylinear()
    A.mysdg()
    A.myridge()
正规方程求解方式回归系数 [[-0.10843933  0.13470414  0.00828142  0.08736748 -0.2274728   0.25791114
   0.0185931  -0.33169482  0.27340519 -0.22995446 -0.20995577  0.08854303
  -0.40967023]]

正规方程的均方误差: 20.334736834357248
梯度下降得出的回归系数 [-0.08498404  0.07094101 -0.03414044  0.11407245 -0.09152116  0.3256401
 -0.0071226  -0.2071317   0.07391015 -0.06095605 -0.17955743  0.08442426
 -0.35757617]

梯度下降的均方误差: 21.558873305580214
岭回归回归系数 [[-0.10727714  0.13281388  0.00561734  0.0878943  -0.22348981  0.25929669
   0.0174662  -0.32810805  0.26380776 -0.22163145 -0.20871114  0.08831287
  -0.4076144 ]]

岭回归均方误差: 20.37300555358197

Over-fitting : a hypothesis on the training data to obtain a better fit than the other hypothesis, but the data set outside the training data can not fit the data well, then consider this assumption appeared overfitting phenomenon. (Model too complicated)

  • The reason: too many original features, there are some noisy features, the model because the model is too complex to try to take into account the individual test data point

  • Solution:
    • 1. feature selection, eliminating a large correlation characteristics (hard to do)
    • 2. Cross-validation (so that all the data have been trained)
    • 3.L2 regularization
      • Action: W may be such that each element is very small, are close to 0 (down weights, minimize the impact of the higher order terms features)
      • Advantages: the smaller model parameter description simpler and more simple model, the more difficult to over-fitting

Underfitting : a hypothesis can not get a better fit on the training data, but the data set outside the training data can not fit the data well, then consider this assumption appeared underfitting phenomenon. (Model is too simple)

  • The reason: to learn the characteristics of the data is too small
  • The solution: increase the number of feature data

1, LinearRegression assessment and SGDRegressor

2, features: linear regression is the most easy-to-use regression models.

To some extent limit the use, however, without knowing the premise of the relationship between characteristics, we are still using linear regression as the first choice for most systems.

Small-scale data: LinearRegression (not solve the fitting problem) as well as other

Large-scale data: SGDRegressor

Ridge contrast with linear regression LinearRegression

Ridge Regression: regression regression coefficient is more realistic and more reliable. Further, the estimated parameters allow the fluctuation range becomes small, it becomes more stable. There are great practical value in the study of pathological data exists in the high side

Guess you like

Origin www.cnblogs.com/ohou/p/11946107.html