Python-Machine Learning-Boston House Price Regression Analysis

First, the purpose 

        Take the Boston house price data set as the object, understand the data and understand the data, master the gradient descent method and the preliminary method of regression analysis , master the general method of model regularization , and interpret the results of regression analysis.


2. Background knowledge and requirements

1. Background knowledge

        The Boston house price data set is the median house price in the suburbs of Boston in the mid-1970s. It collected data on 13 indicators and house prices in the city at that time, trying to find the relationship between those indicators and house prices.

        The data set contains 506 sets of data. In this paper, the first 406 sets are used as training and verification sets, and the remaining 100 sets of data are used as test sets. The data can be directly called by load_boston in the datasets of python's sklearn library, or downloaded from the address below.

        Dataset download address: https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ .

        The meaning of each feature in the dataset is as follows:


        What we call regression generally refers to linear regression (Linear regression). What can regression be used for? We do regression to find the relationship between variables, such as the relationship between sales volume and transportation distance, or the relationship between the number of defective products and the age of machine tools, or the relationship between the divorce rate of celebrities and their age.

        The purpose of regression is to predict the numerical target value. If you want to predict the power of a car, you may write the following equation:

        P=0.0015*S-0.99*R

        Among them, P stands for power, S stands for annual salary, and R stands for the time you listen to the radio station. This is the regression equation (regression equation), where 0.0015 and -0.99 are called regression coefficients (regression weights) or each independent variable affects the dependent variable. Weight, of course, also has nonlinear regression, such as the product of S and R, the regression we are talking about here is linear regression, the general equation of linear regression is as follows, of course, it also includes the initial value of Y (intercept), if not, it is 0:

        Y=WX^{T}+C

        The input data is stored in the matrix X, and the regression coefficient is stored in the vector W, then all we have to do is to find W, how to find W? The most commonly used is to find the W with the smallest error. The error mentioned here refers to the error between the predicted value and the real value. Offset, so the square error is used, that is:

        \sum_{i=1}^{m}(y_{i}-wx_{i}^{T})^{2}

        All we have to do is to minimize it, that is, to find the optimal estimated value of w. This is the most common problem in statistics. There are many methods such as least squares estimation or gradient descent in optimization methods. This article focuses on For the gradient descent method, others are not listed here.

        The general steps of regression are as follows:

(1) Data collection: using any method

(2) Prepare data: Regression requires numerical data , and nominal data needs to be converted into binary data. If there are multiple categories, one-hot encoding can be considered

(3) Analyzing the data: If possible, drawing a two-dimensional image of the data visualization helps to analyze and understand the data and compare it before and after

(4) Training algorithm: To find the regression coefficient, various methods can be used.

(5) Test algorithm: use the coefficient of determination R^{2}or the fit between the predicted value and the data as the standard for quantitative evaluation

(6) Use algorithm: Use the established regression equation to predict the input variables and give a predicted value.


        We want to find the minimum value of the square error. The method we use is the gradient descent method , also known as the gradient ascent method. The idea of ​​this method is: to find the minimum (maximum) value of a function, the best way is to follow the gradient Direction finding, so what is the gradient? If it is a one-variable function with respect to x, then the gradient is its derivative. Mathematically, the gradient is defined as follows:

                                                \bigtriangledown f(x,y)=\binom{\frac{\partial f(x,y)}{\partial x}}{\frac{\partial f(x,y)}{\partial y}}

        We record the gradient as \triangledown, the gradient always points to the direction where the function value grows fastest (the opposite direction is the fastest decreasing direction), but here the gradient only indicates the direction, but does not indicate how far along this direction, we \alphause As the walking distance, that is, the step size , then the iterative formula of the gradient descent algorithm is easy to write:

                                                w:=w-\alpha \bigtriangledown _{_{w}}f(w)

        The formula is run until a certain condition is reached (such as when the error is less than a certain value). The gradient ascent algorithm only needs to change the - sign in the middle to a + sign. The gradient ascent algorithm is used to find the maximum value of the function, and the gradient descent is used to find the minimum value.


        As we just said, the purpose of our iterative gradient descent algorithm is to minimize the square error (or less than a certain range), and we divide the square error by the sample size m. Right now:

                                                L(y,y^{`})=\frac{1}{m}\sum_{i=1}^{m}(y-y^{`})^{2}

        This is called Mean Squared Error (MSE), there is also another one called Mean Absolute Error (MAE):

                                                L(y,y^{`})=\frac{1}{m}\sum_{i=1}^{m}\left | y-y^{`}\right |

        In linear regression, MSE (L2 loss) is easy to calculate, but MAE (L1 loss) has better robustness to outliers. When the predicted value is close to the real value, the variance of the error is small; otherwise, the variance of the error is very large. Compared with MAE, using MSE will cause the abnormal points to have greater weight. Therefore, when the data has abnormal points, use MAE as the loss function. Better, and RMSE is to open the root sign on the basis of MSE.

        The loss function is an operational function used to measure the difference between the predicted value of the model y^{^{`}}  and the real value . It is a non-negative real-valued function, which is usually used to express that the smaller the loss function, the better the model fit. good. Regularization is to add certain rules (restrictions) to the loss function to reduce the solution space, thereby reducing the possibility of finding an over-fitting solution (for example, in order to minimize the loss function of the model, the fitted function does not smooth, leading to overfitting), the specific method is to prevent the parameter size of the high-order variable in the loss function from being too large. An example is given below:yL(y,y^{`})

        Suppose my regression equation is: y=w_{1}x_{1}+w_{2}x_{2}^{5}, then in order to make the parameter size of the high-order variable not too large, we can write the loss function as follows:

                                                L(y,y^{`})=\frac{1}{m}\sum_{i=1}^{m}(y-y^{`})^{2}+1000w_{2}

        In order to minimize the loss function, we can only make it w_{_{2}}extremely small, so that regularization is realized. Generally, we record the previous coefficient 1000 as \lambda, and each whas positive and negative, so we take the square, so that there is a loss function General form:

                        ​​​​​​​        ​​​​​​​        ​​​​​​​        L(y,y^{`})=\frac{1}{2m}[\sum_{i=1}^{m}(y-y^{`})^{2}+\lambda \sum_{j=1}^{n}w_{j}^{2}]

        Correspondingly, the formula of the gradient will also change accordingly. In order to better calculate the gradient, it can be reduced when calculating the derivative. Here it is written as, which has little impact on the \frac{1}{m}whole \frac{1}{2m}. wThe update function of the formula is as follows, that is, for the above formula Find the gradient and multiply it by \alpha( \alphawalking ):

                                                w=w-\alpha [\frac{1}{m}(wx^{T}x-y^{T}x)-w\frac{\lambda }{m}]


2. Requirements

        Ask for answers to the following questions :

  1. What is the error rate of the model? Take MSE, RMSE as an example.
  2. How does the error rate change as gradient descent iterations progress? Drawing representation.
  3. ​​​​​​How does the regularization term of the model Please draw a picture (the weight of the regularization term is represented by the abscissa, and the MSE is represented by the ordinate) to illustrate how the regularization term is selected.
  4. How to preprocess the input features (feature selection, row normalization, column normalization).
  5. Write a program to implement the training and testing of the model, without using SPSS or calling other machine learning toolkits. ​​​​​​​​​​​​​​​​

 Three, python code implementation

        First import the relevant library:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

        First save the data in csv format and simply view the data:

boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston['target']
df.to_csv('./boston.csv', index=None)
data = pd.read_csv(r'./boston.csv')
data.head()

        The result is as follows:

         Check the data for null values:

data.isnull().sum()

        No null values ​​were found.

        Look at the number of rows and columns of the data:

data.shape

        It can be seen that the data is 506 rows*14 columns. Since the dimensions of each feature value are different, in order to eliminate the influence of dimensions and normalize the data, there are three normalization methods: maximum and minimum normalization, standard deviation normalization and decimal normalization .

        Each normalized code is represented as follows:

(data - data.min())/(data.max() - data.min())   #最小-最大规范化
(data - data.mean())/data.std()                 #标准差规范化
data/10**np.ceil(np.log10(data.abs().max()))    #十进制规范化

        This article chooses to standardize the standard deviation. After the normalization is completed, the training set and the test set are divided, and the first 400 rows are selected as the training set, and the remaining 106 rows are used as the test set:

cols = data.shape[1]                 # cols为data的列数
# 选取前400作为训练集
train_x = data.iloc[:400,:cols-1]    # 选取前13列作为因变量
train_y = data.iloc[:400,cols-1:cols]# 选取最后一列作为自变量
# 选取后106行作为测试集
test_x = data.iloc[400:,:cols-1]     
test_y = data.iloc[400:,cols-1:cols]

        The data collection and preparation work has been done, and then the linear regression model is established, and a linear model class is defined:

class linearRegression:
    """python语言实现线性回归(梯度下降法)"""
    
    def __init__(self, alpha, times, l):
        """初始化方法
        
        参数解释:
        alpha:float
               步长,也被叫做学习率(权重调整的幅度)
        times: int
                循环迭代的次数,达到一定次数终止迭代
        l: int
                正则化参数
                
        """
        self.alpha = alpha
        self.times = times
        self.l= l
        
    # 损失函数
    def CostFunction(self, x, y, w):
        inner = np.power(y-(w*x.T).T, 2)
        return np.sum(inner)/(2*len(x))
    
    # 正则化损失函数
    def regularizedcost(self, x, y, w):
        reg = (self.l/(2*len(x))) * (np.power(w, 2).sum())
        return self.CostFunction(x, y, w) + reg     
    
    def fit(self,x,y):
        """建立模型
        
        参数解释:
        x:自变量,特征矩阵
        y:因变量,真实值
        """
        # 将数据转换成numpy矩阵
        x = np.matrix(x.values)
        y = np.matrix(y.values)
        # 添加截距列,值为1,axis=1 添加列
        x = np.insert(x,0,1,axis=1)
        # 创建权重向量W,初始值默认为0,长度比特征数量多1(多出的一项为截距)。
        self.w_ = np.zeros(x.shape[1])
        # 创建损失列表,用来保存每次迭代后的损失值。
        self.loss_ = []
        
        #进行times次数的迭代,在每次迭代过程中,计算损失值,不断调整权重值,使得损失值不断下降。
        for i in range(self.times):
            # 计算正则化损失函数值:
            loss = self.regularizedcost(x, y, self.w_)
            #加入到损失列表
            self.loss_.append(loss)
            # 根据步长调整权重w_
            self.w_ = self.w_ - (self.alpha / len(x)) * (self.w_  * x.T * x   - y.T * x) - (self.alpha * self.l / len(x)) * self.w_

    def predict(self, x):
        """根据参数传递的样本,对样本数据进行预测
        
        参数解释:
        x:测试样本。
        
        返回值:
        result:预测结果。
        """
        
        x = np.asarray(x)
        x = np.insert(x,0,1,axis=1)
        result = np.dot(x, self.w_.T) 
        return result

Establish a linear regression model, set the initial parameters, and find the error rate (error value):

alpha=0.001
times=10000
l=0
lr = linearRegression(alpha,times,l)
lr.fit(train_x,train_y)
MSE = lr.loss_
RMSE = list(map(lambda num:sqrt(num), lr.loss_))

Take a look at the results and performance on the test set:

result = lr.predict(test_x)
# 为了防止画图出现中文乱码设置字体参数
rcParams["font.family"] = "SimHei"
rcParams["axes.unicode_minus"] = False
plt.figure(figsize=(10,10))
plt.plot(result, "ro-", label="预测值")
plt.plot(test_y.values, "go-", label="真实值") # pandas读取时serise类型,我们需要转为ndarray
plt.title("线性回归预测-梯度下降")
plt.xlabel("样本序号")
plt.ylabel("预测房价")
plt.legend()
plt.show()

 

A graph of the error rate (MSE) as a function of the number of iterations:

fig, ax = plt.subplots(figsize=(20,10))
ax.plot(np.arange(times), MSE, 'r') # np.arange()返回等差数组
ax.set_xlabel('迭代次数')
ax.set_ylabel('错误率(MSE)')
ax.set_title('错误率(MSE)随迭代次数变化的图象')
plt.show()

 Plot the error rate (RMSE) as a function of iterations:

fig, ax = plt.subplots(figsize=(20,10))
ax.plot(np.arange(times), RMSE, 'r') # np.arange()返回等差数组
ax.set_xlabel('迭代次数')
ax.set_ylabel('错误率(RMSE)')
ax.set_title('错误率(RMSE)随迭代次数变化的图象')
plt.show()

 Plot the error rate (MSE) as a function of the weight of the regularization term:

MSE_ = []
for l_ in range(0,l):
    lr = linearRegression(alpha,times,l_)
    lr.fit(train_x,train_y)
    MSE_.append(lr.loss_[-1])
fig, ax = plt.subplots(figsize=(20,10))
ax.plot(np.arange(l), MSE_, 'r') # np.arange()返回等差数组
ax.set_xlabel('正则化项权重')
ax.set_ylabel('错误率(MSE)')
ax.set_title('错误率(MSE)随正则化项的权重变化的图象')
plt.show()

You're done.

Guess you like

Origin blog.csdn.net/weixin_59938092/article/details/129746003