<<Introduction to Machine Learning from Zero>>Linear Regression-House Price Forecasting Problem

1. Knowledge and problem background that need to be mastered before learning

1.1 Learning knowledge background

       The premise of learning this practical example is to master the concept of linear regression in machine learning, the basic syntax of Python, the meaning of variables such as MSE and R2_score, and the mathematical solution process. This blog is based on mastering the above knowledge points by default.

1.2 House price prediction problem and the original data of this actual combat (usa_housing_price.csv file is placed at the end of the article, take it yourself)

        The following is a partial data screenshot of the usa_housing_price.csv file. It can be seen that attributes such as Area Income, House Age, etc. will affect the Price in the area.
Partial data screenshot of usa_housing_price.csv

        Now based on the usa_housing_price.csv data, a linear regression model is established to predict reasonable housing prices. The main forecasting problem is divided into three steps, as follows:
        1. Using the area as the input, establish a single factor model, evaluate the model performance, and visualize the linear regression prediction results.
        2. Using income, house_age, number of rooms, population, and area as input variables, build a multi-factor model to evaluate the model performance.
       3. Predict the reasonable house price of Income=65000, House Age=5, Numbers of Rooms=5, Population=30000, size=200.

2. Solve the problem according to the specific steps

2.1 Using area as input, establish a single factor model, evaluate model performance, and visualize linear regression prediction results

        2.1.1 Read the usa_housing_price.csv file into the memory through the read_csv(path) method of pandas, and then use the head() method to view some content characteristics of the file, as shown in the following code and figure:

        Note that the path in the read_csv method is the local path where the usa_housing_price.csv file is stored. The path stored by each person is different and can be customized.

import pandas as pd
import numpy as np
data = pd.read_csv('D:/Google/picture/usa_housing_price.csv')
data.head()

        The content displayed by the data.head() method is in the same format as the header content of our table, except that it only displays a part of the content of the source file

        2.1.2 Introduce the matplotlib package to combine each impact factor and Price to draw a relationship diagram (with Price as the y-axis and the rest of the impact factors as the x-axis):

from matplotlib import pyplot as plt
fig = plt.figure(figsize=(10,10)) 
fig1 = plt.subplot(231) #两行三列第一幅图(两行三列:x走两个单元格,y走三个单元格),下同
plt.scatter(data.loc[:,'Avg. Area Income'],data.loc[:,'Price'])  #plt.scatter(x,y) 花点图 (下同)
plt.title('Price VS Income')
fig2 = plt.subplot(232) 
plt.scatter(data.loc[:,'Avg. Area House Age'],data.loc[:,'Price'])  
plt.title('Price VS Age')
fig3 = plt.subplot(233) 
plt.scatter(data.loc[:,'Avg. Area Number of Rooms'],data.loc[:,'Price']) 
plt.title('Price VS Rooms')
fig4 = plt.subplot(234) 
plt.scatter(data.loc[:,'Area Population'],data.loc[:,'Price'])  
plt.title('Price VS Population')
fig5 = plt.subplot(235) 
plt.scatter(data.loc[:,'size'],data.loc[:,'Price'])  #plt.scatter(x,y)
plt.title('Price VS size')
plt.show()

insert image description here

        2.1.3 Set the area factor size as x and the corresponding house price Price as y, and call the related methods of the sklearn package to train the linear regression model. Next, use size as an input variable to predict the value of Price, and evaluate the single-factor model with the values ​​​​of MSE and R2_score:

        Define x and y, train a linear regression model, and use size as x, use the trained model to predict the value of y and print it out:

#define x and y
x = data.loc[:,'size']
y = data.loc[:,'Price']
x = np.array(x).reshape(-1,1) #需要将x转换成一维的数组
#set up the linear regression model
from sklearn.linear_model import LinearRegression
LR1 = LinearRegression()
LR1.fit(x,y)
y_predict_1 = LR1.predict(x)
print(y_predict_1)

The value of y_predict:
insert image description here
        The performance of the linear regression model is evaluated by the predicted value of y_predict, which mainly uses MSE and R2_score as the criteria for discrimination (the smaller the value of MSE, the better, and the closer the value of R2_score to 1, the better):

#evaluate the model
from sklearn.metrics import mean_squared_error,r2_score
mean_squared_error_1 = mean_squared_error(y,y_predict_1) #MSE
r2_score_1 = r2_score(y,y_predict_1) 
print(mean_squared_error_1,r2_score_1)

Values ​​of MSE and R2_score:
insert image description here
        Use the size of the source data as x and Price as y to draw a scatter diagram, and the predicted y_predict_1 as y to draw a straight line graph (the closer the slope k is to 1, the better):

fig6 = plt.figure(figsize=(8,5))
plt.scatter(x,y) #画点图
plt.plot(x,y_predict_1,'r') #画直线图 r->代表红色
plt.show()

insert image description here

2.2 Using income, house_age, number of rooms, population, and area as input variables, establish a multi-factor model to evaluate the performance of the model.

        Take income, house_age and other multi-variable factors as x, Price still does y, repeat the above operation of single factor, train the model through source data, and use multi-factor x as input variable to predict the value of y, and evaluate it through MSE and R2_score Model:

#define x_multi 多因子(除掉price)
x_multi = data.drop(['Price'],axis = 1)  #data的列去掉Price
x_multi.head()
#set up 2nd linear model
LR_multi = LinearRegression()
#train the model
LR_multi.fit(x_multi,y)
y_predict_multi = LR_multi.predict(x_multi)
print(y_predict_multi)
mean_squared_error_multi = mean_squared_error(y,y_predict_multi) #MSE
r2_score_multi = r2_score(y,y_predict_multi) 
print(mean_squared_error_multi,r2_score_multi)
fig7 = plt.figure(figsize=(8,5))#将y和y_predict_multi画散点图 (接近直线 k=1(k:斜率))
plt.scatter(y,y_predict_multi)
plt.show()

Draw scatter plots of y and y_predict_multi (closer to the straight line k=1 (k: slope) the better)
insert image description here

       The multi-factor MSE and R2_score values: the MSE is many times lower than the single-factor model, and the R2_score value is getting closer to 1, which proves that the multi-factor linear regression model is better than the single-factor model, and it reflects the house price Price should be affected by multiple factors:
insert image description here

3. Predict the reasonable house price of Income=65000, House Age=5, Numbers of Rooms=5, Population=30000, size=200

        Through the second step, we have trained a multi-factor linear regression model, so that we pass the above data into the model as the input variable x to get the corresponding y value, which is the value of Price:

x_test = [65000,5,5,30000,200]
x_test = np.array(x_test).reshape(1,-1)
y_test_predict = LR_multi.predict(x_test)
print(y_test_predict)

        It should be noted that the parameter received by the predict method is an array, not a list, and we need to convert its format. The following is our predicted house price information:
insert image description here

        The following is the usa_housing_price.csv file of the source data of this blog, you can get it yourself (extraction code: 1234)
usa_housing_price.csv

Guess you like

Origin blog.csdn.net/qq_32575047/article/details/117445825