python-- a linear regression model analysis

First, the requirement

Rate boston known machine learning data base data set, comprising 506 records, each record comprising a housing
13 housing the properties, price information in boston.target MEDV properties, in particular (translated into Chinese) can be viewed by the following statement :
print(boston.DESCR)
Chinese interpretation of each attribute as follows:

  • CRIM urban per capita crime rate
  • ZN covers an area of ​​over 25,000 square feet of residential land ratio
  • INDUS proportion of urban commercial land in Central Africa
  • CHAS Charles River dummy variable (if border is the river compared with 1; otherwise, 0)
  • NOX nitric oxide concentration
  • RM Mei Dong residential average number of rooms
  • AGE 1940 years ago, built the proportion of owner-occupied houses
  • DIS from the weighted distance of five Boston employment centers
  • RAD from the motorway Index - TAX $ 10,000 per full property tax rate
  • PTRATIO town middle school teacher ratio
  • The proportion of blacks in town B
  • LSTAT proportion of low-income population
  • Median housing prices from MEDV

Following the completion of data processing and analysis tasks:
(1) on a canvas, and drawing each variable rate changes scattergram, and a detailed analysis of the variables and Rate
relationship between.
(2) Rate of calculated variables and the correlation coefficient (correlation coefficient function df.corr ()).
(3) establish rates of all variables and linear regression model, write the model expression, and analysis of the model significantly.
(4) the coefficient test result is not significant variables removed, to re-establish the linear model.
(5) selecting the rates correlation coefficient greater than 0.5 is equal to the variable, the model as independent variables, as the dependent variable rates
and to establish a linear regression model, and the drawing rate and the predicted value true values into a line in FIG.

Second, the code

from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt
#from scipy.misc import factorial  依赖scipy  而且为1.2.0版本
#此文件依赖pillow


boston = pd.read_csv('./dataset/boston_house.csv')
df=pd.DataFrame(boston)

#print(df.iloc[:,-1])

#相关系数大于0.5
x_has=[]
y_predict=[]
plt.figure(1)
#第一题
plt.rcParams['font.sans-serif']='SimHei'
plt.rcParams['axes.unicode_minus']= False
for i in range(13):
    plt.subplot(7,2,i+1)
    plt.scatter(df.iloc[:,i],df.iloc[:,-1],marker='o',c='g')# x,y, ,green

    #第二题
    #print(type(df.columns[1]))
    dfi=df[[df.columns[i],df.columns[-1]]]
    print('\n',dfi.corr(),dfi.corr().iloc[0,1],'\n')
    #print(dfi.corr().iloc[0,1])
  
    #第三题
    import  numpy  as np
    from sklearn.linear_model import LinearRegression
    x_linear=df.iloc[:,i].values.reshape(-1,1) #将DataFrame转为array格式,通过values 属性
    y_linear=df.iloc[:,-1].values.reshape(-1,1) ##reshape(-1,1)功能
    #print(x_linear,type(x_linear))
    lreg = LinearRegression()
    lreg.fit(x_linear, y_linear)
    message0 = '一元线性回归方程为: '+'\ty' + '=' + str(lreg.intercept_[0])+' + ' +str(lreg.coef_[0][0]) + '*x'
    
    import scipy.stats as stats
    n     = len(x_linear)
    y_prd = lreg.predict(x_linear)
    if dfi.corr().iloc[0,1]>0.5:
        x_has.append(i)
        y_predict.append(y_prd)
    Regression = sum((y_prd - np.mean(y_linear))**2) # 回归
    Residual   = sum((y_linear - y_prd)**2)          # 残差
    R_square   = Regression / (Regression + Residual) # 相关性系数R^2
    F          = (Regression / 1) / (Residual / ( n - 2 ))  # F 分布
    #取a=0.05 
    if stats.pearsonr(x_linear,y_linear)[1][0]<0.05:
        ms1_1='显著'
    else:
        ms1_1='不显著'
    message1='显著性检测(p值检测):'+ str(stats.pearsonr(x_linear,y_linear)[1][0])+ ms1_1 
    print(message0,'\n',message1)


    #第四题
print(x_has,y_predict)
plt.show()


#第五题
plt.figure(2)
for i in range(len(x_has)):
    plt.plot(df.iloc[:,-1].values,marker='o',c='g')
    plt.plot(y_predict[i],marker='o',c='r')
plt.show()

Guess you like

Origin blog.csdn.net/Chengang98/article/details/92403599