Multiple linear regression Python

Objective

分析空气中主要污染物浓度与空气指数之间的关系

analyze data

天气污染物浓度的数据集,该数据集源自天气后报网站上爬取的数据,为北京2013年10月28日到2016年1月31日的空气污染物浓度的数据。包括空气质量等级、AQI指数和当天排名。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline
import statsmodels.api as sm

Linear Regression

1. Data Pretreatment

data = pd.read_csv("beijing.csv",index_col = 0)
data.head()
  AQI index AQI day rankings PM25 PM10 So2 No2 Co O3
1 306 106 255 277 30 105 2.60 15
2 62 22 39 62 10 46 0.91 27
3 99 61 71 101 11 72 1.18 14
4 176 98 135 162 10 96 1.62 2
5 231 102 181 202 14 100 1.89 0
X = data.iloc[:,2:8]
X = sm.add_constant(X)
y = data.iloc [:, 0]
print(X.head())
   const  PM25  PM10  So2  No2    Co  O3
1    1.0   255   277   30  105  2.60  15
2    1.0    39    62   10   46  0.91  27
3    1.0    71   101   11   72  1.18  14
4    1.0   135   162   10   96  1.62   2
5    1.0   181   202   14  100  1.89   0

2. The model

model1 = sm.OLS (y, X) # model
result = model1.fit () # training model
print(result.summary())
                            OLS Regression Results                            
==============================================================================
. Dep Variable: AQI index R-squared: 0.963
Model:                            OLS   Adj. R-squared:                  0.963
Method:                 Least Squares   F-statistic:                     3549.
Date:                Thu, 02 Apr 2020   Prob (F-statistic):               0.00
Time:                        20:43:20   Log-Likelihood:                -3378.3
No. Observations:                 822   AIC:                             6771.
Df Residuals:                     815   BIC:                             6804.
Df Model: 6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         26.4656      2.099     12.610      0.000      22.346      30.585
PM25 0.9506 0,019 50,834 0,000 0,914 0,987
PM10 0.2412 0,015 15,691 0,000 0,211 0,271
So2 -0.0212 0.038 -0.555 0.579 -0.096 0.054
No2 -0.2624 0.047 -5.601 0.000 -0.354 -0.170
Co            -1.5038      1.109     -1.356      0.175      -3.680       0.672
O3 0.0468 0.018 2.621 0.009 0.012 0.082
==============================================================================
Omnibus:                      351.197   Durbin-Watson:                   1.782
Probe (all): 0.000 Jarque-Bera (JB): 5876,885
Skew:                           1.489   Prob(JB):                         0.00
Kurtosis:                      15.756   Cond. No.                         733.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
result.f_pvalue # test was significant linear regression relationship
0.0
result.params # regression coefficients
const    26.465624
PM25 0.950583
PM10 0.241180
So2 -0.021246
No2 -0.262374
Co       -1.503839
O3 0.046783
dtype: float64

Improved Model

Since the p value So2 and Co is greater than 0.05, so the exclusion of these two variables, re-establish the model

data = pd.read_csv("beijing.csv",index_col = 0)
data.head()
  AQI index AQI day rankings PM25 PM10 So2 No2 Co O3
1 306 106 255 277 30 105 2.60 15
2 62 22 39 62 10 46 0.91 27
3 99 61 71 101 11 72 1.18 14
4 176 98 135 162 10 96 1.62 2
5 231 102 181 202 14 100 1.89 0
X = data.iloc[:,[2,3,5,7]]
X = sm.add_constant(X)
y = data.iloc [:, 0]
print(X.head())
   const  PM25  PM10  No2  O3
1    1.0   255   277  105  15
2    1.0    39    62   46  27
3    1.0    71   101   72  14
4    1.0   135   162   96   2
5    1.0   181   202  100   0
model2 = sm.OLS (y, X) # model
result = model2.fit () # training model
print(result.summary())
                            OLS Regression Results                            
==============================================================================
. Dep Variable: AQI index R-squared: 0.963
Model:                            OLS   Adj. R-squared:                  0.963
Method:                 Least Squares   F-statistic:                     5318.
Date:                Thu, 02 Apr 2020   Prob (F-statistic):               0.00
Time:                        21:35:18   Log-Likelihood:                -3379.7
No. Observations:                 822   AIC:                             6769.
Df Residuals:                     817   BIC:                             6793.
Df Model: 4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         25.9959      2.064     12.598      0.000      21.945      30.046
PM25 0.9378 0,016 58,347 0,000 0,906 0,969
PM10 0.2417 0,015 15,864 0,000 0,212 0,272
No2 -0.2891 0.044 -6.613 0.000 -0.375 -0.203
O3 0.0560 0.017 3.297 0.001 0.023 0.089
==============================================================================
Omnibus:                      337.402   Durbin-Watson:                   1.783
Probe (all): 0.000 Jarque-Bera (JB): 5783,530
Skew:                           1.401   Prob(JB):                         0.00
Kurtosis:                      15.689   Cond. No.                         711.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Guess you like

Origin www.cnblogs.com/jiaxinwei/p/12623207.html