数据处理练习

题目来自高级编程课程

给定一个csv文件,完成以下两题:


对应代码如下:

import random

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf

anascombe = pd.read_csv( 'anscombe.csv')

print(anascombe.groupby( 'dataset')[ 'x'].mean())
print(anascombe.groupby( 'dataset')[ 'y'].mean())
print(anascombe.groupby( 'dataset')[ 'x'].var())
print(anascombe.groupby( 'dataset')[ 'y'].var())
print(anascombe.groupby( 'dataset').corr())

dataset_names = [ 'I', 'II', 'III', 'IV']
for i in dataset_names:

n = len(anascombe[anascombe.dataset == i])
is_train = np.random.rand(n) < 0.7
train = anascombe[anascombe.dataset == i][is_train].reset_index( drop = True)
test = anascombe[anascombe.dataset == i][ ~is_train].reset_index( drop = True)

lin_model = smf.ols( 'y ~ x', train).fit()
print(lin_model.summary())

g = sns.FacetGrid(anascombe, col = 'dataset')
g.map(plt.scatter, 'x', 'y')
plt.show()

程序命令行输出:

dataset
I      9.0
II     9.0
III    9.0
IV     9.0
Name: x, dtype: float64
dataset
I      7.500909
II     7.500909
III    7.500000
IV     7.500909
Name: y, dtype: float64
dataset
I      11.0
II     11.0
III    11.0
IV     11.0
Name: x, dtype: float64
dataset
I      4.127269
II     4.127629
III    4.122620
IV     4.123249
Name: y, dtype: float64
                  x         y
dataset
I       x  1.000000  0.816421
        y  0.816421  1.000000
II      x  1.000000  0.816237
        y  0.816237  1.000000
III     x  1.000000  0.816287
        y  0.816287  1.000000
IV      x  1.000000  0.816521
        y  0.816521  1.000000
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=8
  "anyway, n=%i" % int(n))
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.650
Model:                            OLS   Adj. R-squared:                  0.592
Method:                 Least Squares   F-statistic:                     11.15
Date:                Sun, 10 Jun 2018   Prob (F-statistic):             0.0156
Time:                        12:18:34   Log-Likelihood:                -12.931
No. Observations:                   8   AIC:                             29.86
Df Residuals:                       6   BIC:                             30.02
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.4459      1.497      1.634      0.153      -1.216       6.108
x              0.5464      0.164      3.339      0.016       0.146       0.947
==============================================================================
Omnibus:                        0.157   Durbin-Watson:                   3.211
Prob(Omnibus):                  0.925   Jarque-Bera (JB):                0.343
Skew:                          -0.096   Prob(JB):                        0.842
Kurtosis:                       2.004   Cond. No.                         27.8
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
  "anyway, n=%i" % int(n))
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.654
Model:                            OLS   Adj. R-squared:                  0.610
Method:                 Least Squares   F-statistic:                     15.10
Date:                Sun, 10 Jun 2018   Prob (F-statistic):            0.00464
Time:                        12:18:34   Log-Likelihood:                -15.546
No. Observations:                  10   AIC:                             35.09
Df Residuals:                       8   BIC:                             35.70
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.0642      1.169      2.621      0.031       0.369       5.760
x              0.4842      0.125      3.886      0.005       0.197       0.772
==============================================================================
Omnibus:                        1.436   Durbin-Watson:                   2.438
Prob(Omnibus):                  0.488   Jarque-Bera (JB):                0.889
Skew:                          -0.413   Prob(JB):                        0.641
Kurtosis:                       1.795   Cond. No.                         27.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given.
  "samples were given." % int(n), ValueWarning)
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.699e+06
Date:                Sun, 10 Jun 2018   Prob (F-statistic):           2.08e-12
Time:                        12:18:34   Log-Likelihood:                 29.314
No. Observations:                   6   AIC:                            -54.63
Df Residuals:                       4   BIC:                            -55.04
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.0098      0.003   1498.423      0.000       4.002       4.017
x              0.3451      0.000   1303.508      0.000       0.344       0.346
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   2.677
Prob(Omnibus):                    nan   Jarque-Bera (JB):                2.907
Skew:                           1.640   Prob(JB):                        0.234
Kurtosis:                       3.933   Cond. No.                         29.9
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=9
  "anyway, n=%i" % int(n))
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\regression\linear_model.py:1633: RuntimeWarning: divide by zero encountered in double_scalars
  return np.sqrt(eigvals[0]/eigvals[-1])
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\regression\linear_model.py:1554: RuntimeWarning: divide by zero encountered in double_scalars
  return self.ess/self.df_model
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                      -0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                      -inf
Date:                Sun, 10 Jun 2018   Prob (F-statistic):                nan
Time:                        12:18:34   Log-Likelihood:                -13.393
No. Observations:                   9   AIC:                             28.79
Df Residuals:                       8   BIC:                             28.98
Df Model:                           0
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.1107      0.006     18.991      0.000       0.097       0.124
x              0.8856      0.047     18.991      0.000       0.778       0.993
==============================================================================
Omnibus:                        0.591   Durbin-Watson:                   1.614
Prob(Omnibus):                  0.744   Jarque-Bera (JB):                0.509
Skew:                          -0.052   Prob(JB):                        0.775
Kurtosis:                       1.840   Cond. No.                          inf
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is      0. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

C:\Users\10617\Desktop\Python\statistics_exercise\cme193-ipython-notebooks-lecture-master\data1.py
dataset
I      9.0
II     9.0
III    9.0
IV     9.0
Name: x, dtype: float64
dataset
I      7.500909
II     7.500909
III    7.500000
IV     7.500909
Name: y, dtype: float64
dataset
I      11.0
II     11.0
III    11.0
IV     11.0
Name: x, dtype: float64
dataset
I      4.127269
II     4.127629
III    4.122620
IV     4.123249
Name: y, dtype: float64
                  x         y
dataset
I       x  1.000000  0.816421
        y  0.816421  1.000000
II      x  1.000000  0.816237
        y  0.816237  1.000000
III     x  1.000000  0.816287
        y  0.816287  1.000000
IV      x  1.000000  0.816521
        y  0.816521  1.000000
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given.
  "samples were given." % int(n), ValueWarning)
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.144
Model:                            OLS   Adj. R-squared:                 -0.070
Method:                 Least Squares   F-statistic:                    0.6714
Date:                Sun, 10 Jun 2018   Prob (F-statistic):              0.459
Time:                        12:20:16   Log-Likelihood:                -9.2736
No. Observations:                   6   AIC:                             22.55
Df Residuals:                       4   BIC:                             22.13
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.5660      3.535      1.575      0.190      -4.249      15.381
x              0.2723      0.332      0.819      0.459      -0.650       1.195
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.587
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.403
Skew:                           0.513   Prob(JB):                        0.818
Kurtosis:                       2.252   Cond. No.                         66.8
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
  "anyway, n=%i" % int(n))
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.696
Model:                            OLS   Adj. R-squared:                  0.658
Method:                 Least Squares   F-statistic:                     18.33
Date:                Sun, 10 Jun 2018   Prob (F-statistic):            0.00268
Time:                        12:20:16   Log-Likelihood:                -15.103
No. Observations:                  10   AIC:                             34.21
Df Residuals:                       8   BIC:                             34.81
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.8740      1.120      2.565      0.033       0.291       5.457
x              0.5000      0.117      4.281      0.003       0.231       0.769
==============================================================================
Omnibus:                        1.425   Durbin-Watson:                   2.338
Prob(Omnibus):                  0.490   Jarque-Bera (JB):                0.931
Skew:                          -0.471   Prob(JB):                        0.628
Kurtosis:                       1.840   Cond. No.                         28.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 7 samples were given.
  "samples were given." % int(n), ValueWarning)
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 7.652e+05
Date:                Sun, 10 Jun 2018   Prob (F-statistic):           3.71e-14
Time:                        12:20:16   Log-Likelihood:                 31.802
No. Observations:                   7   AIC:                            -59.60
Df Residuals:                       5   BIC:                            -59.71
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.0036      0.004   1102.706      0.000       3.994       4.013
x              0.3456      0.000    874.754      0.000       0.345       0.347
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   2.583
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.574
Skew:                           0.284   Prob(JB):                        0.750
Kurtosis:                       1.717   Cond. No.                         29.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given.
  "samples were given." % int(n), ValueWarning)
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.803
Model:                            OLS   Adj. R-squared:                  0.754
Method:                 Least Squares   F-statistic:                     16.34
Date:                Sun, 10 Jun 2018   Prob (F-statistic):             0.0156
Time:                        12:20:16   Log-Likelihood:                -8.3460
No. Observations:                   6   AIC:                             20.69
Df Residuals:                       4   BIC:                             20.28
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.3904      1.264      2.683      0.055      -0.118       6.899
x              0.4795      0.119      4.042      0.016       0.150       0.809
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   2.450
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.200
Skew:                           0.199   Prob(JB):                        0.905
Kurtosis:                       2.199   Cond. No.                         27.9
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

图形输出:


第三个图中x与y最符合线性关系,而回归分析中第三组数据的误差值也是最小的。

猜你喜欢

转载自blog.csdn.net/qq_35783731/article/details/80640303