R-squared goodness of fit linear regression and

Original connection: http://tecdat.cn/?p=6267

 

I have recently been teaching modeling course, and has been read and think about the concept of fitness. R-squared by the rate of change of covariates X Y interpretation of the results is often described as a measure of goodness of fit. It certainly seems very reasonable, because the predicted value Y model observed (fit) R-squared value of closeness measure.

 

However, the important point to remember, R squared not provide information about our model is specified correctly to us. In other words, it does not tell us whether we have correctly specified the desired result Y depends on how the covariates. In particular, high R-squared value does not necessarily mean that our model is correctly specified. Use a simple example to illustrate this is the easiest.

First, we will use some of the data R simulation. To this end, from the standard normal distribution (zero mean and variance a) the X value randomly generated. Then, we generate results equal to X plus Y random error, again using the standard normal distribution:

n < -  1000
set.seed(512312)
x < -  rnorm(n)
y < -  x + rnorm(n)

Then we can fit of Y (right) linear regression model, where X as a covariate:

summary(mod1)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8571 -0.6387 -0.0022  0.6050  3.0716 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02193    0.03099   0.708    0.479    
x            0.93946    0.03127  30.040   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.98 on 998 degrees of freedom
Multiple R-squared:  0.4748,    Adjusted R-squared:  0.4743 
F-statistic: 902.4 on 1 and 998 DF,  p-value: < 2.2e-16

 

We can also draw data, covered with a model fitted line:

 

 

Observed (Y, X) and the data line overlap fit. 

Now let us rebuild the data, but the resulting Y makes its expected value of X is exponential function:

 
x < - rnorm(n)
y < - exp(x)+ rnorm(n)

Of course, in practice, we do not simulate our data - we observe or collect data, and then try to fit a reasonable model to it. Therefore, as before, we can start from fitting a simple linear regression model, the model assumes that Y expectation is a linear function of X:

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5022 -0.9963 -0.1706  0.6980 21.7411 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.65123    0.05220   31.63   <2e-16 ***
x            1.53517    0.05267   29.15   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.651 on 998 degrees of freedom
Multiple R-squared:  0.4598,    Adjusted R-squared:  0.4593 
F-statistic: 849.5 on 1 and 998 DF,  p-value: < 2.2e-16

Different in the first case, we get the parameter estimates (1.65,1.54) no not "real" data generating mechanism biased estimate, wherein Y is desired exp (X) is a linear function. Moreover, we see that we get R-squared value of 0.46, again showed X (including linear) Y explained in considerable variation. We might think that this means that the model that we use, that expectation Y is linear in X, is reasonable. But if we plot the data observed again, and covered it with a fitted line:

 

The fitted line is superimposed on the observed data clearly show model we use is not correctly specified, although the R-squared value is very large. In particular, we see that for low and high values ​​of X, the fitting is too small. This is clearly the expectations of Y depends on exp (X) the result of the fact that, while the model we use the assumption that it is a linear function of X.

This simple example shows, although the R-squared is an important measure, but the high value does not mean that our model is correctly specified. It can be said, better described method is R-squared "explained variance" metric. To evaluate our model is specified correctly, we should use the model of diagnostic techniques, such as residual plots or linear predictor for covariates.

Thank you for reading this article, you have any questions please leave a comment below!

 

 

Big Data tribe - Chinese professional third-party data service providers to provide customized one-stop data mining and statistical analysis consultancy services
Statistical analysis and data mining consulting services: y0.cn/teradat (Consulting Services, please contact the official website customer service )
Click here to send me a message QQ:3025393450
 
[Service] Scene  
Research projects; 
 
Companies outsourcing; online and offline one training; data collection; academic research; report writing; market research.
[Tribe] big data to provide customized one-stop data mining and statistical analysis consultancy services
[] Big Data Big Data tribal tribe to provide customized one-stop data mining and statistical analysis consultancy services
Share the latest information about big data, data analysis, study a little every day, so we do have the attitude of people together data [] Big Data Big Data tribal tribe to provide customized one-stop data mining and statistical analysis consultancy services
Micro-channel customer service number: lico_9e
QQ exchange group: 186 388 004  Big Data tribe

Welcome to elective our R language data analysis will be mining will know the course!

 

[] Big Data Big Data tribal tribe to provide customized one-stop data mining and statistical analysis consultancy services


Welcome attention to micro-channel public number for more information about data dry!
[] Big Data Big Data tribal tribe to provide customized one-stop data mining and statistical analysis consultancy services

Guess you like

Origin www.cnblogs.com/tecdat/p/11459545.html