Original connection: http://tecdat.cn/?p=6267
I have recently been teaching modeling course, and has been read and think about the concept of fitness. R-squared by the rate of change of covariates X Y interpretation of the results is often described as a measure of goodness of fit. It certainly seems very reasonable, because the predicted value Y model observed (fit) R-squared value of closeness measure.
However, the important point to remember, R squared not provide information about our model is specified correctly to us. In other words, it does not tell us whether we have correctly specified the desired result Y depends on how the covariates. In particular, high R-squared value does not necessarily mean that our model is correctly specified. Use a simple example to illustrate this is the easiest.
First, we will use some of the data R simulation. To this end, from the standard normal distribution (zero mean and variance a) the X value randomly generated. Then, we generate results equal to X plus Y random error, again using the standard normal distribution:
Then we can fit of Y (right) linear regression model, where X as a covariate:
We can also draw data, covered with a model fitted line:
Observed (Y, X) and the data line overlap fit.
Now let us rebuild the data, but the resulting Y makes its expected value of X is exponential function:
x < - rnorm(n)
y < - exp(x)+ rnorm(n)
Of course, in practice, we do not simulate our data - we observe or collect data, and then try to fit a reasonable model to it. Therefore, as before, we can start from fitting a simple linear regression model, the model assumes that Y expectation is a linear function of X:
Different in the first case, we get the parameter estimates (1.65,1.54) no not "real" data generating mechanism biased estimate, wherein Y is desired exp (X) is a linear function. Moreover, we see that we get R-squared value of 0.46, again showed X (including linear) Y explained in considerable variation. We might think that this means that the model that we use, that expectation Y is linear in X, is reasonable. But if we plot the data observed again, and covered it with a fitted line:
The fitted line is superimposed on the observed data clearly show model we use is not correctly specified, although the R-squared value is very large. In particular, we see that for low and high values of X, the fitting is too small. This is clearly the expectations of Y depends on exp (X) the result of the fact that, while the model we use the assumption that it is a linear function of X.
This simple example shows, although the R-squared is an important measure, but the high value does not mean that our model is correctly specified. It can be said, better described method is R-squared "explained variance" metric. To evaluate our model is specified correctly, we should use the model of diagnostic techniques, such as residual plots or linear predictor for covariates.
Thank you for reading this article, you have any questions please leave a comment below!
Welcome attention to micro-channel public number for more information about data dry!