Anscombe Quartet & Linear Regression Analysis
Overview
The effectiveness of linear regression method is judged. Aiming at
Anscombe四重奏数据集
, use excel to perform linear regression analysis on the four sets of data to determine which of the regression equations are valid and which are not valid? How to solve the unfounded?
1. Data analysis process
Use the " 数据分析
" function that comes with the Excel software to perform linear regression analysis on the quartet data set.
For specific operating procedures, please refer to the article:
Excel installation & linear regression .
Excel does linear regression analysis .
2. Quartet data set
1. Data One
Linear regression analysis result:
scatter plot (adjust the corresponding coordinate values to facilitate observation; the same below)
linear fitting equation
Related values:
Average value of x: 9 Average value of
y: 7.5009 Value of
R^2: 0.6665 Value of
P: 0.00217
Standard error: 1.2366
Fitting equation: y=0.5001*x+3.0001
2. Data two
Linear regression analysis results:
scatter plot
linear fitting equation
related values
Average value of x: 9 Average value of
y: 7.5009 Value of
R^2: 0.6662 Value of
P: 0.002719
Standard error: 1.237214
Fitting equation: y=0.5*x+3.0009
3. Data Three
Linear regression analysis results:
scatter plot
Linear Fitting Equation
Related Numerical Value
Average value of x: 9 Average value of
y: 7.5009 Value of
R^2: 0.6663 Value of
P: 0.002176
Standard error: 1.236311
Fitting equation: y=0.4997*x+3.0025
4. Data four
Linear regression analysis results:
scatter plot
linear fitting equation
related values
Average value of x: 9 Average value of
y: 7.5009 Value of
R^2: 0.6667
Value of P: 0.002165
Standard error: 1.235695
Fitting equation: 0.4999*x+3.0017
5. Validity judgment
我们取保留两位有效数字为证:
x值的平均数都是9.0,y值的平均数都是7.5;p值都是0.002,相关度都是1;线性回归拟合方程都是y=3.0+0.5x。
Among the four sets of data, from the statistical data alone, the actual situation reflected by the four sets of data is very similar. In fact, these four sets of data are very different. Plotting them in the chart, you will find that these four sets of data are four completely different situations. The first set of data is the first reaction of most people seeing the above statistics, and it is the most "normal" set of data; the second set of data reflects in fact a precise quadratic function relationship, which is just incorrectly applied After the linear model is used, the statistics are exactly the same as the first set of data; the third set of data describes an accurate linear relationship, but there is an outlier in it, which leads to the above statistics, especially related The deviation of the degree value; the fourth set of data is a more extreme example, and its outliers cause all the statistics such as mean, variance, correlation, linear regression line to be biased.
Three, summary and reference materials
1. Summary
For an effective analysis of a set of data, you cannot just rely on the average, mean square error, etc. to judge. It is best to start with graphics, and the combination of number and shape can effectively determine the trend of a set of data.