Anscombe Quartet & Linear Regression Analysis

Overview

The effectiveness of linear regression method is judged. Aiming at Anscombe四重奏数据集, use excel to perform linear regression analysis on the four sets of data to determine which of the regression equations are valid and which are not valid? How to solve the unfounded?

1. Data analysis process

Use the " 数据分析" function that comes with the Excel software to perform linear regression analysis on the quartet data set.
For specific operating procedures, please refer to the article:

Excel installation & linear regression .
Excel does linear regression analysis .

2. Quartet data set

1. Data One

Linear regression analysis result:
Insert picture description here
scatter plot (adjust the corresponding coordinate values ​​to facilitate observation; the same below)
Insert picture description here
linear fitting equation
Insert picture description here
Related values:

Average value of x: 9 Average value of
y: 7.5009 Value of
R^2: 0.6665 Value of
P: 0.00217
Standard error: 1.2366
Fitting equation: y=0.5001*x+3.0001

2. Data two

Linear regression analysis results:
Insert picture description here
scatter plot
Insert picture description here
linear fitting equation
Insert picture description here
related values

Average value of x: 9 Average value of
y: 7.5009 Value of
R^2: 0.6662 Value of
P: 0.002719
Standard error: 1.237214
Fitting equation: y=0.5*x+3.0009

3. Data Three

Linear regression analysis results:
Insert picture description here
scatter plot
Insert picture description here

Linear Fitting Equation
Insert picture description here
Related Numerical Value

Average value of x: 9 Average value of
y: 7.5009 Value of
R^2: 0.6663 Value of
P: 0.002176
Standard error: 1.236311
Fitting equation: y=0.4997*x+3.0025

4. Data four

Linear regression analysis results:
Insert picture description here
scatter plot
Insert picture description here
linear fitting equation
Insert picture description here
related values

Average value of x: 9 Average value of
y: 7.5009 Value of
R^2: 0.6667
Value of P: 0.002165
Standard error: 1.235695
Fitting equation: 0.4999*x+3.0017

5. Validity judgment

我们取保留两位有效数字为证:

x值的平均数都是9.0,y值的平均数都是7.5;p值都是0.002,相关度都是1;线性回归拟合方程都是y=3.0+0.5x。Among the four sets of data, from the statistical data alone, the actual situation reflected by the four sets of data is very similar. In fact, these four sets of data are very different. Plotting them in the chart, you will find that these four sets of data are four completely different situations. The first set of data is the first reaction of most people seeing the above statistics, and it is the most "normal" set of data; the second set of data reflects in fact a precise quadratic function relationship, which is just incorrectly applied After the linear model is used, the statistics are exactly the same as the first set of data; the third set of data describes an accurate linear relationship, but there is an outlier in it, which leads to the above statistics, especially related The deviation of the degree value; the fourth set of data is a more extreme example, and its outliers cause all the statistics such as mean, variance, correlation, linear regression line to be biased.

Three, summary and reference materials

1. Summary

For an effective analysis of a set of data, you cannot just rely on the average, mean square error, etc. to judge. It is best to start with graphics, and the combination of number and shape can effectively determine the trend of a set of data.

2. Reference materials

The importance of graphs: Anscombe's four sets of data .

Guess you like

Origin blog.csdn.net/QWERTYzxw/article/details/114944978