Excel realizes linear regression analysis-Galton data set, quartet data set

1. Linear regression analysis of Galton data set

(1) Regression analysis of the average height of the parents and the height of one of the children

1. Preprocess the data of the data set
① Calculate the average height of each family's parents through Excel
.
Select the average output location, click the red box in the figure, and then enter the location of the data to be calculated. . Then, select all the areas where the average value is output, and Ctrl+D can quickly calculate the average value of multiple rows.
Insert picture description here
②Each family keep the height of one child
Select data -> select the area that needs to be deduplicated -> select to delete duplicate values ​​-> expand the selected area and
Insert picture description here
only check the column for deduplication
Insert picture description here
2. Perform data analysis and
select data -> data Analysis->Regression->Determine Insert picture description here
The average height of each couple is used as the independent variable X, and the height of one of their children is used as the dependent variable Y. Enter the location of the corresponding data and check it according to the figure.
Insert picture description here
Generate chart
Insert picture description here
3. Add trend line
Select the data point, right-click and select Add trend line.
Insert picture description here
Select linear, and check the above display formula.
Insert picture description here
By selecting some parts of the chart, set some parameters to make it more clear and specific.
The final chart is generated
Insert picture description here
4. Explanation of the relevant data
Through the above formula y=0.5702x+31.801, it is found that when the height of parents increases by 1 unit, the height of their children increases by 0.5702 units on average.
The result of the correlation coefficient R-squared calculation is about 0.12, and it seems that the average parent height is not linearly related to the height of the child (the degree of fit is not high). It can be seen from the table of analysis of variance that the F>F table shows that there are significant differences in the data. The P value is much less than 0.01, indicating that the regression equation obtained is reliable.

Insert picture description here

(2) Regression analysis of the height of the father and the height of one of the children

The method of obtaining the regression equation is similar
to the above operation. Regression results
Insert picture description here
Through the above formula y=0.2962x+49.27, it is found that when the height of parents increases by 1 unit, the height of their children increases by 0.2962 units on average.
The result of the correlation coefficient R-squared calculation is about 0.06, which means that the average parent height is not linearly related to the height of the child (the degree of fitting is very low). It can be seen from the table of analysis of variance that the F>F table shows that there are significant differences in the data. The P value is much less than 0.01, indicating that the regression equation obtained is reliable.
Insert picture description here
Through the comparison between the above methods, the regression equation obtained by using the average height of the parents and the height of the children will be better. However, there are some problems when removing duplicates here. The height of the remaining children is the maximum value of the children in the family. I think if the average height of the children is taken or one of the heights is randomly selected, the linear regression equation obtained by the regression analysis should better fit the data. Since I am not very familiar with the use of Excel, the data set is relatively large, and it will take some time to do this step, so there is no idea of ​​verification.

(3) Regression analysis of father's height and son's height

1. Process the data to
filter out the son's height data row
Insert picture description here
Insert picture description here
2. Perform data regression analysis The
specific method is the same as the above
regression result
Insert picture description here
Insert picture description here

Through the above formula y=0.2547x+49.872, it is found that when the father's height increases by 1 unit, his son's height increases by 0.2547 units on average. It also shows that the height of the father and the height of the son are positively correlated.
The result of the correlation coefficient R-square calculation is about 0.7969, which shows that the linear correlation between the height of the father and the height of the son is relatively high. It can be seen from the table of analysis of variance that the F>F table shows that there are significant differences in the data. The P value is much smaller than 0.01, indicating that the regression equation obtained is reliable.

(4) Regression analysis of mother's height and son's height

The method is the same as above.
Insert picture description here
Through the above formula, it will be found that the height of the mother and the height of the son are negatively correlated, and the square of the correlation coefficient R is very small, indicating that the two are not correlated.
Through the comparison of the above two, it is found that the height of the son is more correlated with the height of the father. The height of the father is positively correlated with the height of the son, and the height of the mother is not correlated with the height of the son. The data shows that the height of the son is mainly affected by the height of the father.

2. Linear regression analysis on Anscombe quartet data set

1. Table 1
It can be seen from the figure that the linearity is not very capable of showing a change trend of the original data, so the linear regression equation is not valid. Through the use of other regression curves to test, it is found that for the regression equation of the 6th degree polynomial, it will better represent the changing trend of the data than the linear regression equation.
Insert picture description here
Insert picture description here
2. Table 2
It can be seen from the figure that linearity is not very capable of showing a change trend of the original data, so the linear regression equation is not valid. Through the use of other regression curves to test, it is found that for the second-order polynomial regression equation, it will better represent the changing trend of the data than the linear regression equation.
Insert picture description here
Insert picture description here
3. Table 3
It can be seen from the figure that linearity can basically express a change trend of the original data, and there are only very few extreme data. All the linear regression equations can basically reflect a change of the data set.
Insert picture description here
Insert picture description here
4. Table 4
It can be seen from the figure that linearity does not represent a change trend of the original data, so the linear regression equation does not hold. It can be found that the data basically cannot be described by linearity. The independent variable and the dependent variable of the data should be exchanged for analysis, and regression data may be used for analysis.
Insert picture description here
Insert picture description here
It can be seen from the analysis results of the four tables that the R value and P value are not very good to indicate whether the regression equation is appropriate. For four different data sets, the same R value and P value are obtained. However, not each of them can be explained by this regression equation.

summary

Through the linear regression analysis of the two data sets, it can be found that the R value and P value can be used to illustrate the quality of the regression equation for the obtained regression equation. However, not all data sets can use this description. Whether the regression equation is applicable or not requires further analysis and judgment before it can be explained.

Guess you like

Origin blog.csdn.net/qq_43279579/article/details/114950002