Medical Cases | Linear Regression

1. Case introduction

A physician pre-studies the effect of total cholesterol and triglycerides on fasting blood sugar in diabetic patients. A researcher investigates the measured values ​​of total cholesterol, triglycerides and fasting blood sugar in 40 diabetic patients as follows. Try to make a statistical analysis based on the above research questions .

2. Problem Analysis

This case wants to study the effect of some variables (total cholesterol and triglycerides) on another variable (fasting blood sugar), which can be studied using multiple linear regression analysis. The more important assumptions of multiple linear regression analysis are as follows:

Assumption 1: Linearity - There is a linear relationship between the dependent variable and the independent variable

Assumption 2: Independence —the observations are independent of each other, that is, there is no autocorrelation between the residuals.

Assumption 3: Normality - The residuals are approximately normally distributed.

Assumption 4: Homogeneous variances - the variances of the residuals are homogeneous.

Assumption 5: Multicollinearity - There is no multicollinearity among the independent variables.

Violation of one or more of these assumptions can lead to unreliable linear regression analysis results. Therefore, we need to test hypotheses 1-5 using software.

3. Premise hypothesis testing

(1) Test Hypothesis 1: Linearity

Multiple linear regression analysis requires a linear relationship between the dependent variable Y and the independent variable X. For continuous independent variables, you can intuitively judge whether there is a linear relationship by drawing a scatter plot of the independent variable and the dependent variable. For categorical independent variables (such as education), the linear relationship with the dependent variable can be ignored.

Use SPSSAU to draw the scatter plots of Y fasting blood glucose, X1 total cholesterol, and X2 triglycerides, select [Scatter plot] in the [Visualization] module, drag the data to the corresponding analysis box on the right, click to start the analysis, and operate As shown below:

The SPSSAU output scatter plot is as follows:

① Scatter diagram of total cholesterol and fasting blood sugar

Draw a scatter plot with "fasting blood sugar" as the Y axis and "total cholesterol" as the X axis. It can be seen that there is a linear relationship between fasting blood sugar and total cholesterol.

②Scatter diagram of triglycerides and fasting blood sugar

In the same way, look at the scatter plot of triglycerides and fasting blood sugar, and there is an approximately linear relationship between the two.

To sum up, it can be considered that the data of this case satisfy assumption 1: that is, there is a linear relationship between the dependent variable and the independent variable.

(2) Test Hypothesis 2: Independence

Linear regression analysis assumes that the observations are independent of each other, that is, there is no autocorrelation between the residuals. The residuals can be tested for autocorrelation using Durbin-Watson .

The SPSSAU linear regression analysis result will output the DW test result, as shown in the figure below:

In general, the DW test has a value between 0 and 4. If the DW test value is close to 0, it means that there is positive autocorrelation, and if it is close to 4, it means that there is negative autocorrelation. It is generally believed that if the DW test value is between 1.5 and 2.5, it means that there is no autocorrelation problem . It can be seen from the above table that the DW value of this case is 2.0437, so it is considered that there is no autocorrelation, so the data satisfies assumption 2, that is, the correlation and independence between the observations.

(3) Test Hypothesis 3: Normality

Normality in the linear regression assumption means that the residuals (that is, the random disturbance term) approximately obey the normal distribution. First get the residual value, when using SPSSAU for linear regression, check "Save residual and predicted value", the operation is as follows:

There are many methods of normal distribution test, such as histogram, PP graph/QQ graph, statistical test, etc. In this case, the PP diagram is used for normality test, and the residual PP diagram is obtained as follows:

The PP graph is approximately presented as a diagonal line, indicating that the data is close to a normal distribution. As can be seen from the above figure, the PP diagram of the residual is approximately a diagonal straight line, so it is considered that the residual basically satisfies the normal distribution and assumption 3.

(4) Test Hypothesis 4: Equal variances

Homogeneity of variance in multiple linear regression means that the residuals have the same variance under different independent variable values, that is, the residuals of each group have the same degree of dispersion. The homogeneity of variances of multiple linear regression can be tested by plotting a scatterplot of standardized predicted values ​​versus standardized residuals . Draw a scatter plot with the standardized predicted value as the abscissa and the standardized residual as the ordinate.

①Data standardization processing

First, standardize the stored residual values ​​and predicted values. In the SPSSAU [Data Processing] module, select [Generate Variables], select the residual values ​​and predicted values, select standardized processing in "Dimension Processing", and click "Confirm Processing". ", the operation is as follows:

② Scatter plot drawing

Take the standardized predicted value as the X axis, and use the standardized residual as the Y axis to draw a scatter diagram, and the obtained scatter diagram is as follows:

If the assumption of homogeneity of variances holds, the points in the scatter plot should be roughly evenly distributed in the coordinate system, and the distribution of the scatter points should not change due to changes in the standardized predicted value. It can be seen from the above figure that the scattered points are basically evenly distributed in the coordinate system, and there is no obvious trend, so it can be considered that assumption 4 is satisfied, that is, the residuals approximately satisfy the variance homogeneity.

(5) Test Hypothesis 5: Multicollinearity

There is no multicollinearity between the independent variables. If there is multicollinearity, the linear relationship between the independent variables will cause the estimation of the regression coefficient to become unstable, increase the standard error, and thus affect the accuracy of the prediction. At the same time, multicollinearity will also lead to the loss of meaning of t-test and P value, and it is impossible to accurately judge the influence of independent variables on dependent variables. In multiple linear regression analysis, we usually use the variance inflation factor (VIF value) to detect multicollinearity.

The results of SPSSAU linear regression analysis output the collinearity diagnosis results as follows:

SPSSAU outputs VIF value and tolerance value at the same time (tolerance=1/VIF, just choose one of the two, usually describes VIF value). It is generally believed that when the VIF value is greater than 5 (or the tolerance is less than 0.2), there is a serious problem of multicollinearity. Analysis of the above table shows that the VIF values ​​are all less than 5, so it is considered that there is no multicollinearity problem among the independent variables in this case, and assumption 5 is satisfied.

If there is a collinearity problem, it can be dealt with by removing collinear variables, using stepwise regression, ridge regression, or increasing the sample size.

In summary, the data in this case meet the assumptions of using multiple linear regression analysis and can be analyzed.

4. Linear regression analysis

The linear regression analysis results of this case are as follows:

The test of the multiple linear regression model can be divided into two parts: ①The significance test of multiple independent variables and the dependent variable as a whole (F test); ②The significance test of the influence of each independent variable on the dependent variable (t test ) , the purposes of the two tests are different.

(1) Model checking

\begin{aligned}H_0&:\beta_1=\beta_2=\cdots=\beta_m=0\\H_1&:\beta_1,\beta_2,\cdots,\beta_m\text{ 不全为 0}\end{aligned}

Test statistics

F=\frac{MS_\text{regression}}{ MS _\text{residuals}}

When H0 is true, the statistic F obeys the F distribution with degrees of freedom m and nm-1, where n is the sample size, m is the number of independent variables in the regression model, and the regression coefficients of the independent variables are all 0, then Y has no relationship with each independent variable, which loses the significance of establishing a regression equation. Therefore, when the test result rejects H0, the regression model is said to be statistically significant.

The output of the F-test of SPSSAU multiple linear regression analysis is as follows:

From the analysis results of the regression model equation in the above table, it can be seen that F=9.2572, p=0.0005<0.05, so the null hypothesis H0 is rejected, that is, the regression model is statistically significant.

(2) Single regression coefficient test

The significance test of the regression coefficient refers to the significance test of the influence of each independent variable on the dependent variable , which is carried out using the t test. SPSSAU outputs the t-test results of the influence of each independent variable on the dependent variable as follows:

Analysis of the above table shows that the p-values ​​of the t-tests corresponding to total cholesterol and triglycerides are both less than 0.05, indicating that both variables have a significant impact on fasting blood glucose.

(3) Comparison of impact size

The comparison of the size of the independent variable's influence on the dependent variable is done by means of standardized regression coefficients. The larger the absolute value of the standardized regression coefficient, the greater the influence of the independent variable on the dependent variable.

The standardized regression coefficient is the regression coefficient obtained after the independent variable and the dependent variable are standardized at the same time. After the data is standardized, the influence of the difference in dimension and order of magnitude is eliminated. Yes, different variables are comparable, so use The standardized regression coefficient compares the effect of different independent variables on the dependent variable.

The analysis results show that the standardized regression coefficients of total cholesterol and triglycerides are 0.4788 and 0.2944, respectively, indicating that both have a significant normal effect on fasting blood glucose, and relatively speaking, total cholesterol has a greater impact.

(4) Model formula

It can be seen from the analysis results that the model formula is: fasting blood sugar = 4.985 + 0.212*total cholesterol + 0.351* triglycerides, and the R square value of the model is 0.334, which means that total cholesterol and triglycerides can explain 33.4% of the change in fasting blood sugar reason.

Special reminder: The unstandardized regression coefficient is used to construct the regression model , which is the original regression coefficient corresponding to different independent variables in the equation, reflecting the effect of each unit change of the independent variable on the dependent variable when other independent variables remain unchanged . The dependent variable can only be predicted by the regression equation constructed by the unstandardized regression coefficient.

More analysis results can be viewed in SPSSAU, so I won't repeat them here.

V. Conclusion

This case uses multiple linear regression analysis to study the impact of total cholesterol and triglycerides on fasting blood sugar. The study found that both total cholesterol and triglycerides have a significant positive impact on fasting blood sugar, and total cholesterol has a greater impact.

Six, knowledge tips

(1) What is the appropriate value of R square?

The R square value indicates the size of the model fitting ability, for example, 0.3 means that the independent variable X has 30% explanatory power for the dependent variable Y. This value is between 0 and 1, the bigger the better. However, there is no fixed standard in actual research. For some majors, 0.1 or even 0.05 is acceptable, but for some majors, it is often above 0.8. In general, you only need to report this value, and you don’t need to pay too much attention to its size, because most of the time we care more about whether X has an impact on Y.

(2) Is the regression coefficient very, very small or very, very large?

If the unit of the data is very large, whether it is the independent variable X or the dependent variable Y; this kind of data will cause the regression coefficient in the result to appear very, very small, or very, very large. This situation is normal, but it is generally necessary to uniformly logarithmize the data to reduce the problem of 'extra large or small regression coefficient' caused by the unit problem.

Guess you like

Origin blog.csdn.net/m0_37228052/article/details/132587437