In the era of big data, data analysis is undoubtedly one of the most popular technologies. With the development and growth of my country's medical and health services, medical workers have a growing demand for data analysis methods. Medical data analysis has become a current hot field. It is an interdisciplinary subject in the fields of medicine, statistics, and computer science. Data analysis is inseparable from software. R is a free and open source software that provides advanced statistical calculation and visualization functions.
The predecessor of R language is S language, which is an interpreted language dedicated to statistical analysis developed by John M. Chambers and his colleagues at Bell Labs in 1976. This language later developed into a commercial version of S-PLUS, and was widely used by statisticians all over the world.
In 1992, Ross Ihaka and Robert Gentleman of the University of Auckland in New Zealand developed a new language based on the S language for teaching purposes, and named it R based on the initials of the two names. In 1995, R was released as open source software, and the two authors also absorbed other developers to participate in R's update. By 1997, an 11-person R language core team was established. Since 2011, the team has been maintained at 20 people.
The R language was mainly used by statisticians in academia in the early days, and it was gradually used by scholars in many other fields. Especially with the explosion of big data, more and more people with computer and engineering background join this circle to improve and upgrade R's computing engine, performance, and various program packages, which greatly promotes the development of R language.
Why use R to analyze data
Medical data analysis is a combination of statistics and medical expertise. Computer software is inseparable from both statistical calculation and data visualization. There are many popular statistics and graphing software on the market, such as SAS, SPSS, Stata, etc. Why choose R? Specifically, R has the following advantages.
(1) Most statistical software requires payment, and R is released based on the GNU General Public License, which can be used and distributed for free.
(2) R can be used on multiple platforms, such as Windows, macOS, various versions of Linux and UNIX, etc. Some users even run R on browsers and mobile operating systems.
(3) R programming is simple, you only need to be familiar with the parameters and usage of some functions, and you don't need to understand the details of program implementation.
(4) R is small but powerful, and is known as the "Swiss Army Knife" of data analysis. The installation file size of R is less than 100MB, and most of the functions exist in the expansion package. These expansion packages cover cutting-edge methods of data analysis in various industries.
(5) R realizes repeatability analysis. Users can get out of the repetitive analysis work, and they can also share the analysis process with peers and benefit from it. With R and its extension packages, users can mix R code and markup text in a single document, and automatically generate analysis reports.
But R also has some inherent shortcomings, such as a relatively steep learning curve and uneven quality of third-party packages. In the learning process of this book, it is recommended that readers "learn while doing", that is, input the code in the book, observe the output results, and try to change the code (such as the parameters in the function) to master the usage of R. In addition, it is recommended that readers try to use the extension package of the official website or the package recommended by experienced users.
Recommended books:
R language medical data analysis combat
Based on the principle of making it easy for non-statistical readers to understand, this book emphasizes actual combat and application, focusing on the ideas and methods of data analysis, as well as the essence, characteristics, application conditions and results of data analysis, and minimize the derivation and calculation of statistical methods. Chapter 1 and Chapter 2 introduce the basic usage of R language; Chapter 3 introduces the methods of data preprocessing, covering basic data processing and some advanced data manipulation skills; Chapter 4 introduces how to use R language Data visualization; Chapter 5 introduces basic statistical analysis methods, including descriptive statistical analysis and various single factor analysis methods; Chapters 6 to 8 combine actual data to introduce the three most commonly used regression models in medical research, namely Linear regression, Logistic regression and Poisson regression; Chapter 9 introduces survival analysis; Chapter 10 to Chapter 12 introduces several of the most commonly used multivariate analysis methods, namely cluster analysis, discriminant analysis, principal component analysis and factor analysis; Chapter 13 introduces the evaluation indicators and calculation methods of clinical diagnostic tests; Chapter 14 introduces the Meta analysis methods commonly used in evidence-based medicine research.
This book is suitable for undergraduate or graduate students in clinical medicine, public health, and other medical-related majors. It can also be used as a reference book for students and researchers in other majors for data analysis. Readers can learn chapter by chapter from beginning to end, or they can selectively find solutions in the corresponding chapters according to the problems they encounter in practice. I hope this book will enable readers to have a deeper understanding of data analysis and further promote the popularization of R language in China.
table of Contents
- Chapter 1 Introduction to R Language
- Chapter 2 Creating a Data Set
- Chapter 3 Data Frame Operation
- Chapter 4 Data Visualization
- Chapter 5 Basic Statistical Analysis
- Chapter 6 Linear Regression Analysis
- Chapter 7 Logistic Regression Analysis
- Chapter 8 Poisson Regression Analysis
- Chapter 9 Survival Analysis
- Chapter 10 Cluster Analysis
- Chapter 11 Discriminant Analysis
- Chapter 12 Principal Component Analysis and Factor Analysis
- Chapter 13 Evaluation of Clinical Diagnostic Tests
- Chapter 14 Meta Analysis
Sample chapter cut: Chapter 6 Linear Regression Analysis
In medical research and practice, it is often necessary to explore the relationship between an outcome variable and other variables. For example, the blood glucose changes of diabetes may be affected by various biochemical indicators such as insulin, serum total cholesterol, triglycerides, etc. The linear regression model introduced in this chapter contains a continuous outcome variable (or dependent variable) and one or more explanatory variables (or independent variables). When there is only one explanatory variable, the model is called simple linear regression or linear regression; when there is more than one explanatory variable, the model is called multiple linear regression.
6.1 Simple linear regression
The simple linear regression model assumes that the dependent variable Y is only affected by one independent variable X , and there is an approximate linear function relationship between them. The model can be expressed as
Wherein the dependent variable Y is decomposed into two parts: one is the X changes the determined Y portion varies linearly with the X linear function of [alpha] + βX , where [alpha] is referred to as the constant term (intercept), beta] It is called the regression coefficient (slope term); the other part is the influence part of other random factors, which is regarded as a random error, represented by ε , and it is assumed that ε obeys a normal distribution with a mean of 0 and a variance of σ 2.
For the above parameters, they are usually estimated by the least square method. The following examples illustrate the establishment, solution, diagnosis and interpretation of the model. The data in this chapter comes from a survey on the age of children with Kashin-Beck disease and their urine creatinine content by a certain endemic disease research institution. We have entered and saved it as "UCR.rdata" in the exercises in Chapter 3. Now use the function load() to load the data.
> load("UCR.rdata")
> library(epiDisplay)
> des(UCR)
UCR in Kaschin-Beck disease children
No. of observations = 18
Variable Class Description
1 age integer Age in years
2 ucr numeric Urine creatinine (mmol)
3 group factor Type of children
> summary(UCR)
age ucr group
Min. : 6.00 Min. :2.210 0: 8
1st Qu.: 8.25 1st Qu.:2.672 1:10
Median :10.00 Median :3.010
Mean :10.50 Mean :3.016
3rd Qu.:12.00 3rd Qu.:3.315
Max. :16.00 Max. :3.980
The data frame UCR contains 3 variables and 18 records. There are no missing values in the data and can be directly used for analysis.
6.1.1 Fitting a simple linear regression model
For all data analysis, the first step is always to explore the data. Scatter plot is a very useful tool for judging whether there is a linear relationship between variables. Let's first explore the relationship between age and urine creatinine content with the help of a scatter plot.
> plot(ucr ~ age, data = UCR,
+ xlab = "Age in years", ylab = "Urine creatinine (mmol)")
It can be seen from Figure 6-1 that urine creatinine content increases with age and shows a linear trend.
Figure 6-1 Scatter diagram of age and urine creatinine content
In order to fit the regression line, use the function lm() to build a linear regression model.
> mod <- lm(ucr ~ age, data = UCR)
> modCall:lm(formula = ucr ~ age, data = UCR)
Coefficients:(Intercept) age 1.4549 0.1487
Directly outputting the created model object mod will only display very limited information about the model. In order to get more information about the model, we can use the function attributes() to view the attributes of the model object.
> attributes(mod)
$names
[1] "coefficients" "residuals" "effects" "rank" "fitted.values"
[6] "assign" "qr" "df.residual" "xlevels" "call"
[11] "terms" "model"
$class
[1] "lm"
The object mod contains a total of 12 attributes, and we can extract an attribute separately. For example, the following command can get the fitted value of the model.
> mod$fitted.values
1 2 3 4 5 6 7
3.387824 3.090454 2.793083 2.347028 2.644398 2.941769 3.239139
8 9 10 11 12 13 14
2.495713 2.941769 2.793083 3.090454 3.239139 3.685194 3.833879
15 16 17 18
2.644398 2.495713 2.941769 3.685194
6.1.2 Interpretation of model output results
Use the function summary() to summarize most of the attributes of the model.
> summary(mod)
Call:
lm(formula = ucr ~ age, data = UCR)
Residuals:
Min 1Q Median 3Q Max
-0.43440 -0.13828 -0.01111 0.14738 0.41823
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.45492 0.20712 7.025 2.87e-06 ***
age 0.14869 0.01904 7.807 7.60e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2289 on 16 degrees of freedom
Multiple R-squared: 0.7921, Adjusted R-squared: 0.7791
F-statistic: 60.95 on 1 and 16 DF, p-value: 7.597e-07
The first part of the output above shows the formula for the model call. The second part gives the distribution of residuals, the median of residuals is close to 0, the absolute value of the maximum (0.41823) and the minimum (−0.43440) are very close, and the lower quartile (−0.13828) is The absolute value of the upper quartile (0.14738) is also very close. This shows that the distribution of residuals is basically symmetrical. The third part gives the estimated value of the regression coefficient of the model, including the constant term (intercept) and the coefficient (slope) of the influence of age on urine creatinine content. Among them, the constant term 1.45492 represents the urine creatinine content at age 0, which is obviously meaningless. The corresponding p value is 2.87 × 10−6, which only means that the difference between the constant term and 0 is very significant. The coefficient of the variable age is 0.14869, which means that the urine creatinine content increases by 0.14869 mmol on average for every year of age. Although the value of 0.14869 is very small, it is highly different from 0 ( p value is 7.60 × 10−7).
The value of the coefficient of determination ( R 2) is 0.7921, which means that 79.2% of the variation in the data can be explained by the model; the value of the adjusted coefficient of determination is 0.7791. We will give the calculation methods of the two in the analysis of variance of the model below. The last part describes the residuals in more detail, and uses the F statistic to do a hypothesis test on the effect of the variable age. The p- value of this test (7.597 × 10−7) is equal to the p- value of the above t test for regression coefficients . The F test appears more often in the analysis of variance table of the model.
> summary(aov(mod))
Df Sum Sq Mean Sq F value Pr(>F)
age 1 3.194 3.194 60.95 7.6e-07 ***
Residuals 16 0.839 0.052
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The above analysis of variance table breaks down the degree of freedom, sum of squares, and average sum of squares of the outcome variable (ucr) according to the source of variation (in this case, there are only two sources of age and residual). The "square" here refers to the square of the difference between the variable value and the mean. Therefore, the total square sum of the urinary creatinine content (ucr) variation is:
> SST <- sum((UCR$ucr - mean(UCR$ucr))^2); SST
[1] 4.033028
The residual sum of squares is:
> SSR <- sum(residuals(mod)^2); SSR
[1] 0.8385279
The regression sum of squares, that is, the sum of squares of the difference between the fitted value and the total mean is:
> SSW <- sum((fitted(mod) - mean(UCR$ucr))^2); SSW
[1] 3.1945
The sum of the last two squares equals the total sum of squares. The coefficient of determination is the ratio of the regression sum of squares to the total sum of squares.
> SSW/SST
[1] 0.7920848
The coefficient of determination can also be considered as the percentage of the independent variable that explains the total variation of the dependent variable. In this case, age explains 79% of the total variation in urine creatinine. The adjusted coefficient of determination (
) Plus the "penalty" on the number of variables, it makes sense in multiple linear regression. The calculation formula is
Among them, R 2 is the coefficient of determination, n is the sample size, and k is the number of variables. Here the sample size is 18, and the number of variables is 1, so there are:
> Radj <- 1 - (1 - SSW / SST) * ((18 - 1) / (18 - 2)); Radj
[1] 0.7790901
This is the adjusted coefficient of determination displayed by the command summary(mod).
Divide the mean square of age (with 1 degree of freedom) by the mean square of the residual to get the value of the F statistic.
> resid.msq <- sum(residuals(mod)^2)/mod$df.residual
> Fvalue <- SSW/resid.msq; Fvalue
[1] 60.95444
Using the F value and its corresponding two degrees of freedom (degrees of freedom of variable age 1 and residual degrees of freedom 16), the p value of the test variable age effect can be calculated .
> pf(Fvalue, df1 = 1, df2 = 16, lower.tail = FALSE)
[1] 7.597353e-07
The function pf() calculates the p value from the given F value and two corresponding degrees of freedom . This result is consistent with the output result obtained by the command summary (aov(mod)). The last parameter lower.tail in the function is set to FALSE to get the area to the right of the F value under the curve .
Regression analysis and analysis of variance gave the same conclusion, that is, there is a significant linear relationship between age and urine creatinine content.
6.1.3 Regression diagnosis
Now, we can add a regression line to Figure 6-1.
> abline(mod)
The intercept of the regression line in Figure 6-2 is about 1.45 and the slope is 0.15. The fitted value (expected value) refers to the corresponding urine creatinine content on the regression line for a given age value. The residual value is the difference between the observed value and the expected value. The residual value can be drawn with the following command:
> points(UCR$age, fitted(mod), pch=18, col = "blue")
> segments(UCR$age, UCR$ucr, UCR$age, fitted(mod), col = "green")
Figure 6-2 Scatter plot and regression line of age and urine creatinine content
You can also use the function residuls() to extract the residual value of each sample point from the model:
> res <- residuals(mod); res
1 2 3 4 5 6
0.15217606 -0.08045368 0.29691648 0.13297194 -0.08439837 0.41823135
7 8 9 10 11 12
-0.05913872 0.15428690 0.06823144 0.03691649 -0.17045359 -0.14913887
13 14 15 16 17 18
0.29480588 0.05612084 -0.43439827 -0.10571309 -0.20176854 -0.32519425
Some of the residual values are positive and some are negative, indicating that some points in the scatter plot are above the fitted line and some are below the line. The 15th record has the largest residual absolute value, and its corresponding point is the farthest from the fitted straight line. We can also check the sum of residuals and the sum of squares:
> sum(res)
[1] -1.387779e-17
> sum(res^2)
[1] 0.8385279
The sum of the residuals is almost equal to 0; the sum of the squares of the residuals is the same as the result of the previous calculation. If the model fits well, the distribution of the residuals should be normal. A common way to test the normality of residuals is to look at their histograms (as shown in Figure 6-3).
> hist(res)
It can be seen from Figure 6-3 that the residuals are basically normally distributed. However, for such a small sample, a better way to check normality is to make a scatter plot of the expected standard normal score and residual (as shown in Figure 6-4). It is called the normal QQ graph. If the scattered points in the normal QQ graph are clustered on a straight line, it means that the residuals are normally distributed.
> qqnorm(res)
> qqline(res)
Figure 6-3 Histogram of residual distribution
Figure 6-4 The normal QQ plot of the residual distribution
As can be seen from Figure 6-4, the scattered points are basically on a straight line. Quantitatively, the Shapiro-Wilk test can be used.
> shapiro.test(res)
Shapiro-Wilk normality test
data: res
W = 0.98546, p-value = 0.9888
The null hypothesis of Shapiro-Wilk test is that the given data obey a normal distribution. The above p- value is 0.9888, which means that the residuals are normally distributed.
Finally, we can make a scatter plot between the residuals and the fitted values to see the distribution pattern of the residuals. The plotting results are shown in Figure 6-5.
> plot(fitted(mod), res, xlab = "Fitted values", type = "n")
> text(fitted(mod), res, labels = rownames(UCR))
> abline(h = 0, col = "blue")
Figure 6-5 Scatter plot of residuals and fitted values
In the above command, we first set the parameter type to "n" in the function plot() (representing no scatter points are displayed), then use the function text() to mark the number of the sample points at the scatter points, and finally add a The blue horizontal line serves as a reference line. There is no obvious pattern in Figure 6-5, and it can be considered that the residual is independent of the fitted value (expected value). In summary, we can conclude that the residuals are random and obey a normal distribution.
In fact, the residual diagnosis diagram of the model can also be obtained by the following command:
> par(mfrow = c(2, 2))
> plot(mod)
> par(mfrow = c(1, 1))
Because the output of the command plot (mod) contains 4 graphics, we first use the function par() to divide the canvas into two rows and two columns, and then restore the canvas to the default one row and one column after drawing. Figure 6-6 not only shows the two residual diagnosis maps drawn before, but also the position scale map (lower left) and the residual-leverage map (lower right). The location scale map is mainly used to test the variance of the residuals, and the residual-leverage map is mainly used to identify outliers, high leverage points and strong influence points. Figure 6-6 shows that the model fitting effect is relatively ideal and satisfies the assumptions of the linear model.
Through the above analysis, we can draw the conclusion that the urine creatinine content is related to the age of the child. With every year of age, urine creatinine levels increase by 0.14869 mmol on average. Except for age, other factors that affect urine creatinine content may be random errors or other factors that have not been considered.
Figure 6-6 Residual error diagnosis diagram
6.2 Hierarchical linear regression
Usually a data set contains multiple variables collected in the research. It is very meaningful to explore the relationship between two variables in the different categories of the third categorical variable. This is essentially a question of comparing regression equations. Here we must first check whether the two (or more) regression lines are parallel; if they are parallel, then check whether their intercepts are equal. In the analysis in Section 6.1, we treat all children as a whole, and do not consider whether children are sick. In fact, there is also a factor variable group in the data set to distinguish normal children from children with Kashin-Beck disease.
In Section 6.1, we have established a simple linear regression model mod and fitted a regression line. The results of the model indicate that the age of the child has a significant effect on the urine creatinine content. Next, explore whether there are significant differences in the effects of different types of children's age on urine creatinine content.
If it is assumed that the effect of age on the urine creatinine content of normal children and diseased children is the same, then two parallel regression lines can be fitted by adding the variable group to the model.
> mod1 <- lm(ucr ~ age + group, data = UCR)
> summary(mod1)
Call:
lm(formula = ucr ~ age + group, data = UCR)
Residuals:
Min 1Q Median 3Q Max
-0.29885 -0.15905 0.01675 0.14186 0.34023
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.44893 0.18427 7.863 1.06e-06 ***
age 0.16156 0.01785 9.049 1.83e-07 ***
group1 -0.23256 0.10181 -2.284 0.0373 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2037 on 15 degrees of freedom
Multiple R-squared: 0.8457, Adjusted R-squared: 0.8252
F-statistic: 41.12 on 2 and 15 DF, p-value: 8.162e-07
On average, the urine creatinine content increased by 0.16156 mmol ( p <0.001) for every 1 year of age; at the same age, the urine creatinine content of children with disease was 0.23256 mmol lower than normal children on average ( p = 0.0373).
Draw two parallel regression lines below. First make a scatter plot, use blue hollow circles to represent normal (unill) children, and red solid circles to represent sick children.
> col <- ifelse(UCR$group == 0, "blue", "red")
> pch <- ifelse(UCR$group == 0, 1, 19)
> plot(ucr ~ age, data = UCR,
+ xlab="Age in years", ylab = "Urine creatinine (mmol)",
+ col = col, pch = pch)
> legend("topleft",
+ legend = c("Normal children", "Diseased children"),
+ col = c("blue", "red"), pch = c(1, 19))
Then, use the function abline() to draw a regression line for each group. The parameter a in the function abline() is the intercept, and the parameter b is the slope, both of which can be derived from the coefficients of the model. The function coef() can be used to extract the coefficients contained in the model mod1:
> coef(mod1)
(Intercept) age group1
1.4489268 0.1615603 -0.2325586
For both groups, the slope is fixed:
> b <- coef(mod1)[2]; b
age
0.1615603
For the unaffected group, the variable group takes the value 0, and its intercept is the first term of the coefficient, denoted as a0:
> a0 <- coef(mod1)[1]; a0
(Intercept)
1.448927
For the disease group, the variable group takes the value 1, and its intercept is the sum of the first and third terms of the coefficient, denoted as a1, and its value is:
> a1 <- coef(mod1)[1] + coef(mod1)[3]; a1
(Intercept)
1.216368
Consistent with the scatter plot that has been drawn above, we use blue to represent the non-diseased group and red to represent the diseased group, and draw two regression lines respectively, as shown in Figure 6-7.
> abline(a = a0, b = b, col = "blue")
> abline(a = a1, b = b, col = "red")
Figure 6-7 Comparison of the two groups of regression line intercepts
In the above model, we assume that children’s age has a constant effect on urine creatinine levels. However, whether this hypothesis is true or not requires statistical testing. We can try to fit the regression line with different slopes, and then compare whether the difference between the two slopes is significant. At this time, it is necessary to build a model containing the interaction terms between the variables age and group.
> mod2 <- lm(ucr ~ age + group + age:group, data = UCR)
In the formula of the model, "age + group + age:group" can be abbreviated as "age*group". There are 4 coefficients in model mod2:
> coef(mod2)
(Intercept) age group1 age:group1
1.66166670 0.13916666 -0.56593453 0.03306943
Therefore, mod2 can be simplified as: ucr = 1.662 + 0.139 × age – 0.566 × group + 0.033 × age × group.
The constant term 1.662 is the intercept of the fitted straight line for the unaffected group (because the values of age and group are both 0). For the intercept of the diseased group, the 2nd and 4th terms on the right side of the formula are both 0 (because the value of age is 0), but the third term should be kept (because the value of group is 1). The coefficient of this term is negative, indicating that the intercept of the diseased group is smaller than that of the unaffected group. Use a0 and a1 to denote the intercepts of the unaffected group and the diseased group, respectively, then:
> a0 <- coef(mod2)[1]; a0
(Intercept)
1.661667
> a1 <- coef(mod2)[1] + coef(mod2)[3]; a1
(Intercept)
1.095732
The slope of the unaffected group is the second coefficient because the group value is 0. The slope of the disease group is the sum of the second coefficient and the fourth coefficient, because the group value is 1. Using b0 and b1 to denote the slopes of the non-diseased group and the diseased group, respectively, then:
> b0 <- coef(mod2)[2]; b0
age
0.1391667
> b1 <- coef(mod2)[2] + coef(mod2)[4]; b1
age
0.1722361
We can use these coefficients to draw two regression lines, as shown in Figure 6-8. The process is similar to the previous one. The code is as follows:
> col <- ifelse(UCR$group == 0, "blue", "red")
> pch <- ifelse(UCR$group == 0, 1, 19)
> plot(ucr ~ age, data = UCR,
+ xlab="Age in years", ylab = "Urine creatinine (mmol)",
+ pch = pch, col = col)
> legend("topleft",
+ legend = c("Normal children", "Diseased children"),
+ col = c("blue", "red"), pch = c(1, 19))
> abline(a = a0, b = b0, col = "blue")
> abline(a = a1, b = b1, col = "red")
Figure 6-8 Comparison of the slope of the regression line between the two groups
Figure 6-8 shows that the two regression lines are not parallel (the slopes are not equal), indicating that the effects of age are inconsistent in the two groups. This inconsistency may be caused by accidental factors (random errors), or it may be caused by disease. Next, use the function summary() to check whether each regression coefficient is statistically significant.
> summary(mod2)
Call:
lm(formula = ucr ~ age * group, data = UCR)
Residuals:
Min 1Q Median 3Q Max
-0.31927 -0.13327 -0.00125 0.15403 0.30667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.66167 0.30982 5.363 0.000100 ***
age 0.13917 0.03170 4.390 0.000617 ***
group1 -0.56593 0.40174 -1.409 0.180743
age:group1 0.03307 0.03853 0.858 0.405152
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2055 on 14 degrees of freedom
Multiple R-squared: 0.8535, Adjusted R-squared: 0.8221
F-statistic: 27.18 on 3 and 14 DF, p-value: 4.257e-06
The coefficient of the interaction term "age:group1" is not statistically significant ( p = 0.405152), that is, the difference between the two slopes is caused by accidental factors. Therefore, there is no need to include this interaction item in the model, and mod1 is the final model.
6.3 Multiple linear regression
The relationship between things in the real world is intricate, and the change of one variable is often related to the change of many other variables. For example, a person's heart rate is related to age, weight, and vital capacity. In linear regression, if the number of explanatory variables is more than one, this model is called multiple linear regression model.
The multiple linear regression model assumes that the dependent variable Y is affected by multiple independent variables
And there is a linear relationship between the dependent variable and these independent variables, the model can be expressed as
among them,
Is called a constant term,
It is called the partial regression coefficient; the random error ε obeys a normal distribution with a mean of 0 and a variance of σ 2. For these parameters, they are usually estimated by the least square method from the sample observations.
6.3.1 Fitting multiple linear regression models
The following uses the data set cystfibr in the ISwR package as an example to introduce multiple linear regression. This data set comes from a study of lung function in patients with cystic fibrosis. First install and load the ISwR package, and then view the variables in the data set cystfibr.
> library(ISwR)
> data(cystfibr)
> ?cystfibr
> str(cystfibr)
'data.frame': 25 obs. of 10 variables:
$ age : int 7 7 8 8 8 9 11 12 12 13 ...
$ sex : int 0 1 0 1 0 0 1 1 0 1 ...
$ height: int 109 112 124 125 127 130 139 150 146 155 ...
$ weight: num 13.1 12.9 14.1 16.2 21.5 17.5 30.7 28.4 25.1 31.5 ...
$ bmp : int 68 65 64 67 93 68 89 69 67 68 ...
$ fev1 : int 32 19 22 41 52 44 28 18 24 23 ...
$ rv : int 258 449 441 234 202 308 305 369 312 413 ...
$ frc : int 183 245 268 146 131 155 179 198 194 225 ...
$ tlc : int 137 134 147 124 104 118 119 103 128 136 ...
$ pemax : int 95 85 100 85 95 80 65 110 70 95 ...
According to the description in the help file of the data set cystfibr, the variable sex represents gender, where 0 represents male and 1 represents female, which needs to be converted into a factor here.
> cystfibr$sex <- factor(cystfibr$sex, labels = c("male", "female"))
The last five variables in the data set are fev1 (forced expiratory volume in the first second), rv (residual volume), frc (functional residual volume), tlc (total lung volume), and pemax (maximum expiratory pressure). These variables are indicators for measuring lung function, and we can use the function cor() to see the correlation between them.
> cor(cystfibr[,6:10])
fev1 rv frc tlc pemax
fev1 1.0000000 -0.6658557 -0.6651149 -0.4429945 0.4533757
rv -0.6658557 1.0000000 0.9106029 0.5891391 -0.3155501
frc -0.6651149 0.9106029 1.0000000 0.7043999 -0.4172078
tlc -0.4429945 0.5891391 0.7043999 1.0000000 -0.1816157
pemax 0.4533757 -0.3155501 -0.4172078 -0.1816157 1.0000000
It can be seen from the correlation coefficient matrix that there is a strong correlation between these 5 variables. In order to simplify the problem, the variable fev1 is selected as the result variable to establish a multiple linear regression model.
> fit1 <- lm(fev1 ~ age + sex + height + weight + bmp, data = cystfibr)
> summary(fit1)
Call:
lm(formula = fev1 ~ age + sex + height + weight + bmp, data = cystfibr)
Residuals:
Min 1Q Median 3Q Max
-10.711 -5.635 -3.155 7.309 16.979
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.28794 43.46384 0.444 0.6622
age -0.16431 1.26446 -0.130 0.8980
sexfemale -10.04122 3.60773 -2.783 0.0118 *
height -0.07645 0.27694 -0.276 0.7855
weight 0.20274 0.53153 0.381 0.7071
bmp 0.33373 0.32219 1.036 0.3133
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.788 on 19 degrees of freedom
Multiple R-squared: 0.5123, Adjusted R-squared: 0.384
F-statistic: 3.992 on 5 and 19 DF, p-value: 0.01209
Similar to simple linear regression, multiple linear regression also uses least squares to fit the model. The output result of the model is also similar to simple linear regression, so I won't repeat it here. Note that "female" is added after the variable sex. This is because R processes categorical variables (factors) in the form of dummy variables, and the undisplayed level (here, "male") is used as the reference group. The results showed that only the t test corresponding to the variable sex (gender) was statistically significant, and the fev1 value of female patients was about 10 units lower than that of male patients on average, and the other variables were not statistically significant. The analysis of variance ( F test) of the model is significant ( p = 0.01209). The coefficient of determination is 0.5123, indicating that 51% of the variation of fev1 can be explained by the changes in the above five independent variables.