Why should medical researchers learn R language? Which book is best for learning?

In the era of big data, data analysis is undoubtedly one of the most popular technologies. With the development and growth of my country's medical and health services, medical workers have a growing demand for data analysis methods. Medical data analysis has become a current hot field. It is an interdisciplinary subject in the fields of medicine, statistics, and computer science. Data analysis is inseparable from software. R is a free and open source software that provides advanced statistical calculation and visualization functions.

The predecessor of R language is S language, which is an interpreted language dedicated to statistical analysis developed by John M. Chambers and his colleagues at Bell Labs in 1976. This language later developed into a commercial version of S-PLUS, and was widely used by statisticians all over the world.

In 1992, Ross Ihaka and Robert Gentleman of the University of Auckland in New Zealand developed a new language based on the S language for teaching purposes, and named it R based on the initials of the two names. In 1995, R was released as open source software, and the two authors also absorbed other developers to participate in R's update. By 1997, an 11-person R language core team was established. Since 2011, the team has been maintained at 20 people.

The R language was mainly used by statisticians in academia in the early days, and it was gradually used by scholars in many other fields. Especially with the explosion of big data, more and more people with computer and engineering background join this circle to improve and upgrade R's computing engine, performance, and various program packages, which greatly promotes the development of R language.

Why use R to analyze data

Medical data analysis is a combination of statistics and medical expertise. Computer software is inseparable from both statistical calculation and data visualization. There are many popular statistics and graphing software on the market, such as SAS, SPSS, Stata, etc. Why choose R? Specifically, R has the following advantages.

(1) Most statistical software requires payment, and R is released based on the GNU General Public License, which can be used and distributed for free.

(2) R can be used on multiple platforms, such as Windows, macOS, various versions of Linux and UNIX, etc. Some users even run R on browsers and mobile operating systems.

(3) R programming is simple, you only need to be familiar with the parameters and usage of some functions, and you don't need to understand the details of program implementation.

(4) R is small but powerful, and is known as the "Swiss Army Knife" of data analysis. The installation file size of R is less than 100MB, and most of the functions exist in the expansion package. These expansion packages cover cutting-edge methods of data analysis in various industries.

(5) R realizes repeatability analysis. Users can get out of the repetitive analysis work, and they can also share the analysis process with peers and benefit from it. With R and its extension packages, users can mix R code and markup text in a single document, and automatically generate analysis reports.

But R also has some inherent shortcomings, such as a relatively steep learning curve and uneven quality of third-party packages. In the learning process of this book, it is recommended that readers "learn while doing", that is, input the code in the book, observe the output results, and try to change the code (such as the parameters in the function) to master the usage of R. In addition, it is recommended that readers try to use the extension package of the official website or the package recommended by experienced users.

Recommended books:

R language medical data analysis combat

Why should medical researchers learn R language?  Which book is best for learning?

 

Based on the principle of making it easy for non-statistical readers to understand, this book emphasizes actual combat and application, focusing on the ideas and methods of data analysis, as well as the essence, characteristics, application conditions and results of data analysis, and minimize the derivation and calculation of statistical methods. Chapter 1 and Chapter 2 introduce the basic usage of R language; Chapter 3 introduces the methods of data preprocessing, covering basic data processing and some advanced data manipulation skills; Chapter 4 introduces how to use R language Data visualization; Chapter 5 introduces basic statistical analysis methods, including descriptive statistical analysis and various single factor analysis methods; Chapters 6 to 8 combine actual data to introduce the three most commonly used regression models in medical research, namely Linear regression, Logistic regression and Poisson regression; Chapter 9 introduces survival analysis; Chapter 10 to Chapter 12 introduces several of the most commonly used multivariate analysis methods, namely cluster analysis, discriminant analysis, principal component analysis and factor analysis; Chapter 13 introduces the evaluation indicators and calculation methods of clinical diagnostic tests; Chapter 14 introduces the Meta analysis methods commonly used in evidence-based medicine research.

This book is suitable for undergraduate or graduate students in clinical medicine, public health, and other medical-related majors. It can also be used as a reference book for students and researchers in other majors for data analysis. Readers can learn chapter by chapter from beginning to end, or they can selectively find solutions in the corresponding chapters according to the problems they encounter in practice. I hope this book will enable readers to have a deeper understanding of data analysis and further promote the popularization of R language in China.

table of Contents

  • Chapter 1 Introduction to R Language
  • Chapter 2 Creating a Data Set
  • Chapter 3 Data Frame Operation
  • Chapter 4 Data Visualization
  • Chapter 5 Basic Statistical Analysis
  • Chapter 6 Linear Regression Analysis
  • Chapter 7 Logistic Regression Analysis
  • Chapter 8 Poisson Regression Analysis
  • Chapter 9 Survival Analysis
  • Chapter 10 Cluster Analysis
  • Chapter 11 Discriminant Analysis
  • Chapter 12 Principal Component Analysis and Factor Analysis
  • Chapter 13 Evaluation of Clinical Diagnostic Tests
  • Chapter 14 Meta Analysis

Sample chapter cut: Chapter 6 Linear Regression Analysis

In medical research and practice, it is often necessary to explore the relationship between an outcome variable and other variables. For example, the blood glucose changes of diabetes may be affected by various biochemical indicators such as insulin, serum total cholesterol, triglycerides, etc. The linear regression model introduced in this chapter contains a continuous outcome variable (or dependent variable) and one or more explanatory variables (or independent variables). When there is only one explanatory variable, the model is called simple linear regression or linear regression; when there is more than one explanatory variable, the model is called multiple linear regression.

6.1 Simple linear regression

The simple linear regression model assumes that the dependent variable Y is only affected by one independent variable X , and there is an approximate linear function relationship between them. The model can be expressed as

Why should medical researchers learn R language?  Which book is best for learning?

 

Wherein the dependent variable Y is decomposed into two parts: one is the X changes the determined Y portion varies linearly with the X linear function of [alpha] + βX , where [alpha] is referred to as the constant term (intercept), beta] It is called the regression coefficient (slope term); the other part is the influence part of other random factors, which is regarded as a random error, represented by  ε , and it is assumed that ε obeys a normal distribution with a mean of 0 and a variance of σ 2.

For the above parameters, they are usually estimated by the least square method. The following examples illustrate the establishment, solution, diagnosis and interpretation of the model. The data in this chapter comes from a survey on the age of children with Kashin-Beck disease and their urine creatinine content by a certain endemic disease research institution. We have entered and saved it as "UCR.rdata" in the exercises in Chapter 3. Now use the function load() to load the data.

> load("UCR.rdata")
> library(epiDisplay)
> des(UCR)
UCR in Kaschin-Beck disease children 
 No. of observations =  18 
  Variable     Class         Description                
1 age          integer       Age in years               
2 ucr          numeric       Urine creatinine (mmol)
3 group        factor        Type of children  
> summary(UCR)
     age           ucr         group 
 Min.   : 6.00   Min.   :2.210  0: 8  
 1st Qu.: 8.25   1st Qu.:2.672  1:10  
 Median :10.00   Median :3.010         
 Mean   :10.50   Mean   :3.016         
 3rd Qu.:12.00   3rd Qu.:3.315         
 Max.   :16.00   Max.   :3.980

The data frame UCR contains 3 variables and 18 records. There are no missing values ​​in the data and can be directly used for analysis.

6.1.1 Fitting a simple linear regression model

For all data analysis, the first step is always to explore the data. Scatter plot is a very useful tool for judging whether there is a linear relationship between variables. Let's first explore the relationship between age and urine creatinine content with the help of a scatter plot.

> plot(ucr ~ age, data = UCR,
+       xlab = "Age in years", ylab = "Urine creatinine (mmol)")

It can be seen from Figure 6-1 that urine creatinine content increases with age and shows a linear trend.

Why should medical researchers learn R language?  Which book is best for learning?

 

Figure 6-1 Scatter diagram of age and urine creatinine content

In order to fit the regression line, use the function lm() to build a linear regression model.

> mod <- lm(ucr ~ age, data = UCR)
> modCall:lm(formula = ucr ~ age, data = UCR)
Coefficients:(Intercept)        age      1.4549      0.1487

Directly outputting the created model object mod will only display very limited information about the model. In order to get more information about the model, we can use the function attributes() to view the attributes of the model object.

> attributes(mod)
$names
 [1] "coefficients"   "residuals"  "effects"   "rank"    "fitted.values"
 [6] "assign"       "qr"         "df.residual" "xlevels" "call"   
[11] "terms"        "model"        
$class
[1] "lm"

The object mod contains a total of 12 attributes, and we can extract an attribute separately. For example, the following command can get the fitted value of the model.

> mod$fitted.values
      1      2        3       4         5       6        7 
3.387824 3.090454 2.793083 2.347028 2.644398 2.941769 3.239139 
      8       9      10       11       12       13      14 
2.495713 2.941769 2.793083 3.090454 3.239139 3.685194 3.833879 
     15      16      17       18 
2.644398 2.495713 2.941769 3.685194

6.1.2 Interpretation of model output results

Use the function summary() to summarize most of the attributes of the model.

> summary(mod)
Call:
lm(formula = ucr ~ age, data = UCR)
Residuals:
    Min      1Q   Median      3Q     Max 
-0.43440 -0.13828 -0.01111  0.14738  0.41823 
Coefficients:
          Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.45492    0.20712  7.025 2.87e-06 ***
age        0.14869    0.01904  7.807 7.60e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2289 on 16 degrees of freedom
Multiple R-squared:  0.7921,   Adjusted R-squared:  0.7791 
F-statistic: 60.95 on 1 and 16 DF,  p-value: 7.597e-07

The first part of the output above shows the formula for the model call. The second part gives the distribution of residuals, the median of residuals is close to 0, the absolute value of the maximum (0.41823) and the minimum (−0.43440) are very close, and the lower quartile (−0.13828) is The absolute value of the upper quartile (0.14738) is also very close. This shows that the distribution of residuals is basically symmetrical. The third part gives the estimated value of the regression coefficient of the model, including the constant term (intercept) and the coefficient (slope) of the influence of age on urine creatinine content. Among them, the constant term 1.45492 represents the urine creatinine content at age 0, which is obviously meaningless. The corresponding p value is 2.87 × 10−6, which only means that the difference between the constant term and 0 is very significant. The coefficient of the variable age is 0.14869, which means that the urine creatinine content increases by 0.14869 mmol on average for every year of age. Although the value of 0.14869 is very small, it is highly different from 0 ( p value is 7.60 × 10−7).

The value of the coefficient of determination ( R 2) is 0.7921, which means that 79.2% of the variation in the data can be explained by the model; the value of the adjusted coefficient of determination is 0.7791. We will give the calculation methods of the two in the analysis of variance of the model below. The last part describes the residuals in more detail, and uses the F statistic to do a hypothesis test on the effect of the variable age. The p- value of this test (7.597 × 10−7) is equal to the p- value of the above t test for regression coefficients . The F test appears more often in the analysis of variance table of the model.

> summary(aov(mod))
           Df  Sum Sq  Mean Sq  F value  Pr(>F)    
age         1  3.194   3.194     60.95  7.6e-07 ***
Residuals  16  0.839   0.052                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The above analysis of variance table breaks down the degree of freedom, sum of squares, and average sum of squares of the outcome variable (ucr) according to the source of variation (in this case, there are only two sources of age and residual). The "square" here refers to the square of the difference between the variable value and the mean. Therefore, the total square sum of the urinary creatinine content (ucr) variation is:

> SST <- sum((UCR$ucr - mean(UCR$ucr))^2); SST
[1] 4.033028

The residual sum of squares is:

> SSR <- sum(residuals(mod)^2); SSR
[1] 0.8385279

The regression sum of squares, that is, the sum of squares of the difference between the fitted value and the total mean is:

> SSW <- sum((fitted(mod) - mean(UCR$ucr))^2); SSW
[1] 3.1945

The sum of the last two squares equals the total sum of squares. The coefficient of determination is the ratio of the regression sum of squares to the total sum of squares.

> SSW/SST
[1] 0.7920848

The coefficient of determination can also be considered as the percentage of the independent variable that explains the total variation of the dependent variable. In this case, age explains 79% of the total variation in urine creatinine. The adjusted coefficient of determination (

Why should medical researchers learn R language?  Which book is best for learning?

 

) Plus the "penalty" on the number of variables, it makes sense in multiple linear regression. The calculation formula is

Why should medical researchers learn R language?  Which book is best for learning?

 

Among them, R 2 is the coefficient of determination, n is the sample size, and k is the number of variables. Here the sample size is 18, and the number of variables is 1, so there are:

> Radj <- 1 - (1 - SSW / SST) * ((18 - 1) / (18 - 2)); Radj
[1] 0.7790901

This is the adjusted coefficient of determination displayed by the command summary(mod).

Divide the mean square of age (with 1 degree of freedom) by the mean square of the residual to get the value of the F statistic.

> resid.msq <- sum(residuals(mod)^2)/mod$df.residual
> Fvalue <- SSW/resid.msq; Fvalue
[1] 60.95444

Using the F value and its corresponding two degrees of freedom (degrees of freedom of variable age 1 and residual degrees of freedom 16), the p value of the test variable age effect can be calculated .

> pf(Fvalue, df1 = 1, df2 = 16, lower.tail = FALSE)
[1] 7.597353e-07

The function pf() calculates the p value from the given F value and two corresponding degrees of freedom . This result is consistent with the output result obtained by the command summary (aov(mod)). The last parameter lower.tail in the function is set to FALSE to get the area to the right of the F value under the curve .

Regression analysis and analysis of variance gave the same conclusion, that is, there is a significant linear relationship between age and urine creatinine content.

6.1.3 Regression diagnosis

Now, we can add a regression line to Figure 6-1.

> abline(mod)

The intercept of the regression line in Figure 6-2 is about 1.45 and the slope is 0.15. The fitted value (expected value) refers to the corresponding urine creatinine content on the regression line for a given age value. The residual value is the difference between the observed value and the expected value. The residual value can be drawn with the following command:

> points(UCR$age, fitted(mod), pch=18, col = "blue")
> segments(UCR$age, UCR$ucr, UCR$age, fitted(mod), col = "green")

Why should medical researchers learn R language?  Which book is best for learning?

 

Figure 6-2 Scatter plot and regression line of age and urine creatinine content

You can also use the function residuls() to extract the residual value of each sample point from the model:

> res <- residuals(mod); res
        1         2         3           4         5         6 
 0.15217606 -0.08045368  0.29691648  0.13297194 -0.08439837  0.41823135 
        7         8         9          10        11        12 
-0.05913872  0.15428690  0.06823144  0.03691649 -0.17045359 -0.14913887 
       13        14        15          16        17        18 
 0.29480588  0.05612084 -0.43439827 -0.10571309 -0.20176854 -0.32519425

Some of the residual values ​​are positive and some are negative, indicating that some points in the scatter plot are above the fitted line and some are below the line. The 15th record has the largest residual absolute value, and its corresponding point is the farthest from the fitted straight line. We can also check the sum of residuals and the sum of squares:

> sum(res)
[1] -1.387779e-17
> sum(res^2)
[1] 0.8385279

The sum of the residuals is almost equal to 0; the sum of the squares of the residuals is the same as the result of the previous calculation. If the model fits well, the distribution of the residuals should be normal. A common way to test the normality of residuals is to look at their histograms (as shown in Figure 6-3).

> hist(res)

It can be seen from Figure 6-3 that the residuals are basically normally distributed. However, for such a small sample, a better way to check normality is to make a scatter plot of the expected standard normal score and residual (as shown in Figure 6-4). It is called the normal QQ graph. If the scattered points in the normal QQ graph are clustered on a straight line, it means that the residuals are normally distributed.

> qqnorm(res)
> qqline(res)

Why should medical researchers learn R language?  Which book is best for learning?

 

Figure 6-3 Histogram of residual distribution

Why should medical researchers learn R language?  Which book is best for learning?

 

Figure 6-4 The normal QQ plot of the residual distribution

As can be seen from Figure 6-4, the scattered points are basically on a straight line. Quantitatively, the Shapiro-Wilk test can be used.

> shapiro.test(res)
  Shapiro-Wilk normality test
data:  res
W = 0.98546, p-value = 0.9888

The null hypothesis of Shapiro-Wilk test is that the given data obey a normal distribution. The above p- value is 0.9888, which means that the residuals are normally distributed.

Finally, we can make a scatter plot between the residuals and the fitted values ​​to see the distribution pattern of the residuals. The plotting results are shown in Figure 6-5.

> plot(fitted(mod), res, xlab = "Fitted values", type = "n")
> text(fitted(mod), res, labels = rownames(UCR))
> abline(h = 0, col = "blue")

Why should medical researchers learn R language?  Which book is best for learning?

 

Figure 6-5 Scatter plot of residuals and fitted values

In the above command, we first set the parameter type to "n" in the function plot() (representing no scatter points are displayed), then use the function text() to mark the number of the sample points at the scatter points, and finally add a The blue horizontal line serves as a reference line. There is no obvious pattern in Figure 6-5, and it can be considered that the residual is independent of the fitted value (expected value). In summary, we can conclude that the residuals are random and obey a normal distribution.

In fact, the residual diagnosis diagram of the model can also be obtained by the following command:

> par(mfrow = c(2, 2))
> plot(mod)
> par(mfrow = c(1, 1))

Because the output of the command plot (mod) contains 4 graphics, we first use the function par() to divide the canvas into two rows and two columns, and then restore the canvas to the default one row and one column after drawing. Figure 6-6 not only shows the two residual diagnosis maps drawn before, but also the position scale map (lower left) and the residual-leverage map (lower right). The location scale map is mainly used to test the variance of the residuals, and the residual-leverage map is mainly used to identify outliers, high leverage points and strong influence points. Figure 6-6 shows that the model fitting effect is relatively ideal and satisfies the assumptions of the linear model.

Through the above analysis, we can draw the conclusion that the urine creatinine content is related to the age of the child. With every year of age, urine creatinine levels increase by 0.14869 mmol on average. Except for age, other factors that affect urine creatinine content may be random errors or other factors that have not been considered.

Why should medical researchers learn R language?  Which book is best for learning?

 

Figure 6-6 Residual error diagnosis diagram

6.2 Hierarchical linear regression

Usually a data set contains multiple variables collected in the research. It is very meaningful to explore the relationship between two variables in the different categories of the third categorical variable. This is essentially a question of comparing regression equations. Here we must first check whether the two (or more) regression lines are parallel; if they are parallel, then check whether their intercepts are equal. In the analysis in Section 6.1, we treat all children as a whole, and do not consider whether children are sick. In fact, there is also a factor variable group in the data set to distinguish normal children from children with Kashin-Beck disease.

In Section 6.1, we have established a simple linear regression model mod and fitted a regression line. The results of the model indicate that the age of the child has a significant effect on the urine creatinine content. Next, explore whether there are significant differences in the effects of different types of children's age on urine creatinine content.

If it is assumed that the effect of age on the urine creatinine content of normal children and diseased children is the same, then two parallel regression lines can be fitted by adding the variable group to the model.

> mod1 <- lm(ucr ~ age + group, data = UCR)
> summary(mod1)
Call:
lm(formula = ucr ~ age + group, data = UCR)
Residuals:
    Min      1Q   Median      3Q     Max 
-0.29885 -0.15905  0.01675  0.14186  0.34023 
Coefficients:
          Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.44893    0.18427   7.863 1.06e-06 ***
age         0.16156    0.01785   9.049 1.83e-07 ***
group1     -0.23256    0.10181  -2.284   0.0373 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2037 on 15 degrees of freedom
Multiple R-squared:  0.8457, Adjusted R-squared:  0.8252 
F-statistic: 41.12 on 2 and 15 DF,  p-value: 8.162e-07

On average, the urine creatinine content increased by 0.16156 mmol ( p  <0.001) for every 1 year of age; at the same age, the urine creatinine content of children with disease was 0.23256 mmol lower than normal children on average ( = 0.0373).

Draw two parallel regression lines below. First make a scatter plot, use blue hollow circles to represent normal (unill) children, and red solid circles to represent sick children.

> col <- ifelse(UCR$group == 0, "blue", "red")
> pch <- ifelse(UCR$group == 0, 1, 19)
> plot(ucr ~ age, data = UCR,
+      xlab="Age in years", ylab = "Urine creatinine (mmol)", 
+      col = col, pch = pch)
> legend("topleft", 
+        legend = c("Normal children", "Diseased children"), 
+        col = c("blue", "red"), pch = c(1, 19))

Then, use the function abline() to draw a regression line for each group. The parameter a in the function abline() is the intercept, and the parameter b is the slope, both of which can be derived from the coefficients of the model. The function coef() can be used to extract the coefficients contained in the model mod1:

> coef(mod1)
(Intercept)        age       group1 
  1.4489268   0.1615603   -0.2325586

For both groups, the slope is fixed:

> b <- coef(mod1)[2]; b
     age 
0.1615603

For the unaffected group, the variable group takes the value 0, and its intercept is the first term of the coefficient, denoted as a0:

> a0 <- coef(mod1)[1]; a0
(Intercept) 
  1.448927

For the disease group, the variable group takes the value 1, and its intercept is the sum of the first and third terms of the coefficient, denoted as a1, and its value is:

> a1 <- coef(mod1)[1] + coef(mod1)[3]; a1
(Intercept) 
  1.216368

Consistent with the scatter plot that has been drawn above, we use blue to represent the non-diseased group and red to represent the diseased group, and draw two regression lines respectively, as shown in Figure 6-7.

> abline(a = a0, b = b, col = "blue")
> abline(a = a1, b = b, col = "red")

Why should medical researchers learn R language?  Which book is best for learning?

 

Figure 6-7 Comparison of the two groups of regression line intercepts

In the above model, we assume that children’s age has a constant effect on urine creatinine levels. However, whether this hypothesis is true or not requires statistical testing. We can try to fit the regression line with different slopes, and then compare whether the difference between the two slopes is significant. At this time, it is necessary to build a model containing the interaction terms between the variables age and group.

> mod2 <- lm(ucr ~ age + group + age:group, data = UCR)

In the formula of the model, "age + group + age:group" can be abbreviated as "age*group". There are 4 coefficients in model mod2:

> coef(mod2)
(Intercept)      age        group1    age:group1 
 1.66166670  0.13916666 -0.56593453   0.03306943

Therefore, mod2 can be simplified as: ucr = 1.662 + 0.139 × age – 0.566 × group + 0.033 × age × group.

The constant term 1.662 is the intercept of the fitted straight line for the unaffected group (because the values ​​of age and group are both 0). For the intercept of the diseased group, the 2nd and 4th terms on the right side of the formula are both 0 (because the value of age is 0), but the third term should be kept (because the value of group is 1). The coefficient of this term is negative, indicating that the intercept of the diseased group is smaller than that of the unaffected group. Use a0 and a1 to denote the intercepts of the unaffected group and the diseased group, respectively, then:

> a0 <- coef(mod2)[1]; a0
(Intercept) 
  1.661667 
> a1 <- coef(mod2)[1] + coef(mod2)[3]; a1
(Intercept) 
  1.095732

The slope of the unaffected group is the second coefficient because the group value is 0. The slope of the disease group is the sum of the second coefficient and the fourth coefficient, because the group value is 1. Using b0 and b1 to denote the slopes of the non-diseased group and the diseased group, respectively, then:

> b0 <- coef(mod2)[2]; b0
     age 
0.1391667 
> b1 <- coef(mod2)[2] + coef(mod2)[4]; b1
     age 
0.1722361

We can use these coefficients to draw two regression lines, as shown in Figure 6-8. The process is similar to the previous one. The code is as follows:

> col <- ifelse(UCR$group == 0, "blue", "red")
> pch <- ifelse(UCR$group == 0, 1, 19)
> plot(ucr ~ age, data = UCR,
+      xlab="Age in years", ylab = "Urine creatinine (mmol)", 
+      pch = pch, col = col)
> legend("topleft", 
+        legend = c("Normal children", "Diseased children"), 
+        col = c("blue", "red"), pch = c(1, 19))
> abline(a = a0, b = b0, col = "blue")
> abline(a = a1, b = b1, col = "red")

Why should medical researchers learn R language?  Which book is best for learning?

 

Figure 6-8 Comparison of the slope of the regression line between the two groups

Figure 6-8 shows that the two regression lines are not parallel (the slopes are not equal), indicating that the effects of age are inconsistent in the two groups. This inconsistency may be caused by accidental factors (random errors), or it may be caused by disease. Next, use the function summary() to check whether each regression coefficient is statistically significant.

> summary(mod2)
Call:
lm(formula = ucr ~ age * group, data = UCR)
Residuals:
    Min      1Q    Median      3Q      Max 
-0.31927 -0.13327 -0.00125  0.15403  0.30667 
Coefficients:
          Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.66167   0.30982  5.363 0.000100 ***
age         0.13917   0.03170  4.390 0.000617 ***
group1      -0.56593   0.40174 -1.409 0.180743    
age:group1   0.03307   0.03853  0.858 0.405152    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2055 on 14 degrees of freedom
Multiple R-squared:  0.8535, Adjusted R-squared:  0.8221 
F-statistic: 27.18 on 3 and 14 DF,  p-value: 4.257e-06

The coefficient of the interaction term "age:group1" is not statistically significant ( p  = 0.405152), that is, the difference between the two slopes is caused by accidental factors. Therefore, there is no need to include this interaction item in the model, and mod1 is the final model.

6.3 Multiple linear regression

The relationship between things in the real world is intricate, and the change of one variable is often related to the change of many other variables. For example, a person's heart rate is related to age, weight, and vital capacity. In linear regression, if the number of explanatory variables is more than one, this model is called multiple linear regression model.

The multiple linear regression model assumes that the dependent variable Y is affected by multiple independent variables

Why should medical researchers learn R language?  Which book is best for learning?

 

And there is a linear relationship between the dependent variable and these independent variables, the model can be expressed as

Why should medical researchers learn R language?  Which book is best for learning?

 

among them,

Why should medical researchers learn R language?  Which book is best for learning?

 

Is called a constant term,

Why should medical researchers learn R language?  Which book is best for learning?

 

It is called the partial regression coefficient; the random error  ε obeys a normal distribution with a mean of 0 and a variance of σ 2. For these parameters, they are usually estimated by the least square method from the sample observations.

6.3.1 Fitting multiple linear regression models

The following uses the data set cystfibr in the ISwR package as an example to introduce multiple linear regression. This data set comes from a study of lung function in patients with cystic fibrosis. First install and load the ISwR package, and then view the variables in the data set cystfibr.

> library(ISwR)
> data(cystfibr)
> ?cystfibr
> str(cystfibr)
'data.frame':    25 obs. of  10 variables:
 $ age   : int  7 7 8 8 8 9 11 12 12 13 ...
 $ sex   : int  0 1 0 1 0 0 1 1 0 1 ...
 $ height: int  109 112 124 125 127 130 139 150 146 155 ...
 $ weight: num  13.1 12.9 14.1 16.2 21.5 17.5 30.7 28.4 25.1 31.5 ...
 $ bmp   : int  68 65 64 67 93 68 89 69 67 68 ...
 $ fev1  : int  32 19 22 41 52 44 28 18 24 23 ...
 $ rv    : int  258 449 441 234 202 308 305 369 312 413 ...
 $ frc   : int  183 245 268 146 131 155 179 198 194 225 ...
 $ tlc   : int  137 134 147 124 104 118 119 103 128 136 ...
 $ pemax : int  95 85 100 85 95 80 65 110 70 95 ...

According to the description in the help file of the data set cystfibr, the variable sex represents gender, where 0 represents male and 1 represents female, which needs to be converted into a factor here.

> cystfibr$sex <- factor(cystfibr$sex, labels = c("male", "female"))

The last five variables in the data set are fev1 (forced expiratory volume in the first second), rv (residual volume), frc (functional residual volume), tlc (total lung volume), and pemax (maximum expiratory pressure). These variables are indicators for measuring lung function, and we can use the function cor() to see the correlation between them.

> cor(cystfibr[,6:10])
           fev1       rv       frc        tlc       pemax
fev1   1.0000000 -0.6658557 -0.6651149 -0.4429945  0.4533757
rv    -0.6658557  1.0000000  0.9106029  0.5891391 -0.3155501
frc   -0.6651149  0.9106029  1.0000000  0.7043999 -0.4172078
tlc   -0.4429945  0.5891391  0.7043999  1.0000000 -0.1816157
pemax  0.4533757 -0.3155501 -0.4172078 -0.1816157  1.0000000

It can be seen from the correlation coefficient matrix that there is a strong correlation between these 5 variables. In order to simplify the problem, the variable fev1 is selected as the result variable to establish a multiple linear regression model.

> fit1 <- lm(fev1 ~ age + sex + height + weight + bmp, data = cystfibr)
> summary(fit1)
Call:
lm(formula = fev1 ~ age + sex + height + weight + bmp, data = cystfibr)
Residuals:
   Min     1Q    Median    3Q    Max 
-10.711  -5.635  -3.155   7.309  16.979 
Coefficients:
           Estimate Std. Error t value Pr(>|t|)  
(Intercept)  19.28794   43.46384  0.444  0.6622  
age         -0.16431    1.26446 -0.130  0.8980  
sexfemale   -10.04122    3.60773 -2.783  0.0118 *
height      -0.07645     0.27694 -0.276  0.7855  
weight       0.20274     0.53153  0.381  0.7071  
bmp          0.33373     0.32219  1.036  0.3133  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.788 on 19 degrees of freedom
Multiple R-squared:  0.5123, Adjusted R-squared:  0.384 
F-statistic: 3.992 on 5 and 19 DF,  p-value: 0.01209

Similar to simple linear regression, multiple linear regression also uses least squares to fit the model. The output result of the model is also similar to simple linear regression, so I won't repeat it here. Note that "female" is added after the variable sex. This is because R processes categorical variables (factors) in the form of dummy variables, and the undisplayed level (here, "male") is used as the reference group. The results showed that only the t test corresponding to the variable sex (gender) was statistically significant, and the fev1 value of female patients was about 10 units lower than that of male patients on average, and the other variables were not statistically significant. The analysis of variance ( F test) of the model is significant ( p  = 0.01209). The coefficient of determination is 0.5123, indicating that 51% of the variation of fev1 can be explained by the changes in the above five independent variables.

Guess you like

Origin blog.csdn.net/epubit17/article/details/108616992