What data structures are included in R language?

In most cases, structured medical data is a data set composed of many rows and many columns. In R, this kind of data set is called a data frame. Before learning data frames, let's first understand some data structures used to store data: vectors, factors, matrices, arrays, and lists. These data structures are different in storage types, creation methods, and operation methods. Familiarity with their basic concepts and operating skills will enable us to process data flexibly and efficiently.

2.1.1 Vector

A vector is a one-dimensional array used to store numeric, character, and logical data. A scalar can be regarded as a vector with only one element. The function c() can be used to create a vector, for example:

> x1 <- c(2, 4, 1, -2, 5)
> x2 <- c("one", "two", "three")
> x3 <- c(TRUE, FALSE, TRUE, FALSE)

Here x1 is a numeric vector, x2 is a character vector, and x3 is a logical vector. The data types in each vector must be consistent. If you want to create a regular vector, R provides some convenient operations and functions, such as:

> x4 <- 1:5   # 等价于x4 <- c(1, 2, 3, 4, 5)
> x5 <- seq(from = 2, to = 10, by = 2)   # 等价于x5 <- c(2, 4, 6, 8, 10)
> x6 <- rep("a", times = 4)   # 等价于x6 <- c("a", "a", "a", "a")

Sometimes we only want to use a certain part of the vector, that is, select a subset of the vector. Suppose there is a vector of integers with a step length of 7 from 3 to 100. What is the value of the fifth number?

> x <- seq(from = 3, to = 100, by = 7)
> x
 [1]  3 10 17 24 31 38 45 52 59 66 73 80 87 94

Please note that the last number of the vector x is not 100 but 94, because 94 plus a step of 7, the result will exceed 100.

> x[5]
[1] 31

The numbers in square brackets "[ ]" are called subscripts, which specify the index position of the vector. In the above command, x[5] represents the fifth element of the vector, and its value is 31. The following command displays the 4th, 6th, and 7th elements of the vector:

> x[c(4, 6, 7)]
[1] 24 38 45

The vector in the subscript can take a negative value, which means removing the element at the specified position. For example, to remove the first 4 elements of x, you can enter the following code (note the parentheses in the command):

> x[-(1:4)]
 [1] 31 38 45 52 59 66 73 80 87 94

Operations in R are vectorized, for example:

> weight <- c(68, 72, 57, 90, 65, 81)
> height <- c(1.75, 1.80, 1.65, 1.90, 1.72, 1.87)
> bmi <- weight / height ^ 2
> bmi
[1] 22.20408 22.22222 20.93664 24.93075 21.97134 23.16337

In the above calculation of bmi, the operator "^" is used cyclically, so the result of the calculation is still a vector. If the lengths of the vectors involved in the calculation are inconsistent, R will automatically complete the calculation and give a warning message.

> a <- 1:5
> b <- 1:3
> a + b
[1] 2 4 6 5 7
Warning message:
In a + b : longer object length is not a multiple of shorter object length

The length of vector a above is 5, and the length of vector b is 3. When calculating a + b, because the length of vector b is shorter than that of vector a, vector b will be used cyclically from the first element. Therefore, in the final output, the fourth element 4 in a is added to the first element 1 of b, and the fifth element 5 in a is added to the second element 2 of b.

R provides a wide variety of functions for calculating statistics. The commonly used statistical functions are shown in Table 2-1. It is very convenient to use these functions to calculate vector statistics. The following code demonstrates the output results of several of these functions after acting on the vector bmi.

> length(bmi)     # 计算向量bmi的长度
[1] 6
> mean(bmi)       # 计算向量bmi的均值
[1] 22.5714
> var(bmi)        # 计算向量bmi的样本方差
[1] 1.841265
> sd(bmi)        # 计算向量bmi的样本标准差
[1] 1.356932

Table 2-1 Commonly used statistical functions

 

2.1.2 Factor

Generally speaking, variables are divided into numerical, nominal and ordered types. Nominal variables are categorical variables with no order relationship, such as gender, blood type, ethnicity, etc. Ordinal variables are categorical variables with hierarchical and sequential relationships, such as the patient's condition (poor, better, very good). Nominal variables and ordinal variables are called factors in R. Factors are very important in R. It determines the way data is displayed and analyzed. Data storage factors are often stored in the form of integer vectors. Therefore, before data analysis, it is often necessary to convert them into factors using the function factor( ). E.g:

> sex <- c(1, 2, 1, 1, 2, 1, 2)
> sex.f <- factor(sex, levels = c(1, 2), labels = c("Male", "Female"))
> sex.f
[1] Male   Female Male   Male   Female Male   Female
Levels: Male Female

The above command first defines a variable sex to represent gender. Suppose its value is 1 for male and 2 for female. Then use the function factor() to convert the variable sex into a factor and save it as an object sex.f, where the parameter levels represents the classification label value of the original variable, and the parameter labels represents the label of the factor value. Note that these two parameters need a one-to-one correspondence when assigning values, and R will associate them. The difference between a factor variable and a general character variable is that it has a level attribute. The attributes of the factor can be viewed using the function levels( ):

> levels(sex.f)
[1] "Male"   "Female"

In statistical models, for factor variables, R will treat the first level as the reference group. Many times we need to change the order of factor levels to change the reference group. This can be achieved in two ways. The first method is to change the order of the parameters levels and labels in the function factor( ), for example:

> sex.f1 <- factor(sex, levels = c(2, 1), labels = c("Female", "Male"))
> sex.f1
[1] Male   Female Male   Male   Female Male   Female
Levels: Female Male

The second method is to use the function relevel( ):

> sex.f1 <- relevel(sex.f, ref = "Female")
> sex.f1
[1] Male   Female Male   Male   Female Male   Female
Levels: Female Male

To express ordered factors, you need to specify the parameter ordered = TRUE in the function factor( ). E.g:

> status <- c(1, 2, 2, 3, 1, 2, 2)
> status.f <- factor(status, 
                     levels = c(1, 2, 3), 
                     labels = c("Poor", "Improved", "Excellent"), 
                     ordered = TRUE)
> status.f
[1] Poor  Improved  Improved  Excellent  Poor  Improved  Improved 
Levels: Poor < Improved < Excellent

2.1.3 Matrix

A matrix is ​​a two-dimensional array composed of rows and columns. Each element in the matrix has the same pattern (numeric, character or logical). In most cases, the elements in the matrix are numerical. It has many mathematical characteristics and operations, which can be used for statistical calculations, such as factor analysis, generalized linear models, etc. The function matrix() is often used to create a matrix, for example:

> M <- matrix(1:6, nrow = 2)
> M
    [,1] [,2] [,3]
[1,]   1    3   5
[2,]   2    4   6

The above command uses vectors 1 to 6 to create a matrix with 2 rows. R will automatically calculate the number of columns based on the length of the vector and the number of rows set by the parameter nrow. The parameter byrow defaults to FALSE, which means that the values ​​are arranged in columns. If you need to arrange in rows, you only need to set the parameter byrow to TRUE.

Common matrix operations can be implemented in R, such as matrix addition, matrix multiplication, matrix inversion, matrix transposition, the determinant of a square matrix, and the eigenvalues ​​and eigenvectors of the square matrix.

In matrix multiplication, the number of columns in the first matrix is ​​required to be equal to the number of rows in the second matrix, and the operator is "%*%". First create two matrices:

> mat1 <- matrix(1:6, nrow = 3)
> mat1
    [,1] [,2]
[1,]   1    4
[2,]   2    5
[3,]   3    6
> mat2 <- matrix(5:10, nrow = 2)
> mat2
    [,1] [,2] [,3]
[1,]   5    7   9
[2,]   6    8  10

The function dim() can get the dimension of the matrix, that is, the number of rows and columns:

> dim(mat1)
[1] 3 2
> dim(mat2)
[1] 2 3

The result shows that mat1 is a matrix with 3 rows and 2 columns, and mat2 is a matrix with 2 rows and 3 columns, so they can be multiplied, and the result should be a matrix with 3 rows and 3 columns.

> mat1 %*% mat2
    [,1] [,2] [,3]
[1,]  29   39  49
[2,]  40   54  68
[3,]  51   69  87

The matrix transpose operation is to exchange the rows and columns of the matrix. For example, find the transpose matrix of matrix mat1:

> t(mat1)
    [,1] [,2] [,3]
[1,]   1    2   3
[2,]   4    5   6

Finding the determinant and inverse matrix of a square matrix can be implemented using the function det() and function solve( ), for example:

> mat3 <- matrix(1:4, nrow = 2)
> det(mat3)
[1] -2
> solve(mat3)
    [,1] [,2]
[1,]  -2  1.5
[2,]   1 -0.5

In addition, we can also sum or average the matrix by row and column, for example:

> rowSums(mat1)
[1] 5 7 9
> colSums(mat1)
[1]  6 15
> rowMeans(mat1)
[1] 2.5 3.5 4.5
> colMeans(mat1)
[1] 2 5

The functions related to matrix operations cannot be described in detail here. Readers can refer to the relevant documents of CRAN to learn more about the usage of matrix operations when necessary.

Using index to access matrix elements is also the basic operation of matrix. Similar to vectors, we can use "[]" to index and access elements in the matrix. The difference is that for a matrix, a comma is required to separate the row number and column number in "[ ]". For example, to select the first two rows and the first two columns of the matrix mat1, you can use the following command:

> mat1[1:2, 1:2]
    [,1] [,2]
[1,]   1    4
[2,]   2    5

If the row number or column number is omitted, it means that all rows or all columns are selected, for example:

> mat1[2:3,]
    [,1] [,2]
[1,]   2    5
[2,]   3    6

2.1.4 Array

Usually the so-called array refers to a multi-dimensional array, which is similar to a matrix but has a dimension greater than 2. Arrays have a special dimension (dim) attribute. The following command defines an array after adding dimensions to a vector. Please pay attention to the order of the values.

> A <- 1:24
> dim(A) <- c(3, 4, 2)
> A
, , 1
    [,1] [,2] [,3] [,4]
[1,]   1    4   7  10
[2,]   2    5   8  11
[3,]   3    6   9  12

, , 2
    [,1] [,2] [,3] [,4]
[1,]  13   16  19  22
[2,]  14   17  20  23
[3,]  15   18  21  24

The above array can also be created by the function array( ), and names and labels are added to each dimension.

> dim1 <- c("A1", "A2", "A3")
> dim2 <- c("B1", "B2", "B3", "B4")
> dim3 <- c("C1", "C2")
> array(1:24, dim = c(3, 4, 2), dimnames = list(dim1, dim2, dim3))
, , C1
  B1 B2 B3 B4
A1 1  4  7 10
A2 2  5  8 11
A3 3  6  9 12

, , C2
   B1 B2 B3 B4
A1 13 16 19 22
A2 14 17 20 23
A3 15 18 21 24

2.1.5 List

List (list) is the most flexible and complex data structure in R. It can be composed of different types of objects. For example, it can be a combination of vectors, arrays, tables, and any type of object.

> list1 <- list(a = 1, b = 1:5, c = c("red", "blue", "green"))
> list1
$a
[1] 1

$b
[1] 1 2 3 4 5

$c
[1] "red"   "blue"   "green"

Note that the parameter of the function list() consists of a series of new objects, which assign values ​​from existing objects or values. When the list is displayed, each new object name is prefixed with the symbol "$".

In ordinary data analysis, creating a list is not a common task. However, the return value of many functions is a list. E.g:

> set.seed(123)
> dat <- rnorm(10) 
> bp <- boxplot(dat)
> class(bp)
[1] "list"

The above command uses the function rnorm() to generate a random sample of 10 numbers from the standard normal distribution. In order to make the result repeatable, we use the function set.seed() to set the seed for generating random numbers before the command. If the seed is not set, the results displayed each time are likely to be different. Then, use the function boxplot() to make a box plot of this random sample, and save the result as bp. The function class() is used to view the type of object, where bp is a list. View the contents of this list:

> bp
$stats
          [,1]
[1,] -1.26506123
[2,] -0.56047565
[3,] -0.07983455
[4,]  0.46091621
[5,]  1.71506499

$n
[1] 10

$conf
         [,1]
[1,] -0.5901626
[2,]  0.4304935

$out
numeric(0)

$group
numeric(0)

$names
[1] "1"

Here the list bp contains multiple objects, if you want to view or use a certain object, just use the "$" symbol to reference. For example, to view the contents of the object stats in the list bp, you can enter:

> bp$stats
          [,1]
[1,] -1.26506123
[2,] -0.56047565
[3,] -0.07983455
[4,]  0.46091621
[5,]  1.71506499

2.1.6 Data Frame

A data frame is a two-dimensional structure consisting of rows and columns, where rows represent observations or records, and columns represent variables or indicators. The data frame is similar to the data set in Excel, SAS and SPSS. The data frame looks very similar to the matrix, and many operations of the matrix also apply to the data frame, such as the selection of subsets. Different from the matrix, different columns in the data frame can be data of different modes (numerical type, character type, etc.). The data frame can be created by the function data.frame( ). For example, the following code creates a data frame with 5 observations and 4 variables:

> ID <- 1:5
> sex <- c("male", "female", "male", "female", "male")
> age <- c(25, 34, 38, 28, 52)
> pain <- c(1, 3, 2, 2, 3)
> pain.f <- factor(pain,
+                  levels = 1:3,
+                  labels = c("mild", "medium", "severe"))
> patients <- data.frame(ID, sex, age, pain.f)
> patients
  ID    sex  age  pain.f
1  1   male   25  mild
2  2 female   34  severe
3  3   male   38  medium
4  4 female   28  medium
5  5   male   52  severe

A data frame is essentially a list. To display or use a variable (column) of the data frame, you can use the "$" symbol plus the variable name. E.g:

> patients$age
[1] 25 34 38 28 52
> mean(patients$age)
[1] 35.4

Most structured medical data sets are presented in the form of data frames. Therefore, data frames are the most commonly processed data structure in this book. The operation of the data frame will be discussed in detail in Chapter 3.

2.1.7 Conversion of data types

When conducting data analysis, the analyst needs to be familiar with the type of data, because the choice of data analysis method is closely related to the type of data. R provides a series of functions for judging the data type of an object, and also provides a function for converting a certain data type to another. These functions all exist in the basic package base. Table 2-2 lists some of the commonly used functions.

Table 2-2 Data type judgment and conversion function

 

The return value of the function starting with "is." is TRUE or FALSE, and the function starting with "as." converts the object to the corresponding type. E.g:

> x <- c(2, 5, 8)
> is.numeric(x)
[1] TRUE
> is.vector(x)
[1] TRUE
> y <- as.character(x)
> y
[1] "2" "5" "8"
> is.numeric(y)
[1] FALSE
> is.character(y)
[1] TRUE
> z <- c(TRUE, FALSE, TRUE, FALSE)
> is.logical(z)
[1] TRUE
> as.numeric(z)
[1] 1 0 1 0

This article is excerpted from "R Language Medical Data Analysis Actual Combat"

 

  • Introduction to Medical Statistics, recommended by Professor Yu Songlin, Tongji Medical College, Huazhong University of Science and Technology
  • Emphasize actual combat and application, highlight the nature of the problem and the overall structure
  • Contains a large number of R program examples and graphics, to take you to a deeper understanding of data analysis

The book is divided into 14 chapters. Chapters 1 to 3 introduce the basic usage of R language; Chapter 4 introduces data visualization; Chapter 5 introduces basic statistical analysis methods; Chapter 6 to Chapter 8 Introduction The three most commonly used regression models in medical research; Chapter 9 introduces the basic methods of survival analysis; Chapters 10 to 12 introduces several commonly used multivariate statistical analysis methods; Chapter 13 introduces the clinical diagnostic test Statistical evaluation indicators and calculation methods; Chapter 14 introduces the Meta analysis methods commonly used in medical scientific research practice.
This book is suitable for undergraduates and graduate students in clinical medicine, public health and other medical related majors, and can also be used as a reference book for students and researchers in other majors to study data analysis. Reading this book, readers can not only master the method of using R and related packages to quickly solve practical problems, but also have a deeper understanding of data analysis.

Guess you like

Origin blog.csdn.net/epubit17/article/details/108586511