Data structure and conversion in R language

Articles and codes have been archived in [Github warehouse: https://github.com/timerring/dive-into-AI ] or the public account [AIShareLab] can also be obtained by replying to R language .

The first step in any data analysis is to create a dataset in the desired format. In R, this task consists of two steps: first choose a data structure to store the data, and then enter or import the data into this data structure. The following describes the various data structures used in R to store data.

R data structures

  • In most cases, structured data is a dataset consisting of many rows and many columns. In R, such a dataset is called a data frame .
  • Before learning about data frames, let's get acquainted with some data structures used to store data: vectors, factors, matrices, arrays, and lists .

1.1 Vector

A vector is a one-dimensional array used to store numeric, character, and logical data . Scalars can be thought of as vectors with only one element. The function c( )can be used to create vectors, for example:

x1 <- c(2, 4, 1, -2, 5)
x2 <- c("one", "two", "three")
x3 <- c(TRUE, FALSE, TRUE, FALSE)

Here x1 is a numeric vector, x2 is a character vector, and x3 is a logical vector. The data types in each vector must be consistent . If you want to create regular vectors, R provides some convenient operations and functions, such as:

x4 <- 1:5     # 等价于x4 <- c(1, 2, 3, 4, 5)
x5 <- seq(from = 2, to = 10, by = 2)  # 等价于x5 <- c(2, 4, 6, 8, 10)
x6 <- rep("a", times = 4)  # 等价于x6 <- c("a", "a", "a", "a")

Sometimes we only want to use a certain part of the vector, that is, select a subset of the vector . Suppose there is a vector of integers from 3 to 100 with a step size of 7, what is the value of the 5th number?

x <- seq(from = 3, to = 100, by = 7)
# 显示第5个元素
x[5]
# 显示第4,6,7个元素
x[c(4, 6, 7)]

The numbers in square brackets "[ ]" are called 下标, which specify the index position of the vector. In the above command, x[5] represents the 5th element of the vector, whose value is 31.

The vector in the subscript can take a negative value, which means removing the element at the specified position. For example, to remove the first 4 elements of x, you can enter the following code (note the parentheses in the command):

x[-(1:4)]

Operations in R are vectorized, for example:

weight <- c(68, 72, 57, 90, 65, 52)
height <- c(1.75, 1.80, 1.65, 1.90, 1.72, 1.65)
bmi <- weight / height ^ 2
bmi

In the process of calculating bmi above, the operator "^" is used cyclically, so the calculated result is still a vector. If the lengths of the vectors involved in the operation are inconsistent, R will automatically complete the calculation . The completion rule is to cycle short vectors and give a warning message at the same time.

a <- 1:5
b <- 1:3
a + b
# Warning message in a + b:
# “longer object length is not a multiple of shorter object length”
# 2 4 6 5 7

Commonly Used Statistical Functions

function describe
length(x) find the number of elements in x
mean(x) find the arithmetic mean of x
median(x) find the median of x
there(x) find the sample variance of x
sd(x) find the sample standard deviation of x
range(x) Find the range distance of x
min(x) find the minimum value of x
max(x) find the maximum value of x
quantile(x) find the quantile of x
sum(x) find the sum of all elements in x
scale(x) normalize x

1.2 factor

In general, variables can be divided into numerical, nominal and ordinal.

Nominal variables are categorical variables that do not have an order relationship , such as a person's gender, blood type, ethnicity, etc. The ordinal variable is a categorical variable with a hierarchical and sequential relationship , such as the patient's condition (worse, better, very good). Nominal and ordinal variables are called factors in R.

Factors are very important in R as they determine how data is presented and analyzed. When storing data, factors are often stored as vectors of integers . Therefore, before data analysis, it is often necessary to factor( )convert them into factors using functions.

# 先定义了一个变量 sex 表示性别,假设其取值 1 表示男性,2 表示女性。
sex <- c(1, 2, 1, 1, 2, 1, 2)
# 接着用函数 factor( ) 将变量 sex 转换成了因子并存为对象 sex.f,其中参数 levels 表示原变量的分类标签值,参数 labels 表示因子取值的标签。
sex.f <- factor(sex,
                levels = c(1, 2),
                labels = c("Male", "Female"))
sex.f
# ============ 输出 =============
# Male Female Male Male Female Male Female
# **Levels**:
# 'Male''Female' 

Note that these two parameters need to correspond one-to-one when assigning values, and R will associate them. The difference between a factor variable and a general character variable is that it has a level attribute. The properties of factors can be viewed using functions levels( ):

levels(sex.f)
# 'Male''Female' 

Change the sorting order of factor levels → change reference group

In statistical models, R considers the first level of a factor variable as the reference group. Many times we need to change the arrangement order of factor levels to change the reference group, which can be achieved by two methods. The first method is to change the order of the parameters levels and labels in the function factor( ) , for example:

sex.f1 <- factor(sex, levels = c(2, 1), labels = c("Female", "Male"))
sex.f1
# Male Female Male Male Female Male Female
# **Levels**:
# 'Female' 'Male'

The second way is to use a function relevel( ):

sex.f1 <- relevel(sex.f, ref = "Female")
sex.f1
# Male Female Male Male Female Male Female
# **Levels**:
# 'Female' 'Male'

order factor: ordered = TRUE

To represent an ordered factor, the parameter ordered = TRUE needs to be specified in the function factor ( ) . For example:

status <- c(1, 2, 2, 3, 1, 2, 2)
status.f <- factor(
  status,
  levels = c(1, 2, 3),
  labels = c("Poor", "Improved", "Excellent"),
  ordered = TRUE
)
status.f
# PoorImprovedImprovedExcellentPoorImprovedImproved

1.3 Matrix

A matrix is ​​a two-dimensional array consisting of rows and columns. Each element in the matrix has the same mode (numeric, character or logical). In most cases, the elements in the matrix are numerical, which have many mathematical properties and operation methods, and can be used for statistical calculations, such as factor analysis, generalized linear models, etc.

1.3.1 Create: matrix( )

Functions matrix( )are often used to create matrices, for example:

M <- matrix(1:6, nrow = 2)
M

R automatically calculates the number of columns based on the length of the vector and the number of rows set by the parameter nrow. The parameter byrow defaults to FALSE, that is, the values ​​are arranged by column. If you need to arrange by row, just set the parameter byrow to TRUE .

Common matrix operations can be implemented in R, such as matrix addition, matrix multiplication, matrix inversion, matrix transposition, determinant of square matrix, eigenvalue and eigenvector of square matrix, etc.

1.3.2 Multiplication: %*%

In matrix multiplication, the number of columns of the first matrix is ​​required to be equal to the number of rows of the second matrix , and its operator is %*%.

First create two matrices:

mat1 <- matrix(1:6, nrow = 3)
mat1
mat2 <- matrix(5:10, nrow = 2)
mat2
# 函数dim( )可以得到矩阵的维数,即行数和列数
dim(mat1)
# 32
dim(mat2)
# 23
mat1 %*% mat2

1.3.3 Transpose: t( )

The transpose operation of a matrix is ​​to exchange the rows and columns of the matrix. For example, find the transpose of matrix mat1:

t(mat1)

1.3.4 Determinant and inverse matrix: det( ), solve( )

Finding the determinant and inverse matrix of a square matrix can be implemented using functions det( )and functions respectively, for example:solve( )

mat3 <- matrix(1:4, nrow = 2)
det(mat3)
# -2

1.3.5 Sum or average by row and column: rowSums, colSums, rowMeans, ColMeans

For example:

rowSums(mat1)
colSums(mat1)
rowMeans(mat1)
colMeans(mat1)

1.4 Arrays

The so-called array (array) usually refers to a multidimensional array, which is similar to a matrix, but the dimension is greater than 2 . Arrays have a special dimension (dim) property .

The following command defines an array after adding dimensions to a vector, please pay attention to the order of the values.

Since the array displayed on the notebook is not very beautiful, it is recommended to use it print(). The following code will additionally add print() when displaying the array.

A <- 1:24
dim(A) <- c(3, 4, 2)
# A # notebook 上数组显示不太正常,使用 print() 可以解决
print(A)

The above arrays can also array( )be created with functions and add names and labels to each dimension.

dim1 <- c("A1", "A2", "A3")
dim2 <- c("B1", "B2", "B3", "B4")
dim3 <- c("C1", "C2")
print(array(1:24, dim = c(3, 4, 2), dimnames = list(dim1, dim2, dim3)))

1.5 list

List (list) is the most flexible and complex data structure in R, which can be composed of different types of objects. For example, it can be a combination of vectors, arrays, tables, and objects of any type .

list1 <- list(a = 1, b = 1:5, c = c("red", "blue", "green"))
list1
# $a
# 1
# $b
# 1 2 3 4 5
# $c
# 'red''blue''green'

Creating lists is not a common task in normal data analysis. The return value of many functions is a list . For example:

# 为了使结果具有可重复性,我们在该命令前用函数 set.seed( ) 设置了生成随机数的种子。如果不设定种子,每次显示的结果很可能不同。
set.seed(123)
# 用函数 rnorm( ) 从标准正态分布中生成了一个由 10 个数组成的随机样本。
dat <- rnorm(10) 
# 用函数 boxplot( ) 对这个随机样本作**箱线图**,并把结果保存为 bp。
bp <- boxplot(dat)
# 函数 class( ) 用于查看对象的类型,这里 bp 是一个列表。
class(bp)
# 'list'

Check out what's in this list:

The list bp here contains multiple objects, if you want to view or use a certain object, you only need to $refer to it with the symbol " ". For example, to view the contents of the object stats in the list bp, you would type bp$stats. If you are interested in other objects in the list, please move to boxplot.statsthe documentation of .

1.6 Data frame

A data frame is a two-dimensional structure consisting of rows and columns, where rows represent observations or records and columns represent variables or indicators. Data frames are similar to data sets in Excel, SAS, and SPSS. Data frames look very similar to matrices, and many operations of matrices also apply to data frames, such as subset selection.

Unlike a matrix, different columns in a data frame can be data in different modes (numeric, character, etc.). Data frames can be created with the function data.frame( ). For example, the following code creates a data frame with 5 observations and 4 variables:

ID <- 1:5
sex <- c("male", "female", "male", "female", "male")
age <- c(25, 34, 38, 28, 52)
pain <- c(1, 3, 2, 2, 3)      
pain.f <- factor(pain, levels = 1:3, labels = c("mild", "medium", "severe"))   
patients <- data.frame(ID, sex, age, pain.f)
patients

A data frame is essentially a list. To display or use a variable (column) of the data frame, you can use $the symbol plus the variable name. For example:

patients$age
mean(patients$age)

Most structured medical datasets are presented as data frames, so data frames are the most commonly processed data structures.

Data type conversion: is., as.

When performing data analysis, analysts need to be familiar with the type of data, because the choice of data analysis method is closely related to the type of data. R provides a series of functions for determining the data type of an object, as well as functions for converting one data type to another . These functions all exist in the basic package base, and some of the commonly used functions are listed below:

Data type judgment and conversion function

judgment convert
is.numeric( ) as.numeric( )
is.character( ) as.character( )
is.logical( ) as.logical( )
is.factor( ) as.factor( )
is.vector( ) as.vector( )
is.matrix( ) as.matrix( )
is.array( ) as.array( )
is.data.frame( ) as.data.frame( )
is.list( ) as.list( )
is.table( ) as.table( )

Functions beginning is.with return either TRUE or FALSE, and as.functions beginning with convert the object to the corresponding type. For example:

x <- c(2, 5, 8)
is.numeric(x)
# TRUE
is.vector(x)
# TRUE
y <- as.character(x)
y
# '2''5''8'
is.numeric(y)
# FALSE
is.character(y)
# TRUE
z <- c(TRUE, FALSE, TRUE, FALSE)
is.logical(z)
# TRUE
as.numeric(z)
# 1 0 1 0

Reference: Zhao Jun " Practical Combat of R Language Medical Data Analysis "

Guess you like

Origin blog.csdn.net/m0_52316372/article/details/132372102