R language strategies for processing large data sets

The article and code have been archived in [Github warehouse: https://github.com/timerring/dive-into-AI ] or the public account [AIShareLab] can also be obtained by replying to R language .

In actual problems, data analysts may face data sets with hundreds of thousands of records and hundreds of variables. Processing such a large data set requires a relatively large memory space on the computer, so try to use a 64-bit operating system and a device with relatively large memory . Otherwise, data analysis may take too long or even be impossible. Furthermore, effective strategies for processing data can improve the efficiency of analysis to a great extent.

1. Clean up your workspace

In order to obtain the largest possible memory space for data analysis, it is recommended to first clean up the workspace when starting any new analysis project.

# rm(list = ls(all = TRUE))

The function ls( )is used to display the objects in the current workspace. The parameter all defaults to FALSE. Setting it to TRUE here is to clear all objects including hidden objects.

In addition, during the data analysis process, use commands rm(object1,object2, …)to clear temporary objects and objects that are no longer needed in a timely manner.

2. Quickly read .csv files

.csv files take up little space and can be viewed and generated by Excel, so they are widely used to store data. The function read.csv() introduced earlier can easily read .csv files. However, for large data sets, this function reads the data too slowly and sometimes even reports an error. At this time, you can use the read_csv() function in the readr package or the fread() function in the data.table package to read the data, with the latter reading faster (about twice as fast as the former) .

The data.table package provides an advanced version of the data frame, greatly improving the speed of data processing. This package is especially suitable for users who need to process large data sets (such as 1GB ~ 100GB) in memory. However, the operation method of this package is quite different from other packages in R, and it requires a certain amount of time to learn.

3. Simulate a large data set

For ease of explanation, a large data set is simulated below, which contains 50,000 records and 200 variables.

bigdata <- as.data.frame(matrix(rnorm(50000 * 200), ncol = 200))
# 使用了嵌套的两个 for 循环语句和 R 的内置常量 letters(小写英文字母)为 200 个变量命名。
varnames <- NULL
# 外面一层循环语句构建变量名的第一个字符(a~t)
for (i in letters[1:20]) {
# 里面一层循环语句把数字 1~10 用 `_` 作为分隔符分别连接到这些字母上。
  for (j in 1:10) {
  # 函数 paste( ) 用于连接字符串。
    varnames <- c(varnames, paste(i, j, sep = "_"))
  }
}
names(bigdata) <- varnames
names(bigdata)

If you don't want to use multiple loops , consider :

# 可惜 apply 此处会导致多余的空格
# apply(expand.grid(1:20, letters[1:20]), 1, function(x) paste(x[2], x[1], sep="_")) 
# sprintf("%s_%s", expand.grid(1:10,letters[1:20])[,2],expand.grid(1:10,letters[1:20])[,1])

# 或者
# as.vector(t(outer(letters[1:20], 1:10, paste, sep="_")))

4. Eliminate unnecessary variables

Before conducting formal analysis, we need to eliminate variables that are temporarily unused to reduce the memory burden. The select series of functions of the dplyr package can come in handy here, especially using these functions in conjunction with the starts_with(), ends_with() and contains() functions of the tidyselect package will bring a lot of convenience.

First load these two packages:

library(dplyr)
library(tidyselect)

Next, we will give an example of how to use the select series of functions to select or eliminate variables.

subdata1 <- select(bigdata, starts_with("a"))
names(subdata1)
# 'a_1''a_2''a_3''a_4''a_5''a_6''a_7''a_8''a_9''a_10'
subdata2 <- select(bigdata, ends_with("2"))
names(subdata2)
#'a_2''b_2''c_2''d_2''e_2''f_2''g_2''h_2''i_2''j_2''k_2''l_2''m_2''n_2''o_2''p_2''q_2''r_2''s_2''t_2'

Functions starts_with( ) and ends_with( )represent the prefix and suffix of variables respectively. In the above command, subdata1 selects all avariables in the data set that begin with , while subdata2 selects all 2variables in the data set that end with .

If you want to select all variables starting with aor b, you can use the following command:

# subdata3 <- select(bigdata, c(starts_with("a"), starts_with("b")))
subdata3 <- select_at(bigdata, vars(starts_with("a"), starts_with("b"))) # 注意跟 select 语法稍有不同
names(subdata3)

To select all variables whose names contain certain characters, you can use the contains() function. For example, to select 1all variables that contain the character , enter the following command:

# subdata4 <- select(bigdata, c(contains("1")))
subdata4 <- select_at(bigdata, vars(contains("1")))
names(subdata4)

It should be noted that all 10variables ending with also contain the characters 1.

If you want to exclude certain variables, just add a sign in front of the functions starts_with(), ends_with() and contains() -. For example, to eliminate variables ending with 1or 5, use the following command:

# subdata5 <- select(bigdata, c(-contains("1"), -contains("5")))
subdata5 <- select_at(bigdata, vars(-contains("1"), -contains("5")))
names(subdata5)

5. Select a random sample of the data set

Processing all records of a large data set often reduces the efficiency of analysis. When writing code, you can extract only part of the records to test the program in order to optimize the code and eliminate bugs.

# 参数 size 用于指定行的个数
sampledata1 <- sample_n(subdata5, size = 500)
nrow(sampledata1)
# 参数 size 用于指定占所有行的比例。
sampledata2 <- sample_frac(subdata5, size = 0.02)
nrow(sampledata2)
# 500
# 1000

The functions sample_n( ) and sample_frac( ) are both used to randomly select a specified number of rows from the data frame. The parameter size in the former is used to specify the number of rows, while the parameter size in the latter is used to specify the proportion of all rows.

It should be noted that the strategies discussed above for processing large data sets are only suitable for processing gigabyte-scale data sets. No matter which tool you use, working with data sets ranging from terabytes to petabytes is a challenge. There are several packages in R that can be used to process terabyte-scale data sets, such as RHIPE, RHadoop, and RevoScaleR. The learning curve of these packages is relatively steep and requires a certain understanding of high-performance computing. You can explore it on your own if necessary, but I will not introduce it here.

sample_n() and sample_frac() are about to be retired. The package documentation recommends using slice_sample( ) instead. The usage can be found here .

# 使用 slice_sample( ) 进行处理
sampledata1 <- slice_sample(subdata5, n = 500)
nrow(sampledata1)
sampledata2 <- slice_sample(subdata5, prop = 0.02)
nrow(sampledata2)

Guess you like

Origin blog.csdn.net/m0_52316372/article/details/132534028