R parallel computing - parallel example 1

Foreword:

Typically, parallel processing is considered if the process takes more than 3 minutes to run.

This may sound complicated, but parallel computing is simple.

When you have a repetitive task and it takes up too much of your precious time, why not use parallel computing to save time?
Even if you have a single task, you can benefit from parallel processing by breaking the task into smaller parts.

Two widely used parallel processing packages are parallel and foreach.

1-Parallel computing preparation stage:

The main purpose of using parallel computing in R is to improve the running speed. Since R is a program running on a single core, the current computers are multi-core. If only one core is used to run the program and the other cores of the computer are idle, it is bound to be a waste of resources. waste.

library(parallel)

# 设置并行计算的核心数
num_cores <- detectCores()
cl <- makeCluster(num_cores)

# 执行并行计算的任务
result <- parLapply(cl, data, your_function)

# 关闭并行计算的集群
stopCluster(cl)

Process: Set the number of cores for parallel computing --> execute parallel computing --> close the cluster for parallel computing.

No matter which parallel computing package is used, it is based on the above three steps, 1-set the number of cores for parallel computing; 2 execute parallel computing 3 close the parallel computing cluster.

2- Comparison of various methods

2.0 generated data:

# create test data
set.seed(1234)
df <- data.frame(matrix(data = rnorm(1e7),  ncol = 1000))
dim(df)

Goal: Sum each row of this matrix.

2.1 Using the For loop

Run event 3.83mins

# for Example 1
times1 <- Sys.time()
results <- c()

for (i in 1:dim(df)[1]) {
  results <- c(results, sum(df[i,]))
}

times2 <- Sys.time()
print(times2 - times1) 
#2.77314 mins


#for Example 2
times1 <- Sys.time()
results <- c()

for (i in 1:dim(df)[1]) {
  results[i] <- sum(df[i,])
}

times2 <- Sys.time()
print(times2 - times1) 
#2.404386 mins

2.2 Using the apply function

When it comes to loops, we think of For, while loops, and the apply function family. It can be said that the apply function family is a good way to replace loops.

#2
times1 <- Sys.time()
apply(df,1,sum)

times2 <- Sys.time()
print(times2 - times1) 
#0.5269971 secs

2.3 Use the function rowSums() that comes with baseR

#3
times1 <- Sys.time()
rowSums(df)
times2 <- Sys.time()
print(times2 - times1)
#0.146533 secs

2.4 Using the parallel package

Here, the data is divided, divided according to the number of cores 1:8, divided into 8 parts, and a list list object is obtained. Then use the parLapply() function for calculation.

#4 
# load R Package
library(parallel)
# check cores numbers
detectCores()
# set cores numbers
num_cores <- 8
# start times
times1 <- Sys.time()
# split data
chunks <- split(df, f = rep(1:num_cores, length.out = nrow(df)))
class(chunks) #list 列表
length(chunks)
# create parallel
cl <- makeCluster(num_cores)

# computed in parallel
results <- parLapply(cl, chunks, function(chunk){
  apply(chunk, 1, sum)
})

# Turn off the cluster for parallel computing
stopCluster(cl)

# combine result
final_result <- unlist(results)

times2 <- Sys.time()

print(times2 - times1) 
#3.047416 secs

2.5 Using the foreach package

install.packages("foreach")
install.packages("doParallel")
library(foreach)
library(doParallel)
# 创建一个集群并注册
cl <- makeCluster(8)
registerDoParallel(cl)


# 启动并行计算
time1 <- Sys.time()
x2 <- foreach(i = 1:dim(df)[1], .combine = c) %dopar% {
  sum(df[i,])
}
time2 <- Sys.time()
print(time2-time1)

# 在计算结束后别忘记关闭集群
stopImplicitCluster()
stopCluster(cl)
# 53.63808 secs

reference:

Rtips multi-core parallel computing