R parallel computing (parallel computing)-1- using the foreach function

The meaning of the number of CPU cores and threads

 

The number of cores of a processor generally refers to the number of physical cores , also known as cores . A dual-core is to include 2 independent CPU core unit groups, and a quad-core is to include 4 independent CPU core unit groups, which are central computing units for processing various data.

Generally, one core corresponds to one thread, but intel has developed hyper-threading technology, one core can perform 2 thread calculations, and 7 cores can perform 12 threads. The number of threads is a logical concept, which is the number of virtual CPU cores.

For example, the CPU can be imagined as a bank, the CPU core is equivalent to the teller, and the number of threads is equivalent to opening several windows, the more tellers and windows, the more businesses are processed at the same time, and the faster the speed .

Under normal circumstances, one teller corresponds to one window. Through hyper-threading technology, it is equivalent to one teller managing two windows. Using the left and right hands to handle the business of the two windows at the same time, which greatly improves the efficiency of the core and increases the speed of handling business. .

Why does R need to use parallel computing?

From the perspective of memory, R uses an in-memory computing model (In-Memory), and the processed data needs to be prefetched into storage (RAM). Its advantages are high computational efficiency and fast speed, but its disadvantage is that the scale of problems that can be processed is very limited (less than the size of RAM).

On the other hand, R has only one single thread of work by default, and it is obviously a waste to let other threads idle.

For example, if R is run on a CPU with 260 cores, the single-threaded R program can only use 1/260 of the computing power at most, thus wasting other 259/260 computing cores. Today's computers all have 4-16 cores. Whether it is a computer with high configuration or a computer with low configuration, if parallel computing is not used, the running speed of each computer is not much different.

Parallel computing is to let multiple threads or all threads work at the same time.

R parallel computing

View the number of physical cores and threads of the computer

detectCores(logical = F)#查看电脑的物理核数
#18

install.packages("future")
library(future)
availableCores()#查看电脑可用的线程数
#36

Example 1

Use the for loop to perform repetitive tasks. In each loop, calculate 100,000 random numbers that obey the standard normal distribution, and then calculate their mean value. This process is repeated 30,000 times. code show as below:

#example 1
timestart <- Sys.time()
x <- numeric()
for (i in 1:30000) {
  x[i] <- mean(rnorm(1e5))
}
timeend <- Sys.time()
runningtime <- timeend - timestart
print(runningtime)
#3.83mins

Try to avoid using for loops in R. Using for loops in R is very slow. When writing code, you should avoid using for loops as much as possible. You can consider vectorized programming (the overall idea is to operate vectors/matrices, so that each element in it performs the same operation).

The simplest usage of 1-foreach (non-parallel computing)

Use %do%+foreach() instead of the for loop, and each loop calculates 100,000 random numbers from the standard normal distribution, with a total of 30,000 loops. From the results, the calculation speed is similar to that of the for loop. It is worth noting that the function foreach returns a list (list). One advantage of using foreach is that multiple statements can be written between the curly braces {} after %do% like a for loop.

timestart <- Sys.time()
library(foreach)
x2 <- list()
foreach(i = 1:30000) %do% {
  x2[[i]] <- mean(rnorm(1e5))
}
timeend <- Sys.time()
runningtime <- timeend - timestart
print(runningtime)
#3.91 mins

2-foreach parallel computing

Use foreach to perform parallel computing. At this time, you need to replace the above %do% with %dopar% to start parallel computing. Before using parallel computing, you first need to load the doParallel package, create a cluster and register it. Take a computer with 18 physical cores and 36 threads as an example. Calculate the previous code for calculating the mean value of normal distribution random numbers:

It only takes about 19 seconds to complete the same task, nearly 10 times faster than using a single core! The default return value data type of foreach is list, but we prefer to use vector or matrix form, then we can use ".combine" parameter to specify the type of output data.

library(foreach)
library(doParallel)
# 创建一个集群并注册
cl <- makeCluster(18)
registerDoParallel(cl)


# 启动并行计算
time1 <- Sys.time()
x2 <- foreach(i = 1:3e4, .combine = c) %dopar% {
  mean(rnorm(1e5))
}
time2 <- Sys.time()
print(time2-time1)
#19sec in cl <- makeCluster(36)

#21sec in cl <- makeCluster(18)

# 在计算结束后别忘记关闭集群
stopImplicitCluster()
stopCluster(cl)

The foreach function can also use functions such as rbind or cbind to output the results in matrix form:

timestart <- Sys.time()
x4 <- matrix(0,nrow=3e4,ncol=6)
for (i in 1:30000) {
  x4[i,] <- summary(rnorm(1e5))
}
timeend <- Sys.time()
runningtime <- timeend - timestart
print(runningtime)
#6.036685 mins


# 创建一个集群并注册
cl <- makeCluster(36)
registerDoParallel(cl)
# 启动并行计算
timestart <- Sys.time()
x <- foreach(i = 1:3e4, .combine = rbind) %dopar% {
  summary(rnorm(1e5))
}
timeend <- Sys.time()
runningtime <- timeend - timestart
print(runningtime)
stopImplicitCluster()
stopCluster(cl)
# 26.86028 secs

Some important parameters of the foreach function

(1).package

The code written %dopar%later often uses third-party R packages, which must be .packagespecified in , such as the random forest algorithm often used in machine learning:

x <- matrix(runif(500), 100)
y <- gl(2, 50)
rf <- foreach(
  ntree = rep(250, 4),.combine = combine,.packages = 'randomForest') %dopar% {
    randomForest(x, y, ntree = ntree)
  }

(2).error action

.errorhandlingCan handle the response method when an error occurs in the loop, the default is stop. For example, an error is deliberately reported here i=5, the code is as follows:

x <- foreach(i = 1:10, .combine = c,.errorhandling = "stop") %dopar% {
  if(i == 5)
    stop('STOP!')
  i
}
Error in { : task 5 failed - "STOP!"

x
错误: 找不到对象'x'

One error report will cause the previous 4 tasks to run in vain. If a task that has been running for several hours is destroyed by a cycle of error reports, I will feel like smashing the keyboard. At this point, it can be modified .errorhandlingto removeskip the error reporting cycle, for example

x <- foreach(i = 1:10, .combine = c,.errorhandling = "remove") %dopar% {
  if(i == 5)
    stop('STOP!')
  i
}
x
[1]  1  2  3  4  6  7  8  9 10

It can be seen that the cycle of the fifth error report was skipped. Of course, we still hope to find out where the error is reported and try to solve it. At this time, we can modify it .errorhandlingto pass:

x <- foreach(i = 1:7, .combine = rbind, .errorhandling = "pass") %dopar% {
  if(i == 5)
    stop('STOP!')
  c(i, i^2)
}
x
         message call      
result.1 1       1         
result.2 2       4         
result.3 3       9         
result.4 4       16        
result.5 "STOP!" Expression
result.6 6       36        
result.7 7       49 

# 第五次循环的报错信息被记录了下来

4. Some notes on environment and variable scope

An R function not only has parameters ( arguments) and body ( body), but also its own environment ( environment). The concept of environment in R may be unfamiliar to most users. Briefly:

x <- 1:3
# 创建一个名为f的函数
f <- function(x){
  k <- 3
  h <- function(){
    x <- sqrt(x)
  }
  print(environment(h))
  return(x + k)
}
# 查看函数f所处的环境
environment(f)
<environment: R_GlobalEnv> 

The result of running the code environment(f)tells us that the function is created f()in the top-level environment . R_GlobalEnvInside the function f(), we created a new function h()and variable k. In the concept of C languagek , a variable is f()a local variable of the function. But in R, h()both kand are considered f()local variables, and both are invisible to the top-level environment. If you run the function f(), you will get this result:

f(x)
<environment: 0x0000021b918e0310>
[1] 4 5 6

printThe function tells us that we h()are not in R_GlobalEnvan environment, but in a named 0x0000021b918e0310environment (which is actually a memory address).

If you change the code to run like this:

x1 <- 1:3
f <- function(x){
  x2 <- 5
  return(h())
}

h <- function(){
  x1 + x2
}
f(x1)
Error in h() : 找不到对象'x2'

An error will be reported. At this time, because the function h()is defined in the top-level environment R_GlobalEnv, f()the local variables created in its own environment x2are h()invisible to the user. This is the concept of scope.

Let's talk about foreachthe characteristics of the scope, for example:

x1 <- 1
x2 <- 2
f <- function(x1) {
  foreach(i = 1:100, .combine = c)  %dopar% {
    x1 + x2 + i
  }
}
f(x1) 
Error in { : task 1 failed - "找不到对象'x2'" 

As shown above, running the function f()will report an error.

The reason is that if you foreachwrite it inside a function, foreachthe runtime will export (export) f()all the necessary variables in its environment (that is, the environment of the function), but it will not actively load the upper-level environment (here R_GlobalEnv) variables in . In this code, x2it is not a necessary variable of the function f(), so it is not placed f()in the environment, causing foreachit not to be recognized. There are two solutions to this situation:

(1) Parameters that will x2be written asf()

x1 <- 1
x2 <- 2
f <- function(x1, x2) {
  foreach(i = 1:3, .combine = c)  %dopar% {
    x1 + x2 + i
  }
}
f(x1, x2) 
[1] 4 5 6

(2) Parameters foreachused .export:

x1 <- 1
x2 <- 2
f <- function(x1) {
  foreach(i = 1:3, .combine = c, .export = 'x2')  %dopar% {
    x1 + x2 + i
  }
}
f(x1) 
[1] 4 5 6

5. One last thought

(1) The trade-off between time (calculation speed) and space (memory usage): Parallel computing in R consumes a lot of memory, so please choose the appropriate number of CPU cores carefully (or optimize memory usage from the code). If the data you want to process is large and the memory is not enough, the computer will be "stuck" offline. The feasible solutions are as follows:

①Multi-insert memory sticks

② Decompose the task, such as 1W data, split it into five 2k

(2) In the case of sufficient memory, the number of cores is not as good as possible : When I calculate a model, I count the time required to run the model from 1 core to 36 cores (2-7 column names are data volumes ). When the number of cores invoked exceeds 1.5 times the number of physical cores (18*1.5=27 cores), the potential brought by hyperthreading has been almost squeezed out, and the speed improvement brought by parallel computing is not much.

Therefore, for a CPU that supports hyperthreading, it is recommended to select 1.5 times the number of physical cores as the upper limit for parallel computing, and you can use detectCores(logical = F)commands to view the number of physical cores of your computer.

(3) Parallel is cool for a while, and you should also pay attention to the heat dissipation of the CPU, especially for notebook computers. All-core operation will reduce the frequency or even crash the computer with poor heat dissipation in minutes.

MM.fun <- function(){
  b <- 1+2
  return(b)
}

cfun <- function(c){
test <- 1:10
m <- length(test)
aa <- MM.fun(m)#全局变量和局部变量的区别,这个时候,m是局部变量,没有办法传递给MM.fun
f <- c+aa
return(f)
}

cfun(1)

reference:

What is the role of the number of CPU cores and threads? What do the number of CPU cores and threads mean? The relationship and difference between the number of CPU cores and threads - Zhihu (zhihu.com)

Speed boosting in R: Writing efficient code & parallel programming | R-bloggers

[Multicore Spring] Parallel Computing in R Language - Zhihu (zhihu.com)

Guess you like

Origin blog.csdn.net/u011375991/article/details/131272023