Use memoization to optimize your R code

Use memoization to optimize your R code

This article is translated from "Optimize your R Code using Memoization" (with deletions)

https://www.inwt-statistics.com/read-blog/optimize-your-r-code-using-memoization.html

This article describes how to apply a programming technique called "Memoization" to speed up your R code and resolve performance bottlenecks. Wikipedia:

In computing,... memoization is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning cached results when the same input occurs again.

Source: https://en.wikipedia.org/wiki/Memoization

If you want a speed boost, and only rely on this technique, you can find two packages on CRAN, R.cache and memoise . Below I will provide different versions of the implementation, one to show that memoization is not magic but a fairly simple technique, and two to show that R can be much faster than C++!

Performance optimization in R

If I read about high performance computing, R seems to have a bad reputation when it comes to raw computing speed. Thanks to Rcpp, integrating C++ into your R projects is as easy as possible. Not necessarily easy though: you still need to learn some C++. I often feel that discussions around this topic ignore the fact that the cost of implementation can be anything, so making recommendations too quickly means you have to use a different language . However, we can often use plain old R techniques and improve runtime performance while being easy to implement. How to do it? Often by redefining the problem (cheating), breaking the computation into pieces (divide and conquer) or finding bottlenecks and optimizing (maximizing the use of the language)!

Of course, any of these steps might force us to end up using C++ or something. But we can push this further and make up for it with some brain power and a toolbox of optimization strategies. There is a lot to learn when it comes to optimization. Memoization can be your trick to magically make R code run faster. This is done by avoiding unnecessary computations, or computing twice unnecessarily.

When is R slow

To start this exercise, let's see when R becomes slow, and how we can improve it subsequently. I took this example from the Rcpp documentation , which was apparently created in response to a question on StackOverflow asking why R's recursive function calls are so slow.

The challenge is to calculate Fibonacci numbers using one of the least computationally efficient definitions. Of course, this is not the case with specific examples. There are other more efficient ways to calculate the Fibonacci sequence. But recursive definitions will make your CPU fan go wild, which is why it's interesting. Below, you can find a "rookie" implementation of the algorithm. Both languages ​​look more or less the same. But, as you can see, the time consuming has been different. We can change N a little and the implementation in R will quickly hit its limit.

N <- 35 ## the position in the fibonacci sequence to compute

fibRcpp <- Rcpp::cppFunction(
    '
    int fibonacci(const int x)
    {
        if (x == 0) return(0);
        if (x == 1) return(1);
        return (fibonacci(x - 1)) + fibonacci(x - 2);
    }
    ')

fibR <- function(x)
{
    ## Non-optimised R
    if (x == 0) return(0)
    if (x == 1) return(1)
    Recall(x - 1) + Recall(x - 2)
}

rbenchmark::benchmark(
    baseR = fibR(N),
    Rcpp = fibRcpp(N),
    columns = c("test", "replications", "elapsed", "user.self"),
    order = "elapsed",
    replications = 1)
    
##    test replications elapsed user.self
## 2  Rcpp            1    0.07      0.06
## 1 baseR            1   29.27     27.69

When does R change (more) faster

Now let's see how we can define a function that still computes the same number and avoids all unnecessary computations. Note that in the recursive definition, to compute the Nth number, we have to compute the N-1 and N-2th numbers. This will lead to an explosion in the number of computations. Meanwhile, if we wanted to calculate the Fibonacci numbers for N = 35, and we already got results for N = 34 and N = 33, we don't have to recalculate them, we just use what we already know. Let's see how to do this:

fibRMemoise <- local(
    {
        memory <- list()
        function(x)
        {
            valueName <- as.character(x)
            if (!is.null(memory[[valueName]])) return(memory[[valueName]])
            if (x == 0) return(0)
            if (x == 1) return(1)
            res <- Recall(x - 1) + Recall(x - 2)
            memory[[valueName]] <<- res # store results
            res
        }
    })

What we do:

  1. Check if the result is already known
    • If already known, return the result and stop there (do nothing)
    • If you don't know, go to step 2
  2. Calculate the necessary results (e.g. Fibonacci numbers)
    • Remember the result before we exit the function
    • Then, return the result

So the idea is pretty simple. Another complication is that we need a closure. Here we use local. localA new scope (environment) will be created and the code in that environment will be run. So the function has access to the environment, i.e. it has access to memory, but we don't see memory in the global environment: it's local to the function definition. Also, we need the super assignment operator ( <<-) so that we can assign values ​​to memory. Well, let's see what we've gained apart from the abstraction, and the code:

rbenchmark::benchmark(
    baseR = fibR(N),
    Rcpp = fibRcpp(N),
    memoization = fibRMemoise(N),
    columns = c("test", "replications", "elapsed", "user.self"),
    order = "elapsed",
    replications = 1)

##          test replications elapsed user.self
## 3 memoization            1    0.00      0.00
## 2        Rcpp            1    0.04      0.04
## 1       baseR            1   32.03     31.67

did you see? R is actually faster than C++! If you have time to wait for the C++ implementation, we can see how far we can go, and the implementation in C++ is quickly reaching its limit.

N <- 50 # not very far, but with memoization Int64 is the limit.

rbenchmark::benchmark(
    # baseR = fibR(N), # not good anymore!
    Rcpp = fibRcpp(N),
    memoization = fibRMemoise(N),
    columns = c("test", "replications", "elapsed", "user.self"),
    order = "elapsed",
    replications = 1)

##          test replications elapsed user.self
## 2 memoization            1    0.00      0.00
## 1        Rcpp            1   87.67     87.24

Great, very efficient black technology. It also explains why performance comparisons between languages ​​are like comparing apples. Yes, R is fast anyway :)

Memoization in R

There are some problems with the above definition: it is not very general. Our definition of memoization is still related to the definition of Fibonacci numbers. However, we can define a higher-order function that separates memoization from the algorithm:

memoise <- function(fun)
{
    memory <- list()
    function(x)
    {
        valueName <- as.character(x)
        if (!is.null(memory[[valueName]])) return(memory[[valueName]])
        res <- fun(x)
        memory[[valueName]] <<- res
        res
    }
}

This is the perfect definition of the technique, and it's not very long or complicated. In principle, you R.cachewill memoisefind the same thing in and . Obviously these two packages add some functionality, for example, how and where do you want to store memory, maybe on disk. The above function only allows one parameter, this problem is also solved in the above two packages. The above two packages also add some other useful things.

When to use memoization

When and why to use memoization? It's not like implementing everyday tasks like Fibonacci sequences. Even so, we do it differently. The actual use case I have in mind is quite different. Here are some ideas for you:

  • We can reduce the number of calls to the API. Most providers (such as Google Maps, operated by Google) will limit the number of calls you allow per day. You can use memoization to quickly build in-memory or disk caches. This will allow you to quickly switch back to the "old" configuration without having to query the API again.
  • Call the database or load data in general. Think of a Shiny application where changes to the UI will trigger calls to the database. For example, when you have parameterized queries. When you cache the results of these queries, you can speed up your application considerably when the user switches back and forth between settings.

Whatever we do there needs to be an important property so that memoization is useful. Wikipedia says:

No matter what we do, there is an important property that is necessary in order for memoization to be useful. Wikipedia:

A function can only be memoized if it is referentially transparent, that is, only if calling the function has the exact same effect as replacing the function call with its return value. (However, there are special cases that exclude this restriction.)

Source: https://en.wikipedia.org/wiki/Memoization#Overview

In other words: we need to take care that the result of the function really only depends on the input arguments. Do you believe your database connection or API call has this property? If so, memoization might be useful. But be careful: memoization leads to cache, cache leads to state management (when and how to update cache?), which leads to hard-to-debug problems: interesting indeed.

Reprinted in: https://www.cnblogs.com/xuruilong100/p/9824997.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324484776&siteId=291194637