About Unified Memory in CUDA

What is unified memory?

In CUDA 6, NVIDIA introduced one of the most important programming model improvements in CUDA history, unified memory (hereinafter referred to as UM). On a typical PC today, the memory of the CPU and the GPU are physically independent, and are connected and communicated through the PCI-E bus. In fact, before CUDA 6.0, programmers had to be aware of this during programming and reflect it in the code. Memory allocation must be done at both ends of the CPU and GPU, and manual copying is constantly performed,

void sortfile(FILE *fp, int N)                       void sortfile(FILE *fp, int N)                   
{
    
                                                        {
    
    
    char *data;                                          char *data; 
    data = (char*)malloc(N);                             cudaMallocManaged(data, N);

    fread(data, 1, N, fp);                               fread(data, 1, N, fp);

    qsort(data, N, 1, compare);                          qsort<<<...>>>(data, N, 1, compare);
                                                         cudaDeviceSynchronize();

    usedata(data);                                       usedata(data);
    free(data);                                          free(data);
}

It can be clearly seen that the two pieces of code are strikingly similar.

The only difference is:

GPU version:
1. Use cudaMallocManaged to allocate memory instead of malloc
2. Since the CPU and GPU are executed asynchronously, cudaDeviceSynchronize needs to be called for synchronization after launching the kernel.
3. Before CUDA 6.0, to realize the above functions, the following code may be required:

void sortfile(FILE *fp, int N)    
{
    
    
    char *h_data, *d_data;                                        
    h_data= (char*)malloc(N); 
    cudaMalloc(&d_data, N);

    fread(h_data, 1, N, fp);  

    cudaMemcpy(d_data, h_data, N, cudaMemcpyHostToDevice);

    qsort<<<...>>>(data, N, 1, compare);

    cudaMemcpy(h_data, h_data, N, cudaMemcpyDeviceToHost);  //不需要手动进行同步,该函数内部会在传输数据前进行同步
    
    usedata(data);
    free(data); 
}

So far, it can be seen that the main advantages are as follows:

1. Simplifies code writing and memory model.
2. You can share a pointer on the CPU side and GPU side, without allocating space separately. It is convenient for management and reduces the amount of code.
3. Languages ​​are more closely integrated, reducing grammatical differences with compatible languages.
4. More convenient code migration.

Deep Copy

etc. . . Judging from the previous description, it seems that the amount of code has not been reduced a lot. . Then let's consider a very common situation. When we have a struct like this:

struct dataElem {
    
    
    int data1;
    int data2;
    char *text;
}

We might do something like this:

void launch(dataElem *elem) 
{
    
    
    dataElem *d_elem;
    char *d_text;

    int textlen = strlen(elem->text);

    // 在GPU端为g_elem分配空间
    cudaMalloc(&d_elem, sizeof(dataElem));
    cudaMalloc(&d_text, textlen);
    // 将数据拷贝到CPU端
    cudaMemcpy(d_elem, elem, sizeof(dataElem));
    cudaMemcpy(d_text, elem->text, textlen);
    // 根据gpu端分配的新text空间,更新gpu端的text指针
    cudaMemcpy(&(d_elem->text), &d_text, sizeof(g_text));

    // 最终CPU和GPU端拥有不同的elem的拷贝
    kernel<<< ... >>>(g_elem);
}

But after CUDA 6.0, due to the reference of UM, it can be like this:

void launch(dataElem *elem) 
{
    
      
    kernel<<< ... >>>(elem); 
} 

Obviously, in the case of deep copy, UM greatly reduces the amount of code. Before the emergence of UM, because the address spaces at both ends were out of sync, it was necessary to manually allocate and copy memory many times. Especially for non-CUDA programmers, it is very uncomfortable and very cumbersome. When the actual data structure is more complex, the gap between the two will be more significant.

As for the common data structure-linked list, it is essentially a nested data structure composed of pointers. Without UM, the shared linked list between CPU and GPU is very difficult to handle, and the transfer of memory space is very complicated.

Using UM at this time can have the following advantages:

1. Directly transfer linked list elements between CPU and GPU.
2. Modify the linked list elements at either end of the CPU or GPU.
3. Avoid complicated synchronization problems.

But in fact, before the emergence of UM, Zero-copy memory (pinned host memory) can be used to solve this complicated problem. But even so, the existence of UM is still meaningful, because the data acquisition of pinned host memory is subject to the performance of PCI-express, and better performance can be obtained by using UM. This issue will not be discussed in depth in this paper.

How to use Unified Memory in C++

Since modern C++ tries to avoid explicitly calling memory allocation functions such as malloc, new is used for wrap. Therefore, UM can be used through the override new function.

class Managed {
    
    
    void *operator new(size_t len) 
   {
    
    
        void *ptr;
         cudaMallocManaged(&ptr, len);
         return ptr;
   }
    void operator delete(void *ptr) 
   {
    
    
        cudaFree(ptr);
   }
};

By inheriting this class, the custom C++ class can realize the pass-by-reference of UM. In the constructor, call cudaMallocManaged to realize pass-by-value under UM. The following is a string class to illustrate:

// 通过继承来实现 pass-by-reference
class String : public Managed {
    
    
    int length;
    char *data;
    // 通过copy constructor实现pass-by-value
    String (const String &s) {
    
    
        length = s.length;
        cudaMallocManaged(&data, length);
        memcpy(data, s.data, length);
    }
};

Comparison of the advantages and disadvantages of Unified Memory and Unified Virtual Addressing

In fact, CUDA 4.0 began to support Unified Virtual Addressing (hereinafter referred to as UVA), please do not confuse it with Unified Memory. Although UM is UVA dependent, they are not actually the same thing. To clarify this issue, first we need to know the type of memory that UVA cares about

1.device memory (possibly on a different gpu)
2.on-chip shared memory
3.host memory

As for the thread-related memory such as local memory and register in SM, it is obviously not within the scope of UVA's attention. Therefore, UVA actually provides a unified address space for these memories. For this reason, UVA enables zero-copy technology, allocates memory on the CPU side, maps CUDA VA to it, and performs each operation through PCI-E. Also note that UVA will never do the memory migration for you.

A more in-depth comparison and performance analysis between the two is beyond the scope of this article and may be discussed in a follow-up article.

doubt:

1. Q: Will UM eliminate the copy between System Memory and GPU Memory?
Answer: No, it's just that this part of the copy work is handed over to CUDA to execute during runtime, which is only transparent to programmers. The overhead of memory copy still exists, and the problem of race conditions still needs to be considered, so as to ensure that the data on the GPU and CPU sides are consistent. To put it simply, if you have excellent ability to manage memory manually, UM cannot bring you better performance, but only reduces your workload.
2. Question: Since the copying between data has not been eliminated, it seems that this is only a matter of compiler time. Why do we still need to calculate more than 3.0? Is it to trick everyone into buying a card?
wait... We've actually left out a lot of implementation details so far. Because it is impossible to eliminate copying in principle, it is impossible to get all messages during compile. And importantly, in the GPU architecture after Pascal, 49-bit virtual memory addressing and on-demand page migration functions are provided. The 49-bit addressing length is enough for the GPU to cover the entire system memory and all GPU memory. The page migration engine migrates memory in any addressable range to GPU memory through memory, so that GPU threads can access non-resident memory.
In short, the new architecture card physically allows the GPU to access "excess" memory, without modifying the program code, so that the GPU can handle out-of-core operations (that is, operations where the data to be processed exceeds the local physical memory).
Moreover, Pascal and Volta even support system-wide atomic memory operations, which can span multiple GPUs. Under multi-GPU, code complexity can be greatly simplified.
At the same time, for programs with scattered data, the on-demand page migration function can load memory at a finer granularity through page faults instead of loading the entire memory, saving more data migration costs. (In fact, the CPU has a similar thing for a long time, and the principle is very similar.)

Guess you like

Origin blog.csdn.net/daijingxin/article/details/122462167