Analysis of the latest mechanism of unified memory in GPU

Simplify GPU application development with heterogeneous memory management

insert image description here

Heterogeneous Memory Management (HMM) is a CUDA memory management feature that extends the simplicity and productivity of the CUDA Unified Memory programming model to include system-allocated memory on systems with PCIe-attached NVIDIA GPUs. System-allocated memory is memory that is ultimately allocated by the operating system; for example, via malloc, mmap, the C++ new operator (using the preceding mechanism of course), or related system routines that set up CPU-accessible memory for applications.

Previously, on PCIe-based computers, the GPU could not directly access system-allocated memory. The GPU can only access memory from special allocators such as cudaMalloc or cudaMallocManaged.

When HMM is enabled, all application threads (GPU or CPU) have direct access to all of the application's system-allocated memory. As with Unified Memory (which can be thought of as a subset or predecessor of HMM), there is no need to manually copy system-allocated memory between processors. This is because it is automatically placed on the CPU or GPU based on processor usage.

In the CUDA driver stack, CPU and GPU page faults are commonly used to discover where memory should be placed. Again, this automatic placement already happens in Unified Memory - the HMM just extends the behavior to cover system allocated memory as well as cudaMalloc managed memory.

This new capability to directly read or write the full application memory address space will significantly increase programmer productivity for all programming models built on CUDA: CUDA C++, Fortran, standard parallelism in Python, ISO C++, ISO Fortran , OpenACC, OpenMP, and many others.

In fact, as the following example shows, HMM simplifies GPU programming, making GPU programming almost as accessible as CPU programming. Some highlights:

  • When writing a GPU program, functions do not require explicit memory management; therefore, the initial "first draft" program can be small and simple. Explicit memory management (for performance tuning) can be deferred until later stages of development.
  • GPU programming is now practical for programming languages ​​that do not distinguish between CPU and GPU memory.
  • Large applications can be accelerated by the GPU without extensive memory management refactoring or changes to third-party libraries (whose source code is not always available).

By the way, newer hardware platforms such as NVIDIA Grace Hopper natively support the Unified Memory programming model with hardware-based memory coherency between all CPUs and GPUs. For such systems, HMM is not necessary, in fact, HMM is automatically disabled there. One way to think about this is to observe that the HMM is actually a software-based approach, offering the same programming model as the NVIDIA Grace Hopper Superchip.

Unified memory before HMM

The original CUDA Unified Memory feature introduced in 2013 allows you to accelerate CPU programs with only a few changes, as follows:

CPU mode

void sortfile(FILE* fp, int N) {
    
    
  char* data;
  data = (char*)malloc(N);

  fread(data, 1, N, fp);
  qsort(data, N, 1, cmp);


  use_data(data);
  free(data);
}

The original unified memory call method

void sortfile(FILE* fp, int N) {
    
    
  char* data;
  cudaMallocManaged(&data, N);

  fread(data, 1, N, fp);
  qsort<<<...>>>(data, N, 1, cmp);
  cudaDeviceSynchronize();

  use_data(data);
  cudaFree(data);
}

This programming model is simple, clear, and powerful. Over the past 10 years, this approach has enabled countless applications to easily benefit from GPU acceleration. However, there is still room for improvement: Note that a special allocator is required: cudaMallocManaged and the corresponding cudaFree.

What if we could go one step further and get rid of these? This is exactly what HMMs do.

Unified memory after HMM

On systems with HMM (details below), continue to use malloc and free:

CPU mode

void sortfile(FILE* fp, int N) {
    
    
  char* data;
  data = (char*)malloc(N);

  fread(data, 1, N, fp);
  qsort(data, N, 1, cmp);


  use_data(data);
  free(data);
}

Latest CUDA Unified Memory and HMM

void sortfile(FILE* fp, int N) {
    
    
  char* data;
  data = (char*)malloc(N);

  fread(data, 1, N, fp);
  qsort<<<...>>>(data, N, 1, cmp);
  cudaDeviceSynchronize();

  use_data(data);
  free(data)
}

With HMM, memory management is now the same between the two.

System allocated memory and CUDA allocator

GPU applications using the CUDA memory allocator work "as is" on systems with the HMM. The main difference between these systems is that system allocation APIs like malloc, C++ new, or mmap now create allocations that can be accessed from GPU threads without calling any CUDA APIs to tell CUDA that these allocations exist. The following table lists the differences between the most common CUDA memory allocators on systems with HMM:

Memory allocators on systems with HMM Placement Migratable Accessible from:
CPU GPU RDMA
System allocated
malloc, mmap, …

First-touch
GPU or CPU
Y Y Y Y
CUDA managed
cudaMallocManaged
Y Y N
CUDA device-only
cudaMalloc, …
GPU N N
CUDA host-pinned
cudaMallocHost, …
CPU N Y
Overview of system and CUDA memory allocators on systems with HMM

In general, CUDA provides better performance by choosing an allocator that better expresses the intent of the application. With HMMs, these choices become performance optimizations that don't need to be done up front before the memory is accessed for the first time from the GPU. HMMs enable developers to focus on parallelizing algorithms first, and then perform memory allocator-related optimizations when overhead improves performance.

A seamless GPU-accelerated HMM in C++, Fortran, and Python
makes it easier to program NVIDIA GPUs using standardized and portable programming languages ​​such as Python, which does not distinguish between CPU and GPU memory and assumes that all threads have access to all memory, and programming languages ​​described by international standards such as ISO Fortran and ISO C++.

These languages ​​provide concurrency and parallelism facilities that enable implementations to automatically dispatch computations to GPUs and other devices. For example, since C++ 2017, <algorithm>standard library algorithms in header files accept an execution policy that enables implementations to run them in parallel.

Sort files in-place from the GPU

For example, prior to HMMs, in-place sorting of files larger than CPU memory was complicated by first sorting smaller parts of the file and then merging them into a fully sorted file. With HMMs, applications can use mmap to map a file on disk into memory, and read and write to it directly from the GPU. See the HMM sample codes file_before.cpp and file_after.cpp on GitHub for more details .

original dynamic allocation

void sortfile(FILE* fp, int N) {
    
    
  std::vector<char> buffer;
  buffer.resize(N);
  fread(buffer.data(), 1, N, fp);
  
  // std::sort runs on the GPU:
  std::sort(std::execution::par,
    buffer.begin(), buffer.end(),
    std::greater{
    
    });
  use_data(std::span{
    
    buffer});
}

Dynamic allocation of the latest unified memory + HMM

void sortfile(int fd, int N) {
    
    
  auto buffer = (char*)mmap(NULL, N, 
     PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

    
  // std::sort runs on the GPU: 
  std::sort(std::execution::par,
    buffer, buffer + N,
    std::greater{
    
    });
  use_data(std::span{
    
    buffer});
}

The NVIDIA C++ Compiler (NVC++) implementation of the parallel std::sort algorithm sorts files on the GPU when the -stdpar=gpu option is used . There are a number of limitations to the use of this option, which are detailed in the HPC SDK documentation.

Before HMM: GPU can only access dynamically allocated memory on the heap in NVC++ compiled code. That is, automatic variables, global variables, and memory-mapped files on the CPU thread stack cannot be accessed from the GPU (see example below).
After HMM: GPU can access all memory allocated by the system, including data dynamically allocated on the heap in CPU code compiled by other compilers and third-party libraries, automatic variables on the CPU thread stack, global variables in CPU memory, memory mapping documents etc.

Atomic memory operations and synchronization primitives

HMM supports all memory operations, including atomic memory operations. That is, programmers can use atomic memory operations to synchronize GPU and CPU threads with flags. While some parts of the C++ std::atomic API use system calls not yet available on the GPU (such as the std::atomic::wait and std::atomic::notify_all/_one APIs ), most of the C++ concurrency primitives API Can be easily used to perform message passing between GPU and CPU threads.

For more information, see the HPC SDK C++ Parallel Algorithms: Interoperability with C++ Standard Library documentation and the atomic_flag.cpp HMM sample code on GitHub. You can extend this set with CUDA C++. For more details, see the Ticket_lock.cpp HMM sample code on GitHub.

Before HMM CPU←→GPU message passing

void main() {
    
    
  // Variables allocated with cudaMallocManaged
  std::atomic<int>* flag;
  int* msg;
  cudaMallocManaged(&flag, sizeof(std::atomic<int>));
  cudaMallocManaged(&msg, sizeof(int));
  new (flag) std::atomic<int>(0);
  *msg = 0;
 
  // Start a different CPU thread…
  auto t = std::jthread([&] {
    
     
    // … that launches and waits 
    // on a GPU kernel completing
    std::for_each_n(
      std::execution::par, 
      &msg, 1, [&](int& msg) {
    
    
        // GPU thread writes message…
        *msg = 42;       // all accesses via ptrs
        // …and signals completion…
        flag->store(1);  // all accesses via ptrs
    });
  });

After HMM CPU←→GPU message passing

void main() {
    
    
  // Variables on CPU thread stack:
  std::atomic<int> flag = 0;  // Atomic
  int msg = 0;                // Message
 
  


// Start a different CPU thread…
  auto t = std::jthread([&] {
    
     
    // … that launches and waits 
    // on a GPU kernel completing
    std::for_each_n(
      std::execution::par, 
      &msg, 1, [&](int& msg) {
    
    
        // GPU thread writes message…
        msg = 42;
        // …and signals completion…
        flag.store(1);  
    });
  });
 
  // CPU thread waits on GPU thread
  while (flag.load() == 0);
  // …and reads the message:
  std::cout << msg << std::endl;
  // …the GPU kernel and thread
  // may still be running here…
}

Before HMM CPU←→GPU locks

void main() {
    
    
  // Variables allocated with cudaMallocManaged
  ticket_lock* lock;    // Lock
  int* msg;         // Message
  cudaMallocManaged(&lock, sizeof(ticket_lock));
  cudaMallocManaged(&msg, sizeof(int));
  new (lock) ticket_lock();
  *msg = 0;

  // Start a different CPU thread…
  auto t = std::jthread([&] {
    
    
    // … that launches and waits 
    // on a GPU kernel completing
    std::for_each_n(
      std::execution::par, 
      &msg, 1, [&](int& msg) {
    
    
        // GPU thread takes lock…
        auto g = lock->guard();
        // … and sets message (no atomics)
        msg += 1;
    }); // GPU thread releases lock here
  });
  
  {
    
     // Concurrently with GPU thread
    // … CPU thread takes lock…
    auto g = lock->guard();
    // … and sets message (no atomics)
    msg += 1;
  } // CPU thread releases lock here

  t.join();  // Wait on GPU kernel completion
  std::cout << msg << std::endl;
}

After HMM CPU←→GPU locks

void main() {
    
    
  // Variables on CPU thread stack:
  ticket_lock lock;    // Lock
  int msg = 0;         // Message

  // Start a different CPU thread…
  auto t = std::jthread([&] {
    
    
    // … that launches and waits 
    // on a GPU kernel completing
    std::for_each_n(
      std::execution::par, 
      &msg, 1, [&](int& msg) {
    
    
        // GPU thread takes lock…
        auto g = lock.guard();
        // … and sets message (no atomics)
        msg += 1;
    }); // GPU thread releases lock here
  });
  
  {
    
     // Concurrently with GPU thread
    // … CPU thread takes lock…
    auto g = lock.guard();
    // … and sets message (no atomics)
    msg += 1;
  } // CPU thread releases lock here

  t.join();  // Wait on GPU kernel completion
  std::cout << msg << std::endl;
}

Accelerate Complex HPC Workloads Using HMMs

Research groups working on large and long-lived HPC applications have longed for years to provide more efficient and portable programming models for heterogeneous platforms. m-AIA is a multiphysics solver developed by the Institute of Aerodynamics at RWTH Aachen University in Germany and contains nearly 300,000 lines of code. For more information, see Accelerating C++ CFD Code with OpenACC . The initial prototype no longer used OpenACC, but was partially accelerated on the GPU using the aforementioned ISO C++ programming model, which was not available when the prototype work was completed.

HMM enables our team to accelerate new m-AIA workloads that interface with GPU-agnostic third-party libraries such as FFTW and pnetcdf for initial conditions and I/O and do not care about direct GPU access same memory.

Rapid development with memory-mapped I/O

An interesting feature provided by the HMM is memory-mapped file I/O directly from the GPU. It enables developers to read files directly from supported storage or/disk without staging files in system memory or copying data to high-bandwidth GPU memory. This also enables application developers to easily process input data larger than available physical system memory without having to build iterative data ingestion and computation workflows.

To demonstrate this functionality, our team wrote a sample application that builds hourly total precipitation histograms for each day of the year from the ERA5 reanalysis dataset . See ERA5 Global Reanalysis for more details .

The ERA5 dataset consists of hourly estimates of several atmospheric variables. In the dataset, the total precipitation data for each month is stored in a separate file. We used 40 years of total precipitation data from 1981 to 2020, with a total of 480 input files and a total input data size of approximately 1.3 TB. See the image below for an example of the result.

Using the Unix mmap API, an input file can be mapped into a contiguous virtual address space. Using the HMM, this virtual address can be passed as input to a CUDA kernel, which can then directly access these values ​​to build a histogram of total precipitation hourly for all days of the year.

The resulting histogram will reside in GPU memory and can be used to easily compute interesting statistics such as the average monthly precipitation in the northern hemisphere. For example, we also calculated the average hourly precipitation for February and August. To see the code for this application, visit HMM_sample_code on GitHub .

Before HMM Batch and pipeline memory transfers

size_t chunk_sz = 70_gb;
std::vector<char> buffer(chunk_sz);

for (fp : files)
  for (size_t off = 0; off < N; off += chunk_sz) {
    
    
    fread(buffer.data(), 1, chunk_sz, fp);
    cudeMemcpy(dev, buffer.data(), chunk_sz, H2D);
  
    histogram<<<...>>>(dev, N, out);
    cudaDeviceSynchronize();
  }

After HMM Memory map and transfer on demand

void* buffer = mmap(NULL, alloc_size,
                    PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, 
                    -1, 0);
for (fd : files)
  mmap(buffer+file_offset, fileByteSize, 
       PROT_READ, MAP_PRIVATE|MAP_FIXED, fd, 0);


histogram<<<...>>>(buffer, total_N, out);
cudaDeviceSynchronize();

Enable and detect HMM

Whenever the CUDA toolkit and drivers detect that your system can handle it, it automatically enables HMM. These requirements are documented in detail in the CUDA 12.2 Release Notes: Universal CUDA . you need to:

  • NVIDIA CUDA 12.2 with the open source r535_00 driver or later. See the NVIDIA Open GPU kernel module installation documentation for details .
  • A sufficiently recent Linux kernel: 6.1.24+, 6.2.11+ or 6.3+.
  • GPU with one of the following supported architectures: NVIDIA Turing, NVIDIA Ampere, NVIDIA Ada Lovelace, NVIDIA Hopper or newer.
  • 64-bit x86 CPUs.

Query the addressing mode properties to verify that the HMM is enabled:

$ nvidia-smi -q | grep Addressing
Addressing Mode : HMM

To detect systems where the GPU can access system-allocated memory, query cudaDevAttrPageableMemoryAccess.

Additionally, systems such as the NVIDIA Grace Hopper Superchip support ATS, which behaves like an HMM. In fact, the programming models of HMM and ATS systems are the same, so for most programs, checking alone cudaDevAttrPageableMemoryAccessis enough.

However, for performance tuning and other advanced programming, queries can also be made cudaDevAttrPageableMemoryAccessUsesHostPageTablesto differentiate between the HMM and the ATS. The table below shows how to interpret the results.

Attribute HMM ATS
cudaDevAttrPageableMemoryAccess 1 1
cudaDevAttrPageableMemoryAccessUsesHostPageTables 0 1

For portable applications that are only interested in querying whether the programming model exposed by the HMM or ATS is available, it is usually sufficient to query the Pageable Memory Access attribute.

Unified Memory Performance Tips

The semantics of pre-existing Unified Memory performance hints have not changed. For applications already using CUDA Unified Memory on hardware coherent systems such as NVIDIA Grace Hopper, the main change is that the HMM enables them to run "as-is" on more systems within the above constraints.

Pre-existing unified memory hints also apply to system-allocated memory on HMM systems:

  1. __host__ cudaError_t cudaMemPrefetchAsync(* ptr, size_t nbytes, int device):
    Asynchronously prefetch memory to GPU (GPU DeviceID) or CPU (cudaCpuDeviceId).
  2. __host__ cudaError_tcudaMemAdvise(*ptr, size_t nbytes, cudaMemoryAdvise,advice, int device): Prompt system:
  • preferred location in memory:
    cudaMemAdviseSetPreferredLocation, or
  • The device that will access the memory: cudaMemAdviseSetAccessedBy, or
  • Devices that primarily read infrequently modified memory:
    cudaMemAdviseSetReadMostly.

A bit more advanced: there is a new CUDA 12.2 API, cudaMemAdvise_v2 , which enables applications to choose which NUMA node should be preferred for a given memory range. This comes into play when the HMM places memory contents on the CPU side.

As always, memory management hints may improve or decrease performance. Behavior depends on the application and workload, but any hints should not affect the correctness of the application.

Limitations of HMMs in CUDA 12.2

The initial HMM implementation in CUDA 12.2 provides new functionality without degrading the performance of any existing applications. The limitations of HMMs in CUDA 12.2 are documented in detail in CUDA 12.2 Release Notes: CUDA General. The main limitations are:

  • HMM is only available for x86_64, other CPU architectures are not yet supported.
  • HMMs on HugeTLB allocations are not supported.
  • GPU atomic operations on file-backed memory and HugeTLBfs memory are not supported.
  • fork(2) without subsequent exec(3) is not fully supported.
  • Page migrations are processed in chunks of 4 KB page size.

Stay tuned for future CUDA driver updates that will address HMM limitations and improve performance.

Summarize

The HMM simplifies the programming model by eliminating the need for explicit memory management for GPU programs running on common PCIe-based (typically x86) computers. Programmers can use malloc, C++ new, and mmap calls directly, just as they do in CPU programming.

HMM further enhances programmer productivity by safely using various standard programming language features in CUDA programs. No need to worry about accidentally exposing system-allocated memory to CUDA kernels.

The HMM enables a seamless transition to and from the new NVIDIA Grace Hopper Superchip and similar machines. On PCIe-based machines, the HMM provides the same simplified programming model used on the NVIDIA Grace Hopper Superchip.

Guess you like

Origin blog.csdn.net/kunhe0512/article/details/132533305