G10: Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor Migrations

MICRO'23

Abstract

The author proposed: a unified GPU memory and storage architecture named G10

Based on this finding:Tensors in DL are highly predictable

G10 integrates GPU memory, host memory, and flash memory to achieve unified memory access and transparent data migration. Based on this unified memory access, G10 uses compilation technology to obtain the characteristics of tensors in DL to achieve subsequent data scheduling.

1. Introduction

Nowadays, when people use GPU for DL ​​model training, they will face the problem ofGPU memory wall. The scale of models and data is increasing, but GPU memory has not increased accordingly, causing the training of DL models to be limited by GPU memory. (The size of large models is growing crazily at a rate of 410 times every two years, while the memory capacity of GPU processors, as the basis of AI computing power, only grows at a rate of 2 times every two years. The memory capacity of GPUs seriously restricts the size and computing power of trainable models. It has become an important bottleneck hindering the development and implementation of AI technology. This is called the GPU memory wall problem. See the picture below. This picture is not from this paper)

image-20231124092113807

Therefore, the author hopes to use low-cost memory (host memory, flash memory) to expand GPU memory. Based on two findings:

  1. Only a few tensors are used in each iteration of the DNN.

  2. The inactive time of most inactive tensors is long enough for data migration.

Faced with three challenges:

  1. Need to obtain the memory requirements and life cycle of the tensor.

  2. Decisions need to be made about when and where (to where) which tensor to migrate.

  3. Migration should be automated and should not be set up manually by a human.

G10 consists of three parts:

  1. tensor vitality analyzer: Extract features of tensors. Track each tensor according to the execution graph generated by the compiler, and obtain the size, life cycle and other information of each tensor. It can be executed at compile time, that is, it can be executed before model training.

  2. tensor migration scheduler: Plan tensor migration in advance. Migrate as many long-lived tensors as possible to external memory.

  3. Unified memory system: Simplifies GPU memory management and enables transparent tensor migration. Expand the UVM of the GPU and merge GPU memory, host memory, and flash memory into a unified memory space. Implement a unified page table that includes addresses on GPU memory, host memory, and flash memory, so that when performing tensor migration, you only need to specify the virtual address.

Implementing G10 based on the open source GPU simulator UVMSmart

2. Background and Motivation

Approaches to Scaling GPU Memory
  1. Expand GPU memory with host memory:

    UVM (Unified Virtual Memory) allows programmers to access host and GPU memory through virtual addresses. Data consistency is guaranteed by the GPU hardware and runtime. Data migration is performed on demand. If the GPU accesses data that is not in the GPU memory, a page fault interrupt is triggered, and then the data is migrated from the host to the GPU memory. When the GPU memory is full, an existing data is migrated through LRU. to the host, and then put this data on the GPU memory. However, host memory alone cannot solve the growing memory requirements of DL.

  2. Expand GPU memory with flash memory:

    Using SSD to expand GPU memory faces communication overhead. For example, the GPU and SSD can communicate through PCIe. However, the communication overhead is still very high, because flash access is slow. You can consider overlapping the DNN processing process with the communication to hide the communication delay and reduce the communication overhead. However, this requires a full understanding of the characteristics of tensors, so this article studies the characteristics of tensors.

3. GPU Memory characterization

Analyze the training process of DNN model: analyze data flow graph, profile cuda kernels

The analysis resulted in four observations (observation)

Observation 01 : Small memory requirement of active tensors

1700707476752

This experiment answers the yes or no question, indeed a large portion of tensors are inactive.

Only a small part of the tensors are active and occupy only a small part of the memory (basically less than 10%)

Observation 02 : Long unused time of inactive tensors

1700707437429

Most tensors are active only a small part of the time. About 50% of the tensors in BERT and ViT have an inactive time of more than 10^5{\mu}s. About 60% of the tensors in ResNet152 and Inception have an inactive time. More than 10^7{\mu}s, which is longer than the latency of an SSD (e.g. 20{\mu}s), so it can be swapped out.

In a typical DNN data flow graph, a tensor is only used twice. If there are branch and join structures, it may be used several times more, but it is also a limited number of times, and the network structure is still approximately linear.

This experiment shows that the inactive time of most active tensors is sufficient (enough to offload to SSD, or prefetch from SSD to GPU)

Observation 03 : Diversity of inactive tensors

1700707327947

As can be seen from Fig 4, the inactive time and size of tensors have a relatively sparse distribution pattern, and about 60%-80% of tensors are exchangeable (inactive time > SSD latency). However, different tensors have different inactive times and different sizes. When a tensor is swapped out, the GPU memory usage is reduced during this period. Reasonable algorithms should be designed to maximize performance (the longest time , maximum memory usage reduction)

Observation 04 : Complexity of scheduling tensor swapping

As tensors are swapped in and out, GPU memory usage changes dynamically. Therefore, it is not reasonable to use a static strategy to determine the swap strategy (when to swap in and when to swap out), which is difficult to make. Globally optimal decision-making.

4. G10 Design

1700707258340

Three main parts:

  1. Tensor Vitality Analysis : Tracks each tensor at compile time, obtaining its size and lifetime, as well as dependencies between tensors.

  2. Smart Tensor Migration Scheduler: Based on the previously obtained tensor information, dynamically formulate tensor eviction and prefetching strategies to optimize performance. (Tensor migration will affect the formulation of subsequent tensor migration strategies, so dynamic algorithms need to be designed to formulate strategies). The G10 then inserts eviction and prefetch instructions into the compiled program, and the GPU executes these instructions in real time.

  3. Unified Memory/Storage Space : Merges GPU memory, host memory, and SSD into a unified memory space, making tensor migration transparent to developers.

4.2 Tensor Vitality Analysis

Mainly to analyze the tensor inactive time period (estimated based on GPU kernel execution time)

1700728878895

born: The tensor is used for the first time

dead: After the tensor is used for the last time, it can be deallocated (deallocate)

The input and output of the operator are both active tensors and should be placed on the GPU memory; if a tensor is not currently used and is not dead, it is an inactive tensor.

inactive time period: The time when a tensor is inactive but not dead

PS: A tensor may have multiple inactive time periods and may be swapped in and out multiple times.

According to the inactive time period, you can know when a tensor can be swapped out and when it needs to be swapped in.

How to achieve? (How to get the inactive time period of a tensor?)

Since the execution of DNN is predictable (predictable dataflow), G10 performs offline profiling at compile time to obtain tensor information, uses the GPU kernel execution time to estimate the inactive time period, and estimates each time period based on the size and bandwidth of the tensor. The eviction and prefetch overhead of tensors.

4.3 Smart Tensor Eviction

There are three main challenges to consider:

  1. Different tensors have different sizes and inactive times, so they contribute differently to GPU memory usage.

  2. Migrate to host memory or SSD? The host memory bandwidth is higher and the SSD capacity is larger.

  3. You should try to make full use of bandwidth and choose an appropriate time point for data migration.

Selecting eviction candidates

GPU memory pressure: The part of the memory occupied by non-evicted tensors that exceeds the GPU memory capacity (that is, the shaded area in the figure below)

Benefit: reduced memory usage

Cost: I/O overhead of eviction, prefetching

1700730180520

❓❓❓Why when evicting X in Fig 7(2), the time for memory pressure == GPU Capacity is equal to the time spent evicting X< a i=2>❓❓❓

Choosing eviction destination

host memory or SSD ?

Swap out to SSD first, because the host memory capacity is relatively small, if you swap out to host memory first, it will lead to rapid saturation, and then you still have to swap out to SSD. .

When the SSD bandwidth is saturated, it will be swapped out to the host memory.

1700742552858

4.4 Smart Tensor Prefetching

1700743116664

latest safe prefetch time: the latest time that will not cause the GPU to wait for tensors

optimized prefetch time: The earliest time that will not cause GPU memory pressure to exceed GPU memory capacity

If you prefetch tensors according to the latest safe prefetch time, you need to have an accurate understanding of the tensor's inactive time and bandwidth, otherwise it may cause the GPU to wait for the tensor. Therefore, priority is given to prefetching according to the optimized prefetch time. If there is no optimized prefetch time, prefetching is performed according to the latest safe prefetch.

Code Instrumentation

1700743538388

Use the compiler to insert swapping, prefetching and other related instructions into the program

4.5 Unified GPU Memory and Storage

1700744411889

When encountering a g10 api call, the smart migration handler will look for the corresponding data address in the unified page table, then put it on the Migration Metadata Queues, wait for subsequent data to arrive, unload/prefetch at the granularity of the batch, and then make a judgment Decide whether to use cpu or ssd and end.

A related translation:

As shown in Figure 10, for tensor eviction and prefetching, G10 will rely on a unified page table to identify the physical location of the tensor (1). For pre-eviction, G10 will look for GPU pages to evict. Afterwards, the migration metadata will be stored in the corresponding migration metadata queue (2). The migration arbiter will select several page migrations to form the next migration batch and store them in the migration set (3). During this process, the G10 driver will also communicate with the GPU to allocate GPU memory as needed. Migrations in transfer sets will be batched periodically, and corresponding SSD-GPU data transfers will be handled by the direct storage access (DSA) process, and CPU-GPU data transfers will be handled by the DMA process (4). After data migration, the unified page table and corresponding TLB entries are updated (5).

Build a page table to store the memory address mapping of GPU Host SSD

Specifically, the GPU MMU maintains a unified page table. When the GPU accesses data that is not on the GPU memory, the GPU MMU implicitly transmits the data to the GPU through the page missing exception mechanism, causing the GPU to wait. However, due to the prefetching design, page missing exceptions are generally not triggered.

❓❓❓Is there a surface? GPU? host? SSD?❓❓❓ On the GPU? ?

A related translation:

When incorporating SSDs into UVM systems, we follow the approach described in previous research [3]. We rely on the host page failover mechanism to complete address translation. After accessing a page that is not in GPU memory, the GPU page fault handler issues an interrupt to the host, which is responsible for moving the data. In order to simulate SSD internals and capture their activities, such as garbage collection (GC) and flash chip access, in our evaluation, we developed an SSD simulator based on SSDSim [5] and integrated it into our simulator in the frame. Therefore, since we measure overall system performance in our experiments, internal SSD activity is taken into account.

5. Implementation Details

Tensor vitality analyzer

Use the DNN model and the execution time of each kernel as the input of the Tensor vitality analyzer.

Simulator framework

In order to simulate the execution process of DNN more accurately, first run the model on A100 and track the execution process of all kernels. Then put it on the simulator for execution and replay the execution process of these kernels.

❓❓❓What should I do if the model takes up too much memory and cannot be executed on the GPU?❓❓❓

7. Evaluation

baseline

Ideal : GPU with unlimited onboard memory

Base UVM: Basic GPU-CPU-SSD UVM system, through page fault interruption, on-demand data migration

DeepUM+

FlashNeuron

1700757831568

1700758109701

1700758332236

1700758519901

1700758571135

robustness

My Conclusion

Optimization technology (offloading, prefetching) is already very mature. The main difficulty lies in getting the SSD into UVM, which is equivalent to changing the architecture.

Guess you like

Origin blog.csdn.net/illyh/article/details/134620397