Tensorflow source code analysis - starting from GPU OOM to talk about Tensorflow's BFC memory management

foreword

Run GPU training on the platform, the result is CUDA OOM, error message

E  Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimary
CtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 11711807488

There is no GPU-related settings for the session, tensorflow gives suggestions, and parameters can be used to control the memory allocation of the GPU

# add gpu growth flags
tf_config.gpu_options.allow_growth = True
tf_config.gpu_options.per_process_gpu_memory_fraction = 0.1

per_process_gpu_memory_fraction parameter

  • per_process_gpu_memory_fraction parameter , this is a memory factor that controls a single GPU process, this is a raft value, the raft value is used to determine the memory ratio of the GPU, thus controlling the memory of the GPU left to the system, if not set, the effective memory is sufficient In this case, tensorflow only reserves 225M for the system when the effective memory is less than 2G, and when the effective memory is greater than 2G, the effective memory of raft value 0.05 and at least 300M of memory are reserved, which is a greedy memory occupation. Setting factors can effectively control the amount of memory you need.
  int64 allocated_memory;
  double config_memory_fraction =
      options.config.gpu_options().per_process_gpu_memory_fraction();
  if (config_memory_fraction == 0) {
    allocated_memory = available_memory;
    const int64 min_system_memory = MinSystemMemory(available_memory);
    if (min_system_memory < allocated_memory) {
      allocated_memory -= min_system_memory;
    }
  } else {
    allocated_memory = total_memory * config_memory_fraction;
  }

  •  If you are running on a platform that already uses a lot of memory, the remaining memory of each GPU is not necessarily the same. The setting factor is based on the memory of all processes. A single factor cannot control the memory allocation of each process. Insufficient memory causes failure.

allow_growth parameter

It may be strange to see this parameter, allow_growth literally means to allow growth, that is, to allow continued allocation of memory later? In fact, when tensorflow starts, it does not actually apply for memory. The generation of initial parameters is only to control the size of the memory that is actually allowed to be used later.

There is a layer of virtual memory management BFC on tensorflow

BFC memory allocation

This is a virtual memory allocator that implements a simple version of malloc (dlmalloc) similar to Doug Lea. It performs memory defragmentation by merging, and implements the 'best-fit with coalescing' algorithm, which requires that all allocated memory must call this interface.

1 Chunk structure

1.1 Structure

This is the smallest memory unit of tensorflow, which consists of multiple 256bytes (kMinAllocationSize) contiguous memory blocks. The memory management of tensorflow is based on chunk management.


1.1.1 chunkhandle

Chunkhandle is the index of the chunk array vector, and the array vector of all chunks is saved in tensorflow, and the subscript of the array vector is chunkhandle 

    // If not kInvalidChunkHandle, the memory referred to by 'prev' is directly
    // preceding the memory used by this chunk.  E.g., It should start
    // at 'ptr - prev->size'
    ChunkHandle prev = kInvalidChunkHandle;

    // If not kInvalidChunkHandle, the memory referred to by 'next' is directly
    // following the memory used by this chunk.  E.g., It should be at
    // 'ptr + size'
    ChunkHandle next = kInvalidChunkHandle;

There are two front and back chunkhandles (indexes of all chunk arrays) in the Chunk structure, and chunkhandles point to adjacent consecutive memory blocks before and after

1.1.2 ptr pointer

ptr is a memory pointer that points to the starting position of the memory, because chunk points to continuous memory, so only its size is recorded

1.2 chunk application

Tensorflow will save an array vector of all chunks. In order to avoid frequent application and release of chunks, the released chunks will be reused. In order to quickly find the released chunks, tensorflow builds a linked list structure of the released chunks, free_chunks_list_ point to the head of the linked list


1.3 Deletion of chunks

Chunks are reused, and the deletion of chunks needs to erase the features in the chunks, such as ptr. Of course, it does not release the memory pointed to by ptr, but invalidates the pointing setting of the chunkhandle corresponding to the address in the Region, and adds the chunk to the existing The head in the linked list of the released chunks, free_chunks_list_ points to the newly released chunk

2 Region

A region is an allocated contiguous memory block. A region can be split into multiple chunks. A chunk points to multiple consecutive 256byte memory blocks.


2.1 Application for Region

Only apply for Region when you really need to use memory

  size_t bytes = std::min(curr_region_allocation_bytes_, available_bytes);
  void* mem_addr = suballocator_->Alloc(32, bytes);

In the above code, we can see that each application for Region memory is controlled by the following parameters:

curr_region_allocation_bytes parameter

  if (allow_growth) {
    // 1MiB smallest initial allocation, unless total memory available
    // is less.
    curr_region_allocation_bytes_ =
        RoundedBytes(std::min(total_memory, size_t{1048576}));
  } else {
    curr_region_allocation_bytes_ = RoundedBytes(total_memory);
  }

The allow_growth parameter here is in the front

tf_config.gpu_options.allow_growth = True

When allow_growth is turned off, curr_region_allocation_bytes_ is the default remaining memory size, that is, there is only one Region

When allow_growth is turned on, the value of curr_region_allocation_bytes_ is multiple Regions with a minimum of 1M, and curr_region_allocation_bytes_ grows at a rate of 2 times by default, that is, the memory of each application for a Region is continuous and grows at a minimum rate of 2 times.

If the actual memory that needs to be applied for is larger than curr_region_allocation_bytes_, increase at twice the speed of curr_region_allocation_bytes_ until the required memory is met.

  bool increased_allocation = false;
  while (rounded_bytes > curr_region_allocation_bytes_) {
    curr_region_allocation_bytes_ *= 2;
    increased_allocation = true;
  }

Available_bytes参数

available_bytes refers to the remaining memory that can be allocated. During initialization, Tensorflow will obtain the effective memory of the GPU, and the memory applied for each time will be subtracted from the remaining memory, that is, the remaining memory of the GPU will only be available during the entire operation. Obtained once at the beginning of the program. If the program is running on the GPU platform, the remaining memory will keep changing. The effective memory is obtained when the program starts to run (without actually applying for it), then during the calculation process There is a high possibility of OOM in the memory application.

2.2 Region的ChunkHandle 

Each Region will be divided into multiple arrays of chunkhandles with a size of 256bytes, and chunkhandles point to the location of the chunk vector array discussed in the previous chapter.

2.3 Region array

Each application for continuous memory will generate a Region, and multiple Regions form a Region vector array

private:
    std::vector<AllocationRegion> regions_;

How to locate which Region the chunk belongs to? Each Region will record the starting address and ending address, and the starting address of the chunk will be stored in the chunk. As long as the starting address of the chunk and the address range of the region are compared, the Region it belongs to can be determined.

3. Bin

Region and Chunk are discussed in the previous chapters, but when new memory is allocated, how to find matching free chunks faster and more efficiently is very important. Finding free chunks in each Region is obviously very inefficient. tensorflow builds a global bin based on chunks, and each bin manages a certain range of memory size chunks (memory size range (2^bin_num) *256 to (2^(bin_num+1))*256-1, bin_num represents the serial number of bin) Each chunk is a memory block that is multiples of 256bytes, and bin manages free chunks.


Each Bin will save a set of free free chunks

    typedef std::set<ChunkHandle, ChunkComparator> FreeChunkSet;
    // List of free chunks within the bin, sorted by chunk size.
    // Chunk * not owned.
    FreeChunkSet free_chunks;
  • The application first applies for memory
  • Calculate to determine the Bin to which the memory size belongs
  • Traverse the free Chunk Set in Bin, if you can't find it, continue to search for a larger Bin until you find free memory
  • If it is still not found, then you need to actually apply for memory from the driver, apply for a memory block of curr_region_allocation_bytes_ size as a Region, which is also a large chunk block, and insert the chunk block back into the corresponding bin as a free block Idle chunk set, and then continue searching.
  • If found, then you need to judge whether the free chunk memory block is 2 times the required memory
  • In order to avoid the waste of memory, the large free chunk will be split into 2 chunks, the small chunk will be used by the program, and the remaining large chunk will be re-inserted back into the corresponding bin's free chunk set

4. Merging and splitting of Chunk

In order to use memory more efficiently, a larger chunk memory block is split into chunks. The splitting strategy has been introduced in the previous chapter. When the chunks are released, tensorflow will try to merge the chunks and merge the chunks. Strategy: Only memory blocks with adjacent addresses can be merged

Remember the Prev and Next of chunks?

  BFCAllocator::ChunkHandle h_neighbor = c->next;
  new_chunk->prev = h;
  new_chunk->next = h_neighbor;
  c->next = h_new_chunk;

When a chunk is split, it is an adjacent chunk. When splitting a large chunk into two chunks, the new chunk prev will point to another chunk, and next will point to the neighbor of the original big chunk. At the same time the neighbor prev of the large chunk points to the new chunk.

When the chunk is released, prev and next are checked, and if the chunk pointed to by prev and next is not used, it will try to merge.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325345803&siteId=291194637