Day 6 CUDA memory management

Operational Memory Allocation and Data Copy Process Overview

  1. Open up a space on the gpu and record the address on mem_device
  2. Open up a space on the cpu, record the address on mem_host, and modify the second value of the area pointed to by the address
  3. Copy all the data in the area pointed to by mem_host to the area pointed to by mem_device
  4. Open up a space on the cpu and record the address on mem_page_locked
  5. Finally, copy the data in the area pointed to by mem_device back to the mem_page_locked area on the cpu

memory model

  1. Memory is divided into

     - 主机内存:Host Memory,也就是CPU内存,内存
     
     - 设备内存:Device Memory,也就是GPU内存,显存
         - 设备内存又分为:
                 - 全局内存(3):Global Memory
                 - 寄存器内存(1):Register Memory
                 - 纹理内存(2):Texture Memory
                 - 共享内存(2):Shared Memory
                 - 常量内存(2):Constant Memory
                 - 本地内存(3):Local Memory
         
         - 只需要知道,谁距离计算芯片近,谁速度就越快,空间越小,价格越贵
         - 清单的括号数字表示到计算芯片的距离
    
  2. Allocate GPU memory through cudaMalloc, and allocate it to the current device specified by setDevice

  3. Allocate page locked memory through cudaMallocHost, namely pinned memory, page locked memory

    • Page-locked memory is host memory that is directly accessible by the CPU
    • Page-locked memory can also be directly accessed by the GPU, using DMA (Direct Memory Access) technology
      • Note that the performance of doing this will be relatively poor, because the host memory is too far away from the GPU, across PCIE, etc., it is not suitable for large data transmission
    • Page-locked memory is physical memory, excessive use can lead to poor system performance (causing a series of technologies such as virtual memory to slow down)
  4. cudaMemcpy

    • If the host is not page-locked memory, then:
      • The process of Device To Host is equivalent to
        • pinned = cudaMallocHost
        • copy Device to pinned
        • copy pinned to Host
        • free pinned
      • The process of Host To Device is equivalent to
        • pinned = cudaMallocHost
        • copy Host to pinned
        • copy pinned to Device
        • free pinned
    • If the host is page-locked memory, then:
      • The process of Device To Host is equivalent to
        • copy Device to Host
      • The process of Host To Device is equivalent to
        • copy Host to Device
  • It is recommended to allocate first and release first
    checkRuntime(cudaFreeHost(memory_page_locked));
    delete [] memory_host;
    checkRuntime(cudaFree(memory_device)); 

Those who use cuda API to allocate memory generally have their own corresponding method of releasing memory; those who use new to allocate use delete to release

insert image description here

// CUDA运行时头文件
#include <cuda_runtime.h>

#include <stdio.h>
#include <string.h>

#define checkRuntime(op)  __check_cuda_runtime((op), #op, __FILE__, __LINE__)

bool __check_cuda_runtime(cudaError_t code, const char* op, const char* file, int line){
    
    
    if(code != cudaSuccess){
    
        
        const char* err_name = cudaGetErrorName(code);    
        const char* err_message = cudaGetErrorString(code);  
        printf("runtime error %s:%d  %s failed. \n  code = %s, message = %s\n", file, line, op, err_name, err_message);   
        return false;
    }
    return true;
}

int main(){
    
    

    int device_id = 0;
    checkRuntime(cudaSetDevice(device_id));

    float* memory_device = nullptr;
    checkRuntime(cudaMalloc(&memory_device, 100 * sizeof(float))); // pointer to device

    float* memory_host = new float[100];
    memory_host[2] = 520.25;
    checkRuntime(cudaMemcpy(memory_device, memory_host, sizeof(float) * 100, cudaMemcpyHostToDevice)); // 返回的地址是开辟的device地址,存放在memory_device

    float* memory_page_locked = nullptr;
    checkRuntime(cudaMallocHost(&memory_page_locked, 100 * sizeof(float))); // 返回的地址是被开辟的pin memory的地址,存放在memory_page_locked
    checkRuntime(cudaMemcpy(memory_page_locked, memory_device, sizeof(float) * 100, cudaMemcpyDeviceToHost)); // 

    printf("%f\n", memory_page_locked[2]);
    checkRuntime(cudaFreeHost(memory_page_locked));
    delete [] memory_host;
    checkRuntime(cudaFree(memory_device)); 

    return 0;
}

Guess you like

Origin blog.csdn.net/qq_38973721/article/details/129796765