GPU memory hierarchy (gpu memory hierarchy)
Small general CAS Chemistry the PhD students
Research, development and application of parallel computer simulation software
Email: [email protected] (welcome and I discussed issues)
Abstract (Abstact)
GPU memory is diversified, its speed and the number is not the same as understanding GPU performance tuning program for the storage of great significance. This article describes the following questions:
1. What types of memory? 2) check their device memory size 3) memory access speed 4) the relationship between different levels of storage 5) Cautions. Advantages and disadvantages of storage structure.
text
GPU configuration diagram
① register memory (Register memory)
Advantages : access speed champion!
Disadvantages : a limited number of
Use : __global__ functions within normal variables, or ___device__ function, definition, it is register variables.
example:
1 //kernel.cu 2 3 __global__ void register_test() 4 5 { 6 7 int a = 1.0; 8 9 double b = 2.0; 10 11 } 12 13 14 15 //main.cu 16 17 int main() 18 19 { 20 21 int nBlock = 100; 22 23 register_test <<<nBlock,128>>>(); 24 25 return 0; 26 27 } 28 29 30 31
② shared memory ( Shared Memory )
Advantages :
2 1 buffer faster speed than two orders of magnitude global memory
2 thread block, all the threads can read and write.
3 life cycle synchronized with the thread block
Drawback : there are size restrictions
Use: The keyword __shared__ __shared__ double A [128];
Conditions:
Use applications, such as the statute sum: a = sum A [i]
If the variable is not frequently modified, such as vector addition.
It is an important means of programming optimization!
C [i] = A [i] + B [i] is not necessary to A, B to be cached in the shared memory.
1 /kernel.cu 2 3 __global__ void shared_test() 4 5 { 6 7 __shared__ double A[128]; 8 9 int a = 1.0; 10 11 double b = 2.0; 12 13 int tid = threadIdx.x; 14 15 A[tid] = a; 16 17 }
③ global memory (Global Memory)
Advantages :
1 space maximum (GB level)
2. Host side may, interact through cudaMemcpy the like.
3. The life cycle is longer than the Kernel function
4. All threads can access
Disadvantages : the slowest memory access
1 //kernel.cu 2 3 __global__ void shared_test(int *B) 4 5 { 6 7 double b = 2.0; 8 9 int tid = threadIdx.x; 10 11 int id = blockDim.x*128 + threadIdx.x; 12 13 int a = B[id] ; 14 15 }
④ texture memory
Advantage , faster than the average global memory
Disadvantages : to use, requires four steps, a little trouble
Applicable scene : the larger need only read the array, using textures access, will be accelerated
Four steps used (here in a one-dimensional array of float as an example), beginners, own hand and struck the code again! ! !
The first step, declaring the texture space, global variables:
texture<float, 1, cudaReadModeElementType> tex1D_load;
The second step, binding texture
The third step is to use
The fourth step, unbound
Specifically look at the code, (best knock yourself again!)
1 #include <iostream> 2 3 #include <time.h> 4 5 #include <assert.h> 6 7 #include <cuda_runtime.h> 8 9 #include "helper_cuda.h" 10 11 #include <iostream> 12 13 #include <ctime> 14 15 #include <stdio.h> 16 17 18 19 using namespace std; 20 21 22 23 texture<float, 1, cudaReadModeElementType>tex1D_load; 24 25 // first step, declaring the texture space, global variable 26 is 27 28 29 a __global__ void Kernel ( a float * D_OUT, int size) 30 31 is { 32 33 is // tex1D_load global variables, not the parameter table 34 is 35 int index; 36 37 [ index + = blockIdx.x * blockDim.x threadIdx.x; 38 is 39 IF (index < size) 40 41 is { 42 is 43 is D_OUT [index] = the tex1Dfetch (tex1D_load, index);// third step, the value of the texture fetch memory 44 is 45 // from the texture fetch the value 46 is 47 the printf ( " % F \ n- " , D_OUT [index]); 48 49 } 50 51 is } 52 is 53 is 54 is 55 int main () 56 is 57 is { 58 59 int size = 120 ; 60 61 is size_t size size = * the sizeof ( a float ); 62 is 63 is a float * harray; 64 65 float *d_in; 66 67 float *d_out; 68 69 70 71 harray = new float[size]; 72 73 checkCudaErrors(cudaMalloc((void **)&d_out, Size)); 74 75 checkCudaErrors(cudaMalloc((void **)&d_in, Size)); 76 77 78 79 //initial host memory 80 81 82 83 for (int m = 0; m < 4; m++) 84 85 { 86 87 printf("m = %d\n", m); 88 89 for (int loop = 0; loop < size; loop++) 90 91 { 92 93 harray[loop] = loop + m * 1000; 94 95 } 96 97 //拷贝到d_in中 98 99 checkCudaErrors(cudaMemcpy(d_in, harray, Size, cudaMemcpyHostToDevice)); 100 101 102 103 // second step, binding texture 104 105 checkCudaErrors (cudaBindTexture ( 0 , tex1D_load, D_IN, Size)); 106 107 // 0 indicates no offset 108 109 110 111 int the nBlocks = (Size - . 1 ) / 128 + . 1 ; 112 113 Kernel <<< the nBlocks, 128 >>> (D_OUT, size); // step 114 115 cudaUnbindTexture (tex1D_load); // fourth, texture solution 1 16 117 getLastCudaError ( " Kernel Execution failed "); 118 119 checkCudaErrors(cudaDeviceSynchronize()); 120 121 } 122 123 delete[] harray; 124 125 cudaUnbindTexture(&tex1D_load); 126 127 checkCudaErrors(cudaFree(d_in)); 128 129 checkCudaErrors(cudaFree(d_out)); 130 131 return 0; 132 133 }
Summarized in the following table
Conclusion
Small general CAS Chemistry the PhD students
Research, development and application of parallel computer simulation software
Email: [email protected] (and I welcome the discussion of issues, private letters and mail OK!)
Let the program so that more people benefit!
references
1) CUDA GPU programming expert manual Definitive Guide [M] 2014
2) CUDA Toolkit Documentation v10.1.168 https://docs.nvidia.com/cuda/