GPU memory hierarchy (gpu memory hierarchy)

GPU memory hierarchy (gpu memory hierarchy)

Small general CAS Chemistry the PhD students

Research, development and application of parallel computer simulation software

Email:  [email protected] (welcome and I discussed issues)

 

Abstract (Abstact)

GPU memory is diversified, its speed and the number is not the same as understanding GPU performance tuning program for the storage of great significance. This article describes the following questions:

1. What types of memory? 2) check their device memory size 3) memory access speed 4) the relationship between different levels of storage 5) Cautions. Advantages and disadvantages of storage structure.  

 

text

GPU configuration diagram

 

 

 

 

register memory (Register memory)

Advantages : access speed champion!

Disadvantages : a limited number of

Use : __global__ functions within normal variables, or ___device__ function, definition, it is register variables.

example:

 

 1 //kernel.cu
 2 
 3 __global__ void register_test()
 4 
 5 {
 6 
 7  int a = 1.0;
 8 
 9 double b = 2.0;
10 
11 }
12 
13  
14 
15 //main.cu
16 
17 int main()
18 
19 {
20 
21 int nBlock = 100;
22 
23 register_test <<<nBlock,128>>>();
24 
25 return 0;
26 
27 }
28 
29  
30 
31  

 

shared memory ( Shared Memory )

Advantages :

2 1 buffer faster speed than two orders of magnitude global memory

2 thread block, all the threads can read and write.

 3 life cycle synchronized with the thread block

Drawback : there are size restrictions

Use: The keyword __shared__ __shared__ double A [128];

Conditions:

Use applications, such as the statute sum: a = sum A [i]

If the variable is not frequently modified, such as vector addition.

It is an important means of programming optimization!

C [i] = A [i] + B [i] is not necessary to A, B to be cached in the shared memory.

 

 1 /kernel.cu
 2 
 3 __global__ void shared_test()
 4 
 5 {
 6 
 7 __shared__ double A[128];
 8 
 9  int a = 1.0;
10 
11 double b = 2.0;
12 
13 int tid = threadIdx.x;
14 
15 A[tid] = a;
16 
17 }

 

③ global memory (Global Memory)

Advantages :

1 space maximum (GB level)

2. Host side may, interact through cudaMemcpy the like.

3. The life cycle is longer than the Kernel function

4. All threads can access

Disadvantages : the slowest memory access

 

 1 //kernel.cu
 2 
 3 __global__ void shared_test(int *B)
 4 
 5 {
 6 
 7 double b = 2.0;
 8 
 9 int tid = threadIdx.x;
10 
11 int id = blockDim.x*128 + threadIdx.x;
12 
13 int a = B[id] ;
14 
15 }

 

 

④ texture memory

Advantage , faster than the average global memory

Disadvantages : to use, requires four steps, a little trouble

Applicable scene : the larger need only read the array, using textures access, will be accelerated

Four steps used (here in a one-dimensional array of float as an example), beginners, own hand and struck the code again! ! !

The first step, declaring the texture space, global variables:

texture<float, 1, cudaReadModeElementType> tex1D_load;

The second step, binding texture

The third step is to use

The fourth step, unbound

Specifically look at the code, (best knock yourself again!)

  1 #include <iostream>
  2 
  3 #include <time.h>
  4 
  5 #include <assert.h>
  6 
  7 #include <cuda_runtime.h>
  8 
  9 #include "helper_cuda.h"
 10 
 11 #include <iostream>
 12 
 13 #include <ctime>
 14 
 15 #include <stdio.h>
 16 
 17  
 18 
 19 using namespace std;
 20 
 21  
 22 
 23 texture<float, 1, cudaReadModeElementType>tex1D_load;
 24  
25  // first step, declaring the texture space, global variable 
26 is  
27   
28  
29 a __global__ void Kernel ( a float * D_OUT, int size)
 30  
31 is  {
 32  
33 is      // tex1D_load global variables, not the parameter table 
34 is  
35      int index;
 36  
37 [      index + = blockIdx.x * blockDim.x threadIdx.x;
 38 is  
39      IF (index < size)
 40  
41 is      {
 42 is  
43 is          D_OUT [index] = the tex1Dfetch (tex1D_load, index);// third step, the value of the texture fetch memory
 44 is  
45          // from the texture fetch the value 
46 is  
47          the printf ( " % F \ n- " , D_OUT [index]);
 48  
49      }
 50  
51 is  }
 52 is  
53 is   
54 is  
55  int main ()
 56 is  
57 is  {
 58  
59      int size = 120 ;
 60  
61 is      size_t size size = * the sizeof ( a float );
 62 is  
63 is      a float * harray;
 64  
65     float *d_in;
 66 
 67     float *d_out;
 68 
 69  
 70 
 71     harray = new float[size];
 72 
 73     checkCudaErrors(cudaMalloc((void **)&d_out, Size));
 74 
 75     checkCudaErrors(cudaMalloc((void **)&d_in, Size));
 76 
 77  
 78 
 79     //initial host memory
 80 
 81  
 82 
 83     for (int m = 0; m < 4; m++)
 84 
 85     {
 86 
 87         printf("m = %d\n", m);
 88 
 89         for (int loop = 0; loop < size; loop++)
 90 
 91         {
 92 
 93             harray[loop] = loop + m * 1000;
 94 
 95         }
 96 
 97         //拷贝到d_in中
 98 
 99         checkCudaErrors(cudaMemcpy(d_in, harray, Size, cudaMemcpyHostToDevice));
100 
101  
102 
103         // second step, binding texture 
104  
105          checkCudaErrors (cudaBindTexture ( 0 , tex1D_load, D_IN, Size));
 106  
107          // 0 indicates no offset 
108  
109   
110  
111          int the nBlocks = (Size - . 1 ) / 128 + . 1 ;
 112  
113          Kernel <<< the nBlocks, 128 >>> (D_OUT, size); // step 
114  
115          cudaUnbindTexture (tex1D_load);          // fourth, texture solution 
1 16  
117          getLastCudaError ( " Kernel Execution failed ");
118 
119         checkCudaErrors(cudaDeviceSynchronize());
120 
121     }
122 
123     delete[] harray;
124 
125     cudaUnbindTexture(&tex1D_load);
126 
127     checkCudaErrors(cudaFree(d_in));
128 
129     checkCudaErrors(cudaFree(d_out));
130 
131     return 0;
132 
133 }

 

 

Summarized in the following table

 

 

 

Conclusion

Small general CAS Chemistry the PhD students

Research, development and application of parallel computer simulation software

Email:  [email protected] (and I welcome the discussion of issues, private letters and mail OK!)

Let the program so that more people benefit!

references

1) CUDA GPU programming expert manual Definitive Guide [M] 2014

2)    CUDA Toolkit Documentation v10.1.168 https://docs.nvidia.com/cuda/

Guess you like

Origin www.cnblogs.com/xiaopu2019/p/11071883.html