[Parallel Spring 5] GPU

GPU Part 1

ready

  1. The need for CUDA-enabled Nvidia graphics
    linux View card information: lspci | grep -i vga
    using nvidia graphics card can use this to find: lspci | grep -i nvidia
    on a command can be like " 03.00.0 " card code to see details: lspic -v -s 03.00.0
    See card usage (NVIDIA specific): nvidia-smi
    continuous use of a periodic output (1 second 1): watch -n 1 nvidia-smi
  2. Need to pycuda (linux installation: apt install python3-pycuda)

Basic use

1. underlying operating

  1. Preparations: sudo apt install python3-pycuda
  2. Create a matrix of memory used in the cpu
  3. Moving from cpu memory matrix used in the memory to the gpu
  4. C language code editor, let computing gpu
  5. The matrix used in the memory from the memory is moved to the cpu gpu

And kernel-level threads :
One of the most important elements of the program is to CUDA kernel (kernel), which represents the code can be executed in parallel
computing unit execution threads per core had called the (thread) is completed, the cpu threading different, gpu thread more lightweight context switch does not affect the performance
in order to determine the number of threads running machine logic forms a desired tissue cores, the CUDA defines a two-layer structure. At the highest level, the definition of the so-called block grid (grid of blocks), this represents a two-dimensional grid structure where the thread blocks, and these blocks are three-dimensional threads (In simple terms, a structure cuda comprising a plurality of blocks, each comprising a plurality of blocks thread) (also block segmentation operation on each thread below)

A thread block is assigned to a Streaming Multiprocessor (SM), and then the threads are further divided into groups referred to as warp threads, which determines the size of the GPU architecture there
To maximize concurrency SM itself, the same threads within the group must execute the same instruction, otherwise there will be differences threads (divergence of thread)

Example:
with each matrix element x2 gpu

import pycuda.driver as cuda
import pycuda.autoinit  # init GPU
from pycuda.compiler import SourceModule

import numpy as np

# 1.cpu create matrix
a = np.random.randn(5, 5)   # matrix:m*n
a = a.astype(np.float32)    # nvidia only support float calculate

# 2.move to gpu from cpu
a_gpu = cuda.mem_alloc(a.nbytes)    # alloc memory of gpu, this is 1 dim
cuda.memcpy_htod(a_gpu, a)      # copy cpu memory to gpu memory

# 3.gpu calculate
# create module of gpu calculate by c
mod = SourceModule('''
    __global__ void doubleMatrix(float *a)
    {
        int idx = threadIdx.x + threadIdx.y * 5;    // (x,y,z), gpu -> sm -> warp
        a[idx] *= 2;
    }
''')

func = mod.get_function('doubleMatrix') # get function from module
func(a_gpu, block = (5, 5, 1))      # set var and thread number from (x,y,z) orient to function

# 4.move to cpu from gpu
a_doubled = np.empty_like(a)        # create memory of cpu
cuda.memcpy_dtoh(a_doubled, a_gpu)  # copy gpu memory to cpu memory

print('original matrix')
print(a)
print('double matrix')
print(a_doubled)

Note. 1 : Import pycuda.autoinit statement is automatically selected according to the number and availability GPU to use GPU, GPU context this will create a desired next operation code (just introduced to complete)

NOTE : astype (numpy.float32): the item is a transformation matrix in single precision mode, since many Nvidia card supports only single precision

Note 3 : When calling gpu function of c, by setting the parameter block, the thread mode distribution, (5, 5, 1) is allocated corresponding to the gpu this (x, y, x) is
the function c, is threadIdx a structure, which has three fields x, , y, zeach thread of the different variables (binding threads appreciated level), so the array with the index, it is necessary due to the function of the c-gpu memory is dynamically allocated one-dimensional array, use threadIdx.ymultiplied by the number of elements in each row of the matrix conversion
For more information: self-Discovery thread blocks divided cuda

Note 4 : c function gpu executed __global__ keyword indicates that the function is a kernel function must be called from the main thread to generate the level of equipment on the gpu
Pycuda relates to memory, to maximize the use of available resources in a CUDA-enabled GPU graphics, there are four types of memory:
寄存器(registers): Each register is assigned a thread, each thread can access only its own registers, even if part of the same thread block
共享存储器(shared memory): shared memory, each thread block has an internal thread shared memory, this memory fast
常数存储器(constant memory): all the threads in a grid has been able to access this memory, but only in read access time. Data memory there has been constant for the duration of the application
全局存储器(global memory),: All grid threads (including all cores) can access the global memory

For more information: Self-Discovery PyCUDA memory model

Package control 2.python

With gpuarray called the kernel, it can be directly stored in the data computing device (GPU), a computing and in the apparatus

Example:
with each matrix element x2 gpu

import pycuda.autoinit
import pycuda.gpuarray as gpuarray

import numpy as np

x = np.random.randn(5, 5)
x_gpu = gpuarray.to_gpu(x)

x_doubled = (2 * x_gpu).get()

print('x:')
print(x)
print('x_doubled:')
print(x_doubled)

other

There are other libraries can be easily programmed for cuda, as NumbaPro is a Python compiler, based on the CUDA API provides a programming interface, you can write CUDA continue, it is designed to perform an array of tasks related calculations, and widely used similar numpy library
NumbaPro: of GPU programming library, provides many numerical library, library accelerated GPU


1. Reference books: "Python Parallel Programming Manual"

Guess you like

Origin www.cnblogs.com/maplesnow/p/12044372.html