基于C++与CUDA的N卡GPU并行程序——在python中使用numba库编写GPU程序

文档

解决numba报错

在python中使用numba编写CUDA程序时会有一个报错

NvvmSupportError: libNVVM cannot be found. Do conda install cudatoolkit:

这一般是关于CUDA的环境变量没有识别出来,所以需要在bashrc或/etc/profile中加入环境变量

    export CUDA_HOME=/usr/local/cuda
    export PATH=/usr/local/cuda/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
    export NUMBAPRO_NVVM=/usr/local/cuda/nvvm/lib64/libnvvm.so
    export NUMBAPRO_LIBDEVICE=/usr/local/cuda/nvvm/libdevice/

之后使用source命令或者重启,使环境变量生效,就能够解决这个报错.

流程

在python的numba中的CUDA程序,其大致流程和在C中的CUDA程序是类似的.区别在于,numba代码量会更加简短,C代码则会繁琐冗长,相应地,numba的运行速度等性能将会有所损失.
首先,将数据从CPU主机内存复制到GPU显存.其次,GPU加载并执行.最后,将GPU显存数据复制到CPU内存.为了说明其流程,以下给出一个简单示例.

import numpy as np
import time
from numba import cuda

#修饰器表示在GPU运行的核函数
@cuda.jit
def add_one(array):
    #获取线程块和线程网格中的索引
    x = cuda.blockDim.x * cuda.blockIdx.x + cuda.threadIdx.x
    y = cuda.blockDim.y * cuda.blockIdx.y + cuda.threadIdx.y
    #二维线程网格索引的简洁形式
    x, y = cuda.grid(2)
    #执行并行计算
    array[x,y] += 1

#执行主函数
if __name__ == "__main__":
    #创建一个二维数组
    array = np.zeros((1024, 1024), dtype = np.uint8)
    #将CPU内存数据复制到GPU显存
    d_array = cuda.to_device(array)
    #规划线程网格
    gridDim = (int(array.shape[0]/16), int(array.shape[1]/16))
    #规划线程块
    blockDim = (16, 16)
    #计时开始
    start = time.time()
    #执行GPU核函数
    add_one[gridDim, blockDim](d_array)
    #GPU核函数执行完某一任务,进行线程同步
    cuda.synchronize()
    #计时结束
    end = time.time()
    #将GPU显存数据复制到CPU内存
    d_array.to_host()
    #打印时间
    print('The Time is %f s'%(end-start))
    #打印计算结果
    print(array)

内存

创建一个GPU上的空数组,用法类似于numpy.empty()

 numba.cuda.device_array(shape, dtype=np.float, strides=None, order='C', stream=0)

创建一个类似于ary数组的空数组

numba.cuda.device_array_like(ary, stream=0)

将数据复制到GPU设备

numba.cuda.to_device(obj, stream=0, copy=True, to=None)

创建CUDA流

stream = cuda.stream()
d_ary = cuda.to_device(ary, stream=stream)

将数据从GPU复制回CPU

hary = d_ary.copy_to_host()
hary = d_ary.copy_to_host(stream=stream)
ary = np.empty(shape=d_ary.shape, dtype=d_ary.dtype)
d_ary.copy_to_host(ary)

将其它对象类型转换为CUDA数组

扫描二维码关注公众号，回复： 11937610 查看本文章

numba.cuda.as_cuda_array(obj)

判断是否CUDA数组

 numba.cuda.is_cuda_array(obj)

将数据从CPU复制回GPU的两种方式

arr = np.arange(1000)
d_arr = cuda.to_device(arr)
my_kernel[100, 100](d_arr)
result_array = d_arr.copy_to_host()

判断数组的内存排列方式

is_c_contiguous(self)
is_f_contiguous(self)

调整数组的行列大小,用法类似于numpy.ndarray.reshape()

reshape(self, \*newshape, \*\*kws)
d_arr = d_arr.reshape(20, 50, order='F')

共享内存数组

numba.cuda.shared.array(shape, type)
numba.cuda.syncthreads()

局部内存数组

numba.cuda.local.array(shape, type)

常量内存数组

numba.cuda.const.array_like(arr)

释放数组内存

 numba.cuda.defer_cleanup()

支持的原子操作


class numba.cuda.atomic

    Namespace for atomic operations

    class add(ary, idx, val)

        Perform atomic ary[idx] += val. Supported on int32, float32, and float64 operands only.

        Returns the old value at the index location as if it is loaded atomically.

    class compare_and_swap(ary, old, val)

        Conditionally assign val to the first element of an 1D array ary if the current value matches old.

        Returns the current value as if it is loaded atomically.

    class max(ary, idx, val)

        Perform atomic ary[idx] = max(ary[idx], val).

        Supported on int32, int64, uint32, uint64, float32, float64 operands only.

        Returns the old value at the index location as if it is loaded atomically.

    class min(ary, idx, val)

        Perform atomic ary[idx] = min(ary[idx], val).

        Supported on int32, int64, uint32, uint64, float32, float64 operands only.

        Returns the old value at the index location as if it is loaded atomically.

随机数


numba.cuda.random.create_xoroshiro128p_states(n, seed, subsequence_start=0, stream=0)

    Returns a new device array initialized for n random number generators.

    This initializes the RNG states so that each state in the array corresponds subsequences in the separated by 2**64 steps from each other in the main sequence. Therefore, as long no CUDA thread requests more than 2**64 random numbers, all of the RNG states produced by this function are guaranteed to be independent.

    The subsequence_start parameter can be used to advance the first RNG state by a multiple of 2**64 steps.

    Parameters

            n (int) – number of RNG states to create

            seed (uint64) – starting seed for list of generators

            subsequence_start (uint64) –

            stream (CUDA stream) – stream to run initialization kernel on

numba.cuda.random.init_xoroshiro128p_states(states, seed, subsequence_start=0, stream=0)

    Initialize RNG states on the GPU for parallel generators.

    This initializes the RNG states so that each state in the array corresponds subsequences in the separated by 2**64 steps from each other in the main sequence. Therefore, as long no CUDA thread requests more than 2**64 random numbers, all of the RNG states produced by this function are guaranteed to be independent.

    The subsequence_start parameter can be used to advance the first RNG state by a multiple of 2**64 steps.

    Parameters

            states (1D DeviceNDArray, dtype=xoroshiro128p_dtype) – array of RNG states

            seed (uint64) – starting seed for list of generators

numba.cuda.random.xoroshiro128p_uniform_float32(states, index)

    Return a float32 in range [0.0, 1.0) and advance states[index].

    Parameters

            states (1D array, dtype=xoroshiro128p_dtype) – array of RNG states

            index (int64) – offset in states to update

    Return type

        float32

numba.cuda.random.xoroshiro128p_uniform_float64(states, index)

    Return a float64 in range [0.0, 1.0) and advance states[index].

    Parameters

            states (1D array, dtype=xoroshiro128p_dtype) – array of RNG states

            index (int64) – offset in states to update

    Return type

        float64

numba.cuda.random.xoroshiro128p_normal_float32(states, index)

    Return a normally distributed float32 and advance states[index].

    The return value is drawn from a Gaussian of mean=0 and sigma=1 using the Box-Muller transform. This advances the RNG sequence by two steps.

    Parameters

            states (1D array, dtype=xoroshiro128p_dtype) – array of RNG states

            index (int64) – offset in states to update

    Return type

        float32

numba.cuda.random.xoroshiro128p_normal_float64(states, index)

    Return a normally distributed float32 and advance states[index].

    The return value is drawn from a Gaussian of mean=0 and sigma=1 using the Box-Muller transform. This advances the RNG sequence by two steps.

    Parameters

            states (1D array, dtype=xoroshiro128p_dtype) – array of RNG states

            index (int64) – offset in states to update

    Return type

        float64

from __future__ import print_function, absolute_import

from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
import numpy as np

@cuda.jit
def compute_pi(rng_states, iterations, out):
    """Find the maximum value in values and store in result[0]"""
    thread_id = cuda.grid(1)

    # Compute pi by drawing random (x, y) points and finding what
    # fraction lie inside a unit circle
    inside = 0
    for i in range(iterations):
        x = xoroshiro128p_uniform_float32(rng_states, thread_id)
        y = xoroshiro128p_uniform_float32(rng_states, thread_id)
        if x**2 + y**2 <= 1.0:
            inside += 1

    out[thread_id] = 4.0 * inside / iterations

threads_per_block = 64
blocks = 24
rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
out = np.zeros(threads_per_block * blocks, dtype=np.float32)

compute_pi[blocks, threads_per_block](rng_states, 10000, out)
print('pi:', out.mean())

基于C++与CUDA的N卡GPU并行程序——在python中使用numba库编写GPU程序

文档

解决numba报错

流程

内存

随机数

猜你喜欢