antecedentes

En el artículo sobre los problemas encontrados al compilar e instalar LitmusRT , hemos compilado e instalado el sistema operativo en tiempo real LitmusRT y podemos iniciarlo normalmente. Ahora, tenemos que compilar e instalar la biblioteca de terceros acelerada por GPU OpenCL u OpenACC.

Nuevamente, tenga cuidado de no instalar el controlador de Nvidia con una máquina virtual, porque la tarjeta gráfica de la máquina virtual está virtualizada y el archivo ko de Nvidia no se puede cargar. Así que estoy usando la computadora de escritorio ubuntu16.04 de 64 bits en el laboratorio, que se instaló con controladores NVIDIA, cuda10.2 y 10.1, gcc7 y g ++ 7.

OpenCL

Si hay cuda, Nvidia es la primera opción, después de todo, Nvidia es el líder en la industria de las GPU. Pero si tenemos que usar una máquina virtual, tenemos que cambiar a la versión Intel de OpenCL para instalar

Versión de Nvidia

1. Descargue el paquete de instalación http://developer.download.nvidia.com/compute/cuda/4_0/sdk/gpucomputingsdk_4.0.17_linux.run

2. Antes de ejecutar el archivo de ejecución, asegúrese de que su gcc y g ++ sean ambos de la versión 3.4, si no, instale y cambie

3. Arrastre el archivo de ejecución al servidor y ejecútelo.

root@sundata:/data/szc# ./gpucomputingsdk_4.0.17_linux.run

Le permitirá especificar el directorio de instalación y el directorio cuda, simplemente presione Enter para predeterminar

3. Luego, copie algunos archivos de la biblioteca en el directorio / usr / lib o / usr / local / include

root@sundata:/data/szc# cp -r /usr/local/cuda-10.0/extras/CUPTI/include/* /usr/local/include

root@sundata:/data/szc# cp -r /snap/gnome-3-34-1804/36/usr/include/* /usr/local/include

root@sundata:/data/szc# cp /usr/local/cuda-10.0/lib64/libOpenCL.so.1.1 /usr/local/lib/libOpenCL.so

root@sundata:/data/szc# cp /usr/lib/x86_64-linux-gnu/libGLU.so.1.3.1 /usr/lib/libGLU.so

root@sundata:/data/szc# cp /snap/gnome-3-34-1804/60/usr/lib/x86_64-linux-gnu/libGL.so.1.0.0 /usr/lib/libGL.so

root@sundata:/data/szc# cp /snap/gnome-3-34-1804/60/usr/lib/x86_64-linux-gnu/libX11.so.6.3.0 /usr/lib/libX11.so

root@sundata:/data/szc# cp /usr/lib/x86_64-linux-gnu/libXmu.so.6.2.0 /usr/lib/libXmu.so

Los nombres y métodos de posicionamiento de ubicación de estas cosas son los siguientes:

Primero corte en el directorio de instalación de OpenCL, y luego haga

root@sundata:~/NVIDIA_GPU_Computing_SDK/OpenCL# make

Informará un error de que no puede encontrar un determinado archivo de encabezado o una determinada biblioteca de modo (no puede encontrar -lxxx), similar a este

Luego ubique el archivo de encabezado o la biblioteca xxx, obtendrá la ruta de la biblioteca, luego copie el archivo de encabezado en el directorio / usr / loca / include y copie el archivo de la biblioteca en / usr / local / lib.

No puedo encontrar glut.so aquí, así que tengo que instalar esta dependencia

root@sundata:~/NVIDIA_GPU_Computing_SDK/OpenCL# apt-get install freeglut3-dev

Si encuentra este error:

ld cannot find crt1.o: No such file or directory

Es necesario establecer la variable de entorno, la siguiente ruta es el resultado de localizar crt1.o

export LIBRARY_PATH=$LIBRARY_PATH:/snap/gnome-3-34-1804/36/usr/lib/x86_64-linux-gnu/

4. Finalmente, vaya al directorio de instalación de OpenCL y realice make

root@sundata:~/NVIDIA_GPU_Computing_SDK/OpenCL# make

Verá un montón de archivos ejecutables en el directorio bin / linux / release /, ejecute uno

O ver si se puede detectar el SDK de OpenCL

Vea que el SDK y los dispositivos que lo soportan puedan ser detectados, y eso es todo.

ps: Durante el proceso de compilación, también encontré problemas como ibstdc ++. so.6 error al agregar símbolos: DSO falta en la línea de comando, sección no representable en la salida, etc. Después de cada intento, reduje las versiones gcc y g ++ a 3.4 y re -Ejecutar el archivo de ejecución, instalado y resuelto. De hecho, estos problemas de enlace se encontraron después de eliminar gcc y g ++. ¿Cómo puedo probar los métodos en Internet? ¿Cómo puedo cambiar el archivo common / common_opencl.mk? Finalmente, volví a ejecutar el archivo de ejecución y el problema fue resuelto.

5. Si queremos escribir nuestro propio archivo opencl, tenemos que copiar todos los archivos de encabezado en cuda al directorio / usr / local / include

root@sundata:/data/szc# cp -r /usr/local/cuda-10.0/include/* /usr/local/include/

Luego escribe código de muestra

#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>

// OpenCL source code
const char* OpenCLSource[] = {
    "__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)",
    "{",
    " // Index of the elements to add \n",
    " unsigned int n = get_global_id(0);",
    " // Sum the n’th element of vectors a and b and store in c \n",
    " c[n] = a[n] + b[n];",
    "}"
};

// Some interesting data for the vectors
int InitialData1[20] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17};
int InitialData2[20] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15};
// Number of elements in the vectors to be added
#define SIZE 2048

int main(int argc, char **argv) {
    // Two integer source vectors in Host memory
    int HostVector1[SIZE], HostVector2[SIZE];
    // Initialize with some interesting repeating data
    int c;
    for(c = 0; c < SIZE; c++) {
        HostVector1[c] = InitialData1[c%20];
        HostVector2[c] = InitialData2[c%20];
    }

    //Get an OpenCL platform
    cl_platform_id cpPlatform;
    clGetPlatformIDs(1, &cpPlatform, NULL);

    // Get a GPU device
    cl_device_id cdDevice;
    clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &cdDevice, NULL);

    // Create a context to run OpenCL on our CUDA-enabled NVIDIA GPU
    cl_context GPUContext = clCreateContext(0, 1, &cdDevice, NULL, NULL, NULL);

    // Create a command-queue on the GPU device
    cl_command_queue cqCommandQueue = clCreateCommandQueue(GPUContext, cdDevice, 0, NULL);

    // Allocate GPU memory for source vectors AND initialize from CPU memory
    cl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |
        CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector1, NULL);
    cl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |
        CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector2, NULL);

    // Allocate output memory on GPU
    cl_mem GPUOutputVector = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY, sizeof(int) * SIZE, NULL, NULL);

    // Create OpenCL program with source code
    cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 7, OpenCLSource, NULL, NULL);

    // Build the program (OpenCL JIT compilation)
    clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL);

    // Create a handle to the compiled OpenCL function (Kernel)
    cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "VectorAdd", NULL);

    // In the next step we associate the GPU memory with the Kernel arguments
    clSetKernelArg(OpenCLVectorAdd, 0, sizeof(cl_mem),(void*)&GPUOutputVector);
    clSetKernelArg(OpenCLVectorAdd, 1, sizeof(cl_mem), (void*)&GPUVector1);
    clSetKernelArg(OpenCLVectorAdd, 2, sizeof(cl_mem), (void*)&GPUVector2);

    // Launch the Kernel on the GPU
    size_t WorkSize[1] = {SIZE}; // one dimensional Range
    clEnqueueNDRangeKernel(cqCommandQueue, OpenCLVectorAdd, 1, NULL, WorkSize, NULL, 0, NULL, NULL);

    // Copy the output in GPU memory back to CPU memory

    int HostOutputVector[SIZE];
    clEnqueueReadBuffer(cqCommandQueue, GPUOutputVector, CL_TRUE, 0,
        SIZE * sizeof(int), HostOutputVector, 0, NULL, NULL);

    // Cleanup
    clReleaseKernel(OpenCLVectorAdd);
    clReleaseProgram(OpenCLProgram);
    clReleaseCommandQueue(cqCommandQueue);
    clReleaseContext(GPUContext);
    clReleaseMemObject(GPUVector1);
    clReleaseMemObject(GPUVector2);
    clReleaseMemObject(GPUOutputVector);

    // Print out the results
    int Rows;
    for (Rows = 0; Rows < (SIZE/20); Rows++, printf("\t")) {
        for(c = 0; c <20; c++) {
            printf("%c",(char)HostOutputVector[Rows * 20 + c]);
        }
    }

    printf("\n\nThe End\n\n");
    return 0;
}

Compilar

root@sundata:/data/szc# gcc test_opencl.c -o test_opencl -lOpenCL

correr

ps: en el manual oficial del sitio web, se crea el contexto

 cl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL,  NULL);

Pero esta función generará un error, que se puede verificar pasando el puntero de la variable de error al último parámetro

    ....

    cl_int ciErr1;
    cl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, &ciErr1);

    if (ciErr1 != CL_SUCCESS) {
        printf("Error in clCreateContext, error: %d\n", ciErr1);
        return -1;
    }

Los resultados son los siguientes

Entonces tienes que usar

cl_context GPUContext = clCreateContext(0, 1, &cdDevice, NULL, NULL, NULL);

Para crear el contexto

Versión de Intel

1. Descargar dependencias

(base) root@ubuntu:/home/szc# apt-get install clinfo
(base) root@ubuntu:/home/szc# apt install dkms xz-utils openssl libnuma1 libpciaccess0 bc curl libssl-dev lsb-core libicu-dev
(base) root@ubuntu:/home/szc# echo "deb http://download.mono-project.com/repo/debian wheezy main" | sudo tee /etc/apt/sources.list.d/mono-xamarin.list
(base) root@ubuntu:/home/szc# apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 3FA7E0328081BFF6A14DA29AA6A19B38D3D831EF
(base) root@ubuntu:/home/szc# apt-get update
(base) root@ubuntu:/home/szc# apt-get install mono-complete

2. Descargue el código fuente Intel opensl sdk http://registrationcenter-download.intel.com/akdlm/irc_nas/vcp/16284/intel_sdk_for_opencl_applications_2020.0.270.tar.gz , cárguelo en ubuntu, descomprímalo e ingrese a su directorio

(base) root@ubuntu:/home/szc# tar -zxvf intel_sdk_for_opencl_applications_2020.0.270.tar.gz
(base) root@ubuntu:/home/szc# cd intel_sdk_for_opencl_applications_2020.0.270

3. Luego, ejecute el script de instalación.

(base) root@ubuntu:/home/szc/intel_sdk_for_opencl_applications_2020.0.270# ./install.sh

4. El valor predeterminado es todo.Después de la finalización, verifique si la instalación está completa y podrá ver el tiempo de ejecución de CPU Intel (R) para aplicaciones OpenCL (TM).

(base) root@ubuntu:/home/szc/intel_sdk_for_opencl_applications_2020.0.270# clinfo

Number of platforms                               1

  Platform Name                                   Intel(R) CPU Runtime for OpenCL(TM) Applications

  Platform Vendor                                 Intel(R) Corporation

  Platform Version                                OpenCL 2.1 LINUX

  Platform Profile                                FULL_PROFILE

  Platform Extensions                             cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer cl_intel_vec_len_hint

  Platform Host timer resolution                  1ns

  Platform Extensions function suffix             INTEL


  Platform Name                                   Intel(R) CPU Runtime for OpenCL(TM) Applications
Number of devices                                 1
  Device Name                                     Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
  Device Vendor                                   Intel(R) Corporation
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 2.1 (Build 0)
  Driver Version                                  18.1.0.0920
  Device OpenCL C Version                         OpenCL C 2.0
  Device Type                                     CPU
  Device Profile                                  FULL_PROFILE
  Max compute units                               4
  Max clock frequency                             2200MHz
  Device Partition                                (core)
    Max number of sub-devices                     4
    Supported partition types                     by counts, equally, by names (Intel)
  Max work item dimensions                        3
  Max work item sizes                             8192x8192x8192
  Max work group size                             8192
  Preferred work group size multiple              128
  Max sub-groups per work group                   1
  Preferred / native vector sizes                 
    char                                                 1 / 32      
    short                                                1 / 16      
    int                                                  1 / 8       
    long                                                 1 / 4       
    half                                                 0 / 0        (n/a)
    float                                                1 / 8       
    double                                               1 / 4        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Address bits                                    64, Little-Endian
  Global memory size                              6233903104 (5.806GiB)
  Error Correction support                        No
  Max memory allocation                           1558475776 (1.451GiB)
  Unified memory for Host and Device              Yes
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   Yes
    Fine-grained system sharing                   Yes
    Atomics                                       Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Preferred alignment for atomics                 
    SVM                                           64 bytes
    Global                                        64 bytes
    Local                                         0 bytes
  Max size for global variable                    65536 (64KiB)
  Preferred total size of global vars             65536 (64KiB)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        262144
  Global Memory cache line                        64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             480
    Max size for 1D images from buffer            97404736 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   64 bytes
    Pitch alignment for 2D image buffers          64 bytes
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             2048x2048x2048 pixels
    Max number of read image args                 480
    Max number of write image args                480
    Max number of read/write image args           480
  Max number of pipe args                         16
  Max active pipe reservations                    65535
  Max pipe packet size                            1024
  Local memory type                               Global
  Local memory size                               32768 (32KiB)
  Max constant buffer size                        131072 (128KiB)
  Max number of constant args                     480
  Max size of kernel argument                     3840 (3.75KiB)
  Queue properties (on host)                      
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Local thread execution (Intel)                Yes
  Queue properties (on device)                    
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Preferred size                                4294967295 (4GiB)
    Max size                                      4294967295 (4GiB)
  Max queues on device                            4294967295
  Max events on device                            4294967295
  Prefer user sync for interop                    No
  Profiling timer resolution                      1ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            Yes
    Sub-group independent forward progress        No
    IL version                                    SPIR-V_1.0
    SPIR versions                                 1.2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Device Extensions                               cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer cl_intel_vec_len_hint
NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [INTEL]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform

OpenACC

Entorno: servidor ubuntu, equipado con controlador Nvidia y cuda, con cable de red enchufado

1. Descarga el paquete comprimido y descomprímelo

root@sundata:/data/szc# wget https://developer.download.nvidia.com/hpc-sdk/20.7/nvhpc_2020_207_Linux_x86_64_cuda_multi.tar.gz
root@sundata:/data/szc# tar xpzf nvhpc_2020_207_Linux_x86_64_cuda_multi.tar.gz

2. Instalación

root@sundata:/data/szc# nvhpc_2020_207_Linux_x86_64_cuda_multi/install

Se le pedirá que configure los parámetros durante la instalación, simplemente seleccione el sistema único y su propia ruta de instalación

3. Prueba

Primero configure las variables de entorno, cambie / root / NVIDIA_GPU_Computing_SDK / hpc_sdk a su propia ruta de instalación

root@sundata:/data/szc# export PATH=/root/NVIDIA_GPU_Computing_SDK/hpc_sdk/Linux_x86_64/2020/compilers/bin/:$PATH

Luego cambie a un directorio de muestra de prueba en la ruta de instalación, compile y ejecute

root@sundata:/data/szc# cd ~/NVIDIA_GPU_Computing_SDK/hpc_sdk/Linux_x86_64/2020/examples/OpenMP
root@sundata:~/NVIDIA_GPU_Computing_SDK/hpc_sdk/Linux_x86_64/2020/examples/OpenMP# make NTHREADS=4 matmul_test

Ejecutar captura de pantalla

Finalmente, adjunte el manual oficial: https://docs.nvidia.com/hpc-sdk/archive/20.7/index.html

Conclusión

Estos archivos no son pequeños. Si la descarga desde el sitio web oficial es lenta, puede descargarla desde mi disco de red de Baidu:

Intel OpenCL: Enlace: https://pan.baidu.com/s/1a9_H5tbsfFjdMmPFlJbUzg, código de extracción: 060s

NVIDIA OpenCL: Enlace: https://pan.baidu.com/s/1J_qrL-PREONvIYnz1F7DoQ, código de extracción: c2ci

NVIDIA OpenACC: Enlace: https://pan.baidu.com/s/1hQKKtrq4c6TEfXMuE_RC5w, código de extracción: 9u7o

Instalación de OpenCL y OpenACC