High performance computing SIMD experiment CPU SIMD+GPU SIMD

CPU SIMD

Intel SIMD

Code:

Run on my PC:

Analysis:

This program compares the calculation speed difference between using SIMD instruction SSE and not using SIMD instruction. The program calculates the value of the square root x divided by x. In the program, the SIMD instruction can be used to calculate four floating-point numbers in parallel. Finally, we found that the SIMD instruction can effectively speed up the calculation.

Kunpeng SIMD(ARM NENO)

Code:

Run on HUAWEI TaiShan200 server

Analysis:

The program uses the functions provided by the ARM NEON instruction set to realize the parallel calculation of 16 uint8_t type data. The add3 function is to add 3 to each element of a vector of 16 unit8_t elements. Among them, the vmovq_n_u8 function creates a vector three with 16 uint8_t type elements, each element value is 3. The vaddq_u8 function adds three and each corresponding element in data, and writes the result back to data. The print_uint8 function is used to print the vector of uint8x16_t type, the method is to store the value of the vector element in the array and then print. A uint8_data array is defined in the main function, which contains 16 uint8_t type data and then converted into a 16-element vector data. The print_uint8 function prints out the initial value of data, then calls the add3 function to manipulate data, and finally calls the print_uint8 function to print out the calculated value.

GPU SIMD

CUDA installation information

Enter cmd through win+r to open the command line, and use nvcc -V to view the cuda installation information. The version I installed is 11.7

deviceQuery in CUDA

according to my location in

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\extras\demo_suite

Run deviceQuery, you can see that detailed device information is given, including my graphics card model and maximum computing power, as well as running memory and main frequency, etc.

vector calculation in CUDA

Code:

Result:

This is the result of running the source code

Add a timer, the source code is calculated with a thread block, and the time is 0.03677s

Change the thread block parameters, each thread block can only calculate two threads, so it needs to call five thread blocks for parallel calculation, and the speed is significantly improved

Analysis:

The program creates an array of floating point numbers with 10 elements. Copy the array from host memory to graphics card memory. Calling the graphics card's computing power, it then calculates the square of each element in the array and stores the result in the graphics card's memory. Finally, it copies the calculation result from the graphics card memory back to the host memory, printing out the index of each element and the corresponding square value.

The calculation process uses parallel computing. Here, each thread block is set to contain 32 threads, and each thread calculates the square of an array element. Since the array size is 10, only 1 thread block needs to be started

PI calculation in CUDA

Code:

Result:

This is the running result based on the source code without modification. The number of threads in each thread block in the source code is 1024, and the calculation only calls one thread block

This is the code I adjusted according to the upper limit of the thread block of my graphics card, calling fourteen thread block calculations, and the speed has increased by about ten times

Analysis:

Program to calculate pi using CUDA. In the program, the sumHost and sumDev arrays store the calculation results in the memory of the host and the graphics card respectively. The cudaMalloc() function and cudaMemset() function open up space on the graphics card and clear it, and then execute the cal_pi function to calculate the pi. The calculation is run in parallel on the graphics card, it will calculate the pi according to the input parameters, and store the result in the passed sum array. Each thread calculates the sum of a part of the area, numbered as idx = blockIdx.x * blockDim.x + threadIdx.x. Through the loop, accumulate the sum of a small area each time, that is, calculate the area of ​​a small rectangle whose left endpoint is (i+0.5)*step, and the right endpoint is (i+1.5)*step, and add the result to sum[idx] . Finally, the program outputs the value of pi and the time it took to calculate it. Before calculation, use numBlocks and threadsPerBlock to specify the number of thread blocks and the number of threads. Use <<<numBlocks, threadsPerBlock>>> to specify how many thread blocks to use and how many threads each thread block contains.

Guess you like

Origin blog.csdn.net/lijj0304/article/details/130907858