CUDA: (xiii) to manually allocate memory and copy (NVIDIA courses Part five)

Advanced Content

The following sections designed for times of surplus and intentionally get to the bottom of the learner and set up, which will introduce more mid-level technology, which will involve some manual memory management, and the use of non-default stream overlapping execution kernel and memory copy.

After understanding the technical listed below, you can try to use these technologies to further optimize the n-body simulation.

Manual Memory Allocation and Copying

Despite cudaMallocManagedand cudaMemPrefetchAsyncfunction of superior performance and can greatly simplify the migration of memory, but it is sometimes necessary to use more manual memory allocation method. This is especially true when known only to access data on the device or host, and due to demand from automated migration and data migration costs can be recovered.

In addition, the manual memory management, you can use the non-default data streams simultaneously carry out transmission and calculations. In this section, you will learn some basic manual memory allocation and copying technology, then will extend the application of these techniques to simultaneously carry out calculations with data copy.

Here are some CUDA command for manual memory management:

cudaMallocGPU commands directly to the active state of allocated memory. This prevents the GPU for all page faults occur at the expense of host code can not access the pointer returned by the command.
cudaMallocHostCommand will directly allocate memory for the CPU. This command also tacking or locked memory pages in memory, this memory can be copied to GPU asynchronous or asynchronous copy from the GPU memory. Memory tacking CPU performance would interfere too much, so do not use this command endless. Use cudaFreeHostcommand to release tacking memory.
Whether it is from the host to the device or from the device to the host, cudaMemcpythe command can copy (not the transmission) memory.

Manual Memory Management Example

Here is the code a demo using the CUDA API calls.

int *host_a, *device_a;        // Define host-specific and device-specific arrays.
cudaMalloc(&device_a, size);   // `device_a` is immediately available on the GPU.
cudaMallocHost(&host_a, size); // `host_a` is immediately available on CPU, and is page-locked, or pinned.

initializeOnHost(host_a, N);   // No CPU page faulting since memory is already allocated on the host.

// `cudaMemcpy` takes the destination, source, size, and a CUDA-provided variable for the direction of the copy.
cudaMemcpy(device_a, host_a, size, cudaMemcpyHostToDevice);

kernel<<<blocks, threads, 0, someStream>>>(device_a, N);

// `cudaMemcpy` can also copy data from device to host.
cudaMemcpy(host_a, device_a, size, cudaMemcpyDeviceToHost);

verifyOnHost(host_a, N);

cudaFree(device_a);
cudaFreeHost(host_a);          // Free pinned memory like this.

Exercise: Manually Allocate Host and Device Memory

Vector adder application [01-stream-init-solution ] the latest iteration cudaMallocManagedcommand assigned managed memory initialization kernel used on the device, and then sequentially allocated vector addition kernel on the equipment employed and used by the host managed memory, wherein memory are used to verify the automatic transmission. This method is very wise, but we are also worth trying some manual memory allocation and copying methods to observe its effect on application performance.

The [01-stream-init-solution ] Applications reconstructed as not to use the cudaMallocManagedcommand. To do this, you need to do the following:

The call cudaMallocManagedcommand replaced by a call cudaMalloccommand.
Create additional vector will be used for authentication on the host. Due to the use of cudaMalloccommand allocated memory is not available on the host, so you must do this, use the cudaMallocHostcommand to assign this host vector.
In addVectorsIntothe kernel has finished running, use the cudaMemcpycommand will contain the result of the addition of the vector to replicate using the cudaMallocHostcommand to create a host vector.
Use cudaFreeHostcommand to release via cudaMallocHostthe memory allocation command.

!nvcc -arch=sm_70 -o vector-add-manual-alloc 06-stream-init/solutions/01-stream-init-solution.cu -run

Success! All values calculated correctly.

After the completion of the reconstruction, open the executable file in a new nvvp session and then use the timeline to do the following:

Note that the timeline unified memory section will cease to exist.
This compares with the previous reconstruction timeline timeline, and use the timeline ruler comparison application in the current cudaMallocand previous application when running cudaMallocManagedwhen running.
View the current running start time application initialization function of how nuclear will run later than the time in the last iteration. After going through the timeline, you will find that the time difference between the cudaMallocHosttime the command used. It clearly shows the difference between memory transmission and memory copies. As you current operation, when the memory copy, the data is present in the system in two different positions. Compared with the allocation of only three vectors in the previous iteration, the current allocation of the four host vector will produce minor performance cost.

Using Streams to Overlap Data Transfers and Code Execution

The following slides will visualize summary of the upcoming release of the material. Click browse through the slides, and then continue in-depth understanding of topics in the following sections.

%%HTML

<div align="center"><iframe src="https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-01-V1/AC_STREAMS_NVVP-zh/NVVP-Streams-3-zh.pptx" frameborder="0" width="900" height="550" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></div>

Other cudaMemcpythan, as long as the host memory tacking, cudaMemcpyAsynccan be copied from the host memory device or asynchronous to asynchronous copy from the device to the host, by using this operation cudaMallocHostto allocate memory.

It performs a similar function to the core, by default, cudaMemcpyAsyncthe function only in terms of asynchronous host. By default, the function performed in the default stream for thereby CUDA other operations performed on the GPU, operable to impede the execution operation. However, cudaMemcpyAsyncthe function can be viewed as a non-default stream optional fifth parameter. To this stream by passing a non-default function, operations other CUDA memory transfer may be performed with other concurrently executing non-default stream.

A common and useful model is integrated using the tacking host memory, non-memory copy default asynchronous stream flow and a non-default function execution core, and to simultaneously perform transmission kernel memory.

Data segment In the following example, we are not waiting for the entire memory copy start running again after the completion of the kernel function, but copy and process data segment needed, and let each copy / processes are in their non-default stream run. By using this technique, you can start data processing section, as well as subsequent stage concurrent execution memory transfer. Using this technique the number of computing operations must be careful when a data segment and an offset value specific location within the array, as follows:

int N = 2<<24;
int size = N * sizeof(int);

int *host_array;
int *device_array;

cudaMallocHost(&host_array, size);               // Pinned host memory allocation.
cudaMalloc(&device_array, size);                 // Allocation directly on the active GPU device.

initializeData(host_array, N);                   // Assume this application needs to initialize on the host.

const int numberOfSegments = 4;                  // This example demonstrates slicing the work into 4 segments.
int segmentN = N / numberOfSegments;             // A value for a segment's worth of `N` is needed.
size_t segmentSize = size / numberOfSegments;    // A value for a segment's worth of `size` is needed.

// For each of the 4 segments...
for (int i = 0; i < numberOfSegments; ++i)
{
  // Calculate the index where this particular segment should operate within the larger arrays.
  segmentOffset = i * segmentN;

  // Create a stream for this segment's worth of copy and work.
  cudaStream_t stream;
  cudaStreamCreate(&stream);
  
  // Asynchronously copy segment's worth of pinned host memory to device over non-default stream.
  cudaMemcpyAsync(&device_array[segmentOffset],  // Take care to access correct location in array.
                  &host_array[segmentOffset],    // Take care to access correct location in array.
                  segmentSize,                   // Only copy a segment's worth of memory.
                  cudaMemcpyHostToDevice,
                  stream);                       // Provide optional argument for non-default stream.
                  
  // Execute segment's worth of work over same non-default stream as memory copy.
  kernel<<<number_of_blocks, threads_per_block, 0, stream>>>(&device_array[segmentOffset], segmentN);
  
  // `cudaStreamDestroy` will return immediately (is non-blocking), but will not actually destroy stream until
  // all stream operations are complete.
  cudaStreamDestroy(stream);
}

Exercise: Overlap Kernel Execution and Memory Copy Back to Host

Vector Add Application [01-manual-malloc-solution.cu] GPU latest iteration is currently being performed on all operations of vector addition, after the completion of its memory will be copied back to the host for verification.

Reconstruction [01-manual-malloc-solution.cu] application, causing it to perform the vector addition operation in the four non-default block flow, so that an asynchronous memory copy start waiting until all work is completed vector addition. If you experience problems.

!nvcc -arch=sm_70 -o vector-add-manual-alloc 07-manual-malloc/solutions/01-manual-malloc-solution.cu -run

Success! All values calculated correctly.

After the completion of the reconstruction, open the executable file in a new nvvp session and then use the timeline to do the following:

Recording device memory to host the start time is before any kernel work is completed or after?
It should be noted four copies of memory segments do not overlap themselves. Even in a separate non-default stream in a given direction (here DtoH), every time, only a memory transfer simultaneously. The reason performance gains here in other ways that it can begin before the memory transfer, and not difficult to imagine: if done in one application workloads compared with the simple addition operation, almost negligible, then the memory copy not only started earlier, but also with the implementation of core functions overlap.

Last modified code is posted, the sample code can be manually assigned a memory and the like asynchronous memory copy operation.

#include <stdio.h>

__global__
void initWith(float num, float *a, int N)
{

  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = index; i < N; i += stride)
  {
    a[i] = num;
  }
}

__global__
void addVectorsInto(float *result, float *a, float *b, int N)
{
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = index; i < N; i += stride)
  {
    result[i] = a[i] + b[i];
  }
}

void checkElementsAre(float target, float *vector, int N)
{
  for(int i = 0; i < N; i++)
  {
    if(vector[i] != target)
    {
      printf("FAIL: vector[%d] - %0.0f does not equal %0.0f\n", i, vector[i], target);
      exit(1);
    }
  }
  printf("Success! All values calculated correctly.\n");
}

int main()
{
  int deviceId;
  int numberOfSMs;

  cudaGetDevice(&deviceId);
  cudaDeviceGetAttribute(&numberOfSMs, cudaDevAttrMultiProcessorCount, deviceId);

  const int N = 2<<24;
  size_t size = N * sizeof(float);

  float *a;
  float *b;
  float *c;
  float *h_c;

  cudaMalloc(&a, size);
  cudaMalloc(&b, size);
  cudaMalloc(&c, size);
  cudaMallocHost(&h_c, size);

  size_t threadsPerBlock;
  size_t numberOfBlocks;

  threadsPerBlock = 256;
  numberOfBlocks = 32 * numberOfSMs;

  cudaError_t addVectorsErr;
  cudaError_t asyncErr;

  /*
   * Create 3 streams to run initialize the 3 data vectors in parallel.
   */

  cudaStream_t stream1, stream2, stream3;
  cudaStreamCreate(&stream1);
  cudaStreamCreate(&stream2);
  cudaStreamCreate(&stream3);

  /*
   * Give each `initWith` launch its own non-standard stream.
   */

  initWith<<<numberOfBlocks, threadsPerBlock, 0, stream1>>>(3, a, N);
  initWith<<<numberOfBlocks, threadsPerBlock, 0, stream2>>>(4, b, N);
  initWith<<<numberOfBlocks, threadsPerBlock, 0, stream3>>>(0, c, N);

  //addVectorsInto<<<numberOfBlocks, threadsPerBlock>>>(c, a, b, N);
  //cudaMemcpy(h_c, c, size, cudaMemcpyDeviceToHost);

  for(int i = 0; i<4; ++i){
      cudaStream_t stream;
      cudaStreamCreate(&stream);
      
      addVectorsInto<<<numberOfBlocks/4, threadsPerBlock, 0, stream>>>(&c[i*N/4], &a[i*N/4], &b[i*N/4], N/4);
      cudaMemcpyAsync(&h_c[i*N/4], &c[i*N/4], size/4, cudaMemcpyDeviceToHost, stream);
      cudaStreamDestroy(stream);
  }

  addVectorsErr = cudaGetLastError();
  if(addVectorsErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(addVectorsErr));

  asyncErr = cudaDeviceSynchronize();
  if(asyncErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(asyncErr));

  checkElementsAre(7, h_c, N);

  /*
   * Destroy streams when they are no longer needed.
   */

  cudaStreamDestroy(stream1);
  cudaStreamDestroy(stream2);
  cudaStreamDestroy(stream3);

  cudaFree(a);
  cudaFree(b);
  cudaFree(c);
  cudaFreeHost(h_c);
}

ps：

There hard work is rewarded, although the courses say only 8h, but the time it takes to LZ should be far more than eight hours, but fortunately rewarding!
Here Insert Picture Description

Felaim blog expert

Published 310 original articles · won praise 211 · views 610 000 +

His message board concerns