pytorch network pre- and post-processing acceleration operation based GPU numpy

  background

  python script runs on the server side of the convolutional neural networks often need the picture data from cv2 (numpy.ndarray) - into the network> tensor, followed by inference, then the results from tensor-> numpy.ndarray process.

  Since the data stored in the read cv2 memory to the frame pytorch example, before the data is sent to GPU data conversion will generate the following:

  Determines whether torch.cuda.FloatTensor already in the memory, if not implicitly call memory dump data protocol method .async_copy () function in the memory, the data to the GPU memory dump ready before GPU Inference, but this often requires time-consuming part.

  Solution: open space directly on the GPU video memory

  Libraries: cupy, dlpack

  First, the pre-treatment

  Typically pytorch pretreatment as follows:

  # Memory space allocated torch.FloatTensor

  batch_input = torch.zeros(len(image_list), 3, target_height, target_width)

  for index in range(len(image_list)):

  # image->numpy.ndarray

  img = cv2.resize(image_list[index].copy(), (target_width, target_height))

  # uint8->float32

  t_img = np.asarray(img, np.float32)

  #Transpose

  m_img = t_img.transpose((2, 0, 1))

  # Numpy.ndarray-> torch.FloatTensor + image regularization

  n_img = transform(torch.from_numpy(m_img))

  # Composition batch data

  batch_input[index, :] = n_img

  # torch.FloatTensor-> torch.cuda.FloatTensor

  batch_input.cuda()

  If this batch into the GPU, the data conversion will be as shown in Figure 1 occurs.

  To replace active cupy numpy operation:

  import cupy as cp

  # GPU memory space allocated cupy batch_data

  batch_input = cp.zeros((len(image_list), 3, target_height, target_width), dtype=cp.float32)

  for index in range(len(image_list)):

  # image->cupy.ndarray

  img = cv2.resize(image_list[index], (target_width, target_height))

  # numpy.uint8 -> cupy.float32

  t_img = cp.asarray(img, cp.float32)

  # Transpose (cupy level)

  m_img = t_img.transpose((2, 0, 1))

  # Image regularization

  n_img = gpu_transform(m_img)

  # Composition batch data

  batch_input[index, :] = n_img

  # cupy.ndarray -> torch.cuda.FloatTensor

  batch_data = from_dlpack(toDlpack(batch_input)).cuda()

  At this time, the conversion process to:

  A couple of points:

  1.1 Since cupy directly GPU memory to allocate space, no implicit call .async_copy () will be transferred to the data memory, is visible in comparison:

  GPU implicit invocation time before transmission as shown below:

  Non GPU implicit call transfer time before below:

  1.2 cupy.ndarray torch.cuda.FloatTensor no way to direct conversion, format conversion required intermediate dlpack, particularly following conversion

  rom cupy.core.dlpack import toDlpack

  from cupy.core.dlpack import fromDlpack

  from torch.utils.dlpack import to_dlpack

  from torch.utils.dlpack import from_dlpack

  import torch Zhengzhou gynecological hospital http://www.sptdfk.com/

  #tensor->cupy

  cupy_data = fromDlpack(to_dlpack(tensor_data))

  #cupy->tensor

  tensor_data = from_dlpack(toDlpack(cupy_data))

  1.3 pytorch framework, some projects require image regularization, others do not. When the network before the transmission of the image being if necessary (typically minus mean and variance addition), it is generally used torchvision.transform. But the built-in function accepts only torch.FloatTensor CPU side, which means that to use the built-transform function, you need to turn into cupy GPU data to the CPU torch.FloatTensor, data conversion is bound to result in waste of resources. Rewrite transform function:

  self.mean = cp.array([102.9801, 115.9465, 122.7717])

  self.std = cp.array([1., 1., 1.])

  def gpu_transform(self, img):

  for index in range(img.shape[0]):

  img[index,:] -= self.mean[index]

  img[index, :] /= self.std[index]

  return img

  The above process are all running in GPU, the time is almost negligible

  Second, post-processing

  This section applies to split the network, i.e., the space previously required mask dispensing end GPU generated. Torch.cuda.FloatTensor common practice space allocation, an implicit call .async_copy () into the GPU, it will also consume a lot of time. Similar to the previous process may be utilized to generate mask cupy space, sub torch.cuda.FloatTensor.

  mask_gpu= from_dlpack(toDlpack(cp.zeros((len(image_list), self.num_classes, ori_img_size[0], ori_img_size[1]), dtype=cp.float32))).cuda()

  pytorch distribution mask time

  cupy distribution mask time

  Three, cupy time before and after treatment with the conventional contrast


Guess you like

Origin blog.51cto.com/14503791/2447917