background
python script runs on the server side of the convolutional neural networks often need the picture data from cv2 (numpy.ndarray) - into the network> tensor, followed by inference, then the results from tensor-> numpy.ndarray process.
Since the data stored in the read cv2 memory to the frame pytorch example, before the data is sent to GPU data conversion will generate the following:
Determines whether torch.cuda.FloatTensor already in the memory, if not implicitly call memory dump data protocol method .async_copy () function in the memory, the data to the GPU memory dump ready before GPU Inference, but this often requires time-consuming part.
Solution: open space directly on the GPU video memory
Libraries: cupy, dlpack
First, the pre-treatment
Typically pytorch pretreatment as follows:
# Memory space allocated torch.FloatTensor
batch_input = torch.zeros(len(image_list), 3, target_height, target_width)
for index in range(len(image_list)):
# image->numpy.ndarray
img = cv2.resize(image_list[index].copy(), (target_width, target_height))
# uint8->float32
t_img = np.asarray(img, np.float32)
#Transpose
m_img = t_img.transpose((2, 0, 1))
# Numpy.ndarray-> torch.FloatTensor + image regularization
n_img = transform(torch.from_numpy(m_img))
# Composition batch data
batch_input[index, :] = n_img
# torch.FloatTensor-> torch.cuda.FloatTensor
batch_input.cuda()
If this batch into the GPU, the data conversion will be as shown in Figure 1 occurs.
To replace active cupy numpy operation:
import cupy as cp
# GPU memory space allocated cupy batch_data
batch_input = cp.zeros((len(image_list), 3, target_height, target_width), dtype=cp.float32)
for index in range(len(image_list)):
# image->cupy.ndarray
img = cv2.resize(image_list[index], (target_width, target_height))
# numpy.uint8 -> cupy.float32
t_img = cp.asarray(img, cp.float32)
# Transpose (cupy level)
m_img = t_img.transpose((2, 0, 1))
# Image regularization
n_img = gpu_transform(m_img)
# Composition batch data
batch_input[index, :] = n_img
# cupy.ndarray -> torch.cuda.FloatTensor
batch_data = from_dlpack(toDlpack(batch_input)).cuda()
At this time, the conversion process to:
A couple of points:
1.1 Since cupy directly GPU memory to allocate space, no implicit call .async_copy () will be transferred to the data memory, is visible in comparison:
GPU implicit invocation time before transmission as shown below:
Non GPU implicit call transfer time before below:
1.2 cupy.ndarray torch.cuda.FloatTensor no way to direct conversion, format conversion required intermediate dlpack, particularly following conversion
rom cupy.core.dlpack import toDlpack
from cupy.core.dlpack import fromDlpack
from torch.utils.dlpack import to_dlpack
from torch.utils.dlpack import from_dlpack
import torch Zhengzhou gynecological hospital http://www.sptdfk.com/
#tensor->cupy
cupy_data = fromDlpack(to_dlpack(tensor_data))
#cupy->tensor
tensor_data = from_dlpack(toDlpack(cupy_data))
1.3 pytorch framework, some projects require image regularization, others do not. When the network before the transmission of the image being if necessary (typically minus mean and variance addition), it is generally used torchvision.transform. But the built-in function accepts only torch.FloatTensor CPU side, which means that to use the built-transform function, you need to turn into cupy GPU data to the CPU torch.FloatTensor, data conversion is bound to result in waste of resources. Rewrite transform function:
self.mean = cp.array([102.9801, 115.9465, 122.7717])
self.std = cp.array([1., 1., 1.])
def gpu_transform(self, img):
for index in range(img.shape[0]):
img[index,:] -= self.mean[index]
img[index, :] /= self.std[index]
return img
The above process are all running in GPU, the time is almost negligible
Second, post-processing
This section applies to split the network, i.e., the space previously required mask dispensing end GPU generated. Torch.cuda.FloatTensor common practice space allocation, an implicit call .async_copy () into the GPU, it will also consume a lot of time. Similar to the previous process may be utilized to generate mask cupy space, sub torch.cuda.FloatTensor.
mask_gpu= from_dlpack(toDlpack(cp.zeros((len(image_list), self.num_classes, ori_img_size[0], ori_img_size[1]), dtype=cp.float32))).cuda()
pytorch distribution mask time
cupy distribution mask time
Three, cupy time before and after treatment with the conventional contrast