C++ calls Python files, and the deep learning model built by TensorFlow and PyTorch cannot use the GPU to analyze the situation.

This article mainly analyzes that C++ calls the Python deep learning model, loads the model to the GPU, and then releases the model in the GPU memory, and calls the GPU, the GPU memory is not enough, the memory usage is too much, the utilization rate of Utilities is low, the GPU cannot run, etc. Intractable diseases!

(Note: Regarding the problem of low GPU utilization, how to improve the efficiency of deep learning GPU use, please check this article of mine: Deep Learning PyTorch, GPU utilization in TensorFlow is low, CPU utilization is very low, and Summary and analysis of the problem of slow model training )

1. C++ adjusts the deep learning model built by Python

1.1 Project description

I am helping with a project recently. A QT software written in C++ at the upper level is a software for face recognition and analysis. C++ is responsible for the business of various applications such as interface, multi-threading, and real-time display. Python is mainly responsible for this aspect of deep learning. Python has built several deep learning models, including ResNet101, LSTM, VGG16 and other networks, which are used for feature extraction, expression classification, timing analysis and other algorithmic logic.
In C++, there is an option for face detection. At this time, in order to speed up, you need to call the GPU for calculation. Since the algorithm provided here is written in Python, there is no idea of ​​converting TensorFlow or PyTorch to C++. Moreover, our deep learning model, because different people are responsible for different blocks, including TensorFlow, Keras, PyTorch, if a conversion package is used, it is not easy to convert the three different frameworks into one usable C++ code.
For convenience, it is written as a separate function under Python, and C++ directly calls Python code through the Python interface. In other words, import the deep learning code you wrote in Python, for example, the file name: DNN_algorithm.py into C++, and call related functions to run the deep learning code through C++ just like running python.

1.2 C++ and Python code construction

After you write the code for deep learning in Python, you can use C++ to use all the functions and classes in your DNN_algorithm.py. My python code, first is to load the trained model, the function name is load_model(); then, use the model_predict(image) function to detect the face of the image, also called prediction.

2. Python deep learning model construction code, the Python file name is DNN_algorithm.py

In the usual way, build the model under Python, train and save the model. When making predictions, just load the trained weight file directly. Relevant guidance documents can be found everywhere in this part, so I won’t say more about it.

	class DNN_model():
		def __init__(self):
			self.vgg_model = None
			self.LSTM_model = None
		
		def load_model(self):
			VGG_net = # construct the model, you can use tensorflow, keras, PyTorch.
			LSTM_net = # construct the LSTM model.
			# load the trained weights to the constructed model architecture.
			self.vgg_model = load_wights(VGG_net, 'vgg_net.h5')
			self.LSTM_model = load_wights(LSTM_net , 'lstm_net.h5')
		
		def model_predict(self, image):
			class_out = self.vgg_model.predict(image)
			temporal_out = self.LSTM_model.predict(image)
			return class_out, temporal_out

3. C++ calls related functions of the built Python code

C++ only talks about how to call a few functions of Python you built above. First import the python file name, and then import the related functions. For details on how C++ calls Python, you can search py.h. It is an interface function that Python officially comes with and is used by C++.

    //创建代码文件模块:将你用Python写的深度学习代码DNN_algorithm.py给导入到C++里面,方便调用
    m_pModule = PyImport_ImportModule("DNN_algorithm");
    //下面就可以,将你DNN_algorithm.py 里面的各个函数,类,使用起来了。
    //我的python代码,首先是加载训练好的模型,函数名称是load_model();
    //然后,使用model_predict(image)函数,来进行图像的人脸检测。
    PyObject* pResult = NULL;
    //调用python加载模型的函数load_model。
    pResult = PyObject_CallMethod(m_pInstanceME, "load_model", NULL);
    PyObject* pFunc = NULL;
    //调用python模型预测的函数model_predict。
    pFunc = PyObject_GetAttrString(m_pInstanceME, "model_predict");
    //这句话,就是将python函数,与C++这边采集到的图像argList,给模型拿去预测。
    pResult2 = PyEval_CallObject(pFunc, argList); 
    

4. Load the model to the GPU, and then use the GPU for calculations

4.1 C++ loads TensorFlow and Keras models to GPU

For this part, since the problem I encountered was related to Tensorflow and Keras, there was no problem loading it under PyTorch. So let's talk about TensorFlow and Keras, how to load the model to the GPU when C++ calls the model, and how to run it.

In fact, you only need to actively add these functions on the side of calling the Python file, and your model will be automatically loaded on the GPU. PyTorch is not used in this way. PyTorch needs to display the model to the device:

model=model.to(device) #This is the loading method of PyTorch

	# 这是Keras的加载方法
	import os
	os.environ['KMP_DUPLICATE_LIB_OK']='TRUE'
	os.environ['CUDA_VISIBLE_DEVICES']='0'
	os.environ["TF_CPP_MIN_LOG_LEVEL"]='3'
	
	## 如果你的GPU内存不够,不允许TF和Keras开辟很大的内存,下面的也可以来进行限制。
	config = tf.ConfigProto()
	config.gpu_options.per_process_gpu_memory_fraction = 0.5  #程序最多只能占用指定gpu50%的显存
	config.gpu_options.allow_growth = True	#程序按需申请内存
	sess = tf.Session(config = config)
  • At this time, check the GPU status of your task manager, including memory and cuda usage. After C++ calls the load_model function, check whether the memory and usage of your GPU are up.
  1. If the memory utilization does not go up, just check whether your model is loaded. In this case, run under python first to see if your model is loaded on the GPU. If not, it is a python code problem.
  2. If Python can load on the GPU, but after C++ calls the code, it does not load on the GPU, that is the problem of C++ calling Python. You check whether your C++ code calls Python correctly. If you are not sure, first write a simple print function, and then call it with C++. If it works, call the function according to this calling method.
  • Below my Python code is a class. You have to instantiate this class on the C++ side before you can call the member functions of the following class.

4.2 After C++ loads the model to the GPU, the neural network finishes running the prediction function, and the memory of the GPU has been occupied, and there is no issue of release.

Under Python, we run the prediction function of the model, that is, after model_predict() is completed, or the code is run, the GPU memory is directly released. Therefore, under python, there is no need to consider the memory occupied by the model.
When we use C++ to call the deep learning model written in Python, as described above, the model is first constructed, the weight file is loaded, and then the model predicts, and the collected images are processed in stages. Later, our interface can do other services, such as browsing, report analysis, and so on. But at this time, the GPU is still occupied. Only when you close the exe or exit the entire program, will the model loaded due to the use of the GPU for neural network prediction (inference) and the GPU memory occupied by it be released.

  • clear cache

Under Python, the following methods can be used to clean up the cache and collect garbage data, (PS: It's just temporarily clearing some temporary variables, but the effect is actually not big. The GPU memory usage cannot be reduced.

    def delete_model(self):
        del self.vgg_model #删除模型
        del self.LSTM_model #删除模型
        gc.collect() #回收一些临时变量和垃圾数据
        K.clear_session() #清除session
        tf.reset_default_graph() # 重置 graph。

Please note: If you have not received the data temporarily, the GPU does not have the image data to be processed temporarily (maybe after more than ten seconds, a new image will be collected, so it may be used at any time). At this time, there is no need to release and delete the memory. If you delete and release the GPU memory, if new image data comes, you have to reload the data to the GPU, this process is very time-consuming.

  • Forced release of the memory occupied by the GPU
    When the face detection task is completed, the GPU memory has been occupied because the current program is still executing.
    If under your business, there are other algorithms that require GPU for processing, or GPU is used for other processing threads, at this time, you can turn off your face detection task to completely clear the GPU cache and memory usage. Personally, I feel a little bit kill. After executing the following code, your GPU memory is released instantly, because all the previously loaded models are closed. This is mandatory.
	from numba import cuda
	
    cuda.select_device(0) #选择你的device id。在上面我们指定了那一块GPU用来处理,这里就指定那块。
    cuda.close() # 然后,关闭掉这个cuda线程。

The following is a brief description of the Numba library.

cuda.close()
Explicitly close all contexts in the current thread.
Compiled functions are associated with the CUDA context. This makes it not very useful to close and create new devices, though it is certainly useful for choosing which device to use when the machine has multiple GPUs.

Numba is a Python library that uses CUDA cores to perform fast calculations on the GPU, mainly used for high-performance computing. Features are as follows:
1. Numba: High Productivity for High-Performance Computing
2. GPU-Accelerated Libraries for Python
3. Massive Parallelism with CUDA Python]

When you code, execute the above close. At this point, if you still want to load the model and make predictions, problems will occur. Because your cuda was forced to close. If you want to run it again, you only have to close the program and rerun the code. If you want to detect the face again in this program. . . . . At this time, an error was reported. . . .
! ! ! Therefore, cuda.close() is only suitable for forcibly shutting down the GPU, leaving it for other tasks. This task is impossible to use again.

4.3 The solution to the problem that the previous round of prediction is completed (that is, the use of the deep learning model is completed), and the next time the model is loaded to the GPU fails.

If your application written in C++, such as the QT interface, you need to perform this deep learning prediction task, and then continue to collect images or other data, and perform face detection again. At this time, if you fail to load the model to the GPU, it should be because the last executed session was not cleared, or these cache variables were not cleared. Therefore, you need to clear some sessions after each execution of deep learning prediction model_predict(). Therefore, delete_model() can load the model to the GPU again. If you use cuda.close(). You cannot load it to the GPU again.

5. The GPU is always occupied and the load is successful, but the utilization rate of the GPU is 0 or very low.

At this time, click on your resource manager. If the GPU memory is occupied, then the utilization rate above, the column of cuda, is always 0. You can see whether your model code is performing forward calculation of prediction, that is, whether it is performing model_predict. Or check whether the model has read in image data, and the result is being output. At this time, if it is really in the prediction stage, then the GPU utilization must be 50%, or 80%, and it cannot be zero. The biggest reason is: your model spends most of the time waiting for the data preprocessing stage, including image resize, face alignment, convert color space, feature detection, filtering, (the problem I encountered is, Most of the time is spent on optical flow processing images). Therefore, it feels very slow, and it feels that the GPU is not used. Once suspected that it was a problem with the deep learning code. Finally, there is the problem of opencv image preprocessing. Your GPU utilization is a jitter in the form of small spikes. In fact, it means that your model is predicting that the GPU is being used, but the speed is extremely fast, and there is only a small pulse jitter in the real-time utilization column.
Solution: Use CUDA to accelerate image preprocessing. Opencv-python 4. and above versions have fully supported the development of CUDA for certain specific functions. You can call the GPU accelerated image processing implemented by CUDA on python. Function now.

  • TODO: In the next blog post, I will talk about how I use cuda and GPU to accelerate image preprocessing. People usually think that GPU is mainly used to accelerate deep learning. In fact, some time-consuming image algorithms, CUDA also has corresponding versions.

Reference

1. Numba: High-Performance Python with CUDA Acceleration
2. Numba for device management
3. CUDA Device Management
4. C++ call python neural network model, the model was loaded on GPU, but can’t run on the GPU, the CPU run the model.

Guess you like

Origin blog.csdn.net/qq_32998593/article/details/107465671