Tensorflow uses graphics card gpu for training detailed tutorial

The nvidia-smi command of the GPU is explained in detail to view the information of the graphics card:

cmd: nvidia-smi

Detailed explanation of nvidia-smi command of GPU - Programmer Sought

edit

GPU: The number of the GPU in this machine (when there are multiple graphics cards, the number starts from 0). The number of the GPU on the picture is: 0

Fan: fan speed (0%-100%), N/A means no fan

Name: GPU type, the type of GPU on the picture is: Tesla T4

Temp: The temperature of the GPU (GPU temperature is too high will cause the frequency of the GPU to drop)

Perf: The performance state of the GPU, from P0 (maximum performance) to P12 (minimum performance), on the graph: P0

Persistence-M: The state of the persistence mode. Although the persistence mode consumes a lot of energy, it takes less time to start a new GPU application. The figure shows: off

Pwr: Usage/Cap: energy consumption display, Usage: how much is used, how much is the total Cap

Bus-Id: GPU bus related display, domain: bus: device.function

Disp.A: Display Active, indicating whether the display of the GPU is initialized

Memory-Usage: memory usage

Volatile GPU-Util: GPU usage

Uncorr. ECC: About ECC, whether to enable error checking and correction technology, 0/disabled, 1/enabled

Compute M: computing mode, 0/DEFAULT, 1/EXCLUSIVE_PROCESS, 2/PROHIBITED

Processes: Display the video memory usage, process number, and GPU occupied by each process

Refresh the memory status every few seconds: nvidia-smi -l seconds

Refresh the status of the GPU every two seconds: nvidia-smi -l 2

Tensorflow graphics card usage

1. Use directly

This method will basically occupy the remaining video memory of all graphics cards on the current machine. Note that it is the remaining video memory of all graphics cards on the machine. So the program may only need one graphics card, but the program is so overbearing, I don't need other graphics cards, or I can't use so many graphics cards, but I just want to occupy them.

with tf.compat.v1.Session() as sess:
        # 输入图片为256x256,2个分类
        shape, classes = (224, 224, 3), 20
        # 调用keras的ResNet50模型
        model = keras.applications.resnet50.ResNet50(input_shape = shape, weights=None, classes=classes)
        model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

        # 训练模型 categorical_crossentropy sparse_categorical_crossentropy
        # training = model.fit(train_x, train_y, epochs=50, batch_size=10)
        model.fit(train_x,train_y,validation_data=(test_x, test_y), epochs=20, batch_size=6,verbose=2)
        # # 把训练好的模型保存到文件
        model.save('resnet_model_dog_n_face.h5')

2. Use of distribution ratio

The difference between this method and the above direct use method is that I do not occupy all the video memory. For example, if I write this way, I will occupy 60% of each video card.

from tensorflow.compat.v1 import ConfigProto# tf 2.x的写法
config =ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction=0.6
with tf.compat.v1.Session(config=config) as sess:
     model = keras.applications.resnet50.ResNet50(input_shape = shape, weights=None, classes=classes)

3. Dynamic application and use

This method dynamically applies for video memory, only applies for memory, and does not release memory. And if someone else's program occupies all the remaining graphics cards, an error will be reported.

The above three methods should be selected according to the scene.

The first type takes up all the memory, so as long as the size of the model does not exceed the size of the video memory, there will be no video memory fragmentation and affect computing performance. It can be said that the configuration is suitable for deploying applications.

The second and third types are suitable for multiple people using one server, but the second type has a waste of video memory, and the third type avoids the waste of video memory in a certain program, but it is very easy for the program to fail to apply for memory Circumstances that lead to crashes.

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.InteractiveSession(config=config)
with tf.compat.v1.Session(config=config) as sess:
     model

4 Specify the GPU

When running tensorflow on a server with multiple GPUs, if you use python programming, you can specify the GPU, the code is as follows:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"

With a complete example: resnet50 image classification:

edit

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.InteractiveSession(config=config)
with tf.compat.v1.Session(config=config) as sess:
        # 输入图片为256x256,2个分类
        shape, classes = (224, 224, 3), 20
        # 调用keras的ResNet50模型
        model = keras.applications.resnet50.ResNet50(input_shape = shape, weights=None, classes=classes)
        model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

        # 训练模型 categorical_crossentropy sparse_categorical_crossentropy
        # training = model.fit(train_x, train_y, epochs=50, batch_size=10)
        model.fit(train_x,train_y,validation_data=(test_x, test_y), epochs=20, batch_size=6,verbose=2)
        # # 把训练好的模型保存到文件
        model.save('resnet_model_dog_n_face.h5')

Guess you like

Origin blog.csdn.net/qiqi_ai_/article/details/128950971