[GCU experience] Run through the ResNet50 model based on PyTorch + GCU and test GCU performance

1. Environment

Address: Qizhi Community: https://openi.pcl.ac.cn/

2. Calculation card introduction

insert image description here
Yunsui T20 is a second-generation artificial intelligence training acceleration card for data centers based on the Seis 2.0 chip. It has the characteristics of wide model coverage, strong performance, and open software ecology, and can support a variety of artificial intelligence training scenarios. At the same time, it has flexible scalability and provides industry-leading artificial intelligence computing power cluster solutions.

Advantages

  • High-precision training with surging computing power
  • Dedicated channel computing power expansion
  • Wide support for eco-friendliness
  • Open and efficient development of tools

3. Code warehouse

https://openi.pcl.ac.cn/Enflame/GCU_Pytorch1.10.0_Example

4. Model + Dataset

Resnet+imagenet_raw

5. Running results

Single Card Single Epoch

    "model": "resnet50",
    "local_rank": 0,
    "batch_size": 256,
    "epochs": 1,
    "training_step_per_epoch": -1,
    "eval_step_per_epoch": -1,
    "acc1": 6.467013835906982,
    "acc5": 20.52951431274414,
    "device": "dtu",
    "skip_steps": 2,
    "train_fps_mean": 706.7805865954374,
    "train_fps_min": 668.1171056579481,
    "train_fps_max": 755.529550208285,
    "training_time": "0:12:27"

fps_mean: 706.78
acc1: 6.47
elapsed time: 12 minutes 27 seconds

8 card single Epoch

    "model": "resnet50",
    "local_rank": 5,
    "batch_size": 256,
    "epochs": 1,
    "training_step_per_epoch": -1,
    "eval_step_per_epoch": -1,
    "acc1": 3.02734375,
    "acc5": 12.5,
    "device": "dtu",
    "skip_steps": 2,
    "train_fps_mean": 704.4055937610347,
    "train_fps_min": 702.2026238348252,
    "train_fps_max": 706.744240295003,
    "training_time": "0:07:04"

fps_mean: 704.41
acc1: 3.03
run time: 7 minutes 04 seconds
8 cards linearity: 99.72%

Single card 100Epoch

    "model": "resnet50",
    "local_rank": 0,
    "batch_size": 64,
    "epochs": 100,
    "training_step_per_epoch": -1,
    "eval_step_per_epoch": -1,
    "acc1": 87.13941955566406,
    "acc5": 97.31570434570312,
    "device": "dtu",
    "skip_steps": 2,
    "train_fps_mean": 488.19604076742735,
    "train_fps_min": 249.3976374646114,
    "train_fps_max": 568.624496005538,
    "training_time": "4:45:13"

fps_mean: 488.19604076742735
acc1: 87.14
Runtime: 4 hours 45 minutes 13 seconds

8 cards 100Epochs

    "model": "resnet50",
    "local_rank": 0,
    "batch_size": 64,
    "epochs": 100,
    "training_step_per_epoch": -1,
    "eval_step_per_epoch": -1,
    "acc1": 82.2265625,
    "acc5": 96.875,
    "device": "dtu",
    "skip_steps": 2,
    "train_fps_mean": 481.25022732778297,
    "train_fps_min": 267.4726081053424,
    "train_fps_max": 509.6326762775301,
    "training_time": "1:18:22"

fps_mean: 481.25
acc1: 82.22
Runtime: 1 hour 18 minutes 22 seconds
Linearity: 98.58%

6. Example of code migration

https://openi.pcl.ac.cn/OpenIOSSG/MNIST_PytorchExample_GCU/src/branch/master/train_for_c2net.py

7. Suggestions

Experience
By reviewing the code examples, you can quickly grasp the method of migrating code from CPU/GPU to GCU. In addition to running the code provided by Suiyuan Technology, I also tried to migrate to gcu when I was learning teacher Li Mu's d2l pytorch code a while ago. Generally speaking, most of them can be migrated smoothly. In addition, sometimes I have run before. Some torch-based notebook codes can be modified to run with gcu according to the example, and they can run successfully.
The only problem I encountered is that sometimes there will be a long list of running prompts that are being compiled. I don’t know what the situation is, and the running time in this case usually takes a little longer than the GPU. Maybe there is something wrong with the code. See, this kind of situation is not often encountered.
The overall feeling of the running speed of GCU is still good, and it is very convenient to run the DEMO code according to the README.

suggestion

  1. The overall speed of using GCU is still quite fast. I will have time to compare it with the speed and accuracy of the CPU and GPU platforms in the later stage.
  2. The current script training has no process output, you can add log output by modifying the py file, but I personally suggest that it would be better if there is a tutorial to guide beginners on how to add log output code examples, beginners may not know how to modify
  3. Is it possible for the GCU platform to support more frameworks in the future, such as tensorflow, mindspore, etc.
  4. The only inconvenience is that if the demo code needs to modify the hyperparameters, you must find the corresponding sh script file to modify. It would be more convenient if you can directly modify the hyperparameters when creating tasks.

Guess you like

Origin blog.csdn.net/yichao_ding/article/details/130081018