AX620A running yolov5s self-training model whole process record (windows)

AX620A running yolov5s self-training model whole process record (windows)


foreword

The previous article "Ubuntu22.04 Builds AX620A Official Routine Development Environment " recorded the establishment of the AX620A development environment. During this time, the board finally arrived. I tried how to run the yolov5 model I trained, and recorded the whole process.

1. Build a model conversion environment (using GPU acceleration)

  The name of the model conversion tool is called Pulsar tool chain , and the official docker container environment is provided. Because I am too lazy to install an ubuntu physical machine, so I have to find a way to realize all operations under windows~~~

  1. Install the Windows platform Docker-Desktopclient
    download address: Click here
    to install without any brain, there is no special operation
    After installation, open Docker-Desktop, the program will be initialized and started, and then you can open cmd and enter commands to pull the toolchain image
docker pull sipeed/pulsar
  1. After downloading the image successfully, you should be able to see it in the images list of Docker-Desktop
    insert image description here
  2. Originally, I could directly start the container to convert the model here, but suddenly I saw that the tool chain supports gpuacceleration, but the corresponding cudaenvironment is not installed in the image, so here is a lot of trouble.
    The idea is: install WSL2 - install ubuntu subsystem - install nVidia CUDA toolkit in the subsystem - then install nvidia-docker . After this operation, the container ( ) started under windows cmd in the future 需加上“--gpus all”参数will bring a complete cuda environment.
  • Use WSL2 to install ubuntu20.04
      , forgive me for being ignorant. I didn’t know that there is such a convenient thing under windows until then. I used to only install virtual machines... How to open
    Baidu for WSL2, anyway, my win10 is updated to the latest version. Had no problems.

(1) View the list of linux subsystems supported by the windows platform

wsl --list --online
以下是可安装的有效分发的列表。
请使用“wsl --install -d <分发>”安装。

NAME            FRIENDLY NAME
Ubuntu          Ubuntu
Debian          Debian GNU/Linux
kali-linux      Kali Linux Rolling
openSUSE-42     openSUSE Leap 42
SLES-12         SUSE Linux Enterprise Server v12
Ubuntu-16.04    Ubuntu 16.04 LTS
Ubuntu-18.04    Ubuntu 18.04 LTS
Ubuntu-20.04    Ubuntu 20.04 LTS

(2) Set each installed distribution to start with WSL2 by default

$ wsl --set-default-version 2

(3) Install Ubuntu20.04

$ wsl --install -d  Ubuntu-20.04

(4) Set the default subsystem to Ubuntu20.04

$ wsl --setdefault Ubuntu-20.04

(5) View subsystem information, the asterisk is the default system

$ wsl -l -v
  NAME                   STATE           VERSION
* Ubuntu-20.04           Stopped         2
  docker-desktop-data    Running         2
  docker-desktop         Running         2
  • Install cuda under ubuntu22.04
    以下操作均在windows的子系统ubuntu终端输入,逐条执行
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin

sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600

wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-wsl-ubuntu-11-7-local_11.7.0-1_amd64.deb

sudo dpkg -i cuda-repo-wsl-ubuntu-11-7-local_11.7.0-1_amd64.deb

sudo cp /var/cuda-repo-wsl-ubuntu-11-7-local/cuda-B81839D3-keyring.gpg /usr/share/keyrings/

sudo apt-get update

sudo apt-get -y install cuda
  • Install nvidia-docker under ubuntu22.04
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -

curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee 
/etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update

sudo apt-get install -y nvidia-docker2

At this point, the related environment of cuda is installed, you can enter in the ubuntu terminal: nvidia-smiConfirm whether the graphics card is recognized

  • Start the container
    In the cmd terminal of windows, enter the following command to start the pulsar container
docker run -it --net host  --gpus all --shm-size 12g -v D:\ax620_data:/data sipeed/pulsar --name ax620

Among them:
--gpus allto add, the container can use gpu.
--shm-size 12gThe memory used by the specified container is determined according to your actual memory size. There is no problem with my numerical conversion yolov5s model.
--name ax620Specify the container name, and then enter: in cmd docker start ax620 && docker attach ax620to easily enter the container environment.
D:\ax620_data:/dataMap the specified path on the local hard disk to the /data directory of the container. This folder needs to store the projects used for model conversion, which can be downloaded here

2. Convert yolov5 self-training model

  1. Yolov5 uses the latest V6.2 version. After training the pt model file according to the standard process, use export.py to export it to onnx format
python export.py --weights yourModel.pt --simplify --include onnx
  1. The post-processing of the three output layers of onnx should be removed, because this part of the processing will be processed in the ax620 program. You can refer to this big brother's article "Aixin Yuanzhi AX620A Deployment yolov5 6.0 Model Record" . The python code for specific processing is as follows:
import onnx

input_path = r"D:\ax620_data\model\yourModel.onnx"
output_path = r"D:\ax620_data\model\yourModel_rip.onnx"
input_names = ["images"]
output_names = ["onnx::Reshape_326","onnx::Reshape_364","onnx::Reshape_402"]

onnx.utils.extract_model(input_path, output_path, input_names, output_names)

Note that the names of the three layers Reshape_326, Reshape_364, Reshape_402may be different from those in your model, use the netron tool to confirm. A model with the following endings was obtained. Note: My last output shape here is 1x18x20x20, because my model has only one recognition type (4+1+1)*3=18
insert image description here

  1. Go to the pulsar container environment for model conversion, onnx->joint , enter
pulsar build --input model/yourModel_rip.onnx --output model/yourModel.joint --config config/yolov5s.prototxt --output_config config/output_config.prototxt

in:

  • yolov5s.prototxtThis configuration file needs to be downloaded from the Baidu network disk of Aixin Yuanzhi: download here
  • Randomly select 1,000 images from your training set, package them coco_1000.tar, and put them in the container /data/datasetdirectory for quantitative calibration.
    Model conversion is relatively slow. Even with GPU acceleration, it takes several minutes. Finally, /data/modelyour joint model file will be generated in the directory of the container
  1. Model simulation and bisection. The accuracy difference between the test model conversion and the original onnx. It can be seen that the cosine similarity of the three outputs is very high, and the accuracy error is very small.
pulsar run model/yourModel_rip.onnx  model/yourModel.joint --input images/test.jpg --config config/output_config.prototxt
 --output_gt gt/
 
...(运行过程省略)
[<frozen super_pulsar.func_wrappers.pulsar_run.compare>:82] Score compare table:
------------------------  ----------------  ------------------
Layer: onnx::Reshape_326  2-norm RE: 4.08%  cosine-sim: 0.9994
Layer: onnx::Reshape_364  2-norm RE: 4.49%  cosine-sim: 0.9991
Layer: onnx::Reshape_402  2-norm RE: 6.77%  cosine-sim: 0.9981
------------------------  ----------------  ------------------

3. Board test

  1. Build your own test programs through the AXERA-TECH/ax-samples repository.
    Because it is a self-training model, my model only has 1 category. The official yolov5 test program uses the 80 categories of the coco data set, and a little adjustment is required.
  • In the examples directory, copy ax_yolov5s_steps.ccand save asax_yolov5s_my.cc
  • Modify ax-samples/examples/base/detection.hpp, add a generate_proposals_n function, cls_num is the number of model categories
static void generate_proposals_n(int cls_num, int stride, const float* feat, float prob_threshold, std::vector<Object>& objects,
                                       int letterbox_cols, int letterbox_rows, const float* anchors, float prob_threshold_unsigmoid)
    {
    
    
        int anchor_num = 3;
        int feat_w = letterbox_cols / stride;
        int feat_h = letterbox_rows / stride;
        // int cls_num = 80;
        int anchor_group;
        if (stride == 8)
            anchor_group = 1;
        if (stride == 16)
            anchor_group = 2;
        if (stride == 32)
            anchor_group = 3;

        auto feature_ptr = feat;

        for (int h = 0; h <= feat_h - 1; h++)
        {
    
    
            for (int w = 0; w <= feat_w - 1; w++)
            {
    
    
                for (int a = 0; a <= anchor_num - 1; a++)
                {
    
    
                    if (feature_ptr[4] < prob_threshold_unsigmoid)
                    {
    
    
                        feature_ptr += (cls_num + 5);
                        continue;
                    }

                    //process cls score
                    int class_index = 0;
                    float class_score = -FLT_MAX;
                    for (int s = 0; s <= cls_num - 1; s++)
                    {
    
    
                        float score = feature_ptr[s + 5];
                        if (score > class_score)
                        {
    
    
                            class_index = s;
                            class_score = score;
                        }
                    }
                    //process box score
                    float box_score = feature_ptr[4];
                    float final_score = sigmoid(box_score) * sigmoid(class_score);

                    if (final_score >= prob_threshold)
                    {
    
    
                        float dx = sigmoid(feature_ptr[0]);
                        float dy = sigmoid(feature_ptr[1]);
                        float dw = sigmoid(feature_ptr[2]);
                        float dh = sigmoid(feature_ptr[3]);
                        float pred_cx = (dx * 2.0f - 0.5f + w) * stride;
                        float pred_cy = (dy * 2.0f - 0.5f + h) * stride;
                        float anchor_w = anchors[(anchor_group - 1) * 6 + a * 2 + 0];
                        float anchor_h = anchors[(anchor_group - 1) * 6 + a * 2 + 1];
                        float pred_w = dw * dw * 4.0f * anchor_w;
                        float pred_h = dh * dh * 4.0f * anchor_h;
                        float x0 = pred_cx - pred_w * 0.5f;
                        float y0 = pred_cy - pred_h * 0.5f;
                        float x1 = pred_cx + pred_w * 0.5f;
                        float y1 = pred_cy + pred_h * 0.5f;

                        Object obj;
                        obj.rect.x = x0;
                        obj.rect.y = y0;
                        obj.rect.width = x1 - x0;
                        obj.rect.height = y1 - y0;
                        obj.label = class_index;
                        obj.prob = final_score;
                        objects.push_back(obj);
                    }

                    feature_ptr += (cls_num + 5);
                }
            }
        }
    }

  • Reviseax_yolov5s_my.cc
44const char* CLASS_NAMES[] = {
    
    "piggy"};		//修改模型类别名称

252for (uint32_t i = 0; i < io_info->nOutputSize; ++i)
        {
    
    
            auto& output = io_info->pOutputs[i];
            auto& info = joint_io_arr.pOutputs[i];

            auto ptr = (float*)info.pVirAddr;

            int32_t stride = (1 << i) * 8;
            // det::generate_proposals_255(stride, ptr, PROB_THRESHOLD, proposals, input_w, input_h, ANCHORS, prob_threshold_unsigmoid);
            // 使用新的generate_proposals_n对预测框进行解码
            det::generate_proposals_n(1, stride, ptr, PROB_THRESHOLD, proposals, input_w, input_h, ANCHORS, prob_threshold_unsigmoid);
        }
  • Modify ax-samples/examples/CMakeLists.txtthe file and add it under line 101
101axera_example (ax_crnn                  ax_crnn_steps.cc)
      axera_example (ax_yolov5_my             ax_yolov5s_my.cc)		//增加编译源文件
else() # ax630a support
      axera_example (ax_classification        ax_classification_steps.cc)
      axera_example (ax_yolov5s               ax_yolov5s_steps.cc)
      axera_example (ax_yolo_fastest          ax_yolo_fastest_steps.cc)
      axera_example (ax_yolov3                ax_yolov3_steps.cc)
endif()
  • Cross-compile the project or compile directly on the board to generate ax_yolov5_myexecutable program files
  1. Copy the program and joint model files to the same directory on the board and execute:
 ./ax_yolov5_my -m yourModel.joint -i /home/images/test.jpg

--------------------------------------
[INFO]: Virtual npu mode is 1_1

Tools version: 0.6.1.14
4111370
run over: output len 3
--------------------------------------
Create handle took 470.84 ms (neu 21.46 ms, axe 0.00 ms, overhead 449.38 ms)
--------------------------------------
Repeat 1 times, avg time 25.28 ms, max_time 25.28 ms, min_time 25.28 ms
--------------------------------------
detection num: 6
 0:  97%, [ 198,   97,  437,  328], piggy
 0:  96%, [  35,  190,  260,  377], piggy
 0:  95%, [ 575,  166,  811,  478], piggy
 0:  93%, [ 526,   47,  630,  228], piggy
 0:  91%, [ 191,   55,  301,  197], piggy
 0:  76%, [   0,  223,   57,  324], piggy
[AX_SYS_LOG] Waiting thread(2971660832) to exit
[AX_SYS_LOG] AX_Log2ConsoleRoutine terminated!!!
exit[AX_SYS_LOG] join thread(2971660832) ret:0

insert image description here
It can be seen that even if half of the computing power is allocated to isp at this time, the inference time of a single image of yolov5s can go to 25ms, which is quite impressive!

Guess you like

Origin blog.csdn.net/flamebox/article/details/127249243