AX620A running yolov5s self-training model whole process record (windows)
Article directory
foreword
The previous article "Ubuntu22.04 Builds AX620A Official Routine Development Environment " recorded the establishment of the AX620A development environment. During this time, the board finally arrived. I tried how to run the yolov5 model I trained, and recorded the whole process.
1. Build a model conversion environment (using GPU acceleration)
The name of the model conversion tool is called Pulsar tool chain , and the official docker container environment is provided. Because I am too lazy to install an ubuntu physical machine, so I have to find a way to realize all operations under windows~~~
- Install the Windows platform
Docker-Desktop
client
download address: Click here
to install without any brain, there is no special operation
After installation, open Docker-Desktop, the program will be initialized and started, and then you can open cmd and enter commands to pull the toolchain image
docker pull sipeed/pulsar
- After downloading the image successfully, you should be able to see it in the images list of Docker-Desktop
- Originally, I could directly start the container to convert the model here, but suddenly I saw that the tool chain supports
gpu
acceleration, but the correspondingcuda
environment is not installed in the image, so here is a lot of trouble.
The idea is: install WSL2 - install ubuntu subsystem - install nVidia CUDA toolkit in the subsystem - then install nvidia-docker . After this operation, the container ( ) started under windows cmd in the future需加上“--gpus all”参数
will bring a complete cuda environment.
- Use WSL2 to install ubuntu20.04
, forgive me for being ignorant. I didn’t know that there is such a convenient thing under windows until then. I used to only install virtual machines... How to open
Baidu for WSL2, anyway, my win10 is updated to the latest version. Had no problems.
(1) View the list of linux subsystems supported by the windows platform
wsl --list --online
以下是可安装的有效分发的列表。
请使用“wsl --install -d <分发>”安装。
NAME FRIENDLY NAME
Ubuntu Ubuntu
Debian Debian GNU/Linux
kali-linux Kali Linux Rolling
openSUSE-42 openSUSE Leap 42
SLES-12 SUSE Linux Enterprise Server v12
Ubuntu-16.04 Ubuntu 16.04 LTS
Ubuntu-18.04 Ubuntu 18.04 LTS
Ubuntu-20.04 Ubuntu 20.04 LTS
(2) Set each installed distribution to start with WSL2 by default
$ wsl --set-default-version 2
(3) Install Ubuntu20.04
$ wsl --install -d Ubuntu-20.04
(4) Set the default subsystem to Ubuntu20.04
$ wsl --setdefault Ubuntu-20.04
(5) View subsystem information, the asterisk is the default system
$ wsl -l -v
NAME STATE VERSION
* Ubuntu-20.04 Stopped 2
docker-desktop-data Running 2
docker-desktop Running 2
- Install cuda under ubuntu22.04
以下操作均在windows的子系统ubuntu终端输入,逐条执行
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-wsl-ubuntu-11-7-local_11.7.0-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-11-7-local_11.7.0-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-11-7-local/cuda-B81839D3-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda
- Install nvidia-docker under ubuntu22.04
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee
/etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
At this point, the related environment of cuda is installed, you can enter in the ubuntu terminal: nvidia-smi
Confirm whether the graphics card is recognized
- Start the container
In the cmd terminal of windows, enter the following command to start the pulsar container
docker run -it --net host --gpus all --shm-size 12g -v D:\ax620_data:/data sipeed/pulsar --name ax620
Among them:
--gpus all
to add, the container can use gpu.
--shm-size 12g
The memory used by the specified container is determined according to your actual memory size. There is no problem with my numerical conversion yolov5s model.
--name ax620
Specify the container name, and then enter: in cmd docker start ax620 && docker attach ax620
to easily enter the container environment.
D:\ax620_data:/data
Map the specified path on the local hard disk to the /data directory of the container. This folder needs to store the projects used for model conversion, which can be downloaded here
2. Convert yolov5 self-training model
- Yolov5 uses the latest V6.2 version. After training the pt model file according to the standard process, use export.py to export it to onnx format
python export.py --weights yourModel.pt --simplify --include onnx
- The post-processing of the three output layers of onnx should be removed, because this part of the processing will be processed in the ax620 program. You can refer to this big brother's article "Aixin Yuanzhi AX620A Deployment yolov5 6.0 Model Record" . The python code for specific processing is as follows:
import onnx
input_path = r"D:\ax620_data\model\yourModel.onnx"
output_path = r"D:\ax620_data\model\yourModel_rip.onnx"
input_names = ["images"]
output_names = ["onnx::Reshape_326","onnx::Reshape_364","onnx::Reshape_402"]
onnx.utils.extract_model(input_path, output_path, input_names, output_names)
Note that the names of the three layers Reshape_326
, Reshape_364
, Reshape_402
may be different from those in your model, use the netron tool to confirm. A model with the following endings was obtained. Note: My last output shape here is 1x18x20x20, because my model has only one recognition type (4+1+1)*3=18
- Go to the pulsar container environment for model conversion, onnx->joint , enter
pulsar build --input model/yourModel_rip.onnx --output model/yourModel.joint --config config/yolov5s.prototxt --output_config config/output_config.prototxt
in:
yolov5s.prototxt
This configuration file needs to be downloaded from the Baidu network disk of Aixin Yuanzhi: download here- Randomly select 1,000 images from your training set, package them
coco_1000.tar
, and put them in the container/data/dataset
directory for quantitative calibration.
Model conversion is relatively slow. Even with GPU acceleration, it takes several minutes. Finally,/data/model
your joint model file will be generated in the directory of the container
- Model simulation and bisection. The accuracy difference between the test model conversion and the original onnx. It can be seen that the cosine similarity of the three outputs is very high, and the accuracy error is very small.
pulsar run model/yourModel_rip.onnx model/yourModel.joint --input images/test.jpg --config config/output_config.prototxt
--output_gt gt/
...(运行过程省略)
[<frozen super_pulsar.func_wrappers.pulsar_run.compare>:82] Score compare table:
------------------------ ---------------- ------------------
Layer: onnx::Reshape_326 2-norm RE: 4.08% cosine-sim: 0.9994
Layer: onnx::Reshape_364 2-norm RE: 4.49% cosine-sim: 0.9991
Layer: onnx::Reshape_402 2-norm RE: 6.77% cosine-sim: 0.9981
------------------------ ---------------- ------------------
3. Board test
- Build your own test programs through the AXERA-TECH/ax-samples repository.
Because it is a self-training model, my model only has 1 category. The official yolov5 test program uses the 80 categories of the coco data set, and a little adjustment is required.
- In the examples directory, copy
ax_yolov5s_steps.cc
and save asax_yolov5s_my.cc
- Modify
ax-samples/examples/base/detection.hpp
, add a generate_proposals_n function, cls_num is the number of model categories
static void generate_proposals_n(int cls_num, int stride, const float* feat, float prob_threshold, std::vector<Object>& objects,
int letterbox_cols, int letterbox_rows, const float* anchors, float prob_threshold_unsigmoid)
{
int anchor_num = 3;
int feat_w = letterbox_cols / stride;
int feat_h = letterbox_rows / stride;
// int cls_num = 80;
int anchor_group;
if (stride == 8)
anchor_group = 1;
if (stride == 16)
anchor_group = 2;
if (stride == 32)
anchor_group = 3;
auto feature_ptr = feat;
for (int h = 0; h <= feat_h - 1; h++)
{
for (int w = 0; w <= feat_w - 1; w++)
{
for (int a = 0; a <= anchor_num - 1; a++)
{
if (feature_ptr[4] < prob_threshold_unsigmoid)
{
feature_ptr += (cls_num + 5);
continue;
}
//process cls score
int class_index = 0;
float class_score = -FLT_MAX;
for (int s = 0; s <= cls_num - 1; s++)
{
float score = feature_ptr[s + 5];
if (score > class_score)
{
class_index = s;
class_score = score;
}
}
//process box score
float box_score = feature_ptr[4];
float final_score = sigmoid(box_score) * sigmoid(class_score);
if (final_score >= prob_threshold)
{
float dx = sigmoid(feature_ptr[0]);
float dy = sigmoid(feature_ptr[1]);
float dw = sigmoid(feature_ptr[2]);
float dh = sigmoid(feature_ptr[3]);
float pred_cx = (dx * 2.0f - 0.5f + w) * stride;
float pred_cy = (dy * 2.0f - 0.5f + h) * stride;
float anchor_w = anchors[(anchor_group - 1) * 6 + a * 2 + 0];
float anchor_h = anchors[(anchor_group - 1) * 6 + a * 2 + 1];
float pred_w = dw * dw * 4.0f * anchor_w;
float pred_h = dh * dh * 4.0f * anchor_h;
float x0 = pred_cx - pred_w * 0.5f;
float y0 = pred_cy - pred_h * 0.5f;
float x1 = pred_cx + pred_w * 0.5f;
float y1 = pred_cy + pred_h * 0.5f;
Object obj;
obj.rect.x = x0;
obj.rect.y = y0;
obj.rect.width = x1 - x0;
obj.rect.height = y1 - y0;
obj.label = class_index;
obj.prob = final_score;
objects.push_back(obj);
}
feature_ptr += (cls_num + 5);
}
}
}
}
- Revise
ax_yolov5s_my.cc
44行 const char* CLASS_NAMES[] = {
"piggy"}; //修改模型类别名称
252行 for (uint32_t i = 0; i < io_info->nOutputSize; ++i)
{
auto& output = io_info->pOutputs[i];
auto& info = joint_io_arr.pOutputs[i];
auto ptr = (float*)info.pVirAddr;
int32_t stride = (1 << i) * 8;
// det::generate_proposals_255(stride, ptr, PROB_THRESHOLD, proposals, input_w, input_h, ANCHORS, prob_threshold_unsigmoid);
// 使用新的generate_proposals_n对预测框进行解码
det::generate_proposals_n(1, stride, ptr, PROB_THRESHOLD, proposals, input_w, input_h, ANCHORS, prob_threshold_unsigmoid);
}
- Modify
ax-samples/examples/CMakeLists.txt
the file and add it under line 101
101行 axera_example (ax_crnn ax_crnn_steps.cc)
axera_example (ax_yolov5_my ax_yolov5s_my.cc) //增加编译源文件
else() # ax630a support
axera_example (ax_classification ax_classification_steps.cc)
axera_example (ax_yolov5s ax_yolov5s_steps.cc)
axera_example (ax_yolo_fastest ax_yolo_fastest_steps.cc)
axera_example (ax_yolov3 ax_yolov3_steps.cc)
endif()
- Cross-compile the project or compile directly on the board to generate
ax_yolov5_my
executable program files
- Copy the program and joint model files to the same directory on the board and execute:
./ax_yolov5_my -m yourModel.joint -i /home/images/test.jpg
--------------------------------------
[INFO]: Virtual npu mode is 1_1
Tools version: 0.6.1.14
4111370
run over: output len 3
--------------------------------------
Create handle took 470.84 ms (neu 21.46 ms, axe 0.00 ms, overhead 449.38 ms)
--------------------------------------
Repeat 1 times, avg time 25.28 ms, max_time 25.28 ms, min_time 25.28 ms
--------------------------------------
detection num: 6
0: 97%, [ 198, 97, 437, 328], piggy
0: 96%, [ 35, 190, 260, 377], piggy
0: 95%, [ 575, 166, 811, 478], piggy
0: 93%, [ 526, 47, 630, 228], piggy
0: 91%, [ 191, 55, 301, 197], piggy
0: 76%, [ 0, 223, 57, 324], piggy
[AX_SYS_LOG] Waiting thread(2971660832) to exit
[AX_SYS_LOG] AX_Log2ConsoleRoutine terminated!!!
exit[AX_SYS_LOG] join thread(2971660832) ret:0
It can be seen that even if half of the computing power is allocated to isp at this time, the inference time of a single image of yolov5s can go to 25ms, which is quite impressive!