[RKNN] In YOLO V5, the output of pytorch2onnx, pytorch and onnx models is inconsistent and the accuracy is reduced.

After converting the model trained on and then , the test found:yolo v5onnxrknn

  1. rknnModels, both quantified and non-quantified, have lower test accuracy than the pytorch model.
  2. onnxCompared with the pytorch model, the test accuracy of the model is also reduced, and it is closer to the accuracy of the rknn model.

Therefore, according to this test situation, the upstream of the rknn model is onnx. Onnx found something wrong here. There must be a problem at this step. So I checked the pytorch to onnx stage, and found that the accuracy of the conversion was reduced.

This article is to record such a process. I also ask you to give some suggestions on the problems in this article. After all, problems have been discovered so far, and there are still some problems.

1. Convert pytorch to onnx: torch.onnx.export

yolo v5 export.py: def export_onnx()In , add the following code to check whether the output results of the dumped onnx model are consistent with the pytorch model. The code is as follows:

torch.onnx.export(
    model.cpu() if dynamic else model,  # --dynamic only compatible with cpu
    im.cpu() if dynamic else im,
    f,
    verbose=False,
    opset_version=opset,
    export_params=True, # 将训练好的权重保存到模型文件中
    do_constant_folding=True,  # 执行常数折叠进行优化
    input_names=['images'],
    output_names=output_names,
    dynamic_axes={
    
    
        "image": {
    
    0: "batch_size"},  # variable length axes
        "output": {
    
    0: "batch_size"},
    }
)

# Checks
model_onnx = onnx.load(f)  # load onnx model
onnx.checker.check_model(model_onnx)  # check onnx model
    
import onnxruntime
import numpy as np
print('onnxruntime run start', f)
sess = onnxruntime.InferenceSession('best.onnx')
print('sess run start')
output = sess.run(['output0'], {
    
    'images': im.detach().numpy()})[0]
print('pytorch model inference start')


pytorch_result = model(im)[0].detach().numpy()
print(' allclose start')
print('output:', output)
print('pytorch_result:', pytorch_result)
assert np.allclose(output, pytorch_result), 'the output is different between pytorch and onnx !!!'

The output results are printed, and the areas with obvious differences are marked, as shown below:

Insert image description here
You can also directly use my version below. After converting onnx, evaluate the difference between the converted onnx and pt files. as follows:

参考pytorch官方:(OPTIONAL) EXPORTING A MODEL FROM PYTORCH TO ONNX AND RUNNING IT USING ONNX RUNTIME

import os
import platform
import sys
import warnings
from pathlib import Path
import torch

FILE = Path(__file__).resolve()
ROOT = FILE.parents[0]  # YOLOv5 root directory
if str(ROOT) not in sys.path:
    sys.path.append(str(ROOT))  # add ROOT to PATH
if platform.system() != 'Windows':
    ROOT = Path(os.path.relpath(ROOT, Path.cwd()))  # relative

from models.experimental import attempt_load
from models.yolo import ClassificationModel, Detect, DetectionModel, SegmentationModel
from utils.dataloaders import LoadImages
from utils.general import (LOGGER, Profile, check_dataset, check_img_size, check_requirements, check_version,
                           check_yaml, colorstr, file_size, get_default_args, print_args, url2file, yaml_save)
from utils.torch_utils import select_device, smart_inference_mode


import numpy as np
def cosine_distance(arr1, arr2):
    # flatten the arrays to shape (16128, 7)
    arr1_flat = arr1.reshape(-1, 7)
    arr2_flat = arr2.reshape(-1, 7)

    # calculate the cosine distance
    cosine_distance = np.dot(arr1_flat.T, arr2_flat) / (np.linalg.norm(arr1_flat) * np.linalg.norm(arr2_flat))

    return cosine_distance.mean()


def check_onnx(model, im):

    import onnxruntime
    import numpy as np
    print('onnxruntime run start')
    sess = onnxruntime.InferenceSession('best.onnx')
    print('sess run start')
    output = sess.run(['output0'], {
    
    'images': im.detach().numpy()})[0]
    print('pytorch model inference start')

    with torch.no_grad():
        pytorch_result = model(im)[0].detach().numpy()
    print(' allclose start')
    print('output:', output, output.shape)
    print('pytorch_result:', pytorch_result, pytorch_result.shape)
    cosine_dis = cosine_distance(output, pytorch_result)
    print('cosine_dis:', cosine_dis)

    # 判断小数点后几位(4),是否相等,不相等就报错
    # np.testing.assert_almost_equal(pytorch_result, output, decimal=4)

    # compare ONNX Runtime and PyTorch results
    np.testing.assert_allclose(pytorch_result, output, rtol=1e-03, atol=1e-05)

    # assert np.allclose(output, pytorch_result), 'the output is different between pytorch and onnx !!!'

import cv2
from utils.augmentations import letterbox
def preprocess(img, device):
    img = cv2.resize(img, (512, 512))

    img = img.transpose((2, 0, 1))[::-1]
    img = np.ascontiguousarray(img)
    img = torch.from_numpy(img).to(device)
    img = img.float()
    img /= 255
    if len(img.shape) == 3:
        img = img[None]
    return img
def main(
        weights=ROOT / 'weights/best.pt',  # weights path
        imgsz=(512, 512),  # image (height, width)
        batch_size=1,  # batch size
        device='cpu',  # cuda device, i.e. 0 or 0,1,2,3 or cpu
        inplace=False,  # set YOLOv5 Detect() inplace=True
        dynamic=False,  # ONNX/TF/TensorRT: dynamic axes

):
    # Load PyTorch model
    device = select_device(device)
    model = attempt_load(weights, device=device, inplace=True, fuse=True)  # load FP32 model

    # Checks
    imgsz *= 2 if len(imgsz) == 1 else 1  # expand

    # Input
    gs = int(max(model.stride))  # grid size (max stride)
    imgsz = [check_img_size(x, gs) for x in imgsz]  # verify img_size are gs-multiples
    im = torch.zeros(batch_size, 3, *imgsz).to(device)  # image size(1,3,320,192) BCHW iDetection
    # im = cv2.imread(r'F:\tmp\yolov5_multiDR\data\0000005_20200929_M_063Y16640.jpeg')
    # im = preprocess(im, device)

    print(im.shape)
    # Update model
    model.eval()
    for k, m in model.named_modules():
        if isinstance(m, Detect):
            m.inplace = inplace
            m.dynamic = dynamic
            m.export = True

    warnings.filterwarnings(action='ignore', category=torch.jit.TracerWarning)  # suppress TracerWarning
    check_onnx(model, im)

if __name__ == "__main__":
    main()

Test 1: The image is an array of all 0s. The consistency check is as follows:

Mismatched elements: 76 / 112896 (0.0673%)
Max absolute difference:  0.00053406
Max relative difference:      2.2101

output: [[[     3.1054       3.965      8.9553 ...  6.8545e-07     0.36458     0.53113]
  [     9.0205      2.5498       13.39 ...  6.2585e-07     0.18449     0.70698]
  [     20.786      2.2233      13.489 ...  2.3842e-06    0.033101     0.95657]
  ...
  [     419.42      493.04      106.14 ...  8.4937e-06     0.24135     0.60916]
  [     485.68      500.22      46.923 ...  1.1176e-05     0.33573     0.48875]
  [     488.37      503.87      68.881 ...  5.9605e-08  0.00030029     0.99639]]] (1, 16128, 7)
pytorch_result: [[[     3.1054       3.965      8.9553 ...  7.0523e-07     0.36458     0.53113]
  [     9.0205      2.5498       13.39 ...  6.0181e-07     0.18449     0.70698]
  [     20.786      2.2233      13.489 ...  2.4172e-06    0.033101     0.95657]
  ...
  [     419.42      493.04      106.14 ...  8.5151e-06     0.24135     0.60916]
  [     485.68      500.22      46.923 ...  1.1174e-05     0.33573     0.48875]
  [     488.37      503.87      68.881 ...  9.3094e-08   0.0003003     0.99639]]] (1, 16128, 7)
cosine_dis: 0.04229331

Test 2: The image is a loaded local image, and the consistency check is as follows:

Mismatched elements: 158 / 112896 (0.14%)
Max absolute difference:   0.0016251
Max relative difference:      1.2584

output: [[[     3.0569      2.4338      10.758 ...  2.0862e-07     0.16333     0.78551]
  [     11.028      2.0251      13.407 ...  3.5763e-07    0.090503     0.88087]
  [     19.447      1.8957      13.431 ...  6.8545e-07    0.047358     0.95029]
  ...
  [     418.66       487.8      80.157 ...  1.4573e-05     0.65453     0.23448]
  [     472.99      491.78      79.313 ...  1.3232e-05     0.79356     0.15061]
  [     496.41      488.49      44.447 ...  2.6256e-05     0.89966     0.08772]]] (1, 16128, 7)
pytorch_result: [[[     3.0569      2.4338      10.758 ...  2.5371e-07     0.16333     0.78551]
  [     11.028      2.0251      13.407 ...  3.3069e-07    0.090503     0.88087]
  [     19.447      1.8957      13.431 ...  6.6051e-07    0.047358     0.95029]
  ...
  [     418.66       487.8      80.157 ...  1.4618e-05     0.65453     0.23448]
  [     472.99      491.78      79.313 ...  1.3215e-05     0.79356     0.15061]
  [     496.41      488.49      44.447 ...  2.6262e-05     0.89966     0.08772]]] (1, 16128, 7)
cosine_dis: 0.04071107

It is found that there are quite a lot of different data points in the output results, which means that some parameters in the model are different, which leads to differences in the final output results for the same input.

But within a certain error, the results are consistent. For example, I verified that the three digits after the decimal point are all the same, but when it reaches the fourth digit, differences begin to appear.

So what can be done to reduce or even eliminate this difference? I don’t know if you have any knowledge or experience in this area. You are welcome to give guidance in the comment area. Thank you.

2. Convert new pytorch to onnx: torch.onnx.dynamo_export

Refer to the official pytorch, regarding the model conversion of torch.onnx.export, in the relevant documents: (OPTIONAL) EXPORTING A MODEL FROM PYTORCH TO ONNX AND RUNNING IT USING ONNX RUNTIME

1
The above case is the evaluation code officially provided by pytorch to evaluate the pytorch and onnx transfer models, and compare the consistency of the output results under the same input. Compare here:

testing.assert_allclose(actual, desired, rtol=1e-07, atol=0, equal_nan=True, err_msg='', verbose=True)

in:

  • rtol: relative tolerance (tolerance, tolerance, tolerance)
  • atol: absolute tolerance
  • requires that the difference between the values ​​of actual does not exceed , otherwise an error message will pop updesiredatol + rtol * abs(desired)

It can be seen that this is an evaluation within the allowable error range. As long as certain error requirements are met, it is still satisfied. And in this test case, the error requirements of the above set values ​​were indeed passed.

However, there is a twist, and there is a reminder, as follows:
2
So, go to the torch.onnx.dynamo_export link, click here to go directly:EXPORT A PYTORCH MODEL TO ONNX

The same process, exporting the model, and then conducting a consistency evaluation, found that the official did not use the allowable error evaluation, but as follows:
Insert image description hereThe output is completely consistent, which is a good thing information. At this point, start verification

2.1. Verification results

At the same time, I found that yolo v5 has been updated to v7.0.0, so I thought about upgrading yolo and also updated the pytorch version to the latest 2.1.0, so that I can use it torch.onnx.dynamo_export Attempted to convert to onnx model.

After everything is ready, when the following code is used to transfer the onnx model, an error message appears.

export_output = torch.onnx.dynamo_export(model.cpu() if dynamic else model,
                                             im.cpu() if dynamic else im)
export_output.save("my_image_classifier.onnx")

2.2. Transfer failed

Insert image description here

gives a failed prompt: torch.onnx.OnnxExporterError, the transfer of onnx model failed, and a SARIF file was generated. Then it introduces what a SARIF file is, which can be viewed through VS Code SARIF or SARIF web. Finally, let’s talk about this error and report it to pytorch’s GitHub’s issue place.

produced a file named:report_dynamo_export.sarif is a file. Open the file and the recorded information is as follows:

{
 "runs":[
  {
   "tool":{
    "driver":{
     "name":"torch.onnx.dynamo_export",
     "contents":[
      "localizedData",
      "nonLocalizedData"
     ],
     "language":"en-US",
     "rules":[],
     "version":"2.1.0+cu118"
    }
   },
   "language":"en-US",
   "newlineSequences":[
    "\r\n",
    "\n"
   ],
   "results":[]
  }
 ],
 "version":"2.1.0",
 "schemaUri":"https://docs.oasis-open.org/sarif/sarif/v2.1.0/cs01/schemas/sarif-schema-2.1.0.json"
}

This is more like a log file collected by a running environment. When I searched the entire network, I found similar error messages, but there was no solution. I don’t know if it’s because this function is still in the internal testing stage and is not well adapted.

If you have encountered the same problem, please leave a comment and provide guidance as to where the problem lies? how to solve this problem. grateful

3. Summary

Originally I wanted to verify whether the final converted rknn model was consistent with the original pytorch model. Finally, I found that after the converted onnx stage, this difference already exists. And it was found that the test results of rknn are closer to the test results of the onnx model. This problem exists whether it is quantized rknn or unquantized.

At the same time, it was found that the quantized rknn model, changing the quantization method in theconfig stage, will indeed improve the performance of the model, and is almost close to the unquantized model version.

I originally thought that using pytorch's new model function to export onnx could solve this problem. However, I found that it is still a closed beta version, and I don’t know where the problem lies. I still need help from experts, and it has not been run through yet.

Finally, if you have encountered the same problem, please leave a comment and provide guidance on where the problem lies? how to solve this problem. grateful

Guess you like

Origin blog.csdn.net/wsLJQian/article/details/133783202