After converting the model trained on and then , the test found:yolo v5
onnx
rknn
rknn
Models, both quantified and non-quantified, have lower test accuracy than the pytorch model.onnx
Compared with the pytorch model, the test accuracy of the model is also reduced, and it is closer to the accuracy of the rknn model.
Therefore, according to this test situation, the upstream of the rknn model is onnx. Onnx found something wrong here. There must be a problem at this step. So I checked the pytorch to onnx stage, and found that the accuracy of the conversion was reduced.
This article is to record such a process. I also ask you to give some suggestions on the problems in this article. After all, problems have been discovered so far, and there are still some problems.
1. Convert pytorch to onnx: torch.onnx.export
yolo v5 export.py: def export_onnx()
In , add the following code to check whether the output results of the dumped onnx
model are consistent with the pytorch
model. The code is as follows:
torch.onnx.export(
model.cpu() if dynamic else model, # --dynamic only compatible with cpu
im.cpu() if dynamic else im,
f,
verbose=False,
opset_version=opset,
export_params=True, # 将训练好的权重保存到模型文件中
do_constant_folding=True, # 执行常数折叠进行优化
input_names=['images'],
output_names=output_names,
dynamic_axes={
"image": {
0: "batch_size"}, # variable length axes
"output": {
0: "batch_size"},
}
)
# Checks
model_onnx = onnx.load(f) # load onnx model
onnx.checker.check_model(model_onnx) # check onnx model
import onnxruntime
import numpy as np
print('onnxruntime run start', f)
sess = onnxruntime.InferenceSession('best.onnx')
print('sess run start')
output = sess.run(['output0'], {
'images': im.detach().numpy()})[0]
print('pytorch model inference start')
pytorch_result = model(im)[0].detach().numpy()
print(' allclose start')
print('output:', output)
print('pytorch_result:', pytorch_result)
assert np.allclose(output, pytorch_result), 'the output is different between pytorch and onnx !!!'
The output results are printed, and the areas with obvious differences are marked, as shown below:
You can also directly use my version below. After converting onnx, evaluate the difference between the converted onnx and pt files. as follows:
参考pytorch官方:(OPTIONAL) EXPORTING A MODEL FROM PYTORCH TO ONNX AND RUNNING IT USING ONNX RUNTIME
import os
import platform
import sys
import warnings
from pathlib import Path
import torch
FILE = Path(__file__).resolve()
ROOT = FILE.parents[0] # YOLOv5 root directory
if str(ROOT) not in sys.path:
sys.path.append(str(ROOT)) # add ROOT to PATH
if platform.system() != 'Windows':
ROOT = Path(os.path.relpath(ROOT, Path.cwd())) # relative
from models.experimental import attempt_load
from models.yolo import ClassificationModel, Detect, DetectionModel, SegmentationModel
from utils.dataloaders import LoadImages
from utils.general import (LOGGER, Profile, check_dataset, check_img_size, check_requirements, check_version,
check_yaml, colorstr, file_size, get_default_args, print_args, url2file, yaml_save)
from utils.torch_utils import select_device, smart_inference_mode
import numpy as np
def cosine_distance(arr1, arr2):
# flatten the arrays to shape (16128, 7)
arr1_flat = arr1.reshape(-1, 7)
arr2_flat = arr2.reshape(-1, 7)
# calculate the cosine distance
cosine_distance = np.dot(arr1_flat.T, arr2_flat) / (np.linalg.norm(arr1_flat) * np.linalg.norm(arr2_flat))
return cosine_distance.mean()
def check_onnx(model, im):
import onnxruntime
import numpy as np
print('onnxruntime run start')
sess = onnxruntime.InferenceSession('best.onnx')
print('sess run start')
output = sess.run(['output0'], {
'images': im.detach().numpy()})[0]
print('pytorch model inference start')
with torch.no_grad():
pytorch_result = model(im)[0].detach().numpy()
print(' allclose start')
print('output:', output, output.shape)
print('pytorch_result:', pytorch_result, pytorch_result.shape)
cosine_dis = cosine_distance(output, pytorch_result)
print('cosine_dis:', cosine_dis)
# 判断小数点后几位(4),是否相等,不相等就报错
# np.testing.assert_almost_equal(pytorch_result, output, decimal=4)
# compare ONNX Runtime and PyTorch results
np.testing.assert_allclose(pytorch_result, output, rtol=1e-03, atol=1e-05)
# assert np.allclose(output, pytorch_result), 'the output is different between pytorch and onnx !!!'
import cv2
from utils.augmentations import letterbox
def preprocess(img, device):
img = cv2.resize(img, (512, 512))
img = img.transpose((2, 0, 1))[::-1]
img = np.ascontiguousarray(img)
img = torch.from_numpy(img).to(device)
img = img.float()
img /= 255
if len(img.shape) == 3:
img = img[None]
return img
def main(
weights=ROOT / 'weights/best.pt', # weights path
imgsz=(512, 512), # image (height, width)
batch_size=1, # batch size
device='cpu', # cuda device, i.e. 0 or 0,1,2,3 or cpu
inplace=False, # set YOLOv5 Detect() inplace=True
dynamic=False, # ONNX/TF/TensorRT: dynamic axes
):
# Load PyTorch model
device = select_device(device)
model = attempt_load(weights, device=device, inplace=True, fuse=True) # load FP32 model
# Checks
imgsz *= 2 if len(imgsz) == 1 else 1 # expand
# Input
gs = int(max(model.stride)) # grid size (max stride)
imgsz = [check_img_size(x, gs) for x in imgsz] # verify img_size are gs-multiples
im = torch.zeros(batch_size, 3, *imgsz).to(device) # image size(1,3,320,192) BCHW iDetection
# im = cv2.imread(r'F:\tmp\yolov5_multiDR\data\0000005_20200929_M_063Y16640.jpeg')
# im = preprocess(im, device)
print(im.shape)
# Update model
model.eval()
for k, m in model.named_modules():
if isinstance(m, Detect):
m.inplace = inplace
m.dynamic = dynamic
m.export = True
warnings.filterwarnings(action='ignore', category=torch.jit.TracerWarning) # suppress TracerWarning
check_onnx(model, im)
if __name__ == "__main__":
main()
Test 1: The image is an array of all 0s. The consistency check is as follows:
Mismatched elements: 76 / 112896 (0.0673%)
Max absolute difference: 0.00053406
Max relative difference: 2.2101
output: [[[ 3.1054 3.965 8.9553 ... 6.8545e-07 0.36458 0.53113]
[ 9.0205 2.5498 13.39 ... 6.2585e-07 0.18449 0.70698]
[ 20.786 2.2233 13.489 ... 2.3842e-06 0.033101 0.95657]
...
[ 419.42 493.04 106.14 ... 8.4937e-06 0.24135 0.60916]
[ 485.68 500.22 46.923 ... 1.1176e-05 0.33573 0.48875]
[ 488.37 503.87 68.881 ... 5.9605e-08 0.00030029 0.99639]]] (1, 16128, 7)
pytorch_result: [[[ 3.1054 3.965 8.9553 ... 7.0523e-07 0.36458 0.53113]
[ 9.0205 2.5498 13.39 ... 6.0181e-07 0.18449 0.70698]
[ 20.786 2.2233 13.489 ... 2.4172e-06 0.033101 0.95657]
...
[ 419.42 493.04 106.14 ... 8.5151e-06 0.24135 0.60916]
[ 485.68 500.22 46.923 ... 1.1174e-05 0.33573 0.48875]
[ 488.37 503.87 68.881 ... 9.3094e-08 0.0003003 0.99639]]] (1, 16128, 7)
cosine_dis: 0.04229331
Test 2: The image is a loaded local image, and the consistency check is as follows:
Mismatched elements: 158 / 112896 (0.14%)
Max absolute difference: 0.0016251
Max relative difference: 1.2584
output: [[[ 3.0569 2.4338 10.758 ... 2.0862e-07 0.16333 0.78551]
[ 11.028 2.0251 13.407 ... 3.5763e-07 0.090503 0.88087]
[ 19.447 1.8957 13.431 ... 6.8545e-07 0.047358 0.95029]
...
[ 418.66 487.8 80.157 ... 1.4573e-05 0.65453 0.23448]
[ 472.99 491.78 79.313 ... 1.3232e-05 0.79356 0.15061]
[ 496.41 488.49 44.447 ... 2.6256e-05 0.89966 0.08772]]] (1, 16128, 7)
pytorch_result: [[[ 3.0569 2.4338 10.758 ... 2.5371e-07 0.16333 0.78551]
[ 11.028 2.0251 13.407 ... 3.3069e-07 0.090503 0.88087]
[ 19.447 1.8957 13.431 ... 6.6051e-07 0.047358 0.95029]
...
[ 418.66 487.8 80.157 ... 1.4618e-05 0.65453 0.23448]
[ 472.99 491.78 79.313 ... 1.3215e-05 0.79356 0.15061]
[ 496.41 488.49 44.447 ... 2.6262e-05 0.89966 0.08772]]] (1, 16128, 7)
cosine_dis: 0.04071107
It is found that there are quite a lot of different data points in the output results, which means that some parameters in the model are different, which leads to differences in the final output results for the same input.
But within a certain error, the results are consistent. For example, I verified that the three digits after the decimal point are all the same, but when it reaches the fourth digit, differences begin to appear.
So what can be done to reduce or even eliminate this difference? I don’t know if you have any knowledge or experience in this area. You are welcome to give guidance in the comment area. Thank you.
2. Convert new pytorch to onnx: torch.onnx.dynamo_export
Refer to the official pytorch, regarding the model conversion of torch.onnx.export
, in the relevant documents: (OPTIONAL) EXPORTING A MODEL FROM PYTORCH TO ONNX AND RUNNING IT USING ONNX RUNTIME
The above case is the evaluation code officially provided by pytorch to evaluate the pytorch and onnx transfer models, and compare the consistency of the output results under the same input. Compare here:
testing.assert_allclose(actual, desired, rtol=1e-07, atol=0, equal_nan=True, err_msg='', verbose=True)
in:
- rtol: relative tolerance (tolerance, tolerance, tolerance)
- atol: absolute tolerance
- requires that the difference between the values of
actual
does not exceed , otherwise an error message will pop updesired
atol + rtol * abs(desired)
It can be seen that this is an evaluation within the allowable error range. As long as certain error requirements are met, it is still satisfied. And in this test case, the error requirements of the above set values were indeed passed.
However, there is a twist, and there is a reminder, as follows:
So, go to the torch.onnx.dynamo_export
link, click here to go directly:EXPORT A PYTORCH MODEL TO ONNX
The same process, exporting the model, and then conducting a consistency evaluation, found that the official did not use the allowable error evaluation, but as follows:
The output is completely consistent, which is a good thing information. At this point, start verification
2.1. Verification results
At the same time, I found that yolo v5 has been updated to v7.0.0, so I thought about upgrading yolo and also updated the pytorch version to the latest 2.1.0, so that I can use it torch.onnx.dynamo_export
Attempted to convert to onnx model.
After everything is ready, when the following code is used to transfer the onnx model, an error message appears.
export_output = torch.onnx.dynamo_export(model.cpu() if dynamic else model,
im.cpu() if dynamic else im)
export_output.save("my_image_classifier.onnx")
2.2. Transfer failed
gives a failed prompt: torch.onnx.OnnxExporterError
, the transfer of onnx
model failed, and a SARIF file was generated. Then it introduces what a SARIF
file is, which can be viewed through VS Code SARIF
or SARIF web
. Finally, let’s talk about this error and report it to pytorch
’s GitHub
’s issue
place.
produced a file named:report_dynamo_export.sarif
is a file. Open the file and the recorded information is as follows:
{
"runs":[
{
"tool":{
"driver":{
"name":"torch.onnx.dynamo_export",
"contents":[
"localizedData",
"nonLocalizedData"
],
"language":"en-US",
"rules":[],
"version":"2.1.0+cu118"
}
},
"language":"en-US",
"newlineSequences":[
"\r\n",
"\n"
],
"results":[]
}
],
"version":"2.1.0",
"schemaUri":"https://docs.oasis-open.org/sarif/sarif/v2.1.0/cs01/schemas/sarif-schema-2.1.0.json"
}
This is more like a log file collected by a running environment. When I searched the entire network, I found similar error messages, but there was no solution. I don’t know if it’s because this function is still in the internal testing stage and is not well adapted.
If you have encountered the same problem, please leave a comment and provide guidance as to where the problem lies? how to solve this problem. grateful
3. Summary
Originally I wanted to verify whether the final converted rknn
model was consistent with the original pytorch
model. Finally, I found that after the converted onnx
stage, this difference already exists. And it was found that the test results of rknn are closer to the test results of the onnx
model. This problem exists whether it is quantized rknn or unquantized.
At the same time, it was found that the quantized rknn model, changing the quantization method in theconfig
stage, will indeed improve the performance of the model, and is almost close to the unquantized model version.
I originally thought that using pytorch's new model function to export onnx could solve this problem. However, I found that it is still a closed beta version, and I don’t know where the problem lies. I still need help from experts, and it has not been run through yet.
Finally, if you have encountered the same problem, please leave a comment and provide guidance on where the problem lies? how to solve this problem. grateful