yolov5's onnx inference example and thought record (including the latest source code interpretation of detect.py)

foreword

Recently, I exported the yolov5 model to onnx format. I want to write a script to verify the results, and see if there is any discrepancy between using the pt file to infer directly. Although the official detect.py file can directly infer various model formats, I always feel that I don’t understand the mystery when using it. At the same time, the official detect.py introduces too many external libraries, so I have the idea of ​​​​writing a separate inference script. Of course, it is very simple to just input an empty vector for inference, but I hope to predict and save the picture directly.

If you think it is too long to read, please skip to the final source code part (at the end of the article). Another source code has been placed on github, the link is here.

2022/09/07 The inference of the cpu version was updated on github.


preparation and ideas

First of all, onnxruntime must be installed. Here is the gpu version of onnxruntime , as well as torch, torchvision, opencv-python and the corresponding cuda toolkit, so I won’t go into details here.
The idea is as follows:
1. Preprocess the image and convert it to the input size of the onnx model.
2. After inferring all the boxes, use non_max_suppression to remove all unqualified boxes, that is, remove boxes with insufficient scores and overlapping boxes based on confidence and iou scores.
3. Convert the coordinates of the frame to the coordinates of the original image size (because the image has been preprocessed and converted to size)
4. Mark and save the original image according to the coordinates and label names (using opencv and cv2.putText methods)


read the source code

First, let's see what is written in detect.py, and then locate what we need to complete the task.

1. Load the onnx model

# detect.py
device = select_device(device)
model = DetectMultiBackend(weights, device=device, dnn=dnn, data=data, fp16=half)
stride, names, pt = model.stride, model.names, model.pt
imgsz = check_img_size(imgsz, s=stride)  # check image size

The first line of device is obviously asking whether you are running with cpu or gpu, because we have cuda, so we must use cuda to run, then this device must be changed to cuda.
The second line DetectMultiBackend , the latest source code of yolov5 (August 2022) uses this function to load the multi-model format. This is different from before. In some yolov5 projects in 2021 you found on the Internet, this section may be the attempt_load function. So it is necessary to take a closer look at how this function is implemented.
The third line is to define some basic information of the model , stride is generally 32, names is the label name of the model, pt is whether you use pytorch's pt weight to infer, since we are onnx, then pt=False, this is very important.
The fourth line is to check whether the size of the input image is a multiple of 32. This is required by yolov5 during training. It must be a multiple of 32. If it is not necessary to perform size conversion.
Then let's take a closer look at this DetectMultiBackend:

# common.py
elif onnx:  # ONNX Runtime
     LOGGER.info(f'Loading {
      
      w} for ONNX Runtime inference...')
     cuda = torch.cuda.is_available()
     check_requirements(('onnx', 'onnxruntime-gpu' if cuda else 'onnxruntime'))
     import onnxruntime
     providers = ['CUDAExecutionProvider', 'CPUExecutionProvider'] if cuda else ['CPUExecutionProvider']
     session = onnxruntime.InferenceSession(w, providers=providers)
     meta = session.get_modelmeta().custom_metadata_map  # metadata
     if 'stride' in meta:
         stride, names = int(meta['stride']), eval(meta['names'])

This function is located in the common.py file under the models folder , because it is too long, we will only look at the onnx part of interest, first of all, to determine whether the onnxruntime of the gpu version is installed, and whether cuda can be used, if it can be used, use 'CUDAExecutionProvider' to infer, so this piece of code is used to load the onnx model.

2. Preprocess the image

# detect.py
if webcam:
        view_img = check_imshow()
        cudnn.benchmark = True  # set True to speed up constant image size inference
        dataset = LoadStreams(source, img_size=imgsz, stride=stride, auto=pt)
        bs = len(dataset)  # batch_size
    else:
        dataset = LoadImages(source, img_size=imgsz, stride=stride, auto=pt)
        bs = 1  # batch_size

Because we only infer a single image, we must use the LoadImages function under the else loop, which involves two parameters img_size and auto :
img_size is the input size of the onnx model. Generally speaking, the default output is (640,640). This value can be modified when you export the onnx model.
auto is whether you use pt, we are onnx, so auto should be equal to False.
Next, look at the specific implementation of this function (in dataloaders.py under utils):

# dataloaders.py
else:
    # Read image
    self.count += 1
    img0 = cv2.imread(path)  # BGR
    assert img0 is not None, f'Image Not Found {
      
      path}'
    s = f'image {
      
      self.count}/{
      
      self.nf} {
      
      path}: '

        # Padded resize
img = letterbox(img0, self.img_size, stride=self.stride, auto=self.auto)[0]

        # Convert
img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB
img = np.ascontiguousarray(img)

Because we only have one picture, we jump directly to the else part, where we first read the picture with cv2.imread , then use letterbox to convert it into a 640x640 picture, then perform dimension conversion, and finally become an array array, so when we write it ourselves, we only need to copy the letterbox method (under the augmentations.py of utils) reasonably to successfully preprocess it .

3. Make inferences

# detect.py
im = torch.from_numpy(im).to(device)
im = im.half() if model.fp16 else im.float()  # uint8 to fp16/32
im /= 255  # 0 - 255 to 0.0 - 1.0
if len(im.shape) == 3:
   im = im[None]  # expand for batch dim
t2 = time_sync()
dt[0] += t2 - t1

# Inference
visualize = increment_path(save_dir / Path(path).stem, mkdir=True) if visualize else False
pred = model(im, augment=augment, visualize=visualize)

Finally came the inference part. The first line is to convert the array into a tensor for the next processing. The second line is because we use cuda for onnx inference, so here we need to use im.float(). The next four lines are useless, skip it, and the last sentence is a formal inference. So what happened, let us jump to common.py under models :

# common.py
elif self.onnx:  # ONNX Runtime
     im = im.cpu().numpy()  # torch to numpy
     y = self.session.run([self.session.get_outputs()[0].name], {
    
    self.session.get_inputs()[0].name: im})[0]

In fact, it is very simple, just two sentences, convert tensor to numpy, and then input onnx for inference.

4. Remove redundant boxes

The yolov5-s I use here will get a (1,25200,7) vector at the end of the third step, which can be understood as 25200 boxes, which contain four coordinate points, one confidence score, and one classification information. So we must remove all the extra boxes. The implementation in detect.py is very simple, just one sentence:

# detect.py
pred = non_max_suppression(pred, conf_thres, iou_thres, classes, agnostic_nms, max_det=max_det)

Here pred is the prediction obtained in the third step. The default value of conf_thres is 0.25, and the default value of iou_thres is 0.45. Boxes smaller than the previous two values ​​will be removed. classes is the category you specify to predict. The default is false . The default value of agnostic_nms is false .
Then this function is in the general.py of utils . The specific implementation is a bit complicated. We can delete functions we don’t need as needed. For example, classes are generally not needed, and related ones can be deleted.

# general.py
box = xywh2xyxy(x[:, :4])
iou = box_iou(boxes[i], boxes)

The reason why I picked these two is because it calls functions in other files. The first one is used to convert the image coordinates of x, y, w, h into the coordinate system of x, y, x, y. The second one is used to calculate the iou score of the frame, where xywh2xyxy is also located under general.py, and box_iou is located in metrics.py in utils .

5. Mark and save the picture

# detect.py
for i, det in enumerate(pred):
    det[:, :4] = scale_coords(im.shape[2:], det[:, :4], img0.shape).round()
#initialize annotator
annotator = Annotator(img0, line_width=3)
#annotate the image
for *xyxy, conf, cls in reversed(det):
    c = int(cls)  # integer class
    label = f'{
      
      names[c]} {
      
      conf:.2f}'
    annotator.box_label(xyxy, label, color=colors(c, True))

Here I made a slight modification to see more clearly.
The first step is to convert the coordinates in the tensor obtained in the fourth step to the size of the original image, using the scale_coords function, and then using the Annotator class for labeling. After you define the labels, you can use Annotator to label and save.
Among them, scale_coords is under the general.py file of utils , and Annotator is under plots.py of utils .


Summarize

It’s better to look at the source code more to improve your skills. Using other people’s source code can only be a tuner, so keep working hard!


final code

After the above reading and long debugging, we got the following script (a bit long):

#inference only for onnx
import onnxruntime
import torch
import torchvision
import cv2
import numpy as np
import time
w = 'best.onnx' #文件名 请自行修改
cuda = torch.cuda.is_available()
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider'] if cuda else ['CPUExecutionProvider']
session = onnxruntime.InferenceSession(w, providers=providers)
#warmup to reduce the first inference time but useless in fact.
# t1 = time.time()
# im = torch.zeros((1,3,640,640), dtype=torch.float, device=torch.device('cuda'))
# im = im.cpu().numpy()  # torch to numpy
# y = session.run([session.get_outputs()[0].name], {session.get_inputs()[0].name: im})[0]
# t2 = time.time()
# print(t2-t1)
#preprocess img to array
def letterbox(im, new_shape=(640, 640), color=(114, 114, 114), auto=True, scaleFill=False, scaleup=True, stride=32):
    # Resize and pad image while meeting stride-multiple constraints
    shape = im.shape[:2]  # current shape [height, width]
    if isinstance(new_shape, int):
        new_shape = (new_shape, new_shape)

    # Scale ratio (new / old)
    r = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
    if not scaleup:  # only scale down, do not scale up (for better val mAP)
        r = min(r, 1.0)

    # Compute padding
    ratio = r, r  # width, height ratios
    new_unpad = int(round(shape[1] * r)), int(round(shape[0] * r))
    dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1]  # wh padding
    if auto:  # minimum rectangle
        dw, dh = np.mod(dw, stride), np.mod(dh, stride)  # wh padding
    elif scaleFill:  # stretch
        dw, dh = 0.0, 0.0
        new_unpad = (new_shape[1], new_shape[0])
        ratio = new_shape[1] / shape[1], new_shape[0] / shape[0]  # width, height ratios

    dw /= 2  # divide padding into 2 sides
    dh /= 2

    if shape[::-1] != new_unpad:  # resize
        im = cv2.resize(im, new_unpad, interpolation=cv2.INTER_LINEAR)
    top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))
    left, right = int(round(dw - 0.1)), int(round(dw + 0.1))
    im = cv2.copyMakeBorder(im, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)  # add border
    return im
def xywh2xyxy(x):
    # Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right
    y = x.clone() if isinstance(x, torch.Tensor) else np.copy(x)
    y[:, 0] = x[:, 0] - x[:, 2] / 2  # top left x
    y[:, 1] = x[:, 1] - x[:, 3] / 2  # top left y
    y[:, 2] = x[:, 0] + x[:, 2] / 2  # bottom right x
    y[:, 3] = x[:, 1] + x[:, 3] / 2  # bottom right y
    return y
def box_area(box):
    # box = xyxy(4,n)
    return (box[2] - box[0]) * (box[3] - box[1])
def box_iou(box1, box2, eps=1e-7):
    # inter(N,M) = (rb(N,M,2) - lt(N,M,2)).clamp(0).prod(2)
    (a1, a2), (b1, b2) = box1[:, None].chunk(2, 2), box2.chunk(2, 1)
    inter = (torch.min(a2, b2) - torch.max(a1, b1)).clamp(0).prod(2)

    # IoU = inter / (area1 + area2 - inter)
    return inter / (box_area(box1.T)[:, None] + box_area(box2.T) - inter + eps)
def non_max_suppression(prediction,
                        conf_thres=0.25,
                        iou_thres=0.45,
                        agnostic=False,
                        max_det=300):
    bs = prediction.shape[0]  # batch size
    xc = prediction[..., 4] > conf_thres  # candidates
    # Settings
    # min_wh = 2  # (pixels) minimum box width and height
    max_wh = 7680  # (pixels) maximum box width and height
    max_nms = 30000  # maximum number of boxes into torchvision.ops.nms()
    redundant = True  # require redundant detections
    merge = False  # use merge-NMS
    output = [torch.zeros((0, 6), device = prediction.device)] * bs
    for xi, x in enumerate(prediction):  # image index, image inference
        # Apply constraints
        # x[((x[..., 2:4] < min_wh) | (x[..., 2:4] > max_wh)).any(1), 4] = 0  # width-height
        x = x[xc[xi]]  # confidence
        # If none remain process next image
        if not x.shape[0]:
            continue

        # Compute conf
        x[:, 5:] *= x[:, 4:5]  # conf = obj_conf * cls_conf

        # Box (center x, center y, width, height) to (x1, y1, x2, y2)
        box = xywh2xyxy(x[:, :4])

        # Detections matrix nx6 (xyxy, conf, cls)
        conf, j = x[:, 5:].max(1, keepdim=True)
        x = torch.cat((box, conf, j.float()), 1)[conf.view(-1) > conf_thres]
        # Apply finite constraint
        # if not torch.isfinite(x).all():
        #     x = x[torch.isfinite(x).all(1)]

        # Check shape
        n = x.shape[0]  # number of boxes
        if not n:  # no boxes
            continue
        elif n > max_nms:  # excess boxes
            x = x[x[:, 4].argsort(descending=True)[:max_nms]]  # sort by confidence

        # Batched NMS
        c = x[:, 5:6] * (0 if agnostic else max_wh)  # classes
        boxes, scores = x[:, :4] + c, x[:, 4]  # boxes (offset by class), scores
        i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMS
        if i.shape[0] > max_det:  # limit detections
            i = i[:max_det]
        if merge and (1 < n < 3E3):  # Merge NMS (boxes merged using weighted mean)
            # update boxes as boxes(i,4) = weights(i,n) * boxes(n,4)
            iou = box_iou(boxes[i], boxes) > iou_thres  # iou matrix
            weights = iou * scores[None]  # box weights
            x[i, :4] = torch.mm(weights, x[:, :4]).float() / weights.sum(1, keepdim=True)  # merged boxes
            if redundant:
                i = i[iou.sum(1) > 1]  # require redundancy

        output[xi] = x[i]
    return output
def scale_coords(img1_shape, coords, img0_shape, ratio_pad=None):
    # Rescale coords (xyxy) from img1_shape to img0_shape
    if ratio_pad is None:  # calculate from img0_shape
        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])  # gain  = old / new
        pad = (img1_shape[1] - img0_shape[1] * gain) / 2, (img1_shape[0] - img0_shape[0] * gain) / 2  # wh padding
    else:
        gain = ratio_pad[0][0]
        pad = ratio_pad[1]

    coords[:, [0, 2]] -= pad[0]  # x padding
    coords[:, [1, 3]] -= pad[1]  # y padding
    coords[:, :4] /= gain
    clip_coords(coords, img0_shape)
    return coords
def clip_coords(boxes, shape):
    # Clip bounding xyxy bounding boxes to image shape (height, width)
    if isinstance(boxes, torch.Tensor):  # faster individually
        boxes[:, 0].clamp_(0, shape[1])  # x1
        boxes[:, 1].clamp_(0, shape[0])  # y1
        boxes[:, 2].clamp_(0, shape[1])  # x2
        boxes[:, 3].clamp_(0, shape[0])  # y2
    else:  # np.array (faster grouped)
        boxes[:, [0, 2]] = boxes[:, [0, 2]].clip(0, shape[1])  # x1, x2
        boxes[:, [1, 3]] = boxes[:, [1, 3]].clip(0, shape[0])  # y1, y2
class Annotator:
    def __init__(self, im, line_width=None):
        assert im.data.contiguous, 'Image not contiguous. Apply np.ascontiguousarray(im) to Annotator() input images.'
        self.im = im
        self.lw = line_width or max(round(sum(im.shape) / 2 * 0.003), 2)  # line width

    def box_label(self, box, label='', color=(128, 128, 128), txt_color=(255, 255, 255)):
        # Add one xyxy box to image with label
        p1, p2 = (int(box[0]), int(box[1])), (int(box[2]), int(box[3]))
        cv2.rectangle(self.im, p1, p2, color, thickness=self.lw, lineType=cv2.LINE_AA)
        if label:
            tf = max(self.lw - 1, 1)  # font thickness
            w, h = cv2.getTextSize(label, 0, fontScale=self.lw / 3, thickness=tf)[0]  # text width, height
            outside = p1[1] - h >= 3
            p2 = p1[0] + w, p1[1] - h - 3 if outside else p1[1] + h + 3
            cv2.rectangle(self.im, p1, p2, color, -1, cv2.LINE_AA)  # filled
            cv2.putText(self.im,
                        label, (p1[0], p1[1] - 2 if outside else p1[1] + h + 2),
                        0,
                        self.lw / 3,
                        txt_color,
                        thickness=tf,
                        lineType=cv2.LINE_AA)

    def rectangle(self, xy, fill=None, outline=None, width=1):
        # Add rectangle to image (PIL-only)
        self.draw.rectangle(xy, fill, outline, width)

    def text(self, xy, text, txt_color=(255, 255, 255)):
        # Add text to image (PIL-only)
        w, h = self.font.getsize(text)  # text width, height
        self.draw.text((xy[0], xy[1] - h + 1), text, fill=txt_color, font=self.font)

    def result(self):
        # Return annotated image as array
        return np.asarray(self.im)
class Colors:
    def __init__(self):
        # hex = matplotlib.colors.TABLEAU_COLORS.values()
        hexs = ('FF3838', 'FF9D97', 'FF701F', 'FFB21D', 'CFD231', '48F90A', '92CC17', '3DDB86', '1A9334', '00D4BB',
                '2C99A8', '00C2FF', '344593', '6473FF', '0018EC', '8438FF', '520085', 'CB38FF', 'FF95C8', 'FF37C7')
        self.palette = [self.hex2rgb(f'#{
      
      c}') for c in hexs]
        self.n = len(self.palette)

    def __call__(self, i, bgr=False):
        c = self.palette[int(i) % self.n]
        return (c[2], c[1], c[0]) if bgr else c

    @staticmethod
    def hex2rgb(h):  # rgb order (PIL)
        return tuple(int(h[1 + i:1 + i + 2], 16) for i in (0, 2, 4))


colors = Colors()  # create instance for 'from utils.plots import colors'
img0 = cv2.imread('test.png') #自行修改文件名称
img = letterbox(img0, (640,640), stride=32, auto=False) #only pt use auto=True, but we are onnx
img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB
img = np.ascontiguousarray(img)
im = torch.from_numpy(img).to(torch.device('cuda'))
im = im.float()
im /= 255  # 0 - 255 to 0.0 - 1.0
if len(im.shape) == 3:
    im = im[None]  # expand for batch dim
im = im.cpu().numpy()  # torch to numpy
y = session.run([session.get_outputs()[0].name], {
    
    session.get_inputs()[0].name: im})[0] #inference onnx model to get the total output
#non_max_suppression to remove redundant boxes
y = torch.from_numpy(y).to(torch.device('cuda'))
pred = non_max_suppression(y, conf_thres = 0.25, iou_thres = 0.45, agnostic= False, max_det=1000)
#transform coordinate to original picutre size
for i, det in enumerate(pred):
    det[:, :4] = scale_coords(im.shape[2:], det[:, :4], img0.shape).round()
print(det)
#标签,请自行修改
names = ['nofall', 'fall']
#initialize annotator
annotator = Annotator(img0, line_width=3)
#annotate the image
for *xyxy, conf, cls in reversed(det):
    c = int(cls)  # integer class
    label = f'{
      
      names[c]} {
      
      conf:.2f}'
    annotator.box_label(xyxy, label, color=colors(c, True))
#自行修改文件名称
cv2.imwrite('test.png', img0)

Let’s talk about the warmup part here. yolov5’s detect.py will first pass in an empty vector to the model for preloading, so that the delay will be smaller during the formal prediction. The time for my test to pass in the empty vector is 0.73s, and the formal prediction is 0.01s.
But if you don’t use warmup to make predictions directly, the time is equal to the sum of the above two. At the same time, it starts from 0.01s from the second prediction, so this function seems to be useless in my scenario, so I commented it out.

Guess you like

Origin blog.csdn.net/weixin_43945848/article/details/126503453