OpenVINO Python API reasoning YOLOv5 model implementation method

overview

This document mainly describes the platform and the method of exporting models pythonusing openvinomodule inference .YOLOv5IR

The document mainly includes the following contents:

  • openvinomodule installation
  • Description of the model format
  • The basic API interface of openvino, including 初始化, 模型加载, 模型参数获取, 模型推理etc.
  • Preprocessing of image data
  • Post-processing of inference results, including conversion NMSof cxcywhcoordinates to xyxycoordinates, etc.
  • Key method calls and parameter descriptions
  • complete sample code

1. Environment deployment

pre-installation environment

  1. (Windows) Visual Studio 2019
  2. Anaconda3orMiniConda3

Note: The model openvinomust be confirmed to be used , otherwise it cannot be used.CPUIntel

openvinoInstall

  • pytorch 1.7.1+cu110
  • onnxruntime-gpu 1.7.0
conda create -n openvino python=3.8 -y

conda activate ort

pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

pip install onnxruntime-gpu==1.7.0

ONNX model conversion

The official pre-training model of YOLOv5 can be downloaded through the official link. The model format is .Download linkpt . The official project provides a script for converting format model to format model. Project link
YOLOv5ptONNX

Model export command:

python export --weights yolov5s.pt --include openvino

Note: For the installation and configuration of the environment required for exporting files to execute instructions, please refer to the official project READMEdocumentation, and will not repeat them here.

After the execution of the command is completed, a folder yolov5s.ptnamed . The file structure is as follows:yolov5s_openvino_modelIR

yolov5s_openvino_model
├── yolov5s.bin
├── yolov5s.mapping
└── yolov5s.xml

2. OpenVINO basic API

2.1 Initialization

from openvino.runtime import Core
core = Core()

2.2 Get device information

devices = core.available_devices

for device in devices:
    device_name = core.get_property(device, "FULL_DEVICE_NAME")
    print(f"{
      
      device}: {
      
      device_name}")

Example code output:

device_name: 12th Gen Intel(R) Core(TM) i5-12400F
device_name: NVIDIA GeForce RTX 2060 SUPER (dGPU)

2.3 Loading the model

Load OpenVINOthe intermediate representation ( IR) model file and return ExecutableNetworkthe object.

# load the openvino IR model
yolo_model_path = "weights/yolov5s_openvino_model/yolov5s.xml"
model = core.read_model(model=yolo_model_path)
compiled_model = core.compile_model(model=model, device_name="AUTO")

2.4 Get model input and output information

# 获取输入层信息
input_layer = model.inputs[0]
print(f"input_layer: {
      
      input_layer}")

# 获取输出层信息
output_layer = model.outputs
print(f"output_layer: {
      
      output_layer}")

Example code output:

input_layer: <Output: names[images] shape[1,3,640,640] type: f32>
output_layer: [<Output: names[output] shape[1,25200,85] type: f32>, <Output: names[345] shape[1,3,80,80,85] type: f32>, <Output: names[403] shape[1,3,40,40,85] type: f32>, <Output: names[461] shape[1,3,20,20,85] type: f32>]

It can be seen that the model has only one input layer, but has 4 output layers, which conforms to the YOLOv5multi-layer output characteristics of the model.
During inference, we only need to focus on outputthis output layer.

Furthermore, for the subsequent preprocessing and postprocessing of model inference, we need to obtain relevant information about the input and output layers of the model, including:

  • The name, shape, and type of the input layer
  • The name, shape, and type corresponding to the output layer
# 如果输入层只有一层,直接调用any_name获取输入层名称
input_name = input_layer.any_name
print(f"input_name: {
      
      input_name}")
N, C, H, W = input_layer.shape
print(f"N: {
      
      N}, C: {
      
      C}, H: {
      
      H}, W: {
      
      W}")
input_dtype = input_layer.element_type
print(f"input_dtype: {
      
      input_dtype}")

output_name = output_layer[0].any_name
print(f"output_name: {
      
      output_name}")
output_shape = output_layer[0].shape
print(f"output_shape: {
      
      output_shape}")
output_dtype = output_layer[0].element_type
print(f"output_dtype: {
      
      output_dtype}")

Example output:

input_name: images
N: 1, C: 3, H: 640, W: 640
input_dtype: <Type: 'float32'>
output_name: output
output_shape: [1,25200,85]
output_dtype: <Type: 'float32'>

2.5 Model Reasoning

image = cv2.imread(str(image_filename))
# image.shape = (height, width, channels)

# N,C,H,W = batch size, number of channels, height, width.
N, C, H, W = input_layer.shape
# OpenCV resize expects the destination size as (width, height).
resized_image = cv2.resize(src=image, dsize=(W, H))
# resized_image.shape = (height, width, channels)

input_data = np.expand_dims(np.transpose(resized_image, (2, 0, 1)), 0).astype(np.float32)
# input_data.shape = (N, C, H, W)

# for single input models only
result = compiled_model(input_data)[output_layer]

# for multiple inputs in a list
result = compiled_model([input_data])[output_layer]

# or using a dictionary, where the key is input tensor name or index
result = compiled_model({
    
    input_layer.any_name: input_data})[output_layer]

3. Key code

2.1 Image data preprocessing

Data preprocessing steps include resize, normalization, color channel conversion, NCWH dimension conversion, etc.

resizeBefore, there was a very common trick to deal with non-square pictures, that is, to calculate the longest side of the graphic, based on this longest side, create a square, place the original graphic in the upper left corner, and fill the rest with black. The advantage of doing this is that the aspect ratio of the original graphics will not be changed, and the content of the original graphics will not be changed at the same time.

 # image preprocessing, the trick is to make the frame to be a square but not twist the image
row, col, _ = frame.shape  # get the row and column of the origin frame array
_max = max(row, col)  # get the max value of row and column
input_image = np.zeros((_max, _max, 3), dtype=np.uint8)  # create a new array with the max value
input_image[:row, :col, :] = frame  # paste the original frame  to make the input_image to be a square

After completing the filling of the image, continue to perform operations such as resize, normalization, and color channel conversion.

blob = cv2.dnn.blobFromImage(image, scalefactor=1 / 255.0, size=(640,640), swapRB=True, crop=False)
  • image: Input image data, numpy.ndarraythe format shapeis (H,W,C), and the channel order is BGR.
  • scalefactor: Image data normalization coefficient, usually 1/255.0.
  • size: The image resize size is subject to the input requirements of the model, here is (640,640).
  • swapRB: Whether to exchange color channels, that is, convert BGRto RGB Trueindicate exchange, Falseand indicate not to exchange. Since opencvthe order of color channels for reading image data is BGR, and YOLOv5the input requirement of the model is RGB, color channels need to be exchanged here.
  • crop: Whether to crop the image, Falsemeans no cropping.

blobFromImageThe function returns a four-dimensional Mat object (NCHW dimensions order), and the shape of the data is(1,3,640,640)

2.4 Post-processing of inference results

Since there are a large number of overlapping inference results bbox, they need to be NMSprocessed, and then filtered according to each bboxconfidence level and the confidence threshold set by the user, and finally the final bbox, corresponding category and confidence level are obtained.

2.4.1 NMS

opencv-pythonModules provide NMSBoxesmethods for NMSprocessing.

cv2.dnn.NMSBoxes(bboxes, scores, score_threshold, nms_threshold, eta=None, top_k=None)
  • bboxes: bboxlist, for shape, (N,4)for Nquantity bbox, 4for bbox.x,y,w,h
  • scores: bboxThe corresponding confidence list, shapeis (N,1), Nis bboxthe quantity.
  • score_threshold: Confidence threshold, which is smaller than the threshold bboxwill be filtered.
  • nms_threshold: NMSthreshold

NMSBoxesThe return value of the function is bboxa list of indices, which shapeis the number.(M,)Mbbox

2.4.2 score_threshold filtering

According to the NMSprocessed bboxindex list, filter confidence less score_thresholdthan bbox.

2.4.3 bbox coordinate conversion and restoration

YOLOv5bboxThe coordinates output by the model are in cxcywha format that needs to be converted to xyxya format. In addition, since the image has been manipulated before resize, the coordinates need to be bboxrestored to the size of the original image.
The conversion method is as follows:

# 获取原始图片的尺寸(填充后)
image_width, image_height, _ = input_image.shape
# 计算缩放比
x_factor = image_width / INPUT_WIDTH  #  640
y_factor = image_height / INPUT_HEIGHT #  640 

# 将cxcywh坐标转换为xyxy坐标
x1 = int((x - w / 2) * x_factor)
y1 = int((y - h / 2) * y_factor)
w = int(w * x_factor)
h = int(h * y_factor)
x2 = x1 + w
y2 = y1 + h

x1, y1, x2, y2are the coordinates bboxof xyxy.

4. Sample code

There are two source codes, one of which is splicing and calling functions, which is more convenient for debugging, and the other is packaged into classes, which is convenient for integration into other projects.

3.1 Unpackaged

"""

this file is to demonstrate how to use openvino to do inference with yolov5 model exported from onnx to openvino format
"""

from typing import List

import cv2
import numpy as np
import time
from pathlib import Path
from openvino.runtime import Core


def build_model(model_path: str) -> cv2.dnn_Net:
    """
    build the model with opencv dnn module
    Args:
        model_path: the path of the model, the model should be in onnx format

    Returns:
        the model object
    """
    # load the model
    core = Core()
    model = core.read_model(model_path)
    for device in core.available_devices:
        print(device)
    compiled_model = core.compile_model(model=model, device_name="AUTO")
    # output_layer = compiled_model.output(0)
    return compiled_model


def inference(image: np.ndarray, model: cv2.dnn_Net) -> np.ndarray:
    """
    inference the model with the input image
    Args:
        image: the input image in numpy array format, the shape should be (height, width, channel),
        the color channels should be in GBR order, like the original opencv image format
        model: the model object

    Returns:
        the output data of the model, the shape should be (1, 25200, nc+5), nc is the number of classes
    """
    # image preprocessing, include resize, normalization, channel swap like BGR to RGB, and convert to blob format
    # get a 4-dimensional Mat with NCHW dimensions order.
    blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (INPUT_WIDTH, INPUT_HEIGHT), swapRB=True, crop=False)

    output_layer = model.output(0)
    outs = model([blob])[output_layer]


    start = time.perf_counter()
    # inference
    # outs = model.forward()

    end = time.perf_counter()

    # print("inference time: ", end - start)

    # the shape of the output data is (1, 25200, nc+5), nc is the number of classes
    return outs


def xywh_to_xyxy(bbox_xywh, image_width, image_height):
    """
    Convert bounding box coordinates from (center_x, center_y, width, height) to (x_min, y_min, x_max, y_max) format.

    Parameters:
        bbox_xywh (list or tuple): Bounding box coordinates in (center_x, center_y, width, height) format.
        image_width (int): Width of the image.
        image_height (int): Height of the image.

    Returns:
        tuple: Bounding box coordinates in (x_min, y_min, x_max, y_max) format.
    """
    center_x, center_y, width, height = bbox_xywh
    x_min = max(0, int(center_x - width / 2))
    y_min = max(0, int(center_y - height / 2))
    x_max = min(image_width - 1, int(center_x + width / 2))
    y_max = min(image_height - 1, int(center_y + height / 2))
    return x_min, y_min, x_max, y_max


def wrap_detection(
        input_image: np.ndarray,
        output_data: np.ndarray,
        labels: List[str],
        confidence_threshold: float = 0.6
) -> (List[int], List[float], List[List[int]]):
    # the shape of the output_data is (25200,5+nc),
    # the first 5 elements are [x, y, w, h, confidence], the rest are prediction scores of each class

    image_width, image_height, _ = input_image.shape
    x_factor = image_width / INPUT_WIDTH
    y_factor = image_height / INPUT_HEIGHT

    # transform the output_data[:, 0:4] from (x, y, w, h) to (x_min, y_min, x_max, y_max)
    # output_data[:, 0:4] = np.apply_along_axis(xywh_to_xyxy, 1, output_data[:, 0:4], image_width, image_height)

    indices = cv2.dnn.NMSBoxes(output_data[:, 0:4].tolist(), output_data[:, 4].tolist(), 0.6, 0.4)

    # print(indices)
    raw_boxes = output_data[:, 0:4][indices]
    raw_confidences = output_data[:, 4][indices]
    raw_class_prediction_probabilities = output_data[:, 5:][indices]

    criteria = raw_confidences > confidence_threshold
    raw_class_prediction_probabilities = raw_class_prediction_probabilities[criteria]
    raw_boxes = raw_boxes[criteria]
    raw_confidences = raw_confidences[criteria]

    bounding_boxes, confidences, class_ids = [], [], []
    for class_prediction_probability, box, confidence in zip(raw_class_prediction_probabilities, raw_boxes,
                                                             raw_confidences):
        # find the least and most probable classes' indices and their probabilities
        # min_val, max_val, min_loc, mac_loc = cv2.minMaxLoc(class_prediction_probability)
        most_probable_class_index = np.argmax(class_prediction_probability)
        label = labels[most_probable_class_index]
        confidence = float(confidence)


        x, y, w, h = box
        left = int((x - 0.5 * w) * x_factor)
        top = int((y - 0.5 * h) * y_factor)
        width = int(w * x_factor)
        height = int(h * y_factor)
        bounding_box = [left, top, width, height]
        bounding_boxes.append(bounding_box)
        confidences.append(confidence)
        class_ids.append(most_probable_class_index)

    return class_ids, confidences, bounding_boxes


coco_class_names = ["person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
                    "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat",
                    "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack",
                    "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball",
                    "kite", "baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket",
                    "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple",
                    "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair",
                    "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
                    "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink",
                    "refrigerator", "book", "clock", "vase", "scissors", "teddy bear", "hair drier",
                    "toothbrush"]
# generate different colors for coco classes
colors = np.random.uniform(0, 255, size=(len(coco_class_names), 3))

INPUT_WIDTH = 640
INPUT_HEIGHT = 640
CONFIDENCE_THRESHOLD = 0.7
NMS_THRESHOLD = 0.45


def video_detector(video_src):
    cap = cv2.VideoCapture(video_src)

    # 3. inference and show the result in a loop
    while cap.isOpened():
        success, frame = cap.read()
        start = time.perf_counter()
        if not success:
            break

        # image preprocessing, the trick is to make the frame to be a square but not twist the image
        row, col, _ = frame.shape  # get the row and column of the origin frame array
        _max = max(row, col)  # get the max value of row and column
        input_image = np.zeros((_max, _max, 3), dtype=np.uint8)  # create a new array with the max value
        input_image[:row, :col, :] = frame  # paste the original frame  to make the input_image to be a square

        # inference
        output_data = inference(input_image, net)  # the shape of output_data is (1, 25200, 85)

        # define coco dataset class names dictionary

        # 4. wrap the detection result
        class_ids, confidences, boxes = wrap_detection(input_image, output_data[0], coco_class_names)

        # wrap_detection(input_image, output_data[0], coco_class_names) ##

        # 5. draw the detection result on the frame
        for (class_id, confidence, box) in zip(class_ids, confidences, boxes):
            color = colors[int(class_id) % len(colors)]
            label = coco_class_names[int(class_id)]

            # cv2.rectangle(frame, box, color, 2)

            # print(type(box), box[0], box[1], box[2], box[3], box)
            xmin, ymin, width, height = box
            cv2.rectangle(frame, (xmin, ymin), (xmin + width, ymin + height), color, 2)
            # cv2.rectangle(frame, box, color, 2)
            # cv2.rectangle(frame, [box[0], box[1], box[2], box[3]], color, thickness=2)

            # cv2.rectangle(frame, (box[0], box[1] - 20), (box[0] + 100, box[1]), color, -1)
            cv2.putText(frame, str(label), (box[0], box[1] - 5), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 0), 2)

        finish = time.perf_counter()
        FPS = round(1.0 / (finish - start), 2)
        cv2.putText(frame, str(FPS), (10, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 0), 2)
        # 6. show the frame
        cv2.imshow("frame", frame)

        # 7. press 'q' to exit
        if cv2.waitKey(1) == ord('q'):
            break

    # 8. release the capture and destroy all windows
    cap.release()
    cv2.destroyAllWindows()


if __name__ == '__main__':
    # there are 4 steps to use opencv dnn module to inference onnx model exported by yolov5 and show the result

    # 1. load the model
    model_path = Path("weights/yolov5s_openvino_model/yolov5s.xml")
    # model_path = Path("weights/POT_INT8_openvino_model/yolov5s_int8.xml")
    net = build_model(str(model_path))
    # 2. load the video capture
    video_source = 0
    # video_source = 'rtsp://admin:[email protected]:554/h264/ch1/main/av_stream'
    video_detector(video_source)

    exit(0)

3.2 Encapsulated into a class call

import cv2
import numpy as np
from pathlib import Path
import time
from typing import List
from glob import glob
from openvino.runtime import Core


class YoloV5OpenvinoInference:
    def __init__(self, model_path: str,
                 imgsize: int = 640,
                 labels: List[str] = None,
                 score_threshold: float = 0.6,
                 nms_threshold: float = 0.45):
        self.load_model(model_path)
        self.imgsize = imgsize
        self.score_threshold = score_threshold
        self.nms_threshold = nms_threshold
        self.coco_class_names = ["person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
                                 "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat",
                                 "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack",
                                 "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard",
                                 "sports ball",
                                 "kite", "baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket",
                                 "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple",
                                 "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake",
                                 "chair",
                                 "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
                                 "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink",
                                 "refrigerator", "book", "clock", "vase", "scissors", "teddy bear", "hair drier",
                                 "toothbrush"]
        if labels is None or len(labels) == 0:
            self.labels = self.coco_class_names
        else:
            self.labels = labels
        self.colors = np.random.uniform(0, 255, size=(len(self.labels), 3))

        self.x_factor = 1
        self.y_factor = 1
        self.xyxy = True

    def load_model(self, model_path: str) -> None:

        if not Path(model_path).exists():
            raise FileNotFoundError(f"model file {
      
      model_path} not found")
        self.core = Core()
        self.loaded_model = self.core.read_model(model_path)
        self.compiled_model = self.core.compile_model(model=self.loaded_model, device_name="AUTO")
        self.output_layer = self.loaded_model.output(0)

    def preprocess(self, image: np.ndarray) -> np.ndarray:
        row, col, _ = image.shape
        _max = max(row, col)
        input_image = np.zeros((_max, _max, 3), dtype=np.uint8)
        input_image[:row, :col, :] = image
        image_width, image_height, _ = input_image.shape
        self.x_factor = image_width / self.imgsize
        self.y_factor = image_height / self.imgsize
        return input_image

    def inference(self, image: np.ndarray) -> np.ndarray:
        image = self.preprocess(image)
        blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (self.imgsize, self.imgsize), swapRB=True, crop=False)
        start = time.perf_counter()
        outs = self.compiled_model([blob])

        # TODO
        # 此处由于model.output(0)返回和推理输出的数据类型可能不一致(openvino.runtime.ConstOutput和openvino.runtime.Output)
        # 导致取出输出层数据报错,暂时先这样处理
        output_layer = [item for item in outs if item.any_name == "output"][0]
        outs = outs[output_layer]
        # outs = self.compiled_model([blob])['output']
        end = time.perf_counter()
        print("inference time: ", end - start)
        return outs[0]

    def wrap_detection(self, result: np.ndarray, to_xyxy: bool = True) -> (List[int], List[float], List[List[int]]):

        # using NMS algorithm to filter out the overlapping bounding boxes
        indices = cv2.dnn.NMSBoxes(result[:, 0:4].tolist(), result[:, 4].tolist(), 0.6, 0.4)
        # get the real data after filtering
        result = result[indices]

        # filter the bounding boxes with confidence lower than the threshold
        result = result[result[:, 4] > self.score_threshold]

        bounding_boxes, confidences, classes = [], [], []
        for item in result:
            box = item[0:4]
            confidence = float(item[4])
            class_prediction_probability = item[5:]
            most_probable_class_index = np.argmax(class_prediction_probability)
            #
            x, y, w, h = box
            left = int((x - 0.5 * w) * self.x_factor)
            top = int((y - 0.5 * h) * self.y_factor)
            width = int(w * self.x_factor)
            height = int(h * self.y_factor)

            if to_xyxy:
                self.xyxy = True
                bounding_box = [left, top, left + width, top + height]
            else:
                self.xyxy = False
                bounding_box = [left, top, width, height]

            bounding_boxes.append(bounding_box)
            confidences.append(confidence)
            classes.append(most_probable_class_index)

        return classes, confidences, bounding_boxes

    def detect(self, image: np.ndarray, visualize=True) -> np.ndarray:
        img = self.preprocess(image)
        result = self.inference(img)
        class_ids, confidences, boxes = self.wrap_detection(result, to_xyxy=True)

        if visualize:
            for (class_id, confidence, box) in zip(class_ids, confidences, boxes):
                color = yolo_v5.colors[int(class_id) % len(yolo_v5.colors)]
                label = yolo_v5.coco_class_names[int(class_id)]
                if self.xyxy:
                    cv2.rectangle(image, (box[0], box[1]), (box[2], box[3]), color, 2)
                else:
                    cv2.rectangle(image, box, color, 2)

                cv2.rectangle(image, (box[0], box[1] - 20), (box[0] + 100, box[1]), color, -1)
                cv2.putText(image, str(label), (box[0], box[1] - 5), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 0), 2)
        return image


def video_detector(video_src):
    cap = cv2.VideoCapture(video_src)

    while cap.isOpened():
        success, frame = cap.read()

        if not success:
            break

        frame = yolo_v5.detect(frame)
        cv2.imshow("frame", frame)

        if cv2.waitKey(1) == ord('q'):
            break

    cap.release()
    cv2.destroyAllWindows()


def image_detector(image_src):
    image = cv2.imread(image_src)
    image = yolo_v5.detect(image)
    cv2.imshow("image", image)
    cv2.waitKey(0)
    cv2.destroyAllWindows()


if __name__ == '__main__':
    model_path = "weights/yolov5s_openvino_model/yolov5s.xml"
    yolo_v5 = YoloV5OpenvinoInference(model_path=model_path)

    video_source = 0
    # video_source = 'rtsp://admin:[email protected]:554/h264/ch1/main/av_stream'
    video_detector(video_source)

    # image_path = "data/images/bus.jpg"
    # image_detector(image_path)
    exit(0)

reference link

Guess you like

Origin blog.csdn.net/LJX_ahut/article/details/132248340