Video object detection yolov

Table of contents

 Yolov related introduction

Pre-training is to detect animals in the animal world. A single frame on a 1060 graphics card takes 300 milliseconds.


 Yolov related introduction

Paper address: https://arxiv.org/pdf/2208.09686.pdf

Code address: https://github.com/YuHengsss/YOLOV

Model size mAP@50val Speed 2080Ti(batch size=1)
(ms)
weights
YOLOX-s 576 69.5 9.4 google
YOLOX-l 576 76.1 14.8 google
YOLOX-x 576 77.8 20.4 google
YOLOV-s 576 77.3 11.3 google
YOLOV-l 576 83.6 16.4 google
YOLOV-x 576 85.5 22.7 google
YOLOV-x + post 576 87.5 -

Video object detection (VID) is challenging due to the high variability in object appearance and various degradations in certain frames. On the positive side, detection in one frame of a video can be supported by other frames compared to still images. Therefore, how to aggregate features across different frames is the key to the VID problem.

Most existing aggregation algorithms are tailored for two-stage detectors. However, due to the two-stage nature, such detectors are often computationally time-consuming. The researchers shared today proposed a simple and effective strategy to solve the above problems, which consumes marginal overhead and significantly improves accuracy. Specifically, unlike the traditional two-stage pipeline, the researchers advocate placing region-level candidates after one-stage detection to avoid processing a large number of low-quality candidates. In addition, a new module is built to evaluate the relationship between the target frame and its reference frame and guide the aggregation.

Extensive experimental and ablation studies are conducted to verify the effectiveness of the newly proposed design and reveal that it outperforms other state-of-the-art VID methods in terms of effectiveness and efficiency. YOLOX-based models can achieve respectable performance (e.g., 87.5% AP50 at over 30 FPS on the ImageNet VID dataset on a single 2080Ti GPU), making them attractive for large-scale or real-time applications.

Video object detection can be viewed as an advanced version of still image object detection. Intuitively, a video sequence can be processed by feeding frames one by one into a still image object detector. However, in this way, temporal information across frames will be wasted, which may be key to removing/reducing the ambiguity that occurs within a single image.

As shown in the figure above, degradations such as motion blur, camera defocus, and occlusion often occur in video frames, significantly increasing the difficulty of detection. For example, just by looking at the last frame in the image above, it would be difficult to impossible for a human to tell where and what the object is. Video sequences, on the other hand, can provide richer information than a single still image. In other words, predictions for a certain frame may be supported by other frames in the same sequence. Therefore, how to efficiently aggregate temporal messages from different frames is crucial for accuracy. As can be seen from the figure above, the method proposed by the researcher gives the correct answer.

03

new framework

Considering the characteristics of video (various degradations and rich temporal information), instead of processing frames individually, how to seek supporting information for the target frame (key frame) from other frames plays a key role in improving the accuracy of video detection. Significant improvements in accuracy over recent attempts confirm the importance of temporal aggregation to the problem. However, most existing methods are based on two-stage techniques.

As mentioned before, their main disadvantage is the relatively slow inference speed compared to the first-level bases. To alleviate this limitation, researchers place region/feature selection after the prediction head of a single-stage detector.

picture

The researcher chose YOLOX as the basis to demonstrate the researcher's main claims. The proposed framework is shown in the figure above.

Let’s review the traditional two-stage pipeline:

1) First "select" a large number of candidate areas as proposals; 

2) Determine whether each proposal is a target and to which class it belongs. The computational bottleneck mainly comes from processing a large number of low-confidence region candidates.

As can be seen from the above figure, the proposed framework also contains two stages. The difference is that its first stage is prediction (discarding a large number of low-confidence regions), while the second stage can be viewed as region-level refinement (exploiting other frames through aggregation).

Through this principle, new designs can benefit from both the efficiency of the first-stage detector and the accuracy gained from temporal aggregation. It's worth emphasizing that such small design differences can lead to huge differences in performance. The proposed strategy can be generalized to many basic detectors, such as YOLOX, FCOS and PPYOLOE.

picture

Furthermore, considering the properties of softmax, it is possible that a small number of reference features hold most of the weights. In other words, it often ignores low-weighted features, which limits the diversity of reference features that may be subsequently used.

To avoid this risk, researchers introduced the average pooling reference feature (A.P.). Specifically, all references with similarity scores higher than a threshold τ are selected and average pooling is applied to these. Note that the similarity in this work is calculated by N(Vc)N(Vc)T. The operator N(·) represents layer normalization, ensuring that the value is within a certain range, thereby eliminating the impact of scale differences. By doing this, more information from relevant features can be maintained. The average pooled features and key features are then transferred into a linear projection layer for final classification. The process is shown in the picture above.

One may ask whether N(Qc)N(Kc)T or N(Qr)N(Kr)T can be performed as similarity. In fact, this is another option. However, in practice, due to the difference between Q and K, it is not as stable as our choice during training.

Pre-training is to detect animals in the animal world. A single frame on a 1060 graphics card takes 300 milliseconds.

Video prediction code:

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
# Copyright (c) Megvii, Inc. and its affiliates.

import argparse
import os
import time
from loguru import logger

import cv2

import torch

from yolox.data.data_augment import ValTransform
from yolox.data.datasets import COCO_CLASSES
from yolox.data.datasets.vid_classes import VID_classes
# from yolox.data.datasets.vid_classes import OVIS_classes as VID_classes
from yolox.exp import get_exp
from yolox.utils import fuse_model, get_model_info, postprocess, vis
import random

IMAGE_EXT = [".jpg", ".jpeg", ".webp", ".bmp", ".png"]


def make_parser():
    parser = argparse.ArgumentParser("YOLOX Demo!")
    # parser.add_argument(
    #     "demo", default="video", help="demo type, eg. image, video and webcam"
    # )
    parser.add_argument("-expn", "--experiment_name", type=str, default=None)
    parser.add_argument("-n", "--name", type=str, default=None, help="model name")

    parser.add_argument("--path", default=r"C:\Users\Administrator\Videos\f7.mp4", help="path to images or video")

    parser.add_argument("--camid", type=int, default=0, help="webcam demo camera id")

    # exp file
    parser.add_argument("-f", "--exp_file", default='../exps/yolov/yolov_s.py', type=str, help="pls input your expriment description file", )
    parser.add_argument("-c", "--ckpt", default='../yolov_s.pth', type=str, help="ckpt for eval")
    # parser.add_argument("-c", "--ckpt", default='../yoloxs_vid.pth', type=str, help="ckpt for eval")
    parser.add_argument("--device", default="gpu", type=str, help="device to run our model, can either be cpu or gpu", )
    parser.add_argument("--dataset", default='vid', type=str, help="Decide pred classes")
    parser.add_argument("--conf", default=0.05, type=float, help="test conf")
    parser.add_argument("--nms", default=0.5, type=float, help="test nms threshold")
    parser.add_argument("--tsize", default=576, type=int, help="test img size")
    parser.add_argument("--fp16", dest="fp16", default=True, action="store_true", help="Adopting mix precision evaluating.", )
    parser.add_argument("--legacy", dest="legacy", default=False, action="store_true", help="To be compatible with older versions", )
    parser.add_argument("--fuse", dest="fuse", default=False, action="store_true", help="Fuse conv and bn for testing.", )
    parser.add_argument("--trt", dest="trt", default=False, action="store_true", help="Using TensorRT model for testing.", )
    parser.add_argument('--output_dir', default='out', help='path where to save, empty for no saving')
    parser.add_argument('--gframe', default=32, help='global frame num')
    parser.add_argument('--save_result', default=True)
    return parser


def get_image_list(path):
    image_names = []
    for maindir, subdir, file_name_list in os.walk(path):
        for filename in file_name_list:
            apath = os.path.join(maindir, filename)
            ext = os.path.splitext(apath)[1]
            if ext in IMAGE_EXT:
                image_names.append(apath)
    return image_names


class Predictor(object):
    def __init__(self, model, exp, cls_names=COCO_CLASSES, trt_file=None, decoder=None, device="cpu", legacy=False, ):
        self.model = model
        self.cls_names = cls_names
        self.decoder = decoder
        self.num_classes = exp.num_classes
        self.confthre = exp.test_conf
        self.nmsthre = exp.nmsthre
        self.test_size = exp.test_size
        self.device = device
        self.preproc = ValTransform(legacy=legacy)
        self.model = model.half()

    def inference(self, img, img_path=None):
        tensor_type = torch.cuda.HalfTensor
        if self.device == "gpu":
            img = img.cuda()
            img = img.type(tensor_type)
        with torch.no_grad():
            t0 = time.time()
            outputs, outputs_ori = self.model(img, nms_thresh=self.nmsthre)
            logger.info("Infer time: {:.4f}s".format(time.time() - t0))
        return outputs

    def visual(self, output, img, ratio, cls_conf=0.0):

        if output is None:
            return img
        bboxes = output[:, 0:4]

        # preprocessing: resize
        bboxes /= ratio

        cls = output[:, 6]
        scores = output[:, 4] * output[:, 5]

        vis_res = vis(img, bboxes, scores, cls, cls_conf, self.cls_names)
        return vis_res


def image_demo(predictor, vis_folder, path, current_time, save_result):
    if os.path.isdir(path):
        files = get_image_list(path)
    else:
        files = [path]
    files.sort()
    for image_name in files:
        outputs, img_info = predictor.inference(image_name, [image_name])
        result_image = predictor.visual(outputs[0], img_info, predictor.confthre)
        if save_result:
            save_folder = os.path.join(vis_folder, time.strftime("%Y_%m_%d_%H_%M_%S", current_time))
            os.makedirs(save_folder, exist_ok=True)
            save_file_name = os.path.join(save_folder, os.path.basename(image_name))
            logger.info("Saving detection result in {}".format(save_file_name))
            cv2.imwrite(save_file_name, result_image)
        ch = cv2.waitKey(0)
        if ch == 27 or ch == ord("q") or ch == ord("Q"):
            break


def imageflow_demo(predictor, vis_folder, current_time, args):
    gframe = args.gframe
    cap = cv2.VideoCapture(args.path)
    width = cap.get(cv2.CAP_PROP_FRAME_WIDTH)  # float
    height = cap.get(cv2.CAP_PROP_FRAME_HEIGHT)  # float
    fps = cap.get(cv2.CAP_PROP_FPS)
    save_folder = os.path.join(vis_folder, time.strftime("%Y_%m_%d_%H_%M_%S", current_time))

    os.makedirs(save_folder, exist_ok=True)
    ratio = min(predictor.test_size[0] / height, predictor.test_size[1] / width)
    save_path = os.path.join(save_folder, args.path.split("/")[-1])
    save_path='out.mp4'
    logger.info(f"video save_path is {save_path}")
    vid_writer = cv2.VideoWriter(save_path, cv2.VideoWriter_fourcc(*"mp4v"), fps, (int(width), int(height)))
    frames = []
    outputs = []
    ori_frames = []
    while True:
        ret_val, frame = cap.read()
        if ret_val:
            ori_frames.append(frame)
            frame, _ = predictor.preproc(frame, None, exp.test_size)
            frames.append(torch.tensor(frame))
        else:
            break
    res = []
    frame_len = len(frames)
    index_list = list(range(frame_len))
    random.seed(41)
    random.shuffle(index_list)
    random.seed(41)
    random.shuffle(frames)

    split_num = int(frame_len / (gframe))  #

    for i in range(split_num):
        res.append(frames[i * gframe:(i + 1) * gframe])
    res.append(frames[(i + 1) * gframe:])

    for ele in res:
        if ele == []: continue
        ele = torch.stack(ele)
        t0 = time.time()
        outputs.extend(predictor.inference(ele))
    outputs = [j for _, j in sorted(zip(index_list, outputs))]
    for output, img in zip(outputs, ori_frames[:len(outputs)]):

        result_frame = predictor.visual(output, img, ratio, cls_conf=args.conf)

        if args.save_result:
            cv2.imshow("sdf",result_frame)
            cv2.waitKey(0)
            vid_writer.write(result_frame)


def main(exp, args):
    if not args.experiment_name:
        args.experiment_name = exp.exp_name

    file_name = os.path.join(exp.output_dir, args.experiment_name)
    os.makedirs(file_name, exist_ok=True)

    vis_folder = None
    if args.save_result:
        vis_folder = os.path.join(file_name, "vis_res")
        os.makedirs(vis_folder, exist_ok=True)

    if args.trt:
        args.device = "gpu"

    logger.info("Args: {}".format(args))

    if args.conf is not None:
        exp.test_conf = args.conf
    if args.nms is not None:
        exp.nmsthre = args.nms
    if args.tsize is not None:
        exp.test_size = (args.tsize, args.tsize)

    model = exp.get_model()
    logger.info("Model Summary: {}".format(get_model_info(model, exp.test_size)))

    if args.device == "gpu":
        model.cuda()
    model.eval()

    if not args.trt:
        if args.ckpt is None:
            ckpt_file = os.path.join(file_name, "best_ckpt.pth")
        else:
            ckpt_file = args.ckpt
        logger.info("loading checkpoint")
        ckpt = torch.load(ckpt_file, map_location="cpu")
        # load the model state dict
        model.load_state_dict(ckpt["model"])
        logger.info("loaded checkpoint done.")

    if args.fuse:
        logger.info("\tFusing model...")
        model = fuse_model(model)

    if args.trt:
        assert not args.fuse, "TensorRT model is not support model fusing!"
        trt_file = os.path.join(file_name, "model_trt.pth")
        assert os.path.exists(trt_file), "TensorRT model is not found!\n Run python3 tools/trt.py first!"
        model.head.decode_in_inference = False
        decoder = model.head.decode_outputs
        logger.info("Using TensorRT to inference")
    else:
        trt_file = None
        decoder = None
    if args.dataset == 'vid':
        predictor = Predictor(model, exp, VID_classes, trt_file, decoder, args.device, args.legacy)
    else:
        predictor = Predictor(model, exp, COCO_CLASSES, trt_file, decoder, args.device, args.legacy)
    current_time = time.localtime()

    imageflow_demo(predictor, vis_folder, current_time, args)


if __name__ == "__main__":
    args = make_parser().parse_args()
    exp = get_exp(args.exp_file, args.name)

    main(exp, args)

Guess you like

Origin blog.csdn.net/jacke121/article/details/134917329