Camera smoking behavior detection system based on Yolov5 (pytoch)

Table of contents

1. Dataset introduction

 1.1 Data set division

 1.2 Generate txt through voc_label.py

1.3 Small target definition

2. Improved performance of smoking behavior detection based on Yolov5

2.1 Using multi-scale to improve the detection accuracy of small targets

2.2 Analysis of multi-scale training results

2.3 Adding BiFormer based on multi-scale: building an efficient pyramid network architecture based on dynamic sparse attention

2.3.1 BiFormer principle introduction

2.3.2 Analysis of experimental results


1. Dataset introduction

Smoking behavior was collected through the camera, a total of 1812 images were collected for labeling, and the training set, verification set, and test set were randomly distinguished according to 8:1:1.

 1.1 Data set division

Get trainval.txt, val.txt, test.txt through split_train_val.py  

# coding:utf-8

import os
import random
import argparse

parser = argparse.ArgumentParser()
#xml文件的地址,根据自己的数据进行修改 xml一般存放在Annotations下
parser.add_argument('--xml_path', default='Annotations', type=str, help='input xml label path')
#数据集的划分,地址选择自己数据下的ImageSets/Main
parser.add_argument('--txt_path', default='ImageSets/Main', type=str, help='output txt label path')
opt = parser.parse_args()

trainval_percent = 0.9
train_percent = 0.8
xmlfilepath = opt.xml_path
txtsavepath = opt.txt_path
total_xml = os.listdir(xmlfilepath)
if not os.path.exists(txtsavepath):
    os.makedirs(txtsavepath)

num = len(total_xml)
list_index = range(num)
tv = int(num * trainval_percent)
tr = int(tv * train_percent)
trainval = random.sample(list_index, tv)
train = random.sample(trainval, tr)

file_trainval = open(txtsavepath + '/trainval.txt', 'w')
file_test = open(txtsavepath + '/test.txt', 'w')
file_train = open(txtsavepath + '/train.txt', 'w')
file_val = open(txtsavepath + '/val.txt', 'w')

for i in list_index:
    name = total_xml[i][:-4] + '\n'
    if i in trainval:
        file_trainval.write(name)
        if i in train:
            file_train.write(name)
        else:
            file_val.write(name)
    else:
        file_test.write(name)

file_trainval.close()
file_train.close()
file_val.close()
file_test.close()

 1.2 Generate txt through voc_label.py

# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
import os
from os import getcwd

sets = ['train', 'val']
classes = ["smoke"]   # 改成自己的类别
abs_path = os.getcwd()
print(abs_path)

def convert(size, box):
    dw = 1. / (size[0])
    dh = 1. / (size[1])
    x = (box[0] + box[1]) / 2.0 - 1
    y = (box[2] + box[3]) / 2.0 - 1
    w = box[1] - box[0]
    h = box[3] - box[2]
    x = x * dw
    w = w * dw
    y = y * dh
    h = h * dh
    return x, y, w, h

def convert_annotation(image_id):
    in_file = open('Annotations/%s.xml' % (image_id), encoding='UTF-8')
    out_file = open('labels/%s.txt' % (image_id), 'w')
    tree = ET.parse(in_file)
    root = tree.getroot()
    size = root.find('size')
    w = int(size.find('width').text)
    h = int(size.find('height').text)
    for obj in root.iter('object'):
        difficult = obj.find('difficult').text
        #difficult = obj.find('Difficult').text
        cls = obj.find('name').text
        if cls not in classes or int(difficult) == 1:
            continue
        cls_id = classes.index(cls)
        xmlbox = obj.find('bndbox')
        b = (float(xmlbox.find('xmin').text), float(xmlbox.find('xmax').text), float(xmlbox.find('ymin').text),
             float(xmlbox.find('ymax').text))
        b1, b2, b3, b4 = b
        # 标注越界修正
        if b2 > w:
            b2 = w
        if b4 > h:
            b4 = h
        b = (b1, b2, b3, b4)
        bb = convert((w, h), b)
        out_file.write(str(cls_id) + " " + " ".join([str(a) for a in bb]) + '\n')

wd = getcwd()
for image_set in sets:
    if not os.path.exists('labels/'):
        os.makedirs('labels/')
    image_ids = open('ImageSets/Main/%s.txt' % (image_set)).read().strip().split()
    list_file = open('%s.txt' % (image_set), 'w')
    for image_id in image_ids:
        list_file.write(abs_path + '/images/%s.jpg\n' % (image_id))
        convert_annotation(image_id)
    list_file.close()

Judging by image belongs to small target detection

1.3 Small target definition


1) Taking the COCO object definition, a general data set in the field of object detection, as an example, a small object refers to a pixel smaller than 32×32 (a medium object refers to 32*32-96*96, and a large object refers to a larger object than 96*96);
2) In actual application scenarios, it is usually more inclined to use the ratio relative to the original image to define: the product of the length and width of the object label box, divided by the product of the length and width of the entire image, and then open the root sign, if the result is less than 3%, called small goals;

2. Improved performance of smoking behavior detection based on Yolov5

Raw yolov5 result

2.1 Using multi-scale to improve the detection accuracy of small targets

Point-up skills: tiny target detection based on Yolov5, multi-head detection head improves small target detection accuracy_yolov5 small target detection_AI Little Monster's Blog-CSDN Blog
 

Principle introduction: In order to achieve better detection results for the above tiny targets, the YOLOv5 model introduces a new detection head through the P2 layer features. The structure is shown in Figure 2. The resolution of the P2 layer detection head is 160×160 pixels, It is equivalent to only two down-sampling operations in the backbone network, which contains richer underlying feature information. The two P2 layer features obtained from the top-down and bottom-up in the neck network are the same as those in the backbone network. Features of the same scale are fused in the form of concat, and the output feature is the fusion result of the three input features, which enables the P2 layer detection head to quickly and effectively detect small targets. The P2 layer detection head plus the original three detection head, which can effectively alleviate the negative impact of scale variance. The added detection head is aimed at the underlying features and is generated through low-level, high-resolution feature maps. This detection head is more sensitive to tiny targets. Although adding this The detection head increases the calculation amount and memory overhead of the model, but it has a great improvement in the detection ability of tiny targets.

2.2 Analysis of multi-scale training results

confusion_matrix.png : Columns represent predicted categories, rows represent actual categories. The values ​​on the diagonal represent the proportion of the correct predictions, and the off-diagonal elements are the wrong parts of the predictions. Higher values ​​on the diagonal of the confusion matrix are better, indicating that many predictions are correct.

 The picture above is the training of smoking detection detection. It can be seen from the picture that they are damage and background FP. The plot is normalized on each column. It can be seen that the probability of correct damage detection prediction is 89%.

F1_curve.png: Relationship between F1 score and confidence (x-axis). The F1 score is a measure of classification, which is a harmonic mean function of precision and recall, between 0 and 1. The bigger the better.

TP: true is true, prediction is true;

FN: True is True, Predicted is False;

FP: true is false, prediction is true;

TN: true is false, prediction is false;

Precision (precision) = TP/(TP+FP)

Recall (Recall)=TP/(TP+FN)

F1=2*(precision rate*recall rate)/(precision rate+recall rate)

 labels_correlogram.jpg : Shows the contrast between each axis of the data and the other axes. The labels in the image are in xywh space.

 labels.jpg :

(1, 1) represents the amount of data for each category

(1, 2) ground-truth labeled bounding_box

(2, 1) The coordinates of the center point of the real label

(2, 2) The matrix width and height of the real label

 P_curve.png: A graph showing the relationship between accuracy and confidence, and the abscissa is confidence. It can be seen from the figure below that the higher the confidence, the higher the accuracy.

 PR_curve.png : P in the PR curve stands for precision (precision rate) , R stands for recall (recall rate) , which represents the relationship between precision rate and recall rate.

 R_curve.png : The relationship between recall and confidence

 results.png

 mAP_0.5:0.95 represents the average mAP from 0.5 to 0.95 with a step size of 0.05.

 forecast result:

2.3 Adding BiFormer based on multi-scale: building an efficient pyramid network architecture based on dynamic sparse attention

Yolov5/Yolov7 introduces CVPR 2023 BiFormer: Building an efficient pyramid network architecture based on dynamic sparse attention, which has a significant increase in small targets_AI Little Monster's Blog-CSDN Blog

2.3.1 BiFormer principle introduction

Paper: https://arxiv.org/pdf/2303.08810.pdf

Background: Attention mechanism is one of the core building blocks of Vision Transformer, which can capture long-range dependencies. However, this powerful feature imposes a huge computational burden and memory overhead due to the need to compute pairwise token interactions between all spatial locations. To alleviate this problem, a series of works try to solve this problem by introducing hand-crafted and content-independent sparsity into attention, such as restricting attention operations to local windows, axial stripes or dilated windows.

Our method: In this paper, we propose a two-layer routing method with dynamic sparse attention. For a query, irrelevant key-value pairs are first filtered out on a coarse region level, and then fine-grained token-to-token attention is applied on the union of the remaining candidate regions (i.e., routing regions). The proposed two-layer routing attention has a simple yet effective implementation that exploits sparsity to save computation and memory, involving only GPU-friendly dense matrix multiplication. A new general Vision Transformer called BiFormer is built on this basis.

 Where Figure (a) is the original attention implementation, which directly operates globally, resulting in high computational complexity and large memory footprint; while for Figures (b)-(d), these methods are implemented by introducing Sparse attention to alleviate complexity, such as local windows, axial stripes, and dilated windows; while Figure (e) is based on deformable attention through irregular grids to achieve image adaptive sparsity; the author believes that these methods Most attempts to alleviate this problem by introducing hand-crafted and content-independent sparsity into the attention mechanism. bi-level routingTherefore, this paper proposes a novel dynamic sparse attention ( ) through two-layer routing ( dynamic sparse attention ) to achieve more flexible computation allocation and content awareness , enabling it to have dynamic query-aware sparsity, as shown in Figure (f) .

2.3.2 Analysis of experimental results

The map is further improved to 0.899

Guess you like

Origin blog.csdn.net/m0_63774211/article/details/132632191
Recommended