Mask detection based on Yolv5s

1. Yolov5 algorithm principle and network structure

        YOLOv5 can be divided into YO-LOv5s, YOLOv5m, YOLOv5l, and YOLOv5x according to the network depth and network width. This article uses YOLOv5s, which has the smallest network structure and the fastest image inference speed of 0.007s. The network structure of YO-LOv5 mainly consists of four parts, namely Input, Backbone, Neck, and Head. The network structure diagram is shown in the figure.

Yolov5 network structure diagram

1.1Input terminal

        Input uses Mosaic data enhancement, adaptive anchor frame calculation, and image size processing. Mosaic data enhancement spliced ​​4 pictures using random scaling, random cropping, and random arrangement, which greatly enriched the detection data set, especially adding many small targets, making the network structure more robust. In the YOLOv5 algorithm, different data sets will set anchor boxes with different initial lengths and widths. When training data, the predicted frame is obtained based on the initial anchor frame, and the predicted frame is compared with the real frame to calculate the difference between the two. Reverse update, iteratively update network structure parameters, and adaptive anchor box calculation can obtain the optimal anchor box value. In the target detection algorithm, the sizes of the training images are different, and the original images must be uniformly scaled to a fixed size and then input into the network for training. The size of the images in this article is 608×608×3.

1.2Backone

        ​ ​ ​ Backbone includes Focus, CSP and SPP. Focus does not exist in other YOLO algorithms. It is mainly used for slicing operations. For example, a 608×608×3 image becomes 304×304×12 in size after slicing in the Focus structure, and 32 volume cores are processed. The convolution operation becomes a 304×304×32 feature map. There is only one CSP structure in YOLOv4, while there are two in YOLOv5, CSP1_x and CSP2_x, which are used in Backbone and Neck respectively. The CSP structure of YOLOv4 only exists in Backbone. The CSP structure is shown in Figure 1, where the x of CSP1_x represents There are x Resunits (residual components) in CSP1. The x of CSP2_x indicates that there are 2x CBLs in CSP2. x affects the depth of the network structure. In Backbone, SPP (Space Vector Pyramid Pooling) is used to maximize the pooling of feature maps and splice feature maps of different scales together.

(1) Focus structure

        The key to the Focus structure is the slicing operation. During the slicing operation demonstration process, the 4×4×3 feature map is sliced ​​into a 2×2×12 feature map. Input the 608×608×3 three-channel image into the Focus structure. After slicing operation, it first becomes a 304×304×12 feature map. Then, after a convolution operation using 32 convolution kernels, it finally becomes 304× 304×32 feature map. It should be noted that the Focus structure in the YOLOv5s network structure uses 32 convolution kernels for convolution operations, while the other three network structures use an increased number of convolution kernels.

Focus structure diagram

(2) CSP structure

        There are two structures of CSP in YOLOv5, the CSP1_X structure is in the Backbone backbone network, and the other CSP2_X structure is in Neck. For Backbone's backbone network structure, the convolution kernel size in the CSP module is 3*3, and the step value is 2. If the input image size is 608*608, then its feature map change rule is: 608*608 ->304*304->152*152->76*76->38*38->19*19, and finally a feature map of 19*19 size was obtained.

        Advantages of using the CSP module: 1. Enhance the learning ability of the network, so that the trained model can remain lightweight and have high accuracy. 2. Reduce computing bottlenecks. 3. Reduce memory costs.

CSP structure diagram

(3)SPP structure

        Detailed introduction of SPP structure: https://www.cnblogs.com/zongfa/p/9076311.html

SPP structure diagram

1.3Neck

        Neck adopts the structure of FPN+PAN. FPN is top-down and uses upsampling to transfer and fuse information to obtain predicted feature maps. PAN adopts a bottom-up feature pyramid. The specific structure is shown in the figure.

FPN+PAN structure diagram 

1.4Prediction

        Prediction includes Bounding box loss function and non-maximum suppression (NMS). YOLOv5 uses GIOU_Loss as the loss function, which effectively solves the problem when the bounding boxes do not overlap. In the target detection prediction result processing stage, the weighted NMS operation is used to filter out the many target frames that appear to obtain the optimal target frame.

  1. GIOU_Loss loss function

        The loss function of the target detection algorithm generally consists of two parts: Classification Loss (classification loss function) and Bounding Box Regeression Loss (regression loss function). The development process of the regression loss function in recent years is: Smooth L1 Loss->IOU_Loss(2016)->GIOU_Loss(2019)->DIOU_Loss(2020)->CIOU_Loss(2020).

IOU_Loss picture

 The picture is IOU_Loss. It can be seen that the yellow box is the predicted box and the blue box is the real box. Assume that the intersection of the predicted box and the real box is A, and the union is B. IOU is defined as the intersection A divided by the union B. The Loss of IOU is:

        IOU’s Loss is relatively simple, but there are two problems.

        Question 1: When the predicted box and the real box do not intersect, as shown in Figure (a) State 1, the IOU is 0 at this time, which cannot reflect the distance between the predicted box and the real box. At this time, the loss function cannot be derived, IOU_Loss The loss function cannot optimize the situation where the predicted box and the real box do not intersect

IOU_Loss graph of special status

        Question 2: When the size of the prediction box and the real box are the same, the IOU may also be the same, as shown in state 2 and state 3 in (b) (c) above. At this time, the IOU_Loss loss function cannot distinguish between the two situations. different. Therefore use GIOU_Loss to improve.

GIOU_Loss graph

         The yellow box may be the predicted box, and the blue box may be the real box. Let the minimum enclosing rectangle of the predicted box and the real box be the set C, and the difference set is defined as the difference between the set C and the union set B, then the GIOU_Loss is:

        The GIOU_Loss loss function improves the way to measure the intersection scale and reduces the shortcomings of simple IOU_Loss.

2.Experiments and results

2.1 Experimental data set and experimental environment

2.1.1 Dataset

        The data set uses Face Mask Detection on Kaggle. The data set has a total of 853 pictures, which are divided into three categories, one is of people wearing masks, one is not wearing masks, and the other is people who are not wearing masks. Dataset link: https://www.kaggle.com/andrewmvd/face-mask-detection

        Data set display:

        After downloading, these data sets cannot be used directly. Because yolov5 does not support xml file processing, but supports txt files. So first organize the data according to the directory format as shown below:

        ​ ​ ​Then run the following code to convert the data set into a data set that Yolov5 can use:

import xml.etree.ElementTree as ET
import pickle
import os
from os import listdir, getcwd
from os.path import join
import random
from shutil import copyfile

classes = ["with_mask", "without_mask","mask_weared_incorrect"]
# classes=["ball"]

TRAIN_RATIO = 80


def clear_hidden_files(path):
    dir_list = os.listdir(path)
    for i in dir_list:
        abspath = os.path.join(os.path.abspath(path), i)
        if os.path.isfile(abspath):
            if i.startswith("._"):
                os.remove(abspath)
        else:
            clear_hidden_files(abspath)


def convert(size, box):
    dw = 1. / size[0]
    dh = 1. / size[1]
    x = (box[0] + box[1]) / 2.0
    y = (box[2] + box[3]) / 2.0
    w = box[1] - box[0]
    h = box[3] - box[2]
    x = x * dw
    w = w * dw
    y = y * dh
    h = h * dh
    return (x, y, w, h)


def convert_annotation(image_id):
    in_file = open('VOCdevkit/VOC2007/Annotations/%s.xml' % image_id)
    out_file = open('VOCdevkit/VOC2007/YOLOLabels/%s.txt' % image_id, 'w')
    tree = ET.parse(in_file)
    root = tree.getroot()
    size = root.find('size')
    w = int(size.find('width').text)
    h = int(size.find('height').text)

    for obj in root.iter('object'):
        difficult = obj.find('difficult').text
        cls = obj.find('name').text
        if cls not in classes or int(difficult) == 1:
            continue
        cls_id = classes.index(cls)
        xmlbox = obj.find('bndbox')
        b = (float(xmlbox.find('xmin').text), float(xmlbox.find('xmax').text), float(xmlbox.find('ymin').text),
             float(xmlbox.find('ymax').text))
        bb = convert((w, h), b)
        out_file.write(str(cls_id) + " " + " ".join([str(a) for a in bb]) + '\n')
    in_file.close()
    out_file.close()


wd = os.getcwd()
wd = os.getcwd()
data_base_dir = os.path.join(wd, "VOCdevkit/")
if not os.path.isdir(data_base_dir):
    os.mkdir(data_base_dir)
work_sapce_dir = os.path.join(data_base_dir, "VOC2007/")
if not os.path.isdir(work_sapce_dir):
    os.mkdir(work_sapce_dir)
annotation_dir = os.path.join(work_sapce_dir, "Annotations/")
if not os.path.isdir(annotation_dir):
    os.mkdir(annotation_dir)
clear_hidden_files(annotation_dir)
image_dir = os.path.join(work_sapce_dir, "JPEGImages/")
if not os.path.isdir(image_dir):
    os.mkdir(image_dir)
clear_hidden_files(image_dir)
yolo_labels_dir = os.path.join(work_sapce_dir, "YOLOLabels/")
if not os.path.isdir(yolo_labels_dir):
    os.mkdir(yolo_labels_dir)
clear_hidden_files(yolo_labels_dir)
yolov5_images_dir = os.path.join(data_base_dir, "images/")
if not os.path.isdir(yolov5_images_dir):
    os.mkdir(yolov5_images_dir)
    clear_hidden_files(yolov5_images_dir)
    yolov5_labels_dir = os.path.join(data_base_dir, "labels/")
    if not os.path.isdir(yolov5_labels_dir):
        os.mkdir(yolov5_labels_dir)
    clear_hidden_files(yolov5_labels_dir)
    yolov5_images_train_dir = os.path.join(yolov5_images_dir, "train/")
    if not os.path.isdir(yolov5_images_train_dir):
        os.mkdir(yolov5_images_train_dir)
    clear_hidden_files(yolov5_images_train_dir)
    yolov5_images_test_dir = os.path.join(yolov5_images_dir, "val/")
    if not os.path.isdir(yolov5_images_test_dir):
        os.mkdir(yolov5_images_test_dir)
    clear_hidden_files(yolov5_images_test_dir)
    yolov5_labels_train_dir = os.path.join(yolov5_labels_dir, "train/")
    if not os.path.isdir(yolov5_labels_train_dir):
        os.mkdir(yolov5_labels_train_dir)
    clear_hidden_files(yolov5_labels_train_dir)
    yolov5_labels_test_dir = os.path.join(yolov5_labels_dir, "val/")
    if not os.path.isdir(yolov5_labels_test_dir):
        os.mkdir(yolov5_labels_test_dir)
    clear_hidden_files(yolov5_labels_test_dir)

    train_file = open(os.path.join(wd, "yolov5_train.txt"), 'w')
    test_file = open(os.path.join(wd, "yolov5_val.txt"), 'w')
    train_file.close()
    test_file.close()
    train_file = open(os.path.join(wd, "yolov5_train.txt"), 'a')
    test_file = open(os.path.join(wd, "yolov5_val.txt"), 'a')
    list_imgs = os.listdir(image_dir)  # list image files
    prob = random.randint(1, 100)
    print("Probability: %d" % prob)
    for i in range(0, len(list_imgs)):
        path = os.path.join(image_dir, list_imgs[i])
        if os.path.isfile(path):
            image_path = image_dir + list_imgs[i]
            voc_path = list_imgs[i]
            (nameWithoutExtention, extention) = os.path.splitext(os.path.basename(image_path))
            (voc_nameWithoutExtention, voc_extention) = os.path.splitext(os.path.basename(voc_path))
            annotation_name = nameWithoutExtention + '.xml'
            annotation_path = os.path.join(annotation_dir, annotation_name)
            label_name = nameWithoutExtention + '.txt'
    label_path = os.path.join(yolo_labels_dir, label_name)
    prob = random.randint(1, 100)
    print("Probability: %d" % prob)
    if (prob < TRAIN_RATIO):  # train dataset
        if os.path.exists(annotation_path):
            train_file.write(image_path + '\n')
            convert_annotation(nameWithoutExtention)  # convert label
            copyfile(image_path, yolov5_images_train_dir + voc_path)
            copyfile(label_path, yolov5_labels_train_dir + label_name)
    else:  # test dataset
        if os.path.exists(annotation_path):
            test_file.write(image_path + '\n')
            convert_annotation(nameWithoutExtention)  # convert label
            copyfile(image_path, yolov5_images_test_dir + voc_path)
            copyfile(label_path, yolov5_labels_test_dir + label_name)
train_file.close()
test_file.close()

         After running the above code, the following directory format will be generated:

        You can see that by running the above code, the images and labels folders are generated in the VOCdevkit directory. There are also train and val folders under these two folders. This is the required data set. Among them, train corresponds to the training set, and val corresponds to the test set.

2.2.2 Data set annotation

        Because the annotated xml files have been given in the data set, there will be no further annotation here. Data set annotation is mainly done through labelimg.

2.2.3 Experimental environment

       The experimental environment uses Ubuntu 20.04.2+ dual-channel Intel 4110CPU + 64G memory + RTX2080Ti graphics card + Anaconda3 for experiments.

2.2Yolov5 network training

       Generally, in order to shorten the training time of the network and achieve better accuracy, we generally load pre-training weights for network training. The 5.0 version of yolov5 provides us with several pre-training weights. We can choose different versions of pre-training weights according to our different needs. The name and size information of the weight can be obtained through the following figure. It can be expected that the larger the pre-training weight, the higher the training accuracy will be, but the slower the detection speed will be. The pre-training weights can be downloaded through this URL. The pre-training weights used for training your own data set this time are yolov5s.pt.

        We first pull the Yolov5 code to the local through git, and you can see the following directory format:

        ​ ​ ​ Then we will place the downloaded Yolov5.pt model file in the root directory of Yolov5. Also place the data set file in the Yolov5 root directory:

        Now let’s give an introduction to the overall directory of the code:

├── data: mainly stores some hyperparameter configuration files (these files (yaml files) are used to configure the paths of the training set, test set, and validation set, which also include the number of target detection types and the name of the type) ); There are also some official pictures provided for testing. If you are training your own data set, you need to modify the yaml file. However, it is not recommended to place your own data set under this path. Instead, it is recommended to place the data set under the same level directory of the yolov5 project.

├── models: It mainly contains some configuration files and functions for network construction, which contains four different versions of the project, namely s, m, l, and x. As you can tell from the name, the size of these versions. Their detection measures are from fast to slow, but their accuracy is from low to high. This is the so-called inability to have your cake and eat it too. If you train your own data set, you need to modify the corresponding yaml file to train your own model.

├── utils: It stores tool functions, including loss function, metrics function, plots function, etc.

├── weights: Place the trained weight parameters.

├── detect.py: Use the trained weight parameters for target detection, which can detect images, videos and cameras.

├── train.py: Function to train your own data set.

├── test.py: Function to test the training results.

├──requirements.txt: This is a text file, which contains some versions of the environmental dependency packages using the yolov5 project. You can use this text to import the corresponding version of the package.

        ​ ​ ​ Then we modify the models/yolov5s.yaml file to change nc to 3, because we are a 3-classification problem. And create a biaoqing.yaml file in the data directory, its content is:

train: Mask_Datas/images/train  # train images (relative to 'path') 128 images
val: Mask_Datas/images/val  # val images (relative to 'path') 128 images


# Classes
nc: 3  # number of classes
names: ['with_mask', 'without_mask', 'mask_weared_incorrect']  # class names

        It specifies the paths of the training set and test set, as well as the classification content.

        ​ ​ ​ Then we modify the train.py file in the root directory of Yolov5 and modify the hyperparameters as shown below. Among them, --weights specifies the pre-training weights, --cfg specifies the configuration file of the pre-training weights, and --data specifies the configuration file of the data set. There are also some parameters such as -epochs that specify the number of training rounds. The default is 300 rounds and can be modified as needed. There is also --batch-size to specify how many photos to read at one time, usually multiples of 8. Modify it according to your computer performance.

        After modifying the above parameters, we uploaded the code to the running environment as follows:

        ​ ​ ​ Installed the required libraries through requirements.txt. Run python train.py directly to start training. The following is the training diagram.

        After 200 rounds of training, we generated a best.pt, which is the final weight file obtained during these 200 rounds of training.

2.3 Experimental results and analysis 

 

        We view the results.png file as shown below:

As the number of iterations increases during the training process, various values ​​change, as shown in the figure above. The meanings of each value in the figure are as follows:

GIo U: The closer the value is to 0, the more accurate the target frame is drawn.

Objectness: The closer the value is to 0, the more accurate the target detection is.

Classification: The closer the value is to 0, the more accurate the target classification is.

Precision: Accuracy, that is, the number of correct targets marked divided by the total number of marked targets. The closer to 1, the higher the accuracy.

Recall: Recall rate, that is, the number of correct targets marked divided by the total number of targets to be marked. The closer to 1, the higher the accuracy.

m [email protected] and m [email protected]:0.95: AP is the area enclosed by using Precision and Recall as the two coordinate axes. The closer to 1, the higher the accuracy.

confidence map

                     P-R diagram

 

confusion matrix 

 Experimental results:

Guess you like

Origin blog.csdn.net/weixin_41477928/article/details/129484881