Detailed interpretation of YOLOV5

Detailed explanation of YOLOV5 detection algorithm

study preface

This article mainly explains in detail the target detection algorithm based on deep learning, taking YOLOV5 as an example;

The overall process of target detection based on deep learning

Object detection based on deep learning mainly includes two parts: training and testing.
insert image description here
insert image description here

training phase

The purpose of training is to use the training data set to learn the parameters of the detection network, where the training data set contains a large number of visual images and label information (object position
and category). The main process of the training phase includes data preprocessing, detection network, label matching and loss calculation.

1. Data preprocessing

The purpose of data preprocessing is to enhance the diversity of training data, thereby improving the detection ability of the detection network.
The preprocessing methods adopted by YOLOV5 mainly include: flip, zoom, twist, color gamut transformation, Mosaic
flip:

image = image.transpose(Image.FLIP_LEFT_RIGHT)  #利用PIL库对图片直接翻转
box[:, [0,2]] = iw - box[:, [2,0]] #翻转图片后要对目标框同时进行调整

Zoom:

#由于实际图像w和h不是相等的,所以采用不失真的resize,将长边resize到和输入尺寸一样大小,然后其余部分
放上灰条

scale = min(w/iw, h/ih) #iw、ih是数据集中图像实际尺寸,w,h为网络输入的图像尺寸,scale为图像缩放因子
nw = int(iw*scale) #图像宽缩放后尺寸
nh = int(ih*scale) #图像长缩放后尺寸
dx = (w-nw)//2  #缩放后图像放在灰度图像上的位置
dy = (h-nh)//2 #缩放后图像放在灰度图像上的位置
image   = image.resize((nw,nh), Image.BICUBIC)  #将输入图像插值到实际缩放后的尺寸大小
new_image   = Image.new('RGB', (w,h), (128,128,128)) #生成一个三通的,大小为(w,h)的灰度图像
new_image.paste(image, (dx, dy)) #将缩放后的实际图像放在灰度图像 (dx, dy)的位置上
image_data  = np.array(new_image, np.float32) #再转换成数组格式

distortion:

new_ar = iw/ih * self.rand(1-jitter,1+jitter) / self.rand(1-jitter,1+jitter) #iw、ih是数据集中图像实际尺寸,jitter扭曲因子
scale = self.rand(.25, 2)
if new_ar < 1:
   nh = int(scale*h)
   nw = int(nh*new_ar)
else:
   nw = int(scale*w)
   nh = int(nw/new_ar)
   image = image.resize((nw,nh), Image.BICUBIC)

Color gamut transformation:

r  = np.random.uniform(-1, 1, 3) * [hue, sat, val] + 1
hue, sat, val   = cv2.split(cv2.cvtColor(image_data, cv2.COLOR_RGB2HSV)) #j将图片转换成HSV格式,再把每个通道分离开来
dtype  = image_data.dtype
x  = np.arange(0, 256, dtype=r.dtype)
lut_hue = ((x * r[0]) % 180).astype(dtype)
lut_sat = np.clip(x * r[1], 0, 255).astype(dtype)
lut_val = np.clip(x * r[2], 0, 255).astype(dtype)
image_data = cv2.merge((cv2.LUT(hue, lut_hue), cv2.LUT(sat, lut_sat), cv2.LUT(val, lut_val))) image_data = cv2.cvtColor(image_data, cv2.COLOR_HSV2RGB)

Mosaic:

train_annotation_path = '1.txt'
with open(train_annotation_path, encoding='utf-8') as f:
    train_lines = f.readlines()
jitter = 0.3
h, w = [640,640]
min_offset_x = rand(0.3, 0.7)
min_offset_y = rand(0.3, 0.7)
image_datas = []
box_datas   = []
index       = 0
lines = sample(train_lines, 3)
lines.append(train_lines[index])
shuffle(lines)  #从训练集中随机取4张图片进行拼接
for line in lines:
    line_content = line.split()
    image = Image.open(line_content[0])
    image = cvtColor(image)
    iw, ih = image.size
    box = np.array([np.array(list(map(int,box.split(',')))) for box in line_content[1:]])
    new_ar = iw / ih * rand(1 - jitter, 1 + jitter) / rand(1 - jitter, 1 + jitter)
    scale = rand(.4, 1)
    if new_ar < 1:
        nh = int(scale * h)
        nw = int(nh * new_ar)
    else:
        nw = int(scale * w)
        nh = int(nw / new_ar)
    image = image.resize((nw, nh), Image.BICUBIC)
    if index == 0:  #分别计算出四张图片分别摆放的位置
        dx = int(w * min_offset_x) - nw
        dy = int(h * min_offset_y) - nh
    elif index == 1:
        dx = int(w * min_offset_x) - nw
        dy = int(h * min_offset_y)
    elif index == 2:
        dx = int(w * min_offset_x)
        dy = int(h * min_offset_y)
    elif index == 3:
        dx = int(w * min_offset_x)
        dy = int(h * min_offset_y) - nh
    new_image = Image.new('RGB', (w, h), (128, 128, 128))
    new_image.paste(image, (dx, dy))
    image_data = np.array(new_image)

    index = index + 1
    box_data = []
    if len(box) > 0:  #对box重新进行处理,超出边界,都要将其限制在图像里面
        np.random.shuffle(box)
        box[:, [0, 2]] = box[:, [0, 2]] * nw / iw + dx
        box[:, [1, 3]] = box[:, [1, 3]] * nh / ih + dy
        box[:, 0:2][box[:, 0:2] < 0] = 0
        box[:, 2][box[:, 2] > w] = w
        box[:, 3][box[:, 3] > h] = h
        box_w = box[:, 2] - box[:, 0]
        box_h = box[:, 3] - box[:, 1]
        box = box[np.logical_and(box_w > 1, box_h > 1)]
        box_data = np.zeros((len(box), 5))
        box_data[:len(box)] = box

    image_datas.append(image_data)
    box_datas.append(box_data)
cutx = int(w * min_offset_x)
cuty = int(h * min_offset_y)
new_image = np.zeros([h, w, 3])
new_image[:cuty, :cutx, :] = image_datas[0][:cuty, :cutx, :]
new_image[cuty:, :cutx, :] = image_datas[1][cuty:, :cutx, :]
new_image[cuty:, cutx:, :] = image_datas[2][cuty:, cutx:, :]
new_image[:cuty, cutx:, :] = image_datas[3][:cuty, cutx:, :] 
new_image       = np.array(new_image, np.uint8)
merge_bbox = []
for i in range(len(bboxes)): #在四张拼接图上面对框进行调整,防止其超出界限
    for box in bboxes[i]:
          tmp_box = []
          x1, y1, x2, y2 = box[0], box[1], box[2], box[3]

          if i == 0:
             if y1 > cuty or x1 > cutx:
                  continue
             if y2 >= cuty and y1 <= cuty:
                        y2 = cuty
             if x2 >= cutx and x1 <= cutx:
                        x2 = cutx

           if i == 1:
                    if y2 < cuty or x1 > cutx:
                        continue
                    if y2 >= cuty and y1 <= cuty:
                        y1 = cuty
                    if x2 >= cutx and x1 <= cutx:
                        x2 = cutx

            if i == 2:
                    if y2 < cuty or x2 < cutx:
                        continue
                    if y2 >= cuty and y1 <= cuty:
                        y1 = cuty
                    if x2 >= cutx and x1 <= cutx:
                        x1 = cutx

             if i == 3:
                    if y1 > cuty or x2 < cutx:
                        continue
                    if y2 >= cuty and y1 <= cuty:
                        y2 = cuty
                    if x2 >= cutx and x1 <= cutx:
                        x1 = cutx
             tmp_box.append(x1)
             tmp_box.append(y1)
             tmp_box.append(x2)
             tmp_box.append(y2)
             tmp_box.append(box[-1])
             merge_bbox.append(tmp_box)

2. Detection network

The detection network generally includes a backbone feature extraction network, a feature fusion network, and a prediction network.

Backbone Feature Extraction Network
YOLOV5 uses CSPDarknet as the feature extraction network, and the structure is shown in the figure:

(1) The Focus structure
is actually the slice index operation of the matrix, and values ​​are taken every other pixel in the w and h directions of each channel, and finally One channel image becomes four channel images, and finally 3 channels become 12 channels, and the image space information is converted to the channel dimension,
insert image description here

class Focus(nn.Module):
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super(Focus, self).__init__()
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act)

    def forward(self, x):
        return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))

(2) The Csplayer structure
borrows from the CSPnet structure. In fact, the residual structure is nested in the residual structure. self.cv2 is a large residual edge, and self.m is a nested residual structure.

class C3(nn.Module):
    # CSP Bottleneck with 3 convolutions
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super(C3, self).__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1)  # act=FReLU(c2)
        self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)])

    def forward(self, x):
        return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))

class Bottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, shortcut, groups, expansion
        super(Bottleneck, self).__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_, c2, 3, 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

(3) SPP structure
Feature extraction is performed through maximum pooling with different pooling kernel sizes to improve the receptive field of the network.

class SPP(nn.Module):
    # Spatial pyramid pooling layer used in YOLOv3-SPP
    def __init__(self, c1, c2, k=(5, 9, 13)):
        super(SPP, self).__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_ * (len(k) + 1), c2, 1, 1)
        self.m = nn.ModuleList([nn.MaxPool2d(kernel_size=x, stride=1, padding=x // 2) for x in k])

    def forward(self, x):
        x = self.cv1(x)
        return self.cv2(torch.cat([x] + [m(x) for m in self.m], 1))

insert image description here

Feature Aggregation Network
YOLOV5 uses the structure of FPN+PAN for feature aggregation, and aggregates the feature layers extracted from the last three parts of the backbone network. The shallow features are spliced ​​with the deep features through downsampling, and the deep features will also be upsampled to combine with the shallow features. For splicing, at the same time, the CSP2 structure designed by referring to CSPnet will be used to strengthen the ability of network feature fusion.

prediction network
YOLOV5 uses a convolutional layer to get the final result

yolo_head = nn.Conv2d(c1, len(anchors_mask[2]) * (5 + num_classes), 1) #最后的输出通道数为3个特征层上的预测结果,包括预测框四个参数、置信度、,所有类别的概率

3. Label assignment and loss calculation

Label assignment is mainly to provide real values ​​for detector prediction. In target detection, the criteria for label assignment include intersection criterion, distance criterion, likelihood estimation criterion, and binary matching. Then based on the results of label classification, the loss function is used to calculate the loss of tasks such as classification and regression, and the weight of the detection network is updated using the backpropagation algorithm. Commonly used classification loss functions include cross-entropy loss function, focusing loss function, and smooth L1 loss function. ', intersection and ratio IOU loss function, GIOU loss function


The selection of positive samples during YOLOV5 training is divided into two steps: finding the highest priority frame; matching feature points;
Find the highest priority box
YOLOV5 sets 9 a priori frames on three feature layers. Use these 9 a priori frames and GT to calculate the aspect ratio. Divide the width and height of the a priori frame by the GT width and height, and divide the GT width and height by the first Inspection frame width and height,take the larger of the two, and then compare these 9 ratios with the thresholds set in advance,If it is less than the threshold, it means that the size of the prior frame is close to the real frame and can be used as a positive sample for training

Match feature points
We selected the highest priority box in the previous step and determined the size, but we have not yet determined the position of the box, so we calculate which grid the real frame falls in, then the feature point in the upper left corner of the grid is a feature point responsible for prediction. At the same time, in order to increase the number of positive samples, find the two closest to the center point of the real frame Grid, so a real box will correspond to three feature points, and the size of the prior box on each feature point is determined by the previous step.

Loss calculation
The loss consists of three parts, regression loss, confidence loss, and classification loss;
regression loss: Use the adjustment parameters obtained by the network to correct the previously obtained prior frame to obtain the predicted frame, and use the real frame and predicted frame to calculate the IOU loss;

confidence loss: Calculate the cross-entropy loss according to whether the feature points and positive and negative samples contain objects;

classification loss: Calculate the cross-entropy loss according to the type of the real frame and the type of the predicted result;

testing phase

Input the test image into the trained detection network to obtain the prediction result, and then perform post-processing operations such as decoding and non-maximum value suppression, and finally identify the category and location information of the object in the image;

NMS
Non-maximum value suppression is actually the same as the principle of bubble sorting, except that nms first takes out the prediction frame with the highest confidence in a certain category, and then performs IOU calculation on it and other remaining prediction frames. If If the overlap is large, the prediction frame will be deleted. If the overlap is small, the frame will be output at the same time; all prediction frames with large overlap will be removed;

I refer to the code and blog of Bubbliiiiing, thank you very much

Guess you like

Origin blog.csdn.net/qq_50027359/article/details/127861763