Yolov5 model and code interpretation

Backbone : new CSPDarknet-53
Neck : SPPF, New CSP-PAN
Head : yolov3 head
The picture below is v5l:
Insert image description here

Improvements:

1.Focus module

The Focus module divides each 2x2 adjacent pixel into a patch, and then puts the pixels at the same position (same color) in each patch together to obtain 4 feature maps, and then connects them to a 3x3 size convolution layer. This is equivalent to directly using a 6x6 convolutional layer

Insert image description here

Coding

class Focus(nn.Module):
    # Focus wh information into c-space
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super().__init__()
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act)
        # self.contract = Contract(gain=2)

    def forward(self, x):  # x(b,c,w,h) -> y(b,4c,w/2,h/2)
        return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))
        # return self.conv(self.contract(x))

::2: Indicates all values ​​from the first row to the end of the last row, with a step size of 2 (all values ​​in the first row and the third row in the figure) ::2: Indicates all values ​​from the first column to the end of the last
column, with a step size of 2 are all values ​​of 2 (all values ​​in the first column and third column in the figure)

Insert image description here

2. SPPF

Change SPP to SPPF.
Two Maxpools with k=5 are equivalent to one Maxpool with k=9.
Three Maxpools with k=5 are equivalent to one Maxpool with k=13.
The author serializes multiple Maxpools with k=5 and then further integrates them. , which can solve the target multi-scale problem to a certain extent.
Insert image description here

Insert image description here
Coding:

class SPPF(nn.Module):
    # Spatial Pyramid Pooling - Fast (SPPF) layer for YOLOv5 by Glenn Jocher
    def __init__(self, c1, c2, k=5):  # equivalent to SPP(k=(5, 9, 13))
        super().__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)   #这里对应第一个CBL
        self.cv2 = Conv(c_ * 4, c2, 1, 1)
        self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)

    def forward(self, x):
        x = self.cv1(x)
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')  # suppress torch 1.9.0 max_pool2d() warning
            y1 = self.m(x)
            y2 = self.m(y1)
            # 上述两次池化
            return self.cv2(torch.cat([x, y1, y2, self.m(y2)], 1))
        # 将原来的x,一次池化后的y1,两次池化后的y2,3次池化的self.m(y2)先进行拼接,然后再CBL

3. Data enhancement

3.1 Mosaic

Combine four pictures into one picture (unchanged)
Insert image description here
Coding :

def load_mosaic(self, index):
    # YOLOv5 4-mosaic loader. Loads 1 image + 3 random images into a 4-image mosaic
    labels4, segments4 = [], []
    s = self.img_size       #图片大小



    '''
Mosaic流程:
1.初始化背景图,大小为(img*2,img*22.随机选取一个中心点,范围为(-x,2*s+x)
3.随机选取三张图,基于中心点分别将四张图放在左上,右上,左下,右下,部分会由于小于4张图片的宽高因此会进行裁剪
4.重新计算打标边框的偏移量计算上

    '''



    # 随机选取拼接四张图的中心点[-320,960]
    yc, xc = (int(random.uniform(-x, 2 * s + x)) for x in self.mosaic_border)  # mosaic center x, y  self.mosaic_border=[-320,320]
    indices = [index] + random.choices(self.indices, k=3)  # 3 additional image indices  随机选三张图
    random.shuffle(indices)
    for i, index in enumerate(indices):
        # Load image
        img, _, (h, w), img_label = load_image_label(self, index)

        # place img in img4
        if i == 0:  # top left
            img4 = np.full((s * 2, s * 2, img.shape[2]), 114, dtype=np.uint8)  # base image with 4 tiles   初始化背景图------填充一个背景图s*2
            # 先计算出第一张图贴到左上角的起点xy,终点就是xc,yc
            x1a, y1a, x2a, y2a = max(xc - w, 0), max(yc - h, 0), xc, yc  # xmin, ymin, xmax, ymax (large image)
            # 计算裁剪的要贴的图,避免越界,这里就会裁掉多余的一部分
            x1b, y1b, x2b, y2b = w - (x2a - x1a), h - (y2a - y1a), w, h  # xmin, ymin, xmax, ymax (small image)
        elif i == 1:  # top right
            x1a, y1a, x2a, y2a = xc, max(yc - h, 0), min(xc + w, s * 2), yc
            x1b, y1b, x2b, y2b = 0, h - (y2a - y1a), min(w, x2a - x1a), h
        elif i == 2:  # bottom left
            x1a, y1a, x2a, y2a = max(xc - w, 0), yc, xc, min(s * 2, yc + h)
            x1b, y1b, x2b, y2b = w - (x2a - x1a), 0, w, min(y2a - y1a, h)
        elif i == 3:  # bottom right
            x1a, y1a, x2a, y2a = xc, yc, min(xc + w, s * 2), min(s * 2, yc + h)
            x1b, y1b, x2b, y2b = 0, 0, min(w, x2a - x1a), min(y2a - y1a, h)

        img4[y1a:y2a, x1a:x2a] = img[y1b:y2b, x1b:x2b]  # img4[ymin:ymax, xmin:xmax]
        padw = x1a - x1b
        padh = y1a - y1b

        # Labels  裁剪过后,对应标签也要往下移动
        labels, segments = img_label.copy(), self.segments[index].copy() # labels (array): (num_gt_perimg, [cls_id, poly])
        if labels.size:
            # labels[:, 1:] = xywhn2xyxy(labels[:, 1:], w, h, padw, padh)  # normalized xywh to pixel xyxy format
            labels[:, [1, 3, 5, 7]] = img_label[:, [1, 3, 5, 7]] + padw
            labels[:, [2, 4, 6, 8]] = img_label[:, [2, 4, 6, 8]] + padh
            segments = [xyn2xy(x, w, h, padw, padh) for x in segments]
        labels4.append(labels)
        segments4.extend(segments)
3.2 Copy paste

Paste some targets randomly into the picture, the premise is that the data must have segment data, that is, the instance segmentation information of each target
Insert image description here

3.3 Random affine(Rotation, Scale, Translation and Shear)

Randomly perform affine transformation, but according to the hyperparameters in the configuration file, it is found that only Scale and Translation, that is, scaling and translation, are used.

3.4 MixUp

It is to fuse two pictures together with a certain transparency. It is not clear whether it is useful. After all, there is no paper and no ablation experiment. Only larger models in the code use MixUp, and there is only a 10% probability of using it each time.
Insert image description here

3.5 Albumentations

Mainly to do some filtering, histogram equalization, and change the image quality, etc. I see that the code written in the code will only be enabled if the albumentations package is installed, but the albumentations package is commented out in the requirements.txt file of the project. So it is not enabled by default.

3.6 Augment HSV(Hue, Saturation, Value)

Randomly adjust hue, saturation and lightness

3.7 Random horizontal flip, random horizontal flip

4. Training strategy

Multi-scale training (0.5~1.5x), multi-scale training, assuming that the size of the input image is set to 640 × 6400, the size used during training is randomly selected between 0.5 × 640 ∼ 1.5 × 640, pay attention to the value obtained when selecting the value They are all integer multiples of 32 (because the network will downsample up to 32 times).
AutoAnchor (For training custom data) , when training your own data set, you can re-cluster and generate Anchors templates based on the targets in your own data set.
Warmup and Cosine LR scheduler, perform Warmup warm-up before training, and then use Cosine learning rate reduction strategy.
EMA (Exponential Moving Average) can be understood as adding a momentum to the training parameters to make its update process smoother.
Mixed precision, mixed precision training, can reduce the memory usage and speed up the training, provided that the GPU hardware supports it.
Evolve hyper-parameters, hyper-parameter optimization, people who have no experience in alchemy should not touch it, just keep the default.

5.Border prediction

Insert image description here
Coding

y = x[i].sigmoid()
y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh

Insert image description here

IOU coding

# 计算两个框的iou(DIOU,GIOU,CIOU)
def bbox_iou(box1, box2, x1y1x2y2=True, GIoU=False, DIoU=False, CIoU=False, eps=1e-7):
    # Returns the IoU of box1 to box2. box1 is 4, box2 is nx4
    box2 = box2.T

    # Get the coordinates of bounding boxes
    if x1y1x2y2:  # x1, y1, x2, y2 = box1
        b1_x1, b1_y1, b1_x2, b1_y2 = box1[0], box1[1], box1[2], box1[3]
        b2_x1, b2_y1, b2_x2, b2_y2 = box2[0], box2[1], box2[2], box2[3]
    else:  # transform from xywh to xyxy
        b1_x1, b1_x2 = box1[0] - box1[2] / 2, box1[0] + box1[2] / 2
        b1_y1, b1_y2 = box1[1] - box1[3] / 2, box1[1] + box1[3] / 2
        b2_x1, b2_x2 = box2[0] - box2[2] / 2, box2[0] + box2[2] / 2
        b2_y1, b2_y2 = box2[1] - box2[3] / 2, box2[1] + box2[3] / 2

    # Intersection area  计算交集
    inter = (torch.min(b1_x2, b2_x2) - torch.max(b1_x1, b2_x1)).clamp(0) * \
            (torch.min(b1_y2, b2_y2) - torch.max(b1_y1, b2_y1)).clamp(0)

    # Union Area
    w1, h1 = b1_x2 - b1_x1, b1_y2 - b1_y1 + eps
    w2, h2 = b2_x2 - b2_x1, b2_y2 - b2_y1 + eps
    union = w1 * h1 + w2 * h2 - inter + eps

    iou = inter / union
    if CIoU or DIoU or GIoU:
        # c是包含两个包围框的最小外接矩形
        cw = torch.max(b1_x2, b2_x2) - torch.min(b1_x1, b2_x1)  # convex (smallest enclosing box) width
        ch = torch.max(b1_y2, b2_y2) - torch.min(b1_y1, b2_y1)  # convex height
        if CIoU or DIoU:  # Distance or Complete IoU https://arxiv.org/abs/1911.08287v1
            # c2是包围框的对角线的平方
            c2 = cw ** 2 + ch ** 2 + eps  # convex diagonal squared
            # 中心点的距离
            rho2 = ((b2_x1 + b2_x2 - b1_x1 - b1_x2) ** 2 +
                    (b2_y1 + b2_y2 - b1_y1 - b1_y2) ** 2) / 4  # center distance squared
            if CIoU:  # https://github.com/Zzh-tju/DIoU-SSD-pytorch/blob/master/utils/box/box_utils.py#L47
                v = (4 / math.pi ** 2) * torch.pow(torch.atan(w2 / h2) - torch.atan(w1 / h1), 2)
                with torch.no_grad():
                    alpha = v / (v - iou + (1 + eps))
                return iou - (rho2 / c2 + v * alpha)  # CIoU
            else:
                return iou - rho2 / c2  # DIoU
        else:  # GIoU https://arxiv.org/pdf/1902.09630.pdf
            c_area = cw * ch + eps  # convex area
            return iou - (c_area - union) / c_area  # GIoU
    else:
        return iou  # IoU

5. Loss calculation

Mainly divided into three parts:

  • Classes loss , classification loss, uses BCE loss. Note that only the classification loss of positive samples is calculated.
  • Objectness loss , obj loss, still uses BCE loss. Note that obj here refers to the CIoU of the target bounding box and GT Box predicted by the network. What is calculated here is the obj loss of all samples. (This is different from v3v4. v3v4 only determines whether there is a target, while v5 needs to calculate the bounding box and GT)
  • Location loss , location loss, uses CIoU loss. Note that only the location loss of positive samples is calculated.
    Insert image description here

The prediction feature layer adopts different weights, and the weight of small targets will be relatively larger.
Insert image description here

To put it simply, border regression is:
find a translation and scaling coefficient so that the target value is infinitely close to the true value. The coefficient that satisfies this infinite proximity condition is the regression coefficient.
Infinitely close means that the two are as similar as possible. To quantify, we construct a loss function and let this function represent the degree of similarity between the two. The more similar they are, the smaller the difference between the two. By continuously reducing the value of the loss function, we can obtain a Appropriate translation and zoom coefficients. The process of reducing the value of the loss function is optimization. The conventional routine of neural network.
compute_loss is the process of constructing a loss function, in which the component of the frame regression loss function is
the loss function = frame regression coefficient * anchors - the purpose of positive sample true value
frame regression neural network training is to find this set of regression coefficients to make the anchors corresponding to the positive sample infinitely close The true value of the positive sample. The final confidence level output by the program is existence confidence level * classification confidence level. Remember, remember.

coding:

smooth_BCE
Insert image description here
smoothes the label [1,0]=>[0.95,0.05], which can prevent overfitting

def smooth_BCE(eps=0.1):  # https://github.com/ultralytics/yolov3/issues/238#issuecomment-598028441
    # return positive, negative label smoothing BCE targets
    return 1.0 - 0.5 * eps, 0.5 * eps
self.cp, self.cn = smooth_BCE(eps=h.get('label_smoothing', 0.0))

WordLoss

It is not turned on by default, ensuring that the final results of both positive and negative samples are close to 1, so that the weight of correctly predicted samples becomes very small, and the weight of incorrectly predicted samples becomes very large, thereby solving the problem of category imbalance 1. Solving the one-stage
problem The problem of imbalance between positive and negative samples in object detection
2. Reduce the weight of simple samples so that the loss function pays more attention to difficult samples

class FocalLoss(nn.Module):
    # Wraps focal loss around existing loss_fcn(), i.e. criteria = FocalLoss(nn.BCEWithLogitsLoss(), gamma=1.5)
    def __init__(self, loss_fcn, gamma=1.5, alpha=0.25):
        super().__init__()
        self.loss_fcn = loss_fcn  # must be nn.BCEWithLogitsLoss()
        self.gamma = gamma
        self.alpha = alpha
        self.reduction = loss_fcn.reduction
        self.loss_fcn.reduction = 'none'  # required to apply FL to each element

    def forward(self, pred, true):
        loss = self.loss_fcn(pred, true)
        # p_t = torch.exp(-loss)
        # loss *= self.alpha * (1.000001 - p_t) ** self.gamma  # non-zero power for gradient stability

        # TF implementation https://github.com/tensorflow/addons/blob/v0.7.1/tensorflow_addons/losses/focal_loss.py
        pred_prob = torch.sigmoid(pred)  # prob from logits
        # 只留下正样本概率true * pred_prob,让负样本也接近1
        p_t = true * pred_prob + (1 - true) * (1 - pred_prob)
        alpha_factor = true * self.alpha + (1 - true) * (1 - self.alpha)
        # 调整正负样本权重
        modulating_factor = (1.0 - p_t) ** self.gamma
        loss *= alpha_factor * modulating_factor

        if self.reduction == 'mean':
            return loss.mean()
        elif self.reduction == 'sum':
            return loss.sum()
        else:  # 'none'
            return loss

bulid_targets

def build_targets(self, p, targets):


Assume p is the result p[0].shape output by each prediction head
: torch.Size([16, 3, 80, 80, 85]) # 16 is btachsize, 3 is the number of Anchors, 80/40/20: Feature map size of 3 detection heads, 85: 80 categories of coco data set + 4 (x, y, w, h) + 1 (whether it is foreground)
p[1].shape: torch.Size([16, 3, 40, 40, 85])
p[2].shape: torch.Size([16, 3, 20, 20, 85])


Targets: GT box information, the dimension is (n, 6), where n is the number of GT boxes in the entire batch image. The following uses the number of GT boxes as 190 as an example.
Each dimension of 6 is (index of the image in the batch, target category, x, y, w, h)


The main work of build_targets:
mainly processing the gt box
1. Copy the gt box 3 times. The reason is that there are three types of anchors with length and width. Each anchor has a gt box corresponding to it. That is, before filtering, a gt box has three anchors and its corresponding correspond.
2. Filter out the gt boxes where the ratio of w and h of the gt box to the w and h of the anchor is greater than the set hyperparameter anchor_t.
3. For the remaining gt boxes, each gt box uses at least three squares to predict, one is the square where the center point of the gt box is located, and the other two are the two squares closest to the center point, as shown below: If
Insert image description here
gt The center coordinates of the box are (51.7, 44.8). This is predicted by the square (51, 44) where the center point of the gt box is located, and the two squares (51, 45) and (52, 44) closest to the center point. gt box

#Build targets for compute_loss(), input targets(image,class,x,y,w,h)

na, nt = self.na, targets.shape[0]     # na:3, nt:190  

#Corresponding category, border, corresponding Anchor index, corresponding anchor

tcls, tbox, indices, anch = [], [], [], []  
ai = torch.arange(na, device=targets.device).float().view(na, 1).repeat(1, nt)          #anchor的索引,shape为(3, 190), 3个anchor因此要复制三份
targets = torch.cat((targets.repeat(na, 1, 1), ai[:, :, None]), 2)

#Add the anchor index to the original target to indicate which anchor it corresponds to. Repeat the target three times to correspond to three kinds of anchors. The figure below shows the content of the processed targets. Its shape is (3, 190, 7), and the red box is Original targets, the blue box is the index added to each gt box
Insert image description here

g = 0.5**  # bias
off = torch.tensor([[0, 0],
                [1, 0], [0, 1], [-1, 0], [0, -1],**  # j,k,l,m
                # [1, 1], [1, -1], [-1, 1], [-1, -1],  # jk,jm,lk,lm
                **], device=targets.device).float() * g**  # offsets

#Traverse the feature map self.nl is the anchor layer

for i in range(self.nl): // 针对每一个检测头
anchors = self.anchors[i]

When i=0, the anchors value is:
anchors
tensor([[1.25000, 1.62500],
[2.00000, 3.75000],
[4.12500, 2.87500]], device='cuda:0')

gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]

The value of #gain is [1., 1., 80., 80., 80., 80., 1.], the position of 80 corresponds to the xywh of the gt box information

t = targets * gain

The xywh in #targets is normalized to between 0 and 1. After multiplying by gain, the xywh of the targets is mapped to the feature map size of the detection head.

if nt:
	 r = t[:, :, 4:6] / anchors[:, None]

#The shape of r here is [3, 190, 2], 2 respectively represents the ratio of w and h of gt box to w and h of anchor.

    j = torch.max(r, 1. / r).max(2)[0] < self.hyp['anchor_t']

# Leave the ratio between 0.25-4

gxy = t[:, 2:4]  #取出过滤后的gt box的中心点浮点型的坐标
gxi = gain[[2, 3]] - gxy   #将以图像左上角为原点的坐标变换为以图像右下角为原点的坐标

#Assume that the total GT boxes of 3 anchors are 3 * 190 = 570, and there are 271 remaining after filtering

j, k = ((gxy % 1. < g) & (gxy > 1.)).T

#Take the coordinates of the upper left corner of the image as the origin, and take the decimal part of the center point. If the decimal part is less than 0.5, it is true, and if it is greater than 0.5, it is false. The position of true means the gt box close to the left of the grid and the gt box close to the top of the grid respectively, then the shape of j and k is (271)

l, m = ((gxi % 1. < g) & (gxi > 1.)).T

# Taking the lower right corner of the image as the coordinates of the origin, take the decimal part of the center point. If the decimal part is less than 0.5, it is true, and if it is greater than 0.5, it is false. The shapes of l and m are both (271), and the position of true indicates the gt box near the right of the grid and the gt box near the bottom of the grid, then the shapes of l and m are (271) and the values ​​of j and l are just
opposite , the values ​​of k and m are also exactly opposite.

j = torch.stack((torch.ones_like(j), j, k, l, m))

#Combine j, k, l, m into a tensor, and also add a dimension that is all true. After combination, the shape of j is (5, 271)

t = t.repeat((5, 1, 1))[j]

The shape before t is (271, 7), here copy 5 t, and then use j to filter, the
first t is to keep all the gt boxes, because a dimension that is all true is added in the previous step,
the second The first t retains the gt box near the left side of the grid,
the third t retains the gt box near the top of the grid,
the fourth t retains the gt box near the right side of the grid, and
the fifth t retains the gt box near the bottom of the grid gt box,
after filtering, the shape of t is (808, 7), which means all the gt boxes are retained.

 offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]

The shape of offsets is (808, 2), which represents the offset corresponding to x and y of the 808 retained gt boxes. The first t
retains all gt box offsets as [0, 0], that is, no offset is made. Move
the gt box reserved by the second t and close to the left side of the square with an offset of [0.5, 0], that is, an offset of 0.5 to the left (the following code uses gxy-offsets, so positive 0.5 means an offset to the left), then Offset to the left square means using the left square to predict
the gt box near the top of the square reserved by the third t. The offset is [0, 0.5], that is, if the upward offset is 0.5, then the offset is to the upper side. grid, which means using the upper square to predict
the fourth t reserved gt box close to the right side of the square, the offset is [-0.5, 0], that is, offset 0.5 to the right, then offset to the right square, indicating Use the square on the right to predict
the gt box near the bottom of the square reserved by the fifth t. The offset is [0, 0.5], that is, the downward offset of 0.5, then the offset is to the lower square, which means using the lower square. grid to predict
the x coordinate of the center point of a gt box is either close to the left side of the square or close to the right side of the square, and the y coordinate is either close to the top of the square or close to the bottom of the square, so a gt box is within the above five t Inside, there will be three t's that are true.
That is to say, a gt box has three squares to predict, one is the square where the center point is located, and the other two are the two closest squares. Yolov3 only uses the square where the center point is located to predict, which is the difference from yolov3.

else:
t = targets[0]
offsets = 0
b, c = t[:, :2].long().T  # image, class
gxy = t[:, 2:4]  # grid xy
gwh = t[:, 4:6]  # grid wh
gij = (gxy - offsets).long()** 

Offset the center point to the nearest adjacent square, and then round down, the shape of gij is (808, 2)

gi, gj = gij.T  # grid xy indices

#Append

a = t[:, 6].long()  # anchor indices
indices.append((b, a, gj.clamp_(0, gain[3] - 1), gi.clamp_(0, gain[2] - 1)))  # image, anchor, grid indices
tbox.append(torch.cat((gxy - gij, gwh), 1)) # box
anch.append(anchors[a]) # anchors
tcls.append(c)  # class
return tcls, tbox, indices, anch

6. Grid sensitivity

In v3, the author introduces the sigmoid function in order to make the predicted coordinates finally fall in the cell. However, if it falls on the boundary or the lower right corner, the predicted value of the network needs to be negative infinity or positive infinity, and this is very difficult. Extreme values ​​are generally unreachable by the network.
Insert image description here

In order to solve this problem, the author scaled the offset from the original ( 0 , 1 ) to ( − 0.5 , 1.5 ) so that the offset predicted by the network can easily reach 0 or 1. In addition to adjusting the prediction in
Insert image description here
YOLOv5 In addition to the offset of the Anchor relative to the upper left corner of the Grid grid, the calculation formulas of the prediction target w and h have also been adjusted. The author explained that the original calculation formula did not limit the width and height of the prediction target, which may cause gradient explosion and unstable training. The previous question was:
Insert image description here
adjusted to:
Insert image description here

The figure below is the change curve of y=e^x before modification and after modification. The adjusted magnification factor is limited between (0, 4), and 4 here will be used in positive sample matching below.
Insert image description here

7. Match positive sample method

The GT and Anchor matching methods of v4 and v5 are different.
In YOLOv4, each GT Box is directly calculated with the corresponding Anchor Templates template to calculate the IoU. As long as the IoU is greater than the set threshold, the match is successful.
But in YOLOv5, the author first calculates the height-to-width ratio of each GT Box and the corresponding Anchor Templates template
Insert image description here
and then counts the maximum value between these ratios and their reciprocals. This can be understood as calculating the width and width of GT Box and Anchor Templates respectively. The maximum difference in the height direction (when equal, the ratio is 1, and the difference is the smallest).
Insert image description here
If r_max<4 (4 is the adjusted magnification factor), the match is successful.
x1 is the original size of the Anchor
x4 is the size of the Anchor size multiplied by 4.
The first frame The picture exceeds the size of Anchor x 4
Insert image description here
positive sample sampling details:

The remaining step is to project the GT onto the corresponding prediction feature layer and locate the corresponding Cell according to the center point of the GT. Note that there are three corresponding Cells in the picture. Because the offset range of the network prediction center point has been adjusted to (− 0.5 , 1.5 ), it stands to reason that as long as the upper left corner of the Grid Cell is within (− 0.5 , 1.5 ) from the GT center point, their corresponding Anchor can return to GT at the location. This will greatly expand the number of positive samples.
Insert image description here
Insert image description here

Tips:

1. EMA

Problem background :
Since deep learning training often cannot find the global optimal solution, and most of the time it swings back and forth between the local optimal solution, the weight we obtain is likely to be the worst of the local optimal solutions, so a The solution is to take these local optimal solutions, perform an averaging operation, and then let the network load this weight for prediction. With this idea, the following weight averaging method is derived.
Definition :
Insert image description here
EMA can be approximately viewed as the average of v values ​​at 1/(1-β) times in the past.
The ordinary average of n moments in the past is like this:
Insert image description here
By analogy with EMA, it can be found that when β=n-1/n , the two formulas are formally equal. It should be noted that the two averages are not strictly equal, this is just to help understanding.

In fact, when calculating EMA, the values ​​before the past **1/(1-β)** moments will on average decline to a weighted ratio of 1/e . If v_t here is expanded, we can get:
Insert image description here
In deep learning, θt is the model weights at time t, and vt is the shadow weights at time t. In the process of gradient descent, this shadow weight will be maintained all the time, but this shadow weight will not participate in training. The basic assumption is that the model weight will jitter at the actual optimal point in the last n steps, so we take the average of the last n steps to make the model more robust.
The formula in deep learning is:
Insert image description here

Steps :

  1. Initialize EMA
  2. EMA.register()
  3. During the training process, update parameters EMA.update() and synchronize update shadow weights
  4. EMA.apply_shadow()

How to take the average?
Assuming decay=0.999 here, a more intuitive understanding, in the last 1000 times of training, the model has already been trained and is in the jittering stage, and the sliding average is equivalent to averaging the last 1000 times of jittering, so the obtained The weights will be more robust.

Advantages :
1. It occupies less memory and can estimate the average value without saving the past 10 or 100 historical v values. (Of course, the sliding average is not as accurate as saving all historical values ​​to calculate the average, but the latter takes up more memory and has higher computing costs)
2. No need to add additional training time, nor do you need to manually adjust parameters, just need to test stage, just conduct a few more sets of tests to select the best results.

The figure below shows the temperature data (blue dots), the fitting curve (red line), and the EMA curve (green line). You can see that the green line has obvious hysteresis, and it follows the trend like a shadow, so it is called a shadow variable.
Insert image description here

2. Parameters

The parameters that need to be saved in the model in pytorch include:

parameter: Backpropagation needs to be updated by the optimizer and can be trained.
Buffer: Backpropagation does not need to be updated by the optimizer and cannot be trained.

These two parameters will be saved in an OrderDict variable respectively, and will eventually be returned and saved by model.state_module().

Example explanation:
When we want to save some variables (such as the anchor in yolov5), which can be used as simple post-processing, we need to register such variables into the network. The APIs that can be used are:

  • self.register_buffer(): cannot be trained;
  • self.register_parameter(), nn.parameter.Parameter(), nn.Parameter(): can be trained.
    Insert image description here
    Member variables :
    _buffers: defined by self.register_buffer(), requires_grad defaults to False and cannot be trained.
    _parasmeter: The variables defined by self.register_parameter(), nn.parameter.Parameter(), and nn.Parameter() are all stored under this attribute, and the requires_grad of the defined parameters defaults to True.
    _module: The structures in the network structures defined by nn.Sequential(), nn.conv(), etc. are stored under this attribute.

    Member function:
    self.state_dict(): OrderedDict type. Save the inference parameters of the neural network, including parameter and buffer
    self.name_parameters(): it is an iterator. The names of all trainable parameters in self._module and self._parameters + tensor. Including BN’s bn.weight and bn.bias.
    self.parameters(): Same as self.name_parameters(), but does not contain the name
    self.name_buffers(): is an iterator. The name of all untrainable parameters in the network and the parameters in the registered buffer + tensor. Including BN's bn.running_mean, bn.running_var, bn.num_batches_tracked.
    self.buffers(): Same as self.name_buffers(), but does not contain the name
    net.named_modules(): It is an iterator. The name of the network structure defined in self._module + layer
    net.modules()

3. IOU

The development process of Loss of Bounding Box Regeression in recent years is: Smooth L1 Loss-> IoU Loss (2016)-> GIoU Loss (2019)-> DIoU Loss (2020)->CIoU Loss (2020)

a.IOU_Loss
IOU: Intersection and union ratio between GT and predict
IOU_Loss = 1 - IOU
Insert image description here
However, there are two problems:
1. When IOU=0, the distance between the two frames cannot be reflected. At this time, the loss function is not derivable, and IOU_Loss cannot Optimize the situation when two boxes do not intersect.
Insert image description here
2. When the two IOUs are the same and the size of the two prediction frames is also the same, IOU_Loss cannot distinguish between two intersections.
Insert image description here
Therefore, GIOU_Loss is introduced for improvement.

b. GIOU_Loss
GIOU_Loss = 1 - GIOU = 1- (IOU - difference set/circumscribed rectangle)
where A is the predicted frame, B is the real frame, C is the smallest bounding box between A and B, and the relationship between A, B, and C is specific As shown below.
Insert image description here

Insert image description here
GIOU solves the problem when IOU=0, but the problem of the same IOU is not solved. When the prediction frame is inside the target frame and the size of the prediction frame is the same, the difference between the prediction frame and the target frame is the same, so this The GIOU values ​​in the three states are also the same. At this time, GIOU degenerates into IOU, and the relative position relationship cannot be distinguished.
Insert image description here
Therefore, DIOU_Loss is introduced for improvement.

c. DIOU_Loss
A good target box regression function should consider three important geometric factors: overlapping area, center point distance, aspect ratio
GIOU_Loss = 1 - GIOU = 1- ( IOU - center point distance / diagonal distance)

Insert image description here
DIOU_Loss takes into account The overlapping area and the center point distance, when the target frame wraps the prediction frame, directly measure the distance between the two frames, so DIOU_Loss converges faster.
Insert image description here
However, the aspect ratio is not considered.
Insert image description here
For example, in the above three cases, the target frame wraps the prediction frame, and DIOU_Loss can originally work.
But the positions of the center points of the prediction boxes are all the same, so according to the calculation formula of DIOU_Loss, the values ​​of the three are the same.
Therefore, CIOU_Loss is introduced for improvement.

d. CIOU_Loss
The previous formulas of CIOU_Loss and DIOU_Loss are the same, but an impact factor is added on this basis, taking into account the aspect ratio of the prediction box and the target box.
Insert image description here
Among them, v is a parameter to measure the consistency of aspect ratio. We can also define it as such that
Insert image description here
CIOU_Loss should consider three important geometric factors in the regression function of the target frame: overlapping area, center point distance, and aspect ratio.

Summary :
IOU_Loss: Mainly considers the overlapping area of ​​the detection frame and the target frame.

GIOU_Loss: Based on IOU, solve the problem when the bounding boxes do not overlap.

DIOU_Loss: Based on IOU and GIOU, consider the distance information of the center point of the bounding box.

CIOU_Loss: Based on DIOU, consider the scale information of the bounding box aspect ratio.

Yolov4 adopts the CIOU_Loss regression method, which makes the prediction box regression faster and more accurate.

Related:

1. https://blog.csdn.net/qq_37541097/article/details/123594351
2. The principle of exponential moving average (EMA) and PyTorch implementation
3. [Alchemy skills] Exponential moving average (EMA) [to a certain extent improved The performance of the final model on the test data (such as accuracy, FID, generalization ability...)]
4. [Notes] Improve the classification model ACC, BatchSize&LARS, Bag of Tricks: tirck may not be suitable for different data scenarios
5. YOLOV5 used to The trick (1)
6. Unread: tricks in deep neural network model training (principle and code summary)
7. Detailed explanation of IOU, GIOU, DIOU, CIOU loss functions
8. Coding analysis is very detailed:
2021SC@SDUSC Software College of Shandong University Software Engineering Application and Practice – YOLOV5 Code Analysis (11) loss.py
Detailed Interpretation of YoloV5 Code

Guess you like

Origin blog.csdn.net/qq_42740834/article/details/125488211