Deep Learning (27) - YOLO Series (6)

Deep Learning (27) - YOLO Series (6)

咱就是说,需要源码请造访:Jane的GitHub: Waiting for you here.
Hey, long time no see. I finished the debugging process of yolov7 yesterday. I really try my best to understand every sentence. I think this should be the last article of my updated yolo series, but it is limited to the topic of yolo and detect It will not end yet, and will continue. After the detection is completed, we will talk about segmentation. Follow the same process as usual, starting with data.
Note: I will not explain every sentence here, but only expand the content that I think is very core, so you understand, what I can write is the core of the core, knock on the blackboard! And the draft I wrote while debugging. There are three full pages, so how can I publish a blog to share my joy with you?

1. Data

The data used in this article is the surface defect data set (NEU-DET) published by Northeastern University, which has been uploaded to the resource and needs to be taken by yourself. This dataset collects six typical surface defects of hot-rolled steel strip, namely, rolling scale (RS), plaque (Pa), cracking (Cr), pitted surface (PS), inclusions (In) and Scratch (Sc). The database includes 1,800 grayscale images: six different types of typical surface defects, each containing 300 samples. For the defect detection task, the dataset provides annotations indicating the category and location of defects in each image. For each defect, the yellow box is a border indicating its location, and the green label is the category score.
It is OK to load data.yaml in the training configuration

train: ..\NEU-DET\train.txt
val:..\NEU-DET\val.txt

# number of classes
nc: 6

# class names
names: ['crazing', 'inclusion', 'patches', 'pitted_surface', 'rolled-in_scale', 'scratches']

insert image description here
Data reading is very simple. The author cleverly uses the cache idea. The data only takes a long time to read in for the first time, and then use the generated cache to load the data. What I want to talk about here is the mosaic method, a key method in the data enhancement part.

1.1 mosaic method

  • Core idea: Mosaic data enhancement is a data enhancement technique commonly used in computer vision tasks. 它通过将多个不同样本的图像拼接成一个大的合成图像,然后将该合成图像用作训练数据来增加模型的鲁棒性和泛化能力.

  • Steps for usage:

    • Randomly select four different training sample images. [ Note: It can be not 4, such as 9, or spliced ​​​​according to your own ideas, but you must pay attention to changing the image and label together ]
    • These selected images are stitched together in certain proportions to form a large composite image.
    • Randomly crop synthetic images to obtain training samples.
    • Label adjustments are performed on the cropped samples to fit the new image layout.
    • Use the cropped samples as training data and input them into the model for training.
  • Use this method to solve the problem: introduce more changes and complexity, so as to improve the model's ability to recognize and understand the relationship between different scenes, different scales, and different objects. By splicing multiple samples, the model can learn a richer feature representation, and it is more stable and reliable when encountering similar situations.

  • Applicable scenarios: It is often used in tasks such as target detection, semantic segmentation, and image classification, which can help the model better cope with complex scenes, small targets, occlusions, and perspective changes.

# 四个拼接
def load_mosaic(self, index):
    # loads images in a 4-mosaic

    labels4, segments4 = [], []
    s = self.img_size
    yc, xc = [int(random.uniform(-x, 2 * s + x)) for x in self.mosaic_border]  # mosaic center x, y
    indices = [index] + random.choices(self.indices, k=3)  # 3 additional image indices
    for i, index in enumerate(indices):
        # Load image
        img, _, (h, w) = load_image(self, index)

        # place img in img4
        if i == 0:  # top left
            img4 = np.full((s * 2, s * 2, img.shape[2]), 114, dtype=np.uint8)  # base image with 4 tiles
            x1a, y1a, x2a, y2a = max(xc - w, 0), max(yc - h, 0), xc, yc  # xmin, ymin, xmax, ymax (large image)
            x1b, y1b, x2b, y2b = w - (x2a - x1a), h - (y2a - y1a), w, h  # xmin, ymin, xmax, ymax (small image)
        elif i == 1:  # top right
            x1a, y1a, x2a, y2a = xc, max(yc - h, 0), min(xc + w, s * 2), yc
            x1b, y1b, x2b, y2b = 0, h - (y2a - y1a), min(w, x2a - x1a), h
        elif i == 2:  # bottom left
            x1a, y1a, x2a, y2a = max(xc - w, 0), yc, xc, min(s * 2, yc + h)
            x1b, y1b, x2b, y2b = w - (x2a - x1a), 0, w, min(y2a - y1a, h)
        elif i == 3:  # bottom right
            x1a, y1a, x2a, y2a = xc, yc, min(xc + w, s * 2), min(s * 2, yc + h)
            x1b, y1b, x2b, y2b = 0, 0, min(w, x2a - x1a), min(y2a - y1a, h)

        img4[y1a:y2a, x1a:x2a] = img[y1b:y2b, x1b:x2b]  # img4[ymin:ymax, xmin:xmax]
        padw = x1a - x1b
        padh = y1a - y1b

        # Labels
        labels, segments = self.labels[index].copy(), self.segments[index].copy()
        if labels.size:
            labels[:, 1:] = xywhn2xyxy(labels[:, 1:], w, h, padw, padh)  # normalized xywh to pixel xyxy format
            segments = [xyn2xy(x, w, h, padw, padh) for x in segments]
        labels4.append(labels)
        segments4.extend(segments)

    # Concat/clip labels
    labels4 = np.concatenate(labels4, 0)
    for x in (labels4[:, 1:], *segments4):
        np.clip(x, 0, 2 * s, out=x)  # clip when using random_perspective()
    # img4, labels4 = replicate(img4, labels4)  # replicate

    # Augment
    #img4, labels4, segments4 = remove_background(img4, labels4, segments4)
    #sample_segments(img4, labels4, segments4, probability=self.hyp['copy_paste'])
    img4, labels4, segments4 = copy_paste(img4, labels4, segments4, probability=self.hyp['copy_paste'])
    img4, labels4 = random_perspective(img4, labels4, segments4,
                                       degrees=self.hyp['degrees'],
                                       translate=self.hyp['translate'],
                                       scale=self.hyp['scale'],
                                       shear=self.hyp['shear'],
                                       perspective=self.hyp['perspective'],
                                       border=self.mosaic_border)  # border to remove

    return img4, labels4

The 9 pieces used in yolo-v7 are stitched together

def load_mosaic9(self, index):
    # loads images in a 9-mosaic

    labels9, segments9 = [], []
    s = self.img_size
    indices = [index] + random.choices(self.indices, k=8)  # 8 additional image indices
    for i, index in enumerate(indices):
        # Load image
        img, _, (h, w) = load_image(self, index)

        # place img in img9
        if i == 0:  # center
            img9 = np.full((s * 3, s * 3, img.shape[2]), 114, dtype=np.uint8)  # base image with 4 tiles
            h0, w0 = h, w
            c = s, s, s + w, s + h  # xmin, ymin, xmax, ymax (base) coordinates
        elif i == 1:  # top
            c = s, s - h, s + w, s
        elif i == 2:  # top right
            c = s + wp, s - h, s + wp + w, s
        elif i == 3:  # right
            c = s + w0, s, s + w0 + w, s + h
        elif i == 4:  # bottom right
            c = s + w0, s + hp, s + w0 + w, s + hp + h
        elif i == 5:  # bottom
            c = s + w0 - w, s + h0, s + w0, s + h0 + h
        elif i == 6:  # bottom left
            c = s + w0 - wp - w, s + h0, s + w0 - wp, s + h0 + h
        elif i == 7:  # left
            c = s - w, s + h0 - h, s, s + h0
        elif i == 8:  # top left
            c = s - w, s + h0 - hp - h, s, s + h0 - hp

        padx, pady = c[:2]
        x1, y1, x2, y2 = [max(x, 0) for x in c]  # allocate coords

        # Labels
        labels, segments = self.labels[index].copy(), self.segments[index].copy()
        if labels.size:
            labels[:, 1:] = xywhn2xyxy(labels[:, 1:], w, h, padx, pady)  # normalized xywh to pixel xyxy format
            segments = [xyn2xy(x, w, h, padx, pady) for x in segments]
        labels9.append(labels)
        segments9.extend(segments)

        # Image
        img9[y1:y2, x1:x2] = img[y1 - pady:, x1 - padx:]  # img9[ymin:ymax, xmin:xmax]
        hp, wp = h, w  # height, width previous

    # Offset
    yc, xc = [int(random.uniform(0, s)) for _ in self.mosaic_border]  # mosaic center x, y
    img9 = img9[yc:yc + 2 * s, xc:xc + 2 * s]

    # Concat/clip labels
    labels9 = np.concatenate(labels9, 0)
    labels9[:, [1, 3]] -= xc
    labels9[:, [2, 4]] -= yc
    c = np.array([xc, yc])  # centers
    segments9 = [x - c for x in segments9]

    for x in (labels9[:, 1:], *segments9):
        np.clip(x, 0, 2 * s, out=x)  # clip when using random_perspective()
    # img9, labels9 = replicate(img9, labels9)  # replicate

    # Augment
    #img9, labels9, segments9 = remove_background(img9, labels9, segments9)
    img9, labels9, segments9 = copy_paste(img9, labels9, segments9, probability=self.hyp['copy_paste'])
    img9, labels9 = random_perspective(img9, labels9, segments9,
                                       degrees=self.hyp['degrees'],
                                       translate=self.hyp['translate'],
                                       scale=self.hyp['scale'],
                                       shear=self.hyp['shear'],
                                       perspective=self.hyp['perspective'],
                                       border=self.mosaic_border)  # border to remove

    return img9, labels9

从上面代码可以看到四个的时候是先确定中心点center,然后从左上到右下一次放图片。九个的时候是先将第一张图放在中心位置,然后从其上面逆时针一次是第二张到第八张图片。之后选择中心点进行裁剪。
Why do you do this?

  • When the data is loaded, the size of the picture is expected to be 640, and the image entering the network is expected to be 1280. If there are four pictures, the original picture can be used directly after cutting or padding according to the center, and there are more possibilities. The stitching of nine pictures together is 1920. Randomly selecting a center point for cropping does not hinder it. You can customize it here. There is no need to do anything. Just pay attention to the change process of image and label to be completely corresponding, and the image and label can be corresponding!

2. Model

The detailed structure of the model will not be described in detail. The focus is on the final detect. The model mainly extracts three layers. The three layers are different network depths. The earlier the network structure is, the smaller the details of the features learned. The further back and deeper the learned features are, the more global features. For each layer of features, three anchors will be predicted, and finally detect will get a list of length 3, indicating the result of each Layet, where the shape of each element is [batch, anchor_no, character_size_width, character_size_height, 11], where 11=[x,y,w,h, confidence level, one_hot_6_classes]
insert image description here

class Detect(nn.Module):
    stride = None  # strides computed during build
    export = False  # onnx export
    end2end = False
    include_nms = False
    concat = False

    def __init__(self, nc=80, anchors=(), ch=()):  # detection layer
        super(Detect, self).__init__()
        self.nc = nc  # number of classes
        self.no = nc + 5  # number of outputs per anchor
        self.nl = len(anchors)  # number of detection layers
        self.na = len(anchors[0]) // 2  # number of anchors
        self.grid = [torch.zeros(1)] * self.nl  # init grid
        a = torch.tensor(anchors).float().view(self.nl, -1, 2)
        self.register_buffer('anchors', a)  # shape(nl,na,2)
        self.register_buffer('anchor_grid', a.clone().view(self.nl, 1, -1, 1, 1, 2))  # shape(nl,1,na,1,1,2)
        self.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch)  # output conv

    def forward(self, x):
        # x = x.copy()  # for profiling
        z = []  # inference output
        self.training |= self.export
        for i in range(self.nl):
            x[i] = self.m[i](x[i])  # conv
            bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
            x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

            if not self.training:  # inference
                if self.grid[i].shape[2:4] != x[i].shape[2:4]:
                    self.grid[i] = self._make_grid(nx, ny).to(x[i].device)
                y = x[i].sigmoid()
                if not torch.onnx.is_in_onnx_export():
                    y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy
                    y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
                else:
                    xy, wh, conf = y.split((2, 2, self.nc + 1), 4)  # y.tensor_split((2, 4, 5), 4)  # torch 1.8.0
                    xy = xy * (2. * self.stride[i]) + (self.stride[i] * (self.grid[i] - 0.5))  # new xy
                    wh = wh ** 2 * (4 * self.anchor_grid[i].data)  # new wh
                    y = torch.cat((xy, wh, conf), 4)
                z.append(y.view(bs, -1, self.no))

        if self.training:
            out = x
        elif self.end2end:
            out = torch.cat(z, 1)
        elif self.include_nms:
            z = self.convert(z)
            out = (z, )
        elif self.concat:
            out = torch.cat(z, 1)
        else:
            out = (torch.cat(z, 1), x)

        return out

    @staticmethod
    def _make_grid(nx=20, ny=20):
        yv, xv = torch.meshgrid([torch.arange(ny), torch.arange(nx)])
        return torch.stack((xv, yv), 2).view((1, 1, ny, nx, 2)).float()

    def convert(self, z):
        z = torch.cat(z, 1)
        box = z[:, :, :4]
        conf = z[:, :, 4:5]
        score = z[:, :, 5:]
        score *= conf
        convert_matrix = torch.tensor([[1, 0, 1, 0], [0, 1, 0, 1], [-0.5, 0, 0.5, 0], [0, -0.5, 0, 0.5]],
                                           dtype=torch.float32,
                                           device=z.device)
        box @= convert_matrix                          
        return (box, score)

3. Training

The training process is the same as before, the core is: calculate the loss between the predict and label obtained by the model.
loss consists of three parts:

1. target与predict的IOU
3. anchor的class label
  • After entering the loss calculation, the target must be processed firstbuild_target
    insert image description here

  • build targetNeed to select the positive sample firstfind_positive
    insert image description here

  • There are three layers in total, and the predicted value of each layer must be lost with the target. Each layer has three anchors, and the three anchors have numbers. They need to be numbered and spliced ​​with the target first, from the original (bcxywh) to ( bcxywh+no_anchor)
    insert image description here

  • In the selection of positive samples, the size between the groundtruth and the anchor needs to be compared first, and two-step screening is required. The first step is to screen out the groundtruth that is too large or too small, and the ratio of h and w of the groundtruth to the hw of the anchor is controlled at 1/4 to four. Before this, the original target must be restored to the real coordinates and length and width. The original target is a relative value, which is between 0-1. Now it is used to t = target* gainrestore the position of this layer of features, not a number between 0-1. up
    insert image description here

  • In this way, part of the groundtruth is screened out, and then in order to add some new points on the original basis, use g=0.5 , which is equivalent to adding two boxes corresponding to the original basis, if g=1you can increase the box corresponding to four points [for Think as much as possible]
    insert image description here

  • Calculate the IOU of target and predict, calculate the loss of iou according to IOU, take the logarithm and then take the negative value, the larger the IOU (<1), the larger the logarithmic value (<0), the smaller the negative number (>0), use topk (There is no largest parameter sorted from small to large) Take min(10, predict_num), and accumulate these IOUs to determine the number of anchors for each positive sample
    insert image description here

  • The class category of target is int, which is converted to one-hot type, and the class of predict should be multiplied by the confidence degree, which is equivalent to weighting. The value after the product is changed and the value of the target is bce_with_logit loss
    insert image description here
    insert image description here

  • The final loss is the weighted addition of the IOU loss and the class loss
    insert image description here

  • According to the number of anchors determined above, the predicted value corresponding to the final target is obtained. The matching_matrix size is [the number of positive samples selected by the target * the number of predicted anchors]insert image description here

  • In order to prevent an anchor from being predicted as multiple positive samples, sum the matching_matrix column by column. If it is greater than 1, compare the final loss value cost and select the one with the smallest loss value. The double insert image description here
    screening is mainly based on the IOU of target and predict and the prediction class.

4. Reasoning

In order to improve efficiency during inference, it is faster:

  • Fusing BN with convolutional weight parameters
    insert image description here
  • Convert 1*1 convolution to 3*3 convolution

Let’s do this today, I’m hungry, I’m going to eat, the reasoned BN and convolution are merged and the code will be uploaded next time

Guess you like

Origin blog.csdn.net/qq_43368987/article/details/131695995