In-depth understanding of YOLOX

foreword

In the latest progress of target detection 2022 , YOLOX is mentioned. YOLOX is worked by Megvii, and the source code is open sourced. It also gives a detailed answer on the question of Zhihu: How to evaluate the open source YOLOX of Megvii, the effect is better than YOLOv5 , it can be said It is "the waves behind the Yangtze River push the waves ahead". YOLOx innovation lies in the use of Decoupled Head, SIMOTA and other methods. By reading this blog, you can understand the principles behind these innovations of yolox and the corresponding sources. At the same time, this blog also introduces the implementation and deployment of YOLOX in detail (based on tensorflow2). Of course, while reading this blog, you can also read the references in the reference section to deepen your understanding of YOLOX.

The base line of YOLOX is YOLOV3. By adding various tricks, YOLOX-Darknet53 is obtained. The structure diagram is shown below. It should be noted that CBSthe module is similar to the module in YOLOV3 CBL, except that the activation function is different. CBSThe SiLU activation function is used, and the activation function CBLis used LeakyRelu.
insert image description here
On the other hand, YOLOX also has Yolox-s, Yolox-m, Yolox-l, and Yolox-x series. The principle of these series is the same as that of YOLOV5, and they are divided according to the width and height of each network. Finally, for lightweight networks, YOLOX designed YOLOX-Nano and YOLOX-Tiny lightweight networks.

Focus

The first use of the Focus module is not in YOLOX, but has been applied on YOLOv5. Since YOLOV5 has not been analyzed before, the Focus module is introduced in this part. The Focus module slices the image, and takes a value for every other pixel in the image, so that four images are obtained, as shown in the figure below. This makes W, H information concentrated in the channel space, and the input channel is expanded by four times. Input [Batch_size, 640, 640, 3] and output [Batch_size, 320, 320, 12].

insert image description here
The function of the Focus module can refer to the following link: YOLOv5 Focus() Layer

 class Focus(nn.Module): 
     # Focus wh information into c-space 
     def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups 
         super(Focus, self).__init__() 
         self.conv = Conv(c1 * 4, c2, k, s, p, g, act) 
         # self.contract = Contract(gain=2) 
  
     def forward(self, x):  # x(b,c,w,h) -> y(b,4c,w/2,h/2) 
         return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1)) 

insert image description here

Decouple head

The principle of Decouple is explained in detail in Revisiting the sibling head in object detector and Rethinking classification and localization for object detection . The first article explained that in the previous target detection task, the coordinates of objects detected by the same detection head and object classification are not beneficial to the detection head. In the article, this detection head is called sibling head. Because localization tasks and classification tasks focus on different parts, this misalignment can be easily task-aware spatial disentanglement(TSD)solved. The principle of this method is to observe a certain instance, and the features of some prominent areas have rich information features for classification, and these boundary features have an effect on boundary regression, as shown in the figure below. Therefore, TSD improves the efficiency of the model by decouples two tasks.
insert image description here

In another article, after analysis, Rethinking classification and localization for object detection also came to the conclusion of the first article, that is, the preferences of the positioning task and the classification task are inconsistent, so it is unreasonable to put them in the same detection head. In addition, the article analyzes that the FC layer is suitable for classification tasks, while the convolutional head is suitable for positioning operations. Therefore, this article proposes a structure of double-head detection, which is similar to the current YOLOX.
insert image description here

Based on the above background, YOLOX proposes the following detection head. This structure can not only improve the detection performance, but also improve the convergence speed. However, decoupling the detection head will increase the complexity of the operation.
insert image description here

The structure of YOLOX's detection head is shown in the figure below.
insert image description here
Different branches output feature maps of different scales output by FPN+PAN. This is an example of target detection, and objects are accurately classified in combination with different scales. The input is [640, 640, 3], after 8 times, 16 times, 32 times downsampling, 80 × 80 80\times 8080×80, 40 × 40 40 \times 40 40×40 20 × 20 20 \times 20 20×The three scales of 20 , the value of 85 is 80 categories + 4 coordinate information + 1 classification confidence. Get 85 × 8400by conca85×8400 results, each line is1 × 8400 1\times84001×The results of 8400 include the results of prediction boxes at different scales.

insert image description here

Strong data augmentation

This part is mainly related to the input. In YOLOX's paper, mosaic enhancement is used, as shown in the figure below:
insert image description here
and the mixup
insert image description here
figure comes from: Reference 1

But pay special attention to:

  1. During the last 15 epochs of training, these two data augmentations are turned off.
  2. Due to the stronger data enhancement method, the author found in the research that ImageNet pre-training will be meaningless, so all models are trained from scratch.

Anchor Free

For Anchor Free, please refer to the following papers:

The way of Anchor Free is to apply it in Yolox-Darknet53. The difference between Anchor Free and Anchor Based is whether there is an anchor (I feel like I'm talking nonsense), and the role of the anchor is as a benchmark. Looking back at the method of YOLOV3, assuming that there is a target in a certain grid of the feature map, then use the benchmark anchor to return to the real frame to obtain the offset. In the feature map output by YOLOX, all prediction frames in the model are included in the 85*8400 feature map, which includes prediction frames of different scales.

insert image description here
With these prediction frame information, each image also has the information of the annotation frame. What needs to be done at this time is to associate the model 8400 prediction frame with the target frame on the picture, and select the positive sample anchor frame. The association method used here is label assignment. The principle is Multi positives and SimOTA.

Multi positives

insert image description here

Shimot

Reference paper on OTA: OTA: Optimal Transport Assignment for Object Detection . In SimOTA, different targets set different numbers of positive samples (dynamick). Taking the ants and watermelons in Megvii’s official answer as an example, the traditional positive sample allocation scheme often assigns the same number of positive samples to watermelons and ants in the same scene. The number of positive samples, either the ants have many low-quality positive samples, or the watermelon only has one or two positive samples. It is not suitable for any distribution method. The key to dynamic positive sample setting is how to determine k. The specific method of SimOTA is to first calculate the 10 feature points with the lowest cost of each target, and then add the prediction frame corresponding to these ten feature points with the IOU of the real frame to obtain the final the k. This part is to filter the frame. First do a preliminary box screening:

  • Judgment based on the center point: find the center point of the anchor box, the anchors that fall within the gt_box rectangle
  • Prediction based on the target frame: based on the center point of gt, set a square with a side length of 5, and select all the anchor frames in the square.

After preliminary screening, you can refine the screening:

  • Preliminary positive sample information extraction
  • Loss function calculation
  • cost cost calculation
  • SimOTA solution

Refined screening mainly describes SimOTA here. Assuming that the current image has 3 target frames, further calculations are performed on the 1000 positive samples that were initially screened.

  1. First calculate 1000 frames to be processed and 3 gt_boxes to calculate the classification loss cls_loss and position loss iou_loss . The confidence loss is the product of the conditional probability of the category and the prior probability of the target. Then further calculate the cost cost according to cls_loss and iou_loss , the dimension is [3, 1000], 3 means 3 prediction frames, and 1000 means the frame to be predicted.

  2. Get 1000 prediction boxes, then you can select k candidate boxes with the largest iou, topk_ious, where k is 10. At this time, dynamic_k needs to be further determined.
    insert image description here
    From the above figure, it can be known that target frames 1 and 3 are assigned 3 candidate frames, and target frame 2 is assigned 4 candidate frames.

  3. Use the cost value calculated earlier, that is, the loss function weighting information of [3,1000]. In the for loop, select some candidate boxes with the lowest cost values ​​for each target box.
    insert image description here

  4. For the repeated prediction frames corresponding to different gt target frames, that is, the candidate frames corresponding to the fifth column, the target detection frames 1 and 2 are all associated. For these two positions, use the cost value to compare, select the smaller value, and then further filter.

  5. Calculate the loss of the filtered prediction frame. It should be noted that the iou_loss and cls_loss here are only calculated for the target frame and the filtered positive sample prediction frame. And obj_loss is still for 8400 prediction boxes.

If you don't understand it here, you can refer to:

Code

input preprocessing

The input preprocessing is mainly to do a Mosaic processing on the image, the specific code is as follows:

def get_random_data_with_Mosaic(self, annotation_line, input_shape, max_boxes=500, jitter=0.3, hue=.1, sat=0.7, val=0.4):
        h, w = input_shape
        min_offset_x = self.rand(0.3, 0.7)
        min_offset_y = self.rand(0.3, 0.7)

        image_datas = [] 
        box_datas   = []
        index       = 0
        for line in annotation_line:
            line_content = line.split()
            image = Image.open(line_content[0])
            image = cvtColor(image)
            
            iw, ih = image.size
            box = np.array([np.array(list(map(int,box.split(',')))) for box in line_content[1:]])
            flip = self.rand()<.5
            if flip and len(box)>0:
                image = image.transpose(Image.FLIP_LEFT_RIGHT)
                box[:, [0,2]] = iw - box[:, [2,0]]
            new_ar = iw/ih * self.rand(1-jitter,1+jitter) / self.rand(1-jitter,1+jitter)
            scale = self.rand(.4, 1)
            if new_ar < 1:
                nh = int(scale*h)
                nw = int(nh*new_ar)
            else:
                nw = int(scale*w)
                nh = int(nw/new_ar)
            image = image.resize((nw, nh), Image.BICUBIC)
            if index == 0:
                dx = int(w*min_offset_x) - nw
                dy = int(h*min_offset_y) - nh
            elif index == 1:
                dx = int(w*min_offset_x) - nw
                dy = int(h*min_offset_y)
            elif index == 2:
                dx = int(w*min_offset_x)
                dy = int(h*min_offset_y)
            elif index == 3:
                dx = int(w*min_offset_x)
                dy = int(h*min_offset_y) - nh
            
            new_image = Image.new('RGB', (w,h), (128,128,128))
            new_image.paste(image, (dx, dy))
            image_data = np.array(new_image)

            index = index + 1
            box_data = []
            if len(box)>0:
                np.random.shuffle(box)
                box[:, [0,2]] = box[:, [0,2]]*nw/iw + dx
                box[:, [1,3]] = box[:, [1,3]]*nh/ih + dy
                box[:, 0:2][box[:, 0:2]<0] = 0
                box[:, 2][box[:, 2]>w] = w
                box[:, 3][box[:, 3]>h] = h
                box_w = box[:, 2] - box[:, 0]
                box_h = box[:, 3] - box[:, 1]
                box = box[np.logical_and(box_w>1, box_h>1)]
                box_data = np.zeros((len(box),5))
                box_data[:len(box)] = box
            
            image_datas.append(image_data)
            box_datas.append(box_data)
        cutx = int(w * min_offset_x)
        cuty = int(h * min_offset_y)

        new_image = np.zeros([h, w, 3])
        new_image[:cuty, :cutx, :] = image_datas[0][:cuty, :cutx, :]
        new_image[cuty:, :cutx, :] = image_datas[1][cuty:, :cutx, :]
        new_image[cuty:, cutx:, :] = image_datas[2][cuty:, cutx:, :]
        new_image[:cuty, cutx:, :] = image_datas[3][:cuty, cutx:, :]

        new_image       = np.array(new_image, np.uint8)
        r               = np.random.uniform(-1, 1, 3) * [hue, sat, val] + 1
        #---------------------------------#
        hue, sat, val   = cv2.split(cv2.cvtColor(new_image, cv2.COLOR_RGB2HSV))
        dtype           = new_image.dtype
        x       = np.arange(0, 256, dtype=r.dtype)
        lut_hue = ((x * r[0]) % 180).astype(dtype)
        lut_sat = np.clip(x * r[1], 0, 255).astype(dtype)
        lut_val = np.clip(x * r[2], 0, 255).astype(dtype)

        new_image = cv2.merge((cv2.LUT(hue, lut_hue), cv2.LUT(sat, lut_sat), cv2.LUT(val, lut_val)))
        new_image = cv2.cvtColor(new_image, cv2.COLOR_HSV2RGB)
        new_boxes = self.merge_bboxes(box_datas, cutx, cuty)
        box_data = np.zeros((max_boxes, 5))
        if len(new_boxes)>0:
            if len(new_boxes)>max_boxes: new_boxes = new_boxes[:max_boxes]
            box_data[:len(new_boxes)] = new_boxes
        return new_image, box_data

output post-processing

The post-output processing is mainly for processing during inference. As we all know, YOLOX outputs feature maps of three different scales ( 20 × 20 , 40 × 40 , 80 × 80 ) (20\times20,40\times40,80\times80)(20×20,40×40,80×80 ) , each featuremap corresponds to three branches which are target coordinate branch, target classification branch and classification confidence branch. During reasoning, assume that at20 × 20 20\times2020×The branch of 20 , if20 × 20 20\times2020×If one of the 20 feature points falls within the corresponding frame of the object, it is predicted.
insert image description here

Specific steps are as follows:

  1. Carry out the calculation of the center prediction point, and use the content of the first two serial numbers of the Regression prediction result to offset the coordinates of the feature points. After the offset of the three red feature points in the left picture, there are three green points in the right picture;
  2. Carry out the calculation of the width and height of the prediction frame, and use the content of the last two serial numbers of the Regression prediction result to calculate the index to obtain the width and height of the prediction frame;
  3. After all the boxes are obtained, non-maximum suppression is performed, and the final result can be obtained.
    The code is as follows:
def get_output(outputs, num_classes, input_shape, max_boxes = 100, confidence=0.5, nms_iou=0.3, letterbox_image=True):
    image_shape = K.reshape(outputs[-1], [-1])
    batch_size = K.shape(outputs[0])[0]
    grids = []
    strides = []
    hw = [K.shape(x)[1:3] for x in outputs]
    '''
    outputs before:
    batch_size, 80, 80, 4+1+num_classes
    batch_size, 40, 40, 4+1+num_classes
    batch_size, 20, 20, 4+1+num_classes

    outputs after:
    batch_size, 8400, 8400, 4+1+num_classes
    '''
    outputs = tf.concat([tf.reshape(x, [batch_size, -1, 5 + num_classes]) for x in outputs], axis = 1)
    for i in range(len(hw)):
        grid_x, grid_y  = tf.meshgrid(tf.range(hw[i][1]), tf.range(hw[i][0]))
        grid            = tf.reshape(tf.stack((grid_x, grid_y), 2), (1, -1, 2))
        shape           = tf.shape(grid)[:2]
        grids.append(tf.cast(grid, K.dtype(outputs)))
        strides.append(tf.ones((shape[0], shape[1], 1)) * input_shape[0] / tf.cast(hw[i][0], K.dtype(outputs)))
    grids = tf.concat(grids, axis=1)
    strides = tf.concat(strides, axis=1)
    box_xy = (outputs[..., :2] + grids) * strides / K.cast(input_shape[::-1], K.dtype(outputs))
    box_wh = tf.exp(outputs[..., 2:4]) * strides / K.cast(input_shape[::-1], K.dtype(outputs))
    box_confidence  = K.sigmoid(outputs[..., 4:5])
    box_class_probs = K.sigmoid(outputs[..., 5: ])
    boxes = yolo_correct_boxes(box_xy, box_wh, input_shape, image_shape, letterbox_image)
    box_scores  = box_confidence * box_class_probs

    mask = box_scores >= confidence
    max_boxes_tensor = K.constant(max_boxes, dtype='int32')
    boxes_out   = []
    scores_out  = []
    classes_out = []
    for c in range(num_classes):
        class_boxes      = tf.boolean_mask(boxes, mask[..., c])
        class_box_scores = tf.boolean_mask(box_scores[..., c], mask[..., c])
        nms_index = tf.image.non_max_suppression(class_boxes, class_box_scores, max_boxes_tensor, iou_threshold=nms_iou)

        class_boxes         = K.gather(class_boxes, nms_index)
        class_box_scores    = K.gather(class_box_scores, nms_index)
        classes             = K.ones_like(class_box_scores, 'int32') * c

        boxes_out.append(class_boxes)
        scores_out.append(class_box_scores)
        classes_out.append(classes)
    boxes_out      = K.concatenate(boxes_out, axis=0)
    scores_out     = K.concatenate(scores_out, axis=0)
    classes_out    = K.concatenate(classes_out, axis=0)

    return boxes_out, scores_out, classes_out

hwThe variable is to obtain the width and height of each feature map. In fact, the feature map mentioned here is wrong, and can only be considered as the shape of the feature.
insert image description here
Then the focus is gridson the sum of stridestwo variables. The respective shapes are as follows:
insert image description here
insert image description here
in the program box_xy = (outputs[..., :2] + grids) * strides / np.array(input_shape[::-1], np.float32), it is divided into two parts: (outputs[..., :2] + grids)and strides / np.array(input_shape[::-1], np.float32). The first part means to get all 8 × 8 8\times88×The x and y of the 8 feature map are specifically the first 6400 values ​​​​in 8400. The latter part is the mapping.

reference

  1. A complete explanation of the core foundation of Yolox in the Yolo series
  2. How to evaluate Megvii's open source YOLOX, the effect is better than YOLOv5

Guess you like

Origin blog.csdn.net/u012655441/article/details/123799503