[Computer Vision | Target Detection] Dry information: Collection of introduction to common algorithms for target detection (3)

Thirty-one, FoveaBox

FoveaBox is an anchor-free framework for object detection. Rather than using predefined anchors to enumerate possible positions, scales, and aspect ratios to search for objects, FoveaBox directly learns the likelihood of object existence and bounding box coordinates without the need for anchor references. This is achieved by: (a) predicting a class-sensitive semantic map for the likelihood of an object's presence, and (b) generating class-agnostic bounding boxes for each location that may contain an object. The scale of the target box is naturally associated with the feature pyramid representation of each input image

It is a single, unified network consisting of a backbone network and two task-specific subnetworks. The backbone network is responsible for calculating the convolutional feature map on the entire input image and is a ready-made convolutional network. The first subnet performs per-pixel classification on the output of the backbone network; the second subnet performs bounding box prediction on the corresponding locations.

Insert image description here

Thirty-two, MobileDet

MobileDet is an object detection model developed for mobile accelerators. MobileDets makes extensive use of regular convolutions on EdgeTPU and DSP, especially in the early stages of the network, where deep convolutions tend to be less efficient. This helps improve the latency-versus-accuracy tradeoff for object detection on accelerators, provided they are strategically placed in the network via neural architecture search. By combining conventional convolutions in the search space and directly optimizing the network architecture for object detection, a series of efficient object detection models are obtained.

Insert image description here

Thirty-three, YOLOP

YOLOP is a panoramic driving perception network that handles traffic object detection, drivable area segmentation and lane detection simultaneously. It consists of an encoder for feature extraction and three decoders for handling specific tasks. It can be considered a lightweight version of Tesla's self-driving car HydraNet model.

Use a lightweight CNN from Scaled-yolov4 as the encoder to extract features from images. These feature maps are then fed to three decoders to complete their respective tasks. The detection decoder is based on the currently best-performing single-stage detection network YOLOv4, mainly for two reasons: (1) The single-stage detection network is faster than the two-stage detection network. (2) The grid-based prediction mechanism of single-stage detectors is more relevant to the other two semantic segmentation tasks, while instance segmentation is usually combined with region-based detectors such as Mask R-CNN. The feature maps output by the encoder fuse semantic features at different levels and scales, and our segmentation branch can use these feature maps to complete pixel-by-pixel semantic prediction.

Insert image description here

三十四、Context-aware Visual Attention-based (CoVA) webpage object detection pipeline

Insert image description here
Insert image description here

Thirty-five, Side-Aware Boundary Localization

Side-aware boundary localization (SABL) is a method for precise localization in object detection, where each side of the bounding box is localized separately using dedicated network branches. Empirically, the authors observe that when they manually annotate the bounding box of an object, it is often much easier to align each side of the box to the object boundary than to move the entire box when resizing. Inspired by this observation, in SABL, each side of the bounding box is positioned separately based on its surrounding context.

As shown in the figure, the author designed a bucketing scheme to improve positioning accuracy. For each side of the bounding box, this scheme divides the target space into multiple buckets and then determines the bounding box in two steps. Specifically, it first searches for the correct bucket, i.e. the bucket where the boundary lies. Utilize the centerline of the selected bucket as a rough estimate and then perform a fine regression by predicting the offset. This scheme enables very precise positioning even in the presence of displacements with large variances. Furthermore, in order to preserve precisely located bounding boxes during non-maximum suppression, the authors also suggest adjusting the classification score based on the bucketing confidence, thereby further improving the performance.

Insert image description here

Thirty-six, Dynamic R-CNN

Dynamic R-CNN is an object detection method that automatically adjusts the label assignment criterion (IoU threshold) and the shape of the regression loss function (parameters of Smooth L1 Loss) based on the statistics of the proposals during training. The motivation is that in previous two-stage object detectors, there is an inconsistency problem between the fixed network settings and the dynamic training process. For example, fixed label assignment strategies and regression loss functions cannot adapt to changes in the distribution of proposals, which is not conducive to training high-quality detectors.

It consists of two components: dynamic label assignment and dynamic smoothing L1 loss, designed for classification and regression branches respectively.

For dynamic label assignment, we hope that our model can distinguish high IoU proposals, so we gradually adjust the IoU threshold of positive/negative samples according to the proposal distribution during training. Specifically, we set the threshold as a certain percentage of the proposal's IoU, as it can reflect the quality of the overall distribution.

For Dynamic Smooth L1 Loss, we hope to change the shape of the regression loss function to adaptively fit the distribution changes of the error and ensure the contribution of high-quality samples to training. This is achieved by adjusting the error distribution based on the regression loss function in Smooth L1 Loss, where the size of the small error gradient is actually controlled.

Insert image description here

Thirty-seven, DAFNe

DAFNe is a dense single-stage anchor-free depth model for oriented object detection. It is a deep neural network that makes predictions on a dense grid of input images, with a simpler architectural design and easier optimization than a two-stage network. Furthermore, it reduces prediction complexity by avoiding the use of bounding box anchors. This enables a tighter fit to oriented objects and thus better separation of bounding boxes, especially in the case of dense object distribution. Furthermore, it introduces an orientation-aware generalization of the center function to arbitrary quadrilaterals, taking into account the orientation of objects and accurately downweighting low-quality predictions accordingly

Insert image description here

Thirty-eight, RPDet

RPDet (or RepPoints Detector) is an anchor-free, two-stage object detection model based on deformable convolution. Representative points serve as basic object representations for the entire detection system. Starting from the center point, the first set of RepPoints is obtained by the regression offset on the center point. The learning of these RepPoints is driven by two goals: 1) inducing a distance loss between the upper left and lower right points between pseudo boxes and ground truth bounding boxes; 2) an object recognition loss in subsequent stages.

Insert image description here

Thirty-nine, RetinaNet-RS

RetinaNet-RS is an object detection model generated through a model scaling method based on changing the input resolution and ResNet backbone depth. For RetinaNet, we expanded the input resolution from 512 to 768 and the ResNet backbone depth from 50 to 152. Since RetinaNet performs dense single-stage object detection, the authors found that enlarging the input resolution results in large-resolution feature maps, thus requiring more anchor points to be processed. This results in higher capacity dense prediction headers and expensive NMS. For RetinaNet, scaling stops at the input resolution of 768 × 768.

Insert image description here

Forty, NAS-FCOS

NAS-FCOS consists of two sub-networks, one is the FPN f and a set of prediction heads h with a shared structure. A significant difference from other FPN-based one-level detectors is that our heads have partially shared weights. Only the last few layers of the prediction head (marked in yellow) are associated with their weights. The number of shared layers is automatically determined by the search algorithm. Note that both FPN and head are in our actual search space; and have more layers than shown in this figure.

Insert image description here

Forty-one, ExtremeNet

xtremeNet is a bottom-up object detection framework that detects four extreme points of objects (top, left, bottom, right). It uses a keypoint estimation framework to find extreme points by predicting four multimodal heatmaps for each object category. Furthermore, it uses a heatmap for each category to predict the object center as the average of two bounding box edges in the x and y dimensions. We use a purely geometry-based approach to group extreme points into objects. We group four extreme points, one in each map, if and only if their geometric center predicted score in the center heatmap is higher than a predefined threshold, and we enumerate all combinations of extreme point predictions, and select valid.

Insert image description here

Forty-two, M2Det

M2Det is a single-stage object detection model that utilizes Multi-level Feature Pyramid Network (MLFPN) to extract features from the input image, and then generates dense bounding boxes and category scores based on the learned features, similar to SSD, and then non-maximally Suppress (NMS) operations to produce final results.

Insert image description here

Forty-three, U2-Net

U2-Net is a two-level nested U-structure architecture designed for salient object detection (SOD). This architecture allows the network to go deeper and achieve high resolution without significantly increasing memory and computational costs. This is achieved through a nested U-shaped structure: at the bottom level, a novel ReSidual U-shaped block (RSU) module is used, which is able to extract intra-stage multi-scale features without reducing the feature map resolution; at the top level, there is a A U-Net-like structure where each stage is populated by RSU blocks.

Insert image description here

Forty-four, RFB Net

RFB Net is a single-stage object detector that utilizes receptive field blocks. It uses a VGG16 backbone and is otherwise very similar to the SSD architecture.

Insert image description here

Forty-five, PP-YOLOv2

PP-YOLOv2 is an object detector extended on the basis of PP-YOLO and has made many improvements:

FPN consists of a path aggregation network to form bottom-up paths.
Use Mish activation function.
The input size is expanded.
IoU-aware branches are calculated using soft label format.

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132863435