Anchor-free application list: target detection, instance segmentation, multi-target tracking

Author|Yang Yang@zhihu

Source|https://zhuanlan.zhihu.com/p/163266388

Edit|Geezer Platform

Since May last year, I have been keeping an eye on the Anchor-free work. This time, I took the opportunity of sharing paper reading in the group to organize some work related to Anchor free. On the one hand, I will share some recent work in the field of target detection. On the other hand, I will sort out the very hot network models CenterNet and FCOS with you. When we migrate them to other tasks such as segmentation and multi-target tracking, the big guy how they were designed.

336e664ba670d101e0ee6a9d1a824b36.png

69535520c4fb587aaa1e8f4912b8fcd9.png

931101cbfb6563cb91e7a7a649fa258e.png

1. Application of anchor free in target detection

First of all, we have to answer why there is an anchor? In previous years, the object detection problem was usually modeled as the problem of classifying and regressing some candidate regions. In the single-stage detector, these candidate regions are the anchors generated by the sliding window method; in the two-stage detector, the candidate regions are the proposals generated by the RPN, but the RPN itself is still to classify and regress the anchors generated by the sliding window method. .

82100b1a8d207cd9ae14155a218c58b0.png

The several anchor-free methods I listed here solve the detection problem by another means. CornetNet characterizes the target frame by predicting pairs of key points (upper left corner and lower right corner); CenterNet and FCOS characterize the target frame by predicting the center point of the object and its distance to the frame; ExtremeNet detects the four extreme points of the object, Four extreme points are formed into an object detection frame; AutoAssign is also a recent paper, which proposes a new allocation strategy for positive and negative sample labels on the anchor free detector; Point-Set is a recent work of ECCV 2020 , a more generalized point-based anchor representation is proposed, which unifies the three major tasks of target detection, instance segmentation, and pose estimation, which we will further expand later.

f10d7b7aceec6627ba9b63d7bc33d88e.png

First, let's briefly review the network architecture of FCOS, where C3, C4, C5 represent the feature maps of the backbone network, and P3 to P7 are the feature levels used for final prediction. The feature maps of these five layers will be followed by a head, which includes three branches, which are used for classification, center point confidence, and regression prediction. The overall architecture is very simple, and many people modify the output branch of FCOS to solve other tasks such as instance segmentation, keypoint detection, and target tracking.

Below I list three adjustments made by the original author when updating the version of the paper. First, the new center point sampling method was used. When judging positive and negative samples, the step size at different stages was considered. to adjust the size of the box where the positive sample is located. Instead of directly judging whether it falls in the gt bbox like in FCOS v1. This new center sampling method reduces the number of difficult-to-discriminate samples, and the accuracy difference caused by whether the center-ness branch is used is also reduced. The second is to replace the regression loss with GIoU loss. The third is that different feature layers of FCOS v2 use different reg ranges (divided by stride) when regressing parameters. (In FCOS v1, it was multiplied by a learnable parameter, which was retained in FCOS v2, but with reduced importance.)

ac61401bf57866fc250ab38fbfeb642e.png

In order to improve the effect of fcos, especially considering some unstable environments, due to sensor noise or incomplete data, the target detector needs to consider the confidence of the positioning prediction. Some people propose to add a branch to predict the uncertainty of the bbox.

The uncertainty here is obtained by predicting the distribution of the four offsets of the bbox. It is assumed here that each example is independent, and the offset of each bbox is represented by the output of the multivariate Gaussian distribution and the diagonal matrix of the covariance matrix. On the three losses of FCOS classification, center point, and regression, a new loss that measures the uncertainty of bbox offset is added. Let's take a closer look at how it is implemented.

420d03deaaeaac7b0f6ef96bcf832395.png

The box offsets here are represented by (l, r, t, b), which are the learnable parameters of the network, the dimension of B is 4, and μ is the offset of the bbox, and the calculated multivariate Gaussian distribution is mentioned earlier. The diagonal matrix of the covariance matrix of ,

The loss that is brought into the network design to measure the uncertainty of the bbox offset, we can focus on the item on the left of the red line. When the predicted μ is very different from the Gaussian distribution of the real bbox, the network will tend to get a A large standard deviation means that the uncertainty at this time is very large. Of course, there is a constraint similar to regularization behind it, so the limit should not be too large.

faf7247ea701e8aef533d632bdc75b63.png

Compared with FCOS, which also uses the ResNet-50 framework, it can improve AP by 0.8 points on the coco dataset. Comparing the two losses, the regression situation is also better.

009c4da2c38d107dca73bb25855e561d.png

Let's take a look at how the point-based network "Point-Set Anchors for Object Detection, Instance Segmentation and Pose Estimation" uses the regression idea to unify the three major tasks of Object Detection, Instance Segmentation and Pose Estimation. The authors say this is the first person to unify these three tasks.

The author believes that in the field of object detection, whether several anchors with an IOU greater than a certain threshold represent positive samples, or the center point of the object represents positive samples. Whether it is an anchor based or anchor-free based method, for the positioning of the positive sample in the original image, it is based on the form of regression to directly return to the rectangular coordinates, or the length and width of the rectangle + the offset of the center point of the rectangle. To a certain extent, Anchor represents only a priori information. Anchor can be a center point or a rectangle. At the same time, it can also provide more model design ideas, such as the allocation of positive and negative samples, classification, regression Feature selection . The idea of ​​all authors is whether they can propose a more generalized anchor, which can be applied to more tasks, not just target detection, and give a better prior information.

For Instance Segmentation and Object Detection, use the leftmost Anchor, which has two parts: a center point and n ordered anchor points. At each image location, we change the scale and aspect ratio of the bounding box to form some Anchor, like the anchor-based method, involves the setting of some hyperparameters. For anchors in pose estimation, use the most common pose in the training set. The regression task of Object Detection is relatively simple, just use the center point or the upper left/lower right corner point to return. For Instance Segmentation, the author uses specific matching criteria to match the anchor points in the green Point-set anchor in the right image and the points of the yellow gt instance, and convert them to regression tasks.

The three figures on the right are connecting the green and yellow points to the nearest point; connecting the green point to the nearest edge; the middle on the far right is the author's optimized method, and the diagonal point adopts the nearest point method, according to the angle. The nearest four points obtained divide the contour of gt into 4 regions. Make vertical lines corresponding to the green points on the upper and lower boundaries to valid gt points (if they are not in the area, they are invalid, such as the green hollow points in the figure).

4a22b7f0c4d61badfb1697c6f3e2792c.png

Overall, Point-set replaces traditional rectangular anchors with its proposed new anchor design, and attaches a parallel regression branch to the head for instance segmentation or pose estimation. The figure shows its network architecture. Like retinanet, the author uses feature layers of different scales. The head contains sub-networks for classification, regression of segmentation pose, and regression of detection boxes. Each sub-network consists of four 3-by-3 convolutional layers with stride 1, a FAM module used only for pose estimation tasks, and an output layer. In the table below, the dimensions of the output layer are listed, corresponding to three tasks.

3172ab6b908bc2527f0e6662aada913d.png

The loss function is very simple, using focal loss for classification and L1 loss for regression tasks.

In addition to target normalization and embedding prior knowledge into the shape of the anchor, the author also mentions how we can further use the anchor to aggregate features to ensure feature transformation invariance and extend to multi-stage learning.

(1) We replace the learnable offset in the variable convolution with the position of the midpoint of the point-based anchor.

(2) Due to this regression of human body shape, it is relatively more difficult to detect. On the one hand, it has very large requirements for feature extraction, and on the other hand, there are differences between different key points. Therefore, the author proposes that the pose prediction of the first stage can be directly used as the anchor of the second stage (classification, mask or pose regression, bounding box regression), and an additional refinement stage is used for pose estimation.

d563fd59eaaf8f431f07c8a6bd15560d.png

2. Introduce three models in the field of instance segmentation

They all refer to the practice of FCOS, and migrate the anchor-free idea in target detection to the task of instance segmentation. The specific details of the network will not be discussed here, only what adjustments they have made in the overall architecture of FCOS when solving the instance segmentation task.

9fc07ab3dac1c08cb456acdbd4d0c29e.png

The first thing I mentioned is CenterMask . I put this at the top because his idea is very direct. This structure can be understood as a branch of the mask of FCOS + MaskRCNN.

d57bff7a4df10779890d9ce3a3cd281b.png

We can compare it with FCOS. The input image gets the target frame through FCOS. This part is the same. After that, similar to MaskRCNN, use ROIAlign to crop the corresponding area, resize it to 14 x14, and finally calculate the loss through the mask branch. The idea is very simple.

1a1c6800cf24636a14f724169cdd2574.png

The second one is EmbedMask . On the basis of ensuring the approximate accuracy, its fastest speed can reach three times that of MaskRCNN. It adopts a one-stage method, which is equivalent to directly using semantic segmentation to obtain segmentation results, and then using clustering or some means to integrate the same instance together to obtain the final instance segmentation result.

95ae029bfaf1c3a2d06630aeab230678.png

The structure of the entire network is shown in the figure above, which is still a FPN structure. In the feature with the largest resolution, P3 used pixel embedding, and each pixel was embedded into a D-length vector, so the final result is the feature map of H_W_D . Then use the proposal head for each feature map P3, P4, P5, P6, and P7 in turn, which is the head of the traditional target detection. The improvement is that each proposal is also embedded into a D-length vector. A margin is used to define the degree of association between two embeddings. If it is smaller than the embedding, it is considered that the pixel and the proposal are the same instance. However, the use of an artificially defined margin will cause some problems. Therefore, this paper proposes a learnable margin, which allows the network to automatically learn the margin of each proposal, just like the proposal margin shown in the result picture. path. Compared with FCOS, EmbedMask adds the blue module in the figure.

Although EmbedMask and CenterMask are all based on a one-stage detection algorithm for instance segmentation, the core points of it have not changed. They are all based on a good enough detector to generate masks from the proposal. It turns out that this is very effective. The instance segmentation method based on a good enough detector is not only conducive to finding more masks, but the generation of these masks will in turn improve the effect of the detector itself. So you can see that the box AP of these two instance segmentations is higher than that of FCOS, of course, this is inevitable.

e5285b786220560d8a4c599b5979d121.png

The third article is PolarMask , which is also based on FCOS and unifies instance segmentation into the framework of FCN. PolarMask proposes a new instance segmentation modeling method, which divides the 360-degree polar coordinates into 36 points, and obtains the outline of the object by predicting the distance from the edge to the polar coordinate center in these 36 directions.

ac515ea34bde9f275f397e8e0cad4c3f.png

897c5bfbfd16117268705616ad0f656d.png

3. Some concerns in the field of multi-target tracking

Here we mainly compare two extended works based on CenterNet. First, briefly introduce the task of MOT (Multi-Object Tracking), which requires object detection for each frame in the video, and assigns an id to each object to track the target.

2c624c38665f8a95dbf692de117d7c92.png

CenterTrack is the work done by the original author of CenterNet. When extending the target detection task to multi-target detection, the author solves the tracking problem by tracking the center point of the object. There are two keys to the multi-target detection task. First, we need to detect the objects in each frame, including occluding objects; second, we need to match the id of the objects in the time dimension.

2d06949e9dcbe942f141ceede9f15b9e.png

The red area in the picture below is to solve the Track task. The image at time t, the image at time t-1, and all objects detected at time t-1 are input. The red area here is different from the target detection task. , four new channels are added (three of which are the input of the image, and the calculation of one channel will be expanded later).

In the output part, in addition to outputting the heatmap of the detected central peak point and the feature map of the predicted length and width, the network also outputs a 2-channel offset, where the offset represents the moving distance of the object between two frames.

70b993d2201a413cefef1a33492703c0.png

On the left is the input of the network and on the right is the output of the network. Mathematically, I represents the image input, b in T represents the bbox, and the right side is the detected center peak point, the feature map of length and width, and the offset of the object movement.

eafbf6c726bd6b4d8ca54084e05ae0b2.png

The above are the specific expressions of the three loss functions corresponding to the center peak point, the feature map of length and width, and the offset of the object movement during network training. In solving the task of center point prediction, focal loss is used here, x and y represent the position of the point on the heatmap, and c is the category. Y is a peak map belonging to 0, 1, and is a convex peak that renders a Gaussian shape. For each position, if it has a center point in a certain number of categories, a peak will be formed on the corresponding channel. Take the maximum slope height for each location. where p is the center point and q is the position. After we get these maximum slope heights, we put them into a 1-channel heatmap as part of the network input. And the three-channel picture of the previous frame, it constitutes the input of the newly added 4 channels when solving the tracking task.

For the loss calculation of length, width and offset, a simple L1 loss is used. With a good enough offset prediction, the network can correlate the target at the previous moment. For each detection position p, we assign it the same id as the previous closest object, and if in a radius κ there is no target at the previous moment, we generate a new track.

5732e8d872898b78203d506f980bd131.png

FairMOT is also based on the work of CenterNet, which is the same period as CenterTrack. Unlike CenterTrack, which introduces the moving distance offset of the target frame in the front and rear frames, it draws on the idea of ​​re-identification, and adds a Re-ID branch to the detection branch, which uses the embedding identified by the target's id as a classification task. . At training time, all object instances with the same id in all training sets are considered as one class. By attaching a 128-dimensional Embedding vector to each point on the feature map, and finally mapping this vector to the score p(k) of each class. Where k is the number of categories (that is, the id that has appeared), is the one-hot encoding of gt, and finally calculates the loss with softmax.

2020-7-24 Update: Some people may have some doubts about the mapping of embedding to classification here. When a large number of new people appear in subsequent frames, FairMot can give these new people a correct new id ? When the author solves this problem, the classification loss is used in training, and the cos distance is used for judgment in the testing phase. And, when the reid is unreliable, the bbox IOU is used to match. Specifically, if the reid embedding is not matched to the bbox, IOU is used to obtain the possible tracking frames in the previous frame, the similarity matrix between them is calculated, and finally the Hungarian algorithm is used to obtain the final result.

df5b451b7a98a86232123cc3936ea5be.png5398632c7d11c2c7bbaf38932ee2eb9b.png

Finally, some links to technical articles that have benefited me in the process of this study and sorting are attached:

Tourbillon: Object Detection: Anchor-Free Era

FY.Wei: Unified Object Detection, Instance Segmentation, and Human Pose Estimation Using Point-set Anchor

Chen Kai: Reincarnation of object detection: anchor-based and anchor-free https://zhuanlan.zhihu.com/p/62372897

81b409e70c2f7ff692c0c1d66ad25e28.png

This article is for academic sharing only, if there is any infringement, please contact to delete the article.

Dry goods download and study

Backstage reply: Barcelona Autonomous University courseware, you can download the 3D Vison high-quality courseware accumulated by foreign universities for several years

Background reply: computer vision books, you can download the pdf of classic books in the field of 3D vision

Backstage reply: 3D vision courses, you can learn excellent courses in the field of 3D vision

3D visual quality courses recommended:

1. Multi-sensor data fusion technology for autonomous driving

2. A full-stack learning route for 3D point cloud target detection in the field of autonomous driving! (Single-modal + multi-modal/data + code)
3. Thoroughly understand visual 3D reconstruction: principle analysis, code explanation, and optimization and improvement
4. The first domestic point cloud processing course for industrial-level combat
5. Laser-vision -IMU-GPS fusion SLAM algorithm sorting
and code
explanation
Indoor and outdoor laser SLAM key algorithm principle, code and actual combat (cartographer + LOAM + LIO-SAM)

9. Build a structured light 3D reconstruction system from scratch [theory + source code + practice]

10. Monocular depth estimation method: algorithm sorting and code implementation

11. The actual deployment of deep learning models in autonomous driving

12. Camera model and calibration (monocular + binocular + fisheye)

13. Heavy! Quadcopters: Algorithms and Practice

14. ROS2 from entry to mastery: theory and practice

Heavy! Computer Vision Workshop - Learning Exchange Group has been established

Scan the code to add a WeChat assistant, and you can apply to join the 3D Vision Workshop - Academic Paper Writing and Submission WeChat exchange group, which aims to exchange writing and submission matters such as top conferences, top journals, SCI, and EI.

At the same time , you can also apply to join our subdivision direction exchange group. At present, there are mainly ORB-SLAM series source code learning, 3D vision , CV & deep learning , SLAM , 3D reconstruction , point cloud post-processing , automatic driving, CV introduction, 3D measurement, VR /AR, 3D face recognition, medical imaging, defect detection, pedestrian re-identification, target tracking, visual product landing, visual competition, license plate recognition, hardware selection, depth estimation, academic exchanges, job search exchanges and other WeChat groups, please scan the following WeChat account plus group, remarks: "research direction + school/company + nickname", for example: "3D vision + Shanghai Jiaotong University + Jingjing". Please remark according to the format, otherwise it will not be approved. After the addition is successful, the relevant WeChat group will be invited according to the research direction. Please contact for original submissions .

0f300d261e2cd2cfb1c6909ca0545842.png

▲Long press to add WeChat group or contribute

ad8cc2fb3424c54f8cab9f58d99fd04b.png

▲Long press to follow the official account

3D vision from entry to proficient knowledge planet : video courses for the field of 3D vision ( 3D reconstruction series , 3D point cloud series , structured light series , hand-eye calibration , camera calibration , laser/vision SLAM, automatic driving, etc. ), summary of knowledge points , entry and advanced learning route, the latest paper sharing, and question answering for in-depth cultivation, and technical guidance from algorithm engineers from various large factories. At the same time, Planet will cooperate with well-known companies to release 3D vision-related algorithm development jobs and project docking information, creating a gathering area for die-hard fans that integrates technology and employment. Nearly 4,000 Planet members make common progress and knowledge to create a better AI world. Planet Entrance:

Learn the core technology of 3D vision, scan and view the introduction, unconditional refund within 3 days

f78057ae9aad6b53772cfbd0a4b7fa83.png

 There are high-quality tutorial materials in the circle, which can answer questions and help you solve problems efficiently

I find it useful, please give a like and watch~

Guess you like

Origin blog.csdn.net/qq_29462849/article/details/124012534