Article directory
foreword
The previous article on target detection overview introduced the knowledge related to target detection in detail. This blog serves as an extension and supplement, recording the latest progress of current (2022) target detection, mainly the target detection that dominates the list on coco test-dev and is well-known network. For details, please refer to relevant papers or codes.
Swim Transformer V2
Paper address: Swin Transformer V2: Scaling Up Capacity and Resolution
Code address: Swim Transformer V2 Code
This method demonstrates the sota discussion of extending Swim Transformer to 3 billion parameters and enabling it to use images of up to 1536 input sizes for training. By scaling up network capacity and resolution, Swim Transformer sets records on four representative vision benchmarks: 84.0% top-1 accuracy for ImageNet-V2 image classification, 63.1/54.4 box/mask mAP for COCO object detection, 59.9 mIoU for ADE20K semantic segmentation and 86.8% top-1 accuracy for Kinetics-400 video action classification. The technology used by Swin Transformer V2 is usually to expand the visual model, but it has not been explored as widely as the NLP language model, partly because of the following difficulties in training and application: 1) The visual model often faces the problem of large-scale unbalanced samples ; 2) Many downstream vision tasks require high-resolution images or sliding windows, and it is unclear how to effectively convert a low-resolution pre-trained model to a higher-resolution model; 3) When the image resolution is high, GPU memory consumption is also an issue. To address these issues, the research team proposes several techniques, illustrated by using Swin Transformer as a case study: 1) post-normalization techniques and scaled cosine attention methods to improve the stability of large visual models; 2) a Logarithmically spaced sequential positional biasing techniques to efficiently transfer models pretrained on low-resolution images and windows to their higher-resolution counterparts. In addition, the team shared key implementation details that lead to significant savings in GPU memory consumption, making it feasible to train large vision models using conventional GPUs.
Swin Transformer
论文:Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
代码:Swin Transformer Code
Dynamic Head
论文:Dynamic Head: Unifying Object Detection Heads with Attentions
代码:Dynamic Head Code
YOLOF
Paper: You Only Look One-level Feature
Code: YOLOF Code
YOLOR
论文:You Only Learn One Representation: Unified Network for Multiple Tasks
代码:YOLOR Code
YOLOX
Paper: YOLOX: Exceeding YOLO Series in 2021
Code: YOLOX Code
Scaled-YOLOv4
Paper: Scaled-YOLOv4: Scaling Cross Stage Partial Network
Code: Scaled-YOLOv4 Code
Scale-Aware Trident Networks
Paper: Scale-Aware Trident Networks for Object Detection
Code: Scale-Aware Trident Networks Code
DETR
Paper: End-to-End Object Detection with Transformers
Code: DETR Code
Dynamic R-CNN
论文:Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training
代码:Dynamic R-CNN Code