Author: Cheng He
Compilation: ronghuaiyang
Guided reading
There are more and more applications of Transformer to CV tasks. Here are some related developments for you.
Transformer architectures have achieved state-of-the-art results in many natural language processing tasks. A major breakthrough in Transformer models could be GPT-3, released in the middle of this year, which was awarded the NeurIPS2020 "Best Paper".
In the field of computer vision, CNN has been the dominant model for vision tasks since 2012. With the emergence of more and more efficient structures, computer vision and natural language processing are more and more convergent, using Transformer to complete vision tasks has become a new research direction to reduce the complexity of the structure, explore scalability and training efficiency.
Here are a few of the more well-known projects in related work:
DETR (End-to-End Object Detection with Transformers), uses Transformers for object detection and segmentation.
Vision Transformer (AN IMAGE IS WORTH 16X16 WORDS: Transformer FOR IMAGE RECOGNITION AT SCALE), use Transformer for image classification.
Image GPT (Generative Pretraining from Pixels), uses Transformer for pixel-level image completion, just like other GPT text completions.
End-to-end Lane Shape Prediction with Transformers , using Transformer for lane marking detection in autonomous driving
structure
In general, there are mainly two model architectures in the related work that adopts Transformer in CV. One is a pure Transformer structure, and the other is a hybrid structure that combines CNNs/backbone with Transformers.
Pure Transformer
Hybrid: (CNNs+ Transformer)
Vision Transformer is a complete self-attention based Transformer structure without using CNN, while DETR is an example of using a hybrid model structure that combines Convolutional Neural Networks (CNNs) and Transformers.
some problems
Why use Transformer in CV? how to use
What are the results on the benchmark?
*What are the constraints and challenges of using Transformer in CV?*
Which structure is more efficient and flexible? Why?
You'll find out in the following in-depth study of ViT, DETR, and Image GPT.
Vision Transformer
Vision Transformer (ViT) directly applies the pure Transformer architecture to a series of image patches for classification tasks and can achieve excellent results. It also outperforms state-of-the-art convolutional networks on many image classification tasks while requiring significantly less (at least 4x) pre-training computational resources.
Vision Transformer model structureimage sequence patches
How they split the image into fixed-size patches and then feed linear projections of these patches into the transformer along with their image positions. Then the remaining steps are a clean and standard Transformer encoder and decoder.
By adding positional embeddings to the embeddings of image patches, spatial/positional information is preserved globally through different strategies. In this paper, they try different spatial information coding methods, including position-free coding, 1D/2D position embedding coding, and relative position embedding coding.
Comparison of different position coding strategiesAn interesting finding is that 2D positional embeddings did not bring significant performance gains compared to 1D positional embeddings.
data set
The model is pretrained with deduplicated data from multiple large datasets to support fine-tuning (smaller datasets) downstream tasks.
The ILSVRC-2012 ImageNet dataset has 1k classes and 1.3 million images
ImageNet-21k has 21k classes and 14 million images
JFT has 18k classes and 303 million high resolution images
variant of the model
Like other popular Transformer models (GPT, BERT, RoBERTa), ViT (vision transformer) also has different model sizes (basic, large and giant) and different numbers of transformer layers and heads. For example, ViT-L/16 can be interpreted as a large (24-layer) ViT model with an input image patch size of 16×16.
Note that the smaller the input patch size, the larger the computational model, because the number of input patches N = HW/P*P, where (H, W) is the resolution of the original image and P is the resolution of the patch image . This means that a 14 x 14 patch is more computationally expensive than a 16 x 16 image patch.
Benchmark results
Benchmark for Image ClassificationThe above results show that the model outperforms existing SOTA models on several popular benchmark datasets.
Vision transformers (ViT-H/14, ViT-L/16) pretrained on the JFT-300M dataset outperform the ResNet models on all test datasets (ResNet152x4, pretrained on the same JFT-300M dataset) , while the computational resources (TPUv3 core days) occupied during pre-training are greatly reduced. Even ViT pretrained on ImageNet-21K outperforms the baseline.
Model performance vs dataset size
Pretrained dataset size vs model performanceThe graph above shows the effect of dataset size on model performance. When the size of the pre-training dataset is small, ViT does not perform well, and when the training data is sufficient, it outperforms the previous SOTA.
Which structure is more efficient?
As mentioned at the beginning, the architectural design of computer vision using transformers is also different, some completely replace CNNs with Transformers (ViT), some replace partially, and some combine CNNs with transformers (DETR). The results below show the performance of each model architecture under the same computational budget.
Performance and computational cost of different model architecturesThe above experiments show that:
Pure Transformer architecture (ViT) is more efficient and scalable than traditional CNNs (ResNet BiT) both in size and computational scale
Hybrid architectures (CNNs + Transformer) outperform pure Transformer at smaller model sizes and very close when model sizes are larger
The focus of ViT (vision transformer)
Use Transformer architecture (pure or hybrid)
The input image is flattened by multiple patches
Beats SOTA on multiple image recognition benchmarks
Cheaper to pre-train on large datasets
More scalable and computationally efficient
DETR
DETR is the first object detection framework to successfully use Transformer as the main building block in the pipeline . It matches the performance of the previous SOTA method (highly optimized Faster R-CNN) with a simpler and more flexible pipeline.
DETR combines CNN and Transformer's pipeline target detectionThe picture above shows DETR, a hybrid pipeline with CNN and Transformer as the main building blocks. Here is the flow:
CNNs are used to learn 2D representations of images and extract features
The output of the CNN is flattened and supplemented with positional encoding to feed into the standard Transformer's encoder
Transformer's decoder predicts classes and bounding boxes by outputting embeddings into a feed-forward network (FNN)
Simpler Pipelines
Comparison of traditional target detection pipeline and DETRTraditional object detection methods, such as Faster R-CNN, have multiple steps for anchor generation and NMS . DETR abandons these hand-designed components, significantly simplifying the object detection pipeline.
Amazing results when extended to panoptic segmentation
In this paper, they further extend the DETR pipeline for the panoptic segmentation task, a recently popular and challenging pixel-level recognition task. To simply explain the task of panoptic segmentation, it unifies 2 different tasks, one is traditional semantic segmentation (assigning a class label to each pixel) and the other is instance segmentation (detecting and segmenting instances of each object). Using one model architecture to solve two tasks (classification and segmentation) is a very smart idea.
Panoramic segmentation at pixel levelThe image above shows an example of panoptic segmentation. With DETR's unified pipeline, it surpasses very competitive baselines.
attention visualization
The figure below shows the attention of the Transformer decoder on predictions. The attention scores of different objects are represented by different colors.
By looking at the color/attention, you will be amazed at the ability of the model to understand the image on a global scale through self-attention, solving the problem of overlapping bounding boxes. Especially the oranges on the zebra legs, although they partially overlap with blues and greens, are well classified.
Decoder attention visualization for predicting objectsKey Points of DETR
Use Transformer for a simpler and more flexible pipeline
Can match SOTA on target detection tasks
Parallel and more efficient direct output of the final set of predictions
Unified Object Detection and Segmentation Architecture
The detection performance of large objects is significantly improved, but the detection performance of small objects is degraded
Image GPT
Image GPT is a GPT-2 transformer model trained on pixel sequences with image completion. Like general pretrained language models, it is designed to learn high-quality unsupervised image representations. It can autoregressively predict the next pixel without knowing the 2D structure of the input image.
Features from pretrained image GPT achieve state-of-the-art performance on some classification benchmarks and approach state-of-the-art unsupervised accuracy on ImageNet.
The figure below shows a completion model generated from a human-provided half-image as input, followed by creative completions from the model.
Image completion from Image GPTHighlights of Image GPT:
Use the same transformer architecture as GPT-2 in NLP
Unsupervised learning without human labeling
More computation is required to generate a competitive representation
The learned features achieve SOTA performance on classification benchmarks on low-resolution datasets
Summarize
The great success of Transformers in natural language processing has been explored in the field of computer vision and has become a new research direction.
Transformer proves to be a simple and scalable framework for computer vision tasks such as image recognition, classification and segmentation, or simply learning global image representations.
Compared with traditional methods, it has significant advantages in training efficiency.
In terms of architecture, it can be used in a pure Transformer way, or it can be used in a mixed way in combination with cnn.
It also faces challenges such as low performance in detecting small objects in DETR, and in Vision Transformer (ViT) when the pre-training dataset is small, the performance is not very good.
Transformers are emerging as a more general framework for learning from sequence data, including text, images, and time series data.
—END—
Original English: https://towardsdatascience.com/transformer-in-cv-bbdb58bf335e
Recommended reading:
My 2022 Internet School Recruitment Sharing
Talking about the difference between algorithm post and development post
Internet school recruitment research and development salary summary
For time series, everything you can do.
Public number: AI snail car
Stay humble, stay disciplined, stay progressive
Send [Snail] to get a copy of "Hands-on AI Project" (AI Snail Car)
Send [1222] to get a good leetcode brushing note
Send [AI Four Classics] to get four classic AI e-books