Transformers in Computer Vision


Author: Cheng He

Compilation: ronghuaiyang

Guided reading

There are more and more applications of Transformer to CV tasks. Here are some related developments for you.

4eb3cf7f55ccb9115c1f9055c59bcd2c.png

Transformer architectures have achieved state-of-the-art results in many natural language processing tasks. A major breakthrough in Transformer models could be GPT-3, released in the middle of this year, which was awarded the NeurIPS2020 "Best Paper".

6f6af62bdd705dedfa8b7a5d5e59e892.png

In the field of computer vision, CNN has been the dominant model for vision tasks since 2012. With the emergence of more and more efficient structures, computer vision and natural language processing are more and more convergent, using Transformer to complete vision tasks has become a new research direction to reduce the complexity of the structure, explore scalability and training efficiency.

Here are a few of the more well-known projects in related work:

  • DETR (End-to-End Object Detection with Transformers), uses Transformers for object detection and segmentation.

  • Vision Transformer (AN IMAGE IS WORTH 16X16 WORDS: Transformer FOR IMAGE RECOGNITION AT SCALE), use Transformer for image classification.

  • Image GPT (Generative Pretraining from Pixels), uses Transformer for pixel-level image completion, just like other GPT text completions.

  • End-to-end Lane Shape Prediction with Transformers , using Transformer for lane marking detection in autonomous driving

structure

In general, there are mainly two model architectures in the related work that adopts Transformer in CV. One is a pure Transformer structure, and the other is a hybrid structure that combines CNNs/backbone with Transformers.

  • Pure Transformer

  • Hybrid: (CNNs+ Transformer)

Vision Transformer is a complete self-attention based Transformer structure without using CNN, while DETR is an example of using a hybrid model structure that combines Convolutional Neural Networks (CNNs) and Transformers.

some problems

  • Why use Transformer in CV? how to use

  • What are the results on the benchmark?

  • *What are the constraints and challenges of using Transformer in CV?*

  • Which structure is more efficient and flexible? Why?

You'll find out in the following in-depth study of ViT, DETR, and Image GPT.

Vision Transformer

Vision Transformer (ViT) directly applies the pure Transformer architecture to a series of image patches for classification tasks and can achieve excellent results. It also outperforms state-of-the-art convolutional networks on many image classification tasks while requiring significantly less (at least 4x) pre-training computational resources.

94e4426bcda0b716db667031b41e85c3.png

Vision Transformer model structure

image sequence patches

How they split the image into fixed-size patches and then feed linear projections of these patches into the transformer along with their image positions. Then the remaining steps are a clean and standard Transformer encoder and decoder.

By adding positional embeddings to the embeddings of image patches, spatial/positional information is preserved globally through different strategies. In this paper, they try different spatial information coding methods, including position-free coding, 1D/2D position embedding coding, and relative position embedding coding.

b38048c05fa1bab69e215dea95586ed4.png

Comparison of different position coding strategies

An interesting finding is that 2D positional embeddings did not bring significant performance gains compared to 1D positional embeddings.

data set

The model is pretrained with deduplicated data from multiple large datasets to support fine-tuning (smaller datasets) downstream tasks.

  • The ILSVRC-2012 ImageNet dataset has 1k classes and 1.3 million images

  • ImageNet-21k has 21k classes and 14 million images

  • JFT has 18k classes and 303 million high resolution images

variant of the model

c0eb278c4a303b57b195992d643de3f7.png

Like other popular Transformer models (GPT, BERT, RoBERTa), ViT (vision transformer) also has different model sizes (basic, large and giant) and different numbers of transformer layers and heads. For example, ViT-L/16 can be interpreted as a large (24-layer) ViT model with an input image patch size of 16×16.

Note that the smaller the input patch size, the larger the computational model, because the number of input patches N = HW/P*P, where (H, W) is the resolution of the original image and P is the resolution of the patch image . This means that a 14 x 14 patch is more computationally expensive than a 16 x 16 image patch.

Benchmark results

61375955e487842c80c48f6efe540f51.png

Benchmark for Image Classification

The above results show that the model outperforms existing SOTA models on several popular benchmark datasets.

Vision transformers (ViT-H/14, ViT-L/16) pretrained on the JFT-300M dataset outperform the ResNet models on all test datasets (ResNet152x4, pretrained on the same JFT-300M dataset) , while the computational resources (TPUv3 core days) occupied during pre-training are greatly reduced. Even ViT pretrained on ImageNet-21K outperforms the baseline.

Model performance vs dataset size

f622d236233aec8e335a688674e5e52b.png

Pretrained dataset size vs model performance

The graph above shows the effect of dataset size on model performance. When the size of the pre-training dataset is small, ViT does not perform well, and when the training data is sufficient, it outperforms the previous SOTA.

Which structure is more efficient?

As mentioned at the beginning, the architectural design of computer vision using transformers is also different, some completely replace CNNs with Transformers (ViT), some replace partially, and some combine CNNs with transformers (DETR). The results below show the performance of each model architecture under the same computational budget.

82b08789c211ee4bd50fb9fc67e7944e.png

Performance and computational cost of different model architectures

The above experiments show that:

  • Pure Transformer architecture (ViT) is more efficient and scalable than traditional CNNs (ResNet BiT) both in size and computational scale

  • Hybrid architectures (CNNs + Transformer) outperform pure Transformer at smaller model sizes and very close when model sizes are larger

The focus of ViT (vision transformer)

  • Use Transformer architecture (pure or hybrid)

  • The input image is flattened by multiple patches

  • Beats SOTA on multiple image recognition benchmarks

  • Cheaper to pre-train on large datasets

  • More scalable and computationally efficient

DETR

DETR is the first object detection framework to successfully use Transformer as the main building block in the pipeline . It matches the performance of the previous SOTA method (highly optimized Faster R-CNN) with a simpler and more flexible pipeline.

6ab8463d2c0a3c42c9981df8afd8c26e.png

DETR combines CNN and Transformer's pipeline target detection

The picture above shows DETR, a hybrid pipeline with CNN and Transformer as the main building blocks. Here is the flow:

  1. CNNs are used to learn 2D representations of images and extract features

  2. The output of the CNN is flattened and supplemented with positional encoding to feed into the standard Transformer's encoder

  3. Transformer's decoder predicts classes and bounding boxes by outputting embeddings into a feed-forward network (FNN)

Simpler Pipelines

6cca02622fdfd8ba3ab5905ab5f8e2ae.png

Comparison of traditional target detection pipeline and DETR

Traditional object detection methods, such as Faster R-CNN, have multiple steps for anchor generation and NMS . DETR abandons these hand-designed components, significantly simplifying the object detection pipeline.

Amazing results when extended to panoptic segmentation

In this paper, they further extend the DETR pipeline for the panoptic segmentation task, a recently popular and challenging pixel-level recognition task. To simply explain the task of panoptic segmentation, it unifies 2 different tasks, one is traditional semantic segmentation (assigning a class label to each pixel) and the other is instance segmentation (detecting and segmenting instances of each object). Using one model architecture to solve two tasks (classification and segmentation) is a very smart idea.

232c9fc6c506bf68375539072f3ff68a.png

Panoramic segmentation at pixel level

The image above shows an example of panoptic segmentation. With DETR's unified pipeline, it surpasses very competitive baselines.

attention visualization

The figure below shows the attention of the Transformer decoder on predictions. The attention scores of different objects are represented by different colors.

By looking at the color/attention, you will be amazed at the ability of the model to understand the image on a global scale through self-attention, solving the problem of overlapping bounding boxes. Especially the oranges on the zebra legs, although they partially overlap with blues and greens, are well classified.

996cb49beba24976c1307ee11fedcc39.png

Decoder attention visualization for predicting objects

Key Points of DETR

  • Use Transformer for a simpler and more flexible pipeline

  • Can match SOTA on target detection tasks

  • Parallel and more efficient direct output of the final set of predictions

  • Unified Object Detection and Segmentation Architecture

  • The detection performance of large objects is significantly improved, but the detection performance of small objects is degraded

Image GPT

Image GPT is a GPT-2 transformer model trained on pixel sequences with image completion. Like general pretrained language models, it is designed to learn high-quality unsupervised image representations. It can autoregressively predict the next pixel without knowing the 2D structure of the input image.

Features from pretrained image GPT achieve state-of-the-art performance on some classification benchmarks and approach state-of-the-art unsupervised accuracy on ImageNet.

The figure below shows a completion model generated from a human-provided half-image as input, followed by creative completions from the model.

ca1262365071628cd8f10eaa8c85f862.png

Image completion from Image GPT

Highlights of Image GPT:

  • Use the same transformer architecture as GPT-2 in NLP

  • Unsupervised learning without human labeling

  • More computation is required to generate a competitive representation

  • The learned features achieve SOTA performance on classification benchmarks on low-resolution datasets

Summarize

The great success of Transformers in natural language processing has been explored in the field of computer vision and has become a new research direction.

  • Transformer proves to be a simple and scalable framework for computer vision tasks such as image recognition, classification and segmentation, or simply learning global image representations.

  • Compared with traditional methods, it has significant advantages in training efficiency.

  • In terms of architecture, it can be used in a pure Transformer way, or it can be used in a mixed way in combination with cnn.

  • It also faces challenges such as low performance in detecting small objects in DETR, and in Vision Transformer (ViT) when the pre-training dataset is small, the performance is not very good.

  • Transformers are emerging as a more general framework for learning from sequence data, including text, images, and time series data.

77310f79599b464e15b0cdc7ea5de3e2.png

—END—

Original English: https://towardsdatascience.com/transformer-in-cv-bbdb58bf335e

Recommended reading:

My 2022 Internet School Recruitment Sharing

My 2021 Summary

Talking about the difference between algorithm post and development post

Internet school recruitment research and development salary summary

For time series, everything you can do.

What is the spatiotemporal sequence problem? Which models are mainly used for such problems? What are the main applications?

Public number: AI snail car

Stay humble, stay disciplined, stay progressive

317664810a5a72c6fd57ccbef5e03c6f.png

Send [Snail] to get a copy of "Hands-on AI Project" (AI Snail Car)

Send [1222] to get a good leetcode brushing note

Send [AI Four Classics] to get four classic AI e-books

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/123606127