Training Vision Transformers for Image Retrieval Paper Notes

Training Vision Transformers for Image Retrieval Paper Notes


Overview

To understand what this paper is talking about, the first thing that is indispensable is to understand Transformers, why this article uses it in the field of image retrieval, as well as its natural advantages and the basis for success.

Why the author wrote this article

Transformers used the field of NLP in the early years, and recently found that it is also good for image classification, so I decided to give it a try.

Do it

The author proposes a transformer-based image retrieval method: 采用视觉transformers生成图像描述符,并用一个度量学习到的目标训练得到的模型(This objective combines contrast loss and differential entropy regularizer).
Talking about people means to first use the View Transformer directly, fine-tune it on this basis, that is, add a common Loss, and finally do a regularization in the first two steps, and experiment with all three methods.

Hard work pays off

The author's experiment came to the conclusion: Compared with the previous convolution-based methods commonly used in the industry, using transformers for image retrieval: NB! feasible! Very strong!

Boasting author

As stated in the original text, the contribution is huge:
"
(1) 提出了一种简单的方法来训练视觉transformer(It can be used for both category-based retrieval and specific object retrieval). Compared with convolutional models with similar capacity, the performance is better.

(2) Category-level retrieval - around the corner Stanford Online Product, In-Shopand CUB-200the excellent performance of the three datasets

(3) For specific object retrieval- the ablation experiments on ROxfordand RParisalso show that the results of transformers and convolutional networks at higher resolution and higher frequencies are comparable, especially for short vector representation (128 components).

(4) It is proved that the differential entropy regularizer enhances the contrast loss and improves the overall performance. "

Structure at a glance

figure 1
As shown in the figure above, this paper has trained a transformer model with Siamese structure for image retrieval. Two input images are mapped to a common feature space by transforme. During training, the contrast loss is enhanced due to the use of an entropy regularizer.


1. Supplementary knowledge

1.1 Introduction

Theoretical basis: In computer vision, learning similarity measurement has many direct applications, such as content-based image retrieval, face recognition, and person re-identification.

Research significance: This is an interesting development, because compared with the current dominant convolution structure, the computer model based on Transformers has a different sensing bias

1.2 Transformers

Important role: a memory function! ! ! Often used for pre-training.
Get a general understanding of the structure of Transformers.
The input is a sequence of vectors (the default column vector), and then each layer passes through the multi-head attention neural network (the most important role is to solve the problem of long-distance dependence) to generate a new sequence with the same length as the original. In NLP, each word is represented by a vector, and then for the vector corresponding to each word, there is a vector that can be learned (that is, a vector that can be updated in the backpropagation process) to determine what it wants Pay attention to which vector in this sequence, so as to be able to rely on context information.
Transformer will add a special CLS mark at the beginning of the input sequence. The output vector corresponding to the CLS mark will be determined by all input vectors and the CLS's attention weight to them, so it can represent the entire input sequence.

1.3 Siamese structure

The algorithm in this paper uses the Siamese structure. As he literally means, Siamese network can be regarded as: "Siamese neural network", which realizes the connection between networks by sharing weights.

Second, the algorithm details

Previously, for category-level retrieval and specific object retrieval, the two major tasks were handled through different technologies. This article uses the same method for both questions.
Here is a step-by-step introduction to its different components:
Insert picture description here

  1. O: Direct feature extraction from the visual Transformers backbone network, pre-training on ImageNet
  2. L: Use metric learning to fine-tune Transformers, especially the use of contrast loss
  3. R: Regularized output feature space

Prior knowledge: understand what a view transformer is.
ViT architecture: First, the input image is decomposed into M fixed-size patches (such as 16×16). Each patch is linearly projected into M vector-shaped markers, and is used as the input of the transformer in an invariant arrangement. The previous position is merged by adding a learnable 1-D position coding vector to the input tag. In addition, add another learnable CLS to the input sequence so that its corresponding output is used as a global image representation.
The transformer consists of L layers, and each layer consists of two main modules: a multi-headed self-attention (MSA) layer, which applies self-attention operations to different projections of input markers, and a feedforward network (FFN).

This paper chooses to use the DeiT-Small variant of the ViT architecture introduced by Touvron et al. (2020) as the basic model.

2.1 IRT-O: Features prepared for transformer

  • Extract features directly from the pre-trained transformer on ImageNet
  • In the ViT architecture, the pre-classification layer outputs an M+1 vector, which corresponds to M input blocks and a CLS embedded in the same space
  • And through PCA to reduce the dimensionality, reduce the amount of calculation, reduce over-fitting

The l2 layer normalization and dimensionality reduction operations are used here. The essence of normalization is to lose part of the complex information that is not very important, so as to reduce the difficulty of fitting. In the transformer model, layer normalization is often used, and its purpose is to be effective Reduce model variance, avoid model overfitting, and accelerate model convergence.

2.2 IRT-L: Metric learning for image retrieval

Metric learning is also called similarity learning. As the name implies, when used for image retrieval, two pictures can be compared to determine whether they are the same object. The greater the similarity, the higher the probability of successful retrieval.

This paper adopts the contrast loss obtained by the previously proposed cross-batch processing memory, and sets margin β = 0.5 as our metric learning goal.
When the batch size is N, the author defines the contrast loss as follows:
Insert picture description here
zi: the coded low-dimensional representation of the sample (after L2 layer normalization)
yi: the label
contrast loss of the sample corresponding to zi maximizes zi The similarity with yi, its purpose is to bring similar samples closer, and negative samples are limited to margins

The following figure helps to better understand the meaning of margin:
Insert picture description here

2.3 IRT-R

Insert picture description here
Insert picture description here

2.4 Analysis

Insert picture description here

3. Experimental results

First describe the data set and implementation details, and then continue to discuss the actual results obtained.

3.1 Data set

3.1.1 Category-level Retrieval

There are 3 common data sets used here:
1.) Stanford Online Products (SOP)
最初是为调查度量学习问题而收集的。由网上销售的产品图片来代表 22634类别。本文先使用11,318个类别进行训练,剩下的11316种类别用于测试。

2.) CUB-200-2011
该数据集由加州理工学院在2010年提出的细粒度数据集,是目前细粒度分类识别研究的基准图像数据集。该数据集共有11788张鸟类图像,包含200类鸟类子类。

3.) In-Shop
这个是卖家秀图片集,每个商品id,有多张不同角度的卖家秀,放在同一个文件夹内

In this paper, the Recall@K evaluation index is calculated, and there is an intuitive comparison with the previous method.

3.1.2 Particular Object Retrieval

  1. During the training phase,
    SFM120k数据集:this dataset is obtained by applying structure from motion and 3D reconstruction to a large collection of unlabeled images. The positive samples are selected in order to observe enough 3D Points with the query image, while the negative samples are from different 3D models. We use 551 3D models for training and 162 for verification.

  2. Evaluation phase
    This article uses 牛津and 巴黎reported the results of the revised benchmark of the data set. These two data sets contain 70 query images describing buildings respectively, and also contain 4993 and 6322 images respectively, in which the same query buildings may appear. The revisited benchmark test consists of 3 parts: Easy (E), Medium (M) and Hard (H), gradually grouped by the difficulty of the query/database pair.

3.2 Experimental details

  1. Category-level Retrieval
    uses AdamW optimizer to optimize all models, the learning rate is 3.105, the weight attenuation is 5.104, and the batch size is 64. For all experiments, unless otherwise specified, the contrast loss rate is set to β=0.5, and the entropy regularization strength is set to λ=0.7. Using standard data enhancement methods, the image size is adjusted to 256×256, and then combined with random horizontal flipping, it is randomly cropped to 224×224.
  2. Particular Object Retrieval
    consists of 5 tuples per batch (1 anchor, 1 positive sample, and 5 negative samples). For each generation, we randomly select 2000 positive sample pairs and 22000 negative sample candidates (using difficult negative sample mining). β = 0.85, fine-tune the model after 100 generations of training.

3.3 Results

  1. Category-level Retrieval:
    Table 2
    The above table shows that the experimental indicators recall@k(k=1,10,100,1000) of the IRT-R algorithm used in this paper on these three data sets have achieved good results. SOTA has been achieved on Top1. Especially in SOP, no matter which of these four values ​​of k is taken, the index has reached SOTA.

  2. Particular Object Retrieval:
    Insert picture description here

3.4 Ablation experiment

  1. 不同的监督方法
    Insert picture description here
    Insert picture description here

  2. 特征提取器的选择:池化方法
    Insert picture description here

  3. 跨目标函数的绩效
    Insert picture description here

  4. 正则化超参数λ
    Insert picture description here
    Insert picture description here

Four, conclusion

This article mainly studies how to adapt the transformer architecture to metric learning and image retrieval.

  1. Re-discussed the contrast loss formula, and said: similar to the convolution model, 基于单位超球面上/微分熵损失扩展向量的正则化器the performance of the transformer-based model is improved. Thus creating a new situation of classification-level image retrieval.
  2. Proved 对于特定对象检索,在可用于比较的设定下,基于transformer的模型是卷积主干的一个有效替代方法,特别是对于短向量表示. Compared with more complex convolutional networks, their performance is more competitive.

Guess you like

Origin blog.csdn.net/weixin_47651805/article/details/114794535