Translation of CLIP papers, translation of Learning Transferable Visual Models From Natural Language Supervision

Translation of CLIP papers, translation of Learning Transferable Visual Models From Natural Language Supervision

Abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept.
Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks.
We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification.
The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on.
We release our code and pre-trained model weights at GitHub - openai/CLIP: Contrastive Language-Image Pretraining.

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability, as additional labeled data is required to specify any other visual concept.
Learning directly from raw text about images is a promising alternative that can leverage a broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption corresponds to which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the Internet. After pre-training, natural language is used to refer to the learned visual concepts (or describe new concepts), thereby enabling zero-shot transfer of the model to downstream tasks.
We study the performance of this approach by benchmarking over 30 different existing computer vision datasets, covering tasks such as OCR, action recognition in videos, geolocalization, and many types of fine-grained object classification.
The model transfers easily to most tasks and is often comparable to fully supervised baselines without any dataset-specific training. For example, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without using any of the 1.28 million training examples it was trained on.
We are at Published our code and pretrained model weights on .

1. Introduction and Motivating Work

Pre-training methods which learn directly from raw text have revolutionized NLP over the last few years (Dai & Le, 2015; Peters et al., 2018; Howard & Ruder, 2018; Radford et al., 2018; Devlin et al., 2018; Raffel et al., 2019).
Task-agnostic objectives such as autoregressive and masked language modeling have scaled across many orders of magnitude in compute, model capacity, and data, steadily improving capabilities. The development of “text-to-text” as a standardized input-output interface (McCann et al., 2018; Radford et al., 2019; Raffel et al., 2019) has enabled task-agnostic architectures to zero-shot transfer to downstream datasets removing the need for specialized output heads or dataset specific customization.
Flagship systems like GPT-3(Brown et al., 2020) are now competitive across many tasks with bespoke models while requiring little to no dataset specific training data.

Pre-training methods that learn directly from raw text have revolutionized NLP over the past few years.
Task-independent objectives such as autoregressive and masked language modeling have scaled by many orders of magnitude in terms of computation, model capacity, and data, steadily improving ability. The development of "text-to-text" as a standardized input-output interface enables task-independent architectures to enable zero-shot transfer to downstream datasets and eliminates the need for specialized output headers or dataset-specific customization.

Flagship systems like GPT-3 are now competitive in many tasks using custom models while requiring little to no dataset-specific training data.

These results suggest that the aggregate supervision accessible to modern pre-training methods within web-scale collections of text surpasses that of high-quality crowd-labeled NLP datasets.
However, in other fields such as computer vision it is still standard practice to pre-train models on crowd-labeled datasets such as ImageNet (Deng et al., 2009).
Could scalable pre-training methods which learn directly from web text result in a similar breakthrough in computer vision? Prior work is encouraging.

These results demonstrate that the total supervision available to modern pre-training methods on Internet-scale text collections exceeds that of high-quality crowd-labeled NLP datasets.

However, in other fields such as computer vision, it is still standard practice to pre-train models on crowd-labeled datasets such as ImageNet.

Could scalable pre-training methods that learn directly from web text achieve similar breakthroughs in computer vision? Previous work is encouraging.

Over 20 years ago Mori et al. (1999) explored improving content based image retrieval by training a model to predict the nouns and adjectives in text documents paired with images. Quattoni et al. (2007) demonstrated it was possible to learn more data efficient image representations via manifold learning in the weight space of classifiers trained to predict words in captions associated with images. Sri vastava & Salakhutdinov (2012) explored deep representation learning by training multimodal Deep Boltzmann Machines on top of low-level image and text tag features. Joulin et al. (2016) modernized this line of work and demonstrated that CNNs trained to predict words in image captions learn useful image representations.
They converted the title, description, and hashtag metadata of images in the YFCC100M dataset (Thomee et al., 2016) into a bag-of words multi-label classification task and showed that pre-training AlexNet (Krizhevsky et al., 2012) to predict these labels learned representations which preformed similarly to ImageNet-based pre-training on transfer tasks.
Li et al. (2017) then extended this approach to predicting phrase n-grams in addition to individual words and demonstrated the ability of their system to zero-shot transfer to other image classification datasets by scoring target classes based on their dictionary of learned visual n-grams and predicting the one with the highest score. Adopting more recent architectures and pre-training approaches, VirTex (Desai & Johnson,2020), ICMLM (Bulent Sariyildiz et al., 2020), and ConVIRT (Zhang et al., 2020) have recently demonstrated the potential of transformer-based language modeling, masked language modeling, and contrastive objectives to learn image representations from text.

More than 20 years ago, Mori explored improving content-based image retrieval by training models to predict nouns and adjectives in text documents paired with images. Quattoni demonstrates that more data-efficient image representations can be learned by manifold learning in the weight space of classifiers trained to predict words in captions associated with images. Sri vastava & Salakhutdinov (2012) explore deep representation learning by training multimodal deep Boltzmann machines on top of low-level image and text label features. Joulin et al. (2016) modernized this field of work and demonstrated that CNNs trained to predict words in image captions learn useful image representations.
They convert the title, description, and hashtag metadata of images in the YFCC100M dataset (Thomee et al., 2016) to a bag-of-words multi-label classification task, and show that pretrained AlexNet predicts representations learned by these labels, which are similar to ImageNet-based pre-training for transfer tasks.
Li then extended this approach to predicting phrase n-grams as well as individual words, and showed that their system learns based on the visual n-grams they learn and predicts the one with the highest score. With newer architectures and pre-training methods, VirTex, ICMLM, and ConVIRT have recently demonstrated the potential of transformer-based language modeling, masked language modeling, and contrastive objectives to learn image representations from text.

While exciting as proofs of concept, using natural language supervision for image representation learning is still rare. This is likely because demonstrated performance on common benchmarks is much lower than alternative approaches.
For example, Li et al. (2017) reach only 11.5% accuracy on ImageNet in a zero-shot setting. This is well below the 88.4% accuracy of the current state of the art (Xie et al.,2020). It is even below the 50% accuracy of classic computer vision approaches (Deng et al., 2012).
Instead, more narrowly scoped but well-targeted uses of weak supervision have improved performance. Mahajan et al. (2018) showed that predicting ImageNet-related hashtags on Instagram images is an effective pre-training task.
When fine-tuned to ImageNet these pre-trained models increased accuracy by over 5% and improved the overall state of the art at the time.
Kolesnikov et al. (2019) and Dosovitskiy et al. (2020) have also demonstrated large gains on a broader set of transfer benchmarks by pre-training models to predict the classes of the noisily labeled JFT-300M dataset.

While exciting as a proof of concept, image representation learning using natural language supervision is still rare. This may be due to the demonstrated performance on common benchmarks being much lower than alternatives.

For example, Li et al. (2017) have an accuracy of only 11.5% on ImageNet in the zero-shot setting. This is far below the 88.4 percent accuracy achieved by the current state-of-the-art. It is even lower than the 50% accuracy of classical computer vision methods.

Conversely, the use of narrower but well-targeted weak supervision improves performance. Mahajan showed that predicting ImageNet-related hashtags on Instagram images is an effective pre-training task.

When fine-tuned on ImageNet, these pretrained models improved accuracy by more than 5% and improved the overall state of the art at the time.

Mahajan and Dosovitskiy also show large gains on broader transfer benchmarks by pre-training models to predict classes on the noisy labeled JFT-300M dataset.

This line of work represents the current pragmatic middle ground between learning from a limited amount of supervised “gold-labels” and learning from practically unlimited amounts of raw text.
However, it is not without compromises. Both works carefully design, and in the process limit, their supervision to 1000 and 18291 classes respectively.
Natural language is able to express, and therefore supervise, a much wider set of visual concepts through its generality. Both approaches also use static softmax classifiers to perform prediction and lack a mechanism for dynamic outputs. This severely curtails their flexibility and limits their “zero-shot” capabilities.

This line of work represents a currently practical middle ground between learning from a finite amount of supervised "gold labels" and learning from an almost unlimited amount of original text.

However, it's not without compromises. Both are carefully designed and in the process limit their supervision to 1000 and 18291 classifications respectively.

Natural language is capable of expressing and thus supervising a broader set of visual concepts through its generality. Both methods use a static softmax classifier to perform predictions and lack a dynamic output mechanism. This severely reduces their flexibility and limits their "zero-shot" capabilities.

A crucial difference between these weakly supervised models and recent explorations of learning image representations directly from natural language is scale. While Mahajan et al. (2018) and Kolesnikov et al. (2019) trained their models for accelerator years on millions to billions of images, VirTex, ICMLM, and ConVIRT trained for accelerator days on one to two hundred thousand images.
In this work, we close this gap and study the behaviors of image classifiers trained with natural language supervision at large scale.
Enabled by the large amounts of publicly available data of this form on the internet, we create a new dataset of 400 million (image, text) pairs and demonstrate that a simplified version of ConVIRT trained from scratch, which we call CLIP, for Contrastive Language-Image Pre-training, is an efficient method of learning from natural language supervision.
We study the scalability of CLIP by training a series of eight models spanning almost 2 orders of magnitude of compute and observe that transfer performance is a smoothly predictable function of compute (Hestness et al., 2017; Kaplan et al.,2020).
We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and many others.
We measure this by benchmarking the zero-shot transfer performance of CLIP on over 30 existing datasets and find it can be competitive with prior task-specific supervised models. We also confirm these findings with linear-probe representation learning analysis and show that CLIP outperforms the best publicly available ImageNet model while also being more computationally efficient.
We additionally find that zero-shot CLIP models are much more robust than equivalent accuracy supervised ImageNet models which suggests that zero-shot evaluation of task-agnostic models is much more representative of a model’s capability. These results have significant policy and ethical implications, which we consider in Section 7.

A key difference between these weakly supervised models and recent explorations of learning image representations directly from natural language is scale. Mahajan and Kolesnikov train their models on millions to billions of images to achieve year-level speedups, while VirTex, ICMLM, and ConVIRT train on 100,000 to 200,000 images to achieve day-level speedups.
In this work, we close this gap and investigate the behavior of image classifiers trained with large-scale natural language supervision.
Drawing on the large amount of publicly available data in this form on the Internet, we create a new dataset containing 400 million (image, text) pairs and demonstrate a simplified version of ConVIRT trained from scratch, which we call CLIP, with Compared with language-image pre-training, it is an effective method for learning from natural language supervision.
We investigate the scalability of CLIP by training a series of eight models spanning nearly 2 orders of magnitude of computation and observe that transfer performance is a smooth predictable function of computation (Hestness et al., 2017; Kaplan et al., 2020 Year).
We find that CLIP, similar to the GPT family, learns to perform a broad set of tasks during pre-training, including OCR, geolocation, action recognition, and more.
We measure this by benchmarking the zero-shot transfer performance of CLIP on more than 30 existing datasets, and find that it is competitive with previous task-specific supervised models. We also corroborate these findings with a linear probe representation learning analysis and show that CLIP outperforms the best publicly available ImageNet models while being more computationally efficient.
We also find that zero-shot CLIP models are more robust than supervised ImageNet models of equivalent accuracy, suggesting that zero-shot evaluations of task-agnostic models are more representative of model capabilities. These results have important policy and ethical implications, which we consider in Section 7.

2. Approach

2.1 Natural Language Supervision

At the core of our approach is the idea of learning perception from supervision contained in natural language.
As discussed in the introduction, this is not at all a new idea, however terminology used to describe work in this space is varied, even seemingly contradictory, and stated motivations are diverse. Zhang et al. (2020), Gomez et al. (2017), Joulin et al. (2016), and Desai & Johnson (2020) all introduce methods which learn visual representations from text paired with images but describe their approaches as unsupervised, self-supervised, weakly supervised, and supervised respectively.

At the heart of our approach is the idea of ​​learning perception from the supervision contained in natural language.
As discussed in the Introduction, this is not a new idea at all, but the terminology used to describe work in this field is diverse, even seemingly contradictory, and the stated motivations are diverse. Zhang, Gomez, Joulin, and Desai & Johnson all introduced methods for learning visual representations from text paired with images, but described their methods as unsupervised, self-supervised, weakly supervised, and supervised, respectively.

We emphasize that what is common across this line of work is not any of the details of the particular methods used but the appreciation of natural language as a training signal.
All these approaches are learning from natural language supervision. Although early work wrestled with the complexity of natural language when using topic model and n-gram representations, improvements in deep contextual representation learning suggest we now have the tools to effectively leverage this abundant source of supervision (McCann et al.,2017).

We emphasize that the common denominator of this work is not any details of the specific methods used, but benefits from using natural language as a training signal.
All these methods learn from natural language supervision. While earlier work struggled with the complexities of natural language when using topic models and n-gram representations, improvements in deep contextual representation learning suggest that we now have the tools to effectively exploit this rich supervised resource.

Learning from natural language has several potential strengths over other training methods.
It’s much easier to scale natural language supervision compared to standard crowd-sourced labeling for image classification since it does not require annotations to be in a classic “machine learning compatible format” such as the canonical 1-of-N majority vote “gold label”.
Instead, methods which work on natural language can learn passively from the supervision contained in the vast amount of text on the internet.
Learning from natural language also has an important advantage over most unsupervised or self-supervised learning approaches in that it doesn’t “just” learn a representation but also connects that representation to language which enables flexible zero-shot transfer.
In the following subsections, we detail the specific approach we settled on.

Learning from natural language has several potential advantages over other training methods. Scaling natural language supervision is much easier than standard crowdsourced labeling for image classification, since it does not require annotations to be in a classic "machine learning compatible format", such as the canonical 1-of-N majority voting "golden label" ".
In contrast, methods applicable to natural language can learn passively from the supervision contained in the vast amount of text on the Internet. Learning from natural language also has an important advantage over most unsupervised or self-supervised learning methods, because it not only "just" learns a representation, but also associates this representation with the language, allowing flexible Zero-shot migration. In the following subsections, we detail the specific methods we identified.

2.2 Creating a Suffificiently Large Dataset

Existing work has mainly used three datasets, MS-COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017), and YFCC100M (Thomee et al., 2016). While MS-COCO and Visual Genome are high quality crowd-labeled datasets, they are small by modern standards with approximately 100,000 training photos each.
By comparison, other computer vision systems are trained on up to 3.5 billion Instagram photos(Mahajan et al., 2018). YFCC100M, at 100 million photos,is a possible alternative, but the metadata for each image is sparse and of varying quality.
Many images use automatically generated filenames like 20160716 113957.JPG as “titles” or contain “descriptions” of camera exposure settings.
After filtering to keep only images with natural language titles and/or descriptions in English, the dataset shrunk by a factor of 6 to only 15 million photos. This is approximately the same size as ImageNet.

Existing work mainly uses three datasets, MS-COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017) and YFCC100M (Thomee et al., 2016). While MS-COCO and Visual Genome are high-quality crowd-labeled datasets, they are small by modern standards, with about 100,000 training photos each.
In comparison, other computer vision systems were trained on as many as 3.5 billion Instagram photos (Mahajan et al., 2018). The YFCC100M, with 100 million photos, is a possible alternative, but the metadata for each image is sparse and the quality varies.
Many images use automatically generated filenames such as 20160716 113957.JPG as "title" or "description" containing camera exposure settings.
After filtering to keep only images with natural language captions and/or English descriptions, the dataset was reduced by a factor of 6 to 15 million photos. This is about the same size as ImageNet.

A major motivation for natural language supervision is the large quantities of data of this form available publicly on the internet. Since existing datasets do not adequately reflect this possibility, considering results only on them would underestimate the potential of this line of research.
To address this, we constructed a new dataset of 400 million (image,text) pairs collected form a variety of publicly available sources on the Internet.
To attempt to cover as broad a set of visual concepts as possible, we search for (image, text) pairs as part of the construction process whose text includes one of a set of 500,000 queries.
We approximately class balance the results by including up to 20,000 (image, text) pairs per query. The resulting dataset has a similar total word count as the WebText dataset used to train GPT-2. We refer to this dataset as WIT for WebImageText.

A major motivation for natural language supervision is the large amount of data in this form publicly available on the Internet. Because existing datasets do not adequately reflect this possibility, considering their results alone would underestimate the potential of this area of ​​research.
To address this problem, we construct a new dataset of 400 million (image, text) pairs collected from various publicly available sources on the Internet.
In an attempt to cover as wide a set of visual concepts as possible, we search for (image, text) pairs whose text contains one of a set of 500,000 queries as part of the construction process.
We roughly balance the results by including up to 20,000 (image, text) pairs in each query. The total word count of the resulting dataset is similar to the WebText dataset used to train GPT-2. We refer to this dataset as WIT for WebImageText.

2.3 Selecting an Effificient Pre-Training Method

State-of-the-art computer vision systems use very large amounts of compute. Mahajan et al. (2018) required 19 GPU years to train their ResNeXt101-32x48d and Xie et al.(2020) required 33 TPUv3 core-years to train their Noisy Student EfficientNet-L2.
When considering that both these systems were trained to predict only 1000 ImageNet classes, the task of learning an open set of visual concepts from natural language seems daunting.
In the course of our efforts, we found training efficiency was key to successfully scaling natural language supervision and we selected our final pre-training method based on this metric.

State-of-the-art computer vision systems use very large amounts of computation. Mahajan needed 19 GPU years to train their ResNeXt101-32x48d, and Xie needed 33 TPUv3 core years to train their Noisy Student EfficientNet-L2.
Considering that both systems were trained to predict only 1000 ImageNet classes, the task of learning an open set of visual concepts from natural language may seem daunting.
During our efforts, we found that training efficiency is key to successfully scaling to natural language supervision, and we chose our final pre-training method based on this metric.

Our initial approach, similar to VirTex, jointly trained an image CNN and text transformer from scratch to predict the caption of an image. However, we encountered difficulties efficiently scaling this method.
In Figure 2 we show that a 63 million parameter transformer language model, which already uses twice the compute of its ResNet-50 image encoder, learns to recognize ImageNet classes three times slower than a much simpler baseline that predicts a bag-of-words encoding of the same text.

Our initial approach, similar to VirTex, jointly trains an image CNN and a text transformer from scratch to predict captions for images. However, we ran into difficulty scaling this approach efficiently.
In Figure 2, we show a 63 million parameter Transformer language model that already uses twice the computation of its ResNet-50 image encoder, but recognizes ImageNet classes on the same text more simply than predictive Bag-of-Words encoded The baseline is three times slower.




Both these approaches share a key similarity. They try to predict the exact words of the text accompanying each image.
This is a difficult task due to the wide variety of descriptions, comments, and related text that co-occur with images. Recent work in contrastive representation learning for images has found that contrastive objectives can learn better representations than their equivalent predictive objective (Tian et al., 2019).
Other work has found that although generative models of images can learn high quality image representations, they require over an order of magnitude more compute than contrastive models with the same performance (Chen et al., 2020a).
Noting these findings, we explored training a system to solve the potentially easier proxy task of predicting only which text as a whole is paired with which image and not the exact words of that text.
Starting with the same bag-of-words encoding baseline, we swapped the predictive objective for a contrastive objective in Figure 2 and observed a further 4x efficiency improvement in the rate of zero-shot transfer to ImageNet.

The two approaches share a key similarity. They tried to predict the exact word of the text attached to each picture.
This is a daunting task due to the wide variety of descriptions, comments, and associated text that accompany images. Recent work on learning contrastive representations for images has found that contrastive targets can learn better representations than their equivalent predictive targets.
Other work has found that while generative models of images can learn high-quality image representations, they require an order of magnitude more computation than comparable models with the same performance.
Noting these findings, we explored training a system to solve the potentially easier proxy task of only predicting which image an entire text is paired with, rather than predicting the exact word of that text.
Starting from the same bag-of-words encoding baseline, we swap the prediction objective for the comparison objective in Figure 2, and observe a further 4x speedup for zero-shot transfer to ImageNet.

Given a batch of N (image, text) pairs, CLIP is trained to predict which of the N × N possible (image, text) pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N2 − N incorrect pairings.
We optimize a symmetric cross entropy loss over these similarity scores.
In Figure 3 we include pseudocode of the core of an implementation of CLIP.
To our knowledge this batch construction technique and objective was first introduced in the area of deep metric learning as the multi-class N-pair loss Sohn (2016), was popularized for contrastive representation learning by Oord et al. (2018) as the InfoNCE loss, and was recently adapted for contrastive (text, image) representation learning in the domain of medical imaging by Zhang et al. (2020).

Given a batch of N (image, text) pairs, CLIP is trained to predict which of the N × N possible (image, text) pairs in the batch actually occurred. To this end, CLIP learns a multimodal embedding space by jointly training an image encoder and a text encoder to maximize the cosine similarity of image and text embeddings for N pairs of real numbers in a batch while minimizing N2 - N different correct pairing.
We optimize a symmetric cross-entropy loss over these similarity scores.
In Figure 3, we include the core pseudocode of the CLIP implementation.
To the best of our knowledge, this batch-building technique and objective was first introduced in the field of deep metric learning by Sohn (2016) as the multi-class N-pair loss, and generalized by Oord for contrastive representation learning. As the InfoNCE loss, it was recently adapted by Zhang et al. for contrastive (text, image) representation learning in the medical imaging domain.




Due to the large size of our pre-training dataset, over-fitting is not a major concern and the details of training CLIP are simplified compared to the implementation of Zhang et al.
(2020). We train CLIP from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights. We do not use the non-linear projection between the representation and the contrastive embedding space, a change which was introduced by Bachman et al. (2019) and popularized by Chen et al. (2020b).
We instead use only a linear projection to map from each encoder’s representation to the multi-modal embedding space.
We did not notice a difference in training efficiency between the two versions and speculate that non-linear projections may be co-adapted with details of current image only in self-supervised representation learning methods.
We also remove the text transformation function tu t_v . A random square crop from resized images is the only data augmentation used during training. Finally, the temperature parameter which controls the range of the logits in the softmax, τ , is directly optimized during training as a log-parameterized multiplicative scalar to avoid turning as a hyper-parameter.

Since our pre-training dataset is large, overfitting is not a major concern, and the details of training CLIP are simplified compared to Zhang et al.'s implementation.
We train CLIP from scratch without initializing an image encoder with ImageNet weights or a text encoder with pretrained weights. We do not use nonlinear projections between representation and contrastive embedding spaces, a change introduced by Bachman et al. and generalized by Chen et al.
Instead, we only use linear projections to map each encoder's representation to the multimodal embedding space.
We do not notice a difference in training efficiency between the two versions, and speculate that nonlinear projections may only be co-adapted to the details of the current image in self-supervised representation learning methods.
We also removed the text transformation function of Zhang et al. tu t_u. A single sentence is uniformly sampled from text because many (image, text) pairs in CLIP's pre-training dataset are just one sentence.

We also simplify the image transformation function tv t_v. Random square crops from resized images are the only data augmentation used during training. Finally, the temperature parameter τ, which controls the range of logits in softmax, is directly optimized as a log-parameterized multiplicative scalar during training to avoid turning into a hyperparameter.

2.4 Choosing and Scaling a Model

We consider two different architectures for the image encoder.
For the first, we use ResNet-50 (He et al., 2016a)as the base architecture for the image encoder due to its widespread adoption and proven performance.
We make several modifications to the original version using the ResNetD improvements from He et al. (2019) and the antialiased rect-2 blur pooling from Zhang (2019). We also replace the global average pooling layer with an attention pooling mechanism.
The attention pooling is implemented as a single layer of “transformer-style” multi-head QKV attention where the query is conditioned on the global average-pooled representation of the image. For the second architecture, we experiment with the recently introduced Vision Transformer (ViT) (Dosovitskiy et al., 2020).
We closely follow their implementation with only the minor modification of adding an additional layer normalization to the combined patch and position embeddings before the transformer and use a slightly different initialization scheme.

We consider two different architectures for image encoders.
First, we use ResNet-50 (He et al., 2016a) as the base architecture of the image encoder since it is widely adopted and has proven good performance.
We improve the original version a bit using He et al.'s ResNetD refinement, while employing Zhang's anti-aliasing rect-2 blur pooling. We also replace the global average pooling layer with an attention pooling mechanism.
Attention pooling is implemented as a single-layer "transformer form" of multi-head QKV attention, where the query is conditioned on a globally averaged pooled representation of the image. For the second architecture, we experimented with the recently introduced Vision Transformer (ViT).
We closely follow their implementation, with only minor modifications to the combined patch and position embedding before the transformer adding an extra layer of normalization and using a slightly different initialization scheme.

The text encoder is a Transformer (Vaswani et al., 2017) with the architecture modifications described in Radford et al. (2019).
As a base size we use a 63M-parameter 12-layer 512-wide model with 8 attention heads. The transformer operates on a lower-cased byte pair encoding (BPE) representation of the text with a 49,152 vocab size (Sennrich et al., 2015). For computational efficiency, the max sequence length was capped at 76.
The text sequence is bracketed with [SOS] and [EOS] tokens and the activations of the highest layer of the transformer at the [EOS] token are treated as the feature representation of the text which is layer normalized and then linearly projected into the multi-modal embedding space.
Masked self-attention was used in the text encoder to preserve the ability to initialize with a pre-trained language model or add language modeling as an auxiliary objective, though exploration of this is left as future work.

The text encoder is a Transformer with architectural modifications described in Radford et al.
As the base size, we use a 63M parameter 12-layer 512 wide model with 8 attention heads. The converter operates on lowercase byte pair encoding (BPE) representations of text with a vocabulary size of 49,152. For computational efficiency, the maximum sequence length is capped at 76.
Text sequences are enclosed by [SOS] and [EOS] tokens, and the activations of the highest layers of the transformer at the [EOS] tokens are treated as feature representations of the text, which are layer-normalized and then linearly projected to the multi-modal Embedding space.
Masked self-attention is used in text encoders to preserve the ability to use a pre-trained language model for initialization or to add language modeling as an auxiliary objective, although exploration of this is left for future work.

While previous computer vision research has often scaled models by increasing the width (Mahajan et al., 2018) or depth (He et al., 2016a) in isolation, for the ResNet image encoders we adapt the approach of Tan & Le (2019) which found that allocating additional compute across all of widthdepth, and resolution outperforms only allocating it to only one dimension of the model. While Tan & Le (2019) tune the ratio of compute allocated to each dimension for their EfficientNet architecture, we use a simple baseline of allocating additional compute equally to increasing the width, depth, and resolution of the model.
For the text encoder, we only scale the width of the model to be proportional to the calculated increase in width of the ResNet and do not scale the depth at all, as we found CLIP’s performance to be less sensitive to the capacity of the text encoder.

While previous computer vision research typically scales models by increasing width or depth individually, for the ResNet image encoder we follow the method of Tan & Le (2019), which finds that allocating additional Computation of is better than assigning it to only one dimension of the model. While Tan & Le (2019) tuned the ratio of computation allocated to each dimension for their EfficientNet architecture, we use a simple baseline that evenly distributes extra computation to increase model width, depth, and resolution.
For the text encoder, we only scale the width of the model proportional to the computed ResNet width increase, and do not scale the depth at all, as we found that the performance of CLIP is not very sensitive to the capacity of the text encoder.

2.5 Training

We train a series of 5 ResNets and 3 Vision Transformers.
For the ResNets we train a ResNet-50, a ResNet-101, and then 3 more which follow EfficientNet-style model scaling and use approximately 4x, 16x, and 64x the compute of a ResNet-50. They are denoted as RN50x4, RN50x16, and RN50x64 respectively. For the Vision Transformers we train a ViT-B/32, a ViT-B/16, and a ViT-L/14.
We train all models for 32 epochs. We use the Adam optimizer (Kingma& Ba, 2014) with decoupled weight decay regularization(Loshchilov & Hutter, 2017) applied to all weights that are not gains or biases, and decay the learning rate using a cosine schedule (Loshchilov & Hutter, 2016).
Initial hyper-parameters were set using a combination of grid searches, random search, and manual tuning on the baseline ResNet-50 model when trained for 1 epoch. Hyper-parameters were then adapted heuristically for larger models due to computational constraints.
The learnable temperature parameter τ was initialized to the equivalent of 0.07 from (Wu et al.,2018) and clipped to prevent scaling the logits by more than 100 which we found necessary to prevent training instability.
We use a very large minibatch size of 32,768. Mixed-precision (Micikevicius et al., 2017) was used to accelerate training and save memory. To save additional memory, gradient checkpointing (Griewank & Walther, 2000; Chen et al., 2016), half-precision Adam statistics (Dhariwal et al., 2020), and half-precision stochastically rounded text encoder weights were used.
The calculation of embedding similarities was also sharded with individual GPUs computing only the subset of the pairwise similarities necessary for their local batch of embeddings.
The largest ResNet model, RN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs.
For the ViT-L/14 we also pre-train at a higher 336 pixel resolution for one additional epoch to boost performance similar to FixRes (Touvron et al., 2019).
We denote this model as ViT-L/14@336px. Unless otherwise specified, all results reported in this paper as “CLIP” use this model which we found to perform best.

We train a series of 5 ResNets and 3 Vision Transformers.
For ResNet, we trained a ResNet-50, a ResNet-101, and then 3 more, following EfficientNet-style model scaling and using approximately 4x, 16x, and 64x ResNet-50 computation. They are denoted as RN50x4, RN50x16 and RN50x64 respectively. For Vision Transformers, we trained a ViT-B/32, a ViT-B/16 and a ViT-L/14.
We train all models for 32 epochs. We use the Adam optimizer to apply decoupled weight decay regularization to all weights that are not gains or biases, and decay the learning rate using a cosine schedule (Loshchilov & Hutter, 2016). When training for 1 epoch, the initial hyperparameters
are A combination of grid search, random search, and manual tuning is set on the baseline ResNet-50 model. Hyperparameters are then heuristically adapted to larger models due to computational constraints.
The learnable temperature parameter τ is initialized to be equivalent to 0.07 and clipped to prevent scaling the logits beyond 100, which we found to be necessary to prevent training instability.
We use a very large minibatch of 32,768. Mixed precision is used to speed up training and save memory. To save additional memory, gradient checkpointing, half-precision Adam statistics, and half-precision random rounding of text encoder weights are used.
Computation of embedding similarity is also sharded with a single GPU, computing only the subset of pairwise similarities needed for its local batch embeddings.
The largest ResNet model, RN50x64, was trained for 18 days on 592 V100 GPUs, while the largest Vision Transformer was trained for 12 days on 256 V100 GPUs.
For ViT-L/14, we also pre-train for an additional epoch at a higher resolution of 336 pixels to improve the performance similar to FixRes.
We denote this model as ViT-L/14@336px. All results reported as “CLIP” in this paper use the model we found to perform best, unless otherwise stated.

3. Experiments

3.1 Zero-Shot Transfer

3.1.1. MOTIVATION

In computer vision, zero-shot learning usually refers to the study of generalizing to unseen object categories in image classification (Lampert et al., 2009). We instead use the term in a broader sense and study generalization to unseen datasets.
We motivate this as a proxy for performing unseen tasks, as aspired to in the zero-data learning paper of Larochelle et al. (2008).
While much research in the field of unsupervised learning focuses on the representation learning capabilities of machine learning systems, we motivate studying zero-shot transfer as a way of measuring the tasklearning capabilities of machine learning systems.
In this view, a dataset evaluates performance on a task on a specific distribution.
However, many popular computer vision datasets were created by the research community primarily as benchmarks to guide the development of generic image classification methods rather than measuring performance on a specific task.
While it is reasonable to say that the SVHN dataset measures the task of street number transcription on the distribution of Google Street View photos, it is unclear what “real” task the CIFAR-10 dataset measures.
It is clear, however, what distribution CIFAR-10 is drawn from - TinyImages (Torralba et al., 2008).
On these kinds of datasets, zero-shot transfer is more an evaluation of CLIP’s robustness to distribution shift and domain generalization rather than task generalization.
Please see Section 3.3 for analysis focused on this.

In computer vision, zero-shot learning generally refers to the study of generalization to unseen object categories in image classification. Instead, we use the term in a broader sense and study generalization to unseen datasets.
As expected from the zero-data learning paper by Larochelle et al., we motivate it as an agent performing unseen tasks.
While much research in the field of unsupervised learning has focused on the representation learning capabilities of machine learning systems, we encourage the study of zero-shot transfer as a way of measuring the task learning capabilities of machine learning systems.
In this view, the dataset evaluates the performance of tasks on a specific distribution.
However, many popular computer vision datasets were created by the research community primarily as benchmarks to guide the development of general image classification methods, rather than to measure task-specific performance.
While it is reasonable to say that the SVHN dataset measures the task of the distribution of street number transcriptions to Google Street View photos, it is unclear what "real" task the CIFAR-10 dataset measures.
However, it is clear what distribution CIFAR-10 draws from - TinyImages (Torralba et al., 2008).
On these types of datasets, zero-shot transfer is more about evaluating CLIP's robustness to distribution shifts and domain generalization than task generalization.
See Section 3.3 for an analysis on this.

To our knowledge, Visual N-Grams (Li et al., 2017) first studied zero-shot transfer to existing image classification datasets in the manner described above.
It is also the only other work we are aware of that has studied zero-shot transfer to standard image classification datasets using a generically pre-trained model and serves as the best reference point for contextualizing CLIP.
Their approach learns the parameters of a dictionary of 142,806 visual n-grams (spanning 1- to 5- grams) and optimizes these n-grams using a differential version of Jelinek-Mercer smoothing to maximize the probability of all text n-grams for a given image.
In order to perform zero-shot transfer, they first convert the text of each of the dataset’s class names into its n-gram representation and then compute its probability according to their model, predicting the one with the highest score.

To the best of our knowledge, Visual N-Grams is the first to study zero-shot transfer to existing image classification datasets in the manner described above.
It is also the only work we are aware of that investigates zero-shot transfer to standard image classification datasets using a generic pretrained model, and serves as the best reference point for contextualized CLIP.
Their method learns the parameters of a dictionary containing 142,806 visual n-grams (spanning 1-5-grams), and optimizes these n-grams using a differential version of Jelinek-Mercer smoothing to maximize the probability of all text n-grams given image.
To perform zero-shot transfer, they first convert the text of each dataset's class names to their n-gram representations, and then compute their probabilities based on their model, predicting the one with the highest score.

Our focus on studying zero-shot transfer as an evaluation of task learning is inspired by work demonstrating task learning in the field of NLP.
To our knowledge Liu et al. (2018) first identified task learning as an “unexpected side-effect” when a language model trained to generate Wikipedia articles learned to reliably transliterate names between languages.
While GPT-1 (Radford et al., 2018) focused on pre-training as a transfer learning method to improve supervised fine-tuning, it also included an ablation study demonstrating that the performance of four heuristic zero-shot transfer methods improved steadily over the course of pre-training, without any supervised adaption.
This analysis served as the basis for GPT-2 (Radford et al., 2019) which focused exclusively on studying the task-learning capabilities of language models via zero-shot transfer.

We focus on studying zero-shot transfer as an evaluation of task learning, inspired by work in the NLP field that demonstrates task learning.
As far as we can tell, Liu first identified task learning as an "unintended by-product," when a language model trained to generate Wikipedia articles learned to reliably transliterate names between languages.
While GPT-1 focuses on pre-training as a transfer learning method to improve supervised fine-tuning, it also includes an ablation study demonstrating that four heuristic zero-shot transfer methods steadily improve the performance of the pre-training process without any supervised adaptation .
This analysis is the basis of GPT-2, which focuses on studying the task learning ability of language models through zero-shot transfer.

3.1.2. USING CLIP FOR ZERO-SHOT TRANSFER

CLIP is pre-trained to predict if an image and a text snippet are paired together in its dataset. To perform zero-shot classification, we reuse this capability.
For each dataset, we use the names of all the classes in the dataset as the set of potential text pairings and predict the most probable (image, text) pair according to CLIP. In a bit more detail, we first compute the feature embedding of the image and the feature embedding of the set of possible texts by their respective encoders. The cosine similarity of these embeddings is then calculated, scaled by a temperature parameter τ , and normalized into aprobability distribution via a softmax.
Note that this prediction layer is a multinomial logistic regression classifier with L2-normalized inputs, L2-normalized weights, no bias, and temperature scaling. When interpreted this way, the image encoder is the computer vision backbone which computes a feature representation for the image and the text encoder is a hypernetwork (Ha et al., 2016) which generates the weights of a linear classifier based on the text specifying the visual concepts that the classes represent.
Lei Ba et al. (2015) first introduced a zero-shot image classifier of this form while the idea of generating a classifier from natural language dates back to at least Elhoseiny et al. (2013). Continuing with this interpretation, every step of CLIP pre-training can be viewed as optimizing the performance of a randomly created proxy to a computer vision dataset which contains 1 example per class and has 32,768 total classes defined via natural language descriptions.
For zero-shot evaluation, we cache the zero-shot classifier once it has been computed by the text encoder and reuse it for all subsequent predictions.
This allows the cost of generating it to be amortized across all the predictions in a dataset.

CLIP is pretrained to predict whether images and text segments are paired together in its dataset. To perform zero-shot classification, we reuse this function.
For each dataset, we use the names of all classes in the dataset as a set of potential text pairs and predict the most likely (image, text) pair according to CLIP. In a little more detail, we first compute feature embeddings for images and a set of possible texts through their respective encoders. The cosine similarity of these embeddings is then computed, scaled by the temperature parameter τ, and normalized to a probability distribution by softmax.
Note that this prediction layer is a multinomial logistic regression classifier with L2 normalized input, L2 normalized weights, no bias, and temperature scaling. When interpreted in this way, image encoders are computer vision backbones that compute feature representations for images, while text encoders are hypernetworks that generate the weights of a linear classifier given text.
Lei Ba first introduced this form of zero-shot image classifier, while the idea of ​​generating classifiers from natural language goes back at least to Elhoseiny et al. (2013). Continuing with this explanation, each step of CLIP pre-training can be seen as optimizing the performance of agents on randomly created computer vision datasets, where each class contains 1 example, and a total of 32,768 classes are defined by natural language descriptions.
For zero-shot evaluation, once the text encoder has computed the zero-shot classifier, we cache it and reuse it for all subsequent predictions.
This allows the cost of generating it to be amortized across all predictions in the dataset.

3.1.3. INITIAL COMPARISON TO VISUAL N-GRAMS

In Table 1 we compare Visual N-Grams to CLIP. The best CLIP model improves accuracy on ImageNet from a proof of concept 11.5% to 76.2% and matches the performance of the original ResNet-50 despite using none of the 1.28 million crowd-labeled training examples available for this dataset.
Additionally, the top-5 accuracy of CLIP models are noticeably higher than their top-1, and this model has a 95% top-5 accuracy, matching Inception-V4 (Szegedy et al., 2016).
The ability to match the performance of a strong, fully supervised baselines in a zero-shot setting suggests CLIP is a significant step towards flexible and practical
zero-shot computer vision classifiers.
As mentioned above, the comparison to Visual N-Grams is meant for contextualizing the performance of CLIP and should not be interpreted as a direct methods comparison between CLIP and
Visual N-Grams as many performance relevant differences between the two systems were not controlled for.
For instance, we train on a dataset that is 10x larger, use a vision model that requires nearly 100x more compute per prediction, likely used over 1000x their training compute, and use a transformer-based model which did not exist when Visual N-Grams was published.
As a closer comparison, we trained a CLIP ResNet-50 on the same YFCC100M dataset that Visual N-Grams was trained on and found it matched their reported ImageNet performance within a V100 GPU day.
This baseline was also trained from scratch instead of being initialized from pre-trained ImageNet weights as in Visual N-Grams.

In Table 1, we compare Visual N-Grams with CLIP. The best CLIP model improves accuracy on ImageNet from a proof-of-concept 11.5% to 76.2%, and matches the performance of the original ResNet-50, despite not using the 1.28 million crowd-labeled training examples available for this dataset.
Furthermore, the top-5 accuracy of the CLIP model is significantly higher than its top-1, and the model has a top-5 accuracy of 95%, matching Inception-V4 (Szegedy et al., 2016).
The ability to match the performance of strong, fully supervised baselines in the zero-shot setting suggests that CLIP is an important step towards a flexible and practical
zero-shot computer vision classifier.
As noted above, the comparison with Visual N-Grams is to contextualize the performance of CLIP and should not be interpreted to mean that CLIP and Visual N-Grams do not control for many performance-related differences between the two systems.
For example, we train on a 10x larger dataset, use a visual model that requires nearly 100x computation per prediction, potentially use over 1000x the computation for training, and use a Transformer-based model, which in Visual N-Grams did not exist at the time of publication.
As a closer comparison, we trained CLIP ResNet-50 on the same YFCC100M dataset used to train Visual N-Grams and found that it matched their reported ImageNet performance within days on a V100 GPU.
This baseline is also trained from scratch rather than initialized from pretrained ImageNet weights as in Visual N-Grams.

CLIP also outperforms Visual N-Grams on the other 2 reported datasets. On aYahoo, CLIP achieves a 95% reduction in the number of errors, and on SUN, CLIP more than doubles the accuracy of Visual N-Grams.
To conduct a more comprehensive analysis and stress test, we implement a much larger evaluation suite detailed in Appendix A.
In total we expand from the 3 datasets reported in Visual N-Grams to include over 30 datasets and compare to over 50 existing computer vision systems to contextualize results.

CLIP also outperforms Visual N-Grams on the other 2 reported datasets. On aYahoo, CLIP achieved a 95% reduction in the number of errors, and on SUN, CLIP was more than twice as accurate as Visual N-Grams.
For a more comprehensive analysis and stress testing, we implemented a larger evaluation suite, detailed in Appendix A.
In total, we expand from the 3 datasets reported in Visual N-Grams to include more than 30 datasets and compare with more than 50 existing computer vision systems to contextualize the results.

3.1.4. PROMPT ENGINEERING AND ENSEMBLING

Most standard image classification datasets treat the information naming or describing classes which enables natural language based zero-shot transfer as an afterthought.
The vast majority of datasets annotate images with just a numeric id of the label and contain a file mapping these ids back to their names in English.
Some datasets, such as Flowers102 and GTSRB, don’t appear to include this mapping at all in their released versions preventing zero-shot transfer entirely.
For many datasets, we observed these labels may be chosen somewhat haphazardly and do not anticipate issues related to zero-shot transfer which relies on task description in order to transfer successfully.

Most standard image classification datasets treat information naming or describing classes as an afterthought, which enables natural language-based zero-shot transfer.
The vast majority of datasets annotate images with only the numeric ids of the labels, and include a file that maps these ids back to their English names.
Some datasets, such as Flowers102 and GTSRB, do not seem to include this mapping at all in their published versions, preventing zero-sample transmission entirely.
For many datasets, we observe that the choice of these labels can be somewhat arbitrary, and do not anticipate issues associated with zero-shot transfers that rely on task descriptions for successful transfers.

A common issue is polysemy. When the name of a class is the only information provided to CLIP’s text encoder it is unable to differentiate which word sense is meant due to the lack of context. In some cases multiple meanings of the same word might be included as different classes in the same dataset! This happens in ImageNet which contains both construction cranes and cranes that fly.
Another example is found in classes of the Oxford-IIIT Pet dataset where the word boxer is, from context, clearly referring to a breed of dog, but to a text encoder lacking context could just as likely refer to a type of athlete.

A common problem is polysemy. When the name of a class is the only information provided to the CLIP text encoder, it cannot distinguish which word sense due to lack of context. In some cases, multiple meanings of the same word may be included in the same dataset as different categories! This happens in ImageNet, which contains construction cranes and flying cranes.
Another example is found in the class of the Oxford-IIIT Pet dataset, where the word boxer clearly refers to a type of dog from context, but a text encoder lacking context likely refers to a type of athlete.

Another issue we encountered is that it’s relatively rare in our pre-training dataset for the text paired with the image to be just a single word. Usually the text is a full sentence describing the image in some way.
To help bridge this distribution gap, we found that using the prompt template “A photo of a {label}.” to be a good default that helps specify the text is about the content of the image.
This often improves performance over the baseline of using only the label text. For instance, just using this prompt improves accuracy on ImageNet by 1.3%.

Another problem we encountered was that in our pre-training dataset, it was relatively rare that the text paired with an image was just a single word. Usually the text is a complete sentence describing the image in some way. To help bridge this distribution gap, we found the use of the prompt template "Photo of {label}". is a good default that helps specify that the text is about the image.
This generally improves the performance of baselines that only use label text. For example, using only this hint improves accuracy on ImageNet by 1.3%.

Similar to the “prompt engineering” discussion around GPT-3 (Brown et al., 2020; Gao et al., 2020), we have also observed that zero-shot performance can be significantly improved by customizing the prompt text to each task.
A few, non exhaustive, examples follow. We found on several fine-grained image classification datasets that it helped to specify the category. For example on Oxford-IIIT Pets, using “A photo of a {label}, a type of pet.” to help provide context worked well.
Likewise, on Food101 specifying a type of food and on FGVC Aircraft a type of aircraft helped too.
For OCR datasets, we found that putting quotes around the text or number to be recognized improved performance.
Finally, we found that on satellite image classification datasets it helped to specify that the images were of this form and we use variants of “a satellite photo of a {label}.”

Similar to the "hint engineering" discussion around GPT-3, we also observe that zero-shot performance can be significantly improved by tailoring the hint text for each task.
The following are some non-exhaustive examples. We found it helpful for class assignment on several fine-grained image classification datasets. For example on Oxford-IIIT Pets, using "A photo of a {label}, a type of pet" to help provide context works well.
Also, specifying a food on Food101 and an aircraft on FGVC Aircraft is helpful.

For the OCR dataset, we found that putting quotes around the text or numbers to be recognized improves performance.
Finally, we found that on satellite image classification datasets it helps to specify that an image is of this form, we use a variation of "a satellite photo of a {label}".

We also experimented with ensembling over multiple zero-shot classifiers as another way of improving performance.
These classifiers are computed by using different context prompts such as ‘A photo of a big {label}” and “A photo of a small {label}”.
We construct the ensemble over the embedding space instead of probability space.
This allows us to cache a single set of averaged text embeddings so that the compute cost of the ensemble is the same as using a single classifier when amortized over many predictions.
We’ve observed ensembling across many generated zero-shot classifiers to reliably improve performance and use it for the majority of datasets.
On ImageNet, we ensemble 80 different context prompts and this improves performance by an additional 3.5% over the single default prompt discussed above.
When considered together, prompt engineering and ensembling improve ImageNet accuracy by almost 5%.
In Figure 4 we visualize how prompt engineering and ensembling change the performance of a set of CLIP models compared to the contextless baseline approach of directly embedding the class name as done in Li et al. (2017).

We also experimented with ensembles of multiple zero-shot classifiers as another way to improve performance.
These classifiers are computed by using different contextual cues, such as "A photo of a big {label}" and "A photo of a small {label}".
We construct the ensemble on the embedding space rather than the probability space.
This allows us to cache a set of averaged text embeddings so that when amortized over many predictions, the ensemble is as computationally expensive as using a single classifier.
We have observed that ensembles of many generated zero-shot classifiers reliably improve performance and are used for most datasets.
On ImageNet, we integrate 80 different contextual cues, which improves performance by 3.5% compared to the single default cues discussed above.
When considered together, hint engineering and ensembling improved ImageNet accuracy by almost 5%.

In Figure 4, we contrast how hint engineering and ensembling change the performance of a set of CLIP models versus a context-free baseline approach that directly embeds class names (as done by Li et al.). (2017).

3.1.5. ANALYSIS OF ZERO-SHOT CLIP PERFORMANCE


Original text: https://zhuanlan.zhihu.com/p/600847090


Since task-agnostic zero-shot classifiers for computer vision have been understudied, CLIP provides a promising oppor- tunity to gain a better understanding of this type of model. In this section, we conduct a study of various properties of CLIP’s zero-shot classifiers. As a first question, we look simply at how well zero-shot classifiers perform. To con- textualize this, we compare to the performance of a simple off-the-shelf baseline: fitting a fully supervised, regularized, logistic regression classifier on the features of the canonical ResNet-50. In Figure 5 we show this comparison across 27 datasets. Please see Appendix A for details of datasets and setup.
Since task-agnostic zero-shot classifiers for computer vision are well-studied, CLIP presents a promising opportunity to gain a better understanding of this type of model. In this section, we investigate various properties of CLIP's zero-shot classifier. As a first question, we briefly look at the performance of zero-shot classifiers. To contextualize this, we compare it to the performance of a simple off-the-shelf baseline: fitting a fully supervised, regularized, logistic regression classifier on the features of a canonical ResNet-50. In Figure 5 we show this comparison for 27 datasets. See Appendix A for details on the dataset and settings.
Zero-shot CLIP outperforms this baseline slightly more often than not and wins on 16 of the 27 datasets. Looking at individual datasets reveals some interesting behavior. On fine-grained classification tasks, we observe a wide spread in performance. On two of these datasets, Stanford Cars and Food101, zero-shot CLIP outperforms logistic regression on ResNet-50 features by over 20% while on two others, Flowers102 and FGVCAircraft, zero-shot CLIP underper- forms by over 10%. On OxfordPets and Birdsnap, per- formance is much closer. We suspect these difference are primarily due to varying amounts of per-task supervision between WIT and ImageNet. On “general” object classifica- tion datasets such as ImageNet, CIFAR10/100, STL10, and PascalVOC2007 performance is relatively similar with a slight advantage for zero-shot CLIP in all cases. On STL10, CLIP achieves 99.3% overall which appears to be a new state of the art despite not using any training examples. Zero- shot CLIP significantly outperforms a ResNet-50 on two datasets measuring action recognition in videos. On Kinet- ics700, CLIP outperforms a ResNet-50 by 14.5%. Zero- shot CLIP also outperforms a ResNet-50’s features by 7.7% on UCF101. We speculate this is due to natural language providing wider supervision for visual concepts involving verbs, compared to the noun-centric object supervision in ImageNet.
Zero-shot CLIP slightly outperforms this baseline, winning on 16 of the 27 datasets. Looking at individual datasets reveals some interesting behavior. In the fine-grained classification task, we observe a large variance in performance. On two of the datasets, Stanford Cars and Food101, zero-shot CLIP outperformed logistic regression on ResNet-50 features by more than 20%, while on the other two, Flowers102 and FGVCAircraft, zero-shot CLIP lagged behind by 10% % above. On OxfordPets and Birdsnap, performance is much closer. We suspect that these differences are mainly due to the different amount of supervision per task between WIT and ImageNet. On "generic" object classification datasets such as ImageNet, CIFAR10/100, STL10, and PascalVOC2007, performance is relatively similar, with zero-shot CLIP gaining a slight edge in all cases. On STL10, CLIP achieves 99.3% overall, which seems to be a new state of the art, despite not using any training examples. Zeroshot CLIP significantly outperforms ResNet-50 on two datasets measuring video action recognition. On Kinetics700, CLIP outperforms ResNet-50 by 14.5%. Zeroshot CLIP also outperforms ResNet-50 features by 7.7% on UCF101. We speculate that this is due to natural language providing broader supervision for visual concepts involving verbs compared to the noun-centric object supervision in ImageNet.
Looking at where zero-shot CLIP notably underperforms,we see that zero-shot CLIP is quite weak on several spe- cialized, complex, or abstract tasks such as satellite image classification (EuroSAT and RESISC45), lymph node tumor detection (PatchCamelyon), counting objects in synthetic scenes (CLEVRCounts), self-driving related tasks such as German traffic sign recognition (GTSRB), recognizing dis- tance to the nearest car (KITTI Distance). These results highlight the poor capability of zero-shot CLIP on more complex tasks. By contrast, non-expert humans can robustly perform several of these tasks, such as counting, satellite image classification, and traffic sign recognition, suggesting significant room for improvement. However, we caution that it is unclear whether measuring zero-shot transfer, as opposed to few-shot transfer, is a meaningful evaluation for difficult tasks that a learner has no prior experience with, such as lymph node tumor classification for almost all hu- mans (and possibly CLIP).
Looking at where zero-shot CLIP clearly underperforms, we find that zero-shot CLIP performs well in satellite imagery classification (EuroSAT and RESISC45), lymph node tumor detection (PatchCamelyon), counting objects in synthetic scenes (CLEVRCounts), autonomous driving-related tasks such as German Traffic Sign Recognition (GTSRB), recognizes the distance to the nearest car (KITTI distance). These results highlight the poor ability of zero-shot CLIP on more complex tasks. In contrast, non-experts can robustly perform several of these tasks, such as counting, satellite imagery classification, and traffic sign recognition, suggesting that there is plenty of room for improvement. However, we caution that it is unclear whether measuring zero-shot transfer, compared with few-shot transfer, is a meaningful assessment of difficult tasks with no prior experience of the learner, such as lymph node tumor classification in almost all humans (and possibly clip ).
While comparing zero-shot performance to fully supervised models contextualizes the task-learning capabilities of CLIP, comparing to few-shot methods is a more direct comparison, since zero-shot is its limit. In Figure 6, we visualize how zero-shot CLIP compares to few-shot logistic regression on the features of many image models including the best publicly available ImageNet models, self-supervised learning methods, and CLIP itself. While it is intuitive to expect zero-shot to underperform one-shot, we instead find that zero-shot CLIP matches the performance of 4-shot logistic regression on the same feature space. This is likely due to an important difference between the zero-shot and few-shot approach. First, CLIP’s zero-shot classifier is generated via natural language which allows for visual concepts to be directly specified (“communicated”). By contrast, “normal” supervised learning must infer concepts indirectly from training examples. Context-less example-based learning has the drawback that many different hypotheses can be consistent with the data, especially in the one-shot case. A single image often contains many different visual concepts. Although a capable learner is able to exploit visual cues and heuristics, such as assuming that the concept being demonstrated is the primary object in an image, there is no guarantee.
While comparing zero-shot performance to fully supervised models contextualizes CLIP's task-learning capabilities, comparing to few-shot methods is a more direct comparison, since zero-shot is its limit. In Figure 6, we visualize how zero-shot CLIP compares to few-shot logistic regression on features from a number of image models, including the best publicly available ImageNet models, self-supervised learning methods, and CLIP itself. While it is intuitive to expect zero-sample performance to be inferior to one-sample, we find that zero-sample CLIP matches the performance of 4-sample logistic regression on the same feature space. This may be due to an important distinction between zero-shot and few-shot methods. First, CLIP's zero-shot classifier is generated via natural language, allowing direct specification ("communication") of visual concepts. In contrast, "normal" supervised learning must indirectly infer concepts from training examples. The disadvantage of context-free example-based learning is that many different hypotheses may be consistent with the data, especially in the one-shot case. A single image often contains many different visual concepts. Although competent learners are able to utilize visual cues and heuristics such as assuming that the concept being demonstrated is the main object in the image, this is not guaranteed.
A potential resolution of this discrepancy between zeroshot and few-shot performance is to use CLIP’s zero-shot classifier as a prior for the weights of the few-shot classifier. While adding an L2 penalty towards the generated weights is a straightforward implementation of this idea, we found that hyperparameter optimization would often select for such a large value of this regularizer that the resulting fewshot classifier was “just” the zero-shot classifier. Research into better methods of combining the strength of zero-shot transfer with flexibility of few-shot learning is a promising direction for future work.
One potential solution to this discrepancy between zeroshot and few-shot performance is to use CLIP's zero-shot classifier as a prior for the few-shot classifier weights. While adding an L2 penalty to the generated weights is a straightforward implementation of the idea, we found that hyperparameter optimization often chooses such large regularization values ​​that the resulting fewshot classifier is "just" a zero-shot classifier. Investigating better ways to combine the strength of zero-shot transfer with the flexibility of few-shot learning is a promising direction for future work.
When comparing zero-shot CLIP to few-shot logistic regression on the features of other models, zero-shot CLIP roughly matches the performance of the best performing 16-shot classifier in our evaluation suite, which uses the features of a BiT-M ResNet-152x2 trained on ImageNet-21K. We are certain that a BiT-L model trained on JFT-300M would perform even better but these models have not been publicly released. That a BiT-M ResNet-152x2 performs best in a 16-shot setting is somewhat surprising since, as analyzed in Section 3.2, the Noisy Student EfficientNet-L2 outperforms it in a fully supervised setting by almost 5% on average across 27 datasets.
When comparing zero-shot CLIP to few-shot logistic regression on other model features, zero-shot CLIP performs roughly as well as the best-performing 16-shot classifier in our evaluation suite, which uses features from BiT-M ResNet-152x2 Trained on ImageNet-21K. We are sure that BiT-L models trained on JFT-300M will perform better, but these models have not been released publicly. It is somewhat surprising that BiT-M ResNet-152x2 performs best in the 16-shot setting, since Noisy Student EfficientNet-L2 outperforms it on average in the fully supervised setting on 27 datasets, as analyzed in Section 3.2. Nearly 5%.
In addition to studying the average performance of zero-shot CLIP and few-shot logistic regression, we also examine performance on individual datasets. In Figure 7, we show estimates for the number of labeled examples per class that a logistic regression classifier on the same feature space requires to match the performance of zero-shot CLIP. Since zero-shot CLIP is also a linear classifier, this estimates the effective data efficiency of zero-shot transfer in this setting. In order to avoid training thousands of linear classifiers, we estimate the effective data efficiency based on a loglinear interpolation of the performance of a 1, 2, 4, 8, 16- shot (when possible), and a fully supervised linear classifier trained on each dataset. We find that zero-shot transfer can have widely varying efficiency per dataset from less than 1 labeled example per class to 184. Two datasets, Flowers102 and EuroSAT underperform one-shot models. Half of the datasets require less than 5 examples per class with a median of 5.4. However, the mean estimated data efficiency is 20.8 examples per class. This is due to the 20% of datasets where supervised classifiers require many labeled examples per class in order to match performance. On ImageNet, zero-shot CLIP matches the performance of a 16-shot linear classifier trained on the same feature space.
In addition to studying the average performance of zero-shot CLIP and few-shot logistic regression, we also examine the performance of individual datasets. In Figure 7 we show estimates of the number of labeled examples per class required for a logistic regression classifier on the same feature space to match the performance of zero-shot CLIP. Since zero-shot CLIP is also a linear classifier, this estimates the effective data efficiency of zero-shot transfer in this setting. To avoid training thousands of linear classifiers, we estimate effective data efficiency based on log-linear interpolation of performance over 1, 2, 4, 8, 16 shots (where possible), and on each dataset. We found that the efficiency of zero-shot transfer varies widely on each dataset, from less than 1 labeled example to 184 per class. Two datasets, Flowers102 and EuroSAT, perform worse than the one-sample model. Half of the datasets required fewer than 5 examples per class, with a median of 5.4. However, the average estimated data efficiency is 20.8 examples per class. This is because in the 20% dataset, supervised classifiers need many labeled examples per class to match performance. On ImageNet, zero-shot CLIP matches the performance of a 16-shot linear classifier trained on the same feature space.
If we assume that evaluation datasets are large enough that the parameters of linear classifiers trained on them are well estimated, then, because CLIP’s zero-shot classifier is also a linear classifier, the performance of the fully supervised classifiers roughly sets an upper bound for what zero-shot transfer can achieve. In Figure 8 we compare CLIP’s zeroshot performance with fully supervised linear classifiers across datasets. The dashed, y = x line represents an “optimal” zero-shot classifier that matches the performance of its fully supervised equivalent. For most datasets, the performance of zero-shot classifiers still underperform fully supervised classifiers by 10% to 25%, suggesting that there is still plenty of headroom for improving CLIP’s task-learning and zero-shot transfer capabilities.
If we assume that the evaluation dataset is large enough to provide a good estimate of the parameters of the linear classifier trained on it, then, since CLIP's zero-shot classifier is also a linear classifier, the performance of a fully supervised classifier roughly sets An upper bound enables zero-sample transfers. In Figure 8, we compare the zeroshot performance of CLIP with fully supervised linear classifiers across datasets. The dashed line y = x represents the "best" zero-shot classifier whose performance matches its fully supervised equivalent. For most datasets, the performance of zero-shot classifiers is still 10% to 25% lower than that of fully supervised classifiers, indicating that there is still much room for improvement in the task learning and zero-shot transfer capabilities of CLIP.
There is a positive correlation of 0.82 (p-value < 10-6) between zero-shot performance and fully supervised performance, suggesting that CLIP is relatively consistent at connecting underlying representation and task learning to zeroshot transfer. However, zero-shot CLIP only approaches fully supervised performance on 5 datasets: STL10, CIFAR10, Food101, OxfordPets, and Caltech101. On all 5 datasets, both zero-shot accuracy and fully supervised accuracy are over 90%. This suggests that CLIP may be more effective at zero-shot transfer for tasks where its underlying representations are also high quality. The slope of a linear regression model predicting zero-shot performance as a function of fully supervised performance estimates that for every 1% improvement in fully supervised performance, zero-shot performance improves by 1.28%. However, the 95th-percentile confidence intervals still include values of less than 1 (0.93-1.79).
There is a positive correlation of 0.82 (p-value < 10-6) between zero-shot performance and fully-supervised performance, suggesting that CLIP is relatively consistent in linking the underlying representation and task learning with zero-shot transfer. However, zero-shot CLIP approaches fully supervised performance on only 5 datasets: STL10, CIFAR10, Food101, OxfordPets, and Caltech101. Both zero-shot and fully-supervised accuracy exceeds 90% on all 5 datasets. This suggests that CLIP may be more effective at zero-shot transfer for tasks whose underlying representations are also of high quality. Estimates of the slope of the linear regression model predicting zero-shot performance as a function of fully supervised performance, for every 1% improvement in fully supervised performance, zero-shot performance improves by 1.28%. However, the 95th percentile confidence interval still contains values ​​less than 1 (0.93-1.79).
Over the past few years, empirical studies of deep learning systems have documented that performance is predictable as a function of important quantities such as training compute and dataset size (Hestness et al., 2017; Kaplan et al., 2020). The GPT family of models has so far demonstrated consistent improvements in zero-shot performance across a 1000x increase in training compute. In Figure 9, we check whether the zero-shot performance of CLIP follows a similar scaling pattern. We plot the average error rate of the 5 ResNet CLIP models across 39 evaluations on 36 different datasets and find that a similar log-log linear scaling trend holds for CLIP across a 44x increase in model compute. While the overall trend is smooth, we found that performance on individual evaluations can be much noisier. We are unsure whether this is caused by high variance between individual training runs on sub-tasks (as documented in D’Amour et al. (2020)) masking a steadily improving trend or whether performance is actually non-monotonic as a function of compute on some tasks.
Over the past few years, empirical studies of deep learning systems have shown that performance is predictable as a function of important quantities such as training computation and dataset size (Hestness et al., 2017; Kaplan et al., 2020). To date, the GPT family of models has demonstrated continual improvements in zero-shot performance over a 1000x increase in training computation. In Figure 9, we examine whether the zero-shot performance of CLIP follows a similar scaling pattern. We plot the average error rate for 39 evaluations of 5 ResNet CLIP models on 36 different datasets, and find that CLIP has a similar log-log-linear scaling trend with a 44x increase in model computation. While the overall trend is stationary, we find that the performance of individual evaluations can be noisier. We are uncertain whether this is due to high variance between individual training runs of the subtasks (as described in D'Amour et al. (2020)) masking a trend of steady improvement, or whether performance is actually a non-monotonic function of computation in on some tasks.

Don't worry, the translation is not finished yet, the thesis is too long, and I am busy with work during this time, I will finish his research, remember to bookmark it and read it later

Guess you like

Origin blog.csdn.net/leiduifan6944/article/details/129813645