Google launches innovative SynCLR technology: using AI-generated data to achieve efficient image modeling, opening a new era of self-training!

Google has launched an innovative synthetic image framework that is unique in that it does not rely on real data at all. This framework first starts with synthesized image captions and then generates corresponding images based on these captions. Next, deep learning is performed through contrastive learning techniques to train a model that can accurately identify and understand these images. Surprisingly, this method performs well in various downstream tasks. Let's see what magic is used!

Paper title : Learning Vision from Models Rivals Learning Vision from Data
Paper link :
https://arxiv.org/pdf/2312.17742.pdf

introduction

Collecting large-scale real data is often accompanied by many challenges: although unfiltered large data sets are cheaper, their benefits are limited; while finely screened small-scale data sets are more accurate, but limit the model wide application. To overcome these obstacles, a new study proposes a unique solution—using synthetic data to learn visual representations. This method achieves effective contrastive learning by generating a large number of image captions and corresponding images, and can treat images sharing the same caption as positive examples that match each other. The research team particularly emphasized that this learning method based on synthetic data not only demonstrated excellent scalability, but also demonstrated excellent performance comparable to traditional methods in a variety of downstream tasks.

Traditional learning methods ("Learning from data") focus on extracting knowledge purely from real data. A typical example is the CLIP model, which extracts information directly from text and image datasets and achieves an impressive 80.2% linear transfer accuracy on ImageNet.

Hybrid learning methods ("Hybrid") use a two-pronged strategy that combines real text and generated images for learning. For example, the StableRep model operates under such a framework. It uses text datasets and image generators for learning, and also achieved a very good linear transfer accuracy of 76.7% on ImageNet.

The method based on model generation ("Learning from models") proposed in this article - SynCLR, marks an innovative leap. By learning from synthetic text and synthetic images, even without direct contact with any real data, it can show competitiveness comparable to CLIP on ImageNet, achieving an excellent linear transfer accuracy of 80.7%.

method

The core innovation of SynCLR is that it redefines the fine-grainedness of visual categories through generative models. Compared with traditional self-supervised and supervised learning methods, SynCLR uniquely uses titles as category definitions, where each title describes a visual category in detail. The neat thing about this approach is that it allows images to be grouped by semantics shared by the title, rather than being limited to broader category labels such as "golden retriever". In experiments, this fine-grained classification approach based on titles proved to be superior to traditional self-supervised and supervised training methods. The system includes the following three steps:

Generate image title

First, the authors successfully generated a large corpus of image captions. To achieve this goal, they cleverly leveraged the contextual learning capabilities of large language models (LLMs) and carefully designed a series of prompt templates to guide the LLM to generate relevant text content based on specific context.

By curating a list of concepts from existing datasets such as ImageNet-21k and Places-365, the authors build ad hoc prompts for each concept that guide LLM to generate descriptive and creative image captions. Central to this process is ensuring that the generated titles both accurately describe the content of the image and exhibit sufficient diversity to cover a wide range of visual concepts. This diversity is crucial as it ensures that the generated image set can represent as diverse a set of scenes, objects, and activities as possible, thereby improving the generalization ability of the learned visual representation.

In this way, the authors synthesize a large and diverse set of image captions, which are subsequently used to guide image generation models to generate corresponding synthetic images. Combining these synthetic images with synthetic captions forms a rich dataset for training visual representation learning models. This method makes it possible to train visual models without real image data at all, providing an innovative and effective supplement to traditional visual learning methods that rely on real data sets.

In the image below, the research team provides a context (left) that guides the model to generate a specific descriptive title based on a given pair of categories (such as "tiger, forest" or "groom, wedding ceremony"). For example, in the actual generated results (right), for the category pair "red fox, yard", the model generated the following title: "wild red fox sitting on a partially snow covered front yard of a house in the suburbs of" a small city". During this process, three such context instances are randomly selected for each inference.

image generation

The research team used an innovative method to generate the images by initiating a back-diffusion process with different random noises. In this process, the Classifier-Free Guidance (CFG) ratio plays a crucial role, which effectively balances the relationship between sample quality, consistency between text and images, and sample diversity. To generate a series of different images for each text description, the team adjusted the random noise input, thus enriching the diversity of the generated images.

Indicates learning methods

This representation learning method is built on the StableRep method and introduces a multiple frontal contrast learning loss. The core idea is to align images generated from the same caption in embedding space, while incorporating several techniques from other self-supervised learning methods, including patch-level masked image modeling objectives.

StableRep

The StableRep method minimizes cross-entropy loss by comparing the similarities and differences between different samples. This strategy trains the model to recognize and differentiate between images generated with the same or different captions.

iBOT

The iBOT method adopts a masked image modeling objective, where local patches are masked, and the model is tasked with predicting labeled representations of these masked patches. This strategy adapts the DINO model from the image level to the patch level.

Exponential Moving Average (EMA)

EMA was originally introduced by MoCo in self-supervised learning to encode crops and generate objectives for the iBOT loss. During training, the EMA model is updated according to the cosine plan to smooth the update process of model parameters, so that the model maintains stability during the training process.

multi-crop strategy

As a method to improve computational efficiency, the multi-crop strategy allows the model to learn from multiple perspectives and contexts, increasing the diversity of training samples and improving the generalization ability of the representation. Specifically, StableRep improves efficiency by minimizing the cross-entropy loss between true and contrasted allocations. In this framework, there is an encoded anchor sample and a set of encoded candidate sample sets. The contrast assignment distribution describes the probability that the model predicts whether the anchor sample and each candidate sample were generated by the same title. They use an indicator function to identify whether two samples are from the same title.

experiment

The research team pre-trained their model for up to 500k steps, using a large batch size of 8192 titles, and all pre-training tasks were performed at 224x224 resolution. They compared SynCLR with OpenAI's CLIP, OpenCLIP, and DINO v2. These models represent different methods of learning from data. It is particularly pointed out that ViT-B/14 and ViT-L/14 in DINO v2 are distilled from the ViT-g model, which gives DINO v2 an advantage when comparing.

ImageNet linear evaluation

For a fair comparison, all models use the cls token of the last block as representation (compared to the results of DINO v2 using multi-layer concatenation). As shown in Table 6, SynCLR achieved a score of 80.7% on the ViT-B structure and 83.0% on the ViT-L structure. These results are comparable to those models that learn directly from real data (such as CLIP and DINO v2), although SynCLR only uses synthetic data.

UpperNet semantic segmentation

The research team used a single scale of 512x512 resolution in UpperNet semantic segmentation, and some models used a 14x14 patch size to adapt to the 518x518 resolution. They used 600M synthetic data and compared it with other models including MoCo v3, SimMIM, MAE, PeCo, data2vec, iBOT, BEiT v2, CLIP and OpenCLIP, which were mainly pre-trained based on real ImageNet data. . SynCLR achieved 54.3% and 57.7% in mIoU indicators at standard resolution and high resolution respectively. Compared with the model trained using real data, the performance is very good.

ImageNet image classification

The performance of SynCLR in ImageNet image classification is also worthy of attention. Using 600M synthetic data, SynCLR is compared with various models pre-trained on different datasets such as IN21K, WIT-400M and LAION-2B. On the ViT-B structure, SynCLR's Top-1 accuracy is 85.8%, while on the ViT-L structure it is 87.9%, both of which are better than most models trained using real data.

These results clearly show that despite relying entirely on synthetic data, the SynCLR method is still comparable to models relying on real data in the field of visual representation learning, demonstrating the significant effectiveness and great potential of this method.

summary

The authors make the following key points and conclusions:

Reasons to learn from generative models : A significant advantage of generative models is their ability to play the role of hundreds or thousands of data sets simultaneously. In traditional research methods, researchers often need to collect separate data sets for different image categories (such as cars, flowers, cats, dogs, etc.). Systems like DINO v2 are able to build powerful and robust representations by synthesizing and integrating large numbers of these data sets.

Significant advantages of generative models : Compared with traditional data collection and annotation methods, generative models provide a more efficient and broader way to cover visual concepts. This method eliminates a lot of time and resources spent on real image data collection and annotation.

The paper highlights the critical role of synthetic data in visual representation learning. Although synthetic data may not be as good as real data in terms of classification accuracy, synthetic data has shown extremely high results in training visual representation models. These trained representations can then be easily adapted to downstream tasks with smaller amounts of real data, demonstrating the utility and adaptability of synthetic data.

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/135369801