LIMoE: Learning multiple modalities using MoE

文章链接:Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

Journal published (conference): NeurIPS 2022

1. Background introduction

In practical applications, multimodal data are usually high-dimensional, and processing such data may result in huge computational and storage overhead. Sparse models can help alleviate this problem because they have fewer parameters and can handle high-dimensional data efficiently. Therefore, taking sparsity into account in multimodal learning can improve model performance and efficiency while reducing computational and storage costs.
Some research works and practical applications try to combine multimodal data with sparse models to better process information from different sensors or data sources. This combination may involve converting multimodal data into sparse representations or introducing sparsity in the construction of multimodal models to improve processing efficiency and performance. This fusion can help solve the challenges of processing large-scale multi-modal data and provide more efficient solutions. Sparse models can be used as a way to handle multi-modal data to reduce computational and storage overhead and improve model efficiency and performance. In the context of multimodal learning, introducing sparsity can be one of the effective strategies to handle complex data.

Dense models that learn many different tasks simultaneously (multitasking) or sequentially (continuous learning) often suffer from negative interference:Excessive task diversity means it is better to train a model for each task, catastrophic forgetting means The model will get worse in early tasks as new tasks are added.

sparse model

Sparse models (Sparse models) stand out among the most promising methods of deep learning in the future. Instead of every part of the model processing every input ("dense" modeling), Sparse models using conditional computation learn to "route" each input to different "experts" in a potentially large network. This has many benefits.

  • Model size can be increased while keeping computational cost constant. This is an efficient and greener way to scale models and is key to high performance.
  • Sparsity also naturally partitions neural networks. Sparse models help avoid both phenomena. By not applying the entire model to all inputs, the "experts" in the model can focus on different tasks or data types while still leveraging the model's Sharing section.

The Google Research team has been working on sparsity for a long time. Today's AI models are typically trained to do only one thing. Pathways will enable people to train a single model to do thousands or millions of things. Pathways summarizes the research vision of building a large model that can diligently handle thousands of tasks and numerous data patterns.
Insert image description here
So far, languages ​​(Switch, Task-MoE) sparse unimodal models have made considerable progress. Currently, the Google team is taking an important step toward the Pathways vision by working on large sparse models to process images and text simultaneously through a modality-agnostic “Router.” They proposed multimodal contrastive learning, which requires deep understanding of both images and text in order to align pictures with their correct textual descriptions. To date, the most powerful models for solving this task rely on independent networks for each mode (the "two-tower" approach). Vision MoE) and computer vision (GLaM,

2. Content summary

This paper proposes the first large-scale multi-modal architecture using MoE composition LIMoE. It processes both images and text, but uses sparse activation of natural specialized experts. LIMoE outperforms comparable dense multimodal models and twin-tower methods on zero-shot image classification. The largest LIMoE zero-shot ImageNet achieves 84.1% accuracy, which is comparable to more expensive state-of-the-art models. Sparsity allows LIMoE to scale gracefully and learn to handle vastly different inputs, resolving the tension between generalists and specialists.

Insert image description here

LIMoEThe architecture contains many "experts" and the "Router" decides which tokens (images or parts of sentences) are sent to which experts. After processing by the expert layer (gray) and the shared dense layer (brown), the final output layer computes a single vector representation of the image or text.

Sparse Mixture-of-Experts Models

Transformers represents data as a sequence of vectors (or tokens). Although originally developed for text, they can be applied to most things that can be represented as sequences of tokens, such as images, video, and audio. Recent large-scale MoE models add expert layers to the Transformer architecture (such as gShard and in natural language processing ST-MoE, andVision MoE for vision tasks).

The standard Transformer is made up of many "chunks", each of which contains various different layers. One of the layers is a feedforward network (FFN). For LIMoE and the works cited above, this single FFN is replaced by an expert layer containing many parallel FFNs, each FFN being an expert. Given a sequence of tokens to process, a simple Router learns to predict which experts should process which tokens. Only a small number of experts are activated per token, which means that while the model capacity is significantly increased by having so many experts, the actual computational cost is controlled by using them sparsely. If only one expert is activated, the model costs roughly the same as the standard Transformer model.

LIMoE does exactly this, activating one expert per example, thus matching the computational cost of dense baselines. The difference is thatLIMoE Router may see tags for image or text data.

A unique failure occurs when the MoE model tries to send all tokens to the same expert. Typically this is addressed through auxiliary losses and additional training objectives that encourage balance experts to use. Handling multiple modes interacting with sparsity leads to new failures that cannot be addressed by existing auxiliary losses. To overcome this problem, this paper develops a new auxiliary loss (see the paper for more details) and uses Router Priority (BPR) during training. These two innovations produce a stable and high-performance multi-modal model.

Insert image description here

NewAuxiliary Loss (LIMoE aux) andRouter Priority (BPR) stabilizes and improves overall performance (left) and improves the success rate of routing behaviors (middle and right). A low success rate means that the router does not use all available experts and discards many tokens because the capacity of a single expert is reached, which usually indicates that the sparse model is not learning well. The combination introduced by LIMoE ensures high routing success rates for images and text, significantly improving performance.

Contrastive Learning

In multimodal contrastive learning (Contrastive Learning), the model is trained on pairs of image-text data (such as photos and their captions). Typically, image models extract representations of images, and different text models extract representations of text. The contrastive learning goal encourages image and text representations to be close for the same image-text pairs and to be further apart for the content of different pairs. This model with aligned representations can be adapted to new tasks without additional training data ("zero samples"), e.g. an image will be classified as dog "cat" if its representation is closer to the representation of the word "dog" than to the word "cat" ". This idea can be extended to thousands of categories and is known as zero-shot image classification.

CLIP and ALIGN (both twin-tower models) extend this process in the popular Zero-shot classification accuracies of 76.2% and 76.4% were achieved on the ImageNet dataset. This paper examines single-tower models for computing image and text representations and finds that this degrades the performance of dense models, possibly due to negative interference or insufficient capacity. However, computationally matched LIMoE not only improves upon the single-tower dense model, but also outperforms the two-tower dense model. This paper trained a series of models using a similar training scheme to CLIP. Our dense L/16 model achieved 73.5% zero-fire accuracy, while the LIMoE-L/16 achieved 78.6%, even better than CLIP's more expensive two-tower L/14 model (76.2%). As shown below, LIMoE's use of sparsity provides significant performance improvements compared to comparably cost dense models.

Insert image description hereFor a given computational cost (x-axis), the LIMoE model (circles, solid line) significantly outperforms its dense baseline (triangles, dashed line). The schema indicates the size of the underlying Transformer, increasing from left (S/32) to right (L/16). S (small), B (basic), and L (large) refer to model scale. This number refers to the patch size, smaller patches mean larger architectures.

Insert image description here

LiT and BASIC improve the zero-sample accuracy of the dense two-tower model to 84.5% respectively. and 85.6%. In addition to scaling, these methods take advantage of specialized pre-training methods to repurpose image models that are already of extremely high quality. LIMoE-H/14 does not benefit from any pre-training or modality-specific components, but still achieves a comparable 84.1% zero-shot accuracy when trained from scratch. It is also interesting to compare the sizes of these models: LiT and BASIC are 2.1B and 3B parameter models respectively. LIMoE-H/14 has a total of 5.6B parameters, but through sparsity, only 675M parameters are applied per token, making it more lightweight.

Experiment Analysis

LIMoEThe motivation for is that sparse conditional computation enables general multimodal models to still develop the specializations needed to be good at understanding each modality. This article analyzes the expert layer of LIMoE and discovers some interesting phenomena.

  • The emergence of pattern specialization experts. In the training setting of the experiment, there were many more image tags than text tags, so all experts tended to process at least some images, but some experts either worked mostly on images, or mostly on text, or both.
    Insert image description hereThe above figure shows the distribution of LIMoE eight experts; the percentages represent the number of image tags processed by the experts. There are one to two experts who clearly specialize in text (represented mainly by blue experts), usually two to four image experts (mainly red), and the rest are somewhere in the middle.

  • There are also some clear qualitative patterns among imaging experts. For example, in most LIMoE models, there is an expert who handles all image patches that contain text.
    Insert image description here
    In the example above, there are experts working with animals and green plants, and there are experts working with human hands. LIMoE Select an expert for each token. This shows which image tokens are sent to which experts on a certain layer of LIMoE-H/14. The emergence of semantic experts who specialize in specific topics such as plants or wheels can be observed despite having no training in this area.

3. Article summary

Multimodal models that handle many tasks are a promising path forward, with two key factors for success: 1. Scale; 2. The ability to exploit synergies while avoiding interference between different tasks and modalities ability. Sparse conditional evaluation is an excellent way to achieve both of these things. It enables high-performance, efficient generalist models while also providing the specialization and flexibility needed to excel at a single task, as demonstrated byLIMoE's reliable performance with less computational effort That way.

Guess you like

Origin blog.csdn.net/cold_code486/article/details/134896108