Meta-Transformer A Unified Framework for Multimodal Learning

15932099:

Meta-Transformer is a new framework for multimodal learning to process and correlate information from multiple modalities, such as natural language, images, point clouds, audio, video, time series, and tabular data, although various There are inherent gaps in the data, but Meta-Transformer exploits frozen encoders to extract high-level semantic features from input data in a shared label space, without paired multimodal training data. The framework consists of a unified data tagger, a schema-sharing encoder, and task headers for various downstream tasks. It is the first effort to perform unified learning with unpaired data in different modalities. Experiments show that it can handle a wide range of tasks from basic perception to practical applications and data mining.

Meta-Transformer

Tokenization of data to sequence

The researchers propose a meta-tokenization scheme to convert data from different modalities (such as text, images, point clouds, and audio) into tagged embeddings in a shared space.

For natural language, they used WordPiece embeddings with a 30,000-token vocabulary, which splits words into subwords and converts each input text into a set of token embeddings.

For images, they reshape the image as a series of flat 2D patches, and then utilize a projection layer to project the embedding dimensions. This operation can also be used for infrared images, while linear projection is used for hyperspectral images. They replaced 2D convolutional layers with 3D convolutions for video recognition.

For point clouds, the furthest point sampling (FPS) operation is employed to transform the original point cloud from the original input space to the label embedding space, which samples a representative skeleton of the original point cloud at a fixed sampling ratio. Then, k-Nearest Neighbors (KNN) are used to group neighboring points, and an adjacency matrix is ​​built to capture structural information of 3D objects and scenes.

For audio spectrograms, the audio waveform is preprocessed using a Mel filter bank and a Hamming window to split the wave into intervals. The spectrogram is then split into patches from the time and frequency dimensions, which are then planarized into a sequence of markers.

Unified Encoder

After converting the raw input from various patterns into token (token) embeddings, the researchers used a unified Transformer encoder with fixed parameters to encode these tokens. The encoder based on the ViT model is pre-trained with comparative learning on the LAION-2B dataset to improve the encoder's general marker encoding ability. For text understanding, they use a pretrained text tokenizer from CLIP to convert sentences into subwords and then into word embeddings.

In what the authors refer to as "modality-agnostic learning", a learnable token (xCLS) is added to the beginning of the token embedding sequence. The final hidden state of this token serves as a summary representation of the input sequence, often used in recognition tasks. Positional embeddings are also added to markup embeddings.

The Transformer encoder consists of multiple stacked multi-head self-attention layers and MLP blocks to process these embedding sequences. The authors note that adding more complex 2D-aware positional embeddings does not significantly improve image recognition performance.

Experimental results

Meta-Transformer models have shown good results, although not always better than other state-of-the-art methods, in various language and image understanding tasks.

Among the text understanding tasks of the GLUE benchmark, Meta-Transformer scores relatively high on sentiment, paraphrase, copy, inference, and answer tasks. While it underperformed models such as BERT, RoBERTa, and ChatGPT, it showed new promise for understanding natural language, especially after fine-tuning.

On image understanding tasks, Meta-Transformer outperforms models such as the Swin Transformer series and interimage in several ways. When combined with the CLIP text encoder, it provides strong results on zero-shot classification. It also outperforms other models on object detection and semantic segmentation tasks, showing its proficiency in image understanding.

Meta-Transformer has also been proven effective in handling infrared and hyperspectral image recognition tasks, tested on the RegDB and Indian Pine datasets, respectively. Although Meta-Transformer did not reach the top of the leaderboard, its results are also very good, demonstrating the potential to handle challenges related to infrared and hyperspectral images.

On X-ray image processing, Meta-Transformer achieved a performance of 94.1%, indicating its utility in medical image analysis.

On point cloud understanding tasks, the Meta-Transformer achieves higher accuracy scores with fewer trainable parameters compared to other models on the ModelNet-40, S3DIS, and ShapeNetPart datasets, emphasizing its efficiency in this area.

In audio recognition tasks, Meta-Transformer has the advantage of competing with existing audio Transformer models such as AST and SSAST, achieving 97.0% high accuracy when tuning parameters. Although AST performs well, models like AST have more trainable parameters.

On video understanding tasks, Meta-Transformer does not outperform other state-of-the-art methods in terms of accuracy, as tested on the UCF101 dataset. But it stands out for its significantly fewer trainable parameters, suggesting the potential benefits of unified multimodal learning and lower architectural complexity.

In time series forecasting tasks, Meta-Transformer outperforms several existing methods on benchmarks such as ETTh1, Traffic, Weather, and Exchange datasets, while requiring few trainable parameters.

In tabular data understanding tasks, Meta-Transformer performs well on adult census and bank marketing datasets. It outperforms other models on the bank marketing dataset, suggesting its potential for understanding complex datasets.

In the graph understanding task of the PCQM4M-LSC dataset, the current Meta-Transformer architecture does not perform well in structural data learning, and the graphhormer model performs better than it, and needs to be improved in this regard.

In the classification task of the Ego4D dataset, the accuracy rate of Meta-Transformer reached 73.9%. Collectively, these findings highlight the versatility and effectiveness of Meta-Transformer in different domains.

Several results above show that Meta-Transformer has fewer parameters and the model is more efficient. One of its main limitations is that the computational complexity is O(n²x D).

By Andrew Lukyanenko

Final paper address and source code:

https://avoid.overfit.cn/post/27688397b91a48f680d3e5e3ca9e9f86

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/131992397