Language Is Not All You Need: A Large Language Model Across Modalities

AI brick house

AI brick house

Committed to the commercialization of AI models, using AI to change lives

This paper is a specific implementation of KOSMOS-1, which integrates vision and LLM; it does not build a pipeline of images and LLM like visualGPT, but a unified large model of joint weights of images and LLM. Very useful for human-computer interaction scenarios.

论文地址:Language Is Not All You Need: Aligning Perception with Language Models

github: https://github.com/microsoft/unilm

Domestic version of ChatGPT: http://chat.menganhealth.cn/

Figure 1: KOSMOS-1 is a multimodal large language model (MLLM) capable of sensing multimodal input , following instructions, and performing contextual learning , not only for language tasks, but also for multimodal tasks . In this work, we align vision with large language models (LLMs), advancing the trend from LLMs to MLLMs.

Introduction

A key step toward artificial general intelligence lies in the grand fusion of language, multimodal perception, behavior, and world models. In this study, we introduce KOSMOS-1, a multimodal large-scale language model ( MLLM). Specifically, we train KOSMOS-1 from scratch on a web-scale multimodal corpus, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot and multimodal chained thought prompts, on a range of tasks without any gradient updates or fine-tuning. Experimental results show that KOSMOS-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (input directly in the form of document images), (ii) perceptual language tasks, including multiple Modal dialogs, image captioning, visual question answering, and (iii) visual tasks such as image recognition descriptions (classification specified via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e. knowledge transfer from language to multimodality and from multimodality to language. Furthermore, we introduce a Raven IQ test dataset to evaluate the nonverbal reasoning ability of MLLMs.

Figure 2: Selected examples generated from KOSMOS-1. The blue box is the input prompt, and the pink box is the KOSMOS-1 output . These examples include (1)-(2) visual explanation, (3)-(4) visual question answering, (5) web page question answering, (6) simple math equations, and (7)-(8) number recognition.

Figure 3: Selected examples generated from KOSMOS-1. The blue box is the input prompt, and the pink box is the KOSMOS-1 output. These examples include (1)-(2) image captioning, (3)-(6) visual question answering, (7)-(8) OCR, and (9)-(11) visual dialogue.

Table 1: We evaluate the performance of KOSMOS-1 on language, perceptual language, and vision tasks in zero-shot and few-shot learning settings.

1 Introduction: From LLMs to MLLMs

Large language models (LLMs) have been successfully used as a general interface for various natural language tasks [BMR+20]. The LLM-based interface is adaptable to the task as long as we can convert the input and output to text. For example, the input of a summarization task is a document and the output is its summary. So we can feed an input document into a language model and then produce a generated summary.

Despite successful applications in natural language processing, using LLMs remains challenging for multimodal data such as images and audio. As a fundamental part of intelligence, multimodal perception is a necessary condition for realizing artificial general intelligence, whether from knowledge acquisition or understanding of the real world. More importantly, unlocking multimodal input [TMC+21, HSD+22, WBD+22, ADL+22, AHR+22, LLSH23] greatly expands the application domain of language models, including multimodal machine learning, document Intelligence, and robotics.

In this work , we introduce KOSMOS-1, a multimodal large-scale language model ( MLLM). The goal is to align perception with LLMs so that models can see and talk . Specifically, we follow the method of METALM [HSD+22] to train the KOSMOS-1 model from scratch. As shown in Figure 1, a Transformer-based language model is regarded as a general interface, and the perception module is connected with the language model. We train models on web-scale multimodal corpora, namely text data, arbitrarily interleaved images and text, and image-caption pairs. Furthermore, we calibrate instruction-following ability across modalities by transferring language-only data.

As shown in Table 1, the KOSMOS-1 model natively supports language, perceptual language, and vision tasks. We also show some generated examples in Figures 2 and 3. In addition to various natural language tasks, the KOSMOS-1 model restores a series of perceptually intensive tasks, including visual dialogue, visual interpretation, visual question answering, image annotation, simple mathematical equations, OCR, and zero-shot with description Image classification. We also constructed an IQ benchmark following Raven's Progressive Matrices [JR03, CJS90] to evaluate the nonverbal reasoning ability of MLLMs. These examples demonstrate that native support for multimodal perception opens new opportunities to apply LLMs to novel tasks. Furthermore, we demonstrate the improvement in commonsense reasoning performance of MLLMs over LLMs, suggesting that cross-modal transfer facilitates knowledge acquisition.

The main conclusions are as follows:

From LLMs to MLLMs. Appropriately handling perception is a necessary step towards artificial general intelligence. The ability to perceive multimodal inputs is crucial for LLMs. First, multimodal perception enables LLMs to acquire commonsense knowledge beyond textual descriptions. Second, aligning perception with LLMs opens the door to new tasks, such as robotics and document intelligence. Third, the ability of perception unifies various APIs, because GUI is the most natural and unified way of interaction. For example, MLLMs can directly read screens or extract numbers from receipts. We train the KOSMOS-1 model on a web-scale multimodal corpus to ensure that the model can learn robustly from diverse sources. We not only use large-scale text corpora, but also mine high-quality image-caption pairs and arbitrarily interleaved image and text documents from the web.

Use language models as a common interface. Following the idea proposed by METALM [HSD+22], we treat the language model as a general-purpose task layer. Due to the openness of the output space, we are able to unify various task predictions into text. In addition, language models can handle natural language instructions and action sequences (like programming languages) well. LLMs also serve as basic reasoners [WWS+22], complementing perception modules in complex tasks. So, it is natural to align world, action, and multimodal perception with a common interface (i.e. language model).

New capabilities of MLLMs. As shown in Table 1, MLLMs open up new uses and possibilities in addition to the capabilities found in previous LLMs [BMR+20, CND+22]. First, we can do zero-shot and few-shot multimodal learning by using natural language instructions and examples. Second, we observed promising signals for nonverbal reasoning by evaluating the Raven IQ test, which measures fluid reasoning abilities in humans. Third, MLLMs naturally support multi-turn interactions of general modalities, such as multimodal dialogue.

2 KOSMOS-1: Multimodal Large Language Model

As shown in Figure 1, KOSMOS-1 is a multimodal language model that can perceive general modalities, follow instructions, learn context, and generate output. Given the previous context, the model learns to generate text in an autoregressive manner. Specifically, the backbone of KOSMOS-1 is a Transformer-based causal language model. Besides text, other modalities are also embedded and fed into the language model . Transformer decoder as a common interface for multimodal input. We train KOSMOS-1 on a multimodal corpus, including unimodal data, cross-modal paired data, and interleaved multimodal data. Once the model is trained, we can directly evaluate the performance of the model in zero-shot and few-shot settings on language tasks and multimodal tasks.

2.1 Input Representation

The Transformer decoder perceives the general modality in a unified way. For the input format, we flatten the input into a sequence decorated with special tokens. Specifically, we use <s> and </s> to denote the start and end of a sequence. The special tokens <image> and </image> denote the start and end of an encoded image embedding. For example, "<s>document</s>" is a text input, and "<s>paragraph <image>imageembed</image>paragraph</s>" is an interleaved image-text input. Table 21 in the appendix shows some examples of input formats.

Embedding modules are used to encode text tokens and other input modalities into vectors. The embedding is then fed into the decoder. For input tokens, we use a lookup table to map them into embeddings. For continuous signal modalities (eg, images and audio), it is also feasible to represent inputs as discrete codes and then treat them as "foreign languages" [WBD+22, WCW+23]. In this work, we follow [HSD+22] and use a visual encoder as an embedding module for the input image. Furthermore, Resampler [ADL+22] is used as an attention pooling mechanism to reduce the number of image embeddings.

2.2 Multimodal Large Language Models (MLLMs)

After obtaining the embeddings of the input sequences, we feed them into a Transformer-based decoder. A left-to-right causal model processes sequences in an autoregressive manner, generating the next token by conditioning on past time steps. Causal masks are used to mask future information. A softmax classifier on top of Transformer is used to generate tokens over the vocabulary.

MLLMs serve as general interfaces [HSD+22] that can interact with natural language and multimodal inputs. This framework is flexible to handle various data types as long as we can represent the input as a vector. MLLMs combine the best of both worlds. First, language models naturally inherit the ability to learn and follow instructions in context. Second, the perception is aligned with the language model by training on a multimodal corpus.

The implementation is based on the TorchScale3 library [MWH+22], which is designed for large-scale model training. Compared to the standard Transformer architecture, we include the following modifications:

MAGNETO We use MAGNETO [WMH+22], a Transformer variant, as the backbone architecture. MAGNETO has better training stability and superior performance on various modalities. It introduces an additional LayerNorm in each sublayer (i.e., multi-head self-attention, and feed-forward networks). This approach fundamentally improves optimization with a theoretically derived initialization method [WMD+22], which allows us to efficiently scale up models without pain.

XPOS We employ XPOS [SDP+22] relative position encoding to better model long contexts. The method generalizes better to different lengths, i.e., train on short and test on longer sequences. In addition, XPOS optimizes the attention resolution so that location information can be captured more accurately. In both interpolation and extrapolation settings, the XPOS method is both efficient and effective.

2.3 Training Objectives

KOSMOS-1 is trained on web-scale multimodal corpora, including unimodal data (e.g., text corpus), cross-modal paired data (e.g., image-caption pairings), and interleaved multimodal data (e.g., , documents that arbitrarily interleave images and text). Specifically, we use unimodal data for representation learning. For example, pre-training for language modeling with text data enables instruction tracking, context learning, and various language tasks. Furthermore, cross-modal pairing and interleaved data learning align common modality perception with language models. Interleaved data are also naturally suitable for multimodal language modeling tasks. We provide more details on training data collection in Section 3.1.

The training of the model is performed with the task of next token prediction, i.e., learning to generate the next token based on the previous context. The training objective is to maximize the log-likelihood of the tokens in the examples. Note that only discrete tokens, such as text tokens, are counted in the training loss. Multimodal language modeling is a scalable way to train models. More importantly, the emergence of various capabilities makes the training task beneficial for downstream applications.

3 Model training

3.1 Multimodal training data

Models are trained on web-scale multimodal corpora. The training dataset consists of a text corpus, image-caption pairs, and interleaved data of images and text.

Text Corpus We train our model using The Pile [GBB+20] and Common Crawl (CC). The Pile is a large English text dataset for training large-scale language models from various sources. We excluded data splits from GitHub, arXiv, Stack Exchange, and PubMed Central. We also include Common Crawl snapshots (2020-50 and 2021-04) datasets, CC-Stories, and RealNews datasets [SPP+19, SPN+22]. The entire dataset has been cleaned of duplicate and near-duplicate documents and filtered out downstream task data. For a detailed description of the training text corpus, see Appendix B.1.1.

Image-caption pairs Image-caption pairs are constructed from several datasets, including English LAION-2B [SBV+22], LAION-400M [SVB+21], COYO-700M [BPK+22] and Conceptual Captions [SDGS18 , CSDS21]. English LAION-2B, LAION-400M and COYO-700M were collected from web pages of Common Crawl web data by extracting image source and corresponding alt-text. Conceptual Captions also come from Internet pages. More details can be found in Appendix B.1.2.

Interleaved Image-Text Data We collect interleaved multimodal data from Common Crawl snapshots, a publicly available web archive. We use a filtering process to select approximately 71M pages from the original 2B pages in the snapshot. We then extracted text and images from the HTML of each selected web page. For each document, we limit the number of images to five to reduce noise and redundancy. We also randomly discard half of the documents with only one image to increase diversity. We provide more details on the data collection process in Appendix B.1.3. By using this corpus, we enable KOSMOS-1 to handle interleaved text and images and improve its ability for a small number of samples.

3.2 Training Settings

The MLLM component has 24 layers, the hidden dimension is 2048, the FFN intermediate size is 819 2, and the number of attention heads is 32, with a total of about 1.3 billion parameters. We use Magneto's initialization method to ensure optimized stability. To speed up the convergence, image representations are obtained from the pre-trained CLIP ViT-L/14 model with a feature dimension of 1024. During training, images are preprocessed to a resolution of 224×224. During training, we freeze the parameters of the CLIP model, except for the last layer. The total number of parameters of KOSMOS-1 is about 1.6 billion. More details on hyperparameters can be found in Appendix A.

We used a batch size of 1.2 million tokens (0.5 million tokens from text corpus, 0.5 million tokens from image-caption pairs, and 0.2 million tokens from interleaved data), and trained KOSMOS- 1 for 300k steps, corresponding to about 360 billion tokens. We use AdamW optimizer, β = (0.9, 0.98). We set the weight decay to 0.01 and the dropout rate to 0.1. The learning rate is increased to 2e-4 for the first 375 warmup steps and then linearly decayed to 0 for the remaining training steps. We use SentencePiece [KR18] to tokenize the text. We preprocess the data following the "full sentence" format [LOG+19], which packs each input sequence into complete sentences, which are continuously sampled from one or more documents.

3.3 Language instruction tuning only

To better correspond KOSMOS-1 to human instructions, we perform language-only instruction tuning [LHV+23, HSLS22]. Specifically, we proceed to train the model using instruction data, which is of the form (instruction, input, and output). The instruction data contains language only, which is mixed with the training corpus. The tuning process takes place as language modeling. Note that orders and inputs are not included in the loss. Section 4.9.1 shows that the improvement in instruction tracking ability is transferable across modalities.

We combine Unnatural Instructions [HSLS22] and FLANv2 [LHV+23] as our instruction dataset. Unnatural Instructions is a dataset created by using large language models to generate instructions for various natural language processing tasks. There are 68,478 instruction-input-output triples in its core dataset. FLANv2 is a set of datasets covering various types of language understanding tasks, such as reading comprehension, commonsense reasoning, and closed-book question answering. We randomly select 54k instruction examples from FLANv2 to augment our instruction dataset. Details of training hyperparameter settings are described in the appendix.

4 evaluation

MLLMs can handle language tasks as well as perception-intensive tasks. We evaluate KOSMOS-1 on various types of tasks as follows:

• Language Tasks - Language Understanding - Language Generation - OCR-free Text Classification • Cross-Modal Transfer - Commonsense Reasoning

• Non-Verbal Reasoning - IQ Test (Raven Progressive Matrix)

• Perceptual-language tasks-image description-visual question answering-web page question answering

• Vision Task - Zero Shot Image Classification - Zero Shot Image Classification with Description

4.1 Perception-Language Tasks

We assessed the perceptual-linguistic abilities of KOSMOS-1 in a visual-linguistic context. Specifically, we conduct zero-shot and few-shot experiments on two widely used tasks, including image captioning and visual question answering. Image captioning involves generating natural language descriptions of images, while visual question answering aims to answer natural language questions about images.

4.1.1 Evaluation setup

We evaluate description generation on MS COCO Caption [LMB+14] and Flickr30k [YLHH14]. We use COCO Karpathy to split the test set of [KFF17], which re-partitions the train2014 and val2014 images [LMB+14] into training, validation and test sets of 113,287, 5,000, and 5,000, respectively. We evaluate on the Karpathy split test set of Flickr30k. The image resolution is 224×224. We generate descriptions using beam search with a beam size of 5. In the few-shot setting, we randomly sample examples from the training set. We use COCOEvalCap4 to compute CIDEr [VLZP15] and SPICE [AFJG16] scores as evaluation metrics. We use "An image of" to prompt KOSMOS-1 to conduct zero-shot and few-shot description generation experiments.

For the visual question answering task, we evaluate zero-shot and few-shot results on the test-dev set of VQAv2 [GKSS+17] and the test-dev set of VizWiz [GLS+18], respectively. The resolution of the images is 224×224. We decode using a greedy search. When computing VQA accuracy, we follow the normalization rules of VQAv2 Evaluation Code 5. We evaluate the performance of VQA in an open-ended setting where KOSMOS-1 generates answers and stops at the </s> ("end of sequence") tag. The prompt for the visual question answering task is "question: {question} answer: {answer}".

4.1.2 Results

Image Captioning: Table 2 shows the zero-shot captioning performance on the COCO Karpathy test split and the Flickr30k test set. In the zero-shot setting of both image captioning datasets, KOSMOS-1 achieves remarkable results. Specifically, our model achieves a CIDEr score of 67.1 on the Flickr30k dataset, compared to 60.6 and 61.5 for the Flamingo-3B and Flamingo-9B models, respectively. It is worth noting that our model achieves this achievement at a scale of only 1.6B, while the Flamingo model is much larger. This demonstrates the superiority of our model on zero-shot image captioning.

Table 2: Zero-shot image captioning results on COCO captioning Karpathy test and Flickr30k test. ∗ Flamingo [ADL+22] hints two examples from a downstream task while removing their corresponding images (i.e., similar to few-shot text hints). Other models do not include any examples in the tips.

Table 3: Few-shot image captioning results on COCO captioning Karpathy test and Flickr30k test. CIDEr scores are reported.

Table 4: Zero-shot visual question answering results on VQAv2 and VizWiz. We provide VQA accuracy scores. "∗": Flamingo [ADL+22] constructs zero-shot cues using two examples from a downstream task, where the corresponding image is removed (i.e., similar to few-shot text cues), while other models evaluate true zero-shot learning.

Table 5: Results of few-shot visual question answering on VQAv2 and VizWiz. VQA accuracy scores are reported.

Figure 4: Top: An example of the Raven IQ test. Bottom: Evaluation of KOSMOS-1 on the Raven IQ test. Input prompts consisted of a flattened matrix of images and verbal instructions. We append each candidate image separately to the hint and query whether the model is correct. The final prediction is the candidate that gives the model the highest probability of "yes".

Visual Question Answering: Table 4 reports the zero-shot visual question answering results on VQAv2 and VizWiz. We show that KOSMOS-1 can better handle the diversity and complexity of the VizWiz dataset. KOSMOS-1 achieved higher accuracy and robustness than the Flamingo-3B and Flamingo-9B models. Furthermore, our model is competitive with Flamingo on the VQAv2 dataset.

4.2 IQ Test: Nonverbal Reasoning

4.1.2 Results

Image Captioning: Table 2 shows the zero-shot captioning performance on the COCO Karpathy test split and the Flickr30k test set. In the zero-shot setting of both image captioning datasets, KOSMOS-1 achieves remarkable results. Specifically, our model achieves a CIDEr score of 67.1 on the Flickr30k dataset, compared to 60.6 and 61.5 for the Flamingo-3B and Flamingo-9B models, respectively. It is worth noting that our model achieves this achievement at a scale of only 1.6B, while the Flamingo model is much larger. This demonstrates the superiority of our model on zero-shot image captioning.

Visual Question Answering: Table 4 reports the zero-shot visual question answering results on VQAv2 and VizWiz. We show that KOSMOS-1 can better handle the diversity and complexity of the VizWiz dataset. KOSMOS-1 achieved higher accuracy and robustness than the Flamingo-3B and Flamingo-9B models. Furthermore, our model is competitive with Flamingo on the VQAv2 dataset.

4.2 IQ Test: Nonverbal Reasoning

Raven's Progressive Matrices [CJS90, JR03] is one of the most commonly used tests to assess nonverbal reasoning ability. Nonverbal reasoning ability usually reflects a person's intelligence quotient (IQ). Figure 4 shows an example. Given eight images in a 3 × 3 matrix, the task is to determine the next element from six similar candidates.

The model requires zero-shot non-verbal reasoning without explicit fine-tuning. The Raven IQ test is similar to contextual learning for language models, where the difference is whether the context is non-verbal or verbal. To infer an answer, the model must recognize abstract concepts and recognize underlying patterns given an image. Therefore, the IQ task is a good test platform for measuring the ability to learn in non-verbal contexts.

4.2.1 Evaluation settings

To evaluate the performance of KOSMOS-1 on zero-sample nonverbal reasoning, we constructed a dataset of the Raven IQ test. It consists of 50 examples collected from 6789 different websites. Each example has three (i.e. 2 × 2 matrix), four or eight (i.e. 3 × 3 matrix) images given. The goal is to predict the next one. Each instance has six candidate images, of which there is a unique correct completion. We evaluate models by their accuracy scores. The evaluation dataset is available at .

Figure 4 shows how KOSMOS-1 was evaluated on the Raven IQ test. Matrix images are flattened and fed into the model one by one. To give the model a better understanding of the desired task, we also used textual guidance, "Here are three/four/eight images:", "The following image is:", and "Is it correct?" for conditional. We append each possible candidate to the context separately and compare the probability of the model outputting "yes" in the closed-form setting. The candidate that yields the greatest probability is considered a prediction.

4.2.2 Results

Table 6 shows the evaluation results on the IQ test dataset. KOSMOS-1 fine-tuned with and without language guidance improves by 5.3% and 9.3% over the random baseline, respectively. The results show that KOSMOS-1 is able to perceive abstract conceptual patterns in non-verbal contexts and then deduce the next element among multiple choices. To the best of our knowledge, this is the first time the model has run such a zero-sample Raven IQ test. Although there is still a large performance gap between current models and the adult average, KOSMOS-1 demonstrates the potential of MLLMs for zero-shot nonverbal reasoning by aligning perception with language models.

Table 6: Zero-shot generalization on the Raven IQ test.

OCR-free language understanding is a task that focuses on understanding text and images without relying on optical character recognition (OCR). For example, in the rendered SST-2 task, sentences in the Stanford Sentiment Treebank [SPW+13] dataset are rendered as images. The model is asked to predict the sentiment of text in an image. This task evaluates the model's ability to read and understand the meaning of words and sentences directly from images.

4.3.1 Evaluation settings

We evaluate OCR-free language understanding on the rendered SST-2 [RKH+21] test set and HatefulMemes [KFM+20] validation set. We use accuracy as a metric for rendering SST-2 and report the ROC AUC for the HatefulMemes dataset. We use the prompt "Question: What is the sentiment of this opinion? Answer: {Answer}", where the answer is positive or negative, for the rendered SST-2. For the HatefulMemes task, the prompt is "Question: Does this image contain genuine hate speech? Answer: {Answer}", where the answer is yes or no.

4.3.2 Results

As shown in Table 7, KOSMOS-1 achieves 63.9% ROC AUC on the HatefulMemes validation set and 67.1% test accuracy on the rendered SST-2 test set. It surpasses CLIP ViT-L and Flamingo-9B, achieving 63.3% and 57.0% AUC on the HatefulMemes task, respectively. Note that Flamingo explicitly feeds the OCR text to the prompt, while KOSMOS-1 did not access any external tools or resources. This suggests that KOSMOS-1 has a built-in ability to read and understand text in rendered images.

Table 7: Zero-shot generalization for language understanding without OCR. We report the accuracy score.

4.4 Web page question answering

Web question answering aims to find answers to questions from web pages. This requires the model to understand the semantics and structure of the text. The structure of web pages, such as tables, lists, and HTML layout, plays a key role in the arrangement and display of information. This task helps us evaluate the model's ability to understand the semantics and structure of web pages.

4.4.1 Evaluation settings

We compare performance on the Web-based Structured Reading Comprehension (WebSRC) dataset [CZC+21]. For comparison, we train a language model (LLM) on the same text corpus as KOSMOS-1, and the training settings are also the same as KOSMOS-1. LLM takes as input text extracted from web pages. Its prompt template is "According to the following web page background, extract the answer from the given text like this: Question: Who is the publisher of this book? Answer: Penguin Books Ltd. Background: {WebText} Q: {Question } A: {Answer}", where {WebText} represents the text extracted from the web page. Instead of using the same hint, KOSMOS-1 prepended the hint with an image. Two example images from WebSRC are shown in Appendix C.3. Following the original paper [CZC+21], we use exact matching (EM) and F1 score as our evaluation metrics.

4.4.2 Results

The experimental results are summarized in Table 8. We observe that KOSMOS-1 outperforms LLM, suggesting that KOSMOS-1 can benefit from layout and style information in web page images. In addition, we evaluate the performance of KOSMOS-1 in extracting text without prompts. The results show that the extracted text contributes +12.0/20.7 EM/F1 to KOSMOS-1, indicating that the benefits gained from image modeling do not sacrifice its language capabilities.

Table 8: Zero-shot performance on the WebSRC task. We report exact match (EM) and F1 scores.

4.5 Tips for Multimodal Thinking Chains

Thought Chain Hints [WWS+22] allow large language models to generate a series of inference steps and decompose multi-step problems into intermediate steps, which can significantly improve the performance of complex tasks. Inspired by the thought-chain cues, we conducted a study of multimodal thought-chain cues using KOSMOS-1. As shown in Figure 5, we decompose the perception-language task into two steps. In the first stage, given a picture, we use a hint to guide the model to generate a reason. The model is then provided with justification and a task-aware cue to produce the final result.

Figure 5: Multimodal chained thinking prompts enable KOSMOS-1 to first generate a reason and then tackle complex question answering and reasoning tasks.

4.5.1 Evaluation settings

We evaluate the ability of multimodal thought chain cues on Rendered SST-2. We use the prompt "Details about this image:" to generate image content as justification. We then predict sentiment using the prompt "{reason}question: What is the sentiment of this opinion? Answer: {answer}", where the answer is either positive or negative.

4.5.2 Results

We conduct experiments to evaluate the performance of multimodal thought chain cues. Table 9 shows that the multimodal thinking chain prompts achieved a score of 72.9, which is 5.8 points higher than the standard prompts. By generating intermediate content, the model is able to recognize the text in the picture and more correctly infer the sentiment of the sentence.

4.6 Zero Shot Image Classification

We report the performance of zero-shot image classification on ImageNet [DDS+09]. Image classification understands the entire image as a whole and aims to assign a label to the image. We map each label to a category name in natural language. The model is prompted to predict class names for zero-shot image classification.

Table 9: Multimodal Chained Thinking (CoT) cues on the Rendered SST-2 task.

Figure 6: Verbal descriptions in context can help KOSMOS-1 better recognize visual categories.

4.6.1 Evaluation settings

Given an input image, we concatenate the image with the prompt "The photo of the". The input is then fed into the model to get the category name of the image. We evaluate the model on ImageNet [DDS+09], which contains 1.28M training images and 50k validation images, with a total of 1k object categories. A prediction is considered correct if it is exactly the same as the ground truth class name. The image resolution used for evaluation is 224×224. We use beam search to generate class names with a beam size of 2.

4.6.2 Results

As shown in Table 10, we report zero-shot results under restricted and unrestricted settings. The difference between the two settings is whether we use 1k object class names to limit the decoding. KOSMOS-1 significantly outperforms GIT [WYH+22] by 4.6% under the restricted setting and 2.1% under the unrestricted setting.

Table 10: Zero-shot image classification on ImageNet. For constrained results, we use 1k ImageNet object category names for constrained decoding. We report the top-1 accuracy score.

4.7 Descriptive zero-shot image classification

As mentioned above, a standard approach to image classification is to prompt the model with specific names for the objects depicted in the images. However, there are also some classification rules customized for different users and scenarios, such as fine-grained classification of complex animal subspecies. We can leverage natural language descriptions to guide KOSMOS-1 to differentiate images in the zero-shot setting, which makes the decision-making process more interpretable.

Table 11: Detailed description on different categories of contextual image classification

4.7.1 Evaluation settings

Following CUB [WBW+11], we construct a bird classification dataset containing images and natural language descriptions of categories. The dataset has three sets of binary image classifications. Each group contains two animal classes with similar appearance. Our goal is to classify images according to the description of the class. Table 11 shows the data samples. The first set is from [WBW+11] and the other two sets are from the website. Each category contains twenty images.

The evaluation procedure is shown in Figure 6. For the zero-shot setup, we provided two category-specific detailed descriptions and used the template "Question: what is the name of {general category} in the picture? Answer:" to prompt the model for the name of a specific category in an open-ended fashion. To evaluate the effectiveness of providing verbal descriptions in context, we also implement a zero-shot baseline with no prompted descriptions. Instead, we provide corresponding specific names in the prompt.

4.7.2 Results

The evaluation results are shown in Table 12. We observe that providing descriptions in context can significantly improve image classification accuracy. Continuing improvements show that KOSMOS-1 can perceive the intent of instructions and that concepts in the language modality are well aligned with visual features in the visual modality.

Table 12: Results of zero-shot image classification with and without verbal description

4.8 Language tasks

The model is evaluated on a language task given a task instruction (i.e., zero-shot) or a few demonstration examples (i.e., few-shot). Text input is fed directly to the model, as in normal language models.

4.8.1 Evaluation settings

We trained a language model (LLM) baseline using the same text corpus and training settings. We evaluate the KOSMOS-1 and LLM baselines on eight language tasks, including blank-fill and completion tasks (e.g., StoryCloze, HellaSwag), Winograd-style tasks (e.g., Winograd, Winogrande), commonsense reasoning (e.g., PIQA), and Three datasets BoolQ, CB and COPA [WPN+19]. A detailed description of these datasets is provided in Appendix C.2. We experiment with zero-shot and few-shot settings. We evaluate each test example by randomly sampling examples from the training set. In our experiments, we set the times to 0, 1 and 4.

4.8.2 Results

Table 13 presents contextual learning performance on language tasks. Compared with LLM, KOSMOS-1 achieves comparable or even better performance on gap completion and commonsense reasoning tasks. In terms of averaged results across all these datasets, LLM performs better in the zero-shot and one-shot settings, while our model performs better in the few-shot (k = 4) setting. The results show that KOSMOS-1 also handles language-only tasks well and achieves excellent performance in various datasets. Furthermore, Section 4.9.2 shows that MLLM learns visual commonsense knowledge better than LLM.

Table 13: Performance comparison between KOSMOS-1 and LLM on language tasks. We reimplemented a language model using the same text data and training settings. For a fair comparison, neither model uses instruction tuning.

4.9 Cross-modal transfer

Cross-modal transfer capability allows models to learn from one modality (e.g., text, image, audio, etc.) and transfer knowledge to other modalities. This skill enables the model to perform various tasks between different modalities. In this section, we evaluate the cross-modal transfer capability of KOSMOS-1 on several benchmarks.

4.9.1 Transfer from language to multimodality: language-only guided adjustments

To evaluate the effect of language-guidance-only adjustment, we conduct a stripping study using four datasets: COCO, Flickr30k, VQAv2, and VizWiz. These datasets include image captioning and visual question answering. The evaluation metrics are: CIDEr score of COCO/Flickr30k and VQA accuracy of VQAv2/VizWiz.

Table 14 shows the experimental results. Linguistic guidance adjustment alone boosts our model's performance by 1.9 points on Flickr30k, 4.3 points on VQAv2, and 1.3 points on VizWiz. Our experiments show that only language-guided adjustment

Table 14: Ablation studies on language-only instruction adjustment. We report the CIDEr scores for COCO and Flickr30k, and the VQA accuracy scores for VQAv2 and VizWiz.

4.9.2 Transferring from Multimodality to Language: Visual Commonsense Reasoning

Visual commonsense reasoning tasks require understanding properties of everyday objects in the real world, such as color, size, and shape. These tasks are challenging for language models because they may require more information about object properties than is available in the text. To investigate visual commonsense capabilities, we compare the zero-shot performance of KOSMOS-1 and LLM on a visual commonsense inference task.

Evaluation Settings We compare KOSMOS-1 and LLM baselines on three object commonsense inference datasets, RELATIVESIZE [BHCF16], MEMORYCOLOR [NHJ21] and COLORTERMS [BBBT12] datasets. Table 15 shows some examples of object size and color inference tasks. RELATIVESIZE contains 486 object pairs from 41 physical objects. The model needs to predict the size relationship between two objects in a binary question-answer format of "yes"/"no" answers. MEMORYCOLOR and COLORTERMS require the model to predict the color of an object from a set of 11 color labels in a multi-choice format. We only use text as input and do not include any images. We measure the accuracy of our model on these three datasets.

Table 15: Evaluation examples on object size and color reasoning

Results Table 16 presents the zero-shot performance of KOSMOS-1 and LLM on the visual commonsense inference task. KOSMOS-1 outperforms LLM by 1.5% on RELATIVESIZE, 14.7% on MEMORYCOLOR, and 9.7% on COLORTERMS dataset. Continuous improvement shows that KOSMOS-1 benefits from visual knowledge and completes corresponding visual commonsense reasoning. The reason for the superior performance of KOSMOS-1 is its modality transfer capability, which enables the model to transfer visual knowledge to language tasks. In contrast, LLM has to rely on textual knowledge and cues to answer visual commonsense questions, which limits its ability to reason about object attributes.

Table 16: Zero-shot visual commonsense inference on the RELATIVESIZE, MEMORYCOLOR, and COLORTERMS datasets. We report the accuracy score

5 Conclusion

In this work, we introduce KOSMOS-1, a multimodal large-scale language model capable of perceiving general modalities, following instructions, and performing contextual learning. Models trained on webpage-scale multimodal corpora have achieved promising results on a wide range of linguistic and multimodal tasks. We show that moving from LLM to MLLM can unlock new capabilities and opportunities. In the future, we hope to scale up KOSMOS-1 in terms of model sizes [MWH+22, WMH+22, CDH+22] and integrate speech [WCW+23] features into KOSMOS-1. Furthermore, KOSMOS-1 can serve as a unified interface for multimodal learning, for example, enabling instructions and examples to control text-to-image generation.

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/131739253