Kunlun Wanwei Tiangong large-scale model topped the multi-modal list

The Kunlun Wanwei Tiangong large model ranked first in the comprehensive score in the multimodal large language model (Multimodal Large Language Model, referred to as "MLLM") evaluation carried out by Tencent Youtu Lab and Xiamen University . According to the announcement , "This marks that Kunlun Wanwei Tiangong's large-scale model has entered the world's leading level in terms of multi-modality, and will strongly support the company's AI business matrix to make key breakthroughs in the future."

Tencent Youtu Lab and Xiamen University conducted a comprehensive quantitative evaluation of MLLM models around the world on the new evaluation benchmark MME for the first time, and announced 16 rankings, including two overall rankings of perception and cognition and 14 sub-lists . The MME dataset is a recently released multimodal language model benchmark. MME comprehensively evaluates large multimodal language models by evaluating their performance on 14 subtasks covering perceptual and cognitive tasks. The Skywork-MM model of Kunlun Wanwei Tiangong Multimodal Team ranked first in the comprehensive list, among which, the perception list ranked first and the cognition list ranked second.

Ranked No. 1 on the perception list

Ranked second on the recognition list

The latest paper of Kunlun Tiangong’s large-scale model multi-modal team pointed out that on the data side, in order to solve the problem of hallucinations, the team constructed more diverse and fine-tuning data to strengthen the large model’s ability to understand image features and enhance multi-modality. Instruction following ability of language model and reducing "hallucination". Skywork-MM improved significantly in reducing hallucinations.

Skywork-MM also enhances the ability to follow instructions in Chinese and the ability to recognize Chinese-related scenes through appropriate data structure, and alleviates the impact of cultural deviation on multimodal understanding. For example, for the TV show "If You Are the One" in a typical Chinese scene, existing large models are difficult to accurately recognize, but Skywork-MM Chinese scene recognition ability is very strong.

On the model side, in terms of model design, the team completely freezes the visual model and the large language model, so that the visual features learned by the visual model in the pre-CLIP training are not lost, and the language ability of the large language model is not lost. At the same time, in order to better associate visual features and language features, the model as a whole includes a learnable visual feature sampler and a LoRA adapter for language models. The training of the Skywork-MM model is divided into two stages. In the first stage, large-scale bilingual graphic-text pair data is used for association learning between image concepts and language concepts; in the second stage, multi-modal fine-tuning data is used for instruction fine-tuning.

In addition, Skywork-MM does not actually use a lot of graphic data (about 50M), which is far smaller than the amount of graphic data used by other existing MLLMs (greater than 100M).

Guess you like

Origin www.oschina.net/news/257052