The second wave of Alibaba Cloud Tongyi Qianwen open source! The large-scale visual language model Qwen-VL is launched on Moda Community

The second wave of Tongyi Qianwen Open Source! According to news on August 25, Alibaba Cloud launched Qwen-VL, a large-scale visual language model, which is one-stop and directly open source. Qwen-VL is developed based on the Tongyi Qianwen 7 billion parameter model Qwen-7B as the base language model, supports image and text input, and has multi-modal information understanding capabilities. In mainstream multi-modal task evaluation and multi-modal chat capability evaluation, Qwen-VL has achieved performance far exceeding that of general models of the same scale.

Qwen-VL is a Vision Language (VL) model that supports multiple languages ​​such as Chinese and English. Compared with the previous VL model, Qwen-VL not only has basic graphic recognition, description, question answering and dialogue capabilities, New capabilities such as visual positioning and text understanding in images have also been added.

Multimodality is one of the important technological evolution directions of general artificial intelligence. The industry generally believes that from a single-sensory language model that only supports text input to a multi-modal model that supports text, image, audio and other information input with "all five senses", it contains a huge leap in the intelligence of large models. possible. Multimodality can improve the understanding of the world by large models and fully expand the use scenarios of large models.

Vision is the first sensory ability of human beings, and it is also the first multi-modal ability that researchers want to give to large models. Following the previous launch of the M6 ​​and OFA series of multimodal models, the Alibaba Cloud Tongyi Qianwen team has open-sourced the large-scale visual language model (Large Vision Language Model, LVLM) Qwen-VL based on Qwen-7B. Qwen-VL and its visual AI assistant Qwen-VL-Chat have been launched on the ModelScope community and are open source, free and commercially available.

Users can directly download the model from the Mota community, or access and call Qwen-VL and Qwen-VL-Chat through the Alibaba Cloud Lingji platform. Directional services.

Qwen-VL can be used in scenarios such as knowledge question and answer, image title generation, image question and answer, document question and answer, and fine-grained visual positioning.

For example, a foreign tourist who doesn’t understand Chinese went to the hospital and didn’t know how to get to the corresponding department. He took a floor map and asked Qwen-VL “Which floor is the Orthopedics Department” and “Which floor is the Otolaryngology Department?” Qwen-VL It will give a text reply based on the picture information, which is the image question and answer ability; another example, input a photo of the Bund in Shanghai, let Qwen-VL find the Oriental Pearl Tower, Qwen-VL can accurately circle the corresponding building with the detection frame, this is the visual Positioning capabilities.

Qwen-VL is the industry's first general-purpose model that supports Chinese open-domain positioning. The open-domain visual positioning capability determines the accuracy of the "sight" of the large model, that is, whether you can accurately find what you are looking for in the picture. It is crucial for the implementation of VL models in real application scenarios such as robot control.

Qwen-VL uses Qwen-7B as the base language model, and introduces a visual encoder into the model architecture, so that the model supports visual signal input, and through the design and training process, the model is equipped with fine-grained perception and understanding of visual signals. The image input resolution supported by Qwen-VL is 448. Previously, open source LVLM models usually only supported 224 resolution. On the basis of Qwen-VL, Tongyi Qianwen team used the alignment mechanism to create the LLM-based visual AI assistant Qwen-VL-Chat, which allows developers to quickly build dialogue applications with multi-modal capabilities.

In the standard English evaluation of four types of multimodal tasks (Zero-shot Caption/VQA/DocVQA/Grounding), Qwen-VL achieved the best results of open source LVLM of the same size. In order to test the multimodal dialogue ability of the model, Tongyi Qianwen team built a test set "touchstone" based on the GPT-4 scoring mechanism, and compared Qwen-VL-Chat and other models, Qwen-VL-Chat It achieved the best results of open source LVLM in both Chinese and English alignment evaluations.

In early August, Alibaba Cloud open-sourced Tongyi Qianwen 7 billion parameter general model Qwen-7B and dialogue model Qwen-7B-Chat, becoming the first large-scale technology enterprise in China to join the ranks of large-scale model open source. The open source model of Tongyi Qianwen has attracted wide attention as soon as it was launched. It was on the HuggingFace trend list that week, and gained more than 3,400 stars on GitHub in less than a month. The cumulative downloads of the model have exceeded 400,000.

Open source address:

ModelScope magic community:

Qwen-VL     Tongyi Qianwen-VL-pre-training

Qwen-VL-Chat     Tongyi Qianwen-VL-Chat

Model experience: Tongyi Qianwen-Multimodal dialogue-Demo

HuggingFace

Qwen-VL   Qwen/Qwen-VL · Hugging Face

Qwen-VL-Chat   Qwen/Qwen-VL-Chat · Hugging Face

GitHub

GitHub - QwenLM/Qwen-VL: The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.

Technical paper address:

https://arxiv.org/abs/2308.12966

Guess you like

Origin blog.csdn.net/GZZN2019/article/details/132491824