High-quality multimodal corpus "Scholar Wanjuan" open source release

Following the establishment of the "Large Model Corpus Data Alliance" (hereinafter referred to as "corpus data alliance") at the 2023 World Artificial Intelligence Conference in July this year, the Shanghai Artificial Intelligence Laboratory (Shanghai AI Laboratory) announced on August 14 that it will cooperate with China Central Broadcasting Corporation Members of the corpus data alliance such as TV Station, People's Daily Online, National Meteorological Center, China Institute of Scientific and Technological Information, Shanghai Press Industry Group, and Shanghai Media Group jointly released the "Scholar Wanjuan" 1.0 multimodal pre-training corpus .

"Scholar Wanjuan" 1.0 currently includes three parts: text data set, graphic data set, and video data set. The total amount of open source data exceeds 2TB. Combining the rich content accumulation of members of the Corpus Data Alliance and the leading data processing capabilities of the Shanghai AI Lab, "Scholar Wanjuan" will provide academia and industry with high-quality large-scale multimodal models that are more aligned with mainstream Chinese values Pre-training corpus.

"Scholar Wan Juan" link: https://opendatalab.org.cn/WanJuan1.0

On July 6 this year, the Shanghai AI Lab officially released the newly upgraded "Scholar Universal Large-scale Model System" , including three major models: the Shusheng Multi-modal Large Model , the Scholar Pu Language Large Model and the Scholar Sky Realistic 3D Large Model InternLM -7B, a high-quality large-scale language model, has been released, and its performance is ahead of that of many mainstream evaluations . Llama-2-7B also provides a full-chain open system covering data, training and evaluation. The Shusheng Puyu open source system provides free commercial licenses for enterprises, lowers the threshold for large-scale model applications, and fully empowers the industry.

"Scholar Wanjuan" has been used for the pre-training of the large-scale model of the scholar, and its open source release will further reduce the threshold for large-scale model technology exploration and implementation.

It is understood that the main construction team of "Scholar Wanjuan" - OpenDataLab aims to build a super-large-scale, high-quality, multi-modal open data service platform for artificial intelligence developers, and is committed to building the infrastructure of domestic open data resources. At present, the platform has established 5,500 shared multi-modal data sets, covering more than 1 trillion token text corpus, 6 billion images, 800 million video clips and 1 million 3D models.

Guess you like

Origin www.oschina.net/news/253784