The China Large Model Corpus Data Alliance welcomes 9 new members and opens up the second batch of corpus data

In order to improve the level of corpus data supply, promote the high-quality development of the large model industry, accelerate application innovation and industry implementation, on September 8, the Data Talk·Open was hosted by the China Large Model Corpus Data Alliance (hereinafter referred to as the "Corpus Data Alliance") The first event on the day was held at the Shanghai Artificial Intelligence Laboratory.

China Patent Technology Development Corporation, Shanghai Arbitration Commission, Shanghai Library (Shanghai Institute of Science and Technology Information), Shanghai Data Exchange, Shanghai Social Credit Promotion Center, Shanghai Midu Information Technology Co., Ltd., Shanghai Titanium Robot Co., Ltd., East China Normal University Press Co., Ltd. and Shanghai Urban Construction City Operation (Group) Co., Ltd. 9 new member units have joined the "China Large Model Corpus Data Alliance". Alliance members will jointly provide more diversity for the in-depth development and high-level application of large model technology Data element protection.

Relevant persons in charge from the Artificial Intelligence Development Division and Informatization Promotion Division (Big Data Development Division) of the Shanghai Municipal Commission of Economy and Information Technology attended the open day activities to guide them.
 

On behalf of the main sponsoring unit, Wang Yanfeng, assistant director of Shanghai Artificial Intelligence Laboratory, shared the current development status and future prospects of the Corpus Data Alliance. He also introduced the OpenDataLab Pushu Artificial Intelligence open data platform and the first batch of multi-modal pre-training corpora released by the alliance - —Scholar ·Wanjuan 1.0 .

Wang Yanfeng, Assistant Director of Shanghai Artificial Intelligence Laboratory

New member units join the "China Large Model Corpus Data Alliance"

Following the release of Scholar·Wanjuan on August 14 , the Corpus Data Alliance launched the second batch of open source corpus data sets - Honey Nest·Pollen 1.0 . It is reported that several other alliance member units have also formed corpus data open source solutions and will gradually enter the release queue.

According to Liu Yidong, Chief Technology Officer of Midu Information, Honey Nest Pollen 1.0 is mainly based on Internet media data. So far, the total number has exceeded 100 million. This data set has been used in the Midu series of large model training, providing various intelligent generative services such as knowledge Q&A and content generation, automatic generation of analysis reports, review and polishing of manuscript content in vertical fields such as government affairs and media.

During the event, Zhang Jian, deputy general manager of the Market Development Department of Shanghai Data Exchange, and Sun Hui, CTO of The Paper, gave keynote speeches respectively, sharing their innovative practices in strengthening the high-quality supply of large model corpus data.

In the future, the Corpus Data Alliance will continue to play the role of a "circle of friends", pooling resources from all parties, leveraging the advantages of member units, and pooling efforts to jointly promote high-level supply of corpus data for large models and provide data support for the development of large models.

China Large Model Corpus Data Alliance

It was jointly initiated by Shanghai Artificial Intelligence Laboratory and 10 units including China Central Radio and Television, People's Daily Online, National Meteorological Center, China Institute of Scientific and Technological Information, Shanghai Media Group, and Shanghai Media Group. In order to respond to the demand for high-quality, large-scale, safe and trustworthy corpus data resources in the development of large models, and to ensure the development of large model scientific research and related industrial ecology, the Large Model Corpus Data Alliance will be held at the opening ceremony of the World Artificial Intelligence Conference on July 6, 2023. Announced on the establishment of the forum, it aims to jointly create multi-knowledge, multi-modal, standardized high-quality corpus data by linking model training, data supply, academic research, third-party services and other institutions, and explore the formation of contribution-based and sustainable operation. Incentive mechanism to create an international and open large model corpus data ecosystem.

To download corpus data and obtain more information about the Large Model Corpus Data Alliance, please log in: https://opendatalab.com/

Guess you like

Origin blog.csdn.net/OpenDataLab/article/details/132810418
Recommended