How to fight the battle of China's open source big model? In three steps.

Recently, at the open source large model sub-forum of the 2023 Open Atom Global Open Source Summit held in Beijing, the author, as the initiator of the TOC large model SIG (Special Interest Group, similar working group) of the Open Atom Foundation, published a paper entitled "Open Source Collaboration Next Generation Digital Economy Core Infrastructure" keynote speech, the official media review link of the sub-forum is here https://mp.weixin.qq.com/s/6eXK-6ztLUBpw2qK61ZsTg .

Here, I would like to write a more detailed blog post of what I shared on the forum that day and call for more participation.

First of all, why is a large language model important?

ChatGPT is on fire, and the high temperature will not retreat. On November 30, 2022, OpenAI launched the ChatGPT service based on GPT3.5, bringing unprecedented shock to the global Internet and technology. There has never been such a product in the human world. It has reached 100 million active users in just two months, and many users are willing to pay and use it every day, and new usages continue to come out. Why? In a simple analysis, I think it is due to the following two reasons for its popularity.

  • The first reason is that ChatGPT revealed the capabilities of AI to the general public for the first time through the application of toC . Since 2017, Professor Wu Enda, a well-known AI scientist and former head of Google and Baidu AI brains, has always said that "AI is the New Electricity", that is, AI is the new electricity that will completely change human work and life, but for a long time It is difficult for ordinary people to feel the charm of AI. They may have experienced Alipay’s face payment on their mobile phones, experienced small video recommendations on Douyin, and experienced shopping recommendations on JD.com. Behind these scenes are actually a lot of artificial intelligence. Smart technology is supporting it, but it is not so directly felt. And ChatGPT provides a dialog box that everyone can use. You only need to enter the content of the conversation. He is like an omniscient genius who can give a variety of reasonable or seemingly reasonable answers, and in Chinese-English translation, articles Very professional in summarizing, seeking advice, etc. For the first time, ordinary people can feel the capabilities of AI in such a convenient form, just like electric lights bring electricity to thousands of households, "Chat GPT" shows the capabilities of artificial intelligence to all ordinary people.
  • The second reason is that it greatly lowers the threshold of experience in the way of natural language interaction. The progress of interaction can greatly promote the progress of technology, products and industries. Let’s recall the relevant history. When Mr. Apple Jobs released the first iPhone in 2007, it brought about a revolutionary change in the way of interaction. People can use their fingers instead of keyboards or stylus to complete the interaction between the mobile phone and various mobile applications. , They can easily use various applications on mobile phones through interactive methods such as clicking, dragging and dropping, and multi-touch, which has completely changed the mobile phone industry and brought about the rapid development of the mobile Internet; but for humans, dialogue is more important than fingers Natural way to interact. Most of the work can be done just by talking to the computer system, which is the dream of many people. However, various previous dialogue tools, such as Apple's Siri and Baidu's Xiaodu, have been released for a long time, but the understanding of human natural language and the logic of interaction are still far from people's expectations. But this time, with the release of ChatGPT, everyone found that talking to it is not such a painful thing. It can understand the questions very well, and it can also answer the questions logically (although the answers to some questions are serious nonsense. Eight ways). From then on, the way of natural language interaction will become the default interface for human-computer interaction in the future, because its interactive experience is more natural.

All this has just begun. ChatGPT is just a product of large-scale model applications. In addition, there are more products for large-scale model applications, such as Midjourney 's AI image generation, Runway's AI video generation, and Adobe, Microsoft, etc. AIGC tools Productivity tools integrated into them and more. With the advancement of large-scale model technology and the advancement of human imagination, more related AI tools will appear, which will greatly improve the efficiency of our daily work and life.

But mockups are more than just a productivity tool, they may be more important than you think. It condenses the knowledge of the world and will completely change the "creation, inheritance and application of knowledge", and its impact on knowledge is comparable to the four great inventions of "papermaking and printing". Knowledge is the primary productive force, and a complete change in the way knowledge is created, inherited, and applied will have a major impact on industry, agriculture, education, and military affairs in the entire society. So I think the big model is the core infrastructure of the next generation digital economy .

Therefore, China's large-scale model construction cannot lag behind, but how to fight this battle?

Let's take a look at the current situation. The current situation is that only China and the United States can make large-scale models in the world, but compared with OpenAI, the gap between us is very obvious.

  • Our model is still a few months away from GPT3.5, but GPT 4 is a few months away.
  • There is still a big gap between our Chinese dataset and the English dataset in terms of quantity and quality.
  • Our computing power is still very limited.
  • Our development ecosystem built on large models has just begun.

Therefore, we can only make use of China's tradition of "concentrating efforts to do big things" and gather the manpower and resources of relevant domestic enterprises, universities, and research institutes to jointly complete the large-scale model. But what strategy to choose?

Looking at the various domestic companies that have participated in large-scale models, there has recently been a battle of hundreds of models, including Baidu, Ali, Huawei, 360, Netease, SenseTime, Tsinghua Zhipu, Beijing Zhiyuan, etc. have launched their own large-scale model products . Among them, there is actually a lot of repetitive work, including but not limited to the following:

  • Collection and cleaning of Chinese corpus.
  • Annotation and organization of Chinese alignment instruction training set.
  • Align with relevant national compliance regulations.
  • Adaptation and optimization of domestic computing power, including scheduling and optimization of training and reasoning.
  • Optimization of related training procedures, especially the RLHF part.

If we can adopt an open source approach and collaborate these repetitive tasks in a more efficient way, we can reduce repetitive wheel creation and provide a better foundation for the innovation ecosystem.

Of course, the large model we need is a large model that can continue to evolve, a large model that can generate a healthy ecology (active development, healthy competition, and both technology and business) on this basis, combined with the Open Atom Foundation 's We can only promote our mission and values ​​through open source collaboration .

Some students may have doubts, can various resources (including data, manpower, computing power and other resources) be concentrated to support one or two companies or institutions to rapidly develop large models? I think that first of all, this is not in line with the positioning of the Open Atom Foundation; second, I don't think this kind of "chosen" or "appointed" approach can work in the current form. The risks of operating in a "chosen" manner are too high, including technical risks, team risks, and moral hazards. The technical iteration of large models is very fast. Although the current mainstream is based on Transformer's Decoder Only mode, how many years will it take? Under the requirements of multi-modality, especially the requirements of multi-modal alignment, it is hard to say whether Transformer's Decoder Only is still the best way. You can’t just gamble on one technical route, the technical risk of betting on one technical route is too high, because we can’t lose this battle; in addition, select one or several companies or institutions, whether the teams of these organizations are strong, and whether they can survive in the long-term Sound operation is a problem, can they take such responsibility and realize it step by step? ;Finally, choosing to concentrate computing power resources to support one or several organizations is also a matter of high moral hazard. Behind the computing power is a large number of machines, behind it is sky-high financial investment, and behind the data set is also huge resource investment. Serious corruption problems may arise in a monopoly state. Based on the above reasons, the Open Atom Foundation can only use open source collaboration and long-term efforts to promote the construction of large models. (Of course, it does not rule out the possibility that some institutions use concentrated power to support one or two institutions. After they have dealt with the above risks, they may act faster and see results more easily.)

After the route of open source collaboration is determined, let's take a look at how to operate it.

First look at what is the goal of collaboration? See below.

One or more of them is based on open source data sets (in line with relevant domestic compliance regulations) and open source training programs, and the open source general-purpose large models generated by training on domestic computing power are the key.

Let me briefly disassemble it and deduce why from the perspective of results.

I predict the industrial form related to large-scale models in the next few decades as follows: 1. First, there are several companies that provide general-purpose large-scale model services, candidates include Baidu, Ali, etc.; 2. Second, many companies provide industry large-scale models Services, including finance, energy, manufacturing and other industries; 3. Finally, hundreds or even thousands of technology companies provide privatized large-scale model services within enterprises for specific scenarios such as knowledge management, software development, and supply chain. Every enterprise has many large-scale model services, most of which are private large-scale model services deployed within the enterprise, and a small number of which access large-scale model API services on the public network.

So how will the open source general model support these industrial forms? 1. Enterprises that provide general large-scale model services can add their own unique competitive features to the open-source general large-scale model, or provide better capabilities with some private data, or provide lower levels in underlying scheduling and optimization. 2. Enterprises that provide large-scale industry model services can add data specific to each industry in the open-source general large-scale model. 3. The private large model within the enterprise can be the private data within the enterprise added to the open source general large model. The analysis of the above forms shows that the open source general-purpose large model is the key technical base.

Architecture On top of the open source general model, there are also various technology stacks, including development frameworks, vector databases, etc., which can be co-constructed through open source. The development library or platform used to support the underlying computing power scheduling and optimization can also be co-built in an open source manner.

Then build a large model (including computing power, data sets, algorithms) and the above-mentioned development technology stack, as well as the domestic computing power scheduling and optimization below. The implementation steps can be carried out in three steps according to the following plan.

The three types of data sets and three types of models listed above are my simplification of the training process of ChatGPT into three steps, as shown in the figure below, which are:

1. Take tens of terabytes of corpus from the Internet, conduct unsupervised learning, and obtain a pre-trained model, also known as Base model;

2. Take tens of thousands of manually labeled instruction training sets, conduct supervised learning, and obtain an instruction optimization model, also known as SFT Model;

3. Take tens of thousands of artificially labeled intensive training sets for intensive learning to obtain the final dialogue model, which is called the Assist Model, also known as the Chat Model.

The specific steps of these three steps are as follows:

  1. Step 1: Obtain various open source datasets (featured in Chinese), plus data compliance cleaning procedures (compliance processing and cleaning of original datasets according to various laws and regulations in China); and store models in China And data hosting services, similar to Hugging face.
  2. Step 2: Obtain various open source training programs, organize a computing power sharing pool, and train various general open source large models on this computing power;
  3. Step 3: Continuously optimize and update the general-purpose large model, train the mobile terminal model, combine industry data to obtain the open source industry large model, etc.;

The route and steps are as above, but it is easy to talk about it on paper, but it is difficult to implement it step by step. Fortunately, the foundation's TOC values ​​are openness, transparency, and pragmatism. I believe that step by step, every step is very solid, then we will definitely get good long-term results, and provide the most basic data, algorithms and models for China's open source large-scale model technology ecology and business ecology. I hope that the whole process can be transparent and traceable. Any company or organization with certain financial resources can build a computing power cluster according to the Foundation documents, download various data sets and programs, and train three types of large models (basic models) from scratch. , supervision model and dialogue model), and then perform various fine tunes to adapt to their respective scenarios, or provide it as an industry large-scale model service, or as an enterprise internal large-scale model service.

On top of TOC's large model SIG, the Open Atom Foundation has established an open source large model working committee to collaborate in three aspects: data sharing, algorithm open source, and computing power public infrastructure co-construction.

People from all walks of life are welcome to join the working committee, please contact the Open Atom Foundation (official website address at https://www.openatom.org/ )

 

 

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3742410/blog/10084450