The war of hundreds of models is in full swing, AI open source shines the sword

3e2c2da13980d4195a69d52f75ab8d09.jpeg

32dbfd6208813415a6d24848fb3ff010.png

Liu Tiandong: Interview host, open source rainforest consultant, co-founder of Kaiyuan Society, official member of the Apache Software Foundation

Tan Zhongyi: Interview guest, official member of Apache Software Foundation, member of LF AI & Data TAC, chairman of Open Atomic Foundation TOC (Technical Oversight Committee), senior architect of Fourth Paradigm

Is AI open source a mainstream trend?

What do you think of artificial intelligence and open source? What opportunities and challenges will it bring to our human society?

Tan Zhongyi: Open source is a social collaboration model, and AI has been open source from the beginning, because AI originated from professors and PhDs in academia. Their main activity is to publish papers, and they need to make the research results reproducible. , so those programs and codes that can be reproduced must be open source. Therefore, from the emergence of AI to the present, open source has always been the default choice. Although there are some projects that are not open source, they are relatively few. For example, OpenAI has not open sourced its GPT-3.5 and GPT-4, but these are a minority.

From the perspective of the general trend, many problems and challenges of AI need to be solved through transparent and large-scale collaboration on a global scale, and this can only be done through open source. Therefore, open source will definitely become an important or important issue in AI. It is the mainstream way of working.

Open source large models and multimodality

You had an article before: " How to fight the battle of China's open source large model?" "Three Steps ", can you briefly introduce the general content?

Tan Zhongyi: In this article, I first analyzed why large models are so important. As everyone can see, ChatGPT was released in October last year, and it has been more than half a year. It is very popular in China. I think there are two main reasons:

1. For the first time, the general public can experience the capabilities of AI in a To C application.

As Andrew Ng said: "AI is the new electricity." AI has created huge commercial value in many companies, such as the advertising system behind Baidu search, the recommendation system of ByteDouyin, and some e-commerce companies. We call it "search, promotion, and promotion" scenarios. Behind these scenarios are large recommendation models with hundreds of billions of parameters serving, but ordinary users have relatively little awareness in this regard.

ChatGPT allows us to talk to it in a very natural way. It is like an intelligent butler that can answer a variety of questions. This method immediately demonstrates the capabilities of AI. Looking back on the age of electricity, it was the electric light invented by Edison that really brought electricity into thousands of households. ChatGPT is like this electric light, which quickly brought this To C application to the public, causing it to break through and explode.

2. It proposes a new interaction method. Changes in user interaction methods can always cause an epoch-making change.

The first way to interact with computers was through the command line. Later, Windows and Mac launched graphical interfaces, which directly pushed PCs into thousands of households. Then Apple used fingers as an interactive tool to replace the keyboard and stylus on traditional mobile phones, thus setting off a revolution. The revolution of smartphones; and now, natural language interaction is coming. We only need to express it in natural language, and the software can understand and execute it. I think any software that becomes popular after ChatGPT will definitely make changes based on this "Language user interface (LUI)" interaction method.

ChatGPT has these two characteristics at the same time, and it has also given an unprecedented boost to the AI ​​industry.

In fact, I have been studying AI for many years. The previous AI scenarios with relatively huge commercial value were basically concentrated in "search and promotion", but the overall situation was relatively monotonous, and there was no major breakthrough in technology until ChatGPT appeared and opened another That door. In the scenarios we have seen before, the ones with commercial value are all called decision-making AI - that is, judging yes or no, whether the user will click, etc.; after the emergence of large generative models, generative AI ignited the entire market, and now in Within the enterprise, there is decision-making AI that can serve the "search, advertising and promotion" scenario, and there is also new generative AI that transforms all To B software to improve the efficiency of each scenario.

I think the most powerful thing about the big language model is that through a large amount of model training, it condenses most of the world's knowledge. Now GPT-3.5 has about 175 billion parameters. After adding multi-modality, GPT-4 contains about 2 trillion parameters, and GPT-4 is not a large multi-modal model in the strict sense.

What is multimodal alignment? Why is it said that the current GPT-4 has not yet achieved true multi-modal alignment?

Tan Zhongyi: Multimodality means that patterns can be discovered not only from text data, but also from different forms of data such as video and audio. Why is it said that the current GPT-4 is not truly multi-modal? The reason is that it has not yet achieved multi-modal alignment.

Suppose we watch a video. There are pictures, voices, subtitles, etc. in the video. Based on the same timeline, we connect the pictures, voices, and text together to jointly discover the rules. We call this "multimodal alignment" to achieve this . For a large model of this level, I think 2 trillion parameters are not enough. It may need to be multiplied by 100 or even 1000. This large model will definitely become a large and comprehensive encyclopedia, and future learning and education will basically be based on it. Interaction, and knowledge is power, determines many industries such as industry, education, and national defense, so I define it as "the core infrastructure of the next generation digital economy ."

Open source collaboration

Now is the stage of the Hundred Model War. How should everyone collaborate with limited resources?

Tan Zhongyi: First of all, China should establish an open source Foundation Model or Base Model. This Base Model is not inferior to GPT-3.5 or GPT-4 in terms of capabilities, but this model will not be used as an online service, but will be used by various enterprises. After that, fine-tuning is performed, and then deployed based on its own private data. The various data, programs, etc. required in the process of training this large model can be co-constructed through open source.

Do you think it is possible to build an open source Base Model?

Tan Zhongyi: Of course it’s possible! It's still possible if you do it, but it will never be possible if you don't do it. But this matter needs to be decided by the organizer. Due to the challenges on the technical route, the risks of team management, and ethical reasons, the open source foundation cannot do things that are quick for quick success, such as selecting a company to fully support the project. , this is a monopoly. Instead, we should do something that is the common denominator of the participants in the hundreds of model wars, that is, do some public, basic work that everyone needs. The first step should be to start with open source data sets. Large models require data sets, and these The data set also needs to comply with Chinese legal regulations.

Compared with Singapore, some European countries, and the United States, we still have a lot to do in terms of open data. The challenge of creating a data set seems a bit big?

Tan Zhongyi: This matter seems quite challenging, but it is actually not as difficult as imagined. In fact, many Chinese companies or organizations have already made some of their data open source, including Baidu, Zhiyuan, Shanghai Artificial Intelligence Society, etc. Therefore, they only need to bring these data sets into the management scope of the Open Atom Open Source Foundation to form a good Update mechanism, coupled with some data set compliance checking tools to ensure its compliance, thereby producing a high-quality data set that everyone needs, and this data set is a continuous accumulation process. When the accumulation reaches a certain level If you go up, you can become a significant player in this field. This player is not here to participate in the competition of hundreds of models, but a friend of everyone. Therefore, the first step to create an open source data set is relatively easy and feasible.

The data you mentioned that various companies have open sourced is raw data or metadata?

Tan Zhongyi: They are all RawData. Of course, they also need to be cleaned and used for Pretraining. Large model training is also divided into three steps:

The first step: "Pre-training" requires a large amount of corpus and is carried out through unsupervised learning. Although the magnitude of the corpus requirements is relatively large, manual annotation is not required, so the cost is relatively low.

The second step: "Instruction Tuning - Instruction Tuning", which requires manual annotation, refers to human experts writing various high-quality questions and answers, of which there are approximately more than 50,000 manually annotated GPT-3.5 sequences.

The third step: "Reinforcement Learning with Human Feedback (RLHF) - Human Feedback Reinforcement Learning", this section also requires manual annotation.

Of these three types of data, the data set in the first step is the largest, the data in the second step is also available on the Internet, and the data in the third step is particularly small.

Model Development and Open Source Licensing

Will the privacy and protection of these data involve relevant laws?

Tan Zhongyi: Yes, so we not only need raw data, but also compliance tools. Taking these as the first step, we use compliance tools to process the raw data and obtain clean data. These clean data are used for pre -training or fine-tuning are both acceptable.

Is the foundation communicating with the state about compliance laws? How do you plan to adjust this tool to make the data compliant?

Tan Zhongyi: This is already underway. For example, the Cyberspace Administration of China has formulated many regulations. The Cyberspace Administration of China also has some cooperative commercial companies that mainly make data compliance tools that comply with the Cyberspace Administration of China’s regulations. One of them is called " RealAI (Ruilai Intelligence)" is a company founded by Academician Zhang Bo of Tsinghua University.

The second step is to train the model into a Base Model, and then continuously update the Base Model. This requires cooperation with some domestic computing centers.

The third step is to customize the model. There may be some industry models, there may be mobile models, and there may be specific scenarios, such as a specific model like coding.

Recently, Meta released the open source and commercially available Llama 2. How do you think it will change the landscape of the large model market?

Tan Zhongyi: In fact, after Llama came out, it has always been considered the most useful basic large model, and other "alpacas" that came out on top of Llama were all fine-tuned based on Llama. Director of Open AI who recently returned Karpathy once said: "Llama is the best open source large model I have ever seen." Although the previous version was accidentally leaked, many people in the industry are already using it, and Llama 2, which was released not long ago, is better to use and has stronger capabilities. I think it can be said that among large open source models, currently, many What enterprises can choose is Llama 2, which is due to its good reputation for quality.

Many people think that Llama2 is commercially available, but it is not open source. Do you have any suggestions or feedback?

Tan Zhongyi: Regarding the issue of License, there are two main aspects: First, it is indeed not a traditional OSI certified open source license, because it has several restrictions on user uses. But on the other hand, the definition of open source has not been updated for about 25 years since it was launched in 1998. This is a very strange thing. Recently, I have heard that OSI will release some new licenses for open AI. Things, we can wait and see.

From a pragmatic perspective, we need to keep pace with the times. If the license cannot match the business model, the license's vitality will be limited. When the GPL was released, it was formulated based on the environment at that time. Copyright was everywhere at that time, but now everyone has accepted copyleft well. On the contrary, we need to balance open source and commercialization, so I am also looking forward to how OSD will update this area. .

Therefore, according to the existing open source definition, Llama 2 is not an open source product, but we do not think this will be the case in the future, and we need some changes.

Conclusion

Any suggestions for next steps for Open Source Rainforest? Or what do you expect Open Source Rainforest to do?

Tan Zhongyi: For the open source rainforest, I think we need to adhere to a clear positioning, be user-centered, build an open source knowledge system from several stages such as understanding open source, using open source, and contributing to open source, and jointly build a prosperous open source ecosystem. Make Open Source Rainforest a brand and continuously output content through various forms, such as celebrity interviews, threesomes, etc., to attract more people to participate and strengthen the brand.

Reprinted from丨Open Source Rainforest

Editor | Mei Hao

Related Reading | Related Reading

A profound open source gathering: KCC@Beijing 9.2 event review

KCC@Dalian | A private brainstorming session about open source business

outside_default.png

Introduction to Kaiyuan Society

outside_default.png

Kaiyuanshe (English name: "KAIYUANSHE") was established in 2014. It is an open source community composed of individual volunteers who volunteer to contribute to the open source cause and based on the principles of "contribution, consensus, and co-governance  " . Kaiyuan Society has always maintained the concept of "vendor neutrality, public welfare, and non-profit", with the vision of "based on China, contributing to the world, and promoting open source as a way of life in the new era" , and with "open source governance, international integration, community development, and project incubation" Our mission is to create a healthy and sustainable open source ecosystem.


Kaiyuan Society actively cooperates closely with communities, universities, enterprises and government-related units that support open source. It is also the first member of OSI, a global open source protocol certification organization, in China.


Since 2016, it has held the China Open Source Annual Conference (COSCon) continuously, continuously released the "China Open Source Annual Report", and jointly launched the "China Open Source Pioneer List", "China Open Source Code Power List", etc., which has had a wide impact at home and abroad. force.

c2445929b3f4d98cf202808165070017.gif

おすすめ

転載: blog.csdn.net/kaiyuanshe/article/details/132928831