It took 10 people two months to build a large model! Blessed by 16 top conference papers in one year: None of the best ones on the market are open source

Source | Qubit | Public account QbitAI

A company established in Shenzhen in May this year has a team of less than 10 people.

What they have to do is no small matter: challenge AGI .

Where is the confidence? First, look at the past resume, and second, look at the current track results.

In the past year, these people have published a total of 16 large model-related papers at top conferences such as CVPR, ICML, and ECCV, and one of them was nominated for the best paper at the top conference ACL 2023.

What were your results after starting your business? Two months after its establishment, the trained model ranked among the top three in the C-Eval list, and its Chinese ability defeated ChatGPT and Claude-v1.3.

This is the result of the symbiotic matrix .

And its model GS-LLM has been on the list for the first time since the end of July, and has been in the first echelon among the 65 players on the C-Eval list.

So, who is the Symbiotic Matrix?

picture

10 people challenge AGI

Symbiotic Matrix aims to build an industry data refining factory based on self-developed AGI technology.

The team mainly relies on the self-developed large model GS-LLM.

The model parameter scale ranges from 7B-130B and can be tailored according to the actual needs of users.

There are two versions based on GS-LLM that occupy a place on C-Eval, one is the 10 billion parameter version GS-LLM-Beta, and the other is the mini version GS-LLM-Beta-Mini with less than 10 billion parameters.

The reason for launching the mini version is that many users found that the original operating environment (even the cloud environment) was not enough to support large-scale local deployment.

The test results found that the multi-billion version of GS-LLM-Beta can perform well, with a best ranking of 6th on C-Eval.

picture

One of the reasons why it can stay at the top of the C-Eval list is that the symbiotic matrix has built a completely independent training framework , which provides relatively complete technical support for the entire training.

The second point is data , which this company attaches great importance to.

Symbiotic Matrix CEO Zhang Lin gave a simple example:

Compare model training to the human growth process. If all he has read since childhood are nutritious novels, this person's overall ability will not be very strong.

Last year, the team found in an experiment that when the model data reaches a certain order of magnitude, the jump in data quality can actually cause some qualitative changes .

"In other words, if you have a relatively small-scale (such as tens of billions) model and feed it high-quality data, the training results will be very close to the results of hundreds of billions of levels." Zhang Lin said.

This experiment also made the team pay more attention to data quality and systematic ways to obtain high-quality data.

In fact, this point has attracted more and more attention from all walks of life recently. Microsoft has a new study "Textbook are all you need". The work shows that growing bigger is not the only way out, but high-quality data is crucial.

As a result, the Symbiosis team built an engineering system for cleaning data to continuously clean data 24 hours a day.

The team has currently cleaned about 20T of text data that can be used for training. “This level of data can support model training of a very large system.”

However, Zhang Lin also revealed that Symbiotic Matrix will not disclose the data cleaned by the team to the public in the short term.

picture

So, what is the concept of the data refinement factory that the team wants to build?

Zhang Lin explained that if a large model is understood as "compression of information", then it itself is a large parameter database.

What the data refining factory has to do is to share and trade the parameter data after the model has been trained.

You must know that the functions of large models are carried through parameters. Transaction parameters are actually switching functions. We need the diversity of large model functions. "Parameter trading is the most efficient path."

The data referred to here is not the kind of data that everyone can see, but parameter data . The data we often talk about is a piece of text or a picture, and the data owned by the factory is the parameters of the trained model, and the parameters are commercially traded.

"The raw data is directly traded, which is constrained by large quantities and privacy issues." Zhang Lin explained that the concept of data trading has been proposed for many years, but it has not been fully accepted by the market. The team believes that if data is to be truly circulated, It needs to be more reasonable, safe and effective, so data transactions at the parameter level were finally determined.

In the team's vision, after the data refinement factory is run through, some data will not need to be trained repeatedly, efficiency will be improved, and costs will be reduced.

Use fewer people and resources to complete large model systems

In the craze of large models, how to evaluate large models has become an important issue, which is why various lists have sprung up.

After Symbiotic Matrix was listed on C-Eval, the outside world focused on two main points:

In addition to their good results, another interesting point is that they are a small team that is rare on the list .

picture

The team said that the list is not the only and most authoritative in the world, but it started to appear on the list one month after its establishment, and once reached the top three, which can reflect that "we use fewer people and resources to do a good job in large-scale model systems."

That’s right, the Symbiosis Matrix team has less than 10 people.

There aren’t many people, but they are all pretty good at fighting——

CEO Zhang Lin, CTO Wang Junjie and other core members of the team are all from IDEA Research Institute , and have rich practical experience in the open source system of domestic Fengshenbang pre-training models (it is reported that Fengshenbang currently has more than 98 open source pre-training models)

Zhang Lin graduated from the State University of New York with a Ph.D. and has published more than 30 papers at top computer conferences. He was previously a senior researcher at the Guangdong-Hong Kong-Macao Greater Bay Area Digital Economy Institute (IDEA).

Wang Junjie holds a PhD in computer science from Waseda University and was previously a core member of the Fengshenbang large model team.

picture

Zhang Lin

Looking at the current AI market, there is no precedent for a small team to do a good job in AI. There are only 11 members behind the most famous Vincent diagram model Midjourney, which is called the benchmark of new era organizations. In the AI ​​2.0 era, many large-model entrepreneurial teams that emphasize "small but beautiful" have emerged at home and abroad.

Of course, Zhang Lin said that the deeper reason is that large models are not simply projects that pile up manpower , and require a small number of elite teams to ensure efficiency.

He said that when training the model, technical aspects such as operator optimization, mixed precision, etc., as well as communication issues when supporting hundreds of cards at the same time, all test engineering capabilities. If a small team can solve the engineering problems encountered and improve efficiency, there is no need to rely on a large team to solve them.

In addition, a small technical core team is more conducive to maintaining ideological independence and exploring more possibilities by not sticking to rules. However, stacking manpower will easily reduce overall efficiency.

According to his estimate, the top talents in the field of large-scale models in the country "may only add up to about 100 people", and there is little room to form a large team.

Therefore, the team will remain at the size of "less than ten people" for a certain period of time.

Ultimately, this is a different understanding of the paradigms and concepts behind the AI ​​2.0 era and the AI ​​1.0 era.

picture

During the communication process, Zhang Lin also directly expressed the team’s different understanding from mainstream voices on another level, which is reflected in the open and closed source concept .

Some time ago, when the free and commercially available LLaMA-2 was released, many people said that it would be a huge blow to startups on the market, because LLaMA-2 can meet the needs of most companies for lower cost and personalization.

"LLaMA-2 has not changed the market structure." In the eyes of the Symbiosis team, truly leading teams do not open source core technologies.

Zhang Lin also added that at the current stage, the significance of open source lies more in educating the market rather than promoting commercialization .

Just like the Raspberry Pi is meaningful for electronic enthusiasts, but will not change the mobile computer market, LLAMA 2 is more valuable for entry-level users, but will have little impact on users who want to go commercial.

There are still many symbiotic matrices with "non-mainstream" views and understandings like this.

For example, we do not believe that large models are the end point of general AI, nor do we believe that ChatGPT represents the ultimate direction.

They are also cautious about unicorn-style rapid expansion and pay more attention to team cohesion and technology accumulation.

……

Regarding the future development route, Symbiosis Matrix chooses to be closed source in the short term, and may be open source appropriately in the future under suitable opportunities.

Open source needs to have clear business-driven goals. Currently, large-model technology is still in the stage of rapid iteration and competition, and open source core technology risks losing its first-mover advantage.

Guess you like

Origin blog.csdn.net/lqfarmer/article/details/133181824
Recommended