The correct posture for large open source language models

6b6a6806d9c119a0b22705fb4883a425.png

Author | Nathan Lambert

OneFlow compilation

Translation|Yang Ting, Wan Zilin

Today, many companies are forced to step up and participate in the competition for open source language large models (LLM). Releasing some form of large open-source language model has become a sign of the strength of machine learning companies. Recently, Mistral AI completed funding and released a powerful language model with 7 billion parameters.

While more participation in the open source machine learning ecosystem is beneficial and seen by many as an important development trend, we now need to shift perspective and push open source machine learning companies from releasing open source models to pursuing long-term business strategies. and competitive advantage. These companies need to do two things: first, be competitive compared with companies with strong capital; second, establish and maintain a moat that is conducive to product stickiness. So far, open source ML companies have not achieved good results in either area.

This article focuses on the first point:If the status quo is maintained, can open source language big model companies close the gap with giants such as Google or OpenAI?

This question is aimed at companies like Mistral and Stability (who put all their money into computing and not building products) and those who believe that open source is the best way forward for large models of languages.

(The author of this article, Nathan Lambert, is a PhD in artificial intelligence at Berkeley and a machine learning scientist at Huggingface. This article is compiled and published by OneFlow. Please contact us for authorization for reprinting. Original text: https://www.interconnects.ai/p/are-open-llms-viable# footnote-anchor-1-13759732)

1

Short-term trends in open source LLM: Seeds, bubbles and experiments

In response to the leak of the original LLaMa model on BitTorrent, Mistral released its first model via Twitter, along with a torrent link (pictured below). This model release method is very interesting and in line with the trend of the times.

5541d35425a81b26a649c129b766701c.png

This year and next may be the golden age for open source LLM. Meta has grown rapidly with LLaMa as its computing platform. LLaMa 2 is very practical and can easily fine-tune various interesting tasks. Many small-scale developers support it by improving its ecosystem. LLaMa 3/4 targets the capabilities of GPT4.

Because Meta successfully built LLaMa as a computing platform, we were able to host LLaMa 2 on Google Cloud Platform GCP and AWS Bedrock. The distribution process is: Meta releases the model--> Developers use the model to experiment--> Developers recommend to management that the model should be used in experimental products--> Business decision-makers find cloud service providers to Host the model. Google and Amazon may agree to a revenue-sharing agreement with Meta to host the model. Although Meta's license has been criticized, LLaMa 2's success is undeniable.

How many models can achieve the same leverage as Meta’s LLaMa? Success appears to depend largely on when and where the model is released. Currently, LLaMa 2 is the only best general-purpose open source language large model. Top-performing code models may also be hosted, as well as some other unique models, but this number is still far lower than the number of vendors currently releasing models.

The key to the sustainability of the hype is figuring out how many people are getting involved with LLaMa and open source LLM because company executives encourage people to use the LLM product and experiment. For technology development, understanding the infrastructure and limitations of open source LLM in specific use cases is critical, and the detail that the product is not yet that useful does not matter when funding is abundant. This is why the LLM market is frothy.

LLaMa is a springboard to verify whether LLM products can gain some traction, which is a valuable foothold, and because of this, it is natural to ask: besides Meta, will other companies enter this field?

2

Mid-term trends in LLM: Sharing, data and competition

Many LLM providers with open source tendencies have announced their intentions. As the picture gradually becomes clearer, we will see new model versions released every few months to determine who is the best among them. The problem is that those who adopt the "open source LLM as a platform" strategy do not really follow through on their openness in open source.

LLM’s true open source lies in the open data and training code base. Therefore, we need to take action to encourage people to open up these resources. The tweet below, along with some of the offline discussions I had after Mistral was released, made me realize that a lack of data transparency could soon become an existential issue for the open source LLM ecosystem. .

7bbc6a88bcf889c1a99a0147d48f8b31.png

NOTE: Many of the data release details may actually be subject to litigation against OpenAI and Meta, where unlicensed use of the book dataset was documented. If the model supplier proves to be trouble-free on this issue, there is nothing to worry about.

One of the main assertions of open source technology is that by involving more stakeholders, everyone benefits and faster progress can be achieved. This applies to security, reliability and functionality. Considering that pre-training of modern LLMs can often be reduced to collecting good data and applying it efficiently to the model, the main difference between the various open source models we are using is the data, such as LLaMa and published models such as Adept, Mistral and Stability.

There are fundamental capital constraints in this assertion. Many open source teams are probably 1/20 the size of larger teams like OpenAI, Google, and Meta. Each team is pursuing similar goals in different ways, and having more people can indeed accomplish more. The advantage of open source is that all parties involved can share the most significant details with each other, and then through integration (especially through debugging and improvement by the pro bono community), the open source camp can bring together 20 times the human resources, thus saving time for each open source organization, Improve efficiency. While open source itself does not bring huge advantages to any one participant, it provides opportunities for the entire ecosystem.

Open source organizations should understand that currently released models do not actually have direct commercial value (1). As LLMs are gradually integrated into the modern economy over the coming years, their primary role will be to serve personnel recruitment and public relations. Although open source LLM is rapidly closing the gap with OpenAI, considering the rapid iteration and update of the model SOTA (about three months), the duration of the advantages and competitiveness of the open source model is very limited and can only be maintained until the next model. before appearing.

While some players may believe they gain an advantage by keeping their data private, in reality they are losing competitiveness. Within the open source space, they compete with each other, and OpenAI may quickly eat into its market share through faster development. Mistral and those players whose only business model is open source need to put more emphasis on openness until they develop their own products. Otherwise, they will only reinforce our reliance on the deep resources of companies like Meta to bridge the gap between open source and closed source language models.

Companies such as Mistral train high-quality large language models in an open source manner. However, this approach does not make much sense in terms of business strategy. It may only mean that a good LLM is trained and released, but there is no product, and it is not feasible. business strategy.

If there is never a plan to launch a product, it means there is no real business model, so at least do a good thing and share the details of the model with everyone. For companies with a lot of capital, such as Meta, open source large language models will not affect their bottom line, but for smaller players, open source may lead to acquisition or bankruptcy.

This is the beginning of a long story between open source and closed source language models. Previous mistakes planted the seeds of tragedy that foreshadowed how the story would end: The massive consolidation of LLM vendors.

If we are serious about thinking about a future where open source thrives on its original principles, then we need to put more pressure on the decision-makers who decide what data can and cannot be open source. In the meantime, what else should these companies be doing to guide community development?

Looking back at the tweet quoted at the beginning of the article, few organizations have been able to join the list due to legal liability issues (which the author and contributors are currently focusing on), but the data contained therein is well worth exploring.

For example, Mistral used some data from 2023 (2) for training, which means they didn’t just download The Pile or the usual Common Crawl archive. Mistral should also expose resources such as its web crawlers or used data processing scripts. Now, it's time for a new pre-trained data release form to appear. In addition, Stability released some data information about their latest 3 billion parameter StableLM, which is worthy of recognition, despite industry rumors that they are experiencing difficulties in the business.

This situation becomes more complicated with the emergence of RLHF. I'm not sure if anyone knows how to communicate the necessary constraints required to recollect RLHF data. Iterative training and data collection with external vendors significantly increases complexity compared to pre-training. The open source field has yet to fully replicate the behavior that OpenAI and Google expect from RLHF, and this gap is particularly evident when the various results in LLaMa 2 are the only output.

I predict that the future of open source LLM has the following two possibilities:

  1. Open source companies continue to increase their efforts in open source. Open source communities work together to quickly solve many problems through the wisdom of the crowd, and companies have sufficient time to develop products to solve business propositions.

  1. Open source LLM maintains the status quo, causing progress to fall behind. Open source vendors are playing a game of musical chairs and will be acquired one by one in about 18 months as they run out of money unless they can find other revenue streams and take advantage of their massive GPU fleets. . Only open source players with large-scale product use cases will survive, and open source can help them gain insights into their models.

In the field of LLM, when everyone focuses on the debate between open source and closed source, it is now time to pay more attention to openness and transparency, not just as a means of public relations.

It turns out that most rational people are increasingly cautious about the prospects of organizations that raise money to train open source models first and think about their use cases later. LLM’s current development focus is on products. The underlying technology will continue to improve, but only the new products it drives will be valuable. This is exactly the economic cycle we are currently in, and there are lessons for it.

If we don't think deeply about this issue early, we will waste huge investments.

840c95811cab0d3ebe046dcaa0fe4740.png

Image via Midjourney

3

Long-term trends in open source LLM: Proprietary models, scaling and challenges

The future of resource sharing is unclear, and scaling law predictions of capital requirements are not optimistic for open source companies. Open source companies need to be able to raise these funds, otherwise it's all just talk.

The key factor that changes this situation is that open source companies have an absolute advantage in the niche areas where their products focus.. You can then publish the model, collect feedback from the community in its area of ​​expertise, and speed up the feedback loop between iterations. Similar to how Adept released a multimodal model demonstrating its ability to learn from YouTube tutorials, this meant Mistral had to find a product solution (not an easy task).

As I mentioned in my article about the LLM development path:

The open source field will develop LLMs that are more pattern-capable on specific sets of requirements, but less comprehensive. This means that compared to the powerful capabilities of benchmark GPT4 in various indicators, the open source model will choose 10-50% of the indicators as targets to outperform GPT4, and will still lag behind in other indicators.

Direct head-to-head competition is not a viable short- or long-term strategy. I think most companies are aware of this and are desperate to find a solution, but many users on Twitter don't seem to be aware of this yet and will cheer just because a model is released. In fact, there are many other influencing factors.

At the same time, I predict thatthe cost of training SOTA language models will increase by about 5 times per yearin the next 5-10 years. By 2028, the cost of training a model could easily exceed tens or even hundreds of billions of dollars. I didn't even fully consider this factor in the discussion, but it further emphasizes the fact that Smaller companies need to specialize in specific areas to increase their competitive advantage < a i=4>.

This is a way for everyone to get a smaller, more localized model for the tasks they are interested in. We need to return to the two principles of open source: personalization and security.

There will be more companies joining the open source model space. xAI is expected to open source models, Mosaic will release powerful models, Contextual may also release some practical models, and there are a few companies operating in the shadows that have not been mentioned. Beyond the data details, the next question is how industry dynamics will unfold as models become more powerful. We've seen a broad trend across the industry where the most powerful model vendors are becoming increasingly insular. For the open source ecosystem to thrive, we need a comprehensive push and aggressive momentum building, but we're not seeing that happening yet.

 Note:

(1) The situation with Meta is a bit subtle. For Meta, there may be more to gain from leveraging proprietary models in products than through licensing.

(2) I didn’t find any tweets containing screenshots of recent models answering questions.

Everyone else is watching

试用OneFlow: github.com/Oneflow-Inc/oneflow/icon-default.png?t=N7T8http://github.com/Oneflow-Inc/oneflow/

Guess you like

Origin blog.csdn.net/OneFlow_Official/article/details/133802157