Why are open source language large models important?

f84c51c89a4a8372bfade4a80a056e86.png

In the field of LLM, what does open source mean? Assuming that the open source community has a truly open source LLM, with its weights, datasets, code, and infrastructure publicly available, what important benefits will we gain from it?

The authors of this article are Vikram Sreekanti and Joseph E. Gonzalez. The former is the co-founder and CEO of Aqueduct, and the latter is a professor at the University of California, Berkeley and the co-founder of Run LLM. They discussed the importance and core value of open source.

(This article is compiled and published by OneFlow. Please contact us for authorization for reprinting. Original text: https://generatingconversation.substack.com/p/why-open-source-llms-matter)

Download | Vikram Sreekanth & Joseph E. Gonzalez

OneFlow compilation

Translation | Wan Zilin

Open source is truly fascinating. As a member of Berkeley, which has a long tradition of open source, we are generally big fans of open source software. But frankly, much of the discussion around open source is extremely vague. Open source advocates often emphasize the undoubted advantages of open source LLM, but rarely specify what they want to see.

This got us thinking about the importance of open source LLMs and the benefits they might bring.

But first let us anchor a specific discussion topic. For LLM, what exactly is open source? Here are several definitions:

  • Publicly available weights: Models like LLaMa 2 and Mistral fall into this category. These models release the weight files that make up the model under a fairly permissive license so that users can obtain the models and make custom deployments.

  • Publicly available datasets: To the best of our knowledge, no mainstream open source LLM currently does this, but making the model data publicly available would have an important impact and it would enable the community Understand the potential biases and flaws of your model.

  • Publicly available training code and infrastructure: Until now, most big model builders have kept this a closely guarded secret. Because the model training process includes a large number of configuration parameters, coupled with the process of reinforcement learning with human feedback (RLHF), disclosing this information helps the community understand the model from a fundamental level.

As discussed elsewhere, the creation process of the dataset and the expertise embedded in the model training process are strictly confidential. Mainstream open source model vendors release little (or no) information about user datasets, much to the dismay of the open source community. So, so far, we've mostly seen publicly available model weights, but very little information about datasets, training code, and infrastructure.

Let's go back to the original question. Assuming that open source advocates win this battle, if we had a truly open source large model of the language, with its weights, datasets, and code and infrastructure available, what significant value would we gain from it?

  • Community Supervision: Understanding model blind spots and flaws is critical for future model improvement and alignment research. By simply chatting with a model like GPT or using its API, many blind spots can already be discovered, and researchers can push the boundaries by hosting models for testing strategies. It remains to be explored whether visibility into the model's underlying data set can provide valuable insights into a model's biases. Obviously, the editorial choices made by model builders (such as removing or including data) are important; however, given the massive investment in data use and potential legal risks, the likelihood that we will see these datasets made public in their entirety is very slim (unless government intervention).

  • Refactoring the model: This is very frustrating for the open source community in the absence of relevant dataset and code information. Ideally, the community would allow researchers to try different model parameters and alignments by recreating existing models. But the reality is that the scale of these models makes recreating them impossible, if not downright unfeasible. Just the GPU cost required for training is prohibitive, and the infrastructure and labor costs required for RLHF are even more unaffordable. Unlike ordinary storage infrastructure, users can actually use Minio in place of AWS S3, but the hardware and time costs required to recreate the model make this effective experiment impossible. Community efforts are insufficient to recreate models at the scale of GPT (or even LLaMA)—the public sector or large research institutions may make some progress, but bottom-up experimentation remains out of the question. Alignment studies will most likely have to be viewed as an add-on to existing models.

  • Self-hosted vs. custom deployment: This is a hot topic, although in some highly sensitive security scenarios, enterprises may need customized large models. We are sure that OpenAI and Azure (and correspondingly AWS + Athropic and GCP) will solve this problem. Due to the huge gap in model quality, users will be less willing to choose open source LLM if they can safely deploy private models (especially with appropriate data sharing protection). Just this week, we spoke with a technology company with a market capitalization of approximately $100 billion, and they were negotiating the terms of sharing private information with a major cloud service provider for the cloud service provider's LLM deployment. The reality is that mainstream model suppliers have the advantages of economies of scale and efficient deployment, and it is difficult for other competitors to surpass them.

  • Proprietary: This is what we mentioned in our previous article, and it is also the most A persuasive point of view. Open source LLM models are a good basis for developing proprietary models. Although the GPT fine-tuning API is powerful, it can only fine-tune through LoRA (rather than full weight update), and restricts users from applying more advanced model specializations. There are technologies (such as RLHF or RLCF) that are likely to be extremely valuable as the proprietary model matures. This is where the open source model is most likely to flourish in the coming years.

The open source model is already strong in terms of proprietaryization. Someone pointed out that Code-LLaMA 34B is already the best code model currently, and we strongly agree with this! This is an excellent success story of a domain-specific model. Unfortunately, fine-tuning can still be very expensive due to the GPU and time investment required to train the model. Fortunately, we already know from many practical cases (including from our own work) that fine-tuning models does not need to reach the scale and generality of models such as GPT-4.

This line of thinking leads to an obvious conclusion:The open source model does not need to get better, it just needs to become smaller and more specialized. Previous articles have pointed out that open source LLM needs to increase by about two orders of magnitude in terms of cost and scale to catch up with GPT. If they can overcome this hurdle, it could improve companies' ability to effectively specialize models and provide a viable path forward for open source software.

We had a strong belief in the value of open source, but it became clear that the open source model could not compete with the quality of the hosted general-purpose model. However, this does not mean failure, but new opportunities. Users who fine-tune models do not want the most general model, but rather a model that is well trained for their task. If the open source model can be lightweight while maintaining high quality, this will be the opportunity in the future market, and there will be a new field of specialization waiting to be opened.

Everyone else is watching

试用OneFlow: github.com/Oneflow-Inc/oneflow/icon-default.png?t=N7T8http://github.com/Oneflow-Inc/oneflow/

Guess you like

Origin blog.csdn.net/OneFlow_Official/article/details/134194489