It's not chatGPT and OpenAI that scares Google

The following text is a recently leaked document shared by an anonymous individual on a public Discord server with permission for republishing. The document comes from a researcher within Google. We've verified its authenticity, made minor formatting changes, and removed links to internal pages. This document represents the personal views of one Google employee, not the entire company. While we disagree with what is written below and we have consulted other researchers, we will express our opinion in a separate column for subscribers. We're just a vehicle to share this document that raises some very interesting points.

Read the original text

English original

we don't have a moat

Neither does OpenAI

We did a lot of careful research at OpenAI. Who will cross the next milestone? What will be the next step?

But the uncomfortable truth is that we are not equipped to win this arms race, and neither is OpenAI. The third faction has been quietly eating our lunch while we bicker.

Of course, I'm talking about open source. To put it bluntly, they are licking us. What we consider to be "major open problems" is solved and in people's hands today. Just to name a few:

  • LLMs on a Phone: People run base models at 5 tokens per second on a Pixel 6.

  • Scalable Personal AI: You can fine-tune your personalized AI at night on your laptop.

  • Responsible release: It's more about "avoiding" than "solving". The entire site is filled with artistic mockups without any restrictions, and the text is not far behind.

  • Multimodal: The current multimodal ScienceQA SOTA trains in under an hour.

While our model still has a slight edge in terms of quality, the gap is closing at an impressive rate. Open source models are faster, more customizable, more private, and more powerful. They were doing things with parameters of $100 and 13B, while we were struggling with parameters of $10M and 540B. They do it in weeks, not months. This has profound implications for us:

  • We have no secret recipe. Our best hope is to learn and collaborate with what others outside of Google are doing. We should prioritize enabling 3P integration.

  • People won't pay for a restricted model when free, unrestricted alternatives are comparable in quality. We should consider where our real added value is.

  • Giant models are slowing us down. In the long run, the best models are those

Can iterate quickly. Now that we know what's possible in the <20B parameter range, we should make small variations rather than an afterthought.

what happened

In early March, with Meta's LLaMA leaked to the public, the open source community got their first really capable base model. It has no instructions or dialogue adjustments, and no RLHF. Still, the community immediately understood the significance of what they got.

What followed was a huge outpouring of innovation, with only a few days between major developments (see timeline for full breakdown). Just one month later, we are here, and have variants of instruction tuning, quantification, quality improvement, human evaluation, multimodality, RLHF, and more, many of which build upon each other.

Best of all, they've fixed scaling issues that anyone can patch. Many new ideas come from ordinary people. The bar for training and experimentation has dropped from the total output of a major research institution to one person, one night, and a powerful laptop.

why we could have seen it coming

In many ways, this should come as no surprise to anyone. The current renaissance of open source LLMs follows that of image generation. Similarities have not been lost by the community, with many calling this the LL.M.'s "moment of steady proliferation."

In both cases, low-cost public participation is achieved through a much cheaper fine-tuning mechanism called low-rank adaptation or LoRA, combined with a major breakthrough in scale (potential diffusion of image synthesis, Chinchilla of LLM). In both cases, obtaining a sufficiently high-quality model sparked a series of ideas and iterations from individuals and institutions around the world. In both cases, this quickly surpassed large enterprises.

These contributions are crucial in the field of image generation and set Stable Diffusion on a different path than Dall-E. Having an open model led to product integration, marketing, user interface and innovation that did not happen at Dall-E.

The effect is clear: rapidly dominating in terms of cultural impact compared to OpenAI solutions, which become increasingly irrelevant. Whether the same thing will happen with the LLM remains to be seen, but the broad structural elements are the same.

what did we miss

The innovations that drive open source's recent success directly address problems we're still grappling with. Paying more attention to what they do can help us avoid reinventing the wheel.

LoRA is a very powerful technology, we should probably pay more attention to it

LoRA works by representing model updates as a low-rank decomposition, which reduces the size of the update matrix by up to thousands of times. This allows model fine-tuning at a fraction of the cost and time. Being able to personalize language models on consumer hardware in a matter of hours is a big deal, especially for aspirations that involve integrating new and diverse knowledge in near real-time. In fact, the existence of this technology is underutilized within Google, even though it directly impacts some of our most ambitious projects.

Retraining a model from scratch is a tough road

Part of what makes LoRA so effective is that -- like other forms of fine-tuning -- it's stackable. Improvements such as instruction tweaks can be applied and then leveraged as other contributors add dialogue, reasoning, or tool usage. While individual fine-tuning is low-rank, their sum is not, allowing full-rank updates to the model to accumulate over time.

This means that as new and better datasets and tasks become available, models can be inexpensively kept up-to-date without the cost of full-scale operation.

In contrast, training a giant model from scratch discards not only the pre-training, but also any iterative improvements made on top. In the open source world, these improvements can quickly take hold, making full retraining prohibitively expensive.

We should consider whether each new application or idea really requires an entirely new model. If we do have significant architectural improvements that prevent direct reuse of model weights, then we should invest in more aggressive forms of distillation that allow us to retain as much functionality as possible from the previous generation.

If we can iterate faster on a small model, the larger model will not be more capable in the long run

For the most popular model sizes, LoRA updates are very cheap to produce (~$100). This means that almost anyone with an idea can generate an idea and spread it. Training times within a day are the norm. At this rate, the cumulative effect of all these fine-tuning quickly overcomes the size disadvantage. In fact, in terms of engineer hours, these models improve much faster than we can do with the largest variants, and the best ones are already essentially indistinguishable from ChatGPT. Focusing on maintaining some of the largest models on the planet actually puts us at a disadvantage.

Data quality is better than data size

Many of these projects save time by training on small, highly curated datasets. This shows that the data scaling law has some flexibility. The existence of such datasets follows the lines in Data Doesn't Do What You Think, and they are fast becoming the standard way of doing training outside of Google. These datasets are drawn using synthetic methods (e.g. filtering the best responses from existing models) and from other projects, neither of which is dominant at Google. Fortunately, these high-quality datasets are open source and thus freely available.

Competing directly with open source is a losing proposition

Recent developments have direct, immediate implications for our business strategy. Who pays for a limited-use Google product if there's no free, high-quality alternative?

We shouldn't expect to be able to catch up. There's a reason the modern internet runs on open source. Open source has some significant advantages that we cannot replicate.

we need them more than they need us

Keeping our technology secret is always a tenuous proposition. Google researchers are regularly traveling to other companies, so we can assume they know everything we know, and will continue to do so as long as the pipeline is open.

However, because the cutting-edge research of the LLM is affordable, maintaining a technical competitive advantage is made more difficult. Research institutions around the world are learning from each other to explore the solution space in a breadth-first manner that is far beyond our own capabilities. We can try to keep it a secret when outside innovations dilute its value, or we can try to learn from each other.

Individuals are not subject to licenses to the same degree as corporations

Much of this innovation happens on top of model weights that are leaked by Meta. While that will inevitably change as truly open models get better, the point is they don't have to wait. The legal protections offered by "personal use" and the impracticality of suing individuals means that individuals can use these technologies while they're hot.

Being your own client means you understand the use case

Browse models created by people in the image generation space, from animation generators to HDR landscapes, the ideas are endless. These models are used and created by people deeply immersed in their particular subgenre, endowing them with a depth of knowledge and empathy that we cannot possibly match.

Owning the Ecosystem: Making Open Source Work for Us

Paradoxically, one clear winner in all of this is Meta. Because the leaked model is theirs, they're effectively getting an entire planet of free labor. Since most open source innovation happens on top of their architecture, there's nothing stopping them from incorporating it directly into their products.

The value of owning an ecosystem cannot be overemphasized. Google itself has successfully used this paradigm in its open source products such as Chrome and Android. By owning the platform on which innovation happens, Google solidifies itself as a thought leader and direction setter, gaining the ability to shape ideas bigger than itself.

The tighter our control over the model, the more attractive we are to make open alternatives. Both Google and OpenAI tend to lean defensively into a release model that gives them tight control over how their models are used. But this control is fictional. Anyone who wants to use LLM for unsanctioned purposes can simply choose the freely available models.

Google should position itself as a leader in the open source community, leading by engaging with the broader conversation rather than ignoring it. This might mean taking uncomfortable steps like publishing model weights for small ULM variants. This necessarily means giving up some control over our model. But such compromises are inevitable. We cannot hope to both drive and control innovation.

Conclusion: How about OpenAI?

All this talk of open source feels unfair given OpenAI's current closed policy. Why should we share if they won't? But the truth is, we already share everything with them in the form of a steady stream of poaching senior researchers. Secrecy is a moot point until we stop this trend.

In the end, OpenAI doesn't matter. They have made the same mistakes we have made in their attitudes relative to open source, and their ability to maintain an edge must be called into question. Unless they change their stance, open source alternatives can and will eventually eclipse them. At least in this regard, we can take the first step.

timeline so far

February 24, 2023 - LLaMA launches

Meta launched LLaMA, open source code, but not open source weights. At this point, LLaMA has no adjustment instructions or dialogue. Like many current models, it is a relatively small model (available at 7B, 13B, 33B, and 65B parameters) and has been trained for a relatively long time, so is quite capable for its size .

March 3, 2023 - The inevitable happened

Within a week, LLaMA was leaked to the public. The impact on the community cannot be underestimated. Existing licenses forbade its use for commercial purposes, but suddenly anyone could experiment. From this point onwards, innovations come in menacing fashion.

March 12, 2023 - Language Models on a Toaster

A little over a week later, Artem Andreenko got the model running on a Raspberry Pi. At this point the model runs too slowly to be practical because the weights have to be paged in and out of memory. Still, it sets the stage for an onslaught of miniaturization.

March 13, 2023 - Fine-tuning on a laptop

The next day, Stanford released Alpaca, which added instruction tuning to LLaMA. More important than the actual weight, though, is Eric Wang's alpaca-lora repository, which did this training "in hours on a single RTX 4090" using low-level fine-tuning.

Suddenly, anyone could fine-tune a model to do anything, kicking off a race to the bottom in low-budget fine-tuning projects. Papers proudly describe their total cost of several hundred dollars. What's more, low-level updates can easily be distributed separately from the original weights, making them independent of Meta's original license. Anyone can share and apply them.

March 18, 2023 - soon

Georgi Gerganov running LLaMA on a MacBook CPU using 4-bit quantization. It is the first "no GPU" solution that is fast enough to be practical.

March 19, 2023 - 13B models achieve "parity" with Bard

The next day, a cross-university collaboration released Vicuna and uses GPT-4-powered eval to provide qualitative comparisons of model outputs. While the evaluation methodology is questionable, the model actually performs better than earlier variants. Training Fee: $300.

It's worth noting that they were able to use data from ChatGPT while circumventing the limitations of its API - they just sampled "impressive" examples of ChatGPT conversations posted on sites like ShareGPT.

March 25, 2023 - Choose your own model

Nomic created GPT4All, which is both a model and more importantly an ecosystem. For the first time we see models (including Vicuna) gathered in one place. Training Fee: $100.

March 28, 2023 - Open source GPT-3

Cerebras (not to be confused with our own Cerebra) train the GPT-3 architecture using the optimal computational plan implied by Chinchilla and the optimal scaling implied by the μ parameterization. This is vastly superior to existing GPT-3 clones and represents the first "in the wild" confirmation of parameterization using μ. These models are trained from scratch, which means that the community no longer relies on LLaMA.

March 28, 2023 - One hour multimodal training

LLaMA-Adapter uses a novel parameter-efficient fine-tuning (PEFT) technique to introduce instruction tuning and multimodality in one-hour training. Impressively, they were able to do this using only 1.2 million learnable parameters. The model achieves new SOTA on multimodal ScienceQA.

April 3, 2023 - Human beings cannot distinguish 13B open model from ChatGPT

Berkeley introduced Koala, a conversational model trained entirely on free data.

They took the critical step of measuring real human preferences between their model and ChatGPT. While ChatGPT still has a slight edge, more than 50% of users either prefer Koala or have no preference. Training Fee: $100.

April 15, 2023 - ChatGPT level open source RLHF

Open Assistant launches a model and, more importantly, a dataset for alignment via RLHF. Their model is close to ChatGPT in terms of human preferences (48.3% vs. 51.7%). In addition to LLaMA, they also showed that this dataset can be applied to Pythia-12B, giving people the option to run models using a completely open stack. Furthermore, since the dataset is publicly available, RLHF goes from impossible to cheap and easy for small experimenters.

Pay attention to the official account

AI Good Book Recommendations
AI is changing with each passing day, but high-rise buildings cannot be separated from a good foundation. Are you interested in learning about the principles and practice of artificial intelligence? Look no further! Our book on AI principles and practices is the perfect resource for anyone looking to gain insight into the world of AI. Written by leading experts in the field, this comprehensive guide covers everything from the basics of machine learning to advanced techniques for building intelligent systems. Whether you are a beginner or an experienced AI practitioner, this book has you covered. So why wait?


[The principles and practice of artificial intelligence comprehensively cover the classics of various important systems of artificial intelligence and data science]

Peking University Press, Principles and Practice of Artificial Intelligence Artificial intelligence and data science from entry to proficiency Detailed explanation of machine learning deep learning algorithm principles

Guess you like

Origin blog.csdn.net/robot_learner/article/details/130550947