After the basic model of tens of billions and hundreds of billions of parameters, are we entering a data-centric era?

In recent years, the emergence of basic models such as GPT-3, CLIP, DALL-E, Imagen, Stabile Diffusion, etc. has been amazing. These models exhibit powerful generative capabilities and contextual learning capabilities that were unimaginable just a few years ago. This article will explore the commercialization of these large-scale technologies. These models are now not just the domain of industry giants, their value is increasingly reflected in the description of the field and key issues, and at its core is data. The impact of the rapid development of the underlying model is inconclusive, so much is based on speculation.

prompt: "taco cat" (don't take it too seriously)

 

From a machine learning perspective, the notion of a task is absolutely fundamental - we create training data to specify the task and generalize through training. Thus, for decades, the industry has had two main schools of thought:

  • "Useless input, useless output", that is, the data/feature information input to the model determines the success or failure of the model.
  •  "Too many parameters will lead to overfitting." In the past 20 years, the development of general and sparse models has become popular. A common belief is that sparse models with fewer parameters help reduce overfitting and thus generalize better.
        These views are generally sound, but they are also somewhat misleading.
        The underlying model is changing our understanding of tasks because it can be trained on a wide range of data and used for a variety of tasks. Even if some users do not have a clear understanding of their target tasks, these models can be easily applied without specific training. These models can be controlled using natural language or an interface, allowing domain experts to drive the use of the models while hoping to immediately experience the magic in new environments. In this exploration process, the user's first step is not to curate a specific training data set, but to ponder, conceive, and quickly iterate their ideas. With the base model in place, we wanted to learn more about how it would transfer to a range of tasks, including some we hadn't anticipated yet.
        In order to benefit from the next wave of AI development, we may need to re-examine the limitations (and wisdom) of previous dominant views. In this post, we'll start from there, discuss what changes we can see in the base model, and conclude by discussing how we see the base model fit in with traditional approaches.

Garbage in, garbage out - that's it?

Mission-free base models are exploding, and so far a lot has been about model architecture and engineering, but signs of convergence are also starting to emerge. Is there a precedent for data becoming the foundation and the fundamental point of differentiation? We have seen the swing back and forth between model-centric and data-centric approaches in supervised machine learning.
     In a series of projects in the second half of the 2010s, feature quality was key. In the old model, features are tools to encode domain knowledge. These features are less stable, and practitioners of processing need to grasp the low-level details of how to characterize this information for more stable and reliable predictions.
    Deep learning succeeds because people are poor at these things. The deep learning revolution is in full swing, and it's mind-boggling to see new models popping up on arXiv. These models take previously manual operations, such as feature engineering, and fully automate them. The model is very good and can successfully characterize raw data such as text and images through deep learning. This is a huge boost in productivity. However, these models are not perfect and growing awareness of the field remains important. So, how do you incorporate this into your model?
     We can see that users use the training data as a vehicle to efficiently enter information, interpret the application, and interact with the model. It all happens in the "darkness" with no tools, theories or abstracts. We thought that users should be able to have some basic programmatic abstractions over their own data, and thus the Snorkel project (and then the company) was born. At the knowledge level, we have thus entered the era of data-centric AI and weak supervision. We can take two important lessons from this:

  • Once a technology stabilizes, its value will return to data. In this case, with the emergence of technologies such as TensorFlow, PyTorch, MXNet, Theano, etc., deep learning technology began to be commercialized, but the description of a specific problem did not give a broad data distribution, task specification, etc. Success therefore depends on how relevant information is brought into the model;
  • We can (and need to) deal with noise. Basic mathematics and engineering can in principle help with noise management. It is difficult for users to perfectly express their knowledge in training data, and the quality of different data sources may vary. While working on the basic theory of weak supervision, we found that models can learn a lot from noisy data (not all bad data is bad). That said, avoid entering useless information -- but don't get too picky about the data either.
prompt: "noisy image". Did you see anything interesting from the noisy image?

 In simple terms, data encodes your questions and analysis - even if the technology is commoditized, the value of data remains. So, it's not that spam is good, it's just that you don't want to make this distinction too absolute. Data is useful or useless depending on whether it is exploited in the most efficient way.
    The basic model is trained based on a large amount of data and is widely used in various tasks, which brings new challenges to data management. As models/architectures continue to become commoditized, we need to understand how to efficiently manage massive amounts of data to ensure generalizability of model use.
 

Too many parameters can lead to overfitting?


    Why do we see magical contextual features? How did the modeling choices (architecture and algorithm) contribute to this? Do the magical properties of large language models come from mysterious model configurations?
    About a decade ago, the rough generalization theory of machine learning held that if a model was too parsimonious (i.e., unable to fit too many spurious features), then it would generalize. People may have a more precise description of this, and these are major achievements in theoretical fields such as VC dimension, Rademacher complexity, etc. In the process, we found that it seems that a small number of parameters are also necessary for generalization. But this is not the case, overparameterization is a major problem, but now we have large models as counterexamples: these large models (more parameters than data points) can fit all kinds of functions that are complicated and complicated, but they are still general optimized (even with random labels).
    The idea of ​​overparameterization has been misleading to us, and recent insights have opened up new directions. We see some magical features emerge in these large models, but the prevailing notion is that these features are only enabled by certain specific machine-trained architectures that few people have access to. One direction of our and other research work is to try to enforce these magic features in simple, classical models. Our recent state-space models build on decades of signal processing work (and thus fit classical models) and exhibit some contextual capabilities.
    Even more surprising, even the classic BERT bidirectional model is contextually capable! I believe there are still many people who are writing related papers, you can send them to us, and we will read them carefully and quote them. We believe that the magical properties of contextual learning are all around us, and that the universe is even more magical than we understand. Or, looking at it more soberly, maybe humans are just not so good at understanding conditional probability.
    Things seem to be working fine within the big model framework. The magical features of the underlying model appear stable and commercially viable, and the data is seen as the point of differentiation therein.
 

Maybe it's time for a data-centric foundational model?


     Are we repeating the data-centric supervised learning shift? In other words, are models and projects becoming commoditized?
The rise of commoditized models and open source information. We're seeing basic models being commoditized and put to work -- well, it feels like "deep learning". For us, the greatest evidence of a model's commoditization is its availability. There are two main forces of influence: people have needs (stability, etc.), and big corporations can take advantage of them. The rise of open source was not out of hobbyist interest, but large corporations and others outside the government decided they needed such a thing (see The Rise of Python).

Waiting for the newest mega-company to come out with a new oversized model?

 Where do the biggest differences come from? data! These tools are increasingly available, but the underlying models are not necessarily immediately available. How will that handle how to deploy? Waiting for a new mega-model from a new super company? This can be said to be a way! But we call it nihilism! Whether this model will be open source is hard to say - what about the base model application on private data that cannot be sent to the API? Will the model have 100 trillion parameters - how many users can access and use? What is the training content of the model? The model is mostly trained on public data... 
    so there is little guarantee that it will know what you care about? How would you preserve the magic of the base model and make it work for you? It is necessary to effectively manage the underlying model data (data is critical!) and to take advantage of great open source models at test time (adjusting input and context data at test time is critical!): data management and data-centric standardization
      . Degree? Prediction: Smarter dataset collection methods lead to smaller but beautiful models. Those scaling law papers that opened our eyes are worth paying attention to: such as OpenAI, which originally studied scaling laws, and Chinchilla, from DeepMind. Although we have a default reference architecture (transforms), the number of tokens represents the information content of the data to some extent. Experience has taught us that data vary widely in subject matter and quality. We have a hunch that what really matters is the actual bits of information with overlap and order—an information-theoretic concept like entropy might drive the evolution of large and small fundamental models.
    Information input and calculations during testing. The base model is not necessarily immediately available, but the calculations can be quite different when tested in new ways. Given the cost and lack of privacy of using a closed-source model API, we recently released an open-source base model with 30x smaller parameters that beats the canonical benchmark at test-time by efficiently using the small model OpenAI's closed source model - the approach is called Ask Me Anything (AMA) Prompting. At test time, users control the underlying model through prompts or natural language descriptions of the tasks they are interested in, and the prompt design can have a huge impact on performance. Accurately obtaining prompts is complex and difficult, so AMA proposes to use a series of noisy prompts of different qualities to deal with the noise problem through statistical theory. There are many sources of inspiration for AMA: Maieutic Prompting, Reframing GPT-k, AI chain, and more! The point is that we can do calculations at test time in new ways - without having to just prompt the model once! It's not just about data management at training time, but also about adjusting input and contextual data at test time.

prompt: "really small AI model"

 

 From AMA, we can see that small models already have excellent reasoning ability matching multiple tasks, while the key value of large models seems to lie in memorizing factual data. Small models are actually underperforming, so how do we bring in data and information to solve this problem? Oddly enough, we use SGD to store facts in a neural network, converting them to fuzzy floating point values... an abstraction that seems to be much less efficient than a DRAM-backed key-value store. However, from the results of the AMA, the difference between the small and large models is much smaller in terms of time-varying or domain-specialized facts... We at Apple build self-supervised models to be able to edit the facts we get back (for commercial reasons) and additional software tools for running the service are required. So it's very important to have the model call the index. Time will tell if the above is a good enough reason to use this type of model.
    Where will this lead us? The base model goes hand-in-hand with traditional methods. Assuming data-centric models progress at both ends of exploration and deployment, for a rapidly iterative and task-independent workflow—exploration phase, we make off-the-shelf general-purpose base models more useful and efficient through data management/test time strategies. Users leaving the Exploration phase with a clearer task definition, using data-centric AI and managing training data (your own data is important), in a Snorkel fashion by leveraging and combining multiple prompts and/or base models Train smaller, faster "proprietary" models. These models can be deployed in real production environments and are more accurate on specific tasks and specific data! Or the underlying model can be used to improve weakly supervised techniques -- some lab and Snorkel members have won UAI awards for this.
     At the end of the day, data is what makes a model go to production. Data is the only thing that isn't commoditized. We still think that Snorkel's view of data is the way of the future - you need programming abstractions, a way to train deployable models for end tasks by expressing, combining and iteratively correcting different data sources and supervisory signals.

Original link:

https://hazyresearch.stanford.edu/blog/2022-10-11-datacentric-fms?continueFlag=9b370bd6ba97021f1b1a646918a103d5

In the current data-centric era, the field of machine vision is also gradually updating and iterating. Coovally is a typical data-centric machine vision platform. Coovally is a machine vision platform that includes a complete AI modeling process, AI project management, and AI system deployment management. It can help users quickly batch verify a variety of machine learning and The performance of the deep learning model greatly reduces the threshold for the engineering application of the AI ​​model; it can provide "packaged AI capabilities" for business personnel to use, and can realize "teaching people how to fish".

Coovally not only realizes the visual modeling of the whole process of deep learning, but also has a built-in hyperparameter search engine, resource scheduling engine, and auxiliary labeling engine. It also provides online reasoning services and multiple resource sharing, and is truly data-centric. At present, Coovally has been widely used in various scenarios such as manufacturing quality inspection, geological disaster monitoring, power industry equipment monitoring, medical special disease diagnosis, smart transportation, and smart parks.

Guess you like

Origin blog.csdn.net/Bella_zhang0701/article/details/128575218