Yang Likun, winner of the Turing Award: No one will use the GPT model in five years, and the world model is the future of AGI

Source | Tencent Technology ID | qqtech

At the opening ceremony of the Beijing Zhiyuan Conference on June 9, 2023, Yann Lecun, one of the three giants of machine learning, gave a remote speech and published a speech entitled "Towards Machines That Can Learn, Think and Plan" (Towards Machines that can Learn, Reason, and Plan).

As a person who has ridiculed ChatGPT since its birth, I think it has nothing new. In today's lecture, Yang Likun, who was in Paris at 4 o'clock in the morning, was still full of fighting spirit. In his speech, he showed his logic of countering GPT: autoregressive models are not good at all, because they have no ability to plan and reason. A large language model that generates autoregressive models purely based on probability cannot fundamentally solve the problem of hallucinations and errors. As the input text size increases, the chance of error increases exponentially.

The currently popular AutoGPT, LOT and the like seem to be able to disassemble tasks, and the language expansion model that explains complex problems step by step makes the large language model look like planning again. Yang Likun also sneered at this, thinking that they are just using search and other tools to make themselves seem to be able to plan and reason, not relying on their own understanding of the world at all.

Amazing performance, but narrow usage. It is not as smart as human beings at all, and there are bugs that cannot be solved. This is Yang Likun's judgment on current artificial intelligence.

So where is the next step in artificial intelligence if you want to lead to AGI?

The answer given by Yang Likun is the world model. A model that not only imitates the human brain at the neural level, but also fully fits the world model of the human brain in terms of cognitive modules. The biggest difference between it and the big language model is that it can have planning and forecasting capabilities (world model), and cost accounting capabilities (cost module).

Through the world model, it can truly understand the world, and predict and plan the future. Through the cost accounting module, combined with a simple requirement (the future must be planned according to the logic that saves the most action costs), it can eliminate all potential poisoning and unreliability.

But how will this future be realized? How is the world model learned? Yang Likun only gave some planning ideas, such as using a self-supervised model for training, such as establishing a multi-level thinking mode. He also admitted that no deep learning training has done this before, and no one knows how to do it.

Professor Zhu Jun from Tsinghua University looked at this model and was probably a little confused. This architecture is too similar to the ideal model of the symbolic school of traditional artificial intelligence. In the question-and-answer session, he also asked whether he considered the possibility of combining symbolism and deep learning.

This song once challenged the rule of Minsk symbolism for more than ten years. Yang Likun, who still insisted on the road of machine learning when no one recognized it, gave a simple answer: "Symbolic logic is not differentiable, and the two systems are not compatible."

The following is the core speech of Yang Likun's report and the transcript of all QA with Professor Zhu Jun compiled by the editor of Tencent News:

The pitfalls of machine learning

The first thing I will say is this: machine learning is not particularly good compared to humans and animals. For decades, we have been using supervised learning, which requires too many labels. Reinforcement learning works well, but requires a lot of training to learn anything. Of course, in recent years, we've been using a lot of self-supervised learning. But it turns out that these systems are not very specialized somewhere, and they're fragile, they make stupid mistakes, they don't really reason, they don't plan. Of course their response is really fast. And when we compare to animals and humans, animals and humans can do new tasks extremely quickly and understand how the world works, can reason and plan, they have a level of common sense that machines still don't have. And this is a problem that was discovered in the early days of artificial intelligence.

This is partly due to the fact that current machine learning systems have essentially a constant number of computational steps between input and output. That's why they really can't reason and plan like humans and some animals. So how do we get machines to understand how the world works and to predict the consequences of their actions like animals and humans do, to follow chains of reasoning with an infinite number of steps, or to plan complex tasks by breaking them down into sequences of subtasks?

That's the question I want to ask. But before I get to that, let me talk a little bit about self-supervised learning and how it has really taken over the machine learning world in the past few years. This has been advocated for quite a long time, seven or eight years, and it really happened, and a lot of the results and success of machine learning that we see today is due to self-supervised learning, especially in natural language processing and text comprehension and generation aspects.

So, what is self-supervised learning? Self-supervised learning is the idea of ​​capturing dependencies in inputs. Therefore, we are not trying to map input to output. We are just given an input. In the most common paradigm, we cover a portion of the input and feed it to a machine learning system, then we reveal the rest of the input, and then train the system to capture the difference between what we see and what we haven't seen dependencies. Sometimes it's done by predicting what's missing, sometimes not quite.

And this can be explained in a few minutes.

This is the idea of ​​self-supervised learning. It's called self-supervision because we're basically using supervised learning methods, but we're applying them to the input itself, rather than matching a separate output provided by a human. So the example I'm showing here is a video prediction where you show a system a short clip of video and you train it to predict what's going to happen next in the video. But it's not just about predicting the future. It could be the kind of data in the middle of the forecast. This type of approach has had amazing success in natural language processing, and all the success we've seen recently in large language models is a version of this idea.

Okay, so I said, this self-supervised learning technique involves taking in a piece of text, deleting some words in that text, and then training a very large neural network to predict the missing word. In doing so, the neural network learns a good internal representation that can be used for some subsequent supervised task like translation or text classification or something like that. So it's been incredibly successful. Also successful are generative artificial intelligence systems for generating images, videos or text. In the case of text, these systems are autoregressive. Instead of predicting random missing words, we use self-supervised learning to predict only the last word. So you take a sequence of words, mask out the last word, and train the system to predict the last word.

They are not necessarily words, but subword units. Once the system is trained on a large amount of data, you can use what is called autoregressive prediction, which involves predicting the next token, then transferring that token to the input, then predicting the next token, and then transferring it to the input , and repeat the process. So that's what autoregressive LLMs are, and that's what we've seen popular models do over the past few months or years. Some of these are from my colleagues at Meta, at FAIR, BlenderBot, Galactica and Lama, which are open source. Alpaca of Stanford University is an improvement on the basis of Lama. Lambda, Google's Bard, DeepMind's Chinchilla, and of course OpenAI's Chet, JVT, and JVT4. If you train them on something like a trillion texts or two trillion texts, the performance of these systems is phenomenal.

But in the end, they make really stupid mistakes. They make factual errors, logical errors, inconsistencies. Their reasoning abilities are limited, they use poisoned content, and they have no knowledge of the underlying reality because they have been trained purely on text, which means that a large part of human knowledge is completely out of their reach. And they can't really plan their answers. There is a lot of research on this. However, these systems work surprisingly well as writing aids as well as generating code to help programmers write code.

So you can ask them to write code in various languages ​​and it works great. It will give you a great starting point. You can ask them to generate text and they can equally illustrate or illustrate the story, but that makes the system not that great as an information retrieval system or as a search engine or if you just want factual information. As such, they are helpful for writing aids, first draft generation, statistics, especially if you are not a native speaker of the language you are writing in. Given recent events, they are not well suited to producing factual and consistent answers, so they have to retrain for that. And they may have relevant content in the training set, which ensures that they will have the correct behavior.

Then there are problems like reasoning, planning, doing arithmetic and whatnot (which they are not good at), for which they use tools like search engine calculator database query. So this is a very hot research topic at the moment, how to essentially make these systems call tools (to do things they are not good at), which is the so-called extended language model. And I co-authored a review paper on this topic with some of my colleagues at FAIR, about the various techniques being proposed to extend language models: we can easily be fooled by their fluency into thinking they are smart, but They're actually not that smart. They're pretty good at retrieving memories, or so. But again, they don't have any understanding of how the world works. Autoregressive models also have a major flaw. If we imagine the set of all possible answers: so the sequence of input phrases, is a tree, here represented by a circle. But it's actually a tree of all possible input sequences. In this huge tree, there is a small subtree corresponding to the correct answer to the given prompt. If we imagine that there is an average probability e that any token produced takes us outside the set of correct answers, errors produced are independent. Then the probability of the correct answer of xn is 1-e to the nth power.

This means that there is an exponentially diverging process that takes us out of the sequence tree of the correct answer. And this is due to the autoregressive forecasting process. There is no way around this other than making e as small as possible. So we have to redesign the system so that it doesn't do that. And indeed, others have pointed out the limitations of some of these systems. So I co-wrote a paper, which is actually a philosophy paper, with my colleague Jigdor Browning, who is a philosopher, and this paper is about the limitations of training AI systems using only language sex.

The fact that these systems have no experience of the physical world makes them very limited. There are some papers, or written by cognitive scientists, like this one on the left from the MIT group, that basically say that systems possess very limited intelligence compared to what we observe in humans and animals. There are also papers from traditional AI researchers with little background in machine learning. They tried to analyze the planning capabilities of these machine learning systems and basically concluded that these systems can't really plan and reason, at least not in the way that people understand it in traditional artificial intelligence. So how are humans and animals able to learn so quickly? What we see is that babies learn a tremendous amount of background knowledge about how the world works in their first few months of life. They learn very basic concepts such as the permanence of objects, the fact that the world is three-dimensional, the difference between animate and inanimate objects, the concept of stability, and the learning of natural categories. As well as learning about very basic things like gravity, when an object is not supported, it falls. Babies learn this by about nine months of age, according to diagrams drawn up by my colleague Emmanuel Dupp.

So if you show a five month old baby, the scene on the bottom left here, a cart is on a platform and you push the cart off the platform and it seems to be floating in the air, a five month old baby doesn't surprised. But 10-month-old babies will be very surprised, watching this scene like the little girl at the bottom, because during this time, they already know that objects should not stay in the air. They should fall under gravity. So these basic concepts are learned in the first few months of life, and I think we should replicate this ability with machines to learn how the world works by watching it unfold or experiencing it. So why any teenager can learn to drive in 20 hours of practice and we still don't have fully reliable level 5 autonomous driving at least without a lot of engineering and maps and lidar and various sensors. So clearly something very important is missing from the autoregressive system. Why do we have smooth systems for passing law exams or medical exams, but we don't have domestic robots that can clear the dining table and fill the dishwasher, right? It's something any 10-year-old can learn in minutes, and we still don't have machines that can approximate it. So we're clearly missing something extremely important. We are nowhere near human-level intelligence in the AI ​​systems we currently have.

Future Challenges of Machine Learning

So how do we do this? In fact, I've sort of identified three big challenges for AI in the coming years:

Models that learn world representations and predictions. It is best to use self-supervised learning.

Learning to Reason: This corresponds to ideas in psychology, such as Daniel Kahanman's, System 2 versus System 1. So System 1 is the human action or behavior that corresponds to the subconscious calculations, the things you do without thinking about it. Then System 2 is what you do consciously, you use your full mental capacity. The autoregressive model basically only does System 1, which is not very smart at all.

The last thing is to advance and plan complex sequences of actions hierarchically by breaking down complex tasks into simpler ones.

Then, about a year ago, I wrote a vision paper, and I put it up for public comment, so you guys can take a look. This is basically my suggestion of where I think AI research should go in the next 10 years. It revolves around the idea that we can organize various modules into so-called cognitive architectures, and at the heart of this system is the world model.

World Models: The Road to AGI

A world model is something that a system can use to essentially imagine a scenario, imagine what will happen, perhaps the consequences of its actions. Thus, the whole system aims to find, based on its own predictions, using its literal model, a sequence of actions that minimizes a sequence of costs. Cost you can think of as a measure of how uncomfortable this agent is. By the way, many of these modules have corresponding subsystems in the brain. The cost module is our world model (in the brain) - the prefrontal cortex, the short-term memory corresponds to the hippocampus; the actor may be the premotor area; the perception system is the back of the brain, where the perceptual analysis of all sensors takes place.

The system works by processing the current state of the world from previous ideas about the world it may have stored in memory. Then you use the world model to predict what would happen if the world continued to behave, or what would be the consequences of actions it would take as an agent. It's inside this yellow action block. The action module proposes a sequence of actions. A world model simulates the world and calculates what happens as a consequence of those actions. Then calculate a cost. What will then happen is that the system will optimize the sequence of actions so that the world model is minimized.

So what I should say is that whenever you see an arrow pointing in one direction, you also have the gradient going backwards. So I'm assuming that all these modules are separable and we can infer the sequence of actions by backpropagating gradients, thus minimizing the cost. It's not about minimizing parameters - it's going to be about minimizing actions. This is the minimization of latent variables. And this is done at inference time.

So there are really two ways to use the system. It's similar to system 1, which I call mode 1 here, basically it's reactive. The system observes the state of the world, runs it through the perceptual encoder, generates a concept of the state of the world, and runs it directly through the policy network, while the actor just directly produces an action.

Mode 2 is where you observe the world and extract a representation of the state of the world as 0. Then, the system imagines a sequence of actions from a[0] to a long T (time). These predicted states are fed into a cost function, and the whole purpose of the system is basically to find the sequence of actions that minimizes the cost based on the predictions. So here the world model is applied repeatedly at each time step, essentially predicting the world state at time T+1 from the world representation at time T and imagining a proposed action. The idea is very similar to what people in the field of optimal control call model predictive optimization. In the context of deep learning, a number of models have been proposed that use this idea to plan trajectories.

The question here is how exactly do we learn this world model? If you skip the question, what we're looking to do is some more sophisticated version where we have a layered system that goes through a chain of encoders that extract more and more abstract representations of the state of the world and uses A world model that predicts the state of the world at different perturbation levels and at different time scales. The high level here means that for example, if I want to go to Beijing from New York, the first thing I need to do is go to the airport and then catch a plane to Beijing. So this will be a high-level representation of the plan. The final cost function can represent my distance from Beijing, say. Then, the first action will be: go to the airport and my status will be, Am I at the airport? Then the second action will be, take a plane to Beijing. How can I get to the airport? From, let's say, my office in New York. The first thing I need to do is go down the street and hail a taxi and tell him to go to the airport. How do I get to the street? I need to get up from my chair, I go to the exit, open the door, go out into the street, etc. Then you can imagine it like this, decomposing this task down to the millisecond level, and controlling it at the millisecond level, all you need to do is to complete this scale.

So all the complex tasks are done hierarchically in this way, which is a big problem that we don't know how to solve with machine learning today. So, this architecture that I'm showing here, no one has built it yet. No one has proven that you can make it work. So I think it's a big challenge, hierarchical planning.

The cost function can consist of two sets of cost modules and will be modulated by the system to decide what task to accomplish at any time. So in cost there are two submodules. Some are those built-in costs that are hard-coded and unchangeable. As you can imagine, those cost functions will enforce safety guardrails to ensure that the system behaves properly, is not dangerous, is not toxic, etc. This is a huge advantage of these architectures, that is, you can optimize the cost at inference time.

You can guarantee that those standards, those goals will be enforced and will be met by the output of the system. This is very different from an autoregressive LLM, which basically has no way to ensure that its output is good, non-toxic, and safe.

Yang Likun X Zhu Jun QA session

Zhu Jun:

Hello, Professor LeCun. Really happy to see you again. I'll then moderate the question-and-answer session. First of all thank you again for getting up so early for this thoughtful workshop presentation and providing so much insight. Considering the time constraints, I have chosen a few questions to ask you.

As you discussed in your talk that there are many problems with generative models, most of which I agree with you, but I do have a question for you about the basic principles of these generative models. Generative models, by definition, output multiple choices. Additionally, creativity is a desirable property when we apply the diversity of generative models. So we are often happy to use models to output diverse results. Does this mean that inconsistencies, like factual errors or illogical errors, are actually unavoidable for such a model? Because in many cases, even if you have data, the data may contain contradictory facts. You also mentioned the uncertainty of the forecast. So here's my first question. So what are your thoughts on this?

Yang Likun:

That's right. So I don't think the problem of autoregressive predictive model and generative model can be solved by keeping autoregressive generation. I think these systems are inherently uncontrollable. So I think they have to be replaced by the kind of architecture that I'm proposing, where time is included in the reasoning, and there's a system to optimize cost and certain criteria. This is the only way to make them controllable, steerable, planable, i.e. the system will be able to plan their answers. You know when you're giving a presentation like I just did, you plan the course of the presentation, right? You go from point to point, you explain each point. When you design a speech, you plan this in your head, rather than improvising (like a large language model) word by word. Maybe on a lower (behavioral) level, you are improvising, but on a higher (behavioral) level, you are planning. So, the need for planning is really obvious. And the fact that humans and many animals have the ability to plan, I think that's an intrinsic property of intelligence. So my prediction is that within a relatively short few years -- certainly within 5 years -- no one in their right mind will follow up with an autoregressive LLM. These systems will soon be abandoned. Because they cannot be repaired.

Zhu Jun:

OK I guess another question about control: In your design and framework, a key part is the intrinsic cost module, right? So it's basically designed to dictate the nature of the agent's behavior. After reading the open-ended views in your working paper, I share concerns with one of the comments online. This comment says that mainly this module is not working as specified. Maybe the agent finally [screen freezes].

Yang Likun:

Securing the cost module of the system won't be a trivial task, but I think it will be a fairly well-defined one. It requires a lot of careful engineering and fine-tuning, and some of that cost may be gained through training, not just design. This is very similar to the strategy evaluation in reinforcement learning (Ctric in the Actor-Crtic structure, which evaluates the results produced by the actor as a language model) or the so-called reward model in the context of LLM. It is an overall consideration system The whole process from internal state to cost. You can train a neural network to predict costs, you can train it by exposing it to a lot of -- let it generate a lot of outputs, and then have someone or something rate those outputs. This gives you an objective for a cost function. You can train it to calculate a small cost, and then backpropagate through it after getting the cost to ensure that this cost function is satisfied. So, I think the design cost thing, I think we're going to have to move from the cost of designing the architecture and designing the LLM to the design cost function. Because these cost functions will drive the properties and behavior of the system. Contrary to some of my colleagues who are more pessimistic about the future, I think it is very feasible to design costs (functions) that are consistent with human values. That’s not to say that if you get it wrong once, there will be a situation where the AI ​​system escapes control and takes over the world. And before we deploy these things, there are many ways to design them well.

Zhu Jun:

I agree with that. So another technical question related to this, I noticed that you model by layered JEPA design, where almost all the modules are differentiable, right? Maybe you can train with backpropagation. But you know there's another field, like symbolic logic, which represents the non-differentiable part, and maybe in the intrinsic cost module can somehow formulate the constraints we like, so, do you have some special considerations to Connect the two fields, or simply ignore the field of symbolic logic?

Yang Likun:

right. So I think yes, the reality is that there is a subfield of neural+symbolic architectures that tries to combine trainable neural networks with symbolic manipulation or something like that. I'm very skeptical of these approaches because of the fact that symbolic operations are not differentiable. So it's basically incompatible with deep learning and gradient-based learning, and certainly not with the kind of gradient-based inference I'm describing. So I think we should make every effort to use differentiable modules everywhere, including cost functions. Now there may be a certain number of cases where we can implement a cost (function) that is not differentiable. For this, optimizers performing inference may have to use combinatorial optimization instead of gradient-based optimization. But I think this should be a last resort, since zero-order gradient-free optimization is much less expensive than gradient-based optimization. Therefore, if you can have a fine-tunable approximation to your cost function, you should use it whenever possible. To some extent, we already do this. When we train a classifier, the cost function we want to minimize is not completely accurate. But this is not differentiable, so we use a differentiable cost proxy. is the cost entropy of the system output versus the desired output distribution, or something like e-squared or hinge loss. These are basically upper bounds of the non-differentiable binary law, for which we cannot easily optimize. So going the old fashioned way, we have to use a cost function, which is a fine-tunable approximation of the cost we actually want to minimize.

Zhu Jun:

My next question is, I'm inspired by our next speaker, Professor Tegmark, who will be giving a live talk after you. We actually heard that you will be participating in a debate about the current state and future of AGI. Since most of us may not be able to attend, can you share some key points to inspire us? We'd like to hear some insights on this.

Yang Likun:

Okay, this is going to be a debate with four participants. The debate will revolve around the question of whether AI systems pose existential risks to humans. So Max and Joshua Bengio will be on the side of "Yes, powerful AI systems have the potential to pose an existential risk to humanity." And then on the "no" side will be me and Melanie Mitchell from the Santa Fe Institute. And our thesis would not be that AI is risk-free. Our contention is that these risks, while present, are easily mitigated or suppressed through careful engineering. My argument for this is, you know asking people today whether we can guarantee that superintelligent systems are safe for humans is an unanswerable question. Because we don't have a design for a superintelligent system. So you can't make a thing secure until you have a basic design. It's like asking an aerospace engineer in 1930, can you make a turbojet safe and reliable? And engineers would say, "What's a turbojet?" Because turbojets hadn't been invented in 1930. So we're kind of in the same situation. It's a bit premature to claim that we can't make these systems secure because we haven't invented them yet. Once we invent them - maybe they will be similar to the blueprint I've come up with, then it's worth discussing. "How do we make them safe?", it seems to me that it would be by designing those that minimize inference time. This is what makes the system secure. Obviously, if you imagine that the superintelligent AI systems of the future will be autoregressive LLMs, then of course we should be afraid because these systems are not controllable. They may escape our grasp, talking gibberish. But a system of the type I'm describing, I think, can be made secure. And I'm pretty sure they will. This will require careful engineering design. It's not easy, just as it hasn't been easy making turbojets reliable over the past seven decades. Turbojets are now incredibly reliable. You can cross oceans in a twin-engine plane and basically have this unbelievable safety. Therefore, this requires careful engineering. And it's really difficult. Most of us have no idea how turbojets are designed to be safe. So it's not crazy to imagine this thing. Figuring out how to make a superintelligent AI system safe is also hard to imagine.

Zhu Jun:

OK Thanks for your insight and answer. Also as an engineer, I thank you again. Thank you so much.

Yang Likun:

thank you very much.

Guess you like

Origin blog.csdn.net/lqfarmer/article/details/131308914