Controversies and Limitations of Large Language Models

f47383965a6d79128ce2c9afc8a5bc63.jpeg

Yoav Goldberg, a professor at Bar-Ilan University in Israel, shared his views on the capabilities and limitations of large language models, as well as his position on language understanding. (The following content is compiled and published by OneFlow after authorization, please contact OneFlow for authorization for translation reprint. Original text: https://gist.github.com/yoavg/59d174608e92e845c8994ac2e234c8a9)

Author | Yoav Goldberg

OneFlow compilation 
and translation | Yang Ting, Xu Jiayu, Jia Chuan

1

introduction

Around 2014-2017, neural network-based NLP methods emerged, and I gave a semi-academic and semi-science lecture on the topic that perfect language modeling can reach the level of human intelligence. At the same time, in an academic group, someone asked: what would you do if you had unlimited computing power and no need to worry about labor costs? At that time, my answer was "I will train a very large language model, the purpose is to prove that computing power cannot solve all problems". Of course, I know it's an old saying, but is that really the case? How does it coexist with the "perfect language modeling is intelligence" story I mentioned earlier?

2

Perfect language modeling is AI complete

The topic of my lecture "Teaching Computers to Understand Language" revolved around Claude Shannon's "guessing game" and language modeling. The speech started with the AI ​​game, and then quickly turned to "a different kind of game" invented by Shannon in 1951, namely "guess the next letter". The game operator selects some text in the text, gives a place to fill in the blanks, and hides the ending, and the player needs to guess the first hidden letter in the fewest number of guesses.

To play the game better, I have included several examples from different levels of language knowledge, at different levels of language understanding (from morphology to different levels of syntax, semantics, pragmatics, and sociolinguistics). The result is a game where people are so good at it without practice that they can't make any progress, so players don't think it's a great game.

Then I mentioned that computers are much worse at games than humans, but in the process of training computers to play games, we gain a lot of implicit language knowledge. Although there is still a long way to go in language modeling, we have been making steady progress, which is how machine translation currently works!

I also said that computers aren't very good at this yet, which is understandable. The reason is that this game is "AI-complete", and playing this game truly "at a human level" means solving all the problems faced by AI and showing human-like intelligence.

Why do you say that? Because the game involves completing arbitrary textual prefixes, including very long prefixes, dialogues, and every possible dialogue prefix, every description of experience that can be expressed in human language, and also includes every Answers also include advanced mathematics, philosophical questions, and more.

In short, in order to play this game well, we need to understand the text, understand the situation described in the text, and be able to put ourselves into the situation and respond. This is indeed imitating human experience and thinking. (One might disagree with this statement, arguing that humans also need to ask questions about perceptual inputs not seen by images, scenes, or models, but I think you get my drift.)

That's Shannon's guessing game (aka "language modeling"), and why playing it at human-level intelligence requires human-level intelligence.

3

Building large language models does not solve all problems

If intelligence is required to achieve perfect language modeling capabilities ("AI complete"), why do I insist that building the largest possible language model does not "solve all problems"? Am I wrong?

The answer is that I don't think building a very large language model based on the technology at the time (RNNs/LSTM or Transformer) will bring us close to having "perfect language modeling".

So am I wrong? It is indeed possible. I was blown away by the power demonstrated by large language models. It turns out that there is a "phase transition" between the 60B parameter and the 175B parameter, which allows the model to demonstrate its amazing strength. Large language models can do a lot more than RNNs/LSTMs/Transformers trained on text, I said "they can't solve everything" but now they do what I had in mind at the time Everything imaginable.

The current language model (ChatGPT's first version) really "solved" all the problems I was worried about about language understanding at the time, and in this sense, I was wrong. But in another sense, I'm not wrong because it doesn't solve the whole problem, at least not yet. Also, the performance of today's language models is not just achieved through language modeling capabilities as I had in mind at the time, which is very important, as I'll elaborate on later.

Next, I will briefly describe the differences between current language models (current-day-LMs) and previous language models (LMs) understood by people, and some problems that large language models have not yet "solved" in my opinion. Some valid but irrelevant, uninteresting arguments will be mentioned.

4

Natural Language Modeling vs Curated Language Modeling

What does "Current language model performance not achieved by language modeling" mean? As far as I know, the first version demonstration of a large language model (170B parameter level, GPT-3) is trained on natural text data, that is, the training data comes from books, the Internet, social networks, etc., and later Series models (BLOOM, OPT) also use similar data. This is very close to Shannon's game, and also what most people think of as "language modeling" in the past few decades, these models have excellent performance. But ChatGPT is different from this.

How is ChatGPT different? There are three conceptual steps between GPT-3 and ChatGPT: instruction, code, RLHF. In my opinion, all three steps are interesting. RLHF is slightly less popular in comparison, although it has received the most attention. This explanation may be casual, and maybe one day in the future I will turn it into a more formal argument, and I hope the reader can get some inspiration from it.

Training models on "plain text" data like "traditional language models" does have some obvious theoretical limitations, the most obvious of which is: this training method cannot be associated with content "outside the text", so No "meaning" or "communicative intent" can be obtained, which means that such a model is "not grounded". The symbols they operate are just symbols. Although they can interact with each other, they are difficult to "base" in the real world. Take the symbol "blue" as an example. Although the models know this symbol, they do not know the corresponding reality. "Blue" in the world.

Previously, models would only be trained on "found" data, but in instruction fine-tuning, model trainers start training models on both "found" data as well as specific data created by humans ( This is called "supervised learning" in machine learning, e.g. learning from annotated examples). For example, a human annotator would write something like "Please summarize this text" followed by the text and a summary of the text, or they would write "Translate this text into formal language" followed by the text, and the formal language after text transformation. They create many similar instructions (such as summarization, translation, etc.), which are then added to the training data of the model.

Why is this important? Essentially, the model is still doing language modeling, learning to predict the next word based on the text alone, but human annotators inject a degree of grounding into the text. Use some symbols (such as "summary", "translation", "formal") with the concepts/tasks they denote.

By always appearing at the beginning of the text, this makes these symbols (or "instructions") somewhat independent of the rest of the data, enabling the model to associate the human concept of a "summary" with the act of producing the summary. In other words, this helps the model to learn the communicative intent of the user asking for a "summary" in his "command".

Some people may think that such cases have naturally appeared in large-scale text collections, and the model has learned from them. What new tricks can there be? However, I think it might be much easier to learn from direct instructions than from non-instruction data (such as direct statements "this is a dog" vs inferring from what people say about dogs). Furthermore, shifting the distribution of the training data toward these annotation cases can fundamentally change the behavior of the model and the degree of "grounding" it possesses. In contrast, using explicit instruction data requires much less training text.

In addition, the latest generation models are also trained using programming language code data, including natural language instructions (in the form of code comments) and corresponding programming language codes. Why is this important? This approach produces a very direct form of "grounding".

We have two separate systems in the text stream: one for human language and one for programming language.

We observe a direct interaction between these two systems: human language describes concepts (or intentions), which are then implemented in the form of corresponding programs. The direct interaction between these two systems is actually a "form-to-meaning pairing", and we can learn more from this interaction than "learn from form alone". (Also, I guess the latest models are also trained on executions: pairs of programs and their outputs. This is a stronger form of "grounding": denotations. It's not "only" Language modeling is done.

Finally there is RLHF (Reinforcement Learning with Human Feedback). RLHF refers to the model observing a conversation between two people, one acting as the user and the other as the "AI", to demonstrate how the AI ​​should respond in different situations. This approach can help the model learn how to conduct a conversation, and how to keep track of information in the state of the conversation (which is very difficult to learn from only "discovered" data). These human instructions are also the source of all the "It is not appropriate to..." and other formulaic/templated responses we observe from our models. This is a way to train a model to "behave nicely" by demonstration.

The above are the three capabilities of ChatGPT. Of course, the model may have other capabilities, which is why I think it is very different from "traditional" language models, why it may not "obey" some of the constraints that we (or I) expect language models to have, and why it Perform better on many tasks: ChatGPT is a supervised model with the ability to access external modalities, while also being explicitly trained by demonstrations to follow instructions given in the form of conversations.

5

What is missing?

Here are some common debates about language models. These arguments do exist, but aren't enlightening, or relevant to what I'm discussing:

  • Language models are uneconomical. Models are expensive to train and expensive to use.

Indeed, these views now appear to be correct. But over time, the cost will come down. Also, we need to put the problem in a broader context: we don't train many language models despite the high environmental cost, and their total energy consumption is insignificant relative to the rest of human energy consumption. Also, I'm not sure what the environmental issues have to do with the "are these things interesting", "are these things useful", etc., it's just an economic issue.

  • Models have a lot of biases and stereotypes.

Models do have these problems. Because the model simulates human language, and humans themselves are biased and stereotyped. This means that we need to be cautious when applying these models to real-world tasks, but it does not make the models less valid, useful, or interesting from a scientific point of view.

  • Models don't really understand language.

Models do not really understand language. So what? We should focus on what the model can do and optimize for its shortcomings.

  • A model can never truly understand language.

so what? Models do some things really well. Why don't we focus on those areas that perform well? If you don't care about this, don't worry about it. Those who really want to understand language in depth may prefer to explore it from other avenues. For me, an approximate understanding is enough.

  • Models cannot understand language like humans.

Models are not human, right? Models differ from humans in some mechanisms, but they can still tell us a lot about the structure of language. For information that the model cannot provide, we can obtain it from other sources.

  • Formal training alone is not enough to learn meaningful content.

Models are not just trained by form, see the previous section for details.

  • The model simply connects the pieces it has seen before according to certain statistical laws.

Isn't it amazing that using statistical methods can make the model achieve such results? Large models can connect words in very powerful ways, and in addition, according to statistical laws, although there are many wrong ways to connect words or phrases from the corpus, the model can still choose "meaningful" connections. That's pretty remarkable.

  • We don't know what impact these things might have on society.

As with any new technology or discovery, we cannot predict the impact it may have on society. But that doesn't mean we shouldn't discover their potential impact. We can try to study it with care, so that it doesn't make it less interesting, useful, or research-worthy. On the contrary, it provides us with an angle worth studying.

  • Models do not cite sources.

I understand your idea of ​​"wanting certain applications to have a reference function", because you don't want to be misled by the model. But in my opinion, this is not the core problem of language models. People don't "cite sources" in the true sense of the word, we rarely attribute knowledge to a specific single source, and even when we do, it's often a conscious process of rationalizing explanations or finding sources before citing. This situation is reproducible.

From an applied point of view (for example if one wants to develop a search system, a paper writing system, or a general question answering system), one can of course work on linking expressions to sources, either through generative processes or post-processing steps, or retrieved and then generated set up. A lot of people do, but it's not really relevant to language understanding. I think more meaningful, or more constructive, questions are: (1) how to separate "core" knowledge of "language and reasoning" from knowledge of specific facts about "things"; (2) ) of how to achieve "knowledge" (knowledge of knowledge, see below).

6

What are the real limitations and gaps at the moment?

I've listed some of the challenges with current "Large Language Models" (including the latest version of ChatGPT). Of course, only represent personal views, and may not be perfect. These problems prevent them from "fully understanding" language in a sense. Here are some of the tasks that these models still can't complete, or at least do poorly when done:

  • Multiple texts are related to each other. During training, these models process text as a whole or as individual pieces of information. While they may discover common patterns in texts, they lack a concept of how to relate texts to real-world "events". If these models were trained on multiple news stories describing the same event, they would not be able to know that the texts were all describing the same thing, nor would they be able to distinguish it from multiple texts describing similar but unrelated events. Consequently, these models cannot (or are not capable of) actually forming a coherent, complete view of the world from all the texts they "read".

  • time concept. The model has no notion of the order in which events occurred during its training process. They don't actually have a notion of time, other than when it's explicitly mentioned. So while they might learn some local meanings, such as "Obama became president in 2009," and be able to infer other events that happened before or after that, they couldn't understand the concept of the passage of time. For example, if the models read "Obama is the current president of the United States" and "Obama is no longer president" in different texts, they cannot determine the sequence of these messages and the current truth. They might simultaneously think that the statements "Obama is the current president of the United States," "Trump is the current president of the United States," and "Biden is the current president of the United States" are all true. Furthermore, these models have virtually no practical way of explaining statements like "X is Y's latest album" and the relationships between them.

  • "knowledge" of knowledge. The model doesn't really "know what it knows", or even what "knows" means. All they do is guess the next token in the process, and that guess may be based on exact knowledge already acquired, or pure guesswork. The training of the model and the training data have no clear mechanism to distinguish between these two situations, and there is no clear mechanism to act differently according to these situations. "Confidently fabricating facts" is a good proof. From learning from demonstration (RLHF), the model "realizes" that some answers should be treated with caution. Perhaps the model even learns to correlate this degree of care with the degree to which certain facts, entities or themes are involved in the training data and reflected in its internal weights by the data. In this sense, they exhibit some "knowledge of knowledge." But when they get past the initial phase of refusal to answer and go into "text generation mode", they lose all this "knowledge of knowledge" and very quickly transition into "bullshit" mode. Even on matters where it expressly says (at various stages) that there is no relevant knowledge.

  • Numbers and math. Models are very ill-suited for doing mathematical computations, and their basic building blocks are "word pieces" that don't exactly correspond to any convenient numerical base. They also don't have any proper way to learn relationships between different numbers (such as +1 or "greater than" relationships) in a meaningful and consistent way. While large language models perform well on some problems involving numbers, there are far more and better ways of representing numbers and mathematics than the mechanisms we give large language models. So the omnipotence of these models is really surprising. But I don't think they will make much progress without more explicit modeling.

  • Rare events, high recall setting, and high coverage setting. By their nature, models focus on common and likely cases, so I'm skeptical about their ability to learn rare events from data, recall rare events, or recall all events. I'm not quite sure that models can do this: they might. But I'm still skeptical for now.

  • Data hunger. This is probably the biggest technical problem facing large language models today: the models require a lot of data. To achieve great performance, models need to be trained on trillions of words. "...humans learn from only a fraction of it" is obviously true, but that doesn't appeal to me that much: so what? Models don't have to mimic humans to be useful. But there are other worrisome implications: Most human languages ​​don't have that much data, especially not that much data in valuable numerical form.

Why is this important? Because it means that it's very difficult to replicate what we've achieved so far in understanding English in other languages, like my native Hebrew, or more commonly like German, French, Arabic, or even Chinese or Hindi languages, not to mention "low resource" languages ​​in places like Africa and the Philippines.

While a lot of data is available for these languages ​​as well, it's not as much as for English. Less data may be needed via the "instruction training" technique, but then instruction data needs to be created, which is a huge amount of work for each new language we want to add. Also, if we believe (and I do believe) that it is important to train in coding + language, it becomes another huge hurdle when implementing similar models for other languages.

So, can translation solve this problem? After all, we have also made great progress in machine translation. We can translate that text into English, run the model in English, and translate the results back. We can do this, but only on a very superficial level. Different geographical regions have different languages. These regions have their own cultures, norms, stories and events that differ in various ways from those of English-speaking regions. Even something as simple as "city" can vary across communities and geographic locations, not to mention concepts like "politeness" or "violence." Of course there is also "factual" knowledge of people, historical events, important places, plants, customs, etc., which will not be reflected in the English training data and cannot be transferred through translation.

So data hunger is a real problem if we want to use language understanding and "AI" technologies in languages ​​other than English.

For those of us who care about social impact, the combination of data hunger and English/US-centrality is definitely a major consideration.

  • Modularity : In the previous "common but unremarkable debate", I raised an important question: "How to combine knowledge of the "core" of language and reasoning with concrete facts about "things" Knowledge separation?” Whether this problem can be solved will greatly affect the solution of other problems. If we can modularize and separate the "core language understanding and reasoning" component from the "knowledge" component, it may better solve the data hunger problem and the cultural knowledge gap problem, better handle and control bias and stereotypes, and also Knowledge of "getting free" knowledge. (A lot of people are working on "retrieval-augmented language models", but I can't confirm that this is the right way to solve this problem. I tend to be skeptical that there may be a more fundamental solution, but historical experience shows that I am not sure. These things are less intuitive.)

7

in conclusion

The power of large language models is amazing. Language modeling alone is not enough, but "current language models" are actually more than language models, they can do a lot of things beyond our expectations. However, if we care about "inclusive" language understanding, large language models are not enough; even if we don't, this is the case.

everyone else is watching

 Welcome Star, Try OneFlow: github.com/Oneflow-Inc/oneflow/  icon-default.png?t=N4P3http://github.com/Oneflow-Inc/oneflow/ 

Guess you like

Origin blog.csdn.net/OneFlow_Official/article/details/130776521