Junlin Zhang: Changes in Interaction Modes Brought by Big Language Models

From: Heart of the Machine

Speech: Zhang Junlin

Enter the NLP group -> join the large model and NLP exchange group

On July 8, at the 2023 WAIC AI Developer Forum held by the Heart of the Machine, Mr. Zhang Junlin, head of new technology research and development at Sina Weibo, delivered a keynote speech "Natural Language Interaction: The Change of Interaction Modes Brought by Large Language Models". In his speech, he mainly introduced the changes brought about by large-scale language models in the way of human-computer interaction. The core point of view is: whether it is human-computer interaction or AI interaction, natural language is used, so that people operate data The way will become more simple and unified. The large language model is at the center of human-computer interaction, and the complex intermediate process will be hidden behind the scenes, and the language model will solve it through Planning+Programming.

4ccc18dfa4fbbeda7d88320108d617c3.png

The following is the content of Mr. Zhang Junlin's speech, and the heart of the machine has edited without changing the original meaning.

Now many people at home and abroad are making large-scale models; and there are two core issues to consider when making large-scale models:

The first is the pedestal mockup. Building a powerful pedestal requires a lot of data, computing power, and money. Although ChatGPT came out and shocked the world, the main reason is not the base model. The powerful large-scale base model was not born when ChatGPT appeared, but developed slowly—from 2020, the scale of models developed abroad has gradually increased, and the effect has gradually improved. ChatGPT’s large base The model may have improved compared to the previous effect, but there has not been a qualitative leap. So the main reason why ChatGPT is so influential is not the base model.

The second is the command understanding ability of the large model. If you want to consider why ChatGPT has such a big influence, the main reason is here. ChatGPT has made it possible for large models to understand human language and commands. This may be the most critical factor, and it is also where ChatGPT is more unique than past large models.

There is an ancient poem that is particularly suitable to describe these two key components of the large model: "In the old days, the swallows in front of Wang Xietang flew into the homes of ordinary people."

"Tangqian Yan" is a large model of the base, but before the ChatGPT era, it was mainly researchers who were paying attention and improving it. "Flying to the homes of ordinary people" refers to RLHF, which is Instruct Learning (instruction learning). It is RLHF that allows all of us to use natural language to interact with large models. That way, everyone can use it, and everyone can appreciate the power of its base model. I think this is the fundamental reason why ChatGPT can cause such a big sensation.

The topic I want to share today is "Natural Language Interaction". I think this may be the most fundamental change brought to us by the ChatGPT-based large-scale language model (LLM).

traditional human-computer interaction

First, let's take a look at the traditional human-computer interaction methods.

833e5780cecef206b8cdeeec21760db2.png

The essence of human-computer interaction is the relationship between people and data. People perform some behaviors in the environment to generate various types of data, which can be divided into two categories: one is unstructured data, such as text, pictures, and videos; the other is structured data. Enterprises may pay more attention to structured data, because many internal data of enterprises exist in the form of databases or tables. People need to process various types of data, typical behaviors such as creation, addition, deletion, modification, and query.

Before the big model came out, how did people and data have a relationship? People cannot directly have a relationship with data, they need to go through an intermediary, and this intermediary is application software. For example, even if you do the simplest text editing, you still need a text editor, and a more advanced text processing tool is Word; if you want to make a table, you need Excel, if you want to operate a database, you need MySQL, and if you want to process images, you need PhotoShop.

From here we can see a characteristic of the traditional interaction method: different application software is required to process different types of data, which is a diversified interactive interface. Another feature is that traditional interaction methods are complex and cumbersome, and a lot of data needs to be processed by professionals. Take image processing as an example, even if you are provided with PhotoShop, it may be difficult for ordinary people to process images well, because it involves very complicated operations, and you need to find the functions you want from multi-level menus, only after training Professionals can do it.

To sum up, before the emergence of large models, our relationship and interaction with data were complex, cumbersome and diverse.

Human-computer interaction in the era of large models

After the emergence of large models, what essential changes have taken place in the situation?

31e9d5597d57ac89e2e4aa7095340a12.png

In fact, there is only one key change: the big language model stands at the center of human-computer interaction.

In the past, people interacted with some kind of data through a certain application software, but now they interact with large models, and the method is very direct and unified, that is, natural language. In other words, if you want to do something, you can just tell the big model directly.

In fact, this is essentially the relationship between people and data, but due to the emergence of large models, the application software is shielded behind the scenes.

Look at the future development trend. In the short term, for text or unstructured data, LLM can replace some application software, such as the replacement of PhotoShop by multimodal large models. In other words, LLM can complete some common tasks by itself, and no longer needs the functional support of applications behind the scenes. At present, most structured data still need corresponding management software, but the large model has come to the foreground, and the management software is hidden in the background. And if we take a longer view, the large model may gradually replace various functional software. If we look at it in a few years, it is likely that there will only be a large model in the middle.

This is the fundamental change in the way people interact with data after the big model is available. This is a very important change.

Human-computer interaction in the era of large models seems very simple. If you want to do something, you just need to say it, and leave the rest to LLM. But what are the facts behind it?

fb6ee53b260f818f2ef7697fb1e4d83d.png

Let me give an example to illustrate that the products made by Apple have a particularly good reputation. Why? It is because they provide users with extremely simple operations, and hide the complicated parts behind the scenes.

The big model is similar to Apple's idea of ​​making software. It seems that the interaction mode between LLM and people is very simple, but in fact, LLM does all the complicated things for users behind the scenes.

The complex things that LLM needs to do can be broken down into three broad categories:

1. Understand natural language. At present, the language understanding ability of large models is very strong, and even small-scale large models have very strong language understanding ability.

2. Task planning. For complex tasks, the best solution is to split it into several simple tasks and then solve them one by one. Generally, the effect will be better. This is what mission planning is responsible for.

3. Formal language. Although natural language is used for human-machine interaction, subsequent data processing generally requires the use of formal languages, such as Programming (program), API, SQL, and module calls. There are many forms, but in the final analysis it is Programming, because API is essentially a function calling external tools, SQL is a special programming language, and module calling is actually API. I believe that with the development of the big model, its internal formal language is likely to be unified into the programming logic. That is to say, after complex tasks are planned into simpler subtasks, the external form of each subtask solution is often in the form of Programming or API calls.

Let's take a look at how people and data interact in the era of large models for different types of data. There is no need to mention the text class, ChatGPT is a typical example.

Manipulating Unstructured Data Using Natural Language

Let's start with unstructured data, first of all pictures.

8c53ddf4a64a02b54f2f797edd017f9e.png

As shown in the figure, this is a typical Planning+Programming mode, which can fully illustrate the three things that the above-mentioned large model does behind the scenes. In this example, people manipulate pictures through language, including adding, deleting, modifying, and checking. This work is called Visual Programming, and won the best paper award at CVPR 2023.

Take, for example, a still from The Big Bang Theory below. The user submitted a group photo and gave the model a task: "Mark the names of the 7 protagonists in the TV series "The Big Bang Theory" on the picture"

How does LLM accomplish this task? First, the LLM maps this task into program statements (five lines below). Why are there five elements? This is Planning, each line is a subtask, executed in sequence.

Briefly explain the meaning of each line of program: the meaning of the first program statement is to recognize the face in this picture; the second sentence is to let the language model issue a query to find the name of the protagonist of "The Big Bang Theory"; The third sentence is to let the model match the face with the name of the person, which is a classification task; the fourth sentence is to let the model frame the face in the picture and put the name on it. The fifth sentence outputs the edited picture.

It can be seen that this process includes Planning and Programming. They are two in one and integrated together. It is not easy to see the Planning step, but it actually exists.

The same is true for video. Give you a video, and you can ask questions in natural sentences, such as: "What is this person doing in this video?"

b888ca0fa05ecc9f346ac9e27bc30033.png

Of course, this task is inherently a multimodal task. It can be seen that the model first needs to encode the video, the upper part of the figure is the encoded text information, and the lower part is the encoded visual information. The text comes from speech recognition. So this is a model that integrates text and pictures. Although it is not drawn in the picture, various planning and reasoning will also be done inside it.

Let me digress here, let me talk about my views on multimodal large models. Now everyone is generally optimistic about the multimodal direction, but I am not as optimistic as most people about the development of multimodal large models. The reason is simple: although many models that process text and images at the same time are still effective, the reason is not that image or video technology has made a breakthrough, but that the text model is too capable, and it is flying with the image model. In other words, in terms of technical capabilities, the text and image models are not equivalent, but the text is strong and the image is weak, and the text complements the image. In fact, there are still serious technical obstacles in image and video that have not been broken through. There is a "technical dark cloud" hovering above image processing. If there is no breakthrough, multimodality will face shadows and great obstacles, and it will be difficult to make significant progress in application.

Manipulate structured data using natural language

Let's look at structured data, which includes three typical types: tables, SQL, and knowledge graphs.

First look at the table. We can manipulate tabular data through natural language, and Microsoft's Office Copilot has already done it. The question is how? Of course we don't know exactly how Microsoft did it, but other researchers have done similar work.

80420333ee1777fc477f855251f89d2f.png

Here's an example. For a sales data table, there are many columns, and the data between the columns are related. For example, one column is sales volume, one column is unit price, and one column is sales volume. LLM can handle tables very well, because LLM has learned a lot of knowledge in the pre-training model, such as sales = unit price × sales. LLM is able to use learned knowledge to process tabular data.

For this sales data table, the user can issue a query: "Lighten the records whose sales are between 200-500."

LLM (here GPT-4) will first plan this task into subtasks, here are three subtasks: 1) first filter out entries with sales between 200-500; 2) light the background in blue; 3 ) to embed the lit data back into the form.

For the first step of the screening process, how does the model do it? Here is a brief introduction to the process. The first is to write prompt. As you can see in the picture, this prompt is very long.

Now writing prompt has become a science. Some people say that prompt is like casting a spell, I think it's more like doing PUA for a large model. We can compare the large model to a person who can play various different types of roles. In order for it to do the task at hand, we need to adjust it to the role that is most suitable for this task. For this reason, we need to write a prompt to induce this role: "You are very knowledgeable, you are especially suitable for this job, you should do it more professionally, and don't be too casual.". and so on.

Then tell it the schema of the table (that is, the meaning of each column in the table). GPT-4 will generate an API, which is a filter that can filter out data that meets 200-500 from all data. But looking at the place marked in red, the parameters of the model are wrongly written when generating the API. What should we do at this time? We can give it some documents to let GPT-4 learn. There are many examples in the documents to tell it how to call the API and write parameters in this case. After reading GPT-4, I changed the API and it was correct.

Then it is executed, and the required data is filtered out.

Another type of structured data is a database. Our purpose is to use natural language to operate the database, which is essentially to map human natural language into SQL statements.

3fc97f439c32a49afc56b0025c9f597a.png

As an example, here's Google's SQL-PaLM, based on PaLM 2. PaLM 2 is a large base model that Google retrained in April, and it was made to benchmark GPT-4.

SQL-PaLM operates on the database in two ways. One is in-context learning, which is to give the model some examples, including database schema, natural language questions and corresponding SQL statements, and then ask a few new questions to ask the model to output SQL statements. Another way is fine-tuning.

How is the model performing now? On more complex database tables, the accuracy rate is 78%, which is close to the practical level. This means that with the further rapid development of technology, it is very likely that SQL statements do not need to be written by humans; in the future, you can just say what you want, and leave the rest to the machine.

Another typical structured data is knowledge graph. The question remains the same: how do we manipulate the knowledge graph with natural language?

eaef92cdb7fae6d39977c4b9acce3d32.png

Here the user asks a question: "Which country is Obama from?" How does the big language model give the answer? It will also plan and split the task into an operable API of the knowledge graph; this will query to get two sub-knowledge, and then do reasoning to output the correct answer "United States".

The relationship between the large model and the environment

The above is about the relationship between people and data, and there is also the relationship between large models and the environment. The most typical is the robot, which is now generally called embodied intelligence, that is to say, how to give the robot a brain and let the robot walk in the world.

94823c5dbc80204f7cc9e261107627c1.png

Embodied intelligence has five core elements: understanding language, task planning, physical execution of actions, receiving environmental feedback, and learning from feedback.

The biggest difference between the embodied intelligence in the big model stage and the previous one is that the core of these five links can now be taken over by the big model. If you want to command the robot, you only need to issue a request in natural language, and all five steps are planned and controlled by the large model. The large model can give the robot a powerful brain, better understand human language and commands, and use the world knowledge learned from the large model to guide behavior, which is a qualitative improvement over previous methods in these aspects.

But there is a problem: many researchers are not optimistic about this direction. what is the reason? If you want to use physical robots to learn and exercise in real life, you will face the problems of high cost and low data acquisition efficiency, because physical robots are very expensive, and the range of action in the real world is limited, the efficiency of data acquisition is very low, and the learning speed is slow. Slow, and can't fall, because falling means a lot of maintenance costs.

What more people are doing is creating a virtual environment in which the robot can explore. The virtual environment can alleviate the problems of high cost and low data acquisition efficiency. "My World" is a commonly used virtual environment. This is an open world, similar to survival in the wilderness. Game characters can learn to survive better in it, so it is especially suitable to replace the activities of robots in the real world.

8c74dde29db4a7267d5d02374edfafe0.png

The cost of the virtual environment is very low, and the data acquisition efficiency is very high. But there is also a problem: the complexity of the virtual world is far less than the real world.

The Voyager developed by Nvidia is to let the robot explore the unfamiliar environment in "My World". The large model behind it is also based on GPT-4, and the robot and GPT-4 also communicate through natural language. . As can be seen on the left side of the figure, the model will learn to do various tasks of different difficulty step by step, from easy to difficult, from the simplest logging to making tables and beating zombies, etc. In the general machine learning context, we take this A "from easy to difficult, step by step" learning mode is called "course learning". The tasks of "course learning" are all generated by GPT-4 according to the current state. You only need to use the PUA model in the prompt to make it generate the next task from easy to difficult.

Suppose the current task is to fight zombies. Faced with this task, GPT-4 will automatically generate the corresponding "zombie-fighting" function and program code that can run in the "Minecraft" environment. When writing this function, it can be reused to solve the previous simpler tasks. Formed tools, such as the stone sword and shield that the character learned to make before, can be used directly through API function calls at this time. After forming the "zombie fighting" function, the code can be executed to interact with the environment. If an error occurs, the error information will be fed back to GPT-4, and GPT-4 will further correct the program. If the program is successfully executed, then these experiences will be put into the tool library as a new knowledge, which can be used next time. Then, according to "course learning", GPT-4 will generate the next more difficult task.

big model of the future

The above is from the perspective of the relationship between people and data, as well as the relationship between large models and the environment, to illustrate that natural language interaction is ubiquitous. Next, let's look at how natural language interaction works in the process of AI and AI interaction.

1c72fb4f70f1e2e9dbdf99f44167961e.png

What research progress is worth paying attention to in the past six months on the large-scale model of the pedestal? Except that the scale of the model continues to increase, the overall progress is not too great. Most of the new progress is concentrated in the instruct part, thanks to Meta's open source LLaMA model. Speaking of the progress of the pedestal model, I think there are two things worth paying attention to. One is the rapid increase in the length of the model input window. This technology is progressing very fast. At present, the input of 100K length or even longer in the open source model can be achieved soon; The second is the enhancement of large models.

I believe that there is a high probability that the large model in the future will be the model shown in the picture above. The previous large model was a static single large model. In the future, it should be a large model composed of multiple agents with different roles. They communicate and exchange through natural language, and work together to perform tasks. The agent can also call external tools through the natural language interface to solve the shortcomings of existing large models, such as outdated data, severe hallucinations, and weak computing power.

e8aad7296e3a95751107a5eb0b083581.png

At present, the mode of using tools for large models is relatively uniform. As shown in the figure, a large number of available external tools can be managed through an API management platform. After the user gives the question, the model judges whether the tool should be used according to the question requirement. If it feels that the tool is needed, it further decides which tool to use, and calls the API interface of the corresponding tool, fills in the corresponding parameters, and the integrated tool returns after the call is completed. The results form an answer, and then return the answer to the user.

11d2359d6c5a1076fe7c4ebcc9806215.png

The intelligent body is a technology worthy of attention, but we still do not have a unified definition of the intelligent body in the era of large models. You can think of agents as different roles assigned to large language models, and these roles complete tasks through division of labor. The intelligent body is a research direction with a long history, which must have existed for decades, but in the era of large models, due to the ability of LLM, the intelligent body has completely different capabilities and contains huge technical potential. In addition, regarding its definition, I feel that the traditional definition of an agent may not be able to meet the situation in the new situation, and the era of large models may need to redefine the meaning of an agent.

The above shows an intelligent body system that simulates human society in a game sandbox environment. Each intelligent body has its own professional identity. Different intelligent bodies can communicate through natural language and hold various gatherings. It looks like The prototype of the science fiction drama "Western World".

a30a1bb9c2cb7a7ca42789fe4989f8cc.png

If we sum up the cooperation mode between Agents, there are two main types: competition type and cooperation type. Competitive type means that different agents question, quarrel and discuss with each other, so as to get better task results. The collaborative type is to divide labor through roles and abilities, each undertake one of the tasks, and complete the task through mutual help and cooperation.

Finally we discuss the advantages and problems of natural language interaction. The advantage of using natural language for interaction is that it is more natural, more convenient, and more unified, and users need almost no learning costs to do things; but natural language also has disadvantages such as ambiguity and language ambiguity.

37ae12a7bb69c00b24c77bc7c1900f5a.png

The ambiguity of natural language means that sometimes it is not easy for you to express your true intention clearly in natural language. You think you have made it clear, but you have not, but you may not realize that you did not make it clear. This is why the requirements for writing prompts are relatively high for using a large model. After all, if the user does not know their intentions clearly, the model cannot be done well.

The problem of ambiguity in natural language has always existed and is ubiquitous. For example, "Bring me the apple" actually has different meanings, and listeners can also have different understandings. How do you let the big model know which meaning it is?

Considering the ambiguity and ambiguity of natural language, from the perspective of human-computer interaction, the large model should enhance the initiative of interaction in the future, that is to say, let the model actively ask the user questions. If the model thinks there is something wrong with the user's words and is not sure what it means, it should ask "what do you mean" or "do you mean that?" This is the part that should be strengthened in the future of the large model.

Thank you everyone!


Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/131799293