What else can NLP do? Beihang University, ETH, Hong Kong University of Science and Technology, Chinese Academy of Sciences and other institutions jointly published a 100-page paper, systematically explaining the post-ChatGPT technology chain

Xi Xiaoyao's technology sharing
source | Heart of the machine

Everything starts with the birth of ChatGPT...

The once peaceful NLP community was frightened by this sudden "monster"! Overnight, the entire NLP circle has undergone tremendous changes, the industry quickly followed up, capital "hurricane", and started the road of replicating ChatGPT; the academic world suddenly fell into a state of confusion... Everyone slowly began to believe that "NLP is solved!"

However, judging from the recent active NLP academic circle and the endless stream of excellent work, this is not the case, and it can even be said that "NLP just got real!"

In the past few months, Beihang University, Mila, Hong Kong University of Science and Technology, Zurich Federal Institute of Technology (ETH), University of Waterloo, Dartmouth College, University of Sheffield, Chinese Academy of Sciences and many other institutions, after systematic and comprehensive research, polished out A 110-page paper systematically expounds the technology chain in the post-ChatGPT era: Interaction.

Paper address:

https://arxiv.org/abs/2305.13246

Project resources:
https://github.com/InteractiveNLP-Team

Different from traditional "human in the loop (HITL)", "writing assistant" and other types of interactions, the interactions discussed in this article have a higher and more comprehensive perspective:

  1. For the industry: If the large model has difficult problems such as factuality and timeliness, can ChatGPT+X solve it? Even like ChatGPT Plugins, let it interact with tools to help us book tickets, order meals, and draw pictures in one step! In other words, we can alleviate some limitations of the current large model through some systematic technical frameworks.

  2. To Academia: What Is Real AGI? In fact, as early as 2020, Yoshua Bengio, the three giants of deep learning and winner of the Turing Award, described the blueprint of an interactive language model [1]: a language model that can interact with the environment and even socially interact with other agents. In order to have the most comprehensive language semantic representation. To some extent, the interaction with the environment and people creates human intelligence.

Therefore, having a language model (LM) interact with external entities as well as the self can not only help bridge the inherent shortcomings of large models, but may also be an important milestone towards the ultimate ideal of AGI!

What is interaction?

In fact, the concept of "interaction" is not imagined by the authors. Since the advent of ChatGPT, many papers on new issues in the NLP world have been born, such as:

  1. Tool Learning with Foundation Models describes the use of tools for language models to reason or perform real-world operations [2];

  2. Foundation Models for Decision Making: Problems, Methods, and Opportunities explains how to use language models to perform decision-making tasks (decision making) [3];

  3. ChatGPT for Robotics: Design Principles and Model Abilities explains how to use ChatGPT to empower robots [4];

  4. Augmented Language Models: a Survey explains how to use the Chain of Thought, Tool-use, etc. to enhance the language model, and points out that the use of tools by the language model can have a practical impact on the external world (ie act)[ 5];

  5. Sparks of Artificial General Intelligence: Early experiments with GPT-4 illustrates how to use GPT-4 to perform various types of tasks, including cases of interaction with people, environments, tools, etc. [6].

It can be seen that the focus of the NLP academic circle has gradually shifted from "how to build a model" to "how to build a framework", that is, to incorporate more entities into the process of language model training and reasoning. The most typical example is the well-known Reinforcement Learning from Human Feedback (RLHF). The basic principle is to let the language model learn from the interaction (feedback) with people [7]. This idea has become the finishing touch of ChatGPT.

Therefore, it can be said that the feature of "interaction" is one of the most mainstream technology development paths of NLP after ChatGPT! The author's paper defines and systematically deconstructs "interactive NLP" for the first time, and mainly based on the dimension of interactive objects, discusses the pros and cons of various technical solutions and application considerations as comprehensively as possible, including:

  1. LM interacts with humans to better understand and meet user needs, personalize responses, align with human values, and improve overall user experience;

  2. LM interacts with the knowledge base to enrich the factual knowledge expressed in language, enhance the knowledge background relevance of responses, and dynamically utilize external information to generate more accurate responses;

  3. LM interacts with models and tools to efficiently decompose and solve complex reasoning tasks, exploit specific knowledge for specific subtasks, and facilitate the emergence of social behaviors for agents;

  4. LMs interact with the environment to learn language grounding representations of entities and efficiently handle embodied tasks such as reasoning, planning, and decision-making related to environmental observations.

Therefore, under the framework of interaction, the language model is no longer the language model itself, but a language-based agent that can "observe", "act" and "feedback". .
Interacting with an object, the authors call it "XXX-in-the-loop", means that this object participates in the process of language model training or reasoning, and is in a cascading, looping, feedback, or iterative manner form involved.
interact with people

Let the language model interact with people can be divided into three ways:

  1. Use prompts to communicate

  2. Use Feedback to Learn

  3. Tuning with configuration

In addition, in order to ensure scalable deployment, models or programs are often used to simulate human behavior or preferences, that is, to learn from human simulations.

In general, the core problem to be solved when interacting with people is the alignment problem (alignment), that is, how to make the response of the language model more in line with the needs of users, more helpful, harmless and well-founded, so that users have more Good user experience, etc.

"Communicating Using Prompts" focuses primarily on the real-time and continuous nature of the interaction, that is, multiple rounds of dialogue emphasizing the continuous nature. This is in line with the idea of ​​Conversational AI [8]. That is, through multiple rounds of dialogue, let the user continue to ask questions, so that the response of the language model is slowly aligned with the user's preference in the dialogue. This approach usually does not require adjustment of model parameters during the interaction.

"Learning with feedback" is currently the main way of alignment, which is to let users give feedback to the language model's response. This feedback can be a "good/bad" label describing the preference, or a more natural language form. for detailed feedback. Models need to be trained so that these feedbacks are as high as possible. A typical example is the RLHF [7] used by InstructGPT. First, the reward model is trained using the user-labeled preference feedback data for the model response, and then this reward model is used to train the language model with a certain RL algorithm to maximize the reward (as shown in the figure below ).

Training language models to follow instructions with human feedback [7]

"Tuning with configuration" is a special interaction method that allows users to directly adjust the hyperparameters of the language model (such as temperature), or the cascading method of the language model, etc. A typical example is Google’s AI Chains [9]. Language models with different preset prompts are connected to each other to form an inference chain for processing workflow tasks. Users can drag and drop through a UI to adjust the node connection mode of this chain .

"Learning from human simulation" can facilitate the large-scale deployment of the above three methods, because it is unrealistic to use real users, especially in the training process. For example, RLHF usually needs to use a reward model to simulate user preferences. Another example is ITG [10] of Microsoft Research, which simulates user editing behavior through an oracle model.

Recently, Stanford Professor Percy Liang and others constructed a very systematic Human-LM interaction evaluation scheme: Evaluating Human-Language Model Interaction [11]. Interested readers can refer to this paper or the original text.

Interact with the knowledge base

Large model research test portal

GPT-4 capability research portal (advanced/continue to visit in case of browser warning):
https://gpt4test.com

There are three steps in the interaction between the language model and the knowledge base:

  1. Identify sources of supplementary knowledge: Knowledge Source

  2. Retrieve knowledge: Knowledge Retrieval

  3. Enhancement using knowledge: For details, please refer to the Interaction Message Fusion section of this paper, and there is not much introduction here.

In general, interacting with the knowledge base can alleviate the "hallucination" of the language model, that is, improve the factuality and accuracy of its output, and it can also help improve the timeliness of the language model and help supplement the language model knowledge and ability (as shown in the figure below), etc.

MineDojo [16]: When a language model agent encounters a task that it does not know, it can find learning materials from the knowledge base, and then complete the task with the help of the materials.

"Knowledge Source" is divided into two types, one is closed corpus knowledge (Corpus Knowledge), such as WikiText, etc. [15]; the other is open network knowledge (Internet Knowledge), such as knowledge that can be obtained by using search engines[ 14].

"Knowledge Retrieval" is divided into four ways:

  1. Sparse retrieval based on language-based sparse representation and lexical matching: such as n-gram matching, BM25, etc.

  2. Dense retrieval based on language-based dense representation and semantic matching: such as using a single-tower or double-tower model as a retriever.

  3. Based on the generative searcher: it is a relatively new method, and the representative work is the Differentiable Search Index [12] of Google Tay Yi et al., which stores all the knowledge in the parameters of the language model, and directly outputs the doc of the corresponding knowledge after giving a query id or doc content. Because the language model is the knowledge base [13]!

  4. Based on reinforcement learning: It is also a relatively cutting-edge method, such as OpenAI's WebGPT [14], which uses human feedback to train the model to retrieve correct knowledge.

Interact with models or tools

The main purpose of language model and model or tool interaction is to decompose complex tasks, such as decomposing complex reasoning tasks into several subtasks, which is also the core idea of ​​Chain of Thought [17]. Different subtasks can be solved using models or tools with different capabilities. For example, computing tasks can be solved using calculators, and retrieval tasks can be solved using retrieval models. Therefore, this type of interaction can not only improve the reasoning, planning, and decision-making capabilities of the language model, but also alleviate the limitations of the language model's "hallucination" (hallucination) and inaccurate output. In particular, when using a tool to perform a specific subtask, it may have a certain impact on the outside world, such as using the WeChat API to post a circle of friends, etc., which is called "Tool-Oriented Learning" (Tool-Oriented Learning) [ 2].

In addition, sometimes it is difficult to explicitly decompose a complex task. At this time, different roles or skills can be assigned to different language models, and then these language models can implicitly , Automatically form some kind of division of labor scheme (division of labor) to decompose tasks. This type of interaction can not only simplify the process of solving complex tasks, but also simulate human society and construct some form of intelligent agent society.

The authors put models and tools together mainly because models and tools are not necessarily two separate categories. For example, a search engine tool and a retriever model are not essentially different. This essence is defined by the authors using "after the task is decomposed, what kind of subtask is undertaken by what kind of object".

When a language model interacts with a model or tool, there are three types of operations:

  1. Thinking: The model interacts with itself, performs task decomposition and reasoning, etc.;

  2. Acting: The model calls other models, or external tools, etc., to help reasoning, or to have a practical effect on the external world;

  3. Collaborating: Multiple language model agents communicate and collaborate with each other to complete specific tasks or simulate human social behavior.

Note: Thinking mainly talks about "Multi-Stage Chain-of-Thought", that is, different reasoning steps correspond to different calls of the language model (multiple model run), not like Vanilla CoT [17] In that way, run the model once and output thought+answer (single model run).

Part of this is inherited from ReAct [18].

Typical works of Thinking include ReAct [18], Least-to-Most Prompting [19], Self-Ask [20], etc. For example, Least-to-Most Prompting [19] first decomposes a complex problem into several simple module sub-problems, and then iteratively calls the language model to break them one by one. Typical works of Acting include ReAct [18], HuggingGPT [21], Toolformer [22], etc. For example, Toolformer [22] processes the pre-training corpus of the language model into a form with tool-use prompt. Therefore, the trained language model can automatically call the correct one at the right time when generating text. External tools (such as search engines, translation tools, time tools, calculators, etc.) solve specific subproblems.

Collaborating mainly includes:

  1. Closed-loop interaction: such as Socratic Models [23], etc., through the closed-loop interaction of large language models, visual language models, and audio language models, some complex QA tasks specific to the visual environment are completed.

  2. Theory of Mind: It aims to enable an agent to understand and predict the state of another agent to facilitate efficient interaction with each other. For example, the Outstanding Paper of EMNLP 2021, MindCraft [24], endows two different language models with different but complementary skills, allowing them to collaborate in the process of communication to complete specific tasks in the MineCraft world. The famous professor Graham Neubig has also paid great attention to this research direction recently, such as [25].

  3. Communicative Agents: Designed to enable multiple agents to communicate and collaborate with each other. The most typical example is the Generative Agents [26] recently shocked the world by Stanford University: build a sandbox environment, let a lot of intelligent agents injected with "soul" from large models move freely in it, and they can spontaneously present some human-like Social behaviors, such as chatting and greeting, have a "Western World" flavor (as shown in the picture below). In addition, the more famous work is the new work CAMEL [27] of the author of DeepGCN, which allows two large-scale model-powered agents to develop games and even stock in the process of communicating with each other without requiring too much human effort. intervene. In the article, the author clearly puts forward the concept of "Large Model Society" (LLM Society).

Generative Agents: Interactive Simulacra of Human Behavior, https://arxiv.org/pdf/2304.03442.pdf

interact with the environment

The language model and the environment belong to two different quadrants: the language model is built on abstract text symbols, and is good at high-level reasoning, planning, decision-making and other tasks; while the environment is built on specific perceptual signals (such as visual information, auditory information, etc.), simulate or naturally occur some low-level tasks, such as providing observation (observation), feedback (feedback), state update (state transition) A "creeper" appears in front of you).
Therefore, in order for the language model to effectively and efficiently interact with the environment, it mainly includes two efforts:

  1. Modality Grounding: Allows language models to process multimodal information such as images and audio;

  2. Affordance Grounding: Let the language model perform possible and appropriate actions on possible and appropriate objects at the scale of the specific scene of the environment.

The most typical for Modality Grounding is the vision-language model. Generally speaking, a single-tower model such as OFA [28], a two-tower model such as BridgeTower [29], or an interaction between a language model and a visual model such as BLIP-2 [30] can be used. No more to say here, readers can read this paper in detail.
There are two main considerations for Affordance Grounding, namely: how to perform (1) scene-scale perception and (2) possible action under the conditions of a given task. for example:

For example, in the scene above, given the task "Please turn off the lights in the living room", "perception of scene scale" requires us to find all the lights in the red frame, instead of selecting the lights in the green circle in the kitchen that are not in the living room, " "Possible actions" requires us to determine the feasible ways to turn off the lights, for example, the "pull" action is required to pull the wire light, and the "toggle switch" action is required to switch the light on.

Generally speaking, Affordance Grounding can be solved using a value function attached to the environment, such as SayCan [31], or a specialized grounding model such as Grounded Decoding [32]. It can even be solved by interacting with people, models, tools, etc. (as shown in the figure below).

Inner Monologue [33]

What to interact with: interactive interface

In the Interaction Interface chapter of the paper, the authors systematically discussed the usage and pros and cons of different interactive languages ​​and interactive media, including:

  1. Natural language: such as few-shot example, task instruction, role assignment and even structured natural language. It mainly discusses its characteristics and functions in generalization and expressiveness.

  2. Formal language: such as code, grammar, mathematical formula, etc. It mainly discusses its characteristics and functions in parsability and reasoning ability.

  3. Machine language: such as soft prompts, discretized visual tokens, etc. It mainly discusses its characteristics and functions in generalization, information bottleneck theory, and interaction efficiency.

  4. Editing: mainly includes operations such as deleting, inserting, replacing, and retaining text. Its rationale, history, advantages, and current limitations are discussed.

  5. Shared memory: mainly includes hard memory and soft memory. The former records the historical state in a log as a memory, and the latter uses a readable and writable memory external module to save tensors. The paper discusses the characteristics, functions and limitations of the two.

How to interact: Interaction methods

The paper also discusses various interaction methods in a comprehensive, detailed and systematic manner, mainly including:

  1. Prompting: Without adjusting the model parameters, the language model is only invoked through prompt engineering, covering In-Context Learning, Chain of Thought, Tool-use, and cascading reasoning chains (Prompt Chaining) and other methods, discussing in detail the principles, functions, various tricks and limitations of various prompting techniques, such as controllability and robustness considerations.

  2. Fine-Tuning: Adjust the model parameters so that the model can learn and update from the interaction information. This section covers Supervised Instruction Tuning, Parameter-Efficient Fine-Tuning, Continuous Learning, Semi-Supervised Fine-Tuning and other methods. The principles, functions, advantages, considerations in specific use, and limitations of these methods are discussed in detail. It also includes part of Knowledge Editing (that is, editing knowledge inside the model).

  3. Active Learning: An interactive active learning algorithm framework.

  4. Reinforcement Learning: An interactive reinforcement learning algorithm framework that discusses online reinforcement learning frameworks, offline reinforcement learning frameworks, learning from human feedback (RLHF), learning from environmental feedback (RLEF), learning from AI feedback (RLAIF), etc. Various methods.

  5. Imitation Learning: An interactive imitation learning algorithm framework, discussing online imitation learning, offline imitation learning, etc.

  6. Interaction Message Fusion: Provides a unified framework for all the above-mentioned interaction methods. At the same time, in this framework, expands outward, discusses different knowledge and information fusion schemes, such as cross-attention fusion scheme (cross-attention), constraint decoding Fusion scheme (constrained decoding) and so on.

other discussions

Due to space limitations, this article does not introduce other aspects of discussion in detail, such as evaluation, application, ethics, security, and future development directions. However, these contents still occupy 15 pages in the original text of the paper, so readers are recommended to view more details in the original text. The following is an outline of these contents:

Evaluation of the interaction

The discussion of evaluation in the paper mainly involves the following key words:

Major Applications of Interactive NLP

  • Controllable Text Generation

    • Interacting with People: RLHF's Thought Stamping Phenomenon and More

    • Interacting with Knowledge: Knowledge-Aware Fine-Tuning [34] etc.

    • Interact with models, tools: Classifier-Guided CTG, etc.

    • Interact with the environment: affordance grounding, etc.

  • Interactive Writing Assistant

    • Content Support: content support type

    • Content Checking and Polishing: content checking, polishing type

    • Content Enrichment: rich content

    • Content Co-creation: content creation type

  • Embodied AI

    • Observation and Manipulation: Basics

    • Navigation and Exploration: 进阶 (e.g., long-horizon embodied tasks)

    • Multi-Role Tasks: Advanced

  • Game (Text Game)

    • Interactive game platforms that include text: Interactive Text Game Platforms

    • How Interactive Language Models Play Text-Only Games: Playing Text-Only Games

    • How Interactive Language Models Can Power Games Involving Text Media: Powering Text-Aided Games

  • other apps

    • Domain and task specialization: For example, how to create a language model framework specific to the financial field, medical field, etc. based on interaction.

    • Personalization & Personality: For example, how to create a language model specific to the user or with a specific personality based on interaction.

    • Model-based Evaluation

Ethics and Safety

The impact of interactive language models on education is discussed, as well as ethical security issues such as social bias and privacy.

Future Development Direction and Challenges

  • Alignment: The alignment of the language model, how to make the output of the model more harmless, more in line with human values, more justified, etc.

  • Social Embodiment: The Grounding problem of the language model, how to further promote the embodiment and socialization of the language model.

  • Plasticity: The plasticity problem of the language model, how to ensure the continuous update of model knowledge, and not forget the previously acquired knowledge during the update process.

  • Speed ​​& Efficiency: Issues such as the inference speed and training efficiency of the language model, how to speed up reasoning and training efficiency without affecting performance.

  • Context Length: The context window size limit of the language model. How to expand the window size of the context so that it can handle longer text.

  • Long Text Generation: Long text generation problems for language models. How to make the language model maintain excellent performance in the generation scenario of extremely long text.

  • Accessibility: Availability of the language model. How to change the language model from closed source to open source, and how to deploy the language model on edge devices such as in-vehicle systems and laptops without excessive loss of performance.

  • Analysis: Analysis of language models, interpretability and other issues. For example, how to predict the performance of the model after scaling up to guide the development of large models, how to explain the internal mechanism of large models, etc.

  • Creativity: The creative problem of language models. How to make the language model more creative, be able to better use metaphors, metaphors, etc., and be able to create new knowledge, etc.

  • Evaluation: How to better evaluate the general large model, how to evaluate the interaction characteristics of the language model, etc.

References

[1] Experience Grounds Language,https://arxiv.org/abs/2004.10151
[2] Tool Learning with Foundation Models
[3] Foundation Models for Decision Making: Problems, Methods, and Opportunities
[4] ChatGPT for Robotics: Design Principles and Model Abilities
[5] Augmented Language Models: a Survey
[6] Sparks of Artificial General Intelligence: Early experiments with GPT-4
[7] Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155
[8] Conversational AI, http://coai.cs.tsinghua.edu.cn/
[9] AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts, https://arxiv.org/abs/2110.01691
[10] Interactive Text Generation
[11] Evaluating Human-Language Model Interaction
[12] Transformer Memory as a Differentiable Search Index, https://arxiv.org/abs/2202.06991
[13] Language Models as Knowledge Bases?, https://arxiv.org/abs/1909.01066
[14] WebGPT: Browser-assisted question-answering with human feedback, https://arxiv.org/abs/2112.09332
[15] Atlas:Few-shot Learning withRetrieval Augmented Language Models, https://arxiv.org/pdf/2208.03299.pdf
[16] MINEDOJO:Building Open-EndedEmbodied Agents with Internet-Scale Knowledge, https://arxiv.org/pdf/2206.08853.pdf
[17] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, https://arxiv.org/abs/2201.11903
[18] ReAct: Synergizing Reasoning and Acting Inlanguage Models, https://arxiv.org/abs/2210.03629
[19] Least-to-Most Prompting Enables complex reasoning in Large Language Models, https://arxiv.org/pdf/2205.10625.pdf
[20] Measuring and Narrowingthe Compositionality Gap in Language Models, https://ofir.io/self-ask.pdf
[21] HuggingGPT, https://arxiv.org/abs/2303.17580
[22] Toolformer: Language Models Can Teach Themselves to Use Tools, https://arxiv.org/abs/2302.04761
[23] Socratic Models, https://arxiv.org/pdf/2204.00598.pdf
[24] MindCraft: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks, https://aclanthology.org/2021.emnlp-main.85/
[25] Computational Language Acquisition with Theory of Mind, https://openreview.net/forum?id=C2ulri4duIs
[26] Generative Agents: Interactive Simulacra of Human Behavior, https://arxiv.org/pdf/2304.03442.pdf
[27] CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society, https://www.camel-ai.org/
[28] OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, https://arxiv.org/abs/2202.03052
[29] BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning, https://arxiv.org/abs/2206.08657
[30] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, https://arxiv.org/pdf/2301.12597.pdf
[31] Do As I Can,Not As I Say:Grounding Language in Robotic Affordances, https://say-can.github.io/
[32] Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control, https://grounded-decoding.github.io/
[33] Inner Monologue:Embodied Reasoning through Planning with Language Models, https://innermonologue.github.io/
[34] Large Language Models with Controllable Working Memory, https://arxiv.org/abs/2211.05110

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/131078038