6,000 words interpretation: 10 major challenges in current large language model LLM research

After the rapid development of large language models for a period of time, what are the current mainstream research directions and common challenges? With the authorization of Chip Huyen , the author of this article , Afatu translated this article. You are also welcome to follow Chip Huyen on Twitter  @chipro

Everyone is welcome to forward it to Moments~~ This way Rabbit will be more motivated to post more good content~

*When reprinting this article, please be sure to include the names of the author, translator , and links to all references

Open challenges in LLM research

*This article is about 6,600 words, written by Chip Huyen, translator: Alpha Rabbit

Source link: https://huyenchip.com/2023/08/16/llm-research-open-challenges.html

The goal of making the big language model more perfect is the first time in my life that I have seen so many smart people working hard for a common goal at the same time . After communicating with many people in industry and academia, I noticed that ten major research directions have emerged. The two directions currently receiving the most attention are Hallucinations (output hallucinations) and Context Learning.

For myself, I am most interested in the third direction listed below (Multimodality multi-modal data mode), the fifth direction (New architecture new architecture) and the sixth direction (GPU alternatives development GPU alternative s solution)

Top 10 Open Challenges in LLM Research

  1. Reduce and evaluate output (fictitious information)

  2. Optimize context length and context construction

  3. Integrate other data forms

  4. Improving language model speed and cost-effectiveness

  5. Design a new model architecture

  6. Developing alternative GPU solutions

  7. Improving the usability of agents (artificial intelligence)

  8. Improved ability to learn from human preferences

  9. Improve the efficiency of chat interfaces

  10. Build language models for non-English languages

6b9980ff1161bc85f7bd190160c63de5.png

1. Reduce and evaluate hallucinations

The output environment is a topic that has been discussed a lot, so I will make a long story short here. Hallucinations occur when AI models make things up. For many creative use cases, illusion is a function. However, hallucinations are an error for most use cases. I recently participated in a panel discussion on LLM with experts from Dropbox, Langchain, Elastics, and Anthropic . According to them, the first obstacle that enterprises need to overcome to apply LLM in actual production is hallucination output.

Reducing the hallucination output of models and developing metrics to evaluate hallucination output is a booming research topic, and many startups are currently focusing on this problem. There are also tricks to reduce the probability of hallucinating output, such as adding more context to the cue word, CoT, self-consistency, or specific requirements for the model's response to be concise.

The following is a series of papers and reference materials on hallucination output:

  • Survey of Hallucination in Natural Language Generation(Ji et al., 2022)

  • How Language Model Hallucinations Can Snowball(Zhang et al., 2023)

  • A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity(Bang et al., 2023)

  • Contrastive Learning Reduces Hallucination in Conversations(Sun et al., 2022)

  • Self-Consistency Improves Chain of Thought Reasoning in Language Models(Wang et al., 2022)

  • SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models(Manakul et al., 2023)

  • A simple example of fact-checking and hallucination by NVIDIA’s NeMo-Guardrails

2. Optimize context length and context construction

Most questions require context. For example, if we ask ChatGPT: "Which Vietnamese restaurant is the best?" the required context will be "Where is the scope of this restaurant?", because the best restaurant in Vietnam is the same as the best Vietnamese restaurant in the United States. Restaurants, the scope of the problem is different.

According to the following cool paper "SITUATEDQA: Incorporating Extra-Linguistic Contexts into QA" (Zhang & Choi, 2021), a significant portion of the answers to information search questions are context dependent, e.g. in the Natural Questions NQ-Open dataset about 16.5%.

(NQ-Open:https://ai.google.com/research/NaturalQuestions)

I personally think that in actual cases encountered by enterprises, this proportion will be higher. For example, suppose a company builds a chatbot for customer support. For this chatbot to answer any question a customer has about any product, the context needed is likely to be that customer's history or information about that product. Because the language model "learns" from the context provided to it, this process is also called context learning.

a4a7bd51bb7cd82d88406941d1099df4.pngContext required for customer support inquiries

Context length is very important for RAG (Retrieval Augmentation Generation), and RAG has become the main mode for application scenarios in the large language model industry. Specifically, retrieval enhancement generation is mainly divided into two stages:

Phase 1: chunking (also known as indexing) chunking (also known as indexing)

Collect all documents used by LLM, split these documents into chunks that can be fed into a larger model to generate embeddings, and store these embeddings in a vector database.

Stage 2: Query

When a user sends a query, such as "Will my insurance policy cover drug X?" the big language model converts this query into an embedding, which we call QUERY_EMBEDDING. The vector database will obtain the block whose embedding is most similar to QUERY_EMBEDDING.

c608dde5ca71f7424130face8fc40005.png

The longer the context length, the more chunks we can squeeze in the context. The more information a model acquires, the higher the quality of its output and responses will be, right?

Not always. How much context a model can use and how efficiently the model uses context are two different issues. While working to increase model context length, we are also working to improve context efficiency. Some people call it "prompt engineering" or "prompt construction". For example, a recent paper talks about how models can better understand the beginning and end of an index, rather than just the information in the middle - Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023).

3. Integration of other data modes (multi-modal)

In my opinion, multimodality is very powerful, but it is also underestimated. Here is an explanation of the reasons for the application of multimodality:

First, many specific application scenarios require multi-modal data, especially in industries with mixed data modalities such as healthcare, robotics, e-commerce, retail, games, and entertainment. for example:

  • Medical testing often requires text (eg, doctor's notes, patient questionnaires) and images (eg, CT, X-rays, MRI scans).

  • Product metadata usually contains pictures, videos, descriptions, and even tabular data (such as production date, weight, color), because from a demand perspective, you may need to automatically fill in missing product information based on user comments or product photos, or you may want to Enable users to conduct product searches using visual information such as shape or color.

Secondly, multi-modality is expected to significantly improve model performance. Shouldn’t a model that understands both text and images perform better than a model that only understands text? Text-based models require so much text that we worry that we will soon run out of Internet data to train text-based models. Once the text is exhausted, we need to take advantage of other data patterns.

a88ec9fad5cb187b67a15480e758f6d4.png

One use case that I'm particularly excited about is where multimodal technology allows visually impaired people to navigate both the internet and the real world.

The following is a series of papers and reference materials related to multimodality:

  • [CLIP] Learning Transferable Visual Models From Natural Language Supervision(OpenAI, 2021)

  • Flamingo: a Visual Language Model for Few-Shot Learning(DeepMind, 2022)

  • BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models(Salesforce, 2023)

  • KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models(Microsoft, 2023)

  • PaLM-E: An embodied multimodal language model(Google, 2023)

  • LLaVA: Visual Instruction Tuning(Liu et al., 2023)

  • NeVA: NeMo Vision and Language Assistant (NVIDIA, 2023)

4.  Make LLM faster and cheaper

When GPT-3.5 is first released in late November 2022, many have expressed concerns about the delay and cost of using it in production. However, the latency/cost analysis has changed rapidly since then. In less than half a year, the community has found a way to create a model that performs very close to GPT-3.5, but requires only about 2% of the memory footprint of GPT-3.5.

The lesson here is: if you create something good enough, people will find a way to make it fast and cost-effective.

dae52c87635af6c4833b4b77966ba36a.png

The following is the performance data of "Guanaco 7B", compared with the performance of ChatGPT GPT-3.5 and GPT-4, according to the report in the "Guanco" paper. Please note: Overall, the following performance comparisons are far from perfect, and the evaluation of LLM is very, very difficult.

Performance comparison of Guanaco 7B with ChatGPT GPT-3.5 and GPT-4:

c6b9dbd34a74c8cafea0d18fde4e60f7.png

Four years ago, when I started writing notes for what became the "Model Compression" section of the book Designing Machine Learning Systems, I wrote about four main techniques for model optimization/compression:

  • Quantization: The most general model optimization method to date. Quantization reduces the size of the model by using fewer bits to represent the parameters of the model, for example, 16 bits or even 4 bits can be used to represent floating point numbers instead of 32 bits.

  • Knowledge distillation: A method of imitating a large model or ensemble of models by training small models.

  • Low-rank factorization: The key idea here is to replace high-dimensional tensors with low-dimensional tensors to reduce the number of parameters. For example, a 3x3 tensor can be decomposed into a product of 3x1 and 1x3 tensors, so that instead of 9 parameters, only 6 are required.

  • Pruning

    All four of the above techniques are still applicable and popular today. Alpaca uses Knowledge distillation for training. QLoRA uses a combination of Low-rank factorization and quantization.

5. Design a new model architecture

Since AlexNet in 2012, we have seen the rise and fall of many architectures, including LSTM, seq2seq, etc. Compared to these, the impact of Transformer is incredible. Transformer has been around since 2017, and how long this architecture will remain popular is an open question.

Developing a new architecture to surpass Transformer is not easy. Transformer has undergone a lot of optimization in the past 6 years, and this new architecture must run on the hardware that people care about at the scale they care about today.

Note: Google originally designed Transformer to run quickly on TPU, and later optimized it on GPU.

In 2021, Chris Ré's lab's S4 has attracted widespread attention, see "Efficiently Modeling Long Sequences with Structured State Spaces" (Gu et al., 2021) for details). Chris Ré's lab is still vigorously developing new architectures, most recently Monarch Mixer (Fu, 2023), developed in collaboration with startup Together, is one of them.

Their main idea is that for the existing Transformer architecture, the complexity of attention is the quadratic of the sequence length, while the complexity of MLP is the quadratic of the model dimension. Architectures with sub-quadratic complexity will be more efficient.

06edb8801c6bc9731826eccd7afb30ea.pngMonarch Mixer

6. Develop GPU alternatives

Since AlexNet in 2012, GPUs have been the dominant hardware for deep learning. In fact, one of the generally recognized reasons for AlexNet's popularity is that it was the first paper to successfully use GPUs to train neural networks. Before the advent of GPUs, if you wanted to train a model at the scale of AlexNet, you needed to use thousands of CPUs, like the one Google released a few months before AlexNet. Compared with thousands of CPUs, a few GPUs are more accessible to PhD students and researchers, triggering a boom in deep learning research.

Over the past decade, many companies, both large enterprises and startups, have attempted to create new hardware for artificial intelligence. The most notable attempts include Google's TPU, Graphcore's IPU (how is the IPU going?), and Cerebras. SambaNova raised more than a billion dollars to develop new AI chips, but appears to have pivoted to becoming a generative AI platform.

For a while, there were great expectations for quantum computing, with key players including:

  • IBM QPU

  • Google's quantum computer achieved a major milestone in quantum error reduction, reported in the journal Nature earlier this year. Its quantum virtual machine is publicly accessible through Google Colab.

  • Research laboratories such as the MIT Center for Quantum Engineering, the Max Planck Institute for Quantum Optics, the Chicago Quantum Exchange Center, Oakridge National Laboratory, and others.

Another equally exciting direction is photonic chips. I don't know much about this area, so please correct me if I'm wrong. Existing chips use electricity to transmit data, which consumes a lot of energy and creates delays. Photonic chips, on the other hand, use photons to transmit data, harnessing the speed of light for faster, more efficient computing. Various startups in this space have raised hundreds of millions of dollars, including Lightmatter ($270 million), Ayar Labs ($220 million), Lightelligence ($200 million+), and Luminous Computing ($115 million).

The following is the progress timeline of the three main methods of photon matrix calculation, excerpted from the paper "Photonic matrix multiplication lights up photonic accelerator and beyond" (Zhou, Nature 2022) . The three different methods are Planar Light Conversion (PLC), Mach-Zehnder Interferometer (MZI) and Wavelength Division Multiplexing (WDM).

19184c11fe8ec5801c32d74bb7bfdb97.png

7. Improve the availability of agents

Agent refers to a large language model that can perform actions (it can be understood as agents that can complete various tasks instead of you, so they are called Agents), such as browsing the Internet, sending emails, booking, etc. Compared with other research directions in this paper, this may be one of the newest directions. Because of the novelty and great potential of Agent itself, people are full of enthusiasm for Agent. And Auto-GPT is now the 25th most popular repo on GitHub by the number of stars. GPT-Engineering is another popular repo.

Despite the excitement of this direction, doubts remain about whether large language models are reliable and performant enough to be empowered to act. However, an application scenario has emerged where agents are used in social research, such as the famous Stanford experiment that showed emerging social behaviors from a small cluster of generative agents: e.g., starting from a user-specified idea, an agent Want to hold a Valentine's Day party, the Agent will automatically spread the invitation to the party in the next two days, make new friends, and invite each other to the party... (Generative Agents: Interactive Simulacra of Human Behavior, Park et al., 2023),

Perhaps the most notable startup in this space is Adept, founded by two former Transformer co-authors and a former OpenAI VP, and has raised nearly $500 million to date. Last year, they showed how their agent could browse the Internet and how to add a new account to Salesforce.

8. Iterative RLHF

RLHF (Reinforcement Learning from Human Feedback) is cool, but a bit tricky. It wouldn't be surprising if people find better ways to train LLMs. However, there are many unresolved issues in RLHF, such as:

①How to express human preferences mathematically?

Currently, human preferences are determined through comparison: a human annotator determines whether response A is better than response B. However, it does not take into account how much better response A is than response B.

②What is human preference?

Anthropic measures the quality of its models based on the output in three dimensions: helpful, honest, and harmless. See Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022).

DeepMind tries to generate responses that please the majority of people. See Fine-tuning language models to find agreement among humans with diverse preferences, (Bakker et al., 2022).

Furthermore, do we want AI that can take a stand, or traditional AI that avoids any potentially controversial topics?

③Whose preferences are “human” preferences? Should differences in culture, religion, political leanings, etc. be taken into account? There are many challenges in obtaining training data that is sufficiently representative of all potential users.

For example, for OpenAI's InstructGPT data, there are no annotators over the age of 65. The annotators are mainly Filipinos and Bangladeshis. See InstructGPT: Training language models to follow instructions with human feedback (Ouyang et al., 2022).

a0449e5194da6db23609a0d58e4f0e32.jpegNationality statistics of InstructGPT annotators

While community-led efforts are laudable in their intent, they can lead to biased data. For example, for the OpenAssistant dataset, 201 out of 222 (90.5%) respondents self-identified as male. Jeremy Howard has a great Thread on Twitter:

2ef89bfc3ecb7aef8a546cab39c1c128.png

9. Improve the efficiency of chat interface

Since ChatGPT, there has been discussion about whether chat is a suitable interface for a variety of tasks.

See:

  • Natural language is the lazy user interface(Austin Z. Henley, 2023)

  • Why Chatbots Are Not the Future(Amelia Wattenberger, 2023)

  • What Types of Questions Require Conversation to Answer? A Case Study of AskReddit Questions(Huang et al., 2023)

  • AI chat interfaces could become the primary user interface to read documentation(Tom Johnson, 2023)

  • Interacting with LLMs with Minimal Chat (Eugene Yan, 2023)

However, this is not a new topic. In many countries, especially in Asia, chat has been used as an interface for super apps for about a decade, and Dan Grover wrote a paper about it in 2014.

8d7c913d7300ed5ed7c38373c6f549b6.png

In 2016, when many thought apps were dead and chatbots were the future, the discussion got heated again:

  • On chat as interface(Alistair Croll, 2016)

  • Is the Chatbot Trend One Big Misunderstanding?(Will Knight, 2016)

  • Bots won’t replace apps. Better apps will replace apps (Dan Grover, 2016)

I personally like the chat interface for the following reasons:

①The chat interface is an interface that everyone, even those who have not had previous contact with computers or the Internet, can quickly learn to use (universality). In the early 2010s, when I was volunteering in a low-income neighborhood in Kenya, I was amazed at how familiar everyone there was with banking on their phones, via text message. No one in that community had a computer.

② The chat interface is easy to access. If your hands are occupied with other things, use voice instead of text.

③ Chat is also a very powerful interface - you can make any request to it and it will reply, even if the reply is not necessarily perfect

However, the author believes that the chat interface can continue to be improved in some aspects:

① Multiple messages can be exchanged at a time

Currently, we basically assume that there is only a single round of messages per communication. But that’s not how my friends and I text. Usually, I need multiple messages to complete my thinking, because I need to insert different data (such as images, locations, links), I may have missed something in the previous message, or just don't want to put everything In a single large paragraph.

②Multimodal input

In the field of multimodal applications, most efforts are spent on building better models and less on building better interfaces. Take Nvidia’s NeVA chatbot, for example. I'm no user experience expert, but I think there might be room for improvement here.

PS: Sorry for mentioning the NeVA team here, even with this, your work is still pretty cool!

021cf4008efae41aca96de56efad9e29.png

③Integrate generative AI into the workflow

Linus Lee covers this very well in his share "Generative AI interface beyond chats." For example, if you want to ask a question about a certain column in a chart you're working on, you should be able to just point to that column and ask.

④Message editing and deletion

How does editing or deletion of user input change the flow of the conversation with the chatbot?

10. Create LLM for non-English languages

We know that current English-first-language LLMs don't scale well with many other languages ​​in terms of performance, latency, and speed. See:

  • ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning(Lai et al., 2023)

  • All languages are NOT created (tokenized) equal (Yennie Jun, 2023)
    13acde380c9aaab9994d2ac2cc939e02.png

I am only aware of attempts to train Vietnamese (such as the Symato community attempts), however, several early readers of this article told me that they did not think I should include this direction, for the following reasons:

This is less a research question than a logistics question. We already know how to do it, it just needs someone to put in the money and effort. However, this is not entirely correct. Most languages ​​are considered low-resource languages, for example, many languages ​​have much less high-quality data than English or Chinese, and thus may require different techniques for training large language models. See also:

    • Low-resource Languages: A Review of Past Work and Future Challenges(Magueresse et al., 2020)

    • JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages (Agić et al., 2019)

Those who are more pessimistic believe that in the future, many languages ​​will disappear and the Internet will be two universes composed of two languages: English and Chinese. This trend of thought is not new - anyone remember Esperanto?

The impact of artificial intelligence tools, such as machine translation and chatbots, on language learning remains unclear. Will they help people learn a new language faster, or will they eliminate the need to learn a new language entirely.

in conclusion

Let me know if I missed anything in this article, and for additional perspectives, please consult this comprehensive paper Challenges and Applications of Large Language Models (Kaddour et al., 2023).

The above questions are more difficult than others. For example, I think question 10 above, setting up an LLM in a non-English language, would be relatively simple given enough time and resources.

The first problem above is to reduce the hallucination output, which will be much harder, because hallucination is just LLM doing probabilistic things.

Fourth, making LLM faster and cheaper can never be completely solved. Great progress has been made in this area, and there will be more progress in the future, but improvements in this direction will continue.

Items 5 and 6, new architecture and new hardware, are very challenging, but they are inevitable over time. Because of the symbiotic relationship between architecture and hardware—new architectures need to be optimized for general-purpose hardware, and hardware needs to support general-purpose architectures—they will likely be done by the same company.

Some problems cannot be solved with technical knowledge alone. For example, Question 8, improving methods of learning from human preferences, may be more of a policy issue than a technical issue. Issue No. 9 is to improve the efficiency of the chat interface, which is more of a user experience issue. We need more people with non-technical backgrounds to work with us on these issues.

What research direction are you most interested in? What do you think is the most promising solution to these problems? Would love to hear your opinion.

9f7f0c0a09989b59e2cb4716ac4315bf.jpeg

social channel

Guess you like

Origin blog.csdn.net/shadowcz007/article/details/132439797