Authors of T5, RoberTa, Enlightenment·Tianying, Zidong Taichu, and CPM talk about cutting-edge technology of basic models丨Large model scientific research, guide to avoiding pitfalls in entrepreneurship...

guide

The big language model is becoming more and more popular, and the research direction of scholars is a beacon. So what about the main contributors to important projects related to large models? On June 9th, Beijing Zhiyuan Conference "Basic Model Frontier Technology" forum invited important model authors such as T5, RoBERTa, Enlightenment·Tianying, Zidong Taichu, and CPM to attend.

Caption: Five guests discussed at the scene, including: Liu Zhiyuan, an associate professor of Tsinghua University and Zhiyuan scholar; Liu Yinhan, the core founder and CTO of Birch.ai; Liu Jing, a researcher at the Institute of Automation, Chinese Academy of Sciences; Zhou Yanqi, a research scientist at Google; Liu Pengfei, Associate Professor of the Institute (connected)

These young scholars expressed their views on the key points and difficulties of scientific research in the era of large-scale models, and how to view the opportunities brought by large-scale models from start-up companies, large factories, universities, research institutes, etc. At the scene, the scholars gave very thoughtful technical suggestions and speeches.

· Multiple sources prove that GPT-4 is a sparse model. ——Zhou Yanqi

· If the large model wants to gain cognitive ability, it must move from single modality to multimodality. ——Liu Jing

· Our response to Party A is: Humble to the dust, responsive to every request, on call. ——Liu Yinhan

· Personally think that the reward model is very important, and RLHF is not very important. ——Liu Pengfei

· The basic model has become the "CPU" in the era of AI large models, and it is the most invested part of a single "product". —— Lin Yonghua

Liu Yinhan: Using RLHF to build a real-time AI system

f68b8995d119b454f9f3efd4d2470fd7.png

In recent years, there have been many studies on large language models in the direction of prompt-tuning and fine-tuning. In this report, Liu Yinhan from BirchAI explained the value of large language models in RLHF from the perspective of products and customers.

dd2a71767547f155e0cb0017582bdfe5.png

Today's world is an era where humans and machines coexist. Due to the limitations of machines' understanding of human society, machines cannot completely replace humans in the short term, and they exist more as assistants to humans. As an assistant, although the general-purpose large language model can complete some common tasks well, it still lacks personalized services for individual users, users in certain professional fields, and corporate users. In this regard, building a real-time AI system can provide a good solution.

The real-time AI system can quantify the collection of customer information, and evaluate whether the generation of AI meets the customer's requirements according to the number of modifications made by the customer. Using the data provided by this information, the model is trained through reinforcement learning, resulting in a more personalized generation.

Use an example to explain how to use human feedback information and a large language model to build a real-time system: Take a user’s request for a return to customer service as an example. The large model can look up the user’s past data and decide whether to agree to return the product or give the user a discount based on the formula policy. Usually, the answers given by human customer service to users are more humane. At this time, if a real-time system is built, the model can absorb the answers of human customer service and perform imitation output through reinforcement learning. In addition, the model can track the subsequent performance of different users to determine which customer service answer is more likely to retain customers, thereby increasing the training weight accordingly.

At the technical level, based on OpenAI's InstructGPT paper in March last year and the PPO method, Birch built his own system and got a better Policy than the initial SFT. Their evaluation strategy comes from user feedback. In general, PPO can be understood as making each text more brilliant when the "value" of the article is certain.

Liu Yinhan believes that now generative AI can only provide a solution. What we really need is to build a platform so that AI can help humans save time more efficiently. In the future, the big language model should become a platform, an ecosystem rather than just a text output.

Zhou Yanqi: Expand LLM through sparse MoE model

e5dd37572f66af87dc56aa1ffda65562.png

Throughout the development history of deep learning, the development of deep learning is actually based on the development of hardware, and the rapid development of hardware has also contributed to the vigorous development of large models in recent years. However, in recent years we are approaching the limit of Moore's Law, therefore, it is no longer possible to substantially and continuously expand dense large language models by simply doubling the parameters or doubling the tokens. This is a very inefficient and unsustainable way. We need a more sustainable way to scale large language models.

A paper from Baidu showed that the performance of a model is predictable given the size of the model and the total training data. A few years later, openAI also gave a scaling law (Scaling Law) for large models based on expanding computing resources, data set size, and parameter scale. This allows more companies and institutions to train their own large models.

For example, Google's T5 model. The T5 model retains most of the architecture of the original Transformer, and one of its greatest contributions is to describe all NLP tasks as text-to-text tasks. Another contribution of T5 is open-sourcing the C4 dataset, which actually benefits the entire research community.

Starting from T5, competition among large companies has become increasingly fierce. T5 has 11B parameters, GPT-3 has 175B, and PaLM released in 2022 has 540B. But dense models with more than 50 billion parameters are very difficult. Multi-party message verification, even GPT4 is a sparse architecture.

dd9ff125649e9a580b18208987fbd5e2.png

So Zhou Yanqi shared the method of expanding the large language model through the sparse model MoE (Mixture-of-Experts layer, expert mixed type). Taking the GLaM model as an example, it contains 1.2T parameters, but the actually activated parameters (activated parameters) are only 97B, which is far less than GPT-3, that is to say, it is a sparsely activated MoE. It is the same decoder-only model as GPT-3, but GlaM achieves better performance compared to GPT-3.

But Token-based MoE also has limitations. Poor expert routing strategies (such as those that lead to load imbalance) can lead to under-training of some experts, resulting in under- or over-specialization of experts. To solve this problem, they proposed a routing algorithm called expert selection. Previous work assigns a fixed number of experts to each marker using a top-k function, regardless of the relative importance of different markers. Instead of letting the markers choose the top-k experts, let the experts choose the top-k markers. Thus, each token can be sent to a different number of experts, and each expert can have a fixed capacity. On this basis, in order to further improve the Moe method, they proposed a non-uniform architecture: the Brainfomers model, which is based on an optimized design based on Transformer and creates a search space (Search Space) to improve performance of the neural network. It is more than 5 times faster than the GLaM baseline.

So how can the language model be updated and the base model, say pre-trained GPT-4, adapted to some target downstream task domain? Zhou Yanqi's team proposed a progressive lifelong learning of expert blending. This method can sublinearly increase the number of parameters while introducing new training data and adding a representation loss so that the model does not forget the previous training data.

Liu Jing: Simple regression and thinking of multi-modal pre-training

646907b0200f73a52c4a74f5b85bc5eb.png

Liu Jing gave a keynote speech on "Simple Regression and Thinking of Multimodal Pre-training" from three aspects: why pay attention to multi-modal large models, how to train multi-modal large models, and how to develop multi-modal large models next. She mentioned that today's large models have completely subverted the AI ​​paradigm centered on deep learning over the past decade or so. Large models that can mine information from large-scale unsupervised data are expected to break through the bottleneck of current AI applications. At the same time, Liu Jing said that multimodal data is ubiquitous, and more or more common expressions of human beings are through seeing, listening, and thinking, not necessarily recorded in words. Therefore, if the large model wants to acquire cognitive ability, it must move from single modality to multimodality.

At present, large-scale data and basic models based on Transformer architecture, as well as self-supervised learning, can make the model have good versatility and inter-modal correlation capabilities. This is also the basis for large models. But for the large model to serve practical applications, it is important to adapt and fine-tune the model. Obviously, models with hundreds of billions and trillions of parameters make it very difficult to fine-tune all parameters. Therefore, how to fine-tune such a model more efficiently and at a lower cost has become an important research direction. To this end, the industry has proposed methods including PromptTuning, adapter methods, and LoRA, hoping to achieve low-cost incremental fine-tuning.

The future development direction of the multi-modal pre-training model includes improving the performance of the model through a more powerful language model, a larger visual model and a larger audio model, and more data. Regarding this phenomenon, Liu Jing also said: "The development of large models is an effective way. By accumulating data and models, the performance can be further improved. But this way is not suitable for everyone, especially the academic circle, which blindly pursues large and is not a strength, so the model needs to be refined and optimized in other directions.”

2af49f0d34c294aaa50971ce32029806.png

Lin Yonghua: Engineering to create the "CPU" in AI

157e376da98a104ad3b2a2c888dacc05.png

For large models with a scale of tens of billions or even hundreds of billions, the training cost is huge. Lin Yonghua mentioned in the report "Enlightenment·Tianying Large Model-Engineering to Build the "CPU" in AI" that an engineering approach should be used to create a set of "large model evolution pipeline" to continuously improve model training Efficiency allows the basic model to continue to radiate energy to the industry. She mentioned that the basic model has become the "CPU" in the era of AI large models - the most invested part of a single "product". After a rough estimate, training a large-scale model with a scale of 33 billion with 1T token data will require an investment of about 20 million RMB, including costs such as computing power, data, evaluation, and manpower.

Therefore, only by adopting a systematic, standardized, and sustainable training process can the basic model release the potential for subsequent model capability improvement and empower the industry to land. Building a large model through engineering includes the following steps: data collection and processing are the foundation, model training is the core, model evaluation can control the direction of phased training, and continuous iteration allows the model to continuously improve.

bb25487d932c201851b66f87b8eb77fe.jpeg

In the report, Lin Yonghua introduced that the Aquila language model is a product of engineering. It is the first open source language model with Chinese and English bilingual knowledge, supports commercial license agreements, and domestic data compliance requirements. A series of models Including Aquila basic model (7B, 33B), AquilaChat dialogue model (7B, 33B) and AquilaCode-7B "text-code" generation model.

The Aquila basic model (7B, 33B) technically inherits the architectural design advantages of GPT-3, LLaMA, etc., replaces a batch of more efficient underlying operator implementations, redesigns and implements a Chinese-English bilingual tokenizer, and upgrades BMTrain parallelism The training method achieves a training efficiency nearly 8 times that of Magtron+DeepSpeed ​​ZeRO-2.

The AquilaChat dialogue model (7B, 33B) supports smooth text dialogue and multi-language generation tasks. By defining scalable special instruction specifications, AquilaChat can call other models and tools, and it is easy to expand. For example, calling Zhiyuan's open source AltDiffusion multilingual text-image generation model enables smooth text-image generation capabilities; with Zhiyuan's InstructFace multi-step controllable Wensheng graph model, it is easy to realize multi-step controllable editing of face images.

AquilaCode-7B "text-code" generation model, based on the powerful basic model capabilities of Aquila-7B, achieves high performance with small data sets and small parameters. It is currently the best open source code model that supports Chinese and English bilingual performance. After high-quality filtering, training is performed using training code data with compliant open source licenses. In addition, AquilaCode-7B has completed the training of the code model on NVIDIA and domestic chips respectively.

The most important thing is that the Aquila language large model has the ability of sustainable iteration. In the future, it will continue to improve the training data, optimize the training method, and improve the performance of the model. Yemao's "model tree" continues to be open-sourced.

Finally, Lin Yonghua said that only by creating a sustainable and forward-looking large model training paradigm, and forming a closed loop of data, training, evaluation, iteration and other steps, can the basic large model play the same core and basic role as the CPU in the computer system. become the infrastructure of economic development.

Round table forum: Tips in the era of large models

Liu Zhiyuan: What technology do you think needs the most attention in the era of large-scale models?

Liu Pengfei: Pay attention to the data structure in model pre-training.

The importance of data work has been verified in the "supervised fine-tuning (SFT)" stage, and now there are articles saying that model pre-training will soon "exhaust" the text data of natural language. Therefore, based on the principle that pre-training is not only adding data, but also adding information, how to incorporate structural information in multi-modality into the model is the direction I will consider next.

At the same time, the existence of Prompt Engineering (Prompt Engineering) is a very bad thing, which is caused by the black box nature of the large model. It is precisely because we don’t know how to "store" data in the model pre-training stage, so we will try various methods in "fetching". Prompt. If the structure of the data is transparent enough, I believe the problem will become a little simpler.

The reward model is very important. I personally think that RLHF (Reinforcement Learning from Human Feedback) is not important. We need a high-quality reward model, not only in the form of binary, but also not only in the form of refinement, but in the hope that it can become generative. In the form of , output a distribution or a function that represents the probability or expectation that the agent will do well or badly.

Liu Zhiyuan: Everyone has different backgrounds. Please come from start-up companies, research institutes, large factories, and universities, and start from personal experience to talk about how to give full play to your own advantages in the era of large-scale models.

Liu Yinhan: I have two periods of work experience.

When I was working as an AI researcher at Facebook in early 2019-2020, Google made the first-generation large-scale model BERT, and I participated in and led the development of RoBERTa and BART. Later, Facebook continued to launch the OPT model, as well as some of the latest large language models.

My impression from Facebook is that all their leaders are very interested in big languages, focusing on a "big" language, and investing regardless of cost, it doesn't matter how much money is spent, and finally will open source the technology.

During that time, everyone kept discussing the model upper limit, parameter upper limit, and data upper limit. The entire industry wants to explore what big languages ​​can do.

Until I started a business. I've found that it's important to look rationally at large language models, especially in small domains. For example, users of medical health care about disease knowledge and drug solutions, but they don't care much about insignificant issues such as flight and hotel reservations.

Therefore, the conclusion is: a general-purpose large language model is completely unnecessary for startups in vertical fields. Because it pays more attention to professionalism.

On the other hand, from a practical point of view, the cost of a large language model is very high. Sometimes a medium, more "focused" model may be more useful.

Liu Jing: The mission of universities and research institutes is to conduct innovative and useful research, and large models are an example. Our advantage in innovation is a steady stream of student resources and the ability to plan long-term research goals, unlike companies that need short-term results. Therefore, we can continue to innovate more steadily and lead the frontier direction.

For example, in large language models, they can explore issues such as stronger self-supervised algorithms, better data cleaning, and stronger model collaboration.

When choosing a direction, you must have a good eye and choose a useful direction. The path of large models has not seen the end. Our research direction should focus on using small and high-quality data to obtain capabilities comparable to large models, and then better serve applications.

Another field suitable for academia is "AI for science". To cooperate with the fields of life engineering and brain science, long-term investment is required to be effective.

Zhou Yanqi: It is still difficult for start-up companies to surpass traditional manufacturers. Take OpenAI and Google's big model competition as an example, Google is not behind. Google has the world's largest cloud computing platform, the most powerful TPU and GPU resources, and the best system and software-level technologies. Moreover, large companies are obviously more concerned about long-term issues, whether it is data standards or model security, they are obviously more compliant.

Liu Pengfei: First of all, college teachers should assume the responsibility as scholars, such as the importance of RLHF. These may be things startups don't want to spend time researching.

Secondly, sort out the battlefields of all parties, including academia, industry, VC, and start-up companies, and clarify what role each person should play, so that this field can perform its duties and do better.

Furthermore, help the field find the direction of scientific progress, dare to put forward different views, and produce a more accurate direction. Especially when evaluating large models, find a reliable and fair evaluation method and avoid detours.

Finally, cultivate students and let them know the growth path. They don't need to be talented, as long as they have interest and enthusiasm, they can go forward together.

Liu Zhiyuan: In the field of large models, what do you want to do most? If you have enough budget, how do you want to solve it?

Liu Yinhan: I want a high-quality data set, because the data is always greater than the architecture, and the architecture may only be the result of fine-tuning or fine-tuning. The big language model should be made into an ecology, not just text, but also beyond text, like a personal secretary, recording his needs and being on call.

Liu Jing: I want to continue to work on multimodal dialogue, so that people and machines can communicate freely with pictures, texts and sounds. The long-term goal is to allow robots to use various senses to perceive and explore the world, and communicate with humans.

Zhou Yanqi: The short-term goal is to study large language models in large companies, build a super large distributed system, reduce the cost of large language models, and make it as fast as Google search. The long-term goal is to understand the principles of large language models and explore whether it is possible to use stronger computing power or quantum computers.

The short-term goal is to make the mathematical problem-solving ability of the language model as good as GPT-4 in answering other questions, and find the secret and method of doing this. In addition, if there are 1w cards, train from scratch to improve your understanding and processing ability of data.

 Audience Q&A

Audience A: Can the robot perform various tasks like ChatGPT, such as serving a water glass. Where is the difficulty in realizing this function?

Liu Jing: Whether the robot can perform various tasks like ChatGPT, the key is to get through perception and decision-making. Robots need to be able to see, locate, and perform tasks, rather than passively receiving pictures or text. The current multimodal large model cannot truly integrate multimedia information, nor can it ask questions or interact with the environment. There is still a lot of work to be done for robots to be like humans, but the route is clear, and better results will emerge in the future.

Audience B: Three questions. First of all, what kind of opportunities do you encounter for your classmates in Dachang that will prompt you to leave Google to start a business?

Secondly, what do students of scientific research in colleges and universities think about entrepreneurship?

Finally, for students who start a business, what kind of mentality should they deal with the needs and pressure of Party A?

Zhou Yanqi: Whenever things go wrong for me, I want to leave Google, but I feel that Google has a better environment and resources. If I can't solve it at Google, I may not be able to use my talents in other companies.

If I leave Google, it's probably because I have something I really want to do. For example, create explosive products such as ChatGPT. At present, Google has not restricted the pace of my research, and I will not leave for the time being.

Liu Jing: Stick to what you want to do, and choose entrepreneurship or scientific research according to your own characteristics and timing. The multi-modal large models of our scientific research institutes are no worse than those of enterprises, and have advantages in video understanding.

Liu Yinhan: 13 words to describe dealing with Party A: Humble to the point of dust, responsive to every request, always on call.

Audience C: How do you think about using large language models for reasoning? Especially in the direction of mathematical reasoning. Some people think that the language model should not "learn" math problems, and should call tools to assist the language model.

Pengfei Liu: Mathematical reasoning with large language models is a basic ability, but it also needs to be combined with other tools to improve efficiency and performance. It is recommended to analyze the types and characteristics of different mathematical problems first, and then choose the most suitable method to solve them, and do not reject any one method. Large language models have advantages in complex multi-step reasoning and formalization problems, but they also need continuous improvement.

Audience D: How to solve the hallucination problem in large language model training?

Zhou Yanqi: Two ways. First of all, a larger language model can be used as a quality inspection model to evaluate the security and authenticity of the data generated by the small model. Secondly, you can use Google search or other indexing tools to add references to the generated data, so that users can trace the source and credibility of the data. Of course, it can also be realized by combining detection models and search tools.

Audience E: How to "break" the limitation of the sequence length of a large language model?

Liu Yinhan: I use the sliding window algorithm. Use sliding windows to generate in different windows, and then combine the generated results. It should be noted that it is necessary to ensure that the training data is aligned, otherwise the effect will be much worse.

Zhou Yanqi: GPT-4 also encountered similar problems, and the computing bottleneck lies in the attention mechanism (Attention). The fully connected attention mechanism should be replaced by a more efficient attention mechanism. The sparse attention method can be used, which is to use a local attention plus a fixed-span fully connected attention, which is somewhat similar to the MOE method.

Audience F: The mathematical reasoning performance of large models such as GPT-4 is poor. How to optimize with small models?

Pengfei Liu: Mathematical reasoning of large models requires a full-stack approach, including pre-training, supervised fine-tuning (SFT) and other stages. In the pre-training phase, relevant corpus should be constructed so that the model can learn basic concepts of mathematics or reasoning, such as the greatest common divisor. In the supervised fine-tuning stage, it is necessary to expand the multi-step reasoning of mathematics, so as to adapt to the large model.

Audience G: What do you think of prompt engineering as a profession, will it develop into a discipline?

Zhou Yanqi: Reminder that engineers will become the fastest disappearing profession after the meeting. I have been studying SoftPromp, and gradually no manual work will be needed.

Liu Zhiyuan: Please share a sentence to end today's forum.

Liu Yinhao: Say three words. I studied chemical engineering as an undergraduate, and then I taught myself computer science. I was fortunate to do research and publish papers on NLP, and now I am starting a business. Nothing is constant, nothing is always popular all over the world, but there will always be new things, so keep changing yourself, embracing new things, finding your favorite direction, chasing your dreams, But don't go with the flow.

Liu Jing: First of all, we must be firm. In the next three to five years, large models will subvert many fields. Second, persist. Stick to what you think is valuable. Third, don't blindly follow the trend.

Zhou Yanqi: Looking at the future, we should not only consider the things of 5 months, but also consider the next 5 or 10 years for scientific research.

Liu Pengfei: As Bill Gates said before, artificial intelligence requires a sense of responsibility. What is the ultimate goal? If it can promote the improvement of all human beings, there is nothing wrong with doing everything.

- Click to view the original text to watch the replay of the conference -

ec34b682cd77fd4d860d3c77388b969d.jpeg

Hinton: I am very nervous about the consequences of "frog" creating "human" | full text + video

06b5c3df6477d89a4e606a8dc53fe8a2.jpeg

Huang Tiejun: Unpredictable, unable to close | 2023 Zhiyuan Conference "AI Security and Alignment Forum"

Guess you like

Origin blog.csdn.net/BAAIBeijing/article/details/131238489
Recommended