[The 10th Issue of MaYin Free Book] "Reinforcement Learning: Principles and Python Practical Combat"

Table of contents

1. What is AI Alignment

2. Why study artificial intelligence alignment?

3. Common methods of artificial intelligence alignment


1. What is AI Alignment

AI Alignment refers to making the behavior of artificial intelligence consistent with human intentions and values.

Artificial intelligence systems may suffer from "misalignment" problems. Take a question and answer system like ChatGPT as an example. ChatGPT's answers may contain remarks that endanger the reunification of the motherland, insult martyrs, vilify the Chinese nation, instigate violence, speak as "dirty", etc. that are illegal or inconsistent with core socialist values. They may also appear Flattery, coercion and inducement, gossip, etc. interfere with users achieving their predetermined goals. The process of eliminating misalignment in an AI system is called AI alignment.

Figure ChatGPT misalignment behavior

2. Why study artificial intelligence alignment?

According to the definition of artificial intelligence alignment, all artificial intelligence problems (including AI ethics, AI governance, explainable AI, and even the most basic regression and classification problems) can be regarded as artificial intelligence alignment problems. So why did academia invent the new concept of “artificial intelligence alignment”? What is the value of studying the new concept of "artificial intelligence alignment"?

In fact, the concept of artificial intelligence alignment is inseparable from the birth of general large models like ChatGPT. For general large models, a model can complete multiple tasks at the same time, and different tasks have different expectations: some tasks hope to be more imaginative, some tasks hope to be more respectful of facts; some tasks hope to be rational Objectively, some tasks require delicate and rich emotions. The diversity of tasks leads to the need to align large models in all aspects, not just in certain aspects. Traditional research is often aligned on a certain aspect. For a general model such as ChatGPT, it will lead to "pressing the gourd and lifting the ladle", and cannot cover everything.

As the scale of machine learning models continues to increase and neural networks are widely used, humans are no longer able to fully understand and explain certain behaviors of artificial intelligence. For example, some of the moves used by AlphaGo to play Go are still not fully understood by humans. In the future, there may be artificial intelligence that crushes humans in all aspects (such as MOSS in "The Wandering Earth"). Traditional alignment methods obviously cannot meet the alignment requirements for such artificial intelligence.

3. Common methods of artificial intelligence alignment

Artificial intelligence alignment is inseparable from human access. Human evaluation and feedback of artificial intelligence systems can identify misalignments in artificial intelligence and guide them to improve.

Methods for AI alignment include imitation learning and human feedback reinforcement learning. ChatGPT adopts these alignment methods.

picture
ChatGPT training steps (image source: https://openai.com/blog/chatgpt)

The picture above is a diagram of the training steps of ChatGPT. The first step is to use the collected data for supervised learning. This part is using imitation learning for artificial intelligence alignment. However, the training team of ChatGPT believes that imitation learning alone cannot fully meet the requirements.

The reasons why imitation learning cannot fully meet the alignment requirements may be as follows: the data range used by imitation learning is limited, and it is impossible to cover all situations. Artificial intelligence trained with such a data set will inevitably have misaligned performance in some edge cases. In addition, although the training target can basically reach the optimal level after training, poor performance on certain sample points will still occur when the training target is optimal. While these sample points may be quite important, these bad sample points may involve significant legal or public opinion risks.

To this end, the training process of ChatGPT further uses human feedback reinforcement learning. The second and third steps in the step diagram use human feedback reinforcement learning.

The second step is to build a reward model through human feedback. In this step, the person providing feedback can focus on the issues that they think need to be focused on to ensure that the reward model is correct on which important issues. And if new problems that were not expected before are discovered in subsequent tests, the reward model can be patched by providing more feedback samples. In this way, through manual intervention and continuous iterative feedback, the reward model tends to be improved. This way, the reward model is aligned with human expectations.

In the training process of reward model alignment using feedback, for each sample, the language model first outputs several alternative answers, and then humans rank these answers. Compared with directly asking users to provide reference answers, this approach can better stimulate the creativity of the language model itself, and can also make feedback faster and more cost-effective.

The third step uses the reward model for reinforcement learning. The PPO algorithm mentioned in the steps is a reinforcement learning algorithm. By using reinforcement learning algorithms, the system's behavior and reward models are aligned.

The successful application of feedback-based reinforcement learning on large models such as ChatGPT makes this algorithm the most popular large model alignment algorithm. At present, most large models use this technology for alignment.

Further reading

picture

"Reinforcement Learning: Principles and Python Practice"

Written by Xiao Zhiqing

Decrypting ChatGPT key technologies PPO and RLHF

Complete theory: Covers the main theories and common algorithms of reinforcement learning, and takes you through the technical points of ChatGPT;

Strong practicality: Each chapter has a programming case, and the deep reinforcement learning algorithm provides TenorFlow and PyTorch comparative implementation;

Rich supporting facilities: Provides a summary of knowledge points chapter by chapter, and a variety of end-of-chapter exercises. There are also online resources such as Gym source code interpretation, development environment setup guide, and answers to exercises to assist self-study.

  • Two books are given away this time
  • Activity time: Until 2023-11-21
  • How to participate: Follow the blogger, like, favorite and comment below this article.

Guess you like

Origin blog.csdn.net/weixin_53197693/article/details/134344334