[Share dry goods from NVIDIA GTC conference] Deep reinforcement learning based on real-world data sets

foreword

The lecture in this article comes from the Nvidia GTC Conference
First attach the original video link https://register.nvidia.com/flow/nvidia/gtcspring2023/attendeeportal/page/sessioncatalog/session/1666649323930001EDPn

The common elements of large-scale success in machine learning are the use of a large number of models and a large number of GPU training. Most datasets are labeled datasets. Although good results can be obtained in the traditional sense, most GPU training requires high cost and large datasets.
However, in recent years, more and more unlabeled data have been used, which is a very important part of machine learning. This naturally introduces reinforcement learning technology. Reinforcement learning is a machine learning framework for direct reasoning decisions and their consequences. However, reconciling reinforcement learning with the data-driven paradigm in which most modern machine learning systems operate is difficult because reinforcement learning in its classical form is an active online learning paradigm. Can we get the best of both worlds—a data-driven approach in supervised or unsupervised learning that can leverage large previously collected datasets, and a decision formalism in reinforcement learning that can reason about decisions and their consequences? What follows is how this is made possible by offline reinforcement learning, which enables effective pre-training from suboptimal multi-task data, wide generalization in real-world domains, and compelling applications in settings such as robotics and dialogue systems.

1. Offline Reinforcement Learning Basics

Offline reinforcement learning refers to reinforcement learning using previously collected empirical data without interacting with the environment. Unlike online reinforcement learning, offline reinforcement learning can be trained by analyzing stored historical data without interacting with the environment. RL is used below to replace reinforcement learning (Reinforcement Learning)

1.1 Comparison of offline RL and imitation learning

From the green point to the red point, imitation learning can only repeat the trajectory, and offline RL can obtain an optimal trajectory from the chaotic trajectory.
insert image description here
Offline RL learning can exploit the strengths of each part of the dataset to achieve overall optimality.

1.2 Conservative Q-learning

This algorithm is somewhat similar to confrontation training, as shown in the figure below, assuming that the green curve is the real function, the blue curve is the Q fitting function, and the Q fitting function tries to find the green real curve.
The first line of formula is a regularization curve, which tries to find an adversarial distribution with a high Q value, and minimizes the Q value under this distribution. It can find these overestimated points and push them down, which can prevent overestimation very well. insert image description here
The figure below is an example of using this algorithm:
insert image description here
a single neural network trained with this algorithm achieves good results.

1.3 PTR

PTR is a policy trained on all tasks in the bridge dataset in a straightforward manner.
The entire dataset is pre-trained, and then trained 10 times for new tasks, and then fine-tuned when recycling the data in the bridge dataset to prevent forgetting. And use the last miniature in the hot vector to represent the new task.

2. Offline RL Pre-training for Robotics

2.1 PTR

PTR is a policy trained on all tasks in the bridge dataset in a straightforward manner.
The entire dataset is pre-trained, and then trained 10 times for new tasks, and then fine-tuned when recycling the data in the bridge dataset to prevent forgetting. And use the last miniature in the hot vector to represent the new task.
insert image description here
And offline RL training helps to improve the performance of PTR
insert image description here

three. Offline RL for Large Language Models

After training, the visual dialogue is used for evaluation, which shows that offline RL can use the data in the process to find out how to optimize.
insert image description hereinsert image description here

4. The impact of offline RL on humans

By observing how people play with each other and finding out how it affects human behavior, all the supporting information can be learned about how humans unexpectedly influence each other.
insert image description here
insert image description here
While robots can also influence human behaviour, larger datasets will enable it to recognize more subtle patterns.

Guess you like

Origin blog.csdn.net/weixin_47665864/article/details/129712018