Distributed Priority Experience Replay (DISTRIBUTED PRIORITIZED EXPERIENCE REPLAY, Ape-X)

written in front

Distributed Prioritized Experience Replay (DPER) is an experience replay method in reinforcement learning for training deep reinforcement learning networks. In deep reinforcement learning, experience replay is commonly used to store and reuse the agent's experience of interacting with the environment. The basic idea of ​​experience replay is to store the experience of the agent in an experience replay buffer, and then randomly sample a batch of experience from it to train the neural network of the agent. This method has two main advantages: first, it can make better use of data and avoid correlation problems during training; second, it can improve the efficiency of data utilization, because the agent can reuse previous experience many times. However, traditional experience replay methods do not take into account the importance of experience, but simply perform random sampling. This may cause some important experiences to be selected frequently, while other experiences are rarely selected, thereby affecting the effect of learning.

DPER introduces the concept of priority experience playback, which determines the probability of selection according to the priority of experience. Priority can be calculated based on empirical TD error (Temporal Difference error) or other measures of importance. Experiences with higher priority will be more likely to be selected for training, allowing the agent to focus more on important experiences. In addition, DPER also introduces the concept of distributed experience replay, which distributes the computing tasks of experience replay to multiple computing nodes for parallel processing. This distributed method can speed up the process of experience playback and improve learning efficiency.

To sum up, Distributed Prioritized Experience Replay combines the advantages of prioritized experience replay and distributed computing, which can improve the learning efficiency and performance of agents in deep reinforcement learning. It has achieved good results in many reinforcement learning tasks and has become a common method in the current field of deep reinforcement learning.

Thesis translation

DISTRIBUTED PRIORITIZED EXPERIENCE REPLAY​​​​​​​

Summary

We propose an architecture for distributed deep reinforcement learning that enables agents to efficiently learn from orders of magnitude more data than was previously possible. The algorithm separates Acting from learning: Actors select actions based on the shared neural network, interact with their own environment instances, and accumulate the resulting experience in the shared experience replay memory; Learner replays experience samples and updates the neural network. This architecture relies on prioritized experience replay, focusing only on the most important data generated by actors. Our architecture substantially improves the performance of SOTA techniques on the Arcade learning environment, achieves better final performance, and uses less actual training time.

1 introduction

A broad trend in deep learning is that more computation (Dean et al., 2012) combined with more powerful models (Kaiser et al., 2017) and larger datasets (Deng et al., 2009) can yield more impressive results . It is reasonable to expect similar principles for deep reinforcement learning. There are growing examples of this optimistic attitude: effective use of larger computing resources is Gorila (Nair et al., 2015), A3C (Mnih et al., 2016), GPU advantage Actor criticism (Babaeizadeh et al., 2017), distributed A key factor for the success of algorithms such as PPO (Heess et al., 2017) and AlphaGo (Silver et al., 2016). Deep learning frameworks like TensorFlow (Abadi et al., 2016) support distributed training, making large-scale machine learning systems easier to implement and deploy. Nevertheless, most of the current research in deep reinforcement learning still focuses on how to improve performance within the computational budget of a single machine, while the problem of how to better utilize more resources is relatively less explored. In this paper, we describe a method to augment deep reinforcement learning by generating more data and selecting it in a preferential manner (Schaul et al., 2016). The standard approach to distributed training of neural networks focuses on computing gradients in parallel for faster parameter optimization (Dean et al., 2012). Instead, we generate and select empirical data distributionally, and find that this approach alone is sufficient to improve results. This is complementary to distributed gradient computation, and the two approaches can be combined, but in this work we only focus on data generation. We use this distributed architecture to scale up variants of Deep Q- Networks ( DQN ) and Deep Deterministic Policy Gradients ( DDPG ), and on the Arcade learning environment benchmark ( Bellemare et al., 2013) and these are evaluated on a series of continuous control tasks. Our architecture achieves new state-of-the-art performance on Atari games in a fraction of the actual time compared to the previous state-of-the-art, and without per-game hyperparameter tuning. We experimentally investigate the scalability of our framework, analyzing how prioritization affects performance as we increase the number of data-generating workers. Our experiments include the analysis of factors such as replay ability, recency of experience, and different data generation strategies used by different workers. Finally, we discuss some potential implications for deep reinforcement learning agents that may be applicable beyond our distributed framework.

2 backgrounds

Distributed Stochastic Gradient Descent Distributed stochastic gradient descent

Distributed stochastic gradient descent is widely used in supervised learning to speed up the training of deep neural networks by computing gradients in parallel to update their parameters. The resulting parameter updates can be applied synchronously (Krizhevsky, 2014) or asynchronously (Dean et al., 2012). Both approaches have proven effective and have become an increasingly standard part of the deep learning toolbox. Inspired by this, Nair et al. (2015) applied distributed asynchronous parameter updates and distributed data generation to deep reinforcement learning. Asynchronous parameter updates and parallel data generation have also been successfully applied within single-machine to multi-threaded rather than distributed environments (Mnih et al., 2016). GPU Asynchronous Actor-Critic (GA3C; Babaeizadeh et al., 2017) and Parallel Advantage Actor-Critic (PAAC; Clemente et al., 2017) tune this approach to efficiently utilize GPUs.

Distributed Importance Sampling Distributed Importance Sampling

A class of complementary techniques for speeding up training is based on importance sampling (Hastings, 1970) to reduce variance. This has been shown to be useful in the context of neural networks (Hinton, 2007). By sampling non-uniformly from the dataset and weighting the updates according to the sampling probability to counteract the bias introduced, convergence can be increased by reducing the variance of the gradient. One approach is to select samples based on the L2 norm of the corresponding gradient. In supervised learning, this approach has been successfully extended to distributed settings (Alain et al., 2015). Another approach is to rank samples according to the latest known loss value and make the sampling probability a function of the ranking rather than the loss function itself (Loshchilov & Hutter, 2015).

Prioritized Experience Replay Prioritized Experience Replay

Experience replay (Lin, 1992) has long been used in reinforcement learning for data efficiency. It is especially useful when training neural network function approximators using stochastic gradient descent algorithms, such as in Neural Fitting Q-Iteration (Riedmiller, 2005) and Deep Q-Learning (Mnih et al., 2015). Experience replay can also help prevent overfitting by allowing the agent to learn from data generated by previous policies. Prioritized experience replay (Schaul et al., 2016) extends the classical prioritized sweep idea (Moore & Atkeson, 1993) to work with deep neural network function approximators. This approach is closely related to the importance sampling techniques discussed in the previous section, but uses a more general class of biased sampling procedures that focus learning on the most "surprising" experiences. In reinforcement learning, biased sampling is especially helpful when the reward signal is sparse and the data distribution depends on the agent's policy. Therefore, prioritized experience replay is used by many agents, such as prioritized Dueling DQN (Wang et al., 2016), UNREAL (Jaderberg et al., 2017), DQfD (Hester et al., 2017) and Rainbow (Hessel et al., 2017). In an ablation study (Hessel et al., 2017) used to investigate the relative importance of several algorithmic components, priority setting was found to be one of the most important factors for agent performance.

3 DISTRIBUTED PRIORITIZED EXPERIENCE REPLAY Distributed priority experience playback

In this paper, we extend Prioritized Experience Replay to distributed settings and show that it is a highly scalable approach to deep reinforcement learning. We introduce several key modifications that make this extensibility possible, and we call our method Ape-X.

Figure 1: A brief overview of the Ape-X architecture: Multiple Actors, each with its own instance of the environment, generate experience and add it to the shared experience playback memory, and calculate the initial priority of the data. A (single) learner samples from this memory and updates the network and the priority of experiences in memory. The actor's network is regularly updated with the latest network parameters from the learner.

Like Gorila (Nair et al., 2015), we decompose the standard deep reinforcement learning algorithm into two parts that run concurrently without high-level synchronization. The first part consists of stepping through the environment, evaluating the policy implemented as a deep neural network, and storing the observed data in Replay memory. We call this Acting. The second part consists of fetching data batches from memory to update policy parameters. We call this learning. In principle, both acting and learning can be distributed among multiple workers. In our experiments, hundreds of Actors running on the CPU generate data, and a single Learner running on the GPU samples the most useful experience (Figure 1). Pseudocodes for Actor and Learner are shown in Algorithms 1 and 2. Updated network parameters are periodically passed from the Learner to the Actor.

Compared to Nair et al. (2015), we use a shared, centralized Replay memory, and we sample preferentially, rather than uniformly, to draw the most useful data more frequently. Since priorities are shared, high-priority data discovered by any actor can benefit the entire system. Priorities can be defined in various ways, depending on the learning algorithm; the next section describes two examples. In prioritized DQN (Schaul et al., 2016), the priorities of new transitions are initialized to the largest priority seen so far, and are only updated when they are sampled. This approach does not scale well: due to the large number of actors in our architecture, waiting for the learner to update priorities will lead to a myopic focus on the most recent data, which itself has the highest priority. Instead, we leverage the calculations the actors in Ape-X are already doing to evaluate a copy of their local strategy, and have them also calculate the appropriate priority for new transformations online. This ensures more accurate prioritization of data coming into Replay at no additional cost. Shared experience has some advantages over shared gradients. Low-latency communication is not as important in distributed SGD as it is in distributed SGD, because empirical data does not stale as fast as gradients, as long as the learning algorithm is robust to off-policy data. Throughout the system, we take advantage of this by batching all communication with the centralized Replay, increasing efficiency and throughput, albeit at the expense of some added latency. Using this approach, even if actors and learners run in different data centers, there is no performance limitation. Finally, through off-policy learning (see Sutton & Barto, 1998; 2017), we can further take advantage of the ability of Ape-X to combine data from many distributed Actors, by assigning different exploration strategies to different Actors, increasing The diversity of experiences they encounter together. As we will see in the results, this can be sufficient to make progress on difficult exploration problems.

3.1 APE-X DQN

The general framework we describe can be combined with different learning algorithms. First, we combine it with a variant of DQN (Mnih et al., 2015) that incorporates some components of Rainbow (Hessel et al., 2017). More specifically, we use dual Q-learning (van Hasselt, 2010; van Hasselt et al., 2016) and multi-step bootstrapping objectives (see Sutton, 1988; Sutton & Barto, 1998; 2017; Mnih et al., 2016) as learning algorithms , and the decision network architecture (Wang et al., 2016) as a function approximator

This results in a failure to compute all elements in the batch

 in    

 Here t is a time index for the experience sampled from the Replay starting with state St and action At , and θ− denotes the parameters of the target network (Mnih et al., 2015), which is a slow copy of the online parameters. Multi-step returns are truncated if the episode ends in less than n steps. In principle, the Q-learning variant is an off-policy method, so we are free to choose the strategy we use to generate the data. In practice, however, the choice of behavioral strategy does affect the quality of exploration and function approximation. Furthermore, we are using a multi-step return without off-policy correction, which could theoretically adversely affect the value estimate. However, in Ape-X DQN, each actor executes a different strategy, which enables experience to be generated from various strategies, relying on a priority mechanism to pick out the most effective experience. In our experiments, actors use ε -greedy strategies with different ε values . Low ε policies allow deeper exploration in the environment, while high ε policies prevent over-specialization.

3.2 APE-X DPG

To test the generality of the framework, we also combined it with a continuous action policy gradient system based on DDPG (Lillicrap et al., 2016), an implementation of deterministic policy gradients by Silver et al. (2014), also similar to older (Werbos, 1990; Prokhorov & Wunsch, 1997) and tested on the continuous control task of the DeepMind control suite (Tassa et al., 2018).

Figure 2: Left: Aggregated results for 57 Atari games, evaluated starting with random no-ops. Right: Atari training curves for selected games, compared to the baseline. Blue: Ape-X DQN for 360 executors; Orange: A3C; Purple: Rainbow; Green: DQN. See the appendix for details, which contains long-run results for all games.

 The setup of Ape-X DPG is similar to Ape-X DQN, but the actor's policy is now explicitly represented by a separate policy network, in addition to the Q network. The two networks are optimized independently by minimizing different losses on sampling experience. We denote the parameters of the policy and Q-networks by φ and ψ , respectively, and adopt the same convention to denote the target network. The Q-network outputs an action-value estimate q ( s , a , ψ ) given a state s and a multidimensional action . It is updated using temporal difference learning with a multi-step guided objective. The loss of the Q network can be written as

in,

 

 The policy network outputs an action:

The policy parameters are updated on the estimated Q value by policy gradient ascent, using the gradient:

- Note that this relies on the policy parameter φ being fed to the evaluation network only through actions.

More details of the Ape-X DPG algorithm are provided in the appendix (appendix without translation) .

 4 experiments

4.1 ATARI Game Missions

In our first set of experiments, we evaluate Ape-X DQN at ATARI and demonstrate state-of-the-art results on this standard reinforcement learning benchmark. We use 360 ​​Actor machines (each using one CPU core) to feed data into Replay memory as fast as they can generate; about 139 frames per second (FPS) each, for a total of about 50K FPS, which equates to about 12.5K transitions (because the fixed action repeats 4 times). Actor batches empirical data locally before sending it to Replay: up to 100 transitions can be buffered at a time, then sent in batches of B=50 asynchronously. The Learner asynchronously prefetches up to 16 batches of 512 transformations and computes updates for 19 such batches per second, which means that the gradient computes about 9.7K transformations per second on average. To reduce memory and bandwidth requirements, observation data is compressed using the PNG codec when sent and stored in Replay. The learner computes and applies gradients while prefetching data. The Learner also handles any parameter requests from Actors asynchronously.

 Table 1: Median normalized scores for 57 Atari games. Note: a) Tesla P100. b) More than 100 CPUs are used, and the number of cores of each CPU machine varies. c) Evaluated on only 49 games. d) Hyperparameters are tuned for each game.

Actor copies network parameters from Learner every 400 frames (about 2.8 seconds). per Actor

Implement an εi -greedy strategy where:

Each i is kept constant during training. During training, the episode length is limited to 50000 frames. Shared experience Replay memory has a soft limit of 2 million transitions: new data is always allowed to be added, so as not to slow down the Actor, but every 100 learning steps, any excess data above this capacity threshold is removed in batches, in FIFO order . The actual median size of memory is 2035050. The data is sampled according to the proportional priority, the priority index is 0.6, and the importance sampling index is set to 0.4. In Figure 2, left, we compare the median human normalized scores for all 57 games to several baselines: DQN, Prioritized DQN, Distributed DQN (Bellemare et al., 2017), Rainbow and Gorila. In all cases, performance is measured at the end of training under the test regime (Mnih et al., 2015) with no manipulations started. On the right, we show the initial learning curves for a set of 6 games (chosen from the greediest actors) (full learning curves for all games are in the appendix). Considering that Ape-X can utilize more computation than most baselines, one might expect it to train faster. Figure 2 shows that this is indeed the case. Perhaps more surprisingly, our agent achieves significantly higher final performance. In Table 1, we compare the median human normalized performance of Ape-X DQN on the ATARI benchmark with the corresponding metrics of other baseline agents reported in their respective publications. Whenever possible, we report results without op-starts and human-starts. The human-start regime (Nair et al., 2015) amounts to a more challenging generalization test because the agent is initialized from randomly drawn starts from games played by human experts. According to these two metrics, the performance of Ape-X is higher than that of all baselines.

 4.2 CONTINUOUS CONTROL Continuous control task

In the second set of experiments, we evaluate Ape-X DPG on four sequential control tasks. In the manipulator domain, the agent must learn to bring the ball to the designated location. In the humanoid domain, an agent must learn to control a humanoid figure to solve three distinct tasks of increasing complexity: standing, walking and running. Since we learn from features rather than pixels, the observation space is much smaller than in the ATARI domain. Therefore, we use small, fully connected networks (details in the appendix). With 64 Actors in this field, we got about 14K total FPS (the same amount of transitions per second; here we don't use motion repetition). We process 86 batches of 256 conversions per second, or about 22K conversions per second. Figure 3 shows that Ape-X DPG achieves very good performance on all four tasks. The graph shows the performance of Ape-X DPG with varying numbers of actors: as the number of actors increases, our agent becomes increasingly effective at solving these problems quickly and reliably, exceeding the standard DDPG baseline with more than 10 times the training time. A parallel paper (Barth-Maron et al., 2018) extends this work by combining Ape-X DPG with a distributional value function, and the resulting algorithm is successfully applied to further continuous control tasks.

Figure 3: Performance of Ape-X DPG on four consecutive control tasks as a function of Wall Clock real time. Performance improves as we increase the number of executors. The black dashed line indicates the highest performance achieved by the standard DDPG baseline during 5 days of training.

Figure 4: Scaling the number of Actors. Performance consistently improves as we increase the number of actors from 8 to 256, noting that the number of learning updates made does not depend on the number of actors.

5 Analysis

In this section, we describe additional Ape-X DQN experiments on Atari that help us better understand this framework and investigate the contributions of different components. First, we investigate how performance varies with the number of actors. We trained for 35 hours on a subset of 6 Atari games with different numbers of actors (8, 16, 32, 64, 128, and 256). We fix the size of the shared experience Replay memory at 1 million transitions in all experiments. Figure 4 shows that the performance has been improving as the number of actors increases. The appendix contains learning curves for other games, and a scalability comparison with and without prior replay algorithms. Surprisingly, just by increasing the number of actors, without changing the rate of network parameter updates, the structure of the network or the update rules, the performance is greatly improved. We hypothesize that the proposed architecture helps address a common failure mode of deep reinforcement learning, where the policy found is locally optimal in the parameter space but not globally optimal, e.g., due to insufficient exploration. Using a large number of actors with different degrees of exploration helps to discover promising new acting routes, and prioritizing Replay ensures that when this happens, the learning algorithm concentrates on processing these important information. Next, we investigated the impact of varying the Replay memory size (see Figure 5). We used a setup of 256 Actors with a total environment frame rate of about 37K per second (about 9K transitions). For such a large number of actors, in-memory content replacement is much faster than most DQN-like agents. We observed a slight benefit in using a larger Replay capacity. We speculate this is due to retaining some high priority XP for longer and replaying their value. As mentioned above, a single learning machine trains the network at a median of 19 batches per second, each containing 512 transitions, processing about 9.7K transitions per second. Finally, we performed additional experiments to disentangle the potential impact of two potentially confounding factors in our scalability analysis: the recency of empirical data in Replay memory, and the diversity of data generation strategies. A full description of these experiments is restricted to the appendix; in summary, no single factor alone can explain the performance we see. We therefore conclude that these results are primarily attributable to the positive effects of collecting more empirical data; that is, better exploration of the environment and better avoidance of overfitting.

Figure 5: Change the capacity of Replay. Agents with larger replay memories performed better in most games. Each curve corresponds to one run, smoothed over 20 points. The curve for Wizard Of Wor at a replay size of 250K is incomplete because of training divergence; we did not observe this for other replay sizes.

 6 Conclusion

We design, implement, and analyze a distributed framework for prioritized replay in deep reinforcement learning. This architecture achieves state-of-the-art results on a range of discrete and continuous tasks, both in terms of wall clock learning speed and final performance. In this article, we focus on applying the Ape-X framework to DQN and DPG, but it can also be combined with any other out-of-the-box reinforcement learning update. For methods that use time-expanded sequences (e.g., Mnih et al., 2016; Wang et al., 2017), the Ape-X framework can be tuned to prioritize sequences of past experience over individual transitions. Ape-X is designed to be used in environments where large amounts of data can be generated in parallel. This includes simulated environments, but also various real-world applications such as farms with robotic arms, self-driving cars, online recommender systems, or other multi-user systems where data is generated from multiple instances of the same environment (see Silver et al., 2013). In applications where data acquisition costs are high, our method will not be directly applicable. For strong function approximators, overfitting is a problem: generating more training data is the easiest way to solve this problem, but may also provide guidance for data-efficient solutions. Many deep reinforcement learning algorithms are inherently limited by their ability to efficiently explore large domains. Ape-X uses a simple yet effective mechanism to solve this problem: generate a variety of experiences, then identify and learn from the most useful events. The success of this approach suggests that simple and straightforward exploration methods may be feasible even for synchronous agents. Our architecture shows that distributed systems are now suitable not only for research, but potentially for large-scale applications of deep reinforcement learning. We hope that our proposed algorithms, architectures, and analyzes will help accelerate future efforts in this direction.

paper summary

This paper by OpenAI proposes the Ape-X framework, which is combined with DQN and DDPG control respectively for discrete control and continuous control. A combination of the following algorithmic properties is used:

Centralized Replay Memory ( Centralized Replay Memory )

This is a device for storing and replaying previous experiences (i.e. decisions and outcomes). The new approach mentioned in this text uses a shared, centralized replay memory. This differs from the approach of Nair et al. (2015), who use a distributed replay memory.

Prioritized Sampling _

When selecting experiences from replay memory for learning, instead of uniformly random sampling, the authors preferentially sample the most useful data. This means that certain experiences, such as those that contribute to model performance, are sampled more frequently.

Priority Sharing _

All actors share data priority, which means that high-priority data found by any actor can benefit the entire system.

Actors Compute Priorities

In order to make the priority more precise, the executor calculates the priority of the new experience when it is generated. In this way, the time to wait for the learner to update the priority is shortened, and at the same time, it can prevent the system from paying too much attention to the most recent data (because the priority of new data is the highest by default).

Experience Sharing Vs Gradient Sharing

Actors in the system share experience, not gradients. Low-latency communication is not as important as shared experience in distributed stochastic gradient descent (SGD), because empirical data becomes stale slower than gradients.

Batch Communication ( Batched Communication )

The system increases efficiency and throughput by batching all communication with a central playback memory, although this may add some latency.

Off -Policy Learning

Through off-policy learning, where the executor and the learner use different policies, the method can gather a wider range of experience from multiple executors, thereby improving the learning effect.

Actors and Learners in Different Locations

Executors and learners can even run in different data centers, which does not limit the performance of the system.

Diverse Exploration Policies

By assigning different exploration strategies to different actors, the diversity of experiences they encounter together can be increased. This means that the system has the ability to learn more kinds of situations and coping strategies, so that it can have richer options when solving problems.

Capability to Address Complex Exploration Problems

The authors mention that the combination of various elements in this approach, especially the adoption of diverse exploration policies and off-policy learning, is sufficient to achieve progress on complex exploration problems.

Guess you like

Origin blog.csdn.net/wenquantongxin/article/details/130834036