DRL--algorithm collection

1. Points of attention (difficulties)

  1. What are the training labels for DRL?
    I want the agent to learn for a period of time, and then test the agent like an exam; in this way, the network parameters are repeatedly trained.

  2. Model-free and model-based in reinforcement learning

    • An agent with a complete model of the environment can predict the outcome of each action and the corresponding reward, so it can optimize its policy without experimenting. This type of agent is called a "modeled" agent.

    • On the other hand, agents that do not model the environment are called "model-free" agents. These agents must learn by interacting with the environment and then make action decisions based on their experience. They can only learn by trial and error during experiments, and require more time and experience to achieve the same level of performance as modeled agents.

    • The Sarsa algorithm is a model-free online learning algorithm.

  3. Q value

    • The Q value in reinforcement learning refers to the value of taking a certain action in a certain state, that is, it represents the expectation of the sum of rewards until the final state after the agent chooses this action. The Q value can be used to guide the agent to make the optimal decision to obtain the maximum cumulative reward.
    • The predicted Q value refers to the output of the neural network, which is the estimate of the Q value of the current state-action pair. The target Q value refers to the expected Q value calculated according to the Bellman equation, that is, the real Q value of the current state-action pair.
  4. Strategy Learning and Value Learning

    • Policy learning is to directly learn a policy function, input a state, and output an action or the probability distribution of an action. The advantage of policy learning is that it can handle continuous action spaces and also implement stochastic policies. The disadvantage of policy learning is that the convergence speed is slow and it is easy to fall into local optimum.
    • Value learning is to learn a value function indirectly, input a state or a state-action pair, and output a value or a Q value. The advantage of value learning is that the convergence speed is fast, and the Bellman equation can be used for iterative update. The disadvantage of value learning is that it is difficult to deal with continuous action spaces, and it is also difficult to implement stochastic policies.

    In general, both policy learning and value learning require interacting with the environment to obtain reward signals to update parameters. But there are also some algorithms that can use both policy and value to make decisions, such as the actor-critic algorithm (Actor-Critic), where the actor (Actor) is a policy function, and the critic (Critic) is a value function. The policy is updated by the evaluation of the critic and the value is updated by the critic according to the reward of the environment.

  5. Policy gradient (reinforce and AC algorithm)
    insert image description here

    • REINFORCE approximates Q π with actual observed rewards, and actor-critic methods approximate Q π with neural networks.
  • About the difference between rt and Rt case

    • insert image description here
  • The relationship between episode and evaluation times

    • The number of evaluations refers to how many times you run in an environment with a fixed strategy, and an episode refers to a process from the start state to the end state in an environment.
    • If you only run one episode per evaluation, then the number of evaluations is equal to the number of episodes. But if you run multiple episodes per evaluation, or your environment doesn't have a clear end state, then the number of evaluations is not equal to the number of episodes.
  • State value V and action value Q

    • State value refers to the expected return that can be obtained by following a certain strategy in a certain state. It reflects the potential benefits of reaching this state.
    • Action value refers to the expected return that can be obtained by following a certain strategy after performing an action in a certain state. It reflects the potential payoff from choosing this action.
    • Both state value and action value are predictions of future rewards, but state value needs to consider all possible action choices, while action value only needs to consider a specific action choice.
    • insert image description here
  • Updates in the AC Algorithm

    • insert image description here

    • Q-learning is an offline algorithm that uses a greedy strategy to select the highest value action for the next state, but that action may not be taken in actual execution12 . SARSA is an online algorithm that uses ε − greedy \varepsilon-greedyeg ree d y ​Strategy to choose the actual action of the next state, and execute the action 12.

    • insert image description here

    • Q-learning is an offline algorithm because it uses the maximum value action of the next state instead of the actual executed action when updating the Q value . That is to say, the strategy used by Q-learning during the learning process may be inconsistent with the actual execution strategy, which can avoid local optimal solutions23.

    • SARSA is an online algorithm, because it uses the actual action of the next state when updating the Q value , that is to say, the strategy used by SARSA in the learning process is consistent with the actual execution strategy12. This ensures that SARSA can adapt to changes in the environment and be more sensitive to errors or penalties24.

    • The update formula of Q-learning and SARSA is different. Q-learning uses the maximum value action of the next state to update the value function of the current state, that is, Q ( s , a ) = Q ( s , a ) + α ∗ ( r + γ ∗ max ( Q ( s ' , a ' ) ) − Q ( s , a ) ) Q (s,a) = Q (s,a) + α* (r+γ* {max} (Q (s', a'))-Q (s,a))Q(s,a)=Q(s,a)+a(r+cmax(Q(s,a))Q(s,a )) . SARSA uses the actual action of the next state to update the value function of the current state, that is,Q ( s , a ) = Q ( s , a ) + α ∗ ( r + γ ∗ Q ( s ' , a ' ) − Q ( s , a ) ) Q (s,a) = Q (s,a) + α* (r+γ*Q (s',a')-Q (s,a))Q(s,a)=Q(s,a)+a(r+cQ(s,a)Q(s,a)) \123。

    • The behavioral characteristics of Q-learning and SARSA are different. Q-learning is an aggressive algorithm, which only considers the possibility of obtaining the maximum reward in the end, and does not care about the traps or penalties encountered in the middle. SARSA is a conservative algorithm, sensitive to mistakes or death, and will try to avoid actions that may lead to negative results124.

  • Since some algorithms (such as A2C, PPO) adopt the stochastic strategy by default, you should set deterministic=True when calling .predict() during the test. Will this have a better effect?

    • What this sentence means is that in reinforcement learning, there are two different types of strategies: deterministic strategies and stochastic strategies. A deterministic strategy means that for each state, only one definite action is output. The randomness strategy means that for each state, output a probability distribution of actions, and then sample actions from it.

    • Some reinforcement learning algorithms (such as A2C, PPO) adopt random strategy by default, which can increase exploratory and robustness. 34

    • However, when testing, if you want to get the optimal action, you should set deterministic=True, so that you can directly output the action with the highest probability instead of random sampling. 34

    • The reason for this is that during training, the random strategy can help learn more states and actions, but during testing, the deterministic strategy can guarantee the best output.

2. Comparison and difference of algorithms

  • Q-learning: Q-learning is a reinforcement learning algorithm based on value functions. It can be used for problems in discrete states and action spaces. It has a fast convergence speed, but it may not work well for problems in continuous states and action spaces.

  • Deep Q Network (DQN): DQN is a Q-learning algorithm based on deep neural networks, which can be used to deal with continuous state and discrete action spaces. DQN has better generalization ability and stability, but the training process is more complicated.

  • The calculation of the TD objective is decomposed into two steps of selection and evaluation, which alleviates the overestimation caused by maximization. The overestimation problem means that when DQN chooses the action that maximizes the action value function, it often overestimates the action value in some situations, resulting in learning bias and instability.insert image description here

    • DQN is directly updated with the maximum value of optional Q at time t+1 in the Target Q network
    • DDQN uses the maximum Q of the Q network at time t+1 to select the corresponding action, and then uses this action to determine the Q value in the Target Q network.
    • In this way, the updated Q value will be less than or equal to the Q value updated by DQN, which improves the problem of overestimate.
    • The solution of DDQN is to separate the action selection and action evaluation , use the Q network to select the optimal action, and use the target network to evaluate the value of the action. In this way, the effect of overestimation can be reduced and the learning performance can be improved32.
  • t:insert image description here

  • Policy Gradient: Policy Gradient is a policy optimization-based reinforcement learning algorithm that can be used for discrete and continuous state spaces, discrete and continuous action spaces. Policy Gradient can directly optimize the policy function, but the training process is relatively unstable.

  • Actor-Critic: Actor-Critic is a reinforcement learning algorithm that combines value functions and policy functions, and can be used for problems in discrete and continuous state spaces, discrete and continuous action spaces. Actor-Critic can optimize policy and value function at the same time, but the training process is also more complicated.

  • Proximal Policy Optimization (PPO): PPO is a policy optimization-based reinforcement learning algorithm that can be used for discrete and continuous state spaces, discrete and continuous action spaces. PPO has better stability and convergence speed, but the training efficiency is lower.

  • Same strategy: SARSA, REINFORCE, and A2C all belong to the same strategy. They require that the experience must be collected by the current target strategy, rather than using outdated experience. Experience replay does not apply to the same strategy

  • Distributional DQN: (Original link: Reference link )

    • Why there is: In the traditional DQN, the network output is the estimation of the action value Q, but in fact, a lot of information is still ignored. If the expected value of the two actions is the same as 20, the first action is 10 in 90% of the cases, 110 in 10% of the cases, 25 in 50% of the cases of the second action, and 50% of the time The case is 15, so although the expectations are the same, if you want to reduce the risk, you should choose the latter action, and only output the expected value without seeing the hidden risks behind it.
    • Why does the distributed deep Q network not overestimate the reward, but underestimate it? Because the output of the distributed deep Q network is a distribution range, the output range cannot be infinite, we must set a limit, for example, the maximum output range is from −10 to 10. What if the reward you get exceeds 10, such as 100? We treat it as if we didn’t see this, so the extreme value of the reward and the large value will be discarded. When using the distributed deep Q network, we will not overestimate the reward, but underestimate the reward.
    • Distributional DQN: It is characterized by not only learning the action value function Q(s, a), that is, the expected return of taking action a in state s, but also learning the action value distribution Z(s, a), that is, taking an action in state s The probability distribution of the return of a.
    • Distributional DQN uses a neural network to output a discrete distribution, that is, the probability corresponding to each support point is given on a set of predefined support points. This neural network can be regarded as a classifier, its input is state s and action a, and the output is a vector representing the probability of each category (support point). To train this neural network, Distributional DQN uses a loss function, which is based on KL divergence, which measures the difference between two distributions.
    • insert image description here
  • The difference between PPO and A3C

    • A3C is an asynchronous actor-critic algorithm that uses multiple parallel agents to explore different environments and update their gradients to a shared global network. A3C can efficiently utilize multi-core CPUs, but may also suffer from sample dependencies and unstable updates.
    • PPO is a proximal policy optimization algorithm that uses a single agent to collect experience and update the policy with a tailored objective function. PPO can prevent the policy update from being too large or too small, thereby ensuring the improvement of the policy. PPO can also use multiple agents to collect data in parallel, but the parameters need to be updated synchronously.
    • The PPO algorithm is a proximal policy optimization algorithm, which also uses the actor-critic framework, but introduces a tailored objective function to limit the update range of the policy. The PPO algorithm can prevent the policy update from being too large or too small, so as to ensure the improvement of the policy. The advantage of the PPO algorithm is that it is more stable and efficient, and it is easier to tune parameters. The disadvantage is that more epochs and batch sizes may be required.
    • Generally speaking, PPO is more stable and efficient than A3C, and it is easier to adjust parameters.
    • The PPO algorithm is an on-policy algorithm, which means that its target strategy and behavior strategy are the same, and it can only be updated with the data generated by the current strategy. However, the PPO algorithm uses importance sampling and proximal strategy optimization methods to enable it to use the same batch of data for multiple updates without causing drastic changes in strategy distribution. In this way, the number of sampling can be reduced, the utilization rate of data can be improved, and the problems of high variance and low efficiency of the on-policy algorithm can also be avoided.
  • epoch

    • The epoch in reinforcement learning usually means that the agent performs a certain number of steps in the environment or completes a certain number of tasks.
    • For example, if the agent is looking for an exit in a maze environment, an epoch can refer to an attempt or failure by the agent to reach the exit from the starting point, or it can refer to how many steps the agent has taken in the environment.
  • soft update factor

    • The soft update coefficient is a method to update the target network parameters in reinforcement learning. It uses the convex combination of the current network parameters and the target network parameters to update the network, that is, the parameters of the target network θ i − \theta_i^ {- }iiWill use the current network parameters θ i \theta_iiiUpdate 1 according to the following formula:

θ i + 1 − ← ( 1 − ϵ ) θ i − + ϵ θ i = θ i − + ϵ ( θ i − θ i − ) \begin {align*} \theta_ {i+1}^ {-} &\leftarrow (1-\epsilon)\theta_i^ {-} + \epsilon \theta_i\\ &= \theta_i^ {-} + \epsilon (\theta_i - \theta_i^ {-}) \end {align*} ii+1(1ϵ ) ii+ϵ ii=ii+ϵ ( iiii)

where 0< ϵ \epsilonϵ << 1 is the soft update coefficient, which controls the speed at which the target network parameters change. The smaller the soft update coefficient, the slower the target network parameters change, the more stable the algorithm, but the slower the convergence speed. The larger the soft update coefficient, the faster the target network parameters change, the more unstable the algorithm is, but the faster the convergence speed is. A suitable soft update coefficient can make the algorithm converge stably and quickly.

Soft update coefficients are generally used in reinforcement learning algorithms for continuous action spaces, such as DDPG 2, TD3 3, SAC [4], etc. They are different from reinforcement learning algorithms in discrete action spaces, which usually use the Hard Update method, that is, the target network parameters θ i − \theta_i^- are updated synchronously with Q network parameters \theta_i every fixed C stepsii. There is a relationship between hard update and soft update, that is, when the target network update interval C and the soft interval update coefficient ϵ \epsiloncurrent C = 1 \epsilon C=1ϵC=1 , the performance of the two update methods is quite 1

  • The difference between load_state_dict and parameters().copy

    • load_state_dict is a method of the torch.nn.Module class. Its function is to load the parameters and buffers of the model from a dictionary. If the strict parameter is True, the keys in the dictionary must be the same as those returned by the model's state_dict() method Matches 12 exactly.
    • parameters().copy is a method of the torch.nn.Parameter class. Its function is to return a new Parameter object, which contains a copy of the data of the original Parameter object. This method is usually used to copy a parameter in the model, rather than the entire model
  • The difference between torch.mean(F.mse_loss()) and nn.MSELoss()

    • The functions of torch.mean(F.mse_loss()) and nn.MSELoss() are the same, but they are used in different ways.
    • Generally speaking, nn.MSELoss() is more suitable for neural network training because it can be used in conjunction with other optimizers or learning rate schedulers, while torch.mean(F.mse_loss()) is more suitable for use in In testing or evaluation because it is simpler and more straightforward.
  • About Policy Gradient:

In the AC algorithm, the general steps of the code to realize the calculation of the policy gradient are as follows12:
1. Define the actor network and the critic network, which are used to output the policy and the value function respectively.
2. Sample a trajectory, recording each state, action, and reward.
3. For each moment, calculate the cumulative reward or use the output of the critic network as the target value.
4. For each moment, calculate the logarithmic probability corresponding to the output of the actor network, multiply it by the target value, and obtain the policy gradient.
5. For each moment, calculate the mean square error between the output of the critic network and the target value, and obtain the gradient of the value function.
6. Use the optimizer to update the parameters of the actor network and the critic network.

2. Algorithm Analysis Notes

code link.

1. Improved greedy algorithm

  • Improve the desire to explore that remained unchanged before. The desire to explore gradually decreases with the number of times you play, and can be set to the reciprocal of the number of times.
  • The Upper Confidence Bound Algorithm UCB: Explore More and Play Less Machines. By setting the value of UCB, UCB is a vector that gradually becomes smaller as the number of times you play increases

2. Dyna Q algorithm

The summary is: offline rumination, learning new things by reviewing the past (offline learning)

  • It accelerates the convergence of the Q-learning algorithm by using the environment model, and it can also make predictions about the environment to improve the strategy.

  • The basic idea of ​​the Dyna Q algorithm is to add an environment model based on the Q-learning algorithm to simulate the dynamic changes of the environment. When the agent interacts with the environment, in addition to updating the Q value, the environment model is also updated. The environment model can record the experience of the agent, including the reward of each state and the transition probability of the next state, and the agent will use this information to speed up the convergence when updating the Q value.

  • In the Dyna Q algorithm, each time the Q value is updated, a previously experienced state is randomly selected, and its next state and reward are predicted based on the environment model. In this way, on the basis of the Q-learning algorithm, more training data can be added to further improve the decision-making ability of the agent. In addition, the states and rewards predicted in the environment model are also used to update the Q value to improve the agent's policy.

  • In general, the Dyna Q algorithm is a reinforcement learning algorithm that combines model learning and model utilization, which can effectively improve the convergence speed and performance of reinforcement learning algorithms. However, in practical applications, due to the need to maintain the environment model, the computational complexity of the Dyna Q algorithm is relatively high.

3. Delayed update next_model in DQN

  • Because during training, it is the goal of evaluating the action model with the empirical model. If the goal is always changing, the training will become difficult and unstable, so the goal needs to be fixed temporarily
  • About detach() (A2C algorithm)
    • The reward is calculated based on the value of the next state (next_value) and rewards (rewards), and the value of the next state is output by the critic network, so the reward and the parameters of the critic network are dependent.
    • If detach is not added, then during backpropagation, the reward will have a gradient to the parameters of the critic network, which will affect the update of the critic network.
    • But we only want to use the reward to evaluate the value of the current state, and we don't want to use it to update the critic network, so we need to add detach to disconnect it from the computation graph.

4. Monte Carlo approximation to expectation

insert image description here

5. The difference between deterministic strategies and stochastic strategies in reinforcement learning

  • A deterministic policy means that for each state, there is a deterministic choice of action that does not involve any randomness12. For example, if you are in a maze, your deterministic policy might be: If there is a left turn, turn left; if there is no left turn, go straight; if you encounter a dead end, turn back.
  • The random strategy means that for each state, there is a probability distribution of actions, and actions are selected according to the probability. For example, if you are in a maze, your random strategy might be: if there is a left turn, turn left with probability 0.8 and go straight with probability 0.2; if there is no left turn, go straight with probability 0.6 and go straight with probability 0.4 Probability to turn right; if you encounter a dead end, turn back with a probability of 0.9 and stay with a probability of 0.1.

Both deterministic and stochastic strategies have advantages and disadvantages. The advantages of deterministic strategies are simplicity, high efficiency, and easy implementation; the disadvantages are that they may fall into local optimum, lack of exploration ability, and are not suitable for partially observable or random environments3. The advantage of the random strategy is that it can enhance the ability to explore, adapt to uncertainty, and avoid premature convergence or oscillation; the disadvantage is that it may increase variance, reduce efficiency, and be difficult to optimize

6、A3C

1) Asynchronous, concurrent, multi-threaded

  • Asynchronous (asynchronous) means that an operation will not block the execution of the program, but will be performed in the background. When the operation is completed, the program will be notified or a callback function will be executed. Asynchronous operations can be performed in a single-threaded or multi-threaded environment, which can improve the responsiveness and efficiency of the program.
  • Concurrent means that multiple operations can be executed alternately within the same time period, but not necessarily at the same time. Concurrency can be implemented on single-core or multi-core processors, which can improve program throughput and utilization.
  • Multithreading means that a program can create multiple threads to perform different tasks, and each thread has its own execution flow and context. Multithreading can achieve concurrency or parallelism, depending on the number of processor cores and scheduling algorithms. Multithreading can improve program performance and concurrency.
  • Parallel means that multiple operations can be executed at the same time. Parallelism requires multi-core or distributed processors to achieve, which can increase the speed and scale of the program.
  • In short: asynchrony means that the execution of the program is not blocked, but is carried out in the background; concurrency means that multiple operations are executed alternately , but not necessarily at the same time; multithreading means that the program creates multiple threads to perform different tasks; parallel means Multiple operations are performed simultaneously.
  • Multithreading means that a program creates multiple threads to perform different tasks, while parallelism means that multiple operations are performed at the same time. Multithreading can achieve parallelism, but parallelism does not necessarily require multithreading.

2) Algorithm principle

A3C uses an asynchronous gradient update method. After different workers obtain independent experience (a batch), they update the Global Network independently. When the main network parameters are updated, they reset all workers with the latest parameters, and then Start the next cycle.

3) Specific update process – reference link

  • One of worker i finishes its current cycle (completes an episode), so it uses the experience of this taskory to calculate actor loss and critic loss, calculates the gradient and uploads it to the global network, and the global network obtains new parameters after gradient descent , and then send it back to the woker i, at this time the parameters of the main network and worker i are updated at the same time;
  • At the same time, other wokers are still running their own loops, which are completely unaffected. When worker i starts a new loop, it uses a different policy from other workers.
  • Repeat the above process again until a worker runs its own cycle.

7、DDPG

Another article.

8、PPO

Another article! ! ! !

9. About the target network

(1) The target network in the AC algorithm

- 目标网络和评论家网络的关系是一种固定网络技术,它是为了解决值函数更新时的不稳定性问题。
- 目标网络和评论家网络有相同的结构,但是目标网络的参数是延迟更新的,也就是说,它不会每次都和评论家网络同步,而是每隔一段时间才会复制评论家网络的参数。这样可以让目标值更稳定,避免因为参数的快速变化而导致值函数震荡或者发散。
- 目标网络只用来计算目标值,不用来更新参数,所以它的梯度不需要反向传播到评论家网络。 	

(2) When is the target network required and when not?

  • Target networks are mainly used in Q-learning based algorithms, such as DQN and DDPG, which use value function networks to estimate Q values ​​and select actions based on Q values. The role of the target network is to generate the target Q value, which is used to calculate the loss function and gradient update of the value function network.

  • The target network is not necessary, but using the target network can make the training process more stable, because it can reduce the difference between the target Q value and the output of the value function network, thereby reducing the shock and instability in the training process.

  • If the target network is not used, both the target Q value and the value function network output will vary with the same set of parameters, which may cause the training to fall into a loop or diverge.

  • If using policy gradient-based algorithms such as REINFORCE or PPO, then no target network is needed, since they directly learn a policy network to output the probability distribution of actions, and update the policy network according to the advantage function or reward estimate.

(3) Can the PPO valuable network critic use the target network?

  • PPO has a value network that is used to estimate state values ​​and is used to compute advantage functions or reward estimates. In theory, PPO can also use the target network to generate the target state value, thereby stabilizing the training of the value network. However, PPO is an on-policy algorithm, which only uses the experience generated by the current policy to update the network, and does not use the experience generated by the historical policy. This means that PPO needs to discard the previously collected experience after each policy update, since they are no longer applicable to the new policy. The advantage of this is that PPO can guarantee the consistency between the policy and the value function, that is, the value function can well estimate the performance of the current policy.

  • If the PPO uses the goal network, then the parameters of the goal network will lag behind those of the value network, which may lead to inconsistencies between the goal state value and the value network output, that is, the goal state value may not be a good estimate of the current policy performance. The price of doing this is that PPO may reduce the learning efficiency, because it cannot reflect the changes of environment and policy in time.

  • Therefore, PPO usually does not use the target network, but directly uses the value network to generate the target state value. The advantage of this is that PPO can improve the learning efficiency because it can reflect the changes of environment and strategy in time.

10、SAC

Zhihu article
SAC algorithm is based on policy iteration to solve control problems, off-policy data may affect the part of policy evaluation, because this part needs to approach Q π Q^\piQπsumV πV^\piVπ . The problem solved by the SAC algorithm is the reinforcement learning problem of discrete action space and continuous action space
. Also includes entropy h) becomes larger, the better. If the action value q output by the Q critic network is more accurate (according to the Bellman equation, whether q is accurate depends on whether v is accurate), then the better. The more accurate the state value v output by the V critic network is, the better.


(2) Algorithm advantage

  • SAC can increase the exploratory nature of the strategy, allowing the strategy to search the state space more extensively, while giving up obviously invalid actions, which can speed up the learning process and avoid falling into local optimum.
  • SAC can capture the multi-modality of the strategy and allow the strategy to allocate probabilities among multiple near-optimal actions, which can improve the robustness and adaptability of the strategy.
  • SAC can be updated using off-policy data, allowing the policy to use the collected data more efficiently, and can also reduce the overhead of sampling.
  • SAC can stably train a stochastic action-review network, making it easier for policies to converge and generalize.

11. TD3 algorithm

What is the TD3 algorithm? (with code and code analysis) - Zhang Sijun's article - know almost

12. Priority sampling and importance sampling

  1. Priority sampling is a method to assign sampling probability according to the importance of data, which can be used to reduce data redundancy and bias, and improve data diversity and representativeness. Priority sampling usually needs to define a priority function to measure the importance of each data, such as TD error, reward size, state sparsity, etc. Priority sampling can be used to improve the experience replay mechanism, allowing reinforcement learning algorithms to learn from key experiences faster. For example, (Schaul et al., 2016) proposed a deep Q-network algorithm based on priority sampling1.

  2. Importance sampling is a method of adjusting learning updates according to the bias of the data , which can be used to reduce the bias and variance of the data and improve the validity and stability of the data. Importance sampling usually needs to calculate an importance weight , which is used to measure the deviation between the target distribution and behavior distribution of each data, such as the strategy probability ratio, the value function gradient ratio, etc. Importance sampling can be used to improve 策略评估和策略改进机制, allowing reinforcement learning algorithms to learn from different policies more accurately. For example, (Precup et al., 2000) proposed an importance sampling-based offline policy evaluation algorithm2.

  3. The difference and connection between importance sampling and priority experience playback are as follows:

    • Importance sampling is a statistical method for sampling from one distribution but computing the expected value from another distribution. It can be used to deal with situations where the sampling distribution and the target distribution do not match, for example in reinforcement learning, when the policy changes, the samples in the experience replay may no longer conform to the distribution of the current policy12.
    • Prioritized experience replay is an experience replay method based on importance sampling, which assigns different priorities to each experience sample according to their time difference error (TD-error), so that more important samples are more likely to be extract. It can improve the learning efficiency and stability of deep reinforcement learning algorithms3.
    • The connection between importance sampling and priority experience replay is that priority experience replay uses the idea of ​​importance sampling, that is, samples are drawn according to a non-uniform distribution, but gradient updates are calculated according to a uniform distribution. In order to eliminate the bias, priority experience replay uses the importance sampling weight (IS-weight), which is the inverse of the sampling probability, to correct the gradient update3.

13. Normalization for Reinforcement Learning

Normalization is a commonly used data preprocessing method, which can eliminate the dimension of data and make the value input into the neural network conform to the standard normal distribution, thus making the training of the neural network more comfortable123.

In reinforcement learning, normalization can be applied to state information, action information, and reward information, but in different ways and effects123.

  • State information: The state information input into the neural network can be normalized so that its mean is 0 and its variance is 1. This can avoid inputting a number with a large absolute value, and the parameters of the neural network need to be updated in many steps to adapt to 12.

  • Action information: In the case that the action space is continuous, the action space can be limited to a range, such as -1 to 1. In this way, the mean and variance of the action information can be close to 0 and 1, which is also conducive to the training of the neural network12.

    • Ensuring Stability: Normalizing actions can help improve the stability of an algorithm. In reinforcement learning, an agent may need to choose actions with different ranges of values. If the actions are not standardized, the value ranges of different actions vary greatly, which may lead to unstable training and difficult convergence.

    • Improve convergence speed: Standardizing actions can speed up the convergence speed of training. By mapping the action value to a fixed range (such as [-1, 1] or [0, 1]), it can make it easier for the agent to learn and adjust the action strategy and reduce the training time.

    • Generalization ability: Normalizing actions can improve the generalization ability of the agent. By limiting the action range to a certain range, the agent can better adapt to the action requirements of different environments and tasks. This generalization ability enables the trained agent to better cope with changes in the unknown environment and action space.

  • Compatibility: Standardizing actions can make the agent's policy easier to integrate with other models or systems. For example, in control tasks that interact with the physical environment, normalization of actions can make the action values ​​output by the agent match the control inputs of the physical system, making it easier to achieve actual control.

  • Reward information: Reward information should not be normalized as this would destroy the meaning of the reward signal123. However, the reward information can be scaled, that is, multiplied by a constant factor to adjust the size of the reward information 123 . In this way, it can avoid that the variance of the reward information is too large to cause the algorithm to fail to converge, or the reward information is too small to cause the algorithm to converge slowly12.

If you want more details:

3. Paste a code flow pipline connection

Know the link!!

(1)

Most DRL algorithms refer to Off-policy DDPG, TD3, SAC, etc., and On-policy A3C, PPO, etc. and their variants. The only difference between most algorithms is: calculating the Q value and exploring the environment. If it is of the DQN type, then you only need to regard the actor as arg max (Q1, ..., Qn), and the critic as a Q Network.

(2) Algorithm selection

Know the link!

4. Error collection

RuntimeError: Found dtype Double but expected Float

You used the torch.from_numpy method to convert the numpy array to a torch tensor, but this method will retain the original data type by default, and the default data type of the numpy array is a double-precision floating point number. You can add the .float() method after the torch.from_numpy method to convert the tensor to a single-precision floating point number,

Guess you like

Origin blog.csdn.net/qq_45889056/article/details/129978353