强化学习导论——Policy Gradient Methods

在这一章中,我们讨论策略梯度

Policy Approximation and its Advantages

  1. the approximate policy can approach a deterministic policy, whereas withε-greedy action selection over action values there is always an ε probability of selecting a random action
  2. In problems with significant function approximation, the best approximate policy may be stochastic

The Policy Gradient Theorem

there is also an important theoretical advantage:
With continuous policy parameterization the action probabilities change smoothly as a function of the learned parameter, whereas inε-greedy selection the action probabilities may change dramatically for an arbitrarily small change in the estimated action values

目标函数即价值

1507799-7a3c408a8229f8ab.png
1507799-848e1968bea5f6ad.png

Policy Gradient Theorem 证明

1507799-e0fda3a5d552b3d2.png
1507799-4ad623f38b40a4ac.png

REINFORCE: Monte Carlo Policy Gradient

1507799-e76a3bc173dbc1a9.png
1507799-ef840298d7b51e68.png
1507799-c2d75f8ffc83d12a.png
1507799-60407f288a707f06.png

REINFORCE with Baseline

1507799-5cb1e62ec82dd5f9.png

The baseline can be any function, even a random variable, as long as it does not vary with a; the equation remains valid because the subtracted quantity is zero

1507799-e24d685029e9dbe4.png
1507799-8b620f1db7b268ab.png

One natural choice for the baseline is an estimate of the state value, ˆv(St,w),

1507799-ff700083cebaa938.png
1507799-b599d32c7b93f22b.png

Actor–Critic Methods

Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor–critic method because its state-value function is used only as a baseline, not as a critic.

REINFORCE with baseline is unbiased and will converge asymptotically to a local minimum, but like all Monte Carlo methods it tends to learn slowly (produce estimates of high variance) and to be inconvenient to implement online or for continuing problems.

First consider one-step actor–critic methods, the analog of the TD methods introduced in Chapter 6such as TD(0), Sarsa(0), and Q-learning.

1507799-d4ab22b7f2fcf47d.png
1507799-12e016e4be0b9d11.png

加入资格迹

1507799-f92d28b82922d13a.png

Policy Gradient for Continuing Problems

1507799-ce195062e59c95d9.png

μ is the steady-state distribution underπ

1507799-fcba00817019ad69.png
1507799-6bcf9ebd4351c363.png

Policy Gradient Theorem 连续版本证明

1507799-1519507191ea52de.png

Policy Parameterization for Continuous Actions

1507799-52a815c665c79546.png

猜你喜欢

转载自blog.csdn.net/weixin_34120274/article/details/87082916