Deterministic Policy Gradient Algorithms
论文地址
笔记
出发点
首先最开始提出的policy gradient 算法是 stochastic的。
这里的随机是指随机策略\(\pi_\theta(a|s)=P[a|s,;\theta]\). 但是随机策略在高维连续动作空间上可能会有问题,毕竟要考虑当前状态下所有的动作带来的不同的影响,需要更多的(s,a)的数据来形成更准确的判断
但是对于确定性策略\(a=\mu_theta(s)\). 过去,认为这样是不可行的,原因待补充。(一个显而易见的原因就是不够explore)
本文就冒天下之大不韪,提出了deterministic policy gradient ,也就是DPG
文章用的off-polcy 用一个stochasitic behavior policy来选择动作,然后学习一个determinisitic target policy.
policy gradient
\[J(\pi_\theta)=\int_S \rho^\pi(s)\int_A \pi_\theta (s,a)q(s,a)dads=E_{s\sim \rho^\pi ,a\sim \pi_\theta}[q(s,a)]\]
\(\rho^\pi(s') = \int_S \sum_{t=1}^{\infty}p_1(s)p(s\to s',t,\pi)ds\)
stochastic policy gradient
policy gradient theorem:
\[\nabla_\theta J(\pi_\theta)=\int_S \rho^\pi(s)\int_A \nabla_\theta \pi_\theta (s,a)Q^\pi(s,a)dads=E_{s\sim \rho^\pi ,a\sim \pi_\theta}[\nabla_\theta log \pi_\theta(s,a)Q^\pi(s,a)]\]
stochastic Actor-Critic algorithm
critic 通过TD的方式估计 action-value function \(Q^w(s,a)=Q^\pi(s,a)\)
\[\nabla_\theta J(\pi_\theta)=\int_S \rho^\pi(s)\int_A \nabla_\theta \pi_\theta (s,a)Q^w(s,a)dads=E_{s\sim \rho^\pi ,a\sim \pi_\theta}[\nabla_\theta log \pi_\theta(s,a)Q^w(s,a)]\]
Off-policy AC
behavior policy \(\beta(a|s)\neq \pi_\theta(a|s)\)
\[J_\beta(\pi_\theta)=\int_S \rho^\beta(s)V^\pi(s)ds=\int_S \int_A \rho^\beta \pi_\theta (s,a)Q^\pi(s,a)dads\]
\[\nabla_\theta J_\beta(\pi_\theta)=\int_S \int_A \rho^\beta(s)\nabla_\theta \pi_\theta (s,a)Q^\pi(s,a)dads=E_{s\sim \rho^\beta ,a\sim \beta}[\frac{\pi_\theta(a|s)}{\beta_\theta(a|s)} \nabla_\theta log \pi_\theta(s,a)Q^\pi(s,a)]\]