Soft Bellman Equation and Soft Value Iteration证明

  本节基础知识Soft Value function基础和Soft Q Learning中Policy Improvement 证明

  首先回顾一下Soft value function的定义:

V s o f f π ( s ) log exp ( Q s o f t π ( s , a ) ) d a V_{\mathrm{soff}}^{\pi}(\mathbf{s}) \triangleq \log \int \exp \left(Q_{\mathrm{soft}}^{\pi}(\mathbf{s}, \mathbf{a})\right) d \mathbf{a}

  假定 π ( a s ) = exp ( Q s o f t π ( s , a ) V s o f t π ( s ) ) \pi(\mathbf{a} | \mathbf{s})=\exp \left(Q_{\mathrm{soft}}^{\pi}(\mathbf{s}, \mathbf{a})-V_{\mathrm{soft}}^{\pi}(\mathbf{s})\right) ,我们有:

Q s o f t π ( s , a ) = r ( s , a ) + γ E s p s [ H ( π ( s ) ) + E a π ( s ) [ Q s o f t π ( s , a ) ] ] = r ( s , a ) + γ E s p s [ V s o f t π ( s ) ] \begin{aligned} Q_{\mathrm{soft}}^{\pi}(\mathbf{s}, \mathbf{a}) &=r(\mathbf{s}, \mathbf{a})+\gamma \mathbb{E}_{\mathbf{s}^{\prime} \sim p_{\mathbf{s}}}\left[\mathcal{H}\left(\pi\left(\cdot | \mathbf{s}^{\prime}\right)\right)+\mathbb{E}_{\mathbf{a}^{\prime} \sim \pi\left(\cdot | \mathbf{s}^{\prime}\right)}\left[Q_{\mathrm{soft}}^{\pi}\left(\mathbf{s}^{\prime}, \mathbf{a}^{\prime}\right)\right]\right] \\ &=r(\mathbf{s}, \mathbf{a})+\gamma \mathbb{E}_{\mathbf{s}^{\prime} \sim p_{\mathbf{s}}}\left[V_{\mathrm{soft}}^{\pi}\left(\mathbf{s}^{\prime}\right)\right] \end{aligned}

  最后。我们定义soft value iteration operator T \mathcal{T}

T Q ( s , a ) r ( s , a ) + γ E s p s [ log exp Q ( s , a ) d a ] \mathcal{T} Q(\mathbf{s}, \mathbf{a}) \triangleq r(\mathbf{s}, \mathbf{a})+\gamma \mathbb{E}_{\mathbf{s}^{\prime} \sim p_{\mathbf{s}}}\left[\log \int \exp Q\left(\mathbf{s}^{\prime}, \mathbf{a}^{\prime}\right) d \mathbf{a}^{\prime}\right]

  它是一个压缩映射里面的一种映射(contraction),我们就得证了。

在这里插入图片描述

  具体参考论文:Reinforcement Learning with Deep Energy-Based Policies

发布了185 篇原创文章 · 获赞 168 · 访问量 21万+

猜你喜欢

转载自blog.csdn.net/weixin_39059031/article/details/104771388