Hands on RL 之 Off-policy Maximum Entropy Actor-Critic (SAC)

Hands on RL 之 Off-policy Maximum Entropy Actor-Critic (SAC)


​ Soft actor-critic方法又被称为Off-policy maximum entropy actor-critic algorithm。

1. 理论基础

1.1 Maximum Entropy Reinforcement Learning, MERL

​ 在MERL原论文中,引入了熵的概念,熵定义如下:

随机变量 x x x符合 P P P的概率分布,那么随机变量 x x x的熵 H ( P ) \mathcal{H}(P) H(P)
H ( P ) = E x ∼ P [ − log ⁡ p ( x ) ] \mathcal{H}(P)=\mathbb{E}_{x\sim P}[-\log p(x)] H(P)=ExP[logp(x)]
标准的RL算法的目标是能够找到最大化累计收益的策略
π std ∗ = arg ⁡ max ⁡ π ∑ t E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) ] \pi^*_{\text{std}} = \arg \max_\pi \sum_t \mathbb{E}_{(s_t,a_t)\sim \rho_\pi}[r(s_t, a_t)] πstd=argπmaxtE(st,at)ρπ[r(st,at)]
引入了熵最大化的RL算法的目标是
π M E R L ∗ = arg ⁡ max ⁡ π ∑ t E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) + α H ( π ( ⋅ ∣ s t ) ) ] \pi^*_{MERL} = \arg\max_\pi \sum_t \mathbb{E}_{(s_t,a_t)\sim\rho_\pi}[r(s_t,a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t))] πMERL=argπmaxtE(st,at)ρπ[r(st,at)+αH(π(st))]
其中, ρ π \rho_\pi ρπ表示在策略 π \pi π下状态空间对的概率分布, α \alpha α是温度系数,用于调节对熵的重视程度。

类似的,我们也可以在RL的动作值函数,和状态值函数中同样引入熵的概念。

标准的RL算法的值函数
standard Q function: Q π ( s , a ) = E s t , a t ∼ ρ π [ ∑ t = 0 ∞ γ t r ( s t , a t ) ∣ s 0 = s , a 0 = a ] standard V function: V π ( s ) = E s t , a t ∼ ρ π [ ∑ t = 0 ∞ γ t r ( s t , a t ) ∣ s 0 = s ] \begin{aligned} \text{standard Q function:} \quad Q^\pi(s,a) & = \mathbb{E}_{s_t,a_t\sim\rho_\pi}[\sum_{t=0}^\infty \gamma^t r(s_t,a_t)|s_0=s, a_0=a] \\ \text{standard V function:} \quad V^\pi(s) & =\mathbb{E}_{s_t,a_t\sim\rho_\pi}[\sum_{t=0}^\infty \gamma^t r(s_t,a_t)|s_0=s] \end{aligned} standard Q function:Qπ (s,a)standard V function:Vπ (s)=Est,atρp[t=0ctr(st,at)s0=s,a0=a]=Est,atρp[t=0ctr(st,at)s0=s]

根据MERL的目标函数,在值函数中引入熵的概念之后就获得了Soft Value Function, SVF
Soft Q function: Q soft π ( s , a ) = E s t , a t ∼ ρ π [ ∑ t = 0 ∞ γ t r ( s t , a t ) + α ∑ t = 1 ∞ γ t H ( π ( ⋅ ∣ s t ) ) ∣ s 0 = s , a 0 = a ] Soft V function: V soft π ( s ) = E s t , a t ∼ ρ π [ ∑ t = 0 ∞ γ t ( r ( s t , a t ) + α H ( π ( ⋅ ∣ s t ) ) ) ∣ s 0 = s ] \begin{aligned} \text{Soft Q function:} \quad Q_{\text{soft}}^\pi(s,a) & = \mathbb{E}_{s_t,a_t\sim\rho_\pi}[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) + \alpha \sum_{t=1}^\infty\gamma^t\mathcal{H} (\pi(\cdot|s_t)) |s_0=s, a_0=a] \\ \text{Soft V function:} \quad V_{\text{soft}}^\pi(s) & =\mathbb{E}_{s_t,a_t\sim\rho_\pi}[\sum_{t=0}^\infty \gamma^t \Big( r(s_t,a_t) + \alpha \mathcal{H} (\pi(\cdot|s_t)) \Big)|s_0=s] \end{aligned} Soft Q function:Qsoftπ(s,a)Soft V function:Vsoftp(s)=Est,atρp[t=0ctr(st,at)+at=1ctH(π(st))s0=s,a0=a]=Est,atρp[t=0ct(r(st,at)+α H ( π ( st)))s0=s]

Determine the specific orientation of the Soft Bellman Equation
Q soft π ( s , a ) = E s ′ ∼ p ( s ′ ∣ s , a ) a ′ ∼ π [ r ( s , a ) + γ ( Q soft π ( 2 ). s ′ , a ′ ) + α H ( π ( ⋅ ∣ s ′ ) ) ) ] = E s ′ ∼ p ( s ′ ∣ s , a ) [ r ( s , a ) + γ V soft π ( s ) ] . \begin{aligned} Q_{\text{soft}}^\pi(s,a) & = \mathbb{E}_{s^\prime \sim p(s^\prime|s,a)\\ a ^\prime\sim \pi}[r(s,a) + \gamma \Big( Q_{\text{soft}}^\pi(s^\prime,a^\prime) + \alpha \mathcal{H }(\pi(\cdot|s^\prime)) \Big)] \\ & = \mathbb{E}_{s^\prime\sim p(s^\prime|s,a)} [r( s,a) + \gamma V_{\text{soft}}^\pi(s)] \end{aligned}Qsoftp(s,a)=Esp(ss,a)aπ[r(s,a)+c ( Qsoftp(s,a)+α H ( π ( s)))]=Esp(ss,a)[r(s,a)+γ Vsoftp(s)]

V soft π ( s ) = E s t , a t ∼ ρ π [ ∑ t = 0 ∞ γ t ( r ( s t , a t ) + α H ( π ( ⋅ ∣ s t ) ) ) ∣ s 0 = s ] = E s t , a t ∼ ρ π [ ∑ t = 0 ∞ γ t r ( s t , a t ) + α ∑ t = 1 ∞ γ t H ( π ( ⋅ ∣ s t ) ) + α H ( π ( ⋅ ∣ s t ) ∣ s 0 = s ] = E s t , a t ∼ ρ π [ Q soft π ( s , a ) ∣ s 0 = s , a 0 = a ] + α H ( π ( ⋅ ∣ s t ) ) = E a ∼ π [ Q soft π ( s , a ) − α log ⁡ π ( a ∣ s ) ] \begin{aligned} V_{\text{soft}}^\pi(s) & = \mathbb{E}_{s_t,a_t\sim\rho_\pi}[\sum_{t=0}^\infty \gamma^t \Big( r(s_t,a_t) + \alpha \mathcal{H} (\pi(\cdot|s_t)) \Big)|s_0=s] \\ & = \mathbb{E}_{s_t,a_t\sim\rho_\pi}[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) + \alpha \sum_{t=1}^\infty\gamma^t\mathcal{H} (\pi(\cdot|s_t)) + \alpha \mathcal{H}(\pi(\cdot|s_t)|s_0=s] \\ & = \mathbb{E}_{s_t,a_t\sim\rho_\pi}[Q_{\text{soft}}^\pi(s,a)|s_0=s,a_0=a] + \alpha \mathcal{H}(\pi(\cdot|s_t)) \\ & = \textcolor{blue}{\mathbb{E}_{a\sim\pi}[Q_{\text{soft}}^\pi(s,a) - \alpha \log \pi(a|s)]} \end{aligned} Vsoftp(s)=Est,atρp[t=0ct(r(st,at)+α H ( π ( st)))s0=s]=Est,atρp[t=0ctr(st,at)+at=1ctH(π(st))+α H ( π ( st)s0=s]=Est,atρp[Qsoftp(s,a)s0=s,a0=a]+α H ( π ( st))=Eaπ[Qsoftp(s,a)alogπ ( a s ) ]

1.2 Soft Policy Evaluation and Soft Policy Improvement in SAC

soft Q function的值迭代公式
Q soft π ( s , a ) = E s ′ ∼ p ( s ′ ∣ s , a ) a ′ ∼ π [ r ( s , a ) + γ ( Q soft π ( s ′ , a ′ ) + α H ( π ( ⋅ ∣ s ′ ) ) ) ] = E s ′ ∼ p ( s ′ ∣ s , a ) [ r ( s , a ) + γ V soft π ( s ) ] \begin{align} Q_{\text{soft}}^\pi(s,a) & = \mathbb{E}_{s^\prime \sim p(s^\prime|s,a)\\ a^\prime\sim \pi}[r(s,a) + \gamma \Big( Q_{\text{soft}}^\pi(s^\prime,a^\prime) + \alpha \mathcal{H}(\pi(\cdot|s^\prime)) \Big)] \tag{1.1} \\ & = \mathbb{E}_{s^\prime\sim p(s^\prime|s,a)} [r(s,a) + \gamma V_{\text{soft}}^\pi(s)] \tag{1.2} \end{align} Qsoftp(s,a)=Esp(ss,a)aπ[r(s,a)+γ(Qsoftπ(s,a)+αH(π(s)))]=Esp(ss,a)[r(s,a)+γVsoftπ(s)](1.1)(1.2)

soft V function的值迭代公式

V soft π ( s ) = E a ∼ π [ Q soft π ( s , a ) − α log ⁡ π ( a ∣ s ) ] (1.3) V_{\text{soft}}^\pi(s)= \mathbb{E}_{a\sim\pi}[Q_{\text{soft}}^\pi(s,a) - \alpha \log \pi(a|s)] \tag{1.3} Vsoftπ(s)=Eaπ[Qsoftπ(s,a)αlogπ(as)](1.3)

在SAC中如果我们只打算维持一个Q值函数,那么使用式子 ( 1 , 1 ) (1,1) (1,1)进行值迭代即可,

如果需要同时维持Q,V两个函数,那么使用式子 ( 1 , 2 ) , ( 1 , 3 ) (1,2),(1,3) (1,2),(1,3)进行值迭代。

下面直接给出训练中的损失函数

V值函数的损失函数

J V ( ψ ) = E s t ∼ D [ 1 2 ( V ψ ( s t ) − E a t ∼ π ϕ [ Q θ ( s t , a t ) − log ⁡ π ϕ ( a t ∣ s t ) ] ) 2 ] J_{V(\psi)} = \mathbb{E}_{s_t\sim\mathcal{D}} \Big[\frac{1}{2}(V_{\psi}(s_t) - \mathbb{E}_{a_t\sim\pi_{\phi}}[Q_\theta(s_t,a_t)- \log \pi_\phi(a_t|s_t)])^2 \Big] JV(ψ)=EstD[21(Vψ(st)Eatπϕ[Qθ(st,at)logπϕ(atst)])2]

Q值函数的损失函数

J Q ( θ ) = E ( s t , a t ) ∼ D [ 1 2 ( Q θ ( s t , a t ) − Q ^ ( s t , a t ) ) ] Q ^ ( s t , a t ) = r ( s t , a t ) + γ E s t + 1 ∼ p [ V ψ ( s t + 1 ) ] Q ^ ( s t , a t ) = r ( s t , a t ) + γ Q θ ( s t + 1 , a t + 1 ) J_{Q(\theta)} = \mathbb{E}_{(s_t,a_t)\sim\mathcal{D}} \Big[ \frac{1}{2}\Big( Q_\theta(s_t,a_t) - \hat{Q}(s_t,a_t) \Big) \Big] \\ \hat{Q}(s_t,a_t) = r(s_t,a_t) + \gamma \mathbb{E}_{s_{t+1}\sim p}[V_{\psi}(s_{t+1})] \\ \hat{Q}(s_t,a_t) = r(s_t,a_t) + \gamma Q_\theta(s_{t+1},a_{t+1}) JQ(θ)=E(st,at)D[21(Qi(st,at)Q^(st,at))]Q^(st,at)=r(st,at)+c Est+1p[Vp(st+1)]Q^(st,at)=r(st,at)+γQi(st+1,at+1)

Strategy π \piLoss function of π

J π ( ϕ ) = E s t ∼ D [ D K L ( π ϕ ( ⋅ ∣ s t ) ∣ ∣ exp ⁡ ( Q θ ( s t , ⋅ ) ) Z θ ( s t ) ) ] J_\pi(\phi) = \mathbb{E}_{s_t\sim \mathcal{D}} \Big[ \mathbb{D}_{KL}\Big( \pi_\phi(\cdot|s_t) \big|\big| \frac{\exp(Q_\theta(s_t,\cdot))}{Z_\theta(s_t)} \Big) \Big] Jp( ϕ )=EstD[DKL( pϕ(st) Zi(st)exp(Qi(st,)))]

Among them, Z θ ( st ) Z_\theta(s_t)Zi(st) is the partition function

The definition of KL divergence is as follows

​ Suppose that for random variables ξ \xiξ , there are two probability distributions P and Q, where P is the real probability distribution and Q is the easier to obtain probability distribution. Ifξ \xiξ is a discrete random variable, then define the KL divergence from P to Qd as

D K L ( P ∣ ∣ Q ) = ∑ i P ( i ) ln ⁡ ( P ( i ) Q ( i ) ) \mathbb{D}_{KL}(P\big|\big|Q) = \sum_i P(i)\ln(\frac{P(i)}{Q(i)}) DKL(P Q)=iP(i)ln(Q(i)P(i))

如果 ξ \xi ξ是连续变量,则定义从P到Q的KL散度为

D K L ( P ∣ ∣ Q ) = ∫ − ∞ ∞ p ( x ) ln ⁡ ( p ( x ) q ( x ) ) d x \mathbb{D}_{KL}(P\big|\big|Q) = \int^\infty_{-\infty}p(\mathbb{x})\ln(\frac{p(\mathbb{x})}{q(\mathbb{x})}) d\mathbb{x} DKL(P Q)=p(x)ln(q(x)p(x))dx

那么根据离散的KL散度定义,我们可以将策略 π \pi π的损失函数展开如下

J π ( ϕ ) = E s t ∼ D [ D K L ( π ϕ ( ⋅ ∣ s t ) ∣ ∣ exp ⁡ ( Q θ ( s t , ⋅ ) ) Z θ ( s t ) ) ] = E s t ∼ D [ ∑ π ϕ ( ⋅ ∣ s t ) ln ⁡ ( π ϕ ( ⋅ ∣ s t ) Z θ ( s t ) exp ⁡ ( Q θ ( s t , ⋅ ) ) ) ] = E s t ∼ D , a t ∼ π ϕ [ ln ⁡ ( π ϕ ( ⋅ ∣ s t ) + ln ⁡ ( Z θ ( s t ) ) − Q θ ( s t , ⋅ ) ] = E s t ∼ D , a t ∼ π ϕ [ ln ⁡ ( π ϕ ( ⋅ ∣ s t ) ) − Q θ ( s t , ⋅ ) ] \begin{align} J_\pi(\phi) & = \mathbb{E}_{s_t\sim \mathcal{D}} \Big[ \mathbb{D}_{KL}\Big( \pi_\phi(\cdot|s_t) \big|\big| \frac{\exp(Q_\theta(s_t,\cdot))}{Z_\theta(s_t)} \Big) \Big] \\ & = \mathbb{E}_{s_t\sim\mathcal{D}} \Big[ \sum\pi_\phi(\cdot|s_t) \ln(\frac{\pi_\phi(\cdot|s_t)Z_\theta(s_t)}{\exp(Q_\theta(s_t,\cdot))}) \Big] \\ & = \mathbb{E}_{s_t\sim\mathcal{D}, a_t\sim \pi_\phi} \Big[ \ln(\pi_\phi(\cdot|s_t) + \ln(Z_\theta(s_t)) - Q_\theta(s_t,\cdot) \Big] \\ & = \textcolor{blue}{\mathbb{E}_{s_t\sim\mathcal{D}, a_t\sim \pi_\phi} \Big[ \ln(\pi_\phi(\cdot|s_t)) - Q_\theta(s_t,\cdot) \Big]} \tag{1.4} \end{align} Jp( ϕ )=EstD[DKL( pϕ(st) Zi(st)exp(Qi(st,)))]=EstD[Piϕ(st)ln(exp(Qi(st,))Piϕ(st)Zi(st))]=EstD,atπϕ[ln ( pϕ(st)+ln(Zi(st))Qi(st,)]=EstD,atπϕ[l n ( pϕ(st))Qi(st,)](1.4)

The last step is because, the distribution function Z θ Z_\thetaZiWith strategy π \piπ has nothing to do, so it has no impact on the gradient, so it can be directly ignored in the objective function.

Then the reparameter method was introduced,

at = f ϕ ( ϵ t ; st ) , ϵ ∼ N a_t = f_\phi(\epsilon_t;s_t ), \epsilon\sim\mathcal{N}at=fϕ( ϵt;st),ϵN

Put the above equation into (1.4) (1.4)In ( 1.4 ) there is

J π ( ϕ ) = E s t ∼ D , a t ∼ π ϕ [ ln ⁡ ( π ϕ ( f ϕ ( ϵ t ; s t ) ∣ s t ) ) − Q θ ( s t , f ϕ ( ϵ t ; s t ) ) ] J_\pi(\phi)=\mathbb{E}_{s_t\sim\mathcal{D}, a_t\sim \pi_\phi} \Big[ \ln\Big(\pi_\phi\big(f_\phi(\epsilon_t;s_t)|s_t\big)\Big) - Q_\theta\Big(s_t,f_\phi(\epsilon_t;s_t)\Big) \Big] Jp( ϕ )=EstD,atπϕ[ln( pϕ(fϕ( ϵt;st)st))Qi(st,fϕ( ϵt;st))]

Finally, you only need to continue to collect data and shrink these two loss functions to converge to a solution.

1.3 Two Q Value Neural Network

The policy network of SAC is different from the policy network in DDPG that directly estimates the size of the action value. The policy network of SAC estimates the mean and variance of the Gaussian distribution in the continuous action space, and then uniformly samples the continuous action value from this Gaussian distribution. SAC can maintain a V-value function network or not maintain a V-value function network, but what is required in SAC are two Q-value function networks and the target network of the two Q-value networks and a policy network π \ piπ . Why maintain two Q-value networks? This is to reduce the problem of overestimation in the Q-value network. Then use the smaller Q-value network of the two Q-value networks to calculate the loss function, assuming there are two Q-value networksQ ( θ 1 ) , Q ( θ 2 ) Q(\theta_1),Q(\theta_2)Q(θ1),Q(θ2) and its corresponding target networkQ ( θ 1 − ) , Q ( θ 2 − ) Q(\theta_1^-),Q(\theta_2^-)Q(θ1),Q(θ2)

​Then the loss function of the Q-value function becomes

J Q ( θ ) = E ( s t , a t ) ∼ D [ 1 2 ( Q θ ( s t , a t ) − Q ^ ( s t , a t ) ) ] Q ^ ( s t , a t ) = r ( s t , a t ) + γ V ψ ( s t + 1 ) V ψ ( s t + 1 ) = Q θ ( s t + 1 , a t + 1 ) − α log ⁡ ( π ( a t + 1 ∣ s t + 1 ) ) \begin{align} J_Q(\theta) & = \mathbb{E}_{(s_t,a_t)\sim\mathcal{D}} \Big[ \frac{1}{2}\Big( Q_\theta(s_t,a_t) - \hat{Q}(s_t,a_t) \Big) \Big] \tag{1.5} \\ \hat{Q}(s_t,a_t) & = r(s_t,a_t) + \gamma V_\psi(s_{t+1}) \tag{1.6} \\ V_\psi(s_{t+1}) & = Q_\theta(s_{t+1}, a_{t+1}) - \alpha \log (\pi(a_{t+1}|s_{t+1})) \tag{1.7} \end{align} JQ( i )Q^(st,at)Vp(st+1)=E(st,at)D[21(Qi(st,at)Q^(st,at))]=r(st,at)+γ Vp(st+1)=Qi(st+1,at+1)alog ( π ( a _t+1st+1))(1.5)(1.6)(1.7)

Combining the formulas 1.5, 1.6, 1.7 1.5, 1.6, 1.71.5,1.6,1.7 has

J Q ( θ ) = E ( s t , a t ) ∼ D [ 1 2 ( Q θ ( s t , a t ) − ( r ( s t , a t ) + γ ( min ⁡ j = 1 , 2 Q θ j − ( s t + 1 , a t + 1 ) − α log ⁡ π ( a t + 1 ∣ s t + 1 ) ) ) ) 2 ] J_Q(\theta) = \textcolor{red}{\mathbb{E}_{(s_t,a_t)\sim\mathcal{D}} \Big[ \frac{1}{2}\Big( Q_\theta(s_t,a_t) - \Big(r(s_t,a_t)+\gamma \big(\min_{j=1,2}Q_{\theta_j^-}(s_{t+1},a_{t+1}) - \alpha \log\pi(a_{t+1}|s_{t+1}) \big) \Big) \Big)^2 \Big]} JQ( i )=E(st,at)D[21(Qi(st,at)(r(st,at)+c (j=1,2minQij(st+1,at+1)alogπ ( at+1st+1))))2]

Because SAC is also an offline algorithm, in order to make the training more stable, the target Q network Q ( θ − ) Q(\theta^-) is used hereQ(θ ), are also two target networks, corresponding to two Q networks. The update method of the target Q network in SAC is consistent with that in DDPG.

​ The strategy π ( ϕ ) \pi(\phi) was derived beforeThe loss function of π ( ϕ ) is as follows

J π ( ϕ ) = E st ∼ D , at ∼ π ϕ [ ln ⁡ ( π ϕ ( ⋅ ∣ st ) ) − Q θ ( st , ⋅ ) ] J_\pi(\phi) = \mathbb{E}_ {s_t\sim\mathcal{D}, a_t\sim \pi_\phi} \Big[ \ln(\pi_\phi(\cdot|s_t)) - Q_\theta(s_t,\cdot)\Big]Jp( ϕ )=EstD,atπϕ[ln ( pϕ(st))Qi(st,)]

After introducing reparameter for simplification, it becomes

J π ( ϕ ) = E s t ∼ D , a t ∼ π ϕ [ ln ⁡ ( π ϕ ( f ϕ ( ϵ t ; s t ) ∣ s t ) ) − Q θ ( s t , f ϕ ( ϵ t ; s t ) ) ] J_\pi(\phi)=\mathbb{E}_{s_t\sim\mathcal{D}, a_t\sim \pi_\phi} \Big[ \ln\Big(\pi_\phi\big(f_\phi(\epsilon_t;s_t)|s_t\big)\Big) - Q_\theta\Big(s_t,f_\phi(\epsilon_t;s_t)\Big) \Big] Jp( ϕ )=EstD,atπϕ[ln( pϕ(fϕ( ϵt;st)st))Qi(st,fϕ( ϵt;st))]

Use the double Q network to find the minimum, and then rewrite the above objective function

J π ( ϕ ) = E st ∼ D , at ∼ π ϕ [ ln ⁡ ( π ϕ ( f ϕ ( ϵ t ; st ) ∣ st ) ) − min ⁡ j = 1 . 2 Q θ j ( st , f ϕ ( ϵ t ; st ) ) ] J_\pi(\phi)= \textcolor{red}{\mathbb{E}_{s_t\sim\mathcal{D}, a_t\sim \pi_\phi} \Big[ \ ln\Big(\pi_\phi\big(f_\phi(\epsilon_t;s_t)|s_t\big)\Big) - \min_{j=1,2}Q_{\theta_j}\Big(s_t,f_\ phi(\epsilon_t;s_t)\Big)\Big]}Jp( ϕ )=EstD,atπϕ[ln( pϕ(fϕ( ϵt;st)st))j=1,2minQij(st,fϕ( ϵt;st))]

Image

1.4 Tricks

SAC also integrates some common techniques of other algorithms, such as Replay Buffer to generate independent and identically distributed samples, and uses the idea of ​​target network in Double DQN to use two networks. One of the most important skills in SAC is to automatically adjust the entropy regularization term. In the SAC algorithm, how to choose the coefficient of the entropy regularization term is very important. Different states require different amounts of entropy: in a state where the optimal action is uncertain, the value of entropy should be larger; and in a state where the optimal action is relatively certain, the value of entropy can be small. a little. In order to automatically adjust the entropy regularization term, SAC rewrites the goal of reinforcement learning into a constrained optimization problem:

max ⁡ π E π [ ∑ t r ( s t , a t ) ] s.t. E ( s t , a t ) ∼ ρ π [ − log ⁡ ( π t ( a t ∣ s t ) ) ] ≥ H 0 \max_\pi \mathbb{E}_\pi\Big[ \sum_t r(s_t,a_t) \Big] \quad \text{s.t.} \quad \mathbb{E}_{(s_t,a_t)\sim\rho_\pi}[-\log (\pi_t(a_t|s_t))] \ge \mathcal{H}_0 PimaxEp[tr(st,at)]s.t.E(st,at)ρp[log ( p _t(atst))]H0

H 0 \mathcal{H}_0 H0is the predefined minimum policy entropy threshold. By simplifying with some mathematical techniques, we can get the temperature α \alphaLoss function of α

J ( α ) = E s t ∼ R , a t ∼ π ( ⋅ ∣ s t ) [ − α log ⁡ ( π t ( a t ∣ s t ) ) − α H 0 ] J(\alpha) = \textcolor{red}{\mathbb{E}_{s_t\sim R,a_t\sim \pi(\cdot|s_t)}[-\alpha\log(\pi_t(a_t|s_t))-\alpha\mathcal{H}_0]} J ( a )=EstR,atπ(st)[ al o g ( pt(atst))a H0]

1.5 Pesudocode

Image

2. Code implementation

2.1 SAC handles continuous action space

Using the gymnasiumcontinuous Pendulum-v1action environment, the overall code is as follows

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Normal

from tqdm import tqdm
import collections
import random
import numpy as np
import matplotlib.pyplot as plt
import gym

# replay buffer
class ReplayBuffer():
    def __init__(self, capacity):
        self.buffer = collections.deque(maxlen=capacity)
    
    def add(self, s, r, a, s_, d):
        self.buffer.append((s,r,a,s_,d))
    
    def sample(self, batch_size):
        transitions = random.sample(self.buffer, batch_size)
        states, rewards, actions, next_states, dones = zip(*transitions)
        return np.array(states), rewards, actions, np.array(next_states), dones
    
    def size(self):
        return len(self.buffer)

# Actor
class PolicyNet_Continuous(nn.Module):
    """动作空间符合高斯分布,输出动作空间的均值mu,和标准差std"""
    def __init__(self, state_dim, hidden_dim, action_dim, action_bound):
        super(PolicyNet_Continuous, self).__init__()
        self.fc1 = nn.Sequential(
            nn.Linear(in_features=state_dim, out_features=hidden_dim),
            nn.ReLU()
        )
        self.fc_mu = nn.Linear(in_features=hidden_dim, out_features=action_dim)
        self.fc_std = nn.Sequential(
            nn.Linear(in_features=hidden_dim, out_features=action_dim),
            nn.Softplus()
        )
        self.action_bound = action_bound

    def forward(self, s):
        x = self.fc1(s)
        mu = self.fc_mu(x)
        std = self.fc_std(x)
        distribution = Normal(mu, std)
        normal_sample = distribution.rsample()
        normal_log_prob = distribution.log_prob(normal_sample)
        # get action limit to [-1,1]
        action = torch.tanh(normal_sample)
        # get tanh_normal log probability
        tanh_log_prob = normal_log_prob - torch.log(1 - torch.tanh(action).pow(2) + 1e-7)
        # get action bounded
        action = action * self.action_bound
        return action, tanh_log_prob


# Critic
class QValueNet_Continuous(nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(QValueNet_Continuous, self).__init__()
        self.fc1 = nn.Sequential(
            nn.Linear(in_features=state_dim + action_dim, out_features=hidden_dim),
            nn.ReLU()
        )
        self.fc2 = nn.Sequential(
            nn.Linear(in_features=hidden_dim, out_features=hidden_dim),
            nn.ReLU()
        )
        self.fc_out = nn.Linear(in_features=hidden_dim, out_features=1)
    
    def forward(self, s, a):
        cat = torch.cat([s,a], dim=1)
        x = self.fc1(cat)
        x = self.fc2(x)
        return self.fc_out(x)

# maximize entropy deep reinforcement learning SAC
class SAC_Continuous():
    def __init__(self, state_dim, hidden_dim, action_dim, action_bound,
                    actor_lr, critic_lr, alpha_lr, target_entropy, tau, gamma,
                    device):
        # actor
        self.actor = PolicyNet_Continuous(state_dim, hidden_dim, action_dim, action_bound).to(device)
        # two critics
        self.critic1 = QValueNet_Continuous(state_dim, hidden_dim, action_dim).to(device)
        self.critic2 = QValueNet_Continuous(state_dim, hidden_dim, action_dim).to(device)
        # two target critics
        self.target_critic1 = QValueNet_Continuous(state_dim, hidden_dim, action_dim).to(device)
        self.target_critic2 = QValueNet_Continuous(state_dim, hidden_dim, action_dim).to(device)
        # initialize with same parameters
        self.target_critic1.load_state_dict(self.critic1.state_dict())
        self.target_critic2.load_state_dict(self.critic2.state_dict())
        # specify optimizers
        self.optimizer_actor = torch.optim.Adam(self.actor.parameters(), lr=actor_lr)
        self.optimizer_critic1 = torch.optim.Adam(self.critic1.parameters(), lr=critic_lr)
        self.optimizer_critic2 = torch.optim.Adam(self.critic2.parameters(), lr=critic_lr)
        # 使用alpha的log值可以使训练稳定
        self.log_alpha = torch.tensor(np.log(0.01), dtype=torch.float, requires_grad = True)
        self.optimizer_log_alpha = torch.optim.Adam([self.log_alpha], lr=alpha_lr)

        self.target_entropy = target_entropy
        self.gamma = gamma
        self.tau = tau
        self.device = device
    
    def take_action(self, state):
        state = torch.tensor(np.array([state]), dtype=torch.float).to(self.device)
        action, _ = self.actor(state)
        return [action.item()]
    
    # calculate td target
    def calc_target(self, rewards, next_states, dones):
        next_action, log_prob = self.actor(next_states)
        entropy = -log_prob
        q1_values = self.target_critic1(next_states, next_action)
        q2_values = self.target_critic2(next_states, next_action)
        next_values = torch.min(q1_values, q2_values) + self.log_alpha.exp() * entropy
        td_target = rewards + self.gamma * next_values * (1-dones)
        return td_target

    # soft update method
    def soft_update(self, net, target_net):
        for param_target, param in zip(target_net.parameters(), net.parameters()):
            param_target.data.copy_(param_target.data * (1.0-self.tau) + param.data * self.tau)
        
    def update(self, transition_dict):
        states = torch.tensor(transition_dict['states'], dtype=torch.float).to(self.device)
        rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1,1).to(self.device)
        actions = torch.tensor(transition_dict['actions'], dtype=torch.float).view(-1,1).to(self.device)
        next_states = torch.tensor(transition_dict['next_states'], dtype=torch.float).to(self.device)
        dones = torch.tensor(transition_dict['dones'], dtype=torch.float).view(-1,1).to(self.device)

        rewards = (rewards + 8.0) / 8.0     #对倒立摆环境的奖励进行重塑,方便训练

        # update two Q-value network
        td_target = self.calc_target(rewards, next_states, dones).detach()
        critic1_loss = torch.mean(F.mse_loss(td_target, self.critic1(states, actions)))
        critic2_loss = torch.mean(F.mse_loss(td_target, self.critic2(states, actions)))

        self.optimizer_critic1.zero_grad()
        critic1_loss.backward()
        self.optimizer_critic1.step()
        self.optimizer_critic2.zero_grad()
        critic2_loss.backward()
        self.optimizer_critic2.step()

        # update policy network
        new_actions, log_prob = self.actor(states)
        entropy = - log_prob
        q1_value = self.critic1(states, new_actions)
        q2_value = self.critic2(states, new_actions)
        actor_loss = torch.mean(-self.log_alpha.exp() * entropy - torch.min(q1_value, q2_value))
        self.optimizer_actor.zero_grad()
        actor_loss.backward()
        self.optimizer_actor.step()

        # update temperature alpha
        alpha_loss = torch.mean((entropy - self.target_entropy).detach() * self.log_alpha.exp())
        self.optimizer_log_alpha.zero_grad()
        alpha_loss.backward()
        self.optimizer_log_alpha.step()

        # soft update target Q-value network
        self.soft_update(self.critic1, self.target_critic1)
        self.soft_update(self.critic2, self.target_critic2)


def train_off_policy_agent(env, agent, num_episodes, replay_buffer, minimal_size, batch_size, render, seed_number):
    return_list = []
    for i in range(10):
        with tqdm(total=int(num_episodes/10), desc='Iteration %d'%(i+1)) as pbar:
            for i_episode in range(int(num_episodes/10)):
                observation, _ = env.reset(seed=seed_number)
                done = False
                episode_return = 0

                while not done:
                    if render:
                        env.render()
                    action = agent.take_action(observation)
                    observation_, reward, terminated, truncated, _ = env.step(action)
                    done = terminated or truncated
                    replay_buffer.add(observation, action, reward, observation_, done)
                    # swap states
                    observation = observation_
                    episode_return += reward
                    if replay_buffer.size() > minimal_size:
                        b_s, b_a, b_r, b_ns, b_d = replay_buffer.sample(batch_size)
                        transition_dict = {
    
    
                            'states': b_s,
                            'actions': b_a,
                            'rewards': b_r,
                            'next_states': b_ns,
                            'dones': b_d
                        }
                        agent.update(transition_dict)
                return_list.append(episode_return)
                if(i_episode+1) % 10 == 0:
                    pbar.set_postfix({
    
    
                        'episode': '%d'%(num_episodes/10 * i + i_episode + 1),
                        'return': "%.3f"%(np.mean(return_list[-10:]))
                    })
                pbar.update(1)
    env.close()
    return return_list

def moving_average(a, window_size):
    cumulative_sum = np.cumsum(np.insert(a, 0, 0)) 
    middle = (cumulative_sum[window_size:] - cumulative_sum[:-window_size]) / window_size
    r = np.arange(1, window_size-1, 2)
    begin = np.cumsum(a[:window_size-1])[::2] / r
    end = (np.cumsum(a[:-window_size:-1])[::2] / r)[::-1]
    return np.concatenate((begin, middle, end))

def plot_curve(return_list, mv_return, algorithm_name, env_name):
    episodes_list = list(range(len(return_list)))
    plt.plot(episodes_list, return_list, c='gray', alpha=0.6)
    plt.plot(episodes_list, mv_return)
    plt.xlabel('Episodes')
    plt.ylabel('Returns')
    plt.title('{} on {}'.format(algorithm_name, env_name))
    plt.show()



if __name__ == "__main__":

    # reproducible
    seed_number = 0
    random.seed(seed_number)
    np.random.seed(seed_number)
    torch.manual_seed(seed_number)

    num_episodes = 150     # episodes length
    hidden_dim = 128        # hidden layers dimension
    gamma = 0.98            # discounted rate
    actor_lr = 1e-4         # lr of actor
    critic_lr = 1e-3        # lr of critic
    alpha_lr = 1e-4
    tau = 0.005             # soft update parameter
    buffer_size = 10000
    minimal_size = 1000
    batch_size = 64

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    env_name = 'Pendulum-v1'

    render = False
    if render:
        env = gym.make(id=env_name, render_mode='human')
    else:
        env = gym.make(id=env_name)

    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.shape[0]  
    action_bound = env.action_space.high[0]
    # entropy初始化为动作空间维度的负数
    target_entropy = - env.action_space.shape[0]

    replaybuffer = ReplayBuffer(buffer_size)
    agent = SAC_Continuous(state_dim, hidden_dim, action_dim, action_bound, actor_lr, critic_lr, alpha_lr, target_entropy, tau, gamma, device)
    return_list = train_off_policy_agent(env, agent, num_episodes, replaybuffer, minimal_size, batch_size, render, seed_number)

    mv_return = moving_average(return_list, 9)
    plot_curve(return_list, mv_return, 'SAC', env_name)

The return curve obtained is as follows:

Image

2.2 SAC deals with discrete action space

An environment employing gaymnasiumdiscrete CartPole-v1action spaces.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical

from tqdm import tqdm
import collections
import random
import numpy as np
import matplotlib.pyplot as plt
import gym

# replay buffer
class ReplayBuffer():
    def __init__(self, capacity):
        self.buffer = collections.deque(maxlen=capacity)
    
    def add(self, s, r, a, s_, d):
        self.buffer.append((s,r,a,s_,d))
    
    def sample(self, batch_size):
        transitions = random.sample(self.buffer, batch_size)
        states, rewards, actions, next_states, dones = zip(*transitions)
        return np.array(states), rewards, actions, np.array(next_states), dones
    
    def size(self):
        return len(self.buffer)

# Actor
class PolicyNet_Discrete(nn.Module):
    """动作空间服从离散的概率分布,输出每个动作的概率值"""
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(PolicyNet_Discrete, self).__init__()
        self.fc1 = nn.Sequential(
            nn.Linear(in_features=state_dim, out_features=hidden_dim),
            nn.ReLU()
        )
        self.fc2 = nn.Sequential(
            nn.Linear(in_features=hidden_dim, out_features=action_dim),
            nn.Softmax(dim=1)
        )

    def forward(self, s):
        x = self.fc1(s)
        return self.fc2(x)


# Critic
class QValueNet_Discrete(nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(QValueNet_Discrete, self).__init__()
        self.fc1 = nn.Sequential(
            nn.Linear(in_features=state_dim, out_features=hidden_dim),
            nn.ReLU()
        )
        self.fc2 = nn.Linear(in_features=hidden_dim, out_features=action_dim)
    
    def forward(self, s):
        x = self.fc1(s)
        return self.fc2(x)

# maximize entropy deep reinforcement learning SAC
class SAC_Continuous():
    def __init__(self, state_dim, hidden_dim, action_dim,
                    actor_lr, critic_lr, alpha_lr, target_entropy, tau, gamma,
                    device):
        # actor
        self.actor = PolicyNet_Discrete(state_dim, hidden_dim, action_dim).to(device)
        # two critics
        self.critic1 = QValueNet_Discrete(state_dim, hidden_dim, action_dim).to(device)
        self.critic2 = QValueNet_Discrete(state_dim, hidden_dim, action_dim).to(device)
        # two target critics
        self.target_critic1 = QValueNet_Discrete(state_dim, hidden_dim, action_dim).to(device)
        self.target_critic2 = QValueNet_Discrete(state_dim, hidden_dim, action_dim).to(device)
        # initialize with same parameters
        self.target_critic1.load_state_dict(self.critic1.state_dict())
        self.target_critic2.load_state_dict(self.critic2.state_dict())
        # specify optimizers
        self.optimizer_actor = torch.optim.Adam(self.actor.parameters(), lr=actor_lr)
        self.optimizer_critic1 = torch.optim.Adam(self.critic1.parameters(), lr=critic_lr)
        self.optimizer_critic2 = torch.optim.Adam(self.critic2.parameters(), lr=critic_lr)
        # 使用alpha的log值可以使训练稳定
        self.log_alpha = torch.tensor(np.log(0.01), dtype=torch.float, requires_grad = True)
        self.optimizer_log_alpha = torch.optim.Adam([self.log_alpha], lr=alpha_lr)

        self.target_entropy = target_entropy
        self.gamma = gamma
        self.tau = tau
        self.device = device
    
    def take_action(self, state):
        state = torch.tensor(np.array([state]), dtype=torch.float).to(self.device)
        probs = self.actor(state)
        action_dist = Categorical(probs)
        action = action_dist.sample()
        return action.item()
    
    # calculate td target
    def calc_target(self, rewards, next_states, dones):
        next_probs = self.actor(next_states)
        next_log_probs = torch.log(next_probs + 1e-8)
        entropy = -torch.sum(next_probs * next_log_probs, dim=1, keepdim=True)
        q1_values = self.target_critic1(next_states)
        q2_values = self.target_critic2(next_states)
        min_qvalue = torch.sum(next_probs * torch.min(q1_values, q2_values),
                               dim=1,
                               keepdim=True)
        next_value = min_qvalue + self.log_alpha.exp() * entropy
        td_target = rewards + self.gamma * next_value * (1 - dones)
        return td_target

    # soft update method
    def soft_update(self, net, target_net):
        for param_target, param in zip(target_net.parameters(), net.parameters()):
            param_target.data.copy_(param_target.data * (1.0-self.tau) + param.data * self.tau)
        
    def update(self, transition_dict):
        states = torch.tensor(transition_dict['states'], dtype=torch.float).to(self.device)
        rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1,1).to(self.device)
        actions = torch.tensor(transition_dict['actions'], dtype=torch.int64).view(-1,1).to(self.device)
        next_states = torch.tensor(transition_dict['next_states'], dtype=torch.float).to(self.device)
        dones = torch.tensor(transition_dict['dones'], dtype=torch.float).view(-1,1).to(self.device)

        rewards = (rewards + 8.0) / 8.0     #对倒立摆环境的奖励进行重塑,方便训练

        # update two Q-value network
        td_target = self.calc_target(rewards, next_states, dones).detach()
        critic1_loss = torch.mean(F.mse_loss(td_target, self.critic1(states).gather(dim=1,index=actions)))
        critic2_loss = torch.mean(F.mse_loss(td_target, self.critic2(states).gather(dim=1,index=actions)))

        self.optimizer_critic1.zero_grad()
        critic1_loss.backward()
        self.optimizer_critic1.step()
        self.optimizer_critic2.zero_grad()
        critic2_loss.backward()
        self.optimizer_critic2.step()

        # update policy network
        probs = self.actor(states)
        log_probs = torch.log(probs + 1e-8)
        entropy = -torch.sum(probs * log_probs, dim=1, keepdim=True)
        q1_value = self.critic1(states)
        q2_value = self.critic2(states)
        min_qvalue = torch.sum(probs * torch.min(q1_value, q2_value),
                               dim=1,
                               keepdim=True)  # 直接根据概率计算期望
        actor_loss = torch.mean(-self.log_alpha.exp() * entropy - min_qvalue)
        self.optimizer_actor.zero_grad()
        actor_loss.backward()
        self.optimizer_actor.step()

        # update temperature alpha
        alpha_loss = torch.mean((entropy - self.target_entropy).detach() * self.log_alpha.exp())
        self.optimizer_log_alpha.zero_grad()
        alpha_loss.backward()
        self.optimizer_log_alpha.step()

        # soft update target Q-value network
        self.soft_update(self.critic1, self.target_critic1)
        self.soft_update(self.critic2, self.target_critic2)


def train_off_policy_agent(env, agent, num_episodes, replay_buffer, minimal_size, batch_size, render, seed_number):
    return_list = []
    for i in range(10):
        with tqdm(total=int(num_episodes/10), desc='Iteration %d'%(i+1)) as pbar:
            for i_episode in range(int(num_episodes/10)):
                observation, _ = env.reset(seed=seed_number)
                done = False
                episode_return = 0

                while not done:
                    if render:
                        env.render()
                    action = agent.take_action(observation)
                    observation_, reward, terminated, truncated, _ = env.step(action)
                    done = terminated or truncated
                    replay_buffer.add(observation, action, reward, observation_, done)
                    # swap states
                    observation = observation_
                    episode_return += reward
                    if replay_buffer.size() > minimal_size:
                        b_s, b_a, b_r, b_ns, b_d = replay_buffer.sample(batch_size)
                        transition_dict = {
    
    
                            'states': b_s,
                            'actions': b_a,
                            'rewards': b_r,
                            'next_states': b_ns,
                            'dones': b_d
                        }
                        agent.update(transition_dict)
                return_list.append(episode_return)
                if(i_episode+1) % 10 == 0:
                    pbar.set_postfix({
    
    
                        'episode': '%d'%(num_episodes/10 * i + i_episode + 1),
                        'return': "%.3f"%(np.mean(return_list[-10:]))
                    })
                pbar.update(1)
    env.close()
    return return_list

def moving_average(a, window_size):
    cumulative_sum = np.cumsum(np.insert(a, 0, 0)) 
    middle = (cumulative_sum[window_size:] - cumulative_sum[:-window_size]) / window_size
    r = np.arange(1, window_size-1, 2)
    begin = np.cumsum(a[:window_size-1])[::2] / r
    end = (np.cumsum(a[:-window_size:-1])[::2] / r)[::-1]
    return np.concatenate((begin, middle, end))

def plot_curve(return_list, mv_return, algorithm_name, env_name):
    episodes_list = list(range(len(return_list)))
    plt.plot(episodes_list, return_list, c='gray', alpha=0.6)
    plt.plot(episodes_list, mv_return)
    plt.xlabel('Episodes')
    plt.ylabel('Returns')
    plt.title('{} on {}'.format(algorithm_name, env_name))
    plt.show()



if __name__ == "__main__":

    # reproducible
    seed_number = 0
    random.seed(seed_number)
    np.random.seed(seed_number)
    torch.manual_seed(seed_number)

    num_episodes = 200    # episodes length
    hidden_dim = 128        # hidden layers dimension
    gamma = 0.98            # discounted rate
    actor_lr = 1e-3         # lr of actor
    critic_lr = 1e-2        # lr of critic
    alpha_lr = 1e-2
    tau = 0.005             # soft update parameter
    buffer_size = 10000
    minimal_size = 500
    batch_size = 64


    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    env_name = 'CartPole-v1'

    render = False
    if render:
        env = gym.make(id=env_name, render_mode='human')
    else:
        env = gym.make(id=env_name)

    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n  

    # entropy初始化为-1
    target_entropy = -1

    replaybuffer = ReplayBuffer(buffer_size)
    agent = SAC_Continuous(state_dim, hidden_dim, action_dim, actor_lr, critic_lr, alpha_lr, target_entropy, tau, gamma, device)
    return_list = train_off_policy_agent(env, agent, num_episodes, replaybuffer, minimal_size, batch_size, render, seed_number)

    mv_return = moving_average(return_list, 9)
    plot_curve(return_list, mv_return, 'SAC', env_name)

Reference

Hands on RL

SAC (Soft Actor-Critic) reading notes

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Guess you like

Origin blog.csdn.net/qq_44940689/article/details/132349150
RL