算法SAC
- 基于动态规划的贝尔曼方城如下所示:
则,基于最大熵的软贝尔曼方程可以描述为如下的形式:
可以这么理解soft贝尔曼方程,就是在原有的贝尔曼方程的基础上添加了一个熵项。
另外一个角度理解soft-贝尔曼方程:
首先,将熵项作为奖励函数的一部分,写为如下的形式:
然后将这个 r s o f t r_{soft} rsoft带入到贝尔曼方程中去,然后通过进一步的转化形式,就可以得到可以得到如下的形式:
Q s o f t ( s t , a t ) = r ( s t , a t ) + γ α E s t + 1 ∼ ρ H ( π ( ⋅ ∣ s t + 1 ) ) + γ E s t + 1 , a t + 1 [ Q s o f t ( s t + 1 , a t + 1 ) ] = r ( s t , a t ) + γ E s t + 1 ∼ ρ , a t + 1 ∼ π [ Q s o f t ( s t + 1 , a t + 1 ) ] + γ α E s t + 1 ∼ ρ H ( π ( ⋅ ∣ s t + 1 ) = r ( s t , a t ) + γ E s t + 1 ∼ ρ , a t + 1 ∼ π [ Q s o f t ( s t + 1 , a t + 1 ) ] + γ E s t + 1 ∼ ρ E a t + 1 ∼ π [ − α log [ π ( a t + 1 ∣ s t + 1 ) ) = r ( s t , a t ) + γ E s t + 1 ∼ ρ [ E a t + 1 ∼ π [ Q s o f t ( s t + 1 , a t + 1 ) α log ( π ( a t + 1 ∣ s t + 1 ) ) ] ] = r ( s t , a t ) + γ E s t + 1 , a t + 1 [ Q s o f t ( s t + 1 , a t + 1 ) − α log [ π ( a t + 1 ∣ s t + 1 ) ) ] \begin{aligned} Q_{s o f t}\left(s_{t}, a_{t}\right) & =r\left(s_{t}, a_{t}\right)+\gamma \alpha \mathbb{E}_{s_{t+1} \sim \rho} H\left(\pi\left(\cdot \mid s_{t+1}\right)\right)+\gamma \mathbb{E}_{s_{t+1}, a_{t+1}}\left[Q_{s o f t}\left(s_{t+1}, a_{t+1}\right)\right] \\ & =r\left(s_{t}, a_{t}\right)+\gamma \mathbb{E}_{s_{t+1} \sim \rho, a_{t+1} \sim \pi}\left[Q_{s o f t}\left(s_{t+1}, a_{t+1}\right)\right]+\gamma \alpha \mathbb{E}_{s_{t+1} \sim \rho} H\left(\pi\left(\cdot \mid s_{t+1}\right)\right. \\ & =r\left(s_{t}, a_{t}\right)+\gamma \mathbb{E}_{s_{t+1} \sim \rho, a_{t+1} \sim \pi}\left[Q_{s o f t}\left(s_{t+1}, a_{t+1}\right)\right]+\gamma \mathbb{E}_{s_{t+1} \sim \rho} \mathbb{E}_{a_{t+1} \sim \pi}[-\alpha \operatorname{\log \left[\pi\left(a_{t+1} \mid s_{t+1}\right)\right)} \\ & =r\left(s_{t}, a_{t}\right)+\gamma \mathbb{E}_{s_{t+1} \sim \rho}\left[\mathbb{E}_{a_{t+1} \sim \pi}\left[Q_{s o f t}\left(s_{t+1}, a_{t+1}\right) \operatorname{\alpha } \log \left(\pi\left(a_{t+1} \mid s_{t+1}\right)\right)\right]\right] \\ & =r\left(s_{t}, a_{t}\right)+\gamma \mathbb{E}_{s_{t+1}, a_{t+1}}\left[Q_{s o f t}\left(s_{t+1}, a_{t+1}\right)-\alpha \log \left[\pi\left(a_{t+1} \mid s_{t+1}\right)\right)\right] \end{aligned} Qsoft(st,at)=r(st,at)+γαEst+1∼ρH(π(⋅∣st+1))+γEst+1,at+1[Qsoft(st+1,at+1)]=r(st,at)+γEst+1∼ρ,at+1∼π[Qsoft(st+1,at+1)]+γαEst+1∼ρH(π(⋅∣st+1)=r(st,at)+γEst+1∼ρ,at+1∼π[Qsoft(st+1,at+1)]+γEst+1∼ρEat+1∼π[−αlog[π(at+1∣st+1))=r(st,at)+γEst+1∼ρ[Eat+1∼π[Qsoft(st+1,at+1)αlog(π(at+1∣st+1))]]=r(st,at)+γEst+1,at+1[Qsoft(st+1,at+1)−αlog[π(at+1∣st+1))]
会得出同样的结论。
看上公式的最后一项,以及根据Q和v的关系函数:
Q s o f t ( s t , a t ) = r ( s t , a t ) + γ E s t + 1 , a t + 1 [ Q s o f t ( s t + 1 , a t + 1 ) − α log [ π ( a t + 1 ∣ s t + 1 ) ) ] \begin{aligned} Q_{s o f t}\left(s_{t}, a_{t}\right) & =r\left(s_{t}, a_{t}\right)+\gamma \mathbb{E}_{s_{t+1}, a_{t+1}}\left[Q_{s o f t}\left(s_{t+1}, a_{t+1}\right)-\alpha \log \left[\pi\left(a_{t+1} \mid s_{t+1}\right)\right)\right] \end{aligned} Qsoft(st,at)=r(st,at)+γEst+1,at+1[Qsoft(st+1,at+1)−αlog[π(at+1∣st+1))]
Q ( s t , a t ) = r ( s t , a t ) + γ E s t + 1 ∼ ρ [ V ( s t + 1 ) ] Q\left(s_t, a_t\right)=r\left(s_t, a_t\right)+\gamma \mathbb{E}_{s_{t+1} \sim \rho}\left[V\left(s_{t+1}\right)\right] Q(st,at)=r(st,at)+γEst+1∼ρ[V(st+1)]
对比一下上面的这两个公式,也就是等号右侧应该是相等的,因此,软值函数形式为: