文章目录
- Markov Decision Process
- Definition
- Related Concepts
- Policy for MDP Agent
- Definition
- Judgement for Policy
- Value Functions
- TD formula for value functions
- Relation of V and Q
- Policy Criterion
- Policy Improvement Theorem
- Optimal Policy
- Reinforcement Learning
- Fundamental RL Algorithms: Policy Iteration
- The Whole Procedure
- Convergence Analysis
- Fundamental RL Algorithms: Value Iteration
- Iterative ways to derive Q ∗ Q^{*} Q∗
- The Whole Procedure
- Convergence Analysis
参考书籍:
1.Reinforcement Learning State-of-the-Art, Marco Wiering
2.Reinforcement learning: An introduction, RS Sutton
下一篇博客:
https://blog.csdn.net/Jinyindao243052/article/details/107051507
Markov Decision Process
Definition
Suppose that there is an interactive system as follows which consists of an environment and an agent.
Its running process is described as follows. At the beginning of step t = 0 , 1 , . . . t=0,1,... t=0,1,..., the agent has the knowledge of the environment state s t s_{t} st. Then it chooses an action a t ∈ A a_{t}\in \mathcal{A} at∈A that is available in state s t s_{t} st and apply it. After applying action a t a_{t} at, the state of the environment makes a transition to s t + 1 s_{t+1} st+1 according to a probability distribution T ( s t , a t ) T(s_t,a_t) T(st,at) over S \mathcal{S} S and generates a reward r t = R ( s t , a t , s t + 1 ) r_{t}=R(s_t,a_t,s_{t+1}) rt=R(st,at,st+1) and gives the information of its new state and reward to the agent.
Here S \mathcal{S} S is the set of possible environmental states, called state space. A \mathcal{A} A is the set of possible agent actions, called action space. T : S × A → P ( S ) T: \mathcal{S} \times \mathcal{A}\rightarrow \mathcal{P}(\mathcal{S}) T:S×A→P(S) is called the transition function, where P ( S ) \mathcal{P}(\mathcal{S}) P(S) is probability distributions of states in S \mathcal{S} S. The probability of ending up in state s ′ s' s′ after doing applicable action a a a in state s s s is denoted as T ( s , a , s ′ ) T(s,a,s') T(s,a,s′). R : S × A × S → R R: \mathcal{S}\times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R} R:S×A×S→R is called the reward function, the reward of ending up in state s ′ s' s′ after doing action a a a in state s s s is R ( s , a , s ′ ) R(s,a,s') R(s,a,s′).
Note that the propability distribution of its next state s t + 1 s_{t+1} st+1 is only determined by s t s_t st and a t a_t at, and is not directly influenced by all former states or actions. The reward r t r_t rt is determined by s t s_t st, a t a_t at, s t + 1 s_{t+1} st+1 and is not directly influenced by all former states, actions, and rewards. This kind of property is called Markov property.
In theoretic analysis, we assume the process never stops.
The discrete-time running process of the foregoing system is called a Markov Decision Process.
Related Concepts
Discounted Cumulative Reward: Define the discounted cumulative reward of the process from step t t t as γ k r t + k \gamma^{k}r_{t+k} γkrt+k(denoted as G t G_t Gt), where γ ∈ ( 0 , 1 ) \gamma \in (0,1) γ∈(0,1) is a parameter called the discount rate.
A MDP is usually represented by a tuple ⟨ S , A , T , R , γ ⟩ \left \langle \mathcal{S}, \mathcal{A}, T, R, \gamma\right \rangle ⟨S,A,T,R,γ⟩, where S \mathcal{S} S is its state space, A \mathcal{A} A is its action space, T T T is its transition function, R R R is its reward function, γ \gamma γ is its discount rate.
History: The sequence of ( s 0 , a 0 , s 1 , a 1 , . . . ) (s_0,a_0,s_1,a_1,...) (s0,a0,s1,a1,...) is defined as the history of the MDP process (denoted as τ \tau τ). The probability of ( s t , a t , . . . , s t ′ ) (s_t, a_t, ..., s_{t'}) (st,at,...,st′) given s t s_t st and a t a_{t} at,…, a t ′ − 1 a_{t'-1} at′−1 is Π i = t t ′ − 1 T ( s i , a i , s i + 1 ) \Pi_{i=t}^{t'-1}T(s_i,a_i, s_{i+1}) Πi=tt′−1T(si,ai,si+1).
Trajectory: The sequence of ( s 0 , a 0 , r 0 , s 1 , a 1 , r 1 . . . ) (s_0,a_0,r_0,s_1,a_1,r_1...) (s0,a0,r0,s1,a1,r1...) is defined as the trajectory of the MDP process (denoted as E \mathcal{E} E).
Policy for MDP Agent
Definition
Policy: A deterministic policy π \pi π is a function defined as π : S → A \pi: \mathcal{S}\rightarrow \mathcal{A} π:S→A, which tells the agent the action to take in a particular state.
A stochastic policy π \pi π is defined as π : S → P ( A ) \pi: \mathcal{S}\rightarrow \mathcal{P}(\mathcal{A}) π:S→P(A) where P ( A ) \mathcal{P}(\mathcal{A}) P(A) is probability distributions of actions in A \mathcal{A} A. It tells the agent the probability to take every action in a particular state. Denote the probability of action a a a according to π ( s ) \pi(s) π(s) as π ( a ∣ s ) \pi(a|s) π(a∣s).
Adopting a deterministic policy π \pi π is equivalent to adopting a stochastic policy π ~ \tilde{\pi} π~,
π ~ ( a ∣ s ) = { 1 , a = π ( s ) 0 , a ≠ π ( s ) , s ∈ S \tilde{\pi}(a|s)=\begin{cases} &1, &a=\pi(s)\\ &0, &a\neq \pi(s) \end{cases},\ s\in\mathcal{S} π~(a∣s)={1,0,a=π(s)a=π(s), s∈S
Therefore, we only talk about stochastic policy in the following part.
Judgement for Policy
Value Functions
State Value Function (V function): The state value function of policy π \pi π is defined as V π : S → R V^{\pi}: \mathcal{S}\rightarrow \mathbb{R} Vπ:S→R,
V π ( s ) = E τ ∣ s 0 = s , π { ∑ t = 0 ∞ γ t r t } = E τ ∣ s 0 = s , π { G 0 } , s ∈ S V^{\pi}(s)=\mathbb{E}_{\tau |s_{0}=s, \pi}\{\sum\limits_{t=0}^{\infty}\gamma^{t}r_{t}\}=\mathbb{E}_{\tau |s_{0}=s, \pi}\{G_{0}\}, \ s\in \mathcal{S} Vπ(s)=Eτ∣s0=s,π{t=0∑∞γtrt}=Eτ∣s0=s,π{G0}, s∈S
where τ = s 0 , a 0 , s 1 , a 1 , … \tau=s_{0}, a_{0}, s_{1}, a_{1}, \dots τ=s0,a0,s1,a1,… is the sequence of state-action transitions. V π ( s ) V^{\pi}(s) Vπ(s) is the expected discounted cumulative reward of the process when s 0 = s s_0=s s0=s and adopting policy π \pi π.
For any sequence [ S 0 , A 0 , S 1 . . . ] [S_0,A_0,S_1...] [S0,A0,S1...], where S 0 = S S_0=S S0=S, S i ∈ S S_i\in \mathcal{S} Si∈S, A i ∈ A A_i\in \mathcal{A} Ai∈A:
p ( τ ∣ s 0 = s , π = [ S 0 , A 0 , S 1 , . . . ] ) = Π i = 0 ∞ T ( S i , A i , S i + 1 ) π ( A i ∣ S i ) p(\tau_{|s_0=s,\pi}=[S_0,A_0,S_1,...])=\Pi_{i=0}^{\infty}T(S_i,A_i,S_{i+1})\pi(A_i|S_i) p(τ∣s0=s,π=[S0,A0,S1,...])=Πi=0∞T(Si,Ai,Si+1)π(Ai∣Si)
p ( τ t ∣ s t = s , π = [ S 0 , A 0 , S 1 , . . . ] ∣ s 0 = s , π ) = Π i = 0 ∞ T ( S i , A i , S i + 1 ) π ( A i ∣ S i ) p(\tau_t|_{s_t=s,\pi}=[S_0,A_0,S_1,...]|s_0=s,\pi)=\Pi_{i=0}^{\infty}T(S_i,A_i,S_{i+1})\pi(A_i|S_i) p(τt∣st=s,π=[S0,A0,S1,...]∣s0=s,π)=Πi=0∞T(Si,Ai,Si+1)π(Ai∣Si)
Therefore the probability density distributions of sequence τ ∣ s 0 = s , π \tau|s_0=s, \pi τ∣s0=s,π and τ t ∣ s t = s , π \tau_t|s_t=s,\pi τt∣st=s,π are the same. When τ ∣ s 0 = s , π = τ t ∣ s t = s , π = [ S 0 , A 0 , S 1 , . . . ] \tau|_{s_0=s, \pi}=\tau_t|_{s_t=s,\pi}=[S_0,A_0,S_1,...] τ∣s0=s,π=τt∣st=s,π=[S0,A0,S1,...], G 0 = G t = ∑ i = 0 ∞ { γ i R ( S i , A i , S i + 1 ) } G_0=G_t=\sum_{i=0}^{\infty}\{\gamma^{i}R(S_i,A_i,S_{i+1})\} G0=Gt=∑i=0∞{γiR(Si,Ai,Si+1)}
Therefore E τ ∣ s 0 = s , π { G 0 } = E τ t ∣ s t = s , π { G t } \mathbb{E}_{\tau |s_{0}=s, \pi}\{G_{0}\}=\mathbb{E}_{\tau_t|s_{t}=s, \pi}\{G_{t}\} Eτ∣s0=s,π{G0}=Eτt∣st=s,π{Gt}. Then V π ( s ) = E τ t ∣ s t = s , π [ G t ] V^{\pi}(s)=\mathbb{E}_{\tau_t|s_{t}=s,\pi}[G_t] Vπ(s)=Eτt∣st=s,π[Gt], which is the expected discounted cumulative reward of the process from step t t t under the condition of s t = s s_t=s st=s and following π \pi π afterwards.
State-Action Value Function (Q function): The state-action value function is defined as Q π : S × A → R Q^{\pi}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R} Qπ:S×A→R,
Q π ( s , a ) = E τ ∣ s 0 = s , a 0 = a , π { ∑ t = 0 ∞ γ t r t } , = E τ ∣ s 0 = s , a 0 = a , π { G 0 } s ∈ S , a ∈ A Q^{\pi}(s,a)=\mathop{\mathbb{E}}\limits_{\tau |s_{0}=s, a_{0}=a, \pi}\{\sum\limits_{t=0}^{\infty}\gamma^{t}r_{t}\}, =\mathop{\mathbb{E}}\limits_{\tau |s_{0}=s, a_{0}=a, \pi}\{G_0\} \ s\in \mathcal{S}, a\in \mathcal{A} Qπ(s,a)=τ∣s0=s,a0=a,πE{t=0∑∞γtrt},=τ∣s0=s,a0=a,πE{G0} s∈S,a∈A
Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a) is the expected discounted cumulative reward of the process from step 0 0 0 when s 0 = s s_0=s s0=s, a 0 = a a_0=a a0=a, and adopting policy π \pi π. Similarly, Q π ( s ) = E τ t ∣ s t = s , a t = a , π [ G t ] Q^{\pi}(s)=\mathbb{E}_{\tau_t|s_{t}=s, a_{t}=a,\pi}[G_t] Qπ(s)=Eτt∣st=s,at=a,π[Gt], which is the expected discounted cumulative reward of the process from step t t t under the condition of s t = s s_t=s st=s, a t = a a_t=a at=a and following π \pi π afterwards.
TD formula for value functions
- TD(t) ( t = 1 , 2 , . . . t=1,2,... t=1,2,...) formula for V function is V π ( s ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t V π ( s k + t ) ] V^{\pi}(s)=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}V^{\pi}(s_{k+t})] Vπ(s)=Eτk∣sk=s,π[∑i=0t−1γirk+i+γtVπ(sk+t)], s ∈ S s\in \mathcal{S} s∈S.
Proof.
V π ( s ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t G k + t ] = E τ k ∣ s k = s , π ∑ i = 0 t − 1 [ γ i r k + i ] + γ t E τ k ∣ s k = s , π [ G k + t ] = E τ k ∣ s k = s , π ∑ i = 0 t − 1 [ γ i r k + i ] + γ t E ( s k , . . . , s k + t ) ∣ s k = s , π E ( a k + t , . . . ) ∣ ( s k , . . . , s k + t ) , s k = s , π [ G k + t ] = E τ k ∣ s k = s , π ∑ i = 0 t − 1 [ γ i r k + i ] + γ t E ( s k , . . . , s k + t ) ∣ s k = s , π V π ( s k + t ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t V π ( s k + t ) ] \begin{align} V^{\pi}(s)&=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}G_{k+t}]\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{k+i}]+\gamma^{t}\mathbb{E}_{\tau_k|s_k=s,\pi}[G_{k+t}]\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{k+i}]+\gamma^{t}\mathbb{E}_{(s_k,...,s_{k+t})|s_k=s, \pi}\mathbb{E}_{(a_{k+t},...)|(s_k,...,s_{k+t}), s_k=s,\pi}[G_{k+t}]\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{k+i}]+\gamma^{t}\mathbb{E}_{(s_k,...,s_{k+t})|s_k=s, \pi}V^{\pi}(s_{k+t})\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}V^{\pi}(s_{k+t})] \end{align} Vπ(s)=Eτk∣sk=s,π[i=0∑t−1γirk+i+γtGk+t]=Eτk∣sk=s,πi=0∑t−1[γirk+i]+γtEτk∣sk=s,π[Gk+t]=Eτk∣sk=s,πi=0∑t−1[γirk+i]+γtE(sk,...,sk+t)∣sk=s,πE(ak+t,...)∣(sk,...,sk+t),sk=s,π[Gk+t]=Eτk∣sk=s,πi=0∑t−1[γirk+i]+γtE(sk,...,sk+t)∣sk=s,πVπ(sk+t)=Eτk∣sk=s,π[i=0∑t−1γirk+i+γtVπ(sk+t)]
- TD(t) ( t = 1 , 2 , . . . t=1,2,... t=1,2,...) for Q function is Q π ( s , a ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r i + k + γ t + k Q π ( s t , a t ) ] Q^{\pi}(s,a)=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{i+k}+\gamma^{t+k}Q^{\pi}(s_{t},a_{t})] Qπ(s,a)=Eτk∣sk=s,ak=a,π[∑i=0t−1γiri+k+γt+kQπ(st,at)], s ∈ S s\in \mathcal{S} s∈S, a ∈ A a\in \mathcal{A} a∈A.
Proof.
Q π ( s , a ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r i + k + γ t + k G t + k ] = E τ ∣ s k = s , a k = a , π ∑ i = 0 t − 1 [ γ i r i + k ] + γ t + k E τ ∣ s k = s , a k = a , π [ G t + k ] = E τ ∣ s k = s , a 0 = a , π ∑ i = 0 t − 1 [ γ i r i + k ] + γ t + k E ( s k , . . . , s t + k , a t + k ) ∣ s k = s , a k = a , π E ( s t + k + 1 , . . . ) ∣ ( s k , . . . , s t + k , a t + k ) , s k = s , a k = a , π [ G t + k ] = E τ k ∣ s k = s , a k = a , π ∑ i = 0 t − 1 [ γ i r i + k ] + γ t + k E ( s k , . . . , s t , a t + k ) ∣ s k = s , a k = a , π Q π ( s t + k , a t + k ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r i + k + γ t + k Q π ( s t + k , a t + k ) ] \begin{align} Q^{\pi}(s,a)&=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{i+k}+\gamma^{t+k}G_{t+k}]\notag\\ &=\mathbb{E}_{\tau|s_k=s, a_k=a,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{i+k}]+\gamma^{t+k}\mathbb{E}_{\tau|s_k=s, a_k=a,\pi}[G_{t+k}]\notag\\ &=\mathbb{E}_{\tau|s_k=s,a_0=a,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{i+k}]+\gamma^{t+k}\mathbb{E}_{(s_k,...,s_{t+k}, a_{t+k})|s_k=s, a_k=a, \pi}\mathbb{E}_{(s_{t+k+1},...)|(s_k,...,s_{t+k}, a_{t+k}), s_k=s,a_k=a,\pi}[G_{t+k}]\notag\\ &=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{i+k}]+\gamma^{t+k}\mathbb{E}_{(s_k,...,s_t, a_{t+k})|s_k=s, a_k=a, \pi}Q^{\pi}(s_{t+k}, a_{t+k})\notag\\ &=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{i+k}+\gamma^{t+k}Q^{\pi}(s_{t+k}, a_{t+k})]\notag \end{align} Qπ(s,a)=Eτk∣sk=s,ak=a,π[i=0∑t−1γiri+k+γt+kGt+k]=Eτ∣sk=s,ak=a,πi=0∑t−1[γiri+k]+γt+kEτ∣sk=s,ak=a,π[Gt+k]=Eτ∣sk=s,a0=a,πi=0∑t−1[γiri+k]+γt+kE(sk,...,st+k,at+k)∣sk=s,ak=a,πE(st+k+1,...)∣(sk,...,st+k,at+k),sk=s,ak=a,π[Gt+k]=Eτk∣sk=s,ak=a,πi=0∑t−1[γiri+k]+γt+kE(sk,...,st,at+k)∣sk=s,ak=a,πQπ(st+k,at+k)=Eτk∣sk=s,ak=a,π[i=0∑t−1γiri+k+γt+kQπ(st+k,at+k)]
Corollary.
(1)
V π ( s , a ) = ∑ a π ( a ∣ s ) ∑ s ′ T ( s , a , s ′ ) [ r ( s , a , s ′ ) + γ V π ( s ′ ) ] V^{\pi}(s,a)=\sum_{a}\pi(a|s)\sum\limits_{s'}T(s,a,s')[r(s,a,s')+\gamma V^{\pi}(s')] Vπ(s,a)=a∑π(a∣s)s′∑T(s,a,s′)[r(s,a,s′)+γVπ(s′)]
This is called the Bellman Equation for V function.
The proof is easily derived with TD(1) formula for V and is omitted.
(2)
Q π ( s , a ) = ∑ s ′ T ( s , a , s ′ ) [ r ( s , a , s ′ ) + γ ∑ a ′ π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) ] Q^{\pi}(s,a)=\sum\limits_{s'}T(s,a,s')[r(s,a,s')+\gamma \sum_{a'} \pi(a'|s') Q^{\pi}(s',a')] Qπ(s,a)=s′∑T(s,a,s′)[r(s,a,s′)+γa′∑π(a′∣s′)Qπ(s′,a′)]
This is called the Bellman Equation for Q function.
The proof is easily derived with TD(1) formula for Q and is omitted.
Relation of V and Q
The relation between V V V and Q Q Q function is given by the following theorems.
Theorem 1.
V π ( s ) = ∑ a π ( a ∣ s ) Q π ( s , a ) , s ∈ S V^{\pi}(s)=\sum_{a}\pi(a|s)Q^{\pi}(s,a), \ s\in \mathcal{S} Vπ(s)=a∑π(a∣s)Qπ(s,a), s∈S
Proof.
V π ( s ) = E τ ∣ s 0 = s , π { ∑ t = 0 ∞ γ t r t } = E a 0 ∣ s 0 = s , π E τ ∣ s 0 = s , a 0 , π { G 0 } = E a 0 ∣ s 0 = s , π { Q π ( s , a 0 ) } = ∑ a π ( a ∣ s ) Q π ( s , a ) \begin{align} V^{\pi}(s)&=\mathbb{E}_{\tau |s_{0}=s, \pi}\{\sum\limits_{t=0}^{\infty}\gamma^{t}r_{t}\}\notag \\ &=\mathbb{E}_{a_0|s_0=s, \pi}\mathbb{E}_{\tau |s_{0}=s, a_0, \pi}\{G_{0}\} \notag \\ &=\mathbb{E}_{a_0|s_0=s, \pi}\{Q^{\pi}(s,a_0)\}\notag \\ &=\sum_{a}\pi(a|s)Q^{\pi}(s,a) \end{align} Vπ(s)=Eτ∣s0=s,π{t=0∑∞γtrt}=Ea0∣s0=s,πEτ∣s0=s,a0,π{G0}=Ea0∣s0=s,π{Qπ(s,a0)}=a∑π(a∣s)Qπ(s,a)
Theorem 2. For t = 1 , . . . t=1,... t=1,..., it holds that
V π ( s ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t Q π ( s k + t , a k + t ) ] , s ∈ S V^{\pi}(s)=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}Q^{\pi}(s_{k+t}, a_{k+t})],\ s\in \mathcal{S} Vπ(s)=Eτk∣sk=s,π[i=0∑t−1γirk+i+γtQπ(sk+t,ak+t)], s∈S
Q π ( s , a ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r k + i + γ t V π ( s k + t ) ] , s ∈ S , a ∈ A Q^{\pi}(s,a)=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}V^{\pi}(s_{k+t})],\ s\in \mathcal{S}, a\in \mathcal{A} Qπ(s,a)=Eτk∣sk=s,ak=a,π[i=0∑t−1γirk+i+γtVπ(sk+t)], s∈S,a∈A
The proof is similar to the foregoing proofs and is omitted.
Corollary.
Q π ( s , a ) = ∑ s ′ T ( s , a , s ′ ) [ r ( s , a , s ′ ) + γ V π ( s ′ ) ] , s ∈ S , a ∈ A Q^{\pi}(s,a)=\sum\limits_{s'}T(s,a,s')[r(s,a,s')+\gamma V^{\pi}(s')], \ s\in \mathcal{S}, a\in \mathcal{A} Qπ(s,a)=s′∑T(s,a,s′)[r(s,a,s′)+γVπ(s′)], s∈S,a∈A
Policy Criterion
A policy π \pi π is said to be not worse than a policy π ′ \pi' π′ ( π ≥ π ′ \pi \geq \pi' π≥π′) if V π ( s ) ≥ V π ′ ( s ) V^{\pi}(s)\geq V^{\pi'}(s) Vπ(s)≥Vπ′(s), s ∈ S s\in \mathcal{S} s∈S.
A policy π \pi π is said to be strictly better than a policy π ′ \pi' π′ ( π > π ′ \pi > \pi' π>π′) if V π ( s ) ≥ V π ′ ( s ) V^{\pi}(s)\geq V^{\pi'}(s) Vπ(s)≥Vπ′(s), s ∈ S s\in \mathcal{S} s∈S and there exists a state s 0 s_0 s0 V π ( s 0 ) > V π ′ ( s 0 ) V^{\pi}(s_0)> V^{\pi'}(s_0) Vπ(s0)>Vπ′(s0).
Policy Improvement Theorem
Theorem. For any policy π \pi π and π ~ \tilde\pi π~:
V π ( s ) ≤ ∑ a π ~ ( a ∣ s ) Q π ( s , a ) , s ∈ S V^{\pi}(s) \leq \sum_{a}\tilde\pi(a|s)Q^{\pi}(s,a),\ s\in \mathcal{S} Vπ(s)≤a∑π~(a∣s)Qπ(s,a), s∈S
it holds that π ~ ≥ π \tilde\pi\geq \pi π~≥π.
Proof.
V π ( s ) = E s 0 ∣ s 0 = s , π ′ V π ( s 0 ) V^{\pi}(s)=\mathbb{E}_{s_0|s_0=s,\pi'} V^{\pi}(s_0) Vπ(s)=Es0∣s0=s,π′Vπ(s0)
E s 0 ∣ s 0 = s , π ′ V π ( s 0 ) = E s 0 ∣ s 0 = s , π ′ [ ∑ a π ( a ∣ s 0 ) Q π ( s 0 , a ) ] ≤ E s 0 ∣ s 0 = s , π ′ [ ∑ a π ~ ( a ∣ s 0 ) Q π ( s 0 , a ) ] = E ( s 0 , a 0 ) ∣ s 0 = s , π ′ [ Q π ( s 0 , a 0 ) ] = E ( s 0 , a 0 ) ∣ s 0 = s , π ′ [ ∑ s T ( s 0 , a 0 , s ) ( R ( s 0 , a 0 , s ) + γ V π ( s ) ) ] = E ( s 0 , a 0 ) ∣ s 0 = s , π ′ E s 1 ∣ ( s 0 , a 0 ) , s 0 = s , π ′ [ r 0 + γ V π ( s 1 ) ] = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ [ r 0 ] + γ E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ [ V π ( s 1 ) ] \begin{align} \mathbb{E}_{s_0|s_0=s,\pi'} V^{\pi}(s_0)&=\mathbb{E}_{s_0|s_0=s,\pi'} [\sum_{a}\pi(a|s_0)Q^{\pi}(s_0,a)]\notag\\ &\leq \mathbb{E}_{s_0|s_0=s,\pi'}[\sum_{a}\tilde\pi(a|s_0)Q^{\pi}(s_0,a)]\notag\\ &=\mathbb{E}_{(s_0,a_0)|s_0=s,\pi'}[Q^{\pi}(s_0,a_0)]\notag\\ &=\mathbb{E}_{(s_0,a_0)|s_0=s,\pi'}[\sum_{s}T(s_0, a_0,s)(R(s_0, a_0,s)+\gamma V^{\pi}(s))]\notag\\ &=\mathbb{E}_{(s_0,a_0)|s_0=s,\pi'}\mathbb{E}_{s_1|(s_0,a_0),s_0=s,\pi'}[r_0+\gamma V^{\pi}(s_1)]\notag\\ &=\mathbb{E}_{(s_0,a_0,s_1)|s_0=s,\pi'}[r_0]+\gamma\mathbb{E}_{(s_0,a_0,s_1)|s_0=s,\pi'}[ V^{\pi}(s_1)] \end{align} Es0∣s0=s,π′Vπ(s0)=Es0∣s0=s,π′[a∑π(a∣s0)Qπ(s0,a)]≤Es0∣s0=s,π′[a∑π~(a∣s0)Qπ(s0,a)]=E(s0,a0)∣s0=s,π′[Qπ(s0,a0)]=E(s0,a0)∣s0=s,π′[s∑T(s0,a0,s)(R(s0,a0,s)+γVπ(s))]=E(s0,a0)∣s0=s,π′Es1∣(s0,a0),s0=s,π′[r0+γVπ(s1)]=E(s0,a0,s1)∣s0=s,π′[r0]+γE(s0,a0,s1)∣s0=s,π′[Vπ(s1)]
For i = 0 , 1 , . . . i=0,1,... i=0,1,..., it can be derived in a similar way that
E ( s 0 , . . . s i ) ∣ s 0 = s , π ′ [ V π ( s i ) ] ≤ E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ r i ] + γ E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ V π ( s i + 1 ) ] \mathbb{E}_{(s_0,...s_i)|s_0=s,\pi'}[ V^{\pi}(s_{i})]\leq \mathbb{E}_{(s_0,...,s_{i+1})|s_0=s,\pi'}[r_i]+\gamma\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s,\pi'}[V^{\pi}(s_{i+1})] E(s0,...si)∣s0=s,π′[Vπ(si)]≤E(s0,...,si+1)∣s0=s,π′[ri]+γE(s0,...,si+1)∣s0=s,π′[Vπ(si+1)]
Then it is easily derived that for i = 0 , 1 , . . . i=0,1,... i=0,1,...,
V π ( s ) ≤ ∑ l = 0 i E ( s 0 , . . . , s l + 1 ) ∣ s 0 = s , π ′ [ γ l r l ] + γ i + 1 E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ V π ( s i + 1 ) ] = E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ ∑ l = 0 i γ l r l ] + γ i + 1 E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ V π ( s i + 1 ) ] \begin{align} V^{\pi}(s) &\leq \sum_{l=0}^{i}\mathbb{E}_{(s_0,...,s_{l+1})|s_0=s, \pi'}[\gamma^{l} r_{l}]+\gamma^{i+1}\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s, \pi'}[V^{\pi}(s_{i+1})]\notag\\ &=\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s, \pi'}[\sum_{l=0}^{i}\gamma^{l}r_{l}]+\gamma^{i+1}\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s, \pi'}[V^{\pi}(s_{i+1})]\notag \end{align} Vπ(s)≤l=0∑iE(s0,...,sl+1)∣s0=s,π′[γlrl]+γi+1E(s0,...,si+1)∣s0=s,π′[Vπ(si+1)]=E(s0,...,si+1)∣s0=s,π′[l=0∑iγlrl]+γi+1E(s0,...,si+1)∣s0=s,π′[Vπ(si+1)]
Let i → ∞ i\rightarrow \infty i→∞, then V π ( s ) ≤ E τ ∣ s 0 = s , π ′ [ ∑ l = 0 ∞ γ l r l ] = V π ~ ( s ) V^{\pi}(s)\leq \mathbb{E}_{\tau|s_0=s, \pi'}[\sum_{l=0}^{\infty}\gamma^{l}r_{l}]=V^{\tilde\pi}(s) Vπ(s)≤Eτ∣s0=s,π′[∑l=0∞γlrl]=Vπ~(s).
Therefore, π ~ ≥ π \tilde\pi\geq \pi π~≥π.
Corollary.
For any policy π \pi π and π ~ \tilde\pi π~:
π ~ ( a ∣ s ) = { 1 , a = a r g m a x a Q π ( s , a ) 0 , a ≠ a r g m a x a Q π ( s , a ) , s ∈ S \tilde\pi(a|s)= \begin{cases} 1, \ a=\mathop{\mathrm{argmax}}\limits_{a}Q^{\pi}(s,a)\\ 0, \ a\neq\mathop{\mathrm{argmax}}\limits_{a}Q^{\pi}(s,a) \end{cases}, \ s\in \mathcal{S} π~(a∣s)=⎩ ⎨ ⎧1, a=aargmaxQπ(s,a)0, a=aargmaxQπ(s,a), s∈S
it holds that π ~ ≥ π \tilde\pi\geq \pi π~≥π.
Optimal Policy
A policy π ∗ \pi^{*} π∗ is called an optimal policy if V π ∗ ( s ) ≥ V π ( s ) V^{\pi^{*}}(s)\geq V^{\pi}(s) Vπ∗(s)≥Vπ(s) for all s ∈ S s\in \mathcal{S} s∈S and for all policy π ≠ π ∗ \pi \neq \pi^{*} π=π∗. (Denote V π ∗ ( s ) V^{\pi^{*}}(s) Vπ∗(s) / Q π ∗ ( s , a ) Q^{\pi^{*}}(s,a) Qπ∗(s,a) as V ∗ ( s ) V^{*}(s) V∗(s) / Q ∗ ( s , a ) Q^{*}(s,a) Q∗(s,a) in the following content.)
The relation of V ∗ V^{*} V∗ and Q ∗ Q^{*} Q∗ is given as follows:
V ∗ ( s ) = max a Q ∗ ( s , a ) , s ∈ S , a ∈ A V^{*}(s)=\max\limits_{a}Q^{*}(s,a),\ s\in \mathcal{S}, \ a\in \mathcal{A} V∗(s)=amaxQ∗(s,a), s∈S, a∈A
The optimal policy is given by:
π ∗ ( a ∣ s ) = { 1 a = a r g m a x a Q ∗ ( s , a ) 0 a ≠ a r g m a x a Q ∗ ( s , a ) , s ∈ S \pi^{*}(a|s)=\begin{cases}&1 &a=\mathop{\mathrm{argmax}}\limits_{a}Q^{*}(s,a)\\ &0 & a\neq \mathop{\mathrm{argmax}}\limits_{a}Q^{*}(s,a) \end{cases},\ s\in \mathcal{S} π∗(a∣s)=⎩ ⎨ ⎧10a=aargmaxQ∗(s,a)a=aargmaxQ∗(s,a), s∈S
Proof.
If these does not hold, then there exists a better policy according to the corollary of policy improvement theorem.
Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state.
Reinforcement Learning
Reinforcement Learning Algorithm: Algorithms that seek the optimal policy for MDPs are collectively known as reinforcement learning algorithms. We will introduce several reinforcement learning algorithms in the following content.
Fundamental RL Algorithms: Policy Iteration
We now introduce an RL algorithm for policy improvement called policy iteration that, with each iteration, yields a better strategy than the previous one and ultimately converges to an optimal policy. It pertains to finite MDPs (MDPs with finite state and action spaces).
Recall that
Q π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ Q π ( s ′ , π ( s ′ ) ) ) Q^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma Q^{\pi}(s',\pi(s')) \big) Qπ(s,a)=s′∑T(s,π(s),s′)(R(s,π(s),s′)+γQπ(s′,π(s′)))
This gives an iterative way to optimize estimator of Q Q Q function:
Q ^ π ( s , a ) ← ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ Q ^ π ( s ′ , π ( s ′ ) ) ) \hat{Q}^{\pi}(s,a)\leftarrow\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \hat{Q}^{\pi}(s',\pi(s')) \big) Q^π(s,a)←s′∑T(s,π(s),s′)(R(s,π(s),s′)+γQ^π(s′,π(s′)))
The Whole Procedure
Require: initial policy π \pi π
Construct a function Q ^ π : S × A → R \hat{Q}^{\pi}: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R} Q^π:S×A→R arbitrarily as an estimator of Q π Q^{\pi} Qπ.
repeat // policy evaluation
\quad Δ = 0 \Delta=0 Δ=0.
\quad for s ∈ S s \in S s∈S, a ∈ A a\in \mathcal{A} a∈A do
\quad \quad q = Q ^ π ( s , a ) q=\hat{Q}^{\pi}(s,a) q=Q^π(s,a)
\quad \quad Q ^ π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ Q ^ π ( s ′ , π ( s ′ ) ) ) \hat{Q}^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \hat{Q}^{\pi}(s',\pi(s')) \big) Q^π(s,a)=s′∑T(s,π(s),s′)(R(s,π(s),s′)+γQ^π(s′,π(s′)))
\quad \quad Δ = max ( Δ , ∣ q − Q ^ π ( s , a ) ∣ ) \Delta=\max(\Delta, |q - \hat{Q}^{\pi}(s,a)|) Δ=max(Δ,∣q−Q^π(s,a)∣)
until Δ < σ \Delta<\sigma Δ<σ
policy-stable=true
for s ∈ S s \in S s∈S do // policy improvement
\quad b = π ( s ) b=\pi(s) b=π(s)
\quad π ( s ) = arg max a Q ^ π ( s , a ) \pi(s)=\argmax_{a}\hat{Q}^{\pi}(s,a) π(s)=argmaxaQ^π(s,a)
\quad if b ≠ π ( s ) b\neq \pi(s) b=π(s) then policy-stable = false
if policy-stable == false then go to policy evaluation else return π ( s ) \pi(s) π(s).
Policy Evaluation
Updating Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a) in the forementioned iterative way based on Bellman Equation.
Policy Improvement
Update π ( s ) \pi(s) π(s) in a greedy way:
π ( s ) = a r g m a x a Q π ( s , a ) \pi(s) =\mathop{\mathrm{argmax}}\limits_{a} Q^{\pi}(s,a) π(s)=aargmaxQπ(s,a)
*Note: Policy-stable means that the Q Q Q has no improvement compared to the previous iteration, that is, π k + 1 ( s ) \pi_{k+1}(s) πk+1(s) is as good as π k ( s ) \pi_{k}(s) πk(s).
参考: Wiering M , Otterlo M V .Reinforcement Learning: State of the Art[J].Springer Publishing Company, Incorporated, 2012.DOI:10.1007/s10840-007-9174-1. P22.
Convergence Analysis
It can be proved that policy iteration generates a monotonically increasing sequence of policies.
V π ( s ) = Q π ( s , π ( s ) ) ≤ Q π ( s , π ′ ( s ) ) = ∑ s ′ T ( s , π ′ ( s ) , s ′ ) [ r ( s , π ′ ( s ) , s ′ ) + γ V π ( s ′ ) ] = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ V π ( s 1 ) } ≤ E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ Q π ( s 1 , π ′ ( s 1 ) ) } = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ ∑ s ′ T ( s 1 , π ′ ( s 1 ) , s ′ ) [ r ( s 1 , π ′ ( s 1 ) , s ′ ) + V π ( s ′ ) ] ) = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ E ( a 1 , s 2 ) ∣ s 1 , π ′ { r 1 + V π ( s 2 ) } } = E ( s 0 , a 0 , s 1 , a 1 , s 2 ) ∣ s 0 = s , π ′ { r 0 + γ r 1 + γ V π ( s 2 ) ] ) ≤ . . . ≤ E ( s 0 , a 0 , . . , s n − 1 , a n − 1 , s n ) ∣ s 0 = s , π ′ { ∑ t = 0 n − 1 γ t r t + γ n V π ( s n ) } ≤ . . . \begin{aligned} V^{\pi}(s)=Q^{\pi}(s,\pi(s)) & \leq Q^{\pi}(s,\pi'(s))\\ & = \sum\limits_{s'}T(s,\pi'(s),s')[r(s,\pi'(s),s')+\gamma V^{\pi}(s')]\\ &=\mathbb{E}_{(s_0, a_0, s_1)| s_0=s, \pi'}\{r_0 + \gamma V^{\pi}(s_{1})\}\\ &\leq \mathbb{E}_{(s_0, a_0, s_1)| s_0=s, \pi'}\{r_0 + \gamma Q^{\pi}(s_{1}, \pi'(s_1))\}\\ &= \mathbb{E}_{(s_0, a_0, s_1)| s_0=s, \pi'}\{r_0 + \gamma \sum\limits_{s'}T(s_1,\pi'(s_1),s')[r({s_1}, \pi'(s_1),s')+V^{\pi}(s')])\\ &=\mathbb{E}_{(s_0, a_0, s_1)|s_0=s, \pi'}\{r_0+\gamma \mathbb{E}_{(a_1, s_2)|s_1, \pi'}\{r_1+V^{\pi}(s_2)\}\}\\ &=\mathbb{E}_{(s_0, a_0, s_1, a_1, s_2)| s_0=s, \pi'}\{r_0 + \gamma r_1+\gamma V^{\pi}(s_2)])\\ & \leq ... \\ & \leq \mathbb{E}_{(s_{0},a_{0},..,s_{n-1},a_{n-1},s_{n})|s_{0}=s, \pi'} \{\sum\limits_{t=0}^{n-1}\gamma^{t} r_{t}+\gamma^{n}V^{\pi}(s_{n})\}\\ & \leq ... \\ \end{aligned} Vπ(s)=Qπ(s,π(s))≤Qπ(s,π′(s))=s′∑T(s,π′(s),s′)[r(s,π′(s),s′)+γVπ(s′)]=E(s0,a0,s1)∣s0=s,π′{r0+γVπ(s1)}≤E(s0,a0,s1)∣s0=s,π′{r0+γQπ(s1,π′(s1))}=E(s0,a0,s1)∣s0=s,π′{r0+γs′∑T(s1,π′(s1),s′)[r(s1,π′(s1),s′)+Vπ(s′)])=E(s0,a0,s1)∣s0=s,π′{r0+γE(a1,s2)∣s1,π′{r1+Vπ(s2)}}=E(s0,a0,s1,a1,s2)∣s0=s,π′{r0+γr1+γVπ(s2)])≤...≤E(s0,a0,..,sn−1,an−1,sn)∣s0=s,π′{t=0∑n−1γtrt+γnVπ(sn)}≤...
Therefore, V π ( s ) ≤ E τ ∣ s 0 = s , π ′ { ∑ t = 0 ∞ γ t r t } = V π ′ ( s ) V^{\pi}(s)\leq \mathbb{E}_{\tau | s_{0} = s, \pi'}\{\sum\limits_{t=0}^{\infty}\gamma^{t} r_{t}\} = V^{\pi'}(s) Vπ(s)≤Eτ∣s0=s,π′{t=0∑∞γtrt}=Vπ′(s), ∀ s ∈ S \forall s \in \mathcal{S} ∀s∈S.
Because a finite MDP has only a finite number of policies, π \pi π must converge in a finite number of iterations.
Suppose π ( s ) \pi(s) π(s) converges to π ∗ \pi^{*} π∗. Then it is easily derived that from the stop criterion that:
π ∗ ( s ) = a r g m a x a ∑ s ′ T ( s , a , s ′ ) ( R ( s , a , s ′ ) + γ V ∗ ( s ′ ) ) , ∀ s ∈ S \pi^{*}(s)=\mathop{\mathrm{argmax}}\limits_{a}\sum\limits_{s'}T(s,a,s')(R(s,a,s')+\gamma V^{*}(s')),\ \forall s\in S π∗(s)=aargmaxs′∑T(s,a,s′)(R(s,a,s′)+γV∗(s′)), ∀s∈S
Then the Bellman optimality equation holds.
- Each policy π k + 1 \pi_{k+1} πk+1 is a strictly better policy than π k \pi_{k} πk until the algorithm stops.
- The policy iteration algorithm completely separates the evaluation and improvement phases.
- Although policy iteration computes the original policy for a given MDP in finite time, it is relatively inefficient. In particular the first step, the policy evaluation step, is computationally expensive.
Fundamental RL Algorithms: Value Iteration
通过前面内容我们知道: π ∗ ( s ) = a r g m a x a Q ∗ ( s , a ) \pi^{*}(s)=\mathop{\mathrm{argmax}}\limits_{a}Q^{*}(s,a) π∗(s)=aargmaxQ∗(s,a),因此,只要求出 Q ∗ ( s , a ) Q^{*}(s,a) Q∗(s,a),就求出了 π ∗ ( s ) \pi^{*}(s) π∗(s),本节我们给出一种通过迭代求 Q ∗ ( s , a ) Q^{*}(s,a) Q∗(s,a) 的算法:Value Iteration.
Iterative ways to derive Q ∗ Q^{*} Q∗
由 Q ∗ Q^{*} Q∗ 和 π ∗ \pi^{*} π∗ 的关系式
Q ∗ ( s , a ) = ∑ s ′ ∈ S T ( s , a , s ′ ) [ R ( s , a , s ′ ) + γ V ∗ ( s ′ ) ] Q^{*}(s, a) = \sum_{s' \in S} T(s, a, s') [R(s, a, s') + \gamma V^{*}(s')] Q∗(s,a)=s′∈S∑T(s,a,s′)[R(s,a,s′)+γV∗(s′)]
可以得到两种迭代更新 Q ∗ Q^{*} Q∗ estimator的方式:
(1)
Q ^ ∗ ( s , a ) ← ∑ s ′ T ( s , a , s ′ ) ( R ( s , a , s ′ ) + γ V ^ ∗ ( s ′ ) ) \hat{Q}^{*}(s, a)\leftarrow \sum _{s'}T(s, a, s')(R(s, a, s')+\gamma \hat{V}^{*}(s')) Q^∗(s,a)←s′∑T(s,a,s′)(R(s,a,s′)+γV^∗(s′))
V ^ ∗ ( s ) ← max a Q ^ ∗ ( s , a ) \hat{V}^{*}(s)\leftarrow\max _{a}\hat{Q}^{*}(s, a) V^∗(s)←amaxQ^∗(s,a)
(2)
Q ^ ∗ ( s , a ) ← ∑ s ′ T ( s , a , s ′ ) ( R ( s , a , s ′ ) + γ max a Q ^ ∗ ( s ′ , a ) ) \hat{Q}^{*}(s, a) \leftarrow \sum_{s'} T(s, a, s') \left( R(s, a, s') + \gamma \max_{a} \hat{Q}^{*}(s', a) \right) Q^∗(s,a)←s′∑T(s,a,s′)(R(s,a,s′)+γamaxQ^∗(s′,a))
Given state s s s, action a a a, reward r r r and next state s i s_i si, it is possible to approximate Q ∗ ( s , a ) Q^{*}(s, a) Q∗(s,a) by iteratively solving the Bellman recurrence equation
Q i + 1 ( s , a ) = E s ′ [ r + γ max a ′ Q i ( s ′ , a ′ ) ] Q_{i+1}(s,a)=\mathbb{E}_{s'}[r+\gamma\max_{a'}Q_{i}(s',a')] Qi+1(s,a)=Es′[r+γa′maxQi(s′,a′)]
摘自: Multiagent cooperation and competition with deep reinforcement learning
value iteration算法就是基于这两种迭代方式计算 Q ∗ Q^{*} Q∗的.
The Whole Procedure
版本1:
Require: initial policy π \pi π
Construct a function Q ^ : S × A → R \hat{Q}: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R} Q^:S×A→R arbitrarily as the calculated Q ∗ Q^{*} Q∗. Construct a function V ^ : S → R \hat{V}: \mathcal{S}\rightarrow \mathbb{R} V^:S→R arbitrarily as the calculated V ∗ V^{*} V∗.
repeat
\quad Δ = 0 \Delta=0 Δ=0.
\quad for s ∈ S s \in S s∈S do
\quad \quad v = V ^ ( s ) v=\hat{V}(s) v=V^(s)
\quad \quad for a ∈ A a\in \mathcal{A} a∈A do
\quad \quad \quad Q ^ π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ V ^ ( s ′ ) ) \hat{Q}^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \hat{V}(s') \big) Q^π(s,a)=s′∑T(s,π(s),s′)(R(s,π(s),s′)+γV^(s′)) // 采用第1种迭代方式更新 Q ^ ( s , a ) \hat{Q}(s,a) Q^(s,a)
\quad \quad V ^ ( s ) = max a Q ^ ( s , a ) \hat{V}(s)=\max_{a}\hat{Q}(s,a) V^(s)=maxaQ^(s,a)
\quad \quad Δ = max ( Δ , ∣ v − V ^ ( s , a ) ∣ ) \Delta=\max(\Delta, |v - \hat{V}(s,a)|) Δ=max(Δ,∣v−V^(s,a)∣)
until Δ < σ \Delta<\sigma Δ<σ
版本2:
Require: initial policy π \pi π
Construct a function Q ^ : S × A → R \hat{Q}: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R} Q^:S×A→R arbitrarily as the calculated Q ∗ Q^{*} Q∗.
repeat
\quad Δ = 0 \Delta=0 Δ=0.
\quad for s ∈ S s \in S s∈S, a ∈ A a\in \mathcal{A} a∈A do
\quad \quad q = Q ^ π ( s , a ) q=\hat{Q}^{\pi}(s,a) q=Q^π(s,a)
\quad \quad Q ^ π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ max a ′ Q ^ π ( s ′ , a ′ ) ) \hat{Q}^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \max_{a'}\hat{Q}^{\pi}(s',a') \big) Q^π(s,a)=s′∑T(s,π(s),s′)(R(s,π(s),s′)+γmaxa′Q^π(s′,a′)) // 采用第2种迭代方式更新 Q ^ ( s , a ) \hat{Q}(s,a) Q^(s,a)
\quad \quad Δ = max ( Δ , ∣ q − Q ^ π ( s , a ) ∣ ) \Delta=\max(\Delta, |q - \hat{Q}^{\pi}(s,a)|) Δ=max(Δ,∣q−Q^π(s,a)∣)
until Δ < σ \Delta<\sigma Δ<σ
Convergence Analysis
V 0 → Q 1 → V 1 → Q 2 → V 2 → Q 3 → V 3 → Q 4 → V 4 → . . . V ∗ V_0 \rightarrow Q_1 \rightarrow V_1 \rightarrow Q_2 \rightarrow V_2 \rightarrow Q_3 \rightarrow V_3 \rightarrow Q_4 \rightarrow V_4 \rightarrow ... V^* V0→Q1→V1→Q2→V2→Q3→V3→Q4→V4→...V∗, Value iteration is guaranteed to converge in the limit towards V ∗ V^* V∗, we have π ∗ ( s ) = π g r e e d y ( V ∗ ) ( s ) = argmax a Q ∗ ( s , a ) \pi ^ {*} (s)= \pi _{greedy}(V^*)(s)= \mathop{\text{argmax}}\limits_a Q^*(s,a) π∗(s)=πgreedy(V∗)(s)=aargmaxQ∗(s,a).
参考: Wiering M , Otterlo M V .Reinforcement Learning: State of the Art[J].Springer Publishing Company, Incorporated, 2012.DOI:10.1007/s10840-007-9174-1. P22.
最新修订于2021年10月29日
最新修订于2021年11月8日
最新修订于2021年11月12日
最新修订于2021年11月13日
最新修订于2021年11月22日
最新修订于2021年11月22日
最新修订于2021年11月29日
最新修订于2022年3月6日
最新修订于2022年6月11日
最新修订于2022年8月1日
最新修订于2022年10月14日
最新修订于2022年11月21日
最新修订于2022年11月28日
最新修订于2023年8月21日
最新修订于2024年12月26日
最新修订于2024年12月30日
最新修订于2024年12月31日
最新修订于2025年2月11日
最新修订于2025年2月19日
最新修订于2025年3月3日