强化学习无痛上手笔记第1课

server/2025/3/4 22:09:47/

文章目录

  • Markov Decision Process
    • Definition
    • Related Concepts
  • Policy for MDP Agent
    • Definition
    • Judgement for Policy
      • Value Functions
      • TD formula for value functions
      • Relation of V and Q
      • Policy Criterion
      • Policy Improvement Theorem
      • Optimal Policy
      • Reinforcement Learning
  • Fundamental RL Algorithms: Policy Iteration
    • The Whole Procedure
    • Convergence Analysis
  • Fundamental RL Algorithms: Value Iteration

参考书籍:
1.Reinforcement Learning State-of-the-Art, Marco Wiering
2.Reinforcement learning: An introduction, RS Sutton

下一篇博客:
https://blog.csdn.net/Jinyindao243052/article/details/107051507

Markov Decision Process

Definition

Suppose that there is an interactive system as follows which consists of an environment and an agent.
在这里插入图片描述
Its running process is described as follows. At the beginning of step t = 0 , 1 , . . . t=0,1,... t=0,1,..., the agent has the knowledge of the environment state s t s_{t} st. Then it chooses an action a t ∈ A a_{t}\in \mathcal{A} atA that is available in state s t s_{t} st and apply it. After applying action a t a_{t} at, the state of the environment makes a transition to s t + 1 s_{t+1} st+1 according to a probability distribution T ( s t , a t ) T(s_t,a_t) T(st,at) over S \mathcal{S} S and generates a reward r t = R ( s t , a t , s t + 1 ) r_{t}=R(s_t,a_t,s_{t+1}) rt=R(st,at,st+1) and gives the information of its new state and reward to the agent.

Here S \mathcal{S} S is the set of possible environmental states, called state space. A \mathcal{A} A is the set of possible agent actions, called action space. T : S × A → P ( S ) T: \mathcal{S} \times \mathcal{A}\rightarrow \mathcal{P}(\mathcal{S}) T:S×AP(S) is called the transition function, where P ( S ) \mathcal{P}(\mathcal{S}) P(S) is probability distributions of states in S \mathcal{S} S. The probability of ending up in state s ′ s' s after doing applicable action a a a in state s s s is denoted as T ( s , a , s ′ ) T(s,a,s') T(s,a,s). R : S × A × S → R R: \mathcal{S}\times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R} R:S×A×SR is called the reward function, the reward of ending up in state s ′ s' s after doing action a a a in state s s s is R ( s , a , s ′ ) R(s,a,s') R(s,a,s).

Note that the propability distribution of its next state s t + 1 s_{t+1} st+1 is only determined by s t s_t st and a t a_t at, and is not directly influenced by all former states or actions. The reward r t r_t rt is determined by s t s_t st, a t a_t at, s t + 1 s_{t+1} st+1 and is not directly influenced by all former states, actions, and rewards. This kind of property is called Markov property.

In theoretic analysis, we assume the process never stops.

The discrete-time running process of the foregoing system is called a Markov Decision Process.

Related Concepts

Discounted Cumulative Reward: Define the discounted cumulative reward of the process from step t t t as γ k r t + k \gamma^{k}r_{t+k} γkrt+k(denoted as G t G_t Gt), where γ ∈ ( 0 , 1 ) \gamma \in (0,1) γ(0,1) is a parameter called the discount rate.

A MDP is usually represented by a tuple ⟨ S , A , T , R , γ ⟩ \left \langle \mathcal{S}, \mathcal{A}, T, R, \gamma\right \rangle S,A,T,R,γ, where S \mathcal{S} S is its state space, A \mathcal{A} A is its action space, T T T is its transition function, R R R is its reward function, γ \gamma γ is its discount rate.

History: The sequence of ( s 0 , a 0 , s 1 , a 1 , . . . ) (s_0,a_0,s_1,a_1,...) (s0,a0,s1,a1,...) is defined as the history of the MDP process (denoted as τ \tau τ). The probability of ( s t , a t , . . . , s t ′ ) (s_t, a_t, ..., s_{t'}) (st,at,...,st) given s t s_t st and a t a_{t} at,…, a t ′ − 1 a_{t'-1} at1 is Π i = t t ′ − 1 T ( s i , a i , s i + 1 ) \Pi_{i=t}^{t'-1}T(s_i,a_i, s_{i+1}) Πi=tt1T(si,ai,si+1).

Trajectory: The sequence of ( s 0 , a 0 , r 0 , s 1 , a 1 , r 1 . . . ) (s_0,a_0,r_0,s_1,a_1,r_1...) (s0,a0,r0,s1,a1,r1...) is defined as the trajectory of the MDP process (denoted as E \mathcal{E} E).

Policy for MDP Agent

Definition

在这里插入图片描述
在这里插入图片描述

Policy: A deterministic policy π \pi π is a function defined as π : S → A \pi: \mathcal{S}\rightarrow \mathcal{A} π:SA, which tells the agent the action to take in a particular state.

A stochastic policy π \pi π is defined as π : S → P ( A ) \pi: \mathcal{S}\rightarrow \mathcal{P}(\mathcal{A}) π:SP(A) where P ( A ) \mathcal{P}(\mathcal{A}) P(A) is probability distributions of actions in A \mathcal{A} A. It tells the agent the probability to take every action in a particular state. Denote the probability of action a a a according to π ( s ) \pi(s) π(s) as π ( a ∣ s ) \pi(a|s) π(as).

Adopting a deterministic policy π \pi π is equivalent to adopting a stochastic policy π ~ \tilde{\pi} π~,

π ~ ( a ∣ s ) = { 1 , a = π ( s ) 0 , a ≠ π ( s ) , s ∈ S \tilde{\pi}(a|s)=\begin{cases} &1, &a=\pi(s)\\ &0, &a\neq \pi(s) \end{cases},\ s\in\mathcal{S} π~(as)={1,0,a=π(s)a=π(s), sS

Therefore, we only talk about stochastic policy in the following part.

Judgement for Policy

Value Functions

State Value Function (V function): The state value function of policy π \pi π is defined as V π : S → R V^{\pi}: \mathcal{S}\rightarrow \mathbb{R} Vπ:SR,

V π ( s ) = E τ ∣ s 0 = s , π { ∑ t = 0 ∞ γ t r t } = E τ ∣ s 0 = s , π { G 0 } , s ∈ S V^{\pi}(s)=\mathbb{E}_{\tau |s_{0}=s, \pi}\{\sum\limits_{t=0}^{\infty}\gamma^{t}r_{t}\}=\mathbb{E}_{\tau |s_{0}=s, \pi}\{G_{0}\}, \ s\in \mathcal{S} Vπ(s)=Eτs0=s,π{t=0γtrt}=Eτs0=s,π{G0}, sS

where τ = s 0 , a 0 , s 1 , a 1 , … \tau=s_{0}, a_{0}, s_{1}, a_{1}, \dots τ=s0,a0,s1,a1, is the sequence of state-action transitions. V π ( s ) V^{\pi}(s) Vπ(s) is the expected discounted cumulative reward of the process when s 0 = s s_0=s s0=s and adopting policy π \pi π.

For any sequence [ S 0 , A 0 , S 1 . . . ] [S_0,A_0,S_1...] [S0,A0,S1...], where S 0 = S S_0=S S0=S, S i ∈ S S_i\in \mathcal{S} SiS, A i ∈ A A_i\in \mathcal{A} AiA:

p ( τ ∣ s 0 = s , π = [ S 0 , A 0 , S 1 , . . . ] ) = Π i = 0 ∞ T ( S i , A i , S i + 1 ) π ( A i ∣ S i ) p(\tau_{|s_0=s,\pi}=[S_0,A_0,S_1,...])=\Pi_{i=0}^{\infty}T(S_i,A_i,S_{i+1})\pi(A_i|S_i) p(τs0=s,π=[S0,A0,S1,...])=Πi=0T(Si,Ai,Si+1)π(AiSi)

p ( τ t ∣ s t = s , π = [ S 0 , A 0 , S 1 , . . . ] ∣ s 0 = s , π ) = Π i = 0 ∞ T ( S i , A i , S i + 1 ) π ( A i ∣ S i ) p(\tau_t|_{s_t=s,\pi}=[S_0,A_0,S_1,...]|s_0=s,\pi)=\Pi_{i=0}^{\infty}T(S_i,A_i,S_{i+1})\pi(A_i|S_i) p(τtst=s,π=[S0,A0,S1,...]s0=s,π)=Πi=0T(Si,Ai,Si+1)π(AiSi)

Therefore the probability density distributions of sequence τ ∣ s 0 = s , π \tau|s_0=s, \pi τs0=s,π and τ t ∣ s t = s , π \tau_t|s_t=s,\pi τtst=s,π are the same. When τ ∣ s 0 = s , π = τ t ∣ s t = s , π = [ S 0 , A 0 , S 1 , . . . ] \tau|_{s_0=s, \pi}=\tau_t|_{s_t=s,\pi}=[S_0,A_0,S_1,...] τs0=s,π=τtst=s,π=[S0,A0,S1,...], G 0 = G t = ∑ i = 0 ∞ { γ i R ( S i , A i , S i + 1 ) } G_0=G_t=\sum_{i=0}^{\infty}\{\gamma^{i}R(S_i,A_i,S_{i+1})\} G0=Gt=i=0{γiR(Si,Ai,Si+1)}

Therefore E τ ∣ s 0 = s , π { G 0 } = E τ t ∣ s t = s , π { G t } \mathbb{E}_{\tau |s_{0}=s, \pi}\{G_{0}\}=\mathbb{E}_{\tau_t|s_{t}=s, \pi}\{G_{t}\} Eτs0=s,π{G0}=Eτtst=s,π{Gt}. Then V π ( s ) = E τ t ∣ s t = s , π [ G t ] V^{\pi}(s)=\mathbb{E}_{\tau_t|s_{t}=s,\pi}[G_t] Vπ(s)=Eτtst=s,π[Gt], which is the expected discounted cumulative reward of the process from step t t t under the condition of s t = s s_t=s st=s and following π \pi π afterwards.

State-Action Value Function (Q function): The state-action value function is defined as Q π : S × A → R Q^{\pi}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R} Qπ:S×AR,

Q π ( s , a ) = E τ ∣ s 0 = s , a 0 = a , π { ∑ t = 0 ∞ γ t r t } , = E τ ∣ s 0 = s , a 0 = a , π { G 0 } s ∈ S , a ∈ A Q^{\pi}(s,a)=\mathop{\mathbb{E}}\limits_{\tau |s_{0}=s, a_{0}=a, \pi}\{\sum\limits_{t=0}^{\infty}\gamma^{t}r_{t}\}, =\mathop{\mathbb{E}}\limits_{\tau |s_{0}=s, a_{0}=a, \pi}\{G_0\} \ s\in \mathcal{S}, a\in \mathcal{A} Qπ(s,a)=τs0=s,a0=a,πE{t=0γtrt},=τs0=s,a0=a,πE{G0} sS,aA

Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a) is the expected discounted cumulative reward of the process from step 0 0 0 when s 0 = s s_0=s s0=s, a 0 = a a_0=a a0=a, and adopting policy π \pi π. Similarly, Q π ( s ) = E τ t ∣ s t = s , a t = a , π [ G t ] Q^{\pi}(s)=\mathbb{E}_{\tau_t|s_{t}=s, a_{t}=a,\pi}[G_t] Qπ(s)=Eτtst=s,at=a,π[Gt], which is the expected discounted cumulative reward of the process from step t t t under the condition of s t = s s_t=s st=s, a t = a a_t=a at=a and following π \pi π afterwards.

TD formula for value functions

  • TD(t) ( t = 1 , 2 , . . . t=1,2,... t=1,2,...) formula for V function is V π ( s ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t V π ( s k + t ) ] V^{\pi}(s)=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}V^{\pi}(s_{k+t})] Vπ(s)=Eτksk=s,π[i=0t1γirk+i+γtVπ(sk+t)], s ∈ S s\in \mathcal{S} sS.

Proof.

V π ( s ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t G k + t ] = E τ k ∣ s k = s , π ∑ i = 0 t − 1 [ γ i r k + i ] + γ t E τ k ∣ s k = s , π [ G k + t ] = E τ k ∣ s k = s , π ∑ i = 0 t − 1 [ γ i r k + i ] + γ t E ( s k , . . . , s k + t ) ∣ s k = s , π E ( a k + t , . . . ) ∣ ( s k , . . . , s k + t ) , s k = s , π [ G k + t ] = E τ k ∣ s k = s , π ∑ i = 0 t − 1 [ γ i r k + i ] + γ t E ( s k , . . . , s k + t ) ∣ s k = s , π V π ( s k + t ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t V π ( s k + t ) ] \begin{align} V^{\pi}(s)&=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}G_{k+t}]\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{k+i}]+\gamma^{t}\mathbb{E}_{\tau_k|s_k=s,\pi}[G_{k+t}]\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{k+i}]+\gamma^{t}\mathbb{E}_{(s_k,...,s_{k+t})|s_k=s, \pi}\mathbb{E}_{(a_{k+t},...)|(s_k,...,s_{k+t}), s_k=s,\pi}[G_{k+t}]\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{k+i}]+\gamma^{t}\mathbb{E}_{(s_k,...,s_{k+t})|s_k=s, \pi}V^{\pi}(s_{k+t})\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}V^{\pi}(s_{k+t})] \end{align} Vπ(s)=Eτksk=s,π[i=0t1γirk+i+γtGk+t]=Eτksk=s,πi=0t1[γirk+i]+γtEτksk=s,π[Gk+t]=Eτksk=s,πi=0t1[γirk+i]+γtE(sk,...,sk+t)sk=s,πE(ak+t,...)(sk,...,sk+t),sk=s,π[Gk+t]=Eτksk=s,πi=0t1[γirk+i]+γtE(sk,...,sk+t)sk=s,πVπ(sk+t)=Eτksk=s,π[i=0t1γirk+i+γtVπ(sk+t)]

  • TD(t) ( t = 1 , 2 , . . . t=1,2,... t=1,2,...) for Q function is Q π ( s , a ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r i + k + γ t + k Q π ( s t , a t ) ] Q^{\pi}(s,a)=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{i+k}+\gamma^{t+k}Q^{\pi}(s_{t},a_{t})] Qπ(s,a)=Eτksk=s,ak=a,π[i=0t1γiri+k+γt+kQπ(st,at)], s ∈ S s\in \mathcal{S} sS, a ∈ A a\in \mathcal{A} aA.

Proof.

Q π ( s , a ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r i + k + γ t + k G t + k ] = E τ ∣ s k = s , a k = a , π ∑ i = 0 t − 1 [ γ i r i + k ] + γ t + k E τ ∣ s k = s , a k = a , π [ G t + k ] = E τ ∣ s k = s , a 0 = a , π ∑ i = 0 t − 1 [ γ i r i + k ] + γ t + k E ( s k , . . . , s t + k , a t + k ) ∣ s k = s , a k = a , π E ( s t + k + 1 , . . . ) ∣ ( s k , . . . , s t + k , a t + k ) , s k = s , a k = a , π [ G t + k ] = E τ k ∣ s k = s , a k = a , π ∑ i = 0 t − 1 [ γ i r i + k ] + γ t + k E ( s k , . . . , s t , a t + k ) ∣ s k = s , a k = a , π Q π ( s t + k , a t + k ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r i + k + γ t + k Q π ( s t + k , a t + k ) ] \begin{align} Q^{\pi}(s,a)&=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{i+k}+\gamma^{t+k}G_{t+k}]\notag\\ &=\mathbb{E}_{\tau|s_k=s, a_k=a,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{i+k}]+\gamma^{t+k}\mathbb{E}_{\tau|s_k=s, a_k=a,\pi}[G_{t+k}]\notag\\ &=\mathbb{E}_{\tau|s_k=s,a_0=a,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{i+k}]+\gamma^{t+k}\mathbb{E}_{(s_k,...,s_{t+k}, a_{t+k})|s_k=s, a_k=a, \pi}\mathbb{E}_{(s_{t+k+1},...)|(s_k,...,s_{t+k}, a_{t+k}), s_k=s,a_k=a,\pi}[G_{t+k}]\notag\\ &=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{i+k}]+\gamma^{t+k}\mathbb{E}_{(s_k,...,s_t, a_{t+k})|s_k=s, a_k=a, \pi}Q^{\pi}(s_{t+k}, a_{t+k})\notag\\ &=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{i+k}+\gamma^{t+k}Q^{\pi}(s_{t+k}, a_{t+k})]\notag \end{align} Qπ(s,a)=Eτksk=s,ak=a,π[i=0t1γiri+k+γt+kGt+k]=Eτsk=s,ak=a,πi=0t1[γiri+k]+γt+kEτsk=s,ak=a,π[Gt+k]=Eτsk=s,a0=a,πi=0t1[γiri+k]+γt+kE(sk,...,st+k,at+k)sk=s,ak=a,πE(st+k+1,...)(sk,...,st+k,at+k),sk=s,ak=a,π[Gt+k]=Eτksk=s,ak=a,πi=0t1[γiri+k]+γt+kE(sk,...,st,at+k)sk=s,ak=a,πQπ(st+k,at+k)=Eτksk=s,ak=a,π[i=0t1γiri+k+γt+kQπ(st+k,at+k)]

Corollary.

(1)
V π ( s , a ) = ∑ a π ( a ∣ s ) ∑ s ′ T ( s , a , s ′ ) [ r ( s , a , s ′ ) + γ V π ( s ′ ) ] V^{\pi}(s,a)=\sum_{a}\pi(a|s)\sum\limits_{s'}T(s,a,s')[r(s,a,s')+\gamma V^{\pi}(s')] Vπ(s,a)=aπ(as)sT(s,a,s)[r(s,a,s)+γVπ(s)]
This is called the Bellman Equation for V function.

The proof is easily derived with TD(1) formula for V and is omitted.

(2)
Q π ( s , a ) = ∑ s ′ T ( s , a , s ′ ) [ r ( s , a , s ′ ) + γ ∑ a ′ π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) ] Q^{\pi}(s,a)=\sum\limits_{s'}T(s,a,s')[r(s,a,s')+\gamma \sum_{a'} \pi(a'|s') Q^{\pi}(s',a')] Qπ(s,a)=sT(s,a,s)[r(s,a,s)+γaπ(as)Qπ(s,a)]
This is called the Bellman Equation for Q function.

The proof is easily derived with TD(1) formula for Q and is omitted.

Relation of V and Q

The relation between V V V and Q Q Q function is given by the following theorems.

Theorem 1.
V π ( s ) = ∑ a π ( a ∣ s ) Q π ( s , a ) , s ∈ S V^{\pi}(s)=\sum_{a}\pi(a|s)Q^{\pi}(s,a), \ s\in \mathcal{S} Vπ(s)=aπ(as)Qπ(s,a), sS
Proof.
V π ( s ) = E τ ∣ s 0 = s , π { ∑ t = 0 ∞ γ t r t } = E a 0 ∣ s 0 = s , π E τ ∣ s 0 = s , a 0 , π { G 0 } = E a 0 ∣ s 0 = s , π { Q π ( s , a 0 ) } = ∑ a π ( a ∣ s ) Q π ( s , a ) \begin{align} V^{\pi}(s)&=\mathbb{E}_{\tau |s_{0}=s, \pi}\{\sum\limits_{t=0}^{\infty}\gamma^{t}r_{t}\}\notag \\ &=\mathbb{E}_{a_0|s_0=s, \pi}\mathbb{E}_{\tau |s_{0}=s, a_0, \pi}\{G_{0}\} \notag \\ &=\mathbb{E}_{a_0|s_0=s, \pi}\{Q^{\pi}(s,a_0)\}\notag \\ &=\sum_{a}\pi(a|s)Q^{\pi}(s,a) \end{align} Vπ(s)=Eτs0=s,π{t=0γtrt}=Ea0s0=s,πEτs0=s,a0,π{G0}=Ea0s0=s,π{Qπ(s,a0)}=aπ(as)Qπ(s,a)
Theorem 2. For t = 1 , . . . t=1,... t=1,..., it holds that
V π ( s ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t Q π ( s k + t , a k + t ) ] , s ∈ S V^{\pi}(s)=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}Q^{\pi}(s_{k+t}, a_{k+t})],\ s\in \mathcal{S} Vπ(s)=Eτksk=s,π[i=0t1γirk+i+γtQπ(sk+t,ak+t)], sS

Q π ( s , a ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r k + i + γ t V π ( s k + t ) ] , s ∈ S , a ∈ A Q^{\pi}(s,a)=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}V^{\pi}(s_{k+t})],\ s\in \mathcal{S}, a\in \mathcal{A} Qπ(s,a)=Eτksk=s,ak=a,π[i=0t1γirk+i+γtVπ(sk+t)], sS,aA
The proof is similar to the foregoing proofs and is omitted.

Corollary.

Q π ( s , a ) = ∑ s ′ T ( s , a , s ′ ) [ r ( s , a , s ′ ) + γ V π ( s ′ ) ] , s ∈ S , a ∈ A Q^{\pi}(s,a)=\sum\limits_{s'}T(s,a,s')[r(s,a,s')+\gamma V^{\pi}(s')], \ s\in \mathcal{S}, a\in \mathcal{A} Qπ(s,a)=sT(s,a,s)[r(s,a,s)+γVπ(s)], sS,aA

Policy Criterion

A policy π \pi π is said to be not worse than a policy π ′ \pi' π ( π ≥ π ′ \pi \geq \pi' ππ) if V π ( s ) ≥ V π ′ ( s ) V^{\pi}(s)\geq V^{\pi'}(s) Vπ(s)Vπ(s), s ∈ S s\in \mathcal{S} sS.

A policy π \pi π is said to be strictly better than a policy π ′ \pi' π ( π > π ′ \pi > \pi' π>π) if V π ( s ) ≥ V π ′ ( s ) V^{\pi}(s)\geq V^{\pi'}(s) Vπ(s)Vπ(s), s ∈ S s\in \mathcal{S} sS and there exists a state s 0 s_0 s0 V π ( s 0 ) > V π ′ ( s 0 ) V^{\pi}(s_0)> V^{\pi'}(s_0) Vπ(s0)>Vπ(s0).

Policy Improvement Theorem

Theorem. For any policy π \pi π and π ~ \tilde\pi π~:
V π ( s ) ≤ ∑ a π ~ ( a ∣ s ) Q π ( s , a ) , s ∈ S V^{\pi}(s) \leq \sum_{a}\tilde\pi(a|s)Q^{\pi}(s,a),\ s\in \mathcal{S} Vπ(s)aπ~(as)Qπ(s,a), sS
it holds that π ~ ≥ π \tilde\pi\geq \pi π~π.

Proof.

V π ( s ) = E s 0 ∣ s 0 = s , π ′ V π ( s 0 ) V^{\pi}(s)=\mathbb{E}_{s_0|s_0=s,\pi'} V^{\pi}(s_0) Vπ(s)=Es0s0=s,πVπ(s0)

E s 0 ∣ s 0 = s , π ′ V π ( s 0 ) = E s 0 ∣ s 0 = s , π ′ [ ∑ a π ( a ∣ s 0 ) Q π ( s 0 , a ) ] ≤ E s 0 ∣ s 0 = s , π ′ [ ∑ a π ~ ( a ∣ s 0 ) Q π ( s 0 , a ) ] = E ( s 0 , a 0 ) ∣ s 0 = s , π ′ [ Q π ( s 0 , a 0 ) ] = E ( s 0 , a 0 ) ∣ s 0 = s , π ′ [ ∑ s T ( s 0 , a 0 , s ) ( R ( s 0 , a 0 , s ) + γ V π ( s ) ) ] = E ( s 0 , a 0 ) ∣ s 0 = s , π ′ E s 1 ∣ ( s 0 , a 0 ) , s 0 = s , π ′ [ r 0 + γ V π ( s 1 ) ] = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ [ r 0 ] + γ E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ [ V π ( s 1 ) ] \begin{align} \mathbb{E}_{s_0|s_0=s,\pi'} V^{\pi}(s_0)&=\mathbb{E}_{s_0|s_0=s,\pi'} [\sum_{a}\pi(a|s_0)Q^{\pi}(s_0,a)]\notag\\ &\leq \mathbb{E}_{s_0|s_0=s,\pi'}[\sum_{a}\tilde\pi(a|s_0)Q^{\pi}(s_0,a)]\notag\\ &=\mathbb{E}_{(s_0,a_0)|s_0=s,\pi'}[Q^{\pi}(s_0,a_0)]\notag\\ &=\mathbb{E}_{(s_0,a_0)|s_0=s,\pi'}[\sum_{s}T(s_0, a_0,s)(R(s_0, a_0,s)+\gamma V^{\pi}(s))]\notag\\ &=\mathbb{E}_{(s_0,a_0)|s_0=s,\pi'}\mathbb{E}_{s_1|(s_0,a_0),s_0=s,\pi'}[r_0+\gamma V^{\pi}(s_1)]\notag\\ &=\mathbb{E}_{(s_0,a_0,s_1)|s_0=s,\pi'}[r_0]+\gamma\mathbb{E}_{(s_0,a_0,s_1)|s_0=s,\pi'}[ V^{\pi}(s_1)] \end{align} Es0s0=s,πVπ(s0)=Es0s0=s,π[aπ(as0)Qπ(s0,a)]Es0s0=s,π[aπ~(as0)Qπ(s0,a)]=E(s0,a0)s0=s,π[Qπ(s0,a0)]=E(s0,a0)s0=s,π[sT(s0,a0,s)(R(s0,a0,s)+γVπ(s))]=E(s0,a0)s0=s,πEs1(s0,a0),s0=s,π[r0+γVπ(s1)]=E(s0,a0,s1)s0=s,π[r0]+γE(s0,a0,s1)s0=s,π[Vπ(s1)]
For i = 0 , 1 , . . . i=0,1,... i=0,1,..., it can be derived in a similar way that

E ( s 0 , . . . s i ) ∣ s 0 = s , π ′ [ V π ( s i ) ] ≤ E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ r i ] + γ E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ V π ( s i + 1 ) ] \mathbb{E}_{(s_0,...s_i)|s_0=s,\pi'}[ V^{\pi}(s_{i})]\leq \mathbb{E}_{(s_0,...,s_{i+1})|s_0=s,\pi'}[r_i]+\gamma\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s,\pi'}[V^{\pi}(s_{i+1})] E(s0,...si)s0=s,π[Vπ(si)]E(s0,...,si+1)s0=s,π[ri]+γE(s0,...,si+1)s0=s,π[Vπ(si+1)]

Then it is easily derived that for i = 0 , 1 , . . . i=0,1,... i=0,1,...,
V π ( s ) ≤ ∑ l = 0 i E ( s 0 , . . . , s l + 1 ) ∣ s 0 = s , π ′ [ γ l r l ] + γ i + 1 E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ V π ( s i + 1 ) ] = E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ ∑ l = 0 i γ l r l ] + γ i + 1 E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ V π ( s i + 1 ) ] \begin{align} V^{\pi}(s) &\leq \sum_{l=0}^{i}\mathbb{E}_{(s_0,...,s_{l+1})|s_0=s, \pi'}[\gamma^{l} r_{l}]+\gamma^{i+1}\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s, \pi'}[V^{\pi}(s_{i+1})]\notag\\ &=\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s, \pi'}[\sum_{l=0}^{i}\gamma^{l}r_{l}]+\gamma^{i+1}\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s, \pi'}[V^{\pi}(s_{i+1})]\notag \end{align} Vπ(s)l=0iE(s0,...,sl+1)s0=s,π[γlrl]+γi+1E(s0,...,si+1)s0=s,π[Vπ(si+1)]=E(s0,...,si+1)s0=s,π[l=0iγlrl]+γi+1E(s0,...,si+1)s0=s,π[Vπ(si+1)]
Let i → ∞ i\rightarrow \infty i, then V π ( s ) ≤ E τ ∣ s 0 = s , π ′ [ ∑ l = 0 ∞ γ l r l ] = V π ~ ( s ) V^{\pi}(s)\leq \mathbb{E}_{\tau|s_0=s, \pi'}[\sum_{l=0}^{\infty}\gamma^{l}r_{l}]=V^{\tilde\pi}(s) Vπ(s)Eτs0=s,π[l=0γlrl]=Vπ~(s).

Therefore, π ~ ≥ π \tilde\pi\geq \pi π~π.

Corollary.

For any policy π \pi π and π ~ \tilde\pi π~:
π ~ ( a ∣ s ) = { 1 , a = a r g m a x a Q π ( s , a ) 0 , a ≠ a r g m a x a Q π ( s , a ) , s ∈ S \tilde\pi(a|s)= \begin{cases} 1, \ a=\mathop{\mathrm{argmax}}\limits_{a}Q^{\pi}(s,a)\\ 0, \ a\neq\mathop{\mathrm{argmax}}\limits_{a}Q^{\pi}(s,a) \end{cases}, \ s\in \mathcal{S} π~(as)= 1, a=aargmaxQπ(s,a)0, a=aargmaxQπ(s,a), sS
it holds that π ~ ≥ π \tilde\pi\geq \pi π~π.

Optimal Policy

A policy π ∗ \pi^{*} π is called an optimal policy if V π ∗ ( s ) ≥ V π ( s ) V^{\pi^{*}}(s)\geq V^{\pi}(s) Vπ(s)Vπ(s) for all s ∈ S s\in \mathcal{S} sS and for all policy π ≠ π ∗ \pi \neq \pi^{*} π=π. (Denote V π ∗ ( s ) V^{\pi^{*}}(s) Vπ(s) / Q π ∗ ( s , a ) Q^{\pi^{*}}(s,a) Qπ(s,a) as V ∗ ( s ) V^{*}(s) V(s) / Q ∗ ( s , a ) Q^{*}(s,a) Q(s,a) in the following content.)

The relation of V ∗ V^{*} V and Q ∗ Q^{*} Q is given as follows:
V ∗ ( s ) = max ⁡ a Q ∗ ( s , a ) , s ∈ S , a ∈ A V^{*}(s)=\max\limits_{a}Q^{*}(s,a),\ s\in \mathcal{S}, \ a\in \mathcal{A} V(s)=amaxQ(s,a), sS, aA

The optimal policy is given by:
π ∗ ( a ∣ s ) = { 1 a = a r g m a x a Q ∗ ( s , a ) 0 a ≠ a r g m a x a Q ∗ ( s , a ) , s ∈ S \pi^{*}(a|s)=\begin{cases}&1 &a=\mathop{\mathrm{argmax}}\limits_{a}Q^{*}(s,a)\\ &0 & a\neq \mathop{\mathrm{argmax}}\limits_{a}Q^{*}(s,a) \end{cases},\ s\in \mathcal{S} π(as)= 10a=aargmaxQ(s,a)a=aargmaxQ(s,a), sS
Proof.

If these does not hold, then there exists a better policy according to the corollary of policy improvement theorem.

Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state.

Reinforcement Learning

Reinforcement Learning Algorithm: Algorithms that seek the optimal policy for MDPs are collectively known as reinforcement learning algorithms. We will introduce several reinforcement learning algorithms in the following content.

Fundamental RL Algorithms: Policy Iteration

We now introduce an RL algorithm for policy improvement called policy iteration that, with each iteration, yields a better strategy than the previous one and ultimately converges to an optimal policy. It pertains to finite MDPs (MDPs with finite state and action spaces).

Recall that
Q π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ Q π ( s ′ , π ( s ′ ) ) ) Q^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma Q^{\pi}(s',\pi(s')) \big) Qπ(s,a)=sT(s,π(s),s)(R(s,π(s),s)+γQπ(s,π(s)))
This gives an iterative way to optimize estimator of Q Q Q function:
Q ^ π ( s , a ) ← ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ Q ^ π ( s ′ , π ( s ′ ) ) ) \hat{Q}^{\pi}(s,a)\leftarrow\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \hat{Q}^{\pi}(s',\pi(s')) \big) Q^π(s,a)sT(s,π(s),s)(R(s,π(s),s)+γQ^π(s,π(s)))

The Whole Procedure

Require: initial policy π \pi π
Construct a function Q ^ π : S × A → R \hat{Q}^{\pi}: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R} Q^π:S×AR arbitrarily as an estimator of Q π Q^{\pi} Qπ.
repeat // policy evaluation
\quad Δ = 0 \Delta=0 Δ=0.
\quad for s ∈ S s \in S sS, a ∈ A a\in \mathcal{A} aA do
\quad \quad q = Q ^ π ( s , a ) q=\hat{Q}^{\pi}(s,a) q=Q^π(s,a)
\quad \quad Q ^ π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ Q ^ π ( s ′ , π ( s ′ ) ) ) \hat{Q}^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \hat{Q}^{\pi}(s',\pi(s')) \big) Q^π(s,a)=sT(s,π(s),s)(R(s,π(s),s)+γQ^π(s,π(s)))
\quad \quad Δ = max ⁡ ( Δ , ∣ q − Q ^ π ( s , a ) ∣ ) \Delta=\max(\Delta, |q - \hat{Q}^{\pi}(s,a)|) Δ=max(Δ,qQ^π(s,a))
until Δ < σ \Delta<\sigma Δ<σ
policy-stable=true
for s ∈ S s \in S sS do // policy improvement
\quad b = π ( s ) b=\pi(s) b=π(s)
\quad π ( s ) = arg max ⁡ a Q ^ π ( s , a ) \pi(s)=\argmax_{a}\hat{Q}^{\pi}(s,a) π(s)=argmaxaQ^π(s,a)
\quad if b ≠ π ( s ) b\neq \pi(s) b=π(s) then policy-stable = false
if policy-stable == false then go to policy evaluation else return π ( s ) \pi(s) π(s).

Policy Evaluation
Updating Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a) in the forementioned iterative way based on Bellman Equation.

Policy Improvement

Update π ( s ) \pi(s) π(s) in a greedy way:
π ( s ) = a r g m a x a Q π ( s , a ) \pi(s) =\mathop{\mathrm{argmax}}\limits_{a} Q^{\pi}(s,a) π(s)=aargmaxQπ(s,a)
在这里插入图片描述

*Note: Policy-stable means that the Q Q Q has no improvement compared to the previous iteration, that is, π k + 1 ( s ) \pi_{k+1}(s) πk+1(s) is as good as π k ( s ) \pi_{k}(s) πk(s).

参考: Wiering M , Otterlo M V .Reinforcement Learning: State of the Art[J].Springer Publishing Company, Incorporated, 2012.DOI:10.1007/s10840-007-9174-1. P22.

Convergence Analysis

It can be proved that policy iteration generates a monotonically increasing sequence of policies.
V π ( s ) = Q π ( s , π ( s ) ) ≤ Q π ( s , π ′ ( s ) ) = ∑ s ′ T ( s , π ′ ( s ) , s ′ ) [ r ( s , π ′ ( s ) , s ′ ) + γ V π ( s ′ ) ] = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ V π ( s 1 ) } ≤ E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ Q π ( s 1 , π ′ ( s 1 ) ) } = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ ∑ s ′ T ( s 1 , π ′ ( s 1 ) , s ′ ) [ r ( s 1 , π ′ ( s 1 ) , s ′ ) + V π ( s ′ ) ] ) = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ E ( a 1 , s 2 ) ∣ s 1 , π ′ { r 1 + V π ( s 2 ) } } = E ( s 0 , a 0 , s 1 , a 1 , s 2 ) ∣ s 0 = s , π ′ { r 0 + γ r 1 + γ V π ( s 2 ) ] ) ≤ . . . ≤ E ( s 0 , a 0 , . . , s n − 1 , a n − 1 , s n ) ∣ s 0 = s , π ′ { ∑ t = 0 n − 1 γ t r t + γ n V π ( s n ) } ≤ . . . \begin{aligned} V^{\pi}(s)=Q^{\pi}(s,\pi(s)) & \leq Q^{\pi}(s,\pi'(s))\\ & = \sum\limits_{s'}T(s,\pi'(s),s')[r(s,\pi'(s),s')+\gamma V^{\pi}(s')]\\ &=\mathbb{E}_{(s_0, a_0, s_1)| s_0=s, \pi'}\{r_0 + \gamma V^{\pi}(s_{1})\}\\ &\leq \mathbb{E}_{(s_0, a_0, s_1)| s_0=s, \pi'}\{r_0 + \gamma Q^{\pi}(s_{1}, \pi'(s_1))\}\\ &= \mathbb{E}_{(s_0, a_0, s_1)| s_0=s, \pi'}\{r_0 + \gamma \sum\limits_{s'}T(s_1,\pi'(s_1),s')[r({s_1}, \pi'(s_1),s')+V^{\pi}(s')])\\ &=\mathbb{E}_{(s_0, a_0, s_1)|s_0=s, \pi'}\{r_0+\gamma \mathbb{E}_{(a_1, s_2)|s_1, \pi'}\{r_1+V^{\pi}(s_2)\}\}\\ &=\mathbb{E}_{(s_0, a_0, s_1, a_1, s_2)| s_0=s, \pi'}\{r_0 + \gamma r_1+\gamma V^{\pi}(s_2)])\\ & \leq ... \\ & \leq \mathbb{E}_{(s_{0},a_{0},..,s_{n-1},a_{n-1},s_{n})|s_{0}=s, \pi'} \{\sum\limits_{t=0}^{n-1}\gamma^{t} r_{t}+\gamma^{n}V^{\pi}(s_{n})\}\\ & \leq ... \\ \end{aligned} Vπ(s)=Qπ(s,π(s))Qπ(s,π(s))=sT(s,π(s),s)[r(s,π(s),s)+γVπ(s)]=E(s0,a0,s1)s0=s,π{r0+γVπ(s1)}E(s0,a0,s1)s0=s,π{r0+γQπ(s1,π(s1))}=E(s0,a0,s1)s0=s,π{r0+γsT(s1,π(s1),s)[r(s1,π(s1),s)+Vπ(s)])=E(s0,a0,s1)s0=s,π{r0+γE(a1,s2)s1,π{r1+Vπ(s2)}}=E(s0,a0,s1,a1,s2)s0=s,π{r0+γr1+γVπ(s2)])...E(s0,a0,..,sn1,an1,sn)s0=s,π{t=0n1γtrt+γnVπ(sn)}...
Therefore, V π ( s ) ≤ E τ ∣ s 0 = s , π ′ { ∑ t = 0 ∞ γ t r t } = V π ′ ( s ) V^{\pi}(s)\leq \mathbb{E}_{\tau | s_{0} = s, \pi'}\{\sum\limits_{t=0}^{\infty}\gamma^{t} r_{t}\} = V^{\pi'}(s) Vπ(s)Eτs0=s,π{t=0γtrt}=Vπ(s), ∀ s ∈ S \forall s \in \mathcal{S} sS.

Because a finite MDP has only a finite number of policies, π \pi π must converge in a finite number of iterations.

Suppose π ( s ) \pi(s) π(s) converges to π ∗ \pi^{*} π. Then it is easily derived that from the stop criterion that:
π ∗ ( s ) = a r g m a x a ∑ s ′ T ( s , a , s ′ ) ( R ( s , a , s ′ ) + γ V ∗ ( s ′ ) ) , ∀ s ∈ S \pi^{*}(s)=\mathop{\mathrm{argmax}}\limits_{a}\sum\limits_{s'}T(s,a,s')(R(s,a,s')+\gamma V^{*}(s')),\ \forall s\in S π(s)=aargmaxsT(s,a,s)(R(s,a,s)+γV(s)), sS
Then the Bellman optimality equation holds.

  • Each policy π k + 1 \pi_{k+1} πk+1 is a strictly better policy than π k \pi_{k} πk until the algorithm stops.
  • The policy iteration algorithm completely separates the evaluation and improvement phases.
  • Although policy iteration computes the original policy for a given MDP in finite time, it is relatively inefficient. In particular the first step, the policy evaluation step, is computationally expensive.

Fundamental RL Algorithms: Value Iteration

通过前面内容我们知道: π ∗ ( s ) = a r g m a x a Q ∗ ( s , a ) \pi^{*}(s)=\mathop{\mathrm{argmax}}\limits_{a}Q^{*}(s,a) π(s)=aargmaxQ(s,a),因此,只要求出 Q ∗ ( s , a ) Q^{*}(s,a) Q(s,a),就求出了 π ∗ ( s ) \pi^{*}(s) π(s),本节我们给出一种通过迭代求 Q ∗ ( s , a ) Q^{*}(s,a) Q(s,a) 的算法:Value Iteration.

Iterative ways to derive Q ∗ Q^{*} Q

Q ∗ Q^{*} Q π ∗ \pi^{*} π 的关系式
Q ∗ ( s , a ) = ∑ s ′ ∈ S T ( s , a , s ′ ) [ R ( s , a , s ′ ) + γ V ∗ ( s ′ ) ] Q^{*}(s, a) = \sum_{s' \in S} T(s, a, s') [R(s, a, s') + \gamma V^{*}(s')] Q(s,a)=sST(s,a,s)[R(s,a,s)+γV(s)]
可以得到两种迭代更新 Q ∗ Q^{*} Q estimator的方式:

(1)
Q ^ ∗ ( s , a ) ← ∑ s ′ T ( s , a , s ′ ) ( R ( s , a , s ′ ) + γ V ^ ∗ ( s ′ ) ) \hat{Q}^{*}(s, a)\leftarrow \sum _{s'}T(s, a, s')(R(s, a, s')+\gamma \hat{V}^{*}(s')) Q^(s,a)sT(s,a,s)(R(s,a,s)+γV^(s))

V ^ ∗ ( s ) ← max ⁡ a Q ^ ∗ ( s , a ) \hat{V}^{*}(s)\leftarrow\max _{a}\hat{Q}^{*}(s, a) V^(s)amaxQ^(s,a)
(2)
Q ^ ∗ ( s , a ) ← ∑ s ′ T ( s , a , s ′ ) ( R ( s , a , s ′ ) + γ max ⁡ a Q ^ ∗ ( s ′ , a ) ) \hat{Q}^{*}(s, a) \leftarrow \sum_{s'} T(s, a, s') \left( R(s, a, s') + \gamma \max_{a} \hat{Q}^{*}(s', a) \right) Q^(s,a)sT(s,a,s)(R(s,a,s)+γamaxQ^(s,a))

Given state s s s, action a a a, reward r r r and next state s i s_i si, it is possible to approximate Q ∗ ( s , a ) Q^{*}(s, a) Q(s,a) by iteratively solving the Bellman recurrence equation
Q i + 1 ( s , a ) = E s ′ [ r + γ max ⁡ a ′ Q i ( s ′ , a ′ ) ] Q_{i+1}(s,a)=\mathbb{E}_{s'}[r+\gamma\max_{a'}Q_{i}(s',a')] Qi+1(s,a)=Es[r+γamaxQi(s,a)]
摘自: Multiagent cooperation and competition with deep reinforcement learning

value iteration算法就是基于这两种迭代方式计算 Q ∗ Q^{*} Q的.

The Whole Procedure

版本1:

Require: initial policy π \pi π
Construct a function Q ^ : S × A → R \hat{Q}: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R} Q^:S×AR arbitrarily as the calculated Q ∗ Q^{*} Q. Construct a function V ^ : S → R \hat{V}: \mathcal{S}\rightarrow \mathbb{R} V^:SR arbitrarily as the calculated V ∗ V^{*} V.
repeat
\quad Δ = 0 \Delta=0 Δ=0.
\quad for s ∈ S s \in S sS do
\quad \quad v = V ^ ( s ) v=\hat{V}(s) v=V^(s)
\quad \quad for a ∈ A a\in \mathcal{A} aA do
\quad \quad \quad Q ^ π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ V ^ ( s ′ ) ) \hat{Q}^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \hat{V}(s') \big) Q^π(s,a)=sT(s,π(s),s)(R(s,π(s),s)+γV^(s)) // 采用第1种迭代方式更新 Q ^ ( s , a ) \hat{Q}(s,a) Q^(s,a)
\quad \quad V ^ ( s ) = max ⁡ a Q ^ ( s , a ) \hat{V}(s)=\max_{a}\hat{Q}(s,a) V^(s)=maxaQ^(s,a)
\quad \quad Δ = max ⁡ ( Δ , ∣ v − V ^ ( s , a ) ∣ ) \Delta=\max(\Delta, |v - \hat{V}(s,a)|) Δ=max(Δ,vV^(s,a))
until Δ < σ \Delta<\sigma Δ<σ

版本2:

Require: initial policy π \pi π
Construct a function Q ^ : S × A → R \hat{Q}: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R} Q^:S×AR arbitrarily as the calculated Q ∗ Q^{*} Q.
repeat
\quad Δ = 0 \Delta=0 Δ=0.
\quad for s ∈ S s \in S sS, a ∈ A a\in \mathcal{A} aA do
\quad \quad q = Q ^ π ( s , a ) q=\hat{Q}^{\pi}(s,a) q=Q^π(s,a)
\quad \quad Q ^ π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ max ⁡ a ′ Q ^ π ( s ′ , a ′ ) ) \hat{Q}^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \max_{a'}\hat{Q}^{\pi}(s',a') \big) Q^π(s,a)=sT(s,π(s),s)(R(s,π(s),s)+γmaxaQ^π(s,a)) // 采用第2种迭代方式更新 Q ^ ( s , a ) \hat{Q}(s,a) Q^(s,a)
\quad \quad Δ = max ⁡ ( Δ , ∣ q − Q ^ π ( s , a ) ∣ ) \Delta=\max(\Delta, |q - \hat{Q}^{\pi}(s,a)|) Δ=max(Δ,qQ^π(s,a))
until Δ < σ \Delta<\sigma Δ<σ

Convergence Analysis

V 0 → Q 1 → V 1 → Q 2 → V 2 → Q 3 → V 3 → Q 4 → V 4 → . . . V ∗ V_0 \rightarrow Q_1 \rightarrow V_1 \rightarrow Q_2 \rightarrow V_2 \rightarrow Q_3 \rightarrow V_3 \rightarrow Q_4 \rightarrow V_4 \rightarrow ... V^* V0Q1V1Q2V2Q3V3Q4V4...V, Value iteration is guaranteed to converge in the limit towards V ∗ V^* V, we have π ∗ ( s ) = π g r e e d y ( V ∗ ) ( s ) = argmax a Q ∗ ( s , a ) \pi ^ {*} (s)= \pi _{greedy}(V^*)(s)= \mathop{\text{argmax}}\limits_a Q^*(s,a) π(s)=πgreedy(V)(s)=aargmaxQ(s,a).

参考: Wiering M , Otterlo M V .Reinforcement Learning: State of the Art[J].Springer Publishing Company, Incorporated, 2012.DOI:10.1007/s10840-007-9174-1. P22.

最新修订于2021年10月29日
最新修订于2021年11月8日
最新修订于2021年11月12日
最新修订于2021年11月13日
最新修订于2021年11月22日
最新修订于2021年11月22日
最新修订于2021年11月29日
最新修订于2022年3月6日
最新修订于2022年6月11日
最新修订于2022年8月1日
最新修订于2022年10月14日
最新修订于2022年11月21日
最新修订于2022年11月28日
最新修订于2023年8月21日
最新修订于2024年12月26日
最新修订于2024年12月30日
最新修订于2024年12月31日
最新修订于2025年2月11日
最新修订于2025年2月19日
最新修订于2025年3月3日


http://www.ppmy.cn/server/172435.html

相关文章

2024年国赛高教杯数学建模A题板凳龙闹元宵解题全过程文档及程序

2024年国赛高教杯数学建模 A题 板凳龙闹元宵 原题再现 “板凳龙”&#xff0c;又称“盘龙”&#xff0c;是浙闽地区的传统地方民俗文化活动。人们将少则几十条&#xff0c;多则上百条的板凳首尾相连&#xff0c;形成蜿蜒曲折的板凳龙。盘龙时&#xff0c;龙头在前领头&#x…

C++初阶—list类

第一章&#xff1a;list的介绍及使用 1.1 list的介绍 list是可以在常数范围内在任意位置进行插入和删除的序列式容器&#xff0c;并且该容器可以前后双向迭代。list的底层是双向链表结构&#xff0c;双向链表中每个元素存储在互不相关的独立节点中&#xff0c;在节点中通过指…

常见的正则匹配规则

目录 1&#xff0c;匹配数字2&#xff0c;匹配字母3&#xff0c;匹配字母和数字4&#xff0c;匹配邮箱地址5&#xff0c;匹配URL6&#xff0c;匹配身份证号7&#xff0c;匹配手机号8&#xff0c;匹配日期9&#xff0c;匹配IP地址10&#xff0c;匹配密码强度11&#xff0c;匹配空…

bc命令学习9 获取bc命令的源码并编译

本文介绍如何获取bc命令的源码并编译,这个对于初学linux还是有点难的,主要坑比较多。下面主要介绍windows下使用wsl环境进行编译 1 初始化工作 创建一个文件夹,我选择创建一个C:\run\linux,这个可以自己选择.然后启动在该文件夹下面启动wsl ,首先获取bc文件的相关信息,可以看…

Lua | 每日一练 (5)

&#x1f4a2;欢迎来到张胤尘的技术站 &#x1f4a5;技术如江河&#xff0c;汇聚众志成。代码似星辰&#xff0c;照亮行征程。开源精神长&#xff0c;传承永不忘。携手共前行&#xff0c;未来更辉煌&#x1f4a5; 文章目录 Lua | 每日一练 (5)题目参考答案浅拷贝深拷贝使用场景…

蓝桥与力扣刷题(蓝桥 k倍区间)

题目&#xff1a;给定一个长度为 N 的数列&#xff0c;A1,A2,⋯AN​&#xff0c;如果其中一段连续的子序列 Ai,Ai1,⋯Aj( i≤j ) 之和是 K 的倍数&#xff0c;我们就称这个区间[i,j] 是 K 倍区间。 你能求出数列中总共有多少个 K 倍区间吗&#xff1f; 输入描述 第一行包含两…

【Elasticsearch】集群配置性能优化

🧑 博主简介:CSDN博客专家,历代文学网(PC端可以访问:https://literature.sinhy.com/#/?__c=1000,移动端可微信小程序搜索“历代文学”)总架构师,15年工作经验,精通Java编程,高并发设计,Springboot和微服务,熟悉Linux,ESXI虚拟化以及云原生Docker和K8s,热衷于探…

网络原理 初识[Java EE]

目录 网络发展史 独立模式 网络互联 局域网 LAN 1. 基于网络直连 2. 基于集线器(Hub)组建 3. 基于交换机(Switch)组建 4. 基于交换机和路由器(Router)组建 广域网 WAN 网络通信基础 IP 地址 1. 概念 2. 格式 端口号 1. 概念 2.格式 认识协议 1. 概念 2. 作用…