强化学习无痛上手笔记第1课

文章目录

Markov Decision Process
- Definition
- Related Concepts
Policy for MDP Agent
- Definition
- Judgement for Policy
- - Value Functions
  - TD formula for value functions
  - Relation of V and Q
  - Policy Criterion
  - Policy Improvement Theorem
  - Optimal Policy
  - Reinforcement Learning
Fundamental RL Algorithms: Policy Iteration
- The Whole Procedure
- Convergence Analysis
Fundamental RL Algorithms: Value Iteration
- Iterative ways to derive $Q^{*}$
- The Whole Procedure
- Convergence Analysis

参考书籍:
1.Reinforcement Learning State-of-the-Art, Marco Wiering
2.Reinforcement learning: An introduction, RS Sutton

下一篇博客：
https://blog.csdn.net/Jinyindao243052/article/details/107051507

Markov Decision Process

Definition

Suppose that there is an interactive system as follows which consists of an environment and an agent.
在这里插入图片描述
Its running process is described as follows. At the beginning of step $t = 0, 1, ...$ , the agent has the knowledge of the environment state $s_{t}$ . Then it chooses an action $a_{t}\in \mathcal{A}$ that is available in state $s_{t}$ and apply it. After applying action $a_{t}$ , the state of the environment makes a transition to $s_{t+1}$ according to a probability distribution $T(s_t,a_t)$ over $\mathcal{S}$ and generates a reward $r_{t}=R(s_t,a_t,s_{t+1})$ and gives the information of its new state and reward to the agent.

Here $\mathcal{S}$ is the set of possible environmental states, called state space. $\mathcal{A}$ is the set of possible agent actions, called action space. $\mathcal{S} \times \mathcal{A}\rightarrow \mathcal{P}(\mathcal{S})$ is called the transition function, where $\mathcal{P}(\mathcal{S})$ is probability distributions of states in $\mathcal{S}$ . The probability of ending up in state $s^{'}$ after doing applicable action $a$ in state $s$ is denoted as $T (s, a, s^{'})$ . $\mathcal{S}\times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$ is called the reward function, the reward of ending up in state $s^{'}$ after doing action $a$ in state $s$ is $R (s, a, s^{'})$ .

Note that the propability distribution of its next state $s_{t+1}$ is only determined by $s_t$ and $a_t$ , and is not directly influenced by all former states or actions. The reward $r_t$ is determined by $s_t$ , $a_t$ , $s_{t+1}$ and is not directly influenced by all former states, actions, and rewards. This kind of property is called Markov property.

In theoretic analysis, we assume the process never stops.

The discrete-time running process of the foregoing system is called a Markov Decision Process.

Related Concepts

Discounted Cumulative Reward: Define the discounted cumulative reward of the process from step $t$ as $\gamma^{k}r_{t+k}$ (denoted as $G_t$ ), where $\gamma \in (0,1)$ is a parameter called the discount rate.

A MDP is usually represented by a tuple $\left \langle \mathcal{S}, \mathcal{A}, T, R, \gamma\right \rangle$ , where $\mathcal{S}$ is its state space, $\mathcal{A}$ is its action space, $T$ is its transition function, $R$ is its reward function, $\gamma$ is its discount rate.

History: The sequence of $s_0,a_0,s_1,a_1,...)$ is defined as the history of the MDP process (denoted as $\tau$ ). The probability of $s_t, a_t, ..., s_{t'})$ given $s_t$ and $a_{t}$ ,…, $a_{t'-1}$ is $\Pi_{i=t}^{t'-1}T(s_i,a_i, s_{i+1})$ .

Trajectory: The sequence of $s_0,a_0,r_0,s_1,a_1,r_1...)$ is defined as the trajectory of the MDP process (denoted as $\mathcal{E}$ ).

Policy for MDP Agent

Definition

在这里插入图片描述

Policy: A deterministic policy $\pi$ is a function defined as $\pi: \mathcal{S}\rightarrow \mathcal{A}$ , which tells the agent the action to take in a particular state.

A stochastic policy $\pi$ is defined as $\pi: \mathcal{S}\rightarrow \mathcal{P}(\mathcal{A})$ where $\mathcal{P}(\mathcal{A})$ is probability distributions of actions in $\mathcal{A}$ . It tells the agent the probability to take every action in a particular state. Denote the probability of action $a$ according to $\pi(s)$ as $\pi(a|s)$ .

Adopting a deterministic policy $\pi$ is equivalent to adopting a stochastic policy $\tilde{\pi}$ ,

$\tilde{\pi}(a|s)=\begin{cases} &1, &a=\pi(s)\\ &0, &a\neq \pi(s) \end{cases},\ s\in\mathcal{S}$

Therefore, we only talk about stochastic policy in the following part.

Judgement for Policy

Value Functions

State Value Function (V function): The state value function of policy $\pi$ is defined as $V^{\pi}: \mathcal{S}\rightarrow \mathbb{R}$ ,

$V^{\pi}(s)=\mathbb{E}_{\tau |s_{0}=s, \pi}\{\sum\limits_{t=0}^{\infty}\gamma^{t}r_{t}\}=\mathbb{E}_{\tau |s_{0}=s, \pi}\{G_{0}\}, \ s\in \mathcal{S}$

where $\tau=s_{0}, a_{0}, s_{1}, a_{1}, \dots$ is the sequence of state-action transitions. $V^{\pi}(s)$ is the expected discounted cumulative reward of the process when $s_0=s$ and adopting policy $\pi$ .

For any sequence $S_0,A_0,S_1...]$ , where $S_0=S$ , $S_i\in \mathcal{S}$ , $A_i\in \mathcal{A}$ :

$p(\tau_{|s_0=s,\pi}=[S_0,A_0,S_1,...])=\Pi_{i=0}^{\infty}T(S_i,A_i,S_{i+1})\pi(A_i|S_i)$

$p(\tau_t|_{s_t=s,\pi}=[S_0,A_0,S_1,...]|s_0=s,\pi)=\Pi_{i=0}^{\infty}T(S_i,A_i,S_{i+1})\pi(A_i|S_i)$

Therefore the probability density distributions of sequence $\tau|s_0=s, \pi$ and $\tau_t|s_t=s,\pi$ are the same. When $\tau|_{s_0=s, \pi}=\tau_t|_{s_t=s,\pi}=[S_0,A_0,S_1,...]$ , $G_0=G_t=\sum_{i=0}^{\infty}\{\gamma^{i}R(S_i,A_i,S_{i+1})\}$

Therefore $\mathbb{E}_{\tau |s_{0}=s, \pi}\{G_{0}\}=\mathbb{E}_{\tau_t|s_{t}=s, \pi}\{G_{t}\}$ . Then $V^{\pi}(s)=\mathbb{E}_{\tau_t|s_{t}=s,\pi}[G_t]$ , which is the expected discounted cumulative reward of the process from step $t$ under the condition of $s_t=s$ and following $\pi$ afterwards.

State-Action Value Function (Q function): The state-action value function is defined as $Q^{\pi}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ ,

$Q^{\pi}(s,a)=\mathop{\mathbb{E}}\limits_{\tau |s_{0}=s, a_{0}=a, \pi}\{\sum\limits_{t=0}^{\infty}\gamma^{t}r_{t}\}, =\mathop{\mathbb{E}}\limits_{\tau |s_{0}=s, a_{0}=a, \pi}\{G_0\} \ s\in \mathcal{S}, a\in \mathcal{A}$

$Q^{\pi}(s,a)$ is the expected discounted cumulative reward of the process from step $0$ when $s_0=s$ , $a_0=a$ , and adopting policy $\pi$ . Similarly, $Q^{\pi}(s)=\mathbb{E}_{\tau_t|s_{t}=s, a_{t}=a,\pi}[G_t]$ , which is the expected discounted cumulative reward of the process from step $t$ under the condition of $s_t=s$ , $a_t=a$ and following $\pi$ afterwards.

TD formula for value functions

TD(t) ( $t = 1, 2, ...$ ) formula for V function is $V^{\pi}(s)=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}V^{\pi}(s_{k+t})]$ , $s\in \mathcal{S}$ .

Proof.

$\begin{align} V^{\pi}(s)&=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}G_{k+t}]\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{k+i}]+\gamma^{t}\mathbb{E}_{\tau_k|s_k=s,\pi}[G_{k+t}]\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{k+i}]+\gamma^{t}\mathbb{E}_{(s_k,...,s_{k+t})|s_k=s, \pi}\mathbb{E}_{(a_{k+t},...)|(s_k,...,s_{k+t}), s_k=s,\pi}[G_{k+t}]\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{k+i}]+\gamma^{t}\mathbb{E}_{(s_k,...,s_{k+t})|s_k=s, \pi}V^{\pi}(s_{k+t})\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}V^{\pi}(s_{k+t})] \end{align}$

TD(t) ( $t = 1, 2, ...$ ) for Q function is $Q^{\pi}(s,a)=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{i+k}+\gamma^{t+k}Q^{\pi}(s_{t},a_{t})]$ , $s\in \mathcal{S}$ , $a\in \mathcal{A}$ .

Proof.

$\begin{align} Q^{\pi}(s,a)&=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{i+k}+\gamma^{t+k}G_{t+k}]\notag\\ &=\mathbb{E}_{\tau|s_k=s, a_k=a,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{i+k}]+\gamma^{t+k}\mathbb{E}_{\tau|s_k=s, a_k=a,\pi}[G_{t+k}]\notag\\ &=\mathbb{E}_{\tau|s_k=s,a_0=a,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{i+k}]+\gamma^{t+k}\mathbb{E}_{(s_k,...,s_{t+k}, a_{t+k})|s_k=s, a_k=a, \pi}\mathbb{E}_{(s_{t+k+1},...)|(s_k,...,s_{t+k}, a_{t+k}), s_k=s,a_k=a,\pi}[G_{t+k}]\notag\\ &=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{i+k}]+\gamma^{t+k}\mathbb{E}_{(s_k,...,s_t, a_{t+k})|s_k=s, a_k=a, \pi}Q^{\pi}(s_{t+k}, a_{t+k})\notag\\ &=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{i+k}+\gamma^{t+k}Q^{\pi}(s_{t+k}, a_{t+k})]\notag \end{align}$

Corollary.

(1)
$V^{\pi}(s,a)=\sum_{a}\pi(a|s)\sum\limits_{s'}T(s,a,s')[r(s,a,s')+\gamma V^{\pi}(s')]$
This is called the Bellman Equation for V function.

The proof is easily derived with TD(1) formula for V and is omitted.

(2)
$Q^{\pi}(s,a)=\sum\limits_{s'}T(s,a,s')[r(s,a,s')+\gamma \sum_{a'} \pi(a'|s') Q^{\pi}(s',a')]$
This is called the Bellman Equation for Q function.

The proof is easily derived with TD(1) formula for Q and is omitted.

Relation of V and Q

The relation between $V$ and $Q$ function is given by the following theorems.

Theorem 1.
$V^{\pi}(s)=\sum_{a}\pi(a|s)Q^{\pi}(s,a), \ s\in \mathcal{S}$
Proof.
$\begin{align} V^{\pi}(s)&=\mathbb{E}_{\tau |s_{0}=s, \pi}\{\sum\limits_{t=0}^{\infty}\gamma^{t}r_{t}\}\notag \\ &=\mathbb{E}_{a_0|s_0=s, \pi}\mathbb{E}_{\tau |s_{0}=s, a_0, \pi}\{G_{0}\} \notag \\ &=\mathbb{E}_{a_0|s_0=s, \pi}\{Q^{\pi}(s,a_0)\}\notag \\ &=\sum_{a}\pi(a|s)Q^{\pi}(s,a) \end{align}$
Theorem 2. For $t = 1, ...$ , it holds that
$V^{\pi}(s)=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}Q^{\pi}(s_{k+t}, a_{k+t})],\ s\in \mathcal{S}$

$Q^{\pi}(s,a)=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}V^{\pi}(s_{k+t})],\ s\in \mathcal{S}, a\in \mathcal{A}$
The proof is similar to the foregoing proofs and is omitted.

Corollary.

$Q^{\pi}(s,a)=\sum\limits_{s'}T(s,a,s')[r(s,a,s')+\gamma V^{\pi}(s')], \ s\in \mathcal{S}, a\in \mathcal{A}$

Policy Criterion

A policy $\pi$ is said to be not worse than a policy $\pi'$ ( $\pi \geq \pi'$ ) if $V^{\pi}(s)\geq V^{\pi'}(s)$ , $s\in \mathcal{S}$ .

A policy $\pi$ is said to be strictly better than a policy $\pi'$ ( $\pi > \pi'$ ) if $V^{\pi}(s)\geq V^{\pi'}(s)$ , $s\in \mathcal{S}$ and there exists a state $s_0$ $V^{\pi}(s_0)> V^{\pi'}(s_0)$ .

Policy Improvement Theorem

Theorem. For any policy $\pi$ and $\tilde\pi$ :
$V^{\pi}(s) \leq \sum_{a}\tilde\pi(a|s)Q^{\pi}(s,a),\ s\in \mathcal{S}$
it holds that $\tilde\pi\geq \pi$ .

Proof.

$V^{\pi}(s)=\mathbb{E}_{s_0|s_0=s,\pi'} V^{\pi}(s_0)$

$\begin{align} \mathbb{E}_{s_0|s_0=s,\pi'} V^{\pi}(s_0)&=\mathbb{E}_{s_0|s_0=s,\pi'} [\sum_{a}\pi(a|s_0)Q^{\pi}(s_0,a)]\notag\\ &\leq \mathbb{E}_{s_0|s_0=s,\pi'}[\sum_{a}\tilde\pi(a|s_0)Q^{\pi}(s_0,a)]\notag\\ &=\mathbb{E}_{(s_0,a_0)|s_0=s,\pi'}[Q^{\pi}(s_0,a_0)]\notag\\ &=\mathbb{E}_{(s_0,a_0)|s_0=s,\pi'}[\sum_{s}T(s_0, a_0,s)(R(s_0, a_0,s)+\gamma V^{\pi}(s))]\notag\\ &=\mathbb{E}_{(s_0,a_0)|s_0=s,\pi'}\mathbb{E}_{s_1|(s_0,a_0),s_0=s,\pi'}[r_0+\gamma V^{\pi}(s_1)]\notag\\ &=\mathbb{E}_{(s_0,a_0,s_1)|s_0=s,\pi'}[r_0]+\gamma\mathbb{E}_{(s_0,a_0,s_1)|s_0=s,\pi'}[ V^{\pi}(s_1)] \end{align}$
For $i = 0, 1, ...$ , it can be derived in a similar way that

$\mathbb{E}_{(s_0,...s_i)|s_0=s,\pi'}[ V^{\pi}(s_{i})]\leq \mathbb{E}_{(s_0,...,s_{i+1})|s_0=s,\pi'}[r_i]+\gamma\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s,\pi'}[V^{\pi}(s_{i+1})]$

Then it is easily derived that for $i = 0, 1, ...$ ,
$\begin{align} V^{\pi}(s) &\leq \sum_{l=0}^{i}\mathbb{E}_{(s_0,...,s_{l+1})|s_0=s, \pi'}[\gamma^{l} r_{l}]+\gamma^{i+1}\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s, \pi'}[V^{\pi}(s_{i+1})]\notag\\ &=\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s, \pi'}[\sum_{l=0}^{i}\gamma^{l}r_{l}]+\gamma^{i+1}\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s, \pi'}[V^{\pi}(s_{i+1})]\notag \end{align}$
Let $i\rightarrow \infty$ , then $V^{\pi}(s)\leq \mathbb{E}_{\tau|s_0=s, \pi'}[\sum_{l=0}^{\infty}\gamma^{l}r_{l}]=V^{\tilde\pi}(s)$ .

Therefore, $\tilde\pi\geq \pi$ .

Corollary.

For any policy $\pi$ and $\tilde\pi$ :
$\tilde\pi(a|s)= \begin{cases} 1, \ a=\mathop{\mathrm{argmax}}\limits_{a}Q^{\pi}(s,a)\\ 0, \ a\neq\mathop{\mathrm{argmax}}\limits_{a}Q^{\pi}(s,a) \end{cases}, \ s\in \mathcal{S}$
it holds that $\tilde\pi\geq \pi$ .

Optimal Policy

A policy $\pi^{*}$ is called an optimal policy if $V^{\pi^{*}}(s)\geq V^{\pi}(s)$ for all $s\in \mathcal{S}$ and for all policy $\pi \neq \pi^{*}$ . (Denote $V^{\pi^{*}}(s)$ / $Q^{\pi^{*}}(s,a)$ as $V^{*}(s)$ / $Q^{*}(s,a)$ in the following content.)

The relation of $V^{*}$ and $Q^{*}$ is given as follows:
$V^{*}(s)=\max\limits_{a}Q^{*}(s,a),\ s\in \mathcal{S}, \ a\in \mathcal{A}$

The optimal policy is given by:
$\pi^{*}(a|s)=\begin{cases}&1 &a=\mathop{\mathrm{argmax}}\limits_{a}Q^{*}(s,a)\\ &0 & a\neq \mathop{\mathrm{argmax}}\limits_{a}Q^{*}(s,a) \end{cases},\ s\in \mathcal{S}$
Proof.

If these does not hold, then there exists a better policy according to the corollary of policy improvement theorem.

Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state.

Reinforcement Learning

Reinforcement Learning Algorithm: Algorithms that seek the optimal policy for MDPs are collectively known as reinforcement learning algorithms. We will introduce several reinforcement learning algorithms in the following content.

Fundamental RL Algorithms: Policy Iteration

We now introduce an RL algorithm for policy improvement called policy iteration that, with each iteration, yields a better strategy than the previous one and ultimately converges to an optimal policy. It pertains to finite MDPs (MDPs with finite state and action spaces).

Recall that
$Q^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma Q^{\pi}(s',\pi(s')) \big)$
This gives an iterative way to optimize estimator of $Q$ function:
$\hat{Q}^{\pi}(s,a)\leftarrow\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \hat{Q}^{\pi}(s',\pi(s')) \big)$

The Whole Procedure

Require: initial policy $\pi$
Construct a function $\hat{Q}^{\pi}: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R}$ arbitrarily as an estimator of $Q^{\pi}$ .
repeat // policy evaluation
$\quad$ $\Delta=0$ .
$\quad$ for $\in S$ , $a\in \mathcal{A}$ do
$\quad$ $\quad$ $q=\hat{Q}^{\pi}(s,a)$
$\quad$ $\quad$ $\hat{Q}^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \hat{Q}^{\pi}(s',\pi(s')) \big)$
$\quad$ $\quad$ $\Delta=\max(\Delta, |q - \hat{Q}^{\pi}(s,a)|)$
until $\Delta<\sigma$
policy-stable=true
for $\in S$ do // policy improvement
$\quad$ $b=\pi(s)$
$\quad$ $\pi(s)=\argmax_{a}\hat{Q}^{\pi}(s,a)$
$\quad$ if $b\neq \pi(s)$ then policy-stable = false
if policy-stable == false then go to policy evaluation else return $\pi(s)$ .

Policy Evaluation
Updating $Q^{\pi}(s,a)$ in the forementioned iterative way based on Bellman Equation.

Policy Improvement

Update $\pi(s)$ in a greedy way:
$\pi(s) =\mathop{\mathrm{argmax}}\limits_{a} Q^{\pi}(s,a)$
在这里插入图片描述

*Note: Policy-stable means that the $Q$ has no improvement compared to the previous iteration, that is, $\pi_{k+1}(s)$ is as good as $\pi_{k}(s)$ .

参考: Wiering M , Otterlo M V .Reinforcement Learning: State of the Art[J].Springer Publishing Company, Incorporated, 2012.DOI:10.1007/s10840-007-9174-1. P22.

Convergence Analysis

It can be proved that policy iteration generates a monotonically increasing sequence of policies.
$\begin{aligned} V^{\pi}(s)=Q^{\pi}(s,\pi(s)) & \leq Q^{\pi}(s,\pi'(s))\\ & = \sum\limits_{s'}T(s,\pi'(s),s')[r(s,\pi'(s),s')+\gamma V^{\pi}(s')]\\ &=\mathbb{E}_{(s_0, a_0, s_1)| s_0=s, \pi'}\{r_0 + \gamma V^{\pi}(s_{1})\}\\ &\leq \mathbb{E}_{(s_0, a_0, s_1)| s_0=s, \pi'}\{r_0 + \gamma Q^{\pi}(s_{1}, \pi'(s_1))\}\\ &= \mathbb{E}_{(s_0, a_0, s_1)| s_0=s, \pi'}\{r_0 + \gamma \sum\limits_{s'}T(s_1,\pi'(s_1),s')[r({s_1}, \pi'(s_1),s')+V^{\pi}(s')])\\ &=\mathbb{E}_{(s_0, a_0, s_1)|s_0=s, \pi'}\{r_0+\gamma \mathbb{E}_{(a_1, s_2)|s_1, \pi'}\{r_1+V^{\pi}(s_2)\}\}\\ &=\mathbb{E}_{(s_0, a_0, s_1, a_1, s_2)| s_0=s, \pi'}\{r_0 + \gamma r_1+\gamma V^{\pi}(s_2)])\\ & \leq ... \\ & \leq \mathbb{E}_{(s_{0},a_{0},..,s_{n-1},a_{n-1},s_{n})|s_{0}=s, \pi'} \{\sum\limits_{t=0}^{n-1}\gamma^{t} r_{t}+\gamma^{n}V^{\pi}(s_{n})\}\\ & \leq ... \\ \end{aligned}$
Therefore, $V^{\pi}(s)\leq \mathbb{E}_{\tau | s_{0} = s, \pi'}\{\sum\limits_{t=0}^{\infty}\gamma^{t} r_{t}\} = V^{\pi'}(s)$ , $\forall s \in \mathcal{S}$ .

Because a finite MDP has only a finite number of policies, $\pi$ must converge in a finite number of iterations.

Suppose $\pi(s)$ converges to $\pi^{*}$ . Then it is easily derived that from the stop criterion that:
$\pi^{*}(s)=\mathop{\mathrm{argmax}}\limits_{a}\sum\limits_{s'}T(s,a,s')(R(s,a,s')+\gamma V^{*}(s')),\ \forall s\in S$
Then the Bellman optimality equation holds.

Each policy $\pi_{k+1}$ is a strictly better policy than $\pi_{k}$ until the algorithm stops.
The policy iteration algorithm completely separates the evaluation and improvement phases.
Although policy iteration computes the original policy for a given MDP in finite time, it is relatively inefficient. In particular the first step, the policy evaluation step, is computationally expensive.

Fundamental RL Algorithms: Value Iteration

通过前面内容我们知道： $\pi^{*}(s)=\mathop{\mathrm{argmax}}\limits_{a}Q^{*}(s,a)$ ，因此，只要求出 $Q^{*}(s,a)$ ，就求出了 $\pi^{*}(s)$ ，本节我们给出一种通过迭代求 $Q^{*}(s,a)$ 的算法：Value Iteration.

Iterative ways to derive $Q^{*}$

由 $Q^{*}$ 和 $\pi^{*}$ 的关系式
$Q^{*}(s, a) = \sum_{s' \in S} T(s, a, s') [R(s, a, s') + \gamma V^{*}(s')]$
可以得到两种迭代更新 $Q^{*}$ estimator的方式：

(1)
$\hat{Q}^{*}(s, a)\leftarrow \sum _{s'}T(s, a, s')(R(s, a, s')+\gamma \hat{V}^{*}(s'))$

$\hat{V}^{*}(s)\leftarrow\max _{a}\hat{Q}^{*}(s, a)$
(2)
$\hat{Q}^{*}(s, a) \leftarrow \sum_{s'} T(s, a, s') \left( R(s, a, s') + \gamma \max_{a} \hat{Q}^{*}(s', a) \right)$

Given state $s$ , action $a$ , reward $r$ and next state $s_i$ , it is possible to approximate $Q^{*}(s, a)$ by iteratively solving the Bellman recurrence equation
$Q_{i+1}(s,a)=\mathbb{E}_{s'}[r+\gamma\max_{a'}Q_{i}(s',a')]$
摘自: Multiagent cooperation and competition with deep reinforcement learning

value iteration算法就是基于这两种迭代方式计算 $Q^{*}$ 的.

The Whole Procedure

版本1:

Require: initial policy $\pi$
Construct a function $\hat{Q}: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R}$ arbitrarily as the calculated $Q^{*}$ . Construct a function $\hat{V}: \mathcal{S}\rightarrow \mathbb{R}$ arbitrarily as the calculated $V^{*}$ .
repeat
$\quad$ $\Delta=0$ .
$\quad$ for $\in S$ do
$\quad$ $\quad$ $v=\hat{V}(s)$
$\quad$ $\quad$ for $a\in \mathcal{A}$ do
$\quad$ $\quad$ $\quad$ $\hat{Q}^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \hat{V}(s') \big)$ // 采用第1种迭代方式更新 $\hat{Q}(s,a)$
$\quad$ $\quad$ $\hat{V}(s)=\max_{a}\hat{Q}(s,a)$
$\quad$ $\quad$ $\Delta=\max(\Delta, |v - \hat{V}(s,a)|)$
until $\Delta<\sigma$

版本2:

Require: initial policy $\pi$
Construct a function $\hat{Q}: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R}$ arbitrarily as the calculated $Q^{*}$ .
repeat
$\quad$ $\Delta=0$ .
$\quad$ for $\in S$ , $a\in \mathcal{A}$ do
$\quad$ $\quad$ $q=\hat{Q}^{\pi}(s,a)$
$\quad$ $\quad$ $\hat{Q}^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \max_{a'}\hat{Q}^{\pi}(s',a') \big)$ // 采用第2种迭代方式更新 $\hat{Q}(s,a)$
$\quad$ $\quad$ $\Delta=\max(\Delta, |q - \hat{Q}^{\pi}(s,a)|)$
until $\Delta<\sigma$

Convergence Analysis

$V_0 \rightarrow Q_1 \rightarrow V_1 \rightarrow Q_2 \rightarrow V_2 \rightarrow Q_3 \rightarrow V_3 \rightarrow Q_4 \rightarrow V_4 \rightarrow ... V^*$ , Value iteration is guaranteed to converge in the limit towards $V^*$ , we have $\pi ^ {*} (s)= \pi _{greedy}(V^*)(s)= \mathop{\text{argmax}}\limits_a Q^*(s,a)$ .

参考: Wiering M , Otterlo M V .Reinforcement Learning: State of the Art[J].Springer Publishing Company, Incorporated, 2012.DOI:10.1007/s10840-007-9174-1. P22.

~~最新修订于2021年10月29日~~
~~最新修订于2021年11月8日~~
~~最新修订于2021年11月12日~~
~~最新修订于2021年11月13日~~
~~最新修订于2021年11月22日~~
~~最新修订于2021年11月22日~~
~~最新修订于2021年11月29日~~
~~最新修订于2022年3月6日~~
~~最新修订于2022年6月11日~~
~~最新修订于2022年8月1日~~
~~最新修订于2022年10月14日~~
~~最新修订于2022年11月21日~~
~~最新修订于2022年11月28日~~
~~最新修订于2023年8月21日~~
~~最新修订于2024年12月26日~~
~~最新修订于2024年12月30日~~
~~最新修订于2024年12月31日~~
~~最新修订于2025年2月11日~~
~~最新修订于2025年2月19日~~
最新修订于2025年3月3日