|
| 1 | +--- |
| 2 | +title: Markov Decision Process |
| 3 | +description: Markov Decision Process and Bellman Equations |
| 4 | +sidebar_position: 2 |
| 5 | +--- |
| 6 | + |
| 7 | +A Markov Decision Process (MDP) is a mathematical abstraction for sequential decision-making. Specifically, a Finite Markov Decision Process effectively characterizes the mathematical properties of reinforcement learning, where "finite" refers to the state, action, and reward sets all being finite. We say a system possesses the Markov property if its next state depends solely on the current state; all historical information is implicitly contained within the current state. |
| 8 | + |
| 9 | +In reinforcement learning, we distinguish between the agent and the environment. The agent is the decision-maker, and everything else constitutes the environment. The agent receives a state from the environment and makes a decision (an action). Consequently, it receives a reward and the next state. Note that sometimes it's necessary to differentiate between the state and the observation. The agent might not perceive all aspects of the environment's state, only a partial observation. |
| 10 | + |
| 11 | +## Definition of Markov Decision Process |
| 12 | + |
| 13 | +In a Finite Markov Decision Process, the state and reward follow a fixed probability distribution based on the previous state and action: |
| 14 | + |
| 15 | +$$p(s',r | s,a) \doteq \Pr\{S_t = s', R_t = r | S_{t-1} = s, A_{t-1} = a\}$$ |
| 16 | + |
| 17 | +where $s',s \in \mathcal{S}$, $r \in \mathcal{R}$, and $a \in \mathcal{A}(s)$. Here, $p$ characterizes the dynamics of this Markov Decision Process. |
| 18 | + |
| 19 | +From the four-argument dynamics function above, other variables describing the environment can be derived. |
| 20 | + |
| 21 | +For example, the state transition probability distribution is: |
| 22 | + |
| 23 | +$$p(s'|s,a) \doteq \Pr\{S_t = s' | S_{t-1} = s, A_{t - 1}=a\} = \sum_{r\in\mathcal{R}}p(s',r|s,a)$$ |
| 24 | + |
| 25 | +The expected reward for a state-action pair is: |
| 26 | + |
| 27 | +$$r(s,a) \doteq \mathbb{E}[R_t | S_{t-1}=s, A_{t-1}=a] = \sum_{r\in\mathcal{R}}r \sum_{s'\in\mathcal{S}} p(s',r|s,a)$$ |
| 28 | + |
| 29 | +The expected reward given the resulting state is: |
| 30 | + |
| 31 | +$$r(s,a, s') \doteq \mathbb{E}[R_t | S_{t-1}=s, A_{t-1}=a, S_t = s'] = \sum_{r\in\mathcal{R}}r \frac{p(s',r|s,a)}{p(s'|s,a)}$$ |
| 32 | + |
| 33 | +## Reward and Return |
| 34 | + |
| 35 | +In reinforcement learning, the agent's objective is to maximize cumulative reward. We define cumulative reward as the return. In its simplest form: |
| 36 | + |
| 37 | +$$G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_T$$ |
| 38 | + |
| 39 | +However, the return defined above is only suitable for episodic tasks, i.e., tasks with a natural terminal state. It is not applicable to continuous tasks. Generally, for continuous tasks, a discount factor is introduced, meaning future rewards are discounted more heavily. |
| 40 | + |
| 41 | +$$ G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$$ |
| 42 | + |
| 43 | +where $0 \le \gamma \le 1$ is the discount rate. The above equation can be written in a recursive form: |
| 44 | + |
| 45 | +$$ G_t \doteq R_{t+1} + \gamma G_{t+1}$$ |
| 46 | + |
| 47 | +Both forms can be unified into one equation (assuming $T=\infty$ or $\gamma=1$ for episodic tasks ending at step $T$): |
| 48 | + |
| 49 | +$$ G_t \doteq \sum_{k=t+1}^T \gamma^{k-t-1}R_k$$ |
| 50 | + |
| 51 | +## Policy and Value Function |
| 52 | + |
| 53 | +Almost all reinforcement learning algorithms involve the evaluation of value functions. Value functions measure the expected return for a given state or a given state-action pair. The value function is related to a specific sequence of decision actions, and the corresponding distribution of decision actions is called the policy. A policy $\pi(a|s)$ can be viewed as a probability distribution over actions given a state. |
| 54 | + |
| 55 | +The definition of the state-value function is given as: |
| 56 | + |
| 57 | +$$v_\pi(s) \doteq \mathbb{E}_\pi[G_t | S_t = s] = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \bigg| S_t = s\right]$$ |
| 58 | + |
| 59 | +The definition of the action-value function is similar: |
| 60 | + |
| 61 | +$$q_\pi(s,a) \doteq \mathbb{E}_\pi[G_t | S_t = s, A_t = a] = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \bigg| S_t = s, A_t = a\right]$$ |
| 62 | + |
| 63 | +State-value and action-value functions can be estimated from experience. One only needs to maintain the average return for each state (or state-action pair) under a specific policy. According to the law of large numbers, this average will eventually converge to the expected value. This is the main idea behind Monte Carlo methods. However, when the number of states is very large, this approach may not be feasible. In such cases, we need to use function approximation to fit the value function, and the effectiveness depends on the choice of the function. Deep learning falls into this category. |
| 64 | + |
| 65 | +Corresponding to the recursive nature of the return function, the state-value function can also be written in a recursive form, known as the Bellman equation. |
| 66 | + |
| 67 | +$$ |
| 68 | +\begin{align*} |
| 69 | + v_\pi(s) &\doteq \mathbb{E}_\pi[G_t | S_t = s] \\ |
| 70 | + &= \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s] \\ |
| 71 | + &= \sum_a \pi(a|s)\sum_{s'}\sum_r p(s',r|s,a)[r + \gamma \mathbb{E}_\pi[G_{t+1}|S_{t+1}=s']] \\ |
| 72 | + &= \sum_a \pi(a|s)\sum_{s',r} p(s',r|s,a)[r + \gamma v_\pi(s')] |
| 73 | +\end{align*} |
| 74 | +$$ |
| 75 | + |
| 76 | +A similar Bellman equation can be derived for the action-value function. |
| 77 | + |
| 78 | +$$ |
| 79 | +\begin{align*} |
| 80 | +q_\pi(s,a) &\doteq \mathbb{E}_\pi[G_t | S_t = s, A_t = a] \\ |
| 81 | +&= \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a] \\ |
| 82 | +&= \sum_{s'}\sum_r p(s',r|s,a)[r + \gamma \sum_{a'}\pi(a'|s')\mathbb{E}_\pi[G_{t+1}|S_{t+1}=s', A_{t+1}=a']] \\ |
| 83 | +&= \sum_{s',r} p(s',r|s,a)[r + \gamma \sum_{a'}\pi(a'|s')q_\pi(s',a')] |
| 84 | +\end{align*} |
| 85 | +$$ |
| 86 | + |
| 87 | +## Optimal Policy and Optimal Value Function |
| 88 | + |
| 89 | +If a policy yields a higher expected value than any other policy for all states, we consider it an optimal policy. It is defined as follows, for all $s \in \mathcal{S}$: |
| 90 | + |
| 91 | +$$ v_\ast(s) \doteq \max_\pi v_\pi(s)$$ |
| 92 | + |
| 93 | +There might be multiple optimal policies, but there is only one optimal value function (otherwise, it would violate the definition of an optimal policy). |
| 94 | + |
| 95 | +Correspondingly, there is also an optimal action-value function. |
| 96 | + |
| 97 | +$$q_\ast(s,a) \doteq \max_\pi q_\pi(s,a) $$ |
| 98 | + |
| 99 | +They have the following relationship: |
| 100 | + |
| 101 | +$$q_\ast(s,a) = \mathbb{E}[R_{t+1} + \gamma v_\ast(S_{t+1}) | S_t=s, A_t=a] $$ |
| 102 | + |
| 103 | +Based on the properties of the optimal policy, the Bellman optimality equation can be derived: |
| 104 | + |
| 105 | +$$ |
| 106 | +\begin{align*} |
| 107 | +v_\ast(s) &= \max_{a \in \mathcal{A}(s)} q_{\pi_\ast}(s, a) \\ |
| 108 | +&= \max_a \mathbb{E}_{\pi_\ast}[G_t | S_t=s, A_t=a] \\ |
| 109 | +&= \max_a \mathbb{E}_{\pi_\ast}[R_{t+1} + \gamma G_{t+1} | S_t=s, A_t=a] \\ |
| 110 | +&= \max_a \mathbb{E}[R_{t+1} + \gamma v_\ast(S_{t+1}) | S_t=s, A_t=a] \\ |
| 111 | +&= \max_a \sum_{s',r} p(s',r|s,a)[r + \gamma v_\ast(s')] |
| 112 | +\end{align*} |
| 113 | +$$ |
| 114 | + |
| 115 | +Similarly, the Bellman optimality equation for the action-value function can be obtained: |
| 116 | + |
| 117 | +$$ |
| 118 | +\begin{align*} |
| 119 | +q_\ast(s,a) &= \mathbb{E}[R_{t+1} + \gamma \max_{a'}q_\ast(S_{t+1}, a')|S_t=s, A_t=a] \\ |
| 120 | +&= \sum_{s',r} p(s',r|s,a)[r + \gamma \max_{a'}q_\ast(s',a')] |
| 121 | +\end{align*} |
| 122 | +$$ |
| 123 | + |
| 124 | +Mathematically, it can be shown that the Bellman optimality equation has a unique solution. It amounts to one equation for each state. If there are $|\mathcal{S}|$ states, there are $|\mathcal{S}|$ equations. When the dynamics function is known, we can use any method for solving systems of non-linear equations. |
| 125 | + |
| 126 | +Deriving the optimal policy from the optimal state-value function requires only a one-step search, i.e., finding the action that maximizes the expression in the Bellman optimality equation. For the optimal action-value function, it is even more convenient: the action corresponding to the maximum value for a given state is the best action. |
| 127 | + |
| 128 | +Although the Bellman equation has a theoretical solution, it can rarely be computed directly. Direct computation relies on the following three conditions: |
| 129 | + |
| 130 | +1. Accurate knowledge of the environment's dynamics function. |
| 131 | +2. Sufficient computational resources. |
| 132 | +3. Satisfaction of the Markov property. |
| 133 | + |
| 134 | +Conversely, many reinforcement learning methods provide corresponding approximate solutions. |
| 135 | + |
| 136 | +## Approximate Optimal Solutions |
| 137 | + |
| 138 | +Besides the conditions mentioned above, memory is also a significant limitation in practice. If the task's state set is small enough, approximate value functions can be stored in memory in tabular form; this is known as the tabular method. However, in many practical examples, the value function cannot fit into memory. In such cases, function approximation must be used to fit the value function. |
| 139 | + |
| 140 | +Approximation might also bring unforeseen benefits. For instance, suboptimal actions in low-probability states do not significantly affect the overall return. The nature of online learning happens to make calculations for high-probability states more accurate. Therefore, computation can be saved without introducing excessive error. |
0 commit comments