Is a model of how the state of a system evolves as different actions are applied to the system. It consists of the following quantities:
- Set of States
$S$ . A set of states where the agent can be at any time. - Set of Actions
$A$ . Any action can change the state. - A Transition function T such that
$T(s,a,s') = P(s'|s, a)$ . It is a probability distribution, i.e.$\sum_{s'\in S}T(s', a, s)=1$ for all$s\in S$ and$a\in A$ . - A Reward function
$r:S\times A \rightarrow \mathbb{R}$ .
These all form a MDP:$(S, A, T, r)$. Starting from an initial position, the goal of RL is to find a trajectory with the largest reward return. In order to compensate for infinite or too long trajectories, a discounted return is used where later actions are penalized by:
The next state only depends on the current action and current state and none before.
An algorithm to find the best trajectory of an MDP.
Stochastic Policy or policy in short is
The value function for policy
$ V^{\pi}(s_0) = \mathbb{E}{a_t \sim \pi(s_t)} [R(\tau)] = \mathbb{E}{a_t \sim \pi(s_t)} [\sum_{t=0}^{\inf}\gamma^{t}r(s_t, a_t)] $
All algorithms in RL follows the following pattern where based on the Markov assumption, the value at state s0 can be written in terms of two terms, the reward of taking an action a0 taken at s0 and the value function of the rest which is very intuitive:
$ V^{\pi}(s_0) = r(s_0, a_0) + \gamma \mathbb{E}{a_0 \sim \pi(s_0)}[ \mathbb{E}{s_1 \sim P(s_1 | s_0, a_0)}[ V^{\pi}(s_1) ] ] $
We can write it w.r.t probabilities and for any state s as:
$
V^{\pi}(s) = \sum_{a\in A} \pi(a|s) [
r(s,a) + \gamma \sum_{s' \in S} [
P(s' | s, a) V^{\pi}(s')
]
]
$ for all
Another useful notation is action value function which is the value function after taking action a:
$ Q^\pi(s_0, a_0) = r(s_0, a_0) + V^\pi (s_1) $
or similar to before be can expand it in terms of itself as:
$
Q^\pi(s, a) = r(s, a) + \gamma \sum_{s' \in S}P(s'|s, a) \sum_{a' \in A}\pi(a' | s')Q^\pi(s', a')
$ for all
$ \pi^* = \argmax_{\pi} V^\pi(s_0) $
Furthermore we can write $V^\equiv V^{\pi^}$ and $Q^=Q^{\pi^}$.
For a deterministic policy, we can write:
$ \pi^(s) = \argmax_{a\in A} [ r(s, a) + \gamma \sum_{s' \in S} P(s' | s, a) V^ (s') ] $
A mnemonic to remember this statement is that the optimal action at state s (for a deterministic policy) is the one that maximizes the sum of reward from the first stage and the average return of the trajectories starting from s' over all possible next states s' from the second stage.
The remainder of an optimal policy is also optimal. This follows the formulas before by replacing the optimal polict. Hence for a deterministic policy, we have:
$ V^(s) = \argmax_{a\in A} [ r(s, a) + \gamma \sum_{s' \in S} P(s' | s, a) V^ (s') $
Or for a stochastic policy, the optimal value function is:
$
V^{}(s) = \sum_{a\in A} \pi^(a|s) [
r(s,a) + \gamma \sum_{s' \in S} [
P(s' | s, a) V^{*}(s')
]
$ for all
We can turn the principle of dynamic programming into an algorithm for finding the optimal value function called Value iteration. The idea is to intialize the value function at iteration 0 for all
$
V_{k+1}(s) = \max_{a \in A} [
r(s, a) + \gamma \sum_{s' \in S}P(s'|s, a)V_{k}(s')
]
$ for all
It turns out that as
The same value iteration algorithm can be written for the Action-value function.
Value iteration algorithm allows us to compute the optimal value function $V^{\pi^}$ of the optimal deterministic policy $\pi^$. We can also use the same value iteration algorithm to compute the optimal value function associated with any other, potentially stochastic, policy
$
V_{k+1}^\pi(s) = \sum_{a\in A} \pi(a|s) [
r(s,a) + \gamma \sum_{s' \in S}P(s'|s,a) V_k^\pi(s')
]
$
for all
This algorithm is know as policy evaluation and is useful to compute the value function given any policy. Again as before, as
Since the material at d2l.ai was not complete, for RL switched to the stanford course.