RL is a method of learning that uses a reward function to guide the agent in the search for the best path to a goal. RL system has the follow components:
-
$s$ : state describes the observation of the environment. -
$a$ : action describes how the agent interacts with the environment. -
$\pi(s)$ : policy describes the rule that the agent decices to take an action$a$ at the state$s$ . -
$r$ : reward describes the reward for the agent after performing an action. -
$\gamma$ : discount factor describes how much the agent cares about future rewards. -
$V$ : value function -
$Q$ : (quality) action-value function
- Each time step
$t$ , the agent chooses an action based on the current state, he will receive a reward$r_t$ . - Along the path, by following the policy
$\pi$ , the total reward$R$ will be accumulated.$$R_{\pi} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + \cdots \gamma^T r_{t+T}$$ where$\gamma$ is the discount factor, which emphasizes the recent rewards are more important than the future ones. - The ultimate goal is to achieve the maximum acumulated reward
$R$ . However, since the environment is stochastic, the agent will not always get the same reward$r_t$ for the same action$a$ . Therefore, the accumulated reward$R$ is expressed in term of expectation, which is called Value.$$V_{\pi}(s) = E[r_t + r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots \gamma^T r_{t+T}|\pi(s)]$$
- By the learning objective, we should of course choose the action that maximizes the value function. This is expressed by the Bellman equation:
$$V(s) = \max_a[r(s,a) + \gamma V(s')]$$
In the example below, the agent gets a reward
- We start from near the goal
$s=(3,3)$ . The value of this state is:$V(s=(3,3)) = 1$ , where reward$r=1$ and stop (no further state). - Then, we move 1 step backward to the left
$s=(3,2)$ . The value of this state is:$V(s=(3,2)) = 0 + 0.9*V(s=(3,3))=0.9$ . - Then, we continue the steps to all other states.
This procedure is basically the "Dynamic Programing".
- We build the value map starting from the location nearing goal, and extend to all the other location in the map.
- From the Value map above, we can easily set the plan to move the agent.
In the example above, the action is 100% certain, e.g turn-left
command will definitely move the agent to the left. However, in practice, there are always uncertainties, e.g turn-left
command will move the agent to the left in 80% cases but $0% cases it can move to the right or forward.
To make the problem easier, we requires the process to be an Markov Process
, that is:
In Markow process, the future state only depends on the current state, not the sequence in the past.
And the BellMan is rewritten in the stochastic form:
- From the Bellman equation, let we define the quality of an action (left-figure) as:
$$Q(s,a) = r(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \Rightarrow V(s) = \max_a[Q(s,a)]$$ The value of the state
$V(s)$ is the maximum quality of an action$Q(s,a)$ . - Then substitue
$V(s') = \max_a[Q(s',a)]$ (right-figure), we write the Q-function independent of$V(s)$ :$$Q(s,a) = r(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_a[Q(s',a)]$$ - For short notation, we can ignore the "Stochastic Process" and simplify the Q-function as follows.
$$Q(s,a) = r(s,a) + \gamma \max_a[Q(s',a)]$$
- Theoretically, if
$Q(s,a)$ is known ahead, then every thing is very easy, we just choose the action that maximizes the quality. - However,
$Q(s,a)$ is just an estimation, which hopefuylly is getting closer and closer to the true value along the learning path. Therefore, after taking an action, there is always a mismatched from the previous and the re-estimated quality. This ammount of mismatch is called temporal difference (TD).
- And we update the Q-function as follows:
$$Q_t(s,a) = Q_{t-1}(s,a) + \alpha TD_t(s,a)$$ Note that, when$\alpha=1$ , this reduces to the ideal Bellman equation, i.e. we discard the whole experience learned in the past, and update the Q-function with the latest quality. However, this is not a good idea, because the new estimated value is not neccesarily the best value representing the state. For example, it may be a rare event due to the random process.
- The algorithm is implemented at q_learning.py.
- An example of cartpole is shown in cartpole.py.
- An example of mountain car is shown in mountain_car.py.