Problem Statement Draw or describe the optimal state-value function for the golf example (below).
To formulate playing a hole of golf as a reinforcement learning task, we count a penalty (negative reward) of -1 for each stoke until we hit the ball into the hole. The state is the location of the ball. The value of a state is the negative of the number of strokes to the hole from that location. Our actions are how we aim and swing at the ball, of course, and which club we select. Let us take the former as given and consider just the choice of club, which we assume is either a putter or a driver. The upper part of Figure 3.3 shows a possible state-value function,
Figure 3.3: A golf example: the state-value function for putting (upper) and the optimal action-value function for using the driver (lower).
Recall the relationship $v_(s) = \max\limits_a q_(s, a)$. In the golf example, there are just two actions, to choose the golf club from {putt, driver}. So we can rewrite this relationship more specifically
Note that
We are told that the value of a state is the negative of the number of strokes to the hole from that location. So locations closer to the hole, which require fewer strokes to sink, have greater (i.e. less negative) value. So we know that the optimal state-value function will roughly increase as the location approaches the hole. At the hole itself, the optimal state-value will assign a value of 0.
We are told that a putt can sink a ball anywhere on the green in one shot. So the optimal state-value function will assign -1 to all points on the green.
From the lower half of Figure 3.3, we can tell that a driver can hit the ball from the tee to the -2 region and from the -2 region to the green. So the optimal policy would use a driver for the first shot, a driver for the second shot (reaching the green) and then putt into the hole on the third shot. This means that the optimal state-value function diagram will look the same as
Visually, the optimal state-value function

