10 Temporal difference methods for prediction

One of the most fundamental concepts in reinforcement learning is temporal difference (TD) learning. TD learning is a combination of Monte Carlo (MC) and dynamic programming (DP) ideas: Like MC, TD can predict using a model-free environment and learn from experience. Like DP, TD update estimates based on other learned estimates, without waiting for a final outcome (bootstrap). That is, TD can learn on-line and do not need to wait until the whole sample-path is found. TD in general learn more efficiently than MC due to bootstrapping. In this module prediction using TD is considered.

10.1 Learning outcomes

By the end of this module, you are expected to:

Describe what Temporal Difference (TD) learning is.
Formulate the incremental update formula for TD learning.
Define the temporal-difference error.
Interpret the role of a fixed step-size.
Identify key advantages of TD methods over DP and MC methods.
Explain the TD(0) prediction algorithm.
Understand the benefits of learning online with TD compared to MC methods.

The learning outcomes relate to the overall learning goals number 3, 4, 6, 9, and 12 of the course.

10.2 Textbook readings

For this module, you will need to read Chapter 6-6.3 in Sutton and Barto (2018). Read it before continuing this module. A summary of the book notation can be seen here.

Slides for this module can be seen here. You do not have to look at them before the lecture!

10.3 What is TD learning?

Given a policy \(\pi\), we want to estimate the state-value function. Recall that the state value function is \[ v_\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]. \] where the return is \[ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} = R_{t+1} + \gamma G_{t+1} \]

Let \(V\) denote the state-value estimate. Under MC prediction we used an incremental update formula: \[ V(S_t) \leftarrow V(S_t) + \alpha_n\left[G_t - V(S_t)\right], \] where \(n\) denote the number of observations and \(\alpha_n\) the step-size. Different values of \(\alpha_n\) was discussed in Module 9. Here we assumed a stationary environment (state set, transition probabilities etc. is the same for each stage \(t\)) e.g. for the sample average \(\alpha_n = 1/n.\) If the environment is non-stationary (e.g. transition probabilities change over time) then a fixed step-size may be appropriate. Let us for the remaining of this module consider a non-stationary process with fixed step-size: \[ V(S_t) \leftarrow V(S_t) + \alpha\left[G_t - V(S_t)\right], \]

Note as pointed out in Module 5.5, a fixed step-size corresponds to a weighted average of the past observed returns and the initial estimate of \(S_t\): \[ \begin{align} V_{n+1} &= V_n +\alpha \left[G_n - V_n\right] \nonumber \\ &= \alpha G_n + (1 - \alpha)V_n \nonumber \\ &= \alpha G_n + (1 - \alpha)[\alpha G_{n-1} + (1 - \alpha)V_{n-1}] \nonumber \\ &= \alpha G_n + (1 - \alpha)\alpha G_{n-1} + (1 - \alpha)^2 V_{n-1} \nonumber \\ & \vdots \nonumber \\ &= (1-\alpha)^n V_1 + \sum_{i=1}^{n} \alpha (1 - \alpha)^{n-i} G_i \\ \end{align} \] That is, a larger weight is used for recent observations compared to old observations.

For MC prediction we needed the sample path to get the realized return \(G_t\). However, since \[ \begin{align} v_\pi(s) &= \mathbb{E}_\pi[G_t | S_t = s] \\ &= \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s] \\ &= \mathbb{E}_\pi[R_{t+1}| S_t = s] + \gamma \mathbb{E}_\pi[G_{t+1} | S_t = s] \\ &= \mathbb{E}_\pi[R_{t+1}| S_t = s] + \gamma v_\pi(S_{t+1}), \end{align} \] then, given a realized reward \(R_{t+1}\), an estimate for the return \(G_t\) is \(R_{t+1} + \gamma V(S_{t+1})\) and the incremental update becomes: \[ V(S_t) \leftarrow V(S_t) + \alpha\left[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\right]. \tag{10.1}\] As a result, we do not have to generate a whole sample-path (as for MC) for updating the state-value estimate of \(s = S_t\) to \(V(S_t)\). Instead we only have to wait until the next state is observed and update the estimate of \(S_t\) given the estimate of the next state \(S_{t+1}\). As the estimate of \(S_{t+1}\) improve the estimate of \(S_t\) also improve. The incremental update in Equation 10.1 is called TD(0) or one-step TD because it use a one-step lookahead to update the estimate. Note updating the estimates using TD resembles the way we did for DP: \[ V(s = S_t) \leftarrow \sum_{a \in \mathcal{A}}\pi(a | s)\left( r(s,a) + \gamma\sum_{s' \in \mathcal{S}} p(s' | s, a) V(s')\right) \] Here we updated the value by considering the expectation of all the next states. This was possible since we had a model. Now, by using TD, we do not need a model to estimate the state-value.

The term \[ \delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t), \] is denoted the temporal difference error (TD error) since it is the difference between the current estimate \(V(S_t)\) and the better estimate \(R_{t+1} + \gamma V(S_{t+1})\).

10.4 TD prediction

We can now formulate a TD(0) algorithm for predicting state-values of a policy (see Fig. 10.1). No stopping criterion is given but could stop when small differences in state-values are observed.

Figure 10.1: TD(0) policy prediction (Sutton and Barto 2018).

The algorithm is given for a process with episodes; however, also works for continuing processes. In this case the inner loop runs over an infinite number of time-steps.

10.4.1 TD prediction for action-values

Later we will use TD to for improving the policy (control). Since we do not have a model we need to estimate action-values instead and the optimal policy can be found using \(q_*\) (see Equation 7.3). To find \(q_*\), we first need to predict action-values for a policy \(\pi\) and the incremental update Equation 10.1 must be modified to use \(Q\) values: \[ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\left[R_{t+1} + \gamma Q(S_{t+1}, A_t) - Q(S_t, A_t)\right]. \]

Note given a policy \(\pi\) you need to know \(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}\) or short SARSA before you can make an update. This acronym is used to name the SARSA algorithm for control in Module 11. Note to ensure exploration of all action-values we need e.g. an \(\epsilon\)-soft behavioral policy.

10.5 Benefits of TD methods

Let us try to summarize the benefits of TD prediction

TD methods do not require a model of the environment (compared to DP).
TD methods can be implemented online, which can speed convergence (compared to MC methods which must wait until the end of the sample-path).
TD methods learn from all actions, whereas MC methods require the sample-path to have a tail equal to the target policy.
TD methods do converge on the value function with a sufficiently small step-size parameter, or with a decreasing step-size.
TD methods generally converge faster than MC methods, although this has not been formally proven.
TD methods are extremely useful for continuing tasks that cannot be broken down into episodes as required by MC methods.
TD can be seen as a method for prediction learning where you try to predict what happens next given you current action, get new information and make a new prediction. That is, you do not need a training set (as in supervised learning) instead the reward signal is observed as time goes by.
TD methods are good for sequential decision problems (multi-step prediction).
TD methods are scalable in the sense that computations do not grow exponentially with the problem size.

An example illustrating that TD methods converge faster than MC methods is given in Exercise 10.6.1.

10.6 Exercises

Below you will find a set of exercises. Always have a look at the exercises before you meet in your study group and try to solve them yourself. Are you stuck, see the help page. Sometimes hints and solutions can be revealed. Beware, you will not learn by giving up too early. Put some effort into finding a solution!

10.6.1 Exercise - A random walk

Consider a MDP with states 2-6 and two terminal states 1 and 7. Possible transitions are given in Fig. 10.2. All episodes start in the centre state, 4, then proceed either left or right by one state on each step. We assume the stochastic policy \(\pi\) is used where each direction has equal probability. Episodes terminate either on the left (1) or the right (7). When an episode terminates on the right, reward of 1 occurs; all other rewards are zero. If the discount factor equals 1, the state-value of each state is the probability of terminating on the right if starting from that state.

Figure 10.2: Possible transitions between states and rewards.

Consider this section in the Colab notebook to see the questions. Use your own copy if you already have one.