Temporal Difference Learning

Temporal Difference (TD) learning is the most novel and central idea of reinforcement learning. It combines the advantages from Dynamic Programming and Monte Carlo methods.

Temporal Difference (TD) Prediction

The TD method (in particular one-step TD(0)) updates the state-value function as following:

At each time step, the state-value of the previous state is updated using the value observed at the current time step.

As a reminder below are the MC update and DP update:

Batch update of TD:

The increment at every time step t when visiting a non-terminal state
The state-value function v is only updated once by the sum of all the increments
After each new episode, all episodes seen so far is treated as a batch (repeated presented to the algorithm)

Let’s say you want to predict the state-value function manually from the below 8 episode of observations (with State = A/B, followed by reward)

It’s easy to observe that V(B)=3/4 because there are 6 cases of return of 1 and 2 cases of return of 0
For A, TD and MC treat it differently
- MC
  - MC sees only one example with state A, which has 0 return, so it estimates V(A)=0
  - MC produces the minimal mean-squared error on the training set
- TD(0) batch
  - TD(0) batch uses the V(B)=3/4 to estimate V(A)=3/4
  - The answer given by TD will match our expectation from the transition diagram above.
  - TD produce the maximum-likelihood model of the Markov process. If we believe the process is Markov, TD is more suitable.
  - This is called “certainty-equivalence estimate” because it’s equivalent to assuming we know the estimate of the underlying process with certainty rather than approximated (I need to understand it better)
  - Nonbatch TD(0) can even move faster to convergence than MC because it still moving toward a better estimate than MC.

Reference: most of the material of this post comes from the book Reinforcement Learning