Temporal Difference (TD) learning is the most novel and central idea of reinforcement learning. It combines the advantages from Dynamic Programming and Monte Carlo methods.

## Temporal Difference (TD) Prediction

The TD method (in particular one-step TD(0)) updates the state-value function as following:

At each time step, the state-value of the previous state is updated using the value observed at the current time step.

As a reminder below are the MC update and DP update:

- MC method:
- Difference from the TD update:
- In MC, the state-value is updated by the return Gt which is only observed at the end of the episodes
- In TD, the state-value is updated immediate in the next time step (online). This result in much faster converge in TD estimation.

- Similarity to the TD update:
- Both methods does not require the model of the environment to update the state-value function

- Difference from the TD update:
- DP method:
- Difference from the TD update:
- In DP, the greedy approach requires a sweep of the action space
- In TD, only the visited action is used during a trajectory

- Similarity to the TD update:
- Both methods are bootstrap methods because they use the same state-value function to update the state-value function

- Difference from the TD update:

## Optimality of Temporal Difference (TD) learning

Batch update of TD:

- The increment at every time step t when visiting a non-terminal state
- The state-value function v is only updated once by the sum of all the increments
- After each new episode, all episodes seen so far is treated as a batch (repeated presented to the algorithm)

### Example

Let’s say you want to predict the state-value function manually from the below 8 episode of observations (with State = A/B, followed by reward)

- It’s easy to observe that V(B)=3/4 because there are 6 cases of return of 1 and 2 cases of return of 0
- For A, TD and MC treat it differently
- MC
- MC sees only one example with state A, which has 0 return, so it estimates V(A)=0
- MC produces the minimal mean-squared error on the training set

- TD(0) batch
- TD(0) batch uses the V(B)=3/4 to estimate V(A)=3/4
- The answer given by TD will match our expectation from the transition diagram above.
- TD produce the maximum-likelihood model of the Markov process. If we believe the process is Markov, TD is more suitable.
- This is called “certainty-equivalence estimate” because it’s equivalent to assuming we know the estimate of the underlying process with certainty rather than approximated (I need to understand it better)
- Nonbatch TD(0) can even move faster to convergence than MC because it still moving toward a better estimate than MC.

- MC

**Reference:** most of the material of this post comes from the book Reinforcement Learning

## One thought on “Temporal Difference Learning”