# On-policy Control with Approximate Value Functions

This is a continuation from Approximate Function Methods in Reinforcement learning

## Episodic Sarsa with Function Approximation

Reminder of what Sarsa is

• State, Action, Reward, State, Action
• On-policy TD Control
• Estimate Q(s, a) by interacting with the environment and updating Q(s, a) each time

To apply it using function approximation  in discrete space (stack representation): set up features by crossing the states and actions combinations so that (# of features = # of state-action pairs). This way all combinations of state-action pairs are uniquely mapped, which works for small state-action space.

It’s very similar except the Q(s, a) is replaced by the approximate version q_hat(s, a, w)

## Exploration under functions approximation

Optimistic Initial Values Causes the agent to explore the action-space because it thinks it will have great return.

• Tabular
• It’s straight forward to implement in the tabular setting but setting an high value in the table for all state-action pairs.
• Function approximation
• Linear/binary – set the weights as high numbers
• Non-linear – unclear how to make the output optimistic. Even if so, it can be influenced by visit to other states that set the weight and affects an unvisited state.

Epsilon-greedy relies on randomness of action to explore the state-action space

• Tabular
• Random spread the probability over the probability epsilon
• Function approximation
• It can be applied but epsilon-greedy is not a direct exploration method.
• Not as systematic as optimistic initial value.
• Still an open area of research

## Average Reward

Average reward applies to

• continuing task where the episode never ends
• when there is no discounting (discount rate = 1)
• can be approximated by having large discount rate (0.999) but large sum might be difficult to learn using earlier methods
• reward will be infinite in an episode
• r(pi) = the average reward per time step under policy pi

• First expression is the average of time in an infinite horizon
• Second expression is written in terms of expectation
• Third expression sums the reward the probability of state distribution, action given state distribution by the policy, and expected reward over probability transition over state-action pair

### Differential Return

Motivating question: this average reward can be used to compare which policy is better. What about for comparing state-action value?

We can use differential return: how much more reward the agent will get from the current state and action for a number of time steps compared to the average reward by following the policy (Cesaro sum)

To make use of the differential sum, we can let the agent take a different state-action trajectory and then follow the given policy. That will yield some difference in the first few time steps. Showing whether it can be a better policy in terms of positive differential return.

### Bellman Equations for Differential Return

Notice the only difference is that r(pi) is used instead of q() and there is no discounting on the r(pi)

### Differential Sarsa

The main differences are

• Need to track an average estimate of the return, but note this is a lower variance update with fixed beta step size
• Differential version of the TD error delta, which is multiplied to the gradient to update the weights w

## Preferences-Parameters Confound

Where do rewards come from?

• The agent designer has an objective reward function that specifies preferences over agent behavior (too sparse and delayed)
• The single reward function confounds two roles
• Expresses agent designer’s preferences over behavior
• RL agent’s goal/purposes and becomes parameters of actual agent behavior