Tabular vs Function Methods
In reinforcement learning, there are a few methods that are called tabular methods because they track a table of the (input, output) of the value functions
 Dynamic Programming
 Policy Iteration
 Value iteration
 Generalized policy iteration (GPI)
 Monte Carlo Method
 Temporal Difference
 Qlearning (Offpolicy TD)
 Sarsa (Onpolicy TD)
However when the state space is enormous that it cannot be tracked in a table. We can use parameterized functions to approximate these value functions.
 Weight representation is more compact than enumerating all values
 Note that when a weight changes, it can affect multiple outputs (this is similar to supervisedlearning)
Linear value function approximation
 The value is approximated using a linear combination of the features
 There are some limitation of expressiveness of linear functions but we can be more creative in making new features
 Tabular value functions are a special case of linear functions
 Use one weight for each state
Nonlinear value function approximation
 Neural network is an example
Generalization and Discrimination
 Generalization: update to value of one state affect value of other states
 Able to recognize some states are similar to treat them similarly (update both)
 Extreme generalization: aggregate all state and predict the average
 Discrimination: ability to make value of two states different
 Tabular methods have high discrimination since it treat each state independently
 Need to find a way to balance the two and make tradeoff
Estimate Value Functions as Supervised Learning
 Can we estimate the value functions as a supervised learning problem?
 Supervisedlearning
 Target is fixed. RL has nonstationary target:
 The environment/policy is not stationary
 In GPI, we lean the policy and valuefunction. The target values are nonstationary because of bootstrap.
 Training examples are batched (not online, temporally correlated)
 Target is fixed. RL has nonstationary target:
 MonteCarlo method can be framed like supervised learning
 One (state S, return G) mapping for each episode
 Learn a value function of a given policy
 Why MC is not supervised learning?
 Because MC is temporally correlated
 TD can be framed like supervised learning but additional completely than MC
 Also have (state S, return G) mapping
 But return G depends on w (target is not fixed)
Value Error Objective
 We want to measure the error between value function approximation and the true value function
 It needs to be weighted different across the statespace using mu(s)
 mu(s) is often the time spent in state s, known as onpolicy distribution
 stationary in continuing tasks
 in episodic tasks, mu(s) can depend on the initial state
 mu(s) can be defined as the time spent in each state following the policy pi
 We would like to minimize this value error function by adjusting the weights w
 Can use gradient descent
 The gradient descent step is followed with Ut replacing v_pi(s) because we are using Ut as an estimate of v_pi because we don’t have true v_pi

 If Ut is unbiased, then expected of Ut is v_pi, then we are guaranteed to converge.
 For linear value function approximation, the gradient is just the feature (each weight corresponds to a feature)
 Even the function that minimizes the error does not mean it’s the true value function
 Reason is that the function might not correspond to the true value function (limited by the choice of function parameterization and choice of objective/features/functional form)
Gradient Monte for Policy Evaluation
 Gradient is quite easy to calculate when the approximation function is linear
 Summing over all states is usually not possible, also hard to know the distribution mu(s)
 We can make small update each time over the sequence of (state, value) following a policy pi
 Each step is an update (a form of stochastic gradient descent)
 Since we don’t have v_pi(s), we can use the sampled return G given state s (G is unbiased estimate of v_pi(s))
 State Aggregation
 when we update the statevalue for one state, the statevalue for all the states in the same group is updated
 it’s a form of linear approximation (the group onehot variable is the feature)
 Example of a random that takes 100 steps to the left or right of [0, 1000] range. The states are aggregated into 10 states from 1000 states.
Semigradient TD(0) – biased target
 Replace the return estimate with a biased estimate that depends on w (bootstrapping estimate of vi_pi(s))
 It’s not guaranteed to converge because the bootstrapping estimate produces part of a gradient (but something extra depending on w). It’s called semigradient methods.
 However,
 it usually converge quicker because variance is lower
 it also enables continual and online learning
In the same 1000state random walk example,
 TD gradient update is biased
 TD runs faster than MC
 Earlier learning is usually more important in practice than long term performance
Linear methods approximation (linear TD)
 Linear in the weights
 x(s) is called a feature vector for state s
 For linear methods, features are basis functions
Update at each time step is
where x_t is just shorthand for x(S_t)
At steady state,
where
The weights must converge to satisfy where w_TD is called TD fixed point.
One thought on “Approximate Function Methods in Reinforcement learning”