# Monte Carlo Methods in RL

In Reinforcement Learning, the Monte Carlo methods are a collection of methods for estimating the value functions and discovering optimal policies thru experience – sampling of states, actions, and rewards.

• It does not require prior knowledge of the environment’s dynamic (transition tables) and yet can still attain optimal behavior.
• It averages sample returns.
• It’s applicable for episodic tasks because we update the estimation at the end of each episode.
• It samples via a trajectory of state transitions as opposed of sweep thru all states in Dynamic Programming
• MC methods do not bootstrap like the DP methods so there is no iteratively solving
• The goal is to estimate the return of a state by averaging the return starting at that state
• It can handle non-stationarity with the idea of general policy iteration (GPI)

## Monte Carlo Prediction: learning the state-value function

To estimate the state-value function v() for state s under a policy pi, there are two methods:

• First-visit MC method:
• Given a trajectory which is sequence of S0, A0, R1, S1, A1, R2, …
• Take the average (over many trajectories) of the returns following the first visit to state s
• Every-visit MC method
• Given a trajectory which is sequence of S0, A0, R1, S1, A1, R2, …
• Take the average (over many trajectories) of the returns following the every visit to state s

### Example: a Blackjack game

• States
• Player’s hand total
• Dealer’s hand total
• Whether player has an Ace
• Policy: player should follow a policy (when to stick or hit)
• Definition: hit = draw another card; stick = stop drawing more cards
• For example let’s assume a simple policy by the user which is: hit if player’s total is under 20
• This is the policy that we evaluate the state-value function under. You can image a different policy would generate a different state-value function.
• Environment
• Dealer hit if dealer’s total if under 17
• This is needed to simulate the game play.
• Sampling some trajectories (generate some episodes)
• With the player policy and the environment (dealer’s strategy), we can simulate the game many times and obtain many trajectories of game play.
• For each trajectory, we can calculate the return following each state
• The exact mechanism is to actually first generate the trajectory and then walk backward to apply the gamma discount to get the return at the corresponding time step.
• Here the first-visit and every-visit MC method differ by which returns to use
• First-visit only uses the first visit of the state in this trajectory, so at most one state-value record for a given state s is obtain from one trajectory
• Every-visit can have multiple record for a given state; but in the blackjack game, since we keep drawing cards, it is not possible to return to an earlier state in one trajectory
• So it does not matter which method you use for the Blackjack game
• Note that the trajectory usually does not visit all states, so this is in contrast to dynamic programming where you take sweep over all possible state.

## Monte Carlo estimation of action-value function

If you have the environment dynamic (the transition graph with probabilities and rewards), then the action-value function q*() is easy to calculate from the state-value function. However, if the transition graph is unknown, it’s useful to estimate the action-value function, because it’s needed to find an optimal policy (need action-value to greedify).

The action-value function under a policy pi is the expected return of for every state s and action s and thereafter following policy pi. Again we have two methods (both converge to true expected value):

• First-visit MC method: average the returns following the first time in each episode that the state s was visited and the action a was selected
• Every-visit MC method: average the returns following every visit in state s and selecting action a

One problem is that many state-action pairs may be never visited by a policy, then there is no returns to average for that state-action pair. This is a problem that hinders learning the optimal policy if we cannot estimate all state-action values.

### Exploring start

The exploring start is a method to guarantees all state-action pairs are visited by starting a simulation at a random state and taking a random action from that state to ensure all state-action pairs have nonzero probability of being selected from the start. However, this is something difficult to achieve. Think about self driving car, how do you similar all possible starting state-action pairs? Another option is to select stochastic policies that have some nonzero probability of selection all actions in each state (see more in later section).

## Applying Generalized Policy Iteration to Monte Carlo Control

The GPI for Monte Carlo is the same as in DP. We alternate between evaluating (E) the state-action function and improving (I) the policy.

• The difference here is that Monte Carlo method needs to estimate the state-action function because it does not have the transition model to derive q() from v().
• This also makes Monte Carlo methods not bootstrap (vs DP). They do not update their value estimates based on other value estimates.
• The common part of GPI between DP and MC is to use value iteration instead of full evaluation “q under policy pi” each iteration. We can even take it to the extreme with in-place value iteration.

Combing with the exploring start earlier, we with Monte Carlo ES for estimate the optimal policy

The idea is to have ES guarantee that all state-action pairs are visited at the beginning of each episodes, irrespective of what policy was in force. This will force the policy to converge to an optimal policy since q() is full evaluated eventually and can converge to q*()

## Monte Carlo Control without Exploring Starts

It’s hard to guarantee that all state-actions are visited in ES, so there are two ways to deal with it

• On-policy methods
• Evaluate/improve the policy that is being using to make decision
• Epsilon-soft policies
• Soft refers to the fact that all probability are nonzero in the policy
• Epsilon refers to greediness
• All non-greedy actions are selected with minimal probability of epsilon/|A(s)|, where |A(s)| is the cardinality of the action space at state s.
• Select an action at random with probability epsilon
• Then the remaining bulk of the probability 1-epsilon+epsilon/|A(s)| is given to the greedy action
• The policy improvement theorem still apply so that the policy will converge to an optimal epsilon-soft policy
• The need for ES is eliminated by the “softness” of the policy
• This policy cannot be optimal because it still explores at the end of convergence
• Off-policy methods
• Evaluate/improve a different policy from the one used to generate the data (the state trajectories)
• Behavior policy – use for exploring and generating trajectories from the state-action space.
• Coverage assumption: the behavior policy must have nonzero probability for all the state-action that is nonzero in the optimal policy.
• Target policy – the policy that we try to achieve optimality, and it a deterministic polity that is epsilon-greedy
• Importance sampling – use the trajectory of the behavior policy to sample the return of the target policy
• The importance sampling ratio
• The numerator is the target policy trajectory probability
• The denominator is the behavior policy trajectory probability
• Although the p() term (transition probability) is unknown, but it’s cancelled out, so we still don’t need to know the dynamics of the MDP
• Finally to estimate the return of state s under the target policy is the same as estimating the return of state s under the behavior policy multiplied by the importance sampling ratio
• Two varieties:
• Ordinary importance sampling – simple average of weighted return, unbias but can have large variance
• Weighted importance sampling – weighted average of weighted return, finite variance, preferred in practice

Reference: most of the material of this post comes from the book Reinforcement Learning