Reinforcement Learning Primer

Reinforcement learning is going to be “the next big thing” in machine learning after 2022, so let’s understand some basic on how it works.

RL pic

  • Agent: trying to make decision
  • Environment: everything outside the agent

Supervised, unsupervised, and reinforcement learning

People like to classify machine learning into these 3 categories

  • Supervised learning: the model can train with label of what is expected to predict the label from features
  • Unsupervised learning: the model has no label to train and is expected to learn some structure about the data via the features
  • Reinforcement learning: the model does not have direct label to train but can interact with the “environment” by taking actions and observations, and the model is expected to learn how to optimally behave in the environment to maximize some reward


  • A_t: action at time t
  • R_t: reward at time t

Reward function

The expected reward function at time t with action a is a function of the action at time t

RL q star

Note that this function is not given to the agent in the reinforcement learning framework. It’s learned about the model.

  • This is bad because it makes it hard for the agent to make decision on the action not knowing the expected reward function
  • This is good because it’s often hard to find this function anyway and we can let the agent learn this function from the data

The estimated reward function at time t is denoted by Q(a) a time t

RL big Q

This function is what the agent has to work with because it does not know q*. The goal is to have Q() be close to q*() as much as possible.

Exploration and Exploitation

  • Exploitation:
    • Taking the best action a that maximize the one-step reward using Q()
    • This is optimal move if there is only one move left to make (greedy)
  • Exploration:
    • Taking anything that is not the best action
    • Help to refine Q() so that the agent understands the true expected reward function better
    • Better Q() means high chance of making the best decision in the future
    • This is optimal when there are many moves left to make and the agent has high uncertainty on the expected reward (there might another action that could have higher reward)

Traditionally there are some way to solve for the “best” strategy to balance exploration and exploitation by making assumption about the prior knowledge and stationarity of the problem. But this is not practical in most situation, so RL is basically a way to more adaptively balance exploitation and exploration to achieve maximum total reward at the end.

Action-value Methods

Action-value Methods refer a collection of methods to (1) estimate the value of actions and (2) make decision based on the estimates.

RL Q estimate

The expression for Q above simply means just take the sample average of historical reward for each action a as an estimate of what the expected reward should be for taking action a. As t goes to infinity, we expect Q converges to q*.

For a given action a, which has been selected n times, we have the sample average reward as

RL sample average reward

To compute this efficiently at each time this action is selected, we can use an incremental update rule

RL incremental update sample average

This form is very commonly used where we call the 1/n a “step size” and make it a configurable parameter later.

RL action Qt

The greedy action is the optimal a that maximizes Q at a given t.

Of course, we cannot just behave greedily at the time. One simple strategy is to act greedily most of the time but explore other actions with small probability epsilon. In the long run, as we Q approximates q* very well, the agent will be behaving optimally 1-epsilon probability of the time.

Multi-Armed Bandit Problem

To get started with the simplest set up of an reinforcement learning problem, see Multi-Armed Bandit Problem

Non-stationary problem

When the mean of each bandit can change over time, the agent needs to be adapt with incremental update rule

RL incremental update rule

where alpha is between 0 and 1. It can also be rewritten as below

RL incremental update Q R

This shows that Qn depends on Q1 (aka Initial Action Value) and all to previous Rewards. These weights sum up to 1. So this is also called the exponential recency-weighted average.

Reference: most of the material of this post comes from the book Reinforcement Learning

To learn more about RL:

Related Posts

9 thoughts on “Reinforcement Learning Primer

Leave a Reply

Your email address will not be published. Required fields are marked *