Reinforcement Learning for Business (RL)

Course notes

Author

Lars Relund Nielsen

Published

May 20, 2025

1 About the course notes

This site contains course notes for the course “Reinforcement Learning for Business” held at Aarhus BSS. It consists of a set of learning modules. The course is an elective course mainly for the Operations and Supply Chain Analytics and Business Intelligence programme and intended to give you an introduction to Reinforcement Learning (RL). You can expect the site to be updated while the course runs. The date listed above is the last time the site was updated.

Slides for this module can be seen here. You do not have to look at them before the lecture!

1.1 Learning outcomes

By the end of this module, you are expected to:

Understand the prerequisites and the goals for the course.
Have downloaded the textbook.
Know how the course is organized.
Have annotated the online notes.

The learning outcomes relate to the overall learning goals number 3, 5 and 6 of the course.

1.2 Purpose of the course

The purpose of this course is to give an introduction and knowledge about reinforcement learning (RL).

RL may be seen as

An approach of modelling sequential decision making problems.
An approach for learning good decision making under uncertainty from experience.
Mathematical models for learning-based decision making.
Trying to optimize decisions in a sequential decision model. That is, making a good sequence of decisions.
Estimating and finding near optimal decisions of a stochastic process with sequential decision making.
A model where, given a state of a system, the agent wants to take actions to maximize future reward. Often the agent does not know the underlying setting and, thus, is bound to learn from experience.

RL can also be seen as a way of modelling intuition. As humans, we often learn by trial and error. For instance, when playing a game, our strategy is based on the game rules and what we have experienced works based on previous plays. In a RL setting, the system has specific states, actions and reward structure, that is, the rules of the game, and it is up to the agent how to solve the game, i.e. find actions that maximize the reward. Typically, the agent starts with totally random trials and finishes with sophisticated tactics and superhuman skills. By leveraging the power of search and many trials, RL is an effective way to find good actions.

A classic RL example is the bandit problem: You are in a casino and want to choose one of many slot machines (one-armed bandits) in each round. However, you do not know the distribution of the payouts of the machines. In the beginning, you will probably just try out machines (exploration) and then, after some learning, you will prefer the best ones (exploitation). Now the problem arises that if you use a slot machine frequently, you will not be able to gain information about the others and may not even find the best machine (exploration-exploitation dilemma). RL focuses on finding a balance between exploration of uncharted territory and exploitation of current knowledge.

The course starts by giving a general overview over RL and introducing bandit problems. Next, the mathematical framework of Markov decision processes (MDPs) is given as a classical formalization of sequential decision making. In this case, actions influence not just immediate rewards, but also subsequent situations, or states, and therefore also future rewards. An MDP assumes that the dynamics of the underlying process and the reward structure are known explicitly by the decision maker. In the last part of the course, we go beyond the case of decision making in known environments and study RL methods for stochastic control.

1.3 Learning goals of the course

After having participated in the course, you must, in addition to achieving general academic skills, demonstrate:

Knowledge of

RL for Bandit problems
Markov decision processes and ways to optimize them
the exploration vs exploitation challenge in RL and approaches for addressing this challenge
the role of policy evaluation with stochastic approximation in the context of RL

Skills to

define the key features of RL that distinguishes it from other machine learning techniques
discuss fundamental concepts in RL
describe the mathematical framework of Markov decision processes
formulate and solve Markov and semi-Markov decision processes for realistic problems with finite state space under different objectives
apply fundamental techniques, results and concepts of RL on selected RL problems.
given an application problem, decide if it should be formulated as a RL problem and define it formally (in terms of the state space, action space, dynamics and reward model)

Competences to

identify areas where RL are valuable
select and apply the appropriate RL model for a given business problem
interpret and communicate the results from RL

1.4 Reinforcement learning textbook

The course uses the free textbook Reinforcement Learning: An Introduction by Sutton and Barto (2018). The book is essential reading for anyone wishing to learn the theory and practice of modern Reinforcement learning. Read the weekly readings before the lecture to understand the material better, and perform better in the course.

Sutton and Barto’s book is the most cited publication in RL research, and is responsible for making RL accessible to people around the world. The new edition, released in 2018, offers improved notation and explanations, additional algorithms, advanced topics, and many new examples; and it’s totally free. Just follow the citation link to download it.

1.5 Course organization

Each week considers a learning module. A learning module is related to either a chapter in the textbook, tutorials, cases or exercises. The learning path would typical be

Before lectures: Read the chapter in the textbook or tutorial.
Lectures (at campus).
After lectures: Module Exercises (in groups).

Lectures will not cover all the curriculum but focus on the main parts. In some weeks tutorials or cases are given and we focus on a specific RL problem.

Lecture notes also contains an appendix with different modules that may be helpful for you.

1.6 Programming software

We use Python as programming software and it is assumed that you are familiar with using Python. Python is a programming language and free software environment.

During this course we are going to use Google Colab which is a hosted Jupyter notebook service that requires no setup to use and provides free access to computing resources, including GPUs and TPUs. That is, Colab runs in your browser and you do not have to install anything on your computer. With a Jupyter notebook you can weave you code together with text.

That said, you may need to run Python on your laptop for larger tasks. You can run Python from a terminal but in general you use an IDE (integrated development environment) such as Positron, PyCharm or VSCode for running Python and to saving your work.

It is assumed as a prerequisite that you know how to use Python at a basic level. If you need a brush-up on your Python programming skills then have a look at Module 4 and Appendix A.

Even though Python will be used in the course, algorithms can also be coded in R if you prefer to it.

Acknowledgements and license

Materials are taken from various places:

The notes are based on Sutton and Barto (2018).
Some notes are adopted from Scott Jeen, Bryn Elesedy and Peter Goldsborough.
Some slides are inspired by the RL specialization at Coursera.
Some exercises are taken from Sutton and Barto (2018) and modified slightly.
Some code are inspired by Marcin Bogdanski.

I would like to thank all for their inspiration.

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.0.

1.7 Exercises (do before class)

Below you will find a set of exercises. Always have a look at the exercises before you meet in your study group and try to solve them yourself. Are you stuck, see the help page. Sometimes hints and solutions can be revealed. Beware, you will not learn by giving up too early. Put some effort into finding a solution!

1.7.1 Exercise - How to annotate

The online course notes can be annotated using hypothes.is. You can create both private and public annotations. Collaborative annotation helps people connect to each other and what they’re reading, even when they’re keeping their distance. You may also use public notes to help indicate spell errors, unclear content etc. in the notes.

Sign-up at hypothes.is. If you are using Chrome you may also install the Chrome extension.
Go back to this page and login in the upper right corner (there should be some icons e.g. <).
Select some text and try to annotate it using both a private and public annotation (you may delete it again afterwards).
Go to the slides for this module and try to annotate the page with a private comment.

1.7.2 Exercise - Colab

During this course we are going to use Google Colab which is a hosted Jupyter Notebook service that requires no setup to use and provides free access to computing resources, including GPUs and TPUs. That is, Colab runs in your browser and you do not have to install anything on your computer.

To be familiar with Colab do the following:

If you do not have a Google account create one. Note if you have a gMail then you already have a Google account.
Open and do this tutorial.