About the course notes

This site contains course notes for the course “Reinforcement Learning for Business” held at Aarhus BSS. It consists of a set of learning modules. The course is an elective course mainly for the Operations and Supply Chain Analytics and Business Intelligence programme and intended to give you an introduction to Reinforcement Learning (RL). You can expect the site to be updated while the course runs. The date listed above is the last time the site was updated.

Learning outcomes

By the end of this module, you are expected to:

  • Understand the prerequisites and the goals for the course.
  • Have downloaded the textbook.
  • Know how the course is organized.
  • Installed R and RStudio.
  • Annotated the online notes.

The learning outcomes relate to the overall learning goals number 3, 5 and 6 of the course.

Purpose of the course

The purpose of this course is to give an introduction and knowledge about reinforcement learning (RL).

RL may be seen as

  • An approach of modelling sequential decision making problems.
  • An approach for learning good decision making under uncertainty from experience.
  • Mathematical models for learning-based decision making.
  • Trying to optimize decisions in a sequential decision model. That is, making a good sequence of decisions.
  • Estimating and finding near optimal decisions of a stochastic process with sequential decision making.
  • A model where given a state of a system, the agent wants to take actions to maximize future reward. Often the agent does not know the underlying setting and, thus, is bound to learn from experience.

RL can also be seen as a way of modelling intuition. As humans, we often learn by trial and error. For instance, when playing a game, our strategy is based on the game rules and what we have experienced works based on previous plays. In a RL setting, the system has specific states, actions and reward structure, that is, the rules of the game, and it is up to the agent how to solve the game, i.e. find actions that maximize the reward. Typically, the agent starts with totally random trials and finishes with sophisticated tactics and superhuman skills. By leveraging the power of search and many trials, RL is an effective way to find good actions.

A classic RL example is the bandit problem: You are in a casino and want to choose one of many slot machines (one-armed bandits) in each round. However, you do not know the distribution of the payouts of the machines. In the beginning, you will probably just try out machines (exploration) and then, after some learning, you will prefer the best ones (exploitation). Now the problem arises that if you use a slot machine frequently, you will not be able to gain information about the others and may not even find the best machine (exploration-exploitation dilemma). RL focuses on finding a balance between exploration of uncharted territory and exploitation of current knowledge.

The course starts by giving a general overview over RL and introducing bandit problems. Next, the mathematical framework of Markov decision processes (MDPs) is given as a classical formalization of sequential decision making. In this case, actions influence not just immediate rewards, but also subsequent situations, or states, and therefore also future rewards. An MDP assumes that the dynamics of the underlying process and the reward structure are known explicitly by the decision maker. In the last part of the course, we go beyond the case of decision making in known environments and study RL methods for stochastic control.

Learning goals of the course

After having participated in the course, you must, in addition to achieving general academic skills, demonstrate:

Knowledge of

  1. RL for Bandit problems
  2. Markov decision processes and ways to optimize them
  3. the exploration vs exploitation challenge in RL and approaches for addressing this challenge
  4. the role of policy evaluation with stochastic approximation in the context of RL

Skills to

  1. define the key features of RL that distinguishes it from other machine learning techniques
  2. discuss fundamental concepts in RL
  3. describe the mathematical framework of Markov decision processes
  4. formulate and solve Markov and semi-Markov decision processes for realistic problems with finite state space under different objectives
  5. apply fundamental techniques, results and concepts of RL on selected RL problems.
  6. given an application problem, decide if it should be formulated as a RL problem and define it formally (in terms of the state space, action space, dynamics and reward model)

Competences to

  1. identify areas where RL are valuable
  2. select and apply the appropriate RL model for a given business problem
  3. interpret and communicate the results from RL

Reinforcement learning textbook

The course uses the free textbook Reinforcement Learning: An Introduction by Sutton and Barto (2018). The book is essential reading for anyone wishing to learn the theory and practice of modern Reinforcement learning. Read the weekly readings before the lecture to understand the material better, and perform better in the course.

Sutton and Barto’s book is the most cited publication in RL research, and is responsible for making RL accessible to people around the world. The new edition, released in 2018, offers improved notation and explanations, additional algorithms, advanced topics, and many new examples; and it’s totally free. Just follow the citation link to download it.

Course organization

Each week considers a learning module. A learning module is related to a chapter in the textbook. The learning path in a typical week are

  • Before lectures: Read the chapter in the textbook and consider the extra module material.
  • Lectures (at campus).
  • After lectures: Module Exercises (in groups).

Lectures will not cover all the curriculum but focus on the main parts. In some weeks tutorials are given and we focus on a specific RL problem.

This module gives a short introduction to the course. Next, the site consists of different parts each containing teaching modules about specific topics:

  • Part I gives you a general introduction to RL and the bandit problem.

  • Part II consider RL sequential decision problems where the state and action spaces are small enough so values can be represented as arrays, or tables. We start by considering bandit problems (Module 2) a RL problem in which there is only a single state. Next, Markov decision processes (the full model known) are considered as a general modelling framework (Module 3) and the concept of policies and value functions are discussed (Module 4). Model-based algorithms for finding the optimal policy (dynamic programming) are given in Module 5. The next modules consider model-free methods for finding the optimal policy, i.e. methods that do not require full knowledge of the transition probabilities and rewards of the process. Monte Carlo sampling methods are presented in Module 6 and …

The appendix contains different modules that may be helpful for you including hints on how to work in groups, how to get help if you are stuck and how to annotate the course notes.

Programming software

We use R as programming software and it is assumed that you are familiar with using R. R is a programming language and free software environment. R can be run from a terminal but in general you use an IDE (integrated development environment) RStudio for running R and to saving your work. R and RStudio can either be run from your laptop or using RStudio Cloud which run R in the cloud using your browser.

It is assumed as a prerequisite that you know how to use R. If you need a brush-up on your R programming skills then have a look at Module A in the appendix.

Acknowledgements and license

Materials are taken from various places:

I would like to thank all for their inspiration.

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.0.

Exercises

Below you will find a set of exercises. Always have a look at the exercises before you meet in your study group and try to solve them yourself. Are you stuck, see the help page. Sometimes solutions can be seen by pressing the button besides a question. Beware, you will not learn by giving up too early. Put some effort into finding a solution!

Exercise - How to annotate

The online course notes can be annotated using hypothes.is. You can create both private and public annotations. Collaborative annotation helps people connect to each other and what they’re reading, even when they’re keeping their distance. You may also use public notes to help indicate spell errors, unclear content etc. in the notes.

  1. Sign-up at hypothes.is. If you are using Chrome you may also install the Chrome extension.
  2. Go back to this page and login in the upper right corner (there should be some icons e.g. <).
  3. Select some text and try to annotate it using both a private and public annotation (you may delete it again afterwards).
  4. Go to the slides for this module and try to annotate the page with a private comment.

Exercise - Templates

A template in RMarkdown of the course notes and exercises are available at GitHub. You can download the repository and keep your own notes during the course by having an R project with it.

  1. Open R studio and do: File > New Project … > Version Control > Git. Add https://github.com/bss-osca/rl-student as repository url and create the project.
  2. Run renv::restore() from the R command line to install needed packages (this may take some time). If you experience errors then try to install the packages one at a time using install.packages("pkg name").
  3. Open e.g. the file 01_rl-intro.Rmd and try to knit it using the Knit button in the upper left corner. A html file with the output will be made. You should be able to add your own notes and solve the exercises using the Rmd file for each module.

References

Sutton, R. S., and A. G. Barto. 2018. Reinforcement Learning: An Introduction. Second Edition. MIT Press. http://incompleteideas.net/book/the-book.html.