An introduction to Reinforcement Learning (RL)

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://bss-osca.github.io/rl/mod-rl-intro.html" target="_blank">Notes</a>
 | 
<a href="https://bss-osca.github.io/rl/slides/01_rl-intro-slides.html" target="_blank">Slides</a>
 | 
<a href="https://github.com/bss-osca/rl/blob/master/slides/01_rl-intro-slides.Rmd" target="_blank">Source</a>
</span>
</div>

---

## Learning outcomes

* Describe what RL is. 
* Be able to identify different sequential decision problems.
* Know what Business Analytics are and identify RL in that framework.
* Memorise different names for RL and how it fits in a Machine Learning framework.
* Formulate the blocks of a RL model (environment, agent, data, states, actions, rewards and policies).
* Run your first RL algorithm and evaluate on its solution.

---

## What is reinforcement learning

RL can be seen as

* An approach of modelling sequential decision making problems.
* An approach for learning good decision making under uncertainty from experience.
* Mathematical models for learning-based decision making.
* Trying to optimize decisions in a sequential decision model. That is, making a good sequence of decisions.
* Estimating and finding near optimal decisions of a stochastic process with sequential decision making. 
* A model where given a state of a system, the agent wants to take actions to maximize future reward. Often the agent does not know the underlying setting and, thus, is bound to learn from experience.

---

## Sequential decision problems

Examples (with possible actions) are:

* Playing backgammon (how to move the checkers).
* [Driving a car](https://arxiv.org/pdf/1807.00412.pdf) (left, right, forward, back, break, stop, ...).
* How to [invest/maintain a portfolio of stocks](https://medium.com/ibm-data-ai/reinforcement-learning-the-business-use-case-part-2-c175740999) (buy, sell, amount).  
* [Control an inventory](https://www.youtube.com/watch?v=pxWkg2N0l9c) (wait, buy, amount).
* Vehicle routing (routes).
* Maintain a spare-part (wait, maintain).
* [Robot operations](https://arxiv.org/pdf/2103.14295.pdf) (sort, move, ...)
* [Dairy cow treatment/replacement](http://dx.doi.org/10.1016/j.ejor.2019.01.050) (treat, replace, ...)
* Recommender systems e.g. [Netflix recommendations](https://scale.com/blog/Netflix-Recommendation-Personalization-TransformX-Scale-AI-Insights) (videos)

Note current decisions have an impact on the future.

---

## RL and intuition

RL can be seen as a way of modelling intuition. An RL model has specific states, actions and reward structure and our goal as an agent is to find good decisions/actions that maximize the total reward. The agent learn using, for instance:

* totally random trials (in the start),
* sophisticated tactics and superhuman skills (in the end).

That is, as the agent learn, the reward estimate of a given action becomes better.

As humans, we often learn by trial and error too:

* Learning to walk (by falling/pain).
* Learning to play (strategy is based on the game rules and what we have experienced works based on previous plays).

This can also be seen as learning the reward of our actions.

---

## RL in a Business Analytics framework

---

---

---

---

---

---

---

---

---

---

<div class="my-footer">
<span>
<a href="https://bss-osca.github.io/rl/mod-rl-intro.html" target="_blank">Notes</a>
 | 
<a href="https://bss-osca.github.io/rl/slides/01_rl-intro-slides.html" target="_blank">Slides</a>
 | 
<a href="https://github.com/bss-osca/rl/blob/master/slides/01_rl-intro-slides.Rmd" target="_blank">Source</a>
</span>
</div>

---

## RL in different research deciplines

.pull-left[
RL is used in many research fields using different names
- RL (most used) originated from computer science and AI.
- *Approximate dynamic programming (ADP)* is mostly used within operations research. 
- *Neuro-dynamic programming* (when states are represented using a neural network).
- RL is closely related to *Markov decision processes* (a mathematical model for a sequential decision problem).
]

<div class="figure" style="text-align: center">
<img src="img/rl-names.png" alt="Adopted from Silver (2015)." width="100%" />
<p class="caption">Adopted from Silver (2015).</p>
</div>
]

---

## RL in a Machine Learning framework

.pull-left[
* **Supervised learning:** Given data `$(x_i, y_i)$` learn to predict `$y$` from `$x$`, i.e. find `$y \approx f(x)$` (e.g. regression).
* **Unsupervised learning:** Given data `$(x_i)$` learn patterns using `$x$`, i.e. find `$f(x)$` (e.g. clustering).

* **RL:** Given state `$x$` you take an action and observe the reward `$r$` and the new state `$x'$`.
 - There is no supervisor `$y$`, only a reward signal `$r$`.
 - Your goal is to find a policy that optimize the total reward function.
]

<div class="figure" style="text-align: center">
<img src="img/rl-ml.png" alt="Adopted from Silver (2015)." width="100%" />
<p class="caption">Adopted from Silver (2015).</p>
</div>
]

---

## The RL data-stream

---

```
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
```

<div class="figure" style="text-align: center">
<img src="img/rl-intro-unnamed-chunk-11-1.png" alt="Agent-environment representation." width="100%" />
<p class="caption">Agent-environment representation.</p>
</div>
]

---

.right-column-small[
<div class="figure" style="text-align: center">
<img src="img/rl-intro-unnamed-chunk-12-1.png" alt="Agent-environment representation." width="100%" />
<p class="caption">Agent-environment representation.</p>
</div>
]

---

.right-column-small[
<div class="figure" style="text-align: center">
<img src="img/rl-intro-unnamed-chunk-13-1.png" alt="Agent-environment representation." width="100%" />
<p class="caption">Agent-environment representation.</p>
</div>
]

---

.right-column-small[
<div class="figure" style="text-align: center">
<img src="img/rl-intro-unnamed-chunk-14-1.png" alt="Agent-environment representation." width="100%" />
<p class="caption">Agent-environment representation.</p>
</div>
]

---

.right-column-small[
<div class="figure" style="text-align: center">
<img src="img/rl-intro-unnamed-chunk-15-1.png" alt="Agent-environment representation." width="100%" />
<p class="caption">Agent-environment representation.</p>
</div>
]

---

.left-column-wide[
- Agent: The one who takes the action (computer, robot, decision maker).
- Environment: The system/world where observations and rewards are found. 
- Data are revealed sequentially as you take actions:
  * `$(O_0, A_0, R_1, O_1, A_1, R_2, O_2, \ldots)$`
- History at time `$t$`: `$$H_t = (O_0, A_0, R_1, O_1, \ldots, A_{t-1}, R_t, O_t)$$`
- Your goal is to find a policy that maximize the total future reward.
]

.right-column-small[
<div class="figure" style="text-align: center">
<img src="img/rl-intro-unnamed-chunk-16-1.png" alt="Agent-environment representation." width="100%" />
<p class="caption">Agent-environment representation.</p>
</div>
]

---

layout:false

## Reward a closer look

- The reward `$R_t$` is a number representing the reward at time `$t$` (negative if a cost).
   * Playing backgammon (0 (when play), 1 (when win), -1 (when loose)).
   * How to invest/maintain a portfolio of stocks (the profit).  
   * Control an inventory (inventory cost, lost sales cost).
   * Vehicle routing (transportation cost).
--

- Reward may be delayed, not instantaneous (the consequences of you decision now is revealed later).
--

- RL assumption: all goals can be transformed into the maximisation of expected total future (cumulative) reward.

---

layout:true

## History vs state

---

.left-column-wide[
- The history is the sequence of observations, actions and rewards `$$H_t = (O_0, A_0, R_1, O_1, \ldots, A_{t-1}, R_t, O_t).$$`
]
.right-column-small[
<img src="img/rl-intro-unnamed-chunk-19-1.png" width="100%" style="display: block; margin: auto;" />
]

---

.left-column-wide[
- The history is the sequence of observations, actions and rewards `$$H_t = (O_0, A_0, R_1, O_1, \ldots, A_{t-1}, R_t, O_t).$$`
- The state `$S_t$` is the information used to take the next action.
]
.right-column-small[
<img src="img/rl-intro-unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" />
]

---

.left-column-wide[
- The history is the sequence of observations, actions and rewards `$$H_t = (O_0, A_0, R_1, O_1, \ldots, A_{t-1}, R_t, O_t).$$`
- The state `$S_t$` is the information used to take the next action.
- The next action `$A_t$` depends on the history, i.e. a state is a function of the history `$S_t = f(H_t)$`.
  * Choosing `$S_t = H_t$` is bad.
  * Instead just store the information needed for taking the next action. 
  * Markov state: given the present state the future is independent of the past.  
]
.right-column-small[
<img src="img/rl-intro-unnamed-chunk-21-1.png" width="100%" style="display: block; margin: auto;" />
]

---

layout:false

layout:true

---

## Policy

- A *policy* is the agent’s behaviour
- It is a map from state to action, i.e. a function 
  `$$a = \pi(s)$$` 
  saying that given the agent is in state `$s$` we choose action `$a$`.
- Given state `$S_t$` the goal is to find a policy that maximize the total future reward.

---

## Value of a state

- We use the *value function* to predict the future reward in state `$S$` e.g. expected discounted future reward:
`$$V_\pi(s) = \mathbb{E}_\pi(R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots | S = s).$$` 
- Discount factor `$\gamma=0$`: Only care about present reward.
- Discount factor `$\gamma=1$`: Future reward is as beneficial as immediate reward. Can be used if the time-horizon is finite.
- Discount factor `$\gamma<1$`: Rewards near to the present more beneficial. Note `$V(s)$` will converge to a number even if the time-horizon is infinite.
- Policy that maximize the total future reward given state `$s$`: `$$\pi^* = \arg\max_{\pi\in\Pi}(V_\pi(s)).$$`

---

## Exploitation vs Exploration

- Exploitation: Taking the action assumed to be optimal with respect to the data observed so far. 
  * Give better predictions of the value function (given the current policy). 
  * Prevents the agent from discovering potential better decisions (a better policy).
- Exploration: Not taking the action that seems to be optimal. 
  * The agent explore to find states we may not see and hence can update the value function for this state.  
- Examples
  * Movies recommendation: recommending the user’s best rated movie type (exploitation) or trying another movie type (exploration).
  * Oil drilling: drilling at the best known location (exploitation) or trying a new location (exploration).

---

## Let's play Tic-Tac-Toe

We start with an empty board and have at most 9 moves (a player may win before). The player who succeeds in placing three of their marks in a horizontal, vertical, or diagonal row wins the game. Reward for a player is 1 for 'win', 0.5 for 'draw', and 0 for 'loss'. These values can be seen as the probability of winning.

</td>
    <td><table border: solid black class="table table-bordered" style="width: auto !important; margin-left: auto; margin-right: auto;">
<tbody>
  <tr>
   <td style="text-align:center;"> X </td>
   <td style="text-align:center;"> . </td>
   <td style="text-align:center;"> X </td>
  </tr>
  <tr>
   <td style="text-align:center;"> . </td>
   <td style="text-align:center;"> X </td>
   <td style="text-align:center;"> . </td>
  </tr>
  <tr>
   <td style="text-align:center;"> O </td>
   <td style="text-align:center;"> O </td>
   <td style="text-align:center;"> O </td>
  </tr>
</tbody>
</table>

</td>
    <td><table border: solid black class="table table-bordered" style="width: auto !important; margin-left: auto; margin-right: auto;">
<tbody>
  <tr>
   <td style="text-align:center;"> X </td>
   <td style="text-align:center;"> X </td>
   <td style="text-align:center;"> O </td>
  </tr>
  <tr>
   <td style="text-align:center;"> O </td>
   <td style="text-align:center;"> O </td>
   <td style="text-align:center;"> X </td>
  </tr>
  <tr>
   <td style="text-align:center;"> X </td>
   <td style="text-align:center;"> X </td>
   <td style="text-align:center;"> O </td>
  </tr>
</tbody>
</table>

</td>
  </tr>
</table>

---

## Gameplay

Let `$S_t$` denote the board state before the opponent makes a move.

---

## Learning to play

- Define `$V(S)$` to be 1 if we win, 0 if we loose and 0.5 otherwise (reward/pr of winning).
- Most of the time we *exploit* our knowledge with `$pr = 1-\epsilon$`, i.e. choose the action which gives us the highest estimated reward and update the value of a state using `$$V(S_t) = V(S_t) + \alpha(V(S_{t+1})-V(S_t))$$` where `$\alpha$` is the *step-size* parameter. 
- Some times we *explore* with `$pr = \epsilon$` and choose another action/move than what seems optimal.

---

## Let us have a look at the code

- Open your `rl-student` R project in RStudio.
- Compile the `01_rl-intro.Rmd` file.
- Let us have a look at the code ...

---

layout:false

# References

Silver, D. (2015). _Lectures on Reinforcement Learning_. URL:
[https://www.davidsilver.uk/teaching/](https://www.davidsilver.uk/teaching/).

[BSS]: https://bss.au.dk/en/
[bi-programme]: https://kandidat.au.dk/en/businessintelligence/

[course-help]: https://github.com/bss-osca/rl/issues
[cran]: https://cloud.r-project.org
[cheatsheet-readr]: https://rawgit.com/rstudio/cheatsheets/master/data-import.pdf
[course-welcome-to-the-tidyverse]: https://github.com/rstudio-education/welcome-to-the-tidyverse

[DataCamp]: https://www.datacamp.com/
[datacamp-signup]: https://www.datacamp.com/groups/shared_links/cbaee6c73e7d78549a9e32a900793b2d5491ace1824efc1760a6729735948215
[datacamp-r-intro]: https://learn.datacamp.com/courses/free-introduction-to-r
[datacamp-r-rmarkdown]: https://campus.datacamp.com/courses/reporting-with-rmarkdown
[datacamp-r-communicating]: https://learn.datacamp.com/courses/communicating-with-data-in-the-tidyverse
[datacamp-r-communicating-chap3]: https://campus.datacamp.com/courses/communicating-with-data-in-the-tidyverse/introduction-to-rmarkdown
[datacamp-r-communicating-chap4]: https://campus.datacamp.com/courses/communicating-with-data-in-the-tidyverse/customizing-your-rmarkdown-report
[datacamp-r-intermediate]: https://learn.datacamp.com/courses/intermediate-r
[datacamp-r-intermediate-chap1]: https://campus.datacamp.com/courses/intermediate-r/chapter-1-conditionals-and-control-flow
[datacamp-r-intermediate-chap2]: https://campus.datacamp.com/courses/intermediate-r/chapter-2-loops
[datacamp-r-intermediate-chap3]: https://campus.datacamp.com/courses/intermediate-r/chapter-3-functions
[datacamp-r-intermediate-chap4]: https://campus.datacamp.com/courses/intermediate-r/chapter-4-the-apply-family
[datacamp-r-functions]: https://learn.datacamp.com/courses/introduction-to-writing-functions-in-r
[datacamp-r-tidyverse]: https://learn.datacamp.com/courses/introduction-to-the-tidyverse
[datacamp-r-strings]: https://learn.datacamp.com/courses/string-manipulation-with-stringr-in-r
[datacamp-r-dplyr]: https://learn.datacamp.com/courses/data-manipulation-with-dplyr
[datacamp-r-dplyr-bakeoff]: https://learn.datacamp.com/courses/working-with-data-in-the-tidyverse
[datacamp-r-ggplot2-intro]: https://learn.datacamp.com/courses/introduction-to-data-visualization-with-ggplot2
[datacamp-r-ggplot2-intermediate]: https://learn.datacamp.com/courses/intermediate-data-visualization-with-ggplot2
[dplyr-cran]: https://CRAN.R-project.org/package=dplyr
[debug-in-r]: https://rstats.wtf/debugging-r-code.html

[google-form]: https://forms.gle/s39GeDGV9AzAXUo18
[google-grupper]: https://docs.google.com/spreadsheets/d/1DHxthd5AQywAU4Crb3hM9rnog2GqGQYZ2o175SQgn_0/edit?usp=sharing
[GitHub]: https://github.com/
[git-install]: https://git-scm.com/downloads
[github-actions]: https://github.com/features/actions
[github-pages]: https://pages.github.com/
[gh-rl-student]: https://github.com/bss-osca/rl-student
[gh-rl]: https://github.com/bss-osca/rl

[happy-git]: https://happygitwithr.com
[hg-install-git]: https://happygitwithr.com/install-git.html
[hg-why]: https://happygitwithr.com/big-picture.html#big-picture
[hg-github-reg]: https://happygitwithr.com/github-acct.html#github-acct
[hg-git-install]: https://happygitwithr.com/install-git.html#install-git
[hg-exist-github-first]: https://happygitwithr.com/existing-github-first.html
[hg-exist-github-last]: https://happygitwithr.com/existing-github-last.html
[hg-credential-helper]: https://happygitwithr.com/credential-caching.html
[hypothes.is]: https://web.hypothes.is/

[osca-programme]: https://kandidat.au.dk/en/operationsandsupplychainanalytics/

[Peergrade]: https://peergrade.io
[peergrade-signup]: https://app.peergrade.io/join
[point-and-click]: https://en.wikipedia.org/wiki/Point_and_click
[pkg-bookdown]: https://bookdown.org/yihui/bookdown/
[pkg-openxlsx]: https://ycphs.github.io/openxlsx/index.html
[pkg-ropensci-writexl]: https://docs.ropensci.org/writexl/
[pkg-jsonlite]: https://cran.r-project.org/web/packages/jsonlite/index.html

[R]: https://www.r-project.org
[RStudio]: https://rstudio.com
[rstudio-cloud]: https://rstudio.cloud/spaces/176810/join?access_code=LSGnG2EXTuzSyeYaNXJE77vP33DZUoeMbC0xhfCz
[r-cloud-mod12]: https://rstudio.cloud/spaces/176810/project/2963819
[r-cloud-mod13]: https://rstudio.cloud/spaces/176810/project/3020139
[r-cloud-mod14]: https://rstudio.cloud/spaces/176810/project/3020322
[r-cloud-mod15]: https://rstudio.cloud/spaces/176810/project/3020509
[r-cloud-mod16]: https://rstudio.cloud/spaces/176810/project/3026754
[r-cloud-mod17]: https://rstudio.cloud/spaces/176810/project/3034015
[r-cloud-mod18]: https://rstudio.cloud/spaces/176810/project/3130795
[r-cloud-mod19]: https://rstudio.cloud/spaces/176810/project/3266132
[rstudio-download]: https://rstudio.com/products/rstudio/download/#download
[rstudio-customizing]: https://support.rstudio.com/hc/en-us/articles/200549016-Customizing-RStudio
[rstudio-key-shortcuts]: https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts
[rstudio-workbench]: https://www.rstudio.com/wp-content/uploads/2014/04/rstudio-workbench.png
[r-markdown]: https://rmarkdown.rstudio.com/
[ropensci-writexl]: https://docs.ropensci.org/writexl/
[r4ds-pipes]: https://r4ds.had.co.nz/pipes.html
[r4ds-factors]: https://r4ds.had.co.nz/factors.html
[r4ds-strings]: https://r4ds.had.co.nz/strings.html
[r4ds-iteration]: https://r4ds.had.co.nz/iteration.html

[stat-545]: https://stat545.com
[stat-545-functions-part1]: https://stat545.com/functions-part1.html
[stat-545-functions-part2]: https://stat545.com/functions-part2.html
[stat-545-functions-part3]: https://stat545.com/functions-part3.html
[slides-welcome]: https://bss-osca.github.io/rl/slides/00-rl_welcome.html
[slides-m1-3]: https://bss-osca.github.io/rl/slides/01-welcome_r_part.html
[slides-m4-5]: https://bss-osca.github.io/rl/slides/02-programming.html
[slides-m6-8]: https://bss-osca.github.io/rl/slides/03-transform.html
[slides-m9]: https://bss-osca.github.io/rl/slides/04-plot.html
[slides-m83]: https://bss-osca.github.io/rl/slides/05-joins.html
[sutton-notation]: https://bss-osca.github.io/rl/misc/sutton-notation.pdf

[tidyverse-main-page]: https://www.tidyverse.org
[tidyverse-packages]: https://www.tidyverse.org/packages/
[tidyverse-core]: https://www.tidyverse.org/packages/#core-tidyverse
[tidyverse-ggplot2]: https://ggplot2.tidyverse.org/
[tidyverse-dplyr]: https://dplyr.tidyverse.org/
[tidyverse-tidyr]: https://tidyr.tidyverse.org/
[tidyverse-readr]: https://readr.tidyverse.org/
[tidyverse-purrr]: https://purrr.tidyverse.org/
[tidyverse-tibble]: https://tibble.tidyverse.org/
[tidyverse-stringr]: https://stringr.tidyverse.org/
[tidyverse-forcats]: https://forcats.tidyverse.org/
[tidyverse-readxl]: https://readxl.tidyverse.org
[tidyverse-googlesheets4]: https://googlesheets4.tidyverse.org/index.html
[tutorial-markdown]: https://commonmark.org/help/tutorial/
[tfa-course]: https://bss-osca.github.io/tfa/

[Udemy]: https://www.udemy.com/

[vba-yt-course1]: https://www.youtube.com/playlist?list=PLpOAvcoMay5S_hb2D7iKznLqJ8QG_pde0
[vba-course1-hello]: https://youtu.be/f42OniDWaIo

[vba-yt-course2]: https://www.youtube.com/playlist?list=PL3A6U40JUYCi4njVx59-vaUxYkG0yRO4m
[vba-course2-devel-tab]: https://youtu.be/awEOUaw9q58
[vba-course2-devel-editor]: https://youtu.be/awEOUaw9q58
[vba-course2-devel-project]: https://youtu.be/fp6PTbU7bXo
[vba-course2-devel-properties]: https://youtu.be/ks2QYKAd9Xw
[vba-course2-devel-hello]: https://youtu.be/EQ6tDWBc8G4

[video-install]: https://vimeo.com/415501284
[video-rstudio-intro]: https://vimeo.com/416391353
[video-packages]: https://vimeo.com/416743698
[video-projects]: https://vimeo.com/319318233
[video-r-intro-p1]: https://www.youtube.com/watch?v=vGY5i_J2c-c
[video-r-intro-p2]: https://www.youtube.com/watch?v=w8_XdYI3reU
[video-r-intro-p3]: https://www.youtube.com/watch?v=NuY6jY4qE7I
[video-subsetting]: https://www.youtube.com/watch?v=hWbgqzsQJF0&list=PLjTlxb-wKvXPqyY3FZDO8GqIaWuEDy-Od&index=10&t=0s
[video-datatypes]: https://www.youtube.com/watch?v=5AQM-yUX9zg&list=PLjTlxb-wKvXPqyY3FZDO8GqIaWuEDy-Od&index=10
[video-control-structures]: https://www.youtube.com/watch?v=s_h9ruNwI_0
[video-conditional-loops]: https://www.youtube.com/watch?v=2evtsnPaoDg
[video-functions]: https://www.youtube.com/watch?v=ffPeac3BigM
[video-tibble-vs-df]: https://www.youtube.com/watch?v=EBk6PnvE1R4
[video-dplyr]: https://www.youtube.com/watch?v=aywFompr1F4

[wiki-snake-case]: https://en.wikipedia.org/wiki/Snake_case
[wiki-camel-case]: https://en.wikipedia.org/wiki/Camel_case
[wiki-interpreted]: https://en.wikipedia.org/wiki/Interpreted_language
[wiki-literate-programming]: https://en.wikipedia.org/wiki/Literate_programming
[wiki-csv]: https://en.wikipedia.org/wiki/Comma-separated_values
[wiki-json]: https://en.wikipedia.org/wiki/JSON