class: center, middle, inverse, title-slide .title[ # Monte Carlo methods for prediction and control ] .author[ ### Lars Relund Nielsen ] --- layout: true <!-- Templates --> <!-- .pull-left[] .pull-right[] --> <!-- knitr::include_graphics("img/bandit.png") --> <!-- .left-column-wide[] .right-column-small[] --> --- ## Learning outcomes * Identify the difference between model-based and model-free RL. * Describe how MC methods can be used to estimate value functions from sample data. * Do MC prediction to estimate the value function for a given policy. * Explain why it is important to maintain exploration in MC algorithms. * Do policy improvement (control) using MC in a GPI algorithm. * Compare different ways of exploring the state-action space. * Argue why off-policy learning can help deal with the exploration problem. * Use importance sampling to estimate the expected value of a target distribution using samples from a different distribution. * Use importance sampling in off-policy learning to predict the value-function. * Explain how to modify the MC GPI algorithm for off-policy learning. --- ## Monte Carlo methods for RL * Monte Carlo (MC) is an estimation method which involves a random component. * Use MC methods to learn state and action values by sampling and averaging returns. * MC do not use dynamics (current state-value estimated using next state-value). * Estimate values by considering different sample-paths (state, action and reward realizations). * MC methods are model-free since they not require full knowledge of the transition probabilities and rewards (a model of the environment). * MC methods learn the value function directly from experience. * The sample-path can be generated using simulation (some environment knowledge). * Consider MC methods for processes with episodes, i.e. where there is a terminal state. * Example: Blackjack (calculating transition probabilities may be tedious and error-prone). Instead we can simulate a game (a sample-path). * Use generalised policy iteration, but learn the value function from experience. --- ## MC prediction (evaluation) Given policy `\(\pi\)`, we want to estimate the state-value function: `$$v_\pi(s) = \mathbb{E}_\pi[G_t | S_t = s].$$` where the return is `$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} = R_{t+1} + \gamma G_{t+1}$$` Procedure: 1. Generate sample-paths `\(S_0, A_0, R_1, S_1, A_1, \ldots, S_{T-1}, A_{T-l}, R_T\)`. 2. Calculate `\(G_t\)` for each state in the sample-path. 3. Use the average of the realized returns for each state as an estimate. With enough observations, the average converges to the true state-value under the policy `\(\pi\)`. --- ## First and every visit MC Given a policy `\(\pi\)` and a set of sample-paths, there are two ways to estimate the state values `\(v_\pi(s)\)`: * First visit MC: average returns from first visit to state `\(s\)`. - Generates iid (independent and identically distributed) estimates of `\(v_\pi(s)\)` with finite variance. - Converges to the expected value by the law of large numbers. * Every visit MC: average returns following every visit to state `\(s\)`. - Does not generate independent estimates, but still converges. - Easier to use since we don't have to check if visited already. Mostly focus on first visit in the book but the R code use every visit. --- ## MC prediction algorithm <img src="img/mc-prediction.png" width="100%" style="display: block; margin: auto;" /> --- ## MC prediction of action-values * With a model of the environment we only need to estimate the state-value function (use Bellman optimality equations). * Without a model it is useful to estimate action-values since the optimal policy can be found using: `$$\pi_*(s) = \arg\max_{a \in \mathcal{A}} q_*(s, a).$$` * To find `\(q_*\)`, we first need to predict action-values for a policy `\(\pi\)`. * Problem: If `\(\pi\)` is deterministic, we only estimate one action-value. * Some exploration are needed: - Make `\(\pi\)` stochastic with non-zero probability of each state-action pair (e.g. `\(\epsilon\)`-soft). - Exploring starts: Use a non-zero probability for each state-action pair of being selected as the starting state of an sample-path. --- ## Generalized policy iteration .left-column-wide[.midi[ Generalised Policy Iteration (GPI) consider different policy evaluation and improvement strategies. The idea is to generate a sequence of policies and action-value functions `$$\pi_0 \xrightarrow[]{E} q_{\pi_0} \xrightarrow[]{I} \pi_1 \xrightarrow[]{E} q_{\pi_1} \xrightarrow[]{I} \pi_2 \xrightarrow[]{E} \ldots \xrightarrow[]{I} \pi_* \xrightarrow[]{E} q_{*}.$$` The algorithm repeatedly considers steps: * Evaluation (E): Here we use MC evaluation. Note we don't have to evaluate the action-values precisely. * Improvement (I): Select the next policy greedy `$$\pi(s) = \arg\max_a q(s, a).$$` By using greedy action selection, the policy improvement theorem applies, i.e. `\(\pi_{k+1}\)` is not worse than `\(\pi_{k}\)`. ]] .right-column-small[ <img src="img/policy-ite-general.png" width="100%" style="display: block; margin: auto;" /> ] --- ## GPI convergence * Model-based GPI (transition probability matrix and reward distribution are known): GPI converge if all states are visited during the algorithm. * Model-free GPI: We cannot simply use a 100% greedy strategy all the time, since all our action-values are estimates. An element of exploration must be used to estimate the action-values. For convergence to the optimal policy a model-free GPI algorithm must satisfy: 1. Infinite exploration: all state-action `\((s,a)\)` pairs should be explored infinitely many times as the number of iterations go to infinity, i.e. `\(\lim_{k\rightarrow\infty} n_k(s, a) = \infty\)`. 2. Greedy in the limit: while we maintain infinite exploration, we do eventually need to converge to the optimal policy: `$$\lim_{k\rightarrow\infty} \pi_k(a|s) = 1 \text{ for } a = \arg\max_a q(s, a).$$` --- ## GPI with exploring starts <img src="img/mc-gpi-es.png" width="90%" style="display: block; margin: auto;" /> --- ## GPI using `\(\epsilon\)`-soft policies .left-column-wide[ * Use an `\(\epsilon\)`-soft policy to ensure infinite exploration. * An `\(\epsilon\)`-soft policy satisfy: `$$\pi(a|s) \geq \epsilon/|\mathcal{A}(s)|$$` * Put probability `\(1 - \epsilon + \frac{\epsilon}{|\mathcal{A}(s)|}\)` on the maximal action and `\(\frac{\epsilon}{|\mathcal{A}(s)|}\)` on each of the others. * Note `\(\epsilon\)`-soft policies are always stochastic and hence ensure infinite exploration of the `\((s,a)\)` pairs. * The best `\(\epsilon\)`-soft policy is found. * To ensure the 'greedy in the limit' convergence (find the optimal policy) one may decrease `\(\epsilon\)` as the number of iterations increase (e.g. `\(\epsilon = 1/k\)`). ] .right-column-small[ <img src="img/mc-unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## An `\(\epsilon\)`-soft GPI algorithm <img src="img/mc-gpi-on-policy.png" width="80%" style="display: block; margin: auto;" /> --- ## On-policy vs off-policy * Until now *on-policy* algorithms: both evaluate or improve the policy that is used to make decisions. * To ensure infinite exploration used exploring starts and `\(\epsilon\)`-soft policies. * Another approach *Off-policy* algorithms that consider two policies: - A policy `\(b\)` used to generate the sample-path (behaviour policy) - A policy `\(\pi\)` that is learned for control (target policy). * We update the target policy using the sample-paths from the behaviour policy. * To ensure infinite exploration the behaviour policy can be e.g. `\(\epsilon\)`-soft. * The target policy `\(\pi\)` may be deterministic by using greedy selection with respect to action-value estimates (greedy in the limit satisfied). * Off-policy learning methods are powerful and can be used to learn from data generated by a conventional non-learning controller or from a human expert. --- ## Importance sampling How do we estimate the expected return using the target policy when we only have sample-paths from the behaviour policy? Importance sampling: estimate expected values under one distribution given samples from another. .pull-left[ Two distributions `\(a\)` and `\(b\)` with samples from `\(b\)`: $$ `\begin{align} \mathbb{E}_{a}[X] &= \sum_{x\in X} a(x)x \\ &= \sum_{x\in X} b(x)\frac{a(x)}{b(x)}x \\ &= \sum_{x\in X} b(x)\rho(x)x \\ &= \mathbb{E}_{b}\left[\rho(X)X\right] \end{align}` $$ ] .pull-right[ * `\(\rho(x) = a(x)/b(x)\)` denote the *importance sampling ratio*. * Now given samples `\((x_1,\ldots,x_n)\)` from `\(b\)` we then can calculate the sample average using $$ `\begin{align} \mathbb{E}_{a}[X] &= \mathbb{E}_{b}\left[\rho(X)X\right] \\ &\approx \frac{1}{n}\sum_{i = 1}^n \rho(x_i)x_i \\ \end{align}` $$ ] --- ## Off-policy importance sampling * Given a target policy `\(\pi\)` and behaviour policy `\(b\)` want to estimate action-values: `$$q_\pi(s,a) = \mathbb{E}_{\pi}[G_t|S_t = s, A_t = a] = \mathbb{E}_{b}[\rho(G_t)G_t|S_t = s, A_t = a] \approx \frac{1}{n} \sum_{i = 1}^n \rho_iG_i$$` where given the sample-paths, have `\(n\)` observations of the return `\((G_1, \ldots, G_n)\)` in `\((s,a)\)`. * Need to find `\(\rho_i\)` for a given sample-path `\(S_t, A_t, R_{t+1}, \ldots, R_T, S_T\)` with return `\(G_t\)`: $$ `\begin{align} \rho(G_t) &= \frac{\Pr{}(S_t, A_t, \dots S_T| S_t = s, A_t = a, \pi)}{\Pr{}(S_t, A_t, \dots, S_T| S_t = s, A_t = a, b)} \\ &= \frac{\prod_{k=t}^{T-1}\pi(A_k|S_k)\Pr{}(S_{k+1}|S_k, A_k)}{\prod_{k=t}^{T-1}b(A_k|S_k)\Pr{}(S_{k+1}|S_k, A_k)}\\ &=\prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}. \end{align}` $$ <!-- * Note the transition probabilities cancel out (ratio independent of MDP dynamics). --> --- ## MC prediction algorithm We can now modify the MC prediction algorithm to use off-policy learning: <img src="img/mc-prediction.png" width="100%" style="display: block; margin: auto;" /> --- ## Off-policy MC prediction (changes) * Generate an sample-path using policy `\(b\)` instead of `\(\pi\)`. * Add a variable W representing the importance sampling ratio which must be set to 1 on line containing `\(G \leftarrow 0\)`. * Modify line `\(G \leftarrow \gamma G + R_{t+1}\)` to `\(G \leftarrow \gamma WG + R_{t+1}\)` since we now need to multiply with the importance sampling ratio. * Add a line after the last with `\(W \leftarrow W \pi(A_t|S_t)/b(A_t|S_t)\)`, i.e. we update the importance sampling ratio. * If `\(\pi(A_t|S_t) = 0\)` then we may stop the inner loop earlier ( `\(W=0\)` for the remaining `\(t\)`). * An incremental update of `\(V\)` can be done (only if don't modify `\(G\)`) $$ V(s) \leftarrow V(s) + \frac{1}{n} \left[WG - V(s)\right]. $$ where `\(n\)` denote the number of realized returns and `\(G\)` the current realized return. --- ## Weighted importance sampling Different importance sampling methods: .frame[ .pull-left[ Ordinary importance sampling: `$$q_\pi(s,a) \approx \frac{1}{n} \sum_{i = 1}^n \rho_iG_i$$` Higher variance. ] .pull-right[ Weighted importance sampling: `$$q_\pi(s,a) \approx \frac{1}{\sum_{i = 1}^n \rho_i} \sum_{i = 1}^n \rho_iG_i.$$` Smaller variance (faster convergence). ]] Incremental update for weighted importance sampling: $$ `\begin{align} q_\pi(s,a) &\approx V_{n+1} = \frac{1}{\sum_{i = 1}^n \rho_i} \sum_{i = 1}^n \rho_iG_i \\ &= \frac{1}{C_n} \sum_{i = 1}^n W_iG_i = V_n + \frac{W_n}{C_n} (G_n - V_n), \end{align}` $$ --- ## Off-policy prediction algorithm <img src="img/mc-off-policy-prediction.png" width="90%" style="display: block; margin: auto;" /> --- ## Off-policy GPI <img src="img/mc-off-policy-gpi.png" width="85%" style="display: block; margin: auto;" /> <!-- # References --> <!-- ```{r, results='asis', echo=FALSE} --> <!-- PrintBibliography(bib) --> <!-- ``` --> [BSS]: https://bss.au.dk/en/ [bi-programme]: https://kandidat.au.dk/en/businessintelligence/ [course-help]: https://github.com/bss-osca/rl/issues [cran]: https://cloud.r-project.org [cheatsheet-readr]: https://rawgit.com/rstudio/cheatsheets/master/data-import.pdf [course-welcome-to-the-tidyverse]: https://github.com/rstudio-education/welcome-to-the-tidyverse [DataCamp]: https://www.datacamp.com/ [datacamp-signup]: https://www.datacamp.com/groups/shared_links/cbaee6c73e7d78549a9e32a900793b2d5491ace1824efc1760a6729735948215 [datacamp-r-intro]: https://learn.datacamp.com/courses/free-introduction-to-r [datacamp-r-rmarkdown]: https://campus.datacamp.com/courses/reporting-with-rmarkdown [datacamp-r-communicating]: https://learn.datacamp.com/courses/communicating-with-data-in-the-tidyverse [datacamp-r-communicating-chap3]: https://campus.datacamp.com/courses/communicating-with-data-in-the-tidyverse/introduction-to-rmarkdown [datacamp-r-communicating-chap4]: https://campus.datacamp.com/courses/communicating-with-data-in-the-tidyverse/customizing-your-rmarkdown-report [datacamp-r-intermediate]: https://learn.datacamp.com/courses/intermediate-r [datacamp-r-intermediate-chap1]: https://campus.datacamp.com/courses/intermediate-r/chapter-1-conditionals-and-control-flow [datacamp-r-intermediate-chap2]: https://campus.datacamp.com/courses/intermediate-r/chapter-2-loops [datacamp-r-intermediate-chap3]: https://campus.datacamp.com/courses/intermediate-r/chapter-3-functions [datacamp-r-intermediate-chap4]: https://campus.datacamp.com/courses/intermediate-r/chapter-4-the-apply-family [datacamp-r-functions]: https://learn.datacamp.com/courses/introduction-to-writing-functions-in-r [datacamp-r-tidyverse]: https://learn.datacamp.com/courses/introduction-to-the-tidyverse [datacamp-r-strings]: https://learn.datacamp.com/courses/string-manipulation-with-stringr-in-r [datacamp-r-dplyr]: https://learn.datacamp.com/courses/data-manipulation-with-dplyr [datacamp-r-dplyr-bakeoff]: https://learn.datacamp.com/courses/working-with-data-in-the-tidyverse [datacamp-r-ggplot2-intro]: https://learn.datacamp.com/courses/introduction-to-data-visualization-with-ggplot2 [datacamp-r-ggplot2-intermediate]: https://learn.datacamp.com/courses/intermediate-data-visualization-with-ggplot2 [dplyr-cran]: https://CRAN.R-project.org/package=dplyr [debug-in-r]: https://rstats.wtf/debugging-r-code.html [google-form]: https://forms.gle/s39GeDGV9AzAXUo18 [google-grupper]: https://docs.google.com/spreadsheets/d/1DHxthd5AQywAU4Crb3hM9rnog2GqGQYZ2o175SQgn_0/edit?usp=sharing [GitHub]: https://github.com/ [git-install]: https://git-scm.com/downloads [github-actions]: https://github.com/features/actions [github-pages]: https://pages.github.com/ [gh-rl-student]: https://github.com/bss-osca/rl-student [gh-rl]: https://github.com/bss-osca/rl [happy-git]: https://happygitwithr.com [hg-install-git]: https://happygitwithr.com/install-git.html [hg-why]: https://happygitwithr.com/big-picture.html#big-picture [hg-github-reg]: https://happygitwithr.com/github-acct.html#github-acct [hg-git-install]: https://happygitwithr.com/install-git.html#install-git [hg-exist-github-first]: https://happygitwithr.com/existing-github-first.html [hg-exist-github-last]: https://happygitwithr.com/existing-github-last.html [hg-credential-helper]: https://happygitwithr.com/credential-caching.html [hypothes.is]: https://web.hypothes.is/ [osca-programme]: https://kandidat.au.dk/en/operationsandsupplychainanalytics/ [Peergrade]: https://peergrade.io [peergrade-signup]: https://app.peergrade.io/join [point-and-click]: https://en.wikipedia.org/wiki/Point_and_click [pkg-bookdown]: https://bookdown.org/yihui/bookdown/ [pkg-openxlsx]: https://ycphs.github.io/openxlsx/index.html [pkg-ropensci-writexl]: https://docs.ropensci.org/writexl/ [pkg-jsonlite]: https://cran.r-project.org/web/packages/jsonlite/index.html [R]: https://www.r-project.org [RStudio]: https://rstudio.com [rstudio-cloud]: https://rstudio.cloud/spaces/176810/join?access_code=LSGnG2EXTuzSyeYaNXJE77vP33DZUoeMbC0xhfCz [r-cloud-mod12]: https://rstudio.cloud/spaces/176810/project/2963819 [r-cloud-mod13]: https://rstudio.cloud/spaces/176810/project/3020139 [r-cloud-mod14]: https://rstudio.cloud/spaces/176810/project/3020322 [r-cloud-mod15]: https://rstudio.cloud/spaces/176810/project/3020509 [r-cloud-mod16]: https://rstudio.cloud/spaces/176810/project/3026754 [r-cloud-mod17]: https://rstudio.cloud/spaces/176810/project/3034015 [r-cloud-mod18]: https://rstudio.cloud/spaces/176810/project/3130795 [r-cloud-mod19]: https://rstudio.cloud/spaces/176810/project/3266132 [rstudio-download]: https://rstudio.com/products/rstudio/download/#download [rstudio-customizing]: https://support.rstudio.com/hc/en-us/articles/200549016-Customizing-RStudio [rstudio-key-shortcuts]: https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts [rstudio-workbench]: https://www.rstudio.com/wp-content/uploads/2014/04/rstudio-workbench.png [r-markdown]: https://rmarkdown.rstudio.com/ [ropensci-writexl]: https://docs.ropensci.org/writexl/ [r4ds-pipes]: https://r4ds.had.co.nz/pipes.html [r4ds-factors]: https://r4ds.had.co.nz/factors.html [r4ds-strings]: https://r4ds.had.co.nz/strings.html [r4ds-iteration]: https://r4ds.had.co.nz/iteration.html [stat-545]: https://stat545.com [stat-545-functions-part1]: https://stat545.com/functions-part1.html [stat-545-functions-part2]: https://stat545.com/functions-part2.html [stat-545-functions-part3]: https://stat545.com/functions-part3.html [slides-welcome]: https://bss-osca.github.io/rl/slides/00-rl_welcome.html [slides-m1-3]: https://bss-osca.github.io/rl/slides/01-welcome_r_part.html [slides-m4-5]: https://bss-osca.github.io/rl/slides/02-programming.html [slides-m6-8]: https://bss-osca.github.io/rl/slides/03-transform.html [slides-m9]: https://bss-osca.github.io/rl/slides/04-plot.html [slides-m83]: https://bss-osca.github.io/rl/slides/05-joins.html [sutton-notation]: https://bss-osca.github.io/rl/misc/sutton-notation.pdf [tidyverse-main-page]: https://www.tidyverse.org [tidyverse-packages]: https://www.tidyverse.org/packages/ [tidyverse-core]: https://www.tidyverse.org/packages/#core-tidyverse [tidyverse-ggplot2]: https://ggplot2.tidyverse.org/ [tidyverse-dplyr]: https://dplyr.tidyverse.org/ [tidyverse-tidyr]: https://tidyr.tidyverse.org/ [tidyverse-readr]: https://readr.tidyverse.org/ [tidyverse-purrr]: https://purrr.tidyverse.org/ [tidyverse-tibble]: https://tibble.tidyverse.org/ [tidyverse-stringr]: https://stringr.tidyverse.org/ [tidyverse-forcats]: https://forcats.tidyverse.org/ [tidyverse-readxl]: https://readxl.tidyverse.org [tidyverse-googlesheets4]: https://googlesheets4.tidyverse.org/index.html [tutorial-markdown]: https://commonmark.org/help/tutorial/ [tfa-course]: https://bss-osca.github.io/tfa/ [Udemy]: https://www.udemy.com/ [vba-yt-course1]: https://www.youtube.com/playlist?list=PLpOAvcoMay5S_hb2D7iKznLqJ8QG_pde0 [vba-course1-hello]: https://youtu.be/f42OniDWaIo [vba-yt-course2]: https://www.youtube.com/playlist?list=PL3A6U40JUYCi4njVx59-vaUxYkG0yRO4m [vba-course2-devel-tab]: https://youtu.be/awEOUaw9q58 [vba-course2-devel-editor]: https://youtu.be/awEOUaw9q58 [vba-course2-devel-project]: https://youtu.be/fp6PTbU7bXo [vba-course2-devel-properties]: https://youtu.be/ks2QYKAd9Xw [vba-course2-devel-hello]: https://youtu.be/EQ6tDWBc8G4 [video-install]: https://vimeo.com/415501284 [video-rstudio-intro]: https://vimeo.com/416391353 [video-packages]: https://vimeo.com/416743698 [video-projects]: https://vimeo.com/319318233 [video-r-intro-p1]: https://www.youtube.com/watch?v=vGY5i_J2c-c [video-r-intro-p2]: https://www.youtube.com/watch?v=w8_XdYI3reU [video-r-intro-p3]: https://www.youtube.com/watch?v=NuY6jY4qE7I [video-subsetting]: https://www.youtube.com/watch?v=hWbgqzsQJF0&list=PLjTlxb-wKvXPqyY3FZDO8GqIaWuEDy-Od&index=10&t=0s [video-datatypes]: https://www.youtube.com/watch?v=5AQM-yUX9zg&list=PLjTlxb-wKvXPqyY3FZDO8GqIaWuEDy-Od&index=10 [video-control-structures]: https://www.youtube.com/watch?v=s_h9ruNwI_0 [video-conditional-loops]: https://www.youtube.com/watch?v=2evtsnPaoDg [video-functions]: https://www.youtube.com/watch?v=ffPeac3BigM [video-tibble-vs-df]: https://www.youtube.com/watch?v=EBk6PnvE1R4 [video-dplyr]: https://www.youtube.com/watch?v=aywFompr1F4 [wiki-snake-case]: https://en.wikipedia.org/wiki/Snake_case [wiki-camel-case]: https://en.wikipedia.org/wiki/Camel_case [wiki-interpreted]: https://en.wikipedia.org/wiki/Interpreted_language [wiki-literate-programming]: https://en.wikipedia.org/wiki/Literate_programming [wiki-csv]: https://en.wikipedia.org/wiki/Comma-separated_values [wiki-json]: https://en.wikipedia.org/wiki/JSON