class: center, middle, inverse, title-slide .title[ # On-Policy Control with Approximation ] .author[ ### Lars Relund Nielsen ] --- layout: true <!-- Templates --> <!-- .pull-left[] .pull-right[] --> <!-- knitr::include_graphics("img/bandit.png") --> <!-- .left-column-wide[] .right-column-small[] --> --- ## Learning outcomes - Describe how to extend semi-gradient prediction to action-value approximation - Implement episodic one-step semi-gradient SARSA with `\(\epsilon\)`-greedy improvement. - Be able to describe how to generalize to `\(n\)`-step semi-gradient SARSA in episodic tasks and explain the bias–variance trade-off as `\(n\)` increases (from TD toward Monte Carlo). - Grasp why in continuing tasks with function approximation the discounted objective lacks a reliable local improvement guarantee, motivating a shift to average-reward. - Define and interpret differential returns and differential value functions for the average-reward setting. - Derive the Bellman equations under the average reward criterion. - Describe differential TD errors and corresponding semi-gradient updates (state-value and action-value forms) using a running estimate of the average reward. - Explain how to update the estimate of the average reward. --- ## On-Policy Control with Approximation - In previous module: Focus on predicting the state values of a policy using function approximation. - Now: Emphasis on control, i.e. finding an optimal policy through function approximation of action values `\(\hat q(s, a, \textbf{w})\)`. - The focus is on on-policy methods. - Episodic case: Extension from prediction to control is straightforward - Continuing case: Discounting is not suitable to find an optimal policy. Here we have to switch from the discounting objective to an average-reward objective. --- ## Episodic Semi-gradient Control - Action values Consider episodic tasks. - Goal is to find a good policy. - The action-value function is approximated by `\(\hat q(s,a,\mathbf{w}) \approx q^\pi(s,a)\)` with weights `\(\mathbf{w}\)`. - Training examples are now `\((S_t, A_t) \mapsto U_t\)`, where `\(U_t\)` is a target approx. `\(q^\pi(S_t,A_t)\)`. - One-step semi-gradient: `$$U_t = R_{t+1} + \gamma\, \hat q(S_{t+1}, A_{t+1}, \mathbf{w}_t).$$` which bootstraps from the next state–action estimate. - An `\(\varepsilon\)`-greedy policy can be used for exploration. - Learning is done using semi-gradient stochastic gradient descent on the squared error: `$$\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\big[U_t - \hat q(S_t,A_t,\mathbf{w}_t)\big]\nabla_{\mathbf{w}} \hat q(S_t,A_t,\mathbf{w}_t).$$` --- ## Episodic Semi-gradient Control - Improvement - The action-value is an analogue of semi-gradient TD for state values used for prediction. - Policy improvement and action selection are needed for doing control. - If action set is discrete and not too large: Can use techniques developed so far: - Exploration using an `\(\varepsilon\)`-greedy policy. - Update action values using SARSA targets. - Generalized policy iteration can be done and converge if use on-policy methods. -- Will it work if action space is continuous? --- ## Pseudo code for the episodic semi-gradient SARSA <img src="img/1001_Ep_Semi_Grad_Sarsa.png" width="100%" style="display: block; margin: auto;" /> --- ## What if use an off-policy algorithm? - Can this algorithm can be modified to use Q-learning? - In general this may not work since this is an off-policy algorithm. - Here we may diverge due to the “deadly triad” (off-policy + bootstrapping + approximation). There is no general convergence guarantee. - This is true even with linear features and fixed `\(\epsilon\)`-greedy behaviour. - For further details you may read Chapter 11 in the book. What about modifying the algorithm to use expected SARSA? --- ## `\(n\)`-step SARSA - To extend one-step SARSA we may extend our forward view to `\(n\)`-steps, i.e. target accumulates the next `\(n\)` rewards before bootstrapping: `$$G_{t:t+n} = U_t = \sum_{k=1}^{n} \gamma^{k-1}R_{t+k}\;+\;\gamma^{n}\,\hat q(S_{t+n}, A_{t+n}, \mathbf{w}).$$` - If the episode terminates before `\(t+n\)` then `\(G_{t:t+n}\)` is just the episodic return (`\(G_t\)`). - The update now becomes `$$\mathbf w \leftarrow \mathbf w + \alpha\big[G_{t:t+n} - \hat q(S_t,A_t,\mathbf w)\big]\nabla_w \hat q(S_t,A_t,\mathbf w).$$` <!-- - Exploration could be `\(\epsilon\)`-greedy with respect to `\(\hat q\)`. --> - The choice of look-ahead steps (`\(n\)`) involves a bias-variance trade-off: - `\(n=1\)` enables rapid learning but can be shortsighted. - large `\(n\)` approaches Monte Carlo methods, increasing variance. - In practice, small to moderate `\(n\)` works best (faster learning and good stability). --- ## The Average Reward Criterion The average-reward formulation treats continuing tasks by optimizing the long-run reward rate instead of discounted returns. The performance of a policy `\(\pi\)` is defined as the *steady-state average* $$ `\begin{align} r(\pi) &= \lim_{h\to\infty}\frac{1}{h}\,\mathbb{E}_\pi\!\left[\sum_{t=1}^{h} R_t\right] \\ &= \sum_s \mu_\pi(s) \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a)r \\ &= \sum_s \mu_\pi(s) \sum_a \pi(a|s) \sum_{s'} p(s'|s,a)r(s,a), \end{align}` $$ Holds if the Markov chain (MDP under `\(\pi\)`) is *ergodic*, i.e., all states are reached with the same steady-state distribution `\(\mu_\pi(s)\)`, regardless of the starting state. --- ## Bellman equations (average reward criterion) To measure preferences between states and actions without discounting, we introduce *differential returns* that subtract the average rate at each step: `$$G_t \doteq \sum_{k=0}^{\infty}\big(R_{t+1+k}-r(\pi)\big).$$` The *differential action-value* functions then becomes $$ `\begin{align} q^\pi(s,a) &= \mathbb{E}_\pi[G_t\mid S_t=s, A_t=a] \\ &= \sum_{s',r} p(s',r\mid s,a)\Big(r - r(\pi) + \sum_{a'} \pi(a'\mid s')\,q^\pi(s',a')\Big). \end{align}` $$ They satisfy Bellman relations analogous to the discounted case but without `\(\gamma\)` and with rewards centered by `\(r(\pi)\)`. --- ## Control for Continuing Tasks (Average Reward) - Replace discounted TD errors with differential TD errors. - Keep an estimate of the average reward `\(\bar R\)`. - With function approximation on `\(\hat q(s,a,w)\)` and a running estimate of `\(\bar R_t \approx r(\pi)\)`: `$$U_t = R_{t+1}-\bar R_t + \hat q(S_{t+1},A_{t+1},\textbf w_t).$$` - The Semi-gradient update becomes `$$\textbf w \leftarrow \textbf w + \alpha\big[U_t - \hat q(S_t,A_t,\mathbf w)\big]\,\nabla_w \hat q(S_t,A_t,\textbf w).$$` - How to update the average-reward estimate? This can be done incrementally with a small step size to ensure stability, e.g. $$ \bar R \leftarrow \bar R + \beta\delta_t^q, \qquad \delta_t^q = U_t - q(S_t,A_t,\mathbf w). $$ <!-- Control replaces policy terms with maximization as usual, defining optimal differential values `\(v^*\)` and `\(q^*\)` and coupling learning with `\(\epsilon\)`-greedy improvement over `\(\hat q\)`. --> - Note, the properties under discounting holds. We just use differential returns instead. --- ## Why avoidig discounted reward? - Using the discounted reward criterion is ill-suited for truly continuing tasks once function approximation enters the picture. - Note, function approximation creates bias among states, i.e. changing `\(\mathbf w\)` change the action values in multiple states. - The policy improvement theorem does not apply with function approximation in the discounted setting. - The approximation errors can be amplified by discounting, and greedy improvement is not guaranteed to improve the policy. - This loss of a policy improvement theorem means discounted control lacks a firm local-improvement foundation under approximation. - Best solution is to replace the discounted criterion with the average-reward and use differential values. Here the policy improvement theorem holds. --- ## Colab Let us consider the an example in the [Colab tutorial][colab-13-approx-control]. <!-- # References --> <!-- ```{r, results='asis', echo=FALSE} --> <!-- PrintBibliography(bib) --> <!-- ``` --> [BSS]: https://bss.au.dk/en/ [bi-programme]: https://masters.au.dk/businessintelligence [course-help]: https://github.com/bss-osca/rl/issues [cran]: https://cloud.r-project.org [cheatsheet-readr]: https://rawgit.com/rstudio/cheatsheets/master/data-import.pdf [course-welcome-to-the-tidyverse]: https://github.com/rstudio-education/welcome-to-the-tidyverse [Colab]: https://colab.google/ [colab-01-intro-colab]: https://colab.research.google.com/drive/1o_Dk4FKTsDxPYxTXBRAUEsfPYU3dJhxg?usp=sharing [colab-03-rl-in-action]: https://colab.research.google.com/drive/18O9MruUBA-twpIDpc-9boXQw-cSjkRoD?usp=sharing [colab-03-rl-in-action-ex]: https://colab.research.google.com/drive/18O9MruUBA-twpIDpc-9boXQw-cSjkRoD#scrollTo=JUKOdK_UqKRJ&line=3&uniqifier=1 [colab-04-python]: https://colab.research.google.com/drive/1_TQoJVTJPiXbynegeUtzTWBgktpL5VQT?usp=sharing [colab-04-debug-python]: https://colab.research.google.com/drive/1JHVxbE89iJ8CGJuwY-A4aEEbWYXMH4dp?usp=sharing [colab-05-bandit]: https://colab.research.google.com/drive/19-tUda-gBb40NWHjpSQboqWq18jYpHPs?usp=sharing [colab-05-ex-bandit-adv]: https://colab.research.google.com/drive/19-tUda-gBb40NWHjpSQboqWq18jYpHPs#scrollTo=Df1pWZ-DZB7v&line=1 [colab-05-ex-bandit-coin]: https://colab.research.google.com/drive/19-tUda-gBb40NWHjpSQboqWq18jYpHPs#scrollTo=gRGiE26m3inM [colab-08-dp]: https://colab.research.google.com/drive/1PrLZ2vppqnq0xk0_Qu7UiW3fASftZUX6?usp=sharing [colab-08-dp-ex-storage]: https://colab.research.google.com/drive/1PrLZ2vppqnq0xk0_Qu7UiW3fASftZUX6#scrollTo=nY6zWiv_3ikg&line=21&uniqifier=1 [colab-08-dp-sec-dp-gambler]: https://colab.research.google.com/drive/1PrLZ2vppqnq0xk0_Qu7UiW3fASftZUX6#scrollTo=GweToDSPd5gj&line=1&uniqifier=1 [colab-08-dp-sec-dp-maintain]: https://colab.research.google.com/drive/1PrLZ2vppqnq0xk0_Qu7UiW3fASftZUX6#scrollTo=HQnlVuuufR_Q&line=1&uniqifier=1 [colab-08-dp-sec-dp-car]: https://colab.research.google.com/drive/1PrLZ2vppqnq0xk0_Qu7UiW3fASftZUX6#scrollTo=xERxGYQDkR87&line=1&uniqifier=1 [colab-09-mc]: https://colab.research.google.com/drive/1I4gBqDqYQAEPOVlMqTyBG1AKSHTgyDm-?usp=sharing [colab-09-mc-sec-mc-seasonal-ex]: https://colab.research.google.com/drive/1I4gBqDqYQAEPOVlMqTyBG1AKSHTgyDm-#scrollTo=1BzUCPQxstvQ&line=3&uniqifier=1 [colab-10-td-pred]: https://colab.research.google.com/drive/1JhLDAtc-5lJ3fzp7natjT_ea_JRiQS7d?usp=sharing [colab-10-td-pred-sec-ex-td-pred-random]: https://colab.research.google.com/drive/1JhLDAtc-5lJ3fzp7natjT_ea_JRiQS7d#scrollTo=1BzUCPQxstvQ&line=4&uniqifier=1 [colab-11-td-control]: https://colab.research.google.com/drive/1EC7qmhZqirQdfV1lDn5wqabGlgE49Ghw?usp=sharing [colab-11-td-control-sec-td-control-storage]: https://colab.research.google.com/drive/1EC7qmhZqirQdfV1lDn5wqabGlgE49Ghw#scrollTo=1BzUCPQxstvQ&line=3&uniqifier=1 [colab-11-td-control-sec-td-control-car]: https://colab.research.google.com/drive/1EC7qmhZqirQdfV1lDn5wqabGlgE49Ghw#scrollTo=5CcNmaUVXekC&line=1&uniqifier=1 [colab-12-approx-pred]: https://colab.research.google.com/drive/1-kh0SiNucJrzUUnIOLSidcA2RO5J1rvY?usp=sharing [colab-13-approx-control]: https://colab.research.google.com/drive/1aTPzgxC2_4O1TStfmiEAArf4kxhEVoFU?usp=sharing [colab-14-policy-gradient]: https://colab.research.google.com/drive/1noa3mzdi4sLyBB9GCzsV9__5ikOwwSn4?usp=sharing [DataCamp]: https://www.datacamp.com/ [datacamp-signup]: https://www.datacamp.com/groups/shared_links/45955e75eff4dd8ef9e8c3e7cbbfaff9e28e393b38fc25ce24cb525fb2155732 [datacamp-r-intro]: https://learn.datacamp.com/courses/free-introduction-to-r [datacamp-r-rmarkdown]: https://campus.datacamp.com/courses/reporting-with-rmarkdown [datacamp-r-communicating]: https://learn.datacamp.com/courses/communicating-with-data-in-the-tidyverse [datacamp-r-communicating-chap3]: https://campus.datacamp.com/courses/communicating-with-data-in-the-tidyverse/introduction-to-rmarkdown [datacamp-r-communicating-chap4]: https://campus.datacamp.com/courses/communicating-with-data-in-the-tidyverse/customizing-your-rmarkdown-report [datacamp-r-intermediate]: https://learn.datacamp.com/courses/intermediate-r [datacamp-r-intermediate-chap1]: https://campus.datacamp.com/courses/intermediate-r/chapter-1-conditionals-and-control-flow [datacamp-r-intermediate-chap2]: https://campus.datacamp.com/courses/intermediate-r/chapter-2-loops [datacamp-r-intermediate-chap3]: https://campus.datacamp.com/courses/intermediate-r/chapter-3-functions [datacamp-r-intermediate-chap4]: https://campus.datacamp.com/courses/intermediate-r/chapter-4-the-apply-family [datacamp-r-functions]: https://learn.datacamp.com/courses/introduction-to-writing-functions-in-r [datacamp-r-tidyverse]: https://learn.datacamp.com/courses/introduction-to-the-tidyverse [datacamp-r-strings]: https://learn.datacamp.com/courses/string-manipulation-with-stringr-in-r [datacamp-r-dplyr]: https://learn.datacamp.com/courses/data-manipulation-with-dplyr [datacamp-r-dplyr-bakeoff]: https://learn.datacamp.com/courses/working-with-data-in-the-tidyverse [datacamp-r-ggplot2-intro]: https://learn.datacamp.com/courses/introduction-to-data-visualization-with-ggplot2 [datacamp-r-ggplot2-intermediate]: https://learn.datacamp.com/courses/intermediate-data-visualization-with-ggplot2 [dplyr-cran]: https://CRAN.R-project.org/package=dplyr [google-form]: https://forms.gle/s39GeDGV9AzAXUo18 [google-grupper]: https://docs.google.com/spreadsheets/d/1DHxthd5AQywAU4Crb3hM9rnog2GqGQYZ2o175SQgn_0/edit?usp=sharing [GitHub]: https://github.com/ [git-install]: https://git-scm.com/downloads [github-actions]: https://github.com/features/actions [github-pages]: https://pages.github.com/ [gh-rl-student]: https://github.com/bss-osca/rl-student [gh-rl]: https://github.com/bss-osca/rl [happy-git]: https://happygitwithr.com [hg-install-git]: https://happygitwithr.com/install-git.html [hg-why]: https://happygitwithr.com/big-picture.html#big-picture [hg-github-reg]: https://happygitwithr.com/github-acct.html#github-acct [hg-git-install]: https://happygitwithr.com/install-git.html#install-git [hg-exist-github-first]: https://happygitwithr.com/existing-github-first.html [hg-exist-github-last]: https://happygitwithr.com/existing-github-last.html [hg-credential-helper]: https://happygitwithr.com/credential-caching.html [hypothes.is]: https://web.hypothes.is/ [Jupyter]: https://jupyter.org/ [osca-programme]: https://masters.au.dk/operationsandsupplychainanalytics [Peergrade]: https://peergrade.io [peergrade-signup]: https://app.peergrade.io/join [point-and-click]: https://en.wikipedia.org/wiki/Point_and_click [pkg-bookdown]: https://bookdown.org/yihui/bookdown/ [pkg-openxlsx]: https://ycphs.github.io/openxlsx/index.html [pkg-ropensci-writexl]: https://docs.ropensci.org/writexl/ [pkg-jsonlite]: https://cran.r-project.org/web/packages/jsonlite/index.html [Python]: https://www.python.org/ [Positron]: https://positron.posit.co/ [PyCharm]: https://www.jetbrains.com/pycharm/ [VSCode]: https://code.visualstudio.com/ [R]: https://www.r-project.org [RStudio]: https://rstudio.com [rstudio-cloud]: https://rstudio.cloud/spaces/176810/join?access_code=LSGnG2EXTuzSyeYaNXJE77vP33DZUoeMbC0xhfCz [r-cloud-mod12]: https://rstudio.cloud/spaces/176810/project/2963819 [r-cloud-mod13]: https://rstudio.cloud/spaces/176810/project/3020139 [r-cloud-mod14]: https://rstudio.cloud/spaces/176810/project/3020322 [r-cloud-mod15]: https://rstudio.cloud/spaces/176810/project/3020509 [r-cloud-mod16]: https://rstudio.cloud/spaces/176810/project/3026754 [r-cloud-mod17]: https://rstudio.cloud/spaces/176810/project/3034015 [r-cloud-mod18]: https://rstudio.cloud/spaces/176810/project/3130795 [r-cloud-mod19]: https://rstudio.cloud/spaces/176810/project/3266132 [rstudio-download]: https://rstudio.com/products/rstudio/download/#download [rstudio-customizing]: https://support.rstudio.com/hc/en-us/articles/200549016-Customizing-RStudio [rstudio-key-shortcuts]: https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts [rstudio-workbench]: https://www.rstudio.com/wp-content/uploads/2014/04/rstudio-workbench.png [r-markdown]: https://rmarkdown.rstudio.com/ [ropensci-writexl]: https://docs.ropensci.org/writexl/ [r4ds-pipes]: https://r4ds.had.co.nz/pipes.html [r4ds-factors]: https://r4ds.had.co.nz/factors.html [r4ds-strings]: https://r4ds.had.co.nz/strings.html [r4ds-iteration]: https://r4ds.had.co.nz/iteration.html [stat-545]: https://stat545.com [stat-545-functions-part1]: https://stat545.com/functions-part1.html [stat-545-functions-part2]: https://stat545.com/functions-part2.html [stat-545-functions-part3]: https://stat545.com/functions-part3.html [slides-welcome]: https://bss-osca.github.io/rl/slides/00-rl_welcome.html [slides-m1-3]: https://bss-osca.github.io/rl/slides/01-welcome_r_part.html [slides-m4-5]: https://bss-osca.github.io/rl/slides/02-programming.html [slides-m6-8]: https://bss-osca.github.io/rl/slides/03-transform.html [slides-m9]: https://bss-osca.github.io/rl/slides/04-plot.html [slides-m83]: https://bss-osca.github.io/rl/slides/05-joins.html [sutton-notation]: https://bss-osca.github.io/rl/misc/sutton-notation.pdf [tidyverse-main-page]: https://www.tidyverse.org [tidyverse-packages]: https://www.tidyverse.org/packages/ [tidyverse-core]: https://www.tidyverse.org/packages/#core-tidyverse [tidyverse-ggplot2]: https://ggplot2.tidyverse.org/ [tidyverse-dplyr]: https://dplyr.tidyverse.org/ [tidyverse-tidyr]: https://tidyr.tidyverse.org/ [tidyverse-readr]: https://readr.tidyverse.org/ [tidyverse-purrr]: https://purrr.tidyverse.org/ [tidyverse-tibble]: https://tibble.tidyverse.org/ [tidyverse-stringr]: https://stringr.tidyverse.org/ [tidyverse-forcats]: https://forcats.tidyverse.org/ [tidyverse-readxl]: https://readxl.tidyverse.org [tidyverse-googlesheets4]: https://googlesheets4.tidyverse.org/index.html [tutorial-markdown]: https://commonmark.org/help/tutorial/ [tfa-course]: https://bss-osca.github.io/tfa/ [video-install]: https://vimeo.com/415501284 [video-rstudio-intro]: https://vimeo.com/416391353 [video-packages]: https://vimeo.com/416743698 [video-projects]: https://vimeo.com/319318233 [video-r-intro-p1]: https://www.youtube.com/watch?v=vGY5i_J2c-c [video-r-intro-p2]: https://www.youtube.com/watch?v=w8_XdYI3reU [video-r-intro-p3]: https://www.youtube.com/watch?v=NuY6jY4qE7I [video-subsetting]: https://www.youtube.com/watch?v=hWbgqzsQJF0&list=PLjTlxb-wKvXPqyY3FZDO8GqIaWuEDy-Od&index=10&t=0s [video-datatypes]: https://www.youtube.com/watch?v=5AQM-yUX9zg&list=PLjTlxb-wKvXPqyY3FZDO8GqIaWuEDy-Od&index=10 [video-control-structures]: https://www.youtube.com/watch?v=s_h9ruNwI_0 [video-conditional-loops]: https://www.youtube.com/watch?v=2evtsnPaoDg [video-functions]: https://www.youtube.com/watch?v=ffPeac3BigM [video-tibble-vs-df]: https://www.youtube.com/watch?v=EBk6PnvE1R4 [video-dplyr]: https://www.youtube.com/watch?v=aywFompr1F4 [wiki-snake-case]: https://en.wikipedia.org/wiki/Snake_case [wiki-camel-case]: https://en.wikipedia.org/wiki/Camel_case [wiki-interpreted]: https://en.wikipedia.org/wiki/Interpreted_language [wiki-literate-programming]: https://en.wikipedia.org/wiki/Literate_programming [wiki-csv]: https://en.wikipedia.org/wiki/Comma-separated_values [wiki-json]: https://en.wikipedia.org/wiki/JSON