Dynamic Programming

class: center, middle, inverse, title-slide

.title[
# Dynamic Programming
]
.author[
### Lars Relund Nielsen
]

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://bss-osca.github.io/rl/mod-dp.html" target="_blank">Notes</a>
 | 
<a href="https://bss-osca.github.io/rl/slides/05_dp-slides.html" target="_blank">Slides</a>
 | 
<a href="https://github.com/bss-osca/rl/blob/master/slides/05_dp-slides.Rmd" target="_blank">Source</a>
</span>
</div>

---

## Learning outcomes

* Describe the distinction between policy evaluation and control.
* Identify when DP can be applied, as well as its limitations.
* Explain and apply iterative policy evaluation for estimating state-values given a policy.
* Interpret the policy improvement theorem.
* Explain and apply policy iteration for finding an optimal policy.
* Explain and apply value iteration for finding an optimal policy.
* Describe the ideas behind generalized policy iteration.
* Interpret the distinction between synchronous and asynchronous dynamic programming methods.

---

## Dynamic Programming

* Algorithms for computing optimal policies given full information about the dynamics. 
* Must satisfy the *principle of optimality*: 
   - An optimal policy must consist for optimal sub-polices 
   - The optimal value function in a state can be calculated using optimal value functions in future states. 
* Two main problems arise with DP: 
   - Not full information, e.g. the rewards or transition probabilities are unknown. 
   - Need to calculate the value function in all states using the rewards, actions, and transition probabilities. 
* Note the term programming in DP have nothing to do with a computer program but comes from that the mathematical model is called a "program".

---

## Policy evaluation vs control

.pull-left[
* Policy evaluation: Given a policy `$\pi$`, find the value function `$v_\pi$`
`$$\pi \xrightarrow[]{E} v_\pi$$`
Recall that
`$$v_\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]$$`
where the return (discounted reward) is
`$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots$$`
]

.pull-right[
* Control: Find the best policy or optimize the value function (using the Bellman optimality equations).

<img src="img/control.png" width="100%" style="display: block; margin: auto;" />
]

---

## Policy evaluation

We can find the state-value function by solving the system of Bellman equations:
`\begin{equation}
v_\pi(s) = \sum_{a \in \mathcal{A}}\pi(a | s)\left( r(s,a) + \gamma\sum_{s' \in \mathcal{S}} p(s' | s, a) v_\pi(s')\right).
\end{equation}`
However, can be computationally expensive. Instead we use an iterative method:

* Use an initial approximation `$v_0$` chosen arbitrarily e.g. `$v_0(s) = 0 \:  \forall s$` (terminal state = 0).
* Update the value function using sweeps:
`\begin{equation}
v_{k+1}(s) = \sum_{a \in \mathcal{A}}\pi(a | s)\left( r(s,a) + \gamma\sum_{s' \in \mathcal{S}} p(s' | s, a) v_k(s')\right) 
\end{equation}`
Until the value function converge to `$v_\pi$` after a number of sweeps (the change in the value function is below a certain threshold).

---

## Policy evaluation - visualization

.pull-left[.midi[

Sweep 1:
`$$v_{1}(s) = \sum_{a \in \mathcal{A}}\pi(a | s)\left( r(s,a) + \gamma\sum_{s' \in \mathcal{S}} p(s' | s, a) v_0(s')\right)$$`
Sweep 2:
`$$v_{2}(s) = \sum_{a \in \mathcal{A}}\pi(a | s)\left( r(s,a) + \gamma\sum_{s' \in \mathcal{S}} p(s' | s, a) v_1(s')\right)$$`
`$$\vdots$$`
Sweep k:
`$$v_{k}(s) = \sum_{a \in \mathcal{A}}\pi(a | s)\left( r(s,a) + \gamma\sum_{s' \in \mathcal{S}} p(s' | s, a) v_{k-1}(s')\right)$$`
]]

.pull-right[.vertical-center[ 
<img src="img/policy-eval-ite.png" width="90%" style="display: block; margin: auto;" />
]]

---

## Policy Improvement (control)

* Given the optimal value function we find the optimal deterministic policy by choosing *greedy* the best action:

`$$\pi_*(s) = \arg\max_{a \in \mathcal{A}} \left(r(s,a) + \gamma\sum_{s' \in \mathcal{S}} p(s' | s, a) v_*(s')\right).$$`
* What if we apply greedy action selection to the value function for a policy `$\pi$`:

`$$\pi'(s) = \arg\max_{a \in \mathcal{A}} \left(r(s,a) + \gamma\sum_{s' \in \mathcal{S}} p(s' | s, a) v_\pi(s')\right),$$`
then

`$$q_\pi(s, \pi'(s)) \geq q_\pi(s, \pi(s)) = v_\pi(s) \quad \forall s \in \mathcal{S}.$$`

---

## Policy improvement theorem (control)

.question[
Let `$\pi$`, `$\pi'$` be any pair of deterministic policies, such that

`\begin{equation}
    q_\pi(s, \pi'(s)) \geq v_\pi(s) \quad \forall s \in \mathcal{S}.
\end{equation}`

That is, `$\pi'$` is as least as good as `$\pi$`.
]

Implications:

* If we find policy `$\pi'$` by applying greedy action selection on policy `$\pi$` then `$\pi'$` is as least as good as `$\pi$`. 
* If `$\pi'(s) = \pi(s), \forall s\in\mathcal{S}$` then `$\pi$` must be optimal.
* If `$\pi'(s) \neq \pi(s)$`, then policy `$\pi'$` is strictly better than policy `$\pi$` since there is at least one state `$s$` for which `$v_{\pi'}(s) > v_\pi(s)$`.

---

## Policy iteration (control)

.left-column-wide[
Given the policy improvement theorem we can now improve policies iteratively:

1. Pick an arbitrary initial policy `$\pi$`.
2. Given a policy `$\pi$`, estimate `$v_\pi(s)$` via the policy evaluation algorithm.
3. Generate a new, improved policy `$\pi' \geq \pi$` using *greedy* action selection. If `$\pi'=\pi$` then stop (found optimal policy); otherwise go to Step 2.

The sequence of calculations will be: 
`$$\pi_0 \xrightarrow[]{E} v_{\pi_0} \xrightarrow[]{I} \pi_1 \xrightarrow[]{E} v_{\pi_1} \xrightarrow[]{I} \pi_2 \xrightarrow[]{E} v_{\pi_2}  \ldots \xrightarrow[]{I} \pi_* \xrightarrow[]{E} v_{\pi_*}$$`
The number of steps needed are often low.
]

.right-column-small[
<img src="img/policy-ite-pen.png" width="100%" style="display: block; margin: auto;" />

<img src="img/policy-ite-path.png" width="100%" style="display: block; margin: auto;" />
]

---

## Value iteration (control)

.left-column-wide[.midi[
Value iteration do one sweep of policy evaluation and one sweep of policy improvement by updating the Bellman optimality equation:

`$$v_{k+1}(s) = \arg\max_a \left(r(s,a) + \gamma\sum_{s',r} p(s'|s, a)v_k(s')\right)$$`
VI does not require a stored policy. Instead after the last step of value-iteration the optimal policy `$\pi_*$` is found:

`$$\pi_*(s) = \arg\max_{a \in \mathcal{A}} \left(r(s,a) + \gamma\sum_{s' \in \mathcal{S}} p(s' | s, a) v_*(s')\right)$$`
The sequence of calculations will be: 
`$$v_{0} \xrightarrow[]{EI} v_{1} \xrightarrow[]{EI} v_{2}  \ldots \xrightarrow[]{EI} v_{*} \xrightarrow[]{G} \pi_{*}$$`
]]

.right-column-small[
<img src="img/value-ite-path.png" width="100%" style="display: block; margin: auto;" />
]

---

## Generalized policy iteration

.left-column-wide[.midi[
Generalised Policy Iteration (GPI) consider different policy evaluation and improvement strategies. For instance partly policy evaluation

* Value iteration: improve the state-value using a single sweep of the state space.

Asynchronous state-value updates:

* In-place DP: Just one array of state-values.
* Prioritized sweeping: Update "significant" state-values using a priority queue.

GPI works if explore the whole state space.
]]

.right-column-small[
<img src="img/policy-ite-general.png" width="100%" style="display: block; margin: auto;" />
]

---

## Policy evaluation in R (evaluation)

---

## Policy iteration in R (control)

---

## Value iteration in R (control)

[BSS]: https://bss.au.dk/en/
[bi-programme]: https://kandidat.au.dk/en/businessintelligence/

[course-help]: https://github.com/bss-osca/rl/issues
[cran]: https://cloud.r-project.org
[cheatsheet-readr]: https://rawgit.com/rstudio/cheatsheets/master/data-import.pdf
[course-welcome-to-the-tidyverse]: https://github.com/rstudio-education/welcome-to-the-tidyverse

[DataCamp]: https://www.datacamp.com/
[datacamp-signup]: https://www.datacamp.com/groups/shared_links/cbaee6c73e7d78549a9e32a900793b2d5491ace1824efc1760a6729735948215
[datacamp-r-intro]: https://learn.datacamp.com/courses/free-introduction-to-r
[datacamp-r-rmarkdown]: https://campus.datacamp.com/courses/reporting-with-rmarkdown
[datacamp-r-communicating]: https://learn.datacamp.com/courses/communicating-with-data-in-the-tidyverse
[datacamp-r-communicating-chap3]: https://campus.datacamp.com/courses/communicating-with-data-in-the-tidyverse/introduction-to-rmarkdown
[datacamp-r-communicating-chap4]: https://campus.datacamp.com/courses/communicating-with-data-in-the-tidyverse/customizing-your-rmarkdown-report
[datacamp-r-intermediate]: https://learn.datacamp.com/courses/intermediate-r
[datacamp-r-intermediate-chap1]: https://campus.datacamp.com/courses/intermediate-r/chapter-1-conditionals-and-control-flow
[datacamp-r-intermediate-chap2]: https://campus.datacamp.com/courses/intermediate-r/chapter-2-loops
[datacamp-r-intermediate-chap3]: https://campus.datacamp.com/courses/intermediate-r/chapter-3-functions
[datacamp-r-intermediate-chap4]: https://campus.datacamp.com/courses/intermediate-r/chapter-4-the-apply-family
[datacamp-r-functions]: https://learn.datacamp.com/courses/introduction-to-writing-functions-in-r
[datacamp-r-tidyverse]: https://learn.datacamp.com/courses/introduction-to-the-tidyverse
[datacamp-r-strings]: https://learn.datacamp.com/courses/string-manipulation-with-stringr-in-r
[datacamp-r-dplyr]: https://learn.datacamp.com/courses/data-manipulation-with-dplyr
[datacamp-r-dplyr-bakeoff]: https://learn.datacamp.com/courses/working-with-data-in-the-tidyverse
[datacamp-r-ggplot2-intro]: https://learn.datacamp.com/courses/introduction-to-data-visualization-with-ggplot2
[datacamp-r-ggplot2-intermediate]: https://learn.datacamp.com/courses/intermediate-data-visualization-with-ggplot2
[dplyr-cran]: https://CRAN.R-project.org/package=dplyr
[debug-in-r]: https://rstats.wtf/debugging-r-code.html

[google-form]: https://forms.gle/s39GeDGV9AzAXUo18
[google-grupper]: https://docs.google.com/spreadsheets/d/1DHxthd5AQywAU4Crb3hM9rnog2GqGQYZ2o175SQgn_0/edit?usp=sharing
[GitHub]: https://github.com/
[git-install]: https://git-scm.com/downloads
[github-actions]: https://github.com/features/actions
[github-pages]: https://pages.github.com/
[gh-rl-student]: https://github.com/bss-osca/rl-student
[gh-rl]: https://github.com/bss-osca/rl

[happy-git]: https://happygitwithr.com
[hg-install-git]: https://happygitwithr.com/install-git.html
[hg-why]: https://happygitwithr.com/big-picture.html#big-picture
[hg-github-reg]: https://happygitwithr.com/github-acct.html#github-acct
[hg-git-install]: https://happygitwithr.com/install-git.html#install-git
[hg-exist-github-first]: https://happygitwithr.com/existing-github-first.html
[hg-exist-github-last]: https://happygitwithr.com/existing-github-last.html
[hg-credential-helper]: https://happygitwithr.com/credential-caching.html
[hypothes.is]: https://web.hypothes.is/

[osca-programme]: https://kandidat.au.dk/en/operationsandsupplychainanalytics/

[Peergrade]: https://peergrade.io
[peergrade-signup]: https://app.peergrade.io/join
[point-and-click]: https://en.wikipedia.org/wiki/Point_and_click
[pkg-bookdown]: https://bookdown.org/yihui/bookdown/
[pkg-openxlsx]: https://ycphs.github.io/openxlsx/index.html
[pkg-ropensci-writexl]: https://docs.ropensci.org/writexl/
[pkg-jsonlite]: https://cran.r-project.org/web/packages/jsonlite/index.html

[R]: https://www.r-project.org
[RStudio]: https://rstudio.com
[rstudio-cloud]: https://rstudio.cloud/spaces/176810/join?access_code=LSGnG2EXTuzSyeYaNXJE77vP33DZUoeMbC0xhfCz
[r-cloud-mod12]: https://rstudio.cloud/spaces/176810/project/2963819
[r-cloud-mod13]: https://rstudio.cloud/spaces/176810/project/3020139
[r-cloud-mod14]: https://rstudio.cloud/spaces/176810/project/3020322
[r-cloud-mod15]: https://rstudio.cloud/spaces/176810/project/3020509
[r-cloud-mod16]: https://rstudio.cloud/spaces/176810/project/3026754
[r-cloud-mod17]: https://rstudio.cloud/spaces/176810/project/3034015
[r-cloud-mod18]: https://rstudio.cloud/spaces/176810/project/3130795
[r-cloud-mod19]: https://rstudio.cloud/spaces/176810/project/3266132
[rstudio-download]: https://rstudio.com/products/rstudio/download/#download
[rstudio-customizing]: https://support.rstudio.com/hc/en-us/articles/200549016-Customizing-RStudio
[rstudio-key-shortcuts]: https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts
[rstudio-workbench]: https://www.rstudio.com/wp-content/uploads/2014/04/rstudio-workbench.png
[r-markdown]: https://rmarkdown.rstudio.com/
[ropensci-writexl]: https://docs.ropensci.org/writexl/
[r4ds-pipes]: https://r4ds.had.co.nz/pipes.html
[r4ds-factors]: https://r4ds.had.co.nz/factors.html
[r4ds-strings]: https://r4ds.had.co.nz/strings.html
[r4ds-iteration]: https://r4ds.had.co.nz/iteration.html

[stat-545]: https://stat545.com
[stat-545-functions-part1]: https://stat545.com/functions-part1.html
[stat-545-functions-part2]: https://stat545.com/functions-part2.html
[stat-545-functions-part3]: https://stat545.com/functions-part3.html
[slides-welcome]: https://bss-osca.github.io/rl/slides/00-rl_welcome.html
[slides-m1-3]: https://bss-osca.github.io/rl/slides/01-welcome_r_part.html
[slides-m4-5]: https://bss-osca.github.io/rl/slides/02-programming.html
[slides-m6-8]: https://bss-osca.github.io/rl/slides/03-transform.html
[slides-m9]: https://bss-osca.github.io/rl/slides/04-plot.html
[slides-m83]: https://bss-osca.github.io/rl/slides/05-joins.html
[sutton-notation]: https://bss-osca.github.io/rl/misc/sutton-notation.pdf

[tidyverse-main-page]: https://www.tidyverse.org
[tidyverse-packages]: https://www.tidyverse.org/packages/
[tidyverse-core]: https://www.tidyverse.org/packages/#core-tidyverse
[tidyverse-ggplot2]: https://ggplot2.tidyverse.org/
[tidyverse-dplyr]: https://dplyr.tidyverse.org/
[tidyverse-tidyr]: https://tidyr.tidyverse.org/
[tidyverse-readr]: https://readr.tidyverse.org/
[tidyverse-purrr]: https://purrr.tidyverse.org/
[tidyverse-tibble]: https://tibble.tidyverse.org/
[tidyverse-stringr]: https://stringr.tidyverse.org/
[tidyverse-forcats]: https://forcats.tidyverse.org/
[tidyverse-readxl]: https://readxl.tidyverse.org
[tidyverse-googlesheets4]: https://googlesheets4.tidyverse.org/index.html
[tutorial-markdown]: https://commonmark.org/help/tutorial/
[tfa-course]: https://bss-osca.github.io/tfa/

[Udemy]: https://www.udemy.com/

[vba-yt-course1]: https://www.youtube.com/playlist?list=PLpOAvcoMay5S_hb2D7iKznLqJ8QG_pde0
[vba-course1-hello]: https://youtu.be/f42OniDWaIo

[vba-yt-course2]: https://www.youtube.com/playlist?list=PL3A6U40JUYCi4njVx59-vaUxYkG0yRO4m
[vba-course2-devel-tab]: https://youtu.be/awEOUaw9q58
[vba-course2-devel-editor]: https://youtu.be/awEOUaw9q58
[vba-course2-devel-project]: https://youtu.be/fp6PTbU7bXo
[vba-course2-devel-properties]: https://youtu.be/ks2QYKAd9Xw
[vba-course2-devel-hello]: https://youtu.be/EQ6tDWBc8G4

[video-install]: https://vimeo.com/415501284
[video-rstudio-intro]: https://vimeo.com/416391353
[video-packages]: https://vimeo.com/416743698
[video-projects]: https://vimeo.com/319318233
[video-r-intro-p1]: https://www.youtube.com/watch?v=vGY5i_J2c-c
[video-r-intro-p2]: https://www.youtube.com/watch?v=w8_XdYI3reU
[video-r-intro-p3]: https://www.youtube.com/watch?v=NuY6jY4qE7I
[video-subsetting]: https://www.youtube.com/watch?v=hWbgqzsQJF0&list=PLjTlxb-wKvXPqyY3FZDO8GqIaWuEDy-Od&index=10&t=0s
[video-datatypes]: https://www.youtube.com/watch?v=5AQM-yUX9zg&list=PLjTlxb-wKvXPqyY3FZDO8GqIaWuEDy-Od&index=10
[video-control-structures]: https://www.youtube.com/watch?v=s_h9ruNwI_0
[video-conditional-loops]: https://www.youtube.com/watch?v=2evtsnPaoDg
[video-functions]: https://www.youtube.com/watch?v=ffPeac3BigM
[video-tibble-vs-df]: https://www.youtube.com/watch?v=EBk6PnvE1R4
[video-dplyr]: https://www.youtube.com/watch?v=aywFompr1F4

[wiki-snake-case]: https://en.wikipedia.org/wiki/Snake_case
[wiki-camel-case]: https://en.wikipedia.org/wiki/Camel_case
[wiki-interpreted]: https://en.wikipedia.org/wiki/Interpreted_language
[wiki-literate-programming]: https://en.wikipedia.org/wiki/Literate_programming
[wiki-csv]: https://en.wikipedia.org/wiki/Comma-separated_values
[wiki-json]: https://en.wikipedia.org/wiki/JSON