class: center, middle, inverse, title-slide .title[ # Transform data ] .author[ ### Lars Relund Nielsen ] --- layout: true --- ## Tidy data >Happy families are all alike; every unhappy family is unhappy in its own way. > >Leo Tolstoy -- .pull-left[ **Characteristics of tidy data:** - Each variable forms a column. - Each observation forms a row. - Each type of observational unit forms a table. ] -- .pull-right[ **Characteristics of untidy data:** !@#$%^&*() ] --- .question[ What makes this data not tidy? ] <br> <img src="img/hiv-est-prevalence-15-49.png" width="95%" style="display: block; margin: auto;" /> .footnote[ Source: [Gapminder, Estimated HIV prevalence among 15-49 year olds](https://www.gapminder.org/data) ] --- ## Summary tables .question[ Is each of the following a dataset or a summary table? ] .midi[ .pull-left[ ``` ## # A tibble: 87 × 3 ## name height mass ## <chr> <int> <dbl> ## 1 Luke Skywalker 172 77 ## 2 C-3PO 167 75 ## 3 R2-D2 96 32 ## 4 Darth Vader 202 136 ## 5 Leia Organa 150 49 ## 6 Owen Lars 178 120 ## # ℹ 81 more rows ``` ] .pull-right[ ``` ## # A tibble: 3 × 2 ## gender avg_height ## <chr> <dbl> ## 1 feminine 167. ## 2 masculine 177. ## 3 <NA> 175 ``` ] ] --- ## A grammar of data wrangling... ... based on the concepts of functions as verbs that manipulate data frames .pull-left[ <img src="img/dplyr-part-of-tidyverse.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ .midi[ - `select`: pick columns by name - `arrange`: reorder rows - `slice`: pick rows using index(es) - `filter`: pick rows matching criteria - `distinct`: filter for unique rows - `mutate`: add new variables - `summarise`: reduce variables to values - `group_by`: for grouped operations - ... (many more) ] ] --- ## Rules of **dplyr** functions - First argument is *always* a data frame - Subsequent arguments say what to do with that data frame - Always return a data frame --- ## Data: Hotel bookings - Data from two hotels: one resort and one city hotel - Observations: Each row represents a hotel booking <!-- - Goal for original data collection: Development of prediction models to classify a hotel booking's likelihood to be cancelled ([Antonia et al., 2019](https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib5)) --> ``` ## # A tibble: 119,390 × 32 ## hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number ## <chr> <dbl> <dbl> <dbl> <chr> <dbl> ## 1 Resort Hotel 0 342 2015 July 27 ## 2 Resort Hotel 0 737 2015 July 27 ## 3 Resort Hotel 0 7 2015 July 27 ## 4 Resort Hotel 0 13 2015 July 27 ## 5 Resort Hotel 0 14 2015 July 27 ## 6 Resort Hotel 0 14 2015 July 27 ## # ℹ 119,384 more rows ## # ℹ 26 more variables: arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>, ## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>, meal <chr>, ## # country <chr>, market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>, ## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, ## # assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>, agent <chr>, ## # company <chr>, days_in_waiting_list <dbl>, customer_type <chr>, adr <dbl>, … ``` --- ## Examples and code Let us have a look at the examples at Posit Cloud. --- layout: false class: center, title-slide, middle, inverse # Mutating joins --- ## Mutating joins Mutating joins allow you to combine variables from multiple tables. There are different types of joins. We will consider: - `inner_join()`: all rows from x where there are matching values in y, return all combinations of multiple matches in the case of multiple matches - `left_join()`: all rows from x - `right_join()`: all rows from y - `full_join()`: all rows from both x and y While mutating joins are primarily used to add new variables, they can also generate new observations. If a match is not unique, a join will add all possible combinations (the Cartesian product) of the matching observations. --- ## Setup For the next few slides... .pull-left[ ``` r band_members ``` ``` ## # A tibble: 3 × 2 ## name band ## <chr> <chr> ## 1 Mick Stones ## 2 John Beatles ## 3 Paul Beatles ``` ] .pull-right[ ``` r band_instruments ``` ``` ## # A tibble: 3 × 2 ## name plays ## <chr> <chr> ## 1 John guitar ## 2 Paul bass ## 3 Keith guitar ``` ] --- ## `inner_join()` .pull-left[ ``` r band_members %>% inner_join(band_instruments) ``` ``` ## # A tibble: 2 × 3 ## name band plays ## <chr> <chr> <chr> ## 1 John Beatles guitar ## 2 Paul Beatles bass ``` ] .pull-right[ <img src="img/inner-join.gif" width="100%" style="display: block; margin: auto;" /> ] --- ## `left_join()` .pull-left[ ``` r band_members %>% left_join(band_instruments) ``` ``` ## # A tibble: 3 × 3 ## name band plays ## <chr> <chr> <chr> ## 1 Mick Stones <NA> ## 2 John Beatles guitar ## 3 Paul Beatles bass ``` ] .pull-right[ <img src="img/left-join.gif" width="100%" style="display: block; margin: auto;" /> ] --- ## `right_join()` .pull-left[ ``` r band_members %>% right_join(band_instruments) ``` ``` ## # A tibble: 3 × 3 ## name band plays ## <chr> <chr> <chr> ## 1 John Beatles guitar ## 2 Paul Beatles bass ## 3 Keith <NA> guitar ``` ] .pull-right[ <img src="img/right-join.gif" width="100%" style="display: block; margin: auto;" /> ] --- ## `full_join()` .pull-left[ ``` r band_members %>% full_join(band_instruments) ``` ``` ## # A tibble: 4 × 3 ## name band plays ## <chr> <chr> <chr> ## 1 Mick Stones <NA> ## 2 John Beatles guitar ## 3 Paul Beatles bass ## 4 Keith <NA> guitar ``` ] .pull-right[ <img src="img/full-join.gif" width="100%" style="display: block; margin: auto;" /> ] --- ## Filtering joins Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types: - `semi_join()`: all rows from x where there are matching values in y, keeping just columns from x. - `anti_join()`: return all rows from x where there are not matching values in y, never duplicate rows of x These are most useful for diagnosing join mismatches. If you're worried about what observations your joins will match, start with a `semi_join()` or `anti_join()`. `semi_join()` and `anti_join()` never duplicate; they only remove observations. --- ## `semi_join()` .pull-left[ ``` r band_members %>% semi_join(band_instruments) ``` ``` ## # A tibble: 2 × 2 ## name band ## <chr> <chr> ## 1 John Beatles ## 2 Paul Beatles ``` ] .pull-right[ <img src="img/semi-join.gif" width="100%" style="display: block; margin: auto;" /> ] --- ## `anti_join()` .pull-left[ ``` r band_members %>% anti_join(band_instruments) ``` ``` ## # A tibble: 1 × 2 ## name band ## <chr> <chr> ## 1 Mick Stones ``` ] .pull-right[ <img src="img/anti-join.gif" width="100%" style="display: block; margin: auto;" /> ] --- ## What if column names are not the same? .pull-left[ ``` r band_members ``` ``` ## # A tibble: 3 × 2 ## name band ## <chr> <chr> ## 1 Mick Stones ## 2 John Beatles ## 3 Paul Beatles ``` ``` r band_instruments2 ``` ``` ## # A tibble: 3 × 2 ## artist plays ## <chr> <chr> ## 1 John guitar ## 2 Paul bass ## 3 Keith guitar ``` ] .pull-right[ ``` r band_members %>% full_join( band_instruments2, by = c("name" = "artist")) ``` ``` ## # A tibble: 4 × 3 ## name band plays ## <chr> <chr> <chr> ## 1 Mick Stones <NA> ## 2 John Beatles guitar ## 3 Paul Beatles bass ## 4 Keith <NA> guitar ``` ] [BSS]: https://bss.au.dk/en/ [course-help]: https://github.com/bss-osca/tfa/issues [cran]: https://cloud.r-project.org [cheatsheet-readr]: https://rawgit.com/rstudio/cheatsheets/master/data-import.pdf [course-welcome-to-the-tidyverse]: https://github.com/rstudio-education/welcome-to-the-tidyverse [DataCamp]: https://www.datacamp.com/ [datacamp-signup]: https://www.datacamp.com/groups/shared_links/c90b55dfb7c72d4f8184f5e53ac5c2521e67a220a9e40778ee28178b284eef77 [datacamp-r-intro]: https://learn.datacamp.com/courses/free-introduction-to-r [datacamp-r-rmarkdown]: https://campus.datacamp.com/courses/reporting-with-rmarkdown [datacamp-r-communicating]: https://learn.datacamp.com/courses/communicating-with-data-in-the-tidyverse [datacamp-r-communicating-chap3]: https://campus.datacamp.com/courses/communicating-with-data-in-the-tidyverse/introduction-to-rmarkdown [datacamp-r-communicating-chap4]: https://campus.datacamp.com/courses/communicating-with-data-in-the-tidyverse/customizing-your-rmarkdown-report [datacamp-r-intermediate]: https://learn.datacamp.com/courses/intermediate-r [datacamp-r-intermediate-chap1]: https://campus.datacamp.com/courses/intermediate-r/chapter-1-conditionals-and-control-flow [datacamp-r-intermediate-chap2]: https://campus.datacamp.com/courses/intermediate-r/chapter-2-loops [datacamp-r-intermediate-chap3]: https://campus.datacamp.com/courses/intermediate-r/chapter-3-functions [datacamp-r-intermediate-chap4]: https://campus.datacamp.com/courses/intermediate-r/chapter-4-the-apply-family [datacamp-r-functions]: https://learn.datacamp.com/courses/introduction-to-writing-functions-in-r [datacamp-r-tidyverse]: https://learn.datacamp.com/courses/introduction-to-the-tidyverse [datacamp-r-strings]: https://learn.datacamp.com/courses/string-manipulation-with-stringr-in-r [datacamp-r-dplyr]: https://learn.datacamp.com/courses/data-manipulation-with-dplyr [datacamp-r-dplyr-bakeoff]: https://learn.datacamp.com/courses/working-with-data-in-the-tidyverse [datacamp-r-ggplot2-intro]: https://learn.datacamp.com/courses/introduction-to-data-visualization-with-ggplot2 [datacamp-r-ggplot2-intermediate]: https://learn.datacamp.com/courses/intermediate-data-visualization-with-ggplot2 [dplyr-cran]: https://CRAN.R-project.org/package=dplyr [debug-in-r]: https://rstats.wtf/debugging-r-code.html [excel-vs-r]: https://www.jessesadler.com/post/excel-vs-r/ [google-form]: https://forms.gle/s39GeDGV9AzAXUo18 [google-grupper]: https://docs.google.com/spreadsheets/d/1DHxthd5AQywAU4Crb3hM9rnog2GqGQYZ2o175SQgn_0/edit?usp=sharing [GitHub]: https://github.com/ [git-install]: https://git-scm.com/downloads [github-actions]: https://github.com/features/actions [github-pages]: https://pages.github.com/ [happy-git]: https://happygitwithr.com [hg-install-git]: https://happygitwithr.com/install-git.html [hg-why]: https://happygitwithr.com/big-picture.html#big-picture [hg-github-reg]: https://happygitwithr.com/github-acct.html#github-acct [hg-git-install]: https://happygitwithr.com/install-git.html#install-git [hg-exist-github-first]: https://happygitwithr.com/existing-github-first.html [hg-exist-github-last]: https://happygitwithr.com/existing-github-last.html [hg-credential-helper]: https://happygitwithr.com/credential-caching.html [hypothes.is]: https://web.hypothes.is/ [osca-programme]: https://masters.au.dk/operationsandsupplychainanalytics [Peergrade]: https://peergrade.io [peergrade-signup]: https://app.peergrade.io/join [point-and-click]: https://en.wikipedia.org/wiki/Point_and_click [pkg-bookdown]: https://bookdown.org/yihui/bookdown/ [pkg-openxlsx]: https://ycphs.github.io/openxlsx/index.html [pkg-ropensci-writexl]: https://docs.ropensci.org/writexl/ [pkg-jsonlite]: https://cran.r-project.org/web/packages/jsonlite/index.html [R]: https://www.r-project.org [RStudio]: https://rstudio.com [posit-cloud]: https://posit.cloud/spaces/426101/join?access_code=GVlexpHQRCXUAiqRiC9ux_KgoAXUyilsrVtxjaB8 [r-cloud-mod7]: https://posit.cloud/spaces/426101/content/6689201 [r-cloud-mod8]: https://posit.cloud/spaces/426101/content/6689202 [r-cloud-mod9]: https://posit.cloud/spaces/426101/content/6689203 [r-cloud-mod10]: https://posit.cloud/spaces/426101/content/6689204 [r-cloud-mod11]: https://posit.cloud/spaces/426101/content/6689205 [r-cloud-mod12]: https://posit.cloud/spaces/426101/content/6689206 [r-cloud-mod13]: https://posit.cloud/spaces/426101/content/6689207 [r-cloud-mod14]: https://posit.cloud/spaces/426101/content/6689208 [r-cloud-mod15]: https://posit.cloud/spaces/426101/content/6689209 [rstudio-download]: https://rstudio.com/products/rstudio/download/#download [rstudio-customizing]: https://support.rstudio.com/hc/en-us/articles/200549016-Customizing-RStudio [rstudio-key-shortcuts]: https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts [rstudio-workbench]: https://www.rstudio.com/wp-content/uploads/2014/04/rstudio-workbench.png [r-markdown]: https://rmarkdown.rstudio.com/ [ropensci-writexl]: https://docs.ropensci.org/writexl/ [r4ds-pipes]: https://r4ds.had.co.nz/pipes.html [r4ds-factors]: https://r4ds.had.co.nz/factors.html [r4ds-strings]: https://r4ds.had.co.nz/strings.html [r4ds-iteration]: https://r4ds.had.co.nz/iteration.html [stat-545]: https://stat545.com [stat-545-functions-part1]: https://stat545.com/functions-part1.html [stat-545-functions-part2]: https://stat545.com/functions-part2.html [stat-545-functions-part3]: https://stat545.com/functions-part3.html [slides]: https://bss-osca.github.io/tfa/slides/ [slides-welcome]: https://bss-osca.github.io/tfa/slides/00-tfa_welcome.html [slides-m1-3]: https://bss-osca.github.io/tfa/slides/01-welcome_r_part.html [slides-m4-5]: https://bss-osca.github.io/tfa/slides/02-programming.html [slides-m6-8]: https://bss-osca.github.io/tfa/slides/03-transform.html [slides-m9]: https://bss-osca.github.io/tfa/slides/04-plot.html [slides-m83]: https://bss-osca.github.io/tfa/slides/05-joins.html [tidyverse-main-page]: https://www.tidyverse.org [tidyverse-packages]: https://www.tidyverse.org/packages/ [tidyverse-core]: https://www.tidyverse.org/packages/#core-tidyverse [tidyverse-ggplot2]: https://ggplot2.tidyverse.org/ [tidyverse-dplyr]: https://dplyr.tidyverse.org/ [tidyverse-tidyr]: https://tidyr.tidyverse.org/ [tidyverse-readr]: https://readr.tidyverse.org/ [tidyverse-purrr]: https://purrr.tidyverse.org/ [tidyverse-tibble]: https://tibble.tidyverse.org/ [tidyverse-stringr]: https://stringr.tidyverse.org/ [tidyverse-forcats]: https://forcats.tidyverse.org/ [tidyverse-readxl]: https://readxl.tidyverse.org [tidyverse-googlesheets4]: https://googlesheets4.tidyverse.org/index.html [tutorial-markdown]: https://commonmark.org/help/tutorial/ [Udemy]: https://www.udemy.com/ [vba-yt-course1]: https://www.youtube.com/playlist?list=PLpOAvcoMay5S_hb2D7iKznLqJ8QG_pde0 [vba-course1-hello]: https://youtu.be/f42OniDWaIo [vba-yt-course2]: https://www.youtube.com/playlist?list=PL3A6U40JUYCi4njVx59-vaUxYkG0yRO4m [vba-course2-devel-tab]: https://youtu.be/awEOUaw9q58 [vba-course2-devel-editor]: https://youtu.be/awEOUaw9q58 [vba-course2-devel-project]: https://youtu.be/fp6PTbU7bXo [vba-course2-devel-properties]: https://youtu.be/ks2QYKAd9Xw [vba-course2-devel-hello]: https://youtu.be/EQ6tDWBc8G4 [video-install]: https://vimeo.com/415501284 [video-rstudio-intro]: https://vimeo.com/416391353 [video-packages]: https://vimeo.com/416743698 [video-projects]: https://vimeo.com/319318233 [video-r-intro-p1]: https://www.youtube.com/watch?v=vGY5i_J2c-c [video-r-intro-p2]: https://www.youtube.com/watch?v=w8_XdYI3reU [video-r-intro-p3]: https://www.youtube.com/watch?v=NuY6jY4qE7I [video-subsetting]: https://www.youtube.com/watch?v=hWbgqzsQJF0&list=PLjTlxb-wKvXPqyY3FZDO8GqIaWuEDy-Od&index=10&t=0s [video-datatypes]: https://www.youtube.com/watch?v=5AQM-yUX9zg&list=PLjTlxb-wKvXPqyY3FZDO8GqIaWuEDy-Od&index=10 [video-control-structures]: https://www.youtube.com/watch?v=s_h9ruNwI_0 [video-conditional-loops]: https://www.youtube.com/watch?v=2evtsnPaoDg [video-functions]: https://www.youtube.com/watch?v=ffPeac3BigM [video-tibble-vs-df]: https://www.youtube.com/watch?v=EBk6PnvE1R4 [video-dplyr]: https://www.youtube.com/watch?v=aywFompr1F4 [wiki-snake-case]: https://en.wikipedia.org/wiki/Snake_case [wiki-camel-case]: https://en.wikipedia.org/wiki/Camel_case [wiki-interpreted]: https://en.wikipedia.org/wiki/Interpreted_language [wiki-literate-programming]: https://en.wikipedia.org/wiki/Literate_programming [wiki-csv]: https://en.wikipedia.org/wiki/Comma-separated_values [wiki-json]: https://en.wikipedia.org/wiki/JSON