Module 11 Introduction to tidyverse and RMarkdown

The tidyverse is a collection of R packages designed for data science. RMarkdown documents support the concept of literate programming where you weave R code together with text (written in Markdown) to produce elegantly formatted documents.

A template project for this module is given on Posit Cloud (open it and use it while reading the notes).

Learning path diagram

It is recommended that you follow the green learning path; however, you may like a different learning style. In the learning path diagram, there are links to alternative online content (video or reading). Note this is an alternative to the standard learning path that you may use instead (you should not do both). The learning path may also have extra content, that is NOT a part of syllabus (only look at it if you want more info)!

11.1 Learning outcomes

By the end of this module, you are expected to be able to:

Describe what the tidyverse package is.
Explain the ideas behind reproducible reports and literal programming.
Create your first RMarkdown document and add some code and text.

The learning outcomes relate to the overall learning goals number 7, 17 and 18 of the course.

11.2 The tidyverse package

The tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

The core tidyverse includes the packages that you are likely to use in everyday data analyses. In tidyverse 1.3.0, the following packages are included in the core tidyverse:

dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. We are going to use dplyr in Module 13.
ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. We are going to use ggplot in Module 14.
tidyr provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable.
readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes. We are going to use dplyr in Module 12.
purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. Once you master the basic concepts, purrr allows you to replace many for loops with code that is easier to write and more expressive. This package is not covered in this course.
tibble is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what has not. Tibbles are data frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code. We are going to use tibbles in Module 13.
stringr provides a cohesive set of functions designed to make working with strings as easy as possible. You have already worked a bit with stringr in Exercise 8.7.8
forcats provides a suite of useful tools that solve common problems with factors. R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. This package is not covered in this course.

Small introductions (with examples) to the packages are given on their documentation pages (follow the links above). The tidyverse also includes many other packages with more specialized usage. They are not loaded automatically with library(tidyverse), so you will need to load each one with its own call to library().

11.3 Writing reproducible reports

The concept of literate programming was originally introduced by Donald Knuth in 1984. In a nutshell, Knuth envisioned a new programming paradigm where computer scientists focus on weaving code together with text as documentation.

That is, when we do an Analytics project, we are interested in writing reports containing both R code for importing data, wrangling and analysis. Moreover, at the same time, the document should contain our comments about the code, plots, analysis, results, etc. The document is then rendered to an output format such as html, pdf or Word which is presented to the decision maker. Note the document can be seen as the “the source code” for the report communicated to the decision maker.

Some developers have created tools to enable others to write better literate programs. They use a markup language made for authoring. We are going to focus on RMarkdown. In RMarkdown documents you can weave R code together with text (written in Markdown) to produce elegantly formatted output.

In fact this book is written in RMarkdown by using

a set of RMarkdown documents bound together as a collection using the bookdown package,
rendered to a web page using RStudio,
shared on GitHub,
built by GitHub Actions,
and published on GitHub Pages.

This may seem complicated at first. However, after setup, it makes life much easier, since we can

update the book easier,
share and collaborate on the book easier,
update the web page automatically,
keep history of the book source,
keep the book source at a single location.

RMarkdown documents are reproducible. Anybody who works with data has at some point heard a colleague say ‘Well, it works on my computer’, expressing dismay at the fact that you cannot reproduce their results. Ultimately, reproducible means that the results can be reproduced given access to the original data, software, and code. In practice it may be hard to make your project totally reproducible. For instance, people may be using a different operating system, other versions of the software, etc. That is, there are different levels of reproducibility. In this course, we will focus on RMarkdown only. See Module 11 for more info about levels of reproducibility.

An introduction to RMarkdown is given in Chapters 3 and 4 of the DataCamp course Communicating with Data in the Tidyverse. Note that you may skip Chapters 1 and 2 and still understand most of the questions in Chapters 3 and 4 (otherwise just see the solution). You are expected to have completed the chapters before continuing this module!

The RMarkdown cheatsheet may be useful. Find the newest version in RStudio Help > Cheatsheets. All chunk options for R code can be seen here.

11.4 Tibbles

Tibbles are a modern data frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are more strict compared to data frames e.g. they do not change variable names or types, do not do partial matching and complain more e.g. when a variable does not exist. This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Moreover, tibbles have an enhanced print method and can have columns that are lists.

Let us see a few examples:

tbl1 <- tibble(name = c("Lars", "Susan", "Hans"), age = c(23, 56, 45))
tbl1
#> # A tibble: 3 × 2
#>   name    age
#>   <chr> <dbl>
#> 1 Lars     23
#> 2 Susan    56
#> 3 Hans     45
tbl2 <- tibble(x = 1:3, y = list(1:5, 1:10, 1:20))
tbl2
#> # A tibble: 3 × 2
#>       x y         
#>   <int> <list>    
#> 1     1 <int [5]> 
#> 2     2 <int [10]>
#> 3     3 <int [20]>
tbl3 <- as_tibble(mtcars)
tbl3
#> # A tibble: 32 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows
tbl4 <- tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 2, 3.6,
  "b", 1, 8.5
)
tbl4
#> # A tibble: 2 × 3
#>   x         y     z
#>   <chr> <dbl> <dbl>
#> 1 a         2   3.6
#> 2 b         1   8.5

Note that we can always coerce a data frame to a tibble (tbl3) or create it directly using tibble. Another way to create a tibble is with tribble. Here column headings are defined by formulas (i.e. they start with ~), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form.

Tibbles have a refined print method that shows only the first 10 rows along with the number of columns that will fit on your screen. This makes it much easier to work with large data. In addition to its name, each column reports its type. Hence, your console is not overwhelmed with data. To see a full view of the data, you can use RStudio’s built-in data viewer:

View(tbl3)

11.5 Recap

tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
RMarkdown is an example of literate programming.
The core tidyverse includes the packages that you are likely to use in everyday data analyses.
The concept of literate programming is a programming paradigm which focuses on weaving code together with text as documentation. That is, we are interested in writing reports containing both text and R code for importing data, wrangling and analysis.
Reproducibility means that the results can be reproduced given access to the original data, software, and code.
In practice it may be hard to make your project totally reproducible. That is, there are different levels of reproducibility.
RMarkdown documents are an attempt to make reproducible documents and combine R code and markdown text.
All chunk options for R code in RMarkdown documents can be seen here.
The RMarkdown cheatsheet may be useful. Find the newest version in RStudio Help > Cheatsheets. For Markdown syntax see Help > Markdown Quick Reference.
Tibbles are a modern data frame, keeping what time has proven to be effective, and throwing out what is not.
Tibbles are more strict compared to data frames e.g. they do not change variable names or types, do not do partial matching and complain more e.g. when a variable does not exist.
Tibbles have an enhanced print method and can have columns that are lists.

You may also have a look at the slides for this module .

11.6 Exercises

Below you will find a set of exercises. Always have a look at the exercises before you meet in your study group and try to solve them yourself. Are you stuck, see the help page. Some of the solutions to each exercise can be seen by pressing the button at each question. Beware, you will not learn by giving up too early. Put some effort into finding a solution! Always practice using shortcuts in RStudio (see Tools > Keyboard Shortcuts Help).

Go to the Tools for Analytics workspace and download/export the TM11 project. Open it on your laptop and have a look at the files in the exercises folder which can be used as a starting point.

11.6.1 Exercise (your first RMarkdown exercise)

Load the tfa package:

# If tfa package is not installed then run
# install.packages("remotes")
# remotes::install_github("bss-osca/tfa-package", upgrade = FALSE)  
library(tfa)

The package contains templates for exercises etc. Go to File > New File > R Markdown…. In the pop-up box select From template in the left column and then TFA Exercise. Press Ok and a new RMarkdown document will be opened.

Change the meta text (e.g. the title and add your name) in the yaml.
Render/compile the document by pressing the Knit button (or Ctrl+Shift+K).

Change echo = TRUE to echo = FALSE in the first chunk setup and render the document. What has happened?

You can easily go to a chunk using the navigation in the bottom left of the source window.

Try to change fig.asp = 0.25 to e.g. 0.5 in Chunk 10 (and set eval = TRUE). What happens? Note: You may need to call install.packages("ggraph") if get Error in library(ggraph) : there is no package called 'ggraph'.
Create a new section ## Question 4 and add text in italic: What is the sum of all setup costs?

Add a code chunk solving Question 4 above.

Add a line of text with the result.

11.6.2 Exercise (tibbles)

Solve this exercise using an R script file.

airquality |> as_tibble()
#> # A tibble: 153 × 6
#>    Ozone Solar.R  Wind  Temp Month   Day
#>    <int>   <int> <dbl> <int> <int> <int>
#>  1    41     190   7.4    67     5     1
#>  2    36     118   8      72     5     2
#>  3    12     149  12.6    74     5     3
#>  4    18     313  11.5    62     5     4
#>  5    NA      NA  14.3    56     5     5
#>  6    28      NA  14.9    66     5     6
#>  7    23     299   8.6    65     5     7
#>  8    19      99  13.8    59     5     8
#>  9     8      19  20.1    61     5     9
#> 10    NA     194   8.6    69     5    10
#> # ℹ 143 more rows

Convert the dataset airquality to a tibble.

airquality |> as_tibble()
#> # A tibble: 153 × 6
#>    Ozone Solar.R  Wind  Temp Month   Day
#>    <int>   <int> <dbl> <int> <int> <int>
#>  1    41     190   7.4    67     5     1
#>  2    36     118   8      72     5     2
#>  3    12     149  12.6    74     5     3
#>  4    18     313  11.5    62     5     4
#>  5    NA      NA  14.3    56     5     5
#>  6    28      NA  14.9    66     5     6
#>  7    23     299   8.6    65     5     7
#>  8    19      99  13.8    59     5     8
#>  9     8      19  20.1    61     5     9
#> 10    NA     194   8.6    69     5    10
#> # ℹ 143 more rows
airquality
#>     Ozone Solar.R Wind Temp Month Day
#> 1      41     190  7.4   67     5   1
#> 2      36     118  8.0   72     5   2
#> 3      12     149 12.6   74     5   3
#> 4      18     313 11.5   62     5   4
#> 5      NA      NA 14.3   56     5   5
#> 6      28      NA 14.9   66     5   6
#> 7      23     299  8.6   65     5   7
#> 8      19      99 13.8   59     5   8
#> 9       8      19 20.1   61     5   9
#> 10     NA     194  8.6   69     5  10
#> 11      7      NA  6.9   74     5  11
#> 12     16     256  9.7   69     5  12
#> 13     11     290  9.2   66     5  13
#> 14     14     274 10.9   68     5  14
#> 15     18      65 13.2   58     5  15
#> 16     14     334 11.5   64     5  16
#> 17     34     307 12.0   66     5  17
#> 18      6      78 18.4   57     5  18
#> 19     30     322 11.5   68     5  19
#> 20     11      44  9.7   62     5  20
#> 21      1       8  9.7   59     5  21
#> 22     11     320 16.6   73     5  22
#> 23      4      25  9.7   61     5  23
#> 24     32      92 12.0   61     5  24
#> 25     NA      66 16.6   57     5  25
#> 26     NA     266 14.9   58     5  26
#> 27     NA      NA  8.0   57     5  27
#> 28     23      13 12.0   67     5  28
#> 29     45     252 14.9   81     5  29
#> 30    115     223  5.7   79     5  30
#> 31     37     279  7.4   76     5  31
#> 32     NA     286  8.6   78     6   1
#> 33     NA     287  9.7   74     6   2
#> 34     NA     242 16.1   67     6   3
#> 35     NA     186  9.2   84     6   4
#> 36     NA     220  8.6   85     6   5
#> 37     NA     264 14.3   79     6   6
#> 38     29     127  9.7   82     6   7
#> 39     NA     273  6.9   87     6   8
#> 40     71     291 13.8   90     6   9
#> 41     39     323 11.5   87     6  10
#> 42     NA     259 10.9   93     6  11
#> 43     NA     250  9.2   92     6  12
#> 44     23     148  8.0   82     6  13
#> 45     NA     332 13.8   80     6  14
#> 46     NA     322 11.5   79     6  15
#> 47     21     191 14.9   77     6  16
#> 48     37     284 20.7   72     6  17
#> 49     20      37  9.2   65     6  18
#> 50     12     120 11.5   73     6  19
#> 51     13     137 10.3   76     6  20
#> 52     NA     150  6.3   77     6  21
#> 53     NA      59  1.7   76     6  22
#> 54     NA      91  4.6   76     6  23
#> 55     NA     250  6.3   76     6  24
#> 56     NA     135  8.0   75     6  25
#> 57     NA     127  8.0   78     6  26
#> 58     NA      47 10.3   73     6  27
#> 59     NA      98 11.5   80     6  28
#> 60     NA      31 14.9   77     6  29
#> 61     NA     138  8.0   83     6  30
#> 62    135     269  4.1   84     7   1
#> 63     49     248  9.2   85     7   2
#> 64     32     236  9.2   81     7   3
#> 65     NA     101 10.9   84     7   4
#> 66     64     175  4.6   83     7   5
#> 67     40     314 10.9   83     7   6
#> 68     77     276  5.1   88     7   7
#> 69     97     267  6.3   92     7   8
#> 70     97     272  5.7   92     7   9
#> 71     85     175  7.4   89     7  10
#> 72     NA     139  8.6   82     7  11
#> 73     10     264 14.3   73     7  12
#> 74     27     175 14.9   81     7  13
#> 75     NA     291 14.9   91     7  14
#> 76      7      48 14.3   80     7  15
#> 77     48     260  6.9   81     7  16
#> 78     35     274 10.3   82     7  17
#> 79     61     285  6.3   84     7  18
#> 80     79     187  5.1   87     7  19
#> 81     63     220 11.5   85     7  20
#> 82     16       7  6.9   74     7  21
#> 83     NA     258  9.7   81     7  22
#> 84     NA     295 11.5   82     7  23
#> 85     80     294  8.6   86     7  24
#> 86    108     223  8.0   85     7  25
#> 87     20      81  8.6   82     7  26
#> 88     52      82 12.0   86     7  27
#> 89     82     213  7.4   88     7  28
#> 90     50     275  7.4   86     7  29
#> 91     64     253  7.4   83     7  30
#> 92     59     254  9.2   81     7  31
#> 93     39      83  6.9   81     8   1
#> 94      9      24 13.8   81     8   2
#> 95     16      77  7.4   82     8   3
#> 96     78      NA  6.9   86     8   4
#> 97     35      NA  7.4   85     8   5
#> 98     66      NA  4.6   87     8   6
#> 99    122     255  4.0   89     8   7
#> 100    89     229 10.3   90     8   8
#> 101   110     207  8.0   90     8   9
#> 102    NA     222  8.6   92     8  10
#> 103    NA     137 11.5   86     8  11
#> 104    44     192 11.5   86     8  12
#> 105    28     273 11.5   82     8  13
#> 106    65     157  9.7   80     8  14
#> 107    NA      64 11.5   79     8  15
#> 108    22      71 10.3   77     8  16
#> 109    59      51  6.3   79     8  17
#> 110    23     115  7.4   76     8  18
#> 111    31     244 10.9   78     8  19
#> 112    44     190 10.3   78     8  20
#> 113    21     259 15.5   77     8  21
#> 114     9      36 14.3   72     8  22
#> 115    NA     255 12.6   75     8  23
#> 116    45     212  9.7   79     8  24
#> 117   168     238  3.4   81     8  25
#> 118    73     215  8.0   86     8  26
#> 119    NA     153  5.7   88     8  27
#> 120    76     203  9.7   97     8  28
#> 121   118     225  2.3   94     8  29
#> 122    84     237  6.3   96     8  30
#> 123    85     188  6.3   94     8  31
#> 124    96     167  6.9   91     9   1
#> 125    78     197  5.1   92     9   2
#> 126    73     183  2.8   93     9   3
#> 127    91     189  4.6   93     9   4
#> 128    47      95  7.4   87     9   5
#> 129    32      92 15.5   84     9   6
#> 130    20     252 10.9   80     9   7
#> 131    23     220 10.3   78     9   8
#> 132    21     230 10.9   75     9   9
#> 133    24     259  9.7   73     9  10
#> 134    44     236 14.9   81     9  11
#> 135    21     259 15.5   76     9  12
#> 136    28     238  6.3   77     9  13
#> 137     9      24 10.9   71     9  14
#> 138    13     112 11.5   71     9  15
#> 139    46     237  6.9   78     9  16
#> 140    18     224 13.8   67     9  17
#> 141    13      27 10.3   76     9  18
#> 142    24     238 10.3   68     9  19
#> 143    16     201  8.0   82     9  20
#> 144    13     238 12.6   64     9  21
#> 145    23      14  9.2   71     9  22
#> 146    36     139 10.3   81     9  23
#> 147     7      49 10.3   69     9  24
#> 148    14      20 16.6   63     9  25
#> 149    30     193  6.9   70     9  26
#> 150    NA     145 13.2   77     9  27
#> 151    14     191 14.3   75     9  28
#> 152    18     131  8.0   76     9  29
#> 153    20     223 11.5   68     9  30

Print the tibble and the original data frame and compare the difference.

# here misc is a list with lists
dat <- tibble(name = c("Hans", "Ole"), 
              age = c(23, 45), 
              misc = list(
                 list(status = 1, comment = "To young"), 
                 list(comment = "Potential candidate")))
dat
#> # A tibble: 2 × 3
#>   name    age misc            
#>   <chr> <dbl> <list>          
#> 1 Hans     23 <named list [2]>
#> 2 Ole      45 <named list [1]>
dat$misc[[1]]
#> $status
#> [1] 1
#> 
#> $comment
#> [1] "To young"

Create a tibble with 3 columns of data type string/character, double and list.