Module 12 Importing and exporting data

For doing data driven analytics, you first must import some data. That is, take data from a database, file, web API etc. and transform it into a data frame/table. Moreover, after processing data, you often want to export or store some of the results. This module introduces you to different ways of importing and exporting data.

A template project for this module is given on Posit Cloud (open it and use it while reading the notes).

Learning path diagram

It is recommended that you follow the green learning path; however, you may like a different learning style. In the learning path diagram, there are links to alternative online content (video or reading). Note this is an alternative to the standard learning path that you may use instead (you should not do both). The learning path may also have extra content, that is NOT a part of syllabus (only look at it if you want more info)!

12.1 Learning outcomes

By the end of this module, you are expected to be able to:

  • Import and export csv files in different formats.
  • Import and export data from Excel.
  • Import and export data from Google Sheets.
  • Write to a text file.
  • Save data using R’s native format.
  • Read and write to a json file.

The learning outcomes relate to the overall learning goals number 7 and 13 of the course.

12.2 CSV files

CSV files contain comma separated values (csv) in plain text and are often named using the file suffix .csv. Each line of the file is a data record. Each record consists of one or more fields, separated by a common delimiter (e.g. ,, ; or _). The CSV file format is not fully standardized. Different delimiters may be used, fields may be surrounded by quotation marks, text may contain escape characters and the encoding of the file may not be known. Despite these problems, CSV files are commonly used since they are easy to exchange and read.

We will use the readr package for reading and writing. An overview over the functions can be seen in the cheatsheet.

12.2.1 Reading a CSV file

In general use the following functions

  • read_csv: Read a file with delimiter ,.
  • read_csv2: Read a file with delimiter ;.
  • read_delim: Read a file with a delimiter set by you.

12.2.1.1 Reading an unknown CSV file

For importing a CSV file properly, you need to know the delimiter, if the files has headers and the encoding. If you are not sure, you may have a look on the file by opening it in a text editor or try to read some lines:

csv_file <- readr_example("mtcars.csv") # csv file
lines <- read_lines(csv_file, n_max = 3)
lines
#> [1] "\"mpg\",\"cyl\",\"disp\",\"hp\",\"drat\",\"wt\",\"qsec\",\"vs\",\"am\",\"gear\",\"carb\""
#> [2] "21,6,160,110,3.9,2.62,16.46,0,1,4,4"                                                     
#> [3] "21,6,160,110,3.9,2.875,17.02,0,1,4,4"
cat(lines, sep = "\n")
#> "mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
#> 21,6,160,110,3.9,2.62,16.46,0,1,4,4
#> 21,6,160,110,3.9,2.875,17.02,0,1,4,4

It seems that the delimiter is a , and we may try to read the file using read_csv:

dat <- read_csv(csv_file)
head(dat)
#> # A tibble: 6 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
#> 6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

CSV files should always be saved using encoding UTF-8. However, sometimes you may have encoding problems when you read a file:

csv_file <- system.file("extdata/persons.csv", package = "tfa")
read_csv(csv_file)
#> # A tibble: 3 × 2
#>   first    last        
#>   <chr>    <chr>       
#> 1 "Hans"   "S\xf8gaard"
#> 2 "\xc5ge" "\xd8kse"   
#> 3 "Yvette" "L\xe6ske"

Note that some of the characters are not converted correctly. This is usually because the file encoding is not UTF-8. In this case, try to guess the encoding using:

guess_encoding(csv_file)
#> # A tibble: 1 × 2
#>   encoding   confidence
#>   <chr>           <dbl>
#> 1 ISO-8859-1       0.27
dat <- read_csv(csv_file, locale = locale(encoding = "ISO-8859-1"))
dat
#> # A tibble: 3 × 2
#>   first  last   
#>   <chr>  <chr>  
#> 1 Hans   Søgaard
#> 2 Åge    Økse   
#> 3 Yvette Læske

12.2.2 Writing to CSV files

Given a tibble/data frame export it using write_csv:

csv_file <- "testing.csv"
write_csv(dat, file = csv_file) 
write_csv2(dat, file = "testing_semicolon.csv")   # use a semicolon as delimitor

You can now always import the data again using read_csv:

read_csv(csv_file)
#> # A tibble: 3 × 2
#>   first  last   
#>   <chr>  <chr>  
#> 1 Hans   Søgaard
#> 2 Åge    Økse   
#> 3 Yvette Læske
guess_encoding(csv_file)
#> # A tibble: 3 × 2
#>   encoding     confidence
#>   <chr>             <dbl>
#> 1 UTF-8              1   
#> 2 windows-1252       0.31
#> 3 windows-1250       0.25

Note that write_csv always saves the file using encoding UTF-8.

In a few cases, you may need to save a CSV file that can be read by Excel. For this purpose use:

write_excel_csv2(dat, csv_file)

The CSV file can now be opened correctly in Excel.

12.3 Excel

There are different packages in R for reading and writing to Excel. We will use the readxl package for reading Excel files which is a part of tidyverse. The package supports both the legacy .xls format and the modern xml-based .xlsx format. Let us use one of the example files provided by the package:

xlsx_file <- system.file("extdata/datasets.xlsx", package = "readxl")

It is always a good idea to have a look at the file before you import from it. You can open it from R by using:

browseURL(xlsx_file)

Data can be read using:

library(readxl)
xlsx <- read_excel(xlsx_file)   # reads the first sheet
xlsx
#> # A tibble: 150 × 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # ℹ 140 more rows
xlsx <- read_excel(xlsx_file, sheet = 2)   # reads the second sheet
xlsx
#> # A tibble: 32 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows
xlsx <- read_excel(xlsx_file, sheet = "quakes")   # reads a named sheet
xlsx
#> # A tibble: 1,000 × 5
#>      lat  long depth   mag stations
#>    <dbl> <dbl> <dbl> <dbl>    <dbl>
#>  1 -20.4  182.   562   4.8       41
#>  2 -20.6  181.   650   4.2       15
#>  3 -26    184.    42   5.4       43
#>  4 -18.0  182.   626   4.1       19
#>  5 -20.4  182.   649   4         11
#>  6 -19.7  184.   195   4         12
#>  7 -11.7  166.    82   4.8       43
#>  8 -28.1  182.   194   4.4       15
#>  9 -28.7  182.   211   4.7       35
#> 10 -17.5  180.   622   4.3       19
#> # ℹ 990 more rows
xlsx <- read_excel(xlsx_file, sheet = "mtcars", range = "A5:G11", col_names = F)   # reads a range
#> New names:
#> • `` -> `...1`
#> • `` -> `...2`
#> • `` -> `...3`
#> • `` -> `...4`
#> • `` -> `...5`
#> • `` -> `...6`
#> • `` -> `...7`
colnames(xlsx) <- read_excel(xlsx_file, sheet = "mtcars", range = "A1:G1", col_names = F)   # reads the column names
#> New names:
#> • `` -> `...1`
#> • `` -> `...2`
#> • `` -> `...3`
#> • `` -> `...4`
#> • `` -> `...5`
#> • `` -> `...6`
#> • `` -> `...7`
xlsx
#> # A tibble: 7 × 7
#>     mpg   cyl  disp    hp  drat    wt  qsec
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21.4     6  258    110  3.08  3.22  19.4
#> 2  18.7     8  360    175  3.15  3.44  17.0
#> 3  18.1     6  225    105  2.76  3.46  20.2
#> 4  14.3     8  360    245  3.21  3.57  15.8
#> 5  24.4     4  147.    62  3.69  3.19  20  
#> 6  22.8     4  141.    95  3.92  3.15  22.9
#> 7  19.2     6  168.   123  3.92  3.44  18.3

Writing to an Excel file can be done using the openxlsx package. To write to a new file use:

library(openxlsx)
dat <- trees   # test dataset
head(dat)
#>   Girth Height Volume
#> 1   8.3     70   10.3
#> 2   8.6     65   10.3
#> 3   8.8     63   10.2
#> 4  10.5     72   16.4
#> 5  10.7     81   18.8
#> 6  10.8     83   19.7
write.xlsx(dat, "test1.xlsx", sheetName = "trees") # start at cell A1
write.xlsx(dat, "test2.xlsx", sheetName = "trees", startCol = "C", startRow = 3)

If you want to append a sheet to a file use:

xlsx_file <- system.file("extdata/datasets.xlsx", package = "tfa")
file.copy(xlsx_file, "test.xlsx")         # copy the file so can make some tests
#> [1] TRUE
wb <- loadWorkbook(file = "test.xlsx")    # read the workbook
addWorksheet(wb = wb, sheetName = "trees")
writeData(wb, sheet = "trees", x = dat)
saveWorkbook(wb, file = "test.xlsx", overwrite = TRUE)

12.4 Google Sheets

You can import and export to Google sheets using the googlesheets4 package in tidyverse. To read and write data, in general, you need to be logged in as a Google user. The package will ask you when needed. However, if you only want to read data from a public sheet, you can use gs4_deauth to skip this:

library(googlesheets4)
gs4_deauth()

To read data use:

url <- "https://docs.google.com/spreadsheets/d/1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY/edit#gid=780868077"
read_sheet(url)
read_sheet(url, sheet = 3)
range_read(url, sheet = 2, n_max = 3)
range_read(url, range = "Africa!A5:C15")

To write data to a new file use:

gs4_auth()
gs <- gs4_create("test", sheets = c("Sheet 1", "Sheet 2"))
write_sheet(dat, ss = gs)
range_write(gs, dat, sheet = "Sheet 1", range = "C4")
gs4_browse(gs)  # have a look at the file in a browser

To see the results, have a look at your Google sheet test in your browser.

12.5 Text files

You can read and write to plain text files using the readr package. However, mostly you want to write to a text file because you want to save some kind of log file when you run your script. Here sink is an excellent function to use, since it redirects your R output. To see the output without messages, errors and warnings use:

sink(file = "ex1.log", split = TRUE)  # open the file for output
cat("This is a string\n... and on a new line\n\n")
print("This is another string")
head(mtcars)
rep(1, 4)
message("A message.")
warning("A warning.")
rep(3, f)  # a error
cat("\nLast line\n")
sink()  # close the file again
# file.show("ex1.log")   # to view in external viewer

Let us have a look at the content of the file (run cat(read_file("ex1.log"))):

This is a string
... and on a new line

[1] "This is another string"
                   mpg cyl disp  hp drat   wt qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.62 16.5  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.88 17.0  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.32 18.6  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.21 19.4  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.46 20.2  1  0    3    1
[1] 1 1 1 1

Last line

Note that messages, errors and warnings are not included in the output. If you want to include it use:

zz <- file("ex2.log", open = "wt")
sink(zz, type = "output")   # open the file for output
sink(zz, type = "message")  # open the same file for messages, errors and warnings
cat("This is a string\n... and on a new line\n\n")
print("This is another string")
head(mtcars)
rep(1, 4)
message("A message.")
warning("A warning.")
rep(3, f)  # a error
cat("\nLast line\n")
sink()  # close the file for output
sink()  # close the file for messages, errors and warnings

That is, we call sink two times. Let us have a look at the content of the file:

This is a string
... and on a new line

[1] "This is another string"
                   mpg cyl disp  hp drat   wt qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.62 16.5  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.88 17.0  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.32 18.6  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.21 19.4  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.46 20.2  1  0    3    1
[1] 1 1 1 1
A message.
Warning message:
A warning. 
Error: object 'f' not found

Last line
Warning message:
In sink() : no sink to remove
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
Error: unexpected symbol in "By the"
Error: unexpected '*' in "*"
Error: unexpected '*' in "*"
Error: unexpected '*' in "*"
Error: unexpected '*' in "*"
Error: unexpected '*' in "*"
Error: unexpected symbol in "The learning"
Error: unexpected '<' in "<"
Error: unexpected '<' in "<"
Error: unexpected '<' in "<"
Error: unexpected '<' in "<"
Error: unexpected symbol in "An excellent"
Error: unexpected '<' in "<"
Error: unexpected '<' in "<"
Error: unexpected '<' in "<"
Error: unexpected '<' in "<"
Error: unexpected '<' in "<"
Error: unexpected symbol in "Mutating joins"
Error: attempt to use zero-length variable name
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Error in the HTTP2 framing layer
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
x Request failed [429]. Retry 1 happens in 3.5 seconds ...
x Request failed [429]. Retry 2 happens in 3.5 seconds ...
x Request failed [429]. Retry 3 happens in 8.2 seconds ...

x Request failed [429]. Retry 1 happens in 2.5 seconds ...
x Request failed [429]. Retry 2 happens in 1.9 seconds ...
x Request failed [429]. Retry 3 happens in 26 seconds ...

x Request failed [429]. Retry 1 happens in 4.2 seconds ...
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.

12.6 R’s native binary format

In general, we can differ between two main types of data/files. Information is either binary encoded (basically just 0’s and 1’s) or stored as text files. What we have considered so far is storing data in text files.

Text files can be read by humans and computers alike. The great thing about plain text is their simplicity and their ease of use: any programming language can read a plain text file. Text files are good for storing tabular data but lacks type-safety, and has limited precision for numeric values. Binary file formats cannot be read by humans but allow space-efficient data compression. Furthermore, binary formats may be difficult to read and write using other programs.

As most other programming languages, R comes with its own binary format. We will focus on the Rds data format which is optimized for speed and compression ratios. To save and read an R object use:

dat <- list(x = c(2, 5, 6), y = "A string", z = mtcars)
saveRDS(dat, file = "test.rds")
readRDS("test.rds")
#> $x
#> [1] 2 5 6
#> 
#> $y
#> [1] "A string"
#> 
#> $z
#>                      mpg cyl  disp  hp drat   wt qsec vs am gear carb
#> Mazda RX4           21.0   6 160.0 110 3.90 2.62 16.5  0  1    4    4
#> Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.88 17.0  0  1    4    4
#> Datsun 710          22.8   4 108.0  93 3.85 2.32 18.6  1  1    4    1
#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.21 19.4  1  0    3    1
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.44 17.0  0  0    3    2
#> Valiant             18.1   6 225.0 105 2.76 3.46 20.2  1  0    3    1
#> Duster 360          14.3   8 360.0 245 3.21 3.57 15.8  0  0    3    4
#> Merc 240D           24.4   4 146.7  62 3.69 3.19 20.0  1  0    4    2
#> Merc 230            22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2
#> Merc 280            19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4
#> Merc 280C           17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4
#> Merc 450SE          16.4   8 275.8 180 3.07 4.07 17.4  0  0    3    3
#> Merc 450SL          17.3   8 275.8 180 3.07 3.73 17.6  0  0    3    3
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.78 18.0  0  0    3    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.25 18.0  0  0    3    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.42 17.8  0  0    3    4
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.34 17.4  0  0    3    4
#> Fiat 128            32.4   4  78.7  66 4.08 2.20 19.5  1  1    4    1
#> Honda Civic         30.4   4  75.7  52 4.93 1.61 18.5  1  1    4    2
#> Toyota Corolla      33.9   4  71.1  65 4.22 1.83 19.9  1  1    4    1
#> Toyota Corona       21.5   4 120.1  97 3.70 2.46 20.0  1  0    3    1
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.52 16.9  0  0    3    2
#> AMC Javelin         15.2   8 304.0 150 3.15 3.44 17.3  0  0    3    2
#> Camaro Z28          13.3   8 350.0 245 3.73 3.84 15.4  0  0    3    4
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.85 17.1  0  0    3    2
#> Fiat X1-9           27.3   4  79.0  66 4.08 1.94 18.9  1  1    4    1
#> Porsche 914-2       26.0   4 120.3  91 4.43 2.14 16.7  0  1    5    2
#> Lotus Europa        30.4   4  95.1 113 3.77 1.51 16.9  1  1    5    2
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.17 14.5  0  1    5    4
#> Ferrari Dino        19.7   6 145.0 175 3.62 2.77 15.5  0  1    5    6
#> Maserati Bora       15.0   8 301.0 335 3.54 3.57 14.6  0  1    5    8
#> Volvo 142E          21.4   4 121.0 109 4.11 2.78 18.6  1  1    4    2

Note we here have saved a non tabular R object (a list).

12.7 Json

JavaScript Object Notation (json) is an open standard text file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and array data types. It can be used to store non tabular data in text format. It is often used for data-exchange in web-apis.

Let us try to read and write to a json file using the jsonlite package.

library(jsonlite)
dat <- list(x = c(2, 5, 6), y = "A string", z = head(mtcars))
write_json(dat, "test.json", pretty = T)
lst <- read_json("test.json", simplifyDataFrame = T, simplifyVector = T)
lst
#> $x
#> [1] 2 5 6
#> 
#> $y
#> [1] "A string"
#> 
#> $z
#>                    mpg cyl disp  hp drat   wt qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.62 16.5  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.88 17.0  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.32 18.6  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.21 19.4  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.46 20.2  1  0    3    1

The content of the json file look likes:

{
  "x": [2, 5, 6],
  "y": ["A string"],
  "z": [
    {
      "mpg": 21,
      "cyl": 6,
      "disp": 160,
      "hp": 110,
      "drat": 3.9,
      "wt": 2.62,
      "qsec": 16.46,
      "vs": 0,
      "am": 1,
      "gear": 4,
      "carb": 4,
      "_row": "Mazda RX4"
    },
    {
      "mpg": 21,
      "cyl": 6,
      "disp": 160,
      "hp": 110,
      "drat": 3.9,
      "wt": 2.875,
      "qsec": 17.02,
      "vs": 0,
      "am": 1,
      "gear": 4,
      "carb": 4,
      "_row": "Mazda RX4 Wag"
    },
    {
      "mpg": 22.8,
      "cyl": 4,
      "disp": 108,
      "hp": 93,
      "drat": 3.85,
      "wt": 2.32,
      "qsec": 18.61,
      "vs": 1,
      "am": 1,
      "gear": 4,
      "carb": 1,
      "_row": "Datsun 710"
    },
    {
      "mpg": 21.4,
      "cyl": 6,
      "disp": 258,
      "hp": 110,
      "drat": 3.08,
      "wt": 3.215,
      "qsec": 19.44,
      "vs": 1,
      "am": 0,
      "gear": 3,
      "carb": 1,
      "_row": "Hornet 4 Drive"
    },
    {
      "mpg": 18.7,
      "cyl": 8,
      "disp": 360,
      "hp": 175,
      "drat": 3.15,
      "wt": 3.44,
      "qsec": 17.02,
      "vs": 0,
      "am": 0,
      "gear": 3,
      "carb": 2,
      "_row": "Hornet Sportabout"
    },
    {
      "mpg": 18.1,
      "cyl": 6,
      "disp": 225,
      "hp": 105,
      "drat": 2.76,
      "wt": 3.46,
      "qsec": 20.22,
      "vs": 1,
      "am": 0,
      "gear": 3,
      "carb": 1,
      "_row": "Valiant"
    }
  ]
}

12.8 Recap

  • For doing data driven analytics you first must import some data. That is, take data from a database, file, web API etc. and transform it into a data frame/table.

  • CSV files contain delimiter separated values in plain text and are often named using the file suffix .csv.

  • Each line of a csv file is a data record. Each record consists of one or more fields, separated by a common delimiter (e.g. ,, ; or _).

  • The readxl package can be used to read Excel files.

  • Writing to an Excel file can be done using the openxlsx package.

  • You can import and export to Google sheets using the googlesheets4 package in tidyverse.

  • Use sink to save output of you R script.

  • There are two main types of data files. Information is either binary encoded or stored as text files.

  • Text files can be read by humans and computers alike. The great thing about plain text is their simplicity and their ease of use: any programming language can read a plain text file.

  • Text files are good for storing tabular data but lacks type-safety, and has limited precision for numeric values.

  • Binary file formats cannot be read by humans but allow space-efficient data compression. Moreover they can be used to save non tabular data.

  • As most other programming languages, R comes with its own binary format. We will focus on the Rds data format which are optimized for speed and compression ratios.

  • Json is an open standard text file format, and data interchange format. It can be used to store non tabular data in text format. It is often used for data-exchange in web-api’s.

You may also have a look at the slides for this module .

12.9 Exercises

Below you will find a set of exercises. Always have a look at the exercises before you meet in your study group and try to solve them yourself. Are you stuck, see the help page. Some of the solutions to each exercise can be seen by pressing the button at each question. Beware, you will not learn by giving up too early. Put some effort into finding a solution! Always practice using shortcuts in RStudio (see Tools > Keyboard Shortcuts Help).

Go to the Tools for Analytics workspace and download/export the TM12 project. Open it on your laptop and have a look at the files in the exercises folder which can be used as a starting point.

12.9.1 Exercise (Statistikbanken)

Use the exercise R markdown template to solve this exercise (File > New File > R Markdown…, select From template and then TFA Exercise).

You can use the API from Statistikbanken to download a lot of data sets. Let us consider airports in Denmark (data set with table id FLYV41):

url <- "https://api.statbank.dk/v1/data/FLYV41/CSV?lang=en&LUFTHAVN=*&Tid=*&Transport=*"
  1. Use cat(read_lines(url, n_max = 3), sep = "\n") to have a look at the delimiter used.
  1. Import the csv file.

  2. Try to retrieve information and get an overview over the data by running:

    library(jsonlite)
    url <- "https://api.statbank.dk/v1/tableinfo/FLYV41?lang=en"
    lst <- read_json(url, simplifyVector = T)
    View(lst)

    Note the data returned is in json format, so we use read_json to read the data into a list.

  1. Create a function info(tab_id) that returns a list with components description, unit and variables from the information for a data set with table id tab_id.

  2. Information about all the data sets can be retrieved using:

    url <- "https://api.statbank.dk/v1/tables?lang=en"
    lst <- jsonlite::read_json(url, simplifyVector = T)
    View(lst)

    Have a look at the row for FLYV41.

  1. Given the information about variables in a data set we can construct the url to retrieve the data in csv format:

    tab_id <- "FLYV41"
    url <- str_c("https://api.statbank.dk/v1/tableinfo/", tab_id, "?lang=en")
    lst <- read_json(url, simplifyVector = T) 
    col_id <- c(1,3)  # column ids in lst$variables$id
    cols <- lst$variables$id[col_id]
    url <- str_c("https://api.statbank.dk/v1/data/", tab_id, "/CSV?lang=en&", str_c(cols, collapse = "=*&"), "=*") |> 
       URLencode()
    url
    #> [1] "https://api.statbank.dk/v1/data/FLYV41/CSV?lang=en&LUFTHAVN=*&Tid=*"

    Create a function get_data(tab_id, col_id) that retrieve a data set.

  1. Use the function get_data to retrieve data for tab_id = "FOLK1A" and col_id = c(2, 3, 5) and save it as a csv file with a comma as delimiter.
  1. Save the data in an Excel file and a Google sheet.

12.9.2 Exercise (tuples in OPL)

In the algebraic modeling language OPL (Optimization Programming Language) used by IBM ILOG CPLEX Optimization Studio, you can define tuples to contain various information. For example consider tuples defined as:

tuple nurse {
  string name;
  int experience;   // higest best
}

tuple shift {
   string departmentName;
   string day;
   int startTime;
   int endTime;
}

A nurse tuple is then defined as <"Anne", 11> and a shift tuple as `<“Consultation”, “Monday” 12, 18>.

A set of tuples can be defined using:

{nurse} nurses = ...;
{shift} shifts = ...;

where the ... operator means that the sets are read from a data text file:

nurses = {
   <"Anne", 11>, 
   <"Bethanie", 4>, 
   <"Betsy", 2>
};

shifts = {
   <"Emergency", "Monday", 2, 8>,
   <"Emergency", Monday 8 12 4 7>, 
   <"Emergency", "Monday" 12 18 2 5> 
};

You can now use the sets to define decision variables \(x_{ns}\) equal one if nurse \(n\) is assigned to shift \(s\).

In this exercise we will try to generate the data text file given tibbles with data.

  1. Try to generate a text file named test.dat using function write_lines with content

    nurses = {
       <"Anne", 11>
    };

Load datasets

# remotes::install_github("bss-osca/tfa-package", dependencies = FALSE)  # if tfa not installed
library(tfa)
library(tidyverse)
nurses <- read_csv(system.file("extdata/nurses.csv", package = "tfa"))
shifts <- read_csv(system.file("extdata/shifts.csv", package = "tfa"))
  1. Transform all character columns in nurses so they start and end with ". Some hints are given in Hint 1.
  1. Unite all columns into a new column named tuple where each column is separated with ,. Hint: have a look at the unite function. All columns can be selected using everything().
  1. Add < and > the start and end of the tuple column.
  1. Extract the tuple column and transform it into a string with collapse = ",\n".
  1. Create a function write_tuple that takes nurses as input and write the tuples to a file.

The name of an object can be extracted as a string using

deparse(substitute(nurses))
#> [1] "nurses"
  1. Modify write_tuple so it works if shifts are given as input instead of nurses.
  1. Modify write_tuple with a new input argument append which is false by default. If true, then then the file is not overwritten.
  1. Write nurses and shifts to a single data file.