Module 12 Importing and exporting data
For doing data driven analytics, you first must import some data. That is, take data from a database, file, web API etc. and transform it into a data frame/table. Moreover, after processing data, you often want to export or store some of the results. This module introduces you to different ways of importing and exporting data.
A template project for this module is given on Posit Cloud (open it and use it while reading the notes).
Learning path diagram
12.1 Learning outcomes
By the end of this module, you are expected to be able to:
- Import and export csv files in different formats.
- Import and export data from Excel.
- Import and export data from Google Sheets.
- Write to a text file.
- Save data using R’s native format.
- Read and write to a json file.
The learning outcomes relate to the overall learning goals number 7 and 13 of the course.
12.2 CSV files
CSV files contain comma separated values (csv) in plain text and are often named using the file suffix .csv
. Each line of the file is a data record. Each record consists of one or more fields, separated by a common delimiter (e.g. ,
, ;
or _
). The CSV file format is not fully standardized. Different delimiters may be used, fields may be surrounded by quotation marks, text may contain escape characters and the encoding of the file may not be known. Despite these problems, CSV files are commonly used since they are easy to exchange and read.
We will use the readr package for reading and writing. An overview over the functions can be seen in the cheatsheet.
12.2.1 Reading a CSV file
In general use the following functions
read_csv
: Read a file with delimiter,
.read_csv2
: Read a file with delimiter;
.read_delim
: Read a file with a delimiter set by you.
12.2.1.1 Reading an unknown CSV file
For importing a CSV file properly, you need to know the delimiter, if the files has headers and the encoding. If you are not sure, you may have a look on the file by opening it in a text editor or try to read some lines:
csv_file <- readr_example("mtcars.csv") # csv file
lines <- read_lines(csv_file, n_max = 3)
lines
#> [1] "\"mpg\",\"cyl\",\"disp\",\"hp\",\"drat\",\"wt\",\"qsec\",\"vs\",\"am\",\"gear\",\"carb\""
#> [2] "21,6,160,110,3.9,2.62,16.46,0,1,4,4"
#> [3] "21,6,160,110,3.9,2.875,17.02,0,1,4,4"
cat(lines, sep = "\n")
#> "mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
#> 21,6,160,110,3.9,2.62,16.46,0,1,4,4
#> 21,6,160,110,3.9,2.875,17.02,0,1,4,4
It seems that the delimiter is a ,
and we may try to read the file using read_csv
:
dat <- read_csv(csv_file)
head(dat)
#> # A tibble: 6 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
CSV files should always be saved using encoding UTF-8. However, sometimes you may have encoding problems when you read a file:
csv_file <- system.file("extdata/persons.csv", package = "tfa")
read_csv(csv_file)
#> # A tibble: 3 × 2
#> first last
#> <chr> <chr>
#> 1 "Hans" "S\xf8gaard"
#> 2 "\xc5ge" "\xd8kse"
#> 3 "Yvette" "L\xe6ske"
Note that some of the characters are not converted correctly. This is usually because the file encoding is not UTF-8. In this case, try to guess the encoding using:
12.2.2 Writing to CSV files
Given a tibble/data frame export it using write_csv
:
csv_file <- "testing.csv"
write_csv(dat, file = csv_file)
write_csv2(dat, file = "testing_semicolon.csv") # use a semicolon as delimitor
You can now always import the data again using read_csv
:
read_csv(csv_file)
#> # A tibble: 3 × 2
#> first last
#> <chr> <chr>
#> 1 Hans Søgaard
#> 2 Åge Økse
#> 3 Yvette Læske
guess_encoding(csv_file)
#> # A tibble: 3 × 2
#> encoding confidence
#> <chr> <dbl>
#> 1 UTF-8 1
#> 2 windows-1252 0.31
#> 3 windows-1250 0.25
Note that write_csv
always saves the file using encoding UTF-8.
In a few cases, you may need to save a CSV file that can be read by Excel. For this purpose use:
The CSV file can now be opened correctly in Excel.
12.3 Excel
There are different packages in R for reading and writing to Excel. We will use the readxl package for reading Excel files which is a part of tidyverse. The package supports both the legacy .xls
format and the modern xml-based .xlsx
format. Let us use one of the example files provided by the package:
It is always a good idea to have a look at the file before you import from it. You can open it from R by using:
Data can be read using:
library(readxl)
xlsx <- read_excel(xlsx_file) # reads the first sheet
xlsx
#> # A tibble: 150 × 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ℹ 140 more rows
xlsx <- read_excel(xlsx_file, sheet = 2) # reads the second sheet
xlsx
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ 22 more rows
xlsx <- read_excel(xlsx_file, sheet = "quakes") # reads a named sheet
xlsx
#> # A tibble: 1,000 × 5
#> lat long depth mag stations
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -20.4 182. 562 4.8 41
#> 2 -20.6 181. 650 4.2 15
#> 3 -26 184. 42 5.4 43
#> 4 -18.0 182. 626 4.1 19
#> 5 -20.4 182. 649 4 11
#> 6 -19.7 184. 195 4 12
#> 7 -11.7 166. 82 4.8 43
#> 8 -28.1 182. 194 4.4 15
#> 9 -28.7 182. 211 4.7 35
#> 10 -17.5 180. 622 4.3 19
#> # ℹ 990 more rows
xlsx <- read_excel(xlsx_file, sheet = "mtcars", range = "A5:G11", col_names = F) # reads a range
#> New names:
#> • `` -> `...1`
#> • `` -> `...2`
#> • `` -> `...3`
#> • `` -> `...4`
#> • `` -> `...5`
#> • `` -> `...6`
#> • `` -> `...7`
colnames(xlsx) <- read_excel(xlsx_file, sheet = "mtcars", range = "A1:G1", col_names = F) # reads the column names
#> New names:
#> • `` -> `...1`
#> • `` -> `...2`
#> • `` -> `...3`
#> • `` -> `...4`
#> • `` -> `...5`
#> • `` -> `...6`
#> • `` -> `...7`
xlsx
#> # A tibble: 7 × 7
#> mpg cyl disp hp drat wt qsec
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.4 6 258 110 3.08 3.22 19.4
#> 2 18.7 8 360 175 3.15 3.44 17.0
#> 3 18.1 6 225 105 2.76 3.46 20.2
#> 4 14.3 8 360 245 3.21 3.57 15.8
#> 5 24.4 4 147. 62 3.69 3.19 20
#> 6 22.8 4 141. 95 3.92 3.15 22.9
#> 7 19.2 6 168. 123 3.92 3.44 18.3
Writing to an Excel file can be done using the openxlsx package. To write to a new file use:
library(openxlsx)
dat <- trees # test dataset
head(dat)
#> Girth Height Volume
#> 1 8.3 70 10.3
#> 2 8.6 65 10.3
#> 3 8.8 63 10.2
#> 4 10.5 72 16.4
#> 5 10.7 81 18.8
#> 6 10.8 83 19.7
write.xlsx(dat, "test1.xlsx", sheetName = "trees") # start at cell A1
write.xlsx(dat, "test2.xlsx", sheetName = "trees", startCol = "C", startRow = 3)
If you want to append a sheet to a file use:
xlsx_file <- system.file("extdata/datasets.xlsx", package = "tfa")
file.copy(xlsx_file, "test.xlsx") # copy the file so can make some tests
#> [1] TRUE
wb <- loadWorkbook(file = "test.xlsx") # read the workbook
addWorksheet(wb = wb, sheetName = "trees")
writeData(wb, sheet = "trees", x = dat)
saveWorkbook(wb, file = "test.xlsx", overwrite = TRUE)
12.4 Google Sheets
You can import and export to Google sheets using the googlesheets4 package in tidyverse. To read and write data, in general, you need to be logged in as a Google user. The package will ask you when needed. However, if you only want to read data from a public sheet, you can use gs4_deauth
to skip this:
To read data use:
url <- "https://docs.google.com/spreadsheets/d/1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY/edit#gid=780868077"
read_sheet(url)
read_sheet(url, sheet = 3)
range_read(url, sheet = 2, n_max = 3)
range_read(url, range = "Africa!A5:C15")
To write data to a new file use:
gs4_auth()
gs <- gs4_create("test", sheets = c("Sheet 1", "Sheet 2"))
write_sheet(dat, ss = gs)
range_write(gs, dat, sheet = "Sheet 1", range = "C4")
gs4_browse(gs) # have a look at the file in a browser
To see the results, have a look at your Google sheet test
in your browser.
12.5 Text files
You can read and write to plain text files using the readr package. However, mostly you want to write to a text file because you want to save some kind of log file when you run your script. Here sink
is an excellent function to use, since it redirects your R output. To see the output without messages, errors and warnings use:
sink(file = "ex1.log", split = TRUE) # open the file for output
cat("This is a string\n... and on a new line\n\n")
print("This is another string")
head(mtcars)
rep(1, 4)
message("A message.")
warning("A warning.")
rep(3, f) # a error
cat("\nLast line\n")
sink() # close the file again
# file.show("ex1.log") # to view in external viewer
Let us have a look at the content of the file (run cat(read_file("ex1.log"))
):
This is a string
... and on a new line
[1] "This is another string"
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
[1] 1 1 1 1
Last line
Note that messages, errors and warnings are not included in the output. If you want to include it use:
zz <- file("ex2.log", open = "wt")
sink(zz, type = "output") # open the file for output
sink(zz, type = "message") # open the same file for messages, errors and warnings
cat("This is a string\n... and on a new line\n\n")
print("This is another string")
head(mtcars)
rep(1, 4)
message("A message.")
warning("A warning.")
rep(3, f) # a error
cat("\nLast line\n")
sink() # close the file for output
sink() # close the file for messages, errors and warnings
That is, we call sink two times. Let us have a look at the content of the file:
This is a string
... and on a new line
[1] "This is another string"
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
[1] 1 1 1 1
A message.
Warning message:
A warning.
Error: object 'f' not found
Last line
Warning message:
In sink() : no sink to remove
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
Error: unexpected symbol in "By the"
Error: unexpected '*' in "*"
Error: unexpected '*' in "*"
Error: unexpected '*' in "*"
Error: unexpected '*' in "*"
Error: unexpected '*' in "*"
Error: unexpected symbol in "The learning"
Error: unexpected '<' in "<"
Error: unexpected '<' in "<"
Error: unexpected '<' in "<"
Error: unexpected '<' in "<"
Error: unexpected symbol in "An excellent"
Error: unexpected '<' in "<"
Error: unexpected '<' in "<"
Error: unexpected '<' in "<"
Error: unexpected '<' in "<"
Error: unexpected '<' in "<"
Error: unexpected symbol in "Mutating joins"
Error: attempt to use zero-length variable name
Error in curl::curl_fetch_memory(url, handle = handle) :
Error in the HTTP2 framing layer
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
x Request failed [429]. Retry 1 happens in 3.5 seconds ...
x Request failed [429]. Retry 2 happens in 3.5 seconds ...
x Request failed [429]. Retry 3 happens in 8.2 seconds ...
x Request failed [429]. Retry 1 happens in 2.5 seconds ...
x Request failed [429]. Retry 2 happens in 1.9 seconds ...
x Request failed [429]. Retry 3 happens in 26 seconds ...
x Request failed [429]. Retry 1 happens in 4.2 seconds ...
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
✓ Reading from "learning_paths".
✓ Range ''r-transform''.
12.6 R’s native binary format
In general, we can differ between two main types of data/files. Information is either binary encoded (basically just 0’s and 1’s) or stored as text files. What we have considered so far is storing data in text files.
Text files can be read by humans and computers alike. The great thing about plain text is their simplicity and their ease of use: any programming language can read a plain text file. Text files are good for storing tabular data but lacks type-safety, and has limited precision for numeric values. Binary file formats cannot be read by humans but allow space-efficient data compression. Furthermore, binary formats may be difficult to read and write using other programs.
As most other programming languages, R comes with its own binary format. We will focus on the Rds data format which is optimized for speed and compression ratios. To save and read an R object use:
dat <- list(x = c(2, 5, 6), y = "A string", z = mtcars)
saveRDS(dat, file = "test.rds")
readRDS("test.rds")
#> $x
#> [1] 2 5 6
#>
#> $y
#> [1] "A string"
#>
#> $z
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160.0 110 3.90 2.62 16.5 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.88 17.0 0 1 4 4
#> Datsun 710 22.8 4 108.0 93 3.85 2.32 18.6 1 1 4 1
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.21 19.4 1 0 3 1
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.0 0 0 3 2
#> Valiant 18.1 6 225.0 105 2.76 3.46 20.2 1 0 3 1
#> Duster 360 14.3 8 360.0 245 3.21 3.57 15.8 0 0 3 4
#> Merc 240D 24.4 4 146.7 62 3.69 3.19 20.0 1 0 4 2
#> Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
#> Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
#> Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
#> Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
#> Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.0 0 0 3 3
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.25 18.0 0 0 3 4
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.42 17.8 0 0 3 4
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.34 17.4 0 0 3 4
#> Fiat 128 32.4 4 78.7 66 4.08 2.20 19.5 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.61 18.5 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.83 19.9 1 1 4 1
#> Toyota Corona 21.5 4 120.1 97 3.70 2.46 20.0 1 0 3 1
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.52 16.9 0 0 3 2
#> AMC Javelin 15.2 8 304.0 150 3.15 3.44 17.3 0 0 3 2
#> Camaro Z28 13.3 8 350.0 245 3.73 3.84 15.4 0 0 3 4
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.85 17.1 0 0 3 2
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.94 18.9 1 1 4 1
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.14 16.7 0 1 5 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.17 14.5 0 1 5 4
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.77 15.5 0 1 5 6
#> Maserati Bora 15.0 8 301.0 335 3.54 3.57 14.6 0 1 5 8
#> Volvo 142E 21.4 4 121.0 109 4.11 2.78 18.6 1 1 4 2
Note we here have saved a non tabular R object (a list).
12.7 Json
JavaScript Object Notation (json) is an open standard text file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and array data types. It can be used to store non tabular data in text format. It is often used for data-exchange in web-apis.
Let us try to read and write to a json file using the jsonlite package.
library(jsonlite)
dat <- list(x = c(2, 5, 6), y = "A string", z = head(mtcars))
write_json(dat, "test.json", pretty = T)
lst <- read_json("test.json", simplifyDataFrame = T, simplifyVector = T)
lst
#> $x
#> [1] 2 5 6
#>
#> $y
#> [1] "A string"
#>
#> $z
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
The content of the json file look likes:
{
"x": [2, 5, 6],
"y": ["A string"],
"z": [
{
"mpg": 21,
"cyl": 6,
"disp": 160,
"hp": 110,
"drat": 3.9,
"wt": 2.62,
"qsec": 16.46,
"vs": 0,
"am": 1,
"gear": 4,
"carb": 4,
"_row": "Mazda RX4"
},
{
"mpg": 21,
"cyl": 6,
"disp": 160,
"hp": 110,
"drat": 3.9,
"wt": 2.875,
"qsec": 17.02,
"vs": 0,
"am": 1,
"gear": 4,
"carb": 4,
"_row": "Mazda RX4 Wag"
},
{
"mpg": 22.8,
"cyl": 4,
"disp": 108,
"hp": 93,
"drat": 3.85,
"wt": 2.32,
"qsec": 18.61,
"vs": 1,
"am": 1,
"gear": 4,
"carb": 1,
"_row": "Datsun 710"
},
{
"mpg": 21.4,
"cyl": 6,
"disp": 258,
"hp": 110,
"drat": 3.08,
"wt": 3.215,
"qsec": 19.44,
"vs": 1,
"am": 0,
"gear": 3,
"carb": 1,
"_row": "Hornet 4 Drive"
},
{
"mpg": 18.7,
"cyl": 8,
"disp": 360,
"hp": 175,
"drat": 3.15,
"wt": 3.44,
"qsec": 17.02,
"vs": 0,
"am": 0,
"gear": 3,
"carb": 2,
"_row": "Hornet Sportabout"
},
{
"mpg": 18.1,
"cyl": 6,
"disp": 225,
"hp": 105,
"drat": 2.76,
"wt": 3.46,
"qsec": 20.22,
"vs": 1,
"am": 0,
"gear": 3,
"carb": 1,
"_row": "Valiant"
}
]
}
12.8 Recap
For doing data driven analytics you first must import some data. That is, take data from a database, file, web API etc. and transform it into a data frame/table.
CSV files contain delimiter separated values in plain text and are often named using the file suffix
.csv
.Each line of a csv file is a data record. Each record consists of one or more fields, separated by a common delimiter (e.g.
,
,;
or_
).The readxl package can be used to read Excel files.
Writing to an Excel file can be done using the openxlsx package.
You can import and export to Google sheets using the googlesheets4 package in tidyverse.
Use
sink
to save output of you R script.There are two main types of data files. Information is either binary encoded or stored as text files.
Text files can be read by humans and computers alike. The great thing about plain text is their simplicity and their ease of use: any programming language can read a plain text file.
Text files are good for storing tabular data but lacks type-safety, and has limited precision for numeric values.
Binary file formats cannot be read by humans but allow space-efficient data compression. Moreover they can be used to save non tabular data.
As most other programming languages, R comes with its own binary format. We will focus on the Rds data format which are optimized for speed and compression ratios.
Json is an open standard text file format, and data interchange format. It can be used to store non tabular data in text format. It is often used for data-exchange in web-api’s.
12.9 Exercises
Below you will find a set of exercises. Always have a look at the exercises before you meet in your study group and try to solve them yourself. Are you stuck, see the help page. Some of the solutions to each exercise can be seen by pressing the button at each question. Beware, you will not learn by giving up too early. Put some effort into finding a solution! Always practice using shortcuts in RStudio (see Tools > Keyboard Shortcuts Help).
Go to the Tools for Analytics workspace and download/export the TM12 project. Open it on your laptop and have a look at the files in the exercises
folder which can be used as a starting point.
12.9.1 Exercise (Statistikbanken)
Use the exercise R markdown template to solve this exercise (File > New File > R Markdown…, select From template and then TFA Exercise).
You can use the API from Statistikbanken to download a lot of data sets. Let us consider airports in Denmark (data set with table id FLYV41):
- Use
cat(read_lines(url, n_max = 3), sep = "\n")
to have a look at the delimiter used.
Import the csv file.
Try to retrieve information and get an overview over the data by running:
library(jsonlite) url <- "https://api.statbank.dk/v1/tableinfo/FLYV41?lang=en" lst <- read_json(url, simplifyVector = T) View(lst)
Note the data returned is in json format, so we use
read_json
to read the data into a list.
Create a function
info(tab_id)
that returns a list with componentsdescription
,unit
andvariables
from the information for a data set with table idtab_id
.Information about all the data sets can be retrieved using:
url <- "https://api.statbank.dk/v1/tables?lang=en" lst <- jsonlite::read_json(url, simplifyVector = T) View(lst)
Have a look at the row for FLYV41.
Given the information about variables in a data set we can construct the url to retrieve the data in csv format:
tab_id <- "FLYV41" url <- str_c("https://api.statbank.dk/v1/tableinfo/", tab_id, "?lang=en") lst <- read_json(url, simplifyVector = T) col_id <- c(1,3) # column ids in lst$variables$id cols <- lst$variables$id[col_id] url <- str_c("https://api.statbank.dk/v1/data/", tab_id, "/CSV?lang=en&", str_c(cols, collapse = "=*&"), "=*") |> URLencode() url #> [1] "https://api.statbank.dk/v1/data/FLYV41/CSV?lang=en&LUFTHAVN=*&Tid=*"
Create a function
get_data(tab_id, col_id)
that retrieve a data set.
- Use the function
get_data
to retrieve data fortab_id = "FOLK1A"
andcol_id = c(2, 3, 5)
and save it as a csv file with a comma as delimiter.
- Save the data in an Excel file and a Google sheet.
12.9.2 Exercise (tuples in OPL)
In the algebraic modeling language OPL (Optimization Programming Language) used by IBM ILOG CPLEX Optimization Studio, you can define tuples to contain various information. For example consider tuples defined as:
tuple nurse {
string name;
int experience; // higest best
}
tuple shift {
string departmentName;
string day;
int startTime;
int endTime;
}
A nurse
tuple is then defined as <"Anne", 11>
and a shift
tuple as `<“Consultation”, “Monday” 12, 18>.
A set of tuples can be defined using:
where the ...
operator means that the sets are read from a data text file:
nurses = {
<"Anne", 11>,
<"Bethanie", 4>,
<"Betsy", 2>
};
shifts = {
<"Emergency", "Monday", 2, 8>,
<"Emergency", Monday 8 12 4 7>,
<"Emergency", "Monday" 12 18 2 5>
};
You can now use the sets to define decision variables \(x_{ns}\) equal one if nurse \(n\) is assigned to shift \(s\).
In this exercise we will try to generate the data text file given tibbles with data.
Try to generate a text file named
test.dat
using functionwrite_lines
with content
Load datasets
# remotes::install_github("bss-osca/tfa-package", dependencies = FALSE) # if tfa not installed
library(tfa)
library(tidyverse)
nurses <- read_csv(system.file("extdata/nurses.csv", package = "tfa"))
shifts <- read_csv(system.file("extdata/shifts.csv", package = "tfa"))
- Transform all character columns in
nurses
so they start and end with"
. Some hints are given in Hint 1.
- Unite all columns into a new column named
tuple
where each column is separated with,
. Hint: have a look at theunite
function. All columns can be selected usingeverything()
.
- Add
<
and>
the start and end of thetuple
column.
- Extract the
tuple
column and transform it into a string withcollapse = ",\n"
.
- Create a function
write_tuple
that takes nurses as input and write the tuples to a file.
The name of an object can be extracted as a string using
- Modify
write_tuple
so it works ifshifts
are given as input instead of nurses.
- Modify
write_tuple
with a new input argumentappend
which is false by default. If true, then then the file is not overwritten.
- Write
nurses
andshifts
to a single data file.