Module 13 Transforming data

In this module, we consider transformation of data. In general raw data may be messy and need to be structured in a tidy way. Tidying your data means storing it in a structured form suitable for analysis. This is done using a tibble (data frame) where each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data.

Given a raw dataset the first step is to clean it and and transform it to a tidy format. Given tidy data, you next often need to transform it. Transformation includes narrowing in on observations of interest (e.g. only observations from a specific year or warehouse), creating new variables based on existing variables (like the cost of using the machine that day given idle time). Together, tidying and transforming are called wrangling, because it can be a hard process to get your data in the right form.

In this chapter, you will learn how to work with tibbles using the dplyr package which is a part of the tidyverse.

A template project for this module is given on Posit Cloud (open it and use it while reading the notes).

Learning path diagram

It is recommended that you follow the green learning path; however, you may like a different learning style. In the learning path diagram, there are links to alternative online content (video or reading). Note this is an alternative to the standard learning path that you may use instead (you should not do both). The learning path may also have extra content, that is NOT a part of syllabus (only look at it if you want more info)!

13.1 Learning outcomes

By the end of this module, you are expected to be able to:

  • Describe what tidy and wangling is.
  • Apply the most common string functions.
  • Apply tidy operations to data.
  • Transform data.
  • Clean data.

The learning outcomes relate to the overall learning goals number 7, 11-14 and 18 of the course.

13.2 Working with data in the tidyverse

An excellent introduction on how to transform data using the tidyverse is given in the interactive DataCamp course Data Manipulation with dplyr. Please complete the course before continuing.

13.3 Mutating joins

Mutating joins allow you to combine variables from multiple tables. There are four types of mutating join, which differ in their behavior when a match is not found. We’ll illustrate each with a simple example:

df1 <- tibble(x = c(1, 2), y = 2:1)
df2 <- tibble(x = c(3, 1), a = 10, b = "a")
df1
#> # A tibble: 2 × 2
#>       x     y
#>   <dbl> <int>
#> 1     1     2
#> 2     2     1
df2
#> # A tibble: 2 × 3
#>       x     a b    
#>   <dbl> <dbl> <chr>
#> 1     3    10 a    
#> 2     1    10 a

Note that column x is present in both tables and used when joining them.

  • inner_join(df1, df2) only includes observations that match in both df1 and df2.

    df1 |> 
      inner_join(df2) 
    #> Joining with `by = join_by(x)`
    #> # A tibble: 1 × 4
    #>       x     y     a b    
    #>   <dbl> <int> <dbl> <chr>
    #> 1     1     2    10 a
  • left_join(df1, df2) includes all observations in df1, regardless of whether they match or not. This is the most commonly used join because it ensures that you don’t lose observations from your primary table.

    df1 |> 
      left_join(df2)
    #> Joining with `by = join_by(x)`
    #> # A tibble: 2 × 4
    #>       x     y     a b    
    #>   <dbl> <int> <dbl> <chr>
    #> 1     1     2    10 a    
    #> 2     2     1    NA <NA>
  • right_join(df1, df2) includes all observations in df2. It’s equivalent to left_join(df2, df1), but the columns and rows will be ordered differently.

    df1 |> 
      right_join(df2)
    #> Joining with `by = join_by(x)`
    #> # A tibble: 2 × 4
    #>       x     y     a b    
    #>   <dbl> <int> <dbl> <chr>
    #> 1     1     2    10 a    
    #> 2     3    NA    10 a
    df2 |> 
      left_join(df1)
    #> Joining with `by = join_by(x)`
    #> # A tibble: 2 × 4
    #>       x     a b         y
    #>   <dbl> <dbl> <chr> <int>
    #> 1     3    10 a        NA
    #> 2     1    10 a         2
  • full_join() includes all observations from df1 and df2.

    df1 |> 
      full_join(df2)
    #> Joining with `by = join_by(x)`
    #> # A tibble: 3 × 4
    #>       x     y     a b    
    #>   <dbl> <int> <dbl> <chr>
    #> 1     1     2    10 a    
    #> 2     2     1    NA <NA> 
    #> 3     3    NA    10 a

The left, right and full joins are collectively know as outer joins. When a row doesn’t match in an outer join, the new variables are filled in with missing values.

While mutating joins are primarily used to add new variables, they can also generate new observations. If a match is not unique, a join will add all possible combinations (the Cartesian product) of the matching observations:

df1 <- tibble(x = c(1, 1, 2), y = 1:3)
df2 <- tibble(x = c(1, 1, 2), z = c("a", "b", "a"))
df1 |> 
  left_join(df2, relationship = "many-to-many")
#> Joining with `by = join_by(x)`
#> # A tibble: 5 × 3
#>       x     y z    
#>   <dbl> <int> <chr>
#> 1     1     1 a    
#> 2     1     1 b    
#> 3     1     2 a    
#> 4     1     2 b    
#> 5     2     3 a

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:

  • semi_join(df1, df2) keeps all observations in df1 that have a match in df2.
  • anti_join(df1, df2) drops all observations in df1 that have a match in df2.

These are most useful for diagnosing join mismatches. If you’re worried about what observations your joins will match, start with a semi_join() or anti_join(). semi_join() and anti_join() never duplicate; they only remove observations.

df1 <- tibble(x = c(1, 1, 3, 4), y = 1:4)
df2 <- tibble(x = c(1, 1, 2), z = c("a", "b", "a"))

# Four rows to start with:
df1 |> nrow()
#> [1] 4
# And we get four rows after the join
df1 |> inner_join(df2, by = "x", relationship = "many-to-many") |> nrow()
#> [1] 4
# But only two rows actually match
df1 |> semi_join(df2, by = "x") |> nrow()
#> [1] 2

13.4 Recap

  • We consider transformation of tidy data where data are stored using a tibble (data frame) where each column is a variable, and each row is an observation/case.

  • Tidy data is important because the consistent structure lets you focus your struggle on questions about the data.

  • Transformation includes narrowing in on observations of interest (e.g. only observations from a specific year or warehouse), creating new variables based on existing variables (like the cost of using the machine that day given idle time).

  • Tidying and transforming are called wrangling, because it can be a hard process to get your data in the right form.

  • The package dplyr provides a consistent set of verbs that helps you solve the most common data manipulation challenges:

    • The filter function chooses rows (cases/observations) that meet a specific criteria.
    • The select function chooses columns (variables) based on their names.
    • The arrange function reorders the rows.
    • The transmute function adds/modifies columns (variables) and drops existing ones.
    • The mutate function adds/modifies columns (variables).
    • The group_by function groups variables for groupwise operations.
    • The ungroup function removes the current grouping.
    • The count function counts rows based on a grouping.
    • The summarise function reduces multiple values down to a single summary.
    • The distinct function selects unique/distinct rows.
    • The pull function can be used to extract columns as vectors (it is similar to $).
  • Some nice to know functions to use inside e.g. summarise or mutate are

    • The n() function counts the number of rows in a group.
    • The n_distinct counts the number of unique rows in a group.
    • The first function considers the first row in a group (remember to order it as needed).
    • The slice_min and slice_max functions select rows with highest or lowest values of a variable.
    • The across function makes it easy to apply the same transformation to multiple columns.
  • Use print(n = Inf) in a pipe to print all rows.

  • Use the pipe operator |> to connect operations.

  • Use functions glimpse, tail, head, View to have a look at the data.

  • The skim function in the skimr package provides an approach to summary statistics.

  • Use as.character, as.numeric, etc. to convert data to a different type.

  • Use nrow and ncol functions to get the number of rows and columns of the data.

  • The ‘Data transformation with dplyr’ cheatsheet is very useful. Find the newest version in RStudio Help > Cheatsheets.

You may also have a look at the slides for this module .

13.5 Exercises

Below you will find a set of exercises. Always have a look at the exercises before you meet in your study group and try to solve them yourself. Are you stuck, see the help page. Some of the solutions to each exercise can be seen by pressing the button at each question. Beware, you will not learn by giving up too early. Put some effort into finding a solution! Always practice using shortcuts in RStudio (see Tools > Keyboard Shortcuts Help).

Go to the Tools for Analytics workspace and download/export the TM13 project. Open it on your laptop and have a look at the files in the exercises folder which can be used as a starting point.

13.5.1 Exercise (gapminder)

Use the exercise R markdown template to solve this exercise (File > New File > R Markdown…, select From template and then TFA Exercise).

The gapminder data set provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007 for 142 countries. The data can be loaded using the gapminder package:

library(gapminder)
data(gapminder, package = "gapminder")
gapminder
#> # A tibble: 1,704 × 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.
#>  3 Afghanistan Asia       1962    32.0 10267083      853.
#>  4 Afghanistan Asia       1967    34.0 11537966      836.
#>  5 Afghanistan Asia       1972    36.1 13079460      740.
#>  6 Afghanistan Asia       1977    38.4 14880372      786.
#>  7 Afghanistan Asia       1982    39.9 12881816      978.
#>  8 Afghanistan Asia       1987    40.8 13867957      852.
#>  9 Afghanistan Asia       1992    41.7 16317921      649.
#> 10 Afghanistan Asia       1997    41.8 22227415      635.
#> # ℹ 1,694 more rows

Let us try to examine the dataset (use pipes |> as much as possible).

  1. Use glimpse, summary and tail to examine the data.

  2. Use count to count the number of

        a) countries,

        b) continents,

        c) countries per continent.

  1. Retrieve a vector with all distinct continent values.

  2. Subset rows to find:

        a) all rows with life expectancy less that 29 years,

        b) all rows for Rwanda after year 1979,

        c) all rows for Rwanda, Afghanistan or France.

  1. Select columns

        a) year and life expectancy,

        b) country and GDP per capita.

  1. Subset your data set to find all rows with GDP per capita greater than 40000 in Europe or with GDP per capita less than 500 in Africa.
  1. Use mutate to calculate each country’s GDP (population times GDP per capita).

In general GDP numbers are large and abstract. Let us try to calculate relative numbers.

  1. Use mutate to calculate GDP per capita relative to mean GDP per capita in Denmark over the whole period (gdpPercap divided by the mean of Danish gdpPercap). Have a look at the calculated data. Does the numbers seem reasonable? I perceive Denmark to be a “high GDP” country, so I predict that the distribution of gdpPercapRel is located below 1, possibly even well below.

  2. Use arrange to order

        a) data by year then country, as opposed to current by country then year,

        b) data from 2007, sorted on life expectancy,

        c) data from 2007, sorted on life expectancy in descending order. Hint: use desc() inside arrange.

  1. Use select to

        a) rename year to yr and keep all other columns (the select helper everything may be used),

        b) remove pop,

        c) reorder columns in order year, pop, … (remaining).

  1. Use group_by and summarize to find the

        a) number of observations per continent,

        b) number of countries per continent (use n_distinct inside summarize to count the number of distinct observations),

        c) average life expectancy by continent,

        d) minimum and maximum life expectancies seen by year in Asia.

  1. Sometimes you do not want to collapse the \(n\) rows for each group into one row. That is, you do not want to use summarize but mutate within your groups. Try to make a new variable that is the years of life expectancy gained (lost) relative to 1952, for each individual country.
  1. Which country experienced the sharpest 5-year drop in life expectancy in each continent? Recall that the Gapminder data only has data every five years, e.g. for 1952, 1957, etc. So this really means looking at life expectancy changes between adjacent timepoints.

13.5.2 Exercise (babynames)

Use the exercise R markdown template to solve this exercise (File > New File > R Markdown…, select From template and then TFA Exercise).

The package babynames contains the dataset babynames provided by the U.S. Social Security Administration. For each year from 1880 to 2017, the number of children of each sex given each name. All names with more than 5 uses are given (source: http://www.ssa.gov/oact/babynames/limits.html).

Install it using

install.packages("babynames")

We will use the skimr package to get an overview over babynames:

library(babynames)
library(skimr)
skim(babynames)
Table 13.1: Data summary
Name babynames
Number of rows 1924665
Number of columns 5
_______________________
Column type frequency:
character 2
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
sex 0 1 1 1 0 2 0
name 0 1 2 15 0 97310 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1975 34 1880 1951 1985 2003 2017.00 ▁▂▃▅▇
n 0 1 181 1533 5 7 12 32 99686.00 ▇▁▁▁▁
prop 0 1 0 0 0 0 0 0 0.08 ▇▁▁▁▁
  1. Which of these is NOT a way to select the name and n columns together?

    select(babynames, -c(year, sex, prop))
    select(babynames, name:n)
    select(babynames, starts_with("n"))
    select(babynames, ends_with("n"))
  2. Use filter and the logical operators to find:

        a) all of the names where prop is greater than or equal to 0.08,

        b) all of the children named “Sea”.

  1. Use Boolean operators to return only the rows that contain:

        a) boys named Sue,

        b) names that were used by exactly 5 or 6 children in 1880,

        c) names that are one of Acura, Lexus, or Yugo.

  1. What is the smallest value of n? What is the largest?
  1. Write a sequence of functions that filters babynames to just the girls that were born in 2017, then select the name and n columns, then arrange the results so that the most popular names are near the top.
  1. Trim babynames to just the rows that contain your name and your sex.
  1. Extract the rows where name == "Khaleesi". Then use summarise() to find the total number of children named Khaleesi and the first year Khaleesi appeared in the data.
  1. Use group_by(), summarise(), and arrange() to display the ten most popular names. Compute popularity as the total number of children of a single gender given a name.
  1. Use group_by() to calculate the total number of children born each year over time.
  1. Column prop denotes the proportion given year and sex. Use mutate() and min_rank() to rank each row in babynames from largest prop to lowest prop given year and sex. What happens if you do the same using the n column?
  1. Filter the results to find all names with rank == 1 after 2009.

13.5.3 Exercise (profit)

Use the exercise R markdown template to solve this exercise (File > New File > R Markdown…, select From template and then TFA Exercise).

Consider the dataset profit containing quarterly financial records for each costumer, product, etc.:

library(skimr)
path <- system.file("extdata/profit_raw.csv", package = "tfa")
profit <- read_csv(path)
skim(profit)
Table 13.2: Data summary
Name profit
Number of rows 24546
Number of columns 9
_______________________
Column type frequency:
character 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Quarter 0 1 1 2 0 12 0
Channel 0 1 3 3 0 10 0
Customer ID 0 1 3 3 0 18 0
Country 0 1 3 14 0 47 0
Product Line 0 1 14 25 0 6 0
Revenue 63 1 1 7 0 1210 0
Product Cost 61 1 3 6 0 1139 0
Customer Service Cost 10 1 1 6 0 464 0
Profit 0 1 3 7 0 966 0

Note that it seems that the dataset need to be cleaned. All columns are strings (some should be numbers) and there seems to be missing values. You may start by having a view of the dataset using:

View(profit)

First focus on column Quarter which currently has 12 distinct values:

profit |> distinct(Quarter)
#> # A tibble: 12 × 1
#>    Quarter
#>    <chr>  
#>  1 Q3     
#>  2 1      
#>  3 Q4     
#>  4 Q1     
#>  5 Q2     
#>  6 2      
#>  7 4      
#>  8 q1     
#>  9 q4     
#> 10 q3     
#> 11 q2     
#> 12 3

You would like it to be a numeric with values 1-4.

  1. Use mutate, str_remove and as.numeric to convert the column to a numeric by removing all ‘q’ and ‘Q’ values.

Let us look at the next columns:

profit |> distinct(Channel) |> pull()
#>  [1] "ATM" "BRH" "INT" "MAL" "EML" "CCT" "TEL" "MOP" "DSA" "EVE"
profit |> distinct(`Customer ID`) |> pull()
#>  [1] "FRT" "MRT" "PBI" "MAM" "EBP" "RPB" "WEB" "WEM" "HEC" "STF" "IAS" "CRE" "INB" "CAM" "AGR" "SBE"
#> [17] "AFF" "MFN"
profit |> distinct(Country) |> pull()
#>  [1] "USA"            "Canada"         "Great Britain"  "Finland"        "New Zealand"   
#>  [6] "Brazil"         "Mexico"         "Germany"        "Puerto Rico"    "Hong Kong"     
#> [11] "Japan"          "Columbia"       "Switzerland"    "Uruguay"        "Netherlands"   
#> [16] "Korea"          "Venezuela"      "Panama"         "Sweden"         "China"         
#> [21] "Guatemala"      "South Africa"   "Malaysia"       "Nigeria"        "Denmark"       
#> [26] "France"         "India"          "Taiwan"         "Norway"         "Chile"         
#> [31] "Indonesia"      "Ireland"        "Thailand"       "Peru"           "Spain"         
#> [36] "Belgium"        "Poland"         "Ecuador"        "Costa Rica"     "Australia"     
#> [41] "Israel"         "Guam"           "Oman"           "Singapore"      "Argentina"     
#> [46] "Czechoslovakia" "Philippines"
profit |> distinct(`Product Line`) |> pull()
#> [1] "Credit Products"           "Deposit Products"          "Revolving Credit Products"
#> [4] "Other Products"            "Third Party Products"      "Fee Based Products"

These seem to be okay. The last columns should be numbers. Let us consider Revenue.

profit |> distinct(Revenue) |> pull() |> head(n = 100)
#>   [1] "$ 6044"  "$ 4686"  "$ 6063"  "$ 4682"  "$ 6320"  "$ 2993"  "$ 3355"  "$ 5716"  "$ 3347" 
#>  [10] "$ 2624"  "$ 3629"  "$ 5612"  "$ 4618"  "$ 2080"  "$ 2788"  "$ 2829"  "$ 2898"  "$ 5232" 
#>  [19] "$ 2949"  "$ 5565"  "$ 2153"  "$ 3097"  "$ 1920"  "$ 4041"  "$ 5931"  "$ 1605"  "$ 2026" 
#>  [28] "$ 1687"  "$ 5075"  "$ 4223"  "$ 2456"  "$ 1924"  "$ 1578"  "$ 3235"  "$ 5123"  "$ 1560" 
#>  [37] "$ 1945"  "$ 6060"  "$ 1222"  "$ 1660"  "$ 3000"  "$ 2970"  "$ 1631"  "$ 1215"  "$ 1759" 
#>  [46] "$ 3285"  "$ 2048"  "$ 2173"  "$ 3353"  "$ 1162"  "$ 1232"  "$ 1561"  "$ 1123"  "$ 1794" 
#>  [55] "$ 1202"  "$ 1510"  "$ 4472"  "$ 2370"  "$ 2581"  "$ 2761"  "$ 6371"  "$ 1972"  "$ 1562" 
#>  [64] "$ 2742"  "$ 4598"  "$ 5322"  "$ 3411"  NA        "$ 1569"  "$ 2852"  "$ 1622"  "$ 2505" 
#>  [73] "$ 1596"  "$ 1447"  "$ 1690"  "$ 2448"  "$ 1593"  "$ 1876"  "$ 6591"  "$ 1611"  "$ 1254" 
#>  [82] "Unknown" "$ 842"   "$ 1529"  "$ 1439"  "$ 762"   "$ 1959"  "$ 4382"  "$ 1407"  "$ 909"  
#>  [91] "$ 1549"  "$ 2161"  "$ 1331"  "$ 727"   "$ 1462"  "$ 1067"  "$ 833"   "$ 1675"  "$ 1524" 
#> [100] "$ 1285"

Most values start with a dollar sign. Let us have a look at the other ones:

profit |> filter(!str_starts(Revenue, fixed("$"))) 
#> # A tibble: 95 × 9
#>    Quarter Channel `Customer ID` Country       `Product Line`            Revenue `Product Cost`
#>      <dbl> <chr>   <chr>         <chr>         <chr>                     <chr>   <chr>         
#>  1       3 INT     MAM           USA           Deposit Products          Unknown $ 1008        
#>  2       3 MAL     RPB           USA           Credit Products           ?       $ 1420        
#>  3       1 MAL     WEM           Great Britain Other Products            ?       $ 87          
#>  4       3 ATM     MFN           Germany       Fee Based Products        unknown $ 47          
#>  5       3 ATM     PBI           Costa Rica    Third Party Products      Unknown $ 51          
#>  6       1 ATM     PBI           Chile         Deposit Products          Unknown $ 58          
#>  7       4 CCT     MRT           Great Britain Revolving Credit Products ?       $ 27          
#>  8       4 ATM     MAM           Taiwan        Third Party Products      unknown $ 55          
#>  9       4 MAL     WEB           Japan         Other Products            unknown $ 40          
#> 10       2 CCT     MAM           Netherlands   Credit Products           unknown $ 14          
#> # ℹ 85 more rows
#> # ℹ 2 more variables: `Customer Service Cost` <chr>, Profit <chr>
na_values <- profit |> 
  filter(!str_starts(Revenue, fixed("$"))) |> 
  distinct(Revenue) |> 
  pull(Revenue)
na_values
#> [1] "Unknown" "?"       "unknown"

The expression is a bit complex. Let us break it up. Function fixed just returns the fixed string ‘$’. This is necessary since the dollar sign has a special meaning in regular expressions (beyond the scope here). Function str_starts checks if the string starts with a dollar sign. We use the logical negation (NOT) to find the complementary set.

Note that different strings have been used to indicate NA values (Unknown, ?, unknown). Let us first use a single value to indicate NA (a question mark):

profit <- profit |> 
  mutate(Revenue = str_replace_all(Revenue, c("unknown" = "?", "Unknown" = "?")))

Next, we replace all ? with NA:

profit <- profit |> 
  mutate(Revenue = na_if(Revenue, "?"))
profit |> # check
  filter(!str_starts(Revenue, fixed("$"))) 
#> # A tibble: 0 × 9
#> # ℹ 9 variables: Quarter <dbl>, Channel <chr>, Customer ID <chr>, Country <chr>,
#> #   Product Line <chr>, Revenue <chr>, Product Cost <chr>, Customer Service Cost <chr>,
#> #   Profit <chr>

Finally, we remove all dollar signs:

profit <- profit |> 
  mutate(Revenue = str_remove(Revenue, fixed("$ ")) |> as.numeric())
profit
#> # A tibble: 24,546 × 9
#>    Quarter Channel `Customer ID` Country `Product Line`            Revenue `Product Cost`
#>      <dbl> <chr>   <chr>         <chr>   <chr>                       <dbl> <chr>         
#>  1       3 ATM     FRT           USA     Credit Products              6044 $ 3998        
#>  2       1 ATM     MRT           USA     Credit Products              4686 $ 3229        
#>  3       4 ATM     PBI           USA     Deposit Products             6063 $ 7440        
#>  4       1 ATM     PBI           USA     Deposit Products             4682 $ 6127        
#>  5       4 ATM     MRT           USA     Deposit Products             6320 $ 7913        
#>  6       3 BRH     MAM           USA     Deposit Products             2993 $ 1034        
#>  7       4 BRH     PBI           USA     Revolving Credit Products    3355 $ 4355        
#>  8       3 ATM     FRT           USA     Revolving Credit Products    5716 $ 5617        
#>  9       4 BRH     PBI           USA     Deposit Products             3347 $ 4229        
#> 10       1 BRH     PBI           USA     Credit Products              2624 $ 1960        
#> # ℹ 24,536 more rows
#> # ℹ 2 more variables: `Customer Service Cost` <chr>, Profit <chr>

As one pipe:

profit <- profit |> 
  mutate(Revenue = str_replace_all(Revenue, c("unknown" = "?", "Unknown" = "?"))) |> 
  mutate(Revenue = na_if(Revenue, "?")) |> 
  mutate(Revenue = str_remove(Revenue, fixed("$ ")) |> as.numeric())
  1. Convert the remaining columns to numeric like shown for Revenue above.
  1. Use the across function to apply the operations in Question 2 for a set of columns. Hint: see the examples on the help page of across.
  1. Write one pipe that does all the cleaning.
  1. Validate that revenue - product costs - customer service cost equals profit. If you see small rounding errors (less than or equal one) then recalculate the profit.
  1. Recalculate values in columns Revenue to Profit if possible.
  1. Find the two best rows with highest profit in each quarter.
  1. Find the two best customers with highest profit in each quarter. Is the result the same as in Question 7?
  1. Find the product line, customer, channel, country and quarter with the highest profit.
  1. Are there rows with the same customer in different countries?
  1. Sort the data decreasing with respect to profit and next revenue.
  1. Which product line has the highest and lowest total cost?
  1. Assume that customer service cost increases with 5%. How will that affect the profit for each product line?
rm(profit)

13.5.4 Exercise (fisheries)

Use the exercise R markdown template to solve this exercise (File > New File > R Markdown…, select From template and then TFA Exercise).

The dataset called fisheries contains world fisheries harvest for 2005. The tonnage from capture and aquaculture is listed by country.

You need the tidyverse package as usual:

library(tidyverse)
# install tfa package using remotes::install_github("bss-osca/tfa-package", upgrade = FALSE)     

We load the needed datasets:

fisheries <- read_csv(system.file("extdata/fisheries.csv", package = "tfa"))
fisheries
#> # A tibble: 216 × 4
#>    country             capture aquaculture  total
#>    <chr>                 <dbl>       <dbl>  <dbl>
#>  1 Afghanistan            1000        1200   2200
#>  2 Albania                7886         950   8836
#>  3 Algeria               95000        1361  96361
#>  4 American Samoa         3047          20   3067
#>  5 Andorra                   0           0      0
#>  6 Angola               486490         655 487145
#>  7 Antigua and Barbuda    3000          10   3010
#>  8 Argentina            755226        3673 758899
#>  9 Armenia                3758       16381  20139
#> 10 Aruba                   142           0    142
#> # ℹ 206 more rows
continents <- read_csv(system.file("extdata/continents.csv", package = "tfa"))
continents
#> # A tibble: 247 × 2
#>    country           continent
#>    <chr>             <chr>    
#>  1 Afghanistan       Asia     
#>  2 Åland Islands     Europe   
#>  3 Albania           Europe   
#>  4 Algeria           Africa   
#>  5 American Samoa    Oceania  
#>  6 Andorra           Europe   
#>  7 Angola            Africa   
#>  8 Anguilla          Americas 
#>  9 Antigua & Barbuda Americas 
#> 10 Argentina         Americas 
#> # ℹ 237 more rows

Some mean statistics:

fisheries |>
  summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))
#> # A tibble: 1 × 3
#>   capture aquaculture   total
#>     <dbl>       <dbl>   <dbl>
#> 1 421916.     508368. 930284.
  1. Use a mutating join to add a continent column to the fisheries dataset. Are there some countries which do not belong to a continent?
  1. Filter out countries whose total harvest was less than 100,000 tons.
  1. If still any countries not belonging to a continent then add them to the closest continent.
  1. Add column aquaculture_perc = aquaculture / total and explain the variable.
  1. Calculate the mean aquaculture percentage (we’ll call it mean_ap for short) for continents in the fisheries data.
  1. Now expand your calculations to also calculate the minimum and maximum aquaculture percentage for continents in the fisheries data and store the summary table in a data frame called fisheries_summary_continent.
  1. Take the fisheries_summary_continent data frame and order the results in descending order of mean aquaculture percentage.
  1. If you already have read the module about visualizations, then try to make some relevant plots.

13.5.5 Exercise (company ranking)

This exercise is a slightly modified version an exam assignment (exam 2021-A2).

Use the exercise R markdown template to solve this exercise (File > New File > R Markdown…, select From template and then TFA Exercise).

The dataset companies, in the tfa package, lists approx. 1000 of the world’s biggest companies, measured by sales, profits, assets and market value. The column/variables are:

  • name: the name of the company.
  • country: the country the company is situated in.
  • category: the products the company produces.
  • sales: the amount of sales of the company in billion USD.
  • profits: the profit of the company in billion USD.
  • assets: the assets of the company in billion USD.
  • marketvalue: the market value of the company in billion USD.

You can load the dataset using:

# remotes::install_github("bss-osca/tfa-package", build = FALSE)  # run if tfa not installed
library(tidyverse)
companies <- read_csv(system.file("extdata/companies.csv", package = "tfa"))

Use the dplyr package in tidyverse to calculate relevant summary tables (data frames) and answer the following questions.

  1. How many rows and columns do the dataset have?
  1. How many different companies are we considering, how many different product categories and how many different countries? Hint: the skimr package might be useful.
  1. What are the 3 biggest companies with respect to market value?
  1. For each country find the company with highest profit. What company has the highest profit in Denmark?
  1. Which 4 product categories have the highest total market value?
  1. Create a new data frame only containing rows from Denmark and with columns name, category and a column value which equals the sum of columns profits, assets and marketvalue. Which company have the lowest value?

13.5.6 Exercise (Titanic)

This exercise is a slightly modified version an exam assignment (reexam 2021-A2).

Use the exercise R markdown template to solve this exercise (File > New File > R Markdown…, select From template and then TFA Exercise).

The dataset titanic, given in the appendix, lists approx. 1300 passengers on Titanic. The column/variables are:

  • pclass: Passenger class (1 = 1st; 2 = 2nd; 3 = 3rd).
  • survived: Survival (0 = No; 1 = Yes).
  • name: Name.
  • sex: Sex.
  • age: Age.
  • fare: Passenger Fare.
  • cabin: Cabin number.
  • embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).
  • boat: Lifeboat number.

You can read the dataset file titanic.csv into the dataset dat using

# remotes::install_github("bss-osca/tfa-package", build = FALSE)  # run if tfa not installed
library(tidyverse)
dat <- read_csv(system.file("extdata/titanic.csv", package = "tfa"))

Use the dplyr package in tidyverse to calculate relevant summary tables (data frames) and answer the following questions.

  1. Create a new column named male which is true if the person is a male.
  1. How many persons are we considering, how many men (in percentage) and how many survived?
  1. How many of the females survived in percent (and how many males)?
  1. Define children as people with age below 19. How many children survived?
  1. Did relatively more people survive at first class compared to third class?
  1. How many persons that entered a lifeboat did die in percent?
  1. How many persons with Hansen in their name survived?

13.5.7 Exercise (covid)

This exercise is a slightly modified version an exam assignment (reexam 2022-A2).

Use the exercise R markdown template to solve this exercise (File > New File > R Markdown…, select From template and then TFA Exercise).

Consider COVID-19 data obtained from Our World in Data in the file covid.csv. The dataset contains data from different countries. Some of the columns/variables are:

  • cases: New confirmed cases of COVID-19.
  • deaths: New deaths attributed to COVID-19.
  • icu_patients: Number of COVID-19 patients in intensive care units (ICUs) on a given day.
  • hosp_patients: Number of COVID-19 patients in hospital on a given day.
  • tests: Total tests for COVID-19.
  • positive_rate: The share of COVID-19 tests that are positive, given as a rolling 7-day average.
  • vac: Total number of people who received at least one vaccine dose.
  • fully_vac: Total number of people who received all doses prescribed by the vaccination protocol.
  • population: Country population.

Other columns are date, country, month and year. You can read the dataset file using

# remotes::install_github("bss-osca/tfa-package", build = FALSE)  # run if tfa not installed
library(tidyverse)
dat <- read_csv(system.file("extdata/covid.csv", package = "tfa"))

Use the dplyr package in tidyverse to calculate relevant summary tables (tibbles/data frames) and answer the following questions.

  1. Which countries are considered and what is the timespan of the data?
  1. What is the number of new confirmed cases November 22nd, 2021 in Denmark?
  1. Calculate the total number of confirmed cases and deaths. Hint: you may use the cumsum function to add all cases up until a given date. You may here consider NA values in the cases and deaths columns as equal to zero (e.g. using replace_na(cases, 0)). What is the total number of deaths in Norway up to October 10th, 2021?
  1. For each country calculate the number of tests done in each month in a given year. Which country had the highest number of tests per capita in March 2021?
  1. Consider United Kingdom. Which month had the highest number of ICU patients on a given day?

13.5.8 Exercise (election)

This exercise is a slightly modified version an exam assignment (exam 2022-A2).

Use the exercise R markdown template to solve this exercise (File > New File > R Markdown…, select From template and then TFA Exercise).

The dataset in the file elections.csv lists election votes for municipal elections in Denmark. The column/variables are:

  • area: municipality,
  • party: political party,
  • year: election year,
  • validVotes: Number of (valid) votes,
  • personalVotes: total number of personal votes,
  • listedMen: men listed,
  • listedWomen: women listed,
  • electedMen: elected men,
  • electedWomen: elected women.

You can read the dataset file into the dataset dat using

# remotes::install_github("bss-osca/tfa-package", build = FALSE)  # run if tfa not installed
library(tidyverse)
dat <- read_csv(system.file("extdata/elections.csv", package = "tfa"))

Use the dplyr package in tidyverse to calculate relevant summary tables (data frames) and answer the following questions.

  1. How many different municipalities were there in Denmark in 2017?
  1. In which election year was the total number of votes highest?
  1. Consider the 2017 election. How many votes and elected candidates did the different parties get in total (sort your result descending by votes)?
  1. Calculate the percentage of females listed for each year. Are there on average more women than men listed? Is there any tendency over the years?
  1. Consider the 2017 election. In which municipality did the Liberal Democratic Party get the highest percentage of votes?

13.5.9 Exercise (orders)

This exercise is a slightly modified version an exam assignment (reexam 2023-A1).

Use the exercise R markdown template to solve this exercise (File > New File > R Markdown…, select From template and then TFA Exercise).

Consider the dataset in the file orders.csv with purchase orders for a group of ships.

The dataset contains a row for each item used in an order. The columns/variables are:

  • ship: The ship considered.
  • order_id: Order id. An order is a group of items purchased in one batch from a single supplier.
  • item_id: Item id.
  • item_desc: Item description.
  • quantity: Number of items ordered.
  • price: Price per unit.
  • order_date: Order date.
  • delivery_date: Expected delivery date when order is made.
  • delivery_place: Delivery place.
  • recieved_date: Actual date the order is recieved.
  • supplier: Supplier for the order.
  • delivery_cost: Delivery cost.
  • order_year: Year the order was placed.

You can read the dataset file into the dataset dat using

# remotes::install_github("bss-osca/tfa-package", build = FALSE)  # run if tfa not installed
library(tidyverse)
dat <- read_csv(system.file("extdata/orders.csv", package = "tfa"))

Use the dplyr package in tidyverse to calculate relevant summary tables (tibbles/data frames) and answer/complete the following questions/tasks:

  1. How many ships, suppliers, and different items are considered?
  1. Do all orders use a single supplier and delivery place?
  1. For each order, calculate/find the number of different items, number of items, total price, total delivery cost, and order year (missing values are assumed zero). Which order has the highest delivery cost?
  1. For each order, find the supplier and delivery place. Hint: The first function may be used to select the first item within a group. Which supplier and delivery place is used most?

  2. Add a column to the dataset equal the value of the items in a row calculated as the price times the quantity. Next, create a summary table named res1 that for each item id calculates the aggregated value and arrange them in descending order.

Which item is the most costly one with respect to the calculated value in res1?

Given the dataset res1, add two new columns:

  • The relative value for each item as the ratio between the item’s value and the total value of all items.
  • The cumulative relative value. Hint: You can use the cumsum function here.

Consider the most costly item with respect to the calculated value in res1. At which ships is the item used and how many times?

13.5.10 Exercise (jobs)

This exercise is a slightly modified version an exam assignment (exam 2023-A1).

Consider the dataset in the file jobs.csv with engine maintenance jobs for a group of ships.

The dataset contains a row for each item used. The columns/variables are:

  • ship: The ship considered.
  • job_id: Maintenance job id. A job is a collection of items replaced.
  • job_desc: Job description.
  • item_id: Item id.
  • item_name: Item name.
  • item_quantity: Number of items used.
  • item_manufaturer: Item manufacturer.
  • component_id: Engine component id.
  • component_desc: Engine component description.
  • done_date: Date the job finished.
  • year: Year of done date.
  • days: Days since the item was last used for maintenance on the ship.

You can access the dataset file at location

# remotes::install_github("bss-osca/tfa-package", build = FALSE)  # run if tfa not installed
path <- system.file("extdata/jobs.csv", package = "tfa")

Use the dplyr package in tidyverse to calculate relevant summary tables (tibbles/data frames) and answer/complete the following questions/tasks.

  1. Find the delimiter used in the csv file and load the file into a tibble/data frame called dat.
  1. Provide a short overview of the data. How many rows are there in the data, what is the number of different ships, and what is the range of dates?
  1. What is the total number of different jobs, number of different jobs per ship, and average number of jobs per ship? Hint: The function n_distinct may be used to find distinct values within a group.
  1. What is the minimum, average, and maximum number of different items considered at each job? What is the minimum, average, and maximum number of items used for maintenance at each job?
  1. It seems that some items considered in a job are not used anyway. Which ship has most jobs where an item should have been used, but has not (quantity is zero)?
  1. Which item is the most used one for maintenance?