Module 10 Functions

To understand computations in R, two slogans are helpful:

Everything that exists is an object.

Everything that happens is a function call.

John Chambers

Writing functions is a core activity of an R programmer. It represents the key step of the transition from a user to a programmer. Functions have inputs and outputs. Functions (and control structures) are what makes your code more dynamic.

Functions are often used to encapsulate a sequence of expressions that needs to be executed numerous times, perhaps under slightly different conditions. In programming, functional programming is a programming paradigm, a style of how code is written. Rather than repeating the code, functions and control structures allow one to build code in blocks. As a result, your code becomes more structured, more readable and much easier to maintain and debug (find errors).

A template project for this module is given on Posit Cloud (open it and use it while reading the notes).

Learning path diagram

It is recommended that you follow the green learning path; however, you may like a different learning style. In the learning path diagram, there are links to alternative online content (video or reading). Note this is an alternative to the standard learning path that you may use instead (you should not do both). The learning path may also have extra content, that is NOT a part of syllabus (only look at it if you want more info)!

10.1 Learning outcomes

By the end of this module, you are expected to be able to:

  • Call a function.
  • Formulate a function with different input arguments.
  • Describe why functions are important in R.
  • Set defaults for input arguments.
  • Return values from functions.
  • Explain how variable scope and precedence works.
  • Document functions.

The learning outcomes relate to the overall learning goals number 2, 3, 4 and 10 of the course.

10.2 DataCamp course

An excellent introduction to functions is given in Chapter 3 in the DataCamp course Intermediate R. Please complete the chapter before continuing.

10.3 Functions returning multiple objects

Functions in R only return a single object. However, note that the object may be a list. That is, if you want to return multiple arguments, store them in a list. A simple example:

test <- function() {
  # the function does some stuff and calculate some results
  res1 <- 45
  res2 <- "Success"
  res3 <- c(4, 7, 9)
  res4 <- list(cost = 23, profit = 200)
  lst <- list(days = res1, run = res2, id = res3, money = res4)
  return(lst)
}
test()
#> $days
#> [1] 45
#> 
#> $run
#> [1] "Success"
#> 
#> $id
#> [1] 4 7 9
#> 
#> $money
#> $money$cost
#> [1] 23
#> 
#> $money$profit
#> [1] 200

10.4 The ... argument

The special argument ... indicates a variable number of arguments and is usually used to pass arguments to nested functions used inside the function. Consider example:

my_name <- function(first = "Lars", last = "Nielsen") {
  str_c(first, last, sep = " ")
}
my_name()
#> [1] "Lars Nielsen"

cite_text <- function(text, ...) {
  str_c(text, ', -', my_name(...))
}
cite_text("Learning by doing is the best way to learn how to program!")
#> [1] "Learning by doing is the best way to learn how to program!, -Lars Nielsen"
cite_text("Learning by doing is the best way to learn how to program!", last = "Relund")
#> [1] "Learning by doing is the best way to learn how to program!, -Lars Relund"
cite_text("To be or not to be", first = "Shakespeare", last = "")
#> [1] "To be or not to be, -Shakespeare "

Note in the first function run, we use the defaults in my_name. In the second run, we change the default last name and in the last run, we change both arguments.

If you need to retrieve/capture the content of the ... argument, put it in a list:

test <- function(...) {
  return(list(...))
}
test(x = 4, y = "hey", z = 1:5)
#> $x
#> [1] 4
#> 
#> $y
#> [1] "hey"
#> 
#> $z
#> [1] 1 2 3 4 5

10.5 Documenting your functions

It is always a good idea to document your functions. This is in fact always done in functions of a package. For instance try ?mutate and see the documentation in the Help tab.

Assume that you have written a function

subtract <- function(x, y) {
  return(x-y)
}

In RStudio you can insert a Roxygen documentation skeleton by having the cursor at the first line of the function and go to Code > Insert Roxygen Skeleton (Ctrl+Alt+Shift+R):

#' Title
#'
#' @param x 
#' @param y 
#' @return
#' @export
#' @examples
subtract <- function(x, y) {
  return(x-y)
}

You now can modify your documentation to

#' Subtract two vectors
#'
#' @param x First vector.
#' @param y Vector to be subtracted.
#' @return The difference.
#' @export
#' @examples
#' subtract(x = c(5,5), y = c(2,3))
subtract <- function(x, y) {
  return(x-y)
}

Note

  • Parameters/function arguments are documented using the @param tag.
  • Return value is documented using the @return tag.
  • Under the @examples tag you can insert some examples.
  • Ignore the @export tag. This is used if you include your function in your own package. Package development is beyond the scope of this course. If you are interested, have a look at the book Hadley Wickham (2015).

A list of further tags can be seen in the vignette Rd (documentation) tags.

10.6 Example - Job sequencing

Recall the job sequencing problem in Section 5.8 that consider a problem of determining the best sequencing of jobs on a machine. A set of startup costs are given for 5 machines:

startup_costs <- c(27, 28, 32, 35, 26)
startup_costs
#> [1] 27 28 32 35 26

Moreover, when changing from one job to another job, the setup costs are given as:

setup_costs <- matrix(c(
  NA, 35, 22, 44, 12,
  49, NA, 46, 38, 17,
  46, 12, NA, 29, 41,
  23, 37, 31, NA, 26,
  17, 23, 28, 34, NA), 
  byrow = T, nrow = 5)
setup_costs
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]   NA   35   22   44   12
#> [2,]   49   NA   46   38   17
#> [3,]   46   12   NA   29   41
#> [4,]   23   37   31   NA   26
#> [5,]   17   23   28   34   NA

For instance, the setup cost from Job 2 to Job 4 is 38.

The goal of the problem is to determine a sequence of jobs which minimizes the total setup cost including the startup cost.

One possible way to find a sequence is the use a greedy strategy:

Greedy Algorithm
Step 0: Start with the job which has minimal startup cost.
Step 1: Select the next job as the job not already done 
        with minimal setup cost given current job. 
Step 2: Set next job in Step 1 to current job and 
        go to Step 1 if not all jobs are done.

In R the greedy algorithm can be implemented as:

#' Calculate a job sequence based on a greedy algorithm
#' 
#' @param startup Startup costs.
#' @param setup Setup costs.
#' @return A list with the job sequence and total setup costs.
greedy <- function(startup, setup) {
  jobs <- nrow(setup)
  cur_job <- which.min(startup)
  cost <- startup[cur_job]
  # cat("Start job:", cur_job, "\n")
  job_seq <- cur_job
  setup[, cur_job] <- NA
  for (i in 1:(jobs-1)) {
    next_job <- which.min(setup[cur_job, ])
    # cat("Next job:", next_job, "\n") 
    cost <- cost + setup[cur_job, next_job]
    job_seq <- c(job_seq, next_job)
    cur_job <- next_job
    setup[, cur_job] <- NA
  }
  # print(setup)
  return(list(seq = job_seq, cost = cost))
}
greedy(startup_costs, setup_costs)
#> $seq
#> [1] 5 1 3 2 4
#> 
#> $cost
#> [1] 115

First, the job with minimum startup cost is found using function which.min and we define cost as the startup cost. We use cat to make some debugging statements and initialize job_seq with the first job. Next, we have to find a way of ignoring jobs already done. We do that here by setting the columns of setup cost equal to NA for jobs already done. Hence, they will not be selected by which.min. The for loop runs 4 times and selects jobs and accumulate the total cost. Finally, the job sequence and the total cost is returned as a list.

A well-known better strategy is to:

Better Algorithm
Step 0: Subtract minimum of startup and setup cost for each job from setup and 
        startup costs (that is columnwise)
Step 1: Call the greedy algorithm with the modified costs. Note that the total 
        cost returned has to be modified a bit.

The better strategy implemented in R:

#' Calculate a job sequence based on a better (greedy) algorithm
#' 
#' @param startup Startup costs.
#' @param setup Setup costs.
#' @return A list with the job sequence and total setup costs.
better <- function(startup, setup) {
  jobs <- nrow(setup)
  min_col_val <- apply(rbind(startup, setup), 2, min, na.rm = T)  
  startup <- startup - min_col_val
  min_mat <- matrix(rep(min_col_val, jobs), ncol = jobs, byrow = T)
  setup <- setup - min_mat
  lst <- greedy(startup, setup)
  lst$cost <- lst$cost + sum(min_col_val)
  return(lst)
}
better(startup_costs, setup_costs)
#> $seq
#> [1] 4 1 3 2 5
#> 
#> $cost
#> [1] 109

First the number of jobs are identified. Next, we need to find the minimum value in each column. Here we use the apply function. The first argument is the setup matrix with the startup costs added as a row. The second argument is 2 indicating that we should apply the third argument to each column (if was equal 1 then to each row). The third argument is the function to apply to each column (here min). The last argument is an optional argument passed to the min function. With the current values min_col_val equals 17, 12, 22, 29, and 12. Afterwards the minimum values are subtracted in each column. Note for subtracting the minimum values from the setup cost, we first need to create a matrix with the minimum values (min_mat). Finally, we call the greedy algorithm with the new costs and correct the returned result with the minimum values.

10.7 Recap

Writing functions is a core activity of an R programmer. It represents the key step of the transition from a user to a programmer. Functions have inputs and outputs. Functions (and control structures) are what makes your code more dynamic.

Functions are often used to encapsulate a sequence of expressions that need to be executed numerous times, perhaps under slightly different conditions. In programming, functional programming is a programming paradigm, a style of how code is written. Rather than repeating the code, functions and control structures allow one to build code in blocks. As a result, your code becomes more structured, more readable and much easier to maintain and debug (find errors).

Functions can be defined using the function() directive.

The named arguments (input values) can have default values. Moreover, R passes arguments by value. That is, an R function cannot change the variable that you input to that function.

A function can be called using its name and its arguments can be specified by name or by position in the argument list.

Functions always return the last expression evaluated in the function body or when you use the return flow control statement (good coding practice).

Scoping refers to the rules R use to look up the value of variables. A function will first look inside the body of the function to identify all the variables. If all variables exist, no further search is required. Otherwise, R will look one level up to see if the variable exists.

Functions can be assigned to R objects just like any other R object.

Document your functions using the Roxygen skeleton!

You may also have a look at the slides for this module .

10.8 Exercises

Below you will find a set of exercises. Always have a look at the exercises before you meet in your study group and try to solve them yourself. Are you stuck, see the help page. Some of the solutions to each exercise can be seen by pressing the button at each question. Beware, you will not learn by giving up too early. Put some effort into finding a solution! Always practice using shortcuts in RStudio (see Tools > Keyboard Shortcuts Help).

Go to the Tools for Analytics workspace and download/export the TM10 project. Open it on your laptop and have a look at the files in the exercises folder which can be used as a starting point.

10.8.1 Exercise (defining functions)

Solve this exercise using a script file.

  1. Create a function sum_n that for any given value, say \(n\), computes the sum of the integers from 1 to n (inclusive). Use the function to determine the sum of integers from 1 to 5000. Document your function too.
  1. Write a function compute_s_n that for any given \(n\) computes the sum \(S_n = 1^2 + 2^2 + 3^2 + \dots + n^2\). Report the value of the sum when \(n=10\).
  1. Define an empty numerical vector s_n of size 25 using s_n <- vector("numeric", 25) and store in the results of \(S_1, S_2, \dots S_{25}\) using a for-loop. Confirm that the formula for the sum is \(S_n= n(n+1)(2n+1)/6\) for \(n = 1, \ldots, 25\).
  1. Write a function biggest which takes two integers as arguments. Let the function return 1 if the first argument is larger than the second and return 0 otherwise.
  1. Write a function that returns the shipping cost as 10% of the total cost of an order (input argument).
  1. Given Question 5, rewrite the function so the percentage is an input argument with a default of 10%.
  1. Given Question 5, the shipping cost can be split into parts. One part is gasoline which is 50% of the shipping cost. Write a function that has total cost as input argument and calculate the gasoline cost and use the function defined in Question 5 inside it.
  1. Given Question 6, the shipping cost can be split into parts. One part is gasoline which is 50% of the shipping cost. Write a function that has total cost a input argument and calculate the gasoline cost and use the function defined in Question 6 inside it. Hint: Use the ... argument to pass arguments to shipping_cost.
  1. Given Question 8, write a function costs that, given total cost, returns the total cost, shipping cost and gasoline cost.

10.8.2 Exercise (euclidean distances)

This exercise is a slightly modified version an exam assignment (exam 2021-A1).

The euclidean distance between two points \(p = (p_1,p_2)\) and \(q = (q_1,q_2)\) can be calculated using formula \[ d(p,q) = \sqrt{(p_1-q_1)^2 + (p_2-q_2)^2}.\]

  1. Calculate the distance between points \(p = (10,10)\) and \(q = (4,3)\) using the formula.
  1. Consider 4 points in a matrix (one in each row):

    p_mat <- matrix(c(0, 7, 8, 2, 10, 16, 8, 12), nrow = 4)
    p_mat
    #>      [,1] [,2]
    #> [1,]    0   10
    #> [2,]    7   16
    #> [3,]    8    8
    #> [4,]    2   12

    The distance matrix of p_mat is a 4 times 4 matrix where entry (i,j) contains the distance from the point in row i to the point in row j.

    Calculate the distance matrix of p_mat.

  1. Create a function calc_distances with the following features (implement as many as you can):

    • Takes a matrix p_mat with a point in each row as input argument.
    • Takes two additional input arguments from and to with default values 1:nrow(p_mat)
    • Return the distance matrix with values calculated for rows in the from input argument and columns in the to input argument. The other entries equals NA.
    • The function should work for different p_mat (you may assume that the matrix always have two columns).

    You may test your code using:

    p_mat <- matrix(c(10, 9, 15, 15, 11, 19, 12, 11, 7, 15), nrow = 5)
    calc_distances(p_mat)
    calc_distances(p_mat, to = 3:4)
    calc_distances(p_mat, from = c(1, nrow(p_mat)), to = 3:4)

10.8.3 Exercise (scope)

  1. After running the code below, what is the value of variable x?
x <- 3
my_func <- function(y){
  x <- 5
  return(y + 5)
}
my_func(7)
  1. Is there any problems with the following code?
x <- 3
my_func <- function(y){
  return(y + x) 
}
my_func(7)
  1. Have a look at the documentation for operator <<- (run ?'<<-'). After running the code below, what is the value of variable x?
x <- 3
my_func <- function(y){
  x <- 4
  x <<- 5
  return(y + 5)
}
  1. After running the code below, what is the value of variable x and output of the function call?
x <- 3
my_func <- function(y){
  x <- 4
  x <<- 5
  return(y + x)
}
my_func(7)

10.8.4 Exercise (time conversion)

This exercise is a slightly modified version an exam assignment (exam 2022-A1).

  1. Make functions:
  • SecToMin which takes an input argument sec in seconds and return the number converted to minutes.
  • SecToHours which takes an input argument sec in seconds and return the number converted to hours.
  • MinToSec which takes an input argument min in minutes and return the number converted to seconds.
  • MinToHours which takes an input argument min in minutes and return the number converted to hours.
  • HoursToMin which takes an input argument hours in hours and return the number converted to minutes.
  • HoursToSec which takes an input argument hours in hours and return the number converted to seconds.

All numbers may be decimal numbers, e.g. 90 seconds is 1.5 minutes and 1.5 hours is 90 minutes.

  1. Make a function ConvertTime which takes two input arguments:
  • val A number.
  • unit A string that can take values “sec”, “min” and “hours”.

The function should return val converted to seconds, minutes and hours with features:

  • works for all possible values for unit,
  • uses the functions in Question 1,
  • returns a vector with 3 numbers (seconds, minutes and hours) or NA if unit does not equals “sec”, “min” or “hours”.

References

Wickham, Hadley. 2015. R Packages: Organize, Test, Document, and Share Your Code. O’Reilly Media. http://r-pkgs.had.co.nz/.