25 Writing Functions

Learning Goals

After this lesson, you should be able to:

Explain why someone might want to write a function
Describe the syntax of a function definition
Describe a strategy for writing functions
Write functions in order to better organize code

Required Packages

This chapter uses the following packages:

dplyr
readxl
stringr

Chapter 14 explains how to install and load packages.

At this point, you’ve learned all of the basic skills necessary to explore a dataset in R. This chapter and the next focus on how to organize and automate your code so that it’s concise, clear, and effective. This will help you and your collaborators avoid tedious, redundant work, reproduce results efficiently, and run code in specialized environments for research computing, such as high-performance computing clusters.

The main way to interact with R is by calling functions, which was first explained way back in Section 12.3.4. This chapter explains how to write your own functions.

25.1 Why Write a Function?

Two reasons to write functions are to:

Reuse general-purpose code. Suppose, for example, that every time you read a data frame into R, you standardize the column names by making them lowercase and replacing spaces with underscores. By turning the code to do this into a function, you can reuse it across many projects without having to copy, paste, and edit (which can introduce errors).
Encapsulate code that carries out a single step in a multi-step computation. For instance, imagine you need to read a dataset, clean it, and finally transform some of the features before you do any analysis. Writing a function for each step (read_data, clean_data, transform_data, and so on) makes it easier to describe the entire computation in code without being overwhelmed by the details. It also means you can focus on one specific step at a time: while you work on clean_data, the focus is on cleaning, and you can assume reading the data will be handled elsewhere. You can also test each function to make sure that step works correctly before moving on to the next.

Functions are the building blocks for solving problems. Break problems down into steps and write a function for each step. That way you can:

Test that each step works correctly.
Describe the sequence of steps concisely.
Modify, reuse, or repurpose steps.

At first, you might not know how small (or big) each step should be. As a suggestion, most functions should be more than one line of code but short enough to fit on one screen.

Tip

Writing code is a lot like writing an essay. Each line of code is a sentence that expresses one or two ideas. Each function definition is a paragraph, and should have a singular purpose and coherent flow. You can use comments to make an outline and as clarifying footnotes.

As with any writing, you’ll typically need to go through several revisions before you arrive at exactly what you want.

25.2 Function Definitions

Think of a function as a factory. It takes raw materials, runs them through some machinery, and produces a final product:

Raw materials: the inputs to a function are called arguments. Each argument is assigned to a parameter, a placeholder variable.
Machinery: code in the body of a function computes something from the arguments.
Final product: the output of a function is called the return value.

The function keyword defines a new function. It can have any number of parameters. The body must be enclosed in curly braces { } if it’s more than one line of code. The function will automatically return the value of its last line. So a function definition looks like this:

function(parameter1, parameter2, ...) {
  # The body of the function (the code) goes between the curly braces { }.

  # The return value (the result) goes on the last line.
}

When you define a function, you’ll usually want to save it in a variable so that you can call it later. Choose a descriptive name that describes what the function does. Since functions do things, it often makes sense to include a verb in the name.

Important

Indent lines of code between curly braces by 2 or 4 spaces to make it easier to see where the block of code starts and ends.

Viewing Function Definitions

Almost every command in R is a function, even the arithmetic operators and the parentheses! You can view the definition of a function by typing its name without trailing parentheses (in contrast to how you call functions).

For example, let’s look at the body of the append function, which appends a value to the end of a list or vector:

append

function (x, values, after = length(x)) 
{
    lengx <- length(x)
    if (!after) 
        c(values, x)
    else if (after >= lengx) 
        c(x, values)
    else c(x[1L:after], values, x[(after + 1L):lengx])
}
<bytecode: 0x558c98f842f0>
<environment: namespace:base>

Don’t worry if you can’t understand everything the append function’s code does yet. It will make more sense later on, after you’ve written a few functions of your own.

Many of R’s built-in functions are not entirely written in R code. You can spot these by calls to the special .Primitive or .Internal functions in their code.

For instance, the sum function is not written in R code:

sum

function (..., na.rm = FALSE)  .Primitive("sum")

As a demonstration, let’s create a function that detects negative numbers. It should take a vector of numbers as input, compare them to zero, and then return the logical result from the comparison as output. Here’s the code:

is_negative = function(x) x < 0

The name of the function, is_negative, describes what the function does and includes a verb. The parameter x is the input. The return value is the result x < 0.

Any time you write a function, the first thing you should do afterwards is test that it actually works. Try the is_negative function out on a few test cases:

is_negative(6)

[1] FALSE

is_negative(-1.1)

[1] TRUE

x = c(5, -1, -2, 0, 3)
is_negative(x)

[1] FALSE  TRUE  TRUE FALSE FALSE

Warning

Be cautious about using variables in the body of a function if they’re assigned somewhere else. For example:

x = 100

f = function() x ** 2

f()

[1] 10000

Doing so makes your function harder for other people (and future you) to use because the inputs are not clearly labeled. It can also introduce subtle bugs. Use parameters for inputs instead:

f = function(z) z ** 2

f(x)

[1] 10000

It’s also okay to use variables assigned in the body of the function:

f = function(z) {
  result = z ** 2
  result
}

f(x)

[1] 10000

25.3 How to Write a Function

Before you write a function, you should:

Write down what you want the function to do, in detail. Pay particular attention to what the inputs and outputs are. Use comments, a separate document, or even a piece of paper. If it’s difficult to explain in words, try drawing a picture. The key is to make sure that the goal is clear before you start to write code.
Check whether there’s already a built-in function. Search online and in the R documentation.

If there isn’t a function that does what you want, it’s time to write one. To do so:

Write the code for a simple case. For data problems, use a small dataset. Don’t worry about turning the code into a function yet. Make sure that the result is correct.
Once you’ve got the code for the simple case working, wrap it in a function definition. Identify the parts of the code that will change for other cases. The dataset is usually one of these, but there might also be others. Replace these with parameters. Make sure the function returns the result.
Test calling the function with the inputs for the simple case. It should return exactly the same result as it did in step 3. If not, figure out why and fix it.
Test calling the function with the inputs for a different case. Make sure that the result is correct. If it isn’t, figure out why and fix it.
Repeat step 6 for a few more cases. Make sure to test cases that are uncommon or unusual, but realistically possible. Also test what happens in cases of erroneous inputs (does your function emit an error? does the error message explain the problem clearly?). By the end of this step, you should feel fairly confident that your function works correctly.

Keep in mind that writing a function is a process of refinement, even for experienced programmers. Even after going through all of the steps above, you might come across a new, unexpected case or discover a bug that requires you to edit your function.

25.4 Example: Getting Largest Values

Suppose we want a function to get the top 3 largest values in a vector. The main input to the function is the vector. We could also try to write the function so that it returns the top n values instead of being fixed at 3. Then n is another input. The output is a vector of 3 (or n) values.

There’s no function built into R to do this. There might be a function in a package, but we’ll go ahead and write the function ourselves for the sake of learning.

We’ll start by writing the code for a simple case. Suppose we’re interested in this vector:

x = c(1, 10, 20, -3)

We can sort the values with the sort function, and then use the head function to get the first 3:

sorted = sort(x, decreasing = TRUE)
head(sorted, 3)

[1] 20 10  1

Now that we’ve solved the simple case, let’s wrap the code in a function definition. The vector will change from case to case, so we’ll replace x with a parameter named vec. We might also want to change the number of values returned, so we’ll also replace 3 with a parameter named n. So the function definition is:

get_largest = function(vec, n) {
  sorted = sort(vec, decreasing = TRUE)
  head(sorted, n)
}

The name of the function, get_largest, describes what the function does and includes a verb. If this function will be used frequently, a shorter name, such as largest, might be preferable (compare to the head function).

Try the get_largest function on the simple case to make sure it returns the same result:

get_largest(x, 3)

[1] 20 10  1

That looks correct, so let’s try some other cases:

# What if we ask for fewer elements?
get_largest(x, 2)

[1] 20 10

# What if we ask for too many elements?
get_largest(x, 5)

[1] 20 10  1 -3

y = c(-1, -2, -3)
get_largest(y, 2)

[1] -1 -2

# What if the vector contains strings?
z = c("d", "a", "t", "a", "l", "a", "b")
get_largest(z, 5)

[1] "t" "l" "d" "b" "a"

Notice that the parameters vec and n inside the function do not exist as variables outside of the function:

vec

Error: object 'vec' not found

In general, R keeps parameters and variables you define inside of a function separate from variables you define outside of a function. You can read more about the specific rules for how R searches for variables in DataLab’s Intermediate R workshop reader.

Note: Default Arguments

As a function for quickly summarizing data, get_largest would be more convenient if the parameter n for the number of values to return was optional (again, compare to the head function). You can make the parameter n optional by setting a default argument: an argument assigned to the parameter if no argument is assigned in the call to the function. You can use = to assign default arguments to parameters when you define a function with the function keyword.

Here’s a new definition of get_largest with the default n = 5:

get_largest = function(vec, n = 5) {
  sorted = sort(vec, decreasing = TRUE)
  head(sorted, n)
}

After making a change, it’s a good idea to test the function again:

get_largest(x)

[1] 20 10  1 -3

get_largest(y)

[1] -1 -2 -3

get_largest(z)

[1] "t" "l" "d" "b" "a"

Note: Returning Multiple Values

A function returns one R object, but sometimes computations have multiple results. In that case, return the results in a vector, list, or other data structure.

For example, let’s make a function that computes the mean and median for a vector. We’ll return the results in a named list, although we could also use a named vector:

compute_mean_med = function(x) {
  m1 = mean(x)
  m2 = median(x)
  list(mean = m1, median = m2)
}

compute_mean_med(c(1, 2, 3, 1))

$mean
[1] 1.75

$median
[1] 1.5

The names make the result easier to understand for the caller of the function, although they certainly aren’t required here.

Note: The return Keyword

We’ve already seen that a function will automatically return the value of its last line. The return keyword causes a function to return a result immediately, without running any subsequent code in its body.

It only makes sense to use return from inside of an if-expression. If your function doesn’t have any if-expressions, you don’t need to use return.

For example, suppose you want the get_largest function to immediately return NULL if the argument for vec is a list. Here’s the code, along with some test cases:

get_largest = function(vec, n = 5) {
  if (is.list(vec))
    return(NULL)

  sorted = sort(vec, decreasing = TRUE)
  head(sorted, n)
}

get_largest(x)

[1] 20 10  1 -3

get_largest(z)

[1] "t" "l" "d" "b" "a"

get_largest(list(1, 2))

NULL

Alternatively, you could make the function raise an error by calling the stop function. Whether it makes more sense to return NULL or print an error depends on how you plan to use the get_largest function.

Notice that the last line of the get_largest function still doesn’t use the return keyword. It’s idiomatic to only use return when strictly necessary.

25.5 Case Study: U.S. Alternative Fueling Stations, Part I

The United States Department of Energy collects data about alternative (non-petroleum) fuel distribution and use within the country. They publish the data they collect on their Alternative Fuels Data Center (AFDC) website. In this case study, which consists of three parts, we’ll clean and analyze an AFDC dataset about alternative fueling stations.

Important

Click here to download the 2007-2013 U.S. Alternative Fueling Stations dataset.

If you haven’t already, we recommend you create a directory for this workshop. In your workshop directory, create a data/ subdirectory. Download and save the dataset in the data/ subdirectory.

Documentation for the U.S. Alternative Fueling Stations Dataset

The dataset is an Excel file with a documentation sheet and one data sheet for each year. Each data sheet contains counts broken down by state and fuel type. The format of the data sheets changes in 2014.

The source of this dataset is the historical counts on the U.S. DoE’s Alternative Fueling Station Counts by State page.

The dataset is in an Excel file with a separate sheet for each year. We’d like to be able to read the data for every year and combine it all into a single data frame. To get started, let’s focus on a single step: reading the data for just one year. We’ll use the first year, 2007, since the data for the pre-2014 years have a simpler structure.

R doesn’t provide a built-in function to read Excel files, so we’ll use the readxl package. Install the package if you haven’t already, and then load it:

# install.packages("readxl")
library("readxl")

The readxl package’s read_excel function reads a single sheet from an Excel file. We can set the function’s sheet parameter to select the sheet by position or name. In the alternative fueling stations dataset, each data sheet is named after the associated year. So the code to read and print the 2007 data is:

path = "data/2007-2023_us_alt_fuels.xlsx"
stations = read_excel(path, sheet = "2007")

New names:
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
• `` -> `...7`
• `` -> `...8`
• `` -> `...9`

head(stations)

# A tibble: 6 × 9
  Station Counts by State and …¹ ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9 
  <chr>                          <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 State                          Biod… CNG   E85   <NA>  <NA>  LNG   <NA>  Total
2 <NA>                           <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
3 <NA>                           <NA>  <NA>  <NA>  Elec… Hydr… <NA>  Prop… <NA> 
4 Alabama                        13    3     3     0     0     0     52    71   
5 <NA>                           <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
6 Alaska                         0     1     0     0     0     0     10    11   
# ℹ abbreviated name: ¹`Station Counts by State and Fuel Type`

From this, we can see that the column names are actually in the 2nd row of the sheet (row 1 in the data frame). So let’s change the call to read_excel to skip the 1st row:

stations = read_excel(path, sheet = "2007", skip = 1)

New names:
• `` -> `...5`
• `` -> `...6`
• `` -> `...8`

head(stations)

# A tibble: 6 × 9
  State   Biodiesel   CNG   E85 ...5     ...6       LNG ...8    Total
  <chr>       <dbl> <dbl> <dbl> <chr>    <chr>    <dbl> <chr>   <dbl>
1 <NA>           NA    NA    NA <NA>     <NA>        NA <NA>       NA
2 <NA>           NA    NA    NA Electric Hydrogen    NA Propane    NA
3 Alabama        13     3     3 0        0            0 52         71
4 <NA>           NA    NA    NA <NA>     <NA>        NA <NA>       NA
5 Alaska          0     1     0 0        0            0 10         11
6 <NA>           NA    NA    NA <NA>     <NA>        NA <NA>       NA

Columns 5, 6, and 8 are still missing names. The names of these columns are in row 2 of the data frame. We’ll use the stringr package, which provides functions for processing strings, to help here. Install the package if you haven’t already, and then load it:

# install.packages("stringr")
library("stringr")

We can use stringr’s str_starts function to detect the column names that start with .... The first argument is the strings and the second is the pattern to detect. Call the fixed function on the pattern make stringr treat it as-is. Then we can use indexing to get the column names from row 2:

names = names(stations)
is_dot_name = str_starts(names, fixed("..."))
names[is_dot_name] = as.character(stations[2, is_dot_name])
names

[1] "State"     "Biodiesel" "CNG"       "E85"       "Electric"  "Hydrogen" 
[7] "LNG"       "Propane"   "Total"

Let’s also make the column names lowercase, so that they’re easy to type. We can use stringr’s str_to_lower function to do this:

names = str_to_lower(names)
names

[1] "state"     "biodiesel" "cng"       "e85"       "electric"  "hydrogen" 
[7] "lng"       "propane"   "total"

Assign the names back to the column names:

names(stations) = names
head(stations)

# A tibble: 6 × 9
  state   biodiesel   cng   e85 electric hydrogen   lng propane total
  <chr>       <dbl> <dbl> <dbl> <chr>    <chr>    <dbl> <chr>   <dbl>
1 <NA>           NA    NA    NA <NA>     <NA>        NA <NA>       NA
2 <NA>           NA    NA    NA Electric Hydrogen    NA Propane    NA
3 Alabama        13     3     3 0        0            0 52         71
4 <NA>           NA    NA    NA <NA>     <NA>        NA <NA>       NA
5 Alaska          0     1     0 0        0            0 10         11
6 <NA>           NA    NA    NA <NA>     <NA>        NA <NA>       NA

Many rows are partially or completely blank. We only want to keep the rows that contain counts. So let’s use dplyr’s filter function to remove all of the rows where biodiesel is missing:

# install.packages("dplyr")
library("dplyr")


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

stations = filter(stations, !is.na(biodiesel))
head(stations)

# A tibble: 6 × 9
  state      biodiesel   cng   e85 electric hydrogen   lng propane total
  <chr>          <dbl> <dbl> <dbl> <chr>    <chr>    <dbl> <chr>   <dbl>
1 Alabama           13     3     3 0        0            0 52         71
2 Alaska             0     1     0 0        0            0 10         11
3 Arizona           10    37    13 12       1            3 58        134
4 Arkansas           3     3     4 0        0            0 41         51
5 California        39   186     6 367      23          29 206       856
6 Colorado          30    21    45 2        0            0 59        157

The electric, hydrogen, and propane columns are character columns, but they should be numeric. Let’s convert the electric column. We’ll leave hydrogen and propane as-is (with a comment in the code), since we’re not going to use them in our analysis. While we’re at it, we’ll also add a year column with the year:

stations$electric = as.numeric(stations$electric)
# TODO: hydrogen, propane
stations$year = 2007

head(stations)

# A tibble: 6 × 10
  state      biodiesel   cng   e85 electric hydrogen   lng propane total  year
  <chr>          <dbl> <dbl> <dbl>    <dbl> <chr>    <dbl> <chr>   <dbl> <dbl>
1 Alabama           13     3     3        0 0            0 52         71  2007
2 Alaska             0     1     0        0 0            0 10         11  2007
3 Arizona           10    37    13       12 1            3 58        134  2007
4 Arkansas           3     3     4        0 0            0 41         51  2007
5 California        39   186     6      367 23          29 206       856  2007
6 Colorado          30    21    45        2 0            0 59        157  2007

This data frame is clean enough for analysis.

We want to carry out the same steps for other years of data, so let’s turn the code into a function. Gather together all of the code and enclose it in a function definition:

function() {
  path = "data/2007-2023_us_alt_fuels.xlsx"
  stations = read_excel(path, sheet = "2007", skip = 1)

  names = names(stations)
  is_dot_name = str_starts(names, fixed("..."))
  names[is_dot_name] = as.character(stations[2, is_dot_name])
  names = str_to_lower(names)
  names(stations) = names

  stations = filter(stations, !is.na(biodiesel))

  stations$electric = as.numeric(stations$electric)
  # TODO: hydrogen, propane
  stations$year = 2007
}

The inputs are the path to the dataset and the year (which is also the sheet name). The output is the cleaned data frame. Add parameters for the inputs and put the output on the last line. Let’s call the function read_fuel_sheet. You can also optionally add some comments to clarify how the function works:

read_fuel_sheet = function(path, year) {
  sheet = as.character(year)
  stations = read_excel(path, sheet = sheet, skip = 1)

  # Clean up the column names.
  names = names(stations)
  is_dot_name = str_starts(names, fixed("..."))
  names[is_dot_name] = as.character(stations[2, is_dot_name])
  names = str_to_lower(names)
  names(stations) = names

  # Remove blank rows.
  stations = filter(stations, !is.na(biodiesel))

  # Correct column types and add year column.
  stations$electric = as.numeric(stations$electric)
  # TODO: hydrogen, propane
  stations$year = year

  stations
}

Test the new function to make sure it returns the same result on the example case:

read_fuel_sheet(path, 2007)

New names:
• `` -> `...5`
• `` -> `...6`
• `` -> `...8`

# A tibble: 52 × 10
   state       biodiesel   cng   e85 electric hydrogen   lng propane total  year
   <chr>           <dbl> <dbl> <dbl>    <dbl> <chr>    <dbl> <chr>   <dbl> <dbl>
 1 Alabama            13     3     3        0 0            0 52         71  2007
 2 Alaska              0     1     0        0 0            0 10         11  2007
 3 Arizona            10    37    13       12 1            3 58        134  2007
 4 Arkansas            3     3     4        0 0            0 41         51  2007
 5 California         39   186     6      367 23          29 206       856  2007
 6 Colorado           30    21    45        2 0            0 59        157  2007
 7 Connecticut         1    10     2        3 0            0 16         32  2007
 8 Delaware            3     1     1        0 0            0 3           8  2007
 9 District o…         1     1     3        0 1            0 0           6  2007
10 Florida            14    17    11        2 1            0 49         94  2007
# ℹ 42 more rows

The result looks okay, so try a few more cases:

read_fuel_sheet(path, 2008)

New names:
• `` -> `...5`
• `` -> `...6`
• `` -> `...8`

# A tibble: 52 × 10
   state       biodiesel   cng   e85 electric hydrogen   lng propane total  year
   <chr>           <dbl> <dbl> <dbl>    <dbl> <chr>    <dbl> <chr>   <dbl> <dbl>
 1 Alabama            11     3     6        0 0            0 40         60  2008
 2 Alaska              0     1     0        0 0            0 10         11  2008
 3 Arizona            10    40    23        5 1            5 51        135  2008
 4 Arkansas            2     3     7        0 0            0 37         49  2008
 5 California         36   184    13      376 26          28 199       862  2008
 6 Colorado           18    18    65        0 0            0 43        144  2008
 7 Connecticut         1     9     4        3 1            0 13         31  2008
 8 Delaware            3     1     1        0 0            0 2           7  2008
 9 District o…         1     1     3        0 1            0 0           6  2008
10 Florida            12    15    18        3 2            0 47         97  2008
# ℹ 42 more rows

read_fuel_sheet(path, 2010)

New names:
• `` -> `...5`
• `` -> `...6`
• `` -> `...8`

# A tibble: 52 × 10
   state       biodiesel   cng   e85 electric hydrogen   lng propane total  year
   <chr>           <dbl> <dbl> <dbl>    <dbl> <chr>    <dbl> <chr>   <dbl> <dbl>
 1 Alabama             5     4    15        0 0            1 128       153  2010
 2 Alaska              1     2     0        0 0            0 8          11  2010
 3 Arizona            14    36    32        8 1            5 61        157  2010
 4 Arkansas            6     4    13        1 0            0 51         75  2010
 5 California         35   213    54      428 22          32 227      1011  2010
 6 Colorado           14    23    86        5 1            0 50        179  2010
 7 Connecticut         1    13     1        4 2            0 16         37  2010
 8 Delaware            3     1     1        0 0            0 3           8  2010
 9 District o…         1     2     3        1 1            0 0           8  2010
10 Florida            16    14    42        7 0            0 75        154  2010
# ℹ 42 more rows

The function seems to generalize well to other years where the data have the same structure. The structure of the data changed in 2014. Let’s test the function on one of these years:

read_fuel_sheet(path, 2015)

Warning: Unknown or uninitialised column: `electric`.

Error in `$<-`:
! Assigned data `as.numeric(stations$electric)` must be compatible with
  existing data.
✖ Existing data has 52 rows.
✖ Assigned data has 0 rows.
ℹ Only vectors of size 1 are recycled.
Caused by error in `vectbl_recycle_rhs_rows()`:
! Can't recycle input of size 0 to size 52.

The function raises an error, which isn’t too surprising. It looks like we still have some work to do to make the function general enough to read the data for 2014 and subsequent years. We’ll save that for part II of this case study (Section 26.2}.