6 Language Fundamentals

This chapter is part 1 (of 2) of Thinking in R, a workshop series about how R works and how to examine code critically. The major topics of this chapter are how R stores and locates variables (including functions) defined in your code and in packages, and how some of R’s object-oriented programming systems work.

Learning Objectives

After completing this session, learners should be able to:

Explain what an environment is and how R uses them
Explain how R looks up variables
Explain what attributes are and how R uses them
Get and set attributes
Explain what (S3) classes are and how R uses them
Explain R’s (S3) method dispatch system
Create an (S3) class
Describe R’s other object-oriented programming systems at a high level

6.1 Variables & Environments

Assigning and looking up values of variables are fundamental operations in R, as in most programming languages. They were likely among the first operations you learned, and now you use them instictively. This section is a deep dive into what R actually does when you assign a variables and how R looks up the values of those variables later. Understanding the process and the data structures involved will introduce you to new programming strategies, make it easier to reason about code, and help you identify potential bugs.

6.1.1 What’s an Environment?

The foundation of how R stores and looks up variables is a data structure called an environment. Every environment has two parts:

A frame, which is a collection of names and associated R objects.
A parent or enclosing environment, which must be another environment.

For now, you’ll learn how to create environments and how to assign and get values from their frames. Parent environments will be explained in a later section.

You can use the new.env function to create a new environment:

e = new.env()
e

## <environment: 0x55ccbb278e50>

Unlike most objects, printing an environment doesn’t print its contents. Instead, R prints its type (which is environment) and a unique identifier (0x55ccbb278e50 in this case).

The unique identifier is actually the memory address of the environment. Every object you use in R is stored as a series of bytes in your computer’s random-access memory (RAM). Each byte in memory has a unique address, similar to how each house on a street has a unique address. Memory addresses are usually just numbers counting up from 0, but they’re often written in hexadecimal (base 16) (indicated by the prefix 0x) because it’s more concise. For the purposes of this reader, you can just think of the memory address as a unique identifier.

To see the names in an environment’s frame, you can call the ls or names function on the environment:

ls(e)

## character(0)

names(e)

## character(0)

You just created the environment e, so its frame is currently empty. The printout character(0) means R returned a character vector of length 0.

You can assign an R object to a name in an environment’s frame with the dollar sign $ operator or the double square bracket [[ operator, similar to how you would assign a named element of a list. For example, one way to assign the number 8 to the name "lucky" in the environment e’s frame is:

e$lucky = 8

Now there’s a name defined in the environment:

ls(e)

## [1] "lucky"

names(e)

## [1] "lucky"

Here’s another example of assigning an object to a name in the environment:

e[["my_message"]] = "May your coffee kick in before reality does."

You can assign any type of R object to a name in an environment, including other environments.

The ls function ignores names that begin with a dot . by default. For example:

e$.x = list(1, sin)
ls(e)

## [1] "lucky"      "my_message"

You can pass the argument all.names = TRUE to make the function return all names in the frame:

ls(e, all.names = TRUE)

## [1] ".x"         "lucky"      "my_message"

Alternatively, you can just use the names function, which always prints all names in an environment’s frame.

Objects in an environment’s frame don’t have positions or any particular order, so they must always be assigned to a name. R raises an error if you try to assign an object to a position:

e[[3]] = 10

## Error in e[[3]] = 10: wrong args for environment subassignment

As you might expect, you can also use the dollar sign operator and double square bracket operator to get objects in an environment by name:

e$my_message

## [1] "May your coffee kick in before reality does."

e[["lucky"]]

## [1] 8

You can use the exists function to check whether a specific name exists in an environment’s frame:

exists("hi", e)

## [1] FALSE

exists("lucky", e)

## [1] TRUE

Finally, you can remove a name and object from an environment’s frame with the rm function. Make sure to pass the environment as the argument to the envir parameter when you do this:

rm("lucky", envir = e)
exists("lucky", e)

## [1] FALSE

6.1.2 Reference Objects

Environments are reference objects, which means they don’t follow R’s copy-on-write rule: for most types of objects, if you modify the object, R automatically and silently makes a copy, so that any other variables that refer to the object remain unchanged.

As an example, lists follow the copy-on-write rule. Suppose you assign a list to variable x, assign x to y, and then make a change to x:

x = list()
x$a = 10
x

## $a
## [1] 10

y = x
x$a = 20
y

## $a
## [1] 10

When you run y = x, R makes y refer to the same object as x, without using any additional memory. When you run x$a = 20, the copy-on-write rule applies, so R creates and modifies a copy of the object. From then on, x refers to the modified copy and y refers to the original.

Environments don’t follow the copy-on-write rule, so repeating the example with an enviroment produces a different result:

e_x = new.env()
e_x$a = 10
e_x$a

## [1] 10

e_y = e_x
e_x$a = 20
e_y$a

## [1] 20

As before, e_y = e_x makes both e_y and e_x refer to the same object. The difference is that when you run e_x$a = 20, the copy-on-write rule does not apply and R does not create a copy of the environment. As a result, the change to e_x is also reflected in e_y.

Environments and other reference objects can be confusing since they behave differently from most objects. You usually won’t need to construct or manipulate environments directly, but it’s useful to know how to inspect them.

6.1.3 The Local Environment

Think of environments as containers for variables. Whenever you assign a variable, R assigns it to the frame of an environment. Whenever you get a variable, R searches through one or more environments for its value.

When you start R, R creates a special environment called the global environment to store variables you assign at the prompt or the top level of a script. You can use the globalenv function to get the global environment:

g = globalenv()
g

## <environment: R_GlobalEnv>

The global environment is easy to recognize because its unique identifier is R_GlobalEnv rather than its memory address (even though it’s stored in your computer’s memory like any other object).

The local environment is the environment where the assignment operators <- and = assign variables. Think of the local environment as the environment that’s currently active. The local environment varies depending on the context where you run an expression. You can get the local environment with the environment function:

loc = environment()
loc

## <environment: R_GlobalEnv>

As you can see, at the R prompt or the top level of an R script, the local environment is just the global environment.

Except for names, the functions introduced in Section 6.1.1 default to the local environment if you don’t set the envir parameter. This makes them convenient for inspecting or modifying the local environment’s frame:

ls(loc)

## [1] "e"          "e_x"        "e_y"        "g"          "loc"       
## [6] "source_rmd" "x"          "y"

ls()

## [1] "e"          "e_x"        "e_y"        "g"          "loc"       
## [6] "source_rmd" "x"          "y"

If you assign a variable, it appears in the local environment’s frame:

coffee = "Right. No coffee. This is a terrible planet."
ls()

## [1] "coffee"     "e"          "e_x"        "e_y"        "g"         
## [6] "loc"        "source_rmd" "x"          "y"

loc$coffee

## [1] "Right. No coffee. This is a terrible planet."

Conversely, if you assign an object in the local environment’s frame, you can access it as a variable:

loc$tea = "Tea isn't coffee!"
tea

## [1] "Tea isn't coffee!"

6.1.4 Call Environments

Every time you call (not define) a function, R creates a new environment. R uses this call environment as the local environment while the code in the body of the function runs. As a result, assigning variables in a function doesn’t affect the global environment, and they generally can’t be accessed from outside of the function.

For example, consider this function which assigns the variable hello:

my_hello = function() {
  hello = "from the other side"
}

Even after calling the function, there’s no variable hello in the global environment:

my_hello()
names(g)

##  [1] "loc"        "my_hello"   "tea"        "e_x"        "x"         
##  [6] "e_y"        "y"          "coffee"     "source_rmd" "e"         
## [11] "g"          ".First"

As further demonstration, consider this modified version of my_hello, which returns the call environment:

my_hello = function() {
  hello = "from the other side"
  environment()
}

The call environment is not the global environment:

e = my_hello()
e

## <environment: 0x55ccbd771558>

And the variable hello exists in the call environment, but not in the global environment:

exists("hello", g)

## [1] FALSE

exists("hello", e)

## [1] TRUE

e$hello

## [1] "from the other side"

Each call to a function creates a new call environment. So if you call my_hello again, it returns a different environment (pay attention to the memory address):

e2 = my_hello()
e

## <environment: 0x55ccbd771558>

e2

## <environment: 0x55ccbdcf2748>

By creating a new environment for every call, R isolates code in the function body from code outside of the body. As a result, most R functions have no side effects. This is a good thing, since it means you generally don’t have to worry about calls assigning, reassigning, or removing variables in other environments (such as the global environment!).

The local function provides another way to create a new local environment in which to run code. However, it’s usually preferable to define and call a function, since that makes it easier to test and reuse the code.

6.1.5 Lexical Scoping

A function can access variables outside of its local environment, but only if those variables exist in the environment where the function was defined (not called). This property is called lexical scoping.

For example, assign a variable tea and function get_tea in the global environment:

tea = "Tea isn't coffee!"
get_tea = function() {
  tea
}

Then the get_tea function can access the tea variable:

get_tea()

## [1] "Tea isn't coffee!"

Note that variable lookup takes place when a function is called, not when it’s defined. This is called dynamic lookup.

For example, the result from get_tea changes if you change the value of tea:

tea = "Tea for two."
get_tea()

## [1] "Tea for two."

tea = "Tea isn't coffee!"
get_tea()

## [1] "Tea isn't coffee!"

When a local variable (a variable in the local environment) and a non-local variable have the same name, R almost always prioritizes the local variable. For instance:

get_local_tea = function() {
  tea = "Earl grey is tea!"
  tea
}

get_local_tea()

## [1] "Earl grey is tea!"

The function body assigns the local variable tea to "Earl grey is tea!", so R returns that value rather than "Tea isn't coffee!". In other words, local variables mask, or hide, non-local variables with the same name.

There’s only one case where R doesn’t prioritize local variables. To see it, consider this call:

mean(1:20)

## [1] 10.5

The variable mean must refer to a function, because it’s being called—it’s followed by parentheses ( ), the call syntax. In this situation, R ignores local variables that aren’t functions, so you can write code such as:

mean = 10
mean(1:10)

## [1] 5.5

That said, defining a local variable with the same name as a function can still be confusing, so it’s usually considered a bad practice.

To help you reason about lexical scoping, you can get the environment where a function was defined by calling the environment function on the function itself. For example, the get_tea function was defined in the global environment:

environment(get_tea)

## <environment: R_GlobalEnv>

6.1.6 Variable Lookup

The key to how R looks up variables and how lexical scoping works is that in addition to a frame, every environment has a parent environment.

When R evaluates a variable in an expression, it starts by looking for the variable in the local environment’s frame.

For example, at the prompt, tea is a local variable because that’s where you assigned it. If you enter tea at the prompt, R finds tea in the local environment’s frame and returns the value:

tea

## [1] "Tea isn't coffee!"

On the other hand, in the get_tea function from Section 6.1.5, tea is not a local variable:

get_tea = function() {
  tea
}

To make this more concrete, consider a function which just returns its call environment:

get_call_env = function() {
  environment()
}

The call environment clearly doesn’t contain the tea variable:

e = get_call_env()
ls(e)

## character(0)

When a variable doesn’t exist in the local environment’s frame, then R gets the parent environment of the local environment.

You can use the parent.env function to get the parent environment of an environment. For the call environment e, the parent environment is the global environment, because that’s where get_call_env was defined:

parent.env(e)

## <environment: R_GlobalEnv>

When R can’t find tea in the call environment’s frame, R gets the parent environment, which is the global environment. Then R searches for tea in the global environment, finds it, and returns the value.

R repeats the lookup process for as many parents as necessary to find the variable, stopping only when it finds the variable or a special environment called the empty environment which will be explained in Section 6.1.7.

The lookup process also hints at how R finds variables and functions such as pi and sqrt that clearly aren’t defined in the global environment. They’re defined in parent environments of the global environment.

The get function looks up a variable by name:

get("pi")

## [1] 3.141593

You can use the get function to look up a variable starting from a specific environment or to control how R does the lookup the variable. For example, if you set inherits = FALSE, R will not search any parent environments:

get("pi", inherits = FALSE)

## Error in get("pi", inherits = FALSE): object 'pi' not found

As with most functions for inspecting and modifying environments, use the get function sparingly. R already provides a much simpler way to get a variable: the variable’s name.

6.1.7 The Search Path

R also uses environments to manage packages. Each time you load a package with library or require, R creates a new environment:

The frame contains the package’s local variables.
The parent environment is the environment of the previous package loaded.
This new environment becomes the parent of the global environment.

R always loads several built-in packages at startup, which contain variables and functions such as pi and sqrt. Thus the global environment is never the top-level environment. For instance:

g = globalenv()
e = parent.env(g)
e

## <environment: package:stats>
## attr(,"name")
## [1] "package:stats"
## attr(,"path")
## [1] "/usr/lib/R/library/stats"

e = parent.env(e)
e

## <environment: package:graphics>
## attr(,"name")
## [1] "package:graphics"
## attr(,"path")
## [1] "/usr/lib/R/library/graphics"

Notice that package environments use package: and the name of the package as their unique identifier rather than their memory address.

The chain of package environments is called the search path. The search function returns the search path:

search()

## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"

The base environment (identified by base) is the always topmost environment. You can use the baseenv function to get the base environment:

baseenv()

## <environment: base>

The base environment’s parent is the special empty environment (identified by R_EmptyEnv), which contains no variables and has no parent. You can use the emptyenv function to get the empty environment:

emptyenv()

## <environment: R_EmptyEnv>

Understanding R’s process for looking up variables and the search path is helpful for resolving conflicts between the names of variables in packages.

6.1.7.1 The Colon Operators

The double-colon operator :: gets a variable in a specific package. Two common uses:

Disambiguate which package you mean when several packages have variables with the same names.
Get a variable from a package without loading the package.

For example:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

stats::filter

## function (x, filter, method = c("convolution", "recursive"), 
##     sides = 2L, circular = FALSE, init = NULL) 
## {
##     method <- match.arg(method)
##     x <- as.ts(x)
##     storage.mode(x) <- "double"
##     xtsp <- tsp(x)
##     n <- as.integer(NROW(x))
##     if (is.na(n)) 
##         stop(gettextf("invalid value of %s", "NROW(x)"), domain = NA)
##     nser <- NCOL(x)
##     filter <- as.double(filter)
##     nfilt <- as.integer(length(filter))
##     if (is.na(nfilt)) 
##         stop(gettextf("invalid value of %s", "length(filter)"), 
##             domain = NA)
##     if (anyNA(filter)) 
##         stop("missing values in 'filter'")
##     if (method == "convolution") {
##         if (nfilt > n) 
##             stop("'filter' is longer than time series")
##         sides <- as.integer(sides)
##         if (is.na(sides) || (sides != 1L && sides != 2L)) 
##             stop("argument 'sides' must be 1 or 2")
##         circular <- as.logical(circular)
##         if (is.na(circular)) 
##             stop("'circular' must be logical and not NA")
##         if (is.matrix(x)) {
##             y <- matrix(NA, n, nser)
##             for (i in seq_len(nser)) y[, i] <- .Call(C_cfilter, 
##                 x[, i], filter, sides, circular)
##         }
##         else y <- .Call(C_cfilter, x, filter, sides, circular)
##     }
##     else {
##         if (missing(init)) {
##             init <- matrix(0, nfilt, nser)
##         }
##         else {
##             ni <- NROW(init)
##             if (ni != nfilt) 
##                 stop("length of 'init' must equal length of 'filter'")
##             if (NCOL(init) != 1L && NCOL(init) != nser) {
##                 stop(sprintf(ngettext(nser, "'init' must have %d column", 
##                   "'init' must have 1 or %d columns", domain = "R-stats"), 
##                   nser), domain = NA)
##             }
##             if (!is.matrix(init)) 
##                 dim(init) <- c(nfilt, nser)
##         }
##         ind <- seq_len(nfilt)
##         if (is.matrix(x)) {
##             y <- matrix(NA, n, nser)
##             for (i in seq_len(nser)) y[, i] <- .Call(C_rfilter, 
##                 x[, i], filter, c(rev(init[, i]), double(n)))[-ind]
##         }
##         else y <- .Call(C_rfilter, x, filter, c(rev(init[, 1L]), 
##             double(n)))[-ind]
##     }
##     tsp(y) <- xtsp
##     class(y) <- if (nser > 1L) 
##         c("mts", "ts")
##     else "ts"
##     y
## }
## <bytecode: 0x55ccbd723f58>
## <environment: namespace:stats>

dplyr::filter

## function (.data, ..., .by = NULL, .preserve = FALSE) 
## {
##     check_by_typo(...)
##     by <- enquo(.by)
##     if (!quo_is_null(by) && !is_false(.preserve)) {
##         abort("Can't supply both `.by` and `.preserve`.")
##     }
##     UseMethod("filter")
## }
## <bytecode: 0x55ccb9836048>
## <environment: namespace:dplyr>

ggplot2::ggplot

## function (data = NULL, mapping = aes(), ..., environment = parent.frame()) 
## {
##     UseMethod("ggplot")
## }
## <bytecode: 0x55ccbcce8e00>
## <environment: namespace:ggplot2>

The related triple-colon operator ::: gets a private variable in a package. Generally these are private for a reason! Only use ::: if you’re sure you know what you’re doing.

6.2 Closures

A closure is a function together with an enclosing environment. In order to support lexical scoping, every R function is a closure (except a few very special built-in functions). The enclosing environment is generally the environment where the function was defined.

Recall that you can use the environment function to get the enclosing environment of a function:

f = function() 42
environment(f)

## <environment: R_GlobalEnv>

Since the enclosing environment exists whether or not you call the function, you can use the enclosing environment to store and share data between calls.

You can use the superassignment operator <<- to assign to a variable to an ancestor environment (if the variable already exists) or the global environment (if the variable does not already exist).

For example, suppose you want to make a function that returns the number of times it’s been called:

counter = 0
count = function() {
  counter <<- counter + 1
  counter
}

In this example, the enclosing environment is the global environment. Each time you call count, it assigns a new value to the counter variable in the global environment.

6.2.1 Tidy Closures

The count function has a side effect—it reassigns a non-local variable. As discussed in 6.1.4, functions with side effects make code harder to understand and reason about. Use side effects sparingly and try to isolate them from the global environment.

When side effects aren’t isolated, several things can go wrong. The function might overwrite the user’s variables:

counter = 0
count()

## [1] 1

Or the user might overwrite the function’s variables:

counter = "hi"
count()

## Error in counter + 1: non-numeric argument to binary operator

For functions that rely on storing information in their enclosing environment, there are several different ways to make sure the enclosing environment is isolated. Two of these are:

Define and return the function from the body of another function. The second function is called a factory function because it produces (returns) the first. The enclosing environment of the first function is the call environment of the second.
Define the function inside of a call to local.

Here’s a template for the first approach:

make_fn = function() {
  # Define variables in the enclosing environment here:

  # Define and return the function here:
  function() {
    # ...
  }
}

f = make_fn()
# Now you can call f() as you would any other function.

For example, you can use the template for the counter function:

make_count = function() {
  counter = 0
  
  function() {
    counter <<- counter + 1
    counter
  }
}

count = make_count()

Then calling count has no effect on the global environment:

counter = 10
count()

## [1] 1

counter

## [1] 10

6.3 Attributes

An attribute is named metadata attached to an R object. Attributes provide basic information about objects and play an important role in R’s class system, so most objects have attributes. Some common attributes are:

class – the class
row.names – row names
names – element names or column names
dim – dimensions (on matrices)
dimnames – names of dimensions (on matrices)

R provides helper functions to get and set the values of the common attributes. These functions usually have the same name as the attribute. For example, the class function gets or sets the class attribute:

class(mtcars)

## [1] "data.frame"

row.names(mtcars)

##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"

An attribute can have any name and any value. You can use the attr function to get or set an attribute by name:

attr(mtcars, "row.names")

##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"

attr(mtcars, "foo") = 42
attr(mtcars, "foo")

## [1] 42

You can get all of the attributes attached to an object with the attributes function:

attributes(mtcars)

## $names
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
## 
## $row.names
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"         
## 
## $class
## [1] "data.frame"
## 
## $foo
## [1] 42

You can use the structure function to set multiple attributes on an object:

mod_mtcars = structure(mtcars, foo = 50, bar = 100)
attributes(mod_mtcars)

## $names
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
## 
## $row.names
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"         
## 
## $class
## [1] "data.frame"
## 
## $foo
## [1] 50
## 
## $bar
## [1] 100

Vectors usually don’t have attributes:

attributes(5)

## NULL

But the class function still returns a class:

class(5)

## [1] "numeric"

When a helper function exists to get or set an attribute, use the helper function rather than attr. This will make your code clearer and ensure that attributes with special behavior and requirements, such as dim, are set correctly.

6.4 S3

R provides several systems for object-oriented programming (OOP), a programming paradigm where code is organized into a collection of “objects” that interact with each other. These systems provide a way to create new data structures with customized behavior, and also underpin how some of R’s built-in functions work.

The S3 system is particularly important for understanding R, because it’s the oldest and most widely-used. This section focuses on S3, while Section 6.5 provides an overview of R’s other OOP systems.

The central idea of S3 is that some functions can be generic, meaning they perform different computations (and run different code) for different classes of objects.

Conversely, every object has at least one class, which dictates how the object behaves. For most objects, the class is independent of type and is stored in the class attribute. You can get the class of an object with the class function. For example, the class of a data frame is data.frame:

class(mtcars)

## [1] "data.frame"

Some objects have more than one class. One example of this is matrices:

m = matrix()
class(m)

## [1] "matrix" "array"

When an object has multiple classes, they’re stored in the class attribute in order from highest to lowest priority. So the matrix m will primarily behave like a matrix, but it can also behave like an array. The priority of classes is often described in terms of a child-parent relationship: array is the parent class of matrix, or equivalently, the class matrix inherits from the class array.

6.4.1 Method Dispatch

A function is generic if it selects and calls another function, called a method, based on the class of one of its arguments. A generic function can have any number of methods, and each must have the same signature, or collection of parameters, as the generic. Think of a generic function’s methods as the range of different computations it can perform, or alternatively as the range of different classes it can accept as input.

Method dispatch, or just dispatch, is the process of selecting a method based on the class of an argument. You can identify S3 generics because they always call the UseMethod function, which initiates S3 method dispatch.

Many of R’s built-in functions are generic. One example is the split function, which splits a data frame or vector into groups:

split

## function (x, f, drop = FALSE, ...) 
## UseMethod("split")
## <bytecode: 0x55ccbac6ef00>
## <environment: namespace:base>

Another is the plot function, which creates a plot:

plot

## function (x, y, ...) 
## UseMethod("plot")
## <bytecode: 0x55ccbcaacb28>
## <environment: namespace:base>

The UseMethod function requires the name of the generic (as a string) as its first argument. The second argument is optional and specifies the object to use for method dispatch. By default, the first argument to the generic is used for method dispatch. So for split, the argument for x is used for method dispatch. R checks the class of the argument and selects a matching method.

You can use the methods function to list all of the methods of a generic. The methods for split are:

methods(split)

## [1] split.data.frame split.Date       split.default    split.POSIXct   
## see '?methods' for accessing help and source code

Method names always have the form GENERIC.CLASS, where GENERIC is the name of the generic and CLASS is the name of a class. For instance, split.data.frame is the split method for objects with class data.frame.

Methods named GENERIC.default are a special case: they are default methods, selected only if none of the other methods match the class during dispatch. So split.default is the default method for split. Most generic functions have a default method.

Methods are ordinary R functions. For instance, the code for split.data.frame is:

split.data.frame

## function (x, f, drop = FALSE, ...) 
## {
##     if (inherits(f, "formula")) 
##         f <- .formula2varlist(f, x)
##     lapply(split(x = seq_len(nrow(x)), f = f, drop = drop, ...), 
##         function(ind) x[ind, , drop = FALSE])
## }
## <bytecode: 0x55ccbaf26f60>
## <environment: namespace:base>

Sometimes methods are defined in privately packages and can’t be accessed by typing their name at the prompt. You can use the getAnywhere function to get the code for these methods. For instance, to get the code for plot.data.frame:

getAnywhere(plot.data.frame)

## A single object matching 'plot.data.frame' was found
## It was found in the following places
##   registered S3 method for plot from namespace graphics
##   namespace:graphics
## with value
## 
## function (x, ...) 
## {
##     plot2 <- function(x, xlab = names(x)[1L], ylab = names(x)[2L], 
##         ...) plot(x[[1L]], x[[2L]], xlab = xlab, ylab = ylab, 
##         ...)
##     if (!is.data.frame(x)) 
##         stop("'plot.data.frame' applied to non data frame")
##     if (ncol(x) == 1) {
##         x1 <- x[[1L]]
##         if (class(x1)[1L] %in% c("integer", "numeric")) 
##             stripchart(x1, ...)
##         else plot(x1, ...)
##     }
##     else if (ncol(x) == 2) {
##         plot2(x, ...)
##     }
##     else {
##         pairs(data.matrix(x), ...)
##     }
## }
## <bytecode: 0x55ccbcb9eab0>
## <environment: namespace:graphics>

As a demonstration of method dispatch, consider this code to split the mtcars dataset by number of cylinders:

split(mtcars, mtcars$cyl)

The split function is generic and dispatches on its first argument. In this case, the first argument is mtcars, which has class data.frame. Since the method split.data.frame exists, R calls split.data.frame with the same arguments you used to call the generic split function. In other words, R calls:

split.data.frame(mtcars, mtcars$cyl)

When an object has more than one class, method dispatch considers them from left to right. For instance, matrices created with the matrix function have class matrix and also class array. If you pass a matrix to a generic function, R will first look for a matrix method. If there isn’t one, R will look for an array method. If there still isn’t one, R will look for a default method. If there’s no default method either, then R raises an error.

The sloop package provides useful functions inspecting S3 classes, generics, and methods, as well as the method dispatch process. For example, you can use the s3_dispatch function to see which method will be selected when you call a generic:

# install.packages("sloop")
library("sloop")
s3_dispatch(split(mtcars, mtcars$cyl))

## => split.data.frame
##  * split.default

The selected method is indicated with an arrow =>, while methods that were not selected are indicated with a star *. See ?s3_dispatch for complete details about the output from the function.

6.4.2 Creating Objects

S3 classes are defined implicitly by their associated methods. To create a new class, decide what its structure will be and define some methods. To create an object of the class, set an object’s class attribute to the class name.

For example, let’s create a generic function get_age that returns the age of an animal in terms of a typical human lifespan. First define the generic:

get_age = function(animal) {
  UseMethod("get_age")
}

Next, let’s create a class Human to represent a human. Since humans are animals, let’s make each Human also have class Animal. You can use any type of object as the foundation for a class, but lists are often a good choice because they can store multiple named elements. Here’s how to create a Human object with a field age_years to store the age in years:

lyra = list(age_years = 13)
class(lyra) = c("Human", "Animal")

Class names can include any characters that are valid in R variable names. One common convention is to make them start with an uppercase letter, to distinguish them from variables.

If you want to make constructing an object of a given class less ad-hoc (and error-prone), define a constructor function that returns a new object of a given class. A common convention is to give the constructor function the same name as the class:

Human = function(age_years) {
  obj = list(age_years = age_years)
  class(obj) = c("Human", "Animal")
  obj
}

asriel = Human(45)

The get_age generic doesn’t have any methods yet, so R raises an error if you call it (regardless of the argument’s class):

get_age(lyra)

## Error in UseMethod("get_age"): no applicable method for 'get_age' applied to an object of class "c('Human', 'Animal')"

Let’s define a method for Animal objects. The method will just return the value of the age_years field:

get_age.Animal = function(animal) {
  animal$age_years
}

get_age(lyra)

## [1] 13

get_age(asriel)

## [1] 45

Notice that the get_age generic still raises an error for objects that don’t have class Animal:

get_age(3)

## Error in UseMethod("get_age"): no applicable method for 'get_age' applied to an object of class "c('double', 'numeric')"

Now let’s create a class Dog to represent dogs. Like the Human class, a Dog is a kind of Animal and has an age_years field. Each Dog will also have a breed field to store the breed of the dog:

Dog = function(age_years, breed) {
  obj = list(age_years = age_years, breed = breed)
  class(obj) = c("Dog", "Animal")
  obj
}

pongo = Dog(10, "dalmatian")

Since a Dog is an Animal, the get_age generic returns a result:

get_age(pongo)

## [1] 10

Recall that the goal of this example was to make get_age return the age of an animal in terms of a human lifespan. For a dog, their age in “human years” is about 5 times their age in actual years. You can implement a get_age method for Dog to take this into account:

get_age.Dog = function(animal) {
  animal$age_years * 5
}

Now the get_age generic returns an age in terms of a human lifespan whether its argument is a Human or a Dog:

get_age(lyra)

## [1] 13

get_age(pongo)

## [1] 50

You can create new data structures in R by creating classes, and you can add functionality to new or existing generics by creating new methods. Before creating a class, think about whether R already provides a data structure that suits your needs.

It’s uncommon to create new classes in the course of a typical data analysis, but many packages do provide new classes. Regardless of whether you ever create a new class, understanding the details means understanding how S3 works, and thus how R’s many S3 generic functions work.

As a final note, while exploring S3 methods you may also encounter the NextMethod function. The NextMethod function redirects dispatch to the method that is the next closest match for an object’s class. You can learn more by reading ?NextMethod.

6.5 Other Object Systems

R provides many systems for object-oriented programming besides S3. Some are built into the language, while others are provided by packages. A few of the most popular systems are:

S4 – S4 is built into R and is the most widely-used system after S3. Like S3, S4 frames OOP in terms of generic functions and methods. The major differences are that S4 is stricter—the structure of each class must be formally defined—and that S4 generics can dispatch on the classes of multiple arguments instead of just one. R provides a special field operator @ to access fields of an S4 object. Most of the packages in the Bioconductor project use S4.
Reference classes – Objects created with the S3 and S4 systems generally follow the copy-on-write rule, but this can be inefficient for some programming tasks. The reference class system is built into R and provides a way to create reference objects with a formal class structure (in the spirit of S4). This system is more like OOP systems in languages like Java and Python than S3 or S4 are. The reference class system is sometimes jokingly called “R5”, but that isn’t an official name.
R6 – An alternative to reference classes created by Winston Chang, a developer at Posit (formerly RStudio). Claims to be simpler and faster than reference classes.
S7 – A new OOP system being developed collaboratively by representatives from several different important groups in the R community, including the R core developers, Bioconductor, and Posit.

Many of these systems are described in more detail in Hadley Wickham’s book Advanced R.