6 Language Fundamentals
This chapter is part 1 (of 2) of Thinking in R, a workshop series about how R works and how to examine code critically. The major topics of this chapter are how R stores and locates variables (including functions) defined in your code and in packages, and how some of R’s object-oriented programming systems work.
Learning Objectives
After completing this session, learners should be able to:
- Explain what an environment is and how R uses them
- Explain how R looks up variables
- Explain what attributes are and how R uses them
- Get and set attributes
- Explain what (S3) classes are and how R uses them
- Explain R’s (S3) method dispatch system
- Create an (S3) class
- Describe R’s other object-oriented programming systems at a high level
6.1 Variables & Environments
Assigning and looking up values of variables are fundamental operations in R, as in most programming languages. They were likely among the first operations you learned, and now you use them instictively. This section is a deep dive into what R actually does when you assign a variables and how R looks up the values of those variables later. Understanding the process and the data structures involved will introduce you to new programming strategies, make it easier to reason about code, and help you identify potential bugs.
6.1.1 What’s an Environment?
The foundation of how R stores and looks up variables is a data structure called an environment. Every environment has two parts:
- A frame, which is a collection of names and associated R objects.
- A parent or enclosing environment, which must be another environment.
For now, you’ll learn how to create environments and how to assign and get values from their frames. Parent environments will be explained in a later section.
You can use the new.env
function to create a new environment:
## <environment: 0x55ccbb278e50>
Unlike most objects, printing an environment doesn’t print its contents.
Instead, R prints its type (which is environment
) and a unique identifier
(0x55ccbb278e50
in this case).
The unique identifier is actually the memory address of the environment.
Every object you use in R is stored as a series of bytes in your computer’s
random-access memory (RAM). Each byte in memory has a unique address,
similar to how each house on a street has a unique address. Memory addresses
are usually just numbers counting up from 0, but they’re often written in
hexadecimal (base 16) (indicated by the prefix 0x
) because it’s more
concise. For the purposes of this reader, you can just think of the memory
address as a unique identifier.
To see the names in an environment’s frame, you can call the ls
or names
function on the environment:
## character(0)
## character(0)
You just created the environment e
, so its frame is currently empty. The
printout character(0)
means R returned a character vector of length 0.
You can assign an R object to a name in an environment’s frame with the dollar
sign $
operator or the double square bracket [[
operator, similar to how
you would assign a named element of a list. For example, one way to assign the
number 8
to the name "lucky"
in the environment e
’s frame is:
Now there’s a name defined in the environment:
## [1] "lucky"
## [1] "lucky"
Here’s another example of assigning an object to a name in the environment:
You can assign any type of R object to a name in an environment, including other environments.
The ls
function ignores names that begin with a dot .
by default. For
example:
## [1] "lucky" "my_message"
You can pass the argument all.names = TRUE
to make the function return all
names in the frame:
## [1] ".x" "lucky" "my_message"
Alternatively, you can just use the names
function, which always prints all
names in an environment’s frame.
Objects in an environment’s frame don’t have positions or any particular order, so they must always be assigned to a name. R raises an error if you try to assign an object to a position:
## Error in e[[3]] = 10: wrong args for environment subassignment
As you might expect, you can also use the dollar sign operator and double square bracket operator to get objects in an environment by name:
## [1] "May your coffee kick in before reality does."
## [1] 8
You can use the exists
function to check whether a specific name exists in an
environment’s frame:
## [1] FALSE
## [1] TRUE
Finally, you can remove a name and object from an environment’s frame with the
rm
function. Make sure to pass the environment as the argument to the envir
parameter when you do this:
## [1] FALSE
6.1.2 Reference Objects
Environments are reference objects, which means they don’t follow R’s copy-on-write rule: for most types of objects, if you modify the object, R automatically and silently makes a copy, so that any other variables that refer to the object remain unchanged.
As an example, lists follow the copy-on-write rule. Suppose you assign a list
to variable x
, assign x
to y
, and then make a change to x
:
## $a
## [1] 10
## $a
## [1] 10
When you run y = x
, R makes y
refer to the same object as x
, without
using any additional memory. When you run x$a = 20
, the copy-on-write rule
applies, so R creates and modifies a copy of the object. From then on, x
refers to the modified copy and y
refers to the original.
Environments don’t follow the copy-on-write rule, so repeating the example with an enviroment produces a different result:
## [1] 10
## [1] 20
As before, e_y = e_x
makes both e_y
and e_x
refer to the same object. The
difference is that when you run e_x$a = 20
, the copy-on-write rule does not
apply and R does not create a copy of the environment. As a result, the change
to e_x
is also reflected in e_y
.
Environments and other reference objects can be confusing since they behave differently from most objects. You usually won’t need to construct or manipulate environments directly, but it’s useful to know how to inspect them.
6.1.3 The Local Environment
Think of environments as containers for variables. Whenever you assign a variable, R assigns it to the frame of an environment. Whenever you get a variable, R searches through one or more environments for its value.
When you start R, R creates a special environment called the global
environment to store variables you assign at the prompt or the top level of a
script. You can use the globalenv
function to get the global environment:
## <environment: R_GlobalEnv>
The global environment is easy to recognize because its unique identifier is
R_GlobalEnv
rather than its memory address (even though it’s stored in your
computer’s memory like any other object).
The local environment is the environment where the assignment operators
<-
and =
assign variables. Think of the local environment as the
environment that’s currently active. The local environment varies depending on
the context where you run an expression. You can get the local environment with
the environment
function:
## <environment: R_GlobalEnv>
As you can see, at the R prompt or the top level of an R script, the local environment is just the global environment.
Except for names
, the functions introduced in Section
6.1.1 default to the local environment if you don’t set
the envir
parameter. This makes them convenient for inspecting or modifying
the local environment’s frame:
## [1] "e" "e_x" "e_y" "g" "loc"
## [6] "source_rmd" "x" "y"
## [1] "e" "e_x" "e_y" "g" "loc"
## [6] "source_rmd" "x" "y"
If you assign a variable, it appears in the local environment’s frame:
## [1] "coffee" "e" "e_x" "e_y" "g"
## [6] "loc" "source_rmd" "x" "y"
## [1] "Right. No coffee. This is a terrible planet."
Conversely, if you assign an object in the local environment’s frame, you can access it as a variable:
## [1] "Tea isn't coffee!"
6.1.4 Call Environments
Every time you call (not define) a function, R creates a new environment. R uses this call environment as the local environment while the code in the body of the function runs. As a result, assigning variables in a function doesn’t affect the global environment, and they generally can’t be accessed from outside of the function.
For example, consider this function which assigns the variable hello
:
Even after calling the function, there’s no variable hello
in the global
environment:
## [1] "loc" "my_hello" "tea" "e_x" "x"
## [6] "e_y" "y" "coffee" "source_rmd" "e"
## [11] "g" ".First"
As further demonstration, consider this modified version of my_hello
, which
returns the call environment:
The call environment is not the global environment:
## <environment: 0x55ccbd771558>
And the variable hello
exists in the call environment, but not in the global
environment:
## [1] FALSE
## [1] TRUE
## [1] "from the other side"
Each call to a function creates a new call environment. So if you call
my_hello
again, it returns a different environment (pay attention to the
memory address):
## <environment: 0x55ccbd771558>
## <environment: 0x55ccbdcf2748>
By creating a new environment for every call, R isolates code in the function body from code outside of the body. As a result, most R functions have no side effects. This is a good thing, since it means you generally don’t have to worry about calls assigning, reassigning, or removing variables in other environments (such as the global environment!).
The local
function provides another way to create a new local environment in
which to run code. However, it’s usually preferable to define and call a
function, since that makes it easier to test and reuse the code.
6.1.5 Lexical Scoping
A function can access variables outside of its local environment, but only if those variables exist in the environment where the function was defined (not called). This property is called lexical scoping.
For example, assign a variable tea
and function get_tea
in the global
environment:
Then the get_tea
function can access the tea
variable:
## [1] "Tea isn't coffee!"
Note that variable lookup takes place when a function is called, not when it’s defined. This is called dynamic lookup.
For example, the result from get_tea
changes if you change the value of
tea
:
## [1] "Tea for two."
## [1] "Tea isn't coffee!"
When a local variable (a variable in the local environment) and a non-local variable have the same name, R almost always prioritizes the local variable. For instance:
## [1] "Earl grey is tea!"
The function body assigns the local variable tea
to "Earl grey is tea!"
, so
R returns that value rather than "Tea isn't coffee!"
. In other words, local
variables mask, or hide, non-local variables with the same name.
There’s only one case where R doesn’t prioritize local variables. To see it, consider this call:
## [1] 10.5
The variable mean
must refer to a function, because it’s being called—it’s
followed by parentheses ( )
, the call syntax. In this situation, R ignores
local variables that aren’t functions, so you can write code such as:
## [1] 5.5
That said, defining a local variable with the same name as a function can still be confusing, so it’s usually considered a bad practice.
To help you reason about lexical scoping, you can get the environment where a
function was defined by calling the environment
function on the function
itself. For example, the get_tea
function was defined in the global
environment:
## <environment: R_GlobalEnv>
6.1.6 Variable Lookup
The key to how R looks up variables and how lexical scoping works is that in addition to a frame, every environment has a parent environment.
When R evaluates a variable in an expression, it starts by looking for the variable in the local environment’s frame.
For example, at the prompt, tea
is a local variable because that’s where you
assigned it. If you enter tea
at the prompt, R finds tea
in the local
environment’s frame and returns the value:
## [1] "Tea isn't coffee!"
On the other hand, in the get_tea
function from Section
6.1.5, tea
is not a local variable:
To make this more concrete, consider a function which just returns its call environment:
The call environment clearly doesn’t contain the tea
variable:
## character(0)
When a variable doesn’t exist in the local environment’s frame, then R gets the parent environment of the local environment.
You can use the parent.env
function to get the parent environment of an
environment. For the call environment e
, the parent environment is the global
environment, because that’s where get_call_env
was defined:
## <environment: R_GlobalEnv>
When R can’t find tea
in the call environment’s frame, R gets the parent
environment, which is the global environment. Then R searches for tea
in the
global environment, finds it, and returns the value.
R repeats the lookup process for as many parents as necessary to find the variable, stopping only when it finds the variable or a special environment called the empty environment which will be explained in Section 6.1.7.
The lookup process also hints at how R finds variables and functions such as
pi
and sqrt
that clearly aren’t defined in the global environment. They’re
defined in parent environments of the global environment.
The get
function looks up a variable by name:
## [1] 3.141593
You can use the get
function to look up a variable starting from a specific
environment or to control how R does the lookup the variable. For example, if
you set inherits = FALSE
, R will not search any parent environments:
## Error in get("pi", inherits = FALSE): object 'pi' not found
As with most functions for inspecting and modifying environments, use the get
function sparingly. R already provides a much simpler way to get a variable:
the variable’s name.
6.1.7 The Search Path
R also uses environments to manage packages. Each time you load a package with
library
or require
, R creates a new environment:
- The frame contains the package’s local variables.
- The parent environment is the environment of the previous package loaded.
- This new environment becomes the parent of the global environment.
R always loads several built-in packages at startup, which contain variables
and functions such as pi
and sqrt
. Thus the global environment is never the
top-level environment. For instance:
## <environment: package:stats>
## attr(,"name")
## [1] "package:stats"
## attr(,"path")
## [1] "/usr/lib/R/library/stats"
## <environment: package:graphics>
## attr(,"name")
## [1] "package:graphics"
## attr(,"path")
## [1] "/usr/lib/R/library/graphics"
Notice that package environments use package:
and the name of the package as
their unique identifier rather than their memory address.
The chain of package environments is called the search path. The search
function returns the search path:
## [1] ".GlobalEnv" "package:stats" "package:graphics"
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"
The base environment (identified by base
) is the always topmost
environment. You can use the baseenv
function to get the base environment:
## <environment: base>
The base environment’s parent is the special empty environment (identified by
R_EmptyEnv
), which contains no variables and has no parent. You can use the
emptyenv
function to get the empty environment:
## <environment: R_EmptyEnv>
Understanding R’s process for looking up variables and the search path is helpful for resolving conflicts between the names of variables in packages.
6.1.7.1 The Colon Operators
The double-colon operator ::
gets a variable in a specific package. Two
common uses:
- Disambiguate which package you mean when several packages have variables with the same names.
- Get a variable from a package without loading the package.
For example:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## function (x, filter, method = c("convolution", "recursive"),
## sides = 2L, circular = FALSE, init = NULL)
## {
## method <- match.arg(method)
## x <- as.ts(x)
## storage.mode(x) <- "double"
## xtsp <- tsp(x)
## n <- as.integer(NROW(x))
## if (is.na(n))
## stop(gettextf("invalid value of %s", "NROW(x)"), domain = NA)
## nser <- NCOL(x)
## filter <- as.double(filter)
## nfilt <- as.integer(length(filter))
## if (is.na(nfilt))
## stop(gettextf("invalid value of %s", "length(filter)"),
## domain = NA)
## if (anyNA(filter))
## stop("missing values in 'filter'")
## if (method == "convolution") {
## if (nfilt > n)
## stop("'filter' is longer than time series")
## sides <- as.integer(sides)
## if (is.na(sides) || (sides != 1L && sides != 2L))
## stop("argument 'sides' must be 1 or 2")
## circular <- as.logical(circular)
## if (is.na(circular))
## stop("'circular' must be logical and not NA")
## if (is.matrix(x)) {
## y <- matrix(NA, n, nser)
## for (i in seq_len(nser)) y[, i] <- .Call(C_cfilter,
## x[, i], filter, sides, circular)
## }
## else y <- .Call(C_cfilter, x, filter, sides, circular)
## }
## else {
## if (missing(init)) {
## init <- matrix(0, nfilt, nser)
## }
## else {
## ni <- NROW(init)
## if (ni != nfilt)
## stop("length of 'init' must equal length of 'filter'")
## if (NCOL(init) != 1L && NCOL(init) != nser) {
## stop(sprintf(ngettext(nser, "'init' must have %d column",
## "'init' must have 1 or %d columns", domain = "R-stats"),
## nser), domain = NA)
## }
## if (!is.matrix(init))
## dim(init) <- c(nfilt, nser)
## }
## ind <- seq_len(nfilt)
## if (is.matrix(x)) {
## y <- matrix(NA, n, nser)
## for (i in seq_len(nser)) y[, i] <- .Call(C_rfilter,
## x[, i], filter, c(rev(init[, i]), double(n)))[-ind]
## }
## else y <- .Call(C_rfilter, x, filter, c(rev(init[, 1L]),
## double(n)))[-ind]
## }
## tsp(y) <- xtsp
## class(y) <- if (nser > 1L)
## c("mts", "ts")
## else "ts"
## y
## }
## <bytecode: 0x55ccbd723f58>
## <environment: namespace:stats>
## function (.data, ..., .by = NULL, .preserve = FALSE)
## {
## check_by_typo(...)
## by <- enquo(.by)
## if (!quo_is_null(by) && !is_false(.preserve)) {
## abort("Can't supply both `.by` and `.preserve`.")
## }
## UseMethod("filter")
## }
## <bytecode: 0x55ccb9836048>
## <environment: namespace:dplyr>
## function (data = NULL, mapping = aes(), ..., environment = parent.frame())
## {
## UseMethod("ggplot")
## }
## <bytecode: 0x55ccbcce8e00>
## <environment: namespace:ggplot2>
The related triple-colon operator :::
gets a private variable in a
package. Generally these are private for a reason! Only use :::
if you’re
sure you know what you’re doing.
6.2 Closures
A closure is a function together with an enclosing environment. In order to support lexical scoping, every R function is a closure (except a few very special built-in functions). The enclosing environment is generally the environment where the function was defined.
Recall that you can use the environment
function to get the enclosing
environment of a function:
## <environment: R_GlobalEnv>
Since the enclosing environment exists whether or not you call the function, you can use the enclosing environment to store and share data between calls.
You can use the superassignment operator <<-
to assign to a variable to an
ancestor environment (if the variable already exists) or the global environment
(if the variable does not already exist).
For example, suppose you want to make a function that returns the number of times it’s been called:
In this example, the enclosing environment is the global environment. Each time
you call count
, it assigns a new value to the counter
variable in the
global environment.
6.2.1 Tidy Closures
The count
function has a side effect—it reassigns a non-local variable. As
discussed in 6.1.4, functions with side effects make code
harder to understand and reason about. Use side effects sparingly and try to
isolate them from the global environment.
When side effects aren’t isolated, several things can go wrong. The function might overwrite the user’s variables:
## [1] 1
Or the user might overwrite the function’s variables:
## Error in counter + 1: non-numeric argument to binary operator
For functions that rely on storing information in their enclosing environment, there are several different ways to make sure the enclosing environment is isolated. Two of these are:
Define and return the function from the body of another function. The second function is called a factory function because it produces (returns) the first. The enclosing environment of the first function is the call environment of the second.
Define the function inside of a call to
local
.
Here’s a template for the first approach:
make_fn = function() {
# Define variables in the enclosing environment here:
# Define and return the function here:
function() {
# ...
}
}
f = make_fn()
# Now you can call f() as you would any other function.
For example, you can use the template for the counter
function:
make_count = function() {
counter = 0
function() {
counter <<- counter + 1
counter
}
}
count = make_count()
Then calling count
has no effect on the global environment:
## [1] 1
## [1] 10
6.3 Attributes
An attribute is named metadata attached to an R object. Attributes provide basic information about objects and play an important role in R’s class system, so most objects have attributes. Some common attributes are:
class
– the classrow.names
– row namesnames
– element names or column namesdim
– dimensions (on matrices)dimnames
– names of dimensions (on matrices)
R provides helper functions to get and set the values of the common attributes.
These functions usually have the same name as the attribute. For example, the
class
function gets or sets the class
attribute:
## [1] "data.frame"
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
An attribute can have any name and any value. You can use the attr
function
to get or set an attribute by name:
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
## [1] 42
You can get all of the attributes attached to an object with the attributes
function:
## $names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
##
## $row.names
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
##
## $class
## [1] "data.frame"
##
## $foo
## [1] 42
You can use the structure
function to set multiple attributes on an object:
## $names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
##
## $row.names
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
##
## $class
## [1] "data.frame"
##
## $foo
## [1] 50
##
## $bar
## [1] 100
Vectors usually don’t have attributes:
## NULL
But the class
function still returns a class:
## [1] "numeric"
When a helper function exists to get or set an attribute, use the helper
function rather than attr
. This will make your code clearer and ensure that
attributes with special behavior and requirements, such as dim
, are set
correctly.
6.4 S3
R provides several systems for object-oriented programming (OOP), a programming paradigm where code is organized into a collection of “objects” that interact with each other. These systems provide a way to create new data structures with customized behavior, and also underpin how some of R’s built-in functions work.
The S3 system is particularly important for understanding R, because it’s the oldest and most widely-used. This section focuses on S3, while Section 6.5 provides an overview of R’s other OOP systems.
The central idea of S3 is that some functions can be generic, meaning they perform different computations (and run different code) for different classes of objects.
Conversely, every object has at least one class, which dictates how the object
behaves. For most objects, the class is independent of type and is stored
in the class
attribute. You can get the class of an object with the class
function. For example, the class of a data frame is data.frame
:
## [1] "data.frame"
Some objects have more than one class. One example of this is matrices:
## [1] "matrix" "array"
When an object has multiple classes, they’re stored in the class
attribute in
order from highest to lowest priority. So the matrix m
will primarily behave
like a matrix
, but it can also behave like an array
. The priority of
classes is often described in terms of a child-parent relationship: array
is
the parent class of matrix
, or equivalently, the class matrix
inherits from the class array
.
6.4.1 Method Dispatch
A function is generic if it selects and calls another function, called a method, based on the class of one of its arguments. A generic function can have any number of methods, and each must have the same signature, or collection of parameters, as the generic. Think of a generic function’s methods as the range of different computations it can perform, or alternatively as the range of different classes it can accept as input.
Method dispatch, or just dispatch, is the process of selecting a method
based on the class of an argument. You can identify S3 generics because they
always call the UseMethod
function, which initiates S3 method dispatch.
Many of R’s built-in functions are generic. One example is the split
function, which splits a data frame or vector into groups:
## function (x, f, drop = FALSE, ...)
## UseMethod("split")
## <bytecode: 0x55ccbac6ef00>
## <environment: namespace:base>
Another is the plot
function, which creates a plot:
## function (x, y, ...)
## UseMethod("plot")
## <bytecode: 0x55ccbcaacb28>
## <environment: namespace:base>
The UseMethod
function requires the name of the generic (as a string) as its
first argument. The second argument is optional and specifies the object to use
for method dispatch. By default, the first argument to the generic is used for
method dispatch. So for split
, the argument for x
is used for method
dispatch. R checks the class of the argument and selects a matching method.
You can use the methods
function to list all of the methods of a generic. The
methods for split
are:
## [1] split.data.frame split.Date split.default split.POSIXct
## see '?methods' for accessing help and source code
Method names always have the form GENERIC.CLASS
, where GENERIC
is the name
of the generic and CLASS
is the name of a class. For instance,
split.data.frame
is the split
method for objects with class data.frame
.
Methods named GENERIC.default
are a special case: they are default
methods, selected only if none of the other methods match the class during
dispatch. So split.default
is the default method for split
. Most generic
functions have a default method.
Methods are ordinary R functions. For instance, the code for split.data.frame
is:
## function (x, f, drop = FALSE, ...)
## {
## if (inherits(f, "formula"))
## f <- .formula2varlist(f, x)
## lapply(split(x = seq_len(nrow(x)), f = f, drop = drop, ...),
## function(ind) x[ind, , drop = FALSE])
## }
## <bytecode: 0x55ccbaf26f60>
## <environment: namespace:base>
Sometimes methods are defined in privately packages and can’t be accessed by
typing their name at the prompt. You can use the getAnywhere
function to get
the code for these methods. For instance, to get the code for
plot.data.frame
:
## A single object matching 'plot.data.frame' was found
## It was found in the following places
## registered S3 method for plot from namespace graphics
## namespace:graphics
## with value
##
## function (x, ...)
## {
## plot2 <- function(x, xlab = names(x)[1L], ylab = names(x)[2L],
## ...) plot(x[[1L]], x[[2L]], xlab = xlab, ylab = ylab,
## ...)
## if (!is.data.frame(x))
## stop("'plot.data.frame' applied to non data frame")
## if (ncol(x) == 1) {
## x1 <- x[[1L]]
## if (class(x1)[1L] %in% c("integer", "numeric"))
## stripchart(x1, ...)
## else plot(x1, ...)
## }
## else if (ncol(x) == 2) {
## plot2(x, ...)
## }
## else {
## pairs(data.matrix(x), ...)
## }
## }
## <bytecode: 0x55ccbcb9eab0>
## <environment: namespace:graphics>
As a demonstration of method dispatch, consider this code to split the mtcars
dataset by number of cylinders:
The split
function is generic and dispatches on its first argument. In this
case, the first argument is mtcars
, which has class data.frame
. Since the
method split.data.frame
exists, R calls split.data.frame
with the same
arguments you used to call the generic split
function. In other words, R
calls:
When an object has more than one class, method dispatch considers them from
left to right. For instance, matrices created with the matrix
function have
class matrix
and also class array
. If you pass a matrix to a generic
function, R will first look for a matrix
method. If there isn’t one, R will
look for an array
method. If there still isn’t one, R will look for a default
method. If there’s no default method either, then R raises an error.
The sloop package provides useful functions inspecting S3 classes,
generics, and methods, as well as the method dispatch process. For example, you
can use the s3_dispatch
function to see which method will be selected when
you call a generic:
## => split.data.frame
## * split.default
The selected method is indicated with an arrow =>
, while methods that were
not selected are indicated with a star *
. See ?s3_dispatch
for complete
details about the output from the function.
6.4.2 Creating Objects
S3 classes are defined implicitly by their associated methods. To create a new class, decide what its structure will be and define some methods. To create an object of the class, set an object’s class attribute to the class name.
For example, let’s create a generic function get_age
that returns the
age of an animal in terms of a typical human lifespan. First define the
generic:
Next, let’s create a class Human
to represent a human. Since humans are
animals, let’s make each Human
also have class Animal
. You can use any type
of object as the foundation for a class, but lists are often a good choice
because they can store multiple named elements. Here’s how to create a Human
object with a field age_years
to store the age in years:
Class names can include any characters that are valid in R variable names. One common convention is to make them start with an uppercase letter, to distinguish them from variables.
If you want to make constructing an object of a given class less ad-hoc (and error-prone), define a constructor function that returns a new object of a given class. A common convention is to give the constructor function the same name as the class:
Human = function(age_years) {
obj = list(age_years = age_years)
class(obj) = c("Human", "Animal")
obj
}
asriel = Human(45)
The get_age
generic doesn’t have any methods yet, so R raises an error if you
call it (regardless of the argument’s class):
## Error in UseMethod("get_age"): no applicable method for 'get_age' applied to an object of class "c('Human', 'Animal')"
Let’s define a method for Animal
objects. The method will just return the
value of the age_years
field:
## [1] 13
## [1] 45
Notice that the get_age
generic still raises an error for objects that don’t
have class Animal
:
## Error in UseMethod("get_age"): no applicable method for 'get_age' applied to an object of class "c('double', 'numeric')"
Now let’s create a class Dog
to represent dogs. Like the Human
class, a
Dog
is a kind of Animal
and has an age_years
field. Each Dog
will also
have a breed
field to store the breed of the dog:
Dog = function(age_years, breed) {
obj = list(age_years = age_years, breed = breed)
class(obj) = c("Dog", "Animal")
obj
}
pongo = Dog(10, "dalmatian")
Since a Dog
is an Animal
, the get_age
generic returns a result:
## [1] 10
Recall that the goal of this example was to make get_age
return the age of an
animal in terms of a human lifespan. For a dog, their age in “human years” is
about 5 times their age in actual years. You can implement a get_age
method
for Dog
to take this into account:
Now the get_age
generic returns an age in terms of a human lifespan whether
its argument is a Human
or a Dog
:
## [1] 13
## [1] 50
You can create new data structures in R by creating classes, and you can add functionality to new or existing generics by creating new methods. Before creating a class, think about whether R already provides a data structure that suits your needs.
It’s uncommon to create new classes in the course of a typical data analysis, but many packages do provide new classes. Regardless of whether you ever create a new class, understanding the details means understanding how S3 works, and thus how R’s many S3 generic functions work.
As a final note, while exploring S3 methods you may also encounter the
NextMethod
function. The NextMethod
function redirects dispatch to the
method that is the next closest match for an object’s class. You can learn more
by reading ?NextMethod
.
6.5 Other Object Systems
R provides many systems for object-oriented programming besides S3. Some are built into the language, while others are provided by packages. A few of the most popular systems are:
S4 – S4 is built into R and is the most widely-used system after S3. Like S3, S4 frames OOP in terms of generic functions and methods. The major differences are that S4 is stricter—the structure of each class must be formally defined—and that S4 generics can dispatch on the classes of multiple arguments instead of just one. R provides a special field operator
@
to access fields of an S4 object. Most of the packages in the Bioconductor project use S4.Reference classes – Objects created with the S3 and S4 systems generally follow the copy-on-write rule, but this can be inefficient for some programming tasks. The reference class system is built into R and provides a way to create reference objects with a formal class structure (in the spirit of S4). This system is more like OOP systems in languages like Java and Python than S3 or S4 are. The reference class system is sometimes jokingly called “R5”, but that isn’t an official name.
R6 – An alternative to reference classes created by Winston Chang, a developer at Posit (formerly RStudio). Claims to be simpler and faster than reference classes.
S7 – A new OOP system being developed collaboratively by representatives from several different important groups in the R community, including the R core developers, Bioconductor, and Posit.
Many of these systems are described in more detail in Hadley Wickham’s book Advanced R.