= "Nick"
name if (name == "Nick") {
# If name is Nick:
message("We went down the TRUE branch")
= "Hi Nick, nice to see you again!"
msg else {
} # Anything else:
= "Nice to meet you!"
msg }
We went down the TRUE branch
After completing this chapter, learners should be able to:
By now, you’ve learned all of the basic skills necessary to explore a data set in R. The focus of this chapter is how to organize your code so that it’s concise, clear, and easy to automate. This will help you and your collaborators avoid tedious, redundant work, reproduce results efficiently, and run code in specialized environments for scientific computing, such as high-performance computing clusters.
Sometimes you’ll need code to do different things, depending on a condition. If-expressions provide a way to write conditional code.
For example, suppose we want to greet one person differently from the others:
= "Nick"
name if (name == "Nick") {
# If name is Nick:
message("We went down the TRUE branch")
= "Hi Nick, nice to see you again!"
msg else {
} # Anything else:
= "Nice to meet you!"
msg }
We went down the TRUE branch
Indent code inside of the if-expression by 2 or 4 spaces. Indentation makes your code easier to read.
The condition in an if-expression has to be a scalar:
= c("Nick", "Susan")
name if (name == "Nick") {
= "Hi Nick!"
msg else {
} = "Nice to meet you!"
msg }
Error in if (name == "Nick") {: the condition has length > 1
You can chain together if-expressions:
= "Susan"
name if (name == "Nick") {
= "Hi Nick, nice to see you again!"
msg else if (name == "Peter") {
} = "Go away Peter, I'm busy!"
msg else {
} = "Nice to meet you!"
msg
} msg
[1] "Nice to meet you!"
If-expressions return the value of the last expression in the evaluated block:
= "Tom"
name = if (name == "Nick") {
msg "Hi Nick, nice to see you again!"
else {
} "Nice to meet you!"
} msg
[1] "Nice to meet you!"
Curly braces { }
are optional for single-line expressions:
= "Nick"
name = if (name == "Nick") "Hi Nick, nice to see you again!" else
msg "Nice to meet you!"
msg
[1] "Hi Nick, nice to see you again!"
But you have to be careful if you don’t use them:
# NO GOOD:
= if (name == "Nick")
msg "Hi Nick, nice to see you again!"
else
"Nice to meet you!"
Error: <text>:4:1: unexpected 'else'
3: "Hi Nick, nice to see you again!"
4: else
^
The else
block is optional:
= "Hi"
msg = "Tom"
name if (name == "Nick")
= "Hi Nick, nice to see you again!"
msg msg
[1] "Hi"
When there’s no else
block, the value of the else
block is NULL
:
= "Tom"
name = if (name == "Nick")
msg "Hi Nick, nice to see you again!"
msg
NULL
The main way to interact with R is by calling functions, which was first explained way back in Section 1.2.4. Since then, you’ve learned how to use many of R’s built-in functions. This section explains how you can write your own functions.
To start, let’s briefly review what functions are, and some of the jargon associated with them. It’s useful to think of functions as factories: raw materials (inputs) go in, products (outputs) come out. We can also represent this visually:
Programmers use several specific terms to describe the parts and usage of functions:
Almost every command in R is a function, even the arithmetic operators and the parentheses! You can view the body of a function by typing its name without trailing parentheses (in contrast to how you call functions). The body of a function is usually surrounded by curly braces {}
, although they’re optional if the body only contains one line of code. Indenting code inside of curly braces by 2-4 spaces also helps make it visually distinct from other code.
For example, let’s look at the body of the append
function, which appends a value to the end of a list or vector:
append
function (x, values, after = length(x))
{
lengx <- length(x)
if (!after)
c(values, x)
else if (after >= lengx)
c(x, values)
else c(x[1L:after], values, x[(after + 1L):lengx])
}
<bytecode: 0x5e37d4ff6588>
<environment: namespace:base>
Don’t worry if you can’t understand everything the append
function’s code does yet. It will make more sense later on, after you’ve written a few functions of your own.
Many of R’s built-in functions are not entirely written in R code. You can spot these by calls to the special .Primitive
or .Internal
functions in their code.
For instance, the sum
function is not written in R code:
sum
function (..., na.rm = FALSE) .Primitive("sum")
The function
keyword creates a new function. Here’s the syntax:
function(parameter1, parameter2, ...) {
# Your code goes here
# The result goes here
}
A function can have any number of parameters, and will automatically return the value of the last line of its body.
A function is a value, and like any other value, if you want to reuse it, you need to assign it to variable. Choosing descriptive variable names is a good habit. For functions, that means choosing a name that describes what the function does. It often makes sense to use verbs in function names.
Let’s write a function that gets the largest values in a vector. The inputs or arguments to the function will be the vector in question and also the number of values to get. Let’s call these vec
and n
, respectively. The result will be a vector of the n
largest elements. Here’s one way to write the function:
= function(vec, n) {
get_largest = sort(vec, decreasing = TRUE)
sorted head(sorted, n)
}
The name of the function, get_largest
, describes what the function does and includes a verb. If this function will be used frequently, a shorter name, such as largest
, might be preferable (compare to the head
function).
Any time you write a function, the first thing you should do afterwards is test that it actually works. Let’s try the get_largest
function on a few test cases:
= c(1, 10, 20, -3)
x get_largest(x, 2)
[1] 20 10
get_largest(x, 3)
[1] 20 10 1
= c(-1, -2, -3)
y get_largest(y, 2)
[1] -1 -2
= c("d", "a", "t", "a", "l", "a", "b")
z get_largest(z, 3)
[1] "t" "l" "d"
Notice that the parameters vec
and n
inside the function do not exist as variables outside of the function:
vec
Error in eval(expr, envir, enclos): object 'vec' not found
In general, R keeps parameters and variables you define inside of a function separate from variables you define outside of a function. You can read more about the specific rules for how R searches for variables in DataLab’s Intermediate R reader.
As a function for quickly summarizing data, get_largest
would be more convenient if the parameter n
for the number of values to return was optional (again, compare to the head
function). You can make the parameter n
optional by setting a default argument: an argument assigned to the parameter if no argument is assigned in the call to the function. You can use =
to assign default arguments to parameters when you define a function with the function
keyword. Here’s a new definition of the function with the default n = 5
:
= function(vec, n = 5) {
get_largest = sort(vec, decreasing = TRUE)
sorted head(sorted, n)
}
After making this change, it’s a good idea to test the function again:
get_largest(x)
[1] 20 10 1 -3
get_largest(y)
[1] -1 -2 -3
get_largest(z)
[1] "t" "l" "d" "b" "a"
We’ve already seen that a function will automatically return the value of its last line.
The return
keyword causes a function to return a result immediately, without running any subsequent code in its body. It only makes sense to use return
from inside of an if-expression. If your function doesn’t have any if-expressions, you don’t need to use return
.
For example, suppose you want the get_largest
function to immediately return NULL
if the argument for vec
is a list. Here’s the code, along with some test cases:
= function(vec, n = 5) {
get_largest if (is.list(vec))
return(NULL)
= sort(vec, decreasing = TRUE)
sorted head(sorted, n)
}
get_largest(x)
[1] 20 10 1 -3
get_largest(z)
[1] "t" "l" "d" "b" "a"
get_largest(list(1, 2))
NULL
Alternatively, you could make the function raise an error by calling the stop
function. Whether it makes more sense to return NULL
or print an error depends on how you plan to use the get_largest
function.
Notice that the last line of the get_largest
function still doesn’t use the return
keyword. It’s idiomatic to only use return
when strictly necessary.
A function returns one R object, but sometimes computations have multiple results. In that case, return the results in a vector, list, or other data structure.
For example, let’s make a function that computes the mean and median for a vector. We’ll return the results in a named list, although we could also use a named vector:
= function(x) {
compute_mean_med = mean(x)
m1 = median(x)
m2 list(mean = m1, median = m2)
}compute_mean_med(c(1, 2, 3, 1))
$mean
[1] 1.75
$median
[1] 1.5
The names make the result easier to understand for the caller of the function, although they certainly aren’t required here.
Before you write a function, it’s useful to go through several steps:
Write down what you want to do, in detail. It can also help to draw a picture of what needs to happen.
Check whether there’s already a built-in function. Search online and in the R documentation.
Write the code to handle a simple case first. For data science problems, use a small dataset at this step.
Let’s apply this in one final example: a function that detects leap years. A year is a leap year if either of these conditions is true:
That means the years 2004 and 2000 are leap years, but the year 2200 is not. Here’s the code and a few test cases:
# If year is divisible by 4 and not 100 -> leap
# If year is divisible by 400 -> leap
= 2004
year = function(year) {
is_leap if (year %% 4 == 0 & year %% 100 != 0) {
= TRUE
leap else if (year %% 400 == 0) {
} = TRUE
leap else {
} = FALSE
leap
}
leap
}is_leap(400)
[1] TRUE
is_leap(1997)
[1] FALSE
Functions are the building blocks for solving larger problems. Take a divide-and-conquer approach, breaking large problems into smaller steps. Use a short function for each step. This approach makes it easier to:
These exercises are meant to challenge you, so they’re quite difficult compared to the previous ones. Don’t get disheartened, and if you’re able to complete them, excellent work!
Create a function compute_day
which uses the Doomsday algorithm to compute the day of week for any given date in the 1900s. The function’s parameters should be year
, month
, and day
. The function’s return value should be a day of week, as a string (for example, "Saturday"
).
Hint: the modulo operator is %%
in R.