x = c(a = 10, b = 20, c = 30, d = 40, e = 50)
x a b c d e
10 20 30 40 50
After this lesson, you should be able to:
[ versus [[The way to get and set elements of a data structure is by indexing. Sometimes this is also called subsetting or (element) extraction. Indexing is a fundamental operation in R, key to reasoning about how to solve problems with the language.
We first saw indexing in Section 12.9, where we used $, the dollar sign operator, to get and set data frame columns. We saw indexing again in Section 13.1.2, where we used [, the indexing or square bracket operator, to get and set elements of vectors.
The indexing operator [ is R’s primary operator for indexing. It works in four different ways, depending on the type of the index you use. These four ways to select elements are:
Let’s examine each in more detail. We’ll use this vector as an example, to keep things concise:
x = c(a = 10, b = 20, c = 30, d = 40, e = 50)
x a b c d e
10 20 30 40 50
Even though we’re using a vector here, the indexing operator works with almost all data structures, including factors, lists, matrices, and data frames. We’ll look at unique behavior for some of these later on.
The first way to use [ to select elements is to leave the index blank. This selects all elements:
x[] a b c d e
10 20 30 40 50
This way of indexing is rarely used for getting elements, since it’s the same as entering the variable name without the indexing operator. Instead, its main use is for setting elements. Suppose we want to set all the elements of x to 5. You might try writing this:
x = 5
x[1] 5
Rather than setting each element to 5, this sets x to the scalar 5, which is not what we want. Let’s reset the vector and try again, this time using the indexing operator:
x = c(a = 10, b = 20, c = 30, d = 40, e = 50)
x[] = 5
xa b c d e
5 5 5 5 5
As you can see, now all the elements are 5. So the indexing operator is necessary to specify that we want to set the elements rather than the whole variable.
Let’s reset x one more time, so that we can use it again in the next example:
x = c(a = 10, b = 20, c = 30, d = 40, e = 50)The second way to use [ is to select elements by position. This happens when you use an integer or numeric index. We already saw the basics of this in Section 13.1.2.
The positions of the elements in a vector (or other data structure) correspond to numbers starting from 1 for the first element. This way of indexing is frequently used together with the sequence operator : to get ranges of values. For instance, let’s get the 2nd through 4th elements of x:
x[2:4] b c d
20 30 40
You can also use this way of indexing to set specific elements or ranges of elements. For example, let’s set the 3rd and 5th elements of x to 9 and 7, respectively:
x[c(3, 5)] = c(9, 7)
x a b c d e
10 20 9 40 7
When getting elements, you can repeat numbers in the index to get the same element more than once. You can also use the order of the numbers to control the order of the elements:
x[c(2, 1, 2, 2)] b a b b
20 10 20 20
Finally, if the index contains only negative numbers, the elements at those positions are excluded rather than selected. For instance, let’s get all elements except the 1st and 5th:
x[-c(1, 5)] b c d
20 9 40
When you index by position, the index should always be all positive or all negative. Using a mix of positive and negative numbers causes R to emit error rather than returning elements, since it’s unclear what the result should be:
x[c(-1, 2)]Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts
The third way to use [ is to select elements by name. This happens when you use a character vector as the index, and only works with named data structures.
Like indexing by position, you can use indexing by name to get or set elements. You can also use it to repeat elements or change the order. Let’s get elements a, c, d, and a again from the vector x:
y = x[c("a", "c", "d", "a")]
y a c d a
10 9 40 10
Element names are generally unique, but if they’re not, indexing by name gets or sets the first element whose name matches the index:
y["a"] a
10
Let’s reset x again to prepare for learning about the final way to index:
x = c(a = 10, b = 20, c = 30, d = 40, e = 50)The fourth and final way to use [ is to select elements based on a condition. This happens when you use a logical vector as the index. The logical vector should have the same length as what you’re indexing, and will be recycled if it doesn’t.
To understand indexing by condition, we first need to learn about congruent vectors. Two vectors are congruent if they have the same length and they correspond element-by-element.
For example, suppose you do a survey that records each respondent’s favorite animal and age. These are two different vectors of information, but each person will have a response for both. So you’ll have two vectors that are the same length:
animal = c("dog", "cat", "iguana")
age = c(31, 24, 72)The 1st element of each vector corresponds to the 1st person, the 2nd to the 2nd person, and so on. These vectors are congruent.
Notice that columns in a data frame are always congruent!
When you index by condition, the index should generally be congruent to the object you’re indexing. Elements where the index is TRUE are kept and elements where the index is FALSE are dropped.
If you create the index from a condition on the object, it’s automatically congruent. For instance, let’s make a condition based on the vector x:
is_small = x < 25
is_small a b c d e
TRUE TRUE FALSE FALSE FALSE
The 1st element in the logical vector is_small corresponds to the 1st element of x, the 2nd to the 2nd, and so on. The vectors x and is_small are congruent.
It makes sense to use is_small as an index for x, and it gives us all the elements less than 25:
x[is_small] a b
10 20
Of course, you can also avoid using an intermediate variable for the condition:
x[x > 10] b c d e
20 30 40 50
If you create index some other way (not using the object), make sure that it’s still congruent to the object. Otherwise, the subset returned from indexing might not be meaningful.
You can also use indexing by condition to set elements, just as the other ways of indexing can be used to set elements. For instance, let’s set all the elements of x that are greater than 10 to the missing value NA:
x[x > 10] = NA
x a b c d e
10 NA NA NA NA
All of the conditions we’ve seen so far have been written in terms of a single test. If you want to use more sophisticated conditions, R provides operators to negate and combine logical vectors. These operators are useful for working with logical vectors even outside the context of indexing.
The NOT operator ! converts TRUE to FALSE and FALSE to TRUE:
x = c(TRUE, FALSE, TRUE, TRUE, NA)
x[1] TRUE FALSE TRUE TRUE NA
!x[1] FALSE TRUE FALSE FALSE NA
You can use ! with a condition:
y = c("hi", "hello")
!(y == "hi")[1] FALSE TRUE
The NOT operator is vectorized.
R also has operators for combining logical values.
The AND operator & returns TRUE only when both arguments are TRUE. Here are some examples:
FALSE & FALSE[1] FALSE
TRUE & FALSE[1] FALSE
FALSE & TRUE[1] FALSE
TRUE & TRUE[1] TRUE
c(TRUE, FALSE, TRUE) & c(TRUE, TRUE, FALSE)[1] TRUE FALSE FALSE
The OR operator | returns TRUE when at least one argument is TRUE. Let’s see some examples:
FALSE | FALSE[1] FALSE
TRUE | FALSE[1] TRUE
FALSE | TRUE[1] TRUE
TRUE | TRUE[1] TRUE
c(TRUE, FALSE) | c(TRUE, TRUE)[1] TRUE TRUE
Be careful: everyday English is less precise than logic. You might say:
I want all subjects with age over 50 and all subjects that like cats.
But in logic this means:
(subject age over 50) OR (subject likes cats)
So think carefully about whether you need both conditions to be true (AND) or at least one (OR).
Rarely, you might want exactly one condition to be true. The XOR (eXclusive OR) function xor() returns TRUE when exactly one argument is TRUE. For example:
xor(FALSE, FALSE)[1] FALSE
xor(TRUE, FALSE)[1] TRUE
xor(TRUE, TRUE)[1] FALSE
The AND, OR, and XOR operators are vectorized.
The second argument is irrelevant in some conditions:
FALSE & is always FALSETRUE | is always TRUENow imagine you have FALSE & long_computation(). You can save time by skipping long_computation(). A short-circuit operator does exactly that.
R has two short-circuit operators:
&& is a short-circuited &|| is a short-circuited |These operators only evaluate the second argument if it is necessary to determine the result. Here are some of these:
TRUE && FALSE[1] FALSE
TRUE && TRUE[1] TRUE
TRUE || TRUE[1] TRUE
The short-circuit operators are not vectorized—they only accept length-1 arguments:
c(TRUE, FALSE) && c(TRUE, TRUE)Error in c(TRUE, FALSE) && c(TRUE, TRUE): 'length = 2' in coercion to 'logical(1)'
Because of this, you can’t use short-circuit operators for indexing. Their main use is in writing conditions for if-expressions, which we’ll learn about later on.
Prior to R 4.3.0, short-circuit operators didn’t raise an error for inputs with length greater than 1 (and thus were a common source of bugs).
This section explains how to get and set data in a data frame, expanding on the indexing techniques you learned in Section 14.1. Under the hood, every data frame is a list, so first you’ll learn about indexing lists.
Lists are a container for other types of R objects. When you select an element from a list, you can either keep the container (the list) or discard it. The indexing operator [ almost always keeps containers.
As an example, let’s get some elements from a small list:
x = list(first = c(1, 2, 3), second = sin, third = c("hi", "hello"))
y = x[c(1, 3)]
y$first
[1] 1 2 3
$third
[1] "hi" "hello"
class(y)[1] "list"
The result is still a list. Even if we get just one element, the result of indexing a list with [ is a list:
class(x[1])[1] "list"
Sometimes this will be exactly what we want. But what if we want to get the first element of x so that we can use it in a vectorized function? Or in a function that only accepts numeric arguments? We need to somehow get the element and discard the container.
The solution to this problem is the extraction operator [[, which is also called the double square bracket operator. The extraction operator is the primary way to get and set elements of lists and other containers.
Unlike the indexing operator [, the extraction operator always discards the container:
x[[1]][1] 1 2 3
class(x[[1]])[1] "numeric"
The tradeoff is that the extraction operator can only get or set one element at a time. Note that the element can be a vector, as above. Because it can only get or set one element at a time, the extraction operator can only index by position or name. Blank and logical indexes are not allowed.
The final difference between the index operator [ and the extraction operator [[ has to do with how they handle invalid indexes. The index operator [ returns NA for invalid vector elements, and NULL for invalid list elements:
c(1, 2)[10][1] NA
x[10]$<NA>
NULL
On the other hand, the extraction operator [[ raises an error for invalid elements:
x[[10]]Error in x[[10]]: subscript out of bounds
The indexing operator [ and the extraction operator [[ both work with any data structure that has elements. However, you’ll generally use the indexing operator [ to index vectors, and the extraction operator [[ to index containers (such as lists).
For two-dimensional objects, like matrices and data frames, you can pass the indexing operator [ or the extraction operator [[ a separate index for each dimension. The rows come first:
DATA[ROWS, COLUMNS]
For instance, let’s get the first 3 rows and all columns of the least terns data:
terns[1:3, ] year site_name site_name_2013_2018 site_name_1988_2001
1 2000 PITTSBURG POWER PLANT Pittsburg Power Plant NA_2013_2018 POLYGON
2 2000 ALBANY CENTRAL AVE NA_NO POLYGON Albany Central Avenue
3 2000 ALAMEDA POINT Alameda Point NA_2013_2018 POLYGON
site_abbr region_3 region_4 event bp_min bp_max fl_min fl_max total_nests
1 PITT_POWER S.F._BAY S.F._BAY LA_NINA 15 15 16 18 15
2 AL_CENTAVE S.F._BAY S.F._BAY LA_NINA 6 12 1 1 20
3 ALAM_PT S.F._BAY S.F._BAY LA_NINA 282 301 200 230 312
nonpred_eggs nonpred_chicks nonpred_fl nonpred_ad pred_control pred_eggs
1 3 0 0 0 4
2 NA NA NA NA NA
3 124 81 2 1 17
pred_chicks pred_fl pred_ad pred_pefa pred_coy_fox pred_meso pred_owlspp
1 2 0 0 N N N N
2 NA NA NA
3 0 0 0 N N N N
pred_corvid pred_other_raptor pred_other_avian pred_misc total_pefa
1 Y Y N N 0
2 NA
3 N Y Y N 0
total_coy_fox total_meso total_owlspp total_corvid total_other_raptor
1 0 0 0 4 2
2 NA NA NA NA NA
3 0 0 0 0 6
total_other_avian total_misc first_observed last_observed first_nest
1 0 0 2000-05-11 2000-08-05 2000-05-26
2 NA NA
3 11 0 2000-05-01 2000-08-19 2000-05-16
first_chick first_fledge
1 2000-06-18 2000-07-08
2
3 2000-06-07 2000-06-30
As we saw in Section 14.1.1, leaving an index blank means all elements.
As another example, let’s get the 3rd and 5th row, and the 2nd and 4th column:
terns[c(3, 5), c(2, 4)] site_name site_name_1988_2001
3 ALAMEDA POINT NA_2013_2018 POLYGON
5 OCEANO DUNES STATE VEHICULAR RECREATION AREA NA_2013_2018 POLYGON
Mixing several different ways of indexing is allowed. So for example, we can get the same above, but use column names instead of positions:
terns[c(3, 5), c("year", "site_name")] year site_name
3 2000 ALAMEDA POINT
5 2000 OCEANO DUNES STATE VEHICULAR RECREATION AREA
For data frames, it’s especially common to index the rows by condition and the columns by name. For instance, let’s get the site_name and bp_min columns for all year 2000 observations in the least terns data set:
result = terns[terns$year == 2000, c("site_name", "bp_min")]
head(result) site_name bp_min
1 PITTSBURG POWER PLANT 15
2 ALBANY CENTRAL AVE 6
3 ALAMEDA POINT 282
4 KETTLEMAN CITY 2
5 OCEANO DUNES STATE VEHICULAR RECREATION AREA 4
6 RANCHO GUADALUPE DUNES PRESERVE 9
Also see ?sec-the-drop-parameter for a case where the [ operator behaves in a surprising way.