14  Indexing

NoteLearning Goals

After this lesson, you should be able to:

  • Index vectors with empty, integer, string, and logical arguments
  • Negate or combine conditions with logic operators
  • Describe when to use [ versus [[
  • Index data frames to get specific rows, columns, or subsets

14.1 Indexing

The way to get and set elements of a data structure is by indexing. Sometimes this is also called subsetting or (element) extraction. Indexing is a fundamental operation in R, key to reasoning about how to solve problems with the language.

We first saw indexing in Section 12.9, where we used $, the dollar sign operator, to get and set data frame columns. We saw indexing again in Section 13.1.2, where we used [, the indexing or square bracket operator, to get and set elements of vectors.

The indexing operator [ is R’s primary operator for indexing. It works in four different ways, depending on the type of the index you use. These four ways to select elements are:

  1. All elements, with no index
  2. By position, with a numeric index
  3. By name, with a character index
  4. By condition, with a logical index

Let’s examine each in more detail. We’ll use this vector as an example, to keep things concise:

x = c(a = 10, b = 20, c = 30, d = 40, e = 50)
x
 a  b  c  d  e 
10 20 30 40 50 

Even though we’re using a vector here, the indexing operator works with almost all data structures, including factors, lists, matrices, and data frames. We’ll look at unique behavior for some of these later on.

14.1.1 All Elements

The first way to use [ to select elements is to leave the index blank. This selects all elements:

x[]
 a  b  c  d  e 
10 20 30 40 50 

This way of indexing is rarely used for getting elements, since it’s the same as entering the variable name without the indexing operator. Instead, its main use is for setting elements. Suppose we want to set all the elements of x to 5. You might try writing this:

x = 5
x
[1] 5

Rather than setting each element to 5, this sets x to the scalar 5, which is not what we want. Let’s reset the vector and try again, this time using the indexing operator:

x = c(a = 10, b = 20, c = 30, d = 40, e = 50)
x[] = 5
x
a b c d e 
5 5 5 5 5 

As you can see, now all the elements are 5. So the indexing operator is necessary to specify that we want to set the elements rather than the whole variable.

Let’s reset x one more time, so that we can use it again in the next example:

x = c(a = 10, b = 20, c = 30, d = 40, e = 50)

14.1.2 By Position

The second way to use [ is to select elements by position. This happens when you use an integer or numeric index. We already saw the basics of this in Section 13.1.2.

The positions of the elements in a vector (or other data structure) correspond to numbers starting from 1 for the first element. This way of indexing is frequently used together with the sequence operator : to get ranges of values. For instance, let’s get the 2nd through 4th elements of x:

x[2:4]
 b  c  d 
20 30 40 

You can also use this way of indexing to set specific elements or ranges of elements. For example, let’s set the 3rd and 5th elements of x to 9 and 7, respectively:

x[c(3, 5)] = c(9, 7)
x
 a  b  c  d  e 
10 20  9 40  7 

When getting elements, you can repeat numbers in the index to get the same element more than once. You can also use the order of the numbers to control the order of the elements:

x[c(2, 1, 2, 2)]
 b  a  b  b 
20 10 20 20 

Finally, if the index contains only negative numbers, the elements at those positions are excluded rather than selected. For instance, let’s get all elements except the 1st and 5th:

x[-c(1, 5)]
 b  c  d 
20  9 40 

When you index by position, the index should always be all positive or all negative. Using a mix of positive and negative numbers causes R to emit error rather than returning elements, since it’s unclear what the result should be:

x[c(-1, 2)]
Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts

14.1.3 By Name

The third way to use [ is to select elements by name. This happens when you use a character vector as the index, and only works with named data structures.

Like indexing by position, you can use indexing by name to get or set elements. You can also use it to repeat elements or change the order. Let’s get elements a, c, d, and a again from the vector x:

y = x[c("a", "c", "d", "a")]
y
 a  c  d  a 
10  9 40 10 

Element names are generally unique, but if they’re not, indexing by name gets or sets the first element whose name matches the index:

y["a"]
 a 
10 

Let’s reset x again to prepare for learning about the final way to index:

x = c(a = 10, b = 20, c = 30, d = 40, e = 50)

14.1.4 By Condition

The fourth and final way to use [ is to select elements based on a condition. This happens when you use a logical vector as the index. The logical vector should have the same length as what you’re indexing, and will be recycled if it doesn’t.

Congruent Vectors

To understand indexing by condition, we first need to learn about congruent vectors. Two vectors are congruent if they have the same length and they correspond element-by-element.

For example, suppose you do a survey that records each respondent’s favorite animal and age. These are two different vectors of information, but each person will have a response for both. So you’ll have two vectors that are the same length:

animal = c("dog", "cat", "iguana")
age = c(31, 24, 72)

The 1st element of each vector corresponds to the 1st person, the 2nd to the 2nd person, and so on. These vectors are congruent.

Notice that columns in a data frame are always congruent!

Back to Indexing

When you index by condition, the index should generally be congruent to the object you’re indexing. Elements where the index is TRUE are kept and elements where the index is FALSE are dropped.

If you create the index from a condition on the object, it’s automatically congruent. For instance, let’s make a condition based on the vector x:

is_small = x < 25
is_small
    a     b     c     d     e 
 TRUE  TRUE FALSE FALSE FALSE 

The 1st element in the logical vector is_small corresponds to the 1st element of x, the 2nd to the 2nd, and so on. The vectors x and is_small are congruent.

It makes sense to use is_small as an index for x, and it gives us all the elements less than 25:

x[is_small]
 a  b 
10 20 

Of course, you can also avoid using an intermediate variable for the condition:

x[x > 10]
 b  c  d  e 
20 30 40 50 

If you create index some other way (not using the object), make sure that it’s still congruent to the object. Otherwise, the subset returned from indexing might not be meaningful.

You can also use indexing by condition to set elements, just as the other ways of indexing can be used to set elements. For instance, let’s set all the elements of x that are greater than 10 to the missing value NA:

x[x > 10] = NA
x
 a  b  c  d  e 
10 NA NA NA NA 

14.1.5 Logic

All of the conditions we’ve seen so far have been written in terms of a single test. If you want to use more sophisticated conditions, R provides operators to negate and combine logical vectors. These operators are useful for working with logical vectors even outside the context of indexing.

Negation

The NOT operator ! converts TRUE to FALSE and FALSE to TRUE:

x = c(TRUE, FALSE, TRUE, TRUE, NA)
x
[1]  TRUE FALSE  TRUE  TRUE    NA
!x
[1] FALSE  TRUE FALSE FALSE    NA

You can use ! with a condition:

y = c("hi", "hello")
!(y == "hi")
[1] FALSE  TRUE

The NOT operator is vectorized.

Combinations

R also has operators for combining logical values.

The AND operator & returns TRUE only when both arguments are TRUE. Here are some examples:

FALSE & FALSE
[1] FALSE
TRUE & FALSE
[1] FALSE
FALSE & TRUE
[1] FALSE
TRUE & TRUE
[1] TRUE
c(TRUE, FALSE, TRUE) & c(TRUE, TRUE, FALSE)
[1]  TRUE FALSE FALSE

The OR operator | returns TRUE when at least one argument is TRUE. Let’s see some examples:

FALSE | FALSE
[1] FALSE
TRUE | FALSE
[1] TRUE
FALSE | TRUE
[1] TRUE
TRUE | TRUE
[1] TRUE
c(TRUE, FALSE) | c(TRUE, TRUE)
[1] TRUE TRUE

Be careful: everyday English is less precise than logic. You might say:

I want all subjects with age over 50 and all subjects that like cats.

But in logic this means:

(subject age over 50) OR (subject likes cats)

So think carefully about whether you need both conditions to be true (AND) or at least one (OR).

Rarely, you might want exactly one condition to be true. The XOR (eXclusive OR) function xor() returns TRUE when exactly one argument is TRUE. For example:

xor(FALSE, FALSE)
[1] FALSE
xor(TRUE, FALSE)
[1] TRUE
xor(TRUE, TRUE)
[1] FALSE

The AND, OR, and XOR operators are vectorized.

Short-circuiting

The second argument is irrelevant in some conditions:

  • FALSE & is always FALSE
  • TRUE | is always TRUE

Now imagine you have FALSE & long_computation(). You can save time by skipping long_computation(). A short-circuit operator does exactly that.

R has two short-circuit operators:

  • && is a short-circuited &
  • || is a short-circuited |

These operators only evaluate the second argument if it is necessary to determine the result. Here are some of these:

TRUE && FALSE
[1] FALSE
TRUE && TRUE
[1] TRUE
TRUE || TRUE
[1] TRUE

The short-circuit operators are not vectorized—they only accept length-1 arguments:

c(TRUE, FALSE) && c(TRUE, TRUE)
Error in c(TRUE, FALSE) && c(TRUE, TRUE): 'length = 2' in coercion to 'logical(1)'

Because of this, you can’t use short-circuit operators for indexing. Their main use is in writing conditions for if-expressions, which we’ll learn about later on.

Note

Prior to R 4.3.0, short-circuit operators didn’t raise an error for inputs with length greater than 1 (and thus were a common source of bugs).

14.2 Indexing Data Frames

This section explains how to get and set data in a data frame, expanding on the indexing techniques you learned in Section 14.1. Under the hood, every data frame is a list, so first you’ll learn about indexing lists.

14.2.1 Indexing Lists

Lists are a container for other types of R objects. When you select an element from a list, you can either keep the container (the list) or discard it. The indexing operator [ almost always keeps containers.

As an example, let’s get some elements from a small list:

x = list(first = c(1, 2, 3), second = sin, third = c("hi", "hello"))
y = x[c(1, 3)]
y
$first
[1] 1 2 3

$third
[1] "hi"    "hello"
class(y)
[1] "list"

The result is still a list. Even if we get just one element, the result of indexing a list with [ is a list:

class(x[1])
[1] "list"

Sometimes this will be exactly what we want. But what if we want to get the first element of x so that we can use it in a vectorized function? Or in a function that only accepts numeric arguments? We need to somehow get the element and discard the container.

The solution to this problem is the extraction operator [[, which is also called the double square bracket operator. The extraction operator is the primary way to get and set elements of lists and other containers.

Unlike the indexing operator [, the extraction operator always discards the container:

x[[1]]
[1] 1 2 3
class(x[[1]])
[1] "numeric"

The tradeoff is that the extraction operator can only get or set one element at a time. Note that the element can be a vector, as above. Because it can only get or set one element at a time, the extraction operator can only index by position or name. Blank and logical indexes are not allowed.

The final difference between the index operator [ and the extraction operator [[ has to do with how they handle invalid indexes. The index operator [ returns NA for invalid vector elements, and NULL for invalid list elements:

c(1, 2)[10]
[1] NA
x[10]
$<NA>
NULL

On the other hand, the extraction operator [[ raises an error for invalid elements:

x[[10]]
Error in x[[10]]: subscript out of bounds

The indexing operator [ and the extraction operator [[ both work with any data structure that has elements. However, you’ll generally use the indexing operator [ to index vectors, and the extraction operator [[ to index containers (such as lists).

14.2.2 Two-dimensional Indexing

For two-dimensional objects, like matrices and data frames, you can pass the indexing operator [ or the extraction operator [[ a separate index for each dimension. The rows come first:

DATA[ROWS, COLUMNS]

For instance, let’s get the first 3 rows and all columns of the least terns data:

terns[1:3, ]
  year             site_name   site_name_2013_2018   site_name_1988_2001
1 2000 PITTSBURG POWER PLANT Pittsburg Power Plant  NA_2013_2018 POLYGON
2 2000    ALBANY CENTRAL AVE         NA_NO POLYGON Albany Central Avenue
3 2000         ALAMEDA POINT         Alameda Point  NA_2013_2018 POLYGON
   site_abbr region_3 region_4   event bp_min bp_max fl_min fl_max total_nests
1 PITT_POWER S.F._BAY S.F._BAY LA_NINA     15     15     16     18          15
2 AL_CENTAVE S.F._BAY S.F._BAY LA_NINA      6     12      1      1          20
3    ALAM_PT S.F._BAY S.F._BAY LA_NINA    282    301    200    230         312
  nonpred_eggs nonpred_chicks nonpred_fl nonpred_ad pred_control pred_eggs
1            3              0          0          0                      4
2           NA             NA         NA         NA                     NA
3          124             81          2          1                     17
  pred_chicks pred_fl pred_ad pred_pefa pred_coy_fox pred_meso pred_owlspp
1           2       0       0         N            N         N           N
2          NA      NA      NA                                             
3           0       0       0         N            N         N           N
  pred_corvid pred_other_raptor pred_other_avian pred_misc total_pefa
1           Y                 Y                N         N          0
2                                                                  NA
3           N                 Y                Y         N          0
  total_coy_fox total_meso total_owlspp total_corvid total_other_raptor
1             0          0            0            4                  2
2            NA         NA           NA           NA                 NA
3             0          0            0            0                  6
  total_other_avian total_misc first_observed last_observed first_nest
1                 0          0     2000-05-11    2000-08-05 2000-05-26
2                NA         NA                                        
3                11          0     2000-05-01    2000-08-19 2000-05-16
  first_chick first_fledge
1  2000-06-18   2000-07-08
2                         
3  2000-06-07   2000-06-30

As we saw in Section 14.1.1, leaving an index blank means all elements.

As another example, let’s get the 3rd and 5th row, and the 2nd and 4th column:

terns[c(3, 5), c(2, 4)]
                                     site_name  site_name_1988_2001
3                                ALAMEDA POINT NA_2013_2018 POLYGON
5 OCEANO DUNES STATE VEHICULAR RECREATION AREA NA_2013_2018 POLYGON

Mixing several different ways of indexing is allowed. So for example, we can get the same above, but use column names instead of positions:

terns[c(3, 5), c("year", "site_name")]
  year                                    site_name
3 2000                                ALAMEDA POINT
5 2000 OCEANO DUNES STATE VEHICULAR RECREATION AREA

For data frames, it’s especially common to index the rows by condition and the columns by name. For instance, let’s get the site_name and bp_min columns for all year 2000 observations in the least terns data set:

result = terns[terns$year == 2000, c("site_name", "bp_min")]
head(result)
                                     site_name bp_min
1                        PITTSBURG POWER PLANT     15
2                           ALBANY CENTRAL AVE      6
3                                ALAMEDA POINT    282
4                               KETTLEMAN CITY      2
5 OCEANO DUNES STATE VEHICULAR RECREATION AREA      4
6              RANCHO GUADALUPE DUNES PRESERVE      9

Also see ?sec-the-drop-parameter for a case where the [ operator behaves in a surprising way.