# 2 Data Structures

The previous chapter introduced R and gave you enough background to do some simple computations on data sets. This chapter focuses on the foundational knowledge and skills you’ll need in order to use R effectively in the long term. Specifically, it begins with a deep dive into R’s various data structures and data types, then explains a variety of ways to get and set their elements.

#### Learning Objectives

- Create vectors, including sequences
- Identify whether a function is vectorized or not
- Check the type and class of an object
- Coerce an object to a different type
- Describe matrices and lists
- Describe and differentiate
`NA`

,`NaN`

,`Inf`

,`NULL`

- Identify, create, and relevel factors
- Index vectors with empty, integer, string, and logical arguments
- Negate or combine conditions with logic operators

## 2.1 Vectors

A *vector* is a collection of values. Vectors are the fundamental unit of data
in R, and you’ve already used them in the previous sections.

For instance, each column in a data frame is a vector. So the `quarter`

column
in the earnings data from Section 1.6 is a vector. Take a look
at it now. You can use `head`

to avoid printing too much. Set the second
argument to `10`

so that exactly 10 values are printed:

`head(earn$quarter, 10)`

`## [1] 1 2 3 4 1 2 3 4 1 2`

Like all vectors, this vector is *ordered*, which just means the values, or
*elements*, have specific positions. The value of the 1st element is `1`

, the
2nd is `2`

, the 5th is also `1`

, and so on.

Notice that the elements of this vector are all integers. This isn’t just a
quirk of the earnings data set. In R, the elements of a vector must all be the
same type of data (we say the elements are *homogeneous*). A vector can contain
integers, decimal numbers, strings, or several other types of data, but not a
mix these all at once.

The other columns in the earnings data frame are also vectors. For instance,
the `age`

column is a vector of strings:

`head(earn$age)`

```
## [1] "16 years and over" "16 years and over" "16 years and over"
## [4] "16 years and over" "16 years and over" "16 years and over"
```

Vectors can contain any number of elements, including 0 or 1 element. Unlike
mathematics, R does not distinguish between vectors and *scalars* (solitary
values). So as far as R is concerned, a solitary value, like `3`

, is a vector
with 1 element.

You can check the length of a vector (and other objects) with the `length`

function:

`length(3)`

`## [1] 1`

`length("hello")`

`## [1] 1`

`length(earn$age)`

`## [1] 4224`

Since the last of these is a column from the data frame `earn`

, the length is
the same as the number of rows in `earn`

.

### 2.1.1 Creating Vectors

Sometimes you’ll want to create your own vectors. You can do this by
concatenating several vectors together with the `c`

function. It accepts any
number of vector arguments, and combines them into a single vector:

`c(1, 2, 19, -3)`

`## [1] 1 2 19 -3`

`c("hi", "hello")`

`## [1] "hi" "hello"`

`c(1, 2, c(3, 4))`

`## [1] 1 2 3 4`

If the arguments you pass to the `c`

function have different data types, R will
attempt to convert them to a common data type that preserves the information:

`c(1, "cool", 2.3)`

`## [1] "1" "cool" "2.3"`

Section 2.2.2 explains the rules for this conversion in more detail.

The colon operator `:`

creates vectors that contain sequences of integers. This
is useful for creating “toy” data to test things on, and later we’ll see that
it’s also important in several other contexts. Here are a few different
sequences:

`1:3`

`## [1] 1 2 3`

`-3:5`

`## [1] -3 -2 -1 0 1 2 3 4 5`

`10:1`

`## [1] 10 9 8 7 6 5 4 3 2 1`

Beware that both endpoints are included in the sequence, even in sequences like
`1:0`

, and that the difference between elements is always `-1`

or `1`

. If you
want more control over the generated sequence, use the `seq`

function instead.

### 2.1.2 Indexing Vectors

You can access individual elements of a vector with the *indexing operator* `[`

(also called the *square bracket operator*). The syntax is:

`VECTOR[INDEXES]`

Here `INDEXES`

is a vector of positions of elements you want to get or set.

For example, let’s make a vector with 5 elements and get the 2nd element:

```
= c(4, 8, 3, 2, 1)
x 2] x[
```

`## [1] 8`

Now let’s get the 3rd and 1st element:

`c(3, 1)] x[`

`## [1] 3 4`

You can use the indexing operator together with the assignment operator to assign elements of a vector:

```
3] = 0
x[ x
```

`## [1] 4 8 0 2 1`

Indexing is among the most frequently used operations in R, so take some time to try it out with few different vectors and indexes. We’ll revisit indexing in Section 2.4 to learn a lot more about it.

### 2.1.3 Vectorization

Let’s look at what happens if we call a mathematical function, like `sin`

, on a
vector:

```
= c(1, 3, 0, pi)
x sin(x)
```

`## [1] 8.414710e-01 1.411200e-01 0.000000e+00 1.224647e-16`

This gives us the same result as if we had called the function separately on each element. That is, the result is the same as:

`c(sin(1), sin(3), sin(0), sin(pi))`

`## [1] 8.414710e-01 1.411200e-01 0.000000e+00 1.224647e-16`

Of course, the first version is much easier to type.

Functions that take a vector argument and get applied element-by-element like
this are said to be *vectorized*. Most functions in R are vectorized,
especially math functions. Some examples include `sin`

, `cos`

, `tan`

, `log`

,
`exp`

, and `sqrt`

.

Functions that are not vectorized tend to be ones that combine or aggregate
values in some way. For instance, the `sum`

, `mean`

, `median`

, `length`

, and
`class`

functions are not vectorized.

A function can be vectorized across multiple arguments. This is easiest to understand in terms of the arithmetic operators. Let’s see what happens if we add two vectors together:

```
= c(1, 2, 3, 4)
x = c(-1, 7, 10, -10)
y + y x
```

`## [1] 0 9 13 -6`

The elements are paired up and added according to their positions. The other arithmetic operators are also vectorized:

`- y x `

`## [1] 2 -5 -7 14`

`* y x `

`## [1] -1 14 30 -40`

`/ y x `

`## [1] -1.0000000 0.2857143 0.3000000 -0.4000000`

### 2.1.4 Recycling

When a function is vectorized across multiple arguments, what happens if the vectors have different lengths? Whenever you think of a question like this as you’re learning R, the best way to find out is to create some toy data and test it yourself. Let’s try that now:

```
= c(1, 2, 3, 4)
x = c(-1, 1)
y + y x
```

`## [1] 0 3 2 5`

The elements of the shorter vector are *recycled* to match the length of the
longer vector. That is, after the second element, the elements of `y`

are
repeated to make a vector with the same length as `x`

(because `x`

is longer),
and then vectorized addition is carried out as usual.

Here’s what that looks like written down:

```
1 2 3 4
+ -1 1 -1 1
-----------
0 3 2 5
```

If the length of the longer vector is not a multiple of the length of the shorter vector, R issues a warning, but still returns the result. The warning as meant as a reminder, because unintended recycling is a common source of bugs:

```
= c(1, 2, 3, 4, 5)
x = c(-1, 1)
y + y x
```

```
## Warning in x + y: longer object length is not a multiple of shorter object
## length
```

`## [1] 0 3 2 5 4`

Recycling might seem strange at first, but it’s convenient if you want to use a
specific value (or pattern of values) with a vector. For instance, suppose you
want to multiply all the elements of a vector by `2`

. Recycling makes this
easy:

`2 * c(1, 2, 3)`

`## [1] 2 4 6`

When you use recycling, most of the time one of the arguments will be a scalar like this.

## 2.2 Data Types & Classes

Data can be categorized into different *types* based on sets of shared
characteristics. For instance, statisticians tend to think about whether data
are numeric or categorical:

- numeric
- continuous (real or complex numbers)
- discrete (integers)

- categorical
- nominal (categories with no ordering)
- ordinal (categories with some ordering)

Of course, other types of data, like graphs (networks) and natural language (books, speech, and so on), are also possible. Categorizing data this way is useful for reasoning about which methods to apply to which data.

In R, data objects are categorized in two different ways:

The

*class*of an R object describes what the object does, or the role that it plays. Sometimes objects can do more than one thing, so objects can have more than one class. The`class`

function, which debuted in Section 1.6, returns the classes of its argument.The

*type*of an R object describes what the object is. Technically, the type corresponds to how the object is stored in your computer’s memory. Each object has exactly one type. The`typeof`

function returns the type of its argument.

Of the two, classes tend to be more important than types. If you aren’t sure what an object is, checking its classes should be the first thing you do.

The built-in classes you’ll use all the time correspond to vectors and lists (which we’ll learn more about in Section 2.2.1):

Class | Example | Description |
---|---|---|

logical | `TRUE` , `FALSE` |
Logical (or Boolean) values |

integer | `-1L` , `1L` , `2L` |
Integer numbers |

numeric | `-2.1` , `7` , `34.2` |
Real numbers |

complex | `3-2i` , `-8+0i` |
Complex numbers |

character | `"hi"` , `"YAY"` |
Text strings |

list | `list(TRUE, 1, "hi")` |
Ordered collection of heterogeneous elements |

R doesn’t distinguish between scalars and vectors, so the class of a vector is the same as the class of its elements:

`class("hi")`

`## [1] "character"`

`class(c("hello", "hi"))`

`## [1] "character"`

In addition, for most vectors, the class and the type are the same:

```
= c(TRUE, FALSE)
x class(x)
```

`## [1] "logical"`

`typeof(x)`

`## [1] "logical"`

The exception to this rule is numeric vectors, which have type `double`

for
historical reasons:

`class(pi)`

`## [1] "numeric"`

`typeof(pi)`

`## [1] "double"`

`typeof(3)`

`## [1] "double"`

The word “double” here stands for *double-precision floating point
number*, a standard way to represent real numbers on computers.

By default, R assumes any numbers you enter in code are numeric, even if they’re integer-valued.

The class `integer`

also represents integer numbers, but it’s not used as often
as `numeric`

. A few functions, such as the sequence operator `:`

and the
`length`

function, return integers. You can also force R to create an integer
by adding the suffix `L`

to a number, but there are no major drawbacks to using
the `double`

default:

`class(1:3)`

`## [1] "integer"`

`class(3)`

`## [1] "numeric"`

`class(3L)`

`## [1] "integer"`

Besides the classes for vectors and lists, there are several built-in classes that represent more sophisticated data structures:

Class | Description |
---|---|

function | Functions |

factor | Categorical values |

matrix | Two-dimensional ordered collection of homogeneous elements |

array | Multi-dimensional ordered collection of homogeneous elements |

data.frame | Data frames |

For these, the class is usually different from the type. We’ll learn more about most of these later on.

### 2.2.1 Lists

A *list* is an ordered data structure where the elements can have different
types (they are *heterogeneous*). This differs from a vector, where the
elements all have to have the same type, as we saw in Section 2.1.
The tradeoff is that most vectorized functions do not work with lists.

You can make an ordinary list with the `list`

function:

```
= list(1, c("hi", "bye"))
x class(x)
```

`## [1] "list"`

`typeof(x)`

`## [1] "list"`

For ordinary lists, the type and the class are both `list`

. In Section
2.4, we’ll learn how to get and set list elements, and in later
sections we’ll learn more about when and why to use lists.

You’ve already seen one list, the earnings data frame:

`class(earn)`

`## [1] "data.frame"`

`typeof(earn)`

`## [1] "list"`

Under the hood, data frames are lists, and each column is a list element.
Because the class is `data.frame`

rather than `list`

, R treats data frames
differently from ordinary lists. This difference is apparent in how data frames
are printed compared to ordinary lists.

### 2.2.2 Implicit Coercion

R’s types fall into a natural hierarchy of expressiveness:

Each type on the right is more expressive than the ones to its left. That is,
with the convention that `FALSE`

is `0`

and `TRUE`

is `1`

, we can represent any
logical value as an integer. In turn, we can represent any integer as a double,
and any double as a complex number. By writing the number out, we can also
represent any complex number as a string.

The point is that no information is lost as we follow the arrows from left to
right along the types in the hierarchy. In fact, R will automatically and
silently convert from types on the left to types on the right as needed. This
is called *implicit coercion*.

As an example, consider what happens if we add a logical value to a number:

`TRUE + 2`

`## [1] 3`

R automatically converts the `TRUE`

to the numeric value `1`

, and then carries
out the arithmetic as usual.

We’ve already seen implicit coercion at work once before, when we learned the
`c`

function. Since the elements of a vector all have to have the same type, if
you pass several different types to `c`

, then R tries to use implicit coercion
to make them the same:

```
= c(TRUE, "hi", 1, 1+3i)
x class(x)
```

`## [1] "character"`

` x`

`## [1] "TRUE" "hi" "1" "1+3i"`

Implicit coercion is strictly one-way; it never occurs in the other direction.
If you want to coerce a type on the right to one on the left, you can do it
explicitly with one of the `as.TYPE`

functions. For instance, the `as.numeric`

(or `as.double`

) function coerces to numeric:

`as.numeric("3.1")`

`## [1] 3.1`

There are a few types that fall outside of the hierarchy entirely, like functions. Implicit coercion doesn’t apply to these. If you try to use these types where it doesn’t make sense to, R generally returns an error:

`+ 3 sin `

`## Error in sin + 3: non-numeric argument to binary operator`

If you try to use these types as elements of a vector, you get back a list instead:

```
= c(1, 2, sum)
x class(x)
```

`## [1] "list"`

Understanding how implicit coercion works will help you avoid bugs, and can also be a time-saver. For example, we can use implicit coercion to succinctly count how many elements of a vector satisfy a some condition:

```
= c(1, 3, -1, 10, -2, 3, 8, 2)
x = x < 4
condition sum(condition) # or sum(x < 4)
```

`## [1] 6`

If you still don’t quite understand how the code above works, try inspecting each variable. In general, inspecting each step or variable is a good strategy for understanding why a piece of code works (or doesn’t work!). Here the implicit coercion happens in the third line.

### 2.2.3 Matrices & Arrays

A *matrix* is the two-dimensional analogue of a vector. The elements, which are
arranged into rows and columns, are ordered and homogeneous.

You can create a matrix from a vector with the `matrix`

function. By default,
the columns are filled first:

```
# A matrix with 2 rows and 3 columns:
matrix(1:6, 2, 3)
```

```
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
```

The class of a matrix is always `matrix`

, and the type matches the type of the
elements:

```
= matrix(c("a", "b", NA, "c"), 2, 2)
x x
```

```
## [,1] [,2]
## [1,] "a" NA
## [2,] "b" "c"
```

`class(x)`

`## [1] "matrix" "array"`

`typeof(x)`

`## [1] "character"`

You can use the matrix multiplication operator `%*%`

to multiply two matrices
with compatible dimensions.

An *array* is a further generalization of matrices to higher dimensions. You
can create an array from a vector with the `array`

function. The
characteristics of arrays are almost identical to matrices, but the class of an
array is always `array`

.

### 2.2.4 Factors

A feature is *categorical* if it measures a qualitative category. For example,
the genres `rock`

, `blues`

, `alternative`

, `folk`

, `pop`

are categories.

R uses the class `factor`

to represent categorical data. Visualizations and
statistical models sometimes treat factors differently than other data types,
so it’s important to make sure you have the right data type. If you’re ever
unsure, remember that you can check the class of an object with the `class`

function.

When you load a data set, R usually can’t tell which features are categorical. That means identifying and converting the categorical features is up to you. For beginners, it can be difficult to understand whether a feature is categorical or not. The key is to think about whether you want to use the feature to divide the data into groups.

For example, if we want to know how many songs are in the `rock`

genre, we
first need to divide the songs by genre, and then count the number of songs in
each group (or at least the `rock`

group).

As a second example, months recorded as numbers can be categorical or not, depending on how you want to use them. You might want to treat them as categorical (for example, to compute max rainfall in each month) or you might want to treat them as numbers (for example, to compute the number of months time between two events).

The bottom line is that you have to think about what you’ll be doing in the analysis. In some cases, you might treat a feature as categorical only for part of the analysis.

Let’s think about which features are categorical in earnings data set. To refresh our memory of what’s in the data set, we can look at the structural summary:

`str(earn)`

```
## 'data.frame': 4224 obs. of 8 variables:
## $ sex : chr "Both Sexes" "Both Sexes" "Both Sexes" "Both Sexes" ...
## $ race : chr "All Races" "All Races" "All Races" "All Races" ...
## $ ethnic_origin : chr "All Origins" "All Origins" "All Origins" "All Origins" ...
## $ age : chr "16 years and over" "16 years and over" "16 years and over" "16 years and over" ...
## $ year : int 2010 2010 2010 2010 2011 2011 2011 2011 2012 2012 ...
## $ quarter : int 1 2 3 4 1 2 3 4 1 2 ...
## $ n_persons : int 96821000 99798000 101385000 100120000 98329000 100593000 101447000 101458000 100830000 102769000 ...
## $ median_weekly_earn: int 754 740 740 752 755 753 753 764 769 771 ...
```

The columns `n_persons`

and `median_weekly_earn`

are quantitative rather than
categorical, since they measure quantities of people and dollars, respectively.

The `sex`

, `race`

, `ethnic_origin`

, and `age`

columns are all categorical,
since they are all qualitative measurements. We can see this better if we use
the `table`

function to compute frequencies for the values in the columns:

`table(earn$sex)`

```
##
## Both Sexes Men Women
## 1408 1408 1408
```

`table(earn$race)`

```
##
## All Races Asian Black or African American
## 2244 660 660
## White
## 660
```

`table(earn$ethnic_origin)`

```
##
## All Origins Hispanic or Latino
## 3564 660
```

`table(earn$age)`

```
##
## 16 to 19 years 16 to 24 years 16 years and over 20 to 24 years
## 132 660 660 132
## 25 to 34 years 25 to 54 years 25 years and over 35 to 44 years
## 132 660 660 132
## 45 to 54 years 55 to 64 years 55 years and over 65 years and over
## 132 132 660 132
```

Each column has only a few unique values, repeated many times. These are ideal for grouping the data. If age had been recorded as a number, rather than a range, it would probably be better to treat it as quantitative, since there would be far more unique values. Columns with many unique values don’t make good categorical features, because each group will only have a few elements!

That leaves us with the `year`

and `quarter`

columns. It’s easy to imagine
grouping the data by year or quarter, but these are also clearly numbers. These
columns can be treated as quantitative or categorical data, depending on how we
want to use them to analyze the data.

Let’s convert the `age`

column to a factor. To do this, use the `factor`

function:

```
= factor(earn$age)
age head(age)
```

```
## [1] 16 years and over 16 years and over 16 years and over 16 years and over
## [5] 16 years and over 16 years and over
## 12 Levels: 16 to 19 years 16 to 24 years 16 years and over ... 65 years and over
```

Notice that factors are printed differently than strings.

The categories of a factor are called *levels*. You can list the levels with
the `levels`

function:

`levels(age)`

```
## [1] "16 to 19 years" "16 to 24 years" "16 years and over"
## [4] "20 to 24 years" "25 to 34 years" "25 to 54 years"
## [7] "25 years and over" "35 to 44 years" "45 to 54 years"
## [10] "55 to 64 years" "55 years and over" "65 years and over"
```

Factors remember all possible levels even if you take a subset:

`1:3] age[`

```
## [1] 16 years and over 16 years and over 16 years and over
## 12 Levels: 16 to 19 years 16 to 24 years 16 years and over ... 65 years and over
```

This is another way factors are different from strings. Factors “remember” all possible levels even if they aren’t present. This ensures that if you plot a factor, the missing levels will still be represented on the plot.

You can make a factor forget levels that aren’t present with the `droplevels`

function:

`droplevels(age[1:3])`

```
## [1] 16 years and over 16 years and over 16 years and over
## Levels: 16 years and over
```

## 2.3 Special Values

R has four *special* values to represent missing or invalid data.

### 2.3.1 Missing Values

The value `NA`

, called the *missing value*, represents missing entries in a
data set. It’s implied that the entries are missing due to how the data was
collected, although there are exceptions. As an example, imagine the data came
from a survey, and respondents chose not to answer some questions. In the data
set, their answers for those questions can be recorded as `NA`

.

The missing value is a chameleon: it can be a logical, integer, numeric, complex, or character value. By default, the missing value is logical, and the other types occur through coercion (2.2.2):

`class(NA)`

`## [1] "logical"`

`class(c(1, NA))`

`## [1] "numeric"`

`class(c("hi", NA, NA))`

`## [1] "character"`

The missing value is also contagious: it represents an unknown quantity, so using it as an argument to a function usually produces another missing value. The idea is that if the inputs to a computation are unknown, generally so is the output:

`NA - 3`

`## [1] NA`

`mean(c(1, 2, NA))`

`## [1] NA`

As a consequence, testing whether an object is equal to the missing value with
`==`

doesn’t return a meaningful result:

`5 == NA`

`## [1] NA`

`NA == NA`

`## [1] NA`

You can use the `is.na`

function instead:

`is.na(5)`

`## [1] FALSE`

`is.na(NA)`

`## [1] TRUE`

`is.na(c(1, NA, 3))`

`## [1] FALSE TRUE FALSE`

Missing values are a feature that sets R apart from most other programming languages.

### 2.3.2 Infinity

The value `Inf`

represents infinity, and can be numeric or complex. You’re most
likely to encounter it as the result of certain computations:

`13 / 0`

`## [1] Inf`

`class(Inf)`

`## [1] "numeric"`

You can use the `is.infinite`

function to test whether a value is infinite:

`is.infinite(3)`

`## [1] FALSE`

`is.infinite(c(-Inf, 0, Inf))`

`## [1] TRUE FALSE TRUE`

### 2.3.3 Not a Number

The value `NaN`

, called *not a number*, represents a quantity that’s undefined
mathematically. For instance, dividing 0 by 0 is undefined:

`0 / 0`

`## [1] NaN`

`class(NaN)`

`## [1] "numeric"`

Like `Inf`

, `NaN`

can be numeric or complex.

You can use the `is.nan`

function to test whether a value is `NaN`

:

`is.nan(c(10.1, log(-1), 3))`

`## Warning in log(-1): NaNs produced`

`## [1] FALSE TRUE FALSE`

### 2.3.4 Null

The value `NULL`

represents a quantity that’s undefined in R. Most of the time,
`NULL`

indicates the absence of a result. For instance, vectors don’t have
dimensions, so the `dim`

function returns `NULL`

for vectors:

`dim(c(1, 2))`

`## NULL`

`class(NULL)`

`## [1] "NULL"`

`typeof(NULL)`

`## [1] "NULL"`

Unlike the other special values, `NULL`

has its own unique type and class.

You can use the `is.null`

function to test whether a value is `NULL`

:

`is.null("null")`

`## [1] FALSE`

`is.null(NULL)`

`## [1] TRUE`

## 2.4 Indexing

The way to get and set elements of a data structure is by *indexing*. Sometimes
this is also called *subsetting* or (element) *extraction*. Indexing is a
fundamental operation in R, key to reasoning about how to solve problems with
the language.

We first saw indexing in Section 1.6, where we used `$`

, the
dollar sign operator, to get and set data frame columns. We saw indexing again
in Section 2.1.2, where we used `[`

, the indexing or square
bracket operator, to get and set elements of vectors.

The indexing operator `[`

is R’s primary operator for indexing. It works in
four different ways, depending on the type of the index you use. These four
ways to select elements are:

- All elements, with no index
- By position, with a numeric index
- By name, with a character index
- By condition, with a logical index

Let’s examine each in more detail. We’ll use this vector as an example, to keep things concise:

```
= c(a = 10, b = 20, c = 30, d = 40, e = 50)
x x
```

```
## a b c d e
## 10 20 30 40 50
```

Even though we’re using a vector here, the indexing operator works with almost all data structures, including factors, lists, matrices, and data frames. We’ll look at unique behavior for some of these later on.

### 2.4.1 All Elements

The first way to use `[`

to select elements is to leave the index blank. This
selects all elements:

` x[]`

```
## a b c d e
## 10 20 30 40 50
```

This way of indexing is rarely used for getting elements, since it’s the same
as entering the variable name without the indexing operator. Instead, its main
use is for setting elements. Suppose we want to set all the elements of `x`

to
`5`

. You might try writing this:

```
= 5
x x
```

`## [1] 5`

Rather than setting each element to `5`

, this sets `x`

to the scalar `5`

, which
is not what we want. Let’s reset the vector and try again, this time using the
indexing operator:

```
= c(a = 10, b = 20, c = 30, d = 40, e = 50)
x = 5
x[] x
```

```
## a b c d e
## 5 5 5 5 5
```

As you can see, now all the elements are `5`

. So the indexing operator is
necessary to specify that we want to set the elements rather than the whole
variable.

Let’s reset `x`

one more time, so that we can use it again in the next example:

`= c(a = 10, b = 20, c = 30, d = 40, e = 50) x `

### 2.4.2 By Position

The second way to use `[`

is to select elements by position. This happens when
you use an integer or numeric index. We already saw the basics of this in
Section 2.1.2.

The positions of the elements in a vector (or other data structure) correspond
to numbers starting from 1 for the first element. This way of indexing is
frequently used together with the sequence operator `:`

to get ranges of
values. For instance, let’s get the 2nd through 4th elements of `x`

:

`2:4] x[`

```
## b c d
## 20 30 40
```

You can also use this way of indexing to set specific elements or ranges of
elements. For example, let’s set the 3rd and 5th elements of `x`

to `9`

and
`7`

, respectively:

```
c(3, 5)] = c(9, 7)
x[ x
```

```
## a b c d e
## 10 20 9 40 7
```

When getting elements, you can repeat numbers in the index to get the same element more than once. You can also use the order of the numbers to control the order of the elements:

`c(2, 1, 2, 2)] x[`

```
## b a b b
## 20 10 20 20
```

Finally, if the index contains only negative numbers, the elements at those positions are excluded rather than selected. For instance, let’s get all elements except the 1st and 5th:

`-c(1, 5)] x[`

```
## b c d
## 20 9 40
```

When you index by position, the index should always be all positive or all negative. Using a mix of positive and negative numbers causes R to emit error rather than returning elements, since it’s unclear what the result should be:

`c(-1, 2)] x[`

`## Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts`

### 2.4.3 By Name

The third way to use `[`

is to select elements by name. This happens when you
use a character vector as the index, and only works with named data structures.

Like indexing by position, you can use indexing by name to get or set elements.
You can also use it to repeat elements or change the order. Let’s get elements
`a`

, `c`

, `d`

, and `a`

again from the vector `x`

:

```
= x[c("a", "c", "d", "a")]
y y
```

```
## a c d a
## 10 9 40 10
```

Element names are generally unique, but if they’re not, indexing by name gets or sets the first element whose name matches the index:

`"a"] y[`

```
## a
## 10
```

Let’s reset `x`

again to prepare for learning about the final way to index:

`= c(a = 10, b = 20, c = 30, d = 40, e = 50) x `

### 2.4.4 By Condition

The fourth and final way to use `[`

is to select elements based on a condition.
This happens when you use a logical vector as the index. The logical vector
should have the same length as what you’re indexing, and will be recycled if it
doesn’t.

#### Congruent Vectors

To understand indexing by condition, we first need to learn about congruent
vectors. Two vectors are *congruent* if they have the same length and they
correspond element-by-element.

For example, suppose you do a survey that records each respondent’s favorite animal and age. These are two different vectors of information, but each person will have a response for both. So you’ll have two vectors that are the same length:

```
= c("dog", "cat", "iguana")
animal = c(31, 24, 72) age
```

The 1st element of each vector corresponds to the 1st person, the 2nd to the 2nd person, and so on. These vectors are congruent.

Notice that columns in a data frame are always congruent!

#### Back to Indexing

When you index by condition, the index should generally be congruent to the
object you’re indexing. Elements where the index is `TRUE`

are kept and
elements where the index is `FALSE`

are dropped.

If you create the index from a condition on the object, it’s automatically
congruent. For instance, let’s make a condition based on the vector `x`

:

```
= x < 25
is_small is_small
```

```
## a b c d e
## TRUE TRUE FALSE FALSE FALSE
```

The 1st element in the logical vector `is_small`

corresponds to the 1st element
of `x`

, the 2nd to the 2nd, and so on. The vectors `x`

and `is_small`

are
congruent.

It makes sense to use `is_small`

as an index for `x`

, and it gives us all the
elements less than `25`

:

` x[is_small]`

```
## a b
## 10 20
```

Of course, you can also avoid using an intermediate variable for the condition:

`> 10] x[x `

```
## b c d e
## 20 30 40 50
```

If you create index some other way (not using the object), make sure that it’s still congruent to the object. Otherwise, the subset returned from indexing might not be meaningful.

You can also use indexing by condition to set elements, just as the other ways
of indexing can be used to set elements. For instance, let’s set all the
elements of `x`

that are greater than `10`

to the missing value `NA`

:

```
> 10] = NA
x[x x
```

```
## a b c d e
## 10 NA NA NA NA
```

### 2.4.5 Logic

All of the conditions we’ve seen so far have been written in terms of a single test. If you want to use more sophisticated conditions, R provides operators to negate and combine logical vectors. These operators are useful for working with logical vectors even outside the context of indexing.

#### Negation

The *NOT operator* `!`

converts `TRUE`

to `FALSE`

and `FALSE`

to `TRUE`

:

```
= c(TRUE, FALSE, TRUE, TRUE, NA)
x x
```

`## [1] TRUE FALSE TRUE TRUE NA`

`!x`

`## [1] FALSE TRUE FALSE FALSE NA`

You can use `!`

with a condition:

```
= c("hi", "hello")
y !(y == "hi")
```

`## [1] FALSE TRUE`

The NOT operator is vectorized.

#### Combinations

R also has operators for combining logical values.

The *AND operator* `&`

returns `TRUE`

only when both arguments are `TRUE`

. Here
are some examples:

`FALSE & FALSE`

`## [1] FALSE`

`TRUE & FALSE`

`## [1] FALSE`

`FALSE & TRUE`

`## [1] FALSE`

`TRUE & TRUE`

`## [1] TRUE`

`c(TRUE, FALSE, TRUE) & c(TRUE, TRUE, FALSE)`

`## [1] TRUE FALSE FALSE`

The *OR operator* `|`

returns `TRUE`

when at least one argument is `TRUE`

.
Let’s see some examples:

`FALSE | FALSE`

`## [1] FALSE`

`TRUE | FALSE`

`## [1] TRUE`

`FALSE | TRUE`

`## [1] TRUE`

`TRUE | TRUE`

`## [1] TRUE`

`c(TRUE, FALSE) | c(TRUE, TRUE)`

`## [1] TRUE TRUE`

Be careful: everyday English is less precise than logic. You might say:

I want all subjects with age over 50 and all subjects that like cats.

But in logic this means:

`(subject age over 50) OR (subject likes cats)`

So think carefully about whether you need both conditions to be true (AND) or at least one (OR).

Rarely, you might want *exactly one* condition to be true. The *XOR (eXclusive
OR) function* `xor()`

returns `TRUE`

when exactly one argument is `TRUE`

. For
example:

`xor(FALSE, FALSE)`

`## [1] FALSE`

`xor(TRUE, FALSE)`

`## [1] TRUE`

`xor(TRUE, TRUE)`

`## [1] FALSE`

The AND, OR, and XOR operators are vectorized.

#### Short-circuiting

The second argument is irrelevant in some conditions:

`FALSE &`

is always`FALSE`

`TRUE |`

is always`TRUE`

Now imagine you have `FALSE & long_computation()`

. You can save time by
skipping `long_computation()`

. A *short-circuit operator* does exactly that.

R has two short-circuit operators:

`&&`

is a short-circuited`&`

`||`

is a short-circuited`|`

These operators only evaluate the second argument if it is necessary to determine the result. Here are some of these:

`TRUE && FALSE`

`## [1] FALSE`

`TRUE && TRUE`

`## [1] TRUE`

`TRUE || TRUE`

`## [1] TRUE`

`c(TRUE, FALSE) && c(TRUE, TRUE)`

```
## Warning in c(TRUE, FALSE) && c(TRUE, TRUE): 'length(x) = 2 > 1' in coercion to
## 'logical(1)'
## Warning in c(TRUE, FALSE) && c(TRUE, TRUE): 'length(x) = 2 > 1' in coercion to
## 'logical(1)'
```

`## [1] TRUE`

For the final expression, notice R only combines the first element of each
vector. The others are ignored. In other words, the short-circuit operators are
*not* vectorized! Because of this, generally you **should not use** the
short-circuit operators for indexing. Their main use is in writing conditions
for if-statements, which we’ll learn about later on.

## 2.5 Exercises

### 2.5.1 Exercise

The `rep`

function is another way to create a vector. Read the help file for
the `rep`

function.

- What does the
`rep`

function do to create a vector? Give an example. - The
`rep`

function has parameters`times`

and`each`

. What does each do, and how do they differ? Give examples for both. - Can you set both of
`times`

and`each`

in a single call to`rep`

? If the function raises an error, explain what the error message means. If the function returns a result, explain how the result corresponds to the arguments you chose.

### 2.5.2 Exercise

Considering how implicit coercion works (Section 2.2.2):

- Why does
`"3" + 4`

raise an error? - Why does
`"TRUE" == TRUE`

return`TRUE`

? - Why does
`"FALSE" < TRUE`

return TRUE?

### 2.5.3 Exercise

Section 2.3.1 described the missing value as a “chameleon” because it can have many different types. Is

`Inf`

also a chameleon? Use examples to justify your answer.The missing value is also “contagious” because using it as an argument usually produces another missing value. Is

`Inf`

contagious? Again, use examples to justify your answer.

### 2.5.4 Exercise

The `table`

function is useful for counting all sorts of things, not just level
frequencies for a factor. For instance, you can use `table`

to count how many
`TRUE`

and `FALSE`

values there are in a logical vector.

- For the earnings data, how many rows had median weekly earnings below $750?
- Based on how the data is structured, is your answer in part 1 the same as the number of quarters that had median weekly earnings below $750? Explain.