getwd()
5 Files, Packages, and Data
This lesson will focus on working with data files in R. It will reinforce understanding of the command line, as well as RStudio, demonstrate finding and loading packages in R, and introduce new ways of inspecting and working with data.
After this lesson, you should be able to:
- Identify some common file extensions
- Read and write CSV files with R
- Read and write RDS files with R
- Install packages from CRAN
- Recognize categorical data
- Explain what factors are and when to use them
- Explain the purpose of R’s special values, especially missing values
- Explain the four different types of indexes and how to use them
- Explain why the
[[
operator is necessary and how to use it - Explain the syntax to subset a data frame
5.1 Working with Files
5.1.1 Setup
To follow along, download this zip file!
Navigate to where you want to save your work:
cd ~/Documents/
Next, make a directory:
mkdir files_in_r
cd files_in_r
Copy the downloaded zip file into that directory:
cp ~/Downloads/best_in_show.zip .
Unzip the file:
unzip best_in_show.zip
Navigate to the newly created directory
cd best_in_show
5.1.2 Exploring Files
When working with files, its important to gather lots of information, and constantly test assumptions that you may have.
This process is a key part of programming and of working in the command line.
Let’s start by seeing what we have, which we do with the ls
command:
ls
Remember that ls
can be modified with flags, for example, to see all the files including hidden ones, use the -a
flag:
ls -a
You can see more information about the files with the -l
flag:
ls -l
Modifiers can be combined for ls
:
ls -la
You can use du -h
to see the disk usage (file size) of a given file:
du -h dogs.csv
-h
refers to human readable as by default du
displays the size in block units. Being aware of the size of a file early on can help debug issues with running out of disk space, as well as issues down the line in the analysis process. For example, reading in too many too large files into R can create issues for you by overloading your system’s RAM (your computer’s working memory).
You can view the disk usage for all the files in the directory with the wildcard symbol *
:
du -h *
5.1.2.1 File Extensions
Most of the time, you can guess the format of a file by looking at its extension, the characters (usually three) after the last dot .
in the file name. For example, the extension .jpg
or .jpeg
indicates a JPEG image file. Some operating systems hide extensions by default, but you can find instructions to change this setting online by searching for “show file extensions” and your operating system’s name. The extension is just part of the file’s name, so it should be taken as a hint about the file’s format rather than a guarantee.
5.1.2.2 Text Files
A text file is one that contains human-readable lines of text. You can check this by opening the file with a text editor such as Microsoft Notepad or macOS TextEdit. Many file formats use text in order to make the format easier to work with.
On the command line, you can get information about the type of a file by using the file
command followed by the name of a file:
file dogs.csv
Note that file
uses a series of tests (learn more by reading man file
), to determine the file type and may not always perfectly report the type of the file.
The output of file
is the file name followed by a colon and then a description of the file type.
In this case, the output tells us that dogs.csv is a CSV text file.
A comma-separated values (CSV) file records tabular data using one line per row, with commas separating columns.
From the command line we can read text files with vim
:
vim dogs.csv
To see the type of all the files in the directory you can use the wildcard *
operator:
file *
5.1.2.3 Binary Files
A binary file is one that’s not human-readable. You can’t just read off the data if you open a binary file in a text editor, but they have a number of other advantages. Compared to text files, binary files are often faster to read and take up less storage space (bytes).
For demonstrations sake, see what happens when you try to use vim to ‘read’ a binary data file:
vim dogs.rds
Notice that the editor displays data but it isn’t human readable, it looks like a bunch of random symbols with potentially the occasional recognizable word.
5.1.2.4 Common Data File Types
Name | Extension | Tabular? | Text? |
---|---|---|---|
Comma-separated Values | .csv |
Yes | Yes |
Tab-separated Values | .tsv |
Yes | Yes |
Fixed-width File | .fwf |
Yes | Yes |
Microsoft Excel | .xlsx |
Yes | No |
Microsoft Excel 1993-2007 | .xls |
Yes | No |
Apache Arrow | .feather |
Yes | No |
R Data | .rds |
Sometimes | No |
R Data | .rda |
Sometimes | No |
Plaintext | .txt |
Sometimes | Yes |
Extensible Markup Language | .xml |
No | Yes |
JavaScript Object Notation | .json |
No | Yes |
5.1.3 Reading and Writing Files in R
R has many functions for working with file systems, reading and writing files.
5.1.3.1 The Working Directory
The working directory is the starting point R uses for relative paths. Think of the working directory as the directory R is currently “at” or watching.
The function getwd
returns the absolute path for the current working directory, as a string. It doesn’t require any arguments:
On your computer, the output from getwd
will likely be different. This is a very useful function for getting your bearings when you write relative paths. If you write a relative path and it doesn’t work as expected, the first thing to do is check the working directory.
The related setwd
function changes the working directory. It takes one argument: a path to the new working directory. Here’s an example:
setwd("..")
# Now check the working directory.
getwd()
Generally, you should avoid using calls to setwd
in your R scripts and R Markdown files. Calling setwd
makes your code more difficult to understand, and can always be avoided by using appropriate relative paths. If you call setwd
with an absolute path, it also makes your code less portable to other computers. It’s fine to use setwd
interactively (in the R console), but avoid making your saved code dependent on it.
When working in RStudio, you can set the working directory at the start of your session in Session
-> Set Working Directory
-> To Source File Location
.
Another function that’s useful for dealing with the working directory and file system is list.files
. The list.files
function returns the names of all of the files and directories inside of a directory. It accepts a path to a directory as an argument, or assumes the working directory if you don’t pass a path. For instance:
# List files and directories in ~/.
list.files("~/")
# List files and directories in the working directory.
list.files()
If you call list.files
with an invalid path or an empty directory, the output is character(0)
:
list.files("/this/path/is/fake/")
character(0)
Later on, we’ll learn about what character(0)
means more generally.
5.1.3.2 Reading a CSV File
Let’s go ahead and read the dogs.csv
file we extracted from the zip file at the start.
R provides a very easy built-in function for reading CSV files, and a variety of other formats for text files containing tabular data.
To read a CSV file into R, use read.csv
:
= read.csv('dogs.csv') dogs
5.1.3.2.1 Inspecting the Data
= readRDS('data/dogs.rds') dogs
Whenever you import data into R, it is crucial to check that things went as expected. To check things went according to our expectation, look at the output of the read.csv
function, which we saved into dogs
.
Let’s see what the output is. We can check what the object is with the class
function:
class(dogs)
[1] "data.frame"
We can see that the read.csv
function returned a data frame. This makes sense because data frames represent tabular data, and csv files contain tabular data.
We can get more information with the str
function. str
concisely gives information about the content of an R object:
str(dogs)
'data.frame': 172 obs. of 18 variables:
$ breed : chr "Border Collie" "Border Terrier" "Brittany" "Cairn Terrier" ...
$ group : Factor w/ 7 levels "herding","hound",..: 1 5 4 5 4 4 4 6 1 1 ...
$ datadog : num 3.64 3.61 3.54 3.53 3.34 3.33 3.3 3.26 3.25 3.22 ...
$ popularity_all : int 45 80 30 59 130 63 27 38 60 20 ...
$ popularity : int 39 61 30 48 81 51 27 33 49 20 ...
$ lifetime_cost : num 20143 22638 22589 21992 20224 ...
$ intelligence_rank: int 1 30 19 35 31 18 20 8 10 6 ...
$ longevity : num 12.5 14 12.9 13.8 12.5 ...
$ ailments : int 2 0 0 2 1 0 2 5 1 5 ...
$ price : num 623 833 618 435 750 800 465 740 530 465 ...
$ food_cost : num 324 324 466 324 324 324 674 324 466 405 ...
$ grooming : Factor w/ 3 levels "daily","weekly",..: 2 2 2 2 2 2 2 2 2 1 ...
$ kids : Factor w/ 3 levels "high","medium",..: 3 1 2 1 1 1 1 2 3 1 ...
$ megarank_kids : int 1 2 3 4 5 6 7 8 9 11 ...
$ megarank : int 29 1 11 2 4 5 6 22 52 8 ...
$ size : Factor w/ 3 levels "large","medium",..: 2 3 2 3 2 2 3 3 2 3 ...
$ weight : num NA 13.5 35 14 NA 30 25 NA NA 22 ...
$ height : num 20 NA 19 10 18 16 14.5 9.5 18.5 14.5 ...
Let’s check the dimensions of our dataset:
dim(dogs)
[1] 172 18
Recall we can access the number of rows with:
nrow(dogs)
[1] 172
And the number of columns:
ncol(dogs)
[1] 18
To display the first rows from the dataset, use head
:
head(dogs)
breed group datadog popularity_all popularity
1 Border Collie herding 3.64 45 39
2 Border Terrier terrier 3.61 80 61
3 Brittany sporting 3.54 30 30
4 Cairn Terrier terrier 3.53 59 48
5 Welsh Springer Spaniel sporting 3.34 130 81
6 English Cocker Spaniel sporting 3.33 63 51
lifetime_cost intelligence_rank longevity ailments price food_cost grooming
1 20143 1 12.52 2 623 324 weekly
2 22638 30 14.00 0 833 324 weekly
3 22589 19 12.92 0 618 466 weekly
4 21992 35 13.84 2 435 324 weekly
5 20224 31 12.49 1 750 324 weekly
6 18993 18 11.66 0 800 324 weekly
kids megarank_kids megarank size weight height
1 low 1 29 medium NA 20
2 high 2 1 small 13.5 NA
3 medium 3 11 medium 35.0 19
4 high 4 2 small 14.0 10
5 high 5 4 medium NA 18
6 high 6 5 medium 30.0 16
And to display the last rows from the dataset, use tail
:
tail(dogs)
breed group datadog popularity_all popularity
167 Vizsla sporting NA 37 NA
168 Weimaraner sporting NA 32 NA
169 Welsh Terrier terrier NA 99 NA
170 Wire Fox Terrier terrier NA 100 NA
171 Wirehaired Pointing Griffon sporting NA 92 NA
172 Xoloitzcuintli non-sporting NA 155 NA
lifetime_cost intelligence_rank longevity ailments price food_cost grooming
167 NA 25 12.50 0 935 NA <NA>
168 NA 21 NA 1 562 NA weekly
169 NA 53 NA 0 843 NA weekly
170 NA 51 13.17 0 668 NA <NA>
171 NA 46 8.80 0 755 NA <NA>
172 NA NA NA NA 717 NA <NA>
kids megarank_kids megarank size weight height
167 <NA> NA NA medium NA 22.5
168 high NA NA large NA 25.0
169 high NA NA small 20.0 15.0
170 <NA> NA NA small 17.5 15.0
171 <NA> NA NA medium NA 22.0
172 <NA> NA NA medium NA 16.5
5.1.3.3 Writing an RDS
You can save any R object, such as a data frame, as an RDS file. RDS files are a great option for storing data that is intended to be loaded into R. Data saved as RDS can be quickly and accurately loaded out of and back into R without losing any information.
This isn’t always the case when saving data in plain text formats such as CSV. Any R-related metadata associated with the object you are saving will be maintained in the RDS format. This is useful in the case of data frames if your data contains factors, or dates, or other specific class attributes that won’t be represented in a csv. You would need to reproduce the process for parsing the data into R.
Additionally, RDS files often times take significantly less disk space to save, as they are a compressed format. RDS files in general are faster to read.
However, its important to keep in mind that RDS files are meant to be used only in R. If you save data as an RDS, you are assuming that however is using that data will have access to and an understanding of R.
As a result, its common to use the RDS format for saving intermediary data in a project. While when exporting results to a collaborator, or the internet you would most likely want to use a commonly used plain-text format such as CSV.
Use saveRDS
to save our data as an RDS file with the rds
file extension.
saveRDS(dogs, "./outputs/dogs.rds")
You can load data saved in RDS files with readRDS
:
= readRDS("./outputs/dogs.rds") dogs
5.1.3.4 Writing a CSV
We just saved and read the dogs data as an RDS file, and we can practice saving data in other forms, such as a comma separated values (CSV) file. Because we will be re-using the class survey data from the first week, let’s go ahead and save this data frame as a CSV in your working directory.
First, you will want to create a folder called data/
in your working directory. You can do this in your console with the dir.create()
function (this is like the mkdir
command used in the command line). (Hint: make sure you are in your class working directory). You can run the following in your console:
dir.create("data")
You can also use a point-and-click method by finding the New Folder
button in the bottom right pane of RStudio, under the Files
tab.
Next, let’s manually create the my.data
data frame once more, by copying and pasting the code below.
<- c("Cats rule, dogs drool", "Cats rule, dogs drool",
pets "Cats rule, dogs drool", "Cats rule, dogs drool",
"Cats rule, dogs drool", "Woof", "Woof", "Cats rule, dogs drool",
"Woof", "Woof", "Cats rule, dogs drool")
<- c("Shah's", "Red 88 noodle bar", "UC Davis CoHo", "Thai Canteen",
place "Tim's Hawaiian", "Peet's coffee and Blaze pizza", "Good Friends",
"in-n-out", "In n Out", "Mishka's!", "California Coffee")
<- c(1, 5, 1, 4, 3, 1, 5, 4, 4, 4, 1)
time.min <- c(472, 0.9, 1.2, 0.6, 0.6, 0.2, 1, 0.8, 0.8, 0.7, 0.3)
distance.mi <- c("Computer Science", "Genetics & Genomics", "Computer Science",
major "Computer Science", "Science and Technology Studies",
"Biomedical Engineering", "Economics", "Computer science",
"Computer Science and Engineering", "Spanish Linguistics",
"Computer Science")
<- data.frame(place, distance.mi, time.min, major, pets) my.data
Now that we have a data frame called my.data
, we can use the write.csv
function to save this data frame as a csv in our data folder. Let’s call it class_survey.csv
.
write.csv(my.data, "data/class_survey.csv", row.names = F)
Now this data will be available to us for future use without having to copy and paste.
5.1.3.5 Excel Files in R
Excel is very popular in the data analysis world. Millions of people use Excel to input, clean, analyze, and store data. R doesn’t provide a built-in function to load Excel files. Fortunately, members of the R community share code for a variety of tasks, including loading Excel files.
5.2 Packages
Lots of the most useful parts of R do not come pre-loaded when you install R. Packages bundle together code, documentation and data. It’s easy to share, and easy to include in your own code. Users have contributed thousands of R packages which can be found online.
You can think of a package as one or more functions that are related to a specific task, that you can include in your code.
Packages need to be installed on your system and then loaded into your R session.
5.2.1 CRAN
The Comprehensive R Archive Network (CRAN) is the main website that makes R packages accessible.
5.2.1.1 readxl
readxl is a package written to provide functions for working with Excel files in R.
5.2.2 Using Packages
To use an R package, it first needs to be installed on your system, and then loaded into the R session.
5.2.2.1 Installing Packages
You can install packages from CRAN onto your system using install.packages
. It will search for the package on CRAN, and download the code onto your computer in a place that R can access.
To install the readxl
package, we pass the name to install.packages
:
install.packages("readxl")
5.2.2.2 Loading Packages
Even if the package is on your system, it is not automatically loaded into R.
Every time you restart R you will need to reload each package that your script uses. Do so with library
at the top of your script for each package that you will use.
This signals to you and anyone else that uses your script which packages are required to run the code, and will stop the execution of the script if any of the packages are not found.
To load in the readxl
package we installed in the previous step, use library
:
library("readxl")
This will load in all the functions, data, and documentation from the readxl
library, so we can now access them in our R session.
To see all the packages installed you can run library
without any arguments:
library()
This displays all the installed libraries as well the path R is searching to find them.
5.2.2.3 Example: Load Excel Data
With the excel_sheets
function in readxl
, we can list all the sheets in an Excel spreadsheet:
= excel_sheets("./data/dogs.xlsx") sheets
We can then load the data with read_xlsx
:
= read_xlsx("./data/dogs.xlsx") data
5.3 Factors, Special Values, and Indexing
5.3.1 Factors
A feature in a data set is categorical if it measures a qualitative category. Some examples of categories are:
- Music genres
rock
,blues
,alternative
,folk
,pop
- Colors
red
,green
,blue
,yellow
- Answers
yes
,no
- Months
January
,February
, and so on
In some cases, a feature can be interpreted as categorical or quantitative. For instance, months can be interpreted as categories, but they can also be interpreted as numbers. There’s not necessarily a “correct” interpretation; each can be useful for different kinds of analyses.
R uses the class factor
to represent categorical data. For instance, in the dogs data set, the group
column is a factor:
class(dogs$group)
[1] "factor"
Visualizations and statistical models sometimes treat factors differently than other data types, so it’s important to think about whether you want R to interpret data as categorical.
When you load a data set, R usually can’t tell which features are categorical. That means identifying and converting the categorical features is up to you. For beginners, it can be difficult to understand whether a feature is categorical or not. The key is to think about whether you want to use the feature to divide the data into groups.
For example, if we want to know how many songs are in the rock
genre, we first need to divide the songs by genre, and then count the number of songs in each group (or at least the rock
group).
As a second example, months recorded as numbers can be categorical or not, depending on how you want to use them. You might want to treat them as categorical (for example, to compute max rainfall in each month) or you might want to treat them as numbers (for example, to compute the number of months time between two events).
The bottom line is that you have to think about what you’ll be doing in the analysis. In some cases, you might treat a feature as categorical only for part of the analysis.
You can use the factor
function to convert a vector into a factor:
= c("red", "green", "red", "blue")
colors = factor(colors)
colors colors
[1] red green red blue
Levels: blue green red
Notice that factors are printed differently than strings.
The categories of a factor are called levels. You can list the levels with the levels
function:
levels(colors)
[1] "blue" "green" "red"
Factors remember all possible levels even if you take a subset:
1:3] colors[
[1] red green red
Levels: blue green red
This is another way factors are different from strings. Factors “remember” all possible levels even if they aren’t present. This ensures that if you plot a factor, the missing levels will still be represented on the plot.
You can make a factor forget levels that aren’t present with the droplevels
function:
droplevels(colors[1:3])
[1] red green red
Levels: green red
5.3.2 Special Values
R has four special values to represent missing or invalid data.
5.3.2.1 Missing Values
The value NA
is called the missing value. Most of the time, missing values originate from how the data were collected (as opposed to computer errors). As an example, imagine the data came from a survey, and respondents chose not to answer some questions. In the data set, their answers for those questions might be recorded as NA
.
Of course, there are sometimes exceptions where missing values are the result of a computation. When you see missing values in a data set, you should think carefully about what the cause might be. Sometimes documentation or other parts of the data set provide clues.
The missing value is a chameleon: it can be a logical, integer, numeric, complex, or character value. By default, the missing value is logical, and the other types occur through coercion (Section 3.8.2):
class(NA)
[1] "logical"
class(c(1, NA))
[1] "numeric"
class(c("hi", NA, NA))
[1] "character"
The missing value is also contagious: it represents an unknown quantity, so using it as an argument to a function usually produces another missing value. The idea is that if the inputs to a computation are unknown, generally so is the output:
NA - 3
[1] NA
mean(c(1, 2, NA))
[1] NA
As a consequence, testing whether an object is equal to the missing value with ==
doesn’t return a meaningful result:
5 == NA
[1] NA
NA == NA
[1] NA
You can use the is.na
function instead:
is.na(5)
[1] FALSE
is.na(NA)
[1] TRUE
is.na(c(1, NA, 3))
[1] FALSE TRUE FALSE
Missing values are a feature that sets R apart from most other programming languages.
5.3.2.2 Not a Number
The value NaN
, called not a number, represents a quantity that’s undefined mathematically. For instance, dividing 0 by 0 is undefined:
0 / 0
[1] NaN
class(NaN)
[1] "numeric"
NaN
can be numeric or complex.
You can use the is.nan
function to test whether a value is NaN
:
is.nan(c(10.1, log(-1), 3))
Warning in log(-1): NaNs produced
[1] FALSE TRUE FALSE
5.3.2.3 Infinity
The value Inf
represents infinity, and can be numeric or complex. You’re most likely to encounter it as the result of certain computations:
13 / 0
[1] Inf
class(Inf)
[1] "numeric"
You can use the is.infinite
function to test whether a value is infinite:
is.infinite(3)
[1] FALSE
is.infinite(c(-Inf, 0, Inf))
[1] TRUE FALSE TRUE
5.3.3 Indexing
The way to get and set elements of a data structure is by indexing. Sometimes this is also called subsetting or (element) extraction. Indexing is a fundamental operation in R, key to reasoning about how to solve problems with the language.
We first saw indexing in Section 3.6.1, where we used [
, the indexing or square bracket operator, to get and set elements of vectors. We saw indexing again in Section 3.7, where we used $
, the dollar sign operator, to get and set data frame columns.
The indexing operator [
is R’s primary operator for indexing. It works in four different ways, depending on the type of the index you use:
- An empty index selects all elements
- A numeric index selects elements by position
- A character index selects elements by name
- A logical index selects elements for which the index is
TRUE
Let’s explore each in more detail. We’ll use this vector as an example, to keep things concise:
= c(a = 10, b = 20, c = 30, d = 40, e = 50)
x x
a b c d e
10 20 30 40 50
Even though we’re using a vector here, the indexing operator works with almost all data structures, including factors, lists, matrices, and data frames. We’ll look at unique behavior for some of these later on.
5.3.3.1 All Elements
The first way to use [
to select elements is to leave the index blank. This selects all elements:
x[]
a b c d e
10 20 30 40 50
This way of indexing is rarely used for getting elements, since it’s the same as entering the variable name without the indexing operator. Instead, its main use is for setting elements. Suppose we want to set all the elements of x
to 5
. You might try writing this:
= 5
x x
[1] 5
Rather than setting each element to 5
, this sets x
to the scalar 5
, which is not what we want. Let’s reset the vector and try again, this time using the indexing operator:
= c(a = 10, b = 20, c = 30, d = 40, e = 50)
x = 5
x[] x
a b c d e
5 5 5 5 5
As you can see, now all the elements are 5
. So the indexing operator is necessary to specify that we want to set the elements rather than the whole variable.
Let’s reset x
one more time, so that we can use it again in the next example:
= c(a = 10, b = 20, c = 30, d = 40, e = 50) x
5.3.3.2 By Position
The second way to use [
is to select elements by position. This happens when you use an integer or numeric index. We already saw the basics of this in Section 3.6.1.
The positions of the elements in a vector (or other data structure) correspond to numbers starting from 1 for the first element. This way of indexing is frequently used together with the sequence operator :
to get ranges of values. For instance, let’s get the 2nd through 4th elements of x
:
2:4] x[
b c d
20 30 40
You can also use this way of indexing to set specific elements or ranges of elements. For example, let’s set the 3rd and 5th elements of x
to 9
and 7
, respectively:
c(3, 5)] = c(9, 7)
x[ x
a b c d e
10 20 9 40 7
When getting elements, you can repeat numbers in the index to get the same element more than once. You can also use the order of the numbers to control the order of the elements:
c(2, 1, 2, 2)] x[
b a b b
20 10 20 20
Finally, if the index contains only negative numbers, the elements at those positions are excluded rather than selected. For instance, let’s get all elements except the 1st and 5th:
-c(1, 5)] x[
b c d
20 9 40
When you index by position, the index should always be all positive or all negative. Using a mix of positive and negative numbers causes R to emit error rather than returning elements, since it’s unclear what the result should be:
c(-1, 2)] x[
Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts
5.3.3.3 By Name
The third way to use [
is to select elements by name. This happens when you use a character vector as the index, and only works with named data structures.
Like indexing by position, you can use indexing by name to get or set elements. You can also use it to repeat elements or change the order. Let’s get elements a
, c
, d
, and a
again from the vector x
:
= x[c("a", "c", "d", "a")]
y y
a c d a
10 9 40 10
Element names are generally unique, but if they’re not, indexing by name gets or sets the first element whose name matches the index:
"a"] y[
a
10
Let’s reset x
again to prepare for learning about the final way to index:
= c(a = 10, b = 20, c = 30, d = 40, e = 50) x
5.3.3.4 By Condition
The fourth and final way to use [
is to select elements based on a condition. This happens when you use a logical vector as the index. The logical vector should have the same length as what you’re indexing, and will be recycled (that is, repeated) if it doesn’t.
Congruent Vectors
To understand indexing by condition, we first need to learn about congruent vectors. Two vectors are congruent if they have the same length and they correspond element-by-element.
For example, suppose you do a survey that records each respondent’s favorite animal and age. These are two different vectors of information, but each person will have a response for both. So you’ll have two vectors that are the same length:
= c("dog", "cat", "iguana")
animal = c(31, 24, 72) age
The 1st element of each vector corresponds to the 1st person, the 2nd to the 2nd person, and so on. These vectors are congruent.
Notice that columns in a data frame are always congruent!
Back to Indexing
When you index by condition, the index should generally be congruent to the object you’re indexing. Elements where the index is TRUE
are kept and elements where the index is FALSE
are dropped.
If you create the index from a condition on the object, it’s automatically congruent. For instance, let’s make a condition based on the vector x
:
= x < 25
is_small is_small
a b c d e
TRUE TRUE FALSE FALSE FALSE
The 1st element in the logical vector is_small
corresponds to the 1st element of x
, the 2nd to the 2nd, and so on. The vectors x
and is_small
are congruent.
It makes sense to use is_small
as an index for x
, and it gives us all the elements less than 25
:
x[is_small]
a b
10 20
Of course, you can also avoid using an intermediate variable for the condition:
> 10] x[x
b c d e
20 30 40 50
If you create index some other way (not using the object), make sure that it’s still congruent to the object. Otherwise, the subset returned from indexing might not be meaningful.
You can also use indexing by condition to set elements, just as the other ways of indexing can be used to set elements. For instance, let’s set all the elements of x
that are greater than 10
to the missing value NA
:
> 10] = NA
x[x x
a b c d e
10 NA NA NA NA
5.3.3.5 Logic
All of the conditions we’ve seen so far have been written in terms of a single test. If you want to use more sophisticated conditions, R provides operators to negate and combine logical vectors. These operators are useful for working with logical vectors even outside the context of indexing.
Negation
The NOT operator !
converts TRUE
to FALSE
and FALSE
to TRUE
:
= c(TRUE, FALSE, TRUE, TRUE, NA)
x x
[1] TRUE FALSE TRUE TRUE NA
!x
[1] FALSE TRUE FALSE FALSE NA
You can use !
with a condition:
= c("hi", "hello")
y !(y == "hi")
[1] FALSE TRUE
The NOT operator is vectorized.
Combinations
R also has operators for combining logical values.
The AND operator &
returns TRUE
only when both arguments are TRUE
. Here are some examples:
FALSE & FALSE
[1] FALSE
TRUE & FALSE
[1] FALSE
FALSE & TRUE
[1] FALSE
TRUE & TRUE
[1] TRUE
c(TRUE, FALSE, TRUE) & c(TRUE, TRUE, FALSE)
[1] TRUE FALSE FALSE
The OR operator |
returns TRUE
when at least one argument is TRUE
. Let’s see some examples:
FALSE | FALSE
[1] FALSE
TRUE | FALSE
[1] TRUE
FALSE | TRUE
[1] TRUE
TRUE | TRUE
[1] TRUE
c(TRUE, FALSE) | c(TRUE, TRUE)
[1] TRUE TRUE
Be careful: everyday English is less precise than logic. You might say:
I want all subjects with age over 50 and all subjects that like cats.
But in logic this means:
(subject age over 50) OR (subject likes cats)
So think carefully about whether you need both conditions to be true (AND) or at least one (OR).
Rarely, you might want exactly one condition to be true. The XOR (eXclusive OR) function xor()
returns TRUE
when exactly one argument is TRUE
. For example:
xor(FALSE, FALSE)
[1] FALSE
xor(TRUE, FALSE)
[1] TRUE
xor(TRUE, TRUE)
[1] FALSE
The AND, OR, and XOR operators are vectorized.
Short-circuiting
The second argument is irrelevant in some conditions:
FALSE &
is alwaysFALSE
TRUE |
is alwaysTRUE
Now imagine you have FALSE & long_computation()
. You can save time by skipping long_computation()
. A short-circuit operator does exactly that.
R has two short-circuit operators:
&&
is a short-circuited&
||
is a short-circuited|
These operators only evaluate the second argument if it is necessary to determine the result. Here are some of these:
TRUE && FALSE
[1] FALSE
TRUE && TRUE
[1] TRUE
TRUE || TRUE
[1] TRUE
For the final expression, notice R only combines the first element of each vector. The others are ignored. In other words, the short-circuit operators are not vectorized! Because of this, generally you should not use the short-circuit operators for indexing. Their main use is in writing conditions for control structures (Section 4.1) and loops (Section 4.3).
5.3.3.6 Indexing Lists
Lists are a container for other types of R objects. When you select an element from a list, you can either keep the container (the list) or discard it. The indexing operator [
almost always keeps containers.
As an example, let’s get some elements from a small list:
= list(first = c(1, 2, 3), second = sin, third = c("hi", "hello"))
x = x[c(1, 3)]
y y
$first
[1] 1 2 3
$third
[1] "hi" "hello"
class(y)
[1] "list"
The result is still a list. Even if we get just one element, the result of indexing a list with [
is a list:
class(x[1])
[1] "list"
Sometimes this will be exactly what we want. But what if we want to get the first element of x
so that we can use it in a vectorized function? Or in a function that only accepts numeric arguments? We need to somehow get the element and discard the container.
The solution to this problem is the extraction operator [[
, which is also called the double square bracket operator. The extraction operator is the primary way to get and set elements of lists and other containers.
Unlike the indexing operator [
, the extraction operator always discards the container:
1]] x[[
[1] 1 2 3
class(x[[1]])
[1] "numeric"
The trade off is that the extraction operator can only get or set one element at a time. Note that the element can be a vector, as above. Because it can only get or set one element at a time, the extraction operator can only index by position or name. Blank and logical indexes are not allowed.
The final difference between the index operator [
and the extraction operator [[
has to do with how they handle invalid indexes. The index operator [
returns NA
for invalid vector elements, and NULL
for invalid list elements:
c(1, 2)[10]
[1] NA
10] x[
$<NA>
NULL
On the other hand, the extraction operator [[
raises an error for invalid elements:
10]] x[[
Error in x[[10]]: subscript out of bounds
The indexing operator [
and the extraction operator [[
both work with any data structure that has elements. However, you’ll generally use the indexing operator [
to index vectors, and the extraction operator [[
to index containers (such as lists).
5.3.3.7 Indexing Data Frames
For two-dimensional objects, like matrices and data frames, you can pass the indexing operator [
or the extraction operator [[
a separate index for each dimension. The rows come first:
DATA[ROWS, COLUMNS]
For instance, let’s get the first 3 rows and all columns of the dogs data:
1:3, ] dogs[
breed group datadog popularity_all popularity lifetime_cost
1 Border Collie herding 3.64 45 39 20143
2 Border Terrier terrier 3.61 80 61 22638
3 Brittany sporting 3.54 30 30 22589
intelligence_rank longevity ailments price food_cost grooming kids
1 1 12.52 2 623 324 weekly low
2 30 14.00 0 833 324 weekly high
3 19 12.92 0 618 466 weekly medium
megarank_kids megarank size weight height
1 1 29 medium NA 20
2 2 1 small 13.5 NA
3 3 11 medium 35.0 19
As we saw in Section 5.3.3.1, leaving an index blank means all elements.
As another example, let’s get the 3rd and 5th row, and the 2nd and 4th column:
c(3, 5), c(2, 4)] dogs[
group popularity_all
3 sporting 30
5 sporting 130
Mixing several different ways of indexing is allowed. So for example, we can get the same above, but use column names instead of positions:
c(3, 5), c("breed", "longevity")] dogs[
breed longevity
3 Brittany 12.92
5 Welsh Springer Spaniel 12.49
For data frames, it’s especially common to index the rows by condition and the columns by name. For instance, let’s get the breed
, popularity
, and weight
columns for all rows with toy dogs:
= dogs[dogs$group == "toy", c("breed", "popularity", "weight")]
result head(result)
breed popularity weight
8 Papillon 33 NA
13 Affenpinscher 84 NA
16 Chihuahua 14 5.5
28 Maltese 23 5.0
29 Pomeranian 17 5.0
30 Shih Tzu 11 12.5
If you use two-dimensional indexing with [
to select exactly one column, you get a vector:
= dogs[1:3, 2]
result class(result)
[1] "factor"
The container is dropped, even though the indexing operator [
usually keeps containers. This also occurs for matrices. You can control this behavior with the drop
parameter:
= dogs[1:3, 2, drop = FALSE]
result class(result)
[1] "data.frame"
The default is drop = TRUE
.