5 Appendix
5.1 More About Comparisons
5.1.1 Equality
The ==
operator is the primary way to test whether two values are equal, as
explained in Section 1.2.3. Nonetheless, equality can be defined
in many different ways, especially when dealing with computers. As a result, R
also provides several different functions to test for different kinds of
equality. This describes tests of equality in more detail, and also describes
some other important details of comparisons.
5.1.1.1 The ==
Operator
The ==
operator tests whether its two arguments have the exact same
representation as a binary number in your computer’s memory. Before
testing the arguments, the operator applies R’s rules for vectorization
(Section 2.1.3), recycling (Section 2.1.4), and
implicit coercion (Section 2.2.2). Until you’ve fully
internalized these three rules, some results from the equality operator may
seem surprising. For example:
# Recycling:
c(1, 2) == c(1, 2, 1, 2)
## [1] TRUE TRUE TRUE TRUE
# Implicit coercion:
TRUE == 1
## [1] TRUE
TRUE == "TRUE"
## [1] TRUE
1 == "TRUE"
## [1] FALSE
The length of the result from the equality operator is usually the same as its longest argument (with some exceptions).
5.1.1.2 The all.equal
Function
The all.equal
function tests whether its two arguments are equal up to some
acceptable difference called a tolerance. Computer representations for
decimal numbers are inherently imprecise, so it’s necessary to allow for very
small differences between computed numbers. For example:
= 0.5 - 0.3
x = 0.3 - 0.1
y
# FALSE on most machines:
== y x
## [1] FALSE
# TRUE:
all.equal(x, y)
## [1] TRUE
The all.equal
function does not apply R’s rules for vectorization, recycling,
or implicit coercion. The function returns TRUE
when the arguments are equal,
and returns a string summarizing the differences when they are not. For
instance:
all.equal(1, c(1, 2, 1))
## [1] "Numeric: lengths (1, 3) differ"
The all.equal
function is often used together with the isTRUE
function,
which tests whether the result is TRUE
:
all.equal(3, 4)
## [1] "Mean relative difference: 0.3333333"
isTRUE(all.equal(3, 4))
## [1] FALSE
You should generally use the all.equal
function when you want to compare
decimal numbers.
5.1.1.3 The identical
Function
The identical
function checks whether its arguments are completely identical,
including their metadata (names, dimensions, and so on). For instance:
= list(a = 1)
x = list(a = 1)
y = list(1)
z
identical(x, y)
## [1] TRUE
identical(x, z)
## [1] FALSE
The identical
function does not apply R’s rules for vectorization, recycling,
or implicit coercion. The result is always a single logical value.
You’ll generally use the identical
function to compare non-vector objects
such as lists or data frames. The function also works for vectors, but most of
the time the equality operator ==
is sufficient.
5.1.2 The %in%
Operator
Another common comparison is to check whether elements of one vector are
contained in another vector at any position. For instance, suppose you want
to check whether 1
or 2
appear anywhere in a longer vector x
. Here’s how
to do it:
= c(3, 4, 2, 7, 3, 7)
x c(1, 2) %in% x
## [1] FALSE TRUE
R returns FALSE
for the 1
because there’s no 1
in x
, and returns TRUE
for the 2
because there is a 2
in x
.
Notice that this is different from comparing with the equality operator ==
.
If you use use the equality operator, the shorter vector is recycled until its
length matches the longer one, and then compared element-by-element. For the
example, this means only the elements at odd-numbered positions are compared to
1
, and only the elements at even-numbered positions are compared to 2
:
c(1, 2) == x
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
5.1.3 Summarizing Comparisons
The comparison operators are vectorized, so they compare their arguments element-by-element:
c(1, 2, 3) < c(1, 3, -3)
## [1] FALSE TRUE FALSE
c("he", "saw", "her") == c("she", "saw", "him")
## [1] FALSE TRUE FALSE
What if you want to summarize whether all the elements in a vector are equal
(or unequal)? You can use the all
function on any logical vector to get a
summary. The all
function takes a vector of logical values and returns TRUE
if all of them are TRUE
, and returns FALSE
otherwise:
all(c(1, 2, 3) < c(1, 3, -3))
## [1] FALSE
The related any
function returns TRUE
if any one element is TRUE
, and
returns FALSE
otherwise:
any(c("hi", "hello") == c("hi", "bye"))
## [1] TRUE
5.1.4 Other Pitfalls
New programmers sometimes incorrectly think they need to append == TRUE
to
their comparisons. This is redundant, makes your code harder to understand, and
wastes computational time. Comparisons already return logical values. If the
result of the comparison is TRUE
, then TRUE == TRUE
is again just TRUE
.
If the result is FALSE
, then FALSE == TRUE
is again just FALSE
. Likewise,
if you want to invert a condition, choose an appropriate operator rather than
appending == FALSE
.
5.2 Variable Scope & Lookup
5.2.1 Local Variables
A variable’s scope is the section of code where it exists and is accessible.
The exists
function checks whether a variable is in scope:
exists("zz")
## [1] FALSE
= 3
zz exists("zz")
## [1] TRUE
When you create a function, you create a new scope. Variables defined inside of a function are local to the function. Local variables cannot be accessed from outside:
= function(x, center, scale) {
rescale = x - center
centered / scale
centered
}
centered
## Error in eval(expr, envir, enclos): object 'centered' not found
exists("centered")
## [1] FALSE
Local variables are reset each time the function is called:
= function() {
f = exists("z")
is_z_in_scope = 42
z
is_z_in_scope
}
f()
## [1] TRUE
f()
## [1] TRUE
5.2.2 Lexical Scoping
A function can use variables defined outside (non-local), but only if those variables are in scope where the function was defined. This property is called lexical scoping.
Let’s see how this works in practice. First, we’ll define a variable cats
and
then define a function get_cats
in the same place (the top level, not inside
any functions). As a result, the cats
variable is in scope inside of the
get_cats
function:
= 3
cats = function() cats
get_cats get_cats()
## [1] 3
Now let’s define a variable dogs
inside of a function create_dogs
. We’ll
also define a function get_dogs
at the top level. The variable dogs
is not
in scope at the top level, so it’s not in scope inside of the get_dogs
function:
= function() {
create_dogs = "hello"
dogs
}= function() dogs
get_dogs create_dogs()
get_dogs()
## Error in get_dogs(): object 'dogs' not found
Variables defined directly in the R console are global and available to any function.
Local variables mask (hide) non-local variables with the same name:
= function() {
get_parrot = 3
parrot
parrot
}= 42
parrot get_parrot()
## [1] 3
There’s one exception to this rule. We often use variables that refer to functions in calls:
#mean()
In this case, the variable must refer to a function, so R ignores local variables that aren’t functions. For example:
= function() {
my_mean = 0
mean
mean(c(1, 2, 3))
}my_mean()
## [1] 2
= function() {
my_get_cats = 10
get_cats
get_cats()
}my_get_cats()
## [1] 3
5.2.3 Dynamic Lookup
Variable lookup happens when a function is called, not when it’s defined. This is called dynamic lookup.
For example, the result from get_cats
, which accesses the global variable
cats
, changes if we change the value of cats
:
= 10
cats get_cats()
## [1] 10
= 20
cats get_cats()
## [1] 20
5.2.4 Summary
This section covered a lot of details about R’s rules for variable scope and lookup. Here are the key takeaways:
Function definitions (or
local()
) create a new scope.Local variables
- Are private
- Get reset for each call
- Mask non-local variables (exception: function calls)
Lexical scoping: where a function is defined determines which non-local variables are in scope.
Dynamic lookup: when a function is called determines values of non-local variables.
5.3 String Processing
So far, we’ve mostly worked with numbers or categories that are ready to use for data analysis. In practice, data sets often require some cleaning before or during data analysis. One common data cleaning task is editing or extracting parts of strings.
We’ll use the stringr package to process strings. Like ggplot2 (Section 3.3), the package is part of the Tidyverse. R also has built-in functions for string processing. The main advantage of stringr is that its functions use a common set of parameters, so they’re easier to learn and remember.
stringr has detailed documentation and also a cheatsheet.
The first time you use stringr, you’ll have to install it with
install.packages
(the same as any other package). Then you can load the
package with the library
function:
# install.packages("stringr")
library("stringr")
The typical syntax of a stringr function is:
str_NAME(string, pattern, ...)
Where:
NAME
describes what the function doesstring
is the string to search within or transformpattern
is the pattern to search for...
is additional, function-specific arguments
The str_detect
function detects whether the pattern appears within the
string. Here’s an example:
str_detect("hello", "el")
## [1] TRUE
str_detect("hello", "ol")
## [1] FALSE
Most of the stringr functions are vectorized in the string
parameter. For
instance:
str_detect(c("hello", "goodbye", "lo"), "lo")
## [1] TRUE FALSE TRUE
Most of the stringr functions also have support for regular
expressions, a powerful language for describing patterns. Several
punctuation characters, such as .
and ?
have special meanings in the
regular expressions language. You can disable these special meanings by putting
the pattern in a call to fixed
:
str_detect("a", ".")
## [1] TRUE
str_detect("a", fixed("."))
## [1] FALSE
You can learn more about regular expressions here.
There are a lot of stringr functions. We’ll focus on two that are especially important, and some of their variants:
str_split
str_replace
You can find a complete list of stringr functions, with examples, in the documentation.
5.3.1 Splitting Strings
The str_split
function splits the string at each position that matches the
pattern. The characters that match are thrown away.
For example, suppose we want to split a sentence into words. Since there’s a space between each word, we can use a space as the pattern:
= "The students in this workshop are great!"
x
= str_split(x, " ")
result result
## [[1]]
## [1] "The" "students" "in" "this" "workshop" "are" "great!"
The str_split
function always returns a list with one element for each input
string. Here the list only has one element because x
only has one element. We
can get the first element with:
1]] result[[
## [1] "The" "students" "in" "this" "workshop" "are" "great!"
We have to use the extraction operator [[
here because x
is a list (for a
vector, we could use the indexing operator [
instead). Notice that in the
printout for result
, R gives us a hint that we should use [[
by printing
[[1]]
.
To see why the function returns a list, consider what happens if we try to split two different sentences at once:
= c(x, "Are you listening?")
x
= str_split(x, " ")
result 1]] result[[
## [1] "The" "students" "in" "this" "workshop" "are" "great!"
2]] result[[
## [1] "Are" "you" "listening?"
Each sentence has a different number of words, so the vectors in the result have different lengths. So a list is the only way to store both.
The str_split_fixed
function is almost the same as str_split
, but takes a
third argument for the maximum number of splits to make. Because the number of
splits is fixed, the function can return the result in a matrix instead of a
list. For example:
str_split_fixed(x, " ", 3)
## [,1] [,2] [,3]
## [1,] "The" "students" "in this workshop are great!"
## [2,] "Are" "you" "listening?"
The str_split_fixed
function is often more convenient than str_split
because the n
th piece of each input string is just the n
th column of the
result.
For example, suppose we want to get the area code from some phone numbers:
= c("717-555-3421", "629-555-8902", "903-555-6781")
phones = str_split_fixed(phones, "-", 3)
result
1] result[,
## [1] "717" "629" "903"
5.3.2 Replacing Parts of Strings
The str_replace
function replaces the pattern the first time it appears in
the string. The replacement goes in the third argument.
For instance, suppose we want to change the word "dog"
to "cat"
:
= c("dogs are great, dogs are fun", "dogs are fluffy")
x str_replace(x, "dog", "cat")
## [1] "cats are great, dogs are fun" "cats are fluffy"
The str_replace_all
function replaces the pattern every time it appears in
the string:
str_replace_all(x, "dog", "cat")
## [1] "cats are great, cats are fun" "cats are fluffy"
We can also use the str_replace
and str_replace_all
functions to delete
part of a string by setting the replacement to the empty string ""
.
For example, suppose we want to delete the comma:
str_replace(x, ",", "")
## [1] "dogs are great dogs are fun" "dogs are fluffy"
In general, stringr functions with the _all
suffix affect all matches.
Functions without _all
only affect the first match.
5.4 Date Processing
Besides strings, dates and times are another kind of data that require special attention to prepare for analysis. This is especially important if you want to do anything that involves sorting dates, like making a line plot with dates on one axis. Dates may not be sorted correctly if they haven’t been converted to one of R’s date classes.
There several built-in functions and also many packages for date processing. As with visualization and string processing, the Tidyverse packages have the best combination of simple design and clear documentation. There are three Tidyverse packages for processing dates and times:
- lubridate, the primary package for working with dates and times
- hms, a package specifically for working with times
- clock, a new package for working with dates and times
We’ll focus on the lubridate package. As always, you’ll have to install the package if you haven’t already, and then load it:
# install.packages("lubridate")
library("lubridate")
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
The most common task is to convert a string into a date or time class. For instance, when you load a data set, you might have dates that look like this:
= c("Jan 10, 2021", "Sep 3, 2018", "Feb 28, 1982")
dates dates
## [1] "Jan 10, 2021" "Sep 3, 2018" "Feb 28, 1982"
These are strings, so it’s relatively difficult to sort the dates, do
arithmetic on them, or extract just one part (such as the year). There are
several lubridate functions to automatically convert strings into dates. They
are named with one letter for each part of the date. For instance, the dates in
the example have the month (m), then the day (d), and then the year (y), so we
can use the mdy
function:
= mdy(dates)
result result
## [1] "2021-01-10" "2018-09-03" "1982-02-28"
class(result)
## [1] "Date"
Notice that the dates now have class Date
, one of R’s built-in classes for
representing dates, and that they print differently. You can find a full list
of the automatic string to date conversion functions in the lubridate
documentation.
Occasionally, a date string may have a format that lubridate can’t convert
automatically. In that case, you can use the fast_strptime
function to
describe the format in detail. At a minimum, the function requires two
arguments: the vector of strings to convert and a format string.
The format string describes the format of the dates, and is based on the syntax
of strptime
, a function provided by many programming languages for converting
strings to dates (including R). In a format string, a percent sign %
followed
by a character is called a specification and has a special meaning. Here are
a few of the most useful ones:
Specification | Description | January 29, 2015 |
---|---|---|
%Y |
4-digit year | 2015 |
%y |
2-digit year | 15 |
%m |
2-digit month | 01 |
%B |
full month name | January |
%b |
short month name | Jan |
%d |
day of month | 29 |
%% |
literal % | % |
You can find a complete list in ?fast_strptime
. Other characters in the
format string do not have any special meaning. Write the format string so that
it matches the format of the dates you want to convert.
For example, let’s try converting an unusual time format:
= "6 minutes, 32 seconds after 10 o'clock"
odd_time fast_strptime(odd_time, "%M minutes, %S seconds after %H o'clock")
## [1] "0-01-01 10:06:32 UTC"
R usually represents dates with the class Date
, and date-times with the
classes POSIXct
and POSIXlt
. The difference between the two date-time
classes is somewhat technical, but you can read more about it in ?POSIXlt
.
There is no built-in class to represent times alone, which is why the result in
the example above includes a date. Nonetheless, the hms package provides the
hms
class to represent times without dates.
Once you’ve converted a string to a date, the lubridate package provides a variety of functions to get or set the parts individually. Here are a few examples:
day(result)
## [1] 10 3 28
month(result)
## [1] 1 9 2
You can find a complete list in the lubridate documentation.