5 Appendix

5.1 More About Comparisons

5.1.1 Equality

The == operator is the primary way to test whether two values are equal, as explained in Section 1.2.3. Nonetheless, equality can be defined in many different ways, especially when dealing with computers. As a result, R also provides several different functions to test for different kinds of equality. This describes tests of equality in more detail, and also describes some other important details of comparisons.

5.1.1.1 The == Operator

The == operator tests whether its two arguments have the exact same representation as a binary number in your computer’s memory. Before testing the arguments, the operator applies R’s rules for vectorization (Section 2.1.3), recycling (Section 2.1.4), and implicit coercion (Section 2.2.2). Until you’ve fully internalized these three rules, some results from the equality operator may seem surprising. For example:

# Recycling:
c(1, 2) == c(1, 2, 1, 2)
## [1] TRUE TRUE TRUE TRUE
# Implicit coercion:
TRUE == 1
## [1] TRUE
TRUE == "TRUE"
## [1] TRUE
1 == "TRUE"
## [1] FALSE

The length of the result from the equality operator is usually the same as its longest argument (with some exceptions).

5.1.1.2 The all.equal Function

The all.equal function tests whether its two arguments are equal up to some acceptable difference called a tolerance. Computer representations for decimal numbers are inherently imprecise, so it’s necessary to allow for very small differences between computed numbers. For example:

x = 0.5 - 0.3
y = 0.3 - 0.1

# FALSE on most machines:
x == y
## [1] FALSE
# TRUE:
all.equal(x, y)
## [1] TRUE

The all.equal function does not apply R’s rules for vectorization, recycling, or implicit coercion. The function returns TRUE when the arguments are equal, and returns a string summarizing the differences when they are not. For instance:

all.equal(1, c(1, 2, 1))
## [1] "Numeric: lengths (1, 3) differ"

The all.equal function is often used together with the isTRUE function, which tests whether the result is TRUE:

all.equal(3, 4)
## [1] "Mean relative difference: 0.3333333"
isTRUE(all.equal(3, 4))
## [1] FALSE

You should generally use the all.equal function when you want to compare decimal numbers.

5.1.1.3 The identical Function

The identical function checks whether its arguments are completely identical, including their metadata (names, dimensions, and so on). For instance:

x = list(a = 1)
y = list(a = 1)
z = list(1)

identical(x, y)
## [1] TRUE
identical(x, z)
## [1] FALSE

The identical function does not apply R’s rules for vectorization, recycling, or implicit coercion. The result is always a single logical value.

You’ll generally use the identical function to compare non-vector objects such as lists or data frames. The function also works for vectors, but most of the time the equality operator == is sufficient.

5.1.2 The %in% Operator

Another common comparison is to check whether elements of one vector are contained in another vector at any position. For instance, suppose you want to check whether 1 or 2 appear anywhere in a longer vector x. Here’s how to do it:

x = c(3, 4, 2, 7, 3, 7)
c(1, 2) %in% x
## [1] FALSE  TRUE

R returns FALSE for the 1 because there’s no 1 in x, and returns TRUE for the 2 because there is a 2 in x.

Notice that this is different from comparing with the equality operator ==. If you use use the equality operator, the shorter vector is recycled until its length matches the longer one, and then compared element-by-element. For the example, this means only the elements at odd-numbered positions are compared to 1, and only the elements at even-numbered positions are compared to 2:

c(1, 2) == x
## [1] FALSE FALSE FALSE FALSE FALSE FALSE

5.1.3 Summarizing Comparisons

The comparison operators are vectorized, so they compare their arguments element-by-element:

c(1, 2, 3) < c(1, 3, -3)
## [1] FALSE  TRUE FALSE
c("he", "saw", "her") == c("she", "saw", "him")
## [1] FALSE  TRUE FALSE

What if you want to summarize whether all the elements in a vector are equal (or unequal)? You can use the all function on any logical vector to get a summary. The all function takes a vector of logical values and returns TRUE if all of them are TRUE, and returns FALSE otherwise:

all(c(1, 2, 3) < c(1, 3, -3))
## [1] FALSE

The related any function returns TRUE if any one element is TRUE, and returns FALSE otherwise:

any(c("hi", "hello") == c("hi", "bye"))
## [1] TRUE

5.1.4 Other Pitfalls

New programmers sometimes incorrectly think they need to append == TRUE to their comparisons. This is redundant, makes your code harder to understand, and wastes computational time. Comparisons already return logical values. If the result of the comparison is TRUE, then TRUE == TRUE is again just TRUE. If the result is FALSE, then FALSE == TRUE is again just FALSE. Likewise, if you want to invert a condition, choose an appropriate operator rather than appending == FALSE.

5.2 Variable Scope & Lookup

5.2.1 Local Variables

A variable’s scope is the section of code where it exists and is accessible. The exists function checks whether a variable is in scope:

exists("zz")
## [1] FALSE
zz = 3
exists("zz")
## [1] TRUE

When you create a function, you create a new scope. Variables defined inside of a function are local to the function. Local variables cannot be accessed from outside:

rescale = function(x, center, scale) {
  centered = x - center
  centered / scale
}

centered
## Error in eval(expr, envir, enclos): object 'centered' not found
exists("centered")
## [1] FALSE

Local variables are reset each time the function is called:

f = function() {
  is_z_in_scope = exists("z")
  z = 42
  
  is_z_in_scope
}

f()
## [1] TRUE
f()
## [1] TRUE

5.2.2 Lexical Scoping

A function can use variables defined outside (non-local), but only if those variables are in scope where the function was defined. This property is called lexical scoping.

Let’s see how this works in practice. First, we’ll define a variable cats and then define a function get_cats in the same place (the top level, not inside any functions). As a result, the cats variable is in scope inside of the get_cats function:

cats = 3
get_cats = function() cats
get_cats()
## [1] 3

Now let’s define a variable dogs inside of a function create_dogs. We’ll also define a function get_dogs at the top level. The variable dogs is not in scope at the top level, so it’s not in scope inside of the get_dogs function:

create_dogs = function() {
  dogs = "hello"
}
get_dogs = function() dogs
create_dogs()
get_dogs()
## Error in get_dogs(): object 'dogs' not found

Variables defined directly in the R console are global and available to any function.

Local variables mask (hide) non-local variables with the same name:

get_parrot = function() {
  parrot = 3
  
  parrot
}
parrot = 42
get_parrot()
## [1] 3

There’s one exception to this rule. We often use variables that refer to functions in calls:

#mean()

In this case, the variable must refer to a function, so R ignores local variables that aren’t functions. For example:

my_mean = function() {
  mean = 0
  
  mean(c(1, 2, 3))
}
my_mean()
## [1] 2
my_get_cats = function() {
  get_cats = 10
  
  get_cats()
}
my_get_cats()
## [1] 3

5.2.3 Dynamic Lookup

Variable lookup happens when a function is called, not when it’s defined. This is called dynamic lookup.

For example, the result from get_cats, which accesses the global variable cats, changes if we change the value of cats:

cats = 10
get_cats()
## [1] 10
cats = 20
get_cats()
## [1] 20

5.2.4 Summary

This section covered a lot of details about R’s rules for variable scope and lookup. Here are the key takeaways:

  • Function definitions (or local()) create a new scope.

  • Local variables

    • Are private
    • Get reset for each call
    • Mask non-local variables (exception: function calls)
  • Lexical scoping: where a function is defined determines which non-local variables are in scope.

  • Dynamic lookup: when a function is called determines values of non-local variables.

5.3 String Processing

So far, we’ve mostly worked with numbers or categories that are ready to use for data analysis. In practice, data sets often require some cleaning before or during data analysis. One common data cleaning task is editing or extracting parts of strings.

We’ll use the stringr package to process strings. Like ggplot2 (Section 3.3), the package is part of the Tidyverse. R also has built-in functions for string processing. The main advantage of stringr is that its functions use a common set of parameters, so they’re easier to learn and remember.

stringr has detailed documentation and also a cheatsheet.

The first time you use stringr, you’ll have to install it with install.packages (the same as any other package). Then you can load the package with the library function:

# install.packages("stringr")
library("stringr")

The typical syntax of a stringr function is:

str_NAME(string, pattern, ...)

Where:

  • NAME describes what the function does
  • string is the string to search within or transform
  • pattern is the pattern to search for
  • ... is additional, function-specific arguments

The str_detect function detects whether the pattern appears within the string. Here’s an example:

str_detect("hello", "el")
## [1] TRUE
str_detect("hello", "ol")
## [1] FALSE

Most of the stringr functions are vectorized in the string parameter. For instance:

str_detect(c("hello", "goodbye", "lo"), "lo")
## [1]  TRUE FALSE  TRUE

Most of the stringr functions also have support for regular expressions, a powerful language for describing patterns. Several punctuation characters, such as . and ? have special meanings in the regular expressions language. You can disable these special meanings by putting the pattern in a call to fixed:

str_detect("a", ".")
## [1] TRUE
str_detect("a", fixed("."))
## [1] FALSE

You can learn more about regular expressions here.

There are a lot of stringr functions. We’ll focus on two that are especially important, and some of their variants:

  • str_split
  • str_replace

You can find a complete list of stringr functions, with examples, in the documentation.

5.3.1 Splitting Strings

The str_split function splits the string at each position that matches the pattern. The characters that match are thrown away.

For example, suppose we want to split a sentence into words. Since there’s a space between each word, we can use a space as the pattern:

x = "The students in this workshop are great!"

result = str_split(x, " ")
result
## [[1]]
## [1] "The"      "students" "in"       "this"     "workshop" "are"      "great!"

The str_split function always returns a list with one element for each input string. Here the list only has one element because x only has one element. We can get the first element with:

result[[1]]
## [1] "The"      "students" "in"       "this"     "workshop" "are"      "great!"

We have to use the extraction operator [[ here because x is a list (for a vector, we could use the indexing operator [ instead). Notice that in the printout for result, R gives us a hint that we should use [[ by printing [[1]].

To see why the function returns a list, consider what happens if we try to split two different sentences at once:

x = c(x, "Are you listening?")

result = str_split(x, " ")
result[[1]]
## [1] "The"      "students" "in"       "this"     "workshop" "are"      "great!"
result[[2]]
## [1] "Are"        "you"        "listening?"

Each sentence has a different number of words, so the vectors in the result have different lengths. So a list is the only way to store both.

The str_split_fixed function is almost the same as str_split, but takes a third argument for the maximum number of splits to make. Because the number of splits is fixed, the function can return the result in a matrix instead of a list. For example:

str_split_fixed(x, " ", 3)
##      [,1]  [,2]       [,3]                         
## [1,] "The" "students" "in this workshop are great!"
## [2,] "Are" "you"      "listening?"

The str_split_fixed function is often more convenient than str_split because the nth piece of each input string is just the nth column of the result.

For example, suppose we want to get the area code from some phone numbers:

phones = c("717-555-3421", "629-555-8902", "903-555-6781")
result = str_split_fixed(phones, "-", 3)

result[, 1]
## [1] "717" "629" "903"

5.3.2 Replacing Parts of Strings

The str_replace function replaces the pattern the first time it appears in the string. The replacement goes in the third argument.

For instance, suppose we want to change the word "dog" to "cat":

x = c("dogs are great, dogs are fun", "dogs are fluffy")
str_replace(x, "dog", "cat")
## [1] "cats are great, dogs are fun" "cats are fluffy"

The str_replace_all function replaces the pattern every time it appears in the string:

str_replace_all(x, "dog", "cat")
## [1] "cats are great, cats are fun" "cats are fluffy"

We can also use the str_replace and str_replace_all functions to delete part of a string by setting the replacement to the empty string "".

For example, suppose we want to delete the comma:

str_replace(x, ",", "")
## [1] "dogs are great dogs are fun" "dogs are fluffy"

In general, stringr functions with the _all suffix affect all matches. Functions without _all only affect the first match.

5.4 Date Processing

Besides strings, dates and times are another kind of data that require special attention to prepare for analysis. This is especially important if you want to do anything that involves sorting dates, like making a line plot with dates on one axis. Dates may not be sorted correctly if they haven’t been converted to one of R’s date classes.

There several built-in functions and also many packages for date processing. As with visualization and string processing, the Tidyverse packages have the best combination of simple design and clear documentation. There are three Tidyverse packages for processing dates and times:

  • lubridate, the primary package for working with dates and times
  • hms, a package specifically for working with times
  • clock, a new package for working with dates and times

We’ll focus on the lubridate package. As always, you’ll have to install the package if you haven’t already, and then load it:

# install.packages("lubridate")
library("lubridate")
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

The most common task is to convert a string into a date or time class. For instance, when you load a data set, you might have dates that look like this:

dates = c("Jan 10, 2021", "Sep 3, 2018", "Feb 28, 1982")
dates
## [1] "Jan 10, 2021" "Sep 3, 2018"  "Feb 28, 1982"

These are strings, so it’s relatively difficult to sort the dates, do arithmetic on them, or extract just one part (such as the year). There are several lubridate functions to automatically convert strings into dates. They are named with one letter for each part of the date. For instance, the dates in the example have the month (m), then the day (d), and then the year (y), so we can use the mdy function:

result = mdy(dates)
result
## [1] "2021-01-10" "2018-09-03" "1982-02-28"
class(result)
## [1] "Date"

Notice that the dates now have class Date, one of R’s built-in classes for representing dates, and that they print differently. You can find a full list of the automatic string to date conversion functions in the lubridate documentation.

Occasionally, a date string may have a format that lubridate can’t convert automatically. In that case, you can use the fast_strptime function to describe the format in detail. At a minimum, the function requires two arguments: the vector of strings to convert and a format string.

The format string describes the format of the dates, and is based on the syntax of strptime, a function provided by many programming languages for converting strings to dates (including R). In a format string, a percent sign % followed by a character is called a specification and has a special meaning. Here are a few of the most useful ones:

Specification Description January 29, 2015
%Y 4-digit year 2015
%y 2-digit year 15
%m 2-digit month 01
%B full month name January
%b short month name Jan
%d day of month 29
%% literal % %

You can find a complete list in ?fast_strptime. Other characters in the format string do not have any special meaning. Write the format string so that it matches the format of the dates you want to convert.

For example, let’s try converting an unusual time format:

odd_time = "6 minutes, 32 seconds after 10 o'clock"
fast_strptime(odd_time, "%M minutes, %S seconds after %H o'clock")
## [1] "0-01-01 10:06:32 UTC"

R usually represents dates with the class Date, and date-times with the classes POSIXct and POSIXlt. The difference between the two date-time classes is somewhat technical, but you can read more about it in ?POSIXlt.

There is no built-in class to represent times alone, which is why the result in the example above includes a date. Nonetheless, the hms package provides the hms class to represent times without dates.

Once you’ve converted a string to a date, the lubridate package provides a variety of functions to get or set the parts individually. Here are a few examples:

day(result)
## [1] 10  3 28
month(result)
## [1] 1 9 2

You can find a complete list in the lubridate documentation.