10 Strings and Regular Expressions

This lesson introduces several concepts related to working with text data (‘strings’), particularly working with the stringr package and writing patterns with regular expressions.

10.1 Learning Objectives

After this lesson, you should be able to:

  • Print strings with cat
  • Read and write escape sequences and raw strings
  • With the stringr package:
    • Split strings on a pattern
    • Replace parts of a string that match a pattern
    • Extract parts of a string that match a pattern
  • Read and write regular expressions, including:
    • Anchors ^ and $
    • Character classes []
    • Quantifiers ?, *, and +
    • Groups ()

10.2 Printing Output

The cat function prints a string in the R console. If you pass multiple arguments, they will be concatenated:

cat("Hello")
## Hello
cat("Hello", "Nick")
## Hello Nick

Pitfall 1: Printing a string is different from returning a string. The cat function only prints (and always returns NULL). For example:

f = function() {
  cat("Hello")
}

x = f()
## Hello
x
## NULL

If you just want to concatenate some strings (but not necessarily print them), use paste instead of cat. The paste function returns a string. The str_c function in stringr (a package we’ll learn about later in this lesson) can also concatenate strings.

Pitfall 2: Remember to print strings with the cat function, not the print function. The print function prints R’s representation of an object, the same as if you had entered the object in the console without calling print.

For instance, print prints quotes around strings, whereas cat does not:

print("Hello")
## [1] "Hello"
cat("Hello")
## Hello

10.3 Escape Sequences

In a string, an escape sequence or escape code consists of a backslash followed by one or more characters. Escape sequences make it possible to:

  1. Write quotes or backslashes within a string
  2. Write characters that don’t appear on your keyboard (for example, characters in a foreign language)

For example, the escape sequence \n corresponds to the newline character. Notice that the cat function translates \n into a literal new line, whereas the print function doesn’t:

x = "Hello\nNick"

cat(x)
## Hello
## Nick
print(x)
## [1] "Hello\nNick"

As another example, suppose we want to put a literal quote in a string. We can either enclose the string in the other kind of quotes, or escape the quotes in the string:

x = 'She said, "Hi"'

cat(x)
## She said, "Hi"
y = "She said, \"Hi\""

cat(y)
## She said, "Hi"

Since escape sequences begin with backslash, we also need to use an escape sequence to write a literal backslash. The escape sequence for a literal backslash is two backslashes:

x = "\\"

cat(x)
## \

There’s a complete list of escape sequences for R in the ?Quotes help file. Other programming languages also use escape sequences, and many of them are the same as in R.

10.3.1 Raw Strings

A raw string is a string where escape sequences are turned off. Raw strings are especially useful for writing regular expressions, which we’ll do later in this lesson.

Raw strings begin with r" and an opening delimiter (, [, or {. Raw strings end with a matching closing delimiter and quote. For example:

x = r"(quotes " and backslashes \)"

cat(x)
## quotes " and backslashes \

Raw strings were added to R in version 4.0 (April 2020), and won’t work correctly in older versions.

10.4 Character Encodings

Computers store data as numbers. In order to store text on a computer, we have to agree on a character encoding, a system for mapping characters to numbers. For example, in ASCII, one of the most popular encodings in the United States, the character a maps to the number 97.

Many different character encodings exist, and sharing text used to be an inconvenient process of asking or trying to guess the correct encoding. This was so inconvenient that in the 1980s, software engineers around the world united to create the Unicode standard. Unicode includes symbols for nearly all languages in use today, as well as emoji and many ancient languages (such as Egyptian hieroglyphs).

Unicode maps characters to numbers, but unlike a character encoding, it doesn’t dictate how those numbers should be mapped to bytes (sequences of ones and zeroes). As a result, there are several different character encodings that support and are synonymous with Unicode. The most popular of these is UTF-8.

In R, we can write Unicode characters with the escape sequence \U followed by the number for the character in base 16. For instance, the number for a in Unicode is 97 (the same as in ASCII). In base 16, 97 is 61. So we can write an a as:

x = "\U61" # or "\u61"

x
## [1] "a"

Unicode escape sequences are usually only used for characters that are not easy to type. For example, the cat emoji is number 1f408 (in base 16) in Unicode. So the string "\U1f408" is the cat emoji.

Note that being able to see printed Unicode characters also depends on whether the font your computer is using has a glyph (image representation) for that character. Many fonts are limited to a small number of languages. The NerdFont project patches fonts commonly used for programming so that they have better Unicode coverage. Using a font with good Unicode coverage is not essential, but it’s convenient if you expect to work with many different natural languages or love using emoji.

10.4.1 Character Encodings in Text Files

Most of the time, R will handle character encodings for you automatically. However, if you ever read or write a text file (including CSV and other formats) and the text looks like gibberish, it might be an encoding problem. This is especially true on Windows, the only modern operating system that does not (yet) use UTF-8 as the default encoding.

Encoding problems when reading a file can usually be fixed by passing the encoding to the function doing the reading. For instance, the code to read a UTF-8 encoded CSV file on Windows is:

read.csv("my_data.csv", fileEncoding = "UTF-8")

Other reader functions may use a different parameter to set the encoding, so always check the documentation. On computers where the native language is not set to English, it can also help to set R’s native language to English with Sys.setlocale(locale = "English").

Encoding problems when writing a file are slightly more complicated to fix. See this blog post for thorough explanation.

10.5 stringr and the tidyverse

10.5.1 The Tidyverse

The Tidyverse is a popular collection of packages for doing data science in R. The packages are made by many of the same people that make RStudio. They provide alternatives to R’s built-in tools for:

  • Manipulating strings (package stringr)
  • Making visualizations (package ggplot2)
  • Reading files (package readr)
  • Manipulating data frames (packages dplyr, tidyr, tibble)
  • And more

Think of the Tidyverse as a different dialect of R. Sometimes the syntax is different, and sometimes ideas are easier or harder to express concisely. Whether to use base R or the Tidyverse is mostly subjective. As a result, the Tidyverse is somewhat polarizing in the R community. It’s useful to be literate in both, since both are popular.

One advantage of the Tidyverse is that the packages are usually well-documented. For example, there are documentation websites and cheat sheets for most Tidyverse packages.

10.5.2 stringr

The rest of this lesson uses stringr, the Tidyverse package for string processing. R also has built-in functions for string processing. The main advantage of stringr is that all of the functions use a common set of parameters, so they’re easier to learn and remember.

The first time you use stringr, you’ll have to install it with install.packages (the same as any other package). Then you can load the package with the library function:

# install.packages("stringr")
library(stringr)

The typical syntax of a stringr function is:

str_NAME(string, pattern, ...)

Where:

  • NAME describes what the function does
  • string is the string to search within or transform
  • pattern is the pattern to search for
  • ... is additional, function-specific arguments

For example, the str_detect function detects whether the pattern appears within the string:

str_detect("hello", "el")
## [1] TRUE
str_detect("hello", "ol")
## [1] FALSE

Most of the stringr functions are vectorized:

str_detect(c("hello", "goodbye", "lo"), "lo")
## [1]  TRUE FALSE  TRUE

There are a lot of stringr functions. The remainder of this lesson focuses on three that are especially important, as well as some of their variants:

  • str_split_fixed
  • str_replace
  • str_match

You can find a complete list of stringr functions with examples in the documentation or cheat sheet.

10.5.2.1 Splitting Strings

The str_split function splits the string at each position that matches the pattern. The characters that match are thrown away.

For example, suppose we want to split a sentence into words. Since there’s a space between each word, we can use a space as the pattern:

x = "The students in this class are great!"

result = str_split(x, " ")
result
## [[1]]
## [1] "The"      "students" "in"       "this"     "class"    "are"      "great!"

The str_split function always returns a list with one element for each input string. Here the list only has one element because x only has one element. We can get the first element with:

result[[1]]
## [1] "The"      "students" "in"       "this"     "class"    "are"      "great!"

We have to use the double square bracket [[ operator here because x is a list (for a vector, we could use the single square bracket operator instead). Notice that in the printout for result, R gives us a hint that we should use [[ by printing [[1]].

To see why the function returns a list, consider what happens if we try to split two different sentences at once:

x = c(x, "Are you listening?")

result = str_split(x, " ")
result[[1]]
## [1] "The"      "students" "in"       "this"     "class"    "are"      "great!"
result[[2]]
## [1] "Are"        "you"        "listening?"

Each sentence has a different number of words, so the vectors in the result have different lengths. So a list is the only way to store both.

The str_split_fixed function is almost the same as str_split, but takes a third argument for the maximum number of splits to make. Because the number of splits is fixed, the function can return the result in a matrix instead of a list. For example:

str_split_fixed(x, " ", 3)
##      [,1]  [,2]       [,3]                      
## [1,] "The" "students" "in this class are great!"
## [2,] "Are" "you"      "listening?"

The str_split_fixed function is often more convenient than str_split because the nth piece of each input string is just the nth column of the result.

For example, suppose we want to get the area code from some phone numbers:

phones = c("717-555-3421", "629-555-8902", "903-555-6781")
result = str_split_fixed(phones, "-", 3)

result[, 1]
## [1] "717" "629" "903"

10.5.2.2 Replacing Parts of Strings

The str_replace function replaces the pattern the first time it appears in the string. The replacement goes in the third argument.

For instance, suppose we want to change the word "dog" to "cat":

x = c("dogs are great, dogs are fun", "dogs are fluffy")
str_replace(x, "dog", "cat")
## [1] "cats are great, dogs are fun" "cats are fluffy"

The str_replace_all function replaces the pattern every time it appears in the string:

str_replace_all(x, "dog", "cat")
## [1] "cats are great, cats are fun" "cats are fluffy"

We can also use the str_replace and str_replace_all functions to delete part of a string by setting the replacement to the empty string "".

For example, suppose we want to delete the comma:

str_replace(x, ",", "")
## [1] "dogs are great dogs are fun" "dogs are fluffy"

In general, stringr functions with the _all suffix affect all matches. Functions without _all only affect the first match.

We’ll learn about str_match at the end of the next section.

10.6 Regular Expressions

The stringr functions (including the ones we just learned) use a special language called regular expressions or regex for the pattern. The regular expressions language is also used in many other programming languages besides R.

A regular expression can describe a complicated pattern in just a few characters, because some characters, called metacharacters, have special meanings. Letters and numbers are never metacharacters. They’re always literal.

Here are a few examples of metacharacters (we’ll look at examples in the subsequent sections):

Metacharacter Meaning
. any single character (wildcard)
\ escape character (in both R and regex)
^ beginning of string
$ end of string
[ab] 'a' or 'b'
[^ab] any character except 'a' or 'b'
? previous character appears 0 or 1 times
* previous character appears 0 or more times
+ previous character appears 1 or more times
() make a group

More metacharacters are listed on the stringr cheat sheet, or in ?regex.

The str_view function is especially helpful for testing regular expressions. It opens a browser window with the first match in the string highlighted. We’ll use it in the subsequent regex examples.

The RegExr website is also helpful for testing regular expressions; it provides an interactive interface where you can write regular expressions and see where they match a string.

10.6.1 The Wildcard

The regex wildcard character is . and matches any single character.

For example:

x = "dog"
str_view(x, "d.g")
## [1] │ <dog>

By default, regex searches from left to right:

str_view(x, ".")
## [1] │ <d><o><g>

10.6.2 Escape Sequences

Like R, regular expressions can contain escape sequences that begin with a backslash. These are computed separately and after R escape sequences. The main use for escape sequences in regex is to turn a metacharacter into a literal character.

For example, suppose we want to match a literal dot .. The regex for a literal dot is \.. Since backslashes in R strings have to be escaped, the R string for this regex is "\\.. Then the regex works:

str_view("this.string", "\\.")
## [1] │ this<.>string

The double backslash can be confusing, and it gets worse if we want to match a literal backslash. We have to escape the backslash in the regex (because backslash is the regex escape character) and then also have to escape the backslashes in R (because backslash is also the R escape character). So to match a single literal backslash in R, the code is:

str_view("this\\that", "\\\\")
## [1] │ this<\>that

Raw strings are helpful here, because they make the backslash literal in R strings (but still not in regex). We can use raw strings to write the above as:

str_view(r"(this\that)", r"(\\)")
## [1] │ this<\>that

You can turn off regular expressions entirely in stringr with the fixed function:

str_view(x, fixed("."))

It’s good to turn off regular expressions whenever you don’t need them, both to avoid mistakes and because they take longer to compute.

10.6.3 Anchors

By default, a regex will match anywhere in the string. If you want to force a match at specific place, use an anchor.

The beginning of string anchor is ^. It marks the beginning of the string, but doesn’t count as a character in the match.

For example, suppose we want to match an a at the beginning of the string:

x = c("abc", "cab")

str_view(x, "a")
## [1] │ <a>bc
## [2] │ c<a>b
str_view(x, "^a")
## [1] │ <a>bc

It doesn’t make sense to put characters before ^, since no characters can come before the beginning of the string.

Likewise, the end of string anchor is $. It marks the end of the string, but doesn’t count as a character in the match.

10.6.4 Character Classes

In regex, square brackets [ ] create a character class. A character class counts as one character, but that character can be any of the characters inside the square brackets. The square brackets themselves don’t count as characters in the match.

For example, suppose we want to match a c followed by either a or t:

x = c("ca", "ct", "cat", "cta")

str_view(x, "c[ta]")
## [1] │ <ca>
## [2] │ <ct>
## [3] │ <ca>t
## [4] │ <ct>a

You can use a dash - in a character class to create a range. For example, to match letters p through z:

str_view(x, "c[p-z]")
## [2] │ <ct>
## [4] │ <ct>a

Ranges also work with numbers and capital letters. To match a literal dash, place the dash at the end of the character class (instead of between two other characters), as in [abc-].

Most metacharacters are literal when inside a character class. For example, [.] matches a literal dot.

A hat ^ at the beginning of the character class negates the class. So for example, [^abc] matches any one character except for a, b, or c:

str_view("abcdef", "[^abc]")
## [1] │ abc<d><e><f>

10.6.5 Quantifiers

Quantifiers are metacharacters that affect how many times the preceding character must appear in a match. The quantifier itself doesn’t count as a character in the match.

For example, the ? quantifier means the preceding character can appear 0 or 1 times. In other words, ? makes the preceding character optional.

For example:

x = c("abc", "ab", "ac", "abbc")

str_view(x, "ab?c")
## [1] │ <abc>
## [3] │ <ac>

The * quantifier means the preceding character can appear 0 or more times. In other words, * means the preceding character can appear any number of times or not at all.

str_view(x, "ab*c")
## [1] │ <abc>
## [3] │ <ac>
## [4] │ <abbc>

The + quantifier means the preceding character must appear 1 or more times.

Quantifiers are greedy, meaning they always match as many characters as possible.

10.6.6 Groups

In regex, parentheses create a group. Groups can be affected by quantifiers, making it possible to repeat a pattern (rather than just a character). The parentheses themselves don’t count as characters in the match.

For example:

x = c("cats, dogs, and frogs", "cats and frogs")

str_view(x, "cats(, dogs,)? and frogs")
## [1] │ <cats, dogs, and frogs>
## [2] │ <cats and frogs>

10.6.7 Extracting Matches

Groups are especially useful with the stringr functions str_match and str_match_all.

The str_match function extracts the overall match to the pattern, as well as the match to each group. So you can use str_match to split a string in more complicated ways than str_split, or to extract specific pieces of a string.

For example, suppose we want to split an email address:

str_match("naulle@ucdavis.edu", "([^@]+)@(.+)[.](.+)")
##      [,1]                 [,2]     [,3]      [,4] 
## [1,] "naulle@ucdavis.edu" "naulle" "ucdavis" "edu"