10 Strings and Regular Expressions
This lesson introduces several concepts related to working with text data (‘strings’), particularly working with the stringr
package and writing patterns with regular expressions.
10.1 Learning Objectives
After this lesson, you should be able to:
- Print strings with
cat
- Read and write escape sequences and raw strings
- With the
stringr
package:- Split strings on a pattern
- Replace parts of a string that match a pattern
- Extract parts of a string that match a pattern
- Read and write regular expressions, including:
- Anchors
^
and$
- Character classes
[]
- Quantifiers
?
,*
, and+
- Groups
()
- Anchors
10.2 Printing Output
The cat
function prints a string in the R console. If you pass multiple
arguments, they will be concatenated:
## Hello
## Hello Nick
Pitfall 1: Printing a string is different from returning a string. The cat
function only prints (and always returns NULL
). For example:
## Hello
## NULL
If you just want to concatenate some strings (but not necessarily print them),
use paste
instead of cat
. The paste
function returns a string. The
str_c
function in stringr (a package we’ll learn about later in this lesson)
can also concatenate strings.
Pitfall 2: Remember to print strings with the cat
function, not the print
function. The print
function prints R’s representation of an object, the
same as if you had entered the object in the console without calling print
.
For instance, print
prints quotes around strings, whereas cat
does not:
## [1] "Hello"
## Hello
10.3 Escape Sequences
In a string, an escape sequence or escape code consists of a backslash followed by one or more characters. Escape sequences make it possible to:
- Write quotes or backslashes within a string
- Write characters that don’t appear on your keyboard (for example, characters in a foreign language)
For example, the escape sequence \n
corresponds to the newline character.
Notice that the cat
function translates \n
into a literal new line, whereas
the print
function doesn’t:
## Hello
## Nick
## [1] "Hello\nNick"
As another example, suppose we want to put a literal quote in a string. We can either enclose the string in the other kind of quotes, or escape the quotes in the string:
## She said, "Hi"
## She said, "Hi"
Since escape sequences begin with backslash, we also need to use an escape sequence to write a literal backslash. The escape sequence for a literal backslash is two backslashes:
## \
There’s a complete list of escape sequences for R in the ?Quotes
help file.
Other programming languages also use escape sequences, and many of them are the
same as in R.
10.3.1 Raw Strings
A raw string is a string where escape sequences are turned off. Raw strings are especially useful for writing regular expressions, which we’ll do later in this lesson.
Raw strings begin with r"
and an opening delimiter (
, [
, or {
. Raw
strings end with a matching closing delimiter and quote. For example:
## quotes " and backslashes \
Raw strings were added to R in version 4.0 (April 2020), and won’t work correctly in older versions.
10.4 Character Encodings
Computers store data as numbers. In order to store text on a computer, we have
to agree on a character encoding, a system for mapping characters to numbers.
For example, in ASCII, one of the most
popular encodings in the United States, the character a
maps to the
number 97.
Many different character encodings exist, and sharing text used to be an inconvenient process of asking or trying to guess the correct encoding. This was so inconvenient that in the 1980s, software engineers around the world united to create the Unicode standard. Unicode includes symbols for nearly all languages in use today, as well as emoji and many ancient languages (such as Egyptian hieroglyphs).
Unicode maps characters to numbers, but unlike a character encoding, it doesn’t dictate how those numbers should be mapped to bytes (sequences of ones and zeroes). As a result, there are several different character encodings that support and are synonymous with Unicode. The most popular of these is UTF-8.
In R, we can write Unicode characters with the escape sequence \U
followed by
the number for the character in base 16. For instance, the number for a
in Unicode is 97 (the same as in ASCII). In base 16, 97 is 61
. So we can write
an a
as:
## [1] "a"
Unicode escape sequences are usually only used for characters that are not easy
to type. For example, the cat emoji is number 1f408
(in base 16) in Unicode.
So the string "\U1f408"
is the cat emoji.
Note that being able to see printed Unicode characters also depends on whether the font your computer is using has a glyph (image representation) for that character. Many fonts are limited to a small number of languages. The NerdFont project patches fonts commonly used for programming so that they have better Unicode coverage. Using a font with good Unicode coverage is not essential, but it’s convenient if you expect to work with many different natural languages or love using emoji.
10.4.1 Character Encodings in Text Files
Most of the time, R will handle character encodings for you automatically. However, if you ever read or write a text file (including CSV and other formats) and the text looks like gibberish, it might be an encoding problem. This is especially true on Windows, the only modern operating system that does not (yet) use UTF-8 as the default encoding.
Encoding problems when reading a file can usually be fixed by passing the encoding to the function doing the reading. For instance, the code to read a UTF-8 encoded CSV file on Windows is:
Other reader functions may use a different parameter to set the encoding, so
always check the documentation. On computers where the native language is not
set to English, it can also help to set R’s native language to English with
Sys.setlocale(locale = "English")
.
Encoding problems when writing a file are slightly more complicated to fix. See this blog post for thorough explanation.
10.5 stringr
and the tidyverse
10.5.1 The Tidyverse
The Tidyverse is a popular collection of packages for doing data science in R. The packages are made by many of the same people that make RStudio. They provide alternatives to R’s built-in tools for:
- Manipulating strings (package
stringr
) - Making visualizations (package
ggplot2
) - Reading files (package
readr
) - Manipulating data frames (packages
dplyr
,tidyr
,tibble
) - And more
Think of the Tidyverse as a different dialect of R. Sometimes the syntax is different, and sometimes ideas are easier or harder to express concisely. Whether to use base R or the Tidyverse is mostly subjective. As a result, the Tidyverse is somewhat polarizing in the R community. It’s useful to be literate in both, since both are popular.
One advantage of the Tidyverse is that the packages are usually well-documented. For example, there are documentation websites and cheat sheets for most Tidyverse packages.
10.5.2 stringr
The rest of this lesson uses stringr
, the Tidyverse package for string
processing. R also has built-in functions for string processing. The main
advantage of stringr is that all of the functions use a common set of
parameters, so they’re easier to learn and remember.
The first time you use stringr, you’ll have to install it with
install.packages
(the same as any other package). Then you can load the
package with the library
function:
The typical syntax of a stringr function is:
str_NAME(string, pattern, ...)
Where:
NAME
describes what the function doesstring
is the string to search within or transformpattern
is the pattern to search for...
is additional, function-specific arguments
For example, the str_detect
function detects whether the pattern appears
within the string:
## [1] TRUE
## [1] FALSE
Most of the stringr functions are vectorized:
## [1] TRUE FALSE TRUE
There are a lot of stringr functions. The remainder of this lesson focuses on three that are especially important, as well as some of their variants:
str_split_fixed
str_replace
str_match
You can find a complete list of stringr functions with examples in the documentation or cheat sheet.
10.5.2.1 Splitting Strings
The str_split
function splits the string at each position that matches the
pattern. The characters that match are thrown away.
For example, suppose we want to split a sentence into words. Since there’s a space between each word, we can use a space as the pattern:
## [[1]]
## [1] "The" "students" "in" "this" "class" "are" "great!"
The str_split
function always returns a list with one element for each input
string. Here the list only has one element because x
only has one element. We
can get the first element with:
## [1] "The" "students" "in" "this" "class" "are" "great!"
We have to use the double square bracket [[
operator here because x
is a
list (for a vector, we could use the single square bracket operator instead).
Notice that in the printout for result
, R gives us a hint that we should use
[[
by printing [[1]]
.
To see why the function returns a list, consider what happens if we try to split two different sentences at once:
## [1] "The" "students" "in" "this" "class" "are" "great!"
## [1] "Are" "you" "listening?"
Each sentence has a different number of words, so the vectors in the result have different lengths. So a list is the only way to store both.
The str_split_fixed
function is almost the same as str_split
, but takes a
third argument for the maximum number of splits to make. Because the number of
splits is fixed, the function can return the result in a matrix instead of a
list. For example:
## [,1] [,2] [,3]
## [1,] "The" "students" "in this class are great!"
## [2,] "Are" "you" "listening?"
The str_split_fixed
function is often more convenient than str_split
because the n
th piece of each input string is just the n
th column of the
result.
For example, suppose we want to get the area code from some phone numbers:
phones = c("717-555-3421", "629-555-8902", "903-555-6781")
result = str_split_fixed(phones, "-", 3)
result[, 1]
## [1] "717" "629" "903"
10.5.2.2 Replacing Parts of Strings
The str_replace
function replaces the pattern the first time it appears in
the string. The replacement goes in the third argument.
For instance, suppose we want to change the word "dog"
to "cat"
:
## [1] "cats are great, dogs are fun" "cats are fluffy"
The str_replace_all
function replaces the pattern every time it appears in
the string:
## [1] "cats are great, cats are fun" "cats are fluffy"
We can also use the str_replace
and str_replace_all
functions to delete
part of a string by setting the replacement to the empty string ""
.
For example, suppose we want to delete the comma:
## [1] "dogs are great dogs are fun" "dogs are fluffy"
In general, stringr functions with the _all
suffix affect all matches.
Functions without _all
only affect the first match.
We’ll learn about str_match
at the end of the next section.
10.6 Regular Expressions
The stringr
functions (including the ones we just learned) use a special
language called regular expressions or regex for the pattern. The regular
expressions language is also used in many other programming languages besides
R.
A regular expression can describe a complicated pattern in just a few characters, because some characters, called metacharacters, have special meanings. Letters and numbers are never metacharacters. They’re always literal.
Here are a few examples of metacharacters (we’ll look at examples in the subsequent sections):
Metacharacter | Meaning |
---|---|
. |
any single character (wildcard) |
\ |
escape character (in both R and regex) |
^ |
beginning of string |
$ |
end of string |
[ab] |
'a' or 'b' |
[^ab] |
any character except 'a' or 'b' |
? |
previous character appears 0 or 1 times |
* |
previous character appears 0 or more times |
+ |
previous character appears 1 or more times |
() |
make a group |
More metacharacters are listed on the stringr cheat sheet, or in ?regex
.
The str_view
function is especially helpful for testing regular expressions.
It opens a browser window with the first match in the string highlighted. We’ll
use it in the subsequent regex examples.
The RegExr website is also helpful for testing regular expressions; it provides an interactive interface where you can write regular expressions and see where they match a string.
10.6.1 The Wildcard
The regex wildcard character is .
and matches any single character.
For example:
## [1] │ <dog>
By default, regex searches from left to right:
## [1] │ <d><o><g>
10.6.2 Escape Sequences
Like R, regular expressions can contain escape sequences that begin with a backslash. These are computed separately and after R escape sequences. The main use for escape sequences in regex is to turn a metacharacter into a literal character.
For example, suppose we want to match a literal dot .
. The regex for a
literal dot is \.
. Since backslashes in R strings have to be escaped, the R
string for this regex is "\\.
. Then the regex works:
## [1] │ this<.>string
The double backslash can be confusing, and it gets worse if we want to match a literal backslash. We have to escape the backslash in the regex (because backslash is the regex escape character) and then also have to escape the backslashes in R (because backslash is also the R escape character). So to match a single literal backslash in R, the code is:
## [1] │ this<\>that
Raw strings are helpful here, because they make the backslash literal in R strings (but still not in regex). We can use raw strings to write the above as:
## [1] │ this<\>that
You can turn off regular expressions entirely in stringr with the fixed
function:
It’s good to turn off regular expressions whenever you don’t need them, both to avoid mistakes and because they take longer to compute.
10.6.3 Anchors
By default, a regex will match anywhere in the string. If you want to force a match at specific place, use an anchor.
The beginning of string anchor is ^
. It marks the beginning of the string,
but doesn’t count as a character in the match.
For example, suppose we want to match an a
at the beginning of the string:
## [1] │ <a>bc
## [2] │ c<a>b
## [1] │ <a>bc
It doesn’t make sense to put characters before ^
, since no characters can
come before the beginning of the string.
Likewise, the end of string anchor is $
. It marks the end of the string, but
doesn’t count as a character in the match.
10.6.4 Character Classes
In regex, square brackets [ ]
create a character class. A character class
counts as one character, but that character can be any of the characters inside
the square brackets. The square brackets themselves don’t count as characters
in the match.
For example, suppose we want to match a c
followed by either a
or t
:
## [1] │ <ca>
## [2] │ <ct>
## [3] │ <ca>t
## [4] │ <ct>a
You can use a dash -
in a character class to create a range. For example, to
match letters p
through z
:
## [2] │ <ct>
## [4] │ <ct>a
Ranges also work with numbers and capital letters. To match a literal dash,
place the dash at the end of the character class (instead of between two other
characters), as in [abc-]
.
Most metacharacters are literal when inside a character class. For example,
[.]
matches a literal dot.
A hat ^
at the beginning of the character class negates the class. So for
example, [^abc]
matches any one character except for a
, b
, or c
:
## [1] │ abc<d><e><f>
10.6.5 Quantifiers
Quantifiers are metacharacters that affect how many times the preceding character must appear in a match. The quantifier itself doesn’t count as a character in the match.
For example, the ?
quantifier means the preceding character can appear 0 or
1 times. In other words, ?
makes the preceding character optional.
For example:
## [1] │ <abc>
## [3] │ <ac>
The *
quantifier means the preceding character can appear 0 or more times.
In other words, *
means the preceding character can appear any number of
times or not at all.
## [1] │ <abc>
## [3] │ <ac>
## [4] │ <abbc>
The +
quantifier means the preceding character must appear 1 or more times.
Quantifiers are greedy, meaning they always match as many characters as possible.
10.6.6 Groups
In regex, parentheses create a group. Groups can be affected by quantifiers, making it possible to repeat a pattern (rather than just a character). The parentheses themselves don’t count as characters in the match.
For example:
## [1] │ <cats, dogs, and frogs>
## [2] │ <cats and frogs>
10.6.7 Extracting Matches
Groups are especially useful with the stringr functions str_match
and
str_match_all
.
The str_match
function extracts the overall match to the pattern, as well as
the match to each group. So you can use str_match
to split a string in more
complicated ways than str_split
, or to extract specific pieces of a string.
For example, suppose we want to split an email address:
## [,1] [,2] [,3] [,4]
## [1,] "naulle@ucdavis.edu" "naulle" "ucdavis" "edu"