message("Hello")
Hello
message("Hello", "Nick")
HelloNick
This lesson introduces several concepts related to working with text data (“strings”), particularly working with the stringr
package and writing patterns with regular expressions.
After this lesson, you should be able to:
cat
stringr
package:
^
and $
[]
?
, *
, and +
()
The message
function prints a string in the R console. If you pass multiple arguments, they are concatenated:
message("Hello")
Hello
message("Hello", "Nick")
HelloNick
Printing a string is different from returning a string. The message
function only prints (and always returns NULL
). For example:
= function() {
f message("Hello")
}
= f() x
Hello
x
NULL
If you just want to concatenate some strings (but not necessarily print them), use paste
instead of message
. The paste
function returns a string. The str_c
function in stringr (a package we’ll learn about later in this lesson) can also concatenate strings.
Remember to print strings with the message
function, not the print
function. The print
function prints R’s representation of an object, the same as if you had entered the object in the console without calling print
.
For instance, print
prints quotes around strings, whereas message
does not:
print("Hello")
[1] "Hello"
message("Hello")
Hello
In a string, an escape sequence or escape code consists of a backslash followed by one or more characters. Escape sequences make it possible to:
For example, the escape sequence \n
corresponds to the newline character. Notice that the message
function translates \n
into a literal new line, whereas the print
function doesn’t:
= "Hello\nNick"
x
message(x)
Hello
Nick
print(x)
[1] "Hello\nNick"
As another example, suppose we want to put a literal quote in a string. We can either enclose the string in the other kind of quotes, or escape the quotes in the string:
= 'She said, "Hi"'
x
message(x)
She said, "Hi"
= "She said, \"Hi\""
y
message(y)
She said, "Hi"
Since escape sequences begin with backslash, we also need to use an escape sequence to write a literal backslash. The escape sequence for a literal backslash is two backslashes:
= "\\"
x
message(x)
\
There’s a complete list of escape sequences for R in the ?Quotes
help file. Other programming languages also use escape sequences, and many of them are the same as in R.
A raw string is a string where escape sequences are turned off. Raw strings are especially useful for writing regular expressions, which we’ll do later in this lesson.
Raw strings begin with r"
and an opening delimiter (
, [
, or {
. Raw strings end with a matching closing delimiter and quote. For example:
= r"(quotes " and backslashes \)"
x
message(x)
quotes " and backslashes \
Raw strings were added to R in version 4.0 (April 2020), and won’t work correctly in older versions.
Computers store data as numbers. In order to store text on a computer, we have to agree on a character encoding, a system for mapping characters to numbers. For example, in ASCII, one of the most popular encodings in the United States, the character a
maps to the number 97.
Many different character encodings exist, and sharing text used to be an inconvenient process of asking or trying to guess the correct encoding. This was so inconvenient that in the 1980s, software engineers around the world united to create the Unicode standard. Unicode includes symbols for nearly all languages in use today, as well as emoji and many ancient languages (such as Egyptian hieroglyphs).
Unicode maps characters to numbers, but unlike a character encoding, it doesn’t dictate how those numbers should be mapped to bytes (sequences of ones and zeroes). As a result, there are several different character encodings that support and are synonymous with Unicode. The most popular of these is UTF-8.
In R, we can write Unicode characters with the escape sequence \U
followed by the number for the character in base 16. For instance, the number for a
in Unicode is 97 (the same as in ASCII). In base 16, 97 is 61
. So we can write an a
as:
= "\U61" # or "\u61"
x
x
[1] "a"
Unicode escape sequences are usually only used for characters that are not easy to type. For example, the cat emoji is number 1f408
(in base 16) in Unicode. So the string "\U1f408"
is the cat emoji.
Being able to see printed Unicode characters also depends on whether the font your computer is using has a glyph (image representation) for that character. Many fonts are limited to a small number of languages. The NerdFont project patches fonts commonly used for programming so that they have better Unicode coverage. Using a font with good Unicode coverage is not essential, but it’s convenient if you expect to work with many different natural languages or love using emoji.
Most of the time, R will handle character encodings for you automatically. However, if you ever read or write a text file (including CSV and other formats) and the text looks like gibberish, it might be an encoding problem. This is especially true on Windows, the only modern operating system that does not (yet) use UTF-8 as the default encoding.
Encoding problems when reading a file can usually be fixed by passing the encoding to the function doing the reading. For instance, the code to read a UTF-8 encoded CSV file on Windows is:
read.csv("my_data.csv", fileEncoding = "UTF-8")
Other reader functions may use a different parameter to set the encoding, so always check the documentation. On computers where the native language is not set to English, it can also help to set R’s native language to English with Sys.setlocale(locale = "English")
.
Encoding problems when writing a file are slightly more complicated to fix. See this blog post for thorough explanation.
stringr
and the tidyverse
The Tidyverse is a popular collection of packages for doing data science in R. The packages are made by many of the same people that make RStudio. They provide alternatives to R’s built-in tools for:
stringr
)ggplot2
)readr
)dplyr
, tidyr
, tibble
)Think of the Tidyverse as a different dialect of R. Sometimes the syntax is different, and sometimes ideas are easier or harder to express concisely. Whether to use base R or the Tidyverse is mostly subjective. As a result, the Tidyverse is somewhat polarizing in the R community. It’s useful to be literate in both, since both are popular.
One advantage of the Tidyverse is that the packages are usually well-documented. For example, there are documentation websites and cheat sheets for most Tidyverse packages.
stringr
The rest of this lesson uses stringr
, the Tidyverse package for string processing. R also has built-in functions for string processing. The main advantage of stringr is that all of the functions use a common set of parameters, so they’re easier to learn and remember.
The first time you use stringr, you’ll have to install it with install.packages
(the same as any other package). Then you can load the package with the library
function:
# install.packages("stringr")
library("stringr")
The typical syntax of a stringr function is:
str_NAME(string, pattern, ...)
Where:
NAME
describes what the function doesstring
is the string to search within or transformpattern
is the pattern to search for...
is additional, function-specific argumentsFor example, the str_detect
function detects whether the pattern appears within the string:
str_detect("hello", "el")
[1] TRUE
str_detect("hello", "ol")
[1] FALSE
Most of the stringr functions are vectorized:
str_detect(c("hello", "goodbye", "lo"), "lo")
[1] TRUE FALSE TRUE
There are a lot of stringr functions. The remainder of this lesson focuses on three that are especially important, as well as some of their variants:
str_split_fixed
str_replace
str_match
You can find a complete list of stringr functions with examples in the documentation or cheat sheet.
The str_split
function splits the string at each position that matches the pattern. The characters that match are thrown away.
For example, suppose we want to split a sentence into words. Since there’s a space between each word, we can use a space as the pattern:
= "The students in this class are great!"
x
= str_split(x, " ")
result result
[[1]]
[1] "The" "students" "in" "this" "class" "are" "great!"
The str_split
function always returns a list with one element for each input string. Here the list only has one element because x
only has one element. We can get the first element with:
1]] result[[
[1] "The" "students" "in" "this" "class" "are" "great!"
We have to use the double square bracket [[
operator here because x
is a list (for a vector, we could use the single square bracket operator instead). Notice that in the printout for result
, R gives us a hint that we should use [[
by printing [[1]]
.
To see why the function returns a list, consider what happens if we try to split two different sentences at once:
= c(x, "Are you listening?")
x
= str_split(x, " ")
result 1]] result[[
[1] "The" "students" "in" "this" "class" "are" "great!"
2]] result[[
[1] "Are" "you" "listening?"
Each sentence has a different number of words, so the vectors in the result have different lengths. So a list is the only way to store both.
The str_split_fixed
function is almost the same as str_split
, but takes a third argument for the maximum number of splits to make. Because the number of splits is fixed, the function can return the result in a matrix instead of a list. For example:
str_split_fixed(x, " ", 3)
[,1] [,2] [,3]
[1,] "The" "students" "in this class are great!"
[2,] "Are" "you" "listening?"
The str_split_fixed
function is often more convenient than str_split
because the n
th piece of each input string is just the n
th column of the result.
For example, suppose we want to get the area code from some phone numbers:
= c("717-555-3421", "629-555-8902", "903-555-6781")
phones = str_split_fixed(phones, "-", 3)
result
1] result[,
[1] "717" "629" "903"
The str_replace
function replaces the pattern the first time it appears in the string. The replacement goes in the third argument.
For instance, suppose we want to change the word "dog"
to "cat"
:
= c("dogs are great, dogs are fun", "dogs are fluffy")
x str_replace(x, "dog", "cat")
[1] "cats are great, dogs are fun" "cats are fluffy"
The str_replace_all
function replaces the pattern every time it appears in the string:
str_replace_all(x, "dog", "cat")
[1] "cats are great, cats are fun" "cats are fluffy"
We can also use the str_replace
and str_replace_all
functions to delete part of a string by setting the replacement to the empty string ""
.
For example, suppose we want to delete the comma:
str_replace(x, ",", "")
[1] "dogs are great dogs are fun" "dogs are fluffy"
In general, stringr functions with the _all
suffix affect all matches. Functions without _all
only affect the first match.
We’ll learn about str_match
at the end of the next section.
The stringr
functions (including the ones we just learned) use a special language called regular expressions or regex for the pattern. The regular expressions language is also used in many other programming languages besides R.
A regular expression can describe a complicated pattern in just a few characters, because some characters, called metacharacters, have special meanings. Letters and numbers are never metacharacters. They’re always literal.
Here are a few examples of metacharacters (we’ll look at examples in the subsequent sections):
Metacharacter | Meaning |
---|---|
. |
any single character (wildcard) |
\ |
escape character (in both R and regex) |
^ |
beginning of string |
$ |
end of string |
[ab] |
'a' or 'b' |
[^ab] |
any character except 'a' or 'b' |
? |
previous character appears 0 or 1 times |
* |
previous character appears 0 or more times |
+ |
previous character appears 1 or more times |
() |
make a group |
More metacharacters are listed on the stringr cheat sheet, or in ?regex
.
The str_view
function is especially helpful for testing regular expressions. It opens a browser window with the first match in the string highlighted. We’ll use it in the subsequent regex examples.
The RegExr website is also helpful for testing regular expressions; it provides an interactive interface where you can write regular expressions and see where they match a string.
The regex wildcard character is .
and matches any single character.
For example:
= "dog"
x str_view(x, "d.g")
[1] │ <dog>
By default, regex searches from left to right:
str_view(x, ".")
[1] │ <d><o><g>
Like R, regular expressions can contain escape sequences that begin with a backslash. These are computed separately and after R escape sequences. The main use for escape sequences in regex is to turn a metacharacter into a literal character.
For example, suppose we want to match a literal dot .
. The regex for a literal dot is \.
. Since backslashes in R strings have to be escaped, the R string for this regex is "\\.
. Then the regex works:
str_view("this.string", "\\.")
[1] │ this<.>string
The double backslash can be confusing, and it gets worse if we want to match a literal backslash. We have to escape the backslash in the regex (because backslash is the regex escape character) and then also have to escape the backslashes in R (because backslash is also the R escape character). So to match a single literal backslash in R, the code is:
str_view("this\\that", "\\\\")
[1] │ this<\>that
Raw strings are helpful here, because they make the backslash literal in R strings (but still not in regex). We can use raw strings to write the above as:
str_view(r"(this\that)", r"(\\)")
[1] │ this<\>that
You can turn off regular expressions entirely in stringr with the fixed
function:
str_view(x, fixed("."))
It’s good to turn off regular expressions whenever you don’t need them, both to avoid mistakes and because they take longer to compute.
By default, a regex will match anywhere in the string. If you want to force a match at specific place, use an anchor.
The beginning of string anchor is ^
. It marks the beginning of the string, but doesn’t count as a character in the match.
For example, suppose we want to match an a
at the beginning of the string:
= c("abc", "cab")
x
str_view(x, "a")
[1] │ <a>bc
[2] │ c<a>b
str_view(x, "^a")
[1] │ <a>bc
It doesn’t make sense to put characters before ^
, since no characters can come before the beginning of the string.
Likewise, the end of string anchor is $
. It marks the end of the string, but doesn’t count as a character in the match.
In regex, square brackets [ ]
create a character class. A character class counts as one character, but that character can be any of the characters inside the square brackets. The square brackets themselves don’t count as characters in the match.
For example, suppose we want to match a c
followed by either a
or t
:
= c("ca", "ct", "cat", "cta")
x
str_view(x, "c[ta]")
[1] │ <ca>
[2] │ <ct>
[3] │ <ca>t
[4] │ <ct>a
You can use a dash -
in a character class to create a range. For example, to match letters p
through z
:
str_view(x, "c[p-z]")
[2] │ <ct>
[4] │ <ct>a
Ranges also work with numbers and capital letters. To match a literal dash, place the dash at the end of the character class (instead of between two other characters), as in [abc-]
.
Most metacharacters are literal when inside a character class. For example, [.]
matches a literal dot.
A hat ^
at the beginning of the character class negates the class. So for example, [^abc]
matches any one character except for a
, b
, or c
:
str_view("abcdef", "[^abc]")
[1] │ abc<d><e><f>
Quantifiers are metacharacters that affect how many times the preceding character must appear in a match. The quantifier itself doesn’t count as a character in the match.
For example, the ?
quantifier means the preceding character can appear 0 or 1 times. In other words, ?
makes the preceding character optional.
For example:
= c("abc", "ab", "ac", "abbc")
x
str_view(x, "ab?c")
[1] │ <abc>
[3] │ <ac>
The *
quantifier means the preceding character can appear 0 or more times. In other words, *
means the preceding character can appear any number of times or not at all.
str_view(x, "ab*c")
[1] │ <abc>
[3] │ <ac>
[4] │ <abbc>
The +
quantifier means the preceding character must appear 1 or more times.
Quantifiers are greedy, meaning they always match as many characters as possible.
In regex, parentheses create a group. Groups can be affected by quantifiers, making it possible to repeat a pattern (rather than just a character). The parentheses themselves don’t count as characters in the match.
For example:
= c("cats, dogs, and frogs", "cats and frogs")
x
str_view(x, "cats(, dogs,)? and frogs")
[1] │ <cats, dogs, and frogs>
[2] │ <cats and frogs>
Groups are especially useful with the stringr functions str_match
and str_match_all
.
The str_match
function extracts the overall match to the pattern, as well as the match to each group. So you can use str_match
to split a string in more complicated ways than str_split
, or to extract specific pieces of a string.
For example, suppose we want to split an email address:
str_match("naulle@ucdavis.edu", "([^@]+)@(.+)[.](.+)")
[,1] [,2] [,3] [,4]
[1,] "naulle@ucdavis.edu" "naulle" "ucdavis" "edu"