23 Strings & Regular Expressions

Learning Goals

After this lesson, you should be able to:

Use escape codes in strings to represent non-keyboard characters
Explain what a text encoding is
Use the stringr package to detect, extract, and change patterns in strings
Use the regular expressions mini-language

Required Packages

This chapter uses the following packages:

dplyr
purrr
stringr

Chapter 14 explains how to install and load packages.

Strings represent text, but even if your datasets are composed entirely of numbers, you’ll need to know how to work with strings. Text formats for data are widespread: comma-separated values (CSV), tab-separated values (TSV), JavaScript object notation (JSON), various markup languages (HTML, XML, YAML, TOML), and more. When you read data in these formats into R, sometimes R will correctly convert the values to appropriate non-string types. The rest of the time, you need to know how to work with strings so that you can fix whatever went wrong and convert the data yourself.

String processing encompasses a variety of tasks such as searching for patterns within strings, extracting data from within strings, splitting strings into component parts, and removing or replacing unwanted characters (excess whitespace, punctuation, and so on).

23.1 String Fundamentals

This section introduces several fundamental concepts related to working with strings.

23.1.1 Escape Sequences

In a string, an escape sequence or escape code consists of a backslash followed by one or more characters. Escape sequences make it possible to:

Write quotes or backslashes within a string
Write characters that don’t appear on your keyboard (for example, characters in a foreign language)

For example, the escape sequence \n corresponds to the newline character. Notice that the message function prints \n as a literal new line:

x = "Hello\nNick"

message(x)

Hello
Nick

On the other hand, when you let R print a string or other object automatically (without explicitly calling message), it prints a programmer-friendly representation. For strings, the representation shows escape codes:

[1] "Hello\nNick"

This makes it easy to identify characters that are not normally visible, such as whitespace, and also makes it easy to copy the string. You can use the built-in print function to get a representation for any object.

Note

To print the actual characters in a string, use the message function.

To print a representation of the characters in a string, use the print function. The representation is useful to identify characters that are not normally visible, such as tabs and the characters that mark the end of a line.

There are at least two more functions in R that can print things: cat and show. The cat function is similar to message, but bypasses R’s system for keeping track of messages. The show function is similar to print, but specifically designed for R’s S4 objects, which we won’t cover. You generally shouldn’t use these two functions directly.

As another example, suppose we want to put literal quotation marks in a string. We can either enclose the string in the other kind of quotation marks (single versus double), or escape the quotation marks in the string:

x = 'She said, "Hi"'
message(x)

She said, "Hi"

y = "She said, \"Hi\""
message(y)

She said, "Hi"

Since escape sequences begin with backslash, we also need to use an escape sequence to write a literal backslash. The escape sequence for a literal backslash is two backslashes:

x = "\\"
message(x)

There’s a complete list of escape sequences for R in the ?Quotes help file. Other programming languages also use escape sequences, and many of them are the same as in R.

Tip 23.1

A raw string is a string where escape sequences are turned off. Raw strings are especially useful for writing regular expressions (covered in Section 23.3).

Raw strings begin with r" and an opening delimiter (, [, or {. Raw strings end with a matching closing delimiter and quote. For example:

x = r"(quotes " and backslashes \)"

message(x)

quotes " and backslashes \

Raw strings were added to R in version 4.0 (April 2020), and won’t work correctly in older versions.

23.1.2 Character Encodings

Computers store data as numbers. In order to store text on a computer, people have to agree on a character encoding, a system for mapping characters to numbers. For example, in ASCII, one of the most popular encodings in the United States, the character a maps to the number 97.

Many different character encodings exist, and sharing text used to be an inconvenient process of asking or trying to guess the correct encoding. This was so inconvenient that in the 1980s, software engineers around the world united to create the Unicode standard. Unicode includes symbols for nearly all languages in use today, as well as emoji and many ancient languages (such as Egyptian hieroglyphs).

Unicode maps characters to numbers, but unlike a character encoding, it doesn’t dictate how those numbers should be mapped to bytes (sequences of ones and zeroes). As a result, there are several different character encodings that support and are synonymous with Unicode. The most popular of these is UTF-8.

In R, you can write Unicode characters with the escape sequence \u (or \U) followed by a 4-digit (or 8-digit) number for the character in base 16. For instance, the number for a in Unicode is 97 (the same as in ASCII). In base 16, 97 is 61. So you can write an a as:

x = "\u61"
x

[1] "a"

Unicode escape sequences are usually only used for characters that are not easy to type. For example, the cat emoji is number 1f408 (in base 16) in Unicode. So the string "\U1f408" is the cat emoji:

"\U1f408"

[1] "🐈"

Tip

Whether you can see a Unicode character depends on whether the current font has a glyph (image representation) for that character. Most fonts only have glyphs for a few languages.

Make sure to use a font with good Unicode coverage if you love emoji or expect to work with many different languages. The Noto Fonts project aims to create a collection of fonts with a common style and complete language coverage. The NerdFonts project patches fonts commonly used for programming so that they have better coverage of symbols.

Note

Most of the time, R will handle character encodings for you automatically. However, if you ever read or write a text file (including CSV and other formats) and the text looks like gibberish, it might be an encoding problem. This is especially true on Windows, the only modern operating system that does not (yet) use UTF-8 as the default encoding.

Encoding problems when reading a file can usually be fixed by passing the encoding to the function doing the reading. For instance, the code to read a UTF-8 encoded CSV file on Windows is:

read.csv("my_data.csv", fileEncoding = "UTF-8")

Other reader functions may use a different parameter to set the encoding, so always check the documentation.

On computers where the native language is not set to English, it can also help to set R’s native language to English with Sys.setlocale(locale = "English").

Encoding problems when writing a file are slightly more complicated to fix. See this blog post for thorough explanation.

23.2 The stringr Package

Although R has built-in functions for string processing, we recommend using the stringr package for all of your string processing needs. The package is part of the Tidyverse, a collection of packages introduced in Section 14.3. Major advantages of stringr over other packages and R’s built-in functions include:

Correctness: the package builds on International Components for Unicode (ICU), the Unicode Consortium’s own library for handling text encodings
Discoverability: every function’s name begins with str_ so they’re easy to discover, remember, and identify in code
Interface consistency: the first argument is always the string to process, the second argument is always the pattern to match (if applicable)
Vectorization: most of the functions are vectorized in the first and second argument

stringr has detailed documentation and also a cheatsheet.

The first time you use stringr, you’ll have to install it with install.packages (the same as any other package). Then you can load the package with the library function:

# install.packages("stringr")
library(stringr)

The typical syntax of a stringr function is:

str_name(string, pattern, ...)

Where:

name describes what the function does
string is a string to search within or transform
pattern is a pattern to search for, if applicable
... is additional, function-specific arguments

For example, the str_detect function detects whether a pattern appears within a string. The function returns TRUE if the pattern is found and FALSE if it isn’t:

str_detect("hello", "el")

[1] TRUE

str_detect("hello", "ol")

[1] FALSE

Most of the stringr functions are vectorized in the string parameter:

str_detect(c("hello", "goodbye", "lo"), "lo")

[1]  TRUE FALSE  TRUE

As another example, the str_sub function extracts a substring from a string, given the substring’s position. The first argument is the string, the second is the position of the substring’s first character, and the third is the position of the substring’s last character:

str_sub("You speak of destiny as if it was fixed.", 5, 9)

[1] "speak"

The str_sub function is especially useful for extracting data from strings that have a fixed width (although the readr package’s read_fwf is usually a better choice if you have a fixed-width file).

There are a lot of stringr functions. Five that are especially important and are explained in this reader are:

str_detect, to test whether a string contains a pattern
str_sub, to extract a substring at a given position from a string
str_replace, to replace or remove parts of a string
str_split_fixed, to split a string into parts
str_match, to extract data from a string

You can find a complete list of functions with examples on the stringr documentation’s reference page and the cheatsheet.

23.3 Regular Expressions

The stringr functions use a special language called regular expressions or regex to describe patterns in strings. Many other programming languages also have string processing tools that use regular expressions, so fluency with regular expressions is a valuable, transferrable skill.

You can use a regular expression to describe a complicated pattern in just a few characters because some characters, called metacharacters, have special meanings. Metacharacters are usually punctation characters. They are never letters or numbers, which always have their literal meaning.

This table lists some of the most useful metacharacters:

Metacharacter	Meaning
`.`	any one character (wildcard)
`\`	escape character (in both R and regex), see Section 23.8.2
`^`	the beginning of string (not a character)
`$`	the end of string (not a character)
`[ab]`	one character, either `'a'` or `'b'`
`[^ab]`	one character, anything except `'a'` or `'b'`
`?`	the previous character appears 0 or 1 times
`*`	the previous character appears 0 or more times
`+`	the previous character appears 1 or more times
`()`	make a group
`\|`	match left OR right side (not a character)

Section 23.8 provides examples of how most of the metacharacters work.

23.4 Replacing Parts of Strings

Replacing part of a string is a common string processing task. For instance, quantitative data often contain non-numeric characters such as commas, currency symbols, and percent signs. These must be removed before converting to numeric data types. Replacement and removal go hand-in-hand, since removal is equivalent to replacing part of a string with the empty string "".

The str_replace function replaces the first part of a string that matches a pattern (from left to right), while the related str_replace_all function replaces every part of a string that matches a pattern. Most stringr functions that do pattern matching come in a pair like this: one to process only the first match and one to process every match.

As an example, suppose you want to remove commas from a vector of numbers (in strings) so that you can cast them to a numeric data type with as.numeric. You want to remove all of the commas, so str_replace_all is the function to use. The first argument is the string and the second is the pattern. The third argument is the replacement. In this case, the pattern is "," and the replacement is the empty string "". So the code to remove the commas and cast the numbers is:

x = c("1,000,000", "525,600", "42")
as.numeric(str_replace_all(x, fixed(","), ""))

[1] 1000000  525600      42

The str_replace function wouldn’t work as well for this task, since it only replaces the first match to the pattern:

str_replace(x, fixed(","), "")

[1] "1000,000" "525600"   "42"

You can also use these functions to replace or remove longer patterns within words. For instance, suppose you want to change the word "dog" to "cat":

x = c("dogs are great, dogs are fun", "dogs are fluffy")
str_replace(x, fixed("dog"), "cat")

[1] "cats are great, dogs are fun" "cats are fluffy"

str_replace_all(x, fixed("dog"), "cat")

[1] "cats are great, cats are fun" "cats are fluffy"

As a final example, you can use the replacement functions and a regex pattern to replace repeated spaces with a single space. This is a good standardization step if you’re working with text. The key is to use the regex quantifier +, which means a character “repeats one or more times” in the pattern, and to use a single space " " as the replacement:

x = "This    sentence  has  extra      space."
str_replace_all(x, " +", " ")

[1] "This sentence has extra space."

The str_squish function does this as well as removing all whitespace from the beginning and end of a string.

23.5 Splitting Strings

Distinct data in a text are generally separated by a character like a space or a comma, to make them easy for people to read. Often these separators also make the data easy for R to parse. The idea is to split the string into a separate value at each separator.

The str_split function splits a string at each match to a pattern. The matching characters—that is, the separators—are discarded.

For example, suppose you want to split several numbers separated by commas and spaces:

x = "21, 32.3, 5, 64"
result = str_split(x, ", ")
result

[[1]]
[1] "21"   "32.3" "5"    "64"

The str_split function always returns a list with one element for each input string. Here the list only has one element because x only has one element. You can get the first element with:

result[[1]]

[1] "21"   "32.3" "5"    "64"

You then convert the values with as.numeric.

To see why the str_split function always returns a list, consider what happens if you try to split two different strings at once:

x = c(x, "10, 15, 1.3")
result = str_split(x, ", ")
result

[[1]]
[1] "21"   "32.3" "5"    "64"  

[[2]]
[1] "10"  "15"  "1.3"

Each string has a different number of parts, so the vectors in the result have different lengths. So a list is the only way to store them.

You can also use the str_split function to split a sentence into words. Use spaces for the split:

sentences = c(
  "The wonderful, wonderful cat!",
  "Who is this duke of zill anyway?"
)

words = str_split(sentences, " ")
words

[[1]]
[1] "The"        "wonderful," "wonderful"  "cat!"      

[[2]]
[1] "Who"     "is"      "this"    "duke"    "of"      "zill"    "anyway?"

Tip

When you know exactly how many parts you expect a string to have, use the str_split_fixed function instead of str_split. It requires a third argument for the maximum number of splits. Because the number of splits is fixed, the function returns the results in a matrix rather than a list. For example:

x = c("1, 2, 3", "10, 20, 30")
str_split_fixed(x, ", ", 3)

     [,1] [,2] [,3]
[1,] "1"  "2"  "3" 
[2,] "10" "20" "30"

The str_split_fixed function is often more convenient than str_split because the nth piece of each input string is just the nth column of the result.

For example, suppose you want to get the area codes from some phone numbers:

phones = c("717-555-3421", "629-555-8902", "903-555-6781")
result = str_split_fixed(phones, "-", 3)

result[, 1]

[1] "717" "629" "903"

23.6 Extracting Matches

Occasionally, you might need to extract parts of a string in a more complicated way than string splitting allows. One solution is to write a regular expression that will match all of the data you want to capture, with parentheses ( ), the regex metacharacter for a group, around each distinct value. Then you can use the str_match function to extract the groups. Section 23.8.6 presents some examples of regex groups.

For example, suppose you want to split an email address into three parts: the user name, the domain name, and the [top-level domain][tld]. To create a regular expression that matches email addresses, you can use the @ and . in the address as anchors. The surrounding characters are generally alphanumeric, which you can represent with the “word” metacharacter \w:

\w+@\w+[.]\w+

Next, put parentheses ( ) around each part that you want to extract:

(\w+)@(\w+)[.](\w+)

Finally, use this pattern in str_match, adding extra backslashes so that everything is escaped correctly:

x = "datalab@ucdavis.edu"
regex = "(\\w+)@(\\w+)[.](\\w+)"
str_match(x, regex)

     [,1]                  [,2]      [,3]      [,4] 
[1,] "datalab@ucdavis.edu" "datalab" "ucdavis" "edu"

The function extracts the overall match to the pattern, as well as the match to each group.

The pattern in this example doesn’t work for all possible email addresses, since user names can contain dots and other characters that are not alphanumeric. You could generalize the pattern if necessary. The point is that the str_match function and groups provide an extremely flexible way to extract data from strings.

23.7 Case Study: U.S. Cold Storage

The U.S. Department of Agriculture (USDA) publishes a variety of datasets online, particularly through its National Agricultural Statistics Service (NASS). Unfortunately, most of are published in PDF or semi-structured text format, which makes reading the data into R or other statistical software a challenge.

The USDA NASS posts monthly reports about stocks of agricultural products in refrigerated warehouses. In this case study, you’ll use string processing functions to extract a table of data from the December 2022 report.

Important

Click here to download the December 2022 Cold Storage report.

If you haven’t already, we recommend you create a directory for this workshop. In your workshop directory, create a data/ subdirectory. Download and save the dataset in the data/ subdirectory.

The goal is to extract the first table, about “Nuts, Dairy Products, Frozen Eggs, and Frozen Poultry,” from the report.

The report is a semi-structured mix of natural language text and fixed-width tables. As a consequence, most functions for reading tabular data will not work well on the entire report. You could try to use a function for reading fixed-width data, such as read.fwf or the readr package’s read_fwf on only the lines containing a table. Another approach, which is shown here, is to use string processing functions to find and extract the table.

The readLines function reads a text file into a character vector with one element for each line. This makes the function useful for reading unstructured or semi-structured text. Use the function to read the report:

report = readLines("data/2022-12_cold-storage.txt")
head(report)

[1] ""                                                                            
[2] "Cold Storage"                                                                
[3] ""                                                                            
[4] "ISSN: 1948-903X"                                                             
[5] ""                                                                            
[6] "Released December 22, 2022, by the National Agricultural Statistics Service "

In the report, tables always begin and end with lines that contain only dashes -. By locating these all-dash lines, you can locate the tables. Like str_detect, the str_which function tests whether strings in a vector match a pattern. The only difference is that str_which returns the indexes of the strings that matched (as if you had called which) rather than a logical vector. Use str_which to find the all-dash lines:

# The regex means:
#   ^  begining of string
#   -+ one or more dashes
#   $  end of string

dashes = str_which(report, "^-+$")
head(report[dashes], 2)

[1] "--------------------------------------------------------------------------------------------------------------------------"
[2] "--------------------------------------------------------------------------------------------------------------------------"

Each table contains three dash lines—one separates the header and body. The header and body of the first table are:

report[dashes[1]:dashes[2]]

[1] "--------------------------------------------------------------------------------------------------------------------------"
[2] "                                      :             :                           :     November 30, 2022     :   Public    "
[3] "                                      :        Stocks in all warehouses         :      as a percent of      :  warehouse  "
[4] "                                      :             :                           :                           :   stocks    "
[5] "               Commodity              :-----------------------------------------------------------------------------------"
[6] "                                      :November 30, : October 31, :November 30, :November 30, : October 31, :November 30, "
[7] "                                      :    2021     :    2022     :    2022     :    2021     :    2022     :    2022     "
[8] "--------------------------------------------------------------------------------------------------------------------------"

bod = report[dashes[2]:dashes[3]]
head(bod)

[1] "--------------------------------------------------------------------------------------------------------------------------"
[2] "                                      :  ------------ 1,000 pounds -----------        ---- percent ----      1,000 pounds "
[3] "                                      :                                                                                   "
[4] "Nuts                                  :                                                                                   "
[5] "Shelled                               :                                                                                   "
[6] "  Pecans .............................:     30,906        38,577        34,489        112            89                   "

The columns have fixed widths, so extracting the columns is relatively easy with str_sub if you can get the offsets. In the last line of the header, the columns are separated by colons :. Thus you can use the str_locate_all function, which returns the locations of a pattern in a string, to get the offsets:

# The regex means:
#   [^:]+  one or more characters, excluding colons
#   (:|$)  a colon or the end of the line

cols = str_locate_all(report[dashes[2] - 1], "[^:]+(:|$)")
# Like str_split, str_locate_all returns a list
cols = cols[[1]]
cols

     start end
[1,]     1  39
[2,]    40  53
[3,]    54  67
[4,]    68  81
[5,]    82  95
[6,]    96 109
[7,]   110 122

You can use these offsets with str_sub to break a line in the body of the table into columns:

str_sub(bod[6], cols)

[1] "  Pecans .............................:"
[2] "     30,906   "                         
[3] "     38,577   "                         
[4] "     34,489   "                         
[5] "     112      "                         
[6] "      89      "                         
[7] "             "

In order to process every line in the body of the table in one vectorized call, we have to use str_sub_all rather than str_sub. The result is a list with one element for each line (that is, a list of rows), but we want a list of columns, so we also need to transpose the list with the purrr package’s list_transpose function:

library("purrr")
tab = str_sub_all(bod, cols)
tab = list_transpose(tab)

The columns still contain undesirable punctuation and whitespace, but we can remove these with str_replace_all and str_squish. We can use purrr’s map function to call these on each column:

# The regex means:
#   ,     a comma
#   |     OR
#   [.]*  zero or more literal dots
#   :     a colon
#   $     the end of the line

tab = map(tab, \(col) {
  col = str_replace_all(col, ",|[.]*:$", "")
  str_squish(col)
})
tab[[1]]

 [1] "---------------------------------------"
 [2] ""                                       
 [3] ""                                       
 [4] "Nuts"                                   
 [5] "Shelled"                                
 [6] "Pecans"                                 
 [7] "In-Shell"                               
 [8] "Pecans"                                 
 [9] ""                                       
[10] "Dairy products"                         
[11] "Butter"                                 
[12] "Natural cheese"                         
[13] "American"                               
[14] "Swiss"                                  
[15] "Other"                                  
[16] "Total natural cheese"                   
[17] ""                                       
[18] "Frozen eggs"                            
[19] "Whites"                                 
[20] "Yolks"                                  
[21] "Whole and mixed"                        
[22] "Unclassified"                           
[23] "Total frozen eggs"                      
[24] ""                                       
[25] "Frozen poultry"                         
[26] "Chicken"                                
[27] "Broilers fryers and roasters"           
[28] "Hens mature chickens"                   
[29] "Breasts and breast meat"                
[30] "Drumsticks"                             
[31] "Leg quarters"                           
[32] "Legs"                                   
[33] "Thigh and thigh quarters"               
[34] "Thigh Meat"                             
[35] "Wings"                                  
[36] "Paws and feet"                          
[37] "Other"                                  
[38] "Total chicken"                          
[39] ""                                       
[40] "Turkey"                                 
[41] "Whole turkeys"                          
[42] "Toms"                                   
[43] "Hens"                                   
[44] "Total whole turkeys"                    
[45] "Breasts"                                
[46] "Legs"                                   
[47] "Mechanically deboned meat"              
[48] "Other"                                  
[49] "Unclassified"                           
[50] "Total turkey"                           
[51] ""                                       
[52] "Ducks"                                  
[53] ""                                       
[54] "Total frozen poultry"                   
[55] "---------------------------------------"

To turn the list of columns into an actual data frame, we can use the dplyr package’s bind_cols function:

library("dplyr")
tab = bind_cols(tab)
head(tab, 10)

# A tibble: 10 × 7
   ...1                                      ...2  ...3  ...4  ...5  ...6  ...7 
   <chr>                                     <chr> <chr> <chr> <chr> <chr> <chr>
 1 "---------------------------------------" "---… "---… "---… "---… "---… "---…
 2 ""                                        "---… "100… "---… "---… "ent… "100…
 3 ""                                        ""    ""    ""    ""    ""    ""   
 4 "Nuts"                                    ""    ""    ""    ""    ""    ""   
 5 "Shelled"                                 ""    ""    ""    ""    ""    ""   
 6 "Pecans"                                  "309… "385… "344… "112" "89"  ""   
 7 "In-Shell"                                ""    ""    ""    ""    ""    ""   
 8 "Pecans"                                  "637… "443… "476… "75"  "107" ""   
 9 ""                                        ""    ""    ""    ""    ""    ""   
10 "Dairy products"                          ""    ""    ""    ""    ""    ""

The first few rows and the last row can be removed, since they don’t contain data. Then we can convert the individual columns to appropriate data types:

tab = tab[-c(1:3, nrow(tab)), ]
tab[2:7] = map(tab[2:7], as.numeric)
head(tab, 10)

# A tibble: 10 × 7
   ...1               ...2   ...3   ...4  ...5  ...6   ...7
   <chr>             <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>
 1 "Nuts"               NA     NA     NA    NA    NA     NA
 2 "Shelled"            NA     NA     NA    NA    NA     NA
 3 "Pecans"          30906  38577  34489   112    89     NA
 4 "In-Shell"           NA     NA     NA    NA    NA     NA
 5 "Pecans"          63788  44339  47638    75   107     NA
 6 ""                   NA     NA     NA    NA    NA     NA
 7 "Dairy products"     NA     NA     NA    NA    NA     NA
 8 "Butter"         210473 239658 199695    95    83 188566
 9 "Natural cheese"     NA     NA     NA    NA    NA     NA
10 "American"       834775 831213 815655    98    98     NA

The data frame is now sufficiently clean that we could use it for a simple analysis. Of course, there are many things we could do to improve the extracted data frame, such as identifying categories and subcategories in the first column, removing rows that are completely empty, and adding column names. These entail more string processing and data frame manipulation—if you want to practice your R skills, try doing them on your own.

23.8 Regular Expression Examples

Important

This section is intended as a reference and is not taught in the live session.

This section provides examples of several different regular expression metacharacters and other features. Most of the examples use the str_view function, which is especially helpful for testing regular expressions. The function displays an HTML-rendered version of the string with the first match highlighted.

The RegExr website is also helpful for testing regular expressions; it provides an interactive interface where you can write regular expressions and see where they match a string.

23.8.1 The Wildcard

The regex wildcard character is . and matches any single character. For example:

x = "dog"
str_view(x, "d.g")

By default, regex searches from left to right:

str_view(x, ".")

23.8.2 Escape Sequences

Like R, regular expressions can contain escape sequences that begin with a backslash. These are computed separately and after R escape sequences. The main use for escape sequences in regex is to turn a metacharacter into a literal character.

For example, suppose you want to match a literal dot .. The regex for a literal dot is \.. Since backslashes in R strings have to be escaped, the R string for this regex is "\\.. For example:

str_view("this.string", "\\.")

The double backslash can be confusing, and it gets worse if you want to match a literal backslash. You have to escape the backslash in the regex (because backslash is the regex escape character) and then also have to escape the backslashes in R (because backslash is also the R escape character). So to match a single literal backslash in R, the code is:

str_view("this\\that", "\\\\")

Raw strings (see Tip 23.1) make regular expressions easier to read, because they make backslashes literal (but they still mark the beginning of an escape sequence in regex). You can use a raw string to write the above as:

str_view(r"(this\that)", r"(\\)")

23.8.3 Anchors

By default, a regex will match anywhere in the string. If you want to force a match at specific place, use an anchor.

The beginning of string anchor is ^. It marks the beginning of the string, but doesn’t count as a character in the pattern.

For example, suppose you want to match an a at the beginning of the string:

x = c("abc", "cab")

str_view(x, "a")

str_view(x, "^a")

It doesn’t make sense to put characters before ^, since no characters can come before the beginning of the string.

Likewise, the end of string anchor is $. It marks the end of the string, but doesn’t count as a character in the pattern.

23.8.4 Character Classes

In regex, square brackets [ ] denote a character class. A character class matches exactly one character, but that character can be any of the characters inside of the square brackets. The square brackets themselves don’t count as characters in the pattern.

For example, suppose you want to match c followed by either a or t:

x = c("ca", "ct", "cat", "cta")

str_view(x, "c[ta]")

You can use a dash - in a character class to create a range. For example, to match letters p through z:

str_view(x, "c[p-z]")

Ranges also work with numbers and capital letters. To match a literal dash, place the dash at the end of the character class (instead of between two other characters), as in [abc-].

Most metacharacters are literal when inside a character class. For example, [.] matches a literal dot.

A hat ^ at the beginning of the character class negates the class. So for example, [^abc] matches any one character except for a, b, or c:

str_view("abcdef", "[^abc]")

23.8.5 Quantifiers

Quantifiers are metacharacters that affect how many times the preceding character must appear in a match. The quantifier itself doesn’t count as a character in the match.

For example, the question mark ? quantifier means the preceding character can appear 0 or 1 times. In other words, ? makes the preceding character optional. For example:

x = c("abc", "ab", "ac", "abbc")

str_view(x, "ab?c")

The star * quantifier means the preceding character can appear 0 or more times. In other words, * means the preceding character can appear any number of times or not at all. For instance:

str_view(x, "ab*c")

The plus + quantifier means the preceding character must appear 1 or more times.

Quantifiers are greedy, meaning they always match as many characters as possible. In this example, notice that the pattern matches the entire string, even though it could also match just abba:

str_view("abbabbba", ".+a")

You can add a question mark ? after another quantifier to make it non-greedy:

str_view("abbabbba", ".+?a")

23.8.6 Groups

In regex, parentheses ( ) denote a group. The parentheses themselves don’t count as characters in the pattern. Groups are useful for repeating or extracting specific parts of a pattern (see Section 23.6).

Quantifiers can act on groups in addition to individual characters. For example, suppose you want to make the entire substring ", dogs," optional in a pattern, so that both of the test strings in this example match:

x = c("cats, dogs, and frogs", "cats and frogs")

str_view(x, "cats(, dogs,)? and frogs")