1 String & Date Processing

This chapter is part 1 (of 2) of Cleaning & Reshaping Data, a workshop series about how to prepare data for analysis. The major topics of this chapter are how to convert dates and times into appropriate R data types and how to extract and clean up data from strings (including numbers with non-numeric characters such as $, %, and ,).

Learning Objectives

After completing this session, learners should be able to:

  • Explain why we use special data structures for dates & times
  • Identify the correct data structure for a given date/time
  • Use lubridate to parse a date
  • Use the date format string mini-language
  • Use escape codes in strings to represent non-keyboard characters
  • Explain what a text encoding is
  • Use the stringr package to detect, extract, and change patterns in strings
  • Use the regular expressions mini-language

1.1 The Tidyverse

For working with dates, times, and strings, we recommend using packages from the Tidyverse, a popular collection of packages for doing data science. Compared to R’s built-in functions, we’ve found that the functions in Tidyverse packages are generally easier to learn and use. They also provide additional features and have more robust support for characters outside of the Latin alphabet.

Although they’re developed by many different members of the R community, Tidyverse packages follow a unified design philosophy, and thus have many interfaces and data structures in common. The packages provide convenient and efficient alternatives to built-in R functions for many tasks, including:

  • Reading and writing files (package readr)
  • Processing dates and times (packages lubridate, hms)
  • Processing strings (package stringr)
  • Reshaping data (package tidyr)
  • Making visualizations (package ggplot2)
  • And more

Think of the Tidyverse as a different dialect of R. Sometimes the syntax is different, and sometimes ideas are easier or harder to express concisely. As a consequence, the Tidyverse is sometimes polarizing in the R community. It’s useful to be literate in both base R and the Tidyverse, since both are popular.

One major advantage of the Tidyverse is that the packages are usually well-documented and provide lots of examples. Every package has a documentation website and the most popular ones also have cheatsheets.

1.2 Parsing Dates & Times

When working with dates and times, you might want to:

  • Use them to sort other data
  • Add or subtract an offset
  • Get components like the month, day, or hour
  • Compute derived components like the day of week or quarter
  • Compute differences

Even though this list isn’t exhaustive, it shows that there are lots of things you might want to do. In order to do them in R, you must first make sure that your dates and times are represented by appropriate data types. Most of R’s built-in functions for loading data do not automatically recognize dates and times. This section describes several data types that represent dates and times, and explains how to use R to parse—break down and convert—dates and times to these types.

1.2.1 The lubridate Package

As explained in Section 1.1, we recommend the Tidyverse packages for working with dates and times over other packages or R’s built-in functions. There are two:

  • lubridate, the primary package for working with dates and times
  • hms, a package specifically for working with time durations

This chapter only covers lubridate, since it’s more useful in most situations. The package has detailed documentation and a cheatsheet.

You’ll have to install the package if you haven’t already, and then load it:

# install.packages("lubridate")
library("lubridate")
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Perhaps the most common task you’ll need to do with date and time data is convert from strings to more appropriate data types. This is because R’s built-in functions for reading data from a text format, such as read.csv, read dates and times as strings. For example, here are some dates as strings:

date_strings = c("Jan 10, 2021", "Sep 3, 2018", "Feb 28, 1982")
date_strings
## [1] "Jan 10, 2021" "Sep 3, 2018"  "Feb 28, 1982"

You can tell that these are dates, but as far as R is concerned, they’re text.

The lubridate package provides a variety of functions to automatically parse strings into date or time objects that R understands. These functions are named with one letter per component of the date or time. The order of the letters must match the order of the components in the string you want to parse.

In the example, the strings have the month (m), then the day (d), and then the year (y), so you can use the mdy function to parse them automatically:

dates = mdy(date_strings)
dates
## [1] "2021-01-10" "2018-09-03" "1982-02-28"
class(dates)
## [1] "Date"

Notice that the dates now have class Date, one of R’s built-in classes for representing dates, and that R prints them differently. Now R recognizes that the dates are in fact dates, so they’re ready to use in an analysis.

There is a complete list of the automatic parsing functions in the lubridate documentation.

Note: a relatively new package, clock, tries to solve some problems with the Date class people have identified over the years. The package is in the r-lib collection of packages, which provide low-level functionality complementary to the Tidyverse. Eventually, it may be preferable to use the classes in clock rather than the Date class, but for now, the Date class is still suitable for most tasks.

Occasionally, a date or time string may have a format that lubridate can’t parse automatically. In that case, you can use the fast_strptime function to describe the format in detail. At a minimum, the function requires two arguments: a vector of strings to parse and a format string.

The format string describes the format of the dates or times, and is based on the syntax of strptime, a function provided by many programming languages (including R) to parse date or time strings. In a format string, a percent sign % followed by a character is called a specification and has a special meaning. Here are a few of the most useful ones:

Specification Description 2015-01-29 21:32:55
%Y 4-digit year 2015
%m 2-digit month 01
%d 2-digit day 29
%H 2-digit hour 21
%M 2-digit minute 32
%S 2-digit second 55
%% literal % %
%y 2-digit year 15
%B full month name January
%b short month name Jan

You can find a complete list in ?fast_strptime. Other characters in the format string do not have any special meaning. Write the format string so that it matches the format of the dates you want to parse.

For example, let’s try parsing an unusual time format:

time_string = "6 minutes, 32 seconds after 10 o'clock"
time = fast_strptime(time_string, "%M minutes, %S seconds after %H o'clock")
time
## [1] "0-01-01 10:06:32 UTC"
class(time)
## [1] "POSIXlt" "POSIXt"

R represents date-times with the classes POSIXlt and POSIXct. There’s no built-in class to represent times alone, which is why the result in the example above includes a date.

Internally, a POSIXlt object is a list with elements to store different date and time components. On the other hand, a POSIXct object is a single floating point number (type double). If you want to store your time data in a data frame, use POSIXct objects, since data frames don’t work well with columns of lists. You can control whether fast_strptime returns a POSIXlt or POSIXct object by setting the lt parameter to TRUE or FALSE:

time_ct = fast_strptime(time_string, "%M minutes, %S seconds after %H o'clock",
  lt = FALSE)

class(time_ct)
## [1] "POSIXct" "POSIXt"

Another common task is combining the numeric components of a date or time into a single object. You can use the make_date and make_datetime functions to do this. The parameters are named for the different components. For example:

make_date(day = 10, year = 2023, month = 1)
## [1] "2023-01-10"

These functions are vectorized, so you can use them to combine the components of many dates or times at once. They’re especially useful for reconstructing dates and times from tabular datasets where each component is stored in a separate column.

After you’ve converted your date and time data to appropriate types, you can do any of the operations listed at the beginning of this section. For example, you can use lubridate’s period function to create an offset to add to a date or time:

dates
## [1] "2021-01-10" "2018-09-03" "1982-02-28"
dates + period(1, "month")
## [1] "2021-02-10" "2018-10-03" "1982-03-28"

You can also use lubridate functions to get or set the components. These functions usually have the same name as the component. For instance:

day(dates)
## [1] 10  3 28
month(dates)
## [1] 1 9 2

See the lubridate documentation for even more details about what you can do.

1.2.2 Case Study: Ocean Temperatures

The U.S. National Oceanic and Atmospheric Administration (NOAA) publishes ocean temperature data collected by sensor buoys off the coast on the National Data Buoy Center (NDBC) website. California also has many sensors collecting ocean temperature data that are not administered by the federal government. Data from these is published on the California Ocean Observing Systems (CALOOS) Data Portal.

Suppose you’re a researcher who wants to combine ocean temperature data from both sources to use in R. Both publish the data in comma-separated value (CSV) format, but record dates, times, and temperatures differently. Thus you need to be careful that the dates and times are parsed correctly.

Download these two 2021 datasets:

  1. 2021_noaa-ndbc_46013.txt, from NOAA buoy 46013, off the coast of Bodega Bay (DOWNLOAD)(source)
  2. 2021_ucdavis_bml_wts.csv, from the UC Davis Bodega Bay Marine Laboratory’s sensors (DOWNLOAD)(source)

The NOAA data has a fixed-width format, which means each column has a fixed width in characters over all rows. The readr package provides a function read_fwf that can automatically guess the column widths and read the data into a data frame. The column names appear in the first row and column units appear in the second row, so read those rows separately:

# install.packages("readr")
library("readr")

noaa_path = "data/ocean_data/2021_noaa-ndbc_46013.txt"
noaa_headers = read_fwf(noaa_path, n_max = 2, guess_max = 1)
## Rows: 2 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## 
## chr (18): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
noaa = read_fwf(noaa_path, skip = 2)
## Rows: 3323 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## 
## chr  (4): X2, X3, X4, X5
## dbl (14): X1, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(noaa) = as.character(noaa_headers[1, ])
names(noaa)[1] = "YY"

The dates and times for the observations are separated into component columns, and the read_fwf function does not convert some of these to numbers automatically. You can use as.numeric to convert them to numbers:

cols = 2:5
noaa[cols] = lapply(noaa[cols], as.numeric)

Finally, use the make_datetime function to combine the components into date-time objects:

noaa_dt = make_datetime(year = noaa$YY, month = noaa$MM, day = noaa$DD,
  hour = noaa$hh, min = noaa$mm)
noaa$date = noaa_dt
head(noaa_dt)
## [1] "2021-01-01 00:00:00 UTC" "2021-01-01 00:10:00 UTC"
## [3] "2021-01-01 00:20:00 UTC" "2021-01-01 00:30:00 UTC"
## [5] "2021-01-01 00:40:00 UTC" "2021-01-01 00:50:00 UTC"

That takes care of the dates in the NOAA data.

The Bodega Marine Lab data is CSV format, which you can read with read.csv or the readr package’s read_csv function. The latter is faster and usually better at guessing column types. The column names appear in the first row and the column units appear in the second row. The read_csv function handles the names automatically, but you’ll have to remove the unit row as a separate step:

bml = read_csv("data/ocean_data/2021_ucdavis_bml_wts.csv")
## Rows: 87283 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): time, sea_water_temperature, z
## dbl (1): sea_water_temperature_qc_agg
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bml = bml[-1, ]

The dates and times of the observations were loaded as strings. You can use lubridate’s ymd_hms function to automatically parse them:

bml_dt = ymd_hms(bml$time)
bml$date = bml_dt
head(bml_dt)
## [1] "2020-12-31 09:06:00 UTC" "2020-12-31 09:12:00 UTC"
## [3] "2020-12-31 09:18:00 UTC" "2020-12-31 09:24:00 UTC"
## [5] "2020-12-31 09:30:00 UTC" "2020-12-31 09:36:00 UTC"

Now you have date and time objects for both datasets, so you can combine the two. For example, you could extract the date and water temperature columns from each, create a new column identifying the data source, and then row-bind the datasets together.

1.3 String Fundamentals

Strings represent text, but even if your datasets are composed entirely of numbers, you’ll need to know how to work with strings. Text formats for data are widespread: comma-separated values (CSV), tab-separated values (TSV), JavaScript object notation (JSON), a panopoly of markup languages (HTML, XML, YAML, TOML), and more. When you read data in these formats into R, sometimes R will correctly convert the values to appropriate non-string types. The rest of the time, you need to know how to work with strings so that you can fix whatever went wrong and convert the data yourself.

This section introduces several fundamental concepts related to working with strings. The next section, Section 1.4.1, describes the stringr package for working with strings. The last section, Section 1.4.2, builds on both and explains how to do powerful pattern matching.

1.3.1 Printing

There are two different ways to print strings: you can print a representation of the characters in the string or you can print the actual characters in the string.

To print a representation of the characters in a string, use the print function. The representation is useful to identify characters that are not normally visible, such as tabs and the characters that mark the end of a line.

To print the actual characters in a string, use the message function.

This important difference in how the print and message functions print strings is demonstrated in the next section.

You can learn more about different ways to print output in R by reading Section 3.5.

1.3.2 Escape Sequences

In a string, an escape sequence or escape code consists of a backslash followed by one or more characters. Escape sequences make it possible to:

  1. Write quotes or backslashes within a string
  2. Write characters that don’t appear on your keyboard (for example, characters in a foreign language)

For example, the escape sequence \n corresponds to the newline character. Notice that the message function translates \n into a literal new line, whereas the print function doesn’t:

x = "Hello\nNick"

message(x)
## Hello
## Nick
print(x)
## [1] "Hello\nNick"

As another example, suppose we want to put a literal quote in a string. We can either enclose the string in the other kind of quotes, or escape the quotes in the string:

x = 'She said, "Hi"'

message(x)
## She said, "Hi"
y = "She said, \"Hi\""

message(y)
## She said, "Hi"

Since escape sequences begin with backslash, we also need to use an escape sequence to write a literal backslash. The escape sequence for a literal backslash is two backslashes:

x = "\\"

message(x)
## \

There’s a complete list of escape sequences for R in the ?Quotes help file. Other programming languages also use escape sequences, and many of them are the same as in R.

1.3.3 Raw Strings

A raw string is a string where escape sequences are turned off. Raw strings are especially useful for writing regular expressions (covered in Section 1.4.2).

Raw strings begin with r" and an opening delimiter (, [, or {. Raw strings end with a matching closing delimiter and quote. For example:

x = r"(quotes " and backslashes \)"

message(x)
## quotes " and backslashes \

Raw strings were added to R in version 4.0 (April 2020), and won’t work correctly in older versions.

1.3.4 Character Encodings

Computers store data as numbers. In order to store text on a computer, people have to agree on a character encoding, a system for mapping characters to numbers. For example, in ASCII, one of the most popular encodings in the United States, the character a maps to the number 97.

Many different character encodings exist, and sharing text used to be an inconvenient process of asking or trying to guess the correct encoding. This was so inconvenient that in the 1980s, software engineers around the world united to create the Unicode standard. Unicode includes symbols for nearly all languages in use today, as well as emoji and many ancient languages (such as Egyptian hieroglyphs).

Unicode maps characters to numbers, but unlike a character encoding, it doesn’t dictate how those numbers should be mapped to bytes (sequences of ones and zeroes). As a result, there are several different character encodings that support and are synonymous with Unicode. The most popular of these is UTF-8.

In R, you can write Unicode characters with the escape sequence \U followed by the number for the character in base 16. For instance, the number for a in Unicode is 97 (the same as in ASCII). In base 16, 97 is 61. So you can write an a as:

x = "\U61" # or "\u61"

x
## [1] "a"

Unicode escape sequences are usually only used for characters that are not easy to type. For example, the cat emoji is number 1f408 (in base 16) in Unicode. So the string "\U1f408" is the cat emoji.

Note that being able to see printed Unicode characters also depends on whether the font your computer is using has a glyph (image representation) for that character. Many fonts are limited to a small number of languages. The NerdFont project patches fonts commonly used for programming so that they have better Unicode coverage. Using a font with good Unicode coverage is not essential, but it’s convenient if you expect to work with many different natural languages or love using emoji.

1.3.4.1 Character Encodings in Text Files

Most of the time, R will handle character encodings for you automatically. However, if you ever read or write a text file (including CSV and other formats) and the text looks like gibberish, it might be an encoding problem. This is especially true on Windows, the only modern operating system that does not (yet) use UTF-8 as the default encoding.

Encoding problems when reading a file can usually be fixed by passing the encoding to the function doing the reading. For instance, the code to read a UTF-8 encoded CSV file on Windows is:

read.csv("my_data.csv", fileEncoding = "UTF-8")

Other reader functions may use a different parameter to set the encoding, so always check the documentation. On computers where the native language is not set to English, it can also help to set R’s native language to English with Sys.setlocale(locale = "English").

Encoding problems when writing a file are slightly more complicated to fix. See this blog post for thorough explanation.

1.4 Processing Strings

String processing encompasses a variety of tasks such as searching for patterns within strings, extracting data from within strings, splitting strings into component parts, and removing or replacing unwanted characters (excess whitespace, punctuation, and so on). If you work with data, sooner or later you’ll run into a dataset in text format that needs a few text corrections before or after you read it into R, and for that you’ll find familiarity with string processing invaluable.

1.4.1 The stringr Package

Although R has built-in functions for string processing, we recommend using the stringr package for all of your string processing needs. The package is part of the Tidyverse, a collection of packages introduced in Section 1.1. Major advantages of stringr over other packages and R’s built-in functions include:

  • Correctness: the package builds on International Components for Unicode (ICU), the Unicode Consortium’s own library for handling text encodings
  • Discoverability: every function’s name begins with str_ so they’re easy to discover, remember, and identify in code
  • Interface consistency: the first argument is always the string to process, the second argument is always the pattern to match (if applicable)
  • Vectorization: most of the functions are vectorized in the first and second argument

stringr has detailed documentation and also a cheatsheet.

The first time you use stringr, you’ll have to install it with install.packages (the same as any other package). Then you can load the package with the library function:

# install.packages("stringr")
library(stringr)

The typical syntax of a stringr function is:

str_name(string, pattern, ...)

Where:

  • name describes what the function does
  • string is a string to search within or transform
  • pattern is a pattern to search for, if applicable
  • ... is additional, function-specific arguments

For example, the str_detect function detects whether a pattern appears within a string. The function returns TRUE if the pattern is found and FALSE if it isn’t:

str_detect("hello", "el")
## [1] TRUE
str_detect("hello", "ol")
## [1] FALSE

Most of the stringr functions are vectorized in the string parameter:

str_detect(c("hello", "goodbye", "lo"), "lo")
## [1]  TRUE FALSE  TRUE

As another example, the str_sub function extracts a substring from a string, given the substring’s position. The first argument is the string, the second is the position of the substring’s first character, and the third is the position of the substring’s last character:

str_sub("You speak of destiny as if it was fixed.", 5, 9)
## [1] "speak"

The str_sub function is especially useful for extracting data from strings that have a fixed width (although the readr package’s read_fwf is usually a better choice if you have a fixed-width file).

There are a lot of stringr functions. Five that are especially important and are explained in this reader are:

  • str_detect, to test whether a string contains a pattern
  • str_sub, to extract a substring at a given position from a string
  • str_replace, to replace or remove parts of a string
  • str_split_fixed, to split a string into parts
  • str_match, to extract data from a string

You can find a complete list of functions with examples on the stringr documentation’s reference page and the cheatsheet.

1.4.2 Regular Expressions

The stringr functions use a special language called regular expressions or regex to describe patterns in strings. Many other programming languages also have string processing tools that use regular expressions, so fluency with regular expressions is a transferrable skill.

You can use a regular expression to describe a complicated pattern in just a few characters because some characters, called metacharacters, have special meanings. Metacharacters are usually punctation characters. They are never letters or numbers, which always have their literal meaning.

This table lists some of the most useful metacharacters:

Metacharacter Meaning
. any one character (wildcard)
\ escape character (in both R and regex), see Section 1.3.2
^ the beginning of string (not a character)
$ the end of string (not a character)
[ab] one character, either 'a' or 'b'
[^ab] one character, anything except 'a' or 'b'
? the previous character appears 0 or 1 times
* the previous character appears 0 or more times
+ the previous character appears 1 or more times
() make a group
| match left OR right side (not a character)

Section 1.5 provides examples of how most of the metacharacters work. Even more examples are presented in the stringr package’s regular expressions vignette. You can find a complete listing of regex metacharacters in ?regex or on the stringr cheatsheet.

You can disable regular expressions in a stringr function by calling the fixed function on the pattern. For example, to test whether a string contains a literal dot .:

x = c("No dot", "Lotsa dots...")
str_detect(x, fixed("."))
## [1] FALSE  TRUE

It’s a good idea to call fixed on any pattern that doesn’t contain regex metacharacters, because it communicates to the reader that you’re not using regex, it helps to prevent bugs, and it provides a small speed boost.

1.4.3 Replacing Parts of Strings

Replacing part of a string is a common string processing task. For instance, quantitative data often contain non-numeric characters such as commas, currency symbols, and percent signs. These must be removed before converting to numeric data types. Replacement and removal go hand-in-hand, since removal is equivalent to replacing part of a string with the empty string "".

The str_replace function replaces the first part of a string that matches a pattern (from left to right), while the related str_replace_all function replaces every part of a string that matches a pattern. Most stringr functions that do pattern matching come in a pair like this: one to process only the first match and one to process every match.

As an example, suppose you want to remove commas from a number so that you can convert it with as.numeric, which returns NA for numbers that contain commas. You want to remove all of the commas, so str_replace_all is the function to use. As usual, the first argument is the string and the second is the pattern. The third argument is the replacement, which is the empty string "" in this case:

x = "1,000,000"
str_replace_all(x, ",", "")
## [1] "1000000"

The str_replace function doesn’t work as well for this task, since it only replaces the first match to the pattern:

str_replace(x, ",", "")
## [1] "1000,000"

You can also use these functions to replace or remove longer patterns within words. For instance, suppose you want to change the word "dog" to "cat":

x = c("dogs are great, dogs are fun", "dogs are fluffy")
str_replace(x, "dog", "cat")
## [1] "cats are great, dogs are fun" "cats are fluffy"
str_replace_all(x, "dog", "cat")
## [1] "cats are great, cats are fun" "cats are fluffy"

As a final example, you can use the replacement functions and a regex pattern to replace repeated spaces with a single space. This is a good standardization step if you’re working with text. The key is to use the regex quantifier +, which means a character “repeats one or more times” in the pattern, and to use a single space " " as the replacement:

x = "This    sentence  has  extra      space."
str_replace_all(x, " +", " ")
## [1] "This sentence has extra space."

If you just want to trim (remove) all whitespace from the beginning and end of a string, you can use the str_trim function instead.

1.4.4 Splitting Strings

Distinct data in a text are generally separated by a character like a space or a comma, to make them easy for people to read. Often these separators also make the data easy for R to parse. The idea is to split the string into a separate value at each separator.

The str_split function splits a string at each match to a pattern. The matching characters—that is, the separators—are discarded.

For example, suppose you want to split several numbers separated by commas and spaces:

x = "21, 32.3, 5, 64"
result = str_split(x, ", ")
result
## [[1]]
## [1] "21"   "32.3" "5"    "64"

The str_split function always returns a list with one element for each input string. Here the list only has one element because x only has one element. You can get the first element with:

result[[1]]
## [1] "21"   "32.3" "5"    "64"

You then convert the values with as.numeric.

To see why the str_split function always returns a list, consider what happens if you try to split two different strings at once:

x = c(x, "10, 15, 1.3")
result = str_split(x, ", ")
result
## [[1]]
## [1] "21"   "32.3" "5"    "64"  
## 
## [[2]]
## [1] "10"  "15"  "1.3"

Each string has a different number of parts, so the vectors in the result have different lengths. So a list is the only way to store them.

You can also use the str_split function to split a sentence into words. Use spaces for the split:

x = "The students in this workshop are great!"
str_split(x, " ")
## [[1]]
## [1] "The"      "students" "in"       "this"     "workshop" "are"      "great!"

When you know exactly how many parts you expect a string to have, use the str_split_fixed function instead of str_split. It accepts a third argument for the maximum number of splits to make. Because the number of splits is fixed, the function can return the result in a matrix instead of a list. For example:

x = c("1, 2, 3", "10, 20, 30")
str_split_fixed(x, ", ", 3)
##      [,1] [,2] [,3]
## [1,] "1"  "2"  "3" 
## [2,] "10" "20" "30"

The str_split_fixed function is often more convenient than str_split because the nth piece of each input string is just the nth column of the result.

For example, suppose you want to get the area codes from some phone numbers:

phones = c("717-555-3421", "629-555-8902", "903-555-6781")
result = str_split_fixed(phones, "-", 3)

result[, 1]
## [1] "717" "629" "903"

1.4.5 Extracting Matches

Occasionally, you might need to extract parts of a string in a more complicated way than string splitting allows. One solution is to write a regular expression that will match all of the data you want to capture, with parentheses ( ), the regex metacharacter for a group, around each distinct value. Then you can use the str_match function to extract the groups. Section 1.5.6 presents some examples of regex groups.

For example, suppose you want to split an email address into three parts: the user name, the domain name, and the [top-level domain][tld]. To create a regular expression that matches email addresses, you can use the @ and . in the address as anchors. The surrounding characters are generally alphanumeric, which you can represent with the “word” metacharacter \w:

\w+@\w+[.]\w+

Next, put parentheses ( ) around each part that you want to extract:

(\w+)@(\w+)[.](\w+)

Finally, use this pattern in str_match, adding extra backslashes so that everything is escaped correctly:

x = "datalab@ucdavis.edu"
regex = "(\\w+)@(\\w+)[.](\\w+)"
str_match(x, regex)
##      [,1]                  [,2]      [,3]      [,4] 
## [1,] "datalab@ucdavis.edu" "datalab" "ucdavis" "edu"

The function extracts the overall match to the pattern, as well as the match to each group. The pattern in this example doesn’t work for all possible email addresses, since user names can contain dots and other characters that are not alphanumeric. You could generalize the pattern if necessary. The point is that the str_match function and groups provide an extremely flexible way to extract data from strings.

1.4.6 Case Study: U.S. Warehouse Stocks

The U.S. Department of Agriculture (USDA) publishes a variety of datasets online, particularly through its National Agricultural Statistics Service (NASS). Unfortunately, most of are published in PDF or semi-structured text format, which makes reading the data into R or other statistical software a challenge.

The USDA NASS posts monthly reports about stocks of agricultural products in refrigerated warehouses. In this case study, you’ll use string processing functions to extract a table of data from the December 2022 report.

To begin, download the report and save it somewhere on your computer. Then open the file in a text editor (or RStudio) to inspect it. The goal is to extract the first table, about “Nuts, Dairy Products, Frozen Eggs, and Frozen Poultry,” from the report.

The report is a semi-structured mix of natural language text and fixed-width tables. As a consequence, most functions for reading tabular data will not work well on the entire report. You could try to use a function for reading fixed-width data, such as read.fwf or the readr package’s read_fwf on only the lines containing a table. Another approach, which is shown here, is to use string processing functions to find and extract the table.

The readLines function reads a text file into a character vector with one element for each line. This makes the function useful for reading unstructured or semi-structured text. Use the function to read the report:

report = readLines("data/cost1222.txt")
head(report)
## [1] ""                                                                            
## [2] "Cold Storage"                                                                
## [3] ""                                                                            
## [4] "ISSN: 1948-903X"                                                             
## [5] ""                                                                            
## [6] "Released December 22, 2022, by the National Agricultural Statistics Service "

In the report, tables always begin and end with lines that contain only dashes -. By locating these all-dash lines, you can locate the tables. Like str_detect, the str_which function tests whether strings in a vector match a pattern. The only difference is that str_which returns the indexes of the strings that matched (as if you had called which) rather than a logical vector. Use str_which to find the all-dash lines:

# The regex means:
#   ^  begining of string
#   -+ one or more dashes
#   $  end of string

dashes = str_which(report, "^-+$")
head(report[dashes], 2)
## [1] "--------------------------------------------------------------------------------------------------------------------------"
## [2] "--------------------------------------------------------------------------------------------------------------------------"

Each table contains three dash lines—one separates the header and body. The header and body of the first table are:

report[dashes[1]:dashes[2]]
## [1] "--------------------------------------------------------------------------------------------------------------------------"
## [2] "                                      :             :                           :     November 30, 2022     :   Public    "
## [3] "                                      :        Stocks in all warehouses         :      as a percent of      :  warehouse  "
## [4] "                                      :             :                           :                           :   stocks    "
## [5] "               Commodity              :-----------------------------------------------------------------------------------"
## [6] "                                      :November 30, : October 31, :November 30, :November 30, : October 31, :November 30, "
## [7] "                                      :    2021     :    2022     :    2022     :    2021     :    2022     :    2022     "
## [8] "--------------------------------------------------------------------------------------------------------------------------"
bod = report[dashes[2]:dashes[3]]
head(bod)
## [1] "--------------------------------------------------------------------------------------------------------------------------"
## [2] "                                      :  ------------ 1,000 pounds -----------        ---- percent ----      1,000 pounds "
## [3] "                                      :                                                                                   "
## [4] "Nuts                                  :                                                                                   "
## [5] "Shelled                               :                                                                                   "
## [6] "  Pecans .............................:     30,906        38,577        34,489        112            89                   "

The columns have fixed widths, so extracting the columns is relatively easy with str_sub if you can get the offsets. In the last line of the header, the columns are separated by colons :. Thus you can use the str_locate_all function, which returns the locations of a pattern in a string, to get the offsets:

# The regex means:
#   [^:]+  one or more characters, excluding colons
#   (:|$)  a colon or the end of the line

cols = str_locate_all(report[dashes[2] - 1], "[^:]+(:|$)")
# Like str_split, str_locate_all returns a list
cols = cols[[1]]
cols
##      start end
## [1,]     1  39
## [2,]    40  53
## [3,]    54  67
## [4,]    68  81
## [5,]    82  95
## [6,]    96 109
## [7,]   110 122

You can use these offsets with str_sub to break a line in the body of the table into columns:

str_sub(bod[6], cols)
## [1] "  Pecans .............................:"
## [2] "     30,906   "                         
## [3] "     38,577   "                         
## [4] "     34,489   "                         
## [5] "     112      "                         
## [6] "      89      "                         
## [7] "             "

Because of the way str_sub is vectorized, you can’t process every line in the body of the table in one vectorized call. Instead, you can use sapply to call str_sub on each line:

# Set USE.NAMES to make the table easier to read
tab = sapply(bod, str_sub, cols, USE.NAMES = FALSE)
# The sapply function transposes the table
tab = t(tab)
head(tab)
##      [,1]                                      [,2]            
## [1,] "---------------------------------------" "--------------"
## [2,] "                                      :" "  ------------"
## [3,] "                                      :" "              "
## [4,] "Nuts                                  :" "              "
## [5,] "Shelled                               :" "              "
## [6,] "  Pecans .............................:" "     30,906   "
##      [,3]             [,4]             [,5]             [,6]            
## [1,] "--------------" "--------------" "--------------" "--------------"
## [2,] " 1,000 pounds " "-----------   " "     ---- perc" "ent ----      "
## [3,] "              " "              " "              " "              "
## [4,] "              " "              " "              " "              "
## [5,] "              " "              " "              " "              "
## [6,] "     38,577   " "     34,489   " "     112      " "      89      "
##      [,7]           
## [1,] "-------------"
## [2,] "1,000 pounds "
## [3,] "             "
## [4,] "             "
## [5,] "             "
## [6,] "             "

The columns still contain undesirable punctuation and whitespace, but you can remove these with str_replace_all and str_trim. Since the table is a matrix, it’s necessary to use apply to process it column-by-column:

# The regex means:
#   ,     a comma
#   |     OR
#   [.]*  zero or more literal dots
#   :     a colon
#   $     the end of the line

tab = apply(tab, 2, function(col) {
  col = str_replace_all(col, ",|[.]*:$", "")
  str_trim(col)
})
head(tab)
##      [,1]                                      [,2]            
## [1,] "---------------------------------------" "--------------"
## [2,] ""                                        "------------"  
## [3,] ""                                        ""              
## [4,] "Nuts"                                    ""              
## [5,] "Shelled"                                 ""              
## [6,] "Pecans"                                  "30906"         
##      [,3]             [,4]             [,5]             [,6]            
## [1,] "--------------" "--------------" "--------------" "--------------"
## [2,] "1000 pounds"    "-----------"    "---- perc"      "ent ----"      
## [3,] ""               ""               ""               ""              
## [4,] ""               ""               ""               ""              
## [5,] ""               ""               ""               ""              
## [6,] "38577"          "34489"          "112"            "89"            
##      [,7]           
## [1,] "-------------"
## [2,] "1000 pounds"  
## [3,] ""             
## [4,] ""             
## [5,] ""             
## [6,] ""

The first few rows and the last row can be removed, since they don’t contain data. Then you can convert the table to a data frame and convert the individual columns to appropriate data types:

tab = tab[-c(1:3, nrow(tab)), ]
tab = data.frame(tab)
tab[2:7] = lapply(tab[2:7], as.numeric)
head(tab, 10)
##                X1     X2     X3     X4  X5  X6     X7
## 1            Nuts     NA     NA     NA  NA  NA     NA
## 2         Shelled     NA     NA     NA  NA  NA     NA
## 3          Pecans  30906  38577  34489 112  89     NA
## 4        In-Shell     NA     NA     NA  NA  NA     NA
## 5          Pecans  63788  44339  47638  75 107     NA
## 6                     NA     NA     NA  NA  NA     NA
## 7  Dairy products     NA     NA     NA  NA  NA     NA
## 8          Butter 210473 239658 199695  95  83 188566
## 9  Natural cheese     NA     NA     NA  NA  NA     NA
## 10       American 834775 831213 815655  98  98     NA

The data frame is now sufficiently clean that you could use it for a simple analysis. Of course, there are many things you could do to improve the extracted data frame, such as identifying categories and subcategories in the first column, removing rows that are completely empty, and adding column names. These entail more string processing and data frame manipulation—if you want to practice your R skills, try doing them on your own.

1.5 Regular Expression Examples

This section provides examples of several different regular expression metacharacters and other features. Most of the examples use the str_view function, which is especially helpful for testing regular expressions. The function displays an HTML-rendered version of the string with the first match highlighted.

The RegExr website is also helpful for testing regular expressions; it provides an interactive interface where you can write regular expressions and see where they match a string.

1.5.1 The Wildcard

The regex wildcard character is . and matches any single character. For example:

x = "dog"
str_view(x, "d.g")
## [1] │ <dog>

By default, regex searches from left to right:

str_view(x, ".")
## [1] │ <d><o><g>

1.5.2 Escape Sequences

Like R, regular expressions can contain escape sequences that begin with a backslash. These are computed separately and after R escape sequences. The main use for escape sequences in regex is to turn a metacharacter into a literal character.

For example, suppose you want to match a literal dot .. The regex for a literal dot is \.. Since backslashes in R strings have to be escaped, the R string for this regex is "\\.. For example:

str_view("this.string", "\\.")
## [1] │ this<.>string

The double backslash can be confusing, and it gets worse if you want to match a literal backslash. You have to escape the backslash in the regex (because backslash is the regex escape character) and then also have to escape the backslashes in R (because backslash is also the R escape character). So to match a single literal backslash in R, the code is:

str_view("this\\that", "\\\\")
## [1] │ this<\>that

Raw strings (see Section 1.3.3) make regular expressions easier to read, because they make backslashes literal (but they still mark the beginning of an escape sequence in regex). You can use a raw string to write the above as:

str_view(r"(this\that)", r"(\\)")
## [1] │ this<\>that

1.5.3 Anchors

By default, a regex will match anywhere in the string. If you want to force a match at specific place, use an anchor.

The beginning of string anchor is ^. It marks the beginning of the string, but doesn’t count as a character in the pattern.

For example, suppose you want to match an a at the beginning of the string:

x = c("abc", "cab")

str_view(x, "a")
## [1] │ <a>bc
## [2] │ c<a>b
str_view(x, "^a")
## [1] │ <a>bc

It doesn’t make sense to put characters before ^, since no characters can come before the beginning of the string.

Likewise, the end of string anchor is $. It marks the end of the string, but doesn’t count as a character in the pattern.

1.5.4 Character Classes

In regex, square brackets [ ] denote a character class. A character class matches exactly one character, but that character can be any of the characters inside of the square brackets. The square brackets themselves don’t count as characters in the pattern.

For example, suppose you want to match c followed by either a or t:

x = c("ca", "ct", "cat", "cta")

str_view(x, "c[ta]")
## [1] │ <ca>
## [2] │ <ct>
## [3] │ <ca>t
## [4] │ <ct>a

You can use a dash - in a character class to create a range. For example, to match letters p through z:

str_view(x, "c[p-z]")
## [2] │ <ct>
## [4] │ <ct>a

Ranges also work with numbers and capital letters. To match a literal dash, place the dash at the end of the character class (instead of between two other characters), as in [abc-].

Most metacharacters are literal when inside a character class. For example, [.] matches a literal dot.

A hat ^ at the beginning of the character class negates the class. So for example, [^abc] matches any one character except for a, b, or c:

str_view("abcdef", "[^abc]")
## [1] │ abc<d><e><f>

1.5.5 Quantifiers

Quantifiers are metacharacters that affect how many times the preceding character must appear in a match. The quantifier itself doesn’t count as a character in the match.

For example, the question mark ? quantifier means the preceding character can appear 0 or 1 times. In other words, ? makes the preceding character optional. For example:

x = c("abc", "ab", "ac", "abbc")

str_view(x, "ab?c")
## [1] │ <abc>
## [3] │ <ac>

The star * quantifier means the preceding character can appear 0 or more times. In other words, * means the preceding character can appear any number of times or not at all. For instance:

str_view(x, "ab*c")
## [1] │ <abc>
## [3] │ <ac>
## [4] │ <abbc>

The plus + quantifier means the preceding character must appear 1 or more times.

Quantifiers are greedy, meaning they always match as many characters as possible. In this example, notice that the pattern matches the entire string, even though it could also match just abba:

str_view("abbabbba", ".+a")
## [1] │ <abbabbba>

You can add a question mark ? after another quantifier to make it non-greedy:

str_view("abbabbba", ".+?a")
## [1] │ <abba><bbba>

1.5.6 Groups

In regex, parentheses ( ) denote a group. The parentheses themselves don’t count as characters in the pattern. Groups are useful for repeating or extracting specific parts of a pattern (see Section 1.4.5).

Quantifiers can act on groups in addition to individual characters. For example, suppose you want to make the entire substring ", dogs," optional in a pattern, so that both of the test strings in this example match:

x = c("cats, dogs, and frogs", "cats and frogs")

str_view(x, "cats(, dogs,)? and frogs")
## [1] │ <cats, dogs, and frogs>
## [2] │ <cats and frogs>