1 String & Date Processing
This chapter is part 1 (of 2) of Cleaning & Reshaping Data, a workshop series
about how to prepare data for analysis. The major topics of this chapter are
how to convert dates and times into appropriate R data types and how to extract
and clean up data from strings (including numbers with non-numeric characters
such as $
, %
, and ,
).
Learning Objectives
After completing this session, learners should be able to:
- Explain why we use special data structures for dates & times
- Identify the correct data structure for a given date/time
- Use lubridate to parse a date
- Use the date format string mini-language
- Use escape codes in strings to represent non-keyboard characters
- Explain what a text encoding is
- Use the stringr package to detect, extract, and change patterns in strings
- Use the regular expressions mini-language
1.1 The Tidyverse
For working with dates, times, and strings, we recommend using packages from the Tidyverse, a popular collection of packages for doing data science. Compared to R’s built-in functions, we’ve found that the functions in Tidyverse packages are generally easier to learn and use. They also provide additional features and have more robust support for characters outside of the Latin alphabet.
Although they’re developed by many different members of the R community, Tidyverse packages follow a unified design philosophy, and thus have many interfaces and data structures in common. The packages provide convenient and efficient alternatives to built-in R functions for many tasks, including:
- Reading and writing files (package readr)
- Processing dates and times (packages lubridate, hms)
- Processing strings (package stringr)
- Reshaping data (package tidyr)
- Making visualizations (package ggplot2)
- And more
Think of the Tidyverse as a different dialect of R. Sometimes the syntax is different, and sometimes ideas are easier or harder to express concisely. As a consequence, the Tidyverse is sometimes polarizing in the R community. It’s useful to be literate in both base R and the Tidyverse, since both are popular.
One major advantage of the Tidyverse is that the packages are usually well-documented and provide lots of examples. Every package has a documentation website and the most popular ones also have cheatsheets.
1.2 Parsing Dates & Times
When working with dates and times, you might want to:
- Use them to sort other data
- Add or subtract an offset
- Get components like the month, day, or hour
- Compute derived components like the day of week or quarter
- Compute differences
Even though this list isn’t exhaustive, it shows that there are lots of things you might want to do. In order to do them in R, you must first make sure that your dates and times are represented by appropriate data types. Most of R’s built-in functions for loading data do not automatically recognize dates and times. This section describes several data types that represent dates and times, and explains how to use R to parse—break down and convert—dates and times to these types.
1.2.1 The lubridate Package
As explained in Section 1.1, we recommend the Tidyverse packages for working with dates and times over other packages or R’s built-in functions. There are two:
- lubridate, the primary package for working with dates and times
- hms, a package specifically for working with time durations
This chapter only covers lubridate, since it’s more useful in most situations. The package has detailed documentation and a cheatsheet.
You’ll have to install the package if you haven’t already, and then load it:
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
Perhaps the most common task you’ll need to do with date and time data is
convert from strings to more appropriate data types. This is because R’s
built-in functions for reading data from a text format, such as read.csv
,
read dates and times as strings. For example, here are some dates as strings:
## [1] "Jan 10, 2021" "Sep 3, 2018" "Feb 28, 1982"
You can tell that these are dates, but as far as R is concerned, they’re text.
The lubridate package provides a variety of functions to automatically parse strings into date or time objects that R understands. These functions are named with one letter per component of the date or time. The order of the letters must match the order of the components in the string you want to parse.
In the example, the strings have the month (m
), then the day (d
), and then
the year (y
), so you can use the mdy
function to parse them automatically:
## [1] "2021-01-10" "2018-09-03" "1982-02-28"
## [1] "Date"
Notice that the dates now have class Date
, one of R’s built-in classes for
representing dates, and that R prints them differently. Now R recognizes that
the dates are in fact dates, so they’re ready to use in an analysis.
There is a complete list of the automatic parsing functions in the lubridate documentation.
Note: a relatively new package, clock, tries to solve some problems with
the Date
class people have identified over the years. The package is in the
r-lib collection of packages, which provide low-level functionality
complementary to the Tidyverse. Eventually, it may be preferable to use the
classes in clock rather than the Date
class, but for now, the Date
class is
still suitable for most tasks.
Occasionally, a date or time string may have a format that lubridate can’t
parse automatically. In that case, you can use the fast_strptime
function to
describe the format in detail. At a minimum, the function requires two
arguments: a vector of strings to parse and a format string.
The format string describes the format of the dates or times, and is based on
the syntax of strptime
, a function provided by many programming languages
(including R) to parse date or time strings. In a format string, a percent sign
%
followed by a character is called a specification and has a special
meaning. Here are a few of the most useful ones:
Specification | Description | 2015-01-29 21:32:55 |
---|---|---|
%Y |
4-digit year | 2015 |
%m |
2-digit month | 01 |
%d |
2-digit day | 29 |
%H |
2-digit hour | 21 |
%M |
2-digit minute | 32 |
%S |
2-digit second | 55 |
%% |
literal % | % |
%y |
2-digit year | 15 |
%B |
full month name | January |
%b |
short month name | Jan |
You can find a complete list in ?fast_strptime
. Other characters in the
format string do not have any special meaning. Write the format string so that
it matches the format of the dates you want to parse.
For example, let’s try parsing an unusual time format:
time_string = "6 minutes, 32 seconds after 10 o'clock"
time = fast_strptime(time_string, "%M minutes, %S seconds after %H o'clock")
time
## [1] "0-01-01 10:06:32 UTC"
## [1] "POSIXlt" "POSIXt"
R represents date-times with the classes POSIXlt
and POSIXct
. There’s no
built-in class to represent times alone, which is why the result in the example
above includes a date.
Internally, a POSIXlt
object is a list with elements to store different date
and time components. On the other hand, a POSIXct
object is a single floating
point number (type double
). If you want to store your time data in a data
frame, use POSIXct
objects, since data frames don’t work well with columns of
lists. You can control whether fast_strptime
returns a POSIXlt
or POSIXct
object by setting the lt
parameter to TRUE
or FALSE
:
time_ct = fast_strptime(time_string, "%M minutes, %S seconds after %H o'clock",
lt = FALSE)
class(time_ct)
## [1] "POSIXct" "POSIXt"
Another common task is combining the numeric components of a date or time into
a single object. You can use the make_date
and make_datetime
functions to
do this. The parameters are named for the different components. For example:
## [1] "2023-01-10"
These functions are vectorized, so you can use them to combine the components of many dates or times at once. They’re especially useful for reconstructing dates and times from tabular datasets where each component is stored in a separate column.
After you’ve converted your date and time data to appropriate types, you can do
any of the operations listed at the beginning of this section. For example, you
can use lubridate’s period
function to create an offset to add to a date or
time:
## [1] "2021-01-10" "2018-09-03" "1982-02-28"
## [1] "2021-02-10" "2018-10-03" "1982-03-28"
You can also use lubridate functions to get or set the components. These functions usually have the same name as the component. For instance:
## [1] 10 3 28
## [1] 1 9 2
See the lubridate documentation for even more details about what you can do.
1.2.2 Case Study: Ocean Temperatures
The U.S. National Oceanic and Atmospheric Administration (NOAA) publishes ocean temperature data collected by sensor buoys off the coast on the National Data Buoy Center (NDBC) website. California also has many sensors collecting ocean temperature data that are not administered by the federal government. Data from these is published on the California Ocean Observing Systems (CALOOS) Data Portal.
Suppose you’re a researcher who wants to combine ocean temperature data from both sources to use in R. Both publish the data in comma-separated value (CSV) format, but record dates, times, and temperatures differently. Thus you need to be careful that the dates and times are parsed correctly.
Download these two 2021 datasets:
2021_noaa-ndbc_46013.txt
, from NOAA buoy 46013, off the coast of Bodega Bay (DOWNLOAD)(source)2021_ucdavis_bml_wts.csv
, from the UC Davis Bodega Bay Marine Laboratory’s sensors (DOWNLOAD)(source)
The NOAA data has a fixed-width format, which means each column has a fixed
width in characters over all rows. The readr package provides a function
read_fwf
that can automatically guess the column widths and read the data
into a data frame. The column names appear in the first row and column units
appear in the second row, so read those rows separately:
# install.packages("readr")
library("readr")
noaa_path = "data/ocean_data/2021_noaa-ndbc_46013.txt"
noaa_headers = read_fwf(noaa_path, n_max = 2, guess_max = 1)
## Rows: 2 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##
## chr (18): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 3323 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##
## chr (4): X2, X3, X4, X5
## dbl (14): X1, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dates and times for the observations are separated into component columns,
and the read_fwf
function does not convert some of these to numbers
automatically. You can use as.numeric
to convert them to numbers:
Finally, use the make_datetime
function to combine the components into
date-time objects:
noaa_dt = make_datetime(year = noaa$YY, month = noaa$MM, day = noaa$DD,
hour = noaa$hh, min = noaa$mm)
noaa$date = noaa_dt
head(noaa_dt)
## [1] "2021-01-01 00:00:00 UTC" "2021-01-01 00:10:00 UTC"
## [3] "2021-01-01 00:20:00 UTC" "2021-01-01 00:30:00 UTC"
## [5] "2021-01-01 00:40:00 UTC" "2021-01-01 00:50:00 UTC"
That takes care of the dates in the NOAA data.
The Bodega Marine Lab data is CSV format, which you can read with read.csv
or
the readr package’s read_csv
function. The latter is faster and usually
better at guessing column types. The column names appear in the first row and
the column units appear in the second row. The read_csv
function handles the
names automatically, but you’ll have to remove the unit row as a separate step:
## Rows: 87283 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): time, sea_water_temperature, z
## dbl (1): sea_water_temperature_qc_agg
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dates and times of the observations were loaded as strings. You can use
lubridate’s ymd_hms
function to automatically parse them:
## [1] "2020-12-31 09:06:00 UTC" "2020-12-31 09:12:00 UTC"
## [3] "2020-12-31 09:18:00 UTC" "2020-12-31 09:24:00 UTC"
## [5] "2020-12-31 09:30:00 UTC" "2020-12-31 09:36:00 UTC"
Now you have date and time objects for both datasets, so you can combine the
two. For example, you could extract the date
and water temperature columns
from each, create a new column identifying the data source, and then row-bind
the datasets together.
1.3 String Fundamentals
Strings represent text, but even if your datasets are composed entirely of numbers, you’ll need to know how to work with strings. Text formats for data are widespread: comma-separated values (CSV), tab-separated values (TSV), JavaScript object notation (JSON), a panopoly of markup languages (HTML, XML, YAML, TOML), and more. When you read data in these formats into R, sometimes R will correctly convert the values to appropriate non-string types. The rest of the time, you need to know how to work with strings so that you can fix whatever went wrong and convert the data yourself.
This section introduces several fundamental concepts related to working with strings. The next section, Section 1.4.1, describes the stringr package for working with strings. The last section, Section 1.4.2, builds on both and explains how to do powerful pattern matching.
1.3.1 Printing
There are two different ways to print strings: you can print a representation of the characters in the string or you can print the actual characters in the string.
To print a representation of the characters in a string, use the print
function. The representation is useful to identify characters that are not
normally visible, such as tabs and the characters that mark the end of a line.
To print the actual characters in a string, use the message
function.
This important difference in how the print
and message
functions print
strings is demonstrated in the next section.
You can learn more about different ways to print output in R by reading Section 3.5.
1.3.2 Escape Sequences
In a string, an escape sequence or escape code consists of a backslash followed by one or more characters. Escape sequences make it possible to:
- Write quotes or backslashes within a string
- Write characters that don’t appear on your keyboard (for example, characters in a foreign language)
For example, the escape sequence \n
corresponds to the newline character.
Notice that the message
function translates \n
into a literal new line,
whereas the print
function doesn’t:
## Hello
## Nick
## [1] "Hello\nNick"
As another example, suppose we want to put a literal quote in a string. We can either enclose the string in the other kind of quotes, or escape the quotes in the string:
## She said, "Hi"
## She said, "Hi"
Since escape sequences begin with backslash, we also need to use an escape sequence to write a literal backslash. The escape sequence for a literal backslash is two backslashes:
## \
There’s a complete list of escape sequences for R in the ?Quotes
help file.
Other programming languages also use escape sequences, and many of them are the
same as in R.
1.3.3 Raw Strings
A raw string is a string where escape sequences are turned off. Raw strings are especially useful for writing regular expressions (covered in Section 1.4.2).
Raw strings begin with r"
and an opening delimiter (
, [
, or {
. Raw
strings end with a matching closing delimiter and quote. For example:
## quotes " and backslashes \
Raw strings were added to R in version 4.0 (April 2020), and won’t work correctly in older versions.
1.3.4 Character Encodings
Computers store data as numbers. In order to store text on a computer, people
have to agree on a character encoding, a system for mapping characters to
numbers. For example, in ASCII, one of
the most popular encodings in the United States, the character a
maps to the
number 97.
Many different character encodings exist, and sharing text used to be an inconvenient process of asking or trying to guess the correct encoding. This was so inconvenient that in the 1980s, software engineers around the world united to create the Unicode standard. Unicode includes symbols for nearly all languages in use today, as well as emoji and many ancient languages (such as Egyptian hieroglyphs).
Unicode maps characters to numbers, but unlike a character encoding, it doesn’t dictate how those numbers should be mapped to bytes (sequences of ones and zeroes). As a result, there are several different character encodings that support and are synonymous with Unicode. The most popular of these is UTF-8.
In R, you can write Unicode characters with the escape sequence \U
followed
by the number for the character in base 16. For instance, the number for
a
in Unicode is 97 (the same as in ASCII). In base 16, 97 is 61
. So you can
write an a
as:
## [1] "a"
Unicode escape sequences are usually only used for characters that are not easy
to type. For example, the cat emoji is number 1f408
(in base 16) in Unicode.
So the string "\U1f408"
is the cat emoji.
Note that being able to see printed Unicode characters also depends on whether the font your computer is using has a glyph (image representation) for that character. Many fonts are limited to a small number of languages. The NerdFont project patches fonts commonly used for programming so that they have better Unicode coverage. Using a font with good Unicode coverage is not essential, but it’s convenient if you expect to work with many different natural languages or love using emoji.
1.3.4.1 Character Encodings in Text Files
Most of the time, R will handle character encodings for you automatically. However, if you ever read or write a text file (including CSV and other formats) and the text looks like gibberish, it might be an encoding problem. This is especially true on Windows, the only modern operating system that does not (yet) use UTF-8 as the default encoding.
Encoding problems when reading a file can usually be fixed by passing the encoding to the function doing the reading. For instance, the code to read a UTF-8 encoded CSV file on Windows is:
Other reader functions may use a different parameter to set the encoding, so
always check the documentation. On computers where the native language is not
set to English, it can also help to set R’s native language to English with
Sys.setlocale(locale = "English")
.
Encoding problems when writing a file are slightly more complicated to fix. See this blog post for thorough explanation.
1.4 Processing Strings
String processing encompasses a variety of tasks such as searching for patterns within strings, extracting data from within strings, splitting strings into component parts, and removing or replacing unwanted characters (excess whitespace, punctuation, and so on). If you work with data, sooner or later you’ll run into a dataset in text format that needs a few text corrections before or after you read it into R, and for that you’ll find familiarity with string processing invaluable.
1.4.1 The stringr Package
Although R has built-in functions for string processing, we recommend using the stringr package for all of your string processing needs. The package is part of the Tidyverse, a collection of packages introduced in Section 1.1. Major advantages of stringr over other packages and R’s built-in functions include:
- Correctness: the package builds on International Components for Unicode (ICU), the Unicode Consortium’s own library for handling text encodings
- Discoverability: every function’s name begins with
str_
so they’re easy to discover, remember, and identify in code - Interface consistency: the first argument is always the string to process, the second argument is always the pattern to match (if applicable)
- Vectorization: most of the functions are vectorized in the first and second argument
stringr has detailed documentation and also a cheatsheet.
The first time you use stringr, you’ll have to install it with
install.packages
(the same as any other package). Then you can load the
package with the library
function:
The typical syntax of a stringr function is:
Where:
name
describes what the function doesstring
is a string to search within or transformpattern
is a pattern to search for, if applicable...
is additional, function-specific arguments
For example, the str_detect
function detects whether a pattern appears
within a string. The function returns TRUE
if the pattern is found and
FALSE
if it isn’t:
## [1] TRUE
## [1] FALSE
Most of the stringr functions are vectorized in the string
parameter:
## [1] TRUE FALSE TRUE
As another example, the str_sub
function extracts a substring from a
string, given the substring’s position. The first argument is the string, the
second is the position of the substring’s first character, and the third is the
position of the substring’s last character:
## [1] "speak"
The str_sub
function is especially useful for extracting data from strings
that have a fixed width (although the readr package’s read_fwf
is usually
a better choice if you have a fixed-width file).
There are a lot of stringr functions. Five that are especially important and are explained in this reader are:
str_detect
, to test whether a string contains a patternstr_sub
, to extract a substring at a given position from a stringstr_replace
, to replace or remove parts of a stringstr_split_fixed
, to split a string into partsstr_match
, to extract data from a string
You can find a complete list of functions with examples on the stringr documentation’s reference page and the cheatsheet.
1.4.2 Regular Expressions
The stringr functions use a special language called regular expressions or regex to describe patterns in strings. Many other programming languages also have string processing tools that use regular expressions, so fluency with regular expressions is a transferrable skill.
You can use a regular expression to describe a complicated pattern in just a few characters because some characters, called metacharacters, have special meanings. Metacharacters are usually punctation characters. They are never letters or numbers, which always have their literal meaning.
This table lists some of the most useful metacharacters:
Metacharacter | Meaning |
---|---|
. |
any one character (wildcard) |
\ |
escape character (in both R and regex), see Section 1.3.2 |
^ |
the beginning of string (not a character) |
$ |
the end of string (not a character) |
[ab] |
one character, either 'a' or 'b' |
[^ab] |
one character, anything except 'a' or 'b' |
? |
the previous character appears 0 or 1 times |
* |
the previous character appears 0 or more times |
+ |
the previous character appears 1 or more times |
() |
make a group |
| |
match left OR right side (not a character) |
Section 1.5 provides examples of how most of the
metacharacters work. Even more examples are presented in the stringr package’s
regular expressions vignette. You can find a complete listing
of regex metacharacters in ?regex
or on the stringr
cheatsheet.
You can disable regular expressions in a stringr function by calling the
fixed
function on the pattern. For example, to test whether a string contains
a literal dot .
:
## [1] FALSE TRUE
It’s a good idea to call fixed
on any pattern that doesn’t contain regex
metacharacters, because it communicates to the reader that you’re not using
regex, it helps to prevent bugs, and it provides a small speed boost.
1.4.3 Replacing Parts of Strings
Replacing part of a string is a common string processing task. For instance,
quantitative data often contain non-numeric characters such as commas, currency
symbols, and percent signs. These must be removed before converting to numeric
data types. Replacement and removal go hand-in-hand, since removal is
equivalent to replacing part of a string with the empty string ""
.
The str_replace
function replaces the first part of a string that matches a
pattern (from left to right), while the related str_replace_all
function
replaces every part of a string that matches a pattern. Most stringr
functions that do pattern matching come in a pair like this: one to process
only the first match and one to process every match.
As an example, suppose you want to remove commas from a number so that you can
convert it with as.numeric
, which returns NA
for numbers that contain
commas. You want to remove all of the commas, so str_replace_all
is the
function to use. As usual, the first argument is the string and the second is
the pattern. The third argument is the replacement, which is the empty string
""
in this case:
## [1] "1000000"
The str_replace
function doesn’t work as well for this task, since it only
replaces the first match to the pattern:
## [1] "1000,000"
You can also use these functions to replace or remove longer patterns within
words. For instance, suppose you want to change the word "dog"
to "cat"
:
## [1] "cats are great, dogs are fun" "cats are fluffy"
## [1] "cats are great, cats are fun" "cats are fluffy"
As a final example, you can use the replacement functions and a regex pattern
to replace repeated spaces with a single space. This is a good standardization
step if you’re working with text. The key is to use the regex quantifier +
,
which means a character “repeats one or more times” in the pattern, and to use
a single space " "
as the replacement:
## [1] "This sentence has extra space."
If you just want to trim (remove) all whitespace from the beginning and end
of a string, you can use the str_trim
function instead.
1.4.4 Splitting Strings
Distinct data in a text are generally separated by a character like a space or a comma, to make them easy for people to read. Often these separators also make the data easy for R to parse. The idea is to split the string into a separate value at each separator.
The str_split
function splits a string at each match to a pattern. The
matching characters—that is, the separators—are discarded.
For example, suppose you want to split several numbers separated by commas and spaces:
## [[1]]
## [1] "21" "32.3" "5" "64"
The str_split
function always returns a list with one element for each input
string. Here the list only has one element because x
only has one element.
You can get the first element with:
## [1] "21" "32.3" "5" "64"
You then convert the values with as.numeric
.
To see why the str_split
function always returns a list, consider what
happens if you try to split two different strings at once:
## [[1]]
## [1] "21" "32.3" "5" "64"
##
## [[2]]
## [1] "10" "15" "1.3"
Each string has a different number of parts, so the vectors in the result have different lengths. So a list is the only way to store them.
You can also use the str_split
function to split a sentence into words. Use
spaces for the split:
## [[1]]
## [1] "The" "students" "in" "this" "workshop" "are" "great!"
When you know exactly how many parts you expect a string to have, use the
str_split_fixed
function instead of str_split
. It accepts a third argument
for the maximum number of splits to make. Because the number of splits is
fixed, the function can return the result in a matrix instead of a list. For
example:
## [,1] [,2] [,3]
## [1,] "1" "2" "3"
## [2,] "10" "20" "30"
The str_split_fixed
function is often more convenient than str_split
because the n
th piece of each input string is just the n
th column of the
result.
For example, suppose you want to get the area codes from some phone numbers:
phones = c("717-555-3421", "629-555-8902", "903-555-6781")
result = str_split_fixed(phones, "-", 3)
result[, 1]
## [1] "717" "629" "903"
1.4.5 Extracting Matches
Occasionally, you might need to extract parts of a string in a more complicated
way than string splitting allows. One solution is to write a regular expression
that will match all of the data you want to capture, with parentheses ( )
,
the regex metacharacter for a group, around each distinct value. Then you can
use the str_match
function to extract the groups. Section 1.5.6
presents some examples of regex groups.
For example, suppose you want to split an email address into three parts: the
user name, the domain name, and the [top-level domain][tld]. To create a
regular expression that matches email addresses, you can use the @
and .
in
the address as anchors. The surrounding characters are generally alphanumeric,
which you can represent with the “word” metacharacter \w
:
\w+@\w+[.]\w+
Next, put parentheses ( )
around each part that you want to extract:
(\w+)@(\w+)[.](\w+)
Finally, use this pattern in str_match
, adding extra backslashes so that
everything is escaped correctly:
## [,1] [,2] [,3] [,4]
## [1,] "datalab@ucdavis.edu" "datalab" "ucdavis" "edu"
The function extracts the overall match to the pattern, as well as the match to
each group. The pattern in this example doesn’t work for all possible email
addresses, since user names can contain dots and other characters that are not
alphanumeric. You could generalize the pattern if necessary. The point is that
the str_match
function and groups provide an extremely flexible way to
extract data from strings.
1.4.6 Case Study: U.S. Warehouse Stocks
The U.S. Department of Agriculture (USDA) publishes a variety of datasets online, particularly through its National Agricultural Statistics Service (NASS). Unfortunately, most of are published in PDF or semi-structured text format, which makes reading the data into R or other statistical software a challenge.
The USDA NASS posts monthly reports about stocks of agricultural products in refrigerated warehouses. In this case study, you’ll use string processing functions to extract a table of data from the December 2022 report.
To begin, download the report and save it somewhere on your computer. Then open the file in a text editor (or RStudio) to inspect it. The goal is to extract the first table, about “Nuts, Dairy Products, Frozen Eggs, and Frozen Poultry,” from the report.
The report is a semi-structured mix of natural language text and fixed-width
tables. As a consequence, most functions for reading tabular data will not work
well on the entire report. You could try to use a function for reading
fixed-width data, such as read.fwf
or the readr package’s read_fwf
on
only the lines containing a table. Another approach, which is shown here, is to
use string processing functions to find and extract the table.
The readLines
function reads a text file into a character vector with one
element for each line. This makes the function useful for reading unstructured
or semi-structured text. Use the function to read the report:
## [1] ""
## [2] "Cold Storage"
## [3] ""
## [4] "ISSN: 1948-903X"
## [5] ""
## [6] "Released December 22, 2022, by the National Agricultural Statistics Service "
In the report, tables always begin and end with lines that contain only dashes
-
. By locating these all-dash lines, you can locate the tables. Like
str_detect
, the str_which
function tests whether strings in a vector match
a pattern. The only difference is that str_which
returns the indexes of the
strings that matched (as if you had called which
) rather than a logical
vector. Use str_which
to find the all-dash lines:
# The regex means:
# ^ begining of string
# -+ one or more dashes
# $ end of string
dashes = str_which(report, "^-+$")
head(report[dashes], 2)
## [1] "--------------------------------------------------------------------------------------------------------------------------"
## [2] "--------------------------------------------------------------------------------------------------------------------------"
Each table contains three dash lines—one separates the header and body. The header and body of the first table are:
## [1] "--------------------------------------------------------------------------------------------------------------------------"
## [2] " : : : November 30, 2022 : Public "
## [3] " : Stocks in all warehouses : as a percent of : warehouse "
## [4] " : : : : stocks "
## [5] " Commodity :-----------------------------------------------------------------------------------"
## [6] " :November 30, : October 31, :November 30, :November 30, : October 31, :November 30, "
## [7] " : 2021 : 2022 : 2022 : 2021 : 2022 : 2022 "
## [8] "--------------------------------------------------------------------------------------------------------------------------"
## [1] "--------------------------------------------------------------------------------------------------------------------------"
## [2] " : ------------ 1,000 pounds ----------- ---- percent ---- 1,000 pounds "
## [3] " : "
## [4] "Nuts : "
## [5] "Shelled : "
## [6] " Pecans .............................: 30,906 38,577 34,489 112 89 "
The columns have fixed widths, so extracting the columns is relatively easy
with str_sub
if you can get the offsets. In the last line of the header, the
columns are separated by colons :
. Thus you can use the str_locate_all
function, which returns the locations of a pattern in a string, to get the
offsets:
# The regex means:
# [^:]+ one or more characters, excluding colons
# (:|$) a colon or the end of the line
cols = str_locate_all(report[dashes[2] - 1], "[^:]+(:|$)")
# Like str_split, str_locate_all returns a list
cols = cols[[1]]
cols
## start end
## [1,] 1 39
## [2,] 40 53
## [3,] 54 67
## [4,] 68 81
## [5,] 82 95
## [6,] 96 109
## [7,] 110 122
You can use these offsets with str_sub
to break a line in the body of the
table into columns:
## [1] " Pecans .............................:"
## [2] " 30,906 "
## [3] " 38,577 "
## [4] " 34,489 "
## [5] " 112 "
## [6] " 89 "
## [7] " "
Because of the way str_sub
is vectorized, you can’t process every line in the
body of the table in one vectorized call. Instead, you can use sapply
to call
str_sub
on each line:
# Set USE.NAMES to make the table easier to read
tab = sapply(bod, str_sub, cols, USE.NAMES = FALSE)
# The sapply function transposes the table
tab = t(tab)
head(tab)
## [,1] [,2]
## [1,] "---------------------------------------" "--------------"
## [2,] " :" " ------------"
## [3,] " :" " "
## [4,] "Nuts :" " "
## [5,] "Shelled :" " "
## [6,] " Pecans .............................:" " 30,906 "
## [,3] [,4] [,5] [,6]
## [1,] "--------------" "--------------" "--------------" "--------------"
## [2,] " 1,000 pounds " "----------- " " ---- perc" "ent ---- "
## [3,] " " " " " " " "
## [4,] " " " " " " " "
## [5,] " " " " " " " "
## [6,] " 38,577 " " 34,489 " " 112 " " 89 "
## [,7]
## [1,] "-------------"
## [2,] "1,000 pounds "
## [3,] " "
## [4,] " "
## [5,] " "
## [6,] " "
The columns still contain undesirable punctuation and whitespace, but you can
remove these with str_replace_all
and str_trim
. Since the table is a
matrix, it’s necessary to use apply
to process it column-by-column:
# The regex means:
# , a comma
# | OR
# [.]* zero or more literal dots
# : a colon
# $ the end of the line
tab = apply(tab, 2, function(col) {
col = str_replace_all(col, ",|[.]*:$", "")
str_trim(col)
})
head(tab)
## [,1] [,2]
## [1,] "---------------------------------------" "--------------"
## [2,] "" "------------"
## [3,] "" ""
## [4,] "Nuts" ""
## [5,] "Shelled" ""
## [6,] "Pecans" "30906"
## [,3] [,4] [,5] [,6]
## [1,] "--------------" "--------------" "--------------" "--------------"
## [2,] "1000 pounds" "-----------" "---- perc" "ent ----"
## [3,] "" "" "" ""
## [4,] "" "" "" ""
## [5,] "" "" "" ""
## [6,] "38577" "34489" "112" "89"
## [,7]
## [1,] "-------------"
## [2,] "1000 pounds"
## [3,] ""
## [4,] ""
## [5,] ""
## [6,] ""
The first few rows and the last row can be removed, since they don’t contain data. Then you can convert the table to a data frame and convert the individual columns to appropriate data types:
tab = tab[-c(1:3, nrow(tab)), ]
tab = data.frame(tab)
tab[2:7] = lapply(tab[2:7], as.numeric)
head(tab, 10)
## X1 X2 X3 X4 X5 X6 X7
## 1 Nuts NA NA NA NA NA NA
## 2 Shelled NA NA NA NA NA NA
## 3 Pecans 30906 38577 34489 112 89 NA
## 4 In-Shell NA NA NA NA NA NA
## 5 Pecans 63788 44339 47638 75 107 NA
## 6 NA NA NA NA NA NA
## 7 Dairy products NA NA NA NA NA NA
## 8 Butter 210473 239658 199695 95 83 188566
## 9 Natural cheese NA NA NA NA NA NA
## 10 American 834775 831213 815655 98 98 NA
The data frame is now sufficiently clean that you could use it for a simple analysis. Of course, there are many things you could do to improve the extracted data frame, such as identifying categories and subcategories in the first column, removing rows that are completely empty, and adding column names. These entail more string processing and data frame manipulation—if you want to practice your R skills, try doing them on your own.
1.5 Regular Expression Examples
This section provides examples of several different regular expression
metacharacters and other features. Most of the examples use the str_view
function, which is especially helpful for testing regular expressions. The
function displays an HTML-rendered version of the string with the first match
highlighted.
The RegExr website is also helpful for testing regular expressions; it provides an interactive interface where you can write regular expressions and see where they match a string.
1.5.1 The Wildcard
The regex wildcard character is .
and matches any single character. For
example:
## [1] │ <dog>
By default, regex searches from left to right:
## [1] │ <d><o><g>
1.5.2 Escape Sequences
Like R, regular expressions can contain escape sequences that begin with a backslash. These are computed separately and after R escape sequences. The main use for escape sequences in regex is to turn a metacharacter into a literal character.
For example, suppose you want to match a literal dot .
. The regex for a
literal dot is \.
. Since backslashes in R strings have to be escaped, the R
string for this regex is "\\.
. For example:
## [1] │ this<.>string
The double backslash can be confusing, and it gets worse if you want to match a literal backslash. You have to escape the backslash in the regex (because backslash is the regex escape character) and then also have to escape the backslashes in R (because backslash is also the R escape character). So to match a single literal backslash in R, the code is:
## [1] │ this<\>that
Raw strings (see Section 1.3.3) make regular expressions easier to read, because they make backslashes literal (but they still mark the beginning of an escape sequence in regex). You can use a raw string to write the above as:
## [1] │ this<\>that
1.5.3 Anchors
By default, a regex will match anywhere in the string. If you want to force a match at specific place, use an anchor.
The beginning of string anchor is ^
. It marks the beginning of the string,
but doesn’t count as a character in the pattern.
For example, suppose you want to match an a
at the beginning of the string:
## [1] │ <a>bc
## [2] │ c<a>b
## [1] │ <a>bc
It doesn’t make sense to put characters before ^
, since no characters can
come before the beginning of the string.
Likewise, the end of string anchor is $
. It marks the end of the string, but
doesn’t count as a character in the pattern.
1.5.4 Character Classes
In regex, square brackets [ ]
denote a character class. A character class
matches exactly one character, but that character can be any of the characters
inside of the square brackets. The square brackets themselves don’t count as
characters in the pattern.
For example, suppose you want to match c
followed by either a
or t
:
## [1] │ <ca>
## [2] │ <ct>
## [3] │ <ca>t
## [4] │ <ct>a
You can use a dash -
in a character class to create a range. For example,
to match letters p
through z
:
## [2] │ <ct>
## [4] │ <ct>a
Ranges also work with numbers and capital letters. To match a literal dash,
place the dash at the end of the character class (instead of between two other
characters), as in [abc-]
.
Most metacharacters are literal when inside a character class. For example,
[.]
matches a literal dot.
A hat ^
at the beginning of the character class negates the class. So for
example, [^abc]
matches any one character except for a
, b
, or c
:
## [1] │ abc<d><e><f>
1.5.5 Quantifiers
Quantifiers are metacharacters that affect how many times the preceding character must appear in a match. The quantifier itself doesn’t count as a character in the match.
For example, the question mark ?
quantifier means the preceding character can
appear 0 or 1 times. In other words, ?
makes the preceding character
optional. For example:
## [1] │ <abc>
## [3] │ <ac>
The star *
quantifier means the preceding character can appear 0 or more
times. In other words, *
means the preceding character can appear any number
of times or not at all. For instance:
## [1] │ <abc>
## [3] │ <ac>
## [4] │ <abbc>
The plus +
quantifier means the preceding character must appear 1 or more
times.
Quantifiers are greedy, meaning they always match as many characters as
possible. In this example, notice that the pattern matches the entire string,
even though it could also match just abba
:
## [1] │ <abbabbba>
You can add a question mark ?
after another quantifier to make it non-greedy:
## [1] │ <abba><bbba>
1.5.6 Groups
In regex, parentheses ( )
denote a group. The parentheses themselves
don’t count as characters in the pattern. Groups are useful for repeating or
extracting specific parts of a pattern (see Section 1.4.5).
Quantifiers can act on groups in addition to individual characters. For
example, suppose you want to make the entire substring ", dogs,"
optional in
a pattern, so that both of the test strings in this example match:
## [1] │ <cats, dogs, and frogs>
## [2] │ <cats and frogs>