# install.packages("lubridate")
library("lubridate")
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
After this lesson, you should be able to:
The major topic of this chapter is how to convert dates and times into appropriate R data types.
For working with dates, times, and strings, we recommend using packages from the Tidyverse, a popular collection of packages for doing data science. Compared to R’s built-in functions, we’ve found that the functions in Tidyverse packages are generally easier to learn and use. They also provide additional features and have more robust support for characters outside of the Latin alphabet.
Although they’re developed by many different members of the R community, Tidyverse packages follow a unified design philosophy, and thus have many interfaces and data structures in common. The packages provide convenient and efficient alternatives to built-in R functions for many tasks, including:
Think of the Tidyverse as a different dialect of R. Sometimes the syntax is different, and sometimes ideas are easier or harder to express concisely. As a consequence, the Tidyverse is sometimes polarizing in the R community. It’s useful to be literate in both base R and the Tidyverse, since both are popular.
One major advantage of the Tidyverse is that the packages are usually well-documented and provide lots of examples. Every package has a documentation website and the most popular ones also have cheatsheets.
When working with dates and times, you might want to:
Even though this list isn’t exhaustive, it shows that there are lots of things you might want to do. In order to do them in R, you must first make sure that your dates and times are represented by appropriate data types. Most of R’s built-in functions for loading data do not automatically recognize dates and times. This section describes several data types that represent dates and times, and explains how to use R to parse—break down and convert—dates and times to these types.
As explained in Section @ref(the-tidyverse), we recommend the Tidyverse packages for working with dates and times over other packages or R’s built-in functions. There are two:
This chapter only covers lubridate, since it’s more useful in most situations. The package has detailed documentation and a cheatsheet.
You’ll have to install the package if you haven’t already, and then load it:
# install.packages("lubridate")
library("lubridate")
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
Perhaps the most common task you’ll need to do with date and time data is convert from strings to more appropriate data types. This is because R’s built-in functions for reading data from a text format, such as read.csv, read dates and times as strings. For example, here are some dates as strings:
date_strings = c("Jan 10, 2021", "Sep 3, 2018", "Feb 28, 1982")
date_strings[1] "Jan 10, 2021" "Sep 3, 2018" "Feb 28, 1982"
You can tell that these are dates, but as far as R is concerned, they’re text.
The lubridate package provides a variety of functions to automatically parse strings into date or time objects that R understands. These functions are named with one letter per component of the date or time. The order of the letters must match the order of the components in the string you want to parse.
In the example, the strings have the month (m), then the day (d), and then the year (y), so you can use the mdy function to parse them automatically:
dates = mdy(date_strings)
dates[1] "2021-01-10" "2018-09-03" "1982-02-28"
class(dates)[1] "Date"
Notice that the dates now have class Date, one of R’s built-in classes for representing dates, and that R prints them differently. Now R recognizes that the dates are in fact dates, so they’re ready to use in an analysis.
There is a complete list of the automatic parsing functions in the lubridate documentation.
Note: a relatively new package, clock, tries to solve some problems with the Date class people have identified over the years. The package is in the r-lib collection of packages, which provide low-level functionality complementary to the Tidyverse. Eventually, it may be preferable to use the classes in clock rather than the Date class, but for now, the Date class is still suitable for most tasks.
Occasionally, a date or time string may have a format that lubridate can’t parse automatically. In that case, you can use the fast_strptime function to describe the format in detail. At a minimum, the function requires two arguments: a vector of strings to parse and a format string.
The format string describes the format of the dates or times, and is based on the syntax of strptime, a function provided by many programming languages (including R) to parse date or time strings. In a format string, a percent sign % followed by a character is called a specification and has a special meaning. Here are a few of the most useful ones:
| Specification | Description | 2015-01-29 21:32:55 |
|---|---|---|
%Y |
4-digit year | 2015 |
%m |
2-digit month | 01 |
%d |
2-digit day | 29 |
%H |
2-digit hour | 21 |
%M |
2-digit minute | 32 |
%S |
2-digit second | 55 |
%% |
literal % | % |
%y |
2-digit year | 15 |
%B |
full month name | January |
%b |
short month name | Jan |
You can find a complete list in ?fast_strptime. Other characters in the format string do not have any special meaning. Write the format string so that it matches the format of the dates you want to parse.
For example, let’s try parsing an unusual time format:
time_string = "6 minutes, 32 seconds after 10 o'clock"
time = fast_strptime(time_string, "%M minutes, %S seconds after %H o'clock")
time[1] "0-01-01 10:06:32 UTC"
class(time)[1] "POSIXlt" "POSIXt"
R represents date-times with the classes POSIXlt and POSIXct. There’s no built-in class to represent times alone, which is why the result in the example above includes a date.
Internally, a POSIXlt object is a list with elements to store different date and time components. On the other hand, a POSIXct object is a single floating point number (type double). If you want to store your time data in a data frame, use POSIXct objects, since data frames don’t work well with columns of lists. You can control whether fast_strptime returns a POSIXlt or POSIXct object by setting the lt parameter to TRUE or FALSE:
time_ct = fast_strptime(time_string, "%M minutes, %S seconds after %H o'clock",
lt = FALSE)
class(time_ct)[1] "POSIXct" "POSIXt"
Another common task is combining the numeric components of a date or time into a single object. You can use the make_date and make_datetime functions to do this. The parameters are named for the different components. For example:
make_date(day = 10, year = 2023, month = 1)[1] "2023-01-10"
These functions are vectorized, so you can use them to combine the components of many dates or times at once. They’re especially useful for reconstructing dates and times from tabular datasets where each component is stored in a separate column.
After you’ve converted your date and time data to appropriate types, you can do any of the operations listed at the beginning of this section. For example, you can use lubridate’s period function to create an offset to add to a date or time:
dates[1] "2021-01-10" "2018-09-03" "1982-02-28"
dates + period(1, "month")[1] "2021-02-10" "2018-10-03" "1982-03-28"
You can also use lubridate functions to get or set the components. These functions usually have the same name as the component. For instance:
day(dates)[1] 10 3 28
month(dates)[1] 1 9 2
See the lubridate documentation for even more details about what you can do.
We’re still developing this section and will likely replace it with the CA Parks & Recreation Fleet dataset.
The U.S. National Oceanic and Atmospheric Administration (NOAA) publishes ocean temperature data collected by sensor buoys off the coast on the National Data Buoy Center (NDBC) website. California also has many sensors collecting ocean temperature data that are not administered by the federal government. Data from these is published on the California Ocean Observing Systems (CALOOS) Data Portal.
Suppose you’re a researcher who wants to combine ocean temperature data from both sources to use in R. Both publish the data in comma-separated value (CSV) format, but record dates, times, and temperatures differently. Thus you need to be careful that the dates and times are parsed correctly.
Download these two 2021 datasets:
2021_noaa-ndbc_46013.txt, from NOAA buoy 46013, off the coast of Bodega Bay (DOWNLOAD)(source)2021_ucdavis_bml_wts.csv, from the UC Davis Bodega Bay Marine Laboratory’s sensors (DOWNLOAD)(source)The NOAA data has a fixed-width format, which means each column has a fixed width in characters over all rows. The readr package provides a function read_fwf that can automatically guess the column widths and read the data into a data frame. The column names appear in the first row and column units appear in the second row, so read those rows separately:
# install.packages("readr")
library("readr")
noaa_path = "data/ocean_data/2021_noaa-ndbc_46013.txt"
noaa_headers = read_fwf(noaa_path, n_max = 2, guess_max = 1)Rows: 2 Columns: 18
── Column specification ────────────────────────────────────────────────────────
chr (18): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
noaa = read_fwf(noaa_path, skip = 2)Rows: 3323 Columns: 18
── Column specification ────────────────────────────────────────────────────────
chr (4): X2, X3, X4, X5
dbl (14): X1, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(noaa) = as.character(noaa_headers[1, ])
names(noaa)[1] = "YY"The dates and times for the observations are separated into component columns, and the read_fwf function does not convert some of these to numbers automatically. You can use as.numeric to convert them to numbers:
cols = 2:5
noaa[cols] = lapply(noaa[cols], as.numeric)Finally, use the make_datetime function to combine the components into date-time objects:
noaa_dt = make_datetime(year = noaa$YY, month = noaa$MM, day = noaa$DD,
hour = noaa$hh, min = noaa$mm)
noaa$date = noaa_dt
head(noaa_dt)[1] "2021-01-01 00:00:00 UTC" "2021-01-01 00:10:00 UTC"
[3] "2021-01-01 00:20:00 UTC" "2021-01-01 00:30:00 UTC"
[5] "2021-01-01 00:40:00 UTC" "2021-01-01 00:50:00 UTC"
That takes care of the dates in the NOAA data.
The Bodega Marine Lab data is CSV format, which you can read with read.csv or the readr package’s read_csv function. The latter is faster and usually better at guessing column types. The column names appear in the first row and the column units appear in the second row. The read_csv function handles the names automatically, but you’ll have to remove the unit row as a separate step:
bml = read_csv("data/ocean_data/2021_ucdavis_bml_wts.csv")Rows: 87283 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): time, sea_water_temperature, z
dbl (1): sea_water_temperature_qc_agg
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bml = bml[-1, ]The dates and times of the observations were loaded as strings. You can use lubridate’s ymd_hms function to automatically parse them:
bml_dt = ymd_hms(bml$time)
bml$date = bml_dt
head(bml_dt)[1] "2020-12-31 09:06:00 UTC" "2020-12-31 09:12:00 UTC"
[3] "2020-12-31 09:18:00 UTC" "2020-12-31 09:24:00 UTC"
[5] "2020-12-31 09:30:00 UTC" "2020-12-31 09:36:00 UTC"
Now you have date and time objects for both datasets, so you can combine the two. For example, you could extract the date and water temperature columns from each, create a new column identifying the data source, and then row-bind the datasets together.