20  Dates & Times

NoteLearning Goals

After this lesson, you should be able to:

  • Explain why we use special data structures for dates & times
  • Identify the correct data structure for a given date/time
  • Use the lubridate package to parse dates
  • Use the date format string mini-language
ImportantRequired Packages

This chapter uses the following packages:

  • dplyr
  • lubridate
  • readr

Chapter 14 explains how to install and load packages.

Given a dataset with dates or times, you might want to:

You can do all of these things and more with R, but only if your dates and times are represented by appropriate data types.

20.1 The lubridate Package

As explained in Section 14.3, we recommend the Tidyverse packages for working with dates and times over other packages or R’s built-in functions. There are two:

  • lubridate, the primary package for working with dates and times
  • hms, a package specifically for working with time durations

This chapter only covers lubridate, since it’s more useful in most situations. The package has detailed documentation and a cheatsheet.

Note

A relatively new package, clock, tries to solve some problems with the Date class people have identified over the years. The package is in the r-lib collection of packages, which provide low-level functionality complementary to the Tidyverse. Eventually, it may be preferable to use the classes in clock rather than the Date class, but for now, the Date class is still suitable for most tasks.

You’ll have to install the package if you haven’t already, and then load it:

# install.packages("lubridate")
library("lubridate")

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union

20.2 Parsing Dates

Many popular file formats for data, such as CSV, do not store metadata about which values are dates and times, so these values get typed as strings. Dates and times extracted from text are also naturally strings.

The first step to working with dates and times typed as strings is to parse them: break them down into their components and cast them to more suitable data types. To demonstrate parsing dates, consider a vector of date strings:

date_strings = c("Jan 10, 2021", "Sep 3, 2018", "Feb 28, 1982")
date_strings
[1] "Jan 10, 2021" "Sep 3, 2018"  "Feb 28, 1982"

You can tell that these are dates, but as far as R is concerned, they’re text.

The lubridate package provides a variety of functions to automatically parse strings into date or time objects that R understands. These functions are named with one letter per component of the date or time. The order of the letters must match the order of the components in the string you want to parse.

The example strings have the month (m), then the day (d), and then the year (y), so you can use the mdy function to parse them automatically:

dates = mdy(date_strings)
dates
[1] "2021-01-10" "2018-09-03" "1982-02-28"
class(dates)
[1] "Date"

Notice that R prints the dates differently now. Thanks to the mdy function, they now have class Date, one of R’s built-in classes for representing dates. R recognizes that the dates are in fact dates, so they’re ready to use in an analysis.

There is a complete list of the automatic parsing functions in the lubridate documentation.

Occasionally, a date or time string may have a format that lubridate can’t parse automatically. In that case, you can use the fast_strptime function to describe the format in detail.

At a minimum, the function requires two arguments: a vector of strings to parse and a format string. The format string describes the format of the dates. In a format string, a percent sign % followed by a character is called a specifier and represents a component of a data or time. Here are some useful specifiers:

Specification Description 2015-01-29 21:32:55
%Y 4-digit year 2015
%m 2-digit month 01
%d 2-digit day 29
%H 2-digit hour 21
%M 2-digit minute 32
%S 2-digit second 55
%% literal % %
%y 2-digit year 15
%B full month name January
%b short month name Jan

Other characters in the format string don’t have any special meaning. Write the format string so that it matches the format of the dates you want to parse.

Note

Most programming languages use format strings to parse and format dates and times. There isn’t a standard, but the idea seems to have originated with the strptime and strftime functions in the C programming language.

You can find a complete list of specifiers for lubridate in ?fast_strptime.

To demonstrate fast_strptime, let’s try parsing an unusual time format:

time_string = "6 minutes, 32 seconds after 10 o'clock"
time = fast_strptime(time_string, "%M minutes, %S seconds after %H o'clock")
time
[1] "0-01-01 10:06:32 UTC"
class(time)
[1] "POSIXlt" "POSIXt" 

R represents datetimes (which combine a date and a time) with the classes POSIXlt and POSIXct. There’s no built-in class to represent times alone, which is why the result in the example above includes a date.

Note

Internally, a POSIXlt object is a list with elements to store different date and time components. On the other hand, a POSIXct object is a single floating point number (type double).

If you want to store your time data in a data frame, use POSIXct objects, since data frames don’t work well with columns of lists.

You can control whether fast_strptime returns a POSIXlt or POSIXct object by setting the lt parameter to TRUE or FALSE:

time_ct = fast_strptime(time_string, "%M minutes, %S seconds after %H o'clock",
  lt = FALSE)

class(time_ct)
[1] "POSIXct" "POSIXt" 

20.3 Creating Dates

Another common task is combining the numeric components of a date or time into a single object. You can use the make_date and make_datetime functions to do this. The parameters are named for the different components. For example:

events = make_datetime(
  year = c(2023, 2002),
  month = c(1, 8),
  day = c(10, 16),
  hour = c(8, 14),
  min = c(3, 59)
)
events
[1] "2023-01-10 08:03:00 UTC" "2002-08-16 14:59:00 UTC"

These functions are vectorized, so you can use them to combine the components of many dates or times at once. They’re especially useful for reconstructing dates and times from tabular datasets where each component is stored in a separate column.

20.4 Extracting Components

Once you have dates or times, you might want to extract components from them. For instance, suppose we want to get the years from the dates we parsed earlier. You can use lubridate functions to get or set the components. These functions usually have the same name as the component. For instance, the year function gets the year:

year(dates)
[1] 2021 2018 1982

Similarly, the month function gets the month:

month(dates)
[1] 1 9 2

See the lubridate documentation for even more details about what you can do.

20.5 Durations & Periods

Occasionally, you might need to adjust dates or times by adding an offset. For example, suppose we want to add 30 days to the dates from earlier. The duration function creates a duration (class Duration), which represents a fixed amount of time. So to add 30 days:

dates + duration(30, "days")
[1] "2021-02-09" "2018-10-03" "1982-03-30"

What if we want to add 1 month to the dates instead of 30 days? The length of a month varies. The duration function gives the average amount of time in a month, but that won’t be correct for some months. The solution is to use a period (class Period) rather than a duration. The period function creates a period. So to add 1 month (of variable length):

dates + period(1, "month")
[1] "2021-02-10" "2018-10-03" "1982-03-28"

This increments each date by exactly 1 month, regardless of how many days that is.

Tip

There are helper functions for common periods named after the associated unit of time. For example, another way add a 1-month period to the dates is with the months function:

dates + months(1)
[1] "2021-02-10" "2018-10-03" "1982-03-28"

Similarly, there are helper functions for common durations. These are also named after the associated unit of time, but always begin with the prefix d. So to add a 30-day duration to the dates:

dates + ddays(30)
[1] "2021-02-09" "2018-10-03" "1982-03-30"

20.6 Case Study: CA Parks & Recreation Fleet

The government of California publishes data about its fleet of vehicles on the California Open Data portal. As of March 2025, the dataset includes all non-confidential vehicles owned by agencies from 2015-2023. We’ll use a subset of this data to compute how many vehicles the CA Department of Parks and Recreation purchased each year from 2019 to 2023. The dataset is published as a messy CSV, so we’ll need to do some cleaning and parse the dates in order to use it.

Important

Click here to download the CA Parks & Recreation Fleet dataset.

If you haven’t already, we recommend you create a directory for this workshop. In your workshop directory, create a data/ subdirectory. Download and save the dataset in the data/ subdirectory.

Each row in the dataset contains measurements from one vehicle-year combination.

Click here to download the documentation for the columns.

This dataset is a subset of the much larger CA State Fleet dataset.

To get started, read the data set from wherever you saved it:

fleet = read.csv("data/2015-2023_ca_parks_fleet.csv")
head(fleet)
                               agency report_year disposed equipment_number
1 Parks and Recreation, Department of        2015       No            93462
2 Parks and Recreation, Department of        2015       No           100380
3 Parks and Recreation, Department of        2015      Yes           117844
4 Parks and Recreation, Department of        2015       No           115503
5 Parks and Recreation, Department of        2015       No             3554
6 Parks and Recreation, Department of        2015       No           D12628
  asset_category model_year               make_model postal_code
1         Ground       1976              TUCKER 1643       96142
2         Ground       1979        FLEETWOOD HOMETTE       94954
3         Ground       2001          MACHETE CHIPPER       91302
4         Ground       1987       KIT OFFICE TRAILER       95430
5         Ground       1960 MAVERICK MFG CORP WELDER       95430
6         Ground       1991          GEARMORE GE200B       93449
              asset_type weight_class passenger_vehicle payload_rating
1 Construction Equipment                             No             NA
2     Non-Self Propelled                            Yes             NA
3     Non-Self Propelled                            Yes             NA
4     Non-Self Propelled                            Yes             NA
5     Non-Self Propelled   Light Duty               Yes             NA
6     Non-Self Propelled                            Yes             NA
  shipping_weight wheel_type tire_size fuel_type engine_configuration
1              NA                         Diesel            Dedicated
2              NA                                                    
3              NA                       Gasoline            Dedicated
4              NA                                                    
5              NA                                                    
6              NA                                                    
  emissions_type_code              primary_application secondary_application
1                  NA Maintenance of public facilities                      
2                  NA Maintenance of public facilities                      
3                  NA Maintenance of public facilities                      
4                  NA Maintenance of public facilities                      
5                  NA Maintenance of public facilities                      
6                  NA Maintenance of public facilities                      
  acquisition_delivery_date suv_justification pickup_4_by_4_justification
1                  1/1/2001                                              
2                  1/1/1979                                              
3                 1/31/2001                                              
4                  1/1/2001                                              
5                  1/1/2001                                              
6                  1/1/1991                                              
  acquisition_method purchase_price annual_lease_rate acquisition_mileage
1           Purchase           2000                NA               Null 
2           Purchase          11483                NA               Null 
3           Purchase          26870                NA               Null 
4           Purchase          10197                NA               Null 
5           Purchase            500                NA               Null 
6           Purchase           2500                NA               Null 
  disposition_date transferred_to disposition_method disposition_reason
1                                                                      
2                                                                      
3         6/7/2018                                                     
4                                                                      
5                                                                      
6                                                                      
  disposition_mileage disposition_sold_amount total_miles
1                  NA                      NA          NA
2                  NA                      NA          NA
3                  NA                      NA          NA
4                  NA                      NA          NA
5                  NA                      NA          NA
6                  NA                      NA          NA

Since we want to compute the number of acquisitions each year, we’ll focus on two columns: acquisition_method and acquisition_delivery_date. The acquisition_method column indicates whether the vehicle was purchased, donated, transferred from a different agency, or something else. We’re only interested in purchases, so we can filter the others out:

library("dplyr")

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
fleet = filter(fleet, acquisition_method == "Purchase")

In order to determine how many vehicles were purchased each year, we need to extract the year from acquisition_delivery_date. Let’s take a look at this column:

head(fleet$acquisition_delivery_date)
[1] "1/1/2001"  "1/1/1979"  "1/31/2001" "1/1/2001"  "1/1/2001"  "1/1/1991" 

It looks like the dates are in month-day-year format, so we can use lubridate’s mdy function:

dates = mdy(fleet$acquisition_delivery_date)
Warning: 4149 failed to parse.

Uh-oh. According to the warning, lubridate couldn’t parse some of the dates. This means they probably don’t follow the month-day-year format. When lubridate can’t parse a date, it returns a missing value. We can use this to figure out what the dates look like:

is_bad_date = is.na(dates)
bad_dates = fleet$acquisition_delivery_date[is_bad_date]
head(bad_dates, 20)
 [1] ""      ""      ""      ""      ""      ""      ""      ""      "44832"
[10] "44462" "42270" "44454" "42262" "44088" "43719" "43719" "43719" "43718"
[19] "41892" "44078"

Instead of month-day-year dates, they’re numbers like 44832, 44462, and 43451. These numbers probably seem inscrutable, but there is an explanation for them: Microsoft Excel, perhaps the most popular tool for working with tabular data, stores dates by counting days from 31 December 1899. So 1 is 1 January 1900, 32 is 1 February 1900, and so on. Unfortunately, the Excel developers incorrectly assumed 1900 was a leap year, so all of the counts were off by 1 after 28 February 1900. Most modern Excel-compatible spreadsheet programs fix this by counting from 30 December 1899 rather than 31 December, so that only dates before 28 February have different numbers.

We can convert the numbers in the bad_dates variable into dates by treating them as offsets to 30 December 1899:

start_date = make_date(1899, 12, 30)

day_offsets = as.numeric(bad_dates)
fixed_dates = start_date + days(day_offsets)
head(fixed_dates, 20)
 [1] NA           NA           NA           NA           NA          
 [6] NA           NA           NA           "2022-09-28" "2021-09-23"
[11] "2015-09-23" "2021-09-15" "2015-09-15" "2020-09-14" "2019-09-11"
[16] "2019-09-11" "2019-09-11" "2019-09-10" "2014-09-10" "2020-09-04"

We can insert these back into the dates vector at their original positions. Then we can replace the acquisition_delivery_date column with dates:

dates[is_bad_date] = fixed_dates

fleet$acquisition_delivery_date = dates

With the dates in hand, we can compute the number of purchases from 2019 to 2023. Since each row in the dataset represents a vehicle-year, and we just want to count vehicles, we must first get a subset of unique vehicles. We can do this by grouping on the equipment_number column, which is a unique identifier for each vehicle, and getting the first row in each group:

by_vehicle = group_by(fleet, equipment_number)
vehicles = filter(by_vehicle, row_number() == 1)

We can compute the number of purchases by extracting the years with lubridate’s year function and then counting them with the table function:

table(year(vehicles$acquisition_delivery_date))

1950 1958 1961 1964 1965 1966 1967 1968 1970 1971 1972 1973 1974 1975 1976 1977 
   1    2    1    2    3    2    3    3    1    1    3    5    3    6    9    3 
1978 1979 1980 1981 1982 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 
  10   10    4    7    5    6   12   25   21   28   54   39   17   11   46   78 
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 
  79   74  119  111  164  235  741  140  220   88  233  118  398  342  388  122 
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 
  97  219  142  139  263  237  219  206  345  254  189  227  192 

It’s unclear whether the dataset contains complete records for 2023, so the count for that year might be too small. It’s also likely that the agency purchases vehicles months or years in advance of their delivery, so these counts are more accurately described as deliveries rather than purchases.

20.7 Case Study: Ocean Temperatures

The U.S. National Oceanic and Atmospheric Administration (NOAA) publishes ocean temperature data collected by sensor buoys off the coast on the National Data Buoy Center (NDBC) website. California also has many sensors collecting ocean temperature data that are not administered by the federal government. Data from these is published on the California Ocean Observing Systems (CALOOS) Data portal.

Suppose you’re a researcher who wants to combine ocean temperature data from both sources to use in R. Both publish the data in comma-separated value (CSV) format, but record dates, times, and temperatures differently. Thus you need to be careful that the dates and times are parsed correctly.

Important

Download these two 2021 datasets:

  1. 2021_noaa-ndbc_46013.txt, from NOAA buoy 46013, off the coast of Bodega Bay (DOWNLOAD)(source)
  2. 2021_ucdavis_bml_wts.csv, from the UC Davis Bodega Bay Marine Laboratory’s sensors (DOWNLOAD)(source)

The NOAA data has a fixed-width format, which means each column has a fixed width in characters over all rows. The readr package provides a function read_fwf that can automatically guess the column widths and read the data into a data frame. The column names appear in the first row and column units appear in the second row, so read those rows separately:

# install.packages("readr")
library("readr")

noaa_path = "data/ocean_data/2021_noaa-ndbc_46013.txt"
noaa_headers = read_fwf(noaa_path, n_max = 2, guess_max = 1)
Rows: 2 Columns: 18
── Column specification ────────────────────────────────────────────────────────

chr (18): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
noaa = read_fwf(noaa_path, skip = 2)
Rows: 3323 Columns: 18
── Column specification ────────────────────────────────────────────────────────

chr  (4): X2, X3, X4, X5
dbl (14): X1, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(noaa) = as.character(noaa_headers[1, ])
names(noaa)[1] = "YY"

The dates and times for the observations are separated into component columns, and the read_fwf function does not convert some of these to numbers automatically. You can use as.numeric to convert them to numbers:

cols = 2:5
noaa[cols] = lapply(noaa[cols], as.numeric)

Finally, use the make_datetime function to combine the components into date-time objects:

noaa_dt = make_datetime(year = noaa$YY, month = noaa$MM, day = noaa$DD,
  hour = noaa$hh, min = noaa$mm)
noaa$date = noaa_dt
head(noaa_dt)
[1] "2021-01-01 00:00:00 UTC" "2021-01-01 00:10:00 UTC"
[3] "2021-01-01 00:20:00 UTC" "2021-01-01 00:30:00 UTC"
[5] "2021-01-01 00:40:00 UTC" "2021-01-01 00:50:00 UTC"

That takes care of the dates in the NOAA data.

The Bodega Marine Lab data is CSV format, which you can read with read.csv or the readr package’s read_csv function. The latter is faster and usually better at guessing column types. The column names appear in the first row and the column units appear in the second row. The read_csv function handles the names automatically, but you’ll have to remove the unit row as a separate step:

bml = read_csv("data/ocean_data/2021_ucdavis_bml_wts.csv")
Rows: 87283 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): time, sea_water_temperature, z
dbl (1): sea_water_temperature_qc_agg

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bml = bml[-1, ]

The dates and times of the observations were loaded as strings. You can use lubridate’s ymd_hms function to automatically parse them:

bml_dt = ymd_hms(bml$time)
bml$date = bml_dt
head(bml_dt)
[1] "2020-12-31 09:06:00 UTC" "2020-12-31 09:12:00 UTC"
[3] "2020-12-31 09:18:00 UTC" "2020-12-31 09:24:00 UTC"
[5] "2020-12-31 09:30:00 UTC" "2020-12-31 09:36:00 UTC"

Now you have date and time objects for both datasets, so you can combine the two. For example, you could extract the date and water temperature columns from each, create a new column identifying the data source, and then row-bind the datasets together.