median(terns$pred_eggs, na.rm = TRUE)[1] 6.5
After this lesson, you should be able to:
summarize and mutate functionsThis chapter uses the following packages:
Section 1.5 explains how to install and load packages.
This chapter explains how to use the dplyr package to summarize columns and groups within data frames. This is a critical skill in data analysis, and a building block for more complicated methods.
Section 2.5.2 showed how to compute the range and mean of a column with R’s built-in range and mean functions. These functions summarize or aggregate the elements in the column, reducing them to a smaller number of values (usually one). For example, suppose we want to compute the median number of predator-related egg mortalities across all sites and years in the least terns data. Then we can use the median function:
median(terns$pred_eggs, na.rm = TRUE)[1] 6.5
There are missing values in the column, so we have to set na.rm = TRUE to get a numerical result (Section 6.5.1 explains more about missing values).
The dplyr package provides functions to make it easier to summarize multiple columns. See the [documentation][dplyr] for a complete list. Perhaps the most useful is the summarize function. With summarize, the equivalent of the code above is:
library("dplyr")
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
summarize(terns, median(pred_eggs, na.rm = TRUE)) median(pred_eggs, na.rm = TRUE)
1 6.5
As you can see, the code is actually a bit longer! The summarize function is most useful when you want to summarize multiple columns, because you can select the columns in all of the same ways as with the select function. For example, suppose we want the medians for columns bp_min, bp_max, fl_min, and fl_max:
summarize(
terns,
med_bp_min = median(bp_min, na.rm = TRUE),
med_bp_max = median(bp_max, na.rm = TRUE),
med_fl_min = median(fl_min, na.rm = TRUE),
med_fl_max = median(fl_max, na.rm = TRUE)
) med_bp_min med_bp_max med_fl_min med_fl_max
1 30 38 7 9
Notice that you can set the names of the columns in the resulting data frame through the parameter names (such as med_bp_min) in the call to summarize.
Categorical columns are useful for dividing observations into groups. You can use dplyr’s group_by function to group rows in a data frame by one or more columns.
To demonstrate the group_by function, let’s compute the median number of nests by region for the least terns data. We’ll use the region_3 column. After grouping with group_by, you can use the summarize function to compute summaries of columns, and dplyr will compute a separate result for each group:
by_region = group_by(terns, region_3)
summarize(by_region, median(total_nests, na.rm = TRUE))# A tibble: 6 × 2
region_3 `median(total_nests, na.rm = TRUE)`
<chr> <dbl>
1 ARIZONA 3
2 CENTRAL 17
3 KINGS 1
4 S.F._BAY 35
5 SACRAMENTO 1
6 SOUTHERN 64
The summarize function can compute grouped summaries for multiple columns. For example, suppose we also want the median number of breeding pairs, and want to compute the maximum for both features as well. We can use parameter names to give each summary statistic a unique, easy-to-type name:
summarize(
by_region,
med_total_nests = median(total_nests, na.rm = TRUE),
med_bp_max = median(bp_max, na.rm = TRUE),
max_total_nests = max(total_nests, na.rm = TRUE),
max_bp_max = max(bp_max, na.rm = TRUE)
)# A tibble: 6 × 5
region_3 med_total_nests med_bp_max max_total_nests max_bp_max
<chr> <dbl> <dbl> <int> <dbl>
1 ARIZONA 3 2 3 2
2 CENTRAL 17 17 83 69
3 KINGS 1 1 3 3
4 S.F._BAY 35 32 550 495
5 SACRAMENTO 1 1 2 1
6 SOUTHERN 64 56 1741 1691
Some dplyr functions only make sense when used with grouping. One is the n function, which returns the number of rows in a group. For example, one way to find all year and climate event combinations present in the least terns data is to run:
summarize(
group_by(terns, year, event),
n()
)`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by year and event.
ℹ Output is grouped by year.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(year, event))` for per-operation grouping
(`?dplyr::dplyr_by`) instead.
# A tibble: 21 × 3
# Groups: year [21]
year event `n()`
<int> <chr> <int>
1 2000 LA_NINA 29
2 2004 NEUTRAL 34
3 2005 NEUTRAL 31
4 2006 NEUTRAL 33
5 2007 EL_NINO 37
6 2008 LA_NINA 38
7 2009 NEUTRAL 40
8 2010 EL_NINO 40
9 2011 LA_NINA 41
10 2012 LA_NINA 41
# ℹ 11 more rows
From this result we can conclude that there is only one climate event for each year in the dataset.
For tasks like standardizing features on a per-group basis, it’s necessary to compute summaries on groups and then map them back to the original observations. Use the mutate function for this rather than summarize. For example, to compute mean total nests for each year:
by_year = group_by(terns, year)
mutate(
by_year,
year,
mean_total_nests = mean(total_nests, na.rm = TRUE),
# Keep only the mentioned columns, to make the result easier to see.
.keep = "used"
)# A tibble: 791 × 3
# Groups: year [21]
year total_nests mean_total_nests
<int> <int> <dbl>
1 2000 15 183.
2 2000 20 183.
3 2000 312 183.
4 2000 3 183.
5 2000 5 183.
6 2000 9 183.
7 2000 32 183.
8 2000 22 183.
9 2000 73 183.
10 2000 252 183.
# ℹ 781 more rows
The mutate function returns a data frame with the same number of rows as the original data frame, unlike the summarize function, which returns a data frame with the same number of rows as there are groups.
Use summarize if you want to compute grouped aggregates on their own. For example, you might use these as a summary, to make a visualization, or to use in further computations that are not on the original data frame.
Use mutate if you want to compute grouped aggregates matched to the rows of the original data frame.
fl_min) for each year and region combination.year for each site.unique and length to find the number of unique years for each site.