Use the dplyr package to join data based on common columns
Describe the different types of joins
Identify which types of joins to use when faced with a relational dataset
4.1 Relational Datasets
Think about how you would organize restaurant reviews data. Each restaurant has a name, address, phone number, operating hours, and other restaurant level details. Because the number of reviews for each restaurant will vary, we could put the data in a single table with one row for each review. Then the table is at the review level, and we’d have to repeat restaurant level details across all reviews for each restaurant.
The review level table is convenient if we want to analyze reviews, but not so convenient if we want to analyze restaurants. For instance, suppose we want to count how many restaurants open before 10 a.m. In order to avoid counting each restaurant multiple times, we have to reduce the table to one row per restaurant before we compute the count.
The problem with putting the restaurant review data in a single table is that it consists of observations at two different levels: the restaurant level and the review level. From this perspective, an intuitive solution is to put the data in two tables: a “restaurants” table where each row is a restaurant and a “reviews” table where each row is a review. Then we can choose the most suitable table for each question we want to answer.
Each review is associated with a restaurant, and each restaurant is associated with some reviews, so the two tables are related even though they’re separate. We can keep track of the relationship by including a column for restaurant name (or a unique restaurant identifier) in both tables. This makes the data relational: multiple tables where the relationships between them are expressed through columns in common.
Note
Most database software are designed to efficiently store and query relational data, so in computing contexts, database is often synonymous with relational data.
For some questions, we’ll need to use both the restaurants table and the reviews table. For example, suppose we want to count how many restaurants open before 10 a.m. and have at least 1 five-star review. We can use the reviews table to compute the number of five-star reviews for each restaurant, combine these with the restaurants table, and then count restaurants that meet our conditions in the resulting table. Operations that combine two related tables based on columns they have in common are called joins. The two tables are conventionally called the left table and right table.
There are a few different kinds of joins. We’ll use a simplified, fictitious version of the restaurant reviews data to demonstrate some of them. The restaurants table contains names and phone numbers for three restaurants:
restaurants =data.frame(id =c(1, 2, 3),name =c("Alice's Restaurant", "The Original Beef", "The Pie Hole"),phone =c("555-3213", "555-1111", "555-9983"))restaurants
id name phone
1 1 Alice's Restaurant 555-3213
2 2 The Original Beef 555-1111
3 3 The Pie Hole 555-9983
The reviews table contains scores from five restaurant reviews:
The dplyr package provides a wide variety of functions for working with data frames. In this chapter, we’ll focus almost exclusively on the functions for combining data frames, but the other parts of the package are also useful. Like most Tidyverse packages, it comes with detailed documentation and a cheatsheet.
Install the package if you haven’t already, and then load it:
# install.packages("dplyr")library("dplyr")
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Note
If you’ve ever used SQL, you’re probably familiar with relational datasets and recognize commands like SELECT, LEFT JOIN, and GROUP BY. The dplyr package borrows some of the ideas of SQL, and provides select, left_join, and group_by functions (among others) that are reminiscent of their SQL counterparts.
If you haven’t used SQL, don’t worry—all of the functions will be explained in detail.
4.3 Inner Joins
An inner join only keeps rows from the left table if they match rows in the right table and vice-versa.
You can use dplyr’s inner_join function to join two data frames with an inner join. The first argument is the left table and the second argument is the right table. Set the by parameter to the name of the column to use to match the rows. Try joining the restaurants table and the reviews table:
inner_join(restaurants, reviews)
Joining with `by = join_by(id)`
id name phone score
1 1 Alice's Restaurant 555-3213 4.7
2 2 The Original Beef 555-1111 3.5
3 2 The Original Beef 555-1111 4.8
4 2 The Original Beef 555-1111 4.0
The inner join keeps rows where id is 1 or 2, since these values appear in both tables. It drops the row where id is 3 in restaurants and the row where id is 4 in reviews. The resulting table has 4 rows and the columns from both data frames.
Tip
If the columns you want to use to join two data frames have different names, set by = join_by(LCOL == RCOL), replacing LCOL with the name of the column in the left table and RCOL with the name of the column in the right table.
4.4 Left & Right Joins
A left join keeps all rows from the left table and only keeps rows from the right table if they match. Missing values fill any spaces where there was no match.
You can use dplyr’s left_join function to join two data frames with a left join. The arguments are the same as for the inner_join function. Try a left join on the restaurant reviews data:
left_join(restaurants, reviews)
Joining with `by = join_by(id)`
id name phone score
1 1 Alice's Restaurant 555-3213 4.7
2 2 The Original Beef 555-1111 3.5
3 2 The Original Beef 555-1111 4.8
4 2 The Original Beef 555-1111 4.0
5 3 The Pie Hole 555-9983 NA
The left join keeps all of the rows from restaurants, and matches rows from reviews when possible. There are no reviews where the id is 3, so score is missing for that row.
A left join is asymmetric, so switching the order of the tables will generally produce a different result:
left_join(reviews, restaurants)
Joining with `by = join_by(id)`
id score name phone
1 4 4.2 <NA> <NA>
2 2 3.5 The Original Beef 555-1111
3 1 4.7 Alice's Restaurant 555-3213
4 2 4.8 The Original Beef 555-1111
5 2 4.0 The Original Beef 555-1111
A right join is equivalent to a left join with the order of the tables switched. Because of this, some relational data tools don’t have a right join command (only a left join command). You can use dplyr’s right_join function to join two data frames with a right join.
4.5 Full Joins
A full join keeps all rows from both tables. Missing values fill any spaces where there was no match.
You can use dplyr’s full_join function to join two data frames with a full join. The arguments are the same as for the inner_join and left_join function. Try a full join on the restaurant reviews data:
full_join(restaurants, reviews)
Joining with `by = join_by(id)`
id name phone score
1 1 Alice's Restaurant 555-3213 4.7
2 2 The Original Beef 555-1111 3.5
3 2 The Original Beef 555-1111 4.8
4 2 The Original Beef 555-1111 4.0
5 3 The Pie Hole 555-9983 NA
6 4 <NA> <NA> 4.2
NoteSee Also
There are a few more kinds of joins. See the dplyr documentation for details.
4.6 Dataset: CA Crash Reporting System
The California Highway Patrol publish data about vehicle crashes in the state as the California Crash Reporting System (CCRS). The CCRS is a relational data set with three tables per year which describe crash events, parties involved, and all injuries, witnesses, and passengers. Let’s use the 2024 CCRS data for Sacramento and Yolo Counties to compute statistics about crashes in those counties.
Important
Click here to download the 2024 Sacramento & Yolo Crash dataset (3 CSV files).
If you haven’t already, we recommend you create a directory for this workshop. In your workshop directory, create a data/ subdirectory. Download and save the dataset in the data/ subdirectory.
NoteDocumentation for the 2024 Sacramento & Yolo Crash Dataset
The dataset consists of three tables:
2024_sac-yolo_crashes.csv, where each row is a crash.
2024_sac-yolo_parties.csv, where each row is a person directly involved in the crash (not a witness or passenger).
2024_sac-yolo_injured-witness-passenger.csv, where each row is an injured person (including drivers), witness, or passenger.
Click here to download the documentation for the source dataset.