2. Untidy & Relational Data#
Learning Goals
After this lesson, you should be able to:
Explain what it means for data to be tidy
Use Polars to reshape data
Explain what a relational data set is
Use Polars to join data based on common columns
Describe the different types of joins
Identify which types of joins to use when faced with a relational data set
This chapter is part 2 (of 2) of Cleaning & Reshaping Data, a workshop series about how to prepare data for analysis. The major topics of this chapter are how to reshape untidy data sets with pivots and how to combine related data sets with joins.
2.1. Reshaping Untidy Data#
The structure of a data set—its shape and organization—has enormous influence on how difficult it will be to analyze, so making structural changes is an important part of the cleaning process. Researchers conventionally arrange tabular data so that each row contains a single observation or case, and each column contains a single kind of measurement or identifier, called a feature.
Important
What constitutes an observation depends on what questions you want to answer with the data. Sometimes you might need to work with a data set at several different levels of observational unit to answer all of your questions. We’ll see an example of this soon.
In 2014, Hadley Wickham refined and formalized the conventions for tabular data by introducing the concept of tidy data. Paraphrasing Wickham, the rules for tidy data are:
Every column is a single feature.
Every row is a single observation.
Every cell is a single value.
These rules ensure that all of the values in a data set are visually organized and are easy to access with indexing operations. They’re also specific enough to make tidiness a convenient standard for functions that operate on tabular data, so many packages for Python and other programming languages are designed from the ground up for working with tidy data.
This section explains how to reshape untidy data into tidy data. While reshaping can seem tricky at first, making sure your data set has the right structure before you begin analysis saves time and frustration in the long run.
2.1.1. An Untidy Data Set#
The City of Davis has two bike counters: one is at the intersection of 3rd Street and University Avenue (the 3rd Street bike obelisk) and the other is at the intersection of Loyola Drive and Pole Line Road. The City publishes data from the bike counters online. DataLab combined the City’s 2020 bike counts, aggregated to the day level, with precipitation and wind data from the U.S. National Oceanic and Atmospheric Administration’s weather station at Sacramento Metropolitan Airport (this was the nearest weather station with complete records for 2020). We’ll use this data set to demonstrate how to transform untidy data.
Important
Click here to download the 2020 Davis bike counts data set.
If you haven’t already, we recommend you create a directory for this workshop.
In your workshop directory, create a data/
subdirectory. Download and save
the data set in the data/
subdirectory.
Documentation for the 2020 Davis Bike Counts Data Set
Each row in the data set contains measurements from one date-variable combination.
Column |
Description |
---|---|
|
The date of measurement |
|
What was measured: |
|
The measured value |
The source for the bike counts is the City of Davis’ Bike and Pedestrian Statistics web page. The source for the total precipitation and average wind speed is NOAA’s weather station at Sacramento Metropolitan Airport.
The data set is saved in a Parquet file, which you can use the
pl.read_parquet
function to read:
import polars as pl
bikes = pl.read_parquet("data/2020_davis_bikes.parquet")
bikes.head()
date | variable | value |
---|---|---|
date | str | f64 |
2020-01-01 | "third" | 332.0 |
2020-01-01 | "loyola" | 181.0 |
2020-01-01 | "awnd" | 1.7 |
2020-01-01 | "prcp" | 0.0 |
2020-01-02 | "third" | 357.0 |
This data is not tidy, because it breaks rule 1. The value
column contains
many different features—they even have different units! Soon we’ll reshape
the data set to make it tidy.
Before you reshape a data set, you should also think about what role each column serves:
Identifiers (or indexes) are labels that distinguish observations from one another. They’re often but not always categorical. Examples include names or identification numbers, treatment groups, and dates or times. In the bike counts data, the
date
column is an identifier.Measurements are the values collected for each observation and typically the values of research interest. For the tuberculosis data set, the
value
column is a measurement.
A clear understanding of which columns are identifiers and which are measurements makes it easier to write the code to reshape.
2.1.2. Rows into Columns#
In order to make the Davis bike counts data tidy, the measurements in the
value
column need to be moved into two separate columns, one for each of the
categories in the variable
column.
You can use the .pivot
method to pivot a data frame, creating new columns
from values in the rows. This makes the data frame wider (and shorter), so some
data frame packages call this operation pivot_wider
. Let’s pivot the bikes
data frame on the variable
column to create four new columns filled with
values from the values
column.
The .pivot
method’s most important parameters are:
on
– The column that contains names for the new columns.values
– The column(s) that contains values for the new columns.index
– The identifier columns, which are not pivoted. This defaults to all columns except those inon
andvalues
.
Here’s how to use the method to make bikes
tidy:
bikes2 = bikes.pivot(on = "variable", values = "value")
bikes2.head()
date | third | loyola | awnd | prcp |
---|---|---|---|---|
date | f64 | f64 | f64 | f64 |
2020-01-01 | 332.0 | 181.0 | 1.7 | 0.0 |
2020-01-02 | 357.0 | 401.0 | 3.0 | 0.0 |
2020-01-03 | 426.0 | 504.0 | 1.0 | 0.0 |
2020-01-04 | 337.0 | 475.0 | 3.2 | 2.0 |
2020-01-05 | 370.0 | 563.0 | 2.5 | 0.0 |
The method automatically removes values from the date
column as needed to
maintain the original correspondence with the pivoted values.
The new bikes2
data frame contains all of the data from bikes
, but now the
measurements for each date share a row. In other words, the observational units
for bikes2
are dates, which is convenient for investigating how individual
features change over time, as well as making same-time comparisons between
features. To illustrate this, we can use Polars’ .plot.scatter
method to make
a scatter plot of the bike counts at 3rd Street against the counts at Loyola
Drive:
bikes2.plot.scatter(x = "third", y = "loyola")
The plot shows that Loyola Drive occasionally has days with much higher traffic than 3rd Street, but it’s difficult to tell whether 3rd or Loyola is typically busier (that is, whether there are more points above or below the \(y = x\) line).
2.1.3. Columns into Rows#
Suppose we want to try to get a better answer to whether Loyola or 3rd tends to be busier. One way we can do it is by making a line plot with the counts for each site over time. This way we’re treating each site as a group within the data and making a comparison between groups.
Comparing groups in a data frame is generally easier when each row corresponds
to an observation from one group. Let’s reshape the bikes2
data frame so that
the observational units are date-site combinations. To do this, the third
and
loyola
columns need to be transformed into two new columns: one for
measurements (the counts) and one for identifiers (the sites). It might help to
visualize this as stacking the two separate columns third
and loyola
together, one on top of the other, and then adding a second column with the
corresponding site names.
You can use the .unpivot
method to unpivot a data frame, creating new
rows from values in the columns. This is the inverse of a pivot. It makes the
data frame longer (and narrower), so some data frame packages call this
operation pivot_longer
. We’ll unpivot the bikes2
data frame on the third
and loyola
columns.
The .unpivot
method’s parameters are:
on
– The columns to stack into a new column; the names of these columns will also go into a new column.index
– The identifier columns, which are not unpivoted.variable_name
– Name(s) for the new identifier column(s)value_name
– Name(s) for the new measurement column(s)
For .unpivot
, it’s important to set both on
and index
, since any columns
you don’t include in one or the other will be dropped.
The code to unpivot bikes2
is:
bikes3 = bikes2.unpivot(
on = ["third", "loyola"],
index = ["date", "prcp", "awnd"],
variable_name = "site",
value_name = "count"
)
bikes3.head()
date | prcp | awnd | site | count |
---|---|---|---|---|
date | f64 | f64 | str | f64 |
2020-01-01 | 0.0 | 1.7 | "third" | 332.0 |
2020-01-02 | 0.0 | 3.0 | "third" | 357.0 |
2020-01-03 | 0.0 | 1.0 | "third" | 426.0 |
2020-01-04 | 2.0 | 3.2 | "third" | 337.0 |
2020-01-05 | 0.0 | 2.5 | "third" | 370.0 |
For bikes3
, the observational units are date-site combinations, as planned.
This is convenient for comparing the two sites to each other with statistics
and visualizations. We can use Polars’ .plot.line
method to make a line plot
of the counts for the two sites:
plot = bikes3.plot.line(x = "date", y = "count", color = "site")
# Make the plot 600 pixels wide.
plot.properties(width = 600)
From this plot, we can see that Loyola Drive was generally busier for the first 3 months of 2020. Traffic dropped at both sites in mid-March, probably due to the COVID-19 pandemic. The drop was sharper at Loyola Drive than 3rd Street, so 3rd Street was generally busier for the remaining 9 months of 2020.
Note
We didn’t use prcp
and awnd
(and they don’t differ between sites anyway), so
we could’ve let .unpivot
drop them. The resulting data frame would have the
same observational units as bikes3
, but would also be a subset of the
original bikes
data frame (albeit with different column names).
2.1.4. Case Study: SMART Ridership#
Sonoma-Marin Area Rail Transit (SMART) is a relatively new single-line passenger light rail service between the San Francisco Bay and Santa Rosa. They publish data about monthly ridership online, but the format is slightly messy. Let’s clean and reshape the data in order to make a plot of ridership over time.
Important
Click here to download the SMART Ridership data set (version 2025-02).
If you haven’t already, we recommend you create a directory for this workshop.
In your workshop directory, create a data/
subdirectory. Download and save
the data set in the data/
subdirectory.
Documentation for the SMART Ridership Data Set
The source for the data set is the SMART Ridership Reports web page.
The data set is saved as a Microsoft Excel file. Before reading an Excel file, it’s a good idea to manually inspect it with spreadsheet software to figure out how the data are organized. The SMART data set contains two tables on the left side of the first sheet: one for total monthly ridership and one for average weekday ridership (by month). Let’s focus on the total monthly ridership table.
You can use Polars’ pl.read_excel
function to read sheets from an Excel file.
Important
The pl.read_excel
function depends on the fastexcel
package. Make sure to
install the package if you haven’t already.
Use pl.read_excel
to read the SMART Ridership data:
smart = pl.read_excel("data/2025-02_smart_ridership.xlsx")
smart.head()
Sonoma-Marin Area Rail Transit (SMART) Ridership | __UNNAMED__1 | __UNNAMED__2 | __UNNAMED__3 | __UNNAMED__4 | __UNNAMED__5 | __UNNAMED__6 | __UNNAMED__7 | __UNNAMED__8 | __UNNAMED__23 | __UNNAMED__24 | __UNNAMED__25 | __UNNAMED__26 | __UNNAMED__27 | __UNNAMED__28 | __UNNAMED__29 | __UNNAMED__30 | __UNNAMED__31 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str |
"Total Monthly Ridership" | null | null | null | null | null | null | null | null | "Bicycles on SMART" | null | null | null | null | null | null | null | null |
"Month" | "FY18" | "FY19" | "FY20" | "FY21" | "FY22" | "FY23" | "FY24" | "FY25" | "Month" | "FY18" | "FY19" | "FY20" | "FY21" | "FY22" | "FY23" | "FY24" | "FY25" |
"Jul" | "-" | "63864" | "62851" | "9427" | "24627" | "43752" | "65779" | "88022" | "Jul" | "-" | "6543" | "6666" | "2143" | "3999" | "7310" | "9575" | "10953" |
"Aug" | "54484" | "74384" | "65352" | "8703" | "25020" | "48278" | "72171" | "91894" | "Aug" | "528" | "7772" | "7999" | "1887" | "4297" | "8775" | "11276" | "13046" |
"Sep" | "65019" | "62314" | "62974" | "8910" | "27967" | "49134" | "68506" | "92855" | "Sep" | "6147" | "7249" | "8032" | "1786" | "4733" | "9033" | "10453" | "14334" |
The total monthly ridership table corresponds to rows 1:14
and the first 9
columns:
smart = smart[1:14, :9]
smart
Sonoma-Marin Area Rail Transit (SMART) Ridership | __UNNAMED__1 | __UNNAMED__2 | __UNNAMED__3 | __UNNAMED__4 | __UNNAMED__5 | __UNNAMED__6 | __UNNAMED__7 | __UNNAMED__8 |
---|---|---|---|---|---|---|---|---|
str | str | str | str | str | str | str | str | str |
"Month" | "FY18" | "FY19" | "FY20" | "FY21" | "FY22" | "FY23" | "FY24" | "FY25" |
"Jul" | "-" | "63864" | "62851" | "9427" | "24627" | "43752" | "65779" | "88022" |
"Aug" | "54484" | "74384" | "65352" | "8703" | "25020" | "48278" | "72171" | "91894" |
"Sep" | "65019" | "62314" | "62974" | "8910" | "27967" | "49134" | "68506" | "92855" |
"Oct" | "57453" | "65492" | "57222" | "9851" | "26997.5" | "59322" | "70807" | "96599" |
… | … | … | … | … | … | … | … | … |
"Feb" | "54797" | "51130" | "71676" | "7412" | "26652" | "49724" | "62090" | "81731" |
"Mar" | "57312" | "58091" | "33624" | "9933" | "35291" | "53622" | "67421" | null |
"Apr" | "56631" | "60256" | "4571" | "11908" | "34258" | "58551" | "80340" | null |
"May" | "59428" | "64036" | "5308" | "13949" | "38655" | "65416" | "85448" | null |
"Jun" | "61828" | "55700" | "8386" | "20469" | "41525" | "67162" | "80434" | null |
The first row is a header, so let’s use it to set the column names and remove
it from the data frame. You can get a row as a tuple (rather than a data frame)
with the .row
method:
smart.columns = smart.row(0)
smart = smart[1:, :]
smart
Month | FY18 | FY19 | FY20 | FY21 | FY22 | FY23 | FY24 | FY25 |
---|---|---|---|---|---|---|---|---|
str | str | str | str | str | str | str | str | str |
"Jul" | "-" | "63864" | "62851" | "9427" | "24627" | "43752" | "65779" | "88022" |
"Aug" | "54484" | "74384" | "65352" | "8703" | "25020" | "48278" | "72171" | "91894" |
"Sep" | "65019" | "62314" | "62974" | "8910" | "27967" | "49134" | "68506" | "92855" |
"Oct" | "57453" | "65492" | "57222" | "9851" | "26997.5" | "59322" | "70807" | "96599" |
"Nov" | "56125" | "52774" | "64966" | "8145" | "26575" | "51383" | "65445" | "78132" |
… | … | … | … | … | … | … | … | … |
"Feb" | "54797" | "51130" | "71676" | "7412" | "26652" | "49724" | "62090" | "81731" |
"Mar" | "57312" | "58091" | "33624" | "9933" | "35291" | "53622" | "67421" | null |
"Apr" | "56631" | "60256" | "4571" | "11908" | "34258" | "58551" | "80340" | null |
"May" | "59428" | "64036" | "5308" | "13949" | "38655" | "65416" | "85448" | null |
"Jun" | "61828" | "55700" | "8386" | "20469" | "41525" | "67162" | "80434" | null |
The FY
columns need to be cast to floats, and FY18
uses a hyphen -
to
indicate a missing value. We can use the .replace
method to replace the
hyphen with a missing value and then cast the columns:
smart = smart.with_columns(
pl.exclude("Month").replace("-", None).cast(pl.Float64)
)
smart.head()
Month | FY18 | FY19 | FY20 | FY21 | FY22 | FY23 | FY24 | FY25 |
---|---|---|---|---|---|---|---|---|
str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
"Jul" | null | 63864.0 | 62851.0 | 9427.0 | 24627.0 | 43752.0 | 65779.0 | 88022.0 |
"Aug" | 54484.0 | 74384.0 | 65352.0 | 8703.0 | 25020.0 | 48278.0 | 72171.0 | 91894.0 |
"Sep" | 65019.0 | 62314.0 | 62974.0 | 8910.0 | 27967.0 | 49134.0 | 68506.0 | 92855.0 |
"Oct" | 57453.0 | 65492.0 | 57222.0 | 9851.0 | 26997.5 | 59322.0 | 70807.0 | 96599.0 |
"Nov" | 56125.0 | 52774.0 | 64966.0 | 8145.0 | 26575.0 | 51383.0 | 65445.0 | 78132.0 |
There’s still a lot of cleaning to do. The identifiers in this data set are the months and years, and they’re split between the row and column names. Each row contains data from several different years, so the data set is not tidy. In addition, the years are indicated in fiscal years (FY), which begin in July rather than January, so some of the years need to be adjusted.
To make the data set tidy, it needs to be reshaped so that the values in the
various fiscal year columns are all in one column. In other words, the data set
needs to be unpivoted (Section 2.1.3) on all of the FY
columns.
Listing all of the columns would be tedious, but fortunately Polars provides a
shortcut: the pl.selectors
namespace, which is conventionally imported as
cs
(for “column selectors”), provides many helpful functions for selecting
columns. You can use the cs.starts_with
function to select all columns that
start with a string. Let’s try this out in the code to unpivot the FY
columns:
import polars.selectors as cs
smart = smart.unpivot(
on = cs.starts_with("FY"),
index = "Month",
variable_name = "fiscal_year",
value_name = "count"
)
smart.head()
Month | fiscal_year | count |
---|---|---|
str | str | f64 |
"Jul" | "FY18" | null |
"Aug" | "FY18" | 54484.0 |
"Sep" | "FY18" | 65019.0 |
"Oct" | "FY18" | 57453.0 |
"Nov" | "FY18" | 56125.0 |
In order to use the months and years in the data, we need to convert them to
dates. As a first step towards this, we can replace the FY
prefix in the new
fiscal_year
column with 20
and cast the column to integers. We can use the
.str.replace
method to do the replacement:
smart = smart.with_columns(
pl.col("fiscal_year").str.replace("FY", "20").cast(pl.Int64)
)
smart.head()
Month | fiscal_year | count |
---|---|---|
str | i64 | f64 |
"Jul" | 2018 | null |
"Aug" | 2018 | 54484.0 |
"Sep" | 2018 | 65019.0 |
"Oct" | 2018 | 57453.0 |
"Nov" | 2018 | 56125.0 |
Next, we can use the .str.to_date
and .dt.month
methods to create a new
column of month numbers:
smart = smart.with_columns(
month_num = pl.col("Month").str.to_date("%b").dt.month()
)
smart.head()
Month | fiscal_year | count | month_num |
---|---|---|---|
str | i64 | f64 | i8 |
"Jul" | 2018 | null | 7 |
"Aug" | 2018 | 54484.0 | 8 |
"Sep" | 2018 | 65019.0 | 9 |
"Oct" | 2018 | 57453.0 | 10 |
"Nov" | 2018 | 56125.0 | 11 |
Now we need to transform the fiscal years in the fiscal_year
column into
calendar years. A SMART fiscal year extends from July to the following June and
is named after the calendar year at the end of the fiscal year. So from July to
December, the calendar year is the fiscal year minus 1. We can
smart = smart.with_columns(
cal_year =
pl.col("fiscal_year") +
pl.when(pl.col("month_num") >= 7).then(-1).otherwise(0)
)
smart.head()
Month | fiscal_year | count | month_num | cal_year |
---|---|---|---|---|
str | i64 | f64 | i8 | i64 |
"Jul" | 2018 | null | 7 | 2017 |
"Aug" | 2018 | 54484.0 | 8 | 2017 |
"Sep" | 2018 | 65019.0 | 9 | 2017 |
"Oct" | 2018 | 57453.0 | 10 | 2017 |
"Nov" | 2018 | 56125.0 | 11 | 2017 |
Finally, we can use the pl.date
function to construct dates from the
cal_year
and month_num
columns:
smart = smart.with_columns(date = pl.date("cal_year", "month_num", 1))
smart.head()
Month | fiscal_year | count | month_num | cal_year | date |
---|---|---|---|---|---|
str | i64 | f64 | i8 | i64 | date |
"Jul" | 2018 | null | 7 | 2017 | 2017-07-01 |
"Aug" | 2018 | 54484.0 | 8 | 2017 | 2017-08-01 |
"Sep" | 2018 | 65019.0 | 9 | 2017 | 2017-09-01 |
"Oct" | 2018 | 57453.0 | 10 | 2017 | 2017-10-01 |
"Nov" | 2018 | 56125.0 | 11 | 2017 | 2017-11-01 |
With the dates in the date
column and the counts in the count
column, we
have everything we need to make a plot of SMART ridership over time. We can use
the .plot.line
method to make the plot:
plot = smart.plot.line(x = "date", y = "count")
# Make the plot 600 pixels wide.
plot.properties(width = 600)
Notice the huge drop (more than 90%) in April 2020 due to the COVID-19 pandemic!
2.2. Working with Relational Data#
Important
This section is still in development and will be posted soon.
Think about how you would organize restaurant reviews data. Each restaurant has a name, address, phone number, operating hours, and other restaurant level details. Because the number of reviews for each restaurant will vary, we could put the data in a single table with one row for each review. Then the table is at the review level, and we’d have to repeat restaurant level details across all reviews for each restaurant.
The review level table is convenient if we want to analyze reviews, but not so convenient if we want to analyze restaurants. For instance, suppose we want to count how many restaurants open before 10 a.m. In order to avoid counting each restaurant multiple times, we have to reduce the table to one row per restaurant before we compute the count.
The problem with putting the restaurant review data in a single table is that it consists of observations at two different levels: the restaurant level and the review level. From this perspective, an intuitive solution is to put the data in two tables: a “restaurants” table where each row is a restaurant and a “reviews” table where each row is a review. Then we can choose the most suitable table for each question we want to answer.
Each review is associated with a restaurant, and each restaurant is associated with some reviews, so the two tables are related even though they’re separate. We can keep track of the relationship by including a column for restaurant name (or a unique restaurant identifier) in both tables. This makes the data relational: multiple tables where the relationships between them are expressed through columns in common.
Note
Most database software are designed to efficiently store and query relational data, so in computing contexts, database is often synonymous with relational data.
For some questions, we’ll need to use both the restaurants table and the reviews table. For example, suppose we want to count how many restaurants open before 10 a.m. and have at least 1 five-star review. We can use the reviews table to compute the number of five-star reviews for each restaurant, combine these with the restaurants table, and then count restaurants that meet our conditions in the resulting table. Operations that combine two related tables based on columns they have in common are called joins. The two tables are conventionally called the left table and right table.
There are a few different kinds of joins. We’ll use a simplified, fictitious
version of the restaurant reviews data to demonstrate some of them. The
restaurants
table contains names and phone numbers for three restaurants:
restaurants = pl.DataFrame({
"id": [1, 2, 3],
"name": ["Alice's Restaurant", "The Original Beef", "The Pie Hole"],
"phone": ["555-3213", "555-1111", "555-9983"]
})
restaurants
id | name | phone |
---|---|---|
i64 | str | str |
1 | "Alice's Restaurant" | "555-3213" |
2 | "The Original Beef" | "555-1111" |
3 | "The Pie Hole" | "555-9983" |
The reviews
table contains scores from five restaurant reviews:
reviews = pl.DataFrame({
"id": [4, 2, 1, 2, 2],
"score": [4.2, 3.5, 4.7, 4.8, 4.0],
})
reviews
id | score |
---|---|
i64 | f64 |
4 | 4.2 |
2 | 3.5 |
1 | 4.7 |
2 | 4.8 |
2 | 4.0 |
2.2.1. Inner Joins#
An inner join only keeps rows from the left table if they match rows in the right table and vice-versa.
You can use the .join
method to join two data frames with Polars. The data
frame on the left side of .join
is the left table. The right table is the
first argument to .join
. The .join
method also requires an argument for the
on
parameter, which should be the name of the column to use to match the two
data frames. By default, .join
does an inner join.
Try joining the restaurants
table and the reviews
table:
restaurants.join(reviews, on = "id")
id | name | phone | score |
---|---|---|---|
i64 | str | str | f64 |
2 | "The Original Beef" | "555-1111" | 3.5 |
1 | "Alice's Restaurant" | "555-3213" | 4.7 |
2 | "The Original Beef" | "555-1111" | 4.8 |
2 | "The Original Beef" | "555-1111" | 4.0 |
The inner join keeps rows where id
is 1
or 2
, since these values appear
in both tables. It drops the row where id
is 3
in restaurants
and the row
where id
is 4
in reviews
. The resulting table has 4 rows and the columns
from both data frames.
Tip
If the column you want to use to join two data frames has a different name in
each one, set the left_on
and right_on
parameters instead of on
.
2.2.2. Left & Right Joins#
A left join keeps all rows from the left table and only keeps rows from the right table if they match. Missing values fill any spaces where there was no match.
You can do a left join with the .join
method by setting how = "left"
. Try a
left join on the restaurant reviews data:
restaurants.join(reviews, on = "id", how = "left")
id | name | phone | score |
---|---|---|---|
i64 | str | str | f64 |
1 | "Alice's Restaurant" | "555-3213" | 4.7 |
2 | "The Original Beef" | "555-1111" | 3.5 |
2 | "The Original Beef" | "555-1111" | 4.8 |
2 | "The Original Beef" | "555-1111" | 4.0 |
3 | "The Pie Hole" | "555-9983" | null |
The left join keeps all of the rows from restaurants
, and matches rows from
reviews
when possible. There are no reviews where the id
is 3
, so score
is missing for that row.
A left join is asymmetric, so switching the order of the tables will generally produce a different result:
reviews.join(restaurants, on = "id", how = "left")
id | score | name | phone |
---|---|---|---|
i64 | f64 | str | str |
4 | 4.2 | null | null |
2 | 3.5 | "The Original Beef" | "555-1111" |
1 | 4.7 | "Alice's Restaurant" | "555-3213" |
2 | 4.8 | "The Original Beef" | "555-1111" |
2 | 4.0 | "The Original Beef" | "555-1111" |
A right join is equivalent to a left join with the order of the tables
switched. Because of this, some relational data tools don’t have a right join
command (only a left join command). You can do a right join with the .join
method by setting how = "right"
.
2.2.3. Full Joins#
A full join keeps all rows from both tables. Missing values fill any spaces where there was no match.
You can do a full join with the .join
method by setting how = "full"
. Try a
full join on the restaurant reviews data:
restaurants.join(reviews, on = "id", how = "full")
id | name | phone | id_right | score |
---|---|---|---|---|
i64 | str | str | i64 | f64 |
null | null | null | 4 | 4.2 |
2 | "The Original Beef" | "555-1111" | 2 | 3.5 |
1 | "Alice's Restaurant" | "555-3213" | 1 | 4.7 |
2 | "The Original Beef" | "555-1111" | 2 | 4.8 |
2 | "The Original Beef" | "555-1111" | 2 | 4.0 |
3 | "The Pie Hole" | "555-9983" | null | null |
See also
There are a few more kinds of joins. The Polars User Guide has a complete list with examples..
2.2.4. Case Study: CA Crash Reporting System#
The California Highway Patrol publish data about vehicle crashes in the state as the California Crash Reporting System (CCRS). The CCRS is a relational data set with three tables per year which describe crash events, parties involved, and all injuries, witnesses, and passengers. Let’s use the 2024 CCRS data for Sacramento and Yolo Counties to compute statistics about crashes in those counties.
Important
Click here to download the 2024 Sacramento & Yolo Crash data set (3 CSV files).
If you haven’t already, we recommend you create a directory for this workshop.
In your workshop directory, create a data/
subdirectory. Download and save
the data set in the data/
subdirectory.
Documentation for the 2024 Sacramento & Yolo Crash Data Set
The data set consists of three tables:
2024_sac-yolo_crashes.csv
, where each row is a crash.2024_sac-yolo_parties.csv
, where each row is a person directly involved in the crash (not a witness or passenger).2024_sac-yolo_injured-witness-passenger.csv
, where each row is an injured person (including drivers), witness, or passenger.
Click here to download the documentation for the source data set.
This data set is a subset of the much larger CA Crash Reporting System data set.