Untidy & Relational Data

2. Untidy & Relational Data#

Learning Goals

After this lesson, you should be able to:

Explain what it means for data to be tidy
Use Polars to reshape data
Explain what a relational data set is
Use Polars to join data based on common columns
Describe the different types of joins
Identify which types of joins to use when faced with a relational data set

This chapter is part 2 (of 2) of Cleaning & Reshaping Data, a workshop series about how to prepare data for analysis. The major topics of this chapter are how to reshape untidy data sets with pivots and how to combine related data sets with joins.

2.1. Reshaping Untidy Data#

The structure of a data set—its shape and organization—has enormous influence on how difficult it will be to analyze, so making structural changes is an important part of the cleaning process. Researchers conventionally arrange tabular data so that each row contains a single observation or case, and each column contains a single kind of measurement or identifier, called a feature.

Important

What constitutes an observation depends on what questions you want to answer with the data. Sometimes you might need to work with a data set at several different levels of observational unit to answer all of your questions. We’ll see an example of this soon.

In 2014, Hadley Wickham refined and formalized the conventions for tabular data by introducing the concept of tidy data. Paraphrasing Wickham, the rules for tidy data are:

Every column is a single feature.

Every row is a single observation.

Every cell is a single value.

These rules ensure that all of the values in a data set are visually organized and are easy to access with indexing operations. They’re also specific enough to make tidiness a convenient standard for functions that operate on tabular data, so many packages for Python and other programming languages are designed from the ground up for working with tidy data.

This section explains how to reshape untidy data into tidy data. While reshaping can seem tricky at first, making sure your data set has the right structure before you begin analysis saves time and frustration in the long run.

2.1.1. An Untidy Data Set#

The City of Davis has two bike counters: one is at the intersection of 3rd Street and University Avenue (the 3rd Street bike obelisk) and the other is at the intersection of Loyola Drive and Pole Line Road. The City publishes data from the bike counters online. DataLab combined the City’s 2020 bike counts, aggregated to the day level, with precipitation and wind data from the U.S. National Oceanic and Atmospheric Administration’s weather station at Sacramento Metropolitan Airport (this was the nearest weather station with complete records for 2020). We’ll use this data set to demonstrate how to transform untidy data.

Important

Click here to download the 2020 Davis bike counts data set.

If you haven’t already, we recommend you create a directory for this workshop. In your workshop directory, create a data/ subdirectory. Download and save the data set in the data/ subdirectory.

Documentation for the 2020 Davis Bike Counts Data Set

Each row in the data set contains measurements from one date-variable combination.

Column	Description
`date`	The date of measurement
`variable`	What was measured: `third` and `loyola` are bike counts, `prcp` is total precipitation in millimeters, `awnd` is average daily wind speed in meters per second
`value`	The measured value

The source for the bike counts is the City of Davis’ Bike and Pedestrian Statistics web page. The source for the total precipitation and average wind speed is NOAA’s weather station at Sacramento Metropolitan Airport.

The data set is saved in a Parquet file, which you can use the pl.read_parquet function to read:

import polars as pl

bikes = pl.read_parquet("data/2020_davis_bikes.parquet")
bikes.head()

shape: (5, 3)

date	variable	value
date	str	f64
2020-01-01	"third"	332.0
2020-01-01	"loyola"	181.0
2020-01-01	"awnd"	1.7
2020-01-01	"prcp"	0.0
2020-01-02	"third"	357.0

This data is not tidy, because it breaks rule 1. The value column contains many different features—they even have different units! Soon we’ll reshape the data set to make it tidy.

Before you reshape a data set, you should also think about what role each column serves:

Identifiers (or indexes) are labels that distinguish observations from one another. They’re often but not always categorical. Examples include names or identification numbers, treatment groups, and dates or times. In the bike counts data, the date column is an identifier.
Measurements are the values collected for each observation and typically the values of research interest. For the tuberculosis data set, the value column is a measurement.

A clear understanding of which columns are identifiers and which are measurements makes it easier to write the code to reshape.

2.1.2. Rows into Columns#

In order to make the Davis bike counts data tidy, the measurements in the value column need to be moved into two separate columns, one for each of the categories in the variable column.

You can use the .pivot method to pivot a data frame, creating new columns from values in the rows. This makes the data frame wider (and shorter), so some data frame packages call this operation pivot_wider. Let’s pivot the bikes data frame on the variable column to create four new columns filled with values from the values column.

The .pivot method’s most important parameters are:

on – The column that contains names for the new columns.
values – The column(s) that contains values for the new columns.
index – The identifier columns, which are not pivoted. This defaults to all columns except those in on and values.

Here’s how to use the method to make bikes tidy:

bikes2 = bikes.pivot(on = "variable", values = "value")
bikes2.head()

shape: (5, 5)

date	third	loyola	awnd	prcp
date	f64	f64	f64	f64
2020-01-01	332.0	181.0	1.7	0.0
2020-01-02	357.0	401.0	3.0	0.0
2020-01-03	426.0	504.0	1.0	0.0
2020-01-04	337.0	475.0	3.2	2.0
2020-01-05	370.0	563.0	2.5	0.0

The method automatically removes values from the date column as needed to maintain the original correspondence with the pivoted values.

The new bikes2 data frame contains all of the data from bikes, but now the measurements for each date share a row. In other words, the observational units for bikes2 are dates, which is convenient for investigating how individual features change over time, as well as making same-time comparisons between features. To illustrate this, we can use Polars’ .plot.scatter method to make a scatter plot of the bike counts at 3rd Street against the counts at Loyola Drive:

bikes2.plot.scatter(x = "third", y = "loyola")

The plot shows that Loyola Drive occasionally has days with much higher traffic than 3rd Street, but it’s difficult to tell whether 3rd or Loyola is typically busier (that is, whether there are more points above or below the \(y = x\) line).

2.1.3. Columns into Rows#

Suppose we want to try to get a better answer to whether Loyola or 3rd tends to be busier. One way we can do it is by making a line plot with the counts for each site over time. This way we’re treating each site as a group within the data and making a comparison between groups.

Comparing groups in a data frame is generally easier when each row corresponds to an observation from one group. Let’s reshape the bikes2 data frame so that the observational units are date-site combinations. To do this, the third and loyola columns need to be transformed into two new columns: one for measurements (the counts) and one for identifiers (the sites). It might help to visualize this as stacking the two separate columns third and loyola together, one on top of the other, and then adding a second column with the corresponding site names.

You can use the .unpivot method to unpivot a data frame, creating new rows from values in the columns. This is the inverse of a pivot. It makes the data frame longer (and narrower), so some data frame packages call this operation pivot_longer. We’ll unpivot the bikes2 data frame on the third and loyola columns.

The .unpivot method’s parameters are:

on – The columns to stack into a new column; the names of these columns will also go into a new column.
index – The identifier columns, which are not unpivoted.
variable_name – Name(s) for the new identifier column(s)
value_name – Name(s) for the new measurement column(s)

For .unpivot, it’s important to set both on and index, since any columns you don’t include in one or the other will be dropped.

The code to unpivot bikes2 is:

bikes3 = bikes2.unpivot(
    on = ["third", "loyola"],
    index = ["date", "prcp", "awnd"],
    variable_name = "site",
    value_name = "count"
)
bikes3.head()

shape: (5, 5)

date	prcp	awnd	site	count
date	f64	f64	str	f64
2020-01-01	0.0	1.7	"third"	332.0
2020-01-02	0.0	3.0	"third"	357.0
2020-01-03	0.0	1.0	"third"	426.0
2020-01-04	2.0	3.2	"third"	337.0
2020-01-05	0.0	2.5	"third"	370.0

For bikes3, the observational units are date-site combinations, as planned. This is convenient for comparing the two sites to each other with statistics and visualizations. We can use Polars’ .plot.line method to make a line plot of the counts for the two sites:

plot = bikes3.plot.line(x = "date", y = "count", color = "site")
# Make the plot 600 pixels wide.
plot.properties(width = 600)

From this plot, we can see that Loyola Drive was generally busier for the first 3 months of 2020. Traffic dropped at both sites in mid-March, probably due to the COVID-19 pandemic. The drop was sharper at Loyola Drive than 3rd Street, so 3rd Street was generally busier for the remaining 9 months of 2020.

Note

We didn’t use prcp and awnd (and they don’t differ between sites anyway), so we could’ve let .unpivot drop them. The resulting data frame would have the same observational units as bikes3, but would also be a subset of the original bikes data frame (albeit with different column names).

2.1.4. Case Study: SMART Ridership#

Sonoma-Marin Area Rail Transit (SMART) is a relatively new single-line passenger light rail service between the San Francisco Bay and Santa Rosa. They publish data about monthly ridership online, but the format is slightly messy. Let’s clean and reshape the data in order to make a plot of ridership over time.

Important

Click here to download the SMART Ridership data set (version 2025-02).

If you haven’t already, we recommend you create a directory for this workshop. In your workshop directory, create a data/ subdirectory. Download and save the data set in the data/ subdirectory.

Documentation for the SMART Ridership Data Set

The source for the data set is the SMART Ridership Reports web page.

The data set is saved as a Microsoft Excel file. Before reading an Excel file, it’s a good idea to manually inspect it with spreadsheet software to figure out how the data are organized. The SMART data set contains two tables on the left side of the first sheet: one for total monthly ridership and one for average weekday ridership (by month). Let’s focus on the total monthly ridership table.

You can use Polars’ pl.read_excel function to read sheets from an Excel file.

Important

The pl.read_excel function depends on the fastexcel package. Make sure to install the package if you haven’t already.

Use pl.read_excel to read the SMART Ridership data:

smart = pl.read_excel("data/2025-02_smart_ridership.xlsx")
smart.head()

shape: (5, 18)

Sonoma-Marin Area Rail Transit (SMART) Ridership	__UNNAMED__1	__UNNAMED__2	__UNNAMED__3	__UNNAMED__4	__UNNAMED__5	__UNNAMED__6	__UNNAMED__7	__UNNAMED__8	__UNNAMED__23	__UNNAMED__24	__UNNAMED__25	__UNNAMED__26	__UNNAMED__27	__UNNAMED__28	__UNNAMED__29	__UNNAMED__30	__UNNAMED__31
str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str
"Total Monthly Ridership"	null	null	null	null	null	null	null	null	"Bicycles on SMART"	null	null	null	null	null	null	null	null
"Month"	"FY18"	"FY19"	"FY20"	"FY21"	"FY22"	"FY23"	"FY24"	"FY25"	"Month"	"FY18"	"FY19"	"FY20"	"FY21"	"FY22"	"FY23"	"FY24"	"FY25"
"Jul"	"-"	"63864"	"62851"	"9427"	"24627"	"43752"	"65779"	"88022"	"Jul"	"-"	"6543"	"6666"	"2143"	"3999"	"7310"	"9575"	"10953"
"Aug"	"54484"	"74384"	"65352"	"8703"	"25020"	"48278"	"72171"	"91894"	"Aug"	"528"	"7772"	"7999"	"1887"	"4297"	"8775"	"11276"	"13046"
"Sep"	"65019"	"62314"	"62974"	"8910"	"27967"	"49134"	"68506"	"92855"	"Sep"	"6147"	"7249"	"8032"	"1786"	"4733"	"9033"	"10453"	"14334"

The total monthly ridership table corresponds to rows 1:14 and the first 9 columns:

smart = smart[1:14, :9]
smart

shape: (13, 9)

Sonoma-Marin Area Rail Transit (SMART) Ridership	__UNNAMED__1	__UNNAMED__2	__UNNAMED__3	__UNNAMED__4	__UNNAMED__5	__UNNAMED__6	__UNNAMED__7	__UNNAMED__8
str	str	str	str	str	str	str	str	str
"Month"	"FY18"	"FY19"	"FY20"	"FY21"	"FY22"	"FY23"	"FY24"	"FY25"
"Jul"	"-"	"63864"	"62851"	"9427"	"24627"	"43752"	"65779"	"88022"
"Aug"	"54484"	"74384"	"65352"	"8703"	"25020"	"48278"	"72171"	"91894"
"Sep"	"65019"	"62314"	"62974"	"8910"	"27967"	"49134"	"68506"	"92855"
"Oct"	"57453"	"65492"	"57222"	"9851"	"26997.5"	"59322"	"70807"	"96599"
…	…	…	…	…	…	…	…	…
"Feb"	"54797"	"51130"	"71676"	"7412"	"26652"	"49724"	"62090"	"81731"
"Mar"	"57312"	"58091"	"33624"	"9933"	"35291"	"53622"	"67421"	null
"Apr"	"56631"	"60256"	"4571"	"11908"	"34258"	"58551"	"80340"	null
"May"	"59428"	"64036"	"5308"	"13949"	"38655"	"65416"	"85448"	null
"Jun"	"61828"	"55700"	"8386"	"20469"	"41525"	"67162"	"80434"	null

The first row is a header, so let’s use it to set the column names and remove it from the data frame. You can get a row as a tuple (rather than a data frame) with the .row method:

smart.columns = smart.row(0)
smart = smart[1:, :]
smart

shape: (12, 9)

Month	FY18	FY19	FY20	FY21	FY22	FY23	FY24	FY25
str	str	str	str	str	str	str	str	str
"Jul"	"-"	"63864"	"62851"	"9427"	"24627"	"43752"	"65779"	"88022"
"Aug"	"54484"	"74384"	"65352"	"8703"	"25020"	"48278"	"72171"	"91894"
"Sep"	"65019"	"62314"	"62974"	"8910"	"27967"	"49134"	"68506"	"92855"
"Oct"	"57453"	"65492"	"57222"	"9851"	"26997.5"	"59322"	"70807"	"96599"
"Nov"	"56125"	"52774"	"64966"	"8145"	"26575"	"51383"	"65445"	"78132"
…	…	…	…	…	…	…	…	…
"Feb"	"54797"	"51130"	"71676"	"7412"	"26652"	"49724"	"62090"	"81731"
"Mar"	"57312"	"58091"	"33624"	"9933"	"35291"	"53622"	"67421"	null
"Apr"	"56631"	"60256"	"4571"	"11908"	"34258"	"58551"	"80340"	null
"May"	"59428"	"64036"	"5308"	"13949"	"38655"	"65416"	"85448"	null
"Jun"	"61828"	"55700"	"8386"	"20469"	"41525"	"67162"	"80434"	null

The FY columns need to be cast to floats, and FY18 uses a hyphen - to indicate a missing value. We can use the .replace method to replace the hyphen with a missing value and then cast the columns:

smart = smart.with_columns(
    pl.exclude("Month").replace("-", None).cast(pl.Float64)
)
smart.head()

shape: (5, 9)

Month	FY18	FY19	FY20	FY21	FY22	FY23	FY24	FY25
str	f64	f64	f64	f64	f64	f64	f64	f64
"Jul"	null	63864.0	62851.0	9427.0	24627.0	43752.0	65779.0	88022.0
"Aug"	54484.0	74384.0	65352.0	8703.0	25020.0	48278.0	72171.0	91894.0
"Sep"	65019.0	62314.0	62974.0	8910.0	27967.0	49134.0	68506.0	92855.0
"Oct"	57453.0	65492.0	57222.0	9851.0	26997.5	59322.0	70807.0	96599.0
"Nov"	56125.0	52774.0	64966.0	8145.0	26575.0	51383.0	65445.0	78132.0

There’s still a lot of cleaning to do. The identifiers in this data set are the months and years, and they’re split between the row and column names. Each row contains data from several different years, so the data set is not tidy. In addition, the years are indicated in fiscal years (FY), which begin in July rather than January, so some of the years need to be adjusted.

To make the data set tidy, it needs to be reshaped so that the values in the various fiscal year columns are all in one column. In other words, the data set needs to be unpivoted (Section 2.1.3) on all of the FY columns. Listing all of the columns would be tedious, but fortunately Polars provides a shortcut: the pl.selectors namespace, which is conventionally imported as cs (for “column selectors”), provides many helpful functions for selecting columns. You can use the cs.starts_with function to select all columns that start with a string. Let’s try this out in the code to unpivot the FY columns:

import polars.selectors as cs

smart = smart.unpivot(
    on = cs.starts_with("FY"),
    index = "Month",
    variable_name = "fiscal_year",
    value_name = "count"
)
smart.head()

shape: (5, 3)

Month	fiscal_year	count
str	str	f64
"Jul"	"FY18"	null
"Aug"	"FY18"	54484.0
"Sep"	"FY18"	65019.0
"Oct"	"FY18"	57453.0
"Nov"	"FY18"	56125.0

In order to use the months and years in the data, we need to convert them to dates. As a first step towards this, we can replace the FY prefix in the new fiscal_year column with 20 and cast the column to integers. We can use the .str.replace method to do the replacement:

smart = smart.with_columns(
    pl.col("fiscal_year").str.replace("FY", "20").cast(pl.Int64)
)
smart.head()

shape: (5, 3)

Month	fiscal_year	count
str	i64	f64
"Jul"	2018	null
"Aug"	2018	54484.0
"Sep"	2018	65019.0
"Oct"	2018	57453.0
"Nov"	2018	56125.0

Next, we can use the .str.to_date and .dt.month methods to create a new column of month numbers:

smart = smart.with_columns(
    month_num = pl.col("Month").str.to_date("%b").dt.month()
)
smart.head()

shape: (5, 4)

Month	fiscal_year	count	month_num
str	i64	f64	i8
"Jul"	2018	null	7
"Aug"	2018	54484.0	8
"Sep"	2018	65019.0	9
"Oct"	2018	57453.0	10
"Nov"	2018	56125.0	11

Now we need to transform the fiscal years in the fiscal_year column into calendar years. A SMART fiscal year extends from July to the following June and is named after the calendar year at the end of the fiscal year. So from July to December, the calendar year is the fiscal year minus 1. We can

smart = smart.with_columns(
    cal_year = 
        pl.col("fiscal_year") +
        pl.when(pl.col("month_num") >= 7).then(-1).otherwise(0)
)
smart.head()

shape: (5, 5)

Month	fiscal_year	count	month_num	cal_year
str	i64	f64	i8	i64
"Jul"	2018	null	7	2017
"Aug"	2018	54484.0	8	2017
"Sep"	2018	65019.0	9	2017
"Oct"	2018	57453.0	10	2017
"Nov"	2018	56125.0	11	2017

Finally, we can use the pl.date function to construct dates from the cal_year and month_num columns:

smart = smart.with_columns(date = pl.date("cal_year", "month_num", 1))
smart.head()

shape: (5, 6)

Month	fiscal_year	count	month_num	cal_year	date
str	i64	f64	i8	i64	date
"Jul"	2018	null	7	2017	2017-07-01
"Aug"	2018	54484.0	8	2017	2017-08-01
"Sep"	2018	65019.0	9	2017	2017-09-01
"Oct"	2018	57453.0	10	2017	2017-10-01
"Nov"	2018	56125.0	11	2017	2017-11-01

With the dates in the date column and the counts in the count column, we have everything we need to make a plot of SMART ridership over time. We can use the .plot.line method to make the plot:

plot = smart.plot.line(x = "date", y = "count")
# Make the plot 600 pixels wide.
plot.properties(width = 600)

Notice the huge drop (more than 90%) in April 2020 due to the COVID-19 pandemic!

2.2. Working with Relational Data#

Important

This section is still in development and will be posted soon.

Think about how you would organize restaurant reviews data. Each restaurant has a name, address, phone number, operating hours, and other restaurant level details. Because the number of reviews for each restaurant will vary, we could put the data in a single table with one row for each review. Then the table is at the review level, and we’d have to repeat restaurant level details across all reviews for each restaurant.

The review level table is convenient if we want to analyze reviews, but not so convenient if we want to analyze restaurants. For instance, suppose we want to count how many restaurants open before 10 a.m. In order to avoid counting each restaurant multiple times, we have to reduce the table to one row per restaurant before we compute the count.

The problem with putting the restaurant review data in a single table is that it consists of observations at two different levels: the restaurant level and the review level. From this perspective, an intuitive solution is to put the data in two tables: a “restaurants” table where each row is a restaurant and a “reviews” table where each row is a review. Then we can choose the most suitable table for each question we want to answer.

Each review is associated with a restaurant, and each restaurant is associated with some reviews, so the two tables are related even though they’re separate. We can keep track of the relationship by including a column for restaurant name (or a unique restaurant identifier) in both tables. This makes the data relational: multiple tables where the relationships between them are expressed through columns in common.

Note

Most database software are designed to efficiently store and query relational data, so in computing contexts, database is often synonymous with relational data.

For some questions, we’ll need to use both the restaurants table and the reviews table. For example, suppose we want to count how many restaurants open before 10 a.m. and have at least 1 five-star review. We can use the reviews table to compute the number of five-star reviews for each restaurant, combine these with the restaurants table, and then count restaurants that meet our conditions in the resulting table. Operations that combine two related tables based on columns they have in common are called joins. The two tables are conventionally called the left table and right table.

There are a few different kinds of joins. We’ll use a simplified, fictitious version of the restaurant reviews data to demonstrate some of them. The restaurants table contains names and phone numbers for three restaurants:

restaurants = pl.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice's Restaurant", "The Original Beef", "The Pie Hole"],
    "phone": ["555-3213", "555-1111", "555-9983"]
})
restaurants

shape: (3, 3)

id	name	phone
i64	str	str
1	"Alice's Restaurant"	"555-3213"
2	"The Original Beef"	"555-1111"
3	"The Pie Hole"	"555-9983"

The reviews table contains scores from five restaurant reviews:

reviews = pl.DataFrame({
    "id": [4, 2, 1, 2, 2],
    "score": [4.2, 3.5, 4.7, 4.8, 4.0],
})
reviews

shape: (5, 2)

id	score
i64	f64
4	4.2
2	3.5
1	4.7
2	4.8
2	4.0

2.2.1. Inner Joins#

An inner join only keeps rows from the left table if they match rows in the right table and vice-versa.

You can use the .join method to join two data frames with Polars. The data frame on the left side of .join is the left table. The right table is the first argument to .join. The .join method also requires an argument for the on parameter, which should be the name of the column to use to match the two data frames. By default, .join does an inner join.

Try joining the restaurants table and the reviews table:

restaurants.join(reviews, on = "id")

shape: (4, 4)

id	name	phone	score
i64	str	str	f64
2	"The Original Beef"	"555-1111"	3.5
1	"Alice's Restaurant"	"555-3213"	4.7
2	"The Original Beef"	"555-1111"	4.8
2	"The Original Beef"	"555-1111"	4.0

The inner join keeps rows where id is 1 or 2, since these values appear in both tables. It drops the row where id is 3 in restaurants and the row where id is 4 in reviews. The resulting table has 4 rows and the columns from both data frames.

Tip

If the column you want to use to join two data frames has a different name in each one, set the left_on and right_on parameters instead of on.

2.2.2. Left & Right Joins#

A left join keeps all rows from the left table and only keeps rows from the right table if they match. Missing values fill any spaces where there was no match.

You can do a left join with the .join method by setting how = "left". Try a left join on the restaurant reviews data:

restaurants.join(reviews, on = "id", how = "left")

shape: (5, 4)

id	name	phone	score
i64	str	str	f64
1	"Alice's Restaurant"	"555-3213"	4.7
2	"The Original Beef"	"555-1111"	3.5
2	"The Original Beef"	"555-1111"	4.8
2	"The Original Beef"	"555-1111"	4.0
3	"The Pie Hole"	"555-9983"	null

The left join keeps all of the rows from restaurants, and matches rows from reviews when possible. There are no reviews where the id is 3, so score is missing for that row.

A left join is asymmetric, so switching the order of the tables will generally produce a different result:

reviews.join(restaurants, on = "id", how = "left")

shape: (5, 4)

id	score	name	phone
i64	f64	str	str
4	4.2	null	null
2	3.5	"The Original Beef"	"555-1111"
1	4.7	"Alice's Restaurant"	"555-3213"
2	4.8	"The Original Beef"	"555-1111"
2	4.0	"The Original Beef"	"555-1111"

A right join is equivalent to a left join with the order of the tables switched. Because of this, some relational data tools don’t have a right join command (only a left join command). You can do a right join with the .join method by setting how = "right".

2.2.3. Full Joins#

A full join keeps all rows from both tables. Missing values fill any spaces where there was no match.

You can do a full join with the .join method by setting how = "full". Try a full join on the restaurant reviews data:

restaurants.join(reviews, on = "id", how = "full")

shape: (6, 5)

id	name	phone	id_right	score
i64	str	str	i64	f64
null	null	null	4	4.2
2	"The Original Beef"	"555-1111"	2	3.5
1	"Alice's Restaurant"	"555-3213"	1	4.7
2	"The Original Beef"	"555-1111"	2	4.8
2	"The Original Beef"	"555-1111"	2	4.0
3	"The Pie Hole"	"555-9983"	null	null

2.2.4. Case Study: CA Crash Reporting System#

The California Highway Patrol publish data about vehicle crashes in the state as the California Crash Reporting System (CCRS). The CCRS is a relational data set with three tables per year which describe crash events, parties involved, and all injuries, witnesses, and passengers. Let’s use the 2024 CCRS data for Sacramento and Yolo Counties to compute statistics about crashes in those counties.

Important

Click here to download the 2024 Sacramento & Yolo Crash data set (3 CSV files).

If you haven’t already, we recommend you create a directory for this workshop. In your workshop directory, create a data/ subdirectory. Download and save the data set in the data/ subdirectory.

Documentation for the 2024 Sacramento & Yolo Crash Data Set

The data set consists of three tables:

2024_sac-yolo_crashes.csv, where each row is a crash.
2024_sac-yolo_parties.csv, where each row is a person directly involved in the crash (not a witness or passenger).
2024_sac-yolo_injured-witness-passenger.csv, where each row is an injured person (including drivers), witness, or passenger.

Click here to download the documentation for the source data set.

This data set is a subset of the much larger CA Crash Reporting System data set.

Untidy & Relational Data

Contents

2. Untidy & Relational Data#

2.1. Reshaping Untidy Data#

2.1.1. An Untidy Data Set#

2.1.2. Rows into Columns#

2.1.3. Columns into Rows#

2.1.4. Case Study: SMART Ridership#

2.2. Working with Relational Data#

2.2.1. Inner Joins#

2.2.2. Left & Right Joins#

2.2.3. Full Joins#

2.2.4. Case Study: CA Crash Reporting System#