# install.packages("tidyr")
library("tidyr")
13 Reshaping Tabular Data
This lesson focuses on how to identify untidy tabular data sets and reshape them to be tidy, in the sense described in Section 7.1.2.
After this lesson, you should be able to:
- Pivot columns in a data set to make it tidy
- Separate values in a column that contains multiple values per cell
- Convert columns to appropriate data types
13.1 Introduction
Let’s look at some examples of tidy and untidy data sets. The tidyr package provides examples, and as we’ll see later, it also provides functions to make untidy data sets tidy. As usual, we first need to load the package:
Let’s start with an example of tidy data. This data set is included in the tidyr package and records the number of tuberculosis cases across several different countries and years:
table1
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
When you first look at a data set, think about what the observations are and what the features are. If the data set comes with documentation, it may help you figure this out. Since this data set is a tidy data set, we already know each row is an observation and each column is a feature.
Features in a data set tend to take one of two roles. Some features are identifiers that describe the observed subject. These are usually not what the researcher collecting the data is trying to find out. For example, in the tuberculosis data set, the country
and year
columns are identifiers.
Other features are measurements. These are usually the reason the researcher collected the data. For the tuberculosis data set, the cases
and population
columns are measurements.
Thinking about whether features are identifiers or measurements can be helpful when you need to use tidyr to rearrange a data set.
13.2 Columns into Rows
Tidy data rule 1 says each observation must have its own row. Here’s a table that breaks rule 1:
table4a
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
All of the numbers measure the same thing: cases. To make the data tidy, we must rotate the 1999
and 2000
column names into rows, one for each value in the columns. The new columns are year
and cases
.
This process means less columns (generally) and more rows, so the data set becomes longer.
We can use the pivot_longer
function to rotate columns into rows. We need to specify:
- Columns to rotate as
cols
. - Name(s) of new identifier column(s) as
names_to
. - Name(s) of new measurement column(s) as
values_to
.
Here’s the code:
pivot_longer(table4a, -country, names_to = "year", values_to = "cases")
# A tibble: 6 × 3
country year cases
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
You also can do this without tidyr:
- Subset columns to separate
1999
and2000
into two data frames. - Add a
year
column to each. - Rename the
1999
and2000
columns tocases
. - Stack the two data frames with
rbind
.
# Step 1
= table4a[-3]
df99 = table4a[-2]
df00
# Step 2
$year = "1999"
df99$year = "2000"
df00
# Step 3
names(df99)[2] = "cases"
names(df00)[2] = "cases"
# Step 4
rbind(df99, df00)
# A tibble: 6 × 3
country cases year
<chr> <dbl> <chr>
1 Afghanistan 745 1999
2 Brazil 37737 1999
3 China 212258 1999
4 Afghanistan 2666 2000
5 Brazil 80488 2000
6 China 213766 2000
13.3 Rows into Columns
Tidy data rule 2 says each feature must have its own column. Let’s look at a table that breaks rule 2:
table2
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Here the count
column contains two different features: cases and population. To make the data tidy, we must rotate the count
values into columns, one for each type
value. New columns are cases
and population
.
This process means less rows and more columns, so the data set becomes wider.
We can use pivot_wider
to rotate rows into columns. We need to specify:
- Column names to rotate as
names_from
. - Measurements to rotate as
values_from
.
Here’s the code:
pivot_wider(table2, names_from = type, values_from = count)
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
You can also do this without tidyr:
- Subset rows to separate
cases
andpopulation
values. - Remove the
type
column from each. - Rename the
count
column tocases
andpopulation
. - Merge the two subsets by matching
country
andyear
.
# Step 1
= table2[table2$type == "cases", ]
cases = table2[table2$type == "population", ]
pop
# Step 2
= cases[-3]
cases = pop[-3]
pop
# Step 3
names(cases)[3] = "cases"
names(pop)[3] = "population"
# Step 4
= cbind(cases, pop[3]) tidy
This code uses the cbind
function to merge the two subsets, but it would be better to use the merge
function. The cbind
function does not use identifier columns to check that the rows in each subset are from the same observations.
Run vignette("pivot")
for more examples of how to use tidyr.
13.4 Separating Values
Tidy data rule 3 says each value must have its own cell. Here’s a table that breaks rule 3:
table3
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Cells in the rate
column contain two values: cases and population. These are two different features, so to make the data set tidy, we need to separate them into two different columns.
So how can we separate the rate
column? The rate column is a character vector (you can check this with str(table3)
), so we can use the string processing functions in the stringr package. In particular, we can use the str_split_fixed
function:
library("stringr")
= str_split_fixed(table3$rate, fixed("/"), 2) columns
Now we have a character matrix where the values are in separate columns. Now we need to combine these with the original data frame. There are several ways to approach this, but to be safe, let’s make a new data frame rather than overwrite the original. First we make a copy of the original:
= table3 tidy_tb
Next, we need to assign each column in the character matrix to a column in the tidy_tb
data frame. Since the columns contain numbers, we can also use the as.numeric
function to convert them to the correct data type:
$cases = as.numeric(columns[, 1])
tidy_tb$population = as.numeric(columns[, 2]) tidy_tb
Extracting values, converting to appropriate data types, and then combining everything into a single data frame is an extremely common pattern in data science.
Using stringr functions is the most general way to separate out values in a column, but the tidyr package also provides a function separate
specifically for the case we just worked through. Either package is appropriate for solving this problem.