32 Reshaping Tabular Data

This lesson focuses on how to identify untidy tabular data sets and reshape them to be tidy, in the sense described in Section 15.2.

Learning Goals

After this lesson, you should be able to:

Pivot columns in a data set to make it tidy
Separate values in a column that contains multiple values per cell
Convert columns to appropriate data types

32.1 Introduction

Let’s look at some examples of tidy and untidy data sets. The tidyr package provides examples, and as we’ll see later, it also provides functions to make untidy data sets tidy. As usual, we first need to load the package:

# install.packages("tidyr")
library("tidyr")

Let’s start with an example of tidy data. This data set is included in the tidyr package and records the number of tuberculosis cases across several different countries and years:

table1

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

When you first look at a data set, think about what the observations are and what the features are. If the data set comes with documentation, it may help you figure this out. Since this data set is a tidy data set, we already know each row is an observation and each column is a feature.

Features in a data set tend to take one of two roles. Some features are identifiers that describe the observed subject. These are usually not what the researcher collecting the data is trying to find out. For example, in the tuberculosis data set, the country and year columns are identifiers.

Other features are measurements. These are usually the reason the researcher collected the data. For the tuberculosis data set, the cases and population columns are measurements.

Thinking about whether features are identifiers or measurements can be helpful when you need to use tidyr to rearrange a data set.

32.2 Columns into Rows

Tidy data rule 1 says each observation must have its own row. Here’s a table that breaks rule 1:

table4a

# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

All of the numbers measure the same thing: cases. To make the data tidy, we must rotate the 1999 and 2000 column names into rows, one for each value in the columns. The new columns are year and cases.

This process means less columns (generally) and more rows, so the data set becomes longer.

We can use the pivot_longer function to rotate columns into rows. We need to specify:

Columns to rotate as cols.
Name(s) of new identifier column(s) as names_to.
Name(s) of new measurement column(s) as values_to.

Here’s the code:

pivot_longer(table4a, -country, names_to = "year", values_to = "cases")

# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

How to Pivot Longer without tidyr

You also can do this without tidyr:

Subset columns to separate 1999 and 2000 into two data frames.
Add a year column to each.
Rename the 1999 and 2000 columns to cases.
Stack the two data frames with rbind.

# Step 1
df99 = table4a[-3]
df00 = table4a[-2]

# Step 2
df99$year = "1999"
df00$year = "2000"

# Step 3
names(df99)[2] = "cases"
names(df00)[2] = "cases"

# Step 4
rbind(df99, df00)

# A tibble: 6 × 3
  country      cases year 
  <chr>        <dbl> <chr>
1 Afghanistan    745 1999 
2 Brazil       37737 1999 
3 China       212258 1999 
4 Afghanistan   2666 2000 
5 Brazil       80488 2000 
6 China       213766 2000

32.3 Rows into Columns

Tidy data rule 2 says each feature must have its own column. Let’s look at a table that breaks rule 2:

table2

# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Here the count column contains two different features: cases and population. To make the data tidy, we must rotate the count values into columns, one for each type value. New columns are cases and population.

This process means less rows and more columns, so the data set becomes wider.

We can use pivot_wider to rotate rows into columns. We need to specify:

Column names to rotate as names_from.
Measurements to rotate as values_from.

Here’s the code:

pivot_wider(table2, names_from = type, values_from = count)

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

How to Pivot Wider without tidyr

You can also do this without tidyr:

Subset rows to separate cases and population values.
Remove the type column from each.
Rename the count column to cases and population.
Merge the two subsets by matching country and year.

# Step 1
cases = table2[table2$type == "cases", ]
pop = table2[table2$type == "population", ]

# Step 2
cases = cases[-3]
pop = pop[-3]

# Step 3
names(cases)[3] = "cases"
names(pop)[3] = "population"

# Step 4
tidy = cbind(cases, pop[3])

This code uses the cbind function to merge the two subsets, but it would be better to use the merge function. The cbind function does not use identifier columns to check that the rows in each subset are from the same observations.

32.4 Separating Values

Tidy data rule 3 says each value must have its own cell. Here’s a table that breaks rule 3:

table3

# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Cells in the rate column contain two values: cases and population. These are two different features, so to make the data set tidy, we need to separate them into two different columns.

So how can we separate the rate column? The rate column is a character vector (you can check this with str(table3)), so we can use the string processing functions in the stringr package. In particular, we can use the str_split_fixed function:

library("stringr")

columns = str_split_fixed(table3$rate, fixed("/"), 2)

Now we have a character matrix where the values are in separate columns. Now we need to combine these with the original data frame. There are several ways to approach this, but to be safe, let’s make a new data frame rather than overwrite the original. First we make a copy of the original:

tidy_tb = table3

Next, we need to assign each column in the character matrix to a column in the tidy_tb data frame. Since the columns contain numbers, we can also use the as.numeric function to convert them to the correct data type:

tidy_tb$cases = as.numeric(columns[, 1])
tidy_tb$population = as.numeric(columns[, 2])

Extracting values, converting to appropriate data types, and then combining everything into a single data frame is an extremely common pattern in data science.

Using stringr functions is the most general way to separate out values in a column, but the tidyr package also provides a function separate specifically for the case we just worked through. Either package is appropriate for solving this problem.