# install.packages("tidyr")
library("tidyr")13 Reshaping Tabular Data
This lesson focuses on how to identify untidy tabular data sets and reshape them to be tidy, in the sense described in Section 7.1.2.
After this lesson, you should be able to:
- Pivot columns in a data set to make it tidy
- Separate values in a column that contains multiple values per cell
- Convert columns to appropriate data types
13.1 Introduction
Let’s look at some examples of tidy and untidy data sets. The tidyr package provides examples, and as we’ll see later, it also provides functions to make untidy data sets tidy. As usual, we first need to load the package:
Let’s start with an example of tidy data. This data set is included in the tidyr package and records the number of tuberculosis cases across several different countries and years:
table1# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
When you first look at a data set, think about what the observations are and what the features are. If the data set comes with documentation, it may help you figure this out. Since this data set is a tidy data set, we already know each row is an observation and each column is a feature.
Features in a data set tend to take one of two roles. Some features are identifiers that describe the observed subject. These are usually not what the researcher collecting the data is trying to find out. For example, in the tuberculosis data set, the country and year columns are identifiers.
Other features are measurements. These are usually the reason the researcher collected the data. For the tuberculosis data set, the cases and population columns are measurements.
Thinking about whether features are identifiers or measurements can be helpful when you need to use tidyr to rearrange a data set.
13.2 Columns into Rows
Tidy data rule 1 says each observation must have its own row. Here’s a table that breaks rule 1:
table4a# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
All of the numbers measure the same thing: cases. To make the data tidy, we must rotate the 1999 and 2000 column names into rows, one for each value in the columns. The new columns are year and cases.
This process means less columns (generally) and more rows, so the data set becomes longer.
We can use the pivot_longer function to rotate columns into rows. We need to specify:
- Columns to rotate as
cols. - Name(s) of new identifier column(s) as
names_to. - Name(s) of new measurement column(s) as
values_to.
Here’s the code:
pivot_longer(table4a, -country, names_to = "year", values_to = "cases")# A tibble: 6 × 3
country year cases
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
You also can do this without tidyr:
- Subset columns to separate
1999and2000into two data frames. - Add a
yearcolumn to each. - Rename the
1999and2000columns tocases. - Stack the two data frames with
rbind.
# Step 1
df99 = table4a[-3]
df00 = table4a[-2]
# Step 2
df99$year = "1999"
df00$year = "2000"
# Step 3
names(df99)[2] = "cases"
names(df00)[2] = "cases"
# Step 4
rbind(df99, df00)# A tibble: 6 × 3
country cases year
<chr> <dbl> <chr>
1 Afghanistan 745 1999
2 Brazil 37737 1999
3 China 212258 1999
4 Afghanistan 2666 2000
5 Brazil 80488 2000
6 China 213766 2000
13.3 Rows into Columns
Tidy data rule 2 says each feature must have its own column. Let’s look at a table that breaks rule 2:
table2# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Here the count column contains two different features: cases and population. To make the data tidy, we must rotate the count values into columns, one for each type value. New columns are cases and population.
This process means less rows and more columns, so the data set becomes wider.
We can use pivot_wider to rotate rows into columns. We need to specify:
- Column names to rotate as
names_from. - Measurements to rotate as
values_from.
Here’s the code:
pivot_wider(table2, names_from = type, values_from = count)# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
You can also do this without tidyr:
- Subset rows to separate
casesandpopulationvalues. - Remove the
typecolumn from each. - Rename the
countcolumn tocasesandpopulation. - Merge the two subsets by matching
countryandyear.
# Step 1
cases = table2[table2$type == "cases", ]
pop = table2[table2$type == "population", ]
# Step 2
cases = cases[-3]
pop = pop[-3]
# Step 3
names(cases)[3] = "cases"
names(pop)[3] = "population"
# Step 4
tidy = cbind(cases, pop[3])This code uses the cbind function to merge the two subsets, but it would be better to use the merge function. The cbind function does not use identifier columns to check that the rows in each subset are from the same observations.
Run vignette("pivot") for more examples of how to use tidyr.
13.4 Separating Values
Tidy data rule 3 says each value must have its own cell. Here’s a table that breaks rule 3:
table3# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Cells in the rate column contain two values: cases and population. These are two different features, so to make the data set tidy, we need to separate them into two different columns.
So how can we separate the rate column? The rate column is a character vector (you can check this with str(table3)), so we can use the string processing functions in the stringr package. In particular, we can use the str_split_fixed function:
library("stringr")
columns = str_split_fixed(table3$rate, fixed("/"), 2)Now we have a character matrix where the values are in separate columns. Now we need to combine these with the original data frame. There are several ways to approach this, but to be safe, let’s make a new data frame rather than overwrite the original. First we make a copy of the original:
tidy_tb = table3Next, we need to assign each column in the character matrix to a column in the tidy_tb data frame. Since the columns contain numbers, we can also use the as.numeric function to convert them to the correct data type:
tidy_tb$cases = as.numeric(columns[, 1])
tidy_tb$population = as.numeric(columns[, 2])Extracting values, converting to appropriate data types, and then combining everything into a single data frame is an extremely common pattern in data science.
Using stringr functions is the most general way to separate out values in a column, but the tidyr package also provides a function separate specifically for the case we just worked through. Either package is appropriate for solving this problem.