3 Data splitting systems
To train and validate a model requires that we have in-sample and out-of-sample data, but typically we have just “the data”. There are a couple of approaches to separating that data in in-sample and out-of-sample sets: training/validation split or cross-validation. They are often used together.
3.1 Time series data
Our example data are time series, so it is good practice for validation splits to respect the time ordering of the observations. That’s because the results of validation are more realistic when they work from a known past to an unknown future.
3.2 Train/testing split
One solution is to reserve some data for validation, and use what is left for training the model. The split can be random or not - for instance, you may hold back the most recent year of data for validation, or you may randomly sample some proportion (e.g., 50%) of the observations to reserve for validation. The rsample
package provides a functions called validation_split()
and validation_time_split()
to split the data into training and testing sets. The difference between the two is that validation_split()
does a random split and validation_time_split()
does its splitting by keeping the first part of the data for training and the latter part for testing. The same package provides the functions testing()
and training()
to extract the testing set and the training set, respectively, from the split.
For the examples, we will train using the first 50% of the observations and validate using the last 50%.
# do a time split on the covid data
= validation_time_split( covid, prop=0.5 )
covid_split
# inspect the split object
covid_split
## # Validation Set Split (0.5/0.5)
## # A tibble: 1 × 2
## splits id
## <list> <chr>
## 1 <split [259/259]> validation
# extract the training set from the split
= training( covid_split$splits[[1]] )
covid_train
# inspect the training data
covid_train
## # A tibble: 259 × 67
## date DAY_OF_THE_WEEK HOSPITAL_CENSUS… INDX_UCDH_TEST_… INDX_UCDH_POS_P…
## <date> <fct> <dbl> <dbl> <dbl>
## 1 2020-06-01 1 547 328 1
## 2 2020-06-02 2 529 290 3
## 3 2020-06-03 3 529 236 3
## 4 2020-06-04 4 554 207 1
## 5 2020-06-05 5 547 317 4
## 6 2020-06-06 6 521 144 0
## 7 2020-06-07 0 519 167 1
## 8 2020-06-08 1 532 364 1
## 9 2020-06-09 2 540 290 0
## 10 2020-06-10 3 541 267 3
## # … with 249 more rows, and 62 more variables: INDX_UCDH_POSITIVITY_RATE <dbl>,
## # INDX_POS_PT_NEW_ADM_CNT <dbl>, INDX_POS_PT_NEW_ADM_PCT <dbl>,
## # INDX_POS_PT_IN_HOUSE_D6_CNT <dbl>, INDX_POS_PT_IN_HOUSE_D6_PCT <dbl>,
## # INDX_POS_PT_IN_HOUSE_D7_CNT <dbl>, INDX_POS_PT_IN_HOUSE_D7_PCT <dbl>,
## # INDX_POS_PT_ALL_ADM_CNT <dbl>, INDX_POS_PT_ALL_ADM_PCT <dbl>,
## # OUTREACH_TEST_CNT <dbl>, OUTREACH_POS_TEST_CNT <dbl>,
## # OUTREACH_POSITIVITY_RATE <dbl>, COVID_RULE_OUT_PT_CNT_M <dbl>, …
3.3 Cross-validation
Cross validation (often abbreviated CV) is a kind of repeated training/validation split. The data is broken into several chunks (called “folds”), and one is held out for validation. All of the others are used as a training set, then resulting model is used to predict the response over the held-out fold. The process is iterated until each fold has been held out once. The main benefit of cross validation is that by iterating over the folds, you end up with a prediction for every data point. This can be important when doing a single train/test split would leave too few observations in the test set to draw reliable conclusions for validation.
Each training/validation split may be random or may take data that are grouped according to some meaningful value. For instance, time-series data may be best analyzed by holding out contiguous blocks of observations.
We will use cross-validation on the training set to help build the models, before we validate them using the test set. As before, the rsample
package provides convenient functions to create cross-validation splits that play nicely with other parts of the tidymodels
system. Here, the CV folds aren’t using contiguous time blocks, which is a shortcoming. We’re doing it this way because the tidymodels
tools for creating and using CV folds don’t provide that functionality. To try to write the loops that would do the job properly is beyond the scope of this workshop.
# create ten cross-validation folds on the training set
= vfold_cv( covid_train, v=10 )
covid_cv
# inspect the CV folds
covid_cv
## # 10-fold cross-validation
## # A tibble: 10 × 2
## splits id
## <list> <chr>
## 1 <split [233/26]> Fold01
## 2 <split [233/26]> Fold02
## 3 <split [233/26]> Fold03
## 4 <split [233/26]> Fold04
## 5 <split [233/26]> Fold05
## 6 <split [233/26]> Fold06
## 7 <split [233/26]> Fold07
## 8 <split [233/26]> Fold08
## 9 <split [233/26]> Fold09
## 10 <split [234/25]> Fold10
3.4 Combinations
There are times when cross-validation and training/testing validation should be used together. For instance, when cross-validation is used for exploratory analysis and model selection, then validation should be done using new data that was never previously part of the estimation. That’s how we are handling the examples in this workshop.