3 Data splitting systems

To train and validate a model requires that we have in-sample and out-of-sample data, but typically we have just “the data”. There are a couple of approaches to separating that data in in-sample and out-of-sample sets: training/validation split or cross-validation. They are often used together.

3.1 Time series data

Our example data are time series, so it is good practice for validation splits to respect the time ordering of the observations. That’s because the results of validation are more realistic when they work from a known past to an unknown future.

3.2 Train/testing split

One solution is to reserve some data for validation, and use what is left for training the model. The split can be random or not - for instance, you may hold back the most recent year of data for validation, or you may randomly sample some proportion (e.g., 50%) of the observations to reserve for validation. The rsample package provides a functions called validation_split() and validation_time_split() to split the data into training and testing sets. The difference between the two is that validation_split() does a random split and validation_time_split() does its splitting by keeping the first part of the data for training and the latter part for testing. The same package provides the functions testing() and training() to extract the testing set and the training set, respectively, from the split.

For the examples, we will train using the first 50% of the observations and validate using the last 50%.

# do a time split on the covid data
covid_split = validation_time_split( covid, prop=0.5 )

# inspect the split object
covid_split

## # Validation Set Split (0.5/0.5)  
## # A tibble: 1 × 2
##   splits            id        
##   <list>            <chr>     
## 1 <split [259/259]> validation

# extract the training set from the split
covid_train = training( covid_split$splits[[1]] )

# inspect the training data
covid_train

## # A tibble: 259 × 67
##    date       DAY_OF_THE_WEEK HOSPITAL_CENSUS… INDX_UCDH_TEST_… INDX_UCDH_POS_P…
##    <date>     <fct>                      <dbl>            <dbl>            <dbl>
##  1 2020-06-01 1                            547              328                1
##  2 2020-06-02 2                            529              290                3
##  3 2020-06-03 3                            529              236                3
##  4 2020-06-04 4                            554              207                1
##  5 2020-06-05 5                            547              317                4
##  6 2020-06-06 6                            521              144                0
##  7 2020-06-07 0                            519              167                1
##  8 2020-06-08 1                            532              364                1
##  9 2020-06-09 2                            540              290                0
## 10 2020-06-10 3                            541              267                3
## # … with 249 more rows, and 62 more variables: INDX_UCDH_POSITIVITY_RATE <dbl>,
## #   INDX_POS_PT_NEW_ADM_CNT <dbl>, INDX_POS_PT_NEW_ADM_PCT <dbl>,
## #   INDX_POS_PT_IN_HOUSE_D6_CNT <dbl>, INDX_POS_PT_IN_HOUSE_D6_PCT <dbl>,
## #   INDX_POS_PT_IN_HOUSE_D7_CNT <dbl>, INDX_POS_PT_IN_HOUSE_D7_PCT <dbl>,
## #   INDX_POS_PT_ALL_ADM_CNT <dbl>, INDX_POS_PT_ALL_ADM_PCT <dbl>,
## #   OUTREACH_TEST_CNT <dbl>, OUTREACH_POS_TEST_CNT <dbl>,
## #   OUTREACH_POSITIVITY_RATE <dbl>, COVID_RULE_OUT_PT_CNT_M <dbl>, …

3.3 Cross-validation

Cross validation (often abbreviated CV) is a kind of repeated training/validation split. The data is broken into several chunks (called “folds”), and one is held out for validation. All of the others are used as a training set, then resulting model is used to predict the response over the held-out fold. The process is iterated until each fold has been held out once. The main benefit of cross validation is that by iterating over the folds, you end up with a prediction for every data point. This can be important when doing a single train/test split would leave too few observations in the test set to draw reliable conclusions for validation.

Each training/validation split may be random or may take data that are grouped according to some meaningful value. For instance, time-series data may be best analyzed by holding out contiguous blocks of observations.

We will use cross-validation on the training set to help build the models, before we validate them using the test set. As before, the rsample package provides convenient functions to create cross-validation splits that play nicely with other parts of the tidymodels system. Here, the CV folds aren’t using contiguous time blocks, which is a shortcoming. We’re doing it this way because the tidymodels tools for creating and using CV folds don’t provide that functionality. To try to write the loops that would do the job properly is beyond the scope of this workshop.

# create ten cross-validation folds on the training set
covid_cv = vfold_cv( covid_train, v=10 )

# inspect the CV folds
covid_cv

## #  10-fold cross-validation 
## # A tibble: 10 × 2
##    splits           id    
##    <list>           <chr> 
##  1 <split [233/26]> Fold01
##  2 <split [233/26]> Fold02
##  3 <split [233/26]> Fold03
##  4 <split [233/26]> Fold04
##  5 <split [233/26]> Fold05
##  6 <split [233/26]> Fold06
##  7 <split [233/26]> Fold07
##  8 <split [233/26]> Fold08
##  9 <split [233/26]> Fold09
## 10 <split [234/25]> Fold10

3.4 Combinations

There are times when cross-validation and training/testing validation should be used together. For instance, when cross-validation is used for exploratory analysis and model selection, then validation should be done using new data that was never previously part of the estimation. That’s how we are handling the examples in this workshop.