Use my package TSCV for nested cross-validation

Recently, some reader asked me whether my time series cross-validation package TSCV can be used for nested cross-validation. I mulled it over and found the answer to be favorable. I planned to tell him this good news, but the answer quickly became lengthy. Therefore, I decided to turn the answer into a standalone post to address this question. In the following, I will explain the concept of nested cross-validation and its advantage as well as how to use TSCV or any similar packages for it. The same content is also hosted on GitHub. If you have any question, you can ask in either place (preferably in both places).

Flat cross-validation vs. nested cross-validation

To clarify the meaning of these two terms in this specific issue, let me first describe them.

Flat cross-validation

Let us use 5-Fold as an example. In a 5-Fold flat cross-validation, you split the dataset into 5 subsets. Each time, you train a model from 4 of them and test it on the remaining one. Afterwards, you average the 5 scores yielded from the 5 test subsets.

ooooo: training subset
*****: test subset

ooooo ooooo ooooo ooooo *****
ooooo ooooo ooooo ***** ooooo
ooooo ooooo ***** ooooo ooooo
ooooo ***** ooooo ooooo ooooo
***** ooooo ooooo ooooo ooooo

Reasonably, the model you trained depends on both the algorithm you use and the hyperparameter you input. Therefore, the averaged score provides a criterion to evaluate both the algorithms and hyperparameters. I will later explain whether these evaluations are accurate enough, but for now it suffices to understand the basic procedure.

Nested cross-validation (for non-time series)

In contrast to flat cross-validation, which evaluates both the algorithms and the hyperparameters in one fell swoop, nested cross-validation evaluates them in a hierarchical fashion. In the upper-level, it evaluates the algorithms; in the lower-level, it evaluates the hyperparameter within each algorihtm.

Let us still use the 5-Fold setup. First we, likewise, split the dataset into 5 subsets. Let us call it the macro split, which allows us to run each same experiment 5 times. In each run, we further split the training set into 5 sub-subsets. Let us call it the micro split. If the whole dataset has 25 samples, then the macro split sets 20 samples for training and 5 samples for testing in each run, and the micro split further splits the 20 training samples and sets 16 for training and 4 for testing.

Macro split:

12345 12345 12345 12345 *****   =>  further split to micro split -- No. 1
12345 12345 12345 ***** 12345   =>  further split to micro split -- No. 2
12345 12345 ***** 12345 12345   =>  further split to micro split -- No. 3
12345 ***** 12345 12345 12345   =>  further split to micro split -- No. 4
***** 12345 12345 12345 12345   =>  further split to micro split -- No. 5

(Indicative) micro split -- No. 1 (5 in total):

1111 2222 3333 4444 xxxx
1111 2222 3333 xxxx 5555
1111 2222 xxxx 4444 5555
1111 xxxx 3333 4444 5555
xxxx 2222 3333 4444 5555

In the upper-level macro split, we choose a target algorithm and dive into the lower-level micro split. With the target algorithm fixed, we vary the hyperparameters to get the evaluation for each hyperparameter and choose the optimal one. Then, we return to the upper level by fixing the hyperparameter as the optimal one and evaluate the target algorithm. Then, we choose another target algorithm and repeat the same procedure.

Let us call it 5x5 nested cross-validation. Of course, you can use, in general, a mxn nested cross-validation. The essence is to separate the evaluation of the algorithm from the evaluation of the hyperparameter.

Use nested cross-validation for time series.

In time series cross-validation, you need to introduce gaps, which makes the problem tricky. Luckily, we have an easy walk around. That is, the 2xn nested cross-validation is free:

2x4 nested cross-validation
---------------------------------

Macro split:

ooooo ooooo ooooo ooooo gap *****
***** gap ooooo ooooo ooooo ooooo

Micro split -- No. 1 (2 in total):

oooo oooo oooo gap ****
ooo ooo gap **** gap ooo
ooo gap **** gap ooo ooo
**** gap oooo oooo oooo

You can use my package tscv for this kind of 2xn nested cross-validation.

Why nested cross-validation?

The reason is that the algorithms with more hyperparameters have an edge in flat cross-validation. The dimension of the hyperparameters can be seen as the capacity of “bribery” of the algorithm. The more hyperparameters the algorithm owns, the more severely the algorithm compromise the test dataset. Flat cross-validation, by nature, favours those algorithms with rich hyperparameters. In contrast, nested cross-validation puts every algorithm on the same starting line. That is why nested cross-validation is preferred when comparing algorithms with significantly different dimensions of hyperparameters.

Then, does the nested cross-validation provides an accurate way to evaluate the final chosen model? No, though it help you to pick the best algorithm and its hyperparameter, the resulted model’s performance is not under measurement. To explain it, we need some advanced statistics knowledge. To avoid bloating this post, I will only mention here that model(x*) is different from model(x)|x=x*. The good news, however, is that if your algorithm does not have too many hyperparameters, the cross-validation error will not be too far away from the resulted model’s error. Therefore, an algorithm with better performance in nested cross-validation likely leads to a model with better performance in terms of generation error.

Wenjie Zheng