TSCV: A Python package for Time Series Cross-Validation
Cross-validation, a popular tool in machine learning and statistics, is crucial for model selection and hyperparameter tuning. To use this tool, one often requires that the data are independent and identically distributed. However, this hypothesis is violated by time series, where successive data points are interdependent.
Many cross-validation packages, such as scikit-learn, rely on the independence hypothesis and thus cannot help for time series.
To solve this problem, I developed a python package
TSCV, which enables cross-validation for time series without the requirement of the independence.
The intuition behind this package is that, by introducing gaps between the training set and the test set, the temporal dependence can be mitigated. Hence, after introducing the gap, leaving p out, K-Fold, and so forth are once again valid. Researches show that cross-validation with gaps outperforms the one without gaps.
The best feature of this package is that it works seamlessly with scikit-learn.
You can pass every class, aka cross-validator, in my package as the
cv argument to the
cross_validate function in scikit-learn.
Indeed, my package is designed as an extension for scikit-learn, instead of being a standalone package itself.
In the following, I will present the various cross-validators in my package.
- Gap leave p out
- Gap K-Fold
- Gap walk forward
- gap train test split
At the end, I will demonstrate how to use this extension with scikit-learn seamlessly.
Gap leave p out
An ordinary leaving p out cross-validation uses any combination of $p$ data samples as the test set and the remaining as the training set. The test sets need not to be contiguous.
The Gap leaving p out, as its name suggests, introduces gaps between the training set and the test set. Since it is not economical to “shatter” the test sets, it is preferred to make contiguous test sets. Also, the gaps in front of and behind the training set need not to be of equal size.
The gap leaving p out cross-validation can be reproduced with the
GapLeavePOut class as in the following code.
An ordinary K-Fold splits the data into $K$ folds, then each time uses one fold for the test set and the remaining for the training set. The data are preferably shuffled before being split to K folds.
The gap K-Fold also splits the data into $K$ folds. The test sets are untouched, while the training sets get the gaps removed. Unlike K-Fold, gap K-Fold does not shuffle the data.
The gap K-Fold cross-validation can be reproduced with the
GapKFold class as in the following code.
Gap walk forward
Walk-forward is very similar to K-Fold except that it ignores the data after the test set.
Gap walk-forward works similarly: it introduces a gap between the training set and the test set, and this very gap is removed from the training set.
The gap walk-forward cross-validation can be reproduced with the
GapWalkForward class as in the following code.
Gap walk-forward is less efficient than gap K-Fold and gap leaving p out in that it does not make the fullest use of the data set. It can be advantageous if the time series is non-stationary though.
Gap train test split
Unlike the above cross-validator, gap train-test split is not a cross-validator but a one-line function that split the data set into the training set and test set while removing the gap.
The above split can be reproduced with the
gap_train_test_split function as in the following code.
Use them with scikit-learn
The best feature of this package is that you can use it with scikit-learn seamlessly. Let me show you with an example.
First let us load the data, the algorithm, and the evaluation.
Then we construct a
GapKFold object and pass it, as the argument for
cv, to the
You can see that you can use the classes in this package in exactly the same way as you use the classes in scikit-learn.
- This package is open-source and is hosted on GitHub. The user guild can be found in the README file. If you like this package, please star the repository.
- I have opened a pull request on scikit-learn. If you would like to see it merged and use it directly within scikit-learn, please comment on the pull request.
- I would like to thank Christoph Bergmeir, Prabir Burman, and Jeffrey Racine for the helpful discussion.
- Bergmeir, Christoph, and José M. Benítez. “On the use of cross-validation for time series predictor evaluation.” Information Sciences 191 (2012): 192-213.
- Bergmeir, Christoph, Rob J. Hyndman, and Bonsoo Koo. “A note on the validity of cross-validation for evaluating autoregressive time series prediction.” Computational Statistics & Data Analysis 120 (2018): 70-83.
- Burman, Prabir, Edmond Chow, and Deborah Nolan. “A cross-validatory method for dependent data.” Biometrika 81.2 (1994): 351-358.
- Racine, Jeff. “Consistent cross-validatory model-selection for dependent data: hv-block cross-validation.” Journal of econometrics 99.1 (2000): 39-61.
- Roberts, David R., et al. “Cross‐validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure.” Ecography 40.8 (2017): 913-929.