# TSCV: A Python package for Time Series Cross-Validation

Cross-validation, a popular tool in machine learning and statistics, is crucial for model selection and hyperparameter tuning.
To use this tool, one often requires that the data are *independent and identically distributed*.
However, this hypothesis is violated by time series, where successive data points are interdependent.

Many cross-validation packages, such as scikit-learn, rely on the independence hypothesis and thus cannot help for time series.
To solve this problem, I developed a python package `TSCV`

, which enables cross-validation for time series without the requirement of the independence.

The intuition behind this package is that, by introducing **gaps** between the training set and the test set, the temporal dependence can be mitigated.
Hence, after introducing the gap, **leaving p out**, **K-Fold**, and so forth are once again valid.
Researches show that cross-validation with gaps outperforms the one without gaps.

The best feature of this package is that **it works seamlessly with scikit-learn**.
You can pass every class, aka cross-validator, in my package as the `cv`

argument to the `cross_validate`

function in scikit-learn.
Indeed, my package is designed as an extension for scikit-learn, instead of being a standalone package itself.

In the following, I will present the various cross-validators in my package.

- Gap leave p out
- Gap K-Fold
- Gap walk forward
- gap train test split

At the end, I will demonstrate how to use this extension with scikit-learn seamlessly.

## Gap leave p out

An ordinary leaving p out cross-validation uses any combination of $p$ data samples as the test set and the remaining as the training set. The test sets need not to be contiguous.

The Gap leaving p out, as its name suggests, introduces **gaps** between the training set and the test set.
Since it is not economical to “shatter” the test sets, it is preferred to make contiguous test sets.
Also, the gaps in front of and behind the training set need not to be of equal size.

The gap leaving p out cross-validation can be reproduced with the `GapLeavePOut`

class as in the following code.

## Gap K-Fold

An ordinary K-Fold splits the data into $K$ folds, then each time uses one fold for the test set and the remaining for the training set. The data are preferably shuffled before being split to K folds.

The gap K-Fold also splits the data into $K$ folds. The test sets are untouched, while the training sets get the gaps removed. Unlike K-Fold, gap K-Fold does not shuffle the data.

The gap K-Fold cross-validation can be reproduced with the `GapKFold`

class as in the following code.

## Gap walk forward

Walk-forward is very similar to K-Fold except that it ignores the data after the test set.

Gap walk-forward works similarly: it introduces a gap between the training set and the test set, and this very gap is removed from the training set.

The gap walk-forward cross-validation can be reproduced with the `GapWalkForward`

class as in the following code.

Gap walk-forward is less efficient than gap K-Fold and gap leaving p out in that it does not make the fullest use of the data set. It can be advantageous if the time series is non-stationary though.

## Gap train test split

Unlike the above cross-validator, gap train-test split is not a cross-validator but a one-line function that split the data set into the training set and test set while removing the gap.

The above split can be reproduced with the `gap_train_test_split`

function as in the following code.

## Use them with scikit-learn

The best feature of this package is that you can use it with scikit-learn seamlessly. Let me show you with an example.

First let us load the data, the algorithm, and the evaluation.

Then we construct a `GapKFold`

object and pass it, as the argument for `cv`

, to the `cross_val_score`

function.

You can see that you can use the classes in this package in exactly the same way as you use the classes in scikit-learn.

## Resource

- This package is open-source and is hosted on GitHub. The user guide can be found in the README file. If you like this package, please star the repository.
- I have opened a pull request on scikit-learn. If you would like to see it merged and use it directly within scikit-learn, please comment on the pull request.

## Acknowledgment

- I would like to thank Christoph Bergmeir, Prabir Burman, and Jeffrey Racine for the helpful discussion.

## Bibliography

- Bergmeir, Christoph, and José M. Benítez. “On the use of cross-validation for time series predictor evaluation.”
*Information Sciences*191 (2012): 192-213. - Bergmeir, Christoph, Rob J. Hyndman, and Bonsoo Koo. “A note on the validity of cross-validation for evaluating autoregressive time series prediction.”
*Computational Statistics & Data Analysis*120 (2018): 70-83. - Burman, Prabir, Edmond Chow, and Deborah Nolan. “A cross-validatory method for dependent data.”
*Biometrika*81.2 (1994): 351-358. - Racine, Jeff. “Consistent cross-validatory model-selection for dependent data: hv-block cross-validation.”
*Journal of econometrics*99.1 (2000): 39-61. - Roberts, David R., et al. “Cross‐validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure.”
*Ecography*40.8 (2017): 913-929.