Future of my TSCV package -- a letter to my users

Nearly two years ago, I developed a time-series cross-validation package, namely tscv, which has since been widely adopted by scientists and quantitative traders worldwide. Seeing ~1000 monthly downloads, I am delighted that I made some positive contributions to this world. Meanwhile, in the last two years, a lot has happened to our world as well as to me. Although I never for a second forgot my responsibility towards my users, I was, unfortunately, unable to maintain this package. In consequence, as you may have noticed, this package is no longer compatible with scikit-learn version 0.24 since two months ago. To respond to this issue, I decide to restore the compatibility and enhance tscv, and this post will witness my resolution.

Compatibility with Scikit-Learn v0.24 and onwards

In this section, I will first explain what happened in v0.24 and then my solution to restore the compatibility as well as the underlying reasoning guiding my solution.

The incompatibility results from the scikit-learn team’s decision to make the safe_indexing function private. That is, safe_indexing was renamed as _safe_indexing in v0.24.

This modification implies that the scikit-learn team can change the API of _safe_indexing in every minor version upgrade (e.g., v1.0 to v1.1). The net consequence is that, for third-party developers like me, we cannot rely on this function if we want to keep the compatibility within each major version family, or equivalently, across all minor versions (e.g., v1.X).

I have communicated with scikit-learn’s core developers and learned from their motive underneath this decision. They told me that making safe_indexing private gives them more freedom to expand the functionality of that function (e.g., indexing the new xarray class.)

I believe that this direction is a positive thing for the scikit-learn users. Since scikit-learn is moving toward its first major version, v1.0, with it being the renaming of the next version v0.25, a private _safe_indexing warrants the functionality expansion within the v1.X line. That is, the users do not have to wait for v2.0 for any compatibility-breaking enhancement.

To cope with the incompatibility, I have two choices, with the first one being keeping a forked copy of safe_indexing. By calling the safe_indexing internally, I no longer need to worry about the compatibility. The downside is that I cannot benefit from the evolution of scikit-learn and thus limit the power of my tscv package.

The second choice is to call the new _safe_indexing instead so as to benefit from potential new features in scikit-learn, say, xarray. The downside is that I have to tune my package accordingly for every compatibility-breaking change the scikit-learn team makes.

I decide to take the second approach. Indeed it may cause some trouble for us third-party developers, but this trouble is neglectable compared to the benefit aforementioned. The effort to make my users enjoy the newest and most powerful features in scikit-learn is worthwhile.

I intend to make tscv compatible with every scikit-learn version onwards (>=v0.22), and this will happen within the v0.1.X line. As for older scikit-learn versions (<=v0.24), the v0.0.5 version of tscv (currently undergoing the stabilization process) will stay relevant.

I have released the first release candidate of v0.0.5, and the binary can be downloaded here. The final version is expected to come out by the end of the month. If you notice any bug, please open a ticket in my GitHub repo.

Overlapped test sets

Version v0.0.5 also enables the feature known as overlapped test sets in the GapWalkForward class. From now on, you can use designs like the following:

|=======o****    |
|  =======o****  |
|    =======o****|

 = : train
 o : gap
 * : test

The level of overlap is controlled by the newly added rollback_size parameter. For instance, the above example has a rollback_size of 2.

>>> from tscv import GapWalkForward
>>> cv = GapWalkForward(n_splits=3, max_train_size=7, gap_size=1, test_size=4, rollback_size=2)
>>> for train, test in cv.split(range(16)):
...    print("train:", train, "test:", test)

train: [0 1 2 3 4 5 6] test: [ 8  9 10 11]
train: [2 3 4 5 6 7 8] test: [10 11 12 13]
train: [ 4  5  6  7  8  9 10] test: [12 13 14 15]

The rollback_size is defaulted to 0 and must be less than the test_size. A higher rollback_size permits more cross-validation folds.

Re-implementation of `GapWalkForward`

The GapWalkForward class has 5 folds in default. If a user wants to maximize the sample’s utility, say, with the first test set starting from the first data points, he will have to precompute the proper value of n_splits manually. It puts a burden on the users, especially when the rollback_size parameter is in use.

This inconvenience results from the legacy implementation of GapWalkForward, which is a subclass of the _BaseKFold virtual class.

I reckon that it is not the optimal implementation and therefore am planning to re-implement it. It is not refactoring since refactoring should not change the API. Instead, I will overhaul the entire class, which will break the backward compatibility.

The overhaul will happen in v0.1.0, which hopefully will be released by the end of April. By then, my users will have the most flexible time-series cross-validation tool possible.

This feature will not be backported to the v0.0.X line, and the old behavior will be deprecated in v0.1.0. It usually will not cause any trouble. It will become an issue only when a user upgrades to v0.1.0 but still wants to stick to the old behavior of GapWalkForward. In this case, he can switch to the native TimeSeriesSplit class of scikit-learn, which is equivalent to v0.0.4 of tscv. If he is not happy with TimeSeriesSplit and wants the v0.0.5 behavior implemented, he can open a ticket in the scikit-learn repository and @me. (Edit in 9 May: The v0.1.X line will still keep GapWalkForward available for backward-compatibility; it is deprecated but not removed. The new functionality is incorporated in the GapRollForward class.)

Transparency

In contrast to a particular government that hides everything from its citizens and the rest of the world, I believe that transparency is the key to making our world a better place. For this purpose, I wrote this letter to communicate the future of tscv to my users.

I hope that transparency can make my work more reliable and make me more dependable. I strive to make the best software for my users, and in return, I hope my users can support me. Your support is vital to the release of v0.1.0, which will also incorporate the continuous integration toolchain and documentation to make it more production-ready (see the v0.1.0 milestone).

You can support me via the following methods:

Be a sponsor.
I have a short paper related to time-series cross-validation but not directly targeting this software. If it does not violate your academic integrity, please consider citing it (see README.md)

Take-home messages

The v0.0.5 version will come out by the end of March. It will solve the compatibility issue and incorporate some enhancements. A pre-release is now available here.
The v0.1.0 version will come out by the end of April. ~~It will overhaul the GapWalkForward class to make it more flexible. The backward compatibility will be dropped.~~ (Edit in 9 May: it provides GapRollForward, a more flexible and powerful cross-validator.)
Please consider supporting my work.

Wenjie Zheng

Future of my TSCV package -- a letter to my users

Compatibility with Scikit-Learn v0.24 and onwards

Overlapped test sets

Re-implementation of `GapWalkForward`

Transparency

Take-home messages

You may also like

Future of my TSCV package -- a letter to my users

Compatibility with Scikit-Learn v0.24 and onwards

Overlapped test sets

Re-implementation of GapWalkForward

Transparency

Take-home messages

You may also like

Re-implementation of `GapWalkForward`