Future of my TSCV package -- a letter to my users
Nearly two years ago, I developed a time-series cross-validation package, namely tscv
, which has since been widely adopted by scientists and quantitative traders worldwide.
Seeing ~1000 monthly downloads, I am delighted that I made some positive contributions to this world.
Meanwhile, in the last two years, a lot has happened to our world as well as to me.
Although I never for a second forgot my responsibility towards my users, I was, unfortunately, unable to maintain this package.
In consequence, as you may have noticed, this package is no longer compatible with scikit-learn
version 0.24 since two months ago.
To respond to this issue, I decide to restore the compatibility and enhance tscv
, and this post will witness my resolution.
Compatibility with Scikit-Learn v0.24 and onwards
In this section, I will first explain what happened in v0.24 and then my solution to restore the compatibility as well as the underlying reasoning guiding my solution.
The incompatibility results from the scikit-learn team’s decision to make the safe_indexing
function private.
That is, safe_indexing
was renamed as _safe_indexing
in v0.24
.
This modification implies that the scikit-learn team can change the API of _safe_indexing
in every minor version upgrade (e.g., v1.0
to v1.1
).
The net consequence is that, for third-party developers like me, we cannot rely on this function if we want to keep the compatibility within each major version family, or equivalently, across all minor versions (e.g., v1.X
).
I have communicated with scikit-learn’s core developers and learned from their motive underneath this decision.
They told me that making safe_indexing
private gives them more freedom to expand the functionality of that function (e.g., indexing the new xarray
class.)
I believe that this direction is a positive thing for the scikit-learn users.
Since scikit-learn is moving toward its first major version, v1.0
, with it being the renaming of the next version v0.25
, a private _safe_indexing
warrants the functionality expansion within the v1.X
line.
That is, the users do not have to wait for v2.0
for any compatibility-breaking enhancement.
To cope with the incompatibility, I have two choices, with the first one being keeping a forked copy of safe_indexing
.
By calling the safe_indexing
internally, I no longer need to worry about the compatibility.
The downside is that I cannot benefit from the evolution of scikit-learn
and thus limit the power of my tscv
package.
The second choice is to call the new _safe_indexing
instead so as to benefit from potential new features in scikit-learn
, say, xarray
.
The downside is that I have to tune my package accordingly for every compatibility-breaking change the scikit-learn team makes.
I decide to take the second approach.
Indeed it may cause some trouble for us third-party developers, but this trouble is neglectable compared to the benefit aforementioned.
The effort to make my users enjoy the newest and most powerful features in scikit-learn
is worthwhile.
I intend to make tscv
compatible with every scikit-learn version onwards (>=v0.22
), and this will happen within the v0.1.X
line.
As for older scikit-learn versions (<=v0.24
), the v0.0.5
version of tscv
(currently undergoing the stabilization process) will stay relevant.
I have released the first release candidate of v0.0.5
, and the binary can be downloaded here.
The final version is expected to come out by the end of the month.
If you notice any bug, please open a ticket in my GitHub repo.
Overlapped test sets
Version v0.0.5
also enables the feature known as overlapped test sets in the GapWalkForward
class.
From now on, you can use designs like the following:
|=======o**** |
| =======o**** |
| =======o****|
= : train
o : gap
* : test
The level of overlap is controlled by the newly added rollback_size
parameter.
For instance, the above example has a rollback_size
of 2.
The rollback_size
is defaulted to 0 and must be less than the test_size
.
A higher rollback_size
permits more cross-validation folds.
Re-implementation of GapWalkForward
The GapWalkForward
class has 5 folds in default.
If a user wants to maximize the sample’s utility, say, with the first test set starting from the first data points, he will have to precompute the proper value of n_splits
manually.
It puts a burden on the users, especially when the rollback_size
parameter is in use.
This inconvenience results from the legacy implementation of GapWalkForward
, which is a subclass of the _BaseKFold
virtual class.
I reckon that it is not the optimal implementation and therefore am planning to re-implement it. It is not refactoring since refactoring should not change the API. Instead, I will overhaul the entire class, which will break the backward compatibility.
The overhaul will happen in v0.1.0
, which hopefully will be released by the end of April.
By then, my users will have the most flexible time-series cross-validation tool possible.
This feature will not be backported to the v0.0.X
line, and the old behavior will be deprecated in v0.1.0
.
It usually will not cause any trouble.
It will become an issue only when a user upgrades to
(Edit in 9 May: The v0.1.0
but still wants to stick to the old behavior of GapWalkForward
.
In this case, he can switch to the native TimeSeriesSplit
class of scikit-learn
, which is equivalent to v0.0.4
of tscv
.
If he is not happy with TimeSeriesSplit
and wants the v0.0.5
behavior implemented, he can open a ticket in the scikit-learn repository and @me.v0.1.X
line will still keep GapWalkForward
available for backward-compatibility; it is deprecated but not removed. The new functionality is incorporated in the GapRollForward
class.)
Transparency
In contrast to a particular government that hides everything from its citizens and the rest of the world, I believe that transparency is the key to making our world a better place.
For this purpose, I wrote this letter to communicate the future of tscv
to my users.
I hope that transparency can make my work more reliable and make me more dependable.
I strive to make the best software for my users, and in return, I hope my users can support me.
Your support is vital to the release of v0.1.0
, which will also incorporate the continuous integration toolchain and documentation to make it more production-ready (see the v0.1.0
milestone).
You can support me via the following methods:
- Be a sponsor.
- I have a short paper related to time-series cross-validation but not directly targeting this software. If it does not violate your academic integrity, please consider citing it (see README.md)
Take-home messages
- The
v0.0.5
version will come out by the end of March. It will solve the compatibility issue and incorporate some enhancements. A pre-release is now available here. - The
v0.1.0
version will come out by the end of April.It will overhaul the(Edit in 9 May: it providesGapWalkForward
class to make it more flexible. The backward compatibility will be dropped.GapRollForward
, a more flexible and powerful cross-validator.) - Please consider supporting my work.