Demystifying KerasTuner
Hyperparameter tuning is a pivotal concept in machine learning, be it deep learning or not. For deep learning, KerasTuner is a powerful and easy-to-use package, which can tune any Keras model in any way. Yet, its intricate internal mechanism concealed beneath its API makes users uneasy. This post will demystify its internal implementation and help you understand how KerasTuner accomplishs this feat. I hope that after reading this post, you will feel confident in using KerasTuner, be it for tuning a neural network or any other custom models. If you are an aspiring, hard-working data scientist, this post is a sure read.
Table of Contents
- The mysterious Hyperparameter class
- Reimplement KerasTuner, or at least partially
- A starter
- Package design
- A home-made dessert
- Use KerasTuner to tune Scikit-Learn models
The mysterious Hyperparameter class
If you are following the official KerasTuner guide, you will stumble upon the following code snippet.
You will unavoidably notice this intricate line: units=hp.Int("units", min_value=32, max_value=512, step=32)
.
Your intuition tells you that the code is assigning a hyperparameter to the number of units in the hidden layer, but you can’t help but think how does the program knows which value to use at each call.
Most intriguingly, how can a function produce different values with the exact same arguments?
The official guide did a poor job to ease you by telling you that hp.Int()
is just a function “returning actual values.”
You run the above code multiple time, but it returns 32 at each occasion. You can’t help but start to question the legitimacy of this package: Is this package trolling you?
In the following, I will reassure you that it is not trolling you. Indeed, I will teach you how it is implemented.
Reimplement KerasTuner, or at least partially
A starter dish
Let us first solve this mystery: how can a function return different values with the exact same arguments?
The caveat here is that a function in the sense of programming is not the same as one in the sense of mathematics. In mathematics, a function such as $\sin(x)$ with the same $x$ will always give the same value. This property is called statelessness. In programming, however, it is possible and easy to build a stateful function. The pseudorandom number generator is such an example.
The following is more complex example using closure. (You can achieve the same effect with a class, but I just want to use this example to show off.)
build_counter
here is a factory function, whose output is a function.
Here, we use it to build the counter
function which simply counts.
This function returns a different value even though each time it is given the same input.
Hence we know that it is indeed possible for hp.Int
to return different values when properly set up.
To understand this setup, I will analyze the KerasTuner’s design in the next subsection.
Package design
The package revolves around four key concepts, with the first two being the most critical:
- Hyperparameters
- Tuners
- Oracles
- Hyper models
The package defines the HyperParameters
class, which is a container class holding all hyperparameters ever instantiated.
When you call hp.Int("units", min_value=32, max_value=512, step=32)
, it will check in its inventory whether there is a hyperparameter called “units”.
If the search is unsuccessful, it will instantiate such a hyperparameter and assign it with a default value, in this case 32
.
If this hyperparameter already exists, it will simply return its current value.
In a nutshell, this simple function achieves two effects, one being hyperparameter register and the other being hyperparameter retrieve, depending on whether the hyperparameter in question is already registered.
Therefore, the remaining thing to do is to find a way to alter the current value of this hyperparameter, and this is achieved by the Tuner
class.
Before talking about the Tuner
class, let us first analyze the HyperModel
class and its simpler form: the build_model
function.
In the following code, build_model
is a factory function, which takes a hyperparameter as input and output a function, in this case a Keras model, which is callable.
Each time it is called, it will return a neural network with the desired number of units in the hidden layer.
Of course, on paper it has no idea how many units it should use.
Instead, it achieves this by using the visitor pattern, that is, by delegating it to the HyperParameters
container class sent to it as a visitor.
The HyperParameters
class takes care of this and provide build_model
with the right hyperparameter value by calling hp.Int
.
Besides this ad hoc build_model
function factory, you can define a more formal HyperModel
class to achieve the same effect.
In the following code snippet, BuildModel.build()
achieves the same effect as build_model()
in the example above.
As for hp.Choice
, it is just another hyperparameter playing a similar role as hp.Int
.
With the HyperParameters
container class and HyperModel
class, the Tuner
class has all the ingredient for model tuning.
All it needs to do is a loop of:
- configuring the hyperparameter value
- calling
HyperModel.build()
- fitting the model
- evaluating the solution
Furthermore, the Tuner
class itself does not pick the hyperparameter value directly; instead, it delegates this task to an Oracle
class.
There are various Oracle
class achieving various hyperparameter choosing algorithms, such as grid search, hyperband, and Bayesian optimization.
A home-made dessert
To test our understanding, let us reimplement a simplified version of KerasTuner.
We first build our home-made hyperparameters container class.
Let us unit-test it.
It works as intended.
Now let us define our home-made Tuner class.
The above code works only where there is a single hyperparameter. For multiple hyperparameters, it will be much more complicate to navigate the space and hence exceed this post’s scope.
Let’s put it into action.
It indeed ran all models with various hyperparameter values. Our understanding is correct!
Use KerasTuner to tune Scikit-Learn models
To push your proficiency with KerasTuner to the next level, let us use it with scikit-learn. KerasTuner was traditionally developed to tune Keras models, and thus it can do nothing for models from other packages. However, with some tweak, we can make it work with models from any packages including scikit-learn. In particular, in this section, we will tune ridge regression and lasso step by step.
First let us create a dataset worthy of regularization.
The model is still over-determined ($n > d$), but the sample size $n$ is laughably small. To our advantage, the ground truth is nonetheless sparse ($p < < d$)
Let’s visualize the first feature, which is informative. We can observe an upward trend.
Let’s visualize a non-informative feature.
Now let’s try Ordinary Least Square on the informative features only.
The score
function returns the $R^2$ value, which equals to 0.68 on the training set and 0.63 on the validation set.
They are decent results compared to the Bayes optimal bound 0.8.
With a sample size of only 100, you cannot expect too much.
Now let’s try the same thing on the full feature set.
We observe that the training score greatly improves and even exceeds the Bayes optimal bound. However, it does not go unpunished; the validation score equals to 0.287, much lower than previous. A textbook overfitting happens here, yelling for regularization. In case you are not convinced, the following figure shows the complete result with every model size (from 1 to 50).
In the following, we will use KerasTuner to tune Ridge Regression and Lasso, respectively.
The bad news is that, the package’s provided Tuners accept only Keras models.
To make it work with scikit-learn models, we need to build customized Tuners.
The following code subclasses keras_tuner.BayesianOptimization
and redefines the run_trial()
method.
In this redefinition, we handle the scikit-learn models directly and returns the evaluation metric, in this case $R^2$.
We can define LassoTuner
similarly.
It is possible to combine both tuners into an ultimate one, but we want to compare these two methods here, so we tuner them separately.
The final result shows that the highest score for Ridge Regression is 0.48, achieved by $\alpha=35$, and the highest score for Lasso is 0.55, achieved by $\alpha=0.27$.
Note: We cannot say that Lasso is universally better than Ridge Regression, for the ground truth here is a sparse one, which favors Lasso.
From the above figure, we can see that both Lasso and Ridge Regression did a decent job in preventing overfitting.