Hyperparameter tuning is a pivotal concept in machine learning, be it deep learning or not. For deep learning, KerasTuner is a powerful and easy-to-use package, which can tune any Keras model in any way. Yet, its intricate internal mechanism concealed beneath its API makes users uneasy. This post will demystify its internal implementation and help you understand how KerasTuner accomplishs this feat. I hope that after reading this post, you will feel confident in using KerasTuner, be it for tuning a neural network or any other custom models. If you are an aspiring, hard-working data scientist, this post is a sure read.
Table of Contents
- The mysterious Hyperparameter class
- Reimplement KerasTuner, or at least partially
- A starter
- Package design
- A home-made dessert
- Use KerasTuner to tune Scikit-Learn models
The mysterious Hyperparameter class
If you are following the official KerasTuner guide, you will stumble upon the following code snippet.
You will unavoidably notice this intricate line:
units=hp.Int("units", min_value=32, max_value=512, step=32).
Your intuition tells you that the code is assigning a hyperparameter to the number of units in the hidden layer, but you can’t help but think how does the program knows which value to use at each call.
Most intriguingly, how can a function produce different values with the exact same arguments?
The official guide did a poor job to ease you by telling you that
hp.Int() is just a function “returning actual values.”
You run the above code multiple time, but it returns 32 at each occasion. You can’t help but start to question the legitimacy of this package: Is this package trolling you?
In the following, I will reassure you that it is not trolling you. Indeed, I will teach you how it is implemented.
Reimplement KerasTuner, or at least partially
A starter dish
Let us first solve this mystery: how can a function return different values with the exact same arguments?
The caveat here is that a function in the sense of programming is not the same as one in the sense of mathematics. In mathematics, a function such as $\sin(x)$ with the same $x$ will always give the same value. This property is called statelessness. In programming, however, it is possible and easy to build a stateful function. The pseudorandom number generator is such an example.
The following is more complex example using closure. (
You can achieve the same effect with a class, but I just want to use this example to show off.)
build_counter here is a factory function, whose output is a function.
Here, we use it to build the
counter function which simply counts.
This function returns a different value even though each time it is given the same input.
Hence we know that it is indeed possible for
hp.Int to return different values when properly set up.
To understand this setup, I will analyze the KerasTuner’s design in the next subsection.
The package revolves around four key concepts, with the first two being the most critical:
- Hyper models
The package defines the
HyperParameters class, which is a container class holding all hyperparameters ever instantiated.
When you call
hp.Int("units", min_value=32, max_value=512, step=32), it will check in its inventory whether there is a hyperparameter called “units”.
If the search is unsuccessful, it will instantiate such a hyperparameter and assign it with a default value, in this case
If this hyperparameter already exists, it will simply return its current value.
In a nutshell, this simple function achieves two effects, one being hyperparameter register and the other being hyperparameter retrieve, depending on whether the hyperparameter in question is already registered.
Therefore, the remaining thing to do is to find a way to alter the current value of this hyperparameter, and this is achieved by the
Before talking about the
Tuner class, let us first analyze the
HyperModel class and its simpler form: the
In the following code,
build_model is a factory function, which takes a hyperparameter as input and output a function, in this case a Keras model, which is callable.
Each time it is called, it will return a neural network with the desired number of units in the hidden layer.
Of course, on paper it has no idea how many units it should use.
Instead, it achieves this by using the visitor pattern, that is, by delegating it to the
HyperParameters container class sent to it as a visitor.
HyperParameters class takes care of this and provide
build_model with the right hyperparameter value by calling
Besides this ad hoc
build_model function factory, you can define a more formal
HyperModel class to achieve the same effect.
In the following code snippet,
BuildModel.build() achieves the same effect as
build_model() in the example above.
hp.Choice, it is just another hyperparameter playing a similar role as
HyperParameters container class and
HyperModel class, the
Tuner class has all the ingredient for model tuning.
All it needs to do is a loop of:
- configuring the hyperparameter value
- fitting the model
- evaluating the solution
Tuner class itself does not pick the hyperparameter value directly; instead, it delegates this task to an
There are various
Oracle class achieving various hyperparameter choosing algorithms, such as grid search, hyperband, and Bayesian optimization.
A home-made dessert
To test our understanding, let us reimplement a simplified version of KerasTuner.
We first build our home-made hyperparameters container class.
Let us unit-test it.
It works as intended.
Now let us define our home-made Tuner class.
The above code works only where there is a single hyperparameter. For multiple hyperparameters, it will be much more complicate to navigate the space and hence exceed this post’s scope.
Let’s put it into action.
It indeed ran all models with various hyperparameter values. Our understanding is correct!
Use KerasTuner to tune Scikit-Learn models
To push your proficiency with KerasTuner to the next level, let us use it with scikit-learn. KerasTuner was traditionally developed to tune Keras models, and thus it can do nothing for models from other packages. However, with some tweak, we can make it work with models from any packages including scikit-learn. In particular, in this section, we will tune ridge regression and lasso step by step.
First let us create a dataset worthy of regularization.
The model is still over-determined ($n > d$), but the sample size $n$ is laughably small. To our advantage, the ground truth is nonetheless sparse ($p < < d$)
Let’s visualize the first feature, which is informative. We can observe an upward trend.
Let’s visualize a non-informative feature.
Now let’s try Ordinary Least Square on the informative features only.
score function returns the $R^2$ value, which equals to 0.68 on the training set and 0.63 on the validation set.
They are decent results compared to the Bayes optimal bound 0.8.
With a sample size of only 100, you cannot expect too much.
Now let’s try the same thing on the full feature set.
We observe that the training score greatly improves and even exceeds the Bayes optimal bound. However, it does not go unpunished; the validation score equals to 0.287, much lower than previous. A textbook overfitting happens here, yelling for regularization. In case you are not convinced, the following figure shows the complete result with every model size (from 1 to 50).
In the following, we will use KerasTuner to tune Ridge Regression and Lasso, respectively.
The bad news is that, the package’s provided Tuners accept only Keras models.
To make it work with scikit-learn models, we need to build customized Tuners.
The following code subclasses
keras_tuner.BayesianOptimization and redefines the
In this redefinition, we handle the scikit-learn models directly and returns the evaluation metric, in this case $R^2$.
We can define
It is possible to combine both tuners into an ultimate one, but we want to compare these two methods here, so we tuner them separately.
The final result shows that the highest score for Ridge Regression is 0.48, achieved by $\alpha=35$, and the highest score for Lasso is 0.55, achieved by $\alpha=0.27$.
Note: We cannot say that Lasso is universally better than Ridge Regression, for the ground truth here is a sparse one, which favors Lasso.
From the above figure, we can see that both Lasso and Ridge Regression did a decent job in preventing overfitting.