Demystifying KerasTuner

Hyperparameter tuning is a pivotal concept in machine learning, be it deep learning or not. For deep learning, KerasTuner is a powerful and easy-to-use package, which can tune any Keras model in any way. Yet, its intricate internal mechanism concealed beneath its API makes users uneasy. This post will demystify its internal implementation and help you understand how KerasTuner accomplishs this feat. I hope that after reading this post, you will feel confident in using KerasTuner, be it for tuning a neural network or any other custom models. If you are an aspiring, hard-working data scientist, this post is a sure read.

Table of Contents

  • The mysterious Hyperparameter class
  • Reimplement KerasTuner, or at least partially
    • A starter
    • Package design
    • A home-made dessert
  • Use KerasTuner to tune Scikit-Learn models

The mysterious Hyperparameter class

If you are following the official KerasTuner guide, you will stumble upon the following code snippet.

from tensorflow import keras
from tensorflow.keras import layers

def build_model(hp):
    model = keras.Sequential()
    model.add(layers.Flatten())
    model.add(
        layers.Dense(
            # Define the hyperparameter.
            units=hp.Int("units", min_value=32, max_value=512, step=32),
            activation="relu",
        )
    )
    model.add(layers.Dense(10, activation="softmax"))
    model.compile(
        optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"],
    )
    return model

You will unavoidably notice this intricate line: units=hp.Int("units", min_value=32, max_value=512, step=32). Your intuition tells you that the code is assigning a hyperparameter to the number of units in the hidden layer, but you can’t help but think how does the program knows which value to use at each call. Most intriguingly, how can a function produce different values with the exact same arguments?

The official guide did a poor job to ease you by telling you that hp.Int() is just a function “returning actual values.”

>>> hp = keras_tuner.HyperParameters()
>>> print(hp.Int("units", min_value=32, max_value=512, step=32))
32

You run the above code multiple time, but it returns 32 at each occasion. You can’t help but start to question the legitimacy of this package: Is this package trolling you?

In the following, I will reassure you that it is not trolling you. Indeed, I will teach you how it is implemented.

Reimplement KerasTuner, or at least partially

A starter dish

Let us first solve this mystery: how can a function return different values with the exact same arguments?

The caveat here is that a function in the sense of programming is not the same as one in the sense of mathematics. In mathematics, a function such as $\sin(x)$ with the same $x$ will always give the same value. This property is called statelessness. In programming, however, it is possible and easy to build a stateful function. The pseudorandom number generator is such an example.

>>> import numpy as np
>>> np.random.rand()
0.09735248922580608

>>> np.random.rand()
0.22875803951336515

The following is more complex example using closure. (You can achieve the same effect with a class, but I just want to use this example to show off.)

def build_counter(start=0):
    count = start
    def f(x):
        nonlocal count
        count += x
        return count

    return f

>>> counter = build_counter()
>>> counter(1)
1
>>> counter(1)
2
>>> counter(1)
3

build_counter here is a factory function, whose output is a function. Here, we use it to build the counter function which simply counts. This function returns a different value even though each time it is given the same input.

Hence we know that it is indeed possible for hp.Int to return different values when properly set up. To understand this setup, I will analyze the KerasTuner’s design in the next subsection.

Package design

The package revolves around four key concepts, with the first two being the most critical:

  • Hyperparameters
  • Tuners
  • Oracles
  • Hyper models

The package defines the HyperParameters class, which is a container class holding all hyperparameters ever instantiated. When you call hp.Int("units", min_value=32, max_value=512, step=32), it will check in its inventory whether there is a hyperparameter called “units”. If the search is unsuccessful, it will instantiate such a hyperparameter and assign it with a default value, in this case 32. If this hyperparameter already exists, it will simply return its current value. In a nutshell, this simple function achieves two effects, one being hyperparameter register and the other being hyperparameter retrieve, depending on whether the hyperparameter in question is already registered. Therefore, the remaining thing to do is to find a way to alter the current value of this hyperparameter, and this is achieved by the Tuner class.

Before talking about the Tuner class, let us first analyze the HyperModel class and its simpler form: the build_model function. In the following code, build_model is a factory function, which takes a hyperparameter as input and output a function, in this case a Keras model, which is callable.

def build_model(hp):
    model = keras.Sequential()
    model.add(layers.Flatten())
    model.add(
        layers.Dense(
            # Define the hyperparameter.
            units=hp.Int("units", min_value=32, max_value=512, step=32),
            activation="relu",
        )
    )
    model.add(layers.Dense(10, activation="softmax"))
    model.compile(
        optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"],
    )
    return model

Each time it is called, it will return a neural network with the desired number of units in the hidden layer. Of course, on paper it has no idea how many units it should use. Instead, it achieves this by using the visitor pattern, that is, by delegating it to the HyperParameters container class sent to it as a visitor. The HyperParameters class takes care of this and provide build_model with the right hyperparameter value by calling hp.Int.

Besides this ad hoc build_model function factory, you can define a more formal HyperModel class to achieve the same effect. In the following code snippet, BuildModel.build() achieves the same effect as build_model() in the example above. As for hp.Choice, it is just another hyperparameter playing a similar role as hp.Int.

class BuildModel(keras_tuner.HyperModel):
    def build(self, hp):
        model = keras.Sequential()
        model.add(keras.layers.Dense(hp.Choice('units', [8, 16, 32]), activation='relu'))
        model.add(keras.layers.Dense(1, activation='relu'))
        model.compile(loss='mse')
        return model

With the HyperParameters container class and HyperModel class, the Tuner class has all the ingredient for model tuning. All it needs to do is a loop of:

  1. configuring the hyperparameter value
  2. calling HyperModel.build()
  3. fitting the model
  4. evaluating the solution

Furthermore, the Tuner class itself does not pick the hyperparameter value directly; instead, it delegates this task to an Oracle class. There are various Oracle class achieving various hyperparameter choosing algorithms, such as grid search, hyperband, and Bayesian optimization.

A home-made dessert

To test our understanding, let us reimplement a simplified version of KerasTuner.

We first build our home-made hyperparameters container class.

class MyHP:
    def __init__(self):
        self.space = dict()
        self.current = dict()

    def Choice(self, name, choices):
        if name not in self.space:  # register it
            self.space[name] = choices
            self.current[name] = choices[0]
        return self.current[name]

    # You don't need this method in Python,
    # but let us do it in a proper OOP way here.
    def pick(self, name, value):
        if name not in self.space:
            raise Exception(f"{name} is not a valid hyperparameter")
        if value not in self.space[name]:
            raise Exception(f"{name}={value} is not a valid hyperparameter value")
        self.current[name] = value

Let us unit-test it.

>>> hp = MyHP()
>>> hp.Choice('alpha', [0.1, 1., 10.])
0.1
>>> hp.pick('alpha', 1.)
>>> hp.Choice('alpha', [0.1, 1., 10.])
1.0

It works as intended.

Now let us define our home-made Tuner class.

class MyGridSearch:
    def __init__(self, build_model):
        self.build_model = build_model
        self.hp = MyHP()
        self.result = []

    def search(self, x, y):
        self.build_model(self.hp)  # register the hyperparameters

        # The following works only where there is a single hyperparameter
        for name in self.hp.space:
            for value in self.hp.space[name]:
                self.hp.pick(name, value)
                model = self.build_model(self.hp)
                model.fit(x, y)

The above code works only where there is a single hyperparameter. For multiple hyperparameters, it will be much more complicate to navigate the space and hence exceed this post’s scope.

Let’s put it into action.

>>> import numpy as np
>>> from sklearn.linear_model import Lasso
>>> build_model = lambda hp: Lasso(alpha=hp.Choice('alpha', [0.1, 2., 10.]))
>>> tuner = MyGridSearch(build_model)
>>> tuner.search(np.random.randn(100, 10), np.random.randn(100, 1))
>>> tuner.result
[Lasso(alpha=0.1), Lasso(alpha=2.0), Lasso(alpha=10.0)]

It indeed ran all models with various hyperparameter values. Our understanding is correct!

Use KerasTuner to tune Scikit-Learn models

To push your proficiency with KerasTuner to the next level, let us use it with scikit-learn. KerasTuner was traditionally developed to tune Keras models, and thus it can do nothing for models from other packages. However, with some tweak, we can make it work with models from any packages including scikit-learn. In particular, in this section, we will tune ridge regression and lasso step by step.

First let us create a dataset worthy of regularization.

import numpy as np

n, d, p = 100, 50, 8
σ = 2.
β = np.concatenate((np.ones(p), np.zeros(d - p)))

rng_x = np.random.default_rng(0)
rng_y = np.random.default_rng(1)
x = rng_x.laplace(size=(2*n, d))
y = x @ β + σ * rng_y.laplace(size=2*n)

x_train, x_val = x[:n, :], x[n:, :]
y_train, y_val = y[:n], y[n:]

The model is still over-determined ($n > d$), but the sample size $n$ is laughably small. To our advantage, the ground truth is nonetheless sparse ($p < < d$)

Let’s visualize the first feature, which is informative. We can observe an upward trend.

Let’s visualize a non-informative feature.

Now let’s try Ordinary Least Square on the informative features only.

>>> from sklearn import linear_model
>>> ols = linear_model.LinearRegression(fit_intercept=False)
>>> ols.fit(x_train[:, :p], y_train)
>>> ols.score(x_train[:, :p], y_train), ols.score(x_val[:, :p], y_val)
(0.6793139873791911, 0.625937769802074)

The score function returns the $R^2$ value, which equals to 0.68 on the training set and 0.63 on the validation set. They are decent results compared to the Bayes optimal bound 0.8. With a sample size of only 100, you cannot expect too much.

Now let’s try the same thing on the full feature set.

>>> ols = linear_model.LinearRegression(fit_intercept=False)
>>> ols.fit(x_train, y_train)
>>> ols.score(x_train, y_train), ols.score(x_val, y_val)
(0.8229686449817113, 0.28705825932541906)

We observe that the training score greatly improves and even exceeds the Bayes optimal bound. However, it does not go unpunished; the validation score equals to 0.287, much lower than previous. A textbook overfitting happens here, yelling for regularization. In case you are not convinced, the following figure shows the complete result with every model size (from 1 to 50).

In the following, we will use KerasTuner to tune Ridge Regression and Lasso, respectively. The bad news is that, the package’s provided Tuners accept only Keras models. To make it work with scikit-learn models, we need to build customized Tuners. The following code subclasses keras_tuner.BayesianOptimization and redefines the run_trial() method. In this redefinition, we handle the scikit-learn models directly and returns the evaluation metric, in this case $R^2$.

class RidgeTuner(keras_tuner.BayesianOptimization):
    def run_trial(self, trial, x_train, y_train, x_val, y_val):
        hp = trial.hyperparameters
        model = linear_model.Ridge(
            alpha=hp.Float('alpha', 0.001, 1000., sampling='log'),
            fit_intercept=False
        )
        model.fit(x_train, y_train)
        return {"R2": model.score(x_val, y_val)}

tuner_ridge = RidgeTuner(objective=keras_tuner.Objective("R2", "max"))

tuner_ridge.search(x_train, y_train, x_val, y_val)

We can define LassoTuner similarly. It is possible to combine both tuners into an ultimate one, but we want to compare these two methods here, so we tuner them separately.

The final result shows that the highest score for Ridge Regression is 0.48, achieved by $\alpha=35$, and the highest score for Lasso is 0.55, achieved by $\alpha=0.27$.

Note: We cannot say that Lasso is universally better than Ridge Regression, for the ground truth here is a sparse one, which favors Lasso.

From the above figure, we can see that both Lasso and Ridge Regression did a decent job in preventing overfitting.

Written on August 10, 2023