Why and how Bayesian Optimization can be used for hyper-parameters tuning

Here you are: your model is finally running and producing predictions. In most cases, one more crucial step is needed: hyper-parameters tuning.

In the context of machine learning, hyper-parameters are opposed to a model’s trainable parameters, like the weights of a neural network. They correspond to more general settings of a model, like its learning rate or the batch size used for training. In deep learning, hyper-parameters are often numerous. Tuning them all together can become a real brain-teaser. In this post, I’ll review different ways to solve this issue, ranging from a brute-force approach to the paradigm of Bayesian optimization.

The easiest way to search for a good combination of hyper-parameters is to exhaustively try them all. Intuitively, you can set up a list of candidate values for every hyper-parameter. You then train a model with each combination. Finally, you can evaluate each combination on a held-out test set and compare their performances to determine the best one.

However, such a brute-force approach is computationally expensive. Let’s assume your model takes 3 hours to train and has 2 hyper-parameters. If you want to test 5 candidate values for each of them, grid search would take (5²) x 3= 75 hours to evaluate all combinations!

If your model is quick to train, has few hyper-parameters or that great amounts of computing powers are available, then grid search is an easy to implement solution. Unfortunately, this is rarely the case.

In case you do not have such amounts of time or computational resources, an alternative is random search. Instead of exhaustively testing combinations in a grid, random search consists in sampling random values in the hyper-parameter space. Although it balances the computational cost of grid search, this approach is inefficient in the way it explores candidate hyper-parameters combinations. Indeed, the algorithm doesn’t learn from the previously tested combinations.

A mutual drawback of both these approaches is that the method loses efficiency if the impacting parameters are not known beforehand. This is the case in many settings. With grid search, you will test many equivalent combinations, hence wasting computing resources and time.

Grid Search vs Random Search in the case of 2 parameters and 9 tested combinations. Black dots: tested combinations of parameters. Green curves: true objective function as a function of the given parameter. On the y-axis, the curve is almost flat meaning that this parameter has a low impact on the objective function. However, on the x-axis, a clear maximum appears, corresponding to the optimal value for this parameter. Grey dots: projection of the tested parameter combination on the green curve. Images from Polyaxon documentation

In the case of one unimportant parameter, random search is likely to test the scoring function more extensively. On Figure 1, the number of gray dots is higher for random search than for grid search, meaning that more values were tested. In grid search, resources were lost while testing the unimportant parameter. However, random search can still miss the global optimum completely, like on the second illustration.

Even though random search explores a larger number points of the different objective functions, in this situation it doesn’t detect the optimal parameter combination (image from Patrick Koch’s blog)

Intuitively, it would be more efficient to choose the next hyper-parameter combination according to past combinations performances. This is exactly the aim of Bayesian optimization.

Another option is to view hyper-parameters tuning as the optimization of a black-box function. Here, the function to optimize is the model’s final prediction score, accuracy for instance, on a held-out test set. Any global optimization framework can then be applied to minimize it. Bayesian optimization was deemed to be a good choice in different papers (see links at the end of the post). It also has the advantage of having available implementations online (see last paragraph).

Like for random search, a Bayesian optimizer samples a subset of hyper-parameters combinations. The difference with randomized grid search resides in the way each combination is chosen.

The main idea resides in constructing a probabilistic representation of the machine learning algorithm’s performance. In Bayesian optimization, the performance function is modeled as a sample from a Gaussian process (GP) over the hyper parameter value. The posterior distribution for this function allows insight in the confidence of the function’s value in the hyper-parameter space (see the figure below).

At first, a few hyper-parameters combinations are randomly sampled and tested. These values are then used to produce the first model of the objective function. Then, Bayesian Optimization comes into play.

Great illustration from Meghana Ravikumar in her blogpost on Bayesian optimization. GP: Gaussian Process. The blue confidence interval represents the confidence of the algorithm in its current representation of the objective function.

Once the probabilistic model is obtained, there are two ways to choose the next combination:

either sample near the highest observed value, where performance is likely to increase

or explore another subset of the hyper-parameter space, where information is lacking and another maximum could be hiding

This choice is known as the exploitation and exploration trade-off. It has been thoroughly studied, in the context of reinforcement learning for instance. Mathematically, this choice is represented by the acquisition function.

Example acquisition function (green curve) from Laura Cornejo Bueno. Here, the maximum value is reached in a domain with few known points i.e with high uncertainty. It is an example of exploration.

Methods like Expected Improvement (EI) or Gaussian Process Upper Confidence Bound (GP-UCB), where regret is minimized throughout the optimization process, can be used as acquisition functions. The acquisition function represents the improvement potential over the hyper-parameter space knowing the results of the previously tested combinations. Once this function is defined, the next combination can be derived as follows:

Where D are the previously tested combinations and their results

Once the chosen combination has been tested, the probabilistic model is updated. The process can then start again. Bayesian optimization takes advantage of previously tested combinations to sample the next one more efficiently.

The confidence in the objective function estimation increases with the number of tested combinations. Hence, the blue confidence interval from Figure 2 shrinks as the process continues.

Estimate of the objective function from Figure 2 after one more test. Here, exploration was chosen over exploitation.

On the downside, Bayesian optimization is costlier than random search in terms of computations in between each iteration. Indeed, the model needs to be updated at every iteration before sampling a new combination. However, by taking into account past results, it allows sampling new combinations more efficiently.

Other methods, like Trees of Parzen Estimators (TPE) also aim at taking into account past tests to choose the next combination. Instead of modelling p(y|x), TPE separately models p(x|y) and p(y) in a tree-structured way. Here, y stands for the objective metric and x for the hyper-parameters. More detailed information about TPE is included in the Hyperopt library which implements it.

In practice, Python packages are available to perform Bayesian optimization. For instance, Spearmint implements Bayesian optimization with EI as the acquisition function. Fabolas, standing for FAst Bayesian Optimization of machine learning LArge datasets, implements a variant of Bayesian Optimization. In this variant, a trade-off between computational cost and information gain is used to speed up hyper-parameter tuning. Modules like HyperBand and BOHB (Bayesian Optimization and HyperBand) perform mixes between random search and Bayesian Optimization. Their objective is to reduce Bayesian Optimization’s computational cost.

These modules can be found in the following GitHub repositories:

This list is not meant to be exhaustive: other packages exist and variants of the described methods are published every year!

When faced with the problem of hyper-parameters tuning, I discovered that many options were available. And choosing one can also get pretty complicated! The best suited one highly depends on the problem’s characteristics and the computational resources you have access to. I tried to explain as best as I could the different methods I came across. Hopefully, this post was useful for you too!

If you want to go further, here are the links to the papers used in this post:

Practical Bayesian Optimization of Machine Learning Algorithms,

The application of Bayesian methods for seeking the extremum

Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design

Are you looking for computer vision experts? Don’t hesitate to contact us!

Thanks to Fatima Klilib, Hugo Lime and Bastien Ponchon

Similar articles

Build Your Own Cloud with Kubernetes and Some Raspberry Pi

Managing several Raspberry Pi can be a lot of work. This article will teach you how Kubernetes and Docker can help.

About Convolutional Layer and Convolution Kernel

A story of Convnet in machine learning from the perspective of kernel sizes.

5 Mistakes I Made When Doing Custom Data Visualization With D3.js

Experience feedback from drawing graphs in Javascript with D3.js