Optuna Samplers

Optimizing ML models using Optuna Samplers

By Niels van der Velden in reticulate R Python Machine Learning

February 7, 2022

Introduction

When building Machine Learning (ML) models an important part of optimizing the model is searching for the right set of parameters such that the model achieves the highest precision and accuracy. In the most advanced ML models (XGBoost, LGBM, CatBoost, TabNet) there are many parameters to optimize and finding the best performing set can take a lot of time. If we for instance would like to tune 6 parameters and try 10 different values for each there are a total of 1,000,000 different combinations we could try. Some models can take hours to train so it would be impossible to try them all. How could we best navigate this vast search space to find the optimum set of values?

One of the best packages (in my opinion) to tune hyperparameters and navigate large search spaces is the python package Optuna. After reading many tutorials on using Optuna for hyperparameter tuning I always wondered what makes Optuna so efficient and if there would be a simple way to visualize what is happening underneath.

In this article I will try to answer this question by approximating the maximum of a single objective function and the minimum of a multi objective function using a random and gird search and compare the result when doing the same using the advanced TPESampler and CmaEsSampler of Optuna.

Using Optuna to find the maximum of a function

In the below code we ask Optuna to suggest values for x between 0 and 1000 and try to find the value x that would maximize y for the function y = sin(-0.4*pi*x / 20) + sin(-0.5*pi*x / 200) + sin(0.5*pi*x / 100). In this example the values for x are suggested using the TPESampler of Optuna. This sampler uses Bayesian optimization methods to select each x value (see link for a more detailed explanation). In total we run 100 trials and we set the seed of the sampler to 42 to get reproducible results.

import optuna

from optuna.samplers import TPESampler

import math

def objective(trial):
    x = trial.suggest_float('x', 0, 1000)
    y = (
      math.sin(-0.4*math.pi*x / 20) 
    + math.sin(-0.5*math.pi*x / 200) 
    + math.sin(0.5*math.pi*x / 100)
    )
    return y

study = optuna.create_study(
  direction="maximize", 
  sampler=TPESampler(seed=42)
  )
study.optimize(objective, n_trials=100)

print("TPESampler: Best y: {} for x: {} found at trial: {}"
.format(
  round(study.best_value, 2), 
  round(study.best_params["x"],2), 
  study.best_trial.number
  )
)

## TPESampler: Best y: 2.5 for x: 478.02 found at trial: 61

You can see that the TPESampler found the maximum already after 61 trials. How would these results compare to taking just 100 random values for x? We can do this by running the same code as above but instead of using the TPESampler we will import and use the RandomSampler of Optuna.

## RandomSampler: Best y: 2.43 for x: 472.21 found at trial: 89

Using the random sampling approach the best value was found after 89 trials and it was not able to find the absolute maximum of 2.5. The TPESampler is clearly the winner here.

We can plot the x values that were taken for each trial onto the function for which we try to find the maximum. What you can see using the timeline slider at the bottom of the graphs is that at the beginning the TPESampler starts with the selection of random values for x but after a while it starts to use a “best guess” approach using Bayesian optimization to select the most promising value of x to sample next.

Using Optuna to find the minimum of a multi objective function

From the example above we can already see that the TPESampler is much better at finding the maximum of a single objective function. How will it perform in a multi objective function? In the below example we ask Optuna to find the minimum of the multi objective Rastrigin function which has a global minimum of z = 0 at x and y = 0. As a comparison we run the same code using the RandomSampler.

import optuna

from optuna.samplers import TPESampler

import math

def objective(trial):
    x = trial.suggest_float('x', -5, 5)
    y = trial.suggest_float('y', -5, 5)
    
    z = (
      20 + x**2 - 10*math.cos(2*math.pi*x) 
      + y**2 - 10*math.cos(2*math.pi*y)
      )
    
    return z

study = optuna.create_study(direction="minimize", sampler=TPESampler(seed=42))
study.optimize(objective, n_trials=100)

print("TPESampler: Best z: {} for x: {} and y: {} found at trial: {}"
.format(
  round(study.best_value, 2), 
  round(study.best_params["x"],2), 
  round(study.best_params["y"],2), 
  study.best_trial.number
  )
)

## TPESampler: Best z: 2.34 for x: 0.99 and y: 1.04 found at trial: 62

## RandomSampler: Best z: 3.09 for x: 1.1 and y: 0.03 found at trial: 73

Although it did not find the absolute minimum after only 100 trials it already came quite close at z = 2.34 and again performed much better then the RandomSampler.

Recently Optuna added a new CmaEsSampler that uses a Covariance Matrix Adaptation Evolution Strategy (CMA-ES) (see: link ) instead of Bayesian optimization. This sampler only works for multi objective problems and does not work with categorical variables but if these conditions are met it promises to perform even better then the TPESampler.

## CmaEsSampler: Best z: 0.4 for x: 0.04 and y: -0.01 found at trial: 60

The CmaEsSampler found a minimum of 0.4 compared to 2.34 at trail 60 so it seems to indeed work better then the TPESampler for this particular example.

To get a better understanding on how the different samplers are operating we can plot the x and y values that the sampler picks at each trial onto a heatmap. You can use the slider at the bottom to go trough the different trials. You can see that like with the single objective the TPESampler starts with random values for x and y and then uses a “best guess” approach using Bayesian optimization. The CmaEsSampler seems to work quite differently and already after a view trials zooms in to the minimum.

echarts4r

Although the CmaEsSampler worked better than the TPESampler for finding the minimum in the above example it works less well if we try to find a maximum. Thus, it might be worthwhile to test out both samplers and see which one gives the best results.

Conclusion

Optuna vastly outperforms a random or grid search for most optimization problems. It achieves this by using either the CmaEsSampler or TPESampler algorithms that try to pick the most promising parameters for each trial. The standard TPESampler will work well for most optimization problems. For optimization problems without categorical variables the CmaEsSampler sometimes performs better and might be worth trying.