How to Optimize Hyperparameter Search Using Bayesian Optimization and Optuna

Hyperparameter optimization is an integral part of machine learning. It aims to find the best set of hyperparameter values to achieve the best model performance.

Grid search and random search are popular hyperparameter tuning methods. They roam around the entire search space to get the best set of hyperparameters, which makes them time-consuming and inefficient for larger datasets.

Based on Bayesian logic, Bayesian optimization considers the model performance for previous hyperparameter combinations while determining the next set of hyperparameters to evaluate.

Optuna is a popular tool for Bayesian hyperparameter optimization. It provides easy-to-use algorithms, automatic algorithm selection, integrations with a wide range of ML frameworks, and support for distributed computing.

Training a machine learning model involves a set of parameters and hyperparameters. Parameters are the internal variables, such as weights and coefficients, that the model learns during the training process. Hyperparameters are the external configuration settings that govern the model training and directly impact the model’s performance. In contrast to parameters learned during training, they need to be defined before the training begins.

Hyperparameter optimization, also known as hyperparameter tuning or hyperparameter search, is the process of finding the optimal values for hyperparameters that result in the best model performance.

The optimization process starts with choosing an objective function to minimize/maximize and selecting the range of values for different hyperparameters called the search space. Then, you choose one of several tuning techniques, such as manual tuning, grid search, random search, and Bayesian optimization.

Methods like manual tuning, grid search, and random search roam the entire search space (all possible values and combinations of hyperparameters) in multiple iterations. They do not take into account the results of past iterations when selecting the next hyperparameter combination to try. The search space for these approaches grows exponentially with the number of hyperparameters to tune.

Further, these methods are time-consuming and resource-consuming, requiring training a model on a selected set of parameter values, making predictions on the validation data, and calculating the validation metrics. All this makes hyperparameter tuning a costly endeavor.

Here, Bayesian hyperparameter optimization methods come to the rescue. Based on Bayesian logic, Bayesian optimization reduces the time required to find an optimal set of parameters to improve generalization performance on the test data. Bayesian approaches consider the previous hyperparameter values and their performance while determining the next set of hyperparameters to evaluate.

Many tools in the ML space use Bayesian optimization to guide the selection of the best set of hyperparameters. Widely employed frameworks are HyperOpt, Spearmint, GPyOpt, and Optuna. For this article, we’ll focus on Optuna, a popular choice for hyperparameter optimization due to its ease of use, efficient search strategy, distributed computing support, and automatic algorithm selection.

Using Optuna and a hands-on example, you will learn about the ideas behind Bayesian hyperparameter optimization, how it works, and how to perform Bayesian optimization for any of your machine-learning models.

How does the Bayesian hyperparameter optimization strategy work?

Each step in a hyperparameter tuning process looks as follows: We select a set of hyperparameters from the search space and evaluate them by computing the objective function. In most basic approaches, the objective function’s value is computed by training a model using the selected hyperparameters, using the model to make predictions on a test data set, and evaluating its performance using a predefined metric such as accuracy.

For a small parameter range and small dataset, we can try out all possible hyperparameter combinations, as the number of calls to the objective function will be small. This popular approach is called grid search. However, for a relatively large dataset and large parameter ranges, this method is too computationally expensive and time-consuming. Hence, we should look for ways to limit the number of calls to the objective function.

A straightforward approach is to randomly select a certain number of hyperparameter combinations (say, 10 or 20) and pick the combination that yields the best value of the objective function. This approach is called random search. It limits the number of calls to the objective function to a fixed value (i.e., the search has approximately constant time complexity). The price we pay is that there is no guarantee that the obtained hyperparameter values are even close to optimal.

In contrast to grid search and random search, Bayesian hyperparameter optimization considers past evaluation results when selecting the next hyperparameter set. Since it makes an informed decision, it focuses on the areas of the search space that are more likely to lead to optimal model performance. Likewise, it tends to ignore areas in the search space that are unlikely to contribute towards performance optimization. This limits the number of calls to the objective function while ensuring that the evaluated hyperparameter combinations are increasingly more likely to produce an optimal model.

Now, let’s examine the main components of Bayesian optimization that work together to obtain the best set of hyperparameters.

Search space

The search space is the set of possible values the parameters and variables of interest can take. For example, we might look for our apartment’s optimal room temperature by trying out values between 16 and 26 degrees Celsius (60 to 80 degrees Fahrenheit). While the parameter “room temperature” could conceivably take on higher or lower values, we’re restricting our search to this particular range.

Bayesian optimization utilizes probability distributions to guide the selection of samples within a defined search space. The user initially defines this search space and specifies the ranges or constraints for each parameter or variable, which requires knowledge of the training data and the model’s algorithm. Usually, the choice of parameter ranges is heavily influenced by the user’s assumptions and experience. When defining the search space, it’s paramount not to be too narrow: If the optimal hyperparameter combination lies outside the search space, no optimization algorithm can find it.

Objective function

The objective function is the evaluator that takes in the values of the hyperparameters and returns a single value score that you want to minimize or maximize.

For example, the objective function could consist of the following algorithm:

Instantiate a model and a training process using the given combination of hyperparameter values.
Train the model on a training dataset.
Evaluate the model’s accuracy on a test data set.
Return the accuracy as the single value score.

In this example, we would try to bring the objective’s function’s value as close to 1.0 (perfect accuracy) as possible.

The fact that computing the objective function involves a full model training run and subsequent evaluation makes every evaluation costly and time-consuming. Thus, hyperparameter optimization approaches that limit the number of calls to the objective function are preferable.

Surrogate function

The surrogate function proposes the best set of hyperparameters given the current state of knowledge. It evaluates all past invocations of the objective function and reveals a parameter combination it expects to yield an even more optimal result.

The purpose of the surrogate function is to limit the number of calls we need to make to the objective function. It also goes by the name response surface, as it is a high-dimensional mapping of hyperparameters to the probability of a score on the objective function. In that sense, it is an approximation of the objective function.

Different types of surrogate functions exist, such as Gaussian Processes, Random Forest Regression, and Tree Parzen Estimator (TPE). For this article, we will be focusing on the Tree Parzen Estimator (TPE).

TPE is a probability-based model that balances exploration and exploitation by maintaining separate models for the likelihood of improvement and the probability of worsening. It is well suited for hyperparameter optimization tasks where the objective is to find the set of hyperparameters that can minimize or maximize the model performance evaluation metrics used in the objective function.

The TPE algorithm iteratively samples new hyperparameters, evaluates their performance using the objective function, updates its internal probability distributions, and continues the search until a stopping criterion is met.

In the TPE, the criterion that guides the search for the next set of hyperparameters is called an acquisition function. It can be defined as follows:

AF(x) = max(P(I∣x)/P(W∣x), ϵ)

Here, P(I∣x) represents the probability of improvement, P(W∣x) represents the probability of worsening, and ϵ is a small constant to prevent division by zero.

TPE starts with randomly sampling a small number of points from the search space to evaluate the objective function. Then, it builds and maintains two separate models for “good” (improving) and “bad” (worsening) regions of the search space.

It divides the search space into regions using a binary tree structure, where each leaf node represents a region. For each leaf node, TPE fits a probability distribution to the observed scores of the points in that region. Typically, TPE uses kernel density estimation (KDE) to model the probability distributions.

At each iteration, TPE samples a new candidate point by selecting a leaf node based on the probabilities obtained from the probability distributions of the “good” and “bad” regions. It then samples a point uniformly within the selected leaf node and evaluates it using the objective function.

After evaluating the new point, TPE updates its models by incorporating the observed score. If the score is better than the previous best score, TPE updates the model for the “good” region. Otherwise, it updates the model for the “bad” region. This process repeats until the stopping criteria are met.

To learn more, I recommend Shuhei Watanabe’s tutorial paper Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance.

Selection function

While the surrogate function uncovers the next best parameters, the selection function also called as acquisition function, is responsible for actually selecting the current best set. Its objective is to strike a balance between exploring regions of the parameter space with high uncertainty (exploration) and exploiting regions likely to yield better objective function values (exploitation).

There are different types of selection functions, including Expected Improvement (EI), Probability of Improvement (PI), and Upper Confidance Bound (UCB). Each of them uses a different approach to strike a balance between exploration and exploitation.

The complete Bayesian hyperparameter search process

The full process of searching the optimal hyperparameters with Bayesian optimization entails the following steps:

1
Select a search space to draw the samples.

2
Select a random value of each hyperparameter.

3
Define an objective function for your specific machine learning model and dataset.

4
Choose a surrogate function to approximate your objective function.

5
Based on the currently known information, select an optimal set of hyperparameters in the search space. This point is chosen based on a trade-off between exploration and exploitation.

6
Evaluate the objective function for the given set of parameters. (This involves training a model and evaluating its performance on a test set.)

7
Update the surrogate function’s model to incorporate the new results, refining its approximation of the objective function.

8
Repeat steps 5 to 7 until a stopping criterion (e.g., a maximum number of iterations or a threshold of the objective function’s value) is reached.

Advantages of Bayesian optimization over other hyperparameter optimization methods

We’ve seen that Bayesian optimization is superior to simpler hyperparameter optimization approaches because it takes into account past information. Let’s look at the advantages in more detail:

Probabilistic model: Bayesian hyperparameter optimization builds a probability-based model of the objective function, typically a TPE or Gaussian Process (GP). This makes accounting for uncertainty in the ML model predictions possible, allows guided exploration of the hyperparameter space, and enables adaptive sampling with greater understanding.
Resource efficiency: While optimization algorithms like random or grid search become infeasibly costly when dealing with large search spaces and huge datasets, Bayesian optimization is well-suited for scenarios where evaluating the objective function is computationally expensive. It minimizes the number of objective function evaluations needed to find an optimal solution, leading to significant savings in computational resources and time.
Global optimization: Bayesian optimization is well-suited for global optimization tasks where the goal is to find the global optimum rather than just a local one. Its exploration-exploitation strategy facilitates a more comprehensive search across the hyperparameter space compared to other optimization methods. However, it still does not guarantee finding a global optimum.
Efficient in high-dimensional spaces: High-dimensional hyperparameter spaces are ideal for Bayesian optimization. Even with a large number of hyperparameters, its probability-based modeling enables the effective exploration and exploitation of promising regions.

Optimizing hyperparameter search using Bayesian optimization and Optuna

Optuna is an open-source hyperparameter optimization software framework that employs Bayesian hyperparameter optimization with the TPE (Tree Parzen Estimator). It is a framework-agnostic tool that allows seamless integration with various machine learning libraries such as TensorFlow, PyTorch, and scikit-learn.

Optuna iteratively suggests new sets of hyperparameters based on TPE’s acquisition function, which balances exploration of unexplored regions and exploitation of promising areas. As the optimization progresses, the probabilistic model is continuously refined with observed data points, allowing Optuna to make informed decisions about where to sample next. This process optimizes the objective function with fewer evaluations, making Optuna an excellent choice for computationally expensive objective functions.

Graph illustrating Optuna Hyperparameter Tuning — Optuna Hyperparameter Tuning: The model is initially trained on the training set and then evaluated on the test set. Hyperparameter tuning is applied to find the set of hyperparameters that can achieve the best performance. Neptune tracks all the trial results for documentation and later analysis.

Optuna supports parallel and distributed optimizations, enabling efficient use of computational resources. The framework also provides visualization tools for analyzing the optimization process and facilitates integration with Jupyter Notebooks.

The Optuna workflow resolves around two terms:

Trial: A single call to an objective function.
Study: Hyperparameter optimization based on an objective function. A Study aims to determine the ideal set of hyperparameter values by conducting several trials.

Now, let’s break down the process of optimizing hyperparameters with Optuna. We’ll optimize the hyperparameters of a Random Forest Classifier on the famous iris dataset.

Since hyperparameter tuning involves several trials with different sets of hyperparameters, keeping track of what combinations Optuna has tried is almost impossible. To make our work easier, we will use Neptune, an ML experiment tracking tool that allows us to store each trial of algorithms like Optuna.

Neptune provides visualization capabilities to understand the model performance for different hyperparameter combinations and over time. To use Neptune, you need to perform the following steps:

Note: Once you create a project, the required credentials, such as the project name and API token, will be visible on the dashboard.

To follow along, you’ll need Python 3.11 and Jupyter Notebook. You can install the dependencies either using pip or conda.

Step 1: Install and load dependencies

We’ll start by installing Optuna, scikit-learn, along with Neptune and Neptune’s Optuna plugin:

If you don’t yet have Jupyter Notebooks available in your environment, you can install and launch it as follows:

In a new notebook, we start by importing the dependencies:

Step 2: Load the dataset

Next, we’ll load the iris dataset, which contains information about three different plant species, using scikit-learn‘s built-in dataset loader:

For this tutorial, we will add some noise to the iris dataset, making it a bit harder for a model to master the classification problem, which will make the effects of Optuna’s hyperparameter tuning more pronounced.

We do this by adding normally distributed random numbers to the original data:

The result should look as follows:

Step 3: Select a performance measure

We’ll use the cross-validation score as a performance measure. It averages the evaluation metric (e.g., accuracy, precision, recall, F1 score) over k cross-validation folds. In more detail, the model is trained k times, each time using k-1 folds for training and the remaining fold for validation. The default metric for evaluating a scikit-learn RandomForestClassifier is accuracy, which we’ll also use here. Alternatively, you can specify an alternative performance metric, such as precision or recall.

Step 4: Training the random forest model and establishing a performance baseline

Before you start optimizing hyperparameters, you must have a baseline to compare the tuned model’s performance. Let’s train the random forest model on the iris data and calculate the cross-validation score to get the baseline results:

As you can see, the model has achieved 67% accuracy on the iris dataset. Now, let’s try to improve this accuracy using the Optuna hyperparameter optimization.

Step 5: Defining the objective function

With the performance metric and the model training set up, we can now define the objective function. This function selects a set of hyperparameter values, trains the ML model, and returns a single-valued score (mean accuracy) you want to maximize.

As Optuna works with the concept of Trials and Studies, we need to define the objective function to accept a Trial object:

The suggest_int() and suggest_float() methods of Optuna dynamically suggest the hyperparameter values by employing TPE (Tree Parzen Estimator) within the range that you define.

For example, the ‘n_estimator’ parameter can have a value between range 2 to 20, and ‘max_depth’ can have a value between 1 to 32. Initially, you will have to come up with this range–this defines the search space.

Step 6: Initialize Neptune for storing the Optuna Trials

To start using Neptune for experiment tracking, you need to initialize a new run using the init_run() method. This method will require the project name and the API token for the repository where you want to save the results in Neptune.

You can do so with the help of the following lines of code:

Since Optuna runs different trials one after another, Neptune employs a callback to track each trial before the next one begins. You can define this callback as follows:

That’s all you have to do to set up Neptune to track your experiments. To learn more about Neptune’s integration with Optuna, have a look at the documentation.

Step 8: Optimizing the objective function

Now, all that’s left is to define a Study consisting of N trials to optimize the objective function.

Initially, the sampler randomly generates a few initial parameter combinations to evaluate the objective function. Optuna then uses a surrogate function (TPE in this case) to balance exploration (sampling from uncertain regions) and exploitation (sampling near promising configurations) to efficiently search for optimal hyperparameters.

The selection function then suggests the next hyperparameter configuration to evaluate by considering both the predicted performance and the uncertainty associated with each point in the search space. This process repeats until the pre-defined number of trials (in our case, 70) is reached.

As you can see in the code above, we use the create_study() method to define a Study and the optimization direction. Then, we use the optimize() method and provide the number of trials and our objective function for hyperparameter optimization.

You might notice that we are using the callbacks argument, passing the Natpune callback object. This ensures we track each trial and its related metadata in Neptune.

Once the optimization process is complete, you can use the best_trial attribute to get the best accuracy score and the associated set of hyperparameters. You should observe an improvement of around 21% in accuracy.

If you had used a basic grid search instead of Bayesian optimization with Optuna, it would have required about 567 iterations to try out all possible hyperparameter combinations, which would have taken roughly eight times longer.

You can also check the hyperparameter combinations that Optuna has tried out and the performance it has achieved from each set of hyperparameters as follows:

Once you have your best set of hyperparameters, you can stop tracking data with Neptune using the following line of code:

This will provide you with the URL where the experiment data is saved. When you open that link (if you’re curious: here’s the one to my Neptune project), you will see different runs (based on how many times you have run Optuna). Each run will have several trials and the best set of hyperparameters. It will look something like this:

Best practices for Bayesian optimization with Optuna

There are several best practices to increase the effectiveness and efficiency of conducting hyperparameter optimization with Optuna.

Here’s a selection:

Understand the problem and data

It’s essential to understand the problem you want to solve thoroughly. You should know the characteristics of your dataset and the ML model you will use. This will allow you to understand the objective function’s nature and the hyperparameters’ behavior. It will also help you choose the right metrics to minimize or maximize for optimal performance.

Define a relevant search space

You should carefully define the search space for the hyperparameters. You can start by identifying the hyperparameters relevant to the model and algorithm being optimized, such as learning rate, batch size, and number of layers for a neural network. Then, you need to specify the ranges or distributions for each parameter, for example, continuous values for the numeric hyperparameters and a set of values for the categorical variables.

Optuna supports various distributions such as uniform, loguniform, categorical, and integer, enabling flexibility in defining the search space. Additionally, you can utilize business knowledge while defining the search space. You should do all these while keeping in mind that achieving a balance between computational feasibility and inclusivity is crucial.

Set an appropriate number of trials

You must identify a reasonable number of trials based on the available computation resources and the complexity of the optimization problem. When you try too few trials, the obtained hyperparameters can be suboptimal. Too many trials will be computationally expensive and will take a long time, just like grid search and random search.

Initially, start with a small number of trials and then gradually increase the number of trials depending on how your optimization progresses. Once you have obtained the optimal parameters, you must validate the model’s performance on a separate validation set or perform cross-validation. This ensures that the chosen configuration generalizes well to new, unseen data.

Experiment with different acquisition functions

Optuna supports different acquisition functions, including Probability of Improvement, Expected Improvement, and Upper Confidence Bound. You should experiment with different functions to find the one that aligns with the characteristics of your objective function.

For example, Knowledge Gradient (KG) is effective for sparse and high-dimensional data, Upper Confidence Boud (UCB) is effective for large datasets with complex relationships, and Probability of Improvement (PI) is effective for data with high variability and noise.

Parallel and distributed optimization

You can leverage parallel and distributed optimization to speed up the overall hyperparameter optimization search, especially when your objective function is computationally expensive, and you are trying a wide range of hyperparameters. To this end, Optuna supports the parallel execution of trials.

Conclusion

After working through our introductory tutorial, you now understand the foundations of Bayesian hyperparameter optimization and its mechanics. We’ve discussed how Bayesian optimization differs from conventional techniques such as random and grid search. Then, we’ve explored the practical application of hyperparameter optimization with Optuna and Neptune. Finally, we’ve reviewed effective strategies to optimize your hyperparameter search process. Armed with this knowledge, you’re well-prepared to apply Bayesian optimization to enhance the performance of your ML models.

Was the article useful?

Thank you for your feedback!

Thanks for your vote! It’s been noted. | What topics you would like to see for your next read?

Thanks for your vote! It’s been noted. | Let us know what should be improved.

Thanks! Your suggestions have been forwarded to our editors

Explore more content topics:

Source link
lol