Mango: A new method for Bayesian optimization based on Python environment

thumbnail

Translator | Zhu Xianzhong

introduction

Optimization of model hyperparameters (or model settings) is probably the most important step in training a machine learning algorithm, as it finds the best parameters that minimize the loss function of the model. This step is also essential for building generalized models that are less prone to overfitting.

The best known techniques for optimizing model hyperparameters are exhaustive grid search and stochastic grid search. In the first approach, the search space is defined as a grid spanning the domain of each model hyperparameter. The optimal hyperparameters are obtained by training the model on each point of the grid. Although grid search is very easy to implement, it becomes computationally expensive, especially when the number of variables to be optimized is large. On the other hand, stochastic grid search is a faster optimization method and can give better results. In stochastic grid search, the best hyperparameters are obtained by training the model only on random point samples in the grid space.

The image above gives a comparison between the two grid search types. Among them, the nine points represent the choice of parameters, and the curves on the left and top represent the model accuracy as a function of each search dimension. This data is taken from a paper by Salgado Pilario et al. published in IEEE Transactions on Industrial Electronics (68, 6171–6180, 2021).

Both grid search algorithms have long been widely used by data scientists to find optimal model hyperparameters. However, these methods often find model hyperparameters where the loss function is far from the global minimum.

By 2013, however, that history had changed. This year, James Bergstra and his collaborators published a paper exploring Bayesian optimization techniques in order to find the optimal hyperparameters of neural networks for image classification. They compared the results with those of a random grid search. Compare. The final conclusion is that Bayesian method is better than random grid search, please refer to the figure below.

Shown are the validation errors on the LFW dataset (left) and the PubFig83 dataset (right). Among them, TPE, or "Tree Parzen Estimator", is an algorithm used in Bayesian optimization. This figure is taken from a paper by Bergstra et al. in the Journal of Machine Learning Research (28, 115–123, 2013).

But why is Bayesian optimization better than any grid search algorithm? Because this is a bootstrapping method, it does an intelligent search for model hyperparameters rather than finding them by trial and error.

In this article, we will dissect the above Bayesian optimization method in detail, and will explore an implementation of this algorithm through a relatively new Python package called Mango.

Bayesian optimization

Before explaining what Mango can do, we need to understand how Bayesian optimization works. Of course, if you already have a solid understanding of the algorithm, you can skip reading this section.

In summary, Bayesian optimization has 4 parts:

  • Objective function: This is the real function you want to minimize or maximize. For example, it could be the root mean squared error (RMSE) in regression problems or the log loss function in classification problems. In the optimization of machine learning models, the objective function depends on the model hyperparameters. This is why the objective function is also called a black box function because its shape is unknown.
  • Search Domain or Search Space: This corresponds to the possible values ​​that each model hyperparameter has. As a user, you need to specify the search space for the model. For example, the search domain for a random forest regression model might be:

copy

param_space = {'max_depth': range(3, 10),

'min_samples_split': range(20, 2000),

'min_samples_leaf': range(2, 20),

'max_features': ["sqrt", "log2", "auto"],

'n_estimators': range(100, 500)

}

Bayesian optimization uses a defined search space to sample the points evaluated in the objective function.

  • Surrogate models: Evaluating the objective function is very expensive, so in practice we only know the true value of the objective function in a few places. However, we need to know the value elsewhere. This is where surrogate models come into play, which are tools for modeling objective functions. A common choice for a surrogate model is the so-called Gaussian Processes (GP: Gaussian Processes) because of its ability to provide uncertainty estimates.

At the beginning of Bayesian optimization, the surrogate model starts with a prior function that is distributed with uniform uncertainty along the search space:

The figure shows the value of the prior function of the surrogate model. Among them, the shaded area represents the uncertainty, while the black line represents its mean, and the purple line represents the one-dimensional objective function. This image is taken from a 2020 blog post exploring Bayesian optimization by Apov Agnihotri and Nipun Batra.

Every time a sample point in the search space is evaluated in the objective function, the uncertainty of the surrogate model at that point becomes zero. After many iterations, the surrogate model will resemble the objective function:

A surrogate model for a simple one-dimensional objective function

However, the goal of Bayesian optimization is not to model the objective function, but to find the best model hyperparameters in as few iterations as possible. For this, an acquisition function is used.

  • Acquisition function: This function was introduced in Bayesian optimization to guide the search. The acquisition function is used to evaluate whether a point needs to be evaluated based on the current surrogate model. A simple acquisition function is to sample the point where the mean of the surrogate function maximizes.

The steps to Bayesian optimization code are:

Choose a surrogate model for modeling the objective function and define its prior for i = 1, 2,..., the number of iterations:

  • Given a set of evaluations in the target, use a Bayesian approach to obtain the posterior.
  • Use an acquisition function (this is a posterior function) to decide the next sample point.
  • Add the newly sampled data to the observation set.

The following figure shows Bayesian optimization of a simple one-dimensional function:

The figure above shows the Bayesian optimization of a one-dimensional function. Image taken from ARM research's blog post "Scalable Hyperparameter Tuning for AutoML".

In fact, there are quite a few Python packages that use Bayesian optimization behind the scenes to get the best hyperparameters for machine learning models. For example: Hyperopt; Optuna; Bayesian optimization; Scikit-optimize (skopt); GPyOpt; pyGPGO and Mango, etc. Only some of them are listed here.

Now, let's officially start the discussion of Mango.

Mango: Why is it so special?

In recent years, the amount of data in various industries has grown significantly. This is a challenge for data scientists, which requires scalability of their machine learning pipelines. Distributed computing might solve this problem.

Distributed computing refers to a group of computers that communicate with each other while performing a common task; this is different from parallel computing. In parallel computing, a task is divided into multiple subtasks, which are assigned to different processors on the same computer system.

Schematic diagram of parallel computing and distributed computing architecture.

Although there are quite a few Python libraries that use Bayesian optimization to optimize model hyperparameters, none of them support scheduling on any distributed computing framework. One of the motivations of the Mango developers was to create an optimization algorithm that would work in a distributed computing environment, while maintaining the power of Bayesian optimization.

What is the secret of the Mango architecture? to make it work well in a distributed computing environment? Mango is built with a modular design, where the optimizer and scheduler are decoupled. This design allows easy scaling of machine learning pipelines that use large amounts of data. However, this architecture presents challenges in optimization methods because traditional Bayesian optimization algorithms are sequential; this means that the acquisition function only provides a single next point to evaluate the search.

Mango uses two methods to parallelize Bayesian optimization: one is a method called batch Gaussian process bandits, and the other is k-means clustering. In this blog, we will not explain batch Gaussian processes.

Regarding clustering methods, a group of researchers at IBM proposed in 2018 the use of k-means clustering to scale out the Bayesian optimization process horizontally (see the paper for technical details. pdf). The method consists of clustering points sampled from the search domain that generate high values ​​in the acquisition function (see figure below). At the beginning, these clusters are far away from each other in the parameter search space. The distance in the parameter space decreases when the optimal region in the surrogate function is found. The k-means clustering method scales the optimization horizontally, as each cluster is used to run Bayesian optimization as a separate process. This parallelization leads to faster finding of optimal model hyperparameters.

Mango uses clustering methods to extend Bayesian optimization methods. The colored areas on the acquisition function are clusters built from sample points in the search space with high acquisition function values. At the beginning, the clusters are separated from each other, but their distance is shortened because the surrogate function is similar to the target. (Image taken from ARM research's blog post "Scalable Hyperparameter Tuning for AutoML")

In addition to being able to handle distributed computing frameworks, Mango is also compatible with the Scikit-learn API. This means, you can define the hyperparameter search space as a Python dictionary, where the keys are the parameter names of the model, and each term can be defined with any of the 70+ distributions implemented in scipy.stats. All these unique features make Mango a good choice for data scientists looking to leverage data-driven solutions at scale.

Simple example

Next, let's show how Mango works in an optimization problem with an example. First, you need to create a Python environment, then install Mango with the following commands:

copy

pip install arm-mango

In this example, we use the California housing dataset that can be loaded directly from Scikit-learn (see for more information on this link .html):

copy

import pandas as pd

from sklearn.datasets import fetch_california_housing

from sklearn.model_selection import train_test_split

from sklearn.ensemble import ExtraTreesRegressor

from sklearn.metrics import mean_squared_error

import numpy as np

import time

from mango import Tuner

housing = fetch_california_housing()

create dataframe from input data

Note: Each value of the target corresponds to the average house value in units of 100,000

features = pd.DataFrame(housing.data, columns=housing.feature_names)

target = pd.Series(housing.target, name=housing.target_names[0])

The dataset contains a total of 20640 samples. Each sample includes 8 characteristics such as house age and average number of bedrooms. In addition, the California housing dataset includes house prices for each sample in units of 100,000. The distribution of house prices is shown in the figure below:

In the left panel of the illustration, the spatial distribution of house prices in the California dataset is shown. Shown on the right is a histogram corresponding to the same variable.

Note that the distribution of house prices is a bit to the left. This means, some preprocessing is required in the target. For example, we can transform the distribution of the target into a normal shape via a Log or Box-Cox transformation. This preprocessing can improve the predictive performance of the model due to the reduction of target variance. We will perform this step during hyperparameter optimization and modeling. Now, let's split the dataset into three parts: training, validation, and test:

copy

Split the dataset into train, validation and test sets

x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

x_train, x_validation, y_train, y_validation = train_test_split(x_train, y_train, test_size=0.2, random_state=42)

At this point, we are ready to use Mango to optimize machine learning models. First, we define the search space from which Mango fetches values. In this example, we used an algorithm called "Extreme Randomized Trees", which is an ensemble method very similar to random forests, except that the way to choose the best split is random of. This algorithm usually reduces variance at the expense of slightly increasing bias.

The search space of an extreme randomization tree can be defined as follows:

copy

Step 1: Define the search space of the algorithm (use the range function instead of the uniform function to ensure integers are generated)

param_space = {'max_depth': range(3, 10),

'min_samples_split': range(int(0.01features.shape[0]), int(0.1features.shape[0])),

'min_samples_leaf': range(int(0.001features.shape[0]), int(0.01features.shape[0])),

'max_features': ["sqrt", "log2", "auto"]

}

After defining the parameter space, we specify the objective function. Here, we use the training and validation datasets created above; however, if you want to run a k-fold cross-validation strategy, you will need to implement it yourself in the objective function.

copy

Step 2: Define the objective function

If you want to cross-validate, define cross-validation in the target

In this case, we use something like 1-fold cross-validation.

def objective(list_parameters):

global x_train, y_train, x_validation, y_validation

results = []

for hyper_params in list_parameters:

model = ExtraTreesRegressor(**hyper_params)

model.fit(x_train, np.log1p(y_train))

prediction = model.predict(x_validation)

prediction = np.exp(prediction) - 1 # to get the real value not in log scale

error = np.sqrt(mean_squared_error(y_validation, prediction))

results.append(error)

return results

There are a few things to note about the above code:

  • The objective function is to find the best model parameters that minimize the root mean squared error (RMSE).
  • In Scikit-learn, the implementation of extreme randomized trees for regression problems is called ExtraTreesRegressor.
  • Note that the house prices in the training set are log-transformed. Therefore, predictions on the validation set are transformed back to their original scale.

The last step required to optimize the model hyperparameters is to instantiate the class Tuner, which is responsible for running Mango:

copy

#Step 3: Run optimization through Tuner

start_time = time.time()

tuner = Tuner(param_space, objective, dict(num_iteration=40, initial_random=10)) # Initialization Tuner

optimisation_results = tuner.minimize()

print(f'The optimisation in series takes {(time.time()-start_time)/60.} minutes.')

#test result

print('best parameters:', optimisation_results['best_params'])

print('best accuracy (RMSE):', optimisation_results['best_objective'])

run the model with the best hyperparameters on the test set

best_model = ExtraTreesRegressor(n_jobs=-1, **optimisation_results['best_params'])

best_model.fit(x_train, np.log1p(y_train))

y_pred = np.exp(best_model.predict(x_test)) - 1 # get the actual value

print('rmse on test:', np.sqrt(mean_squared_error(y_test, y_pred)))

The above code ran for 4.2 minutes on a MacBook Pro with a 2.3 Ghz quad-core Intel Core i7 processor.

The best hyperparameters and the best RMSE are:

copy

best parameters: {‘max_depth’: 9, ‘max_features’: ‘auto’, ‘min_samples_leaf’: 85, ‘min_samples_split’: 729}

best accuracy (RMSE): 0.7418871882901833

When training the model on the training set with the best model parameters, the RMSE on the test set is:

copy

rmse on test: 0.7395178741584788

Disclaimer: You may get different results when running this code.

Let's briefly review the class Tuner used in the code above. There are many configuration parameters for this class, but in this example we only tried two of them:

  • num_iteration: These are the total number of iterations that Mango uses to find the best value.
  • initial_random: This variable sets the number of random samples for testing. Note: Mango returns all random samples together. This is very useful, especially if optimizations need to be run in parallel.

Note: The examples published in this blog use only a small dataset. However, in many real-world applications, you may be dealing with large data files that require a parallel implementation of Mango. If you go to my GitHub source repository, you can find the complete code shown here along with the implementation for large data files.

In conclusion, Mango is very versatile. You can use it in a wide range of machine and deep learning models that require parallel implementation or a distributed computing environment to optimize their hyperparameters. Therefore, I encourage you to visit Mango's GitHub repository. There, you can find source code for many projects showing the use of Mango in different computing environments.

Summarize

In this blog, we met Mango: a Python library for doing large-scale Bayesian optimization. This package will enable you to:

  • Scales the optimization of model hyperparameters, even running on distributed computing frameworks.
  • Easily integrate scikit-learn models with Mango to generate powerful machine learning pipelines.
  • Use any probability distribution function implemented in scipy.stats for declaring your search space.

All these features make Mango a unique Python library that extends your data science toolkit.

Translator introduction

Zhu Xianzhong, community editor, expert blogger, lecturer, computer teacher in a university in Weifang, and a veteran of free programming.

Original title: Mango: A new way to do Bayesian optimization in Python by Carmen Adriana Martinez Barbosa

Latest Programming News and Information | GeekBar

Related Posts