Hyperparameter Tuning using Optuna | Bayesian Optimization using Optuna

TL;DR

Optuna’s Bayesian optimization uses results from earlier trials to choose the next hyperparameters, reducing wasted computation compared with grid search and random search.

Briefing Cornell Notes

Briefing

Hyperparameter tuning stops being a brute-force chore when Optuna replaces exhaustive search with Bayesian optimization that learns where accuracy is likely to improve. Instead of trying every combination (grid search) or a small random subset (random search), Optuna builds a probabilistic model of the relationship between hyperparameters and the objective metric, then uses that model to choose the next hyperparameter settings to evaluate—so it reaches strong results with far fewer trials.

The walkthrough starts with a concrete setup: predicting whether students will get placements using a Random Forest classifier trained on features like CGPA and IQ. Two Random Forest hyperparameters become the tuning targets—max_depth (tree depth) and n_estimators (number of trees). The goal is to maximize classification accuracy, but the “best” values aren’t known ahead of time. That uncertainty motivates hyperparameter tuning: define a search space, train the model repeatedly for different hyperparameter combinations, and measure accuracy on validation data.

Grid search is presented as the baseline method: it enumerates all combinations in the search grid and trains a model for each one. The method is straightforward but quickly becomes computationally expensive as the number of hyperparameters and candidate values grows—especially in deep learning where each training run can be costly. Random search reduces compute by sampling only a limited number of combinations, but it can miss high-performing regions because it doesn’t use information from earlier trials.

Optuna’s core advantage is Bayesian optimization. The transcript frames it as learning a hidden mathematical relationship between hyperparameters (max_depth and n_estimators) and accuracy. As trials accumulate, Optuna treats each evaluated setting as a data point on an implicit multi-dimensional accuracy surface, then uses that information to infer promising regions. The next trial is selected intelligently using a sampler (by default, TPE—Tree-structured Parzen Estimator). Crucially, Optuna reuses past trial outcomes to guide future sampling, unlike grid or random search.

A practical code workflow is then described around five key Optuna concepts: Study (the optimization session), Trial (one hyperparameter evaluation run), Trial Parameters (the specific hyperparameter values for that run), Objective Function (the function that trains the model and returns accuracy), and Sampler (the component that decides the next hyperparameters to try). In the example, missing values in the dataset are handled by replacing zeros with NaN and imputing with the mean. The objective function defines the search ranges (n_estimators between 50 and 200; max_depth between 3 and 20), trains a Random Forest using cross-validation, and returns the mean accuracy.

After running 50 trials, Optuna reports best accuracy and best hyperparameters (example values given: n_estimators=115 and max_depth=18). The model is retrained with those parameters and evaluated on the test set, yielding an accuracy around 75—described as a solid starting point that could improve with better preprocessing.

The transcript also highlights Optuna’s flexibility: the sampler can be swapped to run random search or grid search behavior while keeping the same objective-function structure. Visualization tools are emphasized next, including optimization history (trial vs. accuracy), parallel coordinate plots (how hyperparameters relate to accuracy), contour plots (accuracy density over the hyperparameter grid), and importance plots (which hyperparameters matter most). Finally, Optuna’s “define-by-run” capability is showcased: the algorithm choice itself can be treated as a tunable hyperparameter, enabling dynamic search spaces across SVM, Random Forest, XGBoost/gradient boosting, and Logistic Regression. This lets Optuna find not only the best hyperparameters but also the best model family, using conditional logic to switch search spaces during optimization.

Cornell Notes

Optuna improves hyperparameter tuning by using Bayesian optimization: it learns from previous trials to decide which hyperparameters to evaluate next, aiming to maximize an objective metric (here, accuracy). The example tunes a Random Forest using two parameters—max_depth and n_estimators—by defining a search space, training models inside an objective function, and returning cross-validated mean accuracy. Optuna’s workflow is built around a Study (overall optimization), Trials (individual evaluations), Trial Parameters (chosen hyperparameters), an Objective Function (train + score), and a Sampler (e.g., TPE) that selects the next hyperparameters based on past results. After a limited number of trials (e.g., 50), Optuna yields best hyperparameters and a strong test accuracy, and it can further visualize and interpret the optimization process.

Why do grid search and random search become inefficient as the number of hyperparameters grows?

Grid search evaluates every combination in a discrete search grid, so the number of model trainings scales multiplicatively with the number of candidate values per hyperparameter. Adding another hyperparameter (or increasing candidate counts) rapidly explodes the total combinations. Random search reduces compute by sampling only a fixed number of random combinations, but it doesn’t use information from earlier trials to steer sampling toward promising regions, so it can miss the best settings—especially when the search space is large and training is expensive.

What makes Optuna’s Bayesian optimization different from grid or random search?

Optuna uses past trial results to build an internal probabilistic model of how hyperparameters relate to the objective metric. Each evaluated hyperparameter setting provides a data point on an implicit accuracy surface. Using that accumulated information, the sampler chooses the next hyperparameters that are more likely to improve accuracy, rather than exhaustively enumerating all combinations (grid) or sampling blindly (random). In the transcript, this is described as “uncovering” the shape of the accuracy graph and then targeting its maxima.

What are the five Optuna concepts needed to understand and read the tuning code?

The transcript highlights: (1) Study = the optimization session that manages trials; (2) Trial = one run that evaluates a specific hyperparameter configuration; (3) Trial Parameters = the hyperparameter values used in that trial; (4) Objective Function = the function that trains the model and returns the metric (accuracy); and (5) Sampler = the component that decides the next hyperparameters to try. The sampler is central to Bayesian optimization; by default, Optuna uses TPE (Tree-structured Parzen Estimator).

How does the objective function work in the Random Forest example?

The objective function defines the search space for n_estimators and max_depth, then for each trial it samples one value of each hyperparameter, trains a Random Forest classifier, and evaluates it using cross-validation. It computes the mean accuracy across folds and returns that mean as the objective value. Optuna then uses those returned accuracies to guide subsequent trials.

How can Optuna switch between Bayesian optimization and random/grid-style search?

The transcript explains that the objective function can stay the same, while the sampler changes. By selecting a different sampler (e.g., a RandomSampler for random search or a GridSampler for grid search) inside the Study creation, Optuna can perform random or exhaustive grid evaluations while still using the same Study/Trial/Objective structure.

What does “define-by-run” enable that typical tuning setups don’t?

“Define-by-run” allows conditional, dynamic search spaces. In the example, the algorithm family (SVM, Random Forest, gradient boosting/XGBoost, Logistic Regression) is treated like a hyperparameter. Depending on which algorithm is selected in a trial, Optuna conditionally defines the relevant hyperparameters and search ranges for that algorithm. This lets one optimization run identify both the best model type and its best hyperparameters.

Review Questions

In grid search, how does the total number of model trainings scale when you add another hyperparameter with multiple candidate values?
In Optuna, what role does the Sampler (e.g., TPE) play in choosing the next Trial’s hyperparameters?
How does the objective function’s returned metric (accuracy vs. loss) affect whether Optuna should maximize or minimize during Study creation?

Key Points

1
Optuna’s Bayesian optimization uses results from earlier trials to choose the next hyperparameters, reducing wasted computation compared with grid search and random search.
2
Grid search becomes impractical as hyperparameter candidate counts multiply, while random search can miss high-performing regions because it doesn’t learn from prior outcomes.
3
Optuna’s tuning loop is built from a Study (session), Trials (evaluations), an Objective Function (train + return metric), Trial Parameters (sampled hyperparameters), and a Sampler (e.g., TPE) that drives the search.
4
In the Random Forest example, the objective function samples n_estimators and max_depth, trains the model, evaluates with cross-validation, and returns mean accuracy for Optuna to optimize.
5
Optuna can swap samplers to emulate random search or grid search behavior without changing the overall objective-function structure.
6
Optuna’s visualization tools (history, parallel coordinates, contour, importance) help identify when accuracy plateaus and which hyperparameters matter most.
7
“Define-by-run” supports dynamic/conditional search spaces, enabling joint selection of the best algorithm family and its best hyperparameters in one optimization run.

Highlights

Optuna replaces exhaustive enumeration with Bayesian optimization that learns an implicit accuracy surface and targets its likely maxima.

The tuning workflow centers on Study, Trial, Objective Function, Trial Parameters, and Sampler—making the optimization logic modular and reusable.

Visualization (optimization history, contour plots, importance plots) turns trial results into actionable insight about hotspots and diminishing returns.

Define-by-run lets the algorithm choice itself be optimized, switching hyperparameter search spaces conditionally (e.g., SVM vs. Random Forest).

Topics

Bayesian Optimization
Hyperparameter Tuning
Optuna Workflow
Random Forest Tuning
Define-by-Run

Mentioned

Nitesh
ML
CV
TPE