Get the right balance between underfitting and overfitting by finding optimal hyperparameters for any model

photo by Author

Despite being one of the last stages of creating a model, Hyperparameter optimization (“HPO”) can make all the difference between a good model, which generalizes well, or ugly overfitting, which performs great with the training data but is much worse on the validation set.

This is especially the case with popular Tree Base models such as Random Forest, XGBoost, or CatBoost. Usually, the base model will badly overfit your data. On the other hand, trying to manually increase the bias by setting some hyperparameters like “max_depth” or “max_features” in RandomForest often causes significant underfitting. The search space of possible hyperparameters has so many dimensions and values that you need a convenient way to find the sweet spot.

The easiest approach is grid search — basically, a brute force approach where you retrain your model with all possible combinations from a defined set of parameters and ranges. The huge disadvantage of this approach is the fact that you will spend the majority of your time exploring parameter combinations that don’t work well and only a fraction of them will be close to the optimal spot.

What if we could have another model, which evaluates the results for each iteration of hyperparameters and tries to move them in a direction, which improves the basic model’s performance? Fortunately, scikit-optimize (SKOPT) does exactly that.

In this article, I will demonstrate how to start HPO with scikit-optimize with RandomForest and XGBoost and sample data of 250k German rentals, which we will use to predict rental price. The data is a transformed version of Kaggle dataset, which together with the code used for the article can be found on GitHub.

SKOPT makes your hyperparameter optimization much easier, by basically creating another model, which tries to minimize your initial model loss by changing its hyperparameters. We will first set up HPO for a simple RandomForestRegressor model.

To start you need to set up 3 things:

search_space = [, 12, name='max_depth'),, 200, name='n_estimators'),, 20, name='max_features'),, 1.0, name='min_impurity_decrease'), = [True, False],name="bootstrap")

Search space defines the hyperparameters you want to explore in your search together with exploration…

Continue reading:—-7f60cf5620c9—4