What is the one machine learning algorithm — if you ask — that consistently gives superior performance in regression and classification?
XGBoost it is. It is arguably the most powerful algorithm and is increasingly being used in all industries and in all problem domains —from customer analytics and sales prediction to fraud detection and credit approval and more.
It is also a winning algorithm in many machine learning competitions. In fact, XGBoost was used in 17 out of 29 data science competitions on the Kaggle platform.
Not just in businesses and competitions, XGBoost has been used in scientific experiments such as the Large Hadron Collider (the Higgs Boson machine learning challenge).
A key to its performance is its hyperparameters. While XGBoost is extremely easy to implement, the hard part is tuning the hyperparameters. In this article, I will talk about some of the key hyperparameters, their role and how to choose their values.
But before I go there, let’s talk about how XGBoost works under the hood.
XGBoost (or eXtreme Gradient Boost) is not a standalone algorithm in the conventional sense. It is rather an open-source library that “boosts” the performance of other algorithms. It optimizes the performance of algorithms, primarily decision trees, in a gradient boosting framework while minimizing overfitting/bias through regularization.
The key strengths of XGBoost are:
Flexibility: It can perform machine learning tasks such as regression, classification, ranking and other user-defined objectives.
Portability: It runs on Windows, Linux and OS X as well as on cloud platforms.
Languages support: It supports multiple languages including C++, Python, R, Java, Scala, Julia.
Distributed training on cloud systems: XGBoost supports distributed training on multiple machines, including AWS, GCE, Azure, and Yarn clusters.
Other important features of XGBoost include:
- parallel processing capabilities for large dataset
- can handle missing values
- allows for regularization to prevent overfitting
- has built-in cross-validation
Below I’ll first walk through a simple 5-step implementation of XGBoost and then we can talk about the hyperparameters and how to use them to optimize performance.
1) Import libraries
For this demo we do not need much. From
sklearn library we can import modules for splitting training and testing data and the accuracy metrics. Note that, first you need to install (pip install) the
XGBoost library before you can import it.
# loading data
Continue reading: https://towardsdatascience.com/a-guide-to-xgboost-hyperparameters-87980c7f44a9?source=rss—-7f60cf5620c9—4