Image by Editor
Introduction
Ensemble methods, such as XGBoost (Extreme Gradient Boosting), are powerful implementations of gradient-boosted decision trees. These methods combine multiple weaker estimators to form a robust predictive model. XGBoost ensembles are widely favored for their accuracy, efficiency, and strong performance with structured (tabular) data. While the popular machine learning library scikit-learn does not include a native XGBoost implementation, a separate XGBoost library provides an API compatible with scikit-learn.
To use it, import it as follows:
from xgboost import XGBClassifier
This article details seven Python techniques to effectively utilize the standalone XGBoost implementation, particularly for building more accurate predictive models.
To demonstrate these techniques, the Breast Cancer dataset from scikit-learn will be used, along with a baseline model configured with mostly default settings. It is recommended to run the following code before experimenting with the subsequent seven tricks:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
# Data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Baseline model
model = XGBClassifier(eval_metric="logloss", random_state=42)
model.fit(X_train, y_train)
print("Baseline accuracy:", accuracy_score(y_test, model.predict(X_test)))
1. Tuning Learning Rate And Number Of Estimators
While not a strict rule, reducing the learning rate and simultaneously increasing the number of estimators (trees) in an XGBoost ensemble often leads to improved accuracy. A smaller learning rate enables the model to learn more gradually, with additional trees compensating for the reduced step size.
Consider the following example. Test it and compare the resulting accuracy against the initial baseline:
model = XGBClassifier(
learning_rate=0.01,
n_estimators=5000,
eval_metric="logloss",
random_state=42
)
model.fit(X_train, y_train)
print("Model accuracy:", accuracy_score(y_test, model.predict(X_test)))
For brevity, the final print() statement will be omitted in subsequent examples. Users can append it to any code snippet for testing.
2. Adjusting The Maximum Depth Of Trees
The max_depth argument is a critical hyperparameter derived from classic decision trees, controlling the maximum depth each tree in the ensemble can reach. Limiting tree depth might seem counterintuitive, but shallower trees often exhibit better generalization capabilities than deeper ones.
This example restricts trees to a maximum depth of 2:
model = XGBClassifier(
max_depth=2,
eval_metric="logloss",
random_state=42
)
model.fit(X_train, y_train)
3. Reducing Overfitting By Subsampling
The subsample argument allows for random sampling of a proportion of the training data (e.g., 80%) before each tree in the ensemble is grown. This straightforward technique serves as an effective regularization strategy, helping to prevent overfitting.
If not specified, this hyperparameter defaults to 1.0, meaning all training examples are utilized:
model = XGBClassifier(
subsample=0.8,
colsample_bytree=0.8,
eval_metric="logloss",
random_state=42
)
model.fit(X_train, y_train)
It is important to note that this method is most effective for datasets of reasonable size. For smaller datasets, aggressive subsampling could potentially lead to underfitting.
4. Adding Regularization Terms
To further mitigate overfitting, complex trees can be penalized using standard regularization techniques like L1 (Lasso) and L2 (Ridge). In XGBoost, these are controlled by the reg_alpha and reg_lambda parameters, respectively.
model = XGBClassifier(
reg_alpha=0.2, # L1
reg_lambda=0.5, # L2
eval_metric="logloss",
random_state=42
)
model.fit(X_train, y_train)
5. Using Early Stopping
Early stopping is a mechanism designed for efficiency, halting the training process when the model’s performance on a validation set ceases to improve over a specified number of rounds.
Depending on the coding environment and XGBoost library version, an upgrade might be necessary to use the implementation shown below. Additionally, ensure that early_stopping_rounds is set during model initialization rather than passed to the fit() method.
model = XGBClassifier(
n_estimators=1000,
learning_rate=0.05,
eval_metric="logloss",
early_stopping_rounds=20,
random_state=42
)
model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=False
)
To upgrade the library, execute:
!pip uninstall -y xgboost
!pip install xgboost --upgrade
6. Performing Hyperparameter Search
For a more structured approach, hyperparameter search can assist in identifying optimal combinations of settings that maximize model performance. The following example uses grid search to explore combinations of three previously discussed key hyperparameters:
param_grid = {
"max_depth": [3, 4, 5],
"learning_rate": [0.01, 0.05, 0.1],
"n_estimators": [200, 500]
}
grid = GridSearchCV(
XGBClassifier(eval_metric="logloss", random_state=42),
param_grid,
cv=3,
scoring="accuracy"
)
grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)
best_model = XGBClassifier(
**grid.best_params_,
eval_metric="logloss",
random_state=42
)
best_model.fit(X_train, y_train)
print("Tuned accuracy:", accuracy_score(y_test, best_model.predict(X_test)))
7. Adjusting For Class Imbalance
This final technique is particularly valuable when dealing with datasets that exhibit significant class imbalance (the Breast Cancer dataset is relatively balanced, so minimal changes might be observed). The scale_pos_weight parameter is especially useful when class proportions are highly skewed, such as 90/10, 95/5, or 99/1.
Here is how to calculate and apply it based on the training data:
ratio = np.sum(y_train == 0) / np.sum(y_train == 1)
model = XGBClassifier(
scale_pos_weight=ratio,
eval_metric="logloss",
random_state=42
)
model.fit(X_train, y_train)
Wrapping Up
This article presented seven practical techniques to enhance XGBoost ensemble models using its dedicated Python library. Careful adjustment of learning rates, tree depth, sampling strategies, regularization, and class weighting, combined with systematic hyperparameter search, often distinguishes between an adequate model and a highly accurate one.

