7 XGBoost Tricks for More Accurate Predictive Models

Photo by editor

# Introduction

Combination methods such as XGBoost (Extreme Gradient Boosting) are powerful implementations of gradient-boosted decision trees that combine several weak estimators into a robust prediction model. These pairs are extremely popular due to their accuracy, efficiency and robust performance on structured (tabular) data. While a widely used machine learning library to learn While XGBoost does not provide a native implementation, there is a separate library, appropriately called XGBoost, that offers an API compatible with scikit-learn.

You just need to import it as follows:

from xgboost import XGBClassifier

Below, we outline 7 Python tricks that can help you get the most out of a stand-alone implementation of XGBoost, especially when aiming to build more accurate predictive models.

To illustrate these tricks, we will use the breast cancer dataset freely available in scikit-learn and define a basic model with largely fixed settings. Be sure to run this code before experimenting with the following seven tricks:

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# Data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Baseline model
model = XGBClassifier(eval_metric="logloss", random_state=42)
model.fit(X_train, y_train)
print("Baseline accuracy:", accuracy_score(y_test, model.predict(X_test)))

# 1. Tuning learning rate and number of estimators

Although not a universal rule, exponentially reducing the learning rate while increasing the number of estimators (trees) in the XGBoost ensemble often improves accuracy. A small learning rate allows the model to learn incrementally more, while additional trees compensate for the reduced step size.

Here is an example. Try it yourself and compare the accuracy of the result to the initial baseline:

model = XGBClassifier(
    learning_rate=0.01,
    n_estimators=5000,
    eval_metric="logloss",
    random_state=42
)
model.fit(X_train, y_train)
print("Model accuracy:", accuracy_score(y_test, model.predict(X_test)))

For clarity, the final print() The expression will be omitted in the remaining examples. Add it to any of the snippets below when testing them yourself.

# 2. Adjusting the maximum depth of trees

gave max_depth The argument is an important hyperparameter inherited from classic decision trees. This limits how deep each tree in the ensemble can grow. Limiting the depth of trees may seem simple, but surprisingly, shallow trees are often better than deep ones.

This example limits trees to a maximum depth of 2:

model = XGBClassifier(
    max_depth=2,
    eval_metric="logloss",
    random_state=42
)
model.fit(X_train, y_train)

# 3. Reducing overfitting by subsampling

gave subsample Reasoning Randomly sample a proportion of the training data (eg 80%) before growing each tree in the ensemble. This simple technique serves as an effective regularization strategy and helps prevent overfitting.

If not specified, this hyperparameter defaults to 1.0, meaning 100% of the training examples are used:

model = XGBClassifier(
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric="logloss",
    random_state=42
)
model.fit(X_train, y_train)

Note that this method is most effective for datasets of reasonable size. Aggressive subsampling can lead to underfitting if the dataset is already small.

# 4. Adding provisions for regularization

To further control overfitting, complex trees can be penalized using traditional regularization strategies such as L1 (Lasso) and L2 (Ridge). In XGBoost, these are controlled. reg_alpha And reg_lambda parameters, respectively.

model = XGBClassifier(
    reg_alpha=0.2,   # L1
    reg_lambda=0.5,  # L2
    eval_metric="logloss",
    random_state=42
)
model.fit(X_train, y_train)

# 5. Use of early stops

Premature stopping is a performance-based mechanism that stops training when performance on a validation set stops improving within a specified number of rounds.

Depending on your coding environment and the version of the XGBoost library you are using, you may need to upgrade to a more recent version to use the process shown below. Also, make sure that early_stopping_rounds is specified during model initialization rather than passed. fit() method

model = XGBClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    eval_metric="logloss",
    early_stopping_rounds=20,
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=((X_test, y_test)),
    verbose=False
)

To upgrade the library, run:

!pip uninstall -y xgboost
!pip install xgboost --upgrade

# 6. Finding the hyperparameter

For a more systematic approach, hyperparameter search can help identify combinations of settings that maximize model performance. Below is an example using grid search to explore a combination of the three key hyperparameters introduced earlier.

param_grid = {
    "max_depth": (3, 4, 5),
    "learning_rate": (0.01, 0.05, 0.1),
    "n_estimators": (200, 500)
}

grid = GridSearchCV(
    XGBClassifier(eval_metric="logloss", random_state=42),
    param_grid,
    cv=3,
    scoring="accuracy"
)

grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)

best_model = XGBClassifier(
    **grid.best_params_,
    eval_metric="logloss",
    random_state=42
)

best_model.fit(X_train, y_train)
print("Tuned accuracy:", accuracy_score(y_test, best_model.predict(X_test)))

# 7. Adjusting for class imbalance

This final trick is especially useful when working with datasets with severe class imbalance (the breast cancer dataset is relatively balanced, so don’t worry if you see minimal changes). gave scale_pos_weight The parameter is especially helpful when the class ratio is highly skewed, such as 90/10, 95/5, or 99/1.

Here’s how to calculate and apply it based on training data:

ratio = np.sum(y_train == 0) / np.sum(y_train == 1)

model = XGBClassifier(
    scale_pos_weight=ratio,
    eval_metric="logloss",
    random_state=42
)

model.fit(X_train, y_train)

# wrap up

In this article, we explore seven practical tricks for enhancing XGBoost’s clothing models using its dedicated Python library. Thoughtful configuration of learning rates, tree depth, sampling strategies, regularization, and class weights—combined with systematic hyperparameter search—often makes the difference between a decent model and a highly accurate model.

Iván Palomares Carrascosa He is a leader, author, speaker, and consultant in AI, Machine Learning, Deep Learning and LLMs. He trains and guides others in using AI in the real world.

# Introduction

# 1. Tuning learning rate and number of estimators

# 2. Adjusting the maximum depth of trees

# 3. Reducing overfitting by subsampling

# 4. Adding provisions for regularization

# 5. Use of early stops

# 6. Finding the hyperparameter

# 7. Adjusting for class imbalance

# wrap up

Editor's pick

Get latest news

7 XGBoost Tricks for More Accurate Predictive Models

# Introduction

# 1. Tuning learning rate and number of estimators

# 2. Adjusting the maximum depth of trees

# 3. Reducing overfitting by subsampling

# 4. Adding provisions for regularization

# 5. Use of early stops

# 6. Finding the hyperparameter

# 7. Adjusting for class imbalance

# wrap up

The dawn of the smart classroom era

All about Google Colab file management

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news