Open In App

LightGBM for Quantile Regression

Last Updated : 06 Jun, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

In most machine learning problems, we try to predict the average or expected value of a target variable. For example, if we want to predict the price of a house we usually train a model to output the average price based on features like size, location or number of rooms. But if we want to know more than just the average, Quantile Regression becomes useful.

Instead of predicting just one number like the average, quantile regression predicts a specific percentile. It is useful when we want to estimate conditional quantiles of a target variable. A quantile is a value below which a certain percentage of the data falls.

For Example:

  • The 50th percentile is the median: 50% of values are below it.
  • The 90th percentile is a high value: 90% of the data is below it.
  • The 10th percentile is a low value: 10% of the data is below it.

Why we use Quantile Regression?

When predicting delivery times an average might say 30 minutes but actual times can range from 15 to 45 minutes. Quantile regression helps by showing different outcomes: the 10th percentile could be 20 minutes (best case), the 50th percentile 30 minutes (typical) and the 90th percentile 45 minutes (worst case). This gives a better understanding of the possible range.

Quantile Regression Working in LightGBM

LightGBM (Light Gradient Boosting Machine) is a popular machine learning library developed by Microsoft that is known for being fast, efficient, accurate and easy to use. One of its key advantages is that it supports quantile regression allowing us to directly predict percentiles without needing complex adjustments or custom implementations.

It allows us to change the loss function during training. By default, it uses mean squared error (MSE) for regression. But for quantile regression, we use the quantile loss function.

To use quantile loss in LightGBM we need to set:

Python
objective='quantile'

We also need to set a quantile value (called alpha) between 0 and 1:

  • alpha = 0.1 → predicts the 10th percentile
  • alpha = 0.5 → predicts the 50th percentile (median)
  • alpha = 0.9 → predicts the 90th percentile

For example: Predicting House Prices with Quantile Regression

Step 1: Installing LightGBM

First, install LightGBM

Python
pip install lightgbm

Step 2: Loading Sample Data

  • pandas is used to handle data in a table-like format.
  • fetch_california_housing() loads a sample dataset with information about houses in California.
  • data.data contains the features (e.g., number of rooms, population, etc.).
  • data.target contains the house prices.
  • All this data is stored in a DataFrame called df.
Python
import pandas as pd
from sklearn.datasets import fetch_california_housing

# Load California housing dataset
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

Step 3: Splitting the Data

  • We separate the features (X) and the target variable (y).
  • train_test_split() splits the data into a training set (80%) and a test set (20%).
  • The training set is used to train the model.
  • The test set is used to check how well the model performs on unseen data.
  • random_state=42 ensures the split is the same every time you run it.
Python
from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Training a Quantile Model

  • We import the LightGBM module.
  • We set the model parameters in the params dictionary: 'objective': 'quantile' tells the model to do quantile regression, 'learning_rate': 0.1 controls how fast the model learns, 'n_estimators': 100 means the model will build 100 trees.
  • We create and train three models, one for each quantile: alpha=0.1: predicts the 10th percentile (lower bound), alpha=0.5: predicts the 50th percentile (median), alpha=0.9: predicts the 90th percentile (upper bound).
  • .fit() trains each model using the training data.
Python
import lightgbm as lgb

# Define models
params = {
    'objective': 'quantile',
    'learning_rate': 0.1,
    'n_estimators': 100
}

# 10th percentile
model_10 = lgb.LGBMRegressor(**params, alpha=0.1)
model_10.fit(X_train, y_train)

# 50th percentile (median)
model_50 = lgb.LGBMRegressor(**params, alpha=0.5)
model_50.fit(X_train, y_train)

# 90th percentile
model_90 = lgb.LGBMRegressor(**params, alpha=0.9)
model_90.fit(X_train, y_train)

Step 5: Making Predictions

  • We use the .predict() method to make predictions on the test set.
  • pred_10 holds the 10th percentile predictions.
  • pred_50 holds the 50th percentile (median) predictions.
  • pred_90 holds the 90th percentile predictions.
  • These predictions give us a range of possible prices for each house.
Python
pred_10 = model_10.predict(X_test)
pred_50 = model_50.predict(X_test)
pred_90 = model_90.predict(X_test)

Step 6: Plotting Prediction Interval

  • We import matplotlib for plotting and numpy for sorting.
  • np.argsort(pred_50) gives the sorted indices of the median predictions, just to make the graph look clean and ordered.
  • plt.plot() draws a line showing the median predicted prices.
  • plt.fill_between() fills the area between the 10th and 90th percentile predictions. This shows the range of possible prices.
  • We add labels, a title, a legend and show the plot.
Python
import matplotlib.pyplot as plt
import numpy as np

# Sort for plotting
sorted_idx = np.argsort(pred_50)

plt.figure(figsize=(8, 4))
plt.plot(pred_50[sorted_idx], label='Median Prediction', color='blue')
plt.fill_between(range(len(pred_10)), pred_10[sorted_idx], pred_90[sorted_idx], color='skyblue', alpha=0.4, label='Prediction Interval (10th - 90th)')
plt.legend()
plt.title("Prediction Interval using LightGBM Quantile Regression")
plt.xlabel("Test Sample")
plt.ylabel("Target Value")
plt.grid(True)
plt.tight_layout()
plt.show()

Output:

gbmq
Prediction Interval using LightGBM Quantile Regression

This plot shows:

  • The blue line: median prediction
  • The shaded area: the range between the 10th and 90th percentiles

This gives us a range of possible values not just a single prediction.

Advantages of Using Quantile Regression

  1. Handles Uncertainty: Real-world predictions are uncertain. Quantile regression gives a range instead of one number, helping you understand best-case and worst-case scenarios.
  2. Useful for Risk Management: If you’re forecasting sales or demand, predicting high percentiles helps avoid stockouts. Predicting low percentiles helps avoid overstocking.
  3. More Robust to Outliers: Quantile regression is less sensitive to extreme values. It doesn’t get pulled toward outliers like mean regression does.
  4. Helps in Decision Making: In finance or healthcare having prediction intervals helps professionals make safer and better decisions.

Limitations of Quantile Regression

While quantile regression is useful it has some limitations:

  1. Slower Training: Training multiple models (one for each quantile) takes more time.
  2. No Guarantee of Order: The 10th percentile prediction may sometimes end up higher than the 50th percentile by mistake. You can fix this with post-processing.
  3. Harder to Interpret: Some people find prediction intervals harder to understand than a single number.

Next Article
Article Tags :
Practice Tags :

Similar Reads