Handling Imbalanced Data and Metrics
Handling Imbalanced Data and Metrics
Imbalanced Classification:
Methodologies for Data Handling and
Model Evaluation
Formally, a dataset is considered to have a class imbalance when the distribution of its target
classes is highly skewed.1 In the context of binary classification, this means one class, termed
the majority class, contains a significantly larger number of instances than the other, termed
the minority class.4 While a minor disparity, such as a 45-55 split, is relatively balanced, a
dataset with a 90-10 or a more extreme 99-1 split is considered imbalanced.3 The issue can
manifest in multiclass problems as well, though it is most frequently discussed in the binary
context.3
This imbalance is not an anomaly but a natural consequence of the phenomena being
modeled in numerous practical domains.5 The minority class often represents the event of
interest—a rare but critical occurrence. Prominent examples include:
● Fraud Detection: In financial datasets, fraudulent transactions constitute a minuscule
fraction of the total volume, often less than 0.1%.3
● Medical Diagnosis: The prevalence of certain diseases, particularly rare conditions or
specific types of cancer, is inherently low within the general population.3
● Anomaly Detection: In fields like cybersecurity (intrusion detection) or industrial
manufacturing (fault detection), anomalous events are, by definition, infrequent
deviations from normal operation.2
● Text Classification: Common tasks such as spam detection or targeted sentiment
analysis often encounter datasets where one category (e.g., legitimate emails, neutral
sentiment) vastly outnumbers others.2
● Customer Churn Prediction: The number of customers who churn is typically much
smaller than the number who remain loyal.10
The challenge posed by class imbalance is not solely a function of the skewed ratio. It is often
compounded by other intrinsic data characteristics that can further complicate the learning
process, such as the degree of separability between classes, the presence of noisy or
mislabeled data, and the absolute size of the dataset, particularly the number of available
minority class examples.11
The primary reason imbalanced datasets pose a significant challenge is that most
conventional machine learning algorithms are designed with an implicit assumption of a
balanced class distribution.3 The optimization objective for these algorithms is typically to
maximize overall accuracy or, equivalently, to minimize the total classification error across all
samples.1 This design choice has profound and detrimental consequences when applied to
imbalanced data.
When faced with a dataset where one class overwhelmingly dominates, a classifier can
achieve a very high accuracy score through a trivial and uninformative strategy: always
predicting the majority class.2 Because the majority class constitutes the bulk of the data, this
"lazy" approach minimizes the overall error rate, thus satisfying the algorithm's optimization
objective.7 The model learns that ignoring the minority class is the path of least resistance to
minimizing its loss function. The contribution of errors from the small number of minority
instances becomes statistically insignificant in the calculation of the total loss, leading the
model to develop a strong predictive bias towards the majority class.1
This algorithmic bias manifests in several critical failures:
● Poor Minority Class Performance: The model fails to learn the distinguishing
characteristics of the minority class, resulting in very low sensitivity (also known as recall)
for that class.3 It essentially becomes incapable of identifying the very instances it was
designed to detect.
● Poor Generalization: A model trained on an imbalanced dataset does not generalize well
to new, unseen data, especially for minority class predictions.3 It has not been exposed to
a sufficient number of minority examples to build a robust and accurate representation of
that class.
● Loss of Important Insights: By neglecting the minority class, the model fails to uncover
the patterns and relationships that are often of the greatest practical and economic
importance.1
Visually, this bias can be seen in the model's decision boundary. An algorithm trained on
imbalanced data will often produce a boundary that is shifted heavily towards the minority
class, or in extreme cases, it may classify the entire feature space as the majority class,
rendering the model completely useless for its intended purpose.9 This reveals a fundamental
conflict: the algorithm is succeeding in its own terms (minimizing error) while completely
failing in the terms of the application's objective (detecting the rare event). The problem is not
that the algorithm is broken; it is that its objective function is profoundly misaligned with the
practical goal in an imbalanced context. This misalignment dictates that any viable solution
must either transform the data to fit the algorithm's objective (data-level methods) or
transform the algorithm's objective to fit the problem's goal (algorithm-level methods).
The failure of standard algorithms on imbalanced data is directly linked to the failure of
standard evaluation metrics. The most common metric, classification accuracy, becomes
dangerously misleading in this context. This phenomenon is known as the accuracy paradox:
a model can achieve an exceptionally high accuracy score while being completely
non-informative and practically useless.18
Consider a credit card fraud detection dataset with a 99:1 class imbalance, where 99% of
transactions are legitimate (majority class) and 1% are fraudulent (minority class). A naive
classifier that simply predicts "legitimate" for every single transaction will achieve a
classification accuracy of 99%.20 A practitioner might see this score and conclude the model
is performing exceptionally well. However, this model has zero predictive power for the actual
target of interest; it has a 0% success rate in identifying fraudulent transactions.2
This paradox underscores a critical principle: accuracy is a measure of overall correctness,
and in an imbalanced dataset, the overall correctness is dominated by the model's
performance on the majority class. The metric is blind to the distribution of errors between
classes. Consequently, relying on accuracy for model evaluation in imbalanced scenarios is a
common but severe mistake. It fundamentally invalidates accuracy as a primary performance
indicator and mandates the adoption of a more sophisticated suite of evaluation metrics that
can provide a class-aware and context-sensitive assessment of a model's true predictive
capabilities.10
Data-level strategies are among the most common and intuitive approaches to tackling class
imbalance. The core philosophy of these methods is to modify the training dataset itself to
create a more balanced class distribution before feeding it to a machine learning algorithm.
By rebalancing the data, these techniques aim to mitigate the inherent bias of standard
classifiers, allowing them to learn the characteristics of the minority class more effectively.
This section provides a detailed examination of the three primary categories of data-level
methods: oversampling, undersampling, and hybrid approaches. The progression of these
techniques reveals a significant evolution in thought, moving from simple, quantity-based
rebalancing to more sophisticated, quality-based methods that intelligently sculpt the feature
space to improve class separability.
Oversampling methods address class imbalance by increasing the number of instances in the
minority class. The goal is to provide the learning algorithm with a stronger signal from the
underrepresented class, thereby reducing its bias towards the majority class.23
Advantages: The primary advantages of ROS are its simplicity and ease of implementation.
Furthermore, because it does not discard any data, it ensures that no information from the
original dataset is lost during the resampling process.23
The Synthetic Minority Over-sampling Technique (SMOTE) and its variants were developed to
directly address the overfitting problem associated with ROS. Instead of duplicating existing
instances, SMOTE generates new, synthetic instances of the minority class, creating a more
diverse and robust representation.23
Core SMOTE Algorithm: The SMOTE algorithm operates in the feature space and can be
broken down into the following steps 24:
1. For each instance in the minority class, identify its k nearest neighbors that also belong
to the minority class. A typical value for k is 5.
2. Randomly select one of these k neighbors.
3. Calculate the difference vector between the original instance and the selected neighbor.
4. Multiply this difference by a random number between 0 and 1.
5. Add this new vector to the feature vector of the original instance. This creates a new,
synthetic data point that lies on the line segment connecting the original instance and its
chosen neighbor.
6. This process is repeated until the desired number of synthetic minority instances have
been generated.
Advantages: By generating novel, yet plausible, minority instances, SMOTE provides a richer
and more diverse training set for the classifier. This significantly reduces the risk of overfitting
compared to ROS and has been shown to be a highly effective baseline oversampling strategy
in numerous applications.23
Disadvantages and Critiques: Despite its advantages, the original SMOTE algorithm has
several known limitations:
● Generation of Noise: SMOTE does not consider the proximity of majority class instances
when generating synthetic samples. This can lead to the creation of new minority
instances in regions that are heavily populated by the majority class, effectively creating
noise and increasing class overlap, which can make the decision boundary even more
difficult to learn.18
● Insensitivity to Data Distribution: The algorithm generates the same number of
synthetic samples for each original minority instance, regardless of its local
neighborhood. It does not differentiate between instances that are "safe" (deep within
the minority class region) and those that are on the noisy border with the majority
class.26
● Limitations with Data Types and Dimensionality: Standard SMOTE relies on Euclidean
distance to find nearest neighbors, making it unsuitable for datasets containing
categorical or discrete variables without modification. Variants like SMOTE-NC (for mixed
data) and SMOTE-N (for categorical data) were developed to address this.17 The
algorithm's effectiveness can also degrade in high-dimensional spaces due to the "curse
of dimensionality".18
● Potential for Bias: Some researchers argue that SMOTE can be misapplied or that its
core assumption is flawed. By interpolating between existing points, it may generate
"falsified instances" that do not accurately represent the true, unknown distribution of
the minority class, potentially introducing a new form of bias.11
Advanced SMOTE Variants: To overcome the limitations of the original algorithm, a family of
more sophisticated SMOTE variants has been developed. These methods move beyond simple
interpolation and incorporate more information about the local data distribution to generate
higher-quality synthetic samples.
● Borderline-SMOTE: This variant focuses its synthetic sample generation efforts on the
most critical region: the decision boundary. It first identifies minority instances that are
"on the border" (i.e., where the majority of their nearest neighbors belong to the majority
class). It then applies the SMOTE algorithm only to these borderline instances, reinforcing
the areas where the model is most likely to be confused.17
● ADASYN (Adaptive Synthetic Sampling): ADASYN takes a similar but distinct approach
by adaptively generating more synthetic data for minority instances that are "harder to
learn." The "hardness" of an instance is determined by the proportion of majority class
instances in its local neighborhood. By generating more samples in these difficult
regions, ADASYN forces the classifier to pay greater attention to the most challenging
parts of the feature space.17
● SVM-SMOTE: This method leverages a Support Vector Machine (SVM) algorithm to
approximate the decision boundary. It then generates synthetic samples along the
direction of the support vectors, which are the instances that define the boundary. This
approach aims to create a cleaner and more robust separation between the classes by
strengthening the margin.34
● KMeans-SMOTE: This technique first applies the K-Means clustering algorithm to the
entire dataset. It then applies SMOTE within each cluster, generating synthetic samples
based on the local cluster density. This can be particularly effective if the minority class is
composed of several distinct sub-concepts or small, dense clusters, as it helps to
generate more diverse and representative samples for each sub-group.34
Mechanism: As the name suggests, Random Undersampling involves randomly selecting and
discarding instances from the majority class until a desired class balance is achieved.9
Advantages: The primary benefit of RUS is its ability to significantly reduce the size of the
training dataset. This leads to faster model training and can alleviate storage and memory
constraints.18
Disadvantages: The main and most severe risk of RUS is information loss.3 By randomly
removing majority class instances, there is a high probability of discarding data points that are
crucial for defining the decision boundary or representing important variations within the
majority class. This can lead to a model that is biased in a new way and generalizes poorly to
unseen data.
To mitigate the risk of information loss associated with RUS, more intelligent undersampling
methods have been developed. These techniques do not aim for a specific balance ratio but
instead focus on removing specific types of majority instances that are considered noisy or
unhelpful, thereby "cleaning" the feature space.
Tomek Links:
● Mechanism: A Tomek Link is defined as a pair of instances, one from the minority class
and one from the majority class, that are each other's nearest neighbors in the feature
space.40 The presence of such a pair often indicates either noise or an ambiguous region
along the class boundary. The Tomek Links undersampling algorithm identifies these
pairs and removes the majority class instance from each link.40
● Goal: The objective is not to balance the dataset but to "clean" the space between the
classes, creating a clearer and more well-defined decision boundary for the classifier to
learn.40
● Limitations: While effective for data cleaning, Tomek Links can be computationally
intensive on large datasets due to the pairwise distance calculations. Furthermore, it
often removes a relatively small number of instances, making it insufficient on its own to
address severe class imbalance.42
Key Examples:
● SMOTE-Tomek: This popular hybrid method first uses SMOTE to increase the
representation of the minority class and then uses Tomek Links to remove the resulting
borderline pairs. The outcome is a dataset that is not only more balanced but also has a
cleaner separation between the classes.9
● SMOTE-ENN: This method follows the same principle but uses the more aggressive
Edited Nearest Neighbors algorithm for the cleaning phase. This can result in the removal
of more noisy samples compared to Tomek Links.48
Advantages: Hybrid methods often yield superior performance compared to using either
oversampling or undersampling in isolation. They simultaneously tackle the problem of data
scarcity for the minority class and the problem of class overlap and noise at the decision
boundary.48 Empirical studies have shown that hybrid approaches can be particularly effective
for datasets with extreme levels of imbalance.50
The evolution of these data-level techniques illustrates a significant shift in the field's
understanding of the imbalance problem. The initial, naive solutions like ROS and RUS focused
purely on adjusting class counts—a quantity-based approach. The limitations of this
approach, namely overfitting and information loss, led to the development of SMOTE, which
improved the quality of the minority class representation by generating new data. However,
the recognition that SMOTE could introduce its own problems, such as noise and class
overlap, spurred the creation of more advanced SMOTE variants and hybrid methods. This
progression demonstrates a maturing perspective: the ultimate goal is not merely to achieve a
50:50 class ratio, but to intelligently sculpt the geometry of the feature space to create a
clean, well-defined, and easily learnable decision boundary. The choice of a data-level
technique, therefore, becomes a diagnostic decision. If the primary issue is a simple lack of
minority signal, SMOTE may be sufficient. If the data exhibits significant class overlap and a
noisy boundary, a more sophisticated approach like Borderline-SMOTE or a hybrid method
like SMOTE-Tomek is likely to be more effective.
In contrast to data-level methods that alter the dataset, algorithm-level strategies modify the
learning algorithms themselves to make them more robust to class imbalance. These
techniques accept the original, skewed data distribution and adapt the model's training
process to pay more attention to the minority class. This approach represents a different
philosophical solution to the imbalance problem: instead of changing the data to suit the
model, it changes the model to suit the data. This section explores the three main categories
of algorithm-level strategies: cost-sensitive learning, specialized ensemble methods, and
one-class classification.
Cost Matrices: The asymmetric penalties are formally defined in a cost matrix. For a binary
classification problem, this matrix specifies the cost for each of the four outcomes in a
confusion matrix. Crucially, the cost associated with a False Negative, $C(FN)$, is set to be
significantly higher than the cost of a False Positive, $C(FP)$.13 The total cost to be minimized
is then a weighted sum of errors:
$$Total Cost = C(FN) \times \text{Number of FNs} + C(FP) \times \text{Number of FPs}$$
Practical Implementation via Class Weights: In practice, most modern machine learning
libraries do not require the user to define an explicit cost matrix. Instead, they implement
cost-sensitive learning through a class_weight parameter. By assigning a higher weight to the
minority class, the user instructs the algorithm to amplify the loss contribution of any errors
made on instances of that class during training.1
A common and effective heuristic for setting these weights is to make them inversely
proportional to the class frequencies in the training data.54 For a dataset with a 99:1 ratio of
majority to minority instances, the weight for the minority class could be set to 99 and the
weight for the majority class to 1. Many libraries, such as scikit-learn, provide a 'balanced'
option for the class_weight parameter, which automatically calculates and applies these
inverse-frequency weights.56
Ensemble methods, which build a strong predictive model by combining the outputs of
multiple weaker models, are naturally well-suited for imbalanced classification due to their
inherent robustness and variance reduction properties.1 Their performance can be further
enhanced through specific adaptations for imbalanced data.
In cases of extreme class imbalance, or when the minority class instances are highly
heterogeneous and do not form a coherent group, it can be effective to reframe the problem
entirely. Instead of trying to distinguish between two classes, the task becomes one of
anomaly detection.1 This approach involves training a model exclusively on the majority class
data to learn a representation of "normal." Any new instance that deviates significantly from
this learned norm is then classified as an "anomaly" or "outlier," which corresponds to the
minority class.68
Key Algorithms:
● One-Class SVM: This is a variant of the Support Vector Machine algorithm that is trained
in an unsupervised manner on data from a single class (the majority or "normal" class). It
learns a decision boundary, typically a hypersphere or hyperplane, that encloses the
majority of the training data in the feature space. Any new data point that falls outside
this learned boundary is flagged as an outlier.1 A crucial hyperparameter is nu, which
serves as an upper bound on the fraction of training examples that can be considered
errors (i.e., fall outside the boundary) and a lower bound on the fraction of instances that
will be used as support vectors.68
● Isolation Forest: This is an ensemble-based algorithm built on the principle that
anomalies are "few and different" and are therefore easier to isolate than normal data
points. The algorithm constructs a forest of "isolation trees," where each tree recursively
and randomly partitions the data until individual instances are isolated. Anomalous
points, being different, are likely to be isolated in fewer partitions, resulting in a shorter
average path length from the root of the trees. This path length is used to calculate an
anomaly score for each instance.1
The successful handling of an imbalanced dataset is only half the battle; the other half is
accurately and honestly measuring the performance of the resulting classifier. As established
previously, traditional accuracy is a deeply flawed metric in this context. A robust evaluation
framework requires moving beyond accuracy and adopting a suite of metrics that provide a
nuanced, multi-faceted, and class-aware view of a model's performance. This section
deconstructs the essential evaluation tools, starting with the foundational confusion matrix
and extending to a range of threshold-based and rank-based metrics specifically suited for
imbalanced classification.
The confusion matrix is the cornerstone of classification model evaluation. It is a simple table
that provides a comprehensive summary of a model's predictive performance by
cross-tabulating the actual class labels against the predicted class labels.20 For a binary
classification problem, it is a 2x2 matrix that contains four essential counts 73:
● True Positives (TP): The number of instances from the positive (minority) class that
were correctly predicted as positive.
● True Negatives (TN): The number of instances from the negative (majority) class that
were correctly predicted as negative.
● False Positives (FP) (Type I Error): The number of instances from the negative
(majority) class that were incorrectly predicted as positive. These are often referred to as
"false alarms."
● False Negatives (FN) (Type II Error): The number of instances from the positive
(minority) class that were incorrectly predicted as negative. These are the critical
"missed detections."
The confusion matrix is indispensable because it moves beyond a single, aggregated score
and reveals the specific types of errors a model is making. It is the fundamental source from
which all other meaningful threshold-based evaluation metrics are derived.11
Threshold-based metrics are calculated from the counts in the confusion matrix and provide
single-value scores that summarize different aspects of a model's performance at a given
classification threshold (typically 0.5 for probabilistic models).
These metrics focus on the performance of the model with respect to a single class, which is
crucial for understanding how well the model is handling the minority class.
● Precision (Positive Predictive Value - PPV): Precision measures the reliability of the
positive predictions. It answers the question: "Of all the instances that the model
predicted as positive, what proportion were actually positive?".16 High precision is critical
in scenarios where the cost of a False Positive is high. For example, in a targeted
marketing campaign, high precision ensures that resources are not wasted on customers
who are not actually interested.
○ Formula: $Precision = \frac{TP}{TP + FP}$.1
● Recall (Sensitivity or True Positive Rate - TPR): Recall measures the model's ability to
find all the positive instances in the dataset. It answers the question: "Of all the actual
positive instances, what proportion did the model successfully identify?".16 High recall is
paramount in domains where the cost of a False Negative is severe. For instance, in
medical diagnosis or fraud detection, failing to identify a disease or a fraudulent
transaction can have catastrophic consequences.
○ Formula: $Recall = \frac{TP}{TP + FN}$.73
● Specificity (True Negative Rate - TNR): Specificity is the equivalent of recall for the
negative class. It measures the model's ability to correctly identify all the negative
instances. It answers the question: "Of all the actual negative instances, what proportion
did the model successfully identify?".78
○ Formula: $Specificity = \frac{TN}{TN + FP}$.75
In practice, there is often a trade-off between precision and recall. A model can achieve
perfect recall by classifying every instance as positive, but this would result in very low
precision. Conversely, a model can achieve high precision by being very conservative with its
positive predictions, but this would likely lower its recall. Balanced metrics aim to combine
these individual measures into a single score that provides a more holistic assessment of
performance.
● F1-Score: The F1-score is the harmonic mean of Precision and Recall. It provides a
balanced measure that is high only when both precision and recall are high. Because it is
a harmonic mean, it penalizes extreme values more than a simple arithmetic mean would.
It is one of the most widely used metrics for imbalanced classification because it is
sensitive to the model's performance on the minority class and is not inflated by a large
number of true negatives.17
○ Formula: $F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$.78
○ The more general F-beta Score allows for explicit weighting of recall over precision.
A $\beta$ value greater than 1 gives more weight to recall, while a $\beta$ value less
than 1 gives more weight to precision.82
● Geometric Mean (G-Mean): The G-Mean is the geometric mean of Sensitivity (Recall)
and Specificity. It measures the balance between the classification performance on both
the minority and majority classes. A low G-Mean score indicates poor performance in
identifying the minority class, even if the model achieves high specificity by correctly
classifying the majority class.78 It is a good indicator of a model's ability to perform well
on both classes simultaneously.
○ Formula: $G-Mean = \sqrt{Sensitivity \times Specificity}$.80
● Matthews Correlation Coefficient (MCC): The MCC is considered by many researchers
to be one of the most informative and robust single-score metrics for binary
classification, especially in imbalanced scenarios.86 It is a correlation coefficient between
the observed and predicted classifications and takes into account all four values in the
confusion matrix (TP, TN, FP, and FN). A key advantage of MCC is that it produces a high
score only if the classifier obtains good results in all four categories. Its value ranges from
-1 (perfect misclassification) to +1 (perfect classification), with 0 indicating a
performance no better than random guessing. Unlike the F1-score, it is inherently
symmetric, meaning its value does not change if the positive and negative classes are
swapped.17
○ Formula: $MCC = \frac{TP \times TN - FP \times
FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$
● Construction: The ROC curve is a graphical plot that illustrates the diagnostic ability of a
binary classifier as its discrimination threshold is varied. It plots the True Positive Rate
(TPR), or Recall, on the y-axis against the False Positive Rate (FPR) on the x-axis,
where $FPR = \frac{FP}{FP + TN}$.16
● Interpretation: The area under this curve, known as AUC-ROC, provides a
single-number summary of the model's performance. An AUC of 1.0 represents a perfect
classifier that can perfectly distinguish between the classes, while an AUC of 0.5
represents a model with no discriminative ability, equivalent to random guessing.77
● Limitation in Imbalanced Contexts: Despite its popularity, the AUC-ROC can be overly
optimistic and misleading on datasets with a severe class imbalance.17 The reason is
that the FPR calculation in the denominator includes the number of True Negatives
($TN$). In a highly imbalanced dataset, $TN$ is a very large number. This means the
model can generate a substantial number of false positives without making a significant
impact on the overall FPR, leading to a deceptively high AUC score that masks poor
performance on the minority class.
Precision-Recall (PR) Curve and AUC-PR
● Construction: The Precision-Recall (PR) curve plots Precision on the y-axis against
Recall on the x-axis at various threshold settings.16
● Interpretation: A skillful model is one that can maintain a high precision even as recall
increases. The baseline for a PR curve is not a diagonal line but a horizontal line
corresponding to the fraction of positive examples in the dataset (i.e., the prevalence of
the minority class).77 The area under this curve, AUC-PR (also known as Average
Precision), summarizes the plot.
● Superiority in Imbalanced Contexts: The PR curve is widely regarded as a more
informative and appropriate evaluation tool than the ROC curve for imbalanced
classification tasks.17 The key reason is that the calculation of Precision ($TP / (TP + FP)$)
does not involve the number of True Negatives. Therefore, the PR curve is not influenced
by the large number of correctly classified majority class instances. It focuses directly on
the performance of the model on the positive (minority) class, evaluating the trade-off
between correctly identifying positive instances and the rate of false alarms, which is
often the central business problem.
The selection of an evaluation metric is not a mere technicality performed after a model is
built; it is a strategic decision that must be made at the outset of a project, as it defines the
very criteria for success. This choice should be driven by the specific business context and
the relative costs of different types of prediction errors. For example, in a medical screening
application, the consequence of a False Negative (missing a sick patient) is far more severe
than that of a False Positive (subjecting a healthy patient to further tests). This context
dictates that Recall is the most critical metric, and the model should be optimized to
maximize it, even if it comes at the expense of lower Precision.60 Conversely, in a system that
automatically flags emails for deletion as spam, a False Positive (deleting an important,
non-spam email) is highly undesirable, making Precision the more important metric.76
This leads to a distinction between the tactical and strategic value of different metric types.
Threshold-based metrics like the F1-score or MCC are excellent for tactical evaluation,
assessing a model's performance at a single, fixed operational decision point. However,
rank-based metrics, particularly AUC-PR, offer a more strategic assessment. A model with a
high AUC-PR is inherently more valuable and flexible because it demonstrates a strong ability
to separate classes across a wide range of trade-offs. This provides the business with a
spectrum of good operating points, allowing them to adjust the decision threshold in the
future to meet changing needs (e.g., becoming more aggressive in fraud detection during a
high-risk period) without needing to retrain the entire model. Therefore, AUC-PR is invaluable
for strategic model comparison and selection, while threshold-based metrics are essential for
evaluating performance at the point of deployment.
The choice of a technique to handle class imbalance is a critical decision with significant
trade-offs. The three major paradigms—Data-Level, Algorithm-Level, and Hybrid—offer
different philosophies for solving the problem.
● Data-Level Methods (Resampling): These techniques, such as SMOTE and RUS, focus
on transforming the training data to create a more balanced distribution. Their primary
advantage is that they are model-agnostic; a balanced dataset can be used to train any
standard classifier. However, they carry inherent risks. Oversampling methods like SMOTE
can introduce noise and potentially create artificial instances that do not reflect the true
data distribution, leading to overfitting. Undersampling methods like RUS risk significant
information loss by discarding potentially valuable majority class examples. The
computational cost of these methods, especially sophisticated oversampling techniques,
can also be considerable.
● Algorithm-Level Methods (Cost-Sensitive Learning & Specialized Algorithms):
These techniques modify the learning algorithm's objective function or internal
mechanics to make it inherently sensitive to the minority class. Cost-sensitive learning,
implemented via class weights, is a powerful and often computationally efficient
approach that forces the model to penalize errors on the minority class more heavily.
Specialized algorithms like One-Class SVMs reframe the problem entirely. The main
advantage of this approach is that it works with the original, true data distribution,
avoiding the potential artifacts of resampling. However, these methods require the
chosen algorithm to support such modifications (e.g., have a class_weight parameter)
and may be less interpretable than training a standard model on balanced data.
● Hybrid and Ensemble Methods: These approaches represent the most advanced and
often most effective strategies. Hybrid sampling methods (e.g., SMOTE-Tomek) combine
the benefits of oversampling and undersampling to both increase the minority signal and
clean the class boundary. Specialized ensemble methods (e.g., EasyEnsemble) overcome
the limitations of simple undersampling by training multiple classifiers on different
subsets of the majority class, thereby reducing information loss and improving
robustness. These methods are often more computationally expensive but tend to yield
superior performance by addressing multiple facets of the imbalance problem
simultaneously.
There is no single "best" technique for all imbalanced classification problems. The optimal
choice depends on the specific characteristics of the dataset, the computational constraints,
and the project's goals. The following workflow provides a structured approach to selecting
and applying these techniques effectively.
Effective evaluation is paramount for building trust in a model and making informed decisions.
The following best practices should be standard procedure for any imbalanced classification
project.
Works cited
1. Computational Strategies for Handling Imbalanced Data in Machine Learning - ISI,
accessed October 21, 2025,
[Link]
[Link]
2. Everything You Need to Know When Assessing Imbalance Class Problem Skills -
Alooba, accessed October 21, 2025,
[Link]
em/
3. What is Imbalanced Dataset - GeeksforGeeks, accessed October 21, 2025,
[Link]
4. Class Imbalance Definition - Encord, accessed October 21, 2025,
[Link]
5. Class-imbalanced datasets | Machine Learning - Google for Developers,
accessed October 21, 2025,
[Link]
nced-datasets
6. Handling Imbalanced Data in Classification | Keylabs, accessed October 21, 2025,
[Link]
7. Tackling the Challenge of Imbalanced Datasets: A Comprehensive Guide -
Medium, accessed October 21, 2025,
[Link]
tasets-a-comprehensive-guide-2feb11ca2fa0
8. [Link], accessed October 21, 2025,
[Link]
h-code-8bc8fae71e1a#:~:text=Class%20imbalance%20occurs%20when%20one,
anomaly%20detection%2C%20and%20medical%20diagnosis.
9. Class Imbalance Strategies — A Visual Guide with Code | by Travis ..., accessed
October 21, 2025,
[Link]
h-code-8bc8fae71e1a
10.How to Handle Imbalanced Data? - Analytics Vidhya, accessed October 21, 2025,
[Link]
ed-data-for-a-classification-problem/
11. Class Imbalance in Machine Learning - Train in Data's Blog, accessed October 21,
2025, [Link]
12.Classification of Imbalanced Datasets using One-Class SVM, k-Nearest
Neighbors and CART Algorithm - The Science and Information (SAI) Organization,
accessed October 21, 2025,
[Link]
d_Dataset.pdf
13.Cost-Sensitive Learning Methods for Imbalanced Data, accessed October 21,
2025,
[Link]
14.A review of machine learning methods for imbalanced data challenges in
chemistry - PMC, accessed October 21, 2025,
[Link]
15.How to Handle Imbalanced Data for Machine Learning in Python - Semaphore CI,
accessed October 21, 2025,
[Link]
16.Practical ML: Addressing Class Imbalance | by Juan C Olamendy - Medium,
accessed October 21, 2025,
[Link]
25c4f1b97ee3
17.Imbalanced Dataset: Strategies to Fix Skewed Class Distributions - Label Your
Data, accessed October 21, 2025,
[Link]
18.How to Deal With Imbalanced Classification and Regression Data - [Link],
accessed October 21, 2025,
[Link]
on-data
19.8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset -
[Link], accessed October 21, 2025,
[Link]
our-machine-learning-dataset/
20.Failure of Classification Accuracy for Imbalanced Class Distributions -
[Link], accessed October 21, 2025,
[Link]
istributions/
21.The harm of class imbalance corrections for risk prediction models: illustration
and simulation using logistic regression - PubMed Central, accessed October 21,
2025, [Link]
22.How To Handle Imbalanced Data in Classification - phData, accessed October 21,
2025,
[Link]
23.Exploring Oversampling Techniques for Imbalanced Datasets - Train in Data's
Blog, accessed October 21, 2025,
[Link]
/
24.How to Handle Imbalanced Classes in Machine Learning - GeeksforGeeks,
accessed October 21, 2025,
[Link]
asses-in-machine-learning/
25.Sampling for imbalance data in Python | by Mabrouka Salmi - Medium, accessed
October 21, 2025,
[Link]
-20fc995361db
26.Challenges and limitations of synthetic minority oversampling techniques in
machine learning - PMC, accessed October 21, 2025,
[Link]
27.Overcoming Class Imbalance with SMOTE: How to Tackle Imbalanced Datasets in
Machine Learning - Train in Data's Blog, accessed October 21, 2025,
[Link]
28.ML | Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python,
accessed October 21, 2025,
[Link]
with-smote-and-near-miss-algorithm-in-python/
29.SMOTE for Imbalanced Classification with Python -
[Link], accessed October 21, 2025,
[Link]
ication/
30.SMOTE oversampling for better machine learning classification - Domino Data
Lab, accessed October 21, 2025,
[Link]
31.Handling Imbalanced Data: 7 Innovative ... - Data Science Dojo, accessed October
21, 2025,
[Link]
32.Selective oversampling approach for strongly imbalanced data - PMC, accessed
October 21, 2025, [Link]
33.How does SMOTE work for dataset with only categorical variables?, accessed
October 21, 2025,
[Link]
k-for-dataset-with-only-categorical-variables
34.Mastering Imbalanced Data: Comprehensive Techniques for ..., accessed October
21, 2025,
[Link]
-techniques-for-machine-learning-engineers-7b1641dd4395
35.Comparison of OverSampling Methods(ImbalancedData) - Kaggle, accessed
October 21, 2025,
[Link]
balanceddata
36.COMPARISON OF DATASET OVERSAMPLING ALGORITHMS AND THEIR
APPLICABILITY TO THE CATEGORIZATION PROBLEM, accessed October 21, 2025,
[Link]
37.A Comparative Study of SMOTE, Borderline-SMOTE, and ADASYN Oversampling
Techniques using Different Classifiers | Request PDF - ResearchGate, accessed
October 21, 2025,
[Link]
MOTE_Borderline-SMOTE_and_ADASYN_Oversampling_Techniques_using_Differ
ent_Classifiers
38.Compare over-sampling samplers — Version 0.14.0 - Imbalanced Learn,
accessed October 21, 2025,
[Link]
on_over_sampling.html
39.A Comparative Study of Sampling Methods with Cross-Validation in the FedHome
Framework - arXiv, accessed October 21, 2025, [Link]
40.The Role of Undersampling in Tackling Imbalanced Datasets in Machine Learning,
accessed October 21, 2025,
[Link]
a/
41.3. Under-sampling — Version 0.14.0 - Imbalanced-learn, accessed October 21,
2025, [Link]
42.Undersampling Techniques for Handling Unbalanced Datasets | CodeSignal Learn,
accessed October 21, 2025,
[Link]
ersampling-techniques-for-handling-unbalanced-datasets
43.Tomek Links: An Undersampling Approach | by Simardeep Kaur - Medium,
accessed October 21, 2025,
[Link]
28f8d703c6a0
44.3. Under-sampling — Version 0.15.dev0 - Imbalanced-learn, accessed October
21, 2025, [Link]
45.Tomek links Algorithm – Undersampling to handle Imbalanced data in machine
learning by Mahesh Huddar - YouTube, accessed October 21, 2025,
[Link]
46.Mitigating the Effects of Class Imbalance Using SMOTE and Tomek Link
Undersampling in SAS, accessed October 21, 2025,
[Link]
47.(PDF) A Hybrid Sampling SVM Approach to Imbalanced Data Classification -
ResearchGate, accessed October 21, 2025,
[Link]
Approach_to_Imbalanced_Data_Classification
48.Hybrid and Ensemble Methods: Advanced Approaches to Address ..., accessed
October 21, 2025,
[Link]
nced-approaches-to-address-imbalanced-data-in-machine-learning-b16548122
e5f
49.(PDF) Comparative Analysis of Data Balancing Techniques for Machine Learning
Classification on Imbalanced Student Perception Datasets - ResearchGate,
accessed October 21, 2025,
[Link]
ata_Balancing_Techniques_for_Machine_Learning_Classification_on_Imbalanced_
Student_Perception_Datasets
50.A Comparison of Undersampling, Oversampling, and SMOTE ..., accessed
October 21, 2025, [Link]
51.Cost-Sensitive Learning for Imbalanced Classification ..., accessed October 21,
2025,
[Link]
sification/
52.Cost-Sensitive Learning (CSL) - Machine Learning with Imbalanced Data -
YouTube, accessed October 21, 2025,
[Link]
53.Analysis of preprocessing vs. cost-sensitive learning for imbalanced
classification. Open problems on intrinsic data characteris, accessed October 21,
2025, [Link]
54.How to implement cost-sensitive learning in decision trees ..., accessed October
21, 2025,
[Link]
ng-in-decision-trees/
55.Impact of imbalanced features on large datasets - PMC, accessed October 21,
2025, [Link]
56.Imbalanced Classification: Cost Sensitive Algrthms - Kaggle, accessed October
21, 2025,
[Link]
itive-algrthms
57.Cost-Sensitive Logistic Regression for Imbalanced Classification ..., accessed
October 21, 2025,
[Link]
58.An Optimized Cost-Sensitive SVM for Imbalanced Data Learning - Department of
Computing Science, accessed October 21, 2025,
[Link]
59.IMBENS: Ensemble Class-imbalanced Learning in Python - Zhining Liu, accessed
October 21, 2025, [Link]
60.Tips for Handling Imbalanced Data in Machine Learning ..., accessed October 21,
2025,
[Link]
arning/
61.Ensemble of Rotation Trees for Imbalanced Medical Datasets - PMC - PubMed
Central, accessed October 21, 2025,
[Link]
62.EasyEnsemble and Feature Selection for Imbalance Data Sets - ResearchGate,
accessed October 21, 2025,
[Link]
_Selection_for_Imbalance_Data_Sets
63.Exploratory Undersampling for Class-Imbalance Learning, accessed October 21,
2025, [Link]
64.(PDF) EASY ENSEMMBLE WITH RANDOM FOREST TO HANDLE IMBALANCED
DATA IN CLASSIFICATION - ResearchGate, accessed October 21, 2025,
[Link]
RANDOM_FOREST_TO_HANDLE_IMBALANCED_DATA_IN_CLASSIFICATION
65.Trainable Undersampling for Class-Imbalance Learning - AAAI Publications,
accessed October 21, 2025,
[Link]
66.Survey of Imbalanced Data Methodologies - arXiv, accessed October 21, 2025,
[Link]
67.Handling Imbalanced Data in Machine Learning: Data-level, Model-level
Strategies, and Evaluation Metrics | by Dgholamian | Medium, accessed October
21, 2025,
[Link]
ng-data-level-model-level-strategies-and-evaluation-6467115e5966
68.One-Class Classification Algorithms for Imbalanced Datasets ..., accessed
October 21, 2025,
[Link]
69.One Class SVM - Louise E. Sinks, accessed October 21, 2025,
[Link]
70.Understanding One-Class Support Vector Machines - GeeksforGeeks, accessed
October 21, 2025,
[Link]
ort-vector-machines/
71.What is a confusion matrix? - IBM, accessed October 21, 2025,
[Link]
72.Confusion matrix - Wikipedia, accessed October 21, 2025,
[Link]
73.Understanding the Confusion Matrix in Machine Learning - GeeksforGeeks,
accessed October 21, 2025,
[Link]
ning/
74.How to interpret a confusion matrix for a machine learning model, accessed
October 21, 2025,
[Link]
75.Precision and recall - Wikipedia, accessed October 21, 2025,
[Link]
76.Performance Metrics: Confusion matrix, Precision, Recall, and F1 Score, accessed
October 21, 2025,
[Link]
n-recall-and-f1-score-a8fe076a2262/
77.ROC Curves and Precision-Recall Curves for Imbalanced ..., accessed October 21,
2025,
[Link]
-imbalanced-classification/
78.Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced
Data - SAS Support, accessed October 21, 2025,
[Link]
79.7. Metrics — Version 0.14.0 - Imbalanced-learn, accessed October 21, 2025,
[Link]
80.Tour of Evaluation Metrics for Imbalanced Classification ..., accessed October 21,
2025,
[Link]
-classification/
81.F-score - Wikipedia, accessed October 21, 2025,
[Link]
82.what are the appropriate evaluation metrics used for handle imbalanced data? -
Kaggle, accessed October 21, 2025,
[Link]
83.Advanced Evaluation Metrics for Imbalanced Classification Models | by Rajneesh
Tiwari | CueNex | Medium, accessed October 21, 2025,
[Link]
fication-models-ee6f248c90ca
84.Best techniques and metrics for Imbalanced Dataset - Kaggle, accessed October
21, 2025,
[Link]
mbalanced-dataset
85.Which metric should we use to evaluate highly imbalanced classification model
performance? | ResearchGate, accessed October 21, 2025,
[Link]
ghly_imbalanced_classification_model_performance
86.The advantages of the Matthews correlation coefficient (MCC) over F1 score and
accuracy in binary classification evaluation, accessed October 21, 2025,
[Link]
87.Limitations in Evaluating Machine Learning Models for Imbalanced Binary
Outcome Classification in Spine Surgery: A Systematic Review - PMC - PubMed
Central, accessed October 21, 2025,
[Link]
88.Low G-mean and MCC for binary classification of imbalanced data - Stack
Overflow, accessed October 21, 2025,
[Link]
-classification-of-imbalanced-data
89.ROC and precision-recall with imbalanced datasets, accessed October 21, 2025,
[Link]
h-imbalanced-datasets/
90.[Discussion] Metric to evaluate imbalance data. : r/MachineLearning - Reddit,
accessed October 21, 2025,
[Link]
o_evaluate_imbalance_data/
91.[D]How to handle highly imbalanced dataset? : r/MachineLearning - Reddit,
accessed October 21, 2025,
[Link]
highly_imbalanced_dataset/
92.7 Techniques to Handle Imbalanced Data - KDnuggets, accessed October 21,
2025,
[Link]
93.How to Handle Unbalanced Classes: 5 Strategies - Roboflow Blog, accessed
October 21, 2025, [Link]