0% found this document useful (0 votes)
4 views15 pages

An Instance Level Analysis of Classification Difficulty for Unlabeled Data

This paper explores methods for assessing classification difficulty in unlabeled data by adapting existing instance hardness measures and developing regression meta-models. The study demonstrates that both approaches effectively identify instances in borderline regions, which are typically harder to classify. The findings contribute to improving the reliability and trustworthiness of machine learning models in deployment scenarios where data lacks labels.

Uploaded by

FiveBase
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views15 pages

An Instance Level Analysis of Classification Difficulty for Unlabeled Data

This paper explores methods for assessing classification difficulty in unlabeled data by adapting existing instance hardness measures and developing regression meta-models. The study demonstrates that both approaches effectively identify instances in borderline regions, which are typically harder to classify. The findings contribute to improving the reliability and trustworthiness of machine learning models in deployment scenarios where data lacks labels.

Uploaded by

FiveBase
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

An Instance Level Analysis

of Classification Difficulty for Unlabeled


Data

Patricia S. M. Ueda1(B) , Adriano Rivolli2 , and Ana Carolina Lorena1


1
Universidade Tecnológica Federal do Paraná, Cornélio Procópio, Brazil
[email protected], [email protected]
2
Instituto Tecnológico de Aeronáutica, São José dos Campos, Brazil
[email protected]

Abstract. Instance hardness measures allow us to assess and under-


stand why some observations from a dataset are difficult to classify.
With this information, one may curate and cleanse the training dataset
for improved data quality. However, these measures require data to be
labeled. This limits their usage in the deployment stage when data
is unlabeled. This paper investigates whether it is possible to identify
observations that will be hard to classify despite their label. For such,
two approaches are tested. The first adapts known instance hardness
measures to the unlabeled scenario. The second learns regression meta-
models to estimate the instance hardness of new data observations. In
experiments, both approaches were better at identifying instances lying
in borderline regions of the dataset, which pose a greater difficulty when
the label is unknown.

Keywords: Machine Learning · Instance hardness measures ·


Unlabeled data · Deployment of models

1 Introduction

The Machine Learning (ML) literature extensively provides algorithmic devel-


opments focused on model hyperparameter tuning and related model-centric
tasks. More recently, the community of data-centric Artificial Intelligence (AI)
is lighting the focus on the effort to understand more the data and its quality
improvement than on developing more complex ML models [13].
Paving the way for such a data-centric approach is a more fine-grained anal-
ysis of the data and classification performance. Herewith, aggregated measures
applied for classification problems, such as accuracy, precision, or similar met-
rics, restrict the understanding of the particularities in the data the algorithms
are modeling. Those aggregated metrics do not provide information about mis-
classification at the level of an instance or why they are misclassified. However, a
more reliable usage of ML algorithms must reveal for which particular instances
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2025
A. Paes and F. A. N. Verri (Eds.): BRACIS 2024, LNAI 15412, pp. 141–155, 2025.
https://doi.org/10.1007/978-3-031-79029-4_10
142 P. S. M. Ueda et al.

a model struggles to classify correctly and why. One way to achieve such under-
standing is leveraging knowledge from correlating data characteristics extracted
by a set of meta-features [12] to the predictive performance of multiple algo-
rithms, in a meta-learning (MtL) approach [2].
One particular set of meta-features is the set of data complexity measures,
previously proposed by Ho and Basu [5] to explore the overall complexity of solv-
ing the classification problem given the dataset available for learning, providing
a global perspective of the difficulty of the problem [1]. Since these measures
can fail to provide information at the instance-level [8], the Instance Hardness
Measures (IHMs) were introduced by Smith et al. [14] to characterize the diffi-
culty level of each individual instance of a dataset, giving information on which
particular instances are misclassified and why. These developments attend to a
trending interest in responsible AI that has emerged in recent years, making
researchers focus on the reliability and trustfulness of the predictions obtained
by ML models.
Nonetheless, the current IHMs need the instance label to be computed, which
restricts their use for analyzing and curating ML training datasets. In the use
of ML in production, where the class of an instance is unknown, adaptations
are needed. This paper proposes alternative instance hardness measures when
the instances do not have a label. The idea is to leverage the knowledge of the
hardness of the training dataset, which is labeled, to assess the hardness level
of new unlabeled instances. This knowledge can support reject-option strategies
in the future so the ML model might opt for abstaining from some predictions
that will be uncertain [4].
Firstly, a set of IHMs is adapted to disregard the labels of the new instances
in their computation. Another strategy tested was generating regression meta-
models to estimate the IHMs for new unlabeled observations in a meta-learning
approach at the instance level. Both approaches are compared experimentally
using one synthetic dataset and four datasets of the health domain, known for
presenting hard instances. Instances with characteristics making them lie in over-
lapping or borderline regions of the classes are highlighted as hard to classify
by both approaches. The adapted measures show an increased correlation to the
original values of the instance hardness measures and prove to be an adequate
alternative to estimate instance hardness in the deployment stage, driving the
solutions to a more refined level and contributing toward a more trustful use of
ML models.
The paper is organized as follows: Sect. 2 details the hardness measures to
apply to unlabeled data and how they were modified from the original measures.
Section 3 presents the materials and methods used in experiments, whose results
are presented in Sect. 4. Finally, Sect. 5 presents the conclusions of this work.

2 Instance Hardness Measures


The concept of instance hardness was introduced in the seminal work of Smith
et al. [14] as an alternative for a fine-grained analysis of classification difficulty.
An Instance Level Analysis of Classification Difficulty for Unlabeled Data 143

They define an instance as hard to classify if it gets consistently misclassified by


a set of classifiers of different biases. They also define a set of measures to explain
possible reasons why an instance is difficult to classify, which are regarded as
instance-level meta-features in the literature [7].
The base IHMs adopted in this work are presented next, along with their
adaptations, which are indicated by the “adj” (adjusted) extension. In their
definition, let D be a training dataset with n pairs of labeled instances (xi , yi ),
where each xi ∈ X is described by m input features and yi ∈ Y is the class of
the instance in the dataset. The number of classes is denoted as C. And let x
be a new instance for which the label is unknown.
To illustrate the concepts, consider the dataset in Fig. 1 containing two
classes, red and blue. Two instances are highlighted: x1 and x2 . The instance x1
is in a borderline area of the classes and might be difficult to classify despite its
class. The instance x2 is more aligned to the blue class. If the label registered
for it in the dataset is blue, it will be easily classified. Otherwise, it will have a
hardness level higher than x1 . Standard IHMs need to know these labels, so that
both x1 and x2 are contained in the labeled dataset D. This work introduces
adaptations to estimate the hardness level of an instance in the absence of its
label, meaning x1 and x2 are not in the labeled dataset D used to estimate the
hardness levels. Please note there are differences between the two estimations.
Based on the characteristics of x2 , it will probably be easily classified as blue.
In contrast, x1 will probably be considered hard to classify in both scenarios.

Fig. 1. Example of dataset with highlighted instances: x1 is in a borderline region


and can be difficult to classify despite its class; x2 might be easy or hard to classify
depending on its registered label. (Color figure online)

2.1 Neighborhood-Based IHM


The hardness level of the instance can be obtained considering its neighbourhood
in the dataset. In the original IHMs, instances surrounded by elements sharing
the same label as themselves can be considered easier to classify. For new data
without labels, our approach seeks the neighbourhood of the instance in the
labeled dataset D and assigns a higher hardness level when there is a mix of
different classes in this region.
144 P. S. M. Ueda et al.

k-Disagreeing Neighbors kDN: the original kDN measure computes the per-
centage of the k nearest neighbors of xi in the dataset D that have a different
label than the refereed instance:
{xj |xj ∈ kNN(xi ) ∧ yj = yi }
kDN(xi , yi ) = , (1)
k
where kNN(xi ) represents the set of k-nearest neighbors of the instance xi
in the dataset D. An instance will be considered harder to classify when the
value of kDN(xi , yi ) is higher. Values close to 1 represent an instance sur-
rounded by examples from a different class of itself. This would be x2 ’s case
in Fig. 1 when labeled red in D. Intermediate values of kDN(xi , yi ) are found
for borderline instances. Easier instances are those surrounded by elements
sharing their class label, which would correspond to x2 when it has a blue
label.
In the absence of an instance’s label, an alternative way to measure the mix-
ture of classes in its neighbourhood is to compute an entropy measure. Specif-
ically, the entropy is computed based on the proportion of the classes found
in the instance’s neighbourhood. Higher entropy values represent the new
instance is in regions from D near elements from different classes. This cor-
responds to the x1 case in Fig. 1. In contrast, x2 will be regarded as easy to
predict, as it is surrounded by elements of the blue class.
C

kDNadj (x) = − p(yj = ci ) log p(yj = ci ), for xj ∈ kNN(x), (2)
i=1

where p(yj = ci ) are the proportions of the classes of the k-nearest neighbours
of x in the dataset D.
Ratio of the Intra-class and Extra-Class Distances. N2IHM : the original
measure takes the complement of the ratio of the distance of xi to the nearest
example from its class in D to the distance it has to the nearest instance from
a different class (nearest enemy) in D with a normalization as presented next:
1
N2IHM (xi , yi ) = 1 − , (3)
IntraInter(xi ) + 1
where:
d(xi , NN(xi ∈ yi ))
IntraInter(xi , yi ) = , (4)
d(xi , NE(xi ))
where d is a distance function, NN(xi ∈ yi ) represents the nearest neighbor of
xi from its class and NE(xi ) is the nearest enemy of xi (NE(xi ) = NN(xi ∈
yj = yi )). In this formulation, when an instance is closer to an example from
another class than another from its own class, the N2IHM values will be larger,
indicating that this instance is harder to classify. This would correspond to
the case where x2 in Fig. 1 has the red label.
The alternative measure for unlabeled instances can be obtained by taking the
ratio of the minimum distance from x and the closest element in D, denoted
An Instance Level Analysis of Classification Difficulty for Unlabeled Data 145

as xj in Eq. 5, to the distance from x and the closest element from another
class in D, that is, a class different from that of xj . This ratio will assume
value close to 1 when the instance is almost equally distant from different
classes. This will happen more probably for borderline instances, such as x1
in Fig. 1.
min(d(x, xj ))
N2adj (x) = (5)
min(d(x, xk )|yk = yj )

2.2 Class Likelihood IHM


This type of measure captures if the instance is well situated in its class, consid-
ering the general patterns of this class. The likelihood can be estimated for that,
considering the input features are independent for simplifying the computations.
Class Likelihood Difference] CLD: the original measure takes the comple-
ment of the difference between the likelihood that xi belongs to its class yi
and the maximum likelihood it has to any other class. This complement is
taken to standardize the interpretation of the direction of hardness since the
confidence of an instance belongs to its class is larger than that of any other
class [9]:
 
1 − p(xi |yi )p(yi ) − maxyj =yi [p(xi |yj )p(yj )]
CLD(xi , yi ) = , (6)
2
where p(yi ) is the prior of class yi , set as C1 for all data instances. p(xi |yi ) rep-
resents the likelihood xi belongs to class yi and it can be estimated considering
the input features independent of each other, as in Naïve Bayes classification.
For example, if x2 in Fig. 1 is labeled as blue in D, it will be easy according
to this measure, as its likelihood to the blue class will be higher than to the
red class.
When the class of an instance cannot be defined in advance, the hardness
measure can be estimated by the difference between the two higher likeli-
hoods of all possible classes in the dataset. Like in the original measure, the
complement of the difference is taken to keep the interpretation that higher
values are found for instances harder to classify. The values of this measure
will tend to be higher for borderline instances since their likelihood of being
in different classes will be similar.
 
1 − maxyi [p(x|yi )p(yi )] − maxyj =yi [p(x|yj )p(yj )]
CLDadj (x) = . (7)
2

2.3 Tree-Based IHM


Decision trees (DTs) can be used to estimate the hardness level of an instance
based on the number of splits necessary to classify it. If many splits are required,
the instance’s classification will be harder. The DT is built based on the labeled
dataset D. Unlabeled instances are input to the built DT, and the measure can
be computed based on where it is classified.
146 P. S. M. Ueda et al.

Disjunct Class Percentage DCP: from a pruned decision tree (DT) using D,
the leaf node where the instance is classified is considered the disjunct of xi .
The complement of the percentage of instances in this disjunct that shares
the same label as xi gives the original DCP measure:

{xj |xj ∈ Disjunct(xi ) ∧ yj = yi }


DCP(xi , yi ) = 1 − , (8)
{xj |xj ∈ Disjunct(xi )}
where Disjunct(xi ) represents the instances contained in the disjunct (leaf
node) where xi is placed. For easy instances, according to this measure, larger
percentages of examples sharing the same label as the instance will be found
in their disjunct. For example, if x2 in Fig. 1 has the red label in D, it will
probably be placed in a leaf node containing many elements of the blue class,
making it harder to classify according to the interpretation of this measure.
In scenarios where the instance’s class is unknown, we take the entropy of
the disjunct where the instance is placed as a hardness measure, similarly to
what has been done for kDN.
C

DCPadj (x) = − p(yj = ci ) log p(yj = ci ), for xj ∈ Disjunct(x), (9)
i=1

where the proportions of the classes are taken based on the disjunct where x
is placed in the DT built using the dataset D.
Tree Depth TD: the original measure gives the depth of the leaf node that
classifies xi in a DT built using all labeled dataset D, normalized by the
maximum depth of the tree:

depthDT (xi )
TD(xi , yi ) = , (10)
max(depthDT (xj ∈ D))

where depthDT (xi ) gives the depth where the instance xi is placed in the
DT. Instances harder to classify tend to be placed at deeper levels of the tree,
making TD higher. There are two versions of this measure. One derives from
a pruned tree (TDP ) and the other from an unpruned tree (TDU ).
For unlabeled instances, the procedure for hardness estimation is the same as
in DCP, where the DT is built from the labeled set D, and next, the unlabeled
instance is submitted to the built DT. The depth of the leaf node where this
instance is classified by the DT is taken and used in the equation:

depthDT (x)
TDadj (x) = , (11)
max(depthDT (xj ∈ D))

2.4 Using Meta-models to Estimate IHM

Meta-learning is a traditional ML task that uses data related to ML itself [2].


Here, MtL is designed to predict IHM values without considering their labels.
This is done using the original input features from the dataset D to learn the
An Instance Level Analysis of Classification Difficulty for Unlabeled Data 147

expected IHM values in a regression task. Therefore, in this approach regression


meta-models are induced to estimate the IHM values of new instances. Their
training datasets comprise the original input features of D and a label corre-
sponding to an IHM estimated from D in its original formulation. There is one
regression model per IHM.
The estimation of the IHM values for unlabeled data with this meta-learning
approach is compared to the usage of the adjusted IHM values.

3 Materials and Methods


In this section, we describe the materials and methods used in experiments
performed to analyze the behaviour of IHM for unlabeled data in classification
problems.

3.1 Datasets
Five datasets are employed in the experiments. The first dataset was created
synthetically, containing three classes with some overlap. The other four datasets
are from the health domain, for which some instances are hard to classify due
to the overlap of attribute values for different classes or inconsistencies. Two
of them are from the UCI public repository [6] and have been employed in
previous related work [8,11]. The last two are related to severe COVID-19 cases
in two large hospitals from the São Paulo metropolitan area [15]. The main
characteristics of the five datasets are presented in Table 1, including the number
of instances, classes and input features.

Table 1. Summary of the datasets used in the study.

Blobs Diabetes Heart Hospital1 Hospital2


Instances 300 768 270 526 134
Classes 3 2 2 2 2
Features 2 8 13 17 19

The dataset blobs was generated synthetically using the make_blobs package
from the scikit-learn library [10], which can generate isotropic Gaussian blobs in
space. The standard deviation between the centers of the classes was set as 2 to
create some overlap between the input features and regions where the difficulty
in classifying the instances is harder than others. Figure 4 presents this dataset,
where it is possible to notice some overlap in the borderline regions of the classes.
The diabetes dataset is related to the incidence of diabetes in female patients
of Pima Indian heritage who are at least 21 years old. The objective is to iden-
tify the presence of diabetes. The predictive variables record blood indices and
patient characteristics, such as number of pregnancies and age [6].
148 P. S. M. Ueda et al.

Fig. 2. Illustration of the blobs dataset.

The heart dataset registers heart disease in patients and has features collected
during the exercise test, others reflecting blood indices and personal character-
istics of the patients, such as age and gender [6].
The last two datasets, named hospital, were extracted from the raw public
database provided by FAPESP COVID data sharing initiative [3]. The binary
response categorized patients as severe when hospital stay was greater than or
equal to 14 days or patients who progressed to death. The features collected in
those datasets were related to blood indices, age and gender [15].

3.2 Methodology
The adjusted IHM measures proposed in this paper were applied to the datasets,
considering each instance unlabeled once at a time, and the remaining instances
labeled, resembling a leave-one-out (LOO) cross-validation scheme.
The same procedure is used to generate the meta-models to predict the IHM
values, where one instance is left out as unlabeled at a time. The IHM of the other
instances is calculated using their original formulations. Next, a meta-dataset is
built, mapping the original features of the instances to the computed IHM values.
Regression meta-models are induced to learn this relationship and predict the
expected IHM value of the left-out instance. One meta-model is induced per
IHM measure considered. We used the Random Forest Regressor (RF) available
in the Scikit-learn library [10] with default hyperparameters’ values to generate
these meta-models.
We also computed the original IHMs for the entire datasets, which regard
the labels of all instances. Next, we compare the association of the IHM values
of the original measures to those of the estimated measures, where the estima-
tion is taken by the adjusted measures or the induced meta-models. Spearman’s
correlation provides a non-parametric estimation of the association (monotonic
relationship) of the modified measures with the original measure. This correla-
tion captures if the direction of the adjusted/estimated IHM is the same as the
value obtained from the original IHM. Higher values of the Spearman’s correla-
tion indicate more association between the estimated and original IHM.
We expect medium to high correlations, although there can be deviations of
values, since they do not strive to deliver identical IHM values. Indeed, instances
with noisy labels in the training datasets have characteristics that make them
An Instance Level Analysis of Classification Difficulty for Unlabeled Data 149

aligned to another class and are expected to show a lower correlation to the
original IHM values. But for most cases, we expect the hardness directions to be
maintained.
All codes and analyses are implemented in Python. The original IHMs are
computed using the PyHard package [7,9]. Codes of the adjusted measures are
in a public repository https://anonymous.4open.science/r/Adj-IHM-BF75. The
k value in kDN was set as 10, default value in the PyHard package.

4 Results

The results of the experiments performed are presented and discussed next.

4.1 Meta-models
First, we present the performance of the meta-models in the regression task.
Table 2 presents the Mean Squared Error (MSE) obtained in predicting the
IHMs using the regression meta-models. Lower values are indicative of better
performances in predicting the original IHM values.

Table 2. MSE obtained for the RF algorithm concerning predicting the IHMs to
different datasets.

blobs diabetes heart hospital1 hospital2


kDN 0.219 0.214 0.232 0.190 0.209
N2 0.150 0.077 0.109 0.049 0.049
CLD 0.215 0.196 0.240 0.146 0.124
DCP 0.238 0.221 0.248 0.168 0.190
TDU 0.066 0.087 0.090 0.068 0.091
TDP 0.035 0.022 0.010 0.002 0.105

For some measures, the MSEs are lower, demonstrating a better approxima-
tion of the original IHM values. This happens mostly for tree-depth measures.
For others, the approximations are not as good (e.g. for kDN and DCP). One
possible explanation is that the tree depth measures do not depend as much on
the labels of the instances as the others. The only difference between the original
tree depth measures and their estimated counterparts is excluding one instance
from the decision tree induction, which affects less the results. For other mea-
sures, if an instance is incorrectly labeled, the original measures will point them
as very hard to classify. But this instance might be easily classified into another
class, making it easy without the label information.
150 P. S. M. Ueda et al.

4.2 Correlation Analysis


Table 3 shows Spearman’s correlation coefficient between the original IHMs and
the measures obtained using the meta-learning approach. Values higher than
0.5 are highlighted in bold. The values of the estimated tree depth measures
are the highest, especially for the pruned version of the measure (TDU ). This
happens because, in the pruned version of the tree, noisy and outlier instances
tend to be placed in nodes which have undergone pruning. Therefore, the label of
the particular instance seems to matter less in the original IHM formulation. In
contrast, the formulation of the original CLD, DCP and kDN measures is highly
influenced by the label of each instance where they are measured. This decreases
the correlations, especially in datasets with many instances with feature values
akin to a class, despite being originally labeled into another class in the dataset.
This is the case for hospital 1 and 2 datasets, where situations such as instances
wrongly labeled or with overlapping feature values are more common.

Table 3. Spearman coefficient obtained for the RF algorithm concerning predicting


the IHMs to different datasets.

blobs diabetes heart hospital1 hospital2


kDN 0.721 0.651 0.632 0.411 0.496
N2 0.794 0.592 0.642 0.299 0.295
CLD 0.685 0.666 0.430 0.432 0.287
DCP 0.611 0.680 0.508 0.392 0.456
TDU 0.927 0.824 0.671 0.870 0.911
TDP 0.987 0.965 0.953 0.993 0.876

Table 4. Spearman coefficient obtained for adjusted measures compared to the original
IHMs in different datasets.

blobs diabetes heart hospital1 hospital2


kDN 0.696 0.713 0.836 0.427 0.554
N2 0.579 0.651 0.790 0.501 0.609
CLD 0.760 0.767 0.839 0.502 0.677
DCP 0.666 0.413 0.793 0.514 0.681
TDU 0.918 0.986 0.982 0.994 0.961
TDP 0.773 1.000 1.000 1.000 0.960

Table 4 presents the same results for the adjusted IHMs: their Spearman
correlation to the original IHMs. As in Table 3, values higher than 0.5 are bold-
faced. More boldfaced correlations are observed here. Similar observations con-
cerning the higher correlation values for tree depth-based measures are observed
An Instance Level Analysis of Classification Difficulty for Unlabeled Data 151

in Table 4 too. The correlations observed for the adjusted measures are generally
higher than those observed for the measures estimated by the meta-regressors.
To make the differences clearer, Fig. 3 plots the Spearman’s correlations for the
adjusted IHM and the meta-models compared to the original values. Blue bars
represent the correlation of the adjusted measures, while orange bars denote the
meta-learning approach. Only for the blobs dataset and for the DCP-diabetes
combination were correlations of the meta-models higher than those of the
adjusted IHMs. The blobs dataset has difficult instances concentrated on the
border of the classes, while the other datasets may pose other sources of diffi-
culties which are not captured when the labels are absent, such as label noise.

Fig. 3. Spearman’s correlation applied to the adjusted IHM vs. the original IHM (blue
bars) and the predicted vs. expected values from MtL (orange bars). (Color figure
online)

Figure 4 shows the instances in the blobs dataset colored by the hardness of
the original IHM (in the left) followed by the adjusted IHM (in the center) and
152 P. S. M. Ueda et al.

the meta-learning approach (in the right). This can be done for this dataset,
as it is bi-dimensional. The harder the instance is to classify, the more intense
it is colored in red. In contrast, instances that are easier to classify are filled
with darker blue. The central areas of the plots contain the overlapping region
between the three classes (see Fig. 2) and, therefore, are harder to classify. The
first row corresponds to the kDN measure, while the second is the TDU measure.
For kDN, it is clear that the hardest instances are those in the border of the
classes. For TDU the pattern observed in the three approaches shows that the
harness level is related to partition derived from the decision tree classification.
All measures show similar behaviors. However, for the adjusted kDN measure,
more central instances have higher IHM values compared to the other measures.
It is important to note that since the adjusted measures can vary on a different
scale from the original IHM, the results presented in the plots were normalized
between 0 and 1 to allow a direct comparison.

Fig. 4. Visualization of the measures kDN (top) and TDU (bottom): original IHM
(left), adjusted IHM (middle) and meta-learning approach (right) for the blobs dataset.
(Color figure online)

4.3 Discussion
Considering the difference in the nature of the datasets, where the blobs were
artificially designed with three classes and the other are real-world health data,
the Spearman’s correlation in Fig. 3 shows that the MtL achieves more conver-
gent result than the adjusted IHM in the blobs dataset for great part of the
measures. Conversely, for real datasets, the adjusted IHM is more associated
with the original measure for almost all measures.
The tree-depth measures had the highest correlations to the original measures
for the adjusted IHM and the MtL approach. This is mostly related to the fact
that the original tree-depth measures do not depend so directly on the label of
An Instance Level Analysis of Classification Difficulty for Unlabeled Data 153

the instances. The other measures all regard whether the labels of some vicinity
are in accordance with the registered label of the instance. This makes them
deviate more for instances that are mislabeled, for instance.
This can be observed in Fig. 5, where the original and estimated IHMs KDN
(in the top) and TDU (in the bottom) are contraposed for all instances of the
blobs dataset. In the x-axis, we have the original measures, whilst, in the y-axis,
the proposed counterparts are taken. The adjusted kDN is normalized between
0 and 1 for direct comparison.

Fig. 5. The adjusted IHM vs the original IHM and meta-learning prediction vs original
IHM for the blobs dataset.

For the measure kDN, one can observe that as the hardness to classify the
instances grows, both the adjusted and the original IHM increase their values,
reaching their peak in the middle of the scale. After that, the value of the esti-
mated IHM assumes the opposite direction of the original measure. This result is
expected considering the unlabeled data, given that instances harder to classify
without a label will be predicted as belonging to any other class rather than
being an outlier from a specific class.
Conversely, in the TD unpruned graphs, the results indicate that the hard-
ness in classification is independent of the class being known or not. For both
adjusted IHM and meta-learning approaches, there is some linearity between
those measures and the original IHM. It means that the proposed measures for
154 P. S. M. Ueda et al.

unlabeled data capture the increase of hardness for classification problems equiv-
alent to the increase in hardness when the class is given for this measure. This
result can be expected considering the nature of the measure.
The CLD measure, the only measure using likelihood as the metric, per-
formed more closely to the original measure with the adjusted IHM for all
datasets. Especially for datasets with two classes, in many cases, both estimates
might agree when the first and second classes with maximum likelihoods are the
same.
Overall, adjusted and meta-learning IHMs were able to assess the hardness
level of the unlabeled instances, with some prominence of the adjusted measures,
which showed larger correlations to the original measures in most cases. They
are also simpler to compute, as they do not need to induce a ML model as in
the meta-learning approach. In the absence of labels, most measures are more
effective in pointing borderline instances as posing a higher difficulty of posterior
classification.

5 Conclusion and Future Work


This research analyzed alternative ways to measure the hardness of instances for
classification problems in scenarios where the label of an instance is unknown,
that is, in the deployment stage. Standard IHMs from the literature were adapted
to this scenario. Their results were compared to the alternative of generating
regression meta-models to predict the IHM values. Both alternatives were effec-
tive on their behalf, correlating to the original IHMs that need to know the label
of each instance. The correlations were higher for some measures that do not rely
as much on the labels, but the results for other measures are expected as their
original formulation allows one to identify noise and outliers on data regard-
ing their labels. The results encourages the usage of the adjusted measures in
the deployment of ML models, allowing the identification of instances that ML
models might struggle to classify.
In future work, we will explore the patterns found in the comparisons between
the original and adjusted IHM not presented in this work and alternative mea-
sures for unlabeled data not addressed in this research. We will expand the
application of the adjusted measures and meta-learning to more datasets, and
tuning the meta-models could lead to new findings about the characteristics
of the instances. Another fruitful direction will be to explore the usage of the
adjusted measures for designing classification rejection options.

Acknowledgements. This study was financed in part by the Coordenação de Aper-


feiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001. The
authors thank FAPESP for its support under grant 2021/06870-3.
An Instance Level Analysis of Classification Difficulty for Unlabeled Data 155

References
1. Al Hosni, O.S., Starkey, A.: Investigating the performance of data complexity
& instance hardness measures as a meta-feature in overlapping classes problem.
In: ICCBDC 2023, Manchester, United Kingdom (2023). https://doi.org/10.1145/
3616131.3616132
2. Brazdil, P., van Rijn, J.N., Soares, C., Vanschoren, J.: Metalearning: Applica-
tions to Automated Machine Learning and Data Mining. Springer, Cham (2022).
https://doi.org/10.1007/978-3-030-67024-5
3. FAPESP: FAPESP COVID-19 data sharing/Br (2020). https://
repositoriodatasharingfapesp.uspdigital.usp.br
4. Franc, V., Prusa, D., Voracek, V.: Optimal strategies for reject option classifiers.
J. Mach. Learn. Res. 24(11), 1–49 (2023)
5. Ho, T.K., Basu, M.: Complexity measures of supervised classification problems.
IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002). https://doi.org/
10.1109/34.990132
6. Kelly, M., Longjohn, R., Nottingham, K.: The UCI machine learning repository
(2023). https://archive.ics.uci.edu
7. Lorena, A.C., Paiva, P.Y., Prudêncio, R.B.: Trusting my predictions: on the value
of instance-level analysis. ACM Comput. Surv. 56(7), 1–28 (2024)
8. Martínez-Plumed, F., Prudêncio, R.B., Martínez-Usó, A., Hernández-Orallo, J.:
Item response theory in AI: analysing machine learning classifiers at the instance
level. Artif. Intell. 271, 18–42 (2019)
9. Paiva, P.Y.A., Moreno, C.C., Smith-Miles, K., Valeriano, M.G., Lorena, A.C.:
Relating instance hardness to classification performance in a dataset: a visual app-
roach. Mach. Learn. 111(8), 3085–3123 (2022)
10. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn.
Res. 12, 2825–2830 (2011)
11. Prudêncio, R.B., Silva Filho, T.M.: Explaining learning performance with local
performance regions and maximally relevant meta-rules. In: Xavier-Junior, J.C.,
Rios, R.A. (eds.) BRACIS 2022. LNCS, vol. 13653, pp. 550–564. Springer, Cham
(2022). https://doi.org/10.1007/978-3-031-21686-2_38
12. Rivolli, A., Garcia, L.P., Soares, C., Vanschoren, J., de Carvalho, A.C.: Meta-
features for meta-learning. Knowl.-Based Syst. 108101 (2022)
13. Schweighofer, E.: Data-centric machine learning: improving model performance
and understanding through dataset analysis. In: Legal Knowledge and Information
Systems: JURIX 2021, vol. 346, p. 54. IOS Press (2021)
14. Smith, M.R., Martinez, T., Giraud-Carrier, C.: An instance level analysis of data
complexity. Mach. Learn. 95(2), 225–256 (2014)
15. Valeriano, M.G., et al.: Let the data speak: analysing data from multiple health
centers of the São Paulo metropolitan area for COVID-19 clinical deterioration
prediction. In: 2022 22nd IEEE International Symposium on Cluster, Cloud and
Internet Computing (CCGrid), pp. 948–951. IEEE (2022)

You might also like