How_to_measure_uncertainty_in_uncertainty_sampling
How_to_measure_uncertainty_in_uncertainty_sampling
https://doi.org/10.1007/s10994-021-06003-9
Abstract
Various strategies for active learning have been proposed in the machine learning liter‑
ature. In uncertainty sampling, which is among the most popular approaches, the active
learner sequentially queries the label of those instances for which its current prediction is
maximally uncertain. The predictions as well as the measures used to quantify the degree
of uncertainty, such as entropy, are traditionally of a probabilistic nature. Yet, alternative
approaches to capturing uncertainty in machine learning, alongside with corresponding
uncertainty measures, have been proposed in recent years. In particular, some of these
measures seek to distinguish different sources and to separate different types of uncertainty,
such as the reducible (epistemic) and the irreducible (aleatoric) part of the total uncertainty
in a prediction. The goal of this paper is to elaborate on the usefulness of such measures
for uncertainty sampling, and to compare their performance in active learning. To this end,
we instantiate uncertainty sampling with different measures, analyze the properties of the
sampling strategies thus obtained, and compare them in an experimental study.
* Eyke Hüllermeier
[email protected]
Vu‑Linh Nguyen
[email protected]
Mohammad Hossein Shaker
[email protected]
1
Department of Mathematics and Computer Science, Eindhoven University of Technology,
Eindhoven, The Netherlands
2
Heinz Nixdorf Institute and Department of Computer Science, Paderborn University, Paderborn,
Germany
13
Vol.:(0123456789)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
90 Machine Learning (2022) 111:89–122
1 Introduction
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 91
2 Uncertainty sampling
In this section, we briefly recall the basic setting of uncertainty sampling. As usual in
active learning, we assume to be given a labelled set of training data 𝐃 and a pool of unla‑
beled instances 𝐔 that can be queried by the learner:
{ } { }
𝐃 = (x1 , y1 ), … , (xN , yN ) , 𝐔 = x1 , … , xJ .
( )
Instances are represented as features vectors xi = xi1 , … , xid ∈ X = ℝd . In this paper, we
only consider the case of binary classification, where labels yi are taken from Y = {0, 1},
leaving the more general case of multi-class classification for future work. We denote by
H ⊂ YX the underlying hypothesis space, i.e., the class of candidate models h ∶ X ⟶ Y
the learner can choose from. Often, hypotheses are parametrized by a parameter vector
𝜃 ∈ Θ; in this case, we equate a hypothesis h = h𝜃 ∈ H with the parameter 𝜃 , and the
model space H with the parameter space Θ.
In uncertainty sampling, instances are queried in a greedy fashion. Given the current
model 𝜃 that has been trained on 𝐃, each instance xj in the current pool 𝐔 is assigned a
utility score s(𝜃, xj ), and the next instance to be queried is the one with the highest score
(Lewis and Gale 1994; Settles 2009; Settles and Craven 2008; Sharma and Bilgic 2017).
The chosen instance is labelled (by an oracle or expert) and added to the training data 𝐃,
on which the model is then re-trained. The active learning process for a given budget B (i.e,
the number of unlabelled instances to be queried) is summarized in Algorithm 1.
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
92 Machine Learning (2022) 111:89–122
– the entropy:
∑
s(𝜃, x) = − p𝜃 (y | x) log p𝜃 (y | x) ,
(1)
y∈Y
3 Measures of uncertainty
In this section, we present different frameworks for measuring the learner’s uncertainty
in a query instance: evidence-based uncertainty (EBU), credal uncertainty (CU), and an
approach focusing on a distinction between epistemic and aleatoric uncertainty (EAU).
While the first one has been specifically developed for the purpose of active learning, the
other two are more general approaches to uncertainty quantification in machine learning.
Yet, their potential usefulness for active learning has been pointed out as well (Antonucci
et al. 2012; Nguyen et al. 2019).
In their evidence-based uncertainty sampling approach, Sharma and Bilgic (2013, 2017)
propose to differentiate between uncertainty due to conflicting evidence and insufficient
evidence. The corresponding measures of conflicting-evidence uncertainty and insufficient-
evidence uncertainty are mainly motivated for the Naïve Bayes (NB) classifier as a learning
algorithm. In the spirit of this classifier, evidence-based uncertainty sampling first looks
at the influence of individual features xm in the feature representation x = (x1 , … , xd ) of
instances. More specifically, given the current model 𝜃 , denote by p𝜃 (xm | 0) and p𝜃 (xm | 1)
the class-conditional probabilities on the values of the mth feature. For a given instance x,
the authors partition the set of features into those that provide evidence for the positive and
for the negative class, respectively:
{ }
| p (xm | 1)
P𝜃 (x) = xm || 𝜃 m >1 , (4)
| p𝜃 (x | 0)
{ }
| p (xm | 0)
N𝜃 (x) = xm || 𝜃 m >1 . (5)
| p𝜃 (x | 1)
Then, the total evidence for the positive and the negative class is determined as follows:
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 93
∏ p𝜃 (xm | 1)
E1 (x) = , (6)
p (xm | 0)
xm ∈P (x) 𝜃
𝜃
∏ p𝜃 (xm | 0)
E0 (x) = . (7)
xm ∈N𝜃
p (xm | 1)
(x) 𝜃
The authors consider a situation as conflicting evidence if both E0 (x) and E1 (x) are high,
because in such as situation, there is strong evidence in favor of the positive as well as
strong evidence in favor of the negative class. Likewise, a situation in which both evi‑
dences are low is considered as insufficient evidence. Measuring these conditions in terms
of the product1 E1 (x) × E0 (x), the conflicting evidence-based approach simply queries
the instance with the highest conflicting evidence, while the insufficient evidence-based
approach looks for the one with the highest insufficient evidence:
s∗conf = arg max E1 (x) × E0 (x) , (8)
x∈𝐒
Note that the selection is restricted to the set 𝐒 of instances x in the pool 𝐔 having the high‑
est scores s(𝜃, x) according to standard uncertainty sampling; the size of this set, t = |𝐒|, is
a parameter of the method (and hence a hyper-parameter for the active learning algorithm).
The restriction to the most uncertain cases puts evidence-based uncertainty sampling close
to standard uncertainty sampling. Instead of using conflicting-evidence and insufficient-
evidence uncertainties as selection criteria on their own, they are merely used for prioritiz‑
ing cases that appear to be uncertain in the traditional sense.
Interestingly, to motivate their approach, Sharma and Bilgic (2017) note that “regardless
of whether we want to maximize or minimize E1 (x) × E0 (x), we want to guarantee that
the underlying model is uncertain about the chosen instance”, thereby suggesting that
the evidence-based uncertainties alone do not necessarily inform about this uncertainty.
Indeed, it is true that these uncertainties are not easy to interpret (see also our discussion in
Sect. 4.1), and that their relationship to standard uncertainty measures is not fully obvious.
In particular, note that the latter also comprises the influence of the prior class probabil‑
ities, which is completely neglected by the evidence-based uncertainties (which only look
at the likelihood). This is especially relevant in the case of imbalanced class distributions.
In such cases, evidence-based uncertainty may strongly deviate from standard uncertainty,
i.e., the entropy of the posterior distribution. For instance, E0 (x) and E1 (x) could both be
very large, and p𝜃 (x | 0) ≈ p𝜃 (x | 1), although p𝜃 (0 | x) is very different from p𝜃 (1 | x) due to
unequal prior odds, and hence the entropy small. Likewise, the entropy of the posterior can
be large although both evidence-based uncertainties are small.
1
This is the measure used in (Sharma and Bilgic 2013, 2017). The authors mention, however, that other
measures could be used as well.
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
94 Machine Learning (2022) 111:89–122
The evidence-based approach to uncertainty sampling has been introduced with a focus on
Naïve Bayes as a base learner. In this regard, we like to note that uncertainty sampling for
this learner might be considered critical in general.
It is clear that active learning may always incorporate a bias, simply because the data is
no longer produced by sampling independently according to the true underlying distribu‑
tion. Thus, the data is no longer completely representative. While this may affect any learn‑
ing algorithm, the effect appears to be especially strong for NB, so that uncertainty sam‑
pling for NB appears to be questionable in general. In fact, a sample bias has a very direct
influence on the probabilities estimated by NB. In particular, the estimated class priors are
strongly biased toward the conditional class probabilities of those instances with a high
uncertainty, because these are sampled more often. This bias may in turn affect the classi‑
fier as a whole, and lead to suboptimal predictions.
As an illustration of the problem, let us consider a small example with only two binary
attributes x1 and x2. This example may appear unrealistic, because the instance space is
finite and actually quite small. Please note, however, that even in practice NB is typically
applied to discrete attributes with finite domains (possibly after a discretization of numeri‑
cal attributes in a pre-processing step).
Suppose the class priors to be given by p(y = 0) = 0.3 and p(y = 1) = 0.7, and the
class-conditional probabilities as follows:
p(x1 = 1 | y = 0) = 0.4 ,
p(x1 = 1 | y = 1) = 0.2 ,
p(x2 = 1 | y = 0) = 0.8 ,
p(x2 = 1 | y = 1) = 0.4 .
Now, consider an active learner that can sample from a large (in principle infinite) pool of
unlabeled data points (i.e., multiple copies of each of the four instances). Since the second
instance (x1 , x2 ) = (0, 1) has the highest entropy, standard uncertainty sampling will sooner
or later focus on this instance and sample it over and over again2. This of course has an
influence on the estimation of priors and conditional probabilities by NB. In particular, the
estimated class priors p̂ (y = 0) and p̂ (y = 1) will converge to the conditional posteriors,
i.e., the posteriors of y given (x1 , x2 ) = (0, 1). Consequently, we will produce a bias in the
estimates, and will obtain
2
We confirmed this in a simulation study.
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 95
p̂ (y = 0 | x1 = 0, x2 = 0) ≈ 0.19 , p̂ (y = 1 | x1 = 0, x2 = 0) ≈ 0.81 ,
p̂ (y = 0 | x1 = 0, x2 = 1) ≈ 0.38 , p̂ (y = 1 | x1 = 0, x2 = 1) ≈ 0.62 ,
p̂ (y = 0 | x1 = 1, x2 = 0) ≈ 0.24 , p̂ (y = 1 | x1 = 1, x2 = 0) ≈ 0.76 ,
p̂ (y = 0 | x1 = 1, x2 = 1) ≈ 0.45 , p̂ (y = 1 | x1 = 1, x2 = 1) ≈ 0.56 .
As one can see, this will even have an effect on the Bayes-optimal predictor: For
(x1 , x2 ) = (1, 1), the prediction will be ŷ = 1 instead of the actually optimal prediction
ŷ = 0. Similar effects can be found for the evidence-based approach. For example, when
applying the insufficient evidence approach, it can happen that the active learner will com‑
pletely focus on the third instance, which has the highest insufficient evidence, and then
produce the following estimates:
p̂ (y = 0 | x1 = 0, x2 = 0) ≈ 0.14 , p̂ (y = 1 | x1 = 0, x2 = 0) ≈ 0.86 ,
p̂ (y = 0 | x1 = 0, x2 = 1) ≈ 0.36 , p̂ (y = 1 | x1 = 0, x2 = 1) ≈ 0.64 ,
p̂ (y = 0 | x1 = 1, x2 = 0) ≈ 0.23 , p̂ (y = 1 | x1 = 1, x2 = 0) ≈ 0.78 ,
p̂ (y = 0 | x1 = 1, x2 = 1) ≈ 0.49 , p̂ (y = 1 | x1 = 1, x2 = 1) ≈ 0.51 .
As explained above, the approach by Sharma and Bilgic (2017) is specifically tai‑
lored for Naïve Bayes as a learning algorithm. Yet, the authors also propose vari‑
ants of their measures for logistic regression and support vector machines. For exam‑
ple, if the decision boundary obtained by fitting a logistic regression model is given by
∑
h𝜃 (x) = 𝜃0 + di=1 𝜃i ⋅ xi , the evidences for the positive and the negative class are defined,
respectively, as follows (Sharma and Bilgic 2017):
∑ ∑
E1 (x) = 𝜃m ⋅ xm , E0 (x) = − 𝜃m ⋅ xm ,
m m
(10)
x ∈P𝜃 (x) x ∈N𝜃 (x)
where
{ } { }
P𝜃 (x) = xm || 𝜃m ⋅ xm > 0 , N𝜃 (x) = xm || 𝜃m ⋅ xm < 0 . (11)
Consider an instance space X , output space Y = {0, 1}, and a hypothesis space H consist‑
ing of probabilistic classifiers h ∶ X ⟶ [0, 1]. Assuming that each hypothesis h = h𝜃 is
identified by a (unique) parameter vector 𝜃 ∈ Θ, we can equate H with the parameter space
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
96 Machine Learning (2022) 111:89–122
The credal uncertainty sampling approach simply looks for the instance x with the highest
uncertainty, i.e, the least evidence for the dominance of one of the classes. In the case of
binary classification with Y = {0, 1}, this is expressed by the score
( )
s(x) = − max 𝛾(1, 0, x), 𝛾(0, 1, x) . (13)
Such interval-valued probabilities can be produced within the framework of the Naïve
credal classifier (Antonucci et al. 2012; Antonucci and Cuzzolin 2010; De Campos et al.
1994; Zaffalon 2002). In the case of binary classification, where p𝜃 (0 | x) = 1 − p𝜃 (1 | x),
the score 𝛾(1, 0, x) can be rewritten as follows:
p𝜃 (1 | x) p𝜃 (1 | x) p(1 | x)
𝛾(1, 0, x) = inf = inf = . (15)
𝜃∈C p𝜃 (0 | x) 𝜃∈Θ 1 − p𝜃 (1 | x) 1 − p(1 | x)
Likewise,
p𝜃 (0 | x) 1 − p𝜃 (1 | x) 1 − p(1 | x)
𝛾(0, 1, x) = inf = inf = . (16)
𝜃∈C p𝜃 (1 | x) 𝜃∈C p𝜃 (1 | x) p(1 | x)
A distinction between the epistemic and aleatoric uncertainty (Hora 1996) in a prediction
for an instance x has been motivated by Senge et al. (2014)3. Their approach is based on
3
More recently, this distinction has also attracted attention in the deep learning community (Kendall and
Gal 2017).
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 97
the use of relative likelihoods, historically proposed by Birnbaum (1962) and then justified
in other settings such as possibility theory (Walley and Moral 1999).
Given a set of training data 𝐃 = {(xi , yi )}Ni=1 ⊂ X × Y , the normalized likelihood of a
model h𝜃 is defined as
L(𝜃) L(𝜃)
𝜋Θ (𝜃) = = , (18)
L(𝜃 ml ) max𝜃� ∈Θ L(𝜃 � )
∏
where L(𝜃) = Ni=1 p𝜃 (yi � xi ) is the likelihood of 𝜃 , and 𝜃 ml ∈ Θ the maximum likelihood
estimation on the training data. For a given instance x, the degrees of support (plausibility)
of the two classes are defined as follows:
[ ]
𝜋(1 | x) = sup min 𝜋Θ (𝜃), p𝜃 (1 | x) − p𝜃 (0 | x) ,
𝜃∈Θ
[ ]
𝜋(0 | x) = sup min 𝜋Θ (𝜃), p𝜃 (0 | x) − p𝜃 (1 | x) .
𝜃∈Θ
So, 𝜋(1 | x) is high if and only if a highly plausible model supports the positive class much
stronger (in terms of the assigned probability mass) than the negative class (and 𝜋(0 | x)
can be interpreted analogously)4. Note that, with f (a) = 2a − 1, we can also write
[ ]
𝜋(1 | x) = sup min 𝜋Θ (𝜃), f (h𝜃 (x)) , (19)
𝜃∈Θ
[ ]
𝜋(0 | x) = sup min 𝜋Θ (𝜃), f (1 − h𝜃 (x)) . (20)
𝜃∈Θ
Given the above degrees of support, the degrees of epistemic uncertainty ue and aleatoric
uncertainty ua are defined as follows:
[ ]
ue (x) = min 𝜋(1 | x), 𝜋(0 | x) , (21)
[ ]
ua (x) =1 − max 𝜋(1 | x), 𝜋(0 | x) . (22)
Thus, epistemic uncertainty refers to the case where both the positive and the negative
class appear to be plausible, while the degree of aleatoric uncertainty (22) is the degree to
which none of the classes is supported. Roughly speaking, aleatoric uncertainty is due to
influences on the data-generating process that are inherently random, whereas epistemic
uncertainty is caused by a lack of knowledge. Or, stated differently, ue and ua measure the
reducible and the irreducible part of the total uncertainty, respectively.
It is thus tempting to assume that epistemic uncertainty is more relevant for active learn‑
ing: While it makes sense to query additional class labels in regions where uncertainty can
be reduced, doing so in regions of high aleatoric uncertainty appears to be less reasonable.
This leads us to suggest the principle of epistemic uncertainty sampling, which prescribes
the selection
x∗ = arg max ue (x) . (23)
x∈𝐔
4
Technically, we assume that, for each x ∈ X , there are hypotheses h, h� ∈ H such that h(x) ≥ 0.5 and
h� (x) ≤ 0.5, which implies 𝜋(1 | x) ≥ 0 and 𝜋(0 | x) ≥ 0.
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
98 Machine Learning (2022) 111:89–122
For comparison, we will also consider an analogous selection rule based on the aleatoric
uncertainty, i.e.,
x∗ = arg max ua (x) . (24)
x∈𝐔
As already said, this approach is completely generic and can in principle be instantiated
with any hypothesis space H. The uncertainty measures (21–22) can be derived very eas‑
ily from the support degrees (19–20). The computation of the latter may become difficult,
however, as it requires the solution of an optimization problem, the properties of which
depend on the choice of H. We are going to present practical methods to determine (19–20)
for the cases of a simple Parzen window classifier and logistic regression in Sects. A.1
and A.2, respectively.
b1 b2 b1 b2
In the both scenarios, the insufficient evidence would be high, because all class-condi‑
tional probabilities are equal. In EAU, however, the first scenario would largely be a case
of epistemic uncertainty, due to the few number of training examples, whereas the second
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 99
Fig. 1 Two scenarios for logistic regression: training data with positive (red crosses) and negative examples
(black circles) and five query instances
would be aleatoric, because the equal posteriors5 are sufficiently “confirmed”. Similar
remarks apply to conflicting evidence. In the scenario
b1 b2
a1 (1, 1) (10, 1)
a2 (1, 10) (1, 1)
the latter would be high for (a1 , b1 ), because p𝜃 (a1 | 1) ≫ p𝜃 (a1 | 0) and
p𝜃 (b1 | 0) ≫ p𝜃 (b1 | 1). The same holds for (a2 , b2 ), whereas the uncertainties for (a1 , b2 )
and (a2 , b1 ) would be low. Note, however, that in all these cases, exactly the same condi‑
tional probability estimates p𝜃 (xm | 1) and p𝜃 (xm | 0) are involved.
We would argue that epistemic uncertainty should directly refer to these probabilities,
because they constitute the parameter 𝜃 of the model. Thus, to reduce epistemic uncertainty
(about the right model 𝜃 ), one should look for those examples that will mostly improve the
estimation of these probabilities. Aleatoric uncertainty may occur in cases of posteriors
close to 1/2, in which the conflicting evidence may indeed be high (although, as already
mentioned, the latter ignores the class priors). Yet, we would not necessarily call such
cases a “conflict”, because the predictions are completely in agreement with the underlying
model (Naïve Bayes), which assumes class-conditional independence of attributes, i.e., an
independent combination of evidences on different attributes.
Another illustration is provided in Fig. 1, now for the case of logistic regression. A first
important observation is that the uncertainties due to conflicting and insufficient evidence
are exactly the same in both scenarios in Fig. 1, the left and the right one. This is because
these uncertainties are merely derived from the single model h𝜃 learned from the training
data. Thus, like in the example for Naïve Bayes, the evidence-based approach does not
capture model uncertainty, i.e., uncertainty about the truly optimal model (which is clearly
larger on the left and smaller on the right), which EAU essentially measures in terms of
epistemic uncertainty.
For the first three queries, the evidence-based uncertainties are very differ‑
ent: The first query has a high insufficient-evidence uncertainty, the third has a high
5
The class priors are ignored here.
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
100 Machine Learning (2022) 111:89–122
Fig. 2 Two scenarios for logistic regression: training data with positive (red crosses) and negative examples
(black circles) and five query instances
conflicting-evidence uncertainty, and the second none of the two. According to EAU,
the uncertainties for these three cases are all high (because they are all located close
to the decision boundary) and, more importantly, of the same nature: mostly aleatoric
in the right and a mix of aleatoric and epistemic in the left scenario. For the second,
fourth and fifth query, the evidence-based uncertainties are roughly the same. Again,
this is very different from EAU, which assigns a high uncertainty to the second but
very low uncertainties to the fourth and fifth query.
Figure 2 shows two very similar scenarios, but now with a bias term. As already
said, the evidence-approach does not account for such a bias. Moreover, since w1 ≈ 0
(the first feature does not seem to have an influence), there is essentially no negative
evidence, i.e., E0 (x) is always close to 0. Consequently, the product E0 (x) × E1 (x) will
be small, too, suggesting that conflicting-evidence uncertainty is always low and insuf‑
ficient-evidence uncertainty always high.
As shown by these examples, the additional uncertainty captured by EBU is very
different from aleatoric and epistemic uncertainty in EAU. In particular, the evidence-
based approach can be criticized for ignoring model uncertainty as well as properties
of the model class. Although the measures of evidence for the positive and negative
class, such as (10), are derived from the model, the evidences are “feature-based” in
the sense of considering the evidence provided by each feature in isolation. What is not
taken into account, however, is the way in which the model combines the features into
an overall prediction. In logistic regression, for example, the features are linearly com‑
bined into a single score, and the class probabilities are expressed as a function of this
score. For instance, a model like the one we considered in our example,
1
p(y = 1 | x) = ,
1 + exp(−𝛾(x2 − x1 ))
assumes that the probability of the positive class is a function of the difference between x2
and x1. One may wonder, therefore, why one should consider a case where both x1 and x2
are large as a conflict (and, likewise, a case with both values being small as not providing
sufficient evidence for a prediction). From this point of view, the very idea of conflicting
(and, likewise, insufficient) evidence may appear somewhat questionable.
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 101
0.8
0.8
0.8
0.4
0.4
0.4
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Fig. 3 From left to right: exponential rescaling of the credal uncertainty measure (17), epistemic uncer‑
tainty ue and aleatoric uncertainty ua for intervals [p, p] with lower probability p (x-axis) and upper prob‑
ability p (y-axis). Lighter colors indicate higher values
4.2 EAU versus CU
Credal uncertainty (sampling) seems to be closer to EAU, at least in terms of the underly‑
ing principle. In both approaches, model uncertainty is captured in terms of a set of plau‑
sible candidate models from the underlying hypothesis space, and this (epistemic) uncer‑
tainty about the right model is translated into uncertainty about the prediction for a given
x. In credal uncertainty sampling, the candidate set is given by the credal set C, which
corresponds to the distribution 𝜋Θ in EAU—as a difference, we thus note that the latter is a
“graded set”, to which a candidate 𝜃 belongs with a certain degree of membership (the rela‑
tive likelihood), whereas a credal set is a standard set in which a model is either included
or not. Using machine learning terminology, C plays the role of a version space (Mitchell
1977), whereas 𝜋Θ represents a kind of generalized (graded) version space (Hüllermeier
2003).
More specifically, the wider the interval [p(1 | x), p(1 | x)] in (17), the larger the score
s(x), with the maximum being obtained for the case [0, 1] of complete ignorance. This
is well in agreement with the degree of epistemic uncertainty in EAU. In the limit, when
[p(1 | x), p(1 | x)] reduces to a precise probability p(1 | x), i.e., the epistemic uncertainty dis‑
appears, (17) is maximal for p(1 | x) = 1∕2 and minimal for p(1 | x) close to 0 or 1. Again,
this behavior is in agreement with the conception of aleatoric uncertainty in EAU. More
generally, comparing two intervals of the same length, (17) will be larger for the one that
is closer to the middle point 1/2. Thus, it seems that the credal uncertainty score (17) com‑
bines both epistemic and aleatoric uncertainty in a single measure.
Yet, upon closer examination, its similarity to epistemic uncertainty is much higher than
the similarity to aleatoric uncertainty. Note that, for EAU, the special case of a credal set
C can be imitated with the measure 𝜋Θ (𝜃) = 1 if 𝜃 ∈ C and 𝜋Θ (𝜃) = 0 if 𝜃 ∉ C . Then, (19)
and (20) become
𝜋(1 | x) = sup max[ 2 p𝜃 (1 | x) − 1, 0 ] = max[ 2 p(1 | x) − 1, 0 ] ,
𝜃∈C
𝜋(0 | x) = sup max[ 2 p𝜃 (0 | x) − 1, 0 ] = max[ 1 − 2 p(1 | x), 0 ] ,
𝜃∈C
and ue and ua can be derived from these values as before. Figure 3 shows a graphical illus‑
tration of the credal uncertainty score6 (17) as a function of the probability bounds p and
6
The score s is not well scaled, and may assume very large negative values. For better visibility, we there‑
fore plotted the monotone transformation exp(s).
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
102 Machine Learning (2022) 111:89–122
p , and the same illustration is given for epistemic uncertainty ue and aleatoric uncertainty
ua. From the visual impression, it is clear that the credibility score closely resembles ue,
while behaving quite differently from ua. This impression is corroborated by a simple cor‑
relation analysis, in which we ranked the intervals
{ [ ] }
a b |
[p, p] ∈ Ia,b = , | a, b ∈ {0, 1, … , 100}, a ≤ b ,
100 100 |
i.e., a quantization of the class of all probability intervals, according to the different meas‑
ures, and then computed the Kendall rank correlation. While the ranking according to (17)
is strongly correlated with the ranking for ue (Kendall is around 0.86), it is almost uncor‑
related with ua.
In summary, the credal uncertainty score appears to be quite similar to the measure
of epistemic uncertainty in EAU. As potential advantages of the latter, let us men‑
tion the following points. First, the degree of epistemic uncertainty is normalized and
bounded, and thus easier to interpret. Second, it is complemented by a degree of alea‑
toric uncertainty—the two degrees are carefully distinguished and have a clear seman‑
tics. Third, handling candidate models in a graded manner, and modulating their influ‑
ence according to their plausibility, appears to be more reasonable than creating an
artificial separation into plausible and non-plausible models (i.e., the credal set and its
complement).
5 Experiments
This section starts with a description of the experimental setting and the data sets used
in the experiments. Some technical details, e.g., regarding the choice of the model
parameters and instantiations of aleatoric and epistemic uncertainty, are deferred to
Sects. 1.1 and 1.2 in the appendix. Finally, the results of the experiments are presented
and analyzed.
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 103
We perform experiments on binary classification data sets from the UCI repository7, the
properties of which are summarized in Table 1. To make sure that the data is amenable
to all methods without the need for further preprocessing, we only selected data with
numerical features. Each data set is randomly split into 10% training, 80% pool, and 10%
test data. The training data is used to obtain an initial model. Then, in each iteration, the
learner is allowed to evaluate the instances from the pool and query a (mini-)batch of these
instances — according to the strategy of uncertainty sampling, the learner selects those
instances with the highest degrees of uncertainty. The chosen instances are labelled (by an
oracle or expert) and added to the training data 𝐃, on which the model is then re-trained.
The budget of the active learner is fixed to the size of the pool, and the performance of
the classifiers is monitored over the entire active learning process. The whole procedure is
repeated 1000 times and test accuracies are averaged. The following variants of uncertainty
sampling are included in the experimental studies:
5.1.1 Local learning
By local learning, we refer to a class of non-parametric models that derive predictions from
the training information in a local region of the instance space, for example the local neigh‑
borhood of a query instance (Bottou and Vapnik 1992; Cover and Hart 1967). As a simple
example, we consider the Parzen window classifier (Chapelle 2005), to which most of the
mentioned approaches can be applied in a quite straightforward way. For a given instance
x, we define the set of its neighbours as follows:
� �
R(x, 𝜖) = (xi , yi ) ∈ 𝐃 � ‖xi − x‖ ≤ 𝜖 ,
where 𝜖 is the width of the Parzen window. In binary classification, a local region R(x, 𝜖)
can be associated with a constant hypothesis h𝜃 , 𝜃 ∈ Θ = [0, 1], where p𝜃 (1|x) = h𝜃 (x) ≡ 𝜃 .
With p and n the number of positive and negative instances, respectively, within a Parzen
window R(x, 𝜖), the likelihood function and the maximum likelihood estimate are, respec‑
tively, given by
7
http://archive.ics.uci.edu/ml/index.php
8
In all the experiments, we use entropy as a measure of uncertainty to select unlabelled instances.
9
The interval-valued probabilities associated to each region can be determined using the numbers of posi‑
tive and negative instances as described in (Antonucci et al. 2012).
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
104 Machine Learning (2022) 111:89–122
( )
p+n p p
L(𝜃) =
p
𝜃 (1 − 𝜃)n , and 𝜃̂ = . (25)
p+n
Since the likelihood function is well-defined, we can determine the degrees of epistemic
and aleatoric uncertainty as described in Sect. 3.3; we refer to Sect. A.1 for the technical
details.
How to determine the width 𝜖 of the Parzen window? This value is difficult to assess,
and an appropriate choice strongly depends on properties of the data and the dimension‑
ality of the instance space. Intuitively, it is even difficult to say in which range this value
should lie. Therefore, instead of fixing 𝜖 , we fixed an absolute number K of neighbors in the
training data, which is intuitively more meaningful and easier to interpret. A corresponding
value of 𝜖 is then determined in such a way that the average number of nearest neighbours
of instances xi in the training data 𝐃 is just K. In other words, 𝜖 is determined indirectly via
K. Furthermore, since we are not, in the first place, interested in maximizing performance,
but in analyzing the effectiveness of active learning approaches, we simply fix the neigh‑
borhood size K as the square root of the size of the data set (number of instances in the
initial training and pool set) as suggested by Lall and Sharma (1996). A practical algorithm
for determining 𝜖 given K, and the way in which we handle empty Parzen windows, are
also given in Sect. A.1.
In a similar way, the approach can be applied to decision tree learning (Quinlan 1986;
Safavian and Landgrebe 1991). In fact, recall that a decision tree partitions the instance
⋃L
space X into (rectangular) regions R1 , … , RL (i.e., i=1 Ri = X and Ri ∩ Rj = � for i ≠ j )
associated with corresponding leafs of the tree (each leaf node defines a region R). Again,
in the case of binary classification, we can assume each region R to be associated with a
constant hypothesis h𝜃 , 𝜃 ∈ Θ = [0, 1], where h𝜃 (x) ≡ 𝜃 is the probability of the positive
class. Therefore, degrees of epistemic and aleatoric uncertainty can be derived in the same
way as described for Parzen window.
For the Parzen window classifier and decision trees10, we fixed the batch size to 1% of
the initial pool dataset. For the approach based on credal uncertainty (CU), we determine
the lower and upper probabilities based on the number of positive and negative examples in
a region, following the procedure described in "Appendix 1 and 2" of (Antonucci and Cuz‑
zolin 2010). Note that the evidence-based approach (Sect. 3.1) is not immediately applica‑
ble to these learners, and therefore omitted from the experiments.
5.1.2 Logistic regression
In contrast to nonparametric, local learning methods such as the Parzen window classifier,
logistic regression is a parametric class of linear models, and hence coming with compara‑
tively restrictive assumptions. Recall that logistic regression assumes posterior probabili‑
ties to depend on feature vectors x = (x1 , … , xd ) ∈ ℝd in the following way:
� ∑d �
exp 𝜃0 + i=1 𝜃i xi
h(x) = p(1 � x) = � ∑ �.
1 + exp 𝜃0 + di=1 𝜃i xi
10
For an implementation in Python, see https://scikit-learn.org/stable/modules/tree.html
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 105
This means that learning the model comes down to estimating a parameter vector
𝜃 = (𝜃0 , … , 𝜃d ), which is commonly done through likelihood maximization (Menard
2002). For numerical stability, we employ L2-regularization, which comes down to maxi‑
mizing the following strictly concave function (Rennie 2005):
( )
∑
N
∑d
l(𝜃) = log L(𝜃) = yn 𝜃0 + 𝜃i xni (26)
n=1 i=1
( ( ))
∑
N
∑
d
𝛾∑ 2
d
− ln 1 + exp 𝜃0 + 𝜃i xni − 𝜃 , (27)
n=1 i=1
2 i=1 i
where the regularization term 𝛾 is fixed to 1. On the basis of this likelihood function, the
degrees of epistemic and aleatoric uncertainty can again be determined as described in
Sect. 3.3; as before, technical details and a practical algorithm are deferred to Sect. 1.2 in
the appendix.
For the case of logistic regression, the evidence-based approach can be applied as well
(cf. Sect. 3.1.3). Following Sharma and Bilgic (2017), we set the number of the top uncer‑
tain instances to be evaluated to 5 times of the batch size.
5.2 Results
As can be seen in Fig. 4, in the case of the Parzen window classifier, EU performs the best
and AU the worst. Moreover, standard uncertainty sampling (ENT) and random sampling
are in-between the two. This is in agreement with our expectations and supports our con‑
jecture that, from an active learning point of view, epistemic uncertainty is the more useful
information. Even if the improvements compared to ENT are not huge, they are still visible
and quite consistent. The performance provided by CU is competitive to the one of EU,
and again in agreement with our expectations — as discussed in Sect. 4.2, both CU and
EU have the ability of capturing model uncertainty. The results for decision tree learn‑
ing (cf. Fig. 5) are quite similar. Now, however, standard uncertainty sampling based on
entropy performs worse, and the advantage of epistemic uncertainty sampling is even more
pronounced.
In the case of logistic regression (cf. Fig. 6), the picture looks a bit different. Here, epis‑
temic, aleatoric, and standard uncertainty sampling perform more or less the same (and
all significantly better than random sampling), whereas no general pattern can be drawn
for the evidence-based uncertainty measures. As a plausible explanation, note that, in con‑
trast to the local learning methods in the first experiment, logistic regression comes with a
very strong learning bias in the form of a linearity assumption. Therefore, the epistemic (or
model) uncertainty disappears quite quickly: The linear decision boundary stabilizes rela‑
tively early in the learning process, and then, the learner is rather “certain” about its pre‑
dictions (regardless of whether this certainty is warranted or not). According to the logistic
model, the uncertain cases are those closest to the current decision boundary, some of them
with a slightly higher epistemic and others with a higher aleatoric uncertainty. In any case,
all three methods, EU, AU, and ENT, are sampling near the decision boundary. Thus, it is
hardly surprising that they show similar performance.
Overall, the experiments nevertheless confirm that, in the context of uncertainty
sampling for active learning, epistemic uncertainty is a viable alternative to standard
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
106 Machine Learning (2022) 111:89–122
Fig. 4 Average accuracies (y-axis) for the Parzen window classifiers as a function of the number of
instances queried from the pool (x-axis)
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 107
Fig. 5 Average accuracies (y-axis) for the decision trees as a function of the number of instances queried
from the pool (x-axis)
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
108 Machine Learning (2022) 111:89–122
Fig. 6 Average accuracies (y-axis) for logistic regression as a function of the number of instances queried
from the pool (x-axis)
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 109
uncertainty measures like entropy: For local learning methods, in which epistemic uncer‑
tainty tends to be higher, epistemic uncertainty sampling improves upon standard uncer‑
tainty sampling, and for global methods with a strong learning bias, it performs at least on
a par. Credal uncertainty, which behaves similarly to epistemic uncertainty (cf. Sect. 4.2),
shows strong performance as well.
As an aside, note that the learning curves are not all monotone increasing, which might
be surprising at first sight. Actually, however, this kind of behavior is not uncommon and
may occur if a data set, in addition to useful examples, also comprises low-quality (e.g.,
noisy or otherwise misleading) instances. In this case, a strong active learning strategy
may succeed in selecting the informative, high-quality examples first, leading to a good
model with strong predictive performance. In the end, if the pool needs to be exhausted,
the active learner is “forced” to pick the low-quality examples, too, thereby causing a drop
in performance.
The results presented above suggest that epistemic uncertainty might be more advanta‑
geous for (active) learners with a low bias and less so for learners with a strong bias. To
corroborate this conjecture, we conducted an additional experiment, using decision trees
with a maximum depth limit as model classes. This allows for contolling the bias in a
seamless manner: The higher the depth limit, the less restricted the model class, the lower
the bias.
Figure 7 shows the learning curves for the depth limits {2, 3, 5, 10} on the data sets
blood and QSAR. As expected, different depth limits appear to be optimal for different
problems (the best limit for blood is 3, for QSAR 10). However, more interesting for our
purpose is the slope of the learning curves, which indeed seem to support our conjecture:
For epistemic uncertainty, the learning curves increase faster for larger and slower for
lower depth limits — for aleatoric uncertainty, it is just the other way around. To make
this even clearer, Fig. 8 plots the relative performance in comparison to standard (entropy-
based) uncertainty sampling, i.e., the performance ratio, both for epistemic and aleatoric
uncertainty. As can be seen, EU tends to be superior, because the ratio is mostly larger than
1, while the opposite holds for AU. Again more importantly, the depth limit (bias) is in per‑
fect agreement with the “order” of the curves: The higher the limit, the better EU (worse
AU) in terms of relative performance.
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
110 Machine Learning (2022) 111:89–122
Fig. 7 Average accuracies (y-axis) for decision tree as a function of the number of instances queried from
the pool (x-axis)
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 111
Fig. 8 Average accuracies (y-axis) for decision tree as a function of the number of instances queried from
the pool (x-axis)
While it is difficult to pre-define either a desirable size of the training data set or a targeted
performance level, the last criterion can be easily implemented by setting some predefined
uncertainty threshold and stopping the active learning process if the degree of uncertainty
falls below the threshold (Zhu et al. 2010).
The potential usefulness of epistemic uncertainty is confirmed by our results (shown in
Fig. 9 for two data sets–results for the other data sets are similar and can be found in Sect. 2
in the appendix).
6 Conclusion
This paper reconsiders the principle of uncertainty sampling in active learning from the
perspective of uncertainty modeling and quantification. More specifically, it starts from the
supposition that, when it comes to the question which instances to select from a pool of
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
112 Machine Learning (2022) 111:89–122
Fig. 9 Average accuracies and degrees of epistemic uncertainty (mean and maximum over instances in the
pool, y-axis) as functions of the number of instances queried from the pool (x-axis)
candidates, a learner’s predictive uncertainty due to “not knowing” should be more rel‑
evant than its uncertainty due to confirmed randomness.
To corroborate this conjecture, we revisited recent approaches to uncertainty quantifica‑
tion in machine learning, with a specific emphasis on methods that allow for separating
different types of uncertainty, and incorporated them in the general uncertainty sampling
procedure. Following a comparison and critical discussion of these approaches, a series
of experiments with different learning algorithms was conducted. In these experiments, a
distinction between so-called epistemic and aleatoric uncertainty proved to be especially
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 113
useful. More specifically, epistemic uncertainty sampling in the sense of uncertainty sam‑
pling based on measures of epistemic uncertainty in a prediction shows strong performance
and consistently improves on standard uncertainty sampling. These results, which we inter‑
pret as clear evidence in favor of our conjecture, and indeed quite plausible: Epistemic and
aleatoric uncertainty can be thought of, respectively, as the reducible and irreducible part
of the total uncertainty. Consequently, querying an instance with a high epistemic uncer‑
tainty may provide useful information for the learner, whereas an aleatorically uncertain
instance is unlikely to do so.
Given this affirmation, we are now encouraged to elaborate on epistemic uncertainty
sampling in more depth, and to develop it in more sophistication. In this regard, there are
various directions to be followed:
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
114 Machine Learning (2022) 111:89–122
Fig. 10 From left to right: epistemic, aleatoric, and total uncertainty (epistemic + aleatoric) as a function
of the numbers p, n ∈ {0, 1, … , 10} of positive and negative examples in a region (Parzen window) of the
instance space (lighter colors indicate higher values)
occur in dense regions. Overall, a small improvement in a dense region may thus be more
beneficial than a big improvement in a sparse region. This observation motivates a kind of
density weighting, i.e., the combination (multiplication) of an uncertainty degree with the
(estimated) density of a data point (Krempl et al. 2015).
• Last but not least, going beyond uncertainty sampling for binary classification as
considered in this paper, the idea of epistemic uncertainty sampling should also be
extended toward other learning problems, such as multi-class classification and regres‑
sion.
This section presents an instantiation of the approach outlined in Sect. 3.3 for the case of
local learning using a Parzen window classifier (Chapelle 2005), as well as logistic regres‑
sion. As already said, instantiating the approach essentially means to address the question
of how to compute the degrees of support (19–20), from which everything else can easily
be derived.
With the likelihood function and the the maximum likelihood estimate defined in (25), the
degrees of support for the positive and negative classes are
⎛ p ⎞
𝜃 (1 − 𝜃)n
𝜋(1 � x) = sup min ⎜ � p �p � n �n , 2𝜃 − 1⎟ , (28)
𝜃∈[0,1] ⎜ ⎟
⎝ n+p n+p ⎠
⎛ p ⎞
𝜃 (1 − 𝜃)n
𝜋(0 � x) = sup min ⎜ � p �p � n �n , 1 − 2𝜃 ⎟ . (29)
𝜃∈[0,1] ⎜ ⎟
⎝ n+p n+p ⎠
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 115
Solving (28) and (29) comes down to maximizing a scalar function over a bounded
domain, for which standard solvers can be used. We applied Brent’s method11 (which is
a variant of the golden section method) to find a local minimum in the interval 𝜃 ∈ [0, 1].
From (28–29), the epistemic and aleatoric uncertainty associated with the region R can
be derived according to (23) and (24), respectively. For different combinations of n and p,
these uncertainty degrees can be pre-computed (cf. Fig. 10).
which corresponds to a Bayesian learning approach with Dirichlet priors with parameters
{t(y) | y ∈ {0, 1}}. The problem of learning the lower and upper probabilities (Antonucci
and Cuzzolin 2010) can be rewritten as
p + q ⋅ t(1) p + q ⋅ t(1)
p(1 | x) = inf and p(1 | x) = sup ,
t(1) p+n+q t(1) p + n + q
where
𝜂 𝜂 ∑
≤ t(y) ≤ 1 − , y ∈ {0, 1} , t(y) = 1 and 𝜂 ∈ (0, 1) .
2 2 y∈{0,1}
Thus,
11
For an implementation in Python, see https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.
optimize.minimize_scalar.html
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
116 Machine Learning (2022) 111:89–122
𝜂
p+q⋅ 2
p + q ⋅ (1 − 𝜂2 )
p(1 | x) = and p(1 | x) = .
p+n+q p+n+q
To this end, we set q = 1 and 𝜂 = 0.001 and refer to (Bernard 2005) for an intensive discus‑
sion on choosing these parameters.
In the following, we present a practical procedure for determining the degree of support
(19) for the positive class, and then summarize the results for the negative class (which
can be determined in a similar manner). Associating each hypothesis h ∈ H with a vector
𝜃 ∈ ℝd+1, the degree of support (19) can be rewritten as follows:
[ ]
𝜋(1 | x) = sup min 𝜋(𝜃), 2h(x) − 1 . (30)
𝜃∈ℝd+1
It is easy to see that the target function to be maximized in (30) is not necessarily concave.
Therefore, we propose the following approach.
Let
[ us first note ] that whenever h(x) < 0.5, we have 2h(x) − 1 ≤ 0 and
min 𝜋H (h), 2h(x) − 1 ≤ 0. Thus the optimal value of the target function (19) can only be
achieved for some hypotheses h such that h(x) ∈ [0.5, 1]. For a given value 𝛼 ∈ [0.5, 1], the
set of hypotheses h such that h(x) = 𝛼 corresponds to the convex set
{ ( )}
∑
d
𝛼
𝜃𝛼 = 𝜃 || 𝜃0 + 𝜃i xi = ln . (31)
i=1
1−𝛼
The optimal value 𝜋𝛼∗ (1 | x) that can be achieved within the region (31) can be determined
as follows:
[ ] [ ]
𝜋𝛼∗ (1 | x) = sup min 𝜋(𝜃), 2𝛼 − 1 = min sup 𝜋(𝜃), 2𝛼 − 1 . (32)
𝜃∈𝜃 𝛼 𝜃∈𝜃 𝛼
Thus, to find this value, we maximize the concave log-likelihood over a convex set:
𝜃𝛼∗ = arg sup l(𝜃) . (33)
𝜃∈𝜃 𝛼
As the log-likelihood function (26) is concave and has second-order derivatives, we tackle
the problem with a Newton-CG algorithm (Nocedal and Wright 2006). Furthermore, the
optimization problem (33) can be solved using sequential least squares programming12
(Philip and Elizabeth 2010). Since regions defined in (31) are parallel hyperplanes, the
solution of the optimization problem (19) can then be obatined by solving the following
problem:
[ ]
sup 𝜋𝛼∗ (1|x) = sup min 𝜋(𝜃𝛼∗ ), 2𝛼 − 1 . (34)
𝛼∈[0.5,1) 𝛼∈[0.5,1)
12
For an implementation in Python, see https://docs.scipy.org/doc/scipy/reference/generated/scipy.optim
ize.minimize.html
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 117
Following a similar procedure, we can estimate the degree of support for the negative class
(20) as follows:
[ ]
sup 𝜋𝛼∗ (0|x) = sup min 𝜋(𝜃𝛼∗ ), 1 − 2𝛼 . (35)
𝛼∈(0,0.5] 𝛼∈(0,0.5]
Note that limit cases 𝛼 = 1 and 𝛼 = 0 cannot be solved, since the region (31) is then not
well-defined (as ln(∞) and ln(0) do not exist). For the purpose of practical implementation,
we handle (34) by discretizing the interval over 𝛼 . That is, we optimize the target function
for a given number of values 𝛼 ∈ [0.5, 1) and consider the solution corresponding to the 𝛼
with the highest optimal value of the target function 𝜋𝛼∗ (1 | x) as the maximum estimator.
Similarly, (35) can be handled over the domain (0, 0.5]. In practice, we evaluate (34) and
(35) on uniform discretizations of cardinality 50 of [0.5, 1) and (0, 0.5], respectively. We
can further increase efficiency by avoiding computations for values of 𝛼 for which we know
that 2𝛼 − 1 and 1 − 2𝛼 are lower than the current highest support value given to class 1 and
0, respectively. See Algorithm 3 for a pseudo-code description of the whole procedure.
In addition to the experiments which presented in Sect. 5.4, similar results for the other
data sets are presented in Figs. 11, 12 and 13.
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
118 Machine Learning (2022) 111:89–122
Fig. 11 Average accuracies and degrees of epistemic uncertainty (y-axis) for Parzen window classifiers as
functions of the number of instances queried from the pool (x-axis)
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 119
Fig. 12 Average accuracies and degrees of epistemic uncertainty (y-axis) for decision tree as functions of
the number of instances queried from the pool (x-axis)
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
120 Machine Learning (2022) 111:89–122
Fig. 13 Average accuracies and degrees of epistemic uncertainty (y-axis) for logistic regression as functions
of the number of instances queried from the pool (x-axis)
Acknowledgements The authors gratefully acknowledge financial supported by the German Research Foun‑
dation (DFG) under grant number 400845550 as well as the Federal Ministry for Education and Research
(BMBF). Valuable technical support was provided by the Paderborn Center for Parallel Computing (PC2).
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Com‑
mons licence, and indicate if changes were made. The images or other third party material in this article
are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 121
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References
Antonucci, A., & Cuzzolin, F . (2010) . Credal sets approximation by lower probabilities: Application to
credal networks. In: Proceedings of the 13th International Conference on Information Processing and
Management of Uncertainty in Knowledge-Based Systems (IPMU), Springer, pp. 716–725.
Antonucci, A., Corani, G., & Gabaglio, S .(2012) . Active learning by the naive credal classifier. In: Pro‑
ceedings of the Sixth European Workshop on Probabilistic Graphical Models (PGM), pp. 3–10.
Bernard, J. M. (2005). An introduction to the imprecise Dirichlet model for multinomial data. International
Journal of Approximate Reasoning, 39(2–3), 123–150.
Birnbaum, A. (1962). On the foundations of statistical inference. Journal of the American Statistical Asso-
ciation, 57(298), 269–306.
Bottou, L., & Vapnik, V. (1992). Local learning algorithms. Neural Computation, 4(6), 888–900.
Chapelle, O. (2005). Active learning for Parzen window classifier. Proceedings of the Tenth Interna-
tional Workshop on Artificial Intelligence and Statistics (AISTATS), 5, 49–56.
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information
Theory, 13(1), 21–27.
De Campos, L. M., Huete, J. F., & Moral, S. (1994). Probability intervals: A tool for uncertain rea‑
soning. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2(02),
167–196.
Depeweg, S., Hernandez-Lobato, J., Doshi-Velez, F., & Udluft, S .(2018) . Decomposition of uncertainty
in Bayesian deep learning for efficient and risk-sensitive learning. In: Proceedings of the 35th Inter‑
national Conference on Machine Learning (ICML), vol 3, pp. 1920–1934.
Freytag, A., Rodner, E., & Denzler, J. (2014). Selecting influential examples: Active learning with expected
model output changes. In: Proceedings of the 13th European Conference on Computer Vision (ECCV),
Springer, pp. 562–577.
Fu, Y., Zhu, X., & Li, B .(2013) . A survey on instance selection for active learning. Knowledge and Infor‑
mation Systems, pp. 1–35.
Hastie, T., Tibshirani, R., Friedman, J., & Franklin, J. (2005). The elements of statistical learning: Data min‑
ing, inference and prediction. The Mathematical Intelligencer, 27(2), 83–85.
Hora, S. C. (1996). Aleatory and epistemic uncertainty in probability elicitation with an example from haz‑
ardous waste management. Reliability Engineering and System Safety, 54(2–3), 217–223.
Hüllermeier, E .(2003) .Inducing fuzzy concepts through extended version space learning. In: Bilgic T,
De Baets B, Kaynak O (eds) Proc. IFSA–03, 10th International Fuzzy Systems Association World
Congress, Springer, Istambul, no. 2715 in LNAI, pp. 677–648.
Hüllermeier, E., & Waegeman, W. (2021) .Aleatoric and epistemic uncertainty in machine learning: A tuto‑
rial introduction. Machine Learning (arXiv preprint 191009457)
Kendall, A., & Gal, Y . (2017) .What uncertainties do we need in Bayesian deep learning for computer
vision? In: Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS).
Krempl, G., Kottke, D., & Lemaire, V. (2015). Optimised probabilistic active learning (OPAL). Machine
Learning, 100(2), 449–476.
Lall, U., & Sharma, A. (1996). A nearest neighbor bootstrap for resampling hydrologic time series. Water
Resources Research, 32(3), 679–693.
Lewis, D. D., & Gale, W. A .(1994) . A sequential algorithm for training text classifiers. In: Proceedings
of the 17th Annual International SIGIR Conference on Research and Development in Information
Retrieval, Springer, pp. 3–12.
Li, M., & Sethi, I. K. (2006). Confidence-based active learning. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 28(8), 1251–1261.
Lindenbaum, M., Markovitch, S., & Rusakov, D. (2004). Selective sampling for nearest neighbor classifiers.
Machine learning, 54(2), 125–152.
McCallum, A., & Nigam, K .(1998) . Employing EM and pool-based active learning for text classification.
In: Proceedings of the Fifteenth International Conference on Machine Learning (ICML), pp. 350–358.
Menard, S .(2002) .Applied logistic regression analysis. Sage Publishing.
Mitchell, TM .(1977) . Version spaces: A candidate elimination approach to rule learning. In: Proceedings
of the 5th International Joint Conference on Artificial Intelligence (IJCAI), pp. 305–310.
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
122 Machine Learning (2022) 111:89–122
Nguyen, VL., Destercke, S., H & üllermeier, E .(2019) . Epistemic uncertainty sampling. In: Proceedings of
the 22nd International Conference on Discovery Science (DS), Springer, pp. 72–86.
Nocedal, J., & Wright, S. (2006). Numerical optimization. New York: Springer.
Philip, E., & Elizabeth, W .(2010) Sequential . Quadratic programming methods. UCSD Department of
Mathematics Technical Report NA-10-03.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Rennie, J. D. (2005). Regularized Logistic Regression is Strictly Convex. Technical report, MIT.
Safavian, S. R., & Landgrebe, D. (1991). A survey of decision tree classifier methodology. IEEE Transac-
tions on Systems, Man, and Cybernetics, 21(3), 660–674.
Senge, R., Bösner, S., Dembczyński, K., Haasenritter, J., Hirsch, O., Donner-Banzhoff, N., & Hüllermeier,
E. (2014). Reliable classification: Learning classifiers that distinguish aleatoric and epistemic uncer‑
tainty. Information Sciences, 255, 16–29.
Settles, B .(2009) . Active learning literature survey. Technical Report, University of Wisconsin, Madison
(TR1648).
Settles, B., & Craven, M .(2008) . An analysis of active learning strategies for sequence labeling tasks. In:
Proceedings of the 8th Conference on Empirical Methods in Natural Language Processing (EMNLP),
Association for Computational Linguistics, pp. 1070–1079.
Seung, HS., Opper, M., & Sompolinsky, H .(1992) .Query by committee. In: Proceedings of the 5th Annual
Workshop on Computational Learning Theory (COLT), pp. 287–294.
Shaker, MH., & Hüllermeier, E .(2020) . Aleatoric and epistemic uncertainty with random forests. In: Pro‑
ceedings of the Eighteenth International Symposium on Intelligent Data Analysis (IDA), Springer, pp.
444–456.
Sharma, M., & Bilgic, M .(2013) . Most-surely vs. least-surely uncertain. In: Proceedings of the IEEE 13th
International Conference on Data Mining (ICDM), IEEE, pp. 667–676.
Sharma, M., & Bilgic, M. (2017). Evidence-based uncertainty sampling for active learning. Data Mining
and Knowledge Discovery, 31(1), 164–202.
Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE Transactions on Neural Networks,
10(5), 988–999.
Walley, P., & Moral, S. (1999). Upper probabilities based only on the likelihood function. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 61(4), 831–847.
Zaffalon, M. (2002). The naive credal classifier. Journal of Statistical Planning and Inference, 105(1), 5–21.
Zhu, J., Wang, H., Hovy, E., & Ma, M. (2010). Confidence-based stopping criteria for active learning for
data annotation. ACM Transactions on Speech and Language Processing, 6(3), 1–24.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
1. use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
2. use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at