0% found this document useful (0 votes)

5 views35 pages

How_to_measure_uncertainty_in_uncertainty_sampling

This paper discusses uncertainty sampling in active learning, focusing on how to measure uncertainty effectively. It compares various uncertainty measures, including epistemic and aleatoric uncertainty, and evaluates their performance in active learning scenarios. The authors aim to enhance the understanding of different uncertainty measures and their implications for improving active learning strategies.

Uploaded by

fahmid.bin.kibria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views35 pages

How_to_measure_uncertainty_in_uncertainty_sampling

Uploaded by

fahmid.bin.kibria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Machine Learning (2022) 111:89–122

https://doi.org/10.1007/s10994-021-06003-9

How to measure uncertainty in uncertainty sampling

for active learning

Vu‑Linh Nguyen1 · Mohammad Hossein Shaker2 · Eyke Hüllermeier2

Received: 29 February 2020 / Revised: 17 February 2021 / Accepted: 14 May 2021 /

Published online: 18 June 2021
© The Author(s) 2021

Abstract
Various strategies for active learning have been proposed in the machine learning liter‑
ature. In uncertainty sampling, which is among the most popular approaches, the active
learner sequentially queries the label of those instances for which its current prediction is
maximally uncertain. The predictions as well as the measures used to quantify the degree
of uncertainty, such as entropy, are traditionally of a probabilistic nature. Yet, alternative
approaches to capturing uncertainty in machine learning, alongside with corresponding
uncertainty measures, have been proposed in recent years. In particular, some of these
measures seek to distinguish different sources and to separate different types of uncertainty,
such as the reducible (epistemic) and the irreducible (aleatoric) part of the total uncertainty
in a prediction. The goal of this paper is to elaborate on the usefulness of such measures
for uncertainty sampling, and to compare their performance in active learning. To this end,
we instantiate uncertainty sampling with different measures, analyze the properties of the
sampling strategies thus obtained, and compare them in an experimental study.

Keywords Active learning · Uncertainty sampling · Credal uncertainty · Epistemic

uncertainty · Aleatoric uncertainty

Editors: Petra Kralj Novak, Tomislav Šmuc.

* Eyke Hüllermeier
[email protected]
Vu‑Linh Nguyen
[email protected]
Mohammad Hossein Shaker
[email protected]
1
Department of Mathematics and Computer Science, Eindhoven University of Technology,
Eindhoven, The Netherlands
2
Heinz Nixdorf Institute and Department of Computer Science, Paderborn University, Paderborn,
Germany

13
Vol.:(0123456789)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
90 Machine Learning (2022) 111:89–122

1 Introduction

The goal in standard supervised learning, such as binary or multi-class classification, is to

learn models with high predictive accuracy from labelled training data (Hastie et al. 2005;
Vapnik 1999). However, labelled data does not come for free. On the contrary, labelling
can be expensive, time-consuming, and costly. The ambition of active learning, therefore,
is to exploit labelled data in the most effective way. More specifically, the idea is to let the
learning algorithm itself decide which examples it considers to be most informative. Com‑
pared to random sampling, the hope is to achieve better performance with the same amount
of training data, or to reach the same performance with less data (Fu et al. 2013; Settles
2009).
The selection of training examples is often done in an iterative manner, i.e., the active
learner alternates between re-training and selecting new examples. In each iteration, the
usefulness of a candidate example is estimated in terms of a utility score, and the one with
the highest score is queried. In this regard, the notion of utility typically refers to uncer‑
tainty reduction: To what extent will the knowledge about the label of a specific instance
help to reduce the learner’s uncertainty about the sough model? In uncertainty sampling
(Settles 2009), which is among the most popular approaches, utility is quantified in terms
of predictive uncertainty, i.e., the active learner selects those instances for which its current
prediction is maximally uncertain. The predictions as well as the measures used to quantify
the degree of uncertainty, such as entropy, are almost exclusively of a probabilistic nature.
Such approaches indeed proved to be successful in many applications.
Yet, as pointed out by Sharma and Bilgic (2017), existing approaches can be criticized
for not informing about the reasons for why an instance is considered uncertain, although
this might be relevant for judging the potential usefulness of an example. They propose
an evidence-based approach to active learning, in which conflicting-evidence uncertainty
is distinguished from insufficient-evidence uncertainty. A similar distinction between two
types of uncertainty, called epistemic and aleatoric uncertainty, has been made in the
recent machine learning literature (Hüllermeier and Waegeman 2021; Kendall and Gal
2017; Senge et al. 2014). Roughly speaking, aleatoric uncertainty is due to inherent ran‑
domness, whereas epistemic uncertainty captures the lack of knowledge of the learner.
Thus, the latter corresponds to the reducible and the former to the irreducible part of the
total uncertainty in a prediction. Last but not least, measures of uncertainty are also dis‑
cussed in connection with generalizations of standard probability theory, most notably
imprecise probability (De Campos et al. 1994; Zaffalon 2002). Here, incomplete informa‑
tion is captured in the form of credal sets, that is, (convex) sets of probability distributions;
correspondingly, standard uncertainty measures for single probability distributions (such as
entropy) are generalized toward credal uncertainty measures.
This paper is an extension of (Nguyen et al. 2019), in which the authors conjecture that,
in uncertainty sampling, the usefulness of an instance is better reflected by its epistemic
than by its aleatoric uncertainty, and provide first evidence in favor of this conjecture. The
goal of the current paper is to elaborate more broadly on the usefulness of different meas‑
ures for uncertainty sampling, and to compare their performance in active learning. To this
end, we instantiate uncertainty sampling with different measures, analyze the properties of
the sampling strategies thus obtained, and compare them in an experimental study.
The rest of this paper is organized as follows. In the next section, we first recall the
general framework of uncertainty sampling and provide a brief survey of related work on
active learning. We present different approaches for measuring the learner’s uncertainty

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 91

in a query instance and a comparison of the approaches in Sects. 3 and 4, respectively.

Experimental evaluations for local learning (Parzen window classifier), decision trees and
logistic regression are presented in Sect. 5, prior to concluding the paper in Sect. 6. Tech‑
nical details for instantiations of aleatoric and epistemic uncertainty are deferred to the
appendices.

2 Uncertainty sampling

In this section, we briefly recall the basic setting of uncertainty sampling. As usual in
active learning, we assume to be given a labelled set of training data 𝐃 and a pool of unla‑
beled instances 𝐔 that can be queried by the learner:
{ } { }
𝐃 = (x1 , y1 ), … , (xN , yN ) , 𝐔 = x1 , … , xJ .
( )
Instances are represented as features vectors xi = xi1 , … , xid ∈ X = ℝd . In this paper, we
only consider the case of binary classification, where labels yi are taken from Y = {0, 1},
leaving the more general case of multi-class classification for future work. We denote by
H ⊂ YX the underlying hypothesis space, i.e., the class of candidate models h ∶ X ⟶ Y
the learner can choose from. Often, hypotheses are parametrized by a parameter vector
𝜃 ∈ Θ; in this case, we equate a hypothesis h = h𝜃 ∈ H with the parameter 𝜃 , and the
model space H with the parameter space Θ.
In uncertainty sampling, instances are queried in a greedy fashion. Given the current
model 𝜃 that has been trained on 𝐃, each instance xj in the current pool 𝐔 is assigned a
utility score s(𝜃, xj ), and the next instance to be queried is the one with the highest score
(Lewis and Gale 1994; Settles 2009; Settles and Craven 2008; Sharma and Bilgic 2017).
The chosen instance is labelled (by an oracle or expert) and added to the training data 𝐃,
on which the model is then re-trained. The active learning process for a given budget B (i.e,
the number of unlabelled instances to be queried) is summarized in Algorithm 1.

Algorithm 1: Uncertainty sampling

Input: U, D, θ- initial pool, training data, classifier, and B-budget
Output: U, D, θ - updated pool, training data, classifier
1 initialize b = 0;
2 while b < B do
3 foreach x ∈ U do
4 compute s(θ, x)
5 query the label of the optimal instance x∗ with respect to s(θ, x)
D = D ∪ {x∗ , y ∗ } ;
6 U = U \ {x∗ , y ∗ } ;
7 train θ from D;
8 b = b + 1;
9 Return U, D, θ;

Assuming a probabilistic model producing predictions in the form of probability distri‑

butions p𝜃 (⋅ | x) on Y , the utility score is typically defined in terms of a measure of uncer‑
tainty. Thus, instances on which the current model is highly uncertain are supposed to be
maximally informative (Settles 2009; Settles and Craven 2008; Sharma and Bilgic 2017).
Popular examples of such measures include

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
92 Machine Learning (2022) 111:89–122

– the entropy:
∑
s(𝜃, x) = − p𝜃 (y | x) log p𝜃 (y | x) ,
(1)
y∈Y

– the least confidence:

s(𝜃, x) = 1 − max p𝜃 (y | x) , (2)
y∈Y

– the smallest margin:

s(𝜃, x) = p𝜃 (ym | x) − p𝜃 (yn | x) , (3)
where ym = arg maxy∈Y p𝜃 (y | x) and yn = arg maxy∈Y⧵ym p𝜃 (y | x).
While the first two measures ought to be maximized, the last one has to be minimized.
In the case of binary classification, i.e, Y = {0, 1}, all these measures rank unlabelled
instances in the same order and look for instances with small difference between p𝜃 (0 | x)
and p𝜃 (1 | x).

3 Measures of uncertainty

In this section, we present different frameworks for measuring the learner’s uncertainty
in a query instance: evidence-based uncertainty (EBU), credal uncertainty (CU), and an
approach focusing on a distinction between epistemic and aleatoric uncertainty (EAU).
While the first one has been specifically developed for the purpose of active learning, the
other two are more general approaches to uncertainty quantification in machine learning.
Yet, their potential usefulness for active learning has been pointed out as well (Antonucci
et al. 2012; Nguyen et al. 2019).

3.1 Evidence‑based uncertainty (EBU)

In their evidence-based uncertainty sampling approach, Sharma and Bilgic (2013, 2017)
propose to differentiate between uncertainty due to conflicting evidence and insufficient
evidence. The corresponding measures of conflicting-evidence uncertainty and insufficient-
evidence uncertainty are mainly motivated for the Naïve Bayes (NB) classifier as a learning
algorithm. In the spirit of this classifier, evidence-based uncertainty sampling first looks
at the influence of individual features xm in the feature representation x = (x1 , … , xd ) of
instances. More specifically, given the current model 𝜃 , denote by p𝜃 (xm | 0) and p𝜃 (xm | 1)
the class-conditional probabilities on the values of the mth feature. For a given instance x,
the authors partition the set of features into those that provide evidence for the positive and
for the negative class, respectively:
{ }
| p (xm | 1)
P𝜃 (x) = xm || 𝜃 m >1 , (4)
| p𝜃 (x | 0)

{ }
| p (xm | 0)
N𝜃 (x) = xm || 𝜃 m >1 . (5)
| p𝜃 (x | 1)
Then, the total evidence for the positive and the negative class is determined as follows:

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 93

∏ p𝜃 (xm | 1)
E1 (x) = , (6)
p (xm | 0)
xm ∈P (x) 𝜃
𝜃

∏ p𝜃 (xm | 0)
E0 (x) = . (7)
xm ∈N𝜃
p (xm | 1)
(x) 𝜃

The authors consider a situation as conflicting evidence if both E0 (x) and E1 (x) are high,
because in such as situation, there is strong evidence in favor of the positive as well as
strong evidence in favor of the negative class. Likewise, a situation in which both evi‑
dences are low is considered as insufficient evidence. Measuring these conditions in terms
of the product1 E1 (x) × E0 (x), the conflicting evidence-based approach simply queries
the instance with the highest conflicting evidence, while the insufficient evidence-based
approach looks for the one with the highest insufficient evidence:
s∗conf = arg max E1 (x) × E0 (x) , (8)
x∈𝐒

s∗insu = arg min E1 (x) × E0 (x) . (9)

x∈𝐒

Note that the selection is restricted to the set 𝐒 of instances x in the pool 𝐔 having the high‑
est scores s(𝜃, x) according to standard uncertainty sampling; the size of this set, t = |𝐒|, is
a parameter of the method (and hence a hyper-parameter for the active learning algorithm).
The restriction to the most uncertain cases puts evidence-based uncertainty sampling close
to standard uncertainty sampling. Instead of using conflicting-evidence and insufficient-
evidence uncertainties as selection criteria on their own, they are merely used for prioritiz‑
ing cases that appear to be uncertain in the traditional sense.

3.1.1 A note on evidence‑based uncertainty

Interestingly, to motivate their approach, Sharma and Bilgic (2017) note that “regardless
of whether we want to maximize or minimize E1 (x) × E0 (x), we want to guarantee that
the underlying model is uncertain about the chosen instance”, thereby suggesting that
the evidence-based uncertainties alone do not necessarily inform about this uncertainty.
Indeed, it is true that these uncertainties are not easy to interpret (see also our discussion in
Sect. 4.1), and that their relationship to standard uncertainty measures is not fully obvious.
In particular, note that the latter also comprises the influence of the prior class probabil‑
ities, which is completely neglected by the evidence-based uncertainties (which only look
at the likelihood). This is especially relevant in the case of imbalanced class distributions.
In such cases, evidence-based uncertainty may strongly deviate from standard uncertainty,
i.e., the entropy of the posterior distribution. For instance, E0 (x) and E1 (x) could both be
very large, and p𝜃 (x | 0) ≈ p𝜃 (x | 1), although p𝜃 (0 | x) is very different from p𝜃 (1 | x) due to
unequal prior odds, and hence the entropy small. Likewise, the entropy of the posterior can
be large although both evidence-based uncertainties are small.

1
This is the measure used in (Sharma and Bilgic 2013, 2017). The authors mention, however, that other
measures could be used as well.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
94 Machine Learning (2022) 111:89–122

3.1.2 A note on uncertainty sampling for Naïve Bayes

The evidence-based approach to uncertainty sampling has been introduced with a focus on
Naïve Bayes as a base learner. In this regard, we like to note that uncertainty sampling for
this learner might be considered critical in general.
It is clear that active learning may always incorporate a bias, simply because the data is
no longer produced by sampling independently according to the true underlying distribu‑
tion. Thus, the data is no longer completely representative. While this may affect any learn‑
ing algorithm, the effect appears to be especially strong for NB, so that uncertainty sam‑
pling for NB appears to be questionable in general. In fact, a sample bias has a very direct
influence on the probabilities estimated by NB. In particular, the estimated class priors are
strongly biased toward the conditional class probabilities of those instances with a high
uncertainty, because these are sampled more often. This bias may in turn affect the classi‑
fier as a whole, and lead to suboptimal predictions.
As an illustration of the problem, let us consider a small example with only two binary
attributes x1 and x2. This example may appear unrealistic, because the instance space is
finite and actually quite small. Please note, however, that even in practice NB is typically
applied to discrete attributes with finite domains (possibly after a discretization of numeri‑
cal attributes in a pre-processing step).
Suppose the class priors to be given by p(y = 0) = 0.3 and p(y = 1) = 0.7, and the
class-conditional probabilities as follows:

p(x1 = 1 | y = 0) = 0.4 ,
p(x1 = 1 | y = 1) = 0.2 ,
p(x2 = 1 | y = 0) = 0.8 ,
p(x2 = 1 | y = 1) = 0.4 .

From these, one derives the following posterior probabilities:

p(y = 0 | x1 = 0, x2 = 0) ≈ 0.10 , p(y = 1 | x1 = 0, x2 = 0) ≈ 0.90 ,

Now, consider an active learner that can sample from a large (in principle infinite) pool of
unlabeled data points (i.e., multiple copies of each of the four instances). Since the second
instance (x1 , x2 ) = (0, 1) has the highest entropy, standard uncertainty sampling will sooner
or later focus on this instance and sample it over and over again2. This of course has an
influence on the estimation of priors and conditional probabilities by NB. In particular, the
estimated class priors p̂ (y = 0) and p̂ (y = 1) will converge to the conditional posteriors,
i.e., the posteriors of y given (x1 , x2 ) = (0, 1). Consequently, we will produce a bias in the
estimates, and will obtain

2
We confirmed this in a simulation study.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 95

As one can see, this will even have an effect on the Bayes-optimal predictor: For
(x1 , x2 ) = (1, 1), the prediction will be ŷ = 1 instead of the actually optimal prediction
ŷ = 0. Similar effects can be found for the evidence-based approach. For example, when
applying the insufficient evidence approach, it can happen that the active learner will com‑
pletely focus on the third instance, which has the highest insufficient evidence, and then
produce the following estimates:

So again, the prediction for (x1 , x2 ) = (1, 1) will be ŷ = 1 instead of ŷ = 0.

3.1.3 Extension to other learners

As explained above, the approach by Sharma and Bilgic (2017) is specifically tai‑
lored for Naïve Bayes as a learning algorithm. Yet, the authors also propose vari‑
ants of their measures for logistic regression and support vector machines. For exam‑
ple, if the decision boundary obtained by fitting a logistic regression model is given by
∑
h𝜃 (x) = 𝜃0 + di=1 𝜃i ⋅ xi , the evidences for the positive and the negative class are defined,
respectively, as follows (Sharma and Bilgic 2017):
∑ ∑
E1 (x) = 𝜃m ⋅ xm , E0 (x) = − 𝜃m ⋅ xm ,
m m
(10)
x ∈P𝜃 (x) x ∈N𝜃 (x)

where
{ } { }
P𝜃 (x) = xm || 𝜃m ⋅ xm > 0 , N𝜃 (x) = xm || 𝜃m ⋅ xm < 0 . (11)

Obviously, evidence-based uncertainty measures can be derived is a quite natural way

for models in which the features independently contribute to the prediction. However, the
approach becomes much less straightforward in the case where features may interact with
each other. In any case, new measures need to be derived for every model class separately.
The approaches to be discussed next are more generic (and hence more principled) in the
sense of being independent of the model class. That is, concrete measures of uncertainty
can be derived for any model class in a generic way.

3.2 Credal uncertainty (CU)

Consider an instance space X , output space Y = {0, 1}, and a hypothesis space H consist‑
ing of probabilistic classifiers h ∶ X ⟶ [0, 1]. Assuming that each hypothesis h = h𝜃 is
identified by a (unique) parameter vector 𝜃 ∈ Θ, we can equate H with the parameter space

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
96 Machine Learning (2022) 111:89–122

Θ. We denote by p𝜃 (1 | x) = h𝜃 (x) and p𝜃 (0 | x) = 1 − h𝜃 (x) the (predicted) probability that

instance x ∈ X belongs to the positive and negative class, respectively.
Credal uncertainty sampling (Antonucci et al. 2012) seeks to differentiate between the
reducible and irreducible part of the uncertainty in a prediction. Denote by C ⊆ Θ a credal
set of models, i.e., a set of plausible candidate models. We say that a class y dominates
another class y′ if y is more probable than y′ for each distribution in the credal set, that is
p𝜃 (y | x)
𝛾(y, y� , x) = inf > 1. (12)
𝜃∈C p𝜃 (y� | x)

The credal uncertainty sampling approach simply looks for the instance x with the highest
uncertainty, i.e, the least evidence for the dominance of one of the classes. In the case of
binary classification with Y = {0, 1}, this is expressed by the score
( )
s(x) = − max 𝛾(1, 0, x), 𝛾(0, 1, x) . (13)

Practically, the computations are based on the interval-valued probabilities

p(y | x), p(y | x)] ,

assigned to each class y ∈ {0, 1}, where

p(y | x) = inf p𝜃 (y | x) , p(y | x) = sup p𝜃 (y | x) . (14)
𝜃∈C 𝜃∈C

Such interval-valued probabilities can be produced within the framework of the Naïve
credal classifier (Antonucci et al. 2012; Antonucci and Cuzzolin 2010; De Campos et al.
1994; Zaffalon 2002). In the case of binary classification, where p𝜃 (0 | x) = 1 − p𝜃 (1 | x),
the score 𝛾(1, 0, x) can be rewritten as follows:

p𝜃 (1 | x) p𝜃 (1 | x) p(1 | x)
𝛾(1, 0, x) = inf = inf = . (15)
𝜃∈C p𝜃 (0 | x) 𝜃∈Θ 1 − p𝜃 (1 | x) 1 − p(1 | x)

Likewise,
p𝜃 (0 | x) 1 − p𝜃 (1 | x) 1 − p(1 | x)
𝛾(0, 1, x) = inf = inf = . (16)
𝜃∈C p𝜃 (1 | x) 𝜃∈C p𝜃 (1 | x) p(1 | x)

Finally, the uncertainty score (13) can simply be expressed as follows:

( p(1 | x) )
1 − p(1 | x)
s(x) = − max , . (17)
1 − p(1 | x) p(1 | x)

3.3 Epistemic and aleatoric uncertainty (EAU)

A distinction between the epistemic and aleatoric uncertainty (Hora 1996) in a prediction
for an instance x has been motivated by Senge et al. (2014)3. Their approach is based on

3
More recently, this distinction has also attracted attention in the deep learning community (Kendall and
Gal 2017).

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 97

the use of relative likelihoods, historically proposed by Birnbaum (1962) and then justified
in other settings such as possibility theory (Walley and Moral 1999).
Given a set of training data 𝐃 = {(xi , yi )}Ni=1 ⊂ X × Y , the normalized likelihood of a
model h𝜃 is defined as
L(𝜃) L(𝜃)
𝜋Θ (𝜃) = = , (18)
L(𝜃 ml ) max𝜃� ∈Θ L(𝜃 � )
∏
where L(𝜃) = Ni=1 p𝜃 (yi � xi ) is the likelihood of 𝜃 , and 𝜃 ml ∈ Θ the maximum likelihood
estimation on the training data. For a given instance x, the degrees of support (plausibility)
of the two classes are defined as follows:
[ ]
𝜋(1 | x) = sup min 𝜋Θ (𝜃), p𝜃 (1 | x) − p𝜃 (0 | x) ,
𝜃∈Θ
[ ]
𝜋(0 | x) = sup min 𝜋Θ (𝜃), p𝜃 (0 | x) − p𝜃 (1 | x) .
𝜃∈Θ

So, 𝜋(1 | x) is high if and only if a highly plausible model supports the positive class much
stronger (in terms of the assigned probability mass) than the negative class (and 𝜋(0 | x)
can be interpreted analogously)4. Note that, with f (a) = 2a − 1, we can also write
[ ]
𝜋(1 | x) = sup min 𝜋Θ (𝜃), f (h𝜃 (x)) , (19)
𝜃∈Θ

[ ]
𝜋(0 | x) = sup min 𝜋Θ (𝜃), f (1 − h𝜃 (x)) . (20)
𝜃∈Θ

Given the above degrees of support, the degrees of epistemic uncertainty ue and aleatoric
uncertainty ua are defined as follows:
[ ]
ue (x) = min 𝜋(1 | x), 𝜋(0 | x) , (21)

[ ]
ua (x) =1 − max 𝜋(1 | x), 𝜋(0 | x) . (22)

Thus, epistemic uncertainty refers to the case where both the positive and the negative
class appear to be plausible, while the degree of aleatoric uncertainty (22) is the degree to
which none of the classes is supported. Roughly speaking, aleatoric uncertainty is due to
influences on the data-generating process that are inherently random, whereas epistemic
uncertainty is caused by a lack of knowledge. Or, stated differently, ue and ua measure the
reducible and the irreducible part of the total uncertainty, respectively.
It is thus tempting to assume that epistemic uncertainty is more relevant for active learn‑
ing: While it makes sense to query additional class labels in regions where uncertainty can
be reduced, doing so in regions of high aleatoric uncertainty appears to be less reasonable.
This leads us to suggest the principle of epistemic uncertainty sampling, which prescribes
the selection
x∗ = arg max ue (x) . (23)
x∈𝐔

4
Technically, we assume that, for each x ∈ X , there are hypotheses h, h� ∈ H such that h(x) ≥ 0.5 and
h� (x) ≤ 0.5, which implies 𝜋(1 | x) ≥ 0 and 𝜋(0 | x) ≥ 0.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
98 Machine Learning (2022) 111:89–122

For comparison, we will also consider an analogous selection rule based on the aleatoric
uncertainty, i.e.,
x∗ = arg max ua (x) . (24)
x∈𝐔

As already said, this approach is completely generic and can in principle be instantiated
with any hypothesis space H. The uncertainty measures (21–22) can be derived very eas‑
ily from the support degrees (19–20). The computation of the latter may become difficult,
however, as it requires the solution of an optimization problem, the properties of which
depend on the choice of H. We are going to present practical methods to determine (19–20)
for the cases of a simple Parzen window classifier and logistic regression in Sects. A.1
and A.2, respectively.

4 Discussion and comparison of the approaches

4.1 EAU versus EBU

Although the concepts of “conflicting evidence” and “insufficient evidence” of Sharma

and Bilgic (2017) appear to be quite related, respectively, to aleatoric and epistemic uncer‑
tainty, the correspondence becomes much less obvious (and in fact largely disappears)
upon a closer inspection. Besides, a direct comparison is complicated due to various tech‑
nical issues with the evidence-based approach to uncertainty sampling. In particular, due
to the preselection of the top-t uncertain instances (the set 𝐒), evidence-based uncertainty
sampling is actually a variant of standard (entropy-based) uncertainty sampling, and com‑
pletely degenerates to the latter for t = 1. As we are more interested in alternative measures
of uncertainty, we will subsequently ignore the preselection step, and instead focus our dis‑
cussion on the nature of the evidence measures themselves. In other words, we consider a
version of evidence-based uncertainty sampling with a very large t. Before proceeding, let
us emphasize that this is not the version proposed by the authors. Therefore, our discussion
should be taken with a grain of salt.
As a first important observation, note that the evidences E0 (x) and E1 (x) solely depend
on the relation of the class-conditional probabilities p𝜃 (xm | 1) and p𝜃 (xm | 0), which hides
the number of training examples they have been estimated from, and hence their confi‑
dence. The latter, however, has an important influence on whether something is qualified
as aleatorically or epistemically uncertain. As an illustration, consider a simple exam‑
ple with two binary attributes, the first with domain {a1 , a2 } and the second with domain
{b1 , b2 }. Denote by ni,j = (n+i,j , n−i,j ) the number of positive and negative examples observed
for (x1 , x2 ) = (ai , bj ). Here are two scenarios:

b1 b2 b1 b2

a1 (1, 1) (1, 1) a1 (100, 100) (100, 100)

a2 (1, 1) (1, 1) a2 (100, 100) (100, 100)

In the both scenarios, the insufficient evidence would be high, because all class-condi‑
tional probabilities are equal. In EAU, however, the first scenario would largely be a case
of epistemic uncertainty, due to the few number of training examples, whereas the second

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 99

Fig. 1 Two scenarios for logistic regression: training data with positive (red crosses) and negative examples
(black circles) and five query instances

would be aleatoric, because the equal posteriors5 are sufficiently “confirmed”. Similar
remarks apply to conflicting evidence. In the scenario

b1 b2

a1 (1, 1) (10, 1)
a2 (1, 10) (1, 1)

the latter would be high for (a1 , b1 ), because p𝜃 (a1 | 1) ≫ p𝜃 (a1 | 0) and
p𝜃 (b1 | 0) ≫ p𝜃 (b1 | 1). The same holds for (a2 , b2 ), whereas the uncertainties for (a1 , b2 )
and (a2 , b1 ) would be low. Note, however, that in all these cases, exactly the same condi‑
tional probability estimates p𝜃 (xm | 1) and p𝜃 (xm | 0) are involved.
We would argue that epistemic uncertainty should directly refer to these probabilities,
because they constitute the parameter 𝜃 of the model. Thus, to reduce epistemic uncertainty
(about the right model 𝜃 ), one should look for those examples that will mostly improve the
estimation of these probabilities. Aleatoric uncertainty may occur in cases of posteriors
close to 1/2, in which the conflicting evidence may indeed be high (although, as already
mentioned, the latter ignores the class priors). Yet, we would not necessarily call such
cases a “conflict”, because the predictions are completely in agreement with the underlying
model (Naïve Bayes), which assumes class-conditional independence of attributes, i.e., an
independent combination of evidences on different attributes.
Another illustration is provided in Fig. 1, now for the case of logistic regression. A first
important observation is that the uncertainties due to conflicting and insufficient evidence
are exactly the same in both scenarios in Fig. 1, the left and the right one. This is because
these uncertainties are merely derived from the single model h𝜃 learned from the training
data. Thus, like in the example for Naïve Bayes, the evidence-based approach does not
capture model uncertainty, i.e., uncertainty about the truly optimal model (which is clearly
larger on the left and smaller on the right), which EAU essentially measures in terms of
epistemic uncertainty.
For the first three queries, the evidence-based uncertainties are very differ‑
ent: The first query has a high insufficient-evidence uncertainty, the third has a high

5
The class priors are ignored here.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
100 Machine Learning (2022) 111:89–122

Fig. 2 Two scenarios for logistic regression: training data with positive (red crosses) and negative examples
(black circles) and five query instances

conflicting-evidence uncertainty, and the second none of the two. According to EAU,
the uncertainties for these three cases are all high (because they are all located close
to the decision boundary) and, more importantly, of the same nature: mostly aleatoric
in the right and a mix of aleatoric and epistemic in the left scenario. For the second,
fourth and fifth query, the evidence-based uncertainties are roughly the same. Again,
this is very different from EAU, which assigns a high uncertainty to the second but
very low uncertainties to the fourth and fifth query.
Figure 2 shows two very similar scenarios, but now with a bias term. As already
said, the evidence-approach does not account for such a bias. Moreover, since w1 ≈ 0
(the first feature does not seem to have an influence), there is essentially no negative
evidence, i.e., E0 (x) is always close to 0. Consequently, the product E0 (x) × E1 (x) will
be small, too, suggesting that conflicting-evidence uncertainty is always low and insuf‑
ficient-evidence uncertainty always high.
As shown by these examples, the additional uncertainty captured by EBU is very
different from aleatoric and epistemic uncertainty in EAU. In particular, the evidence-
based approach can be criticized for ignoring model uncertainty as well as properties
of the model class. Although the measures of evidence for the positive and negative
class, such as (10), are derived from the model, the evidences are “feature-based” in
the sense of considering the evidence provided by each feature in isolation. What is not
taken into account, however, is the way in which the model combines the features into
an overall prediction. In logistic regression, for example, the features are linearly com‑
bined into a single score, and the class probabilities are expressed as a function of this
score. For instance, a model like the one we considered in our example,
1
p(y = 1 | x) = ,
1 + exp(−𝛾(x2 − x1 ))

assumes that the probability of the positive class is a function of the difference between x2
and x1. One may wonder, therefore, why one should consider a case where both x1 and x2
are large as a conflict (and, likewise, a case with both values being small as not providing
sufficient evidence for a prediction). From this point of view, the very idea of conflicting
(and, likewise, insufficient) evidence may appear somewhat questionable.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 101

0.8

0.8
0.4

0.4

0.4
0.0

0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Fig. 3 From left to right: exponential rescaling of the credal uncertainty measure (17), epistemic uncer‑
tainty ue and aleatoric uncertainty ua for intervals [p, p] with lower probability p (x-axis) and upper prob‑
ability p (y-axis). Lighter colors indicate higher values

4.2 EAU versus CU

Credal uncertainty (sampling) seems to be closer to EAU, at least in terms of the underly‑
ing principle. In both approaches, model uncertainty is captured in terms of a set of plau‑
sible candidate models from the underlying hypothesis space, and this (epistemic) uncer‑
tainty about the right model is translated into uncertainty about the prediction for a given
x. In credal uncertainty sampling, the candidate set is given by the credal set C, which
corresponds to the distribution 𝜋Θ in EAU—as a difference, we thus note that the latter is a
“graded set”, to which a candidate 𝜃 belongs with a certain degree of membership (the rela‑
tive likelihood), whereas a credal set is a standard set in which a model is either included
or not. Using machine learning terminology, C plays the role of a version space (Mitchell
1977), whereas 𝜋Θ represents a kind of generalized (graded) version space (Hüllermeier
2003).
More specifically, the wider the interval [p(1 | x), p(1 | x)] in (17), the larger the score
s(x), with the maximum being obtained for the case [0, 1] of complete ignorance. This
is well in agreement with the degree of epistemic uncertainty in EAU. In the limit, when
[p(1 | x), p(1 | x)] reduces to a precise probability p(1 | x), i.e., the epistemic uncertainty dis‑
appears, (17) is maximal for p(1 | x) = 1∕2 and minimal for p(1 | x) close to 0 or 1. Again,
this behavior is in agreement with the conception of aleatoric uncertainty in EAU. More
generally, comparing two intervals of the same length, (17) will be larger for the one that
is closer to the middle point 1/2. Thus, it seems that the credal uncertainty score (17) com‑
bines both epistemic and aleatoric uncertainty in a single measure.
Yet, upon closer examination, its similarity to epistemic uncertainty is much higher than
the similarity to aleatoric uncertainty. Note that, for EAU, the special case of a credal set
C can be imitated with the measure 𝜋Θ (𝜃) = 1 if 𝜃 ∈ C and 𝜋Θ (𝜃) = 0 if 𝜃 ∉ C . Then, (19)
and (20) become
𝜋(1 | x) = sup max[ 2 p𝜃 (1 | x) − 1, 0 ] = max[ 2 p(1 | x) − 1, 0 ] ,
𝜃∈C
𝜋(0 | x) = sup max[ 2 p𝜃 (0 | x) − 1, 0 ] = max[ 1 − 2 p(1 | x), 0 ] ,
𝜃∈C

and ue and ua can be derived from these values as before. Figure 3 shows a graphical illus‑
tration of the credal uncertainty score6 (17) as a function of the probability bounds p and

6
The score s is not well scaled, and may assume very large negative values. For better visibility, we there‑
fore plotted the monotone transformation exp(s).

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
102 Machine Learning (2022) 111:89–122

Table 1 Data sets used in the experiments

# name # instances # features attributes % of major class

1 parkinsons 197 22 real 75.0

2 vertebral 310 6 real 67.7
3 ionosphere 351 34 real 64.1
4 climate 540 18 real 91.4
5 breast 569 30 real 62.7
6 blood 748 5 real 76.2
7 QSAR 1055 41 integer, real 66.2
8 banknote 1372 4 real 55.0
9 madelon 4400 500 real 50.0
10 spambase 4601 57 integer, real 60.5

p , and the same illustration is given for epistemic uncertainty ue and aleatoric uncertainty
ua. From the visual impression, it is clear that the credibility score closely resembles ue,
while behaving quite differently from ua. This impression is corroborated by a simple cor‑
relation analysis, in which we ranked the intervals
{ [ ] }
a b |
[p, p] ∈ Ia,b = , | a, b ∈ {0, 1, … , 100}, a ≤ b ,
100 100 |
i.e., a quantization of the class of all probability intervals, according to the different meas‑
ures, and then computed the Kendall rank correlation. While the ranking according to (17)
is strongly correlated with the ranking for ue (Kendall is around 0.86), it is almost uncor‑
related with ua.
In summary, the credal uncertainty score appears to be quite similar to the measure
of epistemic uncertainty in EAU. As potential advantages of the latter, let us men‑
tion the following points. First, the degree of epistemic uncertainty is normalized and
bounded, and thus easier to interpret. Second, it is complemented by a degree of alea‑
toric uncertainty—the two degrees are carefully distinguished and have a clear seman‑
tics. Third, handling candidate models in a graded manner, and modulating their influ‑
ence according to their plausibility, appears to be more reasonable than creating an
artificial separation into plausible and non-plausible models (i.e., the credal set and its
complement).

5 Experiments

This section starts with a description of the experimental setting and the data sets used
in the experiments. Some technical details, e.g., regarding the choice of the model
parameters and instantiations of aleatoric and epistemic uncertainty, are deferred to
Sects. 1.1 and 1.2 in the appendix. Finally, the results of the experiments are presented
and analyzed.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 103

5.1 Data sets and experimental setting

We perform experiments on binary classification data sets from the UCI repository7, the
properties of which are summarized in Table 1. To make sure that the data is amenable
to all methods without the need for further preprocessing, we only selected data with
numerical features. Each data set is randomly split into 10% training, 80% pool, and 10%
test data. The training data is used to obtain an initial model. Then, in each iteration, the
learner is allowed to evaluate the instances from the pool and query a (mini-)batch of these
instances — according to the strategy of uncertainty sampling, the learner selects those
instances with the highest degrees of uncertainty. The chosen instances are labelled (by an
oracle or expert) and added to the training data 𝐃, on which the model is then re-trained.
The budget of the active learner is fixed to the size of the pool, and the performance of
the classifiers is monitored over the entire active learning process. The whole procedure is
repeated 1000 times and test accuracies are averaged. The following variants of uncertainty
sampling are included in the experimental studies:

• Rand: Random sampling

• ENT: Standard uncertainty sampling based on the entropy measure8 (1)
• CEU: Conflicting-evidence uncertainty sampling (8)
• IEU: Insufficient-evidence uncertainty sampling (9)
• CU: Credal uncertainty sampling9 (17)
• EU: Epistemic uncertainty sampling(21)
• AU: Aleatoric uncertainty sampling (22)

5.1.1 Local learning

By local learning, we refer to a class of non-parametric models that derive predictions from
the training information in a local region of the instance space, for example the local neigh‑
borhood of a query instance (Bottou and Vapnik 1992; Cover and Hart 1967). As a simple
example, we consider the Parzen window classifier (Chapelle 2005), to which most of the
mentioned approaches can be applied in a quite straightforward way. For a given instance
x, we define the set of its neighbours as follows:
� �
R(x, 𝜖) = (xi , yi ) ∈ 𝐃 � ‖xi − x‖ ≤ 𝜖 ,

where 𝜖 is the width of the Parzen window. In binary classification, a local region R(x, 𝜖)
can be associated with a constant hypothesis h𝜃 , 𝜃 ∈ Θ = [0, 1], where p𝜃 (1|x) = h𝜃 (x) ≡ 𝜃 .
With p and n the number of positive and negative instances, respectively, within a Parzen
window R(x, 𝜖), the likelihood function and the maximum likelihood estimate are, respec‑
tively, given by

7
http://archive.ics.uci.edu/ml/index.php
8
In all the experiments, we use entropy as a measure of uncertainty to select unlabelled instances.
9
The interval-valued probabilities associated to each region can be determined using the numbers of posi‑
tive and negative instances as described in (Antonucci et al. 2012).

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
104 Machine Learning (2022) 111:89–122

( )
p+n p p
L(𝜃) =
p
𝜃 (1 − 𝜃)n , and 𝜃̂ = . (25)
p+n

Since the likelihood function is well-defined, we can determine the degrees of epistemic
and aleatoric uncertainty as described in Sect. 3.3; we refer to Sect. A.1 for the technical
details.
How to determine the width 𝜖 of the Parzen window? This value is difficult to assess,
and an appropriate choice strongly depends on properties of the data and the dimension‑
ality of the instance space. Intuitively, it is even difficult to say in which range this value
should lie. Therefore, instead of fixing 𝜖 , we fixed an absolute number K of neighbors in the
training data, which is intuitively more meaningful and easier to interpret. A corresponding
value of 𝜖 is then determined in such a way that the average number of nearest neighbours
of instances xi in the training data 𝐃 is just K. In other words, 𝜖 is determined indirectly via
K. Furthermore, since we are not, in the first place, interested in maximizing performance,
but in analyzing the effectiveness of active learning approaches, we simply fix the neigh‑
borhood size K as the square root of the size of the data set (number of instances in the
initial training and pool set) as suggested by Lall and Sharma (1996). A practical algorithm
for determining 𝜖 given K, and the way in which we handle empty Parzen windows, are
also given in Sect. A.1.
In a similar way, the approach can be applied to decision tree learning (Quinlan 1986;
Safavian and Landgrebe 1991). In fact, recall that a decision tree partitions the instance
⋃L
space X into (rectangular) regions R1 , … , RL (i.e., i=1 Ri = X and Ri ∩ Rj = � for i ≠ j )
associated with corresponding leafs of the tree (each leaf node defines a region R). Again,
in the case of binary classification, we can assume each region R to be associated with a
constant hypothesis h𝜃 , 𝜃 ∈ Θ = [0, 1], where h𝜃 (x) ≡ 𝜃 is the probability of the positive
class. Therefore, degrees of epistemic and aleatoric uncertainty can be derived in the same
way as described for Parzen window.
For the Parzen window classifier and decision trees10, we fixed the batch size to 1% of
the initial pool dataset. For the approach based on credal uncertainty (CU), we determine
the lower and upper probabilities based on the number of positive and negative examples in
a region, following the procedure described in "Appendix 1 and 2" of (Antonucci and Cuz‑
zolin 2010). Note that the evidence-based approach (Sect. 3.1) is not immediately applica‑
ble to these learners, and therefore omitted from the experiments.

5.1.2 Logistic regression

In contrast to nonparametric, local learning methods such as the Parzen window classifier,
logistic regression is a parametric class of linear models, and hence coming with compara‑
tively restrictive assumptions. Recall that logistic regression assumes posterior probabili‑
ties to depend on feature vectors x = (x1 , … , xd ) ∈ ℝd in the following way:
� ∑d �
exp 𝜃0 + i=1 𝜃i xi
h(x) = p(1 � x) = � ∑ �.
1 + exp 𝜃0 + di=1 𝜃i xi

10
For an implementation in Python, see https://scikit-learn.org/stable/modules/tree.html

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 105

This means that learning the model comes down to estimating a parameter vector
𝜃 = (𝜃0 , … , 𝜃d ), which is commonly done through likelihood maximization (Menard
2002). For numerical stability, we employ L2-regularization, which comes down to maxi‑
mizing the following strictly concave function (Rennie 2005):
( )
∑
N
∑d
l(𝜃) = log L(𝜃) = yn 𝜃0 + 𝜃i xni (26)
n=1 i=1

( ( ))
∑
N
∑
d
𝛾∑ 2
d
− ln 1 + exp 𝜃0 + 𝜃i xni − 𝜃 , (27)
n=1 i=1
2 i=1 i

where the regularization term 𝛾 is fixed to 1. On the basis of this likelihood function, the
degrees of epistemic and aleatoric uncertainty can again be determined as described in
Sect. 3.3; as before, technical details and a practical algorithm are deferred to Sect. 1.2 in
the appendix.
For the case of logistic regression, the evidence-based approach can be applied as well
(cf. Sect. 3.1.3). Following Sharma and Bilgic (2017), we set the number of the top uncer‑
tain instances to be evaluated to 5 times of the batch size.

5.2 Results

As can be seen in Fig. 4, in the case of the Parzen window classifier, EU performs the best
and AU the worst. Moreover, standard uncertainty sampling (ENT) and random sampling
are in-between the two. This is in agreement with our expectations and supports our con‑
jecture that, from an active learning point of view, epistemic uncertainty is the more useful
information. Even if the improvements compared to ENT are not huge, they are still visible
and quite consistent. The performance provided by CU is competitive to the one of EU,
and again in agreement with our expectations — as discussed in Sect. 4.2, both CU and
EU have the ability of capturing model uncertainty. The results for decision tree learn‑
ing (cf. Fig. 5) are quite similar. Now, however, standard uncertainty sampling based on
entropy performs worse, and the advantage of epistemic uncertainty sampling is even more
pronounced.
In the case of logistic regression (cf. Fig. 6), the picture looks a bit different. Here, epis‑
temic, aleatoric, and standard uncertainty sampling perform more or less the same (and
all significantly better than random sampling), whereas no general pattern can be drawn
for the evidence-based uncertainty measures. As a plausible explanation, note that, in con‑
trast to the local learning methods in the first experiment, logistic regression comes with a
very strong learning bias in the form of a linearity assumption. Therefore, the epistemic (or
model) uncertainty disappears quite quickly: The linear decision boundary stabilizes rela‑
tively early in the learning process, and then, the learner is rather “certain” about its pre‑
dictions (regardless of whether this certainty is warranted or not). According to the logistic
model, the uncertain cases are those closest to the current decision boundary, some of them
with a slightly higher epistemic and others with a higher aleatoric uncertainty. In any case,
all three methods, EU, AU, and ENT, are sampling near the decision boundary. Thus, it is
hardly surprising that they show similar performance.
Overall, the experiments nevertheless confirm that, in the context of uncertainty
sampling for active learning, epistemic uncertainty is a viable alternative to standard

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
106 Machine Learning (2022) 111:89–122

Fig. 4 Average accuracies (y-axis) for the Parzen window classifiers as a function of the number of
instances queried from the pool (x-axis)

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 107

Fig. 5 Average accuracies (y-axis) for the decision trees as a function of the number of instances queried
from the pool (x-axis)

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
108 Machine Learning (2022) 111:89–122

Fig. 6 Average accuracies (y-axis) for logistic regression as a function of the number of instances queried
from the pool (x-axis)

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 109

uncertainty measures like entropy: For local learning methods, in which epistemic uncer‑
tainty tends to be higher, epistemic uncertainty sampling improves upon standard uncer‑
tainty sampling, and for global methods with a strong learning bias, it performs at least on
a par. Credal uncertainty, which behaves similarly to epistemic uncertainty (cf. Sect. 4.2),
shows strong performance as well.
As an aside, note that the learning curves are not all monotone increasing, which might
be surprising at first sight. Actually, however, this kind of behavior is not uncommon and
may occur if a data set, in addition to useful examples, also comprises low-quality (e.g.,
noisy or otherwise misleading) instances. In this case, a strong active learning strategy
may succeed in selecting the informative, high-quality examples first, leading to a good
model with strong predictive performance. In the end, if the pool needs to be exhausted,
the active learner is “forced” to pick the low-quality examples, too, thereby causing a drop
in performance.

5.3 Influence of model bias

The results presented above suggest that epistemic uncertainty might be more advanta‑
geous for (active) learners with a low bias and less so for learners with a strong bias. To
corroborate this conjecture, we conducted an additional experiment, using decision trees
with a maximum depth limit as model classes. This allows for contolling the bias in a
seamless manner: The higher the depth limit, the less restricted the model class, the lower
the bias.
Figure 7 shows the learning curves for the depth limits {2, 3, 5, 10} on the data sets
blood and QSAR. As expected, different depth limits appear to be optimal for different
problems (the best limit for blood is 3, for QSAR 10). However, more interesting for our
purpose is the slope of the learning curves, which indeed seem to support our conjecture:
For epistemic uncertainty, the learning curves increase faster for larger and slower for
lower depth limits — for aleatoric uncertainty, it is just the other way around. To make
this even clearer, Fig. 8 plots the relative performance in comparison to standard (entropy-
based) uncertainty sampling, i.e., the performance ratio, both for epistemic and aleatoric
uncertainty. As can be seen, EU tends to be superior, because the ratio is mostly larger than
1, while the opposite holds for AU. Again more importantly, the depth limit (bias) is in per‑
fect agreement with the “order” of the curves: The higher the limit, the better EU (worse
AU) in terms of relative performance.

5.4 Uncertainty as stopping criterion

In a last experiment, we analyze the potential of epistemic uncertainty to serve as a stop‑

ping criterion for an active learning process. Indeed, this appears to be a rather natural idea,
because epistemic uncertainty — as opposed to aleatoric or total uncertainty — reflects
the state of knowledge of the learner, and the potential to improve this knowledge through
additional data. If the epistemic uncertainty is low for all instances remaining in the pool,
this suggests that almost nothing can be gained anymore through additional sampling.
The criterion just outlined is an instance of the third type of stopping criteria commonly
used in active learning (Li and Sethi 2006; Zhu et al. 2010): The active learning process
ends if

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
110 Machine Learning (2022) 111:89–122

Fig. 7 Average accuracies (y-axis) for decision tree as a function of the number of instances queried from
the pool (x-axis)

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 111

Fig. 8 Average accuracies (y-axis) for decision tree as a function of the number of instances queried from
the pool (x-axis)

• the training data set reaches a desired size;

• a targeted performance level is achieved;
• no informative examples are available anymore.

While it is difficult to pre-define either a desirable size of the training data set or a targeted
performance level, the last criterion can be easily implemented by setting some predefined
uncertainty threshold and stopping the active learning process if the degree of uncertainty
falls below the threshold (Zhu et al. 2010).
The potential usefulness of epistemic uncertainty is confirmed by our results (shown in
Fig. 9 for two data sets–results for the other data sets are similar and can be found in Sect. 2
in the appendix).

6 Conclusion

This paper reconsiders the principle of uncertainty sampling in active learning from the
perspective of uncertainty modeling and quantification. More specifically, it starts from the
supposition that, when it comes to the question which instances to select from a pool of

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
112 Machine Learning (2022) 111:89–122

Fig. 9 Average accuracies and degrees of epistemic uncertainty (mean and maximum over instances in the
pool, y-axis) as functions of the number of instances queried from the pool (x-axis)

candidates, a learner’s predictive uncertainty due to “not knowing” should be more rel‑
evant than its uncertainty due to confirmed randomness.
To corroborate this conjecture, we revisited recent approaches to uncertainty quantifica‑
tion in machine learning, with a specific emphasis on methods that allow for separating
different types of uncertainty, and incorporated them in the general uncertainty sampling
procedure. Following a comparison and critical discussion of these approaches, a series
of experiments with different learning algorithms was conducted. In these experiments, a
distinction between so-called epistemic and aleatoric uncertainty proved to be especially

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 113

useful. More specifically, epistemic uncertainty sampling in the sense of uncertainty sam‑
pling based on measures of epistemic uncertainty in a prediction shows strong performance
and consistently improves on standard uncertainty sampling. These results, which we inter‑
pret as clear evidence in favor of our conjecture, and indeed quite plausible: Epistemic and
aleatoric uncertainty can be thought of, respectively, as the reducible and irreducible part
of the total uncertainty. Consequently, querying an instance with a high epistemic uncer‑
tainty may provide useful information for the learner, whereas an aleatorically uncertain
instance is unlikely to do so.
Given this affirmation, we are now encouraged to elaborate on epistemic uncertainty
sampling in more depth, and to develop it in more sophistication. In this regard, there are
various directions to be followed:

• Depending on the underlying model class, the quantification of epistemic uncertainty

based on the generic approach by Senge et al. (2014) can be computationally expen‑
sive. Therefore, efficient instantiations for important learning methods would be desir‑
able.
• Similar approaches for measuring epistemic uncertainty, which have been proposed in
the literature more recently, should be investigated as possible alternatives (Depeweg
et al. 2018).
• Our experimental results suggest that a distinction between epistemic and aleatoric
uncertainty is more useful for learners with a weak inductive bias and less useful for
learners with a strong bias. This observation ought to be analyzed in more detail and
corroborated by further experiments. In fact, the learning algorithms included in our
study constitute extremes on this spectrum (Parzen classifier and decision trees have a
very low bias, logistic regression a very strong one), and additional experiments with
learners having a “mediocre” bias would certainly be useful.
• Quite interestingly, the very notion of epistemic uncertainty seems to share many com‑
monalities with other principles that have been suggested for active learning — proba‑
bly not by chance. One example is so-called expected model change or the related prin‑
ciple of expected model output change (EMOC), where the idea is to query instances
that, if added to the training data, are likely to cause large changes of the hypothesis
or the predictions produced by the hypothesis (Freytag et al. 2014). According to our
quantification of epistemic uncertainty (but also other formalizations), such instances
should also have a high epistemic uncertainty. Therefore, epistemic uncertainty sam‑
pling seems to have much in common with EMOC, perhaps with the notable difference
that the former looks at the uncertainty for a single instance, whereas the latter consid‑
ers the expected change over all instances. Nevertheless, elaborating on this connection
more closely seems to be worthwhile. The same holds true for another well-established
active learning strategy, namely query-by-committee (QBC) approach (Seung et al.
1992). In fact, the diversity of the predictions of an ensemble of hypotheses, which
is used as a selection criterion in QBC, has recently also been advocated as a suitable
means for quantifying epistemic uncertainty (Shaker and Hüllermeier 2020).
• It might be interesting to combine the quantification of uncertainty with the notion of rel-
evance or representativeness in active learning (McCallum and Nigam 1998; Lindenbaum
et al. 2004): How relevant is a certain improvement of the current model for the overall
performance of the learner? In local learning algorithms, for example, epistemic uncer‑
tainty tends to be high in sparse regions of the instance space, so that an active learner is
tempted to sample here. At the same time, however, such regions appear to be less impor‑
tant for the generalization performance, simply because future queries will more likely

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
114 Machine Learning (2022) 111:89–122

Fig. 10 From left to right: epistemic, aleatoric, and total uncertainty (epistemic + aleatoric) as a function
of the numbers p, n ∈ {0, 1, … , 10} of positive and negative examples in a region (Parzen window) of the
instance space (lighter colors indicate higher values)

occur in dense regions. Overall, a small improvement in a dense region may thus be more
beneficial than a big improvement in a sparse region. This observation motivates a kind of
density weighting, i.e., the combination (multiplication) of an uncertainty degree with the
(estimated) density of a data point (Krempl et al. 2015).
• Last but not least, going beyond uncertainty sampling for binary classification as
considered in this paper, the idea of epistemic uncertainty sampling should also be
extended toward other learning problems, such as multi-class classification and regres‑
sion.

Appendix 1: Instantiations of aleatoric and epistemic uncertainty

This section presents an instantiation of the approach outlined in Sect. 3.3 for the case of
local learning using a Parzen window classifier (Chapelle 2005), as well as logistic regres‑
sion. As already said, instantiating the approach essentially means to address the question
of how to compute the degrees of support (19–20), from which everything else can easily
be derived.

Appendix 1.1: Local learning

With the likelihood function and the the maximum likelihood estimate defined in (25), the
degrees of support for the positive and negative classes are

⎛ p ⎞
𝜃 (1 − 𝜃)n
𝜋(1 � x) = sup min ⎜ � p �p � n �n , 2𝜃 − 1⎟ , (28)
𝜃∈[0,1] ⎜ ⎟
⎝ n+p n+p ⎠

⎛ p ⎞
𝜃 (1 − 𝜃)n
𝜋(0 � x) = sup min ⎜ � p �p � n �n , 1 − 2𝜃 ⎟ . (29)
𝜃∈[0,1] ⎜ ⎟
⎝ n+p n+p ⎠

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 115

Solving (28) and (29) comes down to maximizing a scalar function over a bounded
domain, for which standard solvers can be used. We applied Brent’s method11 (which is
a variant of the golden section method) to find a local minimum in the interval 𝜃 ∈ [0, 1].
From (28–29), the epistemic and aleatoric uncertainty associated with the region R can
be derived according to (23) and (24), respectively. For different combinations of n and p,
these uncertainty degrees can be pre-computed (cf. Fig. 10).

Algorithm 2: Determining the width .

Input: D-normalized data, K-number
Output: the local width K
1 foreach xn ∈ D do
2 foreach xm = xn do
3 compute d xn , xm ;

4 form 1 × (n − 1) vector dn = d xn , xm | n = m ;
5 sort dn by increasing order and determine the K-th element dK
n ;
|D|
dK
6 return K = n=1
|D|
n
;

Given a number K, the corresponding 𝜖 is determined using Algorithm 2. Furthermore,

as K is an average, individual instances may have more or less neighbors in their Parzen
windows. In particular, a Parzen window may also be empty. In this case, we set ue (x) = 1
by definition, i.e., we consider this as a case of full epistemic uncertainty. Likewise, the
uncertainty is considered to be maximal for all other sampling techniques. If the accuracy
of the Parzen classifier needs to be determined, we assume that it yields a wrong prediction.
For the approach based on credal uncertainty (CU), we determine the lower and upper
probabilities based on the number of positive p and negative n examples in a region. The
probability p(y | x), y ∈ {0, 1}, obtained from the counts a region is
p + q ⋅ t(y)
p(y | x) = ,
p+n+q

which corresponds to a Bayesian learning approach with Dirichlet priors with parameters
{t(y) | y ∈ {0, 1}}. The problem of learning the lower and upper probabilities (Antonucci
and Cuzzolin 2010) can be rewritten as
p + q ⋅ t(1) p + q ⋅ t(1)
p(1 | x) = inf and p(1 | x) = sup ,
t(1) p+n+q t(1) p + n + q

where
𝜂 𝜂 ∑
≤ t(y) ≤ 1 − , y ∈ {0, 1} , t(y) = 1 and 𝜂 ∈ (0, 1) .
2 2 y∈{0,1}

Thus,

11
For an implementation in Python, see https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.
optimize.minimize_scalar.html

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
116 Machine Learning (2022) 111:89–122

𝜂
p+q⋅ 2
p + q ⋅ (1 − 𝜂2 )
p(1 | x) = and p(1 | x) = .
p+n+q p+n+q

To this end, we set q = 1 and 𝜂 = 0.001 and refer to (Bernard 2005) for an intensive discus‑
sion on choosing these parameters.

Appendix 1.2: Logistic regression

In the following, we present a practical procedure for determining the degree of support
(19) for the positive class, and then summarize the results for the negative class (which
can be determined in a similar manner). Associating each hypothesis h ∈ H with a vector
𝜃 ∈ ℝd+1, the degree of support (19) can be rewritten as follows:
[ ]
𝜋(1 | x) = sup min 𝜋(𝜃), 2h(x) − 1 . (30)
𝜃∈ℝd+1

It is easy to see that the target function to be maximized in (30) is not necessarily concave.
Therefore, we propose the following approach.
Let
[ us first note ] that whenever h(x) < 0.5, we have 2h(x) − 1 ≤ 0 and
min 𝜋H (h), 2h(x) − 1 ≤ 0. Thus the optimal value of the target function (19) can only be
achieved for some hypotheses h such that h(x) ∈ [0.5, 1]. For a given value 𝛼 ∈ [0.5, 1], the
set of hypotheses h such that h(x) = 𝛼 corresponds to the convex set
{ ( )}
∑
d
𝛼
𝜃𝛼 = 𝜃 || 𝜃0 + 𝜃i xi = ln . (31)
i=1
1−𝛼

The optimal value 𝜋𝛼∗ (1 | x) that can be achieved within the region (31) can be determined
as follows:
[ ] [ ]
𝜋𝛼∗ (1 | x) = sup min 𝜋(𝜃), 2𝛼 − 1 = min sup 𝜋(𝜃), 2𝛼 − 1 . (32)
𝜃∈𝜃 𝛼 𝜃∈𝜃 𝛼

Thus, to find this value, we maximize the concave log-likelihood over a convex set:
𝜃𝛼∗ = arg sup l(𝜃) . (33)
𝜃∈𝜃 𝛼

As the log-likelihood function (26) is concave and has second-order derivatives, we tackle
the problem with a Newton-CG algorithm (Nocedal and Wright 2006). Furthermore, the
optimization problem (33) can be solved using sequential least squares programming12
(Philip and Elizabeth 2010). Since regions defined in (31) are parallel hyperplanes, the
solution of the optimization problem (19) can then be obatined by solving the following
problem:
[ ]
sup 𝜋𝛼∗ (1|x) = sup min 𝜋(𝜃𝛼∗ ), 2𝛼 − 1 . (34)
𝛼∈[0.5,1) 𝛼∈[0.5,1)

12
For an implementation in Python, see https://docs.scipy.org/doc/scipy/reference/generated/scipy.optim
ize.minimize.html

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 117

Following a similar procedure, we can estimate the degree of support for the negative class
(20) as follows:
[ ]
sup 𝜋𝛼∗ (0|x) = sup min 𝜋(𝜃𝛼∗ ), 1 − 2𝛼 . (35)
𝛼∈(0,0.5] 𝛼∈(0,0.5]

Note that limit cases 𝛼 = 1 and 𝛼 = 0 cannot be solved, since the region (31) is then not
well-defined (as ln(∞) and ln(0) do not exist). For the purpose of practical implementation,
we handle (34) by discretizing the interval over 𝛼 . That is, we optimize the target function
for a given number of values 𝛼 ∈ [0.5, 1) and consider the solution corresponding to the 𝛼
with the highest optimal value of the target function 𝜋𝛼∗ (1 | x) as the maximum estimator.
Similarly, (35) can be handled over the domain (0, 0.5]. In practice, we evaluate (34) and
(35) on uniform discretizations of cardinality 50 of [0.5, 1) and (0, 0.5], respectively. We
can further increase efficiency by avoiding computations for values of 𝛼 for which we know
that 2𝛼 − 1 and 1 − 2𝛼 are lower than the current highest support value given to class 1 and
0, respectively. See Algorithm 3 for a pseudo-code description of the whole procedure.

Algorithm 3: Degrees of support for logistic regression

Input: Q, D, θml , x- initial pool, training data, classifier, unlabelled instance
Output: π(1 | x), π(0 | x) - degrees of support
1 initialize subsets Qp , Qn of cardinality Q;
2 π(1 | x) = max(2hml (x) − 1, 0) , π(0 | x) = max(1 − 2hml (x), 0) ;
3 for q = 1, . . . , Q do
4 αp = max(Qp ); αn = min(Qn ) ;
5 if 2αp − 1 > π(1 | x) then
6 solve (33) for x, αp and return θ;
7 π(1 | x) = max(π(1 | x), min(πH (θ), 2αp − 1)) ;
8 if 1 − 2αn > π(0 | x) then
9 solve (33) for x, αn and return θ;
10 π(0 | x) = max(π(0 | x), min(πH (θ), 1 − 2αp )) ;
11 Qp = Qp \ {αp }, Qn = Qn \ {αn } ;
12 Return π(1 | x), π(0 | x) ;

Appendix 2: Additional experiments

In addition to the experiments which presented in Sect. 5.4, similar results for the other
data sets are presented in Figs. 11, 12 and 13.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
118 Machine Learning (2022) 111:89–122

Fig. 11 Average accuracies and degrees of epistemic uncertainty (y-axis) for Parzen window classifiers as
functions of the number of instances queried from the pool (x-axis)

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 119

Fig. 12 Average accuracies and degrees of epistemic uncertainty (y-axis) for decision tree as functions of
the number of instances queried from the pool (x-axis)

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
120 Machine Learning (2022) 111:89–122

Fig. 13 Average accuracies and degrees of epistemic uncertainty (y-axis) for logistic regression as functions
of the number of instances queried from the pool (x-axis)

Acknowledgements The authors gratefully acknowledge financial supported by the German Research Foun‑
dation (DFG) under grant number 400845550 as well as the Federal Ministry for Education and Research
(BMBF). Valuable technical support was provided by the Paderborn Center for Parallel Computing (PC2).

Funding Open Access funding enabled and organized by Projekt DEAL.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Com‑
mons licence, and indicate if changes were made. The images or other third party material in this article
are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2022) 111:89–122 121

permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

References
Antonucci, A., & Cuzzolin, F . (2010) . Credal sets approximation by lower probabilities: Application to
credal networks. In: Proceedings of the 13th International Conference on Information Processing and
Management of Uncertainty in Knowledge-Based Systems (IPMU), Springer, pp. 716–725.
Antonucci, A., Corani, G., & Gabaglio, S .(2012) . Active learning by the naive credal classifier. In: Pro‑
ceedings of the Sixth European Workshop on Probabilistic Graphical Models (PGM), pp. 3–10.
Bernard, J. M. (2005). An introduction to the imprecise Dirichlet model for multinomial data. International
Journal of Approximate Reasoning, 39(2–3), 123–150.
Birnbaum, A. (1962). On the foundations of statistical inference. Journal of the American Statistical Asso-
ciation, 57(298), 269–306.
Bottou, L., & Vapnik, V. (1992). Local learning algorithms. Neural Computation, 4(6), 888–900.
Chapelle, O. (2005). Active learning for Parzen window classifier. Proceedings of the Tenth Interna-
tional Workshop on Artificial Intelligence and Statistics (AISTATS), 5, 49–56.
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information
Theory, 13(1), 21–27.
De Campos, L. M., Huete, J. F., & Moral, S. (1994). Probability intervals: A tool for uncertain rea‑
soning. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2(02),
167–196.
Depeweg, S., Hernandez-Lobato, J., Doshi-Velez, F., & Udluft, S .(2018) . Decomposition of uncertainty
in Bayesian deep learning for efficient and risk-sensitive learning. In: Proceedings of the 35th Inter‑
national Conference on Machine Learning (ICML), vol 3, pp. 1920–1934.
Freytag, A., Rodner, E., & Denzler, J. (2014). Selecting influential examples: Active learning with expected
model output changes. In: Proceedings of the 13th European Conference on Computer Vision (ECCV),
Springer, pp. 562–577.
Fu, Y., Zhu, X., & Li, B .(2013) . A survey on instance selection for active learning. Knowledge and Infor‑
mation Systems, pp. 1–35.
Hastie, T., Tibshirani, R., Friedman, J., & Franklin, J. (2005). The elements of statistical learning: Data min‑
ing, inference and prediction. The Mathematical Intelligencer, 27(2), 83–85.
Hora, S. C. (1996). Aleatory and epistemic uncertainty in probability elicitation with an example from haz‑
ardous waste management. Reliability Engineering and System Safety, 54(2–3), 217–223.
Hüllermeier, E .(2003) .Inducing fuzzy concepts through extended version space learning. In: Bilgic T,
De Baets B, Kaynak O (eds) Proc. IFSA–03, 10th International Fuzzy Systems Association World
Congress, Springer, Istambul, no. 2715 in LNAI, pp. 677–648.
Hüllermeier, E., & Waegeman, W. (2021) .Aleatoric and epistemic uncertainty in machine learning: A tuto‑
rial introduction. Machine Learning (arXiv preprint 191009457)
Kendall, A., & Gal, Y . (2017) .What uncertainties do we need in Bayesian deep learning for computer
vision? In: Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS).
Krempl, G., Kottke, D., & Lemaire, V. (2015). Optimised probabilistic active learning (OPAL). Machine
Learning, 100(2), 449–476.
Lall, U., & Sharma, A. (1996). A nearest neighbor bootstrap for resampling hydrologic time series. Water
Resources Research, 32(3), 679–693.
Lewis, D. D., & Gale, W. A .(1994) . A sequential algorithm for training text classifiers. In: Proceedings
of the 17th Annual International SIGIR Conference on Research and Development in Information
Retrieval, Springer, pp. 3–12.
Li, M., & Sethi, I. K. (2006). Confidence-based active learning. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 28(8), 1251–1261.
Lindenbaum, M., Markovitch, S., & Rusakov, D. (2004). Selective sampling for nearest neighbor classifiers.
Machine learning, 54(2), 125–152.
McCallum, A., & Nigam, K .(1998) . Employing EM and pool-based active learning for text classification.
In: Proceedings of the Fifteenth International Conference on Machine Learning (ICML), pp. 350–358.
Menard, S .(2002) .Applied logistic regression analysis. Sage Publishing.
Mitchell, TM .(1977) . Version spaces: A candidate elimination approach to rule learning. In: Proceedings
of the 5th International Joint Conference on Artificial Intelligence (IJCAI), pp. 305–310.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
122 Machine Learning (2022) 111:89–122

Nguyen, VL., Destercke, S., H & üllermeier, E .(2019) . Epistemic uncertainty sampling. In: Proceedings of
the 22nd International Conference on Discovery Science (DS), Springer, pp. 72–86.
Nocedal, J., & Wright, S. (2006). Numerical optimization. New York: Springer.
Philip, E., & Elizabeth, W .(2010) Sequential . Quadratic programming methods. UCSD Department of
Mathematics Technical Report NA-10-03.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Rennie, J. D. (2005). Regularized Logistic Regression is Strictly Convex. Technical report, MIT.
Safavian, S. R., & Landgrebe, D. (1991). A survey of decision tree classifier methodology. IEEE Transac-
tions on Systems, Man, and Cybernetics, 21(3), 660–674.
Senge, R., Bösner, S., Dembczyński, K., Haasenritter, J., Hirsch, O., Donner-Banzhoff, N., & Hüllermeier,
E. (2014). Reliable classification: Learning classifiers that distinguish aleatoric and epistemic uncer‑
tainty. Information Sciences, 255, 16–29.
Settles, B .(2009) . Active learning literature survey. Technical Report, University of Wisconsin, Madison
(TR1648).
Settles, B., & Craven, M .(2008) . An analysis of active learning strategies for sequence labeling tasks. In:
Proceedings of the 8th Conference on Empirical Methods in Natural Language Processing (EMNLP),
Association for Computational Linguistics, pp. 1070–1079.
Seung, HS., Opper, M., & Sompolinsky, H .(1992) .Query by committee. In: Proceedings of the 5th Annual
Workshop on Computational Learning Theory (COLT), pp. 287–294.
Shaker, MH., & Hüllermeier, E .(2020) . Aleatoric and epistemic uncertainty with random forests. In: Pro‑
ceedings of the Eighteenth International Symposium on Intelligent Data Analysis (IDA), Springer, pp.
444–456.
Sharma, M., & Bilgic, M .(2013) . Most-surely vs. least-surely uncertain. In: Proceedings of the IEEE 13th
International Conference on Data Mining (ICDM), IEEE, pp. 667–676.
Sharma, M., & Bilgic, M. (2017). Evidence-based uncertainty sampling for active learning. Data Mining
and Knowledge Discovery, 31(1), 164–202.
Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE Transactions on Neural Networks,
10(5), 988–999.
Walley, P., & Moral, S. (1999). Upper probabilities based only on the likelihood function. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 61(4), 831–847.
Zaffalon, M. (2002). The naive credal classifier. Journal of Statistical Planning and Inference, 105(1), 5–21.
Zhu, J., Wang, H., Hovy, E., & Ma, M. (2010). Confidence-based stopping criteria for active learning for
data annotation. ACM Transactions on Speech and Language Processing, 6(3), 1–24.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:

1. use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
2. use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at

[email protected]

Valkyrie User Manual - en
No ratings yet
Valkyrie User Manual - en
80 pages
Active Learning With Sampling by Uncertainty and Density For Word
No ratings yet
Active Learning With Sampling by Uncertainty and Density For Word
8 pages
ICML - 2022 - Active Testing Sample-Efficient Model Evaluation
No ratings yet
ICML - 2022 - Active Testing Sample-Efficient Model Evaluation
11 pages
Sample Complexity Active LearninSDFSDGg
No ratings yet
Sample Complexity Active LearninSDFSDGg
31 pages
Active Surrogate Estimators An Active Learning Approach To LabelEfficient Model Evaluation
No ratings yet
Active Surrogate Estimators An Active Learning Approach To LabelEfficient Model Evaluation
14 pages
Active Learning From Multiple Knowledge Sources
No ratings yet
Active Learning From Multiple Knowledge Sources
8 pages
Data Mining - Utrecht University - 13. Active Learning
No ratings yet
Data Mining - Utrecht University - 13. Active Learning
57 pages
Streaming Active Learning With Deep Neural Networks: Ash & Adams 2020
No ratings yet
Streaming Active Learning With Deep Neural Networks: Ash & Adams 2020
17 pages
Mussmann 18 A
No ratings yet
Mussmann 18 A
9 pages
Hierarchical Sampling For Active Learning
No ratings yet
Hierarchical Sampling For Active Learning
8 pages
a-survey-on-online-active-learning-t41pz1uj
No ratings yet
a-survey-on-online-active-learning-t41pz1uj
64 pages
Active Learning Book
No ratings yet
Active Learning Book
116 pages
Active Learning
No ratings yet
Active Learning
102 pages
Natural Language Processing
No ratings yet
Natural Language Processing
31 pages
Active Learning Icml09
No ratings yet
Active Learning Icml09
96 pages
Active Learning For Data Streams A Survey
No ratings yet
Active Learning For Data Streams A Survey
48 pages
ecmlpkdd2019
No ratings yet
ecmlpkdd2019
17 pages
hospedalesEtAl_pakdd2011
No ratings yet
hospedalesEtAl_pakdd2011
12 pages
Active One Shot Learning 2016
No ratings yet
Active One Shot Learning 2016
8 pages
Aleatoric and Epistemic uncertainty
No ratings yet
Aleatoric and Epistemic uncertainty
59 pages
Sinha Et Al. - 2019 - Variational Adversarial Active Learning
No ratings yet
Sinha Et Al. - 2019 - Variational Adversarial Active Learning
10 pages
Lect1116-18 Active Learning
No ratings yet
Lect1116-18 Active Learning
6 pages
Sampling Yue Fuselage
No ratings yet
Sampling Yue Fuselage
11 pages
TR1648
No ratings yet
TR1648
47 pages
Active One-Shot Learning
No ratings yet
Active One-Shot Learning
8 pages
Learning Active Learning From Data
No ratings yet
Learning Active Learning From Data
11 pages
UNIT 5
No ratings yet
UNIT 5
21 pages
77038933
No ratings yet
77038933
18 pages
Selective Data Acquisition For Machine Learning: Josh Attenberg
No ratings yet
Selective Data Acquisition For Machine Learning: Josh Attenberg
45 pages
A_Representation-Based_Query_Strategy_to_Derive_Qu
No ratings yet
A_Representation-Based_Query_Strategy_to_Derive_Qu
11 pages
Mathematics 11 00820
No ratings yet
Mathematics 11 00820
38 pages
ADeepReinforcement Active Learning Method for Multi-Label Image Classification
No ratings yet
ADeepReinforcement Active Learning Method for Multi-Label Image Classification
15 pages
Less Is More: Active Learning With Support Vector Machines: Greg Schohn David Cohn
No ratings yet
Less Is More: Active Learning With Support Vector Machines: Greg Schohn David Cohn
8 pages
Aghdam Et Al. - 2019 - Active Learning For Deep Detection Neural Networks
No ratings yet
Aghdam Et Al. - 2019 - Active Learning For Deep Detection Neural Networks
9 pages
Stream and Pool Based Active Learning
No ratings yet
Stream and Pool Based Active Learning
11 pages
Machine Learning1
No ratings yet
Machine Learning1
8 pages
ML Document-1 - Merged
No ratings yet
ML Document-1 - Merged
19 pages
aaai_seals
No ratings yet
aaai_seals
9 pages
Active Downsampling For Binary Classification With An Imbalanced Dataset
No ratings yet
Active Downsampling For Binary Classification With An Imbalanced Dataset
7 pages
gal17a
No ratings yet
gal17a
10 pages
Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction To Concepts and Methods
No ratings yet
Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction To Concepts and Methods
50 pages
Active Learning On A Budget: Opposite Strategies Suit High and Low Budgets
No ratings yet
Active Learning On A Budget: Opposite Strategies Suit High and Low Budgets
21 pages
A Survey On Uncertainty Estimation in Deep Learning Classification Systems From A Bayesian Perspective
No ratings yet
A Survey On Uncertainty Estimation in Deep Learning Classification Systems From A Bayesian Perspective
35 pages
Margin-Based Active Learning of Multiclass Classifiers: Marco Bressan
No ratings yet
Margin-Based Active Learning of Multiclass Classifiers: Marco Bressan
45 pages
2008 - CVPR-gjqi - Two-Dimensional Active Learning For Image Classification
No ratings yet
2008 - CVPR-gjqi - Two-Dimensional Active Learning For Image Classification
8 pages
mlnotes2srija
No ratings yet
mlnotes2srija
15 pages
ML assignment
No ratings yet
ML assignment
7 pages
Active Learning
100% (3)
Active Learning
116 pages
4305-Article Text-7359-1-10-20190706
No ratings yet
4305-Article Text-7359-1-10-20190706
8 pages
Cohn1994 Article ImprovingGeneralizationWithAct
No ratings yet
Cohn1994 Article ImprovingGeneralizationWithAct
21 pages
Active Inference Demystified and Compared
No ratings yet
Active Inference Demystified and Compared
32 pages
Active Sample Learning and Feature Selection: A Unified Approach
No ratings yet
Active Sample Learning and Feature Selection: A Unified Approach
11 pages
1876 Diffusion Based Probabilistic
No ratings yet
1876 Diffusion Based Probabilistic
27 pages
AI MODULE V PART2(1)
No ratings yet
AI MODULE V PART2(1)
8 pages
2506.19810v1
No ratings yet
2506.19810v1
36 pages
PbAL For Skewed Data With Nonparametric Logistic Regression
No ratings yet
PbAL For Skewed Data With Nonparametric Logistic Regression
34 pages
A Simple Baseline for Low-Budget Active Learning
No ratings yet
A Simple Baseline for Low-Budget Active Learning
20 pages
Hw1 Theory Solution PuHK4fmHvB
No ratings yet
Hw1 Theory Solution PuHK4fmHvB
4 pages
Zhu and Bento - 2017 - Generative Adversarial Active Learning
No ratings yet
Zhu and Bento - 2017 - Generative Adversarial Active Learning
11 pages
Chapter 6:artificial Intelligence Learning: By. Getaneh T
No ratings yet
Chapter 6:artificial Intelligence Learning: By. Getaneh T
59 pages
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Embedded System Lab Manual
No ratings yet
Embedded System Lab Manual
43 pages
Tpa 3111 D 1
No ratings yet
Tpa 3111 D 1
32 pages
Account Statement 010523 020523
No ratings yet
Account Statement 010523 020523
4 pages
4417 Indianserver - Report
No ratings yet
4417 Indianserver - Report
61 pages
GNU Debugger (GDB) : Computing Laboratory
No ratings yet
GNU Debugger (GDB) : Computing Laboratory
10 pages
Installation Guide: SDL Trados Studio 2014
No ratings yet
Installation Guide: SDL Trados Studio 2014
44 pages
Summary of Complexity: A Guided Tour
No ratings yet
Summary of Complexity: A Guided Tour
3 pages
MATLAB-CST Interfacing For A Micro-Strip Patch Antenna
No ratings yet
MATLAB-CST Interfacing For A Micro-Strip Patch Antenna
5 pages
Chapter 5 (Input Output) PDF
No ratings yet
Chapter 5 (Input Output) PDF
111 pages
Upgrading Project For Klinik Kesihatan: WBS Task Description
No ratings yet
Upgrading Project For Klinik Kesihatan: WBS Task Description
4 pages
OM20064S
No ratings yet
OM20064S
2 pages
Secondary Storage Media
No ratings yet
Secondary Storage Media
4 pages
UNIT 2 (2) EJE 2 Ingles V
No ratings yet
UNIT 2 (2) EJE 2 Ingles V
5 pages
2023+Fall+OTT+Quiz+Instruction-DE-topic 5
No ratings yet
2023+Fall+OTT+Quiz+Instruction-DE-topic 5
7 pages
MoTeC Catalog 1.40
No ratings yet
MoTeC Catalog 1.40
77 pages
Guid - Volume Atenuator and Source Selection
No ratings yet
Guid - Volume Atenuator and Source Selection
23 pages
Unit-3 CAD Completed
No ratings yet
Unit-3 CAD Completed
35 pages
Asus Memopad Andromax Tab 7 Acer Inconia B1: RP 1.650.000 RP 1.650.000 RP 1.650.000
No ratings yet
Asus Memopad Andromax Tab 7 Acer Inconia B1: RP 1.650.000 RP 1.650.000 RP 1.650.000
3 pages
V-Wheels (25203-02)
No ratings yet
V-Wheels (25203-02)
33 pages
Surpass Hit 7500
No ratings yet
Surpass Hit 7500
122 pages
الاجوبة المرضية عن الاسئلة النحوية للشيخ ابي عبدالله محمد بن اسماعيل الغرناطي (ت 853 هـ) - الرسالة العلمية PDF
No ratings yet
الاجوبة المرضية عن الاسئلة النحوية للشيخ ابي عبدالله محمد بن اسماعيل الغرناطي (ت 853 هـ) - الرسالة العلمية PDF
361 pages
Apple Macbook A1181 13 K36A MLB 051-7559 820-2279 RevH (02-15-2008) Schematics PDF
No ratings yet
Apple Macbook A1181 13 K36A MLB 051-7559 820-2279 RevH (02-15-2008) Schematics PDF
76 pages
32 Samss 020
No ratings yet
32 Samss 020
18 pages
Creating Self-Signed Certificates With MakeCert - Exe For Development
No ratings yet
Creating Self-Signed Certificates With MakeCert - Exe For Development
28 pages
3.examples of TM
No ratings yet
3.examples of TM
12 pages
Man Digitalservices Truck Brochure
No ratings yet
Man Digitalservices Truck Brochure
20 pages
NMAP TUTORIAL
No ratings yet
NMAP TUTORIAL
17 pages
Ecosys Technology.: Economical. Ecological
No ratings yet
Ecosys Technology.: Economical. Ecological
2 pages
White-Box Testing - A
No ratings yet
White-Box Testing - A
29 pages

How_to_measure_uncertainty_in_uncertainty_sampling

Uploaded by

How_to_measure_uncertainty_in_uncertainty_sampling

Uploaded by

Machine Learning (2022) 111:89–122

How to measure uncertainty in uncertainty sampling

Vu‑Linh Nguyen1 · Mohammad Hossein Shaker2 · Eyke Hüllermeier2

Received: 29 February 2020 / Revised: 17 February 2021 / Accepted: 14 May 2021 /

Keywords Active learning · Uncertainty sampling · Credal uncertainty · Epistemic

Editors: Petra Kralj Novak, Tomislav Šmuc.

The goal in standard supervised learning, such as binary or multi-class classification, is to

in a query instance and a comparison of the approaches in Sects. 3 and 4, respectively.

Algorithm 1: Uncertainty sampling

Assuming a probabilistic model producing predictions in the form of probability distri‑

– the least confidence:

– the smallest margin:

3.1 Evidence‑based uncertainty (EBU)

s∗insu = arg min E1 (x) × E0 (x) . (9)

3.1.1 A note on evidence‑based uncertainty

3.1.2 A note on uncertainty sampling for Naïve Bayes

From these, one derives the following posterior probabilities:

p(y = 0 | x1 = 0, x2 = 0) ≈ 0.10 , p(y = 1 | x1 = 0, x2 = 0) ≈ 0.90 ,

So again, the prediction for (x1 , x2 ) = (1, 1) will be ŷ = 1 instead of ŷ = 0.

3.1.3 Extension to other learners

Obviously, evidence-based uncertainty measures can be derived is a quite natural way

3.2 Credal uncertainty (CU)

Θ. We denote by p𝜃 (1 | x) = h𝜃 (x) and p𝜃 (0 | x) = 1 − h𝜃 (x) the (predicted) probability that

Practically, the computations are based on the interval-valued probabilities

assigned to each class y ∈ {0, 1}, where

Finally, the uncertainty score (13) can simply be expressed as follows:

3.3 Epistemic and aleatoric uncertainty (EAU)

4 Discussion and comparison of the approaches

4.1 EAU versus EBU

Although the concepts of “conflicting evidence” and “insufficient evidence” of Sharma

a1 (1, 1) (1, 1) a1 (100, 100) (100, 100)

Table 1 Data sets used in the experiments

1 parkinsons 197 22 real 75.0

5.1 Data sets and experimental setting

• Rand: Random sampling

5.3 Influence of model bias

5.4 Uncertainty as stopping criterion

In a last experiment, we analyze the potential of epistemic uncertainty to serve as a stop‑

• the training data set reaches a desired size;

• Depending on the underlying model class, the quantification of epistemic uncertainty

Appendix 1: Instantiations of aleatoric and epistemic uncertainty

Appendix 1.1: Local learning

Algorithm 2: Determining the width .

Given a number K, the corresponding 𝜖 is determined using Algorithm 2. Furthermore,

Appendix 1.2: Logistic regression

Algorithm 3: Degrees of support for logistic regression

Appendix 2: Additional experiments

Funding Open Access funding enabled and organized by Projekt DEAL.

You might also like

3.1 Evidence‑based uncertainty (EBU)

3.1.1 A note on evidence‑based uncertainty

3.1.2 A note on uncertainty sampling for Naïve Bayes

3.1.3 Extension to other learners

3.2 Credal uncertainty (CU)

3.3 Epistemic and aleatoric uncertainty (EAU)

4 Discussion and comparison of the approaches

4.1 EAU versus EBU

5.1 Data sets and experimental setting

5.3 Influence of model bias

5.4 Uncertainty as stopping criterion

Algorithm 2: Determining the width .