Bayesian_Active_Learning
Bayesian_Active_Learning
000
001
BAYESIAN ACTIVE L EARNING B Y D ISTRIBUTION D IS -
002 AGREEMENT
003
004
005 Anonymous authors
006 Paper under double-blind review
007
008
009 A BSTRACT
010
011 Active Learning (AL) for regression has been systematically under-researched due
012 to the increased difficulty of measuring uncertainty in regression models. Since
013 normalizing flows offer a full predictive distribution instead of a point forecast,
014 they facilitate direct usage of known heuristics for AL like Entropy or Least-
015 Confident sampling. However, we show that most of these heuristics do not work
016 well for normalizing flows in pool-based AL and we need more sophisticated al-
017 gorithms to distinguish between aleatoric and epistemic uncertainty. In this work
018
we propose BALSA, an adaptation of the BALD algorithm, tailored for regression
with normalizing flows. With this work we extend current research on uncertainty
019
quantification with normalizing flows (Berry and Meger, 2023b;a) to real world
020
data and pool-based AL with multiple acquisition functions and query sizes. We
021 report SOTA results for BALSA across 4 different datasets and 2 different archi-
022 tectures.
023
024
025 1 I NTRODUCTION
026
027 The ever growing need for data for machine learning science and applications has fueled a long his-
028 tory of Active Learning (AL) research, as it is able to reduce the amount of annotations necessary to
029 train strong models. However, most research was done for classification problems, as it is generally
030 easier to derive uncertainty quantification (UC) from classification output without changing the
031 model or training procedure. This feat is a lot less common for regression models, with few historic
032 exceptions like Gaussian Processes. This leads to regression problems being under-researched
033 in AL literature. In this paper, we are focusing specifically on the area of regression and recent
034
models with uncertainty quantification (UC) in the architecture. Recently, two main approaches
of UC for regression problems have been researched: Firstly, Gaussian neural networks (GNN)
035
(Flunkert et al., 2017; Madhusudhanan et al., 2024), which use a neural network to parametrize µ
036
and σ parameters and build a Gaussian predictive distribution and secondly, Normalizing Flows
037 (Papamakarios et al., 2017; Durkan et al., 2019), which are parametrizing a free-form predictive
038 distribution with invertible transformations to be able to model more complex target distributions.
039 Their predictive distributions allow these models to not only be trained via Negative Log Likelihood
040 (NLL), but also to draw samples from the predictive distribution as well as to compute the log
041 likelihood of any given point y. Recent works (Berry and Meger, 2023b;a) have investigated
042 the potential of uncertainty quantification with normalizing flows by experimenting on synthetic
043 experiments with a known ground-truth uncertainty.
044 Intuitively, a predictive distribution should inertly allow for a good uncertainty quantification
045
(e.g. wide Gaussians signal high uncertainty). However, we show empirically that 2 out of
3 well-known heuristics for UC, (standard deviation, least confidence and Shannon entropy)
046
significantly underperform when used as acquisition functions for AL. We argue that this is due to
047
the inability of these heuristics to distinguish between epistemic uncertainty (model underfitting)
048 and aleatoric uncertainty (data noise), out of which AL can only reduce the former. To circumvent
049 this problem, (Berry and Meger, 2023b;a) have proposed ensembles of normalizing flows and
050 studied their approximations via Monte-Carlo (MC) dropout. Even though (Berry and Meger,
051 2023b;a) have demonstrated good uncertainty quantification, their experiments are conducted on
052 simplified AL use cases with synthetic data. They have not benchmarked their ideas against other
053 SOTA AL algorithms or used real-world datasets. In this work we propose a total of 4 different
extensions of the BALD algorithm for AL, which relies on MC dropout to separate the two types of
1
Under review as a conference paper at ICLR 2025
054
uncertainty. We adapt BALD’s methodology for models with predictive distributions, leveraging the
055 distributions directly instead of relying on aggregation methods like Shannon entropy or standard
056 deviation. Additionally, we extend well-known heuristic baselines for AL to models with predictive
057 distributions. We report results for GNNs and Normalizing Flows on 4 different datasets and 3
058 different query sizes.
059 With a recent upswing in the area of comparability and benchmarking (Rauch et al., 2023; Ji et al.,
060 2023; Lüth et al., 2024; Werner et al., 2024), we now have reliable evaluation protocols, which help
061 us to provide an experimental suite that is reproducible and comparable.
062 Our code is available under: https://anonymous.4open.science/r/
063
Bayesian-Active-Learning-By-Distribution-Disagreement-8682/
064
065 C ONTRIBUTIONS
066 • Three heuristic AL baselines for models with predictive distributions and three adaptations
067 to the BALD algorithm for this use case, creating a comprehensive benchmark for AL with
068 models with predictive distributions
069 • Two novel extensions of the BALD algorithm, which leverage the predictive distributions
070 directly instead of relying on aggregation methods, which we call Bayesian Active Learning
071 by DiStribution DisAgreement (BALSA)
072 • Extensive comparison of different versions of BALD and BALSA on 4 different regression
073 datasets and 2 model architectures
074
075
076
2 P ROBLEM D ESCRIPTION
077
We are experimenting on pool-based AL with regression models. Mathematically we have the fol-
078
lowing:
079 Given a dataset Dtrain := (xi , yi ) i ∈ {1, . . . , N } with x ∈ X , y ∈ Y (similarly we have Dval and
080 Dtest ) we randomly sample an initial labeled pool L(0) ∼ Dtrain that we call the seed set. We sup-
081
press the labels from the remaining samples to form the initial unlabeled pool U (0) = Dtrain /L(0) .
082 We define an acquisition function to be a function that selects a batch of samples of size τ from
083 (i)
the unlabeled pool a(U (i) ) := {xb } ∈ U (i) b := [0, . . . , τ ]. We then recover the corresponding
084 (i) (i) (i)
085
labels yb for these samples and add them to the labeled pool L(i+1) := L(i) ∪ {(xb , yb )} and
(i)
086 U (i+1) := U (i) /{xb } b := [0, . . . , τ ]. The acquisition function is applied until a budget B is
087 exhausted.
088 We measure the performance of a model ŷ : X → Y on the held out test set Dtest after each acquisi-
089 tion round by fitting the model ŷ (i) on L(i) and measuring the Negative Log Likelihood (NLL)
090
091 3 BACKGROUND
092
093 U NCERTAINTY Q UANTIFICATION IN R EGRESSION M ODELS
094
095 Uncertainty quantification (UC) in regression models can broadly be archived by two approaches:
096 (i) The architecture of the regression model is set up to produce an UC itself, or (ii) the training or
097 inference of a model is subjected to an additional procedure to generate UCs.
098 Examples of (i) are Gaussian Processes and density-based models, which use an encoder to produce
099
the parameters of a predictive distribution. The most common example is a Gaussian neural network
(GNN), where the encoder produces the mean and variance parameters which create a Gaussian
100
predictive distribution. Recently, Normalizing flows (NF) have been proposed as an alternative to
101
pre-defined output distributions (like Gaussians). NFs parametrize non-linear transformations that
102 transform a Gaussian base-distribution into a more expressive density and use that for prediction
103 (Papamakarios et al., 2017).
104 Examples of (ii) are Monte-Carlo-Dropout, which uses dropout layers in combination with multiple
105 forward passes to approximate samples from the parameter distribution of a Bayesian Neural Net-
106 work, as well as Langevin Dynamics for Neural Networks and ”Stein Variational Gradient Descent”
107 (SVGD), which estimate the parameter distribution via an updated gradient descent algorithm. Both
approaches are model agnostic (apart from requiring dropout and gradient descent training).
2
Under review as a conference paper at ICLR 2025
108 Table 1: Hyperparameters of all proposed variations of our extension to BALD. While BALD (Gal
109 et al., 2017) was proposed for classificaition and uses categorical entropy, BALDH uses continuous
110 entropy. A dropout rate of 0.05 was showing the best AL performance across all datasets. A *
111 denotes the optimal dropout rate for each dataset. Optimal dropout rates for each dataset are between
112 0.008 and 0.05.
113
114 Param. Sampling Aggregation Dist. Function Drop Train Drop Eval
115
BALD MC dropout Shannon Entr. subtraction 0.5 0.5
P
116 NFlows Out MC dropout − log p subtraction 0.05 0.05
117 BALDσ MC dropout std subtraction 0.05 0.05
118 BALDH MC dropout Shannon Entr. subtraction 0.05 0.05
119 BALSAEMD MC dropout - EMD 0.05 0.05
120 BALSAEMD
dual MC dropout - EMD * 0.1
121
BALSAKL MC dropout - KL-Div. 0.05 0.05
122
123 BALSAKL
dual MC dropout - KL-Div. * 0.1
124
125
Models from category (i) are (to the best of our knowledge) not capable of distinguishing between
126
aleatoric uncertainty and epistemic uncertainty. However, in Active Learning, we are primarily in-
127
terested in quantifying the epistemic uncertainty, as this is the only quantity that we can reduce by
128 sampling more data points. For that reason, we chose to extend BALD, a well-known algorithm for
129 AL that uses MC-Dropout. Generally, our proposed method also works for Langevin Dynamics or
130 SVGD, but as they change the training procedure itself by adding new terms and a minimum number
131 of epochs, they are not directly comparable to the bulk of AL algorithms. We compiled an overview
132 of our algorithms in Table 1. Without changing the ”Aggregation” or ”Distance Function” columns
133 (contents detailed in Section 5) we could replace the parameter sampling with Langevin Dynamics
134 or SVGD. We defer studies of the resulting algorithms to future work.
135
136
4 R ELATED W ORK
137
138
D EEP ACTIVE L EARNING FOR R EGRESSION
139
140 Most approaches for Active Learning for Regression are based on geometric properties of the data,
141 with a few notable approaches of uncertainty sampling that are bound to specific model architectures.
142 Geometric methods include Coreset (Sener and Savarese, 2017), CoreGCN (Caramalau et al., 2021)
143 and TypiClust (Hacohen et al., 2022). All three approaches first embed any candidate point using
144 the current model and apply their distance calculations in latent space. Coreset picks points with
145 maximal distances to each previously sampled point. CoreGCN does one more embedding step by
146 training a Graph Convolutional Model on a node classification task, where each node represents an
147
unlabeled data point. Finally, Coreset sampling is applied in this updated embedding space from the
148
Graph Convolutional Model. TypiClust uses KNN-Clustering to bin the points into |L(i) | + τ many
clusters and then select at most one point from each cluster.
149
Many UC approaches for AL with regression are not agnostic to the model architecture (Jose et al.,
150
2024; Riis et al., 2022) and cannot directly be applied to our setting with normalizing flows. One of
151 the few exceptions to this is the BALD algorithm itself, as it’s only requirement are dropout layers
152 in the model architecture.
153
154
C LOSEST R ELATED W ORK
155
156 The authors of (Berry and Meger, 2023b;a) already researched using normalizing flows in an ensem-
157 ble and how to approximate this construct via MC dropout. They proposed two different ways of
158 applying dropout masks to normalizing flows: either in the bijective transformations (called NFlows
159 Out) or in a network that parametrizes the base distribution of the normalizing flow (called NFlows
160 Base). Their methods are evaluated on a synthetic uncertainty quantification tasks, as well as a syn-
161 thetic AL task with random sampling and a fixed query size of τ = 10. We differ from the work of
(Berry and Meger, 2023b;a) in the following ways:
3
Under review as a conference paper at ICLR 2025
162 Table 2: Characteristics of used datasets for this work. Datasets are selected to cover a large range of
163 size and complexity and provide maximal intersection with other literature for AL with regression
164
165 Name #Feat #Inst (Train) L(0) B
166 Parkinsons (Tsanas and Little, 2009) 61 3760 200 800
167 Supercond. (Hamidieh, 2018) 81 13608 200 800
168 Sarcos (Fischer, 2022) 21 28470 200 1200
169
Diamonds (Mueller, 2019) 26 34522 200 1200
170
171
172
(i) While
P (Berry and Meger, 2023b;a) proposes to implement the uncertainty function H in BALD
as − log [ŷθ (x)], we use Shannon-Entropy and propose multiple additional implementations.
173
(ii) (Berry and Meger, 2023b;a) conducted their experiments solely on synthetic data from simula-
174
tions and compared NFlows only against other dropout-based AL algorithms. We extend this use
175 case to 4 real world datasets with multiple acquisition functions and query sizes.
176 (iii) Finally, we opt for applying dropout masks only to the conditioning model and to sample ran-
177 dom dropout masks instead of using the fixed masks from (Berry and Meger, 2023b;a). Even though
178 we acknowledge the potential usefulness of these approaches, none of them have yet been tested on
179 pool-based AL on real world data. We focus first on the most natural application of MC dropout for
180 normalizing flows and defer the other versions to future work.
181
182 M ONTE -C ARLO D ROPOUT FOR ACTIVE L EARNING
183
184 MC dropout for AL was first proposed by BALD (Gal et al., 2017) as a way to estimate parameter
185 uncertainty (epistemic uncertainty). The core idea of BALD is to sample a model’s parameter distri-
186 bution p(θ) multiple times and measure the total (aleatoric+epistemic) uncertainty of each sample.
187 As an approximation of aleatoric uncertainty, the authors then measure the uncertainty of the average
188
prediction and contrast that from the uncertainty of each parameter sample to obtain the epistemic
uncertainty (Eq. 1). The authors derived their algorithm for softmax-classification with neural net-
189
works, but the general idea of measuring the uncertainty of k parameter samples contrasted by the
190
uncertainty of the average prediction is applicable to regression as well.
191
192 k
X
193 BALD(x) = (H [ȳ(x)] − H [ŷθi (x)]) (1)
194 i=1
195 k
1X
196 ȳ(x) = ŷθ (x)
k j=1 j
197
198
Natural choices for the uncertainty function H for predictive distributions are the standard deviation
199
or the Shannon-Entropy. The subtraction in Eq. 1 serves as a distance measure between the total
200 uncertainty of a parameter sample and the uncertainty of the average prediction. Following that
201 idea, if a metric ϕ exists that can measure the distance between ŷθi and ȳ directly, we can apply the
202 following variant of Eq. 1:
203 k
X
204 BALD(x) = ϕ (ŷθi (x), ȳ(x)) (2)
205 i=1
206
Based on Eq. 2, we are proposing two variants of a novel algorithm, which we call BALSA.
207
208
209 5 M ETHODOLOGY
210
211 BALSA
212
213 We define the conditional predictive distribution that a model produces after a point x was fed to the
214 encoder ψθ , which conditions the distribution, as p̂|ψθ (x) or p̂θ |x for short.
215 To employ Eq. 2, we have to solve one main problem: how is the ”average” predictive distribution
p̄|x (analogous to ȳ(x)) defined? We are proposing two solutions:
4
Under review as a conference paper at ICLR 2025
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232 Figure 1: Overview of our regression models. Both models use an MLP encoder to create a latent
233 embedding z of the input, before using z to parametrize a predictive distribution.
234
235
236 Grid Sampling Since there exist no sound way of averaging iid samples (and their likelihoods)
237 from arbitrary distributions to obtain p̄|x, we are changing the sampling method to a more rigid
238 structure. To this end we are normalizing our target values between [0..1] during pre-processing and
239 distribute samples on a grid with a resolution of 200. We use our constructed samples to obtain
240
likelihoods from our model and denote the vector of likelihoods on the grid as p̂⊣θ |x ∈ R200 . Finally,
we can average multiple likelihood vectors like this across k parameter samples to obtain p̄|x ∈
241
R200 .
242
k
243 1X ⊣
p̄|x = p̂ |x
244 k j=1 θj
245
246
As a vector of averaged likelihoods is no longer normalized, we need to re-normalize the values by
247 the area under the curve to obtain a proper distribution. However, we observed in our experiments
248 that the un-normalized version of BALSA performs comparable to or worse than the re-normalized
249 one (We provide the respective ablation study in Sec 7). Therefore, we focus on the un-normalized
250 version and omit the normalization step in our formulas. The formulas including the normalization
251 step can be found in Appendix B.
252
253
Pair Comparison To avoid the computation of p̄|x entirely, we propose to approximate Eq. 2 with
254 pairs of parameter samples instead. Given k parameter samples, we define k − 1 pairs of predictive
255 distributions and measure their distances.
256
257 k−1
X
258 ϕ p̂θi |x, p̂θi+1 |x
259 i=1
260
Since the parameter samples θi are drawn iid, the sum is not influenced by sequence effects from
261
the ordering of the k samples.
262
263 Finally, we need a distance metric ϕ to measure the difference between two arbitrary predictive
264 distributions. We propose KL-Divergence and Earth Mover’s Distance (EMD) and call our resulting
265
algorithms BALSAKL and BALSAEMD respectively.
266
267 BALSAKL KL-Divergence is measured on likelihood vectors of two distributions and is propor-
268 tional to the expected surprise when one distribution is used as a model to describe the other. The
269 higher the surprise, the more different the two distributions are.
Implementing both above mentioned approaches we have a grid sampling version BALSAKL Grid
5
Under review as a conference paper at ICLR 2025
270
and a pair comparison version BALSAKL Pair .
271
k
272 X
BALSAKL Grid (x) = KL p̂⊣θi |x, p̄|x
(3)
273
i=1
274
k−1
275 X
BALSAKL Pair (x) =
KL p̂θi |x, p̂θi+1 |x (4)
276
i=1
277
278 A mathematical analysis of the differences between the resulting BALSAKL Grid algorithm and
279
BALD can be found in Appendix A. We omitted this analysis for BALSAKL Pair and the following
BALSAEMD , because both use fundamentally different computations and are therefore considered
280
different algorithms.
281
282
BALSAEMD The Earth Mover’s Distance (a.k.a. Wasserstein Distance) is computed over iid sam-
283 ples drawn from distributions and is proportional to the cost of transforming one distribution into
284 the other. Since EMD relies on iid samples, we cannot use p̂⊣θi |x in this context. We only implement
285 the pair comparison version, simply called BALSAEMD .
286
k−1
287 X
BALSA EMD
(x) = EMD yθ′ i , yθ′ i+1 (5)
288
i=1
289
yθ′ ∼ p̂θ |x
290
291
292
BASELINES
293 We are using Coreset (Sener and Savarese, 2017), CoreGCN (Caramalau et al., 2021) and Typi-
294 Clust (Hacohen et al., 2022) as clustering based competitors to our uncertainty based algorithms.
295 Additionally, we adapt 3 well-known uncertainty sampling heuristics to models with predictive dis-
296 tributions. Neither the clustering approaches, nor the heuristics rely on MC dropout, hence we omit
297 the index on the parameters θ.
298 For the heuristics, we measure (i) the standard deviation σ of samples from the predictive distribu-
299 tion, (ii) the log likelihood of the most probable prediction (least confident sampling) and (iii) the
300 Shannon entropy of the predictive distribution.
301
We denote baseline (i) as Std = σ(yθ′ ), which is computed based on 200 samples from the predic-
tive distribution.
302
We denote baseline (ii) as LC = − argmaxy′ p̂θ |x(y ′ ) where the most probable sample is again
303
found by sampling 200 points.
304 We denote baseline (iii) as Entr = −p̂θ |x log [p̂θ |x]. Since we are dealing with regression problems
305 and predictive distributions, we use continuous entropy in this work. Calculating continuous entropy
R
306 entails integrating −p̂θ |x log [p̂θ |x] dx, which we approximate by employing our grid sampling
307 approach, computing the entropy of the resulting likelihood vector p̂⊣ |x and finding the total entropy
308 with the trapezoidal rule
Entr(x) = trapz −p̂⊣θ |x log[p̂⊣θ |x]
309 (6)
310 As all baselines (i - iii) are viable replacements of the function H in BALD (Eq. 1), we can construct
311 additional baselines in a straightforward fashion by creating adaptations of BALD for models with
312 predictive distributions.
313 Based on baseline (i), we construct BALDσ . Since the standard deviation needs to be computed
314 over iid samples from p̂θ |x we use pair comparisons (analogous to BALSAEMD ).
315 k−1
X
BALDσ (x) = σ yθ′ i − σ[yθ′ i+1 ]
316 (7)
317 i=1
318 yθ′ ∼ p̂θ |x
319
320
Based on baseline (ii), we construct BALDLC . Following BALDσ , this baseline is also computed
over pairs.
321
k−1
322 X
BALDLC (x) =
323 LC [p̂θi |x] − LC[p̂θi+1 |x] (8)
i=1
6
Under review as a conference paper at ICLR 2025
324
Based on baseline (iii), we construct BALDH . To stay as close as possible to BALD, BALDH uses
325 p̂⊣θ |x to compute p̄|x and reproduces Eq. 1.
326
327
k
328 X
Entr [p̄|x] − Entr p̂⊣θ |x
329 BALDH (x) = (9)
330 i=1
k
331 1X ⊣
p̄|x = p̂ |x
332 k j=1 θj
333
334
335
336 6 I MPLEMENTATION D ETAILS
337
338
339
All experiments are run with PyTorch on Nvidia 2080, 3090 and 4090 GPUs. The total runtime for
all experiments was approximately 7 days on 40-50 GPUs.
340
As backbone model we are using a standard MLP encoder with dropout layers and ReLU activation.
341
The encoder is conditioning the predictive distribution of our model either via a µ-decoder and a
342 σ-decoder (GNN) or as a conditioning input for the normalizing flow. Our normalizing flow is
343 an autoregressive Neural Spline Flow with rational-quadratic spline transformations (Durkan et al.,
344 2019). For detailed descriptions on both models, please refer to Appendix C. We optimize all our
345 hyperparameters on random subsets of size B (e.g. Parkinsons has a budget of 800). To that end,
346 we evaluate any hyperparameter setting on 4 different random subsets and use average validation
347 performance as metric for our search.
348 Evaluating algorithms that include MC dropout is especially tricky, as few guidelines exist on how to
349 choose an appropriate dropout rate. Instead of forcing a (too) high dropout rate onto every algorithm,
350
in this work we include dropout in our hyperparameter search so it will be optimized for validation
performance on each dataset. This creates an optimal evaluation scheme for algorithms without
351
MC dropout. This is an important step in order to not underestimate the performance of algorithms
352
that do not require high dropout rates. We then let each BALD or BALSA algorithm overwrite the
353 dropout rate to a fixed value. The specific rate of MC dropout for overwriting the default setting is
354 optimized for AL performance across all datasets on very few trials in order to find a suitable default
355 value. Finally, we propose an alternative to overwriting the optimal dropout rate to a fixed value: We
356 test BALSA in ”dual” mode, retaining the optimal dropout during training and switching to a higher
357 fixed value during evaluation phases. A fixed evaluation rate of 0.05 is chosen as the highest of our
358 optimal dropout rates (0.008-0.05 per dataset). This is still a full magnitude lower than common
359 rates of 0.5 for MC dropout in the literature (Gal et al., 2017; Kirsch et al., 2019). Please refer to
360 Table 1 for the dropout settings of our algorithms and Appendix C for our used hyperparameters.
361
The results for ”dual” mode can be found in Section 7 in the respective ablation study.
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
Figure 2: Critical Difference Diagram for all datasets and query size 1. (lower is better) Horizontal
377
bars indicate statistical significance according to the Wilcoxon-Holm test.
7
Under review as a conference paper at ICLR 2025
378
379 1.25 NF Diamonds NLL 0.028 NF Diamonds MAE
380 LC
1.50 Entropy
381 BALD LC 0.026
Coreset
382 1.75 CoreGCN 0.024
383 Std
2.00 Random
0.022
MAE
TypiClust
NLL
384
2.25 NFlows Out
385 BALSA EMD 0.020
BALD H
386 2.50 BALSA KL Grid
387 BALSA KL Pairs 0.018
2.75 BALD Std
388 Full Dataset 0.016
389 3.00
500 1000 500 1000
390 # of labeled datapoints # of labeled datapoints
391
392
Figure 3: AL trajectories of all tested algorithms in the Diamonds dataset. Curves based on NLL
393
(left) and MAE (right); lower is better. Trajectories are averaged over 30 restarts of each experiment.
394
395
396
7 E XPERIMENTS
397
398
We test our proposed algorithms from Section 5 on conditional normalizing flows and GNNs on
399
4 datasets (Details of our datasets in Table 2) and across query sizes of τ = {1, 50, 200}. Every
400
experiment is repeated 30 times and implemented according to the guidelines of (Ji et al., 2023) and
401 (Werner et al., 2024). We compare our results mainly on query size 1, as we are mostly interested
402 in the ability of our proposed algorithms to capture uncertainty in the model rather than adapting to
403 larger query sizes. Following (Werner et al., 2024) we choose CD-Diagrams as aggregation method
404 for comparison. To this end, we compute a ranking of each algorithm’s AUC value for each dataset
405 and for each repetition and compare the ranks via the Wilcoxon signed-rank test. Computing ranks
406 out of the AUC values enables us to compare results across datasets without averaging AUC values
407 from different datasets. The AUC values are computed based on test NLL (Fig. 2) and test MAE
408 (Fig. 4). For context, we evaluate Coreset (Sener and Savarese, 2017), CoreGCN (Caramalau et al.,
409
2021) and TypiClust (Hacohen et al., 2022) and display the final ranking across all datasets on query
size 1 in Figure 2. Additionally, we exemplarily display the AL trajectories of all algorithms for the
410
Diamonds dataset in Figure 3. The remaining figures for all datasets can be found in Appendix D. In
411
our experiments, BALSAKL Pairs is the best AL algorithm on average, followed by BALSAKL Grid ,
412 BALDH and Coreset. Notably, common AL heuristics, namely the Shannon Entropy, Std and Least
413 Confidence baselines, which usually are among the most reliable methods for AL with classification,
414 performed especially bad. These results indicate, that not every kind of measure on the uncertainty
415 quantification is useful for AL, even when the UC is inert to the model architecture and the measure
416 is well-tested in other domains. Interestingly, Coreset and CoreGCN perform a lot better with GNN
417 architectures, both gaining about 3 ranks, while TypiClust - the also a clustering algorithm - loses
418 ranks. To investigate and compare these algorithms further, we provide additional results in Figure
419 4, computing the ranks of each algorithm based on MAE instead of NLL. The two main differences
420
are (i) Nflows Out loses drastically, scoring last on average and (ii) Coreset is now the best perform-
ing algorithm, winning closely against BALSAKL Pairs and TypiClust.
421
Finding the right (mix of) metrics to evaluate our models remains a challenging task, as every chosen
422
measure inevitably introduces a bias into the evaluation. Since we have optimized our hyperparam-
423 eters for validation NLL, we opt for NLL as our main metric. We have included results for our main
424 experiments (Fig. 2) measured with the CRPS score instead of NLL in Appendix E. The CRPS
425 score resulted in the same ranking as likelihood did, so we opted to use the less involved score.
426 Additionally, we provide multiple ablation studies for our proposed BALSA algorithm:
427 Dual Mode: We test BALSA in ”dual” mode by switching between the optimal dropout and a static
428 value during training and evaluation phases respectively. This approach poses an alternative to the
429 highlighted problems of setting dropout rates described in Section 6. Unfortunately, the results in
430 Figure 5 are inconclusive, as across all datasets and model architectures the dual mode archives one
431 clear loss (BALSAEMD dual ), a marginal loss (BALSAdual
KL Pairs
) and a marginal win (BALSAKL dual
Grid
).
We hypothesize that the switch of dropout rate between training and evaluation can in some cases
8
Under review as a conference paper at ICLR 2025
432
433
434
435
436
437
438
439
440
441
442
443
444
445
Figure 4: Critical Difference Diagrams with ranks computed based on MAE instead of NLL. Same
experimental parameters as Fig. 2
446
447
448
449
450
451
452
453 Figure 5: Comparison of ”dual” evaluation mode for both BALSA algorithms as well as the re-
454 normalized version of BALSAKL Grid . Based on NLL and τ = 1
455
456
457
degrade the models prediction to much, as the model was not trained to cope with higher than opti-
458
mal dropout.
459 Re-normalization: We also tested a version of BALSAKL Grid where we re-normalize p̄|x with its
460 area under the curve as described in Section 5. We included BALSAKL Grid
norm in Figure 5, but ob-
461 served the slightly lower performance compared to un-normalized BALSAKL Grid . For the sake of
462 brevity and simplicity, we therefore opt to leave the normalization step out of our formulas.
463 Query Sizes: To gauge how well our proposed variants of BALD and BALSA adapt to larger query
464 sizes, we test our proposed methods on τ = {50, 200} and compare the results in Figure 6. For
465 ease of comparison, we exclude the 4 worst performing algorithms. Interestingly, when increas-
466 ing the query size τ , clustering algorithms like Coreset and TypiClust are losing performance more
467
quickly than our proposed uncertainty sampling methods. This finding contradicts experiments on
AL for classification, where those methods are very stable as τ increases (Ji et al., 2023; Werner
468
et al., 2024). The uncertainty sampling methods are behaving as expected, gradually losing their
469
advantage over random sampling with increasing query size, as they suffer from missing diversity
470 sampling components.
471
472
473
474
475
476
477
478
479
480
481
482
483
484
Figure 6: Comparison of our best performing algorithms across different query sizes. Both model
485
architectures, based on NLL.
9
Under review as a conference paper at ICLR 2025
486
8 C ONCLUSION
487
488
In this work, we extended the foundation of (Berry and Meger, 2023b;a) by applying the idea of
489 using MC dropout normalizing flows to real world data and pool-based AL. To that end, we adapted
490 3 heuristic AL baselines to models with predictive distributions, proposed 3 straightforward adap-
491 tations of BALD and created 2 novel algorithms, based on the BALD algorithm. This creates a
492 comprehensive benchmark suite for uncertainty sampling for the use case of AL with models with
493 predictive distributions. We demonstrate strong performance across 4 datasets for normalizing flows
494 for BALSAKL Pairs , narrowly losing against Coreset for GNN models. For larger query sizes, we ob-
495 served unexpected behavior for clustering algorithms like Coreset and TypiClust, which were falling
496 behind uncertainty based algorithms for τ = {50, 200}, while uncertainty based algorithms retain
497
their performance. This goes against common knowledge in AL, which attributes high potential to
clustering algorithms to scale to larger query sizes. This work is but the first step to understanding
498
the dynamics of AL for regression models with uncertainty quantification.
499
500
501 R EPRODUCIBILITY S TATEMENT
502
503 Our code is publicly available under: https://anonymous.4open.science/r/
504 Bayesian-Active-Learning-By-Distribution-Disagreement-8682/
505
We did not provide pseudo-code or algorithms for our experiments, because our setup is identical to
(Werner et al., 2024). Please refer to their work for details.
506
The employed hyperparameters can be found in Appendix C or in the ”configs” folder in the code.
507
508
509 R EFERENCES
510
Lucas Berry and David Meger. Escaping the sample trap: Fast and accurate epistemic uncertainty
511 estimation with pairwise-distance estimators. arXiv preprint arXiv:2308.13498, 2023a.
512
513 Lucas Berry and David Meger. Normalizing flow ensembles for rich aleatoric and epistemic uncer-
514 tainty modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37,
515 pages 6806–6814, 2023b.
516 Razvan Caramalau, Binod Bhattarai, and Tae-Kyun Kim. Sequential graph convolutional network
517 for active learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern
518 recognition, pages 9583–9592, 2021.
519
520
Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. Ad-
vances in neural information processing systems, 32, 2019.
521
522 Sebastian Fischer. Sarcos Data. OpenML, 2022. OpenML ID: 43873.
523
524 Valentin Flunkert, David Salinas, and Jan Gasthaus. Deepar: Probabilistic forecasting with autore-
525
gressive recurrent networks. CoRR, abs/1704.04110, 2017. URL http://arxiv.org/abs/
1704.04110.
526
527 Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data.
528 In International conference on machine learning, pages 1183–1192. PMLR, 2017.
529
Guy Hacohen, Avihu Dekel, and Daphna Weinshall. Active learning on a budget: Opposite strategies
530
suit high and low budgets. arXiv preprint arXiv:2202.02794, 2022.
531
532 Kam Hamidieh. Superconductivty Data. UCI Machine Learning Repository, 2018. DOI:
533 https://doi.org/10.24432/C53P47.
534
Yilin Ji, Daniel Kaestner, Oliver Wirth, and Christian Wressnegger. Randomness is the root of all
535
evil: More reliable evaluation of deep active learning. In Proceedings of the IEEE/CVF Winter
536
Conference on Applications of Computer Vision, pages 3943–3952, 2023.
537
538 Ashna Jose, João Paulo Almeida de Mendonça, Emilie Devijver, Noël Jakse, Valérie Monbet, and
539 Roberta Poloni. Regression tree-based active learning. Data Mining and Knowledge Discovery,
38(2):420–460, 2024.
10
Under review as a conference paper at ICLR 2025
540
Andreas Kirsch, Joost Van Amersfoort, and Yarin Gal. Batchbald: Efficient and diverse batch
541 acquisition for deep bayesian active learning. Advances in neural information processing systems,
542 32, 2019.
543
544 Carsten Lüth, Till Bungert, Lukas Klein, and Paul Jaeger. Navigating the pitfalls of active learning
545 evaluation: A systematic framework for meaningful performance assessment. Advances in Neural
546 Information Processing Systems, 36, 2024.
547 Kiran Madhusudhanan, Shayan Jawed, and Lars Schmidt-Thieme. Hyperparameter tuning mlp’s
548 for probabilistic time series forecasting. In Pacific-Asia Conference on Knowledge Discovery and
549 Data Mining, pages 264–275. Springer, 2024.
550
551 Andreas Mueller. Diamonds Data. OpenML, 2019. OpenML ID: 42225.
552
George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density
553
estimation. Advances in neural information processing systems, 30, 2017.
554
555 Lukas Rauch, Matthias Aßenmacher, Denis Huseljic, Moritz Wirth, Bernd Bischl, and Bernhard
556 Sick. Activeglae: A benchmark for deep active learning with transformers. In Joint Euro-
557 pean Conference on Machine Learning and Knowledge Discovery in Databases, pages 55–74.
558 Springer, 2023.
559
Christoffer Riis, Francisco Antunes, Frederik Hüttel, Carlos Lima Azevedo, and Francisco Pereira.
560 Bayesian active learning with fully bayesian gaussian processes. Advances in Neural Information
561 Processing Systems, 35:12141–12153, 2022.
562
563 Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set
564 approach. arXiv preprint arXiv:1708.00489, 2017.
565
Athanasios Tsanas and Max Little. Parkinsons Telemonitoring. UCI Machine Learning Repository,
566 2009. DOI: https://doi.org/10.24432/C5ZS3N.
567
568 Thorben Werner, Johannes Burchert, Maximilian Stubbemann, and Lars Schmidt-Thieme. A cross-
569 domain benchmark for active learning, 2024. URL https://openreview.net/forum?
570 id=OOItbUUQcd.
571
572
573 A D IFFERENCE BETWEEN BALD AND BALSAKL
574
575 K
K
X
BALD(x | p̂1:K ) := H(p̄(y | x)) − H(p̂k (y | x)), 1
P
576 with p̄(y|x):= K p̂k (y|x)
k=1
577 k=1
578
R p(y)
579 Let KL(p, q) := p(x) log
y q(y) dy be Kullback-Leibler divergence.
580
581
K
582 X
583 BALSA(x | p̂1:K ) := KL(p̂k (y | x)), p̄(y | x))
584 k=1
585
586
To see the differences between bald and balsa more clearly: write the k-th balsa term shorter drop-
ping the throughout dependency on x:
587
Z
588 p̂k (y)
KL(p̂k , p̄) = p̂k (y) log dy
589 ˆ¯p(y)
590 Z Z
591 = p̂k (y) log p̂k (y)dy − p̂k (y) log p̄(y)dy
592 Z
593 = −H(p̂k (y)) − p̂k (y) log p̄(y)dy
11
Under review as a conference paper at ICLR 2025
594
which is different from the k-th bald term
595
596 BALD(x | p̂k ) = H(p̄(y | x)) − H(p̂k (y | x))
Z
597
= −H(p̂k (y)) + p̄(y) log p̄(y)dy
598
599
600 B BALSAKL G RID WITH N ORMALIZATION
601
602 Formulas for BALSAKL Grid
Norm as tested in the respected ablation in Section 7.
603 We found this version to perform identical to the un-normalized version of BALSAKL Grid and opted
604 for the less involved formulation.
605 k
KL Grid
X
⊣ p̄|x
606 BALSA (x) = KL p̂θi |x,
i=1
trapz(p̄|x)
607
608 k
1X ⊣
609 p̄|x = p̂ |x
k j=1 θj
610
611 |p⊣ |−1
X 1
⊣
p⊣n + p⊣n+1
612 trapz(p ) =
613 n=1
2
614 The trapz-method is a well-known method to approximate an integral. We use the PyTorch-
615 Implementation of trapz.
616
617 C M ODEL A RCHITECTURES
618
619 We use a MLP encoder model for both architectures. In our Normalizing Flow models, the encod-
620 ings are used as conditioning input for the bijective transformations (decoder). Our GNNs use a
621 linear layer to decode µ and σ from the encodings.
622 Our Normalizing Flow model is a masked autoregressive flow with rational-quadratic spline trans-
623 formations, which has demonstrated good performance on a variety of tasks in (Durkan et al., 2019).
624
625 Table 3: Used Hyperparameters for Normalizing Flow models
626
Parkinsons Diamonds Supercond. Sarcos
627
Encoder [32, 64, 128] [32, 64, 128] [32, 64, 128] [32, 64, 128]
628
Decoder [128, 128] [128, 128] [128, 128] [128, 128]
629 Budget 800 1200 800 1200
630 Seed Set 200 200 200 200
631 Batch Size 64 64 64 64
632 Optimizer NAdam NAdam NAdam NAdam
633 LR 0.001 0.0004 0.0008 0.0007
634 Weight Dec. 0.0018 0.008 0.0003 0.0004
635 Dropout 0.0163 0.0194 0.0491 0.0261
636
637
Table 4: Used Hyperparameters for GNN models
638
639 Parkinsons Diamonds Supercond. Sarcos
640 Encoder [32, 64, 128] [32, 64, 128] [32, 64, 128] [32, 64, 128]
641 Decoder linear linear linear linear
642 Budget 800 1200 800 1200
643 Seed Set 200 200 200 200
644 Batch Size 64 64 64 64
645 Optimizer NAdam NAdam NAdam NAdam
646
LR 0.0007 0.0004 0.0003 0.0006
Weight Dec. 0.0008 0.005 0.005 0.0009
647
Dropout 0.0077 0.0122 0.0121 0.0074
12
Under review as a conference paper at ICLR 2025
648
D AL T RAJECTORIES
649
650
651
652 PAKINSONS
653
654
655 Normalizing Flows
656
1.0 NF Pakinsons NLL 0.030 NF Pakinsons MAE
657 LC - AUC: 0.0171
1.5 Entropy - AUC: 0.0206
658
BALD LC - AUC: 0.0167 0.025
659 Coreset - AUC: 0.0080
2.0 CoreGCN - AUC: 0.0110
0.020
660 Std - AUC: 0.0113
661 2.5 Random - AUC: 0.0117
0.015
MAE
TypiClust - AUC: 0.0093
NLL
TypiClust
NLL
TypiClust
NLL
13
Under review as a conference paper at ICLR 2025
702
Gaussian Neural Networks
703
704 0.0 GNN Diamonds NLL 0.028 GNN Diamonds MAE
LC
705 Entropy
0.5 BALD LC 0.026
706 Std
Coreset
707 1.0 CoreGCN 0.024
708 Random
1.5 0.022
MAE
TypiClust
NLL
TypiClust
NLL
TypiClust
NLL
14
Under review as a conference paper at ICLR 2025
756
S UPERCONDUCTORS
757
758 Normalizing Flows
759
NF Superconductors NLL 0.075 NF Superconductors MAE
760 LC
761 Entropy
BALD LC 0.070
762 1 Std
Coreset
763 CoreGCN
Random 0.065
0
MAE
764 TypiClust
NLL
NFlows Out
765 BALSA EMD 0.060
766 BALD H
1 BALSA KL Grid
767 BALSA KL Pairs 0.055
BALD Std
768 Full Dataset
2 0.050
769 200 400 600 800 1000 200 400 600 800 1000
770 # of labeled datapoints # of labeled datapoints
771
Gaussian Neural Networks
772
773 GNN Superconductors NLL 0.070 GNN Superconductors MAE
LC
774 2 Entropy
775 Std
BALD LC 0.065
776 1 Coreset
CoreGCN
777 Random
0 0.060
MAE
TypiClust
NLL
15