Youssef Hamadi, Eric Monfroy, Frédéric Saubion (Auth.), Youssef Hamadi, Eric Monfroy, Frédéric Saubion (Eds.) - Autonomous Search-Springer-Verlag Berlin Heidelberg (2012)
Youssef Hamadi, Eric Monfroy, Frédéric Saubion (Auth.), Youssef Hamadi, Eric Monfroy, Frédéric Saubion (Eds.) - Autonomous Search-Springer-Verlag Berlin Heidelberg (2012)
Autonomous Search
Editors
Dr. Youssef Hamadi Prof. Frédéric Saubion
Microsoft Research Faculté des Sciences, LERIA
Cambridge, UK Université d’Angers
Angers, France
LIX [email protected]
École Polytechnique
Palaiseau, France
[email protected]
We would like to thank all the people who have contributed to this book. We are
extremely grateful to the external reviewers for their detailed and constructive sug-
gestions on the successive versions of the manuscript.
v
Contents
vii
viii Contents
Youssef Hamadi
Microsoft Research, Cambridge, CB3 0FB, UK,
LIX, École Polytechnique, F-91128 Palaiseau, France,
e-mail: [email protected]
Eric Monfroy
Universidad Santa Marı́a, Valparaı́so, Chile,
e-mail: [email protected]
and LINA, Université de Nantes, France,
e-mail: [email protected]
Frédéric Saubion
LERIA, Université d’Angers, France,
e-mail: [email protected]
A. E. Eiben
Vrije Universiteit Amsterdam,
e-mail: [email protected]
S. K. Smit
Vrije Universiteit Amsterdam,
e-mail: [email protected]
Holger H. Hoos
Department of Computer Science, University of British Columbia,
2366 Main Mall, Vancouver, BC, V6T 1Z4, Canada,
e-mail: [email protected]
Derek Bridge
Cork Constraint Computation Centre, University College Cork, Ireland,
e-mail: [email protected]
Eoin O’Mahony
Cork Constraint Computation Centre, University College Cork, Ireland,
e-mail: [email protected]
xiii
xiv List of Contributors
Barry O’Sullivan
Cork Constraint Computation Centre, University College Cork, Ireland,
e-mail: [email protected]
Susan L. Epstein
Department of Computer Science, Hunter College and The Graduate Center
of The City University of New York, New York, New York,
e-mail: [email protected]
Smiljana Petrovic
Department of Computer Science, Iona College, New Rochelle, New York,
e-mail: [email protected]
Roberto Battiti
DISI – Dipartimento di Ingegneria e Scienza dell’Informazione,
Università di Trento, Italy,
e-mail: [email protected]
Paolo Campigotto
DISI – Dipartimento di Ingegneria e Scienza dell’Informazione,
Università di Trento, Italy,
e-mail: [email protected]
Jorge Maturana
Instituto de Informática, Universidad Austral de Chile, Valdivia, Chile,
e-mail: [email protected]
Álvaro Fialho
Microsoft Research – INRIA Joint Centre, Orsay, France,
e-mail: [email protected]
Marc Schoenauer
Project-Team TAO, INRIA Saclay – Île-de-France and LRI (UMR CNRS 8623),
Orsay, France and Microsoft Research – INRIA Joint Centre, Orsay, France,
e-mail: [email protected]
Frédéric Lardeux
LERIA, Université d’Angers, Angers, France,
e-mail: [email protected]
Michèle Sebag
Project-Team TAO, LRI (UMR CNRS 8623) and INRIA Saclay – Île-de-France,
Orsay, France and Microsoft Research – INRIA Joint Centre, Orsay, France,
e-mail: [email protected]
Thomas Stützle,
IRIDIA, CoDE, Université Libre de Bruxelles, Brussels, Belgium,
e-mail: [email protected]
List of Contributors xv
Manuel López-Ibáñez
IRIDIA, CoDE, Université Libre de Bruxelles, Brussels, Belgium,
e-mail: [email protected]
Marco Montes de Oca
IRIDIA, CoDE, Université Libre de Bruxelles, Brussels, Belgium,
e-mail: [email protected]
Mauro Birattari
IRIDIA, CoDE, Université Libre de Bruxelles, Brussels, Belgium,
e-mail: [email protected]
Marco Dorigo
IRIDIA, CoDE, Université Libre de Bruxelles, Brussels, Belgium,
e-mail: [email protected]
Michael Maur
Fachbereich Rechts- und Wirtschaftswissenschaften, TU Darmstadt,
Darmstadt, Germany,
e-mail: [email protected]
Paola Pellegrini
Dipartimento di Matematica Applicata, Università Ca’ Foscari Venezia,
Venezia, Italia,
e-mail: [email protected]
Alejandro Arbelaez
Microsoft-INRIA joint lab, Orsay, France,
e-mail: [email protected]
Said Jabbour
Université Lille-Nord de France, CRIL – CNRS UMR 8188,
Lens, France,
e-mail: [email protected]
Lakhdar Sais
Université Lille-Nord de France, CRIL – CNRS UMR 8188,
Lens, France,
e-mail: [email protected]
Marek Petrik
Department of Computer Science, University of Massachusetts Amherst,
Amherst, MA, USA,
e-mail: [email protected]
Shlomo Zilberstein
Department of Computer Science, University of Massachusetts Amherst,
Amherst, MA, USA,
e-mail: [email protected]
Part I
Off-line Configuration
Chapter 2
Evolutionary Algorithm Parameters and
Methods to Tune Them
Finding appropriate parameter values for evolutionary algorithms (EA) is one of the
persisting grand challenges of the evolutionary computing (EC) field. In general, EC
researchers and practitioners all acknowledge that good parameter values are essen-
tial for good EA performance. However, very little effort is spent on studying the
effect of EA parameters on EA performance and on tuning them. In practice, param-
eter values are mostly selected by conventions (mutation rate should be low), ad hoc
choices (why not use uniform crossover?), and experimental comparisons on a lim-
ited scale (testing combinations of three different crossover rates and three different
mutation rates). Hence, there is a striking gap between the widely acknowledged im-
portance of good parameter values and the widely exhibited ignorance concerning
principled approaches to tuning EA parameters.
To this end, it is important to recall that the problem of setting EA parameters
is commonly divided into two cases, parameter tuning and parameter control [14].
In case of parameter control, the parameter values are changing during an EA run.
In this case one needs initial parameter values and suitable control strategies, which
run;in turn can be deterministic, adaptive, or self-adaptive. Parameter tuning is eas-
ier in the sense that the parameter values are not changing during a run, hence only
a single value per parameter is required. Nevertheless, even the problem of tuning
an EA for a given application is hard because of the large number of options and the
limited knowledge about the effect of EA parameters on EA performance.
Given this background, we can regard the primary focus of this chapter as be-
ing parameter tuning. Our main message is that the technical conditions for choos-
ing good parameter values are given and the technology is easily available. As it
A. E. Eiben
Vrije Universiteit Amsterdam, e-mail: [email protected]
S. K. Smit
Vrije Universiteit Amsterdam, e-mail: [email protected]
happens, there exist various algorithms that can be used for tuning EA parame-
ters. Using such tuning algorithms (tuners, for short, in the sequel) implies signifi-
cant performance improvements and the computational costs are moderate. Hence,
changing the present practice and using tuning algorithms widely would lead to
improvements on a massive scale: large performance gains for a large group of re-
searchers and practitioners.
The overall aim of this chapter is to offer a thorough treatment of EA parame-
ters and algorithms to tune them. This aim can be broken down into a number of
technical objectives:
1. To discuss the notion of EA parameters and its relationship with the concepts
of EAs and EA instances.
2. To consider the most important aspects of the parameter tuning problem.
3. To give an overview of existing parameter tuning methods.
4. To elaborate on the methodological issues involved here and provide recom-
mendations for further research.
Table 2.1: Pairs of terms used in the literature to distinguish between two types of
parameters (variables)
The difference between the two types of parameters has already been noted in
evolutionary computing, but different authors use different naming conventions.
For instance, [5] uses the terms qualitative and quantitative parameters, [37] dis-
tinguishes between symbolic and numeric parameters, [10] calls them categorical
and numerical, while [23] refers to structural and behavioral parameters. [30] calls
unstructured parameters components and the elements of their domains, operators,
and a parameter is instantiated by a value while a component is instantiated by
allocating an operator to it. In the context of statistics and data mining one distin-
guishes between two types of variables (rather than parameters) depending on the
presence of an ordered structure, but a universal terminology is lacking here too.
Commonly used names are nominal vs. ordinal and categorical vs. ordered vari-
ables. Table 2.1 provides a quick overview of these options in general; Table 2.2 of-
fers an EA-specific illustration with commonly used parameters in both categories.
From now on we will use the terms qualitative parameter and quantitative pa-
rameter. For both types of parameters the elements of the parameter’s domain are
called parameter values and we instantiate a parameter by allocating a value to it.
In practice, quantitative parameters are mostly numerical values, e.g., the parame-
18 A. E. Eiben and S. K. Smit
ter crossover rate uses values from the interval [0, 1], and qualitative parameters are
often symbolic, e.g., crossoveroperator. However, in general, quantitative pa-
rameters and numerical parameters are not the same, because it is possible to have
an ordering on a set of symbolic values. For instance, one could impose an alpha-
betical ordering on the set of colors {blue, green, red, yellow}.
It is important to note that the number of parameters of EAs is not specified
in general. Depending on design choices one might obtain different numbers of
parameters. For instance, instantiating the qualitative parameter parentselection
by tournament implies a new quantitative parameter tournamentsize. However,
choosing roulettewheel does not add any parameters. This example also shows
that there can be a hierarchy among parameters. Namely, qualitative parameters may
have quantitative parameters “under them”. For an unambiguous treatment we can
call such parameters sub-parameters, always belonging to a qualitative parameter.
Distinguishing between qualitative and quantitative parameters naturally leads to
distinguishing between two levels when designing a specific EA for a given prob-
lem. In the sequel, we perceive qualitative parameters as high-level ones that define
an evolutionary algorithm, and look at quantitative parameters as low-level ones that
define a variant of this EA. Table 2.2 illustrates this.
Following this naming convention an evolutionary algorithm is a partially spec-
ified algorithm where the values to instantiate qualitative parameters are defined,
but the quantitative parameters are not. Hence, we consider two EAs to be differ-
ent if they differ in one of their qualitative parameters, e.g., use different mutation
operators. If the values for all parameters are specified, we obtain an EA instance.
This terminology enables precise formulations while it requires care in phrasing.
2 Evolutionary Algorithm Parameters and Methods to Tune Them 19
In the broad sense, algorithm design includes all decisions needed to specify an
algorithm (instance) for solving a given problem (instance). Throughout this chap-
ter we treat parameter tuning as a special case of algorithm design. The principal
challenge for evolutionary algorithm designers is that the design details, i.e., param-
eter values, largely influence the performance of the algorithm. An EA with good
parameter values can be orders of magnitude better than one with poorly chosen
parameter values. Hence, algorithm design in general, and EA design in particular,
is an optimization problem.
To obtain a detailed view on this issue we distinguish between three layers: ap-
plication layer, algorithm layer, and design layer; see Figure 2.2. As this figure indi-
cates, the whole scheme can be divided into two optimization problems. The lower
part of this three-tier hierarchy consists of a problem (e.g., the traveling salesman
problem)on the application layer and an EA (e.g., a genetic algorithm) on the al-
gorithm layer to find an optimal solution for the problem. Simply put, the EA is
iteratively generating candidate solutions (e.g., permutations of city names), seek-
ing one with maximal quality. The upper part of the hierarchy contains a tuning
method to find optimal parameter values for the EA on the algorithm layer. Simi-
larly to the lower part, the tuning method is iteratively generating parameter vectors
seeking one with maximal quality, where the quality of a given parameter vector is
based on the performance of the EA using its values. To avoid confusion we use dis-
tinct terms to designate the quality function of these optimization problems. In line
with the usual EC terminology we use the term fitness for the quality of candidate
solutions on the lower level, and the term utility to denote the quality of parameter
vectors. Table 2.3 provides a quick overview of the related vocabulary.
With this nomenclature we can formalize the problem to be solved by the al-
gorithm designer if we denote the qualitative parameters and their domains by
q1 , . . . , qm and Q1 , . . . , Qm , and likewise use the notations r1 , . . . , rn and R1 , . . . , Rn
for the quantitative parameters.1 The problem of parameter tuning can then be seen
as a search problem S, u in the parameter space
1 Observe that due to the possible presence of sub-parameters the number of quantitative parame-
ters n depends on the instantiations of q1 , . . . qm . This makes the notation somewhat inaccurate, but
we use it for sake of simplicity.
20 A. E. Eiben and S. K. Smit
Fig. 2.2: The left diagram depicts the control flow through the three main layers in
the hierarchy of parameter tuning, while the right diagram shows the information
flow
Table 2.3: Vocabulary distinguishing the main entities in the context of problem
solving (lower two blocks in Figure 2.2) and algorithm design, and parameter tuning
(upper two blocks in Figure 2.2).
S = Q1 × · · · × Qm × R1 × · · · × Rn (2.1)
using a utility function u, where the utility u( p̄) of a given parameter vector p̄ ∈ S
reflects the performance of EA( p̄), i.e., the evolutionary algorithm instance using
the values of p̄, on the given problem instance(s). Solutions of the parameter tuning
problem can then be defined as parameter vectors with maximum utility. Given this
conceptual framework it is easy to distinguish between the so-called “structural”
and “parametric” tunings [9] in a formal way: structural tuning takes place in the
space S = Q1 × · · · × Qm , while parametric tuning refers to searching through S =
R1 × · · · × Rn .
Now we can define the utility landscape as an abstract landscape where the loca-
tions are the parameter vectors of an EA and the height reflects utility. It is obvious
that fitness landscapes - commonly used in EC - have a lot in common with utility
landscapes as introduced here. To be specific, in both cases we have a search space
(candidate solutions vs. parameter vectors), a quality measure (fitness vs. utility)
that is conceptualized as ‘height’, and a method to assess the quality of a point in
the search space (fitness evaluation vs. utility testing). Finally, we have a search
method (an evolutionary algorithm vs. a tuning procedure) that is seeking a point
with maximum height.
Despite the obvious analogies between the upper and the lower halves of Fig-
ure 2.2, there are some differences we want to note here. First of all, fitness values
2 Evolutionary Algorithm Parameters and Methods to Tune Them 21
As stated above, solutions of the parameter tuning problem are parameter vectors
with maximum utility, where utility is based on some definition of EA performance
and some objective functions or problem instances.
In general, there are two atomic performance measures for EAs: one regarding
solution quality and one regarding algorithm speed or search effort. The latter is
interpreted here in the broad sense, as referring to any sensible measure of com-
putational effort for evolutionary algorithms, e.g., the number of fitness evaluations,
CPU time, and wall clock time.2 There are different combinations of fitness and time
that can be used to define algorithm performance in one single run. For instance:
• Given a maximum running time (computational effort), algorithm performance
is defined as the best fitness at termination.
• Given a minimum fitness level, algorithm performance is defined as the running
time (computational effort) needed to reach it.
• Given a maximum running time (computational effort) and a minimum fitness
level, algorithm performance is defined through the Boolean notion of success:
a run succeeds if the given fitness is reached within the given time; otherwise it
fails.
Obviously, by the stochastic nature of EAs multiple runs on the same problem are
necessary to get a good estimate of performance. Aggregating the measures men-
tioned above over a number of runs we obtain the performance metrics commonly
used in evolutionary computing, (cf. [17, Chapter 14]):
• MBF (mean best fitness),
• AES (average number of evaluations to solution),
• SR (success rate),
respectively. When designing a good EA, one may tune it on either of these perfor-
mance measures, or a combination of them.
Further to the given performance metrics, utility is also determined by the prob-
lem instance(s) or objective functions for which the EA is being tuned. In the sim-
plest case, we are tuning our EA on one function f . Then the utility of a parameter
vector p̄ is measured by the EA performance on f . In this case, tuning delivers
a specialist, that is, an EA that is very good at solving one problem instance, the
function f , with no claims or indications regarding its performance on other prob-
lem instances. This can be a satisfactory result if one is only interested in solving
f . However, algorithm designers in general, and evolutionary computing experts in
particular, are often interested in so-called robust EA instances, that is, in EA in-
stances that work well on many objective functions. In terms of parameters, this
requires “robust parameter settings”. To this end, test suites consisting of many
test functions are used to test and evaluate algorithms. For instance, a specific set
{ f1 , . . . , fn } is used to support claims that the given algorithm is good on a “wide
range of problems”. Tuning an EA on a set of functions delivers a generalist, that
is, an EA that is good at solving various problem instances. Obviously, a true gen-
eralist would perform well on all possible functions. However, this is impossible
by the no-free-lunch theorem [36]. Therefore, the quest for generalist EAs is prac-
tically limited to less general claims that still raise serious methodology issues, as
discussed in [15].
Technically, tuning an EA on a collection of functions { f1 , . . . , fn } means that
the utility is not a single number, but a vector of utilities corresponding to each of
the test functions. Hence, finding a good generalist is a multi-objective problem for
which each test function is one objective. The current parameter tuning algorithms
can only deal with tuning problems for which the utilities can be compared directly;
therefore a method is needed to aggregate the utility vectors into one scalar number.
A straightforward solution to this can be obtained by averaging over the given test
suite. For a precise definition we need to extend the notation of the utility function
such that it shows the given objective function f . Then u f ( p̄) reflects the perfor-
mance of the evolutionary algorithm instance EA( p̄) on f , and the utility of p̄ on
F = { f1 , . . . , fz } can be defined as the average of the utilities u f1 ( p̄), . . . , u fz ( p̄):
1 z
uF ( p̄) = · ∑ u fi ( p̄).
z i=1
Obviously, instead of a simple arithmetic average, one could use weighted averages
or more advanced multi-objective aggregation mechanisms, for instance, based on
lexicographic orderings.
Summarizing, the core of our terminology is as follows:
1. Solution vectors have fitness values, based on the objective function related to
the given problem instance to be solved.
2. EA instances have performance values, based on information regarding fitness
and running time on one or more problem instances (objective functions).
3. Parameter vectors have utility values, defined by the performance of the corre-
sponding EA instance and the problem instances (objective functions) used for
tuning.
Figure 2.3 illustrates the matter in a graphical form. It shows that the solutions of
a tuning problem depend on the problem(s) to be solved, the EA used, and the utility
2 Evolutionary Algorithm Parameters and Methods to Tune Them 23
Fig. 2.3: Generic scheme of parameter tuning showing how good parameter values
depend on four factors: the problem(s) to be solved, the EA used, the utility function,
and the tuner itself
function. Adding the tuner to the equation we obtain this picture showing how a set
of good parameter values obtained through tuning depends on four factors.
The tuners we are to review in the following are variants of this generic scheme,
which can be divided into two main categories:
1. iterative and
2. non-iterative.
All non-iterative tuners execute the GENERATE step only once, during initializa-
tion, thus creating a fixed set of vectors. Each of those vectors is then tested during
the TEST phase to find the best vector in the given set. Hence, one could say that
non-iterative tuners follow the INITIALIZE -and- TEST template. Initialization can
be done by random sampling, generating a systematic grid, or some space-filling set
of vectors.
The second category of tuners is formed by iterative methods that do not fix the
set of vectors during initialization, but start with a small initial set and create new
vectors iteratively during execution.
Given the stochastic nature of EAs, a number of tests is necessary for a reliable
estimate of utility. Following [8] and [5], we distinguish between
24 A. E. Eiben and S. K. Smit
1. single-stage and
2. multistage procedures.
Single-stage procedures perform the same number of tests for each given vector,
while multistage procedures use a more sophisticated strategy. In general, they aug-
ment the TEST step with a SELECT step, where only promising vectors are selected
for further testing and those with low performance are deliberately ignored.
As mentioned in Section 2.2, some methods are only applicable to quantitative
parameters. Sophisticated tuners, such as SPOT [3], however, can be used for quan-
titative, qualitative or even mixed parameter spaces.
In this group we find tuners that are trying to allocate search efforts efficiently by
cleverly generating parameter vectors. Strictly speaking, they might not always min-
imize A, but try to “optimize the spending”, that is, get the most out of testing A
parameter vectors. Such tuners are usually iterative methods, although there are ex-
ceptions, e.g., Latin square [25] and Taguchi orthogonal arrays [35]. The idea behind
most tuners in this category is to start with a relatively small set of vectors and iter-
atively generate new sets in a clever way, i.e., such that new vectors are likely to be
good. These tuners are only appropriate for quantitative parameters.
2 Evolutionary Algorithm Parameters and Methods to Tune Them 25
Finding a good set of parameter values for EAs is the kind of search problem
that EAs are good at. It is therefore possible to use an EA (on the design layer) for
optimizing the parameters of an EA (on the algorithm layer). The EA on the design
layer is called a meta-EA and in general any kind of evolutionary algorithm, e.g., ge-
netic algorithms (GAs), evolution strategies (ESs), evolutionary programming (EP),
differential evolution (DE), particle swarm optimization (PSO), and estimation of
distribution algorithms (EDAs), could be used for this purpose. However, existing
literature only reports on using GAs, ESs, and EDAs as meta-EAs. We assume that
readers of this chapter are familiar with the basics of evolutionary algorithms and
need no introduction to GAs or ESs. Therefore, here we only discuss the generic
template of meta-EA applications, which is the following. The individuals used in
the meta-EA encode a parameter vector p̄ of (quantitative) parameter values of the
baseline EA. This encoding depends on the type of EA on the design layer, and
may differ for a meta-GA, and a meta-ES, or meta-EDA. To evaluate the utility of
such a vector p̄, the baseline EA is run B times using the given parameter values
in p̄, thereby performing B tests. The utility of p̄ is determined based on these run-
s/tests, and the actual definition of EA performance. Using this utility information,
the meta-EA can perform selection, recombination, and mutation of parameter vec-
tors in the usual way. As early as 1978, Mercer and Sampson [24] elaborated on this
idea by their meta-GA (called meta-plan), but due to the large computational costs,
their research was very limited. Greffenstette [18] was the first to conduct more ex-
tensive experiments with meta-EAs and showed the effectiveness of the approach.
The use of meta-ES has been reported by Greffenstette [18] and Herdy [20] with
mixed results.
The meta-GA and meta-ES approaches do not differ significantly; the GA or ES
is used as an optimizer and any preference between them is solely based on their op-
timization power and/or speed. The approach based on ideas from EDAs is different
in this respect. While EAs only try to find good points in a search space belonging
to a certain problem, EDAs try to estimate the distribution of promising values over
the search space. Consequently, they provide more information than GAs or an ESs,
because the evolved distributions disclose a lot about the parameters, in particular
about the utility landscape, and can be used to get insight into the sensitivity and
relevance of the different parameters. Hence, the choice for using a meta-EDA can
be motivated by the additional information it provides w.r.t. a meta-GA or a meta-
ES. The Relevance Estimation and VAlue Calibration method (REVAC) offered by
Nannen and Eiben is based on this idea. REVAC has been introduced in [29] and
further polished in [27, 28] and [26]. In [30] and [34] it is demonstrated how REVAC
can be used to indicate the costs and benefits of tuning each parameter –something
a meta-GA or a meta-ES cannot do. In the process of searching for good parameter
values REVAC implicitly creates probability distributions (one probability distribu-
tion per parameter) in such a way that parameter values that proved to be good in
former trials have a higher probability than poor ones. Initially, all distributions rep-
resent a uniform random variable and after each new test they are updated based on
the new information. After the termination of the tuning process, i.e., the stopping
of REVAC, these distributions can be retrieved and analyzed, showing not only the
26 A. E. Eiben and S. K. Smit
range of promising parameter values, but also disclosing information about the rel-
evance of each parameter. For a discussion of REVAC’s probability distributions,
entropy, and parameter relevance we refer you to [34].
The methods in this group try to reduce B, i.e., the number of tests per parameter
vector. The essential dilemma here is that fewer tests yield fewer reliable estimates
of utility and if the number of tests is too low, then the utilities of two parameter
vectors might not be statistically distinguishable. More tests can improve (sharpen)
the stability of the expected utility to a level such that the superiority of one vector
over another can be safely concluded, but more tests come with a price of longer
runtime. The methods in this group use the same trick to deal with this problem:
performing only a few tests per parameter vector initially and increasing this num-
ber to the minimum level that is enough to obtain statistically sound comparisons
between the given parameter vectors. Such methods are known as statistical screen-
ing, ranking, and selection. Some of these methods are designed to select a single
best solution, others are designed to screen a set of solutions by choosing a (random
size) subset of the solutions containing the best one. Several of these approaches are
based on multistage procedures such as sequential sampling [8, 11].
Maron et al. [22] introduced the term racing for a basic ranking and selection
scheme and applied several methods, based on this scheme, to EAs. These non-
iterative methods assume that a set P = { p̄1 , . . . , p̄A } of parameter vectors is given
and a maximum number of tests, B, per parameter vector is specified. Then they
work according to the following procedure.
1. Test all p̄ ∈ P once.
2. Determine p̄best with highest (average) utility.
3. Determine P as the set of parameters vectors whose utility is not significantly
worse than that of p̄best .
4. P = P .
5. If size(P) > 1 and the number of tests for each p̄ ∈ P is smaller than B, goto 1.
The main difference between the existing racing methods can be found in step 3 as
there are several statistical tests3 that can indicate significant differences in utility:
• Analysis of Variance (ANOVA), cf. [32].
• Kruskal-Wallis (Rank Based Analysis of Variance).
• Hoeffding’s bound, cf. [22].
• Unpaired Student T-Test, cf. [22].
All these methods are multistage procedures and they can be used for both quan-
titative and qualitative parameters; see also the discussion in [9]. Recently, Bal-
aprakash et al. [1] introduced iterative racing procedures. Birattari et al. [10] present
an overview of state-of-the-art racing procedures.
There are currently four approaches that try to combine the main ideas from both
categories. Yuan et al. [37] showed how racing [22] can be combined with meta-
EAs in order to reduce both A and B. They propose two different approaches. In both
methods, they use the standard meta-EA approach, but introduce a racing technique
into the evaluation of parameter vectors.
In the first method, the quantitative (numerical) parameters are represented by a
vector and are evolved using a meta-(1+λ ) ES. At each generation, a set of λ new
vectors is created using a Gaussian distribution centered at the current best vector.
Racing is then used to determine which vector has the highest utility. By using
racing, not all of the λ individuals have to be evaluated B times to determine the
current best vector. It is expected that many of the vectors are eliminated after just a
few tests saving a large portion of computational resources.
Their second method uses a population of parameter vectors containing the quan-
titative (numerical) parameters, which is evolved using selection, recombination,
and mutation. However, the utility of a vector of numerical parameters is evalu-
ated not only within one single EA instance, but on all possible combinations of
qualitative parameters. The utility of the vector is equal to the performance of the
best combination. Racing is then used to determine which combination is most suc-
cessfully using the vector of numerical parameters. By employing racing, not every
combination has to be evaluated B times. This saves computational resources, while
the search space of both quantitative and qualitative parameters can still be explored.
Introducing the sequential parameter optimization toolbox (SPOT), Bartz-Beiel-
stein et al. [3] outlined another approach which uses a small number of parameter
vectors A and tests B by means of an iterative procedure. SPOT starts with an initial
set of parameter vectors usually laid out as a space filling design, e.g., Latin hy-
bercube design (LHD). These are tested B0 times to estimate their expected utility.
Based on the results, a prediction model is fitted to represent an approximation of the
utility landscape (additionally, the often used kriging component enables estimating
the error at any point in the model). Then l new parameter vectors are generated
and their expected utility is estimated using the model, i.e., without additional runs
of the EA. This step employs predictions and error estimations within the expected
improvement heuristic to find points with good potential to improve over the cur-
rently existing ones. The most promising points are then tested B times. If none of
those points results in a better expected utility, B is increased. To establish fairness,
the number of reevaluations of the current best and new design points is the same.
As the estimates of the current vectors are sharpened, utility prediction gets more
accurate over time and “lucky samples” from the start are dropped. Note that the set
of parameter vectors employed after the initial step is usually very small, often in
the range of one to four vectors, and is held constant in size.
In order to explore the search space relatively quickly, B is initialized to a small
value. If a promising area is found, the method focuses on improving the model and
the best vector using more accurate utility estimates by increasing B. Although in
28 A. E. Eiben and S. K. Smit
Optimizing A and/or B are the most commonly used methods to optimally allocate
search efforts, but in principle the number of fitness evaluations per EA run (C)
could also be optimized. Each decrease in EA run-time is multiplied by both A and
B; hence such runtime reduction methods can have a high impact on the total search
effort. However, at the time of writing this chapter, we are not aware of any methods
proposed for this purpose.
In theory, methods to reduce the number of evaluations per EA run could be based
on the idea of terminating an EA run if there is an indication that the parameter
vector used by the given EA will not reach some utility threshold, e.g., the average
utility, or the utility of another, known parameter-vector such as the current best or
current worst one. Such an indication can be statistically or theoretically grounded,
but also based on heuristics. This concept is very similar to racing, but instead of
decreasing the number of tests (B), it decreases the number of evaluations (C). In
general, such a method could deal with both qualitative and quantitative parameters.
As mentioned above, we could not find any general parameter tuning methods of
this type in the literature. However, we found one paper whose main concept would
fit under the umbrella of saving on C whenever possible. This paper of Harik et al.
is concerned with population sizes only, as all other EA parameters are fixed and
only population sizes are varied and compared [19]. The essence of the method is
to run multiple EAs in parallel, each with a different population size, compare the
running EAs after a fixed number of fitness evaluations, and terminate those that
use a population size with an expectedly poor performance (i.e., utility). The ex-
pectations regarding the utilities of different population sizes are based on a simple
heuristic observing that small populations always have a head-start, because of the
fast convergence to a (local) optimum. This fact motivates the assumption that once
a large population EA overtakes a small population EA (using the same number
of fitness evaluations), the small population EA will always remain behind. Hence,
2 Evolutionary Algorithm Parameters and Methods to Tune Them 29
when an EA with a small population size starts performing worse than EAs with a
large population size it can be terminated in an early stage, thus reducing search ef-
fort. The method in [19] can only use a fitness-based notion of utility, i.e., not speed
of success rate information, and the heuristic is only applicable to population sizes.
Further research is needed to develop heuristics for parameters regarding variation
and selection operators.
guess of a human user. As the authors summarize: “no matter what tuner algorithm
you use, you will likely get a much better EA than relying on your intuition and the
usual parameter setting conventions”.
Tuning evolutionary algorithms can be considered as a special case of tuning
metaheuristics. Birattari’s book, cf. [9], gives a good treatment from a machine
learning perspective, including several case studies, for instance, on tuning iterated
local search and ant colony optimization by the F-Race tuner.
This suggestion may sound trivial, but considering the current practice in evolu-
tionary computing it is far from being superfluous. As we argue in this chapter, there
are enough good tuners available, the efforts of tuning are limited, and the gains can
be substantial. A typical use case is the publication of a research paper showing that
a newly invented EA (NI-EA) is better than some carefully chosen benchmark EA
(BM-EA). In this case, tuning NI-EA can make the difference between comparable-
and-sometimes-better and convincingly-better in favor of NI-EA.
Let us note that using parameter tuners in performance comparisons leads to the
methodological question about whether the benchmark algorithm should be tuned
too. One argument for not tuning is that a benchmark may be interpreted as a fixed
and invariable algorithm that provides the same challenge to every comparison using
it. This may sound unreasonable, giving an unfair advantage to the new algorithm in
the comparison. However, the current EC research and publication practice ignores
the tuning issue completely, hiding the tuning efforts, and thus hiding the possible
unfairness of a comparison between a tuned NI-EA and an untuned BM-EA. In this
respect, using a tuner for NI-EA and reporting the tuning efforts alongside the per-
formance results is more illustrative than the contents of the majority of publications
at present. A typical claim in a paper following the usual practice sounds like
“NI-EA is better than BM-EA,”
with possible refinements concerning the number or type of test functions where
this holds. Using tuners and reporting the tuning effort would improve the present
practice by making things public and transparent. A typical claim of the new style
would sound like
“Spending effort X on tuning NI-EA, we obtain an instance that is better than (the untuned)
BM-EA.”
To this end, the tuning effort could be measured by the number of parameter vectors
tested (A), the number of utility tests executed (A × B)4 , the computing time spent
on running the tuner, and so on.
Alternatively, one could reinterpret the concept of a benchmark EA from as is
(including the values of its parameters) to as it could be (if its parameters were
tuned). This stance would imply that both NI-EA and BM-EA need to be tuned for
a fair comparison. A typical claim of this style would sound like
“The best instance of NI-EA is better than the best instance of BM-EA, where for both EAs
the best instance is the one obtained through spending effort X on tuning.”
Note that this new philosophy eliminates the rather accidental parameter values as
a factor when comparing EAs and focuses on the type of EA operators (recombina-
tion, mutation, etc.) instead. In other words, we would obtain statements about EAs
rather than EA instances.
An important caveat with respect to tuning EAs is the fact that the set of important
parameters is often less clear than it may seem. At first sight, it is easy to identify the
4 If the number of tests per parameter vector is not constant, then A × B is not the total number of
utility tests executed; but for the present argument this is not relevant.
32 A. E. Eiben and S. K. Smit
parameters of a given EA, but a closer look at its code may disclose that the work-
ing of the algorithm depends on several constants hidden in the code. These “magic
constants” are determined by the algorithm designer/programmer, but are not made
visible as independent parameters. For instance, the conceptual description of the
well-known G-CMA-ES will indicate the population size μ and the offspring size
λ as parameters with corresponding default values, such as λ = 4 + 3 · log2 (n) and
μ = λ2 , including the magic constants 4, 3, and 2. These constants can be designated
as independent parameters of the given EA, thereby making them subject to tuning.
The advantage of this practice is clear; it provides additional knobs to tune, poten-
tially increasing the maximum achievable performance. If your game is to obtain a
very well performing EA for some application, then you should definitely consider
this option. For academic comparisons, it may raise questions regarding the fairness
of comparison similar to those discussed above regarding benchmarking. The main
question here is whether it is correct to compare the given EA using some good val-
ues5 for its original parameters with a new variant of it using tuned values for those
parameters and tuned values for some of its magic constants.
Finally, note that the adaptability of EAs or related algorithms to a specific prob-
lem is different. We have seen this in the documented results from [30], as explained
in Section 2.6: The default parameters of algorithm A lead to very good perfor-
mance, whether or not the ones of algorithm B do. Tuning is thus much more im-
portant for B. An attempt to measure the adaptablity by means of the empirical
tuning potential (ETP) is given in [31]. However, a lower adaptability may stem
from two sources. Either the algorithm is harder to tune and requires more effort
while tuning, or the obtainable best parameter vectors are not much better than the
default values. Running tuners with increased number of tests T may pay off by
giving insight here.
Fig. 2.4: The effect of parameter tuning on comparing EAs. Left: the traditional
situation, where the reported EA performance is an “accidental” point on the scale
ranging from the worst to the best performance (as determined by the parameter
values used). Right: the improved situation, where the reported EA performance is
a near-optimal point on this scale, belonging to the tuned instance. This indicates
the full potential of the given EA, i.e., how good it can be when using the right
parameter values
ploying their EA-based optimizer module within a decision support system. Given
that the costs of this extra effort are limited, but the gains can be substantial, it
is hard to imagine a reason for not undertaking it. To support this development,
let us point to freely available software from http://sourceforge.net/
projects/tuning/. This Web site contains the Matlab codes of the tuning al-
gorithms we have been using over the last couple of years. Additionally, http:
//sourceforge.net/projects/mobat/ provides the Meta-Heuristic Op-
timizer Benchmark and Analysis Toolbox. It can be used to configure, bench-
mark and analyze Evolutionary Algorithms and other Meta-Heuristic Optimizers.
MOBAT is written in Java and optimized for parallel processing. Alternatively,
SPOT can be obtained from www.gm.fh-koeln.de/campus/personen/
lehrende/thomas.bartz-beielstein/00489/.
A change of practice as foreseen here could also imply a change in methodology
in the long term, illustrated in Figure 2.4. In today’s research practice, comparisons
of EAs are typically based on ad hoc parameter values. Hence, the experiments show
a comparison between two rather arbitrary EA instances and tell nothing about the
full potential of the given EAs. Using tuner algorithms could fix this deficiency and
eliminate much of the noise introduced by ad hoc parameter values, thus support-
ing comparisons of evolutionary algorithms (i.e., configurations of the qualitative
parameters), rather than evolutionary algorithm instances.
References
[1] Balaprakash, P., Birattari, M., and Stützle, T.: Improvement strategies for the
F-race algorithm: Sampling design and iterative refinement. In: Hybrid Meta-
heuristics, pp., 108–122 (2007)
34 A. E. Eiben and S. K. Smit
[2] Bartz-Beielstein, T., Chiarandini, M., Paquete, L., and Preuss, M., editors, Em-
pirical methods for the analysis of optimization algorithms. Springer (2010)
[3] Bartz-Beielstein, T., Parsopoulos, K.E., and Vrahatis, M. N.: Analysis of parti-
cle swarm optimization using computational statistics. In: Chalkis, editor, Pro-
ceedings of the International Conference of Numerical Analysis and Applied
Mathematics (ICNAAM 2004), pp., 34–37 (2004)
[4] Bartz-Beielstein, T.: Experimental analysis of evolution strategies: overview
and comprehensive introduction. Technical Report Reihe CI 157/03, SFB 531,
Universität Dortmund, Dortmund, Germany (2003)
[5] Bartz-Beielstein, T.: Experimental research in evolutionary computation—the
new experimentalism. Natural Computing Series. Springer, Berlin, Heidelberg,
New York (2006)
[6] Bartz-Beielstein, T., and Markon, S.: Tuning search algorithms for real-world
applications: A regression tree based approach. Technical Report of the Col-
laborative Research Centre 531 Computational Intelligence CI-172/04, Uni-
versity of Dortmund, March (2004)
[7] Bartz-Beielstein, T., and Preuss, M.: Considerations of budget allocation for
sequential parameter optimization (SPO). In: Paquete, L., et al., editors, Work-
shop on Empirical Methods for the Analysis of Algorithms, Proceedings, pp.,
35–40, Reykjavik, Iceland (2006)
[8] Bechhofer, R. E., Dunnett, W. C., Goldsman, D. M., and Hartmann, M.: A
comparison of the performances of procedures for selecting the normal pop-
ulation having the largest mean when populations have a common unknown
variance. Communications in Statistics, B19:971–1006 (1990)
[9] Birattari, M.: Tuning metaheuristics. Springer, Berlin, Heidelberg, New York
(2005)
[10] Birattari, M., Yuan, Z., Balaprakash, T., and Stützle, T.: F-frace and iterated
F-frace: An overview. In: Bartz-Beielstein, T., et al. [2]
[11] Branke, J., Chick, S. E., and Schmidt, C.: New developments in ranking and
selection: An empirical comparison of the three main approaches. In: WSC
’05: Proceedings of the 37th Conference on Winter simulation, pp., 708–717.
Winter Simulation Conference (2005)
[12] Chen, J. E., Chen, C. H., and Kelton, D. W.: Optimal computing budget allo-
cation of indifference-zone-selection procedures. Working paper, taken from
http://www.cba.uc.edu/faculty/keltonwd (2003)
[13] Clune, J., Goings, S., Punch, B., and Goodman, E.: Investigations in meta-
GAs: Panaceas or pipe dreams? In: GECCO ’05: Proceedings of the 2005
Workshops on Genetic and Evolutionary Computation, pp., 235–241, ACM
(2005)
[14] Eiben, A. E., Hinterding, R., and Michalewicz, Z.: Parameter control in
evolutionary algorithms. IEEE Transactions on Evolutionary Computation,
3(2):124–141 (1999)
[15] Eiben, A. E., and Jelasity, M.: A critical note on experimental research method-
ology in EC. In: Proceedings of the 2002 Congress on Evolutionary Compu-
tation (CEC’2002), pp., 582–587. IEEE Press, Piscataway, NJ (2002)
2 Evolutionary Algorithm Parameters and Methods to Tune Them 35
[16] Eiben, A. E., and Smith, J. E.: Introduction to Evolutionary Computation. Nat-
ural Computing Series. Springer (2003)
[17] Eiben, A. E., and Smith, J. E.: Introduction to evolutionary computing.
Springer, Berlin, Heidelberg, New York (2003)
[18] Greffenstette, J. J.: Optimisation of control parameters for genetic algorithms.
In: IEEE Transactions on Systems, Man and Cybernetics, volume 16, pp., 122–
128 (1986)
[19] Harik, G. R., and Lobo, F. G.: A parameter-less genetic algorithm. In: Banzhaf,
W., et al., editors, Proceedings of the Genetic and Evolutionary Computation
Conference GECCO-99, San Francisco, CA. Morgan Kaufmann, pp., 258–265
(1999)
[20] Herdy, M.: Reproductive isolation as strategy parameter in hierarichally or-
ganized evolution strategies. In: Männer, R., and Manderick, B., editors, Pro-
ceedings of the 2nd Conference on Parallel Problem Solving from Nature, pp.,
209–209. North-Holland, Amsterdam (1992)
[21] Lasarczyk, C. W. G.: Genetische programmierung einer algorithmischen
chemie. Ph.D. thesis, Technische Universität Dortmund (2007)
[22] Maron, O., and Moore, A.: The racing algorithm: model selection for lazy
learners. In: Artificial Intelligence Review, volume 11, pp., 193–225 (1997)
[23] Maturana, J., Lardeux, F., and Saubion, F.: Autonomous operator management
for evolutionary algorithms. Journal of Heuristics 16, 881–909 (2010)
[24] Mercer, R. E., and Sampson, J. R.: Adaptive search using a reproductive meta-
plan. Kybernetes, 7:215–228 (1978)
[25] Myers, R., and Hancock, E. R.: Empirical modelling of genetic algorithms.
Evolutionary Computation, 9(4):461–493 (2001)
[26] Nannen, V.: Evolutionary agent-based policy analysis in Ddnamic Environ-
ments. Ph.D. thesis, Free University Amsterdam (2009)
[27] Nannen, V., and Eiben, A. E.: Efficient relevance estimation and value calibra-
tion of evolutionary algorithm parameters. In: IEEE Congress on Evolutionary
Computation, pp., 103–110. IEEE (2007)
[28] Nannen, V. and Eiben, A. E.: Relevance estimation and value calibration of
evolutionary algorithm parameters. In: Veloso, M., editor, IJCAI 2007, Pro-
ceedings of the 20th International Joint Conference on Artificial Intelligence,
pp., 1034–1039 (2007)
[29] Nannen, V. and Eiben, A. E.: A method for parameter calibration and relevance
estimation in evolutionary algorithms. In: Keijzer, M. ,editor, Proceedings of
the Genetic and Evolutionary Computation Conference (GECCO-2006), pp.,
183–190. Morgan Kaufmann, San Francisco (2006)
[30] Nannen, V., Smit, S. K., and Eiben, A. E.: Costs and benefits of tuning param-
eters of evolutionary algorithms. In: Rudolph, G., Jansen, T., Lucas, S., Poloni,
C., and Beume,N. , editors, PPSN, volume 5199 of Lecture Notes in Computer
Science, pp., 528–538. Springer (2008)
[31] Preuss, M.: Adaptability of algorithms for real-valued optimization. In: Gia-
cobini, M., et al., editors, Applications of Evolutionary Computing, EvoWork-
36 A. E. Eiben and S. K. Smit
Holger H. Hoos
3.1 Introduction
Holger H. Hoos
Department of Computer Science, University of British Columbia, 2366 Main Mall, Vancouver,
BC, V6T 1Z4, Canada, e-mail: [email protected]
1 We note that the use of heuristic techniques does not imply that the resulting algorithms are nec-
essarily incomplete or do not have provable performance guarantees, but often results in empirical
performance far better than the bounds guaranteed by rigorous theoretical analysis.
efficacy with which a heuristic algorithm solves a given problem instance or class of
problem instances. For example, the run-time of CPLEX 12.1 – a widely used, com-
mercial solver for mixed integer programming problems – has recently been demon-
strated to vary by up to a factor of over 50 with the settings of 76 user-accessible
parameters [42].
A problem routinely encountered by designers as well as end users of parame-
terised algorithms is that of finding parameter settings (or configurations) for which
the empirical performance on a given set of problem instances is optimised. For-
mally, this algorithm configuration or parameter tuning problem can be stated as
follows:
Given
• an algorithm A with parameters p1 , . . . , pk that affect its behaviour,
• a space C of configurations (i.e., parameter settings), where each config-
uration c ∈ C specifies values for A’s parameters such that A’s behaviour
on a given problem instance is completely specified (up to possible ran-
domisation of A),
• a set of problem instances I,
• a performance metric m that measures the performance of A on instance
set I for a given configuration c,
find a configuration c∗ ∈ C that results in optimal performance of A on I ac-
cording to metric m.
In the context of this problem, the algorithm whose performance is to be optimised
is often called the target algorithm, and we use A(c) to denote target algorithm A
under a specific configuration c. The set of values any given parameter p can take
is called the domain of p. Depending on the given target algorithm, various types of
parameters may occur. Categorical parameters have a finite, unordered set of dis-
crete values; they are often used to select from a number of alternative mechanisms
or components. Using Boolean parameters, heuristic mechanisms can be activated
or deactived, while the behaviour and interaction of these mechanisms is often con-
trolled by integer- and real-valued parameters (the former of which are a special
cases of ordinal parameters, whose domains are discrete and ordered). Conditional
parameters are only active when other parameters are set to particular values; they
routinely arise in the context of mechanisms that are activated or selected using some
parameter, and whose behaviour is then controlled by other parameters (where the
latter parameters conditionally depend on the former). Sometimes, it is useful to
place additional constraints on configurations, e.g., to exclude certain combinations
of parameter values that would lead to ill-defined, incorrect or otherwise undesirable
behaviour of a given target algorithm.
Clearly, the number and types of parameters, along with the occurrence of con-
ditional parameters and constraints on configurations, determine the nature of the
configuration space C and have profound implications on the methods to be used
for searching performance-optimising configurations within that space. These meth-
ods range from well-known numerical optimisation procedures, such as the Nelder-
Mead Simplex algorithm [49, 13] or the more recent gradient-free CMA-ES algo-
3 Automated Algorithm Configuration and Parameter Tuning 39
rithm [25, 27, 26], to approaches based on experimental design methods (see, e.g.,
11, 5, 1), response-surface models (see, e.g., 44, 7) and stochastic local search pro-
cedures (see, e.g., 36, 37).
In general, when configuring a specific target algorithm, it is desirable to find
parameter configurations that work well on problem instances other than those in
the given instance set I. To this end, care needs to be taken in selecting the instances
in I to be representative of the kinds of instances to which the optimised target
algorithm configuration is expected to be applied. Difficulties can arise when I is
small, yet contains very different types of instances. To recognise situations in which
a configured target algorithm, A(c∗ ), fails to perform well when applied to instances
other than those used in the configuration process, it is advisable to test it on a set
of instances not contained in I; this can be done by including in I only part of the
overall set of instances available, or by means of cross-validation.
It is also advisable to investigate performance variation of A(c∗ ) over instance set
I, since, depending on the performance metric m used for the configuration of A and
differences between instances in I, the optimised configuration c∗ might represent a
trade-off between strong performance on some instances and weaker performance
on others. In particular, when using a robust statistic, such as median run-time, as
a performance metric, poor performance on large parts of a given instance set can
result. To deal effectively with target algorithm runs in which no solution was pro-
duced (in particular, time-outs encountered when optimising run-time), it is often
useful to use a performance metric based on penalised averaging, in which a fixed
penalty is assigned to any unsuccessful run of A (see also 37).
In the existing literature, the terms algorithm configuration and parameter tuning
are often used interchangeably. We prefer to use parameter tuning in the context of
target algorithms with relatively few parameters with mostly real-valued domains,
and algorithm configuration in the context of target algorithms with many categor-
ical parameters. Following Hoos [28], we note that algorithm configuration prob-
lems arise when dealing with an algorithm schema that contains a number of instan-
tiable components (typically, subprocedures or functions), along with a discrete set
of concrete choices for each of these. While most standard numerical optimisation
methods are not applicable to these types of algorithm configuration problems, F-
Race [11, 5], Calibra [1] and ParamILS [36, 37] have been used successfully in this
context. However, so far only ParamILS has been demonstrated to be able to deal
with the vast design spaces resulting from schemata with many independently in-
stantiable components (see, e.g., 45, 66), and promising results have been achieved
by a genetic programming procedure applied to the configuration of local search
algorithms for SAT [18, 19], as well as by a recent gender-based genetic algorithm
[2].
In the remainder of this chapter, we discuss three classes of methods for solving
algorithm configuration and parameter tuning problems. Racing procedures itera-
tively evaluate target algorithm configurations on problem instances from a given
set and use statistical hypothesis tests to eliminate candidate configurations that
are significantly outperformed by other configurations; ParamILS uses a powerful
stochastic local search (SLS) method to search within potentially vast spaces of can-
40 Holger H. Hoos
Given a number of candidate solvers for a given problem, the concept of racing is
based on a simple yet compelling idea: sequentially evaluate the candidates on a
series of benchmark instances and eliminate solvers as soon as they have fallen too
far behind the current leader, i.e., the candidate with the overall best performance at
a given stage of the race.
Racing procedures were originally introduced for solving model selection prob-
lems in machine learning. The first such technique, dubbed Hoeffding Races [48],
was introduced in a supervised learning scenario, where a black-box learner is eval-
uated by measuring its error on a set of test instances. The key idea is to test a
given set of models, one test instance at a time, and to discard models as soon as
they are shown to perform significantly worse than the best ones. Performance is
measured as error over all test instances evaluated so far, and models are eliminated
3 Automated Algorithm Configuration and Parameter Tuning 41
from the race using non parametric bounds on the true error, determined based on
Hoeffding’s inequality (which gives an upper bound on the probability of the sum
of random variables deviating from its expected value). More precisely, a model is
discarded from the race if the lower bound on its true error (for a given confidence
level 1 − δ ) is worse than the upper bound on the error of the currently best model.
As a result, the computational effort expended in evaluating models becomes in-
creasingly focussed on promising candidates, and the best candidate models end up
getting evaluated most thoroughly.
This idea can be easily transferred to the problem of selecting an algorithm from
a set of candidates, where each candidate may correspond to a configuration of a pa-
rameterised algorithm [11]. In this context, candidate algorithms (or configurations)
are evaluated on a given set of problem instances. As in the case of model selection,
the race proceeds in steps, where in each step every candidate is evaluated on the
same instance, taken from the given instance set, and candidates that performed sig-
nificantly worse on the instances considered so far are eliminated from the race. (We
note that the evaluation of candidates in each step can, in principle, be performed
independently in parallel.)
This procedure requires that the set of candidate algorithms be finite and, since
in the initial steps of a race all candidates will need to be evaluated, of somewhat
reasonable size. Therefore, when applied to algorithm configuration or parameter
tuning scenarios with continuous parameters, racing approaches need to make use
of discretisation or sampling techniques. In the simplest case, all continuous param-
eters are discretised prior to starting the race. Alternatively, stages of sampling and
racing can be interleaved, such that the candidate configurations being considered
become increasingly concentrated around the best performing configurations.
In the following, we will first present the F-Race procedure of Birattari et al.
[11] in more detail and outline its limitations. We will then discuss variations of
F-Race that overcome those weaknesses [5, 12], and finally summarise some results
achieved by these racing procedures in various algorithm configuration scenarios.
3.2.1 F-Race
The F-Race algorithm by Birattari et al. [11] closely follows the previously dis-
cussed racing procedure. Similarly to Hoeffding races, it uses a non parametric
test as the basis for deciding which configurations to eliminate in any given step.
However, rather than just performing pairwise comparisons with the currently best
configuration (the so-called incumbent), F-Race first uses the rank-based Friedman
test (also known as Friedman two-way analysis of variance by ranks) for ni inde-
pendent s-variate random variables, where s is the number of configurations still in
the race, and ni is the number of problem instances evaluated so far. The Friedman
test assesses whether the s configurations show no significant performance differ-
ences on the ni given instances; if this null hypothesis is rejected, i.e., if there is
evidence that some configurations perform better than others, a series of pairwise
42 Holger H. Hoos
procedure F-Race
input target algorithm A, set of configurations C, set of problem instances I,
performance metric m;
parameters integer nimin ;
output set of configurations C∗ ;
C∗ := C; ni := 0;
repeat
randomly choose instance i from set I;
run all configurations of A in C∗ on i;
ni := ni + 1;
if ni ≥ nimin then
perform rank-based Friedman test on results for configurations in C∗ on all instances
in I evaluated so far;
if test indicates significant performance differences then
c∗ := best configuration in C∗ (according to m over instances evaluated so far);
for all c ∈ C∗ \ {c∗ } do
perform pairwise Friedman post hoc test on c and c∗ ;
if test indicates significant performance differences then
eliminate c from C∗ ;
end if;
end for;
end if;
end if;
until termination condition met;
return C∗ ;
end F-Race
Fig. 3.1: Outline of F-Race for algorithm configuration (original version, according
to 11). In typical applications, nimin is set to values between 2 and 5; further details
are explained in the text. When used on its own, the procedure would typically
be modified to return c∗ ∈ C∗ with the best performance (according to m) over all
instances evaluated within the race
post hoc tests between the incumbent and all other configurations is performed. All
configurations found to have performed significantly worse than the incumbent are
eliminated from the race. An outline of the F-Race procedure for algorithm config-
uration, as introduced by [11], is shown in Figure 3.1; as mentioned by [5], runs on
a fixed number of instances are performed before the Friedman test is first applied.
The procedure is typically terminated either when only one configuration remains,
or when a user-defined time budget has been exhausted.
The Friedman test involves ranking the performance results of each configura-
tion on a given problem instance; in the case of ties, the average of the ranks that
would have been assigned without ties is assigned to each tied value. The test then
determines whether some configurations tend to be ranked better than others when
considering the rankings for all instances considered in the race up to the given iter-
ation. Following Birattari et al. [11], we note that performing the ranking separately
for each problem instance amounts to a blocking strategy on instances. The use of
3 Automated Algorithm Configuration and Parameter Tuning 43
this strategy effectively reduces the impact of noise effects that may arise from the
performance variation observed over the given instances set for any configuration of
the target algorithm under consideration; this can become critical when those per-
formance variations are large, as has been observed for many algorithms for various
hard combinatorial problems (see, e.g., 21, 29).
A major limitation of this basic version of F-Race stems from the fact that in the
initial steps all given configurations have to be evaluated. This property of basic
F-Race severely limits the size of the configuration spaces to which the procedure
can be applied effectively – particularly when dealing with configuration spaces
corresponding to so-called full factorial designs, which contain all combinations of
values for a set of discrete (or discretised) parameters. Two more recent variants of
F-Race, Sampling F-Race and Iterative F-Race, have been introduced to address this
limitation [5]; both use the previously described F-Race procedure as a subroutine.
Sampling F-Race (short: RSD/F-Race) is based on the idea of using a sampling
process to determine the initial set of configurations subsequently used in a stan-
dard F-Race. In RSD/F-Race, a fixed number r of samples is determined using a so-
called Random Sampling Design, in which each configuration is drawn uniformly at
random from the given configuration space C. (In the simplest case, where no con-
ditional parameters or forbidden configurations exist, this can be done by sampling
values for each parameter independently and uniformly at random from the respec-
tive domain.) As noted by Balaprakash et al. [5], the performance of this procedure
depends substantially on r, the number of configurations sampled in relation to the
size of the given configuration space.
A somewhat more effective approach for focussing a procedure based on F-Race
on promising configurations is Iterative F-Race (short: I/F-Race). The key idea be-
hind I/F-Race is the use of an iterative process, where in the first stage of each iter-
ation configurations are sampled from a probabilistic model M, while in the second
stage a standard F-Race is performed on the resulting sample, and the configurations
surviving this race are used to define or update the model M used in the following
iteration. (See Figure 3.2.)
The probabilistic model used in each iteration of I/F-Race consists of a series
of probability distributions, D1 , . . . , Ds , each of which is associated with one of s
‘promising’ parameter configurations, c1 , . . . , cs . Balaprakash et al. [5] consider only
numerical parameters and define each distribution Di to be a k-variate normal distri-
bution Ni := N (μi , Σi ) that is centred on configuration ci , i.e., μi = ci . They further
define the covariance between any two different parameters in a given Ni to be zero,
such that Ni can be factored into k independent, univariate normal distributions. To
start the process with an unbiased probabilistic model, in the first iteration of I/F-
Race a single k-variate uniform distribution is used, which is defined as the product
of the k independent uniform distributions over the ranges of each given parameter
44 Holger H. Hoos
procedure I/F-Race
input target algorithm A, set of configurations C, set of problem instances I,
performance metric m;
output set of configurations C∗ ;
initialise probabilistic model M;
C := 0;/ // later, C is the set of survivors from the previous F-Race
repeat
based on model M, sample set of configurations C ⊆ C;
perform F-Race on configurations in C ∪C to obtain set of configurations C∗ ;
update probabilistic model M based on configurations in C∗ ;
C := C∗ ;
until termination condition met;
return c∗ ∈ C∗ with best performance (according to m) over all instances evaluated;
end I/F-Race
Fig. 3.2: High-level outline of Iterated F-Race, as introduced by [5]; details are
explained in the text. The most recent version of I/F-Race slightly deviates from
this outline (see 12)
(we note that this can be seen as a degenerate case of the normal distributions used
subsequently, in which the variance is infinite and truncation is applied).
In each iteration of I/F-Race, a certain number of configurations are sampled
from the distributions N1 , . . . , Ns . In the first iteration, this corresponds to sampling
configurations uniformly at random from the given configuration space. In subse-
quent iterations, for each configuration to be sampled, first, one of the Ni is chosen
using a rank-based probabilistic selection scheme based on the performance of the
configuration ci associated with Ni (for details, see 5), and then a configuration
is sampled from this distribution. Values that are outside the range allowable for a
given parameter are set to the closer of the two boundaries, and settings for param-
eters with integer domains are rounded to the nearest valid value. The number a
of configurations sampled in each iteration depends on the number s of configura-
tions that survived the F-Race in the previous iteration; Balaprakash et al. [5] keep
the overall number of configurations considered in each iteration of I/F-Race con-
stant at some value r, and therefore simply replace those configurations eliminated
by F-Race with newly sampled ones (i.e., a := r − s, where in the first iteration,
s = 0).
The resulting population of a + s configurations is subjected to a standard F-
Race; this race is terminated using a complex, disjunctive termination condition
that involves a (lower) threshold on the number of surviving configurations as well
as upper bounds on the computational budget (measured in target algorithm runs)
and the number of problem instances considered 2 . Each of the F-Races conducted
within I/F-Race uses a random permutation of the given instance set in order to
2 The threshold mechanism ends the race as soon as the number of survivors has fallen below k,
the number of target algorithm parameters.
3 Automated Algorithm Configuration and Parameter Tuning 45
avoid bias due to a particular instance ordering. The s configurations that survived
the race (where the value of s depends on the part of the termination condition that
determined the end of that race) induce the probabilistic model used in the following
iteration of I/F-Race.
To increasingly focus the sampling process towards the most promising config-
urations, the standard deviations of the component distributions of the probabilistic
models Ni are gradually decreased using a volume reduction technique. More pre-
cisely, after each iteration, the standard deviation vector σi of each distribution Ni is
scaled by a factor (1/r)k , where r is the total number of configurations entered into
the F-Race, and k is the number of given parameters; this corresponds to a reduction
of the total volume of the region bounded by μi ± σi (over all k parameters) by a fac-
tor of r. At the beginning of I/F-Race, when configurations are sampled uniformly,
the standard deviation values are (somewhat arbitrarily) set to half of the range of
the respective parameter values.
I/F-Race, as specified by Balaprakash et al. [5], assumes that all parameters
are numerical. This limitation is overcome in a later variant [12], which supports
categorical parameters by sampling their values from discrete probability distri-
butions that are updated by redistributing probability mass to values seen in good
configurations, as determined by F-Race. This version of I/F-Race, which we call
I/F-Race-10 for clarity, also differs from the one described previously in several
other aspects. Notably, the number of iterations in I/F-Race-10 is determined as
2 + log2 (k) + 0.5 , and the overall computational budget (i.e., number of target al-
gorithm runs) is distributed equally over these iterations. Furthermore, the number r
of configurations considered at iteration number t is set to b/(5+t) , where b is the
computational budget available for that iteration; this leads to fewer configurations
being considered in later iterations. The threshold on the number of survivors below
which any given F-Race is terminated is also determined as 2 + log2 (k) + 0.5 . Fi-
nally, I/F-Race-10 handles conditional parameters by only sampling values for them
when they are active, and by only updating the respective component of the model
in situations where such parameters are active in a configuration surviving one of
the subsidiary F-Races. (For further details, see 12.)
3.2.3 Applications
Balaprakash et al. [5] describe applications of F-Race, Sampling F-Race and Iter-
ative F-Race to three high-performance stochastic local search algorithms: MAX-
MIN Ant System for the TSP with six parameters [64], an estimation-based local
search algorithm for the probabilistic TSP (PTSP) with three parameters [6], and a
simulated annealing algorithm for vehicle routing with stochastic demands (VRP-
SD) with four parameters [53]. The empirical results from these case studies indi-
cate that both, Sampling F-Race and Iterative F-Race can find good configurations
in spaces that are too big be handled effectively by F-Race, and that Iterative F-
Race tends to give better results than Sampling F-Race, especially when applied to
46 Holger H. Hoos
more difficult configuration problems. Both, the PTSP and the VRP-SD algorithms
as configured by Iterative F-Race represented the state of the art in solving these
problems at the time of this study.
More applications of F-Race have recently been summarised by Birattari et al.
[12]. These include tuning the parameters of various meta-heuristic algorithms for
university timetabling problems [58], of a control system for simple robots [52], and
of a new state-of-the-art memetic algorithm for the linear ordering problem [61]. In
all of these cases, the basic F-Race algorithm was applied to target algorithms with
few parameters and rather small configuration spaces (48–144 configurations).
Yuan et al. [71] report an application of I/F-Race for tuning various heuristic
algorithms for solving a locomotive scheduling problem provided by the German
railway company, Deutsche Bahn. The target algorithms considered in this work
had up to five parameters, mostly with continuous domains. The most complex
application of I/F-Race reported by Birattari et al. [12] involves 12 parameters of
the ACOTSP software, some of which conditionally depend on the values of oth-
ers.
While these (and other) racing procedures have been demonstrated to be useful
for accomplishing a broad range of parameter tuning tasks, it is somewhat unclear
how well they perform when applied to target algorithms with many more param-
eters, and how effectively they can deal with the many categorical and conditional
parameters arising in the context of more complex computer-aided algorithm design
tasks, such as the ones considered by Hutter et al. [35], KhudaBukhsh et al. [45],
Hutter et al. [42], and Tompkins and Hoos [66].
3.3 ParamILS
At the core of the ParamILS framework for automated algorithm configuration [36,
37] lies Iterated Local Search (ILS), a well-known and versatile stochastic local
search method that has been applied with great success to a wide range of difficult
combinatorial problems (see, e.g., 47, 30). ILS iteratively performs phases of simple
local search designed to rapidly reach or approach a locally optimal solution to
the given problem instance, interspersed with so-called perturbation phases, whose
purpose is to effectively escape from local optima. Starting from a local optimum
x, in each iteration one perturbation phase is performed, followed by a local search
phase, with the aim of reaching (or approaching) a new local optimum x . Then,
a so-called acceptance criterion is used to decide whether to continue the search
process from x or whether to revert to the previous local optimum, x. Using this
mechanism, ILS aims to solve a given problem instance by effectively exploring the
space of its locally optimal solutions. At a lower level, ILS – like most SLS methods
– visits (i.e., moves through) a series of candidate solutions such that at any given
time there is a current candidate solution, while keeping track of the incumbent (i.e.,
the best solution encountered so far).
ParamILS uses this generic SLS method to search for high-performance config-
urations of a given algorithm as follows (see also Figure 3.3). The search process is
initialised by considering a given configuration (which would typically be the given
target algorithm’s default configuration) as well as r further configurations that are
chosen uniformly at random from the given configuration space. These r + 1 con-
figurations are evaluated in a way that is specific to the given ParamILS variant, and
the best-performing configuration is selected as the starting point for the iterated
local search process. This initialisation mechanism can be seen as a combination of
the intuitive choice of starting from a user-defined configuration (such as the tar-
get algorithm’s default settings) and a simple experimental design technique, where
the latter makes it possible to exploit situations where the former represents a poor
choice for the given set of benchmark instances. Clearly, there is a trade-off between
the effort spent on evaluating randomly sampled configurations at this point and the
effort used in the subsequent iterated local search process. Hutter et al. [39] reported
empirical results suggesting that r = 10 results in better performance than r = 0 and
r = 100 across a number of configuration scenarios. However, we suspect that more
sophisticated initialisation procedures, in particular ones based on racing or sequen-
tial model-based optimisation techniques, might result in even better performance.
The subsidiary local search procedure used in ParamILS is based on the one-
exchange neighbourhood induced by arbitrary changes in the values of a single
target algorithm parameter. ParamILS supports conditional parameters by pruning
neighbourhoods such that changes in inactive parameters are excluded from con-
sideration; it also supports exclusion of (complete or partial) configurations ex-
plicitly declared ‘forbidden’ by the user. Using the one-exchange neighbourhood,
ParamILS performs iterative first-improvement search – an obvious choice, con-
sidering the computational cost of evaluating candidate configurations. We believe
that larger neighbourhoods might prove effective in situations in which parame-
48 Holger H. Hoos
procedure ParamILS
input target algorithm A, set of configurations C, set of problem instances I,
performance metric m;
parameters configuration c0 ∈ C, integer r, integer s, probability pr;
output configuration c∗ ;
c∗ := c0 ;
for i := 1 to r do
draw c from C uniformly at random;
assess c against c∗ based on performance of A on instances from I according to metric m;
if c found to perform better than c∗ then
c∗ := c;
end if;
end for;
c := c∗ ;
perform subsidiary local search on c;
while termination condition not met do
c := c;
perform s random perturbation steps on c
perform subsidiary local search on c ;
assess c against c based on performance of A on instances from I according to metric m;
if c found to perform better than c then // acceptance criterion
update overall incumbent c∗ ;
c := c ;
end if;
with probability pr do
draw c from C uniformly at random;
end with probability;
end while;
return c∗ ;
end ParamILS
Fig. 3.3: High-level outline of ParamILS, as introduced by [36]; details are explained
in the text
ter effects are correlated, as well as in conjunction with mechanisms that recog-
nise and exploit such dependencies in parameter response. Furthermore, search
strategies other than iterative first-improvement could be considered in variants of
ParamILS that build and maintain reasonably accurate models of local parameter
responses.
The perturbation procedure used in the ParamILS framework performs a fixed
number, s, of steps chosen uniformly at random in the same one-exchange neigh-
bourhood used during the local search phases. Computational experiments in which
various fixed values of s as well as several multiples of the number of target algo-
rithm parameters were considered suggest that relatively small perturbations (i.e.,
s = 2) are sufficient for obtaining good performance of the overall configuration
procedure [39]. Considering the use of iterative first-improvement during the local
search phases, this is not overly surprising; still, larger perturbations might be effec-
3 Automated Algorithm Configuration and Parameter Tuning 49
3.3.2 BasicILS
exceeds the number of given problem instances; in this case, the list of runs per-
formed is determined by a sequence of random permutations of the given set of
instances, and the random number seed used in each run is determined uniformly at
random.
This approach works well for configuration scenarios where a relatively small set
of benchmark instances is representative of all instances of interest. Furthermore,
the N target algorithm runs per configuration can be performed independently in
parallel. As for all ParamILS variants – and, indeed, for any SLS algorithm – fur-
ther parallelisation can be achieved by performing multiple runs of BasicILS(N)
in parallel. Finally, in principle, it is possible to perform multiple parallel runs of
the subsidiary local search in each iteration or to evaluate multiple neighbours of a
configuration in each search step independently in parallel.
3.3.3 FocusedILS
One drawback of BasicILS is that it tends to make substantial effort evaluating poor
configurations, especially when used to configure a given target algorithm for min-
imised run-time. The only way to reduce that effort is to choose a small number of
runs, N; however, this can (and often does) result in poor generalisation of perfor-
mance to problem instances other than those used during the configuration process.
FocusedILS addresses this problem by initially evaluating configurations using few
target algorithm runs and subsequently performing additional runs to obtain increas-
ingly precise performance estimates for promising configurations. We note that the
idea of focussing the computational effort in evaluating configurations on candi-
dates that have already shown promising performance is exactly the same as that
underlying the concept of racing. However, unlike the previously discussed racing
procedures, FocusedILS determines promising configurations heuristically rather
than using statistical tests.
The mechanism used by FocusedILS to assess configurations is based on the
following concept of domination: Let c1 and c2 be configurations for which N(c1 )
and N(c2 ) target algorithm runs have been performed, respectively. As in the case
of BasicILS, the runs performed for each configuration follow the same sequence of
instances (and pseudo random number seeds). Then c1 dominates c2 if, and only if,
N(c1 ) ≥ N(c2 ) and the performance estimate for c1 based on its first N(c2 ) runs is at
least as good as that for c2 based on all of its N(c2 ) runs. This definition incorporates
the previously discussed idea of blocking, as configurations are compared based on
their performance on a common set of instances (and pseudo random number seeds).
Whenever FocusedILS decides that one configuration, c1 , performs better than
another, c2 , it ensures that c1 dominates c2 by performing additional runs on either
or both configurations. More precisely, when comparing two configurations, an ad-
ditional run is first performed for the configuration whose performance estimate is
based on fewer runs or, in the case of a tie, on both configurations. Then, as long as
neither configuration dominates the other, further runs are performed based on the
3 Automated Algorithm Configuration and Parameter Tuning 51
Both BasicILS and FocusedILS can be improved by limiting under certain condi-
tions the time that is spent evaluating poorly performing configurations. The key
idea is that when comparing two configurations c1 and c2 , a situation may arise
where, regardless of the results of any further runs, c2 cannot match or exceed the
performance of c1 [37]. This is illustrated by the following example, taken from
Hutter et al. [37]: Consider the use of BasicILS(100) for minimising the expected
run-time of a given target algorithm on a set of 100 benchmark instances, where
configuration c1 has solved all 100 instances in a total of ten CPU seconds, and c2
has run for the same ten CPU seconds on the first instances without solving them.
52 Holger H. Hoos
Clearly, we can safely terminate that latter run after 10 + ε CPU seconds (for some
small time ε ), since the average run-time of c2 must exceed 0.1 CPU seconds, re-
gardless of its performance in the remaining N − 1 runs, and therefore be worse than
that of c1 .
Based on this insight, the trajectory-preserving adaptive capping mechanism of
Hutter et al. [37] limits the effort spent on evaluating configurations based on com-
paring lower bounds on the performance of one configuration c2 to upper bounds (or
exact values) on that of another configuration c1 , based on the results of given sets
of runs for c1 and c2 . We note that this corresponds to the notion of racing, where
each of the two configurations works independently through a given number of runs,
but the race is terminated as soon as the winner can be determined with certainty.
Apart from the potential for savings in running time, the use of trajectory-preserving
capping does not change the behaviour of ParamILS.
A heuristic generalisation of this capping mechanism makes it possible to achieve
even greater speedups, albeit at the price of possibly substantial changes to the
search trajectory followed by the configuration procedure. The key idea behind
this generalisation (dubbed aggressive capping) is to additionally bound the time
allowed for evaluating configurations based on the performance observed for the
current incumbent, i.e., the best-performing configuration encountered since the
beginning of the ParamILS run. The additional bound is obtained by multiplying
the performance estimate of the incumbent by a constant, bm, called the bound
multiplier. Formally, for bm = ∞, the additional bound becomes inactive (assum-
ing the performance measure is to be minimised), and the behaviour of trajectory-
preserving capping is obtained. For bm = 1, on the other hand, a very aggressive
heuristic is obtained, which limits the evaluation of any configuration to the time
spent on evaluating the current incumbent. In practice, bm = 2 appears to result
in good performance and is used as a default setting in ParamILS. Despite its
heuristic nature, this modified capping mechanism preserves the PAC property of
FocusedILS.
Although Hutter et al. [37] spelled out their adaptive capping mechanisms for
the performance objective of minimising a target algorithm’s mean run-time only,
these mechanisms generalise to other objectives in a rather straightforward way (a
discussion of capping in the context of minimising quantiles of run-time is found
in Ch. 7 of [32]). We note, however, that – especially when several target algorithm
runs are conducted in parallel – adaptive capping would be most effective in the
case of run-time minimisation. Particularly substantial savings can be achieved dur-
ing the assessment of the r + 1 configurations considered during initialisation of the
search process, as well as towards the end of each local search phase. Finally, it
should be noted that adaptive capping mechanisms can be used in the context of
configuration procedures other than ParamILS; for example, Hutter et al. [37] men-
tion substantial speedups achieved by using adaptive capping in combination with
simple random sampling (the same procedure as that used during the initialisation
of ParamILS).
3 Automated Algorithm Configuration and Parameter Tuning 53
3.3.5 Applications
ParamILS variants, and in particular FocusedILS, have been very successfully ap-
plied to a broad range of high-performance algorithms for several hard combina-
torial problems. An early version of FocusedILS was used by Thachuk et al. [65]
to configure a replica-exchange Monte Carlo (REMC) search procedure for the 2D
and 3D HP protein structure prediction problems; the performance objective was to
minimise the mean run-time for finding ground states for a given set of sequences
in these abstract but prominent models of protein structure, and the resulting config-
urations of the REMC procedure represented a considerable improvement over the
state of the art techbniques for solving these challenging problems.
FocusedILS has also been used in a series of studies leading to considerable
advances in the state-of-the-art in solving the satisfiability problem in propositional
logic, one of the most widely studied NP-hard problems in computer science. Hut-
ter et al. [35] applied this procedure to S PEAR, a complete, DPLL-type SAT solver
with 26 parameters (ten of which are categorical), which jointly give rise to a total
of 8.34 · 1017 possible configurations. The design of S PEAR was influenced con-
siderably by the availability of a powerful configuration tool such as FocusedILS,
whose application ultimately produced configurations that solved a given set of SAT-
encoded software verification problems about 100 times faster than previous state-
of-the-art solvers for these types of SAT instances and won the first prize in the
QF BV category of the 2007 Satisfiability Modulo Theories (SMT) Competition.
KhudaBukhsh et al. [45] used FocusedILS to find performance-optimised in-
stantiations of SATenstein-LS, a highly parametric framework for stochastic local
search (SLS) algorithms for SAT. This framework was derived from components
found in a broad range of high-performance SLS-based SAT solvers; its 41 parame-
ters induce a configuration space of size 4.82·1012 . Using FocusedILS, performance
improvements of up to three orders of magnitudes were achieved over the previous
best-performing SLS algorithms for various types of SAT instances, for several of
which SLS-based solvers are the most effective SAT algorithms overall. Several
automatically determined configurations of SATenstein-LS were used in the most
recent SATzilla solvers, which led the field in the 2009 SAT Competition, winning
prizes in five of the nine main categories [69].
Very recently, Xu et al. [70] used FocusedILS in an iterative fashion to obtain
sets of configurations of SATenstein-LS that were then used in combination with
state-of-the-art per-instance algorithm selection techniques (here: SATzilla). In each
iteration of the overall procedure, dubbed Hydra, FocusedILS was used to find con-
figurations that would best complement a given portfolio-based per-instance algo-
rithm selector. This approach resulted in a portfolio-based SAT solver that, while
derived in a fully automated fashion from a single, highly parameterised algorithm,
achieved state-of-the-art performance across a wide range of benchmark instances.
Tompkins and Hoos [66] applied FocusedILS to a new, flexible framework for
SLS-based SAT solvers called VE-Sampler (which is conceptually orthogonal to the
previously mentioned SATenstein-LS framework). VE-Sampler has a large number
of categorical and conditional parameters, which jointly give rise to more than 1050
54 Holger H. Hoos
not cause major losses in the performance of the configurations found by ParamILS)
could be difficult, in most cases generic approaches such as even or geometric sub-
divisions of a given interval seem to give good results. Where this is not the case,
multiple runs of the configuration procedure can be used to iteratively refine the do-
mains of continuous parameters. The same approach can be used to extend domains
in cases where parameter values in an optimised configuration lie at the boundary
of their respective domains. Nevertheless, the development of ParamILS variants
that natively deal with continuous parameters and support dynamic extensions of
parameter domains remains an interesting direction for future work.
procedure SMBO
input target algorithm A, set of configurations C, set of problem instances I,
performance metric m;
output configuration c∗ ;
determine initial set of configurations C0 ⊂ C;
for all c ∈ C0 , measure performance of A on I according to metric m;
build initial model M based on performance measurements for C0 ;
determine incumbent c∗ ∈ C0 for which best performance was observed or predicted;
repeat
based on model M, determine set of configurations C ⊆ C;
for all c ∈ C , measure performance of A on I according to metric m;
update model M based on performance measurements for C ;
update incumbent c∗ ;
until termination condition met;
return c∗ ;
end SMBO
Fig. 3.4: High-level outline of the general sequential model-based optimisation ap-
proach to automated algorithm configuration; model M is used to predict the perfor-
mance of configurations that have not (yet) been evaluated, and set C is typically
chosen to contain configurations expected to perform well based on those predic-
tions. Details of various algorithms following this approach are explained in the text
The efficient global optimisation (EGO) algorithm for black-box function optimi-
sation by Jones et al. [44] uses a response surface model obtained via noise-free
Gaussian process regression in combination with an expected improvement crite-
3 Automated Algorithm Configuration and Parameter Tuning 57
rion for selecting the next configuration to be evaluated. The noise-free Gaussian
process (GP) model utilised by EGO is also known as the DACE model, after its
prominent use in earlier work by Sacks et al. [59]. It defines for every input x a
random variable F̂(x) that characterises the uncertainty over the true response value
f (x) at point x.
The model-based optimisation process carried out by EGO starts with about 10·k
design points determined using a k-dimensional space-filling Latin hypercube de-
sign (LHD). After measuring the response values for these values, the 2 · k + 2 pa-
rameters of a DACE model are fit to the pairs of design points and response values,
using maximum likelihood estimates (as described by 44, this can be partially done
in closed form). The resulting model is assessed by means of so-called standard-
ized cross-validated residuals, which reflect the degree to which predictions made
by the model agree with the observed response values on the design points used for
constructing the model. If the model is deemed unsatisfactory, the response values
may be transformed using a log- or inverse-transform (i.e., modified by applying the
function ln y or 1/y) and the model fitted again.
After a satisfactory initial model has been obtained, it is used in conjunction
with an expected improvement criterion to determine a new design point to be
evaluated. The expected improvement measure used in this context uses the cur-
rent DACE model M to estimate the expected improvement over the best response
value measured so far, fmin , at any given design point x, and is formally defined as
EI(x) := E[max{ fmin − F̂(x), 0}], where F̂(x) is the random variable describing the
response for a design point x according to model M. Using a closed-form expres-
sion for this measure given by Jones et al. [44] and a branch & bound search method
(which can be enhanced heuristically), the EGO algorithm then determines a design
point x with maximal expected improvement EI(x ). If EI(x ) is less than 1% of the
current incumbent, the procedure terminates. Otherwise, the response value f (x ) is
measured, and the DACE model is refitted on the previous set of data extended by
the pair (x , f (x )), and a new iteration begins, in which the updated model is used
to determine the next design point using the same process that yielded x .
Note that in every iteration of this process, the DACE model has to be fitted,
which involves a matrix inversion of cost O(n3 ), where n is the number of design
points used. Depending on the cost of measuring the response value for a given
design point, this may represent a substantial computational overhead. Furthermore,
the noise-free Gaussian process model used in EGO cannot directly characterise
the stochastic responses obtained when solving algorithm configuration problems
involving randomised target algorithms.
We now discuss two black-box optimisation procedures that deal with stochastic
responses, as encountered when modelling phenomena subject to observation noise
or configuring randomised algorithms.
58 Holger H. Hoos
where fˆmin is the model’s prediction for the current best input (as in EGO, obtained
by considering all design points used for building the model), F̂(x) is the random
variable describing the response for a design point x according to model M, σε is
the standard deviation of the measurement noise (assumed to be identical for all
inputs), and s2 (x) is the variance of the response F̂(x) given by the model at point x,
where the second term in the product decreases as the predictions of M become more
accurate. Based on the given model M, the next design point to be evaluated, x , is
determined by maximising EI (x) using the Nelder-Mead Simplex algorithm [49].
Next, the model is refitted, taking into account x and a response value sampled at x ,
and a new iteration begins, in which the updated model is used to determine the next
design point using the same process that yielded x . If the maximum EI (x) values
from d + 1 successive iterations all fall below a user-defined threshold, the iterative
sampling process is terminated. (This treshold can be specified as an absolute value
or as a fraction of the difference between the largest and smallest observed response
values.)
Unfortunately, SKO assumes that the variability of the response values at each
design point is characterised by a Gaussian distribution, and that the standard de-
viations of those distributions are the same across the entire design space. Both of
these assumptions are problematic in the context of configuring randomised algo-
rithms, particularly when minimising run-time (see, e.g., 30). Furthermore, the time
required for fitting the model in each iteration of SKO is cubic in the number of
response values sampled, which can represent a substantial computational burden.
3 Automated Algorithm Configuration and Parameter Tuning 59
where fmin is the best value of the measure to be optimised observed so far, and
F̂(x) is the distribution over the predictions obtained from model M at design point
x. This EI measure has been introduced by Schonlau et al. [62] with the aim of
encouraging the exploration of design points for which the current model produces
highly uncertain predictions.
At each of the design points determined in this way, r new response values are
measured. Furthermore, additional response values are measured for the current in-
cumbent to ensure that it is evaluated based on as many samples as available for any
of the new design points. Then, the best of all the design points considered so far,
according to the given measure to be optimised, is selected as the new incumbent
3 In the literature, the term sequential parameter optimisation is also used to refer to a broad
(with ties broken uniformly at random). If the design point thus selected has been an
incumbent at any point earlier in the search process, r is increased; in SPO version
0.3 [8], r is doubled, while in the newer version 0.4 [7], it is merely incremented by
1, and in both cases values of r are limited to a user-specified maximum value rmax .
At this point, a new iteration of SPO begins, in which a noise-free GP is fitted on
the augmented set of data.
Based on a detailed investigation of the core components of the SPO algorithm, Hut-
ter et al. [38] introduced a variant called SPO+ that shows considerably more robust
performance on standard benchmarks than the SPO 0.3 and SPO 0.4 algorithms
described previously.
The main difference between SPO+ and the previous SPO procedures lies in
the way in which new design points are accepted as incumbents. Inspired by Fo-
cusedILS, SPO+ uses a mechanism that never chooses a new incumbent x̂ without
ensuring that at least as many responses have been sampled at x̂ as at any other de-
sign point x = x̂ . To achieve this, for any challenger to the current incumbent x̂, i.e.,
for any design point x that appears to represent an improvement over x̂ based on the
samples taken so far, additional response values are sampled until either x ceases to
represent an improvement, or the number of response values sampled at x reaches
that taken at x̂, with x still winning the comparison with x̂ based on the respective
samples; only in the latter case does x become the new incumbent, while in the
former case it is dismissed, and as many additional response values are sampled for
x̂ as newly measured for x .
The new response values determined for a challenger x are sampled in batches,
with the number of new samples taken doubling in each successive batch. As noted
by Hutter et al. [38], using this mechanism, rejection of challengers is done in a
rather aggressive, heuristic manner, and frequently occurs after only a single re-
sponse value has been sampled at x – long before a statistical test could conclude
that the x is worse than the current incumbent.
The Time-Bounded Sequential Parameter Optimisation (TB-SPO) algorithm by
Hutter et al. [41] introduces a number of further modifications to the SMBO frame-
work underlying the previously described SPO variants. In particular, in contrast
to all SMBO procedures discussed so far, TB-SPO does not construct its initial
model based on a large set of samples determined using a Latin hypercube design,
but rather interleaves response measurements at randomly chosen points with ones
taken at points that appear to be promising based on the current model. The initial
model is based on a single sample only; when used for algorithm configuration,
where the black-box function to be optimised represents the output of a parame-
terised algorithm, the default configuration for the algorithm is used as the design
point at which this initial sample is taken. At any stage of the iterative model-based
3 Automated Algorithm Configuration and Parameter Tuning 61
search process that follows, response values are sampled at a series of design points
in which odd-numbered points are determined by optimising an expected improve-
ment measure (as is done in SPO+ ), while even-numbered points are sampled uni-
formly at random from the given design space. (Mechanisms that achieve a different
balance between promising and randomly chosen design points could lead to better
performance but have not been explored so far.)
The number of design points at which response values are sampled between any
two updates to the model is determined based on the time t required for constructing
a new model and the search for promising parameter settings; to be precise, after at
least two design points have been evaluated, further points are considered until the
time used for evaluating design points since the last model update exceeds a user-
defined multiple (or fraction) of the overhead t.
Finally, in order to reduce the computational overhead incurred by the model con-
struction process, TB-SPO uses an approximate version of the standard Gaussian
process models found in the other SPO variants. This so-called projected process
(PP) approximation is based on the idea of representing explicitly only a randomly
sampled subset of the given data points (here: pairs of input and response values)
when building the Gaussian process model; if this subset comprises s data points,
while the complete set has n data points, the time complexity of fitting a GP model
decreases from O(n3 ) to O((s + n) · s2 ), while the time required for predicting a
response value (mean and variance of the predictive distribution at a given design
point) decreases from O(n2 ) to O(s2 ) [56]. In the context of an SMBO procedure,
this will typically lead to substantial savings, since the number of data points avail-
able increases over time, and n can easily reach values of several thousand, while
effective PP approximations can be based on constant-size subsets with s no larger
than 300 [41]. (Details on other, minor differences between TB-SPO and SPO+ can
be found in [41].)
3.4.4 Applications
response surface models used, SMBO methods are usually restricted to dealing with
real- and integer-valued target algorithm parameters (although, very recently, [43]
has introduced techniques that can handle categorical parameters).
Following an example from Bartz-Beielstein et al. [9], the SPO variants discussed
in this section have been empirically evaluated using CMA-ES [25, 26, 27] – one
of the best-performing gradient-free numerical optimisation procedures currently
known – on several standard benchmark functions from the literature on gradient-
free numerical optimisation (see, e.g., 26). The configuration space considered in
these examples, which involve the convex Sphere function as well as the non-convex
Ackley, Griewank and Rastrigin functions, is spanned by three real- and one integer-
valued parameters of CMA-ES, and the performance measure was solution quality,
achieved after a fixed number of evaluations of the respective benchmark function.
The empirical results reported by Hutter et al. [38] for CMA-ES applied to the ten-
dimensional instances of these functions indicate that SPO+ tends to perform sig-
nificantly better than SPO 0.3 and 0.4, which in turn appear to perform substantially
better than SKO. In addition, Hutter et al. [38] considered the minimisation of the
median number of search steps required by SAPS [33], a well-known stochastic lo-
cal search algorithm for SAT, to solve a single benchmark instance obtained from
encoding a widely studied quasi-group completion problem into SAT; in this case,
four continuous parameters were optimised. The results from that experiment con-
firmed that SPO+ tends to perform better than previous SPO variants and suggest
that, at least on some configuration problems with a relatively modest number of
predominently real-valued parameters, it can also yield slightly better results than
FocusedILS when allowed the same number of target algorithm runs.
TB-SPO has been empirically compared to SPO+ and FocusedILS on relatively
simple algorithm configuration tasks involving the well-known SAT solver SAPS
[33], with four continuous parameters, running on single SAT instances. In these
experiments, TB-SPO was shown to perform significantly better than SPO+ (some-
times achieving over 250-fold speedups), and moderately better than FocusedILS
[41]. However, it is important to keep in mind that, unlike TB-SPO (and all other
SMBO procedures covered in this section), FocusedILS explicitly deals with multi-
ple problem instances, and can therefore be expected to perform substantially better
on realistic algorithm configuration tasks. Furthermore, while SMBO procedures
like TB-SPO do not require continuous algorithm parameters to be discretised, they
presently cannot deal with conditional parameters, which are routinely encountered
in the more challenging algorithm configuration tasks on which FocusedILS has
been shown to be quite effective.
In addition to the methods covered in the previous sections, there are many other
procedures described in the literature that can, at least in principle, be applied to the
algorithm configuration problem considered here.
3 Automated Algorithm Configuration and Parameter Tuning 63
most genetic programming approaches, CLASS closely links the specification of the
configuration space and the evolutionary algorithm used for exploring this space.
The Composer system developed by Gratch and Dejong [23] is based on an it-
erative improvement procedure not unlike that used in ParamILS; this procedure is
conceptually related to racing techniques in that it moves to a new configuration
only after gathering sufficient statistical evidence to conclude that this new configu-
ration performs significantly better than the current one. In a prominent application,
Gratch and Chien [22] used the Composer system to optimise five parameters of an
algorithm for scheduling communication between a spacecraft and a set of ground-
based antennas.
Ansótegui et al. [2] recently developed a gender-based genetic algorithm for solv-
ing algorithm configuration problems. Their GGA procedure supports categorical,
ordinal and real-valued parameters; it also allows its user to express independen-
cies between parameter effects by means of so-called variable trees – a concept that
could be of particular interest in the context of algorithm configuration problems
where such independencies are known by construction, or heuristic methods are are
available for detecting (approximate) independencies. Although there is some evi-
dence that GGA can solve some moderately difficult configuration problems more
effectively than FocusedILS without capping [2], it appears to be unable to reach
the performance of FocusedILS version 2.3 with aggressive capping on the most
challenging configurations problems [40]. Unfortunately, GGA also offers less flex-
ibility than FocusedILS in terms of the performance metric to be optimised. More
algorithm configuration procedures based on evolutionary algorithms are covered in
Chapter 2 of this book.
Finally, work originating from the Ph.D. project of Hutter [32] has recently over-
come two major limitations of the sequential model-based optimisation methods
discussed in Section 3.4 of this chapter by introducing a procedure that can han-
dle categorical parameters while explicitly exploiting the fact that performance is
evaluated on a set of problem instances. There is some evidence that this procedure,
dubbed Sequential Model-based Algorithm Configuration (SMAC), can, at least on
some challenging configuration benchmarks, reach and sometimes exceed the per-
formance of FocusedILS [43], and we are convinced that, at least in cases where
the parameter response of a given target algorithm is reasonably regular and per-
formance evaluations are very costly, such advanced SMBO methods hold great
promise.
Automated algorithm configuration and parameter tuning methods have been de-
veloped and used for more than a decade, and many of the fundamental techniques
date back even further. However, it has only recently become possible to effectively
solve complex configuration problems involving target algorithms with dozens of
parameters, which are often categorical and conditional. This success is based in
3 Automated Algorithm Configuration and Parameter Tuning 65
part on the increased availability of computational resources, but has mostly been
enabled by methodological advances underlying recent configuration procedures.
Still, we see much room (and, indeed, need) for future work on automated algo-
rithm configuration and parameter tuning methods. We believe that in developing
such methods, the fundamental features underlying all three types of methods dis-
cussed in this chapter can play an important role, and that the best methods will
employ combinations of these. We further believe that different configuration pro-
cedures will likely be most effective for solving different types of configuration
problems (depending, in particular, on the number and type of target algorithm pa-
rameters, but also on regularities in the parameter response). Therefore, we see a
need for research aiming to determine which configurator is most effective under
which circumstances. In fact, we expect to find situations in which the sequential
or iterative application of more than one configuration procedure turns out to be
effective – for example, one could imagine applying FocusedILS to find promising
configurations in vast, discrete configuration spaces, followed by a gradient-free nu-
merical optimisation method, such as CMA-ES, for fine-tuning a small number of
real-valued parameters.
Overall, we believe that algorithm configuration techniques, such as the ones
discussed in this chapter, will play an increasingly crucial role in the develop-
ment, evaluation and use of state-of-the-art algorithms for challenging computa-
tional problems, where the challenge could arise from high computational complex-
ity (in particular, NP-hardness) or from tight resource constraints (e.g., in real-
time applications). Therefore, we see great value in the design and development of
software frameworks that support the real-world application of various algorithm
configuration and parameter tuning procedures. The High-Performance Algorithm
Lab (HAL), recently introduced by Nell et al. [50], is a software environment de-
signed to support a wide range of empirical analysis and design tasks encountered
during the development, evaluation and application of high-performance algorithms
for challenging computational problems, including algorithm configuration and pa-
rameter tuning. Environments such as HAL not only facilitate the application of
automated algorithm configuration and parameter tuning procedures, but also their
development, efficient implementation and empirical evaluation.
In closing, we note that the availability of powerful and effective algorithm
configuration and parameter tuning procedures has a number of interesting con-
sequences for the way in which high-performance algorithms are designed and used
in practice. Firstly, for developers and end users, it is now possible to automatically
optimise the performance of (highly) parameterised solvers specifically for certain
classes of problem instances, leading to potentially much improved performance in
real-world applications. Secondly, while on-line algorithm control mechanisms that
adjust parameter settings during the run of a solver (as covered, for example, in
Chapters 6, 7 and 8 of this book) can in principle lead to better performance than
the (static) algorithm configuration procedures considered in this chapter, we expect
these latter procedures to be very useful in the context of (statically) configuring the
parameters and heuristic components that determine the behaviour of the on-line
control mechanisms. Finally, during algorithm development, it is no longer neces-
66 Holger H. Hoos
sary (or even desirable) to eliminate parameters and similar degrees of freedom,
but instead, developers can focus more on developing ideas for realising certain
heuristic mechanisms or components, while the precise instantiation can be left to
automated configuration procedures [28]. We strongly believe that this last effect
will lead to a fundamentally different and substantially more effective way of de-
signing and implementing high-performance solvers for challenging computational
problems.
Acknowledgements This chapter surveys and discusses to a large extent work carried out by
my research group at UBC, primarily involving Frank Hutter, Kevin Leyton-Brown and Kevin
Murphy, as well as Thomas Stützle at Université Libre de Bruxelles, to all of whom I am deeply
grateful for their fruitful and ongoing collaboration. I gratefully acknowledge valuable comments
by Frank Hutter, Thomas Stützle and Maverick Chan on earlier drafts of this chapter, and I thank
the members of my research group for the many intellectually stimulating discussions that provide
a fertile ground for much of our work on automated algorithm configuration and other topics in
empirical algorithmics.
References
[1] Adenso-Diaz, B., Laguna, M.: Fine-tuning of algorithms using fractional ex-
perimental design and local search. Operations Research 54(1):99–114 (2006)
[2] Ansótegui, C., Sellmann, M., Tierney, K.: A gender-based genetic algorithm
for the automatic configuration of algorithms. In: Proceedings of the 15th In-
ternational Conference on Principles and Practice of Constraint Programming
(CP 2009), pp. 142–157 (2009)
[3] Applegate, D. L., Bixby, R. E., Chvátal, V., Cook, W. J.: The Traveling Sales-
man Problem: A Computational Study. Princeton University Press (2006)
[4] Audet, C., Orban, D.: Finding optimal algorithmic parameters using the mesh
adaptive direct search algorithm. SIAM Journal on Optimization 17(3):642–
664 (2006)
[5] Balaprakash, P., Birattari, M., Stützle, T.: Improvement strategies for the
F-Race algorithm: Sampling design and iterative refinement. In: Bartz-
Beielstein, T., Blesa, M., Blum, C., Naujoks, B., Roli, A., Rudolph, G., Sam-
pels, M. (eds) 4th International Workshop on Hybrid Metaheuristics, Proceed-
ings, HM 2007, Springer Verlag, Berlin, Germany, Lecture Notes in Computer
Science, vol. 4771, pp. 108–122 (2007)
[6] Balaprakash, P., Birattari, M., Stützle, T., Dorigo, M.: Estimation-based meta-
heuristics for the probabilistic traveling salesman problem. Computers & OR
37(11):1939–1951 (2010)
[7] Bartz-Beielstein, T.: Experimental Research in Evolutionary Computation:
The New Experimentalism. Natural Computing Series, Springer Verlag,
Berlin, Germany (2006)
[8] Bartz-Beielstein, T., Lasarczyk, C., Preuß, M.: Sequential parameter optimiza-
tion. In: McKay, B., et al. (eds) Proceedings 2005 Congress on Evolutionary
3 Automated Algorithm Configuration and Parameter Tuning 67
[22] Gratch, J., Chien, S. A.: Adaptive problem-solving for large-scale scheduling
problems: A case study. Journal of Artificial Intelligence Research 4:365–396
(1996)
[23] Gratch, J., Dejong, G.: Composer: A probabilistic solution to the utility prob-
lem in speed-up learning. In: Rosenbloom, P., Szolovits, P. (eds) Proceedings
of the 10th National Conference on Artificial Intelligence (AAAI-92), AAAI
Press / The MIT Press, Menlo Park, CA, USA, pp. 235–240 (1992)
[24] Guerri, A., Milano, M.: Learning techniques for automatic algorithm portfo-
lio selection. In: Proceedings of the 16th European Conference on Artificial
Intelligence (ECAI 2004), pp. 475–479 (2004)
[25] Hansen, N.: The CMA evolution strategy: A comparing review. In: Lozano,
J., Larranaga, P., Inza, I., Bengoetxea, E. (eds) Towards a new evolutionary
computation. Advances on estimation of distribution algorithms, Springer, pp.
75–102 (2006)
[26] Hansen, N., Kern, S.: Evaluating the CMA evolution strategy on multimodal
test functions. In: Yao, X., et al. (eds) Parallel Problem Solving from Nature
PPSN VIII, Springer, LNCS, vol. 3242, pp. 282–291 (2004)
[27] Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evo-
lution strategies. Evolutionary Computation 9(2):159–195 (2001)
[28] Hoos, H.: Computer-aided design of high-performance algorithms. Tech. Rep.
TR-2008-16, University of British Columbia, Department of Computer Sci-
ence (2008)
[29] Hoos, H., Stützle, T.: Local search algorithms for SAT: An empirical evalua-
tion. Journal of Automated Reasoning 24(4):421–481 (2000)
[30] Hoos, H., Stützle, T.: Stochastic Local Search—Foundations and Applications.
Morgan Kaufmann Publishers, USA (2004)
[31] Huang, D., Allen, T. T., Notz, W. I., Zeng, N.: Global optimization of stochas-
tic black-box systems via sequential kriging meta-models. Journal of Global
Optimization 34(3):441–466 (2006)
[32] Hutter, F.: Automated configuration of algorithms for solving hard computa-
tional problems. Ph.D. thesis, University of British Columbia, Department of
Computer Science, Vancouver, BC, Canada (2009)
[33] Hutter, F., Tompkins, D. A., Hoos, H.: Scaling and Probabilistic Smoothing:
Efficient Dynamic Local Search for SAT. In: Principles and Practice of Con-
straint Programming – CP 2002, Springer-Verlag, LNCS, vol. 2470, pp. 233–
248 (2002)
[34] Hutter, F., Hamadi, Y., Hoos, H., Leyton-Brown, K.: Performance prediction
and automated tuning of randomized and parametric algorithms. In: Principles
and Practice of Constraint Programming – CP 2006, Springer-Verlag, LNCS,
vol. 4204, pp. 213–228 (2006)
[35] Hutter F., Babić, D., Hoos, H., Hu, A. J.: Boosting verification by automatic
tuning of decision procedures. In: Proc. Formal Methods in Computer-Aided
Design (FMCAD’07), IEEE Computer Society Press, pp. 27–34 (2007)
3 Automated Algorithm Configuration and Parameter Tuning 69
[36] Hutter, F., Hoos, H., Stützle, T.: Automatic algorithm configuration based on
local search. In: Proceedings of the 22nd National Conference on Artificial
Intelligence (AAAI-07), pp. 1152–1157 (2007)
[37] Hutter, F., Hoos, H., Leyton-Brown, K., Murphy, K.: An experimental investi-
gation of model-based parameter optimisation: SPO and beyond. In: Proceed-
ings of the 11th Annual Conference on Genetic and Evolutionary Computation
(GECCO’09), ACM, pp. 271–278 (2009)
[38] Hutter, F., Hoos, H., Leyton-Brown, K., Stützle, T.: ParamILS: An automatic
algorithm configuration framework. Journal of Artificial Intelligence Research
36:267–306 (2009)
[39] Hutter, F., Hoos, H., Leyton-Brown, K., Stützle, T.: ParamILS: An automatic
algorithm configuration framework (extended version). Tech. Rep. TR-2009-
01, University of British Columbia, Department of Computer Science (2009)
[40] Hutter, F., Hoos, H., Leyton-Brown K.: Sequential model-based optimization
for general algorithm configuration (extended version). Tech. Rep. TR-2010-
10, University of British Columbia, Department of Computer Science (2010)
[41] Hutter, F., Hoos, H., Leyton-Brown, K., Murphy, K.: Time-bounded sequential
parameter optimization. In: Proceedings of the 4th International Conference
on Learning and Intelligent Optimization (LION 4), Springer-Verlag, LNCS,
vol. 6073, pp. 281–298 (2010)
[42] Hutter, F., Hoos, H., Leyton-Brown, K.: Automated configuration of mixed
integer programming solvers. In: Proceedings of the 7th International Confer-
ence on the Integration of AI and OR Techniques in Constraint Programming
for Combinatorial Optimization Problems (CPAIOR 2010), Springer-Verlag,
LNCS, vol. 6140, pp. 186–202 (2010)
[43] Hutter, F., Hoos, H., Leyton-Brown, K.: Extending sequential model-based
optimization to general algorithm configuration. To appear in: Proceedings
of the 5th International Conference on Learning and Intelligent Optimization
(LION 5) (2011)
[44] Jones, D. R., Schonlau, M., Welch, W. J.: Efficient global optimization of
expensive black box functions. Journal of Global Optimization 13:455–492
(1998)
[45] KhudaBukhsh, A., Xu, L., Hoos, H., Leyton-Brown, K.: SATenstein: Automat-
ically building local search SAT solvers from components. In: Proceedings of
the 21st International Joint Conference on Artificial Intelligence (IJCAI-09),
pp 517–524 (2009)
[46] Leyton-Brown, K., Nudelman, E., Andrew, G., McFadden, J., Shoham, Y.:
A portfolio approach to algorithm selection. In: Rossi, F. (ed) Principles and
Practice of Constraint Programming – CP 2003, Springer Verlag, Berlin, Ger-
many, Lecture Notes in Computer Science, vol. 2833, pp. 899–903 (2003)
[47] Lourenço, H. R., Martin, O., Stützle, T.: Iterated local search. In: Glover, F.,
Kochenberger, G. (eds) Handbook of Metaheuristics, Kluwer Academic Pub-
lishers, Norwell, MA, USA, pp. 321–353 (2002)
70 Holger H. Hoos
[48] Maron, O., Moore, A. W.: Hoeffding races: Accelerating model selection
search for classification and function approximation. In: Advances in neural
information processing systems 6, Morgan Kaufmann, pp. 59–66 (1994)
[49] Nelder, J. A., Mead, R.: A simplex method for function minimization. The
Computer Journal 7(4):308–313 (1965)
[50] Nell, C. W., Fawcett, C., Hoos, H., Leyton-Brown K.: HAL: A framework for
the automated design and analysis of high-performance algorithms. To appear
in: Proceedings of the 5th International Conference on Learning and Intelli-
gent Optimization (LION 5) (2011)
[51] Nocedal, J., Wright, S. J.: Numerical Optimization, 2nd edn. Springer-Verlag
(2006)
[52] Nouyan, S., Campo, A., Dorigo, M.: Path formation in a robot swarm: Self-
organized strategies to find your way home. Swarm Intelligence 2(1):1–23
(2008)
[53] Pellegrini, P., Birattari, M.: The relevance of tuning the parameters of meta-
heuristics. a case study: The vehicle routing problem with stochastic de-
mand. Tech. Rep. TR/IRIDIA/2006-008, IRIDIA, Université Libre de Brux-
elles, Brussels, Belgium (2006)
[54] Pop, M., Salzberg, S. L., Shumway, M.: Genome sequence assembly: Algo-
rithms and issues. Computer 35(7):47–54 (2002)
[55] Prasad, M. R., Biere, A., Gupta, A.: A survey of recent advances in SAT-based
formal verification. International Journal on Software Tools for Technology
Transfer 7(2):156–173 (2005)
[56] Rasmussen, C. E., Williams, C. K. I.: Gaussian Processes for Machine Learn-
ing. The MIT Press (2006)
[57] Rice, J.: The algorithm selection problem. Advances in Computers 15:65–118
(1976)
[58] Rossi-Doria, O., Sampels, M., Birattari, M., Chiarandini, M., Dorigo, M.,
Gambardella, L. M., Knowles, J. D., Manfrin, M., Mastrolilli, M., Paechter,
B., Paquete, L., Stützle, T.: A comparison of the performance of different
metaheuristics on the timetabling problem. In: Burke, E. K., Causmaecker,
P. D. (eds) Practice and Theory of Automated Timetabling IV, 4th Interna-
tional Conference, PATAT 2002, Selected Revised Papers, Springer, Lecture
Notes in Computer Science, vol. 2740, pp. 329–354 (2003)
[59] Sacks, J., Welch, W., Mitchell, T., Wynn, H.: Design and analysis of computer
experiments (with discussion). Statistical Science 4:409–435 (1989)
[60] Santner, T., Williams, B., Notz, W.: The Design and Analysis of Computer
Experiments. Springer Verlag, New York (2003)
[61] Schiavinotto, T., Stützle, T.: The linear ordering problem: Instances, search
space analysis and algorithms. Journal of Mathematical Modelling and Algo-
rithms 3(4):367–402 (2004)
[62] Schonlau, M., Welch, W. J., Jones, D. R.: Global versus local search in con-
strained optimization of computer models. In: Flournoy, N., Rosenberger, W.,
Wong, W. (eds) New Developments and Applications in Experimental Design,
3 Automated Algorithm Configuration and Parameter Tuning 71
4.1 Introduction
Your next conference will take you from your home in Cork, Ireland to Victoria
on Vancouver Island, Canada. You recall that you made a similar trip two years
previously, when you travelled from Cork, through London Heathrow, New York
JFK and Vancouver In: Airport, finishing with a bus connection to Whistler, British
Colombia. You can avoid planning your next trip ‘from scratch’ by reusing your
previous itinerary. You make a small adaptation to the final leg: from Vancouver
International Airport, you will take a bus and then a ferry to your hotel in Victoria.
Humans often reason from experiences in the way exemplified above. Faced with
a new problem, we recall our experiences in solving similar problems in the past,
and we modify the past solutions to fit the circumstances of the new problem.
Within Artificial Intelligence (AI), the idea that we can solve problems by recall-
ing and reusing the solutions to similar past problems, rather than reasoning ‘from
scratch’, underlies Case-Based Reasoning (CBR), which has been the target of ac-
tive research and development since the late 1980s [35, 36, 39, 7, 2]. CBR is a prob-
lem solving and learning strategy: reasoning is remembered (this is learning); and
reasoning is remembering (this is problem-solving) [37]. CBR can be useful in do-
mains where problem types recur, and where similar problems have similar solutions
[37]. Its wide range of application areas — from classification and numeric predic-
tion to configuration, design and planning — and domains — from medicine to law
to recommender systems — is testimony to its generality [23, 24, 13, 29, 62, 8].
In many cases, it has moved out of the research laboratory and into fielded sys-
tems [11].
In this chapter, we review the application of CBR to search and especially to
constraint solving. We present our review in three parts. In Section 4.2, using the
application of CBR to route planning as a case study, we explain the CBR process
model, i.e., CBR’s main subprocesses. This helps establish terminology for later
sections. Section 4.3 reviews the roles that CBR can play in search, ranging from
its role in general AI search in tasks such as planning, scheduling and game playing
to its use in constraint solving. Section 4.4 presents CP HYDRA, a recent successful
application of CBR to autonomous constraint solving. In CP HYDRA, CBR is used
to inform a portfolio approach to constraint problem solving.
There are many ways to characterize Case-Based Reasoning systems. Here, we de-
scribe them in terms of their knowledge containers and their process model.
Their knowledge containers are
• the case base;
• the vocabulary used to represent the cases in the case base;
• the similarity measure used in retrieving cases from the case base; and
• the solution transformation knowledge used to reuse the retrieved solution(s)
[61].
Their process model typically comprises some or all of the following steps (see
Figure 4.1):
• retrieval of the most similar case(s);
• reuse of the knowledge contained in the retrieved cases to propose a solution to
the new problem;
• revision of the proposed solution in the light of feedback; and
• retention of parts of the newly gained experience for use in future problem
solving [1].
The last of these steps is what makes CBR a strategy for learning as well as for
problem solving.
By way of a case study that exemplifies CBR’s knowledge containers and process
model, we present Turas, McGinty and Smyth’s route planner [45]. A user of Turas
specifies start and end locations on a city map, and Turas finds a route that connects
the two.
The Turas case base contains routes that the user has travelled in the past and
that she has found to be of sufficient quality to store for future reuse. Route quality
is subjective: some users may prefer short routes; others may prefer scenic routes,
4 Case-Based Reasoning for Autonomous Constraint Solving 75
or routes that mostly use main roads, and so on. The Turas representation of route
quality is entirely implicit: it is captured only by the kinds of cases that the user
stores in her case base.
In Turas, each case or route is represented by an undirected graph, where nodes
represent road junctions and edges represent road segments. The set of allowed junc-
tion and segment identifiers defines the Turas case vocabulary.
Graphs are only one possible case representation. Especially commonplace across
other CBR systems are case representations that use feature vectors or sets of
attribute-value pairs. In these, the features or attributes and their possible values de-
fine the system’s vocabulary [7]. CBR systems may even use feature selection meth-
ods to automatically select from among candidate features [34], or feature weighting
methods to automatically determine feature importance [76].
Retrieval. In our presentation of Turas, we will refer to the pair that comprises
the user’s start and end locations, start and end, as a problem specification. Turas
retrieves from the case base the route that is most similar to this specification. In-
formally, Turas retrieves the case that contains a pair of junctions (nodes), n and n ,
that are close to start and end. More formally, from case base CB, it retrieves:
Turas retrieves only one case. It is not uncommon for CBR systems to retrieve
several cases. In particular, many CBR systems score cases by their similarity to the
problem specification and retrieve the k most similar cases. In this case, the retrieval
component of the system is an implementation of kNN, the k nearest neighbours
method.
As we shall see in later sections of this chapter, CBR systems that represent cases
by feature vectors or sets of attribute-value pairs often define case similarity differ-
ently from Turas, and according to a so-called local-global principle [7]. According
to this principle, the system’s similarity knowledge container has a local similarity
measure for each feature or attribute. For example, on a numeric-valued attribute,
similarity between two values may be defined as the complement of the normalized
absolute difference in the values. Local similarity measures may also be defined in
terms of taxonomic knowledge, or by enumeration of all pairs of values. This is one
of the ways in which a CBR system may incorporate domain-specific and expert
knowledge. A global similarity measure aggregates local similarities to measure the
overall similarity. Aggregation is often done by averaging, possibly weighted by the
feature weights mentioned earlier.
Reuse. A retrieval-only system would barely qualify as a CBR system. It is essential
that there be inference, which is the responsibility of the reuse step in the process
model. This can take many forms. Where CBR is used for classification, for exam-
ple, each retrieved case contains a class label, and the system may predict the class
of the new problem by taking a vote among the retrieved cases. In other applications,
the solutions of the retrieved cases must be adapted to suit the circumstances of the
new problem [35]. For example, the reuse knowledge container might contain rules,
perhaps mined from the case base or elicited from experts, for carrying out simple
substitutions to make a past solution apply to new circumstances.
In Turas, the user wants a route that goes from start to end, but the retrieved case
c goes from c.start to c.end and contains junctions c.n and c.n that are close to start
and end. Adaptation consists of trimming the ends of the retrieved case to produce
a new case c whose start is n and whose end is n . In other words, Turas uses
the heuristic that the subsequence of the case that brings us closest to the problem
specification locations should be reused unmodified, and leading and trailing parts
of the case that take us further away from locations in the problem specification
should be discarded.
This, of course, leaves Turas with two new subproblems, which we can express
using two new problem specifications. The first subproblem is how to get from start
to c .n; the second subproblem is how to get from c .n to end. These subproblems
are solved by recursive invocation of Turas. If the distance between a problem spec-
ification’s start and end falls below a threshold, then Turas invokes A∗ to plan that
part of the route.
Revision. By the end of its retrieval and reuse steps, a CBR system has created a
solution to the new problem. This solution needs to be tested. In some applications,
we might test it by simulating its use. For example, we can execute a case-based
plan on a world model to ensure that it leads from start state to end state and that
4 Case-Based Reasoning for Autonomous Constraint Solving 77
action preconditions are satisfied at every step [28]. In other applications, we test
the solution by applying it for real. In Turas, for example, the user can drive the
recommended route. Subsequently, she might give feedback, e.g., in the form of
a rating. She might even edit the route manually to fix any weaknesses that she
experiences.
Retention. At this point, we have a new experience: the system proposed a solution
to our new problem, and we have tested this solution. The system should learn from
this experience. At its simplest, a CBR system such as Turas can create a new case
from a problem specification and a successful solution and insert the new case into
its case base, thus extending its coverage of the problem space. Some CBR systems
might allow for the retention of unsuccessful cases too, assuming that retrieval and
reuse can handle them intelligently. There is the possibility that new experiences
result in updates to the CBR system’s other knowledge containers too.
In principle, with greater coverage comes greater likelihood of retrieving cases
that are highly similar to a given problem specification, and hence a reduced need
for solution adaptation during the reuse step.
In practice, matters are more complicated. New cases may be noisy, to the detri-
ment of solution quality. A CBR system may also have to address the utility problem.
This was first discussed in the context of explanation-based learning by Minton [49].
Minton’s PRODIGY/EBL system learns search control rules to reduce search by
eliminating dead-end paths. However, as the number of such rules grows, the cost of
rule-matching grows, and may outweigh the savings in search. Similarly, in a CBR
system, as the knowledge containers, especially the case base, grow in size, retrieval
costs may come to exceed the performance benefits that accrue from greater prob-
lem space coverage [49, 17, 18]. Hence, the four-step CBR process model has been
extended to include additional steps for analysis and maintenance of the knowledge
containers [60], and there is a considerable body of techniques for knowledge con-
tainer maintenance, especially maintenance of the case base, e.g., [38, 48, 77, 15].
4.2.3 Discussion
We have given a very high-level overview of CBR, illustrated with reference to the
Turas system. The concepts that we have explained will recur throughout the rest
of this chapter, as we review other ways in which search systems can exploit case-
based experience.
CBR is compelling because it seems so natural: humans often seem to reason in
a similar way. It is compelling too because of the way it integrates problem solv-
ing and learning. It is especially applicable in ‘weak theory’ domains, where deep
underlying causal models may not be known or may be difficult and costly to elicit.
Rarely cited as an advantage, but well illustrated by Turas, CBR allows for easy
personalization of solutions. In Turas, each user has her own case base, containing
her own preferred routes, implicitly capturing a personalized notion of route quality.
The new routes that Turas builds from those in a user’s case base will reflect the
78 Derek Bridge, Eoin O’Mahony, and Barry O’Sullivan
personal biases of that case base. In an extension to the work, McGinty and Smyth
present a distributed version of Turas: when a user’s own case base lacks requisite
coverage, cases may be retrieved from the case bases of similar users [46]. This
shows how CBR can be used in collaborative problem solving.
From a learning point of view, CBR is usually a form of instance-based learning
[51]. The training instances are the cases, and these are either presented up front or
incrementally as problem solving proceeds. They are simply stored in the case base
for later reuse. This is a form of lazy learning: generalization takes place at problem-
solving time and can therefore take the new problem into account by reasoning from
instances (cases) that are close to the new problem. By contrast, in eager learning,
generalization takes place on receipt of training examples, in advance of presenta-
tion of new problems, and therefore eager learning commits to its generalization of
the training examples before it sees any new problems.
In the next section, we review the roles that CBR can play in search, especially
in constraint solving.
There are many classes of search problem for which experience can help improve
the efficiency with which a solution can be found. As we have discussed, the reuse
of experience is the core principle upon which Case-Based Reasoning (CBR) is
based. Of course, CBR can be applied more generally, and is often combined with
other reasoning techniques as the application domain demands. For example, there
are many integrations of CBR with rule-based reasoning, model-based reasoning,
constraint satisfaction, information retrieval, and planning [44].
The majority of applications of CBR to search have been in game playing,
scheduling, and most of all planning. In game playing CBR has been used for plan
selection in strategy games [3], plan recognition in games [16], move selection in
adversarial games [69], and planning in imperfect information games [68]. Case-
based scheduling has also received attention [66, 67].
However, there is a vast body of work on case-based planning algorithms, some
of which are concerned with speeding up the planning process by retrieving and
adapting a plan from a case base [52]. For example, de la Rosa et al. use CBR in a
forward planning algorithm to order nodes for evaluation [63]. In their work, cases
4 Case-Based Reasoning for Autonomous Constraint Solving 79
capture typical state transitions, which can be learned from past plans; successor
nodes are recommended for evaluation if they match these cases. Not only can the
cases themselves be learned, their utility in planning can also be monitored to in-
fluence their likelihood of being reused [63]. In fact, this planning-specific research
can be viewed as an instance of the more general technique of Constructive Adap-
tation [57]. Constructive Adaptation is a form of best-first search in which cases are
used for both hypothesis generation and hypothesis ordering.
The many applications of CBR to planning include planning support in collabo-
rative workflows [19], hierarchical planning [33], hierarchical abstractions for plan-
ning [42, 65], adaptation-guided retrieval for planning [73], case adaptation through
replanning [74], and constraint-based case-based planning [78].
In this chapter we are concerned with integrations of CBR with constraint program-
ming and, in particular, with how such an integration can be used to support au-
tonomous search in constraint solving. Many search problems can be formulated as
Constraint Satisfaction Problems [64]. A Constraint Satisfaction Problem (CSP) is
defined in terms of a finite set of decision variables where each variable is associated
with a finite set of values, called the domain of the variable, and a set of constraints
that specify the legal ways of assigning values to the variables. The task in a CSP is
to find an assignment of a value to each variable such that the set of constraints is
satisfied. The problem is well known to be NP-complete. In addition, in many appli-
cation domains a natural objective function can be formulated that defines the no-
tion of an optimal solution. CSPs are solved either using systematic search, such as
backtracking interleaved with constraint propagation, or non-systematic approaches
based on local search [64].
A number of systems have been built that integrate CBR and CSP techniques,
e.g., in the fields of configuration and engineering design [30, 71] and forest fire
fighting [6]. However, there are a number of integrations through which either CSP
techniques are used to enhance CBR, or vice versa. For example, Neagu and Faltings
have shown how constraints can be used to support the adaptation phase of the CBR
process [54, 55]. Specifically, they show how interchangeable values for variables in
solutions can help reason about the effect of change in a solution set so that minimal
sets of choices can be found to adapt existing solutions for a new problem. This is
particularly interesting since the adaptation phase of CBR has received less attention
than other aspects, such as case retrieval.
Conversely, CBR techniques have been used to help find solutions to new sets of
constraints by reasoning about solutions to similar sets of constraints [31, 58, 32].
These approaches are similar to CSP techniques that help solve challenging CSPs
by caching search states [70]. The majority of the search effort is spent proving
the insolubility of unsatisfiable subtrees when using a systematic backtrack-style
algorithm. Therefore, by remembering search states that lead to failure, one can
80 Derek Bridge, Eoin O’Mahony, and Barry O’Sullivan
improve search efficiency. One can view this as a form of CBR since one stores
interesting states, and retrieves these when they are similar to the current search.
A review of applications that integrate CBR and CSP techniques up to the late
1990s is available [41]. In the next subsection we will briefly review the literature
on the use of CBR techniques to improve the efficiency of various types of search
techniques.
CBR has been used to solve a variety of problems that are typically solved using
systematic search techniques such as timetabling [9] and route planning [27]. A
typical way of using CBR to improve systematic search is through solution reuse,
which has been successfully applied in real-time scheduling [12]. However, none of
these systems can be regarded as truly supporting autonomous search.
In constraint programming, a major challenge is to choose the appropriate search
strategy to use for a given CSP model. Autonomous approaches to addressing this
challenge are amongst the most difficult to develop. We briefly survey two specific
approaches to autonomous systematic and local search, before discussing an award-
winning approach to autonomous constraint solving in Section 4.4.
Autonomous Systematic Search. Gebruers et al. present a simple CBR system for
choosing a solution strategy [20]. In their work, the solution strategy is chosen in
advance, and not changed dynamically during the search. Each case in the case base
contains a description of a problem, along with a good solution strategy for solving
that problem. The problems are described by a vector of static, surface features. In
their experiment, they work with the Social Golfer Problem, and features include the
number of golfers, the number of weeks, and the ratio of these two. Of all the pos-
sible features, a subset is used in the CBR; this subset is chosen using the Wrapper
method [34], and cases are retrieved using Euclidean distance.
Gebruers et al. compare their CBR approach to a decision tree [59] and sev-
eral benchmark methods, including a ‘use best’ approach, which selects the same
strategy every time: the one that is a winner most often in the training set. In their
experiments, CBR predicts the best strategy about 10% more often than decision
trees and ‘use best’. Furthermore, in terms of total execution time, regardless of
whether a winning strategy is predicted or not, CBR significantly outperforms the
other strategies, largely because it predicts far fewer strategies that time out. The ap-
proach can therefore be regarded as one of the first that supports autonomous search
for solving CSPs.
There is related work that does not use CBR. Minton dynamically constructs con-
straint programs by performing an incomplete search of the space of possible pro-
grams [50]. The contrast between his work and that of Gebruers et al. is that Minton
constructs new programs whereas Gebruers et al. seek to reuse existing strategies.
Rather different again is the system reported in [10], which executes multiple strate-
gies, gathers information at runtime about their relative performance and decides
4 Case-Based Reasoning for Autonomous Constraint Solving 81
which strategies to continue with. The focus in that system is on domains where
optimisation is the primary objective, rather than constraint satisfaction.
Autonomous Local Search. One of the most common local search techniques for
solving large-scale constraint optimisation problems is tabu search [21, 22]. In tabu
search memory structures are used to remember previous states that should not be
revisited, i.e., tabu states, in order to help diversify search. Tabu states prevent the
local search algorithm from converging at a local minimum.
Grolimund and Ganascia [25, 26] have reported a system called ALOIS which
automatically designs the search control operators in tabu search in a problem-
independent manner. ALOIS is the acronym for “Analogical Learning for Optimi-
sation to Improve Operator Selection”. The design of good tabu search techniques
for a specific problem class is a highly technical process which requires that the
search algorithm designer has a deep understanding of the problem being consid-
ered. ALOIS uses a case base that tries to relate how using a search operator in some
state yielded a specific objective function value. The principle that underlies ALOIS
is that using an operator in a similar search state will result in a similar reward.
The better the reward obtained, the more frequently a particular operator becomes
available for retrieval using the CBR system. By rewarding better performing oper-
ators in specific search states, ALOIS can learn complex tabu strategies. On hard,
(un)capacitated facility location problems ALOIS improves the performance of sev-
eral generic tabu search strategies.
None of the systems reviewed above has been applied across a large number and
variety of constraint problems. CP HYDRA is an algorithm portfolio for constraint
satisfaction that uses Case-Based Reasoning to determine how to solve an unseen
problem instance by exploiting a case base of problem-solving experiences [56]. In
areas such as satisfiability testing and integer linear programming a carefully chosen
combination of solvers can outperform the best individual solver for a given set of
problems. This approach is an extension of the concept of algorithm portfolios [43].
In an algorithm portfolio a collection of algorithms is used to solve a problem. The
strengths and weaknesses of each of the algorithms are balanced against each other.
Algorithm portfolios can be very successful due to the diverse performance of dif-
ferent algorithms on similar problems. Key in the design of algorithm portfolios is
a selection process fro choosing an algorithm from the portfolio to run on a given
problem. The selection process is usually performed using a machine learning tech-
nique based on feature data extracted from the problem. CP HYDRA’s algorithm
portfolio consists of a set of constraint solvers and its selection process is a CBR
system. The system won the 2008 International CSP Solver Competition, demon-
strating that a CBR-based approach to autonomous constraint solving could be built
and deployed successfully.
82 Derek Bridge, Eoin O’Mahony, and Barry O’Sullivan
4.4.2 CP HYDRA
The CP HYDRA system was developed with the goal of entering the 2008 Interna-
tional CSP Competition. It is itself unable to solve CSPs, but it has a portfolio of
constraint solvers that it can use to solve a given problem. In the version entered
into the 2008 competition, this portfolio consisted of three solvers: Mistral, Choco
and Abscon. These solvers, although similar in their design (all are finite domain
solvers with similar levels of consistency), have diverse strengths and weaknesses.
From a CBR point of view, the CP HYDRA knowledge containers are as follows:
1 http://cpai.ucc.ie
4 Case-Based Reasoning for Autonomous Constraint Solving 83
• The case base: in CP HYDRA the case base contains all problems from previous
CSP competitions. Each case respresents an individual problem. The run times
for each of the constituent solvers on this problem are stored as part of the case.
• The vocabulary used to represent a case comprises two parts. The first is a
vector of real numbers representing features of the problem, discussed below.
The second is the time taken for each of CP HYDRA’s constituent solvers to
solve the problem, or an indication that they timed out.
• The similarity measure in CP HYDRA is a simple Euclidean distance calculation
between the vector of real numbers representing the problem features and each
case.
• The solution transformation knowledge of CP HYDRA is an objective function
to maximize the chances of solving the problem within the time limit.
CP HYDRA has the following process model (Figure 4.2):
• Retrieval in CP HYDRA is achieved through a simple k-nearest neighbour (kNN)
approach. In the competition k was set to 10.
• Reuse in CP HYDRA involves building a schedule that runs solvers that appear
in the cases CP HYDRA retrieves in the previous step for periods that are com-
puted to maximize the chances of solving the problem within the time limit (30
minutes in the competition).
• In its Revise step, CP HYDRA tries to solve a new problem instance by running
solvers according to the schedule computed in its Reuse step. Either a solver
solves the instance, and the rest of the schedule need not be executed, or none
of the solvers solves the instance, i.e., the schedule is executed to conclusion
without success. In the competition there is no opportunity to use this feedback
to revise the case base.
• CP HYDRA has no retention step, since in the competition setting entrants can-
not self-modify incrementally after each problem instance.
We will now explain in detail the components of CP HYDRA.
Case Base. The case base used in CP HYDRA was generated from the problem set of
the 2007 International Constraint Solver Competition and problems released before
the 2008 competition. The total size of the case base is of the order of 3, 000 cases.
Each of the cases respresented one of the problems. As mentioned earlier, a problem
is represented by a vector of real numbers. This vector contains two separate sets of
features.
The first set of syntactic features is a representation of the basic properties of
the problem. They count things like the number of constraints, ratio of intensional
constraints to extensional constraints, average domain size and average constraint
arity, among others. In CP HYDRA there are 22 of these syntactic features. CP HY-
DRA computes these features by parsing the XCSP file specifying the problem and
extracting the information needed.
The second set of semantic features captures information on the structure of the
problem. These are generated by running a constraint solver on the problem for a
short time. In CP HYDRA Mistral is run for two seconds on the problem. Information
on search statistics and modelling choices is extracted, such as how many nodes
84 Derek Bridge, Eoin O’Mahony, and Barry O’Sullivan
were searched in the time, how the variables are represented in the solver, and so
on. CP HYDRA uses 14 of these semantic features.
During the generation of the case base any problems that were solved by Mistral
during the extraction of the semantic features were not included in the case base.
We decided to omit these cases as they are trivially simple. We felt that they would
not contribute to the case base. All features, syntactic and semantic, that represent a
quantity are logarithmically scaled whereas the ratios are percentages. This prevents
large values from dominating the similarity computation.
Retrieval. As mentioned above, the CP HYDRA case retrieval is done using a simple
k-nearest neighbour (kNN) algorithm. The similarity between cases is calculated as
4 Case-Based Reasoning for Autonomous Constraint Solving 85
the Euclidean distance between the vectors of real numbers that represent problem
features. When the k nearest neighbours are computed, if the kth nearest case has
a similarity to the target case equal to that of the (k + 1)th case, both are taken.
In the competition version of CP HYDRA k was set to 10. This value was chosen
following some initial experimentation, but also large values are not suitable given
the complexity of the reuse stage, which will be discussed next.
Reuse. CP HYDRA is designed to solve problems within a given time limit. It has a
portfolio of solvers to allow it to achieve this task. Earlier versions of CP HYDRA
picked one solver to attempt to solve the problem. This was done through a simple
majority vote for one of all of the k similar cases. This approach lacked subtlety as
all hopes of solving the problem were placed on one solver. Imagine that three cases
have been returned, two of which state that Choco has solved a similar problem in
25 minutes and one of which states that Mistral has solved a similar problem in
five minutes. A majority vote will choose Choco to be run for the full 30 minutes
that are allowed in the competition. Running Choco for 25 minutes and Mistral for
five minutes gives a much better cover of the similar cases. This simple schedule
reflects much more of the information the similar cases are providing, since all the
information returned is being used to make the best decision on how to solve the
proposed problem.
To avoid the weakness just mentioned, CP HYDRA generates a schedule of its
constituent solvers. The schedule is optimized to maximize the chances of solving
the problem within 30 minutes. Given the set of similar cases, C, CP HYDRA com-
putes a schedule such that the number of cases in C that would be solved using the
schedule is maximized.
More formally the goal is to compute a solver schedule, that is, a function f :
S → R mapping an amount of CPU time to each solver. Consider a set of similar
cases C and a set of solvers S. For a given solver s ∈ S and a time point t ∈ [0..1800]
we define C(s,t) as the subset of C solved by s if given at least time t. The schedule
can be computed by the following constraint program:
maximize s∈S |C(s, f (s))|
subject to ∑s∈S f (s) ≤ 1800.
This problem is NP-hard as it is a generalisation of the knapsack problem. For-
tunately, in practice it is simple because the number of similar cases k = |C| and the
number of solvers |S| are small.
In Figures 4.3 and 4.4 CP HYDRA’s scheduling is illustrated for two solvers, in
this case Mistral and Choco. Each point in the graphs represents a case, whose (x, y)
co-ordinates represent the time taken to solve the problem for Choco (x) and Mistral
(y). The x and y axes represent time.
In Figure 4.3, 25 minutes are given to Choco and five minutes to Mistral. The
cases that would not be solved by this schedule are highlighted. Figure 4.4 has a
better solution, with 15 minutes being allocated to each solver, solving one more
problem than the first solution. The optimal solution, Figure 4.5, gives 25 minutes
to Mistral and five to Choco; only three problems remain unsolved.
86 Derek Bridge, Eoin O’Mahony, and Barry O’Sullivan
In practice there are some complications associated with the scheduling com-
putation. The first is that some solvers are dominated by others in a schedule.
Formally, we define a solver s1 to be dominated by another solver s2 if and
only if
∀t<1800 C(s1 ,t) ⊆ C(s2 ,t + ε ).
4 Case-Based Reasoning for Autonomous Constraint Solving 87
CP HYDRA won the 2008 International CSP Solver Competition, winning four in of
the five categories as well as winning overall. The results are shown in Figure 4.6.
CP HYDRA beat all its constituent solvers in every category, showing that a collec-
tion of solvers can be stronger than any of the individual solvers. The single category
88 Derek Bridge, Eoin O’Mahony, and Barry O’Sullivan
Fig. 4.6: Results from all categories in the left and from the whole competition on
the right
that CP HYDRA failed to win outright was the Global constraint category. This cat-
egory was won by the solver “Sugar”. This is an argument for adding Sugar to the
algorithm portfolio as it clearly has something to contribute.
During analysis of competition data, an interesting property of the portfolio ap-
proach emerged. In a competition setting, most portfolio approaches are very suc-
cessful. Figure 4.7 shows the performance of different scheduling approaches com-
4 Case-Based Reasoning for Autonomous Constraint Solving 89
Fig. 4.7: Data on different approaches as the time limit decreases on 2007 data
puted on data from the 2007 competition. The schedules compared in the graph are
listed below. The solvers in the graph represent the constituent solvers used for the
portfolio on the tests. These schedules vary in complexity, from a simple even split
of available time to a scheduling system informed by a CBR system:
• Schedule – The CP HYDRA scheduling system.
• Split – Each of the constituent solvers receives an equal amount of time.
• Static – A schedule generated using the scheduling process described earlier.
90 Derek Bridge, Eoin O’Mahony, and Barry O’Sullivan
It is apparent from these graphs (Figure 4.7) that, as the time available decreases,
CP HYDRA begins to clearly outperform the other approaches. We believe this to be
a consequence of the majority of the problems in the competition being relatively
easy. Most problems, if they are solved, are solved in a short amount of time.
Although CP HYDRA has been very successful, there remains work to be done.
CP HYDRA does not utilize many of the more sophisticated CBR techniques such as
features selection to remove features that contribute little to the case base, automatic
feature weighting to identify features that appear to be critical in case comparison,
and case base maintenance to remove cases that do not improve the effectiveness of
the system, among many others.
Within Artificial Intelligence the idea that we can solve problems by recalling and
reusing the solutions to similar past problems, rather than reasoning ‘from scratch’,
underlies Case-Based Reasoning. In this chapter, we reviewed the application of
CBR to search. We presented the application of CBR to route planning as a case
study, which was used to explain the case-based reasoning process model. We pro-
vided a review of the roles that CBR can play in search in general, and then focused
on its use in systematic and incomplete approaches to constraint solving. Finally, we
presented CP HYDRA, a CBR-based portfolio approach to constraint problem solv-
ing, which won the 2008 CSP Solver Competition, demonstrating that successful
autonomous search systems can be built using CBR techniques.
References
[1] Aamodt, A., Plaza, E.: Case-Based Reasoning: Foundational Issues, Method-
ological Variants, and System Approaches. AI Communications 7(1), 39–59
(1994)
[2] Aha, D. W., Marling, C., Watson, I.: Case-based reasoning commentaries: In-
troduction. The Knowledge Engineering Review 3(20), 201–202 (2005)
[3] Aha, D. W., Molineaux, M., Ponsen, M. J. V.: Learning to win: Case-based
plan selection in a real-time strategy game. In: Muñoz-Avila and Ricci [53],
pp. 5–20
[4] Aha, D. W., Watson, I. (eds.): Case-Based Reasoning Research and Develop-
ment, 4th International Conference on Case-Based Reasoning, ICCBR 2001,
Vancouver, BC, Canada, July 30 - August 2, 2001, Proceedings, Lecture Notes
in Computer Science, vol. 2080. Springer (2001)
[5] Ashley, K. D., Bridge, D. G. (eds.): Case-Based Reasoning Research and De-
velopment, 5th International Conference on Case-Based Reasoning, ICCBR
4 Case-Based Reasoning for Autonomous Constraint Solving 91
[23] Goel, A., Craw, S.: Design, innovation and case-based reasoning. The Knowl-
edge Engineering Review 3(20), 271–276 (2005)
[24] Göker, M. H., Howlett, R. J., Price, J. E.: Case-based reasoning for diagnosis
applications. The Knowledge Engineering Review 3(20), 277–281 (2005)
[25] Grolimund, S., Ganascia, J. G.: Integrating case based reasoning and tabu
search for solving optimisation problems. In: Veloso and Aamodt [75], pp.
451–460
[26] Grolimund, S., Ganascia, J. G.: Driving tabu search with case-based
reasoning. European Journal of Operational Research 103(2), 326–
338 (1997). URL http://ideas.repec.org/a/eee/ejores/
v103y1997i2p326-338.html
[27] Haigh, K. Z., Veloso, M. M.: Route planning by analogy. In: Veloso and
Aamodt [75], pp. 169–180
[28] Hammond, K. J.: Explaining and Repairing Plans that Fail. Artificial Intelli-
gence 45, 173–228 (1990)
[29] Holt, A., Bichindaritz, I., Schmidt, R., Perner, P.: Medical applications in case-
based reasoning. The Knowledge Engineering Review 3(20), 289–292 (2005)
[30] Hua, K., Smith, I. F. C., Faltings, B.: Integrated case-based building desing.
In: S. Wess, K. D. Althoff, M. M. Richter (eds.) EWCBR, Lecture Notes in
Computer Science, vol. 837, pp. 436–445. Springer (1993)
[31] Huang, Y.: Using case-based techniques to enhance constraint satisfaction
problem solving. Applied Artificial Intelligence 10(4), 307–328 (1996)
[32] Huang, Y., Miles, R.: A case based method for solving relatively stable dy-
namic constraint satisfaction problems. In: Veloso and Aamodt [75], pp. 481–
490
[33] Khemani, D., Prasad, P.: A memory-based hierarchical planner. In: Veloso and
Aamodt [75], pp. 501–509
[34] Kohavi, R., John, G.: Wrappers for feature subset selection. Artificial Intelli-
gence 97(1–2), 273–324 (1997)
[35] Kolodner, J.: Case-Based Reasoning. Morgan Kaufmann (1993)
[36] Leake, D. B. (ed.): Case-Based Reasoning: Eperiences, Lessons, & Future Di-
rections. AAAI Press/MIT Press (1996)
[37] Leake, D. B.: CBR in Context: The Present and Future. [36], pp. 3–30
[38] Leake, D. B., Wilson, D. C.: Categorizing case-base maintenance: Dimensions
and directions. In: B. Smyth, P. Cunningham (eds.) Procs. of the 4th European
Conference on Case-Based Reasoning, pp. 196–207. Springer-Verlag (1998)
[39] Lenz, M., Bartsch-Spörl, B., Burkhard, H. D., Wess, S. (eds.): Case-Based
Reasoning Technology: From Foundations to Applications. LNAI 1400.
Springer (1998)
[40] Luger, G. F.: Artificial Intelligence: Structures and Strategies for Complex
Problem Solving. Addison-Wesley Longman Publishing Co., Inc., Boston,
MA, USA (2001)
[41] Squalli, M., Purvis, L., Freuder, E.: Survey of applications integrating con-
straint satisfaction and case-based reasoning. In: Procs. of the 1st Interna-
4 Case-Based Reasoning for Autonomous Constraint Solving 93
[59] Quinlan, J. R.: C4.5: Programs for Machine Learning. Morgan Kaufmann
(1993)
[60] Reinartz, T., Iglezakis, I., Roth-Berghofer, T.: Review and restore for case-
based maintenance. Computational Intelligence 17(2), 214–234 (2001)
[61] Richter, M. M.: Introduction. In: Lenz et al. [39], pp. 1–15
[62] Rissland, E. L., Ashley, K. D., Branting, L. K.: Case-based reasoning and law.
The Knowledge Engineering Review 3(20), 293–298 (2005)
[63] de la Rosa, T., Olaya, A. G., Borrajo, D.: Using cases utility for heuristic plan-
ning improvement. In: R. Weber, M. M. Richter (eds.) ICCBR, Lecture Notes
in Computer Science, vol. 4626, pp. 137–148. Springer (2007)
[64] Rossi, F., Beek, P. v., Walsh, T.: Handbook of Constraint Programming (Foun-
dations of Artificial Intelligence). Elsevier Science Inc., New York, NY, USA
(2006)
[65] Sánchez-Ruiz-Granados, A. A., González-Calero, P. A., Dı́az-Agudo, B.: Ab-
straction in knowledge-rich models for case-based planning. In: McGinty and
Wilson [47], pp. 313–327
[66] Scott, S., Osborne, H., Simpson, R.: Selecting and comparing multiple cases to
maximise result quality after adaptation in case-based adaptive scheduling. In:
E. Blanzieri, L. Portinale (eds.) EWCBR, Lecture Notes in Computer Science,
vol. 1898, pp. 517–528. Springer (2000)
[67] Scott, S., Simpson, R.: Case-bases incorporating scheduling constraint dimen-
sions - experiences in nurse rostering. In: Smyth and Cunningham [72], pp.
392–401
[68] Shih, J.: Sequential instance-based learning for planning in the context of an
imperfect information game. In: Aha and Watson [4], pp. 483–501
[69] Sinclair, D.: Using example-based reasoning for selective move generation in
two player adversarial games. In: Smyth and Cunningham [72], pp. 126–135
[70] Smith, B. M.: Caching search states in permutation problems. In: P. van Beek
(ed.) CP, Lecture Notes in Computer Science, vol. 3709, pp. 637–651. Springer
(2005)
[71] Smith, I. F. C., Lottaz, C., Faltings, B.: Spatial composition using cases: Idiom.
In: Veloso and Aamodt [75], pp. 88–97
[72] Smyth, B., Cunningham, P. (eds.): Advances in Case-Based Reasoning, 4th
European Workshop, EWCBR-98, Dublin, Ireland, September 1998, Proceed-
ings, Lecture Notes in Computer Science, vol. 1488. Springer (1998)
[73] Tonidandel, F., Rillo, M.: An accurate adaptation-guided similarity metric for
case-based planning. In: Aha and Watson [4], pp. 531–545
[74] Tonidandel, F., Rillo, M.: Case adaptation by segment replanning for case-
based planning systems. In: Muñoz-Avila and Ricci [53], pp. 579–594
[75] Veloso, M. M., Aamodt, A. (eds.): Case-Based Reasoning Research and Devel-
opment, First International Conference, ICCBR-95, Sesimbra, Portugal, Octo-
ber 23-26, 1995, Proceedings, Lecture Notes in Computer Science, vol. 1010.
Springer (1995)
4 Case-Based Reasoning for Autonomous Constraint Solving 95
[76] Wettschereck, D., Aha, D. W.: Weighting features. In: M. Veloso, A. Aamodt
(eds.) Procs. of the 1st International Conference on Case-Based Reasoning,
LNAI 110, pp. 347–358. Springer (1995)
[77] Wilson, D. R., Martinez, T. R.: Reduction Techniques for Instance-Based
Learning. Machine Learning 38, 257–286 (2000)
[78] Zhuo, H., Yang, Q., Li, L.: Constraint-based case-based planning using
weighted max-sat. In: McGinty and Wilson [47], pp. 374–388
Chapter 5
Learning a Mixture of Search Heuristics
5.1 Introduction
Susan L. Epstein
Department of Computer Science, Hunter College and The Graduate Center of The City University
of New York, New York, New York, e-mail: [email protected]
Smiljana Petrovic
Department of Computer Science, Iona College, New Rochelle, New York, e-mail: spetrovic@
iona.edu
and gauge its own performance. In addition, the learner must somehow infer the
correct/incorrect labels for those examples from its own performance on the partic-
ular problem where the examples arose. Furthermore, the autonomous learner must
evaluate its performance more generally: Is the program doing well? Has it learned
enough? Should it start over?
Typically, one creates an autonomous learner because there is no oracle at hand.
Such is the case in the search for solutions to constraint satisfaction problems
(CSPs). The constraint literature is rich in heuristics to solve these problems, but
the efficacy of an individual heuristic may vary dramatically with the kind of CSP it
confronts, and even with the individual problem. A combination of heuristics seems
a reasonable approach, but that combination must somehow be learned. This chap-
ter begins, then, with general background on combinations of decision makers and
machine learning, followed by specific work on combinations to guide CSP search.
Subsequent sections describe ACE, an ambitious project that learns a mixture of
heuristics to solve CSPs.
In the face of uncertainty, a prediction algorithm draws upon theory and knowledge
to forecast a correct decision. Often however, there may be multiple reasonable pre-
dictors, and it may be difficult for the system builder to select from among them.
Dietterich gives several reasons not to make such a selection, that is, to have a ma-
chine learning algorithm employ a mixture of hypotheses [11]. On limited data,
there may be different hypotheses that appear equally accurate. In that case, al-
though one could approximate the unknown true hypothesis by the simplest one,
averaging or mixing all of them together could produce a better approximation.
Moreover, even if the target function cannot be represented by any of the individual
hypotheses, their combination could produce an acceptable representation.
An ensemble method combines a set of individual hypotheses. There is substan-
tial theoretical and empirical confirmation that the average case performance of an
ensemble of hypotheses outperforms the best individual hypothesis, particularly if
they are represented as decision trees or neural networks [2, 31, 46]. Indeed, for suf-
ficiently accurate and diverse classifiers, the accuracy of an ensemble of classifiers
has been shown to increase with the number of hypotheses it combines [22]. One
well-known ensemble method is AdaBoost, which ultimately combines a sequence
of learned hypotheses, emphasizing examples misclassified by previously generated
hypotheses [14, 42].
More generally, a mixture of experts algorithm learns from a sequence of trials
how to combine its experts’ predictions [26]. In a supervised environment, a trial
has three steps: the mixture algorithm receives predictions from each of e experts,
makes its own prediction y based on them, and then receives the correct value y .
The objective is to create a mixture algorithm that minimizes the loss function (the
distance between y and y ). The performance of such an algorithm is often mea-
5 Learning a Mixture of Search Heuristics 99
sured by its relative loss: the additional loss to that incurred on the same example
by the best individual expert. Under the worst-case assumption, mixture of experts
algorithms have been proved asymptotically close to the behavior of the best expert
[26].
A CSP is a set of variables, each with a domain of values, and a set of constraints ex-
pressed as relations over subsets of those variables. In a binary CSP, each constraint
is on at most two variables. A problem class is a set of CSPs with the same charac-
terization. For example, binary CSPs in model B are characterized by <n, m, d, t>,
where n is the number of variables, m the maximum domain size, d the density
(fraction of constraints out of n(n − 1)/2 possible constraints) and t the tightness
(fraction of possible value pairs that each constraint excludes) [19]. A binary CSP
can be represented as a constraint graph, where vertices correspond to the vari-
ables (labeled by their domains), and each edge represents a constraint between its
respective variables.
A randomly generated problem class may also mandate a specific structure for its
CSPs. For example, each of the composed problems used here consists of a subgraph
(its central component) loosely joined to one or more additional subgraphs (its satel-
lites) [1]. Figure 5.1 illustrates a composed problem with two satellites. Geometric
CSPs also have non-random structure. A random geometric graph <n, D> has n
vertices, each represented by a random point in the unit square [25]. There is an
edge between two vertices if and only if their (Euclidean) distance is no larger than
D. A class of random geometric CSPs <n, D, d, t> is based on a set of random ge-
ometric graphs <n, D>. In <n, D, d, t>, the variables represent random points, and
constraints are on variables corresponding to points close to each other. Additional
edges ensure that the graph is connected. Density and tightness are given by the
parameters d and t, respectively. Figure 5.2 illustrates a geometric graph with 500
variables. Real-world CSPs typically display some non-random structure in their
constraint graphs.
This chapter presents experiments on six CSP classes. Geo (the geometric prob-
lems <50, 10, 0.4, 0.82>) and Comp (the composed problems with central compo-
nent in model B with <22, 6, 0.6, 0.1>, linked to a single model B satellite with
<8, 6, 0.72, 0.45> by edges with density 0.115 and tightness 0.05) are classes of
structured CSPs. The others are model B <50, 10, 0.38, 0.2>, which is exception-
ally hard for its size (n and m); <50, 10, 0.18, 0.37>, which is the same size but
somewhat easier; < 20, 30, 0.444, 0.5 >, whose problems have large domains; and
<30, 8, 0.26, 0.34> which are easy compared to the other classes, but difficult for
their size. Some of these problems appeared in the First International Constraint
Solver Competition at CP-2005.
An instantiation assigns a value to all (full instantiation) or some (partial instan-
tiation) of the variables. A solution to a CSP is a full instantiation that satisfies all
the constraints. A solvable CSP has at least one solution. A solver solves a CSP if
it either finds a solution or proves that it is unsolvable, that is, that it has no solu-
tion. All problems used in the experiments reported here are randomly generated,
solvable binary CSPs with at least one solution.
Traditional (global) CSP search makes a sequence of decisions that instanti-
ates the variables in a problem one at a time with values from their respective
domains. After each value assignment, some form of inference detects values in
the domains of future variables (those not yet instantiated) that are incompati-
ble with the current instantiation. The work reported here uses the MAC-3 infer-
ence algorithm to maintain arc consistency during search [40]. MAC-3 temporar-
ily removes currently unsupportable values to calculate dynamic domains that re-
flect the current instantiation. If, after inference, every value in some future vari-
able’s domain is inconsistent (violates some constraint), a wipeout has occurred
and the current partial instantiation cannot be extended to a solution. At that point,
some retraction method is applied. Here we use chronological backtracking, which
prunes the subtree (digression) rooted at an inconsistent node (assignment of val-
ues to some subset of the variables) and withdraws the most recent value assign-
ment(s).
The efficacy of a constraint solver is gauged by its ability to solve a problem,
along with the computational resources (CPU time and search tree size in nodes)
5 Learning a Mixture of Search Heuristics 101
required to do so. Search for a CSP solution is NP-complete; the worst-case cost is
exponential in the number of variables n for any known algorithm. Often, however,
a CSP can be solved with a cost much smaller than that of the worst case. A CSP
search algorithm specifies heuristics for variable (and possibly value) selection, an
inference method, and a backtracking method. It is also possible to restart search on
a problem, beginning over with no assignments and choosing a new first variable-
value pair for assignment.
In global search, there are two kinds of search decisions: select a variable or
select a value for a variable. Constraint researchers have devised a broad range of
variable-ordering and value-ordering heuristics to speed search. Each heuristic re-
lies on its own metric, a measure that the heuristic either maximizes or minimizes
when it makes a decision. Min domain and max degree are classic examples of these
heuristics. (A full list of the metrics for the heuristics used in these experiments ap-
pears in the Appendix.) A metric may rely upon dynamic and/or learned knowledge.
Each such heuristic may be seen as expressing a preference for choices based on the
scores returned by its metric. As demonstrated in Section 5.5.1, however, no single
heuristic is “best” on all CSP classes. Our research therefore seeks a combination of
heuristics.
In the experiments reported here, resources were controlled with a node limit that
imposed an upper bound on the number of assignments of a value to a variable dur-
ing search on a given problem. Unless otherwise noted, the node limit per problem
was 50,000 for <50, 10, 0.38, 0.2>; 20,000 for <20, 30, 0.444, 0.5>; 10,000 for
<50, 10, 0.18, 0.37>; 500 for <30, 8, 0.26, 0.34>; and 5,000 for Comp and Geo
problems. Performance was declared inadequate if at least ten out of 50 problems
went unsolved under a specified resource limit.
A good mixture of heuristics can outperform even the best individual heuris-
tic, as Table 5.1 demonstrates. The first line shows the best performance achieved
by any traditional single heuristic we tested. The second line illustrates the per-
formance of a random selection of heuristics, without any learning. (These ex-
periments are non-deterministic and therefore averaged over a set of ten runs.)
On one class, the sets of heuristics proved inadequate on every run, and on the
other class, five runs were inadequate and the other five dramatically underper-
formed every other approach. The third line shows that a good pair of heuristics,
one for variable ordering and the other for value ordering, can perform signifi-
cantly better than an individual heuristic. The last line of Table 5.1 demonstrates
that a customized combination of more than two heuristics, discovered with the
methods described here, can further improve performance. Of course, the use of
more than one heuristic may increase solution time, particularly on easier prob-
lems where a single heuristic may suffice. On harder problems, however, increased
decision time is justified by the ability to solve more problems. This chapter ad-
dresses work on the automatic identification of such particularly effective mix-
tures.
102 Susan L. Epstein and Smiljana Petrovic
Table 5.1: Search tree size under individual heuristics and under mixtures of heuris-
tics on two classes of problems. Each class has its own particular combination of
more than two heuristics that performs better
Best individual heuristic tested 3,403.42 100% 10.70 17,399.06 84% 79.02
Best pair of variable-ordering and 1,988.10 100% 17.73 10,889.00 96% 76.16
value-ordering heuristic identified
heuristic there. The solver applies the default heuristic initially but, after restart,
replaces the default with a randomly selected heuristic chosen from those that the
SVM preferred.
A slightly more complex approach permits a different individual heuristic to
make the decisions at each step, again based on experience solving other problems
not necessarily in the same class. A hyperheuristic decides which heuristic to apply
at each decision step during search [7]. A hyperheuristic is a set of problem states,
each labeled by a condition-action rule of the form “if the state has these properties,
then apply this heuristic.” A genetic algorithm evolves a hyperheuristic for a given
set of states [45]. To solve a CSP, the best among the evolved hyperheuristics is
chosen, and then the heuristic associated with the problem state most similar to the
current partial instantiation is applied.
Alternatively, a solver can be built for a single CSP from a set of known search
algorithms that take turns searching. For example, REBA (Reduced Exceptional
Behavior Algorithm) applies more complex algorithms only to harder problems
[5]. It begins search with a simple algorithm, and when there is no indication of
progress, switches to a more complex algorithm. If necessary, this process can con-
tinue through a prespecified sequence of complex algorithms. The complexity rank-
ing can be tailored to a given class of CSPs, and is usually based on the median cost
of solution and an algorithm’s sensitivity to exceptionally hard problems from the
class. In general the most complex of these algorithms have better worst-case perfor-
mance but a higher average cost when applied to classes with many easy problems
that could be quickly solved by simpler algorithms. Another approach that alter-
nates among solvers is CPHydra. It maintains a database of cases on not necessarily
similar CSPs, indexed both by static problem features and by modeling selections
[30]. Each case includes the time each solver available to CPHydra took to solve
the problem. CPHydra retrieves the most similar cases and uses them to generate a
schedule that interleaves the fastest solvers on those cases.
It is also possible to race algorithms against one another to solve a single
problem. An algorithm portfolio selects a subset from among its available al-
gorithms according to some schedule. Each of these algorithms is run in paral-
lel to the others (or interleaved on a single processor with the same priority),
until the fastest one solves the problem [21]. With a dynamic algorithm port-
folio, schedule selection changes the proportion of CPU time allocated to each
heuristic during search. Dynamic algorithm portfolios favor more promising algo-
rithms [16] or improve average-case running time relative to the fastest individual
solver [44]. A particularly successful example is SATzilla-07. It builds a portfolio
for each SAT (propositional satisfiability) problem instance online as it searches
[49]. Based on features of the instance and each algorithm’s past performance,
SATzilla-07 uses linear regression to build a computationally inexpensive model
of empirical hardness that predicts each algorithm’s runtime on a given SAT prob-
lem.
104 Susan L. Epstein and Smiljana Petrovic
Other autonomous learners seek to discover their own heuristics, given inference
and backtracking mechanisms. One approach is to discover a combination of exist-
ing algorithms appropriate for a class of problems. For example, Multi-TAC learns
an ordered list of variable-ordering heuristics for a given class of CSPs [28]. Begin-
ning with an initially empty list of heuristics, each remaining heuristic is attached to
the parent to create a different child. The utility of a child is the number of instances
it solves within a given time limit, with total time as a tiebreaker. The child with the
highest utility becomes the new parent. This process repeats recursively until it does
not produce any child that improves upon its parent. The resulting list is consulted
in order. Whenever one heuristic cannot discriminate among the variables, those it
ranks at the top are forwarded to the next heuristic on the list. Another way to link
a class of problems to a combination of algorithms is to construct a class descrip-
tion based on the algorithms’ performance. For example, one effort constructs and
stores portfolios for quantified Boolean formulae that are generalizations of SAT
problems [41].
Local search is an alternative paradigm to global search. Local search only con-
siders full instantiations, and moves to another full instantiation that changes the
value of some metric. Heuristics specify the metric and how to select among instanti-
ations that qualify. Multiple search heuristics have been integrated with value-biased
stochastic sampling [9]. On a given problem instance, multiple restarts sample the
performance of different base heuristics. Then the program applies extreme value
theory to construct solution quality distributions for each heuristic, and uses this
information to bias the selection of a heuristic on subsequent iterations. Extensive
research has also been conducted on optimization problems to learn which low-level
local search heuristics should be applied in a region of the solution space [8]. One
approach combined local search heuristics for SAT problems [29]. Each step chose
a constraint to be adjusted based on some measure of its inconsistency in the current
instantiation. Then a heuristic was selected probabilistically, based on its expected
utility value (ability to ameliorate the violation of that constraint). All utility values
were initially equal, and then positively or negatively reinforced based on the dif-
ference between the current total cost and the total cost the last time that constraint
was selected.
The building blocks for a new algorithm need not be standalone algorithms them-
selves. For example, CLASS discovers local search variable-ordering heuristics for
a class of SAT problems with a genetic algorithm on prespecified heuristic primi-
tives [15]. Its primitives describe the currently satisfied clauses and search experi-
ence. CLASS uses them to construct LISP-like s-expressions that represent versions
of standard local search variable-selection heuristics. Its initial population is a set of
randomly generated expressions. Each population is scored by its performance on a
set of problems. Then, to construct the next generation, instead of using traditional
crossover and mutation, CLASS creates ten new children from its primitives and
makes the ten lowest-scoring expressions less likely to survive. The best heuristic
found during the course of the search is returned.
5 Learning a Mixture of Search Heuristics 105
Learning is necessary for CSP solutions because even well-trusted individual heuris-
tics vary dramatically in their performance on different classes. Consider, for exam-
ple, the performance of five popular variable-ordering heuristics on the three classes
in Table 5.2. The Min domain/dynamic degree heuristic is the most successful on
< 20, 30, 0.444, 0.5 > problems, but it is inadequate on Comp problems.
Table 5.3: Average number of nodes explored by three traditional heuristics (in ital-
ics) and their duals on Comp problems (described in Section 5.3). Note the better
performance of two of the duals here
From a successful search, the solver extracts both positive and negative training
examples. Each training example is the current instantiation and the decision made
there. Here, too, difficulties arise. The search trace is not necessarily the best way
to solve that CSP. There may have been a better (smaller search tree or less elapsed
time) way to solve it. Thus the training examples selected may not be those an
oracle would have provided. Nonetheless, decisions not subsequently retracted on
the path to the solution can be considered better than those that were retracted, so
we take them as positive examples. We also know that the roots of retracted subtrees
(disgressions) were errors, and therefore take them as negative examples. (Note that
5 Learning a Mixture of Search Heuristics 107
CSP search performance is traditionally gauged by the size of the search tree ex-
panded (number of partial instantiations) and elapsed CPU time. Here, a solver ex-
ecutes in a run of two phases: a learning phase during which it attempts to solve a
sequence of CSPs from a given class, and a testing phase during which it attempts to
solve a sequence of hitherto unseen problems drawn from the same class. Because
the class is of uneven difficulty, the problems in the learning phase may not be in-
108 Susan L. Epstein and Smiljana Petrovic
dicative of the class as a whole. Thus we average the solver’s performance over an
experiment, here a set of ten runs.
Ideally, a solver should also consider its performance more broadly. It should be
aware of its general progress on the class. If it believes it can learn no more, it should
terminate learning itself and proceed to testing. And if it believes that learning is not
going well, it should elect to discard what it has learned and start over. The solver
constructed to meet these expectations is called ACE.
5.6 ACE
When ACE (the Adaptive Constraint Engine) learns to solve a class of binary CSPs,
it customizes a weighted mixture of heuristics for the class [13]. ACE is based on
FORR, an architecture for the development of expertise from multiple heuristics
[12]. ACE’s search algorithm (in Figure 5.4) alternately selects a variable and then
selects a value for it from its domain. The size of the resultant search tree depends
upon the order in which values and variables are selected.
Decision-making procedures in ACE are called Advisors. They are organized into
three tiers, and presented with the current set of choices (variables or values). Tier 1
Advisors are correct and quick, and preordered by the user. If any of them approves a
choice, it is executed. (For example, Victory recommends any value from the domain
of the final unassigned variable. Since inference has already removed inconsistent
values, any remaining value produces a solution.) Disapproval from any tier 1 Advi-
sor eliminates some subset of choices; the remaining choices are passed to the next
Advisor. (The set of choices is not permitted to go empty.) Tier 2 Advisors address
subgoals; they are outside the scope of this chapter and not used in the experiments
reported here.
The work described here focuses on the Advisors in tier 3. Each tier 3 Advisor
comments upon (produces a strength for) some of its favored choices, those whose
metric scores are among the f most favored. Because a metric can return identical
values for different choices, an Advisor usually makes many more than f comments.
(Here f = 5, unless otherwise stated.) The strength s(A, c) is the degree of support
from Advisor A for choice c. Each tier 3 Advisor’s view is based on a descriptive
metric. All tier 3 Advisors are consulted together. As in Figure 5.4, a decision in
tier 3 is made by weighted voting, where the strength s(A, c) given to choice c by
Advisor A is multiplied by the weight w(A) of Advisor A. All weights are initialized
to 0.05, and then learned for a class of problems by the processes described below.
The discount factor q(A) in (0,1] modulates the influence of Advisor A until it has
commented often enough during learning. As data is observed on A, q(A) moves
5 Learning a Mixture of Search Heuristics 109
v = arg max ∑
cvar ∈V A∈Avar
q(A) · w(A) · s(A, cvar )
d = arg max ∑
cval ∈Dv A∈Aval
q(A) · w(A) · s(A, cval )
Fig. 5.4: Search in ACE with a weighted mixture of variable-ordering Advisors from
Avar , and value-ordering Advisors from Aval . q(A) is the discount factor. w(A) is the
weight of Advisor A. s(A, c) is the strength of Advisor A for choice c
toward 1, effectively increasing the impact of A on a given class as its learned weight
becomes more trustworthy.
Weighted voting selects the choice with the greatest sum of weighted strengths
from all Advisors. (Ties are broken randomly.) Each tier 3 Advisor’s heuristic view
is based on a descriptive metric. For each metric, there is a dual pair of Advisors, one
that favors smaller values for the metric and one that favors larger values. Typically,
only one of the pair has been reported in the literature as a heuristic. Weights are
learned from problem-solving experience.
Learn Weights
Initialize all weights to 0.05
Until termination of the learning phase
Identify learning problem p
Search ( p, Avar , Aval )
If p is solved
then for each training instance t from p
for each Advisor A that supports t
when t is a positive training instance, increase w(A) *credit*
when t is a negative training instance, decrease w(A) *penalize*
else when full restart criteria are satisfied
initialize all weights to 0.05
Fig. 5.5: Learning weights for Advisors. The Search algorithm is defined in Fig-
ure 5.4
receives more credit than a less discriminating Advisor, and the penalty for a wrong
choice from among a few is harsher than for a wrong choice from among many.
In addition to an input set of Advisors, ACE has one benchmark for variable
ordering and another for value ordering. Each benchmark Advisor models random
advice; it makes random comments with random strengths. Although the bench-
marks’ comments never participate in decision making, the benchmarks themselves
earn weights. That weight serves as a filter for the benchmark’s associated Advisors;
an Advisor must have a learned weight higher than its benchmark’s (that is, provide
better than random advice) to be constructive.
This section describes four techniques that use both search performance and prob-
lem difficulty to adapt learning: full restart, random subsets of heuristics, consid-
eration of decision difficulty, and the nuances of preferences. Section 5.8 provides
empirical demonstrations of their efficacy.
From a large initial list of heuristics that contains minimizing and maximizing ver-
sions of many metrics, some perform poorly on a particular class of problems (class-
inappropriate heuristics) while others perform well (class-appropriate heuristics).
In some cases, class-inappropriate heuristics occasionally acquire high weights on
an initial problem and then control subsequent decisions. As a result, subsequent
problems may have extremely large search trees.
Given unlimited resources, DWL will recover from class-inappropriate heuristics
with high weights, because they typically generate large search trees and large di-
gressions. In response, DWL will impose large penalties and provide small credits to
the variable-ordering Advisors that lead decisions. With their significantly reduced
weights, class-inappropriate Advisors will no longer dominate the class-appropriate
Advisors. Nonetheless, solving a hard problem without good heuristics is compu-
tationally expensive. If adequate resources are unavailable under a given node limit
and a problem goes unsolved, no weight changes occur at all.
Under full restart, however, ACE monitors the frequency and the order of un-
solved problems in the problem sequence. If it deems the current learning attempt
not promising, ACE abandons the learning process (and any learned weights) and
begins learning on new problems with freshly initialized weights [34]. Note that full
restart is different from restart on an individual problem, discussed in Section 5.3,
which diversifies the search for a solution to that problem alone [20]. We focus here
only on the impact of full restart of the entire learning process.
112 Susan L. Epstein and Smiljana Petrovic
The node limit is a critical parameter for full restart. Because ACE abandons a
problem if it does not find a solution within the node limit, the node limit is the
criterion for unsuccessful search. Since the full restart threshold directly depends
upon the number of failures, the node limit is the performance standard for full
restart. The node limit also controls resources; lengthy searches permitted under
high node limits are expensive.
Resource limits and full restart impact the cost of learning in complex ways. With
higher node limits, weights can eventually recover without the use of full restart,
but recovery is more expensive. With lower node limits, the cost of learning (total
number of nodes across all learning problems) with full restart is slightly higher
than without it. The learner fails on all the difficult problems, and even on some
of medium difficulty, repeatedly triggering full restart until the weight profile (the
set of tier 3 weights) is good enough to solve almost all the problems. Full restart
abandons some problems and uses additional problems, which increases the cost
of learning. The difference in cost is small, however, since each problem’s cost
is subject to a relatively low node limit. As the node limit increases, full restart
produces fewer inadequate runs, but at a higher cost. It takes longer to trigger full
restart because the learned weight profile is deemed good enough and failures are
less frequent. Moreover, with a high node limit, every failure is expensive. When
full restart eventually triggers, the prospect of relatively extensive effort on further
problems is gone. Because it detects and eliminates unpromising learning runs early,
full restart avoids many costly searches and drastically reduces overall learning cost.
Experimental results and further discussion of full restart appear in Section 5.8.1.
The interaction among heuristics can also serve as a filter during learning. Given a
large and inconsistent initial set of heuristics, many class-inappropriate ones may
combine to make bad decisions, and thereby make it difficult to solve any problem
within a given node limit. Because only solved problems provide training instances
for weight learning, no learning can take place until some problem is solved. Rather
than consult all its Advisors at once, ACE can randomly select a new subset of
Advisors for each problem, consult them, make decisions based on their comments,
and update only their weights [37]. This method, learning with random subsets,
eventually uses a subset in which class-appropriate heuristics predominate and agree
on choices that solve a problem.
It is possible to preprocess individual heuristics on representatives of a prob-
lem class (possibly by racing [4]), and then eliminate from the candidate heuris-
tics those with poor individual performance. That approach, however, requires mul-
tiple solution attempts on some set of problems. Multiple individual runs would
consume more computational resources, because many different Advisors reference
(and share the values of) the same metrics in our approach. Moreover, elimination
of poorly performing heuristics after preprocessing might prevent the discovery of
5 Learning a Mixture of Search Heuristics 113
and the expected number of trials until the subset has a majority of class-appropriate
Advisors is
∞
1
p = ∑ i(1 − p)i−1 p = . (5.2)
i=1 p
114 Susan L. Epstein and Smiljana Petrovic
When there are more class-inappropriate Advisors (a < b), a smaller set is more
likely to have a majority of class-appropriate Advisors. For example, if a = 6, b =
9, and r = 4, equation (1) evaluates to 0.14 and equation (2) to 7. For a = 6, b = 9,
and r = 10, however, the probability of a randomly selected subset with a majority
of class-appropriate heuristics is only 0.04 and the expected number of trials until
the subset has a majority of class-appropriate Advisors is 23.8.
Weight convergence is linked to subset size. Weights converge faster when sub-
sets are larger. When the random subsets are smaller, subsequent random sub-
sets are less likely to overlap with those that preceded them, and therefore less
likely to include Advisors whose weights have been revised. As a result, fail-
ures occur often, even after some class-appropriate heuristics receive high weights.
ACE monitors its learning progress, and can adapt the size of random subsets. In
this scenario, random subsets are initially small, but as learning progresses and
more Advisors participate and obtain weights, the size of the random subsets in-
crease. This makes overlap more likely, and thereby speeds learning. Experimen-
tal results and further discussion of learning with random subsets appear in Sec-
tion 5.8.2.
Correct easy decisions are less significant for learning; it is correct difficult de-
cisions that are noteworthy. Thus it may be constructive to estimate the diffi-
culty of each decision the solver faces as if it were a fresh problem, and ad-
just Advisors’ weights accordingly. Our rationale for this is that, on easy prob-
lems, any decision leads to a solution. Credit for an easy decision effectively in-
creases the weight of Advisors that support it, but if the decision was made during
search, those Advisors probably already had high weights. ACE addresses this is-
sue with two algorithms, each dependent upon a single parameter: RSWL-κ and
RSWL-d.
Constrainedness, as measured by κ , has traditionally been used to identify hard
classes of CSPs [18]. κ depends upon n, d, m, and t, as defined in Section 5.3:
n−1 1
κ= · d · logm . (5.3)
2 1−t
For every search algorithm, and for fixed n and m, hard problem classes have κ
close to 1. RSWL-κ uses equation (5.3) to measure the difficulty κP of subproblem P
at each decision point. For a given parameter k, RSWL-κ gives credit to an Advisor
only when it supports a positive training instance derived from a search state where
|κP –1|< k. RSWL-κ penalizes an Advisor only when it supports a negative training
instance derived from a search state where |κP –1| > k. The calculation of κP on
every training instance is computationally expensive.
RSWL-d uses the number of unassigned variables at the current search node as a
rough, quick estimate of problem hardness. Decisions at the top of the search tree
5 Learning a Mixture of Search Heuristics 115
are known to be more difficult [38]. For a given parameter h, no penalty is given at
all for any decision in the top h percent of the nodes in the search tree, and no credit
is given for any decision below them. Experimental results and further discussion of
learning with decision difficulty appear in Section 5.8.3.
Table 5.4: The impact of different preference expression methods on a single metric
Variables X Y 1 ,Y 2 ,...,Y 10 Z
variable X is reflected in the distance between its strength and those given to the
other variables.
The Borda methods were inspired by an election method devised by Jean-Charles
de Borda in the late eighteenth century [39]. Borda methods consider the total num-
ber of available choices, the number of choices with a smaller score and the number
of choices with an equal score. Thus the strength for a choice is based on its position
relative to the other choices.
The first Borda method, Borda-w, awards a point for each win (metric score
higher than the score of some other commented choice). Examples for Borda-w
strengths are also shown in Table 5.4. The set of lowest-scoring variables (here, only
Z) always has strength 0. Because every Y variable outscored only Z, the strength
of any Y variable is 1. The highest-scoring choice X outscored 11 choices, so X’s
strength is 11.
The second Borda method, Borda-wt, awards one point for each win and one
point for each tie (score equal to the score of some other choice). It can be interpreted
as emphasizing losses. The highest-scoring set of variables (here, only X) always has
strength equal to the number of variables to be scored. For example, in Table 5.4, no
choice outscored the highest-scoring choice X, so its strength is 12, one choice (X)
outscored the Y variables, so their strengths are reduced by one point (12 – 1 = 11),
and 11 choices outscored Z, resulting in strength 1.
The difference between the two Borda methods is evident when many choices
share the same score. Borda-w considers only how many choices score lower, so that
a large subset results in a big gap in strength between that subset and the previous
(more preferred) one. Under Borda-wt, a large subset results in a big gap in strength
between that subset and the next (less preferred) one. In Table 5.4, for example, the
10 Y variables share the same score. Under Borda-w, the difference between the
strength of every Y variable and X is 10, while the difference between the strength
of any Y variable and Z is only 1. Under Borda-wt, however, the difference between
the strength of any Y and X is only 1, while the difference between the strength of
any Y and Z is 10. The Borda approach can be further emphasized by making a point
inversely proportional to the number of subsets of tied values. Experimental results
and further discussion of learning with preferences using this emphasis appear in
Section 5.8.4.
5 Learning a Mixture of Search Heuristics 117
5.8 Results
The methods in the previous section are investigated here with ACE. As described
in Section 5.5.3, each run under ACE is a learning phase followed by a testing
phase. During a learning phase ACE refines its Advisor weights; the testing phase
provides a sequence of fresh problems with learning turned off. All of the following
experiments average results across ten runs, and differences cited are statistically
significant at the 95% confidence level.
In the experiments that follow, learning terminated after 30 problems, counting
from the first solved problem, or terminated if no problem in the first 30 was solved
at all. Under full restart, more learning problems can be used, but the upper bound
for the total number of problems in a learning phase was always 80. For each prob-
lem class, every testing phase used the same 50 problems. When any ten of the 50
testing problems went unsolved within the node limit, learning in that run was de-
clared inadequate and further testing was halted. In every learning phase, ACE had
access to 40 tier 3 Advisors, 28 for variable selection and 12 for value selection
(described in the Appendix). During a testing phase, ACE used only those Advisors
whose learned weights exceeded those of their respective benchmarks.
The benefits of full restart are illustrated here on <30, 8, 0.26, 0.34> problems
with DWL, where the node limit during learning is treated as a parameter and the
node limit during testing is 10,000 nodes. A run was declared successful if testing
was not halted due to repeated failures. The learning cost is the total number of
nodes during the learning phase of a run, calculated as the product of the average
number of nodes per problem and the average number of problems per run. The
restart strategy is defined by a full restart threshold (k, l), which performs a full
restart after failure on k problems out of the last l. (Here, k = 3 and l = 4.) This
seeks to avoid full restarts when multiple but sporadic failures are actually due to
uneven problem difficulty rather than to an inadequate weight profile. Problems that
went unsolved under initial weights before any learning occurred (early failures)
were not counted toward full restart. If the first 30 problems went unsolved under
the initial weights, learning was terminated and the run judged unsuccessful. The
learner’s performance here is measured by the number of successful runs (out of
ten) and the learning cost across a range of node limits.
Under every node limit tested, full restart produced more runs that were suc-
cessful, as Figure 5.7 illustrates. At lower node limits, Figure 5.8 shows that better
testing performance came with a learning cost similar to or slightly higher than the
cost without full restart. At higher node limits, the learning cost was considerably
lower with full restart. With very low node limits (200 or 300 nodes), even with
full restart DWL could not solve all the problems. During learning, many problems
went unsolved under a low node limit and therefore did not provide training in-
118 Susan L. Epstein and Smiljana Petrovic
stances. On some (inadequate) runs, no solution was found to any of the first 30
problems, so learning was terminated without any weight changes. When the node
limit was somewhat higher (400 nodes), more problems were solved, more training
instances became available and more runs were successful. These reasonably low
node limits set a high standard for the learner; only weight profiles well tuned to the
class will solve problems within them and thereby provide good training instances.
Further increases in the node limit (500, 600, 700 and 800 nodes), however, did
not further increase the number of successful runs. Under higher node limits, prob-
lems were solved even with weight profiles that were not particularly good for the
class, and may have produced training instances that were not appropriate. Under
extremely high node limits (5,000 nodes), problems were solved even under inad-
equate weight profiles, but the weight-learning mechanism was able to recover a
good weight profile, and again the number of successful runs increased.
Similar performance was observed under RSWL and on Geo and the other model
B classes identified in Section 5.3, but not on the Comp problems. Some Comp
problems go unsolved under any node limit, while many others are solved. Because
failures there are sporadic, they do not trigger full restart. The use of full restart on
them does not improve learning, but it does not harm it either. (Data omitted.) Full
restart is therefore used throughout the remainder of these experiments.
When the full complement of Advisors was present but resource limits were strict,
even with full restart ACE sometimes failed to solve problems in some difficult
5 Learning a Mixture of Search Heuristics 119
Fig. 5.9: Weights of eight variable-ordering Advisors and their common benchmark
during learning after each of 30 problems in <50, 10, 0.38, 0.2> on a single run
classes. Random subsets corrected this. Figure 5.9 illustrates weight convergence
during learning with random subsets. On this run some Advisors (e.g., Min domain/-
dynamic degree and Min static degree) recovered from initially inadequate weights.
The figure tracks the weights of eight variable-ordering heuristics and their com-
mon benchmark (described in Section 5.6.2) after each of 30 problems. Here the
problems were drawn from <50, 10, 0.38, 0.2>, and 30% of the variable-ordering
Advisors and 30% of the value-ordering Advisors were randomly selected for each
problem. Plateaus in weights correspond to problems where the particular heuris-
tic was not selected for the current random subset, or the problem went unsolved,
so that no learning or weight changes occurred. The first four problems were early
failures (problems that went unsolved under initial weights, before any learning
occurred). When the fifth problem was solved, some class-inappropriate Advisors
received high weights from its training instances. On the next few problems, either
heavily weighted but class-inappropriate heuristics were reselected and the prob-
lem went unsolved and no weights changed, or some class-appropriate Advisors
were selected and gained high weights. Eventually the latter began to dominate de-
cisions, so that the disagreeing class-inappropriate Advisors had their weights re-
duced. After the 21st problem, when the weight of Min static connected edges had
significantly decreased, the weights clearly separated the class-appropriate Advisors
from the class-inappropriate ones. Afterwards, as learning progressed, the weights
stabilized.
Experiments that illustrate the benefits of random subsets tested four ways to
choose the Advisors from each problem:
1. All used all the Advisors on every problem.
2. Fixed chose a fixed percentage q (30% or 70%), and then chose q variable-
ordering Advisors and q value-ordering Advisors, without replacement.
3. Varying chose a random value r in [30, 70] for each problem, and then chose
r percent of the variable-ordering Advisors and r percent of the value-ordering
Advisors, without replacement.
4. Incremental initially selected q of the variable-ordering Advisors and q of the
value-ordering Advisors. Then, for each subsequent problem, it increased the
sizes of the random subsets in proportion to the number of Advisors whose
weight was greater than their initially assigned weight.
120 Susan L. Epstein and Smiljana Petrovic
On problems in <50, 10, 0.18, 0.37>, Table 5.5 compares these approaches for
learning with random subsets of Advisors to learning with all the Advisors at once.
When all 40 Advisors were consulted, the predominance of class-inappropriate Ad-
visors sometimes prevented the solution of any problem under the given node limit,
so that some learning phases were terminated after 30 unsolved problems. In those
runs no learning occurred. With random subsets, however, adequate weights were
learned on every run, and there were fewer early failures.
Table 5.5: Early failures, successful runs, decision time and percentage of computa-
tion time during learning with random subsets of Advisors, compared to computa-
tion time with all the Advisors on problems in <50, 10, 0.18, 0.37>
Table 5.5 also demonstrates that using random subsets significantly reduces
learning time. The time to select a variable or a value is not necessarily directly
proportional to the number of selected Advisors. This is primarily because dual
pairs of Advisors share the same fundamental computational cost: calculating their
common metric. For example, the bulk of the work for Min product domain value
lies in the one-step lookahead that calculates (and stores) the products of the domain
sizes of the neighbors after each potential value assignment. Consulting only Min
product domain value and not Max product domain value will therefore not signifi-
cantly reduce computation time. Moreover, the metrics for some Advisors are based
upon metrics already calculated for others that are not their duals. The reduction in
total computation time per run also reflects any reduction in the number of learning
problems.
The robustness of learning with random subsets is demonstrated with experi-
ments documented in Table 5.6 that begin with fewer Advisors, a majority of which
are class-inappropriate. Based on weights from successful runs with all Advisors,
Advisors were first identified as class-appropriate or class-inappropriate for prob-
lems in <50, 10, 0.18, 0.37>. ACE was then provided with two different sets
Avar of variable-ordering Advisors in which class-inappropriate Advisors outnum-
bered class-appropriate ones (nine to six or nine to four). When all the provided
Advisors were consulted, the predominance of class-inappropriate Advisors effec-
tively prevented the solution of any problem under the given node limit and no
learning took place. When learning with random subsets, as the size of the ran-
dom subsets decreased, the number of successful runs increased. As a result, ran-
5 Learning a Mixture of Search Heuristics 121
6 class-appropriate 4 class-appropriate
9 class-inappropriate 9 class-inappropriate
dom subsets with fixed q = 30% is used throughout the remainder of these experi-
ments.
We tested both linear interpolation and the Borda methods on the CSP classes iden-
tified in Section 5.3. On the unstructured model B problems, preference expression
made no significant difference. On Comp, however, across a broad range of node
limits, preference expression had an effect, as shown in Table 5.8.
Initially, most variables in the central component score similarly on most metrics,
and most variables in the satellite score similarly to one another but differently from
those in the central component. Under Borda–wt, if only a few choices score higher,
the strength of the choices from the next lower-scoring subset is close enough to in-
fluence the decision. If there are many high-scoring choices, in the enhanced version
the next lower subset will have a much lower strength, which decreases its influence.
Moreover, when many choices share the same score, they are penalized for failure to
discriminate, and their strength is lowered. When Borda–w assigns lower strengths
122 Susan L. Epstein and Smiljana Petrovic
Table 5.7: Learning and testing performance with different weight learning methods
on <50, 10, 0.38, 0.2> problems. Bold figures indicate statistically significant re-
ductions, at the 95% confidence level, in the number of nodes and in the percentage
of solved problems compared to search under DWL
Learning Testing
Unsolved Full
Weight-learning algorithm Problems problems restarts Nodes Solved
Table 5.8: Testing performance with RSWL on Comp problems with reduced node
limits and a variety of preference expression methods
to large subsets from the central component, it makes them less attractive. That en-
courages search to explore variables in the satellites to be selected first; this is often
the right way to solve such problems.
Linear interpolation performed similarly to RSWL with ranking on Comp prob-
lems, except under the lowest node limit tested. Given only 35 nodes, RSWL with
linear preference expression was able to solve every problem during testing. The 35-
node limit imposes a very high learning standard; it allows learning only from small
searches. (A backtrack-free solution would expand exactly 30 nodes for a Comp
problem.) Only with the nuances of information provided by linear preference ex-
pression did ACE develop a weight profile that solved all the testing problems in
every run.
Our fundamental underlying assumption is that the right way to make search deci-
sions in a class of CSPs has some uniformity, that is, that it is possible to learn from
one problem how to solve another. ACE tries to solve a CSP from a class, and, if it
finds a solution, it extracts training examples from the search trace. If it fails to solve
5 Learning a Mixture of Search Heuristics 123
the CSP, however, given the size of the search space, the solver will learn nothing
from the effort it expended.
Given a training example, the learning algorithms described here reinforce heuris-
tics that prove successful on a set of problems and discard those that do not. Our pro-
gram represents its learned knowledge about how to solve problems as a weighted
sum of the output from some subset of its heuristics. Thus the learner’s task is both
to choose the best heuristics and to weight them appropriately.
ACE is a successful, adaptive solver. It learns to select a weighted mixture of
heuristics for a given problem class, one that produces search trees smaller than
those from outstanding individual heuristics in the CSP literature. ACE learns from
its own search performance, based upon the accuracy, intensity, frequency and dis-
tribution of its heuristics’ preferences. ACE adapts its decision making, its rein-
forcement policy, and its heuristic selection mechanisms effectively.
Our current work extends these ideas on several fronts. Under an option called
Pusher, ACE consults the single highest-weighted tier 3 variable-ordering heuristic
below the maximum search depth at which it has experienced backtracking on other
problems in the same class [13]. Current work includes learning different weight
profiles for different stages in solving a problem, where stages are determined by
search tree depth or the constrainedness of the subproblem at the decision point.
A generalization of that approach would associate weight profile(s) with an en-
tire benchmark family of problems, and begin with the weights of the most similar
benchmark family for each new problem instance.
Rather than rely on an endless set of fresh problems, we plan to reuse un-
solved problems and implement boosting with little additional effort during learn-
ing [42]. A major focus is the automated selection of good parameter settings for
an individual class (including the node limit and full-restart parameters), given
the results in [24]. We also intend to extend our research to classes containing
both solvable and unsolvable problems, and to optimization problems. Finally,
we plan to study this approach further in light of the factor analysis evidence of
strong correlations between CSP ordering heuristics [47]. Meanwhile, ACE profi-
ciently tailors a mixture of search heuristics for each new problem class it encoun-
ters.
Appendix
Two vertices with an edge between them are neighbors. Here, the degree of an edge
is the sum of the degrees of its endpoints, and the edge degree of a variable is the
sum of the edge degrees of the edges on which it is incident. Each of the following
metrics produces two Advisors.
Metrics for variable selection were static degree, dynamic domain size, FF2
[43], dynamic degree, number of valued neighbors, ratio of dynamic domain size
to dynamic degree, ratio of dynamic domain size to degree, number of acceptable
constraint pairs, static and dynamic edge degree with preference for the higher or
124 Susan L. Epstein and Smiljana Petrovic
lower degree endpoint, weighted degree [6], and ratio of dynamic domain size to
weighted degree.
Metrics for value selection were number of value pairs for the selected variable
that include this value, and, for each potential value assignment: minimum resulting
domain size among neighbors, number of value pairs from neighbors to their neigh-
bors, number of values among neighbors of neighbors, neighbors’ domain size, a
weighted function of neighbors’ domain size, and the product of the neighbors’ do-
main sizes.
Acknowledgements ACE is a joint project with Eugene Freuder and Richard Wallace of The Cork
Constraint Computation Centre. Dr. Wallace made important contributions to the DWL algorithm.
This work was supported in part by the National Science Foundation under IIS-0328743, IIS-
0739122, and IIS-0811437.
References
[1] Aardal, K. I., Hoesel, S. P. M. v., Koster, A. M. C. A., Mannino, C., Sas-
sano, A.: Models and solution techniques for frequency assignment problems.
A Quarterly Journal of Operations Research, 1(4), pp. 261–317 (2003)
[2] Ali, K., Pazzani, M.: Error reduction through learning multiple descriptions.
Machine Learning, 24, pp. 173–202 (1996)
[3] Arbelaez, A., Hamadi, Y., Sebag, M.: Online heuristic selection in constraint
programming. International Symposium on Combinatorial Search (SoCS-
2009), Lake Arrowhead, CA
[4] Birattari, M., Stützle, T., Paquete, L., Varrentrapp, K.: (2002). A Racing Algo-
rithm for configuring metaheuristics. Genetic and Evolutionary Computation
Conference (GECCO-2002), New York, NY, pp. 11–18 (2009)
[5] Borrett, J. E., Tsang, E. P. K., Walsh, N. R.: Adaptive constraint satisfaction:
The quickest first principle. 12th European Conference on AI (ECAI-1996),
Budapest, Hungary, pp. 160–164 (1996)
[6] Boussemart, F., Hemery, F., Lecoutre, C., Sais, L.: Boosting systematic search
by weighting constraints. 16th European Conference on AI (ECAI-2004), Va-
lencia, Spain, pp. 146–150 (2004)
[7] Burke, E., Hart, E., Kendall, G., Newall, J., Ross, P., Schulenburg, S.: Hyper-
heuristics: an emerging direction in modern research technology. Handbook of
Metaheuristics, Dordrecht: Kluwer Academic Publishers, pp. 457–474 (2003)
[8] Chakhlevitch, K., Cowling, P.: Hyperheuristics: Recent developments.
C. Cotta, M. Sevaux, K. Sorensen (Eds.), Adaptive and Multilevel Metaheuris-
tics, Studies in Computational Intelligence, Springer, pp. 3–29 (2008)
[9] Cicirello, V. A., Smith, S. F.: Heuristic selection for stochastic search op-
timization: Modeling solution quality by extreme value theory. Principles
and Practice of Constraint Programming (CP-2004), Toronto, Canada, LNCS
3258, pp. 197–211 (2004)
5 Learning a Mixture of Search Heuristics 125
[10] D’Andrade, R. G.: Some propositions about the relations between culture and
human cognition. J. W. Stigler, R. A. Shweder, G. Herdt (Eds.), Cultural Psy-
chology: Essays on Comparative Human Development, Cambridge University
Press, Cambridge 1990, pp. 65–129 (1990)
[11] Dietterich, T. G.: Ensemble methods in machine learning. J. Kittler, F. Roli
(Eds.) First International Workshop on Multiple Classifier Systems, MCS-
2000, Cagliari, Italy, LNCS 1857, pp. 1–15 (2000)
[12] Epstein, S. L.: For the Right Reasons: The FORR architecture for learning in
a skill domain. Cognitive Science, 18, 479–511 (1994)
[13] Epstein, S. L., Freuder, E. C., Wallace, R.: Learning to support constraint
programmers. Computational Intelligence, 21(4), 337–371 (2005)
[14] Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. Thir-
teenth International Conference on Machine Learning (ICML-96), Bari, Italy,
pp. 148–156 (1996)
[15] Fukunaga, A. S.: Automated discovery of composite SAT variable-selection
heuristics. Eighteenth National Conference on Artificial Intelligence (AAAI-
2002), Edmonton, Canada, pp. 641–648 (2002)
[16] Gagliolo, M., Schmidhuber, J.: Dynamic algorithm portfolios. Ninth Interna-
tional Symposium on Artificial Intelligence and Mathematics, Special Issue of
the Annals of Mathematics and Artificial Intelligence, 47(3-4), pp. 295–328
(2006)
[17] Gebruers, C., Hnich, B., Bridge, D., Freuder, E.: Using CBR to select solution
strategies in constraint programming. Case-Based Reasoning Research and
Development. LNCS 3620, pp. 222–236 (2005)
[18] Gent, I. P., Prosser, P., Walsh, T.: The constrainedness of search. The Thir-
teenth National Conference on Artificial Intelligence (AAAI-1996), Portland,
Oregon, pp. 246–252 (1996)
[19] Gomes, C., Fernandez, C., Selman, B., Bessière, C.: Statistical regimes across
constrainedness regions. Principles and Practice of Constraint Programming
(CP-2004), Toronto, Canada, LNCS 3258, pp. 32–46 (2004)
[20] Gomes, C. P., Selman, B., Kautz, H. A.: Boosting combinatorial search
through randomization, The Fifteenth National Conference on Artificial Intel-
ligence (AAAI-1998), Madison, Wisconsin, pp. 431–437 (1998)
[21] Gomes, C. P., Selman, B.: Algorithm portfolios. Artificial Intelligence, 126(1-
2), pp. 43–62 (2001)
[22] Hansen, L., Salamon, P.: Neural network ensembles. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 12, pp. 993–1001 (1990)
[23] Hulubei, T., O’Sullivan, B.: Search heuristics and heavy-tailed behavior. Prin-
ciples and Practice of Constraint Programming (CP-2005), Barcelona, Spain,
pp. 328–342 (2005)
[24] Hutter, F., Hamadi, Y., Hoos, H. H., Leyton-Brown, K.: Performance predic-
tion and automated tuning of randomized and parametric algorithms. Princi-
ples and Practice of Constraint Programming (CP-2006), Nantes, France, pp.
213–228 (2006)
126 Susan L. Epstein and Smiljana Petrovic
[25] Johnson, D. S., Aragon, C. R., McGeoch, L. A., Schevon, C.: Optimization by
simulated annealing: An experimental evaluation; Part I, Graph Partitioning.
Operations Research, 37(6), 865–892 (1989)
[26] Kivinen, J., Warmuth, M. K.: Averaging expert predictions. Computational
Learning Theory: 4th European Conference (EuroCOLT-1999), Nordkirchen,
Germany, pp. 153–167 (1999)
[27] Lecoutre, C., Boussemart, F., Hemery, F.: Backjump-based techniques versus
conflict directed heuristics. 16th IEEE International Conference on Tools with
Artificial Intelligence (ICTAI-2004), Boca Raton, Florida, USA, pp. 549–557
(2004)
[28] Minton, S., Allen, J. A., Wolfe, S., Philpot, A.: An overview of learning in
the Multi-TAC system. First International Joint Workshop on Artificial Intel-
ligence and Operations Research, Timberline, Oregon, USA (1995)
[29] Nareyek, A.: Choosing search heuristics by non-stationary reinforcement
learning. Metaheuristics: Computer Decision-Making, Kluwer Academic
Publishers, Norwell, MA, USA, pp. 523–544 (2004)
[30] O’Mahony, E., Hebrard, E., Holland, A., Nugent, C., O’Sullivan, B.: Us-
ing case-based reasoning in an algorithm portfolio for constraint solving.
19th Irish Conference on Artificial Intelligence and Cognitive Science (AICS-
2008), Cork, Ireland (2008)
[31] Opitz, D., Shavlik, J.: Generating accurate and diverse members of a neural-
network ensemble. Advances in Neural Information Processing Systems, 8,
pp. 535–541 (1996)
[32] Otten, L., Gronkvist, M., Dubhashi, D. P.: Randomization in constraint pro-
gramming for airline planning. Principles and Practice of Constraint Program-
ming (CP-2006), Nantes, France, pp. 406–420 (2006)
[33] Petrie, K. E., Smith, B. M.: Symmetry breaking in graceful graphs. Principles
and Practice of Constraint Programming (CP-2003), Kinsale, Ireland, LNCS
2833, pp. 930–934 (2003)
[34] Petrovic, S., Epstein, S. L.: Full restart speeds learning. 19th International
FLAIRS Conference (FLAIRS-06), Melbourne Beach, Florida (2006)
[35] Petrovic, S., Epstein, S. L.: Relative support weight learning for constraint
solving. Workshop on Learning for Search, AAAI-06, Boston, Massachusetts,
USA, pp. 115–122 (2006)
[36] Petrovic, S., Epstein, S. L.: Preferences improve learning to solve con-
straint problems. Workshop on Preference Handling for Artificial Intelligence
(AAAI-07), Vancouver, Canada, pp. 71–78 (2007)
[37] Petrovic, S., Epstein, S. L.: Random subsets support learning a mixture of
heuristics. International Journal on Artificial Intelligence Tools (IJAIT), 17(3),
pp. 501–520 (2008)
[38] Ruan, Y., Kautz, H., Horvitz, E.: The backdoor key: A path to understanding
problem hardness. Nineteenth National Conference on Artificial Intelligence
(AAAI-2004), San Jose, CA, USA, pp. 124–130 (2004)
[39] Saari, D. G.: Geometry of voting. Studies in Economic Theory, Vol. 3,
Springer (1994)
5 Learning a Mixture of Search Heuristics 127
[40] Sabin, D., Freuder, E. C.: Understanding and improving the MAC algorithm.
Principles and Practice of Constraint Programming (CP-1997), Linz, Austria,
LNCS 1330, pp. 167–181 (1997)
[41] Samulowitz, H., Memisevic, R.: Learning to solve QBF. 22nd National Con-
ference on Artificial intelligence (AAAI-2007), Vancouver, Canada, pp. 255–
260 (2007)
[42] Schapire, R. E.: The strength of weak learnability. Machine Learning, 5(2):
197–227 (1990)
[43] Smith, B., Grant, S.: Trying harder to fail first. European Conference on Arti-
ficial Intelligence (ECAI-1998), pp. 249–253 (1998)
[44] Streeter, M., Golovin, D., Smith, S. F.: Combining multiple heuristics online.
22nd National Conference on Artificial Intelligence (AAAI-07), Vancouver,
Canada, pp. 1197–1203 (2007)
[45] Terashima-Marin, H., Ortiz-Bayliss, J., Ross, P., Valenzuela-Rendón, M.:
Using hyper-heuristics for the dynamic variable ordering in binary Con-
straint Satisfaction Problems, MICAI 2008, Advances in Artificial Intelli-
gence, LNCS 5317, pp. 407–417 (2008)
[46] Valentini, G., Masulli, F.: Ensembles of learning machines. Neural Nets
(WIRN-2002), Vietri sul Mare, Italy, LNCS 2486, pp. 3–22 (2002)
[47] Wallace, R. J.: Factor analytic studies of CSP heuristics. Principles and Prac-
tice of Constraint Programming (CP-2005), Barcelona, Spain, LNCS 3709,
pp. 712–726 (2005)
[48] Whitley, D.: The genitor algorithm and selection pressure: Why rank-based
allocation of reproductive trials is best. International Conference on Genetic
Algorithms (ICGA-1989), pp. 116–121 (1989)
[49] Xu, L., Hutter, F., Hoos, H., Leyton-Brown, K.: SATzilla-07: The design and
analysis of an algorithm portfolio for SAT. Principles and Practice of Con-
straint Programming (CP-2007), Providence, RI, USA, pp. 712–727 (2007)
[50] Young, H. P.: Condorcet’s theory of voting. The American Political Science
Review, 82(4), 1231–1244 (1988)
Chapter 1
An Introduction to Autonomous Search
1.1 Introduction
In the foreword of his seminal book on artificial intelligence and on problem solving
by man and machine [24], Jean-Louis Laurière pointed out that one of the main
goals of research was to understand how a system (man or machine) may learn,
analyze knowledge, transpose and generalize it in order to face real situations and
finally solve problems. Therefore, the ultimate goal of this pioneer AI scientist was
to build systems that were able to autonomously solve problems, and to acquire and
even discover new knowledge, which could be reused later.
Finding the most suitable algorithm and its correct setting for solving a given
problem is the holy grail of many computer science researchers and practitioners.
This problem was investigated many years ago [31] as the algorithm selection prob-
lem. The proposed abstract model suggested extracting features in order to char-
acterize the problem, searching for a suitable algorithm in the space of available
algorithms and then evaluating its performance with respect to a set of measures.
These considerations are still valid and can indeed be considered at least from two
complementary points of view:
• Selecting solving techniques or algorithms from a set of possible available tech-
niques;
• Tuning an algorithm with respect to a given instance of a problem.
Youssef Hamadi
Microsoft Research, Cambridge, CB3 0FB, UK
LIX, École Polytechnique, F-91128 Palaiseau, France
e-mail: [email protected]
Eric Monfroy
Universidad Santa Marı́a, Valparaı́so, Chile, e-mail: [email protected] and
LINA, Université de Nantes, France, e-mail: [email protected]
Frédéric Saubion
LERIA, Université d’Angers, France, e-mail: [email protected]
To address these issues, existing works often include tools from different com-
puter science and mathematics areas, especially from machine learning and statis-
tics. Moreover, many approaches have been developed to answer the algorithm se-
lection problem in various fields, as described in the survey of K. Smith-Miles [33].
This present book focuses on the restriction of this general question to the fields of
constraint satisfaction, constraint solving, and optimization problems.
Of course, this quest appears also related to the No Free Lunch theorems for
optimization [34], which argue against the existence of a universal multipurpose
solver. This is precisely the challenge to more autonomous and adaptive solving
algorithms in opening up to Free Lunch theorems for large classes of problems, as
recently suggested by R. Poli and M. Graff [30].
In the context of optimization problems, as we will see later, several terminolo-
gies are used to characterize those systems that include specific techniques in order
to provide more intelligent solving facilities to the user: adaptive algorithms, Reac-
tive Search, portfolio algorithms. We propose here unifying these approaches under
the generic term Autonomous Search that reflects the ultimate goal of providing an
easy-to-use interface for end users, who could provide a possibly informal descrip-
tion of their problem and obtain solutions with minimum interaction and technical
knowledge. Such an ideal system is depicted in Figure 1.1.
as local search [19, 29]. Powerful off-line techniques are now available for tuning
parameters [22, 20, 27, 26].
Another important research community that focuses on very related problems
has been established under the name Reactive Search by R. Battiti et al. [5, 3]. After
focusing on local search with the seminal works on reactive tabu [4] and adaptive
simulated annealing [23], this community is now growing through the dedicated
conference LION [2].
It clearly appears that all these approaches share common principles and purposes
and have been developed in different but connected communities. Their foundations
rely on the fact that, since the solving techniques and search heuristics are more
and more sophisticated and the problems structures more and more intricate, the
choice and the correct setting of a solving algorithm is becoming an intractable task
for most users. Therefore, there is a rising need for an alternative problem-solving
framework.
In 2007, we organized the first workshop on Autonomous Search in Providence,
RI, USA, in order to present relevant works aimed at building more intelligent
solvers for the constraint programming community [17]. We tried to describe more
conceptually the concept of Autonomous Search from the previously described re-
lated works [18]. The purpose of this book is to provide a clear overview of recent
advances in autonomous tools for optimization and constraint satisfaction problems.
In order to be as exhaustive as possible, keeping the focus on constrained problem
solving, this book includes ten chapters that cover different solving techniques from
metaheuristics to tree-based search and that illustrate how these solving techniques
may benefit from intelligent tools by improving their efficiency and adaptability to
solve the problems at hand.
Before presenting the chapters, we would like to quickly review some general prin-
ciples and definitions that have guided us in the organization of this book. Therefore,
we first describe the architecture of a solver, trying to be general enough to cover
various solving techniques and paradigms. From this architecture, we will review
different points of view concerning the different ways in which different features
can be included and used to attain autonomy.
useful over a wide range of instances. Such results require a large number of exper-
imental evaluations and are generally based on empirical observations. Parameter
control is divided into three branches according to the degree of autonomy of the
strategies. Control is deterministic when parameters are changed according to a pre-
viously established schedule, adaptive when parameters are modified according to
rules that take into account the state of the search, and self-adaptive when parameters
are encoded into individuals in order to evolve conjointly with the other variables of
the problem. In [32], Eiben and Smit discuss the differences between numeric and
symbolic parameters. In [28], symbolic parameters are called components, and the
operators of an evolutionary algorithm are configurable components of the solver
that implement solving techniques.
In [3], Reactive Search is characterized by the integration of machine learning
techniques into search heuristics. A classification of the source of information that
is used by the algorithm is proposed for distinguishing between problem-dependent
information, task-dependent information, and local properties.
In their survey [8], Burke et al. propose a classification of hyper-heuristics de-
fined as “search methods or learning mechanisms for selecting or generating heuris-
tics to solve computational search problems”. As already mentioned, this classifica-
tion also distinguishes between two dimensions: the different sources of feedback
information and the nature of the heuristics’ search space. The feedback, when used,
corresponds here to the information that is learned while solving (on-line) or while
using a set of training instances (off-line). The authors identify two families of low-
level heuristics: construction heuristics (used to incrementally build a solution) and
perturbation heuristics (used to iteratively improve a starting solution). The hyper-
heuristics level can use heuristics selection methodologies that produce combina-
tions of preexisting low-level heuristics, or heuristics generation methodologies that
generate new heuristics from basic blocks of low level heuristics.
Another interesting classification is proposed in [16], in which Gagliolo et al.
are interested in the algorithm selection problem and describe the different selection
techniques according to the following points of views. Different dimensions are
identified with respect to this algorithm selection problem (e.g., the generality of
the selection process and its reactivity).
From these considerations, we may now provide a general overview of the chap-
ters of this book with respect to the different aspects of Autonomous Search that
they handle.
According to the existing taxonomies and different points of view that have been
described above, it appears quite natural to classify the chapters of this book with
respect to how they support autonomy from the user’s point of view: off-line config-
uration of a solver or on-line control of a solver. The book also investigates future
1 An Introduction to Autonomous Search 7
Focusing first on the tuning of the parameters of a given solver, Chapter 2 addresses
the problem of parameter tuning for evolutionary algorithms, which is clearly an
important issue for the algorithms that most of the time involve many parameters
whose impact is difficult to forecast. The chapter aims for a methodological ap-
proach to tuning and reviews automated tuning algorithms.
In a similar and complementary context, Chapter 3 provides a review of more
automated configuration tools for solvers, focusing on local search-based tuning
algorithms that explore the space of parameters. It also reviews other parameter
tuning methods such as racing techniques.
As mentioned above, another way to address the algorithm selection problem is
to consider a portfolio of solvers and to provide an intelligent oracle to select them
with regard to the problems to solve at hand. In this context, Chapter 4 proposes
an original use of learning processes in case-based reasoning applied to constraint
programming and describes CPHYDRA, a case-based portfolio constraint solver. It
illustrates how the user can take advantage of previously solved problems to solve
similar cases. A general learning and reuse model is proposed to automate this whole
process.
Turning to a more hyper-heuristics point of view, Chapter 5 describes, in the
context of a complete tree based constraint solver, how combinations of low-level
search heuristics can be learned from examples. The choice of the most suitable
heuristics (i.e., variable/value choices) is a well-known difficult problem in con-
straint programming. In this chapter, a general scheme is proposed for designing
an autonomous learner that is able to monitor its own performance and generate
training examples to improve its quality. This chapter describes the Adaptive Con-
straint Engine, ACE, which constitutes a complete project that aims at allowing the
user to benefit from an automated configuration of the basic branching heuristics of
a complete constraint solver before running it on the instances she wants to solve
efficiently.
stance, prohibition mechanisms or noise rates are dynamically adjusted along the
search.
In a way complementary that of Chapter 6, Chapter 7 presents how learning
mechanisms can be used in evolutionary algorithms to automatically select the suit-
able operators during the search process. This chapter also explores how operators
can be generated on the fly according to the search needs. Therefore, the proposed
techniques allow the user to handle two types of parameters: numerical parame-
ters that adjust the behaviour of the solver and the solving components themselves,
which can be modified along the search process.
In Chapter 8, the adaptive control of parameters in ant colony optimization, a
broadly used solving paradigm for constraint optimization problems, is reviewed.
This chapter presents a full empirical study of parameter setting for ACO and pro-
poses an adaptation scheme for improving performance by controlling dynamic
variations of the parameters.
In this book, our purpose is to provide a wide panel of representative works in the
scope of Autonomous Search. Therefore, different readers will be interested in fo-
cusing on diffrent areas, according to their scientific interests or main goals. To this
aim, we propose hereafter some typical reading paths:
1 An Introduction to Autonomous Search 9
References
[1] Bader-El-Den, M., Poli, R.: Generating SAT local-search heuristics using a
GP hyper-heuristic Framework. In: 8th International Conference, Evolution
Artificielle, EA 2007. Revised Selected Papers, Springer, no. 4926 in Lecture
Notes in Computer Science, pp, 37–49 (2008)
[2] Battiti, R., Brunato, M. (eds), Learning and Intelligent Optimization second in-
ternational conference, LION 2007 II. Selected Papers, Lecture Notes in Com-
puter Science, vol., 5313. Springer (2008)
[3] Battiti, R., Brunato, M.: Handbook of metaheuristics (2nd edition), Gen-
dreau, M., Potvin, J. Y. (eds), Springer, chap, Reactive search optimization:
learning while optimizing (2009)
[4] Battiti, R., Tecchiolli, G.: The reactive tabu search. INFORMS Journal on
Computing 6(2):126–140 (1994)
[5] Battiti, R., Brunato, M., Mascia, F.: Reactive search and intelligent opti-
mization, Operations research/Computer Science Interfaces, vol., 45. Springer
(2008)
[6] Burke, E., Kendall, G., Newall, J., Hart, E., Ross, P., Schulenburg, S.: Hand-
book of Meta-heuristics, Kluwer, chap., Hyper-heuristics: an emerging direc-
tion in modern search technology, pp, 457–474 (2003)
[7] Burke, E., Hyde, M., Kendall, G., Ochoa, G., Ozcan, E., Qu, R. A.: Sur-
vey of hyper-heuristics. Tech. Rep. Technical Report No. NOTTCS-TR-SUB-
0906241418-2747, School of Computer Science and Information Technology,
University of Nottingham, Computer Science (2009)
[8] Burke, E., Hyde, M., Kendall, G., Ochoa, G., Ozcan, E., Woodward, J.:
Handbook of metaheuristics (2nd edition), Gendreau, M., Potvin, J. Y. (eds),
Springer, chap., A classification of hyper-heuristics approaches (2009)
[9] Cowling, P., Soubeiga, E.: Neighborhood structures for personnel schedul-
ing: a summit meeting scheduling problem (abstract). In: Burke, E., Erben,
W. (eds), Proceedings of the 3rd International Conference on the Practice and
Theory of Automated Timetabling (2000)
[10] Cowling, P., Kendall, G., Soubeiga, E.: Hyperheuristics: a tool for rapid pro-
totyping in scheduling and optimisation. In: Applications of Evolutionary
Computing, EvoWorkshops 2002: EvoCOP, EvoIASP, EvoSTIM/EvoPLAN,
Springer, Lecture Notes in Computer Science, vol., 2279, pp, 1–10 (2002)
10 Youssef Hamadi, Eric Monfroy, and Frédéric Saubion
[11] Crowston, W., Glover, F., Thompson, G., Trawick, J.: Probabilistic and para-
metric learning combinations of local job shop scheduling rules. Tech. rep.,
ONR Research Memorandum No. 117, GSIA, Carnegie-Mellon University,
Pittsburg, PA (1963)
[12] De Jong, K.: Parameter setting in EAs: a 30 year perspective. In: [25], pp, 1–18
(2007)
[13] Eiben, A. E., Hinterding, R., Michalewicz, Z.: Parameter control in evolution-
ary algorithms. IEEE Trans Evolutionary Computation 3(2):124–141 (1999)
[14] Eiben, A. E., Michalewicz, Z., Schoenauer, M., Smith, J. E.: Parameter control
in evolutionary algorithms. In: [25], pp, 19–46 (2007)
[15] Fisher, H., Thompson, L.: Industrial Scheduling, Prentice Hall, chap., Proba-
bilistic learning combinations of local job-shop scheduling rules (1963)
[16] Gagliolo, M., Schmidhuber, J.: Algorithm selection as a bandit problem with
unbounded losses. Tech. rep., Tech. report IDSIA - 07 - 08 (2008)
[17] Hamadi, Y., Monfroy, E., Saubion, F.: Special issue on Autonomous
Search. Constraint Programming Letters 4, URL http://www.
constraint-programming-letters.org/ (2008)
[18] Hamadi, Y., Monfroy, E., Saubion, F.: Hybrid Optimization: The Ten Years of
CPAIOR, Springer, chap., What is autonomous search? (2010)
[19] Hoos, H.: An adaptive noise mechanism for WalkSAT. In: AAAI/IAAI, pp,
655–660 (2002)
[20] Hutter, F.: Automating the configuration of algorithms for solving hard compu-
tational problems. Ph.D. thesis, Department of Computer Science, University
of British Columbia (2009)
[21] Hutter, F., Hamadi, Y., Hoos, H., Brown, K. L.: Performance prediction and
automated tuning of randomized and parametric algorithms. In: Twelfth In-
ternational Conference on Principles and Practice of Constraint Programming
CP’06 (2006)
[22] Hutter, F., Hoos, H., Stützle, T.: Automatic algorithm configuration based on
local search. In: Proc. of the Twenty-Second Conference on Artifical Intelli-
gence (AAAI ’07), pp, 1152–1157 (2007)
[23] Ingber, L.: Very fast simulated re-annealing. Mathematical Computer Mod-
elling 12(8):967–973 (1989)
[24] Laurière, J. L.: Intelligence Artificielle, Résolution de problèmes par l’Homme
et la machine. Eyrolles (1986)
[25] Lobo, F., Lima, C., Michalewicz, Z. (eds), Parameter setting in evolutionary
algorithms, Studies in Computational Intelligence, vol., 54. Springer (2007)
[26] Nannen, V., Eiben, A. E.: A method for parameter calibration and relevance
estimation in evolutionary algorithms. In: Genetic and Evolutionary Compu-
tation Conference, GECCO 2006, Proceedings, ACM, pp, 183–190 (2006)
[27] Nannen, V., Eiben, A. E.: Relevance estimation and value calibration of evo-
lutionary algorithm parameters. In: IJCAI 2007, Proceedings of the 20th Inter-
national Joint Conference on Artificial Intelligence, pp, 975–980 (2007)
[28] Nannen, V., Smit, S., Eiben, A. E.: Costs and benefits of tuning parameters of
evolutionary algorithms. In: Parallel Problem Solving from Nature, PPSN X,
1 An Introduction to Autonomous Search 11
6.1 Introduction
Roberto Battiti
DISI – Dipartimento di Ingegneria e Scienza dell’Informazione, Università di Trento, Italy, e-mail:
[email protected]
Paolo Campigotto
DISI – Dipartimento di Ingegneria e Scienza dell’Informazione, Università di Trento, Italy, e-mail:
[email protected]
can be obtained in the future. Goal-directed learning from interaction with an (un-
known) environment with trial-and-error search and delayed reward is the main fea-
ture of RL.
As was suggested, for example, in [2], the issue of learning from an initially
unknown environment is shared by RSO and RL. A basic difference is that RSO
optimizes a function and the environment is provided by a fitness surface to be ex-
plored, while RL optimizes the long-term reward obtained by selecting actions at
the different states. The sequential decision problem and therefore the non-greedy
nature of choices is also common. For example, in Reactive Tabu Search, the appli-
cation of RSO in the context of Tabu Search, steps leading to worse configurations
need in some cases to be performed to escape from a basin of attraction around a lo-
cal optimizer. It is therefore of interest to investigate the relationship in more detail,
to see whether specific techniques of reinforcement learning can be profitably used
in RSO.
In this paper, we discuss the application of reinforcement learning methods for
tuning the parameters of stochastic local search (SLS) algorithms. In particular,
we select the Least Squares Policy Iteration (LSPI) algorithm [22] to implement
the reinforcement learning mechanism. This investigation is done in the context
of the Maximum Satisfiability (MAX-SAT) problem. The reactive versions of the
SLS algorithms for the SAT/MAX-SAT problem, like RSAPS, Adaptive Walk-
sat and AdaptNovelty+ , perform better than their original non-reactive versions.
While solving a single instance, the level of diversification of the reactive SLS al-
gorithms is adaptively increased or decreased if search stagnation is or is not de-
tected.
Reactive approaches in the above methods consist of ad hoc. ways to detect the
search stagnation and to adapt the value of one or more parameters determining the
diversification of the algorithm.
In this investigation, a generic RL-based reactive scheme is designed, which can
be customized to replace the ad hoc. method of the algorithm considered.
To test its performance, we select three SAT/MAX-SAT SLS solvers: the (adap-
tive) Walksat, the Hamming-Reactive Tabu Search (H-RTS) and the Reactive Scal-
ing and Probabilistic Smoothing (RSAPS) algorithms. Their ad hoc. reactive meth-
ods are replaced with our RL-based strategy and the results obtained over a ran-
dom MAX-3-SAT benchmark and a structured MAX-SAT benchmark are dis-
cussed.
This paper is organized as follows. Previous approaches of reinforcement learn-
ing applied to optimization are discussed in Section 6.2, while the basics of rein-
forcement learning and dynamic programming are given in Section 6.3. In particu-
lar, the LSPI algorithm is detailed.
The considered Reactive SLS algorithms for the SAT/MAX-SAT problem are
introduced in Section 6.4. Section 6.5 explains our integration of RSO with the
RL strategy. Finally, the experimental comparison of the RL-based reactive ap-
proach with the original ad hoc. reactive schemes (Section 6.6) concludes the
work.
6 An Investigation of Reinforcement Learning for Reactive Search Optimization 133
Many are the intersections between optimization, dynamic programming and rein-
forcement learning. Approximated versions of DP/RL comprise challenging opti-
mization tasks. Consider, for example, the maximization operations in determining
the best action when an action value function is available, the optimal choice of ap-
proximation architectures and parameters in dynamic programming, or the optimal
choice of algorithm details and parameters for a specific RL instance.
A recent paper about the interplay of optimization and machine learning (ML)
research is [8], which mainly shows how recent advances in optimization can be
profitably used in ML.
This work, however, goes in the opposite direction: which techniques of RL can
be used to improve heuristic algorithms for a standard optimization task such as min-
imizing a function? Interesting summaries of statistical machine learning methods
applied for large-scale optimization are presented in [1]. Autonomous search meth-
ods that adapt the search performance to the problem at hand are described in [15],
which defines a metric to classify problem solvers based on their computation char-
acteristics. The authors first identify a set of rules describing search processes, and
then they classify existing search paradigms from different but related areas, such
as the Satisfiability and the Constraint Satisfaction Problems.
In [10] RL is applied in the area of local search for solving maxx f (x): the rewards
from a local search method π starting from an initial configuration x are given by
the size of improvements of the best-so-far value fbest . In detail, the value function
V π (x) of configuration x is given by the expected best value of f seen on a trajec-
tory starting from state x and following the local search method π . The curse of
dimensionality discourages directly using x for state description: informative fea-
tures extracted from x are used to compress the state description to a shorter vector
s(x), so that the value function becomes V π (s(x)). The introduced algorithm S TAGE
is a smart version of iterated local search which alternates between two phases. In
one phase, new training data for V π (F(x)) are obtained by running local search
from the given starting configuration. Assuming that local search is memoryless,
the estimates of V π (F(x )) for all the points s along the local search trajectory can
be obtained from a single run. In the other phase, one optimizes the value function
V π (F(x)) (instead of the original f ), so that a hopefully new and promising starting
point is obtained. A suitable approximation architecture V π (F(x); w) and a super-
vised machine learning method to train V π (F(x); w) from the examples are needed.
The memory-based version of the RASH (Reactive Affine Shaker) optimizer intro-
duced in [11] adopts a similar strategy. An open issue is the proper definition of the
features by the researcher, which are chosen by insight in [10]. Preliminary results
are also presented about transfer, i.e., the possibility to use a value function learnt
on one task for a different task with similar characteristics.
A second application of RL to local search is that of supplementing f with a
“scoring function” to help in determining the appropriate search option at every
step. For example, different basic moves or entire different neighborhoods can be
applied. RL can in principle make more systematic some of the heuristic approaches
134 Roberto Battiti and Paolo Campigotto
where Hk,d (x p ) is the value obtained by completing the partial solution with the kth
method, and ψ are tunable coefficients depending on the parameters w which are
learnt from many runs on different instances of the problem. While in the previ-
ously mentioned application one evaluates starting points for local search, here one
evaluates the promise of partial solutions leading to good, complete solutions. The
performance of this technique in terms of CPU time is poor: in addition to being run
for the separate training phase, the K construction algorithms must be run for each
partial solution (for each variable-fixing step) in order to allow for the picking of the
next step in an optimal manner.
A different parametric combination of costs of solutions obtained by two fixed
heuristics is considered for the stochastic programming problem of maintenance
and repair [9]. In the problem, one has to decide whether to immediately repair
breakdowns by consuming the limited amount of spare parts, or to keep spare parts
for breakdowns at a later phase.
In the context of continuous function optimization, [24] uses RL for replacing a
priori-defined adaptation rules for the step size in Evolution Strategies with a reac-
tive scheme which adapts step sizes automatically during the optimization process.
The states are characterized only by the success rate after a fixed number of mu-
tations; the three possible actions consist of increasing (by a fixed multiplicative
amount), decreasing or keeping the current step size. SARSA learning with various
reward functions is considered, including combinations of the difference between
the current function value and the one evaluated at the last reward computation
and the movement in parameter space (the distance traveled in the last phase). On-
the-fly parameter tuning, or on-line calibration of parameters for evolutionary algo-
rithms by reinforcement learning (crossover, mutation, selection operators, popula-
tion size), is suggested in [12]. The EA process is divided into episodes; the state
describes the main properties of the current population (mean fitness, or f values,
standard deviation, etc.); the reward reflects the progress between two episodes (im-
provement of the best fitness value); the action consists of determining the vector of
control parameters for the next episode.
The trade-off between exploration and exploitation is another issue shared by RL
and stochastic local search techniques. In RL an optimal policy is learnt by gen-
erating samples; in some cases samples are generated by a policy which is being
evaluated and then improved. If the initial policy is generating states only in a lim-
ited portion of the entire state space, there will be no way to learn the true optimal
policy. A paradigmatic example is the n-armed bandit problem [14], so named be-
cause of its similarity with a slot machine. In the problem, one is faced repeatedly
with a choice among different options, or actions. After each choice a numerical
reward is generated from a stationary probability distribution that depends on the
selected action. The objective is to maximize the expected total reward over some
time period. In this case the action value function depends only on the action, not on
the state (which is unique in this simplified problem). In the more general case of
many states, no theoretically sound ways exist to combine exploration and exploita-
tion in an optimal manner. Heuristic mechanisms come to the rescue, for example,
by selecting actions not in a greedy manner given an action value function, but based
136 Roberto Battiti and Paolo Campigotto
decisions, where each decision either selects a variable or assigns a value to a vari-
able from its associated value set. After each assignment, propagation rules infer its
effect on the domains of the unassigned variables. Each decision is obtained by the
weighted combination of a set of prespecified search heuristics. The learning task
consists of searching the appropriate set of weights for the heuristics. The Adaptive
Constraint Engine can learn the optimal set of weights for classes of CSP problems,
rather than for a single problem.
In this section, Markov decision processes are formally defined and the standard
dynamic programming technique is summarized in Section 6.3.2, while the Policy
Iteration technique to determine the optimal policy is defined in Section 6.3.3. In
many practical cases exact solutions must be abandoned in favor of approximation
strategies, which are the focus of Section 6.3.4.
Note that state transitions are not deterministic; nevertheless their distribution can
be controlled by the action a. The goal is to control the system in order to maximize
the expected cumulative reward.
Given an MDP (S , A , P, R, γ ), we define a policy as a probability distribution
π (·|s) : A → [0, 1], where π (a|s) is the probability of choosing action a when the
system is in state s. In other words, π maps states onto probability distributions over
A . Note that we are only considering stationary policies. If a policy is deterministic,
then we shall resort to the more compact notation a = π (s).
The learning task consists of the selection of a policy that maximizes a measure of
the total reward accumulated during an infinite chain of decisions (infinite horizon).
To achieve this goal, let us define the state-action value function Qπ (s, a) of the
policy π as the expected overall future reward for applying a specified action a when
the system is in status s, under the hypothesis that the ensuing actions are taken
according to policy π . A straightforward implementation of the Bellman principle
leads to the following definition:
Qπ (s, a) = ∑ P(s, a, s ) R(s, a, s ) + γ ∑ π (a |s )Qπ (s , a ) (6.2)
s ∈S a ∈A
where the sum over S can be interpreted as an integral in the case of a continuous
state set. The interpretation is that the value of selecting action a in state s is given
by the expected value of the immediate reward plus the value of the future rewards
which one expects by following policy π from the new state. These rewards have
to be discounted by γ (they are a step in the future w.r.t. the new state) and prop-
erly weighted by transition probabilities and action-selection probabilities given the
stochasticity in the process.
The expected reward of a state/action pair (s, a) ∈ S × A is
Qπ = R + γ PΠ π Qπ (6.3)
6 An Investigation of Reinforcement Learning for Reactive Search Optimization 139
where R is the |S ||A |-entry column vector corresponding to R(s, a), P is the
|S ||A | × |S | matrix of P(s, a, s ) values having (s, a) as row index and s as col-
umn, and Π π is a |S | × |S ||A | matrix whose entry (s, (s, a)) is π (a|s).
Equation (6.3) can be seen as a non-homogeneous linear problem with unknown
Qπ ,
(I − γ PΠ π )Qπ = R, (6.4)
or, alternatively, as a fixed-point problem,
Qπ = T π Qπ , (6.5)
Policy πi+1 is never worse than πi , in the sense that Qπi+1 ≥ Qπi over all state/action
pairs.
In the following, we assume that the optimal policy π ∗ exists in the sense that for
all states it attains the minimum of the right-hand side of Bellman’s equation; see
[9] for more details.
The Policy Iteration (PI) method consists of the alternate computation shown
in Fig. 6.1: given a policy πi , the policy evaluation procedure (also known as the
“Critic”) generates its state-action value function Qπi , or a suitable approximation.
The second step is the policy improvement procedure (the “Actor”), which computes
a new policy by applying (6.6).
140 Roberto Battiti and Paolo Campigotto
The two steps are repeated until the value function does not change after iterating,
or the change between consecutive iterations is less than a given threshold.
To carry out the above discussion by means of exact methods, in particular us-
ing (6.4) as the Critic component, the system model has to be known in terms of
its transition probability P(s, a, s ) and reward R(s, a) functions. In many cases this
detailed information is not available but we have access to the system itself or to a
simulator. In both cases, we have a black box which given the current state and the
performed action determines the next state and reward. In both cases, and more con-
veniently with a simulator, several sample trajectories can be generated, providing
the information about the system behavior needed to learn the optimal control.
A brute force approach can be that of estimating the system model functions
P(·, ·, ·) and R(·, ·) by executing a very large series of simulations. The reinforce-
ment learning methodology bypasses the system model and directly learns the value
function.
Assume that the system simulator (the “Model” box in Fig. 6.1) generates
quadruples in the form
(s, a, r, s )
where s is the state of the system at a given step, a is the action taken by the simu-
lator, s is the state in which the system falls after the application of a, and r is the
reward received. In the setting described by this paper, the (s, a) pair is generated by
the simulator.
A viable method to obtain an approximation of the state-action value function
is to approximate it with respect to a functional linear subspace having basis Φ =
(φ 1 , . . . , φ k ). The approximation Q̂π ≈ Qπ is in the form
Q̂π = Φ T wπ .
6 An Investigation of Reinforcement Learning for Reactive Search Optimization 141
The weights vector wπ is the solution of the linear system Awπ = b, where
A = Φ T (Φ − γ PΠ π Φ ), b = Φ T R. (6.7)
In this case, matrix A and vector b are “learnt” as sums of rank-one elements, each
obtained by a sample tuple:
T
A= ∑ Φ (s, a) Φ (s, a) − γΦ (s , π (s )) , b= ∑ rΦ (s, a).
(s,a,r,s )∈D (s,a,r,s )∈D
1. function LSTDQ ( D, k, Φ , γ , π )
2. A ← 0;
3. b ← 0;
4. for each (s, a, r, s ) ∈ D
T
5. A ← A + Φ (s, a) Φ (s, a) − γΦ (s , π (s ))
6. b ← b + rΦ (s, a)
7. wπ ← A−1 b
The complete LSPI algorithm is given in Fig. 6.3. Note that, because of the iden-
tification between the weight vector w and the ensuing policy π , the code assumes
that the previously declared function LSTDQ() accepts its last parameter, i.e., the
policy π , in the form of a weight vector w.
This investigation considers as starting point the following methods: the Walk-
sat/SKC algorithm [27] and its reactive version, the Adaptive Walksat [16], the
Hamming-Reactive Tabu Search (H-RTS) [5] and the Reactive Scaling and Prob-
abilistic Smoothing (RSAPS) algorithm [18].
The Walksat/SKC algorithm adopts a two-stage variable selection scheme. First,
one of the clauses violated by the current assignment is selected uniformly at ran-
dom. If the selected clause contains variables that can be flipped without violating
any other clause, one of these is randomly chosen. When such an improving step
does not exist, a random move is executed with a certain probability p: a vari-
able appearing in the selected unsatisfied clause is selected uniformly at random
and flipped. Otherwise with probability 1 − p the least worsening move is greed-
ily selected. The parameter p is often referred to as noise parameter. The Adaptive
Walksat algorithm automatically adjusts the noise parameter, increasing or decreas-
ing it when search stagnation is or is not detected. Search stagnation is measured
in terms of the time elapsed since the last reduction in the number of unsatisfied
clauses achieved.
6 An Investigation of Reinforcement Learning for Reactive Search Optimization 143
1 The non-oblivious search phase is a local search procedure guided by a non-oblivious func-
tion, i.e., a function that weights in different ways the satisfied clauses according to the num-
ber of matched literals. For example, the non-oblivious function for MAX-2-SAT problems is the
weighted linear combination of the number of clauses with one and two matched literals. See [5]
for details.
144 Roberto Battiti and Paolo Campigotto
is therefore changed during the run to obtain a proper balance of diversification and
bias.
The random restart executed after 10 n moves guarantees that the search trajec-
tory is not confined to a localized portion of the search space.
procedure H RTS
1. repeat
⎡
2. tr ← t
3. ⎢ X ← random truth assignment
⎢
4. ⎢ T ← T f n
⎢
5. ⎢ repeat
⎢ ⎡
6. ⎢ repeat
{ local search }
⎢
7. ⎢ ⎢ X ← B EST-M OVE (LS, f )
⎢ ⎢
8. ⎢ ⎢ until largest Δ f = 0
⎢ ⎢
9. ⎢ ⎢ XI ← X
⎢ ⎢
10. ⎢
⎢
⎢ for 2(T
⎢
+ 1) iterations { reactive tabu search }
11. ⎢ ⎢ X ← B EST-M OVE (T S, f )
⎢ ⎢
12. ⎢ ⎢ XF ← X
⎢ ⎣
⎢
13. ⎣ T ← R EACT(T f , XF , XI )
14. until (t − tr ) > 10 n
15. until solution is acceptable or maximum number
of iterations reached
Fig. 6.4: The simplified version of the H RTS algorithm considered in this work
Furthermore, in this paper the term offline defines an action executed before the
search phase of the local search algorithm, while the term online describes an action
executed during the search phase. This work proposes an RSO approach for the
online tuning of the target parameter based on the LSPI algorithm, which is trained
offline.
In order to apply the LSPI method introduced in Section 6.3.4 for tuning the
target parameter, first the search process of the SLS algorithms is modelled as a
Markov decision process. Each state of the MDP is created by observing the be-
havior of the considered algorithm over an epoch of epoch length consecutive vari-
able flips. In fact, the effect of changing the target parameter on the algorithm’s
behavior can only be evaluated after a reasonable number of local moves. Therefore
the algorithm’s traces are divided into epochs (E1 , E2 , . . . ) composed of a suitable
number of local moves, and changes of the target parameter are allowed only be-
tween epochs. Given the subdivision of the reactive local search algorithm’s trace
into a sequence of epochs (E1 , E2 , . . . ), the state at the end of epoch Ei is a collec-
tion of features extracted from the algorithm’s execution up to that moment in the
form of a tuple: s(E1 , . . . , Ei ) ∈ Rd , where d is the number of features that form the
state.
In the case of the Reactive Tabu Search algorithm, epoch length = 2 ∗ Tmax ,
where Tmax is the maximum allowed value for the prohibition parameter. Because
in a prohibition mechanism with prohibition parameter T , during the first T steps
the Hamming distance keeps increasing and only in the subsequent steps may it de-
crease, an epoch is long enough to monitor the behavior of the algorithm in the case
of the largest allowed T value. A preliminary application of RL to Reactive Tabu
Search for SAT has been presented in [4].
In the case of the target parameter in the Walksat and the RSAPS algorithms,
each epoch lasts 100 and 200 variable flips, respectively, including null flips (i.e.,
search steps where no variable is flipped) in the case of the RSAPS algorithm. These
values for the epoch length, ranging from 10 to 400, are the optimal values selected
during the experiments over a set of candidates.
The scope of this work is the design of a reactive reinforcement-based method
that is independent of the specific SAT/MAX-SAT reactive algorithm considered.
Therefore, the features selected for the definition of the states of the MDP under-
lying the reinforcement learning approach do not depend on the target parameter
considered, allowing for generalization.
As specified above, the state of the system at the end of an epoch describes the
algorithm’s behavior during the epoch, and an “action” is the modification of the
target parameter before the algorithm enters the next epoch. The target parameter is
responsible for the diversification of the search process once stagnation is detected.
The state features therefore describe the intensification-diversification trade-off ob-
served in the epoch.
In particular, let us define the following:
• n and m are the number of variables and clauses of the input SAT instance,
respectively;
146 Roberto Battiti and Paolo Campigotto
• f (x) is the score function counting the number of unsatisfied clauses in the truth
assignment x;
• xbsf is the “best-so-far” (BSF) configuration before the current epoch;
• f epoch is the average value of f during the current epoch;
• H epoch is the average Hamming distance during the current epoch from the con-
figuration at the beginning of the current epoch.
The compact state representation chosen to describe an epoch is the following cou-
ple:
H epoch f epoch − f (xbsf )
s ≡ Δ f, , where Δ f = .
n m
The first component of the state is the mean change of f in the current epoch with
respect to the best value. It represents the preference for configurations with low
objective function values. The second component describes the ability to explore
new configurations in the search space by moving away from local minima. In
the literature, this behaviour is often referred to as the diversification bias trade-
off. For the purpose of addressing uniformly SAT instances with different num-
bers of variables, the first and the second state components have been normal-
ized.
The reward signal is given by the normalized change of the best value achieved
in the observed epoch with respect to the “best-so-far” value before the epoch:
( f (xbsf ) − f (xlocalBest ))/m.
The state of the system at the end of an epoch describes the algorithm’s behavior
during the last epoch, while an action is the modification of the algorithm’s param-
eters before it enters the next epoch. In particular, the action consists of setting the
target parameter from scratch at the end of the epoch. The noise and the smooth-
ing probability parameters are set to one of 20 uniformly distributed values in the
range [0.01, 0.2]. In the Tabu Search context, we consider the fractional prohibition
parameter (the prohibition parameter is T = nTf ), equal to one of 25 uniformly dis-
tributed values in the range [0.01, 0.25]. Therefore the actions set A = {ai , i = 1..n}
is composed of n = 20 choices in the case of the noise and the smoothing probability
parameters and n = 25 choices in the case of the prohibition parameter. The effect
of the action ai consists of setting target parameter to 0.01 ∗ i, i ∈ [1, 20] in the case
of the noise and the smoothing probability parameters and i ∈ [1, 25] in the case of
the prohibition parameter.
Once a reactive local search algorithm is modeled as a Markov decision process,
a reinforcement learning method such as LSPI can be used to control the evolution
of its parameters. To tackle large state and action spaces, the LSPI algorithm ap-
proximates the value function as a linear weighted combination of basis functions
(see Eq. 6.8). In this work, we consider the value function space spanned by the
following finite set of basis functions:
6 An Investigation of Reinforcement Learning for Reactive Search Optimization 147
⎛ ⎞
Ia==a1 (a)
⎜ Ia==a (a) · Δ f ⎟
⎜ 1 ⎟
⎜ Ia==a (a) · H epoch ⎟
⎜ 1 ⎟
⎜ Ia==a (a) · H epoch · Δ f ⎟
⎜ 1 ⎟
⎜ Ia==a (a) · (Δ f )2 ⎟
⎜ 1 ⎟
⎜I 2 ⎟
⎜ a==a1 (a) · H epoch ⎟
⎜. ⎟
⎜
Φ (s, a) = ⎜ .. ⎟. (6.9)
⎟
⎜ ⎟
⎜ Ia==an (a) ⎟
⎜ ⎟
⎜ Ia==an (a) · Δ f ⎟
⎜ ⎟
⎜ Ia==an (a) · H epoch ⎟
⎜ ⎟
⎜ Ia==an (a) · H epoch · Δ f ⎟
⎜ ⎟
⎝ Ia==an (a) · (Δ f )2 ⎠
2
Ia==an (a) · H epoch
The set is composed of 6 ∗ n elements, and Ia==ai , with i = 1..n, are the indicator
functions for the actions, evaluating to 1 if the action is the indicated one, and 0
otherwise. The indicator function is used to discern the “state-action” features for
the different actions considered: the learning for an action is entirely decoupled from
the learning for the remaining actions. For example, consider the case of n = 20
actions. The vector Φ (s, a) has 120 elements and for each of the 20 possible actions
only six different elements are not 0.
The adoption of LSPI for Reactive Search Optimization requires a training phase
that identifies the optimal policy for the tuning of the parameter(s) of the consid-
ered SLS algorithm. During the training phase, a given number of runs of the SLS
algorithm are executed over a subset of instances from the considered benchmark.
The initial value for the target parameter is selected uniformly at random over the
set of the allowed values. Each run is divided into epochs and the target parameter
value is modified only at the end of each epoch using a random policy. An example
(s, a, r, s ) is composed of the states s and s of the Markov process observed at the
end of two consecutive epochs, the action a modifying the target parameter at the
end of the first epoch and the reward r observed for the action. The collected exam-
ples are used by the LSPI algorithm to identify the optimal policy (see Fig. 6.3). In
this work, the training phase is executed off-line.
During the testing phase, the state observed at the end of the current epoch is
given as input to the LSPI algorithm. The learnt policy determines the appropriate
modification of the target parameter (see Eq. 6.8). For the testing phase, the differ-
ence between the RL-based and the original version of the considered SLS algorithm
consists only of the modification of the target parameter at the end of each epoch.
methods described in Section 6.4 and interfaced them to the Matlab LSPI imple-
mentation by Lagoudakis and Parr, available at http://www.cs.duke.edu/research/AI/
LSPI/ (as of December 15, 2009).
The experiments performed in this work were carried out using a 2 GHz Intel
Xeon processor computer, with 6 GB RAM. The efficient UBCSAT [29] implemen-
tation of the Walksat and (R)SAPS algorithms is considered, while for the H RTS
algorithm the original code of its authors is executed.
While in this paper we base our comparisons on the solution quality after a
given number of iterations, the CPU time required by RTS RL, Walksat RL and
RSAPS RL is analogous to that of the basic H RTS, Walksat and SAPS algorithms,
respectively. The only negligible CPU time overhead is due to the computation of 20
floating-point 120-element vector products in order to compute Q̂(s, a) (see Eq. 6.8)
for the 20 actions at the end of each epoch.
For brevity, the applications of our Reinforcement-based approach to the tuning
of the noise parameter, the prohibition value and the smoothing probability param-
eter are termed “Walksat RL”, “RTS RL” and “RSAPS RL”, respectively.
We consider two kinds of instances: random MAX-3-SAT instances and in-
stances derived from relevant industrial problems.
For the training phase, we selected one instance from the considered benchmark
and performed four runs of the algorithm with different randomly chosen starting
truth assignments. We created 8,000 examples for the Walksat RL and RSAPS RL
algorithms and 4,000 examples for the RTS RL method. The instance selected for
the training is not included in the set of instances used for the testing phase.
The experiments performed are over selected MAX-3-SAT random instances de-
fined in [23]. Each instance has 2, 000 variables and 8, 400 clauses, thus lying near
the satisfiability threshold of 3-SAT instances.
The comparison of our approach based on reinforcement learning w.r.t. the orig-
inal Walksat/SKC, RTS and RSAPS algorithms is in Table 6.1. Each entry in the
table contains the mean and the standard error of the mean over ten runs of the
best-so-far number of unsatisfied clauses at iteration 210, 000.
The Walksat/SKC algorithm has been executed with the default 0.5 value for the
noise parameter. The setting for the SAPS parameters is the default one described
in [18], as empirical results in [18] show it is the most robust configuration.
Table 6.1 shows the superior performance of the RSAPS algorithm to that of the
SAPS algorithm, motivating the need for reactive parameter tuning over this bench-
mark. Over all the benchmark instances, the RSAPS algorithm exhibits a lower best-
so-far number of unsatisfied clauses at iteration 210, 000 than the SAPS algorithm.
The Kolmogorov-Smirnov test for two independent samples indicates that there is
statistical evidence (with a confidence level of 95%) for the superior performance
of RSAPS to SAPS (except in the case of the instance sel 04.cnf). In the case of
6 An Investigation of Reinforcement Learning for Reactive Search Optimization 149
the sel 04.cnf instance the Kolmogorov-Smirnov test accepts the null hypothesis,
while, e.g., in the case of instance sel 03.cnf it rejects the null hypothesis.
Over the considered benchmark, our RL-based approach improves the perfor-
mance of the underlying algorithm when considering RSAPS and H RTS.
In the case of the smoothing probability parameter, for each benchmark instance
the RL-based algorithm shows a lower number of unsatisfied clauses in the best-
so-far number as of iteration 210, 000 than the original RSAPS algorithm. For ex-
ample, over the instance sel 07.cnf, RSAPS RL on average reaches 4.4 unsatisfied
clauses, while the original RSAPS algorithm reaches 18.2 unsatisfied clauses. The
Kolmogorov-Smirnov test for two independent samples confirms this observation.
Figure 6.5 shows the evolution of the value
w − w
during the training phase of
the LSPI algorithm in the case of RSAPS RL. The LSPI algorithm converges in
seven iterations, when the value
w − w
becomes smaller than 10−6 .
In the case of the RL-based tuning of the prohibition parameter, the RTS RL al-
gorithm outperforms H RTS over all the benchmark instances. For example, con-
sider the 7.3. unsatisfied clauses reached on average by the RTS RL algorithm
compared to the 16.5 clauses of the H RTS algorithm over the instance sel 03.cnf.
Again, the Kolmogorov-Smirnov test for two independent samples rejects the null
hypothesis at the standard significance level of 0.05 for all the instances of the con-
sidered benchmark.
However, over the MAX-3-SAT benchmark considered, no improvement is ob-
served for the RL-based reactive tuning of the noise parameter of the Walksat/SKC
algorithm. The Kolmogorov-Smirnov test does not show statistical evidence (with
a confidence level of 95%) for the different performance of Walksat RL compared
to Walksat/SKC, except in the case of instances 1 and 4, where the latter algorithm
reaches on average a lower number of unsatisfied clauses than the former one. In the
case of instances 1 and 4, the Kolmogorov-Smirnov test rejects the null hypothesis.
Note that during the offline learning phase of the RL-based version of Walksat,
the LSPI algorithm does not converge even after 20 iterations over the set of ex-
amples extracted from the considered benchmark (even if 64,000 instead of 8,000
examples are generated).
150 Roberto Battiti and Paolo Campigotto
Fig. 6.5: Training of the LSPI algorithm over the MAX-3-SAT random instances
considered in the case of RSAPS RL
In the case of the prohibition parameter, the Kolmogorov-Smirnov test for two in-
dependent samples indicates that the difference between the results of the RTS RL
and H RTS approaches is not statistically significant (at the standard significance
level of 0.05) in the case of instances en 6 nd.cnf and en 6 nd case1.cnf. How-
ever, over the instances en 6.cnf and en 6 case1.cnf, RTS RL shows superior perfor-
mance to H RTS. E.g., in the case of instance en 6 nd.cnf the Kolmogorov-Smirnov
test accepts the null hypothesis since the p-value is 0.11 while in the case of instance
en 6.cnf it rejects the null hypothesis since the p-value is 0.031.
In the case of the smoothing probability and the noise parameters, the RL-
based approach shows poor performance even compared to the non-reactive ver-
sions of the original methods considered (the Walksat and the SAPS algorithms).
The Kolmogorov-Smirnov test does not reject the null hypothesis comparing the
results of SAPS to RSAPS RL (except in the case of instance en 6 nd case1.cnf,
where SAPS outperforms RSAPS RL). Furthermore, the statistical test confirms
the worse results of Walksat RL to the Walksat algorithm (except in the case of
instance en 6 nd case1.cnf, where there is no statistical evidence of different per-
formance).
Note that during the offline learning phase of the RL-based version of Walksat
and RSAPS, the LSPI algorithm does not converge even after 20 iterations over the
set of examples extracted from the considered benchmark (even if 64,000 instead of
8,000 examples are generated). Figure 6.6 shows the evolution of the value
w−w
during the training phase of the LSPI algorithm in the case of Walksat RL. The
algorithm does not converge: from iteration 12, the L2-norm of the vector w − w
oscillates from the value 4, 150 to 5, 700.
Furthermore, qualitatively different results w.r.t. Table 6.2 are not observed if a
different instance is used for training, or if the training set includes two instances
rather than one.
Table 6.2 shows the poor performance of prohibition-based reactive algorithms.
To understand possible causes for this lack of performance we analyzed the evolu-
tion of the prohibition parameter during the run. Figure 6.7 depicts the evolution of
the fractional prohibition parameter T f during a run of the H RTS algorithm over
two instances of the Engine benchmark.
The fact that the maximum prohibition value is reached suggests that the prohibi-
tion mechanism is not sufficient to diversify the search in these more structured in-
stances. The performance of H RTS is not significantly improved even when smaller
values for the maximum prohibition are considered.
The experimental results show that over the MAX-3-SAT benchmark the RL-based
approach outperforms the ad hoc. reactive approaches in the case of the RSAPS
and H RTS algorithms. The values in Tables 6.1 and 6.2 are obtained by using the
152 Roberto Battiti and Paolo Campigotto
Fig. 6.6: Training of the LSPI algorithm over the industrial instances considered in
the case of Walksat RL
original default settings of the algorithm’s parameters. In particular, the default con-
figuration for SAPS parameters is the following:
• the scaling parameter is set to 1.3;
• the smoothing parameter is set to 0.8;
• the smoothing probability parameter is set to 0.05;
• the noise parameter is set to 0.01.
This configuration is also the original default setting of RSAPS, where the reactive
mechanism adapts the smoothing probability parameter during the search history.
The initial default value for the fractional prohibition parameter in H RT S is 0.1.
The reactive mechanism implemented in H RT S is responsible for the adaptation of
the prohibition parameter during the algorithm execution.
For many domains, a significant improvement in the performance of the SAT
solvers can be achieved by setting their parameters differently from the original
default configuration. The tuning is executed offline, and the determined parameters
are then fixed while the algorithm runs on new instances.
In order to measure the beneficial effect of the RL-based reactive mechanism ob-
served for the smoothing probability parameter and the prohibition over the MAX-
3-SAT benchmark, we compare RTS RL and RSAPS RL with the best fixed pa-
rameter tuning of SAPS and H RTS algorithms for those instances. The best fixed
6 An Investigation of Reinforcement Learning for Reactive Search Optimization 153
Fig. 6.7: Evolution of the T f parameter during a run of the H RTS algorithm over
two selected instances of the Engine benchmark
configuration for SAPS and H RTS parameters are obtained by ParamILS [19], an
automatic tool for the parameter tuning problem. In particular, the best fixed setting
for SAPS parameters:
• the scaling parameter is set to 1.256;
• the smoothing parameter is set to 0.5;
• the smoothing probability parameter is set to 0.1;
• the noise parameter is set to 0.05.
The best fixed value for the fractional prohibition parameter over the MAX-3-SAT
benchmark is 0.13. Note that, by fixing the prohibition parameter in H RTS, one
obtains the standard prohibition-based approach known as GSAT/tabu. Table 6.3
shows the results of the best fixed parameter setting approach.
The Kolmogorov-Smirnov test shows that there is no statistical evidence of dif-
ferent performance between SAPS RL and SAPSbest , except in the case of instance
sel 5.cnf where the latter algorithm outperforms the former one (the p-value is
0.030). Therefore, over the MAX-3-SAT benchmark SAPS RL is competitive with
any fixed setting of the smoothing probability parameter.
In the case of the prohibition parameter, a better performance of the GSAT/tabubest
algorithm compared to RTS RL is observed over three instances (sel 1.cnf, sel 9.cnf
and sel 10.cnf) where the Kolmogorov-Smirnov test for two independent samples
rejects the null hypothesis at the standard significance level of 0.05. Over the re-
maining instances, the performance of RTS RL is competitive with the perfor-
mance of GSAT/tabubest . E.g., in the case of instance sel 8.cnf the Kolmogorov-
154 Roberto Battiti and Paolo Campigotto
Table 6.3: Best fixed tuning of SAPS and and GSAT/tabu algorithms. The values in
the table are the mean and the standard error of the mean over ten runs of the BSF
value observed at iteration 210,000
Smirnov test accepts the null hypothesis since the p-value is 0.974, while in
the case of instance sel 1.cnf it rejects the null hypothesis since the p-value is
0.006.
For RTS RL, Walksat RL and SAPS RL, the LSPI algorithm has been applied to the
training sample set, and with (6.9) as approximate space basis. Figures 6.8 and 6.9
and 6.10 and 6.11 show the policy learnt when executing RTS RL and Walksat RL
over the MAX-3-SAT benchmark, respectively.
Each
point in the graph represents the action determined by Eq. 6.9 for the state
H
s ≡ Δ f , epoch
n of the Markov decision process. On the x- and y-axes, the features
of the state are represented. In particular, the x-axis shows the first component on the
H
MDP state epochn , which represents the normalized mean Hamming distance during
the current epoch from the configuration at its beginning (see Section 6.5). The y-
axis refers to Δ f , the second component of the state, representing the mean change
of f in the current epoch with respect to the best value. The range for the x- and y-
H
axes are loose bounds on the maximum and minimum values for the epoch n and Δ f
features observed during the offline training phase. The z-axis contains the values
for the action.
In the case of RTS RL, the LSPI algorithm converges and the distribution of the
actions over the state space is consistent with the intuition. A high value for the
prohibition parameter is suggested in cases where the mean Hamming distance be-
tween the configurations explored in the current epoch and the configuration at the
beginning of the current epoch does not exceed a certain value, provided that the
current portion of the landscape is worse than the previously explored regions. This
policy is consistent with intuition: a higher value of T causes a larger differentia-
6 An Investigation of Reinforcement Learning for Reactive Search Optimization 155
Fig. 6.8: Distribution of the actions in the significant portion of the state space for
the RTS RL algorithm over the Selman benchmark
tion of visited configurations (more different variables need to be flipped), and this
is desired when the algorithm needs to escape the neighborhood of a local mini-
mum; in this case, in fact, movement is limited because the configuration is trapped
at the “bottom of a valley”. On the other hand, when the trajectory is not within
the attraction basin of a minimum, a lower value of T enables a better exploitation
of the neighborhood. Furthermore, small values for the prohibition parameter are
suggested when the current portion of landscape is better than the previously ex-
plored regions: the algorithm is currently investigating a promising region of the
search space. These observations, consistent with the intuition, confirm the positive
behavior of RTS RL over the Selman benchmark, improving the performance of the
H RT S algorithm.
For Walksat RL over the MAX-3-SAT instances, the LSPI algorithm does not
converge. The distribution of the best action obtained after 20 iterations of the LSPI
algorithm is in Figures 6.10 and 6.11. A very similar diversification level is sug-
gested when the current portion of the landscape is both better and worse than the
previously explored regions. In particular, a counterintuitive diversification action
is suggested when the Δ f feature assumes negative values while high values are
H
observed for the epoch
n feature. These observations confirm the poor results of the
Walksat RL algorithm, performing worse even than the non-reactive Walksat algo-
rithm.
156 Roberto Battiti and Paolo Campigotto
Fig. 6.9: Distribution of the actions in the significant portion of the state space for
the RTS RL algorithm over the Selman benchmark (contour lines)
6.7 Conclusion
Fig. 6.10: Distribution of the actions in the significant portion of the state space for
the Walksat RL algorithm over the Selman benchmark
Fig. 6.11: Distribution of the actions in the significant portion of the state space for
the Walksat RL algorithm over the Selman benchmark (contour lines)
158 Roberto Battiti and Paolo Campigotto
References
[1] Baluja S., Barto A., Boese K., Boyan J., Buntine W., Carson T., Caruana R.,
Cook D., Davies S., Dean T., et al.: Statistical machine learning for large-scale
optimization. Neural Computing Surveys 3:1–58 (2000)
[2] Battiti R.: Machine learning methods for parameter tuning in heuristics. In:
5th DIMACS Challenge Workshop: Experimental Methodology Day, Rutgers
University (1996)
[3] Battiti R., Brunato M.: Reactive search: Machine learning for memory-based
heuristics. In: Gonzalez T.F. (ed.) Approximation Algorithms and Metaheuris-
tics, Taylor and Francis Books (CRC Press), Washington, DC, chap. 21, pp.
21-1–21-17 (2007)
[4] Battiti R., Campigotto P.: Reinforcement learning and reactive search: An
adaptive MAX-SAT solver. In: Proceedings of the 2008 Conference on ECAI
2008: 18th European Conference on Artificial Intelligence, IOS Press, pp.
909–910 (2008)
[5] Battiti R., Protasi M.: Reactive search, a history-sensitive heuristic for
MAX-SAT. ACM Journal of Experimental Algorithmics 2 (article 2),
http://www.jea.acm.org/ (1997)
[6] Battiti R., Tecchiolli G.: The Reactive Tabu Search. ORSA Journal on Com-
puting 6(2):126–140 (1994)
[7] Battiti R., Brunato M., Mascia F.: Reactive Search and Intelligent Optimiza-
tion, Operations research/Computer Science Interfaces, vol. 45. Springer Ver-
lag (2008)
[8] Bennett K., Parrado-Hernández E.: The interplay of optimization and machine
learning research. The Journal of Machine Learning Research 7:1265–1281
(2006)
[9] Bertsekas D., Tsitsiklis J.: Neuro-dynamic programming. Athena Scientific
(1996)
[10] Boyan J.A., Moore A.W.: Learning evaluation functions for global optimiza-
tion and Boolean satisfiability. In: Press A. (ed.) Proc. of 15th National Conf.
on Artificial Intelligence (AAAI), pp. 3–10 (1998)
[11] Brunato M., Battiti R., Pasupuleti S.: A memory-based rash optimizer. In:
Geffner A.F.R.H.H. (ed.) Proceedings of AAAI-06 Workshop on Heuristic
6 An Investigation of Reinforcement Learning for Reactive Search Optimization 159
Search, Memory Based Heuristics and Their applications, Boston, Mass., pp.
45–51, ISBN 978-1-57735-290-7 (2006)
[12] Eiben A., Horvath M., Kowalczyk W., Schut M.: Reinforcement learning
for online control of evolutionary algorithms. In: Brueckner S.A., Hassas S.,
Jelasity M., Yamins D. (eds.) Proceedings of the 4th International Workshop
on Engineering Self-Organizing Applications (ESOA’06), Springer Verlag,
LNAI 4335, pp. 151–160 (2006)
[13] Epstein S.L., Freuder E.C., Wallace R.J.: Learning to support constraint pro-
grammers. Computational Intelligence 21(4):336–371 (2005)
[14] Fong P.W.L.: A quantitative study of hypothesis selection. In: International
Conference on Machine Learning, pp. 226–234 (1995) URL citeseer.
ist.psu.edu/fong95quantitative.html
[15] Hamadi Y., Monfroy E., Saubion F.: What is Autonomous Search? Tech. Rep.
MSR-TR-2008-80, Microsoft Research (2008)
[16] Hoos H.: An adaptive noise mechanism for WalkSAT. In: Proceedings of the
National Conference on Artificial Intelligence, AAAI Press; MIT Press, vol.
18, pp. 655–660 (1999)
[17] Hoos H., Stuetzle T.: Stochastic Local Search: Foundations and applications.
Morgan Kaufmann (2005)
[18] Hutter F., Tompkins D., Hoos H.: Scaling and probabilistic smoothing: Effi-
cient dynamic local search for sat. In: Proc. Principles and Practice of Con-
straint Programming - CP 2002, Ithaca, NY, Sept. 2002, Springer LNCS, pp.
233–248 (2002)
[19] Hutter F., Hoos H.H., Stützle T.: Automatic algorithm configuration based on
local search. In: Proc. of the Twenty-Second Conference on Artifical Intelli-
gence (AAAI ’07), pp. 1152–1157 (2007)
[20] Lagoudakis M., Littman M.: Algorithm selection using reinforcement learn-
ing. Proceedings of the Seventeenth International Conference on Machine
Learning, pp. 511–518 (2000)
[21] Lagoudakis M., Littman M.: Learning to select branching rules in the DPLL
procedure for satisfiability. LICS 2001 Workshop on Theory and Applications
of Satisfiability Testing (SAT 2001) (2001)
[22] Lagoudakis M., Parr R.: Least-Squares Policy Iteration. Journal of Machine
Learning Research 4(6):1107–1149 (2004)
[23] Mitchell D., Selman B., Levesque H.: Hard and easy distributions of SAT prob-
lems. In: Proceedings of the Tenth National Conference on Artificial Intelli-
gence (AAAI-92), San Jose, CA, pp. 459–465 (1992)
[24] Muller S., Schraudolph N., Koumoutsakos P.: Step size adaptation in Evolution
Strategies using reinforcement learning. Proceedings of the 2002 Congress on
Evolutionary Computation, 2002 CEC’02 1, pp. 151–156 (2002)
[25] Prestwich S.: Tuning local search by average-reward reinforcement learning.
In: Proceedings of the 2nd Learning and Intelligent OptimizatioN Conference
(LION II), Trento, Italy, Dec. 10–12, 2007, Springer, Lecture Notes in Com-
puter Science (2008)
160 Roberto Battiti and Paolo Campigotto
7.1 Introduction
Evolutionary Algorithms (EAs) constitute efficient solving methods for general op-
timization problems. From an operational point of view, they can be considered as
basic computational processes that select and apply variation operators over a set
of possible configurations of the problem to be solved, guided by the Darwinian
“survival of the fittest” paradigm.
In order to efficiently apply an EA to an optimization problem, there are many
choices that need to be made. Firstly, the design of the general skeleton of the al-
gorithm must include the selection of a suitable encoding for the search space at
hand, the management of the population (i.e., size setting, selection and replace-
ment processes, and so on), and the definitions of the variation operators that will be
used, namely the mutation and recombination operators. These components can be
Jorge Maturana
Instituto de Informática, Universidad Austral de Chile, Valdivia, Chile.
e-mail: [email protected]
Álvaro Fialho
Microsoft Research – INRIA Joint Centre, Orsay, France. e-mail: [email protected]
Frédéric Saubion
LERIA, Université d’Angers, Angers, France. e-mail: [email protected]
Marc Schoenauer
Project-Team TAO, INRIA Saclay – Île-de-France and LRI (UMR CNRS 8623), Orsay, France
and Microsoft Research – INRIA Joint Centre, Orsay, France.
e-mail: [email protected]
Frédéric Lardeux
LERIA, Université d’Angers, Angers, France. e-mail: [email protected]
Michèle Sebag
Project-Team TAO, LRI (UMR CNRS 8623) and INRIA Saclay – Île-de-France, Orsay, France and
Microsoft Research – INRIA Joint Centre, Orsay, France. e-mail: [email protected]
considered as the structural parameters of the algorithms that define its operational
architecture. However, once the structure is defined, a difficult and crucial task re-
mains: how to control the general computation process? This control is usually em-
bedded in a set of behavioral parameters that can be related to the data structures or
to the computation steps, e.g., the application rate of each variation operator.
All these parameters should then be tuned, depending on the problem at hand.
In the early days of Evolutionary Computation, these numerous possible choices
were in fact considered as richness, providing very useful flexibility to EAs, that
could indeed be applied to a very wide range of scientific fields. Nowadays, the
contemporary view of EAs acknowledges that specific problems require specific
setups in order to obtain satisfactory performance [13]. In other words, when
it comes to solving a given problem, all practitioners know that parameter set-
ting is in fact the Achilles’ heel of EAs (together with their high computational
cost).
As the efficiency of Evolutionary Algorithms has already been experimentally
demonstrated on many difficult problems out of the reach of other optimization
methods, more and more scientists (researchers, engineers) are trying them out on
their specific problems. However, they very often fail in getting interesting results
precisely because there is a lack of general methods for tuning at least some of
the involved parameters (and also because they are not, and do not want to be-
come, “Evolutionary Engineers”). Of course, the specialization of an EA to a given
problem has an impact on its performance on other problems (according to the No
Free Lunch Theorems for Optimization [44]). In fact, through parameter setting, the
main challenge is to set (or select) the right algorithm for the given problem, which
is an old problem in computer science [39]. Therefore, it is time that we Cross the
Chasm [33], bringing the benefits of EAs to the whole scientific community without
its main burden, that of parameter setting.
A current trend in EA is to focus on the definition of more autonomous solving
processes, which aim at allowing the basic user to benefit from a more efficient
and easy-to-use algorithmic framework. Parameter setting in EAs is a major issue
that has received much attention in recent years [14], and its research is still very
active nowadays, as a book recently published [27], and from the numerous recent
references cited in this document. This subject is not limited to EAs, being also
investigated in operations research and constraint programming communities, where
the current solving technologies that are included in efficient solvers require huge
expertise to be fully used (see, for instance, [4]).
Parameter setting in EAs may be considered at two complementary levels as
follows [36]:
Design: In many application domains that directly pertain to standard representa-
tions, users who are not EA experts can simply use off-the-shelf EAs with classic
(and thus non-specialized) variation operators to solve their problems. However, the
same users will encounter great difficulties when faced with problems outside the
basic frameworks. Even if standard variation operators exist in the literature (such
as the uniform crossover [40]), the achievement of acceptable results depends nec-
essarily on the specialization of the algorithmic scheme, which usually requires the
7 Adaptive Operator Selection and Management in Evolutionary Algorithms 163
Fig. 7.1: General scheme of the controller, depicting its two main components,
Adaptive Operator Selection and Adaptive Operator Management
164 J. Maturana, Á. Fialho, F. Saubion, M. Schoenauer, F. Lardeux, and M. Sebag
This chapter will focus on these control mechanisms, by reviewing existing solu-
tions previously proposed by the authors, and providing examples of their possible
applications by means of cases studies. Firstly, an overview of the current state of
the art in parameter setting in EAs and, more specifically, in adaptive control of
operators, is presented in Section 7.2. Then, some efficient AOS combinations are
presented in Section 7.3. A case study of the application of an AOM technique, com-
bined with the AOS, is presented in Section 7.4. In both cases, experimental results
are presented, applying the adaptive EA to the solution of SAT problems. Finally,
conclusions and possible paths for further research are drawn in Section 7.5.
Slightly departing from the point of view adopted in previous surveys [14, 13], we
shall firstly distinguish here between external tuning and internal control of the
parameters.
Methods that perform external tuning consider the algorithm as a black box:
some external process is run (e.g., another optimization algorithm, or some standard
statistical analysis), to determine a good (range of) value(s) for the parameters at
hand. Most of these methods are not limited to Evolutionary Algorithms and can
be applied to any type of algorithm, although some of them have been designed
in the evolutionary framework. The most straightforward approach is the complete
factorial Design of Experiments, in which the same number of runs is executed for
each setting, and the one that achieves the best performance on average is selected
for the algorithm. Other approaches have been recently proposed, such as the SPO
[3], in which the optimal values are refined after a few runs; the Racing methods
[6, 46], in which settings are discarded once proven worst than others ***; and the
recent Meta-EA approaches [8], in which some runs are actually stopped at some
point of the evolution, and some of their parameters are modified before they are
restarted.
On the other hand, internal control methods work directly on the values of the
parameters while solving the problem, i.e., on-line. Such kind of mechanisms for
modifying parameters during an algorithm execution were invented early in EC his-
tory, and most of them are still being investigated nowadays. Indeed, there are at
least two strong arguments to support the idea of changing the parameters during an
Evolutionary Algorithms run:
• As evolution proceeds, more information about the landscape is known by the
algorithm, so it should be possible to take advantage of it. This applies to global
properties (for example, knowing how rugged the landscape is) and to local ones
(for example, knowing whether a solution has been improved lately or not).
• As the algorithm proceeds from a global (early) exploration of the landscape to
a more focused, exploitation-like behavior, the parameters should be adjusted to
take care of this new reality. This is quite obvious, and it has been empirically
and theoretically demonstrated that different values of parameters might be op-
7 Adaptive Operator Selection and Management in Evolutionary Algorithms 165
timal at different stages of the search process (see [13, p. 21] and references
therein).
The different approaches that have been proposed to internally adapt the parame-
ters can be classified into three categories, depending on the type of information that
is used to adjust the parameters during the search (such classification, used by many
authors in the 1990s, has been made in [14] and republished in the corresponding
chapter of [13]).
Deterministic parameter control implements a set of deterministic rules, with-
out any feedback from the search. This is, in general, hard to achieve, because of
a simple reason: it relies heavily on knowing beforehand how long it will take to
achieve a given target solution with the running algorithm and, obviously, this can-
not be easily predictable. But even if it were, the way to balance exploration and
exploitation can hardly be guessed. This situation is worsened by two facts: first,
there is usually a big variance between different runs on the very same problem; and
second, these methods often require additional parameters that are used to tune the
deterministic parameter (starting with the total number of generations the algorithm
will run), and even though these parameters can be considered second-order, their
influence is nevertheless critical.
Since our knowledge about the way the search should behave is always limited,
it is sometimes possible, and advantageous, to let evolution itself tune some of the
parameters: such a Self-Adaptive parameter control adjusts parameters “for free”,
i.e., without any direct specification from the user. In other words, individuals in the
population might contain “regulatory genes” that control some of the parameters,
e.g., the mutation and recombination rates; and these regulatory genes are subject
to the same evolutionary processes as the rest of the genome [23]. For quite some
time in the 1990s, self-adaptation was considered as the royal road to success in
Evolutionary Computation. First of all, the idea that the parameters are adapted for
free is very appealing, and its parallel with the behavior of self-regulated genes is
another appealing argument. On the practical side, as self-adaptive methods require
little knowledge about the problem and, “what is probably more important”, about
the way the search should proceed, they sometimes are the only way to go when
nothing is actually known about the problem at hand. However, by using such an ap-
proach, the algorithm needs to explore, in parallel, the search space of the variables
of the problem and also the search space of the parameter values, which potentially
increases the complexity of the search.
Then, it is clear that, whenever some decisions can be made to help the search
follow an efficient path, this should be done. Adaptive or Feedback-based methods
follow this rationale, being the most successful approaches today in on-line param-
eter control. These methods are based on the monitoring of particular properties of
the evolutionary process, and use changes in these properties as an input signal to
change the parameter values. The controllers presented in this work are included in
this category.
Adaptive Operator Selection (AOS), presented in Section 7.3, is used to au-
tonomously select which operator, among the available ones, should be applied by
the EA at a given instant, based on how they have performed in the search so far. The
166 J. Maturana, Á. Fialho, F. Saubion, M. Schoenauer, F. Lardeux, and M. Sebag
The AOS learns the performance of each operator based on a performance measure,
assessed by what is usually referred to as the Credit Assignment mechanism. Starting
with the initial work in AOS, from the late 1980’s [12], several methods have been
proposed to do this, with different ways of transforming the impact of the operator
application into a quality measure.
Most of the methods use only the fitness improvement as a reward, i.e., the qual-
ity gain of the newborn offspring compared to a base individual, which might be (i)
its parent [26, 42, 2], (ii) the current best individual [12], (iii) or the median indi-
vidual [24] of the current population. If no improvement is achieved, usually a null
credit is assigned to the operator.
Regarding which operator should be credited after the achievement of a given
fitness improvement, the most common approach is to assign a credit to the oper-
ator that was directly responsible for the creation of the newborn offspring. Some
authors, however, propose assigning credit to the operators used to generate the
ancestors of the current individual (e.g., using some bucket brigade-like algorithm
[12, 24]), based on the claim that the existence of efficient parents is indeed as
important as the creation of improved offspring. Others, however, do not consider
ancestors at all ([26, 42]), and some even suggest that this sometimes degrades the
results [2].
The existing approaches also differ in the statistics that are considered in order to
define the numerical reward. Most of the methods calculate their rewards based just
on the most recent operator application, while others use the average of the quality
achieved over a few applications. More recently, in [43], the utilization of extreme
values over a few applications was proposed (statistical outliers), based on the idea
that highly beneficial but rare events might be better for the search than regular but
smaller improvements. The reported comparative results with other Credit Assign-
ment mechanisms show the superiority of this approach over a set of continuous
benchmark problems. Though used in a different context, the methods presented in
this paper are also based on this idea.
7 Adaptive Operator Selection and Management in Evolutionary Algorithms 167
The most common and straightforward way of doing Operator Selection is the so-
called Probability Matching (PM) [20, 26, 2]. In brief, each operator is selected by a
roulette-wheel-like process with a probability that is proportional to its known em-
pirical quality (as defined by the Credit Assignment mechanism, e.g., the average of
the received rewards). A user-defined α parameter might be used to introduce some
relaxation in the update of this empirical estimation. Besides, a minimal probability
(pmin ) is usually applied, so that no operator is “lost” during the process: one opera-
tor that is currently bad might become useful at some further stage of the search. If
an operator receives just null rewards (or the maximal reward) for some time, its ex-
pected reward will go to pmin (or pmax = 1 − K ∗ pmin ). However, this convergence is
very slow, and, experimentally, mildly relevant operators keep being selected, which
badly affects the performance of the algorithm [41].
Originally proposed for learning automata, Adaptive Pursuit (AP) [41] is an Op-
erator Selection technique that partly addresses this drawback by implementing a
winner-take-all strategy. Another user-defined parameter, β , is used to control the
greediness of the strategy, i.e., how fast the probability of selecting the current best
individual will converge to pmax while that of selecting the others will go to pmin .
In its original proposal for the AOS problem [41], AP was applied in a set of artifi-
cial scenarios, in which the operator qualities were changing every Δ t applications,
achieving significant improvements over the performance of PM.
Different approaches, such as APGAIN [45], propose an Operator Selection di-
vided into two periods. During a first learning stage (which is repeated several times
during the run, so the changes can be followed), the operators are randomly selected,
and the rewards gathered are used to extract initial knowledge about them. In a later
stage, such knowledge is exploited by the technique in order to efficiently select the
operators. The main drawback in this case is that roughly a quarter of the genera-
tions are dedicated to the learning, thus doing random selection, which may strongly
affect the performance of the algorithm if disruptive operators are present in the set.
Clearly, the main objective of parameter setting in EAs is to automate some of the
choices that should be made by the user. What happens in reality is that, although
some of the parameters are efficiently defined by these autonomous controllers, the
controllers themselves also have their own parameters that need to be tuned. This
section considers this issue.
If one wants to define “manually” (or by the use of some off-line tuning tech-
nique) the parameters of an EA related to the variation operators, several choices
need to be made. In addition to defining the list of operators that will be used by
the algorithm, we need at least one other parameter to define for each operator:
its application rate. Additionally, some operators also require some extra settings,
168 J. Maturana, Á. Fialho, F. Saubion, M. Schoenauer, F. Lardeux, and M. Sebag
e.g., the rate of flipping a bit in a bit-flip mutation operator, or the number of cross-
ing points that will be used by an n-point crossover operator. Thus, considering a
static definition of the operators and their respective rates, the number of original
parameters to be defined is a multiple of the number of operators that is used by the
EA.
All these parameters are automatically handled by the proposed adaptive con-
trollers while solving the problem. The list of operators is managed by the Adaptive
Operator Management, while the application rates are abstracted by the Adaptive
Operator Selection. The extra parameters are abstracted as well, by considering dif-
ferent variants of the same operator as different operators, e.g., 1-point and 2-point
crossover are treated as different operators instead of different (discrete) settings of
the same operator.
As previously mentioned, to be able to do this, the controllers also introduce
their own parameters, usually called hyper- or meta-parameters. But, firstly, there is
a fixed number of meta-parameters instead of a multiple of the number of operators;
secondly, these parameters are (supposed to be) less sensitive than the original ones;
and finally, they are adapted online according to the needs of the search, while a
static setting would always be suboptimal, as discussed in Section 7.2. For example,
concerning the Operator Selection techniques reviewed in Section 7.2.2, Probability
Matching needs the definition of the minimal probability pmin and the adaptation
rate α ; while the Adaptive Pursuit needs the definition of a third parameter, the
learning rate β . For Credit Assignment, usually at least one parameter should be
considered: the number of operator applications that are taken into account to cal-
culate an empirical estimation that the controller keeps about each operator.
Even though the number of meta-parameters is fixed and small, given the initial
lack of knowledge about the behavior of the controllers and what would be a good
setting for them, off-line parameter tuning methods can be used to facilitate such
definitions.
The cost of acquiring information about parameters is variable: whereas some
parameters can be easily recognized as preponderant and easy to tune, others could
have little effect over the search (i.e., different parameter values do not significantly
affect the results), or could require many experiments to discover a correct value.
For example, information theory could be used to characterize the relevance of the
parameters, as in [34, 35]. The proposed method, called REVAC, uses Shannon and
differential entropies to find the parameters with higher impact on the efficiency of
the algorithm while estimating the utility of the possible parameters values.
Another possibility is the utilization of Racing methods: instead of performing
m runs on each of the n instances in order to find the best configuration (i.e., a
full factorial Design of Experiments), the candidate configurations are eliminated as
soon as there is enough statistical evidence that they are worse than the current best
one. Cycles of “execution/comparison/elimination” are repeated until there is only
one configuration left, or some other stopping criterion is achieved. So, instead of
wasting computing time to estimate with precision the performance of inferior can-
didates, Racing allocates the computational resources in a better way, by focusing
on the most promising ones and consequently achieving lower variance estimates
7 Adaptive Operator Selection and Management in Evolutionary Algorithms 169
for them. Racing typically requires 10-20% of the computing time of a complete
factorial Design of Experiments.
In this section, we review three AOS methods from our previous work. Compass
[31] (Section 7.3.1) presents a Credit Assignment mechanism able to efficiently
measure the impact of the operators application, while Ex-DMAB [15] (Section
7.3.2) proposes an Operator Selection that quickly adapts to the dynamics of the
search. These two elements were combined into ExCoDyMAB [28], which is de-
scribed and empirically compared to the original methods, respectively, in Sections
7.3.3 and 7.3.4.
The Adaptive Operator Selection technique proposed in [31] uses a very simplis-
tic Operator Selection mechanism, Probability Matching. However, its Credit As-
signment mechanism is very efficient in measuring the impact of the operator’s ap-
plication, thus being the main contribution of this work. As is known, in case of
multimodal functions, fitness improvement is not the only important criterion for
the progress of the search; some level of diversity should also be maintained in the
population; otherwise the search will quite probably converge prematurely and get
trapped into a local optimum. Based on these assumptions, Compass provides a way
to encourage improvements for both criteria, and works as follows.
Firstly, each time an operator is applied, three measures are collected: the vari-
ations in the population diversity and in the mean fitness of the population, respec-
tively Δ D and Δ Q, and also its execution time T , as shown in Fig. 7.2(a). The
average performance of each operator w.r.t. Δ D and Δ Q over the last τ applications
is displayed in a “diversity vs. fitness” plot, represented by the points in Fig. 7.2(b).
A user-defined angle θ defines the compromise between obtaining good results and
maintaining diversity in the population, addressing the EvE dilemma. In practice,
such an angle defines the plane from which perpendicular distances to the points
are measured. Finally, these measures (δi ) are divided by the operator’s execution
time to obtain the final aggregated evaluation of each one (Fig. 7.2(c)), which is the
reward used to update the selection preferences.
This approach was used to control a steady state evolutionary algorithm applied
to the well-known Boolean satisfiability problem (SAT) [9], automatically select-
ing from six ill-known operators. The SAT problem was chosen because it offers a
large variety of instances with different properties and search landscapes, besides
allowing the scaling of the instance difficulty. The experiments have demonstrated
that this AOS method is efficient and provides good results when compared to other
170 J. Maturana, Á. Fialho, F. Saubion, M. Schoenauer, F. Lardeux, and M. Sebag
Fig. 7.2: Compass credit assignment: Sliding windows of three measures are main-
tained (a). Average measures of Δ D and Δ Q are plotted and distance of those points
are measured from a plane with a slope of θ (b). Finally, those distances are divided
by the execution time, resulting in the reward assigned to the operator (c)
existing mechanisms, such as Adaptive Pursuit [41] and APGAIN [45], based on
the fitness improvement.
Such performance was achieved mainly due to the strength of the Credit Assign-
ment mechanism proposed, which provides a robust measurement of the impact of
the operator application by simultaneously considering several criteria. The Opera-
tor Selection rule used (PM) is rather simple, known being quite conservative and
slow w.r.t. the adaptation, as already discussed in Section 7.2.2.
based on the number of times n j such an operator has been tried. At each time step t,
the algorithm selects the arm with the best upper bound w.r.t. the following quantity:
log ∑k nk,t
p̂ j,t +C (7.1)
n j,t
The first term of this equation favors the best empirical arm (exploitation) while
the second term ensures that each arm is selected infinitely often (exploration).
In the original setting [1], all rewards are Boolean, and hence their empirical
means p̂i,t are in [0, 1]. As this is not the case in the context of AOS, a Scal-
ing factor C is needed in order to properly correct the balance between the two
terms.
The most important issue, however, is that the original MAB setting is static,
while the AOS scenario is dynamic, i.e., the quality of the operators is likely to vary
during the different stages of the search. Even though the exploration term in the
UCB algorithm ensures that all operators will be tried infinitely many times, in case
a change occurs (w.r.t. the operator’s performance), it might take a long time before
the new best operator catches up. To be able to efficiently follow the changes in
dynamic environments, it has been proposed [21] to combine a statistical test with
the UCB algorithm that efficiently detects changes in time series, the Page-Hinkley
(PH) test [37]. Basically, as soon as a change in the reward distribution is detected
(based on a threshold γ ), e.g., the “best” operator is probably not the best anymore,
the UCB algorithm is restarted from scratch, thus being able to quickly rediscover
the new best operator.
The UCB algorithm involves one meta-parameter, the scaling factor C, while the
PH test involves two parameters, the threshold γ for the change detection, and δ ,
which enforces the robustness of the test when dealing with slowly varying envi-
ronments. Note that, according to initial experiments in [10], δ has been kept fixed
at 0.15. The UCB algorithm, coupled with the Scaling factor and the PH change-
detection test, has been termed as Dynamic MAB (DMAB), with the complete AOS
combination being denoted by Ex-DMAB.
Ex-DMAB has been used to adapt a (1+λ )-EA by efficiently choosing on-line
from four mutation operators to solve the OneMax problem [10], and has been tried
on yet another unimodal benchmark problem, the Long k-Path [16], this time ef-
fectively selecting from five mutation operators. In both cases, the optimal oper-
ator selection strategy was extracted by means of Monte Carlo simulations, and
Ex-DMAB was shown to perform statistically equivalently to it while significantly
improving over the naive (uniform selection) approach. It has also been used to
adapt a (100,100)-EA with four crossover operators and one mutation operators in
the Royal Road problem [17], also performing significantly better than the naive
approach. For the three problems, we have also used other AOS combinations as
baselines for comparison, namely Adaptive Pursuit, Probability Matching and static
MAB (UCB without restarts), coupled with Extreme or Average rewards. Ex-DMAB
was shown to be the best option.
172 J. Maturana, Á. Fialho, F. Saubion, M. Schoenauer, F. Lardeux, and M. Sebag
However, as its Credit Assignment considers only the fitness improvement, Ex-
DMAB would probably be less efficient on rougher landscapes, quickly converg-
ing to local optima. In order to solve multimodal problems, diversity should also
be considered somehow. Besides being problem-dependent, another drawback of
Ex-DMAB is that the variance of the fitness improvements changes as the search ad-
vances (in the beginning of the search larger improvements are more easily achieved,
while in the fine-tuning final stage only very small improvements are possible).
Thus, there do not exist any robust values for the meta-parameters C and γ that
make it likely to perform well during the whole search. A possible solution to this
is the utilization of comparison-based rewards (that, by definition, always keep the
same variance, no matter what the problem and the search stage), e.g., a credit based
on ranks or the use of some normalized measures.
The previous sections showed that the strengths and weaknesses of both Compass
and Ex-DMAB methods are complementary: Compass measures in a holistic way the
effects of the operator applications over the population, but the operator selection
rule is rather rudimentary, while DMAB has an effective way to adapt and select the
operators, but its credit assignment mechanism is probably too simplistic. It seems
hence natural to combine the Credit Assignment of the former with the Operator
Selection mechanism of the latter.
However, even though merging both modules seems to be straightforward, some
important issues need to be further explored:
• Compass uses sliding windows of size τ in the “measurement stage”, with a
unique reward value in its output, while Ex-DMAB stores in a sliding window
the last κ outputs (rewards) of its Credit Assignment module. Should we keep
both windows, or would it be redundant? And if only one is kept, which one
should it be? From here on, these two windows will be referred to as W 1 for the
measurement window and W 2 for the reward window.
• Another issue concerning the sliding windows is that of their usage: should the
algorithm use their average (A), extreme (E), or simply instantaneous (I) value
(equivalent to using no window at all)? The Extreme was shown to be stronger
in unimodal problems, but how do such results hold in this completely different
scenario?
• The last issue concerns the other meta-parameters. Besides tuning the size
and type of W 1 and W 2, we also need to tune the values of the angle θ
in Compass, and the scaling factor C and change detection threshold γ in
DMAB. Since the idea is not to simply replace some parameters (the oper-
ator application probabilities) by other ones at a higher level of abstraction,
we need to better understand their effects. One way to find answers to this
7 Adaptive Operator Selection and Management in Evolutionary Algorithms 173
Table 7.2: Comparative results on the 22 SAT instances: average (std dev) number
of false clauses (over 50 runs)
The performance of ExCoDyMAB (C) was compared with the baseline techniques,
the original Compass and Ex-DMAB, and also with the Uniform (naive) selection,
with 50 runs being performed on 22 SAT instances, obtained from [22] and from
the SAT Race 2006. A preliminary off-line tuning of the meta-parameters by means
of F-Race [6] was also done for the other techniques, in order to compare them at
their best (more details in [28]).
The results are presented in Table 7.2. The columns show the mean number of
false clauses after 5,000 function evaluations, averaged over 50 runs, and the stan-
dard deviation in parentheses, with the best results for each problem presented in
boldface. The last line of the table summarizes the comparison, by showing the
number of “wins - losses” of ExCoDyMAB compared to each of the baseline tech-
niques, according to a Student t-test with 95% confidence.
Note that the purpose of this work was not to build an amazing SAT solver, but
rather to experiment with a different AOS and validate ExCoDyMAB within an EA
solving a general, difficult combinatorial problem. The results of Table 7.2 show that
7 Adaptive Operator Selection and Management in Evolutionary Algorithms 175
a basic EA using rather naive operators can indeed solve some instances. The main
interesting result is that this set of benchmarks was difficult enough to highlight
the benefit of using the proposed combination of Compass and Ex-DMAB rather
than either separately – or a naive, blind choice. The deliberate choice of several
non-specialized operators was also important for validating the control ability of
ExCoDyMAB when facing variation operators of very different characteristics and
performance. Competing for SAT races implies using highly specialized operators,
such as the ones implemented in GASAT [25], and is left for further consideration.
This study highlights the importance of both control stages of AOS, namely
credit assignment and operator selection rules. Both features, Compass Credit As-
signmentcombined with DMAB Operator Selection, contribute to the overall per-
formance of the proposed autonomous control method, explaining the efficiency
gain over each previous method used alone. Additionally, using the Extreme values
from the aggregated performance measure allows the algorithm to identify occa-
sional but highly beneficial operators, while the inclusion of population diversity
with the traditional fitness improvement measure contributes to escaping from local
optima.
One main drawback of ExCoDyMAB is the tuning of its meta-parameters. Though
the normalization of fitness improvements and diversity by Compass might result in
a possible robust setting for the scaling parameter of the MAB balance (i.e., the
value found here using Racing), further work is needed for a deeper understanding
of how to tune the meta-parameter γ that triggers the change detection test. As pre-
viously mentioned, the off-line meta-parameter tuning using the F-Race paradigm
can be done in a fraction (15%) of the time needed for a complete factorial DOE,
but this is still quite costly.
Although its good performances rely expensive procedures, ExCoDyMAB was
found to outperform the main options available to the naive EA user, namely (i)
using a fixed or deterministic strategy (including the naive, uniform selection); and
(ii) using a different AOS strategy, including the combinations previously proposed
by the authors. Furthermore, ExCoDyMAB involves a fixed and limited number of
parameters, whereas the number of operator rates increases with the number of op-
erators, as discussed in Section 7.2.3.
The design of the algorithm has great influence on the search performance. It can be
seen as a parameter setting at a higher level of abstraction: one must decide, from
several choices, the best configuration for the algorithm in order to maximize its
performance. However, in this case “structural” parameters are considered instead
of “behavioral” ones. In the case of operator control, instead of deciding on differ-
ent values for the application rate of a given operator, it decides whether a given
operator should be considered or not by the algorithm. It can be seen as a parameter
optimization problem over a search space composed of algorithms.
176 J. Maturana, Á. Fialho, F. Saubion, M. Schoenauer, F. Lardeux, and M. Sebag
This is usually done manually: the user decides, based mainly on his experience,
which operators should be included in the search algorithm; or he tries a few config-
urations, assessing their performance in order to set down the definitive algorithm.
As well as for the setting of the behavioral parameters, this design could also be as-
sisted by parameter setting techniques, such as Racing or REVAC (see Section 7.2).
While the utilization of off-line tuning techniques is straightforward, providing
the user a static “good” configuration, one might wonder how to do this in a dy-
namic way, while solving the problem. The motivations to do so would be the same
as those that led us to propose the AOS methods: the efficiency of the operators
changes according to the region of the search space that is currently being explored;
thus a useless operator might be excluded from the algorithmic framework, while
possibly useful ones might be included. The autonomous handling of the operators
included in the algorithm while solving the problem is what we refer to as Adap-
tive Operator Management (AOM). Besides being yet another approach of doing
parameter control, such an approach can also be described in the light of the recent
field of hyper-heuristics [7].
the placement of the population over the search space has changed. This could mean
that the information stored in the profiles of the dead operators is no longer useful;
thus there are only two states: dead and alive. Nevertheless, we prefer to consider
the tree-state model, which offers wider possibilities in terms of previous knowledge
inclusion.
Such states are linked by the following three transitions:
• Birth, when an operator is included for the first time into the search.
• Death, when an operator has been considered useless for the current state of the
search, and thus removed from it.
• Revival, when a previously dismissed operator is brought back to the search.
The core of Adaptive Operator Management lies in defining the conditions under
which the transitions must take place. These conditions correspond to the different
criteria described above, which define whether and when an operator should be in-
cluded and eliminated from the search.
The framework presented here, Blacksmith, aims to control the design by modify-
ing the set of operators available to the algorithm during its execution. The struc-
ture of the entire controller is shown in Figure 7.4. In addition to comprising the
Adaptive Operator Selection component, Adaptive Operator Management is com-
posed of a set of operator definitions from which the operators are extracted. This
set could be simply a list of possible operators that can be included in the search,
or an entire module that mixes subcomponents to create operators in an intelligent
manner.
Blacksmith manages operator states by defining a couple of data structures that
are attached to the operator definition. The rules that define the transition from one
state to another can be summarized as follows:
1. Initially, when an operator is Unborn state, Blacksmith only knows its name.
2. Once operators are born (i.e., they pass from unborn to alive state) two data
structures are associated with the operator name: data, which stores recent mea-
sures of performance, and profile, which summarizes the information in data,
by calculating meaningful statistics. A fixed number of operators is kept in the
registry, and they are evaluated at regular intervals.
3. Operators that have been applied a sufficient number of times are considered
for elimination. This condition is required to ensure that the operator has low
178 J. Maturana, Á. Fialho, F. Saubion, M. Schoenauer, F. Lardeux, and M. Sebag
Fig. 7.4: Scheme of the controller used in this work, depicting its two main compo-
nents, Adaptive Operator Selection and Adaptive Operator Management
1. Selection of clauses that are false in both parents: select none, in chrono-
logical order, randomly, randomly from the smallest and randomly from the
biggest.
2. Action to perform on the selected clauses: either take no action or consider
one of four ways to flip variables with different criteria.
3. Selection of clauses that are true in both parents: same as in (1).
4. Action to perform on the selected clauses: same as in (2).
Fig. 7.5: Credit Assignment schemes. Compass (a), Pareto Dominance (b), Pareto
Rank (c)
After the application of an operator, the values of Δ Diversity and Δ Quality are
sent to the controller. The credit assignment module computes then the evaluation
180 J. Maturana, Á. Fialho, F. Saubion, M. Schoenauer, F. Lardeux, and M. Sebag
(C, PD or PR, depending on the scheme selected), and normalizes the resulting val-
ues across all operators (normalization is achieved by dividing values by the highest
possible value). The normalized values are stored in the Credit Registry as the as-
signed rewards. A list of the last m rewards of each operator (corresponding to its
last m applications) is recorded in the registry in order to provide to the operator
selection module an updated history of the performances of each operator.
The operator selection module selects the next operator to be applied based on its
past performances, but without neglecting an exploration of the possible available
operators. As mentioned in Section 7.3.2, Ex-DMAB is inspired by the multi-armed
bandit approach, used in game theory. The strategy always chooses the operator that
maximizes expression (7.1).
However, this expression relies on the assumption that all operators are present in
the evolutionary algorithm from the beginning of the run. If an operator is inserted
during execution, its value of ni,t would be so low that the bandit-based AOS would
have to apply it many times in order to adjust its exploration term with regards to
the rest of the operators.
Since we are interested in operators that enter and exit the evolutionary algo-
rithm during the search process, we have reformulated expression (7.1) in order
to deal with a dynamic set of operators. This is mainly achieved by replacing the
measure corresponding to the number of times that an operator has been applied
with another criterion that corresponds to the number of generations elapsed since
the last application of the operator (i.e., its idle time). This allows a new operator
to immediately increase its evaluation by applying it once. The new evaluation of
performance is then defined as follows:
where io,t is the idle time of operator o at time t, NOt is the number of operators
considered by the Operator Selection at time t, and x is the number of times the
controller must wait before compulsorily applying the operator o. The behavior of
the exploration component is better understood by looking at Figure 7.6. The value
stays close to zero except when io,t is close to x × NOt . Since the values of ro,t
are normalized in [0, 1], when an operator has not been applied for a long time, its
application becomes mandatory. p is a positive parameter that adjusts the slope of
the exponential.
Besides MAB2, referred to as M2 from here on, the following three different
Operator Selection methods were also analyzed:
• Random (R), simply chooses randomly which among the operators currently
available in the EA.
7 Adaptive Operator Selection and Management in Evolutionary Algorithms 181
Fig. 7.7: Number of false clauses of best individual so far, obtained by different
controllers and state-of-the-art crossovers when solving an industrial instance
1According to the usual taxonomy, this algorithm is a steady-state evolutionary algorithm; thus a
generation corresponds to the application of one operator.
7 Adaptive Operator Selection and Management in Evolutionary Algorithms 183
There is a clear advantage with Credit Assignment based on the Pareto Domi-
nance (PD), since it appears in the most efficient controllers, followed by C and PR.
One possible explanation is that PR considers equally all the operators placed on
the Pareto frontier (points in the Figure 7.5(c) with value 0). This induces a balance
between the exploration and the exploitation tendencies, thus preventing the evolu-
tionary algorithm from leaning to one side or to the other. A similar behavior could
be observed when using Compass according to its performance measure method.
On the other hand, when using PD, the better evaluation of the operators that fol-
low the general tendency (points in the Figure 7.5(b) with higher values) allows the
evolutionary algorithm to break the status quo and finally improve the quality of
the population. This “flexible balance” is the main asset of this Credit Assignment
method.
It is interesting to notice that the most exploratory Operator Selection methods
(PM and R) have produced some of the best results. It could seem surprising that
a random operator selection could be able to overcome sophisticated methods that
carefully try to balance EvE at the operator selection level, such as the bandit-based
approaches. A possible hypothesis for this could be that a mix of different crossover
operators works better than a single one due the diversity of behavior the former
can manifest; however, the poor results obtained by R307 show that simply mixing
different operators does not produce good results.
Another, more plausible, hypothesis is that an exploratory Operator Selection
method will provide updated information to AOM, since the rate of application of
“bad” operators is higher than when an exploitative Operator Selection method is
used. In this manner, R and PM rapidly produce the five applications needed to
evaluate the unsuccessful operators and move them from alive to dead state.
Another factor to consider is that Adaptive Operator Management uses the same
criterion as Adaptive Operator Selection to choose which operator will survive.
Since AOM continually dismisses the worst operators, an important part of exploita-
tion is done by AOM; thus AOS can focus exclusively on exploration. Differently
stated, AOS performs exploration among the operators that AOM has decided to
keep by using an exploitative criteria.
Table 7.3 shows the mean (standard deviation) of the performance of the algo-
rithms for each problem over 25 executions. We show here only the two most suc-
cessful controllers (PD-PM and PD-R). The best results (and those that are indistin-
guishable from them, using a Student t-test with 95% confidence) are in boldface.
Here we focus on the most successful controllers, PD-R and PD-PM.
PD-R obtained top results on 19 instances and PD-PM on 17, while CC and
CCTM were the best ones only five times. Even though the controller configura-
tions obtained the worst results on two industrial instances, we observed the best
improvements on this family of instances, especially on I5 and I6, in which the
adaptively controlled evolutionary algorithms obtained up to 260 times fewer false
clauses than the best state-of-the-art crossover. The results presented in this com-
parison show the generality of PD-PM and PD-R on different families and types
of instances. The overhead of this improvement is really moderate since the time
dedicated to control represents less that 10% of the total execution time.
184 J. Maturana, Á. Fialho, F. Saubion, M. Schoenauer, F. Lardeux, and M. Sebag
Table 7.3: Comparison of PD-PM and PD-R with state-of-the-art crossovers over
crafted, random and industrial instances. Number of false clauses and standard de-
viation
7.5 Conclusions
i.e., Pareto optimality (or a similar idea) of both mean fitness and population diver-
sity. In the way it is currently being done, the same weight is given to both trends, the
exploration/diversification and the exploitation/intensification criteria; as is known,
the needs of the search w.r.t. such trends vary as the search goes on; thus keeping the
same balance might lead to overall suboptimal behavior. Intuitively, one idea would
be to consider some strategy that starts by encouraging the exploration behavior, and
gradually shifts to the exploitation; or one that searches for a certain compromise
level of EvE [30, 32].
Another path of further research concerns the identification of the best operators
used during one or more runs. Even though our approach can choose appropriate
operators given a search state, no information about this knowledge is delivered at
the end. This automatically extracted knowledge could be used to guide the selection
of the better operators for the next run, or to establish a map between operators and
the search states for which they are best suited.
Some further analysis should be done for both Adaptive Operator Selection and
Adaptive Operator Managementconcerning the relevance and sensitivity of their
meta-parameters, which would help us to better understand their behavior.
One could also remark that, in this architecture, bad operators should be tried
before assessing their poor performances. Another architecture could be envisaged
to separate the evaluation of the operators and their effective use. For instance, in
a distributed solving scheme, some controllers could be devoted to the evaluation
of the performance according to given state of the search while other controllers
could concentrate on the effective solving of the problem, the controllers exchanging
information about the current state of the search.
References
[1] Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed
bandit problem. Machine Learning 47(2/3), 235–256 (2002)
[2] Barbosa, H. J. C., Sá, A. M.: On adaptive operator probabilities in real coded
genetic algorithms. In: Proc. XX Intl. Conf. of the Chilean Computer Science
Society (2000)
[3] Bartz-Beielstein, T., Lasarczyk, C., Preuss, M.: Sequential parameter opti-
mization. In: Proc. CEC’05, pp. 773–780. IEEE Press (2005)
[4] Battiti, R., Brunato, M., Mascia, F.: Reactive Search and Intelligent Opti-
mization, Operations Research/Computer Science Interfaces, vol. 45. Springer
(2008)
[5] Biere, A., Heule, M., van Maaren, H., Walsh, T. (eds.): Handbook of Satisfia-
bility, Frontiers in Artificial Intelligence and Applications, vol. 185. IOS Press
(2009)
[6] Birattari, M., Stützle, T., Paquete, L., Varrentrapp, K.: A racing algorithm for
configuring metaheuristics. In: W. B. Langdon et al. (ed.) Proc. GECCO ’02,
pp. 11–18. Morgan Kaufmann (2002)
7 Adaptive Operator Selection and Management in Evolutionary Algorithms 187
[7] Burke, E. K., Hyde, M., Kendall, G., Ochoa, G., Ozcan, E., Woodward, J.:
Handbook of Meta-heuristics, chap. A Classification of Hyper-heuristics Ap-
proaches. Springer (2009)
[8] Clune, J., Goings, S., Punch, B., Goodman, E.: Investigations in Meta-GA:
Panaceas or Pipe Dreams? In: H. G. Beyer et al. (ed.) Proc. GECCO’05, pp.
235–241. ACM Press (2005)
[9] Cook, S. A.: The complexity of theorem-proving procedures. In: Proc. ACM
Symposium on Theory of Computing, pp. 151–158. ACM Press (1971)
[10] Da Costa, L., Fialho, A., Schoenauer, M., Sebag, M.: Adaptive operator se-
lection with dynamic multi-armed bandits. In: M. Keijzer et al. (ed.) Proc.
GECCO’08, pp. 913–920. ACM Press (2008)
[11] Da Costa, L., Schoenauer, M.: GUIDE, a Graphical User Interface for Evo-
lutionary Algorithms Design. In: J. H. Moore et al. (ed.) GECCO Workshop
on Open-Source Software for Applied Genetic and Evolutionary Computation
(SoftGEC). ACM Press (2007)
[12] Davis, L.: Adapting operator probabilities in genetic algorithms. In:
J. D. Schaffer (ed.) Proc. ICGA’89, pp. 61–69. Morgan Kaufmann (1989)
[13] Eiben, A., Michalewicz, Z., Schoenauer, M., Smith, J.: Parameter Setting
in Evolutionary Algorithms, chap. Parameter Control in Evolutionary Algo-
rithms, pp. 19–46. Vol. 54 of Lobo et al. [27] (2007)
[14] Eiben, A. E., Hinterding, R., Michalewicz, Z.: Parameter control in evolution-
ary algorithms. IEEE Trans. Evolutionary Computation 3(2), 124–141 (1999)
[15] Fialho, A., Da Costa, L., Schoenauer, M., Sebag, M.: Extreme value based
adaptive operator selection. In: G. Rudolph et al. (ed.) Proc. PPSN’08, LNCS,
vol. 5199, pp. 175–184. Springer (2008)
[16] Fialho, A., Da Costa, L., Schoenauer, M., Sebag, M.: Dynamic multi-armed
bandits and extreme value-based rewards for adaptive operator selection in
evolutionary algorithms. In: T. Stuetzle et al. (ed.) Proc. LION’09 (2009)
[17] Fialho, A., Schoenauer, M., Sebag, M.: Analysis of adaptive operator selection
techniques on the royal road and long k-path problems. In: F. Rothlauf et al.
(ed.) Proc. GECCO’09, pp. 779–786. ACM Press (2009)
[18] Fleurent, C., Ferland, J. A.: Object-oriented implementation of heuristic search
methods for graph coloring, maximum clique, and satisfiability. In: Cliques,
Coloring, and Satisfiability: Second DIMACS Implementation Challenge, DI-
MACS Series in Discrete Mathematics and Theoretical Computer Science,
vol. 26, pp. 619–652 (1996)
[19] Garey, M. R., Johnson, D. S.: Computers and Intractability: A Guide to
the Theory of NP-Completeness. W.H. Freeman & Company, San Francisco
(1979)
[20] Goldberg, D.: Probability Matching, the Magnitude of Reinforcement, and
Classifier System Bidding. Machine Learning 5(4), 407–426 (1990)
[21] Hartland, C., Gelly, S., Baskiotis, N., Teytaud, O., Sebag, M.: Multi-armed
bandit, dynamic environments and meta-bandits. In: Online Trading of Explo-
ration and Exploitation Workshop, NIPS (2006)
188 J. Maturana, Á. Fialho, F. Saubion, M. Schoenauer, F. Lardeux, and M. Sebag
[22] Hoos, H., Stützle, T.: SATLIB: An Online Resource for Research on SAT, pp.
283–292. IOS Press, www.satlib.org (2000)
[23] Jong, K. D.: Parameter Setting in Evolutionary Algorithms, chap. Parameter
Setting in EAs: a 30 Year Perspective, pp. 1–18. Vol. 54 of Lobo et al. [27]
(2007)
[24] Julstrom, B. A.: What have you done for me lately? Adapting operator proba-
bilities in a steady-state genetic algorithm on genetic algorithms. In: L. J. Es-
helman (ed.) Proc. ICGA’95, pp. 81–87. Morgan Kaufmann (1995)
[25] Lardeux, F., Saubion, F., Hao, J. K.: GASAT: A genetic local search algo-
rithm for the satisfiability problem. Evolutionary Computation 14(2), 223–253
(2006)
[26] Lobo, F., Goldberg, D.: Decision making in a hybrid genetic algorithm. In:
T. Bäck et al. (ed.) Proc. ICEC’97, pp. 121–125. IEEE Press (1997)
[27] Lobo, F., Lima, C., Michalewicz, Z. (eds.): Parameter Setting in Evolutionary
Algorithms, Studies in Computational Intelligence, vol. 54. Springer (2007)
[28] Maturana, J., Fialho, A., Saubion, F., Schoenauer, M., Sebag, M.: Extreme
compass and dynamic multi-armed bandits for adaptive operator selection. In:
Proc. CEC’09 (2009)
[29] Maturana, J., Lardeux, F., Saubion, F.: Autonomous operator management for
evolutionary algorithms. Journal of Heuristics 16, 881–909 (2010)
[30] Maturana, J., Saubion, F.: On the design of adaptive control strategies for evo-
lutionary algorithms. In: Proc. EA’07, LNCS, vol. 4926. Springer (2007)
[31] Maturana, J., Saubion, F.: A compass to guide genetic algorithms. In:
G. Rudolph et al. (ed.) Proc. PPSN’08, LNCS, vol. 5199, pp. 256–265.
Springer (2008)
[32] Maturana, J., Saubion, F.: From parameter control to search control: Param-
eter control abstraction in evolutionary algorithms. Constraint Programming
Letters 4, Special Issue on Autonomous Search, 39–65 (2008)
[33] Moore, G. A.: Crossing the Chasm: Marketing and Selling High-Tech Products
to Mainstream Customers. Collins Business Essentials (1991)
[34] Nannen, V., Eiben, A. E.: A method for parameter calibration and relevance
estimation in evolutionary algorithms. In: Proc. GECCO’06, pp. 183–190.
ACM Press (2006)
[35] Nannen, V., Eiben, A. E.: Relevance estimation and value calibration of evo-
lutionary algorithm parameters. In: Proc. IJCAI’07, pp. 975–980 (2007)
[36] Nannen, V., Smit, S. K., Eiben, A. E.: Costs and benefits of tuning parameters
of evolutionary algorithms. In: G. Rudolph et al. (ed.) Proc. PPSN’08, pp.
528–538. Springer (2008)
[37] Page, E.: Continuous inspection schemes. Biometrika 41, 100–115 (1954)
[38] Pareto, V.: Cours d’économie politique. In: Vilfredo Pareto, Oeuvres
complètes, Genève: Librairie Droz (1896)
[39] Rice, J. R.: The algorithm selection problem. Advances in Computers 15,
65–118 (1976)
[40] Sywerda, G.: Uniform crossover in genetic algorithms. In: Proc. ICGA’89, pp.
2–9. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1989)
7 Adaptive Operator Selection and Management in Evolutionary Algorithms 189
[41] Thierens, D.: An adaptive pursuit strategy for allocating operator probabilities.
In: H. G. Beyer (ed.) Proc. GECCO’05, pp. 1539–1546. ACM Press (2005)
[42] Tuson, A., Ross, P.: Adapting operator settings in genetic algorithms. Evolu-
tionary Computation 6(2), 161–184 (1998)
[43] Whitacre, J. M., Pham, T. Q., Sarker, R. A.: Use of statistical outlier detec-
tion method in adaptive evolutionary algorithms. In: M. Cattolico (ed.) Proc.
GECCO’06, pp. 1345–1352. ACM Press (2006)
[44] Wolpert, D. H., Macready, W. G.: No free lunch theorems for optimization.
IEEE Transactions on Evolutionary Computation 1(1), 67–82 (1997)
[45] Wong, Y. Y., Lee, K. H., Leung, K. S., Ho, C. W.: A novel approach in param-
eter adaptation and diversity maintenance for genetic algorithms. Soft Com-
puting 7(8), 506–515 (2003)
[46] Yuan, B., Gallagher, M.: Statistical racing techniques for improved empirical
evaluation of evolutionary algorithms. In: X. Yao et al. (ed.) Proc. PPSN’04,
pp. 172–181. Springer (2004)
Chapter 8
Parameter Adaptation in
Ant Colony Optimization
8.1 Introduction
Thomas Stützle, Manuel López-Ibáñez, Marco Montes de Oca, Mauro Birattari, Marco Dorigo
IRIDIA, CoDE, Université Libre de Bruxelles, Brussels, Belgium
e-mail: {stuetzle,manuel.lopez-ibanez,mmontes,mbiro,mdorigo}@ulb.ac.
be
Michael Maur
Fachbereich Rechts- und Wirtschaftswissenschaften, TU Darmstadt, Darmstadt, Germany
e-mail: [email protected]
Paola Pellegrini
Dipartimento di Matematica Applicata, Università Ca’ Foscari Venezia, Venezia, Italia
e-mail: [email protected]
The earliest ACO algorithms used the traveling salesman problem (TSP) as an ex-
ample application. The TSP is typically represented by a graph G = (V, E), V being
the set of n = |V | vertices, representing the cities, and E being the set of edges that
fully connect the vertices. A distance di j is associated with each edge (i, j). The
objective is to find a Hamiltonian cycle of minimum total cost. The TSP is a com-
putationally hard problem, but the application of ACO algorithms to it is simple,
which explains why ACO applications to the TSP have played a central role in the
development of this algorithmic technique.
When applying ACO to the TSP, a pheromone value τi j is associated with each edge
(i, j) ∈ E. The pheromone value represents the attractiveness of a specific edge for
the ants, according to the experience gained at runtime: the higher the amount of
pheromone on an edge, the higher the probability that ants choose it when con-
structing solutions. Pheromone values are iteratively updated by two mechanisms:
pheromone evaporation and pheromone deposit. In addition to the pheromone trails,
the ants’ solution construction process is also biased by a heuristic value ηi j = 1/di j ,
8 Parameter Adaptation in Ant Colony Optimization 193
procedure ACOMetaheuristic
ScheduleActivities
ConstructSolutions
DaemonActions //optional
UpdatePheromones
end-ScheduleActivities
end-procedure
which represents the attractiveness of each edge (i, j) from a greedy point of
view.
The algorithmic outline of an ACO algorithm (see Figure 8.1) contains three
main procedures: ConstructSolutions, DaemonActions, and UpdatePheromones. The
main characteristics of these procedures are as follows.
• ConstructSolutions. This procedure includes the routines needed for ants to con-
struct solutions incrementally. After the selection of the starting city, one node
at a time is added to an ant’s path. An ant’s decision about where to go next is
biased by the pheromone trails τi j and the heuristic information ηi j . In general,
the higher the two values, the higher the probability of choosing the associated
edge. Typically, two parameters, α > 0 and β ≥ 0, are used to weigh the relative
influence of the pheromone and the heuristic values, respectively. The rule that
defines the ant’s choice is specific to each ACO variant.
• DaemonActions. This procedure comprises all problem-specific operations that
may be considered for boosting the performance of ACO algorithms. The main
example of such operations is the introduction of a local search phase. In addi-
tion, daemon actions implement centralized tasks that cannot be performed by
an individual ant. This type of procedures is for optional use, but typically sev-
eral daemon actions are very useful for significantly improving the performance
of ACO algorithms [11].
• UpdatePheromones. This procedure updates the pheromone trail values in two
phases. First, pheromone evaporation is applied to decrease pheromone values.
The degree of decrement depends on a parameter ρ ∈ [0, 1], called evapora-
tion rate. The aim of pheromone evaporation is to avoid an unlimited increase
of pheromone values and to allow the ant colony to forget poor choices made
previously. A pheromone deposit is then applied to increase the pheromone
values that belong to good solutions the ants have generated. The amount of
pheromone deposited and the solutions considered are peculiar to each ACO
variant.
ACO algorithms involve a number of parameters that need to be set appropri-
ately. Of these, we already have mentioned α and β , which are used to weigh the
relative influence of the pheromone and heuristic values in the ants’ solution con-
struction. The role of these parameters in biasing the ants’ search is intuitively sim-
ilar. Higher values of α emphasize differences in the pheromone values, and higher
194 Stützle, López-Ibáñez, Pellegrini, Maur, Montes de Oca, Birattari, and Dorigo
values of β have the same effect on the heuristic values. The initial value of the
pheromones, τ0 , has a significant influence on the convergence speed of the algo-
rithm; however, its recommended setting depends on the particular ACO algorithm.
The evaporation rate parameter, ρ , 0 ≤ ρ ≤ 1, regulates the degree of the decrease
in pheromone trails. If ρ is low, the influence of the pheromone values will persist
longer, while high values of ρ allow fast forgetting of previously very attractive
choices and, hence, allow faster focus on new information that incorporate into the
pheromone matrix. Another parameter is the number of ants in the colony, m. For
a given computational budget, such as maximum computation time, the number of
ants is a critical parameter for determining the trade-off between the number of it-
erations that can be made and how broad the search is at each of the iterations. In
fact, the larger the number of ants per iteration, the fewer the iterations of the ACO
algorithm.
The ACO framework can be implemented in many different ways. In the literature,
several ACO algorithms have been proposed, which differ in some choices charac-
terizing the ConstructSolutions and UpdatePheromones procedures. Three of the main
variants are described next. For a description of other variants we refer the interested
reader to Dorigo and Stützle [11].
Ant System (AS) is the first ACO algorithm proposed in the literature [12, 13]. In
AS, an ant k in node i chooses the next node j with probability given by the random
proportional rule, defined as
[τi j ]α · [ηi j ]β
pi j = , (8.1)
∑ [τih ]α · [ηih ]β
h∈N k
where F(k) is the amount of pheromone deposited on the edges of the solution
constructed by ant k. F(k) is equal to the reciprocal of the cost of the solution con-
structed by ant k, possibly multiplied by a constant Q. Hence, the better the solution,
the higher the amount of pheromone deposited by an ant.
where Δ τibest
j is defined as
⎧
⎨F(sbest ) if edge (i, j) is part of the best solution sbest ,
Δ τibest
j = (8.5)
⎩0 otherwise.
F(sbest ) is the reciprocal of the cost of the solution considered for the deposit. For
more details on the pheromone initialization and the use of occasional pheromone
trail reinitializations, we refer the reader to Stützle and Hoos [42].
Ant Colony System (ACS) [10] differs in several ways from AS and MMAS. ACS
uses the pseudorandom proportional rule in the solution construction: with a prob-
ability q0 the next city to visit is chosen as
β
j = arg max{τih · ηih }, (8.6)
h∈N k
196 Stützle, López-Ibáñez, Pellegrini, Maur, Montes de Oca, Birattari, and Dorigo
that is, the most attractive edge is selected greedily with fixed probability. With
probability 1 − q0 , the AS random proportional rule defined by Equation 8.1 is used.
In ACS, the parameter α is fixed to 1, and, therefore, it is often omitted.
The pheromone deposit of ACS modifies only the pheromone values of edges
from one solution. As in MMAS, this solution is either the iteration-best or the best-
so-far solution. The ACS pheromone update formula is
⎧
⎨(1 − ρ ) · τi j + ρ · Δ τi j if (i, j) is part of the best solution sbest ,
τi j ← (8.7)
⎩τi j otherwise,
with Δ τi j = F(sbest ).
A local pheromone update rule is applied during the solution construction of the
ants. Each time an ant traverses an edge (i, j), τi j is modified as
τi j ← (1 − ξ ) · τi j + ξ · τ0 , (8.8)
adapt better to the particular instance’s characteristics. When instances are relatively
heterogeneous, parameter settings that result in good performance on average across
all instances may lead to much worse results on some instances. Online parameter
adaptation may also be useful for achieving the best performance for a stage of
search. It is often possible to identify an algorithm’s explorative and exploitative
search phases, and good parameter settings in these phases may again be very dif-
ferent. Finally, if algorithms are applied in situations that are very different from the
context in which they have been developed or tuned offline, allowing parameters to
change online may increase an algorithm’s robustness.
There are different ways of modifying parameters during the run of an algo-
rithm. Many strategies have been widely studied in the context of EAs, and Eiben et
al. [17] give a possible classification of these strategies. Perhaps the simplest way is
to define the parameter variation rule before actually running the algorithm. In such
an approach, the problem is observed from an offline perspective: Static parame-
ters are substituted with (deterministic or randomized) functions depending on the
computational time or the number of algorithm iterations. Eiben et al. [17] call such
strategies deterministic parameter control; however, we prefer the term presched-
uled parameter variation, because the adjective “deterministic” does not correctly
characterize this way of controlling parameters, given that the schedule could also
allow randomized choices. Even if prescheduled parameter variation is an online
tuning method, it does not make the offline tuning problem disappear since also the
parameters that define the schedule need to be appropriately set.
An alternative is to use adaptive parameter settings, where the parameter modi-
fication scheme is defined as a function of some statistics on the algorithm behavior.
Various measures can be used for this online adaptation. They can be grouped de-
pending on whether they are based on absolute or relative evidence. In the first
case, the adaptation strategy monitors the occurrence of some events during the
run, for example, some fixed threshold of the distance between the solutions vis-
ited. Surpassing the threshold then triggers a set of rules for parameter variation. In
the second case, the adaptation strategy considers the relative difference between
the performance achieved with different parameter settings, and adapts the parame-
ter values to resemble the most successful ones. For parameter adaptation to work,
some additional decisions need to be made beyond the ones strictly related to the
implementation of the algorithm. In particular, the equations that describe the pa-
rameter update need to be defined a priori and it is hoped that the mechanisms used
are very robust with respect to their definitions.
A further possibility that has been the object of studies consists of having the pa-
rameters modified at run time by the algorithm itself. Specifically, dimensions that
represent parameters of exploration strategies are added to the search space of the
problem. The optimization process is then executed in this new space. Eiben et al.
[17] name this approach self-adaptation. Taking the notion of self-adaptation a step
further, we call search-based adaptation strategies that use a search algorithm differ-
ent from the underlying algorithm for parameter adaptation. This class of strategies
includes techniques such as local search and EAs for adapting the parameters of
ACO algorithms.
198 Stützle, López-Ibáñez, Pellegrini, Maur, Montes de Oca, Birattari, and Dorigo
Table 8.1: Schematic description of the literature on adaptive ACO. Some of the arti-
cles propose general adaptation strategies that could be used for several parameters.
Here we only indicate the parameters that have been adapted experimentally
The study of the impact of various parameters on the behavior of ACO algorithms
has been an important subject since the first articles [12, 13]. We schematically sum-
marize in Table 8.1 the main approaches that have been used in the ACO literature
to adapt parameter values, following roughly the classes defined by Eiben et al. [17].
This summary shows that the most frequently chosen parameters for adaptation are
α , β , q0 (in the case of ACS), and those that control the pheromone update. We
describe these approaches in the following sections.
There is surprisingly little work on prescheduled parameter variation for ACO algo-
rithms. Merkle and Middendorf [34] describe a contribution that considers an ACO
algorithm for the resource-constrained project scheduling problem. Its authors pro-
pose decreasing the value of the parameter β linearly over the run of an algorithm
from an initial value of 2 to 0. In a subsequent study, Merkle et al. [35] consider the
same problem and its authors propose modifying the parameter β and the evapora-
8 Parameter Adaptation in Ant Colony Optimization 199
tion rate ρ . For β they propose a schedule as described before. For ρ , they propose
starting with a small value to increase the initial exploration of the search space, and
later setting the evaporation rate to a high value for an intensive search around the
best solutions found by the algorithm.
Meyer [36] proposes a variant of AS called α -annealing. The idea at the basis of
its author’s work is to change the value of α according to some annealing schedule.
Increasing α slowly throughout the search can keep diversity in the beginning and
gradually increase the selective pressure to cover better regions of the search space
in the later phases.
In a second group of papers, the driver of adaptation is the quality of the so-
lutions generated. Hao et al. [24] and Cai et al. [5] propose a variant of ACS for
the TSP. In their implementation, a different value of the parameter ρ is associ-
ated with each ant depending on the quality of its solution. This mechanism aims
at having high-quality solutions contribute more pheromone than low-quality ones.
Amir et al. [1] add a fuzzy logic controller module to the ACS algorithm for the
TSP for adapting the value of β and q0 . The adaptive strategy uses two perfor-
mance measures: the difference between the optimal solution (or the best known
solution) and the best one found and the variance of the solutions visited by a
population of ants. Kovářı́k and Skrbek [27] describe an approach that divides the
ant colony into groups of ants using different parameter settings. They adapt the
number of ants in each of the groups depending on the improvement of the solu-
tion quality obtained by each group; however, they do not give all details of the
adaptation strategy. An analysis of the experimental results of adapting the val-
ues of β indicates that better initial performance is obtained with high values of
β while towards the end of the run low values of β are preferable. Randall and
Montgomery [40] apply an adaptation mechanism to an ACS algorithm that uses
a candidate set strategy as a speedup procedure. At each step, the component to
be added to the partial solution under construction is chosen from the ones belong-
ing to such a set. The composition of the set is modified throughout the run. Ele-
ments that give low probability values are eliminated temporarily from the search
process (they become tabu). After a number of iterations, they are added again to
the candidate set. The threshold for establishing which elements are tabu is varied
throughout the search process, depending on the quality of solutions being pro-
duced.
Various adaptive ACO strategies fall into the category of self-adaptive strategies
[17], where an algorithm tunes itself by integrating its parameters into its search
task. We first present strategies that are “purely self-adaptive” in the original mean-
ing used by Eiben et al. [17]. Later, we discuss approaches that use other search
algorithms than ACO for adapting parameters. Given that these strategies are search-
based, most of the approaches discussed in the following use as feedback the quality
of the solutions generated.
Pilat and White [38] test two ways of using an EA to adjust parameters of an ACS
algorithm; one of them has to do online tuning. Their approach to online tuning
uses an EA to determine, at each ACS iteration, the parameter settings of four
ants before constructing solutions. The EA in turn uses the constructed solutions
to further evolve a set of good parameter settings. The authors choose three param-
eters for adaptation, namely, β , q0 , and ξ . Their results are somewhat inconclusive.
This approach is similar to the mechanism used in an earlier paper by White et al.
[43], where the authors evolve the parameters α and β in an ACO algorithm for a
telecommunications routing problem. As an alternative to the online tuning of ACO
202 Stützle, López-Ibáñez, Pellegrini, Maur, Montes de Oca, Birattari, and Dorigo
parameters by an EA, Pilat and White [38] explore the use of an EA as an offline
tuning mechanism, analogously to Botee and Bonabeau [4].
Gaertner and Clark [20] propose a similar adaptive approach. In their work, ev-
ery ant is initialized with a random parameter combination, where the parameter
values are chosen from a predefined range. Over time, the entire population of ants
evolves, breeding ants with parameter combinations which find improved solutions.
In their experiments, the authors consider an algorithm based on a combination
of AS and ACS for the TSP. They test their approach on three parameters: β , ρ
and q0 .
Hao et al. [23] propose a variant of ACS in which each ant is characterized by its
own parameter setting. The usual random-proportional rule is applied for selecting
subsequent moves. After each iteration, the parameter configurations are modified
using a particle swarm optimization (PSO) approach. Three parameters are adapted
throughout the algorithm’s run, namely β , ρ and q0 . If the PSO mechanism assigns
a value outside a predefined range to a parameter, then the parameter is randomly
reinitialized. Following a similar idea, Ling and Luo [30] propose using an artificial
fish swarm algorithm for exploring the parameter space. Its authors also consider a
variant of ACS and vary the three parameters α , ρ and Q, a parameter that influences
the amount of pheromone an ant deposits. The main difference between this work
and the one of Hao et al. [23] is that Ling and Luo use the same parameter setting
for all ants.
Garro et al. [22] present an algorithm that evolves parameters using an EA. An
individual in the EA represents an ant characterized by specific parameter values.
The authors study a variant of ACS for automatically determining the path a robot
should follow from its initial position to its goal position. They adapt three parame-
ters of a newly proposed state transition rule.
Anghinolfi et al. [2] adapt two parameters using a local search in the param-
eter space. They define the neighborhood of the current parameter setting as all
possible combinations of parameter settings that can be obtained by increasing or
decreasing each parameter value by a fixed amount. Therefore, in the case of two
parameters, at each iteration five parameter configurations are tested: the incumbent
one and its four resulting neighbors. The test is done by assigning each parameter
combination to one of five equal-sized groups of ants; each group then uses its pa-
rameters to generate solutions. After each iteration, the incumbent parameter setting
is changed to the one that produced the iteration-best solution. In their experiments,
the authors adapt two parameters of a variant of ACS, namely β and q0 . They ob-
serve better performance of the adaptive strategy than a tuned, fixed parameter set-
ting.
Finally, Melo et al. [33] propose a multi-colony ACS algorithm, where several
colonies of ants try to solve the same problem simultaneously. Each colony uses
different parameter settings for α , β , ρ and q0 . Apart from exchanging solutions
among the colonies, their proposal includes a mutation operator that replaces the
parameter settings of the worst colony with the value of the same parameter in the
best colony modified by a small, uniformly random value.
8 Parameter Adaptation in Ant Colony Optimization 203
The review above shows that there is ongoing interest in parameter adaptation in the
ACO literature. However, we also observe that several of the contributions apply
adaptive techniques without prior in-depth understanding of the effect of individ-
ual parameters. Without such an understanding, decisions about which parameters
to adapt and how to adapt them are mostly arbitrary. In particular, we did not find
in our review any systematic study of the effect of different parameter settings on
the anytime behavior of ACO algorithms. It is our intuition that such an analysis can
inform decisions not only about which parameters may be worth varying during run-
time, but also about how to perform such variations. Moreover, the anytime behavior
of fixed parameter settings provides a baseline for evaluating the performance of pa-
rameter adaptations. In the following sections, we first provide a systematic study of
the anytime behavior of ACO algorithms, and we use the knowledge acquired from
it to design successful schemes for prescheduled parameter variation.
In this section, we examine the effect that various parameters have on the perfor-
mance of ACS and MMAS. In particular, we are interested in the development of
the best-so-far solution over time when varying one parameter at a time. Our goal
is to identify which parameter settings produce the best results at any moment dur-
ing the run of the algorithm. Clearly, a parameter setting that produces very good
solutions during the initial stages of a run but leads to much worse results later is
an interesting candidate for having its settings varied online. In other words, we
are interested in the anytime behavior [44] of specific parameter settings to clearly
identify opportunities for the adaptation of parameter values over the computation
period.
Our experimental analysis is based on the publicly available ACOTSP soft-
ware [41], which we compiled with gcc, version 3.4. Experiments are carried out
on a cluster of Intel Xeon™ E5410 quad-core processors running at 2.33 GHz with 6
MB L2 Cache and 8 GB RAM under Rocks Cluster GNU/Linux. Due to the sequen-
tial implementation of the code, only one core is used for running the executable.
We test two ACO algorithms, ACS and MMAS. Table 8.2 gives the default values
for the parameters under study. In each experiment where one parameter is varied,
the others are kept fixed at their default values. For the remaining parameters, that
is, τ0 , ξ (for ACS), the choice of iteration-best or best-so-far update, and so on, we
use the default values given by Dorigo and Stützle [11], which are also the default
values of the ACOTSP package. We test the algorithms with and without the use
of the first-improvement 2-opt local search provided by the ACOTSP package. For
the experiments, TSP instances are randomly generated using the instance generator
provided for the 8th DIMACS challenge on the TSP; in particular, points are gen-
erated uniformly at random in a square of side length 1,000,000. When using ACO
204 Stützle, López-Ibáñez, Pellegrini, Maur, Montes de Oca, Birattari, and Dorigo
Table 8.2: Default settings of the parameters under study for MMAS and ACS with-
out local search and with 2-opt local search
Algorithm β ρ m q0
ACS 2 0.1 10 0.9
ACS + 2-opt 2 0.1 10 0.98
MMAS 2 0.02 n 0.00
MMAS + 2-opt 2 0.2 25 0.00
algorithms without local search, the tests are done on instances of size 100, 200,
400 and 800; because of the much higher performance of the ACO algorithms when
local search is used, we use with local search larger instances of size 1, 500, 3, 000
and 6, 000 to minimize possible floor effects. The presentation of the experimental
results is based on the development of the relative deviation of the best solution
found by an ACO algorithm from the optimal solution (or the best known solution
for the instance of size 6, 000). Each of the curves of the solution quality over time,
or SQT curves [25], is the average of 25 executions of each parameter setting. Since
we only present plots, we give for each setting results on only one instance. How-
ever, the trends are the same on all instances and, hence, the plots are representative
of the general results.
In the case of ACS, we study the effect of β , which regulates the influence of the
heuristic information; m, the number of ants; ρ , the evaporation factor; and q0 , the
probability of making a deterministic choice in Equation 8.6. Here, we present SQT
curves only for the case where ants’ solutions are improved by a local search for
the instance of size 3, 000. The final conclusions concerning the usefulness of the
variation of parameters at run-time were the same on the other instances and when
using ACS without local search. Figures 8.2 to 8.5 report the results on parameters
β , m, ρ , and q0 , respectively.
The main overall conclusion we obtain from these results is that very often there
is a single parameter setting that performs best during most of the available run-
time. Hence, there does not appear to be a clear benefit to varying the parameter
settings at run-time. This conclusion remains the same if ACS is run without lo-
cal search; the main difference is that the performance is more variable and more
dependent on specific parameter values. In more detail, the observations and con-
clusions that arise for the single parameters from the presented results are the fol-
lowing.
β , Figure 8.2: Medium range values of β equal to 2 or 5 produce the best re-
sults during most of the runtime. Smaller values of β are initially worse but,
8 Parameter Adaptation in Ant Colony Optimization 205
after enough computation time, eventually match the results of the default
value. Much larger values (e.g., β = 10) are quickly overshadowed by smaller
ones.
m, Figure 8.3: The default value of ten ants results in very good anytime perfor-
mance. Interestingly, very small values (notably m = 1) make the algorithm
perform slightly worse during the whole runtime, whereas much larger values
(m = 100) lead to much worse results. The latter effect is probably due to too
much diversification because of the application of the local pheromone update
rule in ACS.
ρ , Figure 8.4: Surprisingly, the differences among different settings of ρ are almost
imperceptible. Without local search (not shown here), large ρ values produce
faster convergence. However, after a short time small values close to the default
(ρ = 0.1) produce progressively better results.
q0 , Figure 8.5: As suggested in the literature, good values of q0 tend to be close
to 1. In extreme cases, a value of 1 quickly leads to search stagnation, while
values smaller than 0.75 produce very slow convergence towards good solu-
tions. Similar results are obtained when local search is disabled.
Fig. 8.2: ACS with various values of β Fig. 8.3: ACS with various numbers of
ants (m)
Fig. 8.4: ACS with various values of ρ Fig. 8.5: ACS with various values of q0
206 Stützle, López-Ibáñez, Pellegrini, Maur, Montes de Oca, Birattari, and Dorigo
Fig. 8.6: MMAS with various fixed values of β ; left plot without local search and
right plot with local search
We now study the impact of the same parameters, β , m, ρ , and q0 , on the anytime
behavior of MMAS. For MMAS, the behavior is more interesting from a parameter
adaptation point of view. We therefore present results for the cases with and with-
out local search. Results without local search are for one instance with 400 nodes,
whereas with local search, they are for an instance with 3, 000 nodes.
β , Figure 8.6: Without local search (upper part of Figure 8.6), MMAS requires rel-
atively large values of β , which produce a significant advantage, especially dur-
ing the initial part of the run, over the default setting of β = 2. While results
with the default setting eventually match the results obtained with higher set-
tings, values of β less than 2 lead to quite poor performance. With local search,
the differences are much smaller and a setting of β = 10 is quickly overshad-
owed by lower ones. This suggests that starting with a high value of β may
enhance the performance of MMAS at the beginning of the run, but a value
close to the default may produce better results for larger computation times.
m, Figure 8.7: With and without local search, the number of ants shows a clear
trade-off between early and late performance. In particular, a low number of
ants (for example, m = 5) produces the best results during the early stages of
the algorithm run. However, a higher number of ants (for example, m = 100)
obtains much better results towards the end of the run. Without local search,
the fast initial progress with few ants soon levels off and apparently leads to
search stagnation. In this case, the default setting of m = 400 appears to be
already too high, and it slows down the algorithm compared to using 100 ants
without improving the final result. With local search, the SQT curves cross for
the different parameter settings and those with few ants (m = 1 and m = 5)
result in worse final solution quality. In fact, a larger number of ants (m ≥ 25)
pays off if the algorithm is allowed to run for enough time. This result suggests
that increasing the number of ants from an initially low value may lead to better
anytime behavior of MMAS.
8 Parameter Adaptation in Ant Colony Optimization 207
Fig. 8.7: MMAS with various fixed numbers of ants (m); left plot without local
search and right plot with local search
Fig. 8.8: MMAS with various fixed values of ρ ; left plot without local search and
right plot with local search
ρ , Figure 8.8: There is some degree of trade-off between large and small values of
ρ . Large values (for example, ρ = 0.6) converge faster than the default values
(ρ = 0.02 without local search, ρ = 0.2 with local search). Nevertheless, low
values of ρ are able to achieve the same performance, and, given sufficient time,
produce the best final results. This effect is most noticeable in the case without
local search. Hence, starting with a high evaporation factor and then reducing it
over time to its default value appears to be a promising strategy.
q0 , Figure 8.9: Finally, we test the use of the pseudorandom proportional rule of
ACS (Equation 8.6) in MMAS. Here, we study the effect of different values
of q0 as we previously did for ACS. In this case, a clear trade-off is observed:
high values of q0 perform best for a short runtime, whereas low values of q0
(q0 = 0 effectively reverts to the standard MMAS) generally result in better
final performance.
Summarizing the above experiments, in MMAS a strong trade-off exists for var-
ious parameters between the performance of fixed settings for short and long com-
putation times, making the behavior of MMAS very different from that of ACS.
In particular, β , m and q0 seem good candidates for the use of variable settings to
achieve good anytime performance. Therefore, in the next section, we examine a
few simple ways of varying the parameter settings of MMAS during the run.
208 Stützle, López-Ibáñez, Pellegrini, Maur, Montes de Oca, Birattari, and Dorigo
Fig. 8.9: MMAS using the pseudorandom proportional rule with various fixed values
of q0 ; left plot without local search and right plot with local search
Given that MMAS was a clear candidate for the varying of parameters during the
computation period, we examine various schedules for changing the parameter set-
tings. In this section, we give exemplary results for prescheduled parameter varia-
tion. In particular, we show results concerning the adaptation of the parameters β ,
m, and q0 . These illustrate the types of improvement in the anytime behavior of
MMAS that may be obtained. We do not consider varying the evaporation factor, ρ ,
since we did not find schedules that significantly improve performance over a fixed,
high setting (such as ρ = 0.6).
First, we study the variation of β . We tested schedules that decrease the value of
β linearly with the iteration counter as well as schedules where a switch from a high
value to a low value occurs at a fixed iteration number. The latter type of schedule
resulted in better anytime performance and, hence, we focus on these here. The pro-
cedure of these schedules is to start with the high value of β = 20, which was shown
to yield good performance at the start of the run, and to later set it directly to a lower
value close to the default value. Figure 8.10 shows the results with local search for
three scenarios that differ in the number of iterations after which β is changed, from
20 to 3; in particular, we consider 50 (aβ 1), 100 (aβ 2) and 200 (aβ 3) iterations. The
schedule aβ 1 obtained the best SQT curve, and delaying the change of β produces
worse results. In the case without local search (not shown here), delaying the switch
from the high to the low value of β showed some improvement. Nonetheless, for
simplicity, we choose strategy aβ 1 for further comparison. Figure 8.11 compares
the use of strategy aβ 1 to the use of the default value and a large value (β = 20)
of β without (left) and with (right) local search. In both cases, the prescheduled pa-
rameter variation is able to combine the best results of both fixed settings, achieving
a better anytime performance.
In the case of the number of ants, m, the strategies studied here start with a
single ant and increase the number of ants as the algorithm progresses. Figure 8.12
shows that there is progressive degradation of the quality of the results as the rate
at which ants are added increases. The best results are obtained with the lowest rate
8 Parameter Adaptation in Ant Colony Optimization 209
Fig. 8.11: MMAS, comparison of fixed and varying parameter settings for β ; left
side: case without local search; right side: case with local search. The adaptation
strategy used is aβ 1 (see caption of Figure 8.10 for details)
(am 1, which adds one ant after every ten iterations) for both cases, with and without
local search (only the local search case is shown for conciseness). The comparison
between the fixed and prescheduled settings (Figure 8.13) shows a clear benefit
with the use of prescheduled variation of m, which matches the good performance
obtained for short runtimes by only one ant and for long runtimes by a large number
of ants.
For varying q0 for MMAS, we tested strategies that start with a high value of
q0 = 0.99 and decrease it until they reach a setting of q0 equal to 0. Figure 8.14
shows four strategies that decrease q0 at different rates, namely, by 0.001 every
15 iterations (aq0 1), by 0.001 every two iterations (aq0 2), by 0.005 every itera-
tion (aq0 3), and by 0.02 every iteration (aq0 4). Without local search, the strategies
that decrease q0 more slowly result in faster convergence to good solutions (not
shown here). However, with local search there is a trade-off between the slowest
and the fastest decrease of q0 , with the former being better at the start of the algo-
rithm, and the latter performing best for higher computation times. This suggests
that more sophisticated strategies may be able to further enhance the performance
of the algorithm. Nevertheless, the comparison of the schedule aq0 2 with fixed pa-
210 Stützle, López-Ibáñez, Pellegrini, Maur, Montes de Oca, Birattari, and Dorigo
Fig. 8.12: MMAS, scheduled variation of the number of ants (m). All strategies am
1 to am 5 start with one ant and iteratively increase the number of ants. In particular,
am 1 adds one ant every ten iterations, am 2 adds one ant every second iteration, am
3 adds one ant each iteration, am 4 adds two ants each iteration, and am 5 adds five
ants each iteration
Fig. 8.13: MMAS, comparison of fixed and varying parameter settings for param-
eter m; left side: case without local search; right side: case with local search. The
adaptation strategy used is am 1 (see caption of Figure 8.12 for details)
rameter settings shows that prescheduled parameter variation is able to match the
best results of both fixed parameter settings during the execution time of the algo-
rithm.
A general observation from our study of prescheduled parameter variation is
that considerable improvements of the anytime behavior of MMAS are possible
without their substantially affecting the final performance achieved by the algo-
rithm. In some additional experiments, we verified that the same conclusion is also
true for the parameter α , which weights the influence of the pheromone trails.
In fact, for similar simple schedules as proposed previously, we could observe
strong improvements in the anytime behavior compared to performance with fixed
settings for α . Further studies need to verify whether the same observations on
the usefulness of simple prescheduled parameter variations hold for other prob-
lems.
8 Parameter Adaptation in Ant Colony Optimization 211
Fig. 8.14: MMAS, scheduled variation of the parameter q0 . All strategies (aq0 1 to
aq0 4) start at q0 = 0.99 and decrease q0 to 0. In particular, aq0 1 decreases q0 by
0.001 every 15 iterations, aq0 2 by 0.001 every 2 iterations, aq0 3 by 0.005 every
iteration, and aq0 4 by 0.02 every iteration
Fig. 8.15: MMAS, comparison of fixed and varying parameter settings for param-
eter q0 ; left side: case without local search; right side: case with local search. The
adaptation strategy used is aq0 2 (see caption of Figure 8.14 for details)
and q0 in MMAS. As a result, we could observe that the anytime behavior of MMAS
can be greatly improved without significant loss in the final solution quality.
Our computational study can clearly be extended in different directions. An in-
teresting extension is the study of interactions between different parameter settings.
Such a study may hint at combined variations of at least two parameters that can
further improve the anytime behavior of MMAS or even its final performance. Fi-
nally, it is certainly worthwhile studying in more detail the contribution of adaptive
strategies that take into account the internal state of the algorithm in order to adapt
to different classes of instances. For their study, it is probably preferable to consider
problems where the algorithm parameters depend more strongly on specific instance
classes than they do in the TSP.
For our main conclusion we can state that parameter adaptation is a relatively
large field that is receiving strong attention by the research community. Many tech-
niques have already been proposed in the context of other heuristic methods, but
their adoption in ACO algorithms still opens up a number of research opportunities
with potentially significant impact.
Acknowledgements This work was supported by the META-X project, an Action de Recherche
Concertée funded by the Scientific Research Directorate of the French Community of Belgium.
Mauro Birattari, Thomas Stützle and Marco Dorigo acknowledge support from the Belgian F.R.S.-
FNRS, of which they are Research Associates and Research Director, respectively. The authors also
acknowledge support from the FRFC project “Méthodes de recherche hybrides pour la résolution
de problèmes complexes”.
References
[1] Amir C., Badr A., Farag I.: A fuzzy logic controller for ant algorithms. Com-
puting and Information Systems 11(2):26–34 (2007)
[2] Anghinolfi D., Boccalatte A., Paolucci M., Vecchiola C.: Performance evalua-
tion of an adaptive ant colony optimization applied to single machine schedul-
ing. In: Li X., et al. (eds.) Simulated Evolution and Learning, 7th Interna-
tional Conference, SEAL 2008, Lecture Notes in Computer Science, vol. 5361,
Springer, Heidelberg, Germany, pp. 411–420 (2008)
[3] Battiti R., Brunato M., Mascia F.: Reactive Search and Intelligent Optimiza-
tion, Operations Research/Computer Science Interfaces, vol. 45. Springer,
New York, NY (2008)
[4] Botee H. M., Bonabeau E.: Evolving ant colony optimization. Advances in
Complex Systems 1:149–159 (1998)
[5] Cai Z., Huang H., Qin Y., Ma X.: Ant colony optimization based on adaptive
volatility rate of pheromone trail. International Journal of Communications,
Network and System Sciences 2(8):792–796 (2009)
[6] Chusanapiputt S., Nualhong D., Jantarang S., Phoomvuthisarn S.: Selective
self-adaptive approach to ant system for solving unit commitment problem.
8 Parameter Adaptation in Ant Colony Optimization 213
In: Cattolico M., et al. (eds.) GECCO 2006, ACM press, New York, NY, pp.
1729–1736 (2006)
[7] Colas S., Monmarché N., Gaucher P., Slimane M.: Artificial ants for the
optimization of virtual keyboard arrangement for disabled people. In: Mon-
marché N., et al. (eds.) Artificial Evolution - 8th International Conference,
Evolution Artificielle, EA 2007, Lecture Notes in Computer Science, vol.
4926, Springer, Heidelberg, Germany, pp. 87–99 (2008)
[8] Dorigo M.: Ant colony optimization. Scholarpedia 2(3):1461 (2007)
[9] Dorigo M., Di Caro G.: The Ant Colony Optimization meta-heuristic. In:
Corne D., Dorigo M., Glover F. (eds.) New Ideas in Optimization, McGraw
Hill, London, UK, pp. 11–32 (1999)
[10] Dorigo M., Gambardella L. M.: Ant Colony System: A cooperative learning
approach to the traveling salesman problem. IEEE Transactions on Evolution-
ary Computation 1(1):53–66 (1997)
[11] Dorigo M., Stützle T.: Ant Colony Optimization. MIT Press, Cambridge, MA
(2004)
[12] Dorigo M., Maniezzo V., Colorni A.: The Ant System: An autocatalytic opti-
mizing process. Tech. Rep. 91-016 Revised, Dipartimento di Elettronica, Po-
litecnico di Milano, Italy (1991)
[13] Dorigo M., Maniezzo V., Colorni A.: Ant System: Optimization by a colony
of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics
- Part B 26(1):29–41 (1996)
[14] Dorigo M., Di Caro G., Gambardella L. M.: Ant algorithms for discrete opti-
mization. Artificial Life 5(2):137–172 (1999)
[15] Dorigo M., Birattari M., Stützle T.: Ant colony optimization: Artificial ants
as a computational intelligence technique. IEEE Computational Intelligence
Magazine 1(4):28–39 (2006)
[16] Dorigo M., et al. (eds.): Ant Algorithms: Third International Workshop, ANTS
2002, Lecture Notes in Computer Science, vol. 2463. Springer, Heidelberg,
Germany (2002)
[17] Eiben A. E., Michalewicz Z., Schoenauer M., Smith J. E.: Parameter control
in evolutionary algorithms. In: [31], pp. 19–46 (2007)
[18] Favaretto D., Moretti E., Pellegrini P.: On the explorative behavior of MAX–
MIN Ant System. In: Stützle T., Birattari M., Hoos H. H. (eds.) Engineering
Stochastic Local Search Algorithms. Designing, Implementing and Analyzing
Effective Heuristics. SLS 2009, Lecture Notes in Computer Science, vol. 5752,
Springer, Heidelberg, Germany, pp. 115–119 (2009)
[19] Förster M., Bickel B., Hardung B., Kókai G.: Self-adaptive ant colony optimi-
sation applied to function allocation in vehicle networks. In: Thierens D., et al.
(eds.) GECCO’07: Proceedings of the 9th Annual Conference on Genetic and
Evolutionary Computation, ACM, New York, NY, pp. 1991–1998 (2007)
[20] Gaertner D., Clark K.: On optimal parameters for ant colony optimization al-
gorithms. In: Arabnia H. R., Joshua R. (eds.) Proceedings of the 2005 Inter-
national Conference on Artificial Intelligence, ICAI 2005, CSREA Press, pp.
83–89 (2005)
214 Stützle, López-Ibáñez, Pellegrini, Maur, Montes de Oca, Birattari, and Dorigo
9.1 Introduction
Alejandro Arbelaez
Microsoft-INRIA joint lab, Orsay, France
e-mail: [email protected]
Youssef Hamadi
Microsoft Research, Cambridge, CB3 0FB, UK
LIX, École Polytechnique, F-91128 Palaiseau, France
e-mail: [email protected]
Michèle Sebag
Project-Team TAO, INRIA Saclay, Ile-de-France
LRI (UMR CNRS 8623), Orsay, France
Microsoft-INRIA joint lab, Orsay, France
e-mail: [email protected]
This section briefly introduces definitions used in the rest of the chapter.
Each constraint Ci involves a set of variables and is used to restrict the combi-
nations of values between these variables. The degree deg(Xi ) of a variable is the
number of constraints involving Xi and dom(Xi ) denotes the current domain of Xi .
Solving a CSP involves finding a solution, i.e., an assignment of values to vari-
ables such that all constraints are satisfied. If a solution exists the problem is satis-
fiable, and it is unsatisfiable otherwise. A depth-first search backtracking algorithm
can be used to tackle CSPs. At each step of the search process, an unassigned vari-
able X and a valid value v for X (v ∈ dom(X)) are selected and a constraint X = v
is added to the search process. In case of infeasibility, the search backtracks and
can undo previous decisions with new constraints, e.g., X = v. The search thus ex-
plores a so-called search tree, where each leaf node corresponds to a solution. In the
worst-case scenario the search process has to explore exponential space. Therefore,
it is necessary to combine the exploration of variables and values with a look-ahead
strategy able to narrow the domains of the variables and reduce the remaining search
space through constraint propagation. Restarting the search engine [23, 29, 31] helps
to reduce the effects of early mistakes in the search process. A restart is done when
some cutoff limit in the number of failures (backtracks) is reached (i.e., at some
point in the search tree).
In this section, we briefly review the basic ideas and principles of the last generation
of CSP heuristics. As we pointed out above, a CSP heuristic includes a variable/-
value selection procedure. Most common value selection strategies can be summa-
rized as follows: min-value selects the minimum value, max-value selects the max-
imum value, mid-value selects the median value and rand-value selects a random
value from the remaining domain. On the other hand, variable selection heuristics
are usually more important and include more sophisticated algorithms.
In [11], Boussemart et al. proposed wdeg and dom-wdeg to focus the search on
difficult constraints. The former selects the variable that is involved in most failed
constraints. A weight is associated with each constraint and incremented each time
the constraint fails. Using this information wdeg selects the variable whose weight is
maximal. The latter, dom-wdeg, is a mixture of the current domain and the weighted
dom
degree of a variable, choosing the variable that minimizes the ratio wdeg .
In [36], Refalo proposed the impact dynamic variable-value selection heuristic.
The approach of this strategy is to measure the impact on the search space reduction,
given by the Cartesian product of the variables (i.e., |v1 | × . . . × |vn |); using this
information the impact of a variable is averaged over all previous decisions in the
search tree and the variable with the highest impact is selected.
222 Alejandro Arbelaez, Youssef Hamadi, and Michèle Sebag
f (x) = ∑ αi K(xi , x) + b
9 Continuous Search in Constraint Programming 223
Fig. 9.1: Linear Support Vector Machine. The optimal hyperplane is the one maxi-
mizing the minimal distance to the example
using the same learning algorithm core as in the linear case. In all cases, a new
instance x is classified as positive (or negative) if f (x) is positive (or negative).
The exploitation or production mode (step 1) aims at solving new problem in-
stances as quickly as possible. The exploration or learning mode (steps 2 and 3)
aims at learning a more accurate heuristics model.
Initially, the heuristics model of a continuous search system is empty, that is, it
is set to the default settings of the search system. In the proposed CS-based con-
straint programming, the default setting is a given heuristic, DEF in the following
(Section 9.4). Arguably, DEF is a reasonably good strategy on average; the chal-
lenge is to improve on DEF for the types of instances encountered in production
mode.
Fig. 9.3: dyn-CS: selecting the best heuristic at each restart point
checkpoint and uses it for the whole checkpoint window. In learning (or exploration)
mode, other combinations of heuristics are applied (Section 9.4.4) and the eventual
result (depending on whether the other heuristics improved on heuristic f (x)) leads
to the building of training examples (Section 9.4.3). The augmented training set is
used to relearn f (x).
At each checkpoint (or restart), the description of the problem instance is computed,
including its static and dynamic features.
While a few of these descriptors had already been used in SAT portfolio solvers
[27, 47], many descriptors had to be added as CSPs are more diverse than SAT
instances: SAT instances only involve Boolean variables and clauses, in contrast
with CSPs, which use variables with large domains, and a variety of constraints and
pruning rules [6, 8, 34].
Static features encode the general description of a given problem instance; they are
computed once for each instance as they are not modified during the resolution pro-
cess. The static features also allow one to discriminate between types of problems
and different instances.
• Problem definition (four features): Number of variables, constraints, variables
assigned/not assigned at the beginning of the search.
• Variables size information (six features): Size prod, sum, min, max,
mean and variance of all variable domain sizes.
• Variables degree information (eight features): min, max, mean and vari-
ance of all variable degrees (or variable domains/degrees).
• Constraints information (six features): The degree (or arity) of a given con-
straint c is represented by the total number of variables involved in c. Like-
wise the size of c is represented by the product of its corresponding variable
domain sizes. Taking into account this information, the following features are
computed: min, max, and mean of constraint size and degree.
• Filtering cost category (eight features): Each constraint c is associated with
a category1 . In this way, we compute the number of constraints for each cate-
gory. Intuitively, each category represents the implementation cost of the filter-
ing algorithm. Cat = {Exponential, Cubic, Quadratic, Linear expensive, Linear
cheap, Ternary, Binary, Unary}, where Linear expensive (or cheap) indicates
the complexity of a linear equation constraint and the last three categories indi-
cate the number of variables involved in the constraint. More information about
the filtering cost category can be found in [20].
Two kinds of dynamic features are used to monitor the performance of the search
effort at a given checkpoint: global statistics describe the progress of the overall
search process; local statistics check the evolution of a given strategy.
1 Out of eight categories, detailed in
http://www.gecode.org/doc-latest/reference/classGecode 1 1PropCost.html.
9 Continuous Search in Constraint Programming 227
• Heuristic criteria (15 features): each heuristic criterion (e.g., wdeg, dom-wdeg,
impacts) is computed for each variable; their prod, min, max, mean and
variance over all variables are used as features.
• Constraint weights (12 features): the min, max, mean and variance of
all constraint weights (i.e., constraints wdeg). Additionally, the mean for each
filtering cost category is used as a feature.
• Constraints information (three features): the min, max and mean of a con-
straint’s run-prop, where run-prop indicates the number of times the propaga-
tion engine has called the filtering algorithm of a given constraint.
• Checkpoint information (33 features): for every checkpointi relevant infor-
mation from the previous checkpointi−1 (when available) is included into the
feature vector. From checkpointi−1 we include the total number of nodes and
maximum search depth. From the latest non-failed node, we consider the total
number of assigned variables, satisfied constraints, sum of variables wdeg (or
size and degree) and product of variable degrees (or domain, wdeg and impacts)
of non-assigned variables. Finally, using the previous 11 features the mean and
variance are computed taking into account all visited checkpoints.
The attributes listed above include a collection of 95 features.
Feature preprocessing is a very important step in Machine Learning [45], which can
significantly improve the prediction accuracy of the learned hypothesis. Typically,
the descriptive features detailed above are on different scales; the number of vari-
ables and/or constraints can be high while the impact of (variable, value) is between
0 and 1. A data normalization step is performed using the min-max normalization
[41] formula:
vi − mini
vi = × (maxnew − minnew ) + minnew
maxi − mini
Where minnew =−1, maxnew =1 and mini (or maxi ) corresponds to the normalization
value for the ith feature. In this way, feature values are scaled down to [−1, 1]. Al-
though selecting the most informative features might improve the performance, in
this chapter we do not consider any feature selection algorithm, and only features
that are constant over all examples are removed as they offer no discriminant infor-
mation.
The selection of the best heuristic for a given problem instance is formulated as
a binary classification problem, as follows. Let H denote the set of k candidate
228 Alejandro Arbelaez, Youssef Hamadi, and Michèle Sebag
heuristics, two particular elements in H being DEF (the default heuristic yielding
reasonably good results on average) and dyn-CS, the (dynamic) ML-based heuristic
model initially set to DEF.
If the problem instance cannot be solved whatever the heuristic used, i.e., times out
during the exploration or exploitation mode, it is discarded (since the associated
training examples do not provide any relevant information).
In production mode, the hypothesis f learned from the above training examples
(their generation is detailed in next subsection) is used as follows:
Definition 4 At each checkpoint, for each h ∈ H , the description xh and the asso-
ciated value f (xh ) are computed.
If there exists a single h such that f (xh ) is positive, it is selected and used in the
subsequent search effort.
If there exists several heuristics with positive f (xh ), the one with maximal value is
selected2 .
If f (xh ) is negative for all h, the default heuristic DEF is selected.
The Continuous Search paradigm uses the idle computer’s CPU cycles to ex-
plore different heuristic combinations on the last seen problem instance, and to
see whether one could have done better than the current heuristics model on this
instance. The rationale for this exploration is that improving on the last seen in-
stance (albeit meaningless from a production viewpoint since the user already got
a solution) will deliver useful indications as to how to best deal with other similar
instances. In this way, the heuristics model will be tailored to the distribution of
problem instances actually dealt with by the user.
2 The rationale for this decision is that the margin, i.e., the distance of the example from the
separating hyperplane, is interpreted as the confidence of the prediction [44].
9 Continuous Search in Constraint Programming 229
The CS exploration proceeds by slightly perturbing the heuristics model. Let dyn-
CS −i,h denote the policy defined as: use heuristics model dyn-CS at all checkpoints
except the ith one, and use heuristic h at i checkpoint.
It is well known that one of the heuristics often performs much better than the oth-
ers for a particular distribution of problems [15]. Accordingly, negative training
examples considerably outnumber the positive ones (it is difficult to improve on the
winning heuristics). This phenomenon, known as Imbalanced distribution, might
severely hinder the SVM algorithm [1]. Simple ways of enforcing a balanced distri-
bution in such cases, intensively examined in the literature and considered in earlier
work [4], are oversampling examples in the minority class (generating additional
positive examples by perturbing the available ones according to a Gaussian distri-
bution) and/or undersampling examples in the majority class.
Another option is to use prior knowledge to rebalance the training distribution.
Formally, instead of labeling an example positive (or negative) iff its associated
runtime is strictly less (or greater) than that of the heuristic model, we consider the
230 Alejandro Arbelaez, Youssef Hamadi, and Michèle Sebag
difference between the runtimes. If the difference is less than some tolerance value
dt, then the example is relabeled as positive.
The number of positive examples and hence the coverage of the learned heuris-
tics model increases with dt; in the experiments (Section 9.5), dt is set to 20%
of the time limit iff time-exploitation (time required to solve a given instance in
production mode) is greater than 20% of the time limit; otherwise dt is set to time-
exploitation.
The presented experiments consider 496 CSP instances taken from different reposi-
tories.
• nsp: 100 nurse-scheduling instances from the MiniZinc3 repository.
• bibd: 83 Balanced Incomplete Block Design instances from the XCSP [39]
repository, translated into Gecode using Tailor [22].
• js: 130 Job Shop instances from the XCSP repository.
• geom: 100 Geometric instances from the XCSP repository.
• lfn: 83 Langford number instances, translated into Gecode using global and
channelling constraints.
The learning algorithm used in the experimental validation of the proposed ap-
proach is a Support Vector Machine with a Gaussian kernel using the libSVM
implementation with default parameters [14]. All considered CSP heuristics (Sec-
tion 9.2) are homemade implementations integrated in the Gecode 2.1.1 [20] con-
straint solver. dyn-CS was used as a heuristics model on top of the heuristics set
H = {dom-wdeg, wdeg, dom-deg, min-dom, impacts}, with min-value as the value
selection heuristic. The cutoff value used to restart the search was initially set to
1,000 and the cutoff increase policy to ×1.5; the same cutoff policy is used in all
the experimental scenarios.
Continuous Search was assessed against the best two dynamic variable order-
ing heuristics on the considered problems, namely dom-wdeg and wdeg. It must
be noted that Continuous Search, being a lifelong learning system, will depend on
the curriculum, that is, the order of the submitted instances. If the user “pedagog-
ically” starts by submitting informative instances first, the performance in the first
3 http://www.g12.cs.mu.oz.au/minizinc/download.html.
9 Continuous Search in Constraint Programming 231
stages will be better than if untypical and awkward instances are considered first.
For the sake of fairness, the performance reported for Continuous Search on each
problem instance is the median performance over ten random orderings of the CSP
instances.
The first experimental scenario involves a timeout of five minutes. Figure 9.5 high-
lights the Continuous Search results on Langford number problems, compared to
dom-wdeg and wdeg. The x-axis gives the number of problems solved and the
y-axis presents the cumulated runtime. The (median) dyn-CS performance (gray
line) is satisfactory as it solves 12 more instances than dom-wdeg (black line)
and wdeg (light gray line). The dispersion of the dyn-CS results depending on the
instance ordering is depicted through the set of dashed lines. Indeed, traditional
portfolio approaches such as [27, 40, 47] do not present such performance varia-
tions as they assume a complete set of training examples to be available before-
hand.
Figures 9.6 to 9.9 depict the performance of dyn-CS, dom-wdeg and wdeg on all
other problem families, respectively (bibd, js, nsp, and geom). On the bibd (Fig-
ure 9.7) and js (Figure 9.8) problems, the best heuristic is dom-wdeg, solving three
more instances than dyn-CS. Note that dom-wdeg and wdeg coincide on bibd since
all decision variables are Boolean.
232 Alejandro Arbelaez, Youssef Hamadi, and Michèle Sebag
On nsp (Figure 9.9), dyn-CS solves nine more problems than dom-wdeg, but is
outperformed by wdeg by 11 problems. On geom (Figure 9.6), dyn-CS improves on
the other heuristics, solving respectively three more instances and 40 more instances
than dom-wdeg and wdeg.
These results suggest that dyn-CS is most often able to pick up the best heuristic
on a given problem family, and sometimes able to significantly improve on the best
of the available heuristics.
9 Continuous Search in Constraint Programming 233
All experimental results concerning the first scenario are summarized in Table
9.1, reporting for each considered heuristic the number of instances solved (#sol),
the total computational cost for all instances (time, in hours), and the average time
per instance (avg-time, in minutes), over all problem families. These results con-
firm that dyn-CS outperforms dom-wdeg and wdeg, solving respectively 18 and 41
instances more out of 315. Furthermore, it shows that dyn-CS is slightly faster than
the other heuristics, with an average time of 2.11 minutes, against 2.39 for dom-
wdeg and 2.61 for wdeg.
234 Alejandro Arbelaez, Youssef Hamadi, and Michèle Sebag
The second experimental results using a timeout of three minutes are presented in
table 9.2; as can be observed, decreasing the time limit drastically reduces the total
number of solved instances for dom-wdeg and wdeg. Therefore, selecting the right
heuristic becomes critical. Here dyn-CS is able to solve 24 and 45 more instances
than dom-wdeg and wdeg.
Another interesting lesson learnt from the experiments concerns the difficulty
of the underlying learning problem and the generalization error of the learnt hy-
pothesis. The generalization error in the Continuous Search framework is estimated
by 10-fold Cross-Validation [5] on the whole training set (including all training
examples generated in the exploration mode). Table 9.3 reports on the predictive
accuracy of the SVM algorithm (with the same default setting) on all problem
families, with an average accuracy of 67.8%. As could have been expected, the
predictive accuracy is correlated to the performance of Continuous Search: the
problems with best accuracy and best performance improvement are geom and
lfn.
To give an idea of order, 62% predictive accuracy was reported in the context of
SATzilla [47], aimed at selecting the best heuristic in a portfolio.
A direct comparison of the predictive accuracy might however be biased. On
the one hand SATzilla errors are attributed, according to its authors, to the selec-
tion of some near-optimal heuristics; on the other hand, Continuous Search involves
several selection steps (in each checkpoint) and can thus compensate for earlier er-
rors.
9 Continuous Search in Constraint Programming 235
Now we turn our attention to the CPU time required to complete the exploration
mode. Table 9.4 shows the total exploration time considering a timeout of five and
three minutes for each problem family; the median value is computed taking into
account all instance orderings, and instance estimates the total exploration time
for a single problem instance.
TO = 5 Min TO = 3 Min
Problem
Median Instance Median Instance
bibd 73.6 0.9 48.3 0.6
nsp 151.1 1.5 106.8 1.1
geom 71.6 0.7 37.6 0.4
lfn 161.8 1.9 100.3 1.0
js 215.6 1.7 135.6 1.0
As can be seen the time required to complete the exploration mode after solving
a problem-instance is on average no longer than two hours. On the other hand, since
the majority of the instances for the bibd and geom problems can be quickly solved
by dyn-CS, it is not surprising that their required time is significantly less than that
of nsp, js and lfn.
Our third experimental test combines instances from different domains in order to
show how CS is able to adapt to changing problem distributions. Indeed, unlike
classical portfolio-based approaches which can only be applied if the training and
exploitation sets come from the same domain, CS can adapt to changes and provide
top performance even if the problems change.
In this context, Table 9.5 reports the results on the geom (left) and bibd (right)
problems by considering the following two scenarios. In the first scenario, we are
going to emulate a portfolio-based search which would use the wrong domain to
train. In nsp-geom‡ , CS incrementally learns while solving the 100 nsp instances,
236 Alejandro Arbelaez, Youssef Hamadi, and Michèle Sebag
and then solves one by one the 100 geom instances. However, when switching to
this second domain, incremental learning is switched off, and checkpoint adaptation
uses the model learnt on nsp. In the second scenario, nsp-geom† , we solve nsp, and
then geom instances one by one, but this time we keep the incremental learning on
when switching from the first domain to the second one, as if CS were not aware of
the transition.
As we can see in the first line of the table, training on the wrong domain gives
poor performance (55 instances solved in 4.1 hours). In contrast, the second line
shows that CS can recover from training on the wrong domain thanks to its incre-
mental adaptation (solving 67 instances in 3.4 hours). The right hand of the table
reports similar results for the bibd problem.
As can be observed in nsp-geom† and lfn-bibd† , CS successfully identifies the
new distribution of problems solving, respectively the same number and two fewer
instances than geom and bibd when CS is only applied to this domain starting from
scratch. However, the detection of the new distribution introduces an overhead in
the solving time (see results for a single domain in Table 9.1).
This section briefly reviews and discusses some related works, devoted to heuris-
tic selection within CP and SAT solvers. This section is divided into three main
categories: per-instance learning, per-class learning and other work.
problem instance for a given algorithm. The goal of SATzilla is to learn a run-time
prediction function by using a linear regression model. Notice that this portfolio
solver follows a traditional batch learning scenario in which all training examples
are provided at once in order to learn a machine learning model.
CPHydra [32], the overall best solver in the 2008 CSP competition,4 is a portfolio
approach based on case-based reasoning [37]; it maintains a database with all solved
instances (so-called cases). When a new instance I arrives a set of similar cases C
is computed, and based on C CPHydra builds a switching policy which selects (and
executes) a set of solvers that maximize the possibility of solving I within a given
amount of time.
The approach most similar to the presented one is that of [40], which applies Ma-
chine Learning techniques to combine on-line heuristics into search tree procedures.
Unfortunately, this work requires an important number of training instances to build
a model with good generalization property. The Continuous Search approach was
instead designed to take into account the fact that such a large and representative
training set is missing in general, and should be constructed during the lifetime of
the system.
In [12, 13] low-knowledge is used to select the best algorithm in the context of
optimization problems; this work assumes a black-box optimization scenario where
the user has no information about the problem or even about the domain of the
problem, and the only known information is about the output (i.e., solution cost for
each algorithm in the portfolio). Unfortunately, this mechanism is only applicable
to optimization problems and cannot be used to solve CSPs.
So far, we have presented machine learning approaches dealing with the heuristic
selection problem. We switch our attention to the class-based methodologies dealing
with the algorithm configuration problem whose objective is to find the best param-
eter configuration for a given algorithm in order to efficiently solve a given class of
problems.
In [7], Birattari et al. presented F-Race, a racing algorithm in which candi-
dates (i.e., parameter configurations) compete with each other, poor candidates are
quickly discarded, and the search is focused on high-quality candidates. The race is
repeated until one candidate remains or a given timeout is reached.
paramILS [28] is an Iterative Local search procedure to select the best parame-
ter configurations for a given algorithm; it starts with an initial parameter configura-
tion and then applies a set of local search operators to focus the search on promising
candidates, avoiding the overtunning problem, which is the side-effect of consid-
ering a fixed number of training instances. It is worth mentioning that paramILS
4 http://www.cril.univ-artois.fr/CPAI08/.
238 Alejandro Arbelaez, Youssef Hamadi, and Michèle Sebag
has been successfully used to tune algorithms from different domains, ranging from
SAT to mixed integer programming problems.
Finally, Ansótegui et al. recently proposed GGA [3], a genetic algorithm to over-
come the parameter algorithm configuration problem; this approach uses a gender
separation strategy to reduce the number of executions of a given configuration can-
didate. An empirical comparison showed that GGA is able to find parameter config-
urations as good as the ones found by paramILS.
To conclude the related work section, we highlight some strategies that have been
developed to combine CSP heuristics during the resolution process.
The quickest first principle (QFP) [10] is a methodology for combining CSP
heuristics. QFP relies on the fact that easy instances can frequently be solved by
simple algorithms, while exceptionally difficult instances will require more com-
plex heuristics. In this context, it is necessary to predefine an execution order for
heuristics, and the switching mechanism is set according to the thrashing indica-
tor; once the thrashing value of the current strategy reaches some cutoff value, it
becomes necessary to continue the search procedure with the following heuristic
in the sequence. Although QFP is a very elegant approach, this static methodology
does not actually learn any information about solved instances.
The purpose of the Adaptive Constraint Engine (ACE) [18] is to unify the deci-
sion of several heuristics in order to guide the search process. Each heuristic votes
for a possible variable/value decision to solve a CSP. Afterwards, a global controller
selects the most appropriate variable/value pair according to previously (off-line)
learnt weights associated with each heuristic. The authors however did not present
any experimental scenario taking into account any restart strategy, although this
nowadays is an essential part of constraint solvers. In the same way, Petrovic and
Epstein in [33] showed that using a subset of powerful heuristics is more useful than
using all available ones.
Finally, Combining Multiple Heuristics Online [43] and Portfolios with dead-
lines [46] are designed to build scheduler policies to adapt the execution of black-
box solvers during the resolution process. However, in these papers the switching
mechanism is learnt or defined beforehand, while our approach relies on the use of
machine learning to change the execution of heuristics on the fly.
The main contribution of the presented approach, the Continuous Search frame-
work, aims at designing a heuristics model tailored to the user’s problem distribu-
tion, allowing him or her to get top performance from the constraint solver. The
9 Continuous Search in Constraint Programming 239
representative instances needed to train a good heuristics model are not assumed to
be available beforehand; they are gradually built and exploited for improving the
current heuristics model by stealing the idle CPU cycles of the computing system.
Metaphorically speaking, the constraint solver uses its spare time to play against it-
self and gradually improve its strategy over time; further, this expertise is relevant to
the real-world problems considered by the user, all the more so as it directly relates
to the problem instances submitted to the system.
The experimental results suggest that Continuous Search is able to pick up the
best of a set of heuristics on a diverse set of problems, by exploiting the incoming
instances; in two out of five problems, Continuous Search swiftly builds up a mixed
strategy, significantly overcoming all baseline heuristics. With the other classes of
problems, its performance is comparable to that of the best two single heuristics.
Our experiments also showed the capacity of adaptation of CS. Moving from one
problem domain to another is possible thanks to its incremental learning capacity.
This capacity is a major improvement over classical portfolio-based approaches,
which only work when off-line training and exploitation use instances from the same
domain.
Further work will investigate the use of Active Learning [9, 17] in order to select
the most informative training examples and focus the exploration mode on the most
promising heuristics. Another point regards the feature design; better features would
be needed to get higher predictive accuracy, which governs the efficiency of the
approach.
References
[1] Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to
imbalanced datasets. In: J. F. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi
(eds.) 15th European Conference on Machine Learning, Lecture Notes in Com-
puter Science, vol. 3201, pp. 39–50. Springer, Pisa, Italy (2004)
[2] Al-Shahib, A., Breitling, R., Gilbert, D. R.: Predicting protein function by ma-
chine learning on amino acid sequences – a critical evaluation. BMC Genomics
78(2) (2007)
[3] Ansótegui, C., Sellmann, M., Tierney, K.: A gender-based genetic algorithm
for the automatic configuration of algorithms. In: I. P. Gent (ed.) 15th Inter-
national Conference on Principles and Practice of Constraint Programming,
Lecture Notes in Computer Science, vol. 5732, pp. 142–157. Springer, Lisbon,
Portugal (2009)
[4] Arbelaez, A., Hamadi, Y., Sebag, M.: Online heuristic selection in constraint
programming. In: International Symposium on Combinatorial Search. Lake
Arrowhead, USA (2009)
[5] Bailey, T. L., Elkan, C.: Estimating the accuracy of learned concepts. In: IJ-
CAI, pp. 895–901 (1993)
240 Alejandro Arbelaez, Youssef Hamadi, and Michèle Sebag
[6] Beldiceanu, N., Carlsson, M., Demassey, S., Petit, T.: Global constraint cata-
logue: Past, present and future. Constraints 12(1), 21–62 (2007)
[7] Birattari, M., Stützle, T., Paquete, L., Varrentrapp, K.: A racing algorithm for
configuring metaheuristics. In: W. B. Langdon, E. Cantú-Paz, K. E. Mathias,
R. Roy, D. Davis, R. Poli, K. Balakrishnan, V. Honavar, G. Rudolph, J. We-
gener, L. Bull, M. A. Potter, A. C. Schultz, J. F. Miller, E. K. Burke, N. Jonoska
(eds.) Proceedings of the Genetic and Evolutionary Computation Conference,
pp. 11–18. Morgan Kaufmann, New York, USA (2002)
[8] Bordeaux, L., Hamadi, Y., Zhang, L.: Propositional satisfiability and constraint
programming: A comparative survey. ACM Comput. Surv. 38(4) (2006)
[9] Bordes, A., Ertekin, S., Weston, J., Bottou, L.: Fast kernel classifiers with on-
line and active learning. Journal of Machine Learning Research 6, 1579–1619
(2005)
[10] Borrett, J. E., Tsang, E. P. K., Walsh, N. R.: Adaptive constraint satisfaction:
The quickest first principle. In: W. Wahlster (ed.) 12th European Conference
on Artificial Intelligence, pp. 160–164. John Wiley and Sons, Chichester, Bu-
dapest, Hungary (1996)
[11] Boussemart, F., Hemery, F., Lecoutre, C., Sais, L.: Boosting systematic search
by weighting constraints. In: R. L. de Mántaras, L. Saitta (eds.) Proceedings
of the 16th European Conference on Artificial Intelligence, pp. 146–150. IOS
Press, Valencia, Spain (2004)
[12] Carchrae, T., Beck, J. C.: Low-knowledge algorithm control. In: D. L.
McGuinness, G. Ferguson (eds.) Proceedings of the Nineteenth National Con-
ference on Artificial Intelligence, Sixteenth Conference on Innovative Appli-
cations of Artificial Intelligence, pp. 49–54. AAAI Press / The MIT Press, San
Jose, California, USA (2004)
[13] Carchrae, T., Beck, J. C.: Applying machine learning to low-knowledge con-
trol of optimization algorithms. Computational Intelligence 21(4), 372–387
(2005)
[14] Chang, C. C., Lin, C. J.: LIBSVM: A library for support vector machines
(2001). Software from http://www.csie.ntu.edu.tw/˜cjlin/
libsvm
[15] Correira, M., Barahona, P.: On the efficiency of impact based heuristics. In:
P. J. Stuckey (ed.) 14th International Conference on Principles and Practice of
Constraint Programming, Lecture Notes in Computer Science, vol. 5202, pp.
608–612. Springer, Sydney, Australia (2008)
[16] Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines
and other kernel-based learning methods. Cambridge University Press (2000)
[17] Dasgupta, S., Hsu, D., Monteleoni, C.: A general agnostic active learning al-
gorithm. In: J. C. Platt, D. Koller, Y. Singer, S. T. Roweis (eds.) Proceedings of
the Twenty-First Annual Conference on Neural Information Processing Sys-
tems. MIT Press, Vancouver, British Columbia, Canada (2007)
[18] Epstein, S. L., Freuder, E. C., Wallace, R., Morozov, A., Samuels, B.: The
Adaptive Constraint Engine. In: P. Van Hentenryck (ed.) 8th International Con-
9 Continuous Search in Constraint Programming 241
[31] Luby, M., Sinclair, A., Zuckerman, D.: Optimal speedup of Las Vegas algo-
rithms. In: ISTCS, pp. 128–133 (1993)
[32] O’Mahony, E., Hebrard, E., Holland, A., Nugent, C., O’Sullivan, B.: Using
case-based reasoning in an algorithm portfolio for constraint solving. In: Pro-
ceedings of the 19th Irish Conference on Artificial Intelligence and Cognitive
Science (2008)
[33] Petrovic, S., Epstein, S. L.: Random subsets support learning a mixture of
heuristics. International Journal on Artificial Intelligence Tools 17(3), 501–
520 (2008)
[34] Prasad, M. R., Biere, A., Gupta, A.: A survey of recent advances in SAT-based
formal verification. STTT 7(2), 156–173 (2005)
[35] Pulina, L., Tacchella, A.: A multi-engine solver for quantified Boolean for-
mulas. In: C. Bessiere (ed.) 13th International Conference on Principles and
Practice of Constraint Programming, Lecture Notes in Computer Science, vol.
4741, pp. 574–589. Springer, Providence, RI, USA (2007)
[36] Refalo, P.: Impact-based search strategies for constraint programming. In: M.
Wallace (ed.) 10th International Conference on Principles and Practice of Con-
straint Programming, Lecture Notes in Computer Science, vol. 2004, pp. 557–
571. Springer, Toronto, Canada (2004)
[37] Richter, M. M., Aamodt, A.: Case-based reasoning foundations. Knowledge
Eng. Review 20(3), 203–207 (2005)
[38] Rish, I., Brodie, M., Ma, S., et al.: Adaptive diagnosis in distributed systems.
IEEE Trans. on Neural Networks 16, 1088–1109 (2005)
[39] Roussel, O., Lecoutre, C.: XML representation of constraint networks: Format
XCSP 2.1. CoRR abs/0902.2362 (2009)
[40] Samulowitz, H., Memisevic, R.: Learning to solve QBF. In: Proceedings of
the Twenty-Second AAAI Conference on Artificial Intelligence. AAAI Press,
Vancouver, British Columbia (2007)
[41] Shalabi, L. A., Shaaban, Z., Kasasbeh, B.: Data mining: A preprocessing en-
gine. Journal of Computer Science 2, 735–739 (2006)
[42] Smith-Miles, K.: Cross-disciplinary perspectives on meta-learning for algo-
rithm selection. ACM Comput. Surv. 41(1) (2008)
[43] Streeter, M., Golovin, D., Smith, S. F.: Combining multiple heuristics online.
In: Proceedings of the Twenty-Second AAAI Conference on Artificial Intel-
ligence, pp. 1197–1203. AAAI Press, Vancouver, British Columbia, Canada
(2007)
[44] Vapnik, V.: The Nature of Statistical Learning. Springer Verlag, New York,
NY, USA (1995)
[45] Witten, I. H., Frank, E.: Data Mining, Practical Machine Learning Tools and
Techniques. Elsevier (2005)
[46] Wu, H., Beek, P. V.: Portfolios with deadlines for backtracking search. Inter-
national Journal on Artificial Intelligence Tools 17, 835–856 (2008)
[47] Xu, L., Hutter, F., Hoos, H. H., Leyton-Brown, K.: The design and analysis of
an algorithm portfolio for SAT. In: C. Bessiere (ed.) 13th International Con-
ference on Principles and Practice of Constraint Programming, Lecture Notes
9 Continuous Search in Constraint Programming 243
in Computer Science, vol. 4741, pp. 712–727. Springer, Providence, RI, USA
(2007)
[48] Xu, L., Hutter, F., Hoos, H. H., Leyton-Brown, K.: Satzilla: Portfolio-based
algorithm selection for sat. Journal of Artificial Intelligence Research 32, 565–
606 (2008)
Chapter 10
Control-Based Clause Sharing in Parallel SAT
Solving
10.1 Introduction
The recent successes of SAT solvers in traditional hardware and software appli-
cations have extended to important new domains. Today, they represent essential
low-level reasoning components used in general theorem proving, computational
biology, AI, and so. This popularity gain is related to their breakthrough in real-
world instances involving millions of clauses and hundreds of thousands of vari-
ables. These solvers, called modern SAT solvers [22, 11], are based on a nice com-
bination of (i) clause learning [21, 22], (ii) activity-based heuristics [22], and (iii)
restart policies [14] enhanced with efficient data structures (e.g., watched literals
[22]). This architecture is now standard, and today only minor improvements have
been observed (cf. SAT competitions). Therefore, it seems difficult to bet on other
order of magnitude gains without a radical algorithmic breakthrough.
Fortunately, the recent generalization of multicore hardware gives parallel pro-
cessing capabilities to standard PCs. This represents a real opportunity for SAT
researchers, who can now consider parallel SAT solving as an obvious way to sub-
stantial efficiency gains.
Recent works on parallel SAT are all based on the modern SAT architecture [22],
and therefore systematically exploit clause learning as an easy way to extend the
cooperation between processing units. When a unit learns a new clause, it can share
Youssef Hamadi
Microsoft Research, Cambridge, CB3 0FB, UK
LIX, École Polytechnique, F-91128 Palaiseau, France
e-mail: [email protected]
Said Jabbour
Université Lille-Nord de France, CRIL – CNRS UMR 8188, Artois, F-62307 Lens
e-mail: [email protected]
Lakhdar Sais
Université Lille-Nord de France, CRIL – CNRS UMR 8188, Artois, F-62307 Lens
e-mail: [email protected]
it with all other units in order to prune their search spaces. Unfortunately, since the
number of potential conflicts is exponential, the systematic sharing of learnt clauses
is not practically feasible. The solution is to exchange up to some predefined size
limit. This has the advantage of reducing the overhead of the cooperation while
focusing the exchange on short clauses, recognized as more powerful in terms of
pruning.
In this work, our goal is to improve the clause sharing scheme of modern par-
allel SAT solvers. Indeed, the approach based on some predefined size limit has
several flaws, the first and most apparent being that an overestimated value might
induce a very large cooperation overhead, while an underestimated one might com-
pletely inhibit the cooperation. The second flaw comes from the observation that
the size of learnt clauses tends to increase over time (see Section 10.4.1), leading
to an eventual halt in the cooperation. The third flaw is related to the internal dy-
namics of modern solvers, which tend to focus on particular subproblems thanks
to the activity/restart mechanisms. In parallel SAT, this can lead two search pro-
cesses toward completely different subproblems such that clause sharing becomes
pointless.
We propose a dynamic clause sharing policy which uses pairwise size limits to
control the exchange between any pair of processing units. Initially, high limits are
used to enforce the cooperation, and allow pairwise exchanges. On a regular basis,
each unit considers the number of foreign clauses received from other units. If this
number is below/above a predefined threshold, the pairwise limits are increased/de-
creased. This mechanism allows the system to maintain a throughput. It addresses
the first and second flaws. To address the last flaw, related to the poor relevance of
the shared clauses, we extend our policy to integrate the quality of the exchanges.
Each unit evaluates the quality of the received clauses, and the control is able to
selectively increase/decrease the pairwise limits based on the underlying quality of
the recently communicated clauses, the rationale being that the information recently
received from a particular source is qualitatively linked to the information which
could be received from it in the very near future. The evolution of the pairwise lim-
its w.r.t. the throughput or quality criterion follows an AIMD (Additive-Increase-
Multiplicative-Decrease) feedback control-based algorithm [6].
The chapter is organized as follows. After the presentation of previous works in
Section 10.2, Section 10.3 gives some technical background. Our dynamic control-
based clause sharing policies are motivated and presented in Section 10.4. Section
10.5 presents an extensive experimental evaluation of the new policies, as opposed
to the standard one. Finally, we conclude by providing some interesting future paths
of research.
We present here two categories of work. The first one addresses parallel SAT al-
gorithms while the second one addresses an original, relevance restriction strategy
10 Control-Based Clause Sharing in Parallel SAT Solving 247
We can distinguish between two categories of works: the ones which predate the
introduction of the modern DPLL architecture, and the ones based on this architec-
ture.
PSATO [29] is based on the SATO (SAtisfiability Testing Optimized) sequential
solver [30]. Like SATO, it uses a trie data structure to represent clauses. PSATO
uses the notion of guiding-paths to divide the search space of a problem. These paths
are represented by a set of unit clauses added to the original formula. The parallel
exploration is organized in a master/slave model. The master organizes the work by
addressing guiding-paths to workers that have no interaction with each other. The
first worker to finish stops the system. The balancing of the work is organized by
the master.
In [5], the input formula is dynamically divided into disjoint subformulas. Each
subformula is solved by a sequential SAT solver running on a particular processor.
The algorithm uses optimized data structures to modify Boolean formulas. Addi-
tionally, workload balancing algorithms are used to achieve a uniform distribution
of load among the processors.
The first parallel SAT solver based on a modern DPLL is Gradsat [7] which ex-
tends the zChaff solver with a master/slave model, and implements guiding-paths
to divide the search space of a problem. These paths are represented by a set of unit
clauses added to the original formula. Additionally, learned clauses are exchanged
between all clients if they are less than a predefined limit on the number of liter-
als.
[4] uses an architecture similar to Gradsat. However, a client incorporates a for-
eign clause if it is not subsumed by the current guiding-path constraints. Practically,
clause sharing is implemented by mobile-agents. This approach is supposed to scale
well on computational grids.
Nagsat [12] is a parallel SAT solver which exploits the heavy-tailed distribution
of random 3-SAT instances. It implements nagging, a notion taken from the DALI
theorem prover. Nagging involves a master and a set of clients called naggers. In
Nagsat, the master runs a standard DPLL algorithm with a static variable ordering.
When a nagger becomes idle, it requests a nagpoint which corresponds to the current
state of the master. Upon receiving a nagpoint, it applies a transformation (e.g., a
change in the ordering of the remaining variables) and begins its own search on the
corresponding subproblem.
MiraXT is designed for shared memory multiprocessor systems [19]. It uses a
divide-and-conquer approach where threads share a unique clause database which
248 Youssef Hamadi, Said Jabbour, and Lakhdar Sais
represents the original and the learnt clauses. Therefore in this system all the learnt
clauses are shared. When a new clause is learnt by a thread, it uses a lock to safely
update the common database. Read access can be done in parallel.
pMiniSat uses a standard divide-and-conquer approach based on guiding-paths
[8]. It exploits these paths to improve clause sharing. When considered with the
knowledge of the guiding-path of a particular thread, a clause can become small
and therefore highly relevant. This allows pMiniSat to extend the sharing of clauses
since a large clause can become small in another search context.
pMiniSat and MiraXT were respectively ranked second and third in the parallel
track of the 2008 SAT-Race.
In [26], the authors consider the challenge of identifying the relevant clauses of
a problem in order to improve the efficiency of a sequential automated theorem
proving process. They define relevance relative to a given set of clauses S and one
or more distinguished sets of support T . In their theorem proving application, the
set T can be played by the negation of the theorem to be proved or the query to be
answered in S.
The concept of relevance distance between two clauses of S is defined using
various metrics based on properties of the paths connecting the clauses (clauses
are connected if they contain the same literal). This concept is extended to de-
fine relevance distance between a clause and a set (or multiple sets) of sup-
port.
Informally, the relevance distance reflects how closely two clauses are related.
The relevance distance to one or more support sets is used to compute a relevance
set R, a subset of S that is unsatisfiable if and only if S is unsafisfiable. R is computed
as the set of clauses of S at distance less than n from one or more support sets: if n
is sufficiently large then R is unsatisfiable if S is unsatisfiable. If R is much smaller
than S, a refutation from R may be obtainable in much less time than a refutation
from S. R must be efficiently computable to achieve overall efficiency improvement.
The authors define and characterize different relevance metrics.
In this section, we introduce the computational features of modern SAT solvers. This
is followed by a full overview of the ManySAT parallel SAT solver. We finish the
section with the presentation of the principles behind the AIMD feedback control-
based algorithm usually applied to solve TCP congestion control problems. This
technique is exploited in our framework to deal with the combinatorial explosion
problem caused by the huge number of possible clauses to be shared.
10 Control-Based Clause Sharing in Parallel SAT Solving 249
Most of the state-of-the-art SAT solvers are based on the Davis, Putnam, Loge-
mann and Loveland procedure, commonly called DPLL [10]. DPLL is a backtrack
search procedure; at each node of the search tree, a decision literal is chosen ac-
cording to some branching heuristic. Its assignment to one of the two possible
values (true or false) is followed by an inference step that deduces and propa-
gates some forced unit literal assignments. The assigned literals (decision literal
and the propagated ones) are labeled with the same decision level, starting from
1, and increased at each decision (or branching) until a model is found or a con-
flict (or a dead end) is encountered. In the first case, the formula is answered
to be satisfiable, whereas in the second case, we backtrack to the last decision
level and assign the remaining value to the last decision literal. After backtrack-
ing, some variables are unassigned, and the current decision level is decreased ac-
cordingly. The formula is answered to be unsatisfiable when backtracking to level 0
occurs.
In addition to this basic scheme, modern SAT solvers use additional components
such as restart policy, conflict-driven clause learning and activity-based heuristics.
Let us give some details on these last two important features. First, to learn from
conflict, they maintain a central data structure, the implication graph, which records
the partial assignment that is under construction, together with its implications.
When a dead end occurs, a conflict clause (called asserting clause) is generated
by resolution following a bottom-up traversal of the implication graph. The learn-
ing process stops when a conflict clause containing only one literal from the current
decision level is generated. Such a conflict clause (or learnt clause) expresses that
the literal is implied at a previous level. Modern SAT solvers backtrack to the im-
plication level and assign that literal the value true. The activity of each variable
encountered during the resolution process is increased. The variable with greatest
activity is selected to be assigned next.
As the number of learnt clauses can grow exponentially, even in the sequential
case, modern SAT solvers regularly reduce the database of learnt clauses. In the
SAT solver MiniSat [11], such a reduction (called reduceDB) is achieved as follows.
When the size of the learnt database exceeds a given upper bound (B), it is reduced
by half. The set of deleted clauses corresponds to the less active ones. Initially, B is
set to 13 × |F |, where |F | is the number of clauses in the original formula F . At
each restart, B is increased by 10%.
Let us mention a strong result by Pipatsrisawat and Darwiche [24], proving that
the proof system of modern SAT solvers p-simulates general resolution. This result
shows that modern SAT solvers as usually used in practice are capable of simu-
lating any resolution proof (given the right branching and restarting heuristics, of
course).
250 Youssef Hamadi, Said Jabbour, and Lakhdar Sais
The design of ManySAT benefits from the main weaknesses of modern SAT solvers:
their sensitivity to parameter tuning and their lack of robustness. ManySAT uses
a portfolio of complementary sequential algorithms obtained through careful vari-
ations of the most important components of modern SAT solvers –two-watched-
literal, unit propagation, activity-based decision heuristics, lemma deletion strate-
gies, and clause learning. Additionally, each sequential algorithm shares clauses
with each other to improve the overall performance of the whole system. This con-
trasts with most of the parallel SAT solvers generally designed using the divide-and-
conquer paradigm (see Section 10.2.1). The first version of ManySAT parameterizes
these techniques and runs them on four different cores. In the following, we detail
the particular combinations of techniques used by the different cores. This version
won the parallel track of the 2008 SAT-Race. Interested readers will find a full pre-
sentation of this solver in [16].
mation is a good indicator of the decision errors made during search. Moreover, it
can be seen as an interesting measure of the relative hardness of the instance. This
new policy is designed in such a way that, for high (or low) fluctuation of the av-
erage size of backjumps (between the current and the previous restart), it delivers
a low (or high) cutoff value. In other words, the cutoff value of the next restart
depends on the average size of backjumps observed during the two previous con-
secutive runs. It is defined as x1 = 100, x2 = 100, and xi+1 = yαi × |cos(1 − ri )|,
i ≥ 2, where α = 1200, yi represents the average size of backjumps at restart
y yi
i, ri = i−1
yi if yi−1 < yi and ri = yi−1 otherwise (see Table 10.1 column 3). The
cutoff value xi is minimal when the ratio between the average size of jumps be-
tween the two previous consecutive runs is equal to 1. This dynamic restart pol-
icy was introduced in ManySAT [15]. Other dynamic policies were presented in
[27, 3, 25].
Figure 10.1 presents the previous restart policies: geometric, dynamic (plotted on
a particular instance), arithmetic, and Luby with a unit of 512.
In ManySAT, the first core uses increased random noise with its VSIDS heuristic
[22] since its restart policy is the slowest one. Indeed, that core tends to intensify
the search, and slightly increasing the random noise allows us to introduce more
diversification.
252 Youssef Hamadi, Said Jabbour, and Lakhdar Sais
Each time a variable is chosen, one needs to decide if such a variable might be as-
signed the value true (positive polarity) or false (negative polarity). Different kinds
of polarity have been defined. For example, MiniSat usually chooses negative polar-
ity, whereas RSat [23] uses progress saving. More precisely, each time a backtrack
occurs, the polarities of the assigned variables between the conflict and the back-
jumping level are saved. If one of these variables is chosen again its saved polarity
is preferred. In CDCL-based solvers, the chosen polarity might have a direct impact
on the learnt clauses and on the performance of the solver.
The polarity policy of the core 0 is defined according to the number of occur-
rences of each literal in the learnt database. Each time a learnt clause is generated,
the number of occurrences of each literal is increased by 1. Then to maintain a more
constrained learnt database, the polarity of l is set to true when #occ(l) is greater
than # occ(¬l), and to f alse otherwise. For example, setting the polarity of l to true
biases the occurrence of its negation ¬l in the next learnt clauses.
This approach tends to balance the polarity of each literal in the learnt database.
By doing so, it increases the number of possible resolvents between the learnt
clauses. If the relevance of a given resolvent is defined as the number of steps needed
to derive it, then a resolvent between two learnt clauses might lead to more relevant
clauses in the database.
Since the restart strategy in core 0 tends to intensify the search, it is important to
maintain a learnt database of better quality. However, for rapid restarts as in cores
1 and 3, progress saving is more suitable for saving the work. Core 2 applies a
complementary polarity ( f alse by default, as in MiniSat).
10.3.2.3 Learning
Learning is another important component crucial for the efficiency of modern SAT
solvers. Most of the known solvers use similar CDCL approaches associated with
the first UIP (Unique Implication Point) scheme.
In ManySAT a new learning scheme obtained using an extension of the classi-
cal implication graph is also used [2]. This new notion considers additional arcs,
called inverse arcs. These are obtained by taking into account the satisfied clauses
of the formula, which are usually ignored in classical conflict analysis. The new
arcs present in this extended graph can show that even some decision literals admit
a reason, something ignored when using classical implication graphs. As a result,
the size of the backjumps is often increased.
This new learning scheme is integrated into the SAT solvers of cores 0 and 3.
In ManySAT, each core exchanges a learnt clause if its size is less or equal to 8.
This decision is based on extensive tests with representative industrial instances.
The communication between the solvers of the portfolio is organized through
10 Control-Based Clause Sharing in Parallel SAT Solving 253
lockless queues which contain the lemmas that a particular core wants to ex-
change.
10.3.2.5 Summary
Table 10.1 summarizes the choices made for the different solvers of the ManySAT
portfolio. For each solver (core), we mention the restart policy, the heuristic, the
polarity, the learning scheme and the size of the shared clauses.
10.4.1 Motivation
1 The regular, small raises are the result of the cyclic reduction of the learnt base through reduceDB.
256 Youssef Hamadi, Said Jabbour, and Lakhdar Sais
or equal to a size limit ekj→i , i.e., Δ kj→i = {c ∈ Δ kj / |c| ≤ ekj→i }. Then, the set Rki can
be formally defined as ∪0≤ j<n, j=i Δ kj→i .
Using a fixed throughput threshold T of shared clauses, we describe our control-
based policies, which allow each unit ui to guide the evolution of the size limit e j→i
using an AIMD feedback mechanism.
As illustrated in Figure 10.3, at step k a given unit ui checks whether the throughput
is exceeded or not. If |Rki | < T (or |Rki | > T ) the size limit ek+1
j→i is additively increased
(or multiplicatively decreased). More formally, the upper bound ek+1 j→i on the size of
clauses that a solver j shares with the solver i between k and k + 1 is changed using
the following AIMD function:
aimdT (Rki ){
∀ j|0 ≤ j < n, j = i
ekj→i + ekb , i f (|Rki | < T )
ek+1
j→i = j→i
ekj→i − a × ekj→i , i f (|Rki | > T )
}
aimdT Q(Rki ){
∀ j|0 ≤ ⎧
j < n, j = i
⎨ e j→i + Q[ j→i ] × ekj→i , i f (|Ri | < T )
⎪ k k b k
ek+1
j→i = ⎪ e j→i − (1 − Q[ j→i ]) × a × e j→i , i f (|Ri | > T )
k k k k
⎩
}
Activity-Based Quality
Our activity based-quality measure, Qa, is defined using the activity of the variables
at the basis of the VSIDS heuristic [22], another important component of modern
SAT solvers. The variables with greatest activity represent those involved in most
of the (recent) conflicts. Indeed, when a conflict occurs, the activity of the variables
whose literals appear in the clauses encountered during the generation of a learnt
clause are updated. The most active variables are those related to the current part
of the search space. Consequently, our quality measure exploits these activities to
quantify the relevance of a clause learnt by unit u j to the current state of a given unit
258 Youssef Hamadi, Said Jabbour, and Lakhdar Sais
ui . To define our quality measure, suppose that, at any time of the search process,
we have Aimax , the current maximal activity of ui ’s variables, and Ai (x), the current
activity of a given variable x.
Definition 10.1 (Activity-Based Quality). Let c be a clause sent by ui to u j and
A max
LAi (c) = {x/x ∈ c|Ai (x) ≥ i2 } be the set of active literals of c with respect to
unit ui . We define P kj→i = {c/c ∈ Δ kj→i | LAi (c)| ≥ f (|c|)} to be the set of clauses
received by ui from u j between steps k − 1 and k with at least f (|c|) active literals.
We define the quality of clauses sent by u j to ui at a given step k as Qa[kj→i ] =
|P kj→i |+1
|Δ kj→i |+1
.
From the definition below, we can see that the quality of a given clause sent by
u j to ui depends on the activity of its literals. In our experiments the lower bound
f (|c|) is set to |c|
3 .
Proof-Based Quality
The best clause to share between u j and ui is the one leading to a conflict on ui . In
other words, the clause sent by u j to ui is falsified by the current assignment ρui .
Such a clause induces conflict analysis and backjumping for unit ui . The second
most preferred shared clause is clearly one that causes unit propagation on ui . Con-
tinuing the same reasoning, we claim that the most “relevant” clause contains a large
number of falsified literals w.r.t. the current assignment ρui . Consequently, a clause
with the largest number of falsified literals is clearly relevant. Interestingly enough,
as the variable ordering used in modern SAT solvers is based on the activities of the
variable, if a clause received by ui shares a large portion of assigned literals, then
it involves active literals. This means that the shared clause is closely related to the
current portion of the search space of ui .
The following definition describes our second quality measure, called proof-
based quality, as its relevance is computed with respect to the current assign-
ment.
Definition 10.2 (Proof-Based Quality). Let c be a clause sent by u j to ui , ρui
the current assignment of ui , and LPi (c) = {x/x ∈ c|ρui (x) = f alse} be the set
of unsigned or satisfied literals of c with respect to unit ui . We define P kj→i =
{c/c ∈ Δ kj→i |LPi (c)| ≤ f (|c|)} to be the set of clauses received by ui from u j
between steps k − 1 and k with at most f (|c|) unsigned or satisfied literals. We
define the quality of clauses sent by u j to ui at a given step k as Qp[kj→i ] =
|P kj→i |+1
|Δ kj→i |+1
.
From Definition 10.2, we can observe that a given clause is considered rel-
evant if its ratio of unassigned and satisfied literals is less than a given upper
|c|
bound f (|c|). In our experiment, we consider f (|c|) = log(|c|) . The idea is that
10 Control-Based Clause Sharing in Parallel SAT Solving 259
as the clause length increases, the number of falsified literals that a given clause
|c|
might contain also increases. For example, using the ratio log(|c|) , unary and bi-
nary clauses are considered relevant, while a ternary clause is considered relevant
if it contains one literal assigned to false. A clause of length 10, 20 or 30 is con-
sidered relevant only if it contains 6, 14 or 22 literals assigned to false, respec-
tively.
10.5 Evaluation
10.5.1 Experiments
Our policies were implemented and tested on top of the ManySAT parallel SAT
solver (see Section 10.3.2). Our tests were done on Intel Xeon Quadcore machines
with 16 GB of RAM running at 2.3 Ghz. We used a timeout of 1,500 seconds for
each problem. ManySAT was used with four DPLL strategies, each running on a
particular core (unit). To alleviate the effects of unpredictable thread scheduling,
each problem was solved three times and the average was taken.
Our dynamic clause sharing policies were added to ManySAT and compared to
ManySAT with its default static policy ManySAT e = 8, which exchanges clauses
up to size 8. Since each pairwise limit is read by one unit and updated by another,
our proposal can be integrated without any lock.
In all our experiments, the default parameter settings are a = 0.125 and b = 8 for
aimdT, aimdTQa, and aimdTQp, associated with a time window of α = 10, 000 con-
flicts. The throughput T is set to 5,000. Each unit ui can then receive from the three
remaining units 5, 000 × 3 = 15, 000 clauses in the worst case. Each pairwise limit
e j→i was initialized to 8. The results corresponding to these settings are presented
in Tables 10.2 and 10.3 and Figures 10.4 and 10.6.
Tables 10.2, 10.3 and 10.4 present the results on the 100 industrial problems of
the 2008 SAT-Race. The problem set contains families with several instances or
individual instances. We present the family/instance name and the number of in-
stances per family. For each method, we report the average CPU time (seconds)
for each family and the number of problems solved (#S) before timeout. In the
case of dynamic policies, we also provide ē, the average of the e j→i observed
during the computation. When the symbol “-” is reported, this means that the in-
stance is not solved in the allowed time limit. The last row provides for each
method the total number of problems solved, and the cumulated runtime (with
1, 500 seconds added for each unsolved instance). For the dynamic policies, it
260 Youssef Hamadi, Said Jabbour, and Lakhdar Sais
also presents the average of the ē values (without considering the unsolved in-
stances).
explain this as follows. The quality-based policy intensifies the search by favoring
the exchange of clauses related to the current exploration of each unit. This intensi-
fication leads to the resolution of more difficult problems. However, it increases the
runtime on easier instances where a more diversified search is often more beneficial.
On these last instances, a small overhead can be observed. It can be explained by
the time needed to evaluate the quality of exchanged clauses. Overall these results
are very good since our dynamic policies are able to outperform the best possible
static tuning.
Our hypothesis is that the introduction of quality measures to control the over-
all exchange might be beneficial in situations where each unit is flooded by the
huge number of received clauses. To validate the hypothesis, in Table 10.4 we con-
sider new parameters setting corresponding to the worst-case situation, where the
amount of exchanged clauses is maximal. To simulate this situation, we first set
the time window α to a small value of 1, 000 conflicts in order to increase the
application frequency of our control-based policies. Secondly, the throughput T
is set to 1, 000 × 3 = 3, 000 clauses, which corresponds to the maximal number
of clauses that can be received by each unit ui during a time window of α con-
flicts. Finally, to allow a fast arithmetic increase, we also set the value b to 64.
Our goal behind these last settings is to quickly bring each unit towards a maxi-
262 Youssef Hamadi, Said Jabbour, and Lakhdar Sais
mum amount of exchange, and then evaluate the effect of our quality-based mea-
sures.
We can observe that the results obtained in Table 10.4 prove that our hypothesis is
correct. Indeed, aimdTQa and aimdTQp are better than aimdT both in CPU time and
in terms of the number of solved problems. Clearly, aimdTQa (activity-based qual-
ity) obtained the best performance in the worst-case situation. We can also observe
that in this intense situation of exchange, aimdTQa obtained the smallest average
value of ē (see that last row).
Table 10.4: Industrial problems: ManySAT aimdT, aimdTQa, and aimdTQp for α =
1, 000, T = 3, 000 and b = 64
We present here results on the crafted category (201 problems) of the 2007 SAT
competition. These problems are handmade and many of them are designed to
beat all the existing SAT solvers. They contain, for example, quasi-group instances,
forced random SAT instances, counting, ordering and pebbling instances, and social
golfer problems.
10 Control-Based Clause Sharing in Parallel SAT Solving 263
The scatter plot (in log scale) given in Figure 10.4 illustrates the comparative
results of the static and dynamic throughput versions of ManySAT. The x-axis (or
y-axis) corresponds to the CPU time tx (or ty) obtained by ManySAT e = 8 (or
ManySAT aimdT). Each dot with (tx,ty) coordinates corresponds to a SAT instance.
Dots below (or above) the diagonal indicate that ManySAT aimdT is faster (or
slower) than ManySAT e = 8. The results clearly exhibit that the throughput-based
policies outperform the static policy on the crafted category.
Fig. 10.4: SAT Competition 2007, crafted problems (aimd e = 8 and aimdT)
The above improvements are illustrated in Figure 10.5, which shows the time
in seconds needed to solve a given number of instances (#instances). We can ob-
serve that both aimdT (ē = 23.14 with a peak at 94) and aimdTQa (ē = 26.17
with a peak at 102) solve seven more problems than the static policy. Also, the
aimdTQp (ē = 30.44 with a peak at 150) solves six more problems than the static
policy. The proof-based quality policy (aimdTQp), even if it solves one less prob-
lem than aimdT and aimdTQa, obtains better results in term of CPU times. As for
the previous problem category, aimdT remains faster than aimdTQa, but slower
than aimdTQp. The proof-based quality is better than the activity-based one. It
seems that since by definition these problems do not have a structure which can
be advantageously exploited by the activity-based heuristics (even in the sequential
case), the proof-based quality is clearly more convenient as it measures the rele-
vance of clauses according to their relationship to the current state of the search
process.
264 Youssef Hamadi, Said Jabbour, and Lakhdar Sais
Fig. 10.5: SAT Competition 2007, crafted problems (aimd e = 8, aimdT, aimdTQa,
aimdTQp)
It is interesting to consider the evolution of ē, the average of the e j→i observed
during the computation. In Figure 10.2 it was shown on the 16 instances of the
AProVE07 * family that the size of learnt clauses was increasing over time.
We present in Figure 10.6 the evolution of ē with ManySAT aimdT on the same
family. The evolution is given for each unit (core0 to core3), and the average of the
units is also presented (average). We can see that our dynamic policy overcomes the
disappearing of “small” clauses by the incremental raising of the pairwise limits. It
shows the typical sawtooth behavior that represents the probe for throughput T . This
figure and our results on the industrial and crafted problems show that the evolution
of the pairwise limits does not reach unacceptable levels if not bounded.
10.6 Conclusion
We have presented several dynamic clause sharing policies for parallel SAT solving.
They use an AIMD feedback control-based algorithm to dynamically adjust the size
of shared clauses between any pair of processing units. Our first policy maintains
an overall number of exchanged clauses (throughput) whereas our second policy
additionally exploits the relevance quality of shared clauses. These policies have
10 Control-Based Clause Sharing in Parallel SAT Solving 265
been devised as an efficient answer to the various flaws of the classical static size
limit policy. The experimental results comparing our proposed dynamic policies to
the static policy show important improvements in the state-of the-art parallel SAT
solver ManySAT. Our proposed framework offers interesting perspectives. For ex-
ample, the design of new, relevant quality measures for clause sharing is of great
importance. It could benefit sequential solvers by improving their learnt base re-
duction strategy and, as demonstrated by this work, have an important impact on
parallel SAT solving.
In the context of Autonomous Search, our work demonstrates how control the-
ory can be integrated during search. In general, feedback mechanisms such as the
one presented here can be used to adjust the behavior of a search algorithm to the
difficulty of the problem at hand. In future work, we are planning to integrate more
feedback mechanisms to address the challenges of large-scale parallel SAT solv-
ing.
References
[1] Büning, H., Zhao, X. eds.: Proceedings of Theory and applications of satisfi-
ability testing - 11th International Conference SAT 2008. Guangzhou, China,
May 12–15, 2008. Lecture Notes in Computer Science, vol. 4996. Springer
(2008)
266 Youssef Hamadi, Said Jabbour, and Lakhdar Sais
[2] Audemard, G., Bordeaux, L., Hamadi, Y., Jabbour, S., Sais, L.: A generalized
framework for conflict analysis. In: Proc. of SAT, pp. 21–27 (2008)
[3] Biere, A.: Adaptive restart strategies for conflict driven SAT solvers. In:
Büning and Zhao [1], pp. 28–33 (2008)
[4] Blochinger, W., Sinz, C., Küchlin, W.: Parallel propositional satisfiability
checking with distributed dynamic learning. Parallel Computing 29(7), 969–
994 (2003)
[5] Böhm, M., Speckenmeyer, E.: A fast parallel SAT-solver – efficient workload
balancing. Ann. Math. Artif. Intell. 17(3-4), 381–400 (1996)
[6] Chiu, D. M., Jain, R.: Analysis of the increase and decrease algorithms for con-
gestion avoidance in computer networks. Comp. Networks 17, 1–14 (1989)
[7] Chrabakh, W., Wolski, R.: GrADSAT: A parallel SAT solver for the grid. Tech.
rep., UCSB CS TR N. 2003-05 (2003)
[8] Chu, G., Stuckey, P.: pMiniSat: A parallelization of MiniSat 2.0. Tech. rep.,
SAT-Race 2008, solver description (2008)
[9] Crawford, J. M., Baker, A. B.: Experimental results on the application of satis-
fiability algorithms to scheduling problems. In: AAAI, pp. 1092–1097 (1994)
[10] Davis, M., Logemann, G., Loveland, D. W.: A machine program for theorem-
proving. Comm. of the ACM 5(7), 394–397 (1962)
[11] Eén, N., Sörensson, N.: An extensible SAT-solver. In: Proc. of SAT’03, pp.
502–518 (2003)
[12] Forman, S. L., Segre, A. M.: Nagsat: A randomized, complete, parallel solver
for 3-sat. In: Proc. of SAT 2002, pp. 236–243 (2002)
[13] Gomes, C. P., Selman, B., Crato, N., Kautz, H.: Heavy-tail phenomena in sat-
isfiability and constraint satisfaction. Journal of Automated Reasoning pp. 67–
100 (2000)
[14] Gomes, C. P., Selman, B., Kautz, H.: Boosting combinatorial search through
randomization. In: Proceedings of the Fifteenth National Conference on Arti-
ficial Intelligence (AAAI’98), pp. 431–437. Madison, Wisconsin (1998)
[15] Hamadi, Y., Jabbour, S., Sais, L.: ManySAT: Solver description. Tech. Rep.
MSR-TR-2008-83, Microsoft Research (2008)
[16] Hamadi, Y., Jabbour, S., Sais, L.: ManySAT: A parallel SAT solver. Journal on
Satisfiability, Boolean Modeling and Computation 6, 245–262 (2009)
[17] Huang, J.: The effect of restarts on the efficiency of clause learning. In: IJCAI,
pp. 2318–2323 (2007)
[18] Jacobson, V.: Congestion avoidance and control. In: Proc. SIGCOMM ’88, pp.
314–329 (1988)
[19] Lewis, M., Schubert, T., Becker, B.: Multithreaded SAT solving. In: Proc. ASP
DAC’07 (2007)
[20] Luby, M., Sinclair, A., Zuckerman, D.: Optimal speedup of Las Vegas algo-
rithms. Information Processing Letters 47, 173–180 (1993)
[21] Marques-Silva, J., Sakallah, K. A.: GRASP – A New Search Algorithm
for Satisfiability. In: Proceedings of IEEE/ACM International Conference on
Computer-Aided Design, pp. 220–227 (1996)
10 Control-Based Clause Sharing in Parallel SAT Solving 267
[22] Moskewicz, M. W., Madigan, C. F., Zhao, Y., Zhang, L., Malik, S.: Chaff:
Engineering an efficient SAT solver. In: Proceedings of the 38th Design Au-
tomation Conference (DAC’01), pp. 530–535 (2001)
[23] Pipatsrisawat, K., Darwiche, A.: A lightweight component caching scheme for
satisfiability solvers. In: Proc. SAT’07, pp. 294–299 (2007)
[24] Pipatsrisawat, K., Darwiche, A.: On the power of clause-learning SAT solvers
with restarts. In: Proceedings of the 15th International Conference on the
Principles and Practice of Constraint Programming – CP 2009, pp. 654–668
(2009)
[25] Pipatsrisawat, K., Darwiche, A.: Width-based restart policies for clause-
learning satisfiability solvers. In: O. Kullmann (ed.) SAT, Lecture Notes in
Computer Science, vol. 5584, pp. 341–355. Springer (2009)
[26] Plaisted, D. A., Yahya, A. H.: A relevance restriction strategy for automated
deduction. Artif. Intell. 144(1-2), 59–93 (2003)
[27] Ryvchin, V., Strichman, O.: Local restarts. In: Büning and Zhao [1], pp. 271–
276
[28] Walsh, T.: Search in a small world. In: Proceedings of the Sixteenth Interna-
tional Joint Conference on Artificial Intelligence – IJCAI 99, pp. 1172–1177
(1999)
[29] Zhang, H., Bonacina, M. P., Hsiang, J.: PSATO: a distributed propositional
prover and its application to quasigroup problems. Journal of Symbolic Com-
putation 21, 543–560 (1996)
[30] Zhang, H., Stickel, M. E.: Implementing the Davis-Putnam algorithm by tries.
Tech. rep., AI Center, SRI International, Menlo (1994)
Chapter 11
Learning Feature-Based Heuristic Functions
11.1 Introduction
Planning is the process of creating a sequence of actions that achieve some desired
goals. Automated planning arguably plays a key role in both developing intelligent
systems and solving many practical industrial problems. Typical planning problems
are characterized by a structured state space, a set of possible actions, a description
of the effects of each action, and an objective measure. In this chapter, we con-
sider planning as an optimization problem, seeking plans that minimize the cost of
reaching the goals or some other performance measure.
An important challenge in planning research is developing general-purpose plan-
ning systems that rely on minimal human guidance. General-purpose planners – also
referred to as domain-independent – can be easily applied to a large class of prob-
lems, without relying on domain-specific assumptions. Such planners often take as
input problems specified using PDDL [29], a logic-based specification language in-
spired by the well-known STRIPS formulation [27, 61].
Planning plays an important role in many research areas such as robotics, rein-
forcement learning, and operations research. Examples of planning problems in the
various research fields are as follows:
Mission planning – How do we guide an autonomous spacecraft? These prob-
lems are often modeled using highly structured and discrete state and action
Marek Petrik
Department of Computer Science
University of Massachusetts Amherst
Amherst, MA, USA, e-mail: [email protected]
Shlomo Zilberstein
Department of Computer Science
University of Massachusetts Amherst
Amherst, MA, USA, e-mail: [email protected]
spaces. They have been addressed extensively by the classical planning com-
munity [30].
Inventory management – What is the best amount of goods to keep in stock to min-
imize holding costs and shortages? These problems are often modeled using
large continuous state spaces and have been widely studied within operations
research [57].
Robot control – How do we control complex robot manipulations? These stochas-
tic planning problems have a shallow state space structure often with a small
number of actions, and have been studied by the reinforcement learning com-
munity [66].
Interestingly, despite the diverse objectives in these research areas, many of the
solution methods are based on heuristic search [30]. However, different research
fields sometimes emphasize additional aspects of the problem that we do not ad-
dress here. Reinforcement learning, for example, considers model learning to be an
integral part of the planning process.
Planning problems can be solved in several different ways. One approach is to
design a domain-specific solution method. Domain-specific methods are often very
efficient but require significant effort to design. Another approach is to formulate the
planning problem as a generic optimization problem, such as SAT, and solve it using
general-purpose SAT solvers [40]. This approach requires little effort, but often does
not exploit the problem structure. Yet another approach, which is the focus of this
chapter, is to use heuristic search algorithms such as A* or Branch-and-Bound [6,
61]. Heuristic search algorithms represent a compromise between domain-specific
methods and generic optimization. The domain properties are captured by a heuristic
function which assigns an approximate utility to each state. Designing a heuristic
function is usually easier than designing a domain-specific solver. The heuristic
function can also reliably capture domain properties, and therefore heuristic search
algorithms can be more efficient than generic optimization methods.
The efficiency of heuristic search depends largely on the accuracy of the heuristic
function, which is often specifically designed for each planning problem. Design-
ing a good heuristic function, while easier than designing a complete solver, often
requires considerable effort and domain insight. Therefore, a truly autonomous plan-
ning system should not rely on handcrafted heuristic functions. To reduce the need
for constructing heuristics manually, much of the research in autonomous planning
concentrates on automating this process. The goal of automatic heuristic construc-
tion is to increase the applicability of heuristic search to new domains, not necessar-
ily to improve the available heuristics in well-known domains.
The construction of heuristic functions often relies on learning from previous
experience. While constructing a good heuristic function can be useful when solving
a single problem, it is especially beneficial when a number of related problems must
be solved. In that case, early search problems provide the necessary samples for
the learning algorithm, and the time spent on constructing a good heuristic can be
amortized over future search problems.
The focus of this chapter is on constructing heuristic functions within a plan-
ning framework. The specific context of construction within the planning system
11 Learning Feature-Based Heuristic Functions 271
Fig. 11.1: Framework for learning heuristic functions. The numbers in parentheses
are used to reference the individual components
is also crucial in designing the learning algorithms. Planning, despite being simi-
lar to generic optimization problems, such as SAT, usually encompasses a number
of other components. A general framework for learning heuristics in planning sys-
tems is shown in Figure 11.1. The ovals represent inputs, the rectangles represent
computational components, and the arrows represent information flow. The compu-
tational components list sample implementations and the inputs list sample infor-
mation sources.
The inputs are formally defined in Section 11.3. Intuitively, the problem descrip-
tion (1a) denotes a precise model in some description language that can be easily
manipulated. The description languages are often logic-based, such as STRIPS and
PDDL. The features (1b) represent functions that assign a set of real values to each
state. Finally, the samples (1c) represent simple sequences of states and actions that
respect the transition model.
The methods for constructing heuristics can be divided into two broad categories
based on the input they require: 1) analytic methods and 2) inductive methods [74].
Analytic methods use the formal description of the domain to derive a relaxed ver-
sion of the problem, which in turn is used to derive a heuristic. For example, an inte-
ger linear program may be relaxed to a linear program [11, 5], which is much easier
to solve. A STRIPS problem may be abstracted by simply ignoring selected liter-
als [72]. When applicable, analytic methods typically work well, but the required
description is sometimes unavailable.
Inductive methods rely on a provided set of features and transition samples of
the domain. A heuristic function is calculated as a function of these state features
272 Marek Petrik and Shlomo Zilberstein
based on transition samples. Inductive methods are easier to apply than analytic
ones, since they rely on fewer assumptions. Moreover, the suitable structure needed
to use abstraction is rarely present in stochastic domains [3].
Inductive heuristic learning methods can be further classified by the source of
the samples into online and offline methods. Online methods interleave execution
of a calculated plan with sample gathering. As a result, a new plan may be often
recalculated during plan execution. Offline methods use a fixed number of samples
gathered earlier, prior to plan execution. They are simpler to analyze and implement
than online methods, but may perform worse due to fewer available samples.
To illustrate the planning framework, consider two common planning techniques
that can learn heuristic functions: Real-Time Dynamic Programming (RTDP) [2]
and Least Squares Policy Iteration (LSPI) [45]. Both methods are examples of in-
ductive learning systems and do not rely on domain descriptions. The component
implementations of RTDP and LSPI are marked with “(R)” and “(L)” respectively
in Figure 11.1. The methods are described in more detail in Section 11.3.3.
RTDP is an online method that is often used to solve stochastic planning prob-
lems. While it is usually considered a type of heuristic search, it can also be seen
as an inductive method for learning heuristic functions. Because features are inde-
pendent for each state, it is possible to represent an arbitrary heuristic function as
shown in Section 11.3.2. However, RTDP updates the heuristic function after ob-
taining each new sample during execution; thus it does not generalize to unseen
states.
LSPI is an offline method that uses a fixed set of predetermined samples. It has
been developed by the reinforcement learning community, which often refers to the
heuristic function as the value function. The features may be arbitrary and therefore
could limit the representable heuristic functions. However, unlike RTDP, LSPI also
generalizes values to states that have not been visited.
Figure 11.1 represents a single planning system, but it is possible to combine
multiple learning systems. Instead of the calculated heuristic function directly being
used to create a policy, it can be used as an additional feature within another plan-
ning system. This is convenient, because it allows analytic methods to create good
features for use with inductive learning.
The remainder of the chapter presents a unified framework for learning heuris-
tic functions. In particular, it focuses on calculating heuristics – component (2)
of the general framework – but it also describes other components and their in-
teractions. Section 11.2 introduces the framework and the search algorithms used
in components (3) and (4). Section 11.3 describes the existing inductive and ana-
lytic approaches to learning heuristic functions. The chapter focuses on inductive
learning methods – the analytic methods are described because they can serve as
a good source of features. Section 11.4 describes an inductive method for con-
structing heuristic functions using linear programming. The linear programming
method is further analyzed in Section 11.5. Experimental results are presented in
Section 11.6.
This chapter concentrates on using previous experience to create or improve
heuristic functions. Previous experience can be used in a variety of ways. For ex-
11 Learning Feature-Based Heuristic Functions 273
ample, some planners use samples to learn control rules that speed up the com-
putation [30]. Early examples of such systems are SOAR [46] and PRODIGY [50].
Other notable approaches include learning state invariants [60], abstracting planning
rules [62], and creating new planning rules [47]. In comparison, learning a heuristic
function offers greater flexibility, but it is only relevant to planning systems that use
heuristic search.
This section formally defines the search framework and describes common search
algorithms. The section also discusses metrics of heuristic function quality that can
be used to measure the required search effort. The search problem is defined as
follows:
The discount factor γ is considered only for the sake of generality; the framework
also applies to search problems without discounting (i.e., γ = 1). The assumption of
a single goal or initial state is for notational convenience only. Problems with many
goals or initial states can be easily mapped into this framework.
The solution of a search problem is a policy π : S → A ∪ {ν }, which defines the
action to take in each state. Here, ν represents an undefined action and by definition
π (τ ) = ν . The policy is chosen from the set of all valid policies Π . In a deterministic
search problem, a policy defines a path from a state s1 to the goal. The path can be
represented using the sequence
({si ∈ S , ai ∈ A ∪ {ν }})ni=1 ,
ai ∈ A ai = π (si )
t(si , ai ) = si+1 sn = τ .
The optimal value of a state is the maximal possible value over any goal-terminated
sequence starting in that state. The objective is to find a policy π ∗ that defines a path
for the initial state σ with the maximal discounted cumulative return vπ ∗ (σ ).
Notice that the optimal policy only needs to define actions for the states connect-
ing the initial state σ to the goal state τ . For all other states the policy may have the
value ν . For simplicity, assume that there is always a path from σ to any state s and
that there is a path from any state s to the goal τ . It is easy to modify the techniques
for problems with dead ends and unreachable states. To guarantee the finiteness of
the returns when γ = 1, assume that there are no cycles with positive cumulative
rewards.
Heuristic methods calculate a policy for search problems from a heuristic function
and an initial state. A heuristic function h : S → R returns for each state s an es-
timate of its optimal value v∗ (s). An important property of a heuristic function is
admissibility, which guarantees optimality of some common algorithms. A heuris-
tic function h is admissible when
with ties broken arbitrarily. Greedy search does not require the initial state and de-
fines a valid action for each state.
Many contemporary heuristic search methods are based on the A* algorithm,
a simplified version of which is depicted in Algorithm 2. It is a best-first search,
guided by a function f : S → R, defined as
The total reward (possibly negative) achieved for reaching s from the initial state σ
is denoted by g(s). States are chosen from an open set O according to their f value.
The successors of the chosen states are consequently added to the open set. A* is
guaranteed to find the optimal solution when the heuristic function is admissible.
A* is an optimally efficient search algorithm since it expands the minimal num-
ber of states needed to prove the optimality of the solution with a consistent heuristic
function. When the heuristic function is inconsistent, the efficiency of A* could be
improved using simple modifications such as PATHMAX [73].
Despite the theoretical optimality of A*, its applicability is often limited by the
exponential growth of the open list O. A* modifications have been proposed that ad-
dress its large memory requirements. One common modification is Iterative Deep-
ening A* (IDA*), which repeatedly searches the graph in a depth-first manner with
a decreasing threshold on the value f of states [43]. Others have proposed modifi-
cations that do not guarantee optimality, such as weighted A* [55, 32, 68]. Because
most of these algorithms retain the basic characteristics of A* and use the same
heuristic function, the rest of the chapter focuses on A* as a general representative
of these heuristic search methods.
The presented framework also applies to stochastic domains, formulated as
Markov decision processes (MDPs) [58]. An MDP is a generalization of a search
problem, in which the successor state is chosen stochastically after an action. A* is
not directly applicable in stochastic domain, but some modifications have been pro-
posed. The most commonly used heuristic search algorithm for MDPs is the greedy
best-first search method described above. More sophisticated algorithms used in
stochastic domains are RTDP and LAO*, which are discussed in Section 11.3.2.
However, in some domains, even greedy search can be overly complicated [57].
This section discusses the influence of heuristic function quality on search com-
plexity. To start, it is necessary to define a measure of heuristic quality. The measure
must accurately capture the time complexity of the search process and also must be
easy to estimate and optimize. For example, the number of states that A* expands
is a very precise measure but is hard to calculate. As it turns out, a precise and sim-
276 Marek Petrik and Shlomo Zilberstein
ple quality measure does not exist yet, but approximations can be used with good
results.
The discussion in this section is restricted to admissible heuristic functions. This
is a limitation when a suboptimal solution is sought and admissibility is not crucial.
However, the analysis of inadmissible heuristics is hard and involves trading off so-
lution quality with time complexity. Moreover, admissible heuristic functions can be
used with suboptimal algorithms, such as weighted A* [55], to obtain approximate
solutions faster.
Early analysis shows bounds on the number of expanded nodes in terms of a
worst-case additive error [56, 28]:
The bounds generally show that even for a small value of ε , a very large number of
nodes may be expanded unnecessarily. A more general analysis with regard to errors
weighted by the heuristic function shows similar results [53]. In addition, most of
these bounds assume a single optimal solution, which is often violated in practice.
Recent work shows tight bounds on the number of nodes expanded by A* in
problems with multiple solutions [15]. In particular, the number of expanded nodes
may be bounded in terms of
|h(s) − v∗ (s)|
ε = max .
s∈S |v∗ (s)|
This work assumes that there is a relatively small number of optimal solutions clus-
tered together.
The existing bounds usually require a small number of goal states and states that
are close to them. These assumptions have been questioned because many bench-
mark problems do no satisfy them [36, 35]. When the assumptions do not hold,
even a good heuristic function according to the measures may lead A* to explore
exponentially many states. In light of this evidence, the utility of the existing quality
bounds is questionable.
Because a widely acceptable measure of the quality of heuristic functions does
not yet exist, we focus on two objectives that have been studied in the literature and
are relatively easy to calculate. They are
L1 , the average-case error: h − v∗ 1 = ∑s∈S |h(s) − v∗ (s)| and
L∞ , the worst-case error: h − v∗ ∞ = maxs∈S |h(s) − v∗ (s)|.
This section gives an overview of existing methods for learning heuristic functions
in planning domains. Most of these methods, however, can also be applied to search.
As mentioned earlier, methods that use experience to learn control rules or other
11 Learning Feature-Based Heuristic Functions 277
structures are beyond the scope of this chapter [74]. First, to classify the methods
based on the inputs in Figure 11.1, we define the inputs more formally below.
Description (1a)
The domain description defines the structure of the transitions and rewards using a
well-defined language. This description allows us to easily simulate the domain and
to thoroughly analyze its properties. Domain descriptions in planning usually come
in one of three main forms [30]: 1) Set-theoretic representation such as STRIPS,
2) classical representation such as PDDL, or 3) state-variable representation such
as SAS+. Extensive description of these representations is beyond the scope of this
chapter. The details can be found in [30].
Features (1b)
The actual combination function θ̄ may be either fixed for the domain or may be
computed from other inputs.
One simple way to obtain features is state aggregation. In aggregation, the set of
all states S is partitioned into S1 . . . Sm . A feature φ j is then defined as
φ j (si ) = 1 ⇔ si ∈ S j .
Samples (1c)
This set also contains all shorter sample paths that start with a state later in the
sequence, which are also arbitrary goal-terminated samples.
2. Optimal goal-terminated paths:
n
Σ2 = (sij , aij )i=1
j
, sn j = τ .
We assume also that Σ2 ⊆ Σ1 . As in Σ1 , all paths that start with a state later in
the sequence are included in Σ2 .
3. One-step samples:
Σ3 = (s1j , a1j , s2j ).
We assume that Σ3 also contains all transitions pairs from Σ1 and Σ2 .
In general, the effort needed to gather an individual sample decreases in the order
Σ2 , Σ1 , Σ3 . However, the utility of the samples also decreases in the same way.
Samples may come either from previously solved problem instances or from ran-
domly selected state transitions. Samples that come from previously solved prob-
lems are in the form Σ1 or Σ2 . Randomly gathered samples are usually in the form
Σ3 and have relatively low utility, but are much simpler to obtain. The availabil-
ity of the various samples depends on the application. For example, it is rare to
have optimal goal-terminated paths in online learning algorithms such as RTDP.
Sampling is also more complicated in stochastic domains, such as Markov deci-
sion processes, because many samples may be required to determine the transition
probabilities.
The inputs to the learning procedure determine its applicability and theoretical
properties. Therefore, the procedures used in component (2) are classified according
to their inputs as follows:
Inductive methods rely on state features (1b) or transition samples (1b) or both,
but do not require a domain description. The methods usually cannot use the
description even if it is provided. They are further classified as follows:
Feature-based methods only require features (1b), without any samples. The
methods are typically very simple and require high-quality features in order
to produce admissible heuristic functions. The methods are described in
detail in Section 11.3.1.
Sample-based methods use samples (1c) to learn the heuristic function for
states that have been directly sampled. The methods cannot take advan-
tage of features to generalize to unseen states. The methods are described
in detail in Section 11.3.2.
Sample and feature-based methods use samples (1c) to combine provided fea-
tures (1b) to construct a heuristic value for states that have not been sam-
pled. The features may be arbitrary, but the quality of the heuristic function
11 Learning Feature-Based Heuristic Functions 279
Fig. 11.2: Examples of inductive methods based on the inputs they use
highly depends on their quality. The methods are described in detail in Sec-
tion 11.3.3.
Analytic methods rely on a domain description (1a), such as SAS+. These methods
often cannot take advantage of features and samples, even if they are provided.
They can be further classified as follows:
Abstraction-based methods create a heuristic function without using samples
or features. Instead, they simplify the problem using abstraction and solve
it to obtain the heuristic function. The methods are described in detail in
Section 11.3.4.
Other methods: There is a broad range of analytic methods for many different
problem formulations. A notable example is the simplification of mathe-
matical programs by relaxations. For example, a linear program relaxation
of an integer linear program defines an admissible heuristic function [70].
These relaxation methods have been also previously used in planning with
success [5], but are not discussed in detail here.
The classification roughly follows Zimmerman et al. [74]. As mentioned above,
the main focus of the chapter is on inductive methods. However, the inductive
methods require features which may be hard to come by. The analytic methods,
particularly those that are abstraction-based, can be a good source of the fea-
tures.
Selected inductive heuristic learning methods are classified in Figure 11.2. The
feature- and sample-based methods, lying at the intersection of the two sets, are
the most complex and interesting to study. The intersection contains also manyre-
inforcement learning methods, which are inherently incompatible with A*, as dis-
cussed in Section 11.3.3. Note that the classification in the figure is with respect
to the implementation of component (2) and it ignores differences in other compo-
nents.
280 Marek Petrik and Shlomo Zilberstein
Feature-based methods are inherently simple because the heuristic function is com-
puted statically regardless of the specific problem structure. As a result, they are
useful only when the features φi represent admissible heuristic functions. The three
major feature-based methods are as follows
1. Minimum of heuristic functions1 ,
2. Dovetailing,
3. Sum of heuristic functions.
Feature-based methods must use a fixed combination function θ̄ in (11.1) since there
are no other inputs available.
The most common feature-based method is to simply take the minimum value of
the features. The combination function θ̄ is defined as
The main advantage of this method is that if the features φi are admissible heuristic
functions, then the resulting heuristic function h is also admissible. Dovetailing is
a refinement of minimization, which can reduce memory requirements by defining
each feature only for a subset of the state space [12].
The minimization method is simple but usually provides only a small improve-
ment over the provided heuristic functions. An alternative is to statically add the
available features:
k
h (s) = θ̄ (φ1 (s), . . . , φk (s)) = ∑ φi (s).
i=1
There are, however, additional requirements in this case to ensure that h is admis-
sible. It is not sufficient to use arbitrary admissible heuristic functions as features.
Designing suitable features is often complicated, as has been shown for additive
pattern databases [71].
refinements. The section also includes a comparison with a significant body of the-
oretical work on sampling algorithms in reinforcement learning, such as E3 and
R-Max.
RTDP, shown in Algorithm 3, refines the heuristic values of the individual states
by backward propagation of heuristic values, essentially using value iteration [58].
The method builds a table of visited states with values h(s). When a state s is ex-
panded with children s1 . . . sk , it is stored in a table along with the value
When the function h(s) is initialized to an upper bound on v∗ , RTDP will converge
at the limit to h(s) = v∗ (s) for all visited states.
Modifications of RTDP may significantly improve its performance. Some no-
table RTDP extensions are Labeled RTDP [8], HDP [7], Bounded RTDP [48], Fo-
cused RTDP [65], and VPI-RTDP [64]. The algorithms differ in how the next states
are chosen in “ChooseNextState” and in the convergence criterion. The procedure
“ChooseNextState” is modified from greedy to one that promotes more exploration
in the right places. Bounded RTDP, Focused RTDP, and VPI-RTDP determine the
exploration by additionally using a lower bound on the value of v∗ . Note that h
represents an upper bound on v∗ .
The LAO* algorithm, described also in Section 11.2, can be seen as a sample-
based method for learning heuristic functions. LAO* is based on policy itera-
tion [58], unlike value iteration-based RTDP.
Many similar methods exist in reinforcement learning, such as R-Max [10],
E3 [41], UCT [42], UCRL2 [1]. Like RTDP, they modify sample selection in value
iteration, but their motivation is quite different. In particular, the RL methods differ
in the following ways: 1) they assume that the model is unknown, and 2) they have
a well-defined measure of optimality, namely regret. Overall, RTDP focuses on the
282 Marek Petrik and Shlomo Zilberstein
optimization part while the RL methods focus on the learning part. That is, while
the RL methods aim to work almost as well as value iteration in unknown domains,
RTDP methods attempt to improve on value iteration in fully known domains.
The difficulty with the purely sample-based methods is that the heuristic function
is only calculated for sampled states. Generalization in RTDP can be achieved by
using state features to represent the function h(s). Regular RTDP can be seen as
using features φ1 . . . φn , defined as
1 i= j
φi (s j ) = ,
0 i = j
which are combined linearly. It is possible to extend RTDP to use arbitrary fea-
tures, which would effectively turn it into Approximate Value Iteration (AVI). AVI
is a common method in reinforcement learning but is in general unsuitable for con-
structing heuristic functions to be used with A*, as we discuss in Section 11.3.3.
Feature- and sample-based methods use features to generalize from the visited state
and samples to combine the features adaptively. These methods are the most general
and the most complex of the inductive heuristic learning methods.
As mentioned above, the heuristic function h is constructed using a combination
function θ from some predetermined set Θ :
The objective is to determine the θ̄ ∈ Θ that would result in the best heuristic func-
tion. That is, for some measure of heuristic quality ω : RS → R, the heuristic se-
lection problem is
That is, the objective is to determine the weights xi for all features. As described in
detail below, these methods are closely related to linear value function approxima-
tion in reinforcement learning [57, 66]. The linear feature combination in the vector
form is
h = Φ x,
where Φ is a matrix that represents the features as explained in Section 11.4. The
ordering of the states and actions is arbitrary but fixed.
The samples in feature- and sample-based methods may be incomplete. That is,
there are potentially many transitions that are not sampled. The samples are also in-
complete in classical planning when the search problem is fully known. The limited
set of samples simplifies solving the problem. This is in contrast with reinforcement
learning, in which more samples cannot be obtained. The presence of the model
opens up the possibility of choosing the “right” samples.
Few feature- and sample-based methods have been proposed within classical
planning. The symbolic RTDP method [25] enables RTDP to generalize beyond the
visited states in some domains. Yoon et al. [72] propose a method that calculates a
heuristic function from a linear combination of state features. The coefficients x are
calculated using an iterative regression-like method. They also propose constructing
state features from relaxed plans, obtained by deleting some of the preconditions in
actions. Another feature- and sample-based method, RTDP-Bel, was proposed in
the context of solving POMDPs [9]. RTDP-Bel endows RTDP with a specific set
of features which allow generalization in the infinite belief space of a POMDP. The
resulting heuristic function is not guaranteed to be admissible.
Reinforcement learning
The notion of a heuristic function in search is very similar to the notion of a value
function in reinforcement learning [66]. A value function also maps states to real
values that represent the expected sum of discounted rewards achievable from that
state. In fact, the value function is an approximation of v∗ . The objective of most
methods in reinforcement learning is to learn the value function as a combination of
provided features and a set of transition samples using approximate dynamic pro-
gramming [57]. The policy is then calculated from the value function using greedy
search.
Approximate Dynamic Programming (ADP) fits well into the category of feature-
and sample-based methods. It has also been used in some interesting planning set-
tings [17, 26]. Yet, there is significant incompatibility between ADP and classi-
cal planning applications. The policy in ADP is calculated using greedy best-first
search, while in classical learning it is calculated using heuristic search. This re-
sults in two main difficulties with using ADP for learning heuristic functions: (1)
ADP calculates functions that minimize the Bellman residual, which has little in-
fluence on the quality of the heuristic, and (2) they do not consider admissibil-
ity.
284 Marek Petrik and Shlomo Zilberstein
As Section 11.2.2 shows, there are difficulties with assessing the quality of a
heuristic function used with A*. It is much simpler if the heuristic function is instead
used with greedy search. Greedy search has fixed time complexity and the heuristic
influences only the quality of the policy. The quality of the policy is bounded by
the Bellman residual of h. The Bellman residual B(s) for a state s and a heuristic
function h is defined as
maxs∈S B(s)
.
1−γ
The bounds requires a discount factor γ < 1, which is not the case in most planning
problems.
Most approximate dynamic programming methods minimize a function of the
Bellman residual to get a good greedy policy. A heuristic function with a small
Bellman residual may be used with A*, but it may be very far away from the true
heuristic function. In particular, adding a constant to the heuristic changes the Bell-
man residual by factor of 1 − γ , which is very small when γ → 1, but significantly
increases the number of nodes expanded by A*. Figures 11.3 and 11.4 illustrate this
issue on a simple test problem. The horizontal axes represent the state space and
the vertical axes represent respectively the Bellman residual and the value of the
corresponding state. The solid line is the optimal value function v∗ . Notice that the
dotted function has a very small Bellman residual, but a large true error. On the other
hand, the dashed function has a large Bellman residual, but the error of the value is
small.
Because of the above-mentioned issues, most reinforcement learning methods
are unsuitable for calculating heuristic functions to be used with A*. One notable
exception is approximate linear programming, which is described in detail in Sec-
tion 11.4. Approximate linear programming does not minimize the Bellman residual
but bounds the error of the heuristic function.
Fig. 11.3: Bellman residual of three Fig. 11.4: Three value functions for a
heuristic functions for a simple chain simple chain problem
problem
11 Learning Feature-Based Heuristic Functions 285
S =S
b(s) = s ∀s ∈ S
T (s, a) = s ⇒ T (s, a) = s ∀s ∈ S , a ∈ A
286 Marek Petrik and Shlomo Zilberstein
|S | ≤ |S |
T (s, a) = s ⇔ T (b(s), a) = b(s ) ∀s ∈ S , a ∈ A .
Although abstraction can be defined for arbitrary search spaces, additional structure
is required to make it efficient. Such structure can be provided by a description using
a logic-based language.
Embedding provides two main advantages over homomorphism. First, assume
that a search problem P is abstracted using homomorphism into P and solved us-
ing blind search to obtain a heuristic h . Then, the combined effort of blindly solving
P and solving P using A* with h equals solving P using blind search [69].
In such settings, homomorphism provably does not provide any improvement.
Second, a heuristic function computed using homomorphism may be difficult to
store, while embedding provides a natural method for storing the heuristic func-
tion. Because the number of abstract states is typically small, the heuristic func-
tion is easy to store and reuse over multiple searches. As a result, homomor-
phism is useful only when the relaxed value may be calculated faster than blind
search.
Despite difficulties with homomorphism, it has been used successfully in general-
purpose planning by Yoon et al. [72]. While they calculate the heuristic function
using a form of blind search, the policy is calculated from the heuristic function
using a greedy search. As a result, Valtorta’s theorem does not apply. The method
leads to good results since the constructed heuristic function is sufficiently accu-
rate.
Embedding is very useful in general-purpose planning, because it is easily ap-
plied to domains represented using logic relations [39, 37]. Such languages are
discussed in Section 11.3. Embeddings in problems described by logic languages
are created by simply ignoring selected predicates [19, 21] and are called pattern
databases [13, 14, 18].
When designing a pattern database, it is crucial to properly choose the right ab-
straction. This is a challenging problem, but some recent progress has been made
recently [16, 20, 33, 34, 35]. The existing methods use a local search in the space of
potential abstraction to determine the best abstraction. They often have good empir-
ical results in specific domains, but their behavior is not well understood and is very
hard to analyze.
Because it is easy to obtain many pattern databases from different abstractions,
it is desirable to be able to combine them into a single heuristic function. Research
on additive pattern databases has tried to develop pattern databases that can be
combined using the simple additive feature-based method [71], described in Sec-
tion 11.3.1. Additive pattern databases guarantee that the sum of the heuristic func-
tions produces an admissible heuristic function, but constructing them is nontrivial.
Therefore, combining pattern databases using feature- and sample-based methods
seems to be more promising.
11 Learning Feature-Based Heuristic Functions 287
This section describes in detail a method for learning heuristic functions based on
approximate linear programming (ALP) [54]. ALP, a common method in reinforce-
ment learning, uses a linear program to calculate the heuristic function. The ALP
approach can be classified as feature- and sample-based and is related to methods
described in Section 11.3.3. However, unlike most other feature- and sample-based
methods, ALP is guaranteed to produce admissible heuristic functions.
Linear programming, the basis for the ALP formulation, is a mathematical opti-
mization method in which the solution is constrained by a set of linear inequalities.
The optimal solution minimizes a linear function of the set of constraints. A general
form of a linear program is
min cT x
x (11.2)
s.t. Ax ≥ b
where c is the objective function and Ax ≥ b are the constraints. Linear programs can
express a large variety of optimization problems and can be solved efficiently [70].
The heuristic function in ALP is obtained as a linear combination of the features
φi . The set of feasible heuristic functions forms a linear vector space M , spanned
by the columns of matrix Φ , which is defined as follows:
⎛ ⎞
φ1 (s1 ) φ2 (s1 ) · · ·
⎜ ⎟
Φ = ⎝φ1 (s2 ) φ2 (s2 ) · · ·⎠ .
..
.
That is, each row of Φ defines the features for the corresponding state. The heuristic
function may then be expressed as
h = Φx
for some vector x. Vector x represents the weights on the features, which are com-
puted by solving the ALP problem. An important issue in the analysis is the repre-
sentability of functions using the features, defined as follows.
Definition 2 A heuristic function h is representable in M if h ∈ M ⊆ R|S | , i.e.,
there exists a z such that h = Φ z.
It is important to use a relatively small number of features in an ALP for two
main reasons. First, it makes the linear programs easy to solve. Second, it allows
us to use a small sample of all ALP constraints. This is important, since the total
number of ALP constraints is greater than the total number of states in the search
problem. Constraint sampling is explained in detail later in the section.
The remainder of the section describes multiple ALP formulations and their
trade-offs. Section 11.4.1 describes two formulations of the constraints that ensure
the admissibility of h. Section 11.4.2 introduces a new function θ that represents a
lower bound on v∗ . The difference between h and θ can be used to determine the
288 Marek Petrik and Shlomo Zilberstein
accuracy of the heuristic function h. The difference is used when formulating the
linear program, as described in Section 11.4.3.
This section describes two sets of linear constraints that can ensure the admissibility
of the heuristic function. The feasible set is represented by a set of linear inequali-
ties.
The two basic formulations are depicted in Figure 11.5. The first formulation,
represented by the dotted line in the figure, is to simply bound the heuristic value by
the value received in the sampled states. Formally, this is stated as
nj
h(sij ) ≥ ∑ γ k−i r(skj , akj ) ∀ j ∈ Σ2 , ∀i ∈ 1 . . . n j . (11.3)
k=i
Clearly, this formulation ensures the admissibility of the heuristic function. No-
tice that this will possibly mean multiple inequalities for each state, but only the
dominating ones need to be retained. Thus, let vi denote the largest right-hand side
for state si . The function must be restricted to the vector subspace spanned by the
columns of Φ . For notational convenience, we formulate the problem in a way that
is independent of the samples. Let h and v∗ be column vectors with each row corre-
sponding to a state. Then the inequality may be written as
h = Φ x ≥ v∗ ,
treating h as a vector. In general, only some of the inequalities are provided based on
the available samples; with all samples, v = v∗ . To simplify the notation, we denote
this feasible set as H1 , and thus h ∈ H1 .
The second formulation is based on approximate linear programming, and is rep-
resented by the solid lines in Figure 11.5. In this case, the sample paths do not need
to be terminated by the goal node. However, the heuristic function is actually re-
quired to be consistent, which is a stronger condition than admissibility. That is, for
each observed sequence of two states, the difference between their heuristic values
11 Learning Feature-Based Heuristic Functions 289
h(sij ) ≥ γ h(si+1
j
) + r(sij , aij ) ∀ j ∈ Σ2 , ∀i ∈ 1 . . . (n j − 1)
(11.4)
h(τ ) ≥ 0.
In this case, we can define an action transition matrix Ta for action a. The matrix
captures whether it is possible to move from the state defined by the row to the state
defined by the column.
Ta (i, j) = 1 ⇔ t(si , a) = s j .
A transition matrix T for all actions can then be created by vertically appending
these matrices as follows:
Ta1
T= .. .
.
Similarly, we define a vector ra of all the rewards for action a such that ra (i) =
r(si , a). The vector r of all the rewards for all the actions can then be created by
appending the vectors:
ra1
r= . .
..
h ≥ γ Ta h + ra ∀a ∈ A ,
together with the constraint h(τ ) ≥ 0. To include the basis Φ to which the heuristic
function is constrained, the problem is formulated as
(I − γ Ta )Φ x ≥ ra ∀a ∈ A
h(τ ) ≥ 0,
where I is the identity matrix. To simplify the notation, we denote the feasible set as
H2 , and thus h ∈ H2 .
The formulation in (11.4) ensures that the resulting heuristic function will be
admissible.
Proposition 1 Given a complete set of samples, the heuristic function h ∈ H2 is
admissible. That is, for all s ∈ S , h(s) ≥ v∗ (s).
Proof. By induction on the length of the path from state s to the goal with maximal
value. The base case follows from the definition. For the inductive case, let the
optimal path to the goal from state s take action a, breaking ties arbitrarily. Let
s1 = t(s, a). From the inductive hypothesis and subpath optimality, we have that
h(s1 ) ≥ v∗ (s1 ). Then
Therefore, for a finite state space, the function h is an admissible heuristic func-
tion.
∀ j ∈ Σ2 , ∀i ∈ 1 . . . (n j − 1) :
(11.5)
h(sij ) ≥ αγ h(si+1
j
) + α r(sij , aij ) + (1 − α )v∗ (sij )
Here v∗ (sij ) is the value of state sij in sequence j. When it is not available, an arbi-
trary lower bound may be used. For α = 0, this formulation is equivalent to (11.3),
and for α = 1, the formulation is equivalent to (11.4). We denote the feasible set as
H3 , and thus h ∈ H3 . The key property of this formulation is stated in the following
lemma, which is used later in the chapter to establish approximation bounds, and is
straightforward to prove.
Lemma 1 The optimal value function v∗ is a feasible solution of (11.5) for an ar-
bitrary α .
This section shows how to obtain a lower bound on the value of each state. This is
important because it allows us to evaluate the difference between the heuristic value
and the true value of each state. The lower bounds on the values of some selected
states are obtained from the optimal solutions.
The formulation we consider is similar to that of (11.3):
nj
θ (sij ) ≤ ∑ γ k−i r(skj , akj ) ∀ j ∈ Σ1 , ∀i ∈ 1 . . . n j . (11.6)
k=i
That is, the bounds are on the values of states that were solved optimally and on
any nodes that are on the path connecting the start state with the goal state. These
bounds can also be written in matrix notation, as in (11.3):
11 Learning Feature-Based Heuristic Functions 291
Fig. 11.6: Lower bound formulations, where the dotted lines represent paths of ar-
bitrary length. The numbers next to the edges represent the rewards received
θ = Φ y ≥ v∗ .
θ (s) ≤ v∗ (s) ∀s ∈ S .
θ (si+1
j
) ≤ γθ (sij ) − r(sij , aij )
(11.7)
θ (σ ) ≤v∗ (σ ).
Unlike in (11.4), these constraints alone do not guarantee that the function θ will be
a lower bound on the optimal value of the states. Figure 11.6 depicts a situation in
which these bounds are satisfied, but there is a feasible solution that is not an upper
bound. However, as in (11.4), the bounds may be formulated as
(I − γ Ta )Φ y ≥ ra ∀a ∈ A
θ (σ ) ≤ v∗ (σ ).
Given the above, we are ready to formulate the linear program for an admissible
heuristic function. As discussed in Section 11.2.2, two simple metrics used for judg-
292 Marek Petrik and Shlomo Zilberstein
ing the quality of a heuristic function are the L1 norm and L∞ norm. Linear program
formulations for each of the norms follow.
The linear program that minimizes the L1 norm of the heuristic function error is
the following:
min 1T h
h (11.8)
s.t. h ∈ H3 .
The formulation corresponds exactly to approximate linear programming when α =
1. It is easy to show that the optimal solution of (11.8) minimizes h − v∗ 1 [22].
A linear program that minimizes the L∞ norm of the heuristic function error is the
following:
min δ
δ ,h
s.t. h(s) − θ (s) ≤ δ ∀s ∈ S (11.9)
h ∈ H3 θ ∈ G1 .
It is easy to show that (11.9) minimizes h − v∗ ∞ . This is because from the defini-
tion h(s) ≥ θ (s) for all s. In addition, even when the linear program is constructed
from the samples only, this inequality holds. Notice that the number of constraints
h(s) − θ (s) ≤ δ is too large, because one constraint is needed for each state. There-
fore, in practice these constraints as well as the remaining states will be sampled.
In particular, we use those states s for which v∗ (s) is known. While it is possible
to use G2 instead of G1 , that somewhat complicates the analysis. We summarize
below the main reasons why the formulation in (11.9) is more suitable than the one
in (11.8).
We showed above how to formulate the linear programs for optimizing the heuris-
tic function. It is however important whether these linear programs are feasible and
whether their solutions are close to the best heuristic that can be represented using
the features in basis Φ . In this section, we extend the analysis used in approximate
linear programming to show new conditions for obtaining a good heuristic func-
tion.
We are interested in bounding the maximal approximation error h − v∗ ∞ . This
bound limits the maximal error in any state, and can be used as a rough measure
of the extra search effort required to find the optimal solution. Alternatively, given
that h − v∗ ∞ ≤ ε , the greedily constructed solution with this heuristic will have an
approximation error of at most mε , where m is the number of steps required to reach
the goal. This makes it possible to solve the problem without search. For simplicity,
we do not address here the issues related to limited sample availability, which have
been previously analyzed [22, 23, 4, 31].
The approximation bound for the solution of (11.8) with the constraints in (11.4)
comes from approximate linear programming [22]. In the following, we use 1 to
11 Learning Feature-Based Heuristic Functions 293
The potential problem with this formulation is that it may be very loose when (1)
the number of states is large, since it depends on the number of states |S |, or (2)
the discount factor γ is close to 1 or is 1.
We show below how to address these problems using the alternative formula-
tion of (11.9) and taking advantage of the additional structure of the approximation
space.
ε = h̃ − v∗ ∞ = min v∗ − Φ x∞ .
x
This function may not satisfy the inequalities (11.5). We show that it is possible to
construct an ĥ that satisfies the inequalities. From the assumption, we have that
v∗ − ε 1 ≤ h̃ ≤ v∗ + ε 1
and, for a ∈ A ,
0 ≤ Ta 1 ≤ γα 1,
directly from the definition of the inequalities (11.5). We use 0 to denote a zero
vector of the appropriate size. Now let ĥ = h̃ + d1 for some d. An appropriate value
of d to make ĥ feasible can be derived using Lemma 1 as follows
ĥ = h̃ + d1
≥ v∗ − ε 1 + d1
≥ T v∗ + r + (d − ε )1
294 Marek Petrik and Shlomo Zilberstein
≥ T (h̃ − ε 1) + r + (d − ε )1
≥ T h̃ − εγα 1 + r + (d − ε )1
= T h̃ + r + (d − (γα + 1)ε )1
= T (ĥ − d1) + r + (d − (γα + 1)ε )1
≥ T ĥ + r + ((1 − γα )d − (1 + γα )ε )1.
Therefore, ĥ ≥ T ĥ + r if
((1 − γα )d − (1 + γα )ε )1 ≥ 0
1 + γα
d≥ ε.
1 − γα
2
dˆ − v∗ ∞ ≤ h̃ − v∗ ∞ + d1∞ ≤ ε.
1 − γα
Using similar analysis, the following lemma can be shown.
Lemma 3 Assume 1 is representable in M . Then there exists a lower bound θ̂ that
is feasible in (11.6), such that
θ̂ − v∗ ∞ ≤ 2 min v∗ − Φ x∞ .
x
This lemma can be proved simply by subtracting ε 1 from the θ that is closest to v∗ .
The above lemmas lead to the following theorem with respect to the formulation in
(11.9).
Theorem 1 Assume that 1 is representable in M , and let ĥ, θ̂ , δ be an optimal
solution of (11.9). Then
2
δ = ĥ − v∗ ∞ ≤ 2 + min v∗ − Φ x∞ .
1 − γα x
Proof. Assume that that the solution δ does not satisfy the inequality. Then, us-
ing Lemmas 3 and 2, it is possible to construct a solution ĥ, θ̂ , δ̂ . This leads to a
contradiction, because δ̂ < δ .
Therefore, by solving (11.9) instead of (11.8), the error becomes independent of the
number of states. This is a significant difference, since the approach is proposed for
problems with a very large number of states.
Even when (11.9) is solved, the approximation error depends on the factor 1/(1−
γα ). For γ = α = 1, the bound is infinite. In fact, the approximate linear program
may become infeasible in this case, unless the approximation basis Φ satisfies some
requirements. In the following, we show which requirements are necessary to ensure
that there will always be a feasible solution.
11 Learning Feature-Based Heuristic Functions 295
To illustrate this problem with the approximation, consider the following sim-
ple example with states S = {s1 , s2 , s3 } and a single action A = {a}. The goal is
the state τ = s3 , and thus there is no transition from this state. The transitions are
t(si , a) = si+1 , for i = 1, 2. The rewards are also r(si , a) = 1 for i = 1, 2. Now, let the
approximation basis be
T
101
Φ= .
010
This example is depicted in Figure 11.7, in which the rectangle represents the ag-
gregated states in which the heuristic function is constant. The bounds of (11.4) in
this example are
h(s1 ) ≥ γ h(s2 ) + 1
h(s2 ) ≥ γ h(s3 ) + 1
h(s3 ) ≥ 0.
The approximation basis Φ requires that h(s1 ) = h(s3 ). Thus we get that
As a result, despite the fact that v∗ (s1 ) = 2, the heuristic function is h(s2 ) =
(1 + γ )/(1 − γ 2 ). This is very imprecise for high values of γ . A similar problem was
addressed in standard approximate linear programming by introducing so-called
Lyapunov vectors. We build on this idea to define conditions that enable us to use
(11.5) with high γ and α .
Ti ui ≤ βi ui
T j ui ≤ 0 ∀ j < i.
The second condition requires that no states in partition j transit to a state with
positive ui .
An example of such a hierarchy would be an abstraction, depicted in Figure 11.8.
Let the state space S be partitioned into l subsets Si , with i ∈ 1 . . . l. Assume that
the transitions satisfy
296 Marek Petrik and Shlomo Zilberstein
∀a ∈ A : t(s, a) = s ∧ s ∈ Si ∧ s ∈ S j ⇒ j < i.
That is, there is an ordering of the partitions consistent with the transitions. Let ui
be a vector of the size of the state space, defined as
ui (s) = 1 ⇔ s ∈ Si ,
and zero otherwise. It is easy to show that these vectors satisfy the requirements
of Definition 3. When the approximation basis Φ can be shown to contain such ui ,
it is, as we show below, possible to use the formulation with γ = α = 1 with low
approximation error.
Fig. 11.8: An example of the Lyapunov hierarchy. The dotted line represents a con-
straint that needs to be removed and replaced with the dashed ones
Lemma 4 Assume that there exists a Lyapunov hierarchy u1 . . . ul such that each ui
is representable in M . Then there exists a heuristic function ĥ in Φ that is feasible
in (11.5), such that
l
(1 + αγ ) maxs∈S ui (s)
ĥ − v ∞ ≤ ∏
∗
2 min v∗ − Φ x∞ ,
i=1 (1 − αγβ i ) min u
s∈S i
i (s) x
v∗ ≤ h̃ ≤ v∗ + ε .
The proof follows by induction on the size l of the Lyapunov hierarchy. Assume that
the inequalities are satisfied for all i < i, with the error ε and the property that the
current h̃ ≥ v∗ . Then let ĥ = h̃ + d1 for some d. Then, using Lemma 1, we have
ĥ = h̃ + dui ≥ v∗ + dui ≥ Ti v∗ + rd ui
11 Learning Feature-Based Heuristic Functions 297
≥ Ti h̃ − γαε 1 + r + dui
≥ Ti (ĥ − dui ) − γαε 1 + r + dui
≥ Ti ĥ + r − αβi γ dui + dui − γαε e
To satisfy ĥ ≥ Ti ĥ + ri , set d to
γα maxs∈S ui (s)
ĥ − v∗ ∞ ≤ ε.
1 − αβi γ mins∈S uii (s)
The bound on the approximation error of the optimal solution of (11.9) may be
then restated as follows.
Theorem 2 Assume that there exists a Lyapunov hierarchy u1 . . . ul , and for each ui
there exists zi such that ui = Φ zi . Then for the optimal solution ĥ, δ of (11.9),
l
(1 + αγ ) maxs∈S ui (s)
δ = ĥ − v ∞ ≤ 1 + ∏
∗
2ε ,
i=1 (1 − αγβi ) mins∈S ui (s)
i
where ε = minx v∗ − Φ x∞ .
The proof follows from Lemma 4 similarly to Theorem 1. The theorem shows that
even when γ = 1, it is possible to guarantee the feasibility of (11.9) by including the
Lyapunov hierarchy in the basis.
A simple instance of a Lyapunov hierarchy is a set of features that depends on
the number of steps to the goal. Therefore, the basis Φ must contain a vector ui
such that ui ( j) = 1 if the number of steps to get to s j is i, and 0 otherwise. This
is practical in problems in which the number of steps to the goal is known in
any state. Assuming this simplified condition, Theorem 2 may be restated as fol-
lows:
l
1 + αγ
ĥ − v ∞ ≤ 1 + ∏
∗
2 min v∗ − Φ x∞ .
i=1 1 − αγβ i x
This, however, indicates exponential growth in error with the size of the hierar-
chy with γ = α = 1. It is possible to construct an example in which this bound is
tight. We have not observed such behavior in the experiments, and it is likely that
finer error bounds could be established. As a result of the analysis above, if the
298 Marek Petrik and Shlomo Zilberstein
basis contains the Lyapunov hierarchy, the approximation error is finite even for
γ = 1.
In some problems it is hard to construct a basis that contains a Lyapunov hierar-
chy. An alternative approach is to include only constraints that respect the Lyapunov
hierarchy present in the basis. These may include multistep constraints, as indicated
in Figure 11.8. As a result, only a subset of the constraints is added, but this may im-
prove the approximation error significantly. Another option is to define features that
depend on actions, and not only on states. This is a nontrivial extension, however,
and we leave the details to future work. Finally, when all the rewards are negative
and the basis contains only a Lyapunov hierarchy, then it can be shown that no con-
straints need to be removed.
We evaluate the approach on the sliding eight tile puzzle problem -a classic search
problem [59]. The purpose of these experiments is to demonstrate the applicability
of the ALP approach. We used the eight tile puzzle particularly because it has been
studied extensively and can be solved relatively quickly. This allows us to evaluate
the quality of the heuristic functions we obtain in different settings. Since all the
experiments we describe took less than a few seconds, scaling to large problems
with many sample plans is very promising. Scalability mostly relies on the ability
to efficiently solve large linear programs -an area that has seen significant progress
over the years. In all instances, we use the formulation in (11.9) with different values
of α .
The heuristic construction method relies on a set of features available for the
domain. Good features are crucial for obtaining a useful heuristic function, since
they must be able to differentiate between states based on their heuristic value.
In addition, the set of features must be limited to facilitate generalization. No-
tice that although the features are crucial, they are in general much easier to se-
lect than a good admissible heuristic function. We consider the following basis
choices:
1. Manhattan distance of each tile from its goal position, including the empty tile.
This results in nine features that range in values from 0 to 4. This basis does
not satisfy the Lyapunov property condition. The minimal admissible heuristic
function from these features will assign value −1 to the feature that corresponds
to each tile, except the empty tile, the feature for which is assigned 0.
2. Abstraction based on the sum of the Manhattan distances of all pieces. For
example, feature 7 will be 1 if the sum of the Manhattan distances of the pieces
is 7, and 0 otherwise. This basis satisfies the Lyapunov hierarchy condition as
defined above, since all the rewards in the domain are negative.
3. The first feature is the Nilsson sequence score [52] and the second feature is
the total Manhattan distance minus 3. The Nilsson sequence score is obtained
by checking around the non-central square in turn, allotting 2 for every tile not
11 Learning Feature-Based Heuristic Functions 299
followed by its proper successor and 0 for every other tile, except that a piece
in the center scores 1. This value is not an admissible heuristic.
First, we evaluate the approach in terms of the number of samples that are re-
quired to learn a good heuristic function. We first collect samples from a blind
search. To avoid expanding too many states, we start with initial states that are
close to the goal. This is done by starting with the goal state and performing a
sequence of 20 random actions. Typical results obtained in these experiments are
shown in Figure 11.9, all performed with α = 1. The column labeled “States”
shows the total number of node-action pairs, not necessarily unique, expanded
and used to learn the heuristic function. The samples are gathered from solving
the problem optimally for two states. The results show that relatively few nodes
are required to obtain a heuristic function that is admissible, and very close to
the optimal heuristic function with the given features. Observe that the heuristic
function was obtained with no prior knowledge of the domain and without any a
priori heuristic function. Very similar results were obtained with the second ba-
sis.
x0 x1 x2 x3 x4 x5 x6 x7 x8 States
0 -1 -1 -1 -1 0 0 -1 0 958
0 -1 -1 -1 -1 -1 0 -1 0 70264
0 -1 -1 0 -1 -1 -1 -1 -1 63
0 -1 -1 -1 -1 -1 -1 -1 -1 162
Fig. 11.9: Weights calculated for individual features using the first basis choice.
Column xi corresponds to the weight assigned to the feature associated with tile i,
where 0 is the empty tile. The top two rows are based on data from blind search, and
the bottom two on data from search based on the heuristic from the previous row
Next, we compare the two formulations for the upper bounds, (11.3) and (11.4),
with regard to their approximation error. Notice that a big advantage of the formu-
lation depicted by (11.4) is the ability to use transitions from states that are not on
a path to the goal. The data is based on 100 goal-terminated searches and 1,000 ad-
ditional randomly chosen states. The results are shown in Figure 11.10. Here, δ is
the objective value of (11.9), which is the maximal overestimation of the heuristic
function in the given samples. Similarly, δ is the maximal overestimation obtained
based on 1,000 state samples independently of the linear program. The approximate
fraction of states in which the heuristic function is admissible is denoted by p. These
results demonstrate the trade-off that α offers. A lower value of α generally leads to
a better heuristic function, but at the expense of admissibility. The bounds with re-
gard to the sampled constraints, as presented in [22], do not distinguish between the
two formulations. A deeper analysis of this will be an important part of future work.
Interestingly, the results show that the Nilsson sequence score is not admissible, but
it becomes admissible when divided by 4.
300 Marek Petrik and Shlomo Zilberstein
Fig. 11.10: The discovered heuristic functions as a function of α in the third basis
choice, where the xi are the weights on the corresponding features in the order in
which they are defined
The ALP approach has also been applied to Tetris, a popular benchmark problem
in reinforcement learning [67]. Because Tetris is an inherently stochastic problem,
we used the Regret algorithm [49] to solve it using a deterministic method. The
Regret algorithm first generates samples of the uncertain component of the prob-
lem and then treats it as a deterministic problem. It is crucial in Tetris to have a
basis that contains a Lyapunov hierarchy, since the rewards are positive. Since we
always solve the problem for a fixed number of steps forward, such a hierarchy can
be defined based on the number of steps remaining, as in Figure 11.8. Our initial
experiments with Tetris produced promising results. However, to outperform the
state-of-the-art methods -such as approximate linear programming [24] and cross-
entropy methods [67]- we need to scale up our implementation to handle millions of
samples. This is mostly a technical challenge that can be addressed using methods
developed by [24].
11.7 Conclusion
This chapter presents a unified framework for learning heuristic functions in plan-
ning settings. Analytic methods use the structure of the domain to calculate heuristic
functions, while inductive methods rely on state features and properties. The frame-
work also encompasses most of the methods used in reinforcement learning, which
are often unsuitable for use with heuristic search algorithms.
The ALP-based approach is a feature- and sample-based method that learns ad-
missible heuristic functions. When applied naively, these techniques may lead to
formulations that have no feasible solutions. We show that by guaranteeing cer-
tain properties, it is possible to ensure that the approximation is finite and -in some
cases- accurate, with tight error bounds. This work lays the foundation for further
understanding of how sampling techniques can produce good heuristic functions for
complex planning problems.
Learning heuristic functions automatically is a long-standing challenge in artifi-
cial intelligence. Despite the existing shortcomings, the ALP approach has several
important advantages. The formulation is very general and it places only modest
11 Learning Feature-Based Heuristic Functions 301
assumptions on the features. It requires little domain knowledge. It works well with
samples of plans -both optimal and non-optimal ones. And, most importantly, it
makes it possible to compute guarantees on the admissibility of the learned heuris-
tic function.
Acknowledgements This work was supported by the Air Force Office of Scientific Research un-
der Grant No. FA9550-08-1-0171.
References
[1] Auer, P., Jaksch, T., Ortner, R.: Near-optimal regret bounds for reinforcement
learning. In: Advances in Neural Information Processing Systems (2009)
[2] Barto, A., Bradtke, S. J., Singh, S. P.: Learning to act using real-time dynamic
programming. Artificial Intelligence 72(1), 81–138 (1995)
[3] Beliaeva, N., Zilberstein, S.: Generating admissible heuristics by abstraction
for search in stochastic domains. In: Abstraction, Reformulation and Approx-
imation, pp. 14–29. Springer Berlin / Heidelberg (2005)
[4] Ben-Tal, A., Nemirovski, A.: Selected topics in robust optimization. Mathe-
matical Programming, Series B 112, 125–158 (2008)
[5] Benton, J., van den Briel, M., Kambhampati, S.: A hybrid linear programming
and relaxed plan heuristic for partial satisfaction planning problems. In: Inter-
national Conference on Automated Planning and Scheduling (ICAPS) (2007)
[6] Bonet, B., Geffner, H.: Planning as heuristic search. Artificial Intelligence
129(1-2), 5–33 (2001)
[7] Bonet, B., Geffner, H.: Faster heuristic search algorithms for planning under
uncertainty and full feedback. In: International Joint Conference on Artificial
Intelligence (2003)
[8] Bonet, B., Geffner, H.: Labeled RTDP: Improving the convergence of real-time
dynamic programming. In: International Conference on Autonomous Planning
(ICAPS) (2003)
[9] Bonet, B., Geffner, H.: Solving POMDPs: RTDP-Bel vs. point-based algo-
rithms. In: International Joint Conference on Artificial Intelligence (IJCAI)
(2009)
[10] Brafman, R. I., Tennenholtz, M.: R-MAX -a general polynomial time algo-
rithm for near-optimal reinforcement learning. Journal of Machine Learning
Research 3, 213–231 (2002)
[11] Bylander, T.: A linear programming heuristic for optimal planning. In: Na-
tional Conference on Artificial Intelligence, pp. 694–699 (1997)
[12] Culberson, J. C., Schaeffer, J.: Efficiently searching the 15-puzzle. Tech. rep.,
Department of Computer Science, University of Alberta (1994)
[13] Culberson, J. C., Schaeffer, J.: Searching with pattern databases. In: Advances
in Artifical Intelligence, pp. 402–416. Springer Berlin / Heidelberg (1996)
302 Marek Petrik and Shlomo Zilberstein
[34] Haslum, P., Botea, A., Helmert, M., Bonet, B., Koenig, S.: Domain-
independent construction of pattern database heuristics for cost-optimal plan-
ning. In: National Conference on Artificial Intelligence (2007)
[35] Helmert, M., Mattmuller, R.: Accuracy of admissible heuristic functions in
selected planning domains. In: National Conference on Artificial Intelligence
(2008)
[36] Helmert, M., Roger, G.: How good is almost perfect. In: National Conference
on AI (2008)
[37] Holte, R. C., Grajkowski, J., Tanner, B.: Hierarchical heuristic search revisited.
In: Abstraction, Reformulation and Approximation, pp. 121–133. Springer
Berlin / Heidelberg (2005)
[38] Holte, R. C., Mkadmi, T., Zimmer, R., MacDonald, A.: Speeding up problem
solving by abstraction: a graph oriented approach. Artificial Intelligence 85,
321–361 (1996)
[39] Holte, R. C., Perez, M., Zimmer, R., MacDonald, A.: Hierarchical A*: Search-
ing abstraction hierarchies efficiently. In: National Conference on Artificial
Intelligence (AAAI), pp. 530–535 (1996)
[40] Kautz, H. A., Selman, B.: Pushing the envelope: Planning, propositional
logic, and stochastic search. In: National Conference on Artificial Intelligence
(AAAI) (1996)
[41] Kearns, M., Singh, S.: Near-polynomial reinforcement learning in polynomial
time. Machine Learning 49, 209–232 (2002)
[42] Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: European
Conference on Machine Learning (ECML) (2006)
[43] Korf, R.: Depth-first iterative deepening: An optimal admissible tree search.
Artificial Intelligence 27(1), 97–109 (1985)
[44] Korf, R. E.: Real-time heuristic search. In: National Conference on AI (AAAI)
(1988)
[45] Lagoudakis, M. G., Parr, R.: Least-squares policy iteration. Journal of Machine
Learning Research 4, 1107–1149 (2003)
[46] Laird, J. E., Rosenbloom, P. S., Newell, A.: Chunking in SOAR: The anatomy
of of a general learning mechanism. Machine Learning 1, 11–46 (1986)
[47] Leckie, C., Zuckerman, I.: Inductive learning of search control rules for plan-
ning. Artificial Intelligence 101(1-2), 63–98 (1998)
[48] McMahan, H. B., Likhachev, M., Gordon, G. J.: Bounded real-time dynamic
programming: RTDP with monotone upper bounds and performance guaran-
tees. In: International Conference on Machine Learning (ICML) (2005)
[49] Mercier, L., van Hentenryck, P.: Performance analysis of online anticipatory
algorithms for large multistage stochastic integer programs. In: International
Joint Conference on AI, pp. 1979–1985 (2007)
[50] Minton, S., Knoblock, C., Kuokka, D. R., Gil, Y., Joseph, R. L., Carbonell,
J. G.: PRODIGY 2.0: The manual and tutorial. Tech. rep., Carnegie Mellon
University (1989)
[51] Munos, R.: Error bounds for approximate policy iteration. In: International
Conference on Machine Learning, pp. 560–567 (2003)
304 Marek Petrik and Shlomo Zilberstein
[73] Zhang, Z., Sturtevant, N. R., Holte, R., Schaeffer, J., Felner, A.: A* search
with inconsistent heuristics. In: International Joint Conference on Artificial
Intelligence (IJCAI) (2009)
[74] Zimmerman, T., Kambhampati, S.: Learning-assisted automated planning. AI
Magazine 24(2), 73–96 (2003)