0% found this document useful (0 votes)
36 views

Finding Association Rules That Trade Support Optimally Against Confidence

This document discusses finding association rules that optimally trade support against confidence. It proposes maximizing the expected accuracy of rules on future data as the optimal trade-off criterion. A Bayesian framework is used to determine how confidence and support contribute to expected accuracy. An algorithm is presented that finds the n best rules according to this criterion by pruning redundant or suboptimal parts of the search space.

Uploaded by

david jhon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Finding Association Rules That Trade Support Optimally Against Confidence

This document discusses finding association rules that optimally trade support against confidence. It proposes maximizing the expected accuracy of rules on future data as the optimal trade-off criterion. A Bayesian framework is used to determine how confidence and support contribute to expected accuracy. An algorithm is presented that finds the n best rules according to this criterion by pruning redundant or suboptimal parts of the search space.

Uploaded by

david jhon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Finding Association Rules That Trade Support

Optimally against Confidence

Tobias Scheffer1,2
1
University of Magdeburg, FIN/IWS, PO Box 4120, 39016 Magdeburg, Germany
2
SemanticEdge, Kaiserin-Augusta-Allee 10-11, 10553 Berlin, Germany
[email protected]

Abstract. When evaluating association rules, rules that differ in both


support and confidence have to compared; a larger support has to be
traded against a higher confidence. The solution which we propose for
this problem is to maximize the expected accuracy that the association
rule will have for future data. In a Bayesian framework, we determine
the contributions of confidence and support to the expected accuracy on
future data. We present a fast algorithm that finds the n best rules which
maximize the resulting criterion. The algorithm dynamically prunes re-
dundant rules and parts of the hypothesis space that cannot contain
better solutions than the best ones found so far. We evaluate the perfor-
mance of the algorithm (relative to the Apriori algorithm) on realistic
knowledge discovery problems.

1 Introduction
Association rules (e.g., [1,5,2]), express regularities between sets of data items
in a database. [Beer and TV magazine ⇒ chips] is an example of an association
rule and expresses that, in a particular store, all customers who buy beer and a
TV magazine are also likely to buy chips. In contrast to classifiers, association
rules do not make a prediction for all database records. When a customer does
not buy beer and a magazine, then our example rule does not conjecture that
he will not buy chips either. The number of database records for which a rule
does predict the proper value of an attribute is called the support of that rule.
Associations rules may not be perfectly accurate. The fraction of database
records for which the rules conjectures a correct attribute value, relative to the
fraction of records for which it makes any prediction, is called the confidence.
Note that the confidence is the relative frequency of a correct prediction on the
data that is used for training. We expect the confidence (or accuracy) on unseen
data to lie below that on average, in particular, when the support is small.
When deciding which rules to return, association rule algorithms need to
take both confidence and support into account. Of course, we can find any num-
ber of rules with perfectly high confidence but support of only one or very few
records. On the other hand, we can construct very general rules with large sup-
port but low confidence. The Apriori algorithm [2] possesses confidence and
support thresholds and returns all rules which lie above these bounds. However,

L. De Raedt and A. Siebes (Eds.): PKDD 2001, LNAI 2168, pp. 424–435, 2001.
c Springer-Verlag Berlin Heidelberg 2001

Finding Association Rules That Trade Support Optimally 425

a knowledge discovery system has to evaluate the interestingness of these rules


and provide the user with a reasonable number of interesting rules.
Which rules are interesting to the user depends on the problem which the
user wants to solve and hopes the rules to be helpful for. In many cases, the user
will be interested in finding items that do not only happen to co-occur in the
available data. He or she will rather be interested in finding items between which
there is a connection in the underlying reality. Items that truly correlate, will
most likely also correlate in future data. In statistics, confidence intervals (which
bound the difference between relative frequencies and their probabilities) can be
used to derive guarantees that empirical observations reflect existing regularities
in the underlying reality, rather than occurring just by chance. The number of
observation plays a crucial role; when a rule has a large support, then we can be
much more certain that the observed confidence is close to the confidence that
we can expect to see in future. This is one reason why association rules with
very small support are considered less interesting.
In this paper, we propose a trade-off between confidence and support which
is in a way optimal by maximizing the chance of correct predictions on unseen
data. We concretize the problem setting in Sect. 2, and in Sect. 3 we present our
resulting utility criterion. In Sect. 4, we present a fast algorithm that finds the n
best association rules with respect to this criterion. We discuss the algorithm’s
mechanism for pruning regions of the hypothesis space that cannot contain so-
lutions that are better than the ones found so far, as well as the technique used
to delete redundant rules which are already implied by other rules. In Sect. 5,
we evaluate our algorithm empirically. Section 6 concludes.

2 Preliminaries
Let D be a database consisting of one table over binary attributes a1 , . . . , ak ,
called items. In general, D has been generated by discretizing the attributes
of a relation of an original database D . For instance, when D contains an
attribute income, then D may contain binary attributes 0 ≤ income ≤ 20k,
20k < income ≤ 40k, and so on. A database record r ⊆ {a1 , . . . , ak } is the set of
attributes that take value one in a focused row of the table D.
A database record r satisfies an item set x ⊆ {a1 , . . . , ak } if x ⊆ r. The
support s(x) of an item set x is the number of records in D which satisfy x.
Often, the fraction s(x)
|D| of records in D that satisfy x is called the support of x.
But since the database D is constant, these terms are equivalent.
An association rule [x ⇒ y] with x, y ⊆ {a1 , . . . , ak }, y = ∅, and x ∩ y = ∅
expresses a relationship between an item set x and a nonempty item set y. The
intuitive semantic of the rule is that all records which satisfy x are predicted to
also satisfy y. The confidence of the rule with respect to the (training) database
D is ĉ([x ⇒ y]) = s(x∪y)
s(x) – that is, the ratio of correct predictions over all records
for which a prediction is made.
The confidence is measured with respect to the database D that is used for
training. Often, a user will assume that the resulting association rules provide
426 T. Scheffer

information on the process that generated the database which will be valid in
future, too. But the confidence on the training data is only an estimate of the
rules’ accuracy in the future, and since we search the space of association rules
to maximize the confidence, the estimate is optimistically biased. We define the
predictive accuracy c([x ⇒ y]) of a rule as the probability of a correct prediction
with respect to the process underlying the database.
Definition 1. Let D be a database the records r of which are generated by a
static process P , let [x ⇒ y] be an association rule. The predictive accuracy
c([x ⇒ y]) = P r[r satisfies y|r satisfies x] is the conditional probability of y ⊆ r
given that x ⊆ r when the distribution of r is governed by P .
The confidence ĉ([x ⇒ y]) is the relative frequency of probability c([x ⇒ y]) for
given database D. We now pose the n most accurate association rules problem.
Definition 2. Given a database D (defined like above) and a set of database
items a1 through ak , find n rules h1 , . . . , hn ∈ {[x ⇒ y]|x, y ⊆ {a1 , . . . , ak }; y =
∅; x ∩ y = ∅} which maximize the expected predictive accuracy c([x ⇒ y]).
We formulate the problem such that the algorithm needs to return a fixed
number of best association rules rather than all rules the utility of which ex-
ceeds a given threshold. We think that this setting is more appropriate in many
situation because a threshold may not be easy to specify and a user may not be
satisfied with either an empty or an outrageously large set of rules.

3 Bayesian Frequency Correction


In this section, we analyze how confidence and support contribute to the pre-
dictive accuracy. The intuitive idea is that we “mistrust” the confidence a little.
How strongly we have to discount the confidence depends on the support – the
greater the support, the more closely does the confidence relate to the predictive
accuracy. In the Bayesian framework that we adopt, there is an exact solution as
to how much we have to discount the confidence. We call this approach Bayesian
frequency correction since the resulting formula (Equation 6) takes a confidence
and “corrects” it by returning a somewhat lower predictive accuracy.
Suppose that we have a given association rule [x ⇒ y] with observed con-
fidence ĉ([x ⇒ y]). We can read p(c([x ⇒ y]|ĉ([x ⇒ y]), s(x)) as “P (predictive
accuracy of [x ⇒ y] given confidence of [x ⇒ y] and support of x)”. The intuition
of our analysis is that application of Bayes’ rule implies “P (predictive accuracy
given confidence and support) = P (confidence given predictive accuracy and
support)P (predictive accuracy)/ normalization constant”. Note that the likeli-
hood P (ĉ|c, s) is simply the binomial distribution. (The target attributes of each
record that is satisfied by x can be classified correctly or erroneously; the chance
of a correct prediction is just the predictive accuracy c; this leads to a binomial
distribution.) “P (predictive accuracy)”, the prior in our equation, is the accu-
racy histogram over the space of all association rules. This histogram counts, for
every accuracy c, the fraction of rules which possess that accuracy.
Finding Association Rules That Trade Support Optimally 427

In Equation 1, we decompose the expectation by integrating over all pos-


sible values of c([x ⇒ y]). In Equation 2, we apply Bayes’ rule. π(c) =
|{[x⇒y]|c([x⇒y])=c}|
|{[x⇒y]}| is the accuracy histogram. It specifies the probability of draw-
ing an association rule with accuracy c when drawing at random under uniform
distribution from the space of association rules of length up to k.

E(c([x ⇒ y])|ĉ([x ⇒ y]), s(x))



= cp(c([x ⇒ y]) = c|ĉ([x ⇒ y]), s(x))dc (1)

P (ĉ([x ⇒ y])|c([x ⇒ y]) = c, s(x))π(c)
= c dc (2)
P (ĉ([x ⇒ y])|s(x))

In Equation 4, we apply Equation 2. Since, over all c, the distribution p(c([x ⇒


y]) = c|ĉ([x ⇒ y]), s(x)) has to integrate to one (Equation 3), we can treat
P (ĉ([x ⇒ y])|c([x ⇒ y]), s(x)) as a normalizing constant which we can determine
uniquely in Equation 5.

p(c([x ⇒ y]) = c|ĉ([x ⇒ y]), s(x))dc = 1 (3)

P (ĉ([x ⇒ y])|c([x ⇒ y]) = c, s(x))π(c)
⇔ dc = 1 (4)
P (ĉ([x ⇒ y])|s(x))

⇔ P (ĉ([x ⇒ y])|s(x)) = P (ĉ([x ⇒ y])|c([x ⇒ y]) = c, s(x))π(c)dc (5)

Combining Equations 2 and 5 we obtain Equation 6. In this equation, we also


state that, when the accuracy c is given, the confidence ĉ is governed by the
binomial distribution which we write as B[c, s](ĉ). This requires us make the
standard assumption of independent and identically distributed instances.

cB[c, s(x)](ĉ([x ⇒ y]))π(c)dc
E(c([x ⇒ y])|ĉ([x ⇒ y]), s(x)) =  (6)
B[c, s(x)](ĉ([x ⇒ y]))π(c)dc

We have now found a solution that quantifies E(c([x ⇒ y])|ĉ([x ⇒ y]), s(x)),
the exact expected predictive accuracy of an association rule [x ⇒ y] with given
confidence ĉ and body support s(x). Equation 6 thus quantifies just how strongly
the confidence of a rule has to be corrected, given the support of that rule. Note
that the solution depends on the prior π(c) which is the histogram of accuracies
of all association rules over the given items for the given database.
One way of treating such priors is to assume a certain standard distribution.
Under a set of assumptions on the process that generated the database, π(c)
can be shown to be governed by a certain binomial distribution [9]. However,
empirical studies (see Sect. 5 and Fig. 2a) show that the shape of the prior can
deviate strongly from this binomial distributions. Reasonably accurate estimates
can be obtained by following a Markov Chain Monte Carlo [4] approach to
estimating the prior, using the available database (see Sect. 4). For an extended
discussion of the complexity of estimating this distributions, see [9,6].
428 T. Scheffer

predictive accuracy

predictive accuracy

0.6
0.5
0.4
0.3
0.2
0.1
0
1
0.8
0 0.6
10 20 0.4 confidence
30 40
50 0.2
support 60 70 80 0

Fig. 1. Contributions of support s(x) and confidence ĉ([x ⇒ y]) to predictive accuracy
c([x ⇒ y]) of rule [x ⇒ y]

Example Curve. Figure 1 shows how expected predictive accuracy, confidence,


and body support relate for the database that we also use for our experiments in
Sect. 5, using 10 items. The predictive accuracy grows with both confidence and
body support of the rule. When the confidence exceeds 0.5, then the predictive
accuracy is lower than the confidence, depending on the support and on the
histogram π of accuracies of association rules for this database.

4 Discovery of Association Rules


The Apriori algorithm [1] finds association rules in two steps. First, all item
sets x with support of more then the fixed threshold “minsup” are found. Then,
all item sets are split into left and right hand side x and y (in all possible
ways) and the confidence of the rules [x ⇒ y] is calculated as s(x∪y)
s(x) . All rules
with a confidence above the confidence threshold “minconf” are returned. Our
algorithm differs from that scheme since we do not have fixed confidence and
support thresholds. Instead, we want to find the n best rules.
In the first step, our algorithm estimates the prior π(c). Then generation of
frequent item sets, pruning the hypothesis space by dynamically adjusting the
minsup threshold, generating association rules, and removing redundant associ-
ation rules interleave. The algorithm is displayed in Table 1.

Estimating π(c). We can estimate π by drawing many hypotheses at random


under uniform distribution, measuring their confidence, and recording the re-
sulting histogram. Algorithmically, we run a loop over the length of the rule
Finding Association Rules That Trade Support Optimally 429

Table 1. Algorithm PredictiveApriori: discovery of n most predictive association rules

1. Input: n (desired number of association rules), database with items a1 , . . . , ak .


2. Let τ = 1.
3. For i = 1 . . . k Do: Draw a number of association rules [x ⇒ y] with i items at
random. Measure their confidence (provided s(x) > 0). Let πi (c) be the distribution
of confidences. k
πi (c)(k
i)
(2i −1)
4. For all c, Let π(c) =  i=1
k .
i=1 i
( )(2i −1)
k

5. Let X0 = {∅}; Let X1 = {{a1 }, . . . , {ak }} be all item sets with one single element.
6. For i = 1 . . . k − 1 While (i = 1 or Xi−1 = ∅).
a) If i > 1 Then determine the set of candidate item sets of length i as Xi =
{x ∪ x |x, x ∈ Xi−1 , |x ∪ x | = i}. Generation of Xi can be optimized by
considering only item sets x and x ∈ Xi−1 that differ only in the element
with highest item index. Eliminate double occurrences of item sets in Xi .
b) Run a database pass and determine the support of the generated item sets.
Eliminate item sets with support less than τ from Xi .
c) For all x ∈ Xi Call RuleGen(x).
d) If best has been changed, Then Increase τ to be the smallest number such
that E(c|1, τ ) > E(c(best[n])|ĉ(best[n], s(best[n])) (refer to Equation 6). If τ >
database size, Then Exit.
e) If τ has been increased in the last step, Then eliminate all item sets from Xi
which have support below τ .
7. Output best[1] . . . best[n], the list of the n best association rules.

Algorithm RuleGen(x) (generate all rules with body x)

10. Let γ be the smallest number such that E(c|γ/s(x), s(x)) >
E(c(best[n])|ĉ(best[n], s(best[n])).
11. For i = 1 . . . k With ai ∈ x Do (for all items not in x)
a) If i = 1 Then Let Y1 = {{ai }|ai ∈ x} (item sets with one element not in x).
b) Else Let Yi = {y ∪ y  |y, y  ∈ Yi−1 , |y ∪ y  | = i} analogous to the generation of
candidates in step 6a.
c) For all y ∈ Yi Do
i. Measure the support s(x ∪ y). If s(x ∪ y) ≤ γ, Then eliminate y from Yi
and Continue the for loop with the next y.
ii. Equation 6 gives the predictive accuracy E(c([x ⇒ y])|s(x ∪ y)/s(x), s(x)).
iii. If the predictive accuracy is among the n best found so far (recorded
in best), Then update best, remove rules in best that are subsumed
by other rules, and Increase γ to be the smallest number such that
E(c|γ/s(x), s(x)) ≥ E(c(best[n])|ĉ(best[n], s(best[n])).
12. If any subsumed rule has been erased in step 11(c)iii, Then recur from step 10.
430 T. Scheffer

and, given that length, draw a fixed number of rules. We determine the items
and the split into body and head by drawing at random (Step 3). We have now
drawn equally many rules for each size while the uniform distribution requires
us to prefer long rules
 as there are many more long rules than there are short
ones. There are ki item sets of size i over k database items, and given i items,
there are 2i − 1 distinct association rules (each item can be located on the left or
right hand side of the rule but the right hand side must be nonempty). Hence,
Equation 7 gives the probability that exactly i items occur in a rule which is
drawn at random under uniform distribution from the space of all association
rules over k items. k  i
(2 − 1)
P [i items] = k i k (7)
j
j=1 j (2 − 1)

In step 4 we apply a Markov Chain Monte Carlo style correction to the prior by
weighting each prior for rule length i by the probability of a rule length of i.

Enumerating Item Sets with Dynamic Minsup Threshold. Similarly to


the Apriori algorithm, the PredictiveApriori algorithm generates frequent item
sets, but using a dynamically increasing minsup threshold τ . Note that we start
with size zero (only the empty item set is contained in X0 ). X1 contains all
item sets with one element. Given Xi−1 , the algorithm computes Xi in step 6a
just like Apriori does. An item set can only be frequent when all its subsets are
frequent, too. We can thus generate Xi by only joining those elements of Xi−1
which differ exactly in the last element (where last refers to the highest item
index). Since all subsets of an element of Xi must be in Xi−1 , the subsets that
result from removing the last, or the last but one element must be in Xi−1 , too.
After running a database pass and measuring the support of each element of Xi ,
we can delete all those candidates that do not achieve the required support of τ .
We then call the RuleGen procedure in step 6c that generates all rules over
body x, for each x ∈ Xi . The RuleGen procedure alters our array best[1 . . . n]
which saves the best rules found so far. In step 6d, we refer to best[n], meaning
the nth best rule found so far. We now refer to Equation 6 again to determine
the least support that the body of an association rule with perfect confidence
must possess in order to exceed the predictive accuracy of the currently nth best
rule. If that required support exceeds the database size we can exit because no
such rule can exist. We delete all item sets in step 6e which lie below that new
τ . Finally, we output the n best rules in step 7.

Generating All Rules over Given Body x. In step 10, we introduce a new
accuracy threshold γ which quantifies the confidence that a rule with support
s(x) needs in order to be among the n best ones. We then start enumerating all
possible heads y, taking into account in step 11 that body and head must be
disjoint and generating candidates in step 11(b) analogous to step 6a. In step
11(c)i we calculate the support of x ∪ y for all heads y. When a rules lies among
the best ones so far, we update best. We will not bother with rules that have
a predictive accuracy below the accuracy of best[n], so we increase γ. In step
Finding Association Rules That Trade Support Optimally 431

11(c)iii, we delete rules from best which are subsumed by other rules. This may
result in the unfortunate fact that rules which we dropped from best earlier, now
belong to the n best rules again. So in step 11(c)iii we have to check this and
recur from step 10 if necessary.

Removing Redundant Rules. Consider an association rule [a ⇒ c, d]. When


this rule is satisfied by a database, then that database must also satisfy [a, b ⇒
c, d], [a ⇒ c], [a ⇒ d], and many other rules. We write [x ⇒ y] |= [x ⇒ y  ] to
express that any database that satisfies [x ⇒ y] must also satisfy [x ⇒ y  ]. Since
we can generate exponentially many redundant rules that can be inferred from
a more general rule, it is not desirable to present all these redundant rules to
the user. Consider the example in Table 2 which shows the five most interesting
rules generated by PredictiveApriori for the purchase database that we study
in Sect. 5. The first and second rule in the bottom part are special cases of the
third rule; the fourth and fifth rules are subsumed by the second rule of the top
part. The top part shows the best rules with redundant variants removed.

Theorem 1. We can decide whether a rule subsumes another rule by two simple
subset tests: [x ⇒ y] |= [x ⇒ y  ] ⇔ x ⊆ x ∧ y ⊇ y  . Moreover, if [x ⇒ y] is
supported by a database D, and [x ⇒ y] |= [x ⇒ y  ] then this database also
supports [x ⇒ y  ].

Proofs of Theorems 1 and 2 are left for the full paper. Theorem 1 says that
[x ⇒ y] subsumes [x ⇒ y  ] if and only if x is a subset of x (weaker precondition)
and y is a superset of y  (y predicts more attribute values than y  ). We can then
delete [x ⇒ y  ] because Theorem 1 says that from a more general rule we can
infer that all subsumed rules must be satisfied, too. In order to assure that the n
rules which the user is provided are not redundant specializations of each other,
we test for subsumption in step 11(c)iii by performing the two subset tests that
imply subsumption according to Theorem 1.

Theorem 2. The PredictiveApriori algorithm (Table 1) returns n association


rules [xi ⇒ yi ] with the following properties. (i) For all returned solutions [x ⇒
y], [x ⇒ y  ]: [x ⇒ y] |= [x ⇒ y  ]. (ii) Subject to constraint (i), the returned
rules maximize E(c[xi ⇒ yi ]|ĉ([xi ⇒ yi ]), s(x)) according to Equation 6.

Improvements. Several improvements of the Apriori algorithm have been sug-


gested that improve on the PredictiveApriori algorithm as well. The AprioriTid
algorithm requires much fewer database passes by storing, for each database
record, a list of item sets of length i which this record supports. From these
lists, the support of each item set can easily be computed. In the next itera-
tion, the list of item sets of length i + 1 that each transaction supports can
be computed without accessing the database. We can expect this modification
to enhance the overall performance when the database is very large but sparse.
Further improvements can be obtained by using sampling techniques (e.g., [11]).
432 T. Scheffer

Table 2. (Top) five best association rules when subsumed rules are removed; (bottom)
five best rules when subsumed rules are not removed

[ ⇒ PanelID=9 ProductGroup=84 ]
E(c|ĉ = 1, s = 10000) = 1
[ Location=market 4 ⇒ PanelID=9, ProductGroup=84, Container=nonreuseable ]
E(c|ĉ = 1, s = 1410) = 1
[ Location=market 6 ⇒ PanelID=9, ProductGroup=84, Container=nonreuseable ]
E(c|ĉ = 1, s = 1193) = 1
[ Location=market 1 ⇒ PanelID=9, ProductGroup=84, Container=nonreuseable ]
E(c|ĉ = 1, s = 1025) = 1
[ Manufacturer=producer 18 ⇒ PanelID=9, ProductGroup=84, Type=0, Con-
tainer=nonreuseable ]
E(c|ĉ = 1, s = 1804) = 1

[ ⇒ PanelID=9 ] E(c|ĉ = 1, s = 10000) = 1


[ ⇒ ProductGroup=84 ] E(c|ĉ = 1, s = 10000) = 1
[ ⇒ PanelID=9, ProductGroup=84 ] E(c|ĉ = 1, s = 10000) = 1
[ Location=market 4 ⇒ PanelID=9 ] E(c|ĉ = 1, s = 1410) = 1
[ Location=market 4 ⇒ ProductGroup=84 ] E(c|ĉ = 1, s = 1410) = 1

5 Experiments

For our experiments, we used a database of 14,000 fruit juice purchase transac-
tions, and the mailing campaign data used for the KDD cup 1998. Each trans-
action of the fruit juice dtabase is described by 29 real valued and string valued
attributes which specify properties of the purchased juice as well as attributes
of the customer (e.g., age and job). By binarizing the attributes and considering
only a subset of the binary attributes, we varied the number of items during the
experiments. For instance, we transformed the attribute “ContainerSize” into
five binary attributes, “ContainerSize ≤ 0.3”, “0.3 < ContainerSize ≤ 0.5”, etc.
Figure 2a shows the prior π(c) as estimated by the algorithm in step 3 for
several numbers of items. Figure 1 shows the predictive accuracy for this prior,
depending on the confidence and the body support. Table 2 (top) shows the five
best association rules found for the fruit juice problem by the PredictiveApriori
algorithm. The rules say that all transactions are performed under PanelID 9
and refer to product group 84 (fruit juice purchases). Apparently, markets 1, 4,
and 6 only sell non-reuseable bottles (in contrast to the refillable bottles sold by
most german supermarkets). Producer 18 does not sell refillable bottles either.
In order to compare the performance of Apriori and PredictiveApriori, we
need to find a uniform measure that is independent of implementation details.
For Apriori, we count how many association rules have to be compared against
the minconf threshold (this number is independent of the actual minconf thresh-
old). We can determine this number from the item sets without actually enumer-
ating all rules. For PredictiveApriori, we measure for how many rules we need
to determine the predictive accuracy by evaluating Equation 6.
Finding Association Rules That Trade Support Optimally 433

0.6 20000
10 items 18000 20 items
0.5 20 items 30 items
30 items 16000 40 items

rules considered
fraction of rules

0.4 14000
12000
0.3 10000
8000
0.2 6000
0.1 4000
2000
0 0
0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50 60 70 80 90 100
(a) confidence (b) rules to be found

Fig. 2. (a) Confidence prior π for various numbers of items. (b) Number of rules that
PredictiveApriori has to consider dependent on the number n of desired solutions

1e+06 120000
900000 Apriori, minsup=1000/10000 Apriori, minsup=1000/10000
number of rules considered

Apriori, minsup=500/10000 100000 Apriori, minsup=500/10000


800000 Apriori, minsup=100/10000 Apriori, minsup=100/10000
700000 Apriori, minsup=10/10000 Apriori, minsup=50/10000
PredictiveApriori 80000 Apriori, minsup=10/10000
600000 PredictiveApriori
500000 60000
400000
300000 40000
200000 20000
100000
0 0
10 20 30 40 50 60 70 50 55 60 65 70 75 80
(a) number of items (b) number of items

Fig. 3. Time complexity of PredictiveApriori and Apriori, depending on the number


of items and (in case of Apriori) of minsup (a) fruit juice problem, (b) KDD cup 1998

The performance of Apriori depends crucially on the choice of the support


threshold “minsup”. In Fig. 3, we compare the computational expenses imposed
by PredictiveApriori (10 best solutions) to the complexity of Apriori for several
different minsup thresholds and numbers of items for both the fruit juice and the
KDD cup database. The time required by Apriori grows rapidly with decreasing
minsup values. Among the 25 best solutions for the juice problem found by Pre-
dictiveApriori we can see rules with body support and confidence of 92. In order
to find such special but accurate rules, Apriori would run many times as long as
PredictiveApriori. Figure 2b shows how the complexity increases with the num-
ber of desired solutions. The increase is only sub-linear. Figure 4 shows extended
comparisons of the Apriori and PredictiveApriori performance for the fruit juice
problem. The horizontal lines show the time required by PredictiveApriori for
the given number of database items (n = 10 best solutions). The curves show
how the time required by Apriori depends on the minsup support threshold.
Apriori is faster for large thresholds since it then searches only a small fraction
of the space of association rules.
434 T. Scheffer

4500 8000 12000


4000 Apriori, 20 items 7000 Apriori, 30 items Apriori, 40 items
PredictiveApriori PredictiveApriori 10000 PredictiveApriori
rules considered

rules considered

rules considered
3500 6000
3000 5000 8000
2500
4000 6000
2000
1500 3000 4000
1000 2000
1000 2000
500
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
minsup minup minup

70000 180000 500000


Apriori, 50 items 160000 Apriori, 60 items 450000 Apriori, 65 items
60000 PredictiveApriori PredictiveApriori PredictiveApriori
rules considered

rules considered

rules considered
140000 400000
50000 120000 350000
40000 100000 300000
250000
30000 80000 200000
20000 60000 150000
40000 100000
10000 20000 50000
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
minup minup minup

Fig. 4. Number of rules that PredictiveApriori and Apriori and need to consider, de-
pending on the number of items (in case of Apriori also depending on minsup)

6 Discussion and Related Results

We discussed the problem of trading confidence of an association rule against


support. When the goal is to maximize the expected accuracy on future database
records that are generated by the same underlying process, then Equation 6 gives
us the optimal trade-off between confidence and support of the rule’s body. Equa-
tion 6 results from a Bayesian analysis of the predictive accuracy; it is based on
the assumption that the database records are independent and identically dis-
tributed and requires us to estimate the confidence prior. The PredictiveApriori
algorithm does this using a MCMC approach [4].
The Bayesian frequency correction approach that eliminates the optimistical
bias of high confidences relates to an analysis of classification algorithms [8]
that yields a parameter-free regularization criterion for decision tree algorithms
[10]. The PredictiveApriori algorithm returns the n rules which maximize the
expected accuracy; the user only has to specify how many rules he or she wants
to be presented. This is perhaps a more natural parameter than minsup and
minconf, required by the Apriori algorithm.
The algorithm also checks the rules for redundancies. It has a bias towards
returning general rules and eliminating all rules which are entailed by equally
accurate, more general ones. Guided by similar ideas, the Midos algorithm [12]
performs a similarity test for hypotheses. In [13], a rule discovery algorithm
is discussed that selects from classes of redundant rules the most simple, rather
than the most general ones. For example, given two equally accurate rules [a ⇒ b]
and [a ⇒ b, c] PredictiveApriori would prefer the latter which predicts more
values whereas [13] would prefer the shorter first one.
The favorable computational performance of the PredictiveApriori algorithm
can be credited to the dynamic pruning technique that uses an upper bound on
Finding Association Rules That Trade Support Optimally 435

the accuracy of all rules over supersets of a given item set. Very large parts of
the search space can thus be excluded. A similar idea is realized in Midos [12].
Many optimizations of the Apriori algorithm have been proposed which have
helped this algorithm gain its huge practical relevance. These include the Apri-
oriTid approach for minimizing the number of database passes [2], and sampling
approaches for estimating the support of item sets [2,11]. In particular, efficient
search for frequent itemsets has been addressed intensely and successfully [7,
3,14]. Many of these improvements can, and should be, applied to the Predic-
tiveApriori algorithm as well.

References
1. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of
items in large databases. In ACM SIGMOD Conference on Management of Data,
pages 207–216, 1993.
2. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo. Fast discovery
of association rules. In Advances in Knowledge Discovery and Data Mining, 1996.
3. S. Brin, R. Motwani, J. Ullmann, and S. Tsur. Dynamic itemset counting and
implication rules for market basket data. In Proceedings of the ACM SIGMOD
Conference on Managament of Data, 1997.
4. W. Gilks, S. Richardson, and D. Spiegelhalter, editors. Markov Chain Monte Carlo
in Practice. Chapman & Hall, 1995.
5. M. Klemettinen, H. Mannila, P. Ronkainen, H.Toivonen, and A. I. Verkamo. Find-
ing interesting rules from large sets of discovered associacion rules. Proc. Third
International Conference on Information and Knowledge Management, 1994.
6. J. Langford and D. McAllester. Computable shell decomposition bounds. In Pro-
ceedings of the International Conference on Computational Learning Theory, 2000.
7. D. Lin and Z. Kedem. Pincer search: a new algorithm for discovering the maximum
frequent set. In Proceedings of the International Conference on Extending Database
Technology, 1998.
8. T. Scheffer. Error Estimation and Model Selection. Infix Publisher, Sankt Au-
gustin, 1999.
9. T. Scheffer. Average-case analysis of classification algorithms for boolean functions
and decision trees. In Proceedings of the International Conference on Algorithmic
Learning Theory, 2000.
10. T. Scheffer. Nonparametric regularization of decision trees. In Proceedings of the
European Conference on Machine Learning, 2000.
11. T. Scheffer and S. Wrobel. A sequential sampling algorithm for a general class
of utility functions. In Proceedings of the International Conference on Knowledge
Discovery and Data Mining, 2000.
12. S. Wrobel. Inductive logic programming for knowledge discovery in databases. In
Sašo Džeroski and Nada Lavrač, editors, Relational Data Mining, 2001.
13. M. Zaki. Generating non-redundant association rules. In Proceedings of the Inter-
national Conference on Knowledge Discovery and Data Mining, 2000.
14. M. Zaki and C. Hiao. Charm: an efficient algorithm for closed association rule
mining. Technical Report 99-10, Rensselaer Polytechnic Institute, 1999.

You might also like