Finding Association Rules That Trade Support Optimally Against Confidence
Finding Association Rules That Trade Support Optimally Against Confidence
Tobias Scheffer1,2
1
University of Magdeburg, FIN/IWS, PO Box 4120, 39016 Magdeburg, Germany
2
SemanticEdge, Kaiserin-Augusta-Allee 10-11, 10553 Berlin, Germany
[email protected]
1 Introduction
Association rules (e.g., [1,5,2]), express regularities between sets of data items
in a database. [Beer and TV magazine ⇒ chips] is an example of an association
rule and expresses that, in a particular store, all customers who buy beer and a
TV magazine are also likely to buy chips. In contrast to classifiers, association
rules do not make a prediction for all database records. When a customer does
not buy beer and a magazine, then our example rule does not conjecture that
he will not buy chips either. The number of database records for which a rule
does predict the proper value of an attribute is called the support of that rule.
Associations rules may not be perfectly accurate. The fraction of database
records for which the rules conjectures a correct attribute value, relative to the
fraction of records for which it makes any prediction, is called the confidence.
Note that the confidence is the relative frequency of a correct prediction on the
data that is used for training. We expect the confidence (or accuracy) on unseen
data to lie below that on average, in particular, when the support is small.
When deciding which rules to return, association rule algorithms need to
take both confidence and support into account. Of course, we can find any num-
ber of rules with perfectly high confidence but support of only one or very few
records. On the other hand, we can construct very general rules with large sup-
port but low confidence. The Apriori algorithm [2] possesses confidence and
support thresholds and returns all rules which lie above these bounds. However,
L. De Raedt and A. Siebes (Eds.): PKDD 2001, LNAI 2168, pp. 424–435, 2001.
c Springer-Verlag Berlin Heidelberg 2001
Finding Association Rules That Trade Support Optimally 425
2 Preliminaries
Let D be a database consisting of one table over binary attributes a1 , . . . , ak ,
called items. In general, D has been generated by discretizing the attributes
of a relation of an original database D . For instance, when D contains an
attribute income, then D may contain binary attributes 0 ≤ income ≤ 20k,
20k < income ≤ 40k, and so on. A database record r ⊆ {a1 , . . . , ak } is the set of
attributes that take value one in a focused row of the table D.
A database record r satisfies an item set x ⊆ {a1 , . . . , ak } if x ⊆ r. The
support s(x) of an item set x is the number of records in D which satisfy x.
Often, the fraction s(x)
|D| of records in D that satisfy x is called the support of x.
But since the database D is constant, these terms are equivalent.
An association rule [x ⇒ y] with x, y ⊆ {a1 , . . . , ak }, y = ∅, and x ∩ y = ∅
expresses a relationship between an item set x and a nonempty item set y. The
intuitive semantic of the rule is that all records which satisfy x are predicted to
also satisfy y. The confidence of the rule with respect to the (training) database
D is ĉ([x ⇒ y]) = s(x∪y)
s(x) – that is, the ratio of correct predictions over all records
for which a prediction is made.
The confidence is measured with respect to the database D that is used for
training. Often, a user will assume that the resulting association rules provide
426 T. Scheffer
information on the process that generated the database which will be valid in
future, too. But the confidence on the training data is only an estimate of the
rules’ accuracy in the future, and since we search the space of association rules
to maximize the confidence, the estimate is optimistically biased. We define the
predictive accuracy c([x ⇒ y]) of a rule as the probability of a correct prediction
with respect to the process underlying the database.
Definition 1. Let D be a database the records r of which are generated by a
static process P , let [x ⇒ y] be an association rule. The predictive accuracy
c([x ⇒ y]) = P r[r satisfies y|r satisfies x] is the conditional probability of y ⊆ r
given that x ⊆ r when the distribution of r is governed by P .
The confidence ĉ([x ⇒ y]) is the relative frequency of probability c([x ⇒ y]) for
given database D. We now pose the n most accurate association rules problem.
Definition 2. Given a database D (defined like above) and a set of database
items a1 through ak , find n rules h1 , . . . , hn ∈ {[x ⇒ y]|x, y ⊆ {a1 , . . . , ak }; y =
∅; x ∩ y = ∅} which maximize the expected predictive accuracy c([x ⇒ y]).
We formulate the problem such that the algorithm needs to return a fixed
number of best association rules rather than all rules the utility of which ex-
ceeds a given threshold. We think that this setting is more appropriate in many
situation because a threshold may not be easy to specify and a user may not be
satisfied with either an empty or an outrageously large set of rules.
We have now found a solution that quantifies E(c([x ⇒ y])|ĉ([x ⇒ y]), s(x)),
the exact expected predictive accuracy of an association rule [x ⇒ y] with given
confidence ĉ and body support s(x). Equation 6 thus quantifies just how strongly
the confidence of a rule has to be corrected, given the support of that rule. Note
that the solution depends on the prior π(c) which is the histogram of accuracies
of all association rules over the given items for the given database.
One way of treating such priors is to assume a certain standard distribution.
Under a set of assumptions on the process that generated the database, π(c)
can be shown to be governed by a certain binomial distribution [9]. However,
empirical studies (see Sect. 5 and Fig. 2a) show that the shape of the prior can
deviate strongly from this binomial distributions. Reasonably accurate estimates
can be obtained by following a Markov Chain Monte Carlo [4] approach to
estimating the prior, using the available database (see Sect. 4). For an extended
discussion of the complexity of estimating this distributions, see [9,6].
428 T. Scheffer
predictive accuracy
predictive accuracy
0.6
0.5
0.4
0.3
0.2
0.1
0
1
0.8
0 0.6
10 20 0.4 confidence
30 40
50 0.2
support 60 70 80 0
Fig. 1. Contributions of support s(x) and confidence ĉ([x ⇒ y]) to predictive accuracy
c([x ⇒ y]) of rule [x ⇒ y]
5. Let X0 = {∅}; Let X1 = {{a1 }, . . . , {ak }} be all item sets with one single element.
6. For i = 1 . . . k − 1 While (i = 1 or Xi−1 = ∅).
a) If i > 1 Then determine the set of candidate item sets of length i as Xi =
{x ∪ x |x, x ∈ Xi−1 , |x ∪ x | = i}. Generation of Xi can be optimized by
considering only item sets x and x ∈ Xi−1 that differ only in the element
with highest item index. Eliminate double occurrences of item sets in Xi .
b) Run a database pass and determine the support of the generated item sets.
Eliminate item sets with support less than τ from Xi .
c) For all x ∈ Xi Call RuleGen(x).
d) If best has been changed, Then Increase τ to be the smallest number such
that E(c|1, τ ) > E(c(best[n])|ĉ(best[n], s(best[n])) (refer to Equation 6). If τ >
database size, Then Exit.
e) If τ has been increased in the last step, Then eliminate all item sets from Xi
which have support below τ .
7. Output best[1] . . . best[n], the list of the n best association rules.
10. Let γ be the smallest number such that E(c|γ/s(x), s(x)) >
E(c(best[n])|ĉ(best[n], s(best[n])).
11. For i = 1 . . . k With ai ∈ x Do (for all items not in x)
a) If i = 1 Then Let Y1 = {{ai }|ai ∈ x} (item sets with one element not in x).
b) Else Let Yi = {y ∪ y |y, y ∈ Yi−1 , |y ∪ y | = i} analogous to the generation of
candidates in step 6a.
c) For all y ∈ Yi Do
i. Measure the support s(x ∪ y). If s(x ∪ y) ≤ γ, Then eliminate y from Yi
and Continue the for loop with the next y.
ii. Equation 6 gives the predictive accuracy E(c([x ⇒ y])|s(x ∪ y)/s(x), s(x)).
iii. If the predictive accuracy is among the n best found so far (recorded
in best), Then update best, remove rules in best that are subsumed
by other rules, and Increase γ to be the smallest number such that
E(c|γ/s(x), s(x)) ≥ E(c(best[n])|ĉ(best[n], s(best[n])).
12. If any subsumed rule has been erased in step 11(c)iii, Then recur from step 10.
430 T. Scheffer
and, given that length, draw a fixed number of rules. We determine the items
and the split into body and head by drawing at random (Step 3). We have now
drawn equally many rules for each size while the uniform distribution requires
us to prefer long rules
as there are many more long rules than there are short
ones. There are ki item sets of size i over k database items, and given i items,
there are 2i − 1 distinct association rules (each item can be located on the left or
right hand side of the rule but the right hand side must be nonempty). Hence,
Equation 7 gives the probability that exactly i items occur in a rule which is
drawn at random under uniform distribution from the space of all association
rules over k items. k i
(2 − 1)
P [i items] = k i k (7)
j
j=1 j (2 − 1)
In step 4 we apply a Markov Chain Monte Carlo style correction to the prior by
weighting each prior for rule length i by the probability of a rule length of i.
Generating All Rules over Given Body x. In step 10, we introduce a new
accuracy threshold γ which quantifies the confidence that a rule with support
s(x) needs in order to be among the n best ones. We then start enumerating all
possible heads y, taking into account in step 11 that body and head must be
disjoint and generating candidates in step 11(b) analogous to step 6a. In step
11(c)i we calculate the support of x ∪ y for all heads y. When a rules lies among
the best ones so far, we update best. We will not bother with rules that have
a predictive accuracy below the accuracy of best[n], so we increase γ. In step
Finding Association Rules That Trade Support Optimally 431
11(c)iii, we delete rules from best which are subsumed by other rules. This may
result in the unfortunate fact that rules which we dropped from best earlier, now
belong to the n best rules again. So in step 11(c)iii we have to check this and
recur from step 10 if necessary.
Theorem 1. We can decide whether a rule subsumes another rule by two simple
subset tests: [x ⇒ y] |= [x ⇒ y ] ⇔ x ⊆ x ∧ y ⊇ y . Moreover, if [x ⇒ y] is
supported by a database D, and [x ⇒ y] |= [x ⇒ y ] then this database also
supports [x ⇒ y ].
Proofs of Theorems 1 and 2 are left for the full paper. Theorem 1 says that
[x ⇒ y] subsumes [x ⇒ y ] if and only if x is a subset of x (weaker precondition)
and y is a superset of y (y predicts more attribute values than y ). We can then
delete [x ⇒ y ] because Theorem 1 says that from a more general rule we can
infer that all subsumed rules must be satisfied, too. In order to assure that the n
rules which the user is provided are not redundant specializations of each other,
we test for subsumption in step 11(c)iii by performing the two subset tests that
imply subsumption according to Theorem 1.
Table 2. (Top) five best association rules when subsumed rules are removed; (bottom)
five best rules when subsumed rules are not removed
[ ⇒ PanelID=9 ProductGroup=84 ]
E(c|ĉ = 1, s = 10000) = 1
[ Location=market 4 ⇒ PanelID=9, ProductGroup=84, Container=nonreuseable ]
E(c|ĉ = 1, s = 1410) = 1
[ Location=market 6 ⇒ PanelID=9, ProductGroup=84, Container=nonreuseable ]
E(c|ĉ = 1, s = 1193) = 1
[ Location=market 1 ⇒ PanelID=9, ProductGroup=84, Container=nonreuseable ]
E(c|ĉ = 1, s = 1025) = 1
[ Manufacturer=producer 18 ⇒ PanelID=9, ProductGroup=84, Type=0, Con-
tainer=nonreuseable ]
E(c|ĉ = 1, s = 1804) = 1
5 Experiments
For our experiments, we used a database of 14,000 fruit juice purchase transac-
tions, and the mailing campaign data used for the KDD cup 1998. Each trans-
action of the fruit juice dtabase is described by 29 real valued and string valued
attributes which specify properties of the purchased juice as well as attributes
of the customer (e.g., age and job). By binarizing the attributes and considering
only a subset of the binary attributes, we varied the number of items during the
experiments. For instance, we transformed the attribute “ContainerSize” into
five binary attributes, “ContainerSize ≤ 0.3”, “0.3 < ContainerSize ≤ 0.5”, etc.
Figure 2a shows the prior π(c) as estimated by the algorithm in step 3 for
several numbers of items. Figure 1 shows the predictive accuracy for this prior,
depending on the confidence and the body support. Table 2 (top) shows the five
best association rules found for the fruit juice problem by the PredictiveApriori
algorithm. The rules say that all transactions are performed under PanelID 9
and refer to product group 84 (fruit juice purchases). Apparently, markets 1, 4,
and 6 only sell non-reuseable bottles (in contrast to the refillable bottles sold by
most german supermarkets). Producer 18 does not sell refillable bottles either.
In order to compare the performance of Apriori and PredictiveApriori, we
need to find a uniform measure that is independent of implementation details.
For Apriori, we count how many association rules have to be compared against
the minconf threshold (this number is independent of the actual minconf thresh-
old). We can determine this number from the item sets without actually enumer-
ating all rules. For PredictiveApriori, we measure for how many rules we need
to determine the predictive accuracy by evaluating Equation 6.
Finding Association Rules That Trade Support Optimally 433
0.6 20000
10 items 18000 20 items
0.5 20 items 30 items
30 items 16000 40 items
rules considered
fraction of rules
0.4 14000
12000
0.3 10000
8000
0.2 6000
0.1 4000
2000
0 0
0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50 60 70 80 90 100
(a) confidence (b) rules to be found
Fig. 2. (a) Confidence prior π for various numbers of items. (b) Number of rules that
PredictiveApriori has to consider dependent on the number n of desired solutions
1e+06 120000
900000 Apriori, minsup=1000/10000 Apriori, minsup=1000/10000
number of rules considered
rules considered
rules considered
3500 6000
3000 5000 8000
2500
4000 6000
2000
1500 3000 4000
1000 2000
1000 2000
500
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
minsup minup minup
rules considered
rules considered
140000 400000
50000 120000 350000
40000 100000 300000
250000
30000 80000 200000
20000 60000 150000
40000 100000
10000 20000 50000
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
minup minup minup
Fig. 4. Number of rules that PredictiveApriori and Apriori and need to consider, de-
pending on the number of items (in case of Apriori also depending on minsup)
the accuracy of all rules over supersets of a given item set. Very large parts of
the search space can thus be excluded. A similar idea is realized in Midos [12].
Many optimizations of the Apriori algorithm have been proposed which have
helped this algorithm gain its huge practical relevance. These include the Apri-
oriTid approach for minimizing the number of database passes [2], and sampling
approaches for estimating the support of item sets [2,11]. In particular, efficient
search for frequent itemsets has been addressed intensely and successfully [7,
3,14]. Many of these improvements can, and should be, applied to the Predic-
tiveApriori algorithm as well.
References
1. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of
items in large databases. In ACM SIGMOD Conference on Management of Data,
pages 207–216, 1993.
2. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo. Fast discovery
of association rules. In Advances in Knowledge Discovery and Data Mining, 1996.
3. S. Brin, R. Motwani, J. Ullmann, and S. Tsur. Dynamic itemset counting and
implication rules for market basket data. In Proceedings of the ACM SIGMOD
Conference on Managament of Data, 1997.
4. W. Gilks, S. Richardson, and D. Spiegelhalter, editors. Markov Chain Monte Carlo
in Practice. Chapman & Hall, 1995.
5. M. Klemettinen, H. Mannila, P. Ronkainen, H.Toivonen, and A. I. Verkamo. Find-
ing interesting rules from large sets of discovered associacion rules. Proc. Third
International Conference on Information and Knowledge Management, 1994.
6. J. Langford and D. McAllester. Computable shell decomposition bounds. In Pro-
ceedings of the International Conference on Computational Learning Theory, 2000.
7. D. Lin and Z. Kedem. Pincer search: a new algorithm for discovering the maximum
frequent set. In Proceedings of the International Conference on Extending Database
Technology, 1998.
8. T. Scheffer. Error Estimation and Model Selection. Infix Publisher, Sankt Au-
gustin, 1999.
9. T. Scheffer. Average-case analysis of classification algorithms for boolean functions
and decision trees. In Proceedings of the International Conference on Algorithmic
Learning Theory, 2000.
10. T. Scheffer. Nonparametric regularization of decision trees. In Proceedings of the
European Conference on Machine Learning, 2000.
11. T. Scheffer and S. Wrobel. A sequential sampling algorithm for a general class
of utility functions. In Proceedings of the International Conference on Knowledge
Discovery and Data Mining, 2000.
12. S. Wrobel. Inductive logic programming for knowledge discovery in databases. In
Sašo Džeroski and Nada Lavrač, editors, Relational Data Mining, 2001.
13. M. Zaki. Generating non-redundant association rules. In Proceedings of the Inter-
national Conference on Knowledge Discovery and Data Mining, 2000.
14. M. Zaki and C. Hiao. Charm: an efficient algorithm for closed association rule
mining. Technical Report 99-10, Rensselaer Polytechnic Institute, 1999.