Association Rule Mining Guide
Association Rule Mining Guide
— Chapter 5 —
BIS 541
2020/2021 Spring
1
Chapter 5: Mining Association
Rules in Large Databases
Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
From association mining to correlation analysis
Constraint-based association mining
Sequential pattern mining
2
What Is Association Mining?
design, etc.
Examples.
Rule form: “Body ead [support, confidence]”.
[1%, 75%]
3
Association Rule: Basic Concepts
Given:
(1) database of transactions,
customer in a visit)
Find: all rules that correlate the presence of one set of
items with that of another set of items
E.g., 98% of people who purchase tires and auto
interesting
4
Basic Concepts cont.
I:{i1,..,im} set of all items, T any transaction
AT: T contains the itemset A
AT, BT A,B itemsets
Examine rule like:
AB where
AB=,
support s: P(AB)
transaction contains {X Y Z}
confidence, c, conditional
6
Frequent itemsets
Strong association rules:
Support rule > min_support
itemsets
7
Mining Association Rules—An Example(1)
Min_support 50%
Transaction ID Items Bought
Min._confidence 50%
2000 A,B,C Min_count:0.5*4=2
1000 A,C
4000 A,D Frequent Itemset Support
5000 B,E,F {A} 75%
{B} 50%
{A}.{B}.{C}.{D} are 1-itemsets {C} 50%
{D}
{A}.{B}.{C} are frequent 1-itemsets as 25%
Count[{A}] = 3 >= 2 (minimum_count) or
Support[{A}] = 75% >= 50% (minimum_support)
{D} is not a frequent 1-itemsets as
Count[{D}] = 1 < 2 (minimum_count) or
Support[{D}] = 25% < 50% (minimum_support)
8
Mining Association Rules—An Example(2)
11
Apriori Algorithme has two steps
12
Generation of frequent itemsets from
candidate itemsets (Step 1)
13
Self Join operation
17
Example 6.1 Han
TID_____list of item_Ids
T100 1 2 5 9 transactions
T200 2 4 D=9
T300 23 minimum transaction
T400 1 2 4 support_count=2
T500 1 3 min_sup=2/9=22%
T600 2 3
T900 1 2 3
1: milk
2: apple
3: butter
4: bread
5: orange
19
1th iteration of algorithm
C1: itemset sup_count L1:itemset sup_count
1 6 1 6
2 7 2 7
3 6 3 6
4 2 4 2
5 2 5 2
C2:L1 join L1, itemset sup_count L2 itset supcount
1 2 4 12 4
1 3 4 13 4
1 4 1x 15 2
1 5 2 23 4
2 3 4 24 2
2 4 2 25 2
2 5 2 frequent 2 item sets L2
3 4 0x those itemsets in C2
3 5 1x having minimum support
4 5 0x Step (1b)
20
3 th iteration
Self join to get C3 Step (2)
C3: L2 join L2: [1 2 3], [1 2 5],[1 3 5],[2 3 4],
[2 3 5],[2 4 5]
Now Step (1a) Apply Apriori to every itemset in C3
2 item subsets of [1 2 3]:[1 2],[1 3],[2 3]
all 2 items sets are members of L
2
keep [1 2 3] in C3
2 item subsets of [1 2 5]:[1 2],[1 5],[2 5]
all 2 items sets are members of L
2
keep [1 2 5] in C3
2 item subsets of [1 3 5]:[1 3],[1 5],[3 5]
[3 5] is not a members of L so it si not frequent
2
remove [1 2 5] from C3
21
3 iteration cont.
2 item subsets of [2 3 4]:[2 3],[2 4],[3 4]
[3 4] is not a members of L so it si not frequent
2
remove [2 3 4] from C3
2 item subsets of [2 3 5]:[2 3],[2 5],[3 5]
[3 5] is not a members of L so it si not frequent
2
remove [2 3 5] from C3
2 item subsets of [2 4 5]:[2 4],[2 5],[4 5]
[4 5] is not a members of L so it si not frequent
2
remove [2 4 5] from C3
C3:[1 2 3],[1 2 5] after pruning
22
4 th iteration
C3L3 check min support Step (1b)
L3:those item sets having minimum support
L3: item sets minsupcount
123 2
125 2
L3 join L3 to generate C4 Step (2)
L3 join L3: 1 2 3 5
pruned since its subset [2 3 5] is not frequent
C4=
the algorithm terminates
23
Generating Association Rules from
frequent itemsets
Strong rules
min support and min confidence
confidence(AB)= P(BA):sup_count(AB)
sup_count(A)
for each frequent itemset l
generate non empty subsets of l: denoted by s
24
Example 6.2 Han cont.
the 3-frequent item set l:[1 2 5]: transaction containing
milk, apple and orange is frequent
non empty subsets of l are
[1 2],[1 5],[2 5],[1],[2],[5]
25
Example 6.2 cont. Detail on confidence
for two rules
26
A two intemset rule examaple
27
important
29
Exercise
30
Bottleneck of Frequent-pattern Mining
Bottleneck: candidate-generation-and-test
Can we avoid candidate generation?
31
Is Apriori Fast Enough? — Performance
Bottlenecks
Completeness:
never breaks a long pattern of any transaction
mining
Compactness
reduce irrelevant information—infrequent items are gone
over 100
35
Chapter 5: Mining Association
Rules in Large Databases
Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
From association mining to correlation analysis
Constraint-based association mining
Sequential pattern mining
36
Multiple-Level Association Rules
Food
Items often form hierarchy.
Items at the lower level are milk bread
expected to have lower
support. skim 2% wheat white
Rules regarding itemsets at
Fraser Sunset
appropriate levels could be
quite useful.
TID Items
Transaction database can be
T1 {111, 121, 211, 221}
encoded based on
T2 {111, 211, 222, 323}
dimensions and levels
T3 {112, 122, 221, 411}
We can explore shared multi- T4 {111, 121}
level mining
T5 {111, 122, 211, 221, 413}
37
A top_down, progressive deepening approach
38
Multi-level Association: Uniform
Support vs. Reduced Support
Level-by-level independent
Level-cross filtering by k-itemset
Level-cross filtering by single item
Controlled level-cross filtering by single item
39
Uniform Support
Back
40
Reduced Support
Back
41
Controlled level-cross filtering by single item
min_sup_T(k+1)<LPT(k)<min_sup_T(k)
Example:
High level: milk
min_supp_treshold=5%
Low level: 2% milk,skim milk
Min_upp_treshold = 3%
Level_passage_trashold = 4%
42
Multi-level Association: Redundancy
Filtering
43
Multi-Level Mining: Progressive
Deepening
A top-down, progressive deepening approach:
First mine high-level frequent items:
milk (15%), bread (10%)
Then mine their lower-level “weaker” frequent
itemsets:
2% milk (5%), wheat bread (4%)
Different min_support threshold across multi-levels:
If adopting the same min_support across multi-
levels
then toss t if any of t’s ancestors is infrequent.
If adopting reduced min_support at lower levels
then examine only those descendents whose ancestor’s
support is frequent/non-negligible.
44
Progressive Refinement of Data
Mining Quality
45
Chapter 5: Mining Association
Rules in Large Databases
Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
From association mining to correlation analysis
Constraint-based association mining
Sequential pattern mining
46
Interestingness Measurements
Objective measures
Two popular measurements:
support; and
confidence
and/or
actionable (the user can do something with it)
47
Criticism to Support and Confidence
Example 2:
X and Y: positively correlated, X 1 1 1 1 0 0 0 0
X and Z, negatively related Y 1 1 0 0 0 0 0 0
support and confidence of
Z 0 1 1 1 1 1 1 1
X=>Z dominates
We need a measure of dependent
or correlated events Rule Support Confidence
P( A B) X=>Y 25% 50%
corrA, B
P( A) P( B) X=>Z 37.50% 75%
P(B|A)/P(B) is also called the lift
of rule A => B
49
Other Interestingness Measures: Interest
Interest (correlation, lift) P( A B)
P ( A) P ( B )
taking both P(A) and P(B) in consideration
P(A^B)=P(B)*P(A), if A and B are independent events
A and B negatively correlated, if the value is less than 1;
otherwise A and B positively correlated
Itemset Support Interest
X 1 1 1 1 0 0 0 0
X,Y 25% 2
Y 1 1 0 0 0 0 0 0 X,Z 37.50% 0.9
Z 0 1 1 1 1 1 1 1 Y,Z 12.50% 0.57
50
Example
Total transactions 10,000
İtems C:computer geme, V: video game
V: 7,500 C: 6,000 C and V: 4,000
Min_support: 0.3 min_conf:0,50
Strong but
51
lift
Lift of A B
Lift : P(A and B)/P(A)*P(B)
P(A and B) = P(B|A)*P(A) then
Lift = P(B|A)/P(B)
Ratio of probablity of buying A and B divided by
buying A and B independently
Or it can be interpreted as:
Conditional probablity of buying B given that A
4000 10000
6000
Lift CV is P(C and V)/P(V)P(C) = P(V|C)/P(V)
= 0.4/0.6*0.75=0.89<1 there is a negative correlation
Between Video and computer
53
Are All the Rules Found Interesting?
“Buy walnuts buy milk [1%, 80%]” is misleading
if 85% of customers buy milk
Support and confidence are not good to represent correlations
So many interestingness measures? (Tan, Kumar, Sritastava @KDD’02)
P ( A B )
lift
P ( A) P ( B ) Milk No Milk Sum (row)
Coffee m, c ~m, c c
sup( X ) No Coffee m, ~c ~m, c ~c
all _ conf
max_ item _ sup( X ) Sum(col.) m ~m
Cosine : P(A,B)/sqrt(P(A),P(B))
Similar to lift but take square root of denominator
Both cosine and all_conf are null inveriant
Not affected from null transactions
Ex:
Cosine: 0.4/sqrt(0.6*0.75)=0.27
56
Mining Highly Correlated Patterns
57
Example
m: containing milk m: not containing milk
c: containing coffee, c: not containing coffee
null transactions with respect to milk and
coffee
transactgions containing neither milk nor
coffee
vary the number of null transactions and see that
lift and chi-square arer affected from the
59
Chapter 5: Mining Association
Rules in Large Databases
Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
From association mining to correlation analysis
Constraint-based association mining
Sequential pattern mining
60
Constraint-based (Query-Directed) Mining
mined
System optimization: explores such constraints for
61
Constraints in Data Mining
Dec.’2020
Dimension/level constraint:
in relevance to region, price, brand, customer category
$200)
Interestingness constraint:
strong rules: min_support 3%, min_confidence
60%
62
Example
bread milk
milk butter
Strong rules but items are not that valuable
TV VCD player
Support of this rule may be lower than
63
Aggregation functions
support
None of its supersets can
Count(l) < 10
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
69
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
70
Naïve Algorithm: Apriori + Constraint
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price < 5}
71
The Constrained Apriori Algorithm: Push
an Anti-monotone Constraint Deep
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price < 5}
72
The Constrained Apriori Algorithm: Push
Another Constraint Deep
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2}
{1 2} 1 Scan D
{1 3} 2 {1 3} 2 {1 3}
{1 5} 1 {1 5}
{2 3} 2
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2 {3 5}
{3 5} 2
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 min{S.price <= 1 }
73
Chapter 5: Mining Association Rules in
Large Databases
Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
From association mining to correlation analysis
Constraint-based association mining
Sequential pattern mining
74
Sequence Databases and Sequential
Pattern Analysis
75
What Is Sequential Pattern Mining?
77
Studies on Sequential Pattern Mining
78
Sequential pattern mining: Cases and
Parameters
Duration of a time sequence T
Sequential pattern mining can then be confined to the
80
Sequential pattern mining: Cases and Parameters (2)
81
A Basic Property of Sequential Patterns: Apriori
82
GSP—A Generalized Sequential Pattern Mining Algorithm
candidate sequence
generate candidate length-(k+1) sequences from
83
Finding Length-1 Sequential Patterns
min_sup =2 <e> 3
Seq. ID Sequence <f> 2
10 <(bd)cb(ac)> <g> 1
20 <(bf)(ce)b(fg)> <h> 1
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
84
Generating Length-2 Candidates
Candidates <c>
<d>
<ca>
<da>
<cb>
<db>
<cc>
<dc>
<cd>
<dd>
<ce>
<de>
<cf>
<df>
<e> <ea> <eb> <ec> <ed> <ee> <ef>
<f> <fa> <fb> <fc> <fd> <fe> <ff>
4th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc> … Cand. not in DB at all
pat.
3rd scan: 46 cand. 19 length-3 seq. <abb> <aab> <aba> <baa> <bab> …
pat. 20 cand. not in DB at all
2nd scan: 51 cand. 19 length-2 seq.
pat. 10 cand. not in DB at all <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
1st scan: 8 cand. 6 length-1 seq.
pat. <a> <b> <c> <d> <e> <f> <g> <h>
Seq. ID Sequence
10 <(bd)cb(ac)>
min_sup =2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
88
Definition c is a contiguous subsequence of a
sequence s:{s1,s2,...,sn} if
c is derived by dropping an item from s1 or sn
c is derived by dropping an item from si which
has at least 2 items
c’ is a contiguous subsequence of c and c is a
contiguous subsequence of s
Ex: s:{ (1,2),(3,4),5,6}
{ 2,(3,4),5}, { (1,2),3,5,6},{ (3,5} are but
89
Candidate genration
Step 1: Join Step Lk-1 join with Lk-1 to give Ck
s1 and s2 are joined if dropping first item of s1
and last item of s2 gives the same sequence
s1 is extended by adding the last item of s2
90
L3 C4 L4
{(1,2),3} {(1,2),(3,4)} {(1,2),(3,4)}
{(1,2),4} {(1,2),3,5}
{1,(3,4)}
{(1,3),5}
{2,(3,4)}
{2,3,5}
{(1,2),3} joined with {2,(3,4)} to give {(1,2),(3,4)}
{(1,2),3} joined with {2,3,5} to give {(1,2),3,5}
{(1,2),3,5} is dropped since its 3 contiguous subseq
{(1,3,5} not in L3
91
Bottlenecks of GSP
A huge set of candidates could be generated
1,000 frequent length-1 sequences generate
1000 999
1000 1000 1,499,500 length-2 candidates!
2
93