0% found this document useful (0 votes)
38 views91 pages

Association Rule Mining Guide

The document summarizes the Apriori algorithm for mining association rules from transactional databases. The algorithm has two main steps: (1) find all frequent itemsets that occur above a minimum support threshold, and (2) generate strong association rules from those frequent itemsets that have confidence above a minimum confidence threshold. It provides examples to illustrate how the algorithm works by iteratively generating candidate itemsets, pruning non-frequent itemsets, and producing association rules that meet the support and confidence constraints.

Uploaded by

Omer Tunoglu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views91 pages

Association Rule Mining Guide

The document summarizes the Apriori algorithm for mining association rules from transactional databases. The algorithm has two main steps: (1) find all frequent itemsets that occur above a minimum support threshold, and (2) generate strong association rules from those frequent itemsets that have confidence above a minimum confidence threshold. It provides examples to illustrate how the algorithm works by iteratively generating candidate itemsets, pruning non-frequent itemsets, and producing association rules that meet the support and confidence constraints.

Uploaded by

Omer Tunoglu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 91

Data Mining

— Chapter 5 —

BIS 541
2020/2021 Spring

1
Chapter 5: Mining Association
Rules in Large Databases
 Association rule mining
 Mining single-dimensional Boolean association
rules from transactional databases
 Mining multilevel association rules from
transactional databases
 From association mining to correlation analysis
 Constraint-based association mining
 Sequential pattern mining

2
What Is Association Mining?

 Association rule mining:


 Finding frequent patterns, associations, correlations, or

causal structures among sets of items or objects in


transaction databases, relational databases, and other
information repositories.
 Applications:
 Market basket analysis, cross-marketing, catalog

design, etc.
 Examples.
 Rule form: “Body ead [support, confidence]”.

 buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%]

 major(x, “MIS”) ^ takes(x, “DM”) grade(x, “AA”)

[1%, 75%]
3
Association Rule: Basic Concepts
 Given:
 (1) database of transactions,

 (2) each transaction is a list of items (purchased by a

customer in a visit)
 Find: all rules that correlate the presence of one set of
items with that of another set of items
 E.g., 98% of people who purchase tires and auto

accessories also get automotive services done


 The user specifies
 Minimum support level

 Minimum confidence level

 Rules exceeding the two trasholds are listed as

interesting
4
Basic Concepts cont.
 I:{i1,..,im} set of all items, T any transaction
 AT: T contains the itemset A
 AT, BT A,B itemsets
 Examine rule like:
 AB where
 AB=,
 support s: P(AB)

 frequency of transactions containing both A and B


 confidence c: P(BA) = P(AB)/P(A)
 Conditional probability that a transaction containing
A contains B
5
Rule Measures: Support and
Confidence
Customer
buys both
Customer  Find all the rules X & Y  Z with
buys diaper minimum confidence and support
 support, s, probability that a

transaction contains {X  Y  Z}
 confidence, c, conditional

Customer probability that a transaction


buys beer having {X  Y} also contains Z

Transaction ID Items Bought Let minimum support 50%, and


2000 A,B,C minimum confidence 50%,
1000 A,C we have
4000 A,D  A  C (50%, 66.6%)

5000 B,E,F  C  A (50%, 100%)

6
Frequent itemsets
 Strong association rules:
 Support rule > min_support

 Confidence rule > min_confidence

 k-item set: itemsets containing k items


 occurrence frequency=count=support count:
 Minimum support count =
 min_sup*#transactions in database
 frequent item sets:
 İtemsets satisfying minimum support count

 The Apriori Algorithm has two steps:


 (1) - Find all frequent itemsets

 (2) - Genertate strong association rules from frequent

itemsets

7
Mining Association Rules—An Example(1)

Min_support 50%
Transaction ID Items Bought
Min._confidence 50%
2000 A,B,C Min_count:0.5*4=2
1000 A,C
4000 A,D Frequent Itemset Support
5000 B,E,F {A} 75%
{B} 50%
{A}.{B}.{C}.{D} are 1-itemsets {C} 50%
{D}
{A}.{B}.{C} are frequent 1-itemsets as 25%
Count[{A}] = 3 >= 2 (minimum_count) or
Support[{A}] = 75% >= 50% (minimum_support)
{D} is not a frequent 1-itemsets as
Count[{D}] = 1 < 2 (minimum_count) or
Support[{D}] = 25% < 50% (minimum_support)
8
Mining Association Rules—An Example(2)

Transaction ID Items Bought Min_support 50%


2000 A,B,C Min._confidence 50%
1000 A,C Min_count:0.5*4=2
4000 A,D Frequent Itemset Support
5000 B,E,F {A.B} 25%
{A.C} 50%
{A.B}.{A.C}.{A.D}.{B.C} are 2-itemsets {A.D} 25%
{A.C}is frequent 2-itemsets as {B,C} 25%
Count[{A.C}] = 2 >= 2 (minimum_count) or
Support[{A.C}] = 50% >= 50% (minimum_support)
{A.B}.{A.D} are not frequent 2-itemsets as
Count[{A.D}] = 1 < 2 (minimum_count) or
Support[{A.D}] = 25% < 50% (minimum_support)
9
Mining Association Rules—An Example(3)

Transaction ID Items Bought Min. support 50%


2000 A,B,C Min. confidence 50%
1000 A,C
4000 A,D Frequent Itemset Support
{A} 75%
5000 B,E,F
{B} 50%
{C} 50%
For rule A  C: {A,C} 50%
support = support({A C}) = 50%
confidence = support({A C})/support({A}) = 66.6%
Strong rule as support >=min_support
confidence >= min_confidence
10
The Apriori Principle

Transaction ID Items Bought Min. support 50%


2000 A,B,C Min. confidence 50%
1000 A,C
4000 A,D Frequent Itemset Support
{A} 75%
5000 B,E,F
{B} 50%
{C} 50%
The Apriori principle: {A,C} 50%
Any subset of a frequent itemset must be frequent
{A.C} is a frequent 2-itemset
{A} and {C}: subsets of {A,C} must be frequent
1-itemsets

11
Apriori Algorithme has two steps

 (1)-Find the frequent itemsets: the sets of items that


have minimum support (the key step)
 A subset of a frequent itemset must also be a frequent itemset
 i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent
itemsets
 Iteratively find frequent itemsets with cardinality from 1 to k (k-
itemset)
 Until k is an empty set
 (2)-Use the frequent itemsets to generate association
rules.

12
Generation of frequent itemsets from
candidate itemsets (Step 1)

 C1L1  C2L2 C3  L3  C4L4…


 From Ck (candidate k-itemsets) generate Lk :Ck  Lk
 From candidate k itemsets generate frequent k itemsets
 (a)-Using the Apriori principle that:
 Eliminate itemset sk in Ck if
 At least one k-1 subset of sk is not in Lk-1
 (b)-For candidate k itemsets in Ck
 Make a database scan to eliminate those itemsets whose
support counts are below the critical min support cout
 From frequent k itemsets Lk generate candidate k+1
itemsets Ck+1 : Lk  Ck+1
 Self joining any Lk with Lk

13
Self Join operation

 Sort the items in any li Lk in some lexicographic order


 l [1]<l [2]<,… <l [k-1]<l [k]
i i i i
 li and lj are elements of Lk li.lj Lk
 If li[1]=lj[1] and li[2]=lj[2] and … li[k-1]=lj[k-1]
 and li[k]<lj[k]
 The first k-1 elements are the same

 Only the last elements are different

 li lj satisfiing the above condition


 Construct the item set lk+1:
 l [1], l [2],… l [k-1],l [k], l [k]
i i i i j
 common items
 the k-1 items are taken form l or l
i j
 k th item is taken from l
i
 k+1 th item is from l
j
14
Example of Self Join operation
 Lexigographic order: alphabetic a<b<c<d....
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3 Step(2)
 abcd from abc and abd
 acde from acd and ace
 Pruning by Apriori principle: Step(1a)
 acde is removed because ade is not in L3
 C4={abcd}
15
The Apriori Algorithm — Example
min support cont=2

Database D itemset sup.


L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
16
Chapter 5: Mining Association
Rules in Large Databases
 Association rule mining
 Mining single-dimensional Boolean association
rules from transactional databases
 Mining multilevel association rules from
transactional databases
 From association mining to correlation analysis
 Constraint-based association mining
 Sequential pattern mining

17
Example 6.1 Han
 TID_____list of item_Ids
 T100 1 2 5 9 transactions
 T200 2 4 D=9
 T300 23 minimum transaction
 T400 1 2 4 support_count=2
 T500 1 3 min_sup=2/9=22%
 T600 2 3

 T700 1 3 min conf: 70%


 T800 1 2 3 5

 T900 1 2 3

Find strong association rules: having min sup count


of 2 and min confidence %70
18
Data Dictionary

 1: milk
 2: apple
 3: butter
 4: bread
 5: orange

19
1th iteration of algorithm
 C1: itemset sup_count L1:itemset sup_count
 1 6 1 6
 2 7 2 7
 3 6  3 6 
 4 2 4 2
 5 2 5 2
 C2:L1 join L1, itemset sup_count L2 itset supcount
 1 2 4 12 4
 1 3 4 13 4
 1 4 1x 15 2
 1 5 2 23 4
 2 3 4 24 2
 2 4 2  25 2
 2 5 2 frequent 2 item sets L2
3 4 0x those itemsets in C2
 3 5 1x having minimum support
 4 5 0x Step (1b)

20
3 th iteration
 Self join to get C3 Step (2)
 C3: L2 join L2: [1 2 3], [1 2 5],[1 3 5],[2 3 4],
 [2 3 5],[2 4 5]
 Now Step (1a) Apply Apriori to every itemset in C3
 2 item subsets of [1 2 3]:[1 2],[1 3],[2 3]
 all 2 items sets are members of L
2

 keep [1 2 3] in C3
 2 item subsets of [1 2 5]:[1 2],[1 5],[2 5]
 all 2 items sets are members of L
2

 keep [1 2 5] in C3
 2 item subsets of [1 3 5]:[1 3],[1 5],[3 5]
 [3 5] is not a members of L so it si not frequent
2

 remove [1 2 5] from C3
21
3 iteration cont.
 2 item subsets of [2 3 4]:[2 3],[2 4],[3 4]
 [3 4] is not a members of L so it si not frequent
2

 remove [2 3 4] from C3
 2 item subsets of [2 3 5]:[2 3],[2 5],[3 5]
 [3 5] is not a members of L so it si not frequent
2

 remove [2 3 5] from C3
 2 item subsets of [2 4 5]:[2 4],[2 5],[4 5]
 [4 5] is not a members of L so it si not frequent
2

 remove [2 4 5] from C3
 C3:[1 2 3],[1 2 5] after pruning

22
4 th iteration
 C3L3 check min support Step (1b)
 L3:those item sets having minimum support
 L3: item sets minsupcount
 123 2
 125 2
 L3 join L3 to generate C4 Step (2)
 L3 join L3: 1 2 3 5
 pruned since its subset [2 3 5] is not frequent
 C4=
 the algorithm terminates
23
Generating Association Rules from
frequent itemsets
 Strong rules
 min support and min confidence

 confidence(AB)= P(BA):sup_count(AB)
 sup_count(A)
 for each frequent itemset l
 generate non empty subsets of l: denoted by s

 For each sl

 construct rules: s (l-s)


 Satısfying the condition:
 sup_count(l)/sup_count(s)>=min_conf
 are listed as interestıng

24
Example 6.2 Han cont.
 the 3-frequent item set l:[1 2 5]: transaction containing
milk, apple and orange is frequent
 non empty subsets of l are
 [1 2],[1 5],[2 5],[1],[2],[5]

 the resulting association rules are:


 125 conf: 2/4=50%
 152 conf: 2/2=100%
 251 conf: 2/2=100%
 125 conf: 2/6=33%
 215 conf: 2/7=29%
 512 conf: 2/2=100%
 if min conf: 70% 2th 3th and last rules are strong

25
Example 6.2 cont. Detail on confidence
for two rules

 For the rule


 152 conf: s(1,2,5)/s(1,5)
 conf: 2/2=100% >= 70%
 A strong rule

 For the rule


 215 conf: s(1,2,5)/s(2)
 conf: 2/7=29% < 70%
 Not a strong rule

26
A two intemset rule examaple

 For the rule


 12 conf: s(1,2)/s(1)
 s(1,2)= 4, s(1)= 6
 confidence: 4/6=66% < 70%
 not a strong rule

 For the rule


 21 conf: s(1,2/s(2)
 confidence: 4/7=58% < 70%
 Not a strong rule

27
important

 All such rules for 2-itemsets should be cheked as


well
Exercise

 Find all strong association rules in Example 6.2


 Check minimum confindence

 for 2-frequent intemsets

 [1,2], [1,3], [1,5], [2,3], [2,4], [2,5]


 12, 21
 25, 52 exetra
 for 3-frequent intemset
 [1,2,5]
 123
 3  12 exetra

29
Exercise

 a) Suppose A  B and B  C are strong rules


 Dose this imply that A  C is also a strong rule?
 b) Suppose A  B and A  C are strong rules
 Dose this imply that B  C is also a strong rule?
 c) Suppose A  C and B  C are strong rules
 Dose this imply that A and B  C is also a strong
rule?

30
Bottleneck of Frequent-pattern Mining

 Multiple database scans are costly


 Mining long patterns needs many passes of
scanning and generates lots of candidates

 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?

31
Is Apriori Fast Enough? — Performance
Bottlenecks

 The core of the Apriori algorithm:


 Use frequent (k – 1)-itemsets to generate candidate frequent k-
itemsets
 Use database scan and pattern matching to collect counts for the
candidate itemsets
 The bottleneck of Apriori: candidate generation
 Huge candidate sets:

 104 frequent 1-itemset will generate 107 candidate 2-itemsets


 To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate 2100  1030 candidates.
 Multiple scans of database:
 Needs (n +1 ) scans, n is the length of the longest pattern
32
Mining Frequent Patterns Without
Candidate Generation

 Compress a large database into a compact, Frequent-


Pattern tree (FP-tree) structure
 highly condensed, but complete for frequent pattern
mining
 avoid costly database scans
 Develop an efficient, FP-tree-based frequent pattern
mining method
 A divide-and-conquer methodology: decompose
mining tasks into smaller ones
 Avoid candidate generation: sub-database test only!
33
Construct FP-tree from a
Transaction DB
TID items Items bought (ordered) frequent
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} min_support =
200 {a, b, c, f, l, m, o} {f, c, a, b, m} 0.5
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Steps: Header Table
1. Scan DB once, find frequent Item frequency head f:4 c:1
1-itemset (single item f 4
pattern) c 4 c:3 b:1 b:1
a 3
2. Order frequent items in b 3 a:3 p:1
frequency descending order m 3
p 3 m:2 b:1
3. Scan DB again, construct
FP-tree
p:2 m:1 34
Benefits of the FP-tree Structure

 Completeness:
 never breaks a long pattern of any transaction

 preserves complete information for frequent pattern

mining
 Compactness
 reduce irrelevant information—infrequent items are gone

 frequency descending ordering: more frequent items are

more likely to be shared


 never be larger than the original database (if not count

node-links and counts)


 Example: For Connect-4 DB, compression ratio could be

over 100
35
Chapter 5: Mining Association
Rules in Large Databases
 Association rule mining
 Mining single-dimensional Boolean association
rules from transactional databases
 Mining multilevel association rules from
transactional databases
 From association mining to correlation analysis
 Constraint-based association mining
 Sequential pattern mining

36
Multiple-Level Association Rules
Food
 Items often form hierarchy.
 Items at the lower level are milk bread
expected to have lower
support. skim 2% wheat white
 Rules regarding itemsets at
Fraser Sunset
appropriate levels could be
quite useful.
TID Items
 Transaction database can be
T1 {111, 121, 211, 221}
encoded based on
T2 {111, 211, 222, 323}
dimensions and levels
T3 {112, 122, 221, 411}
 We can explore shared multi- T4 {111, 121}
level mining
T5 {111, 122, 211, 221, 413}
37
A top_down, progressive deepening approach

 A top_down, progressive deepening approach:

 First find high-level strong rules:


milk  bread [20%, 60%].
 Then find their lower-level “weaker” rules:
2% milk  wheat bread [6%, 50%].

38
Multi-level Association: Uniform
Support vs. Reduced Support

 Uniform Support: the same minimum support for all levels


 + One minimum support threshold. No need to examine itemsets
containing any item whose ancestors do not have minimum
support.
 – Lower level items do not occur as frequently. If support
threshold
 too high  miss low level associations

 too low  generate too many high level associations

 Reduced Support: reduced minimum support at lower levels


 There are 4 search strategies:

 Level-by-level independent
 Level-cross filtering by k-itemset
 Level-cross filtering by single item
 Controlled level-cross filtering by single item

39
Uniform Support

Multi-level mining with uniform support


Level 1 Milk
min_sup = 5%
[support = 10%]

Level 2 2% Milk Skim Milk


min_sup = 5% [support = 6%] [support = 4%]

Back
40
Reduced Support

Multi-level mining with reduced support


Level 1 Milk
min_sup = 5%
[support = 10%]

Level 2 2% Milk Skim Milk


min_sup = 3% [support = 6%] [support = 4%]

Back
41
Controlled level-cross filtering by single item

 Specify a level passage treshold for each level k


 in between min_sup_treshold of level k

 and level k+1

 min_sup_T(k+1)<LPT(k)<min_sup_T(k)
 Example:
 High level: milk

 min_supp_treshold=5%
 Low level: 2% milk,skim milk
 Min_upp_treshold = 3%
 Level_passage_trashold = 4%
42
Multi-level Association: Redundancy
Filtering

 Some rules may be redundant due to “ancestor”


relationships between items.
 Example
 milk  wheat bread [support = 8%, confidence = 70%]
 2% milk  wheat bread [support = 2%, confidence = 72%]

 We say the first rule is an ancestor of the second rule.


 A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.

43
Multi-Level Mining: Progressive
Deepening
 A top-down, progressive deepening approach:
 First mine high-level frequent items:
milk (15%), bread (10%)
 Then mine their lower-level “weaker” frequent
itemsets:
2% milk (5%), wheat bread (4%)
 Different min_support threshold across multi-levels:
 If adopting the same min_support across multi-

levels
then toss t if any of t’s ancestors is infrequent.
 If adopting reduced min_support at lower levels
then examine only those descendents whose ancestor’s
support is frequent/non-negligible.

44
Progressive Refinement of Data
Mining Quality

 Why progressive refinement?


 Mining operator can be expensive or cheap, fine or
rough
 Trade speed with quality: step-by-step refinement.

45
Chapter 5: Mining Association
Rules in Large Databases
 Association rule mining
 Mining single-dimensional Boolean association
rules from transactional databases
 Mining multilevel association rules from
transactional databases
 From association mining to correlation analysis
 Constraint-based association mining
 Sequential pattern mining

46
Interestingness Measurements

 Objective measures
Two popular measurements:
 support; and

 confidence

 Subjective measures (Silberschatz & Tuzhilin,


KDD95)
A rule (pattern) is interesting if
 it is unexpected (surprising to the user);

and/or
 actionable (the user can do something with it)

47
Criticism to Support and Confidence

 Example 1: (Aggarwal & Yu, PODS98)


 Among 5000 students

3000 play basketball

3750 eat cereal

2000 both play basket ball and eat cereal
 play basketball  eat cereal [40%, 66.7%] is misleading

because the overall percentage of students eating cereal is 75%


which is higher than 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is far more

accurate, although with lower support and confidence


basketball not basketball sum(row)
cereal 2000 1750 3750
not cereal 1000 250 1250
sum(col.) 3000 2000 5000
48
Criticism to Support and Confidence
(Cont.)

 Example 2:
 X and Y: positively correlated, X 1 1 1 1 0 0 0 0
 X and Z, negatively related Y 1 1 0 0 0 0 0 0
 support and confidence of
Z 0 1 1 1 1 1 1 1
X=>Z dominates
 We need a measure of dependent
or correlated events Rule Support Confidence
P( A B) X=>Y 25% 50%
corrA, B 
P( A) P( B) X=>Z 37.50% 75%
 P(B|A)/P(B) is also called the lift
of rule A => B
49
Other Interestingness Measures: Interest
 Interest (correlation, lift) P( A  B)
P ( A) P ( B )
 taking both P(A) and P(B) in consideration
 P(A^B)=P(B)*P(A), if A and B are independent events
 A and B negatively correlated, if the value is less than 1;
otherwise A and B positively correlated
Itemset Support Interest
X 1 1 1 1 0 0 0 0
X,Y 25% 2
Y 1 1 0 0 0 0 0 0 X,Z 37.50% 0.9
Z 0 1 1 1 1 1 1 1 Y,Z 12.50% 0.57

50
Example
 Total transactions 10,000
 İtems C:computer geme, V: video game
 V: 7,500 C: 6,000 C and V: 4,000
 Min_support: 0.3 min_conf:0,50

 Consider the rule:


 Buy(X: computer) buy(X: video)
 Support : = 4000/10000 = 0.4

 Confidence: P(C and V) /P(C) = 4000/6000 =%66

 Strong but

 The probablity of buying a video is 0.75 buying a

comuter reduces the probablity of buying a video


 From 0.75 to 0.66

 Computer and video games are negatively correlated

51
lift
 Lift of A  B
 Lift : P(A and B)/P(A)*P(B)
 P(A and B) = P(B|A)*P(A) then
 Lift = P(B|A)/P(B)
 Ratio of probablity of buying A and B divided by
buying A and B independently
 Or it can be interpreted as:
 Conditional probablity of buying B given that A

is purchased divided by unconditional


probablity of buying B
52
C not C

V 4000 3500 7500

not V 2000 500 2500

4000 10000
6000
Lift CV is P(C and V)/P(V)P(C) = P(V|C)/P(V)
= 0.4/0.6*0.75=0.89<1 there is a negative correlation
Between Video and computer
53
Are All the Rules Found Interesting?
 “Buy walnuts  buy milk [1%, 80%]” is misleading
 if 85% of customers buy milk
 Support and confidence are not good to represent correlations
 So many interestingness measures? (Tan, Kumar, Sritastava @KDD’02)

P ( A B )
lift 
P ( A) P ( B ) Milk No Milk Sum (row)
Coffee m, c ~m, c c
sup( X ) No Coffee m, ~c ~m, c ~c
all _ conf 
max_ item _ sup( X ) Sum(col.) m ~m 

DB m, c ~m, c m~c ~m~c lift all-conf coh 2


sup( X ) A1 1000 100 100 10,000 9.26 0.91 0.83 9055
coh  A2 100 1000 1000 100,000 8.44 0.09 0.05 670
| universe( X ) |
A3 1000 100 10000 100,000 9.18 0.09 0.09 8172
A4 1000 1000 1000 1000 1 0.5 0.33 0
54
All Confidence
 All confidence:
 All_conf= sup(X)/max sup(Xi)i
 X: (X1,X2,...,Xk)
 For k = 2
 Rules are X1X2 and X2 X1
 All_conf = sup(X1,X2)/max sup(X1),sup(X2)
 Here sup(X1,X2)/sup(X1): confidence of rule
 X1X2
 Similarly sup(X1,X2)/sup(X2): confidence of rule
 X2X1
 Ex all conf: 0.4/max(0.6,0.75)=0.4/0.75=0.53
55
Cosine

 Cosine : P(A,B)/sqrt(P(A),P(B))
 Similar to lift but take square root of denominator
 Both cosine and all_conf are null inveriant
 Not affected from null transactions

 Ex:
 Cosine: 0.4/sqrt(0.6*0.75)=0.27

56
Mining Highly Correlated Patterns

 lift and 2 are not good measures for correlations


in transactional DBs
 all-conf or cosine could be good measures
(Omiecinski @TKDE’03)
 Both all-conf and coherence have the downward
closure
sup( X )
all _ conf  DB m, c ~m, c m~c ~m~c lift all-conf coh 2
max_ item _ sup( X )
A1 1000 100 100 10,000 9.26 0.91 0.83 9055

sup( X ) A2 100 1000 1000 100,000 8.44 0.09 0.05 670


coh  A3 1000 100 10000 100,000 9.18 0.09 0.09 8172
| universe( X ) |
A4 1000 1000 1000 1000 1 0.5 0.33 0

57
Example
 m: containing milk m: not containing milk
 c: containing coffee, c: not containing coffee
 null transactions with respect to milk and
coffee
 transactgions containing neither milk nor

coffee
 vary the number of null transactions and see that
 lift and chi-square arer affected from the

presence of bull transactions


 but all_confidence and cosine rules

 are not affected – null invariant


Dataset mc mc mc mc all_conf. cosine lift 2

A1 1000 100 100 100000 0.91 0.91 83.64 83452.6

A2 1000 100 100 10000 0.91 0.91 9.36 9055.7

A3 1000 100 100 1000 0.91 0.91 1.82 1472.7

A4 1000 100 100 0 0.91 0.91 0.99 9.9

B1 1000 1000 1000 1000 0.5 0.5 1 0

C1 100 1000 1000 100000 0.09 0.09 8.44 670

C2 1000 100 10000 100000 0.09 0.29 9.18 8172.8

C3 1 1 100 10000 0.1 0.07 50 48.5

59
Chapter 5: Mining Association
Rules in Large Databases
 Association rule mining
 Mining single-dimensional Boolean association
rules from transactional databases
 Mining multilevel association rules from
transactional databases
 From association mining to correlation analysis
 Constraint-based association mining
 Sequential pattern mining

60
Constraint-based (Query-Directed) Mining

 Finding all the patterns in a database autonomously? —


unrealistic!
 The patterns could be too many but not focused!

 Data mining should be an interactive process


 User directs what to be mined using a data mining

query language (or a graphical user interface)


 Constraint-based mining
 User flexibility: provides constraints on what to be

mined
 System optimization: explores such constraints for

efficient mining—constraint-based mining

61
Constraints in Data Mining

 Knowledge type constraint:


 classification, association, etc.

 Data constraint: using SQL-like queries


 find product pairs sold together in stores in Chicago in

Dec.’2020
 Dimension/level constraint:
 in relevance to region, price, brand, customer category

 Rule (or pattern) constraint :


 small sales (price < $10) triggers big sales (sum >

$200)
 Interestingness constraint:
 strong rules: min_support  3%, min_confidence 

60%

62
Example

 bread  milk
 milk  butter
 Strong rules but items are not that valuable

 TV  VCD player
 Support of this rule may be lower than

previous rules but value of items (revenue or


profit) are much higher
 This rule may be more valuable

63
Aggregation functions

 well known aggregation functions


 are used to express constraints
 sum, average, min, max,
 others:
 range = max – min
 Example:
 sum(price) <= 100 for capturing budget
customers
 average(price) >= 50 for big customers
Anti-Monotonicity in Constraint Pushing
TDB (min_sup=2)
 Anti-monotonicity TID Transaction
10 a, b, c, d, f
 When an intemset S violates 20 b, c, d, f, g, h
the constraint, so does any of its 30 a, c, d, e, f
40 c, e, f, g
superset
Item Price
 sum(S.Price)  v is anti- a 100
monotone b 2
c 60
 sum(S.Price)  v is not anti- d 20
monotone e 45
f 55
g 50
h 25
65
Anti-Monotonicity in Constraint Pushing
TDB (min_sup=2)
 Example. C: range(S.profit)  15 is TID Transaction
10 a, b, c, d, f
anti-monotone
20 b, c, d, f, g, h
 Itemset ab violates C 30 a, c, d, e, f
40 c, e, f, g
 So does every superset of ab
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
66
Example Apriori
 Apriori principle stating that
 All non empty subsets of a frequent itemsets

must also be frequent


 Note that:
 If a given itemset does not satisfy minimum

support
 None of its supersets can

 Other examples of anti-monotone constraints:


 Min(l.price) >= 500

 Count(l) < 10

 Average(l.price) < 10 : not anti-monotone


67
Monotonicity for Constraint Pushing
TDB (min_sup=2)
TID Transaction
 Monotonicity 10 a, b, c, d, f
 When an intemset S satisfies 20 b, c, d, f, g, h
30 a, c, d, e, f
the constraint, so does any of 40 c, e, f, g

its superset Item Price


 sum(S.Price)  v is monotone a 100
b 2
 min(S.Price)  v is monotone c 60
d 20
e 45
f 55
g 50
h 25
68
Monotonicity for Constraint Pushing
TDB (min_sup=2)
TID Transaction
 Example. C: range(S.profit)  15 10 a, b, c, d, f
 Itemset ab satisfies C 20 b, c, d, f, g, h
30 a, c, d, e, f
 So does every superset of ab 40 c, e, f, g

Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
69
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
70
Naïve Algorithm: Apriori + Constraint
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price < 5}
71
The Constrained Apriori Algorithm: Push
an Anti-monotone Constraint Deep
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price < 5}
72
The Constrained Apriori Algorithm: Push
Another Constraint Deep
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2}
{1 2} 1 Scan D
{1 3} 2 {1 3} 2 {1 3}
{1 5} 1 {1 5}
{2 3} 2
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2 {3 5}
{3 5} 2
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 min{S.price <= 1 }
73
Chapter 5: Mining Association Rules in
Large Databases
 Association rule mining
 Mining single-dimensional Boolean association
rules from transactional databases
 Mining multilevel association rules from
transactional databases
 From association mining to correlation analysis
 Constraint-based association mining
 Sequential pattern mining

74
Sequence Databases and Sequential
Pattern Analysis

 Transaction databases, time-series databases vs. sequence


databases
 Frequent patterns vs. (frequent) sequential patterns
 Applications of sequential pattern mining
 Customer shopping sequences:
 First buy computer, then CD-ROM, and then digital camera,
within 3 months.
 Medical treatment, natural disasters (e.g., earthquakes),
science & engineering processes, stocks and markets, etc.
 Telephone calling patterns, Weblog click streams
 DNA sequences and gene structures

75
What Is Sequential Pattern Mining?

 Given a set of sequences, find the complete set


of frequent subsequences

A sequence : < (ef) (ab) (df) c b >


A sequence database
SID sequence An element may contain a set of items.
10 <a(abc)(ac)d(cf)> Items within an element are unordered
and we list them alphabetically.
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb> <a(bc)dc> is a subsequence
40 <eg(af)cbc> of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a


sequential pattern
76
Challenges on Sequential Pattern Mining

 A huge number of possible sequential patterns are


hidden in databases
 A mining algorithm should
 find the complete set of patterns, when possible,
satisfying the minimum support (frequency) threshold
 be highly efficient, scalable, involving only a small
number of database scans
 be able to incorporate various kinds of user-specific
constraints

77
Studies on Sequential Pattern Mining

 Concept introduction and an initial Apriori-like algorithm


 R. Agrawal & R. Srikant. “Mining sequential patterns,” ICDE’95

 GSP—An Apriori-based, influential mining method (developed at IBM


Almaden)
 R. Srikant & R. Agrawal. “Mining sequential patterns:

Generalizations and performance improvements,” EDBT’96


 From sequential patterns to episodes (Apriori-like + constraints)
 H. Mannila, H. Toivonen & A.I. Verkamo. “Discovery of frequent

episodes in event sequences,” Data Mining and Knowledge


Discovery, 1997
 Mining sequential patterns with constraints
 M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential
Pattern Mining with Regular Expression Constraints. VLDB 1999

78
Sequential pattern mining: Cases and
Parameters
 Duration of a time sequence T
 Sequential pattern mining can then be confined to the

data within a specified duration


 Ex. Subsequence corresponding to the year of 1999

 Ex. Partitioned sequences, such as every year, or every

week after stock crashes, or every two weeks before


and after a volcano eruption
 Event folding window w
 If w = T, time-insensitive frequent patterns are found

 If w = 0 (no event sequence folding), sequential

patterns are found where each event occurs at a distinct


time instant
 If 0 < w < T, sequences occurring within the same

period w are folded in the analysis


79
Example

 When event folding window is 5 munites


 Purchases within 5 munits is considered to be
taken together

80
Sequential pattern mining: Cases and Parameters (2)

 Time interval, int, between events in the discovered


pattern
 int = 0: no interval gap is allowed, i.e., only strictly

consecutive sequences are found


 Ex. “Find frequent patterns occurring in consecutive weeks”
 min_int  int  max_int: find patterns that are
separated by at least min_int but at most max_int
 Ex. “If a person rents movie A, it is likely she will rent movie
B within 30 days” (int  30)
 int = c  0: find patterns carrying an exact interval
 Ex. “Every time when Dow Jones drops more than 5%, what
will happen exactly two days later?” (int = 2)

81
A Basic Property of Sequential Patterns: Apriori

 A basic property: Apriori (Agrawal & Sirkant’94)


 If a sequence S is not frequent

 Then none of the super-sequences of S is frequent

 E.g, <hb> is infrequent  so do <hab> and <(ah)b>

Seq. ID Sequence Given support threshold


10 <(bd)cb(ac)> min_sup =2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>

82
GSP—A Generalized Sequential Pattern Mining Algorithm

 GSP (Generalized Sequential Pattern) mining algorithm


 proposed by Agrawal and Srikant, EDBT’96
 Outline of the method
 Initially, every item in DB is a candidate of length-1
 for each level (i.e., sequences of length-k) do
 scan database to collect support count for each

candidate sequence
 generate candidate length-(k+1) sequences from

length-k frequent sequences using Apriori


 repeat until no frequent sequence or no candidate
can be found
 Major strength: Candidate pruning by Apriori

83
Finding Length-1 Sequential Patterns

 Examine GSP using an example


 Initial candidates: all singleton sequences Cand Sup
 <a>, <b>, <c>, <d>, <e>, <f>, <a> 3
<g>, <h> <b> 5
 Scan database once, count support for
candidates <c> 4
<d> 3

min_sup =2 <e> 3
Seq. ID Sequence <f> 2
10 <(bd)cb(ac)> <g> 1
20 <(bf)(ce)b(fg)> <h> 1
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
84
Generating Length-2 Candidates

<a> <b> <c> <d> <e> <f>


<a> <aa> <ab> <ac> <ad> <ae> <af>
51 length-2 <b> <ba> <bb> <bc> <bd> <be> <bf>

Candidates <c>
<d>
<ca>
<da>
<cb>
<db>
<cc>
<dc>
<cd>
<dd>
<ce>
<de>
<cf>
<df>
<e> <ea> <eb> <ec> <ed> <ee> <ef>
<f> <fa> <fb> <fc> <fd> <fe> <ff>

<a> <b> <c> <d> <e> <f>


Without Apriori
<a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)>
<b> <(bc)> <(bd)> <(be)> <(bf)>
property,
<c> <(cd)> <(ce)> <(cf)> 8*8+8*7/2=92
<d> <(de)> <(df)> candidates
<e> <(ef)>
<f>
Apriori prunes
44.57% candidates 85
Generating Length-3 Candidates and Finding Length-3 Patterns

 Generate Length-3 Candidates


 Self-join length-2 sequential patterns
 Based on the Apriori property

 <ab>, <aa> and <ba> are all length-2 sequential

patterns  <aba> is a length-3 candidate


 <(bd)>, <bb> and <db> are all length-2 sequential

patterns  <(bd)b> is a length-3 candidate


 46 candidates are generated
 Find Length-3 Sequential Patterns
 Scan database once more, collect support counts for
candidates
 19 out of 46 candidates pass support threshold
87
The GSP Mining Process

5th scan: 1 cand. 1 length-5 seq. <(bd)cba> Cand. cannot pass


pat. sup. threshold

4th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc> … Cand. not in DB at all
pat.
3rd scan: 46 cand. 19 length-3 seq. <abb> <aab> <aba> <baa> <bab> …
pat. 20 cand. not in DB at all
2nd scan: 51 cand. 19 length-2 seq.
pat. 10 cand. not in DB at all <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
1st scan: 8 cand. 6 length-1 seq.
pat. <a> <b> <c> <d> <e> <f> <g> <h>
Seq. ID Sequence
10 <(bd)cb(ac)>
min_sup =2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
88
 Definition c is a contiguous subsequence of a
sequence s:{s1,s2,...,sn} if
 c is derived by dropping an item from s1 or sn
 c is derived by dropping an item from si which
has at least 2 items
 c’ is a contiguous subsequence of c and c is a

contiguous subsequence of s
 Ex: s:{ (1,2),(3,4),5,6}
 { 2,(3,4),5}, { (1,2),3,5,6},{ (3,5} are but

 { (1,2),(3,4),6},{ (1,5,6} are not

89
Candidate genration
 Step 1: Join Step Lk-1 join with Lk-1 to give Ck
 s1 and s2 are joined if dropping first item of s1
and last item of s2 gives the same sequence
 s1 is extended by adding the last item of s2

 Step 2: Prune Step Delete candidate sequences


having (k-1) contiguous subsequences whose
support count is less than min_support count

90
 L3 C4 L4
 {(1,2),3} {(1,2),(3,4)} {(1,2),(3,4)}
 {(1,2),4} {(1,2),3,5}
 {1,(3,4)}
 {(1,3),5}
 {2,(3,4)}
 {2,3,5}
 {(1,2),3} joined with {2,(3,4)} to give {(1,2),(3,4)}
 {(1,2),3} joined with {2,3,5} to give {(1,2),3,5}
 {(1,2),3,5} is dropped since its 3 contiguous subseq
 {(1,3,5} not in L3
91
Bottlenecks of GSP
 A huge set of candidates could be generated
1,000 frequent length-1 sequences generate
1000  999
1000 1000   1,499,500 length-2 candidates!
2

 Multiple scans of database in mining

 Real challenge: mining long sequential patterns


 An exponential number of short candidates
 A length-100 sequential pattern needs 1030
candidate100sequences!
100 
 
  
i 1  i 
 2100
 1  10 30

93

You might also like