0% found this document useful (0 votes)
70 views

Dami Lecture4

This document summarizes concepts and algorithms for finding frequent itemsets in transactional data. It discusses: 1) The itemset lattice structure that represents all possible itemsets and their subset relationships. 2) The Apriori algorithm for mining frequent itemsets by iteratively generating candidate itemsets and pruning infrequent ones based on the property that subsets of frequent itemsets must be frequent. 3) Methods for compactly representing frequent itemsets, including maximal and closed frequent itemsets, to address the large number of frequent itemsets that can be generated.

Uploaded by

Aaditya Baid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Dami Lecture4

This document summarizes concepts and algorithms for finding frequent itemsets in transactional data. It discusses: 1) The itemset lattice structure that represents all possible itemsets and their subset relationships. 2) The Apriori algorithm for mining frequent itemsets by iteratively generating candidate itemsets and pruning infrequent ones based on the property that subsets of frequent itemsets must be frequent. 3) Methods for compactly representing frequent itemsets, including maximal and closed frequent itemsets, to address the large number of frequent itemsets that can be generated.

Uploaded by

Aaditya Baid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

582364 Data mining, 4 cu

Lecture 4: Finding frequent


itemsets - concepts and algorithms
Spring 2010
Lecturer: Juho Rousu
Teaching assistant: Taru Itpelto
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Itemset lattice
! Itemsets that can be constructed
from a set of items have a partial
order with respect to the subset
operator
! i.e. a set is larger than its proper
subsets
! This induces a lattice where nodes
correspond to itemsets and arcs
correspond to the subset relation
! The lattice is called the itemset
lattice
! For d items, the size of the lattice
is 2
d

Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Frequent itemsets on the itemset lattice
! The Apriori principle is
illustrated on the Itemset
lattice
! The subsets of a frequent
itemset are frequent
! They span a sublattice of the
original lattice (the grey area)
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Frequent itemsets on the itemset lattice
! Conversely
! The supersets of an infrequent
itemset are infrequent
! They also span a sublattice of
the original lattice (the crossed
out nodes)
! If we know that {a,b} is
infrequent, we never need to
check any of the supersets
! This fact is used in support-
based pruning
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Compact Representation of Frequent Itemsets
! In practise, the number of frequent itemsets produced
from transaction data can be very large
! when the database is dense i.e. many items per transaction on
average
! when the number of transactions is high
! when the minimum support level is set too low
! We will look at methods that
! use the properties of the itemset lattice and the support
function...
! to compress the collection of frequent itemsets in a more
manageable size...
! so that all frequent itemsets can be derived from the compressed
representation
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Maximal Frequent Itemsets
! The minimum support threshold
induces a partition of the itemset
lattice into frequent and infrequent
itemsets (grey nodes)
! Frequent itemsets that cannot be
extended with any item without
making them infrequent are called
maximal frequent itemsets
! We can derive all frequent
itemsets from the set of maximal
itemsets
! Use of the Apriori principle
backwards
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Infrequent
itemsets
Frequent
itemsets
Maximal Frequent Itemsets
! {A,C} is not maximal as it can be
extended to frequent itemset
{A,C,E} although its supersets
{A,B,C}, {A,C,D} are infrequent
! {A,D} is maximal as all its
immediate supersets {A,B,D},
{A,C,D} and {A,D,E} are
infrequent
! {B,D} is not maximal as it can be
extended to frequent itemsets
{B,C,D} and {B,D,E}
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Infrequent
itemsets
Frequent
itemsets
Maximal frequent itemsets
! The number of maximal
frequent itemsets is typically
considerably smaller than the
number of all frequent itemsets
! In worst case, the number can
still be exponential in the
number of items:
! e.g. consider the case where
all itemsets of size d/2 are
frequent and no itemset of size
d/2+1 is frequent.
! Still need efficient algorithms
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Infrequent
itemsets
Frequent
itemsets
Border
Maximal frequent itemsets
! Exact support counts of the
subsets cannot be directly
derived from support of the
maximal frequent itemset
! From Apriori principle we only
know that the subsets must be
frequent, but not how frequent
! Need to do support counting for
the subsets of the maximal
frequent itemset to create
association rules
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Infrequent
itemsets
Frequent
itemsets
Border
Closed itemsets
! An alternative approach is to try
to retain some of the support
information in the compacted
representation
! A closed itemset is an itemset
whose all immediate supersets
have different support count
! A closed frequent itemset is a
closed itemset that satisfies the
minimum support threshold
! Maximal frequent itemsets are
closed by definition
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Example: Closed frequent itemsets
! Assume minimum support
threshold 40%
! {b} is frequent: !({b})=3, but not
closed: !({b}) = !({b,c}) = 3
! {b,c} is frequent: !({b,c})= 3, and
closed: !({a,b,c}) = 2,
!({b,c,d})=1,!({b,c,e})=1
! {b,c,d} is not frequent: !({b,c,d}) =
1, and not closed : !({a,b,c,d}) = 1
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Maximal vs Closed Itemsets
Transaction
Ids
Not supported
by any
transactions
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Maximal vs Closed Frequent Itemsets
Minimum support = 2
# Closed = 9
# Maximal = 4
Closed and
maximal
Closed but
not maximal
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Maximal vs Closed Itemsets
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Determining the support of non-closed
frequent itemsets
! Consider a non-closed
frequent itemset {a,d}
! assume we have not stored
its support count
! By definition, there must be
at least one immediate
superset that has the same
support count
! It must be that !({a,d}) =
!(X) for some immediate
superset X of {a,d}
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Determining the support of non-closed
frequent itemsets
! From the Apriori principle
we know that no superset
can have higher support
than {a,d}
! It must be that the
support equals the
support of the most
frequent superset
!({a,d}) =
max(!(abd),!(acd),!(ade))
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Determining the support of non-closed
frequent itemsets
! Algorithm sketch:
1. kmax = size of largest closed frequent itemset
2. F
kmax
= closed frequent itemsets of size kmax
3. for k = kmax-1 downto 1 do
4. F
k
= {f | f immediate subset of f in F
k+1
or f is closed, |f|=k

}
5. for every f in F
k
do
6. if f is not closed
7. f.support = max(f.support | f in F
k+1
, f is a superset of f )
8. endif
9. endfor
10. endfor
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Characteristics of Apriori algorithm
! Breadth-first search algorithm:
! all frequent itemsets of given
size are kept in the algorithms
processing queue
! General-to-specific search:
! start with itemsets with large
support, work towards lower-
support region
! Generate-and-test strategy:
! generate candidates, test by
support counting
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Level 0
Level 1
Level 2
Level 3
Level 4
Level 5
Weaknesses of Apriori
! Apriori is one of the first algorithms that succesfully
tackled the exponential size of the frequent itemset space
! Nevertheless the Apriori suffers from two main
weaknesses:
! High I/O overhead from the generate-and-test strategy:
several passes are required over the database to find the
frequent itemsets
! The performance can degrade significantly on dense
databases, as large portion of the itemset lattice becomes
frequent
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Alternative methods for generating frequent
itemsets: Traversal of itemset lattice
! Apriori uses general-to-specific search: start from most highly
supported itemsets, work towards lower support region
! Works well if the frequent itemset border is close to the top of the lattice
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Alternative methods for generating frequent
itemsets: Traversal of itemset lattice
! Specific-to-general search: look first for the most specific frequent
itemsets, work towards higher support region
! Works well if the border is close to the bottom of the lattice
! Dense databases
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Alternative Methods for Frequent Itemset
Generation: Breadth-first vs Depth-first
! Apriori traverses the itemset lattice in breadth-first manner
! Alternatively, the lattice can be searched in depth-first manner:
extend single itemset until it cannot be extended
! often used to find maximal frequent itemsets
! hits the border of frequent itemsets quickly
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Alternative Methods for Frequent Itemset
Generation: Breadth-first vs Depth-first
! Depth-first search allows
different kind of pruning of the
search space
! Example: if {b,c,d,e} is found
maximal frequent by the search
algorithm, the region of the
lattice consisting of subsets of
{b,c,d,e} does not need to be
traversed
! known to be frequent non-maximal
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Alternative methods for generating frequent
itemsets: Equivalence classes
! Many search algorithms can be seen to conceptually
partition the itemset lattice into equivalence classes
! The itemsets in one equivalence class are processed before
moving into the next
! Several ways of defining equivalence classes
! Levels defined by itemset size (used by Apriori)
! Prefix labels: two itemsets that share a prefix of length k
belong to the same class e.g. {a,c,d}, {a,c,e} if k <= 2
! Suffix labels: two itemsets that share a suffix of length k
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Prefix and suffix trees
! Left: prefix tree and equivalence classes defined by for prefixes of length k=1
! Right: suffix tree and equivalence classes defined by for prefixes of length k=1
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
FP-growth algorithm
! FP-growth avoids the repeated scans of the database of Apriori by
using a compressed representation of the transaction database
using a data structure called FP-tree
! Once an FP-tree has been constructed, it uses a recursive divide-
and-conquer approach to mine the frequent itemsets
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
FP-tree
! FP-tree is a compressed representation of the transaction database
! Each transaction is mapped onto a path in the tree
! Each node contains an item and the support count corresponding to
the number of transactions with the prefix corresponding to the path
from root
! Nodes having the same item label are cross-linked: this helps
finding the frequent itemsets ending with a particular item
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
FP-tree construction
null
A:1
B:1
null
A:1
B:1
B:1
C:1
D:1
After reading TID=1:
After reading TID=2:
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
FP-Tree Construction
D:1
E:1
Pointers are used to assist
frequent itemset generation
Transaction
Database
null
A:7
B:5
B:3
C:3
D:1
C:1
D:1
C:3
D:1
E:1
D:1
E:1
Header table
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
FP-Tree vs. original database
! If the transactions share a significant number of items, FP-tree
can be considerably smaller as the common subset of the items
is likely to share paths
! There is a storage overhead from the links as well from the
support counts, so in worst case may even be larger than original
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Frequent itemset generation in FP-growth
! FP-growth uses a divide-
and-conquer approach to
find frequent itemsets
! It searches frequent
itemsets ending with item
E first, then itemsets
ending with D,C,B,A
! i.e. uses equivalence classes
based on length-1 suffixes
! Paths corresponding to
different suffixes are
extracted from the FP-tree
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
null
A:7
B:5
B:3
C:3
D:1
C:1
D:1
C:3
D:1
E:1
D:1
E:1
D:1
E:1
Frequent itemset generation in FP-growth
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Frequent itemset generation in FP-growth
! To find all frequent itemsets ending with given last item
(e.g. E), we first need to compute the support of the
item
! This is given by the sum of support counts of all nodes
labeled with the item (!(E)=3)
! found by following the cross-links connecting the nodes with
the same item
! If last item is found frequent, FP-growth next iteratively
looks for all frequent itemsets ending with given
length-2 suffix (DE,CE,BE, and AE),
! and recursively with length-3 suffix, length-4 suffix
until no more frequent itemsets are found
! Conditional FP-tree is constructed for each different
suffix to speed up the computation
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)
Frequent itemset generation in FP-growth
Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

You might also like