UNIT-III
Association Analysis
Association analysis is useful for discovering interesting relationships hidden in large data sets. The
uncovered relationships can be represented in the form of association rules or sets of frequent
items.
For example, given a table of market basket transactions
TID Items
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
3 {Milk, Diapers, Beer, Cola}
4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Cola}
market basket data, association analysis is also applicable to other application domains such as
bioinformatics, medical diagnosis, Web mining, and scientific data analysis. In the analysis of Earth
science data, for example, the association patterns may reveal interesting connections among the ocean,
land, and atmospheric processes. Such information may help Earth scientists develop a better
understanding of how the different elements of the Earth system interact with each other. Even though
the techniques presented here are generally applicable to a wider variety of data sets. There are two key
issues that need to be addressed when applying association analysis to market basket data. First,
discovering patterns from a large transaction data set can be computationally expensive. Second, some
of the discovered patterns are potentially spurious because they may happen simply by chance.
Association Rule An association rule is an implication expression of the form X −→ Y
, where X and Y are disjoint itemsets, i.e., X ∩ Y = ∅. The strength of an association rule can
be measured in terms of its support and confidence a common strategy adopted by many association rule
mining algorithms is to
decompose the problem into two major subtasks:
1. Frequent Itemset Generation, whose objective is to find all the item- sets that satisfy the minsup
threshold. These itemsets are called frequent itemsets.
2. Rule Generation, whose objective is to extract all the high-confidence rules from the frequent
itemsets found in the previous step. These rules are called strong rules.
The computational requirements for frequent itemset generation are generally more expensive
than those of rule generation.
Definitions
Support Count
σ(X)=|{ti|X⊂ti,ti∈T}|σ(X)=|{ti|X⊂ti,ti∈T}|
I=i1,i2,…,iNI=i1,i2,…,iN is the set of all items
T=t1,t2,…,tNT=t1,t2,…,tN is the set of all transactions
Each titi is a transaction which contains a subset of items chosen from II
XX is a subset of titi
Association Rule
An associasion rule is an implication expression of the form X→YX→Y, where XX and YY are disjoint
itemsets (X∩Y=∅X∩Y=∅).
The strength of an association rule can be measured in terms of its support and confidence. A rule
that has very low support may occur simply by chance. Confidence measures the reliability of the
inference made by a rule.
Support of an association rule X→YX→Y
o σ(X)σ(X) is the support count of XX
o NN is the count of the transactions set TT.
s(X→Y)=σ(X∪Y)Ns(X→Y)=σ(X∪Y)N
Confidence of an association rule X→YX→Y
o σ(X)σ(X) is the support count of XX
o NN is the count of the transactions set TT.
conf(X→Y)=σ(X Y)σ(X)conf(X→Y)=σ(X Y)σ(X)
Interest of an association rule X→YX→Y o P(Y)=s(Y)P(Y)=s(Y) is the
support of YY (fraction of baskets that contain YY) o If interest of a rule is
close to 1, then it is uninteresting.
I(X→Y)=1→XI(X→Y)=1→X and YY are independent
I(X→Y)>1→XI(X→Y)>1→X and YY are positive correlated
I(X→Y)<1→XI(X→Y)<1→X and YY are negative correlated
I(X→Y)=P(X,Y)P(X)×P(Y)
For example, given a table of market basket transactions:
TID Items
1 {Bread, Milk}
2 {Bread, Diaper, Beer, Eggs}
3 {Milk, Diaper, Beer, Coke}
4 {Bread, Milk, Diaper, Beer}
5 {Bread, Milk, Diaper, Coke}
We can conclude that
s({Milk,Diaper}→{Beer})=2/5=0.4s({Milk,Diaper}→{Beer})=2/5=0.4
conf({Milk,Diaper}→{Beer})=2/3=0.67conf({Milk,Diaper}→{Beer})=2/3=0.67
I({Milk,Diaper}→{Beer})=2/53/5×3/5=10/9=1.11
Frequent Itemset Generation
A lattice structure can be used to enumerate the list of all possible itemsets:
However, the cost of frequent itemset generation is large. Given dd items, there are 2d2d possible
candidate itemsets. There are several ways to reduce the computational complexity of frequent
itemset generation:
1. Reduce the number of candidate itemsets (MM): The Apriori Principle
2. Reduce the number of comparison while counting supports: By using more advanced data structures, we
can reduce the number of comparisons for matching each candidate itemset against every transaction.
Factors Affecting Complexity:
• Choice of minimum support threshold lowering support threshold results in more
frequent itemsets. This may increase number of candidates and max length of frequent
itemsets
• Dimensionality (number of items) of the data set More space is needed to store support
count of each item.
• Number of transactions Since Apriori makes multiple passes, run time of algorithm may
increase with number of transactions.
• Average transaction width This may increase max length of frequent itemsets and
traversals of hash tree.
Frequent Itemset Generation Using Apriori
The Apriori Principle:
If an itemset is frequent, then all of its subsets must also be frequent. Conversely, if an subset is infrequent,
then all of its supersets must be infrequent, too.
Aprioir Algorithm:
Generate frequent itemsets of length k (initially k=1)
Repeat until no new frequent itemsets are identified
o Generate length (k+1) candidate itemsets from length k frequent itemsets
o Prune length (k+1) candidate itemsets that contain subsets of length k that are infrequent
o Count the support of each candidate
o Eliminate length (k+1) candidates that are infrequent
• k=1
• F[1] = all frequent 1-itemsets while F[k] != Empty Set:
• k += 1
• ## Candidate Itemsets Generation and Pruning
• C[k] = candidate itemsets generated by F[k-1] ## Support Counting
• for transaction t in T:
• C[t] = subset(C[k], t) # Identify all candidates that belong to t for candidate
itemset c in C[t]:
• support_count[c] += 1 # Increment support count
• F[k] = {c | c in C[k] and support_count[c]>=N*minsup} return F[k] for all values
of k
Frequent Itemset Generation Using ECLAT
Instead of horizontal data layout, ECLAT uses vertical data layout to store a list of transaction ids for
each item.
Advantage: very fast support counting
Disadvantage: intermediate tid-lists may become too large for memory
Frequent Itemset Generation Using FP-Growth
FP-Growth uses FP-tree (Frequent Pattern Tree), a compressed representation of the database. Once an
FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent
itemsets.
Build Tree
How to construct a FP-Tree?
1. Create the root node (null)
2. Scan the database, get the frequent itemsets of length 1, and sort these 11-itemsets in decreasing
support count.
3. Read a transaction at a time. Sort items in the transaction according to the last step.
4. For each transaction, insert items to the FP-Tree from the root node and increment occurrence record
at every inserted node.
5. Create a new child node if reaching the leaf node before the insertion completes.
6. If a new child node is created, link it from the last node consisting of the same item.
Mining Tree
How to mine the frequent itemsets using the FP-Tree?
FP-Growth finds all the frequent itemsets ending with a particular suffix by employing a divide-and-
conquer strategy to split the problem into smaller subproblems.
1. Using the pointer in the header table, decompose the FP-Tree into multiple subtrees,
each represent a subproblem (ex. finding frequent itemsets ending in ee)
2. For each subproblem, traverse the corresponding subtree bottom-up to obtain
conditional pattern bases for the subproblem recursively. Benefits of using FP-
Tree structure:
• No need to generate itemset candidates
• No need to scan the database frequently. (Only 2 pass is needed.)
Rule Generation
Given a frequent itemset LL, find all non-empty subsets f⊂Lf⊂L such that f→L–ff→L– f satisfies
the minimum confidence requirement.
If |LL| = kk, then there are 2k–22k–2 candidate association rules (ignoring L→∅–fL→∅– f and
∅→L∅→L)
So how to efficiently generate rules from frequent itemsets?
Confidence-Based Pruning
In general, confidence does not have an anti-monotone property. But confidence of rules
generated from the same itemset has an anti- monotone property:
L={A,B,C,D}c(ABC→D)≥c(AB→CD)≥c(A→BCD)L={A,B,C,D}c(ABC→D)≥c(AB→CD)≥c(A
→BCD)
If we compare rules generated from the same frequent itemset YY, the following theorem holds
for the confidence measure:
If a rule X→YX→Y does not satisfy the confidence threshold, then any
rule X′→Y−X′X′→Y−X′ (where X′⊂XX′⊂X), must not satisfy the confidence threshold as well.
Rule Generation in Apriori Algorithm
Candidate rule is generated by merging two rules that share the same prefix in the rule consequent.
F = frequent k-itemsets for
k>=2 for itemset f in F:
H[1] = 1-itemsets in f rules
+= ap_genrules(f, H[1])
return rules def
ap_genrules(f, H): k = size
of itemset f m = size of
itemsets in H if k>m+1:
H[m+1] = (m+1)-itemsets in f
for h in H[m+1]:
conf = support_count[f]/support_count[f-h]
if conf >= minconf:
output the rule (f-h) -> h
else:
H[m+1] = H[m+1] - hs
ap_genrules(f, H[m+1])
Compact Representation of Frequent Itemsets
In practice, the number of frequent itemsets produced from a transaction data set can be very large. It
is useful to identify a small representation set of itemsets from which all other frequent itemsets can be
derived.
To reduce the number of rules we can post-process them and only output:
Maximal frequent itemsets o
No immediate superset is
frequent
o Gives more
pruning
Closed itemsets
o No immediate superset has the same count
o Stores not only frequent information, but exact counts
Maximal Frequent Itemsets
An itemset is maximal frequent if none of its immediate supersets is frequent.
For example, the frequent itemsets in above Figure can be derived into 2 groups:
Frequent itemsets that begin with item aa, and may contain items cc, dd, or ee.
Frequent itemsets that begin with items bb, cc, dd, or ee.
Maximal frequent itemsets provide a valuable representation for data sets that can prod uce
very long, frequent itemsets, as there are exponentially many frequent itemsets in such d ata.
Despite providing a compact representation, maximal frequent itemsets do not contains the
support information of their subsets .
Closed Frequent Itemsets
• An itemset is closed if none of its immediate supersets has the same support as the itemset.
• An itemset is closed frequent if it is closed and its support is greater than or equal to minsup.
• Algorithms are available to explicitly extract closed frequent itemsets from a given data set.
With this compact representation, we can count support of the frequent itemsets efficiently:
C = the set of closed frequent itemsets
k_max = the maximum size of closed frequent itemsets
F[k_max] = a set of frequent itemsets of size k_max k =
k_max - 1 while k > 0:
F[k] = frequent itemsets of
size k for f in F[k]: if f not
in C:
X = a set of frequent itemsets such that each member ff is in F[k+1] and also a subset of f
support[f] = max(support[ff] in X)
Multiple Minimum Support
• Using a single minimum support threshold may not be effective since many real data sets have
skewed support distribution.
• If minsup is set too high, we could miss itemsets involving interesting rare items. (e.g.,
expensive products)
• If minsup is set too low, it is computationally expensive and the number of itemsets is very
large.
But how to apply multiple minimum supports?
MS(i)=minimum support for item iMS(A,B)=min(MS(A),MS(B))MS(i)=minimum suppor t for item
iMS(A,B)=min(MS(A),MS(B))
For example, given
• MS(Milk)=5%
• MS(Coke) = 3%
• MS(Broccoli)=0.1%
• MS(Salmon)=0.5%
• MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli)) = 0.1%
• However, with multiple minimum support, support is no longer anti-monotone. Need to
modify Apriori such that Ck+1Ck+1(Candidate itemsets of size (k+1)(k+1)) is generated from
LkLk(set of items whose support is ≥MS(k)=kth≥MS(k)=kth smallest minimum support) instead
of FkFk (set of frequent 11-itemset candidates.) Alternative methods for
generating Frequent Item sets
The performance of the Apriori algorithm may degrade significantly for dense data sets because
of the increasing width of transactions. Several alternative methods have been developed to
overcome these limitations and improve upon the efficiency of the Apriori algorithm. The following
is a high-level description of these methods.
Traversal of Itemset Lattice A search for frequent itemsets can be conceptually viewed as a
traversal on the itemset lattice shown in Figure . The search strategy employed by an algorithm
dictates how the lattice structure is traversed during the frequent itemset generation process.
Some search strategies are better than others, depending on the configuration of frequent
itemsets in the lattice. An overview of these strategies is presented next.
• General-to-Specific versus Specific-to-General: The Apriori algorithm uses a generalto-specific
search strategy, where pairs of frequent (k−1)-itemsets are merged to obtain candidate k-itemsets. This
general to-specific search strategy is effective, provided the maximum length of a frequent itemset is not
too long. The configuration of frequent itemsets that works best with this strategy is shown in Figure(a),
where the darker nodes represent infrequent itemsets.
specific to-general search strategy looks for more specific frequent itemsets first, before finding the
more general frequent itemsets. This strategy is useful to discover maximal frequent itemsets in dense
transactions, where the frequent itemset border is located near the bottom of the lattice, as shown in
Figure 5.19(b). The Apriori principle can be applied to prune all subsets of maximal frequent itemsets.
Specifically, if a candidate k-itemset is maximal frequent, we do not have to examine any of its subsets of
size k − 1. However, if the candidate k-itemset is infrequent, we need to check all of its k − 1 subsets in
the next iteration. Another approach is to combine both general-to-specific and specific-to-general
search strategies. This bidirectional approach .
• Equivalence Classes: Another way to envision the traversal is to first partition the lattice into
disjoint groups of nodes (or equivalence classes). A frequent itemset generation algorithm searches for
frequent itemsets within a particular equivalence class first before moving to another equivalence class.
Apriori algorithm can be considered to be partitioning the lattice on the basis of itemset sizes; i.e., the
algorithm discovers all frequent 1-itemsets first before proceeding to larger-sized itemsets. Equivalence
classes can also be defined according to the prefix or suffix labels of an itemset
Breadth-First versus Depth-First: The Apriori algorithm traverses the lattice in a breadth-first
manner, as shown in Figure(a). It first discovers all the frequent 1-itemsets, followed by the frequent 2-
itemsets, and so on, until no new frequent itemsets are generated. The itemset lattice can also be
traversed in a depth-first manner, as shown in Figures (b). The algorithm can start from, say, node a in
Figure , and count its support to determine whether it is frequent. If so, the algorithm progressively
expands the next level of nodes, i.e., ab, abc, and so on, until an infrequent node is reached, say, abcd. It
then backtracks to another branch, say, abce, and continues the search from there. The depth-first
approach is often used by algorithms designed to find maximal frequent itemsets. This approach allows
the frequent itemset border to be detected more quickly than using a breadth-first approach.
Representations of Transaction Data Set There are many ways to represent a transaction data set. The
choice of representation can affect the I/O costs incurred when computing the support of candidate
itemsets. Figure shows two different ways of representing market basket transactions. The representation
on the left is called a horizontal data layout, which is adopted by many association rule mining algorithms,
including Apriori. Another possibility is to store the list of transaction identifiers (TIDlist) associated with
each item. Such a representation is known as the vertical data layout. The support for each candidate
itemset is obtained by intersecting the TID-lists of its subset items. The length of the TID-lists shrinks as we
progress to larger sized itemsets. However, one problem with this approach is that the initial set of TID-
lists might be too large to fit into main memory, thus requiring more sophisticated techniques to compress
the TID-lists. We describe another effective approach to represent the data in the next section.
FP Growth Algorithm
This algorithm is an improvement to the Apriori method. A frequent pattern is generated without the
need for candidate generation. FP growth algorithm represents the database in the form of a tree called
a frequent pattern tree or FP tree.
This tree structure will maintain the association between the itemsets. The database is fragmented
using one frequent item. This fragmented part is called “pattern fragment”. The itemsets of these
fragmented patterns are analyzed. Thus with this method, the search for frequent itemsets is reduced
comparatively.
FP-Tree Representation
An FP-tree is a compressed representation of the input data. It is constructed by reading the data set
one transaction at a time and mapping each transaction onto a path in the FP-tree. As different
transactions can have several items in common, their paths may overlap. The more the paths overlap
with one another, the more compression we can achieve using the FP-tree structure. If the size of the
FP-tree is small enough to fit into main memory, this will allow us to extract frequent itemsets directly
from the structure in memory instead of making repeated passes over the data stored on disk.
The structures of the FP-tree after reading the first three transactions are also depicted in the
diagram. Each node in the tree contains the label of an item along with a counter that shows the
number of transactions mapped onto the given path. Initially, the FP-tree contains only the root node
represented by the null symbol. The FP-tree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of each item. Infrequent items are
discarded, while the frequent items are sorted in decreasing support counts. For the data set shown in
Figure 6.24, a is the most frequent item, followed by b, c, d, and e.
2. The algorithm makes a second pass over the data to construct the FP- tree. After reading the first
transaction, {a, b}, the nodes labeled as a and b are created. A path is then formed from null → a → b to
encode the transaction. Every node along the path has a frequency count of 1.
3. After reading the second transaction, {b,c,d}, a new set of nodes is created for items b, c, and d. A path is
then formed to represent the transaction by connecting the nodes null → b → c → d. Every node along
this path also has a frequency count equal to one. Although the first two transactions have an item in
common, which is b, their paths are disjoint because the transactions do not share a common prefix.
4. The third transaction, {a,c,d,e}, shares a common prefix item (which is a) with the first transaction. As
a result, the path for the third transaction, null → a → c → d
→ e, overlaps with the path for the first transaction, null → a → b. Because of their overlapping path,
the frequency count for node a is incremented to two, while the frequency counts for the newly created
nodes, c, d, and e, are equal to one.
5. This process continues until every transaction has been mapped onto one of the paths given in the FP-
tree. The resulting FP-tree after reading all the transactions is shown at the bottom of Figure 6.24.
The size of an FP-tree is typically smaller than the size of the uncompressed data because many
transactions in market basket data often share a few items in common. In the best-case scenario,
where all the transactions have the same set of items, the FP-tree contains only a single branch of
nodes. The worst-case scenario happens when every transaction has a unique set of items. As none of
the transactions have any items in common, the size of the FP-tree is effectively the same as the size of
the original data. However, the physical storage requirement for the FP-tree is higher because it
requires additional space to store pointers between nodes and counters for each item.
Frequent Itemset Generation in FP-Growth Algorithm
FP-growth is an algorithm that generates frequent itemsets from an FP-tree by exploring the tree in a
bottom-up fashion. the algorithm looks for frequent itemsets ending in e first, followed by d, c, b, and
finally, a. This bottom-up strategy for finding frequent item sets ending with a particular item is
equivalent to the suffix based approach Since every transaction is mapped onto a path in the FPtree,
we can derive the frequent itemsets ending with a particular item, say, e, by examining only the paths
containing node e. These paths can be accessed rapidly using the pointers associated with node e.
1)The first step is to scan the database to find the occurrences of the itemsets in the database. This step is
the same as the first step of Apriori. The count of 1-itemsets in the database is called support count or
frequency of 1-itemset.
2) The second step is to construct the FP tree. For this, create the root of the tree. The root is represented by
null.
3) The next step is to scan the database again and examine the transactions. Examine the first transaction and
find out the itemset in it. The itemset with the max count is taken at the top, the next itemset with lower
count and so on. It means that the branch of the tree is constructed with transaction itemsets in descending
order of count.
4) The next transaction in the database is examined. The itemsets are ordered in descending order of count. If
any itemset of this transaction is already present in another branch (for example in the 1st transaction),
then this transaction branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this transaction.
5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the common node and
new node count is increased by 1 as they are created and linked according to transactions.
6) The next step is to mine the created FP Tree. For this, the lowest node is examined first along with the links
of the lowest nodes. The lowest node represents the frequency pattern length 1. From this, traverse the
path in the FP Tree. This path or paths are called a conditional pattern base.
Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with the lowest
node (suffix).
7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The itemsets meeting
the threshold support are considered in the Conditional FP Tree.
8) Frequent Patterns are generated from the Conditional FP Tree.
Advantages of FP Growth Algorithm
1. This algorithm needs to scan the database only twice when compared to Apriori which scans the
transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages of FP-Growth Algorithm
1. FP Tree is more cumbersome and difficult to build than Apriori.
2. It may be expensive.
3. When the database is large, the algorithm may not fit in the shared memory.
Evaluation of Association Patterns
Association rule algorithms tend to produce too many rules. Many of them are uninteresting
or redundant. Objective Evaluation
An objective measure is a data-driven approach for evaluating the quality of association patterns. It is
domain-independent and requires minimal input from the users. Patterns that involve a set of mutually
independent items or cover very few transactions are considered uninteresting because they may
capture spurious relationships in the data. Such patterns can be eliminated by applying an objective
interestingness measure.
An objective measure is usually computed based on contingency table. For example, the table below is a
2-way contingency table for variable AA and BB.
BB B¯B¯
AA f11f11 f10f10 f1+f1+ A¯A¯
f01f01 f00f00 f0+f0+ f+1f+1 f+1f+1
NN
• f11=N×P(A,B)f11=N×P(A,B) denotes the number of transaction that contains AA and BB.
• f10=N×P(A,B¯)f10=N×P(A,B¯) denotes the number of transaction that contains AA but
not BB.
• f1+=N×P(A)f1+=N×P(A) denotes the support count for AA.
• f+1=N×P(B)f+1=N×P(B) denotes the support count for BB.
The pitfall of confidence can be traced to the fact that the measure ignores the
support of the itemset in the rule consequent (e.g. P(B)P(B) in the above case).
Subjective Evaluation
A pattern is considered subjectively uninteresting unless it reveals unexpected information about the
data. Unlike Objective measures, which rank patterns based on statistics computed from data,
subjective measures rank patterns according to user’s interpretation.
Objective Measures of Interestingness
Interest of an association rule X→YX→Y
o P(Y)=s(Y)P(Y)=s(Y) is the support of YY (fraction of baskets that contain YY)
o If interest of a rule is close to 1, then it is uninteresting.
I(X→Y)=1→XI(X→Y)=1→X and YY are independent
I(X→Y)>1→XI(X→Y)>1→X and YY are positive correlated
I(X→Y)<1→XI(X→Y)<1→X and YY are negative correlated
I(X→Y)=P(X,Y)P(X)×P(Y)I(X→Y)=P(X,Y)P(X)×P(Y)
Lift of an association rule X→YX→Y
o P(YP(Y|X)=P(X,Y)P(X)=f11f1+X)=P(X,Y)P(X)=f11f1+
o P(Y)=f+1P(Y)=f+1
Lift(X→Y)=P(Y|X)P(Y)Lift(X→Y)=P(Y|X)P(Y)
There are lots of measures proposed in the literature. Some measures are good for certain
applications, but not for others:
Properties of Good Objective Measure Inversion Property
An objective measure MM is invariant under the inversion operation if its value remains the
same when exchanging the frequent counts f11f11 with f10f10 and f10f10 with f01f01.
Null Addition Property
An objective measure MM is invariant under the null addition operation if it is not affected by increasing
f00f00, while all other frequencies in the contingency table stay the same.
Scaling Invariance Property
An objective measure MM is invariant under the row/column scaling operation.