0% found this document useful (0 votes)

154 views17 pages

304A Data Warehousing and Data Mining Unit-3

Association analysis helps uncover relationships in large datasets, represented as association rules or frequent itemsets, applicable in various fields including market basket analysis and Earth sciences. The process involves two main tasks: frequent itemset generation and rule generation, with key metrics being support and confidence. Techniques like Apriori, ECLAT, and FP-Growth are employed to efficiently generate frequent itemsets, while compact representations like maximal and closed frequent itemsets help manage the complexity of large datasets.

Uploaded by

Vishnu Vardhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

154 views17 pages

304A Data Warehousing and Data Mining Unit-3

Uploaded by

Vishnu Vardhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT-III

 Association Analysis
Association analysis is useful for discovering interesting relationships hidden in large data sets. The
uncovered relationships can be represented in the form of association rules or sets of frequent
items.

For example, given a table of market basket transactions

TID Items
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
3 {Milk, Diapers, Beer, Cola}
4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Cola}
market basket data, association analysis is also applicable to other application domains such as
bioinformatics, medical diagnosis, Web mining, and scientific data analysis. In the analysis of Earth
science data, for example, the association patterns may reveal interesting connections among the ocean,
land, and atmospheric processes. Such information may help Earth scientists develop a better
understanding of how the different elements of the Earth system interact with each other. Even though
the techniques presented here are generally applicable to a wider variety of data sets. There are two key
issues that need to be addressed when applying association analysis to market basket data. First,
discovering patterns from a large transaction data set can be computationally expensive. Second, some
of the discovered patterns are potentially spurious because they may happen simply by chance.
Association Rule An association rule is an implication expression of the form X −→ Y
, where X and Y are disjoint itemsets, i.e., X ∩ Y = ∅. The strength of an association rule can
be measured in terms of its support and confidence a common strategy adopted by many association rule
mining algorithms is to

decompose the problem into two major subtasks:

1. Frequent Itemset Generation, whose objective is to find all the itemsets that satisfy the minsup
threshold. These itemsets are called frequent itemsets.

2. Rule Generation, whose objective is to extract all the high-confidence rules from the frequent
itemsets found in the previous step. These rules are called strong rules.
The computational requirements for frequent itemset generation are generally more expensive
than those of rule generation.
Definitions
Support Count
σ(X)=|{ti|X⊂ti,ti∈T}|σ(X)=|{ti|X⊂ti,ti∈T}|
I=i1,i2,…,iNI=i1,i2,…,iN is the set of all items
T=t1,t2,…,tNT=t1,t2,…,tN is the set of all transactions
Each titi is a transaction which contains a subset of items chosen from II
XX is a subset of titi
Association Rule
An associasion rule is an implication expression of the form X→YX→Y, where XX and YY are disjoint
itemsets (X∩Y=∅X∩Y=∅).

The strength of an association rule can be measured in terms of its support and confidence. A rule
that has very low support may occur simply by chance. Confidence measures the reliability of the
inference made by a rule.

Support of an association rule X→YX→Y

o σ(X)σ(X) is the support count of XX
o NN is the count of the transactions set TT.
s(X→Y)=σ(X∪Y)Ns(X→Y)=σ(X∪Y)N
Confidence of an association rule X→YX→Y
o σ(X)σ(X) is the support count of XX
o NN is the count of the transactions set TT.
conf(X→Y)=σ(X Y)σ(X)conf(X→Y)=σ(X Y)σ(X)
Interest of an association rule X→YX→Y o P(Y)=s(Y)P(Y)=s(Y) is the
support of YY (fraction of baskets that contain YY) o If interest of a rule is
close to 1, then it is uninteresting.
 I(X→Y)=1→XI(X→Y)=1→X and YY are independent
 I(X→Y)>1→XI(X→Y)>1→X and YY are positive correlated 
I(X→Y)<1→XI(X→Y)<1→X and YY are negative correlated
I(X→Y)=P(X,Y)P(X)×P(Y)
For example, given a table of market basket transactions:
TID Items
1 {Bread, Milk}
2 {Bread, Diaper, Beer, Eggs}
3 {Milk, Diaper, Beer, Coke}
4 {Bread, Milk, Diaper, Beer}
5 {Bread, Milk, Diaper, Coke}
We can conclude that

s({Milk,Diaper}→{Beer})=2/5=0.4s({Milk,Diaper}→{Beer})=2/5=0.4
conf({Milk,Diaper}→{Beer})=2/3=0.67conf({Milk,Diaper}→{Beer})=2/3=0.67

I({Milk,Diaper}→{Beer})=2/53/5×3/5=10/9=1.11

 Frequent Itemset Generation

A lattice structure can be used to enumerate the list of all possible itemsets:

However, the cost of frequent itemset generation is large. Given dd items, there are 2d2d possible
candidate itemsets. There are several ways to reduce the computational complexity of frequent
itemset generation:
1. Reduce the number of candidate itemsets (MM): The Apriori Principle
2. Reduce the number of comparison while counting supports: By using more advanced data structures, we
can reduce the number of comparisons for matching each candidate itemset against every transaction.
Factors Affecting Complexity:
• Choice of minimum support threshold lowering support threshold results in more
frequent itemsets. This may increase number of candidates and max length of frequent
itemsets
• Dimensionality (number of items) of the data set More space is needed to store support
count of each item.
• Number of transactions Since Apriori makes multiple passes, run time of algorithm may
increase with number of transactions.
• Average transaction width This may increase max length of frequent itemsets and
traversals of hash tree.
Frequent Itemset Generation Using Apriori

The Apriori Principle:

If an itemset is frequent, then all of its subsets must also be frequent. Conversely, if an subset is infrequent,
then all of its supersets must be infrequent, too.

Aprioir Algorithm:
Generate frequent itemsets of length k (initially k=1)
Repeat until no new frequent itemsets are identified
o Generate length (k+1) candidate itemsets from length k frequent itemsets
o Prune length (k+1) candidate itemsets that contain subsets of length k that are infrequent

o Count the support of each candidate

o Eliminate length (k+1) candidates that are infrequent
• k=1
• F[1] = all frequent 1-itemsets while F[k] != Empty Set:
• k += 1
• ## Candidate Itemsets Generation and Pruning
• C[k] = candidate itemsets generated by F[k-1] ## Support Counting
• for transaction t in T:
• C[t] = subset(C[k], t) # Identify all candidates that belong to t for candidate
itemset c in C[t]:
• support_count[c] += 1 # Increment support count
• F[k] = {c | c in C[k] and support_count[c]>=N*minsup} return F[k] for all values
of k

Frequent Itemset Generation Using ECLAT

Instead of horizontal data layout, ECLAT uses vertical data layout to store a list of transaction ids for
each item.

Advantage: very fast support counting

Disadvantage: intermediate tid-lists may become too large for memory
Frequent Itemset Generation Using FP-Growth
FP-Growth uses FP-tree (Frequent Pattern Tree), a compressed representation of the database. Once an
FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent
itemsets.
Build Tree
How to construct a FP-Tree?
1. Create the root node (null)
2. Scan the database, get the frequent itemsets of length 1, and sort these 11-itemsets in decreasing
support count.
3. Read a transaction at a time. Sort items in the transaction according to the last step.
4. For each transaction, insert items to the FP-Tree from the root node and increment occurrence record
at every inserted node.
5. Create a new child node if reaching the leaf node before the insertion completes.
6. If a new child node is created, link it from the last node consisting of the same item.
Mining Tree
How to mine the frequent itemsets using the FP-Tree?

FP-Growth finds all the frequent itemsets ending with a particular suffix by employing a divide-and-
conquer strategy to split the problem into smaller subproblems.
1. Using the pointer in the header table, decompose the FP-Tree into multiple subtrees,
each represent a subproblem (ex. finding frequent itemsets ending in ee)
2. For each subproblem, traverse the corresponding subtree bottom-up to obtain
conditional pattern bases for the subproblem recursively. Benefits of using FP-
Tree structure:
• No need to generate itemset candidates
• No need to scan the database frequently. (Only 2 pass is needed.)
 Rule Generation
Given a frequent itemset LL, find all non-empty subsets f⊂Lf⊂L such that f→L–ff→L– f satisfies
the minimum confidence requirement.

If |LL| = kk, then there are 2k–22k–2 candidate association rules (ignoring L→∅–fL→∅– f and
∅→L∅→L)

So how to efficiently generate rules from frequent itemsets?

Confidence-Based Pruning

In general, confidence does not have an anti-monotone property. But confidence of rules
generated from the same itemset has an anti- monotone property:

L={A,B,C,D}c(ABC→D)≥c(AB→CD)≥c(A→BCD)L={A,B,C,D}c(ABC→D)≥c(AB→CD)≥c(A

→BCD)

If we compare rules generated from the same frequent itemset YY, the following theorem holds
for the confidence measure:

If a rule X→YX→Y does not satisfy the confidence threshold, then any

rule X′→Y−X′X′→Y−X′ (where X′⊂XX′⊂X), must not satisfy the confidence threshold as well.
Rule Generation in Apriori Algorithm
Candidate rule is generated by merging two rules that share the same prefix in the rule consequent.

F = frequent k-itemsets for

k>=2 for itemset f in F:
H[1] = 1-itemsets in f rules
+= ap_genrules(f, H[1])
return rules def
ap_genrules(f, H): k = size
of itemset f m = size of
itemsets in H if k>m+1:

H[m+1] = (m+1)-itemsets in f
for h in H[m+1]:

conf = support_count[f]/support_count[f-h]
if conf >= minconf:

output the rule (f-h) -> h

else:

H[m+1] = H[m+1] - hs
ap_genrules(f, H[m+1])

 Compact Representation of Frequent Itemsets

In practice, the number of frequent itemsets produced from a transaction data set can be very large. It
is useful to identify a small representation set of itemsets from which all other frequent itemsets can be
derived.
To reduce the number of rules we can post-process them and only output:

Maximal frequent itemsets o

No immediate superset is
frequent
o Gives more
pruning
Closed itemsets
o No immediate superset has the same count
o Stores not only frequent information, but exact counts
Maximal Frequent Itemsets
An itemset is maximal frequent if none of its immediate supersets is frequent.

For example, the frequent itemsets in above Figure can be derived into 2 groups:
Frequent itemsets that begin with item aa, and may contain items cc, dd, or ee.
Frequent itemsets that begin with items bb, cc, dd, or ee.
Maximal frequent itemsets provide a valuable representation for data sets that can prod uce
very long, frequent itemsets, as there are exponentially many frequent itemsets in such d ata.
Despite providing a compact representation, maximal frequent itemsets do not contains the
support information of their subsets .
Closed Frequent Itemsets
• An itemset is closed if none of its immediate supersets has the same support as the itemset.
• An itemset is closed frequent if it is closed and its support is greater than or equal to minsup.
• Algorithms are available to explicitly extract closed frequent itemsets from a given data set.
With this compact representation, we can count support of the frequent itemsets efficiently:
C = the set of closed frequent itemsets

k_max = the maximum size of closed frequent itemsets

F[k_max] = a set of frequent itemsets of size k_max k =
k_max - 1 while k > 0:
F[k] = frequent itemsets of
size k for f in F[k]: if f not
in C:

X = a set of frequent itemsets such that each member ff is in F[k+1] and also a subset of f

support[f] = max(support[ff] in X)

Multiple Minimum Support

• Using a single minimum support threshold may not be effective since many real data sets have
skewed support distribution.
• If minsup is set too high, we could miss itemsets involving interesting rare items. (e.g.,
expensive products)
• If minsup is set too low, it is computationally expensive and the number of itemsets is very
large.
But how to apply multiple minimum supports?
MS(i)=minimum support for item iMS(A,B)=min(MS(A),MS(B))MS(i)=minimum suppor t for item
iMS(A,B)=min(MS(A),MS(B))

For example, given

• MS(Milk)=5%
• MS(Coke) = 3%
• MS(Broccoli)=0.1%
• MS(Salmon)=0.5%
• MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli)) = 0.1%
• However, with multiple minimum support, support is no longer anti-monotone. Need to
modify Apriori such that Ck+1Ck+1(Candidate itemsets of size (k+1)(k+1)) is generated from
LkLk(set of items whose support is ≥MS(k)=kth≥MS(k)=kth smallest minimum support) instead
of FkFk (set of frequent 11-itemset candidates.)  Alternative methods for
generating Frequent Item sets
The performance of the Apriori algorithm may degrade significantly for dense data sets because
of the increasing width of transactions. Several alternative methods have been developed to
overcome these limitations and improve upon the efficiency of the Apriori algorithm. The following
is a high-level description of these methods.

Traversal of Itemset Lattice A search for frequent itemsets can be conceptually viewed as a
traversal on the itemset lattice shown in Figure . The search strategy employed by an algorithm
dictates how the lattice structure is traversed during the frequent itemset generation process.
Some search strategies are better than others, depending on the configuration of frequent
itemsets in the lattice. An overview of these strategies is presented next.
• General-to-Specific versus Specific-to-General: The Apriori algorithm uses a generalto-specific
search strategy, where pairs of frequent (k−1)-itemsets are merged to obtain candidate k-itemsets. This
general to-specific search strategy is effective, provided the maximum length of a frequent itemset is not
too long. The configuration of frequent itemsets that works best with this strategy is shown in Figure(a),
where the darker nodes represent infrequent itemsets.
specific to-general search strategy looks for more specific frequent itemsets first, before finding the
more general frequent itemsets. This strategy is useful to discover maximal frequent itemsets in dense
transactions, where the frequent itemset border is located near the bottom of the lattice, as shown in
Figure 5.19(b). The Apriori principle can be applied to prune all subsets of maximal frequent itemsets.
Specifically, if a candidate k-itemset is maximal frequent, we do not have to examine any of its subsets of
size k − 1. However, if the candidate k-itemset is infrequent, we need to check all of its k − 1 subsets in
the next iteration. Another approach is to combine both general-to-specific and specific-to-general
search strategies. This bidirectional approach .

• Equivalence Classes: Another way to envision the traversal is to first partition the lattice into
disjoint groups of nodes (or equivalence classes). A frequent itemset generation algorithm searches for
frequent itemsets within a particular equivalence class first before moving to another equivalence class.

Apriori algorithm can be considered to be partitioning the lattice on the basis of itemset sizes; i.e., the
algorithm discovers all frequent 1-itemsets first before proceeding to larger-sized itemsets. Equivalence
classes can also be defined according to the prefix or suffix labels of an itemset

Breadth-First versus Depth-First: The Apriori algorithm traverses the lattice in a breadth-first
manner, as shown in Figure(a). It first discovers all the frequent 1-itemsets, followed by the frequent 2-
itemsets, and so on, until no new frequent itemsets are generated. The itemset lattice can also be
traversed in a depth-first manner, as shown in Figures (b). The algorithm can start from, say, node a in
Figure , and count its support to determine whether it is frequent. If so, the algorithm progressively
expands the next level of nodes, i.e., ab, abc, and so on, until an infrequent node is reached, say, abcd. It
then backtracks to another branch, say, abce, and continues the search from there. The depth-first
approach is often used by algorithms designed to find maximal frequent itemsets. This approach allows
the frequent itemset border to be detected more quickly than using a breadth-first approach.
Representations of Transaction Data Set There are many ways to represent a transaction data set. The
choice of representation can affect the I/O costs incurred when computing the support of candidate
itemsets. Figure shows two different ways of representing market basket transactions. The representation
on the left is called a horizontal data layout, which is adopted by many association rule mining algorithms,
including Apriori. Another possibility is to store the list of transaction identifiers (TIDlist) associated with
each item. Such a representation is known as the vertical data layout. The support for each candidate
itemset is obtained by intersecting the TID-lists of its subset items. The length of the TID-lists shrinks as we
progress to larger sized itemsets. However, one problem with this approach is that the initial set of TID-
lists might be too large to fit into main memory, thus requiring more sophisticated techniques to compress
the TID-lists. We describe another effective approach to represent the data in the next section.
 FP Growth Algorithm
This algorithm is an improvement to the Apriori method. A frequent pattern is generated without the
need for candidate generation. FP growth algorithm represents the database in the form of a tree called
a frequent pattern tree or FP tree.
This tree structure will maintain the association between the itemsets. The database is fragmented
using one frequent item. This fragmented part is called “pattern fragment”. The itemsets of these
fragmented patterns are analyzed. Thus with this method, the search for frequent itemsets is reduced
comparatively.

FP-Tree Representation
An FP-tree is a compressed representation of the input data. It is constructed by reading the data set
one transaction at a time and mapping each transaction onto a path in the FP-tree. As different
transactions can have several items in common, their paths may overlap. The more the paths overlap
with one another, the more compression we can achieve using the FP-tree structure. If the size of the
FP-tree is small enough to fit into main memory, this will allow us to extract frequent itemsets directly
from the structure in memory instead of making repeated passes over the data stored on disk.

The structures of the FP-tree after reading the first three transactions are also depicted in the
diagram. Each node in the tree contains the label of an item along with a counter that shows the
number of transactions mapped onto the given path. Initially, the FP-tree contains only the root node
represented by the null symbol. The FP-tree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of each item. Infrequent items are
discarded, while the frequent items are sorted in decreasing support counts. For the data set shown in
Figure 6.24, a is the most frequent item, followed by b, c, d, and e.
2. The algorithm makes a second pass over the data to construct the FP- tree. After reading the first
transaction, {a, b}, the nodes labeled as a and b are created. A path is then formed from null → a → b to
encode the transaction. Every node along the path has a frequency count of 1.
3. After reading the second transaction, {b,c,d}, a new set of nodes is created for items b, c, and d. A path is
then formed to represent the transaction by connecting the nodes null → b → c → d. Every node along
this path also has a frequency count equal to one. Although the first two transactions have an item in
common, which is b, their paths are disjoint because the transactions do not share a common prefix.
4. The third transaction, {a,c,d,e}, shares a common prefix item (which is a) with the first transaction. As
a result, the path for the third transaction, null → a → c → d
→ e, overlaps with the path for the first transaction, null → a → b. Because of their overlapping path,
the frequency count for node a is incremented to two, while the frequency counts for the newly created
nodes, c, d, and e, are equal to one.

5. This process continues until every transaction has been mapped onto one of the paths given in the FP-
tree. The resulting FP-tree after reading all the transactions is shown at the bottom of Figure 6.24.
The size of an FP-tree is typically smaller than the size of the uncompressed data because many
transactions in market basket data often share a few items in common. In the best-case scenario,
where all the transactions have the same set of items, the FP-tree contains only a single branch of
nodes. The worst-case scenario happens when every transaction has a unique set of items. As none of
the transactions have any items in common, the size of the FP-tree is effectively the same as the size of
the original data. However, the physical storage requirement for the FP-tree is higher because it
requires additional space to store pointers between nodes and counters for each item.

Frequent Itemset Generation in FP-Growth Algorithm

FP-growth is an algorithm that generates frequent itemsets from an FP-tree by exploring the tree in a
bottom-up fashion. the algorithm looks for frequent itemsets ending in e first, followed by d, c, b, and
finally, a. This bottom-up strategy for finding frequent item sets ending with a particular item is
equivalent to the suffix based approach Since every transaction is mapped onto a path in the FPtree,
we can derive the frequent itemsets ending with a particular item, say, e, by examining only the paths
containing node e. These paths can be accessed rapidly using the pointers associated with node e.

1)The first step is to scan the database to find the occurrences of the itemsets in the database. This step is
the same as the first step of Apriori. The count of 1-itemsets in the database is called support count or
frequency of 1-itemset.
2) The second step is to construct the FP tree. For this, create the root of the tree. The root is represented by
null.
3) The next step is to scan the database again and examine the transactions. Examine the first transaction and
find out the itemset in it. The itemset with the max count is taken at the top, the next itemset with lower
count and so on. It means that the branch of the tree is constructed with transaction itemsets in descending
order of count.
4) The next transaction in the database is examined. The itemsets are ordered in descending order of count. If
any itemset of this transaction is already present in another branch (for example in the 1st transaction),
then this transaction branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this transaction.
5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the common node and
new node count is increased by 1 as they are created and linked according to transactions.
6) The next step is to mine the created FP Tree. For this, the lowest node is examined first along with the links
of the lowest nodes. The lowest node represents the frequency pattern length 1. From this, traverse the
path in the FP Tree. This path or paths are called a conditional pattern base.
Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with the lowest
node (suffix).
7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The itemsets meeting
the threshold support are considered in the Conditional FP Tree.
8) Frequent Patterns are generated from the Conditional FP Tree.

Advantages of FP Growth Algorithm

1. This algorithm needs to scan the database only twice when compared to Apriori which scans the
transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages of FP-Growth Algorithm
1. FP Tree is more cumbersome and difficult to build than Apriori.
2. It may be expensive.
3. When the database is large, the algorithm may not fit in the shared memory.
 Evaluation of Association Patterns
Association rule algorithms tend to produce too many rules. Many of them are uninteresting
or redundant. Objective Evaluation
An objective measure is a data-driven approach for evaluating the quality of association patterns. It is
domain-independent and requires minimal input from the users. Patterns that involve a set of mutually
independent items or cover very few transactions are considered uninteresting because they may
capture spurious relationships in the data. Such patterns can be eliminated by applying an objective
interestingness measure.
An objective measure is usually computed based on contingency table. For example, the table below is a
2-way contingency table for variable AA and BB.

BB B¯B¯
AA f11f11 f10f10 f1+f1+ A¯A¯
f01f01 f00f00 f0+f0+ f+1f+1 f+1f+1
NN
• f11=N×P(A,B)f11=N×P(A,B) denotes the number of transaction that contains AA and BB.
• f10=N×P(A,B¯)f10=N×P(A,B¯) denotes the number of transaction that contains AA but
not BB.
• f1+=N×P(A)f1+=N×P(A) denotes the support count for AA.
• f+1=N×P(B)f+1=N×P(B) denotes the support count for BB.
The pitfall of confidence can be traced to the fact that the measure ignores the
support of the itemset in the rule consequent (e.g. P(B)P(B) in the above case).

Subjective Evaluation
A pattern is considered subjectively uninteresting unless it reveals unexpected information about the
data. Unlike Objective measures, which rank patterns based on statistics computed from data,
subjective measures rank patterns according to user’s interpretation.

Objective Measures of Interestingness

Interest of an association rule X→YX→Y

o P(Y)=s(Y)P(Y)=s(Y) is the support of YY (fraction of baskets that contain YY)

o If interest of a rule is close to 1, then it is uninteresting.
 I(X→Y)=1→XI(X→Y)=1→X and YY are independent
 I(X→Y)>1→XI(X→Y)>1→X and YY are positive correlated
 I(X→Y)<1→XI(X→Y)<1→X and YY are negative correlated
I(X→Y)=P(X,Y)P(X)×P(Y)I(X→Y)=P(X,Y)P(X)×P(Y)
Lift of an association rule X→YX→Y
o P(YP(Y|X)=P(X,Y)P(X)=f11f1+X)=P(X,Y)P(X)=f11f1+
o P(Y)=f+1P(Y)=f+1
Lift(X→Y)=P(Y|X)P(Y)Lift(X→Y)=P(Y|X)P(Y)
There are lots of measures proposed in the literature. Some measures are good for certain
applications, but not for others:

Properties of Good Objective Measure Inversion Property

An objective measure MM is invariant under the inversion operation if its value remains the
same when exchanging the frequent counts f11f11 with f10f10 and f10f10 with f01f01.

Null Addition Property

An objective measure MM is invariant under the null addition operation if it is not affected by increasing
f00f00, while all other frequencies in the contingency table stay the same.
Scaling Invariance Property
An objective measure MM is invariant under the row/column scaling operation.

Frequent Itemsets and Clustering Techniques
No ratings yet
Frequent Itemsets and Clustering Techniques
152 pages
Unit 3
No ratings yet
Unit 3
36 pages
Understanding Association Analysis
100% (1)
Understanding Association Analysis
12 pages
Data Mining: Association Analysis Basics
No ratings yet
Data Mining: Association Analysis Basics
59 pages
Retail Data Insights & Strategies
No ratings yet
Retail Data Insights & Strategies
24 pages
Unit 4
No ratings yet
Unit 4
113 pages
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
No ratings yet
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
30 pages
Market Basket Analysis in Data Mining
No ratings yet
Market Basket Analysis in Data Mining
75 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 6
82 pages
Association Rule Mining Explained
No ratings yet
Association Rule Mining Explained
9 pages
Data Analytics: Clustering Techniques
No ratings yet
Data Analytics: Clustering Techniques
44 pages
Constraint-Based Cluster Analysis Overview
No ratings yet
Constraint-Based Cluster Analysis Overview
56 pages
Data Mining Techniques and Challenges Under The Context of Facebook
No ratings yet
Data Mining Techniques and Challenges Under The Context of Facebook
6 pages
DataMining Workbook Answers
No ratings yet
DataMining Workbook Answers
18 pages
Association Rule Mining Basics
No ratings yet
Association Rule Mining Basics
102 pages
Toivonen's Algorithm Overview
No ratings yet
Toivonen's Algorithm Overview
33 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
30 pages
Chart2D Olchocx
No ratings yet
Chart2D Olchocx
262 pages
Mining Frequent Patterns in Transactions
No ratings yet
Mining Frequent Patterns in Transactions
37 pages
Data Mining: Classification Techniques
No ratings yet
Data Mining: Classification Techniques
72 pages
Understanding Clustering Techniques
0% (1)
Understanding Clustering Techniques
57 pages
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
20 pages
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
65 pages
Introduction To Data Mi Ing Solutions
100% (1)
Introduction To Data Mi Ing Solutions
22 pages
Data Mining Exam Review Guide
No ratings yet
Data Mining Exam Review Guide
6 pages
Data Mining Chapter 1 Notes
100% (1)
Data Mining Chapter 1 Notes
40 pages
Recommendation System in Python
No ratings yet
Recommendation System in Python
13 pages
From Association Mining to Correlation Analysis
No ratings yet
From Association Mining to Correlation Analysis
22 pages
KPU INFO 2312 Assignment 1 Guide
No ratings yet
KPU INFO 2312 Assignment 1 Guide
4 pages
M.Tech Data Mining Exam Paper 2019-20
No ratings yet
M.Tech Data Mining Exam Paper 2019-20
3 pages
COMP5310 Notes
No ratings yet
COMP5310 Notes
10 pages
Swarm Intelligence: From Natural To Artificial Systems
No ratings yet
Swarm Intelligence: From Natural To Artificial Systems
42 pages
Data Mining Exam Instructions 2007
No ratings yet
Data Mining Exam Instructions 2007
5 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Market Basket Analysis Explained
No ratings yet
Market Basket Analysis Explained
30 pages
Information Gain in Decision Trees
No ratings yet
Information Gain in Decision Trees
10 pages
Locality-Sensitive Hashing Explained
No ratings yet
Locality-Sensitive Hashing Explained
85 pages
Advanced Data Mining Course Overview
No ratings yet
Advanced Data Mining Course Overview
6 pages
PCA: Eigenvalues and Eigenvectors Explained
No ratings yet
PCA: Eigenvalues and Eigenvectors Explained
73 pages
Frequent Itemsets and Association Rules
No ratings yet
Frequent Itemsets and Association Rules
31 pages
Data Mining: A Comprehensive Survey
No ratings yet
Data Mining: A Comprehensive Survey
4 pages
Neutrosophic Logic for Time Series Forecasting
No ratings yet
Neutrosophic Logic for Time Series Forecasting
21 pages
Bar Graph-Wps Office
No ratings yet
Bar Graph-Wps Office
16 pages
Data Mining Practical File by Kashish Madan
No ratings yet
Data Mining Practical File by Kashish Madan
40 pages
Create Bar and Line Graphs Guide
No ratings yet
Create Bar and Line Graphs Guide
23 pages
Prediction of Graduate Admission IEEE - 2020
No ratings yet
Prediction of Graduate Admission IEEE - 2020
6 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
10 pages
Unit 4 DM
No ratings yet
Unit 4 DM
31 pages
Hybrid Feature Selection in EDM
No ratings yet
Hybrid Feature Selection in EDM
17 pages
Understanding Probability and Uncertainty
No ratings yet
Understanding Probability and Uncertainty
104 pages
Introduction to Text Mining Techniques
No ratings yet
Introduction to Text Mining Techniques
48 pages
Data Mining for CSE Students
No ratings yet
Data Mining for CSE Students
11 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Association Rules and Frequent Itemsets
No ratings yet
Association Rules and Frequent Itemsets
14 pages
Frequent Pattern Mining Explained
No ratings yet
Frequent Pattern Mining Explained
30 pages
Market Basket Analysis with Association Rules
No ratings yet
Market Basket Analysis with Association Rules
54 pages
Association Analysis and Algorithms Guide
No ratings yet
Association Analysis and Algorithms Guide
37 pages
Frequent Pattern Mining Basics
No ratings yet
Frequent Pattern Mining Basics
74 pages
Understanding Association Rule Mining
No ratings yet
Understanding Association Rule Mining
9 pages
306 Digital Marketing Unit-3-5
No ratings yet
306 Digital Marketing Unit-3-5
46 pages
Vamsi Krishna T R
No ratings yet
Vamsi Krishna T R
1 page
Mulesoft Introduction
No ratings yet
Mulesoft Introduction
26 pages
M. Vishnu's Professional Resume
No ratings yet
M. Vishnu's Professional Resume
2 pages
Anyflip Com
No ratings yet
Anyflip Com
1 page
Azure Synapse Analytics Overview
No ratings yet
Azure Synapse Analytics Overview
72 pages
Hajj 2024 Packages by Al Multazim
No ratings yet
Hajj 2024 Packages by Al Multazim
4 pages
STEM
100% (3)
STEM
16 pages
Lungsod Nga Balaan Full Score
No ratings yet
Lungsod Nga Balaan Full Score
5 pages
Network Flow Optimization Guide
No ratings yet
Network Flow Optimization Guide
29 pages
Luck by Chance
No ratings yet
Luck by Chance
61 pages
Stagger
No ratings yet
Stagger
3 pages
PA-34 Seneca Standardization Manual
No ratings yet
PA-34 Seneca Standardization Manual
54 pages
Head Waiter
No ratings yet
Head Waiter
2 pages
Ex 2 5 FSC Part1 Ver3
No ratings yet
Ex 2 5 FSC Part1 Ver3
2 pages
Enhancing Adversarial Attack Stealthiness
No ratings yet
Enhancing Adversarial Attack Stealthiness
18 pages
GPL-2020-2-2-00 V 2020
No ratings yet
GPL-2020-2-2-00 V 2020
2 pages
Design Guide for Unsupported Steel Beams
No ratings yet
Design Guide for Unsupported Steel Beams
57 pages
San Diego Sewer Design Guide 2013
No ratings yet
San Diego Sewer Design Guide 2013
231 pages
LBU Assignment: Leadership & Management
No ratings yet
LBU Assignment: Leadership & Management
21 pages
Coil
No ratings yet
Coil
8 pages
Canara - Epassbook - 2024-11-26 200131.879496
No ratings yet
Canara - Epassbook - 2024-11-26 200131.879496
87 pages
Understanding Option Volatility Types
No ratings yet
Understanding Option Volatility Types
2 pages
Cisl Closed Loop Case Study Web
100% (1)
Cisl Closed Loop Case Study Web
12 pages
English For Academic and Professional Purposes Ut 2 Reviewer NAME
No ratings yet
English For Academic and Professional Purposes Ut 2 Reviewer NAME
4 pages
MBNA v. Hill: Bankruptcy Arbitration Appeal
No ratings yet
MBNA v. Hill: Bankruptcy Arbitration Appeal
8 pages
Code of Civil Procedure 1908 Pleading Plaint ... - Slideshare
No ratings yet
Code of Civil Procedure 1908 Pleading Plaint ... - Slideshare
5 pages
Introduction to Basic Accounting Concepts
No ratings yet
Introduction to Basic Accounting Concepts
5 pages
NorthRidge Church May 2012 Newsletter
No ratings yet
NorthRidge Church May 2012 Newsletter
15 pages
Cracking Mechanisms in Desiccating Soils
No ratings yet
Cracking Mechanisms in Desiccating Soils
8 pages
Module 4
No ratings yet
Module 4
13 pages
EMW Lec 11
No ratings yet
EMW Lec 11
7 pages
Short Stories Santas Little Helper Worksheet
No ratings yet
Short Stories Santas Little Helper Worksheet
2 pages
Rule 3 Transfer September Schedule
No ratings yet
Rule 3 Transfer September Schedule
2 pages

304A Data Warehousing and Data Mining Unit-3

Uploaded by

304A Data Warehousing and Data Mining Unit-3

Uploaded by

UNIT-III

For example, given a table of market basket transactions

decompose the problem into two major subtasks:

Support of an association rule X→YX→Y

 Frequent Itemset Generation

The Apriori Principle:

o Count the support of each candidate

Frequent Itemset Generation Using ECLAT

Advantage: very fast support counting

So how to efficiently generate rules from frequent itemsets?

F = frequent k-itemsets for

output the rule (f-h) -> h

 Compact Representation of Frequent Itemsets

Maximal frequent itemsets o

k_max = the maximum size of closed frequent itemsets

Multiple Minimum Support

For example, given

Frequent Itemset Generation in FP-Growth Algorithm

Advantages of FP Growth Algorithm

Objective Measures of Interestingness

o P(Y)=s(Y)P(Y)=s(Y) is the support of YY (fraction of baskets that contain YY)

Properties of Good Objective Measure Inversion Property

Null Addition Property

You might also like