0% found this document useful (0 votes)
16 views

Association Rule Mining

MCA student using try method

Uploaded by

nivithaswathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Association Rule Mining

MCA student using try method

Uploaded by

nivithaswathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 19

Steps involved in Association Rule Mining

Association Rule Mining can be described as a two-step


process.

Step 1: Find all frequent itemsets.

An itemset is a set of items that occurs in a shopping


basket.

A set of items in a shopping basket can be referred to as an


itemset. It can consist of any number of products. For
example, [bread, butter, eggs] is an itemset from a
supermarket database.

A frequent itemset is one that occurs frequently in a


database. This begs the question of how frequency is
defined. This is where support count comes in.

The support count of an item is defined as the


frequency of the item in the dataset.
Itemsets and their respective support counts.

The support count can only speak for the frequency of an


itemset. It does not take into account relative frequency
i.e., the frequency with respect to the number of
transactions. This is called the support of an itemset.

Support of an itemset is the frequency of the itemset


with respect to the number of transactions.
Itemsets and their respective supports.

Consider the itemset [Bread] which has 80% support. This


means that in every 100 transactions, bread occurs 80
times.

Defining support as percentage helps us set a threshold for


frequency called min_support. If we set support at 50%,
this means that we define a frequent itemset as one that
occurs at least 50 times in 100 transactions. For instance,
for the above dataset, we set threshold_support at 60%.

We always eliminate those items whose support is less


than min_support as is seen from the greyed-out parts of
the table above. The generation of frequent itemsets
depends on the algorithm used.

Step 2: Generate strong association rules from the frequent


itemsets.

Association rules are generated by building associations


from frequent itemsets generated in step 1. This uses a
measure called confidence to find strong associations.

FREQUENT SET
 Refers to a set of items that frequently appear together, for example, Python and Big Data
Analytics when the students of computer science frequently chose these subjects for in-
depth studies.
 Frequent Itemset (FI) refers to a subset of items that appears frequently in the datasets.
Example On finding Frequent Itemsets – Consider the given dataset with

given transactions.
 Lets say minimum support count is 3
 Relation hold is maximal frequent => closed => frequent
1-frequent: {A} = 3; // not closed due to {A, C} and not maximal
{B} = 4; // not closed due to {B, D} and no maximal
{C} = 4; // not closed due to {C, D} not maximal
{D} = 5; // closed item-set since not immediate super-set has same count.
Not maximal
2-frequent: {A, B} = 2 // not frequent because support count < minimum
support count so ignore
{A, C} = 3 // not closed due to {A, C, D}
{A, D} = 3 // not closed due to {A, C, D}
{B, C} = 3 // not closed due to {B, C, D}
{B, D} = 4 // closed but not maximal due to {B, C, D}
{C, D} = 4 // closed but not maximal due to {B, C, D}
3-frequent: {A, B, C} = 2 // ignore not frequent because support count <
minimum support count
{A, B, D} = 2 // ignore not frequent because support count < minimum
support count
{A, C, D} = 3 // maximal frequent
{B, C, D} = 3 // maximal frequent
4-frequent: {A, B, C, D} = 2 //ignore not frequent

MAXIMAL FREQUENT ITEMSE T


A maximal frequent itemset is a frequent itemset for which none of its
immediate supersets are frequent. To illustrate this concept, consider the
example given below:
The support counts are shown on the top left of each node. Assume support
count threshold = 50%, that is, each item must occur in 2 or more
transactions. Based on that threshold, the frequent itemsets are: a, b, c, d,
ab, ac and ad (shaded nodes).
Out of these 7 frequent itemsets, 3 are identified as maximal frequent
(having red outline):
 ab: Immediate supersets abc and abd are infrequent.
 ac: Immediate supersets abc and acd are infrequent.
 ad: Immediate supersets abd and acd are infrequent.
The remaining 4 frequent nodes (a, b, c and d) cannot be maximal frequent
because they all have at least 1 immediate superset that is frequent.

Advantage:
Maximal frequent itemsets provide a compact representation of all the
frequent itemsets for a particular dataset. In the above example, all
frequent itemsets are subsets of the maximal frequent itemsets, since we
can obtain sets a, b, c, d by enumerating subsets of ab, ac and ad (including
the maximal frequent itemsets themselves).
Disadvantage:
The support count of maximal frequent itemsets does not provide any
information about the support count of their subsets. This means that an
additional traversal of data is needed to determine support count for non-
maximal frequent itemsets, which may be undesirable in certain cases.

BORDER SET
An itemset is a border set if it is not a frequent set (infrequent), but all its proper
subsets are frequent sets.

• One can see that if X is an infrequent itemset, then it must have a subset that
is a border set. Since X is not frequent, it is possible that it is a border set.
• In that case, the proof is done. Let us assume that X’ is not a border set too.
• Hence, there exists at least one proper subset of cardinality |X| - 1 that is not
frequent, say X’. If X' is a border set, then the proof is complete. Let us,
hence, assume that X' is not a border set. We recursively construct X, X’, X”,..
., and so on, having the common property that neither of these is a frequent
set nor a border set and this construction process terminates when we get a
set which is a border. This construction process must terminate in a finite
number of steps as we are decreasing the size of the sets by 1 in every step.
In most peculiar case, we may land up in a singleton itemset (the empty
itemset is always considered to be a frequent set).

• Note that if we know the set of all maximal frequent sets of a given T with

respect to a , then we can find the set of all frequent sets without any extra
scan of the database. Thus, the set of all maximal frequent sets can act as a
compact representation of the set of all frequent sets. However, if we require
the frequent sets together with their respective support values in T, then we
have to make one more database pass to derive the support values when the
set of all maximal frequent sets is known,

A PRIORI ALGORITHM

What is Apriori Algorithm?


 Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding
frequent itemsets in a dataset for boolean association rule. Name of the algorithm is
Apriori because it uses prior knowledge of frequent itemset properties.
 Apriori algorithm refers to an algorithm that is used in mining frequent products sets
and relevant association rules.
 Generally, the apriori algorithm operates on a database containing a huge number of
transactions.
 For example, the items customers but at a Big Bazar.
 Apriori algorithm helps the customers to buy their products with ease and increases
the sales performance of the particular store.

Components of Apriori algorithm


The given three components comprise the apriori algorithm.
1. Support
2. Confidence
3. Lift

Let's take an example to understand this concept.

Suppose you have 4000 customers transactions in a Big Bazar. You have to calculate the
Support, Confidence, and Lift for two products, and you may say Biscuits and Chocolate.
This is because customers frequently buy these two items together.
Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and these 600
transactions include a 200 that includes Biscuits and chocolates. Using this data, we will find
out the support, confidence, and lift.

Support
Support refers to the default popularity of any product. You find the support as a quotient of
the division of the number of transactions comprising that product by the total number of
transactions. Hence, we get
Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)
= 400/4000 = 10 percent.
Confidence
Confidence refers to the possibility that the customers bought both biscuits and chocolates
together. So, you need to divide the number of transactions that comprise both biscuits and
chocolates by the total number of transactions to get the confidence.
Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions
involving Biscuits)
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought chocolates also.

Lift
Consider the above example; lift refers to the increase in the ratio of the sale of chocolates
when you sell biscuits. The mathematical equations of lift are given below.
Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)
= 50/10 = 5
It means that the probability of people buying both biscuits and chocolates together is five
times more than that of purchasing the biscuits alone. If the lift value is below one, it requires
that the people are unlikely to buy both the items together. Larger the value, the better is the
combination.

Steps for Apriori Algorithm

Below are the steps for the apriori algorithm:

Step-1: Determine the support of itemsets in the transactional database, and select the
minimum support and confidence.

Step-2: Take all supports in the transaction with higher support value than the minimum or
selected support value.

Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.

Step-4: Sort the rules as the decreasing order of lift.

Flow Chart
Apriori Algorithm Working
We will understand the apriori algorithm using an example and
mathematical calculation:

Example: Suppose we have the following dataset that has various


transactions, and from this dataset, we need to find the frequent itemsets
and generate the association rules using the Apriori algorithm:
Solution:
Step-1: Calculating C1 and L1:
o In the first step, we will create a table that contains support count (The
frequency of each itemset individually in the dataset) of each itemset in
the given dataset. This table is called the Candidate set or C1.

o Now, we will take out all the itemsets that have the greater support count
that the Minimum Support (2). It will give us the table for the frequent
itemset L1.
Since all the itemsets have greater or equal support count than the
minimum support, except the E, so E itemset will be removed.
Step-2: Candidate Generation C2, and L2:
o In this step, we will generate C2 with the help of L1. In C2, we will create
the pair of the itemsets of L1 in the form of subsets.
o After creating the subsets, we will again find the support count from the
main transaction table of datasets, i.e., how many times these pairs have
occurred together in the given dataset. So, we will get the below table for
C2:

o Again, we need to compare the C2 Support count with the minimum


support count, and after comparing, the itemset with less support count
will be eliminated from the table C2. It will give us the below table for L2

Step-3: Candidate generation C3, and L3:


o For C3, we will repeat the same two processes, but now we will form the
C3 table with subsets of three itemsets together, and will calculate the
support count from the dataset. It will give the below table:

o Now we will create the L3 table. As we can see from the above C3 table,
there is only one combination of itemset that has support count equal to
the minimum support count. So, the L3 will have only one combination,
i.e., {A, B, C}.

Step-4: Finding the association rules for the subsets:


To generate the association rules, first, we will create a new table with the
possible rules from the occurred combination {A, B.C}. For all the rules,
we will calculate the Confidence using formula sup( A ^B)/A. After
calculating the confidence value for all rules, we will exclude the rules that
have less confidence than the minimum threshold(50%).

Consider the below table:

Rules Support Confidence

A ^B → C 2 Sup{(A ^B) ^C}/sup(A ^B)= 2/4=0.5=50%

B^C → A 2 Sup{(B^C) ^A}/sup(B ^C)= 2/4=0.5=50%

A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%

C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%

A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%

B→ A^C 2 Sup{(B^( A ^C)}/sup(B)= 2/7=0.28=28%

As the given threshold or minimum confidence is 50%, so the first three


rules A ^B → C, B^C → A, and A^C → B can be considered as the strong
association rules for the given problem.
Apriori Algorithm Pseudo Code
Join Step: Ck is generated by joining Lk-1with itself

Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-
itemset

Pseudo-code:
Ck: Candidate itemset of size k

Lk: frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=0; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return = Lk;

Apriori Algorithm Example


TID List of Items
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3

Consider a database, D, consisting of 9 transactions.

Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % ).

Let the minimum confidence required is 70%.


We have to first find out the frequent itemset using Apriori algorithm.

Then, Association rules will be generated using min. support & min. confidence.

Step 1: Generating 1-Itemset Frequent Pattern

Itemset Count

{I1} 6

{I2} 7

{I3} 6

{I4} 2

{I5} 2

The above table is L1.

In the first iteration of the algorithm, each item is a member of the set of candidate.

The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying minimum
support.

Step 2: Generating 2-Itemset Frequent Pattern

To discover the set of frequent 2-itemsets, L2, the algorithm uses L1 Join L1 to generate a
candidate set of 2-itemsets, C2.

Next, the transactions in D are scanned and the support count for each candidate itemset in
C2 is accumulated (as shown in the middle table).

The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-
itemsets in C2 having minimum support.
Note: We haven’t used Apriori Property yet.

L1 = {I1,I2,I3,I4,I5}.

Since L2 = L1 join L1 then {I1,I2,I3,I4,I5} join {I1,I2,I3,I4,I5}.

It becomes -> C2= [ {I1,I2} {I1,I3}, {I1,I4}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I3,I4}
{I3,I5}, {I4,I4} ].

Now we need to check the frequent itemsets with min support count.

Then we get -> (C2*C2) L2= [ {I1,I2} {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5} ].

Similarly, We do it for L3.

Step 3: Generating 3-Itemset Frequent Pattern

L2= [ {I1,I2} {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5} ].

L3 = L2 JOIN L2 i.e.

C3 = [{I1,I2,I3},{I1,I2,I5}].

Now, the Join step is complete and the Prune step will be used to reduce the size of C3. Prune
step helps to avoid heavy computation due to large Ck.

Procedure Step 1: Find Items starting with I2 in B


It gives {I1, I2, I3}, {I1, I2, I4}, {I1, I2, I5}.

Step 2: Find Items starting with I3 in B


It gives NIL, Similarly I4, I5.
Step 3: Find out infrequent items sets using min support count and remove them.

Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we
can determine that four latter candidates cannot possibly be frequent. How?

For example, lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}.
Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3.

Lets take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-
item subsets are {I2, I3}, {I2, I5} & {I3,I5}.

BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property.
Thus We will have to remove {I2, I3, I5} from C3.

Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of the result of Join
operation for Pruning.

Now, the transactions in D are scanned in order to determine L3, consisting of those
candidates 3-itemsets in C3 having minimum support.

Step 4: Generating 4-Itemset Frequent Pattern

The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the
join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not
frequent.

Thus, C4 = φ(Null), and algorithm terminates, having found all of the frequent items. This
completes our Apriori Algorithm.

What’s Next?

These frequent itemsets will be used to generate strong association rules ( where
strong association rules satisfy both minimum support & minimum confidence).

Step 5: Generating Association Rules From Frequent Itemsets

Procedure:
 For each frequent itemset “l”, generate all nonempty subsets of l.
 For every non-empty subset s of l, output the rule “s -> (l-s)” if support_count(l) /
support_count(s) >= min_conf where min_conf is minimum confidence threshold.

From the above example

Let the minimum confidence threshold is, say 70%.

The resulting association rules are shown below, each listed with its confidence.

R1: I1 ^ I2 -> I5

 Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%.


 R1 is Rejected.
R2: I1 ^ I5 -> I2
 Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%.
 R2 is Selected.
R3: I2 ^ I5 -> I1
 Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%.
 R3 is Selected.
R4: I1 -> I2 ^ I5
 Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%.
 R4 is Rejected.
R5: I2 -> I1 ^ I5
 Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
 R5 is Rejected.
R6: I5 -> I1 ^ I2
 Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%.
 R6 is Selected.

In this way, We have found three strong association rules(R2, R3, R6).

You might also like