Association Rule Mining
Association Rule Mining
FREQUENT SET
Refers to a set of items that frequently appear together, for example, Python and Big Data
Analytics when the students of computer science frequently chose these subjects for in-
depth studies.
Frequent Itemset (FI) refers to a subset of items that appears frequently in the datasets.
Example On finding Frequent Itemsets – Consider the given dataset with
given transactions.
Lets say minimum support count is 3
Relation hold is maximal frequent => closed => frequent
1-frequent: {A} = 3; // not closed due to {A, C} and not maximal
{B} = 4; // not closed due to {B, D} and no maximal
{C} = 4; // not closed due to {C, D} not maximal
{D} = 5; // closed item-set since not immediate super-set has same count.
Not maximal
2-frequent: {A, B} = 2 // not frequent because support count < minimum
support count so ignore
{A, C} = 3 // not closed due to {A, C, D}
{A, D} = 3 // not closed due to {A, C, D}
{B, C} = 3 // not closed due to {B, C, D}
{B, D} = 4 // closed but not maximal due to {B, C, D}
{C, D} = 4 // closed but not maximal due to {B, C, D}
3-frequent: {A, B, C} = 2 // ignore not frequent because support count <
minimum support count
{A, B, D} = 2 // ignore not frequent because support count < minimum
support count
{A, C, D} = 3 // maximal frequent
{B, C, D} = 3 // maximal frequent
4-frequent: {A, B, C, D} = 2 //ignore not frequent
Advantage:
Maximal frequent itemsets provide a compact representation of all the
frequent itemsets for a particular dataset. In the above example, all
frequent itemsets are subsets of the maximal frequent itemsets, since we
can obtain sets a, b, c, d by enumerating subsets of ab, ac and ad (including
the maximal frequent itemsets themselves).
Disadvantage:
The support count of maximal frequent itemsets does not provide any
information about the support count of their subsets. This means that an
additional traversal of data is needed to determine support count for non-
maximal frequent itemsets, which may be undesirable in certain cases.
BORDER SET
An itemset is a border set if it is not a frequent set (infrequent), but all its proper
subsets are frequent sets.
• One can see that if X is an infrequent itemset, then it must have a subset that
is a border set. Since X is not frequent, it is possible that it is a border set.
• In that case, the proof is done. Let us assume that X’ is not a border set too.
• Hence, there exists at least one proper subset of cardinality |X| - 1 that is not
frequent, say X’. If X' is a border set, then the proof is complete. Let us,
hence, assume that X' is not a border set. We recursively construct X, X’, X”,..
., and so on, having the common property that neither of these is a frequent
set nor a border set and this construction process terminates when we get a
set which is a border. This construction process must terminate in a finite
number of steps as we are decreasing the size of the sets by 1 in every step.
In most peculiar case, we may land up in a singleton itemset (the empty
itemset is always considered to be a frequent set).
• Note that if we know the set of all maximal frequent sets of a given T with
respect to a , then we can find the set of all frequent sets without any extra
scan of the database. Thus, the set of all maximal frequent sets can act as a
compact representation of the set of all frequent sets. However, if we require
the frequent sets together with their respective support values in T, then we
have to make one more database pass to derive the support values when the
set of all maximal frequent sets is known,
A PRIORI ALGORITHM
Suppose you have 4000 customers transactions in a Big Bazar. You have to calculate the
Support, Confidence, and Lift for two products, and you may say Biscuits and Chocolate.
This is because customers frequently buy these two items together.
Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and these 600
transactions include a 200 that includes Biscuits and chocolates. Using this data, we will find
out the support, confidence, and lift.
Support
Support refers to the default popularity of any product. You find the support as a quotient of
the division of the number of transactions comprising that product by the total number of
transactions. Hence, we get
Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)
= 400/4000 = 10 percent.
Confidence
Confidence refers to the possibility that the customers bought both biscuits and chocolates
together. So, you need to divide the number of transactions that comprise both biscuits and
chocolates by the total number of transactions to get the confidence.
Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions
involving Biscuits)
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought chocolates also.
Lift
Consider the above example; lift refers to the increase in the ratio of the sale of chocolates
when you sell biscuits. The mathematical equations of lift are given below.
Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)
= 50/10 = 5
It means that the probability of people buying both biscuits and chocolates together is five
times more than that of purchasing the biscuits alone. If the lift value is below one, it requires
that the people are unlikely to buy both the items together. Larger the value, the better is the
combination.
Step-1: Determine the support of itemsets in the transactional database, and select the
minimum support and confidence.
Step-2: Take all supports in the transaction with higher support value than the minimum or
selected support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.
Flow Chart
Apriori Algorithm Working
We will understand the apriori algorithm using an example and
mathematical calculation:
o Now, we will take out all the itemsets that have the greater support count
that the Minimum Support (2). It will give us the table for the frequent
itemset L1.
Since all the itemsets have greater or equal support count than the
minimum support, except the E, so E itemset will be removed.
Step-2: Candidate Generation C2, and L2:
o In this step, we will generate C2 with the help of L1. In C2, we will create
the pair of the itemsets of L1 in the form of subsets.
o After creating the subsets, we will again find the support count from the
main transaction table of datasets, i.e., how many times these pairs have
occurred together in the given dataset. So, we will get the below table for
C2:
o Now we will create the L3 table. As we can see from the above C3 table,
there is only one combination of itemset that has support count equal to
the minimum support count. So, the L3 will have only one combination,
i.e., {A, B, C}.
Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-
itemset
Pseudo-code:
Ck: Candidate itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=0; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return = Lk;
Then, Association rules will be generated using min. support & min. confidence.
Itemset Count
{I1} 6
{I2} 7
{I3} 6
{I4} 2
{I5} 2
In the first iteration of the algorithm, each item is a member of the set of candidate.
The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying minimum
support.
To discover the set of frequent 2-itemsets, L2, the algorithm uses L1 Join L1 to generate a
candidate set of 2-itemsets, C2.
Next, the transactions in D are scanned and the support count for each candidate itemset in
C2 is accumulated (as shown in the middle table).
The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-
itemsets in C2 having minimum support.
Note: We haven’t used Apriori Property yet.
L1 = {I1,I2,I3,I4,I5}.
It becomes -> C2= [ {I1,I2} {I1,I3}, {I1,I4}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I3,I4}
{I3,I5}, {I4,I4} ].
Now we need to check the frequent itemsets with min support count.
Then we get -> (C2*C2) L2= [ {I1,I2} {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5} ].
L3 = L2 JOIN L2 i.e.
C3 = [{I1,I2,I3},{I1,I2,I5}].
Now, the Join step is complete and the Prune step will be used to reduce the size of C3. Prune
step helps to avoid heavy computation due to large Ck.
Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we
can determine that four latter candidates cannot possibly be frequent. How?
For example, lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}.
Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3.
Lets take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-
item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property.
Thus We will have to remove {I2, I3, I5} from C3.
Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of the result of Join
operation for Pruning.
Now, the transactions in D are scanned in order to determine L3, consisting of those
candidates 3-itemsets in C3 having minimum support.
The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the
join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not
frequent.
Thus, C4 = φ(Null), and algorithm terminates, having found all of the frequent items. This
completes our Apriori Algorithm.
What’s Next?
These frequent itemsets will be used to generate strong association rules ( where
strong association rules satisfy both minimum support & minimum confidence).
Procedure:
For each frequent itemset “l”, generate all nonempty subsets of l.
For every non-empty subset s of l, output the rule “s -> (l-s)” if support_count(l) /
support_count(s) >= min_conf where min_conf is minimum confidence threshold.
The resulting association rules are shown below, each listed with its confidence.
R1: I1 ^ I2 -> I5
In this way, We have found three strong association rules(R2, R3, R6).