Association Analysis and Frequent Sequential Pattern Mining-Apriori Algorithm
Association Analysis and Frequent Sequential Pattern Mining-Apriori Algorithm
• Association mining is a technique that can discover interesting relationships hidden in transaction datasets.
• This approach first finds all frequent item sets, and generates strong association rules from frequent item sets.
• Apriori is the most well-known association mining algorithm, which identifies frequent individual items first and
then performs a breadth-first search strategy to extend individual items to larger item sets until larger frequent item
sets cannot be found.
• The purpose of association mining is to discover associations among items from the transactional database.
• Typically, the process of association mining proceeds by finding item sets that have the support greater than the
minimum support.
• Next, the process uses the frequent item sets to generate strong rules (for example, milk => bread; a customer who
buys milk is likely to buy bread) that have the confidence greater than minimum the confidence.
• By definition, an association rule can be expressed in the form of X=>Y, where X and Y are disjointed item sets.
• We can measure the strength of associations between two terms: support and confidence.
• Support shows how much of the percentage of a rule is applicable within a dataset, while confidence indicates the
probability of both X and Y appearing in the same transaction:
•
• All other combinations of frequent item sets in L3 failed the minimum support test.
• These rules now would need to be evaluated, possibly subjectively by the users, for interestingness.
• Here the focus is on cases where a customer who buys one type of book might be likely according to this data
to buy the other type of books.
• Another indication is that if a customer never bought a paperback, they are not likely to buy a hardback, and
vice versa.
The Apriori algorithm to find association rules within transactions:
An application on real world data set
• We use the built-in Groceries dataset, which contains one month of real-world point-of-sale transaction data
from a typical grocery outlet.
• We then use the summary function to obtain the summary statistics of the Groceries dataset.
• The summary statistics shows that the dataset contains 9,835 transactions, which are categorized into 169
categories.
• In addition to this, the summary shows information, such as most frequent items, itemset distribution, and
example extended item information within the dataset.
• We can then use itemFrequencyPlot to visualize the five most frequent items with support over 0.1.
• Next, we apply the Apriori algorithm to search for rules with support over 0.001 and confidence over 0.5.
• We then use the summary function to inspect detailed information on the generated rules. From the output
summary, we find the Apriori algorithm generates 5,668 rules with support over 0.001 and confidence over
0.5.
• Further, we can find the rule length distribution, summary of quality measures, and mining information. In
the summary of the quality measurement, we find descriptive statistics of three measurements, which are
support, confidence, and lift.
• Support is the proportion of transactions containing a certain itemset.
• Confidence is the correctness percentage of the rule. Lift is the response target association rule divided by the
average response.
• To explore some generated rules, we can use the inspect function to view the first six rules of the 5,668
generated rules.
• Lastly, we can sort rules by confidence and list rules with the most confidence.
• Therefore, we find that rich sugar associated to whole milk is the most confident rule with the support equal
to 0.001220132, confidence equal to 1, and lift equal to 3.913649.