Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing 2011 PDPTA Jongwook Woo, PhD [email_address] High-Performance Internet Computing Center (HiPIC) Computer Information Systems Department California State University, Los Angeles

Contents Map/Reduce Brief Introduction Market Basket Analysis Map/Reduce Algorithm for MBA Experimental Result Conclusion

What is Map/Reduce Cloud Computing Cloudera HortonWorks AWS Parallel Computing

Have you heard about Cloud Computing? First Impression In late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files. not a particularly complicated but large computing chore, requiring a whole lot of computer processing time. a software programmer at the Times, Derek Gottfrid, playing around with Amazon Web Services, Elastic Compute Cloud (EC2), uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3) In less than 24 hours, 11,000 PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site. The total cost for the computing job? $240 10 cents per computer-hour times 100 computers times 24 hours

What is MapReduce Functions borrowed from functional programming languages (eg. Lisp) ‏ Provides Restricted parallel programming model User implements Map() and Reduce() ‏ Libraries (Hadoop) take care of EVERYTHING else Parallelization Fault Tolerance Data Distribution Load Balancing Useful for huge (peta- or Terra-bytes) but non-complicated data New York Times case Log file for web companies

Map Convert data to (key, value) pairs map() functions run in parallel, creating different intermediate values from different input data sets

Reduce reduce() combines those intermediate values into one or more final values for that same output key reduce() functions also run in parallel, each working on a different output key Bottleneck: reduce phase can’t start until map phase is completely finished.

Example: Sort URLs in the largest hit order Map() ‏ Input <logFilename, file text> Parses file and emits <url, hit counts> pairs eg. <http://hello.com, 1> Reduce() ‏ Sums all values for the same key and emits <url, TotalCount> eg. <http://hello.com, (3 5 2 7)> => <http://hello.com, 17>

Market Basket Analysis (MBA) Collect the list of pair of transaction items most frequently occurred together at a store(s) Traditional Business Intelligence Analysis much better opportunity to make a profit by controlling the order of products and marketing control the stocks more intelligently arrange items on shelves promote items together etc.

Market Basket Analysis (MBA) Transactions in Store A: Input data Transaction 1: cracker, icecream, beer Transaction 2: chicken, pizza, coke, bread Transaction 3: baguette, soda, hering, cracker, beer Transaction 4: bourbon, coke, turkey Transaction 5: sardines, beer, chicken, coke Transaction 6: apples, peppers, avocado, steak Transaction 7: sardines, apples, peppers, avocado, steak … What is a pair of items that people frequently buy at Store A?

Map Algorithm 1: Reads each transaction of input file and generates the data set of the items: (<V 1 >, <V 2 >, …, <V n >) where < V n >: (v n1 , v n2 ,.. v nm ) 2: Sort all data set <V n > and generates sorted data set <Un>: (, , …, ) where : (u n1 , u n2 ,.. u nm ) 3: Loop For each item from u n1 to u nm of 3.a: generate the data set <Yn>: (y n1 , y n2 ,.. y nl ); y nl : (u nx , u ny ) where u nx ≢ u ny 3.b: increment the occurrence of y nl ; note: (key, value) = (ynl, number of occurrences) 4. Data set is created as input of Reducer: (key, <value>) = (y nl , <number of occurrences>)

Reduce Algorithm 1: Take (y nl , <number of occurrences>) as input data from multiple Map nodes 2. Add the values for y nl to have (y nl , total number of occurrences) as output

Market Basket Analysis (Cont’d) Transactions in Store A Transaction 1: cracker, icecream, beer Transaction 2: chicken, pizza, coke, bread … Distribute Transaction data to Map nodes Pair of Items restructured in each Map node Transaction 1: < (cracker, icecream), (cracker, beer) , (beer, icecream)> Transaction 2: < (chicken, pizza), (chicken, coke), (chicken, bread) , (coke, pizza), (bread, pizza), (coke , bread)> …

Market Basket Analysis (Cont’d) Note: order of pairs should be sorted as it becomes a key For example, (cracker, icecream), (icecream, cracker) should be (cracker, icecream) Pair of Items sorted in MBA Transaction 1: < (cracker, icecream), (beer, cracker) , (beer, icecream)> Transaction 2: < (chicken, pizza), (chicken, coke), (bread, chicken) , (coke, pizza), (bread, pizza), (bread, coke)> …

Market Basket Analysis (Cont’d) Output of Map node Pair of Items in (key, value) structure in each Map node (key, value): (pair of items, number of occurences) ((cracker, icecream), 1) ((beer, cracker), 1) ((beer, icecream),1) (chicken, pizza), 1) ((chicken, coke), 1) ((chicken, bread) , 1) ((coke, pizza), 1) ((bread, pizza), 1) ((coke , bread), 1) …

Market Basket Analysis (Cont’d) Data Aggregation/Combine (key, <value>): (pair of items, list number of occurences) ((cracker, icecream), <1, 1, …, 1>) ((beer, cracker), <1, 1, …, 1>) ((beer, icecream), <1, 1, …, 1>) (chicken, pizza), <1, 1, …, 1>) …

Market Basket Analysis (Cont’d) Reduce nodes (key, value): (pair of items, total number of occurences) ((cracker, icecream), 421) ((beer, cracker), 341) ((beer, icecream), 231) (chicken, pizza), 111) …

Map/Reduce for MBA … … Map 1 () Map 2 () Map m () Reduce 1 () Reduce l () Data Aggregation/Combine ((coke, pizza), <1, 1, …, 1>) ((ham, juice), <1, 1, …, 1>) ((coke, pizza), 3,421) ((ham, juice), 2,346) Input Trax Data Reduce 2 () ((coke, pizza), 1) ((bear, corn), 1) … ((ham, juice), 1) ((coke, pizza), 1) …

Experimental Result 5 transaction files for the experiment: 400 MB (6.7M transactions), 800MB (13M transactions), 1.6 GB (26M transactions). run on small instances of AWS EC2 each node is of 1.0-1.2 GHz 2007 Opteron or Xeon Processor 1.7GB memory 160GB storage on 32 bits platform. The data are executed on 2, 5, 10, 15, and 20 nodes

Experimental Result Execution time (sec) 5,671 2,911 2,868 20 5,898 2,917 2,792 15 8,845 5,998 2,910 10 15,963 8,717 5,442 5 NA NA 9,133 2 26M (1.6GB) 13M (800MB) 6.7M (400MB)

Experimental Result Execution time (sec)

Conclusion The Market Basket Analysis Algorithm on Map/Reduce is presented data mining analysis to find the most frequently occurred pair of products in baskets at a store. The associated items can be paired with Map/Reduce approach. Once we have the paired items, it can be used for more studies by statically analyzing them even sequentially, which is beyond this paper a bottle-neck for distributing, aggregating, and reducing the data set among nodes

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

More Related Content

What's hot (11)

Viewers also liked (20)

Similar to Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing (20)

More from Jongwook Woo (20)

Recently uploaded (20)

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing