Market Basket Analysis Algorithm with Map/Reduce of   Cloud Computing 2011 PDPTA Jongwook Woo, PhD [email_address] High-Performance Internet Computing Center (HiPIC) Computer Information Systems Department California State University, Los Angeles
Contents Map/Reduce Brief Introduction Market Basket Analysis Map/Reduce Algorithm for MBA Experimental Result Conclusion
What is Map/Reduce Cloud Computing Cloudera HortonWorks AWS Parallel Computing
Have you heard about Cloud Computing? First Impression In late 2007, the New York Times wanted to make available over the web its entire archive of articles,  11 million in all, dating back to 1851.  four-terabyte pile of images in TIFF format.  needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files.  not a particularly complicated but large computing chore, requiring a whole lot of computer processing time. a software programmer at the Times, Derek Gottfrid,  playing around with Amazon Web Services, Elastic Compute Cloud (EC2),  uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3)  In less than 24 hours, 11,000 PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site. The total cost for the computing job? $240 10 cents per computer-hour times 100 computers times 24 hours
What is MapReduce Functions borrowed from functional programming languages (eg. Lisp) ‏ Provides Restricted parallel programming model User implements Map() and Reduce() ‏ Libraries (Hadoop) take care of EVERYTHING else Parallelization Fault Tolerance Data Distribution Load Balancing Useful for huge (peta- or Terra-bytes) but non-complicated data New York Times case Log file for web companies
Map Convert data to (key, value) pairs map() functions run in parallel,  creating different intermediate values from different input data sets
Reduce reduce() combines those intermediate values into one or more  final values  for that same output key reduce() functions also run in parallel,  each working on a different output key Bottleneck:  reduce phase can’t start until map phase is completely finished.
Example:  Sort URLs in the largest hit order Map() ‏ Input <logFilename, file text> Parses file and emits <url, hit counts> pairs eg. <http://hello.com, 1> Reduce() ‏ Sums all values for the same key and emits <url, TotalCount> eg. <http://hello.com, (3 5 2 7)>  => <http://hello.com, 17>
Market Basket Analysis (MBA) Collect the list of pair of transaction items most frequently occurred together at a store(s) Traditional Business Intelligence Analysis much better opportunity to make a profit by controlling the order of products and marketing  control the stocks more intelligently  arrange items on shelves  promote items together etc.
Market Basket Analysis (MBA) Transactions in Store A: Input data Transaction 1: cracker, icecream, beer Transaction 2: chicken, pizza, coke, bread Transaction 3:  baguette, soda, hering, cracker, beer  Transaction 4: bourbon, coke, turkey  Transaction 5: sardines, beer, chicken, coke Transaction 6:  apples, peppers, avocado, steak Transaction 7:  sardines, apples, peppers, avocado, steak … What is a pair of items that people frequently buy at Store A?
Map Algorithm 1: Reads each transaction of input file and generates the data set of the items:  (<V 1 >, <V 2 >, …, <V n >) where < V n >: (v n1 , v n2 ,.. v nm ) 2: Sort all data set <V n > and generates sorted data set <Un>: (<U 1 >, <U 2 >, …, <U n >) where < U n >: (u n1 , u n2 ,.. u nm ) 3: Loop For each item from u n1  to u nm  of < U n  > 3.a: generate the data set <Yn>: (y n1 , y n2 ,.. y nl );  y nl : (u nx , u ny ) where  u nx  ≢ u ny 3.b: increment the occurrence of  y nl ;   note:  (key, value) = (ynl, number of occurrences) 4. Data set is created as input of Reducer:  (key, <value>) = (y nl , <number of occurrences>)
Reduce Algorithm 1: Take (y nl , <number of occurrences>) as input data from multiple Map nodes 2. Add the values for y nl  to have (y nl , total number of occurrences) as output
Market Basket Analysis (Cont’d) Transactions in Store A Transaction 1: cracker, icecream, beer Transaction 2: chicken, pizza, coke, bread … Distribute Transaction data to Map nodes Pair of Items restructured in each Map node Transaction 1: < (cracker, icecream), (cracker, beer) , (beer, icecream)> Transaction 2: < (chicken, pizza), (chicken, coke), (chicken, bread) , (coke, pizza), (bread, pizza), (coke , bread)>  …
Market Basket Analysis (Cont’d) Note: order of pairs should be sorted as it becomes a key For example, (cracker, icecream), (icecream, cracker) should be (cracker, icecream) Pair of Items sorted in MBA Transaction 1: < (cracker, icecream), (beer, cracker) , (beer, icecream)> Transaction 2: < (chicken, pizza), (chicken, coke), (bread, chicken) , (coke, pizza), (bread, pizza), (bread, coke)>  …
Market Basket Analysis (Cont’d) Output of Map node Pair of Items in (key, value) structure in each Map node (key, value): (pair of items, number of occurences)  ((cracker, icecream), 1) ((beer, cracker), 1)  ((beer, icecream),1) (chicken, pizza), 1) ((chicken, coke), 1) ((chicken, bread) , 1) ((coke, pizza), 1) ((bread, pizza), 1) ((coke , bread), 1)  …
Market Basket Analysis (Cont’d) Data Aggregation/Combine (key, <value>): (pair of items, list number of occurences)  ((cracker, icecream), <1, 1, …, 1>) ((beer, cracker), <1, 1, …, 1>)  ((beer, icecream), <1, 1, …, 1>) (chicken, pizza), <1, 1, …, 1>) …
Market Basket Analysis (Cont’d) Reduce nodes (key, value): (pair of items, total number of occurences)  ((cracker, icecream), 421) ((beer, cracker), 341)  ((beer, icecream), 231) (chicken, pizza), 111) …
Map/Reduce for MBA … … Map 1 () Map 2 () Map m () Reduce 1  () Reduce l () Data Aggregation/Combine ((coke, pizza), <1, 1, …, 1>) ((ham, juice), <1, 1, …, 1>) ((coke, pizza), 3,421) ((ham, juice), 2,346) Input Trax Data Reduce 2 () ((coke, pizza), 1) ((bear, corn), 1) … ((ham, juice), 1) ((coke, pizza), 1) …
Experimental Result 5 transaction files for the experiment:  400 MB (6.7M transactions), 800MB (13M transactions), 1.6 GB (26M transactions).  run on small instances of AWS EC2  each node is of 1.0-1.2 GHz 2007 Opteron or Xeon Processor 1.7GB memory 160GB storage on 32 bits platform.  The data are executed on 2, 5, 10, 15, and 20 nodes
Experimental Result Execution time (sec) 5,671 2,911 2,868 20 5,898 2,917 2,792 15 8,845 5,998 2,910 10 15,963 8,717 5,442 5 NA NA 9,133 2 26M (1.6GB) 13M (800MB)  6.7M (400MB)
Experimental Result Execution time (sec)
Conclusion The Market Basket Analysis Algorithm on Map/Reduce is presented data mining analysis to find the most frequently occurred pair of products in baskets at a store.  The associated items can be paired with Map/Reduce approach.  Once we have the paired items, it can be used for more studies by statically analyzing them even sequentially, which is beyond this paper  a bottle-neck for distributing, aggregating, and reducing the data set among nodes
 

More Related Content

PDF
Vasia Kalavri – Training: Gelly School
PPTX
Faster Workflows, Faster
PDF
10 Good Reasons to Use ClickHouse
PDF
PG Day'14 Russia, GIN — Stronger than ever in 9.4 and further, Александр Коро...
PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
PDF
Rolf huisman programming quantum computers in dot net using q#
PPTX
Apache Flink @ NYC Flink Meetup
PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Vasia Kalavri – Training: Gelly School
Faster Workflows, Faster
10 Good Reasons to Use ClickHouse
PG Day'14 Russia, GIN — Stronger than ever in 9.4 and further, Александр Коро...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Rolf huisman programming quantum computers in dot net using q#
Apache Flink @ NYC Flink Meetup
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015

What's hot (11)

PDF
PyCon Ukraine 2017: Operational Transformation
PDF
Streaming Dataflow with Apache Flink
PDF
Fun with click house window functions webinar slides 2021-08-19
DOCX
R Data Visualization-Spatial data and Maps in R: Using R as a GIS
PDF
GeoMesa on Apache Spark SQL with Anthony Fox
PDF
Anton Dignös - Towards a Temporal PostgresSQL
PDF
Processing Big Data in Real-Time - Yanai Franchi, Tikal
PDF
M|18 Real-time Analytics with the New Streaming Data Adapters
PPTX
Michael Häusler – Everyday flink
PDF
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
PDF
ClickHouse Materialized Views: The Magic Continues
PyCon Ukraine 2017: Operational Transformation
Streaming Dataflow with Apache Flink
Fun with click house window functions webinar slides 2021-08-19
R Data Visualization-Spatial data and Maps in R: Using R as a GIS
GeoMesa on Apache Spark SQL with Anthony Fox
Anton Dignös - Towards a Temporal PostgresSQL
Processing Big Data in Real-Time - Yanai Franchi, Tikal
M|18 Real-time Analytics with the New Streaming Data Adapters
Michael Häusler – Everyday flink
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
ClickHouse Materialized Views: The Magic Continues
Ad

Viewers also liked (20)

PPT
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
PPTX
Masket Basket Analysis
PDF
07 2
PPTX
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
PPT
Map Reduce
PPT
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
PPT
Hadoop World Vertica
PDF
DFA Minimization in Map-Reduce
PDF
Introduction to Map-Reduce
PPTX
Big Data Analysis With RHadoop
PPT
Wireless hacking and security
PDF
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
PPTX
2nd year pre clinical RPD Terminology, Components and Classification of parti...
PPTX
Wireless Security null seminar
PPTX
Presentation on Wireless border security system
PPTX
types of dental surveyor
PDF
Application of MapReduce in Cloud Computing
PDF
Dental surveying of Removal partial denture
PPT
Mouth preparation for removable partial denture/ dental education in india
PPTX
Diagnosis and treatment planning of Removable Partial Denture
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Masket Basket Analysis
07 2
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Map Reduce
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
Hadoop World Vertica
DFA Minimization in Map-Reduce
Introduction to Map-Reduce
Big Data Analysis With RHadoop
Wireless hacking and security
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
2nd year pre clinical RPD Terminology, Components and Classification of parti...
Wireless Security null seminar
Presentation on Wireless border security system
types of dental surveyor
Application of MapReduce in Cloud Computing
Dental surveying of Removal partial denture
Mouth preparation for removable partial denture/ dental education in india
Diagnosis and treatment planning of Removable Partial Denture
Ad

Similar to Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing (20)

PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PPTX
Introduction to MapReduce
PDF
introduction to data processing using Hadoop and Pig
PDF
Mapreduce Algorithms
PPT
Is There Room For Another Elephant In Tucson
PPTX
TheEdge10 : Big Data is Here - Hadoop to the Rescue
PPT
Introduction To Map Reduce
PPTX
Accelerating analytics on the Sensor and IoT Data.
PPTX
Malstone KDD 2010
PDF
QConSF 2014 talk on Netflix Mantis, a stream processing system
PDF
A Deep Dive into Structured Streaming in Apache Spark
PPT
Spark streaming
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
PDF
Spark what's new what's coming
PDF
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PPT
Intermachine Parallelism
PPT
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
PPT
strata_spark_streaming.ppt
PPT
Memory Optimization
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Introduction to MapReduce
introduction to data processing using Hadoop and Pig
Mapreduce Algorithms
Is There Room For Another Elephant In Tucson
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Introduction To Map Reduce
Accelerating analytics on the Sensor and IoT Data.
Malstone KDD 2010
QConSF 2014 talk on Netflix Mantis, a stream processing system
A Deep Dive into Structured Streaming in Apache Spark
Spark streaming
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Spark what's new what's coming
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Intermachine Parallelism
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
strata_spark_streaming.ppt
Memory Optimization

More from Jongwook Woo (20)

PPTX
History and Application of LLM Leveraging Big Data
PDF
How To Use Artificial Intelligence (AI) in History
PPTX
Machine Learning in Quantum Computing
PPTX
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
PPTX
Scalable Predictive Analysis and The Trend with Big Data & AI
PPTX
Introduction to Big Data and AI for Business Analytics and Prediction
PPTX
Introduction to Big Data and its Trends
PPTX
Rating Prediction using Deep Learning and Spark
PPTX
History and Trend of Big Data and Deep Learning
PPTX
The Importance of Open Innovation in AI era
PPTX
Traffic Data Analysis and Prediction using Big Data
PDF
Big Data and Predictive Analysis
PPTX
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
PPTX
Introduction to Big Data: Smart Factory
PPTX
AI on Big Data
PDF
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
PDF
President Election of Korea in 2017
PPTX
Big Data Trend with Open Platform
PPTX
Big Data Trend and Open Data
PPTX
Big Data Platform adopting Spark and Use Cases with Open Data
History and Application of LLM Leveraging Big Data
How To Use Artificial Intelligence (AI) in History
Machine Learning in Quantum Computing
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Scalable Predictive Analysis and The Trend with Big Data & AI
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and its Trends
Rating Prediction using Deep Learning and Spark
History and Trend of Big Data and Deep Learning
The Importance of Open Innovation in AI era
Traffic Data Analysis and Prediction using Big Data
Big Data and Predictive Analysis
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Introduction to Big Data: Smart Factory
AI on Big Data
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
President Election of Korea in 2017
Big Data Trend with Open Platform
Big Data Trend and Open Data
Big Data Platform adopting Spark and Use Cases with Open Data

Recently uploaded (20)

DOCX
search engine optimization ppt fir known well about this
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PPTX
Build Your First AI Agent with UiPath.pptx
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PPTX
Training Program for knowledge in solar cell and solar industry
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
STKI Israel Market Study 2025 version august
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PPTX
Module 1 Introduction to Web Programming .pptx
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
Microsoft Excel 365/2024 Beginner's training
PPTX
TEXTILE technology diploma scope and career opportunities
PDF
Five Habits of High-Impact Board Members
PDF
4 layer Arch & Reference Arch of IoT.pdf
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
The influence of sentiment analysis in enhancing early warning system model f...
search engine optimization ppt fir known well about this
sustainability-14-14877-v2.pddhzftheheeeee
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Build Your First AI Agent with UiPath.pptx
Comparative analysis of machine learning models for fake news detection in so...
OpenACC and Open Hackathons Monthly Highlights July 2025
Training Program for knowledge in solar cell and solar industry
Basics of Cloud Computing - Cloud Ecosystem
STKI Israel Market Study 2025 version august
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
Module 1 Introduction to Web Programming .pptx
sbt 2.0: go big (Scala Days 2025 edition)
NewMind AI Weekly Chronicles – August ’25 Week IV
CloudStack 4.21: First Look Webinar slides
Microsoft Excel 365/2024 Beginner's training
TEXTILE technology diploma scope and career opportunities
Five Habits of High-Impact Board Members
4 layer Arch & Reference Arch of IoT.pdf
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
The influence of sentiment analysis in enhancing early warning system model f...

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

  • 1. Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing 2011 PDPTA Jongwook Woo, PhD [email_address] High-Performance Internet Computing Center (HiPIC) Computer Information Systems Department California State University, Los Angeles
  • 2. Contents Map/Reduce Brief Introduction Market Basket Analysis Map/Reduce Algorithm for MBA Experimental Result Conclusion
  • 3. What is Map/Reduce Cloud Computing Cloudera HortonWorks AWS Parallel Computing
  • 4. Have you heard about Cloud Computing? First Impression In late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files. not a particularly complicated but large computing chore, requiring a whole lot of computer processing time. a software programmer at the Times, Derek Gottfrid, playing around with Amazon Web Services, Elastic Compute Cloud (EC2), uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3) In less than 24 hours, 11,000 PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site. The total cost for the computing job? $240 10 cents per computer-hour times 100 computers times 24 hours
  • 5. What is MapReduce Functions borrowed from functional programming languages (eg. Lisp) ‏ Provides Restricted parallel programming model User implements Map() and Reduce() ‏ Libraries (Hadoop) take care of EVERYTHING else Parallelization Fault Tolerance Data Distribution Load Balancing Useful for huge (peta- or Terra-bytes) but non-complicated data New York Times case Log file for web companies
  • 6. Map Convert data to (key, value) pairs map() functions run in parallel, creating different intermediate values from different input data sets
  • 7. Reduce reduce() combines those intermediate values into one or more final values for that same output key reduce() functions also run in parallel, each working on a different output key Bottleneck: reduce phase can’t start until map phase is completely finished.
  • 8. Example: Sort URLs in the largest hit order Map() ‏ Input <logFilename, file text> Parses file and emits <url, hit counts> pairs eg. <http://hello.com, 1> Reduce() ‏ Sums all values for the same key and emits <url, TotalCount> eg. <http://hello.com, (3 5 2 7)> => <http://hello.com, 17>
  • 9. Market Basket Analysis (MBA) Collect the list of pair of transaction items most frequently occurred together at a store(s) Traditional Business Intelligence Analysis much better opportunity to make a profit by controlling the order of products and marketing control the stocks more intelligently arrange items on shelves promote items together etc.
  • 10. Market Basket Analysis (MBA) Transactions in Store A: Input data Transaction 1: cracker, icecream, beer Transaction 2: chicken, pizza, coke, bread Transaction 3: baguette, soda, hering, cracker, beer Transaction 4: bourbon, coke, turkey Transaction 5: sardines, beer, chicken, coke Transaction 6: apples, peppers, avocado, steak Transaction 7: sardines, apples, peppers, avocado, steak … What is a pair of items that people frequently buy at Store A?
  • 11. Map Algorithm 1: Reads each transaction of input file and generates the data set of the items: (<V 1 >, <V 2 >, …, <V n >) where < V n >: (v n1 , v n2 ,.. v nm ) 2: Sort all data set <V n > and generates sorted data set <Un>: (<U 1 >, <U 2 >, …, <U n >) where < U n >: (u n1 , u n2 ,.. u nm ) 3: Loop For each item from u n1 to u nm of < U n > 3.a: generate the data set <Yn>: (y n1 , y n2 ,.. y nl ); y nl : (u nx , u ny ) where u nx ≢ u ny 3.b: increment the occurrence of y nl ; note: (key, value) = (ynl, number of occurrences) 4. Data set is created as input of Reducer: (key, <value>) = (y nl , <number of occurrences>)
  • 12. Reduce Algorithm 1: Take (y nl , <number of occurrences>) as input data from multiple Map nodes 2. Add the values for y nl to have (y nl , total number of occurrences) as output
  • 13. Market Basket Analysis (Cont’d) Transactions in Store A Transaction 1: cracker, icecream, beer Transaction 2: chicken, pizza, coke, bread … Distribute Transaction data to Map nodes Pair of Items restructured in each Map node Transaction 1: < (cracker, icecream), (cracker, beer) , (beer, icecream)> Transaction 2: < (chicken, pizza), (chicken, coke), (chicken, bread) , (coke, pizza), (bread, pizza), (coke , bread)> …
  • 14. Market Basket Analysis (Cont’d) Note: order of pairs should be sorted as it becomes a key For example, (cracker, icecream), (icecream, cracker) should be (cracker, icecream) Pair of Items sorted in MBA Transaction 1: < (cracker, icecream), (beer, cracker) , (beer, icecream)> Transaction 2: < (chicken, pizza), (chicken, coke), (bread, chicken) , (coke, pizza), (bread, pizza), (bread, coke)> …
  • 15. Market Basket Analysis (Cont’d) Output of Map node Pair of Items in (key, value) structure in each Map node (key, value): (pair of items, number of occurences) ((cracker, icecream), 1) ((beer, cracker), 1) ((beer, icecream),1) (chicken, pizza), 1) ((chicken, coke), 1) ((chicken, bread) , 1) ((coke, pizza), 1) ((bread, pizza), 1) ((coke , bread), 1) …
  • 16. Market Basket Analysis (Cont’d) Data Aggregation/Combine (key, <value>): (pair of items, list number of occurences) ((cracker, icecream), <1, 1, …, 1>) ((beer, cracker), <1, 1, …, 1>) ((beer, icecream), <1, 1, …, 1>) (chicken, pizza), <1, 1, …, 1>) …
  • 17. Market Basket Analysis (Cont’d) Reduce nodes (key, value): (pair of items, total number of occurences) ((cracker, icecream), 421) ((beer, cracker), 341) ((beer, icecream), 231) (chicken, pizza), 111) …
  • 18. Map/Reduce for MBA … … Map 1 () Map 2 () Map m () Reduce 1 () Reduce l () Data Aggregation/Combine ((coke, pizza), <1, 1, …, 1>) ((ham, juice), <1, 1, …, 1>) ((coke, pizza), 3,421) ((ham, juice), 2,346) Input Trax Data Reduce 2 () ((coke, pizza), 1) ((bear, corn), 1) … ((ham, juice), 1) ((coke, pizza), 1) …
  • 19. Experimental Result 5 transaction files for the experiment: 400 MB (6.7M transactions), 800MB (13M transactions), 1.6 GB (26M transactions). run on small instances of AWS EC2 each node is of 1.0-1.2 GHz 2007 Opteron or Xeon Processor 1.7GB memory 160GB storage on 32 bits platform. The data are executed on 2, 5, 10, 15, and 20 nodes
  • 20. Experimental Result Execution time (sec) 5,671 2,911 2,868 20 5,898 2,917 2,792 15 8,845 5,998 2,910 10 15,963 8,717 5,442 5 NA NA 9,133 2 26M (1.6GB) 13M (800MB) 6.7M (400MB)
  • 22. Conclusion The Market Basket Analysis Algorithm on Map/Reduce is presented data mining analysis to find the most frequently occurred pair of products in baskets at a store. The associated items can be paired with Map/Reduce approach. Once we have the paired items, it can be used for more studies by statically analyzing them even sequentially, which is beyond this paper a bottle-neck for distributing, aggregating, and reducing the data set among nodes
  • 23.