Ā®
Ā© 2014 MapR Technologies 1
Ā®
Ā© 2014 MapR Technologies
Machine Learning with Spark
Carol McDonald
Ā®
Ā© 2014 MapR Technologies 2
Agenda
•  Classification
•  Clustering
•  Collaborative Filtering with Spark
•  Model training
•  Alternating Least Squares
•  The code
Ā®
Ā© 2014 MapR Technologies 3
Three Categories of Techniques for Machine Learning
classification
Collaborative filtering (recommendation)
clustering
Groups similar
items
identifies
category for
item
Recommend
items
Ā®
Ā© 2014 MapR Technologies 4
What is classification
Form of ML that:
•  identifies which category an item belongs to
•  Uses supervised learning algorithms
•  Data is labeled
Examples:
•  Spam Detection
•  spam/non-spam
•  Credit Card Fraud Detection
•  fraud/non-fraud
•  Sentiment analysis
Ā®
Ā© 2014 MapR Technologies 5
Building and deploying a classifier model
Ā®
Ā© 2014 MapR Technologies 6
If it walks/swims/quacks like a duck …
Attributes, Features:
•  If it walks
•  If it swims
•  If it quacks
ā€œWhen I see a bird that walks like a duck
and swims like a duck and quacks like a
duck, I call that bird a duck.ā€
classify something based on ā€œifā€ conditions.
Answer, Label:
•  Duck
•  Not duck
Ā®
Ā© 2014 MapR Technologies 7
… then it must be a duck
ducks not ducks
walks
quacks
swims
Label:
•  Duck
•  Not
duck
Features:
•  walks
•  swims
•  quacks
Ā®
Ā© 2014 MapR Technologies 8
Building and deploying a classifier model
Reference Learning Spark Oreilly Book
Ā®
Ā© 2014 MapR Technologies 9
Vectorizing Data
•  identify interesting features (those that contribute to the model)
•  assign features to dimensions
Example: vectorize an apple
Features: [size, color, weight]
Example: vectorize a text document
(Term Frequency Inverse Term Frequency)
Dictionary: [a, advance, after, …, you, yourself, youth, zigzag]
[3.2, 16777184.0, 45.8][223,1,1,0,…,12,10,6,1]
Ā®
Ā© 2014 MapR Technologies 10
Build Term Frequency Feature vectors
// examples of spam
val spam = sc.textFile("spam.txt")
// examples of not spam
val normal = sc.textFile("normal.txtā€)
// Create a HashingTF map email text to vectors of features
val tf = new HashingTF(numFeatures = 10000)
// Each email each word is mapped to one feature.
val spamFeatures = spam
.map(email => tf.transform(email.split(" ")))
val normalFeatures = normal
.map(email => tf.transform(email.split(" ")))
Reference Learning Spark Oreilly Book
Ā®
Ā© 2014 MapR Technologies 11
Building and deploying a classifier model
Ā®
Ā© 2014 MapR Technologies 12
Build Model
val trainingData = positiveExamples.union(negativeExamples)
trainingData.cache() // Cache for iterative algorithm.
// Run Logistic Regression using the SGD algorithm.
val model = new LogisticRegressionWithSGD()
.run(trainingData)
Reference Learning Spark Oreilly Book
Ā®
Ā© 2014 MapR Technologies 13
Building and deploying a classifier model
Reference Learning Spark Oreilly Book
Ā®
Ā© 2014 MapR Technologies 14
Model Evaluation
// Test on a positive example (spam)
Vector posTest = tf.transform(Arrays.asList(
"O M G GET cheap stuff by sending money to...".split(" ")));
// negative test not spam
Vector negTest = tf.transform(Arrays.asList(
"Hi Dad, I started studying Spark the other ...".split(" ")));
System.out.println("Prediction for positive: " +
model.predict(posTest));
System.out.println("Prediction for negative: " +
model.predict(negTest));
Ā®
Ā© 2014 MapR Technologies 15
Three Categories of Techniques for Machine Learning
classification
Collaborative filtering (recommendation)
clustering
Ā®
Ā© 2014 MapR Technologies 16
Clustering
•  Clustering is the unsupervised learning task that involves grouping objects
into clusters of high similarity
–  Search results grouping
–  grouping of customers by similar habits
–  Anomaly detection
•  data traffic
–  Text categorization
Ā®
Ā© 2014 MapR Technologies 17
What is Clustering?
Clustering = (unsupervised) task of grouping similar objects
MLlib K-means algorithm for clustering
1.  randomly initialize centers of
clusters
2.  Assign all points to the closest
cluster center
3.  Change cluster centers to be in the
middle of its points
4.  Repeat until convergence
Ā®
Ā© 2014 MapR Technologies 18
What is Clustering?
Clustering = (unsupervised) task of grouping similar objects
Ā®
Ā© 2014 MapR Technologies 19
Examples of ML Algorithms
machine learning
supervised unsupervised
•  Classification
•  NaĆÆve Bayes
•  SVM
•  Random Decision Forests
•  Regression
•  Linear
•  logistic
•  Clustering
•  K-means
•  Dimensionality reduction
•  Principal Component Analysis
•  SVD
Ā®
Ā© 2014 MapR Technologies 20
ML Algorithms
http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Ā®
Ā© 2014 MapR Technologies 21
Three Categories of Techniques for Machine Learning
classification
Collaborative filtering (recommendation)
clustering
Ā®
Ā© 2014 MapR Technologies 22
Collaborative Filtering with Spark
•  Recommend Items
–  (filtering)
•  Based on User preferences data
–  (collaborative)
Ā®
Ā© 2014 MapR Technologies 23
Train a Model to Make Predictions
New
Data
Model Predictions
Training
Data
ModelAlgorithm
Ted and Carol like Movie B and C
Bob likes Movie B, What might he like ?
Bob likes Movie B, Predict C
Ā®
Ā© 2014 MapR Technologies 24
Alternating Least Squares
•  approximates sparse user item rating matrix
–  as product of two dense matrices, User and Item factor matrices
–  tries to learn the hidden features of each user and item
–  algorithm alternatively fixes one factor matrix and solves for the other
?
Ā®
Ā© 2014 MapR Technologies 25
ML Cross Validation Process
Data
Model
Training/
Building
Test Model
Predictions
Test
Set
Train Test loop
Training
Set
Ā®
Ā© 2014 MapR Technologies 26
Ratings Data
Ā®
Ā© 2014 MapR Technologies 27
Parse Input
// parse input UserID::MovieID::Rating
def parseRating(str: String): Rating= {
val fields = str.split("::")
Rating(fields(0).toInt, fields(1).toInt,
fields(2).toDouble)
}
// create an RDD of Ratings objects
val ratingsRDD = ratingText.map(parseRating).cache()
Ā®
Ā© 2014 MapR Technologies 28
Build Model
Data
Build
Model
Test
Set
Training
Set
split ratings RDD into training data RDD (80%)
and test data RDD (20%)
build a user product matrix model
Ā®
Ā© 2014 MapR Technologies 29
Create Model
// Randomly split ratings RDD into training data RDD (80%)
and test data RDD (20%)
val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L)
val trainingRatingsRDD = splits(0).cache()
val testRatingsRDD = splits(1).cache()
// build a ALS user product matrix model with rank=20,
iterations=10
val model = (new ALS().setRank(20).setIterations(10)
.run(trainingRatingsRDD))
Ā®
Ā© 2014 MapR Technologies 30
Get predictions
// get predicted ratings to compare to test ratings
val testUserProductRDD = testRatingsRDD.map {
case Rating(user, product, rating) => (user, product)
}
// call model.predict with test Userid, MovieId input data
val predictionsForTestRDD = model.predict(testUserProductRDD)
User, Movie
Test
Data
Model
Predicted
Ratings
Ā®
Ā© 2014 MapR Technologies 31
Compare predictions to Tests
Join predicted ratings to test ratings in order to compare
((user, product),test rating) ((user, product), predicted rating)
((user, product),(test rating, predicted rating))
Key, Value Key, Value
Key, Value
Ā®
Ā© 2014 MapR Technologies 32
Test Model
// prepare predictions for comparison
val predictionsKeyedByUserProductRDD = predictionsForTestRDD.map{
case Rating(user, product, rating) => ((user, product), rating)
}
// prepare test for comparison
val testKeyedByUserProductRDD = testRatingsRDD.map{
case Rating(user, product, rating) => ((user, product), rating)
}
//Join the test with predictions
val testAndPredictionsJoinedRDD = testKeyedByUserProductRDD
.join(predictionsKeyedByUserProductRDD)
Ā®
Ā© 2014 MapR Technologies 33
Compare predictions to Tests
Find False positives: Where
test rating <= 1 and predicted rating >= 4
((user, product),(test rating, predicted rating))
Key, Value
Ā®
Ā© 2014 MapR Technologies 34
Test Model
val falsePositives =(testAndPredictionsJoinedRDD.filter{
case ((user, product), (ratingT, ratingP)) =>
(ratingT <= 1 && ratingP >=4)
})
falsePositives.take(2)
Array[((Int, Int), (Double, Double))] =
((3842,2858),(1.0,4.106488210964762)),
((6031,3194),(1.0,4.790778049100913))
Ā®
Ā© 2014 MapR Technologies 35
Test Model Mean Absolute Error
//Evaluate the model using Mean Absolute Error (MAE) between
test and predictions
val meanAbsoluteError = testAndPredictionsJoinedRDD.map {
case ((user, product), (testRating, predRating)) =>
val err = (testRating - predRating)
Math.abs(err)
}.mean()
meanAbsoluteError: Double = 0.7244940545944053
Ā®
Ā© 2014 MapR Technologies 36
Get Predictions for new user
val newRatingsRDD=sc.parallelize(Array(Rating(0,260,4),Rating(0,1,3))
// union
val unionRatingsRDD = ratingsRDD.union(newRatingsRDD)
// build a ALS user product matrix model
val model = (new ALS().setRank(20).setIterations(10)
.run(unionRatingsRDD))
// get 5 recs for userid 0
val topRecsForUser = model.recommendProducts(0, 5)
Ā®
Ā© 2014 MapR Technologies 37
Soon to Come
•  Spark On Demand Training
–  https://www.mapr.com/services/mapr-academy/
•  Blogs and Tutorials:
–  Movie Recommendations with Collaborative Filtering
–  Spark Streaming
Ā®
Ā© 2014 MapR Technologies 38
Machine Learning Blog
•  https://www.mapr.com/blog/parallel-and-iterative-processing-
machine-learning-recommendations-spark
Ā®
Ā© 2014 MapR Technologies 39
Spark on MapR
•  Certified Spark Distribution
•  Fully supported and packaged by MapR in partnership with
Databricks
•  YARN integration
–  Spark can then allocate resources from cluster when needed
Ā®
Ā© 2014 MapR Technologies 40
References
•  Spark Online course: learn.mapr.com
•  Spark web site: http://spark.apache.org/
•  https://databricks.com/
•  Spark on MapR:
–  http://www.mapr.com/products/apache-spark
•  Spark SQL and DataFrame Guide
•  Apache Spark vs. MapReduce – Whiteboard Walkthrough
•  Learning Spark - O'Reilly Book
•  Apache Spark
Ā®
Ā© 2014 MapR Technologies 41
Q&A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies

More Related Content

PDF
Introduction to Machine Learning with Spark
PDF
Large-Scale Machine Learning with Apache Spark
PPTX
Large Scale Machine Learning with Apache Spark
PPTX
Large Scale Machine learning with Spark
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
PPTX
MLlib and Machine Learning on Spark
PDF
Recent Developments in Spark MLlib and Beyond
PDF
Sparse Data Support in MLlib
Introduction to Machine Learning with Spark
Large-Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Large Scale Machine learning with Spark
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
MLlib and Machine Learning on Spark
Recent Developments in Spark MLlib and Beyond
Sparse Data Support in MLlib

What's hot (20)

PDF
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
PDF
Introduction to Spark
PDF
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
PDF
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
PDF
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
PDF
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
PPTX
Introduction to Mahout
PDF
Machine Learning using Apache Spark MLlib
PDF
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
PPTX
Machine Learning With Spark
PPTX
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Ā 
PDF
Spark 101
PDF
Spark Summit EU talk by Reza Karimi
PDF
Spark ML par Xebia (Spark Meetup du 11/06/2015)
PDF
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
PDF
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
PDF
A Graph-Based Method For Cross-Entity Threat Detection
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
Introduction to Spark
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Introduction to Mahout
Machine Learning using Apache Spark MLlib
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Machine Learning With Spark
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Ā 
Spark 101
Spark Summit EU talk by Reza Karimi
Spark ML par Xebia (Spark Meetup du 11/06/2015)
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
A Graph-Based Method For Cross-Entity Threat Detection
Ad

Viewers also liked (6)

PDF
Crab: A Python Framework for Building Recommender Systems
PDF
MLlib: Spark's Machine Learning Library
PPTX
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
PPTX
Collaborative Filtering using KNN
PDF
Recommender Systems with Apache Spark's ALS Function
PDF
Collaborative Filtering and Recommender Systems By Navisro Analytics
Crab: A Python Framework for Building Recommender Systems
MLlib: Spark's Machine Learning Library
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Collaborative Filtering using KNN
Recommender Systems with Apache Spark's ALS Function
Collaborative Filtering and Recommender Systems By Navisro Analytics
Ad

Similar to Apache Spark Machine Learning (20)

PPTX
Parallel and Iterative Processing for Machine Learning Recommendations with S...
PDF
Hadoop France meetup Feb2016 : recommendations with spark
PDF
Free Code Friday - Machine Learning with Apache Spark
PPTX
Apache Spark Machine Learning Decision Trees
PPTX
Intro to Apache Spark by Marco Vasquez
PDF
Introduction to Collaborative Filtering with Apache Mahout
PDF
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PDF
Training Large-scale Ad Ranking Models in Spark
PPTX
Azure Machine Learning Dotnet Campus 2015
PDF
Object Oriented Programming in Matlab
PDF
MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR
PDF
Data Science in the Elastic Stack
PDF
ML-Ops how to bring your data science to production
PDF
Silicon valleycodecamp2013
PPTX
Employee Salary Presentation.l based on data science collection of data
PPTX
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
PDF
Designing and Building a Graph Database Application – Architectural Choices, ...
Ā 
PDF
Data science with R - Clustering and Classification
PDF
(Py)testing the Limits of Machine Learning
PPTX
DataScience-101
Parallel and Iterative Processing for Machine Learning Recommendations with S...
Hadoop France meetup Feb2016 : recommendations with spark
Free Code Friday - Machine Learning with Apache Spark
Apache Spark Machine Learning Decision Trees
Intro to Apache Spark by Marco Vasquez
Introduction to Collaborative Filtering with Apache Mahout
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Training Large-scale Ad Ranking Models in Spark
Azure Machine Learning Dotnet Campus 2015
Object Oriented Programming in Matlab
MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR
Data Science in the Elastic Stack
ML-Ops how to bring your data science to production
Silicon valleycodecamp2013
Employee Salary Presentation.l based on data science collection of data
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Designing and Building a Graph Database Application – Architectural Choices, ...
Ā 
Data science with R - Clustering and Classification
(Py)testing the Limits of Machine Learning
DataScience-101

More from Carol McDonald (20)

PDF
Introduction to machine learning with GPUs
PDF
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
PDF
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
PDF
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
PDF
Predicting Flight Delays with Spark Machine Learning
PDF
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
PDF
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
PDF
How Big Data is Reducing Costs and Improving Outcomes in Health Care
PDF
Demystifying AI, Machine Learning and Deep Learning
PDF
Spark graphx
PDF
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
PDF
Streaming patterns revolutionary architectures
PDF
Spark machine learning predicting customer churn
PDF
Fast Cars, Big Data How Streaming can help Formula 1
PDF
Applying Machine Learning to Live Patient Data
PDF
Streaming Patterns Revolutionary Architectures with the Kafka API
PDF
Advanced Threat Detection on Streaming Data
PDF
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Introduction to machine learning with GPUs
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Predicting Flight Delays with Spark Machine Learning
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Demystifying AI, Machine Learning and Deep Learning
Spark graphx
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Streaming patterns revolutionary architectures
Spark machine learning predicting customer churn
Fast Cars, Big Data How Streaming can help Formula 1
Applying Machine Learning to Live Patient Data
Streaming Patterns Revolutionary Architectures with the Kafka API
Advanced Threat Detection on Streaming Data
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...

Recently uploaded (20)

PPTX
Viber For Windows 25.7.1 Crack + Serial Keygen
PDF
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
Ā 
PPTX
ROI from Efficient Content & Campaign Management in the Digital Media Industry
PDF
infoteam HELLAS company profile 2025 presentation
PDF
AI Guide for Business Growth - Arna Softech
PPTX
HackYourBrain__UtrechtJUG__11092025.pptx
PDF
AI-Powered Fuzz Testing: The Future of QA
PPTX
Presentation by Samna Perveen And Subhan Afzal.pptx
DOCX
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
PPTX
hospital managemt ,san.dckldnklcdnkdnkdnjadnjdjn
PPTX
Human-Computer Interaction for Lecture 2
PPTX
string python Python Strings: Literals, Slicing, Methods, Formatting, and Pra...
PDF
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
PDF
E-Commerce Website Development Companyin india
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PDF
BoxLang Dynamic AWS Lambda - Japan Edition
PDF
Website Design & Development_ Professional Web Design Services.pdf
PDF
PDF-XChange Editor Plus 10.7.0.398.0 Crack Free Download Latest 2025
PPTX
DevOpsDays Halifax 2025 - Building 10x Organizations Using Modern Productivit...
PPTX
ROI Analysis for Newspaper Industry with Odoo ERP
Viber For Windows 25.7.1 Crack + Serial Keygen
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
Ā 
ROI from Efficient Content & Campaign Management in the Digital Media Industry
infoteam HELLAS company profile 2025 presentation
AI Guide for Business Growth - Arna Softech
HackYourBrain__UtrechtJUG__11092025.pptx
AI-Powered Fuzz Testing: The Future of QA
Presentation by Samna Perveen And Subhan Afzal.pptx
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
hospital managemt ,san.dckldnklcdnkdnkdnjadnjdjn
Human-Computer Interaction for Lecture 2
string python Python Strings: Literals, Slicing, Methods, Formatting, and Pra...
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
E-Commerce Website Development Companyin india
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
BoxLang Dynamic AWS Lambda - Japan Edition
Website Design & Development_ Professional Web Design Services.pdf
PDF-XChange Editor Plus 10.7.0.398.0 Crack Free Download Latest 2025
DevOpsDays Halifax 2025 - Building 10x Organizations Using Modern Productivit...
ROI Analysis for Newspaper Industry with Odoo ERP

Apache Spark Machine Learning

  • 1. Ā® Ā© 2014 MapR Technologies 1 Ā® Ā© 2014 MapR Technologies Machine Learning with Spark Carol McDonald
  • 2. Ā® Ā© 2014 MapR Technologies 2 Agenda •  Classification •  Clustering •  Collaborative Filtering with Spark •  Model training •  Alternating Least Squares •  The code
  • 3. Ā® Ā© 2014 MapR Technologies 3 Three Categories of Techniques for Machine Learning classification Collaborative filtering (recommendation) clustering Groups similar items identifies category for item Recommend items
  • 4. Ā® Ā© 2014 MapR Technologies 4 What is classification Form of ML that: •  identifies which category an item belongs to •  Uses supervised learning algorithms •  Data is labeled Examples: •  Spam Detection •  spam/non-spam •  Credit Card Fraud Detection •  fraud/non-fraud •  Sentiment analysis
  • 5. Ā® Ā© 2014 MapR Technologies 5 Building and deploying a classifier model
  • 6. Ā® Ā© 2014 MapR Technologies 6 If it walks/swims/quacks like a duck … Attributes, Features: •  If it walks •  If it swims •  If it quacks ā€œWhen I see a bird that walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck.ā€ classify something based on ā€œifā€ conditions. Answer, Label: •  Duck •  Not duck
  • 7. Ā® Ā© 2014 MapR Technologies 7 … then it must be a duck ducks not ducks walks quacks swims Label: •  Duck •  Not duck Features: •  walks •  swims •  quacks
  • 8. Ā® Ā© 2014 MapR Technologies 8 Building and deploying a classifier model Reference Learning Spark Oreilly Book
  • 9. Ā® Ā© 2014 MapR Technologies 9 Vectorizing Data •  identify interesting features (those that contribute to the model) •  assign features to dimensions Example: vectorize an apple Features: [size, color, weight] Example: vectorize a text document (Term Frequency Inverse Term Frequency) Dictionary: [a, advance, after, …, you, yourself, youth, zigzag] [3.2, 16777184.0, 45.8][223,1,1,0,…,12,10,6,1]
  • 10. Ā® Ā© 2014 MapR Technologies 10 Build Term Frequency Feature vectors // examples of spam val spam = sc.textFile("spam.txt") // examples of not spam val normal = sc.textFile("normal.txtā€) // Create a HashingTF map email text to vectors of features val tf = new HashingTF(numFeatures = 10000) // Each email each word is mapped to one feature. val spamFeatures = spam .map(email => tf.transform(email.split(" "))) val normalFeatures = normal .map(email => tf.transform(email.split(" "))) Reference Learning Spark Oreilly Book
  • 11. Ā® Ā© 2014 MapR Technologies 11 Building and deploying a classifier model
  • 12. Ā® Ā© 2014 MapR Technologies 12 Build Model val trainingData = positiveExamples.union(negativeExamples) trainingData.cache() // Cache for iterative algorithm. // Run Logistic Regression using the SGD algorithm. val model = new LogisticRegressionWithSGD() .run(trainingData) Reference Learning Spark Oreilly Book
  • 13. Ā® Ā© 2014 MapR Technologies 13 Building and deploying a classifier model Reference Learning Spark Oreilly Book
  • 14. Ā® Ā© 2014 MapR Technologies 14 Model Evaluation // Test on a positive example (spam) Vector posTest = tf.transform(Arrays.asList( "O M G GET cheap stuff by sending money to...".split(" "))); // negative test not spam Vector negTest = tf.transform(Arrays.asList( "Hi Dad, I started studying Spark the other ...".split(" "))); System.out.println("Prediction for positive: " + model.predict(posTest)); System.out.println("Prediction for negative: " + model.predict(negTest));
  • 15. Ā® Ā© 2014 MapR Technologies 15 Three Categories of Techniques for Machine Learning classification Collaborative filtering (recommendation) clustering
  • 16. Ā® Ā© 2014 MapR Technologies 16 Clustering •  Clustering is the unsupervised learning task that involves grouping objects into clusters of high similarity –  Search results grouping –  grouping of customers by similar habits –  Anomaly detection •  data traffic –  Text categorization
  • 17. Ā® Ā© 2014 MapR Technologies 17 What is Clustering? Clustering = (unsupervised) task of grouping similar objects MLlib K-means algorithm for clustering 1.  randomly initialize centers of clusters 2.  Assign all points to the closest cluster center 3.  Change cluster centers to be in the middle of its points 4.  Repeat until convergence
  • 18. Ā® Ā© 2014 MapR Technologies 18 What is Clustering? Clustering = (unsupervised) task of grouping similar objects
  • 19. Ā® Ā© 2014 MapR Technologies 19 Examples of ML Algorithms machine learning supervised unsupervised •  Classification •  NaĆÆve Bayes •  SVM •  Random Decision Forests •  Regression •  Linear •  logistic •  Clustering •  K-means •  Dimensionality reduction •  Principal Component Analysis •  SVD
  • 20. Ā® Ā© 2014 MapR Technologies 20 ML Algorithms http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
  • 21. Ā® Ā© 2014 MapR Technologies 21 Three Categories of Techniques for Machine Learning classification Collaborative filtering (recommendation) clustering
  • 22. Ā® Ā© 2014 MapR Technologies 22 Collaborative Filtering with Spark •  Recommend Items –  (filtering) •  Based on User preferences data –  (collaborative)
  • 23. Ā® Ā© 2014 MapR Technologies 23 Train a Model to Make Predictions New Data Model Predictions Training Data ModelAlgorithm Ted and Carol like Movie B and C Bob likes Movie B, What might he like ? Bob likes Movie B, Predict C
  • 24. Ā® Ā© 2014 MapR Technologies 24 Alternating Least Squares •  approximates sparse user item rating matrix –  as product of two dense matrices, User and Item factor matrices –  tries to learn the hidden features of each user and item –  algorithm alternatively fixes one factor matrix and solves for the other ?
  • 25. Ā® Ā© 2014 MapR Technologies 25 ML Cross Validation Process Data Model Training/ Building Test Model Predictions Test Set Train Test loop Training Set
  • 26. Ā® Ā© 2014 MapR Technologies 26 Ratings Data
  • 27. Ā® Ā© 2014 MapR Technologies 27 Parse Input // parse input UserID::MovieID::Rating def parseRating(str: String): Rating= { val fields = str.split("::") Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble) } // create an RDD of Ratings objects val ratingsRDD = ratingText.map(parseRating).cache()
  • 28. Ā® Ā© 2014 MapR Technologies 28 Build Model Data Build Model Test Set Training Set split ratings RDD into training data RDD (80%) and test data RDD (20%) build a user product matrix model
  • 29. Ā® Ā© 2014 MapR Technologies 29 Create Model // Randomly split ratings RDD into training data RDD (80%) and test data RDD (20%) val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L) val trainingRatingsRDD = splits(0).cache() val testRatingsRDD = splits(1).cache() // build a ALS user product matrix model with rank=20, iterations=10 val model = (new ALS().setRank(20).setIterations(10) .run(trainingRatingsRDD))
  • 30. Ā® Ā© 2014 MapR Technologies 30 Get predictions // get predicted ratings to compare to test ratings val testUserProductRDD = testRatingsRDD.map { case Rating(user, product, rating) => (user, product) } // call model.predict with test Userid, MovieId input data val predictionsForTestRDD = model.predict(testUserProductRDD) User, Movie Test Data Model Predicted Ratings
  • 31. Ā® Ā© 2014 MapR Technologies 31 Compare predictions to Tests Join predicted ratings to test ratings in order to compare ((user, product),test rating) ((user, product), predicted rating) ((user, product),(test rating, predicted rating)) Key, Value Key, Value Key, Value
  • 32. Ā® Ā© 2014 MapR Technologies 32 Test Model // prepare predictions for comparison val predictionsKeyedByUserProductRDD = predictionsForTestRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } // prepare test for comparison val testKeyedByUserProductRDD = testRatingsRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } //Join the test with predictions val testAndPredictionsJoinedRDD = testKeyedByUserProductRDD .join(predictionsKeyedByUserProductRDD)
  • 33. Ā® Ā© 2014 MapR Technologies 33 Compare predictions to Tests Find False positives: Where test rating <= 1 and predicted rating >= 4 ((user, product),(test rating, predicted rating)) Key, Value
  • 34. Ā® Ā© 2014 MapR Technologies 34 Test Model val falsePositives =(testAndPredictionsJoinedRDD.filter{ case ((user, product), (ratingT, ratingP)) => (ratingT <= 1 && ratingP >=4) }) falsePositives.take(2) Array[((Int, Int), (Double, Double))] = ((3842,2858),(1.0,4.106488210964762)), ((6031,3194),(1.0,4.790778049100913))
  • 35. Ā® Ā© 2014 MapR Technologies 35 Test Model Mean Absolute Error //Evaluate the model using Mean Absolute Error (MAE) between test and predictions val meanAbsoluteError = testAndPredictionsJoinedRDD.map { case ((user, product), (testRating, predRating)) => val err = (testRating - predRating) Math.abs(err) }.mean() meanAbsoluteError: Double = 0.7244940545944053
  • 36. Ā® Ā© 2014 MapR Technologies 36 Get Predictions for new user val newRatingsRDD=sc.parallelize(Array(Rating(0,260,4),Rating(0,1,3)) // union val unionRatingsRDD = ratingsRDD.union(newRatingsRDD) // build a ALS user product matrix model val model = (new ALS().setRank(20).setIterations(10) .run(unionRatingsRDD)) // get 5 recs for userid 0 val topRecsForUser = model.recommendProducts(0, 5)
  • 37. Ā® Ā© 2014 MapR Technologies 37 Soon to Come •  Spark On Demand Training –  https://www.mapr.com/services/mapr-academy/ •  Blogs and Tutorials: –  Movie Recommendations with Collaborative Filtering –  Spark Streaming
  • 38. Ā® Ā© 2014 MapR Technologies 38 Machine Learning Blog •  https://www.mapr.com/blog/parallel-and-iterative-processing- machine-learning-recommendations-spark
  • 39. Ā® Ā© 2014 MapR Technologies 39 Spark on MapR •  Certified Spark Distribution •  Fully supported and packaged by MapR in partnership with Databricks •  YARN integration –  Spark can then allocate resources from cluster when needed
  • 40. Ā® Ā© 2014 MapR Technologies 40 References •  Spark Online course: learn.mapr.com •  Spark web site: http://spark.apache.org/ •  https://databricks.com/ •  Spark on MapR: –  http://www.mapr.com/products/apache-spark •  Spark SQL and DataFrame Guide •  Apache Spark vs. MapReduce – Whiteboard Walkthrough •  Learning Spark - O'Reilly Book •  Apache Spark
  • 41. Ā® Ā© 2014 MapR Technologies 41 Q&A @mapr maprtech Engage with us! MapR maprtech mapr-technologies