Machine Learning at Scale
Madhukara Phatak
• Madhukara Phatak
• Consult in Bigdata and
FP
• Work with Spark,
Hadoop and ecosystem
• Training on Bigdata
• @madhukaraphatak
• http://www.madhukara
phatak.com
How many of You?
• Own a Smart phone?
• Want to know when next phone coming into
market?
• Next version of existing phone coming into
market?
• Specs and prices of new phone?
– Months before phone releases
• Data from multiple sources aggregated in one
place
Rumor Engine
• A practical implementation of machine
learning to solve phone rumor problem.
• Built in 3 months
– Learning machine learning
– Learning Spark
– Idea
– Implementation
– Release
My Journey
• Hadoop
• Mahout and Nectar
• JavaScript
• Scala and Spark
• Coursera
• MLLib
• Rumour Engine
Big data at work
• Worked for a BSS/OSS product company
• Big data is normal in Telecom
• CDR (call data record ) around 3TB for
companies like Airtel
• Need a solution for processing over 6 months
of data.
• Started to work around 4 years ago
Hadoop Saga
• Hadoop was default choice
• Challenge in the ecosystem in India
• Hype vs Reality
• Work
– Building ML library Nectar
– Working with companies to build hadoop
expertise and solutions
– POC’s
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
Machine Learning in Hadoop
• Apache Mahout was the choice but its too
hard to map it to any new requirements
• Map/Reduce implementation suffered from
speed and complexity
• Accuracy of the results are often poor
• We set out to build our own and realized it
was too much of overhead even to build
simplest things
ML and Map Reduce
• M/R forgets everything once one operation is
done
• Everything has to go through HDFS , slower
because of disk over heads
• Mahout long tried to make as fast possible ,
but they kind of gave up.
• In Zinnia , we moved on with aggregation and
KPI based solutions rather than pure ML.
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
JavaScript
• Functional programming
• Closures
• Loose typing /type inference
• Prototype inheritance
• REPL (node.js) or webtools
Search for New Language
• Statically typed (Enterprise stack)
• Runs on JVM
• Ability to use Java libraries
• Functional programming
• Type inference
• Repl
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Coursera
• MLLib
• Rumour Engine
Scala
• Statically typed
• Type inference
• Functional programming and OO built in
• Parallelism built in
• REPL
• Scalable language
Search for Functional Bigdata
• Pig attempted on Hadoop
• Tuple Map/Reduce
• Javascript API for Hadoop
• Why functional bigdata?
Big data platform requirement
• Immutability support
• Transformation not CRUD
• Built in laziness
• Concise API
• Type inference
Java and Hadoop (Productivity)
• No Laziness
– Every Map/Reduce operation needs to write
output to HDFS
• Java allows crud like variable assignments but
fails in distributed mode
• Type of each key/value pair has to be declared
no way to skip it
• Lots of boiler plate code for closures
Apache Spark
• Apache Spark is a framework for lightening
fast cluster computing .
• Build by AmpLabs and now Databricks.
• Competitor to M/R of Hadoop
• Runs on Hadoop 1.0 and Hadoop 2.0 yarn
• Written in scala
Spark and ML
• Built for Iterative programs Aka ML
• Support for intermediate result caching
• Support for in memory processing
• Remembers across jobs not just within job
• There is suddenly interest in Bigdata ML again
with spark as its finally possible to run fast and
accurate with spark
• Mahout is moving on to Spark
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Coursera
• MLLib
• Rumour Engine
Learning Machine learning
• Coursera
• Example in octave
• Porting examples from octave to Spark
• https://github.com/zinniasystems/spark-ml-
class
• Uses
– MLLib
– JBlas
– Breeze
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
MLLib
• Standard Spark library for Machine learning
• Built into spark
• Very small code base – 1200 line of scala code
• 40x – 100x faster than Mahout
• Supports
– Linear and Logistic regression
– SVM
– Recommender systems
Mahout vs MLLib
• Mahout has more algorithms than MLLib
• MLLib has less code than MLLib (1200 lines
scala vs >20,000 lines of java code
• Much improved performance and accuracy
• Mahout recognizes it , moving to spark
backend for next release
My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
Rumor Engine
• Crawls blog data
• As of 12 blogs everyday, more to add in future
• Naïve Bayes to classify
• Uses single node spark for prediction
• MLLib
• Has <200 lines of actual application scala
code.
ML-Scale challenges
• Choosing an algorithm
• Accuracy of algorithm implementation
• Modeling when data is noisy and big
• Faster sampling
• Real time processing
• Accuracy vs Performance
ML-People challenges
• Hard to find Data scientists
• Unique combination of skills – Programming
at scale and Maths.
• Mathematical reasoning and practicality of
implementation.
Thank you

More Related Content

PPTX
Tailored for Spark
PDF
Scalable Deep Learning Platform On Spark In Baidu
PDF
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
PDF
Splice Machine's use of Apache Spark and MLflow
PDF
DLoBD: An Emerging Paradigm of Deep Learning Over Big Data Stacks with Dhaba...
PDF
Riak at shareaholic
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
PDF
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Tailored for Spark
Scalable Deep Learning Platform On Spark In Baidu
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Splice Machine's use of Apache Spark and MLflow
DLoBD: An Emerging Paradigm of Deep Learning Over Big Data Stacks with Dhaba...
Riak at shareaholic
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark

What's hot (20)

PPT
Rolling With Riak
PPTX
Advanced Spark Meetup - Jan 12, 2016
PDF
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
PDF
Liferay & Big Data Dev Con 2014
PPTX
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
PDF
On-boarding with JanusGraph Performance
PDF
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
PDF
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
PDF
MLeap: Release Spark ML Pipelines
PPTX
Bigdata antipatterns
PDF
Spark summit 2019 infrastructure for deep learning in apache spark 0425
PDF
Scaling with Riak at Showyou
PDF
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
PDF
Apache Arrow: Leveling Up the Analytics Stack
PDF
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
PDF
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
PDF
Koalas: Unifying Spark and pandas APIs
PDF
An Introduction to Sparkling Water by Michal Malohlava
PDF
DataFrames: The Extended Cut
PDF
Liferay and Big Data
Rolling With Riak
Advanced Spark Meetup - Jan 12, 2016
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Liferay & Big Data Dev Con 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
On-boarding with JanusGraph Performance
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
MLeap: Release Spark ML Pipelines
Bigdata antipatterns
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Scaling with Riak at Showyou
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
Apache Arrow: Leveling Up the Analytics Stack
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Koalas: Unifying Spark and pandas APIs
An Introduction to Sparkling Water by Michal Malohlava
DataFrames: The Extended Cut
Liferay and Big Data
Ad

Viewers also liked (20)

PPTX
Percolation Model and Controllability
PDF
First-passage percolation on random planar maps
PDF
mtc All Hands 8/15 Werte
PPTX
20131011 - Los Gatos - Netflix - Big Data Design Patterns
PDF
Percolation
PDF
Paper Review: An exact mapping between the Variational Renormalization Group ...
PDF
Elastic Search
PDF
Artificial intelligence 2015: Quo Vadis?
PDF
Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on...
PPTX
Machine Learning and Logging for Monitoring Microservices
PDF
Scalable and Reliable Logging at Pinterest
PDF
Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a...
PPTX
Percolation
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
PDF
Predictive analytics in mobility
PDF
BigData & Supply Chain: A "Small" Introduction
PDF
Logging : How much is too much? Network Security Monitoring Talk @ hasgeek
PPT
Deep Learning
PPTX
Elasticsearch Distributed search & analytics on BigData made easy
Percolation Model and Controllability
First-passage percolation on random planar maps
mtc All Hands 8/15 Werte
20131011 - Los Gatos - Netflix - Big Data Design Patterns
Percolation
Paper Review: An exact mapping between the Variational Renormalization Group ...
Elastic Search
Artificial intelligence 2015: Quo Vadis?
Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on...
Machine Learning and Logging for Monitoring Microservices
Scalable and Reliable Logging at Pinterest
Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a...
Percolation
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Predictive analytics in mobility
BigData & Supply Chain: A "Small" Introduction
Logging : How much is too much? Network Security Monitoring Talk @ hasgeek
Deep Learning
Elasticsearch Distributed search & analytics on BigData made easy
Ad

Similar to Machine Learning at Scale (20)

PDF
Machine learninginspark
PDF
A Tool For Big Data Analysis using Apache Spark
PPTX
Big Data Analytics-Open Source Toolkits
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
PDF
End-to-end Data Pipeline with Apache Spark
PPTX
Combining Machine Learning frameworks with Apache Spark
PPTX
Combining Machine Learning Frameworks with Apache Spark
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
PDF
Apache Spark's MLlib's Past Trajectory and new Directions
PPTX
introduction to big data frameworks
PPTX
Hadoop for the Data Scientist: Spark in Cloudera 5.5
PPTX
Apache Spark MLlib
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
PDF
Spark DataFrames and ML Pipelines
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
PDF
An introduction into Spark ML plus how to go beyond when you get stuck
PDF
MLlib: Spark's Machine Learning Library
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
Machine learninginspark
A Tool For Big Data Analysis using Apache Spark
Big Data Analytics-Open Source Toolkits
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
End-to-end Data Pipeline with Apache Spark
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark's MLlib's Past Trajectory and new Directions
introduction to big data frameworks
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Apache Spark MLlib
Practical Distributed Machine Learning Pipelines on Hadoop
Spark DataFrames and ML Pipelines
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
An introduction into Spark ML plus how to go beyond when you get stuck
MLlib: Spark's Machine Learning Library
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Deep Learning on Apache® Spark™: Workflows and Best Practices

Recently uploaded (20)

PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PPTX
Introduction to Windows Operating System
PDF
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PDF
CCleaner 6.39.11548 Crack 2025 License Key
DOC
UTEP毕业证学历认证,宾夕法尼亚克拉里恩大学毕业证未毕业
PPTX
Download Adobe Photoshop Crack 2025 Free
PPTX
4Seller: The All-in-One Multi-Channel E-Commerce Management Platform for Glob...
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PDF
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PDF
PDF-XChange Editor Plus 10.7.0.398.0 Crack Free Download Latest 2025
PDF
AI-Powered Fuzz Testing: The Future of QA
PPTX
Full-Stack Developer Courses That Actually Land You Jobs
PPTX
Python is a high-level, interpreted programming language
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PDF
Topaz Photo AI Crack New Download (Latest 2025)
DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
Introduction to Windows Operating System
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
CCleaner 6.39.11548 Crack 2025 License Key
UTEP毕业证学历认证,宾夕法尼亚克拉里恩大学毕业证未毕业
Download Adobe Photoshop Crack 2025 Free
4Seller: The All-in-One Multi-Channel E-Commerce Management Platform for Glob...
How Tridens DevSecOps Ensures Compliance, Security, and Agility
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PDF-XChange Editor Plus 10.7.0.398.0 Crack Free Download Latest 2025
AI-Powered Fuzz Testing: The Future of QA
Full-Stack Developer Courses That Actually Land You Jobs
Python is a high-level, interpreted programming language
DNT Brochure 2025 – ISV Solutions @ D365
Topaz Photo AI Crack New Download (Latest 2025)
How to Use SharePoint as an ISO-Compliant Document Management System

Machine Learning at Scale

  • 1. Machine Learning at Scale Madhukara Phatak
  • 2. • Madhukara Phatak • Consult in Bigdata and FP • Work with Spark, Hadoop and ecosystem • Training on Bigdata • @madhukaraphatak • http://www.madhukara phatak.com
  • 3. How many of You? • Own a Smart phone? • Want to know when next phone coming into market? • Next version of existing phone coming into market? • Specs and prices of new phone? – Months before phone releases • Data from multiple sources aggregated in one place
  • 4. Rumor Engine • A practical implementation of machine learning to solve phone rumor problem. • Built in 3 months – Learning machine learning – Learning Spark – Idea – Implementation – Release
  • 5. My Journey • Hadoop • Mahout and Nectar • JavaScript • Scala and Spark • Coursera • MLLib • Rumour Engine
  • 6. Big data at work • Worked for a BSS/OSS product company • Big data is normal in Telecom • CDR (call data record ) around 3TB for companies like Airtel • Need a solution for processing over 6 months of data. • Started to work around 4 years ago
  • 7. Hadoop Saga • Hadoop was default choice • Challenge in the ecosystem in India • Hype vs Reality • Work – Building ML library Nectar – Working with companies to build hadoop expertise and solutions – POC’s
  • 8. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Courseera • MLLib • Rumour Engine
  • 9. Machine Learning in Hadoop • Apache Mahout was the choice but its too hard to map it to any new requirements • Map/Reduce implementation suffered from speed and complexity • Accuracy of the results are often poor • We set out to build our own and realized it was too much of overhead even to build simplest things
  • 10. ML and Map Reduce • M/R forgets everything once one operation is done • Everything has to go through HDFS , slower because of disk over heads • Mahout long tried to make as fast possible , but they kind of gave up. • In Zinnia , we moved on with aggregation and KPI based solutions rather than pure ML.
  • 11. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Courseera • MLLib • Rumour Engine
  • 12. JavaScript • Functional programming • Closures • Loose typing /type inference • Prototype inheritance • REPL (node.js) or webtools
  • 13. Search for New Language • Statically typed (Enterprise stack) • Runs on JVM • Ability to use Java libraries • Functional programming • Type inference • Repl
  • 14. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Coursera • MLLib • Rumour Engine
  • 15. Scala • Statically typed • Type inference • Functional programming and OO built in • Parallelism built in • REPL • Scalable language
  • 16. Search for Functional Bigdata • Pig attempted on Hadoop • Tuple Map/Reduce • Javascript API for Hadoop • Why functional bigdata?
  • 17. Big data platform requirement • Immutability support • Transformation not CRUD • Built in laziness • Concise API • Type inference
  • 18. Java and Hadoop (Productivity) • No Laziness – Every Map/Reduce operation needs to write output to HDFS • Java allows crud like variable assignments but fails in distributed mode • Type of each key/value pair has to be declared no way to skip it • Lots of boiler plate code for closures
  • 19. Apache Spark • Apache Spark is a framework for lightening fast cluster computing . • Build by AmpLabs and now Databricks. • Competitor to M/R of Hadoop • Runs on Hadoop 1.0 and Hadoop 2.0 yarn • Written in scala
  • 20. Spark and ML • Built for Iterative programs Aka ML • Support for intermediate result caching • Support for in memory processing • Remembers across jobs not just within job • There is suddenly interest in Bigdata ML again with spark as its finally possible to run fast and accurate with spark • Mahout is moving on to Spark
  • 21. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Coursera • MLLib • Rumour Engine
  • 22. Learning Machine learning • Coursera • Example in octave • Porting examples from octave to Spark • https://github.com/zinniasystems/spark-ml- class • Uses – MLLib – JBlas – Breeze
  • 23. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Courseera • MLLib • Rumour Engine
  • 24. MLLib • Standard Spark library for Machine learning • Built into spark • Very small code base – 1200 line of scala code • 40x – 100x faster than Mahout • Supports – Linear and Logistic regression – SVM – Recommender systems
  • 25. Mahout vs MLLib • Mahout has more algorithms than MLLib • MLLib has less code than MLLib (1200 lines scala vs >20,000 lines of java code • Much improved performance and accuracy • Mahout recognizes it , moving to spark backend for next release
  • 26. My Journey • Hadoop • Mahout and Nectar • Javascript • Scala and Spark • Courseera • MLLib • Rumour Engine
  • 27. Rumor Engine • Crawls blog data • As of 12 blogs everyday, more to add in future • Naïve Bayes to classify • Uses single node spark for prediction • MLLib • Has <200 lines of actual application scala code.
  • 28. ML-Scale challenges • Choosing an algorithm • Accuracy of algorithm implementation • Modeling when data is noisy and big • Faster sampling • Real time processing • Accuracy vs Performance
  • 29. ML-People challenges • Hard to find Data scientists • Unique combination of skills – Programming at scale and Maths. • Mathematical reasoning and practicality of implementation.