Machine Learning at Scale

Machine Learning at Scale
Madhukara Phatak

• Madhukara Phatak
• Consult in Bigdata and
FP
• Work with Spark,
Hadoop and ecosystem
• Training on Bigdata
• @madhukaraphatak
• http://www.madhukara
phatak.com

How many of You?
• Own a Smart phone?
• Want to know when next phone coming into
market?
• Next version of existing phone coming into
market?
• Specs and prices of new phone?
– Months before phone releases
• Data from multiple sources aggregated in one
place

Rumor Engine
• A practical implementation of machine
learning to solve phone rumor problem.
• Built in 3 months
– Learning machine learning
– Learning Spark
– Idea
– Implementation
– Release

My Journey
• Hadoop
• Mahout and Nectar
• JavaScript
• Scala and Spark
• Coursera
• MLLib
• Rumour Engine

Big data at work
• Worked for a BSS/OSS product company
• Big data is normal in Telecom
• CDR (call data record ) around 3TB for
companies like Airtel
• Need a solution for processing over 6 months
of data.
• Started to work around 4 years ago

Hadoop Saga
• Hadoop was default choice
• Challenge in the ecosystem in India
• Hype vs Reality
• Work
– Building ML library Nectar
– Working with companies to build hadoop
expertise and solutions
– POC’s

My Journey
• Hadoop
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine

Machine Learning in Hadoop
• Apache Mahout was the choice but its too
hard to map it to any new requirements
• Map/Reduce implementation suffered from
speed and complexity
• Accuracy of the results are often poor
• We set out to build our own and realized it
was too much of overhead even to build
simplest things

ML and Map Reduce
• M/R forgets everything once one operation is
done
• Everything has to go through HDFS , slower
because of disk over heads
• Mahout long tried to make as fast possible ,
but they kind of gave up.
• In Zinnia , we moved on with aggregation and
KPI based solutions rather than pure ML.

JavaScript
• Functional programming
• Closures
• Loose typing /type inference
• Prototype inheritance
• REPL (node.js) or webtools

Search for New Language
• Statically typed (Enterprise stack)
• Runs on JVM
• Ability to use Java libraries
• Functional programming
• Type inference
• Repl

My Journey
• Hadoop
• Javascript
• Scala and Spark
• Coursera
• MLLib
• Rumour Engine

Scala
• Statically typed
• Type inference
• Functional programming and OO built in
• Parallelism built in
• REPL
• Scalable language

Search for Functional Bigdata
• Pig attempted on Hadoop
• Tuple Map/Reduce
• Javascript API for Hadoop
• Why functional bigdata?

Big data platform requirement
• Immutability support
• Transformation not CRUD
• Built in laziness
• Concise API
• Type inference

Java and Hadoop (Productivity)
• No Laziness
– Every Map/Reduce operation needs to write
output to HDFS
• Java allows crud like variable assignments but
fails in distributed mode
• Type of each key/value pair has to be declared
no way to skip it
• Lots of boiler plate code for closures

Apache Spark
• Apache Spark is a framework for lightening
fast cluster computing .
• Build by AmpLabs and now Databricks.
• Competitor to M/R of Hadoop
• Runs on Hadoop 1.0 and Hadoop 2.0 yarn
• Written in scala

Spark and ML
• Built for Iterative programs Aka ML
• Support for intermediate result caching
• Support for in memory processing
• Remembers across jobs not just within job
• There is suddenly interest in Bigdata ML again
with spark as its finally possible to run fast and
accurate with spark
• Mahout is moving on to Spark

Learning Machine learning
• Coursera
• Example in octave
• Porting examples from octave to Spark
• https://github.com/zinniasystems/spark-ml-
class
• Uses
– MLLib
– JBlas
– Breeze

MLLib
• Standard Spark library for Machine learning
• Built into spark
• Very small code base – 1200 line of scala code
• 40x – 100x faster than Mahout
• Supports
– Linear and Logistic regression
– SVM
– Recommender systems

Mahout vs MLLib
• Mahout has more algorithms than MLLib
• MLLib has less code than MLLib (1200 lines
scala vs >20,000 lines of java code
• Much improved performance and accuracy
• Mahout recognizes it , moving to spark
backend for next release

Rumor Engine
• Crawls blog data
• As of 12 blogs everyday, more to add in future
• Naïve Bayes to classify
• Uses single node spark for prediction
• MLLib
• Has <200 lines of actual application scala
code.

ML-Scale challenges
• Choosing an algorithm
• Accuracy of algorithm implementation
• Modeling when data is noisy and big
• Faster sampling
• Real time processing
• Accuracy vs Performance

ML-People challenges
• Hard to find Data scientists
• Unique combination of skills – Programming
at scale and Maths.
• Mathematical reasoning and practicality of
implementation.

Machine Learning at Scale

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Machine Learning at Scale (20)

Recently uploaded (20)

Machine Learning at Scale