SlideShare a Scribd company logo
Parallel auto-tuning of
machine learning
algorithms
Gianmario Spacagna
gm.spacagna@gmail.com

16 October 2012




AgilOne, Inc.                 (877) 769-3047
1091 N Shoreline Blvd. #250   (408) 404-0152 fax
Mountain View, CA 94043       sales@agilone.com
Motivation
• Increase revenue of cloud service providers
    à Keep cost curve linear w.r.t. the expected
    exponential income growth.                                    Income   Cost
• Technically achievable through Scalability:
    • Scalability in terms of resources à Distributed Parallel
      Computing (Hadoop).
    • Scalability in terms of multi-tenancy à Same system
      running for several customers.
    • Scalability in terms of auto-configuration à
      Avoiding manual tuning up operations.




2
Good Work Flow


    Good         ML                     Good
    data      Algorithm                results!



                Tuning
           (Adjusting configuration)
3
General Tuning diagram
         Test Data



       Run algorithm
        with conf. X



             Are       no     Change
           results          configuration
           good?                  X

                yes
           Tuned


4
Tuning of Machine Learning
Algorithms
• We need tuning when:
    • New algorithm or version is released.
    • We want to improve accuracy and/or performance.
    • New customer comes and the system must be customized for the
      new dataset and requirements.



     We need to make it smart, automatic
                and scalable!



5
Vision

Request:
•  Data set                                        Response:
                                                   •  Best algorithm
•  Application
         (prediction,
                                           Magic   •  Optimal
         clustering, classification…)
          •  Algorithm
                                            Box       configuration
               (ANN, LR, K-means…)
                                                   •  Metrics
•        Fitness metrics                              evaluation
         (Std. dev, Prob. of false true,
         clustering coeff., randomness…)
•        Goal constraints
         (x> 0.9 & 0.3<y<0.5)




     6
Architecture Design
                   Upper Applications API

              Initializer

                            Controller

                            Scheduler

      Executor              Executor        Executor
        ANN                   LR            K-Means
       Evaluator             Evaluator       Evaluator

         Data                  Data           Data
        Sampler               Sampler        Sampler


                                             Cloud
       Local                Hadoop
                                            Service

7
Upper Applications API
Tasks:                             Possible data format:
• Interfaces the communication     • JSON
    between the system and the
                                   • STDIN/OUT
    upper applications layer.
• Parse requests and results and
    generates the related output
    domain object.




8
Initializer
Tasks:                           Possible implementations:
• Generates the initial set of   • Random points
    configuration.
                                 • Latin Hyper Cube
                                 • Dataset similarity




9
Controller
Tasks:                               Possible implementations:
• Compares and generates             • Random search
 configurations.
                                     • Grid search
• Decides the convergence of the
 tuning.                             • Stochastic Kriging
                                     • Genetic Algorithms
• Adapt the data sampling request.




10
Scheduler
Tasks:                                Possible implementations:
• Checks if the requests are          • First available
 covered by the available services.
                                      • Oldest idle
• Schedules and parallelizes
 requests executions.                 • Load balanced
                                      • Serialized (single node)
• Optimizes resources.
• Collects evaluated results.




11
Executor
Tasks:                                   Possible implementations:
• Executes the providing algorithm       • Local execution
 with the specified configuration.
                                         • Hadoop cluster
                                         • Cloud service
Sub components:
•  Evaluator: Evaluates results
     standing to the specified fitness
     metrics.
•  Data Sampler: Down and Up
     sampling of data.



12
Tuning diagram
                        Test Data
     Test execution
                                                            Test control

        Scheduler,    Run algorithm
        Executor       with conf. X                   Initializer,
                                                      Controller

                            Are       no     Change
                          results          configuration
                          good?                  X

                               yes
                          Tuned


13
SUNS: Simple, Unclever and Not
Scalable
                     STDIN/OUT

          Random Points

             Random Search – Grid Search

                      Serialized

                      Executor
                      K-Means
                          Evaluator




                          Local




14
SNS: Smart but Not Scalable
                   STDIN/OUT or JSON

          Latin Hyper Cube

           Genetic Algorithm / Stochastic Kriging

                         Serialized

                         Executor
                         K-Means
                          Evaluator




                             Local




15
VSNS: Very Smart but Not Scalable
                   STDIN/OUT or JSON

          Dataset Similarity

           Genetic Algorithm / Stochastic Kriging

                         Serialized

                          Executor
                          K-Means
                           Evaluator




                               Local




16
VSS: Very Smart and Scalable
                  STDIN/OUT or JSON

         Dataset Similarity

          Genetic Algorithm or Stochastic Kriging

                      First Available

                         Executor
                         K-Means
                          Evaluator




                         Hadoop




17
VSVSO: Very Smart, Very Scalable and
Optimized
                  STDIN/OUT or JSON

         Dataset Similarity

          Genetic Algorithm or Stochastic Kriging

                      Load Balanced

                          Executor
                          K-Means
                                  Data
                     Evaluator
                                 Sampler




                           Hadoop




18
Thesis
It is possible to build an intelligent system
based on Genetic Algorithm/Stochastic
   Kriging that automatically selects and
tunes machine learning algorithms, such
   as K-Means and LR, parallelizing the
 work on an Hadoop cluster to scale in a
           cost-efficient manner.


19
Project Plan
Order of priorities:

1.  Design the entire application in Scala in a testable and expandable
     way.
2.  Implement the Genetic Algorithm or the Stochastic Kriging controller.
3.  Implement the Latin Hyper Cube initializer.
4.  Test with local instance algorithms (K-Means and/or LR).
5.  Develop and test at least one algorithm in MapReduce fashion using
     Hadoop.
6.  Test with real AgilOne cluster of servers.
7.  Implement the Dataset Similarity initializer.
8.  Implement the Dataset Sampler.


20
Questions, feedbacks,
        suggestions?




21
Thank you!




22

More Related Content

PPTX
Learning machine learning with Yellowbrick
PDF
New Directions in Mahout's Recommenders
PDF
Next directions in Mahout's recommenders
PDF
Visualizing the model selection process
PPTX
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
PDF
Machine learning pipeline with spark ml
PDF
(Py)testing the Limits of Machine Learning
PDF
Yellowbrick: Steering machine learning with visual transformers
Learning machine learning with Yellowbrick
New Directions in Mahout's Recommenders
Next directions in Mahout's recommenders
Visualizing the model selection process
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
Machine learning pipeline with spark ml
(Py)testing the Limits of Machine Learning
Yellowbrick: Steering machine learning with visual transformers

What's hot (19)

PPTX
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
PDF
2016 VLDB - The iBench Integration Metadata Generator
PDF
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
PDF
Matlab OOP
PPTX
Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)
PPTX
PPT
probabilistic ranking
PDF
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
PPTX
1 cs xii_python_functions_introduction _types of func
PPTX
House price prediction
PDF
Optimal feature selection from v mware esxi 5.1 feature set
PDF
Feature selection
PDF
Machine Learning with Spark MLlib
PDF
Analysing-MMPLs
PPTX
Net campus2015 antimomusone
PDF
Optimization Technique for Feature Selection and Classification Using Support...
PPTX
Branch And Bound and Beam Search Feature Selection Algorithms
PPTX
Matlab - Introduction and Basics
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
2016 VLDB - The iBench Integration Metadata Generator
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Matlab OOP
Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)
probabilistic ranking
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
1 cs xii_python_functions_introduction _types of func
House price prediction
Optimal feature selection from v mware esxi 5.1 feature set
Feature selection
Machine Learning with Spark MLlib
Analysing-MMPLs
Net campus2015 antimomusone
Optimization Technique for Feature Selection and Classification Using Support...
Branch And Bound and Beam Search Feature Selection Algorithms
Matlab - Introduction and Basics
Ad

Viewers also liked (20)

PDF
Actividades
PDF
Spring3.1 aop-mvc
PDF
How High Tech CEOs Can Increase Sales and Marketing Effectiveness and Reduce ...
PPTX
Fall Simmer Pot Recipes
PPSX
Средство индивидуального перемещения "СИП-С"
PPTX
Progress presentation
PDF
和菓子の販売促進施策について
PPT
Jihočeské vzdělávání dospělých - SEO část
PDF
SXSW Next Gen Responsive Design
PDF
和菓子ここだけの話
PDF
How to Kick Ass on Google+ Local When You're All Out Of Bubblegum
PDF
The Latest SEO Statistics for SEOs, Tweeted at SMX West 2013
PDF
Soluciones de software para CTOUCH
PDF
Незаконне_звільнення_з_роботи
PDF
Quarterly Technology Briefing, Manchester, UK September 2013
PPTX
NFS: para la gestion de espacios de trabajo
PDF
웹 접근성의 지침 및 평가 팀인터페이스 현준호
PDF
Squaw Lake
PPTX
Infolit day 24_may2016
PDF
Evolver Architects
Actividades
Spring3.1 aop-mvc
How High Tech CEOs Can Increase Sales and Marketing Effectiveness and Reduce ...
Fall Simmer Pot Recipes
Средство индивидуального перемещения "СИП-С"
Progress presentation
和菓子の販売促進施策について
Jihočeské vzdělávání dospělých - SEO část
SXSW Next Gen Responsive Design
和菓子ここだけの話
How to Kick Ass on Google+ Local When You're All Out Of Bubblegum
The Latest SEO Statistics for SEOs, Tweeted at SMX West 2013
Soluciones de software para CTOUCH
Незаконне_звільнення_з_роботи
Quarterly Technology Briefing, Manchester, UK September 2013
NFS: para la gestion de espacios de trabajo
웹 접근성의 지침 및 평가 팀인터페이스 현준호
Squaw Lake
Infolit day 24_may2016
Evolver Architects
Ad

Similar to Parallel Tuning of Machine Learning Algorithms, Thesis Proposal (20)

PPTX
Parallel Linear Regression in Interative Reduce and YARN
PPTX
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
PPTX
Big Data Analytics with Storm, Spark and GraphLab
PDF
Machine Learning Project - Neural Network
PDF
Machine learning at Scale with Apache Spark
KEY
Machine Learning with Apache Mahout
PDF
An Introduction to Neural Architecture Search
PPTX
Big dataanalyticsbeyondhadoop public_20_june_2013
PPTX
Next generation analytics with yarn, spark and graph lab
PDF
Pycvf
PPTX
Scalable Parallel Computing on Clouds
PPTX
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
PDF
Introducing LCS to Digital Design Verification
PDF
Arvindsujeeth scaladays12
PDF
A Survey of Machine Learning Methods Applied to Computer ...
PPTX
Recommendations for Building Machine Learning Software
PDF
A Survey of Machine Learning Techniques for Self-tuning Hadoop Performance
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
PPTX
Spark and Deep Learning Frameworks at Scale 7.19.18
PDF
Terascale Learning
Parallel Linear Regression in Interative Reduce and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Big Data Analytics with Storm, Spark and GraphLab
Machine Learning Project - Neural Network
Machine learning at Scale with Apache Spark
Machine Learning with Apache Mahout
An Introduction to Neural Architecture Search
Big dataanalyticsbeyondhadoop public_20_june_2013
Next generation analytics with yarn, spark and graph lab
Pycvf
Scalable Parallel Computing on Clouds
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Introducing LCS to Digital Design Verification
Arvindsujeeth scaladays12
A Survey of Machine Learning Methods Applied to Computer ...
Recommendations for Building Machine Learning Software
A Survey of Machine Learning Techniques for Self-tuning Hadoop Performance
Yarn spark next_gen_hadoop_8_jan_2014
Spark and Deep Learning Frameworks at Scale 7.19.18
Terascale Learning

More from Gianmario Spacagna (9)

PDF
BUILDING Q&A EDUCATIONAL APPLICATIONS WITH LLMS - MARCH 2024.pdf
PDF
Latent Panelists Affinities: a Helixa case study
PDF
Tech leaders guide to effective building of machine learning products
PDF
Managers guide to effective building of machine learning products
PDF
Anomaly Detection using Deep Auto-Encoders
PDF
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
PDF
Logical-DataWarehouse-Alluxio-meetup
PDF
Robust and declarative machine learning pipelines for predictive buying at Ba...
PDF
TunUp final presentation
BUILDING Q&A EDUCATIONAL APPLICATIONS WITH LLMS - MARCH 2024.pdf
Latent Panelists Affinities: a Helixa case study
Tech leaders guide to effective building of machine learning products
Managers guide to effective building of machine learning products
Anomaly Detection using Deep Auto-Encoders
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Logical-DataWarehouse-Alluxio-meetup
Robust and declarative machine learning pipelines for predictive buying at Ba...
TunUp final presentation

Parallel Tuning of Machine Learning Algorithms, Thesis Proposal

  • 1. Parallel auto-tuning of machine learning algorithms Gianmario Spacagna [email protected] 16 October 2012 AgilOne, Inc. (877) 769-3047 1091 N Shoreline Blvd. #250 (408) 404-0152 fax Mountain View, CA 94043 [email protected]
  • 2. Motivation • Increase revenue of cloud service providers à Keep cost curve linear w.r.t. the expected exponential income growth. Income Cost • Technically achievable through Scalability: • Scalability in terms of resources à Distributed Parallel Computing (Hadoop). • Scalability in terms of multi-tenancy à Same system running for several customers. • Scalability in terms of auto-configuration à Avoiding manual tuning up operations. 2
  • 3. Good Work Flow Good ML Good data Algorithm results! Tuning (Adjusting configuration) 3
  • 4. General Tuning diagram Test Data Run algorithm with conf. X Are no Change results configuration good? X yes Tuned 4
  • 5. Tuning of Machine Learning Algorithms • We need tuning when: • New algorithm or version is released. • We want to improve accuracy and/or performance. • New customer comes and the system must be customized for the new dataset and requirements. We need to make it smart, automatic and scalable! 5
  • 6. Vision Request: •  Data set Response: •  Best algorithm •  Application (prediction, Magic •  Optimal clustering, classification…) •  Algorithm Box configuration (ANN, LR, K-means…) •  Metrics •  Fitness metrics evaluation (Std. dev, Prob. of false true, clustering coeff., randomness…) •  Goal constraints (x> 0.9 & 0.3<y<0.5) 6
  • 7. Architecture Design Upper Applications API Initializer Controller Scheduler Executor Executor Executor ANN LR K-Means Evaluator Evaluator Evaluator Data Data Data Sampler Sampler Sampler Cloud Local Hadoop Service 7
  • 8. Upper Applications API Tasks: Possible data format: • Interfaces the communication • JSON between the system and the • STDIN/OUT upper applications layer. • Parse requests and results and generates the related output domain object. 8
  • 9. Initializer Tasks: Possible implementations: • Generates the initial set of • Random points configuration. • Latin Hyper Cube • Dataset similarity 9
  • 10. Controller Tasks: Possible implementations: • Compares and generates • Random search configurations. • Grid search • Decides the convergence of the tuning. • Stochastic Kriging • Genetic Algorithms • Adapt the data sampling request. 10
  • 11. Scheduler Tasks: Possible implementations: • Checks if the requests are • First available covered by the available services. • Oldest idle • Schedules and parallelizes requests executions. • Load balanced • Serialized (single node) • Optimizes resources. • Collects evaluated results. 11
  • 12. Executor Tasks: Possible implementations: • Executes the providing algorithm • Local execution with the specified configuration. • Hadoop cluster • Cloud service Sub components: •  Evaluator: Evaluates results standing to the specified fitness metrics. •  Data Sampler: Down and Up sampling of data. 12
  • 13. Tuning diagram Test Data Test execution Test control Scheduler, Run algorithm Executor with conf. X Initializer, Controller Are no Change results configuration good? X yes Tuned 13
  • 14. SUNS: Simple, Unclever and Not Scalable STDIN/OUT Random Points Random Search – Grid Search Serialized Executor K-Means Evaluator Local 14
  • 15. SNS: Smart but Not Scalable STDIN/OUT or JSON Latin Hyper Cube Genetic Algorithm / Stochastic Kriging Serialized Executor K-Means Evaluator Local 15
  • 16. VSNS: Very Smart but Not Scalable STDIN/OUT or JSON Dataset Similarity Genetic Algorithm / Stochastic Kriging Serialized Executor K-Means Evaluator Local 16
  • 17. VSS: Very Smart and Scalable STDIN/OUT or JSON Dataset Similarity Genetic Algorithm or Stochastic Kriging First Available Executor K-Means Evaluator Hadoop 17
  • 18. VSVSO: Very Smart, Very Scalable and Optimized STDIN/OUT or JSON Dataset Similarity Genetic Algorithm or Stochastic Kriging Load Balanced Executor K-Means Data Evaluator Sampler Hadoop 18
  • 19. Thesis It is possible to build an intelligent system based on Genetic Algorithm/Stochastic Kriging that automatically selects and tunes machine learning algorithms, such as K-Means and LR, parallelizing the work on an Hadoop cluster to scale in a cost-efficient manner. 19
  • 20. Project Plan Order of priorities: 1.  Design the entire application in Scala in a testable and expandable way. 2.  Implement the Genetic Algorithm or the Stochastic Kriging controller. 3.  Implement the Latin Hyper Cube initializer. 4.  Test with local instance algorithms (K-Means and/or LR). 5.  Develop and test at least one algorithm in MapReduce fashion using Hadoop. 6.  Test with real AgilOne cluster of servers. 7.  Implement the Dataset Similarity initializer. 8.  Implement the Dataset Sampler. 20
  • 21. Questions, feedbacks, suggestions? 21