Parallel Tuning of Machine Learning Algorithms, Thesis Proposal

Parallel auto-tuning of
machine learning
algorithms
Gianmario Spacagna
gm.spacagna@gmail.com

16 October 2012

AgilOne, Inc. (877) 769-3047
1091 N Shoreline Blvd. #250 (408) 404-0152 fax
Mountain View, CA 94043 sales@agilone.com

Motivation
• Increase revenue of cloud service providers
à Keep cost curve linear w.r.t. the expected
exponential income growth. Income Cost
• Technically achievable through Scalability:
• Scalability in terms of resources à Distributed Parallel
Computing (Hadoop).
• Scalability in terms of multi-tenancy à Same system
running for several customers.
• Scalability in terms of auto-configuration à
Avoiding manual tuning up operations.

2

Good Work Flow

Good ML Good
data Algorithm results!

Tuning
(Adjusting configuration)
3

General Tuning diagram
Test Data

Run algorithm
with conf. X

Are no Change
results configuration
good? X

yes
Tuned

4

Tuning of Machine Learning
Algorithms
• We need tuning when:
• New algorithm or version is released.
• We want to improve accuracy and/or performance.
• New customer comes and the system must be customized for the
new dataset and requirements.

We need to make it smart, automatic
and scalable!

5

Vision

Request:
•  Data set Response:
•  Best algorithm
•  Application
(prediction,
Magic •  Optimal
clustering, classification…)
•  Algorithm
Box configuration
(ANN, LR, K-means…)
•  Metrics
•  Fitness metrics evaluation
(Std. dev, Prob. of false true,
clustering coeff., randomness…)
•  Goal constraints
(x> 0.9 & 0.3<y<0.5)

6

Architecture Design
Upper Applications API

Initializer

Controller

Scheduler

Executor Executor Executor
ANN LR K-Means
Evaluator Evaluator Evaluator

Data Data Data
Sampler Sampler Sampler

Cloud
Local Hadoop
Service

7

Upper Applications API
Tasks: Possible data format:
• Interfaces the communication • JSON
between the system and the
• STDIN/OUT
upper applications layer.
• Parse requests and results and
generates the related output
domain object.

8

Initializer
Tasks: Possible implementations:
• Generates the initial set of • Random points
configuration.
• Latin Hyper Cube
• Dataset similarity

9

Controller
• Compares and generates • Random search
configurations.
• Grid search
• Decides the convergence of the
tuning. • Stochastic Kriging
• Genetic Algorithms
• Adapt the data sampling request.

10

Scheduler
• Checks if the requests are • First available
covered by the available services.
• Oldest idle
• Schedules and parallelizes
requests executions. • Load balanced
• Serialized (single node)
• Optimizes resources.
• Collects evaluated results.

11

Executor
• Executes the providing algorithm • Local execution
with the specified configuration.
• Hadoop cluster
• Cloud service
Sub components:
•  Evaluator: Evaluates results
standing to the specified fitness
metrics.
•  Data Sampler: Down and Up
sampling of data.

12

Tuning diagram
Test Data
Test execution
Test control

Scheduler, Run algorithm
Executor with conf. X Initializer,
Controller

Are no Change
results configuration
good? X

yes
Tuned

13

SUNS: Simple, Unclever and Not
Scalable
STDIN/OUT

Random Points

Random Search – Grid Search

Serialized

Executor
K-Means
Evaluator

Local

14

SNS: Smart but Not Scalable
STDIN/OUT or JSON

Latin Hyper Cube

Genetic Algorithm / Stochastic Kriging

Serialized

Executor
K-Means
Evaluator

Local

15

VSNS: Very Smart but Not Scalable
STDIN/OUT or JSON

Dataset Similarity

Genetic Algorithm / Stochastic Kriging

Serialized

Executor
K-Means
Evaluator

Local

16

VSS: Very Smart and Scalable
STDIN/OUT or JSON

Dataset Similarity

Genetic Algorithm or Stochastic Kriging

First Available

Executor
K-Means
Evaluator

Hadoop

17

VSVSO: Very Smart, Very Scalable and
Optimized
STDIN/OUT or JSON

Dataset Similarity

Genetic Algorithm or Stochastic Kriging

Load Balanced

Executor
K-Means
Data
Evaluator
Sampler

Hadoop

18

Thesis
It is possible to build an intelligent system
based on Genetic Algorithm/Stochastic
Kriging that automatically selects and
tunes machine learning algorithms, such
as K-Means and LR, parallelizing the
work on an Hadoop cluster to scale in a
cost-efficient manner.

19

Project Plan
Order of priorities:

1.  Design the entire application in Scala in a testable and expandable
way.
2.  Implement the Genetic Algorithm or the Stochastic Kriging controller.
3.  Implement the Latin Hyper Cube initializer.
4.  Test with local instance algorithms (K-Means and/or LR).
5.  Develop and test at least one algorithm in MapReduce fashion using
Hadoop.
6.  Test with real AgilOne cluster of servers.
7.  Implement the Dataset Similarity initializer.
8.  Implement the Dataset Sampler.

20

Questions, feedbacks,
suggestions?

21

Parallel Tuning of Machine Learning Algorithms, Thesis Proposal

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Parallel Tuning of Machine Learning Algorithms, Thesis Proposal (20)

More from Gianmario Spacagna (9)

Parallel Tuning of Machine Learning Algorithms, Thesis Proposal