USING SIGOPT TO TUNE DEEP LEARNING
MODELS WITH NERVANA CLOUD
Scott Clark
Co-founder and CEO of SigOpt
scott@sigopt.com @DrScottClark
TRIAL AND ERROR WASTES EXPERT TIME
● Deep Learning is extremely powerful
● Tuning Deep Learning systems is extremely non-intuitive
UNRESOLVED PROBLEM IN ML
https://www.quora.com/What-is-the-most-important-unresolved-problem-in-machine-learning-3
What is the most important unresolved problem in machine learning?
“...we still don't really know why some configurations of deep neural networks work
in some case and not others, let alone having a more or less automatic approach
to determining the architectures and the hyperparameters.”
Xavier Amatriain, VP Engineering at Quora
(former Director of Research at Netflix)
TUNING DEEP LEARNING MODELS
Big Data
Deep Learning
System
With tunable parameters
Expertise
TUNING DEEP LEARNING MODELS
Big Data
Metics
Optimally suggests
new parameters
Objective
New parameters
Expertise
Deep Learning
System
With tunable parameters
TUNING DEEP LEARNING MODELS
Big Data
Metics
Optimally suggests
new parameters
Objective
New parameters
Better
Results
Expertise
Deep Learning
System
With tunable parameters
COMMON APPROACH
Random Search for Hyper-Parameter Optimization, James Bergstra et al., 2012
1. Random search or grid search
2. Expert defined grid search near “good” points
3. Refine domain and repeat steps - “grad student descent”
COMMON APPROACH
● Expert intensive
● Computationally intensive
● Finds potentially local optima
● Does not fully exploit useful information
Random Search for Hyper-Parameter Optimization, James Bergstra et al., 2012
1. Random search or grid search
2. Expert defined grid search near “good” points
3. Refine domain and repeat steps - “grad student descent”
… the challenge of how to collect information as efficiently
as possible, primarily for settings where collecting information
is time consuming and expensive.
Prof. Warren Powell - Princeton
What is the most efficient way to collect information?
Prof. Peter Frazier - Cornell
How do we make the most money, as fast as possible?
Me - @DrScottClark
OPTIMAL LEARNING
● Optimize some Overall Evaluation Criterion (OEC)
○ Loss, Accuracy, Likelihood, Revenue
● Given tunable parameters
○ Hyperparameters, feature parameters
● In an efficient way
○ Sample function as few times as possible
○ Training on big data is expensive
BAYESIAN GLOBAL OPTIMIZATION
Details at https://sigopt.com/research
EXAMPLE: TUNING DNN CLASSIFIERS
CIFAR10 Dataset
● Photos of objects
● 10 classes
● Metric: Accuracy
○ [0.1, 1.0]
Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009.
● All convolutional neural network
● Multiple convolutional and dropout layers
● Hyperparameter optimization mixture of
domain expertise and grid search (brute force)
USE CASE: ALL CONVOLUTIONAL
http://arxiv.org/pdf/1412.6806.pdf
EXAMPLE: NCLOUD/NEON
● epochs: “number of epochs to run fit” - int [1,∞]
● learning rate: influence on current value of weights at each step - double (0, 1]
● momentum coefficient: “the coefficient of momentum” - double (0, 1]
● weight decay: parameter affecting how quickly weight decays - double (0, 1]
● depth: parameter affecting number of layers in net - int [1, 20(?)]
● gaussian scale: standard deviation of initialization normal dist. - double (0,∞]
● momentum step change: mul. amount to decrease momentum - double (0, 1]
● momentum step schedule start: epoch to start decreasing momentum - int [1,∞]
● momentum schedule width: epoch stride for decreasing momentum - int [1,∞]
Many tunable parameters...
...optimal values non-intuitive
COMPARATIVE PERFORMANCE
● Expert baseline: 0.8995
○ (using neon)
● SigOpt best: 0.9011
○ 1.6% reduction in
error rate
○ No expert time
wasted in tuning
USE CASE: DEEP RESIDUAL
http://arxiv.org/pdf/1512.03385v1.pdf
● Explicitly reformulate the layers as learning residual functions with
reference to the layer inputs, instead of learning unreferenced functions
● Variable depth
● Hyperparameter optimization mixture of domain expertise and grid
search (brute force)
COMPARATIVE PERFORMANCE
Standard Method
● Expert baseline: 0.9339
○ (from paper)
● SigOpt best: 0.9343
○ Found after 17 trials
○ 0.61% reduction in
error rate
○ No expert time
wasted in tuning
Questions?
scott@sigopt.com
@DrScottClark
https://sigopt.com
@SigOpt
TRY OUT SIGOPT FOR FREE
https://sigopt.com/get_started
● Quick example and intro to SigOpt
● No signup required
● Visual and code examples
https://sigopt.com/text-classifier
● Jupyter Notebook
● Use SigOpt to tune feature and model parameters
● Detailed walkthrough with code
MORE EXAMPLES
https://github.com/sigopt/sigopt-examples
Examples of using SigOpt in a variety of languages and contexts.
Tuning Machine Learning Models (with code)
A comparison of different hyperparameter optimization methods.
Using Model Tuning to Beat Vegas (with code)
Using SigOpt to tune a model for predicting basketball scores.
Learn more about the technology behind SigOpt at
https://sigopt.com/research
HOW DOES IT WORK?
1. User reports data
2. SigOpt builds statistical model
(Gaussian Process)
3. SigOpt finds the points of
highest Expected Improvement
4. SigOpt suggests best
parameters to test next
5. User tests those parameters
and reports results to SigOpt
6. Repeat
HOW DOES IT WORK?
1. User reports data
2. SigOpt builds statistical model
(Gaussian Process)
3. SigOpt finds the points of
highest Expected Improvement
4. SigOpt suggests best
parameters to test next
5. User tests those parameters
and reports results to SigOpt
6. Repeat
HOW DOES IT WORK?
1. User reports data
2. SigOpt builds statistical model
(Gaussian Process)
3. SigOpt finds the points of
highest Expected Improvement
4. SigOpt suggests best
parameters to test next
5. User tests those parameters
and reports results to SigOpt
6. Repeat
HOW DOES IT WORK?
1. User reports data
2. SigOpt builds statistical model
(Gaussian Process)
3. SigOpt finds the points of
highest Expected Improvement
4. SigOpt suggests best
parameters to test next
5. User tests those parameters
and reports results to SigOpt
6. Repeat
HOW DOES IT WORK?
1. User reports data
2. SigOpt builds statistical model
(Gaussian Process)
3. SigOpt finds the points of
highest Expected Improvement
4. SigOpt suggests best
parameters to test next
5. User tests those parameters
and reports results to SigOpt
6. Repeat
HOW DOES IT WORK?
1. User reports data
2. SigOpt builds statistical model
(Gaussian Process)
3. SigOpt finds the points of
highest Expected Improvement
4. SigOpt suggests best
parameters to test next
5. User tests those parameters
and reports results to SigOpt
6. Repeat
GPs: FUNCTIONAL VIEW
overfit good fit underfit
GPs: FITTING THE GP

Using SigOpt to Tune Deep Learning Models with Nervana Cloud

  • 1.
    USING SIGOPT TOTUNE DEEP LEARNING MODELS WITH NERVANA CLOUD Scott Clark Co-founder and CEO of SigOpt [email protected] @DrScottClark
  • 2.
    TRIAL AND ERRORWASTES EXPERT TIME ● Deep Learning is extremely powerful ● Tuning Deep Learning systems is extremely non-intuitive
  • 3.
    UNRESOLVED PROBLEM INML https://www.quora.com/What-is-the-most-important-unresolved-problem-in-machine-learning-3 What is the most important unresolved problem in machine learning? “...we still don't really know why some configurations of deep neural networks work in some case and not others, let alone having a more or less automatic approach to determining the architectures and the hyperparameters.” Xavier Amatriain, VP Engineering at Quora (former Director of Research at Netflix)
  • 4.
    TUNING DEEP LEARNINGMODELS Big Data Deep Learning System With tunable parameters Expertise
  • 5.
    TUNING DEEP LEARNINGMODELS Big Data Metics Optimally suggests new parameters Objective New parameters Expertise Deep Learning System With tunable parameters
  • 6.
    TUNING DEEP LEARNINGMODELS Big Data Metics Optimally suggests new parameters Objective New parameters Better Results Expertise Deep Learning System With tunable parameters
  • 7.
    COMMON APPROACH Random Searchfor Hyper-Parameter Optimization, James Bergstra et al., 2012 1. Random search or grid search 2. Expert defined grid search near “good” points 3. Refine domain and repeat steps - “grad student descent”
  • 8.
    COMMON APPROACH ● Expertintensive ● Computationally intensive ● Finds potentially local optima ● Does not fully exploit useful information Random Search for Hyper-Parameter Optimization, James Bergstra et al., 2012 1. Random search or grid search 2. Expert defined grid search near “good” points 3. Refine domain and repeat steps - “grad student descent”
  • 9.
    … the challengeof how to collect information as efficiently as possible, primarily for settings where collecting information is time consuming and expensive. Prof. Warren Powell - Princeton What is the most efficient way to collect information? Prof. Peter Frazier - Cornell How do we make the most money, as fast as possible? Me - @DrScottClark OPTIMAL LEARNING
  • 10.
    ● Optimize someOverall Evaluation Criterion (OEC) ○ Loss, Accuracy, Likelihood, Revenue ● Given tunable parameters ○ Hyperparameters, feature parameters ● In an efficient way ○ Sample function as few times as possible ○ Training on big data is expensive BAYESIAN GLOBAL OPTIMIZATION Details at https://sigopt.com/research
  • 11.
    EXAMPLE: TUNING DNNCLASSIFIERS CIFAR10 Dataset ● Photos of objects ● 10 classes ● Metric: Accuracy ○ [0.1, 1.0] Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009.
  • 12.
    ● All convolutionalneural network ● Multiple convolutional and dropout layers ● Hyperparameter optimization mixture of domain expertise and grid search (brute force) USE CASE: ALL CONVOLUTIONAL http://arxiv.org/pdf/1412.6806.pdf
  • 13.
    EXAMPLE: NCLOUD/NEON ● epochs:“number of epochs to run fit” - int [1,∞] ● learning rate: influence on current value of weights at each step - double (0, 1] ● momentum coefficient: “the coefficient of momentum” - double (0, 1] ● weight decay: parameter affecting how quickly weight decays - double (0, 1] ● depth: parameter affecting number of layers in net - int [1, 20(?)] ● gaussian scale: standard deviation of initialization normal dist. - double (0,∞] ● momentum step change: mul. amount to decrease momentum - double (0, 1] ● momentum step schedule start: epoch to start decreasing momentum - int [1,∞] ● momentum schedule width: epoch stride for decreasing momentum - int [1,∞] Many tunable parameters... ...optimal values non-intuitive
  • 14.
    COMPARATIVE PERFORMANCE ● Expertbaseline: 0.8995 ○ (using neon) ● SigOpt best: 0.9011 ○ 1.6% reduction in error rate ○ No expert time wasted in tuning
  • 15.
    USE CASE: DEEPRESIDUAL http://arxiv.org/pdf/1512.03385v1.pdf ● Explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions ● Variable depth ● Hyperparameter optimization mixture of domain expertise and grid search (brute force)
  • 16.
    COMPARATIVE PERFORMANCE Standard Method ●Expert baseline: 0.9339 ○ (from paper) ● SigOpt best: 0.9343 ○ Found after 17 trials ○ 0.61% reduction in error rate ○ No expert time wasted in tuning
  • 17.
  • 18.
    TRY OUT SIGOPTFOR FREE https://sigopt.com/get_started ● Quick example and intro to SigOpt ● No signup required ● Visual and code examples https://sigopt.com/text-classifier ● Jupyter Notebook ● Use SigOpt to tune feature and model parameters ● Detailed walkthrough with code
  • 19.
    MORE EXAMPLES https://github.com/sigopt/sigopt-examples Examples ofusing SigOpt in a variety of languages and contexts. Tuning Machine Learning Models (with code) A comparison of different hyperparameter optimization methods. Using Model Tuning to Beat Vegas (with code) Using SigOpt to tune a model for predicting basketball scores. Learn more about the technology behind SigOpt at https://sigopt.com/research
  • 20.
    HOW DOES ITWORK? 1. User reports data 2. SigOpt builds statistical model (Gaussian Process) 3. SigOpt finds the points of highest Expected Improvement 4. SigOpt suggests best parameters to test next 5. User tests those parameters and reports results to SigOpt 6. Repeat
  • 21.
    HOW DOES ITWORK? 1. User reports data 2. SigOpt builds statistical model (Gaussian Process) 3. SigOpt finds the points of highest Expected Improvement 4. SigOpt suggests best parameters to test next 5. User tests those parameters and reports results to SigOpt 6. Repeat
  • 22.
    HOW DOES ITWORK? 1. User reports data 2. SigOpt builds statistical model (Gaussian Process) 3. SigOpt finds the points of highest Expected Improvement 4. SigOpt suggests best parameters to test next 5. User tests those parameters and reports results to SigOpt 6. Repeat
  • 23.
    HOW DOES ITWORK? 1. User reports data 2. SigOpt builds statistical model (Gaussian Process) 3. SigOpt finds the points of highest Expected Improvement 4. SigOpt suggests best parameters to test next 5. User tests those parameters and reports results to SigOpt 6. Repeat
  • 24.
    HOW DOES ITWORK? 1. User reports data 2. SigOpt builds statistical model (Gaussian Process) 3. SigOpt finds the points of highest Expected Improvement 4. SigOpt suggests best parameters to test next 5. User tests those parameters and reports results to SigOpt 6. Repeat
  • 25.
    HOW DOES ITWORK? 1. User reports data 2. SigOpt builds statistical model (Gaussian Process) 3. SigOpt finds the points of highest Expected Improvement 4. SigOpt suggests best parameters to test next 5. User tests those parameters and reports results to SigOpt 6. Repeat
  • 26.
  • 27.
    overfit good fitunderfit GPs: FITTING THE GP