Facebook Talk at Netflix ML Platform meetup Sep 2019

Practical Solutions to Exploration Problems
Sam Daulton
Core Data Science, Facebook
Adaptive Experimentation Practical Solutions to Exploration Problems 1 / 68

Overview
1 Adaptive Experimentation
Introduction
2 Direct policy search via Bayesian optimization
Motivating Example
Gaussian Process Regression
Bayesian Optimization
3 Combining online and oﬄine experiments
Value Model Tuning
Multi-Task Bayesian Optimization
4 Open Source Tools
Ax
BoTorch
5 Constrained Bayesian Contextual Bandits
Video Upload Transcoding Optimization
Constrained Thompson Sampling (CTS)
Reward Shaping and Hyperparameter Optimization

Adaptive Experimentation Team
• Horizontal R&D team within
Facebook
• Goal: radically change the way
people run experiments and
develop systems:
• Reduce threshold for
experimentation
• Use RL to robustly solve
explore/exploit problems
• Develop tools to improve and
automate decision-making
under multiple and/or
constrained objectives

Spectrum of Automation

Overview
Introduction
Motivating Example
Value Model Tuning
4 Open Source Tools
Ax
BoTorch

Heterogeneous Connections and Devices

Homogeneous Status Quo Policy

Homogeneous Status Quo Policy
Idea: What if we loaded diﬀerent numbers of stories depending on the
connection type?

Potential Contextualized Policy
Idea: What if we loaded more posts for better connections types?

Potential Contextualized Policy - Opposite
Idea: What if we loaded fewer posts for better connections types?

Potential Contextualized Policies
Suppose that for each connection type c:
• We could fetch any number of posts xc ∈ [2, 24]
• Then there are 224 = 234, 256 possible conﬁgurations to test!

Policies as Black-box Functions
The average treatment eﬀect over all individuals can be expected to be
some smooth function of the policy table x = [x1, ..., xk]:
f(x) : Rk
→ R

Black-box Function View of RL
• Turns ”full RL” problem into an inﬁnite-armed bandit problem
πx∗ = arg max
x
g(f(x))
• Advantages:
• Does not require estimating value functions, state transition functions,
or inference about unobserved states
• Involves virtually no logging of actions, states, or intermediate rewards
• Allows for direct maximization of multiple, delayed rewards
Question: How can we make predictions about long-term outcomes from
limited number of vector-valued policies?

Gaussian Process (GP) Priors

Gaussian Process (GP) Posteriors

GP regression gives well-calibrated posterior predictive intervals that are
easy to compute

Gaussian Process (GP) Regression
In practice, we ﬁnd that GP surrogate models ﬁt the data well for many
online experiments.

Other Examples with Continuous Action Spaces
• Value models governing ranking policies: e.g.
rank(Z) = x1P(click|Z) + x2Zx3
num friends + f(P(spam|Z)/x4) + ...
• Bit-rate controllers for video and audio streaming
• Data retrieval policies for ML backends
Question: How do we use GP surrogate models to guide the
explore-exploit trade-oﬀ?

Setup

Round 1

Round 2

Round 3

q-Batch Bayesian Optimization

Response surface is maximized sequentially
• Models tell us which regions should be considered for further
assessment

Algorithm 1 BayesianOptimization
1: Run N random initial arms
2: for t = 0 to T do
3: Fit GP model to data
4: Use acquistion function select candidates C
5: Evaluate C on black box function
6: Add new observations to dataset
7: end for

Alternatives
Grid Search (Expensive - 81 arms)

Alternatives
Random Search (Cheaper - 25 arms)
• Maxima can be deduced with only a few, smartly chosen arms

Competing Objectives
• Product teams are used to running an A/B test and observing the
outcomes.
• Often, there are multiple competing objectives

If we want full automation, we need to specify more information in
advance: ideally, ”the” scalarized objective

Decision Makers Have Multiple Objectives

Decision makers don’t like scalarizations: e.g.
objective = −0.8 · cpu + 1.1 · time spent

Decision makers prefer constraints:
min(cpu) subject to time spent > 0.7

Practical Challenges
• Constrained optimization
• Observations often have high variance, leading to potentially large
measurement error
• High noise levels can degrade the performance of many common
acquisition functions including Expected Improvement

Solution
For more details, see
• Constrained Bayesian Optimization with Noisy Experiments Bayesian
Analysis 2019. Letham, Karrer, Ottoni, & Bakshy

Overview
Introduction
Motivating Example
Value Model Tuning
4 Open Source Tools
Ax
BoTorch

Value Model Tuning
• Ranking teams use value models, combine multiple predictive models
and features, e.g.
rank(Z) = x1P(click|Z) + x2Zx3
num friends + f(P(spam|Z)/x4) + ...
• Not feasible to run suﬃciently powered experiments with 20+ arms,
so the team developed a simulator

Simulation Setup

Biased Simulator

Debiasing Simulations with Multi-Task Models

Multi-Task Bayesian Optimization Loop
Algorithm 2 MultiTaskBayesianOptimization
1: Run N random arms online
2: Run M random arms oﬄine with M > N
4: Fit MT-GP model to all data, with each batch as separate task
5: Use NEI to generate q candidates C (e.g. q = 30)
6: Run C on the simulator, ﬁt GP model again
7: Use NEI to generate candidates to run online
8: end for

Example of Multi-Task Bayesian Optimization
0 5 10 15 20 25 30 35 40
Iteration
−1
0
1
2
Outcome
Objective
−2
−1
0
1
Outcome
Constraints
−2
−1
0
1
2
Outcome
0 5 10 15 20 25 30 35 40
Iteration
−2
0
2
Outcome

Paper
For more details, see
• See Bayesian Optimization for Policy Search via Online-Oﬄine
Experimentation. Letham & Bakshy 2019. Forthcoming, arXiv
1904.01049

Overview
Introduction
Motivating Example
Value Model Tuning
4 Open Source Tools
Ax
BoTorch

Open Source Tools

Research to Production

Simple APIs

Adaptive Experimentation in Practice

Experiment Understanding

BoTorch

BoTorch: Building Blocks

Improving Researcher Eﬃciency

Overview
Introduction
Motivating Example
Value Model Tuning
4 Open Source Tools
Ax
BoTorch

Problem
• System receives requests to upload videos of different source qualities
and file sizes from a variety of network connections and devices.
• To ensure high reliability, a video may be transcoded to be uploaded
at a lower quality
• For each video upload request, we have features about
• the video: file size, duration, source resolution
• the network: country, network type, download speed
• the device
Goal
• Maximize quality preserved without decreasing reliability

Video Upload Transcoding - CB Problem
• Context: features about video, network, device
• Actions: 360p, 480p, 720p, 1080p
• Outcomes: reliability y(x, a)
• Rewards: ?? some function R(x, a, y)

Approach - Bandit Algorithmm
Thompson Sampling
• Works well in batch mode
• Hyper-parameter free exploration
• Always ”picks the best” codec: picks codecs with probability
proportional to it being the best

Approach - Modeling
Bayesian Linear Model
• Bernoulli likelihood to predict reliability
• Using a neural network feature extractor
• Simple two-layer MLP (50, 4) trained via SGD
• Last layer is a stochastic variational GP with a linear kernel
• Trained via stochastic variational inference using 1000 inducing points
according to space-ﬁlling design

Thompson Sampling
Algorithm 3 ThompsonSampling
Input: discrete set of actions A, distribution over models P0(f)
2: Sample model ˜ft ∼ Pt(f|X, y)
3: Select an action at ← arg maxa∈A E(rt|xt, a, ˜ft)
4: Observe reward rt
5: Update distribution Pt+1(f)
6: end for

Issues with Vanilla Thompson Sampling
• Thompson sampling does not account for the constraint
• Change in reliability must be non-negative
• Unclear how to optimally specify reward parameterization

Constrained Thompson Sampling
Algorithm 4 ConstrainedThompsonSampling
1: Input: discrete set of actions A, distribution over models P0(f)
3: Receive context xt
4: Sample model ˜ft ∼ Pt(f|X, y)
5: for a ∈ A do
6: Estimate outcomes ˜ft(xt, a)
7: end for
8: Fetch action under baseline policy b ← πb(xt)
9: Filter feasible actions: Afeas ← {a ∈ A| ˜ft(xt, a) ≥ ε · ˜ft(xt, b)}
10: Select an action at ← arg maxa∈Afeas
E(rt|xt, a, ˜ft)
11: Observe outcome yt
12: Update distribution Pt+1(f)
13: end for

Reward Shaping Setup
Reward Shaping:
• Reward is 0 if the upload is a failure
• Reward is ﬁxed at 1 for a 360p upload success:
• Reward is monotonically increasing with quality:
R(y = 1, a) = 1 +
a ≤a
wa
where
wi ∈ (0.0, 0.2]
Safety Constraint: ε ∈ [0.95, 1.0]

Reward Shaping Optimization
• Teams care about top-line outcomes:
• Reliability: mean reliability per user
• Quality preserved: mean quality (e.g., 1080p preserved, HD) per user
• Other outcomes: watch time, content production
• Diﬃcult to evaluate these outcomes from purely oﬄine data
Solution: Use Bayesian Optimization (via Ax) using online experiments

(a) 1080p quality preserved (b) Reliability
Figure: GP-modeled response surface of mean percent change in video quality
and reliability relative to the baseline policy. Each point represents a policy
parameterized by reward function hyperparameters and constraint parameter ε.

Thanks
• Manager: Eytan Bakshy
• Constrained TS: Sam Daulton, Shaun Singh, Drew Dimmery
• BoTorch: Max Balandat, Sam Daulton, Daniel Jiang, Brian Karrer,
Ben Letham
• Ax: Kostya Kashin, Lili Dworkin, Lena Kashtelyan, Ben Letham,
Ashwin Murthy, Shaun Singh, and Drew Dimmery
Papers
• Constrained Bayesian Optimization with Noisy Experiments. Letham
et al. 2019, Bayesian Analysis.
• Bayesian Optimization for Policy Search via Online-Oﬄine
Experimentation. Letham & Bakshy 2019. Forthcoming, arXiv
1904.01049

Facebook Talk at Netflix ML Platform meetup Sep 2019

Recommended

More Related Content

What's hot (20)

Similar to Facebook Talk at Netflix ML Platform meetup Sep 2019 (20)

Recently uploaded (20)

Facebook Talk at Netflix ML Platform meetup Sep 2019