0% found this document useful (0 votes)
155 views

Deeplearning2017 Johnson Automatic Differentiation 01

The document summarizes an introduction to automatic differentiation presented at the Deep Learning Summer School in Montreal 2017. It discusses how automatic differentiation tools like TensorFlow, Stan, Theano, Edward, PyTorch, and MinPy allow users to specify just the forward model and handle the autodifferentiation and optimization/inference automatically. It also introduces Autograd, an open source Python library for automatic differentiation that can handle Python code with loops, branching, recursion, closures, and data structures.

Uploaded by

dhanu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
155 views

Deeplearning2017 Johnson Automatic Differentiation 01

The document summarizes an introduction to automatic differentiation presented at the Deep Learning Summer School in Montreal 2017. It discusses how automatic differentiation tools like TensorFlow, Stan, Theano, Edward, PyTorch, and MinPy allow users to specify just the forward model and handle the autodifferentiation and optimization/inference automatically. It also introduces Autograd, an open source Python library for automatic differentiation that can handle Python code with loops, branching, recursion, closures, and data structures.

Uploaded by

dhanu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 142

Automatic differentiation

Matthew J Johnson ([email protected])


Deep Learning Summer School
Montreal 2017

Dougal Maclaurin David Duvenaud Ryan P Adams

brain
Our awesome new world
Our awesome new world

• TensorFlow, Stan, Theano, Edward, PyTorch, MinPy


• Only need to specify forward model
• Autodiff + optimization / inference done for you


Our awesome new world

• TensorFlow, Stan, Theano, Edward, PyTorch, MinPy


• Only need to specify forward model
• Autodiff + optimization / inference done for you


• loops? branching? recursion? closures? data structures?


Our awesome new world

• TensorFlow, Stan, Theano, Edward, PyTorch, MinPy


• Only need to specify forward model
• Autodiff + optimization / inference done for you


• loops? branching? recursion? closures? data structures?


• debugger?
Our awesome new world

• TensorFlow, Stan, Theano, Edward, PyTorch, MinPy


• Only need to specify forward model
• Autodiff + optimization / inference done for you


• loops? branching? recursion? closures? data structures?


• debugger?
• a second compiler/interpreter to satisfy
Our awesome new world

• TensorFlow, Stan, Theano, Edward, PyTorch, MinPy


• Only need to specify forward model
• Autodiff + optimization / inference done for you


• loops? branching? recursion? closures? data structures?


• debugger?
• a second compiler/interpreter to satisfy
• a new mini-language to learn
Autograd

• github.com/hips/autograd
• differentiates native Python code

• handles most of Numpy + Scipy

• loops, branching, recursion, closures

• arrays, tuples, lists, dicts, classes, …

• derivatives of derivatives

• a one-function API Dougal Maclaurin


• small and easy to extend
Autograd
Autograd examples examples
import autograd.numpy as np
importautograd
import auotgrad.numpy.random
. numpy as np as npr
from
fromautograd
autograd import
import grad

def
defpredict ( weights , inputs):
predict(weights, inputs ):
for
forW ,W, bb in
in weights
weights::
outputs = np . dot ( inputs , W ) + b
outputs = np.dot(inputs, W) + b
inputs = np . tanh ( outputs )
returninputs = np.tanh(outputs)
outputs
return outputs
def init_params ( scale , sizes ):
defreturn [( npr . randn (sizes):
init_params(scale, nin , out ) * scale ,
npr . randn ( outn)
return [(npr.randn(m, ) ** scale
scale,)
for nin , out in
npr.randn(n) * scale)
zip ( sizes [: -1] , sizes [1:])]
for m, n in zip(sizes[:-1], sizes[1:])]
def logprob_func ( weights , inputs , targets ):
defpreds
logprob_fun(params, inputs,
= predict ( weights targets):
, inputs )
preds =nppredict(weights,
return inputs))**2)
. sum (( preds - targets
return np.sum((preds - targets)**2)
gradient_func = grad ( logprob_func )
gradient_fun = grad(logprob_fun)
return [(npr.randn(m, n) * scale,
npr.randn(n) * scale)

Autograd examples
for m, n in zip(sizes[:-1], sizes[1:])]

def logprob_fun(params, inputs, targets):


preds = predict(weights, inputs)
return np.sum((preds - targets)**2)

gradient_fun = grad(logprob_fun)

import autograd.numpy as np
from autograd import grad
import matplotlib.pyplot as plt

x = np.linspace(-7, 7, 200)
plt.plot(x, np.tanh(x),
x, grad(np.tanh)(x), # first deriva
x, grad(grad(np.tanh))(x), # second deriv
x, grad(grad(grad(np.tanh)))(x), # third deriva
x, grad(grad(grad(grad(np.tanh))))(x), # fourth deriv
x, grad(grad(grad(grad(grad(np.tanh)))))(x), # fifth deriva
x, grad(grad(grad(grad(grad(grad(np.tanh))))))(x)) # sixth deriva

from autograd import grad, jacobian

def hessian(fun, argnum=0):


return jacobian(jacobian(fun, argnum), argnum)
x, grad(np.tanh)(x),

Hessians and HVPs


x,
x,
grad(grad(np.tanh))(x),
grad(grad(grad(np.tanh)))(x),
x, grad(grad(grad(grad(np.tanh))))(x),
x, grad(grad(grad(grad(grad(np.tanh)))))(x),
x, grad(grad(grad(grad(grad(grad(np.tanh))))))(x))

from autograd import grad, jacobian

def hessian(fun, argnum=0):


return jacobian(jacobian(fun, argnum), argnum)

def hvp(fun):
def grad_dot_vector(arg, vector):
return np.dot(grad(fun)(arg), vector)
return grad(grad_dot_vector)

2
r f (x) · v = rx (rx f (x) · v)
1
ButBlack-box inference
what about inference? in a tweet
Stan also provides inference routines...
Tutorial goals

1. Jacobians and the chain rule


• Forward and reverse accumulation

2. Autograd’s implementation
• Fully closed tracing autodiff in Python

3. Advanced autodiff techniques


• Checkpointing, forward from reverse, differentiating optima and fixed points
Tutorial goals

1. Jacobians and the chain rule


• Forward and reverse accumulation

2. Autograd’s implementation
• Fully closed tracing autodiff in Python

3. Advanced autodiff techniques


• Checkpointing, forward from reverse, differentiating optima and fixed points
F : Rn ! R
F : Rn ! R F : 7!
y2R
n
x2R
F : Rn ! R F : 7!
y2R
n
x2R

F =D C B A
F : Rn ! R F : 7!
y2R
n
x2R

F =D C B A y = F (x) = D(C(B(A(x))))
F : Rn ! R F : 7!
y2R
n
x2R

F =D C B A y = F (x) = D(C(B(A(x))))

y = D(c), c = C(b), b = B(a), a = A(x)


y = D(c), c = C(b), b = B(a), a = A(x)
y = D(c), c = C(b), b = B(a), a = A(x)

@y h i
0 @y @y
F (x) = = @x1 ··· @xn
@x
y = D(c), c = C(b), b = B(a), a = A(x)

@y h i
0 @y @y
F (x) = = @x1 ··· @xn
@x

0 @y @c @b @a
F (x) =
@c @b @a @x
y = D(c), c = C(b), b = B(a), a = A(x)

@y h i
0 @y @y
F (x) = = @x1 ··· @xn
@x

0 @y @c @b @a
F (x) =
@c @b @a @x

@y
= D0 (c)
@c
y = D(c), c = C(b), b = B(a), a = A(x)

@y h i
0 @y @y
F (x) = = @x1 ··· @xn
@x

0 @y @c @b @a
F (x) =
@c @b @a @x

@y @c
= D0 (c) = C 0 (b)
@c @b
y = D(c), c = C(b), b = B(a), a = A(x)

@y h i
0 @y @y
F (x) = = @x1 ··· @xn
@x

0 @y @c @b @a
F (x) =
@c @b @a @x

@y @c @b @a
= D0 (c) = C 0 (b) = B 0 (a) = A0 (x)
@c @b @a @x
✓ ✓ ◆
0 @y @c @b @a
F (x) =
@c @b @a @x
✓ ✓ ◆
0 @y @c @b @a
F (x) =
@c @b @a @x

(
2 @b1 @b1 3
@x1 ··· @xn
@b 6 .. .. .. 7
=4 . . . 5
@x @b @bm
m
@x1 ··· @xn
✓ ✓ ◆
0 @y @c @b @a
F (x) =
@c @b @a @x

(
2 @b1 @b1 3 Forward
···
@b
@x1 @xn accumulation
6 .. .. .. 7
=4 . . . 5
@x @b @bm
m
@x1 ··· @xn
✓ ✓ ◆
0 @y @c @b @a
F (x) =
@c @b @a @x

(
2 @b1 @b1 3 Forward
···
@b
@x1 @xn accumulation
6 .. .. .. 7
=4 . . . 5
@x @b @bm
m
@x1 ··· @xn

✓✓ ◆ ◆
0 @y @c @b @a
F (x) =
@c @b @a @x
✓ ✓ ◆
0 @y @c @b @a
F (x) =
@c @b @a @x

(
2 @b1 @b1 3 Forward
···
@b
@x1 @xn accumulation
6 .. .. .. 7
=4 . . . 5
@x @b @bm
m
@x1 ··· @xn

✓✓ ◆ ◆
0 @y @c @b @a
F (x) =
@c @b @a @x
(

@y h @y @y
i
= @b1 ··· @bm
@b
✓ ✓ ◆
0 @y @c @b @a
F (x) =
@c @b @a @x

(
2 @b1 @b1 3 Forward
···
@b
@x1 @xn accumulation
6 .. .. .. 7
=4 . . . 5
@x @b @bm
m
@x1 ··· @xn

✓✓ ◆ ◆
0 @y @c @b @a
F (x) =
@c @b @a @x
Reverse
(

@y h @y @y
i accumulation
= @b1 ··· @bm
@b
0 @y @c @b @a
F (x) v = v
@c @b @a @x
0 @y @c @b @a
F (x) v = v
@c @b @a @x

✓ ✓ ✓ ◆◆◆
0 @y @c @b @a
F (x) v = v
@c @b @a @x
0 @y @c @b @a
F (x) v = v
@c @b @a @x

✓ ✓ ✓ ◆◆◆
0 @y @c @b @a
F (x) v = v
@c @b @a @x

Forward accumulation $ Jacobian-vector products


Build Jacobian one column at a time
0 @y @c @b @a
F (x) v = v
@c @b @a @x

✓ ✓ ✓ ◆◆◆
0 @y @c @b @a
F (x) v = v
@c @b @a @x

Forward accumulation $ Jacobian-vector products


Build Jacobian one column at a time

✓ ✓ ✓ ◆◆◆
0 @y @c @b @a @x
F (x) =
@c @b @a @x @x
T 0 v T @y @c @b @a
v F (x) =
@c @b @a @x
T 0 v T @y @c @b @a
v F (x) =
@c @b @a @x

✓✓✓ ◆ ◆ ◆ ◆◆◆
T 0 v T @y @c @b @a
v F (x) =
@c @b @a @x
T 0 v T @y @c @b @a
v F (x) =
@c @b @a @x

✓✓✓ ◆ ◆ ◆ ◆◆◆
T 0 v T @y @c @b @a
v F (x) =
@c @b @a @x

Reverse accumulation $ vector-Jacobian products


Build Jacobian one row at a time
T 0 v T @y @c @b @a
v F (x) =
@c @b @a @x

✓✓✓ ◆ ◆ ◆ ◆◆◆
T 0 v T @y @c @b @a
v F (x) =
@c @b @a @x

Reverse accumulation $ vector-Jacobian products


Build Jacobian one row at a time

✓✓✓ ◆ ◆ ◆ ◆◆◆
0 @y @y @c @b @a
F (x) =
@y @c @b @a @x
Forward and reverse accumulation

• Forward accumulation
• Jacobian-vector products

• “push-forward”

• build Jacobian matrix one column at a time

• Reverse accumulation
• vector-Jacobian products

• “pull-back”

• build Jacobian matrix one row at a time


Non-chain composition
Non-chain composition

Fan-in y = F (x1 , x2 )
Non-chain composition

Fan-in y = F (x1 , x2 )

@y 0 @y 0
= F1 (x1 , x2 ) = F2 (x1 , x2 )
@x1 @x2
Non-chain composition

Fan-in y = F (x1 , x2 )

@y 0 @y 0
= F1 (x1 , x2 ) = F2 (x1 , x2 )
@x1 @x2

 
Fan-out x I
G(x) = = x
x I
Non-chain composition

Fan-in y = F (x1 , x2 )

@y 0 @y 0
= F1 (x1 , x2 ) = F2 (x1 , x2 )
@x1 @x2

 
Fan-out x I
G(x) = = x
x I
 ⇥ ⇤ 
0 I T 0 v1 T v2 T
I
G (x) = v G (x) = = v1 T + v2 T
I I
Tutorial goals

1. Jacobians and the chain rule


• Forward and reverse accumulation

2. Autograd’s implementation
• Fully closed tracing autodiff in Python

3. Advanced autodiff techniques


• Checkpointing, forward from reverse, differentiating optima and fixed points
Autodiff implementations

1. Read and generate source code ahead-of-time


• source and target language could be Python

• or a “computation graph” language (TensorFlow)

2. Monitor function execution at runtime


Autodiff implementations

1. Read and generate source code ahead-of-time


• source and target language could be Python

• or a “computation graph” language (TensorFlow)

2. Monitor function execution at runtime


Autograd’s ingredients

1. Tracing the composition of primitive functions

2. Defining a vector-Jacobian product (VJP) operator for each primitive

3. Composing VJPs backward


Autograd’s ingredients

1. Tracing the composition of primitive functions

2. Defining a vector-Jacobian product (VJP) operator for each primitive

3. Composing VJPs backward


numpy.sum
primitive

autograd.numpy.sum

numpy.sum
primitive

Node ã autograd.numpy.sum

value: a
function: F numpy.sum
parents: [x]
primitive

Node ã autograd.numpy.sum

value: a a
function: F unbox numpy.sum
parents: [x]
primitive

Node ã autograd.numpy.sum Node ˜


b

value: a a b value: b
function: F unbox numpy.sum box function: anp.sum
˜
parents: [x] parents: [ã]
class Node(object):
__slots__ = [’value’, ’recipe’, ’progenitors’, ’vspace’]

def __init__(self, value, recipe, progenitors):


self.value = value
self.recipe = recipe
self.progenitors = progenitors
self.vspace = vspace(value)
__slots__ = [’value’, ’recipe’, ’progenitors’, ’vspace’]

def __init__(self, value, recipe, progenitors):


self.value = value
self.recipe = recipe
self.progenitors = progenitors
self.vspace = vspace(value)

class primitive(object):
def __call__(self, *args, **kwargs):
argvals = list(args)
progenitors = set()
parents = []
for argnum, arg in enumerate(args):
if isnode(arg):
argvals[argnum] = arg.value
if argnum in self.zero_vjps: continue
parents.append((argnum, arg))
progenitors.update(arg.progenitors & active_progenitors)

result_value = self.fun(*argvals, **kwargs)


return new_node(result_value, (self, args, kwargs, parents), progenitors)
__slots__ = [’value’, ’recipe’, ’progenitors’, ’vspace’]

def __init__(self, value, recipe, progenitors):


self.value = value
self.recipe = recipe
self.progenitors = progenitors
self.vspace = vspace(value)

class primitive(object):
def __call__(self, *args, **kwargs):
argvals = list(args)
progenitors = set()
parents = []
for argnum, arg in enumerate(args):
if isnode(arg):
argvals[argnum] = arg.value
if argnum in self.zero_vjps: continue
parents.append((argnum, arg))
progenitors.update(arg.progenitors & active_progenitors)

result_value = self.fun(*argvals, **kwargs)


return new_node(result_value, (self, args, kwargs, parents), progenitors)
argvals[argnum] = arg.value
if argnum in self.zero_vjps: continue
parents.append((argnum, arg))
progenitors.update(arg.progenitors & active

result_value = self.fun(*argvals, **kwargs)


return new_node(result_value, (self, args, kwargs,

def forward_pass(fun, args, kwargs, argnum=0):


args = list(args)
start_node = new_progenitor(args[argnum])
args[argnum] = start_node
active_progenitors.add(start_node)
end_node = fun(*args, **kwargs)
active_progenitors.remove(start_node)
return start_node, end_node
start_node

x
start_node

a = A(x)
start_node

x b = B(a)
a = A(x)
start_node

x b = B(a)
a = A(x) c = C(b)
start_node end_node

x b = B(a) y = D(c)

a = A(x) c = C(b)
start_node end_node
start_node end_node

No control flow!
Autograd’s ingredients

1. Tracing the composition of primitive functions

2. Defining a vector-Jacobian product (VJP) operator for each primitive

3. Composing VJPs backward


x a = A(x)
@y
@a

x a = A(x)
@y @y
=?
@x @a

x a = A(x)
@y @y @a @y
= ·
@x @a @x @a

x a = A(x)
@y @y 0 @y
= · A (x)
@x @a @a

x a = A(x)
vector-Jacobian product

@y @y 0 @y
= · A (x)
@x @a @a

x a = A(x)
def forward_pass(fun, args, kwargs, argnum=0):
args = list(args)
start_node = new_progenitor(args[argnum])
args[argnum] = start_node
active_progenitors.add(start_node)
end_node = fun(*args, **kwargs)
active_progenitors.remove(start_node)
return start_node, end_node

anp.sinh.defvjp(lambda g, ans, vs, gvs, x: g * anp.cosh(x))


anp.cosh.defvjp(lambda g, ans, vs, gvs, x: g * anp.sinh(x))
anp.tanh.defvjp(lambda g, ans, vs, gvs, x: g / anp.cosh(x)**2)

anp.cross.defvjp(lambda g, ans, vs, gvs, a, b, axisa=-1, axisb=-1, axisc=-1, axis=None:


anp.cross(b, g, axisb, axisc, axisa, axis), argnum=0)

def grad_sort(g, ans, vs, gvs, x, axis=-1, kind=’quicksort’, order=None):


sort_perm = anp.argsort(x, axis, kind, order)
return unpermuter(g, sort_perm)
anp.sort.defvjp(grad_sort)

3
Autograd’s ingredients

1. Tracing the composition of primitive functions

2. Defining a vector-Jacobian product (VJP) operator for each primitive

3. Composing VJPs backward


start_node end_node

x b = B(a) y = D(c)

a = A(x) c = C(b)
@y
=1
@y
start_node end_node

x b = B(a) y = D(c)

a = A(x) c = C(b)
@y
=1
@y
@y
start_node end_node
@c
x b = B(a) y = D(c)

a = A(x) c = C(b)
@y @y
=1
@b @y
@y
start_node end_node
@c
x b = B(a) y = D(c)

a = A(x) c = C(b)
@y @y
=1
@b @y
@y @y
start_node end_node
@a @c
x b = B(a) y = D(c)

a = A(x) c = C(b)
@y @y @y
=1
@x @b @y
@y @y
start_node end_node
@a @c
x b = B(a) y = D(c)

a = A(x) c = C(b)
higher-order autodiff just works:
the backward pass can itself be traced
@y
=1
@y

start_node end_node

x b = B(a) y = D(c)

a = A(x) c = C(b)
@y
@c
@y
=1
@y

start_node end_node

x b = B(a) y = D(c)

a = A(x) c = C(b)
@y @y
@b @c
@y
=1
@y

start_node end_node

x b = B(a) y = D(c)

a = A(x) c = C(b)
@y @y @y
@a @b @c
@y
=1
@y

start_node end_node

x b = B(a) y = D(c)

a = A(x) c = C(b)
@y @y @y @y
@x @a @b @c
@y
=1
@y

start_node end_node

x b = B(a) y = D(c)

a = A(x) c = C(b)
@y
end_node =1
@y

start_node

x b = B(a) y = D(c)

a = A(x) c = C(b)
def backward_pass(g, end_node, start_node):
outgrads = {end_node : (g, False)}
assert_vspace_match(outgrads[end_node][0], end_node.vspace, None)
for node in toposort(end_node, start_node):
if node not in outgrads: continue
cur_outgrad = outgrads.pop(node)
function, args, kwargs, parents = node.recipe
for argnum, parent in parents:
outgrad = function.vjp(argnum, cur_outgrad[0], node,
parent.vspace, node.vspace, args, kwargs)
assert_vspace_match(outgrad, parent.vspace, function)
outgrads[parent] = add_outgrads(parent.vspace, outgrads.get(parent),
outgrad)
return cur_outgrad[0]

def grad(fun, argnum=0):


def gradfun(*args,**kwargs):
args = list(args)
args[argnum] = safe_type(args[argnum])
vjp, ans = make_vjp(fun, argnum)(*args, **kwargs)
return vjp(vspace(getval(ans)).ones())
outgrad)
return cur_outgrad[0]

def grad(fun, argnum=0):


def gradfun(*args,**kwargs):
args = list(args)
args[argnum] = safe_type(args[argnum])
vjp, ans = make_vjp(fun, argnum)(*args, **kwargs)
return vjp(vspace(getval(ans)).ones())
return gradfun

def make_vjp(fun, argnum=0):


def vjp_maker(*args, **kwargs):
start_node, end_node = forward_pass(fun, args, kwargs, argnum)
if not isnode(end_node) or start_node not in end_node.progenitors:
warnings.warn("Output seems independent of input.")
def vjp(g): return start_node.vspace.zeros()
else:
def vjp(g): return backward_pass(g, end_node, start_node)
return vjp, end_node
return vjp_maker
Autograd’s ingredients

1. Tracing the composition of primitive functions



Node, primitive, forward_pass

2. Defining a vector-Jacobian product (VJP) operator for each primitive



defvjp

3. Composing VJPs backward



backward_pass, make_vjp, grad
Tradeoffs in forward vs reverse
Tradeoffs in forward vs reverse

• Reverse-mode requires tracing a program’s execution


• Memory cost scales like depth of program

• Checkpointing can trade off time and memory


Tradeoffs in forward vs reverse

• Reverse-mode requires tracing a program’s execution


• Memory cost scales like depth of program

• Checkpointing can trade off time and memory

• Forward-mode evaluates a JVP with constant memory overhead


• But requires n calls to form Jacobian of F : Rn ! R

• Autograd forward-mode by @j-towns: github.com/BB-UCL/autograd-forward


Tradeoffs in forward vs reverse

• Reverse-mode requires tracing a program’s execution


• Memory cost scales like depth of program

• Checkpointing can trade off time and memory

• Forward-mode evaluates a JVP with constant memory overhead


• But requires n calls to form Jacobian of F : Rn ! R

• Autograd forward-mode by @j-towns: github.com/BB-UCL/autograd-forward

• Can use both together (in autograd!) for mixed-mode


Tutorial goals

1. Jacobians and the chain rule


• Forward and reverse accumulation

2. Autograd’s implementation
• Fully closed tracing autodiff in Python

3. Advanced autodiff techniques


• Checkpointing, forward from reverse, differentiating optima and fixed points
Checkpointing
Checkpointing
Checkpointing
Checkpointing
Checkpointing
Checkpointing
Checkpointing
Checkpointing
Checkpointing
Checkpointing

@y
=1
@y
Checkpointing

@y
=1
@y
Checkpointing

@y
=1
@y
Checkpointing

@y
=1
@y
Checkpointing

@y @y
=1
@c @y
Checkpointing

@y
@c
Checkpointing

@y
@c
Checkpointing

@y @y
@b @c
Checkpointing

@y
@b
Checkpointing
Checkpointing
hypergrad_fun = grad_named(adam, ’step_sizes’)

def make_jvp(fun, argnum=0):


def jvp_maker(*args, **kwargs):
vjp, y = make_vjp(fun, argnum)(*args, **kwargs)
vjp_vjp, _ = make_vjp(vjp)(vspace(getval(y)).zeros()) # dummy vals
return vjp_vjp # vjp_vjp is just jvp by linearity
return jvp_maker

import tensorflow as tf

def fwd_gradients(ys, xs, d_xs):


v = tf.placeholder(ys.dtype, shape=ys.get_shape()) # dummy variable
g = tf.gradients(ys, xs, grad_ys=v)
return tf.gradients(g, v, grad_ys=d_xs)

def checkpoint(fun):
"""Returns a checkpointed version of ‘fun‘, where intermediate values
computed during the forward pass of ‘fun‘ are discarded and then recomputed
for the backward pass. Useful to trade off time and memory."""
def wrapped_grad(argnum, g, ans, vs, gvs, args, kwargs):
return make_vjp(fun, argnum)(*args, **kwargs)[0](g)
wrapped = primitive(fun)
wrapped.vjp = wrapped_grad
return wrapped
mhat = m / (1 - b1**(i + 1))

Getting forward from reverse


vhat = v / (1 - b2**(i + 1))
x = x - step_sizes[i] * mhat/(np.sqrt(vhat) + eps)
return x

hypergrad_fun = grad_named(adam, ’step_sizes’)

def make_jvp(fun, argnum=0):


def jvp_maker(*args, **kwargs):
vjp, y = make_vjp(fun, argnum)(*args, **kwargs)
vjp_vjp, _ = make_vjp(vjp)(vspace(getval(y)).zeros()) # dummy vals
return vjp_vjp # vjp_vjp is just jvp by linearity
return jvp_maker

import tensorflow as tf

def fwd_gradients(ys, xs, d_xs):


v = tf.placeholder(ys.dtype, shape=ys.get_shape()) # dummy variable
g = tf.gradients(ys, xs, grad_ys=v)
return tf.gradients(g, v, grad_ys=d_xs)
mhat = m / (1 - b1**(i + 1))

Getting forward from reverse


vhat = v / (1 - b2**(i + 1))
x = x - step_sizes[i] * mhat/(np.sqrt(vhat) + eps)
return x

hypergrad_fun = grad_named(adam, ’step_sizes’)

def make_jvp(fun, argnum=0):


def jvp_maker(*args, **kwargs):
vjp, y = make_vjp(fun, argnum)(*args, **kwargs)
vjp_vjp, _ = make_vjp(vjp)(vspace(getval(y)).zeros()) # dummy vals
return vjp_vjp # vjp_vjp is just jvp by linearity
return jvp_maker

import tensorflow as tf

x
def fwd_gradients(ys, xs, d_xs): y
v = tf.placeholder(ys.dtype, shape=ys.get_shape()) # dummy variable
g = tf.gradients(ys, xs, grad_ys=v)
return tf.gradients(g, v, grad_ys=d_xs)
mhat = m / (1 - b1**(i + 1))

Getting forward from reverse


vhat = v / (1 - b2**(i + 1))
x = x - step_sizes[i] * mhat/(np.sqrt(vhat) + eps)
return x

hypergrad_fun = grad_named(adam, ’step_sizes’)

def make_jvp(fun, argnum=0):


def jvp_maker(*args, **kwargs):
vjp, y = make_vjp(fun, argnum)(*args, **kwargs)
vjp_vjp, _ = make_vjp(vjp)(vspace(getval(y)).zeros()) # dummy vals
return vjp_vjp # vjp_vjp is just jvp by linearity
return jvp_maker

import tensorflow as tf

x
def fwd_gradients(ys, xs, d_xs): y
v = tf.placeholder(ys.dtype, shape=ys.get_shape()) # dummy variable
g = tf.gradients(ys, xs, grad_ys=v)
return tf.gradients(g, v, grad_ys=d_xs)
T
J v v
mhat = m / (1 - b1**(i + 1))

Getting forward from reverse


vhat = v / (1 - b2**(i + 1))
x = x - step_sizes[i] * mhat/(np.sqrt(vhat) + eps)
return x

hypergrad_fun = grad_named(adam, ’step_sizes’)

def make_jvp(fun, argnum=0):


def jvp_maker(*args, **kwargs):
vjp, y = make_vjp(fun, argnum)(*args, **kwargs)
vjp_vjp, _ = make_vjp(vjp)(vspace(getval(y)).zeros()) # dummy vals
return vjp_vjp # vjp_vjp is just jvp by linearity
return jvp_maker

import tensorflow as tf

x
def fwd_gradients(ys, xs, d_xs): y
v = tf.placeholder(ys.dtype, shape=ys.get_shape()) # dummy variable
g = tf.gradients(ys, xs, grad_ys=v)
return tf.gradients(g, v, grad_ys=d_xs)
T
J v v

u Ju
mhat = m / (1 - b1**(i + 1))

Getting forward from reverse


vhat = v / (1 - b2**(i + 1))
x = x - step_sizes[i] * mhat/(np.sqrt(vhat) + eps)
return x

hypergrad_fun = grad_named(adam, ’step_sizes’)

import tensorflow as tf

def fwd_gradients(ys, xs, d_xs):


v = tf.placeholder(ys.dtype, shape=ys.get_shape()) # dummy variable
g = tf.gradients(ys, xs, grad_ys=v)
return tf.gradients(g, v, grad_ys=d_xs)

x y

T
J v v

u Ju
Solutions, optima, and fixed points
Solutions, optima, and fixed points


x (a) = arg min f (a, x)
x

rx (a) = ?
Solutions, optima, and fixed points


x (a) = arg min f (a, x)
x

rx (a) = ?

solve g(a, x) = 0 for x



g(a, x (a)) = 0

rx (a) = ?
The implicit function theorem

g(a, x (a)) = 0
The implicit function theorem

g(a, x (a)) = 0

ra g(a, x⇤ ) + rx⇤ (a)rx g(a, x⇤ ) = 0


The implicit function theorem

g(a, x (a)) = 0

ra g(a, x⇤ ) + rx⇤ (a)rx g(a, x⇤ ) = 0

⇤ ⇤ ⇤ 1
rx (a) = ra g(a, x )rx g(a, x )
The implicit function theorem

g(a, x (a)) = 0

ra g(a, x⇤ ) + rx⇤ (a)rx g(a, x⇤ ) = 0

⇤ ⇤ ⇤ 1
rx (a) = ra g(a, x )rx g(a, x )

differentiate solutions / optima $ solve linearized systems


The implicit function theorem

g(a, x (a)) = 0

ra g(a, x⇤ ) + rx⇤ (a)rx g(a, x⇤ ) = 0

⇤ ⇤ ⇤ 1
rx (a) = ra g(a, x )rx g(a, x )

differentiate solutions / optima $ solve linearized systems

automatically generate a linear solver from the forward solver?


Differentiating fixed points
Differentiating fixed points

x (a) solves x = f (a, x) for x
at a point. Slightly more expensive than mixed-mode."""
def ggnvp_maker(*args, **kwargs):
Differentiating fixed points
f_vjp, f_x = make_vjp(f, f_argnum)(*args, **kwargs)
g_hvp, grad_g_x = make_vjp(grad(g))(f_x)
f_vjp_vjp, _ = make_vjp(f_vjp)(vspace(getval(grad_g_
def ggnvp(v): return f_vjp(g_hvp(f_vjp_vjp(v)))

x (a) ggnvp
return solves x = f (a, x) for x
return ggnvp_maker

from autograd import primitive


from functools import partial

@primitive
def fixed_point(f, a, init, converged, max_iter):
update = partial(f, a)
current, prev = update(init), init
for _ in xrange(max_iter):
if converged(current, prev): break
current, prev = update(current), current
else:
print ’fixed point iteration limit reached’
return current
Differentiating fixed points
a

xinit x1 x2 x3 xn 2 xn 1 xn
Differentiating fixed points
a

xinit x1 x2 x3 xn 2 xn 1 xn

n!1

x⇤ = xn = xn 1 = xn 2 = ···
def ggnvp(v): return f_vjp(g_hvp(f_vjp_vjp(v)))
return ggnvp

Differentiating fixed points


return ggnvp_maker
import autograd.numpy as np
from functools import partial

a
@primitive
def fixed_point(f, a, init, converged, max_iter):
update = partial(f, a)
current, prev = update(init), init
for _ in xrange(max_iter):
if converged(current, prev): break
current, prev = update(current), current

else:
xprint
init x1 point
’fixed x3 limit reached’
x2 iteration xn 2 xn 1 xn
return current

from autograd import primitive, make_vjp, make_tuple


from autograd.util import flatten

def grad_fixed_point(g_fp, fp, vs, gvs, f, a, init, converged, max_iter):


vjp, _ = make_vjp(lambda args: f(*args))(make_tuple(a, fp))
g_a_flat, unflatten = flatten(vs.zeros())
for _ in xrange(max_iter):
if normsq(flatten(g)[0]) < 1e-6: break
term, g = vjp(g)
g_a_flat = g_a_flat + flatten(term)[0]
else:
print ’backward fixed point iteration limit reached’
return unflatten(g_a_flat)

fixed_point.defvjp(grad_fixed_point, 1)
Differentiating fixed points

• Inherits structure from forward iteration


• Forward is Newton ) reverse requires only one step
• Forward is block coordinate descent ) reverse is block Gauss-Seidel
• May be preferable to decouple forward and reverse
• Then choose any linear solver for implicit linearized system

• Can reuse dual variables from forward solver


Second-order optimization
def make_hvp(fun, argnum=0):
"""Builds a function for evaluating the Hessian-vector product at a point,
which may be useful when evaluating many Hessian-vector products at the same
point while caching the results of the forward pass."""
def hvp_maker(*args, **kwargs):
return make_vjp(grad(fun, argnum), argnum)(*args, **kwargs)[0]
return hvp_maker

def make_ggnvp(f, g=lambda x: 1./2*np.sum(x**2, axis=-1), f_argnum=0):


"""Builds a function for evaluating generalized-Gauss-Newton-vector products
at a point. Slightly more expensive than mixed-mode."""
def ggnvp_maker(*args, **kwargs):
f_vjp, f_x = make_vjp(f, f_argnum)(*args, **kwargs)
g_hvp, grad_g_x = make_vjp(grad(g))(f_x)
f_vjp_vjp, _ = make_vjp(f_vjp)(vspace(getval(grad_g_x)).zeros())
def ggnvp(v): return f_vjp(g_hvp(f_vjp_vjp(v)))
return ggnvp
return ggnvp_maker
Thanks!
References

• Dougal Maclaurin. Modeling, Inference and Optimization with


Composable Differentiable Procedures. Harvard Physics Ph.D.
Thesis, 2016. URL: https://dougalmaclaurin.com/phd-thesis.pdf
• github.com/hips/autograd

You might also like