0% found this document useful (0 votes)

10 views9 pages

Bregman

The document discusses Bregman divergence, a generalization of squared Euclidean distance, and its applications in machine learning and optimization. It defines Bregman divergence, outlines its properties, and introduces the concept of mirror descent for batch optimization, which utilizes Bregman divergence for updates. The document also analyzes the convergence rate of subgradient descent with Euclidean distance, providing a structured approach to bounding and summing updates over iterations.

Uploaded by

rimniz31

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views9 pages

Bregman

Uploaded by

rimniz31

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Bregman Divergence and Mirror Descent

1 Bregman Divergence

Motivation
• Generalize squared Euclidean distance to a class of distances that all share similar properties
• Lots of applications in machine learning, clustering, exponential family

Definition 1 (Bregman divergence) Let ψ : Ω → R be a function that is: a) strictly convex, b) continuously
differentiable, c) defined on a closed convex set Ω. Then the Bregman divergence is defined as
∆ψ (x, y) = ψ(x) − ψ(y) − h∇ψ(y), x − yi , ∀ x, y ∈ Ω. (1)
That is, the difference between the value of ψ at x and the first order Taylor expansion of ψ around y evaluated
at point x.

Examples
2 2
• Euclidean distance. Let ψ(x) = 21 kxk . Then ∆ψ (x, y) = 21 kx − yk .
0 0
P n
• ψ(x)
P = xii xi log xi and Ω = x ∈ R+ : 1 x = 1 , where 1 = (1, 1, . . . , 1) . Then ∆ψ (x, y) =
i xi log yi for x, y ∈ Ω. This is called relative entropy, or Kullback–Leibler divergence between
probability distributions x and y.
2 2 2
• Lp norm. Let p ≥ 1 and p1 + 1q = 1. ψ(x) = 12 kxkq . Then ∆ψ (x, y) = 12 kxkq + 12 kykq −
D E
2 2
x, ∇ 21 kykq . Note 12 kykq is not necessarily continuously differentiable, which makes this case not
precisely consistent with our definition.

1.1 Properties of Bregman divergence

• Strict convexity in the first argument x. Trivial by the strict convexity of ψ.
• Nonnegativity: ∆ψ (x, y) ≥ 0 for all x, y. ∆ψ (x, y) = 0 if and only if x = y. Trivial by strict convexity.
• Asymmetry: in general, ∆ψ (x, y) 6= ∆ψ (y, x). Eg, KL-divergence. Symmetrization not always useful.
• Non-convexity in the second argument. Let Ω = [1, ∞), ψ(x) = − log x. Then ∆ψ (x, y) = − log x +
log y + x−y 1 2x
y . One can check its second order derivative in y is y 2 ( y − 1), which is negative when
2x < y.
• Linearity in ψ. For any a > 0, ∆ψ+aϕ (x, y) = ∆ψ (x, y) + a∆ϕ (x, y).
∂
• Gradient in x: ∂x ∆ψ (x, y) = ∇ψ(x) − ∇ψ(y). Gradient in y is trickier, and not commonly used.
• Generalized triangle inequality:
∆ψ (x, y) + ∆ψ (y, z) = ψ(x) − ψ(y) − h∇ψ(y), x − yi + ψ(y) − ψ(z) − h∇ψ(z), y − zi (2)
= ∆ψ (x, z) + hx − y, ∇ψ(z) − ∇ψ(y)i . (3)

• Special case: ψ is called strongly convex with respect to some norm with modulus σ if
σ 2
ψ(x) ≥ ψ(y) + h∇ψ(y), x − yi + kx − yk . (4)
2
Note the norm here is not necessarily the Euclidean norm. When the norm is Euclidean, this condition is
2
equivalent to ψ(x)− σ2 kxk being convex. For example, the ψ(x) = i xi log xi used in KL-divergence
P
is 1-strongly convex over the simplex Ω = x ∈ Rn+ : 10 x = 1 , with respect to the L1 norm (not so

trivial). When ψ is σ strongly convex, we have
σ 2
∆ψ (x, y) ≥ kx − yk . (5)
2
σ 2
Proof: By definition ∆ψ (x, y) = ψ(x) − ψ(y) − h∇ψ(y), x − yi ≥ 2 kx − yk .

• Duality. Suppose ψ is strongly convex. Then

(∇ψ ∗ )(∇ψ(x)) = x, ∆ψ (x, y) = ∆ψ∗ (∇ψ(y), ∇ψ(x)) . (6)

Proof: (for the first equality only) Recall

ψ ∗ (y) = sup {hz, yi − ψ(z)} . (7)
z∈Ω

sup must be attainable because ψ is strongly convex and Ω is closed. x is a maximizer if and only if
y = ∇ψ(x). So
ψ ∗ (y) + ψ(x) = hx, yi ⇔ y = ∇ψ(x). (8)
Since ψ = ψ ∗∗ , so ψ ∗ (y) + ψ ∗∗ (x) = hx, yi, which means y is the maximizer in
ψ ∗∗ (x) = sup {hx, zi − ψ ∗ (z)} . (9)
z

This means x = ∇ψ ∗ (y). To summarize, (∇ψ ∗ )(∇ψ(x)) = x.

• Mean of distribution. Suppose U is a random variable over an open set S with distribution µ. Then
min EU ∼µ [∆ψ (U, x)] (10)
x∈S
R
is optimized at ū := Eµ [U ] = u∈S
uµ(u).
Proof: For any x ∈ S, we have
EU ∼µ [∆ψ (U, x)] − EU ∼µ [∆ψ (U, ū)] (11)
= Eµ [ψ(U ) − ψ(x) − (U − x)0 ∇ψ(x) − ψ(U ) + ψ(ū) + (U − ū)0 ∇ψ(ū)] (12)
0 0 0 0
= ψ(ū) − ψ(x) + x ∇ψ(x) − ū ∇ψ(ū) + Eµ [−U ∇ψ(x) + U ∇ψ(ū)] (13)
0
= ψ(ū) − ψ(x) − (ū − x) ∇ψ(x) (14)
= ∆ψ (ū, x). (15)
This must be nonnegative, and is 0 if and only if x = ū.

• Pythagorean Theorem. If x∗ is the projection of x0 onto a convex set C ⊆ Ω:

x∗ = argmin ∆ψ (x, x0 ). (16)
x∈C

Then for all y ∈ C,

∆ψ (y, x0 ) ≥ ∆ψ (y, x∗ ) + ∆ψ (x∗ , x0 ). (17)
In Euclidean case, it means the angle ∠yx∗ x0 is obtuse. More generally

Lemma 2 Suppose L is a proper convex function whose domain is an open set containing C. L is not
necessarily differentiable. Let x∗ be
x∗ = argmin {L(x) + ∆ψ (x∗ , x0 )} . (18)
x∈C

Then for any y ∈ C we have

L(y) + ∆ψ (y, x0 ) ≥ L(x∗ ) + ∆ψ (x∗ , x0 ) + ∆ψ (y, x∗ ). (19)

2
The projection in (16) is just a special case of L = 0. This property is the key to the analysis of many
optimization algorithms using Bregman divergence.
Proof: Denote J(x) = L(x) + ∆ψ (x, x0 ). Since x∗ minimizes J over C, there must exist a subgradient
d ∈ ∂J(x∗ ) such that
hd, x − x∗ i ≥ 0, ∀ x ∈ C. (20)
∗
Since ∂J(x ) = {g + ∇x=x∗ ∆ψ (x, x0 ) : g ∈ ∂L(x )} = {g + ∇ψ(x ) − ∇ψ(x0 ) : g ∈ ∂L(x∗ )}.
∗ ∗

So there must be a subgradient g ∈ L(x∗ ) such that

hg + ∇ψ(x∗ ) − ∇ψ(x0 ), x − x∗ i ≥ 0, ∀ x ∈ C. (21)
Therefore using the property of subgradient, we have for all y ∈ C that
L(y) ≥ L(x∗ ) + hg, y − x∗ i (22)
≥ L(x∗ ) + h∇ψ(x0 ) − ∇ψ(x∗ ), y − x∗ i (23)
= L(x∗ ) − h∇ψ(x0 ), x∗ − x0 i + ψ(x∗ ) − ψ(x0 ) (24)
+ h∇ψ(x0 ), y − x0 i − ψ(y) + ψ(x0 ) (25)
− h∇ψ(x∗ ), y − x∗ i + ψ(y) − ψ(x∗ ) (26)
= L(x∗ ) + ∆ψ (x∗ , x0 ) − ∆ψ (y, x0 ) + ∆ψ (y, x∗ ). (27)
Rearranging completes the proof.

2 Mirror Descent for Batch Optimization

Suppose we want to minimize a function f over a set C. Recall the subgradient descent rule
xk+ 12 = xk − ηk gk , where gk ∈ ∂f (xk ) (28)
1 2 1 2
xk+1 = argmin x − xk+ 21 = argmin kx − (xk − ηk gk )k . (29)
x∈C 2 x∈C 2
This can be interpreted as follows. First approximate f around xk by a first-order Taylor expansion
f (x) ≈ f (xk ) + hgk , x − xk i . (30)
1 2
Then penalize the displacement by 2ηk kx − xk k . So the update rule is to find a regularized minimizer of
the model

1 2
xk+1 = argmin f (xk ) + hgk , x − xk i + kx − xk k . (31)
x∈C 2ηk
It is trivial to see this is exactly equivalent to (29).
Mirror descent extension To generalize (31) beyond Euclidean distance, it is straightforward to use the
Bregman divergence as a measure of displacement:

1
xk+1 = argmin f (xk ) + hgk , x − xk i + ∆ψ (x, xk ) (32)
x∈C ηk
= argmin {ηk f (xk ) + ηk hgk , x − xk i + ∆ψ (x, xk )} . (33)
x∈C
It is again equivalent to two steps:

1
xk+ 12 = argmin f (xk ) + hgk , x − xk i + ∆ψ (x, xk ) (34)
x η k
xk+1 = argmin ∆ψ (x, xk+ 21 ). (35)
x∈C
The first order optimality condition for (34) is
1
gk + (∇ψ(xk+ 12 ) − ∇ψ(xk )) = 0 (36)
ηk
⇐⇒ ∇ψ(xk+ 21 ) = ∇ψ(xk ) − ηk gk (37)
−1 ∗
⇐⇒ xk+ 21 = (∇ψ) (∇ψ(xk ) − ηk gk ) = (∇ψ )(∇ψ(xk ) − ηk gk ). (38)
For example, in KL-divergence over simplex, the update rule becomes
xk+ 12 (i) = xk (i) exp(−ηk gk (i)). (39)

3
2.1 Rate of convergence for subgradient descent with Euclidean distance
We now analyze the rates of convergence of subgradient descent as in (31) and (33). It takes four steps.
1. Bounding on a single update
2
2 2
kxk+1 − x∗ k2 ≤ xk+ 12 − x∗ = kxk − ηk gk − x∗ k2 (≤ by the Pythagorean theorem in (17)) (40)
2 2
= kxk − x∗ k 2 − 2ηk hgk , xk − x∗ i + ηk2 kgk k2 (41)
2 2
≤ kxk − x∗ k 2 − 2ηk (f (xk ) − f (x∗ )) + ηk2 kgk k2 . (42)

2. Telescope over k = 1, . . . , T (summing them up):

T
X T
X
2 2 2
kxT +1 − x∗ k2 ≤ kx1 − x∗ k2 − 2 ηk (f (xk ) − f (x∗ )) + ηk2 kgk k2 . (43)
k=1 k=1

2 2 2
3. Bounding by kgk k2 ≤ G2 and kx1 − x∗ k2 ≤ R2 := maxx∈C kx1 − xk2 :
T
X T
X
2 ηk (f (xk ) − f (x∗ )) ≤ R2 + G2 ηk2 . (44)
k=1 k=1

4. Denote k = f (xk ) − f (x∗ ) and rearrange

PT
R2 + G2 k=1 ηk2
min k ≤ PT . (45)
k∈{1,...,T } 2 k=1 ηk
R
Denote [T ] := {1, 2, . . . , T }. By setting the step size ηk = √ ,
G k
we can achieve
PT RT
1 + k=1 k1 2 + 1 x1 dx RG log T
min k ≤ RG PT ≤ RG R T +1 ≤ √ . (46)
1 1 2 T
k∈[T ] 2 k=1 √k 4 1 √ dx
x

Remark 1 The term log T in the bound can actually be removed by using the following simple fact. Given
c > 0, b ∈ Rd+ , and D a positive definite matrix. Then

c + 12 x0 Dx
r r
2c 2c
min = , where the optimal x = D−1 b. (47)
x∈Rd
+
b0 x b0 D−1 b b0 D−1 b

One can prove it by writing out the KKT condition for the equivalent convex problem (with a perspective
function) inf x,u u1 (c + 12 x0 Dx), s.t. x ∈ Rd+ , u > 0, and b0 x = u. Now apply this result to (45) with all
R
ηk = G√ T
(k ∈ [T ]), then we get

RG
min k ≤ √ . (48)
k∈[T ] T
So to drive mink∈[T ] k below a threshold > 0, it suffices to take T steps where

R2 G2
T ≥ . (49)
2
Note the method requires that the horizon T be specified a priori, because the step size ηk needs this infor-
mation. We next give a more intricate approach which does not require a pre-specified horizon.

Remark 2 The term log T in the bound can also be removed as follows. Here we redefine R2 as the diameter
2
square maxx,y∈C kx − yk2 . Instead of telescoping over k = 1, . . . , T , let us telescope from k = T /2 to T
4
(without loss of generality, let T be an even integer):
T T
2 2 X X 2
kxT +1 − x∗ k2 ≤ xT /2 − x∗ 2
−2 ηk (f (xk ) − f (x∗ )) + ηk2 kgk k2 (50)
k=T /2 k=T /2
T
X T
X
∗ 2 2
=⇒ 2 ηk (f (xk ) − f (x )) ≤ R + G ηk2 (51)
k=T /2 k=T /2
PT
R2 + G2 k=T /2 ηk2
=⇒ min k ≤ PT (52)
k∈{T /2,...,T } 2 k=T /2 ηk
PT RT
R 1 + k=T /2 k1 1 + T /2−1 log xdx 2RG
(plug in ηk = √ ) = RG PT ≤ RG R T +1 √ ≤ √ . (53)
G k 2 k=T /2 √1k 4 T /2 xdx T
2
The trick is to exploit log T −log( T2 −1) ≈ log 2 in the numerator. In step (51), we bounded xT /2 − x∗ 2
2
by R2 , because in general we cannot bound it by kx1 − x∗ k2 . In the sequel, we will simply write
RG
min k ≤ √
k∈[T ] T
ignoring the constants.
2.2 Rate of convergence for subgradient descent with mirror descent
The rate of convergence of subgradient descent often depends on R and G, which may√depend unfortunately
on the dimension of the problem. For example, suppose C is the simplex. √ Then R ≤ 2. If each coordinate
of each gradient gi is upper bounded by M , then G can be at most M n, i.e. depends on the dimension of x.
We next see how this dependency can be removed by extending Euclidean distance to Bregman diver-
2
gence. Clearly the steps 2 to 4 above can be easily extended by replacing kxk+1 − x∗ k2 with ∆ψ (x∗ , xk+1 ).
So the only challenge left is to extend step 1. This is actually possible via Lemma 2.
We further assume ψ is σ strongly convex on C. In (33), consider ηk (f (xk ) + hgk , x − xk i) as the L in
Lemma 2. Then
ηk (f (xk ) + hgk , x∗ − xk i) + ∆ψ (x∗ , xk ) ≥ ηk (f (xk ) + hgk , xk+1 − xk i) + ∆ψ (xk+1 , xk ) (54)
+ ∆ψ (x∗ , xk+1 ). (55)
Canceling some terms can rearranging, we obtain
∆ψ (x∗ , xk+1 ) ≤ ∆ψ (x∗ , xk ) + ηk hgk , x∗ − xk+1 i − ∆ψ (xk+1 , xk ) (56)
= ∆ψ (x∗ , xk ) + ηk hgk , x∗ − xk i + ηk hgk , xk − xk+1 i − ∆ψ (xk+1 , xk ) (57)
σ 2
≤ ∆ψ (x∗ , xk ) − ηk (f (xk ) − f (x∗ )) + ηk hgk , xk − xk+1 i − kxk − xk+1 k (58)
2
σ 2
≤ ∆ψ (x∗ , xk ) − ηk (f (xk ) − f (x∗ )) + ηk kgk k∗ kxk − xk+1 k − kxk − xk+1 k (59)
2
η2 2
≤ ∆ψ (x∗ , xk ) − ηk (f (xk ) − f (x∗ )) + k kgk k∗ (60)
2σ
2
Now compare with (42), we have successfully replaced kxk+1 − x∗ k2 with ∆ψ (x∗ , xi ). Again upper
bound ∆ψ (x∗ , x1 ) by R2 and kgk k∗ by G, and we obtain
RG
min k ≤ √ . (61)
k∈[T ] σT
Note the norm on gk is the dual norm. To see the advantage of mirror descent, suppose C is the n
dimensional simplex, and we use KL-divergence for which ψ is 1 strongly convex with respect to the L1
norm. The dual norm of the L1 norm is the L∞ norm. Then we can bound ∆ψ (x∗ , x1 ) by using KL-
divergence, and it is at most log n if we set x1 = n1 1 and x∗ lies in the probability simplex. G can be upper
bounded by M , and R by log n. So with regard to the valueq of RG, mirror descent yields M log n, which is
√
smaller than that of subgradient descent by an order of O( logn n ). Note the saving of Θ( n) is from the
norm of gradient (G) by replacing the L2 norm by the L∞ norm, at a slight cost of increasing R by log n.

5
Remark 4 Note R2 is an upper bound on ∆ψ (x∗ , x1 ), rather than the real diameter maxx,y∈C ∆ψ (x, y).
This is important because for KL divergence defined on the probability simplex, the latter is actually infinity,
while maxx∈Ω ∆ψ (x, n1 1) = log n.
2.3 Possibilities for accelerated rates
When the objective function has additional properties, the rates can be significantly improved. Here we see
two examples.
Acceleration 1: f is strongly convex. We say f is strongly convex with respect to another convex function
ψ with modulus λ if
f (x) ≥ f (y) + hg, x − yi + λ∆ψ (x, y) ∀ g ∈ ∂f (y). (62)
Note we do not assume f is differentiable. Now in the step from (57) to (58), we can plug in the definition of
strong convexity:
∆ψ (x∗ , xk+1 ) = . . . + ηk hgk , x∗ − xk i + . . . (copy of (57)) (63)
≤ . . . − ηk (f (xk ) − f (x∗ ) + λ∆ψ (x∗ , xk )) + . . . (64)
≤ ... (65)
ηk2 2
≤ (1 − ληk )∆ψ (x∗ , xk ) − ηk (f (xk ) − f (x∗ )) + kgk k∗ (66)
2σ
Denote δk = ∆ψ (x∗ , xk ). Set ηk = 1
λk . Then
k−1 1 G2 1 G2
δk+1 ≤ δk − k + =⇒ kδk+1 ≤ (k − 1)δk − k + (67)
k λk 2σλ2 k 2 λ 2σλ2 k
Now telescope (sum up both sides from k = 1 to T )
T T T
1X G2 X 1 G2 1 X 1 G2 O(log T )
T δT +1 ≤ − k + =⇒ min k ≤ ≤ . (68)
λ 2σλ2 k i∈[T ] 2σλ T k 2σλ T
k=1 k=1 k=1

Acceleration 2: f has Lipschitz continuous gradient. If the gradient of f is Lipschitz continuous, there
exists L > 0 such that
k∇f (x) − ∇f (y)k∗ ≤ L kx − yk , ∀ x, y. (69)
Sometimes we just directly say f is smooth. It is also known that this is equivalent to
L 2
f (x) ≤ f (y) + h∇f (y), x − yi + kx − yk . (70)
2
We bound the hgk , x∗ − xk+1 i term in (56) as follows
hgk , x∗ − xk+1 i = hgk , x∗ − xk i + hgk , xk − xk+1 i (71)
L 2
≤ f (x∗ ) − f (xk ) + f (xk ) − f (xk+1 ) + kxk − xk+1 k (72)
2
L 2
= f (x∗ ) − f (xk+1 ) + kxk − xk+1 k . (73)
2
Plug into (56), we get

L 2 σ 2
∆ψ (x∗ , xk+1 ) ≤ ∆ψ (x∗ , xk ) + ηk f (x∗ ) − f (xk+1 ) + kxk − xk+1 k − kxk − xk+1 k . (74)
2 2
σ
Set ηk = L, we get
σ
∆ψ (x∗ , xk+1 ) ≤ ∆ψ (x∗ , xk ) − (f (xk+1 ) − f (x∗ )) . (75)
L
Telescope we get
L∆(x∗ , x1 ) LR2
min f (xk ) − f (x∗ ) ≤ ≤ . (76)
k∈{2,...,T +1} σT σT
This gives O( T1 ) convergence rate. But if we are smarter, like Nesterov, the rate can be improved to O( T12 ).
We will not go into the details but the algorithm and proof are again based on Lemma 2. This is often called
accelerated proximal gradient method.
6
2.4 Composite Objective
Suppose the objective function is h(x) = f (x)+r(x), where f is smooth and r(x) is simple, like kxk1 . If we
directly apply the above rates for optimizing h, we get O( √1T ) rate of convergence because h is not smooth.
It will be nice if we can enjoy the O( T1 ) rate as in smooth optimization. Fortunately this is possible thanks to
the simplicity of r(x), and we only need to extend the proximal operator (33) as follows:

1
xk+1 = argmin f (xk ) + hgk , x − xk i + r(x) + ∆ψ (x, xk ) (77)
x∈C ηk
= argmin {ηk f (xk ) + ηk hgk , x − xk i + ηk r(x) + ∆ψ (x, xk )} . (78)
x∈C

Here we use a first-order Taylor approximation of f around xk , but keep r(x) exact. Assuming this proximal
operator can be computed efficiently, then we can show all the above rates carry over. We here only show the
case of general f (not necessarily strongly convex or has Lipschitz continuous gradient), and leave the rest
two cases as an exercise. In fact we can again achieve O( T12 ) rate when f has Lipschitz continuous gradient.
Consider ηk (f (xk ) + hgk , x − xk i + r(x)) as the L in Lemma 2. Then
ηk (f (xk ) + hgk , x∗ − xk i + r(x∗ )) + ∆ψ (x∗ , xk ) (79)
≥ ηk (f (xk ) + hgk , xk+1 − xk i + r(xk+1 )) + ∆ψ (xk+1 , xk ) + ∆ψ (x∗ , xk+1 ). (80)
Following exactly the derivations from (56) to (60), we obtain
∆ψ (x∗ , xk+1 ) ≤ ∆ψ (x∗ , xk ) + ηk hgk , x∗ − xk+1 i + ηk (r(x∗ ) − r(xk+1 )) − ∆ψ (xk+1 , xk ) (81)
≤ ... (82)
ηk2 2
≤ ∆ψ (x∗ , xk ) − ηk (f (xk ) + r(xk+1 ) − f (x∗ ) − r(x∗ )) + kgk k∗ . (83)
2σ
This is almost the same as (60), except that we want to have r(xk ) here, not r(xk+1 ). Fortunately this is not
a problem as long as we use a slightly different way of telescoping. Denote δk = ∆ψ (x∗ , xk ) and then
1 ηk 2
f (xk ) + r(xk+1 ) − f (x∗ ) − r(x∗ ) ≤ (δk − δk+1 ) + kgk k∗ . (84)
ηk 2σ
Summing up from k = 1 to T we obtain
T T T
G2 X

X δ1 X 1 1 δT +1
r(xT +1 ) − r(x1 ) + (h(xk ) − h(x∗ )) ≤ + δk − − + ηk (85)
η1 ηk ηk−1 ηT 2σ
k=1 k=2 k=1
T ! T
2 1 X 1 1 G2 X
≤R + − + ηk (86)
η1 ηk ηk−1 2σ
k=2 k=1
2 T 2
R G X
= + ηk . (87)
ηT 2σ
k=1
R
pσ
Suppose we choose x1 = argminx r(x), which ensures r(xT +1 ) − r(x1 ) ≥ 0. Setting ηk = G k, we get
T T
!
X
∗ RG √ 1X 1 RG √
(h(xk ) − h(x )) ≤ √ T+ √ = √ O( T ). (88)
σ 2 k σ
k=1 k=1

Therefore mink∈[T ] {h(xk ) − h(x∗ )} decays at the rate of O( √RG

σT
).

3 Online and Stochastic Learning

The protocol of online learning is shown
P in Algorithm 1. The player’s goal of online learning is to minimize
the regret, the minimal possible loss k fk (x) over all possible x:
T
X T
X
Regret = fk (xk ) − min fk (x). (89)
x
k=1 k=1
Note there is no assumption made on how the rival picks fk , and it can adversarial. After obtaining fk at
iteration k, let us update the model into xk+1 by using the mirror descent rule on function fk only:

1
xk+1 = argmin fk (xk ) + hgk , x − xk i + ∆ψ (x, xk ) , where gk ∈ ∂fk (xk ). (90)
x∈C ηk

7
Algorithm 1: Protocol of online learning
1 The player initializes a model x1 .
2 for k = 1, 2, . . . do
3 The player proposes a model xk .
4 The rival picks a function fk .
5 The player suffers a loss fk (xk ).
6 The player gets access to fk and use it to update its model to xk+1 .

Then it is easy to derive the regret bound. Using fk in step (60), we have
1 ηk 2
fk (xk ) − fk (x∗ ) ≤ (∆ψ (x∗ , xk ) − ∆ψ (x∗ , xk+1 )) + kgk k∗ . (91)
ηk 2σ
Summing up from k = 1 to n and using the same process as in (85) to (88), we get
T
X RG √
(fk (xk ) − fk (x∗ )) ≤ √ O( T ). (92)
σ
k=1
√
So the regret grows in the order of O( T ).

f is strongly convex. Exactly use (66) with fk in place of f , and we can derive the O(log T ) regret bound
immediately.

f has Lipschitz continuous gradient. The result in (75) can NOT be extended to the online setting because
if we replace f by fk we will get fk (xk+1 ) − fk (x∗ ) on the right-hand side. Telescoping will not give a regret
bound. In fact, it is known that √in the online setting, having a Lipschitz continuous gradient itself cannot
reduce the regret bound from O( T ) (as in nonsmooth objective) to O(log T ).

Composite objective. In the online setting, both the player and the rival know r(x), and the rival changes
fk (x) at each iteration. The loss incurred at each iteration is hk (xk ) = fk (xk ) + r(xk ). The update rule is

1
xk+1 = argmin fk (xk ) + hgk , x − xk i + r(x) + ∆ψ (x, xk ) , where gk ∈ ∂fk (xk ). (93)
x∈C ηk
Note in this setting, (84) becomes
1 ηk 2
fk (xk ) + r(xk+1 ) − fk (x∗ ) − r(x∗ ) ≤ (δk − δk+1 ) + kgk k∗ . (94)
ηk 2σ
Although we have r(xk+1 ) here rather than r(xk ), it is fine because r does not change through iterations.
Choosing x1 = argminx r(x) and telescoping in the same way as from (85) to (88), we immediately obtain
T
X G √
(hk (xk ) − hk (x∗ )) ≤ √ O( T ). (95)
σ
k=1
√
So the regret grows at O( T ).
When fk are strongly convex, we can get O(log T ) regret for the√
composite case. But as expected, having
Lipschitz continuity of ∇fk alone cannot reduce the regret from O( T ) to O(log T ).
3.1 Stochastic optimization
Let us consider optimizing a function which takes a form of expectation
min F (x) := E [f (x; ω)], (96)
x ω∼p

where p is a distribution of ω. This subsumes a lot of machine learning models. For example, the SVM
objective is
m
1 X λ 2
F (x) = max{0, 1 − ci hai , xi} + kxk . (97)
m i=1 2

8
Algorithm 2: Protocol of online learning
1 The player initializes a model x1 .
2 for k = 1, 2, . . . do
3 The player proposes a model xk .
4 The rival randomly draws a ωk from p, which defines a function fk (x) := f (x; ωk ).
5 The player suffers a loss fk (xk ).
6 The player gets access to fk and use it to update its model to xk+1 by, e.g., mirror descent (90).

1
It can be interpreted as (96) where ω is uniformly distributed in {1, 2, . . . , m} (i.e. p(ω = i) = m ), and
λ 2
f (x; i) = max{0, 1 − ci hai , xi} + kxk . (98)
2
When m is large, it can be costly to calculate F and its subgradient. So a simple idea is to base the
updates on a single randomly chosen data point. It can be considered as a special case of online learning in
Algorithm 1, where the rival in step 4 now randomly picks fk as f (x; ωk ) with ωk being drawn independently
from p. Ideally we hope that by using the mirror descent updates, xk will gradually approach the minimizer
of F (x). Intuitively this is quite reasonable, and by using fk we can compute an unbiased estimate of F (xk )
and a subgradient of F (xk ) (because ωk are sampled iid from p). This is a particular case of stochastic
optimization, and we recap it in Algorithm 2.
In fact, the method is valid in a more general setting. For simplicity, let us just say the rival plays ωk
at iteration k. Then an online learning algorithm A is simply a deterministic mapping from an ordered set
{ω1 , . . . , ωk } to xk+1 . Denote as A(ω0 ) the initial model x1 . Then the following theorem is the key for
online to batch conversion.
Theorem 3 Suppose an online learning algorithm A has regret bound Rk after running Algorithm 1 for k
iterations. Suppose ω1 , . . . , ωT +1 are drawn iid from p. Define x̂ = A(ωj+1 , . . . , ωT ) where j is drawn
uniformly random from {0, . . . , T }. Then
RT +1
E[F (x̂)] − min F (x) ≤ , (99)
x T +1
where the expectation is with respect to the randomness of ω1 , . . . , ωT , and j.
Similarly we can have high probability bounds, which can be stated in the form like (not exactly true)
RT +1 1
F (x̂) − min F (x) ≤ log (100)
x T +1 δ
with probability 1 − δ, where the probability is with respect to the randomness of ω1 , . . . , ωT , and j.

Proof of Theorem 3.
E[F (x̂)] = E [f (x̂; ωT +1 )] = E [f (A(ωj+1 , . . . , ωT ); ωT +1 )] (101)
j,ω1 ,...,ωT +1 j,ω1 ,...,ωT +1
 
T
 1
X
= E f (A(ωj+1 , . . . , ωT ); ωT +1 ) (as j is drawn uniformly random) (102)
ω1 ,...,ωT +1 T + 1
j=0
 
T
1 X
= E  f (A(ω1 , . . . , ωT −j ); ωT +1−j ) (shift iteration index by iid of wi )
T + 1 ω1 ,...,ωT +1 j=0
(103)
"T +1 #
1 X
= E f (A(ω1 , . . . , ωs−1 ); ωs ) (change of variable s = T − j + 1) (104)
T + 1 ω1 ,...,ωT +1 s=1
" T +1
#
1 X
≤ E min f (x; ωs ) + RT +1 (apply regret bound) (105)
T + 1 ω1 ,...,ωT +1 x s=1
RT +1
≤ min E[f (x; ω] + (expectation of min is smaller than min of expectation) (106)
x ω T +1
RT +1
= min F (x) + . (107)
x T +1

DSA Notes Abdul Bari
No ratings yet
DSA Notes Abdul Bari
206 pages
Jan Van Tiel - Convex Analysis - An Introductory Text-Wiley (1984) PDF
No ratings yet
Jan Van Tiel - Convex Analysis - An Introductory Text-Wiley (1984) PDF
135 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
181 pages
40 Questions
100% (2)
40 Questions
10 pages
Grundlehren Der Mathematischen Wissenschaften 305: A Series of Comprehensive Studies in Mathematics
No ratings yet
Grundlehren Der Mathematischen Wissenschaften 305: A Series of Comprehensive Studies in Mathematics
431 pages
Convex Optimization For Machine Learning
No ratings yet
Convex Optimization For Machine Learning
110 pages
Mikhail Moklyachuk - Convex Optimization - Introductory Course-Wiley-ISTE (2021)
No ratings yet
Mikhail Moklyachuk - Convex Optimization - Introductory Course-Wiley-ISTE (2021)
254 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
04 Nonlinear Systems and Optimization
No ratings yet
04 Nonlinear Systems and Optimization
74 pages
Introduction To Optimization - Jean-François Aujol
No ratings yet
Introduction To Optimization - Jean-François Aujol
51 pages
Co 463
No ratings yet
Co 463
116 pages
Lecture Notes PDF
No ratings yet
Lecture Notes PDF
143 pages
Optimization Best
No ratings yet
Optimization Best
71 pages
Convex Analysis and Optimization Solution Manual
100% (2)
Convex Analysis and Optimization Solution Manual
193 pages
Nisheeth VishnoiFall2014 ConvexOptimization PDF
No ratings yet
Nisheeth VishnoiFall2014 ConvexOptimization PDF
114 pages
Mirror Descent Slides
No ratings yet
Mirror Descent Slides
35 pages
Func 20160919
No ratings yet
Func 20160919
35 pages
ConvexSpring25 Week9
No ratings yet
ConvexSpring25 Week9
26 pages
EE364a Homework 3 Solutions: 0 N 0 1 N N 1 1 N N 0 0
No ratings yet
EE364a Homework 3 Solutions: 0 N 0 1 N N 1 1 N N 0 0
19 pages
Xu2001 Minimax
No ratings yet
Xu2001 Minimax
13 pages
Gradient
No ratings yet
Gradient
37 pages
Lecture3 ConvexSetsFuns PDF
No ratings yet
Lecture3 ConvexSetsFuns PDF
43 pages
O4MD 03 Descent Methods
No ratings yet
O4MD 03 Descent Methods
18 pages
Subgradients: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradients: Ryan Tibshirani Convex Optimization 10-725
25 pages
Convexsol 1
No ratings yet
Convexsol 1
40 pages
Lec6 Constr Opt
No ratings yet
Lec6 Constr Opt
30 pages
Sparsity and Its Mathematics
No ratings yet
Sparsity and Its Mathematics
44 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
Mirror 2
No ratings yet
Mirror 2
8 pages
Exercises With Solutions PDF
No ratings yet
Exercises With Solutions PDF
37 pages
Convex Optimization L2 18
No ratings yet
Convex Optimization L2 18
11 pages
Epigrafo PDF
No ratings yet
Epigrafo PDF
12 pages
03 Convex Functions
No ratings yet
03 Convex Functions
31 pages
A Double Projection Algorithm For Quasimonotone Variational Inequalities in Banach Spaces
No ratings yet
A Double Projection Algorithm For Quasimonotone Variational Inequalities in Banach Spaces
14 pages
18.657: Mathematics of Machine Learning: S R LR LK K
No ratings yet
18.657: Mathematics of Machine Learning: S R LR LK K
9 pages
Lec 11
No ratings yet
Lec 11
13 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
No ratings yet
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
8 pages
Lec3 Convex Function Exercise
No ratings yet
Lec3 Convex Function Exercise
4 pages
Lecture 12
No ratings yet
Lecture 12
4 pages
Notes ch4 1
No ratings yet
Notes ch4 1
7 pages
Controle 16
No ratings yet
Controle 16
4 pages
Raghu Meka Notes
No ratings yet
Raghu Meka Notes
7 pages
Phillip Allen Short Course Notes
100% (2)
Phillip Allen Short Course Notes
633 pages
01 Convex and Concave Functions
No ratings yet
01 Convex and Concave Functions
5 pages
Bregman Divergence and Mirror Descent
No ratings yet
Bregman Divergence and Mirror Descent
8 pages
Basic Concepts: 1.1 Continuity
No ratings yet
Basic Concepts: 1.1 Continuity
7 pages
14.451 Notes: 1 Mathematical Preliminaries
No ratings yet
14.451 Notes: 1 Mathematical Preliminaries
5 pages
Optimality Conditions: Unconstrained Optimization: 1.1 Differentiable Problems
No ratings yet
Optimality Conditions: Unconstrained Optimization: 1.1 Differentiable Problems
10 pages
Lecture 10
No ratings yet
Lecture 10
4 pages
Convex Optimization Prerequisite - Topics
No ratings yet
Convex Optimization Prerequisite - Topics
6 pages
Review Question 3
No ratings yet
Review Question 3
4 pages
Analiza Convexa
No ratings yet
Analiza Convexa
4 pages
DSA3102 Midterm Chestsheet
No ratings yet
DSA3102 Midterm Chestsheet
2 pages
Conjugate Gradient Method: Com S 477/577 Nov 6, 2007
No ratings yet
Conjugate Gradient Method: Com S 477/577 Nov 6, 2007
8 pages
AE Unit2
No ratings yet
AE Unit2
189 pages
Lecture 4 - Taylor Series Expansion and Finite Difference Method
100% (1)
Lecture 4 - Taylor Series Expansion and Finite Difference Method
41 pages
Spectrofotometer Cary PDF
No ratings yet
Spectrofotometer Cary PDF
16 pages
Weg ssw900 15825368 Datasheet
No ratings yet
Weg ssw900 15825368 Datasheet
5 pages
ARPro3 Manual
No ratings yet
ARPro3 Manual
54 pages
Microsoft System Center 2012 R2 Operations Manager Step by Step Ebook
No ratings yet
Microsoft System Center 2012 R2 Operations Manager Step by Step Ebook
195 pages
Electronics For Industrial Applications: Operating Instructions RE 07602-B/01.2015
No ratings yet
Electronics For Industrial Applications: Operating Instructions RE 07602-B/01.2015
28 pages
02 ACS 6000 Hands-On Operation - RevA1
No ratings yet
02 ACS 6000 Hands-On Operation - RevA1
6 pages
English Please 11-Student S
No ratings yet
English Please 11-Student S
3 pages
Gigabyte A520m-H Rev 1.0 PDF
No ratings yet
Gigabyte A520m-H Rev 1.0 PDF
39 pages
Manual Rápido ScreenPlay Director SPDHD
No ratings yet
Manual Rápido ScreenPlay Director SPDHD
44 pages
Excessive Speed Manual
No ratings yet
Excessive Speed Manual
16 pages
Branch Current Based State Estimation
No ratings yet
Branch Current Based State Estimation
5 pages
Frequency-Response Design Method Hand-Out
No ratings yet
Frequency-Response Design Method Hand-Out
15 pages
Fundamental of Programming Course Outline
No ratings yet
Fundamental of Programming Course Outline
3 pages
SLA Part 3 - Journal Line Definition PDF
No ratings yet
SLA Part 3 - Journal Line Definition PDF
7 pages
Bernet El Al 1999
No ratings yet
Bernet El Al 1999
10 pages
My Work
No ratings yet
My Work
9 pages
Nikhil Doddad Resume
No ratings yet
Nikhil Doddad Resume
2 pages
Your Reliance Communications Bill: Summary of Current Charges Amount (RS.)
No ratings yet
Your Reliance Communications Bill: Summary of Current Charges Amount (RS.)
5 pages
Individual Project
No ratings yet
Individual Project
4 pages
Polynomials
No ratings yet
Polynomials
4 pages
Low Voltage Low Power Current Comparator Using Dtmos: Abstract
No ratings yet
Low Voltage Low Power Current Comparator Using Dtmos: Abstract
5 pages
Cenumes - Week 6
No ratings yet
Cenumes - Week 6
4 pages
In4073 QR Controller Theory (2011-2012)
No ratings yet
In4073 QR Controller Theory (2011-2012)
4 pages
Understanding Ferrite Bead
No ratings yet
Understanding Ferrite Bead
3 pages
Curriculum Vitae Pradeep Kumar Gupta: Email
No ratings yet
Curriculum Vitae Pradeep Kumar Gupta: Email
3 pages

Bregman

Uploaded by

Bregman

Uploaded by

Bregman Divergence and Mirror Descent

1.1 Properties of Bregman divergence

• Duality. Suppose ψ is strongly convex. Then

Proof: (for the first equality only) Recall

This means x = ∇ψ ∗ (y). To summarize, (∇ψ ∗ )(∇ψ(x)) = x.

• Pythagorean Theorem. If x∗ is the projection of x0 onto a convex set C ⊆ Ω:

Then for all y ∈ C,

Then for any y ∈ C we have

So there must be a subgradient g ∈ L(x∗ ) such that

2 Mirror Descent for Batch Optimization

2. Telescope over k = 1, . . . , T (summing them up):

4. Denote k = f (xk ) − f (x∗ ) and rearrange

Therefore mink∈[T ] {h(xk ) − h(x∗ )} decays at the rate of O( √RG

3 Online and Stochastic Learning

You might also like

4. Denote k = f (xk ) − f (x∗ ) and rearrange