0% found this document useful (0 votes)
10 views28 pages

On The Online Frank-Wolfe Algorithms For Convex and Non-Convex Optimizations

This paper explores online variants of the Frank-Wolfe algorithm for minimizing regret in stochastic optimization, presenting results for both convex and non-convex losses. The proposed algorithms achieve improved regret bounds of O(log3 T /T) and anytime optimality of O(log2 T /T) under specific conditions, while also demonstrating convergence to stationary points for non-convex losses at a rate of O(1/√T). Numerical experiments validate the theoretical claims, showing the algorithms outperform traditional methods in practical applications.

Uploaded by

Zineb GARROUSSI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views28 pages

On The Online Frank-Wolfe Algorithms For Convex and Non-Convex Optimizations

This paper explores online variants of the Frank-Wolfe algorithm for minimizing regret in stochastic optimization, presenting results for both convex and non-convex losses. The proposed algorithms achieve improved regret bounds of O(log3 T /T) and anytime optimality of O(log2 T /T) under specific conditions, while also demonstrating convergence to stationary points for non-convex losses at a rate of O(1/√T). Numerical experiments validate the theoretical claims, showing the algorithms outperform traditional methods in practical applications.

Uploaded by

Zineb GARROUSSI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

On the Online Frank-Wolfe Algorithms for Convex and

Non-convex Optimizations
Jean Lafond∗, Hoi-To Wai†‡, Eric Moulines§
August 16, 2016
arXiv:1510.01171v2 [stat.ML] 15 Aug 2016

Abstract
In this paper, the online variants of the classical Frank-Wolfe algorithm are considered. We consider
minimizing the regret with a stochastic cost. The online algorithms only require simple iterative updates
and a non-adaptive step size rule, in contrast to the hybrid schemes commonly considered in the literature.
Several new results are derived for convex and non-convex losses. With a strongly convex stochastic cost
and when the optimal solution lies in the interior of the constraint set or the constraint set is a polytope,
the regret bound and anytime optimality are shown to be O(log3 T /T ) and O(log2 T /T ), respectively,
where T is the number of rounds played. These results are based on an improved analysis on the stochastic
Frank-Wolfe algorithms. Moreover, the online algorithms are shown to converge even when the loss is
non-convex,
p i.e., the algorithms find a stationary point to the time-varying/stochastic loss at a rate of
O( 1/T ). Numerical experiments on realistic data sets are presented to support our theoretical claims.

1 Introduction
Recently, Frank-Wolfe (FW) algorithm [FW56] has become popular for high-dimensional constrained op-
timization. Compared to the projected gradient (PG) algorithm (see [BT09, JN12a, JN12b, NJLS09]), the
FW algorithm (a.k.a. conditional gradient method) is appealing due to its projection-free nature. The costly
projection step in PG is replaced by a linear optimization in FW. The latter admits a closed form solution
for many problems of interests in machine learning.
This work focuses on the online variants of the FW and the FW with away step (AW) algorithms. At
each round, the proposed online FW/AW algorithms follow the same update equation applied in classical
FW/AW and a step size is taken according to a non-adaptive rule. The only modification involved is that we
use an online-computed aggregated gradient as a surrogate of the true gradient of the expected loss that we
attempt to minimize. We establish fast convergence of the algorithms under various conditions.
Fast convergence for projection-free algorithms have been studied in [LJJ13, LJJ15, GH15a, GH15b, LZ14,
HL16]. However, many works have considered a ‘hybrid’ approach that involves solving a regularized linear
optimization during the updates [GH15b, LZ14]; or combining existing algorithms with FW [HL16]. In
particular, the authors in [GH15b] showed a regret bound of O(log T /T ) for their online projection-free
algorithm, where T is the number of iterations, under an adversarial setting. This matches the optimal bound
for strongly convex loss. The drawback of these algorithms lies on the extra complexities (in implementation
and computation) added to the classical FW algorithm.
Our aim is to show that simple online projection-free methods can achieve on-the-par convergence
guarantees as the sophisticated algorithms mentioned above. In particular, we present a set of new results
for online FW/AW algorithms under the full information setting, i.e., complete knowledge about the loss
∗ InstitutMines-Telecom, Telecom ParisTech, CNRS LTCI, Paris, France. Email: [email protected]
† School of Electrical, Computer and Energy Engineering, Arizona State University, AZ, USA. Email: [email protected]
‡ J. Lafond and H.-T. Wai have contributed equally.
§ CMAP, Ecole Polytechnique, Palaiseau, France. Email: [email protected]

1
function is retrieved at each round [ADX10] (see section 2). Our online FW algorithm is similar to the
online projection-free method proposed in [HK12],
q while the online AW algorithm is new. For online FW
algorithms, [HK12] has proven a regret of O( log2 T /T ) for convex and smooth stochastic costs. We improve
the regret bound to O(log3 T /T ) under two different sets of assumptions: (a) the stochastic cost is strongly
convex, the optimal solutions lie in the interior of C (cf. H1, for online FW); (b) C is q
a polytope (cf. H2, for
online AW). An improved anytime optimality bound of O(log2 T /T ) (compared to O( log2 T /T ) in [HK12])
is also proven. We compare our results to the state-of-the-art in Table 1.

Settings Regret bound Anytime bound


Hybrid algo., Lipschitz O(
p
1/T )
p
O( log T /T )
Garber & Hazan, 2015 [GH15b] cvx. loss
Hybrid algo., strong O(log T /T ) O(log T /T )
cvx. loss
Simple algo., Lipschitz O(
p 2
log T /T ) O(
p 2
log T /T )
Hazan & Kale, 2012 [HK12] cvx. loss
Simple algo., strong O(
p 2
log T /T ) O(
p 2
log T /T )
cvx. loss
Simple algo., strong
cvx. loss, interior point O(log3 T /T ) O(log2 T /T )
This work
(online FW)
Simple algo., strong
cvx. loss, polytope O(log3 T /T ) O(log2 T /T )
const. (online AW)

Table 1: Convergence rate comparison. Note that the regret bound for [GH15b] is given under an adversarial
loss setting, while the bounds for [HK12] and our work are based on a stochastic cost. Depending on the
applications (see Section 5 & Appendix I), our regret and anytime bounds can be improved to O(log2 T /T )
and O(log T /T ), respectively.

Another interesting discovery is that the online


√ FW/AW algorithms converge to a stationary point even
when the loss is non-convex, at a rate of O(1/ T ). To the best of our knowledge, this is the first convergence
rate result for non-convex online optimization with projection-free methods.
To support our claims, we perform numerical experiments on online matrix completion using realistic
dataset. The proposed online schemes outperform a simple projected gradient method in terms of running
time. The algorithm also demonstrates excellent performance for robust binary classification.
Related Works. In addition to the references mentioned above, this work is related to the study of
stochastic optimization, e.g., [GL15,NJLS09]. [GL15] describes a FW algorithm using stochastic approximation
and proves that the optimality gap converges to zero almost surely; [NJLS09] analyses the stochastic projected
gradient method and proves that the convergence rate is O(log t/t) under strong convexity and that the
optimal solution lies in the interior of C. This is similar to assumption H1 in this paper.
Lastly, most recent works on non-convex optimization are based on the stochastic projected gradient
descent method [AZH16, GHJY15]. Projection-free non-convex optimization has only been addressed by a
few authors [GL15, EV76]. At the time when we finished with the writing, we notice that several authors have
published articles pertaining to offline, non-convex FW algorithm, e.g., [LJ16] achieves the same convergence
rate as ours with an adaptive step size, [JLMZ16] considers a different assumption on the smoothness of loss
function, [YZS14] has a slower convergence rate than ours. Nevertheless, none of the above has considered an
online optimization setting with time varying objective like ours.
Notation. For any n ∈ N, let [n] denote the set {1, · · · , n}. The inner product on a n dimensional

2
real Euclidian space E is denoted by h·, ·i and the associated Euclidian norm by k · k2 . The space E is also
equipped with a norm k · k and its dual norm k · k? . Diameter of the set C w.r.t. k · k? is denoted by ρ, that
is ρ := supθ,θ0 ∈C kθ − θ0 k? . In addition, we denote the diameter of C w.r.t. the Euclidean norm as ρ̄, i.e.,
ρ̄ := supθ,θ0 ∈C kθ − θ0 k2 . The ith element in a vector x is denoted by [x]i .

2 Problem Setup and Algorithms


We use the setting introduced in [HK12]. The online learner wants to minimize a loss function f which is the
expectation of empirical loss functions ft (θ) = f (θ; ωt ), where ωt is drawn i.i.d. from a fixed distribution D:
f (θ) := Eω∼D [f (θ; ω)]. The regret of a sequence of actions {θt }Tt=1 is :
PT
RT := T −1 t=1 f (θt ) − minθ∈C f (θ) . (1)

Here, C is a bounded convex set included in E and ft (·) is a continuously differentiable function.
Our proposed algorithms assume the full information setting [ADX10] such that upon playing θt , we receive
full knowledge about the loss function θ 7→ ft (θ) . The choice of θt+1 will be based on the previously observed
PT
loss {fs (θ)}ts=1 . Let γt ∈ (0, 1] be a sequence of decreasing step size (see section 3), FT (θ) = t=1 ft (θ) the
aggregated loss and ∇FT (θ) be the gradient of Ft (θ) evaluated at θ, we study two online algorithms.

Online Frank-Wolfe (O-FW). The online FW Algorithm 1 Online Frank-Wolfe (O-FW).


algorithm, introduced in [HK12], is a direct general- 1: Initialize: θ1 ← 0
ization of the classical FW algorithm, as summarized 2: for t = 1, . . . do
in Algorithm 1. It differs from the classical FW algo- 3: Play θt and receive θ 7→ ft (θ).
rithm only in the
Pt sense that the aggregated gradient 4: Solve the linear optimization:
∇Ft (θt ) = t−1 s=1 ∇fs (θt ) is used for the linear op-
timization in Step 4. See the comment in Remark 3 for at ← arg min a, ∇Ft (θt ) . (2)
a∈C
the complexity of calculating the aggregated gradient.
5: Compute θt+1 ← θt + γt (at − θt ).
6: end for

Online away-step Frank-Wolfe (O-AW). The online counterpart of the away step algorithm is given
in Algorithm 2. By construction, the iterate θt is a convex combination of extreme points of C, referred to as
active atoms. We denote by At the set of active atoms and denote by αta the positive weight of any active
atom a ∈ At at time t, that is:

θt = a∈At αta · a with αta > 0 .


P
(4)

At each round, two types of step might be taken. If the condition of line 5 in Algorithm 2 is satisfied, we
call the iteration a “FW step”, otherwise we call it an “AW step”. When a FW step is taken, a new atom
aFW
t is selected (3), the current iterate θt is moved towards aFW t and the active set is updated accordingly
(lines 6 and 15). The selected atom is the (extreme) point of C which is maximally correlated to the negative
aggregated gradient. Note that this step is identical to a usual O-FW iteration. When an “AW step” is taken,
a currently active atom aAW t is selected (3) and the current iterate is moved away from aAWt (line 8 and 15).
AW
The atom at is the active atom which is the most correlated to the current gradient approximation. The
intuition is that taking the ‘away’ step prevents the algorithm from following a ‘zig-zag’ path when θt is close
to the boundary of C [Wol70].
Lastly, we note that the O-AW algorithm is similar to a classical AW algorithm [Wol70]. The exception is
that a fixed step size rule is adopted due to the online optimization setting.
Remark 1. As the linear optimization (3) enumerates over the active atoms At at round t, the O-AW
algorithm is suitable when C is an atomic (or polytope) set, otherwise |At | may become too large.

3
Algorithm 2 Online away step Frank-Wolfe (O-AW).
1: Initialize: n0 = 0, θ1 = 0, A1 = ∅;
2: for t = 1, . . . do
3: Play θt and receive the loss function θ 7→ ft (θ) .
4: Solve the linear optimizations with the aggregated gradient:

aFW
t ← arg min a, ∇Ft (θt ) , aAW
t ← arg max a, ∇Ft (θt ) (3)
a∈C a∈At

5: if haFW
t − θt , ∇Ft (θt )i ≤ hθt − aAWt , ∇Ft (θt )i or At = ∅ then
6: FW step: dt ← aFW t − θt , nt ← nt−1 + 1, γ̂t ← γnt and At+1 ← At ∪ {aFW t }.
7: else
aAW aAW aAW
8: dt ← θt − aAW
t , γmax = αt
t
/(1 − αt t ), cf. (4) for definition of αt t .
9: if γmax ≥ γnt−1 then
10: AW step: nt ← nt−1 + 1 and γ̂t ← γnt
11: else
12: Drop step: γ̂t ← γmax , nt ← nt−1 and At+1 ← At \ {aAW t }
13: end if
14: end if
15: Compute θt+1 ← θt + γ̂t dt .
16: end for

Remark 2 (Linear Optimization.). The run-time complexity of the O-FW and O-AW algorithms depends
on finding efficient solution to the linear optimization step. In many cases, this is extremely efficient. For
example, when C is the trace-norm ball, then the linear optimization amounts to finding the top singular
vectors of the gradient; see [Jag13] for an overview.

Remark 3 (Complexity per iteration.). In addition to the linear optimization, both O-FW/O-AW algorithms
require the aggregate gradient ∇Ft (θt ) to be computed at each round, and the complexity involved grows
with the round number. In cases when the loss ft is the negated log-likelihood of an exponential family
distribution, the gradient aggregation can be replaced by an efficient ‘on-the-fly’ update, whose complexity is a
dimension-dependent constant over the iterations. As demonstrated in Section 5 and Appendix I, this
set-up covers many problems of interest, among others the online matrix completion and online LASSO.

3 Main Results
This section presents the main results for the convergence of O-FW/O-AW algorithms. Notice that our
results for convex losses are based on an improved analysis on the stochastic/inexact invariant of FW/AW
algorithms (see Anytime Analysis in subsection 3.1), while the results for non-convex losses are derived from
a novel observation on the duality gap for FW algorithms. Due to space constraints, only the main results
are displayed. Detailed proofs can be found in the appendices.
Some constants are defined as follows. A function f is said to be µ-strongly convex if, for all θ, θ̃ ∈ E,

f (θ) ≤ f (θ̃) + h∇f (θ), θ − θ̃i − (µ/2)kθ − θ̃k22 . (5)

We also say f is L-smooth if for all θ, θ̃ ∈ E we get

f (θ̃) ≤ f (θ) + h∇f (θ), θ̃ − θi + (L/2)kθ − θ̃k22 . (6)

Lastly, f is said to be G-Lipschitz if for all θ, θ̃ ∈ E,

|f (θ) − f (θ̃)| ≤ Gkθ − θ̃k∗ . (7)

4
3.1 Convex Loss
We analyze first Algorithm 1 and Algorithm 2 when the expected loss function f is convex. In particular, our
analysis will depend on the following geometric condition of the constraint set C. Denote by ∂C the boundary
set of C. For Algorithm 1, we consider
H1. There is a minimizer θ ? of f that lies in the interior of C, i.e., δ := inf s∈∂C ks − θ ? k2 > 0.
While H1 appears to be restrictive, for Algorithm 2, we can work with a relaxed condition:

H2. C is a polytope.
As argued in [LJJ15], H2 implies that the pyramidal width for C, δAW := P dirW (C), is positive; see the
definition in (29) of the appendix.
Regret Analysis. Our main result is summarized as follows. For  ∈ (0, 1),

Theorem 1. Consider O-FW (resp. O-AW). Assume H1 (resp. H2), f (θ) is µ-strongly convex, f (θ; ω)
is L-smooth for all ω drawn from D and each element of ∇ft (θ) is sub-Gaussian with parameter σD . Set
γt = 2/(t + 1). With probability at least 1 −  and for all t ≥ 1, the anytime loss bounds hold:
p √ 2
(O-FW) f (θt ) − min f (θ) ≤ 2 3/2(σgrd ρ + Lρ̄2 )/(2δ µ) · (log(t) log(nt/)) · t−1 ,
θ∈C
√ 2 (8)
(O-AW) f (θt ) − min f (θ) ≤ (5/3)(2σgrd ρ + Lρ̄2 )/(δAW µ) · (log(t) log(nt/)) · t−1 ,
θ∈C

where σgrd = O(max{σD , ρ̄L} n). Consequently, summing up the two sides of (8) from t = 1 to t = T gives
the regret bound for both O-FW and O-AW:
PT
T −1 t=1 f (θt ) − minθ∈C f (θ) = O(log3 T /T ), ∀ T ≥ 1 . (9)

Proof. To prove Theorem 1, we first upper bound the gradient error of ∇Ft (θt ), i.e.,
Proposition 2. Assume that f (θ; ω) is L-smooth for all ω from D and each element of the vector ∇ft (θ) is
sub-Gaussian with parameter σD . With probability at least 1 − ,
√ p
k∇Ft (θt ) − ∇f (θt )k∞ = O(max{σD , ρ̄L} n log(t) log(nt/)/t), ∀ t ≥ 1 . (10)

This shows that ∇Ft (θt ) is an inexact gradient of the stochastic objective f (θ) at θt . Our proof is
achieved by applying Theorem 3 (see below) by plugging in the appropriate constants.
q
We notice that for O-FW, [HK12] has proven a regret bound of O( log2 T /T ), which is obtained by

applying a uniform approximation bound on the objective value and proving a O(1/ t) bound for the
instantaneous loss Ft (θt ) − Ft (θt? ). In contrast, Theorem 1 yields an improved regret by controlling the
gradient error directly using Proposition 2 and analyzing O-FW/O-AW as an FW/AW algorithm with inexact
gradient in the following.
Anytime Analysis. The regret analysis is derived from the following general result for FW/AW algorithms
ˆ t f (θt ) be an estimate of ∇f (θt ) which satisfies:
with stochastic/inexact gradients. Let ∇
H3. For some α ∈ (0, 1], σ ≥ 0 and K ∈ Z?+ . With probability at least 1 − , we have

ˆ t f (θt ) − ∇f (θt )k ≤ σ(ηt /{K + t − 1})α , ∀ t ≥ 1 ,


k∇ (11)

where ηt ≥ 1 is an increasing sequence such that the right hand side decreases to 0.
This is a more general setting than is required for the analysis of O-FW/O-AW as σ, α, ηt are arbitrary.
The O-FW (or O-AW) with the above inexact gradient has the following convergence rate:

5
Theorem 3. Consider the sequence {θt }∞ t=1 generated by O-FW (resp. O-AW) with the aggregated gradient
ˆ t f (θt ) satisfying H3 with K = 2. Assume H1 (resp. H2) and that f (θ) is L-smooth,
∇Ft (θt ) replaced by ∇
µ-strongly convex. Set γt = 2/(t + 1). With probability at least 1 −  and for all t ≥ 1, we have
√ 2
(O-FW) f (θt ) − min f (θ) ≤ max{2(3/2)α , 1 + 2α/(2 − α)}(σρ + Lρ̄2 )/(2δ µ) · (ηt /(t + 1))2α ,
θ∈C
√ 2
(O-AW) f (θt ) − min f (θ) ≤ 2 max{(3/2)α , 1 + 2α/(2 − α)}(2σρ + Lρ̄2 )/(δAW µ) · (ηt /(t + 1))2α ,
θ∈C
(12)
p
When α = 0.5, Theorem 3 improves the previous known bound of f (θt ) − minθ∈C f (θ) = O( ηt /t)
in [FG13, Jag13] under strong convexity and H1 or H2. It also matches the information-theoretical lower
bound for strongly convex stochastic optimization in [RR11] (up to a log factor). Moreover, for O-AW, the
strong convexity requirement on f can be relaxed; see Appendix G.

3.2 Non-convex Loss


Define respectively the duality gaps for O-FW and O-AW as

gtFW := h∇Ft (θt ), θt − at i, gtAW := h∇Ft (θt ), aAW


t − aFW
t i, (13)

where at is defined in line 4 of Algorithm 1 and aAW t , at


FW
are defined in (3) of Algorithm 2. Using the
FW
definition of at , if gt = 0, then θt is a stationary point to the optimization problem minθ∈C Ft (θ). Therefore,
gtFW (and similarly gtAW ) can be seen as a measure to the stationarity of the point θt to the online optimization
problem.
We analyze the convergence of O-FW/O-AW for general Lipschitz and smooth (possibly non-convex) loss
function using the duality gaps defined above. To do so, we depart from the usual induction based proof
technique (e.g., in the previous section or [Jag13, HK12]). Instead, our method of proof amounts to relate the
duality gaps with a learning rate controlled by the step size rule on γt . The main result can be found below:

Theorem 4. Consider O-FW and O-AW. Assume that each of the loss function ft is G-Lipschitz, L-smooth.
Setting the step size sequence as γt = t−α with α ∈ [0.5, 1). We have

min gtFW ≤ (1 − α)(4Gρ + Lρ̄2 /2)(1 − (2/3)1−α )−1 · T −(1−α) , ∀ T ≥ 6 ,


t∈[T /2+1,T ]
(14)
min gtAW ≤ (1 − α)(4Gρ + Lρ̄2 )(1 − (4/5)1−α )−1 · T −(1−α) , ∀ T ≥ 20 .
t∈[T /2+1,T ]

Notice that the above result is deterministic (cf. the definition of gtFW , gtAW ) and also works with non-
stochastic, non-convex losses. The above guarantees an O(1/T 1−α ) rate for O-FW/O-AW at a certain round
t within the interval [T /2 + 1, T ]. Unlike the regret/anytime analysis done previously, our bounds are stated
with respect to the best duality gap attained within an interval from t = T /2 + 1 to t = T . This is a common
artifact when analyzing the duality gap of FW [Jag13]. Furthermore, we can show that:

Proposition 5. Consider O-FW (or O-AW), assume that each of ft is G-Lipschitz, L-smooth and each of
∇ft (θ) is sub-Gaussian with parameter σD . Set the step size sequence as γt = t−α with α ∈ [0.5, 1). With
probability at least 1 −  and for T ≥ 20, there exists t ∈ [T /2 + 1, T ] such that
p
max h∇f (θt ), θt − θi = O max 1/T 1−α , log T /T .
 
(15)
θ∈C
 p 
The proposition indicates that the iterate θt at round t ∈ [T /2 + 1, T ] is an O max 1/T 1−α , log T /T -
stationary point to the stochastic optimization minθ∈C f (θ). Our proof relies on Theorem 4 and a uniform
approximation bound result for ∇Ft (θt ).

6
4 Sketch of the Proof of Theorem 3
To provide some insights, we present the main ideas behind the proof of Theorem 3. To simplify the discussion
we only consider O-FW, K = 1, ηt = 1 and α = 0.5 in H3. The full proof can be found in the supplementary
material. Since f (·) is L-smooth and C has a diameter of ρ̄, we have

f (θt+1 ) ≤ f (θt ) + γt h∇f (θt ), at − θt i + γt2 Lρ̄2 /2


ˆ t f (θt ) − ∇f (θt ), and subtract f (θ ∗ ) on both sides, applying Cauchy Schwartz yields
If we define t := ∇

ht+1 ≤ ht − γt gtFW + γt2 Lρ̄2 /2 + γt ρkt k . (16)

Observe that as ht , gtFW ≥ 0, the duality gap term gtFW determines the convergence rate of the sequence ht to
zero.
In fact, when f is convex, one can prove gtFW ≥ ht − ρkt k. By the assumption H3, with probability at
least 1 − , we have

ht+1 ≤ ht − γt ht + γt2 Lρ̄2 /2 + 2γt ρσ/ t = (1 − γt )ht + O(t−1.5 ) .

Setting γt = 1/t and a simple induction on the above inequality proves ht = O(1/ t).
An important consequence of H1 is that the latter leads to a tighter lower bound on gtFW . As we present
in Lemma 6 in Appendix B, under H1 and when f is µ-strongly convex, we can lower bound gtFW as
p
gtFW ≥ max{0, δ µht − ρkt k}.

Note that ht converges to √zero and the above lower bound on gtFW eventually will become tighter than the
FW
previous one, i.e., gt ≥ δ µht − ρkt k ≥ ht − ρkt k. This leads to the accelerated convergence of ht . More
formally, plugging the lower bound into (16) gives
p √
ht+1 ≤ ht − γt δ µht + γt2 Lρ̄2 /2 + 2γt ρσ/ t .

Again, setting γt = 1/t and a carefully executed induction argument shows ht = O(1/t). The same line of
arguments is also used to prove the convergence rate of O-AW, where H2 will be required (instead of H1) to
provide a similarly tight lower bound to gtAW .

5 Numerical Experiments
We conduct numerical experiments to demonstrate the practical performance of the online algorithms. An
additional experiment for online LASSO with O-AW can be found in the appendix.

5.1 Example: Online matrix completion (MC)


Consider the following setting: we are sequentially given observations in the form (kt , lt , Yt ), with (kt , lt ) ∈
[m1 ] × [m2 ] and Yt ∈ R. The observations are assumed to be i.i.d. To define the loss function, the conditional
distribution of Yt w.r.t. the sampling is parametrized by an unknown matrix θ̄ ∈ Rm1 ×m2 and supposed to
belong to the exponential family, i.e.,

pθ̄ (Yt |kt , lt ) := m(Yt ) exp Yt θ̄kt ,lt − A(θ̄kt ,lt ) , (17)

where m(·) and A(·) are the base measure and log-partition functions, respectively. A natural choice for the
loss function at round t is obtained by taking the logarithm of the posterior, i.e.,

ft (θ) := A(θkt ,lt ) − Yt θkt ,lt .

7
103 103 102 103 10 105
O-FW O-FW O-FW 4
O-PG O-PG 2 Batch FW 10
Batch FW Batch FW 10 8 103
102 102
102
Primal optimality
101 6

Duality gap

Duality gap
101 101

Duality

MSE

MSE
101 101 100
100 4
10-1
10-1 2 10-2
100 100 100 10-3
10-2 0 10-4
100 101 102 103 102 103 0 200 400 600 800 1000 1200 1400 1600 1800
Iteration number Iteration number Iteration number

103 103 102 104 1.5 103


O-FW O-FW O-FW
O-PG O-PG 103 1.4 Batch FW 102
Batch FW Batch FW 1.3 Active ALT
102 102 Active ALT 102 101
Primal optimality

1.2

Duality gap

Duality gap
101

Duality
101 100

MSE

MSE
1.1
101 101
100 1.0 10-1

10-1 0.9 10-2


100 100 100
0.8
-2 10-3
10-1 100 101 102 103 10-3 10-2 10-1 100 101 102 103 10 410 0 1000 2000 3000 4000 5000
Time (s) Time in sec. Time in sec.

Figure 1: Online MC performance. (Left) synthetic with batch size B = 1000; (Middle) movielens100k
with B = 80; (Right) movielens20m with B = 10000. (Top) objective value/MSE against round number;
(Bottom) against execution time. The duality gap gtFW for O-FW is plotted in purple.

Our goal is to minimize the regret with a penalty favoring low rank solutions
 C := {θ ∈ Rm1 ×m2 : kθkσ,1 ≤ R},
and the stochastic cost associated is f (θ) := Eθ̄ A(θk1 ,l1 ) − Y1 θk1 ,l1 .
Pt
Note that the aggregated gradient ∇Ft (θt ) = t−1 ∇ s=1 fs (θt ) can be expressed as:
 Pt 0   Pt 0 
∇Ft (θt ) k,l = t−1 A0 ([θt ]k,l ) > −1 >
 
s=1 eks els k,l − t s=1 Ys eks els k,l , ∀ k, l ∈ [m1 ] × [m2 ] ,

with {ek }m 1
k=1 (resp. {e0l }m 2
l=1 ) the canonical basis of of R
m1
(resp. Rm2 ). We observe that the two matrices
Pt 0 t 0
> >
P
s=1 eks els and s=1 Ys eks els can be computed ‘on-the-fly’ as the running sum. The two matrices can
also be stored efficiently in the memory as they are at most t-sparse. The per iteration complexity is upper
bounded by O(min{m1 m2 , T }), where T is the total number of observations.
We observe that for online MC, a better anytime/regret bound than the general case p analyzed in Section 3
can be achieved. In particular, Appendix H shows that k∇Ft (θ) − ∇f (θ)kσ,∞ = O( log t/t). As such, the
online gradient satisfies H3 with ηt = O(log t) and α = 0.5. Moreover, f (θ) is strongly convex if A00 (θ) ≥ µ.
For example, this holds for square loss function. Now if H1 is also satisfied, repeating the analysis in Section 3
yields an anytime and regret bound of O(log t/t) and O(log2 T /T ), respectively.
We test our online MC algorithm on a small synthetically generated dataset, where θ̄ is a rank-20,
200 × 5000 matrix with Gaussian singular vectors. There are 2 × 106 observations with Gaussian noise of
variance 3. Also, we test with two dataset movielens100k, movielens20m from [HK15], which contains 105 ,
2 × 107 movie ratings from 943, 138493 users on 1682, 26744 movies, respectively. We assume Gaussian
observation and the loss function ft (·) is designed as the square loss.
Results. We compare O-FW to a simple online projected-gradient (O-PG) method. The step size for
O-FW is set as γt = 2/(1 + t). For the movielens datasets, the parameter θ̄ is unknown, therefore we split the
dataset into training (80%) and testing (20%) set and evaluate the mean square error on the test set. Radiuses
of CR are set as R = 1.1kθ̄kσ,1 (synthetic), R = 10000 (movielens100k) and R = 150000 (movielens20m).
Note that H1 is satisfied by the synthetic case.
The results are shown in Figure 1. For the synthetic data, we observe that the stochastic objective of
O-FW decreases at a rate ∼ O(1/t), as predicted in our analysis. Significant complexity reduction compared
to O-PG for synthetic and movielens100k datasets are also observed. The running time is faster than the
batch FW with line searched step size on movielens20m, which we suspect is caused by the simpler linear
optimization (2) solved at the algorithm initialization by O-FW1 ; and is also comparable to a state-of-the-art,
1 This operation amounts to finding the top singular vectors of ∇F (θ ), whose complexity grows linearly with the number of
t t
non-zeros in ∇Ft (θt ).

8
0.10 101 0.030 101 0.20 101
(Test) Logistic loss (Test) Logistic loss (Test) Logistic loss
(Test) Sigmoid loss, sigma=10 100 0.025 (Test) Sigmoid loss, sigma=10 100 (Test) Sigmoid loss, sigma=10
0.08 (Train) Logistic loss (Train) Logistic loss (Train) Logistic loss 100
(Train) Sigmoid loss, sigma=10 10-1 (Train) Sigmoid loss, sigma=10 10-1 0.15 (Train) Sigmoid loss, sigma=10
0.020
0.06 10-1

Duality gap

Duality gap

Duality gap
Error rate

Error rate

Error rate
10-2 0.015 10-2
0.10
0.04 10-2
10-3 10-3
0.010
0.02 10-4 10-4 0.05 10-3
0.005
0.00 10-5 10-5 10-4
500 1000 1500 2000 2500 3000 3500 4000 1000 2000 3000 4000 5000 6000 500 1000 1500 2000 2500 3000 3500 4000
Round number Round number Round number

0.50 101 0.08 100 0.40 100


0.45 (Test) Logistic loss (Test) Logistic loss (Test) Logistic loss
(Test) Sigmoid loss, sigma=10 0.07 (Test) Sigmoid loss, sigma=10 0.35 (Test) Sigmoid loss, sigma=10
0.40 (Train) Logistic loss 100
0.06 10-1 0.30
0.35 (Train) Sigmoid loss, sigma=10 10-1
10-1 0.05

Duality gap

Duality gap

Duality gap
0.30 0.25
Error rate

Error rate

Error rate
0.04 10-2
0.25 0.20
10-2 0.03
0.20 10-2
0.15
0.15 0.02 10-3
10-3
0.10 0.01 0.10
0.05 10-4 0.00 10-4 0.05 10-3
500 1000 1500 2000 2500 3000 3500 4000 1000 2000 3000 4000 5000 6000 500 1000 1500 2000 2500 3000 3500 4000
Round number Round number Round number

Figure 2: Binary classification performance against round number t for: (Left) synthetic data; (Middle)
mnist (class ‘1’); (Right) rcv1.binary. (Top) with no flip (Bottom) with 25% flip in the training labels.
The duality gap gtFW for O-FW with sigmoid loss is plotted in purple.

specialized batch algorithm for MC problems in [HO14] (‘active ALT’) and achieves the same MSE level, even
though the data are acquired in an online fashion in O-FW.

5.2 Example: Robust Binary Classification with Outliers


Consider the following online learning setting: the training data is given sequentially in the form of (yt , xt ),
where yt ∈ {±1} is a binary label and xt ∈ Rn is a feature vector. Our goal is to train a classifier θ ∈ Rn
such that for an arbitrary feature vector x̂ it assigns ŷ = sign(hθ, x̂i).
The dataset may sometimes be contaminated by wrong labels. As a remedy, we design a sigmoid loss
function ft (θ) := (1 + exp(10 · yt hθ, xt i))−1 that approximates the 0/1 loss function [SSSS11, EBG11]. Note
that ft (θ) is smooth and Lipschitz, but not convex. For C, we consider the `1 ball C`1 = {θ ∈ Rn : kθk1 ≤ r}
when a sparse classifier is preferred; or the trace-norm ball Cσ = {θ ∈ Rm1 ×m2 : kθkσ,1 ≤ R}, where
n = m1 m2 , when a low rank classifier is preferred.
We evaluate the performance of our online classifier on synthetic and real data. For the synthetic data,
the true classifier θ̄ is a rank-10, 30 × 30 Gaussian matrix. Each feature xt is a 30 × 30 Gaussian matrix. We
have 40000 (20000) tuples of data for training (testing). We also test the classifier on the mnist (classifying
‘1’ from the rest of the digits), rcv1.binary dataset from LIBSVM [CL11]. The feature dimensions are 784,
47236, and there are 60000 (10000) and 20242 (677399) data tuples for training (testing), respectively. We
artificially and randomly flip 0%, 25% labels in the training set.
Results. As benchmark, we compare with the logistic loss function, i.e., ft (θ) = log(1 + exp(−yt hθ, xt i)).
We apply O-FW with a learning rate of α = 0.75 for both loss functions, i.e., γt = 1/t0.75 . For the synthetic
data and mnist, the sigmoid (logistic) loss classifier is trained with a trace norm ball constraint of R = 1
(R = 10). Each round is fed with a batch of B = 10 tuples of data. For rcv1.binary, we train the classifiers
with `1 -ball constraint of r = 100 (r = 1000) for sigmoid (logistic) loss. Each round is fed with a batch of
B = 5 tuples of data.
As seen in Figure 2, the logistic loss and sigmoid loss performs similarly when there are no flip in the
labels; and the sigmoid loss demonstrates better classification performance when some of the labels are flipped.
Lastly, the duality gap of O-FW applied to the non-convex loss decays gradually with t, indicating that the
algorithm converges to a stationary point.

9
A Proof of Proposition 2
The following proof is an application of a modified version of [SSSSS09, Theorem 5]2 . Let us define
t
1 X 
t (θ) = ∇Ft (θ) − ∇f (θ) = ∇f (θ; ωs ) − Eω∼D [∇f (θ; ω)] . (18)
t s=1

From [Gau05], for some sufficiently small  > 0, there exists a Euclidean -net, N (), with cardinality bounded
by   ρ̄ n 
|N ()| = O n2 log(n) . (19)

In particular, for any θ ∈ C there is a point pθ ∈ N (/L) such that kpθ − θk2 ≤ /L. This implies:

kt (θ)k∞ ≤ kt (pθ )k∞ + kt (pθ ) − t (θ)k∞ ≤ kt (pθ )k∞ + k∇Ft (θ) − ∇Ft (pθ )k∞ + k∇f (θ) − ∇f (pθ )k∞
≤ kt (pθ )k∞ + k∇Ft (θ) − ∇Ft (pθ )k2 + k∇f (θ) − ∇f (pθ )k2 ≤ kt (pθ )k∞ + 2Lkθ − pθ k2
≤ kt (pθ )k∞ + 2 ,

where we used the L-smoothness of ∇Ft (θ) and ∇f (θ) for the second last inequality. Applying the union
bound and controlling each point pθ ∈ N (/L) using the sub-Gaussian assumption yields:

t(s − 2)2
   [ n o  
P sup kt (θ)k∞ > s ≤ P kt (pθ )k∞ > s − 2 ≤ |N (/L)| · 2n exp − 2
θ∈C 2σD
pθ ∈N (/L)
 n
t(s − 2)2
  
Lρ̄
≤ O n3 log(n) exp − 2 .
 2σD

Setting s = 3 in the above, it can be verified that the following holds with probability at least 1 − δ
r !
n log(t) log(n/δ)
kt (θ)k∞ = O max{Lρ̄, σD } . (20)
t

Applying another union bound over t ≥ 1 (e.g., by setting δ = /t2 ) then yields the desired result.

B Proof of Theorem 3
We define ht := f (θt ) − minθ∈C f (θ) in the following. The analysis below is done by assuming a more general
step size rule γt = K/(K + t − 1) with some K ∈ Z?+ . First of all, we notice that for both Algorithm 1 and
Algorithm 2 with the step size rule γt = K/(K + t − 1), we have γ1 = 1 and thus h1 = f (a1 ) − f (θ ? ) < ∞.
For t ≥ 2, we have the following convergence results for FW/AW algorithms with inexact gradients.
As explained in the proof sketch, let us state the following lemma which is borrowed from [LJJ13, LJJ15].

Lemma 6. [LJJ13, LJJ15] Assume H1 and that f is L-smooth and µ-strongly convex, then
2
maxh∇f (θt ), θt − θi ≥ 2µδ 2 ht and Lρ̄2 ≥ µδ 2 . (21)
θ∈C

Consider Algorithm 2, assume H2 and that f is L-smooth and µ-strongly convex, then
2 2
max h∇f (θt ), θi − minh∇f (θt ), θi ≥ 2µδAW ht and Lρ̄2 ≥ µδAW
2
. (22)
θ∈At θ∈C

2 Note that [SSSSS09, Theorem 5] assumed implicitly that ∇f (θ; ω ) is bounded for all ω , which can be generalized by our
s s
assumption that ∇f (θ; ωs ) is sub-Gaussian.

10
The above lemma is a key result that leads to the linear convergence of the classical FW/AW algorithms
with adaptive step sizes, as studied in [LJJ13, LJJ15]. Lemma 6 enables us to prove the theorems below for
the FW/AW algorithms with inexact gradient and fixed step sizes, whose proof can be founded in Appendix E
and F:
Theorem 7. Consider Algorithm 1 with the assumptions given in Theorem 3. The following holds with
probability at least 1 − :
2α
ηt

?
f (θt ) − f (θ ) ≤ D1 , ∀t≥2, (23)
t+K −1
where β = 1 + 2α/(K − α) and
n  K + 1 2α o (ρσ + KLρ̄2 /2)2
D1 = max 4 , β2 · .
K 2δ 2 µ
The anytime bound for Algorithm 1 is obvious from the above Theorem.
Theorem 8. Consider Algorithm 2 with the assumptions given in Theorem 3. The following holds with
probability at least 1 − :
 ηt 2α
f (θt ) − f (θ ? ) ≤ D2 , ∀ t ≥ 2, , (24)
nt−1 + K
where nt is the number of non-drop steps (see Algorithm 2) up to iteration t, β = 1 + 2α/(K − α) and
n K + 1 2α o 2(2ρσ + KLρ̄2 /2)2
D2 = max , β2 · .
K (δAW )2 µ

In addition, we have the following Lemma for Algorithm 2.


Lemma 9. Consider Algorithm 2. We have nt ≥ t/2 for all t, where nt is the number of non-drop steps
taken until round t.
Proof. Except at initialization, the active set is never empty. Indeed, if there is only one active atom left,
then its weight is 1. Therfore the condition of line 9 is satisfied and the atom cannot be dropped. Denote by
qt the number of iterations where an atom was dropped up to time t (line 12). As noted above, nt + qt = t
holds. Since to be dropped, an atom needs to be added to the active set At first, qt ≤ t/2 also holds, yielding
the result.
Combining Theorem 8 and the above lemma, we get the desirable anytime bound for Algorithm 2.

B.1 Proof of Lemma 6


We first prove the first part of the lemma, i.e., (21), pertaining to the O-FW algorithm. Let s̄t ∈ ∂C be a point
on the boundary of C such that it is co-linear with θ ? and θt . Moreover, we defin gt := maxθ∈C h∇f (θt ), θt −θi.
As θ ? ∈ int(C), we can write

θ ? = θt + γ̄(s̄t − θt ) for some γ̄ ∈ [0, 1) . (25)

From the µ-strong convexity of f , we have


µ ?
kθ − θt k22 ≤ f (θ ? ) − f (θt ) − h∇f (θt ), θ ? − θt i = −ht + γ̄h∇f (θt ), θt − s̄t i ≤ −ht + γ̄gt , (26)
2
where the last inequality is due to the definition of gt . Now, the left hand side of the inequality above can be
bounded as
µ ? µ µ µ
kθ − θt k22 = γ̄ 2 ks̄t − θt k22 ≥ γ̄ 2 ks̄t − θ ? k22 ≥ γ̄ 2 δ 2 (27)
2 2 2 2

11
Combining the two inequalities above yields
µ g2
ht ≤ γ̄gt − γ̄ 2 δ 2 ≤ 2t , (28)
2 2δ µ
where the upper bound is achieved by setting γ̄ = gt /(δ 2 µ). Recalling the definition of gt concludes the
proof of the first part. Lastly, we note by combining Eq. (2), Remark 1 and Lemma 2 in [LJJ13], we have
Lρ̄2 ≥ µδ 2 .
Next, we prove the second part of the lemma, i.e., (22), pertaining to the O-AW algorithm. Recall that as
C is a polytope, we can write C = conv(A) where A is a finite set of atoms in Rn , i.e., C is a convex hull of A.
Note that At ⊆ A for all t in the O-AW algorithm. Let us define the pyramidal width δAW of C as:
1  
δAW := inf inf max hd, yi − min hd, yi , (29)
K∈faces(C),θ∈K,d∈cone(C−θ)\{0} A0 ∈Aθ kdk2 y∈A0 ∪{a(K,d)} y∈A0 ∪{a(K,d)}

where Aθ := {A0 : A0 ⊆ A such that θ ∈ conv(A0 ) and θ is a proper convex combination of A0 } and
a(K, d) := arg maxv∈K hv, di. Now, define the quantities:
h∇f (θ), θ − θ 0 i
γ A (θ, θ 0 ) := , (30)
h∇f (θ), vf (θ) − sf (θ)i

where vf (θ) := arg mina∈A(θ) h∇f (θ), ai and sf (θ) := arg mina∈A h∇f (θ), ai. From [LJJ15, Theorem 6], it
can be verified that
2
  2 0 0

µ · δAW ≤ inf inf f (θ ) − f (θ) − h∇f (θ), θ − θi , (31)
θ∈C θ 0 ∈C,s.t.h∇f (θ),θ 0 −θi<0 γ A (θ, θ 0 )2

In the above, we have denoted A(θ) := {v = vA0 (θ) : A0 ∈ Aθ } where vA0 (θ) := arg maxa∈A0 h∇f (θ), ai.
We remark that A(θt ) ⊆ At . Note that γ A (θ, θ 0 ) > 0 as long as h∇f (θ), θ 0 − θi < 0 is satisfied.
Assume θt 6= θ ? and observe that we have h∇f (θt ), θ ? − θt i < 0, Eq. (31) implies that

γ A (θt , θ ? )2 2
µδAW ≤ f (θ ? ) − f (θt ) − h∇f (θt ), θ ? − θt i = −ht + γ A (θt , θ ? )h∇f (θt ), vf (θt ) − sf (θt )i , (32)
2
where the equality is found using the definition of γ A (θt , θ ? ). Define gtAW := maxθ∈At h∇f (θt ), θi −
minθ∈C h∇f (θt ), θi and observe that

h∇f (θt ), sf (θt )i = minh∇f (θt ), θi and h∇f (θt ), vf (θt )i ≤ max h∇f (θt ), θi . (33)
θ∈C θ∈At

Plugging the above into (32) yields

γ A (θt , θ ? )2 2 (g AW )2
ht ≤ − µδAW + γ A (θt , θ ? )gtAW ≤ t2 , (34)
2 2δAW µ

where we have set γ A (θt , θ ? ) = gtAW /(δAW


2
µ) similar to the first part of this proof. This concludes the proof
AW
for the lower bound on gt . Lastly, it follows from Remark 7, Eq. (20) and Theorem 6 of [LJJ15] that
2
µδAW ≤ Lρ̄2 .

C Proof of Theorem 4
In the following, we denote the minimum loss action at round t as θt? ∈ arg minθ∈C Ft (θ). Notice that Ft (θ)
may be non-convex.
Observe that for O-FW:
1 1
Ft (θt+1 ) ≤ Ft (θt ) + γt h∇Ft (θt ), at − θt i + γt2 Lρ̄2 = Ft (θt ) − γt gtFW + γt2 Lρ̄2 , (35)
2 2

12
where the first inequality is due to the fact that f is L-smooth and C has a diameter of ρ̄. Define ∆t :=
Ft (θt ) − Ft (θt? ) to be the instantaneous loss at round t (recall that θt? ∈ arg minθ∈C Ft (θ)). We have
t ?
 1 ?
∆t+1 = Ft (θt+1 ) − Ft (θt+1 ) + (ft+1 (θt+1 ) − ft+1 (θt+1 )) (36)
t+1 t+1
Note that the first part of the right hand side of (36) can be upper bounded as

? 1
Ft (θt+1 ) − Ft (θt+1 ) ≤ Ft (θt+1 ) − Ft (θt? ) ≤ ∆t − γt gtFW + γt2 Lρ̄2 , (37)
2
?
where the first inequality is due to θt+1 ∈ C and the optimality of θt? and the second inequality is due to the
L-smoothness of Ft . Combining (36) and (37) gives
t 1
∆t − γt gtFW + γt2 Lρ̄2 /2 + ?

∆t+1 ≤ (ft+1 (θt+1 ) − ft+1 (θt+1 ))
t+1 t+1
 
t t 1 1
⇐⇒ γt g FW ≤ ∆t + γt2 Lρ̄2 + ?
(ft+1 (θt+1 ) − ft+1 (θt+1 )) − ∆t+1 .
t+1 t t+1 2 t+1

Using the definition of ∆t+1 , we note that (t + 1)−1 (ft+1 (θt+1 ) − ft+1 (θt+1
?
)) − ∆t+1 = −(t/(t + 1))(Ft (θt+1 ) −
?
Ft (θt+1 )). Therefore, simplifying terms give

γt gtFW ≤ ∆t − (Ft (θt+1 ) − Ft (θt+1


?
)) + γt2 Lρ̄2 /2 . (38)

Observe that:
T
X   T
X  
?
∆t − (Ft (θt+1 ) − Ft (θt+1 )) = (Ft (θt ) − Ft (θt+1 )) − (Ft (θt? ) − Ft (θt+1
?
))
t=T /2+1 t=T /2+1

= −FT (θT +1 ) + FT /2+1 (θT /2+1 ) − FT /2+1 (θT? /2+1 ) + FT (θT? +1 )


PT  
+ t=T /2+2 t−1 ft (θt ) − Ft−1 (θt ) − (ft (θt? ) − Ft−1 (θt? ))
 PT 
≤ G · kθT +1 − θT? +1 k∗ + kθT /2+1 − θT? /2+1 k∗ + t=T /2+2 2t−1 kθt − θt? k∗
 PT 
≤ 2ρG · 1 + t=T /2+2 t−1 ≤ 2ρG · (1 + log 2) ≤ 4ρG .

where we have used the fact that Ft (θt ) − Ft−1 (θt ) = t−1 (ft (θt ) − Ft−1 (θt )) in the first equality and that
PT PT
ft , Ft are G-Lipschitz in the second inequality. We notice that t=T /2+1 γt2 = t=T /2+1 t−2α ≤ log 2 ≤ 1 as
α ∈ [0.5, 1]. Summing up both side of the inequality (38) gives
  P
T PT
min gtFW · t=T /2+1 γt ≤ t=T /2+1 γt gt ≤ 4ρG + Lρ̄2 /2 , (39)
t∈[T /2+1,T ]

PT
where the inequality to the left is due to γt , gtFW ≥ 0. Observe that for all T ≥ 6, t=T /2+1 γt =
 
T 1−α 2 1−α
PT −α
 
t=T /2+1 t ≥ 1−α 1 − 3 = Ω T 1−α . We conclude that

 1−α !−1
1−α 2
min gtFW ≤ 1−α (4ρG + Lρ̄2 /2) 1 − = O(1/T 1−α ) . (40)
t∈[T /2+1,T ] T 3

For the O-AW algorithm, we observe that


1
Ft (θt+1 ) ≤ Ft (θt ) + γ̂t h∇Ft (θt ), dt i + γ̂t2 Lρ̄2 (41)
2

13
Note that by construction, h∇Ft (θt ), dt i = min{h∇Ft (θt ), aFW
t − θt i, h∇Ft (θt ), θt − aAW
t i}. Using the
inequality min{a, b} ≤ (1/2)(a + b), we have
D 1  FW E 1 1 1
Ft (θt+1 ) ≤ Ft (θt ) + γ̂t ∇Ft (θt ), at − aAW
t + γ̂t2 Lρ̄2 = Ft (θt ) − γ̂t gtAW + γ̂t2 Lρ̄2 . (42)
2 2 2 2
Proceeding in a similar manner to the proof for O-FW above, we get
1 1
γ̂t gtAW ≤ ∆t − (Ft (θt+1 ) − Ft (θt+1
?
)) + γ̂t2 Lρ̄2 . (43)
2 2
The only difference from (38) in the O-FW analysis are the terms that depend on the actual step size γ̂t .
Now, Lemma 9 implies that at least T /4 non-drop steps could have taken until round T /2, therefore we
have γ̂t ≤ γT /4 for all t ∈ [T /2 + 1, T ] since if a non-drop step is taken, then the step size will decrease; or if
a drop-step step is taken, we have γ̂t ≤ γnt−1 and nt−1 ≥ T /4. Therefore,
T  T −2α
1 X T
γ̂t2 Lρ̄2 ≤ · Lρ̄2 ≤ Lρ̄2 .
2 4 4
t=T /2+1

Summing the right hand side of (43) from t = T /2 + 1 to t = T yields an upper bound of 4ρG + Lρ̄2 .
On the other hand, define Tnon-drop be a subset of [T /2 + 1, T ] where a non-drop step is taken. We have
T T
X X X T 1−α   4 1−α 
γ̂t ≥ γ nt ≥ γt ≥ 1− = Ω(T 1−α ),
1−α 5
t=T /2+1 t∈Tnon-drop t=3T /4+1

where the second inequality is due to the fact that |Tnon-drop | ≥ T /4 and the last inequality holds for all
T ≥ 20. Finally, summing the left hand side of (43) from t = T /2 + 1 to t = T yields
  T
X T
X
min gtAW · γ̂t ≤ γ̂t gtAW ≤ 4ρG + Lρ̄2 .
t∈[T /2+1,T ]
t=T /2+1 t=T /2+1

Therefore, we conclude that mint∈[T /2+1,T ] gtAW = O(1/T 1−α ) for the O-AW algorithm.

D Proof of Proposition 5
We first look at the O-FW algorithm. Our goal is to bound the following inner product

maxh∇f (θt ), θt − θi,


θ∈C

where t ∈ [T /2 + 1, T ] is the round index that satisfies gtFW = O(1/T 1−α ), which exists due to Theorem 4.
For all θ ∈ C, observe that
h∇f (θt ), θt − θi ≤ h∇Ft (θt ), θt − θi + h∇f (θt ) − ∇Ft (θt ), θt − θi
(44)
≤ gtFW + ρk∇f (θt ) − ∇Ft (θt )k.
Following the same line of analysis as Proposition 2, with probability at least 1 − , it holds that
r
 n log(t) log(n/) 
k∇f (θt ) − ∇Ft (θt )k∞ = O max{σD , ρ̄L} , (45)
t
which is obtained from (20). Note that compared to Proposition 2, we save a factor of log(t) inside the square
root as the iteration instance t is fixed. Using the fact that t ≥ T /2 + 1, the following holds with probability
at least 1 − , p
h∇f (θt ), θt − θi = O max 1/T 1−α , log T /T , ∀ θ ∈ C.
 

14
For the O-AW algorithm, we observe that the inequality (42) in Appendix C can be replaced by
1 2 2
Ft (θt+1 ) ≤ Ft (θt ) − γ̂t h∇Ft (θt ), θt − aFW
t i + γ̂t Lρ̄ .
2
Furthermore, we can show that the inner product h∇Ft (θt ), θt − aFW
t i decays at the rate of O(1/T
1−α
) by
AW
replacing gt in the proof in Appendix C with this inner product. Consequently, (44) holds for the θt
generated by O-AW, i.e.,

h∇f (θt ), θt − θi ≤ h∇Ft (θt ), θt − aFW


t i + ρk∇f (θt ) − ∇Ft (θt )k

Applying (45) yields our result.

E Proof of Theorem 7
This section establishes a O((ηt /(t + K − 1))2α ) bound for ht for Algorithm 1 with inexact gradients, i.e.,
replacing ∇Ft (θt ) by ∇ ˆ t f (θt ) satisfying H3, under the assumption that f (θ) is L-smooth, µ-strongly convex
and γt = K/(K + t − 1).
Define t = ∇ˆ t f (θt ) − ∇f (θt ), gt = maxs∈C hθt − s, ∇f (θt )i as the duality gap at θt . Notice that (21) in
Lemma 6 implies: p
gt ≥ 2µδ 2 ht , (46)
Define st ∈ arg maxs∈C hθt − s, ∇f (θt )i. We note that

ˆ (θt ), st − θt i − ht , at − θt i = h∇f (θt ), st − θt i + ht , st − at i


h∇f (θt ), at − θt i ≤ h∇f
p
≤ −gt + ρkt k ≤ −δ 2µht + ρkt k , (47)

where the last line follows from (46). Combining the L-smoothness of f (θ) and (47) yield the following with
probability at least 1 −  and for all t ≥ 1,

ηt

p p p 1
ht+1 ≤ ht ( ht − γt δ 2µ) + γt ρσ + γt2 Lρ̄2 . (48)
t+K −1 2

Let us recall the definition of D1

D1 = max{4((K + 1)/K)2α , β 2 }(ρσ + KLρ̄2 /2)2 /(2δ 2 µ) with β = 1 + 2α/(K − α) , (49)


 2α
√ induction. Suppose that ht ≤ D1 (ηt /(t + K − 1)) for some t ≥ 1. There are two cases.
and proceed by
Case 1 ht − γt δ 2µht ≤ 0:
Then since γt = K/(K + t − 1), (48) yields

(ηt )α Lρ̄2 K 2 2 2 (ηt+1 )2α
ht+1 ≤ ρσK + ≤ (ρσK + Lρ̄ K /2)
(K + t − 1)1+α 2(K + t − 1)2 (K + t − 1)2α
 K + 1 2α  η  2α
t+1
≤ (ρσK + Lρ̄2 K 2 /2) ,
K K +t
where we used that ηt is increasing and larger than 1. To conclude, one just needs to check that
 K + 1 2α
(ρσK + Lρ̄2 K 2 /2) ≤ D1 . (50)
K
Note that we have
 K + 1 2α (ρσ + Lρ̄2 K/2)  K + 1 2α
D1 ≥ (ρσ + Lρ̄2 K/2) · 4 ≥ (ρσ + Lρ̄2 K/2) · K , (51)
K 2µδ 2 K

15
where the last inequality is due to Lρ̄2 ≥ δ 2 µ from Lemma 6. Hence,

ht+1 ≤ D1 (ηt+1 /(K + t))2α .

Case 2 ht − γt δ 2µht > 0:
By induction hypothesis and (48), we have
 η  2α
t+1
ht+1 − D1
K +t
 ηt 2α  η  2α (ηt )α · K  
t+1 2
p
≤ D1 ( − )+ ρσ + Lρ̄ K/2 − δ 2µD1
K +t−1 K +t (K + t − 1)1+α
α
(ηt )α ηt
   
p
≤ 1+α
2αD1 + Kρσ + K 2 Lρ̄2 /2 − δK 2µD1
(K + t − 1) t+K −1
 α

ηt
  
(ηt ) 2 2
≤ 2αD 1 + (Kρσ + K Lρ̄ /2)(1 − β) (52)
(K + t − 1)1+α t+K −1

where we used the fact that (i) ηt is increasing and larger than 1, (ii) t ≥ 1 and (iii) 1/(K + t − 1)2α − 1/(K +
t)2α ≤ 2α/(K + t − 1)1+2α in the second last inequality; and we have used the definition of D1 in the last
inequality. Define

ηt

t0 := inf{t ≥ 1 : 2αD1 + (Kρσ + K 2 Lρ̄2 /2)(1 − β) ≤ 0}. (53)
t+K −1

Since ηt /(K + t − 1) is monotonically decreasing to 0 and β > 1, t0 exists. Clearly, for any t > t0 the RHS is
non-positive. For t ≤ t0 , we have

ηt

(Kρσ + K 2 Lρ̄2 /2)(β − 1) ≤ 2αD1 (54)
t+K −1

i.e.,

ηt

D0 (K − α)(β − 1) ≤ 2αD1 (55)
t+K −1

Hence by the definition that β = 1 + 2α/(K − α) and applying Theorem 10 (see Section E.1) we get:
α 2α
ηt ηt
 
ht ≤ D0 ≤ D1
t+K −1 t+K −1

The initialization is easily verified as the first inequality holds true for all t ≥ 2.

E.1 Proof of Theorem 10


Theorem 10. Consider Algorithm 1 and assume H3 and that f (θ) is convex and L-smooth. Then, the
following holds with probability at least 1 − :

ηt

?
f (θt ) − f (θ ) ≤ D0 , ∀t≥2, (56)
t+K −1

where
K 2 Lρ̄2 /2 + ρσK
D0 = . (57)
K −α

16
Let us define ht = f (θt ) − f (θ ? ), then we get
1
ht+1 ≤ ht + γt h∇f (θt ), at − θt i + γt2 Lρ̄2 . (58)
2
On the other hand, the following also hods:
ˆ (θt ), at − θt i − ht , at − θt i,
h∇f (θt ), at − θt i = h∇f
ˆ (θt ), θ ? − θt i − ht , at − θt i
≤ h∇f
= h∇f (θt ), θ ? − θt i + ht , θ ? − at i
≤ −ht + ρkt k . (59)

where the second line follows from the definition of at and the last inequality is due to the convexity of f and
the definition of the diameter. Plugging (59) into (58) and using H3 yields the following with probability at
least 1 − ∆ and for all t ≥ 1
ηt 1
ht+1 ≤ (1 − γt )ht + γt ρσ( )α + γt2 Lρ̄2 . (60)
K +t−1 2
We now proceed by induction to prove the first bound of the Theorem. Define

D0 = (K 2 Lρ̄2 /2 + ρσK)/(K − α) .

The initialization is done by applying (60) with t = 1 and noting that K ≥ 1. Assume that ht ≤
D0 (ηt /(K + t − 1))α for some t ≥ 1. Since γt = K/(t + K − 1), from (60) we get:
 η  α
t+1
ht+1 − D0 (61)
K +t

 ηt α ηt+1 α  K 2 Lρ̄2 /2 + ρσK(ηt )α − D0 K(ηt )α
≤ D0 − +
t+K −1 t+K (t + K − 1)1+α
2 2
 D0 D0 K Lρ̄ /2 + ρσK − D0 K 
≤ (ηt )α − +
(t + K − 1)α (t + K)α (t + K − 1)1+α
 α
(ηt ) 
2 2

≤ (α − K)D 0 + K Lρ̄ /2 + ρσK ≤0,
(t + K − 1)1+α
where we used the fact that ηt is increasing and larger that 1 for the second inequality and 1/(t + K − 1)α −
1/(t + K)α ≤ α/(t + K − 1)1+α for the third inequality. The induction argument is now completed.

F Proof of Theorem 8
This section establishes a O((ηt /(nt−1 + K))2α ) bound for ht for Algorithm 2 with inexact gradients, i.e.,
ˆ t f (θt ) satisfying H3, under the assumption that f (θ) is L-smooth, µ-strongly convex
replacing ∇Ft (θt ) by ∇
and γt = K/(K + t − 1).
Outline of the proof. Here, our strategy parallels that of Appendix E. We first show that the slow
convergence rate of O((ηt /(nt−1 + K))α ) holds for Algorithm 2 (Theorem 11). The fast convergence rate of
O((ηt /(nt−1 + K))2α ) is then established using induction. We have to pay special attention to the case when
a drop step is taken (line 13 of Algorithm 2). In particular, when a drop step is taken, the induction step is
done by Lemma 12; for otherwise, we apply similar arguments in Appendix E to proceed with the induction.
To begin our proof, let us define t = ∇ ˆ t f (θt ) − ∇f (θt ),

bFW
t := arg min b, ∇f (θt ) , bAW
t := arg max b, ∇f (θt ) , ḡtAW := h∇f (θt ), bAW
t − bFW
t i.
b∈C b∈At

We remark that bAW


t 6= aAW
t and bFW
t 6= aFW
t as they are evaluated on the true gradient ∇f (θt ).

17
ˆ (θt ), dt i = min{h∇f
Recall that in Algorithm 2, we choose dt such that h∇f ˆ (θt ), aFW − θt i, h∇f
ˆ (θt ), θt −
t
aAW
t i}. Therefore, for t ≥ 2:

ˆ (θt ), dt i ≤ h∇f
ˆ (θt ), aFW
t − aAW
t
FW AW
ˆ (θt ), bt − bt i
h∇f i ≤ h∇f
2 2
bFW − bAW
b FW
− b AW
= h∇f (θt ), t t
i + ht , t t
i
2 2
where the second inequality is due to the definitions of aFW
t and aAW
t in (3). Hence:

ˆ (θt ), dt i ≤ − ḡtAW bFW − bAW


h∇f + ht , t t
i (62)
2 2
As f is L-smooth, the following holds,
Lρ̄2 2
f (θt+1 ) ≤ f (θt ) + γ̂t h∇f (θt ), dt i + γ̂ (63)
2 t
2
= f (θt ) + γ̂t (h∇fˆ (θt ), dt i − ht , dt i) + γ̂ 2 Lρ̄
t
2
ḡtAW bFW
t − bAWt Lρ̄2
≤ f (θt ) − γ̂t + γ̂t ht , − dt i + γ̂t2
2 2 2
where we used (62) for the last line. Subtracting f (θ ∗ ) on both sides and applying H3 yield

ḡ AW ηt Lρ̄2

ht+1 ≤ ht − γ̂t t + 2γ̂t ρσ + γ̂t2 , (64)
2 K +t−1 2
where we have used k(bFWt − bAW
t )/2 − dt k∗ ≤ 2ρ.
We first establish a slow convergence rate of O-AW algorithm. Define
K
D20 = (KLρ̄2 /2 + 2ρσ) . (65)
K −α

Theorem 11. Consider Algorithm 2. Assume H3 and that f (θ) is convex and L-smooth, the following holds
with probablity 1 − :
 ηt α
ht := f (θt ) − f (θ ? ) ≤ D20 , (66)
nt−1 + K
for all t ≥ 2. Here D20 is given in (65).
Proof. See subsection F.1.
Let us recall the definition of D2
D2 = 2 max{((K + 1)/K)2α , β 2 }(2ρσ + KLρ̄2 /2)2 /(δAW
2
µ) with β = 1 + 2α/(K − α) .
To prove Theorem 8, we proceed by induction and assume that for some t ≥ 2, ht ≤ D2 (ηt /(K + nt−1 ))2α
holds. Notice that (22) in Lemma 6 gives:
q
ḡtAW ≥ 2µδAW2 h ,
t (67)

Now, suppose that ht > 0 (ht = 0 is discussed at the end of the proof). Combining (64) and (67) gives:
r
µht  ηt α Lρ̄2
ht+1 ≤ ht − γ̂t δAW + 2γ̂t ρσ + γ̂t2 . (68)
2 nt−1 + K 2
We have used the fact that t − 1 ≥ nt−1 .
Consider two different cases. If a drop step is taken at iteration t + 1, the induction step can be done by
the following:

18
Lemma 12. Suppose that ht ≤ D2 (ηt /(K + nt−1 ))2α and that a drop step is taken at iteration t + 1 (see
Algorithm 2 line 12), then
  2α
ηt+1
ht+1 ≤ D2 , (69)
K + nt
note that nt = nt−1 when a drop step is taken.
Proof. See subsection F.2.
The above lemma shows that the objective value does not increase when a drop step is taken.
On the other hand, when a drop step is not taken at iteration t + 1, then from Algorithm 2, we have
γ̂t = γnt = K/(K + nt − 1)qand nt = nt−1 + 1. We consider the following two cases:
Case 1: If ht − γ̂t δAW µh2 ≤ 0.
t

Then, since γ̂t = K/(K + nt − 1) and nt ≤ t, (68) yields

(ηt )α Lρ̄2 K 2
ht+1 ≤ 2ρσK + (70)
(K + nt − 1)1+α 2(K + nt − 1)2
(ηt+1 )2α

≤ (2ρσK + Lρ̄2 K 2 /2)
(K + nt − 1)2α
 K + 1 2α  η  2α
2 2 t+1
≤ (2ρσK + Lρ̄ K /2) ,
K K + nt
where we used that ηt is increasing and larger than 1. To conclude, one just needs to check that
 K + 1 2α
(2ρσK + Lρ̄2 K 2 /2) ≤ D2 . (71)
K
Note that we have
 K + 1 2α (2ρσ + Lρ̄2 K/2)  K + 1 2α
D2 ≥ (2ρσ + Lρ̄2 K/2) · 2 2 ≥ (2ρσ + Lρ̄2 K/2) · K ,
K µδAW K

where the last inequality is due to Lρ̄2 ≥ δAW


2
µ from Lemma 6. Hence,

ht+1 ≤ D2 (ηt+1 /(K + nt ))2α .
q
Case 2: Assume ht − γ̂t δAW µh 2 > 0.
t

By induction and (68), we have


 η 2α
t+1
ht+1 −D2
K + nt
2α  η  r
 ηt t+1
2α (ηt )α · K  µD2 
≤ D2 ( − )+ 2ρσ + Cf K/2 − δ AW
K + nt − 1 K + nt (nt + K − 1)1+α 2
α
r
 α 
 
(ηt ) ηt µD2
+ 2Kρσ + K 2 Lρ̄2 /2 − δAW K
 
≤ 2αD2
(K + nt − 1)1+α nt + K − 1 2
 α
 

(ηt ) ηt
+ (2Kρσ + K 2 Lρ̄2 /2)(1 − β)
 
≤ 2αD2 (72)
(K + nt − 1)1+α nt + K − 1

where we used the fact that (i) ηt is increasing and larger than 1, (ii) t ≥ 1 and (iii) 1/(K + t − 1)2α − 1/(K +
t)2α ≤ 2α/(K + t − 1)1+2α in the second last inequality; and we have used the definition of D2 in the last
inequality. Define

ηt

t0 := inf{t ≥ 1 : 2αD2 + K(2ρσ + KLρ̄2 /2)(1 − β) ≤ 0}. (73)
nt + K − 1

19
Since ηt /(K + nt − 1) decreases to 0 (see H3 and Lemma 9), t0 exists. Clearly, for any t > t0 the RHS is
non-positive. For t ≤ t0 , we have

ηt

K(2ρσ + KLρ̄2 /2)(β − 1) ≤ 2αD2 (74)
nt + K − 1
implying

ηt

D20 (K − α)(β − 1) ≤ 2αD2 (75)
nt + K − 1
Since β = 1 + 2α/(K − α), the left hand side of (75) equals 2αD20 and we conclude that D20 ≤ D2 (ηt /(nt +
K − 1))α . Applying Theorem 11 we get:
α 2α
ηt ηt
 
0
ht ≤ D2 ≤ D2
nt + K − 1 nt + K − 1
The induction step is completed by observing that nt − 1 = nt−1 . The initialization is easily verified for t = 2.
If ht = 0, then by Lemma 6 yields gtAW = 0 and the induction is treated as Case 1.

F.1 Proof of Theorem 11


We proceed by induction and assume for some t > 0 that ht ≤ D20 (ηt /(nt−1 + K))α holds. First of all, observe
that from the L-smoothness of f (θ),
1
ht+1 ≤ ht + γ̂t h∇f (θt ), dt i + γ̂t2 Lρ̄2 . (76)
2
Moreover, we have:
ˆ (θt ), dt i − ht , dt i ≤ h∇f
h∇f (θt ), dt i = h∇f ˆ (θt ), aFW − θt i − ht , dt i
t
ˆ
≤ h∇f (θt ), θ? − θt i − ht , dt i = h∇f (θt ), θ? − θt i + ht , θ? − θt − dt i
≤ −ht + 2ρkt k (77)
where we used the condition of line 5 (Algorithm 2) in the first inequality and the fact kθ? − θt − dt k? ≤ 2ρ
in the last inequality. This gives
 ηt α 1
ht+1 ≤ (1 − γ̂t )ht + 2γ̂t ρσ + γ̂t2 Lρ̄2 , (78)
K + nt−1 2
where we have used H3 and the fact that nt−1 ≤ t − 1.
Consider the two cases: if a drop step (line 12) is taken at iteration t + 1, the following result that is
analogous to Lemma 12 gives the induction.
Lemma 13. Suppose that ht ≤ D20 (ηt /(K + nt−1 ))2α for α ∈ (0, 1], and that a drop step is taken at time
t + 1 (see Algorithm 2 line 12), then
  α
0 ηt+1
ht+1 ≤ D2 . (79)
K + nt
Proof. See subsection F.3.
On the other hand, if a drop step is not taken, notice that we will have γ̂t = γnt = K/(K + nt − 1) and
nt = nt−1 + 1. Consequently, the same induction argument in subsection E.1 (replacing t by nt and consider
ht+1 − D20 (ηt+1

/(K + nt ))α ) shows:
  α
0 ηt+1
ht+1 ≤ D2 . (80)
K + nt
The initialization of the induction is easily checked for t = 2.

20
F.2 Proof of Lemma 12
Since iteration t + 1 is a drop step, we have by construction (Algorithm 2 line 12)
K
γ̂t = γmax ≤ and nt = nt−1 .
K + nt
√ p
2 /2 ≤ 0, then we have
From (68) and the assumption in the lemma, we consider two cases: if ht − γ̂t µδAW
  2α   2α
ηt+1  ηt α 1 ηt+1
ht+1 − D2 ≤ 2γ̂t ρσ + Lρ̄2 γ̂t2 − D2
K + nt nt−1 + K 2 nt + K
 α 2   2α
K · (ηt+1 )

1 2 K ηt+1
≤ 2ρσ + Lρ̄ − D2 (81)
(nt + K)1+α 2 nt + K nt + K
  2α 
ηt+1 
≤ 2ρσK + K 2 Lρ̄2 /2 − D2
nt + K
The second inequality is due to nt = nt−1 and γ̂t = γmax ≤ K/(K + nt ). The last inequality is due to
2α ≤ min{2, 1 + α} for all α ∈ (0, 1] and ηt is an increasing sequence with ηt ≥ 1. It can be verified that the
right hand side is non-positive
√ p the definition of D2 .
using
On the other hand, if ht − γ̂t µδAW 2 /2 > 0, we have from (68)

  2α
ηt+1
ht+1 − D2
nt + K
  2α
 1 ηt α ηt+1
p p q 
2 2 2
≤ ht ht − γ̂t µδAW /2 + Lρ̄ γ̂t + 2γ̂t ρσ − D2
2 nt−1 + K nt + K

 

1  ηt  α q
ηt
≤ Lρ̄2 γ̂t2 + 2γ̂t ρσ 2 /2
− γ̂t D2 µδAW
2 nt−1 + K nt−1 + K
 η  α q α 
ηt
1 
2 t 2
= γ̂t Lρ̄ γ̂t + 2ρσ − D2 µδAW /2
2 nt + K nt + K
 KLρ̄2 /2  q   η  α 
2 /2 t
≤ γ̂t + 2ρσ − D2 µδAW
nt + K nt + K
 η  α  q 
t
≤ γ̂t KLρ̄2 /2 + 2ρσ − D2 µδAW 2 /2 .
nt + K
The last inequality is due to α ≤ 1. Similarly, by the definition of D2 , we observe that the RHS in the above
inequality is non-positive.

F.3 Proof of Lemma 13


Using (78) gives the following chain
  α   α   α
0 ηt+1 ηt+1 1 2 2 0 ηt+1
ht+1 − D2 ≤ (1 − γ̂t )ht + 2γ̂t ρσ + Lρ̄ γ̂t − D2
K + nt K + nt 2 K + nt
 
 α    α   α
0 ηt ηt+1 1 2 2 0 ηt+1
≤ (1 − γ̂t )D2 + 2γ̂t ρσ + Lρ̄ γ̂t − D2
K + nt K + nt 2 K + nt
  
 α 2
 (82)
η Lρ̄
≤ γ̂t (−D20 + 2ρσ) t
+ γ̂t
K + nt 2

1   η  α 
≤ γ̂t − D20 + 2ρσ + KLρ̄2 t
≤0.
2 K + nt
In the above, the second inequality is due to 1 − γ̂t ≥ 0 and the induction hypothesis; the third inequality is
due to ηt is increasing and; the last inequality is due to γ̂t < K/(K + nt ). The proof is completed.

21
G Fast convergence of O-AW without strong convexity
The proof is based on a generalization of Lemma 6, and the following result is borrowed from Theorem 11
in [LJJ15].
We focus on the anytime/regret bound studied in Section 3.1 below. In particular, the relaxed conditions
for a regret bound of O(log3 T /T ) and anytime bound of O(log2 t/t) are that (i) C is a polytope and (ii) the
loss function can be written as:
f (θ) = g(Aθ) + hb, θi . (83)
where g is µg -strongly convex. For a general matrix A, f (θ) may not be strongly convex.
Define C to be the matrix with rows containing the linear inequalities defining C. Let ch be the Hoffman
constant [LJJ15] for the matrix [A; b> ; C], G = maxθ∈C k∇g(Aθ)k be the maximal norm of gradient of g
over AC, ρA be the diameter of AC and we define the generalized strong convexity constant:
1
µ̃ := . (84)
2c2h (kbkM + 3GρA + (2/µg )(G2 + 1)

Under H2 and assuming that ht > 0 holds, applying the inequality (43) from [LJJ15] yields
p
ḡtAW ≥ δAW 2µ̃ · ht . (85)

Subsequently, the O(log2 T /T ) anytime bound and O(log3 T /T ) regret bound in Theorem 1 can be obtained
by repeating the proof in Appendix F with (85).

H Improved gradient error bound for online MC


Our goal is to show that with high probability,
p
k∇Ft (θ) − ∇f (θ)kσ,∞ = O( log t/t), ∀ t sufficiently large. (86)

To facilitate our proof, let us state the following conditions on the observation noise statistics:
A1. The noise variance is finite, that is there exists a constant σ̄ > 0 such that for all ϑ ∈ R, 0 ≤ A00 (ϑ) ≤ σ̄ 2 ,
and the noise is sub-exponential i.e., there exist a constant λ ≥ 1 such that for all (k, l) ∈ [m1 ] × [m2 ]:
Z
exp λ−1 y − A0 (θ̄k,l ) pθ̄ (y|k, l)dy ≤ e ,

(87)


where pθ̄ (·) is defined as pθ̄ (y|k, l) := m(y) exp y θ̄k,l − A(θ̄k,l ) and e is the natural number.
A 2. There exists a finite constant κ > 0 such that for all θ ∈ C, k ∈ [m1 ], l ∈ [m2 ]
v v 
u m2 u m1
uX uX
κ ≥ max t A0 (θk,l )2 , t A0 (θk,l )2  . (88)
l=1 k=1

p
Notice that κ = O( max{m1 , m2 }).
We remark that A1 and A2 are satisfied by all the exponential family distributions. We also need the
following proposition.
Proposition 14. Consider a finite sequence of independent random matrices (Zs )1≤s≤t ∈ Rm1 ×m2 satisfying
E[Zi ] = 0. For some U > 0, assume

inf{λ > 0 : E[exp(kZi kσ,∞ /λ)] ≤ e} ≤ U ∀i ∈ [n], (89)

22
and there exists σZ s.t.
 
 1X t t
1X 
2
σZ ≥ max E[Zs Zs> ] , E[Zs> Zs ] . (90)
 t t 
s=1 σ,∞ s=1 σ,∞

Then for any ν > 0, with probability at least 1 − e−ν


t
( r )
1X ν + log(d) U ν + log(d)
Zi ≤ cU max σZ , U log( ) , (91)
t i=1 t σZ t
σ,∞

with cU an increasing constant with U .

Proof. This result is proved in Theorem 4 in [Kol13] for symmetric matrices. Here we state a slightly different
2
result because σZ is an upper bound of the variance and not the variance itself. However, it does not the
alter the proof and the result stays valid. This concentration is extended to rectangular matrices by dilation,
see Proposition 11 in [Klo14] for details.
Our result is stated as follows.

Proposition 15. Assume A1, A2 and that the sampling distribution is uniform. Define the approximation
error t (θ) := ∇Ft (θ)−∇f (θ). With probability at least 1−, for any t ≥ T := (λ/σ̄)2 log2 (λ/σ̄) log(d+2d/),
and any θ ∈ CR : s !
log(d(1 + t2 /))
kt (θ)kσ,∞ = O cλ (κ + σ̄) ,
t(m1 ∧ m2 )

with k · kσ,∞ the operator norm, cλ a constant which depends only on λ. The constants λ, σ̄ and κ are defined
in A1 and A2.
Proof. For a fixed θ, by the triangle inequality

t t
1X 0 0 1X 0 0 0
kt (θ)kσ,∞ ≤ k Ys eks els> − E[Ys eks els> ]kσ,∞ + k A (θks ,ls )eks els> − E[A0 (θks ,ls )eks els> ]kσ,∞
t s=1 t s=1
0 0
Define Zs := Ys eks els> − E[Ys eks els> ], then
0 0
kE[Zs Zs> ]kσ,∞ ≤ kE[Ys2 eks els> els e>
ks ]kσ,∞ ,
m2
! m1 !
1 X
=k diag E[Ys2 |k, l] kσ,∞ ,
m1 m2
l=1 k=1
m2
!
1 X
00 0 2
= max A (θ̄k,l ) + (A (θ̄k,l )) ,
m1 m2 k∈[m1 ]
l=1
2 2
σ̄ κ σ̄ 2 + κ2
≤ + ≤ ,
m1 ∧ m2 m1 m2 m1 ∧ m2
where we used the fact that the distribution belongs to the exponential family for the second equality.
Similarly one shows that kE[Zs> Zs ]kσ,∞ satisfies the same upper bound. Hence by Proposition 14 and A1,
with probability at least 1 − e−ν , it holds
s
t
1X (σ̄ 2 + κ2 )(ν + log(d))
k Zs kσ,∞ ≤ cλ , (92)
t s=1 t(m1 ∧ m2 )

23
t larger than the threshold given in the proposition statement. For the second term, define Pt :=
for P
t 0
1/t s=1 eks els> − (m1 m2 )−1 11> , we get
t
1X 0 0 0
k A (θks ,ls )eks els> − E[A0 (θks ,ls )eks els> ]kσ,∞ = kPt (A0 (θk,l ))k,l kσ,∞ ≤ κkPt kσ,∞ , (93)
t s=1

where denotes the Hardamard product and we have used Theorem 5.5.3 in [HJ94] for the last inequality.
0
Define Zs0 := eks els> − (m1 m2 )−1 11> . Since by definition, λ ≥ 1, one can again apply Proposition 14 for
U = λ and get with probability at least 1 − e−ν ,
s
ν + log(d)
kPt kσ,∞ ≤ cλ . (94)
t(m1 ∧ m2 )

Hence, by a union bound argument we find that with probability at least 1 − 2e−ν
s
ν + log(d)
kt kσ,∞ ≤ cλ (2κ + σ̄) . (95)
t(m1 ∧ m2 )

Taking ν = log(1 + 2t2 /) and applying a union bound argument yields the result.

I Additional results: Online LASSO


Consider the setting where we are sequentially given i.i.d. observations (Yt , At ) such that Yt ∈ Rm is the
response, At ∈ Rm×n is the random design and
Yt = At θ̄ + wt , (96)
where the vector wt is i.i.d., [wt ]i is independent of [wt ]j for i 6= j and [wt ]i is zero-mean and sub-Gaussian
with parameter σw . We suppose that the unknown parameter θ̄ is sparse. Attempting to learn θ̄, a natural
choice for the loss function at round t is the square loss, i.e.,
ft (θ) = (1/2)kYt − At θk22 (97)
and the stochastic cost associated is f (θ) := 12 Eθ̄ [kYt − At θk22 ]. As θ̄ is sparse, the constraint set is designed
to be the `1 ball, i.e., C = {θ ∈ Rn : kθk1 ≤ r}, where r > 0 is a regularization constant. Note that C is a
polytope.
The aggregated gradient can be expressed as
t
X  t
X 
∇Ft (θt ) = t−1 A>
s A s θt − t
−1
A>
s Ys . (98)
s=1 s=1
Pt Pt
Similar to the case of online matrix completion, the terms s=1 A> s As and
>
s=1 As Ys can be computed
‘on-the-fly’ as running sums. Applying O-FW (Algorithm 1) or O-AW (Algorithm 2) with the above aggregated
gradient yields an online LASSO algorithm with a constant complexity (dimension-dependent) per iteration.
Notice that as C is an `1 ball constraint, the linear optimization in Line 4 of Algorithm 1 or (3) in Algorithm 2
can be evaluated simply as at = −r · sign([∇Ft (θt )]i ) · ei , where i = arg maxj∈[n] |[∇Ft (θt )]j |.
p
Similar to the case of online MC, we derive the following O( log t/t) bound for the gradient error:
Proposition 16. Assume that kA> >
t At − E[A A]kmax ≤ B1 and kAt kmax ≤ B2 almost surely, with k · kmax
being the matrix max norm. Define c := maxθ∈C kθ − θ̄k1 . With probability at least 1 − (1 + 1/n)(π 2 /6), the
following holds for all θ ∈ C and all t ≥ 1:
r
p
2)
2(log(2n2 t2 ) − log )
k∇Ft (θ) − ∇f (θ)k∞ ≤ (cB1 + mB2 σw , (99)
t
where k · k∞ is the infinity norm and the dual norm of k · k1 .

24
We observe that H3 is satisfied with ηt asymptotically equivalent to 4log(t) and α = 0.5. Furthermore,
the stochastic cost f is L-Lipschitz if LI  E[A> A]; µ-strongly convex if E[A> A]  µ1 for some µ > 0; and
H2 is satisfied as C is a polytope. The analysis from the previous section applies, i.e., O-FW/O-AW has a
regret bound of O(log2 T /T ) and an anytime bound of O(log t/t).
Proof. Notice that the gradient vector is given by:

∇f (θ) = E[A> (Aθ − Y )] = E[A> A]θ − E[A> Y ]. (100)

We can bound the gradient estimation error as:


t t
1X > 1X >
As As − E[A> A] (θ − θ̄)

k∇Ft (θ) − ∇f (θ)k∞ ≤ A ws + (101)
t s=1 s ∞ t s=1 ∞

To bound the second term in (101), we define Zs := A> >


s As − E[A A]. Observe that

t t
1X 1X
Zs (θ − θ̄) = max zs,i (θ − θ̄) , (102)
t s=1 ∞ i∈[n] t s=1

where zs,i denotes the ith row vector in Zs . Furthermore, by the Holder’s inequality,
t t
1X 1X
zs,i (θ − θ̄) ≤ kθ − θ̄k1 zs,i ∞
, (103)
t s=1 t s=1

Now that zs,i is a zero-mean, independent random vector with elements bounded in [−B1 , B1 ], applying the
union bound and the Hoefding’s inequality gives:
2
Pt  − x 2t
P (1/t) s=1 zs,i ∞
≥ x, ∀ i ≤ 2n2 e 2B1 . (104)
p
Setting x = B1 2(log(2n2 t2 ) − log )/t gives /t2 on the right hand side. With probability at least 1 − /t2 ,
we have
t
1X p
Zs θ ≤ cB1 2(log(2n2 t2 ) − log ∆)/t, (105)
t s=1 ∞

To bound the first term in (101), we find that the ith element of the vector A>
s ws is zero-mean.
Furthermore, it can be verified that
Pm 
2 2
E e λ j=1 As,i,j ws,j ≤ eλ ·mσw B2 /2 ,

(106)

for all λ ∈ R, where As,i,j is the (i, j)th element of As and ws,j is the jth element of ws . In other words, the
ith element of A> 2
s ws is sub-Gaussian with parameter m · σw B2 . It follows by the Hoefding’s inequality that

t 2
1X >  − x t
P A ws ≥ x ≤ 2ne 2mB2 σw2 . (107)
t s=1 s ∞

p
Setting x = σw 2mB2 (log(2n2 t2 ) − log )/t yields /(nt2 ) on the right hand side. Combining (104), (107)
and using a union bound argument (for all t ≥ 1) yields the desired result.

25
105 103
O-FW O-FW
104 O-AW 102 O-AW
s-PG s-PG
103 O(1/t^0.86)
O(1/t^1.03) 101

Primal optimality

Primal optimality
102
100
101
10-1
100
10-1 10-2

10-2 10-3 0
100 101 102 103 104 105 10 101 102 103 104 105
Iteration no Iteration no

Figure 3: Online LASSO with synthetic data. Convergence of the primal optimality for online LASSO with (Left)
r = 1.1kθ̄k1 > kθ ? k1 ; (Right) r = 0.15kθ̄k1 = kθ ? k1 .

101
O-FW
O-AW
100 O(1/t^{1.06})
Primal optimality

10-1

10-2

10-3

10-4 0
10 101 102
Iteration no

Figure 4: Online LASSO with single-pixel imaging data R64.mat. (Left) Convergence of the objective value. (Middle)
Reconstructed image after 500 iterations of O-FW; (Right) O-AW.

I.1 Numerical Result


We present numerical results on both synthetic data and realistic data.
Synthetic Data. We set At = A as fixed for all t with dimension 80 × 300 and the parameter θ̄ ∈ R300
is a vector with 10% sparsity and independent N (0, 1) elements. We also set σw = 10. The matrix A is
generated as a random Gaussian matrix with independent N (0, 1) elements. For benchmarking purpose, we
have compared the O-FW/O-AW’s performance with a stochastic projected gradient (sPG) method [RVV14]
with a fixed step size 1/L.
Figure 3 plots the primal optimality ht := f (θt ) − f (θ ? ) with the round number t. The left figure
corresponds to the scenario under H1 as θ ? belongs to the interior of C. The simulation result corroborates
with our analysis, which indicate a fast convergence rate of O(1/t). In the right figure, we observe that
although H1 is not satisfied, the O-FW algorithm still maintains a convergence rate of ∼ O(1/t), and O-AW
is slightly outperforming O-FW. Examining the necessity of including H1 in achieving a fast convergence rate
for O-FW will be left for future investigation. Lastly, the primal convergence rate of sPG is similar to O-FW.
However, the per-iteration complexity of sPG is O(n log n), while it is O(n) for the O-FW.
Realistic Data. We consider learning a sparse image θ from the dataset R64.mat available from [DDT+ 08].
The dataset consists of T = 4319 one-bit measurements of a greyscale image of ‘R’ with size 64 × 64. The
squared loss function is chosen such that ft (θ) = (yt − a> 2 n
t θ) , where at ∈ R is a binary measurement vector
and n = 4096 is the vectorized image. For the O-FW/O-AW algorithms, we have (i) used batch processing by
drawing a batch of B = 5 new observations and (ii) introduced an inner loop by repeating the O-FW/O-AW
iterations, i.e., Line 4-5 of Algorithm 1 or Line 4-15 of Algorithm 2 for 50 times within each iteration.
As the optimal solution θ ? is unavailable for this problem, Figure 4 compares the primal objective value
FT (θt ) against the iteration number and the reconstructed image after tf = 500 iterations of the tested
algorithms. The figure shows that the convergence rates of these algorithms all converge at a rate of ∼ O(1/t).

26
References
[ADX10] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point
bandit feedback. In COLT, 2010.
[AZH16] Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In ICML, 2016.
[BT09] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAM J. Imaging Sci., 2(1):183–202, 2009.
[CL11] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Trans. on Intelligent
Sys. and Tech., 2:27:1–27:27, 2011.
[DDT+ 08] Marco Duarte, Mark Davenport, Dharmpal Takhar, Jason Laska, Ting Sun, Kevin Kelly, and Richard
Baraniuk. Single-pixel imaging via compressive sampling. IEEE Signal Processing Magazine, 25(2):83–91,
Mar 2008.
[EBG11] S. Ertekin, L. Bottou, and C. Lee Giles. Nonconvex online support vector machines. IEEE Trans. on
Pattern Analysis and Machine Intelligence, 33(2), Feb 2011.
[EV76] Yu. M. Ermol’ev and P. I. Verchenko. A linearization method in limiting extremal problems. Cybernetics,
12(2):240–245, 1976.
[FG13] Robert M. Freund and Paul Grigas. New analysis and results for the Frank-Wolfe method. CoRR,
abs/1307.0873v2, 2013.
[FW56] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Res. Logis. Quart., 1956.
[Gau05] Jean-Louis Verger Gaugry. Covering a ball with smaller equal balls in n. Discrete and Computational
Geometry, 33:143–155, 2005.
[GH15a] D. Garber and E. Hazan. Faster rates for the Frank-Wolfe method over strongly-convex sets. ICML, 2015.
[GH15b] D. Garber and E. Hazan. A linearly convergent conditional gradient algorithm with applications to online
and stochastic optimization. CoRR, abs/1301.4666, August 2015.
[GHJY15] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points — online stochastic gradient for tensor
decomposition. In COLT, 2015.
[GL15] S. Ghosh and H. Lam. Computing worst-case input models in stochastic simulation. CoRR, abs/1507.05609,
July 2015.
[HJ94] R. A. Horn and C. R. Johnson. Topics in matrix analysis. Cambridge University Press, Cambridge, 1994.
Corrected reprint of the 1991 original.
[HK12] E. Hazan and S. Kale. Projection-free online learning. ICML, 2012.
[HK15] F. M. Harper and J. A. Konstan. The movielens datasets: History and context. ACM TiiS, Jan 2015.
[HL16] E. Hazan and H. Luo. Variance-reduced and projection-free stochastic optimization. In ICML, 2016.
[HO14] C.-J. Hsieh and P. A. Olsen. Nuclear norm minimization via active subspace selection. In ICML, 2014.
[Jag13] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. ICML, 2013.
[JLMZ16] Bo Jiang, Tianyi Lin, Shiqian Ma, and Shuzhong Zhang. Structured nonconvex and nonsmooth optimization:
Algorithms and iteration complexity analysis. CoRR, May 2016.
[JN12a] A. B. Juditsky and A. S. Nemirovski. First-Order Methods for Nonsmooth Convex Large-Scale Optimization,
I: General Purpose Methods. 2012.
[JN12b] A. B. Juditsky and A. S. Nemirovski. First-Order Methods for Nonsmooth Convex Large-Scale Optimization,
II: Utilizing Problem’s Structure. 2012.
[Klo14] O. Klopp. Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 2(1):282–303,
02 2014.
[Kol13] V. Koltchinskii. A remark on low rank matrix recovery and noncommutative Bernstein type inequalities,
volume Volume 9 of Collections, pages 213–226. Institute of Mathematical Statistics, 2013.
[LJ16] S. Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives. CoRR, July 2016.
[LJJ13] S. Lacoste-Julien and M. Jaggi. An affine invariant linear convergence analysis for Frank-Wolfe algorithms.
NIPS, 2013.

27
[LJJ15] S. Lacoste-Julien and M. Jaggi. On the global linear convergence of Frank-Wolfe optimization variants. In
NIPS. 2015.
[LZ14] G. Lan and Y. Zhou. Conditional gradient sliding for convex optimization. Tech. Report, 2014.
[NJLS09] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to
stochastic programming. SIAM J. Optim., 2009.
[RR11] M. Raginsky and A. Rakhlin. Information-based complexity, feedback and dynamics in convex programming.
IEEE Trans. Inf. Theory, 57(10):7036–7056, October 2011.
[RVV14] L. Rosasco, S. Villa, and Bang Cong Vu. Convergence of Stochastic Proximal Gradient Algorithm. CoRR,
abs/1403.5074v3, 2014.
[SSSS11] S. Shalev-Shwartz, Ohad Shamir, and Karthik Sidharan. Learning kernel-based halfspaces with the 0-1
loss. SIAM J. Comput., 40(6):1623–1646, 2011.
[SSSSS09] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. COLT, 2009.
[Wol70] P. Wolfe. Convergence theory in nonlinear programming. Integer and Nonlinear Program., 1970.
[YZS14] Yaoliang Yu, Xinhua Zhang, and Dale Schuurmans. Generalized conditional gradient for sparse estimation.
CoRR, Oct 2014.

28

You might also like