Gradient Methods For Onl
Gradient Methods For Onl
Abstract
1 Introduction
1. In the semi-bandit feedback setting, we propose the first stochastic gradient ascent based algorithm
for stochastic online DR-submodular√maximization with stochastic long-term constraints. Our
proposed algorithm can achieve O( T ) 12 -regret and O(T 3/4 ) constraint violation with high
probability. Compared to all previous works [10, 29, 31],
√ they consider first-order full-information
feedback and require unbiased gradient estimates at T locations √(not just at action xt ) in every
round. For those works, their query complexity in each round is T while ours is just 1.
2. In first order full-information setting, where the unbiased gradient estimates of any point can be
observed, we propose the first stochastic gradient ascent based algorithm for stochastic online DR-
submodular maximization with stochastic long-term constraints. We utilize the recently developed
√
technique in [42] called the non-oblivious function. Our proposed algorithm can achieve O( T )
(1 − 1/e)-regret and O(T 3/4 ) constraint violation with high probability. Again, compared to
previous works [10, 29, 31], our query complexity is significantly lower.
Regarding the approximation ratios: We note that in offline 1 − 1/e is known to be the optimal
approximation ratio for optimizing monotone DR submodular functions over a general convex set,
where the query can be anywhere in the convex hull of K ∪ {0} (K is the constraint set). However,
when the oracle calls are restricted to K, an approximation ratio of 1/2 is the best that is known to be
achievable [28]. Thus, full information feedback can achieve 1 − 1/e-regret while the semi-bandit
feedback achieves 1/2-regret.
2
Table 1: We include related works from online DR-submodular optimization with constant or
stochastic long term constraint functions. (Works handling adversarial long-term constraints require
a different definition of regret.) All methods require a gradient oracle for feedback, and ‘Noise’ lists
whether the gradient is exact or there is stochastic noise. ‘# Grad.’ is the number gradient evaluations
required per-round. ‘Con. Viol.’ is the bound on the constraint violation. † [31] considered√constraint
set being convex√ while all other works consider linear constraint. In ‘# Grad.’ column, 2 T means
this work needs T gradients on both f and g. ‡ While all actions will be feasible, some gradient
queries will be in the convex hull of K ∪ {0}.
2 Related Works
The primary related works are summarized in Table 1. We briefly discuss notable contributions and
for additional related works, see Appendix F
Online DR-submodular Maximization with Long-Term Constraints We do not compare results
with adversarial constraints due to additional assumptions needed in this setting. See Appendix F for
a detailed discussion. In the context of stochastic constraints,
√ Raut et al. [29] conducted the initial
study of the problem. They successfully attained
√ O( T ) regret and constraint violation with high
probability, as well as O(T 3/4 ) regret and O( T ) constraint violation in expectation.
√ Building upon
this work, Sadeghi et al. [31] further improved the results to achieve O( T ) regret and constraint
violation, both in expectation and with high probability. Additionally, Feng et al. [10] extended these
findings to incorporate weakly DR-submodular utility, achieving analogous results.
Online Convex Optimization with Long-Term Constraints Several results for OCO with determin-
istic long-term√constraints can be found in [14, 20, 37, 38, 40]. Existing literature has established that
a regret of O( T ) and a cumulative constraint violation of O(T 1/4 ) can be achieved without√the
Slater condition. Conversely, assuming the Slater condition allows for achieving a regret of O( T )
and a cumulative constraint violation of O(1).
√ In cases where the considered constraint is assumed to
be stochastic, Yu et al. [39] achieved a O( T ) bound on both regret and constraint violations under
the Slater condition. Furthermore, Wei et al. [35] achieved the same regret and constraint violation
bounds while assuming a strictly weaker assumption than the Slater condition.
3 Preliminaries
3.1 Notations
Vectors are shown by lowercase bold letters, such as x ∈ Rd . We denote by ∥ · ∥ the ℓ2 (Euclidean)
norm. We use [T ] to denote the set {1, 2, . . . , T }. The inner product of two vectors x, y ∈ Rd
is denoted by either ⟨x, y⟩ or x⊤ y. For u ∈ R, we define [u]+ := max{u, 0}. For two vectors
x, y ∈ Rd , x ⪯ y implies that xi ≤ yi , ∀i ∈ [d]. For a convex set X , we denote the projection of y
onto set X as ΠX (y) = arg minx∈X ∥x − y∥.
Here we list some function properties that will appear in our assumptions.
Monotonicity A function f is monotone if f (x) ≤ f (y) for all x ⪯ y.
3
Lipschitz continuous A function f is Lipschitz continuous with parameter β if for any x, y ∈ X ,
we have f (x) − f (y) ≤ β∥x − y∥.
4 Problem Statement
Consider the following offline optimization problem:
maxx∈K f (x)
(1)
subject to g(x) ≤ 0,
where g(x) = ⟨p, x⟩ − b for some non-negative constant b. We study an analogous online setup
as follows: At each round t ∈ [T ], the algorithm chooses an action xt ∈ K, where K ⊂ Rd+ is a
fixed, known set. We consider both the utility and the constraints being stochastic, where we assume
at each time step, the utility function ft is sampled i.i.d. from a distribution Df with mean f , i.e.,
Eft ∼Df [ft (·)] = f (·), while the cost vector pt is i.i.d. sampled from another distribution Dp . After
an action is selected by the learner, a random reward ft (xt ) is obtained while using ⟨pt , xt ⟩ of its
fixed total allotted budget BT , and pt is observed. In the semi-bandit setting, an unbiased gradient
estimator for that action, ∇fe t (xt ), is also revealed. In the first order full-information setting, the
unbiased gradient estimator of any point can be observed. In this paper, we consider both settings
while all other works in the literature on DR-submodular maximization with long-term constraints,
such as [10, 29, 31], consider full-information feedback.
4
To make sure the long-term constraint is not vacuous, we consider BT = bT for a constant b such
that minx∈K ⟨p, x⟩ ≤ b < maxx∈K ⟨p, x⟩. In this case, there will always be a solution x that satisfies
constraint (having zero constraint violation) and it is not the case that any sequence of actions
(especially the most expensive w.r.t. p) is feasible.
We make the following assumptions to proceed our analysis:
Assumption 1. The constraint set K is convex and compact, with diameter d = supx,y∈K ∥x − y∥
and radius r = supx∈K ∥x∥. Since X is compact, we denote its diameter as d¯ = supx,y∈X ∥x − y∥
and radius as r̄ = supx∈X ∥x∥, respectively.
Assumption 2. The expected utility function f (·) is monotone DR-submodular and βf -Lipschitz.
Assumption 3. The distribution Dp for the cost vectors has bounded support βp B ∩ Rd+ with mean
p ⪰ 0, where B is the unit ball of Euclidean norm.
Assumption 4. The gradient oracle is unbiased E[∇f (x) − ∇f e t (x)|x] = 0 and has a bounded
2 2
variance E[∥∇f (x) − ∇f e t (x)∥ |x] ≤ σ . In the semi-bandit setting, we assume G =
maxt supx ∥∇ft (x)∥ is finite. In the first order full-information setting, denoting the unbiased
e
estimator of the non-oblivious function obtained at round t by ∇F e t (x), we assume GF =
maxt supx ∥∇F
e t (x)∥ is finite.
Unlike Frank-Wolfe type algorithms in other papers [10, 29, 31], we do not assume bounded smooth-
ness on the gradients: ∇f (x) − ∇f (y) ≤ β∥x − y∥. Moreover, we do not assume f (0) = 0.
Our overall goal is to maximize the total obtained reward while satisfying the budget constraint
PT
asymptotically (i.e., t=1 ⟨p, xt ⟩ − BT being sub-linear in T ).
Note that our proposed algorithm can handle multiple linear constraints as well, and similar regret
and constraint violation bounds can be derived. In the case of there are m constraints gi (·), i ∈ [m],
we can define g(x) := maxi∈[m] gi (x) and it can be shown that g preserves the same properties
as those of individual gi ’s (sub-differentiability, bounded (sub-)gradients and bounded values; see
Proposition 6 in [20] for proofs).
For simplicity of presentation, we denote β = max{βf , βp }. Since K is compact, from mono-
tonicity of f we have F1 := maxx∈K |f (x)| is bounded. Since f is βf -Lipschitz, we have
F2 := maxx,y∈K |f (x) − f (y)| ≤ βf D is bounded. Since K is compact and Dp has bounded
support, we have C := maxp′ ∼Dp maxx∈K |⟨p′ , x⟩ − BTT | is bounded.
To measure the effectiveness of our proposed algorithm, we use the notions of regret and total
constraint violation to quantify the overall utility and the total resource consumption, respectively.
Regret is typically defined as the difference between the total reward accumulated by the algorithm
and the best fixed action in hindsight. Note that even in the offline setting, maximizing a monotone
DR-submodular function subject to a convex constraint can only be done approximately in polynomial
time unless RP = NP [5]. Thus, we instead use the notion of α-regret of an algorithm.
Definition 1. The α-regret of an online algorithm with outputs {xt }Tt=1 is defined as
T
X T
X
RT := α max∗ ft (x) − ft (xt ), (2)
x∈K
t=1 t=1
where K∗ is the restricted search space of solutions that satisfy long-term constraints for T steps, (i.e.,
PT
can be played T times), K∗ = {x ∈ K : t=1 g(x) ≤ 0}, which is also equivalent as satisfying
per-round constraint: K∗ = {x ∈ K : g(x) ≤ 0}.
Since we are mainly interested in stochastic utility functions, i.e., ft ∼ Df , we aim to minimize the
expected α-regret:
XT
E[RT ] = αT max∗ f (x) − f (xt ).
x∈K
t=1
Denote x∗ = arg maxx∈K∗ f (x). Note that since pt is drawn i.i.d. from the distribution Dp with
mean p ∀t ∈ [T ], the best benchmark action is with respect to the “true” underlying p of the constraint
5
function as opposed to pt . It is possible that the best-fixed action has a constraint violation with some
noisy pt ’s.
We next define the total constraint violation.
Definition 2. The total constraint violation of an online algorithm with outputs {xt }Tt=1 is defined as
T
X T
X
CT := g(xt ) = ⟨p, xt ⟩ − BT .
t=1 t=1
Again, in the stochastic constraint setting, the total constraint violation is defined with respect to the
mean p.
In this work, we consider the following regularized Lagrangian function L(x, λ) given by
δη 2
L(x, λ) := f (x) − λg(x) + λ . (3)
2
It is important to observe that the expression in (3) deviates from the conventional Lagrangian due
to the inclusion of the term δη 2
2 λ , where both δ and η are parameters that will be later chosen to
optimize theoretical guarantees. The main purpose of this modification is to control the value of λ
and prevent it from growing too large. Although we can achieve the same goal by restricting λ to a
bounded domain, using the quadratic regularizer makes it convenient for our analysis.
One issue is that p (shown in g(x)) is unknown to the online algorithm. Therefore, we alternatively
bt = 1t ts=1 ps instead of p in the Lagrangian function. Moreover, in
P
use an empirical estimate p
order to achieve the high probability bound, we adjust our Lagrangian function in (3) as follows:
δη 2
Lt (x, λ) = ft (x) − λe
gt (x) + λ , (4)
2
q
2C 2 log( 2T
ε )
where get (x) = ⟨bpt , x⟩ − BTT − γt and γt = t . For the purpose of analysis, we further
pt , x⟩ − BTT .
define gbt (x) := ⟨b
For the purpose of analysis, we do not directly use Equation (4) in our primal update. Let Lbt be
defined by its gradient, ∇
e x Lbt (xt , λt ) = ∇f
e t (xt ) − 2λt ∇e gt (xt ). The primal updates are formulated
as follows:
xt+1 = ΠK (xt + η ∇ e x Lbt (xt , λt )).
Note that compared to Equation (4), the Lagrangian function used for updating has a coefficient of 2
in front of the second term.
Our proposed algorithm is shown in Algorithm 1. The algorithm proceed as follows: it takes a
convex constraint set K and a time horizon T as inputs. Initially, the algorithm selects an initial point
x1 ∈ K and sets λ1 = 0. At each time step t ∈ [T ], the algorithm takes an action xt , acquires a
reward ft (xt ), and observes the cost vector pt as well as an unbiased gradient estimate ∇f e t (xt ).
Subsequently, unbiased gradient estimates of the updating Lagrangian function with respect to x and
to λ are computed using the empirical estimate of p. Using these calculated gradients, updates to x
and λ are made using (7) and (8), respectively.
6
Algorithm 1 OLSGA (Semi-bandit Feedback)
1: Input: Convex set K, time horizon T
2: Initialize x1 ∈ K, λ1 = 0.
3: for t ∈ [T ] do
4: Play xt , obtain ft (xt ) and ∇f
e t (xt ) and pt
1
Pt
5: Compute p bt = t s=1 ps
6: Compute
∇
e x Lbt (xt , λt ) = ∇f
e t (xt ) − 2λt ∇e
gt (xt ) (5)
∇λ Lt (xt , λt ) = −e gt (xt ) + δηλt (6)
7: Update xt and λt :
xt+1 = ΠK (xt + η ∇
e x Lbt (xt , λt )) (7)
λt+1 = Π[0,+∞) (λt − η∇λ Lt (xt , λt )) (8)
8: end for
Remark 1. A notable difference between our algorithm and all prior works addressing online DR-
submodular maximization with long-term constraints (e.g., [29, 31]) is that our algorithm can handle
search spaces that do not necessarily include 0. This distinction bears importance, particularly when
considering scenarios where we can only query values within the constraint set. In such cases, 1/2
has been conjectured to be the optimal approximation ratio ([28] section B in Appendix). We refer to
Appendix H for motivation examples where searching over K ∪ {0} is not applicable.
Now, we establish the regret and constraint violation achievable by our proposed Algorithm 1. Before
delving into the main theorem, we first present three lemmas, which are adapted from [29] and are
essential for achieving high probability bounds. Given the slight difference in the definition of p
b, we
provide the proofs in Appendices A to C respectively. First, Lemma 3 demonstrates that with high
probability, the empirical estimate p
bt is close to its mean p.
Lemma 3. The following holds with probability at least 1 − ε:
s
T
X 2nT
∥b
pt − p∥ ≤ Qβ T log ,
t=1
ε
where Q > 0 is some universal constant.
Next, Lemma 4 establishes that with high probability, the gb(·) computed using p
bt and g(·) calculated
using p are close.
n q oT
Lemma 4. Let x ∈ K be fixed. For a fixed t ∈ [T ] and γt := 2t C 2 log 2T ε , |b
gt (x) −
t=1
ε
g(x)| ≤ γt holds with probability at least 1 − T .
Finally, Lemma 5 provides an upper bound for the total constraint violation.
Lemma 5. Let {γt }Tt=1 be defined as in Lemma 4, then the following holds:
T
X T
X T
X
CT ≤ get (xt ) + r ∥b
pt − p∥ + γt . (9)
t=1 t=1 t=1
Armed with these results, we can now establish regret and constraint violation bounds for Algorithm 1.
d
Theorem 1. Let Assumptions 1 2 3 4 be satisfied. Let U = max{G, C}. Choose η = U √ T
and δ = 8β 2 . Let x1 , . . . , xT be the sequence of solutions obtained by Algorithm 1. When T is
2 2
sufficiently large, i.e., T ≥ 64dU 2β , we have the following 12 -regret and constraint violation bounds
with probability at least 1 − ε:
√
E[RT ] = O( T ) and CT = O(T 3/4 ).
7
The complete theorem statement and the complete proof are in Appendix D.
Partial Proof: From the update of xt , we have that for any x ∈ K,
∥xt+1 − x∥2 = ∥ΠK (xt + η ∇ e x Lbt (xt , λt )) − x∥2
e x Lbt (xt , λt ) − x∥2
≤ ∥xt + η ∇
e x Lbt (xt , λt )∥2 − 2η(x − xt )⊤ ∇
≤ ∥xt − x∥2 + η 2 ∥∇ e x Lbt (xt , λt ). (10)
Rearranging, and using Assumption 4 and Assumption 3 we have
e x Lbt (xt , λt ) ≤ 1 (∥xt − x∥2 −∥xt+1 − x∥2 ) + ηG2 + 4ηβ 2 λ2t .
(x − xt )⊤ ∇ (11)
2η
Applying similar steps on the λ updates, we establish
1
(λ − λt )⊤ ∇λ Lt (xt , λt ) ≥ − (∥λt − λ∥2 − ∥λt+1 − λ∥2 ) − C 2 η − 2ηγt2 − 2δ 2 η 3 λ2t . (12)
2η
From monotonicity and DR-submodularity of E[ft (x)], we have
E[Lt (x, λt ) − 2Lt (xt , λt )]
= E[E[Lt (x, λt ) − 2Lt (xt , λt )|xt ]]
δη
≤ E[(x − xt )⊤ ∇x E[Lbt (xt , λt )|xt ]] + λt get (x) − λ2t (Lemma 1)
2
1 δη
≤ E[∥xt − x∥2−∥xt+1 − x∥2 ] + G2 η + 4ηβ 2 λ2t + λt get (x) − λ2t , (13)
2η 2
where (13) follows from (11). Similarly, from convexity of function Lt (x, λ) w.r.t λ, we have
Lt (xt , λ) − Lt (xt , λt ) ≥ (λ − λt )⊤ ∇λ Lt (xt , λt )
1
≥ − (∥λt − λ∥2 − ∥λt+1 − λ∥2 ) − C 2 η − 2ηγt2 − 2δ 2 η 3 λ2t , (14)
2η
where (14) follows from (12). Subtracting two times (14) from (13), and sum t over 1 through T , we
get
T
X
E[Lt (x, λt ) − 2Lt (xt , λ)]
t=1
T T T
d2 λ2 X X X δη
≤ + + G2 ηT + ηβ 2 λ2t + 2C 2 ηT + 4η γt2 + 4δ 2 η 3 λ2t + λt get (x) − λ2t ,
2η η t=1 t=1 t=1
2
(15)
Expanding the left hand side of (15) and rearranging, we deduce
T
" T #
X X 1
[f (x) − 2f (xt )] + 2λ get (xt ) − δηT + λ2
t=1 t=1
η
T T T
X X d2 X
≤2 λt get (x) + η 4β 2 + 4δ 2 η 2 − δ λ2t + + G2 ηT + 2C 2 ηT + 4η γt2 . (16)
t=1 t=1
2η t=1
2 2
To ensure that the equation 4β 2 + 4δ 2 η 2 − δ = 0 has real roots, we require T ≥ 64dU 2β . Setting
δ = 8β 2 ensures that 4β 2 + 4δ 2 η 2 − δ ≤ 0. Set x = x∗ ; From Lemma 4, with probability at least
1 − Tε , get (x∗ ) = gbt (x∗ ) − γt ≤ g(x∗ ) holds. since x∗ satisfies the long term constraint, we have
g(x∗ ) ≤ 0. Thus, we can drop the first two terms in the RHS of (16) and by union bound, we get
with probability at least 1 − ε,
T
" T # T
d2
X
∗
X 1 X
[f (x ) − 2f (xt )] + 2λ get (xt ) − δηT + λ2 ≤ + G2 ηT + 2C 2 ηT + 4η γt2 .
t=1 t=1
η 2η t=1
(17)
8
Maximizing the LHS of (17) with respect to λ over the range [0, +∞), we get a solution of λ =
[ Tt=1 get (xt )]+
P
hP i2
T
t=1 g
et (xt )
T T
X + d2 X
[f (x∗ ) − 2f (xt )] + ≤ + G2 ηT + 2C 2 ηT + 4η γt2 . (18)
t=1
δηT + 1/η 2η t=1
d
Plugging in U = max{G, C} and η = U
√
T
, we have with probability at least 1 − ε,
hP i2
T
t=1 g
et (xt )
T
X + 7dU √ 2T
[f (x∗ ) − 2f (xt )] + ≤ T + 8dU log , (19)
t=1
δηT + 1/η 2 ε
PT 1
√
where we used the fact that t=1
√
t
≤ 2 T . This gives us our result on objective regret:
T
√
X 1
f (x∗ ) − f (xt ) = O( T ). (20)
t=1
2
The detailed proof, including constraint violation steps, are provided in Appendix D.
R1
where the non-oblivious function F is defined by its gradient: ∇F (x) = 0 ez−1 ∇f (z · x)dz.
As we discussed in Section 3, the non-oblivious function F plays an important role in obtaining
the optimal approximation ratio 1 − 1/e. However, calculating the gradient of the non-oblivious
function F (x) can be challenging, especially when only unbiased estimates of the gradients are
available. To overcome this, [42] presents a computational approach for obtaining an unbiased
estimate of the gradient of F (x) through sampling (Lines 6 and 7). The following lemma indicates
that (1 − 1/e)∇fe t (z ∗ x) is an unbiased estimator of ∇F (x) with bounded variance.
Lemma
h 6. If z is sampled from
i r.v. Z as in line 6 of Algorithm 2, E[∇f
e t (x) | x] = ∇f (x), and
E ∥∇f e t (x) − ∇f (x)∥2 | x ≤ σ 2 , we have
h i
(i) E (1 − 1/e)∇f e t (z ∗ x)| x = ∇F (x);
2 2 2
2
(ii) E (1 − 1/e)∇ft (z ∗ x) − ∇F (x)
e x ≤ σ12 , where σ12 = 2 (1 − 1/e) σ 2 + 2β r̄ (1−1/e)
3 .
With the unbiased estimator of the gradient of the non-oblivious function, we show the following
regret and constraint violation guarantee for our Algorithm 2 in Appendix E:
d
Theorem 2. Let Assumptions 1 2 3 4 be satisfied. Let U = max{GF , C}. Choosing η = U √ T
and
δ = 4β 2 . Let xt , t ∈ [T ] be the sequence of solutions obtained by Algorithm 2. When T is sufficiently
2 2
large, i.e., T ≥ 16dU 2β , we have the following (1 − 1/e)-regret and constraint violation bounds with
probability at least 1 − ε:
√
E[RT ] = O( T ) and CT = O(T 3/4 ).
9
7 Conclusions
In this paper, we address the problem of stochastic DR-submodular maximization with stochastic
long-term constraints
√ over a general convex set. We introduce the first algorithm for this setting,
attaining O( T ) regret and O(T 3/4 ) constraint violation bounds. Notably, our algorithm operates in
both the semi-bandit feedback and first-order full-information setting, requiring only
√ 1 gradient query
per round, while all previous works operate in the full-information setting with T gradient queries
per round. Extension of the results here to upper-linearizable functions in [27] is an open direction.
8 Acknowledgement
This work was supported in part by the National Science Foundation under grants CCF-2149588 and
CCF-2149617. We acknowledge Yiyang (Roy) Lu for helpful feedback.
References
[1] S. Agrawal and N. R. Devanur. Bandits with concave rewards and convex knapsacks. Proceed-
ings of the fifteenth ACM conference on Economics and computation, 2014.
[2] F. R. Bach. Submodular functions: from discrete to continuous domains. Mathematical
Programming, 175:419 – 459, 2015.
[3] A. Badanidiyuru, R. D. Kleinberg, and A. Slivkins. Bandits with knapsacks. 2013 IEEE 54th
Annual Symposium on Foundations of Computer Science, pages 207–216, 2013.
[4] S. Balseiro and Y. Gur. Learning in repeated auctions with budgets: Regret minimization and
equilibrium. Management Science, 65:3952–3968, 09 2019. doi: 10.1287/mnsc.2018.3174.
[5] A. A. Bian, B. Mirzasoleiman, J. Buhmann, and A. Krause. Guaranteed Non-convex Optimiza-
tion: Submodular Maximization over Continuous Domains. In AISTATS, volume 54, pages
111–120. PMLR, 20–22 Apr 2017.
[6] Y. Bian, J. M. Buhmann, and A. Krause. Continuous submodular function maximization. ArXiv,
abs/2006.13474, 2020.
[7] L. Chen, C. Harshaw, H. Hassani, and A. Karbasi. Projection-free online optimization with
stochastic gradient: From convexity to submodularity. In J. Dy and A. Krause, editors, Proceed-
ings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of
Machine Learning Research, pages 814–823. PMLR, 10–15 Jul 2018.
[8] L. Chen, C. Harshaw, H. Hassani, and A. Karbasi. Projection-free online optimization with
stochastic gradient: From convexity to submodularity. In J. Dy and A. Krause, editors, Proceed-
ings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of
Machine Learning Research, pages 814–823. PMLR, 10–15 Jul 2018.
[9] L. Chen, H. Hassani, and A. Karbasi. Online continuous submodular maximization. In
A. Storkey and F. Perez-Cruz, editors, Proceedings of the Twenty-First International Confer-
ence on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning
Research, pages 1896–1905. PMLR, 09–11 Apr 2018.
[10] J. Feng, R. Yang, Y. Zhang, and Z. Zhang. Online weakly dr-submodular optimization with
stochastic long-term constraints. In D.-Z. Du, D. Du, C. Wu, and D. Xu, editors, Theory and
Applications of Models of Computation, pages 32–42, Cham, 2022. Springer International
Publishing. ISBN 978-3-031-20350-3.
[11] C. Guestrin, A. Krause, and A. P. Singh. Near-optimal sensor placements in gaussian processes.
Proceedings of the 22nd international conference on Machine learning, 2005.
[12] H. Hassani, M. Soltanolkotabi, and A. Karbasi. Gradient methods for submodular maximization.
In Neural Information Processing Systems, 2017.
10
[13] E. Hazan. Introduction to Online Convex Optimization. Foundations and Trends in Optimization.
Now, Boston, 2016. ISBN 978-1-68083-170-2. doi: 10.1561/2400000013.
[14] R. Jenatton, J. Huang, and C. Archambeau. Adaptive algorithms for online convex optimization
with long-term constraints. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The
33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine
Learning Research, pages 402–411, New York, New York, USA, 20–22 Jun 2016. PMLR.
[15] C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan. A short note on concentration
inequalities for random vectors with subgaussian norm, 2019.
[16] D. Kempe, J. M. Kleinberg, and É. Tardos. Maximizing the spread of influence through a social
network. In Knowledge Discovery and Data Mining, 2003.
[17] A. Krause and D. Golovin. Submodular function maximization. In Tractability, 2014.
[18] N. Liakopoulos, A. Destounis, G. Paschos, T. Spyropoulos, and P. Mertikopoulos. Cautious
regret minimization: Online optimization with long-term budget constraints. In K. Chaudhuri
and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine
Learning, volume 97 of Proceedings of Machine Learning Research, pages 3944–3952. PMLR,
09–15 Jun 2019.
[19] T. Lin, J. Li, and W. Chen. Stochastic online greedy learning with semi-bandit feedbacks. In
Proceedings of the 29th International Conference on Neural Information Processing Systems,
pages 352–360, 2015.
[20] M. Mahdavi, R. Jin, and T. Yang. Trading regret for efficiency: Online convex optimization
with long term constraints. Journal of Machine Learning Research, 13(81):2503–2528, 2012.
[21] S. Mannor, J. N. Tsitsiklis, and J. Y. Yu. Online learning with sample path constraints. Journal
of Machine Learning Research, 10:569–590, 2009.
[22] B. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause. Distributed submodular maximization:
Identifying representative elements in massive data. In C. Burges, L. Bottou, M. Welling,
Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing
Systems, volume 26. Curran Associates, Inc., 2013.
[23] B. Mirzasoleiman, A. Badanidiyuru, and A. Karbasi. Fast constrained submodular maximization:
Personalized data summarization. In International Conference on Machine Learning, 2016.
[24] R. Niazadeh, N. Golrezaei, J. R. Wang, F. Susan, and A. Badanidiyuru. Online learning via
offline greedy algorithms: Applications in market design and optimization. In Proceedings of
the 22nd ACM Conference on Economics and Computation, pages 737–738, 2021.
[25] G. Nie, M. Agarwal, A. K. Umrawal, V. Aggarwal, and C. J. Quinn. An explore-then-commit
algorithm for submodular maximization under full-bandit feedback. In Uncertainty in Artificial
Intelligence, pages 1541–1551. PMLR, 2022.
[26] G. Nie, Y. Y. Nadew, Y. Zhu, V. Aggarwal, and C. J. Quinn. A framework for adapting offline
algorithms to solve combinatorial multi-armed bandit problems with bandit feedback. In
A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings
of the 40th International Conference on Machine Learning, volume 202 of Proceedings of
Machine Learning Research, pages 26166–26198. PMLR, 23–29 Jul 2023.
[27] M. Pedramfar and V. Aggarwal. From linear to linearizable optimization: A novel framework
with applications to stationary and non-stationary dr-submodular optimization. In Thirty-eighth
Conference on Neural Information Processing Systems, 2024.
[28] M. Pedramfar, C. J. Quinn, and V. Aggarwal. A unified approach for maximizing continuous
DR-submodular functions. In Thirty-seventh Conference on Neural Information Processing
Systems, 2023.
[29] P. S. Raut, O. Sadeghi, and M. Fazel. Online dr-submodular maximization: Minimizing regret
and constraint violation. In AAAI Conference on Artificial Intelligence, 2021.
11
[30] O. Sadeghi and M. Fazel. Online continuous dr-submodular maximization with long-term
budget constraints. In S. Chiappa and R. Calandra, editors, Proceedings of the Twenty Third
International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of
Machine Learning Research, pages 4410–4419. PMLR, 26–28 Aug 2020.
[31] O. Sadeghi, P. Raut, and M. Fazel. A single recipe for online submodular maximization with
adversarial or stochastic constraints. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan,
and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages
14712–14723. Curran Associates, Inc., 2020.
[32] T. Soma, N. Kakimura, K. Inaba, and K. ichi Kawarabayashi. Optimal budget allocation:
Theoretical guarantee and efficient algorithm. In International Conference on Machine Learning,
2014.
[33] M. J. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In
Neural Information Processing Systems, 2008.
[34] J. Vondrák. Submodularity in Combinatorial Optimization. Phd thesis, Charles University,
2007.
[35] X. Wei, H. Yu, and M. J. Neely. Online primal-dual mirror descent under stochastic constraints.
Proceedings of the ACM on Measurement and Analysis of Computing Systems, 4:1 – 36, 2019.
[36] L. A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem.
Combinatorica, 2:385–393, 1982.
[37] X. Yi, X. Li, T. Yang, L. Xie, T. Chai, and K. Johansson. Regret and cumulative constraint
violation analysis for online convex optimization with long term constraints. In M. Meila and
T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning,
volume 139 of Proceedings of Machine Learning Research, pages 11998–12008. PMLR, 18–24
Jul 2021.
√
[38] H. Yu and M. J. Neely. A low complexity algorithm with o( T ) regret and o(1) constraint
violations for online convex optimization with long term constraints. Journal of Machine
Learning Research, 21(1):1–24, 2020.
[39] H. Yu, M. J. Neely, and X. Wei. Online convex optimization with stochastic constraints. In
Neural Information Processing Systems, 2017.
[40] J. Yuan and A. Lamperski. Online convex optimization for cumulative constraints. In S. Bengio,
H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in
Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
[41] M. Zhang, L. Chen, H. Hassani, and A. Karbasi. Online continuous submodular maximization:
From full-information to bandit feedback. In Advances in Neural Information Processing
Systems, volume 32. Curran Associates, Inc., 2019.
[42] Q. Zhang, Z. Deng, Z. Chen, H. Hu, and Y. Yang. Stochastic continuous submodular maximiza-
tion: Boosting via non-oblivious function. In International Conference on Machine Learning,
2022.
[43] M. A. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In
International Conference on Machine Learning, 2003.
12
A Proof of Lemma 3
Proof. From Assumption 1, we have ∥pt ∥ ≤ β. We can apply Lemma 1 of [15] with sub-Gaussian
parameter ρ = cβ for some universal constant c > 0 and the result follows immediately.
Combine E[pt − p] = 0 and Lemma 7, we can apply Corollary 7 of [15] to the random vector
{ps − p}ts=1 and obtain with probability at least 1 − Tε ,
v
t u t r
X uX
′t 2
2dT ′ 2dT
∥ (ps − p)∥ ≤ c ρ log( ) = c ρ t log( ), (22)
s=1 s=1
ε ε
where c′ > 0 is some universal constant. Combining (21) and (22) and apply union bound, we have
with probability at least 1 − Tε ,
T T t
X X 1 X
∥b
pt − p∥ = ∥ (p − p)∥
t=1 t=1
t s=1 s
T r
X
′ 2dT
≤ c ρ log( )/t
t=1
ε
T r
X
′ 2dT
≤ c ρ T log( ), (23)
t=1
ε
PT 1
√
where the last inequality follows from t=1
√
t
≤ 2 T . We get the desired result be taking
Q = 2cc′ .
B Proof of Lemma 4
13
Proof. Recall from the definition that get (x) = ⟨b pt , x⟩ − BTT − γt and gbt (x) := ⟨b pt , x⟩ − BTT .
BT BT
Note that E[bgt (x)] = E[⟨b pt , x⟩ − T ] = ⟨p, x⟩ − T = g(x). Recall that we have defined
C := maxp′ ∼Dp maxx∈K |⟨p′ , x⟩ − BTT |. Thus, we have gbt (x) is bounded within interval [−C, C].
Apply Hoeffding’s inequality on gbt (x), we get
tγ 2
P {|bgt (x) − g(x)| > γt } ≤ 2 exp − t2 .
2C
gt (x) − g(x)| > γt } ≤ Tϵ .
Substituting the value of γt in the right hand side, we get that P {|b
C Proof of Lemma 5
We restate the lemma as follows:
Lemma 10. Let {γt }Tt=1 be defined as in Lemma 4, then the following holds:
T
X T
X T
X
CT ≤ get (xt ) + r ∥b
pt − p∥ + γt . (24)
t=1 t=1 t=1
PT
Proof. Bounding t=1 g
et (xt ) from below, we obtain:
T
X T
X T
X T
X
get (xt ) = g (xt ) + gt (xt ) − g (xt )) −
(b γt
t=1 t=1 t=1 t=1
T
X T
X T
X
≥ g (xt ) − |b
gt (xt ) − g (xt )| − γt
t=1 t=1 t=1
T
X T
X
= CT − |⟨b
pt − p, xt ⟩| − γt
t=1 t=1
T
X T
X
≥ CT − ∥b
pt − p∥ ∥xt ∥ − γt
t=1 t=1
T
X T
X
≥ CT − r ∥b
pt − p∥ − γt .
t=1 t=1
D Proof of Theorem 1
We restate our theorem as follows:
d
Theorem 3. Let Assumptions 1 2 3 4 be satisfied. Let U = max{G, C}. Choose η = U √ T
2
and δ = 8β . Let xt , t ∈ [T ] be the sequence of solutions obtained by Algorithm 1. When T is
2 2
sufficiently large, i.e., T ≥ 64dU 2β , we have the following 12 -regret and constraint violation bounds
with probability at least 1 − ε:
T
7dU √
X 1 ∗ 2T
E[RT ] ≤ f (x ) − f (xt ) ≤ T + 8dU log = O(T 1/2 )
t=1
2 4 ε
and
s
7dU √ 8β d U √
2
2T
CT ≤ T + 4dU log + (F1 + F2 )T · + T
4 ε U d
s s
2nT 2T
+ rQσ T log + 2 2T C 2 log (25)
ε ε
= O(T 3/4 ).
14
Proof. From the update of xt , we have that for any x ∈ K,
Rearranging,
1 η e b
(x − xt )⊤ ∇
e x Lbt (xt , λt ) ≤ (∥xt − x∥2 − ∥xt+1 − x∥2 ) + ∥∇x Lt (xt , λt )∥2
2η 2
1 η e
= (∥xt − x∥2 − ∥xt+1 − x∥2 ) + ∥∇f (xt ) − 2λt ∇e gt (xt )∥2
2η 2
(from (5))
1
≤ (∥xt − x∥2 − ∥xt+1 − x∥2 ) + η∥∇f
e (xt )∥2 + 4ηλ2t ∥∇e
gt (xt )∥2
2η
(∥a + b∥2 ≤ 2∥a∥2 + 2∥b∥2 )
1
≤ (∥xt − x∥2 − ∥xt+1 − x∥2 ) + ηG2 + 4ηβ 2 λ2t (27)
2η
where (27) follows from Assumption 4 and Assumption 3. Taking expectation with respect to ft , we
have
where (28) follows from (27). From the update of λt (8), we have
15
Rearranging,
1 η
(λ − λt )⊤ ∇λ Lt (xt , λt ) ≥ − (∥λt − λ∥2 − ∥λt+1 − λ∥2 ) − ∥∇λ Lt (xt , λt )∥2
2η 2
1 η
= − (∥λt − λ∥2 − ∥λt+1 − λ∥2 ) − ∥ − get (xt ) + δηλt ∥2
2η 2
(from (6))
1 η
= − (∥λt − λ∥2 − ∥λt+1 − λ∥2 ) − ∥ − gbt (xt ) − γt + δηλt ∥2
2η 2
(by def. gbt )
1
≥ − (∥λt − λ∥2 − ∥λt+1 − λ∥2 ) − η∥bgt (xt )∥2 − 2ηγt2 − 2δ 2 η 3 λ2t
2η
(apply ∥a + b∥2 ≤ 2∥a∥2 + 2∥b∥2 twice)
1
≥ − (∥λt − λ∥2 − ∥λt+1 − λ∥2 ) − C 2 η − 2ηγt2 − 2δ 2 η 3 λ2t , (30)
2η
BT
where the last inequality follows from the definition of C := maxp′ ∼Dp maxx∈K |⟨p′ , x⟩ − T | and
pt , x⟩ − BTT . From convexity of function Lt (x, λ) w.r.t λ, we have
gbt (x) := ⟨b
Lt (xt , λ) − Lt (xt , λt ) ≥ (λ − λt )⊤ ∇λ Lt (xt , λt )
1
≥ − (∥λt − λ∥2 − ∥λt+1 − λ∥2 ) − C 2 η − 2ηγt2 − 2δ 2 η 3 λ2t (31)
2η
where (31) follows from (30). Subtracting two times (31) from (28), we get
1 1
E[Lt (x, λt ) − 2Lt (xt , λ)] ≤ E[∥xt − x∥2 − ∥xt+1 − x∥2 ] + (∥λt − λ∥2 − ∥λt+1 − λ∥2 )
2η η
δη
+ G2 η + 4ηβ 2 λ2t + 2C 2 η + 4ηγt2 + 4δ 2 η 3 λ2t + λt get (x) − λ2t .
2
(32)
Summing (32) for t ∈ [T ], we have
T
X
E[Lt (x, λt ) − 2Lt (xt , λ)]
t=1
T T
1 X 1X
≤ E[∥xt − x∥2 − ∥xt+1 − x∥2 ] + (∥λt − λ∥2 − ∥λt+1 − λ∥2 )
2η t=1 η t=1
T T T T T
X X X X δη X 2
+ G2 ηT + 4ηβ 2 λ2t + 2C 2 ηT + 4η γt2 + 4δ 2 η 3 λ2t + λt get (x) − λ
t=1 t=1 t=1 t=1
2 t=1 t
1 1
≤ E[∥x1 − x∥2 − ∥xT +1 − x∥2 ] + (∥λ1 − λ∥2 − ∥λT +1 − λ∥2 )
2η η
T T T T T
X X X X δη X 2
+ G2 ηT + 4ηβ 2 λ2t + 2C 2 ηT + 4η γt2 + 4δ 2 η 3 λ2t + λt get (x) − λ
t=1 t=1 t=1 t=1
2 t=1 t
(telescoping series)
1 1
≤ E[∥x1 − xT +1 ∥2 ] + (∥λ1 − λ∥2 − ∥λT +1 − λ∥2 )
2η η
T T T T T
2 2
X
2 2
X
2 2 3
X
2
X δη X 2
+ G ηT + 4ηβ λt + 2C ηT + 4η γt + 4δ η λt + λt get (x) − λ
t=1 t=1 t=1 t=1
2 t=1 t
(triangle inequality)
T T T T T
d2 λ2 X X X X δη X 2
≤ + + G2 ηT + ηβ 2 λ2t + 2C 2 ηT + 4η γt2 + 4δ 2 η 3 λ2t + λt get (x) − λ ,
2η η t=1 t=1 t=1 t=1
2 t=1 t
(33)
16
where (33) uses Assumption 1 with def. of diameter d, λ1 = 0, and −∥ · ∥2 ≤ 0. Expanding the left
hand side of (33), we deduce
T T T
δηλ2
X X X
t
[f (x) − 2f (xt )] + gt (xt ) − λt get (x)] +
[2λe − δηλ2
t=1 t=1 t=1
2
T T T T T
d2 λ2 X X X X δη X 2
≤ + + G2 ηT + 4ηβ 2 λ2t + 2C 2 ηT + 4η γt2 + 4δ 2 η 3 λ2t + λt get (x) − λ .
2η η t=1 t=1 t=1 t=1
2 t=1 t
(34)
Rearranging, we have
T
( T )
X X 1
[f (x) − 2f (xt )] + 2λ get (xt ) − δηT + λ2
t=1 t=1
η
T T T
X X d2 X
≤2 λt get (x) + η 4β 2 + 4δ 2 η 2 − δ λ2t + + G2 ηT + 2C 2 ηT + 4η γt2 . (35)
t=1 t=1
2η t=1
Set x = x∗ . From Lemma 4, with probability at least 1 − Tε , get (x∗ ) = gbt (x∗ ) − γt ≤ g(x∗ ) holds.
Since x∗ satisfies the long term constraint, we have g(x∗ ) ≤ 0. By the union bound, we get with
probability at least 1 − ε that for the first term on the RHS of (35)
T
X T
X
2 λt get (x∗ ) ≤ 2g(x∗ ) λt ≤ 0.
t=1 t=1
Now we choose δ such that 4β 2 + 4δ 2 η 2 − δ ≤ 0 (so that the second term on the RHS of (35) will
be negative and can be dropped with an upper bound). This is a quadratic term w.r.t. δ. It is easy
1 d
to verify that the quadratic formula has real roots when η ≤ 8β . As we choose η = U √ T
, this will
64d2 β 2
give us the condition that T needs to be sufficiently large, i.e., T ≥ U2 . We can simply choose
δ = 8β 2 .
Applying both of those inequalities to (35) and by union bound, we get with probability at least 1 − ε,
T
( T
)
X
∗
X 1
[f (x ) − 2f (xt )] + 2λ get (xt ) − (δηT + )λ2
t=1 t=1
η
T
d2 2 2
X
≤ + G ηT + 2C ηT + 4η γt2 . (36)
2η t=1
Maximizing the LHS of (36) with respect to λ, over the range [0, +∞), we get a solution of
[ Tt=1 get (xt )]
P
λ = δηT +1/η + . Plugging this into (36) gives us
hP i2
T
t=1 g
et (xt )
T T
X + d2 X
[f (x∗ ) − 2f (xt )] + ≤ + G2 ηT + 2C 2 ηT + 4η γt2 . (37)
t=1
δηT + 1/η 2η t=1
17
d
Let U = max{G, C}. Choosing η = U √ T
, we have with probability at least 1 − ε,
hP i2
T
t=1 g
et (xt )
XT
+
[f (x∗ ) − 2f (xt )] +
t=1
δηT + 1/η
T
d max{G, C} √ G2 d √ 2C 2 d √ 8 max{G, C}d log 2T
ε
X 1
≤ T+ T+ T+ √
2 max{G, C} max{G, C} T t=1
t
T
d max{G, C} √ √ √ 8U d log 2T
ε
X 1
≤ T + max{G, C}d T + 2 max{G, C}d T + √ (38)
2 T t=1
t
T
7dU √ 8U d log 2T X 1
≤ T+ √ ε √ (39)
2 T t=1
t
7dU √ 2T
≤ T + 16U d log . (40)
2 ε
PT √
where Equation (40) uses t=1 √1t ≤ 2 T . This gives us
T
7dU √
X 1 ∗ 2T
f (x ) − f (xt ) ≤ T + 8dU log = O(T 1/2 ). (41)
t=1
2 4 ε
Next, we establish our constraint violation bound. Since F1 := maxx∈K |f (x)| and F2 :=
maxx,y∈K |f (x) − f (y)|, we have
|f (x∗ ) − 2f (xt )| ≤ |f (x∗ ) − f (xt )| + |f (xt )| ≤ F1 + F2 , (42)
PT ∗
thus, t=1 [f (x ) − 2f (xt )] ≥ −(F1 + F2 )T . Plugging back in (40), we have
hP i2
T
t=1 g
et (xt ) 7dU √ 2T
+
≤ T + 8dU log + (F1 + F2 )T. (43)
δηT + 1/η 4 ε
Rearranging and plug in the value of η, we have
" T # s
7dU √ 8β d U √
2
X 2T
ge( xt ) ≤ T + 8dU log + (F1 + F2 )T · + T. (44)
t=1
4 ε U d
+
18
Algorithm 2 OLSGA with First Order Full Information
1: Input: Convex set K, time horizon T
2: Initialize x1 ∈ K, λ1 = 0.
3: for t ∈ [T ] do
4: Play xt , obtain ft (xt ) and ∇f
e t (·) and pt
1
Pt
5: Compute p bt = t s=1 ps
R z eu−1
6: Sample zt from Z where P(Z ≤ z) = 0 1−e −1 du.
9: Update xt and λt :
xt+1 = ΠK (xt + η ∇
e x Lbt (xt , λt )) (47)
λt+1 = Π[0,+∞) (λt − η∇λ Lt (xt , λt )) (48)
10: end for
E Proof of Theorem 2
T
X 5dU √ 2T
E[RT ] ≤ [(1 − 1/e)f (x∗ ) − f (xt )] ≤ T + 8dU log = O(T 1/2 )
t=1
2 ε
and
s
5dU √ 4β d U √
2
2T F1
CT ≤ T + 8dU log +( + F2 )T · + T
2 ε e U d
s s
2nT 2T
+ rQσ T log + 2 2T C 2 log
ε ε
= O(T 3/4 ).
19
Rearranging,
1 η e b
(x − xt )⊤ ∇
e x Lbt (xt , λt ) ≤ (∥xt − x∥2 − ∥xt+1 − x∥2 ) + ∥∇x Lt (xt , λt )∥2
2η 2
1 η e
= (∥xt − x∥2 − ∥xt+1 − x∥2 ) + ∥∇Ft (xt ) − λt ∇e gt (xt )∥2
2η 2
(from (45))
1
≤ (∥xt − x∥2 − ∥xt+1 − x∥2 ) + η∥∇F
e t (xt )∥2 + ηλ2t ∥∇e
gt (xt )∥2
2η
(∥a + b∥2 ≤ 2∥a∥2 + 2∥b∥2 )
1
≤ (∥xt − x∥2 − ∥xt+1 − x∥2 ) + ηG2F + ηβ 2 λ2t (50)
2η
where (50) follows from Assumption 4 and Assumption 3. When λ is fixed, we have (taking
expectation over ft )
where (51) follows from (50). From the update (48) of λt , we have
20
where (52) multiplies through. Rearranging,
1 η
(λ − λt )⊤ ∇λ Lt (xt , λt ) ≥ − (∥λt − λ∥2 − ∥λt+1 − λ∥2 ) − ∥∇λ Lt (xt , λt )∥2
2η 2
1 η
= − (∥λt − λ∥2 − ∥λt+1 − λ∥2 ) − ∥ − get (xt ) + δηλt ∥2
2η 2
(from (46))
1 η
= − (∥λt − λ∥2 − ∥λt+1 − λ∥2 ) − ∥ − gbt (xt ) − γt + δηλt ∥2
2η 2
1
≥ − (∥λt − λ∥2 − ∥λt+1 − λ∥2 ) − η∥b gt (xt )∥2 − 2ηγt2 − 2δ 2 η 3 λ2t
2η
(apply ∥a + b∥2 ≤ 2∥a∥2 + 2∥b∥2 twice)
1
≥ − (∥λt − λ∥2 − ∥λt+1 − λ∥2 ) − C 2 η − 2ηγt2 − 2δ 2 η 3 λ2t , (53)
2η
where the last inequality follows from the definition of C := maxp′ ∼Dp maxx∈K |⟨p′ , x⟩ − BTT | and
pt , x⟩ − BTT . From convexity of function Lt (x, λ) w.r.t λ, we have
gbt (x) := ⟨b
Lt (xt , λ) − Lt (xt , λt ) ≥ (λ − λt )⊤ ∇λ Lt (xt , λt )
1
≥ − (∥λt − λ∥2 − ∥λt+1 − λ∥2 ) − C 2 η − 2ηγt2 − 2δ 2 η 3 λ2t (54)
2η
where (54) follows from (53).
Subtracting (54) from (51), we get
1 1
E[(1 − 1/e)Lt (x, λt ) − Lt (xt , λ)] ≤ E[∥xt − x∥2 − ∥xt+1 − x∥2 ] + (∥λt − λ∥2 − ∥λt+1 − λ∥2 )
2η 2η
1 δη
+ G2F η + ηβ 2 λ2t + C 2 η + 2ηγt2 + 2δ 2 η 3 λ2t + λt get (x) − λ2t .
e 2e
(55)
Summing (55) for t ∈ [T ], we have
T
X
E[(1 − 1/e)Lt (x, λt ) − Lt (xt , λ)]
t=1
T T
1 X 1 X
≤ E[∥xt − x∥2 − ∥xt+1 − x∥2 ] + (∥λt − λ∥2 − ∥λt+1 − λ∥2 )
2η t=1 2η t=1
T T T T T
X X X 1X δη X 2
+ G2F ηT + ηβ 2 λ2t + C 2 ηT + 2η γt2 + 2δ 2 η 3 λ2t + λt get (x) − λ
t=1 t=1 t=1
e t=1 2e t=1 t
(56)
1 1
≤ E[∥x1 − x∥2 − ∥xT +1 − x∥2 ] + (∥λ1 − λ∥2 − ∥λT +1 − λ∥2 )
2η 2η
T T T T T
X X X 1X δη X 2
+ G2F ηT + ηβ 2 λ2t + C 2 ηT + 2η γt2 + 2δ 2 η 3 λ2t + λt get (x) − λ
t=1 t=1 t=1
e t=1 2e t=1 t
(57)
1 1
≤ E[∥x1 − xT +1 ∥2 ] + (∥λ1 − λ∥2 − ∥λT +1 − λ∥2 )
2η 2η
T T T T T
2 2
X
2 2
X
2 2 3
X
2 1X δη X 2
+ GF ηT + ηβ λt + C ηT + 2η γt + 2δ η λt + λt get (x) − λ
t=1 t=1 t=1
e t=1 2e t=1 t
(triangle inequality)
T T T T T
d2 λ2 X X X 1X δη X 2
≤ + + G2F ηT + ηβ 2 λ2t + C 2 ηT + 2η γt2 + 2δ 2 η 3 λ2t + λt get (x) − λ .
2η 2η t=1 t=1 t=1
e t=1 2e t=1 t
(58)
21
where (58) uses Assumption 1 with def. of diameter d,λ1 = 0, and −∥ · ∥2 ≤ 0. Expanding the left
hand side of (58), we deduce
T T T
X X X δηλ2t δηλ2
[(1 − 1/e)f (x) − f (xt )] + gt (xt ) − (1 − 1/e)λt get (x)] +
[λe [(1 − 1/e) − ]
t=1 t=1 t=1
2 2
2 2 T T T T T
d λ X X X 1X δη X 2
≤ + + G2F ηT + ηβ 2 λ2t + C 2 ηT + 2η γt2 + 2δ 2 η 3 λ2t + λt get (x) − λ .
2η 2η t=1 t=1 t=1
e t=1 2e t=1 t
(59)
Rearranging, we have
T
( T )
X X δηT 1 2
[(1 − 1/e)f (x) − f (xt )] + λ get (xt ) − + λ
t=1 t=1
2 2η
T T T
d2
X
X
2 δ 2 2
X
≤ λt get (x) + η β + 2δ η − λ2t + + G2F ηT + C 2 ηT + 2η γt2 . (60)
t=1
2 t=1
2η t=1
Set x = x∗ ; From Lemma 4, with probability at least 1 − Tε , get (x∗ ) = gbt (x∗ ) − γt ≤ g(x∗ )
holds. since x∗ satisfies the long term constraint, we have g(x∗ ) ≤ 0. Now we choose δ such that
β 2 + 2δ 2 η 2 − 2δ ≤ 0. This is a quadratic term w.r.t. δ. It is easy to verify that the quadratic formula
√
has real roots when η ≤ 8β2 . As we choose η = U √ d
T
, this will give us the condition that T needs to
32d2 β 2
be sufficiently large, i.e., T ≥ U2 . We can simply choose δ = 4β 2 .
Applying both of those inequalities to (60) and by union bound, we get with probability at least 1 − ε,
T
( T ) T
d2
X X δηT 1 X
[(1 − 1/e)f (x∗ ) − f (xt )]+ λ get (xt ) − + λ2 ≤ + G2F ηT + C 2 ηT + 2η γt2 .
t=1 t=1
2 2η 2η t=1
(61)
Maximizing the LHS of (61) with respect to λ, over the range [0, +∞), we get a solution of
[ Tt=1 get (xt )]+
P
λ = δηT /2+1/2η . Plugging this into (61) gives us
hP i2
T
T
X t=1 g
et (xt ) d2 XT
+
[(1 − 1/e)f (x∗ ) − f (xt )] + ≤ + G2F ηT + C 2 ηT + 2η γt2 . (62)
t=1
δηT /2 + 1/2η 2η t=1
d
Let U = max{GF , C}. Choosing η = U
√
T
, we have with probability at least 1 − ε,
hP i2
T
t=1 g
et (xt )
T
X +
[(1 − 1/e)f (x∗ ) − f (xt )] +
t=1
δηT /2 + 1/2η
T
d max{GF , C} √ G2F d √ C 2d √ 4 max{GF , C}d log 2T
ε
X 1
≤ T+ T+ T+ √
2 max{GF , C} max{GF , C} T t=1
t
T
d max{GF , C} √ √ √ 4 max{GF , C}d log 2T
ε
X 1
≤ T + max{GF , C}d T + max{GF , C}d T + √
2 T t=1
t
T
5dU √ 4dU log 2T X 1
≤ T+ √ ε √ (63)
2 T t=1
t
5dU √ 2T
= T + 8dU log (64)
2 ε
1/2
= O(T ), (65)
22
and dropping the second term on the LHS gives us the desired (1 − 1/e)-regret bound.
Next, we establish our constraint violation bound. Since F1 := maxx∈K |f (x)| and F2 :=
maxx,y∈K |f (x) − f (y)|, we have
1 F1
|(1 − 1/e)f (x∗ ) − f (xt )| ≤ |f (x∗ ) − f (xt )| + |f (xt )| ≤ + F2 , (66)
e e
PT
thus, t=1 [(1 − 1/e)f (x∗ ) − f (xt )] ≥ −( Fe1 + F2 )T . Plugging back in (64), we have
hP i2
T
t=1 g
et (xt ) 5dU √ 2T F1
+
≤ T + 8dU log +( + F2 )T. (67)
δηT + 1/η 2 ε e
Rearranging and plug in the value of η, we have
" T # s
5dU √ 4β d U √
2
X 2T F1
ge( xt ) ≤ T + 8dU log +( + F2 )T · + T. (68)
t=1
2 ε e U d
+
23
domain, [19] investigated the case of semi-bandit feedback, specifically in the form of marginal gains.
Additionally, recent works such as [25, 26] have delved into the full-bandit feedback setting. For
continuous domains, [9] first investigate the online (stochastic) gradient ascent (OGA) with a 12 -regret
of O(T 1/2 ). Then, inspired by the meta actions [33], [9] also proposed Frank-Wolfe type algorithm
with a (1 − 1/e)-regret of O(T 1/2 ) when exact gradient is available. When only stochastic gradient
is available, [7] proposed a variant of Frank-Wolfe algorithm achieving (1 − 1/e)-regret of O(T 1/2 ),
but requires O(T 3/2 ) stochastic gradient queries in each time step. In the effort of reducing gradient
queries, [41] achieves (1 − 1/e)-regret of O(T 4/5 ) with only one stochastic gradient evaluation each
round. Recently, [42] have proposed an auxiliary function to boost the approximation ratio of the
online gradient ascent algorithms from 21 to 1 − 1/e.
All the problems above were initially studied in the discrete domain and extended to the continuous
domain in [6]. Furthermore, when faced with a discrete objective, one can always use the "relax
and rounding" strategy to transition from addressing a discrete problem to tackling a continuous one.
Such techniques are widely frequently utilized within the submodular maximization community, as
exemplified by the work of [8].
24
NeurIPS Paper Checklist
1. Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the
paper’s contributions and scope?
Answer: [Yes]
Justification: The abstract and introduction clearly state the paper’s contribution and scope.
Guidelines:
• The answer NA means that the abstract and introduction do not include the claims
made in the paper.
• The abstract and/or introduction should clearly state the claims made, including the
contributions made in the paper and important assumptions and limitations. A No or
NA answer to this question will not be perceived well by the reviewers.
• The claims made should match theoretical and experimental results, and reflect how
much the results can be expected to generalize to other settings.
• It is fine to include aspirational goals as motivation as long as it is clear that these goals
are not attained by the paper.
2. Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes]
Justification: Some limitations of the paper have been discussed in the conclusion section.
Assumptions form another part of the limitations.
Guidelines:
• The answer NA means that the paper has no limitation while the answer No means that
the paper has limitations, but those are not discussed in the paper.
• The authors are encouraged to create a separate "Limitations" section in their paper.
• The paper should point out any strong assumptions and how robust the results are to
violations of these assumptions (e.g., independence assumptions, noiseless settings,
model well-specification, asymptotic approximations only holding locally). The authors
should reflect on how these assumptions might be violated in practice and what the
implications would be.
• The authors should reflect on the scope of the claims made, e.g., if the approach was
only tested on a few datasets or with a few runs. In general, empirical results often
depend on implicit assumptions, which should be articulated.
• The authors should reflect on the factors that influence the performance of the approach.
For example, a facial recognition algorithm may perform poorly when image resolution
is low or images are taken in low lighting. Or a speech-to-text system might not be
used reliably to provide closed captions for online lectures because it fails to handle
technical jargon.
• The authors should discuss the computational efficiency of the proposed algorithms
and how they scale with dataset size.
• If applicable, the authors should discuss possible limitations of their approach to
address problems of privacy and fairness.
• While the authors might fear that complete honesty about limitations might be used by
reviewers as grounds for rejection, a worse outcome might be that reviewers discover
limitations that aren’t acknowledged in the paper. The authors should use their best
judgment and recognize that individual actions in favor of transparency play an impor-
tant role in developing norms that preserve the integrity of the community. Reviewers
will be specifically instructed to not penalize honesty concerning limitations.
3. Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and
a complete (and correct) proof?
Answer: [Yes]
25
Justification: We have clearly stated the required assumptions and an accompanying com-
plete proof in the appendix for each theory result.
Guidelines:
• The answer NA means that the paper does not include theoretical results.
• All the theorems, formulas, and proofs in the paper should be numbered and cross-
referenced.
• All assumptions should be clearly stated or referenced in the statement of any theorems.
• The proofs can either appear in the main paper or the supplemental material, but if
they appear in the supplemental material, the authors are encouraged to provide a short
proof sketch to provide intuition.
• Inversely, any informal proof provided in the core of the paper should be complemented
by formal proofs provided in appendix or supplemental material.
• Theorems and Lemmas that the proof relies upon should be properly referenced.
4. Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main ex-
perimental results of the paper to the extent that it affects the main claims and/or conclusions
of the paper (regardless of whether the code and data are provided or not)?
Answer: [NA]
Justification: Our paper is primarily of theoretical nature and does not include experiments.
Guidelines:
• The answer NA means that the paper does not include experiments.
• If the paper includes experiments, a No answer to this question will not be perceived
well by the reviewers: Making the paper reproducible is important, regardless of
whether the code and data are provided or not.
• If the contribution is a dataset and/or model, the authors should describe the steps taken
to make their results reproducible or verifiable.
• Depending on the contribution, reproducibility can be accomplished in various ways.
For example, if the contribution is a novel architecture, describing the architecture fully
might suffice, or if the contribution is a specific model and empirical evaluation, it may
be necessary to either make it possible for others to replicate the model with the same
dataset, or provide access to the model. In general. releasing code and data is often
one good way to accomplish this, but reproducibility can also be provided via detailed
instructions for how to replicate the results, access to a hosted model (e.g., in the case
of a large language model), releasing of a model checkpoint, or other means that are
appropriate to the research performed.
• While NeurIPS does not require releasing code, the conference does require all submis-
sions to provide some reasonable avenue for reproducibility, which may depend on the
nature of the contribution. For example
(a) If the contribution is primarily a new algorithm, the paper should make it clear how
to reproduce that algorithm.
(b) If the contribution is primarily a new model architecture, the paper should describe
the architecture clearly and fully.
(c) If the contribution is a new model (e.g., a large language model), then there should
either be a way to access this model for reproducing the results or a way to reproduce
the model (e.g., with an open-source dataset or instructions for how to construct
the dataset).
(d) We recognize that reproducibility may be tricky in some cases, in which case
authors are welcome to describe the particular way they provide for reproducibility.
In the case of closed-source models, it may be that access to the model is limited in
some way (e.g., to registered users), but it should be possible for other researchers
to have some path to reproducing or verifying the results.
5. Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instruc-
tions to faithfully reproduce the main experimental results, as described in supplemental
material?
26
Answer: [NA]
Justification: Our paper is primarily of theoretical nature and does not include experiments.
Guidelines:
• The answer NA means that paper does not include experiments requiring code.
• Please see the NeurIPS code and data submission guidelines (https://nips.cc/
public/guides/CodeSubmissionPolicy) for more details.
• While we encourage the release of code and data, we understand that this might not be
possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not
including code, unless this is central to the contribution (e.g., for a new open-source
benchmark).
• The instructions should contain the exact command and environment needed to run to
reproduce the results. See the NeurIPS code and data submission guidelines (https:
//nips.cc/public/guides/CodeSubmissionPolicy) for more details.
• The authors should provide instructions on data access and preparation, including how
to access the raw data, preprocessed data, intermediate data, and generated data, etc.
• The authors should provide scripts to reproduce all experimental results for the new
proposed method and baselines. If only a subset of experiments are reproducible, they
should state which ones are omitted from the script and why.
• At submission time, to preserve anonymity, the authors should release anonymized
versions (if applicable).
• Providing as much information as possible in supplemental material (appended to the
paper) is recommended, but including URLs to data and code is permitted.
6. Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyper-
parameters, how they were chosen, type of optimizer, etc.) necessary to understand the
results?
Answer: [NA]
Justification: Our paper is primarily of theoretical nature and does not include experiments.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The experimental setting should be presented in the core of the paper to a level of detail
that is necessary to appreciate the results and make sense of them.
• The full details can be provided either with the code, in appendix, or as supplemental
material.
7. Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate
information about the statistical significance of the experiments?
Answer: [NA]
Justification: Our paper is primarily of theoretical nature and does not include experiments.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The authors should answer "Yes" if the results are accompanied by error bars, confi-
dence intervals, or statistical significance tests, at least for the experiments that support
the main claims of the paper.
• The factors of variability that the error bars are capturing should be clearly stated (for
example, train/test split, initialization, random drawing of some parameter, or overall
run with given experimental conditions).
• The method for calculating the error bars should be explained (closed form formula,
call to a library function, bootstrap, etc.)
• The assumptions made should be given (e.g., Normally distributed errors).
• It should be clear whether the error bar is the standard deviation or the standard error
of the mean.
27
• It is OK to report 1-sigma error bars, but one should state it. The authors should
preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis
of Normality of errors is not verified.
• For asymmetric distributions, the authors should be careful not to show in tables or
figures symmetric error bars that would yield results that are out of range (e.g. negative
error rates).
• If error bars are reported in tables or plots, The authors should explain in the text how
they were calculated and reference the corresponding figures or tables in the text.
8. Experiments Compute Resources
Question: For each experiment, does the paper provide sufficient information on the com-
puter resources (type of compute workers, memory, time of execution) needed to reproduce
the experiments?
Answer: [NA]
Justification: Our paper is primarily of theoretical nature and does not include experiments.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The paper should indicate the type of compute workers CPU or GPU, internal cluster,
or cloud provider, including relevant memory and storage.
• The paper should provide the amount of compute required for each of the individual
experimental runs as well as estimate the total compute.
• The paper should disclose whether the full research project required more compute
than the experiments reported in the paper (e.g., preliminary or failed experiments that
didn’t make it into the paper).
9. Code Of Ethics
Question: Does the research conducted in the paper conform, in every respect, with the
NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
Answer: [Yes]
Justification: Our research conforms, in every respect, to the NeurIPS Code of Ethics.
Guidelines:
• The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
• If the authors answer No, they should explain the special circumstances that require a
deviation from the Code of Ethics.
• The authors should make sure to preserve anonymity (e.g., if there is a special consid-
eration due to laws or regulations in their jurisdiction).
10. Broader Impacts
Question: Does the paper discuss both potential positive societal impacts and negative
societal impacts of the work performed?
Answer: [Yes]
Justification: Our work is primarily of theoretical nature and has no immediate societal
impact.
Guidelines:
• The answer NA means that there is no societal impact of the work performed.
• If the authors answer NA or No, they should explain why their work has no societal
impact or why the paper does not address societal impact.
• Examples of negative societal impacts include potential malicious or unintended uses
(e.g., disinformation, generating fake profiles, surveillance), fairness considerations
(e.g., deployment of technologies that could make decisions that unfairly impact specific
groups), privacy considerations, and security considerations.
28
• The conference expects that many papers will be foundational research and not tied
to particular applications, let alone deployments. However, if there is a direct path to
any negative applications, the authors should point it out. For example, it is legitimate
to point out that an improvement in the quality of generative models could be used to
generate deepfakes for disinformation. On the other hand, it is not needed to point out
that a generic algorithm for optimizing neural networks could enable people to train
models that generate Deepfakes faster.
• The authors should consider possible harms that could arise when the technology is
being used as intended and functioning correctly, harms that could arise when the
technology is being used as intended but gives incorrect results, and harms following
from (intentional or unintentional) misuse of the technology.
• If there are negative societal impacts, the authors could also discuss possible mitigation
strategies (e.g., gated release of models, providing defenses in addition to attacks,
mechanisms for monitoring misuse, mechanisms to monitor how a system learns from
feedback over time, improving the efficiency and accessibility of ML).
11. Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible
release of data or models that have a high risk for misuse (e.g., pretrained language models,
image generators, or scraped datasets)?
Answer: [NA]
Justification: No high risk data or model have been used.
Guidelines:
• The answer NA means that the paper poses no such risks.
• Released models that have a high risk for misuse or dual-use should be released with
necessary safeguards to allow for controlled use of the model, for example by requiring
that users adhere to usage guidelines or restrictions to access the model or implementing
safety filters.
• Datasets that have been scraped from the Internet could pose safety risks. The authors
should describe how they avoided releasing unsafe images.
• We recognize that providing effective safeguards is challenging, and many papers do
not require this, but we encourage authors to take this into account and make a best
faith effort.
12. Licenses for existing assets
Question: Are the creators or original owners of assets (e.g., code, data, models), used in
the paper, properly credited and are the license and terms of use explicitly mentioned and
properly respected?
Answer: [NA]
Justification: No existing asset has been used in the paper.
Guidelines:
• The answer NA means that the paper does not use existing assets.
• The authors should cite the original paper that produced the code package or dataset.
• The authors should state which version of the asset is used and, if possible, include a
URL.
• The name of the license (e.g., CC-BY 4.0) should be included for each asset.
• For scraped data from a particular source (e.g., website), the copyright and terms of
service of that source should be provided.
• If assets are released, the license, copyright information, and terms of use in the
package should be provided. For popular datasets, paperswithcode.com/datasets
has curated licenses for some datasets. Their licensing guide can help determine the
license of a dataset.
• For existing datasets that are re-packaged, both the original license and the license of
the derived asset (if it has changed) should be provided.
29
• If this information is not available online, the authors are encouraged to reach out to
the asset’s creators.
13. New Assets
Question: Are new assets introduced in the paper well documented and is the documentation
provided alongside the assets?
Answer: [NA]
Justification: No new asset is introduced in the paper.
Guidelines:
• The answer NA means that the paper does not release new assets.
• Researchers should communicate the details of the dataset/code/model as part of their
submissions via structured templates. This includes details about training, license,
limitations, etc.
• The paper should discuss whether and how consent was obtained from people whose
asset is used.
• At submission time, remember to anonymize your assets (if applicable). You can either
create an anonymized URL or include an anonymized zip file.
14. Crowdsourcing and Research with Human Subjects
Question: For crowdsourcing experiments and research with human subjects, does the paper
include the full text of instructions given to participants and screenshots, if applicable, as
well as details about compensation (if any)?
Answer: [NA]
Justification: No experiments with human subjects were conducted.
Guidelines:
• The answer NA means that the paper does not involve crowdsourcing nor research with
human subjects.
• Including this information in the supplemental material is fine, but if the main contribu-
tion of the paper involves human subjects, then as much detail as possible should be
included in the main paper.
• According to the NeurIPS Code of Ethics, workers involved in data collection, curation,
or other labor should be paid at least the minimum wage in the country of the data
collector.
15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human
Subjects
Question: Does the paper describe potential risks incurred by study participants, whether
such risks were disclosed to the subjects, and whether Institutional Review Board (IRB)
approvals (or an equivalent approval/review based on the requirements of your country or
institution) were obtained?
Answer: [NA]
Justification: We conducted no experiments with human subjects.
Guidelines:
• The answer NA means that the paper does not involve crowdsourcing nor research with
human subjects.
• Depending on the country in which research is conducted, IRB approval (or equivalent)
may be required for any human subjects research. If you obtained IRB approval, you
should clearly state this in the paper.
• We recognize that the procedures for this may vary significantly between institutions
and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the
guidelines for their institution.
• For initial submissions, do not include any information that would break anonymity (if
applicable), such as the institution conducting the review.
30