Adversarial Network Optimization under Bandit Feedback:
Maximizing Utility in Non-Stationary Multi-Hop Networks

Yan Dai ORC & LIDS, MIT. Email: [email protected]. Work done when Yan was an undergraduate student at IIIS, Tsinghua. Longbo Huang IIIS, Tsinghua. Email: [email protected].

Abstract

Stochastic Network Optimization (SNO) concerns scheduling in stochastic queueing systems. It has been widely studied in network theory. Classical SNO algorithms require network conditions to be stationary with time, which fails to capture the non-stationary components in many real-world scenarios. Many existing algorithms also assume knowledge of network conditions before decision, which rules out applications where unpredictability presents.

Motivated by these issues, we consider Adversarial Network Optimization (ANO) under bandit feedback. Specifically, we consider the task of i) maximizing some unknown and time-varying utility function associated to scheduler’s actions, where ii) the underlying network is a non-stationary multi-hop one whose conditions change arbitrarily with time, and iii) only bandit feedback (effect of actually deployed actions) is revealed after decisions. Our proposed UMO² algorithm ensures network stability and also matches the utility maximization performance of any “mildly varying” reference policy up to a polynomially decaying gap. To our knowledge, no previous ANO algorithm handled multi-hop networks or achieved utility guarantees under bandit feedback, whereas ours can do both.

Technically, our method builds upon a novel integration of online learning into Lyapunov analyses: To handle complex inter-dependencies among queues in multi-hop networks, we propose meticulous techniques to balance online learning and Lyapunov arguments. To tackle the learning obstacles due to potentially unbounded queue sizes, we design a new online linear optimization algorithm that automatically adapts to loss magnitudes. To maximize utility, we propose a bandit convex optimization algorithm with novel queue-dependent learning rate scheduling that suites drastically varying queue lengths. Our new insights in online learning can be of independent interest.

1 Introduction

Stochastic Network Optimization (SNO) studies the fundamental problem of resource allocation in a dynamic system to fulfill incoming demands, with extensive applications in real-world problems including communication networks (Srikant and Ying, 2013), cloud computing (Maguluri et al., 2012), and supply chains (Rahdar et al., 2018). There are many classical scheduling algorithms in this field enjoying performance guarantees in terms of throughput maximization (Tsibonis et al., 2003), delay minimization (Neely, 2008), or utility maximization (Huang and Neely, 2011).

Classical SNO models often assume that the network conditions, for example, the arrival and service rates to each queue or the capacities of data links, are stationary with respect to time. However, many important network scenarios in practice face non-stationarity. For instance, in applications such as autonomous driving, parties in the communication networks can move rapidly (Ashjaei et al., 2021), causing the network conditions to vary from time to time. Even more, attacks such as Distributed Denial-of-Service (DDoS) or jamming can frequently happen in communication networks (Zou et al., 2016), where arrival rates or link conditions are altered by some malicious adversary.

Moreover, we notice that existing works, even those allowing non-stationary network conditions, assumes perfect knowledge about the network conditions. For example, in the paper by Liang and Modiano (2018b), the network condition is revealed at the beginning of each round, so the outcomes associated with each action can be accurately calculated, before actually deciding and deploying the scheduler’s action (see Section 1.1 for more discussions). Nevertheless, this may again not be the case in practice. In underwater wireless communication systems, for instance, the network conditions is unpredictable until the policy is actually executed (Khan et al., 2020). In Internet of Things (IoT), device failures or sensor temperatures can change rapidly, resulting in highly unpredictable traffic and channel patterns in the network (Gaddam et al., 2020). Therefore, it is hard to estimate counterfactual outcomes of other actions (i.e., “what will happen if we used a different action?”) even after deploying the action and obtaining more information about the network conditions, not to mention pre-decision evaluations. In a nutshell, it is important and largely open to design network algorithms that are robust to time-dependent or adversarial conditions with post-decision feedback.

Motivated by these two challenges, this paper considers optimizing an abstract utility function associated with the scheduler’s action (the so-called utility maximization task (Neely et al., 2008)) even when the network is non-stationary (which we call Adversarial Network Optimization, or ANO in short) and the feedback model is bandit style. Specifically, in ANO, the network conditions and utility functions can unknowingly vary from time to time. Therefore, statistics of the past merely infers the current network condition, which breaks many traditional SNO techniques. Moreover, under the bandit feedback model, the scheduler has no information about the current network condition before decision. Even worse, after making decisions, it also only receives feedback resulting from the chosen action – not those “counterfactual” ones associated with other actions. More formally, if an action $a$ is associated with outcome $O_{t}(a)$ (for example, arrival and service rates) in round $t$ . Then i) the scheduler has to decide action $a_{t}$ without having any information about $O_{t}(a)$ , and ii) after playing $a_{t}$ , it can only observe $O_{t}(a_{t})$ but not those $O_{t}(a^{\prime})$ ’s for $a^{\prime}\neq a$ . Therefore, it is hard to evaluate the optimal action even in hindsight: Based on information collected in rounds $[1,t]$ , one cannot accurately calculate which action had the most gain within rounds $[1,t]$ . Henceforth, in our model, one not only cannot predict the future, but also cannot fully interpolate the history.

In addition to the challenging ANO setup, bandit feedback model, and utility maximization task, we also allow the underlying network to be multi-hop, which means jobs can be forwarded between queues. Despite these hardness, we succeeded in designing the utility maximization algorithm of UMO², which achieves a strong performance guarantee in non-stationary multi-hop networks under bandit feedback. It not only ensures the network is stable over time, but also proves a polynomially decaying gap between our utility and any “mildly varying” policy’s (measured by the path length; see Theorem 4.5), similar to what can be achieved in perfect-knowledge SNO problems (Neely et al., 2008).

We now highlight several technical innovations in our algorithm. While our algorithm is based on the classical Lyapunov drift-plus-penalty (DPP) analysis, the adversarial network conditions breaks existing SNO arguments, which we tackle by designing online learning algorithms enjoying dynamic regret guarantees in adversarial environments. However, due to the multi-hop topology, learning in different queues is correlated. Thus, it is highly non-trivial to decompose the problem into several online learning tasks. To this end, we develop meticulous analysis techniques that can jointly analyze the online learning algorithms and the utility maximization effects. Moreover, the queue lengths can occasionally be large despite having a bounded expectation. Such a unique challenge is missing in online learning literature, which usually assumes the loss magnitudes are uniformly bounded by a constant. Finally, yet another challenge is due to a combination of the multi-hop topology and the unbounded queue lengths, which makes the losses fed into online learning algorithms sometimes quite negative, a known issue for many online learning algorithms (Zheng et al., 2019, Dai et al., 2023). These two challenges make existing online learning algorithms unable to fulfill our purpose, and we propose a novel Online Linear Optimization algorithm (AdaPFOL; used for system stability) that adapts to drastically varying losses (proportional to queue lengths) and a new Bandit Convex Optimization algorithm (AdaBGD; deployed for utility maximization) whose learning rates are carefully designed to take care of the time-dependent loss magnitudes and Lipschitzness.

Table 1: Comparison of Most Related Works

	Network	Arrival &
	Conditions	Service ¹¹footnotemark: 1 ⁰⁰footnotetext: ¹¹footnotemark: 1 Arrival & Service and Utility columns stand for whether the arrival and service rates or the utility function associated with each feasible control action is known before decision-making, respectively.	Topology	Objective	Utility ¹¹footnotemark: 1
(Neely et al., 2008)	Stochastic	Known	Multi-Hop	Utility Maximization	Known
(Neely, 2010b)	Adversarial	Known	Multi-Hop	Utility Maximization	Known
(Liang and Modiano, 2018b)	Adversarial	Known	Multi-Hop	Network Stability	—
(Liang and Modiano, 2018a)	Adversarial	Known	Multi-Hop	Utility Maximization	Known
(Yang et al., 2023)	Adversarial	Unknown	Single-Hop	Network Stability	—
(Huang et al., 2024)	Adversarial	Unknown	Single-Hop	Network Stability	—
Ours	Adversarial	Unknown	Multi-Hop	Utility Maximization	Unknown

Finally, we mention some most related works including (Neely, 2010b, Liang and Modiano, 2018b; a, Huang et al., 2024, Yang et al., 2023) to help interpolate our position in the literature. Among them, Neely (2010b), Liang and Modiano (2018b; a) assumed perfect pre-decision knowledge on network conditions, which allows direct calculation of the arrival and service rates resulting from every action. In our case, we have to learn these outcomes in an online manner. Liang and Modiano (2018b), Huang et al. (2024), Yang et al. (2023) focused on network stability, while our utility maximization task additionally requires maximizing an abstract, unknown, and time-varying utility function, thus adding difficulties in designing online learning algorithms. Neely (2010b), Liang and Modiano (2018a) considered utility maximization, but their utility functions are fixed and non-adversarial. In our case, the utility functions are both time-varying and unknown, thus another online learning sub-routine is needed. Huang et al. (2024), Yang et al. (2023) investigated single-hop networks where jobs leave the network upon being served, whereas our formulation considers the more general multi-hop networks. We refer the readers to Table 1 and Section 1.1 for more information.

Our main contributions in this paper can be summarized as follows:

•

We propose a novel algorithm UMO² (Algorithm 3) for adversarial multi-hop networks under bandit feedback, which gives rigorous utility optimization guarantee. To the best of our knowledge, no previous algorithm can handle multi-hop topology or achieve utility guarantees in adversarial networks under bandit feedback, whereas our UMO² algorithm is able to do both. Moreover, as a by-product, we also derive a simpler algorithm NSO (Algorithm 1) which ensures network stability for adversarial multi-hop networks under bandit feedback.
•

To handle the multi-hop topology which brings inter-queue correlations and to jointly handle online learning and network optimization, we develop a unified analysis that allows the integration of online learning techniques into the classical Lyapunov drift-plus-penalty arguments. Specifically, via the design of a new OLO algorithm to stabilize the network and a novel BCO algorithm to maximize the utility, UMO² algorithm enjoys a network stability guarantee together with a polynomially decaying gap between its utility and that of any policy that is “mildly varying” (in the sense that its path length is of order $o(\sqrt{T})$ ; see Theorem 4.5 for more details).
•

Due to the potentially unbounded queue lengths, existing online learning algorithms are unfortunately inapplicable. We design an OLO algorithm that can handle large losses and enjoys a performance guarantee adapted to the loss magnitudes (AdaPFOL; Theorem 3.5). We also develop a new BCO method specially crafted for the drastically varying loss magnitudes and Lipschitzness (AdaBGD; Theorem 4.4). Both online learning algorithms can be of independent interest.

1.1 Related Works

We discuss the most related works here. A more comprehensive literature review is in Appendix A.

Adversarial Network Control. Adversarial networks date back to the 1990s, when Cruz (1991) gave the first adversarial dynamics network model and its scheduling algorithm. More efforts were made to allow more general arrival rates (Borodin et al., 2001, Andrews et al., 2001), link conditions (Andrews and Zhang, 2004, Andrews et al., 2007), or both (Liang and Modiano, 2018b). We also direct the readers to the references therein for more discussions. The main focus of the aforementioned papers were usually system stability, whereas ours is utility maximization. As we are aware of, existing results on utility maximization (Neely, 2010b, Liang and Modiano, 2018a) mostly assumed perfect knowledge on network conditions.

Feedback Models. Most previous works considered perfect knowledge model which assumes pre-decision knowledge on network conditions (Liang and Modiano, 2018b; a). In contrast, our paper considers bandit feedback model which only reveals the consequence of our action. A small number of previous works (Fu and Modiano, 2022, Yang et al., 2023, Huang et al., 2024) also assumed similar feedback models albeit under different names. Another feedback model whose difficulty lies in between is full-information feedback model, which requires network conditions to be revealed after decision and thus counterfactual evaluations of all actions (i.e., not only the deployed one) are allowed in hindsight. See (Neely et al., 2012) for an example.

Adversarial Networks under Bandit Feedback. Prior to our work, Huang et al. (2024), Yang et al. (2023) also studied adversarial networks under bandit feedback. However, they both assumed single-hop networks and focused on network stability. In contrast, our paper allows a general multi-hop topology and tackles the utility maximization task of optimizing an abstract, unknown, and time-varying utility function.

2 Notations and Preliminaries

We use bold letters to denote vectors, e.g., $\bm{q}_{t},\bm{\mu}_{t},\bm{\lambda}_{t}$ , and denote their elements with corresponding normal letters, e.g., $q_{t,i},\mu_{t,i},\lambda_{t,i}$ . For an integer $n\geq 0$ , $[n]$ stands for $\{1,2,\ldots,n\}$ . For a finite set $\mathcal{S}$ , $\triangle(\mathcal{S})$ is the simplex over $\mathcal{S}$ , i.e., $\{\bm{x}\in\mathbb{R}^{\lvert\mathcal{S}\rvert}\mid\sum_{i=1}^{\lvert\mathcal{% S}\rvert}x_{i}=1\}$ , where every element $\bm{x}\in\triangle(\mathcal{S})$ is a discrete probability distribution over $\mathcal{S}$ . We use $\operatorname{\mathcal{O}}$ to hide all absolute constants, and use $\operatorname{\widetilde{\mathcal{O}}}$ to additionally hide all logarithmic factors. For functions $f(T)$ and $g(T)$ , we say $f(T)=\operatorname{\mathcal{O}}_{T}(g(T))$ if $\limsup_{T\to\infty}\frac{f(T)}{g(T)}<\infty$ and $f(T)=o_{T}(g(T))$ if $\limsup_{T\to\infty}\frac{f(T)}{g(T)}=0$ .

2.1 Adversarial Network Optimization under Bandit Feedback Formulation

We first introduce our adversarial network optimization with bandit feedback model. Specifically, in a network with multiple servers and directional data links, we denote the set of all servers by ${\mathcal{N}}$ and that of all data links by ${\mathcal{L}}\subseteq{\mathcal{N}}\times{\mathcal{N}}$ . Suppose that $\lvert{\mathcal{N}}\rvert$ and $\lvert{\mathcal{L}}\rvert$ are both finite. There are $\lvert{\mathcal{N}}\rvert$ commodities of jobs such that those jobs belonging to commodity $k\in{\mathcal{N}}$ are destined for server $k\in{\mathcal{N}}$ . We denote $Q_{n}^{(k)}$ as the queue of unfinished commodity- $k$ jobs at server $n$ , where $n\in{\mathcal{N}}$ and $k\in{\mathcal{N}}$ . We assume the links do not interfere with each other.

The scheduling problem lasts for $T>0$ rounds. In round $t\in[T]$ , the scheduler makes two decisions: i) arrival rates of commodity- $k$ jobs into server $n$ , $\forall n\in{\mathcal{N}},k\in{\mathcal{N}}$ , and ii) link rate allocations of transmitting how many commodity- $k$ jobs over data link $(n,m)$ , $\forall k\in{\mathcal{N}},(n,m)\in{\mathcal{L}}$ . Both decisions are made under the bandit feedback model, i.e., the scheduler makes decisions in blind and only receives feedback resulting from its actions. Below, we describe them in detail.

Arrival Rates and Utility. In every round $t$ , the scheduler decides an $\lvert{\mathcal{N}}\rvert\times\lvert{\mathcal{N}}\rvert$ dimensional arrival rate matrix $\bm{\lambda}(t)$ from some fixed action set $\Lambda\subseteq\mathbb{R}_{\geq 0}^{\lvert{\mathcal{N}}\rvert\times\lvert{% \mathcal{N}}\rvert}$ , and consequently, $\lambda_{n}^{(k)}(t)$ jobs with commodity $k$ will be added to queue $Q_{n}^{(k)}$ . The arrival rate vector $\bm{\lambda}(t)$ is associated with some abstract utility $g_{t}(\bm{\lambda}(t))$ where $g_{t}\colon\Lambda\to\mathbb{R}$ is concave (that is, the user’s marginal return diminishes gradually as the arrivals increase (Huang and Neely, 2011, Huang et al., 2012)), $L$ -Lipschitz, and $[-G,G]$ -bounded, where $L$ and $G$ here are known constants. Following the adversarial network assumption, we allow $g_{t}$ ’s to be time-dependent (though they have to be pre-determined, which is called the oblivious adversary model). Following the bandit feedback model, the scheduler has no information about $g_{t}$ before the decision, and can only observe $g_{t}(\bm{\lambda}(t))$ for the chosen $\bm{\lambda}(t)$ but not the whole $g_{t}$ after decision.

Link Rate Allocations. The capacity of each link $(n,m)\in{\mathcal{L}}$ can be time-varying. We denote the capacity of $(n,m)$ in round $t$ as $C_{n,m}(t)$ . We assume the capacities are always bounded by some finite constant $M$ . Due to the bandit feedback model, the scheduler cannot access $C_{n,m}(t)$ when deciding. Nevertheless, the scheduler can still decide a link allocation plan which assigns a distribution over commodities on each link, or formally denoted as $\bm{a}_{n,m}(t)\in\triangle({\mathcal{N}})$ (the $|{\mathcal{N}}|$ -dimension distribution simplex, representing the portion of rates allocated to each commodity over the link). Via sending jobs from each commodity along link $(n,m)$ according to distribution $\bm{a}_{n,m}(t)$ in a round-robin manner, approximately $a_{n,m}^{(k)}(t)C_{n,m}(t)$ jobs from queue $Q_{n}^{(k)}$ will be sent along link $(n,m)$ to queue $Q_{m}^{(k)}$ . Formally, we assume that after deciding link allocation plans $\{\bm{a}_{n,m}(t)\in\triangle({\mathcal{N}})\}_{(n,m)\in{\mathcal{L}}}$ , the number of jobs successfully sent from $Q_{n}^{(k)}$ to $Q_{m}^{(k)}$ , denoted by $\mu_{n,m}^{(k)}(t)$ , are independently generated such that $\operatornamewithlimits{\mathbb{E}}[\mu_{n,m}^{(k)}(t)]=C_{n,m}(t)a_{n,m}^{(k)% }(t)$ and $\mu_{n,m}^{(k)}(t)\in[0,M]$ . Again, we assume a bandit fededback model, which means the scheduler is able to observe $C_{n,m}(t)$ and $\bm{\mu}_{n,m}(t)$ for all $(n,m)\in{\mathcal{L}}$ only after the decision is made at the end of round $t$ .

Putting the two components together, by denoting the length of $Q_{n}^{(k)}$ at the beginning of round $t$ to be $Q_{n}^{(k)}(t)$ , the network dynamics can then be characterized as follows:

Q_{n}^{(k)}(t+1)=\begin{cases}\left[Q_{n}^{(k)}(t)-\sum_{(n,m)\in{\mathcal{L}}% }\mu_{n,m}^{(k)}(t)\right]_{+}+\sum_{(o,n)\in{\mathcal{L}}}\mu_{o,n}^{(k)}(t)+% \lambda_{n}^{(k)}(t),&k\neq n\\ 0,&k=n\end{cases},

(1)

where $\lambda_{n}^{(k)}(t)$ is the number of jobs with commodity $k$ that the scheduler adds to server $n$ , and $\mu_{n,m}^{(k)}(t)$ is the number of jobs with commodity $k$ transmitted along data link $(n,m)$ .

The objective of the scheduler is to maximize its average utility over the $T$ rounds, namely $\frac{1}{T}\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}g_{t}(\bm{\lambda% }(t))]$ . However, a scheduling algorithm is meaningless if it cannot ensure network stability, which requires the average number of jobs remaining in the network is non-divergent when the number of rounds is large enough. Formally, the network stability requirement says

\frac{1}{T}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}% (t)\rVert_{1}\right]=\frac{1}{T}\operatornamewithlimits{\mathbb{E}}\left[\sum_% {t=1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\right]% =\operatorname{\mathcal{O}}_{T}(1),\quad\text{when }T\gg 0.

(2)

The scheduler aims to maximize its average utility subject to the network stability condition, i.e.,

\text{Maximize }\frac{1}{T}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}% ^{T}g_{t}(\bm{\lambda}(t))\right]\text{ s.t. \lx@cref{creftypecap~refnum}{eq:a% verage queue} holds}.

(3)

2.2 Technical Overview of Our Paper

In order to improve presentation and facilitate understanding, in Section 3, we first present the network stability algorithm NSO, i.e., pretending $g_{t}$ is a constant. This algorithm will serve as a key building block for the utility maximization algorithm UMO² that we introduce in Section 4.

In Figure 1, we give an overview of our main technical steps when analyzing NSO and UMO². The steps for NSO are in yellow, the ones for UMO² are in blue, and those in common are in green. In general, either analysis starts from the famous Lyapunov drift(-plus-penalty) analysis (Neely, 2010a, §4), which reveals the non-negativity of a Lyapunov drift(-plus-penalty) function – see Section 3.3 and Section 4.3 for more details. We then use online learning techniques to minimize them (i.e., making them as close to zero as possible). From here, the analyses for NSO and UMO² become different.

Figure 1: Technical Overview of NSO and UMO² Frameworks

For the network stability algorithm NSO, we succeeded in expressing the Lyapunov drift function as a function linear in the queue lengths $\bm{Q}(t)$ and the link allocation plan $\bm{a}(t)$ . While this belongs to the classical Online Linear Optimization (OLO) problem in online learning (Zinkevich, 2003), we face two unique challenges due to the potentially unbounded queue lengths and the self-bounding analysis for network stability guarantees; see Section 3.1 for more discussions. These two requirements rules out existing OLO algorithms, and thus we have to design our own algorithm crafted towards the network optimization objective. Specifically, we designed an OLO algorithm AdaPFOL (see Algorithm 2) that can handle occasionally large loss magnitudes and ensures a performance guarantee depending on all the losses. When plugging it into NSO, we are able to see that the link allocation plans perform well, as detailed in Theorem 3.5. Therefore, combining it with some reference policy assumption (1) and the Lyapunov drift analysis, we obtain the network stability guarantee of NSO in Theorem 3.6. A more detailed overview of NSO is in Section 3.1.

Regarding the utility maximization algorithm UMO², we have to decompose the Lyapunov drift-plus-penalty function into two parts. The first part is still linear in $\bm{Q}(t)$ and $\bm{a}(t)$ , which we can reuse the AdaPFOL algorithm. For the second part, as the utility function $g_{t}$ is an arbitrary time-varying concave function and we only receive bandit feedback (recall Table 1), OLO cannot capture it. Instead, we model this part as a Bandit Convex Optimization (BCO) problem (Flaxman et al., 2005). Unfortunately, again due to the potentially unbounded queue lengths and the self-bounding analysis, the loss functions’ magnitudes and Lipschitzness can be very large for some rounds but we still want to adapt to them – thus, existing BCO algorithms are inapplicable either. To this end, we develop a BCO algorithm AdaBGD (Algorithm 4) which allows loss functions with large magnitudes or Lipschitzness and enjoys a performance adaptive to the loss functions. When plugging in AdaBGD to the UMO² framework, it can generate a good arrival rate sequence as we analyze in Theorem 4.4. Therefore, similar to the analysis of NSO in Theorem 3.6, if we combine the AdaPFOL in Algorithm 2 and the AdaBGD in Algorithm 4 together, we are able to derive the utility maximization guarantee of UMO² in Theorem 4.5. Again, a more detailed overview of UMO² can be found in Section 4.1.

3 Network Stability in Adversarial Multi-Hop Networks

In this section, we do a first step towards our ultimate goal of multi-hop utility maximization, which is network stability (recall that Equation 3 requires Equation 2 as a condition). That is, this section only focuses on stablizing the average number of tasks in the system $\frac{1}{T}\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{1}]$ and does not consider utilities. The algorithm designed for this purpose (NSO in Algorithm 1) will serve as the network stability component of our utility maximization algorithm (UMO² in Algorithm 3).

One may observe that if we only want to ensure the network stability condition in Equation 2, it suffices to pick all arrival rate vectors $\bm{\lambda}(t)\equiv 0$ . To avoid such trivial algorithms, we assume the arrival rates are adversarially chosen for now. That is, $\bm{\lambda}(t)\in[0,R]^{\lvert{\mathcal{N}}\rvert\times\lvert{\mathcal{N}}\rvert}$ is some arbitrary, unknown, and time-varying vector following the oblivious adversary model. It is only revealed post-decision, at the end of round $t$ . The rationale of assuming adversarial $\bm{\lambda}(t)$ is because in the UMO² algorithm for utility maximiztion, another algorithmic component decides $\bm{\lambda}(t)$ and the network stability component that we design in this section must adapt to such an arbitrary arrival rate matrix $\bm{\lambda}(t)$ .

Algorithm 1 NSO: Network Stability via Online Linear Optimization

0: Number of rounds

T

, set of servers

{\mathcal{N}}

and links

{\mathcal{L}}

, maximum capacity

M

, feasible arrival rates

\Lambda

(adversarial arrival rates given during execution – only assumed in this section). An online linear optimization algorithm AdaPFOL (Algorithm 2).

1: For each link

(n,m)\in{\mathcal{L}}

, initialize an instance of AdaPFOL with action set

\triangle({\mathcal{N}})

\texttt{AdaPFOL}_{n,m}

2: for

t=1,2,\ldots,T

3: For each link

(n,m)\in{\mathcal{L}}

, pass the maximum loss magnitude for this round

M\lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rVert_{\infty}

\texttt{AdaPFOL}_{n,m}

. Pick link allocation

\bm{a}_{n,m}(t)\in\triangle({\mathcal{N}})

as the output of

\texttt{AdaPFOL}_{n,m}

4: Observe arrival rates

\bm{\lambda}(t)\in\Lambda

\triangleright

In the utility maximization algorithm UMO² (Algorithm 3), this step will be replaced by another algorithmic component.

5: Observe capacities

\{C_{n,m}(t)\}_{(n,m)\in{\mathcal{L}}}

and actual data transmissions

\{\mu_{n,m}^{(k)}(t)\}_{(n,m)\in{\mathcal{L}},k\in{\mathcal{N}}}

6: Calculate queue lengths

\bm{Q}(t+1)

from

\bm{Q}(t)

according to Equation 1.

7: For each link

(n,m)\in{\mathcal{L}}

, pass the loss vector

C_{n,m}(t)(\bm{Q}_{m}(t)-\bm{Q}_{n}(t))

\texttt{AdaPFOL}_{n,m}

3.1 Motivation of Our Algorithmic Framework

In Algorithm 1, we present Network Stability via Online Linear Optimization (NSO), an algorithmic framework which achieves stability in adversarial multi-hop networks under bandit feedback. One key ingredient of NSO is the plug-in Online Linear Optimization (OLO) algorithm AdaPFOL. Before going into details of the AdaPFOL algorithm, we first introduce why we need it.

The design of NSO is based on the famous Lyapunov drift analysis (Neely, 2010a, §4). Conducting standard Lyapunov analysis on the network dynamics defined in Equation 1, we are able to derive

	$\displaystyle\quad-\frac{1}{2}N^{2}((NM)^{2}+2(NM)^{2}+2R^{2})T$
	$\displaystyle\leq\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_{% n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\left(\sum_{(o,n)\in{% \mathcal{L}}}\mu_{o,n}^{(k)}(t)+\lambda_{n}^{(k)}(t)-\sum_{(n,m)\in{\mathcal{L% }}}\mu_{n,m}^{(k)}(t)\right)\right],$		(4)

whose formal statement and proof can be found in Lemma 3.2.

Based on this inequality, a Lyapunov drift based algorithm can be constructed by minimizing the RHS of Equation 4 (Neely, 2010a, §4). As the arrival rate $\lambda$ ’s is regarded as a constant in this section, we may only focus on the terms related to $\mu$ . Thus, minimizing RHS of Equation 4 is equivalent to

Minimizing	$\displaystyle\quad\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_% {(n,m)\in{\mathcal{L}}}\sum_{k=1}^{K}\mu_{n,m}^{(k)}(t)(Q_{m}^{(k)}(t)-Q_{n}^{% (k)}(t))\right]$
	$\displaystyle=\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_{(n,% m)\in{\mathcal{L}}}\langle\bm{Q}_{m}(t)-\bm{Q}_{n}(t),\bm{\mu}_{n,m}(t)\rangle\right]$
	$\displaystyle=\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_{(n,% m)\in{\mathcal{L}}}\langle C_{n,m}(t)(\bm{Q}_{m}(t)-\bm{Q}_{n}(t)),\bm{a}_{n,m% }(t)\rangle\right],$	(5)

where the last step uses the assumption that $\operatornamewithlimits{\mathbb{E}}[\bm{\mu}_{n,m}(t)]=\operatornamewithlimits% {\mathbb{E}}[C_{n,m}\bm{a}_{n,m}(t)]$ .

For illustration purposes, let us focus on a single data link $(n,m)\in{\mathcal{L}}$ . Motivated by Huang et al. (2024), we consider designing a scheduling algorithm via minimizing the following expectation:

\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\langle C_{n,m}(t)(\bm{% Q}_{m}(t)-\bm{Q}_{n}(t)),\bm{a}_{n,m}(t)\rangle\right].

Remark 1.

While having similarities, this objective is different from that of Huang et al. (2024) in two aspects:

First, the network topology in (Huang et al., 2024) is a single-server, single-hop one, thus it suffices to conduct the Lyapunov drift optimization on the centralized server. In contrast, due to our multi-hop topology, our optimization task Equation 5 has to be distributed onto every data link $(n,m)\in{\mathcal{L}}$ and extra efforts are needed to ensure a good overall scheduling effect.

Second, the coefficient $C_{n,m}(t)(Q_{m}^{(k)}(t)-Q_{n}^{(k)}(t))$ before $a_{n,m}^{(k)}(t)$ can be either positive or negative, whereas that of Huang et al. (2024) is always non-negative. Such a negativity also increases the difficulty as many online learning algorithms are typically bad at handling potentially negative losses, see, e.g., (Zheng et al., 2019, Dai et al., 2023).

Recall that $\bm{a}_{n,m}(t)\in\triangle({\mathcal{N}})$ can be any probability distribution from the simplex. Hence, one may view $\triangle({\mathcal{N}})$ as the action set in round $t$ . Moreover, for an action $\bm{a}$ from the action set $\triangle({\mathcal{N}})$ , picking it in round $t$ will incur a loss $\langle C_{n,m}(t)(\bm{Q}_{m}(t)-\bm{Q}_{n}(t)),\bm{a}\rangle$ – which is linear in $\bm{a}$ . Thus, this problem belongs to the class of Online Linear Optimization (OLO) problems (Zinkevich, 2003, McMahan and Streeter, 2010, Duchi et al., 2011), whose formal definition will be presented later as Definition 3.3. Although our problem belongs to the OLO formulation, we face significantly different challenges due to our network optimization context:

i)

In our task of minimizing $\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}\langle C_{n,m}(t)(\bm{Q}_{m% }(t)-\bm{Q}_{n}(t)),\bm{a}_{n,m}(t)\rangle]$ , the magnitude of the loss $C_{n,m}(t)(\bm{Q}_{m}(t)-\bm{Q}_{n}(t))$ can occasionally be large because $\bm{Q}_{n}(t)$ or $\bm{Q}_{m}(t)$ may be unbounded. Note that, despite our system stability condition Equation 2 requires $\frac{1}{T}\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{1}]$ to be small, some $\bm{Q}(t)$ ’s inside average and expectation are still allowed to be large. However, existing algorithms in OLO mostly require the losses to be uniformly bounded by a constant (see, e.g., (McMahan and Streeter, 2010, Cutkosky, 2020)), which means extra efforts should be made to handle occasionally large losses.

ii)

Moreover, we also want our performance to depend on the geometric mean of all loss magnitudes (which in turn relates to queue lengths since the losses $C_{n,m}(t)(\bm{Q}_{m}(t)-\bm{Q}_{n}(t))$ depend on $\lVert\bm{Q}(t)\rVert$ ). On a high level, this can be understood as follows: The average queue length $\frac{1}{T}\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{1}]$ can be controlled by the OLO performance via some other arguments (see Section 3.2). Thus, if we can additionally show that OLO performance is bounded by queue lengths, we can conduct a self-bounding analysis on the queue lengths which informally reads

	$\displaystyle\quad\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right]\lesssim\text{Online Learning Performance}% \lesssim\operatorname{\mathcal{O}}_{T}(T)+o_{T}(T^{1/4})% \operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{% 1}\right]^{3/4}$
	$\displaystyle\Longrightarrow\frac{1}{T}\operatornamewithlimits{\mathbb{E}}% \left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1}\right]=\operatorname{\mathcal{O}% }_{T}(1),\textit{ i.e.},\text{ the system is stabilized}.$		(6)

More details regarding the self-bounding analysis can be found in Section 3.5. Nevertheless, it suffices to remember that a performance depending on all loss magnitudes is beneficial.

In Section 3.4, we introduce our novel OLO algorithm of AdaPFOL (Algorithm 2) that enjoys these two properties. Equipped with such an algorithm, we can minimize Equation 5 and achieve network stability guarantee by deploying it on every link $(n,m)\in{\mathcal{L}}$ for $T$ rounds with action set $\triangle({\mathcal{N}})$ . This idea of deploying AdaPFOL onto every link exactly gives our NSO framework in Algorithm 1.

Therefore, we are able to analyze the network stability effect when the NSO framework is equipped with AdaPFOL in Algorithm 2, which we do in the rest of this section: We introduce our reference policy assumptions in Section 3.2, conduct the Lyapunov drift analysis in Section 3.3, introduct and analyze the novel AdaPFOL algorithm in Section 3.4, and present our final analysis in Section 3.5.

3.2 Reference Policy Assumption

We first make the following multi-hop piecewise stability assumption, which, informally speaking, assumes that there exists a reference policy that stabilizes the system piecewisely. It is an extension of the piecewise stability assumption (Huang et al., 2024, Assumption 1) to multi-hop cases.

Assumption 1 (Multi-Hop Piecewise Stability for Network Stability).

There exists a reference action sequence $\{\mathring{\bm{a}}(t)\}_{t\in[T]}$ (where $\mathring{\bm{a}}(t)=\{\mathring{\bm{a}}_{n,m}(t)\in\triangle({\mathcal{N}})\}% _{(n,m)\in{\mathcal{L}}}$ , in analogue to the scheduler’s action sequence), such that there are some constants $C_{W}\geq 0$ , $\epsilon_{W}\geq 0$ and a partition $W_{1},W_{2},\ldots,W_{J}$ of $[T]$ ,¹¹1A partition $W_{1},W_{2},\ldots,W_{N}$ of $[T]$ is a collection of a few non-intersecting intervals whose union is $[T]$ which ensure that $\sum_{j=1}^{J}(\lvert W_{j}\rvert-1)^{2}\leq C_{W}T$ and

	$\displaystyle\frac{1}{\lvert W_{j}\rvert}\sum_{t\in W_{j}}\sum_{(n,m)\in{% \mathcal{L}}}C_{n,m}(t)\mathring{a}_{n,m}^{(k)}(t)\geq\epsilon_{W}+\frac{1}{% \lvert W_{j}\rvert}\sum_{t\in W_{j}}\left(\lambda_{n}^{(k)}(t)+\sum_{(o,n)\in{% \mathcal{L}}}C_{o,n}(t)\mathring{a}_{o,n}^{(k)}(t)\right),$
	$\displaystyle\quad\forall j\in[J],n\in{\mathcal{N}},k\in{\mathcal{N}},$		(7)

where $\lambda_{n}^{(k)}(t)$ is the obliviously decided arrival rates that we assume in this section.

Intuitively, 1 means that there exists some “good” action sequence $\{\mathring{\bm{a}}(t)\}_{t\in[T]}$ making the network stable, in the sense that there are multiple windows $W_{1},W_{2},\ldots,W_{J}$ such that in expectation, for each window $W_{j}$ and for each queue $Q_{n}^{(k)}$ , the average service rate it receives (whose expectation is $\sum_{(n,m)\in{\mathcal{L}}}C_{n,m}(t)\mathring{a}_{n,m}^{(k)}(t)$ in round $t$ ), is strictly more than its net arrival rate (which includes both external data flows $\lambda_{n}^{(k)}(t)$ and internal data flows that are forwarded from other queues $\sum_{(o,n)\in{\mathcal{L}}}C_{o,n}(t)\mathring{a}_{o,n}^{(k)}(t)$ ), by a constant gap of at least $\epsilon_{W}$ .

Remark 2.

Such assumptions are typical in network optimization literature. In the case when the network is stationary, 1 recovers the classical capacity region assumption in SNO (Neely, 2010a). However, extending this condition to adversarial network is highly non-trivial. For adversarial networks, an alternative assumption is the $(W,\epsilon)$ -constrained dynamics assumption (Liang and Modiano, 2018b), which roughly says Equation 7 holds for every window of size $W$ . 1 thus allows more flexibility. Finally, our 1 can be viewed as a generalization of the piecewise stability assumption (Huang et al., 2024), which was crafted for a single centralized server.

Before moving on, we shall remark that the reference action sequence $\{\mathring{\bm{a}}(t)\}_{t\in[T]}$ in 1 is unknown to the scheduler. Instead, the scheduler needs to learn its own way of stabilizing the network via observations. To characterize the ability of $\{\mathring{\bm{a}}(t)\}_{t\in[T]}$ in stabilizing the network, the following lemma controls the average queue length resulting from any scheduling policy.

Lemma 3.1 (Ability of $\{\mathring{\bm{a}}(t)\}_{t\in[T]}$ in Stabilizing the Network).

If $\{\mathring{\bm{a}}(t)\}_{t\in[T]}$ satisfies 1, then for any scheduler-generated queue lengths $\{\bm{Q}(t)\}_{t\in[T]}$ ,

	$\displaystyle\quad\epsilon_{W}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t% =1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\right]-(% N^{2}(2NM+R)^{2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T$
	$\displaystyle\leq-\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_% {(n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}C_{n,m}(t)\mathring{a}_{n,m}^{(% k)}(t)(Q_{m}^{(k)}(t)-Q_{n}^{(k)}(t))\right]-\operatornamewithlimits{\mathbb{E% }}\left[\sum_{t=1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(% k)}(t)\lambda_{n}^{(k)}(t)\right].$

Lemma 3.1 says the Lyapunov drift (defined later) under $\{\mathring{\bm{a}}(t)\}_{t=1}^{T}$ is always negative, which is useful when analyzing queue-based policies (Neely, 2010a, §3.1). Its proof can be found in Section B.1.

3.3 Lyapunov Drift Analysis

We carry out our analysis based on the Lyapunov drift analysis (Neely, 2010a, §4), which considers the Lyapunov function $L_{t}$ and its drift $\Delta(\bm{Q}(t))$ , defined as follows:

\text{Lyapunov function }L_{t}=\frac{1}{2}\sum_{n\in{\mathcal{N}}}\sum_{k\in{% \mathcal{N}}}\left(Q_{n}^{(k)}\right)^{2},\quad\text{Lyapunov drift }\Delta(% \bm{Q}(t))=\operatornamewithlimits{\mathbb{E}}[L_{t+1}-L_{t}\mid\bm{Q}(t)].

We give the following result which is almost diretly applying the classical Lyapunov drift analysis to the queue dynamics in Equation 1. The proof is standard and thus deferred to Section B.2.

Lemma 3.2 (Lyapunov Drift Analysis).

Under the queue dynamics of Equation 1,

	$\displaystyle 0\leq\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}% \Delta(\bm{Q}(t))\right]$	$\displaystyle\leq\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_{% (n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}\mu_{n,m}^{(k)}(t)\left(Q_{m}^{(% k)}(t)-Q_{n}^{(k)}(t)\right)+\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q% _{n}^{(k)}(t)\lambda_{n}^{(k)}(t)\right]+$
		$\displaystyle\quad\frac{1}{2}N^{2}((NM)^{2}+2(NM)^{2}+2R^{2})T.$		(8)

As sketched in Section 3.1, our algorithm is designed to approximately minimize the RHS of Equation 8 via online learning, which contains two non-constant terms $\operatornamewithlimits{\mathbb{E}}[\sum_{(n,m)\in{\mathcal{L}}}\langle\bm{\mu% }_{n,m}(t),\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rangle]$ and $\operatornamewithlimits{\mathbb{E}}[\sum_{n\in{\mathcal{N}}}\langle\bm{Q}_{n}(% t),\bm{\lambda}_{n}(t)\rangle]$ . As $\lambda_{n}^{(k)}(t)$ are obliviously chosen, the second term is also constant. Therefore, it remains to minimize the following term:

\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\langle\bm{\mu}_{n,m}(t% ),\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rangle\right]=\operatornamewithlimits{\mathbb{E}% }\left[\sum_{t=1}^{T}\langle C_{n,m}(t)(\bm{Q}_{m}(t)-\bm{Q}_{n}(t)),\bm{a}_{n% ,m}(t)\rangle\right],\quad\forall(n,m)\in{\mathcal{L}}.

(9)

For each data link $(n,m)\in{\mathcal{L}}$ , Equation 9 corresponds to an Online Linear Optimization (OLO) problem with $\langle C_{n,m}(t)(\bm{Q}_{m}(t)-\bm{Q}_{n}(t)),\bm{a}_{n,m}(t)\rangle$ being the loss for round $t$ . In the next section, we first rigorously define the OLO problem in Definition 3.3 and then present our novel algorithm that tackles the unique challenges we face in network optimization contexts – potentially large losses due to unbounded queue lengths (recall Equation 9), and adapting to all the loss magnitudes because we want to conduct a self-bounding analysis on the queue lengths (recall Equation 6).

3.4 AdaPFOL: Learning for Network Stability

In this section, we will have a small detour to the OLO problem mentioned in Section 3.1. We first rigorously define the OLO problem (Zinkevich, 2003, McMahan and Streeter, 2010, Duchi et al., 2011) in Definition 3.3. Then, we present the construction our novel OLO algorithm of AdaPFOL (Algorithm 2). Finally, we prove that plugging it into the NSO framework in Algorithm 1 indeed ensures good optimization effect of Equation 9.

Definition 3.3 (Online Linear Optimization).

Consider a $T$ -round game. Every round $t\in[T]$ , the player selects an action $\bm{x}_{t}$ from a convex set $\mathcal{X}\subseteq\mathbb{R}^{d}$ . The environment simultaneously decides a loss vector $\bm{g}_{t}\in\mathbb{R}^{d}$ such that the loss of the player for round $t$ is $\langle\bm{x}_{t},\bm{g}_{t}\rangle$ . The player will observe the whole vector of $\bm{g}_{t}$ (i.e., full-information feedback is available, instead of the more restrictive bandit feedback model). Dynamic regret minimization in OLO considers minimizing

\text{D-Regret}_{T}^{\text{OLO}}(\mathring{\bm{x}}_{1},\mathring{\bm{x}}_{2},% \ldots,\mathring{\bm{x}}_{T})=\sum_{t=1}^{T}\langle\bm{g}_{t},\bm{x}_{t}-\bm{x% }_{t}^{\circ}\rangle,\quad\forall\mathring{\bm{x}}_{1},\mathring{\bm{x}}_{2},% \ldots,\mathring{\bm{x}}_{T}\in\mathcal{X}.

Before moving on, we recall the two challenges i) and ii) in Section 3.1: Due to potentially unbounded queue lengths, AdaPFOL must resist from large and negative losses. Meanwhile, as we want to conduct the self-bounding analysis in Equation 6, AdaPFOL shall additionally enjoy a performance guarantee ( $\text{D-Regret}_{T}^{\text{OLO}}$ in Definition 3.3) depending on the geometric mean of all the loss magnitudes. Thus, our ideal algorithm for Definition 3.3 must satisfy the following:

i)

it can resist against occasionally large loss magnitudes, i.e., $\sup_{t}\lVert\bm{g}_{t}\rVert_{\infty}$ can be large, and
ii)

it enjoys a performance guarantee depending on all the loss magnitudes, e.g., $\sqrt{\sum_{t=1}^{T}\lVert\bm{g}_{t}\rVert_{\infty}^{2}}$ .

Algorithm 2 AdaPFOL: Adaptive Pamameter-Free Online Learning

0: Action set

\mathcal{X}

. For each round

t

, the maximum loss magnitude

G_{t}

will be reveled at the beginning, and a loss vector

\bm{g}_{t}

satisfying

\lVert\bm{g}_{t}\rVert_{\infty}\leq G_{t}

will be given in the end.

1: Set

G\leftarrow 1

. Initialize an instance

\mathcal{A}

of the algorithm in Lemma B.4 with action set

\mathcal{X}

2: for

t=1,2,\ldots

3: Observe the maximum loss magnitude

G_{t}>0

for this round.

4: if

G_{t}>G

then

5: Set

G=2G_{t}

. Reset

\mathcal{A}

as a new instance of the algorithm in Lemma B.4 with action set

\mathcal{X}

6: Output the ouptut of

\mathcal{A}

. Observe loss vector

\bm{g}_{t}

(such that

\lVert\bm{g}_{t}\rVert_{\infty}\leq G_{t}

). Feed

G^{-1}\bm{g}_{t}

\mathcal{A}

To design such an algorithm, we build upon the Parameter-Free Online Learning (PFOL) algorithm by Cutkosky (2020). It ensures condition ii) by enjoying $\text{D-Regret}_{T}^{\text{OLO}}(\mathring{\bm{x}}_{1},\mathring{\bm{x}}_{2},% \ldots,\mathring{\bm{x}}_{T})\propto\sqrt{\sum_{t=1}^{T}\lVert\bm{g}_{t}\rVert% _{\infty}^{2}}$ (Cutkosky, 2020, Theorem 6), but fails to bare large loss magnitudes as it requires $\lVert\bm{g}_{t}\rVert_{\infty}\leq 1$ for all $t\in[T]$ .

Fortunately, as $C_{n,m}(t)\in[0,M]$ , we know $\lVert\bm{g}_{t}\rVert_{\infty}\leq M\lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rVert_{\infty}$ . Even better, $\bm{Q}(t)$ can be calculated at the beginning of round $t$ – before deciding $\bm{a}(t)$ . Utilizing this knowledge, we are able to design our OLO algorithm which enjoys both property i) and ii). We call this algorithm AdaPFOL (Adaptive Pamameter-Free Online Learning), whose pseudo-code is presented in Algorithm 2. AdaPFOL deploys a doubling technique to the PFOL algorithm of Cutkosky (2020), which restarts every time observing a large $G_{t}$ . We can show that this only introduces a logarithmic overhead as the original PFOL algorithm also enjoys ii).

AdaPFOL algorithm enjoys the following dynamic regret guarantee, satisfying both i) and ii):

Lemma 3.4 (Guarantee of AdaPFOL Algorithm).

Consider the OLO problem in Definition 3.3. Let the action set $\mathcal{X}$ has diameter $D=\sup_{\bm{x},\bm{y}\in\mathcal{X}}\lVert\bm{x}-\bm{y}\rVert_{1}$ . Suppose that $\lVert\bm{g}_{t}\rVert_{\infty}\leq G_{t}$ , where $G_{t}$ is some $\mathcal{F}_{t-1}$ -measurable random variable and $(\mathcal{F}_{t})_{t=0}^{T}$ is the natural filtration, i.e., $\mathcal{F}_{t}$ is the $\sigma$ -algebra generated by all random observations made during the first $t$ rounds. Then, AdaPFOL (Algorithm 2) ensures that for any comparator sequence $\mathring{\bm{x}}_{1},\mathring{\bm{x}}_{2},\ldots,\mathring{\bm{x}}_{T}\in% \mathcal{X}$ , if $\max_{t\in[T]}G_{t}\geq 1$ , then

\text{D-Regret}_{T}^{\text{OLO}}(\mathring{\bm{x}}_{1},\mathring{\bm{x}}_{2},% \ldots,\mathring{\bm{x}}_{T})=\operatorname{\mathcal{O}}\left(\sqrt{D\left(D+% \sum_{t=1}^{T-1}\lVert\mathring{\bm{x}}_{t}-\mathring{\bm{x}}_{t+1}\rVert_{1}% \right)}\sqrt{\sum_{t=1}^{T}\lVert\bm{g}_{t}\rVert_{\infty}^{2}}\log T\log% \left(\max_{t=1}^{T}G_{t}\right)\right).

Remark 3.

Note that in general, it is impossible to guarantee $\text{D-Regret}_{T}^{\text{OLO}}(\mathring{\bm{x}}_{1},\mathring{\bm{x}}_{2},% \ldots,\mathring{\bm{x}}_{T})=o_{T}(T)$ simultaneously for all $\{\mathring{\bm{x}}_{t}\in\mathcal{X}\}_{t\in[T]}$ (Zinkevich, 2003). Therefore, many dynamic regret bounds, including ours, depend on the notion of path length $P_{T}=\sum_{t=1}^{T-1}\lVert\mathring{\bm{x}}_{t}-\mathring{\bm{x}}_{t+1}\rVert$ . Although the path length is linear in $T$ in the worst case, $\text{D-Regret}_{T}^{\text{OLO}}(\mathring{\bm{x}}_{1},\mathring{\bm{x}}_{2},% \ldots,\mathring{\bm{x}}_{T})=o_{T}(T)$ can still be ensured in cases where $P_{T}=o_{T}(T)$ .

The proof of Lemma 3.4 will be presented in Section B.3. It can be seen that AdaPFOL indeed satisfies both i) and ii): It allows the loss magnitudes $\lVert\bm{g}_{t}\rVert_{\infty}$ to be large, and also enjoys a magnitude-aware dynamic regret guarantee of $\text{D-Regret}_{T}^{\text{OLO}}(\mathring{\bm{x}}_{1},\mathring{\bm{x}}_{2},% \ldots,\mathring{\bm{x}}_{T})\propto\sqrt{\sum_{t=1}^{T}\lVert\bm{g}_{t}\rVert% _{\infty}^{2}}$ .

Therefore, if we deploy AdaPFOL (Algorithm 2) to decide $\bm{a}_{n,m}(t)$ on each link $(n,m)\in{\mathcal{L}}$ as we do in Algorithm 1, the RHS of Equation 8 can consequently be minimized in the sense that it is close to that induced by the reference actions $\{\mathring{\bm{a}}(t)\}_{t\in[T]}$ . Formally, we give the following theorem:

Theorem 3.5 (Deciding $\bm{a}(t)$ via AdaPFOL Algorithm).

For each link $(n,m)\in{\mathcal{L}}$ , as we did in NSO, we execute an instance of AdaPFOL (Algorithm 2) where $\mathcal{X}=\triangle({\mathcal{N}})$ , $\bm{g}_{t}=C_{n,m}(t)(\bm{Q}_{m}(t)-\bm{Q}_{n}(t))$ , and $G_{t}=M\lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rVert_{\infty}$ . We make their outputs $\bm{x}_{t}$ as $\bm{a}_{n,m}(t)$ for every round $t$ . Let $\mu_{n,m}^{(k)}(t)$ be the number of actually transmitted jobs from $Q_{n}^{(k)}(t)$ to $Q_{m}^{(k)}(t+1)$ induced by $a_{n,m}^{(k)}(t)$ .

Consider an arbitrary reference action sequence $\{\mathring{\bm{a}}(t)\}_{t\in[T]}$ satisfying 1. Let $\mathring{\mu}_{n,m}^{(k)}(t)=C_{n,m}(t)\mathring{a}_{n,m}^{(k)}(t)\in[0,M]$ (as $C_{n,m}(t)\in[0,M]$ and $\mathring{a}_{n,m}^{(k)}(t)\in[0,1]$ ). Then

	$\displaystyle\quad\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_% {(n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}(\mu_{n,m}^{(k)}(t)-\mathring{% \mu}_{n,m}^{(k)}(t))\left(Q_{m}^{(k)}(t)-Q_{n}^{(k)}(t)\right)\right]$
	$\displaystyle=\operatorname{\mathcal{O}}\left(M\sqrt{1+P_{T}^{a}}% \operatornamewithlimits{\mathbb{E}}\left[\sqrt{\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{2}^{2}}\log T\log\left(\max_{t=1}^{T}\max_{(n,m)\in{\mathcal{L}}}M% \lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rVert_{\infty}\right)\right]\right)\text{,}$

where $P_{T}^{a}\triangleq\sum_{t=1}^{T-1}\sum_{(n,m)\in{\mathcal{L}}}\lVert\mathring% {\bm{a}}_{n,m}(t)-\mathring{\bm{a}}_{n,m}(t+1)\rVert_{1}$ is the path length of $\{\mathring{\bm{a}}(t)\}_{t=1}^{T}$ .

The proof is almost directly applying Lemma 4.3, so we postpone it to Section B.4. Thanks to property ii), the RHS of Theorem 3.5 depends on the $\ell_{2}$ -norm of the queue lengths $\sqrt{\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{2}^{2}}$ . As we will see shortly, this is pivotal to the self-bounding argument sketched in Equation 6.

3.5 Main Theorem for Multi-Hop Network Stability

As sketched in Section 3.1, putting previous conclusions together and use a so-called self-bounding property, the following guarantee for multi-hop network stability can be derived:

Theorem 3.6 (Main Theorem for Multi-Hop Network Stability).

Suppose that $\{\mathring{\bm{a}}_{n,m}(t)\in\triangle({\mathcal{N}})\}_{(n,m)\in{\mathcal{L% }},t\in[T]}$ satisfies 1 and its path length satisfies ²²2This assumption on path lengths comes from (Huang et al., 2024, Assumption 2). As discussed in Remark 3, such conditions are necessary.

P_{t}^{a}\triangleq\sum_{s=1}^{t-1}\sum_{(n,m)\in{\mathcal{L}}}\lVert\mathring% {\bm{a}}_{n,m}(s)-\mathring{\bm{a}}_{n,m}(s+1)\rVert_{1}\leq C^{a}t^{1/2-% \delta_{a}},\quad\forall t=1,2,\ldots,T,

(10)

where $C^{a}$ and $\delta_{a}$ are assumed to be known constants but the precise $P_{t}^{a}$ or $\{\mathring{\bm{a}}_{n,m}(t)\in\triangle({\mathcal{N}})\}_{(n,m)\in{\mathcal{L% }},t\in[T]}$ both remain unknown. Then, if we execute the NSO framework in Algorithm 1 with AdaPFOL defined in Algorithm 2, the following performance guarantee is enjoyed:

\frac{1}{T}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}% (t)\rVert_{1}\right]=\operatorname{\mathcal{O}}\left(\frac{(N^{2}(2NM+R)^{2}+% \epsilon_{W}N^{2}(2NM+R))C_{W}+(N^{4}M^{2}+N^{2}R^{2})}{\epsilon_{W}}\right)+o% _{T}(1).

That is, when $T\gg 0$ , we have $\frac{1}{T}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}% (t)\rVert_{1}\right]=\operatorname{\mathcal{O}}_{T}(1)$ , i.e., Equation 2 holds and the system is stable.

Remark 4.

Any reference policy $\{\mathring{\bm{a}}(t)\}_{t\in[T]}$ satisfying Equation 10 is called “mildly varying”. As mentioned in Remark 3, it is impossible to achieve non-trivial performance without restricting the reference sequence. We also compare our result with two previous results (Huang et al., 2024, Yang et al., 2023) which also ensured network stability in adversarial networks under bandit feedback: Under a path length assumption similar to but looser than Equation 10, Huang et al. (2024) stabilized single-server networks. By assuming the environment (instead of the reference policy which we do) is mildly varying, Yang et al. (2023) stabilized single-hop networks. Thus, Theorem 3.6 is the first guarantee applicable to adversarial multi-hop networks under bandit feedback.

A formal proof resides in Section B.5. We only highlight the self-bounding step here:

Proof Sketch of Theorem 3.6.

The first step of the proof is comparing the guarantee from 1 and that from Lyapunov drift analysis. Specifically, recall that Lemma 3.1 upper bounds the total queue length using $-\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}\sum_{(n,m)\in{\mathcal{L}}% }\langle\bm{C}_{n,m}(t)(\bm{Q}_{m}(t)-\bm{Q}_{n}(t)),\mathring{\bm{a}}_{n,m}(t% )\rangle]$ (together with some other terms, which are constants after taking expectations) while Lemma 3.2 reveals the non-positivity of $-\operatornamewithlimits{\mathbb{E}}[\sum_{(n,m)\in{\mathcal{L}}}\langle\bm{C}% _{n,m}(t)(\bm{Q}_{m}(t)-\bm{Q}_{n}(t)),{\bm{a}}_{n,m}(t)\rangle]$ (again, many constants omitted). Furthermore, via the guarantee of AdaPFOL (Algorithm 2) in Theorem 3.5, these two terms are actually pretty close – they only differ by $\operatorname{\widetilde{\mathcal{O}}}_{T}\left(\sqrt{1+C^{a}T^{1/2-\delta_{a}% }}\operatornamewithlimits{\mathbb{E}}\left[\sqrt{\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{2}^{2}}\right]\right)$ , where we used the assumption that $P_{T}^{a}\leq C^{a}T^{1/2-\delta_{a}}$ . By a property that ensures $\sum_{t=1}^{T}x_{t}^{2}\leq 4(\sum_{t=1}^{T}x_{t})^{1/5}$ when $\lvert x_{t}-x_{t+1}\rvert\leq 1$ (Lemma D.3), this $\operatornamewithlimits{\mathbb{E}}\left[\sqrt{\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{2}^{2}}\right]$ can be controlled by $\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{% 1}\right]^{3/4}$ .

Finishing these steps, which are detailed in Lemma B.8, we are able to conclude

\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{% 1}\right]\leq f(T)+g(T)\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right]^{3/4}\log\operatornamewithlimits{\mathbb{E}}% \left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1}\right],

where $f(T)=\operatorname{\mathcal{O}}_{T}(T)$ and $g(T)=\operatorname{\mathcal{O}}_{T}(T^{1/4-\delta_{a}/2})$ are some abstract functions to simplify notations.

Therefore, this inequality is in a self-bounding form that $y\leq f+y^{3/4}g\log y$ where $y$ is the total queue lengths in $T$ rounds. As we informally stated in Equation 6, this gives our system stability guarantee. Indeed, in Lemma D.5, we show that $y\leq f+y^{3/4}g\log y$ implies $y=\operatorname{\mathcal{O}}(f)+\operatorname{\widetilde{\mathcal{O}}}(g^{4})$ . Therefore, $y=\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1}]% =\operatorname{\mathcal{O}}_{T}(T)+\operatorname{\widetilde{\mathcal{O}}}_{T}(% T^{1-2\delta_{a}})$ and thus $\frac{1}{T}\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{1}]=\operatorname{\mathcal{O}}_{T}(1)$ . ∎

Algorithm 3 UMO²: Utility Maximization via Online Linear Optimization and Bandit Convex Optimization

0: Number of rounds

T

, set of servers

{\mathcal{N}}

and links

{\mathcal{L}}

, maximum capacity

M

, feasible arrival rates

\Lambda

. Parameter

V

. An online linear optimization algorithm AdaPFOL (Algorithm 2) and a bandit convex optimization algorithm AdaBGD (Algorithm 4).

1: For each link

(n,m)\in{\mathcal{L}}

, initialize an instance of AdaPFOL with action set

\triangle({\mathcal{N}})

\texttt{AdaPFOL}_{n,m}

2: Initialize an instance of AdaBGD with action set

\Lambda

3: for

t=1,2,\ldots,T

4: For each link

(n,m)\in{\mathcal{L}}

, pass the maximum loss magnitude for this round

M\lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rVert_{\infty}

\texttt{AdaPFOL}_{n,m}

. Pick link allocation

\bm{a}_{n,m}(t)\in\triangle({\mathcal{N}})

as the output of

\texttt{AdaPFOL}_{n,m}

5: Call AdaBGD with learning rates defined in Equation 11. Pick arrival rates

\bm{\lambda}(t)\in\Lambda

as its output.

	$\displaystyle\eta_{t}$	$\displaystyle=\left(C^{\lambda}T^{1/2-\delta_{\lambda}}\middle/\begin{subarray% }{c}\left(C^{\lambda}T^{1/2-\delta_{\lambda}}\right)^{7/3}\left(4r^{-3}d^{2}% \right)^{28/9}\left(M+R\right)^{4/3}+\\ C^{\lambda}T^{1/2-\delta_{\lambda}}(r^{-3}d^{2}VG^{2}/L)^{4/3}+\\ \sum_{s=1}^{t}\left((\lVert\bm{q}_{s}\rVert_{\infty}+VG)^{2}(\lVert\bm{q}_{s}% \rVert_{2}+VL)^{2}\right)^{1/3}\end{subarray}\right)^{3/4},$
	$\displaystyle\delta_{t}$	$\displaystyle=\left(\eta_{t}d^{2}\frac{(\lVert\bm{Q}(t)\rVert_{\infty}+VG)^{2}% }{(\lVert\bm{Q}(t)\rVert_{2}+VL)}\right)^{1/3},\quad\alpha_{t}=\frac{\delta_{t% }}{r}.$		(11)

6: Observe capacities

\{C_{n,m}(t)\}_{(n,m)\in{\mathcal{L}}}

and actual data transmissions

\{\mu_{n,m}^{(k)}(t)\}_{(n,m)\in{\mathcal{L}},k\in{\mathcal{N}}}

7: Calculate queue lengths

\bm{Q}(t+1)

from

\bm{Q}(t)

according to Equation 1.

8: For each link

(n,m)\in{\mathcal{L}}

, pass the loss vector

C_{n,m}(t)(\bm{Q}_{m}(t)-\bm{Q}_{n}(t))

\texttt{AdaPFOL}_{n,m}

9: Observe the collected utility

g_{t}(\bm{\lambda}(t))

. Pass the loss

\langle\bm{Q}(t),\bm{\lambda}(t)\rangle-Vg_{t}(\bm{\lambda}(t))

to AdaBGD.

4 Utility Maximization in Adversarial Multi-Hop Networks

We now turn to the utility maximization task. In this task, in addition to the capacity allocations, the arrival rates are also decided by the scheduler with an objective of maximizing the unknown and time-varying utility function. The scheduler’s objective is to maximize the average utility it gains (Equation 3), while ensuring the average number of jobs in the network remains small (Equation 2).

This section is organized similar to Section 3: We first explain the motivation of our algorithmic framework UMO² in Algorithm 3 and then present the assumptions together with analysis.

4.1 Motivation of Our Algorithmic Framework

In Algorithm 3, we give the general algorithmic framework of Utility Maximization via Online Linear Optimization and Bandit Convex Optimization (UMO²) which achieves utility optimization via the plug-in of two optimization sub-rountines AdaPFOL and AdaBGD. The differences between it and our system stability algorithm (NSO; Algorithm 1) is marked in blue. As we already motivated in Section 3.1 that an OLO algorithm AdaPFOL can help stabilize the system (recall Equation 5), this section focuses on motivating the other sub-rountine AdaBGD by going through the design of UMO².

To handle the utility function, instead of the Lyapunov analysis in the previous section, UMO² is based on the Lyapunov drift-plus-penalty analysis (Neely, 2010a, Theorem 4.2). In Lemma C.2, we derive

	$\displaystyle\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_{(n,m% )\in{\mathcal{L}}}\langle C_{n,m}(t)(\bm{Q}_{m}(t)-\bm{Q}_{n}(t)),\bm{a}_{n,m}% (t)\rangle\right]+$
	$\displaystyle\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_{n\in% {\mathcal{N}}}\langle\bm{Q}_{n}(t),\bm{\lambda}_{n}(t)\rangle\right]-V% \operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\bigg{(}g_{t}(\bm{% \lambda}(t)-g_{t}(\mathring{\bm{\lambda}}(t)))\bigg{)}\right]\gtrsim 0,$		(12)

where $V$ is a constant that we can arbitrarily pick for analytical purposes. Intuitively, this $V$ stands for a trade-off between the stability part $\langle\bm{Q}_{n}(t),\bm{\lambda}_{n}(t)\rangle$ and the utility part $g_{t}(\bm{\lambda}(t))$ .

Again motivated by Huang et al. (2024), our goal is to minimize Equation 12. The first term in Equation 12 is exactly the OLO optimization objective from the previous section (recall Equation 5), which can be minimized by the AdaPFOL algorithm given in Algorithm 2. For the second and third term, we would like to

\text{Minimize }\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\bigg{(% }\langle\bm{Q}(t),\bm{\lambda}(t)\rangle-Vg_{t}(\bm{\lambda}(t))\bigg{)}\right].

(13)

We now also tackle Equation 13 using online learning techniques: As $g_{t}$ is defined over all possible $\bm{\lambda}(t)$ ’s and $\bm{\lambda}(t)$ can be arbitrarily chosen from the feasible action set $\Lambda$ , we regard “deciding $\bm{\lambda}(t)$ ” as an learning problem with action set $\Lambda$ (instead of making decisions on each link or server separately in Section 3.4) where the loss of picking $\bm{\lambda}$ in round $t$ is $\ell_{t}(\bm{\lambda})\triangleq\langle\bm{Q}(t),\bm{\lambda}\rangle-Vg_{t}(% \bm{\lambda})$ . This loss function is convex w.r.t. $\bm{\lambda}$ as $\langle\cdot,\cdot\rangle$ is linear and $g_{t}$ is concave. However, since only bandit feedback on $g_{t}$ is available, we can only calculate $\ell_{t}(\bm{\lambda})$ but not the whole $\ell_{t}$ . Thus, this problem is not an OLO problem as Definition 3.3 requires full information feedback. Instead, it belongs to the category of Bandit Convex Optimization (BCO) (Flaxman et al., 2005), which we will define in Definition 4.2.

Similar to Section 3.1, we now discuss what properties the AdaBGD sub-routine should enjoy. We have the following challenges that is unique due to the network optimization context:

i)

Again, the queue lengths can potentially go unbounded, which means the loss $\ell_{t}(\bm{\lambda})$ can have large magnitudes. However, different from the OLO problem we met in Equation 5, in BCO problems we face general convex functions and thus Lipschitzness (i.e., the maximum gradient magnitude) also plays a role as it characterizes how fast $\ell_{t}$ changes with $\bm{\lambda}$ . Therefore, our AdaBGD shall not only bare large loss magnitudes but also resist from huge Lipshictzness. As we will see in Equation 11, adapting to both magnitudes and Lipschitzness is particularly difficult.
ii)

The second challenge is again due to self-bounding analysis: Specifically, we want to conduct self-bounding analyses on the queue lengths and also the utility gap (both similar to Equation 6), we also want AdaBGD to be adaptive to the loss functions’ magnitudes and Lipschitzness.

In Section 4.4, we introduce the details of our AdaBGD algorithm that ensures both i) and ii). Equipped with this algorithm, we can optimize Equation 13 by deploying it over the action set $\Lambda$ . Moreover, as deploying AdaPFOL (Algorithm 2) on each link $(n,m)\in{\mathcal{L}}$ minimizes $\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}\sum_{(n,m)\in{\mathcal{L}}}% \langle\bm{Q}_{m}(t)-\bm{Q}_{n}(t),\bm{\mu}_{n,m}(t)\rangle]$ , these two algorithms can together minimize the Lyapunov drift-plus-penalty in Equation 12. Such a combination gives the UMO² framework in Algorithm 3.

In the remaining of this section, we present the analysis of UMO²: In Section 4.2, we introduce the reference sequence assumption. In Section 4.3, we present the Lyapunov drift-plus-penalty analysis. In Section 4.4, we rigorously define the BCO problem and present our AdaBGD algorithm (Algorithm 4). Finally, by combining the AdaPFOL guarantee from Section 3.4 and the AdaBGD guarantee from Section 4.4, we yield the utility maximization guarantee in Section 4.5.

4.2 Reference Policy Assumption

The assumption we need in the multi-hop utility maximization task is similar to the one in multi-hop network stability (1), with one important difference that our arrival rates are no longer fixed but decided by the scheduler. Hence, instead of assuming a sequence of $\{\mathring{\bm{a}}(t)\}_{t\in[T]}$ stabilizing the system with the obliviously adversarial arrival rates $\{\bm{\lambda}(t)\}_{t\in[T]}$ , we assume the existence of reference sequence $\{(\mathring{\bm{a}}(t),\mathring{\bm{\lambda}}(t))\}_{t\in[T]}$ such that $\{\mathring{\bm{a}}(t)\}_{t\in[T]}$ stabilizes system with reference arrival rates $\{\mathring{\bm{\lambda}}(t)\}_{t\in[T]}$ . Formally, we make the following assumption.

Assumption 2 (Multi-Hop Piecewise Stability for Utility Maximization).

There exists a reference action sequence $\{(\mathring{\bm{a}}(t),\mathring{\bm{\lambda}}(t))\}_{t\in[T]}$ (where $\mathring{\bm{a}}(t)=\{\mathring{\bm{a}}_{n,m}(t)\in\triangle({\mathcal{N}})\}% _{(n,m)\in{\mathcal{L}}}$ and $\mathring{\bm{\lambda}}(t)=\{\mathring{\bm{\lambda}}_{n}(t)\in\Lambda_{n}\}_{n% \in{\mathcal{N}}}$ , in analogue to the scheduler’s action sequence) such that there are constants $C_{W}\geq 0$ , $\epsilon_{W}\geq 0$ , and a partition $W_{1},W_{2},\ldots,W_{J}$ of $[T]$ , such that $\sum_{j=1}^{J}(\lvert W_{j}\rvert-1)^{2}\leq C_{W}T$ and

	$\displaystyle\frac{1}{\lvert W_{j}\rvert}\sum_{t\in W_{j}}\sum_{(n,m)\in{% \mathcal{L}}}C_{n,m}(t)\mathring{a}_{n,m}^{(k)}(t)\geq\epsilon_{W}+\frac{1}{% \lvert W_{j}\rvert}\sum_{t\in W_{j}}\left(\mathring{\lambda}_{n}^{(k)}(t)+\sum% _{(o,n)\in{\mathcal{L}}}C_{o,n}(t)\mathring{a}_{o,n}^{(k)}(t)\right),$
	$\displaystyle\quad\forall j\in[J],n\in{\mathcal{N}},k\in{\mathcal{N}}.$

Imitating Lemma 3.1, one can derive the following, whose proof is in Section C.1.

Lemma 4.1 (Ability of $\{(\mathring{\bm{a}}(t),\mathring{\bm{\lambda}}(t))\}_{t\in[T]}$ in Stabilizing the Network).

If $\{(\mathring{\bm{a}}(t),\mathring{\bm{\lambda}}(t))\}_{t\in[T]}$ satisfies 2, then for any scheduler-generated queue lengths $\{\bm{Q}(t)\}_{t\in[T]}$ ,

	$\displaystyle\quad\epsilon_{W}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t% =1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\right]-(% N^{2}(2NM+R)^{2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T$
	$\displaystyle\leq-\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_% {(n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}\mathring{\mu}_{n,m}^{(k)}(t)(Q% _{m}^{(k)}(t)-Q_{n}^{(k)}(t))\right]-\operatornamewithlimits{\mathbb{E}}\left[% \sum_{t=1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)% \mathring{\lambda}_{n}^{(k)}(t)\right].$		(14)

Still, we remark that our scheduler cannot access the reference action sequence $\{(\mathring{\bm{a}}(t),\mathring{\bm{\lambda}}(t))\}_{t\in[T]}$ . However, similar to the NSO algorithm, our UMO² algorithm can also learn to stabilize the system. Even more, it also learns to outperform the utility maximization performance of any mildly varying reference policy. Specifically, i) our action sequence $\{(\bm{a}(t),\bm{\lambda}(t))\}_{t\in[T]}$ also stabilizes the system, i.e., $\frac{1}{T}\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{1}\rvert]=\operatorname{\mathcal{O}}_{T}(1)$ ; and, ii) its utility matches any mildly varying reference policy $\{(\mathring{\bm{a}}(t),\mathring{\bm{\lambda}}(t))\}_{t\in[T]}$ asymptotically — that is, $\frac{1}{T}\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}g_{t}(\bm{\lambda% }(t))]\xrightarrow{\text{polynomially}}\frac{1}{T}\sum_{t=1}^{T}g_{t}(% \mathring{\bm{\lambda}}(t))$ .³³3According to the oblivious adversary assumption, $g_{t}$ is pre-determined. Thus, the $g_{t}$ ’s on the LHS and RHS are the same, so there is no need to take conditional expectation w.r.t. previous actions $\{(\bm{a}(t),\bm{\lambda}(t))\}_{t\in[T]}$ or $\{(\mathring{\bm{a}}(t),\mathring{\bm{\lambda}}(t))\}_{t\in[T]}$ .

4.3 Lyapunov Drift-plus-Penalty Analysis

In the Lyapunov drift analysis (Section 3.3), we consider the drift function $\Delta(\bm{Q}(t))\triangleq\operatornamewithlimits{\mathbb{E}}[L_{t+1}-L_{t}% \mid\bm{Q}(t)]$ where $L_{t}\triangleq\frac{1}{2}\lVert\bm{Q}(t)\rVert_{2}^{2}$ is the Lyapunov function. In the Lyapunov drift-plus-penalty (DPP) analysis (Neely, 2010a, Theorem 4.2), we consider the DPP function $\Delta(\bm{Q}(t))-V\operatornamewithlimits{\mathbb{E}}[g_{t}(\bm{\lambda}(t))% \mid\bm{Q}(t)]$ , where $V$ is arbitrarily determined for our purpose. As we will see in Theorem 4.5, when $V$ is chosen to be no larger than a polynomial of $T$ , our utility is at least that of any mildly varying reference policy minus $\operatorname{\mathcal{O}}(V^{-1})$ , thus implying a polynomially decaying gap between these two utilities.

Similar to the calculations in Lemma 3.1, one can derive the following inequality (see Lemma C.2):

	$\displaystyle\quad-\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum% _{(n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}C_{n,m}(t)(Q_{m}^{(k)}(t)-Q_{n% }^{(k)}(t))a_{n,m}^{(k)}(t)\right]$
	$\displaystyle\quad-\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum% _{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\lambda_{n}^{(k)}(t)% \right]+V\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}(g_{t}(\bm{% \lambda}(t))-g_{t}(\mathring{\bm{\lambda}}(t)))\right]$
	$\displaystyle\leq\frac{1}{2}N^{2}((NM)^{2}+2(NM)^{2}+2R^{2})T+V% \operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}(g_{t}(\bm{\lambda}(t))% -g_{t}(\mathring{\bm{\lambda}}(t)))\right].$		(15)

Similar to the previous section, we want to make the RHS of Equation 15 close to that of Equation 14. Specifically, we decompose these two RHS’s into two parts and show the following inequalities:


			(16)

The first inequality is exactly our objective in Equation 5, which can be ensured via the AdaPFOL algorithm in Algorithm 2 – recall its performance guarantee in Theorem 3.5. On the other hand, the second inequality is new in the utility maximization task. While some algorithmic ingredients can be borrowed from the Bandit Convex Optimization (BCO) problem (Flaxman et al., 2005), new efforts need be made due to the network optimization context: Our BCO algorithm shall accept large loss magnitudes and Lipschitzness (due to potentially unbounded queue lengths), and its performance must be adaptive to the loss functions’ magnitudes and Lipschitzness as well.

4.4 AdaBGD: Learning for Utility Maximization

As mentioned in the sketch (Section 4.1), Equation 16 is equivalent to minimizing the time-varying loss function $\ell_{t}(\bm{\lambda})\triangleq\langle Q(t),\bm{\lambda}\rangle-Vg_{t}(\bm{% \lambda})$ under bandit feedback over the action set $\Lambda$ . This problem is different from the OLO problem introduced in Definition 3.3 as we do not have full-information feedback: Indeed, we assume $g_{t}(\bm{\lambda}(t))$ instead of the whole $g_{t}$ will be revealed to the scheduler, hence only $\ell_{t}(\bm{\lambda}(t))$ , the actual loss associated with our action, can be accurately calculated. We provide a formal definition of the Bandit Convex Optimization (BCO) problem (Flaxman et al., 2005, Chen and Giannakis, 2018) below.

Definition 4.2 (Bandit Convex Optimization).

Consider a $T$ -round game. In round $t=1,2,\ldots,T$ , the player picks an action $\bm{x}_{t}$ from a convex action set $\mathcal{X}\subseteq\mathbb{R}^{d}$ , and the environment simultaneously picks an arbitrary convex loss $\ell_{t}\colon\mathcal{X}\to\mathbb{R}$ . The player observes and suffers loss $\ell_{t}(\bm{x}_{t})$ . Dynamic regret minimization in BCO considers minimizing

\text{D-Regret}_{T}^{\text{BCO}}(\bm{u}_{1},\bm{u}_{2},\ldots,\bm{u}_{T})=% \operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}(\ell_{t}(\bm{x}_{t})-% \ell_{t}(\bm{u}_{t}))\right],\quad\forall\bm{u}_{1},\bm{u}_{2},\ldots,\bm{u}_{% T}\in\mathcal{X}.

Again, we recall the two challenges that our BCO algorithm should overcome:

i)

It must handle $\ell_{t}$ ’s with large magnitude $\sup_{\bm{\lambda}\in\Lambda}\lvert\ell_{t}(\bm{\lambda})\rvert$ or Lipschitzness $\sup_{\bm{\lambda}\in\Lambda}\lVert\nabla\ell_{t}(\bm{\lambda})\rVert_{2}$ .
ii)

Its performance should depend on magnitudes and Lipschitzness of all loss functions.

Our algorithm is based on the Bandit Gradient Descent (BGD) algorithm (Zhao et al., 2021, Algorithm 1), which does not satisfy i) or ii) as it requires losses to be uniformly bounded by some $C$ and Lipschitzness to be always bounded by some $L$ . In our case where $\ell_{t}(\bm{\lambda})=\langle\bm{Q}(t),\bm{\lambda}\rangle-Vg_{t}(\bm{\lambda})$ , the loss magnitude $\lVert\bm{Q}(t)\rVert_{\infty}+VG$ and the Lipschitzness $\lVert\bm{Q}(t)\rVert_{2}+VL$ are both large when $\lVert\bm{Q}(t)\rVert$ is large.

Nevertheless, based on the BGD algorithm, we designed a BCO algorithm called Adaptive BGD (AdaBGD; Algorithm 4) which satisfies both i) and ii). Specifically, to ensure i), we utilize the fact that $\bm{Q}(t)$ is known before deciding $\bm{\lambda}_{t}$ , which means the magnitude $C_{t}$ and the Lipschitzness $L_{t}$ of loss function $\ell_{t}$ can be calculated before decision. To enjoy ii), instead of the doubling technique in AdaPFOL (Algorithm 2), we now design an adaptive learning rate scheduling mechanism which involves a sequence of time-varying learning rates, namely $\eta_{1}>\eta_{2}>\cdots>\eta_{T}$ , instead of using a single $\eta$ throughout execution. Formally, AdaBGD has the following dynamic regret guarantee:

Algorithm 4 AdaBGD: Adaptive Bandit Gradient Descent

0: Action set

\mathcal{X}

bounded by

[r,R]

(i.e.,

r\mathbb{B}\subseteq\mathcal{X}\subseteq R\mathbb{B}

), hyper-parameters

\eta_{1}>\eta_{2}>\cdots>\eta_{T}

\delta_{1},\delta_{2},\ldots,\delta_{T}

, and

\alpha_{t}\triangleq\delta_{t}/r,\forall t\in[T]

1: Initialize

\bm{y}_{1}=\bm{0}

(an internal variable of the algorithm).

2: for

t=1,2,\ldots,T

3: Calculate this round’s action

\bm{x}_{t}\in\mathcal{X}

, observe loss

\ell_{t}(x_{t})

, and update internal variable

\bm{y}_{t+1}

\bm{x}_{t}=\bm{y}_{t}+\delta\bm{s}_{t},\quad\bm{y}_{t+1}=\text{Proj}_{(1-% \alpha_{t})\mathcal{X}}\left[\bm{y}_{t}-\eta_{t}\frac{d}{\delta_{t}}\ell_{t}(% \bm{x}_{t})\bm{s}_{t}\right],

(17)

where

\bm{s}_{t}\in\mathbb{R}^{d}

is a uniformly sampled unit vector used to estimate gradients (Flaxman et al., 2005).

Lemma 4.3 (Guarantee of AdaBGD Algorithm).

Suppose that $r\mathbb{B}\subseteq\mathcal{X}\subseteq R\mathbb{B}$ , the $t$ -th loss $\ell_{t}$ is bounded by $C_{t}$ and is $L_{t}$ -Lipschitz. Suppose that $\eta_{t}$ and $\delta_{t}$ are both $\mathcal{F}_{t-1}$ -measurable (where $(\mathcal{F}_{t})_{t=0}^{T}$ is the natural filtration), $\eta_{1}>\eta_{2}>\cdots>\eta_{T}$ , and $\alpha_{t}\triangleq\delta_{t}/r<1$ a.s. for all $t\in[T]$ . Then for any fixed $\bm{u}_{1},\bm{u}_{2},\ldots,\bm{u}_{T}\in\mathcal{X}$ , the AdaBGD algorithm in Algorithm 4 enjoys the following guarantee:

	$\displaystyle\quad\text{D-Regret}_{T}^{\text{BCO}}(\bm{u}_{1},\bm{u}_{2},% \ldots,\bm{u}_{T})=\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}(% \ell_{t}(\bm{x}_{t})-\ell_{t}(\bm{u}_{t}))\right]$
	$\displaystyle\leq\operatornamewithlimits{\mathbb{E}}\left[\frac{7R^{2}}{4\eta_% {T}}+\frac{P_{T}R}{\eta_{T}}+\sum_{t=1}^{T}\left(\frac{\eta_{t}}{2}\frac{d^{2}% }{\delta_{t}^{2}}C_{t}^{2}+3L_{t}\delta_{t}+L_{t}\alpha_{t}R\right)\right],$

where $P_{T}=\sum_{t=1}^{T-1}\lVert\bm{u}_{t}-\bm{u}_{t+1}\rVert$ is the path length of the comparator sequence $\{\bm{u}_{t}\}_{t\in[T]}$ .

Compared to the algorithm itself in Algorithm 4 and the guarantee in Lemma 4.3, our main innovation on the BCO side lies in the novel learning rate scheduling mechanism in Equation 11. Therefore, we do not prove Lemma 4.3 at this moment and postpone it to Section C.3. Instead, we prove:

Theorem 4.4 (Deciding $\bm{\lambda}(t)$ via AdaBGD Algorithm).

For the reference arrival rates $\{\mathring{\bm{\lambda}}(t)\}_{t\in[T]}$ defined in 2, suppose that its path length ensures

P_{t}^{\lambda}\triangleq\sum_{t=1}^{T-1}\lVert\mathring{\bm{\lambda}}({t+1})-% \mathring{\bm{\lambda}}(t))\rVert_{1}\leq C^{\lambda}t^{1/2-\delta_{\lambda}},% \quad\forall t=1,2,\ldots,T,

where, similar to Theorem 3.6, $C^{\lambda}$ and $\delta_{\lambda}$ are assumed to be known constants but the precise $P_{t}^{\lambda}$ or $\{\mathring{\bm{\lambda}}(t)\}_{t\in[T]}$ both remain unknown. Suppose that the action set $\Lambda$ is bounded by $[r,R]$ (i.e., $r\mathbb{B}\subseteq\Lambda\subseteq R\mathbb{B}$ ). If we execute AdaBGD (Algorithm 4) over $\Lambda$ with loss functions $\ell_{t}(\bm{\lambda})=\langle\bm{Q}(t),\bm{\lambda}\rangle-Vg_{t}(\bm{\lambda})$ and parameters $\eta_{t},\delta_{t},\alpha_{t}$ defined in Equation 11 (restated below as Equation 18 to ease reading):

	$\displaystyle\eta_{t}$	$\displaystyle=\left(C^{\lambda}T^{1/2-\delta_{\lambda}}\middle/\begin{subarray% }{c}\left(C^{\lambda}T^{1/2-\delta_{\lambda}}\right)^{7/3}\left(4r^{-3}d^{2}% \right)^{28/9}\left(M+R\right)^{4/3}+\\ C^{\lambda}T^{1/2-\delta_{\lambda}}(r^{-3}d^{2}VG^{2}/L)^{4/3}+\\ \sum_{s=1}^{t}\left((\lVert\bm{q}_{s}\rVert_{\infty}+VG)^{2}(\lVert\bm{q}_{s}% \rVert_{2}+VL)^{2}\right)^{1/3}\end{subarray}\right)^{3/4},$
	$\displaystyle\delta_{t}$	$\displaystyle=\left(\eta_{t}d^{2}\frac{(\lVert\bm{Q}(t)\rVert_{\infty}+VG)^{2}% }{(\lVert\bm{Q}(t)\rVert_{2}+VL)}\right)^{1/3},\quad\alpha_{t}=\frac{\delta_{t% }}{r},$		(18)

then the outputs $\bm{\lambda}(1),\bm{\lambda}(2),\ldots,\bm{\lambda}(T)\in\Lambda$ of AdaBGD ensure

In words, it means that the optimization objective Equation 16 is ensured. The second term on the RHS is a key term as it ensures property ii) – which is due to the $\eta_{t}$ definition in Equation 11 which contains all historical magnitudes and Lipschitzness. Below, we quick overview this proof and see why ii) can be ensured by the $\eta_{t}$ defined in Equation 18. A formal version is included in Section C.4.

Proof Sketch of Theorem 4.4.

The terms $(\lVert\bm{Q}(t)\rVert_{\infty}+VG)$ and $(\lVert\bm{Q}(t)\rVert_{2}+VL)$ are the boundedness and Lipschitzness of $\ell_{t}$ , respectively, which we denote by $C_{t}$ and $L_{t}$ to simplify notations.

First suppose that all conditions in Lemma 4.3 are satisfied by our $\{\eta_{t}\}_{t\in[T]}$ . To balance the term inside $\sum_{t=1}^{T}(\cdot)$ in Lemma 4.3 by fixing $\eta_{t}$ and altering $\delta_{t}$ , one shall pick $\delta_{t}=(\eta_{t}d^{2}C_{t}^{2}/L_{t})^{1/3}$ and roughly have (hiding many constant terms; see Equation 24 in the appendix for an accurate form):

\text{D-Regret}_{T}^{\text{BCO}}(\mathring{\bm{\lambda}}(1),\mathring{\bm{% \lambda}}(2),\ldots,\mathring{\bm{\lambda}}(T))\lesssim\frac{C^{\lambda}T^{1/2% -\delta_{\lambda}}}{\eta_{T}}+\sum_{t=1}^{T}\left(\eta_{t}C_{t}^{2}L_{t}^{2}% \right)^{1/3}.

(19)

We derive in Lemma D.1 that $\sum_{t=1}^{T}\frac{x_{t}}{(\sum_{s\leq t}x_{s})^{1/4}}\lesssim(\sum_{t=1}^{T}% x_{t})^{3/4}$ , which is a variant of the famous summation lemma (Auer et al., 2002). Therefore, if we pick $\eta_{t}\approx\left(\frac{C^{\lambda}T^{1/2-\delta_{\lambda}}}{\sum_{s=1}^{t}% (C_{t}^{2}L_{t}^{2})^{1/3}}\right)^{3/4}$ (i.e., only keeping the third term in the denominator of Equation 18), Equation 19 becomes $\operatorname{\mathcal{O}}((C^{\lambda}T^{1/2-\delta_{\lambda}})^{1/4}% \operatornamewithlimits{\mathbb{E}}[(\sum_{t=1}^{T}(C_{t}^{2}L_{t}^{2})^{1/3})% ^{3/4}])$ . When focusing on $T$ -related terms, this is roughly $\operatorname{\mathcal{O}}_{T}(T^{1/8-\delta_{\lambda}/4}% \operatornamewithlimits{\mathbb{E}}[(\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1})]% ^{7/8})$ , which allows the self-bounding analysis similar to Equation 6. However, such a configuration of $\{\eta_{t}\}_{t\in[T]}$ may not ensure the $\alpha_{t}=\delta_{t}/r=(\eta_{t}d^{2}C_{t}^{2}/L_{t})^{1/3}/r<1$ condition in Lemma 4.3. The other two terms in Equation 18 are added for this purpose. We refer the readers to Section C.4 for detailed verification. ∎

4.5 Main Theorem for Multi-Hop Utility Maximization

As sketched in Section 4.1, if we use a Lyapunov drift-plus-penalty analysis, exploit the network stability assumption, use AdaPFOL (Algorithm 2) to decide link allocations $\bm{a}(t)$ , and use AdaBGD (Algorithm 4) to decide arrival rates $\bm{\lambda}(t)$ , we get the following utility maximization guarantee.

Theorem 4.5 (Main Theorem for Multi-Hop Utility Maximization).

Suppose that the feasible set of arrival rates vector $\Lambda$ is bounded by $[r,R]$ . Assume all (unknown) utility functions $g_{t}$ to be concave, $L$ -Lipschitz, and $[-G,G]$ -bounded. Consider a reference action sequence $\{(\mathring{\bm{a}}(t),\mathring{\bm{\lambda}}(t))\}_{t\in[T]}$ satisfying 2, such that their path lengths satisfy

\displaystyle P_{t}^{a}\triangleq\sum_{s=1}^{t-1}\lVert\mathring{\bm{a}}(s)-% \mathring{\bm{a}}(s+1)\rVert_{1}\leq C^{a}t^{1/2-\delta_{a}},\leavevmode% \nobreak\ P_{t}^{\lambda}\triangleq\sum_{s=1}^{t-1}\lVert\mathring{\bm{\lambda% }}(s)-\mathring{\bm{\lambda}}(s+1)\rVert_{1}\leq C^{\lambda}t^{1/2-\delta_{% \lambda}},\quad\forall t\in[T].

Here, $M,R,r,L,G,C^{a},\delta_{a},C^{\lambda},\delta_{\lambda}$ are assumed to be known constants, whereas the specific $\{(\mathring{\bm{a}}(t),\mathring{\bm{\lambda}}(t))\}_{t\in[T]}$ remains unknown. If we execute the UMO² framework in Algorithm 3 with the AdaPFOL sub-rountine given in Algorithm 2 and the AdaBGD sub-routine given in Algorithm 4, when $T$ is large enough such that the constant $V=o_{T}(\min\{T^{2\delta_{a}/3},T^{2\delta_{\lambda}/7}\})$ , the following inequalities hold simultaneously:

That is, when $T\gg 0$ , our algorithm not only stabilizes the system so that $\frac{1}{T}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}% (t)\rVert_{1}\right]=\operatorname{\mathcal{O}}_{T}(1)$ , but also enjoys an average utility approaching that of the reference policy polynomially fast, i.e., $\frac{1}{T}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\left(g_{t}(% \mathring{\bm{\lambda}}(t))-g_{t}(\bm{\lambda}(t))\right)\right]=\operatorname% {\mathcal{O}}_{T}(V^{-1})$ – the utility maximization objective Equation 3 is ensured.

Remark 5.

The condition $V=o_{T}(\min\{T^{2\delta_{a}/3},T^{2\delta_{\lambda}/7}\})$ says $T$ cannot be too small compared to $V$ , which was not an issue in SNO as people often let $T$ approach infinity (Neely et al., 2008). Albeit this condition looks restrictive, we remark that $V$ can still be as large as a polynomial of $T$ and thus $\operatorname{\mathcal{O}}_{T}(V^{-1})$ means a polynomially decaying gap between our utility and that of any mildly varying policies whose path lengths are small – which is the first guarantee that applies to utility maximization tasks in adversarial networks under bandit feedback. Similar to the discussions in Remark 4, due to non-stationary environments and bandit feedback, it is highly non-trivial to define “optimal reference policy” in ANO. Nevertheless, our mildly varying reference policy class allows optimal policies for SNO settings.

The full proof of Theorem 4.5 is presented in Section C.5. We outline three key steps here.

Proof Sketch of Theorem 4.5.

The first step of the proof is plugging the algorithmic guarantees for AdaPFOL (Theorem 3.5) and for AdaBGD (Theorem 4.4) into the reference policy assumption (Equation 14) and then making use of the Lyapunov DPP guarantee (Equation 15). We present the detailed derivation in Lemma C.7. The conclusion reads

	$\displaystyle\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{% Q}(t)\rVert_{1}\right]$	$\displaystyle\leq-\frac{V}{\epsilon_{W}}\operatornamewithlimits{\mathbb{E}}% \left[\sum_{t=1}^{T}\bigg{(}g_{t}(\mathring{\bm{\lambda}}(t))-g_{t}(\bm{% \lambda}(t))\bigg{)}\right]+f(T)+$
		$\displaystyle\quad g(T)\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right]^{3/4}\log\operatornamewithlimits{\mathbb{E}}% \left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1}\right]+h(T)% \operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{% 1}\right]^{7/8},$		(20)

where $f(T)=\operatorname{\mathcal{O}}_{T}(T)$ , $g(T)=\operatorname{\mathcal{O}}_{T}(T^{1/4-\delta_{a}/2})$ , and $h(T)=\operatorname{\mathcal{O}}_{T}(T^{1/8-\delta_{\lambda}/4})$ .

Step 1 (Develop a Coarse Average Queue Length Bound).

By the boundedness of $g_{t}$ , the first term on the RHS of Equation 20 is controlled by $\frac{V}{\epsilon_{W}}TG=\operatorname{\mathcal{O}}_{T}(VT)$ (note that, as $V=\text{poly}(T)$ , $V$ is not a constant that can be hidden in $\operatorname{\mathcal{O}}_{T}$ and this term is actually super-linear). Similar to the one used when proving Theorem 3.6, we develop another self-bounding property that says $y\leq f+y^{3/4}g\log y+y^{7/8}h$ infers $y=\operatorname{\mathcal{O}}(f)+\operatorname{\widetilde{\mathcal{O}}}(g^{4})+% \operatorname{\mathcal{O}}(h^{8})$ (see Lemma D.6). Therefore,

\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{% 1}\right]=\operatorname{\mathcal{O}}_{T}\left(VT+T\right)+\operatorname{% \widetilde{\mathcal{O}}}_{T}\left(T^{1-2\delta_{a}}\right)+\operatorname{% \mathcal{O}}_{T}\left(T^{1-2\delta_{\lambda}}\right)=\operatorname{\mathcal{O}% }_{T}(VT),

only giving a $\frac{1}{T}\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{1}]=\operatorname{\mathcal{O}}_{T}(V)=\omega_{T}(1)$ bound on the average queue length (violating the system stability condition Equation 2). However, this inequality can be used to derive the polynomial convergence result on the utility, which in turn further refines the queue length bound.

Step 2 (Yield Polynomial Convergence on the Utility).

Moving the difference in the average utility to the LHS in Equation 20 and plugging in the just-derived bound on average queue length, we have

	$\displaystyle\frac{V}{\epsilon_{W}T}\operatornamewithlimits{\mathbb{E}}\left[% \sum_{t=1}^{T}\left(g_{t}(\mathring{\bm{\lambda}}(t))-g_{t}(\bm{\lambda}(t)))% \right)\right]$	$\displaystyle=\operatorname{\mathcal{O}}_{T}\left(-0+\frac{f(T)}{T}+\frac{g(T)% }{T}(VT)^{3/4}+\frac{h(T)}{T}(VT)^{7/8}\right)$
		$\displaystyle=\operatorname{\mathcal{O}}_{T}\left(1+\left(\frac{T^{1-2\delta_{% a}}V^{3}}{T}\right)^{1/4}+\left(\frac{T^{1-2\delta_{\lambda}}V^{7}}{T}\right)^% {1/8}\right).$

According to the assumption that $V=\operatorname{\mathcal{O}}_{T}(\min\{T^{2\delta_{a}/3},T^{2\delta_{\lambda}/% 7}\})$ , the RHS is of order $\operatorname{\mathcal{O}}_{T}(1)$ . Thus

\frac{1}{T}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\left(g_{t}(% \mathring{\bm{\lambda}}(t))-g_{t}(\bm{\lambda}(t)))\right)\right]=% \operatorname{\mathcal{O}}_{T}(V^{-1}),

i.e., a polynomial convergence rate on the expected average utility is derived.

Step 3 (Refine the Average Queue Length Bound).

Now we are ready to refine our average queue length bound. Instead of controlling the utility with boundedness $g_{t}\in[-G,G]$ , we utilize the just-derived convergence result and yields (again using the self-bounding property in Lemma D.6)

\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{% 1}\right]=\operatorname{\mathcal{O}}_{T}\left(V\times V^{-1}T+T\right)+% \operatorname{\widetilde{\mathcal{O}}}_{T}\left(T^{1-2\delta_{a}}\right)+% \operatorname{\mathcal{O}}_{T}\left(T^{1-2\delta_{\lambda}}\right)=% \operatorname{\mathcal{O}}_{T}(T),

that is, the average queue length $\frac{1}{T}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}% (t)\rVert_{1}\right]=\operatorname{\mathcal{O}}_{T}(1)$ , which means Equation 2 holds and the system is stable. Putting Step 2 and Step 3 together gives our conclusion.

Once again, we omitted all factors except for those $\text{poly}(T)$ ones throughout this proof sketch. Please refer to Section C.5 for the complete version. ∎

5 Conclusion

We study utility maximization in Adversarial Network Optimization (ANO) under bandit feedback. We design a network stability algorithm NSO and a utility maximization algorithm UMO², which both ingeniously integrate online learning components into Lyapunov drift framework to allow a joint analysis. When designing the online learning components of UMO², due to the potentially unbounded queue lengths in network optimization and the self-bounding analysis we want to conduct, we develop a novel OLO algorithm AdaPFOL which adapts to occasionally large losses and a BCO algorithm AdaBGD which suites large loss magnitudes and Lipschitzness via a meticulous learning rate scheduling scheme. One important future research direction will be defining other alternative reference policy classes that allows competing to more policies, even the optimal ones.

References

Andrews and Zhang (2004) Matthew Andrews and Lisa Zhang. Scheduling over nonstationary wireless channels with finite rate sets. In IEEE INFOCOM 2004, volume 3, pages 1694–1704. IEEE, 2004.
Andrews and Zhang (2005) Matthew Andrews and Lisa Zhang. Scheduling over a time-varying user-dependent channel with applications to high-speed wireless data. Journal of the ACM (JACM), 52(5):809–834, 2005.
Andrews et al. (2001) Matthew Andrews, Baruch Awerbuch, Antonio Fernández, Tom Leighton, Zhiyong Liu, and Jon Kleinberg. Universal-stability results and performance bounds for greedy contention-resolution protocols. Journal of the ACM (JACM), 48(1):39–69, 2001.
Andrews et al. (2007) Matthew Andrews, Kyomin Jung, and Alexander Stolyar. Stability of the max-weight routing and scheduling protocol in dynamic networks and at critical loads. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 145–154, 2007.
Ashjaei et al. (2021) Mohammad Ashjaei, Lucia Lo Bello, Masoud Daneshtalab, Gaetano Patti, Sergio Saponara, and Saad Mubeen. Time-sensitive networking in automotive embedded systems: State of the art and research opportunities. Journal of systems architecture, 117:102137, 2021.
Auer (2002) Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
Borodin et al. (2001) Allan Borodin, Jon Kleinberg, Prabhakar Raghavan, Madhu Sudan, and David P Williamson. Adversarial queuing theory. Journal of the ACM (JACM), 48(1):13–38, 2001.
Chen and Giannakis (2018) Tianyi Chen and Georgios B Giannakis. Bandit convex optimization for scalable and dynamic iot management. IEEE Internet of Things Journal, 6(1):1276–1286, 2018.
Cholvi and Echagüe (2007) Vicent Cholvi and Juan Echagüe. Stability of fifo networks under adversarial models: State of the art. Computer Networks, 51(15):4460–4474, 2007.
Choudhury et al. (2021) Tuhinangshu Choudhury, Gauri Joshi, Weina Wang, and Sanjay Shakkottai. Job dispatching policies for queueing systems with unknown service rates. In Proceedings of the Twenty-second International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, pages 181–190, 2021.
Cruz (1991) Rene L Cruz. A calculus for network delay. i. network elements in isolation. IEEE Transactions on information theory, 37(1):114–131, 1991.
Cutkosky (2020) Ashok Cutkosky. Parameter-free, dynamic, and strongly-adaptive online learning. In International Conference on Machine Learning, pages 2250–2259. PMLR, 2020.
Dai and Gluzman (2022) Jim G Dai and Mark Gluzman. Queueing network controls via deep reinforcement learning. Stochastic Systems, 12(1):30–67, 2022.
Dai et al. (2023) Yan Dai, Haipeng Luo, Chen-Yu Wei, and Julian Zimmert. Refined regret for adversarial mdps with linear function approximation. In International Conference on Machine Learning, pages 6726–6759. PMLR, 2023.
Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
Flaxman et al. (2005) Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385–394, 2005.
Fu and Modiano (2022) Xinzhe Fu and Eytan Modiano. Joint learning and control in stochastic queueing networks with unknown utilities. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 6(3):1–32, 2022.
Gaddam et al. (2020) Anuroop Gaddam, Tim Wilkin, Maia Angelova, and Jyotheesh Gaddam. Detecting sensor faults, anomalies and outliers in the internet of things: A survey on the challenges and solutions. Electronics, 9(3):511, 2020.
Harrison and Wein (1990) J Michael Harrison and Lawrence M Wein. Scheduling networks of queues: Heavy traffic analysis of a two-station closed network. Operations research, 38(6):1052–1064, 1990.
Huang et al. (2024) Jiatai Huang, Leana Golubchik, and Longbo Huang. When lyapunov drift based queue scheduling meets adversarial bandit learning. IEEE/ACM Transactions on Networking, 2024.
Huang and Neely (2011) Longbo Huang and Michael J Neely. Utility optimal scheduling in processing networks. Performance Evaluation, 68(11):1002–1021, 2011.
Huang et al. (2012) Longbo Huang, Scott Moeller, Michael J Neely, and Bhaskar Krishnamachari. Lifo-backpressure achieves near-optimal utility-delay tradeoff. IEEE/ACM Transactions On Networking, 21(3):831–844, 2012.
Khan et al. (2020) Md Rizwan Khan, Bikramaditya Das, and Bibhuti Bhusan Pati. Channel estimation strategies for underwater acoustic (uwa) communication: An overview. Journal of the Franklin Institute, 357(11):7229–7265, 2020.
Krishnasamy et al. (2018) Subhashini Krishnasamy, PT Akhil, Ari Arapostathis, Rajesh Sundaresan, and Sanjay Shakkottai. Augmenting max-weight with explicit learning for wireless scheduling with switching costs. IEEE/ACM Transactions on Networking, 26(6):2501–2514, 2018.
Krishnasamy et al. (2021) Subhashini Krishnasamy, Rajat Sen, Ramesh Johari, and Sanjay Shakkottai. Learning unknown service rates in queues: A multiarmed bandit approach. Operations research, 69(1):315–330, 2021.
Liang and Modiano (2018a) Qingkai Liang and Evtan Modiano. Network utility maximization in adversarial environments. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages 594–602. IEEE, 2018a.
Liang and Modiano (2018b) Qingkai Liang and Eytan Modiano. Minimizing queue length regret under adversarial network models. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2(1):1–32, 2018b.
Lim et al. (2013) Sungsu Lim, Kyomin Jung, and Matthew Andrews. Stability of the max-weight protocol in adversarial wireless networks. IEEE/ACM Transactions on Networking, 22(6):1859–1872, 2013.
Liu et al. (2022) Bai Liu, Qiaomin Xie, and Eytan Modiano. Rl-qn: A reinforcement learning framework for optimal control of queueing systems. ACM Transactions on Modeling and Performance Evaluation of Computing Systems, 7(1):1–35, 2022.
Maguluri et al. (2012) Siva Theja Maguluri, Rayadurgam Srikant, and Lei Ying. Stochastic models of load balancing and scheduling in cloud computing clusters. In 2012 Proceedings IEEE Infocom, pages 702–710. IEEE, 2012.
McMahan and Streeter (2010) H Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex optimization. Annual Conference on Learning Theory 2010, page 244, 2010.
Neely (2008) Michael J Neely. Order optimal delay for opportunistic scheduling in multi-user wireless uplinks and downlinks. IEEE/ACM Transactions on Networking, 16(5):1188–1199, 2008.
Neely (2009) Michael J Neely. Delay analysis for max weight opportunistic scheduling in wireless systems. IEEE Transactions on Automatic Control, 54(9):2137–2150, 2009.
Neely (2010a) Michael J Neely. Stochastic network optimization with application to communication and queueing systems. Synthesis Lectures on Communication Networks, 3(1):1–211, 2010a.
Neely (2010b) Michael J Neely. Universal scheduling for networks with arbitrary traffic, channels, and mobility. In 49th IEEE Conference on Decision and Control (CDC), pages 1822–1829. IEEE, 2010b.
Neely et al. (2008) Michael J Neely, Eytan Modiano, and Chih-Ping Li. Fairness and optimal stochastic control for heterogeneous networks. IEEE/ACM Transactions On Networking, 16(2):396–409, 2008.
Neely et al. (2012) Michael J Neely, Scott T Rager, and Thomas F La Porta. Max weight learning algorithms for scheduling in unknown environments. IEEE Transactions on Automatic Control, 57(5):1179–1191, 2012.
Rahdar et al. (2018) Mohammad Rahdar, Lizhi Wang, and Guiping Hu. A tri-level optimization model for inventory control with uncertain demand and lead time. International Journal of Production Economics, 195:96–105, 2018.
Sadiq and De Veciana (2009) Bilal Sadiq and Gustavo De Veciana. Throughput optimality of delay-driven maxweight scheduler for a wireless system with flow dynamics. In 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1097–1102. IEEE, 2009.
Srikant and Ying (2013) Rayadurgam Srikant and Lei Ying. Communication networks: an optimization, control, and stochastic networks perspective. Cambridge University Press, 2013.
Tsibonis et al. (2003) Vagelis Tsibonis, Leonidas Georgiadis, and Leandros Tassiulas. Exploiting wireless channel state information for throughput maximization. In IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No. 03CH37428), volume 1, pages 301–310. IEEE, 2003.
Wei et al. (2024) Honghao Wei, Xin Liu, Weina Wang, and Lei Ying. Sample efficient reinforcement learning in mixed systems through augmented samples and its applications to queueing networks. Advances in Neural Information Processing Systems, 36, 2024.
Yang et al. (2023) Zixian Yang, R Srikant, and Lei Ying. Learning while scheduling in multi-server systems with unknown statistics: Maxweight with discounted ucb. In International Conference on Artificial Intelligence and Statistics, pages 4275–4312. PMLR, 2023.
Zhao et al. (2021) Peng Zhao, Guanghui Wang, Lijun Zhang, and Zhi-Hua Zhou. Bandit convex optimization in non-stationary environments. Journal of Machine Learning Research, 22(125):1–45, 2021.
Zheng et al. (2019) Kai Zheng, Haipeng Luo, Ilias Diakonikolas, and Liwei Wang. Equipping experts/bandits with long-term memory. Advances in Neural Information Processing Systems, 32, 2019.
Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928–936, 2003.
Zou et al. (2016) Yulong Zou, Jia Zhu, Xianbin Wang, and Lajos Hanzo. A survey on wireless security: Technical challenges, recent advances, and future trends. Proceedings of the IEEE, 104(9):1727–1765, 2016.

\appendixpage

\startcontents

[section] \printcontents[section]l1

Appendix A Additional Related Works

Adversarial Components in Network Optimization

The study of adversarial components in network optimization dates back to the 1990s, when Cruz (1991) gave the first network model with adversarial dynamics. This was further generalized to Adversarial Queueing Theory (Borodin et al., 2001) and Leaky Bucket (Andrews et al., 2001) models. Many follow-up works, as surveyed by Cholvi and Echagüe (2007), considered network optimization under various types of adversarial traffic injections (i.e., the arrival rates to each queue are adversarial). However, these early works only considered adversarial arrival rates but assume link conditions are stationary, which cannot capture the fact that wireless communication networks can have very different link conditions from time to time due to congestions (Zou et al., 2016).

Noticing this shortcoming, Andrews and Zhang (2004) and Andrews and Zhang (2005) studied a single-hop network where link conditions are also adversarial. Their works were extended to multi-hop networks by Andrews et al. (2007) and Lim et al. (2013). For a more up-to-date discussion on these works, we refer the readers to the discussions in (Liang and Modiano, 2018b).

Utility Maximization in Adversarial Networks. While it has been more and more results considering network stability in adversarial networks, the utility maximization guarantees are not so common. Neely (2010b) proposed the universal network utility maximization problem which considers competing with a look-ahead policy that has perfect knowledge about the near future. Liang and Modiano (2018a) generalized the aforementioned stability requirement in another way and showed a trade-off between network stability and utility maximization. However, both papers assumed perfect knowledge on network conditions, as summarized in Table 1.

Feedback Models. Most previous works considered the perfect knowledge model which assumes known network conditions before decison (Liang and Modiano, 2018b; a). Despite its simplicity, this assumption eliminates the hardness of estimating the network topology or link conditions, which we argue is highly non-trivial due to the unpredictability in a drastically varying network like underwater communications (Khan et al., 2020) or IoT systems (Gaddam et al., 2020). The harder full-information feedback model assumes no prior knowledge at decision-making but requires the network conditions to be fully revealed after decision; a variant of this model was proposed by Neely et al. (2012) under the name of 2-stage decision model. In this paper, we consider the hardest bandit feedback model, which rules out all counterfactual feedback that are associated with the actions that are not really deployed. This model, under various different names, were recently proposed and considered by Fu and Modiano (2022), Yang et al. (2023), and Huang et al. (2024).

Adversarial Networks under Bandit Feedback. As we are aware of, the papers closest to ours are the ones by Huang et al. (2024) and Yang et al. (2023), which also studied adversarial networks under bandit feedback. However, both papers assumed a single-hop network model, i.e., jobs immediately leave the network upon being served. This model finds shortcoming when trying to reflect the reality where some jobs may be forwarded within the network for many hops – for example, in the classical criss-cross network extensively studied in the SNO literature (Harrison and Wein, 1990). In contrast, our multi-hop model allows jobs being forwarded from one server to another and is much more general.

Moreover, both papers only considered the task of network stability, i.e., the number of jobs remaining in the network (which is the sum of queue lengths) does not diverge; see Equation 2. However, only stabilizing the system may not be enough in many realistic problems, where the throughput (average number of jobs getting served) (Tsibonis et al., 2003, Sadiq and De Veciana, 2009) or delay (average waiting time for each job from entering the system to being served) (Neely, 2008; 2009) should be optimized. In our paper, we consider the utility maximization task (Huang and Neely, 2011, Huang et al., 2012) where an abstract utility function shall be optimized, allowing various network optimization objectives other than simply stabilizing the system.

Learning-Augmented Algorithms in Network Optimization. Finally, we give a brief overview of recent learning-augmented algorithms in network optimization. To tackle the lack of accurate channel information (e.g., under the feedback model), exploration approaches like $\epsilon$ -greedy (Krishnasamy et al., 2018; 2021) or Upper Confidence Bound (UCB) (Auer, 2002) (e.g., (Choudhury et al., 2021, Krishnasamy et al., 2021, Yang et al., 2023)) were widely used. More recently, using a Reinforcement Learning (RL) approach, Liu et al. (2022) proposed the RL-QN algorithm by putting the queue lengths as the state in RL, which outperforms many existing SNO algorithms. Empirically, utilizing the recent advances in Deep RL (DRL), Dai and Gluzman (2022) established state-of-the-art control performance in the criss-cross network. Following this line, Wei et al. (2024) compressed the number of states and yielded improved performance in other networks as well. However, most aforementioned works only considered the stochastic regime. Moreover, RL approaches (especially those DRL ones) typically have extremely large space when modelling network optimization problems, making the algorithm computationally infeasible. In contrast, our online-learning-based approach makes the algorithm not only capable of adversarial environment and bandit feedback but also computationally efficient.

Appendix B Omitted Proofs for Multi-Hop Network Stability Tasks

Before presenting the theorem proofs, we first give a bound on the queue length increments.

Lemma B.1 (Queue Length Increment).

For every $n\in{\mathcal{N}}$ , $k\in{\mathcal{N}}$ , and $t\in[T]$ , we have

\lvert Q_{n}^{(k)}(t+1)-Q_{n}^{(k)}(t)\rvert\leq(2NM+R).

Proof.

According to the queue length dynamics in Equation 1, we know

\lvert Q_{n}^{(k)}(t+1)-Q_{n}^{(k)}(t)\rvert\leq\left\lvert\sum_{(n,m)\in{% \mathcal{L}}}\mu_{n,m}^{(k)}(t)+\sum_{(o,n)\in{\mathcal{L}}}\mu_{o,n}^{(k)}(t)% +\lambda_{n}^{(k)}(t)\right\rvert\leq(2NM+R),

which utilizes the assumption that $\lambda_{n}^{(k)}(t)\in[0,R]$ and $\mu_{(n,m)}^{(k)}(t)\in[0,M]$ . ∎

B.1 Reference Policy Assumption (Proof of Lemma 3.1)

Lemma B.2 (Restatement of Lemma 3.1; Ability of $\{\mathring{\bm{a}}(t)\}_{t\in[T]}$ in Stabilizing the Network).

If $\{\mathring{\bm{a}}(t)\}_{t\in[T]}$ satisfies 1, then for any scheduler-generated queue lengths $\{\bm{Q}(t)\}_{t\in[T]}$ ,

	$\displaystyle\quad\epsilon_{W}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t% =1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\right]-(% N^{2}(2NM+R)^{2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T$
	$\displaystyle\leq-\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_% {(n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}C_{n,m}(t)\mathring{a}_{n,m}^{(% k)}(t)(Q_{m}^{(k)}(t)-Q_{n}^{(k)}(t))\right]-\operatornamewithlimits{\mathbb{E% }}\left[\sum_{t=1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(% k)}(t)\lambda_{n}^{(k)}(t)\right].$

Proof.

The proof almost follows that of Huang et al. (2024, Lemma 2). We adapt their proof here for completeness. For each interval $W_{j}$ in 1, let $T_{0}$ be the first round in $W_{j}$ . Then

	$\displaystyle\quad\sum_{t\in W_{j}}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal% {N}}}Q_{n}^{(k)}(t)\left(\sum_{(n,m)\in{\mathcal{L}}}C_{n,m}(t)\mathring{a}_{n% ,m}^{(k)}(t)-\lambda_{n}^{(k)}(t)-\sum_{(o,n)\in{\mathcal{L}}}C_{o,n}(t)% \mathring{a}_{o,n}^{(k)}(t)\right)$
	$\displaystyle=\sum_{t\in W_{j}}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}% }Q_{n}^{(k)}(T_{0})\left(\sum_{(n,m)\in{\mathcal{L}}}C_{n,m}(t)\mathring{a}_{n% ,m}^{(k)}(t)-\lambda_{n}^{(k)}(t)-\sum_{(o,n)\in{\mathcal{L}}}C_{o,n}(t)% \mathring{a}_{o,n}^{(k)}(t)\right)+$
	$\displaystyle\quad\sum_{t\in W_{j}}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal% {N}}}\left(Q_{n}^{(k)}(t)-Q_{n}^{(k)}(T_{0})\right)\left(\sum_{(n,m)\in{% \mathcal{L}}}C_{n,m}(t)\mathring{a}_{n,m}^{(k)}(t)-\lambda_{n}^{(k)}(t)-\sum_{% (o,n)\in{\mathcal{L}}}C_{o,n}(t)\mathring{a}_{o,n}^{(k)}(t)\right).$

For the first term, we have the following for every $n\in{\mathcal{N}}$ and $k\in{\mathcal{N}}$ , according to 1:

\displaystyle\sum_{t\in W_{j}}\left(\sum_{(n,m)\in{\mathcal{L}}}C_{n,m}(t)% \mathring{a}_{n,m}^{(k)}(t)-\lambda_{n}^{(k)}(t)-\sum_{(o,n)\in{\mathcal{L}}}C% _{o,n}(t)\mathring{a}_{o,n}^{(k)}(t)\right)\geq\epsilon_{W}\lvert W_{j}\rvert.

Thus, the first summation enjoys the following lower bound:

	$\displaystyle\quad\sum_{t\in W_{j}}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal% {N}}}Q_{n}^{(k)}(T_{0})\left(\sum_{(n,m)\in{\mathcal{L}}}C_{n,m}(t)\mathring{a% }_{n,m}^{(k)}(t)-\lambda_{n}^{(k)}(t)-\sum_{(o,n)\in{\mathcal{L}}}C_{o,n}(t)% \mathring{a}_{o,n}^{(k)}(t)\right)$
	$\displaystyle\geq\left(\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{% (k)}(T_{0})\right)\epsilon_{W}\lvert W_{j}\rvert=\epsilon_{W}\lvert W_{j}% \rvert\times\lVert\bm{Q}(T_{0})\rVert_{1}\geq\epsilon_{W}\sum_{t\in W_{j}}% \lVert\bm{Q}(t)\rVert_{1}-\epsilon_{W}N^{2}(2NM+R)(\lvert W_{j}\rvert-1)^{2},$

where the last step uses the fact that $\lVert\bm{Q}(t)\rVert_{1}-\lVert\bm{Q}(T_{0})\rVert_{1}\leq\lVert\bm{Q}(t)-\bm% {Q}(T_{0})\rVert_{1}$ and the queue length increment bound in Lemma B.1.

For the second summation, again utilizing the bound on $\lVert\bm{Q}(t)-\bm{Q}(T_{0})\rVert_{1}$ , we have

	$\displaystyle\quad\sum_{t\in W_{j}}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal% {N}}}\left(Q_{n}^{(k)}(t)-Q_{n}^{(k)}(T_{0})\right)\left(\sum_{(n,m)\in{% \mathcal{L}}}C_{n,m}(t)\mathring{a}_{n,m}^{(k)}(t)-\lambda_{n}^{(k)}(t)-\sum_{% (o,n)\in{\mathcal{L}}}C_{o,n}(t)\mathring{a}_{o,n}^{(k)}(t)\right)$
	$\displaystyle\geq-\sum_{t\in W_{j}}\lVert\bm{Q}(t)-\bm{Q}(T_{0})\rVert_{1}% \times\max_{t\in W_{j}}\max_{n\in{\mathcal{N}}}\max_{k\in{\mathcal{N}}}\left% \lvert\sum_{(n,m)\in{\mathcal{L}}}C_{n,m}(t)\mathring{a}_{n,m}^{(k)}(t)-% \lambda_{n}^{(k)}(t)-\sum_{(o,n)\in{\mathcal{L}}}C_{o,n}(t)\mathring{a}_{o,n}^% {(k)}(t)\right\rvert$
	$\displaystyle\geq-N^{2}(2NM+R)(\lvert W_{j}\rvert-1)^{2}\times(2NM+R)=-N^{2}(2% NM+R)^{2}(\lvert W_{j}\rvert-1)^{2}.$

Therefore, recall the assumption that $\sum_{j\in[J]}(\lvert W_{j}\rvert-1)^{2}\leq C_{W}T$ , summing over $j=1,2,\ldots,J$ gives

	$\displaystyle\quad\epsilon_{W}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t% =1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\right]-(% N^{2}(2NM+R)^{2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T$
	$\displaystyle\leq\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_{% n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\left(\sum_{(n,m)\in{% \mathcal{L}}}C_{n,m}(t)\mathring{a}_{n,m}^{(k)}(t)-\lambda_{n}^{(k)}(t)-\sum_{% (o,n)\in{\mathcal{L}}}C_{o,n}(t)\mathring{a}_{o,n}^{(k)}(t)\right)\right]$
	$\displaystyle=-\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_{(n% ,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}C_{n,m}(t)\mathring{a}_{n,m}^{(k)}% (t)(Q_{m}^{(k)}(t)-Q_{n}^{(k)}(t))\right]-\operatornamewithlimits{\mathbb{E}}% \left[\sum_{t=1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)% }(t)\lambda_{n}^{(k)}(t)\right],$

thus giving our conclusion. ∎

B.2 Lyapunov Drift Analysis (Proof of Lemma 3.2)

Lemma B.3 (Restatement of Lemma 3.2; Lyapunov Drift Analysis).

Under the queue dynamics of Equation 1,

	$\displaystyle 0\leq\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}% \Delta(\bm{Q}(t))\right]$	$\displaystyle\leq\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_{% (n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}\mu_{n,m}^{(k)}(t)\left(Q_{m}^{(% k)}(t)-Q_{n}^{(k)}(t)\right)+\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q% _{n}^{(k)}(t)\lambda_{n}^{(k)}(t)\right]+$
		$\displaystyle\quad\frac{1}{2}N^{2}((NM)^{2}+2(NM)^{2}+2R^{2})T.$

Proof.

According to Equation 1, for any round $t\in[T]$ , server $n\in{\mathcal{N}}$ , and commodity $k\in{\mathcal{N}}$ , we

	$\displaystyle(Q_{n}^{(k)}(t+1))^{2}$	$\displaystyle\leq\left[Q_{n}^{(k)}(t)-\sum_{(n,m)\in{\mathcal{L}}}\mu_{n,m}^{(% k)}(t)\right]_{+}^{2}+\left(\sum_{(o,n)\in{\mathcal{L}}}\mu_{o,n}^{(k)}(t)+% \lambda_{n}^{(k)}(t)\right)^{2}+$
		$\displaystyle\quad 2\left[Q_{n}^{(k)}(t)-\sum_{(n,m)\in{\mathcal{L}}}\mu_{n,m}% ^{(k)}(t)\right]_{+}\left(\sum_{(o,n)\in{\mathcal{L}}}\mu_{o,n}^{(k)}(t)+% \lambda_{n}^{(k)}(t)\right)$
		$\displaystyle\leq(Q_{n}^{(k)}(t))^{2}-2Q_{n}^{(k)}(t)\left(\sum_{(n,m)\in{% \mathcal{L}}}\mu_{n,m}^{(k)}(t)\right)+\left(\sum_{(n,m)\in{\mathcal{L}}}\mu_{% n,m}^{(k)}(t)\right)^{2}+$
		$\displaystyle\quad\left(\sum_{(o,n)\in{\mathcal{L}}}\mu_{o,n}^{(k)}(t)+\lambda% _{n}^{(k)}(t)\right)^{2}+2Q_{n}^{(k)}(t)\left(\sum_{(o,n)\in{\mathcal{L}}}\mu_% {o,n}^{(k)}(t)+\lambda_{n}^{(k)}(t)\right)$
		$\displaystyle=(Q_{n}^{(k)}(t))^{2}-2Q_{n}^{(k)}(t)\left(\sum_{(n,m)\in{% \mathcal{L}}}\mu_{n,m}^{(k)}(t)-\sum_{(o,n)\in{\mathcal{L}}}\mu_{o,n}^{(k)}(t)% -\lambda_{n}^{(k)}(t)\right)+$
		$\displaystyle\quad\left(\sum_{(n,m)\in{\mathcal{L}}}\mu_{n,m}^{(k)}(t)\right)^% {2}+\left(\sum_{(o,n)\in{\mathcal{L}}}\mu_{o,n}^{(k)}(t)+\lambda_{n}^{(k)}(t)% \right)^{2}.$

Therefore, by definition of the Lyapunov function $L_{t}$ and the Lyapunov drift $\Delta(\bm{Q}(t))$ ,

	$\displaystyle\Delta(\bm{Q}(t))$	$\displaystyle\leq-\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(% t)\operatornamewithlimits{\mathbb{E}}\left[\sum_{(n,m)\in{\mathcal{L}}}\mu_{n,% m}^{(k)}(t)-\sum_{(o,n)\in{\mathcal{L}}}\mu_{o,n}^{(k)}(t)-\lambda_{n}^{(k)}(t% )\middle\|\bm{Q}(t)\right]+$
		$\displaystyle\quad\frac{1}{2}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}% \operatornamewithlimits{\mathbb{E}}\left[\left(\sum_{(n,m)\in{\mathcal{L}}}\mu% _{n,m}^{(k)}(t)\right)^{2}+\left(\sum_{(o,n)\in{\mathcal{L}}}\mu_{o,n}^{(k)}(t% )+\lambda_{n}^{(k)}(t)\right)^{2}\middle\|\bm{Q}(t)\right].$

Exchanging summations and using the bounded assumptions that $\mu_{n,m}^{(k)}(t)\in[0,M]$ and $\lambda_{n}^{(k)}(t)\in[0,R]$ , we get

	$\displaystyle\Delta(\bm{Q}(t))$	$\displaystyle\leq\operatornamewithlimits{\mathbb{E}}\left[\sum_{(n,m)\in{% \mathcal{L}}}\sum_{k\in{\mathcal{N}}}\mu_{n,m}^{(k)}(t)\left(Q_{m}^{(k)}(t)-Q_% {n}^{(k)}(t)\right)+\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)% }(t)\lambda_{n}^{(k)}(t)\middle\|\bm{Q}(t)\right]+$
		$\displaystyle\quad\frac{1}{2}N^{2}((NM)^{2}+2(NM)^{2}+2R^{2}).$

Taking expectation w.r.t. $\bm{Q}(t)$ and summing up from $t=1,2,\ldots,T$ , we have

	$\displaystyle\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\Delta(\bm% {Q}(t))\right]$	$\displaystyle\leq\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_{% (n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}\mu_{n,m}^{(k)}(t)\left(Q_{m}^{(% k)}(t)-Q_{n}^{(k)}(t)\right)+\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q% _{n}^{(k)}(t)\lambda_{n}^{(k)}(t)\right]+$
		$\displaystyle\quad\frac{1}{2}N^{2}((NM)^{2}+2(NM)^{2}+2R^{2})T.$

By telescoping sums, we know

\sum_{t=1}^{T}\Delta(\bm{q}_{t})=L_{T+1}-L_{1}=L_{T+1}\geq 0,

where the last step uses the fact that $L_{T+1}$ is the sum of squares. Therefore, our conclusion follows. ∎

B.3 Guarantee of AdaPFOL Algorithm (Proof of Lemma 3.4)

Before analyzing our AdaPFOL algorithm, we first include the guarantee of the PFOL algorithm (Cutkosky, 2020) as follows. It roughly says that for bounded losses (i.e., $\lVert\bm{g}_{t}\rVert\leq 1$ ), there exists an algorithm that enjoys the following parameter-free (i.e., performance depending on loss magnitudes) guarantee.

Lemma B.4 (Guarantee of PFOL Algorithm (Cutkosky, 2020, Theorem 6)).

Consider the OLO problem in Definition 3.3. Suppose that $\mathcal{X}$ has diameter $D=\sup_{\bm{x},\bm{y}\in\mathcal{X}}\lVert\bm{x}-\bm{y}\rVert_{1}$ and all $\lVert\bm{g}_{t}\rVert_{\infty}\leq 1$ . Then there exists an algorithm, such that for any comparator sequence $\mathring{\bm{x}}_{1},\mathring{\bm{x}}_{2},\ldots,\mathring{\bm{x}}_{T}\in% \mathcal{X}$ ,

\text{D-Regret}_{T}^{\text{OLO}}(\mathring{\bm{x}}_{1},\mathring{\bm{x}}_{2},% \ldots,\mathring{\bm{x}}_{T})=\operatorname{\mathcal{O}}\left(\sqrt{D\left(D+% \sum_{t=1}^{T-1}\lVert\mathring{\bm{x}}_{t}-\mathring{\bm{x}}_{t+1}\rVert_{1}% \right)}\sqrt{1+\sum_{t=1}^{T}\lVert\bm{g}_{t}\rVert_{\infty}^{2}}\log\left(T% \sum_{t=1}^{T}\lVert\bm{g}_{t}\rVert_{\infty}^{2}\right)\right).

Due to the complicatedness of the original algorithm, we do not present its pseudo-code here but instead use it as a black-box. Please refer to the original paper by Cutkosky (2020) for more details. Note that, although the original analysis by Cutkosky (2020) uses $\ell_{2}$ -norm for both comparators $\{\mathring{\bm{x}}_{t}\}_{t=1}^{T}$ and loss vectors $\{\bm{g}_{t}\}_{t=1}^{T}$ , it is straightforward to extend to a pair of dual norms, which is $\ell_{1}$ -norm for $\{\mathring{\bm{x}}_{t}\}_{t=1}^{T}$ and $\ell_{\infty}$ -norm for $\{\bm{g}_{t}\}_{t=1}^{T}$ in our case.

Now, we are ready to present the guarantee for our AdaPFOL algorithm:

Lemma B.5 (Restatement of Lemma 3.4; Guarantee of AdaPFOL Algorithm).

\text{D-Regret}_{T}^{\text{OLO}}(\mathring{\bm{x}}_{1},\mathring{\bm{x}}_{2},% \ldots,\mathring{\bm{x}}_{T})=\operatorname{\mathcal{O}}\left(\sqrt{D\left(D+% \sum_{t=1}^{T-1}\lVert\mathring{\bm{x}}_{t}-\mathring{\bm{x}}_{t+1}\rVert_{1}% \right)}\sqrt{\sum_{t=1}^{T}\lVert\bm{g}_{t}\rVert_{\infty}^{2}}\log T\log% \left(\max_{t=1}^{T}G_{t}\right)\right).

Proof.

As $G$ changes only if $G_{t}>2\max_{s<t}G_{s}$ , it cannot change for more than $\lceil\log_{2}(\max_{t}G_{t})\rceil$ times. For a fixed $G$ , suppose that it is used for rounds $t_{1},t_{1}+1,\ldots,t_{2}$ , then we must have $G_{t}\leq G$ , $\forall t\in[t_{1},t_{2}]$ as otherwise a new instance of PFOL will be launched.

Therefore, as $\lVert\bm{g}_{t}\rVert_{\infty}\leq G_{t}\leq G$ , we have $\lVert G^{-1}\bm{g}_{t}\rVert_{\infty}\leq 1$ . This allows us to apply Lemma B.4 and yield

\sum_{t=t_{1}}^{t_{2}}\langle G^{-1}\bm{g}_{t},\bm{x}_{t}-\bm{x}_{t}^{\circ}% \rangle=\operatorname{\mathcal{O}}\left(\sqrt{D\left(D+\sum_{t=t_{1}}^{t_{2}-1% }\lVert\mathring{\bm{x}}_{t}-\mathring{\bm{x}}_{t+1}\rVert_{1}\right)}\sqrt{% \sum_{t=t_{1}}^{t_{2}}\lVert G^{-1}\bm{g}_{t}\rVert_{\infty}^{2}}\log T\right).

Multiplying $G$ on both sides, we have

\sum_{t=t_{1}}^{t_{2}}\langle\bm{g}_{t},\bm{x}_{t}-\bm{x}_{t}^{\circ}\rangle=% \operatorname{\mathcal{O}}\left(\sqrt{D\left(D+\sum_{t=t_{1}}^{t_{2}-1}\lVert% \mathring{\bm{x}}_{t}-\mathring{\bm{x}}_{t+1}\rVert_{1}\right)}\sqrt{\sum_{t=t% _{1}}^{t_{2}}\lVert\bm{g}_{t}\rVert_{\infty}^{2}}\log T\right).

(21)

As all $[t_{1},t_{2}]$ ’s form a partition of $[T]$ , summing up all Equation 21 gives

\sum_{t=1}^{T}\langle\bm{g}_{t},\bm{x}_{t}-\bm{x}_{t}^{\circ}\rangle=% \operatorname{\mathcal{O}}\left(\sqrt{D\left(D+\sum_{t=1}^{T-1}\lVert\mathring% {\bm{x}}_{t}-\mathring{\bm{x}}_{t+1}\rVert_{1}\right)}\sqrt{\sum_{t=1}^{T}% \lVert\bm{g}_{t}\rVert_{\infty}^{2}}\log T\log\left(\max_{t=1}^{T}G_{t}\right)% \right),

which utilizes the fact that at most $\operatorname{\mathcal{O}}(\lceil\log_{2}(\max_{t}G_{t})\rceil)$ distinct $[t_{1},t_{2}]$ ’s can occur. ∎

B.4 Deciding $\bm{a}(t)$ via AdaPFOL Algorithm (Proof of Theorem 3.5)

Theorem B.6 (Restatement of Theorem 3.5; Deciding $\bm{a}(t)$ via AdaPFOL Algorithm).

	$\displaystyle\quad\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_% {(n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}(\mu_{n,m}^{(k)}(t)-\mathring{% \mu}_{n,m}^{(k)}(t))\left(Q_{m}^{(k)}(t)-Q_{n}^{(k)}(t)\right)\right]$
	$\displaystyle=\operatorname{\mathcal{O}}\left(M\sqrt{1+P_{T}^{a}}% \operatornamewithlimits{\mathbb{E}}\left[\sqrt{\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{2}^{2}}\log T\log\left(\max_{t=1}^{T}\max_{(n,m)\in{\mathcal{L}}}M% \lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rVert_{\infty}\right)\right]\right)\text{,}$

Proof.

According to the definitions of $C_{n,m}^{(k)}(t)$ and $\mathring{C}_{n,m}^{(k)}(t)$ together with Lemma 3.4,

	$\displaystyle\quad\sum_{t=1}^{T}\operatornamewithlimits{\mathbb{E}}\left[\sum_% {(n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}(C_{n,m}^{(k)}(t)-\mathring{C}_% {n,m}^{(k)}(t))\left(Q_{m}^{(k)}(t)-Q_{n}^{(k)}(t)\right)\middle\|\bm{Q}(t)\right]$
	$\displaystyle=\sum_{t=1}^{T}\sum_{(n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}% }}C_{n,m}(t)\left(Q_{m}^{(k)}(t)-Q_{n}^{(k)}(t)\right)\left(a_{n,m}^{(k)}(t)-% \mathring{a}_{n,m}^{(k)}(t)\right)$
	$\displaystyle\leq\operatorname{\mathcal{O}}\Bigg{(}\sqrt{1+\sum_{t=1}^{T-1}% \sum_{(n,m)\in{\mathcal{L}}}\lVert\mathring{\bm{a}}_{n,m}(t)-\mathring{\bm{a}}% _{n,m}(t+1)\rVert_{1}}$
	$\displaystyle\qquad\sqrt{\sum_{t=1}^{T}\sum_{(n,m)\in{\mathcal{L}}}\sum_{k\in{% \mathcal{N}}}\big{(}C_{n,m}(t)(Q_{m}^{(k)}(t)-Q_{n}^{(k)}(t))\big{)}^{2}}$
	$\displaystyle\qquad\log T\log\left(\max_{t=1}^{T}\max_{(n,m)\in{\mathcal{L}}}M% \lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rVert_{\infty}\right)\Bigg{)}.$

As $C_{n,m}(t)\in[0,M]$ , we can upper bound the RHS of the above inequality by

\operatorname{\mathcal{O}}\left(\sqrt{1+P_{T}^{a}}\sqrt{2M^{2}\sum_{t=1}^{T}% \sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)^{2}}\log T\log% \left(\max_{t=1}^{T}\max_{(n,m)\in{\mathcal{L}}}M\lVert\bm{Q}_{m}(t)-\bm{Q}_{n% }(t)\rVert_{\infty}\right)\right),

B.5 Main Theorem for Multi-Hop Network Stability (Proof of Theorem 3.6)

Theorem B.7 (Restatement of Theorem 3.6; Main Theorem for Multi-Hop Network Stability).

Suppose that $\{\mathring{\bm{a}}_{n,m}(t)\in\triangle({\mathcal{N}})\}_{(n,m)\in{\mathcal{L% }},t\in[T]}$ satisfies 1 and its path length satisfies

P_{t}^{a}\triangleq\sum_{s=1}^{t-1}\sum_{(n,m)\in{\mathcal{L}}}\lVert\mathring% {\bm{a}}_{n,m}(s)-\mathring{\bm{a}}_{n,m}(s+1)\rVert_{1}\leq C^{a}t^{1/2-% \delta_{a}},\quad\forall t=1,2,\ldots,T,

\frac{1}{T}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}% (t)\rVert_{1}\right]=\operatorname{\mathcal{O}}\left(\frac{(N^{2}(2NM+R)^{2}+% \epsilon_{W}N^{2}(2NM+R))C_{W}+(N^{4}M^{2}+N^{2}R^{2})}{\epsilon_{W}}\right)+o% _{T}(1).

Proof.

We first defer some calculations into Lemma B.8, which basically combines Lemma 3.1, Lemma 3.2, and Theorem 3.5 together. Lemma B.8 says

\displaystyle\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{% Q}(t)\rVert_{1}\right]

\displaystyle\leq f(T)+g(T)\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}% ^{T}\lVert\bm{Q}(t)\rVert_{1}\right]^{3/4}\log\operatornamewithlimits{\mathbb{% E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1}\right],

where

	$\displaystyle f(T)$	$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left((N^{2}(2NM+R)^{% 2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T+(N^{4}M^{2}+N^{2}R^{2})T\right),$
	$\displaystyle g(T)$	$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left(M(2NM+R)^{1/4}% \sqrt{1+P_{T}^{a}}\log T\right).$

In Lemma D.5, we will prove a self-bounding inequality that says, if $y\leq f+y^{3/4}g\log y$ , then $y\leq\left(f^{1/4}+g\log\left(2(f^{1/4}+g)^{2}\right)\right)^{4}$ . Therefore, we can apply Lemma D.5 to conclude that

	$\displaystyle\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{% Q}(t)\rVert_{1}\right]$	$\displaystyle=\operatorname{\mathcal{O}}\left(f(T)+g(T)^{4}\log^{4}\left(2(f(T% )^{1/4}+g(T))^{2}\right)\right)$
		$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left((N^{2}(2NM+R)^{% 2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T+(N^{4}M^{2}+N^{2}R^{2})T\right)+$
		$\displaystyle\quad\left(\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left(M(2NM% +R)^{1/4}\sqrt{1+P_{T}^{a}}\log T\right)\operatorname{\widetilde{\mathcal{O}}}% _{T}(1)\right)^{4}.$

Since

\displaystyle\left(\sqrt{1+P_{T}^{a}}\right)^{4}\leq\left(\sqrt{T^{1/2-\delta_% {a}}}\right)^{4}=\operatorname{\mathcal{O}}_{T}(T^{1-2\delta_{a}}),

we know $\left(\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left(M(2NM+R)^{1/4}\sqrt{1+P% _{T}^{a}}\log T\right)\operatorname{\widetilde{\mathcal{O}}}_{T}(1)\right)^{4}% =\operatorname{\widetilde{\mathcal{O}}}_{T}(T^{1-2\delta_{a}})=o_{T}(T)$ . The conclusion then follows. ∎

Lemma B.8 (Calculations when Proving Theorem 3.6).

Following all the assumptions in Theorem 3.6, we have

\displaystyle\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{% Q}(t)\rVert_{1}\right]

\displaystyle\leq f(T)+g(T)\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}% ^{T}\lVert\bm{Q}(t)\rVert_{1}\right]^{3/4}\log\operatornamewithlimits{\mathbb{% E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1}\right],

where

	$\displaystyle f(T)$	$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left((N^{2}(2NM+R)^{% 2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T+(N^{4}M^{2}+N^{2}R^{2})T\right),$
	$\displaystyle g(T)$	$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left(M(2NM+R)^{1/4}% \sqrt{1+P_{T}^{a}}\log T\right).$

Proof.

Recall the piecewise stability assumption in 1 infers Lemma 3.1:

	$\displaystyle\quad\epsilon_{W}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t% =1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\right]-(% N^{2}(2NM+R)^{2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T$
	$\displaystyle\leq-\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_% {(n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}C_{n,m}(t)\mathring{a}_{n,m}^{(% k)}(t)(Q_{m}^{(k)}(t)-Q_{n}^{(k)}(t))\right]-\operatornamewithlimits{\mathbb{E% }}\left[\sum_{t=1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(% k)}(t)\lambda_{n}^{(k)}(t)\right].$

From the non-negativity of Lyapunov drifts in Lemma 3.2, we know

	$\displaystyle 0$	$\displaystyle\leq\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_{% (n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}\mu_{n,m}^{(k)}(t)\left(Q_{m}^{(% k)}(t)-Q_{n}^{(k)}(t)\right)+\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q% _{n}^{(k)}(t)\lambda_{n}^{(k)}(t)\right]+$
		$\displaystyle\quad\frac{1}{2}N^{2}((NM)^{2}+2(NM)^{2}+2R^{2})T.$

Furthermore, recall the guarantee of AdaPFOL algorithm in Theorem 3.5 that

	$\displaystyle\quad\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_% {(n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}(\mu_{n,m}^{(k)}(t)-\mathring{% \mu}_{n,m}^{(k)}(t))\left(Q_{m}^{(k)}(t)-Q_{n}^{(k)}(t)\right)\right]$
	$\displaystyle=\operatorname{\mathcal{O}}\left(M\sqrt{1+P_{T}^{a}}% \operatornamewithlimits{\mathbb{E}}\left[\sqrt{\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{2}^{2}}\log T\log\left(\max_{t=1}^{T}\max_{(n,m)\in{\mathcal{L}}}M% \lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rVert_{\infty}\right)\right]\right)\text{,}$

Therefore, we are able to get

	$\displaystyle\quad\epsilon_{W}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t% =1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\right]-(% N^{2}(2NM+R)^{2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T$
	$\displaystyle\leq\operatorname{\mathcal{O}}\left(M\sqrt{1+P_{T}^{a}}% \operatornamewithlimits{\mathbb{E}}\left[\sqrt{\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{2}^{2}}\log T\log\left(\max_{t=1}^{T}\max_{(n,m)\in{\mathcal{L}}}M% \lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rVert_{\infty}\right)\right]\right)+$
	$\displaystyle\quad\frac{1}{2}N^{2}((NM)^{2}+2(NM)^{2}+2R^{2})T.$

Lemma D.3 states that if $x_{1}=0$ , $x_{2},\ldots,x_{T}\geq 0$ , and $\lvert x_{t+1}-x_{t}\rvert\leq 1$ , then $\sum_{t=1}^{T}x_{t}^{2}=\operatorname{\mathcal{O}}\left((\sum_{t=1}^{T}x_{t})^% {3/2}\right)$ . From Lemma B.1, any single queue $Q_{n}^{(k)}(t)$ satisfies $\lvert Q_{n}^{(k)}(t+1)-Q_{n}^{(k)}(t)\rvert\leq(2NM+R)$ . Hence, applying Lemma D.3 to $\{Q_{n}^{(k)}(t)/(2NM+R)\}_{t\in[T]}$ to every $n\in{\mathcal{N}}$ and $k\in{\mathcal{N}}$ , we have

\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{2}^{2}=(2NM+R)^{2}\sum_{t=1}^{T}\sum_{n% \in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}\left(\frac{Q_{n}^{(k)}(t)}{2NM+R}% \right)^{2}=\operatorname{\mathcal{O}}\left(\sqrt{2NM+R}\left(\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right)^{1.5}\right).

Further noticing that

\max_{t=1}^{T}\max_{(n,m)\in{\mathcal{L}}}M\lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)% \rVert_{\infty}\leq\sum_{t=1}^{T}M\sum_{n\in{\mathcal{N}}}\lVert\bm{Q}_{n}(t)% \rVert_{1}\leq M\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1},

the above inequality becomes

	$\displaystyle\epsilon_{W}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{% T}\lVert\bm{Q}(t)\rVert_{1}\right]$	$\displaystyle\leq\operatorname{\mathcal{O}}\left(M(2NM+R)^{1/4}\sqrt{1+P_{T}^{% a}}\operatornamewithlimits{\mathbb{E}}\left[\left(\sum_{t=1}^{T}\lVert\bm{Q}(t% )\rVert_{1}\right)^{3/4}\log T\log\left(M\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{% 1}\right)\right]\right)+$
		$\displaystyle\quad\operatorname{\mathcal{O}}\left((N^{2}(2NM+R)^{2}+\epsilon_{% W}N^{2}(2NM+R))C_{W}T+(N^{4}M^{2}+N^{2}R^{2})T\right).$

Noticing that $x\mapsto x^{3/4}\log(Mx)$ is concave when $x$ is large enough, Jensen inequality then gives

\operatorname{\mathcal{O}}\left(\operatornamewithlimits{\mathbb{E}}\left[\left% (\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1}\right)^{3/4}\log\left(M\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right)\right]\right)=\operatorname{\mathcal{O}}\left% (\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_% {1}\right]^{3/4}\log\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right]\right).

Therefore, if we define the auxiliary functions $f(T)$ and $g(T)$ as

	$\displaystyle f(T)$	$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left((N^{2}(2NM+R)^{% 2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T+(N^{4}M^{2}+N^{2}R^{2})T\right),$
	$\displaystyle g(T)$	$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left(M(2NM+R)^{1/4}% \sqrt{1+P_{T}^{a}}\log T\right),$

we are able to conclude that

\displaystyle\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{% Q}(t)\rVert_{1}\right]

\displaystyle\leq f(T)+g(T)\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}% ^{T}\lVert\bm{Q}(t)\rVert_{1}\right]^{3/4}\log\operatornamewithlimits{\mathbb{% E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1}\right],

as claimed. ∎

Appendix C Omitted Proofs for Multi-Hop Utility Maximization Tasks

C.1 Reference Policy Assumption (Proof of Lemma 4.1)

Lemma C.1 (Restatement of Lemma 4.1; Ability of $\{(\mathring{\bm{a}}(t),\mathring{\bm{\lambda}}(t))\}_{t\in[T]}$ in Stabilizing the Network).

If $\{(\mathring{\bm{a}}(t),\mathring{\bm{\lambda}}(t))\}_{t\in[T]}$ satisfies 2, then for any scheduler-generated queue lengths $\{\bm{Q}(t)\}_{t\in[T]}$ ,

	$\displaystyle\quad\epsilon_{W}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t% =1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\right]-(% N^{2}(2NM+R)^{2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T$
	$\displaystyle\leq-\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_% {(n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}\mathring{\mu}_{n,m}^{(k)}(t)(Q% _{m}^{(k)}(t)-Q_{n}^{(k)}(t))\right]-\operatornamewithlimits{\mathbb{E}}\left[% \sum_{t=1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)% \mathring{\lambda}_{n}^{(k)}(t)\right].$

Proof.

The proof of this lemma is identical to that of Lemma 3.1, except for replacing the environment-generated $\bm{\lambda}(t)$ with the $\mathring{\bm{\lambda}}(t)$ generated by the reference policy. For more details, please refer to Section B.1. ∎

C.2 Lyapunov Drift-Plus-Penalty Analysis (Proof of Lemma C.2)

Lemma C.2 (Lyapunov Drift-Plus-Penalty Analysis).

Under the queue dynamics of Equation 1,

	$\displaystyle\quad-\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum% _{(n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}C_{n,m}(t)(Q_{m}^{(k)}(t)-Q_{n% }^{(k)}(t))a_{n,m}^{(k)}(t)\right]$
	$\displaystyle\quad-\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum% _{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\lambda_{n}^{(k)}(t)% \right]+V\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}(g_{t}(\bm{% \lambda}(t))-g_{t}(\mathring{\bm{\lambda}}(t)))\right]$
	$\displaystyle\leq\frac{1}{2}N^{2}((NM)^{2}+2(NM)^{2}+2R^{2})T+V% \operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}(g_{t}(\bm{\lambda}(t))% -g_{t}(\mathring{\bm{\lambda}}(t)))\right].$

Proof.

The proof follows from applying Lemma 3.2 and adding

V\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}(g_{t}(\bm{\lambda}(t)% )-g_{t}(\mathring{\bm{\lambda}}(t)))\right]

to both sides. ∎

C.3 Guarantee of AdaBGD Algorithm (Proof of Lemma 4.3)

Lemma C.3 (Restatement of Lemma 4.3; Guarantee of AdaBGD Algorithm).

	$\displaystyle\quad\text{D-Regret}_{T}^{\text{BCO}}(\bm{u}_{1},\bm{u}_{2},% \ldots,\bm{u}_{T})=\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}(% \ell_{t}(\bm{x}_{t})-\ell_{t}(\bm{u}_{t}))\right]$
	$\displaystyle\leq\operatornamewithlimits{\mathbb{E}}\left[\frac{7R^{2}}{4\eta_% {T}}+\frac{P_{T}R}{\eta_{T}}+\sum_{t=1}^{T}\left(\frac{\eta_{t}}{2}\frac{d^{2}% }{\delta_{t}^{2}}C_{t}^{2}+3L_{t}\delta_{t}+L_{t}\alpha_{t}R\right)\right],$

where $P_{T}=\sum_{t=1}^{T-1}\lVert\bm{u}_{t}-\bm{u}_{t+1}\rVert$ is the path length of the comparator sequence $\{\bm{u}_{t}\}_{t\in[T]}$ .

Proof.

Similar to the proof by Zhao et al. (2021, Theorem 1), let $\widehat{\ell}_{t}(\bm{x})=\operatornamewithlimits{\mathbb{E}}_{\bm{v}\in% \mathbb{B}}[\ell_{t}(\bm{x}+\delta\bm{v})]$ (where $\mathbb{B}=\{\bm{x}\in\mathbb{R}^{d}\mid\lVert\bm{x}\rVert\leq 1\}$ is the unit ball in $\mathbb{R}^{d}$ ) and $\bm{v}_{t}=(1-\alpha_{t})\bm{u}_{t}\in(1-\alpha_{t})\mathcal{X}$ , then

\sum_{t=1}^{T}(\ell_{t}(\bm{x}_{t})-\ell_{t}(\bm{u}_{t}))=\sum_{t=1}^{T}(% \widehat{\ell}_{t}(\bm{y}_{t})-\widehat{\ell}_{t}(\bm{v}_{t}))+\sum_{t=1}^{T}(% \ell_{t}(\bm{x}_{t})-\widehat{\ell}_{t}(\bm{y}_{t}))+\sum_{t=1}^{T}(\widehat{% \ell}_{t}(\bm{v}_{t})-\ell_{t}(\bm{u}_{t})).

According to the original proof, the expectation of the latter two terms are controlled by $\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}2L\delta_{t}]$ and $\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}(L\delta_{t}+L\alpha_{t}R)]$ , respectively. For the first term, according to Flaxman et al. (2005, Lemma 2.1), $\operatornamewithlimits{\mathbb{E}}_{\bm{s}_{t}}[\frac{d}{\delta}\ell_{t}(\bm{% x}_{t})\bm{s}_{t}]=\nabla\widehat{\ell}_{t}(\bm{x}_{t})$ . Therefore, since $\lVert\frac{d}{\delta_{t}}\ell_{t}(\bm{x}_{t})\bm{s}_{t}\rVert\leq\frac{d}{% \delta_{t}}C_{t}$ , we get from Lemma C.4 that

\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}(\widehat{\ell}_{t}(\bm% {x}_{t})-\widehat{\ell}_{t}(\bm{v}_{t}))\right]\leq\operatornamewithlimits{% \mathbb{E}}\left[\frac{7R^{2}}{4\eta_{T}}+\frac{P_{T}R}{\eta_{T}}+\sum_{t=1}^{% T}\frac{\eta_{t}}{2}\frac{d^{2}}{\delta_{t}^{2}}C_{t}^{2}\right],

where $\widehat{P}_{T}=\sum_{t=1}^{T-1}\lVert\bm{v}_{t}-\bm{v}_{t+1}\rVert$ , the path length of $\{\bm{v}_{t}\}_{t\in[T]}$ , satisfies $\widehat{P}_{T}\leq P_{T}$ . This gives our conclusion. ∎

Lemma C.4 (Guarantee of Projected SGD).

Suppose that $\mathcal{X}$ is bounded by $[r,R]$ , the $t$ -th loss function $\ell_{t}$ is bounded by $[-C_{t},C_{t}]$ and is $L$ -Lipschitz. Further suppose that a stochastic gradient $\bm{g}_{t}$ can be calculated in round $t$ such that $\operatornamewithlimits{\mathbb{E}}[\bm{g}_{t}\mid\bm{x}_{1},\ell_{1},\ldots,% \bm{x}_{t},\ell_{t}]=\nabla\ell_{t}(\bm{x}_{t})$ and $\lVert\bm{g}_{t}\rVert_{2}\leq C_{t}$ . The the iteration

\bm{x}_{t+1}=\text{Proj}_{\mathcal{X}}\left[\bm{x}_{t}-\eta_{t}\bm{g}_{t}\right]

ensures the following dynamic regret guarantee for any fixed $\bm{u}_{1},\bm{u}_{2},\ldots,\bm{u}_{T}\in\mathcal{X}$ :

\sum_{t=1}^{T}(\ell_{t}(\bm{x}_{t})-\ell_{t}(\bm{u}_{t}))\leq\frac{7R^{2}}{4% \eta_{T}}+\frac{P_{T}R}{\eta_{T}}+\sum_{t=1}^{T}\frac{\eta_{t}}{2}C_{t}^{2},

where $P_{T}=\sum_{t=1}^{T-1}\lVert\bm{u}_{t}-\bm{u}_{t+1}\rVert$ is the path length of $\{\bm{u}_{t}\}_{t=1}^{T}$ .

Proof.

We first consider the full-feedback model where the whole $\ell_{t}$ , instead of the single-entry $\ell_{t}(\bm{x}_{t})$ , is available. Then the Gradient Descent algorithm $\bm{x}_{t+1}=\text{Proj}_{\mathcal{X}}[\bm{x}_{t}-\eta_{t}\nabla\ell_{t}(\bm{x% }_{t})]$ enjoys the following dynamic regret guarantee (which follows the proof of Zinkevich (2003, Theorem 2)):

	$\displaystyle\quad\text{D-Regret}_{T}(\bm{u}_{1},\bm{u}_{1},\ldots,\bm{u}_{T})% =\sum_{t=1}^{T}(\ell_{t}(\bm{x}_{t})-\ell_{t}(\bm{u}_{t}))$
	$\displaystyle\leq\sum_{t=1}^{T}\left(\frac{1}{2\eta_{t}}\left(\lVert\bm{x}_{t}% -\bm{u}_{t}\rVert^{2}-\lVert\bm{x}_{t+1}-\bm{u}_{t}\rVert^{2}\right)+\frac{% \eta_{t}}{2}\lVert\nabla\ell_{t}(\bm{x}_{t})\rVert^{2}\right)$
	$\displaystyle=\sum_{t=1}^{T}\frac{\lVert\bm{x}_{t}\rVert^{2}-\lVert\bm{x}_{t+1% }\rVert^{2}}{2\eta_{t}}+\sum_{t=1}^{T}\frac{\langle\bm{x}_{t+1}-\bm{x}_{t},\bm% {u}_{t}\rangle}{\eta_{t}}+\sum_{t=1}^{T}\frac{\eta_{t}}{2}\lVert\nabla\ell_{t}% (x_{t})\rVert^{2}$
	$\displaystyle\overset{(a)}{\leq}\frac{\lVert\bm{x}_{1}\rVert^{2}-\lVert\bm{x}_% {T+1}\rVert^{2}}{2\eta_{T}}+\frac{\langle\bm{x}_{T+1},\bm{u}_{T}\rangle}{\eta_% {T}}-\frac{\langle\bm{x}_{1},\bm{u}_{1}\rangle}{\eta_{1}}+\sum_{t=2}^{T}\frac{% \langle\bm{u}_{t-1}-\bm{u}_{t},\bm{x}_{t}\rangle}{\eta_{t}}+\sum_{t=1}^{T}% \frac{\eta_{t}}{2}\lVert\nabla\ell_{t}(x_{t})\rVert^{2}$
	$\displaystyle\leq\frac{7R^{2}}{4\eta_{T}}+\frac{P_{T}R}{\eta_{T}}+\sum_{t=1}^{% T}\frac{\eta_{t}}{2}\lVert\nabla\ell_{t}(x_{t})\rVert^{2},$		(22)

where (a) uses the property that $\eta_{1}\geq\eta_{2}\geq\cdots\geq\eta_{T}$ .

Moving back to the bandit-feedback model, following the proof of Zhao et al. (2021, Theorem 8), we can consider a loss function defined as $\widetilde{\ell}_{t}(x)\triangleq\ell_{t}(x)+\langle\bm{x},\bm{g}_{t}-\nabla% \ell_{t}(\bm{x}_{t})\rangle$ . As $\nabla\widetilde{\ell}_{t}(\bm{x}_{t})=\bm{g}_{t}$ and $\operatornamewithlimits{\mathbb{E}}[\widetilde{\ell}_{t}(\bm{x})]=% \operatornamewithlimits{\mathbb{E}}[\ell_{t}(\bm{x})]$ (which is due to the fact that $\operatornamewithlimits{\mathbb{E}}[\bm{g}_{t}]=\nabla\ell_{t}(\bm{x}_{t})$ ), applying Equation 22 gives

	$\displaystyle\quad\sum_{t=1}^{T}(\ell_{t}(\bm{x}_{t})-\ell_{t}(\bm{u}_{t}))=% \operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}(\widetilde{\ell}_{t}(% \bm{x}_{t})-\widetilde{\ell}_{t}(\bm{u}_{t}))\right]$
	$\displaystyle\leq\operatornamewithlimits{\mathbb{E}}\left[\frac{7R^{2}}{4\eta_% {T}}+\frac{P_{T}R}{\eta_{T}}+\sum_{t=1}^{T}\frac{\eta_{t}}{2}\lVert\nabla% \widetilde{\ell}_{t}(\bm{x}_{t})\rVert^{2}\right]\overset{(b)}{\leq}\frac{7R^{% 2}}{4\eta_{T}}+\frac{P_{T}R}{\eta_{T}}+\sum_{t=1}^{T}\frac{\eta_{t}}{2}C_{t}^{% 2},$

where the expectation is taken w.r.t. the randomness in the stochastic gradient $\{\bm{g}_{t}\}_{t\in[T]}$ , and (b) makes use of the fact that $\nabla\widetilde{\ell}_{t}(\bm{x}_{t})=\bm{g}_{t}$ . ∎

C.4 Deciding $\bm{\lambda}(t)$ via AdaBGD Algorithm (Proof of Theorem 4.4)

Theorem C.5 (Restatement of Theorem 4.4; Deciding $\bm{\lambda}(t)$ via AdaBGD Algorithm).

For the reference arrival rates $\{\mathring{\bm{\lambda}}(t)\}_{t\in[T]}$ defined in 2, suppose that its path length ensures

P_{t}^{\lambda}\triangleq\sum_{t=1}^{T-1}\lVert\mathring{\bm{\lambda}}({t+1})-% \mathring{\bm{\lambda}}(t))\rVert_{1}\leq C^{\lambda}t^{1/2-\delta_{\lambda}},% \quad\forall t=1,2,\ldots,T,

	$\displaystyle\eta_{t}$	$\displaystyle=\left(C^{\lambda}T^{1/2-\delta_{\lambda}}\middle/\begin{subarray% }{c}\left(C^{\lambda}T^{1/2-\delta_{\lambda}}\right)^{7/3}\left(4r^{-3}d^{2}% \right)^{28/9}\left(M+R\right)^{4/3}+\\ C^{\lambda}T^{1/2-\delta_{\lambda}}(r^{-3}d^{2}VG^{2}/L)^{4/3}+\\ \sum_{s=1}^{t}\left((\lVert\bm{q}_{s}\rVert_{\infty}+VG)^{2}(\lVert\bm{q}_{s}% \rVert_{2}+VL)^{2}\right)^{1/3}\end{subarray}\right)^{3/4},$
	$\displaystyle\delta_{t}$	$\displaystyle=\left(\eta_{t}d^{2}\frac{(\lVert\bm{Q}(t)\rVert_{\infty}+VG)^{2}% }{(\lVert\bm{Q}(t)\rVert_{2}+VL)}\right)^{1/3},\quad\alpha_{t}=\frac{\delta_{t% }}{r},$		(23)

its outputs $\bm{\lambda}(1),\bm{\lambda}(2),\ldots,\bm{\lambda}(T)\in\Lambda$ ensure

Proof.

For loss function $\ell_{t}(\bm{\lambda})=\langle\bm{Q}(t),\bm{\lambda}\rangle-Vg_{t}(\bm{\lambda})$ , it is bounded by $C_{t}\triangleq\lVert\bm{Q}(t)\rVert_{\infty}+VG$ and is $L_{t}\triangleq(\lVert\bm{Q}(t)\rVert_{2}+VL)$ -Lipschitz. As $\bm{Q}(t)$ is revealed after the $(t-1)$ -th round, $G_{t}$ and $L_{t}$ are both $\mathcal{F}_{t-1}$ -measurable. As sketched in the main text, we first regard $\eta_{t}$ as a constant and tune $\delta_{t}$ to minimize the summation term in Lemma 4.3.

Let $\delta_{t}=(\eta_{t}d^{2}C_{t}^{2}/L_{t})^{1/3}$ and $\alpha_{t}=\delta_{t}/r$ . Suppose that all conditions in Lemma 4.3 hold, then

	$\displaystyle\quad\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}(\ell% _{t}(\bm{\lambda}_{t})-\ell_{t}(\bm{\lambda}_{t}^{\ast}))\right]\leq% \operatornamewithlimits{\mathbb{E}}\left[\frac{7R^{2}+4P_{T}R}{4\eta_{T}}+\sum% _{t=1}^{T}\left(\eta_{t}\frac{d^{2}}{\delta_{t}^{2}}C_{t}^{2}+3L_{t}\delta_{t}% +L_{t}\alpha_{t}R\right)\right]$
	$\displaystyle=\operatorname{\mathcal{O}}\left(\operatornamewithlimits{\mathbb{% E}}\left[\frac{R^{2}+C^{r}T^{1/2-\delta_{r}}R}{\eta_{T}}+\sum_{t=1}^{T}\left(% \eta_{t}d^{2}(\lVert\bm{Q}(t)\rVert_{\infty}+VG)^{2}(\lVert\bm{Q}(t)\rVert_{2}% +VL)^{2}\right)^{1/3}\frac{R}{r}\right]\right).$		(24)

We first only keep the last term in Equation 23, i.e., let

\eta_{t}=\left(C^{r}T^{1/2-\delta_{r}}\middle/\sum_{s=1}^{t}\left((\lVert\bm{q% }_{s}\rVert_{\infty}+VG)^{2}(\lVert\bm{q}_{s}\rVert_{2}+VL)^{2}\right)^{1/3}% \right)^{3/4},

(25)

then we have $\eta_{1}>\eta_{2}>\cdots>\eta_{T}$ . Let’s first pretend the other condition of $\alpha_{t}<1$ from Lemma 4.3 also holds at this moment. Lemma D.1 reveals that if $x_{1},x_{2},\ldots,x_{T}\geq 0$ , then $\left.\sum_{t=1}^{T}x_{t}\middle/(\sum_{s\leq t}x_{s})^{1/4}\right.=% \operatorname{\mathcal{O}}\left(\left(\sum_{t=1}^{T}x_{t}\right)^{3/4}\right)$ . Plugging in $x_{t}=\left(\lVert\bm{q}_{s}\rVert_{\infty}+VG)^{2}(\lVert\bm{q}_{s}\rVert_{2}% +VL)^{2}\right)^{1/3}$ ,

	$\displaystyle\quad\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}(\ell% _{t}(\bm{\lambda}_{t})-\ell_{t}(\bm{\lambda}_{t}^{\ast}))\right]$
	$\displaystyle=\operatorname{\mathcal{O}}\left(\operatornamewithlimits{\mathbb{% E}}\left[(C^{r}T^{1/2-\delta_{r}})^{1/4}R\left(\sum_{t=1}^{T}\left((\lVert\bm{% Q}(t)\rVert_{\infty}+VG)^{2}(\lVert\bm{Q}(t)\rVert_{2}+VL)^{2}\right)^{1/3}% \right)^{3/4}\right]\right.+$
	$\displaystyle\qquad\left.\operatornamewithlimits{\mathbb{E}}\left[\frac{R}{r}% \left(C^{r}T^{1/2-\delta_{r}}\right)^{1/4}d^{2/3}\left(\sum_{t=1}^{T}\left((% \lVert\bm{Q}(t)\rVert_{\infty}+VG)^{2}(\lVert\bm{Q}(t)\rVert_{2}+VL)^{2}\right% )^{1/3}\right)^{3/4}\right]\right)$
	$\displaystyle=\operatorname{\mathcal{O}}\left(\operatornamewithlimits{\mathbb{% E}}\left[\left(\frac{R}{r}d^{2/3}+R\right)(C^{r}T^{1/2-\delta_{r}})^{1/4}\left% (\sum_{t=1}^{T}\left(\lVert\bm{Q}(t)\rVert_{2}+V(L+G)\right)^{4/3}\right)^{3/4% }\right]\right),$

where the last step utilizes $\lVert\bm{Q}(t)\rVert_{\infty}\leq\lVert\bm{Q}(t)\rVert_{2}$ .

This almost recovers our conclusion, so it only remains to ensure $\alpha_{t}=\delta_{t}/r<1$ , which is equivalent to $\eta_{t}d^{2}C_{t}^{2}/L_{t}<r^{3}$ , i.e.,

\eta_{t}^{-1}>r^{-3}d^{2}\frac{(\lVert\bm{Q}(t)\rVert_{\infty}+VG)^{2}}{\lVert% \bm{Q}(t)\rVert_{2}+VL}.

Consider adding a term $X$ into the denominator of Equation 25. As $\lVert\bm{Q}(t)\rVert_{\infty}\leq\lVert\bm{Q}(t)\rVert_{2}$ , $(\lVert\bm{Q}(t)\rVert_{\infty}+VG)^{2}/(\lVert\bm{Q}(t)\rVert_{2}+VL)\leq 2(% \lVert\bm{Q}(t)\rVert_{\infty}^{2}/\lVert\bm{Q}(t)\rVert_{2})+2((VG)^{2}/(VL))% \leq 2\lVert\bm{Q}(t)\rVert_{\infty}+2VG^{2}/L$ . So we only need to show

\displaystyle\frac{\left(X+\sum_{s=1}^{t}\left((\lVert\bm{q}_{s}\rVert_{\infty% }+VG)^{2}(\lVert\bm{q}_{s}\rVert_{2}+VL)^{2}\right)^{1/3}\right)^{3/4}}{(C^{r}% T^{1/2-\delta_{r}})^{3/4}}>2r^{-3}d^{2}\left(\lVert\bm{Q}(t)\rVert_{\infty}+V% \frac{G^{2}}{L}\right).

We decompose $X$ into $X_{1}$ and $X_{2}$ and use them to cancel the two terms on the RHS, respectively. That is, we want

	$\displaystyle X_{1}+\sum_{s=1}^{t}\left(\lVert\bm{q}_{s}\rVert_{\infty}^{2}% \lVert\bm{q}_{s}\rVert_{2}^{2}\right)^{1/3}$	$\displaystyle>C^{r}T^{1/2-\delta_{r}}\left(r^{-3}d^{2}\lVert\bm{Q}(t)\rVert_{% \infty}\right)^{4/3},$
	$\displaystyle X_{2}+t^{3/4}(V^{4}G^{2}L^{2})^{1/4}$	$\displaystyle>C^{r}T^{1/2-\delta_{r}}\left(r^{-3}d^{2}VG^{2}/L\right)^{4/3}.$

We first craft $X_{1}$ . In Lemma D.2, we show that if $x_{1}=0$ , $x_{2},\ldots,x_{T}\geq 0$ , and $\lvert x_{t+1}-x_{t}\rvert\leq 1$ , then $\sum_{t=1}^{T}x_{t}^{4/3}\geq(x_{T}/4)^{7/3}$ . Thus, using the fact that $\lVert\bm{Q}(t)\rVert_{\infty}\leq\lVert\bm{Q}(t)\rVert_{2}$ and the queue length increment bound $\lvert\lVert\bm{Q}(t+1)\rVert_{\infty}-\lVert\bm{Q}(t)\rVert_{\infty}\rvert% \leq(2NM+R)$ (Lemma B.1), we can let $x_{t}=\lVert\bm{Q}(t)\rVert_{\infty}/(2NM+R)$ and lower bound the LHS by

X_{1}+\sum_{s=1}^{t}\left(\lVert\bm{q}_{s}\rVert_{\infty}^{2}\lVert\bm{q}_{s}% \rVert_{2}^{2}\right)^{1/3}\geq X_{1}+\frac{\left(\lVert\bm{Q}(t)\rVert_{% \infty}/4\right)^{7/3}}{(2NM+R)}\geq X_{1}^{3/7}\left(\frac{\left(\lVert\bm{Q}% (t)\rVert_{\infty}/4\right)^{7/3}}{(2NM+R)}\right)^{4/7},

where the inequality results from AM-GM inequality $\frac{ax+by}{a+b}\geq\sqrt[a+b]{x^{a}y^{b}}$ . Therefore, we only need to ensure $X_{1}^{3/7}\geq C^{r}T^{1/2-\delta_{r}}(4r^{-3}d^{2})^{4/3}(2NM+R)^{4/7}$ . Setting $X_{1}$ as following then suffices.

X_{1}=\left(C^{r}T^{1/2-\delta_{r}}\right)^{7/3}\left(4r^{-3}d^{2}\right)^{28/% 9}\left((2NM+R)\right)^{4/3}.

For $X_{2}$ , we only need to set $X_{2}=C^{r}T^{1/2-\delta_{r}}(r^{-3}d^{2}VG^{2}/L)^{4/3}$ . Plugging back $X=X_{1}+X_{2}$ , we get the learning rate scheduling defined in Equation 23. Now we verify that the two terms on the RHS of Equation 24 does not increase too much due to $X$ . For the first term,

	$\displaystyle\quad\operatornamewithlimits{\mathbb{E}}\left[\frac{R^{2}+C^{r}T^% {1/2-\delta_{r}}R}{\eta_{T}}\right]$
	$\displaystyle=\left[\frac{R^{2}+C^{r}T^{1/2-\delta_{r}}R}{(C^{r}T^{1/2-\delta_% {r}})^{3/4}}\left(X+\sum_{t=1}^{T}\left((\lVert\bm{Q}(t)\rVert_{\infty}+VG)^{2% }(\lVert\bm{Q}(t)\rVert_{2}+VL)^{2}\right)^{1/3}\right)^{3/4}\right]$
	$\displaystyle=\operatorname{\mathcal{O}}\left(\operatornamewithlimits{\mathbb{% E}}\left[\left(C^{r}T^{1/2-\delta_{r}}R\right)\left(\frac{X}{C^{r}T^{1/2-% \delta_{r}}}\right)^{3/4}+(C^{r}T^{1/2-\delta_{r}})^{1/4}R\left(\sum_{t=1}^{T}% \left(\lVert\bm{Q}(t)\rVert_{2}+V(L+G)\right)^{4/3}\right)^{3/4}\right]\right)$
	$\displaystyle=\operatorname{\mathcal{O}}\left(\operatornamewithlimits{\mathbb{% E}}\left[\frac{R(2NM+R)}{r^{7}}d^{14/3}(C^{r}T^{1/2-\delta_{r}})^{2}+R(C^{r}T^% {1/2-\delta_{r}})^{1/4}\left(\sum_{s=1}^{t}\left(\lVert\bm{q}_{s}\rVert_{2}+V(% L+G)\right)^{4/3}\right)^{3/4}\right]\right).$

For the second term, as $\eta_{t}$ is strictly smaller than that in Equation 25, we can again apply Lemma D.1 (if $x_{1},x_{2},\ldots,x_{T}\geq 0$ , then $\left.\sum_{t=1}^{T}x_{t}\middle/(\sum_{s\leq t}x_{s})^{1/4}\right.=% \operatorname{\mathcal{O}}\left(\left(\sum_{t=1}^{T}x_{t}\right)^{3/4}\right)$ ) to conclude

	$\displaystyle\quad\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\left% (\eta_{t}d^{2}(\lVert\bm{Q}(t)\rVert_{\infty}+VG)^{2}(\lVert\bm{Q}(t)\rVert_{2% }+VL)^{2}\right)^{1/3}\frac{R}{r}\right]$
	$\displaystyle=\operatornamewithlimits{\mathbb{E}}\left[\frac{R}{r}d^{2/3}(C^{r% }T^{1/2-\delta_{r}})^{1/4}\left(\sum_{t=1}^{T}\left(\lVert\bm{Q}(t)\rVert_{2}+% V(L+G)\right)^{4/3}\right)^{3/4}\right].$

Summing up two parts gives our conclusion. ∎

C.5 Main Theorem for Multi-Hop Utility Maximization (Proof of Theorem 4.5)

Theorem C.6 (Restatement of Theorem 4.5; Main Theorem for Multi-Hop Utility Maximization).

\displaystyle P_{t}^{a}\triangleq\sum_{s=1}^{t-1}\lVert\mathring{\bm{a}}(s)-% \mathring{\bm{a}}(s+1)\rVert_{1}\leq C^{a}t^{1/2-\delta_{a}},\leavevmode% \nobreak\ P_{t}^{\lambda}\triangleq\sum_{s=1}^{t-1}\lVert\mathring{\bm{\lambda% }}(s)-\mathring{\bm{\lambda}}(s+1)\rVert_{1}\leq C^{\lambda}t^{1/2-\delta_{% \lambda}},\quad\forall t\in[T].

Here, $M,R,r,L,G,C^{a},\delta_{a},C^{\lambda},\delta_{\lambda}$ are assumed to be known constants, whereas the specific $\{(\mathring{\bm{a}}(t),\mathring{\bm{\lambda}}(t))\}_{t\in[T]}$ remains unknown. If we execute the UMO² framework in Algorithm 3 with the AdaPFOL sub-rountine given in Algorithm 2 and the AdaBGD sub-routine given in Algorithm 4, when $T$ is large enough such that $V=o_{T}(\min\{T^{2\delta_{a}/3},T^{2\delta_{\lambda}/7}\})$ , the following inequalities hold simultaneously:

Proof.

As sketched in the main text, the first step is to i) combine algorithmic guarantees for AdaPFOL (Theorem 3.5) and AdaBGD (Theorem 4.4), ii) plug in the network stability assumption Lemma 4.1, and iii) make use of the Lyapunov DPP analysis in Lemma C.2. Deferring these calculations to Lemma C.7, we can get

$\displaystyle\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{% Q}(t)\rVert_{1}\right]$	$\displaystyle\leq-\frac{V}{\epsilon_{W}}\operatornamewithlimits{\mathbb{E}}% \left[\sum_{t=1}^{T}\bigg{(}g_{t}(\mathring{\bm{\lambda}}(t))-g_{t}(\bm{% \lambda}(t))\bigg{)}\right]+f(T)+$
	$\displaystyle\quad g(T)\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right]^{3/4}\log\operatornamewithlimits{\mathbb{E}}% \left[M\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1}\right]+$
	$\displaystyle\quad h(T)\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right]^{7/8}.$	(26)

where

	$\displaystyle f(T)$	$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left((N^{2}(2NM+R)^{% 2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T+\frac{R(2NM+R)}{r^{7}}d^{14/3}(C^{\lambda}T% ^{1/2-\delta_{\lambda}})^{2}\right.+$
		$\displaystyle\qquad\qquad\left.\left(\frac{R}{r}d^{2/3}+R\right)(C^{r}T^{1/2-% \delta_{r}})^{1/4}V(L+G)T^{3/4}+\frac{1}{2}N^{2}((NM)^{2}+2(NM)^{2}+2R^{2})T% \right),$
	$\displaystyle g(T)$	$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left((2NM+R)^{1/4}M% \sqrt{1+C^{a}T^{1/2-\delta_{a}}}\log T\right),$
	$\displaystyle h(T)$	$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left((2NM+R)^{1/8}% \left(\frac{R}{r}d^{2/3}+R\right)(C^{\lambda}T^{1/2-\delta_{\lambda}})^{1/4}% \right).$

Step 1 (Develop a Coarse Average Queue Length Bound).

Recall the assumption that $g_{t}$ is uniformly bounded by $[-G,G]$ . Therefore, the first term on the RHS of Equation 26 is bounded by $2\frac{V}{\epsilon_{W}}GT$ in absolute value. In Lemma D.6, we develop a self-bounding property that says, if $y\leq f+y^{3/4}g\log y+y^{7/8}$ , then $y=\operatorname{\mathcal{O}}\left(f+g^{4}\log^{8}\left(2(f^{1/8}+g^{1/2}+h)^{2% }\right)+h^{8}\right)$ . Therefore, applying it to Equation 26, we have

$\displaystyle\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{% Q}(t)\rVert_{1}\right]$	$\displaystyle\leq\operatorname{\mathcal{O}}\left(2\frac{V}{\epsilon_{W}}GT+f(T% )+g(T)^{4}\log^{8}\left(2(f(T)^{1/8}+g(T)^{1/2}+h(T))^{2}\right)+h(T)^{8}\right)$
	$\displaystyle=\operatorname{\mathcal{O}}\left(\frac{V}{\epsilon_{W}}GT+f(T)% \right)+\operatorname{\mathcal{O}}_{T}\left(T^{1-2\delta_{a}}\log^{8}\left(T^{% 1/4}+T^{1/4-\delta_{a}/2}+T^{1/4-\delta_{\lambda}/2}\right)+T^{1-2\delta_{% \lambda}}\right)$
	$\displaystyle=\operatorname{\mathcal{O}}\left(\frac{V}{\epsilon_{W}}GT+f(T)% \right)+o_{T}(1).$	(27)

As mentioned in the proof sketch of this theorem, this only gives a $\frac{1}{T}\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{1}]=\operatorname{\mathcal{O}}_{T}(V)$ bound on the average queue length, which violates the system stability condition Equation 2. However, this inequality can be used to derive the polynomial convergence result on the utility, which in turn refines the average queue length bound.

Step 2 (Yield Polynomial Convergence on the Utility).

Moving the difference in the average utility in Equation 26 to the LHS, we have

	$\displaystyle\frac{V}{\epsilon_{W}}\operatornamewithlimits{\mathbb{E}}\left[% \sum_{t=1}^{T}\bigg{(}g_{t}(\mathring{\bm{\lambda}}(t))-g_{t}(\bm{\lambda}(t))% \bigg{)}\right]$	$\displaystyle\leq-\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right]+f(T)+$
		$\displaystyle\quad g(T)\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right]^{3/4}\log\operatornamewithlimits{\mathbb{E}}% \left[M\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1}\right]+$
		$\displaystyle\quad h(T)\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right]^{7/8}.$

Plugging in the just-derived bound on average queue length, namely Equation 27, we have

	$\displaystyle\quad\frac{V}{\epsilon_{W}}\operatornamewithlimits{\mathbb{E}}% \left[\sum_{t=1}^{T}\bigg{(}g_{t}(\mathring{\bm{\lambda}}(t))-g_{t}(\bm{% \lambda}(t))\bigg{)}\right]$
	$\displaystyle\leq 0+f(T)+g(T)\operatorname{\mathcal{O}}\left(\left(\frac{V}{% \epsilon_{W}}GT+f(T)\right)^{3/4}\log\left(\frac{V}{\epsilon_{W}}GT+f(T)\right% )+h(T)\left(\frac{V}{\epsilon_{W}}GT+f(T)\right)^{7/8}\right)$
	$\displaystyle=f(T)+\operatorname{\mathcal{O}}_{T}\left(T^{1/4-\delta_{a}/2}(VT% )^{3/4}\log(VT)+T^{1/8-\delta_{\lambda}/4}(VT)^{7/8}\right).$

According to the assumption that $V=o_{T}(\min\{T^{2\delta_{a}/3},T^{2\delta_{\lambda}/7}\})$ , the second term on the RHS is of order $o_{T}(T)$ . Therefore, we have

	$\displaystyle\quad\frac{1}{T}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=% 1}^{T}\left(g_{t}(\mathring{\bm{\lambda}}(t))-g_{t}(\bm{\lambda}(t)))\right)% \right]=\frac{\epsilon_{W}}{VT}f(T)+\frac{\epsilon_{W}}{VT}o_{T}(T)$
	$\displaystyle=(VT)^{-1}\operatorname{\mathcal{O}}\left((N^{2}(2NM+R)^{2}+% \epsilon_{W}N^{2}(2NM+R))C_{W}T+\frac{R(2NM+R)}{r^{7}}d^{14/3}(C^{\lambda}T^{1% /2-\delta_{\lambda}})^{2}\right.+$
	$\displaystyle\qquad\qquad\qquad\left.\left(\frac{R}{r}d^{2/3}+R\right)(C^{r}T^% {1/2-\delta_{r}})^{1/4}V(L+G)T^{3/4}+\frac{1}{2}N^{2}((NM)^{2}+2(NM)^{2}+2R^{2% })T\right)+o_{T}(V^{-1})$
	$\displaystyle=\operatorname{\mathcal{O}}\left(\frac{(N^{2}(2NM+R)^{2}+\epsilon% _{W}N^{2}(2NM+R))C_{W}+(N^{4}M^{2}+N^{2}R^{2})}{V}\right)+o_{T}(V^{-1}).$		(28)

The second conclusion of this theorem follows.

Step 3 (Refine the Average Queue Length Bound).

Now we are ready to refine our average queue length bound using Equation 28. Instead of controlling the utility with the uniform boundedness assumption that $g_{t}\in[-G,G]$ , we utilize the just-derived convergence result Equation 28.

Specifically, again applying the self-bounding property in Lemma D.6 to Equation 26 but instead replacing the first term on the RHS with Equation 28, we get

	$\displaystyle\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{% Q}(t)\rVert_{1}\right]$	$\displaystyle\leq\operatorname{\mathcal{O}}\left(f(T)+o_{T}(T)+f(T)+g(T)^{4}% \log^{8}\left(2(f(T)^{1/8}+g(T)^{1/2}+h(T))^{2}\right)+h(T)^{8}\right)$
		$\displaystyle=\operatorname{\mathcal{O}}(f(T))+\operatorname{\mathcal{O}}\left% (g(T)^{4}\log^{8}\left(2(f(T)^{1/8}+g(T)^{1/2}+h(T))^{2}\right)+h(T)^{8}\right% )+o_{T}(T)$
		$\displaystyle=\operatorname{\mathcal{O}}\left(\frac{(N^{2}(2NM+R)^{2}+\epsilon% _{W}N^{2}(2NM+R))C_{W}+(N^{4}M^{2}+N^{2}R^{2})}{\epsilon_{W}}T\right)+o_{T}(T),$

which gives our first conclusion as well. ∎

Lemma C.7 (Calculations when Proving Theorem 4.5).

Under the conditions of Theorem 4.5, we have

	$\displaystyle\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{% Q}(t)\rVert_{1}\right]$	$\displaystyle\leq-\frac{V}{\epsilon_{W}}\operatornamewithlimits{\mathbb{E}}% \left[\sum_{t=1}^{T}\bigg{(}g_{t}(\mathring{\bm{\lambda}}(t))-g_{t}(\bm{% \lambda}(t))\bigg{)}\right]+f(T)+$
		$\displaystyle\quad g(T)\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right]^{3/4}\log\operatornamewithlimits{\mathbb{E}}% \left[M\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1}\right]+$
		$\displaystyle\quad h(T)\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right]^{7/8}.$

where

	$\displaystyle f(T)$	$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left((N^{2}(2NM+R)^{% 2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T+\frac{R(2NM+R)}{r^{7}}d^{14/3}(C^{\lambda}T% ^{1/2-\delta_{\lambda}})^{2}\right.+$
		$\displaystyle\qquad\qquad\left.\left(\frac{R}{r}d^{2/3}+R\right)(C^{r}T^{1/2-% \delta_{r}})^{1/4}V(L+G)T^{3/4}+\frac{1}{2}N^{2}((NM)^{2}+2(NM)^{2}+2R^{2})T% \right),$
	$\displaystyle g(T)$	$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left((2NM+R)^{1/4}M% \sqrt{1+C^{a}T^{1/2-\delta_{a}}}\log T\right),$
	$\displaystyle h(T)$	$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left((2NM+R)^{1/8}% \left(\frac{R}{r}d^{2/3}+R\right)(C^{\lambda}T^{1/2-\delta_{\lambda}})^{1/4}% \right).$

Proof.

From the network stability assumption, we derived in Lemma 4.1 that

	$\displaystyle\quad\epsilon_{W}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t% =1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\right]-(% N^{2}(2NM+R)^{2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T$
	$\displaystyle\leq-\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_% {(n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}\mathring{\mu}_{n,m}^{(k)}(t)(Q% _{m}^{(k)}(t)-Q_{n}^{(k)}(t))\right]-\operatornamewithlimits{\mathbb{E}}\left[% \sum_{t=1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)% \mathring{\lambda}_{n}^{(k)}(t)\right].$

Recall the AdaPFOL guarantee in Theorem 3.5 that

	$\displaystyle\quad\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_% {(n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}(\mu_{n,m}^{(k)}(t)-\mathring{% \mu}_{n,m}^{(k)}(t))\left(Q_{m}^{(k)}(t)-Q_{n}^{(k)}(t)\right)\right]$
	$\displaystyle=\operatorname{\mathcal{O}}\left(M\sqrt{1+P_{T}^{a}}% \operatornamewithlimits{\mathbb{E}}\left[\sqrt{\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{2}^{2}}\log T\log\left(\max_{t=1}^{T}\max_{(n,m)\in{\mathcal{L}}}M% \lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rVert_{\infty}\right)\right]\right)\text{,}$

and the Bandit Convex Optimization guarantee in Theorem 4.4 that

we therefore have

	$\displaystyle\quad\epsilon_{W}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t% =1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\right]-(% N^{2}(2NM+R)^{2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T$
	$\displaystyle\leq-\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum_% {(n,m)\in{\mathcal{L}}}\sum_{k\in{\mathcal{N}}}\mu_{n,m}^{(k)}(t)(Q_{m}^{(k)}(% t)-Q_{n}^{(k)}(t))\right]+$
	$\displaystyle\quad\operatorname{\mathcal{O}}\left(M\sqrt{1+P_{T}^{a}}% \operatornamewithlimits{\mathbb{E}}\left[\sqrt{\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{2}^{2}}\log T\log\left(\max_{t=1}^{T}\max_{(n,m)\in{\mathcal{L}}}M% \lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rVert_{\infty}\right)\right]\right)+$
	$\displaystyle\quad-\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\sum% _{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\lambda_{n}^{(k)}(t)% \right]-V\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\bigg{(}g_{t}(% \mathring{\bm{\lambda}}(t))-g_{t}(\bm{\lambda}(t))\bigg{)}\right]+$
	$\displaystyle\quad\operatorname{\mathcal{O}}\left(\frac{R(2NM+R)}{r^{7}}d^{14/% 3}(C^{\lambda}T^{1/2-\delta_{\lambda}})^{2}\right)+$
	$\displaystyle\quad\operatorname{\mathcal{O}}\left(\operatornamewithlimits{% \mathbb{E}}\left[\left(\frac{R}{r}d^{2/3}+R\right)(C^{\lambda}T^{1/2-\delta_{% \lambda}})^{1/4}\left(\sum_{t=1}^{T}\left(\lVert\bm{Q}(t)\rVert_{2}+V(L+G)% \right)^{4/3}\right)^{3/4}\right]\right).$

Further plugging in the Lyapunov DPP calculation in Equation 15 (which controls the three $\operatornamewithlimits{\mathbb{E}}[\sum_{t=1}^{T}\cdots]$ terms outside $\operatorname{\mathcal{O}}$ on the RHS), we have

	$\displaystyle\quad\epsilon_{W}\operatornamewithlimits{\mathbb{E}}\left[\sum_{t% =1}^{T}\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)\right]-(% N^{2}(2NM+R)^{2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T$
	$\displaystyle\leq\operatorname{\mathcal{O}}\left(M\sqrt{1+P_{T}^{a}}% \operatornamewithlimits{\mathbb{E}}\left[\sqrt{\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{2}^{2}}\log T\log\left(\max_{t=1}^{T}\max_{(n,m)\in{\mathcal{L}}}M% \lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rVert_{\infty}\right)\right]\right)+$
	$\displaystyle\quad\operatorname{\mathcal{O}}\left(\frac{R(2NM+R)}{r^{7}}d^{14/% 3}(C^{\lambda}T^{1/2-\delta_{\lambda}})^{2}\right)+$
	$\displaystyle\quad\operatorname{\mathcal{O}}\left(\operatornamewithlimits{% \mathbb{E}}\left[\left(\frac{R}{r}d^{2/3}+R\right)(C^{\lambda}T^{1/2-\delta_{% \lambda}})^{1/4}\left(\sum_{t=1}^{T}\left(\lVert\bm{Q}(t)\rVert_{2}+V(L+G)% \right)^{4/3}\right)^{3/4}\right]\right)+$
	$\displaystyle\quad\frac{1}{2}N^{2}((NM)^{2}+2(NM)^{2}+2R^{2})+V% \operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}(g_{t}(\bm{\lambda}(t))% -g_{t}(\mathring{\bm{\lambda}}(t)))\right].$

For notational simplicity, we can abbreviate this inequality as

	$\displaystyle\quad\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right]\leq-\frac{V}{\epsilon_{W}}% \operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\bigg{(}g_{t}(\mathring% {\bm{\lambda}}(t))-g_{t}(\bm{\lambda}(t))\bigg{)}\right]+\widetilde{f}(T)+$
	$\displaystyle\quad\widetilde{g}(T)\sqrt{\operatornamewithlimits{\mathbb{E}}% \left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{2}^{2}\right]}\log\left(\max_{t=1}^% {T}\max_{(n,m)\in{\mathcal{L}}}M\lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rVert_{% \infty}\right)+\widetilde{h}(T)\left(\operatornamewithlimits{\mathbb{E}}\left[% \sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{2}^{4/3}\right]\right)^{3/4},$

where

	$\displaystyle\widetilde{f}(T)$	$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left((N^{2}(2NM+R)^{% 2}+\epsilon_{W}N^{2}(2NM+R))C_{W}T+\frac{R(2NM+R)}{r^{7}}d^{14/3}(C^{\lambda}T% ^{1/2-\delta_{\lambda}})^{2}\right.+$
		$\displaystyle\qquad\qquad\left.\left(\frac{R}{r}d^{2/3}+R\right)(C^{r}T^{1/2-% \delta_{r}})^{1/4}V(L+G)T^{3/4}+\frac{1}{2}N^{2}((NM)^{2}+2(NM)^{2}+2R^{2})T% \right),$
	$\displaystyle\widetilde{g}(T)$	$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left(M\sqrt{1+C^{a}T% ^{1/2-\delta_{a}}}\log T\right),$
	$\displaystyle\widetilde{h}(T)$	$\displaystyle=\epsilon_{W}^{-1}\operatorname{\mathcal{O}}\left(\left(\frac{R}{% r}d^{2/3}+R\right)(C^{\lambda}T^{1/2-\delta_{\lambda}})^{1/4}\right).$

To handle the $\widetilde{g}(T)$ -related term, we use the argument same to that of Section 3.5: Lemma D.3 states that if $x_{1}=0$ , $x_{2},\ldots,x_{T}\geq 0$ , and $\lvert x_{t+1}-x_{t}\rvert\leq 1$ , then $\sum_{t=1}^{T}x_{t}^{2}=\operatorname{\mathcal{O}}\left((\sum_{t=1}^{T}x_{t})^% {3/2}\right)$ . From Lemma B.1, any single queue $Q_{n}^{(k)}(t)$ satisfies $\lvert Q_{n}^{(k)}(t+1)-Q_{n}^{(k)}(t)\rvert\leq(2NM+R)$ . Hence, applying Lemma D.3 to $\{Q_{n}^{(k)}(t)/(2NM+R)\}_{t\in[T]}$ to every $n\in{\mathcal{N}}$ and $k\in{\mathcal{N}}$ , we have

\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{2}^{2}=(2NM+R)^{2}\sum_{t=1}^{T}\sum_{n% \in{\mathcal{N}}}\sum_{k\in{\mathcal{N}}}\left(\frac{Q_{n}^{(k)}(t)}{2NM+R}% \right)^{2}=\operatorname{\mathcal{O}}\left(\sqrt{2NM+R}\left(\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right)^{1.5}\right).

Further noticing that

\max_{t=1}^{T}\max_{(n,m)\in{\mathcal{L}}}M\lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)% \rVert_{\infty}\leq\sum_{t=1}^{T}M\sum_{n\in{\mathcal{N}}}\lVert\bm{Q}_{n}(t)% \rVert_{1}\leq M\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1},

the $\widetilde{g}(T)$ -related term then becomes

	$\displaystyle\quad\widetilde{g}(T)\sqrt{\operatornamewithlimits{\mathbb{E}}% \left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{2}^{2}\right]}\log\left(\max_{t=1}^% {T}\max_{(n,m)\in{\mathcal{L}}}M\lVert\bm{Q}_{m}(t)-\bm{Q}_{n}(t)\rVert_{% \infty}\right)$
	$\displaystyle=\widetilde{g}(T)\operatorname{\mathcal{O}}\left((2NM+R)^{1/4}% \operatornamewithlimits{\mathbb{E}}\left[\left(\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{1}\right)^{3/4}\log\left(M\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1}% \right)\right]\right).$

Noticing that $x\mapsto x^{3/4}\log(Mx)$ is concave when $x$ is large enough, Jensen inequality then gives

\operatorname{\mathcal{O}}\left(\operatornamewithlimits{\mathbb{E}}\left[\left% (\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1}\right)^{3/4}\log\left(M\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right)\right]\right)=\operatorname{\mathcal{O}}\left% (\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_% {1}\right]^{3/4}\log\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}% \lVert\bm{Q}(t)\rVert_{1}\right]\right).

Moreover, we handle the $\widetilde{h}(T)$ -related term using Lemma D.4, a variant of Lemma D.3 which states that if $x_{1}=0$ , $x_{2},\ldots,x_{T}\geq 0$ , and $\lvert x_{t+1}-x_{t}\rvert\leq 1$ , then $\sum_{t=1}^{T}x_{t}^{4/3}=\operatorname{\mathcal{O}}\left((\sum_{t=1}^{T}x_{t}% )^{7/6}\right)$ . Hence, still applying it to $\{Q_{n}^{(k)}(t)/(2NM+R)\}_{t\in[T]}$ for every $n\in{\mathcal{N}}$ and $k\in{\mathcal{N}}$ ,

	$\displaystyle\sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{2}^{4/3}$	$\displaystyle=\sum_{t=1}^{T}\left(\sum_{n\in{\mathcal{N}}}\sum_{k\in{\mathcal{% N}}}Q_{n}^{(k)}(t)^{2}\right)^{2/3}\leq\sum_{t=1}^{T}\sum_{n\in{\mathcal{N}}}% \sum_{k\in{\mathcal{N}}}Q_{n}^{(k)}(t)^{4/3}$
		$\displaystyle=\operatorname{\mathcal{O}}\left(\left(2NM+R\right)^{1/6}\left(% \sum_{t=1}^{T}\lVert\bm{Q}(t)\rVert_{1}\right)^{7/6}\right).$

Therefore, we have

	$\displaystyle\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{% Q}(t)\rVert_{1}\right]$	$\displaystyle\leq-\frac{V}{\epsilon_{W}}\operatornamewithlimits{\mathbb{E}}% \left[\sum_{t=1}^{T}\bigg{(}g_{t}(\mathring{\bm{\lambda}}(t))-g_{t}(\bm{% \lambda}(t))\bigg{)}\right]+\widetilde{f}(T)+$
		$\displaystyle\quad\widetilde{g}(T)\operatorname{\mathcal{O}}\left((2NM+R)^{1/4% }\right)\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{1}\right]^{3/4}\log\operatornamewithlimits{\mathbb{E}}\left[M\sum_{t=1% }^{T}\lVert\bm{Q}(t)\rVert_{1}\right]+$
		$\displaystyle\quad\widetilde{h}(T)\operatorname{\mathcal{O}}\left((2NM+R)^{1/8% }\right)\operatornamewithlimits{\mathbb{E}}\left[\sum_{t=1}^{T}\lVert\bm{Q}(t)% \rVert_{1}\right]^{7/8}.$

Setting $f(T)=\widetilde{f}(T)$ , $g(T)=\widetilde{g}(T)\operatorname{\mathcal{O}}((2NM+R)^{1/4})$ , and $h(T)=\widetilde{h}(T)\operatorname{\mathcal{O}}((2NM+R)^{1/8})$ gives our conclusion. ∎

Appendix D Auxiliary Lemmas

The first lemma extends the famous summation lemma $\sum_{t=1}^{T}\frac{x_{t}}{\sqrt{\sum_{s=1}^{t}x_{s}}}=\operatorname{\mathcal{% O}}\left(\sqrt{\sum_{t=1}^{T}x_{t}}\right)$ (Auer et al., 2002).

Lemma D.1.

For non-negative real numbers $x_{1},x_{2},\ldots,x_{T}\in\mathbb{R}$ , we have

\sum_{t=1}^{T}\frac{x_{t}}{\left(\sum_{s=1}^{t}x_{s}\right)^{1/4}}\leq 2\left(% \sum_{t=1}^{T}x_{t}\right)^{3/4}.

Proof.

Prove by induction. The case when $T=1$ is obvious. Suppose that the conclusion holds for $T-1$ , then consider some $x_{T}$ :

	$\displaystyle\quad\sum_{t=1}^{T}\frac{x_{t}}{\left(\sum_{s=1}^{t}x_{s}\right)^% {1/4}}$
	$\displaystyle=\sum_{t=1}^{T-1}\frac{x_{t}}{\left(\sum_{s=1}^{t}x_{s}\right)^{1% /4}}+\frac{x_{T}}{\left(\sum_{t=1}^{T}x_{t}\right)^{1/4}}$
	$\displaystyle\leq 2\left(\sum_{t=1}^{T-1}x_{t}\right)^{3/4}+\frac{x_{T}}{\left% (\sum_{t=1}^{T}x_{t}\right)^{1/4}},$

so it suffices to prove $\left.x_{T}\middle/\left(\sum_{t=1}^{T}x_{t}\right)^{1/4}\right.\leq 2\left(% \sum_{t=1}^{T}x_{t}\right)^{3/4}-2\left(\sum_{t=1}^{T-1}x_{t}\right)^{3/4}$ . Notice that

	$\displaystyle\quad\left(\left(\sum_{t=1}^{T}x_{t}\right)^{3/4}-\left(\sum_{t=1% }^{T-1}x_{t}\right)^{3/4}\right)\left(\left(\sum_{t=1}^{T}x_{t}\right)^{1/4}+% \left(\sum_{t=1}^{T-1}x_{t}\right)^{1/4}\right)$
	$\displaystyle=\left(\sum_{t=1}^{T}x_{t}\right)+\left(\sum_{t=1}^{T}x_{t}\right% )^{3/4}\left(\sum_{t=1}^{T-1}x_{t}\right)^{1/4}-\left(\sum_{t=1}^{T}x_{t}% \right)^{1/4}\left(\sum_{t=1}^{T-1}x_{t}\right)^{3/4}-\left(\sum_{t=1}^{T-1}x_% {t}\right)$
	$\displaystyle\geq x_{T},$

where the last inequality uses $\sum_{t=1}^{T}x_{t}\geq\sum_{t=1}^{T-1}x_{t}$ (follows from $x_{T}\geq 0$ ). Hence, due to the fact that

\left.x_{T}\middle/2\left(\sum_{t=1}^{T}x_{t}\right)^{1/4}\right.\leq\left.x_{% T}\middle/\left(\sum_{t=1}^{T}x_{t}\right)^{1/4}+\left(\sum_{t=1}^{T-1}x_{t}% \right)^{1/4}\right.,

the conclusion holds for $T$ as well. ∎

Lemma D.2.

Suppose that $x_{1}=0$ , $x_{2},x_{3},\ldots,x_{T}\geq 0$ , and $\lvert x_{t}-x_{t-1}\rvert\leq 1$ , $\forall t=2,3,\ldots,T$ , then

\sum_{t=1}^{T}x_{t}^{4/3}\geq 4^{-7/3}x_{T}^{7/3}.

Proof.

As adjacent $x_{t}$ ’s differ by no more than $1$ , $\lfloor x_{T}\rfloor<T$ and $x_{T-t}\geq x_{T}-t$ . Therefore,

\sum_{t=1}^{T}x_{t}^{4/3}\geq\sum_{t=0}^{\lfloor x_{T}\rfloor}x_{T-t}^{4/3}% \geq\sum_{t=0}^{\lfloor x_{T}\rfloor}(x_{T}-t)^{4/3}\geq\sum_{t=0}^{\lfloor x_% {T}\rfloor}\left(t^{4/3}+(x_{T}-\lfloor x_{T}\rfloor)^{4/3}\right)\geq\sum_{t=% 0}^{\lfloor x_{T}\rfloor}t^{4/3},

where the last step uses $(a+b)^{4/3}\geq a^{4/3}+b^{4/3}$ . As

\sum_{i=0}^{n}i^{4/3}\geq\sum_{i=\lfloor\frac{n}{2}\rfloor}^{n}i^{4/3}\geq% \left(n-\left\lfloor\frac{n}{2}\right\rfloor\right)\left(\left\lfloor\frac{n}{% 2}\right\rfloor\right)^{4/3}\geq\left(\left\lfloor\frac{n}{2}\right\rfloor% \right)^{7/3},

we have $\sum_{t=1}^{T}x_{t}^{4/3}\geq(\lfloor\frac{x_{T}}{2}\rfloor)^{7/3}$ . If $x_{T}\geq 4$ , then $\lfloor\frac{x_{T}}{2}\rfloor\geq\frac{x_{T}}{2}-1\geq\frac{x_{T}}{4}$ , giving the conclusion. Otherwise, i.e., $x_{T}<4$ , then we naturally have $\sum_{t=1}^{T}x_{t}^{4/3}\geq(\frac{x_{T}}{4})^{4/3}\geq(\frac{x_{T}}{4})^{7/3}$ , so our conclusion still follows. ∎

Lemma D.3 ((Huang et al., 2024, Lemma 4)).

If $x_{1}=0$ , $x_{2},x_{3},\ldots,x_{T}\geq 0$ , and $\lvert x_{t}-x_{t-1}\rvert\leq 1$ , $\forall t=2,3,\ldots,T$ , then

\sum_{t=1}^{T}x_{t}^{2}\leq 4\left(\sum_{t=1}^{T}x_{t}\right)^{3/2}.

Lemma D.4.

If $x_{1}=0$ , $x_{2},x_{3},\ldots,x_{T}\geq 0$ , and $\lvert x_{t}-x_{t-1}\rvert\leq 1$ , $\forall t=2,3,\ldots,T$ , then

\sum_{t=1}^{T}x_{t}^{4/3}\leq 2^{1/6}\left(\sum_{t=1}^{T}x_{t}\right)^{7/6}.

Proof.

Imitating the proof of Lemma D.3 (Huang et al., 2024, Lemma 4), we short $x_{1},x_{2},\ldots,x_{T}$ as $y_{1}\leq y_{2}\leq\cdots\leq y_{T}$ . According to the original proof, $y_{T}=\max_{t\in[T]}x_{t}\leq(2\sum_{t=1}^{T}x_{t})^{1/2}$ . Hence,

\sum_{t=1}^{T}x_{t}^{4/3}\leq y_{T}^{1/3}\sum_{t=1}^{T}x_{t}\leq\left(2\sum_{t% =1}^{T}x_{t}\right)^{1/6}\left(\sum_{t=1}^{T}x_{t}\right)=2^{1/6}\left(\sum_{t% =1}^{T}x_{t}\right)^{7/6}.

∎

The following two lemmas are similar to Lemma 5 of Huang et al. (2024).

Lemma D.5.

If $y\leq f+y^{3/4}g\log y$ and $f,g\geq 1$ , then

y^{1/4}\leq f^{1/4}+g\log\left(2(f^{1/4}+g)^{2}\right).

Proof.

Let $i(z)=z^{4}-z^{3}g\log(z^{4})-f$ . Notice that

	$\displaystyle\quad\left(f^{1/4}+g\log\left(2(f^{1/4}+g)^{2}\right)\right)^{4}$
	$\displaystyle\geq f+4\left(f^{1/4}+g\log\left(2(f^{1/4}+g)^{2}\right)\right)^{% 3}g\log\left(2(f^{1/4}+g)^{2}\right).$

As $f,g\geq 1$ , we have $2(f^{1/4}+g)^{2}\geq f^{1/4}+g\log(2(f^{1/4}+g)^{2})$ . Applying this relationship to the second log on the RHS of the inequality above, we yield

i\left(f^{1/4}+g\log\left(2(f^{1/4}+g)^{2}\right)\right)\geq 0.

On the other hand, from the conditions, we know $i(y^{1/4})\leq 0$ . Hence, if we can prove that $i(z)$ is monotone (at least for a range of $z$ ), we can conclude that $y^{1/4}\leq f^{1/4}+g\log\left(2(f^{1/4}+g)^{2}\right)$ , thus giving our conclusion. To conclude the monotonity of $i(z)$ , we calculate its derivative as

i^{\prime}(z)=4z^{3}-3z^{2}g\log(z^{4})-4z^{2}g.

Thus, if we only consider the case where $z\geq 0$ , $i^{\prime}(z)\geq 0$ holds when $z\geq 3g\log z+g$ . Denoting the larger root of $z=3g\log z+g$ as $z_{0}$ (in case it has no root, let $z_{0}=1$ ), we know $i(z)$ is increasing in $[z_{0},+\infty)$ . Further observing that $f^{1/4}+g\log\left(2(f^{1/4}+g)^{2}\right)$ indeed satisfies $z\geq 3g\log z+g$ , we know $f^{1/4}+g\log\left(2(f^{1/4}+g)^{2}\right)\geq z_{0}$ .

On the other hand, from the assumption that $y\leq f+y^{3/4}g\log y$ , we know $i(y^{1/4})\leq 0\leq i(f^{1/4}+g\log\left(2(f^{1/4}+g)^{2}\right))$ . Our conclusion follows from discussing the relationship between $y^{1/4}$ and $z_{0}$ : If $y^{1/4}\leq z_{0}$ , then we immediately have

y^{1/4}\leq z_{0}\leq f^{1/4}+g\log\left(2(f^{1/4}+g)^{2}\right).

Otherwise, according to the monotonity of $i(z)$ when $z\geq z_{0}$ , we still have

y^{1/4}\leq f^{1/4}+g\log\left(2(f^{1/4}+g)^{2}\right),

as claimed. ∎

Lemma D.6.

If $y\leq f+y^{3/4}g\log y+y^{7/8}h$ and $f,g,h\geq 1$ , then

y^{1/8}\leq f^{1/8}+g^{1/2}\log\left(2(f^{1/8}+g^{1/2}+h)^{2}\right)+h.

Proof.

Let $i(z)=z^{8}-f-z^{6}g\log(z^{8})-z^{7}h$ . Notice that

	$\displaystyle\quad\left(f^{1/8}+g^{1/2}\log\left(2(f^{1/8}+g^{1/2}+h)^{2}% \right)+h\right)^{8}$
	$\displaystyle=\left(f^{1/8}+g^{1/2}\log\left(2(f^{1/8}+g^{1/2}+h)^{2}\right)+h% \right)^{7}\left(f^{1/8}+g^{1/2}\log\left(2(f^{1/8}+g^{1/2}+h)^{2}\right)% \right)+$
	$\displaystyle\quad\left(f^{1/8}+g^{1/2}\log\left(2(f^{1/8}+g^{1/2}+h)^{2}% \right)+h\right)^{7}h$
	$\displaystyle\geq\left(f^{1/8}+g^{1/2}\log\left(2(f^{1/8}+g^{1/2}+h)^{2}\right% )+h\right)^{6}f^{1/4}+$
	$\displaystyle\quad\left(f^{1/8}+g^{1/2}\log\left(2(f^{1/8}+g^{1/2}+h)^{2}% \right)+h\right)^{6}g\log\left(2(f^{1/8}+g^{1/2}+h)^{2}\right)+$
	$\displaystyle\quad\left(f^{1/8}+g^{1/2}\log\left(2(f^{1/8}+g^{1/2}+h)^{2}% \right)+h\right)^{7}h$
	$\displaystyle\geq f+\left(f^{1/8}+g^{1/2}\log\left(2(f^{1/8}+g^{1/2}+h)^{2}% \right)+h\right)^{6}g\log\left(2(f^{1/8}+g^{1/2}+h)^{2}\right)+$
	$\displaystyle\quad\left(f^{1/8}+g^{1/2}\log\left(2(f^{1/8}+g^{1/2}+h)^{2}% \right)+h\right)^{7}h.$

As $f,g,h\geq 1$ , we have $2(f^{1/8}+g^{1/2}+h)^{2}\geq f^{1/8}+g^{1/2}\log\left(2(f^{1/8}+g^{1/2}+h)^{2}% \right)+h$ . Applying it to the $(\cdots)^{6}g\log(2(f^{1/8}+g^{1/2}+h)^{2})$ term on the RHS, we have

i\left(f^{1/8}+g^{1/2}\log\left(2(f^{1/8}+g^{1/2}+h)^{2}\right)+h\right)\geq 0

On the other hand, from the conditions we know $i(y^{1/8})\leq 0$ . Hence, we again want to prove the monotony of $i(z)$ , which gives the conclusion that $y^{1/8}\leq f^{1/8}+g^{1/2}+h$ .

Calculating the derivative of $i(z)$ , we have

i^{\prime}(z)=8z^{7}-6z^{5}g\log(z^{8})-8z^{5}g-7z^{6}h.

Thus, if we only consider the case where $z\geq 1$ , $i^{\prime}(z)\geq 0$ holds when $8z^{2}-6g\log(z^{8})-8g-7zh\geq 0$ . As $z^{2}$ is convex and $6g\log(z^{8})+8g+7zh$ is concave, there are at most two intersections. Hence, again denoting the larger root of $8z^{2}-6g\log(z^{8})-8g-7zh=0$ as $z_{0}$ (in case there is no root, let $z_{0}=1$ ), $i(z)$ is monotonic in $[z_{0},+\infty)$ .

As $f,g,h\geq 1$ , $f^{1/8}+g^{1/2}\log\left(2(f^{1/8}+g^{1/2}+h)^{2}\right)+h\geq z_{0}$ always holds. The conclusion follows by discussing the relationship of $y^{1/8}$ and $z_{0}$ : If $y^{1/8}\leq z_{0}$ , we directly have

y^{1/8}\leq z_{0}\leq f^{1/8}+g^{1/2}\log\left(2(f^{1/8}+g^{1/2}+h)^{2}\right)% +h.

Otherwise, we can still conclude

y^{1/8}\leq f^{1/8}+g^{1/2}\log\left(2(f^{1/8}+g^{1/2}+h)^{2}\right)+h

from the monotonity of $i(z)$ when $z\geq z_{0}$ . ∎

Adversarial Network Optimization under Bandit Feedback: Maximizing Utility in Non-Stationary Multi-Hop Networks

Abstract

1 Introduction

1.1 Related Works

2 Notations and Preliminaries

2.1 Adversarial Network Optimization under Bandit Feedback Formulation

2.2 Technical Overview of Our Paper

3 Network Stability in Adversarial Multi-Hop Networks

3.1 Motivation of Our Algorithmic Framework

Remark 1.

3.2 Reference Policy Assumption

Assumption 1 (Multi-Hop Piecewise Stability for Network Stability).

Remark 2.

Lemma 3.1 (Ability of {𝒂̊⁢(t)}t∈[T]subscript̊𝒂𝑡𝑡delimited-[]𝑇\{\mathring{\bm{a}}(t)\}_{t\in[T]}{ over̊ start_ARG bold_italic_a end_ARG ( italic_t ) } start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT in Stabilizing the Network).

3.3 Lyapunov Drift Analysis

Lemma 3.2 (Lyapunov Drift Analysis).

3.4 AdaPFOL: Learning for Network Stability

Definition 3.3 (Online Linear Optimization).

Lemma 3.4 (Guarantee of AdaPFOL Algorithm).

Remark 3.

Theorem 3.5 (Deciding 𝒂⁢(t)𝒂𝑡\bm{a}(t)bold_italic_a ( italic_t ) via AdaPFOL Algorithm).

3.5 Main Theorem for Multi-Hop Network Stability

Theorem 3.6 (Main Theorem for Multi-Hop Network Stability).

Remark 4.

Proof Sketch of Theorem 3.6.

4 Utility Maximization in Adversarial Multi-Hop Networks

4.1 Motivation of Our Algorithmic Framework

4.2 Reference Policy Assumption

Assumption 2 (Multi-Hop Piecewise Stability for Utility Maximization).

4.3 Lyapunov Drift-plus-Penalty Analysis

4.4 AdaBGD: Learning for Utility Maximization

Definition 4.2 (Bandit Convex Optimization).

Lemma 4.3 (Guarantee of AdaBGD Algorithm).

Theorem 4.4 (Deciding 𝝀⁢(t)𝝀𝑡\bm{\lambda}(t)bold_italic_λ ( italic_t ) via AdaBGD Algorithm).

Proof Sketch of Theorem 4.4.

4.5 Main Theorem for Multi-Hop Utility Maximization

Theorem 4.5 (Main Theorem for Multi-Hop Utility Maximization).

Remark 5.

Proof Sketch of Theorem 4.5.

Step 1 (Develop a Coarse Average Queue Length Bound).

Step 2 (Yield Polynomial Convergence on the Utility).

Step 3 (Refine the Average Queue Length Bound).

5 Conclusion

References

Appendix A Additional Related Works

Adversarial Components in Network Optimization

Appendix B Omitted Proofs for Multi-Hop Network Stability Tasks

Lemma B.1 (Queue Length Increment).

Proof.

B.1 Reference Policy Assumption (Proof of Lemma 3.1)

Lemma B.2 (Restatement of Lemma 3.1; Ability of {𝒂̊⁢(t)}t∈[T]subscript̊𝒂𝑡𝑡delimited-[]𝑇\{\mathring{\bm{a}}(t)\}_{t\in[T]}{ over̊ start_ARG bold_italic_a end_ARG ( italic_t ) } start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT in Stabilizing the Network).

Proof.

B.2 Lyapunov Drift Analysis (Proof of Lemma 3.2)

Lemma B.3 (Restatement of Lemma 3.2; Lyapunov Drift Analysis).

Proof.

B.3 Guarantee of AdaPFOL Algorithm (Proof of Lemma 3.4)

Lemma B.4 (Guarantee of PFOL Algorithm (Cutkosky, 2020, Theorem 6)).

Lemma B.5 (Restatement of Lemma 3.4; Guarantee of AdaPFOL Algorithm).

Proof.

B.4 Deciding 𝒂⁢(t)𝒂𝑡\bm{a}(t)bold_italic_a ( italic_t ) via AdaPFOL Algorithm (Proof of Theorem 3.5)

Theorem B.6 (Restatement of Theorem 3.5; Deciding 𝒂⁢(t)𝒂𝑡\bm{a}(t)bold_italic_a ( italic_t ) via AdaPFOL Algorithm).

Proof.

B.5 Main Theorem for Multi-Hop Network Stability (Proof of Theorem 3.6)

Theorem B.7 (Restatement of Theorem 3.6; Main Theorem for Multi-Hop Network Stability).

Proof.

Lemma B.8 (Calculations when Proving Theorem 3.6).

Proof.

Appendix C Omitted Proofs for Multi-Hop Utility Maximization Tasks

C.1 Reference Policy Assumption (Proof of Lemma 4.1)

Proof.

C.2 Lyapunov Drift-Plus-Penalty Analysis (Proof of Lemma C.2)

Lemma C.2 (Lyapunov Drift-Plus-Penalty Analysis).

Proof.

C.3 Guarantee of AdaBGD Algorithm (Proof of Lemma 4.3)

Lemma C.3 (Restatement of Lemma 4.3; Guarantee of AdaBGD Algorithm).

Proof.

Lemma C.4 (Guarantee of Projected SGD).

Proof.

C.4 Deciding 𝝀⁢(t)𝝀𝑡\bm{\lambda}(t)bold_italic_λ ( italic_t ) via AdaBGD Algorithm (Proof of Theorem 4.4)

Theorem C.5 (Restatement of Theorem 4.4; Deciding 𝝀⁢(t)𝝀𝑡\bm{\lambda}(t)bold_italic_λ ( italic_t ) via AdaBGD Algorithm).

Adversarial Network Optimization under Bandit Feedback:
Maximizing Utility in Non-Stationary Multi-Hop Networks

Lemma 3.1 (Ability of $\{\mathring{\bm{a}}(t)\}_{t\in[T]}$ in Stabilizing the Network).

Theorem 3.5 (Deciding $\bm{a}(t)$ via AdaPFOL Algorithm).

Theorem 4.4 (Deciding $\bm{\lambda}(t)$ via AdaBGD Algorithm).

Lemma B.2 (Restatement of Lemma 3.1; Ability of $\{\mathring{\bm{a}}(t)\}_{t\in[T]}$ in Stabilizing the Network).

B.4 Deciding $\bm{a}(t)$ via AdaPFOL Algorithm (Proof of Theorem 3.5)

Theorem B.6 (Restatement of Theorem 3.5; Deciding $\bm{a}(t)$ via AdaPFOL Algorithm).

C.4 Deciding $\bm{\lambda}(t)$ via AdaBGD Algorithm (Proof of Theorem 4.4)

Theorem C.5 (Restatement of Theorem 4.4; Deciding $\bm{\lambda}(t)$ via AdaBGD Algorithm).