Adversarial Network Optimization under Bandit Feedback:
Maximizing Utility in Non-Stationary Multi-Hop Networks
Abstract
Stochastic Network Optimization (SNO) concerns scheduling in stochastic queueing systems. It has been widely studied in network theory. Classical SNO algorithms require network conditions to be stationary with time, which fails to capture the non-stationary components in many real-world scenarios. Many existing algorithms also assume knowledge of network conditions before decision, which rules out applications where unpredictability presents.
Motivated by these issues, we consider Adversarial Network Optimization (ANO) under bandit feedback. Specifically, we consider the task of i) maximizing some unknown and time-varying utility function associated to scheduler’s actions, where ii) the underlying network is a non-stationary multi-hop one whose conditions change arbitrarily with time, and iii) only bandit feedback (effect of actually deployed actions) is revealed after decisions. Our proposed UMO2 algorithm ensures network stability and also matches the utility maximization performance of any “mildly varying” reference policy up to a polynomially decaying gap. To our knowledge, no previous ANO algorithm handled multi-hop networks or achieved utility guarantees under bandit feedback, whereas ours can do both.
Technically, our method builds upon a novel integration of online learning into Lyapunov analyses: To handle complex inter-dependencies among queues in multi-hop networks, we propose meticulous techniques to balance online learning and Lyapunov arguments. To tackle the learning obstacles due to potentially unbounded queue sizes, we design a new online linear optimization algorithm that automatically adapts to loss magnitudes. To maximize utility, we propose a bandit convex optimization algorithm with novel queue-dependent learning rate scheduling that suites drastically varying queue lengths. Our new insights in online learning can be of independent interest.
1 Introduction
Stochastic Network Optimization (SNO) studies the fundamental problem of resource allocation in a dynamic system to fulfill incoming demands, with extensive applications in real-world problems including communication networks (Srikant and Ying, 2013), cloud computing (Maguluri et al., 2012), and supply chains (Rahdar et al., 2018). There are many classical scheduling algorithms in this field enjoying performance guarantees in terms of throughput maximization (Tsibonis et al., 2003), delay minimization (Neely, 2008), or utility maximization (Huang and Neely, 2011).
Classical SNO models often assume that the network conditions, for example, the arrival and service rates to each queue or the capacities of data links, are stationary with respect to time. However, many important network scenarios in practice face non-stationarity. For instance, in applications such as autonomous driving, parties in the communication networks can move rapidly (Ashjaei et al., 2021), causing the network conditions to vary from time to time. Even more, attacks such as Distributed Denial-of-Service (DDoS) or jamming can frequently happen in communication networks (Zou et al., 2016), where arrival rates or link conditions are altered by some malicious adversary.
Moreover, we notice that existing works, even those allowing non-stationary network conditions, assumes perfect knowledge about the network conditions. For example, in the paper by Liang and Modiano (2018b), the network condition is revealed at the beginning of each round, so the outcomes associated with each action can be accurately calculated, before actually deciding and deploying the scheduler’s action (see Section 1.1 for more discussions). Nevertheless, this may again not be the case in practice. In underwater wireless communication systems, for instance, the network conditions is unpredictable until the policy is actually executed (Khan et al., 2020). In Internet of Things (IoT), device failures or sensor temperatures can change rapidly, resulting in highly unpredictable traffic and channel patterns in the network (Gaddam et al., 2020). Therefore, it is hard to estimate counterfactual outcomes of other actions (i.e., “what will happen if we used a different action?”) even after deploying the action and obtaining more information about the network conditions, not to mention pre-decision evaluations. In a nutshell, it is important and largely open to design network algorithms that are robust to time-dependent or adversarial conditions with post-decision feedback.
Motivated by these two challenges, this paper considers optimizing an abstract utility function associated with the scheduler’s action (the so-called utility maximization task (Neely et al., 2008)) even when the network is non-stationary (which we call Adversarial Network Optimization, or ANO in short) and the feedback model is bandit style. Specifically, in ANO, the network conditions and utility functions can unknowingly vary from time to time. Therefore, statistics of the past merely infers the current network condition, which breaks many traditional SNO techniques. Moreover, under the bandit feedback model, the scheduler has no information about the current network condition before decision. Even worse, after making decisions, it also only receives feedback resulting from the chosen action – not those “counterfactual” ones associated with other actions. More formally, if an action is associated with outcome (for example, arrival and service rates) in round . Then i) the scheduler has to decide action without having any information about , and ii) after playing , it can only observe but not those ’s for . Therefore, it is hard to evaluate the optimal action even in hindsight: Based on information collected in rounds , one cannot accurately calculate which action had the most gain within rounds . Henceforth, in our model, one not only cannot predict the future, but also cannot fully interpolate the history.
In addition to the challenging ANO setup, bandit feedback model, and utility maximization task, we also allow the underlying network to be multi-hop, which means jobs can be forwarded between queues. Despite these hardness, we succeeded in designing the utility maximization algorithm of UMO2, which achieves a strong performance guarantee in non-stationary multi-hop networks under bandit feedback. It not only ensures the network is stable over time, but also proves a polynomially decaying gap between our utility and any “mildly varying” policy’s (measured by the path length; see Theorem 4.5), similar to what can be achieved in perfect-knowledge SNO problems (Neely et al., 2008).
We now highlight several technical innovations in our algorithm. While our algorithm is based on the classical Lyapunov drift-plus-penalty (DPP) analysis, the adversarial network conditions breaks existing SNO arguments, which we tackle by designing online learning algorithms enjoying dynamic regret guarantees in adversarial environments. However, due to the multi-hop topology, learning in different queues is correlated. Thus, it is highly non-trivial to decompose the problem into several online learning tasks. To this end, we develop meticulous analysis techniques that can jointly analyze the online learning algorithms and the utility maximization effects. Moreover, the queue lengths can occasionally be large despite having a bounded expectation. Such a unique challenge is missing in online learning literature, which usually assumes the loss magnitudes are uniformly bounded by a constant. Finally, yet another challenge is due to a combination of the multi-hop topology and the unbounded queue lengths, which makes the losses fed into online learning algorithms sometimes quite negative, a known issue for many online learning algorithms (Zheng et al., 2019, Dai et al., 2023). These two challenges make existing online learning algorithms unable to fulfill our purpose, and we propose a novel Online Linear Optimization algorithm (AdaPFOL; used for system stability) that adapts to drastically varying losses (proportional to queue lengths) and a new Bandit Convex Optimization algorithm (AdaBGD; deployed for utility maximization) whose learning rates are carefully designed to take care of the time-dependent loss magnitudes and Lipschitzness.
Network | Arrival & | ||||
Conditions | Service 11footnotemark: 1 00footnotetext: 11footnotemark: 1 Arrival & Service and Utility columns stand for whether the arrival and service rates or the utility function associated with each feasible control action is known before decision-making, respectively. | Topology | Objective | Utility 11footnotemark: 1 | |
(Neely et al., 2008) | Stochastic | Known | Multi-Hop | Utility Maximization | Known |
(Neely, 2010b) | Adversarial | Known | Multi-Hop | Utility Maximization | Known |
(Liang and Modiano, 2018b) | Adversarial | Known | Multi-Hop | Network Stability | — |
(Liang and Modiano, 2018a) | Adversarial | Known | Multi-Hop | Utility Maximization | Known |
(Yang et al., 2023) | Adversarial | Unknown | Single-Hop | Network Stability | — |
(Huang et al., 2024) | Adversarial | Unknown | Single-Hop | Network Stability | — |
Ours | Adversarial | Unknown | Multi-Hop | Utility Maximization | Unknown |
Finally, we mention some most related works including (Neely, 2010b, Liang and Modiano, 2018b; a, Huang et al., 2024, Yang et al., 2023) to help interpolate our position in the literature. Among them, Neely (2010b), Liang and Modiano (2018b; a) assumed perfect pre-decision knowledge on network conditions, which allows direct calculation of the arrival and service rates resulting from every action. In our case, we have to learn these outcomes in an online manner. Liang and Modiano (2018b), Huang et al. (2024), Yang et al. (2023) focused on network stability, while our utility maximization task additionally requires maximizing an abstract, unknown, and time-varying utility function, thus adding difficulties in designing online learning algorithms. Neely (2010b), Liang and Modiano (2018a) considered utility maximization, but their utility functions are fixed and non-adversarial. In our case, the utility functions are both time-varying and unknown, thus another online learning sub-routine is needed. Huang et al. (2024), Yang et al. (2023) investigated single-hop networks where jobs leave the network upon being served, whereas our formulation considers the more general multi-hop networks. We refer the readers to Table 1 and Section 1.1 for more information.
Our main contributions in this paper can be summarized as follows:
-
•
We propose a novel algorithm UMO2 (Algorithm 3) for adversarial multi-hop networks under bandit feedback, which gives rigorous utility optimization guarantee. To the best of our knowledge, no previous algorithm can handle multi-hop topology or achieve utility guarantees in adversarial networks under bandit feedback, whereas our UMO2 algorithm is able to do both. Moreover, as a by-product, we also derive a simpler algorithm NSO (Algorithm 1) which ensures network stability for adversarial multi-hop networks under bandit feedback.
-
•
To handle the multi-hop topology which brings inter-queue correlations and to jointly handle online learning and network optimization, we develop a unified analysis that allows the integration of online learning techniques into the classical Lyapunov drift-plus-penalty arguments. Specifically, via the design of a new OLO algorithm to stabilize the network and a novel BCO algorithm to maximize the utility, UMO2 algorithm enjoys a network stability guarantee together with a polynomially decaying gap between its utility and that of any policy that is “mildly varying” (in the sense that its path length is of order ; see Theorem 4.5 for more details).
-
•
Due to the potentially unbounded queue lengths, existing online learning algorithms are unfortunately inapplicable. We design an OLO algorithm that can handle large losses and enjoys a performance guarantee adapted to the loss magnitudes (AdaPFOL; Theorem 3.5). We also develop a new BCO method specially crafted for the drastically varying loss magnitudes and Lipschitzness (AdaBGD; Theorem 4.4). Both online learning algorithms can be of independent interest.
1.1 Related Works
We discuss the most related works here. A more comprehensive literature review is in Appendix A.
Adversarial Network Control. Adversarial networks date back to the 1990s, when Cruz (1991) gave the first adversarial dynamics network model and its scheduling algorithm. More efforts were made to allow more general arrival rates (Borodin et al., 2001, Andrews et al., 2001), link conditions (Andrews and Zhang, 2004, Andrews et al., 2007), or both (Liang and Modiano, 2018b). We also direct the readers to the references therein for more discussions. The main focus of the aforementioned papers were usually system stability, whereas ours is utility maximization. As we are aware of, existing results on utility maximization (Neely, 2010b, Liang and Modiano, 2018a) mostly assumed perfect knowledge on network conditions.
Feedback Models. Most previous works considered perfect knowledge model which assumes pre-decision knowledge on network conditions (Liang and Modiano, 2018b; a). In contrast, our paper considers bandit feedback model which only reveals the consequence of our action. A small number of previous works (Fu and Modiano, 2022, Yang et al., 2023, Huang et al., 2024) also assumed similar feedback models albeit under different names. Another feedback model whose difficulty lies in between is full-information feedback model, which requires network conditions to be revealed after decision and thus counterfactual evaluations of all actions (i.e., not only the deployed one) are allowed in hindsight. See (Neely et al., 2012) for an example.
Adversarial Networks under Bandit Feedback. Prior to our work, Huang et al. (2024), Yang et al. (2023) also studied adversarial networks under bandit feedback. However, they both assumed single-hop networks and focused on network stability. In contrast, our paper allows a general multi-hop topology and tackles the utility maximization task of optimizing an abstract, unknown, and time-varying utility function.
2 Notations and Preliminaries
We use bold letters to denote vectors, e.g., , and denote their elements with corresponding normal letters, e.g., . For an integer , stands for . For a finite set , is the simplex over , i.e., , where every element is a discrete probability distribution over . We use to hide all absolute constants, and use to additionally hide all logarithmic factors. For functions and , we say if and if .
2.1 Adversarial Network Optimization under Bandit Feedback Formulation
We first introduce our adversarial network optimization with bandit feedback model. Specifically, in a network with multiple servers and directional data links, we denote the set of all servers by and that of all data links by . Suppose that and are both finite. There are commodities of jobs such that those jobs belonging to commodity are destined for server . We denote as the queue of unfinished commodity- jobs at server , where and . We assume the links do not interfere with each other.
The scheduling problem lasts for rounds. In round , the scheduler makes two decisions: i) arrival rates of commodity- jobs into server , , and ii) link rate allocations of transmitting how many commodity- jobs over data link , . Both decisions are made under the bandit feedback model, i.e., the scheduler makes decisions in blind and only receives feedback resulting from its actions. Below, we describe them in detail.
Arrival Rates and Utility. In every round , the scheduler decides an dimensional arrival rate matrix from some fixed action set , and consequently, jobs with commodity will be added to queue . The arrival rate vector is associated with some abstract utility where is concave (that is, the user’s marginal return diminishes gradually as the arrivals increase (Huang and Neely, 2011, Huang et al., 2012)), -Lipschitz, and -bounded, where and here are known constants. Following the adversarial network assumption, we allow ’s to be time-dependent (though they have to be pre-determined, which is called the oblivious adversary model). Following the bandit feedback model, the scheduler has no information about before the decision, and can only observe for the chosen but not the whole after decision.
Link Rate Allocations. The capacity of each link can be time-varying. We denote the capacity of in round as . We assume the capacities are always bounded by some finite constant . Due to the bandit feedback model, the scheduler cannot access when deciding. Nevertheless, the scheduler can still decide a link allocation plan which assigns a distribution over commodities on each link, or formally denoted as (the -dimension distribution simplex, representing the portion of rates allocated to each commodity over the link). Via sending jobs from each commodity along link according to distribution in a round-robin manner, approximately jobs from queue will be sent along link to queue . Formally, we assume that after deciding link allocation plans , the number of jobs successfully sent from to , denoted by , are independently generated such that and . Again, we assume a bandit fededback model, which means the scheduler is able to observe and for all only after the decision is made at the end of round .
Putting the two components together, by denoting the length of at the beginning of round to be , the network dynamics can then be characterized as follows:
(1) |
where is the number of jobs with commodity that the scheduler adds to server , and is the number of jobs with commodity transmitted along data link .
The objective of the scheduler is to maximize its average utility over the rounds, namely . However, a scheduling algorithm is meaningless if it cannot ensure network stability, which requires the average number of jobs remaining in the network is non-divergent when the number of rounds is large enough. Formally, the network stability requirement says
(2) |
The scheduler aims to maximize its average utility subject to the network stability condition, i.e.,
(3) |
2.2 Technical Overview of Our Paper
In order to improve presentation and facilitate understanding, in Section 3, we first present the network stability algorithm NSO, i.e., pretending is a constant. This algorithm will serve as a key building block for the utility maximization algorithm UMO2 that we introduce in Section 4.
In Figure 1, we give an overview of our main technical steps when analyzing NSO and UMO2. The steps for NSO are in yellow, the ones for UMO2 are in blue, and those in common are in green. In general, either analysis starts from the famous Lyapunov drift(-plus-penalty) analysis (Neely, 2010a, §4), which reveals the non-negativity of a Lyapunov drift(-plus-penalty) function – see Section 3.3 and Section 4.3 for more details. We then use online learning techniques to minimize them (i.e., making them as close to zero as possible). From here, the analyses for NSO and UMO2 become different.
For the network stability algorithm NSO, we succeeded in expressing the Lyapunov drift function as a function linear in the queue lengths and the link allocation plan . While this belongs to the classical Online Linear Optimization (OLO) problem in online learning (Zinkevich, 2003), we face two unique challenges due to the potentially unbounded queue lengths and the self-bounding analysis for network stability guarantees; see Section 3.1 for more discussions. These two requirements rules out existing OLO algorithms, and thus we have to design our own algorithm crafted towards the network optimization objective. Specifically, we designed an OLO algorithm AdaPFOL (see Algorithm 2) that can handle occasionally large loss magnitudes and ensures a performance guarantee depending on all the losses. When plugging it into NSO, we are able to see that the link allocation plans perform well, as detailed in Theorem 3.5. Therefore, combining it with some reference policy assumption (1) and the Lyapunov drift analysis, we obtain the network stability guarantee of NSO in Theorem 3.6. A more detailed overview of NSO is in Section 3.1.
Regarding the utility maximization algorithm UMO2, we have to decompose the Lyapunov drift-plus-penalty function into two parts. The first part is still linear in and , which we can reuse the AdaPFOL algorithm. For the second part, as the utility function is an arbitrary time-varying concave function and we only receive bandit feedback (recall Table 1), OLO cannot capture it. Instead, we model this part as a Bandit Convex Optimization (BCO) problem (Flaxman et al., 2005). Unfortunately, again due to the potentially unbounded queue lengths and the self-bounding analysis, the loss functions’ magnitudes and Lipschitzness can be very large for some rounds but we still want to adapt to them – thus, existing BCO algorithms are inapplicable either. To this end, we develop a BCO algorithm AdaBGD (Algorithm 4) which allows loss functions with large magnitudes or Lipschitzness and enjoys a performance adaptive to the loss functions. When plugging in AdaBGD to the UMO2 framework, it can generate a good arrival rate sequence as we analyze in Theorem 4.4. Therefore, similar to the analysis of NSO in Theorem 3.6, if we combine the AdaPFOL in Algorithm 2 and the AdaBGD in Algorithm 4 together, we are able to derive the utility maximization guarantee of UMO2 in Theorem 4.5. Again, a more detailed overview of UMO2 can be found in Section 4.1.
3 Network Stability in Adversarial Multi-Hop Networks
In this section, we do a first step towards our ultimate goal of multi-hop utility maximization, which is network stability (recall that Equation 3 requires Equation 2 as a condition). That is, this section only focuses on stablizing the average number of tasks in the system and does not consider utilities. The algorithm designed for this purpose (NSO in Algorithm 1) will serve as the network stability component of our utility maximization algorithm (UMO2 in Algorithm 3).
One may observe that if we only want to ensure the network stability condition in Equation 2, it suffices to pick all arrival rate vectors . To avoid such trivial algorithms, we assume the arrival rates are adversarially chosen for now. That is, is some arbitrary, unknown, and time-varying vector following the oblivious adversary model. It is only revealed post-decision, at the end of round . The rationale of assuming adversarial is because in the UMO2 algorithm for utility maximiztion, another algorithmic component decides and the network stability component that we design in this section must adapt to such an arbitrary arrival rate matrix .
3.1 Motivation of Our Algorithmic Framework
In Algorithm 1, we present Network Stability via Online Linear Optimization (NSO), an algorithmic framework which achieves stability in adversarial multi-hop networks under bandit feedback. One key ingredient of NSO is the plug-in Online Linear Optimization (OLO) algorithm AdaPFOL. Before going into details of the AdaPFOL algorithm, we first introduce why we need it.
The design of NSO is based on the famous Lyapunov drift analysis (Neely, 2010a, §4). Conducting standard Lyapunov analysis on the network dynamics defined in Equation 1, we are able to derive
(4) |
whose formal statement and proof can be found in Lemma 3.2.
Based on this inequality, a Lyapunov drift based algorithm can be constructed by minimizing the RHS of Equation 4 (Neely, 2010a, §4). As the arrival rate ’s is regarded as a constant in this section, we may only focus on the terms related to . Thus, minimizing RHS of Equation 4 is equivalent to
Minimizing | ||||
(5) |
where the last step uses the assumption that .
For illustration purposes, let us focus on a single data link . Motivated by Huang et al. (2024), we consider designing a scheduling algorithm via minimizing the following expectation:
Remark 1.
While having similarities, this objective is different from that of Huang et al. (2024) in two aspects:
First, the network topology in (Huang et al., 2024) is a single-server, single-hop one, thus it suffices to conduct the Lyapunov drift optimization on the centralized server. In contrast, due to our multi-hop topology, our optimization task Equation 5 has to be distributed onto every data link and extra efforts are needed to ensure a good overall scheduling effect.
Second, the coefficient before can be either positive or negative, whereas that of Huang et al. (2024) is always non-negative. Such a negativity also increases the difficulty as many online learning algorithms are typically bad at handling potentially negative losses, see, e.g., (Zheng et al., 2019, Dai et al., 2023).
Recall that can be any probability distribution from the simplex. Hence, one may view as the action set in round . Moreover, for an action from the action set , picking it in round will incur a loss – which is linear in . Thus, this problem belongs to the class of Online Linear Optimization (OLO) problems (Zinkevich, 2003, McMahan and Streeter, 2010, Duchi et al., 2011), whose formal definition will be presented later as Definition 3.3. Although our problem belongs to the OLO formulation, we face significantly different challenges due to our network optimization context:
-
i)
In our task of minimizing , the magnitude of the loss can occasionally be large because or may be unbounded. Note that, despite our system stability condition Equation 2 requires to be small, some ’s inside average and expectation are still allowed to be large. However, existing algorithms in OLO mostly require the losses to be uniformly bounded by a constant (see, e.g., (McMahan and Streeter, 2010, Cutkosky, 2020)), which means extra efforts should be made to handle occasionally large losses.
-
ii)
Moreover, we also want our performance to depend on the geometric mean of all loss magnitudes (which in turn relates to queue lengths since the losses depend on ). On a high level, this can be understood as follows: The average queue length can be controlled by the OLO performance via some other arguments (see Section 3.2). Thus, if we can additionally show that OLO performance is bounded by queue lengths, we can conduct a self-bounding analysis on the queue lengths which informally reads
(6) More details regarding the self-bounding analysis can be found in Section 3.5. Nevertheless, it suffices to remember that a performance depending on all loss magnitudes is beneficial.
In Section 3.4, we introduce our novel OLO algorithm of AdaPFOL (Algorithm 2) that enjoys these two properties. Equipped with such an algorithm, we can minimize Equation 5 and achieve network stability guarantee by deploying it on every link for rounds with action set . This idea of deploying AdaPFOL onto every link exactly gives our NSO framework in Algorithm 1.
Therefore, we are able to analyze the network stability effect when the NSO framework is equipped with AdaPFOL in Algorithm 2, which we do in the rest of this section: We introduce our reference policy assumptions in Section 3.2, conduct the Lyapunov drift analysis in Section 3.3, introduct and analyze the novel AdaPFOL algorithm in Section 3.4, and present our final analysis in Section 3.5.
3.2 Reference Policy Assumption
We first make the following multi-hop piecewise stability assumption, which, informally speaking, assumes that there exists a reference policy that stabilizes the system piecewisely. It is an extension of the piecewise stability assumption (Huang et al., 2024, Assumption 1) to multi-hop cases.
Assumption 1 (Multi-Hop Piecewise Stability for Network Stability).
There exists a reference action sequence (where , in analogue to the scheduler’s action sequence), such that there are some constants , and a partition of ,111A partition of is a collection of a few non-intersecting intervals whose union is which ensure that and
(7) |
where is the obliviously decided arrival rates that we assume in this section.
Intuitively, 1 means that there exists some “good” action sequence making the network stable, in the sense that there are multiple windows such that in expectation, for each window and for each queue , the average service rate it receives (whose expectation is in round ), is strictly more than its net arrival rate (which includes both external data flows and internal data flows that are forwarded from other queues ), by a constant gap of at least .
Remark 2.
Such assumptions are typical in network optimization literature. In the case when the network is stationary, 1 recovers the classical capacity region assumption in SNO (Neely, 2010a). However, extending this condition to adversarial network is highly non-trivial. For adversarial networks, an alternative assumption is the -constrained dynamics assumption (Liang and Modiano, 2018b), which roughly says Equation 7 holds for every window of size . 1 thus allows more flexibility. Finally, our 1 can be viewed as a generalization of the piecewise stability assumption (Huang et al., 2024), which was crafted for a single centralized server.
Before moving on, we shall remark that the reference action sequence in 1 is unknown to the scheduler. Instead, the scheduler needs to learn its own way of stabilizing the network via observations. To characterize the ability of in stabilizing the network, the following lemma controls the average queue length resulting from any scheduling policy.
Lemma 3.1 (Ability of in Stabilizing the Network).
If satisfies 1, then for any scheduler-generated queue lengths ,
Lemma 3.1 says the Lyapunov drift (defined later) under is always negative, which is useful when analyzing queue-based policies (Neely, 2010a, §3.1). Its proof can be found in Section B.1.
3.3 Lyapunov Drift Analysis
We carry out our analysis based on the Lyapunov drift analysis (Neely, 2010a, §4), which considers the Lyapunov function and its drift , defined as follows:
We give the following result which is almost diretly applying the classical Lyapunov drift analysis to the queue dynamics in Equation 1. The proof is standard and thus deferred to Section B.2.
Lemma 3.2 (Lyapunov Drift Analysis).
Under the queue dynamics of Equation 1,
(8) |
As sketched in Section 3.1, our algorithm is designed to approximately minimize the RHS of Equation 8 via online learning, which contains two non-constant terms and . As are obliviously chosen, the second term is also constant. Therefore, it remains to minimize the following term:
(9) |
For each data link , Equation 9 corresponds to an Online Linear Optimization (OLO) problem with being the loss for round . In the next section, we first rigorously define the OLO problem in Definition 3.3 and then present our novel algorithm that tackles the unique challenges we face in network optimization contexts – potentially large losses due to unbounded queue lengths (recall Equation 9), and adapting to all the loss magnitudes because we want to conduct a self-bounding analysis on the queue lengths (recall Equation 6).
3.4 AdaPFOL: Learning for Network Stability
In this section, we will have a small detour to the OLO problem mentioned in Section 3.1. We first rigorously define the OLO problem (Zinkevich, 2003, McMahan and Streeter, 2010, Duchi et al., 2011) in Definition 3.3. Then, we present the construction our novel OLO algorithm of AdaPFOL (Algorithm 2). Finally, we prove that plugging it into the NSO framework in Algorithm 1 indeed ensures good optimization effect of Equation 9.
Definition 3.3 (Online Linear Optimization).
Consider a -round game. Every round , the player selects an action from a convex set . The environment simultaneously decides a loss vector such that the loss of the player for round is . The player will observe the whole vector of (i.e., full-information feedback is available, instead of the more restrictive bandit feedback model). Dynamic regret minimization in OLO considers minimizing
Before moving on, we recall the two challenges i) and ii) in Section 3.1: Due to potentially unbounded queue lengths, AdaPFOL must resist from large and negative losses. Meanwhile, as we want to conduct the self-bounding analysis in Equation 6, AdaPFOL shall additionally enjoy a performance guarantee ( in Definition 3.3) depending on the geometric mean of all the loss magnitudes. Thus, our ideal algorithm for Definition 3.3 must satisfy the following:
-
i)
it can resist against occasionally large loss magnitudes, i.e., can be large, and
-
ii)
it enjoys a performance guarantee depending on all the loss magnitudes, e.g., .
To design such an algorithm, we build upon the Parameter-Free Online Learning (PFOL) algorithm by Cutkosky (2020). It ensures condition ii) by enjoying (Cutkosky, 2020, Theorem 6), but fails to bare large loss magnitudes as it requires for all .
Fortunately, as , we know . Even better, can be calculated at the beginning of round – before deciding . Utilizing this knowledge, we are able to design our OLO algorithm which enjoys both property i) and ii). We call this algorithm AdaPFOL (Adaptive Pamameter-Free Online Learning), whose pseudo-code is presented in Algorithm 2. AdaPFOL deploys a doubling technique to the PFOL algorithm of Cutkosky (2020), which restarts every time observing a large . We can show that this only introduces a logarithmic overhead as the original PFOL algorithm also enjoys ii).
AdaPFOL algorithm enjoys the following dynamic regret guarantee, satisfying both i) and ii):
Lemma 3.4 (Guarantee of AdaPFOL Algorithm).
Consider the OLO problem in Definition 3.3. Let the action set has diameter . Suppose that , where is some -measurable random variable and is the natural filtration, i.e., is the -algebra generated by all random observations made during the first rounds. Then, AdaPFOL (Algorithm 2) ensures that for any comparator sequence , if , then
Remark 3.
Note that in general, it is impossible to guarantee simultaneously for all (Zinkevich, 2003). Therefore, many dynamic regret bounds, including ours, depend on the notion of path length . Although the path length is linear in in the worst case, can still be ensured in cases where .
The proof of Lemma 3.4 will be presented in Section B.3. It can be seen that AdaPFOL indeed satisfies both i) and ii): It allows the loss magnitudes to be large, and also enjoys a magnitude-aware dynamic regret guarantee of .
Therefore, if we deploy AdaPFOL (Algorithm 2) to decide on each link as we do in Algorithm 1, the RHS of Equation 8 can consequently be minimized in the sense that it is close to that induced by the reference actions . Formally, we give the following theorem:
Theorem 3.5 (Deciding via AdaPFOL Algorithm).
For each link , as we did in NSO, we execute an instance of AdaPFOL (Algorithm 2) where , , and . We make their outputs as for every round . Let be the number of actually transmitted jobs from to induced by .
Consider an arbitrary reference action sequence satisfying 1. Let (as and ). Then
where is the path length of .
The proof is almost directly applying Lemma 4.3, so we postpone it to Section B.4. Thanks to property ii), the RHS of Theorem 3.5 depends on the -norm of the queue lengths . As we will see shortly, this is pivotal to the self-bounding argument sketched in Equation 6.
3.5 Main Theorem for Multi-Hop Network Stability
As sketched in Section 3.1, putting previous conclusions together and use a so-called self-bounding property, the following guarantee for multi-hop network stability can be derived:
Theorem 3.6 (Main Theorem for Multi-Hop Network Stability).
Suppose that satisfies 1 and its path length satisfies 222This assumption on path lengths comes from (Huang et al., 2024, Assumption 2). As discussed in Remark 3, such conditions are necessary.
(10) |
where and are assumed to be known constants but the precise or both remain unknown. Then, if we execute the NSO framework in Algorithm 1 with AdaPFOL defined in Algorithm 2, the following performance guarantee is enjoyed:
That is, when , we have , i.e., Equation 2 holds and the system is stable.
Remark 4.
Any reference policy satisfying Equation 10 is called “mildly varying”. As mentioned in Remark 3, it is impossible to achieve non-trivial performance without restricting the reference sequence. We also compare our result with two previous results (Huang et al., 2024, Yang et al., 2023) which also ensured network stability in adversarial networks under bandit feedback: Under a path length assumption similar to but looser than Equation 10, Huang et al. (2024) stabilized single-server networks. By assuming the environment (instead of the reference policy which we do) is mildly varying, Yang et al. (2023) stabilized single-hop networks. Thus, Theorem 3.6 is the first guarantee applicable to adversarial multi-hop networks under bandit feedback.
A formal proof resides in Section B.5. We only highlight the self-bounding step here:
Proof Sketch of Theorem 3.6.
The first step of the proof is comparing the guarantee from 1 and that from Lyapunov drift analysis. Specifically, recall that Lemma 3.1 upper bounds the total queue length using (together with some other terms, which are constants after taking expectations) while Lemma 3.2 reveals the non-positivity of (again, many constants omitted). Furthermore, via the guarantee of AdaPFOL (Algorithm 2) in Theorem 3.5, these two terms are actually pretty close – they only differ by , where we used the assumption that . By a property that ensures when (Lemma D.3), this can be controlled by .
Finishing these steps, which are detailed in Lemma B.8, we are able to conclude
where and are some abstract functions to simplify notations.
Therefore, this inequality is in a self-bounding form that where is the total queue lengths in rounds. As we informally stated in Equation 6, this gives our system stability guarantee. Indeed, in Lemma D.5, we show that implies . Therefore, and thus . ∎
4 Utility Maximization in Adversarial Multi-Hop Networks
We now turn to the utility maximization task. In this task, in addition to the capacity allocations, the arrival rates are also decided by the scheduler with an objective of maximizing the unknown and time-varying utility function. The scheduler’s objective is to maximize the average utility it gains (Equation 3), while ensuring the average number of jobs in the network remains small (Equation 2).
This section is organized similar to Section 3: We first explain the motivation of our algorithmic framework UMO2 in Algorithm 3 and then present the assumptions together with analysis.
4.1 Motivation of Our Algorithmic Framework
In Algorithm 3, we give the general algorithmic framework of Utility Maximization via Online Linear Optimization and Bandit Convex Optimization (UMO2) which achieves utility optimization via the plug-in of two optimization sub-rountines AdaPFOL and AdaBGD. The differences between it and our system stability algorithm (NSO; Algorithm 1) is marked in blue. As we already motivated in Section 3.1 that an OLO algorithm AdaPFOL can help stabilize the system (recall Equation 5), this section focuses on motivating the other sub-rountine AdaBGD by going through the design of UMO2.
To handle the utility function, instead of the Lyapunov analysis in the previous section, UMO2 is based on the Lyapunov drift-plus-penalty analysis (Neely, 2010a, Theorem 4.2). In Lemma C.2, we derive
(12) |
where is a constant that we can arbitrarily pick for analytical purposes. Intuitively, this stands for a trade-off between the stability part and the utility part .
Again motivated by Huang et al. (2024), our goal is to minimize Equation 12. The first term in Equation 12 is exactly the OLO optimization objective from the previous section (recall Equation 5), which can be minimized by the AdaPFOL algorithm given in Algorithm 2. For the second and third term, we would like to
(13) |
We now also tackle Equation 13 using online learning techniques: As is defined over all possible ’s and can be arbitrarily chosen from the feasible action set , we regard “deciding ” as an learning problem with action set (instead of making decisions on each link or server separately in Section 3.4) where the loss of picking in round is . This loss function is convex w.r.t. as is linear and is concave. However, since only bandit feedback on is available, we can only calculate but not the whole . Thus, this problem is not an OLO problem as Definition 3.3 requires full information feedback. Instead, it belongs to the category of Bandit Convex Optimization (BCO) (Flaxman et al., 2005), which we will define in Definition 4.2.
Similar to Section 3.1, we now discuss what properties the AdaBGD sub-routine should enjoy. We have the following challenges that is unique due to the network optimization context:
-
i)
Again, the queue lengths can potentially go unbounded, which means the loss can have large magnitudes. However, different from the OLO problem we met in Equation 5, in BCO problems we face general convex functions and thus Lipschitzness (i.e., the maximum gradient magnitude) also plays a role as it characterizes how fast changes with . Therefore, our AdaBGD shall not only bare large loss magnitudes but also resist from huge Lipshictzness. As we will see in Equation 11, adapting to both magnitudes and Lipschitzness is particularly difficult.
-
ii)
The second challenge is again due to self-bounding analysis: Specifically, we want to conduct self-bounding analyses on the queue lengths and also the utility gap (both similar to Equation 6), we also want AdaBGD to be adaptive to the loss functions’ magnitudes and Lipschitzness.
In Section 4.4, we introduce the details of our AdaBGD algorithm that ensures both i) and ii). Equipped with this algorithm, we can optimize Equation 13 by deploying it over the action set . Moreover, as deploying AdaPFOL (Algorithm 2) on each link minimizes , these two algorithms can together minimize the Lyapunov drift-plus-penalty in Equation 12. Such a combination gives the UMO2 framework in Algorithm 3.
In the remaining of this section, we present the analysis of UMO2: In Section 4.2, we introduce the reference sequence assumption. In Section 4.3, we present the Lyapunov drift-plus-penalty analysis. In Section 4.4, we rigorously define the BCO problem and present our AdaBGD algorithm (Algorithm 4). Finally, by combining the AdaPFOL guarantee from Section 3.4 and the AdaBGD guarantee from Section 4.4, we yield the utility maximization guarantee in Section 4.5.
4.2 Reference Policy Assumption
The assumption we need in the multi-hop utility maximization task is similar to the one in multi-hop network stability (1), with one important difference that our arrival rates are no longer fixed but decided by the scheduler. Hence, instead of assuming a sequence of stabilizing the system with the obliviously adversarial arrival rates , we assume the existence of reference sequence such that stabilizes system with reference arrival rates . Formally, we make the following assumption.
Assumption 2 (Multi-Hop Piecewise Stability for Utility Maximization).
There exists a reference action sequence (where and , in analogue to the scheduler’s action sequence) such that there are constants , , and a partition of , such that and
Imitating Lemma 3.1, one can derive the following, whose proof is in Section C.1.
Lemma 4.1 (Ability of in Stabilizing the Network).
If satisfies 2, then for any scheduler-generated queue lengths ,
(14) |
Still, we remark that our scheduler cannot access the reference action sequence . However, similar to the NSO algorithm, our UMO2 algorithm can also learn to stabilize the system. Even more, it also learns to outperform the utility maximization performance of any mildly varying reference policy. Specifically, i) our action sequence also stabilizes the system, i.e., ; and, ii) its utility matches any mildly varying reference policy asymptotically — that is, .333According to the oblivious adversary assumption, is pre-determined. Thus, the ’s on the LHS and RHS are the same, so there is no need to take conditional expectation w.r.t. previous actions or .
4.3 Lyapunov Drift-plus-Penalty Analysis
In the Lyapunov drift analysis (Section 3.3), we consider the drift function where is the Lyapunov function. In the Lyapunov drift-plus-penalty (DPP) analysis (Neely, 2010a, Theorem 4.2), we consider the DPP function , where is arbitrarily determined for our purpose. As we will see in Theorem 4.5, when is chosen to be no larger than a polynomial of , our utility is at least that of any mildly varying reference policy minus , thus implying a polynomially decaying gap between these two utilities.
Similar to the calculations in Lemma 3.1, one can derive the following inequality (see Lemma C.2):
(15) |
Similar to the previous section, we want to make the RHS of Equation 15 close to that of Equation 14. Specifically, we decompose these two RHS’s into two parts and show the following inequalities:
(16) |
The first inequality is exactly our objective in Equation 5, which can be ensured via the AdaPFOL algorithm in Algorithm 2 – recall its performance guarantee in Theorem 3.5. On the other hand, the second inequality is new in the utility maximization task. While some algorithmic ingredients can be borrowed from the Bandit Convex Optimization (BCO) problem (Flaxman et al., 2005), new efforts need be made due to the network optimization context: Our BCO algorithm shall accept large loss magnitudes and Lipschitzness (due to potentially unbounded queue lengths), and its performance must be adaptive to the loss functions’ magnitudes and Lipschitzness as well.
4.4 AdaBGD: Learning for Utility Maximization
As mentioned in the sketch (Section 4.1), Equation 16 is equivalent to minimizing the time-varying loss function under bandit feedback over the action set . This problem is different from the OLO problem introduced in Definition 3.3 as we do not have full-information feedback: Indeed, we assume instead of the whole will be revealed to the scheduler, hence only , the actual loss associated with our action, can be accurately calculated. We provide a formal definition of the Bandit Convex Optimization (BCO) problem (Flaxman et al., 2005, Chen and Giannakis, 2018) below.
Definition 4.2 (Bandit Convex Optimization).
Consider a -round game. In round , the player picks an action from a convex action set , and the environment simultaneously picks an arbitrary convex loss . The player observes and suffers loss . Dynamic regret minimization in BCO considers minimizing
Again, we recall the two challenges that our BCO algorithm should overcome:
-
i)
It must handle ’s with large magnitude or Lipschitzness .
-
ii)
Its performance should depend on magnitudes and Lipschitzness of all loss functions.
Our algorithm is based on the Bandit Gradient Descent (BGD) algorithm (Zhao et al., 2021, Algorithm 1), which does not satisfy i) or ii) as it requires losses to be uniformly bounded by some and Lipschitzness to be always bounded by some . In our case where , the loss magnitude and the Lipschitzness are both large when is large.
Nevertheless, based on the BGD algorithm, we designed a BCO algorithm called Adaptive BGD (AdaBGD; Algorithm 4) which satisfies both i) and ii). Specifically, to ensure i), we utilize the fact that is known before deciding , which means the magnitude and the Lipschitzness of loss function can be calculated before decision. To enjoy ii), instead of the doubling technique in AdaPFOL (Algorithm 2), we now design an adaptive learning rate scheduling mechanism which involves a sequence of time-varying learning rates, namely , instead of using a single throughout execution. Formally, AdaBGD has the following dynamic regret guarantee:
(17) |
Lemma 4.3 (Guarantee of AdaBGD Algorithm).
Suppose that , the -th loss is bounded by and is -Lipschitz. Suppose that and are both -measurable (where is the natural filtration), , and a.s. for all . Then for any fixed , the AdaBGD algorithm in Algorithm 4 enjoys the following guarantee:
where is the path length of the comparator sequence .
Compared to the algorithm itself in Algorithm 4 and the guarantee in Lemma 4.3, our main innovation on the BCO side lies in the novel learning rate scheduling mechanism in Equation 11. Therefore, we do not prove Lemma 4.3 at this moment and postpone it to Section C.3. Instead, we prove:
Theorem 4.4 (Deciding via AdaBGD Algorithm).
For the reference arrival rates defined in 2, suppose that its path length ensures
where, similar to Theorem 3.6, and are assumed to be known constants but the precise or both remain unknown. Suppose that the action set is bounded by (i.e., ). If we execute AdaBGD (Algorithm 4) over with loss functions and parameters defined in Equation 11 (restated below as Equation 18 to ease reading):
(18) |
then the outputs of AdaBGD ensure
In words, it means that the optimization objective Equation 16 is ensured. The second term on the RHS is a key term as it ensures property ii) – which is due to the definition in Equation 11 which contains all historical magnitudes and Lipschitzness. Below, we quick overview this proof and see why ii) can be ensured by the defined in Equation 18. A formal version is included in Section C.4.
Proof Sketch of Theorem 4.4.
The terms and are the boundedness and Lipschitzness of , respectively, which we denote by and to simplify notations.
First suppose that all conditions in Lemma 4.3 are satisfied by our . To balance the term inside in Lemma 4.3 by fixing and altering , one shall pick and roughly have (hiding many constant terms; see Equation 24 in the appendix for an accurate form):
(19) |
We derive in Lemma D.1 that , which is a variant of the famous summation lemma (Auer et al., 2002). Therefore, if we pick (i.e., only keeping the third term in the denominator of Equation 18), Equation 19 becomes . When focusing on -related terms, this is roughly , which allows the self-bounding analysis similar to Equation 6. However, such a configuration of may not ensure the condition in Lemma 4.3. The other two terms in Equation 18 are added for this purpose. We refer the readers to Section C.4 for detailed verification. ∎
4.5 Main Theorem for Multi-Hop Utility Maximization
As sketched in Section 4.1, if we use a Lyapunov drift-plus-penalty analysis, exploit the network stability assumption, use AdaPFOL (Algorithm 2) to decide link allocations , and use AdaBGD (Algorithm 4) to decide arrival rates , we get the following utility maximization guarantee.
Theorem 4.5 (Main Theorem for Multi-Hop Utility Maximization).
Suppose that the feasible set of arrival rates vector is bounded by . Assume all (unknown) utility functions to be concave, -Lipschitz, and -bounded. Consider a reference action sequence satisfying 2, such that their path lengths satisfy
Here, are assumed to be known constants, whereas the specific remains unknown. If we execute the UMO2 framework in Algorithm 3 with the AdaPFOL sub-rountine given in Algorithm 2 and the AdaBGD sub-routine given in Algorithm 4, when is large enough such that the constant , the following inequalities hold simultaneously:
That is, when , our algorithm not only stabilizes the system so that , but also enjoys an average utility approaching that of the reference policy polynomially fast, i.e., – the utility maximization objective Equation 3 is ensured.
Remark 5.
The condition says cannot be too small compared to , which was not an issue in SNO as people often let approach infinity (Neely et al., 2008). Albeit this condition looks restrictive, we remark that can still be as large as a polynomial of and thus means a polynomially decaying gap between our utility and that of any mildly varying policies whose path lengths are small – which is the first guarantee that applies to utility maximization tasks in adversarial networks under bandit feedback. Similar to the discussions in Remark 4, due to non-stationary environments and bandit feedback, it is highly non-trivial to define “optimal reference policy” in ANO. Nevertheless, our mildly varying reference policy class allows optimal policies for SNO settings.
The full proof of Theorem 4.5 is presented in Section C.5. We outline three key steps here.
Proof Sketch of Theorem 4.5.
The first step of the proof is plugging the algorithmic guarantees for AdaPFOL (Theorem 3.5) and for AdaBGD (Theorem 4.4) into the reference policy assumption (Equation 14) and then making use of the Lyapunov DPP guarantee (Equation 15). We present the detailed derivation in Lemma C.7. The conclusion reads
(20) |
where , , and .
Step 1 (Develop a Coarse Average Queue Length Bound).
By the boundedness of , the first term on the RHS of Equation 20 is controlled by (note that, as , is not a constant that can be hidden in and this term is actually super-linear). Similar to the one used when proving Theorem 3.6, we develop another self-bounding property that says infers (see Lemma D.6). Therefore,
only giving a bound on the average queue length (violating the system stability condition Equation 2). However, this inequality can be used to derive the polynomial convergence result on the utility, which in turn further refines the queue length bound.
Step 2 (Yield Polynomial Convergence on the Utility).
Moving the difference in the average utility to the LHS in Equation 20 and plugging in the just-derived bound on average queue length, we have
According to the assumption that , the RHS is of order . Thus
i.e., a polynomial convergence rate on the expected average utility is derived.
Step 3 (Refine the Average Queue Length Bound).
Now we are ready to refine our average queue length bound. Instead of controlling the utility with boundedness , we utilize the just-derived convergence result and yields (again using the self-bounding property in Lemma D.6)
that is, the average queue length , which means Equation 2 holds and the system is stable. Putting Step 2 and Step 3 together gives our conclusion.
Once again, we omitted all factors except for those ones throughout this proof sketch. Please refer to Section C.5 for the complete version. ∎
5 Conclusion
We study utility maximization in Adversarial Network Optimization (ANO) under bandit feedback. We design a network stability algorithm NSO and a utility maximization algorithm UMO2, which both ingeniously integrate online learning components into Lyapunov drift framework to allow a joint analysis. When designing the online learning components of UMO2, due to the potentially unbounded queue lengths in network optimization and the self-bounding analysis we want to conduct, we develop a novel OLO algorithm AdaPFOL which adapts to occasionally large losses and a BCO algorithm AdaBGD which suites large loss magnitudes and Lipschitzness via a meticulous learning rate scheduling scheme. One important future research direction will be defining other alternative reference policy classes that allows competing to more policies, even the optimal ones.
References
- Andrews and Zhang (2004) Matthew Andrews and Lisa Zhang. Scheduling over nonstationary wireless channels with finite rate sets. In IEEE INFOCOM 2004, volume 3, pages 1694–1704. IEEE, 2004.
- Andrews and Zhang (2005) Matthew Andrews and Lisa Zhang. Scheduling over a time-varying user-dependent channel with applications to high-speed wireless data. Journal of the ACM (JACM), 52(5):809–834, 2005.
- Andrews et al. (2001) Matthew Andrews, Baruch Awerbuch, Antonio Fernández, Tom Leighton, Zhiyong Liu, and Jon Kleinberg. Universal-stability results and performance bounds for greedy contention-resolution protocols. Journal of the ACM (JACM), 48(1):39–69, 2001.
- Andrews et al. (2007) Matthew Andrews, Kyomin Jung, and Alexander Stolyar. Stability of the max-weight routing and scheduling protocol in dynamic networks and at critical loads. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 145–154, 2007.
- Ashjaei et al. (2021) Mohammad Ashjaei, Lucia Lo Bello, Masoud Daneshtalab, Gaetano Patti, Sergio Saponara, and Saad Mubeen. Time-sensitive networking in automotive embedded systems: State of the art and research opportunities. Journal of systems architecture, 117:102137, 2021.
- Auer (2002) Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
- Borodin et al. (2001) Allan Borodin, Jon Kleinberg, Prabhakar Raghavan, Madhu Sudan, and David P Williamson. Adversarial queuing theory. Journal of the ACM (JACM), 48(1):13–38, 2001.
- Chen and Giannakis (2018) Tianyi Chen and Georgios B Giannakis. Bandit convex optimization for scalable and dynamic iot management. IEEE Internet of Things Journal, 6(1):1276–1286, 2018.
- Cholvi and Echagüe (2007) Vicent Cholvi and Juan Echagüe. Stability of fifo networks under adversarial models: State of the art. Computer Networks, 51(15):4460–4474, 2007.
- Choudhury et al. (2021) Tuhinangshu Choudhury, Gauri Joshi, Weina Wang, and Sanjay Shakkottai. Job dispatching policies for queueing systems with unknown service rates. In Proceedings of the Twenty-second International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, pages 181–190, 2021.
- Cruz (1991) Rene L Cruz. A calculus for network delay. i. network elements in isolation. IEEE Transactions on information theory, 37(1):114–131, 1991.
- Cutkosky (2020) Ashok Cutkosky. Parameter-free, dynamic, and strongly-adaptive online learning. In International Conference on Machine Learning, pages 2250–2259. PMLR, 2020.
- Dai and Gluzman (2022) Jim G Dai and Mark Gluzman. Queueing network controls via deep reinforcement learning. Stochastic Systems, 12(1):30–67, 2022.
- Dai et al. (2023) Yan Dai, Haipeng Luo, Chen-Yu Wei, and Julian Zimmert. Refined regret for adversarial mdps with linear function approximation. In International Conference on Machine Learning, pages 6726–6759. PMLR, 2023.
- Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
- Flaxman et al. (2005) Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385–394, 2005.
- Fu and Modiano (2022) Xinzhe Fu and Eytan Modiano. Joint learning and control in stochastic queueing networks with unknown utilities. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 6(3):1–32, 2022.
- Gaddam et al. (2020) Anuroop Gaddam, Tim Wilkin, Maia Angelova, and Jyotheesh Gaddam. Detecting sensor faults, anomalies and outliers in the internet of things: A survey on the challenges and solutions. Electronics, 9(3):511, 2020.
- Harrison and Wein (1990) J Michael Harrison and Lawrence M Wein. Scheduling networks of queues: Heavy traffic analysis of a two-station closed network. Operations research, 38(6):1052–1064, 1990.
- Huang et al. (2024) Jiatai Huang, Leana Golubchik, and Longbo Huang. When lyapunov drift based queue scheduling meets adversarial bandit learning. IEEE/ACM Transactions on Networking, 2024.
- Huang and Neely (2011) Longbo Huang and Michael J Neely. Utility optimal scheduling in processing networks. Performance Evaluation, 68(11):1002–1021, 2011.
- Huang et al. (2012) Longbo Huang, Scott Moeller, Michael J Neely, and Bhaskar Krishnamachari. Lifo-backpressure achieves near-optimal utility-delay tradeoff. IEEE/ACM Transactions On Networking, 21(3):831–844, 2012.
- Khan et al. (2020) Md Rizwan Khan, Bikramaditya Das, and Bibhuti Bhusan Pati. Channel estimation strategies for underwater acoustic (uwa) communication: An overview. Journal of the Franklin Institute, 357(11):7229–7265, 2020.
- Krishnasamy et al. (2018) Subhashini Krishnasamy, PT Akhil, Ari Arapostathis, Rajesh Sundaresan, and Sanjay Shakkottai. Augmenting max-weight with explicit learning for wireless scheduling with switching costs. IEEE/ACM Transactions on Networking, 26(6):2501–2514, 2018.
- Krishnasamy et al. (2021) Subhashini Krishnasamy, Rajat Sen, Ramesh Johari, and Sanjay Shakkottai. Learning unknown service rates in queues: A multiarmed bandit approach. Operations research, 69(1):315–330, 2021.
- Liang and Modiano (2018a) Qingkai Liang and Evtan Modiano. Network utility maximization in adversarial environments. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages 594–602. IEEE, 2018a.
- Liang and Modiano (2018b) Qingkai Liang and Eytan Modiano. Minimizing queue length regret under adversarial network models. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2(1):1–32, 2018b.
- Lim et al. (2013) Sungsu Lim, Kyomin Jung, and Matthew Andrews. Stability of the max-weight protocol in adversarial wireless networks. IEEE/ACM Transactions on Networking, 22(6):1859–1872, 2013.
- Liu et al. (2022) Bai Liu, Qiaomin Xie, and Eytan Modiano. Rl-qn: A reinforcement learning framework for optimal control of queueing systems. ACM Transactions on Modeling and Performance Evaluation of Computing Systems, 7(1):1–35, 2022.
- Maguluri et al. (2012) Siva Theja Maguluri, Rayadurgam Srikant, and Lei Ying. Stochastic models of load balancing and scheduling in cloud computing clusters. In 2012 Proceedings IEEE Infocom, pages 702–710. IEEE, 2012.
- McMahan and Streeter (2010) H Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex optimization. Annual Conference on Learning Theory 2010, page 244, 2010.
- Neely (2008) Michael J Neely. Order optimal delay for opportunistic scheduling in multi-user wireless uplinks and downlinks. IEEE/ACM Transactions on Networking, 16(5):1188–1199, 2008.
- Neely (2009) Michael J Neely. Delay analysis for max weight opportunistic scheduling in wireless systems. IEEE Transactions on Automatic Control, 54(9):2137–2150, 2009.
- Neely (2010a) Michael J Neely. Stochastic network optimization with application to communication and queueing systems. Synthesis Lectures on Communication Networks, 3(1):1–211, 2010a.
- Neely (2010b) Michael J Neely. Universal scheduling for networks with arbitrary traffic, channels, and mobility. In 49th IEEE Conference on Decision and Control (CDC), pages 1822–1829. IEEE, 2010b.
- Neely et al. (2008) Michael J Neely, Eytan Modiano, and Chih-Ping Li. Fairness and optimal stochastic control for heterogeneous networks. IEEE/ACM Transactions On Networking, 16(2):396–409, 2008.
- Neely et al. (2012) Michael J Neely, Scott T Rager, and Thomas F La Porta. Max weight learning algorithms for scheduling in unknown environments. IEEE Transactions on Automatic Control, 57(5):1179–1191, 2012.
- Rahdar et al. (2018) Mohammad Rahdar, Lizhi Wang, and Guiping Hu. A tri-level optimization model for inventory control with uncertain demand and lead time. International Journal of Production Economics, 195:96–105, 2018.
- Sadiq and De Veciana (2009) Bilal Sadiq and Gustavo De Veciana. Throughput optimality of delay-driven maxweight scheduler for a wireless system with flow dynamics. In 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1097–1102. IEEE, 2009.
- Srikant and Ying (2013) Rayadurgam Srikant and Lei Ying. Communication networks: an optimization, control, and stochastic networks perspective. Cambridge University Press, 2013.
- Tsibonis et al. (2003) Vagelis Tsibonis, Leonidas Georgiadis, and Leandros Tassiulas. Exploiting wireless channel state information for throughput maximization. In IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No. 03CH37428), volume 1, pages 301–310. IEEE, 2003.
- Wei et al. (2024) Honghao Wei, Xin Liu, Weina Wang, and Lei Ying. Sample efficient reinforcement learning in mixed systems through augmented samples and its applications to queueing networks. Advances in Neural Information Processing Systems, 36, 2024.
- Yang et al. (2023) Zixian Yang, R Srikant, and Lei Ying. Learning while scheduling in multi-server systems with unknown statistics: Maxweight with discounted ucb. In International Conference on Artificial Intelligence and Statistics, pages 4275–4312. PMLR, 2023.
- Zhao et al. (2021) Peng Zhao, Guanghui Wang, Lijun Zhang, and Zhi-Hua Zhou. Bandit convex optimization in non-stationary environments. Journal of Machine Learning Research, 22(125):1–45, 2021.
- Zheng et al. (2019) Kai Zheng, Haipeng Luo, Ilias Diakonikolas, and Liwei Wang. Equipping experts/bandits with long-term memory. Advances in Neural Information Processing Systems, 32, 2019.
- Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928–936, 2003.
- Zou et al. (2016) Yulong Zou, Jia Zhu, Xianbin Wang, and Lajos Hanzo. A survey on wireless security: Technical challenges, recent advances, and future trends. Proceedings of the IEEE, 104(9):1727–1765, 2016.
[section] \printcontents[section]l1
Appendix A Additional Related Works
Adversarial Components in Network Optimization
The study of adversarial components in network optimization dates back to the 1990s, when Cruz (1991) gave the first network model with adversarial dynamics. This was further generalized to Adversarial Queueing Theory (Borodin et al., 2001) and Leaky Bucket (Andrews et al., 2001) models. Many follow-up works, as surveyed by Cholvi and Echagüe (2007), considered network optimization under various types of adversarial traffic injections (i.e., the arrival rates to each queue are adversarial). However, these early works only considered adversarial arrival rates but assume link conditions are stationary, which cannot capture the fact that wireless communication networks can have very different link conditions from time to time due to congestions (Zou et al., 2016).
Noticing this shortcoming, Andrews and Zhang (2004) and Andrews and Zhang (2005) studied a single-hop network where link conditions are also adversarial. Their works were extended to multi-hop networks by Andrews et al. (2007) and Lim et al. (2013). For a more up-to-date discussion on these works, we refer the readers to the discussions in (Liang and Modiano, 2018b).
Utility Maximization in Adversarial Networks. While it has been more and more results considering network stability in adversarial networks, the utility maximization guarantees are not so common. Neely (2010b) proposed the universal network utility maximization problem which considers competing with a look-ahead policy that has perfect knowledge about the near future. Liang and Modiano (2018a) generalized the aforementioned stability requirement in another way and showed a trade-off between network stability and utility maximization. However, both papers assumed perfect knowledge on network conditions, as summarized in Table 1.
Feedback Models. Most previous works considered the perfect knowledge model which assumes known network conditions before decison (Liang and Modiano, 2018b; a). Despite its simplicity, this assumption eliminates the hardness of estimating the network topology or link conditions, which we argue is highly non-trivial due to the unpredictability in a drastically varying network like underwater communications (Khan et al., 2020) or IoT systems (Gaddam et al., 2020). The harder full-information feedback model assumes no prior knowledge at decision-making but requires the network conditions to be fully revealed after decision; a variant of this model was proposed by Neely et al. (2012) under the name of 2-stage decision model. In this paper, we consider the hardest bandit feedback model, which rules out all counterfactual feedback that are associated with the actions that are not really deployed. This model, under various different names, were recently proposed and considered by Fu and Modiano (2022), Yang et al. (2023), and Huang et al. (2024).
Adversarial Networks under Bandit Feedback. As we are aware of, the papers closest to ours are the ones by Huang et al. (2024) and Yang et al. (2023), which also studied adversarial networks under bandit feedback. However, both papers assumed a single-hop network model, i.e., jobs immediately leave the network upon being served. This model finds shortcoming when trying to reflect the reality where some jobs may be forwarded within the network for many hops – for example, in the classical criss-cross network extensively studied in the SNO literature (Harrison and Wein, 1990). In contrast, our multi-hop model allows jobs being forwarded from one server to another and is much more general.
Moreover, both papers only considered the task of network stability, i.e., the number of jobs remaining in the network (which is the sum of queue lengths) does not diverge; see Equation 2. However, only stabilizing the system may not be enough in many realistic problems, where the throughput (average number of jobs getting served) (Tsibonis et al., 2003, Sadiq and De Veciana, 2009) or delay (average waiting time for each job from entering the system to being served) (Neely, 2008; 2009) should be optimized. In our paper, we consider the utility maximization task (Huang and Neely, 2011, Huang et al., 2012) where an abstract utility function shall be optimized, allowing various network optimization objectives other than simply stabilizing the system.
Learning-Augmented Algorithms in Network Optimization. Finally, we give a brief overview of recent learning-augmented algorithms in network optimization. To tackle the lack of accurate channel information (e.g., under the feedback model), exploration approaches like -greedy (Krishnasamy et al., 2018; 2021) or Upper Confidence Bound (UCB) (Auer, 2002) (e.g., (Choudhury et al., 2021, Krishnasamy et al., 2021, Yang et al., 2023)) were widely used. More recently, using a Reinforcement Learning (RL) approach, Liu et al. (2022) proposed the RL-QN algorithm by putting the queue lengths as the state in RL, which outperforms many existing SNO algorithms. Empirically, utilizing the recent advances in Deep RL (DRL), Dai and Gluzman (2022) established state-of-the-art control performance in the criss-cross network. Following this line, Wei et al. (2024) compressed the number of states and yielded improved performance in other networks as well. However, most aforementioned works only considered the stochastic regime. Moreover, RL approaches (especially those DRL ones) typically have extremely large space when modelling network optimization problems, making the algorithm computationally infeasible. In contrast, our online-learning-based approach makes the algorithm not only capable of adversarial environment and bandit feedback but also computationally efficient.
Appendix B Omitted Proofs for Multi-Hop Network Stability Tasks
Before presenting the theorem proofs, we first give a bound on the queue length increments.
Lemma B.1 (Queue Length Increment).
For every , , and , we have
Proof.
According to the queue length dynamics in Equation 1, we know
which utilizes the assumption that and . ∎
B.1 Reference Policy Assumption (Proof of Lemma 3.1)
Lemma B.2 (Restatement of Lemma 3.1; Ability of in Stabilizing the Network).
If satisfies 1, then for any scheduler-generated queue lengths ,
Proof.
The proof almost follows that of Huang et al. (2024, Lemma 2). We adapt their proof here for completeness. For each interval in 1, let be the first round in . Then
For the first term, we have the following for every and , according to 1:
Thus, the first summation enjoys the following lower bound:
where the last step uses the fact that and the queue length increment bound in Lemma B.1.
For the second summation, again utilizing the bound on , we have
Therefore, recall the assumption that , summing over gives
thus giving our conclusion. ∎
B.2 Lyapunov Drift Analysis (Proof of Lemma 3.2)
Lemma B.3 (Restatement of Lemma 3.2; Lyapunov Drift Analysis).
Under the queue dynamics of Equation 1,
Proof.
According to Equation 1, for any round , server , and commodity , we
Therefore, by definition of the Lyapunov function and the Lyapunov drift ,
Exchanging summations and using the bounded assumptions that and , we get
Taking expectation w.r.t. and summing up from , we have
By telescoping sums, we know
where the last step uses the fact that is the sum of squares. Therefore, our conclusion follows. ∎
B.3 Guarantee of AdaPFOL Algorithm (Proof of Lemma 3.4)
Before analyzing our AdaPFOL algorithm, we first include the guarantee of the PFOL algorithm (Cutkosky, 2020) as follows. It roughly says that for bounded losses (i.e., ), there exists an algorithm that enjoys the following parameter-free (i.e., performance depending on loss magnitudes) guarantee.
Lemma B.4 (Guarantee of PFOL Algorithm (Cutkosky, 2020, Theorem 6)).
Consider the OLO problem in Definition 3.3. Suppose that has diameter and all . Then there exists an algorithm, such that for any comparator sequence ,
Due to the complicatedness of the original algorithm, we do not present its pseudo-code here but instead use it as a black-box. Please refer to the original paper by Cutkosky (2020) for more details. Note that, although the original analysis by Cutkosky (2020) uses -norm for both comparators and loss vectors , it is straightforward to extend to a pair of dual norms, which is -norm for and -norm for in our case.
Now, we are ready to present the guarantee for our AdaPFOL algorithm:
Lemma B.5 (Restatement of Lemma 3.4; Guarantee of AdaPFOL Algorithm).
Consider the OLO problem in Definition 3.3. Let the action set has diameter . Suppose that , where is some -measurable random variable and is the natural filtration, i.e., is the -algebra generated by all random observations made during the first rounds. Then, AdaPFOL (Algorithm 2) ensures that for any comparator sequence , if , then
Proof.
As changes only if , it cannot change for more than times. For a fixed , suppose that it is used for rounds , then we must have , as otherwise a new instance of PFOL will be launched.
Therefore, as , we have . This allows us to apply Lemma B.4 and yield
Multiplying on both sides, we have
(21) |
As all ’s form a partition of , summing up all Equation 21 gives
which utilizes the fact that at most distinct ’s can occur. ∎
B.4 Deciding via AdaPFOL Algorithm (Proof of Theorem 3.5)
Theorem B.6 (Restatement of Theorem 3.5; Deciding via AdaPFOL Algorithm).
For each link , as we did in NSO, we execute an instance of AdaPFOL (Algorithm 2) where , , and . We make their outputs as for every round . Let be the number of actually transmitted jobs from to induced by .
Consider an arbitrary reference action sequence satisfying 1. Let (as and ). Then
where is the path length of .
Proof.
According to the definitions of and together with Lemma 3.4,
As , we can upper bound the RHS of the above inequality by
where is the path length of . Taking expectations gives our conclusion. ∎
B.5 Main Theorem for Multi-Hop Network Stability (Proof of Theorem 3.6)
Theorem B.7 (Restatement of Theorem 3.6; Main Theorem for Multi-Hop Network Stability).
Suppose that satisfies 1 and its path length satisfies
where and are assumed to be known constants but the precise or both remain unknown. Then, if we execute the NSO framework in Algorithm 1 with AdaPFOL defined in Algorithm 2, the following performance guarantee is enjoyed:
That is, when , we have , i.e., Equation 2 holds and the system is stable.
Proof.
We first defer some calculations into Lemma B.8, which basically combines Lemma 3.1, Lemma 3.2, and Theorem 3.5 together. Lemma B.8 says
where
In Lemma D.5, we will prove a self-bounding inequality that says, if , then . Therefore, we can apply Lemma D.5 to conclude that
Since
we know . The conclusion then follows. ∎
Lemma B.8 (Calculations when Proving Theorem 3.6).
Proof.
From the non-negativity of Lyapunov drifts in Lemma 3.2, we know
Furthermore, recall the guarantee of AdaPFOL algorithm in Theorem 3.5 that
Therefore, we are able to get
Lemma D.3 states that if , , and , then . From Lemma B.1, any single queue satisfies . Hence, applying Lemma D.3 to to every and , we have
Further noticing that
the above inequality becomes
Noticing that is concave when is large enough, Jensen inequality then gives
Therefore, if we define the auxiliary functions and as
we are able to conclude that
as claimed. ∎
Appendix C Omitted Proofs for Multi-Hop Utility Maximization Tasks
C.1 Reference Policy Assumption (Proof of Lemma 4.1)
Lemma C.1 (Restatement of Lemma 4.1; Ability of in Stabilizing the Network).
If satisfies 2, then for any scheduler-generated queue lengths ,
Proof.
The proof of this lemma is identical to that of Lemma 3.1, except for replacing the environment-generated with the generated by the reference policy. For more details, please refer to Section B.1. ∎
C.2 Lyapunov Drift-Plus-Penalty Analysis (Proof of Lemma C.2)
Lemma C.2 (Lyapunov Drift-Plus-Penalty Analysis).
Under the queue dynamics of Equation 1,
Proof.
C.3 Guarantee of AdaBGD Algorithm (Proof of Lemma 4.3)
Lemma C.3 (Restatement of Lemma 4.3; Guarantee of AdaBGD Algorithm).
Suppose that , the -th loss is bounded by and is -Lipschitz. Suppose that and are both -measurable (where is the natural filtration), , and a.s. for all . Then for any fixed , the AdaBGD algorithm in Algorithm 4 enjoys the following guarantee:
where is the path length of the comparator sequence .
Proof.
Similar to the proof by Zhao et al. (2021, Theorem 1), let (where is the unit ball in ) and , then
Lemma C.4 (Guarantee of Projected SGD).
Suppose that is bounded by , the -th loss function is bounded by and is -Lipschitz. Further suppose that a stochastic gradient can be calculated in round such that and . The the iteration
ensures the following dynamic regret guarantee for any fixed :
where is the path length of .
Proof.
We first consider the full-feedback model where the whole , instead of the single-entry , is available. Then the Gradient Descent algorithm enjoys the following dynamic regret guarantee (which follows the proof of Zinkevich (2003, Theorem 2)):
(22) |
where (a) uses the property that .
Moving back to the bandit-feedback model, following the proof of Zhao et al. (2021, Theorem 8), we can consider a loss function defined as . As and (which is due to the fact that ), applying Equation 22 gives
where the expectation is taken w.r.t. the randomness in the stochastic gradient , and (b) makes use of the fact that . ∎
C.4 Deciding via AdaBGD Algorithm (Proof of Theorem 4.4)
Theorem C.5 (Restatement of Theorem 4.4; Deciding via AdaBGD Algorithm).
For the reference arrival rates defined in 2, suppose that its path length ensures
where, similar to Theorem 3.6, and are assumed to be known constants but the precise or both remain unknown. Suppose that the action set is bounded by (i.e., ). If we execute AdaBGD (Algorithm 4) over with loss functions and parameters defined as
(23) |
its outputs ensure
Proof.
For loss function , it is bounded by and is -Lipschitz. As is revealed after the -th round, and are both -measurable. As sketched in the main text, we first regard as a constant and tune to minimize the summation term in Lemma 4.3.
Let and . Suppose that all conditions in Lemma 4.3 hold, then
(24) |
We first only keep the last term in Equation 23, i.e., let
(25) |
then we have . Let’s first pretend the other condition of from Lemma 4.3 also holds at this moment. Lemma D.1 reveals that if , then . Plugging in ,
where the last step utilizes .
This almost recovers our conclusion, so it only remains to ensure , which is equivalent to , i.e.,
Consider adding a term into the denominator of Equation 25. As , . So we only need to show
We decompose into and and use them to cancel the two terms on the RHS, respectively. That is, we want
We first craft . In Lemma D.2, we show that if , , and , then . Thus, using the fact that and the queue length increment bound (Lemma B.1), we can let and lower bound the LHS by
where the inequality results from AM-GM inequality . Therefore, we only need to ensure . Setting as following then suffices.
For , we only need to set . Plugging back , we get the learning rate scheduling defined in Equation 23. Now we verify that the two terms on the RHS of Equation 24 does not increase too much due to . For the first term,
For the second term, as is strictly smaller than that in Equation 25, we can again apply Lemma D.1 (if , then ) to conclude
Summing up two parts gives our conclusion. ∎
C.5 Main Theorem for Multi-Hop Utility Maximization (Proof of Theorem 4.5)
Theorem C.6 (Restatement of Theorem 4.5; Main Theorem for Multi-Hop Utility Maximization).
Suppose that the feasible set of arrival rates vector is bounded by . Assume all (unknown) utility functions to be concave, -Lipschitz, and -bounded. Consider a reference action sequence satisfying 2, such that their path lengths satisfy
Here, are assumed to be known constants, whereas the specific remains unknown. If we execute the UMO2 framework in Algorithm 3 with the AdaPFOL sub-rountine given in Algorithm 2 and the AdaBGD sub-routine given in Algorithm 4, when is large enough such that , the following inequalities hold simultaneously:
That is, when , our algorithm not only stabilizes the system so that , but also enjoys an average utility approaching that of the reference policy polynomially fast, i.e., – the utility maximization objective Equation 3 is ensured.
Proof.
As sketched in the main text, the first step is to i) combine algorithmic guarantees for AdaPFOL (Theorem 3.5) and AdaBGD (Theorem 4.4), ii) plug in the network stability assumption Lemma 4.1, and iii) make use of the Lyapunov DPP analysis in Lemma C.2. Deferring these calculations to Lemma C.7, we can get
(26) |
where
Step 1 (Develop a Coarse Average Queue Length Bound).
Recall the assumption that is uniformly bounded by . Therefore, the first term on the RHS of Equation 26 is bounded by in absolute value. In Lemma D.6, we develop a self-bounding property that says, if , then . Therefore, applying it to Equation 26, we have
(27) |
As mentioned in the proof sketch of this theorem, this only gives a bound on the average queue length, which violates the system stability condition Equation 2. However, this inequality can be used to derive the polynomial convergence result on the utility, which in turn refines the average queue length bound.
Step 2 (Yield Polynomial Convergence on the Utility).
Moving the difference in the average utility in Equation 26 to the LHS, we have
Plugging in the just-derived bound on average queue length, namely Equation 27, we have
According to the assumption that , the second term on the RHS is of order . Therefore, we have
(28) |
The second conclusion of this theorem follows.
Step 3 (Refine the Average Queue Length Bound).
Now we are ready to refine our average queue length bound using Equation 28. Instead of controlling the utility with the uniform boundedness assumption that , we utilize the just-derived convergence result Equation 28.
Specifically, again applying the self-bounding property in Lemma D.6 to Equation 26 but instead replacing the first term on the RHS with Equation 28, we get
which gives our first conclusion as well. ∎
Lemma C.7 (Calculations when Proving Theorem 4.5).
Proof.
From the network stability assumption, we derived in Lemma 4.1 that
Recall the AdaPFOL guarantee in Theorem 3.5 that
and the Bandit Convex Optimization guarantee in Theorem 4.4 that
we therefore have
Further plugging in the Lyapunov DPP calculation in Equation 15 (which controls the three terms outside on the RHS), we have
For notational simplicity, we can abbreviate this inequality as
where
To handle the -related term, we use the argument same to that of Section 3.5: Lemma D.3 states that if , , and , then . From Lemma B.1, any single queue satisfies . Hence, applying Lemma D.3 to to every and , we have
Further noticing that
the -related term then becomes
Noticing that is concave when is large enough, Jensen inequality then gives
Moreover, we handle the -related term using Lemma D.4, a variant of Lemma D.3 which states that if , , and , then . Hence, still applying it to for every and ,
Therefore, we have
Setting , , and gives our conclusion. ∎
Appendix D Auxiliary Lemmas
The first lemma extends the famous summation lemma (Auer et al., 2002).
Lemma D.1.
For non-negative real numbers , we have
Proof.
Prove by induction. The case when is obvious. Suppose that the conclusion holds for , then consider some :
so it suffices to prove . Notice that
where the last inequality uses (follows from ). Hence, due to the fact that
the conclusion holds for as well. ∎
Lemma D.2.
Suppose that , , and , , then
Proof.
As adjacent ’s differ by no more than , and . Therefore,
where the last step uses . As
we have . If , then , giving the conclusion. Otherwise, i.e., , then we naturally have , so our conclusion still follows. ∎
Lemma D.3 ((Huang et al., 2024, Lemma 4)).
If , , and , , then
Lemma D.4.
If , , and , , then
Proof.
The following two lemmas are similar to Lemma 5 of Huang et al. (2024).
Lemma D.5.
If and , then
Proof.
Let . Notice that
As , we have . Applying this relationship to the second log on the RHS of the inequality above, we yield
On the other hand, from the conditions, we know . Hence, if we can prove that is monotone (at least for a range of ), we can conclude that , thus giving our conclusion. To conclude the monotonity of , we calculate its derivative as
Thus, if we only consider the case where , holds when . Denoting the larger root of as (in case it has no root, let ), we know is increasing in . Further observing that indeed satisfies , we know .
On the other hand, from the assumption that , we know . Our conclusion follows from discussing the relationship between and : If , then we immediately have
Otherwise, according to the monotonity of when , we still have
as claimed. ∎
Lemma D.6.
If and , then
Proof.
Let . Notice that
As , we have . Applying it to the term on the RHS, we have
On the other hand, from the conditions we know . Hence, we again want to prove the monotony of , which gives the conclusion that .
Calculating the derivative of , we have
Thus, if we only consider the case where , holds when . As is convex and is concave, there are at most two intersections. Hence, again denoting the larger root of as (in case there is no root, let ), is monotonic in .
As , always holds. The conclusion follows by discussing the relationship of and : If , we directly have
Otherwise, we can still conclude
from the monotonity of when . ∎