Asynchronous Distributed Learning
with Quantized Finite-Time Coordination

Nicola Bastianello, Apostolos I. Rikos, Karl H. Johansson This work was partially supported by the European Union’s Horizon Research and Innovation Actions programme under grant agreement No. 101070162, and partially by Swedish Research Council Distinguished Professor Grant 2017-01078 Knut and Alice Wallenberg Foundation Wallenberg Scholar Grant.N. Bastianello and K. H. Johansson are with the School of Electrical Engineering and Computer Science, and Digital Futures, KTH Royal Institute of Technology, Sweden, {nicolba | kallej}@kth.se.Apostolos I. Rikos is with the Artificial Intelligence Thrust of the Information Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China. He is also affiliated with the Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China. E-mail:  [email protected].
Abstract

In this paper we address distributed learning problems over peer-to-peer networks. In particular, we focus on the challenges of quantized communications, asynchrony, and stochastic gradients that arise in this set-up. We first discuss how to turn the presence of quantized communications into an advantage, by resorting to a finite-time, quantized coordination scheme. This scheme is combined with a distributed gradient descent method to derive the proposed algorithm. Secondly, we show how this algorithm can be adapted to allow asynchronous operations of the agents, as well as the use of stochastic gradients. Finally, we propose a variant of the algorithm which employs zooming-in quantization. We analyze the convergence of the proposed methods and compare them to state-of-the-art alternatives.

I Introduction

In recent years, multi-agent systems have become ubiquitous in a broad range of applications, e.g. robotics, power grids, traffic networks [1, 2]. A multi-agent system consists of autonomous agents with communication and computation capabilities, cooperating to accomplish a specific goal, e.g. learning, decision-making, navigation. In this paper, we will focus on developing algorithms to enable decentralized learning. In decentralized learning, the agents in the system collect and locally store data, with the goal being to collectively train a model without sharing these raw data [3]. To enable this objective, the design of distributed learning (or optimization) algorithms has been extensively studied in the past decades [2, 4]. In particular, different classes of algorithms have been proposed, with the main ones being gradient methods (e.g. DGD), gradient tracking, and dual methods (e.g. ADMM) [5]. In this paper, we will focus on gradient-based methods.

Learning over a multi-agent system, however, presents a number of practical challenges, with communication constraints being a central one. These constraints may arise due to the reliance on communication channels with limited bandwidth (e.g. wireless) [6], or the necessity to share high dimensional models (e.g. neural networks) [7]. A common solution to reduce the communication burden is the use of quantization, which however may result in lower accuracy of the trained model. In this paper we address distributed learning with quantized communication, and aim at showing how to turn quantization from a design constraint into an opportunity. The central idea is that agents employing quantized communications can reach an inexact consensus in finite time. Thus, in this paper we combine a Finite-Time, Quantized Coordination (FTQC) scheme with gradient descent, to design efficient learning algorithms that only require quantized communications. In particular, the FTQC scheme we analyze is based on the consensus ADMM [8], differently from previous alternatives [9, 10].

Besides limited communications, in this paper we address two additional challenges that arise in distributed learning: asynchrony [11] and stochastic gradients [12]. Indeed, the cooperating agents may have access to different, and limited, hardware resources. On the one hand, different resources result in the agents having different computation speeds, which make asynchronous completion of local training steps inevitable. In this paper, therefore, we follow the literature [13, 14, 15] in designing a gradient-based learning algorithm that enables asynchronous local training. On the other hand, limited hardware resources have the consequence that agents may find the computation of local gradients prohibitive (e.g., due to potentially lengthy computation times or memory constraints). For this reason, the agents may resort to computing inexact stochastic gradients [12, 16], by only using a subset of the available data. In the following we design an algorithm that relies on stochastic gradients.

The main contributions of the paper are as follows:

  1. 1.

    We propose an asynchronous distributed learning algorithm which relies on finite-time, quantized coordination. The novel FTQC scheme we propose is based on consensus ADMM, and the algorithm allows for the use of stochastic gradients.

  2. 2.

    We further propose an alternative version of the algorithm which employs zooming-in quantization, which progressively reduces the loss of accuracy due to quantized communications.

  3. 3.

    We analyze the convergence of the proposed FTQC scheme, and of the complete algorithm, highlighting the effect of (i) quantization, (ii) stochastic gradients, (iii) asynchrony. We further analyze the effect of zooming-in quantization on the convergence.

  4. 4.

    We conclude with numerical results comparing the proposed FTQC scheme and algorithms against state-of-the-art alternatives.

II Problem Formulation

Given the undirected graph 𝒢=(𝒱,)𝒢𝒱\mathcal{G}=(\mathcal{V},\mathcal{E})caligraphic_G = ( caligraphic_V , caligraphic_E ) with N𝑁Nitalic_N agents, our goal is to solve the distributed optimization problem

min\mathboldxnNi=1Nfi(xi)s.t.\mathboldx𝒞,subscript\mathbold𝑥superscript𝑛𝑁superscriptsubscript𝑖1𝑁subscript𝑓𝑖subscript𝑥𝑖s.t.\mathbold𝑥𝒞\min_{\mathbold{x}\in\mathbb{R}^{nN}}\sum_{i=1}^{N}f_{i}(x_{i})\quad\text{s.t.% }\ \mathbold{x}\in\mathcal{C},roman_min start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) s.t. italic_x ∈ caligraphic_C , (1)

where \mathboldx=[x1,,xN]\mathbold𝑥superscriptsuperscriptsubscript𝑥1topsuperscriptsubscript𝑥𝑁toptop\mathbold{x}=[x_{1}^{\top},\ldots,x_{N}^{\top}]^{\top}italic_x = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, fi:n:subscript𝑓𝑖superscript𝑛f_{i}:\mathbb{R}^{n}\to\mathbb{R}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R are local costs, each privately held by one of the agents, and we define the consensus set 𝒞={\mathboldxnN|xi=xji,j𝒱}.𝒞conditional-set\mathbold𝑥superscript𝑛𝑁formulae-sequencesubscript𝑥𝑖subscript𝑥𝑗for-all𝑖𝑗𝒱\mathcal{C}=\{\mathbold{x}\in\mathbb{R}^{nN}\ |\ x_{i}=x_{j}\ \forall i,j\in% \mathcal{V}\}.caligraphic_C = { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n italic_N end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∀ italic_i , italic_j ∈ caligraphic_V } . In the following we are interested in the finite-sum local costs that arise in learning applications, hence we assume that

fi(x)=1mih=1mi(x;dih)subscript𝑓𝑖𝑥1subscript𝑚𝑖superscriptsubscript1subscript𝑚𝑖𝑥superscriptsubscript𝑑𝑖f_{i}(x)=\frac{1}{m_{i}}\sum_{h=1}^{m_{i}}\ell(x;d_{i}^{h})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_ℓ ( italic_x ; italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) (2)

where :n:superscript𝑛\ell:\mathbb{R}^{n}\to\mathbb{R}roman_ℓ : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R is a loss function (e.g. logistic) and {dih}h=1misuperscriptsubscriptsuperscriptsubscript𝑑𝑖1subscript𝑚𝑖\{d_{i}^{h}\}_{h=1}^{m_{i}}{ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are the data points stored by agent i𝑖iitalic_i (e.g. pairs of label and feature vector).

The following assumptions will hold throughout the paper.

Assumption 1 (Network)

The graph 𝒢=(𝒱,)𝒢𝒱\mathcal{G}=(\mathcal{V},\mathcal{E})caligraphic_G = ( caligraphic_V , caligraphic_E ) is undirected and connected.

Assumption 2 (Costs)

The local cost fi:n:subscript𝑓𝑖superscript𝑛f_{i}:\mathbb{R}^{n}\to\mathbb{R}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R is λ¯¯𝜆{\underaccent{\bar}{\lambda}}under¯ start_ARG italic_λ end_ARG-strongly convex and λ¯¯𝜆{\bar{\lambda}}over¯ start_ARG italic_λ end_ARG-smooth for each agent i𝒱𝑖𝒱i\in\mathcal{V}italic_i ∈ caligraphic_V.

By Assumption 1, the graph is connected, which ensures that problem (1) can be solved in a distributed fashion. Moreover, Assumption 2 implies that there is a unique solution to the problem, which we can write as

\mathboldx=𝟏Nx,x=argminxni=1Nfi(x).formulae-sequence\mathboldsuperscript𝑥tensor-productsubscript1𝑁superscript𝑥superscript𝑥subscriptargmin𝑥superscript𝑛superscriptsubscript𝑖1𝑁subscript𝑓𝑖𝑥\mathbold{x}^{*}=\boldsymbol{1}_{N}\otimes x^{*},\quad x^{*}=\operatorname*{% arg\,min}_{x\in\mathbb{R}^{n}}\sum_{i=1}^{N}f_{i}(x).italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_1 start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⊗ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) .

We are now ready to discuss the objectives that will guide the algorithm design in Section III:

  1. 1.

    Quantized communications: learning problems are high-dimensional, as the model being trained may have a large number of parameters n1much-greater-than𝑛1n\gg 1italic_n ≫ 1 [17, 18]. However, distributed learning requires the agents to share their local models, which may cause a large communication overhead. The idea is to design an algorithm that uses quantized/compressed communications [19].

  2. 2.

    Stochastic gradients: in order to train an accurate model, the local costs (2) of the learning problem are often defined over a large data-set, with mi1much-greater-thansubscript𝑚𝑖1m_{i}\gg 1italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≫ 1 [17, 18]. However, computing the gradients of such costs may be excessively time consuming. Hence, we are interested in designing an algorithm that uses less computationally expensive gradients, called stochastic gradients.

  3. 3.

    Asynchrony: synchronizing all agents in the network 𝒢𝒢\mathcal{G}caligraphic_G may not be feasible, especially when N1much-greater-than𝑁1N\gg 1italic_N ≫ 1 [11]. The goal is to design an algorithm that allows the agents to perform computations asynchronously.

III Algorithm Design

In this section we design the proposed distributed learning algorithm tailored to the objectives detailed in section II. Conceptually, one could think of solving problem (1) by applying the projected gradient method [20] characterized by

\mathboldxk+1=proj𝒞(\mathboldxkαf(\mathboldxk)),k,formulae-sequence\mathboldsubscript𝑥𝑘1subscriptproj𝒞\mathboldsubscript𝑥𝑘𝛼𝑓\mathboldsubscript𝑥𝑘𝑘\mathbold{x}_{k+1}=\operatorname{proj}_{\mathcal{C}}\left(\mathbold{x}_{k}-% \alpha\nabla f(\mathbold{x}_{k})\right),\quad k\in\mathbb{N},italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = roman_proj start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_α ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) , italic_k ∈ blackboard_N , (3)

where f(\mathboldx)=[f1(x1),,fN(xN)]𝑓\mathbold𝑥superscriptsubscript𝑓1superscriptsubscript𝑥1topsubscript𝑓𝑁superscriptsubscript𝑥𝑁toptop\nabla f(\mathbold{x})=[\nabla f_{1}(x_{1})^{\top},\ldots,\nabla f_{N}(x_{N})^% {\top}]^{\top}∇ italic_f ( italic_x ) = [ ∇ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , ∇ italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT collects the local gradients, and the projection onto the consensus space is proj𝒞(\mathboldx)=1Ni=1Nxi.subscriptproj𝒞\mathbold𝑥1𝑁superscriptsubscript𝑖1𝑁subscript𝑥𝑖\operatorname{proj}_{\mathcal{C}}(\mathbold{x})=\frac{1}{N}\sum_{i=1}^{N}x_{i}.roman_proj start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

Clearly, the computation of proj𝒞(\mathboldx)subscriptproj𝒞\mathbold𝑥\operatorname{proj}_{\mathcal{C}}(\mathbold{x})roman_proj start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( italic_x ) cannot be performed in a distributed fashion, except with specific architectures such as federated learning [18]. The objective therefore is to propose a distributed (and approximate) implementation of the consensus projection. Different techniques have been explored to this end, foremost of which is averaged consensus. In particular, we can replace proj𝒞(\mathboldx)subscriptproj𝒞\mathbold𝑥\operatorname{proj}_{\mathcal{C}}(\mathbold{x})roman_proj start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( italic_x ) with one or more consensus steps, giving rise to Near-DGD [21]

\mathboldxk+1=\mathboldWt(\mathboldxkαf(\mathboldxk)),k,t1,formulae-sequence\mathboldsubscript𝑥𝑘1\mathboldsuperscript𝑊𝑡\mathboldsubscript𝑥𝑘𝛼𝑓\mathboldsubscript𝑥𝑘formulae-sequence𝑘𝑡1\mathbold{x}_{k+1}=\mathbold{W}^{t}\left(\mathbold{x}_{k}-\alpha\nabla f(% \mathbold{x}_{k})\right),\quad k\in\mathbb{N},\ t\geq 1,italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_W start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_α ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) , italic_k ∈ blackboard_N , italic_t ≥ 1 , (4)

where \mathboldW\mathbold𝑊\mathbold{W}italic_W is a symmetric, doubly stochastic matrix. Alternatively, the average consensus can be replaced with dynamic average consensus, which gives rise to gradient tracking algorithms [12].

In this paper we take a different approach by using, similarly to [9, 10], a finite-time, quantized coordination (FTQC) scheme to approximate proj𝒞(\mathboldx)subscriptproj𝒞\mathbold𝑥\operatorname{proj}_{\mathcal{C}}(\mathbold{x})roman_proj start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( italic_x ). Indeed, as discussed in section II, in learning applications we may need to use quantized communications, and the idea is to use this fact to our advantage.

III-A Finite-time, quantized coordination

The main insight guiding our design is that specific consensus schemes achieve convergence in finite-time when the communications are quantized. Employing such a scheme therefore allows the agents to approximate proj𝒞(\mathboldx)subscriptproj𝒞\mathbold𝑥\operatorname{proj}_{\mathcal{C}}(\mathbold{x})roman_proj start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( italic_x ) in a finite number of iterations. The algorithm proposed in [22], for example, is specifically tailored to achieve this goal. However, we explore a different strategy by showing how the consensus ADMM [8] satisfies the requirements of a FTQC scheme.

Let {yi}i𝒱subscriptsubscript𝑦𝑖𝑖𝒱\{y_{i}\}_{i\in\mathcal{V}}{ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_V end_POSTSUBSCRIPT be local states that need to be averaged. We can formulate this as the distributed optimization problem

min\mathboldxnN12i=1Nxiyi2s.t.\mathboldx𝒞,subscript\mathbold𝑥superscript𝑛𝑁12superscriptsubscript𝑖1𝑁superscriptdelimited-∥∥subscript𝑥𝑖subscript𝑦𝑖2s.t.\mathbold𝑥𝒞\min_{\mathbold{x}\in\mathbb{R}^{nN}}\frac{1}{2}\sum_{i=1}^{N}\left\lVert x_{i% }-y_{i}\right\rVert^{2}\quad\text{s.t.}\ \mathbold{x}\in\mathcal{C},roman_min start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT s.t. italic_x ∈ caligraphic_C , (5)

to which we apply the distributed ADMM [8], yielding Algorithm 1 111To be precise, Algorithm 1 is derived from [8] by setting α=0.5𝛼0.5\alpha=0.5italic_α = 0.5, and excluding the termination step, discussed in the following..

Algorithm 1 Finite-time quantized coordination (FTQC)
1:The states to be averaged {yi}i𝒱subscriptsubscript𝑦𝑖𝑖𝒱\{y_{i}\}_{i\in\mathcal{V}}{ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_V end_POSTSUBSCRIPT, zij0=0superscriptsubscript𝑧𝑖𝑗00z_{ij}^{0}=0italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = 0 for all i𝒱𝑖𝒱i\in\mathcal{V}italic_i ∈ caligraphic_V, j𝒩i𝑗subscript𝒩𝑖j\in\mathcal{N}_{i}italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, penalty ρ>0𝜌0\rho>0italic_ρ > 0, quantizer q()𝑞q(\cdot)italic_q ( ⋅ ), termination threshold θ>0𝜃0\theta>0italic_θ > 0.
2:// initialization
3:each agent i𝒱𝑖𝒱i\in\mathcal{V}italic_i ∈ caligraphic_V picks zij0superscriptsubscript𝑧𝑖𝑗0z_{ij}^{0}italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT for all j𝒩i𝑗subscript𝒩𝑖j\in\mathcal{N}_{i}italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
4:for =1,2,12\ell=1,2,\ldotsroman_ℓ = 1 , 2 , … do
5:     // local update and transmission
6:     if agent i𝑖iitalic_i is active 
7:         computes wi=11+ρ|𝒩i|(yi+j𝒩izij)superscriptsubscript𝑤𝑖11𝜌subscript𝒩𝑖subscript𝑦𝑖subscript𝑗subscript𝒩𝑖superscriptsubscript𝑧𝑖𝑗w_{i}^{\ell}=\frac{1}{1+\rho|\mathcal{N}_{i}|}\left(y_{i}+\sum_{j\in\mathcal{N% }_{i}}z_{ij}^{\ell}\right)italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + italic_ρ | caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT )
8:         and transmits tij=q(zij+2ρwi)subscript𝑡𝑖𝑗𝑞superscriptsubscript𝑧𝑖𝑗2𝜌superscriptsubscript𝑤𝑖t_{i\to j}=q\left(-z_{ij}^{\ell}+2\rho w_{i}^{\ell}\right)italic_t start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT = italic_q ( - italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT + 2 italic_ρ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ) to each neighbor j𝒩i𝑗subscript𝒩𝑖j\in\mathcal{N}_{i}italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
9:     end if
10:     // auxiliary update
11:     if agent i𝑖iitalic_i is active and receives tjisubscript𝑡𝑗𝑖t_{j\to i}italic_t start_POSTSUBSCRIPT italic_j → italic_i end_POSTSUBSCRIPT 
12:         computes zij+1=12(zij+tji)superscriptsubscript𝑧𝑖𝑗112superscriptsubscript𝑧𝑖𝑗subscript𝑡𝑗𝑖z_{ij}^{\ell+1}=\frac{1}{2}\left(z_{ij}^{\ell}+t_{j\to i}\right)italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ + 1 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT + italic_t start_POSTSUBSCRIPT italic_j → italic_i end_POSTSUBSCRIPT )
13:     end if
14:     // termination
15:     if zij+1zijθdelimited-∥∥superscriptsubscript𝑧𝑖𝑗1superscriptsubscript𝑧𝑖𝑗𝜃\left\lVert z_{ij}^{\ell+1}-z_{ij}^{\ell}\right\rVert\leq\theta∥ italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ + 1 end_POSTSUPERSCRIPT - italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ∥ ≤ italic_θ for all j𝒩i𝑗subscript𝒩𝑖j\in\mathcal{N}_{i}italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 
16:         agent i𝑖iitalic_i terminates
17:     end if
18:end for

The following lemma shows how Algorithm 1 can indeed serve as a FTQC scheme. The proof is reported in Appendix A.

Lemma 1 (Consensus ADMM as FTQC scheme)

Let {wi}subscriptsuperscriptsubscript𝑤𝑖\{w_{i}^{\ell}\}_{\ell\in\mathbb{N}}{ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT roman_ℓ ∈ blackboard_N end_POSTSUBSCRIPT be the trajectory generated by Algorithm 1 applied to average {yi}i𝒱subscriptsubscript𝑦𝑖𝑖𝒱\{y_{i}\}_{i\in\mathcal{V}}{ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_V end_POSTSUBSCRIPT, with a given (auxiliary) initial condition {zij0}i𝒱,j𝒩isubscriptsuperscriptsubscript𝑧𝑖𝑗0formulae-sequence𝑖𝒱𝑗subscript𝒩𝑖\{z_{ij}^{0}\}_{i\in\mathcal{V},\ j\in\mathcal{N}_{i}}{ italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_V , italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and penalty ρ>0𝜌0\rho>0italic_ρ > 0. Assume that communications are quantized according to

q(x)=ΔxΔ,Δ>0q(x)=\Delta\left\lfloor\frac{x}{\Delta}\right\rceil,\ \Delta>0italic_q ( italic_x ) = roman_Δ ⌊ divide start_ARG italic_x end_ARG start_ARG roman_Δ end_ARG ⌉ , roman_Δ > 0 (6)

where delimited-⌊⌉\left\lfloor\cdot\right\rceil⌊ ⋅ ⌉ rounds to the nearest integer. Then there exist μ(0,1)𝜇01\mu\in(0,1)italic_μ ∈ ( 0 , 1 ) and C>0𝐶0C>0italic_C > 0 such that for each i𝒱𝑖𝒱i\in\mathcal{V}italic_i ∈ caligraphic_V

wi1Ni𝒱yiC(μd(z0)+Δ2ni𝒱|𝒩i|1μ1μ)delimited-∥∥superscriptsubscript𝑤𝑖1𝑁subscript𝑖𝒱subscript𝑦𝑖𝐶superscript𝜇𝑑subscript𝑧0Δ2𝑛subscript𝑖𝒱subscript𝒩𝑖1superscript𝜇1𝜇\left\lVert w_{i}^{\ell}-\frac{1}{N}\sum_{i\in\mathcal{V}}y_{i}\right\rVert% \leq C\left(\mu^{\ell}d(z_{0})+\frac{\Delta}{2}\sqrt{n\sum_{i\in\mathcal{V}}|% \mathcal{N}_{i}|}\frac{1-\mu^{\ell}}{1-\mu}\right)∥ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_V end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≤ italic_C ( italic_μ start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT italic_d ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + divide start_ARG roman_Δ end_ARG start_ARG 2 end_ARG square-root start_ARG italic_n ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_V end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG divide start_ARG 1 - italic_μ start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_μ end_ARG )

with d(z0)𝑑subscript𝑧0d(z_{0})italic_d ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) being a function of the initial conditions. Additionally, convergence is achieved after a finite number of iterations; that is, for ¯¯\ell\geq\bar{\ell}roman_ℓ ≥ over¯ start_ARG roman_ℓ end_ARG we have wi=wi+1superscriptsubscript𝑤𝑖superscriptsubscript𝑤𝑖1w_{i}^{\ell}=w_{i}^{\ell+1}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ + 1 end_POSTSUPERSCRIPT, with

¯|1log(μ)log(Δ2ni𝒱|𝒩i|(1μ)d(z0))|.¯1𝜇Δ2𝑛subscript𝑖𝒱subscript𝒩𝑖1𝜇𝑑subscript𝑧0\bar{\ell}\geq\left|\frac{1}{\log(\mu)}\log\left(\frac{\frac{\Delta}{2}\sqrt{n% \sum_{i\in\mathcal{V}}|\mathcal{N}_{i}|}}{(1-\mu)d(z_{0})}\right)\right|.over¯ start_ARG roman_ℓ end_ARG ≥ | divide start_ARG 1 end_ARG start_ARG roman_log ( italic_μ ) end_ARG roman_log ( divide start_ARG divide start_ARG roman_Δ end_ARG start_ARG 2 end_ARG square-root start_ARG italic_n ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_V end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG end_ARG start_ARG ( 1 - italic_μ ) italic_d ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ) | .

We can now use the finite-time convergence result of Lemma 1 to design the termination technique in Algorithm 1. The idea is for each agent i𝒱𝑖𝒱i\in\mathcal{V}italic_i ∈ caligraphic_V to detect when the difference zij+1zijsuperscriptsubscript𝑧𝑖𝑗1superscriptsubscript𝑧𝑖𝑗z_{ij}^{\ell+1}-z_{ij}^{\ell}italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ + 1 end_POSTSUPERSCRIPT - italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT is below a threshold θ𝜃\thetaitalic_θ, identifying that their values have stopped changing significantly. In practice, we can choose θ=cΔ𝜃𝑐Δ\theta=c\Deltaitalic_θ = italic_c roman_Δ for some c1𝑐1c\geq 1italic_c ≥ 1. Notice that the agents do not need to know ¯¯\bar{\ell}over¯ start_ARG roman_ℓ end_ARG to apply the termination.

Remark 1 (Speed and accuracy trade-off)

Lemma 1 shows how the smaller the quantization level ΔΔ\Deltaroman_Δ is, the smaller the consensus error. On the other hand, smaller values of ΔΔ\Deltaroman_Δ imply that a larger number of iterations is required to reach convergence, thus presenting a trade-off between speed and accuracy.

Remark 2 (Why choose ADMM?)

Why choose consensus ADMM as an FTQC scheme, as opposed to the average consensus of Near-DGD, or the FTQC [22]? The answer is that ADMM has been proved to be robust to many different challenges, ranging from asynchrony and packet losses [8], to quantization and other additive errors [23]. Alternative schemes instead lack such theoretical robustness guarantees.

Remark 3 (Extensions of Algorithm 1)

Besides allowing for asynchronous activations and packet losses (cf. Remark 2), we can further modify Algorithm 1 to allow the agents to use different quantizers. Indeed, Lemma 1 would apply equally, but replacing ΔΔ\Deltaroman_Δ with the maximum of the local quantization level ΔisubscriptΔ𝑖\Delta_{i}roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

III-B Algorithm

The proposed Algorithm 2 is based on the projected gradient descent (3), where the projection is approximated with the finite-time, quantized coordination scheme discussed in section III-A above.

In particular, the agents apply a local gradient step in steps 2-3. They then apply Algorithm 1 on the result of steps 2-3 (\mathboldyk\mathboldsubscript𝑦𝑘\mathbold{y}_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT), and update their local states \mathboldxk\mathboldsubscript𝑥𝑘\mathbold{x}_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with the result. Notice that the algorithm allows for asynchronous operations: steps 2-3 are performed only by active agents, while inactive ones do not update their yi,ksubscript𝑦𝑖𝑘y_{i,k}italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT’s. Additionally, the agents may use stochastic gradients ^fi^subscript𝑓𝑖\hat{\nabla}f_{i}over^ start_ARG ∇ end_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT instead of the full gradients.

Algorithm 2 Proposed algorithm
1:For each agent i𝒱𝑖𝒱i\in\mathcal{V}italic_i ∈ caligraphic_V initialize xi,0subscript𝑥𝑖0x_{i,0}italic_x start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT; choose the step-size α<2/λ¯𝛼2¯𝜆\alpha<2/{\bar{\lambda}}italic_α < 2 / over¯ start_ARG italic_λ end_ARG.
2:for k=0,1,𝑘01k=0,1,\ldotsitalic_k = 0 , 1 , … each agent i𝑖iitalic_i do
3:     // local update
4:     if agent i𝑖iitalic_i active 
5:         apply the local (possibly inexact) gradient step
yi,k=xi,kα^fi(xi,k)subscript𝑦𝑖𝑘subscript𝑥𝑖𝑘𝛼^subscript𝑓𝑖subscript𝑥𝑖𝑘y_{i,k}=x_{i,k}-\alpha\hat{\nabla}f_{i}(x_{i,k})italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT - italic_α over^ start_ARG ∇ end_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT )
6:         else yi,k=yi,k1subscript𝑦𝑖𝑘subscript𝑦𝑖𝑘1y_{i,k}=y_{i,k-1}italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i , italic_k - 1 end_POSTSUBSCRIPT
7:     end if
8:     // coordination
9:     apply finite-time, quantized coordination
\mathboldxk+1=Algorithm 1(\mathboldyk)\mathboldsubscript𝑥𝑘1Algorithm 1\mathboldsubscript𝑦𝑘\mathbold{x}_{k+1}=\text{Algorithm~{}\ref{alg:finite-time-consensus}}(% \mathbold{y}_{k})italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = Algorithm ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
with \mathboldyk=[y1,k,,yN,k]\mathboldsubscript𝑦𝑘superscriptsuperscriptsubscript𝑦1𝑘topsuperscriptsubscript𝑦𝑁𝑘toptop\mathbold{y}_{k}=[y_{1,k}^{\top},\ldots,y_{N,k}^{\top}]^{\top}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ italic_y start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_N , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT
10:end for

III-C Zooming-in quantization

By the discussion in Remark 1, the quantization level ΔΔ\Deltaroman_Δ mediates the trade-off between the speed of convergence of Algorithm 1 and its consensus error. The idea then is to exploit this trade-off to improve the performance of Algorithm 2 by changing ΔΔ\Deltaroman_Δ over time.

By Remark 3 we know that the agents can have uncoordinated quantizers, i.e. qi(x)=Δix/Δiq_{i}(x)=\Delta_{i}\left\lfloor x/\Delta_{i}\right\rceilitalic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⌊ italic_x / roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⌉. Each agent then is allowed to modify its quantizer whenever necessary. Algorithm 3 reports a prototype of how this can be implemented. Specifically, each agent checks periodically if its local solution xi,ksubscript𝑥𝑖𝑘x_{i,k}italic_x start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT has stopped improving, and selects rΔi𝑟subscriptΔ𝑖r\Delta_{i}italic_r roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, r(0,1)𝑟01r\in(0,1)italic_r ∈ ( 0 , 1 ), if this is the case.

An alternative algorithm with zooming-in quantization was proposed in [24]. However, in [24] the agents reduce their quantization in a synchronized fashion via voting, while in Algorithm 3 the agents can set their quantization levels independently.

Algorithm 3 Proposed algorithm (zooming-in quantization)
1:For each agent i𝒱𝑖𝒱i\in\mathcal{V}italic_i ∈ caligraphic_V initialize xi,0subscript𝑥𝑖0x_{i,0}italic_x start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT; choose the step-size α𝛼\alphaitalic_α. Choose the local quantization level ΔisubscriptΔ𝑖\Delta_{i}roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and let r(0,1)𝑟01r\in(0,1)italic_r ∈ ( 0 , 1 ) and T1𝑇1T\geq 1italic_T ≥ 1.
2:for k=0,1,𝑘01k=0,1,\ldotsitalic_k = 0 , 1 , … each agent i𝑖iitalic_i do
3:     // local update
4:     if agent i𝑖iitalic_i active 
5:         apply the local (possibly inexact) gradient step
yi,k=xi,kα^fi(xi,k)subscript𝑦𝑖𝑘subscript𝑥𝑖𝑘𝛼^subscript𝑓𝑖subscript𝑥𝑖𝑘y_{i,k}=x_{i,k}-\alpha\hat{\nabla}f_{i}(x_{i,k})italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT - italic_α over^ start_ARG ∇ end_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT )
6:         else yi,k=yi,k1subscript𝑦𝑖𝑘subscript𝑦𝑖𝑘1y_{i,k}=y_{i,k-1}italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i , italic_k - 1 end_POSTSUBSCRIPT
7:     end if
8:     // coordination
9:     \mathboldxk+1=Algorithm 1(\mathboldyk),\mathboldsubscript𝑥𝑘1Algorithm 1\mathboldsubscript𝑦𝑘\mathbold{x}_{k+1}=\text{Algorithm~{}\ref{alg:finite-time-consensus}}(% \mathbold{y}_{k}),italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = Algorithm ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , with local quantization levels ΔisubscriptΔ𝑖\Delta_{i}roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
10:     // zooming-in quantization
11:     if agent i𝑖iitalic_i activated T𝑇Titalic_T times & xi,k+1xi,kΔidelimited-∥∥subscript𝑥𝑖𝑘1subscript𝑥𝑖𝑘subscriptΔ𝑖\left\lVert x_{i,k+1}-x_{i,k}\right\rVert\leq\Delta_{i}∥ italic_x start_POSTSUBSCRIPT italic_i , italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ∥ ≤ roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 
12:         ΔirΔisubscriptΔ𝑖𝑟subscriptΔ𝑖\Delta_{i}\leftarrow r\Delta_{i}roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_r roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
13:     end if
14:end for

IV Convergence Analysis

In this section we analyze the convergence of Algorithms 2 and 3 when the agents operate asynchronously and apply stochastic gradients. Before presenting our analysis we make the following assumption.

Assumption 3 (Set-up)

Each agent i𝒱𝑖𝒱i\in\mathcal{V}italic_i ∈ caligraphic_V activates at iteration k𝑘kitalic_k to perform a local gradient step with probability pi(0,1]subscript𝑝𝑖01p_{i}\in(0,1]italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ( 0 , 1 ]. In particular, active agents use a (possibly inexact) gradient ^fi^subscript𝑓𝑖\hat{\nabla}f_{i}over^ start_ARG ∇ end_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, for which there exists τ0𝜏0\tau\geq 0italic_τ ≥ 0 such that

𝔼[^fi(x)fi(x)]τ.𝔼delimited-[]delimited-∥∥^subscript𝑓𝑖𝑥subscript𝑓𝑖𝑥𝜏\mathbb{E}\left[\left\lVert\hat{\nabla}f_{i}(x)-\nabla f_{i}(x)\right\rVert% \right]\leq\tau.blackboard_E [ ∥ over^ start_ARG ∇ end_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∥ ] ≤ italic_τ .

Under this assumption, we derive the following convergence result, proved in Appendix B.

Proposition 1 (Convergence of Algorithm 2)

Let {\mathboldxk}ksubscript\mathboldsubscript𝑥𝑘𝑘\{\mathbold{x}_{k}\}_{k\in\mathbb{N}}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT be the trajectory generated by Algorithm 2. Let Assumptions 1, 2, and 3 hold. Then for all k>0𝑘0k>0italic_k > 0 it holds that

𝔼[\mathboldxk\mathboldx]𝔼delimited-[]delimited-∥∥\mathboldsubscript𝑥𝑘\mathboldsuperscript𝑥\displaystyle\mathbb{E}\left[\left\lVert\mathbold{x}_{k}-\mathbold{x}^{*}% \right\rVert\right]blackboard_E [ ∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ] maxipiminipi(χk\mathboldx0\mathboldx\displaystyle\leq\sqrt{\frac{\max_{i}p_{i}}{\min_{i}p_{i}}}\Bigg{(}\chi^{k}% \left\lVert\mathbold{x}_{0}-\mathbold{x}^{*}\right\rVert≤ square-root start_ARG divide start_ARG roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG ( italic_χ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
+(γ+ατN)1χk1χ)\displaystyle+\left(\gamma+\alpha\tau\sqrt{N}\right)\frac{1-\chi^{k}}{1-\chi}% \Bigg{)}+ ( italic_γ + italic_α italic_τ square-root start_ARG italic_N end_ARG ) divide start_ARG 1 - italic_χ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_χ end_ARG )

where χ=1(1ζ2)minipi(0,1)𝜒11superscript𝜁2subscript𝑖subscript𝑝𝑖01\chi=\sqrt{1-(1-\zeta^{2})\min_{i}p_{i}}\in(0,1)italic_χ = square-root start_ARG 1 - ( 1 - italic_ζ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∈ ( 0 , 1 ) with ζ=max{|1αλ¯|,|1αλ¯|}𝜁1𝛼¯𝜆1𝛼¯𝜆\zeta=\max\{|1-\alpha{\underaccent{\bar}{\lambda}}|,|1-\alpha{\bar{\lambda}}|\}italic_ζ = roman_max { | 1 - italic_α under¯ start_ARG italic_λ end_ARG | , | 1 - italic_α over¯ start_ARG italic_λ end_ARG | }, and γ=O(Δ)𝛾𝑂Δ\gamma=O\left(\Delta\right)italic_γ = italic_O ( roman_Δ ) (as characterized in Lemma 1).

As a consequence of Proposition 1, we see that

limk𝔼[\mathboldxk\mathboldx]maxipiminipiγ+ατN1χsubscript𝑘𝔼delimited-[]delimited-∥∥\mathboldsubscript𝑥𝑘\mathboldsuperscript𝑥subscript𝑖subscript𝑝𝑖subscript𝑖subscript𝑝𝑖𝛾𝛼𝜏𝑁1𝜒\lim_{k\to\infty}\mathbb{E}\left[\left\lVert\mathbold{x}_{k}-\mathbold{x}^{*}% \right\rVert\right]\leq\sqrt{\frac{\max_{i}p_{i}}{\min_{i}p_{i}}}\ \frac{% \gamma+\alpha\tau\sqrt{N}}{1-\chi}roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT blackboard_E [ ∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ] ≤ square-root start_ARG divide start_ARG roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG divide start_ARG italic_γ + italic_α italic_τ square-root start_ARG italic_N end_ARG end_ARG start_ARG 1 - italic_χ end_ARG

which highlights how the different challenges of quantization, stochastic gradients, and asynchrony impact the asymptotic error.

We can similarly characterize the asymptotic error when we employ the zooming-in quantization of Algorithm 3. The proof is reported in Appendix C.

Corollary 1 (Convergence of Algorithm 3)

Let {\mathboldxk}ksubscript\mathboldsubscript𝑥𝑘𝑘\{\mathbold{x}_{k}\}_{k\in\mathbb{N}}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT be the trajectory generated by Algorithm 3. Let Assumptions 1, 2, and 3 hold. Then it holds that

limk𝔼[\mathboldxk\mathboldx]subscript𝑘𝔼delimited-[]delimited-∥∥\mathboldsubscript𝑥𝑘\mathboldsuperscript𝑥\displaystyle\lim_{k\to\infty}\mathbb{E}\left[\left\lVert\mathbold{x}_{k}-% \mathbold{x}^{*}\right\rVert\right]roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT blackboard_E [ ∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ] maxipiminipiατN1χ.absentsubscript𝑖subscript𝑝𝑖subscript𝑖subscript𝑝𝑖𝛼𝜏𝑁1𝜒\displaystyle\leq\sqrt{\frac{\max_{i}p_{i}}{\min_{i}p_{i}}}\ \frac{\alpha\tau% \sqrt{N}}{1-\chi}.≤ square-root start_ARG divide start_ARG roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG divide start_ARG italic_α italic_τ square-root start_ARG italic_N end_ARG end_ARG start_ARG 1 - italic_χ end_ARG .

Clearly, using zooming-in quantization implies that quantization will not impact the asymptotic error, and only the effects of asynchrony and stochastic gradients are present.

V Numerical Results

In this section we evaluate the performance of the proposed algorithms on a classification task, and compare it with algorithms from the literature. We consider problem (1) with local costs

fi(x)=h=1milog(1+exp(bihaihx))+ϵ2x2subscript𝑓𝑖𝑥superscriptsubscript1subscript𝑚𝑖1superscriptsubscript𝑏𝑖superscriptsubscript𝑎𝑖𝑥italic-ϵ2superscriptdelimited-∥∥𝑥2f_{i}(x)=\sum_{h=1}^{m_{i}}\log\left(1+\exp(-b_{i}^{h}a_{i}^{h}x)\right)+\frac% {\epsilon}{2}\left\lVert x\right\rVert^{2}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_log ( 1 + roman_exp ( - italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_x ) ) + divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (7)

defined by the local dataset {dih=(aih,bih)1×n×{1,1}}h=1misuperscriptsubscriptsuperscriptsubscript𝑑𝑖superscriptsubscript𝑎𝑖superscriptsubscript𝑏𝑖superscript1𝑛111subscript𝑚𝑖\{d_{i}^{h}=(a_{i}^{h},b_{i}^{h})\in\mathbb{R}^{1\times n}\times\{-1,1\}\}_{h=% 1}^{m_{i}}{ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_n end_POSTSUPERSCRIPT × { - 1 , 1 } } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. In our experiments we have N=10𝑁10N=10italic_N = 10 agents with mi=150subscript𝑚𝑖150m_{i}=150italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 150 data-points each, and the problem size is n=10𝑛10n=10italic_n = 10. The regularization weight is set to ϵ=0.075italic-ϵ0.075\epsilon=0.075italic_ϵ = 0.075. Moreover, unless otherwise stated we use the symmetric quantizer (6). Finally, the data for the problem are randomly generated using the make_classification utility of sklearn [25], and all algorithms are implemented in tvopt [26].

V-A Performance of Finite-Time, Quantized Coordination schemes

We start by comparing the performance of the proposed Finite-Time, Quantized Coordination scheme Algorithm 1 with the scheme proposed in [22] and employed in [9, 10]. In Table I we compare the two FTQC schemes in terms of consensus error and number of iterations, for different quantization levels. The algorithms are applied to average randomly generated vectors in 10superscript10\mathbb{R}^{10}blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, the size of x𝑥xitalic_x in (7).

TABLE I: Consensus error and iteration number for different quantization levels.
ΔΔ\Deltaroman_Δ [22] Algorithm 1
Cons. err. Num. iter. Cons. err. Num. iter.
108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 5.31×1085.31superscript1085.31\times 10^{-8}5.31 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 115115115115 2.85×1082.85superscript1082.85\times 10^{-8}2.85 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 105105105105
107superscript10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 5.13×1075.13superscript1075.13\times 10^{-7}5.13 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 106106106106 2.88×1072.88superscript1072.88\times 10^{-7}2.88 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 95959595
106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 5.32×1065.32superscript1065.32\times 10^{-6}5.32 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 96969696 2.89×1062.89superscript1062.89\times 10^{-6}2.89 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 86868686
105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 5.30×1055.30superscript1055.30\times 10^{-5}5.30 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 86868686 2.86×1052.86superscript1052.86\times 10^{-5}2.86 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 80808080
104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 5.31×1045.31superscript1045.31\times 10^{-4}5.31 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 75757575 2.89×1042.89superscript1042.89\times 10^{-4}2.89 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 66666666
103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 5.27×1035.27superscript1035.27\times 10^{-3}5.27 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 67676767 2.87×1032.87superscript1032.87\times 10^{-3}2.87 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 57575757
102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 5.36×1025.36superscript1025.36\times 10^{-2}5.36 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 55555555 2.91×1022.91superscript1022.91\times 10^{-2}2.91 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 48484848
101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 5.18×1015.18superscript1015.18\times 10^{-1}5.18 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 49494949 2.87×1012.87superscript1012.87\times 10^{-1}2.87 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 43434343
1111 5.175.175.175.17 38383838 2.892.892.892.89 29292929

We can see that Algorithm 1 consistently outperforms [22] in terms of consensus error, since it reaches a smaller neighborhood of the consensus, and in terms of the number of iterations it requires.

Turning exclusively to Algorithm 1, we know that it is characterized by two parameters, the penalty ρ𝜌\rhoitalic_ρ and the quantizer q()𝑞q(\cdot)italic_q ( ⋅ ). In the following we provide results to guide the tuning of these parameters. First of all, Figure 1 reports the consensus error and number of iterations for different values of the penalty and quantization levels.

Refer to caption
Figure 1: Consensus error and iteration number for Algorithm 1 with different penalties and quantization levels.

Interestingly, both metrics are minimized for a value of ρ0.3𝜌0.3\rho\approx 0.3italic_ρ ≈ 0.3. Moreover, as predicted by Remark 1, the smaller the quantization level is, the smaller the consensus error is, to the detriment of the number of iterations needed to converge.

Finally, Table II reports the performance of the consensus scheme with different quantizers besides the symmetric (6), namely: floor q(x)=Δx/Δ𝑞𝑥Δ𝑥Δq(x)=\Delta\lfloor x/\Delta\rflooritalic_q ( italic_x ) = roman_Δ ⌊ italic_x / roman_Δ ⌋, ceil q(x)=Δx/Δ𝑞𝑥Δ𝑥Δq(x)=\Delta\lceil x/\Delta\rceilitalic_q ( italic_x ) = roman_Δ ⌈ italic_x / roman_Δ ⌉, sparisfier, which sets to zero all components of x𝑥xitalic_x with absolute value below θ=0.1𝜃0.1\theta=0.1italic_θ = 0.1.

TABLE II: Performance of Algorithm 1 with different quantizers.
Quantizer Cons. err. Num. iter.
Symmetric 3.37×1033.37superscript1033.37\times 10^{-3}3.37 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 52525252
Floor 8.70×1038.70superscript1038.70\times 10^{-3}8.70 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 52525252
Ceil 8.69×1038.69superscript1038.69\times 10^{-3}8.69 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 52525252
Sparsifier 3.79×1033.79superscript1033.79\times 10^{-3}3.79 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 50505050

We notice that the floor (employed in [22, 9, 10]) and ceiling quantizers attain a more than double the consensus error of the symmetric one. The sparsifier instead achieves similar performance. Future work will explore the use of the sparsifier from a theoretical perspective.

V-B Comparison of gradient descent schemes

The previous section evaluated the performance of the Finite-Time, Quantized Coordination scheme Algorithm 1 which is used as a building block of Algorithm 2. In this section we discuss the performance of Algorithm 2 itself, and compare it with alternative methods.

We start by comparing Algorithm 2 to FTQC-DGD [10], Near-DGD [21], and the distributed gradient tracking (DGT) method [12]. The latter two do not employ a finite-time coordination scheme, but they are modified to use multiple rounds of communications to match the budget of FTQC-DGD and Algorithm 2. Figure 2 reports the error trajectories of all methods.

Refer to caption
Figure 2: Comparison of different distributed optimization methods with quantized communications.

We can see that Algorithm 2 achieves a smaller asymptotic error than Near-DGD and FTQC-DGD, owing to the improved coordination performance of Algorithm 1 (cf. section V-A). Moreover, DGT appears to diverge, which is known to happen with some gradient tracking schemes perturbed by (quantization) noise [27].

Table III further compares Near-DGD, FTQC-DGD and Algorithm 2 for different quantization levels.

TABLE III: Comparison of Near-DGD [21], FTQC-DGD [10], and Algorithm 2 for different quantization levels.
ΔΔ\Deltaroman_Δ Near-DGD [21] FTQC-DGD [10] Algorithm 2
1010superscript101010^{-10}10 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT 3.75×1083.75superscript1083.75\times 10^{-8}3.75 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 4.76×1084.76superscript1084.76\times 10^{-8}4.76 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 2.13×1082.13superscript1082.13\times 10^{-8}2.13 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT
108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 3.00×1073.00superscript1073.00\times 10^{-7}3.00 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 7.35×1077.35superscript1077.35\times 10^{-7}7.35 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 1.04×1071.04superscript1071.04\times 10^{-7}1.04 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 3.29×1053.29superscript1053.29\times 10^{-5}3.29 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 2.84×1052.84superscript1052.84\times 10^{-5}2.84 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 7.79×1067.79superscript1067.79\times 10^{-6}7.79 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3.46×1033.46superscript1033.46\times 10^{-3}3.46 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 2.49×1022.49superscript1022.49\times 10^{-2}2.49 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.12×1031.12superscript1031.12\times 10^{-3}1.12 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.88×1011.88superscript1011.88\times 10^{-1}1.88 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 2.56×1012.56superscript1012.56\times 10^{-1}2.56 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 7.86×1027.86superscript1027.86\times 10^{-2}7.86 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
1111 24.0224.0224.0224.02 20.1520.1520.1520.15 3.463.463.463.46

The proposed Algorithm 2 outperforms both alternatives, again owing to the improved coordination precision.

V-C Variations of Algorithm 2

In this section we discuss the performance of Algorithm 2 in challenging scenarios, and compare it to that of Algorithm 3.

As discussed in section II, in learning problems the use of full gradients may be prohibitive, and the agents need to resort to stochastic gradients. In Figure 3 we report the asymptotic error achieved by Algorithm 2 when stochastic gradients computed on different batch sizes B𝐵Bitalic_B are used.

Refer to caption
Figure 3: Asymptotic error of Algorithm 2 when the agents employ stochastic gradients of different batch sizes.

Clearly, the larger the batch size, the better the performance. However, due to the use of quantization, even with B=mi𝐵subscript𝑚𝑖B=m_{i}italic_B = italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the algorithm can only reach a neighborhood of the optimal solution.

Another one of the challenges discussed in section II is the asynchronous operation of the agents. In Figure 4 we report the performance of Algorithm 2 in this scenario, when agents activate to perform a gradient descent step with probability p(0,1]𝑝01p\in(0,1]italic_p ∈ ( 0 , 1 ].

Refer to caption
Figure 4: Error trajectory of Algorithm 2 with different agent activation probabilities.

As predicted by the theory [28], the smaller p𝑝pitalic_p is the fewer updates are performed, and hence the slower the convergence is.

We conclude this section by comparing the performance of Algorithm 2 with the variation Algorithm 3 that employs zooming-in quantization (T=25𝑇25T=25italic_T = 25, r=0.1𝑟0.1r=0.1italic_r = 0.1). In particular, Figure 5 depicts the error trajectory of the latter against the error trajectory of the former with different quantization levels. The x-axis marks the cumulative number of communication rounds.

Refer to caption
Figure 5: Comparison of Algorithm 2 (fixed quantization level) with Algorithm 3 (zooming-in quantization).

We can thus deduce that using zooming-in quantization can achieve very good performance (in terms of asymptotic error) with a smaller number of communication rounds.

VI Conclusions

In this paper we addressed distributed learning problems over peer-to-peer networks, with a particular focus on the challenges of quantized communications, asynchrony, and stochastic gradients that arise in this set-up. We first discussed how to turn the presence of quantized communications into an advantage, by resorting to a finite-time, quantized coordination scheme. This scheme is combined with a distributed gradient descent method to derive the proposed algorithm. Secondly, we showed how this algorithm can be adapted to allow asynchronous operations of the agents, as well as the use of stochastic gradients. Finally, we proposed a variant of the algorithm which employs zooming-in quantization. We analyzed the convergence of the proposed methods and compared them to state-of-the-art alternatives. The performance of the proposed methods compares very favorably with the alternatives from the literature.

Appendix A Proof of Lemma 1

We start by observing that Algorithm 1 consists of an affine update in \mathboldz=[zij]i𝒱,j𝒩i\mathbold𝑧subscriptdelimited-[]subscript𝑧𝑖𝑗formulae-sequence𝑖𝒱𝑗subscript𝒩𝑖\mathbold{z}=[z_{ij}]_{i\in\mathcal{V},\ j\in\mathcal{N}_{i}}italic_z = [ italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i ∈ caligraphic_V , italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT; in particular, for appropriate matrices and vectors we can write \mathboldz+1=\mathboldT\mathboldz+\mathboldu+\mathbolde\mathboldsuperscript𝑧1\mathbold𝑇\mathboldsuperscript𝑧\mathbold𝑢\mathboldsuperscript𝑒\mathbold{z}^{\ell+1}=\mathbold{T}\mathbold{z}^{\ell}+\mathbold{u}+\mathbold{e% }^{\ell}italic_z start_POSTSUPERSCRIPT roman_ℓ + 1 end_POSTSUPERSCRIPT = italic_T italic_z start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT + italic_u + italic_e start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT, \mathboldw=\mathboldH\mathboldz\mathboldsuperscript𝑤\mathbold𝐻\mathboldsuperscript𝑧\mathbold{w}^{\ell}=\mathbold{H}\mathbold{z}^{\ell}italic_w start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT = italic_H italic_z start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT. The vector \mathbolde\mathboldsuperscript𝑒\mathbold{e}^{\ell}italic_e start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT represents the noise caused by quantization, that is eij=q(zji+2ρwj)(zji+2ρwj)superscriptsubscript𝑒𝑖𝑗𝑞superscriptsubscript𝑧𝑗𝑖2𝜌superscriptsubscript𝑤𝑗superscriptsubscript𝑧𝑗𝑖2𝜌superscriptsubscript𝑤𝑗e_{ij}^{\ell}=q(-z_{ji}^{\ell}+2\rho w_{j}^{\ell})-(-z_{ji}^{\ell}+2\rho w_{j}% ^{\ell})italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT = italic_q ( - italic_z start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT + 2 italic_ρ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ) - ( - italic_z start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT + 2 italic_ρ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ). Since Algorithm 1 is an affine operator (plus additive noise) then it is μ𝜇\muitalic_μ-metric subregular for a given μ(0,1)𝜇01\mu\in(0,1)italic_μ ∈ ( 0 , 1 ) [29]. Therefore the assumptions of [23, Theorem 3] are verified, and we have

d(\mathboldz)μd(\mathboldz0)+h=01μh1\mathboldeh𝑑\mathboldsuperscript𝑧superscript𝜇𝑑\mathboldsuperscript𝑧0superscriptsubscript01superscript𝜇1delimited-∥∥\mathboldsuperscript𝑒\displaystyle d(\mathbold{z}^{\ell})\leq\mu^{\ell}d(\mathbold{z}^{0})+\sum_{h=% 0}^{\ell-1}\mu^{\ell-h-1}\left\lVert\mathbold{e}^{h}\right\rVertitalic_d ( italic_z start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ) ≤ italic_μ start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT italic_d ( italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT roman_ℓ - italic_h - 1 end_POSTSUPERSCRIPT ∥ italic_e start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∥ (8)
\mathboldw1Ni𝒱yi𝟏NCd(\mathboldz),delimited-∥∥\mathboldsuperscript𝑤1𝑁subscript𝑖𝒱tensor-productsubscript𝑦𝑖subscript1𝑁𝐶𝑑\mathboldsuperscript𝑧\displaystyle\quad\left\lVert\mathbold{w}^{\ell}-\frac{1}{N}\sum_{i\in\mathcal% {V}}y_{i}\otimes\boldsymbol{1}_{N}\right\rVert\leq Cd(\mathbold{z}^{\ell}),∥ italic_w start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_V end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊗ bold_1 start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∥ ≤ italic_C italic_d ( italic_z start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ) , (9)

where d(\mathboldz)𝑑\mathbold𝑧d(\mathbold{z})italic_d ( italic_z ) measures the distance of \mathboldz\mathbold𝑧\mathbold{z}italic_z from the set of fixed points {\mathboldz¯|\mathboldz¯=\mathboldT\mathboldz¯+\mathboldu}conditional-set¯\mathbold𝑧¯\mathbold𝑧\mathbold𝑇¯\mathbold𝑧\mathbold𝑢\{\bar{\mathbold{z}}\ |\ \bar{\mathbold{z}}=\mathbold{T}\bar{\mathbold{z}}+% \mathbold{u}\}{ over¯ start_ARG italic_z end_ARG | over¯ start_ARG italic_z end_ARG = italic_T over¯ start_ARG italic_z end_ARG + italic_u }.

Now, since \mathbolde\mathboldsuperscript𝑒\mathbold{e}^{\ell}italic_e start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT represents the quantization noise, we can upper bound its norm as follows:

\mathbolde2superscriptdelimited-∥∥\mathboldsuperscript𝑒2\displaystyle\left\lVert\mathbold{e}^{\ell}\right\rVert^{2}∥ italic_e start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =i𝒱j𝒩iq(zji+2ρwj)(zji+2ρwj)2absentsubscript𝑖𝒱subscript𝑗subscript𝒩𝑖superscriptdelimited-∥∥𝑞superscriptsubscript𝑧𝑗𝑖2𝜌superscriptsubscript𝑤𝑗superscriptsubscript𝑧𝑗𝑖2𝜌superscriptsubscript𝑤𝑗2\displaystyle=\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{N}_{i}}\left\lVert q(-z% _{ji}^{\ell}+2\rho w_{j}^{\ell})-(-z_{ji}^{\ell}+2\rho w_{j}^{\ell})\right% \rVert^{2}= ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_V end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_q ( - italic_z start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT + 2 italic_ρ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ) - ( - italic_z start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT + 2 italic_ρ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
i𝒱j𝒩in(Δ/2)2=n(Δ/2)2i𝒱|𝒩i|absentsubscript𝑖𝒱subscript𝑗subscript𝒩𝑖𝑛superscriptΔ22𝑛superscriptΔ22subscript𝑖𝒱subscript𝒩𝑖\displaystyle\leq\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{N}_{i}}n(\Delta/2)^{% 2}=n(\Delta/2)^{2}\sum_{i\in\mathcal{V}}|\mathcal{N}_{i}|≤ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_V end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_n ( roman_Δ / 2 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_n ( roman_Δ / 2 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_V end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |

where the inequality holds because the quantization commits an error of at most Δ/2Δ2\Delta/2roman_Δ / 2. Using this bound and combining (8) with (9) then yields the first thesis.

The goal now is to show that Algorithm 1 achieves finite-time convergence. By (8), we know that limd(\mathboldz)=11μΔ2ni𝒱|𝒩i|subscript𝑑\mathboldsuperscript𝑧11𝜇Δ2𝑛subscript𝑖𝒱subscript𝒩𝑖\lim_{\ell\to\infty}d(\mathbold{z}^{\ell})=\frac{1}{1-\mu}\frac{\Delta}{2}% \sqrt{n\sum_{i\in\mathcal{V}}|\mathcal{N}_{i}|}roman_lim start_POSTSUBSCRIPT roman_ℓ → ∞ end_POSTSUBSCRIPT italic_d ( italic_z start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 1 - italic_μ end_ARG divide start_ARG roman_Δ end_ARG start_ARG 2 end_ARG square-root start_ARG italic_n ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_V end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG. Then to bound the time of convergence we impose that the first term on the right-hand side of (8) be smaller than limd(\mathboldz)subscript𝑑\mathboldsuperscript𝑧\lim_{\ell\to\infty}d(\mathbold{z}^{\ell})roman_lim start_POSTSUBSCRIPT roman_ℓ → ∞ end_POSTSUBSCRIPT italic_d ( italic_z start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ). By rearranging, taking the logarithm and the absolute value, the thesis follows. \hfill\square

Appendix B Proof of Proposition 1

Algorithm 2 was derived in section III as an inexact version of the projected gradient method, where Algorithm 1 replaces the projection onto the consensus set. Additionally, by Assumption 3, the agents apply inexact gradients during local computations. Accounting for both these sources of errors, we can characterize Algorithm 2 as

\mathboldxk+1=proj𝒞(\mathboldxkαfk(\mathboldxk))+\mathboldekq+\mathboldekg,\mathboldsubscript𝑥𝑘1subscriptproj𝒞\mathboldsubscript𝑥𝑘𝛼subscript𝑓𝑘\mathboldsubscript𝑥𝑘\mathboldsuperscriptsubscript𝑒𝑘𝑞\mathboldsuperscriptsubscript𝑒𝑘𝑔\mathbold{x}_{k+1}=\operatorname{proj}_{\mathcal{C}}\left(\mathbold{x}_{k}-% \alpha\nabla f_{k}(\mathbold{x}_{k})\right)+\mathbold{e}_{k}^{q}+\mathbold{e}_% {k}^{g},italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = roman_proj start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_α ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT + italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , (10)

where \mathboldekq\mathboldsuperscriptsubscript𝑒𝑘𝑞\mathbold{e}_{k}^{q}italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT is the error due to Algorithm 1 (cf. Lemma 1), and \mathboldekg\mathboldsuperscriptsubscript𝑒𝑘𝑔\mathbold{e}_{k}^{g}italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT is the error due to inexact gradients:

\mathboldekp\mathboldsuperscriptsubscript𝑒𝑘𝑝\displaystyle\mathbold{e}_{k}^{p}italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT =Algorithm 1(\mathboldyk)proj𝒞(\mathboldyk)absentAlgorithm 1\mathboldsubscript𝑦𝑘subscriptproj𝒞\mathboldsubscript𝑦𝑘\displaystyle=\text{Algorithm~{}\ref{alg:finite-time-consensus}}(\mathbold{y}_% {k})-\operatorname{proj}_{\mathcal{C}}(\mathbold{y}_{k})= Algorithm ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - roman_proj start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
\mathboldekg\mathboldsuperscriptsubscript𝑒𝑘𝑔\displaystyle\mathbold{e}_{k}^{g}italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT =proj𝒞(\mathboldxkα^fk(\mathboldxk))proj𝒞(\mathboldxkαfk(\mathboldxk)).absentsubscriptproj𝒞\mathboldsubscript𝑥𝑘𝛼^subscript𝑓𝑘\mathboldsubscript𝑥𝑘subscriptproj𝒞\mathboldsubscript𝑥𝑘𝛼subscript𝑓𝑘\mathboldsubscript𝑥𝑘\displaystyle=\operatorname{proj}_{\mathcal{C}}(\mathbold{x}_{k}-\alpha\hat{% \nabla}f_{k}(\mathbold{x}_{k}))-\operatorname{proj}_{\mathcal{C}}(\mathbold{x}% _{k}-\alpha\nabla f_{k}(\mathbold{x}_{k})).= roman_proj start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_α over^ start_ARG ∇ end_ARG italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - roman_proj start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_α ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) .

Moreover, Assumption 3 allows the agents to activate asynchronously, each with its probability pi(0,1]subscript𝑝𝑖01p_{i}\in(0,1]italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ( 0 , 1 ]. This means that the i𝑖iitalic_i-th coordinate of \mathboldxk\mathboldsubscript𝑥𝑘\mathbold{x}_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is updated with probability pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Finally, we notice that by the choice α<2/λ¯𝛼2¯𝜆\alpha<2/{\bar{\lambda}}italic_α < 2 / over¯ start_ARG italic_λ end_ARG, the projected gradient method (without errors and asynchrony) is ζ=max{|1αλ¯|,|1αλ¯|}𝜁1𝛼¯𝜆1𝛼¯𝜆\zeta=\max\{|1-\alpha{\underaccent{\bar}{\lambda}}|,|1-\alpha{\bar{\lambda}}|\}italic_ζ = roman_max { | 1 - italic_α under¯ start_ARG italic_λ end_ARG | , | 1 - italic_α over¯ start_ARG italic_λ end_ARG | }-contractive [20]. This implies that Algorithm 2 can be interpreted as a projected gradient method with bounded additive noise and random coordinate updates. Thus it verifies the assumptions of [28, Proposition 1], which implies

𝔼[\mathboldxk\mathboldx]𝔼delimited-[]delimited-∥∥\mathboldsubscript𝑥𝑘\mathboldsuperscript𝑥\displaystyle\mathbb{E}\left[\left\lVert\mathbold{x}_{k}-\mathbold{x}^{*}% \right\rVert\right]blackboard_E [ ∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ] maxipiminipi(χk\mathboldx0\mathboldx\displaystyle\leq\sqrt{\frac{\max_{i}p_{i}}{\min_{i}p_{i}}}\Big{(}\chi^{k}% \left\lVert\mathbold{x}_{0}-\mathbold{x}^{*}\right\rVert≤ square-root start_ARG divide start_ARG roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG ( italic_χ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
+h=0kχkh\mathboldehp+\mathboldehg).\displaystyle+\sum_{h=0}^{k}\chi^{k-h}\left\lVert\mathbold{e}_{h}^{p}+% \mathbold{e}_{h}^{g}\right\rVert\Big{)}.+ ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_χ start_POSTSUPERSCRIPT italic_k - italic_h end_POSTSUPERSCRIPT ∥ italic_e start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT + italic_e start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ∥ ) .

Now, by Assumption 3 we know that 𝔼[\mathboldekg]τN𝔼delimited-[]delimited-∥∥\mathboldsuperscriptsubscript𝑒𝑘𝑔𝜏𝑁\mathbb{E}\left[\left\lVert\mathbold{e}_{k}^{g}\right\rVert\right]\leq\tau% \sqrt{N}blackboard_E [ ∥ italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ∥ ] ≤ italic_τ square-root start_ARG italic_N end_ARG, and by Lemma 1 we can bound \mathboldekpNCΔ2ni𝒱|𝒩i|11μ=O(Δ)delimited-∥∥\mathboldsuperscriptsubscript𝑒𝑘𝑝𝑁𝐶Δ2𝑛subscript𝑖𝒱subscript𝒩𝑖11𝜇𝑂Δ\left\lVert\mathbold{e}_{k}^{p}\right\rVert\leq\sqrt{N}C\frac{\Delta}{2}\sqrt{% n\sum_{i\in\mathcal{V}}|\mathcal{N}_{i}|}\frac{1}{1-\mu}=O(\Delta)∥ italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∥ ≤ square-root start_ARG italic_N end_ARG italic_C divide start_ARG roman_Δ end_ARG start_ARG 2 end_ARG square-root start_ARG italic_n ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_V end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG divide start_ARG 1 end_ARG start_ARG 1 - italic_μ end_ARG = italic_O ( roman_Δ ), and the thesis follows. \hfill\square

Appendix C Proof of Corollary 1

Following the same derivation as Appendix B yields

𝔼[\mathboldxk\mathboldx]𝔼delimited-[]delimited-∥∥\mathboldsubscript𝑥𝑘\mathboldsuperscript𝑥\displaystyle\mathbb{E}\left[\left\lVert\mathbold{x}_{k}-\mathbold{x}^{*}% \right\rVert\right]blackboard_E [ ∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ] maxipiminipi(χk\mathboldx0\mathboldx\displaystyle\leq\sqrt{\frac{\max_{i}p_{i}}{\min_{i}p_{i}}}\Big{(}\chi^{k}% \left\lVert\mathbold{x}_{0}-\mathbold{x}^{*}\right\rVert≤ square-root start_ARG divide start_ARG roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG ( italic_χ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
+h=0kχkh\mathboldehp+\mathboldehg).\displaystyle+\sum_{h=0}^{k}\chi^{k-h}\left\lVert\mathbold{e}_{h}^{p}\right% \rVert+\left\lVert\mathbold{e}_{h}^{g}\right\rVert\Big{)}.+ ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_χ start_POSTSUPERSCRIPT italic_k - italic_h end_POSTSUPERSCRIPT ∥ italic_e start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∥ + ∥ italic_e start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ∥ ) .

By Assumption 3 we know that 𝔼[\mathboldekg]τN𝔼delimited-[]delimited-∥∥\mathboldsuperscriptsubscript𝑒𝑘𝑔𝜏𝑁\mathbb{E}\left[\left\lVert\mathbold{e}_{k}^{g}\right\rVert\right]\leq\tau% \sqrt{N}blackboard_E [ ∥ italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ∥ ] ≤ italic_τ square-root start_ARG italic_N end_ARG. On the other hand, by the use of zooming-in quantization, and by Lemma 1, we have

\mathboldekpNCΔk2ni𝒱|𝒩i|11μdelimited-∥∥\mathboldsuperscriptsubscript𝑒𝑘𝑝𝑁𝐶subscriptΔ𝑘2𝑛subscript𝑖𝒱subscript𝒩𝑖11𝜇\left\lVert\mathbold{e}_{k}^{p}\right\rVert\leq\sqrt{N}C\frac{\Delta_{k}}{2}% \sqrt{n\sum_{i\in\mathcal{V}}|\mathcal{N}_{i}|}\frac{1}{1-\mu}∥ italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∥ ≤ square-root start_ARG italic_N end_ARG italic_C divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG square-root start_ARG italic_n ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_V end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG divide start_ARG 1 end_ARG start_ARG 1 - italic_μ end_ARG

with Δk=maxiΔi,ksubscriptΔ𝑘subscript𝑖subscriptΔ𝑖𝑘\Delta_{k}=\max_{i}\Delta_{i,k}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT being the largest quantization level among all agents at time k𝑘kitalic_k. We know that ΔksubscriptΔ𝑘\Delta_{k}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is monotonically non-increasing, and that in particular it decreases at finite intervals, when all agents have stopped seeing an improvement in their local solution xi,ksubscript𝑥𝑖𝑘x_{i,k}italic_x start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT (cf. lines 7-8 in Algorithm 3). Thus limk\mathboldekp=0subscript𝑘delimited-∥∥\mathboldsuperscriptsubscript𝑒𝑘𝑝0\lim_{k\to\infty}\left\lVert\mathbold{e}_{k}^{p}\right\rVert=0roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT ∥ italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∥ = 0, and by [30, Lemma 3.1(a)] limkh=0kχkh\mathboldehp=0subscript𝑘superscriptsubscript0𝑘superscript𝜒𝑘delimited-∥∥\mathboldsuperscriptsubscript𝑒𝑝0\lim_{k\to\infty}\sum_{h=0}^{k}\chi^{k-h}\left\lVert\mathbold{e}_{h}^{p}\right% \rVert=0roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_χ start_POSTSUPERSCRIPT italic_k - italic_h end_POSTSUPERSCRIPT ∥ italic_e start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∥ = 0 since χ(0,1)𝜒01\chi\in(0,1)italic_χ ∈ ( 0 , 1 ), and the thesis follows. \hfill\square

References

  • [1] D. K. Molzahn, F. Dorfler, H. Sandberg, S. H. Low, S. Chakrabarti, R. Baldick, and J. Lavaei, “A Survey of Distributed Optimization and Control Algorithms for Electric Power Systems,” IEEE Transactions on Smart Grid, vol. 8, no. 6, pp. 2941–2962, Nov. 2017.
  • [2] A. Nedić and J. Liu, “Distributed Optimization for Control,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 1, no. 1, pp. 77–103, May 2018.
  • [3] S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,” IEEE Transactions on Signal Processing, vol. 70, pp. 3264–3279, 2022.
  • [4] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K. H. Johansson, “A survey of distributed optimization,” Annual Reviews in Control, vol. 47, pp. 278–305, 2019.
  • [5] G. Notarstefano, I. Notarnicola, and A. Camisa, “Distributed Optimization for Smart Cyber-Physical Networks,” Foundations and Trends® in Systems and Control, vol. 7, no. 3, pp. 253–383, 2019.
  • [6] L. Qian, P. Yang, M. Xiao, O. A. Dobre, M. D. Renzo, J. Li, Z. Han, Q. Yi, and J. Zhao, “Distributed Learning for Wireless Communications: Methods, Applications and Challenges,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 3, pp. 326–342, Apr. 2022.
  • [7] P. Richtarik, I. Sokolov, E. Gasanov, I. Fatkhullin, Z. Li, and E. Gorbunov, “3PC: Three Point Compressors for Communication-Efficient Distributed Training and a Better Theory for Lazy Aggregation,” in Proceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162.   PMLR, Jul. 2022, pp. 18 596–18 648.
  • [8] N. Bastianello, R. Carli, L. Schenato, and M. Todescato, “Asynchronous Distributed Optimization Over Lossy Networks via Relaxed ADMM: Stability and Linear Convergence,” IEEE Transactions on Automatic Control, vol. 66, no. 6, pp. 2620–2635, Jun. 2021.
  • [9] A. I. Rikos, W. Jiang, T. Charalambous, and K. H. Johansson, “Distributed optimization with gradient descent and quantized communication,” IFAC-PapersOnLine, vol. 56, no. 2, pp. 5900–5906, 2023.
  • [10] N. Bastianello, A. I. Rikos, and K. H. Johansson, “Online Distributed Learning with Quantized Finite-Time Coordination,” in 2023 62nd IEEE Conference on Decision and Control (CDC).   Singapore, Singapore: IEEE, Dec. 2023, pp. 5026–5032.
  • [11] B. M. Assran, A. Aytekin, H. R. Feyzmahdavian, M. Johansson, and M. G. Rabbat, “Advances in Asynchronous Parallel and Distributed Optimization,” Proceedings of the IEEE, vol. 108, no. 11, pp. 2013–2031, Nov. 2020.
  • [12] R. Xin, S. Kar, and U. A. Khan, “Decentralized Stochastic Optimization and Machine Learning: A Unified Variance-Reduction Framework for Robust Performance and Fast Convergence,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 102–113, May 2020.
  • [13] X. Zhao and A. H. Sayed, “Asynchronous Adaptation and Learning Over Networks—Part I: Modeling and Stability Analysis,” IEEE Transactions on Signal Processing, vol. 63, no. 4, pp. 811–826, Feb. 2015.
  • [14] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Convergence of Asynchronous Distributed Gradient Methods Over Stochastic Networks,” IEEE Transactions on Automatic Control, vol. 63, no. 2, pp. 434–448, Feb. 2018.
  • [15] Y. Tian, Y. Sun, and G. Scutari, “Achieving Linear Convergence in Distributed Asynchronous Multiagent Optimization,” IEEE Transactions on Automatic Control, vol. 65, no. 12, pp. 5264–5279, Dec. 2020.
  • [16] X. Yi, S. Zhang, T. Yang, T. Chai, and K. H. Johansson, “A Primal-Dual SGD Algorithm for Distributed Nonconvex Optimization,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 5, pp. 812–833, May 2022.
  • [17] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated Learning: Challenges, Methods, and Future Directions,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50–60, May 2020.
  • [18] T. Gafni, N. Shlezinger, K. Cohen, Y. C. Eldar, and H. V. Poor, “Federated Learning: A signal processing perspective,” IEEE Signal Processing Magazine, vol. 39, no. 3, pp. 14–41, May 2022.
  • [19] Z. Zhao, Y. Mao, Y. Liu, L. Song, Y. Ouyang, X. Chen, and W. Ding, “Towards efficient communications in federated learning: A contemporary survey,” Journal of the Franklin Institute, p. S0016003222009346, Jan. 2023.
  • [20] A. B. Taylor, J. M. Hendrickx, and F. Glineur, “Exact Worst-Case Convergence Rates of the Proximal Gradient Method for Composite Convex Minimization,” Journal of Optimization Theory and Applications, vol. 178, no. 2, pp. 455–476, Aug. 2018.
  • [21] A. S. Berahas, R. Bollapragada, N. S. Keskar, and E. Wei, “Balancing Communication and Computation in Distributed Optimization,” IEEE Transactions on Automatic Control, vol. 64, no. 8, pp. 3141–3155, Aug. 2019.
  • [22] A. I. Rikos, C. N. Hadjicostis, and K. H. Johansson, “Non-oscillating quantized average consensus over dynamic directed topologies,” Automatica, vol. 146, 2022.
  • [23] N. Bastianello, D. Deplano, M. Franceschelli, and K. H. Johansson, “Robust Online Learning over Networks,” IEEE Transactions on Automatic Control, 2024.
  • [24] A. I. Rikos, W. Jiang, T. Charalambous, and K. H. Johansson, “Distributed Optimization via Gradient Descent with Event-Triggered Zooming Over Quantized Communication,” in 2023 62nd IEEE Conference on Decision and Control (CDC), 2023, pp. 6321–6327.
  • [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [26] N. Bastianello, “tvopt: A Python Framework for Time-Varying Optimization,” in 2021 60th IEEE Conference on Decision and Control (CDC), 2021, pp. 227–232.
  • [27] M. Bin, I. Notarnicola, and T. Parisini, “Stability, Linear Convergence, and Robustness of the Wang-Elia Algorithm for Distributed Consensus Optimization,” in 2022 IEEE 61st Conference on Decision and Control (CDC).   Cancun, Mexico: IEEE, Dec. 2022, pp. 1610–1615.
  • [28] N. Bastianello, L. Madden, R. Carli, and E. Dall’Anese, “A Stochastic Operator Framework for Optimization and Learning With Sub-Weibull Errors,” IEEE Transactions on Automatic Control, 2024.
  • [29] A. Themelis and P. Patrinos, “SuperMann: A Superlinearly Convergent Algorithm for Finding Fixed Points of Nonexpansive Operators,” IEEE Transactions on Automatic Control, vol. 64, no. 12, pp. 4875–4890, Dec. 2019.
  • [30] S. Sundhar Ram, A. Nedić, and V. V. Veeravalli, “Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization,” Journal of Optimization Theory and Applications, vol. 147, no. 3, pp. 516–545, Dec. 2010.