Communication Efficient Decentralized Learning Over Bipartite Graphs

In this paper, we propose a communication-efficiently decentralized machine learning framework that solves a consensus optimization problem defined over a network of inter-connected workers. The proposed algorithm, Censored and Quantized Generalized GADMM (CQ-GGADMM), leverages the worker grouping and decentralized learning ideas of Group Alternating Direction Method of Multipliers (GADMM), and pushes the frontier in communication efficiency by extending its applicability to generalized network topologies, while incorporating link censoring for negligible updates after quantization. We theoretically prove that CQ-GGADMM achieves the linear convergence rate when the local objective functions are strongly convex under some mild assumptions. Numerical simulations corroborate that CQ-GGADMM exhibits higher communication efficiency in terms of the number of communication rounds and transmit energy consumption without compromising the accuracy and convergence speed, compared to the censored decentralized ADMM, and the worker grouping method of GADMM.

collecting and distributing model parameters, which is not always accessible from faraway workers and is vulnerable to a single point of failure [8].
In this respect, fully decentralized learning frameworks are promising solutions in which workers directly communicate with each other without any central coordination [9]- [12]. However, due to the lack of the parameter server, the convergence speed is often too slow, particularly when some workers encounter extremely sparse connectivity. Indeed, existing decentralized learning frameworks such as distributed gradient descent (DGD) and multi-step primal-dual (MSPD) frameworks can only achieve sub-linear convergence rates [9], [10]. Spurred by this motivation, the Group ADMM (GADMM) framework exploits the Alternating Direction Method of Multipliers algorithm to accelerate the convergence speed, and thereby achieves a linear convergence rate under a fully decentralized architecture with sparse connectivity [11]. Meanwhile, a quantization based GADMM (Q-GADMM) additionally utilizes stochastic quantization, and reduces communication payload sizes without compromising the convergence rate of GADMM [12]. Notwithstanding, both GADMM and Q-GADMM are limited to a ring network topology where each worker is connected only with at most two neighboring workers as shown in Fig. 1(a), questioning its scalability.
To fill this void, by generalizing and extending GADMM and Q-GADMM, in this article we propose a novel decentralized learning framework, coined Censored and Quantized Generalized Group ADMM (CQ-GGADMM) as illustrated in see Fig. 1(c). On the one hand, following GADMM, the workers in CQ-GGADMM are divided into head and tail groups in which the workers in the same group update their models in parallel, whereas the workers in different groups update their models in an alternating way. On the other hand, following Q-GADMM, CQ-GGADMM reduces the communication payload size per each link by applying a heterogeneous stochastic quantization scheme that decreases the number of bits to represent each model parameter [12]. Furthermore, to reduce the number of communication links per round, CQ-GGADMM introduces a link censoring technique that allows to exchange model parameters only when the updated quantized model is sufficiently changed from the previous quantized model, i.e., skipping small quantized model updates [13]. While leveraging the aforementioned three principles, CQ-GADMM is built upon a generalized version of GADMM (GGADMM), wherein each worker can connect with an arbitrary number of neighbors under a generic bipartite graph topology as shown in Fig. 1(b). This is in contrast to the original GADMM that is operational only under a ring . topology [11] as shown in Fig. 1(a). Related Works. Towards improving the communication efficiency of distributed learning, prior works have studied various techniques under centralized and decentralized network architectures.
a) Fast Convergence: The total communication cost until completing a distributed learning operation can be reduced by accelerating the convergence speed. To this end, departing from the conventional first-order methods such as distributed gradient descent [14], primal-dual methods are applied under centralized [13], [15], [16] and decentralized architectures anteeing the linear convergence of the proposed algorithm. By doing so, CQ-GGADMM achieves significant savings in terms of the total number of transmitted bits (and consequently in terms of the total energy) compared to [13]. Incorporating both censoring and quantization steps incur model update errors that may propagate over communication rounds due to the lack of central entity. To resolve this problem, we carefully determine the non-increasing target censoring threshold and quantization step size, such that the model updates are more finely tuned as time elapses until convergence. We thereby prove the linear convergence rate of CQ-GGADMM, and show its effectiveness by simulations, in terms of convergence speed, total communication cost, and transmission energy consumption.
Contributions. The major contributions of this work are summarized as follows.
• We propose CQ-GGADMM, a primal-dual decentralized learning framework utilizing censoring, quantization, and GADMM for any bipartite and connected network topology graph (Algorithm 1 in Sec. III). • We prove that CQ-GGADMM converges to the optimal solution for convex loss functions (Theorem 1 in Sec. IV). • We identify the network topology conditions under which CQ-GGADMM achieves a linear convergence rate (Theorem 2 in Sec. IV) when the loss functions are strongly convex. • Numerical simulations have corroborated that in linear and logistic regression tasks using synthetic and real datasets, CQ-GGADMM achieves the same convergence speed at significantly lower communication rounds and several orders of magnitude less transmission energy, compared to C-GGADMM and Censored ADMM (C-ADMM) in [13].
The remainder of this paper is organized as follows. In section II, we describe the generalized version of GADMM (GGADMM) for a bipartite and connected graph, and formulate the decentralized learning problem. Then, we extend GGADMM to censored and quantized GGADMM (CQ-GGADMM) in Section III. In Section IV, we prove the convergence of CQ-GGADMM theoretically under some mild conditions. Finally, Section V validates the performance of CQ-GGADMM by simulations. The details of the proofs of our results are deferred to the appendices.

II. PROBLEM FORMULATION
We consider a connected network wherein a set V of N workers aim to reach a consensus around a solution of a global optimization problem. The problem is solved using only local data and information available for each worker. Moreover, communication is constrained to only take place between neighboring workers. The optimization problem is given by where Θ ∈ R d×1 is the global model parameter and f n : R d → R is a local function composed of data stored at worker n. Problem (P1) appears in many applications of machine learning, especially when the dataset is very large and the training is carried out using different workers. The connections among workers are represented as an undirected communication graph G having the set E ⊆ V × V of edges. The set of neighbors of worker n is defined as N n = {m|(n, m) ∈ E} whose cardinality is |N n | = d n . We start by making the following key assumption. Assumption 1. The communication graph G is bipartite and connected. Under Assumption 1, following the worker grouping of GADMM [11], workers are divided into two groups: a head group H, and a tail group T . Note that unlike [11] where every worker needs to connect with at most two neighbors under a chain network topology, each worker in CQ-GGADMM can connect with an arbitrary number of neighbors, as long as the network topology graph is bipartite and connected. Each head worker in H can only communicate with tail workers in T , and vice versa. In this case, the edge set definition can be re-written as E = {(n, m)|n ∈ H, m ∈ T }, and the problem (P1) is equivalent to the following problem (P2) θ * := arg min where θ n is the local copy of the common optimization variable Θ at worker n. Note that, under the formulation (P2), the objective function becomes separable across the workers and as a consequence the problem can be solved in a distributed manner. In this case, the Lagrangian of the optimization problem (P2) can be written as where ρ > 0 is a constant penalty parameter and λ n,m is the dual variable between neighboring workers n and m, ∀(n, m) ∈ E. At iteration k+1, the Generalized Group ADMM (GGADMM) algorithm runs as follows.
(1) Every head worker, n ∈ H, updates its primal variable by solving and sends its updated model to its neighbors. (2) The primal variables of tail workers, m ∈ T , are then updated as (3) The dual variables are updated locally for every worker, after receiving the model updates from its neighbors, in the following way λ k+1 n,m = λ k n,m + ρ(θ k+1 n − θ k+1 m ), ∀(n, m) ∈ E. (6) Note that GGADMM is a generalized version of GADMM algorithm proposed in [11] since it considers an arbitrary topology. Introducing α n = m∈Nn λ n,m , ∀n ∈ V, we can write (1) The update of the models of head workers is done in parallel by solving (2) The models of tail workers are updated in parallel using (3) Instead of updating λ n,m , each worker will update locally the new auxiliary variable α n as

III. CENSORED QUANTIZED GENERALIZED GROUP ADMM
To reduce the communication payload size, we follow a similar stochastic quantization scheme to the one described in [12]. Each worker n quantizes the difference between its current model and its previously quantized model before transmission as θ k n −Q k−1 n = Q n (θ k n ,Q k−1 n ), whereQ k−1 n is the quantized model at iteration (k − 1) and Q n (·) is a stochastic quantization operator that depends on the quantization probability p k n,i for each model vector's dimension i ∈ {1, 2, · · · , d}, and on b k n bits used for representing each model vector dimension.
The i th dimensional element [Q k−1 n ] i of the previously quantized model vector is centred at the quantization range 2R k n that is equally divided into (2 b k n − 1) quantization levels, yielding the quantization step size ∆ k n = 2R k n /(2 b k n − 1).
In this coordinate, the difference between the i th dimensional element [θ k n ] i of the current model vector and where R k n ensures the non-negativity of the quantized value. Then, [c n (θ k n )] i is mapped to where · and · are the ceiling and floor functions, respectively. Next, the probability p k n,i in (11) is selected such that the expected quantization error E e k n,i is zero The choice of p k n,i in (12) ensures that the quantization in (11) is unbiased and the quantization error variance E e k n,i 2 is less than (∆ k n ) 2 . This implies that E e k n 2 ≤ d(∆ k n ) 2 . The convergence of CQ-GGADMM requires non-increasing quantization step sizes over iterations, i.e., ∆ k n ≤ ω∆ k−1 n for all k where ω ∈ (0, 1). To satisfy this condition, the parameter b k n is chosen as Under this condition, we get that ∆ k n ≤ ω k ∆ 0 n . With the aforementioned stochastic quantization procedure, b k n , R k n , and q n (θ k n ) suffice to representQ k n , where q n (θ k n ) = ([q n (θ k n )] 1 , . . . , [q n (θ k n )] d ) which are transmitted to neighbors. After receiving these values,Q k n can be reconstructed as [31], [32] Now, we introduce a censoring condition that reduces the communication overhead by reducing the number of workers communicating at a given iteration. Under this condition, the worker is allowed to transmit only when the difference between the current and previously transmitted value is sufficiently different. However, we apply the censoring not on the model itself but on its quantized value, i.e., if the worker is not censored, it transmits its quantized model to its neighbors. According to the communication-censoring strategy, we have thatθ k+1 n =Q k+1 n provided that θ k n −Q k+1 n ≥ τ 0 ξ k+1 and θ k+1 n =θ k n , otherwise. The CQ-GGADMM algorithm can be written in this case as (2) Primal variables update for tail workers is done as follow Algorithm 1 Censored Quantized Generalized Group ADMM (CQ-GGADMM) 1: Input: N, ρ, τ 0 , ξ, f n (θ n ) for all n 2: θ 0 n = 0,θ 0 n = 0,Q 0 n = 0, α 0 n = 0 for all n 3: for k = 0, 1, 2, · · · , K do 4: Head worker n ∈ H in parallel: 5: reconstructsθ k m , m ∈ N n via (14) if m not censored 6: computes its primal variable θ k+1 n via (15) 7: chooses p k+1 n,i and b k+1 n via (12) and (13), respectively. 8: quantizes its primal variable via (11) 9: if θ k n −Q k+1 n ≥ τ 0 ξ k+1 then 10: worker n sends q n (θ k+1 n ), R k+1 n , and b k+1 n to its neighboring workers N n and setsθ k+1 n =Q k+1 n . Tail worker m ∈ T in parallel: 15: reconstructsθ k+1 n , n ∈ N m via (14) if n not censored 16: computes its primal variable θ k+1 m via (16) 17: chooses p k+1 m,i and b k+1 m via (12) and (13), respectively. 18: quantizes its primal variable via (11) 19: Given p k n,i in (12) and b k n in (13), the convergence of CQ-GGADMM is provided in Section IV. Note that when the full arithmetic precision uses 32 bits, every transmission payload are the required bits to represent R k n and b k n , respectively. Compared to GGADMM, whose payload size is 32d bits, CQ-GGADMM can achieve a huge reduction in communication overhead, particularly for large models, i.e., large d.

IV. CONVERGENCE ANALYSIS
Before stating the main results of the paper, we further make the following assumptions. Assumption 2. There exists an optimal solution set to (P1) which has at least one finite element. Assumption 3. The local cost functions f n are convex. Assumption 4. The local cost functions f n are strongly convex with parameter µ n > 0, i.e., Assumption 5. The local cost functions f n have L n -Lipschitz continuous gradient (L n > 0) Assumptions 1-5 are key assumptions that are often used in the context of distributed optimization [13], [15], [21]. While only assumptions 1-3 are needed to prove the convergence of CQ-GGADMM, assumptions 4 and 5 are further required to show the linear convergence rate. Note that Assumption 2 ensures that the problem (P2) has at least one optimal solution, denoted by θ . Under Assumption 4, the function f is strongly convex with parameter µ = min 1≤n≤N µ n , and from Assumption 5, we can see that f has L-Lipschitz continuous gradient with To proceed with the analysis, we start by writing the optimality conditions as where θ n and α n are the optimal values of the primal and dual variables, respectively. We define the primal residual as and the dual residual as Finally, the total error is defined as The total error can be decomposed as the sum of two errors (i) a random error coming from the quantization process e k+1 n = θ k+1 n −Q k+1 n , (ii) a deterministic one due to the censoring strategy k+1 n . According to the communication-censoring strategy, we have n , then, the total error can be upper bounded, using (30), by where C 0 = max{τ 0 , √ d(∆ 0 )}, and ψ = max{ξ, ω} ∈ (0, 1). To prove the convergence of the proposed algorithm, we start by stating and proving the first lemma where we derive upper and lower bounds on the expected value of the optimality gap. Lemma 1. Under assumptions 1-3, we have the following bounds on the expected value of the optimality gap Proof: The details of the proof are deferred to Appendix VII-B.
Next, we present the first theorem that states the asymptotic convergence of the proposed algorithm where we prove the convergence to zero in the mean square sense of both the primal and dual residuals as well as the convergence to zero in the mean sense of the optimality gap. Theorem 1. Suppose assumptions 1-3 hold, then the CQ-GGADMM iterates lead to (i) the convergence of the primal residual to zero in the mean square sense as k → ∞, i.e., (ii) the convergence of the dual residual to zero in the mean square sense as k → ∞, i.e., (iii) the convergence of the optimality gap to zero in the mean sense as k → ∞, i.e., Proof: The proof can be found in Appendix VII-C. The linear convergence of the CQ-GGADMM algorithm is presented next.
Theorem 2. Suppose that assumptions 1, 2, 4 and 5 hold and the dual variable α is initialized such that α 0 lies in the column space of the oriented incidence matrix M − . For sufficiently small κ and ρ, the sequence of iterates of CQ-GGADMM converges linearly with a rate (1 + δ 2 )/2 where Proof: The proof is provided in Appendix VII-D where the conditions on κ and ρ are derived. In the proof, we require an extra initialization condition that α 0 lies in the column space of M − , by taking α 0 = 0. Thus, we ensure that α k will always stay in the column space of M − and we can write α k = M − β k . The convergence rate, derived in the proof, depends on the network topology through the values of σ max (C), σ max (M − ) andσ min (M − ), the properties of the local objective functions (µ and L), the penalty parameter ρ but also on the threshold parameter ξ as well as the parameter ω used to construct the quantization step sizes.

V. NUMERICAL RESULTS
To validate our theoretical results, we numerically evaluate the performance of CQ-GGADMM compared with GGADMM, C-GGADMM, and C-ADMM [13]. Note that C-ADMM performs censoring on top of the Jacobian and decentralized version of the standard ADMM. For the tuning parameters, we choose the values leading to the best performance of all algorithms. Model and Datasets. All simulations are conducted using synthetic and real datasets. For the synthetic data, we used the datasets that were generated in [21]. We consider two decentralized consensus optimization problems: (i) linear regression, and (ii) logistic regression. The details about the datasets used in our experiments are summarized in Table  I. For each dataset, the number of samples are uniformly distributed across the N workers. Graph Generation. Similarly to [33], we generate randomly a network consisting of N workers with a connectivity ratio p. The ratio p is defined as the actual number of edges divided by the number of edges for a fully connected graph, i.e., N ×(N − 1)/2. Such a random graph is created with N p × (N − 1)/2 edges that are uniformly randomly chosen, while ensuring that the generated network is connected. Smaller values of p leads to a sparser graph, while the generated graph becomes denser as p approaches 1. Communication Energy. We assume that the total system bandwidth 2MHz is equally divided across workers. Therefore, the available bandwidth to the n-th worker (B n ) at every communication round when utilizing GGADMM is (4/N )MHz since only half of the workers are transmitting at each communication round. On the other hand, the available bandwidth to each worker when using C-ADMM is (2/N )MHz. The power spectral density (N 0 ) is 10 −6 W/Hz, and each upload/download transmission time (τ ) is 1ms. We assume a free space model, and each worker needs to transmit at a power level that allows transmitting the model vector in one communication round (the rate is bottlenecked by the worst link). For example, using C-ADMM, each worker needs to find the transmission power that achieves the transmission rate R = (32d/1ms) bits/sec. Using Shannon capacity, the corresponding transmission power can be calculated as where D is the distance to the farthest neighbor. Hence, the consumed energy will be E = P τ .

A. Linear Regression
In this case, the local cost function at worker n is explicitly given by f n (θ) = 1 2 X n θ − y n 2 where X n ∈ R s×d and y n ∈ R s×1 are private for each worker n ∈ V where s represents the size of the data at each worker. Figs. 2-(a) and 3-(a) corroborate that both C-GGADMM and CQ-GGADMM achieve the same convergence speed as GGADMM and significantly outperform C-ADMM, thanks to the the alternation update, censoring, and stochastic quantization. Note that though, C-ADMM allows workers to update their models in parallel, it requires significantly higher number of iterations. Figs. 2-(b) and 3-(b) show that C-GGADMM achieves 10 −4 objective error with the minimum number of communication rounds outperforming all other algorithms. We also note that introducing quantization on top of censoring has increased the number of communication rounds. However, in terms of the total number of transmitted bits and consumed energy, CQ-GGADMM outperforms all algorithms.

B. Logistic Regression
In this section, we consider the binary logistic regression problem. We assume that worker n owns a data matrix X n = (x n,1 , . . . , x n,s ) T ∈ R s×d along with the corresponding labels y n = (y n,1 , . . . , y n,s ) ∈ {−1, 1} s . The local cost function for worker n is then given by f n (θ) = 1 s s j=1 log 1 + exp −y n,j x T n,j θ + µ0 2 θ 2 where µ 0 is the regularization parameter. As observed from Figs. 4-(a) and 5-(a), C-GADMM requires more iterations compared to GADMM to achieve the same loss which leads to either no saving in the number of communication rounds (see Fig. 4-(b)) or a small saving in the number of communication rounds (see Fig. 5-(b)). It also appears that the update of each individual worker when not quantizing is important at each iteration and censoring hurts the convergence speed. However, interestingly, when introducing stochastic quantization and performing censoring on top of the quantized models, we overcome this issue, and we show significant savings in the number of communication rounds and the communication overhead per iteration.

C. Impact of the Network Graph Density
To study how the network graph density (the node degree) affects the performance of the proposed approach, we conduct an experiment using linear regression on real dataset under different graph topologies. In particular, we consider two graphs with different density as shown in Fig. 6. The first graph, denoted by Graph 1, is a sparse graph (generated with p = 0.2), where each worker has a few links (communicating with low number of neighbouring workers). For example, worker 12 communicates only with one neighbour (worker 8). On the other hand, the dense graph (Graph 2) is generated with a connectivity ratio p = 0.4 where each worker has at least three links (three neighbours). We can see from Fig. 6 that a denser graph leads to faster convergence for all algorithms since each worker uses more information per iteration. However, the ratio in the performance gap in terms of the number of communication rounds remains the same, i.e., C-GGADMM achieves the minimum number of communication rounds followed by CQ-GGADMM which confirms the findings in Fig.3-(b) for more choices of network graph density.

D. Impact of the non-IIDness
To study the impact of non-IIDness on the convergence of the different algorithms, we artificially create a non-IID  [21] linear regression synthetic 50 1200 Body Fat [34] linear regression real 14 252 synth-logistic [21] logistic regression synthetic 50 1200 Derm [34] logistic regression real 34 358 partitioned data from the Derm dataset by first sorting the data by label and then dividing them between the workers. In Fig. 7, we study the effect of the non-IIDness on the performance of the different algorithms by plotting the loss as a function of the number of iterations, communications rounds, transmitted bits, and sum energy. Although it is clear that the non-IIDness of the data hurts the performance of all algorithms, CQ-GGADMM still provides significant savings in terms of communication rounds, and especially in transmitted bits and sum energy compared to the other baselines. Since the distribution of each local dataset differs from the global distribution, each worker's local objective can be different. As a result, the local updates can drift from the global optima. Hence, the averaged model may also be far from the global optimal, especially when the difference between the local updates is significant, i.e. the degree of heterogeneity between the local datasets is high. Consequently, the global model has significantly lower performance in the non-IID setting than the IID one.

VI. CONCLUSIONS
In this paper, we have proposed a communication-efficiently decentralized ML algorithm that extends GADMM and Q-GADMM to arbitrary topologies. Moreover, the proposed algorithm leverages censoring (sparsification) to minimize the number of communication rounds for each worker. Utilizing a decreasing sequence of censoring threshold, stochastic quantization, and adjusting the quantization range at every iteration such that a linear convergence rate is achieved are key features that make CQ-GGADMM robust to errors while ensuring its convergence guarantees. Numerical results in convex linear and logistic regression tasks corroborate the advantages of CQ-GGADMM over GGADMM, and C-ADMM. An interesting direction is to consider solving the distributed learning problem over a time-varying topology. Additionally, extending   the derivations to the stochastic non-convex setting is of paramount importance especially when training deep neural networks. Finally, improving model training on non-IID data is a key challenge that need to be addressed.

A. Basic identities and inequalities
For any two vectors x, y ∈ R d , we have For any two matrices A and B, we have where σ max (A) denotes the maximum singular value of the matrix A.

B. Proof of Lemma 1
Using (15) the update of the head workers can be written as Using the update of α k n in (17), and the definition of the dual residual, we get Similarly, using the update of the tail workers as in (16) ρd m k+1 m , θ m . Therefore, we obtain the following inequality Summing over all workers, we get Using the definition of α k+1 n , n ∈ V in the right hand-side of (40) Using that λ k+1 m,n = −λ k+1 n,m , and that θ n = θ m , ∀(n, m) ∈ E we can write This proves (i) of Lemma 1. To prove (ii), we know from the optimality conditions that ∇f n (θ n ) + α n = 0. Thus, θ n minimizes the function f n (θ n ) + α n , θ n and for n ∈ H Similarly, we have, for m ∈ T , that Summing over all workers, we get where we used the definition of α n in (a) and that λ k+1 m,n = −λ k+1 n,m , and that θ n = θ m in (b).

C. Proof of Theorem 1
Multiplying (26) by (-1), adding (25) and multiplying the sum by 2, we get Since λ k+1 n,m = λ k n,m + ρr k+1 n,m + ρ( k+1 m − k+1 n ), then we can write Using r k+1 n,m = 1 ρ (λ k+1 n,m − λ n,m ) − 1 ρ (λ k n,m − λ n,m ) + k+1 n − k+1 m , we will examine the different terms of (47) starting from the first term Now, we can re-write the first term as − 2ρ Using the identity r k+1 n,m = 1 ρ (λ k+1 n,m − λ k n,m ) + k+1 n − k+1 m , we can write Adding both equations and re-arranging the terms, we get Using the update of α k+1 m , i.e. α k+1 m = α k m +ρ n∈Nm r k+1 m,n , we can re-write (62) to get where we used r k+1 m,n = −r k+1 n,m after summing over m ∈ T . Going back to (57), we can write To upper bound the terms in the right hand side, we will use the identity (31) Finally, we use both identities (30) and (31) to get the following bound − 2ρ where {η i } 5 i=1 are are arbitrary positive constants to be specified later on. Using these bounds and re-arranging the terms in (64), we can write Fix the values of Re-arranging the terms, we can write Therefore, using (24), we can write where γ 1 = 64ρC 0 ψ 0 |E| and γ 2 = 88ρC 2 0 |E|. Now, we define the Lyapunov function Thus, we get As a consequence, we can write that where we have used the fact that 1 − ψ k 2ψ 0 is also 1536-1276 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TWC.2021.3126859, IEEE Transactions on Wireless Communications 13 finite, we consider its logarithm, i.e.
where we have used that log 1 − z is also finite and we conclude that the sequence E V k is upper bounded by a finite quantity that we denote asV . Going back to (75) and taking the sum from k = 0 to ∞ while using the upper bound on E V k , we get ∞ k=0 (n,m)∈E From (i) and (ii) of Lemma 1, we conclude that lim k→∞ N n=1 E f n (θ k n ) − f n (θ n ) = 0.

D. Proof of Theorem 2
The proof of Theorem 2 follows similar steps as the proof of convergence rate of [13]. However, the alternating update nature of our algorithm makes the updates happen in an asymmetric manner, in contrast to the symmetric update in [13], which makes the proof more complex. For a bipartite graph, the adjacency matrix can be written as A = [0 rr , B; B T , 0 ss ] where r = |H|, s = |T | are the cardinalities of H and T , respectively. The matrices 0 rr , and 0 ss are the null matrices of order r × r, and s × s, respectively. The matrix B ∈ R r×s is called the bi-adjacency matrix. The adjacency matrix is a boolean matrix where each element is defined as and (93), we can write (91) as Using that ∇f (θ ) + M − β = 0 and A = C + C T , we can write then, multiplying both sides by θ k+1 − θ , we get The first term of the right hand side can be re-written as where we have used . Expanding the first term of (97), we can write Replacing the terms derived in (97) and (98) by their expressions in (96), we obtain Using the strong convexity of the function f , we can lower bound the left hand side of (96) Hence, we can write Now, using identities (32) and (33), we get the following bounds Replacing the bounds derived in (102)-(107) in (101) and introducing κ > 0, we get Using that E k+1 2 F ≤ E k 2 F , and re-arranging the terms, we can further write where γ = σ 2 max (M−) 2η2 In order to bound the term E β k+1 − β 2 F in the left hand side, we use (95) to write Using identity (30), we can further write Using (34) for the first term and (30) for the second term of the right hand side, we get Now, we can write whereσ min (M − ) is the minimum non-zero singular value of M − since both β k+1 and β belong to the columns space of M − and from Assumption 5, we have where we have used that E E k+1 2 Plugging the bound obtained from (114) in (110) we get To ensure that there is a decrease in the optimality gap, we need to determine, for which values of ρ, we have c−b 2 ρ−aρ 2 > 0 and µ − cκ In other words, we need to look for ρ such that The discriminant of the quadratic equation is To ensure that we can find ρ such that (116) is satisfied, we need to impose that ∆ > 0. Since ∆ is a third order equation in κ, finding, for which values of κ > 0, ∆ > 0 is not straightforward. However, since when κ → 0, ∆ → µ 2 > 0, and knowing that ∆ is a decreasing function with ∆ → −∞ as κ → ∞, then we deduce that there exitsκ > 0 such that for 0 < κ <κ, we have ∆ > 0. In the remainder, we consider κ such that 0 < κ <κ. Thus, for 0 < ρ <ρ, (116) holds wherē ρ = Therefore, we can write Re-arranging the terms, we get Using this equation iteratively, we can write Defining δ 1 = min{(1 + κ) −1 , ψ 2 } and δ 2 = max{(1 + κ) −1 , ψ 2 }, we can further write where we have used in (a) the fact that δ 2 ≤ (1 + δ 2 )/2 since κ > 0 and ψ ∈ (0, 1) and (2δ 1 )/(1 + δ 2 ) ∈ (0, 1) in (b).