Abstract
Decentralized machine learning has been playing an essential role in improving training efficiency. It has been applied in many real-world scenarios, such as edge computing and IoT. However, in fact, networks are dynamic, and there is a risk of information leaking during the communication process. To address this problem, we propose a decentralized parallel stochastic gradient descent algorithm (D-(DP)2SGD) with differential privacy in dynamic networks. With rigorous analysis, we show that D-(DP)2SGD converges with a rate of while satisfying -DP, which achieves almost the same convergence rate as previous works without privacy concern. To the best of our knowledge, our algorithm is the first known decentralized parallel SGD algorithm that can implement in dynamic networks and take privacy-preserving into consideration.
1. Introduction
Decentralized machine learning, as a modeling mechanism that allocates training tasks and computes resources to achieve a balance between training speed and accuracy, has demonstrated strong potential in various areas, especially for training large models on large datasets [1–3], such as ImageNet [4]. Typically, assume that there are workers where each worker has its local data, decentralized machine learning problem is aimed at solving an empirical risk minimization problem as follows: where is the local loss function at node . The objective can be rephrased as a linear combination of the local loss function . This formulation can be expressed as many popular decentralized learning models including deep learning [5], linear regression [6], and logistic regression [7].
In recent years, decentralized machine learning has attracted much attention to derive convergence solutions while reducing communication costs [8, 9]. Previous works mainly study decentralized collaborative learning in a static network assumption. For example, decentralized parallel stochastic gradient descent (D-PSGD) is one of the fundamental methods in solving large-scale machine learning tasks in static networks [1]. In D-PSGD, all nodes compute the stochastic gradient using their local dataset and exchange the results with their neighbors iteratively. However, in fact, dynamicity has been an important feature for networks, especially for large-scale networks, such as IoT [10] and V2V networks [11, 12], as nodes in the network can move around and join or leave the network at any time. On the other hand, in large-scale networks, it is hard or even impossible to ensure every node is reliable [13, 14]. Consequently, during the collaborative learning process, it is unavoidable to face the risk of information leaking. Hence, when designing decentralized machine learning algorithms, it has been to consider the impact of the dynamicity of network topology and the demand for privacy preservation. However, to the best of our knowledge, there is no existing work taking both factors simultaneously into consideration. In this work, we focus on this missing piece in decentralized learning.
Specifically, based on differential privacy, we present a new dynamic decentralized stochastic gradient descent algorithm (D-(DP)2SGD), which offers a strong protection for local datasets of decentralized nodes. With rigorous analysis, we show that our proposed D-(DP)2SGD algorithm satisfies -DP and achieves the convergence rate of when is large enough. Empirically, we conduct experiments on CIFAR-10 datasets to accomplish image classification tasks. We conduct extensive experiments to evaluate the performance of our proposed algorithms.
The remainder of this paper is organized as follows. We present our survey on related work in Section 2. We then introduce our model, problem, and some useful preliminary knowledge in Section 3. Our algorithm, main results, and analysis are presented in Section 4, Section 5, and Section 6, respectively. Experimental results are illustrated in Section 7. The whole paper is concluded in Section 8.
2. Related Work
In this section, we introduce some closed related work.
2.1. Decentralized Parallel Stochastic Gradient Algorithms
Most existing work on decentralized parallel stochastic gradient focus on static networks in both synchronous and asynchronous settings [1, 15–18]. Under the synchronous setting, Lian et al. [1] illustrated the advantage of decentralized algorithms over centralized ones and showed that the proposed D-PSGD converges with a rate of when is large enough, where is the number of iterations and is the total number nodes in the network. Qureshi et al. [17] proposed an algorithm called S-ADDOPT, and it converges with a rate of .
Feyzmahdavian et al.[19] and Agarwal et al. [20] considered decentralized SGD in the asynchronous setting. They allowed workers to use stale weights to compute gradients in S-PSGD. Asynchronous algorithms avoid idling any worker to reduce the communication overhead, and it is very robust because it can still work well when part of the computing workers are down. For asynchronous algorithms, Lian et al. [16] proposed the asynchronous decentralized parallel SGD algorithm for convex optimization and showed AD-PSGD converges at . Then, Lian et al. [21] proposed the asynchronous decentralized parallel stochastic gradient descent algorithm for nonconvex optimization and showed the ergodic convergence rate is . They proved that the linear speedup is achievable when the number of workers is bounded by .
2.2. Differentially Privacy Decentralized Learning
Most existing work on differentially private decentralized learning focuses on the static network [22–25]. Our work combines decentralized learning and dynamic network in a DP setting. In contract, Lu et al. [24] proposed an asynchronous federated learning scheme with differential private for resource sharing in vehicular networks. Hsin-Pai et al. [26] proposed a new learning algorithm LEASGD (Leader-Follower Elastic Averaging Stochastic Gradient Descent), which is driven by a novel Leader-Follower topology and a differential privacy model. And they provide a theoretical analysis of the convergence rate and the trade-off between the performance and privacy in the private setting. Based on the research in [16], Xu et al. [2] designed an algorithm on asynchronous decentralized parallel stochastic gradient descent algorithm with differential privacy (A(DP)2SGD). They showed that A(DP)2SGD converges at . For all of these reviewed papers, the study of decentralized parallel SGD for differential privacy in dynamic networks is still an open problem.
3. System Model and Problem Description
We consider a network consisting of computational nodes (could be a machine or a GPU). At each iterate , the network topology is denoted by a network , where is the set of computational nodes, , and is the set of communication edges at iterate . If there exists an edge from node to node at iterate , then . In a connected network, two nodes are neighbors if the node can be connected directly by an edge, i.e., the nodes can communicate with each other. The set of neighbors of node at iterate is denoted by , and define . We assume that the nodes keep unchanged, but the connection between nodes can be changed after every iteration. The network is assumed to be strongly connected, i.e., for all nodes , there exists a path from to at each iterate . Some frequently used notations are summarized in Table 1.
In a decentralized network, the data is stored at nodes, and each node is associated with a local loss function where is a distribution associated with the local data at node and is a data sampled via .
In this work, we consider the following optimization problem: where is a uniform distribution of nodes.
Similarity, we gives an -approximation solution if where is the average local variable with all nodes at iterate and is the maximum iterations.
We next review the definition of differential privacy, which is originally proposed by Dwork [27].
Definition 1 (see [27] (Differential Privacy)). Given a , a randomized mechanism with domain preserves -differential privacy if for all and for any adjacent datasets (Given two datasets and , D and are adjacent if there exist such that and for , , i.e., .) such that: where is the output range of mechanism .
Informally, differential privacy means that the distributions over the outputs of the randomized algorithm should be nearly identical for two adjacent input datasets. The constant measures the privacy level of the randomized mechanism , i.e., a large implies a lower privacy level. Therefore, an appropriate constant should be chosen to balance the accuracy and the privacy level of the mechanism .
Then, we introduce the definition of sensitivity, which plays a key role in the design of differential privacy mechanisms.
Definition 2 (see [28] (Sensitivity)). The sensitivity of a function is defined as follows:
The sensitivity of a mechanism captures the magnitude by which a single individual’s data can change the mechanism in the worse case. Moreover, we will introduce the Laplace mechanism.
Definition 3 (see [27] (The Laplace mechanism). Given any function , the Laplace mechanism is defined as: where are i.i.d. random variables drawn from . The variable of Laplace distribution is , where according to the property of differential privacy.
Throughout the paper, we adopt the following commonly used assumptions: (1)Lipschitzian gradient: all functions 's are -Lipschitzian gradients.(2)Unbiased estimation:(3)Bounded variance: assume the variance of stochastic gradientis bounded for any with sampled from the distribution and from the distribution . It implies there exist constants and such that (4)Bounded subgradient: assume that the subgradient of is -bounded for all and , i.e., and .
4. Algorithm
The D-(DP)2SGD algorithm can be described as follows: each node maintains its own local variable and run the following steps. (i)Sample data: each node samples a training data .(ii)Compute gradient: each node computes the stochastic gradient using the current local variable and the data , where is the node index and is the iterate number.(iii)Add noise: random generate the Laplace noise and add noise to the variable , to get the perturbed variable .(iv)Communication: send the perturbed variable and the degree to its neighbors; receive and from neighbors, where .(v)Determine the matrix : determine the matrix according to the local network topology, i.e.,where describes how much node can affect node at iterate . (vi)Weighted average: compute the weighted average by obtaining perturbed variable from neighbors and the matrix : .(vii)Gradient update: each node updates its local variable using the weighted average and the local stochastic gradient .
From a global view, we define the concatenation of all local variables, perturbed variables, Laplace noises, random samples, and stochastic gradients by matrix , , and , vector , and , respectively:
Then, the th iterate of Algorithm 1 can be described as the following update i.e., .
|
5. Main Results
In this section, we present the main results, which guarantees the privacy and the convergence rate of our proposed algorithm.
Theorem 4. Let assumptions hold. If ’s are i.i.d. random variables drawn according to the Laplace distribution with parameter , such that for all and . Then, our proposed algorithm guarantees -differential private.
Theorem 5. Let , . Under the assumptions, we can get the convergence rate of Algorithm 1 as follows: where .
This theorem characterizes the convergence of the average of all local optimization variables . To take a closer look at this result, we choose an appropriate step length in Theorem 5 to obtain the following result:
Corollary 6. Under the same assumptions as in Theorem 5, let , by setting , we have the following convergence rate: if the total number of iterates is sufficiently large,
6. Result Analysis
In this section, we will give the analysis for privacy preservation and convergence rate of D-(DP)2SGD.
For convenience, we define and
6.1. Privacy Analysis
Proof (Proof of Theorem 4). From the definition of sensitivity, we obtain
Note that and are in space . According to the definition of -norm, we have
where and are the th component of and , respectively.
Consider an output vectors . Then, we follow from the property of Laplace distribution, we get
Then,
where the first inequality comes from the triangle inequality, and the last inequality follows from (19).
Therefore, we know , we can obtain
6.2. Convergence Rate Analysis
In order to obtain the result of convergence rate, we first give some lemmas.
Lemma 7. is a symmetric doubly stochastic matrix.
Proof. From the definition of Equation (11), we can obtain that (1), for all ;(2), for all ;(3), for all .Therefore, is a symmetric doubly stochastic matrix.
Lemma 8. Define , where is the identity matrix. Assume that there exists a such that Then
Proof. Let . We prove this lemma by induction. For , .
We assume that hold, i.e., . Then, for , note that , then we have
According to Lemma 7, is symmetric and doubly stochastic. Then, is an eigenvector of , and is the eigenvalue. According to the spectral theorem of Hermitian matrices, we can construct a basis of composed by the eigenvectors of starting from . From Equation (25), the magnitude of all other eigenvectors’ associated eigenvalues should be smaller or equal to . Note that is orthogonal to , then we can find
By induction, we complete the proof.
In Lemma 9, we give the bound of the sensitivity of our proposed algorithm.
Lemma 9. Under assumption 4, the sensitivity of the algorithm can be bounded as where and is the dimensionality of vectors.
Proof. Assume that and are any two adjacent datasets at iterate . Assume that and be the executions for and , respectively. Then, from our proposed algorithm, we have where the first inequality comes from the norm inequality and the last inequality is from the triangle inequality. From assumption 4, we have Since we can choose the pair of adjacent datasets , arbitrarily, and we can obtain The lemma is obtained.
From Lemma 9, we know that the learning rate , the dimensionality of vectors , the maximal bound of subgradient , and the privacy level have an effect on the magnitude of the added random noise. Based on Lemma 9, we next provide the bound of the noise.
Lemma 10. We give the following inequalities:
Proof. According to the property of the Laplace mechanism and , we obtain a bound on in the following: And we can get a bound on .
Lemma 11 (see [1]). Under assumption 1, the following inequality holds:
The proof of this lemma can be found in the full version of [1]. And we define as the squared distance of the local optimization variable on node from the averaged local optimization variable on all nodes at iterate , i.e., . In the following, we will present the bound of .
Lemma 12. Under the definition of , we can get:
Proof. According to the update method of , we split into two terms:
where the seventh equation comes from for .
Firstly, we split into two terms,
To bound , we first bound and :
Moreover, can be bound as follows:
Then, plugging and into ,
where the last inequality comes from the fact that .
Moreover, we split into two terms:
We give an upper bound as follows:
where the last second inequality comes from Lemma 8 and assumption 3.
For , we will give the following upper bound:
To bound , we first bound and , for :
where the last inequality comes from Lemmas 8 and 11.
Then, we bound as follows:
where can be bounded by and :
and we give a bound of :
Then, we plug , into , plug , into . We can yield the upper bound for .
Then, we plug and into yielding the upper bound of .
Finally, we can describe as follows:
The lemma is obtained.
Based on these lemmas above, we prove Theorem 5 subsequently.
Proof (Proof of Theorem 5). We start from :
where the last step comes from .
Then, we split the second term according to ,
We split the last term of (53) into two terms because of ,
According to (54) and (55), (53) can be expressed as:
We can bound the second last term using :
where the last step is true because of assumption 3.
Then, it follows from (56):
where the last step comes from .
We then bound the equation :
where the first inequality comes from .
According to Equation (37) in Lemma 12, we have the bound of . Then, we will bound its average on all nodes as follows:
Summing it from to , we can get the following result:
We then can get the bound of the summation of ’s from to :
Rearranging the term, it can be obtained that
Plugging the bound of into :
Then, we bound the by the above bound,
Summing from to :
Rearranging the terms, we get the following result:
This completes the proof.
Proof (Proof of Corollary 6). Substituting into the result of Theorem 5 and removing the first term of the RHS on the LHS, we can obtain that
Let , and we show and are approximately constants when (16) is satisfied.
Note that
Since , as long as we have
and will be satisfied. Then, we can obtain .
Let , then
7. Experiments
In this section, we perform extensive simulations to evaluate our proposed algorithm. In particular, we compare the convergence rate of our proposed algorithm with the best-known D-PSGD algorithm give in [1], in the settings of different privacy budgets, different number of nodes and different extents of dynamicity.
In our experiments, we evaluate our proposed algorithm on image classification tasks. We train ResNet model on CIFAR-10 dataset. Moreover, we use MPI to implement the communication scheme. We run our experiments on a CPU server cluster. Each server has 32 cores, which is an Intel Xeon E5-2620 v4 @ 2.10GHz cluster. Each server is equipped with 128G memory.
7.1. Impact of Privacy Budgets
We evaluate the import of privacy budget on the convergence of our proposed algorithm in a dynamic network. Here, we run our algorithm in a dynamic network with different number of nodes under different privacy budgets. The results are illustrated in Figure 1, where the privacy budget is set to ,, and , respectively. It can be seen that the smaller is, the slower the learning converges. This is because a smaller privacy budget means more noise are added, which affects the convergence speed of the algorithm.

(a)

(b)
7.2. Impact of Dynamicity
We compare our algorithm with the best-known D-PSGD algorithm (with privacy protection) in a static network. Due to D-PSGD is applied to a ring structure, we set the expected degree in the dynamic network as 2. Here, the privacy budget is set to , and the number of nodes is 4 and 8. From Figure 2, we can find that our proposed algorithm can reach the same convergence rate in dynamic networks as the D-PSGD algorithm (with privacy protection) in static networks.

(a)

(b)
8. Conclusion
We presented D-(DP)2SGD, a decentralized parallel stochastic gradient descent algorithm with privacy preservation in dynamic networks. With theoretical analysis and extensive experiments, it shows that our proposed algorithm can achieve the same convergence rate as the best know previous work in static networks without considering privacy issue. Based on this work, it is meaningful to further devise privacy-preserving algorithms in an asynchronous dynamic environment.
Data Availability
1. The CIFAR-10 data used to support the findings of this study have been deposited in http://www.cs.toronto.edu/~kriz/cifar.html. 2. The software code used to support the findings of this study have been deposited in https://github.com/zongruisdu/D-DPDP-SGD.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work is partially supported by the National Key R&D Program of China with grant no. 2019YFB2102600 and NSFC (No. 61971269).