Abstract
With the vigorous development of artificial intelligence technology, various engineering technology applications have been implemented one after another. The gradient descent method plays an important role in solving various optimization problems, due to its simple structure, good stability, and easy implementation. However, in multinode machine learning system, the gradients usually need to be shared, which will cause privacy leakage, because attackers can infer training data with the gradient information. In this paper, to prevent gradient leakage while keeping the accuracy of the model, we propose the super stochastic gradient descent approach to update parameters by concealing the modulus length of gradient vectors and converting it or them into a unit vector. Furthermore, we analyze the security of super stochastic gradient descent approach and demonstrate that our algorithm can defend against the attacks on the gradient. Experiment results show that our approach is obviously superior to prevalent gradient descent approaches in terms of accuracy, robustness, and adaptability to large-scale batches. Interestingly, our algorithm can also resist model poisoning attacks to a certain extent.
1. Introduction
Gradient descent (GD) is a technique to minimize an objective function, which is parameterized by the parameters of a model, by updating the parameters with the opposite direction of the gradient of the objective function about the parameters [1]. It has widely been applied in solving various optimization problems because of its simplicity and impressive generalization ability [2], but it is born with a heart of revealing privacy. Mathematically, the gradient is the parametric derivative of the loss function, which is explicitly calculated from the given training data and its true label. Therefore, the attacker may extract the sensitive information of the original training data from the captured gradients. Recently, researches have shown that the attacker, which captures the gradient of a training sample, can successfully infer its attributes [3], label [4], class representation [5, 6], or the data input itself [4, 7–9], with high accuracy. In the actual deep learning system, the gradient of multiple samples is widely used to improve efficiency and performance, which can also be viewed as the per-coordinate average of the single-sample gradients. Is multisample gradient safer for the privacy of training data? Unfortunately, Pan et al. [9] gave the theoretical analysis to indicate that multisample gradient still leaks samples and labels under certain circumstances. Since the work of Zhu et al. [7] was proposed, there is a branch of research [4, 7–9] to explore a violent but universal method for successful data reconstruction attacks, and some meaningful empirical results are given on CIFAR-10 and ImageNet. These works are based on the same learning-based framework. First, a batch of unknown training samples are used as variables, and then the optimal training samples are searched by minimizing the distance between the ground-truth gradient and the gradient calculated by the variables. The main difference between them is the choice of minimizing distance function. L2 and cosine distances are used in [4, 7, 8], respectively. Although Zhao et al. [4] used the properties of neural networks to recover the label of a single sample before the learning-based attack, this technique is only suitable to single-point gradient. It is the same as [7] in the multisample case. Pan et al. [9] gave a theoretical explanation for information leakage of single sample in a fully connected neural network with ReLu activation function. Furthermore, they used the internal information between neurons to show that in some cases there is sample and label leakage in multiple samples and extended the model to ResNet-18 [10], VGG-11 [11], DenseNet-121 [12], AlexNet [13], ShuffleNet v2-x0-5 [14], InceptionV3 [15], GoogLeNet [16], and MobileNet-V2 [17].
To solve the gradient safety problem, Bonawitz et al. [18] designed a secure aggregation protocol, which is a four-round interactive protocol. Xu et al. proposed VerifyNet [19] and VeriFL [20] by adding verifiability to [18] for ensuring the correctness of aggregation. Bell et al. [21, 22] introduced a secure aggregation protocol with multilogarithmic communication and computational complexity, which reduces one round of interaction compared with [18]. Fereidooni et al. [23] showed that only two rounds of communication can be safely aggregated. All of the above works use encryption algorithms to encrypt the entire data set or intermediate values during the training process. Different from them, Ma et al. [24] used secure verifiable computing delegation to privately label a public data set from locally trained model aggregation and then utilized public data sets to train local models. Phong et al. [25] used homomorphic encryption technology to encrypt the gradient before sending it. Abadi et al. [26] employed differential privacy to protect gradients. Yadav et al. [27] applied differential privacy to federated machine learning by directly adding noise to the gradient. In PrivateDL [28], it is allowed to effectively transfer relational knowledge from sensitive data to public data in a way of privacy protection and enables participants to jointly learn local models based on public data with noise protection labels. However, these methods also have their limitations. The main problem of the secure aggregation protocol is communication overhead and computational efficiency. For differential privacy technology, it needs to consider the tradeoff between privacy and utility. More noise will lead to poor performance, and less noise will not be enough to protect the gradient. PrivateDL [28] requires a public data set and reduces the performance of the algorithm.
Therefore, this paper proposes a new gradient descent method, super stochastic gradient descent (SSGD), for achieving neuron-level security while maintaining the accuracy of model. Moreover, SSGD has stronger robustness. Phong et al. [25] analyzed the leakage of single-sample single-neuron input data in the single-layer perceptron by using the sigmoid activation function. Pan et al. [9] used the ReLu activation function to analyze the sample data leakage from the multilayer fully connected neural network gradient and indicated that multiple samples also reveal privacy. There are two neurons in the last layer which are only activated by the same single sample. Essentially, the leakage is caused by attacking the single-sample gradient. SSGD converts the neuron gradient into a unit vector, which makes that the gradient aggregation of neurons has superrandomness. Superrandomness may significantly worsen the performance of the algorithm and make it difficult to converge. We select multiple-sample gradient composition updates to increase stability. At the same time, the superrandomness also brings strong robustness because the attacker cannot know the true gradient. SSGD invalidates these attacks on the gradient model, including the attack by searching for the optimal training sample [4, 7, 8] based on minimizing the distance between the ground-truth gradient and the gradient calculated by the variable, and the attack by solving the equation system [9] to obtain the training data. Our contributions are summarized as follows.(1)We propose a gradient descent algorithm, called super stochastic gradient descent. The main idea is to update the parameters by using the unit gradient vector. In neural networks, neuron parameters are updated by using the unit gradient vector of neurons.(2)We analyze theoretically that SSGD can realize neuron-level security and defend against attacks on the gradient.(3)Experimental results show our approach has better accuracy and robustness than prevalent gradient descent approaches. And it can resist model poisoning attacks to a certain extent.
The rest of this paper is organized as follows. In Section 2, we review the basic gradient descent methods and the data leakage by gradients. In Section 3, we describe the super stochastic gradient descent and analyze the safety of our approach. The experimental results are shown in Section 4. Finally, we conclude this paper and give the further work.
2. Preliminaries
In this section, we review some basic gradient descent algorithms [1], including batch gradient descent (BGD), stochastic gradient descent (SGD), and mini-batch gradient descent (MBGD). The difference among them is that how much data is used to calculate the gradient of the objective function. Then, we describe the information leakage caused by gradients [19].
2.1. Basic Gradient Descent Algorithms
The BGD is an ordinary form of gradient descent, which takes the entire training samples into account to calculate the gradient of the cost function about the parameters and then update the parameters bywhere η is the learning rate and represents the gradient of function with respect to the parameters . The BGD uses the entire training set in each iteration. Therefore, the update is proceeded in the right direction, and finally BGD is guaranteed to converge to the extreme point. On the contrary, the SGD considers a training sample xi and label yi randomly selected from the training set in each iteration to perform the update of parameters by
The BGD and SGD are two extremes: one uses all training samples and the other uses one sample for gradient descent. Naturally, their advantages and disadvantages are very prominent. For the training speed, the SGD is very fast, and the BGD cannot be satisfactory when the size of training sample set is large. For accuracy, the SGD determines the direction of the gradient with only one sample, resulting in a solution which may not be optimal. For the convergence rate, because the SGD considers one sample in each iteration and the gradient direction changes greatly, it cannot quickly converge to the local optimal solution.
The MBGD is a compromise between BGD and SGD, which performs an update with a randomly sampled mini-batch of N training samples bywhere N is the number of batches. MBGD decreases the variance of the updates for parameter, so it has more stable convergence. Moreover, the computing of gradient about a mini-batch is very efficient by using highly optimized matrix optimizations that existed in advanced deep learning libraries.
2.2. Analysis of Gradient Information Leakage
Phong et al. [25] illustrated that how gradients leak the data information based on a single neuron shown in Figure 1. Assume that represents data input with a label value . is the weight parameter and is the bias, represented uniformly by . is the gradient vector of the parameter , f is an activation function, and the loss function is, where . Let and . We have

Therefore, we obtain . By solving the system of equations, we can easily get and . Also, we know that is determined by . Therefore, and are bijective. In distributed training, and b usually are the parameters that need to be updated and known. Then, it can infer from .
Based on [9], the single-sample analysis of multilayer neural networks by using ReLu activation function, there is also data leakage problem. Although there is no such simple and intuitive leakage of data in a multilayer neural network, we can still know and by analyzing the internal relationship of the neural network and find that and are still bijective.
3. Super Stochastic Gradient Descent
In this section, we propose our super stochastic gradient descent approach for preventing gradient leakage while keeping the accuracy and then analyze in detail the safety of our approach.
3.1. Approach
It was confirmed that the gradient leaks privacy [7, 25]. For solving the security problem caused by the exchange gradient in stochastic gradient descent or mini-batch gradient descent, we propose the super stochastic gradient descent approach, which can protect the gradient information without losing accuracy by hiding part of the gradient information. The gradient is the first-order partial derivative of the objective function, so it is a vector with both magnitude and direction. We seek the gradient of the objective function to find the fastest descent direction. But it is a little related to the modulus length of the gradient vector. Therefore, we hide the modulus length of the gradient vector and convert the gradient vector into a unit vector.
The superrandomness, caused by the aggregation of multiple unit gradient vectors, may lead to poor results. To guarantee that this kind of randomness is friendly, we utilize the following approaches to reduce the uncertainty caused by superrandomness.
For single-sample training sample and label , we use unit gradient vector to update parameter :
For multiple samples, the parameter is updated towhere represents samples and denotes their labels. The gradient of samples is considered as a basic gradient, and is the number of basic gradients. Aggregating the unit gradient vectors of basic gradients on average is to further enhance the stability of the algorithm. The algorithm has higher performance with strong randomness. It is secure to share this unit basic gradient in a distributed environment.
Neuron is the smallest information carrier in the neural network structure. In the neural network, we choose to convert each neuron parameter gradient vector into a unit vector. Therefore, the single-layer neural network parameter is updated towhere represents the rth column or rth row of the parameter matrix in the fully connected layer or convolutional layer (the convolution kernel is regarded as a neuron). In the fully connected layer, is expressed as the rth column of the gradient matrix. And in the convolutional layer, it represents the rth row of the gradient matrix of the convolution kernel. Therefore, each row or column of the gradient matrix is a unit vector. Then, we obtain an average gradient matrix by using such gradient matrices to update the parameters.
3.2. The Safety of SSGD
By analyzing the multilayer neural network with ReLu activation function on a training sample, the following relationship is obtained in [9]:in which is the input data, where and . represents the cth dimension of the loss vector , T is the number of layers of neural network, is the activation pattern of the ith layer of neural network, and and denote the gradients and parameters of the ith layer of neural network, respectively. In fact, the attack gradient models are all solutions to the above equations. In the distributed training model that needs to share the gradient, the participants know , , and . For data reconstruction attacks, it can infer and solve X by equation (8).
The left side of equation (8) is the ith layer gradient matrix:where is the number of neurons in the ith layer. The gradient matrix of our SSGD is
Each column of is a unit vector, and is the modulus length of the 1st column vector of the ith layer gradient matrix, i.e., the modulus length of the 1st neuron gradient of the ith layer neural network. Essentially, the parameter matrix of a layer of neural network is multiplied by a diagonal matrix on the right, and the value of the diagonal matrix is the reciprocal of the modulus length of the gradient vector of each neuron. By using our SSGD, equation (8) is represented aswhere is unknown and is not uniquely determined when the loss functions are nonconvex and nonconcave functions. According to [29], we know that the loss function of multilayer neural networks are nonconvex and nonconcave functions. Due to the dynamicity of , even if , , , and are known, X is not obtained.
Our method hides the correlation between the gradient and the sample, eliminates the information between neurons, and achieves neuron-level security. SSGD is a multisample training; there is no information leakage problem like in [19], which is a single-sample leakage of privacy. SSGD can defend against attacks on the gradient.
Since training a model requires rounds of iterations, is it safe to use multiple rounds of iterations? We previously analyzed that the gradient and the training data are bijective in terms of parameter , i.e., , where f is a functional relationship. We use and to denote the training parameters of the ith and i + 1st rounds, respectively. Then, we have. The ith gradient . Therefore, we have . Furthermore, we obtain . By comparing with , we can see that there is not additional information in . The information of the model is only related to the training samples, initial parameters, and learning rate. Therefore, the iteration operation does not cause the information leakage.
4. Experiments
Data. We use MNIST (https://yann.lecun.com/exdb/mnist) and Fashion-MNIST (https://fashion-mnist.s3-website.eu-central-1.amazonaws.com) datasets to assess the performance of our algorithm. The MNIST contains 60000 training images and 10000 test images, where every image is a 28 × 28 grayscale image, and each pixel is an octet. The Fashion-MNIST [30] is composed of 28 × 28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set and test set contain 60,000 images and 10,000 images, respectively.
Model. The lenet-5 [31] contains two convolutional layers, two pooling layers, and three fully connected layers. The activation function is ReLu. The input dimensions are 784, and output dimensions are 10.
Evaluation Index (The Test Accuracy). We use 60000 training images to train model. The test accuracy is the average value of ten experimental results, and every experiment obtains the average test accuracy of randomly selecting 1000 samples from the test set. The number of iterations is 10,000. The highest test accuracy of these compared algorithms in the same experimental environment is shown in bold.
4.1. Accuracy and Efficiency
We compare SSGD with SGD, SGDm [32], and Adam [33], which are widely used gradient descent algorithms. The batch size () is set to 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, and 8192, where n is set to 1, 4, 8, 16, 32, 64, and 128 and m is set to 4, 8, 16, 32, and 64. When m = 1, it is the MBGD. There is not set same learning rate as a good experimental result, because SGD and SGDm have poor adaptability in large batches.
For MNIST data set, the momentum of SGDm is set to 0.999. For the experimental parameters of Adam, the learning rate is set to , and and are set to 0.9 and 0.999, respectively. For SSGD, the learning rate in this experiment is set to . For Fashion-MNIST data set, the momentum of SGDm is set to 0.99. For the experimental parameters of Adam, the learning rate is set to , and and are set to 0.9 and 0.999, respectively. For SSGD, the learning rate in this experiment is set to , where j is the number of iterations.
The comparative experimental results of SGD, SGDm, Adam, and SSGD are shown in Tables 1 and 2, where the numbers in bracket in the second and third columns denote the learning rates of SGD and SGDm, respectively, and the number in bracket in the fifth column is the value of m. From Tables 1 and 2, we can see that the performance of our algorithm is better than that of SGD, SGDm, and Adam for large batches of data. In this case, SGD and SGDm need to reduce the learning rate to adapt to it. And Adam also has obvious overfitting in large batches of data. SSGD has always maintained high precision. On the whole, our algorithm on test accuracy is better and more stable than SGD, SGDm, and Adam.
Tables 3 and 4 show the running results of our SSGD approach in different numbers of training batches. We can see that the larger the number of training batches (N) is, the better the test accuracy is. When the number of training batches is too small, the effect of n on performance is greater than that of m. The distribution of m values in Tables 1 and 2 also shows this point.
The convergence rate graphs on MNIST and Fashion-MNIST are shown in Figures 2(a) and 2(b), respectively. The value in longitudinal axis is the average accuracy of every 10 iterations. The SSGDm is SSGD with momentum. We choose the intermediate value 256 as the batch number in the convergence experiment, where n = 16 and m = 16 for SSGD and SSGDm. In Figure 2(a), the learning rates of SGD and SGDm are and , respectively. The momentum of SGDm is set to 0.999. The learning rate of SSGDm is , where j is the number of iterations, and its momentum is 0.99. The other parameters are consistent with the above experiment on MNIST. In Figure 2(b), we choose the larger batch number 1024 as the batch number in the convergence experiment, where n = 64 and m = 16 for SSGD and SSGDm. The learning rates of SGD and SGDm are and , respectively. The momentum of SGDm is set to 0.99. The other parameters are consistent with the above experiment on Fashion-MNIST. The learning rate of SSGDm is , where j is the number of iterations, and its momentum is 0.9. From Figure 2, we can see that the convergence speed of our algorithm is faster and more stable than SGD, SGDm, and Adam.

(a)

(b)
4.2. Robustness
Robustness is the robustness of the system, which refers to the characteristic that the system maintains a certain performance under certain parameter perturbations. To check the robustness of our algorithm, we add random noise to the gradient. At the same time, we noticed that differential privacy is a way to protect gradient information by adding random noise that meets a certain distribution. To compare the performances of our algorithm and the model with differential privacy, we choose the model in the robustness experiment to add noise that satisfies differential privacy. In this section, we compare the performances of the traditional gradient descent algorithm and SSGD with noises. In [20], the large gradient does not participate in the update, which will seriously affect the gradient descent performance. However, the large gradient participating in the update will cause the noise scale to be too large, which makes the algorithm effect extremely poor or even unable to converge. Different from cutting gradient value in [20], we strictly define sensitivity as the maximum value minus the minimum value in the gradient matrix. We add Laplacian noises of the same scale on comparing algorithms and set privacy budget ε = 4 and ε = 2 on MNIST and Fashion-MNIST, respectively.
We use SGDm and Adam as the compared algorithms. Also, we have tested SGD algorithm. When noise or the number of batches is large, the gradient explosion will occur and the SGD cannot converge on MNIST. SGDm and Adam algorithms have better robustness. Because both SGDm and Adam have momentum, SSGDm is chosen as our comparison algorithm. We adjust hyper parameters to get more performance for SGDm and Adam with noises. To make SGDm, Adam, and SSGDm experiments in the same environment, the batch number is , where n is set to 4, 8, 16, 32, and 64, and m is set to 4, 8, 16, 32, and 64. For each iteration, after the n vectors are added, the Laplace noises of ε = 4 or ε = 2 that strictly meet the differential privacy are added. The sensitivity is set to the maximum value minus the minimum value of the gradient matrix of the same batch. Then, we use SGDm, Adam, and SSGDm algorithms to update their parameters, respectively. For SGDm, the momentum is 0.99. The learning rate is and on MNIST and Fashion-MNIST, respectively. For Adam, the learning rate is on MNIST and Fashion-MNIST, β1 = 0.9 and β2 = 0.999. For SSGDm, we use the average of multiple-unit gradient vectors to update the gradient. Therefore, the module length of the update gradient vector decreases very slowly, and dynamic learning rates need to be set. The learning rate of SSGDm is set to and on MNIST and Fashion-MNIST, respectively. The momentum = 0.9.
From Tables 5 and 6, all three algorithms comply with the law of acquaintance; that is, the larger the batch size is, the better the accuracy is. We can see that SSGDm is more robust than the SGDm and Adam algorithms when the noises of the same scale are added in gradients on test accuracy. On MNIST, compared with SGDm and Adam, the average test accuracy of SSGDm is increased by 4.12% and 1.60%, respectively. On Fashion-MNIST, compared with SGDm and Adam, the average test accuracy of SSGDm is increased by 5.24% and 1.64%, respectively.
Where is the limit of the robustness of our algorithm? On MINST, we try to increase the scale of noises and make ε be 0.2, 0.5, 1, 2, and 4. The experimental environment is the same as the robustness experiment above, and the parameter settings are also the same. The batch number is set to n = 16 and m = 16. On Fashion-MNIST, we make ε be 0.5, 1, 2, and 4. The batch number is set to n = 64 and m = 16. From Tables 7 and 8, it is clear that our SSGDm has obvious advantages in robustness. The greater the scale of noises is, the more obvious the advantage of our algorithm is.
4.3. Poisoning Attack
The goal of poisoning attack is to destroy the integrity and availability of data. The robustness experiment results show that our algorithm can resist the poisoning attack added to the gradient to a certain extent. According to the previous analysis, the gradient is a kind of mapping of the training data. Then, our algorithm should be effective against data poisoning attacks. This part of the experiment is to verify the performance of our algorithm in data poisoning attacks.
SGD is a more basic gradient descent method. In this experiment, we chose SGD as compared algorithm. This experiment compares the performance of SGD, Adam, and SSGD on the same data set with noises. To determine the scale of added noises, the differential privacy mechanisms still are used to add noises with the same methods as the robustness experiment. We add Laplacian noises of different scales to 60,000 training samples. The evaluation method of the experiment result is the same as the above experiment. On MNIST, the batch number is set to n = 64 and m = 4. The learning rate of SGD, Adam, and SSGD is , , and , respectively. On Fashion-MNIST, the batch number is set to n = 64 and m = 16. The learning rate of SGD, Adam, and SSGD is , , and , respectively. The other settings are the same as the above experiment.
Figure 3 is the effect picture after adding different noise scales. From Tables 9 and 10, we can see that SSGD is significantly better than SGD and Adam in test accuracy. Also, our algorithm still maintains a higher test accuracy while continuously increasing the scale of noises. Therefore, SSGD can resist gradient poisoning attacks and parametric poisoning attacks to a certain extent.

(a)

(b)
4.4. Data Reconstruction Attack
Zhu et al. [7] presented an approach which shows the possibility of obtaining private training data from the publicly shared gradients. In their deep leakage from gradient (DLG) method, they synthesized the dummy data and corresponding labels with the supervision of shared gradients. Specifically, they start with random initialization of pseudodata and labels. Virtual gradients are computed on the current shared model in the distributed setup. By minimizing the difference between the virtual gradient and the shared real gradient, they iteratively update the virtual data and labels simultaneously. iDLG [4] is an improvement based on DLG. The following experimental diagrams include the experimental results of DLG [7] and iDLG [4] attacking SGD and SSGD algorithms on MINST datasets and Fashion-MNIST datasets. Figures 4 and 5 are the experimental results of DLG and iDLG attacking SGD. Figures 6 and 7 are about the experimental results of DLG and iDLG attacking SSGD. The number of iterations is 300, and the iteration is stopped if the predetermined accuracy is reached. We can see that our algorithm can defend against DLG and iDLG.

(a)

(b)

(a)

(b)

(a)

(b)

(a)

(b)
5. Conclusions
In this paper, we propose a new gradient descent approach, called super stochastic gradient descent. The SSGD enhances the randomness of gradients to protect against gradient-based attacks. Simultaneously, we use multisample aggregation to enhance stability and eliminate the uncertainty brought about by superrandomness. Our approach achieves neuron-level security and can defend against attacks on the gradient. Experimental results demonstrate that SSGD has good accuracy and strong robustness because its stability and randomness are enhanced. SSGD can also resist model poisoning attacks to a certain extent. But for attacks with the same degree of poisoning, data poisoning has a greater impact on performance. In the future, we will continue to find a more suitable method for resisting data poisoning attacks.
Data Availability
All the experimental data used to support the findings of this study are included within the article.
Disclosure
An earlier version of this study’s preprint is given in the following link: https://arxiv.org/abs/2012.02076.
Conflicts of Interest
The authors declare that they have no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (nos. 61672176 and 61763003), Research Fund of Guangxi Key Lab of Multi-Source Information Mining and Security (no. 19-A-02-01), Guangxi 1000-Plan of Training Middle-Aged/Young Teachers in Higher Education Institutions, Guangxi “Bagui Scholar” Teams for Innovation and Research Project, Guangxi Talent Highland Project of Big Data Intelligence and Application, and Guangxi Collaborative Innovation Center of Multisource Information Integration and Intelligent Processing.