Abstract
In the Internet of Things (IoT), massive interconnected intelligent terminal devices constitute diverse networks. Link prediction can serve as a powerful inference attack to speculate the sensitive links in the networks, posing a security threat to entity privacy in IoT. Most antilink prediction methods reduce the prediction ability of link prediction models through link disturbance to hide sensitive links but fail to consider the impact of node attributes on link prediction. This paper proposes a sensitive link protection method based on graph embedding (SLPGE) to combat link prediction attacks. This method is aimed at compressing network topology data into an embedding matrix and lessening private information by combining Variational Graph Autoencoder (VGAE) and Adversarially Regularized Variational Graph Autoencoder (ARVGA). Based on our experiment on two datasets, SLPGE reduces the prediction accuracy of two attack models for sensitive links by up to 30.05% and 15.03% compared to the original data, and the corresponding utility sees a drop of 7.54% and 7.79% at most, which verifies the feasibility of SLPGE—achieving the tradeoff between privacy protection and data utility effectively.
1. Introduction
To build a highly automated, informative, and intelligent system, the Internet of Things (IoT) integrates numerous communication, computing, and sensing devices, ranging from smartphones to vehicles [1], which is an organic collection of intelligent terminal devices and users. In IoT, widely distributed terminal devices establish reliable wireless links through advanced wireless communication and network technology, forming distributed multidomain networks [2]. Networks are ubiquitous in the real world, such as communication networks, social networks, biological networks, and transportation networks, represented by graphs containing nodes and edges. Similarly, the networks in IoT can also be regarded as graphs with terminal devices as nodes and communication links as edges. Although attractive and convenient, IoT also brings a significant challenge, i.e., the concerns on privacy disclosure [3]. As a new paradigm of big data platform, IoT deploys smart city applications to timely monitor, analyze, and respond to volumes of physical data. The data in IoT collected in a distributed manner are strongly correlated with users’ sensitive status. However, some information platforms disclose private information inadvertently while trading the data, most likely the graphs in IoT. Furthermore, it does not rule out the possibility that malicious attackers may spy on entity privacy, analyze network traffic, and track users’ behavior by stealing the complete network graphs, which invade the entity privacy and threaten the security of the IoT system. At present, the study on privacy for IoT mainly focuses on the privacy of data, identity, and location [4], while rarely mentioning graph privacy, especially the privacy of the communication links between terminal nodes in graphs, i.e., sensitive links. Actually, the disclosure of sensitive links will bring many security threats to the IoT system. For example, some sensitive links usually involve personal privacy, such as the doctor-patient relationship in smart healthcare, one of the typical application scenarios of IoT, and the user trajectories that data requesters may expose when accessing IoT. In addition, in the man-in-the-middle (MITM) attack, hackers will try to intercept private data; control devices in smart homes, smart industries, and smart healthcare; or destroy the communication links in the IoT system, resulting in privacy disclosure, device failure, and even system collapse, which seriously threaten personal privacy, business activities, and industrial operations. Hence, it is imperative to detach private information from the graphs in advance. The most straightforward operation to hide the sensitive links is to delete the sensitive links in the graphs directly. Unfortunately, sensitive links may be predicted out of released data through data mining techniques, even if they have been deleted [5]. As an essential task in data mining, link prediction has been heating up in recent years. More and more link prediction methods and their application technologies have been proposed. Link prediction can predict the relationship between nodes by mapping the graph information to a continuous vector space. While being widely applied in network analysis, link prediction can also be used as an inference attack to speculate the sensitive links in graphs. Therefore, the data publisher shall carry out privacy processing for the published data to defend link prediction attacks while retaining necessary data utility. In recent years, the privacy disclosure caused by link prediction attacks has attracted researchers’ attention, and many researches on antilink prediction have emerged. To defend link prediction based on similarity and deep learning methods, most antilink prediction methods adopt various link disturbances, e.g., random link disturbance, heuristic link disturbance, and evolutionary link disturbance, at the expense of part of data utility [6–13]. Besides, these methods only focus on the graph structure information and fail to consider the unstructured information in graphs, such as node attributes. The node attributes may include the performance, identity, and type of devices, deepening the association strength between nodes and making the attacker’s prediction more accurate. As mentioned above, protecting sensitive links against link prediction attacks is an urgent problem to be solved. Significantly, Li et al. [14] proposed an adversarial privacy graph embedding (APGE) method to conceal users’ sensitive attributes from inference attacks, which opens up a novel idea for our work. In this paper, we intend to fill this blank by developing a graph embedding-based sensitive link protection method named SLPGE. Our basic idea is to use the graph embedding model combined with Variational Graph Autoencoder (VGAE) and Adversarially Regularized Variational Graph Autoencoder (ARVGA) to encode graph data into an embedding matrix before publishing the data. To be concrete, we utilize adversarial training assisted by two schemes to eliminate private information in the embedding matrix. Then, to balance the tradeoff between privacy and utility, we design the loss functions in SLPGE to retain the utility of graph structure and node labels. The main contributions of this paper are summarized below: (i)This article focuses on the privacy protection of sensitive links in IoT and proposes a sensitive link protection method (SLPGE) to conceal sensitive links from link prediction attacks(ii)The results of experiments on two public datasets with node attributes validate that our SLPGE can reduce the prediction accuracy of attack models for sensitive links by 30.05% and 15.03% at most on the basis of the original data(iii)Our method achieves a tradeoff between privacy and utility. Different from the previous method, our method abandons the idea of directly applying link disturbance on the original graph to remove private information, for which we reduce the loss of utility
The rest of the paper is organized as follows. The related work and preliminaries are reviewed in Sections 2 and 3, respectively. The system models and problem formulation are presented in Section 4. The details of our SLPGE are described in Section 5. The simulation and results are shown in Section 6. Moreover, we give the conclusions and future work in Section 7.
2. Related Work
The emergence of various IoT platforms not only facilitates people’s lives but also generates a huge volume of data-carrying personal information. These data can be modeled into graph structure data, and attackers can then easily expose the privacy information hidden in graphs via link prediction. In this section, we briefly introduce the relevant work of graph privacy protection, link prediction, and antilink prediction.
2.1. Graph Privacy Protection
The main methods of graph privacy protection include anonymization, random disturbance, and clustering. Since Sweeney [15] introduced anonymization into graph structure data, different anonymization variants for graphs have also been derived. Ying and Wu [16] disrupted the graph structure by deleting and adding edges randomly. Li et al. [17] performed spectral clustering according to the distance between nodes firstly and then anonymized subgraphs. For the graphs with node labels, Yuan et al. [18] proposed the protection method of node attribute label -anonymity to ensure that the labels of at least nodes are the same. Chester and Srivastava [19] proposed an attribute probability distribution anonymity method to make the probability distribution of the label carried by each node in the attribute sets of its neighbors as close as possible to the global label probability distribution. The random graph modification technology proposed by Hay et al. [20] is the simplest technology to prevent node reidentification and edge exposure. Mittal et al. [9] proposed a link perturbation based on the random walk (LPRW), which improved the privacy and utility of data compared with Hay’s method. In edge clustering methods, Liu et al. [21] proposed privacy protection methods for sensitive edge weights in weighted graphs, adopting Gaussian noise disturbance and greedy disturbance. Zheleva and Getoor [22] mainly considered the privacy of graphs with multiple types of edges and one type of node. Its main idea is to divide the original graph into subgraphs via spectral clustering and then modify the links in the subgraphs and add new links between the subgraphs randomly.
Low data availability and high computational complexity are the common problems of these methods, and their privacy will continue to decrease as inference attacks intensify.
2.2. Link Prediction
Link prediction is aimed at predicting missing facts according to existing entities and has found wide application in social, biological, and communication networks. Known for its powerful inference attack, link prediction has been maliciously used to spy on the privacy of entities in the networks. Among plenty of link prediction methods, classification models such as support vector machine (SVM) [23], multilayer perceptron (MLP) [24], and nearest neighbor (KNN) [25] regard link prediction as a binary classification problem, in which the connected node pairs and unconnected node pairs are regarded as positive samples and negative samples, respectively.
2.3. Antilink Prediction
At present, most antilink prediction methods for graph structure data disturb the graph structure by adding some new links and deleting part of nonsensitive links strategically to reduce the prediction ability of various link prediction methods and achieve the privacy protection of sensitive links. Liu and Terzi [6] proposed to achieve -degree anonymization through edge addition or deletion strategies. Rousseau et al. [7] proposed two approaches that preserve the coreness of a graph while anonymizing it through various edge modification operations. Fard and Wang [8] and Mittal et al. [9] proposed two structure-aware randomization perturbation methods based on local perturbation and random walk considering the structural proximity of nodes. Zhou et al. [10] regarded the links between the end nodes of a sensitive link and their common nodes as the candidate links to be deleted and expressed the attack on local similarity as an optimization problem to determine which links to delete. Chen et al. [11] proposed an iterative gradient attack (IGA) method based on integral gradient information in Graph Autoencoder (GAE). The gradients obtained by maximizing the loss of sensitive links represent the influence of other links on sensitive links. During iterations, links with the largest gradients are modified. Yu et al. [12] combated resource allocation (RA) indicator link prediction via random, heuristic, and evolutionary link disturbance. Among these three methods, random link disturbance increases and changes links without any strategy, heuristic link disturbance reduces the link prediction ranking of node pairs in the test set, and evolutionary link disturbance selects the links to be added and deleted according to the fitness function. Waniek et al. [13] selected to delete or add the most influential links to hide sensitive links by reducing or creating the closed triangles containing sensitive links.
The methods mentioned above can be used in IoT systems to avoid the leakage of sensitive links in data transactions. However, two shortcomings are present in the above methods: the first is that the utility of the graph will be lost due to link disturbance, and the second is that they lack the consideration of the impact of node attributes on link prediction.
3. Preliminaries
As a kind of non-Euclidean data, a graph is difficult to be directly processed by traditional data analysis methods or deep learning models such as Convolutional Neural Network (CNN) [26] and Recurrent Neural Network (RNN) [27] due to the high computational and space overhead. Graph embedding, also called network representation learning, is aimed at mapping graph data, usually a high-dimensional dense matrix to low-dimensional dense vectors. Graph embedding has more flexible and rich calculation methods to apply deep learning models directly for graph analysis tasks. Graph Neural Network (GNN) represents the deep learning method of graph embedding. By modeling the nodes and communication links in the networks, GNN can be applied to solve the privacy disclosure problem in IoT. For the advantages of feature extraction from non-Euclidean data, our SLPGE is based on some GNN models. In this section, the GNN models involved in SLPGE, e.g., Graph Convolutional Network (GCN), VGAE, and ARVGA, are briefly introduced. For the sake of clarity, the frequently used notations and their meanings are listed in Table 1.
3.1. Graph Convolutional Network
In 2013, Bruna et al. [28] first proposed the neural network on the graph and gave two structures based upon a hierarchical clustering of the domain and the spectrum of the graph Laplacian. As a typical GNN model, GCN [29] is a scalable approach for semisupervised learning on graph data, which uses the spectrum of the graph Laplacian to achieve convolution on graphs. After each convolution of GCN, the node features are the weighted sum of the previous features of the nodes and their neighbor nodes, for which the nodes can aggregate further features with the deepening of layers. Hence, the superiority of GCN is to incorporate local graph structure and node features naturally. Suppose the adjacency matrix represents the connection relationship between nodes, then the layer-wise propagation rule of GCN is as follows: where is the feature matrix of the layer, is the trainable weight matrix, and is an activation function. is the normalization of where , is the identity matrix, is the degree matrix of , and . The degree of a node is the number of first-order neighbors connected to the node. Equation (1) can be abbreviated as , for is the input of each layer.
3.2. Variational Graph Autoencoders
Soon after the proposal of GCN, to expand the capability of GCN, VGAE proposed by Kipf and Welling [30] adopts GCN as an encoder to generate specific graph embedding for different tasks of the graph, not limited to node classification. VGAE is an unsupervised learning framework derived from Variational Autoencoders (VAE) [31], which obtains graph embedding through the encoder-decoder structure. VGAE consists of a two-layer GCN encoder and a simple inner-product decoder. The two-layer GCN can be defined as follows: where is the symmetrically normalized adjacency matrix and is the activation function of the first layer. of the second layer is determined according to the specific task. The encoder of VGAE is aimed at learning the mean and the standard deviation of a multidimensional Gaussian distribution from which the graph embedding is sampled. The process is briefly described below: where replaces in Equation (2) as the node feature matrix of the first layer and and share first-layer parameters . is the graph embedding matrix and is the noise sampled from the standard Gaussian distribution. The inner product is used as a decoder in VGAE, and the formula is as follows: where is the sigmoid function. is the reconstructed adjacency matrix, and can be regarded as the product of independent event probabilities of the node and the node. When is greater than the threshold 0.5, it means that there is a link between the node and the node.
VGAE has two optimization objectives: one is to make and as similar as possible; the other is to make the distribution of as close to the standard Gaussian distribution as possible. Since binary cross-entropy (BCE) can determine the proximity between the actual output and the expected output and Kullback-Leibler (KL) divergence can measure the difference between two distributions, the loss function of VGAE composed of BCE and KL divergence can be expressed as
Here, the former minimizes the reconstruction loss through the cross-entropy function, and the latter minimizes the KL divergence. , is the real distribution function we get, and is a Gaussian prior. is the KL divergence between and . We expect to be as close to as possible.
More specifically, in Equation (5) can be abbreviated as below: where represents the value which is 0 or 1 of an element in , represents the probability value of the corresponding element in , and is the ratio of the number of 0 to 1 in , which can be used to solve the problem of imbalance between positive and negative samples. in Equation (5) can be abbreviated as below:
3.3. Adversarially Regularized Variational Graph Autoencoder
To force the graph embedding learned by VGAE to fit the prior distribution better, Pan et al. [32] proposed ARVGA by combining VGAE and Generative Adversarial Network (GAN). GAN was first proposed by Goodfellow et al. [33] to serve as a generative model bridging supervised learning and unsupervised learning in 2014. Most recently, exploiting GAN to work out elegant solutions to severe privacy and security problems has become increasingly popular in both academia and industry due to its game theoretic optimization strategy [34]. Typically, GAN consists of a generator and a discriminator , the purpose of which is to mix the spurious with the genuine in a nutshell. During the iterative training, is trained to generate the fake samples to convince that the fake samples come from a prior data distribution, while discriminates whether an input sample comes from the prior data distribution or we built. In ARVGA, we take VGAE as , a two-layer fully connected network as where the output layer only has one dimension with a sigmoid function. The equation for training the encoder model with the discriminator can be written as follows:
Here, is the real sample, is the original data, is the fake sample, and is the probability that the sample is true. is aimed at minimizing the equation while is aimed at the opposite of . Through the game between and , ARVGA can enforce the graph embedding to match the prior distribution and produce a robust representation.
4. Model and Problem Formulation
In this article, our work is based on the following assumptions in the graph of IoT: The connections between devices are bidirectional. There are types of devices in the graph, and each device has its own attribute information such as internal storage, bandwidth, and hard disk. Sensitive links are the links that need to be hidden, while nonsensitive links are those which can be made public. The links whose end nodes have a larger total degree are defined as sensitive links. The nodes with larger degrees usually have more influence in the graph, so the links between these nodes are also more meaningful. Moreover, we take SVM and MLP as attack models to test the performance of our method, and part of nonsensitive links and nonexistent links in the graph are known to the attack models.
4.1. Network Model
We express one of the graphs of IoT as an undirected graph . is the set of terminal nodes and . contains the edges with the communication link between and , including sensitive links and nonsensitive links. is the set of nonexistent links and , where contains edges that can be connected by nodes. Node attributes are summarized in a feature matrix with the row representing the attributes of and is the number of attributes. is the adjacency matrix, where if ; otherwise, . is the set of sensitive links, is the set of nonsensitive links, and .
4.2. Attack Model
Both SVM and MLP have strong classification abilities for nonlinear problems with different structures.
SVM is a classification model based on the structural risk minimization criterion in machine learning. For the nonlinear classification problems, SVM adopts a nonlinear function to map the samples from the input space to a high-dimensional feature space where the samples are linearly separable and construct an optimal classification hyperplane to categorize new samples utilizing labeled training data. Given the training set , SVM can transform the classification problem into a convex quadratic optimization problem as follows: where is a Lagrange multiplier and is the penalty factor. Since the computation of increases sharply in the high-dimensional space, SVM introduces kernel function to avoid the problem. The kernel function we choose is Gaussian kernel: where is the variance. In this case, the classification decision function is as follows: where is the bias constant.
MLP is a fully connected artificial neural network, consisting of an input layer, hidden layer, and output layer. MLP adjusts the parameters in the hidden layer units through the supervised back propagation (BP) algorithm and gradient descent algorithm to reduce the error between the actual output and the expected output. The forward propagation mechanism of MLP is expressed as below: where is the input matrix, is the weight matrix, is the bias, and is the output of the hidden layer. Thus, the decision function of MLP with only one hidden layer can be expressed as follows: where is the input and is an activation function.
4.3. Problem Formulation
Given a graph , our model will compress it into a graph embedding where the row represents the vector of . The vector of can be expressed as . Suppose a link set containing nonsensitive links (class 1) and nonexistent links (class 0) in have been exposed to attackers. Then, the embedding matrix of and are and where each row represents an edge embedding vector.
During data transactions, attackers will collect or steal by any means to infer sensitive links through link prediction. Our goal is to achieve the balance between privacy protection and data utility. To this end, we use “minmax” strategy to maximize the distance between the predicted label of sensitive links and its real label and then minimize the distance between of nonsensitive links and its . The mathematical description is as follows: where means to fit the classifier model with the training data and means to predict the labels of . is the label set of where ones represent nonsensitive links and zeros represent nonexistent links. We expect to get a graph embedding which can work for our purpose.
5. Algorithm
The SLPGE framework consists of two parts. In this section, we will introduce the framework of SLPGE in Subsections 5.1 and 5.2, and the evaluation indicators are described in Subsection 5.3.
5.1. Generate the Privacy Embedding
Part 1 is to generate a privacy embedding . In order to put more privacy information into , we first change the structure of the original graph to enhance the connection strength of end nodes of sensitive links. Before inputting the graph data into the model, we preprocess through Algorithms 1 and 2 corresponding to Figures 1 and 2. For Algorithm 1, we believe that two nodes with more common neighbors have a closer relationship. As shown in Figure 1, is a sensitive link, and are the neighbor sets of and , respectively, and exists; in this case, we link and to make the relationship between and closer. For Algorithm 2, we believe that the main information in the graph will focus on sensitive links when other irrelevant nodes and links are removed. As shown in Figure 2, we only keep the sensitive links and their adjacent links to retain the information about the sensitive links to the greatest extent. The two privacy graph adjacency matrices ’s obtained by Algorithms 1 and 2 are, respectively, used as the input of the encoder to output two ’s.


For VGAE is more robust and suitable for small graphs, we adopt VGAE to obtain in this part as shown in Figure 3. As discussed in Section 3, the mechanism for the encoder to generate can refer to Equations (2) and (3). Then, we get the reconstructed adjacency matrix . The reconstruction loss is the same as Equation (5), except that and are replaced by and .
|
|

For node classification, a softmax classifier is followed by the encoder to predict the labels of the nodes. The node classification loss function is as follows: where represents the real label of in category with a value of 0 or 1, while is the value we predict in and with . Therefore, the total loss of Part 1 is as follows:
Through the BP mechanism of , we train the encoder to generate that contains privacy information and conforms to a prior distribution.
5.2. Generate the Link Protection Graph Embedding
Part 2 generates a graph embedding that can protect sensitive links. In order to reduce the most intuitive privacy information, we remove the sensitive links in to obtain , the adjacency matrix of the training graph, as shown in Figure 4. The model in this part is designed based on ARVGA, as shown in Figure 5. The inputs of the encoder are and . output from the encoder is the input of the discriminator and the softmax classifier. Unlike Part 1, and are combined by adding or concatenating to form a higher dimensional matrix as the input of the inner-product decoder. The node classification loss function is the same as Equation (15). In order to distinguish the two ’s obtained in Part 1, in Section 6.2, we will use to explain that Algorithm 1 is used for the generation of and to explain that Algorithm 2 is used.


Since the adversarial training between the encoder and the discriminator can force to match a prior distribution, the KL divergence in Equation (5) is omitted. Here, we try to use Mean Squared Error (MSE) as the reconstruction loss , and the reconstruction target is changed to :
In the discriminator, we take as the fake samples and Gaussian distribution samples as the real samples, then input them into the discriminator, i.e., a two-layer full connection layer network to get two estimated value and , respectively. and are the distribution loss of the generator and the discriminator, which are both calculated by BCE:
Therefore, the total loss of the generator can be written as follows:
Through the adversarial training, the discriminator learns how to distinguish between the real samples and the fake samples, while the generator learns to generate a better to confuse the discriminator. In general, the training process of obtaining can be summarized as Algorithm 3.
|
5.3. Evaluation Indicators
This subsection will introduce the quantitative indicators of privacy and utility.
5.3.1. Privacy
Our chief target is to reduce the prediction accuracy of the attack models for sensitive links. Input an embedding vector of a sensitive or nonsensitive link to the attack models; if the predicted value is 1, it means the link exists and vice versa. Privacy is measured by the prediction accuracy of the attack models for sensitive links: where is the number of sensitive links predicted to exist and is the total number of sensitive links. When the security of is stronger, is lower.
5.3.2. Utility
Utility includes three parts: the prediction accuracy of the attack models for nonsensitive links, the accuracy and recall of the reconstructed graph, and the accuracy of node classification. Taking the existing links as positive samples and the nonexistent links as negative samples, the quantitative expression of utility is as follows: where is the prediction accuracy of nonsensitive links, is the number of nonsensitive links predicted to exist, and is the total number of nonsensitive links: where is the ratio of existing links and nonexistent links that are reconstructed correctly and represents how many existing links have been reconstructed. and are the numbers of reconstructed positive and negative samples, and and are the numbers of nonreconstructed positive and negative samples: where is the ratio of the nodes classified correctly to the total number of nodes. is the number of nodes classified correctly, and is the number of nodes. Our tradeoff is protecting privacy while preserving utility, that is, reducing and keeping , , , and high.
6. Simulation
In this section, we will evaluate the performance of SLPGE on two public datasets, [35] and [14].
6.1. Experiment Setting
(1)Datasets. is a citation network composed of 7 categories of machine learning papers. includes 2708 papers as and 5278 citation relationships between papers as . 1433 unique words appear in all papers as the attributes of . is a social network including 8578 people and 188 attributes. The class year attribute divides the nodes into 7 categories. Part of links and labels of the datasets are used as training sets.(2)Training. The experimental parameters are shown in Table 2. The initial features of nodes are 1433 and 188 dimensions. and are both 8-dim in and 7-dim in . As shown in Figure 5, we have two splicing modes of and in SLPGE: “concatenate (cat)” and “add,” where “cat” means stacking and in the horizontal direction (i.e., column order) and “add” means that the elements in and are added correspondingly. is 16-dim and 14-dim when using “cat” and 8-dim and 7-dim when using “add.” The embeddings of two nodes in are concatenated together as an edge embedding, so the dimension of edge embedding is twice as large as node embedding.
Besides, we take the original graph and the training graph with sensitive links deleted as the input of VGAE to compare with SLPGE. At the same time, we use TSNE to visualize in 2-dim to observe the node classification result, and the nodes belonging to the same label are represented by the same color. In essence, TSNE uses PCA to reduce the dimension of the feature and then maps it to a 2-dimensional or 3-dimensional space for visualization to observe each layer’s feature distribution. (3)Attack. 100 and 200 edges with larger node degrees in the training sets are selected as the sensitive links of and , respectively, and is 10 and 15 in Algorithm 1. We randomly select 200 nonsensitive links and 200 nonexistent links to form which has been exposed to the attackers. Moreover, the edge embeddings of will be used as the training set to train the attack models. The edge embeddings of the same number of sensitive and nonsensitive links are the input of the attack models. We train each model four times, and the attack models make 10 predictions after each training. Finally, the averages of the 40 prediction results are taken as the prediction accuracy of sensitive and nonsensitive links.
6.2. Result Analysis
We carried out our experiments under four models: , , , and . means the input is the original graph without any modification. means the input is the training graph in which the sensitive links are deleted. Our SLPGE is divided into two types: and where comes from Algorithms 1 and 2, respectively.
Figures 6 and 7 show node classification of SVM under different models for and in a visualization method, respectively. In each subgraph, the points in the same color constitute a cluster, representing different classes. A larger distance between different clusters means higher accuracy. Corresponding numerical results are listed in Tables 3 and 4. The decline degree of five indicators of , , and compared with is shown in Tables 5 and 6, and the decline degree are calculated by , where represents and represents the others.


6.2.1. Privacy
There is a comparison of of the four models in Tables 3 and 4 that of , , and decrease in varying degrees, but of and decrease more. Especially for , reduces by 30.05% at most and 22.28% at least on the basis of while reduces by 14.05% at most. For , reduces by 15.03% at most and 9.46% at least on the basis of while reduces by 4.14% at most. The privacy of SLPGE has significant improvement compared with .
Although the privacy of SLPGE is 1.33.6 times higher than that of on , the protection effect of sensitive links on is not as good as that on , which results from the fact that the node attributes of are more closely related to the links. This also signifies that similar attributes will make the privacy information between nodes more difficult to remove. In general, these comparisons can confirm that our SLPGE has better performance on sensitive link protection.
6.2.2. Utility
The loss of partial utility is the necessary cost of privacy protection. Taking as a comparison, we can see that the classification accuracy of four variant models has decreased, reflecting a partial sacrifice of data utility. , , , and of SLPGE and all decrease simultaneously, but the decline ranges are generally lower than that of . From Tables 5 and 6, it can be seen that of SLPGE decrease by 6.94% and 11.56% at most on the basis of for and , but the two decrease more, reaching 21.76% and 13.99%. of SLPGE decrease by 5.75% and 9.07% at most for and . The maximum decline ranges of , , and of SLPGE on two datasets are 5.76% and 9.02%, 5.84% and 15.50%, and 11.26% and 6.79%, which are basically lower than the decline ranges of . Tables 5 and 6 reflect the tradeoff between privacy and utility.
6.2.3. Models
The data in four tables show that and are very close in performance on privacy and utility, which also proves that both Algorithms 1 and 2 are feasible. For the two splicing modes, the privacy and utility of mode “add” are better than those of mode “cat.” The analysis of this result is as follows.
The distributions of and both approach (standard normal distribution), and the weight of privacy information in is large and fixed. When obtained by adding and is to fit the link labels of the original graph after decoding, the MSE loss function will force to reduce the weight of privacy information, so that we can squeeze more privacy information. Therefore, the combination of MSE and mode “add” is better.
Overall, our SLPGE reduces the prediction accuracy of sensitive links to varying degrees, from which we can conclude that our model is effective. While protecting the privacy of sensitive links, some utility will be sacrificed, which may be structure information or attribute information. From the result analysis, it can be confirmed that SLPGE can retain most of the utility. In practical application, part of the structure of the model can be adjusted to meet different task requirements.
7. Conclusion
The problems of individual privacy under the interconnection of all things are ubiquitous. The research on link protection against link prediction in IoT is of great significance for entity privacy. Through the simulation of the datasets, the feasibility of our SLPGE is preliminarily verified. However, multifaceted challenges remain in the research on link protection. Our datasets are just static graphs, in which the nodes belong to different categories at the same level, and the edges only represent reference and social relationships. In heterogeneous scenarios, nodes can be of different levels, edges between the nodes may have diverse meanings, and the weight of the edges are no longer all equal to one. The weight of edges reflects the difference in the degree of communication between nodes.
Furthermore, in dynamic graphs, the entry and exit of nodes will affect the graph structure and the privacy information of sensitive links in real time. The attackers can collect more information for inference attacks. The greatest challenge is that the researches on resisting graph disturbance and enhancing the robustness of link prediction continue to emerge, which increases the difficulty of sensitive link protection. Therefore, we will emphasize the sensitive link protection in weighted graphs and dynamic graphs in our follow-up research.
Appendix
In Section 5.2, we use the two modes of “concatenate (cat)” and “add” to combine . The following is an explanation of these two operations. “cat” and “add” are two splicing modes of in SLPGE. “cat” means stacking in the horizontal direction, and “add” means that the elements in are added correspondingly. Here, we will intuitively show how to get and and explain their meanings. We assume that both are 3 times 2 matrices, then
In Part II, the reconstructed adjacency matrix is obtained by the inner product of , i.e., and whose detailed calculations are shown in the bottom.
We can see that , and is a cross-multiplying term matrix. Since is fixed, , the loss function will force to constantly adjust so that is close to . Because has more cross-multiplying terms, may exert greater pressure . Based on the above analysis, we chose these two modes to get :
Data Availability
Cora [35] is a citation network composed of 7 categories of machine learning papers. Cora includes 2708 papers and 5278 citation relationships between papers. 1433 unique words appear in all papers as the attributes. Yale is a social network including 8578 people and 188 attributes. The class year attribute divides the nodes into 7 categories. Part of links and labels of the datasets are used as training sets. K. Li, “The data about the facebook friendships of yale univer-sity.” [Online]. Available: https://github.com/KaiyangLi1992/Privacy-Preserving-Social-Network-Embedding.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported in part by the Fundamental Research Funds for the Central Universities under Grant 2019JBZ001, in part by the Beijing Natural Science Foundation under Grant 4202054, and in part by the National Natural Science Foundation of China under Grant 61871023 and Grant 61931001.