Abstract
Vertex attributes exert huge impacts on the analysis of social networks. Since the attributes are often sensitive, it is necessary to seek effective ways to protect the privacy of graphs with correlated attributes. Prior work has focused mainly on the graph topological structure and the attributes, respectively, and combining them together by defining the relevancy between them. However, these methods need to add noise to them, respectively, and they produce a large number of required noise and reduce the data utility. In this paper, we introduce an approach to release graphs with correlated attributes under differential privacy based on early fusion. We combine the graph topological structure and the attributes together with a private probability model and generate a synthetic network satisfying differential privacy. We conduct extensive experiments to demonstrate that our approach could meet the request of attributed networks and achieve high data utility.
1. Introduction
Social network is an important topic in many subjects, such as sociology, economics, and informatics. Many applications need to achieve useful information through analyzing social networks. Social networks contain sensitive personal information, including the individual information and the relationships among them. However, vertices usually contain sensitive attribute information and we need to protect the privacy of not only the graph topological structure but also the vertex attributes. We call the network with vertex attributes as attributed network. In this paper, we study the privacy protection method of releasing the attributed networks. Network data can be released for public with sanitization by a substituted synthetic network generated from a private network generation model.
As a common privacy protection method, anonymization technique (e.g., k-anonymity [1] and l-diversity [2]) could easily be susceptible to new privacy attacks. Recently, differential privacy has been proposed to address such privacy problem. Unlike the anonymization technique, differential privacy provides strong theoretical guarantees by adding calibrated noise against adversaries with prior knowledge.
The standard technique to ensure differential privacy of social network is to “sanitize” the topological structure and neglect the vertex attributes. These vertex attributes affect not only the vertex information but also the topological structure. For example, vertex attributes are closely related to the connection between vertices. When the topological graph has high assortativity coefficient, vertices with similar attributes connect with each other with a higher probability. Adversaries with vertex attributes could infer the topological structure effectively.
A straightforward approach to release the attributed network under differential privacy is to sanitize the vertex attributes and topological structure, respectively, and then fuse them together to get the differentially private version of the network. We call this kind of method as late fusion. On the contrary, when we deal with the vertex attributes and topological structure simultaneously, we call it as early fusion. In this paper, we study the attributed network releasing method under differential privacy through early fusion.
We propose a probability model of attributed network under differential privacy, called DP-ANP (i.e., differentially private attributed network probability model). DP-ANP contains the vertex attribute information and the topological structure. The DP-ANP records connection probabilities between any pair of vertices and attribute value probabilities of each vertex. It allows to draw a sample model from the model's space through joint probability distribution, and it is equivalent to maximizing the probability of the model. In particular, to satisfy the differential privacy, we add noise to the parameters of the probability model and generate the private synthetic network using DP-ANP.
The core of DP-ANP is to add noise to the model parameters of each vertex. However, due to the large scale of vertices, DP-ANP may lead to a large amount of noise. Therefore, we propose an improved DP-ANP, called DP-ANPHP (i.e., differentially private attributed network probability model based on hyperparameter). In DP-ANPHP, we construct a model of network based on hyperparameters and add noise to them instead of the model parameters in DP-ANP. The scale of hyperparameters in DP-ANPHP is much less than the parameters in DP-ANP. It makes DP-ANPHP produce less noise and achieve high data utility. In summary, we present several contributions:(1)We introduce the probability model of attributed network under differential privacy based on early fusion, called DP-ANP. This model sanitizes the model parameters of the overall information of the network instead of vertex attributes and topological structure separately. Compared with the late fusion method, DP-ANP reduces the complexity of computing.(2)We develop an improved DP-ANP model based on hyperparameters, called DP-ANPHP. DP-ANPHP reduces the scale of noise.(3)Through privacy analysis, we prove that DP-ANP and DP-ANPHP satisfy differential privacy. We conduct extensive experimental study over synthetic datasets and real datasets. The results demonstrate that DP-ANP outperforms the late fusion method and DP-ANPHP outperforms DP-ANP.
The rest of the paper is organized as follows. Section 2 provides a literature review of differential privacy of social network. Section 3 presents necessary concept on differential privacy and attributed network. Section 4 describes early fusion and late fusion. Section 5 describes the DP-ANP model. Section 6 describes the DP-ANPHP model. Section 7 reports the experimental results. Section 8 concludes the paper.
2. Related Work
2.1. Differentially Private Network Analysis
Dwork et al. [3] firstly proposed a differentially private method to answer the query functions by disturbing the outcome directly. Hay et al. [4] proposed a differential privacy in a post-processing phase to compute the output which retained the accuracy and privacy at the same time. They used this method to estimate the degree distribution. Nissim et al. [5] introduced the concept of smooth sensitivity and protected individuals' privacy by adding a small amount of noise to the released statistics, such as triangle count. Karwa et al. [6] expanded this concept and calculate k-star count of a network. Zhang et al. [7] analyzed the released statistics through a ladder function and reduced the sensitivity effectively. Cheng et al. [8] presented a two-phase differentially private frequent subgraph mining algorithm, named DFG. In DFG, frequent subgraphs are privately identified in the first phase, and the noisy support of each identified frequent subgraph is calculated in the second phase. Ding et al. [9] published the private triangle counts with the triangle count distribution and the cumulative distribution. Sun et al. [10] studied fundamental problems related to extended local view. They formulated a decentralized private scheme named DDP, which requires that each participant consider both her own privacy and her neighbors' privacy involved in her ELV.
2.2. Differentially Private Network Publishing
Sala et al. [11] proposed a private graph model which extracts a graph's structure into dK-graph and generated a synthetic graph. Mir and Wright [12] used maximum likelihood estimation to privately estimate the parameters based on the stochastic Kronecker graph model. Xiao et al. [13] proposed a private network publishing method which computes an estimator of the graph in the hierarchical random graph (HRG) model. They sampled possible HRG structures via Markov chain Monte Carlo (MCMC) through exponential mechanism. Qin et al. [14] investigated techniques to ensure local differential privacy of individuals while collecting structural information and generating representative synthetic social graphs. They proposed LDPGen which incrementally clusters users based on their connections to different partitions of the whole population and adapted existing social graph generation models to construct a synthetic social graph. Chen et al. [15] presented a method for publishing private synthetic graphs, which preserves the community structure of the original graph without sacrificing the ability to capture global structural properties. Wang et al. [16] presented a differential privacy method for weighted network through structuring a private probability model.
2.3. Differential Privacy for Attributed Network
Qian et al. [17] proposed to deanonymize social graphs and infer private attributes leveraging knowledge graphs, which carry both graph structural information and semantic information. Ji et al. [18] conducted the first attribute-based anonymity analysis for attributed network under both preliminary and general models and proposed a new deanonymization framework, which takes into account both the graph structure and the attribute information to the best of our knowledge. Jorgensen et al. [19] proposed a method for publishing private attributed network which adapted existing graph models and introduced a new one and then showed how to augment them with differential privacy. Yin et al. [20] defined a new type of attack called attribute couplet attack and proposed a new anonymity concept called k-couplet anonymity to achieve privacy preservation under attribute couplet attacks. Kiranmayi et al. [21] designed an algorithm named couplet anonymization by using the node addition approach to reduce the misleading fake relations. Meanwhile, some research studies focus on the privacy of high-dimensional data. These methods are also closely related to the differential privacy for attributed network. Chen et al. [22] proposed a novel solution to preserve the joint distribution of a high-dimensional dataset. They developed a robust sampling-based framework to systematically explore the dependencies among all attributes and identified a set of marginal tables from the dependency graph to approximate the joint distribution based on the solid inference foundation of the junction tree algorithm. Zhang et al. [23] presented a differentially private method for releasing high-dimensional data named PrivBayes. PrivBayes injected noise into the low-dimensional marginals instead of the high-dimensional dataset to circumvent the curse of dimensionality. They also introduced a novel approach that uses a surrogate function for mutual information to build the model more accurately. Ren et al. [24] developed a local differentially private high-dimensional data publication algorithm, LoPub, by taking advantage of our distribution estimation techniques. Correlations among multiple attributes were identified to reduce the dimensionality of crowdsourced data.
In conclusion, only a few number of research studies focus on differential privacy for attributed network and are generally based on late fusion. In this work, we develop a differentially private attributed network publishing method based on early fusion.
3. Background
3.1. Differential Privacy
Given a graph G, differential privacy ensures the outputs to be approximately same even if any edge is arbitrarily added or deleted in the graph. Thus, the presence or absence of any edge has a negligible effect on the outputs. We define two graphs and to be neighbors if they satisfy , and . -Differential privacy is defined as follows.
Definition 1 (-differential privacy) (see [3]). A randomized algorithm A is -differential privacy if for any two neighboring graphs G1 and G2, and for any output ,Differential privacy is based on the concept of global sensitivity of a function f. It is used to measure the maximum change in the outputs of f when any edge in the graph is changed. The global sensitivity of f is defined as .
Differential privacy can be achieved by Laplace mechanism and exponential mechanism. The Laplace mechanism is mainly used for functions whose outputs are real values. Differential privacy can be achieved by adding properly noise drawn randomly from Laplace distribution to the true answer.
Theorem 1 (laplace mechanism) (see [3]). For any function with sensitivity , the algorithmsatisfies -differential privacy, where are i.i.d Laplace variables with scale parameter .
The exponential mechanism is mainly used for functions whose outputs are not real numbers. The main idea is to sample the output data O from the output space according to the utility function u. The global sensitivity of u is .
Theorem 2 (exponential mechanism) (see [25]). Given a graph G and a utility function , the arithmetic A whose output is with probability proportional to satisfies -differential privacy.
Theorem 3 (sequential composition 1) (see [26]). If each arithmetic Ai provides -differential privacy, a sequence of over the same database D provides -differential privacy.
Theorem 4 (sequential composition 2) (see [27]). Any subset sampled from D satisfying each data point is included independently with probability p. If algorithm satisfies -differential privacy, satisfies -differential privacy.
3.2. Attributed Network
Attributed network G is denoted by the triple . is a set of vertices, is a set of edges, and is the set of attribute S. Given the integer K, N vertices are divided into K disjoint subsets. Let be an undirected edge and G be an undirected attributed network.
4. Early Fusion and Late Fusion
We classify the methods of publishing differentially private attribute network into two categories: early fusion and late fusion.
The main idea of early fusion is constructing a private network model which combines the vertex attributes and topological structure into a whole structure and allows two parts to interact with each other. Early fusion makes the network model private and uses it to release the private vertex attributes and topological structure. Figure 1 shows the structure of early fusion.

Late fusion needs to consider that vertex attributes affect the topological structure. Adversaries might pose threats to the privacy through the correlation between them without considering that vertex attributes affect the topological structure. As a result, late fusion needs to design the parameters which represent the correlation between vertex attributes and topological structure and sanitize them. However, it results in additional privacy cost and increases the complexity of the algorithm. Figure 2 shows the structure of late fusion.

Compared with late fusion, early fusion constructs the network model and combines the vertex attributes and topological structure into a whole structure which contains the correlation. There is no need to design the correlation parameters. In this paper, we study the attributed network releasing method under differential privacy through early fusion.
5. Differentially Private Attributed Network Model
In this section, we introduce a kind of attributed network probability (ANP) model and enable generation of synthetic attributed graph with differential privacy.
5.1. Privacy Analysis of Straightforward Approach
In practice, we know that vertices with similar attributes connect with each other with a higher probability. For example, in social network, people with similar interest in a certain field may follow a well-known person in that field in common. They are all connected with this vertex in social network. To simplify the complexity, we divide these vertices into the same group. Vertices in the same group are stochastically equivalent. They have similar attributes and play equivalent roles in generating the network's structure. In particular, they have the attributes which obey the same distribution and connect with other vertices with the same distribution. We assume that this kind of partition exists but is unknowable. Thus, it can be treated as hidden variable and cannot be observed directly.
ANP is denoted by a triple . Adjacency matrix is an N × N symmetric random matrix, where is a binary random variable which takes value from . When there is an edge between vertices , ; otherwise, . Attribute matrix is an N × S matrix, where is a random variable which denotes the value of attribute s associated with vertex . Group matrix is an N × 1 matrix, where is a random variable denoting the label of the group which belongs to. ANP outputs a sample from all possible attributed graphs. ANP uses the parameters , , and ; a compact graphical representation is given in Figure 3.

The label of the group is sampled from a multinomial distribution with its parameter representing the probability of vertex belonging to the group. It is defined aswhere is a K-vector parameter and . denotes the proportion of vertex belonging to the group k and satisfies and .
Suppose vertices in the same group have similar attributes and their attributes are samples from the same distribution. Given the label of the group , the attribute is sampled from a multinomial distribution with its parameter representing the probability of attribute values. It is defined aswhere is a M-vector parameter and . If vertex i belongs to , denotes the proportion of vertex in values as m and satisfies and .
Suppose vertices from and vertices from are connected with each other from the same distribution. Given the label of the group , is sampled from a Bernoulli distribution with its parameter representing the probability of whether there is an edge between vertices and . It is defined aswhere satisfies and .
We use maximum likelihood estimation (MLE) to specify the values of , , and The group label is treated as hidden variable and cannot be observed directly. Adjacency matrix X and attributes Y can be observed directly and could be used for the calculation. We take the expectation maximization (EM) algorithm to address this issue: E step: estimate the distribution of hidden variable . M step: get the parameters through finding the maximize expectations, and the parameters are computed as follows:
5.2. Differentially Private ANP
Our goal is to publish a differentially private attributed network which matches the topological structure and attribute distribution of the original network as much as possible. We identify the private ANP model and generate from it.
To identify the private ANP model, we need to add noise to the three kinds of parameters and get the noisy parameters. Then, we use the noisy parameters to generate the sanitized network . Our differentially private attributed network publishing algorithm is shown in Algorithm 1.
|
To satisfy differential privacy, we need to calculate the noisy parameters during each iteration process of parameter estimation. We add noise to all the parameters and the scale of noise depends on the sensitivity of the parameters. In each iteration process, we run the EM algorithm and obtain the parameters (lines 4–5). We firstly add noise to the group parameter (line 6). denotes the proportion of the vertices belonging to group k, and the sensitivity of is calculated as the maximum change in if any one vertex changes its group label. Suppose the number of vertices in group k is , we have . If one vertex from group k changes its group label to other groups or one vertex from other group changes its group label to k, changes to or . The global sensitivity of is . Then, we sample through the multinomial distribution with the parameter (line 6) and add noise to the attribute parameter (line 8). The parameter of each group is different to each other. denotes the proportion of vertices in group that take the attribute value as m, and the sensitivity of is calculated as the maximum change in if any one vertex in group changes its attribute. Suppose the number of vertices with the attribute equal to m in group is q and the number of vertices in group is (line 7), we have . If one vertex from group changes its attribute m to other value or one vertex changes its attribute value to m, changes to or . The global sensitivity of is . At last, we add noise to the parameter (line 9). The sensitivity of is calculated as the maximum change in if any one vertex disappears or appears. Suppose the number of vertices in group and is and , respectively, and there are r edges between group and , we have . If one vertex from group connects with s vertices in group , after deleting this vertex, we have . The maximum change of is . When this vertex connects all the vertices in group , s takes the maximum value, and . Thus, the global sensitivity of is .
After getting the noisy parameters , we sample vertex attribute from the multinomial distribution with the parameter (line 11) and sample edges from the Bernoulli distribution with the parameter (lines 12–14).
5.3. Privacy Analysis of DP-ANP
We prove that DP-ANP satisfies -differential privacy based on the sequential composition property.
Theorem 5. DP-ANP satisfies -differential privacy.
Proof. Suppose that DP-ANP makes the parameters converge after iterating times. From Algorithm 1, we know that during each iteration process, satisfies -differential privacy, satisfies -differential privacy, and satisfies -differential privacy. Based on the sequential composition property, times iteration process satisfies -differential privacy. Furthermore, generating synthetic network does not consume any privacy budget. In conclusion, DP-ANP satisfies -differential privacy and .
6. Differentially Private Attributed Network Model Based on Hyperparameter
6.1. Differentially Private Attributed Network Model Based on Hyperparameter (DP-ANPHP)
As discussed in the previous section, we need to add all the parameters , , to satisfy differential privacy in DP-ANP. To reduce the amount of noise and relieve the effect of the noise, we propose a differentially private attributed network publishing method based on hyper-parameter, called DP-ANPHP. In DP-ANPHP, we learn the parameters by Bayesian theory and propose an efficient differential privacy method by sanitizing a part of hyperparameters. Thus, the scale of parameters we need to sanitize decreases, and we could generate synthetic attributed network effectively. At last, we prove that DP-ANPHP satisfies differential privacy.
However, some difficulties lie in the calculation of the posterior distribution of hidden variable when we use probabilistic inference model in Bayesian theory. Therefore, we use variational Bayesian algorithm to address this problem. We treat parameters , , and as random variables, and let them have prior distribution. We place a Dirichlet distribution over as the prior distribution with hyper-parameter , and we also place a Dirichlet distribution as the prior distribution with hyper-parameter . We place Beta distribution over as the prior distribution with hyper-parameter . As Dirichlet distribution and Beta distribution are the conjugate priors for multinomial distribution and Bernoulli distribution, respectively, the posterior distribution of and is still Dirichlet distribution, and the posterior distribution of is still Beta distribution. This will result in mathematical convenience.
Let be the variational distribution to approximate the distribution , and the constant log-likelihood can be represented aswhere we adopt the Kullback–Leibler (KL) divergence to measure the distance between and :
Let to be the functional lower bound on the :
Thus, minimizing the is equivalent to maximizing . When we maximize , gets closer to .
Furthermore, the variational distribution can be factorized aswhere . We further convert the variational distribution to the hyperparameter form. Based on the conjugate distribution, and are defined as Dirichlet distribution with hyperparameters and , respectively, and is defined as beta distribution with hyperparameter . And we also introduce a new variational parameter to replace as the parameter of group label . The reason is that according to equation (10), we require to no longer depend on . So, we treat as the hyperparameter of . has the following form:
To maximize , we take the derivatives of with respect to , , , , and and set them to zeros:where ; ; ; . is Kronecker delta function. , , and are the linear functions with parameters , , and , respectively.
To satisfy differential privacy, we need to add noise to hyperparameters during each iteration process. In practice, we found that the functions of , , , and are with respect to the parameter . The coefficients of are , , and , respectively. They are set to 1 or 0, and we could treat them as the range of the sum of . Hence, we only need to add noise to during the iteration process to satisfy differential privacy. In this way, we could reduce the scale of parameters which we need to add noise to and reduce the amount of noise required by differential privacy. Similar to the parameter in DP-ANP, the global sensitivity of is . The differentially private attributed network publishing algorithm based on hyperparameter is shown in Algorithm 2.
|
6.2. Privacy Analysis of DP-ANPHP
Based on the sequential composition property, we prove that DP-ANPHP satisfies -differential privacy.
Theorem 6. DP-ANPHP satisfies -differential privacy.
Proof. Suppose that DP-ANPHP makes the parameters converge after iterating times. From Algorithm 2, we know that during each iteration process, satisfies -differential privacy. Based on the sequential composition property, times iteration process satisfies -differential privacy. Furthermore, generating synthetic network does not consume any privacy budget. In conclusion, DP-ANPHP satisfies -differential privacy and .
7. Experimental Results
Existing work on differentially private attributed network publishing algorithm has mainly focused on the late fusion method. For comparing the results effectively, Algorithm 3 shows a kind of late fusion method to compare DP-ANP and DP-ANPHP. Firstly, we need to add noise to the connection probability directly (line 9) and perturb the attributes using exponential mechanism (line 12). We emphasize that these two steps are independent.
|
We evaluate the performance of DP-ANP and DP-ANPHP with LN. As the Laplace mechanism differential privacy produces random noise, we measure the accuracy of the result using the median relative error where we run the Laplace mechanism for 10 times.
7.1. Datasets
We evaluate the utility over one synthetic dataset and two real-life datasets. The synthetic data consist of 100 vertices, and there are a number of edges randomly between them. It have two attributes. One includes two attribute values, and the other one includes three attribute values. Real-life datasets are Cora and Flickr. As these two datasets contain a large number of attributes, we sample 5 discrete attributes as our experimental data. The statistics of datasets are given in Table 1. The fill of a network is the proportion of edges to the number of all possible edges, and it denotes the probability that an edge is present between two randomly vertices.
7.2. Evaluation under Different
We compare the utility of LN, DP-ANP, and DP-ANPHP through the measure average clustering coefficient, number of triangles, normalized degree distribution, and modularity. We fix K at 4 and allocate the privacy budget as follows: 0.25, 0.5, 0.75, and 1. We represent the relative error of these measurements aswhere is the differentially private output.
In Figures 4 and 5, we compare the correlated error of average clustering coefficient and number of triangles. We can see that DP-ANP and DP-ANPHP outperform LN, and DP-ANPHP outperforms DP-ANP. This means that DP-ANPHP performs better in terms of representing the density level. Furthermore, with the increase of , the correlated error of DP-ANPHP had greater decreases, and the decrease of LN and DP-ANP is smaller. The reason is that in DP-ANPHP, we only need to add noise to the hyperparameter . When the private strength is small, the noise affects the output more slightly. We need to add noise to all the parameters of the model in LN and to add noise to a part of the parameters in DP-ANP. The noise affects the output more greatly. In Figure 6, we compare the correlated error of normalized degree distribution. We can see that DP-ANPHP outperforms DP-ANP and DP-ANP outperforms LN.

(a)

(b)

(c)

(a)

(b)

(c)

(a)

(b)

(c)
We use the modularity as the measure of cluster analysis to compare DP-ANPHP with DP-ANP. Modularity is a measurement for evaluating the result of community division. It reflects the concentration of nodes in the same group. The higher the modularity, the tighter the nodes in the same group and the looser the nodes in different groups. Modularity has a range of [−1, 1]. In Figure 7, we can see that the modularity of DP-ANPHP is higher than that of DP-ANP. This is mainly because DP-ANPHP updates the hyperparameters during the iteration process and has a better result on clustering. Moreover, with the increase of , DP-ANP and DP-ANPHP have higher value of modularity. This means when the strength is small, community division has more accurate result.

(a)

(b)

(c)
7.3. Evaluation under Different K
In Figures 8–10, we show how the correlated error of average clustering coefficient, number of triangles, and normalized degree distribution vary under different K. We also compare the modularity in Figure 11. We allocate K as follows: 2, 4, 6, and 8. We can see that DP-ANP and DP-ANPHP have better performances when K is larger. The reason is that the network is divided into more groups when K is larger, and the model has more exact parameters.

(a)

(b)

(c)

(a)

(b)

(c)

(a)

(b)

(c)

(a)

(b)

(c)
8. Conclusions
In this paper, we investigate the problem of attribute network differential privacy. We use the idea of early fusion. In early fusion, it constructs a probability model combining the topological structure and the attributes together and uses it to generate a private synthetic network. We propose two attribute network differential privacy methods based on early fusion, named DP-ANP and DP-ANPHP. DP-ANP sanitizes the model parameters during the iteration process to satisfy the differential privacy. As an improvement of DP-ANP, DP-ANPHP sanitizes the model hyperparameters during the iteration process and has a better performance. The results of extensive experiments show that both DP-ANP and DP-ANPHP generate private synthetic attribute network with a high accuracy. In the future, we focus on the research of attributed network differential privacy with higher data utility.
Data Availability
The data used to support the findings of this study can be accessed from http://konect.cc/networks/.
Conflicts of Interest
The authors declare that they have no conflicts of interest.