Grid Adaptive Bucketing Algorithm Based on Differential Privacy

Li, Xiangjun; Zhao, Xuewen; Zhang, Huijuan; Han, Jideng

doi:https://doi.org/10.1155/2022/6988976

Mobile Information Systems

On this page

Abstract Introduction Related Work Experimental Results and Analysis Conclusions Data Availability Conflicts of Interest Authors’ Contributions Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2022 | Article ID 6988976 | https://doi.org/10.1155/2022/6988976

Grid Adaptive Bucketing Algorithm Based on Differential Privacy

Xiangjun Li,^1,2Xuewen Zhao ,²Huijuan Zhang,²and Jideng Han³

Academic Editor: Kuo-Hui Yeh

Received11 May 2022

Revised18 Jul 2022

Accepted26 Jul 2022

Published17 Sept 2022

Abstract

With the popularity of location-based services, the restricted relationship between the availability of big location data and user’s privacy security has become a challenging issue. The spatial division is an effective measure for statistical location data characteristics. This paper proposes a grid adaptive bucketing algorithm based on differential privacy (GAB) to solve the problem that the existing differential privacy location data division method does not fully consider the distribution characteristics of spatial data points and noise superimposition. The algorithm first divides the mobile app location spatial data set into two layers of grids. It then performs adaptive binning judgment according to the square sum error value in the divided area. It puts areas with similar distributions into the same bucket, reducing the uniform assumption error and the noise error caused by many blank areas. Finally, a noise allocation strategy based on Hopkins statistic is adopted to achieve a reasonable noise allocation. The experimental results on two real large-scale location data sets, Checkin and Beijing, show that, compared with other classical algorithms based on differential privacy space partitioning, the GAB algorithm performs better in common evaluation indicators such as relative error and absolute error, which means GAB can obtain better range query effects and privacy protection effects.

1. Introduction

With the rapid development of Internet technology and the popularization of sensors and mobile terminals, society has entered the information age [1, 2]. The explosive growth of various data allows people to obtain valuable information through various channels. The release of location big data is conducive to providing various location-based services (LBSs) [3, 4], so LBS has gradually become an indispensable tool in people’s lives. Among them, location data range query is a commonly used technique in spatial data analysis. The application of extensive data statistics is beneficial to help people understand regional conditions. Some current data statistics applications do not require knowing the specific location of objects at a specific time but only need to obtain the number of objects in a particular area, such as viewing the distribution of pedestrian traffic at a specific intersection, obtaining the distribution of local taxis, and understanding the distribution of endangered species. However, because location information usually contains much private information, users will likely leak their privacy when they obtain the services. Some malicious attackers usually use data mining and other techniques to analyze and infer specific information from the data submitted by users. The user’s location information can obtain sensitive information such as gender, habits, behavior characteristics, and health status, which may threaten the user’s property and personal safety [5–9].

Real-life location data are represented mainly by two-dimensional spatial data point coordinates and are exceptionally unevenly distributed due to the target’s movement. Therefore, most current algorithms divide spatial regions by corresponding index structures (e.g., grids, kd-trees, and quadtrees) and then publish statistics on the data in the divided regions instead of detailed location information for each data point so that the user can only query the target traffic state in a specific region, significantly reducing the risk of compromising the actual location of the user in that region. However, malicious attackers can target the geolocation information of individual targets by continually narrowing the query area; thus, posing a threat to the user's location privacy information.

K-anonymity [10] is an important method to protect the privacy of data sets by generalizing and hiding the critical information of the data to ensure the security of individual identity information. However, the algorithm needs to protect user identity while not preventing the leakage of user attribute information. Before privacy protection, it needs to fully consider the background knowledge of malicious attackers to formulate effective generalization and concealment technology strategies to resist background knowledge attacks [11]. Due to the development of network technology, it is impossible to consider all the background knowledge possessed by the attacker, so this algorithm is unsuitable for the privacy protection of spatial location data. Dwork et al. [12–14] proposed a mathematical model of differential privacy, which distorts the query results of the data set by adding random noise that conforms to the Laplace distribution or other distributions to the actual data set. Different from the k-anonymity model, background knowledge does not affect differential privacy. No matter the attacker’s background knowledge, it will not help the target’s private information. Therefore, applying the differential privacy protection model to spatial location data can further reduce the risk of leakage of user location privacy. However, the availability of regional statistical values is directly related to the amount of noise added to the data set. When the added noise value is too large, although data privacy can be better protected, the availability of statistical values will be significantly reduced. Therefore, improving the availability of regional statistical values to ensure the privacy and security of spatial location data has become a vital issue worthy of research.

To solve the abovementioned problems, this paper proposes an adaptive grid binning method, which can enhance regional statistics' availability while protecting user location information's privacy. The main contributions are as follows:(1)This paper proposes an adaptive grid binning algorithm to balance the noise and uniform assumption errors generated in the space division process. According to each area’s square sum error value, after the division is completed, the areas meeting the limited conditions adaptively merge to reduce the negative impact of excessive noise superimposition.(2)To further improve the usability of location data, this paper will design a new noise distribution scheme. When adding noise, the noise is allocated according to the ratio of the Hopkins statistic value of each grid in the bucket, which improves the practicability of the data.(3)Through the experimental verification of the actual position data set, it is proved that the adaptive grid binning method proposed in this paper effectively balances the noise error and the uniform assumption error and improves the availability of regional statistical values.

This article is organized as follows: section 2 reviews the space division method based on differential privacy. The third section outlines the basic definition of differential privacy, analyzes the source of error, and introduces the adaptive grid binning strategy in detail. The fourth section shows the experimental analysis results to verify the feasibility of the proposed method. The fifth section summarizes the main work content of this article and puts forward the prospects for future research directions.

For a long time, the protection of location information privacy has been extensively studied [15–17], among which differential privacy, as a privacy protection model with a solid mathematical foundation, fits the protection of location information privacy [18–20]. The space division method based on differential privacy can be roughly divided into two categories according to whether the space is related to the location data when space is divided: the spatial data-independent division and the spatial data-related division.

Differential privacy-based spatial data-independent partitioning, in which the partitioning behaviors of spatial locations are not directly related to the distribution of the data set itself, and noise matching a particular distribution are added to each region after the partitioning is completed. Quad-heu [21] is a spatial partitioning algorithm based on the QuadTree structure, which uses a heuristic judgment strategy to adjust and merge the QuadTree structure from the bottom up; thus, reducing uniform assumption errors and improving query accuracy. However, the drawback is that the depth of the tree cannot be well determined, resulting in a lower final query accuracy. Zhang et al. [22] proposed a PrivTree method that responds to spatial extent counting queries by publishing the noise results of leaf nodes and introducing controlled deviation noise to decide whether to perform subtree partitioning; thus, eliminating the requirement of predefined partition depth. However, the method still uses a complete QuadTree structure, generating a sizeable uniform assumption error. Yan [23] proposed a heuristic QuadTree partitioning method and a corresponding privacy budget allocation strategy. An adaptive sampling mechanism based on a proportional-integral-derivative (PID) controller was designed to reflect the dynamic changes in the data. The literature [24] proposes the concept of noneliminable Laplacian noise, adding Laplacian noise to the leaf regions only, solving the noise cancellation problem. The literature [25] designed an arithmetic privacy budget allocation strategy based on the QuadTree structure, significantly reducing the noise error caused by uniform privacy budget allocation.

In addition to tree structures, grid structures are commonly used for spatial segmentation. UG [26] divides the location region into m × m grid cells of equal size and adds differential privacy noise to each grid with the same privacy budget. However, this method does not consider the data distribution in the grid, which leads to a significant uniformity assumption error. The literature [27] builds on UG by synthesizing the noise results into a data set. Then, IH-Tree partitioning was performed on the synthetic data, and the original data set was partitioned using the generated partition keys, reducing the noise error and improving range query accuracy.

Differential privacy-based spatial data-related partitioning means that the partitioning process of space is mainly based on the proper distribution of the data. After the partitioning, noise matching a specific distribution is added to each region. To solve the UG’s problem of not considering the actual data distribution, W Qardaji proposed the adaptive two-layer grid division method AG [26]. Based on UG, AG adaptively determines the division granularity of the second layer of each grid cell based on the noise result N of each grid in the first layer. Although AG considers the data distribution to some extent, it still uses a uniform grid division structure in the first layer; thus, having more uniform assumption errors.

Huang et al. [28] proposed the Kd-PPDP algorithm. Kd-PPDP measures the similarity between adjacent grids by a sum-of-squares error on the top of the AG algorithm and adaptively fits the grid regions whose similarity meets the requirements; thus, reducing the superposition of noise errors. The literature [29] introduces the concept of random sampling, stating that the process of Bernoulli random sampling satisfies differential privacy. The paper first performs Bernoulli random sampling multiple times with probability to obtain an adequate spatial data set D. Next, it performs a grid partitioning operation and uses threshold filtering in the grid layers after the partitioning is completed to determine whether the grid can be reorganized upward or subdivided downward. Yan et al. [30] proposed an unbalanced quadratic tree partitioning algorithm (UBQP-gra), based on the region uniformity condition, which adaptively performs iterative partitioning during spatial partitioning based on the distribution density of location data points and stops partitioning when the target region is empty, the region area is too small, or the uniformity judgment condition is satisfied. The partitioning is stopped when the target region is empty, the region's area is too small, or the uniformity condition is satisfied to reduce the large number of errors caused by a large number of blank nodes. The literature [31] proposes a privacy-preserving dynamic data publishing method based on microaggregation, aiming at the privacy-preserving dynamic data publishing method. This method introduces a dynamic update program to realize data's dynamic publishing and updating. References [32] and [33] proposed the concept of location point density, which divides the target region into different regions based on the density information of location points and then sets different partitioning strategies for different regions. However, this method will divide the target region into different sizes and increase the uniformity error (see Table 1).

In response to the shortcomings of the above methods, this paper proposes a method for processing location data based on differential privacy based on existing grid partitioning methods, which merges similar grids by squared and error values and determines the distribution of data within the grid based on the Hopkins statistic and adaptively assigns noise, thereby reducing noise errors and improving query accuracy.

3. GAB Algorithm

3.1. Adaptive Bucket Strategy

3.1.1. Strategy Description

Due to the highly uneven distribution of spatial points in real life, most spatial segmentation methods do not adequately consider the distribution characteristics of spatial data points and noise superposition. In addition, the noise error and the uniformity assumption error always show this pattern in most cases. When the target area is determined to be uniformly distributed and no longer divided, the noise error in the query results can be reduced, but the uniformity assumption error will be more significant; conversely, the division of the target area is too detailed, which brings a large amount of noise error while reducing the uniformity assumption error. Therefore, this paper proposes an adaptive bucketing strategy, which first divides the spatial data set into two layers of grids, and then makes an adaptive bucketing judgment based on the sum of squares error values within the divided regions, heuristically placing regions with similar distributions into the same bucket, reducing the uniformity assumption error, and reducing the noise error brought about by a large number of blank regions.

3.1.2. Related Concepts

The differential privacy model ensures that the final publicly visible information will not change significantly due to whether an individual is in the data set, so the privacy of each individual in the data set will be guaranteed. Since differential privacy is not affected by the amount of background knowledge the attacker has, no matter how much background knowledge the attacker has, it is not helpful for accurate information acquisition.

(1) Differential Privacy.

Definition 1. (ε-differential privacy). Let D and D be any adjacent datasets with only one different record, A is a random algorithm, and the value range of A is Range (A). If the algorithm satisfies the following formula for D and D and any subset S that satisfies Range (A), then algorithm A is said to satisfy -differential privacy.Among them, the parameter is called the privacy budget, which is used to balance the privacy and availability of data. The smaller the , the higher the privacy of the data, and the lower the availability of the data, and vice versa.
There are two factors that affect the output of differential privacy: privacy budget and global sensitivity Δf. Sensitivity refers to the degree of difference between whether there is a certain piece of data in the data set and the returned query results, which is defined as follows.

Definition 2. (global sensitivity). For any query function f:D⟶R^d and for any adjacent data set D and D, the global sensitivity of f is:where R is the space of real numbers, d is the query dimension, and max can find the most different results in the data set for different neighboring data sets.
The random noise mechanism is the main technique for achieving differential privacy protection, where the Laplacian mechanism is the most fundamental mechanism for providing privacy protection for numerical data sets, which achieves differential privacy by adding noise that obeys the Laplacian distribution to the true results.

Definition 3. (Laplace mechanism). For any function f:D⟶R^d, if the output result of random algorithm A satisfies the following formula, it is said that algorithm A satisfies -differential privacy.Among them, Lap(Δf/ε) is a noise variable that obeys the Laplacian distribution, and the size of the noise is proportional to Δf and inversely proportional to .
When dealing with some complex problems, the differential privacy protection algorithm can be used multiple times to disturb data.

Definition 4. (serial composability). Given a data set D and a set of random algorithms , the algorithm , respectively, satisfies -differential privacy, and any two algorithms are independent of each other, then act on D. The combined algorithm meets -differential privacy.

Definition 5. (parallel composability). Given a data set D, divide it into disjoint sub-datasets , and another set of random algorithms , and the algorithm , respectively, satisfies -differential privacy and any two algorithms are independent of each other, then the combined algorithm of acting on satisfies -differential privacy.

(2) Spatial Range Query. Spatial range query is the primary application of geographic information data map. Assuming you want to determine the number of location points in a particular area, the most basic method is to traverse the entire location data set and count the data points within the query range. The time complexity of this method is O(n), where n is the number of location points. Since two-dimensional data point coordinates usually represent the position data in real life, the quantity is an enormous and uneven distribution. Hence, it is not advisable to traverse the entire data point.

When the spatial location data set is divided by the corresponding index structure (QuadTree, grid, etc.,) to generate a more evenly distributed spatial point area, there is no need to traverse all the location points to judge all the generated areas and range query boxes whether there are intersections, which significantly reduces time complexity and improves query efficiency. According to Figure 1, there may be three different situations in the process of judging whether the generated area V and the range query box P have an intersection: if V and P have no intersection, then the data point in V must not be in the query box and skip pass area V; if area V is wholly contained in the range query box P, then the number of position points in area V is added to the final result; if area V is partly located in the range query box P, then area V is added to the final result, and the ratio of the P intersecting area to the total area of the area V is multiplied by the total number of location points in the area V.

3.1.3. Strategy Steps

Due to the highly uneven distribution of spatial points in real life, most spatial division methods do not fully consider the distribution characteristics of spatial data points and the insufficiency of noise superposition. In addition, the noise and uniform assumption errors always show a trade-off law in most cases. When it is determined that the target area is uniformly distributed and no longer divided, the noise error in the query result can be reduced, but the uniform assumption error will be more significant; conversely, the division of the target area is too detailed while reducing the uniform assumption error, which brings many noise errors.

Therefore, this article first performs a bucketing operation on the divided grid according to the square error, as shown in the formula:

Among them, is the noise adding count value produced by combining the buckets of regions i and j equally dividing the noise. When SSE1 > SSE2, it indicates that the noise adding count value after the barrel closing is closer to the real count value, and the regions i and j are combined into a barrel. As shown in Figure 2, the area connected by the dotted line is represented as a bucket, and the grids that cannot be divided in the first layer are directly classified into the same category bucket, and for the second layer grid, the abovementioned formula is required to judge.

3.2. Noise Allocation Strategy

3.2.1. Strategy Description

Uniform distribution of noise is the simplest and most appropriate method. However, since the distribution of location data is highly uneven in real life, the query accuracy of spatial extent varies drastically with the data distribution. Inspired by Yan [25] who proposed and proved that arithmetic distribution outperforms uniform distribution strategy in query accuracy under the same partitioning conditions, this paper introduces the Hopkins statistic. The Hopkins statistic can obtain the distribution of data points within the grid and thus allocate noise adaptively according to the distribution.

3.2.2. Related Concepts

In space division, the location data set needs to be divided according to a specific structure and cannot be subdivided into single data points. Therefore, it is usually assumed that the location points in each area formed are evenly distributed. At the same time, since a proper amount of noise is added to the count value of each area to achieve privacy protection, the spatial range query result mainly contains two kinds of errors: noise error and uniformity assumption error.

Definition 6. (noise error (NoiseErr)). NoiseErr is the error generated by adding noise to the real count value. Given a certain area P, and are the raw count and noisy version of the count value in that area, respectively. NoiseErr(P) is expressed asIf much noise is added, it will bring too much noise error. In addition, too delicate or inappropriate spatial partition structure will generate many empty nodes, introducing too many noise errors and generating noise superimposition problems.

Definition 7. (the uniform hypothesis error (NonErr)). NonErr is generated by the cells that intersect the query rectangle Q but are not completely contained in it. Suppose the units intersecting with the query rectangle Q are , and the intersecting ratios are , respectively, then NonErr(Q) is expressed asFigures 3 and 4 show two different division schemes under different regional distributions. The data distribution in Figure 3 is relatively uneven, while the data point distribution in Figure 4 is very uniform. The dashed box represents the query range P, and the number represents the actual statistical value of the data point in the area. When querying the target area P, the range count obtained by scheme a in Figure 3 is (138 + Y + 130 + Y)/2 = 134 + Y and the range count for scheme b is (0 + Y+0 + Y+0+ Y+0 + Y) = 0+4Y. In Figure 4, the query result of plan a is (138 + Y+130 + Y)/2 = 134 + Y and the query result of plan b is (34.5 + Y+34.5 + Y+32.5 + Y+32.5 + Y) = 134+4Y, where Y represents the noise added under the same privacy budget. According to the error analysis in Table 2, it can be known that if the actual distribution of the data set is relatively uniform, the resulting uniform error is relatively small. At this time, the more the unit is divided, the more noise error is added. On the contrary, stopping the division too early will lead to a more significant error in the assumption of uniformity when the distribution of actual data points is messy.

(a)

(b)

(a)

(b)

3.2.3. Strategy Steps

Aiming at the uniform allocation of privacy budgets that cannot well meet the needs of privacy protection in different regions, this paper uses Hopkins statistic to adaptively obtain the distribution in the grid to allocate noise. Hopkins statistic can be used to judge the randomness of data in space. First, randomly find n points from the current grid. For each point p_i, find the nearest data point from the current grid and calculate the distance x_i between them; then, randomly generate uniformly distributed points from the current grid range. For each point , find the nearest data point from the current grid range and calculate the distance between . The Hopkins statistic H can be expressed as

If the data distribution in the grid is uniform, the value of H should be close to 0.5, and when the data distribution in the grid is sparse, the value of H should be close to 0. Otherwise, the value of H should be close to 1. Therefore, this paper assigns noise according to the ratio H of the Hopkins statistic of the grid in each bucket. The privacy budget of each bucket is , and the noise distribution of the grid in the bucket is as follows:

Among them, err is the total noise allocated in the bucket that meets the privacy budget and k is the total number of regions contained in the current bucket. It can be seen from (7) that when the distribution of data points in the grid is denser, the more private information is included, the higher the H_i value and the greater the noise added to the grid. When the distribution of data points in the region is more sparse, a small amount of noise will be added.

3.3. Analysis of the Effectiveness of the Strategy

On the spatial data set D, the original mesh generation and GAB generation processes are compared to verify the effectiveness of the adaptive blocking strategy and noise allocation strategy.

Figure 5 shows a schematic representation of the original meshing. The dotted box indicates the query range P, and the number indicates the actual statistical value of the data points in the area. Among them, the partition granularity is selected as 5, that is, each side of the current area is divided into 5 blocks, and the privacy budget ε is 0.1. The actual result value within the query range is 457 and the query result is 494.1, so the absolute error value of the query is and the relative error value is .

(a)

(b)

Figure 6 shows the GAB division process, where the division granularity and privacy budget settings are consistent with the previous ones. Figure 6(b) reflects the area classification of the adaptive bucket combination. The number in each area represents the bucket it belongs to. It can be seen that the entire area is divided into 5 buckets in total. For each bucket that is divided, Laplacian noise is added according to the ratio of the Hopkins statistic of the grid within the bucket. The result of the final division is shown in Figure 6(c). The spatial range query value is 465.2, so the absolute error value of the query is , and the relative error value is . From the abovementioned results, it can be known that the adaptive bucket combining strategy and the noise allocation strategy can effectively reduce the noise error in the query result, thereby improving the usability of the data.

(a)

(b)

(c)

3.4. GAB Algorithm Flow

The above section explains the bucket operation and noise distribution of the algorithm, and this section gives the pseudocode of the overall algorithm in detail (Algorithm 1).

	Input: Dataset D, privacy budget ε
	Output: Adversarial sample set
(1)	N = count (D)
(2)	V_m1_m1 = split(D, m1)
(3)	for(i = 0; i < m1m1; i++)do
(4)	c´(Vi) = c(Vi) + Lap(1/0.5 )
(5)	If m2 ≤ 1 then
(6)	Cbucket(c(Vi), c´(Vi));
(7)	else
(8)	V_m2_m2 = split(D, m2)
(9)	for(j = 0; j < m2m2; j++)do
(10)	c´(Vij) = c(Vij) + Lap(1/0.5)
(11)	Cbucket(c(Vij), c´(Vij))
(12)	end for
(13)	end if
(14)	end for
(15)	AddNoise(D)
(16)	return D′

There are two important steps in the algorithm: binning and noise allocation. Based on the literature [26], this article sets the following: and , where m1 represents the division granularity of the first layer of grids and m2 represents the division granularity of the second layer of grids. For the Cbucket part, as described in the adaptive bucket strategy step, traverse all the divided grids and judge whether the square sum error of the value after the two grids are merged. The noise is added according to the distribution, and the actual value is less than the direct value. Add the squared error of the noise and the actual value. If the conditions are met, the two grids will be bucketed; if they do not match, the noise will be added directly.

As shown in Figure 7, GAB first extracts the latitude and longitude information in the data set and grid the data set to get m1² grids in the first layer. Next, determine the grids for the secondary grid division, merge the grids that do not meet the second division into a bucket, and divide the grids that meet the secondary division conditions into m2² area elements. According to the divided grid structure, the adaptive bucket operation is performed. Finally, the Hopkins statistic Hi of each grid in the bucket is calculated, and the Laplacian noise with the privacy budget is allocated. According to the parallel combination, the Laplacian noise that meets the privacy budget is added to L independent units, and L is the final combined bucket number, then GAB provides differential privacy. Since = GAB satisfies differential privacy. In steps 1-2 of the algorithm, the data set D has |D| data points, and the traversal cost is n times (see Figure 7). In steps 3–5, the data set D has nm² units, and the traversal cost is nm² times. Therefore, the computational complexity of the algorithm is related to the number of samples and the size of the divided grid, and the total time cost is O(n²) where n is the number of regional blocks after the division.

4. Experimental Results and Analysis

This paper designs two sets of experiments to verify the performance of the GAB algorithm. The first set of experiments compares the performance of GAB and other classic space division algorithms based on differential privacy from the effectiveness of range query, such as Quad-heu [21], UG [26], AG [26], and Kd-PPDP [28], as comparison algorithm. The second set of experimental analyses verifies the privacy protection of the GAB algorithm.

4.1. Experimental Setup

4.1.1. Dataset and Evaluation Indicators

The experimental platform is an 8-core Intel i7-10700 CPU (2.9 GHz), 16 GB RAM, and a Win10 system. The python language was used for the programming simulation and the plotting of the experimental results.

4.1.2. Dataset

The experiment uses two simple large-scale location data sets: Checkin and Beijing. The Checkin data set is obtained from the geographic location-based social network provider Gowalla. This data set records Gowalla users from February 2009 to October 2010. Checkin location information: the Beijing data set contains GPS trajectories of 10,357 taxis in Beijing from February 2 to February 8, 2008. The total number of location points in the data set is about 15 million, and the total distance of the trajectory reaches 9 million kilometers. The average sampling interval is about 177 seconds, and the distance is about 623 meters. The specific details and visualization results of the two data sets are shown in Figure 8.

(a)

(b)

4.1.3. Experimental Parameters

The values of the privacy budget designed in this paper are 0.1, 0.5, and 1.0, respectively, representing higher, moderate, and lower differential privacy protection. Based on the literature [26], the query area size (Q) is shown in Table 3. Under the budget value, experiments are performed on five different ranges of queries. For each query range, 100 experiments will be performed within the valid data set range, and the final result will be the average.

4.2. Algorithm Evaluation Indicators

Since the location data after differential privacy processing are disturbed by noise, when the added noise is too large, although the privacy of the user's location data is better protected, the usability of the data is significantly reduced. In order to weigh data availability and protection, combined with the above data set, the RE value is used to measure the accuracy of the query results of the algorithm. At the same time, the AE value is used to measure the availability of query results and the privacy protection of the algorithm.

Among them, represents the range of the query, represents the real query result in the range , represents the query result with noise added in the range , represents the total number of points in the data set, and prevents the denominator from being 0 from affecting the experimental results. Under the premise of satisfying differential privacy, the smaller the relative error value of the query result, the closer the query result is to the real data result and the higher the availability of the data perturbed by the algorithm.

Among them, represents the range of the query, represents the real query result in the range , and represents the query result with noise added in the range . Under the premise of satisfying differential privacy, the absolute error value of multiple query results does not appear to be zero; thus, ensuring that the results returned by the spatial number query are always different from the true value, reflecting the security of the algorithm.

Among them, represents an experiment, represents the total number of experiments, and the final experimental result takes the average value of queries.

4.3. Experimental Results and Analysis

4.3.1. Effectiveness Analysis

(1) Comparison of RE values based on the Checkin data set. Figures 9(a)–9(c) compare the relative error values generated by different query ranges when the privacy budget is fixed. It can be seen that regardless of the value of the privacy budget ε, the range query RE value shows a trend of “increasing first and then decreasing” with the continuous increase of the query area. This is because the Checkin data set contains a large blank area. When the area increases to a certain extent, it contains more blank areas than the data-intensive areas. At this time, the error introduced is more significant. As the query area increases, the proportion of data-intensive areas increases, and the error decreases. Figures 9(d)–9(h) show the comparison of the relative error values when the privacy budget changes from 0.1 to 1 under the premise of a fixed query range. Since the value of the privacy budget is more significant, the lower the degree of data disturbance, the relative error value will also be reduced in the process of increasing the privacy budget. From the experimental results of various algorithms, almost always, UG achieves a more significant query error, and GAB achieves a better query effect. This is mainly because UG adopts a data-independent single-layer grid division structure. An equal amount of noise is added to each grid, ignoring the actual distribution of location points and introducing more noise and uniform assumption errors. GAB reduces the impact of data sparseness through the reasonable allocation of buckets and noise.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(2) Comparison of RE Values Based on the Beijing Data Set. Figures 10(a)–10(c) compare the relative error values generated by different query ranges when the privacy budget is fixed. It can be seen from the figure that as the area increases, the RE value of the Beijing data set is not the same as the RE value of the Checkin data set, showing a trend of first increasing and then decreasing but oscillating within a data range. The space of the oscillation range is much smaller than the Checkin data set. This is mainly because the distribution of spatial points in the Beijing data set is relatively dense, and there are few blank areas; thus, introducing less uniform assumption error. Figures 10(d)–10(h) compare the relative error values when the privacy budget changes from 0.1 to 1 under the premise of a fixed query range. It can be seen that as the privacy budget ε increases, the RE values of various methods are gradually reduced. The more significant the ε, the smaller the value of the added noise, and the closer the data is to the actual value. In this process, the query accuracy of GAB is always better than the other four methods. As the query area continues to increase, the advantage of GAB becomes more apparent. When the query area is q4 and q5, the RE value obtained by GAB is far less than the other four methods, especially when the query area is q5 and ε = 0.1, the query accuracy of GAB is eight times that of UG, six times that of AG, five times that of Kd-PPDP, and two times that of Quad-heu.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

4.3.2. Privacy Protection Analysis

Figure 11 shows the AE value comparison results of the five methods through box plots to better represent the algorithm's security. It can be seen that as the privacy budget ε increases, the AE values of the five methods decrease to varying degrees. The larger the ε is, the smaller the noise value added to the original data. However, the average value of the range query AE in the five methods is always above 0. The minimum value is also greater than 0; thus, ensuring that the result returned by the space number query is always different from the true value, reflecting the safety of the algorithm. Figures 11(b) and 11(d)–11(f) have outliers. This is because the noise is too large (or too small) in adding the noise by the Laplace distribution, resulting in the final range query AE value exceeding the upper edge (or lower edge value) and becomes an abnormal value. At the same time, it can be seen from the figure that the relatively uniformly distributed Beijing data set has a higher probability of outliers than the unevenly distributed Checkin data set. The Checkin data set has different query ranges under the same privacy budget. The difference between the absolute error values is large, which leads to a low probability of abnormal values. In this process, the average AE of GAB is always smaller than the other four algorithms and greater than 0, which shows that GAB has better range query accuracy and better privacy protection.

(a)

(b)

(c)

(d)

(e)

(f)

5. Conclusions

Aiming at the existing position data processing methods that do not fully consider the distribution characteristics of spatial data points and the insufficiency of noise superposition, this paper proposes a position data processing method based on differential privacy based on the existing grid division method. The method uses Hopkins statistic to judge the data distribution in the grid and uses a bucketing strategy to reduce noise errors. Finally, the method proposed in this paper is verified by two real large-scale GPS data sets, proving that the GAB algorithm is practical and feasible and has advantages compared with similar algorithms.

The future areas worthy of further exploration include (1) localized differential privacy. Centralized differential privacy is established to assure that the third party is credible. When the third party is not credible, it increases the risk of user privacy leakage. Localized differential privacy can leave the third party behind and protect it on the user side; (2) reduce computational complexity. The spatial data set contains many data points, and the bucketing operation needs to traverse all the grids in the second layer, so the GAB calculation complexity is relatively high.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Xiangjun Li participated in the design and coordination of the overall experiment. Xuewen Zhao conducted differential privacy research, participated in the experimental design, and drafted the manuscript. Huijuan Zhang participated in the collation of manuscripts. Jideng Han participated in design and coordination. All authors read and approved the final manuscript.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant nos. 61862042, 61762062); the Science and Technology Innovation Platform Project of Jiangxi Province (Grant no. 20181BCD40005); the Major Discipline Academic and Technical Leader Training Plan Project of Jiangxi Province (Grant no. 20172BCB22030); the Primary Research & Development Plan of Jiangxi Province (Grant nos. 20192BBE50075, 20192BEL50041, 20181ACE50033); the Jiangxi Province Natural Science Foundation of China (Grant nos. 20192BAB207019, 20192BAB207020); the Graduate Innovation Fund Project of Jiangxi Province (Grant Nos. YC2019–S100, YC2019–S048, YC2020–S028, YC2020–S092, YC2020–S083); the Practice Innovation Training Program of Jiangxi Province for College Students (Grant nos. 20190403041, 20190402125, 2020CX160); and the Jiangxi Province Educational Reform Key Project (Grant No. JXJG-2020-1-2). The authors wish to thank all the students and experts who participated in our study for their positive and valuable comments and suggestions regarding our manuscript.

References

J. Wang, Y. Yang, T. Wang, R. S. Sherratt, and J. Zhang, “Big data service architecture: a survey,” Journal of Internet Technology, vol. 21, no. 2, pp. 393–405, 2020.
View at: Google Scholar
R. Iqbal, F. Doctor, B. More, S. Mahmud, and U. Yousuf, “Big data analytics: computational intelligence techniques and application areas,” Technological Forecasting and Social Change, vol. 153, Article ID 119253, 2020.
View at: Publisher Site | Google Scholar
P. Martí, L. Serrano-Estrada, and A. Nolasco-Cirugeda, “Social media data: challenges, opportunities and limitations in urban studies,” Computers, Environment and Urban Systems, vol. 74, pp. 161–174, 2019.
View at: Publisher Site | Google Scholar
S. Bartoletti, L. Chiaraviglio, S. Fortes et al., “Location-based analytics in 5G and beyond,” IEEE Communications Magazine, vol. 59, no. 7, pp. 38–43, 2021.
View at: Publisher Site | Google Scholar
P. Goel, R. Patel, D. Garg, and A. Ganatra, “A review on big data: privacy and security challenges,” in Proceedings of the 3rd International Conference on Signal Processing and Communication (ICPSC), pp. 705–709, IEEE, Coimbatore, India, May 2021.
View at: Publisher Site | Google Scholar
Z. Sun, K. D. Strang, and F. Pambel, “Privacy and security in the big data paradigm,” Journal of Computer Information Systems, vol. 60, no. 2, pp. 146–155, 2018.
View at: Publisher Site | Google Scholar
L. Zou, X. Wang, and S. Yin, “A data sorting and searching scheme based on distributed asymmetric searchable encryption,” International Journal on Network Security, vol. 20, no. 3, pp. 502–508, 2018.
View at: Google Scholar
J. Moreno, E. B. Fernandez, M. A. Serrano, and E. Fernandez-Medina, “Secure development of big data ecosystems,” IEEE Access, vol. 7, Article ID 96604, 2019.
View at: Publisher Site | Google Scholar
H. Jiang, J. Li, P. Zhao, F. Zeng, Z. Xiao, and A. Iyengar, “Location privacy-preserving mechanisms in location-based services: a comprehensive survey,” ACM Computing Surveys, vol. 54, no. 1, pp. 1–36, 2022.
View at: Publisher Site | Google Scholar
L. Sweeney, “k-ANONYMITY: a model for protecting privacy,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 5, pp. 557–570, 2002.
View at: Publisher Site | Google Scholar
F. Amiri, N. Yazdani, A. Shakery, and A. H. Chinaei, “Hierarchical anonymization algorithms against background knowledge attack in data releasing,” Knowledge-Based Systems, vol. 101, pp. 71–89, 2016.
View at: Publisher Site | Google Scholar
C. Dwork, “Differential privacy,” in Proceedings of the International Colloquium on Automata, Languages, and Programming, pp. 1–12, Springer, Venice, Italy, July 2006.
View at: Publisher Site | Google Scholar
C. Dwork, “Differential privacy: a survey of results,” in Proceedings of the 5th International conference on theory and applications of models of computation, pp. 1–19, Springer, Xi'an, China, April 2008.
View at: Publisher Site | Google Scholar
C. Dwork and J. Lei, “Differential privacy and robust statistics,” in Proceedings of the 41st Annual ACM Symposium on Theory of Computing, pp. 371–380, Bethesda MD USA, May 2009.
View at: Publisher Site | Google Scholar
Q. Ye, M. Li, and Y. Song, “Noise encrypted based differential location privacy protection in wireless sensors network,” Chinese Journal of Sensors and Actuators, vol. 32, no. 12, pp. 1904–1910, 2019.
View at: Google Scholar
Q. Zhang, X. Zhang, M. Wang, and X. Li, “DPLQ: location-based service privacy protection scheme based on differential privacy,” IET Information Security, vol. 15, no. 6, pp. 442–456, 2021.
View at: Publisher Site | Google Scholar
Y. Qiu, Y. Liu, X. Li, and J. Chen, “A novel location privacy-preserving approach based on blockchain,” Sensors, vol. 20, no. 12, p. 3519, 2020.
View at: Publisher Site | Google Scholar
F. Yu, Y. Yihan, and W. Xiaoping, “Differential privacy protection technology and its application in big data environment,” Journal on Communications, vol. 40, no. 10, 2019.
View at: Google Scholar
B. Jiang, J. Li, G. Yue, and H. Song, “Differential privacy for industrial internet of things: opportunities, applications and challenges,” IEEE Internet of Things Journal, vol. 8, no. 13, Article ID 10430, 2021.
View at: Publisher Site | Google Scholar
N. Niknami, M. Abadi, and F. Deldar, “A fully spatial personalized differentially private mechanism to provide non-uniform privacy guarantees for spatial databases,” Information Systems, vol. 92, Article ID 101526, 2020.
View at: Publisher Site | Google Scholar
Y. Wu, Q. Lu, J. Cai, and X. Wang, “Differential privacy two-dimensional data partitioning publication algorithm based on quad-tree,” Journal of Huazhong University of Science and Technology (Nature Science Edition), vol. 4, no. 3, pp. 99–104, 2016.
View at: Publisher Site | Google Scholar
J. Zhang, X. Xiao, and X. Xie, “Privtree: a differentially private algorithm for hierarchical decompositions,” in Proceedings of the 2016 International Conference on Management of Data, pp. 155–170, San Francisco California USA, June 2016.
View at: Publisher Site | Google Scholar
Y. Yan, L. Zhang, Q. Z. Sheng, B. Wang, X. Gao, and Y. Cong, “Dynamic release of big location data based on adaptive sampling and differential privacy,” IEEE Access, vol. 7, Article ID 164962, 2019.
View at: Publisher Site | Google Scholar
X. Li, Y. Wang, J. Song et al., “A low cost and un-cancelled laplace noise based differential privacy algorithm for spatial decompositions,” World Wide Web, vol. 23, no. 1, pp. 549–572, 2020.
View at: Publisher Site | Google Scholar
Y. Yan, X. Gao, A. Mahmood, Y. Zhang, S. Wang, and Q. Z. Sheng, “An arithmetic differential privacy budget allocation method for the partitioning and publishing of location information,” in Proceedings of the 19th International Conference on Trust, Security and Privacy in Computing and Communications (IEEE TrustCom), pp. 1395–1401, IEEE, Guangzhou, China, December 2020.
View at: Publisher Site | Google Scholar
W. Qardaji, W. Yang, and N. Li, “Differentially private grids for geospatial data,” in Proceedings of the 29th international conference on data engineering (ICDE), pp. 757–768, IEEE, Brisbane, QLD, Australia, April 2013.
View at: Publisher Site | Google Scholar
S. Cai, X. Lyu, and D. Ban, “Spatial statistic data release based on differential privacy,” KSII Transactions on Internet and Information Systems (TIIS), vol. 13, no. 10, pp. 5244–5259, 2019.
View at: Google Scholar
Y. Yan, X. Gao, A. Mahmood, T. Feng, and P. Xie, “Differential private spatial decomposition and location publishing based on unbalanced quadtree partition algorithm,” IEEE Access, vol. 8, Article ID 104775, 2020.
View at: Publisher Site | Google Scholar
S. Huang, T. Chen, and Q. Lu, “Differentially privacy two-dimensional dataset partitioning publication algorithm based on kd-tree,” Journal of Shandong University, vol. 45, no. 1, pp. 24–29, 2015.
View at: Google Scholar
X. Zhang, K. Jin, and X. Meng, “Private spatial decomposition with adaptive grid,” Journal of Computer Research and Development, vol. 55, no. 6, 2018.
View at: Google Scholar
Y. Yan, A. H. Eyeleko, A. Mahmood, J. Li, Z. Dong, and F. Xu, “Privacy preserving dynamic data release against synonymous linkage based on microaggregation,” Scientific Reports, vol. 12, no. 1, 2022.
View at: Publisher Site | Google Scholar
Y. Yan, X. Hao, and L. Zhang, “Hierarchical differential privacy hybrid decomposition algorithm for location big data,” Cluster Computing, vol. 22, no. S4, Article ID S9269, 2019.
View at: Publisher Site | Google Scholar
S. Li, Y. Geng, and Y. Li, “A Differentially private hybrid decomposition algorithm based on quad-tree,” Computers & Security, vol. 109, Article ID 102384, 2021.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Xiangjun Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Mobile Information Systems

Grid Adaptive Bucketing Algorithm Based on Differential Privacy

Abstract

1. Introduction

2. Related Work

3. GAB Algorithm

3.1. Adaptive Bucket Strategy

3.1.1. Strategy Description

3.1.2. Related Concepts

3.1.3. Strategy Steps

3.2. Noise Allocation Strategy

3.2.1. Strategy Description

3.2.2. Related Concepts

3.2.3. Strategy Steps

3.3. Analysis of the Effectiveness of the Strategy

3.4. GAB Algorithm Flow

4. Experimental Results and Analysis

4.1. Experimental Setup

4.1.1. Dataset and Evaluation Indicators

4.1.2. Dataset

4.1.3. Experimental Parameters

4.2. Algorithm Evaluation Indicators

4.3. Experimental Results and Analysis

4.3.1. Effectiveness Analysis

4.3.2. Privacy Protection Analysis

5. Conclusions

Data Availability

Conflicts of Interest

Authors’ Contributions

Acknowledgments

References

Copyright