Abstract
In big data analytics, Jaccard similarity is a widely used block for scalable similarity computation. It is broadly applied in the Internet of Things (IoT) applications, such as credit system, social networking, epidemic tracking, and so on. However, with the increasing privacy concerns of user’s sensitive data for IoT, it is intensively desirable and necessary to investigate privacy-preserving Jaccard similarity computing over two users’ datasets. To boost the efficiency and enhance the security, we propose two methods to measure Jaccard similarity over private sets of two users under the assistance of an untrusted cloud server in this paper. Concretely, by leveraging an effective Min-Hash algorithm on encrypted datasets, our protocols output an approximate similarity, which is very close to the exact value, without leaking any additional privacy to the cloud. Our first solution is under a semihonest cloud server, and our enhanced solution introduced the consistency-check mechanism to achieve verifiability in malicious model. For efficiency, the first solution only need about 6 minutes for billion-element sets. Furthermore, as far as we know, the consistency-check mechanism is proposed for the first time to achieve an effective verifiable approximate similarity computation.
1. Introduction
Jaccard similarity [1], which is for computing the similarity of two different sets, is a very fundamental and critical basis for data computation and analysis, universally used in many of the Internet of Things (IoT) applications, such as online shopping, social network, copyright or plagiarism, human genomes testing, credit evaluation, epidemic tracking, and so on [2, 3]. For example, in social networks, we can find two user’s with almost common social interests via computing Jaccard similarity [4]; proper advertisement or video recommendation may be timely pushed to users, with the computation of Jaccard similarity over shopping preferences or video-viewing records [5]; furthermore, we can effectively find out similar webpages or images by computing Jaccard similarity [1].
However, for IoT, protecting data privacy while computing Jaccard similarity is an important problem for two data users. Several classical applications are listed as follows:(i)Two genome research institutions would like to do some genome similarity study collaboratively, and each institution need to protect genome dataset privacy for law and business(ii)People want to find potential friends with common interests or symptoms via online social networks, without leaking extra private sensitive information to others(iii)Detecting the similarity between the two publications should not disclose contents to each other for intellectual copyright
Furthermore, with the wide application of data analytics and third-party cloud storage and computing services for IoT, data mining should be performed by the third-party server without leaking any sensitive private information. Compared with other computation, such as private set intersection, private similarity computation is more necessary since it discloses less information to the server. Moreover, it is obvious that an approximated computation is able to extremely boost the efficiency with large scale datasets in data mining.
As we know, Jaccard similarity between two private sets can be easily computed by the existing work about private set intersection (PSI), in two party setting (e.g., [6–10]) or server-aided setting (e.g., [11–15]), but more private information will be leaked obliviously, such as elements in the set intersection or set size. To hide the set size, private set intersection cardinality (PSI-CA) [6] primitive was introduced by Blundo et al. [16] to build a similarity computation protocol. Later, several other works focus on boosting the computing efficiency [17, 18], especially, Dong and Loukides [17] presented an approximating private set cardinality scheme, which is only logarithmic complexity under a two-party setting, and Lv et al. [19] designed an unbalanced private set intersection cardinality protocol via leveraging bloom filter, which achieve a low communication cost, while these protocols are under a two-party setting, and cannot be efficient enough for billion-scale datasets. Thus far, it is still a major challenge to secure and fast billion-scale similarity computation with protection of the maximum privacy of sets.
For that, two fast and secure billion-scale approximate similarity protocols with server aided is designed by combing the Minhashing technique [1], with leaking no additional privacy. Furthermore, our two solutions greatly improve the performance for the scenario of billion-element sets. Our contributions include the following:(i)We leverage a deterministic encryption and Minhashing technique to get extremely high efficiency in our first protocol. Through our analysis, it is without disclosing any information to a semihonest cloud server except the approximate Jaccard similarity. Our first protocol is also a basis for the second solution.(ii)To enhance the security, we initialize a malicious server in our second protocol, where the server can cheat in the executions. Concretely, an original consistency-check mechanism is proposed to verify whether a similarity result returned from the server is a correct approximation. Theoretical analysis and the results of the experiments prove that a very high detection probability is achieved by the verification method. Also, it may be optimized (e.g., over 99%) by executing multiple rounds. As we know, the consistency-check mechanism is the earliest concrete verifiability scheme for the approximate computation with such a high detection probability.(iii)To implement the performance, we made extensive experiments and evaluations of two protocols. Experimental results show that it only costs 6 minutes for our first protocol to compute Jaccard similarity of two sets with billion size in a parallel mode, and it also just needs about 18 minutes for our second protocol to calculate the similarity over the sets with same scale size.
2. Related Work
2.1. Private Set Similarity
Blundo et al. [16] presented a protocol to evaluate the approximate Jaccard similarity of two private sets between Alice and Bob. Alice gets nothing about Bob’s data except the similarity. It adapts Minhashing and PSI-CA [6] to get a more efficient solution than the exact ones [20, 21]. After that, Buyrukbilen and Bakiras [22] suggested using Simhash to build an efficient scheme for similar document detection. Hereafter, Dong and Loukides [17] presented an approximating private set union/intersection cardinality with logarithmic complexity under a two-party setting, and recently Lv et al. [19] proposed an unbalanced private set intersection cardinality protocol via leveraging bloom filter, which achieve a low communication cost. However, all of these works are set under a two-party model, which would lead to more interactions and lower efficiency for computing over billion-scale datasets.
2.2. Private Set Intersection
Freedman et al. [7] firstly introduced private set intersection (PSI). After running PSI, Alice gets (where belongs to Alice and belongs to Bob), while nothing can be obtained by Bob. After that, most of works for PSI mainly focus on improving the efficiency, such as oblivious polynomial evaluation based protocols [23] and oblivious pseudorandom functions based protocols [24]. In addition, garbled circuits [25] and garbled bloom filters [9] can also boost the efficiency for PSI.
Unfortunately, the two-party model cannot be scaled for billion-scale datasets. Server-aided PSI [12, 13, 26–31] was proposed, and under this model, two clients can delegate their storage cost and set intersection computation to an assistant server. After the executions, the server outputs the set intersections without obtaining any additional private information of two clients. In this model, several techniques, such as Garbled Circuits [26] and homomorphic encryption [12] are important foundations. But PSI based on these techniques is not efficient enough. As we know, up to now, the most efficient server-aid PSI, introduced by Kamara et al. [13], is efficient for billion-scale with parallel processing.
As it is stated, private Jaccard similarity can be achieved from PSI output, but it is unable to be replaced by PSI which makes more private leakage.
2.3. Verifiable Computation
Gennaro et al. [32] present a general verifiable computation, which enables the honest client to verify whether a delegated computation was conducted correctly with the given specification in an outsourced environment. Later, Applebaum and Kushilevitz [33] presented a new general verifiable method for secured multiparty computation. Several works have focused on the protocols with more efficiency for verification mechanism of some specific functions due to the inefficiency of these generic approaches. For instance, Benabbas et al. [34] proposed efficiently verifiable protocols for high degree polynomial computation over large-scale dataset, and several works like [35, 36] have also presented an improved protocol for verifiable outsourcing computation. All of the abovementioned solutions are focused on the outsourced delegation verification, which is different from our server-aided model. Kamara et al. [13] introduced a specific efficient approach to be against the aided server from maliciously deviating the private set intersection protocol via adding noisy elements into the original datasets. However, efficient verification protocols for approximate computations in the server-aided setting are still missing.
On this work, we aim to design high efficiency and more secured protocols of server-aided billion-scale approximate Jaccard similarity computation. Kamara et al. [13] presented a feasible protocol to compute the Jaccard similarity with high efficiency, but it cannot guarantee the minimal privacy disclosure since it leaks the size of the set to the cloud and leaks set intersection to the client. We specifically design the protocols to compute the approximate Jaccard similarity, with set size privacy and higher efficiency compared to [13]. In addition, a novel consistency checking mechanism is proposed in this work to verify the correctness of our approximate calculation results from the server. With adding multiple experimental evaluations on real data, our consistency checking mechanism is more effective and practical.
3. Security Model
Two different types of adversaries are introduced in our paper, including the semihonest and the malicious.
3.1. Semihonest Adversary
A semihonest adversary honestly runs the executions with the specified steps, but it would curiously keep a record of all intermediate computations in the protocol and try to infer the private data of other parties.
3.2. Malicious Model
A malicious adversary not only can eavesdrop the privacy of datasets in our paper, it may also cheat in the protocol, such as deviating from the steps of protocol or returning incorrect results of computation. Specifically, we consider two types of malicious adversaries, including malicious client and malicious server. For a malicious client, it may input an arbitrary set, which is not its real input, to attempt to get another client’s privacy information. While a malicious server with no input may not be able to execute the specified protocol by just returning a false value under the following three cases, including replying a random similarity value from (0, 1) with no calculation (case I), replying a partial similarity value with not performing the whole protocol (case II), or replying a false result deliberately although it is honestly computing in the executions (case III). For case I and case II, because of its economically driven nature, malicious servers can save computing costs and make profits. While in case III, it maliciously returns a false result to affect analyzing data correctly, such as avoiding article plagiarism to be detected.
In addition, we assume that there is no collusion between the parties in our protocols and such assumption is referred in most efficient server-aided protocols (e.g., [4, 13, 37]) as shown in the work in [13]. For instance, the server in the executions is assumed to be unable to collude with one of the parties to get extra private information of the other party.
3.3. Real and Ideal Model
We define the privacy of our protocols with a real and ideal model like [38] in this paper. Generally speaking, assuming that there is a fully trusted third party (TTP) and a simulator in an ideal world, and they measure the similarity through simulating protocol . outputs as the ideal world view. In a real world with no fully trusted party, the participants follow and the adversary outputs as its view under this model. The protocol is secure if ’s views are computationally distinguishable in the ideal world and real world.
Our model contains three participants: A server, Alice, and Bob. Suppose that there is no collusion in our protocols, for the security parameter , all possible inputs , auxiliary random inputs , and for where is defined as computational indistinguishability [38].
4. Preliminaries
4.1. Jaccard Similarity
Jaccard similarity [1] is most widely used to compute the similarity of two sets and , which is defined as in this paper. It is computed by the cardinality of the intersection and the union of sets and as below:
Obviously, . For example, we give two sets and , the Jaccard similarity is easily computed with equation (2), i.e., .
4.2. Min-Hash Technology
As we know, the computation cost of directly calculating a Jaccard similarity increases linearly relative to the size of two sets, while this efficiency is infeasible for computing over scalable encrypted datasets, such as millions or billions set size. Min-Hash [1] is an efficient and widely used approach for scalable Jaccard similarity computation, since it can significantly reduce the computation time by half via returned an approximated result, where this approximation is probably quite close to the real result.
The Min-Hash technique maps a longer set into a much shorter signature with some collision-resistant hash functions (Noncryptographic hashes, such as Rabin’s fingerprinting function [39], are enough for Min-Hash). Instead of directly computing the Jaccard similarity of two sets, we can effectively get an approximation by calculating their Min-Hash signatures.
More specifically, assume there are hash functions in Min-Hash, denoted as (essentially ), we can compute an approximate Jaccard similarity of sets and as following ((without loss of generality, we assume )):(i)Firstly, is defined as the minimum value in . Then, for sets and , two -length Min-Hash signatures can be easily computed, respectively, which are represented as(ii)Then, the approximation of Jaccard similarity is calculated
Further description of Min-Hash algorithm can be found in [1].
It is obvious that we can improve the accuracy of approximate computation by increasing the hashes , while it will lead to more computation and storage overheads. Therefore, in the practical application, we need to choose a suitable for maintaining a trade-off between efficiency and accuracy. As introduced in [39], the number of common items in the Min-Hash signatures can be approximated with a binomial distribution. Specifically, given the number of hashes , the actual Jaccard similarity , and the estimated bias , we can obtain the probability that the approximation result computed by Min-Hash falls into as
Based on Chernoff bounds given in [40], the upper bound of is [41].
4.3. Deterministic Encryption
We illustrate a deterministic encryption cryptosystem as [42], where a same ciphertext is corresponding to a same plaintext. So, it can be used to perform equality checking or membership testing over ciphertext [13]. We generally state as follows:(i): for security parameter , return secret key (ii): for secret key and plaintext , return ciphertext (iii): for secret key and ciphertext , return plaintext
Here, is a probabilistic algorithm while and are deterministic for equality checking in our protocols. The correctness is defined as for all , , and , satisfies
5. Approximate Similarity Computation Protocols
In our work, two effective and secure protocols are proposed to compute approximate Jaccard similarity of two sets under a server-aided setting. Our Protocol 1 focuses on high efficiency and is secured against a semihonest server, which is a fundamental block of our Protocol 2. For more security, we introduce a consistency-checking mechanism in Protocol 2 to verify the correctness of the returned result even if for a malicious server with cheating (i.e., outputting a false value). Obliviously, it is a trade-off between security and efficiency, and our Protocol 2 needs extra computation and communication overhead.
5.1. Protocol 1 under a Semihonest Model
Our Protocol 1 is setting under a semihonest model, that is to say the server is semihonest in the protocol. As shown in Figure 1, we first initialize a deterministic encryption and share a secret key between two clients. Each client (Alice and Bob) firstly computes the Min-Hash sets of their inputs and , respectively, and then outsources the encrypted Min-Hash sets to the server. Once getting the ciphertexts from two clients, the server performs the equality checking operations to get an approximated similarity of and , and then returns it to Alice and Bob. More details of Protocol 1 are introduced in the following Figure 1.

Then, we present the security of our Protocol 1 under a real and ideal model. Generally speaking, for a client, the security comes from that no extra private information is revealed to the other party except the returned approximation, even though the client provide a false input for some other wanted intention. For a semihonest server, we define the security to be that it cannot distinguish the views in the ideal model from the real model. Concretely, the semihonest server just gets the Min-Hash signatures encrypted with , and it is unable to gain more private information of clients due to the pseudorandomness of the ciphertexts. Furthermore, with the definition of Min-Hash, we know that the number of hash functions in Min-Hash is independent of the size of sets, so that the server would not be able to infer any information of two sets, such as the set size or the intersection size. We formalize the security of Protocol 1 in theorem ref theorem: protocol 1 and prove the security as follows:
Theorem 1. Our Protocol 1 described in Figure 1 is secure with hiding the set-size in the following cases: (1) the server is semihonest and both clients are honest; (2) the server is honest and one of clients is malicious.
Proof. For a semihonest server, we defined an adversary with corrupting it in the real world, denoted as , and its views in the executions include the length of the Minhash signatures and the intersection size of the encrypted signatures. According to the simulation, we construct a simulator that operates in the ideal world and receives and to simulate the view of in a real-world execution between Alice and Bob. The simulator performs as follows:(i) first generates a -length Minhash signature, donated as , where is randomly chosen from the datasets space.(ii)Then, it generates another -length Minhash signature such that . Specifically, selects the same elements from as one part of , and randomly generates the left elements which are not equal to any elements in as the other part of .(iii)Then, generates and , where is a random key generated by with the same length of .(iv)It sends and to the server who returns a similarity .(v) outputs whatever the server outputs.With the pseudorandomness of our deterministic encryption DE, the server is unable to distinguish from with equal length (i.e., ), and since is independently selected from the size of the original sets and dummy sets, it hides the input set size. Furthermore, the output similarity in the ideal-world execution is equal to the result in the real-world execution. Therefore, the server’s view in the real world is computationally distinguishable from that in the ideal world and we havewhere .
For the case of a malicious adversary corrupting one of the client (here we suppose that Alice is corrupted with a malicious adversary, denoted as ) in the real world, we construct a simulator in the ideal world to simulate the view of in the execution between the honest server and honest Bob. receives the secret key from the . Once getting the encrypted from Alice, will recover via decryption and sends the plaintext input to a trust-third party. Then, the trust-third party computes the similarity between and , and returns it to . It is clear that due to the random distribution of the returned similarities, is unable to distinguish the real world from the ideal world. Therefore, we havewhere .
5.2. Protocol 2 under a Malicious Model
Our Protocol 1 shown above is proven to be secured under a semihonest model, where the server always honestly executes the protocol and outputs the real results. Unfortunately, such assumption cannot satisfy the actual requirements since the server sometimes could perform cheating in practice to make more profits or save computational expands with economically driven nature.
Concretely, as stated in Section 3, the server may return a result randomly extracted from (0, 1) without any computations or output a similarity result with a degraded level of accuracy based on partial calculation of Min-Hash signatures. Furthermore, the server would also intentionally return a wrong similarity with a certain offset according to the real similarity from honest computation. To verify the server’s cheating behavior in the executions, we propose Protocol 2 with a verification mechanism, in which we can check out the malicious server with a quite high probability. As far as we know, our Protocol 2 is the first solution for calculating an approximate similarity with effective verifiability.
Intuitively, a direction solution for verifying server’s malicious behavior is to add dummy sets into both clients’ original sets, and then the server computes the dummied similarity and returns it to clients. Finally, the clients check whether the returned similarity is within a precomputed valid range. This naïve solution has also been adopted in [13] to verify the correctness of PSI with a malicious server. However, it is not feasible enough to verify an approximate computation of our protocol. More specifically, since the similarity of two original set is unknown in advance, the dummied similarity is less to be predicated and thus the defined precomputed valid range may not cover all possible results. Furthermore, recklessly expanding or narrowing the precomputed range would lead to unpredictable false positive rate or false negative rate.
In order to overcome the limitation stated above, we leverage a novel consistency-check mechanism in Protocol 2 with running a two-round similarity computation. Specifically, given the original sets and , Alice, Bob, and the server firstly run Protocol 1 gain an approximate Jaccard similarity . Then, Alice and Bob generate dummied sets and based on their original sets and , respectively, and hereafter they compute another similarity by following Protocol 1 with two inputs and , where , , and . Especially, to guarantee the correctness of similarity computation, we define that are three disjointed sets secretly preshared between Alice and Bob, and . Finally, the client Alice or Bob can verify the server’s behavior via checking whether and are consistent with a secret mapping shown in Figure 2. If the server is cheating in the protocol, the two returned similarities will not pass through the consistency checking with a very high probability. Note that we can leverage a short shared random seed as pseudorandom generator to effectively generate the dummy sets used in the protocol, and thus it can reduce time overhead of directly sharing these dummy sets. The detail of our Protocol 2 is illustrated in Figure 2. Then, we prove the privacy of Protocol 2, and illustrate our consistency-check solution in detail.

Theorem 2. Our Protocol 2 in Figure 2 is secured with hiding the set-size in the following cases: (1) two clients are honest, while the server is malicious without knowing any privacy of set-size; (2) the server is honest, while one of a client Alice or Bob is malicious.
Proof. Suppose that there is a malicious adversary corrupting the server in the real world, its all views in the executions include the length of the Minhash signatures (i.e., ), , and . We construct a simulator operating in the ideal world with receiving the encrypted Minhash signatures and , the size of and . Then, simulates the view of in the real-world executions between the malicious server and two honest clients Alice and Bob as follows:(i) first generates two -length Minhash signatures and with the same way as the proof of Theorem 1 with the corresponding sets and .(ii)Then, it generates a -length , where , and are two disjointed sets randomly chosen from a different space of the original datasets. Later, generates another -length such that , where , is generated from the same space with and .(iii) follows the simulation in the proof of Theorem 1 and sends and to the server. The server will response with two similarities:(iv) outputs whatever server outputs and sends them to Alice and Bob. Then, they perform the following consistency-check by verifying whether where . If yes, output 1; otherwise, output 0.During the abovementioned simulation, the server only obtains two encrypted tuples and with equal length , and since is independently selected as in Protocol 1, it successfully hides input set size. Furthermore, the returned similarities in the ideal-world execution are equal to those in the real-world executions, respectively, thus the server obtains no extra information from its view of two similarity results. Therefore, with the pseudorandomness of , the view of the server in the simulation is indistinguishable from its view in the real-world execution with honest clients Alice and Bob, and we havewhere .
Note that, the only difference in the executions of the two models is the case that the server returns two similarities that fail to pass through the consistency-check in the real world. However, such case is that the server behaves maliciously, which can be easily detected with a significantly high probability as illustrated in Figures 3 and 4
For the case of the malicious client (without generality, Alice was corrupted in the protocol), suppose there is an adversary corrupting Alice with the input in the real world. Then, we construct a simulator to simulate the view of in the real-world execution of the protocol between Alice and the honest server and Bob. Once receiving the inputs from Alice, verifies them and aborts if they are not valid; otherwise, the simulator will generate all the Minhash signatures of the set and , and then sends them to the trusted third party and receives two similarities from the trusted party. Finally, returns to .
In the abovementioned simulation, only gets two similarities which are the only information connected with Bob’s private input. It is clear that cannot distinguish the returned similarities in the real world from that in the ideal world since they are randomly distributed within (0, 1). Therefore, the view of in the real world is indistinguishable from its view in the ideal world, and we havewhere .

(a)

(b)

(a)

(b)
As shown in Protocol 1, our Protocol 2 can also hide the size of input sets to the server. Obviously, with keeping secret, the server cannot derive any privacy of the dummy set size via two returned similarities. We assume that even through the server knows some background information of the set size (e.g., its distribution), such as getting exactly itself in the worst case, then it can compute an approximation of according to equation (15) and thus the secret mapping would be leaked. However, we can set different size for these dummy sets to prevent such leakage (e.g., , where are random number chosen from a very large range). With such setting, it is unfeasible for the server to infer successfully based on the returned two similarities and the private .
5.2.1. Consistency-Check Mechanism
Then, we present our consistency-check mechanism to verify the returned results in Protocol 2. The intuition of this mechanism is to check whether the two returned similarities satisfy a secret mapping. Specifically, assume are the actual Jaccard similarity and approximate similarity of original sets and , respectively, and represent the actual Jaccard similarity and approximate similarity of the dummied sets and , respectively. According to equation (5), it would be a very high probability with , and here is the estimation bias as . For the simplification, we suppose and , and then we have
Then, we can make a secret mapping as
According to equation (15), assume given a , a valid range, denoted as , can be easily computed, and for all valid , it should have . Thus, we define an algorithm to check the consistency of and . If , , and are regarded as consistent, and thus the server passes through the consistency checking with outputting 1. Otherwise, the server fails and outputs 0.
Then, we introduce two concepts in order to check the behavior of the server (e.g., malicious or not), including false positive rate and false negative rate:(i)False positive rate (FPR) is defined as the probability that the server honestly conducts the protocol without passing through our consistency-check mechanism, which is erroneously considered to be malicious(ii)False positive rate (FNR) is defined as the probability of malicious execution by the server, but the consistency-check shows that the server is honest
Based on the above definition, we clearly have the following conclusions: (i)If the server is honest, we define the probability that the server successfully passes through the consistency checking as(ii)If the server is malicious, we define the probability that the server does not pass through the consistency checking as
Here, denotes false positive rate, and denotes false negative rate. Clearly, we need to guarantee and as low as possible in our consistency-check mechanism. Then, we introduce more detail computation of the false positive rate and false negative rate.
Theorem 3. According to equation (5), we can easily calculate the probability of the server with honestly executing our protocol and successfully passing the consistency-check as follows:and thus we can obtain the false positive rate as
Theorem 4. The false negative rate under the following three cases can be calculated, respectively, asin Case I,in Case II, andin Case III, where is an offset between the real similarity and the incorrect similarity result with cheating behavior of the server.
Case I denotes that the server returns two arbitrary similarities chosen from (0, 1) without doing any computations over the sets he received, while two similarities successfully pass through the consistency-check, and Case II denotes that the server returns two similarities generated by partial computation of Minhash signatures he received, but these two similarities cannot achieve a desired accuracy and still pass through consistency-check in the protocol, and in Case III, the server deliberately outputs two incorrect similarities within a certain offset on the basis of two real similarities obtained via correct computation, while these two incorrect similarities successfully pass the consistency-check. Formalized calculation of false negative rate in these three cases is described as follows:(i)In Case I, suppose that these two arbitrary values are , and let denote the approximate Jaccard similarity of the sets and denote the approximate Jaccard similarity of the sets . When Alice or Bob verifies the consistency of these two similarities, as described above, he/she can compute , which is a valid range for . We know that is randomly chosen from (0, 1) without any information of the sets, so the probability that is the interval length of . That is to say, the false negative rate in such case is We observe that the smaller is, the lower FNR will be. By increasing , i.e., the number of hash functions, we can make sufficiently small.(ii)Similar to Case I, we suppose that the server just computes Minhash signatures and returns two approximate Jaccard similarities . Based on equation (5), we have that and with a high probability, where and are Jaccard similarity of original sets and dummied sets, respectively, . Without loss of generality, we know that where . Then, we need to compute the distribution of . According to equation (5), we can approximate to a binomial distribution, denoted as . With the property of binomial distribution, for a large value of (200 is sufficient based on Central Limit Theorem (https://www.stat.yale.edu/Courses/1997-98/101/sampmn.htm)), the similarity can be approximated as a normal distribution . Therefore, we have(iii)In Case III, assume that the server returns two false similarities and , then we have that and (note that the offset can be positive or negative, we only consider it to be positive in the following discussion since the results are almost the same for the negative case.). Like in Case II, we have , and the valid verification range generated from is . Therefore, the false negative rate in this case can be easily calculated as where .
Note that we only consider that the server returns the false simultaneously in the abovementioned analysis, since if it just returns one false result and the other is the real one, it would be a quite low probability (no more than 0.6%) to pass through the verification according to the Figure 5

6. Evaluation
Then, we evaluate the efficiency of the two protocols and the verifiability of the second protocol. More specifically, we leverage Crypto++ library and AES-ECB mode with 256-bit security to realize our cryptographic operations, and run the protocol on two client’s machines and one server’s machine with 8 vCPUs on Window Server 7. Furthermore, in order to gain more efficiency, we utilize noncryptographic random hashes for the Min-Hash technique. For example, as presented in [1], we define a hash function aswhere are two random integers.
Based on [39], we can significantly reduce the probability of the hash collisions via selecting a very large prime , and thus it can guarantee more accuracy for our approximate similarity computation.
6.1. Efficiency
Then, we evaluate the efficiency of our protocols in the pipeline and parallel mode (described in [9]), respectively. Generally speaking, under the pipeline mode, the computation is performed in one single thread, while under the parallel mode, the protocols are executed in multiple threads simultaneously on multiple CPUs.
The evaluations show that the efficiency of our protocols can be significantly boosted under the parallel mode with multiple threads. According to theoretical analysis, it is obviously that the complexity of our protocols are linear to the length of Min-Hash signatures ( is also the number of hash functions in the Min-Hash technology.). In order to leverage the efficiency and accuracy of our protocols, we set and thus get an estimated bias (as defined in equation (5)) in the following evaluations.
In order to thoroughly evaluate the efficiency of our two protocols, we suppose that the size of input sets is from 100 thousand to 1 billion. From Table 1, we can clearly see that the efficiency can be greatly boosted with running under a parallel mode. Specifically, given the set size 1 billion, it runs about 30 minutes for Protocol 1 under the pipeline mode, while we can enhance the efficiency to only 6 minutes with the same set size in parallel mode. Similar to Protocol 1, we can boost the efficiency of our Protocol 2 with the same mode.
6.1.1. Comparison
Since our Protocol 2 provides a novel verification solution for checking a malicious server, it achieves more security with sacrificing some efficiency. Therefore, we would like to evaluate our Protocol 1 with the most related work proposed by Kamara et al. [13] in terms of efficiency. Actually, [13] proposed a server-aided PSI protocol, but it can be easily extended for calculating a Jaccard similarity via the cardinality of PSI. So, we mainly compare our protocol with [13] in the following discussion.
As shown in Figure 6(a), our Protocol 1 has better efficiency (almost three times faster) compared with the basic protocol of [13] under both two running modes. Furthermore, we also figure the communication comparison of our protocol with KMRS’s in Figure 6(b). We can clearly see that it is a constant communication overhead of our Protocol 1 while it increases linearly with the set size of the overhead in [13], and thus our approximate computation has obvious advantage on decreasing the overhead of communication with large set size.

(a)

(b)
6.2. Verifiability
As stated in Section 5, we firstly get a series of numerical results via theoretical analysis for false positive rate and false negative rate. Then, we preform our consistency-check mechanism with real sets, and compare the real test with the numerical results. Generally speaking, through our comparison, the real test results are very close to our theoretical numerical results, and the results also show that our Protocol 2 can check a malicious cheating server with a quite high probability almost over (considering the nature of approximate computation, it is obviously impossible to verify the result with a 100% probability).
6.2.1. Numerical Results
Then, we theoretically analyze the false positive rate and false negative rate in our consistency-check mechanism. To make a trade-off between the efficiency and security of our protocol, we generally set the original set size million, and then randomly set the dummy set size within (note that can be kept private reasonably with a large .). Here, we set the number of hashes with different values, including , to evaluate the false positive rate and false negative rate of Protocol 2.
Based on the definition of false positive rate (FPR), we figure the distribution of FPR with in Figure 5. It is clearly seen that FPR is quite small in our Protocol 2 since it is only around even if under the worst case.
For the false negative rate (FNR) of Case I mentioned in Section 4, where the server returns two arbitrary similarities chosen from (0, 1) without doing any computations over the sets, Figure 7 shows that a lower FNR is able to be achieved by increasing . For example, FNR is around with , but it can be greatly reduced to less than by increasing .

Before discussing FNR from Case II, where the aided server partially executes the protocol to return two similarity results with a degraded level of accuracy, we first discuss the effect of the parameter (the dummy set size) and (Jaccard similarity of sets and ). Assume that the desired number of hash functions , and the server partially computes a similarity by evaluating Minhash signatures, where . We discuss the effects of and with setting and . As shown in Figure 8 with the increase of , the FNR decreases. However, the false positive rate (FPR) increases with , as shown in Figure 5. Hence, to balance the FPR and FNR, we choose in the real implementation.

(a)

(b)
Then, we analyze the distribution of FNR in Case II under three cases: . As shown in Figure 9, it is obvious that FNR increases with , that means the server can pass the verification with a high probability by more computation. When the server calculates the Min-Hash signatures with full computation, that is, , we can observe in Figure 9 that FNR (the probability of the server successfully passing the consistency-check) is almost over .

(a)

(b)

(c)
In the following part, we discuss the FNR in Case III, and in this case the server returns a false result with a certain offset of the real similarity via completely computing. As shown in Figure 3, we analyze FNR with different variables and . Here, we set varied from 0 to 0.5 under certain in Figure 3(a), and it can be clearly observed that if the offset is larger than the estimated bias in equation (5), the probability of the server passing through the consistency-check is extremely low. Concretely, FNR is trending to zero when we set and (i.e., ), and FNR is very close to 0 when setting with (i.e., ). Then, we set in Figure 3(b), and it shows that FNR is almost equal to 0 with , and for , the FNR reduces with the increase of . So, it is feasible to set in Figure 3(a).
(1) Improvement with Multiple Rounds. It is clearly seen that the performance of our two protocols increases with the decreasing of hash function numbers . Unfortunately, a smaller with high efficiency would lead to a lower accuracy, and introduce a higher false negative rate, as shown in Figure 9. To leverage the efficiency and acceptable accuracy for Protocol 2, we present multiple rounds execution of our protocol, and after running multiple times of Protocol 2, a malicious server can be successfully detected with an extremely high probability.
As described in Figure 4(a), the false negative rate dramatically decreases with the number of executions. We observe that the false negative rate is almost zero for (i.e., Case I without any calculations) after two executions. After ten executions, false negative rate is also close to zero for and less than for , while for (i.e., the honest behavior), the true negative rate is still nearby . Similarly, Figure 4(b) indicates the false negative rate with multiple executions for . Loosely speaking, we can successfully obtain a false negative rate close to zero for the malicious server by running protocol multiple times, that is to say, the malicious server can be detected with a high detection rate. Meanwhile, necessary efficiency trade-offs are introduced due to the multiple executions of our protocol.
6.2.2. Results on Real Datasets
To demonstrate false negative rate (FNR) over real datasets, we implement our second protocol and run 1000 tests over 1 million datasets each time. As shown in Figure 10, the results of real dataset tests are close to the numerical results as mentioned above. Stated differently, our numerical analysis on the false negative rate is correct. Similarly, we can also observe that our false positive rate (FPR) on real datasets tests is very close to the numerical results and here we skip the corresponding figures for verifying the numerical false positive rate.

(a)

(b)

(c)
7. Conclusions
We proposed two Jaccard similarity computation protocols of two users’ private large sets with the assistance of an unfully trusted server for IoT in this paper. By leveraging the Min-Hash technique, we achieve an enhanced efficiency and security in both two protocols since only an approximate similarity result is leaked to server and clients. Implemental experiments demonstrate that the efficiency is practical with our protocols over scalable datasets, especially under the parallel mode. In addition, we present a novel consistency-check mechanism to further achieve result verifiability against a malicious server, which may return a random or low accurate result, or even a false result with offset. Through our analyses, this mechanism achieves a very high probability for detecting the server’s malicious behavior, and to the best of our knowledge, it also represents the very first work to realize an efficient verification particularly on approximate similarity computation.
Data Availability
The experimental data in the paper are simulated to support the findings of this study. The specific content of the datasets does not affect the security and performance of the protocols.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by Program for Scientific Research Foundation for Talented Scholars of Jinling Institute of Technology (JIT-B-201726), Philosophy and Social Science Foundation of the Jiangsu Higher Education Institutions of China (2021SJA0448), Natural Science Foundation of Jiangsu Province (BK20210928).