Abstract
Biometric identification services have been applied to almost all aspects of life. However, how to securely and efficiently identify an individual in a huge biometric dataset is still very challenging. For one thing, biometric data is very sensitive and should be kept secure during the process of biometric identification. On the other hand, searching a biometric template in a large dataset can be very time-consuming, especially when some privacy-preserving measures are adopted. To address this problem, we propose an efficient and privacy-preserving biometric identification scheme based on the FITing-tree, iDistance, and a symmetric homomorphic encryption (SHE) scheme with two cloud servers. With our proposed scheme, the privacy of the user’s identification request and service provider’s dataset is guaranteed, while the computational costs of the cloud servers in searching the biometric dataset can be kept at an acceptable level. Detailed security analysis shows that the privacy of both the biometric dataset and biometric identification request is well protected during the identification service. In addition, we implement our proposed scheme and compare it to a previously reported M-Tree based privacy-preserving identification scheme in terms of computational and communication costs. Experimental results demonstrate that our proposed scheme is indeed efficient in terms of computational and communication costs while identifying a biometric template in a large dataset.
1. Introduction
With the booming development of the Internet of Things (IoT), the number of smart devices, such as smart cameras and smartwatches, has grown dramatically in recent years. According to the reports, there have been 12 billion IoT devices in 2020, and this amount will grow to more than 30 billion in 2025 [1]. The proliferation of IoT devices and the development of image processing technologies make biometric-based services increasingly easy to deploy and reliable. As a result, biometric-based services have been applied to a variety of scenarios including airport service, criminal investigation, and counterterrorism [2–5].
As a major type of biometric-based services, biometric identification aims to find whether a given biometric template exists in a precollected biometric dataset. Specifically, a biometric template is usually denoted by a vector, e.g., an -dimension vector . For a given biometric dataset , exists in , which means that there exists a biometric template , which makes the Euclidean distance less than or equal to a given threshold , i.e., . Since the biometric template dataset owner may be limited in computing power and storage capacity, it is necessary to outsource the biometric dataset to the cloud for data management and biometric identification request processing when the dataset becomes large. However, as biometric data is crucially sensitive, and the cloud server is not always trusted, some measures should be adopted to prevent the cloud from extracting private information from the outsourced data. It is well accepted that encryption is the most intuitive solution to this issue. But we should pay attention to the fact that, in addition to privacy, data utility should also be guaranteed, which means that the similarity between two templates should be able to be derived from their encrypted data.
To solve this problem, researchers have proposed many schemes [6–13] to achieve privacy-preserving biometric recognition. Unfortunately, most of these schemes [6–10] work in a basic way, which means that they just traverse the whole dataset to get the identification result, and no optimization tactics are taken to accelerate the searching process. Consequently, a huge computing burden is brought to the cloud when it handles too many identification requests simultaneously, which makes these schemes inefficient and unpractical. The situation turns worse when the dataset size grows. As a result, it is urgent to design a privacy-preserving biometric identification scheme by which both the security of the biometric template and the efficiency of the identification process can be guaranteed. Some schemes [11] employ some data structures to expedite the searching process. However, the data owner has to be on standby during the identification service, which results in the loss of the native advantages of cloud computing.
In this paper, we propose an efficient and privacy-preserving biometric identification scheme based on two data structures, namely, the FITing-tree [14] and iDistance [15], and a symmetric homomorphic encryption (SHE) algorithm [16,17]. With our proposed scheme, the privacy of the biometric dataset and the identification request is preserved. In addition, the computational costs of the cloud, which are used to process the identification requests, are kept at an acceptable level. Specifically, the main contributions of this paper are threefold.(i)First, we propose a privacy-preserving biometric identification scheme with two noncollusive cloud servers. With our proposed scheme, the privacy of the biometric dataset and the identification request is protected during the online biometric identification service process. Specifically, the cloud servers cannot obtain the specific value of the identification request and the plaintext of the biometric template in the dataset.(ii)Second, the efficiency of the proposed scheme is improved with the FITing-tree and iDistance. By introducing the FITing-tree, the computational costs of the biometric identification process are significantly lowered by reducing the number of similarity comparison operations. Besides, the size of the index is also optimized.(iii)Third, to evaluate the performance of our proposed scheme, we implement our proposed scheme and conduct extensive experiments on a synthetic dataset. Both the theoretical and experimental results show that the proposed scheme is more efficient than other similar schemes in both computational and communication costs. In addition, we also test the accuracy of our proposed scheme on a real-world face dataset.
The remainder of the paper is organized as follows. In Section 2, we review some related work at first. Then, we formalize our system model and security model and identify our design goal in Section 3 and review some preliminaries, including a SHE scheme, FaceNet algorithm, iDistance data structure, and FITing-tree data structure in Section 4. After that, we present our proposed scheme in Section 5, followed by security analysis and performance evaluation in Section 6 and Section 7, respectively. Finally, we draw our conclusion in Section 8.
2. Related Work
In this section, we will briefly review some related work on privacy-preserving biometric identification.
Early privacy-preserving biometric identification schemes only focus on the privacy-preserving issue. In these schemes, the biometric identification scheme is considered to be a two-party system, where the data owner takes charge of biometric dataset management and template matching. Most of these schemes are designed based on the secure computation protocol [18–20] and homomorphic encryption [9,21,22] techniques. Although the privacy-preserving is achieved in these schemes, the data owner is required to be equipped with powerful computing ability and remarkable storage capacity in these schemes, which can hardly be satisfied in most application scenarios and thus makes these schemes unpractical.
The emergence of cloud computing presents a new and promising paradigm to handle these challenges. Some researchers leverage cloud computing techniques to release the data owner from this burden. In their schemes, the data owner outsources the encrypted biometric dataset to the cloud server, and the matching process is completed on the cloud. Yuan et al. [6] proposed the first cloud-based privacy-preserving biometric identification scheme using a matrix encryption scheme, where the biometric dataset and identification query are both encrypted and sent to the cloud server by the data owner. However, Wang et al. [7] and Zhu et al. [23] pointed out that [6] is not secure under the known-plaintext attack model [24]. In addition, [7] presented a privacy-preserving biometric identification scheme based on the similarity matrix under the same system model in [6] and the security analysis showed that [7] had a higher security level than [6]. Zhang et al. [8] proposed an efficient privacy-preserving biometric identification scheme based on the matrix and perturbed terms with lower time cost and bandwidth consumption than [6,7]. Wang et al. [10] proposed an inference-based framework for privacy-preserving similarity search in Hamming space and achieved privacy-preserving biometric identification based on it. Hu et al. [25] proposed a privacy-preserving biometric identification scheme in an outsourcing environment with two noncolluded servers based on homomorphic encryption and batched protocols. With the help of the cloud, the computing cost of the data owner during the biometric matching is significantly reduced in the above schemes. However, in [6–8], the data owner has to keep on online to encrypt the user’s query data and decrypt the identification result, which whittles some advantages of cloud computing away and leads heavy load to the data owner if it serves too many users at the same time. What is more, in all the cloud-based schemes above, the searching process is not optimized, which means that the searching cost of the cloud server is linear with the size of the dataset. Despite the fact that the cloud server is equipped with strong computing power, it may still run into a bottleneck while simultaneously severing too many users.
To address this issue, some researchers begin to focus on how to achieve sublinear searching efficiency in the biometric identification process, which will significantly ease the pressure of the cloud server. Zhu et al. [11] proposed a cloud-assisted privacy-preserving biometric identification scheme. With the help of an asymmetric scalar-product preserving encryption scheme and R-tree, sublinear search efficiency is achieved in [11]. Nevertheless, the data owner also needs to keep online in [11]. And since R-tree is not constructed among the metric relation between the data objects, the cloud server needs to search the tree twice to find the closest biometric template in the dataset, which reduces the efficiency of the searching process. Yang et al. [26] proposed a privacy-preserving biometric identification scheme based on the M-tree to achieve a sublinear search efficiency.
In this paper, to protect the privacy of the biometric data and reduce the time consumption in the biometric searching process, we introduce SHE and FITing-tree to construct a privacy-preserving biometric identification scheme based on two noncolluded cloud servers. Compared with previous works, the service provider in our proposed scheme does not need to keep online in the identification scheme, and higher efficiency in both computation and communication is achieved.
3. Models and Design Goal
In this section, we formally describe our system model and security model and identify our design goals.
3.1. System Model
In our system model, we consider a cloud-based biometric identification system as shown in Figure 1, which mainly consists of three parts: the service provider, a cloud with two servers, and the client.(i)Service provider: the service provider (SP) has collected a biometric templates dataset with biometric templates, i.e., , where each template is an -dimension vector . For simplicity and clear description, we assume that the value of each () is a positive integer since a nonintegral biometric template can be transformed into a positive integer vector easily. To make the best use of the dataset, the data owner is willing to offer an online biometric identification service to some users. Since the SP usually has limited computing power and storage capability, to relieve the burden of data management and handling a large number of biometric identification requests, it tends to outsource the biometric dataset to the cloud. In consideration of the sensibility of the biometric data, and the fact that the cloud is not always trusted, the biometric data should be encrypted before being outsourced to the cloud.(ii)Cloud servers: in our system, the cloud employs two cloud servers, namely, cloud server 1 (CS1) and cloud server 2 (CS2), from two different cloud service providers, to collaboratively process the biometric identification requests. Specifically, CS1 stores the encrypted dataset and indexes and accepts identification requests. CS2 holds the secret key and helps CS1 get the identification result by decrypting some intermediate results. Note that these two cloud servers are powerful in computing and have sufficient storage space.(iii)Client: the client in our system model can be an IoT device, which is equipped with sensors (e.g., camera, microphone, or fingerprint collector) and has moderate computation ability (e.g., to extract biometric features and encrypt some data). An application employs the client to access the biometric identification service. To get the identification result, the client generates an identification request, submits it to the cloud servers, and processes the response from the cloud servers.

3.2. Security Model
In our system model, we consider that the SP and the client are fully trusted and will honestly follow the prearranged protocol. As for the two cloud servers, CS1 and CS2 are considered to be honest-but-curious, which means they will faithfully follow the protocols by outputting the correct identification result but will be curious about the client’s or SP’s data once certain conditions are satisfied. Meanwhile, we assume that the two cloud servers do not collude with each other. This is reasonable since the cloud servers should maintain their reputation and interests. Note that since we only focus on how to achieve efficient and privacy-preserving biometric identification in this paper, active attacks on data integrity and source authentication from external adversaries are beyond the scope of our work and will be discussed in our future work.
3.3. Design Goals
Our design goal is to propose an efficient and privacy-preserving cloud-based biometric identification scheme to address the challenges mentioned in the above system model and security model. Specifically, the following two objectives should be achieved:(i)Privacy: since the biometric data is highly sensitive, the proposed biometric identification scheme should be privacy-preserving, which means that the security of the biometric data stored in the biometric template dataset and identification request should be guaranteed.(ii)Efficiency: since high time delay is intolerable for a biometric identification system, the proposed biometric scheme should be efficient in terms of both computational and communication costs. The factors causing the inefficiency of the biometric identification system mainly lie in two aspects. First, the cloud servers need to search the whole biometric template dataset at the identification stage, which is quite time-consuming when the template dataset becomes large. Second, to satisfy the privacy-preserving requirements, some additional operations will be necessarily introduced, which significantly increases the computational costs of the cloud servers. Besides, since both the dataset and the identification request are needed to be sent to the cloud servers, it will bring a heavy communication burden to the cloud server while serving too many users simultaneously. Therefore, some measures should be adopted to achieve higher efficiency in computation and communication.
4. Preliminaries
In this section, we briefly review the FaceNet algorithm [27], symmetric homomorphic encryption (SHE) scheme [16], the iDistance data structure [15], and FITing-Tree data structure [14], which will serve as the building blocks of our proposed scheme.
4.1. FaceNet Algorithm
FaceNet [27] is a face recognition system that aims at outputting an embedding by mapping a face image into a compact Euclidean space. With the help of a deep convolutional network, FaceNet works in a two-phases model, i.e., the training phase and the matching phase. In the training phase, given a face image , a mapping from the face image to a compact Euclidean space is built at first. Then, based on the mapping, a Euclidean embedding can be calculated to represent the face image . In the matching phase, two face images and are given, which will be represented as two embeddings: and . To evaluate the similarity of and , the Euclidean distance of the two embeddings can be computed as . Then, a threshold is used to determine whether these two faces are the same (denoted as ) or different (denoted as ). The decision process is performed as follows:
4.2. Description of SHE
The SHE [16,17] is a symmetric homomorphic encryption scheme, which mainly consists of three algorithms, namely, (i) key generation, (ii) encryption, and (iii) decryption:(i)Key generation: given three security parameters , which satisfy the constraint , then the secret key is generated as , where are two large prime numbers with , and is a random number with the bit length . Eventually, compute and set the public parameters . The message space of the SHE scheme is as .(ii)Encryption: given a message , it can be encrypted with the secret key as where and are two random numbers.(iii)Decryption: given a ciphertext , it can be decrypted with the secret key as
The correctness of the decryption can be proven as follows:
SHE satisfies the following homomorphic properties:(i)Homomorphic Addition-I: given two ciphertexts and , we have (ii)Homomorphic Multiplication-I: given two ciphertexts , , we have (iii)Homomorphic Addition-II: given a ciphertext and a plaintext , we have (iv)Homomorphic Multiplication-II: given a ciphertext and a plaintext , we have
4.3. iDistance Data Structure
The iDistance data structure is an indexing and query processing technique for the -nearest neighbor (NN) queries on point data in multidimensional metric spaces [15]. For a given dataset, iDistance firstly partitions the data based on a space- or data-partitioning strategy and selects a reference point for each partition. Then, a one-dimensional index is calculated for each data point in one partition based on its distance to the reference point of this partition. Finally, a B+ tree is built on these indexes, and the NN search can be performed using a one-dimensional range search. As shown in Figure 2, the detailed description of the index building process is as follows:(i)Data partition: the dataset is divided into a set of partitions with some clustering algorithms, e.g., the K-means algorithm. Then, a reference point is assigned for each partition. Suppose that there are partitions whose corresponding reference points are represented by .(ii)Index calculation: for a data point , its index can be generated based on the distance from its corresponding reference point as follows: where is an offset value used to avoid the overlap between the iDistance range of different partitions. Specifically, plays a role in splitting the one-dimensional space into regions, and all points in each partition are mapped to different regions. For example, for the th partition , all data points in this region will be mapped to the range , where is the greatest distance of all points in from the reference point of . must be set sufficiently large to avoid the overlap between the index regions of different partitions.(iii)Range query: the given range query aims to find all data points in that satisfy . For any data point in the th partition , it is straightforward to get the inequality based on the triangle inequality property of Euclidean distance. With the range query requirement, we have which means that we only need to check data points whose iDistance index lies in .

4.4. FITing-Tree Data Structure
The FITing-tree [14] is a data-aware index structure that approximates an index using piece-wise linear functions. With FITing-tree, a given key can be mapped to a storage location with a bounded error.
There are two basic data notions in the FITing-tree, namely, the error threshold and the segment:(i)Segment: a segment is a line segment that maps a key to its approximate storage position. A segment can be represented by the starting point and the , i.e., . For a given key lying in this segment, its predicted position can be calculated by(ii)Error: the error threshold is the maximum distance that the predicted location of any key inside a segment from its actual position.
The operations of the FITing-tree mainly consist of FITing-tree building and query. The detailed description of the FITing-tree building and query process is as follows.(i)FITing-tree building: the main goal of the FITing-tree building process is to use a series of disjoint linear segments to capture trends that exist in the data with the error threshold satisfied. The dataset is sorted in ascending order at first, and then a greedy streaming algorithm is used to maximize the length of a segment as shown in Algorithm 1. After all the segments are selected, a B+ tree is built on these segments.(ii)Query: in the query process, given a key , we firstly find which segment this key is located in. For a segment , lies in segment meaning . This process can be easily achieved by the B+ tree search algorithm. Then, the predicted position of is calculated by equation (8). After interpolating the key’s position, the true position of the key is guaranteed to be within the error threshold. Finally, FITing-tree locally searches the region using binary search. Figure 3 illustrates this query process.

|
5. Our Proposed Scheme
n this section, we will present our privacy-preserving cloud-based biometric identification scheme, which consists of four phases, i.e., System Initialization, Index Creation and Encryption, Encrypted Identification Request Generation, and Biometric Identification. Before delving into the details, we give an overview of our proposed scheme. Specifically, in the System Initialization stage, SP firstly generates system parameters (including the security parameters and identification parameters) and distributes them to the client and cloud servers. Then, SP builds an index based on the iDistance and FITing-tree, encrypts the index and the dataset, and outsources them to the cloud in the Index Creation and Encryption stage. The client generates an encrypted identification request based on a given biometric template in the Encrypted Identification Request Generation stage. Eventually, in the Biometric Identification stage, the client sends an encrypted identification request to the cloud, and two cloud servers work together to get the identification result and return it to the client. To describe our proposed scheme clearer, we give the explanation of the main notations used in the following sections in Table 1.
5.1. System Initialization
In the System Initialization phase, SP sets up the system and generates keys for the client and cloud servers. Following the method described in subsection, SP selects the security parameters , calculates the secret key , and generates the public parameters , where . After that, SP encrypts 0 and 1 with , and the corresponding ciphertexts are denoted as and , respectively. Then, SP sets the identification threshold for the identification system. After all parameters are generated, SP publishes and and sends to the client, and to CS2, respectively.
5.2. Index Creation and Encryption
In this phase, SP firstly builds a searching index based on the iDistance data structure and FITing-tree data structure over the biometric dataset . Then, SP encrypts the index and dataset with the SHE scheme. Eventually, SP outsources the encrypted index and dataset to CS1.
5.2.1. Stage 1: Index Building Process
In this stage, the iDistance index for each biometric template in dataset is calculated at first. Then, a FITing-tree is built based on these indexes.(i)Data partition: SP firstly divides the biometric dataset into partitions utilizing the K-means algorithm and selects a reference point for each partition, where the reference point for the th partition is represented by .(ii)iDistance index calculation: for every template in one partition, the Euclidean distance between this biometric template and the partition’s reference point is computed at first. For example, for the th partition , for any , the Euclidean distance between them is computed as . Besides, the maximum distance of all distance calculated in the th partition is denoted as . Then, to avoid the overlap of indexes between different partitions, an offset should be added while calculating the iDistance index. Meanwhile, to keep the gap between indexes of different partitions as small as possible, the offset is selected as . Eventually, for a biometric template in the th partition, its iDistance index is calculated as(iii)FITing-tree building: after all the iDistance indexes for all templates are calculated, SP builds a FITing-tree based on these iDistance indexes following Algorithm 1 with a given threshold. When the building process of the FITing-tree is complete, a series of disjoint linear segments are generated. Each segment can be represented by its starting point and slope, where .
5.2.2. Stage 2: Index Encryption
After the FITing-tree is built, SP encrypts the indexes and the dataset with the SHE scheme. At first, SP encrypts the reference points of each partition with the SHE scheme, and the encrypted reference points are represented as . Then, SP encrypts each biometric template in dataset using the SHE scheme. For a biometric , it is encrypted as . In the end, SP outsources the encrypted dataset and the searching index, which includes the encrypted reference points , maximum distance list , and FITing-tree segments to CS1.
5.3. Encrypted Identification Request Generation
When a client wants to verify whether a biometric template exists in the dataset , it needs to send an identification request to the cloud servers. Considering the issue of privacy protection, should be encrypted in advance. Since the client receives the ciphertexts and in the System Initialization phase, the biometric template can be encrypted based on the homomorphic properties of the SHE scheme. The encrypted template is denoted as , whereand is a random number.
After the identification request is encrypted, it is sent to CS1.
5.4. Biometric Identification
On receiving a biometric identification request from the client, two cloud servers work together to verify whether it exists in dataset . Firstly, two cloud servers collaboratively calculate the distance of the identification request corresponding to each reference point. Then, based on the trained FITing-tree and iDistance data structure, a candidate result set is generated. Eventually, two cloud servers traverse the candidate result set to get the identification result.
5.4.1. Stage 1: iDistance Index Calculation
After receiving the biometric request from the client, CS1 firstly calculates the ciphertext of the square of the Euclidean distance between and each encrypted reference point , where
Then, CS1 sendsto CS2. CS2 decrypts these ciphertexts with and returns the corresponding plaintexts to CS1. After getting the plaintexts from CS2, CS1 computes the positive square root of these plaintexts and gets . After that, CS1 checks whether holds. If it does, it means that th partition has an intersection with the query range and will be considered as a candidate partition. Otherwise, it will be ignored. Eventually, the iDistance indexes of the identification request corresponding to each candidate partition’s reference point are computed. For example, if th partition is a candidate partition with reference point , the iDistance index of the identification request with respect to is calculated as , where .
5.4.2. Stage 2: Candidate Result Set Generation
When the iDistance indexes of are obtained, a candidate result set is selected for each candidate partitions. According to the range query algorithm of the iDistance, for a given data partition with reference point , we need to search the data points whose iDistance index lies in . We denote and as the lower search bound and upper search bound , respectively. CS1 finds out the biometric templates, which are stored in the range , and adds them to the candidate result set , where is the predicted position of calculated by the FITing-tree, and is the predicted position of .
5.4.3. Stage 3: Verification
When the candidate result set is determined, two cloud servers work together to make sure whether there is a biometric template satisfying the identification requirements. Specifically, for each candidate result set , CS1 firstly calculates the encrypted distances between and all the templates in this candidate result set. For a template , the encrypted square of the distance from to is calculated as
Then, the encrypted distances are sent to CS2 to get the plaintexts. While receiving the plaintexts, CS1 finds the template that is closest to and verifies whether it satisfies the identification requirements by checking whether holds. If it does, CS1 adds the identifier of to the result set. After all the candidate result sets are checked, CS1 returns the result set to the client.
5.4.4. Correctness
We will show the correctness of the our proposed scheme. If our proposed scheme is not correct, it means that there exists a template which satisfies , but is not searched by our scheme. To proof the correctness of our scheme, we only need to prove that all the biometric templates satisfying are searched by our scheme. All the biometric templates in the candidate result set are verified, and the biometric templates whose position lie in are added to the candidate result set. According to the properties of the FITing-tree, if the predicted position of lies in , it will be added to the candidate result set. Since is the predicted position of calculated by the FITing-tree and is the predicted position of , and the iDistance indexes are in the ascending order in the FITing-Tree, if , its iDistance index will lie in . Therefore, if a template which satisfies , it will be searched by our proposed scheme.
6. Security Analysis
In this section, we will analyze the security of our proposed privacy-preserving biometric identification scheme. Since our proposed scheme is designed based on the SHE scheme, which has been proved to be CPA-secure in previous work [17], we mainly focus on the privacy-preserving goal described in section. Specifically, both CS1 and CS2 cannot obtain the plaintext of the biometric dataset and biometric identification request. In the following, we will prove the security of our scheme from the view of the two cloud servers.
Theorem 1. CS1 cannot obtain the plaintext of the biometric dataset and the biometric identification request during the biometric identification process.
Proof. We give the view of CS1 during the biometric identification process firstly and analyze why CS1 cannot obtain the plaintext of the biometric dataset and biometric identification request.(i)View 1: in the Index Creation and Encryption phase, the encrypted biometric dataset and the encrypted searching indexes including the encrypted reference points , maximum distance list , and FITing-tree segments are sent to CS1. Since the biometric dataset and reference points are encrypted by the SHE scheme, and CS1 does not have the secret key , CS1 cannot get any information about the plaintext of the biometric data from these encrypted data directly. By analyzing the maximum distance list, CS1 can only learn that there exists a biometric template in each data partition that satisfies where . Since the reference points and biometric template are both encrypted by the SHE scheme, CS1 cannot infer their plaintext from these data. The FITing-tree segments are built on the iDistance indexes and do not have corresponding relation to the biometric dataset or identification request; thus, CS1 cannot obtain the plaintext from the segments either.(ii)View 2: the encrypted biometric identification request. Since the biometric identification request is also encrypted by the SHE scheme and CS1 cannot decrypt it, the plaintext of the biometric identification will not be leaked to CS1.(iii)View 3: intermediate values during the biometric identification process. In the identification process, the distance between the biometric identification request and each reference point and the distance between the biometric identification request and each candidate template are leaked to CS1. Since the biometric identification request, the reference points, and the biometric template in the candidate result set are all encrypted by the SHE scheme, CS1 cannot get the plaintext of the biometric dataset and biometric identification request from these intermediate values.Therefore, CS1 cannot obtain the plaintext of the biometric dataset and the biometric identification request during the biometric identification process.
Theorem 2. CS2 cannot obtain the plaintext of the biometric dataset and the biometric identification request during the biometric identification process.
Proof. We give the view of CS2 during the biometric identification process firstly and analyze why CS2 cannot learn the plaintext of the biometric dataset, biometric identification request from these data.(i)View 1: the distance between the biometric identification request and each reference point. While calculating the iDistance indexes of the biometric identification request, CS2 gets the distance between the biometric identification request and each reference point. However, CS2 cannot get the encrypted reference points and biometric identification request; thus, CS2 cannot obtain the plaintext of the biometric identification request.(ii)View 2: the distance between the biometric identification request and each candidate template. While verifying the template in the candidate result set, CS2 obtains the distance between the biometric identification request and each candidate template. Since the encrypted biometric dataset and biometric identification request are kept secret from CS2, CS2 cannot get the plaintext of the biometric dataset and the biometric identification request.Therefore, CS2 cannot learn the plaintext of the biometric dataset and the biometric identification request during the biometric identification process.
7. Performance Evaluation
In this section, we evaluate the performance of our proposed scheme in terms of computational costs and communication overhead. Specifically, we will compare our proposed scheme with an M-tree based privacy-preserving biometric identification scheme named MASK [26]. The reason why we compare with MASK is twofold: (i) MASK is designed under the same system model with our proposed scheme. (ii) MASK is more efficient than other schemes designed for the biometric identification scenario in terms of computational and communication costs.
7.1. Evaluation Environment
In order to measure the integrated performance, we implement both schemes with Java and conduct some experiments on an Intel Xeon 6226R CPU@2.9 GHz Windows platform with 256 GB RAM. The SHE scheme is used to protect the privacy of the dataset and identification requests in these two schemes. The security parameters are set as , and . A real-world dataset and a synthetic dataset are used to test the performance of these two schemes. These two datasets are prepared as follows.(i)Real-world dataset: we choose the Labeled Faces in the Wild (LFW) dataset [28], which contains 13223 face images collected from 5749 individuals. In this dataset, 1680 of the people pictured have two or more distinct photos. In this paper, we use the FaceNet algorithm to extract face features from these face images at first. Each extracted face feature is a 512-dimensional vector, and all the face features live on the same hypersphere, which means that each of the dimensions of the vector is in the range (-1,1).(ii)Synthetic dataset: we randomly generate a synthetic dataset that contains face features. Each face feature is a 512-dimensional vector, and all face features lie in the same range as the face features extracted by the FaceNet. The templates in the synthetic dataset are distributed in a hypercubes, in which each dimension is lying in (-1, 1).
7.2. Computational Costs
In this section, we will evaluate the computational costs of our proposed scheme while generating the searching index, encrypting identification request, and answering the biometric identification requests, which correspond to the computational costs in Index Creation and Encryption phase, Encrypted Identification Request Generation phase, and Biometric Identification phase, respectively. Since the data in these two schemes are encrypted by the SHE scheme, we denote the computational costs of encrypting and decrypting data by the SHE scheme as and , the computational costs of adding and multiplying two SHE ciphertexts as and , and the computational costs of multiplying a plaintext and SHE ciphertext as .
7.2.1. Index Creation and Encryption
As described in Section 5, there are two stages in the Index Creation and Encryption phase. The computational costs in this phase are related to the dataset size , data dimension , and partition numbers .(i)Index building process: in this stage, the biometric dataset is firstly divided into partitions using the K-means algorithm. Then, a reference point is selected for each partition. In Figure 4, we plot the computational costs of partitioning the dataset versus with when and . Then, the distance between the biometric templates in each partition and the partition’s reference point is calculated. Later, a FITing-tree is built on these iDistance indexes. The computational costs of creating the index are shown in Figure 5.(ii)Index encryption: when the index building process is complete, the searching index is encrypted. While encrypting the searching index, reference points are encrypted. Therefore, the computational costs of encrypting the searching index are . In MASK, the index size is related to the node capacity of the M-tree [29]. Given the node capacity , there are at most nodes in the M-tree, where . Therefore, the computational costs of encrypting the searing index in MASK are less than . Since the computational costs of encrypting the dataset are the same in these two schemes, we mainly focus on comparing the computational costs of encrypting the searching indexes. We test the computational costs of encrypting the searching indexes in both schemes over the synthetic dataset, and the results are shown in Figure 6 ().


(a)

(b)

(c)

We can see that our proposed scheme is more efficient in generating and encrypting the searching indexes.
7.2.2. Encrypted Identification Request Generation
As described in Section 5, the client generates the encrypted identification request by encrypting the biometric template. Then, the encrypted identification request is sent to the cloud servers to get the identification result. 2 operations of multiplying a plaintext and SHE ciphertext are needed to encrypt each dimension of the identification request. Since the length of the biometric template is , the computational costs of encrypting the identification template are in our proposed scheme. In MASK, the computational costs of generating the encrypted identification are . In this phase, the computational costs are constant when the template length and the security parameters are set to fixed values.
7.2.3. Biometric Identification
In the biometric identification phase, there are three stages:(i)iDistance index calculation: in this stage, two cloud servers work together to find out which partition the identification request belongs to. At first, the encrypted square of Euclidean distance between the identification request and each partition’s reference points is calculated over their ciphertexts by CS1. Then, the encrypted data is sent to CS2 to get the plaintext. The computational costs of CS1 in this phase are , and the computational costs of CS2 are .(ii)Candidate result set generation: in this phase, CS1 firstly finds out which segments and lie in by searching in the B+ tree built over the FITing-tree segments. Then, CS1 calculates the predicted position calculation of and and determines the candidate result set. Since the predicted position is calculated based on the plaintext, the computational costs in this phase contain the searching costs and predicted position calculation costs.(iii)Verification: in this phase, two cloud servers collaboratively traverse the biometric templates in the candidate result set. The computational costs of CS1 in this phase are , and the computational costs of CS2 in this phase are , where is the size of the th partition’s candidate result set.
In MASK, computational costs in this phase consist of the computational costs in searching the M-tree and verifying nodes in leaf nodes. The computational costs of CS1, CS2 in these two schemes are shown in Figures 7(a) and Figure 7(b), respectively. And the integrated running time in the biometric identification process of our proposed scheme and MASK is shown in Figure 7(c). We can see that our proposed scheme takes more time than MASK in identifying a biometric template when the biometric dataset is not very large. But as the size of the biometric dataset grows, our scheme is advantageous in the computational cost of the identification process. Since the searching process in each data partition is independent, it is easy to accelerate the search process of our proposed scheme in a concurrent way. As shown in 7(c), the efficiency of our proposed scheme is greatly improved when it is executed concurrently.

(a)

(b)

(c)
7.3. Communication Overhead
In this section, we will evaluate the communication overhead of our proposed scheme when outsourcing the searching index and the encrypted biometric template dataset, submitting the encrypted identification request and searching the biometric templates, which are corresponding to the communication overhead in Index Creation and Encryption phase, Encrypted Identification Request Generation phase, and Biometric Identification phase, respectively. We analyze the communication overhead in theory at first and test it over the synthetic dataset. For the sake of simplicity, we denote the bit length of an integer and a floating number as and , respectively.
7.3.1. Index Creation and Encryption
In the Index Creation and Encryption phase, the encrypted reference points , maximum distance list , the encrypted , and FITing-tree segments are outsourced to CS1. According to the SHE scheme, the size of the ciphertext is bits. Since there are reference points and biometric templates in the dataset , where both the reference point and biometric template are -dimensional vectors, the size of the encrypted reference point is and the size of the encrypted dataset is . As the data in the maximum distance list is stored in floating number, their size is . A FITing-tree segment consists of a start point and the slope, and the size of each FITing-tree segment is . Suppose that there are segments contained in the FITing-tree, the size of FITing-tree segments is . According to the building process of FTIing-tree, there are at least data points in a segment, which means that . In MASK, there are at most nodes in the M-tree, where . Hence, the communication overhead of MASK in this stage is less than .
7.3.2. Encrypted Identification Request Generation
In the Encrypted Identification Request Generation phase, the identification request is encrypted and sent to cloud servers. Since the identification request is an -dimensional vector, and the ciphertext of each dimension is bits, the size of the identification request is . The communication overhead of MASK in this phase is .
7.3.3. Biometric Identification
In the Biometric Identification stage, two cloud servers work together to get the identification request. Firstly, two cloud servers select the candidate partition. Then, two cloud servers generate a candidate result set for each partition. Eventually, two cloud servers traverse all the candidate result sets to get the identification result. In this stage, encrypted data are sent from CS1 to CS2, and plaintext data are sent from CS2 to CS1.
We test the communication costs of different phases in both two schemes over the synthetic dataset, and the experimental results are shown in Figure 8. Specifically, Figure 8(a) shows the communication costs of sending the indexes in these two schemes. The experimental results demonstrate that the communication costs of our proposed scheme in this phase are much lower than those of MASK. Figure 8(a) shows the communication costs of sending the identification request in both two schemes. The communication costs in this phase are almost the same, but our proposed scheme is more efficient. Figure 8(a) shows the communication costs while identifying a template in the dataset. The results show that our proposed scheme sends more data than MASK when the dataset is not very large. But our proposed scheme is more and more efficient when the dataset size grows.

(a)

(b)

(c)
7.4. Storage Cost
In our proposed scheme and MASK, the storage consumption of the cloud servers is mainly used to store the search indexes and the encrypted dataset. Since the size of the encrypted dataset is the same in these two schemes, we mainly compare the storage cost of storing the search indexes. In our proposed scheme, the search indexes consist of the encrypted reference points, the maximum distance list, and the FITing-tree segments. In MASK, the search indexes consist of the M-tree indexes. The storage cost of storing the searching indexes of these two schemes is shown in Table 2.
7.4.1. Accuracy
In our proposed scheme, the combination iDistance and FITing-tree structure can achieve accurate range query. Transforming the biometric template into integers may reduce the accuracy of the identification scheme. This influence is very slight when enough decimal places are kept during the transforming process. We test the accuracy of our proposed scheme in terms of the false acceptance rate (FAR), false rejection rate (FRR), and equal error rate (ERR) over the LFW dataset [28]. We firstly test the accuracy of the original FaceNet algorithm in terms of FAR, FRR, and EER varying with thresholds from 0 to 2, and the result is shown in Figure 9(a). The ERR of the original FaceNet algorithm is 0.025. Then, we also test the accuracy of the FaceNet algorithm with the biometric templates, which have been converted to integers. In this experiment, the biometric template values are converted to integer values with 3 decimals kept. We test the FAR, FRR varying with thresholds from 0 to 2000, and the result is shown in Figure 9(b). The ERR of FaceNet algorithm with integer templates is 0.026. We can see the accuracy is kept almost the same as the original FaceNet algorithm.

(a)

(b)
After that, we also evaluate the accuracy of our proposed scheme in terms of FAR, FRR, and ERR in the identification scenario. We test the FAR, FRR varying with thresholds from 0 to 2000, and the result is shown in Figure 10. The ERR of the identification scheme is 0.249.

8. Conclusion
In this paper, we have proposed an efficient and privacy-preserving identification scheme for identifying an individual in huge biometric datasets. Specifically, we introduced the FITing-tree to generate an index for the biometric dataset based on which the efficient identification service can be achieved. Then, we use the SHE technique to ensure the privacy of identification requests and the biometric dataset. The security of our proposed scheme has been analyzed, and the result shows that the privacy of both the biometric dataset and biometric identification can be preserved. To evaluate the computational and communication cost of our proposed scheme, we implement it and test it over a synthetic dataset. Experimental results demonstrate that our proposed scheme is efficient in terms of computational and communication costs when identifying a biometric template in a large dataset.
Data Availability
The data used to support the findings of this study are available at http://vis-www.cs.umass.edu/lfw/index.html.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (61972304 and 61932015), Natural Science Foundation of Shaanxi Province (2019ZDLGY12-02), and Technical Research Program of the Ministry of Public Security (2019JSYJA01).