Abstract
Data clustering is the unsupervised classification of data records into groups. As one of the steps in data analysis, it has been widely researched and applied in practical life, such as pattern recognition, image processing, information retrieval, geography, and marketing. In addition, the rapid increase of data volume in recent years poses a huge challenge for resource-constrained data owners to perform computation on their data. This leads to a trend that users authorize the cloud to perform computation on stored data, such as keyword search, equality test, and outsourced data clustering. In outsourced data clustering, the cloud classifies users’ data into groups according to their similarities. Considering the sensitive information in outsourced data and multiple data owners in practical application, it is necessary to develop a privacy-preserving outsourced clustering scheme under multiple keys. Recently, Rong et al. proposed a privacy-preserving outsourced k-means clustering scheme under multiple keys. However, in their scheme, the assistant server (AS) is able to extract the ratio of two underlying data records, and key management server (KMS) can decrypt the ciphertexts of owners’ data records, which break the privacy security. AS can even reduce all data records if it knows one of the data records. To solve the aforementioned problem, we propose a highly secure privacy-preserving outsourced k-means clustering scheme under multiple keys in cloud computing. In this paper, noncolluded cloud computing service (CCS) and KMS jointly perform clustering over the encrypted data records without exposing data privacy. Specifically, we use BCP encryption which has additive homomorphic property and AES encryption to double encrypt data records, where the former cryptosystem prevents CCS from obtaining any useful information from received ciphertexts and the latter one protects data records from being decrypted by KMS. We first define five protocols to realize different functions and then present our scheme based on these protocols. Finally, we give the security and performance analyses which show that our scheme is comparable with the existing schemes on functionality and security.
1. Introduction
Data clustering [1, 2] enables data records to be classified into groups according to their features, attributes, or similarities. This property leads to its significance in many fields related to data analysis, such as pattern recognition, image processing, information retrieval, geography, and marketing. Furthermore, with the explosive data received nowadays in the information era, it has been a challenge for our digital devices not only to storage but also to perform computation on such massive data. Cloud computing relieves this problem by providing a platform with high storage capacity and strong computing power. Users tend to outsource their data on the cloud and authorize the cloud server computing ability on data. The cloud server therefore can replace users to perform some computation on the outsourced data, such as keyword search [3], equality test [4], and outsourced data clustering [5]. It is worth noting that, in these applications, the cloud server will send the final result to the data owner. This gives a security issue of data integrity which has been further researched in [6–11].
By outsourced data clustering which means the cloud classifies data into different groups according to their similarities, it is possible to efficiently detect abnormalities, segment images, and predict diseases. As a widely applied clustering method, k-means clustering [1] classifies data into k-clusters based on their distances from cluster centers. However, the sensitive information of data on the cloud platform cannot be protected by simply using k-means clustering. This calls for privacy-preserving outsourced k-means clustering, where data is classified without exposing the sensitive information of data.
The traditional privacy-preserving k-means clustering schemes [12–15] protect the data privacy by adding noises with the sacrifice of clustering accuracy. Subsequently, some symmetric and asymmetric constructions [16–18] have been proposed to improve it with the tradeoff of computing cost and communication overhead. The literature of outsourced privacy-preserving clustering schemes fall into two categories, i.e., single-key and multikey clustering, where the former one refers to that all of the outsourced data of owners are encrypted with one same key while that are with different keys in multikey clustering. Taking into account the practical application, it is necessary to consider the privacy-preserving clustering under multiple keys.
Recently, Rong et al. proposed an outsourced k-means clustering scheme [19] under multiple keys. Nevertheless, their scheme is not secure against semihonest assistant server (AS) and key management server (KMS), where AS can extract the ratio of messages and KMS can even extract all data records of users with its master secret key. In addition, as long as AS obtains one of the data records, it can recover all data records. The privacy leakage may incur a huge economic loss to the user in practice. To solve this problem, we present a highly secure outsourced k-means clustering scheme under multiple keys in cloud computing.
1.1. Our Contribution
In this paper, we propose a highly secure privacy-preserving outsourced k-means clustering under multiple keys in cloud computing.
We first introduce our system model and threat models. Specifically, the system model includes four entities, i.e., data owners (DOs), query client (QC), cloud computing service (CCS), and key management service (KMS), and threat model denotes the models against semihonest CCS and KMS. Subsequently, based on [19] and BCP homomorphic encryption, we construct five protocols to realize different functions. It is worth noting that the secure multiplication (SM) protocol is defined to achieve the multiplicative homomorphic property using BCP encryption which only has additive homomorphic property. We then present a highly secure outsourced k-means clustering scheme under multiple keys in cloud computing, which achieves privacy security against semihonest CCS and KMS. In particular, we use BCP encryption to realize the security against privacy leakage to CCS such that semihonest CCS cannot extract any useful information from ciphertexts of data records. We then utilize AES encryption to protect privacy security against semihonest KMS. KMS, therefore, cannot extract any data records of data owners although KMS possesses the master secret key which can be used to decrypt ciphertexts encrypted using BCP encryption.
1.2. Related Work
1.2.1. Privacy-Preserving k-Means Clustering
Zhang et al. [20] proposed a high-order possibilistic c-means algorithm for big data in cloud computing based on the BGV cryptosystem [21]. However, their scheme is not practical because of its low efficiency. Subsequently, Almutairi et al. [22] improved it and developed a privacy-preserving k-means clustering scheme based on homomorphic encryption but failed to protect the plaintext information in the update of clustering centers. For this, Yuan and Tian [23] put forward a privacy-preserving clustering scheme using a novel lightweight cryptosystem basing on the hardness of learning with error (LWE) [24]. Their scheme can complete the sum of ciphertexts and compare the distance using ciphertexts of multidimensional data. Nevertheless, this scheme is not fully outsourced.
1.2.2. Outsourced Single-Key Clustering
Lin [25] constructed a privacy-preserving kernel k-means clustering scheme based on linear transformation and kernel matrix with random perturbation, but this scheme cannot realize ciphertext comparison. Based on Paillier cryptosystem, Rao et al. [26] proposed a privacy-preserving outsourcing distributed clustering protocol in the union cloud environment, which includes a new protocol to construct the function of Euclidean distance and evaluate the termination condition over the encrypted data. The problem of this scheme lies in the heavy computing load and lack of support to encrypted datasets under multiple keys. Liu et al. [27] constructed a secure KNN multilabel data classification scheme based on Paillier cryptosystem.
1.2.3. Multikey Clustering
Gheid and Challal [28] presented a novel privacy-preserving k-means clustering scheme with the multiparty of Clifton security [29]. Peter et al. [30] further proposed a scheme to outsource multiparty computation to cloud under multiple keys, while it does not support ciphertext comparison. Li et al. [31] applied the BCP homomorphic encryption [32] to multiparty horizontal partitioned databases and then set up the ciphertext comparison for the outsourced privacy-preserving random decision tree algorithm. Rong et al. [19] improved it by presenting an efficient privacy-preserving protocol for outsourced k-means clustering under multiple keys based on the double decryption cryptosystem [33].
1.3. Organization
The rest of this paper is organized as follows. In Section 2, we recall the definitions for k-means clustering, BCP encryption, and AES encryption. The system model and threat models are proposed in Section 3. In Section 4, five basic protocols are constructed, and we present our scheme in which the defined protocols are invoked thoroughly. The security proof and performance analysis are given in Section 5. Finally, we conclude this paper in Section 6.
2. Preliminaries
2.1. Notations
We summarize the notations used in this paper in Table 1.
2.2. k-Means Clustering
k-means clustering is an iterative algorithm that allocates l data records into k disjoint clusters, each of which has a center. Let l m-dimensional data records be and k clusters be , where are the centers of k clusters separately. The data record will be categorized into the cluster if and has the minimum Euclidean distance among that of and all of cluster centers. In particular, the Euclidean distance of an m-dimensional data record and a cluster center can be expressed as
The detailed process of k-means clustering is depicted as Algorithm 1. The algorithm takes as input l m-dimensional data records , a predefined number of clusters k, and a predefined max number of iterations I. k-cluster centers are firstly picked to compute the Euclidean distance with data records. Each data record is distributed to the cluster which has the minimum Euclidean distance with it. After one iteration, the cluster center is reassigned as the average value of all data records in for . If the max number of iterations is reached or the output clusters does not change any more, terminate the algorithm and output the k-clusters.
|
2.3. BCP Encryption
In this paper, we utilize the BCP encryption proposed by Bresson et al. [32] which has the additive homomorphic property and provides double decryption mechanisms. The BCP encryption consists of five algorithms as follows:(i). Taking as input a security parameter λ, the setup algorithm picks two primes of the form and computes , where are also primes. Consider , the cyclic group of quadratic residues modulo , and we have ord with . It chooses , the order of which is , and we have , . The public parameter and the master secret key are denoted as(ii). Taking as input the public parameter , the key generation algorithm randomly chooses and computes . Note that h is of maximal order with high probability. It sets the output public and secret key pair as(iii). Taking as input a public key and a message M, the encryption algorithm randomly chooses and generates the ciphertext as Specifically, we denote as the encryption of message M under the public key .(iv). Taking as input a secrete key and a ciphertext , the decryption algorithm output the message as Specifically, we denote as the decryption of ciphertext under the secret key .(v). Taking as input the master secret key and a ciphertext , the system decryption algorithm computes
Let ; thus, is efficiently computable. Let π be the inverse of . It generates the message as
Specifically, we denote as the decryption of ciphertext under the master secret key .
Specifically, BCP encryption has additive homomorphic property, which means
This property will be utilized in the whole system.
2.4. AES Encryption
AES encryption is an efficient symmetric encryption system widely used in practical application, where the symmetric means encryption and decryption require the same key. We give the simplified definition of AES as follows:(i). The sender and receiver consult the secret key of the AES encryption system.(ii). The sender generates the ciphertext of message M under the secret key following the AES encryption algorithm. We denote it as(iii). The receiver decrypts the ciphertext with the secret key . We denote it as
3. Models
3.1. System Model
As shown in Figure 1, our scheme considers four types of entities, i.e., data owner (DO), cloud computing service (CCS), key management server (KMS), and query client (QC).(i)DO: DO has limited computing power and therefore outsources its encrypted data to the cloud. Our system involves n DOs, denoted as . For , each has data records, and each data record has m attributes. Data owners are assumed not to collude with the cloud servers.(ii)QC: QC is authorized to query and receive the clustering results and does not involve in any clustering calculation.(iii)CCS: CCS stores the datasets of multiple DOs, takes part in the clustering process, and sends the clustering results to the QC.(iv)KMS: KMS generates system parameters and performs ciphertext transformation with the master secret key. It also participates in the clustering process.

3.2. Threat Models
In our system, we suppose that CCS and KMS are semihonest. This means they will honestly perform what the protocol requires but will be curious about the messages under ciphertexts they received. Upon this assumption, we define three thread models as follows, where an adversary acting as different roles in different models attempts to decrypt the ciphertexts sent from DOs and CCS.(i)Acting as a “malicious” CCS, tries to obtain the message under ciphertexts sent from DOs and KMS(ii)Acting as a “malicious” KMS, tries to obtain the real message under ciphertexts sent from CCS(iii)Acting as a “malicious” KMS, tries to obtain the message under the ciphertexts that sent from DOs to CCS
It is worth noting that CCS and KMS are assumed not to collude with each other.
4. Our Construction
Based on the scheme proposed by Rong et al. in [19], we construct a more secure clustering scheme. In our construction, we utilize BCP homomorphic encryption to protect the privacy security of data owners such that adversaries cannot extract any useful information about underlying data records of data owners, while AS can easily extract in [19]. Furthermore, AES encryption is also used to double-encrypt the data records to prevent KMS from directly extracting data records from ciphertexts sent from DO to CCS.
4.1. Protocols
We first define five underlying protocols to satisfy different requirements in the clustering process. To securely transfer the data records of DO to CCS, we define secure ciphertext transformation (SCT) protocol. Since the BCP encryption used in our scheme only has additive property, we build a secure multiplication (SM) protocol to realize the multiplicative property. Finally, aiming to classify the similar data records using the ciphertexts, we construct three protocols, namely, secure distance measurement (SDM) protocol, secure distance comparison (SDC) protocol, and secure minimum distance measurement (SMDM) protocol. These protocols will be invoked through our scheme.
4.1.1. Secure Ciphertext Transformation Protocol
Secure ciphertext transformation (SCT) protocol aims to transfer the ciphertext of message M encrypted under public key to a ciphertext of M encrypted under public key without revealing M. Suppose two entities in SCT protocol, i.e., Alice and Bob, Alice interacts with Bob following SCT protocol to convert to . To prevent Bob from extracting the message M, a random number is used to blind the message from Bob. The detailed process is listed in Algorithm 2.
|
Taking as the input the public keys and the ciphertext , Alice randomly chooses and encrypts r using to . It then computes the encryption of under , which can be realized by because of the additive homomorphic property of BCP encryption. Alice then sends the output to Bob. Taking as the input the public key , its master secret key , and received , Bob decrypts this ciphertext using its master secret key following the system decryption algorithm sDec and obtains . It then encrypts with and sends the output to Alice. Alice eliminates r in the ciphertext by computing and obtains as the final output.
4.1.2. Secure Multiplication Protocol
Secure multiplication (SM) protocol is used to obtain the ciphertext of messages’ multiplication with corresponding messages’ ciphertexts using the BCP homomorphic cryptosystem. It is required in this process that the messages should not be exposed. The same as SCT protocol, we also assume two entities in SM protocol, i.e., Alice and Bob. Alice attempts to obtain from without revealing to Bob who is the owner of the corresponding secret key . We define SM protocol in Algorithm 3.
|
Taking as the input the ciphertext and , Alice randomly chooses numbers and computes the ciphertext of by computing and respectively. This utilizes the additive homomorphic property of BCP encryption. It then sends the output to Bob. Taking as the input the corresponding secret key of , Bob decrypts the received ciphertexts with and obtains . It computes the multiplication of as and encrypts with as which is used to divide in the underlying message. Bob sends to Alice. Finally, Alice computes with using the additive homomorphic property of BCP encryption.
4.1.3. Secure Distance Measurement Protocol
We define the secure distance measurement (SDM) protocol to measure the distance between data records and cluster centers using Euclidean distance. Assume there are n data records and k clusters. Let be the sum of data records in cluster and be the number of data records in cluster , respectively. Given a data record and a cluster center , is denoted as the scaled squared distance between and satisfying . Therefore, is denoted and computed as in the following equation:
The process is depicted as Algorithm 4.
|
4.1.4. Secure Distance Comparison Protocol
Secure distance comparison (SDC) protocol is to determine the shorter distance between two output distances from SDM protocol. Taking as the input two distances, i.e., and , Alice interacts with Bob to obtain the shorter one. As in [19], the difference between two differences can be expressed as
Since we only need to know whether or not, it is equal to judge whether or not. This means, the comparison can be related to
Let β be the maximum size of messages. We have , which means if and . Let η be the threshold for sign judgement chosen from . To prevent Bob from obtaining distance-related information, Alice blinds the message with a random with and satisfying
We illustrate the detailed realization in Algorithm 5.
|
In the process, Bob cannot obtain .
4.1.5. Secure Minimum Distance Measurement Protocol
Finally, we define the secure minimum distance measurement (SMDM) protocol as Algorithm 6 to choose the shortest one among given distances.
|
4.2. Our Scheme
At the beginning, the four entities in the system, i.e., data owners DOs, query client QC, cloud computing service CCS, and key management server KMS, setup the system by running the algorithms, Setup, KeyGen, and AKeyGen. DOs then run Enc and AEnc on their data records and upload to CCS separately. CCS decrypts the received ciphertexts using ADec. After receiving the clustering request from QC, CCS interacts with KMS to transform the ciphertexts encrypted under different public keys to ciphertexts encrypted under the same public key. Subsequently, CCS performs the clustering computation. Finally, CCS interacts with KMS to transfer the clustering result to QC. It is worth noting that the defined protocols are invoked through the process.
4.2.1. System Setup
As the setting in the system model (see Section 4.1), we have n data owners , cloud computing servers (CCS), key management server (KMS), and query client (QC). Before running the protocols, related entities in the system model generate their keys as follows:(1)Taking as the input a security parameter λ, KMS runs the setup algorithm of the BCP homomorphic cryptosystem and generate the public parameter and master secret key , where is kept secret(2)Each data owner runs to generate its own public/secret key pair , (3)Each consults with CCS a symmetric key through Diffie–Hellman key exchange protocol or other methods for (4)CCS runs the key generation algorithm to generate its public/secret key pair as (5)QC runs to generate its own public/secret key pair
4.2.2. Data Uploading
Following the setting in Section 4.1, assume that each data owner has a dataset which contains data records, and each record has m attributes, and encrypts with BCP cryptosystem first and then AES encryption, . Finally, sends the output to CCS.(1) then runs the encryption algorithm on each record and obtains the encrypted result as(2)To prevent the privacy disclosure from KMS, data owners double-encrypt the output ciphertext with AES encryption. Each computes and sends the output results to CCS.(3)After receiving from , CCS runs the decryption algorithm aDEC with the consulted symmetric key on each ciphertext to obtain
In our setting for data uploading, each data owner sends their double-encrypted ciphertext to CCS such that the KMS cannot obtain the original message of the data owner although the KMS has the master secret key which can be used to decrypt the ciphertext encrypted under the BCP homomorphic cryptosystem.
4.2.3. Ciphertext Transformation
This phase is to transfer “multiuser” to “single-user” by re-encrypting the ciphertext encrypted under the public key of to the ciphertext encrypted under , .(1)QC sends a clustering request to CCS.(2)For a ciphertext from , CCS interacts with KMS to run the SCT protocol by setting . Finally, CCS obtains .(3)By performing the SCT protocol on all the ciphertexts received from , CCS finally obtains
Let , and denote these n ciphertexts as
For simplicity, we denote as in the following.
It is worth noting that the final ciphertexts are unknown to the KMS since they are blinded in the SCT protocol.
4.2.4. Clustering Computation
In this phase, CCS computes the clustering results with k randomly chosen cluster centers from . Let and . CCS also outputs a matrix which refers to the location in k clusters of n records, where means is allocated to j-th cluster. In addition, there is a maximum iteration time . Let .(1)For a data record , CCS runs the SMDM protocol on it and k-cluster centers with the setting . Finally, CCS obtains the output where . Let for .(2)For each data record where , CCS runs step 1 and obtains and the matrix .(3)With the matrix and data records , if , CCS updates and as for . Finally, CCS obtains new and for . Let .(4)If and the output matrix is different from that in the last iteration, CCS starts a new iteration by running steps (1), (2), and (3). Otherwise, CCS outputs the final
4.2.5. Result Retrieval
(1)CCS interacts with KMS to run the SCT protocol on with the setting . CCS obtains and sends it and to QC.(2)QC decrypts the received with its secret key by computing QC then computes the cluster centers as where .
5. Security and Performance Analysis
5.1. Security Analysis
As shown in the proposed scheme (see Section 4.2), our protocol is realized by invoking the BCP homomorphic cryptosystem, AES encryption, and the defined protocols. Upon that, the former two cryptosystems are semantic secure, and we give the security proof of the defined protocols as follows. We take the SM protocol’s security proof under “Real-vs.-Ideal” framework as an example. Other protocols’ security proofs are in a similar manner and we omit here.
Theorem 1. SM protocol is secure.
Proof. SM protocol relates to two semihonest parties, namely, Alice and Bob. Therefore, we consider both securities of SM protocol against semihonest attacker Alice and semihonest attacker Bob . In the protocol, Alice takes as the input and Bob takes as the input the corresponding secret key of public key .(i)Security against : In the SM protocol, the real-world view of the attacker includes the input , random numbers , , and the output , where . tries to obtain useful information about the underlying messages, i.e., that are encrypted under . Because of the semantic security of the used BCP homomorphic cryptosystem, we have that cannot extract any information of underlying messages except the bit length without . Therefore, we can construct a simulator in the ideal world by using ciphertexts of random chosen messages. It will be computationally hard for to distinguish the ideal world with real world because of the semantic security of the BCP homomorphic cryptosystem. We have where means computationally indistinguishable.(ii)Security against : In the protocol, takes as the input the secret key of and . With , can decrypt the ciphertexts and obtain the underlying messages . However, since are randomly chosen by Alice, they are random numbers in the point of view of . We can then construct a simulator in the ideal world by using ciphertexts of random chosen messages, and it will be computationally hard for to distinguish the ideal world with the real world. We haveThis completes the proof of Theorem 1.
Next, we prove that our protocol is secure by taking the process of data uploading as an example.
Theorem 2. The data uploading process is secure.
Proof. In the data uploading process, data owners (DOs) double-encrypt their data records with and using the BCP homomorphic cryptosystem and AES encryption separately. They then send the encrypted result to CCS who has but does not have the corresponding secret key of . Because of the semantic security of the BCP homomorphic cryptosystem, it is secure against semihonest CCS. Although KMS can extract the underlying messages of ciphertexts encrypted using the BCP homomorphic cryptosystem, it is also computationally hard for a semihonest KMS to obtain any information of data records with the semantic security of AES encryption. Furthermore, CCS and KMS are supposed not to collude in our scheme such that the data uploading process is secure against semihonest CCS and KMS separately. This completes the proof of Theorem 2.
It is worth noting that the security of our construction is protected by the semantic security of the BCP homomorphic cryptosystem, AES encryption, and blinding with random numbers, which prevents the adversaries from obtaining any useful information from the received ciphertexts.
5.2. Performance Analysis
In our construction, we use the BCP homomorphic cryptosystem and AES encryption to encrypt data owners’ data records to prevent the information disclosure to KMS. Compared with the underlying scheme [19] which utilizes Youn’s homomorphic encryption scheme [33], our scheme therefore increases the computation cost between DOs and CCS along with the increased security.
In particular, each data owner additionally needs to interact with CCS to consult a symmetric key of AES encryption in the system setup phase. Except this, since BCP encryption has additive homomorphic property instead multiplication in Youn’s encryption scheme [33], we give a secure multiplication protocol SM instead of secure addition SA in [19]. This leads to different invocations in other defined protocols, which result in more computation cost.
With the sacrifice on the computation cost, our scheme achieves semantic security that adversaries cannot obtain any useful information about underlying data records, while AS can extract in SA protocol of [19]. Furthermore, in our scheme, KMS cannot extract the underlying data records of data owners, while KMS can realize this with its master secret key in [19].
Finally, we compare our scheme with the existing outsourced k-means clustering schemes [19, 22, 23, 26, 30, 34] in Table 2 on six aspects, i.e., whether the scheme is based on symmetric or asymmetric cryptosystem, whether it supports or achieves multiple data owners and multiple keys, ciphertext comparison, security, and multidimensional data. As shown in Table 2, our scheme achieves all the listed functionalities under the asymmetric cryptosystem.
6. Conclusions
This paper proposed a highly secure privacy-preserving outsourced k-means clustering scheme on the encrypted datasets under multiple keys. We utilized BCP homomorphic encryption and AES encryption to double-encrypt the data records in the database to protect the security against semihonest cloud computing server and key management server. Furthermore, we constructed five protocols, i.e., secure ciphertext transformation (SCT), secure multiplication (SM), secure distance measurement (SDM), secure distance comparison (SDC), and secure minimum distance measurement (SMDM), as the base of our scheme. In particular, SM protocol is built to achieve the homomorphic multiplicative property using BCP encryption. Finally, we proposed our scheme by invoking the defined protocols thoroughly. The given security and performance analysis shows that our scheme is comparable with the existing outsourced k-means clustering scheme on security and functionality.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Key R&D Program of China under Grant no. 2017YFB0802000, the National Natural Science Foundation of China under Grant nos. 61572390 and U1736111, the National Cryptography Development Fund under Grant no. MMJJ20180111, the Plan For Scientific Innovation Talent of Henan Province under Grant no. 184100510012, the Program for Science & Technology Innovation Talents in Universities of Henan Province under Grant no. 18HASTIT022, and the Innovation Scientists and Technicians Troop Construction Projects of Henan Province.