Abstract

With the advent of the intelligent era, more and more artificial intelligence algorithms are widely used and a large number of user data are collected in the cloud server for sharing and analysis, but the security risks of private data breaches are also increasing in the meantime. CKKS homomorphic encryption has become a research focal point in the cryptography field because of its ability of homomorphic encryption for floating-point numbers and comparable computational efficiency. Based on the CKKS homomorphic encryption, this paper implements a secure KNN classification scheme in cloud servers for Cyberspace (CKKSKNNC) and supports batch calculation. This paper uses the CKKS homomorphic encryption scheme to encrypt user data samples and then uses Euclidean distance, Pearson similarity, and cosine similarity to compute the similarity between ciphertext data samples. Finally, the security classification of the samples is realized by voting rules. This paper selects IRIS data set for experimental, which is the classification data set commonly used in machine learning. The experimental results show that the accuracy of the other three similarity algorithms of the IRIS data is around 97% except for the Pearson correlation coefficient, which is almost the same as that in plaintext, which proves the effectiveness of this scheme. Through comparative experiments, the efficiency of this scheme is proved.

1. Introduction

With the gradual maturity of various AI algorithms, data has gradually become the basis of social operation, playing an essential role in important areas such as economic investment, social management, scientific and technological development, and national security. Large amounts of data are uploaded to cloud servers from personal front-ends, social networks, sensor networks, and the Internet for sharing and analysis. With the development of cloud computing, many artificial intelligence algorithms derived from large data have been developed and widely used in various APPs. Major manufacturers generate recommendation algorithms or make price decisions by analyzing large data of user behavior, such as TikTok, Taobao, small videos recommended by Meituan for users, commodities, and stores. Drip travel also analyzes user travel segments to increase prices for older users. As one of the classic algorithms for big data classification, the KNN algorithm realizes data classification by calculating the similarity between the test data set and the training data set. Because of the simple structure, high efficiency, and accuracy of the KNN algorithm, it is adopted in various scenarios; for example, it is used in image recognition, traffic classification, sensor detection, and especially text classification. Commercial clouds such as Google, Microsoft, and Amazon also provide related services. With the popularity and application of cloud computing, large numbers of users upload local data to cloud servers to enjoy storage and computing services provided by cloud platforms. However, the security of commercial clouds is often not fully trusted [13]. As shown in Figure 1, various users and cloud servers transmit various data and classification results in network space, and the security of data needs to be protected.

In recent years, privacy leakage incidents such as malicious collection and theft of user-sensitive privacy data by APP have emerged one after another. In 2007, the National Security Agency (NSA) and the Federal Bureau of Investigation (FBI) launched the infamous “Prism” secret surveillance project, which directly entered the central server of the US Internet company to mine data and collect intelligence; nine international network giants including Microsoft, Yahoo, Google, and Apple have participated in it. In 2018, the Z. power risk monitoring platform detected NASDAQ: HTHT data leak, involving sensitive and private data of about 500 million citizens. In 2021, China Internet Network Information Center issued a notice on the illegal collection and use of personal information by 105 apps including TikTok. Under this trend, people's personal data has gradually become another commodity, and many manufacturers privately collect or even sell users' personal data. In the traditional KNN classification schemes [47], the user uploads the local raw data to the cloud platforms for storage and calculation, and then, the cloud platforms compute the similarity between samples and return the result to the user. However, for some highly confidential data, such as personal privacy data, commercial confidential data, medical privacy data, and national security data, once these data are leaked or stolen, the consequences are unimaginable. Therefore, the core issue of this study is to implement a secure privacy-preserving KNN classification scheme efficiently and accurately in the cloud environment. An effective solution for the secure KNN classification scheme is to encrypt local data through a cryptographic algorithm and compute the similarity on the ciphertext data. However, due to the limitation of the encryption algorithm, the computational overhead and storage overhead brought by this solution are extremely large compared to plaintext [8, 9]. First of all, the basic logic operations supported by general encryption schemes are limited, and iterative algorithms are needed instead. Second, some encryption schemes are restricted by modulus reduction and can only support a limited number of multiplications. Finally, traditional encryption schemes can only encrypt digits or vectors one by one, requiring a lot of resources to store the ciphertext. In 2017, Cheon et al. [10] proposed a scheme of homomorphic encryption, CKKS, which supports real number/complex number approximations. This scheme has the ability of homomorphic encryption for floating-point numbers and comparable computational efficiency, which has become a research focal point in the cryptography field, and is widely used in machine learning and big data analysis. To keep users’ privacy safely, ensure the security of data and implement KNN classification safely and efficiently in the cloud environment, and enable it to maintain computational efficiency and classification accuracy in plain text fields, as proposed by this paper, a secure KNN classification scheme ground on the CKKS homomorphic encryption scheme in cloud sever for Cyberspace (CKKSKNNC) will be presented. We realize the computation of the similarity through the Euclidean distance, Pearson correlation coefficient, and cosine similarity, at last, to make the secure KNN classification can operate in the domain of ciphertext, which can avoid any user privacy data leakage.

The traditional KNN classification scheme is divided into two types: one is to assign the same optimal K value for every sample in the test [1113], and the other is for different test samples; the experts assign individual K values [1417]. In recent years, a lot of categories is designed based on the KNN algorithm. In 2016, Deng et al. [18] first divided the data set into several categories by K-means algorithm and then classified them by KNN, which realized the efficient KNN algorithm for big data. In 2017, Zhang et al. [19] came up with a kTree method to effectively implement KNN classification with different neighbor numbers. In 2020, Zhang et al. [20] designed two effective cost-sensitive KNN classifiers to classify unbalanced data. In 2021, Zhu et al. [21] proposed an ML-KNN integration scheme which can realize classification algorithm recommendations, and the scheme can take advantage of the diversity of different data features. Levchenko et al. [22] implemented a KNN query on a large time-series database based on iSAX and sketch. In the meantime, for protecting the users' private data, researchers also carry out a lot of research on how to perform KNN classification on the ciphertext. As early as 2009, Wong et al. [23] designed an asymmetric vector product preservation encryption scheme (ASPE) to support KNN calculations on encrypted data, which supports KNN computation of encrypted data by retaining a special type of scalar product, but the scheme assumes that the querying user is fully trusted, which is not suitable for practical application in complex network environments. In 2013, Zhu et al. [24] proposed a secure KNN calculation scheme for encrypted cloud data, and it does not need to share the key with the querying user, but they increase the communication overhead compared to the ASPE scheme. In 2014, Elmehdwi et al. [25] proposed a secure KNN scheme in an outsourcing environment based on the Paillier homomorphic encryption scheme, which can query the data without leaking any information to the cloud server by using the feature of homomorphic encryption and hides the query and data access mode of users, but the computing cost is large. In 2015, Xia et al. [26] proposed a secure dynamic multikeyword ranking search scheme based on encrypted cloud data, which achieves sublinear search time and handles document deletion and insertion flexibly with a special tree-based index structure. Samanthula et al. [27] proposed a KNN classification scheme, which can be used for encrypted data stored in the cloud based on Paillier and multiparty security protocol. In 2016 and 2017, based on the Paillier homomorphic encryption scheme, similarly, Kim et al. [28, 29] designed a privacy protection KNN classification algorithm using the tree index structure and Yao's garbled code, respectively. However, the KNN classification scheme based on Paillier homologous encryption scheme is inefficient to compute, has some limitations in calculation method, and has a high computation cost. Li et al. [30] presented two secure and effective dynamic searchable symmetric encryption (SEDSSE) schemes for medical cloud data, they combined the secure KNN scheme and ABE technology to design a dynamic searchable symmetric encryption scheme and a key sharing scheme, and they implement both forward and backward privacy security and propose an enhanced scheme to effectively solve the key sharing problem caused by search encryption using KNN. In 2018, Wu et al. [31] used Paillier and ElGamal encryption schemes to implement a secure KNN classification scheme on a semantically secure hybrid encrypted cloud database. Later, Liu et al. [32] proposed a privacy protection KNN classification scheme in the dual cloud model based on secret sharing and additive homomorphic cryptography. In 2020, Parvin et al. [33] developed an electronic medical record analysis system on the blockchain based on KNN and LDA algorithms to automatically and safely share medical data sets among medical experts. In the same year, in order to realize the classification of large-scale ciphertext data in distributed servers, Yang et al. [34] proposed a vector homomorphic encryption (VHE) scheme through constructing key switching matrix and noise matrix and constructed a secure distributed KNN classification algorithm (seed KNN) based on it. Recently, Kim et al. [35] proposed an index-based KNN query processing algorithm and improved query processing efficiency through Yao's garbled code and data packaging technology. Liu et al. [36] achieved secure KNN classification by a secure and efficient query processing (SecEQP) scheme, which encodes location information through a projection function and implements location-based query processing based on the encrypted geospatial data stored in the cloud.

3. Preliminaries

3.1. CKKS Homomorphic Encryption Algorithm

In 2017, Cheon et al. [10] proposed a scheme of homomorphic encryption, CKKS, which supports real number/complex number approximations. This article mainly analyzes the CKKS homomorphic encryption algorithm. As shown in Figure 2 which is drawn referring to Cheon's Report [10], the following describes the main algorithm flow of CKKS.

Set safety parameters , and choose the power of two integers . Set distributions for key, learning with errors, and encryption on individually. To get a basic integer p and the number of levels L, set the modulus of the ciphertext , where is the level of ciphertext, then create an integer P at random, and output :(1). Randomly generate , and set private key. Randomly generate and , set . Randomly generate and , and set .(2). Randomly generate and ; output the ciphertext .(3). For ciphertext of level , compute and output the plaintext .(4). For the ciphertext of the same level , compute and output the addition result .(5) . Compute ; output the result of ciphertext multiplication . Given that the CKKS encryption scheme has a nature of being homomorphic, the cloud server computes ciphertext equivalent to plaintext, which would ensure both privacy of the user and the efficiency of the encryption.

3.2. K-Nearest Neighbor

Proposed by Cover and Hart in 1968, KNN came into the public view quite some time ago [37], and it ranks among the simplest algorithms for machine learning. Due to its simple structure and remarkable classification performance, it became one of the most popular algorithms in the data mining and statistics fields, granting it a seat among the top ten data mining algorithms [6], and is used very commonly in classification, regression, and missing value interpolation and other fields [3840]. At present, many algorithms for machine learning have been developed to better determine the value of k in the KNN algorithm and the distance measurement algorithm. Being one of the most classic data mining classification technology algorithms, the main idea of the KNN nearest neighbor classification algorithm is to establish the category objects to be classified, based on the category of the majority of samples in a certain range adjacent to the object to be classified. The working principle of the KNN nearest neighbor classification algorithm is to compare the sample waiting for classification with the others which are of established categories in the database, and to compute the similarity between these two sets of different samples, and select the k samples of known categories with the closest similarity to the sample to be classified. According to the voting rule (minority obeys the majority), the category of the sample to be classified falls in rank with the category which has the highest proportion of the k-nearest samples. Suppose that we have samples of known categories , where X represents the characteristic index of the sample, and Y represents the category label of the sample. For a given sample to be classified, we select the k samples with the highest similarity in the vicinity of , and these samples vote for the category of according to their own category. The category label with the most votes is called category of , as shown in Figure 3. The green dots represent samples to be classified, and the blue squares and red triangles represent the other two samples of known categories. When k = 3, the proportion of red triangles in the nearest neighboring range is 2/3, and the green dots are judged as red triangle samples. When k = 5, the proportion of the blue square in the nearest neighbor is 3/5, and the green dot is judged as a blue square sample.

The KNN method is more suitable than other methods in the sample to be classified with more intersections or overlaps in the class domain. There are many methods for calculating similarity in the KNN algorithm, such as the Euclidean distance, cosine similarity, Pearson correlation, Manhattan distance, and Chebyshev distance. The most commonly used method is the Euclidean distance.

3.3. Ciphertext Matrix Transpose Operation

Since this scheme is implemented in the TenSEAL homomorphic encryption library, although TenSEAL provides the ciphertext matrix multiplication function and the inner product function , it does not provide a ciphertext transpose function. Therefore, this part will introduce the process of transposing ciphertext matrix in the TenSEAL homomorphic encryption library. TenSEAL provides a very useful function ; its function can be expressed as . Suppose that there is a ciphertext matrix . First, the transition matrix is generated, and the transposition process can be shown in Figure 4.

It can be seen that is converted to through . Afterward, the internal elements are rearranged through and finally transposed through .

3.4. Symbols and Parameters

In order to show the algorithm in this article more intuitively, we briefly introduce the related symbols that are often used in this article, as shown in Table 1. The vectors are illustrated in lowercase bold letters and the matrices are shown in uppercase bold letters. Add enc_ in front to indicate the ciphertext form of the data.

4. System Models

4.1. Proposed Model

According to Figure 5, the CKKSKNNC protocol model designed in this paper is composed of two parts, namely, the user (USER) and the cloud service provider (CSP). Among them, CSP can provide remote storage and computing services for users, which is “honest and curious”. Users have a large amount of local data and enjoy the services provided by CSP. The division of labor of each part is as follows:(1)USER: generate public and private keys locally, encrypt data and upload them to CSP, and decrypt ciphertext computation results(2)CSP: provide remote storage and services for computing for USER, with capable storage and computing capabilities, taking charge for storing the ciphertext data uploaded by USER, calculating the similarity between the encrypted sample to be classified and other ciphertext samples, and returning the ciphertext result to USER

First of all, USER generates public and private keys locally, encrypts locally known samples of classes, and sends them to CSP. CSP accepts the ciphertext samples sent by USER and stores them. When USER receives a new sample to be classified, USER encrypts the sample to be classified locally and delivers it to the CSP. The CSP accepts and computes the similarity between the encrypted sample to be classified and other ciphertext samples in the server and sends the ciphertext computation result to USER. The USER accepts the computation result from the CSP and decrypts them. Then, the USER selects the nearest k samples and obtains the category label of the sample to be classified according to the voting rule.

4.2. Security Model

Since the CSP is “honest and curious”, the transmission network may also be subject to malicious attacks. Therefore, we list the following security issues that may occur when users upload data to the cloud server for KNN classification:(1)CSP may strictly abide by the designed protocol, but it can infer other additional information through the information legally received in the process of the protocol(2)CSP attempts to steal USER’s public and private keys and relies on stored ciphertext data samples to try and decipher the USER’s plaintext data samples and private keys(3)During the transmission process between the user-uploaded ciphertext data and the ciphertext result returned by the cloud server, data samples may be maliciously intercepted by hackers and be used to crack the user's sensitive data

5. System Algorithm

5.1. CKKSKNNC Framework

Assuming that the user has sample data , which is known categories to be uploaded locally, where , the system protocol framework is shown in Figure 6.

According to the protocol framework, the protocol algorithm is made up of two phases, namely, the data initialization phase and the classification phase. The specific operation procedures are listed as follows:(1)Data initialization:(a)First, USER standardizes the characteristic index of local data samples; compute , where represents the average value of the j-th characteristic index, represents the standard deviation of the j-th characteristic index, then the standardized data is a , the average value of its characteristic index is 0, the variance is 1, and it is dimensionless. b. USER generates public and private keys locally and encrypts the characteristic index and category labels in the original data and standardized data, respectively, getand, and upload both to CSP for storage.(2)Classification(a)After receiving the new sample to be classified, USER first standardizes its characteristic index to obtain , then uses the public key to encrypt it to obtain , and sends the encrypted result to the CSP as a query matrix.(b)After receiving the query matrix, the CSP computes the similarity in the ciphertext between the sample waiting for classification and others that are of other known categories and returns it to USER.(c)USER decrypts , selects the top k samples with the highest similarity, and obtains the category label of the sample to be classified according to the voting rule.

5.2. Security Similarity Calculation

In the process of data mining and data analysis, there are many methods to measure the differences between samples. In the CKKSKNNC protocol, this paper uses the Euclidean distance, Pearson correlation coefficient, and cosine similarity to measure the similarity between samples.

5.2.1. Euclidean Distance

The Euclidean distance [41] is the most popular similarity measurement method. It has been widely used in various scenes such as face recognition. The traditional Euclidean distance computation method is to directly calculate the absolute distance between each point in the multidimensional space and the Euclidean distance between samples through the matrix inner product [42]. Two methods for calculating the Euclidean distance are introduced below. Method 1: since the ciphertext encrypted by the CKKS homomorphic encryption algorithm cannot be directly squared, the distance is not squared, and the ciphertext distance between the sample to be classified and the sample of the category is

As the distance grows smaller, the similarity of the samples becomes higher. Method 2: before uploading data, USER computes and , respectively, and uploads data to CSP after being encrypted. CSP can directly compute the ciphertext distance between two samples through the inner product:

The result of this method is the same as that of the first method. Although it increases the computational complexity, it can be batch-processed computation.

5.2.2. Pearson’s Correlation Coefficient

Since the magnitude of the different characteristic index of the sample has a greater impact on the Euclidean distance, in some applications, people often choose the Pearson correlation coefficient [43] that is not sensitive to the magnitude to measure the similarity between samples. The data sample has been standardized before uploading to the CSP; the computation process is as follows:

5.2.3. Cosine Similarity

The angle cosine similarity is like the Pearson correlation coefficient and insensitive to the magnitude of the characteristic index. It is often used in the computation of text similarity, but it needs to be computed on the original data. It measures the similarity by calculating the cosine of the angle between both samples in the vector space. And the method pays more attention to what is different from the direction of one vector to another, rather than the distance measurement. Similarly, because the CKKS ciphertext cannot be directly used for square rooting, the cosine similarity computation process in the protocol is as follows:

In terms of actual implementation, the CKKS ciphertext cannot be directly performed division operations, so the CSP will actually return the two values of and to the USER, and the USER will decrypt it and perform the division on the plaintext.

5.3. Batch

Assume that CSP stores ciphertext samples of known category , when USER uploads multiple ciphertext samples to be classified , CSP needs to compute the similarity between each sample to be classified and each sample of a known category, and the encryption method is determined by the method of calculating the similarity.

5.3.1. Euclidean Distance

When CSP uses the Euclidean distance method 1 to compute similarity, batch processing cannot be performed. USER needs to encrypt each sample separately, and CSP needs to separately compute the ciphertext similarity between each sample to be classified and all of the other samples of known categories and returns the similarity matrix , where is the similarity between the i-th known category sample and the j-th sample to be classified. When CSP uses the Euclidean distance method 2 to compute similarity, USER can directly encrypt the plaintext matrix of the sample to be classified and obtain the ciphertext matrix , CSP computes, and similarity matrix is , where is the similarity between the sample of i-th from a known category and the sample of j-th, which is yet to be classified.

5.3.2. Pearson’s Correlation Coefficient

Similar to the above, CSP computes the similarity matrix , where is between the sample of i-th from a known category and the sample of j-th, which is yet to be classified.

5.3.3. Cosine Similarity

When CSP uses cosine similarity to compute similarity, CSP first generates unit diagonal matrix , vector , and; then, compute the distance matrix and . Finally, compute the similarity matrix , where is the similarity between the sample of i-th from a known category and the sample of j-th, which is not yet classified; CSP returns and to the USER.

6. Security Analysis

According to the protocol model of CKKSKNNC, since the CSP is ‘honest and curious’, the USER's private key is only stored locally, and the CSP is only in charge of storing data and computing the user-uploaded ciphertext data; both the public and the private central information of the USER cannot be obtained. The security definition of the semitrusted model is listed as follows.

Definition (security of semitrusted model): assume function , where and are, respectively, the first and second elements of . Assume that is a two-party protocol used to compute . is a role that implements the protocol, where , represents input, represents randomness, and represents the i-th data accepted. Also is available. If there exists probabilistic polynomial-time algorithms and , such thatwhere computational indistinguishability is represented by , it is said that computing is secure when protocol is against semitrusted adversary.

Theorem 1. Under the semihonest model, CKKSKNNC is provably secure. CSP fails to obtain any helpful information from the stored data set or query matrix.

Proof. In the protocol, the data set and query matrix that CSP can obtain are transmitted in the ciphertext. The view of CSP is , where are all ciphertexts and n is the number of ciphertext data that CSP can access. We can design a simulator to simulate the USER view and then set , where are all random numbers. Since CKKS homomorphic encryption scheme is a semantically secure encryption scheme, is computationally no different from . Thus, under the semihonest model, CKKSKNNC is provably secure, and CSP cannot gain any helpful information from the stored data set and query matrix.

Theorem 2. Assume that CSP and other attackers cannot perform key recovery attacks on stored or stolen ciphertext data and computation results, so they cannot recover the user's original data and keys.

Proof. According to the protocol algorithm, in the transmission process, USER's sample data, category label, and query matrix are all transmitted in the form of CKKS ciphertext. Its security is protected by the CKKS homomorphic encryption scheme, which grants security, and security is resolved by its own algorithm. Therefore, the CSP cannot restore the original user data and keys through the stored user data and intermediate computation results. In the process of returning the result, the similarity computation result is transmitted to the user in ciphertext for decryption, so the attacker cannot recover the user's original sensitivity from the intercepted ciphertext data during the process of transmission and the data in the stolen cloud server data. In addition, no matter what method is used to compute the similarity, multiplications only need 3 times at most. Therefore, the CKKSKNNC protocol algorithm does not have additional special requirements for the parameters of the CKKS encryption scheme.

7. Experimental Test

With the aim of well testing the potency of the scheme in a proposition, we conduct our experiments on Windows10 operating system with Intel® Core™ i7-7700HQ CPU @ 2.80 GHz/16 GB RAM, using PyCharm 2020.1.1 × 64 to call TenSEAL-0.1.4 library to implement the CKKS encryption scheme, take poly_ modulus_ degree = 8192, coeff_ mod_ bit_ sizes = [50, 30, 30, 50], scale = 30 as the parameter of CKKS homomorphic encryption scheme, and test on IRIS data set.

7.1. Efficiency of Similarity Calculation

In this part, we test the computational efficiency of different similarity algorithms. We randomly selected 100 groups of IRIS data set as the known class samples and randomly selected the remaining 10, 20, 30, 40, and 50 groups of samples as the samples to be classified to form the test set, recorded the time to compute the similarity on the ciphertext, and recorded the results of 30 experiments, and the average value is regarded to be the final experimental data. The computational efficiency of the four similarity algorithms is shown in Figure 7.

In this part, we did not record the time to compute the transposition of the ciphertext matrix, because we make matrix transposition default as preprocessing work. It shows that as the number of samples to be classified in the test set increases, the computation costs of the four similarity algorithms increase linearly. Among them, the Euclidean distance 2 has the highest computation efficiency, and cosine similarity computation has the lowest efficiency because it performs more cipher multiplications. But it is worth mentioning that the ciphertext in the Euclidean distance 1 uses the CKKS_Vector data type, and other methods use the CKKS_Tensor data type. The storage overhead of different test sets is shown in Table 2 (unit: byte).

It shows that, with the rising amount of samples, the ciphertext data set of CKKS_Vector type occupies a linear increase in memory and occupies more memory than the same number of ciphertext data sets of CKKS_Tensor type, while the ciphertext data set of CKKS_Tensor type occupies a constant memory change. Therefore, when dealing with big data, try to avoid using the Euclidean distance 1 and cosine similarity to compute the similarity.

7.2. Accuracy of Similarity Calculation

In this part, we take a random number of samples to be classified between 35 and 45 as the data set and test the classification accuracy of the four similarity algorithms when k = 3, 5, 7, and 9, respectively. We record the results of 30 experiments and take the average value as the final experimental data. The computation accuracy of the four similarity algorithms is shown in Figure 8.

It shows that the accuracy of the Euclidean distance and cosine similarity is stable at about 97%, but the accuracy of the Pearson correlation coefficient is as low as about 65%. Therefore, when doing the KNN classification algorithm, try to avoid using the Pearson correlation coefficient to compute similarity degree.

7.3. Comprehensive Performance Comparison

Because the mature security KNN classification schemes applied in the market are ground on the Paillier homologous encryption scheme, we test the efficiency and accuracy of computing the Euclidean distance in the Paillier and CKKS homologous encryption schemes with the same encryption parameters and in plain text. It should be emphasized that since the CKKS scheme is implemented in the TenSEAL homologous encryption library, which encapsulates the seal encryption library based on C++ into a dynamic library called python, the speed of the CKKS scheme depends on the efficiency of the sealed library in C++. Therefore, we did a comparative experiment in C++ with the Paillier homologous encryption scheme by calling NTL and GMP libraries with encryption parameter |N| = 1024. First, we compare the computational efficiency of the three schemes. We randomly selected the remaining 10, 20, 30, 40, and 50 groups of samples as the test set to be classified and calculated the similarity time as shown in Figure 9.

Clearly, the efficient CKKS schemes are more computational and closer to plain text than the Paillier schemes, and CKKS can support batch processing of data. In the encryption mode, the CKKS scheme can directly encrypt the data matrix, while the Paillier scheme can only encrypt numbers one by one and does not support floating-point operation. Then, we put together, in comparison, the accuracies of the three different schemes. We take a random number of samples to be classified between 35 and 45 as a dataset and test the classification accuracy of the four similarity algorithms at k = 3, 5, 7, and 9, respectively. The result can be found in Figure 10.

It is evident that whether the CKKS scheme or the Paillier scheme is used for security calculation, the calculation accuracy is not different from that calculated directly in plain text. We then tested the storage costs of the three schemes in datasets with different sample sizes, as shown in Table 3 (in bytes).

As the number of samples goes up, the encrypted dataset of the CKKS scheme occupies the least memory and remains unchanged, but the Paillier scheme and the plain scheme occupy more memory and increase linearly. The secure KNN classification algorithm that chooses the CKKS scheme to process large data has absolute advantages.

8. Conclusions

To protect sensitive privacy data of cloud servers and users during transmission while meeting classification accuracy and computational efficiency requirements of classification algorithms, this paper implements a secure KNN classification scheme in ciphertext domain for Cyberspace (CKKSKNNC), based on the KNN classification scheme and CKKS algorithm. We use the TenSEAL homomorphic encryption library to implement the CKKS homomorphic encryption scheme and select two schemes of the Euclidean distance, Pearson correlation coefficient, and cosine similarity as the algorithm for calculating similarity in the KNN classification algorithm and test the computational efficiency, storage cost, and classification accuracy of the four similarity algorithms on IRIS data set. Through experimental tests, we found that the Euclidean distance Scheme 1 has the largest storage cost, the computation efficiency of cosine similarity is the lowest, and the classification accuracy of Pearson's correlation coefficient is the lowest. Nevertheless, the specific algorithm used as the similarity algorithm varies depending on the specific data.

Data Availability

Marked datasets, which support the conclusion of this study, can be obtained upon request to the corresponding authors.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work was supported by the Foundation of State Key Laboratory of Public Big Data (No. 2019BDKFJJ008), Engineering University of PAP’s Funding for Scientific Research Innovation Team (No. KYTD201805), and Engineering University of PAP’s Funding for Key Researcher (No. KYGG202011).