Abstract

Frequency-hiding order-preserving encryption (FH-OPE) has emerged as an important tool in data security, particularly in cloud computing, because of its unique ability to preserve the order of plaintexts in their corresponding ciphertexts and enable efficient range queries on encrypted data. Despite its strong security model, indistinguishability under frequency analyzing ordered chosen plaintext attack (IND-FA-OCPA), our research identifies a vulnerability in its design, particularly the impact of range queries. In our research, we quantify the frequency of data exposure resulting from these range queries and present potential inference attacks on the FH-OPE scheme. Our findings are substantiated through experiments on real-world datasets, with the goal of measuring the frequency of data exposure resulting from range queries on FH-OPE encrypted databases. These results quantify the level of risk in practical applications of FH-OPE and reveal the potential for additional inference attacks and the urgency of addressing these threats. Consequently, our research highlights the need for a more comprehensive security model that considers the potential risks associated with range queries and underscores the importance of developing new range-query methods that prevent exposing these vulnerabilities.

1. Introduction

The significant rise in cloud computing used for data storage and processing necessitates new encryption methodologies to ensure data security. One such method is frequency-hiding order-preserving encryption (FH-OPE), which maintains the order of plaintexts in the ciphertexts, thereby allowing efficient range queries and sorting operations on the encrypted data. A significant aspect of FH-OPE is frequency hiding, a mechanism that conceals the frequency of individual values in the encrypted data, thereby mitigating some vulnerabilities of standard OPE. Such frequency hiding is important in a scenario where data usage patterns should remain confidential, like in medical database or financial records where repeated values could hint at certain medical conditions or repeated transactions, respectively.

While the security model for FH-OPE, known as indistinguishability under frequency analyzing ordered chosen plaintext attack (IND-FA-OCPA), is recognized as robust and well-structured, our work suggests that it might be incomplete. More specifically, the model was designed without considering that range queries could weaken FH-OPE’s ability to hide plaintext distributions. We demonstrate in this paper that the security of FH-OPE can be affected not only by attackers directly performing these queries but also simply by observing their results, leading to potential frequency exposure.

We then transition our focus to quantifying frequency exposure, determining the number of range queries needed to reveal the frequency of all plaintexts in FH-OPE and the expected number of distinct ciphertexts exposed after executing a specific number of range queries. Our approach employs mathematical problems such as the coupon collector’s problem and probabilistic analyses to illustrate these concepts effectively. We extend our research to other potential threats posed by more complex queries, such as join queries, as well as inference attacks conducted via association rule mining. Our findings demonstrate that these methods can use and potentially exacerbate the identified weaknesses in the FH-OPE scheme. To support our theoretical analysis, we experiment with real-world datasets. The focus of these experiments was to measure the frequency of data exposure when executing range queries on an FH-OPE encrypted database. The results obtained quantified the risk associated with practical implementations of FH-OPE and reveal the possibility for further inference attacks.

Consequently, our study underlines the need for a more comprehensive FH-OPE security model. This model must consider the risks associated with range queries and their potential to expose vulnerabilities. Our research presents the importance of developing new methods for conducting range queries on encrypted data that are devoid of the vulnerabilities currently identified.

1.1. Related Work
1.1.1. OPE and FH-OPE

Over the years, numerous studies have been conducted on OPE [115], and FH-OPE [1620]. The concept of OPE has garnered substantial attention due to its utility in database systems [2123]. In recent research, a shift in focus has been noticeable, with most efforts centered on hiding frequency information and achieving ideal security for OPE and FH-OPE. The initial exploration of OPE was largely influenced by the work of Boldyreva et al. [2], which established the first formal definitions and security models for OPE and introduced the concept of indistinguishability under ordered chosen plaintext attack (IND-OCPA). This model primarily aimed to preserve the order of plaintext data, even in its encrypted form. As the field evolved, potential vulnerabilities, particularly concerning frequency exposure in traditional OPE schemes, became more apparent. Frequency-hiding order-preserving encryption was proposed in response. Kerschbaum’s influential paper [16] marked a significant advance in this direction. Kerschbaum introduced the IND-FA-OCPA security model specifically designed for FH-OPE, aiming to mitigate frequency-related vulnerabilities and enhance data security. However, subsequent research by Maffei et al. [17] highlighted certain inadequacies in Kerschbaum’s original security model. They presented a comprehensive critique of the model’s structure and identified and executed an attack on it, thereby exposing an issue in the original security proof. Moreover, they proposed an impossibility result, demonstrating that Kerschbaum’s security definition could not be achieved by any OPE scheme. Consequently, they introduced a new security definition that maintains the fundamental concept of frequency-hiding yet is practically achievable. They also demonstrated that their refined version of the security definition, which more accurately captured the concept of frequency-hiding, could indeed be realized.

1.1.2. Attacks on OPE and FH-OPE

Several papers [2428] have presented various attack models targeting both OPE and FH-OPE. In general, deterministic OPE leaks both the order and the frequency of plaintexts. Such leakages lead to two primary types of attacks that are based on frequency: sorting attacks and frequency-revealing attacks.

(1) Sorting Attack. This attack, presented in [24], targets columns densely encrypted with OPE, such as those for age and disease severity, where each distinct plaintext is encrypted at least once. In such scenarios, there is a one-to-one correspondence between the distinct OPE ciphertexts and the distinct plaintexts. An adversary can exploit this to reconstruct the plaintexts by mapping the ordered distinct ciphertexts to their corresponding ordered distinct plaintexts. Therefore, the success of this attack is up to the adversary’s prior knowledge of the plaintext distribution. Notably, the sorting attack only succeeds in columns with high density, where every element of the plaintext space is present; otherwise, it is likely to fail.

(2) Frequency-Revealing Attacks. This attack, presented in [24, 26], is specifically effective against datasets with a low-density plaintext space. In [24], their approach to frequency analysis is termed a cumulative attack. For columns encrypted with OPE, the adversary can infer the frequency of each ciphertext by constructing a histogram that reflects the pattern of the encrypted data. The adversary then exploits frequency and order leakages to correlate ciphertexts with plaintexts, aiming to closely match their distributions. The success of this cumulative attack hinges on the adversary having prior knowledge of the plaintext distribution with public auxiliary information. In addition, in [28], they introduced three novel ciphertext-only attacks for FH-OPE schemes [11, 16, 20]. For conducting frequency analysis, they assume that the adversary exploits leakages from the distribution of ciphertexts and the orders in which they are inserted. Specially, they presented that they recovered about 96% of plaintext frequencies for Kerschbaum’s FH-OPE scheme in a nonuniform ciphertext distribution environment. In addition, they conducted a plaintext frequency attack on [11, 20] under the assumption that the attacker is aware of the ciphertext input order.

As shown in Table 1, the attacks initially presented by [24] considered only deterministic ciphertexts, focusing on the vulnerabilities inherent to OPE schemes. While effective, these attacks are limited to scenarios where plaintext distribution is known, which is a significant constraint. On the other hand, Cao et al. [28] expanded the scope by targeting FH-OPE schemes. Despite their progress, their methods had an applicability of “No,” indicating that they were only applicable to specific schemes. Table 1 also shows that their work focused on schemes satisfying the IND-FA-OCPA security definition presented by [16]. However, Maffei et al. demonstrated its inadequacies, proving that the original IND-FA-OCPA could not be achieved in any FH-OPE scheme. In contrast, our attacks are conducted on the IND-FA-OCPA (throughout this paper, “IND-FA-OCPA” will be denoted simply as “IND-FA-OCPA”), a revised security model proposed by [17]. Despite these ongoing advancements and varied attacks, no research has yet shown that an adversary can conduct frequency exposure analysis without the need for additional information or constraints. Therefore, our attacks utilize only information that naturally occurs in OPE-encrypted databases. Moreover, current research noticeably lacks in exploring vulnerabilities in the FH-OPE security model. Our study aims to address this gap by highlighting issues within the FH-OPE security model, based on effectiveness of our attacks.

1.2. Our Contributions

Our research significantly enhances the current understanding of security vulnerabilities in FH-OPE, contributing to the existing field in several ways:(1)We revisited the IND-FA-OCPA security model and identified significant vulnerabilities associated with range queries. This discovery questions the reliability of IND-FA-OCPA in encrypted systems, indicating a need to re-evaluate its design, especially in the context of range queries.(2)We quantified the frequency of plaintext exposure with respect to the number of range queries executed on FH-OPE. This quantification offers insight into the risks associated with standard range queries.(3)To validate our quantification formula, we conducted experiments with range queries on two real-world datasets, measuring the actual frequency of plaintext exposure. Furthermore, our research included the execution of an inference attack on the FH-OPE scheme.

1.3. Setting and Notations

The attacks we planned to execute are all based on a common framework, which is detailed below, accompanied by the necessary notation. Let be the number of plaintexts to be encrypted and be the number of distinct plaintexts. Let denote the entire set of plaintexts, where . Within this set, several plaintexts may be repeated. We then define another set which includes only the distinct plaintexts from set , where .

1.3.1. Range Queries

Suppose we have a plaintext range query which requests all records in the range . In an FH-OPE encrypted database, the client transforms this query into an encrypted range query as follows:(1)The client encrypts the range endpoints and with FH-OPE encryption function. Due to FH-OPE’s frequency-hiding property, if and are duplicates in , for each, this could produce multiple corresponding ciphertexts.(2)The client selects the smallest encrypted value from the set of ciphertexts corresponding to and the smallest encrypted value from the set of ciphertexts corresponding to .(3)The client makes the encrypted range query , which requests all records in the range , where is the ciphertext in the encrypted database.(4)The client sends to the server.(5)The server executes on the encrypted database and returns the result set of to the client.(6)The client decrypts the result set to obtain the plaintext values that satisfy the original plaintext range query .

We consider the application of FH-OPE to an annual salary database. Given salary dataset with its corresponding ciphertexts stored on the server, the server can execute queries generated by a client. For instance, we suppose a client is interested in determining the count of salaries falling within a specific range greater than or equal to 10000 but less than 17000. The client would generate a range query reflecting this interest but in the form of ciphertexts, such as , and send it to the server. Consequently, the ciphertext set, which may be utilized for a range search, has a size of . This approach ensures that the plaintext’s frequency remains concealed during the range query because the query leverages the minimum ciphertext corresponding to the lower range bound. Consequently, the server cannot ascertain the frequency of a particular salary in the data.

1.3.2. Adversarial Model

We adopt an adversarial model characterized by a persistent, passive adversary including the server acting as an honest-but-curious adversary. This adversary can continuously observe all interactions between the client and the server. Notably, our adversary model, unlike a snapshot adversary with only single instance access to the server’s memory, continuously monitors the range queries executed by the client, identifying patterns and extracting meaningful information from these activities.

1.3.3. Mathematical Notation

In our notation, we will always use log to denote the natural rather than base 2 logarithm. The harmonic number will be represented as , such that .

2. Preliminaries

In this section, we formally introduce the concept of OPE and explain its core security notions, specifically IND-OCPA and IND-FA-OCPA. These foundational definitions pave the way for our following discussion on the possible threats linked with OPE in the subsequent sections.

2.1. Formal Notion of OPE

A stateful OPE is a technique that keeps a record of past operations, which is essential for improving security. The unique adaptive nature of a stateful OPE provides the necessary layer of complexity for achieving IND-OCPA security.

Definition 1 (OPE). A stateful OPE scheme consists of the following three algorithms :(1): The key generation algorithm takes as input a security parameter and initializes a state .(2): The encryption algorithm takes as input a plaintext and a state . It outputs a ciphertext and updates the state to .(3): The decryption algorithm takes as input a ciphertext and a state . It outputs a plaintext .

Definition 2 (order-preserving). An OPE scheme is order-preserving if it maintains the order of plaintexts in their corresponding ciphertexts. This is for any two plaintexts and and their corresponding ciphertexts and produced by the OPE scheme, if , then .

2.2. Security Definitions

IND-OCPA security means that no efficient (bounded by polynomial time) adversary can distinguish between the ciphertexts of two sequences of plaintext that are equally ordered. This concept is illustrated through a simulation. The simulation between adversary and simulator for security parameter proceeds as follows:(1)The adversary prepares two plaintext sequences and where , , and sends them to the simulator .(2)The simulator randomly chooses , executes , and runs , for all . Then, the simulator sends to the adversary .(3)The adversary tries to infer which sequence has been encrypted and outputs as their guess for .

Definition 3 (IND-OCPA). An OPE scheme has IND-OCPA security if the chance of a probabilistic polynomial time (PPT) adversary correctly guessing whether a given ciphertext corresponds to a particular plaintext sequence is negligibly better than random guessing. Otherwise stated, the probability of is  + , where is a negligible function in the security parameter .

Definition 4 (randomized order). Let us consider a sequence of not necessarily distinct plaintexts . We define a randomized order , representing one of the possible permutation of the set that maintains the sequence of . This means for every pair of indices and where , , and

For instance, we consider a plaintext sequence . It could be represented by the randomized orders , , , or . In this framework, the common randomized order for and denotes the elements shared between the two randomized order sets of and . So, for and , the common randomized order could be either or . We utilize the notation to represent the order of the elements in the sequence up to . For example, if we consider the sequence , would refer to the order of the first three elements of , resulting in the sequence .

To satisfy the IND-FA-OCPA security notion and protect against frequency analysis attacks, Maffei et al. proposed an enhanced encryption method known as augmented order-preserving encryption. Therefore, we employ the augmented OPE scheme proposed by [17].

Definition 5 ((augmented) OPE). An augmented OPE scheme consists of the following three algorithms :(i): The key generation algorithm takes as input a security parameter and initializes a state .(ii): The encryption algorithm takes as input a plaintext , a state , and an order . It outputs a ciphertext and updates the state to .(iii): The decryption algorithm takes as input a ciphertext and a state . It outputs a plaintext .

In the context of FH-OPE, IND-FA-OCPA security extends the IND-OCPA definitions to withstand frequency analysis attacks. This is defined using the simulation , where an adversary interacts with a simulator . The simulation for security parameter proceeds as follows:(1)The adversary prepares two plaintext sequences and . These sequences have at least one common randomized order . These are sent to the simulator .(2)The simulator randomly chooses and one of , executes , and runs , for all , based on the selected . Then, the simulator sends to the adversary .(3)The adversary tries to infer which sequence has been encrypted and outputs as their guess of .

Definition 6 (IND-FA-OCPA). An (augmented) OPE scheme has IND-FA-OCPA security if the chance of a probabilistic polynomial time (PPT) adversary correctly guessing whether a given ciphertext corresponds to a particular plaintext sequence is only negligibly better than random guessing. Differently expressed, the probability of is  + , where is a negligible function in the security parameter .

3. Exposing Vulnerability in the Security Model

In this section, we expose a potential vulnerability in the IND-FA-OCPA security model. This vulnerability arises when an adversary merely observes the outcomes of range queries. Although these results are typically accessible to any system user and do not grant any additional advantage, this seemingly harmless activity could still potentially weaken the security of IND-FA-OCPA.

Let and be two plaintext sequences. and are defined as sequences that represent the frequency of each unique plaintext in and , respectively, when the unique plaintexts are sorted in ascending order. More formally, if denotes the unique plaintext in ascending order in a sequence and denotes the frequency of in sequence , then for . We use to denote and denote as the greater value between and . When and have different frequencies at some index , we define as the frequency of in for .

Theorem 7. Given two plaintext sequences and , which share a common randomized order and have their frequency distributions and , respectively, an adversary that can observe the outcomes of range queries can distinguish which sequence has been selected by the simulator with a probability of at least after queries have been processed, where the frequency distributions .

Proof. Consider an adversary that observes the outcomes of the range queries made by some entity (not necessarily ). For each range query , where and denote the lower and upper bounds of the encrypted range, respectively, there is a chance that either or matches an encrypted plaintext corresponding to in the frequency distribution . If the ciphertext corresponding to is included in the lower and upper bounds of the range query, then the number of ciphertexts returned as a result of the range query is different for each of the sequences and . The adversary can distinguish the sequence selected by the simulator by observing this difference in the number of returned ciphertexts. The probability of being able to distinguish from after observing queries can be computed as follows: The probability of a range query not including the ciphertext corresponding to in either of the two selected values in a query is given by . Therefore, the probability of the ciphertext not being included in any of the selections is . The probability of the ciphertext being included in at least one of the selections is then given by . Hence, with queries (equivalent to selections), can distinguish between the sequences with probability .
Theorem 7 is illustrated with a practical example. Suppose we have two plaintext sequences: and . The frequency distributions of and are represented as and , respectively. Here, the frequencies of the distinct plaintext “1” are different in and , i.e., and , so we consider the index where and differ. These sequences share a common randomized order . We assume that the simulator selects the sequence for encryption, resulting in the ciphertext sequence . When an entity issues a range query that includes the ciphertext corresponding to plaintext “1” as either the upper or lower bound, the adversary , who can observe the results of such queries, will see two ciphertexts being returned. From this observation, can infer that plaintext “1” appears twice in the selected sequence. By comparing this frequency with the frequency distributions and , can correctly conclude that is the sequence encrypted by the simulator is . Thus, the adversary successfully distinguishes the chosen plaintext sequence.

Remark 8. This theorem underscores a key limitation of the IND-FA-OCPA security model. While it provides security against adversaries attempting to derive information from individual plaintext-ciphertext pairs, it does not fully protect the frequency patterns of these pairs. Consequently, it does not ensure frequency hiding. An adversary, just by observing the outcomes of range queries, could potentially infer the frequency of specific plaintexts within a given range. This vulnerability highlights that the IND-FA-OCPA security model may still be susceptible to frequency analysis attacks.

4. Quantifying Frequency Exposure

We now shift focus to quantifying frequency exposure. The objective of this section is not merely to measure the degree of frequency exposure of plaintexts in an FH-OPE scheme through range queries but to highlight the urgent need for further research into novel range-query methods that could mitigate this exposure.

4.1. Demonstrating Frequency Exposure Using the Coupon Collector’s Problem

Due to its complexity, quantifying the level of frequency exposure in FH-OPE is a challenge. However, we can leverage mathematical problems, such as the coupon collector’s problem, to measure this exposure. This problem, where the goal is to collect distinct coupons through independent trials, parallels the process of conducting range queries on an encrypted database. In both scenarios, the aim is to acquire unique elements (or data) from a larger set. A mathematical model, inspired by the coupon collector’s problem that estimates the number of range queries required to expose the frequency of all plaintexts in an FH-OPE scheme, is demonstrated below.

Theorem 9. To reveal the frequency of all plaintexts in our scenario with FH-OPE, the required number of range queries, denoted as , is given by the approximation:

Proof. In the coupon collector’s problem, the expected number of trials for collecting all distinct coupons is given by the formula:where represents the harmonic number, roughly equal to .
Comparing our scenario with FH-OPE to the coupon collector’s problem, we consider each distinct ciphertext in as a unique coupon and our range queries as trials to collect these coupons. A range query, denoted as , parallels the action of selecting two coupons simultaneously.
Therefore, the required number of range queries, denoted as , is approximated by .

4.2. Probabilistic Analysis of Plaintext Frequency Exposure

In the preceding subsection, we examined the frequency exposure of plaintext through a method similar to the coupon collector’s problem. Now, we shift our perspective to a probabilistic analysis of the frequency exposure of plaintext, focusing on the expected number of unique plaintexts revealed after executing a specific number of range queries.

Theorem 10. The expected number of distinct ciphertexts selected after queries (or selections) is given by the equation:

Proof. We begin by noting that the probability of not selecting a particular ciphertext in a selection can be represented asThen, the probability of not selecting a particular ciphertext in any of the sections isHence, the probability of selecting the ciphertext in at least one of the selections isSubstituting this into the equation for the expected number of distinct ciphertexts selected after queries (or selections), we get our result:Now, we consider an example where and . The expected number of distinct ciphertexts selected after range queries is . These interpretations help us probabilistically quantify the vulnerability of plaintext frequencies in FH-OPE after executing a specific number of range queries.

5. Experiments

To demonstrate the potential vulnerability within the context of FH-OPE in practical settings, we performed extensive range queries on real-world datasets. These experiments were conducted on a desktop computer with AMD Ryzen 7 PRO 4750G (3.60 GHz, 16 GB RAM) running Windows 11. All of the experimental code was written in Python 3.9.7.

5.1. Datasets and Preprocessing

For our experiments, we employed two real-world datasets: Dataset 1 and Dataset 2. Dataset 1, sourced from Allegheny County Employee Salaries (https://catalog.data.gov/dataset/allegheny-county-employee-salaries), contains salary information that is frequently encrypted in real-world scenarios to ensure privacy. This dataset has 6280 entries with 1677 distinct values, i.e., . On the other hand, Dataset 2 contains weight information of Women’s National Basketball Association (WNBA) (https://www.kaggle.com/datasets/jinxbe/wnba-player-stats-2017) players. It consists of 143 data points with 40 distinct values, i.e., . We encrypted both datasets using the FH-OPE scheme [17] to preserve privacy, enabling us to measure the frequency of different data points without risking data leakage.

5.2. Experiment Setup

We designed our experiments to evaluate the FH-OPE scheme under two different query scenarios. The first scenario is characterized by a uniformly random range query, where all data are equally likely to be queried. In contrast, the second approach involves a weighted range-query model that assumes a Gaussian distribution, thereby giving preferential consideration to areas likely to be queried more frequently by users.

5.3. Experiment Results
5.3.1. Uniform Random Range Query

In the uniform random range-query scenario, we aimed to verify the theoretical results obtained in previous sections, with a particular focus on those associated with equation (2) and experimented on Dataset 1. Figure 1 illustrates the plaintext frequency exposure corresponding to a varying number of range queries. is defined as the number of range queries corresponding to different proportions of equation (2). Here, corresponds to , corresponds to , corresponds to , corresponds to , and corresponds to . As can be observed from Figure 1, with a number of queries equivalent to equation (2), we effectively reveal almost all plaintext frequencies. In addition, Figure 1 also shows the fact that executing fewer queries than the size of equation (2) still leads to a considerable degree of plaintext exposure.

5.3.2. Gaussian Range Query

To generate Gaussian range queries, we utilized the properties of Dataset 2, which naturally follows a Gaussian distribution. The mean and standard deviation of Dataset 2 are approximately 79.02 and 10.96, respectively. Utilizing these parameters, we generated synthetic data through random sampling from a Gaussian distribution. This procedure yielded a set of representative data points that could be used to execute a range of queries reflecting the natural distribution of the dataset. By applying equation (4) to the Gaussian range query, we calculated the expected number of distinct ciphertexts selected as a result of the chosen .

From Figure 2, it is clear that even with a smaller number of queries, a significant degree of plaintext exposure is possible. Also, despite the differences in query distribution between uniform random range queries and Gaussian range queries, the outcomes do not exhibit a significant disparity. Our results underline the need for improved methods of query processing in FH-OPE to minimize such data exposure and enhance its overall security.

6. Further Exploitations: Inference Attacks Using Join Queries and Association Rule Mining

The frequency-exposure vulnerability of FH-OPE and the weakness of the IND-FA-OCPA security model have been demonstrated through range queries. However, beyond range queries, more complex types of queries, such as join queries, may also potentially expose additional vulnerabilities in FH-OPE. While individual columns encrypted using FH-OPE might appear secure in isolation, joining multiple tables can expose new relationships between data items, thereby allowing sensitive information to be deduced. To illustrate, we consider a hypothetical scenario involving a healthcare database encrypted with FH-OPE, consisting of two main tables:(1)Patients: this table includes fields like “PatientID,” “Age Group,” and “Gender”(2)Treatments: this table contains fields such as “PatientID,” “Lengths of Stay,” and “Diagnosis Code”

The sensitive numerical fields “Lengths of Stay” and “Diagnosis Code” are encrypted using the FH-OPE scheme. We assume an adversary has knowledge about the frequency distribution of patients’ lengths of stay from their initial range queries and targets a specific “age group,” for instance, “30–49.” The adversary could execute the following SQL join query:(1)SELECT Patients.Age_Group, Patients.Gender, Treatments.Lengths_of_Stay, Treatments.Diagnosis_Code(2)FROM Patients(3)JOIN Treatments ON Patients.PatientID = Treatments.PatientID(4)WHERE Patients.Age_Group = “30−49”

The result of this join query allows the adversary to generate frequency distributions of the encrypted “Lengths of Stay” and “Diagnosis Code” for the targeted age group. Identifying correlations between “Lengths of Stay” and “Diagnosis Codes” may enable the adversary to infer that certain diseases correspond to longer hospital stays for patients within this age group. If the adversary has prior knowledge or assumptions about disease prevalence and average hospital stays, they could potentially deduce further information from the encrypted diagnosis codes, revealing a risk to privacy even within securely encrypted fields.

Moreover, adversaries can use the ordered nature of encrypted data in FH-OPE to perform advanced inference attacks, such as association rule mining [29]. This machine learning technique can reveal relations between different fields in a database that are not directly linked but share common attributes. For example, given the adversary’s knowledge about the frequency distribution of patients’ ages and lengths of stay, they could potentially discover rules like “if a patient is in the “X” age group and stays for “Y” days, they likely have a “Z” diagnosis.” These associations can be found even when the diagnosis is encrypted using FH-OPE, emphasizing the serious privacy implications. To provide a deeper understanding of these potential threats, we present further extensive experiments in which our goal was to demonstrate that association rule mining attacks on real-world datasets can reveal sensitive information within a healthcare context, even when data are encrypted using the FH-OPE scheme.

6.1. Experiments on Further Exploitations
6.1.1. Datasets and Preprocessing

We utilized a real-world dataset originating from the New York State Department of Health, specifically, the “Hospital Inpatient Discharges (SPARCS De-Identified)” dataset available on the https://health.data.ny.gov website. This dataset includes a wealth of patient information, such as patient ID, age group, gender, lengths of stay, and CCS diagnosis Code. The data relating to lengths of stay and CCS diagnosis code have been encrypted using the FH-OPE scheme. Using join queries on this dataset, we focused on a subset of patient data. Specifically, we obtained the patient IDs for those in the long-term inpatients category, patients with lengths of stay exceeding 120 days. We assumed that an adversary had executed several range queries, thereby acquiring knowledge about the frequency distribution of lengths of stay. Figure 3 includes histograms that depict the distribution of lengths of stay and the CCS diagnosis code for patients within the 120+ days category.

Subsequently, we undertook a series of experiments on these encrypted datasets, each characterized by a different number of range queries executed on encrypted data. Specifically, we applied different volumes of range queries (, , , and ) to the CCS diagnosis code, leading to distinct datasets with varying levels of exposed ciphertexts being generated. The primary purpose of our experiment is to assess the impact of varying levels of exposed ciphertexts on the quality of association rule mining. Therefore, we employed association rule mining on each of these datasets, as detailed in Table 2, and contrasted the outcomes.

6.1.2. Association Rule Mining Results on Datasets

To explore the impact of varying levels of exposed ciphertexts on the quality of association rule mining, we applied the Apriori algorithm [29] in Python. We applied this algorithm to each of the datasets (, , , and ) defined in Table 2. Indeed, can essentially be considered the same as a plaintext dataset, as the frequency of ciphertexts is 100% exposed. Table 3 displays the rules yielded by applying the Apriori algorithm to the Hospital Inpatient Discharges (SPARCS De-Identified) dataset. For these results, the minimum support was set to 0.01, and the minimum confidence threshold was set to 0.6. These rules provide a baseline with which we can compare the association rule mining results for the remaining datasets.

Figures 4(a)4(d) show the outcomes of association rule mining the datasets (, , , and ) with the Apriori algorithm, maintaining a minimum support of 0.01 and a minimum confidence threshold of 0.05. Each figure represents a scatterplot graph displaying the distribution of the rule metrics: support, confidence, and lift. The resultant graphs for , , and depict strikingly similar distributions, indicating that the rules generated for these datasets are nearly identical. This similarity suggests that despite reducing the amount of exposed ciphertext (from 100% for to 92% for and further down to 72% for ), the quality of association rule mining did not significantly degrade. The rules mined under these conditions are, therefore, almost as insightful as those obtained from plaintext data. In contrast, the results for , where the frequency exposure is less than 50%, are clearly differentiated from the previous dataset. Despite the dissimilar graph shape, strong rules that were found in the more exposed datasets were also identified to some extent in the dataset.

Through our experimentation, we found that despite a reduction in ciphertext exposure from 100% to 49%, an adversary can still extract significant information from encrypted data. This insight is critical, as it shows that FH-OPE is vulnerable to inference attacks by enabling attackers to discern patterns and gather meaningful information. Notably, the derived association rules remained informative, regardless of the reduced data exposure. Furthermore, our findings definitively proved that even a limited number of range queries can allow an adversary to launch successful attacks, thus compromising the security of FH-OPE encrypted data.

7. Conclusion

This study presented a comprehensive analysis of FH-OPE, specifically its vulnerability to frequency exposure through range queries. Our findings provide an overlooked aspect of the IND-FA-OCPA security model: the absence of consideration for vulnerabilities introduced by range queries. We have quantified frequency exposure using principles from the coupon collector’s problem and probabilistic analyses to determine the number of range queries necessary to reveal the frequency of all plaintexts and estimate the expected number of unique ciphertexts revealed after executing a given number of range queries. Furthermore, our exploration goes beyond straightforward threats, considering more complex query types such as join queries and advanced attacks through association rule mining. Experimental analyses on real-world datasets have provided concrete evidence of these vulnerabilities, highlighting the risks associated with practical implementations of FH-OPE. Our results quantify the level of exposure risk and present the potential for inference attacks. Ultimately, our findings highlight that FH-OPE requires a more comprehensive security model that adequately addresses the risks posed by range queries. Further research is needed to develop new range-query methods that resist the vulnerabilities identified in this study. In conclusion, we hope our research contributes to a deeper understanding of the security challenges associated with FH-OPE.

Data Availability

The data that support the findings of this study are openly available in Allegheny County Employee Salaries at https://catalog.data.gov/dataset/allegheny-county-employee-salaries and Women’s National Basketball Association at https://www.kaggle.com/datasets/jinxbe/wnba-player-stats-2017.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2022R1F1A1062693).