Abstract

With the continuous development of the Internet of things (IoTs), data security and privacy protection in the IoTs are becoming increasingly important. Aiming at the hugeness, redundancy, and heterogeneity of data in the IoTs, this paper proposes a ranked searchable encryption scheme based on an access tree. First, this solution introduces parameters such as the word position and word span into the calculation of the relevance score of keywords to build a more accurate document index. Secondly, this solution builds a semantic relationship graph based on mutual information to expand the query semantics, effectively improving the accuracy and recall rate during retrieval. Thirdly, the solution uses an access tree control structure to control user authority and realizes fine-grained access management to data by data owners in the IoTs. Finally, the safety analysis of this scheme and the efficiency comparison with other existing schemes are given.

1. Introduction

With the rapid development of IoTs technology, it has been used in all walks of life and has been widely recognized in various fields such as medical care, smart transportation, government work, smart home, and environmental monitoring [15]. At the same time, all kinds of information generated by users are increasing by a huge order of magnitude. Cloud storage is widely used due to its low cost and good scalability, and it solves the storage and management of this electronic data information by the IoTs. However, frequent privacy data leakage incidents have caused severe social impacts and disrupted economic development [68]. Therefore, how to protect user privacy and data security has become a technical bottleneck restricting the further development of IoT applications [913]. An effective way to solve data privacy leakage is to encrypt data and then store it on a cloud server. It can prevent unauthorized servers from accessing user data, and it can also effectively protect user data when the server is attacked. However, when users want to access their own data, because the cloud server stores encrypted data, these data no longer maintain the plaintext data structure before encryption, so the cloud server cannot effectively return the data searched by the user. In view of this situation, the easiest way is to download all encrypted data locally and then decrypt them one by one before searching. This method does not make full use of the computing power of the cloud server and wastes a lot of time and bandwidth power consumption, so that it cannot meet the actual needs of the cloud storage of the IoTs. Therefore, how to securely retrieve ciphertext data is an urgent problem to be solved in the IoTs [14].

The searchable encryption (SE) technology is a special encryption technology that can realize keyword ciphertext retrieval and ensure that attackers cannot obtain the keyword information queried by users through keyword ciphertext or search trapdoors [15]. At present, searchable encryption technologies mainly include symmetric searchable encryption (SSE) and asymmetric searchable encryption (ASE) [16, 17]. In 2000, Song [18] proposed a single-keyword searchable encryption scheme based on symmetric key encryption, which searches the ciphertext of related keywords by linearly scanning the entire ciphertext document. In 2004, Dan et al. [19] proposed an asymmetric searchable encryption scheme (Public-Key Encryption with Keyword Search (PKES)) for mail routing application scenarios. After that, researchers have done a lot of research on this basis.

In practical applications, users are usually more concerned with finding the top K documents most relevant to multiple keywords. In order to meet such demands, various multikeyword ranking searchable encryption schemes have been proposed in recent years. In 2011, Cao et al. [20] based on the secure KNN technology [21] first proposed a multikeyword ranking searchable encryption scheme based on vector inner product calculation. The scheme uses a 0/1 vector to represent each document and query vector. Compare the number of digits in the same position with a value of 1 to obtain the relevance score of the document, but this solution does not consider the importance of different keywords in the document. Therefore, Sun et al. [22] extended the scheme, introduced keyword weights when constructing document vectors and query vectors, and calculated the correlation through vector cosines to improve the accuracy of ranking.

The above ranking searchable encryption schemes all focus on the precise search of keywords and do not take into account the semantic expansion of keywords, resulting in many documents that meet the query conditions not being retrieved. Yang et al. [23] proposed a fast multikeyword semantic ranking search scheme, which introduced the concept of domain weighted scoring into document scoring and semantically expanded search keywords to improve the accuracy of the document index. In addition, the document vector is divided into blocks to effectively filter a large number of irrelevant documents, which effectively improves the efficiency of the scheme. However, this solution does not involve access control and can only be limited to a single legitimate user’s query and is not suitable for the needs of keywords being queried by multiple users in an actual environment. Sun et al. [24] proposed an attribute-based keyword search scheme, which only returns authorized documents to search users. However, the search results cannot be ranked. Li et al. [25] proposed an authorized multikeyword ranking search scheme based on encrypted cloud data using attribute-based encryption strategy and symmetric searchable encryption. This scheme satisfies the confidentiality of files, the unlinkability of trapdoors, and the resistance to collusion attacks. Moreover, the scheme can enable the same data to be queried by multiple users but does not consider the semantic relevance of search keywords.

Therefore, the research on searchable encryption schemes for the cloud storage environment of the IoTs not only must protect the privacy of data to achieve the purpose of secret search but also ensure the efficiency of search efficiency. At the same time, it is also necessary to consider the situation of multiple users accessing search in the special application scenario of the IoT cloud storage environment. In the solution, the access tree is used to set user access permissions, which allows only authorized users to retrieve cloud data and obtain the most relevant K documents. And based on the secure KNN method, the document is encrypted to ensure the security and correctness of index creation and trapdoor generation.

The main contributions of this paper are as follows: (1)This paper introduces parameters such as word position and word span into the calculation of the relevance score of keywords and assigns more accurate weights to keywords at different positions in the document, thereby constructing a more accurate document index(2)This paper builds a semantic relationship graph based on mutual information to expand the query vector semantics, which effectively improves the precision and recall during retrieval(3)This paper involves multikeyword search and access control. The access tree is used to control user access rights. Only users whose attributes meet the access policy defined by the data owner can search encrypted data with multiple keywords, so as to realize the fine-grained access management of the data by the data owner in the IoTs

2. Preliminaries

2.1. Vector Space Model

The vector space model [23] is the representation of the document set in the same vector space. Each document corresponds to a document vector, the dimension of the vector is equal to the length of the keyword collection, each dimension of the vector corresponds to a keyword and the value is equal to the weight of the corresponding keyword in that dimension. The user’s query is also regarded as a vector in the same space, which is called the query vector. The keywords corresponding to each dimension of the vector are consistent with the document vector, and the vector dimension is the same as the document vector. The relevance score of the query and each file is equal to the value of the inner product of the document vector and the query vector.

2.2. Word Span

Word span [19] refers to the distance between the first and last occurrence of a word or phrase in the document. The larger the word span, the more important the word is to the topic of the document. The word span can effectively reduce the impact of local keywords on document keyword extraction, because local keywords often become keywords in the entire document due to their high-frequency advantages, reducing the accuracy of keyword extraction. The word span formula is shown in formula (1).

Among them, is the location identifier where the keyword first appeared in the document , is the location identifier where the keyword last appeared in the document , and is the total number of keywords in the document obtained after word segmentation processing.

2.3. Word Position

The word position [26] refers to the area where a keyword appears in a document, which is of great value for judging the importance of the keyword. The title and abstract are the central ideas extracted by the author through the summary of the whole article, so the keywords appearing in these two positions are more important than those appearing in the main text. This article divides the word position into three parts: title, abstract (or first paragraph), and body. Here, let the position value of the keyword in different areas of the document be set to 3, 2, and 1. There are two situations where a keyword appears multiple times: (1)If the same area appears multiple times, the record is not repeated(2)If it appears multiple times in different areas, the highest value is used

The word position formula is shown in formula (2).

2.4. Relevance Score

This paper is based on the calculation method of TF-IDF (term frequency-inverse document frequency) to evaluate the importance of keywords in documents.TF represents the frequency of the keyword appearing in the document, and IDF represents the frequency of the inverse document, that is, the fewer the documents containing the keyword, the greater the IDF value, indicating that the keyword has a strong distinguishing ability. The TF-IDF formula is shown in formula (3). where represents the frequency of the keyword in the document , represents the total number of all documents, and represents the number of documents containing the keyword .

When calculating the relevance score of keywords, the word position and word span factors of the keywords should also be considered.

Therefore, the correlation score formula used in this paper is shown in formula (4).

Among them, represent the weight of the three parameters and .

2.5. Semantic Relation Graph

Mutual information allows users to analyze the correlation between keywords. Constructing a semantic relationship graph [27] based on mutual information to expand the query semantics can effectively improve the precision and recall during retrieval. For keywords and , their mutual information [28] is expressed as shown in formula (5). where represents the probability of a document, and represents the probability of a document containing both and . Then, normalize the information. where represents the maximum mutual information value in all . Figure 1 is a small-scale semantic relationship graph , where node represents the keyword and the edge weight represents the normalized mutual information value of two related keywords .

2.6. Access Tree

The scheme in this paper uses the access tree defined by the CP-ABE [29] scheme to represent the access structure. The access tree can be flexibly and efficiently applied to access authority control, which is defined as follows.

Let denote the visit tree, and each nonleaf node in represents a threshold. If node has child nodes and its threshold is , then, . When , the node represents an OR gate. If , it means the AND gate. Each leaf node in represents an attribute and the leaf node corresponds to .

When checking whether the user authority meets the access tree , let be the root node of and let be the subtree with node as the root node. If the attribute set can satisfy the strategy represented by , then, denote . Calculate using the following recursive algorithm.

If is a nonleaf node, then, calculate for the child node of . Only when the number of child nodes satisfying is greater than or equal to , then, let ; otherwise, it is NULL. If the node is a leaf node, only if the corresponding attribute of the node is , then, let ; otherwise, it is NULL.

2.7. Bilinear Mapping

are two multiplicative cyclic group of prime order , is a generator of , is a bilinear map [30] if three properties are satisfied: (1)Bilinear. For and , (2)Nondegenerate. (3)Computability. There is an efficient algorithm computing , for any .

It is said that is an effective bilinear mapping from to .

3. Problem Description

3.1. System Model

The entities included in this program include data owners (Data Owner (DO)), data users (Data User (DU)), IoT cloud servers (IoT Cloud Server (CS)), and trusted institutions (Trust Authority (TA)). The system model is shown in Figure 2.

(1) Data Owner. The data owner is responsible for encrypting the original document, establishing a secure index and uploading the ciphertext document and the secure index. First, the data owner extracts the keyword collection from the original document collection and encrypts it according to the keyword collection and the data access strategy to generate a security index and then uses the symmetric key to encrypt the original document collection to generate a ciphertext document collection. Finally, the ciphertext document collection and the security index are uploaded and stored to the cloud server together.

(2) IoT Cloud Server. The IoT cloud server is mainly responsible for receiving and storing the data uploaded by the data owner and satisfying the query requests of authorized users. When receiving a user’s query request, the cloud server first conducts permission review. If it is an authorized user, use the stored security index and trapdoor to calculate the similarity score of the document, search for related documents, sort the query results, and return the most relevant TOP-K document to the user. It is worth noting that only authorized users can perform a correct search and unauthorized users cannot obtain search results.

(3) Trust Authority. It is mainly responsible for generating system keys and generating user private keys based on user attribute sets.

(4) Data User. The data user submits a query request to the cloud storage server of the IoTs to query the files of interest. The user sends his own set of attributes to a trusted organization to obtain the user’s private key and then uses the private key and query keywords to generate trapdoors and permission tags and upload them to the cloud server. Finally, authorized users can receive the most relevant TOP-K query results sent by the cloud server.

3.2. Safety Requirements

This paper assumes that trust authority is completely credible. The cloud server is semihonest but curious. It can correctly execute the user’s query request in accordance with the requirements of the plan and will not delete or modify the data uploaded by the data owner. But the cloud server is curious, and it may try to obtain other additional information from the security index and trapdoor. Therefore, the solution in this paper mainly considers the following 4 types of security requirements:

(1) Confidentiality of Documents. The data owner does not want unauthorized entities (cloud servers or data users) to know the content of the documents, so the documents must be encrypted before they are sent to the cloud servers and the unauthorized entities do not have decryption keys.

(2) Anonymity of Indexes and Trapdoors. The cloud server knows the ciphertext information stored by the data owner, including ciphertext document collection, security indexes, and trapdoors, but does not know the key.

(3) Anonymous Access. Data users can access IoT data without giving their detailed identity information.

(4) Collusion Resistance. Any two or more data users cannot collude to access the document.

3.3. Scheme Definition

The multikeyword semantic rank search scheme based on the user attribute consists of 5 polynomial time algorithms such as Setup, Encrypt, KeyGen, Trapdoor, and Search:

(1) . TA runs the initialization algorithm and generates system master key , index key , and system public parameter by inputting system security parameter .

(2) . This algorithm is the user’s private key generation algorithm, which is executed by TA. The algorithm inputs the system master key and user attribute set and outputs the user private key .

(3) . The data owner executes the encryption algorithm. The algorithm inputs the index key , the system public parameters , the plain text document collection , and the access tree and outputs the security index and the cipher text document collection .

(4) . The data user uses the algorithm to generate search credentials corresponding to the keywords that need to be queried. The algorithm inputs the query keyword set , index key , system public parameters , and user private key and outputs search credentials .

(5) . The keyword search algorithm is executed by the cloud server. The algorithm inputs the security index , search credentials , and the parameter and outputs the TOP-K documents most relevant to the query keyword set. It is worth noting that only users who meet the access control authority can get the correct results; otherwise, the search will fail.

3.4. Scheme Description

(1) . TA randomly selects a large prime number . Let be the multiplicative cyclic group whose generator are and the order are . TA generates a bilinear map and a hash function . In addition, TA randomly generates an -dimensional segmentation vector and two -dimensional invertible matrices , where is the number of confusion bits and is the number of keywords, and generate index key . Finally, TA randomly selects and generates system master key and system public parameters .

(2) . TA selects a random number and randomly selects for each attribute in the attribute set and finally generates the user’s private key . The system transfers the user’s private key to the data user.

(3) . (1)Extract Keywords.The data owner extracts keywords from the plaintext document collection to obtain the keyword collection . (2)Encrypted Documents.The data owner uses the symmetric key to encrypt each document to obtain the ciphertext set . (3)Create Index Process.Based on the vector space model, the data owner generates a document vector for each document . If the document contains the keyword , use formula (4) to calculate the relevance score of the keyword in the document. Otherwise, .

The data owner expands each document vector from the dimension to the dimension and sets , where and are random numbers with normal distribution . Then, the data owner splits each document into two vectors according to the split vector . If , then, , if , let and be random values and . Finally, the data owner uses the master key to encrypt and and get the partial security index .

According to the access tree , a polynomial is selected for each node in and the polynomial is generated as follows. Starting from the root node of , use a recursive algorithm to run from top to bottom. For each node , let the number of terms of the polynomial be one less than the threshold represented by the node, that is, . First, select randomly for the root node and let , and then, randomly select the coefficients of other terms. For other nodes , define the function , , the former represents the parent node of node , and the latter represents the position of node in the parent node. Let , and randomly select coefficients for the other terms of . According to the above algorithm, is generated for all nodes in , namely, . Finally, a safety index is generated.

Finally, the data owner uploads the security index and the ciphertext document collection to the cloud server.

(4) . (1)Keyword Semantic Expansion.First, the data user performs semantic expansion on the keyword set according to the semantic relationship graph to obtain the expanded keyword set . (2)Generate the Query Vector.Based on the vector space model, the query vector is constructed. Here, if , then let ; if the expansion word corresponds to one original keyword , then, ; similarly, if the expansion word corresponds to multiple original keywords , then, . Finally, extend the query vector from the dimension to the dimension and let , where is a random number and . (3)Encrypt the Query Vector.First, the data user divides the query vector into two vectors according to the division vector . This is the opposite of the document split method. If , then, ; if , let and be random values and . Finally, the data user encrypts and with the system master key to get the trapdoor . Then, randomly select and generate search credentials .

Finally, the data user sends the search credentials to the cloud server.

(5) .

If the cloud server receives the query request from the data user, it can perform the following steps:

First, the cloud server first calculates whether the user attributes meet the access tree defined by the data owner. is a node in the access tree , and the cloud server executes the following recursive algorithm:

If node is a leaf node, let be the attribute corresponding to node . If , then,

Otherwise, .

If node is a nonleaf node, calculate for the child node of node . Let be the set of subnodes that satisfy . If no such set , namely, , is found, it means that the access requirements are not met. Otherwise, calculate

Using the Lagrangian interpolation theorem, can be obtained. Here, it is explained that the data user is an authorized user who can perform data query.

The cloud server uses formula (9) to calculate the correlation between the security index and the query trapdoor and returns the TOP-K documents most relevant to the query keyword set to authorized users. If the user does not meet the access rights, the search fails and NULL is returned.

The formula for calculating document relevance is shown in formula (9).

If the user’s attribute set satisfies part or all of , the user can obtain according to formula (8) and calculate the document encryption key by formulas (10) and (11).

Finally, the user uses to decrypt the obtained document to obtain a collection of plaintext documents.

4. Safety Analysis

4.1. Confidentiality of Documents

The document is encrypted with a symmetric key before being uploaded to the cloud server, and only data users who meet the access policy defined by the data owner can search for the document and further obtain the decryption key to decrypt the obtained ciphertext document. Therefore, this solution guarantees the confidentiality of the document.

4.2. Anonymity of Indexes and Trapdoors

The cloud server can get the encrypted security index and query vector . The security index of each document is represented as under the action of the segmentation vector and the invertible matrix, namely, , where the segmentation vector and the two invertible matrices are the encryption keys of this scheme. It can be seen from the foregoing that in the above equations, , , , and are all -dimensional vectors (here, is equal to 0), so there are , equations in a set containing documents. However, there are unknowns in , , and unknowns in ,. It is not feasible to solve such a system of equations in which the number of equations is less than the number of unknowns, so the cloud server cannot deduce , , , and .

Similarly, the query vector can be regarded as two m-dimensional vectors , that is, the number of unknowns is . There are unknowns in ,. However, the number of equations for solving the query vector is only , so the query vector and the invertible matrix , cannot be solved as well. Therefore, this scheme can ensure the safety of indexes and query vectors.

4.3. Anonymous Access

The solution uses attributes as the minimum granularity of access control. When an access request is made, the IoTs does not care about the user’s identity and only needs to verify whether the user’s attributes meet the access structure and decide whether to provide the user with decrypted data.

4.4. Collusion Resistance

Collusion resistance means that users with different attributes cannot decrypt the corresponding ciphertext even if they combine their private keys. In the searchable encryption scheme, it is required that even if users collude, they cannot search for unauthorized keyword ciphertexts. In this scheme, the system selects a random number for each attribute on the access tree. Since is randomly distributed, the private keys of the same attribute in different networks are different, so that the secret value that can be recovered is different. Therefore, this scheme has the property of anticollusion.

4.5. Function and Safety Comparison

In this section, we compare the expression ability and supported functions of the proposed scheme with some existing schemes. The summary is shown in Table 1.

5. Efficiency Analysis

The following analyzes the computational cost performance of this scheme from the stages of private key generation, indexing, trapdoor, search, etc. and compares the efficiency of the scheme in the literature [31] with the scheme in this paper and then conducts an experimental simulation on the scheme, and the following situations can be ignored.

(1) Index Generation Stage. For encrypting each document index, the data owner performs the multiplication of two -dimensional vector and -dimensional matrix with a complexity of , where is the number of keywords after expansion. Comparing the exponential operation and bilinear pairing operation on the group , , the time spent on the matrix multiplication operation is negligible.

(2) The Trapdoor Generation Stage. For calculating the encrypted query vector, the data user needs to perform the multiplication operation between the two -dimensional vector and the -dimensional matrix and the time spent in the multiplication operation can also be ignored.

(3) Search Stage. If the user meets the access rights, the cloud server performs a search. The main operation is the inner product operation of two -dimensional vectors. The computational complexity is , where is the number of the entire document collection. Also here, the time spent in the vector inner product operation can be ignored.

Here, let and denote the exponential operation of groups and , respectively, denote bilinear pairing operation, denote the time of hash operation, is the number of attributes in the system, is the number of user attributes, is the number of files, and is the number of keywords.

The efficiency comparison of the scheme in literature [31] and the scheme in this paper is shown in Table 2.

In order to verify the effectiveness of the scheme, this paper compares the performance of the scheme in literature [31] with the performance of this scheme. We conduct real experiments on a Windows 10 64-bit operating system, Inter (R) CoreTM i7-7700 CPU @ 3.60 GHz and 8 GB RAM to study the true execution time. Here, we set the number of keywords in the dictionary to be the same as the number of query keywords in the trapdoor () and set the number of attributes in the system to be equal to the number of user attributes (), and , .

As shown in Figure 3, we found that compared with the computational cost in [31], in a large-scale data sharing system, the algorithm in this scheme is more computationally efficient, which means that this scheme is more effective and practical.

As shown in Figure 4, we compare the execution time of the search operation of one single subindex. The computational overhead of the search phase is mainly affected by the number of user attributes. We see that the computational overhead of the search phase of these two schemes increases linearly with the increase in the number of user attributes.

6. Conclusion

Aiming at the special application scenario of the IoTs environment, this paper proposes an attribute-based multikeyword ranking search scheme. The scheme not only realizes the keyword search function based on semantic expansion but also realizes the user’s access control function. The scheme takes into account the weight difference of different positions of keywords and introduces parameters such as word position and word span into the calculation of the relevance score of keywords to build a more accurate document index. Secondly, the scheme expands the query keywords semantically according to the semantic relationship graph to find more keywords with similar meanings, thereby effectively improving the precision and recall rate during retrieval. Again, the solution uses an access tree control structure to control the access authority of data users and realizes the fine-grained management of data owners based on attributes. Finally, the functional and security analyses and comparison of the scheme show that the scheme has document confidentiality, index and trapdoor anonymity, anonymous access, and resistance to collusion attacks. In addition, the efficiency of the scheme is theoretically analyzed and the analysis results show that this scheme has advantages over other schemes.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study is supported by the National Key Research and Development Program of China (2020YFB1005404), Open Foundation of State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications (SKLNST-2020-1-09), Henan Key Research Projects of Universities (21B520022, 20A580008), and Science and Technology Program of Henan Province (212102210415, 212102210100).