Abstract
Cities in the big data era hold the massive urban data to create valuable information and digitally enhanced services. Sources of urban data are generally categorized as one of the three types: official, social, and sensorial, which are from the government and enterprises, social networks of citizens, and the sensor network. These types typically differ significantly from each other but are consolidated together for the smart urban services. Based on the sophisticated consolidation approaches, we argue that a new challenge, fragment complexity that represents a well-integrated data has appropriate but fragmentary schema and difficult to be queried, is ignored in the state-of-art urban data management. Comparing with predefined and rigid schema, fragmentary schema means a dataset contains millions of attributes but nonorthogonally distributed among tables, and of course, values of these attributes are even massive. As far as a query is concerned, locating where these attributes are being stored is the first encountered problem, while traditional value-based query optimization has no contributions. To address this problem, we propose an index on massive attributes as an attributes-oriented optimization, namely, attribute index. Attribute index is a secondary index for locating files in which the target attributes are stored. It contains three parts: ATree for searching keys, DTree for locating keys among files, and ADLinks as a mapping table between ATree and DTree. In this paper, the index architecture, logical structure and algorithms, the implementation details, the creation process, the integration to the existing key-value store, and the urban application scenario are described. Experiments show that, in comparison with B + -Tree, LSM-Tree, and AVL-Tree, the query time of ATree is 1.1x, 1.5x, and 1.2x faster, respectively. Finally, we integrate our proposition with HBase, namely, UrbanBase, whose query performance is 1.3x faster than the original HBase.
1. Introduction
Urban big data are a large amount of dynamic and static data generated from subjects and objects including various urban facilities, organizations, and individuals. With the continuous development and maturity of mobile Internet and big data technologies, the types and scale of urban data have increased significantly [1]. Using urban data for data analysis to create valuable information and digitally enhanced services has also become a focus of much attention in recent years [2,3,4]. Sources of urban data are generally categorized as one of the three types: official, social, and sensorial. Official urban data refer to the data from the government and enterprises, such as basic public data on population, traffic, lands, housing, and geography; public administration data on transactions, taxes and revenues, payment, and registration; and confidential microdata on personal employment, medical treatment, welfare, and education. Social urban data refer to the data generated by urban residents in their daily lives, such as social media usage records and global positioning system (GPS) data generated by user activities. Sensorial data refer to the sensor data on urban infrastructure and moving objects, such as historical and real-time data recorded by sensor systems of environment, water, transportation, gas and buildings, and pictures and video taken by surveillance cameras. The urban data constituted by these three sources are diverse and large in scale, covering all aspects of urban production and life [5].
As the official, social, and sensorial urban data are usually very different from each other and the data types within each of them are also very complex, the urban data are often scattered and chaotic [6]. Therefore, from the aspect of big data management, the raw urban data are massive, distributed, heterogeneous, and inconsistent. According to the needs of smart city services, urban data generated from different sources need to be consolidated for storage, retrieval, and analysis [7]. In this area, the latest urban data management research has proposed many specific integration methods and architectures on the urban data [8,9,10]. With the sophisticated data preprocessing methods such as data cleaning and integration, the scattered and chaotic urban data are consolidated to the distributed and none-relational data store, such as NoSQL databases. However, we argue that urban databases contain tiny and massive attributes which are nonorthogonally distributed among tables, and this feature brings new challenge of query efficiency. In common databases, there are relatively small amount of attributes, and these attributes are divided into a few tables. However, an urban database has a large number of distinct attributes for two reasons: first, it stores mass objects, and these official, social, and sensorial objects all have different attributes; second, the semantically consistent attributes may be represented variously. For example, there are more than 8.5 million people, 110 universities, and an average of 500,000 taxi trips per day in New York City [11]. The amount of data generated every day is very huge and involves many completely different fields, including massive distinct attributes. Firstly, attributes are treated as the schema of values, and then, the fragmentary schema in this paper means that a database contains millions of attributes but nonorthogonally distributed among tables. Secondly, database queries rely on the schema, and then, the fragmentary schema would bring the new complexity of urban database query, named as fragment complexity.
In computer science, fragmentary is used to describe massive, tiny, and disconnected parts of something. The fragmentary schema in this paper contains two meanings: first, there is a none-predefined and flexible schema, which is satisfied by the NoSQL database; second, there are massive attributes, which are satisfied by urban data. The fragmentary schema is caused by the data consolidation process for many reasons, such as geographic “diversity,” semantic “diversity,” and format “diversity,”. All diversities are represented as the fragmentary schema after the data consolidation. Therefore, consolidating official, social, and sensorial urban data to the NoSQL database causes the fragment complexity, which represents that a well-integrated data have an appropriate but fragmentary schema and difficult to be queried. There are many complexities in the studies of database query; for instance, query complexity is about the features of query algorithms, and storage complexity is about the features of data. In this paper, fragment complexity is about the features of the schema. The known queries optimization such as index on values or partitions on values cannot solve the fragment complexity because the queries on attributes should be performed firstly. Such complexity is ignored in the state-of-the-art urban data management. An urban database with a fragmentary schema has two weaknesses.(1)Given an attribute and a value range, there is no mechanism to support retrieving the attribute from massive attributes. An attribute index which promotes the queries is necessary.(2)Due to the massive attributes, there is no one structure that could index all attributes; thus, attribute indexes should be distributed and managed well. A secondary index on local indexes is necessary.
As shown in Figure 1, this research aims to solve the fragment complexity of official, social, and sensorial urban data via an attribute-oriented index. We propose an index on massive attributes as an attributes-oriented optimization, namely, attribute index. Attribute index is a secondary index for locating files in which the target attributes are stored. In this paper, the index architecture, logical structure and algorithms, the implementation details, the creation process, the integration to the existing key-value store, and the urban application scenario are described. We also integrate attribute index to HBase named as UrbanBase, for storing urban data, and compare its query performance with the original one. Furthermore, the experimental results show that the proposed index is more efficient than B + -Tree, LSM Tree, and AVL-Tree for the query performance and storage cost.

The rest of the paper is organized as follows. Section 2 introduces the related work. Section 3 introduces the index architecture as the overall description of our solution. Section 4 gives the details about the structure and algorithms of attribute index. Section 5 discusses implementation issues such as arrays as memory structure, system integration, and speedup analysis. In Section 6, we evaluate and compare the performance of several indexes, respectively, and show the application results of key index on HBase. Finally, the conclusions and future works are summarized in Section 7.
2. Related Works
In the previous research on smart urban, many specific solutions have been proposed to address the storage and analysis of urban data. Cheng et al. introduced an urban data and analytics platform named CiDAP [12]. In CiDAP, couchDB is used to store JSON-formatted data generated by sensors, and all unstructured data are saved as files into the HDFS. However, complete urban data include not only a few specific formats but also a wide variety. We set up a suitable storage platform for each format of data and, then, integrate those that cannot be easily applied to urban data from multiple sources. Marco et al. proposed a model called FraPPE, which uses space, time, content, and source to represent events and uses grids and frames to unify the granularity of space and time [13]. But, complete urban data include more than just spatiotemporal data. Also, in the process of modeling the data at a uniform granularity, some detailed information will be lost. To solve the fragment complexity of urban data, a new data model needs to be built.
Many research studies on indexes can be found in the concerned databases, and a lot of classical indexes have been proposed, such as, dense index, sparse index, multilevel index combined with the dense index, sparse index, and B-Tree, as well as B+-Tree, and efficient Hash index. Nevertheless, as far as we know, there is no study on the index on attributes, but all are on values. Since the relationship between attributes and values is similar to the key-value model, we study some researches on indexes of key-value stores.
Ma et al. recently proposed a scalable in-memory key-value store which leverages an index combining with a HashTable index and a SkipList index, in support of both the range query and point query [14]. They built in-memory key space which is indexed by a HashTable, and then, parallel access and request dispatch are as key partitions. In our attribute index, and HashTable solution is infeasible because all data have been partitioned to the on-disk data files before the index is built. Zhang et al. also proposed an in-memory SkipList index for the key-value store [15]. The index is on values and has two layers: the top layer is a tree-structured one, and the bottom layer is a SkipList. In our solution, the attribute index also contains the top layer ATree and the bottom layer DTree. Gao et al. proposed a scalable multidimensional index (MD-Index) for key-value stores [16]. It is an index on on-disk key-value pairs where the keys are modeled as dimensions. During the querying process, queried keys reduce the number of dimensions in the multidimensional model. The weakness of MD-Index is that keys have to be modeled as dimensions, but a multidimensional cube is not suitable for massive keys. The Spkv is also a similar study in multidimensional index [17]. Kejriwal proposed a partition approach on value index and maintained the consistency during normal operations [18]. Our solution is inspired by index partitions. Namely, each data node contains an attribute index. However, the consistency is not a problem in our solution because an attribute index only for attributes on its own server, and it is a parallel index naturally partitioned according to the data placement.
As we adopt HBase as our host system, we also review studies on secondary indexing of HBase and the similar. In the table-based indexing approach, the secondary index is a special system table, while in a colocated in-memory secondary indexing approach, the secondary index entries are stored on the same node as the corresponding base table records. Each node is, therefore, responsible for maintaining its portion of the secondary index [19]. ITHBase [20] is an open-source implementation developed by modifying the base HBase source code to include transaction and secondary index support. It is a table-based indexing solution. IHBase [21] is an in-memory secondary indexing solution for HBase. It is an in-memory colocated indexing solution. Both of them built the index table based on the index keys. The key-value pairs in the index table are retrieved from the original data set. When a query is performed, the index table is accessed firstly, and then, the matched data are located by the mappings between the index table and the indexed keys. Secondary indexes benefit the query performance as long as the indexed keys are being contained in the query conditions [22].
Attribute index, which is an index on massive attributes, is a secondary index with a colocated approach. The approach is chosen for several considerations. First, we focus on in-memory indexing, that is, the index is resided in memory. Second, values of the same attribute would be stored in any data node, that is, attributes are spread among nodes and naturally partitioned by nodes; thus, the colocated approach is much convenient than the table-based approach. Besides, the common weakness of a secondary index is either all data cannot be indexed or there are too many local indexes to be maintained. However, this weakness is not obvious in the attributes index because the queried values could be indexed by traditional index while the attribute indexes only filter the data files which do not contain the queried attributes; also, the attributes indexes are totally shared and data nodes maintain their own index. A two-level indexing framework contains a local index and global index [23]. The local index maintains the information on each node, while the global index consists of some selected local indexes in order to save storage and improve query efficiency. Attribute index is a local index on attributes, and the challenge of attribute index is to define an efficient structure in both query and storage.
3. Index Architecture
There are a variety of data types in urban data. For example, the product sales records of enterprises are structured data; sensors in air monitoring stations generate semistructured data in the JSON format; pictures posted by residents on social media are unstructured data. We build a unified attribute-value data model for all types of urban data. In structured data, the column name is regarded as the attribute, and all values under each column are regarded as the value set. In semistructured data, labels are regarded as attributes, and the corresponding value of each label is regarded as the value of the attribute. In unstructured data, the description of each data is regarded as the attribute, and the data itself are regarded as the value. For example, in a picture uploaded by a user, the name and also the file attributes of the picture file are regarded as the attributes, and the picture itself and file attributes values are regarded as the values. Through the abovementioned methods, we summarize the complex urban data into a unified attribute-value model, so that the urban data can be stored in the same NoSQL database for data query and analysis. The process of building the attribute-value model is shown in Figure 2.

Based on the attribute-value model, we define an index on many attributes, as shown below.
Definition 1. Attribute Index. An attribute index is built on attributes rather than the values, and it is a tree-structured index on massive attributes, namely, on words. An attribute index maps a queried attribute to the data files in which the attribute is stored. An attribute index is a secondary index of the NoSQL database. It is resided in the memory of each data node with three substructures:(1)ATree: the ATree is a tree structure that maps an attribute to the identity, named as AID(2)DTree: the DTree is a reverse tree structure that generates the URL of the data file by traversing it from its identity, named as FID and represented as the leaf node, to the root(3)ADLinks: the ADLinks hold the links from AIDs, as leaves of ATree, to the FIDs, as leaves of DTreeThe application scenario of an attribute tree is a database that stores urban data with an attribute-value model. Both attributes and values of the database are huge, and attribute values are stored on data files of multiple data nodes. If a global attribute index is built for attributes in these files, the scale of the index is also huge and may not be stored in the memory of the master node. Also, the distributed index brings additional cost for management and query; for example, queries on the index are processed among the nodes. So that, in this paper, the attribute index is reserved for the data node, that is, an attribute index is for the attributes in a data node, and it is placed on the data node, maintained, and queried locally.
When a query is submitted to the database, it contains both attributes and values, for example, “select a and b for dataset where c is between min and max.” In the query, a, b, and c are attributes while min and max; obviously, the search scope is data files containing attribute a, b, and c. As the attribute index is well shared to each data node, the query engine sends the query to each data node, and on the node, first, the ATree is searched and the matched AIDs (both a, b and c) are retrieved, and then, the path of data files which contain these AID is retrieved by querying DTree with the FIDs, which is linked by the AIDs in ADLinks. Finally, the query engine scans these files and looks for records whose attribute c is between min and max. The overall architecture of the attribute index is shown in Figure 3.
The attribute index works together with the value index. Frist, the attribute index is a local index; thus, it outputs the paths of files, named as attribute scope which contains the queried attributes. If the value index also outputs the paths of files, it is named as the value scope which contains the queried values of the indexed attribute. Then, the scanned files, named as the scan scope, are the intersection of the attribute scope and value scope. If value indexes are the row-level indexes built on each data file, only indexes on the files in the attribute scope are searched. If there are no value indexes, data files in the attribute scope are scanned.

4. Attribute Index and Algorithms
For indexing n attributes, at least, log2(n) bits of information are required to distinguish n different attributes, and then, an attribute index needs, at least, nlog2(n) bit space for n attributes in theory. Therefore, the index space overhead depends only on the number of attributes, but not the length of the attributes. We can estimate the number of attributes in the dataset and calculate the size of the ATree. In the big data environment, if the size of the index exceeds the memory capacity, we share the dataset and build an index on each subset, for reasons that n = n1 + n2, n1 log (n1) + n2 log (n2) < (n1 + n2) log (n1+n2). However, the sharing method is not applicable to attribute indexes because the attributes are stored with the values and cannot be shared. Then, for each data node, all kinds of attributes may reside in. Since an ATree for each node indexes millions of attributes on the node, a space-efficient index structure is required to ensure it fits the memory size. Representatives of tree-based indexes, such as B + -Tree, LSM-Tree, and AVL-Tree, have large memory overhead. Assuming the average length of the attribute is tree-based, indexes have to store all attributes in the memory, and the storage complexity is O (kn) while the query time complexity is about O (klog2 (n)). Comparing with the classic tree-structured index, the design goal of ATree is to minimize the storage cost of the index and improve the query performance.
The ATree is designed as a Trie-tree that is used to store the attributes. Attributes are sequences of characters. Normally, a word does not contain too many characters, so it is beneficial for the search complexity related to the length of attributes rather than the number of attributes. Unlike a binary search tree, no node in the ATree stores the attribute associated with that node. Instead, its position in the tree defines the attribute with which it is associated. All descendants of a node have a common prefix of the sequence associated with that node, and the root is associated with the empty sequence. In the tree, the time of finding a sequence does not depend on the number of the tree’s nodes, but rather the length of sequence. For example, if a group of attributes are “pool,” “prize,” “preview,” “prepare,” “produce,” and “progress,” the snapshot can be built as shown in Figure 4. The time complexity of finding an attribute in a Trie-tree is dominated by the depth of Trie but not the number of attributes, and the depth of Trie is dominated by lengths of attributes. By analyzing, the time complexity and space complexity of the Trie-tree is O(k) and O(kn), respectively.

The logical structure of ATree is inspired by the Trie-tree. In a Trie-tree, the root node does not contain data, and all the other nodes contain one character. From the root node to a certain node, the characters of all the nodes that traversed are connected as the corresponding record. Children of each node contain different data and are arranged in a certain order. All descendants of a node have a common prefix of the sequence associated with the node. If a node in a Trie-tree has only one none-leaf child node, then two nodes are compressed into one. By this method, some common prefixes and suffixes in English, such as “pre,” “tion,” and “ing” can be compressed. In addition, the attribute in the data set may also consist of multiple words, such as “object color” and “object size”, and then, the abovementioned compression method can further reduce the size of ATree. Compression saves space, reduces the traversal depth, and reduces time complexity. However, when a new attribute is introduced, nodes of ATree are updated; thus, the compression method is reversible. Figures 5(b) and 5(c) show the compression of ATree.

(a)

(b)

(c)

(d)
In ATree, on one hand, leaf nodes, as the last nodes of attributes, are the root nodes of ADLinks and map to the corresponding leaf nodes of DTree; on the other hand, if nonleaf nodes have no links to ADLinks and DTree, then the space is wasted for them as internal nodes only. For fully utilizing the nodes in ATree, we introduce a wildcard “.” For all nodes except the root one, the wildcard “” is added to the end of node value. It indicates that a node which matches a string in conditions of the node value is the suffix of the string. However, when retrieving an attribute in an ATree, the top-down traverse is stopped at the matched node whose children nodes are all unmatched. For example, when leaf node a = {pre} is the child node of node b = {par}, node b will be selected when querying “prepare” and “preparation,” and node a will be selected when querying “preview” and “preach”. The advantage of introducing wildcards is that the space complexity of ATree is reduced from O (kn) to O (n). At the same time, ATree can support the addition of new attributes without modifying the structure. However, this comes at a cost of undistinguishing none-existing attributes. Fortunately, general data query conditions set approximation on the value instead of attribute. For example, a query is meaningless if the queried attributes are fuzzy. In addition, the attribute index will finally return the query engine which data files contain, but not only contain these attributes; thus, fuzzy attribute matching does not affect the final query efficiency and correctness. Figure 5(d) shows the wildcard of ATree.
In the attribute-value storage, attributes may contain any kind of characters. In addition to common numbers and English letters, Greek characters, Chinese characters, Japanese hiragana, and other language characters may appear in the attribute. It is difficult to index attributes directly on the basis of containing all kinds of characters. Therefore, it is necessary to use a method to unify characters and facilitate sorting and searching, thereby establishing the attribute index. In ATree, attributes are stored and retrieved in UTF-8 encoding in hexadecimal. The basic storage unit is one or more encodings. UTF-8 is one of the most widely used Unicode implementations on the Internet and is known for its variable-length encoding. Generally, the attribute does not contain too many characters and is mainly composed of letters and numbers, so that the UTF-8 encoding obtained by the conversion will not be too long. Compared with setting up Trie-tree directly on different characters, using UTF-8 encoding not only slightly lengthens the encoding size but also allows special characters to have a common prefix with numbers or letters. This encoding method makes full use of the characteristics of Trie-tree, which not only saves storage space but also reduces space complexity.
The ATree search algorithm is shown in Algorithm 1. The idea of the algorithm is to traverse the UTF-8 encoding of the attribute and compare it with the nodes in the ATree until the node that the path exactly matches the encoding is finally selected; otherwise, it selects the wildcard with the longest common prefix node if all nodes do not match in the end. In the algorithm, “Y” is the attribute to be queried, “AT” is the generated ATree, “node” is the node in the current ATree, and “codeInquire” is the character segment that is matched with the ATree at each step. First, Y is converted to UTF-8 encoding codeY, looped through each character of this encoding, and codeInquire is updated; if the current node has a child node that matches codeInquire, node becomes that child node and resets the codeTemp to search for the next layer. Otherwise, there is no matching node in the current codeInquire, and then, codeInquire continues to update and search again. After the loop ends, if the matching fails, the wildcard character of the node is selected as the query result.
|
Figure 6 is an example of the proposed index. In the ATree, each node is one or more hexadecimal digits, which represents one or more characters in the UTF-8 encoding of the attribute, including “tax, task, tag, text, team, and tel.”, converts them to UTF-8 encoding, and constructs ATree. The construction steps are shown in Figures 6(a) and 6(b). DTree in Figure 6(d) is a simple directory tree structure. Each parent node stores first-level directory information, and leaf nodes store file names. Similar to ATree, all descendants of a node have a common prefix of the sequence associated with the node, but the difference is that ATree is traversed from root to leaf, while DTree is traversed from leaf to root to obtain the full file path. Generally, there are lots of files in the storage system to store urban data, but a few parent directories of these files. If the absolute path of all files is completely recorded, so as their parent directory, thus wasting storage.

Figure 6(c) shows the examples of ADLinks which are pointers from nodes of ATree to the leaves of DTree. Because an attribute may be stored in multiple files and a file may also hold multiple attributes, there is a many-to-many relationship between nodes of ATree and leaf nodes of DTree. Generally, the massive attributes are sparse and meet the Zipf distribution: a small number of high-frequency attributes and a large number of low-frequency attributes. Therefore, in the DTree section of ADLinks, we added the leaf node “ALL”. When a certain attribute is a high-frequency attribute (for example, more than 70% of the data files contain an attribute), it directly points to the node, which means that the attribute query returns all data files on the server. In addition, although each file contains a large number of attributes, each attribute corresponds to a small number of files, and it ensures the efficiency of storage and query.
5. Implementation
In this section, the implementation issues of the attribute index are discussed, including its data structures and the creation; also, how it is integrated to the typical NoSQL store and why it can promote the query performance are discussed.
5.1. Arrays for the Attribute Index
ATree is a compressed Trie-tree. The simplest implementation is to build a multitree, and each node stores a set of all its child nodes. Based on the huge amount of data, the space occupied by this structure will be particularly huge. Thus, ATree is implemented as a “double arrays method.” It greatly reduces the space cost by using only two arrays, namely, “base” and “check” to store data. The base array is responsible for recording the state, and the check array is responsible for checking whether each string has transitioned from the same state. Assuming that two characters a and b are both the successors of character t, a.code represents the encoded value of character a, and a.index represents the index position of character a in the array. The values of base array and check array satisfy the following conditions: base [t.index] + a.code = a.index base [t.index] + b.code = b.index check [a.index] = check [b.index] = t.index
The node t has successor nodes when base [t.index] > 0; otherwise, the node t has no successor nodes when base [t.index] < 0. Let a be the successor node of t when both base [t.index] + a.code = a.index and check [a.index] = t.index are true.
We select a part of the ATree in Figure 6 and mark each node, and then, Figure 7 describes how the relationship between nodes ①, ②, and ③ are represented in the double array. The query algorithm, Algorithm 1, is implemented in the double array, to traverse the UTF-8 code transformed by the query attribute and, then, determine whether each query node can correspond to the ATree in turn. The “findNext” function in Algorithm 1 indicates whether the next query node can correspond to the successor node of the ATree. Let the input query node be u and the previous node matched in ATree be p; then, u.index = base [p] + u.code. The returned node u exists in ATree if check [u.index] is equal to p.index; otherwise, the node u does not have a complete match in ATree; thus, p is returned and, then, u is expanded by one bit in the next loop. Finally, after the traversal of the attribute is completed, the wildcard of the previous matching node is selected as the query result if there are no exactly matched nodes finally. The wildcard is set to a special child node of a node, and it is used to match any value except the values of child nodes.

(a)

(b)
We adopt two arrays “find” and “name ” to implement DTree. The find array is of type integer and stores the parent node index of a node. The name array is of type string and stores the directory name or file name of a node. We adopt the following steps to build a DTree. First, we determine the range of the database files, use Breadth First Search to traverse files under the range, and get traversal results. Then, we fill the result sequence into the name array in the order from the root directory to the file. At the same time, the index of the parent of each node is filled into the find array. When a file path lookup is concerned, the index x of a specific file is retrieved, the value of find[x] is determined, and then, name[find [x]] is combined with name[x] to form a partial path. This process is iterated until the system root directory is reached, and a full absolute path is retrieved. For the DTree shown in Figure 6(d), two arrays shown in Table 1 are built. Any full absolute path, such as “/local/data1/c.db” for index 5 in Table 1, could be retrieved.
Both ATree and DTree are implemented by arrays so that both attributes and files have the array index. Therefore, in ADLinks, their respective indexes represent attributes and files. Generally, there are many attributes and files. However, the number of files corresponding to each attribute is relatively small. It will be very sparse if a two-dimensional matrix is to represent the relationships. Therefore, ADLinks are built with sparse matrices, that is, one array stores the indexes of attributes, called “Attribute-index,” and the other stores the indexes of files, called “File-index.” The ADLinks query first retrieved the index x of the target attribute in ATree, secondly retrieved the index i of all values equal to x in the Attribute-index array, and then, all the values of File-index[i] are the indexes of the corresponding files. As far as the wildcard of a node is concerned, it is set as a special child node and represented as the negative value in Attribute-index array of the node. For example, if the attribute “team” has a value of 2 in the attribute index array, its wildcard value is −2.
In Figure 6(c), the index values of “text,” “team,” and “tel.” in an ATree is set to be 1, 2, and 3 in the ADLinks, respectively. Then, according to Table 1, two arrays are built, as shown in Table 2. Among them, the last column of the table represents the wildcard of attribute “team”. Let the queried attribute be “tel.” and the index retrieved by the ATree be 3. In the Attribute-index array of ADLinks, the values of Attribute-index [3] and Attribute-index [4] are equal to 3; thus, the values of File-index [3] and File-index [4] are the values of the corresponding file index. According to Table 1, the absolute paths of the files corresponding to the attribute “tel.” are “/local/data1/c.db” and “/local/data2/1.db”, respectively.
5.2. Creation
The overall steps for building an attribute index are shown in Figure 8 and explained as follows: First, the base and check arrays of ATree, the find and name arrays of DTree, and the Attribute-index and File-index arrays of ADLinks are initialized. Second, the range of the local database files is retrieved, all the urban data files in it are traversed, and the find and name arrays of the DTree are updated in the process of the traversal until the DTree is built completely. Third, the DTree is traversed, the index value of each file is obtained, all attributes in the file are read, each attribute is turned into UTF-8 encoding, and it is added to ATree successively. Fourth, the index values of all attributes in the ATree and the index values of the corresponding files in DTree are stored into the Attribute-index array and File-index array of ADLinks, respectively. Among the four steps, the ATree building is the most complex step, including the insert, compression, and transformation steps.

In the inserting step, a multitree structure is built to hold ATree temporarily. A node stores a list containing all its child node objects and the encoded value of itself and an index list of corresponding files in DTree if the node is the end of an attribute. Then, this multitree is initialized with all attributes and managed in the form of Trie-tree. A node stores only one bit of encoding value. The algorithm, which traverses the UTF-8 encoding of a attribute and inserts it into the multitree in turn, is shown in Algorithm 2. After the traversal is completed, the values of the file index corresponding to the attribute are added to the node information. In Algorithm 2, Y is the attribute to be inserted, fileIndex is a list that stores the file index values corresponding to the attribute, AT is the generated multitree, and node is the node in the current multitree.
|
In the compression step, multitree is traversed with the Depth-First Search approach. If a node does not store the file index values and only has one child node, the node and its child nodes are compressed into one node, and the encoded values of the two nodes are spliced as the encoded value of the compressed node.
In the transformation step, the multitree is transformed to ATree in a double-array form; meanwhile, ADLinks are built with the following five steps:(1)base [0] = 1 is initialized; all values in the check array are 0.(2)assumed that a node is p in the multitree ATree, and its group of child nodes are , ..., . A positive integer begin is found and check[begin + .code….code] = 0 is made; that is, n free spaces can be found to store these child nodes.(3)Then, the check array of this group of child nodes is set to check [begin + .code….code] = p.index, and the base array value of their parent node is set to base [p.index] = begin.(4)For nodes that have stored file index values, their index values of ATree and their file index values are stored in the Attribute-index array and File-index array of ADLinks, respectively.(5)For each child node, if it does not have its own child node, its value of base array is let to be negative; otherwise, all its child nodes are inserted and iterated to step (2).
As far as the wildcards are concerned, it is processed in the index update but not in the index creation. By this approach, a fuzzy matching of newly added attributes can be achieved without modifying the ATree. When there is a new attribute added to the attribute index, first, DTree is updated and the corresponding DTree index values are retrieved. Afterwards, there are three possible processes. (1) If ATree contains the new attribute, then the ATree index value of the attribute is retrieved and combine with index values of new DTree to update the ADLinks; (2) if ATree does not contain the new attribute, it will be added to the wildcard with the longest common prefix node, and only ADLinks will be updated while ATree remains unchanged; and (3) if wildcards in ATree exceed the limitation, the entire index is recreated for performance reasons.
5.3. Integration and Urban Application
We take HBase as an example to describe the integration of the Attribute Index and database. HBase is a highly reliable, high-performance, column-oriented, and scalable distributed database that runs on the HDFS file system. It uses rowkeys, column families, columns qualifier, and timestamps as keys, and the content of the data is stored as values in key-value pairs. HBase employs master-slave architecture in which the HMaster manages many HRegionServers where the data are resided. HBase data files are distributed on different data nodes under the HDFS file system. Its basic file unit is called StoreFile, which is stored on HDFS in HFile format.
For each HRegionServer, an attribute index is built and resided in the memory of the server based on the local files. The creation process of the server follows the description in previous sections. All HFiles on the server are iterated. The physical structure of HFiles consists of many parts. Among them, only DataBlock, which stores the user's key-value data, is scanned. Each key-value data consists of four parts: key length, value length, key, and value. The key part is a complex structure, including the length of the rowkey, the content of the rowkey, the length of the column family, the content of the column family, the content of the column qualifier, timestamp, and keyType, among which the column qualifier is the element for attribute index. Building processes of attribute index, including ATree, DTree, and ADLinks, are the same for all HRegionSevers and executed as Map tasks of the MapReduce paradigm in parallel.
When a query is submitted to HBase, for example, the query q is scan “Person ”, { FILTER =>“QualifierFilter (“name”) AND SingleColumnValueExcludeFilter (“INFO”, “age”, >, “binary: 28”) ”}.
The query q is to find names of a person in the table Person whose age is larger than 28. The HBase client first locates all Regions where the table Person is resided through a RegionServer, hbase:meta table, and ZooKeeper. Without losing the generality, assuming that Region A1 and A2 on RegionServer A, also Region B1 and B2 on RegionServer B are matched. A Region contains an in-memory memstore and on-disk storefiles which are HFiles stored in the HDFS. Then, if there is a secondary index on the column age, which maps values of age to the row attributes, Region B2 and some storefiles in Region A1, A2, and B1 are filtered out by the query condition “age> 28”. Thus, the scan range is narrowed through the index on values first. Next, the attribute indexes on both RegionServer A and B are operated before storefiles in Region A1, A2, and B1 are scanned. By querying attribute indexes, only the retuned storefiles on each RegionServer are scanned in parallel. In conclusion, the attribute index further narrows the scan range by excluding the storefiles which do not contain attribute name or age. The query process described in this paper is shown in Figure 9.

The application scenario of HBase with an attribute index is also clear. Urban data from different sources usually have massive nonorthogonally distributed attributes. There are many types of attributes, but most attributes exist only in data from some sources. The attribute transaction amount exists in the transaction record of store and the purchase list of enterprise, while the attribute temperature exists in some weather sensor. If the urban data generated by all sources are stored together, the attributes will be huge. A good example is SmartSantander, one of the largest smart city experimental testbeds in the world. This testbed has been deployed at the city of Santander, located in the north of Spain. The SmartSantander infrastructure includes a continuously growing IoT setup spread throughout the city that currently encompasses more than 10,000 diverse IoT devices (fixed and mobile sensor nodes, Near-Field-Communication (NFC) tags, gateway devices, and citizens’ smartphones) [24]. An anemometer is a kind of IoT device in SmartSantander, and the data file that contains real-time wind speed, the unique attributes of the anemometer, only account for a small part of the overall data files. If such dataset is stored in HBase, urban data can be stored in the form of the attribute as column qualifier and the value of the attribute as value. In this case, the attribute index can significantly speed up the query.
6. Experiments
In this section, both the performance of the attribute index and its integration on HBase are evaluated and compared. In Subsection 6.1, we describe the fundamentals of the experiments including the experimental purpose and experimental plan.
6.1. Setup
6.1.1. Scope
We compare the core components of attribute index, ATree, with existing indexes in a standalone environment. We integrate ATree to HBase, namely, UrbanBase, and compare its query performance with that of original HBase.
6.1.2. Experimental Environment
We execute our experiments on a cluster with 6 physical machines. Each node has the same configuration, i.e., Intel Core i7, 1 TB hard disk, 8 GB memories, CentOS 7, and moderate I/O performance. The Gigabit Ethernet is connected by the Dell PowerConnect 5548.
6.1.3. Selection of Competitors
We compare ATree with B+-Tree, AVL-Tree, and LSM-Tree index, which are substitutions of the attribute index. In B+-Tree, the root represents the whole range of values in the tree, where every internal node is a subinterval [25]; LSM-Tree, as a search tree, maintains attribute-value pairs in two or more separate structures, each of which is optimized for its respective underlying storage medium [26]; data are synchronized between the two structures efficiently, in batches. An AVL tree is a self-balancing binary search tree. The heights of the two child subtrees of any node differ by, at most, one. Since most indexes treat support disk as storage, without losing the fairness, we give up the disk operations and put all data in the memory. We notice the even four indexes are designed for different purposes in their application scenario, and they are adopted to maintain and querying massive attributes, represented as short string, in this experiments. Besides, UrbanBase is, of course, compared with HBase to evaluate the optimization effects.
6.1.4. Fragmentary Schema
The fragmentary schema in the experiment are represented by two approaches: first, the attributes are massive and random strings; second, the attributes are nonorthogonally distributed among tables, and each row of the table contains half empty attributes for sparsity.
6.1.5. Experimental Data
The data set for the index experiments contains massive random strings as attributes. When generating the data, the attribute contains 8 characters which are randomly selected from 95 different characters (ASCII code 32 to 126, keyboard characters), and the attribute set contains 50k, 100k, 200k, 400k, and 800k strings. The data set for the HBase and UrbanBase are the attribute-value data set. We prepare a word set including 10000 English words by NLTK liberty in Python. We extract five subsets from the word set with different scales, i.e., 1000, 2000, 4000, 6000, and 8000 words, and set them to be the candidates of attributes. For generating an attribute-value pair, the attribute is selected from the candidate set, and the value of an attribute is random strings whose length, together with its attribute, is 100 byte. By this approach, we could generate a 50 GB dataset which contains about 5 108 records and control the distinct attributes of them be the five scales. Thus, the five datasets in HBase and UrbanBase experiments are all 50 GB, but it contains 1000, 2000, 4000, 6000, and 8000 distinct attributes, respectively.
6.1.6. Experimental Cases
For the experiments of index, the test cases include building, query and appending, and storage of four indexes. For experiments of UrbanBase, the query clause is the same as the example in Section 5.3 and Figure 9.
6.2. Results
For comparing ATree, B+-Tree, LSM-Tree, and AVL-Tree on three operations, building, query, and appending, the criteria are time consumption and storage consumption. Overall, the building time of B+-Tree and AVL-Tree is averagely 1.23x and 1.89x longer than that of ATree, respectively. The query time of LSM-Tree and AVL-Tree is averagely 1.11x, 1.5x, and 1.2x longer than that of ATree, respectively. The storage of B+-Tree, LSM-Tree, and AVL-Tree is averagely 1.21x and 1.61x larger than that of ATree, respectively. The results are shown in Figure 10.

(a)

(b)

(c)

(d)
Figure 10(a) shows the results as follows:(i)When the data volume increases, AVL-Tree’s building time is longer and increases faster than that of other indexes. The building time and data volume are almost the linear correlation except the AVL-Tree.(ii)LSM-Tree is for HBase, it is optimized for inserting data quickly, and the new data are temporarily stored in the memory and merged to files later. Even in our experiments, the data are not really merged into disk but stay in the memory, and the creation of LSM-Tree is very efficient.(iii)There are too much I/O operations to create a higher AVL-Tree due to the rotate operations even if these I/O operations are performed in the memory in our experiments. That is the reason why AVL-Tree has the worst building performance.(iv)The building time of ATree is less than that of B+-Tree because the nodes of ATree are smaller, and ATree is also lower than B + -Tree.
Figure 10(b) shows the results as follows:(i)The ATree has the best query performance while the LSM-Tree has the worst one. The results obey the time complexity, and the level-wise query mechanism of LSM-Tree is not optimized for attribute query.(ii)The query time of AVL-Tree and LSM-Tree increases fast with data volumes, but the query time of the other two indexes is almost stable.(iii)For AVL-Tree, query is suffered for I/O operations as it is higher than B + -Tree and ATree.(iv)Because of the simple structure, the ATree has better query performance than the B+-Tree.(v)The appending time of four indexes, which is shown in Figure 10(c), follows the same regularities of created time. Also, Figure 10(d) shows the results as follows:(vi)When the data volume increases, the AVL-Tree’s storage is larger and increases faster than that of other indexes. Also, they are almost the linear correlation.(vii)The B+-Tree required less storage than the ATree, but they are very close because ATree also adopts nodes compression to reduce the storage. However, for comparing, the wildcards are disabled, also in practice, ATree needs to be associated with ADLinks and DTree whose storage are not counted in. Intuitively, the storage for a complete attribute index was doubled.(viii)LSM-Tree requires larger storage than ATree and B+-Tree because of its compaction mechanism and read and write amplification problem.
We compare the query performance and storage consumption of HBase and UrbanBase with the same query case and runtime environment. The experimental results are shown in Figure 11. Generally, the query time is dominated by the data amount but not the number of distinct attributes so that the query performance of both HBase and UrbanBase are stable. Overall, UrbanBase performs query 1.3x faster than HBase. Almost half data files are skipped in the scanning processing because the attribute index distinguishes the files which do not contain the query attributes. Besides, the number of distinct attributes has effects on the query efficiency of the attribute index in K-HBase; however, comparing with querying cost on dataset, such effects are negligible.

Experimental results prove that, for a distributed attribute-value store with massive attributes, using an attribute index can greatly improve the query efficiency. UrbanBase, as an HBase integrated the attribute index, shows that the adaptivity of attribute index is also good. In the original HBase, queries can be optimized by secondary indexes on values because some data files beyond the queried range are skipped, while in our solution, data files also can be skipped by the attribute index because they do not contain the queried attributes. Furthermore, both the attribute index and values index can work together. In addition, since the attribute index is built on each data node, it does not affect the parallelism of distributed queries. Moreover, the length of the attribute is relatively short, which maintains the efficiency of ATree, and the storage cost of ATree is also carefully designed and fits the size of memory. The time consumption of searching the attribute index for attributes is far less than the time saved by narrowing down the scanning scope.
7. Conclusions
For the applications of urban big data management, this paper proposes an attribute-value model to represent all types of them. Based on the model, we present the design, implementation, and evaluation of the attribute index, and attribute-oriented optimization approaches are proposed for solving the fragment complexity of official, social, and sensorial urban data. The attribute index is designed as a secondary index not on values, but attributes. It resides in the memory of the data node, and it distinguishes the data files which do not contain the queried attributes in a parallel manager and, finally, speeds up query by skipping these files in the scanning process. We proposed the following models, algorithms, and implementations for the attribute index:(1)Three parts of attribute index, namely, ATree, DTree, and ADLinks, and how they work together(2)The logical structures and algorithms for ATree, DTree, and ADLinks, especially compression, wildcards, multiple language support, and complexity analysis for ATree(3)The double-array implementation, the creation, and the update of the attribute index(4)Selecting HBase as a host system and explaining how to integrate the attribute index to HBase and the potential application scenario as the storage of massive urban data
The experimental results show that, in comparison with B + -Tree, LSM-Tree, and AVL-Tree, the query time of ATree is 1.1x, 1.5x, and 1.2x faster, respectively. Also, the HBase is integrated with the attribute index, namely, UrbanBase, whose query performance is 1.3x faster than the original HBase.
In future, the attribute index could be further optimized as follows: more investigation on the attribute extraction according to the attribute distribution, adjusting the ATree according to the frequency of the attributes, and integration of the attribute index to more databases.
Data Availability
Data used to support this study can be obtained from the corresponding author on request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
The authors would like to thank Chengwen Wang, who has graduated from Software College, Northeastern University, and received his master’s degree. He contributed to the parts of solution mentioned in Section 4. The authors would like to thank the reviewers for their feedback on the paper. This paper was supported by the research grant from the National Natural Science Foundation of China (Grant no. 61662057) and the Fundamental Research Funds for the Central Universities (N182504017).