Abstract
The dependability and elasticity of various NoSQL stores in critical application are still worth studying. Currently, the cluster and backup technologies are commonly used for improving NoSQL availability, but these approaches do not consider the availability reduction when NoSQL stores encounter performance bottlenecks. In order to enhance the availability of Riak TS effectively, a resource-aware mechanism is proposed. Firstly, the data table is sampled according to time, the correspondence between time and data is acquired, and the real-time resource consumption is recorded by Prometheus. Based on the sampling results, the polynomial curve fitting algorithm is used to constructing prediction curve. Then the resources required for the upcoming operation are predicted by the time interval in the SQL statement, and the operation is evaluated by comparing with the remaining resources. Using the real hydrological sensor dataset as experimental data, the effectiveness of the mechanism is experimented in two aspects of sensitivity and specificity, respectively. The results show that through the availability enhancement mechanism, the average specificity is 80.55% and the sensitivity is 76.31% which use the initial sampling dataset. As training datasets increase, the specificity increases from 80.55% to 92.42%, and the sensitivity increases from 76.31% to 87.90%. Besides, the availability increases from 40.33% to 89.15% in hydrological application scenarios. Experimental results show that this resource-aware mechanism can effectively prevent potential availability problems and enhance the availability of Riak TS. Moreover, as the number of users and the size of the data collected grow, our method will become more accurate and perfect.
1. Introduction
In practice, as the world gets more instrumented and connected, we are witnessing a flood of digital data generated from diversified hardware (e.g., sensors) or software in the format of big data. However, it is difficult for the traditional storage represented by relational database to deal with large-scale batch or stream data effectively. In order to explore the value of big data, the first challenge is how to store and manage big data reasonably and dependably. Nowadays, the fast-evolving NoSQL stores [1] provide a referential solution, and more characteristics often apply such as schema-free, easy replication support, simple API, and eventually consistent. As is known to all, more than two hundred NoSQL databases usually have the very different characteristics, and many mainstream NoSQL stores have been adopted for big data applications in different fields, such as Redis [2], HBase [3], MongoDB [4], Druid [5], and Riak TS [6]. However, the dependability [7] and elasticity [8] of various NoSQL stores in critical application are still worth studying.
In our hydrological application system, Riak TS is adopted for storing hydrological sensor stream data, which is a well-known enterprise-grade NoSQL time series database optimized specifically for IoT and time series data. However, we found that Riak TS could crash when the dataset object obtained by a query is too large. For example, when hundreds of users query long-term time series data for multiple hydrological stations, the system breakdown is easy to occur. As a critical system, if the crashed problems cannot be effectively solved, serious consequences will follow. As far as we know, there are not so much direct researches examining the reason and solution. For other NoSQL stores, there are also few direct lessons to be learned because the differences between various NoSQL stores are large. Therefore, in order to improve the availability of Riak TS effectively, a resource-aware mechanism for enhancing the availability of Riak TS is proposed. The core idea is that the data in Riak TS is sampled to obtain the correspondence between time and data size, while the real-time resource consumption is recorded by Prometheus [9] and the relationship between data size and resources consumption is obtained. Then this relationship is used to establish predict model. So far, preparations have been completed. When a user makes a query operation, the data size is predicted by the time interval in the query statement of Riak TS to calculate the resources required for the subsequent operations. Based on the real hydrological sensor dataset and application scenario, the effectiveness of the proposed mechanism is verified.
The following contents are organized as follows. Section 2 discusses the research work related to this paper; Section 3 introduces the methodology of enhancing the availability of Riak TS using resource-aware mechanism in detail. In Section 4, we continue to use the real hydrological sensor data as experimental data to verify the effectiveness of the proposed mechanism. Finally, the summary and prospect are given.
2. Related Work
2.1. NoSQL
In [1], the author introduced the related concepts and classification of NoSQL, i.e., “NoSQL refers to non-relational, distributed, and non-AICD-compliant data storage systems”. NoSQL can be divided into multiple categories, including key-value stores, wide-column stores, document stores, time series databases, and graph databases. For the slight differences in key-value data storage, the researchers further divided the key-value storage mechanism into key-document storage represented by MongoDB, key-column storage represented by HBase, and key-value storage represented by Redis. On the website of “nosql-database.org”, there is a more detailed taxonomy and description of the current more than 200 different NoSQL stores. It divides them into 15 classes: wide column store/column families, document store, key value/tuple store, graph databases, multimodel databases, object databases, XML databases, multidimensional databases, multivalue databases, event sourcing, grid&cloud databases solutions, time series/streaming databases, other NoSQL-related databases, and scientific and specialized DBs as well as unresolved and uncategorized. Under each category, there are many specific instances of NoSQL databases.
Different NoSQL stores have different characteristics and applicability [8–12]. In [8], from data model, the consistency model, data partitioning, and the CAP theorem, the researchers elucidated the design decisions of NoSQL stores with regard to the four design principles of distributed database systems. Literature [10] gave a top-down overview of the field, and they proposed a comparative classification model that relates functional and non-functional requirements to techniques and algorithms employed in NoSQL databases. In [11], the authors surveyed and created a concise and up-to-date comparison of NoSQL engines, identifying their most beneficial scenarios of use case from the software engineer’s point of view and the quality attributes that each of them was most suited to. In [12], the authors tested both collection speed and aggregation speed for reasonable data streams of sensor data, and then they used relational databases, key-value stores, column stores, and time-series enabled database systems for performing the test. Their experimental results confirmed that column stores and key-value stores perform better than relational databases, while time-series databases outperform all the others. The above work mainly introduces the concept and characteristics of NoSQL databases.
As an instance of NoSQL, Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and time series data. It can ingest, transform, store, and analyse massive amounts of time series data. Therefore, Riak TS is widely used for log analytics, edge device analytics, IoT data retrieving, and so on. However, because the types and the instances of NoSQL stores are numerous and develop rapidly, the dependability of various NoSQL stores in critical application is very worthy of studying. More specifically, dependability is an integrating concept that encompasses availability, reliability, safety, integrity, and maintainability. As far as we know, for the dependability of the time series databases represented by Riak TS, there is also plenty of room for further study. Given the above, we can see that it makes sense to follow the path of enhancing dependability of NoSQL stores and start with Riak TS.
2.2. Availability
Availability in database [13] means that all nodes or partial nodes in the system can continue to serve the upper business after node or link failure has occurred. Different databases have different optimizations for availability.
Currently, many technologies can help the database achieve its high-availability goals. The high availability scheme in MySQL based on DRBD (Distributed Replicated Block Device) [14] uses the technology of block device level synchronous replication implemented by kernel module mode to realize the high availability of MySQL in combination with other highly available cluster technologies. Database replication is also a more common scenario [15], including both logical replication and physical replication.
MongoDB supports horizontal data partitioning across nodes and uses B-Trees to index data within shards. Each shard may be replicated for high availability and failure recovery [16]. Chalkiadaki [17] supports a primary-backup scheme using a shadow QoS controller to replicate the state of the primary for high availability in Cassandra [18]. In Gorilla [19], even if a network partition or other failure leads to disconnection between different data centres, systems operating within any given data centre ought to be able to write data to local TSDB (a time series database) machines and be able to retrieve this data on demand.
Cluster is a very mature and highly available solution that avoids host problems through host redundancy technology, which enables multiple machines to share a set of disks. The machine uses Floating IP Addresses technology [20] to enable applications serve on it and can be switched between different machines. Riak TS takes cluster model to ensure its high availability. High availability based on NWR model [21] (N represents n replication, W represents writing data at least W copy is considered successful, and R means to read at least R copies when reading data. For the selection of R and W, W+R > N is required), and Riak TS can be used by reading and writing multiple servers to ensure that services are available when the machine is down or the network is broken. But Riak TS only considers the situation of node downtime and when the nodes face the performance problems such as insufficient memory and query a large amount of data, which is likely to run out of the resources of clusters, so that the entire system cannot function properly.
2.3. Performance Prediction and Resource Monitor
Performance evaluation techniques are grouped into two major categories: model-based and measurement-based. Model-based approaches require in-depth understanding of the application [22]. Measurement-based techniques consider the application as a black box and thus are more straightforward and prevalent in practice [23]. They require methodological experimentation to collect a wide range of measurements on the system under study.
Karniavoura [16] has proposed a way to predict NoSQL performance in a measurement-based with high accuracy, but this does not guarantee high availability in NoSQL. In reality, the data distribution of some data sets is not uniform, and some have great randomness. Although users may have a very accurate prediction model, they cannot predict what kind of load they are going to be carrying, which can significantly reduce database availability. Generally speaking, users operate the database by writing SQL statements and they cannot predict how much this SQL statement will load on it. Therefore, this measure can only predict how the performance will change when a specific load occurs, but it cannot predict the cause of the load in advance. In this way, the decline in database availability is inevitable.
Besides, the use of latency as evaluation index is vulnerable to the impact of hardware. Performance issues should essentially be resource issues, so this paper adopts Prometheus [9], the resource monitoring software to evaluation. Prometheus is an open source monitoring system, which stores all information as time series data and real-time analysis of the system's operation status, execution time, number of calls, etc., to find the hot spots of the system and provide basis for performance optimization.
3. The Proposed Methodology
3.1. Problem Description
In our hydrological application system, we used Riak TS for storing and querying hydrological sensor stream data. The runtime environment is deployed in a cluster using four PCs and the hardware environment is as follows: CPU is Intel(R) Xeon(R) CPU E5645@2.40GHz dual-core 24 CPU; memory is Kingston DDR3 1333MHz 8G, 500GB SSD Flash Memory. Operating system tools are Ubuntu 16.04 64-bit and Linux 3.11.0 kernel. Some experimental data are as shown in Table 1.
When multiple users query long time hydrological series data for multiple stations, the crash is easy to occur. For example, the selected time intervals are day, week, month, and year. Figure 1 shows the result of querying the data of multiple stations.

We randomly selected data for one day, one week, one month and one year, and the corresponding data size were 6328 rows, 44010 rows, 206814 rows, and 11897068 rows, respectively. Besides, the execution time was also recorded in milliseconds. From Figure 1, we can see that as the data size increases and so does the execution time of Riak TS. When the time interval is year, Riak TS crashed because of its performance bottleneck, and the corresponding amount of data are 11897068 rows. With other NoSQL databases such as HBase or MongoDB, the same operation does not crash although the execution is slow.
3.2. The Sampling Strategy
Riak TS is a time series database which needs to be queried by one of the primary keys of timestamp type. The time range can be an intermediate medium for predicting the data size, so it needs to sample the datasheet to obtain the macroscopic information of the datasheet at a small cost and record interval between sampling time, data size sampled, and resource consumption. The sample mechanism is as follows.
Step 1. Customize the initial sampling interval to get the initial sample dataset, which includes three properties: start time, end time, and data size.
Step 2. When entering the Riak TS SQL statement again, extract the start time and end time . Compare and with the data of the sampled dataset and find the maximum subset of in the data set, whose start time is and end time is .
Step 3. If the data size between and cannot be obtained directly, start with and look for the maximum value in the end time attribute, and record data size from to . Then start with , and repeat the previous step until the end time is greater than or equal to . If is greater than , the data size is divided proportionally.
Step 4. Look for the maximum point , which is smaller than , in the start time attribute, and the residual value is obtained based on the ratio which is calculated by the time interval between and as well as and . Similarly, it can be obtained in a similar way at the other end.
Step 5. The two residual values and the median value are added to obtain the final predicted data size.
After the sampling steps described above, the relationship between the time, data size, and resource consumption can be obtained.
3.3. Polynomial Curve Fitting Algorithm
The least squares method [24, 25] is a mathematical optimization technique and widely used in many fields, including parameter estimation, system identification, prediction, and forecasting. It matches the best function of finding the data by minimizing the sum of squared error. Using the least square method, the unknown data can be obtained easily, and the sum of squared error between the obtained data and the actual data is minimized.
Specifically speaking, suppose that a given set of hash values (datasets) is recorded as , and we want to find out a functionwhich makes the function fit as much as possible. The principle and key of the least squares method to find the fitting function are to make the sum of the square difference smallest.
TakeIn (2), is remnant, representing the fitting error. Usually, . The purpose is to determine the coefficients in (1) according to the least square method. Other words, it is necessary to solve which makes the sum of the squares of the residual least, because of the minimum value problem inequivalent to finding the extremum problem of multivariate function . For taking the minimum value in need meet the conditionthat is,Equation (5) written in matrix form isTake coefficient matrix , variable column vector , and constant term column vector . Equation (6) can simplify asIf (7) has unique solution, it must meet the requirement that . It can be proved by counter-evidence. We assume , and then the homogeneous linear equations is which has the non-0 solution . Each in (8) is multiplied by and then divided by (5). The result isFrom (9), it can be known that , which means has zero points. However, , is a N-1 degree polynomial, and the number of zero points is less than n-1 unless all the coefficient in are zero (). It is in contradiction with the non-zero solution of (8). So, the assuming is incorrect, and (7) has unique solution. Principal Component Elimination Method can be used to solve (7) and obtain solution vector .
After determining the fitting function by the least squares method, in order to judge whether the remaining computing resources are enough for the query operations, it is necessary to use the fitting function to predict the resources consumption of the subsequent query. Through the time in the SQL statement entered, the corresponding data size is achieved, so as to then appropriate consumption is obtained. We set the threshold according to the computing resource situation; if the ratio of resources that related operation needs to use to the remaining resources is greater than , the current operation will be rejected. If the ratio is less than or equal to , Riak TS continues to execute the next operations.
3.4. The Resource-Aware Mechanism
Prometheus is used to monitor resources of the cluster in resource-aware mechanism. Through the configuration of the corresponding metric, CPU, memory, and network I/O consumption is acquired.
Before the experiment, it is necessary to make it clear which resource is most closely related to Riak TS. The monitoring results are as shown in Figures 2, 3, and 4.



The data in Figures 2, 3, and 4 is collected through real-time monitoring by Prometheus and record the corresponding resource consumption once for each additional 100 data. For memory acquisition, we record it in the form of total memory-cache memory-free memory. For CPU consumption data acquisition, 100%-CPU (mode is idle) is used to calculate CPU usage. For network I/O, the consumption is observed by recording the total number of bits transmitted within 30 seconds.
From Figures 2, 3, and 4, we can see that the memory consumption increases significantly as the data size increases. In the first place, because the collection of memory consumption metrics is done in a computationally incremental manner, the amount of data initially selected is small, memory consumption is almost 0. Due to fluctuations in the memory consumption of the computer itself, it will lead to computing negative numbers.
The CPU has a relatively high usage rate at the time of initial query. However, as the number of queries increases, it gradually becomes stabilizes, which remains at around 10%. Because the application scenario is conducted in a way that continuously increases the amount of data and the monitoring network does not record dynamically in real time, so the value of network I/O is collected every 30 seconds. As the size of the data being queried increases, the query time increases accordingly. According to the block graphics shown in Figure 3, the network I/O does not change much as the amount of data increases.
To sum up, memory is the most important factor affecting queries, so when we think about resource-aware mechanism, memory consumption is the focus.
At first, the datasheet in Riak TS is sampled to select a certain size of data. Meanwhile, the resource status of the cluster is continuously monitored and recorded by Prometheus. Thus, we can establish the relationship between computing resources and data size. Then the least squares method is used to fit the collected data, and the linear regression model is obtained. By setting a threshold for computing resource, this operation is rejected if the ratio of the resource required to be used by the operation to the remaining resource is greater than the threshold; If the ratio is less than or equal to the threshold, it will be executed. This mechanism improves availability and prevents insufficient resources from leading to a decline in global availability. The concrete steps are as follows.
Input. SQL statement S.
Output. State available or not.
Step 1. Pre-Enter the start time T1 and end time T2 of the datasheet of Riak TS to sample from the entire datasheet while recording the size of data sampled, sampling intervals, resource consumption, and whether Riak TS crashes. The relationship between time, data size, and resource consumption will be acquired.
Step 2. The linear regression model is established to predict resource consumption through the data size.
Step 3. The time interval is obtained by entering S, and according to the three relationships obtained in Step 2, the data size that may be detected is predict and the resources that may be consumed accordingly are obtained.
Step 4. The resource consumption is compared with the remaining resource, and the state available or not is given.
The flowchart is shown in Figures 5 and 6:


Figure 5 corresponds to Steps 1 and 2, and Figure 6 corresponds to Steps 3 and 4. How to predict the data size through time interval is given in Section 3.3.
4. Results and Discussion
4.1. Experimental Environment and Dataset
The runtime environment is deployed in a cluster using four PCs and the hardware environment is as follows: CPU is Intel(R) Xeon(R) CPU E5645@2.40GHz dual-core 24 CPU; memory is Kingston DDR3 1333MHz 8G, 500GB SSD Flash Memory. Operating system tools are Ubuntu 16.04 64-bit and Linux 3.11.0 kernel. The relevant software versions are as followed: Java 1.8 and Riak TS1.5.2.
The experimental data is from the monitoring data from 2015 to 2017 for more than 70 hydrological stations located on Chuhe river, with a data size of 18910864 rows.
4.2. Data Sampling
According to Section 3, Riak TS is a memory-consuming NoSQL database, and as the number of queries increases, memory consumption is growing overall. When the number of queries increases further, Riak TS is at risk of a significant decrease in availability. Therefore, this table is retested and sampled at intervals of one month. The intervals and the data size in the sampling process are as shown in Figure 7.

From Figure 7, the data in the table is not evenly distributed, and the data size in a month is mostly below 1,000,000 rows, but between February 2016 and April 2016, the data size suddenly increases dramatically each month, so when the query involves this time, it may cause the database to become unavailable.
4.3. Availability Enhancement
According to the relationship between time interval and data size, as well as the data size and memory consumption which is obtained by Prometheus, the corresponding linear regression model is established. If we need to make a conditional query, the query statement must have a primary key of the time type, so the time interval can be extracted in accordance with the input SQL. Therefore, the SQL statement alone can roughly infer the data size which will be queried. When the data size is achieved, linear regression model can quickly calculate the memory resources that need to be consumed and determine whether such an operation can be performed by comparing it with the remaining resources.
The method uses the continuous time query to verify the validity of the availability enhancement mechanism by changing the time interval to change the data size. The total time range is from January 2015 to April 2017 and the sampling interval is one month. Memory consumption before improvement is shown in Figure 8.

It can be seen from Figure 8 that, before the availability enhancement mechanism is added, when querying to February 2016 and the data size reaching to 2,392,843, the required memory resources are much larger than the remaining memory resources, Riak TS consumes all memory resources and is not actively released before the query result comes out. Not only this query cannot operate successfully but also the next queries also do not work correctly. However, when the availability enhancement mechanism is added, it is necessary to ensure the long-term availability of Riak TS. If an error occurs at this point, restarting Riak TS is intolerable, so we want to minimize the occurrence of such errors with rejecting the query statement that could cause the availability to degrade.
From Figure 9, after the mechanism is integrated, the memory resources to be used can be predicted by using SQL statements, which are effectively circumvented before the query is executed. So, the subsequent valid operations can continue to execute, increasing the availability of Riak TS. Specifically, when the time of the query grows to February 2016, if Riak TS executes this SQL in the previous scenario, there is not enough available memory, which leads to decrease the availability in Riak TS. Now, when executed to February 2016, this mechanism effectively predicts that the current memory resources are not enough to query data of this magnitude, so it rejects the operation, which is recorded as null in the memory consumption table. Similarly, the amount of data in March 2016 is relatively large, which is 2483206 rows; therefore it also refuses to perform the query operation. Until May 2016, memory resources are available for query, and Riak TS continued to perform and Prometheus recorded memory consumption data.

4.4. Specificity and Sensitivity
In order to verify the validity and accuracy of this mechanism, this paper divides the experimental results into 4 categories. The first category: TP (True positive), the actual abnormal is judged to be abnormal; the second category: FN (False Negative), the actual abnormal is judged to be normal; the third Category: FP (False positive), the actual normal is judged to be abnormal; the last Category: TN (True Negative), the actual normal is judged to be normal. TP and TN are ideal situations where FN and FP are not desired. This paper defines sensitivity, , and specificity, . This paper uses 0.9 as a threshold; that is, when the memory consumed immediately is less than 90% equal to the remaining memory, it can be executed. The initial training data uses the sampled data obtained. Then random query by simulating user behaviour and continuous collection of data increase the scale of training data and regularly update the model. The test scenario provides a total of 200 data for random 200 time periods from March 2016. The results are shown in Table 2.
According to Table 2, as queries number increases, the average specificity increased from the initial 80.55% to 92.42% and the average sensitivity increased from the previous 76.31% to 87.90%. This shows that the mechanism can effectively use resource awareness to improve the availability of Riak TS and, with the increase of the number of queries, the effect of resource prediction is gradually improved, which reduces the number of miscalculation and improves the accuracy rate.
4.5. Availability Enhancement
The average availability [26] can be represented aswhere MTTF is the mean time to failure of system and MTTR is the mean time to recover of system after it failed. As the amount of test data increases, the availability of Riak TS in this scenario is shown in Figure 10.

In Figure 10, the value 0 represents the status before availability enhancement loaded. It is just 40.33% in this scenario. While the mechanism is loaded, the availability was directly increased by 33.26%, reaching 73.59%. When the amount of training data reaches 1000, the availability test result for the 200 data is 89.15%. This experiment shows that this mechanism can enhance the availability of Riak TS in similar scenarios such as hydrological data.
5. Conclusions
In this paper, availability enhancement mechanism of Riak TS with resource awareness is proposed, improving the real-world availability of Riak TS from a resource monitoring perspective rather than clustering or backup. As computing resources inevitably face bottlenecks, this will lead to a corresponding reduction in database performance, and thus database availability. The mechanism consists of the following. Firstly, the distributed cluster is monitored by Prometheus. Through the time sampling, the initial data set is obtained, and the distribution of the data in the current datasheet is preliminarily analysed. The relationship between time and data is acquired, meanwhile the corresponding resource consumption is recorded. Then the linear regression model is achieved by the least squares method by using the initial dataset, and the subsequent resource consumption is estimated by using the model. This mechanism will reject it if the ratio of the resource used by the related operation to the remaining resource is greater than the threshold. If the ratio is less than or equal to the threshold, it will be executed. The main contribution of this paper is to study the availability enhancement mechanism of the system based on Riak TS from the perspective of resource awareness, realize the dynamic adaptation of different operating environments, and provide effective monitoring and control in real time. Further enhances the availability of Riak TS. The results show that after applying this mechanism, Riak TS can effectively circumvent the coming operational risk, and with the increase of the number of queries, the specificity of the whole mechanism has been increased from the initial 80.55% to 92.42%, the specificity has been increased from the previous 76.31% to 87.90%. Besides the availability is increased from 40.33% to 89.15% in the case of unknown data distribution like hydrological However, the mechanism proposed in this paper is still insufficient in the detection accuracy, and the follow-up work needs to find a more accurate model to identify more accurately which are abnormal SQL statements.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work is partly supported by the 2018 Jiangsu Province Key Research and Development Program (Modern Agriculture) Project under Grant No. BE2018301, 2017 Jiangsu Province Postdoctoral Research Funding Project under Grant No. 1701020C, 2017, Six Talent Peaks Endorsement Project of Jiangsu under Grant No. XYDXX-078, and the Fundamental Research Funds for the Central Universities under Grant No. 2013B01814.