Abstract

Computers generate network traffic data when people go online, and devices generate sensor data when they communicate with each other. When events such as network intrusion or equipment failure occur, the corresponding time-series will show abnormal trends. By detecting these time-series, anomalous events can be detected instantly, ensuring the security of network communication. However, existing time-series anomaly detection methods are difficult to deal with sequences with different degrees of correlation in complex scenes. In this paper, we propose three multiscale C-LSTM deep learning models to efficiently detect abnormal time-series: independent multiscale C-LSTM (IMC-LSTM), where each LSTM has an independent scale CNN; combined multiscale C-LSTM (CMC-LSTM), that is, the output of multiple scales of CNN is combined as an LSTM input; and shared multiscale C-LSTM (SMC-LSTM), that is, the output of multiple scales of CNN shares an LSTM model. Comparative experiments on multiple data sets show that the proposed three models have achieved excellent performance on the famous Yahoo Webscope S5 dataset and Numenta Anomaly Benchmark dataset, even better than the existing C-LSTM based latest model.

1. Introduction

In nature, there are many statistical data related to time, such as the number of people, the number of animals, the temperature of the environment, etc. The sequence formed by these statistics is called a time-series. Similarly, there are many important time-series in many fields of human society, such as: network traffic [1] in the Internet field, credit card transaction [2] in the financial field, electrocardiogram [3] in the medical field, transportation vehicle monitoring data [4] in the traffic field, and device communication data [5] in the Internet of Things (IoT) field. With the development of science and technology, the time-series data generated in many fields are more and more closely related to people’s production and life.

Usually, time-series data have their corresponding intrinsic regularity in each time period. As shown in Figure 1, a time-series becomes anomalous when its inherent regularity is broken. Anomalies are patterns of data [6] caused by anomalous behaviors that differ across domains. For example, in the Internet field, this behavior may appear as an illegal hacking of the server [7]; in the medical field, it may be seen as a health problem [8]; in the IoT field, it may be a device malfunction caused by illegal attacks or misconfiguration [9]. These behaviors or events will destroy the regularity of the time-series, resulting in abnormal time-series. Therefore, time-series anomaly analysis and detection play an important role in the stable operation of networks and equipment, reducing personal and social property losses, and promoting social stability and development.

Since the last century, researchers have proposed many time-series anomaly detection methods, including various time-series statistical methods and machine learning-based methods. With the rapid development of artificial intelligence technology, deep learning algorithms have attracted more and more attention. Because deep learning methods can learn complex features of data without making any assumptions about the underlying patterns in the data, deep learning has become the most attractive choice for time-series analysis [10]. The main research direction of time-series anomaly detection is gradually shifting from traditional statistical machine learning methods to more powerful deep learning methods.

Currently, there are two most representative deep learning algorithms. One is the convolutional neural network (CNN) [11], which has made a major breakthrough in the field of images. In addition to 2D-CNN which can be applied to 2D image data, CNN also includes 1D-CNN [12] for 1D sequence data and 3D-CNN [13] for 3D tensor data. The other is the Recurrent Neural Network (RNN) [14], which performs well in the field of natural language processing. There are two variants of RNN: long short-term memory (LSTM) [15] and gated recurrent unit (GRU) [16]. For time-series data, CNN and RNN can effectively extract spatial and temporal features [17] from the data respectively, thus revealing the complex internal laws in time-series. Therefore, more and more researchers apply these two deep learning algorithms to time-series anomaly detection. Although the convolutional layer and pooling layer of CNN can reduce some noise caused by high frequency in time-series, CNN will lose some temporal information when extracting spatial information. Similarly, RNN can effectively extract the temporal information of time-series, but if the sequence data are too long or has some noise, RNN cannot extract the dependencies between sequence elements.

In recent years, some researchers have combined these two algorithms and used the C-LSTM [18] model to solve anomaly detection tasks. This model can make up for the shortcomings of CNN and RNN models, so as to simultaneously extract the spatiotemporal information of time-series. It has better anomaly detection ability than other deep learning models and is currently one of the most effective models in time-series anomaly detection. However, time-series in real environments may have different patterns at different stages, and existing models with fixed convolution kernel sizes cannot extract different spatial information from sequences to handle various real scenes. Therefore, it is particularly important to build more flexible deep learning models to detect time-series anomalies.

Based on the C-LSTM model, this paper uses convolution kernels of different sizes instead of fixed-size convolution kernels to fully extract spatial information. Meanwhile, in order to explore more suitable temporal feature extraction methods in time-series containing different spatial information, we propose three multi-scale C-LSTM models that respectively build different LSTM structures. Each model uses CNNs with multiple different convolution kernels to extract time-series features. The first model uses the same number of LSTMs as the CNN to extract temporal features from its output, called independent multiscale C-LSTM (IMC-LSTM). The second model uses only one LSTM to extract the combined output of different CNNs, called combined multiscale C-LSTM (CMC-LSTM). The last model sequentially extracts temporal features from data output by multiple CNNs using only one LSTM, that is, these CNNs share one LSTM, called shared multiscale C-LSTM (SMC-LSTM). To verify the effectiveness of the three proposed models, we conduct experiments on several real datasets such as Yahoo Webscope S5 and Numenta Anomaly Benchmark. Experimental results show that the three proposed models perform well on each dataset, and the overall performance of the three models is better than the existing C-LSTM based models.

The main contributions of this paper include the following three points:(1)The use of convolution kernels of various sizes instead of fixed-size convolution kernels enhances the spatial feature extraction capability of the C-LSTM model and provides a more suitable feature extraction idea for time-series data generated in the Internet and other fields.(2)By creating three multiscale C-LSTM models for time-series anomaly detection and searching for their corresponding suitable parameters, a feasible time-series anomaly detection method is provided for problems such as Internet or IoT equipment failures.(3)The three models we propose outperform existing C-LSTM models in detection ability on multiple datasets and run much faster than the best existing models with fewer parameters, making them more suitable for practical applications.

The remainder of this paper is organized as follows. In Section 2, we categorize and discuss various time-series anomaly detection methods. In Section 3, we describe the three proposed models in detail. In Section 4, we describe the experimental procedure details and discuss the experimental results. Finally, we conclude this paper with a final section.

Research on the problem of time-series anomaly detection has a long history. There are many methods for anomaly detection, and Braei and Wagner [19] divided these methods into three categories according to the characteristics of the model: statistical models, machine learning models, and deep learning models. In addition, Kim and Cho [20] categorized these methods into three categories according to different modeling methods: statistical modeling, spatial feature modeling and temporal feature modeling, where statistical modeling uses machine learning models, whiling spatial feature modeling and temporal feature modeling adopt deep learning models. With the above consideration, we subdivide the existing time-series anomaly detection models into five categories according to different model types and modeling methods: time-series statistical models, machine learning models, spatial deep learning models, temporal deep learning models, and spatiotemporal deep learning model.

Time-series statistical models are designed according to the statistical characteristics of time-series data, which can intuitively and quickly discover timing anomalies. Some researchers use these models to solve the problem of time-series anomaly detection in the Internet field. Wu and Shao [21] employed autoregressive (AR) models to model abrupt changes in time-series data and detect anomalies by comparing two adjacent non-overlapping time-series windows. Qi et al. [22] used the autoregressive moving average (ARMA) model to analyze the historical characteristics of the time-series, in which the predicted value reflects the current network situation, and the degree of deviation between the predicted value and the actual value is identified as a network anomaly. Although these models have low time complexity, they are only suitable for stationary time-series and cannot be directly used to analyze periodic or nonstationary time-series.

For nonstationary time-series, it is necessary to differentiate the time-series during modeling, convert it into a stationary series, and then use the ARMA model to analyze the transformed time-series. ARMA model combined with differential operation is called auto-regressive integrated moving average (ARIMA) [23] and is widely used in time-series analysis. Moayedi and Masnadi-Shirazi [24] used ARIMA to simulate and predict network traffic sequences, verifying that the prediction results of normal sequences and abnormal sequences are different. Yaacob et al. [25] compared the network traffic sequence predicted by ARIMA with the original sequence, and then regarded the real sequence with a large distance from the predicted sequence as an abnormal sequence. Yu et al. [26] improved the existing ARIMA to model historical data of time-series using sliding window and then used a short step exponential weighted average method for web traffic forecasting to detect anomalies traffic with a large distance from the predicted traffic. These models can be applied to non-stationary time-series that are stationary after differential operation, but the adaptability of the model is relatively poor for the non-stationary time-series with large fluctuations. In addition, the differentiation operation may destroy the characteristics of the original data, and the statistical characteristics of time-series in different time periods may be lost. Therefore, it is difficult for statistical models to capture the inherent laws of time-series and apply them to more complex scenes.

Machine learning models are used to solve many practical problems, which are often converted into classification, clustering or regression problems. These models are also widely used in various fields of time-series data, but usually need to divide the sequence into multiple subsequences of equal length. Ma and Perkins [27] transformed the time-series anomaly detection problem into one-class classification problem, and combined the output results of one-class support vector machine (OC-SVM) in different phase spaces to improve the robustness of anomaly detection. To address denial-of-service (DoS) attacks, Ahmed and Mahmood [28, 29] clustered network traffic using x-means (a variant of k-means) and a modified co-clustering algorithm, respectively. Experimental results show that these two algorithms are superior to other clustering algorithms. Anton et al. [30] extracted features in network traffic through principal component analysis (PCA), and then used support vector machine and random forest classifiers to solve the network intrusion problem. Both classifiers have achieved good detection results. Zhou et al. [31] proposed an anomaly detection method called KSM, which combines three models: k-nearest neighbors, symbolic aggregate approximation, and the Markov model. Compared with other methods, this method can accurately handle a large amount of data. Machine learning models can effectively extract time-series features to process more complex data, but when modeling, each sample of time-series is simply regarded as a point in a high-dimensional space, and the coordinate elements of each point are independent of each other. However, normal data in time-series usually have temporal correlation with adjacent data, and the former series can affect the trend of the latter series. Therefore, it is difficult for machine learning models to extract high-level features of time-series.

Spatial deep learning models use algorithms such as convolution algorithm to capture the relevant information between adjacent data, so as to extract the spatial characteristics of the data. Some researchers have applied them to time-series anomaly detection. Wen and Keyes [32] adopted the U-Net model in image segmentation for anomaly segmentation and detection of time-series, which achieved satisfactory performance. Hwang et al. [33] combined CNN and unsupervised learning to propose an abnormal traffic detection mechanism called D-PACK, which can perform preliminary detection on as few packets as possible and bytes in each packet. Its accuracy is not much different from the full detection, but the detection speed is greatly improved. Although spatial deep learning models can extract relevant features of adjacent data on time-series data, they do not make full use of the temporal correlation of data points, so the extraction of temporal features is still a direction worth exploring.

Temporal deep learning models use algorithms such as RNN algorithm to extract temporal features between data. RNN algorithm and its variants can extract temporal features by memorizing previous data, which has attracted the attention of many time-series researchers. Cheng et al. [34] transformed the time-series anomaly detection problem into classification, and used sliding windows and multi-scale LSTM model to extract temporal features to determine whether the sequence was abnormal. Some researchers used LSTM to predict future sequences from historical time-series data, and determined whether the sequence was abnormal by the error between the predicted sequence and the real sequence [3537]. In addition, some researchers reconstructed the time-series data with reconstruction RNN algorithms, and detected the abnormal sequence through the error between the reconstructed sequence and the original sequence [3840]. Although temporal deep learning models can effectively extract temporal features of data, it is difficult to extract complete temporal information when the sequence is too long or the data is noisy.

Spatiotemporal deep learning models combine the spatial deep learning model and the temporal deep learning model, which can fully extract the spatiotemporal information of the data, and are more suitable for time-series anomaly detection. Currently, only a few studies have adopted these models. Kim and Cho [20] converted the time-series anomaly detection problem into a binary classification problem. They applied a C-LSTM model combining 1D-CNN and LSTM to solve the anomaly detection task of network traffic. The model reduces the frequency of time-series by CNN, and then extracts temporal features from the output sequence of CNN by LSTM. Experiments show that the model outperforms other deep learning models and machine learning models. Based on the C-LSTM model, Yin et al. [41] proposed a C-LSTM-AE model combined with an LSTM encoder-decoder to solve time-series anomaly detection for IoT. Experiments show that the anomaly detection ability of this model exceeds that of other C-LSTM deep learning models.

In conclusion, the spatiotemporal deep learning model is one of the most effective models for extracting time-series features. However, existing C-LSTM based models still have a shortcoming in extracting spatial features, that is, the size of the convolution kernel in these models is fixed. If the fixed convolution kernel is too small, the frequency and noise of the time-series cannot be effectively reduced, so that the spatial information cannot be fully preserved. If the fixed convolution kernel is too large, more temporal information of the time-series will be lost. For time-series in real scenes, the inherent laws contained in the sequences of different scenes or different stages are inconsistent. It is difficult to design a model with a fixed convolution kernel size to extract features from all stages of the time-series. To this end, we plan to build a multiscale C-LSTM model using multiple convolution kernels of different sizes to ensure more accurate detection of complex anomaly sequences in real scenes.

3. Proposed Models

In this section, we propose three multi-scale C-LSTM models for the binary classification problem of time-series anomaly detection. We first introduce the basic model C-LSTM, and then describe the construction of the three proposed multi-scale C-LSTM models in detail.

3.1. C-LSTM Model

The C-LSTM model was first proposed by Zhou et al. [18] for text classification tasks. Kim and Cho [20] first used this model for the task of time-series anomaly detection and achieved good results. As shown in Figure 2, the C-LSTM model is mainly composed of a CNN layer and an LSTM layer. When C-LSTM model is used to extract time-series features, the data will enter the CNN layer and the LSTM layer sequentially.

The CNN layer mainly consists of convolutional layers and pooling layers, which are used in the C-LSTM model to extract the spatial features and provide the output sequence to the next layer. CNN convolves data through convolutional layers and then extracts features through pooling layers. For univariate time-series data, convolutional layers use multiple convolution kernels of the same size to convolve the sequence according to the stride. Convolution operations are inherently linear calculations, making it difficult to fit data with more complex relationships. Therefore, a nonlinear activation function (such as tanh) is generally added after the convolution operation to enhance the nonlinear fitting ability of the convolution layer.

In addition, pooling layers are used to extract time-series features after convolution operations. In order to reduce the extraction of repetitive features, pooling layers generally adopt maximum pooling with equal width and stride.

Convolutional layers and pooling layers of CNN can fully extract the spatial information of complex data, and have translation invariance, which can reduce the frequency of time-series while maintaining the original time order, so as to facilitate the extraction of time-series information by the subsequent LSTM layer.

LSTM is a variant of RNN used in C-LSTM model to extract temporal features of CNN output sequence. Unlike RNN, LSTM can extract information from long-term sequence. In the C-LSTM model, the CNN layer outputs a high-dimensional sequence . Then, the sequence is sequentially fed into the LSTM cells of the LSTM layer at each time. As shown in Figure 3, an LSTM cell consists of a forget gate , an input gate , an output gate , a cell state , and a hidden state . Among them, the forget gate determines which information in the previous cell state can be forgotten and retains important memory information, the input gate determines which input information can be retained in the current cell state, and the output gate determines which information in the current cell state can be stored in hidden state. At time , the output of LSTM is computed as follows:where , , and represent the forget gate, input gate, and output gate respectively, which are calculated by the sigmoid function; represents the value of the cell state at the current time; represents the output value of the hidden state; the asterisk represents the Hadamard product between the vectors; and the other matrix and vector represent the linear transformation of the vector.

The cell state retains the data information at each moment in a cyclic manner, and the hidden state selects the information of the cell state memory to output the final result. Therefore, LSTM uses cell state and hidden state to store and select data information, which can effectively extract temporal features of time-series.

The C-LSTM model combines CNN and LSTM to extract higher-level features from time-series data, which points out the development direction for time-series problems and lays the foundation for the development of spatiotemporal deep learning models.

3.2. The Proposed Multiscale C-LSTM Models

In order to extract time-series information more effectively, considering that there are different spatial information in time-series data, we will change the fixed-size convolution kernel in the CNN layer to various convolution kernels of different sizes based on the C-LSTM model. At the same time, in order to explore methods suitable for extracting temporal information of time-series containing different spatial information, we construct three multiscale C-LSTM classification models with different LSTM layers for time-series anomaly detection: Independent Multi-scale C-LSTM (IMC-LSTM), Combined Multi-scale C-LSTM (CMC-LSTM) and Shared Multi-scale C-LSTM (SMC-LSTM).

3.2.1. IMC-LSTM Model

The structure of the IMC-LSTM model is shown in Figure 4, which consists of three model layers: CNN, LSTM, and deep neural network (DNN).

The CNN layer performs convolution operations on sequence data using a variety of convolution kernels of different size. In order to avoid information loss and facilitate subsequent operations, the sizes of these convolution kernels are set to odd numbers. Let the number of types of convolution kernels of different sizes be , then the size of the -th type of convolution kernel is . The number of convolution kernels of each type is set to , and the stride is fixed to 1 in order to fully extract the spatial information. In order to keep the sequence length of convolution kernels output consistent, it is necessary to pad the both ends of the input sequence with zeros before the convolution operation. Then, tanh activation functions and max pooling layers of size and stride are used to extract spatial features on the -dimensional sequences output by the convolution kernels, respectively.

In the LSTM layer, LSTMs with hidden layer size of are used to extract the temporal features of sequences output by different CNNs respectively. Then, the output data of the LSTMs are concatenated into a vector of size and sent to the DNN layer.

In the DNN layer, dropout algorithm is used to randomly drop 20% of the output neurons by the LSTM layer during training to reduce data overfitting. At the same time, the neurons are mapped into a vector of neurons. Then, the tanh activation function is used to improve the fit ability. Finally, the vectors of neurons are fully connected to output the binary classification result.

3.2.2. CMC-LSTM Model

The structure of the CMC-LSTM model is shown in Figure 5. The model also consists of CNN, LSTM, and DNN layers. Among them, the CNN and DNN layers are exactly the same as the IMC-LSTM model.

Considering that in the groups of -dimensional sequences output by the CNN layer, the same time step in each sequence contains information at the same instant. Therefore, concatenating these sequences into a dimensional sequence at the same time step may help the LSTM layer to better extract temporal information.

In the LSTM layer, unlike the IMC-LSTM model, a single LSTM with a hidden layer size of is used in this model, and outputs a vector of size .

3.2.3. SMC-LSTM Model

The structure of the SMC-LSTM model is presented in Figure 6, where the model also includes CNN, LSTM and DNN layers. Only the LSTM layer differs from the previous two models. In the IMC-LSTM model, three LSTMs are connected with three CNNs respectively. In the CMC-LSTM model, a single LSTM is used to receive the combined sequence of CNN layer outputs. Although the first two models are effective, they both need to build LSTM layers with more parameters, and the model takes up a lot of space.

To reduce the parameters of the LSTM layer, a single LSTM with a hidden layer size of is adopted to build the SMC-LSTM model. The model uses only one LSTM to share weights and sequentially extract features from sequences output by different CNNs. Its weights also incorporate features from different CNN output sequences, and this model is also suitable for feature extraction of complex sequences like the previous two models.

4. Experiments

In this section, we experiment with the proposed models using datasets publicly available on the Internet and compare them with other deep learning models for time-series anomaly detection. For all experiments in this section, we use a device consisting of an Intel(R) Core(TM) i5-8250U CPU, GeForce MX150 GPU, and 8 GB RAM. In terms of software, we use the programming environment Python 3.7.11, the deep learning framework PyTorch 1.7.1, the numerical calculation package NumPy 1.21.2 and Pandas 1.3.4.

4.1. Datasets

We use time-series datasets provided by Yahoo and Numenta: Yahoo Webscope S5 dataset [42] and Numenta Anomaly Benchmark (NAB) [43]. The Yahoo Webscope S5 dataset consists of four datasets: A1, A2, A3 and A4, where A1 is a real scene dataset and the others are artificial datasets. NAB contains seven datasets, including two artificial datasets and five real datasets. Since real datasets are more suitable for real scenarios, we selected a total of six real datasets for experiments, namely A1 in Yahoo, Ad, AWS, Known, Traffic and Tweets in NAB. These datasets are labeled with normal and abnormal points, as described below.A1. This dataset contains 67 web traffic time-series from Yahoo, most of which contain around 1,400 observations and a few series that contain less than 1,000 observations.Ad. This dataset contains 6 online ad click-through rate sequences, each containing approximately 1,600 observations.AWS. This dataset contains data such as CPU utilization, network bytes in Amazon Services, and disk read bytes. There are 17 series, most of which contain more than 4,000 observations, and a few contain more than 1,000 observations.Known. The dataset contains sequences of 7 known causes of anomalies, including data such as temperature sensor data, CPU usage, and the number of taxi passengers. Each series contains thousands to tens of thousands of observations.Traffic. This dataset contains real-time traffic flow data in the Twin Cities metro area of Minnesota, including vehicle sensor occupancy, speed, and travel time, for a total of 7 sequences. Each series contains more than 1,000 to 2,000 observations.Tweet. The dataset contains 10 sequences of code mentions counted by Twitter, each containing over 10,000 observations.

4.2. Data Preprocessing

Since the value range of each time-series is different, we normalize each dimension of the time-series before further processing. Let a time-series be , and we calculate the mean and standard deviation in units of each sequence, and then normalize the sequence by the following formula:where and are, respectively, the mean and the standard deviation of the time-series .

Considering the small proportion of anomalous data in the real dataset, it is difficult to implement a training model with each individual data in the sequence as a sample. Therefore, we need to use a sliding window to select subsequences and convert the detection of outliers into the detection of abnormal subsequences. This process can increase the dimensionality of the samples used for classification training while increasing the proportion of anomalous data. We take the subsequences in the sliding window as data samples. If there is abnormal data in the subsequence, the sample is marked as abnormal, and the other samples are marked as normal. For each sequence, we set the sliding window size to and the step size to 1, then each sample is , . Because each dataset contains multiple sequences, for each dataset, we combine the samples of all sequences into a dataset that can be used directly for classification training.

4.3. Evaluation Metrics

We employ a variety of traditional evaluation metrics to comprehensively evaluate the anomaly detection performance of various models. Since the proportion of abnormal samples in the data set is small, in order to highlight the abnormal detection ability of the model, we focus on abnormal samples, and set abnormal samples as positive samples and normal samples as negative samples. True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN), respectively, represent the number of the correct prediction of abnormal samples, the number of the prediction of normal samples as abnormal samples (i.e., the number of false alarms), the number of the prediction of abnormal samples as normal samples (i.e., the number of missed detections), and the number of the correct prediction of normal samples.

Based on the abovementioned settings, we use classification evaluation metrics such as accuracy, precision, recall, and F1 to evaluate the experimental results of the model.Accuracy (A). As shown in the equation (3), this evaluation metric represents the proportion of correct predictions, and its value is between [0, 1]. However, this evaluation metric performs poorly on datasets with imbalanced labels and is only used as a supplementary evaluation in experiments.Precision (P). As shown in the equation (4), this evaluation metric represents the proportion of abnormal data in the data predicted to be abnormal. Its value is between [0, 1], the higher the value, the lower the abnormal false positive rate.Recall (R). As shown in the equation (5), this evaluation metric represents the proportion of abnormal data predicted to be abnormal. Its value is between [0, 1], the higher the value, the lower the abnormal missed detection rate.F1. As shown in the equation (6), this evaluation index is the harmonic mean of precision and recall, and its value is between [0, 1]. Only when both precision and recall are high, F1 will be high, that is, F1 combines the performance of the above two and is more suitable for comprehensive evaluation of the performance of the model.

4.4. Results and Analysis

Before training the model, we randomly stratified the samples of each dataset according to the original sequence and abnormal rate, where the ratio of training set, validation set, and test set is 6 : 2 : 2. Finally, three sets of abnormal data with the same proportion are obtained, and each set contains the data of the original series. Table 1 shows the sample size and abnormal proportion of each dataset under different sliding windows.

For each model in the experiments, we used the Adam optimizer and the same training parameters, as shown in Table 2. When selecting the trained models for the test set, we choose the best model according to the F1 score of the validation set from high to low. If there are multiple equal F1 scores in the validation set, we choose the trained model based on the training set. The final experimental evaluation uses the evaluation results of the selected model in the test set.

Based on the above considerations, we set up multiple sets of experiments to fully illustrate the rationality and superiority of the proposed model from different perspectives, and uploaded the experimental code URL: https://github.com/lyx199504/mc-lstm-time-series to GitHub for peer review and reproduction of our experiments.

4.4.1. Effectiveness of Different-Sized Convolution Kernels

In order to test the feature extraction ability of convolution kernels of different sizes, we compared CNN and multiscale CNN (MS-CNN) with different sizes of convolution kernels. As with reference [20], we set the sliding window size of the time-series to 60.

The experimental results are shown in Figure 7. The MS-CNN model outperformed the CNN model in F1 scores in five of the six datasets. It can be seen that changing the fixed-size convolution kernel in CNN to various convolution kernels of different sizes can extract time-series information more effectively.

4.4.2. Parameter Exploration of Our Models

We conducted comparative experiments on the parameter combinations of the type number of convolution kernels and the number of convolution kernels per type to select appropriate parameters. In this experiment, we set the sliding window size of the time-series to 60, and set the maximum pooling size and stride to 3 to balance the training efficiency and the integrity of the spatial features. The parameters and will be chosen from the ranges {2, 3, 4} and {16, 32, 48, 64}, respectively.

The experimental results of different parameter combinations of the three proposed models are shown in Figure 8. According to the experimental results of the IMC-LSTM model, the performance of this model is better when the type and number of convolution kernels are both large. The results of the SMC-LSTM model are better when the number of convolution kernels is large. Unlike the above two proposed models, the experimental results of the CMC-LSTM model have no obvious correlation with the parameter combinations. However, its experimental scores have less fluctuation, and it outperforms the other two models overall.

When choosing the optimal combination of parameters for each proposed model, models with higher F1 scores are considered first, and models with fewer parameters are considered when F1 scores are close. In the end, the optimal parameter combination we obtained is shown in Table 3.

4.4.3. Comparison of Our Models and the Baseline Model

We compared the model with selected parameters with the baseline model C-LSTM model. In order to ensure the fairness of the experiment, we continue to set the sliding window size to 60, as in [20].

Figure 9 shows the scores of each evaluation metric for the three proposed models and C-LSTM on each dataset, while Table 4 shows the average score for each evaluation metric. These results show that in terms of detection ability, all evaluation metrics of the three proposed models are higher than the C-LSTM model on almost all datasets, and the average score of the six datasets of the proposed model is also much higher than that of C-LSTM model. Therefore, the three proposed models are superior to the C-LSTM model in detection ability, and it is effective to replace the fixed convolution kernel with multiple convolution kernels of different sizes.

In addition, we compared the F1 scores of every two models (e.g., model A vs. model B) in each dataset and listed the statistics for each model in Table 5: cell values indicate the number of datasets for which the model in the row outperforms the model in the column. We can see that the IMC-LSTM and CMC-LSTM models outperform the other models in F1 score on more than half of the datasets, while the F1 score of the CMC-LSTM model outperforms the other models on most of the datasets. These results suggest that the structure of the CMC-LSTM model may be more suitable for extracting time-series features than the other two models.

4.4.4. Model Comparison under Different Sequence Lengths

To demonstrate the effect of sliding window size on model performance, we set sliding windows of various sizes ( = 20, 40, 60, 80 and 100) and compared the proposed model with the C-LSTM model.

In this experiment, the F1 score comparison of the three proposed models and the C-LSTM model under different sliding windows is shown in Figure 10. It can be seen that the F1 scores of the four models all increase with the increase of the sliding window. This may be due to the increase in the size of the sliding window leading to an increase in the proportion of abnormal sequences, which is more conducive to the training of the model. However, the three proposed models consistently outperform the C-LSTM model no matter how large the sliding window is. This observation suggests that all three proposed models are better suited to the time-series anomaly detection problem than the C-LSTM model.

4.4.5. Comparison of Our Models and C-LSTM-AE

We conducted an additional comparison experiment for reflecting the comprehensive performance of the three proposed models, that is, to compare with the C-LSTM-AE model which is the existing optimal C-LSTM based model. As in [41], we set the sliding window size to 60 and compared in terms of detection ability, training efficiency, and number of parameters.

Table 6 shows the average scores for each evaluation metric, the training time for each model on the A1 dataset, and the number of parameters for the 4 models on the six datasets. This shows that all three proposed models are slightly better than the C-LSTM-AE model in detection ability, and all the proposed models are faster in training efficiency than the C-LSTM-AE model. In terms of model parameters, the three proposed models all have fewer parameters than the C-LSTM-AE model, which means lower space complexity. In summary, the proposed models have better comprehensive performance than other C-LSTM based models, including detection performance, time complexity and space complexity, and are more suitable for the application of time-series anomaly detection.

5. Conclusion

In order to solve the problem of insufficient time-series spatial feature extraction in anomaly detection, we replaced the fixed-size convolution kernel in C-LSTM with multiscale convolution kernels of different sizes, and obtained three multiscale C-LSTM models: IMC-LSTM (independent multi-scale C-LSTM), CMC-LSTM (combined multiscale C-LSTM) and SMC-LSTM (shared multi-scale C-LSTM). Experiments on a large number of real datasets show that the detection ability of the three proposed models is higher than that of the C-LSTM model, and the F1 score of CMC-LSTM on real datasets is 2.1% higher than that of the C-LSTM model. Moreover, the detection ability of the three proposed models is even slightly better than the state-of-the-art C-LSTM-AE model, and compared with the C-LSTM-AE model, they have faster running speed and lower space storage. In future work, we will further explore other combinations of multiscale time-series to build an anomaly detection framework that is more suitable for practical scenarios.

Data Availability

The datasets we used for our research are available in https://webscope.sandbox.yahoo.com/catalog.php?datatype=s&did=70 and https://github.com/numenta/NAB.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work was supported by National Key R&D Program of China (2022YFB3103000), MIIT Project Industrial Internet identification resolution system security monitoring and protection (TC220H078), Guangdong Macao Science and Technology Cooperation Project (2022A050520013 & 0059/2021/AGJ), Qing Lan Project in Jiangsu universities, XJTLU RDF-22-01-020. G. Geng is supported by the Pearl River Talents Plan.