Abstract

Time series classification is a basic and important approach for time series data mining. Nowadays, more researchers pay attention to the shape similarity method including Shapelet-based algorithms because it can extract discriminative subsequences from time series. However, most Shapelet-based algorithms discover Shapelets by searching candidate subsequences in training datasets, which brings two drawbacks: high computational burden and poor generalization ability. To overcome these drawbacks, this paper proposes a novel algorithm named Shapelet Dictionary Learning with SVM-based Ensemble Classifier (SDL-SEC). SDL-SEC modifies the Shapelet algorithm from two aspects: Shapelet discovery method and classifier. Firstly, a Shapelet Dictionary Learning (SDL) is proposed as a novel Shapelet discovery method to generate Shapelets instead of searching them. In this way, SDL owns the advantages of lower computational cost and higher generalization ability. Then, an SVM-based Ensemble Classifier (SEC) is developed as a novel ensemble classifier and adapted to the SDL algorithm. Different from the classic SVM that needs precise parameters tuning and appropriate features selection, SEC can avoid overfitting caused by a large number of features and parameters. Compared with the baselines on 45 datasets, the proposed SDL-SEC algorithm achieves a competitive classification accuracy with lower computational cost.

1. Introduction

Time series classification (TSC) is a theoretical abstraction of many engineering problems, such as fault diagnosis, speech recognition, and electroencephalogram (EEG) identification. It has become an active research field in recent years [1]. To address the challenge of TSC problems, several methods have been proposed in the literature. An intuitive way to model time series by comparing the differences in time domain is called time domain similarity method. Time domain similarity methods process time series as high-dimensional points. A basic framework of time domain similarity based TSC method is utilizing different distance measurements to quantify similarity between time series and then combining 1-NN algorithm as the classifier. Several L1 or L2 norm distance measurements have been widely researched in [2]. To solve time axis distortion problem in time series, elastic distance measures are employed in TSC. Dynamic time warping (DTW) comes to be a popular elastic distance measurement. Several variants of DTW are compared in [2]. At the same time, some time series discretization techniques are proposed, such as symbolic aggregate approximation (SAX) [3], to reduce time series dimension and improve computational efficiency. Fuzzy similarity (FS) [4] is adapted into the characterizing defects problem, which is capable of processing time series signals affected by uncertainty and inaccuracy. In literatures, time domain similarity methods are proved intuitive and efficient but not suitable for complex dynamic systems.

The model similarity method is another well-known way to deal with TSC problems. These methods represent time series by using multiple statistical models, such as Auto Regressive Moving Average (ARMA) [5], Hidden Markov Model (HMM) [6], and Gaussian Mixture Models (GMM) [7]. By comparing the parameters of the models, different class time series can be distinguished. In recent years, Deep Learning has become a common approach in machine learning and artificial intelligence fields, more and more neural network-based models have been introduced into TSC [8, 9]. Although model similarity methods have a high accuracy, the models used have a high level of abstraction, so these methods are not interpretable.

Recently, more researchers are interested in the shape similarity method. It is human instinct to distinguish objects by their shapes. Such methods suggest imitating human intuition and use shapes to distinguish different classes of time series. Shape similarity methods discover local shape features, while other methods discover global statistic features. Therefore, these methods can obtain high accuracy with better interpretability. The base model of this article, the Shapelet algorithm, is a kind of shape similarity methods. Shapelet is a special concept which means one discriminative subseries in the time series [10]. More than high accuracy, Shapelet can also provide visualization results, which can point out further research directions for domain experts [11].

Most of the existing Shapelet-based algorithms attend to discover Shapelets by searching candidates in the training dataset. Such algorithms have two drawbacks. First, the search requires extensive computation. As an illustration, the complexity of the original Shapelet algorithm is O (m4n2), where m is the length of time series and n is the number of instances in the dataset. Second, the searched Shapelet lacks generalization. Each Shapelet must be a segment in the existed instance, while the most discriminative Shapelet may never appear in the historical data. Thus, this paper proposes a novel algorithm named Shapelet Dictionary Learning with SVM-based Ensemble Classifier (SDL-SEC), which contributes following two points:(i)Shapelet Dictionary Learning. Inspired by Dictionary Learning (DL), the proposed algorithm generates subseries (Shapelets) by optimizing an object function. Different from the searching method, SDL generates Shapelets directly.(ii)SVM-Based Ensemble Classifier. According to our experience, different time series are sensitive to different features and different parameters. Classic SVM takes a lot of time to tune parameters and select feature subsets. To address this problem, we train a set of SVM models with randomly selected features and parameters and get final results through majority voting.

The rest of this paper is organized as follows: Section 2 reviews the research literature related to Shapelet algorithms and Dictionary Learning; Section 3 describes the structure of the proposed algorithm SDL-SEC; Section 4 presents the implementation of the SDL-SEC and compares the results with the baselines in 45 datasets; Section 5 concludes the paper and discusses future research directions.

2.1. Shapelet Algorithm

The original Shapelet algorithm was constituted in [10] for time-series classification problem. In brief, there are three steps in the training stage of the original edition, as exhibited in Figure 1. First, a sliding window is used to extract time series segments, which are candidate Shapelets, and further figure out the minimum distances between candidate Shapelets and total time series in the dataset. Then, a candidate Shapelet orderline is built by arranging the minimum distances from small to large; thus, we can figure out the Information Gain (IG) and Optimal Split Point (OSP) of each orderline. High IG represents a good discrimination. The third step is to choose k-best Shapelets with higher IG. At the inference stage, a decision tree is built by k-best Shapelets’ OSPs, and untagged time series are classified by this decision tree.

Followed the principle of Shapelet, many modified Shapelet-based algorithms have been proposed in recent years. These approaches have two main directions: speed-up the running time and improve the accuracy.

The original Shapelet algorithm as described above is a recursive search method, calls for a high time complexity. Hence, some speed-up techniques are indispensable for improving Shapelet algorithm efficiency. Some efficient pruning strategies are suggested to avoid searching for unproductive Shapelets [11]. The representation methods map time series from the source space to a reduced-dimensional space, which reduce the searching time significantly [12]. Moreover, parallelizing the Shapelet algorithm makes it possible for GPU operations [13]. However, the speed-up techniques do not involve the substantial improvement of Shapelet discovery and therefore limit the accuracy of this family of methods.

On the other hand, some Shapelet-based algorithms focus on improving accuracy. Grabocka et al. [11] suggest getting optimal Shapelet by optimizing a logistic loss objective function, which can further improve the classification accuracy. Instead of IG, Kruskall–Wallis and Mood’s median are employed as quality measurement for Shapelet selection [14]. Meanwhile, Hills et al. [14] suggest a feature transformation technique, which unbind the Shapelet algorithm from the decision tree classifier. Generally, this family of methods obtains high accuracy, but the computation cost is still high.

In summary, most Shapelet-based algorithms do not achieve a good balance between efficiency and performance. A better Shapelet discovery mechanism is needed to improve the accuracy and reduce the runtime.

2.2. Dictionary Learning

Dictionary Learning is a widely used machine learning algorithm. Its main idea is assuming that signals can be represented by a linear combination of dictionaries. Strictly, the mathematical form of Dictionary Learning can be organized aswhere is the input time series signal, is a dictionary, is a sparse coefficient, and we call an atom of dictionary.

Problem (1) is nonconvex when optimizing α and Z together. However, we can draw this problem to a two-step method: coefficient updating step and dictionary updating step. If Z is fixed, α updating subproblem is convex. Orthogonal Matching Pursuit (OMP) [15], Lasso [16], or Alternating Direction Method of Multipliers (ADMM) [17] are common methods to update α. If α is fixed, Z updating subproblem is a constrained least squares problem. Many convex optimization methods can solve this problem such as Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) [18], gradient descent [19], and K-SVD [20]. Perform the above two steps alternately until convergence.

There are some similarities between Shapelet and Dictionary. Shapelet represents the time series local shape features; thus, the linear combination of Shapelets can represent time series partially or totally. This behaviour is similar to the main idea of Dictionary Learning. The linear combination of Dictionaries is considered to have the high representation ability for raw data. From this sight, those two kinds of methods can share the same underlying technique.

To this end, Dictionary Learning technique is introduced to Shapelet discovery, which obtains a set of Shape-lets by solving several convex optimization problems. This will greatly reduce the amount of computation in the training phase. Different from search-based methods, the optimized Shapelet is a novel segment which does not appear in the existing data. This will improve the generalization ability of Shapelets.

3. Shapelet Dictionary Learning with SVM-Based Ensemble Classifier

In this section, we introduce the structure of Shapelet Dictionary Learning with SVM-based Ensemble Classifier. SDL is a generative Shapelet discovery algorithm which combines DL and Shapelet. We train SDL in a supervised way and get subdictionary for each class. Then, the transformation technique uses subdictionaries to map time series from time domain to feature domain. Last, we present an SVM-based Ensemble Classifier to improve accuracy further.

3.1. Build SDL Atom

The atoms of the original DL present global features, while Shapelets present local features. So, we must reconstruct the SDL atoms before making the overall model. From the introduction in Section 2.2, the original DL atom dimension is equal to the input signal dimension. More specifically, consider the following situation, an input signal T with m dimension, and the atom in DL with p dimension. In the original DL based algorithms, p is set equal to m inevitably. Nevertheless, in the SDL algorithm, an atom is a representation of the Shapelet; thus, the dimension p needs small than m. To address this problem, we introduce the Shift-Invariant Dictionary Learning (SIDL) technique [21] into our work, which use a sliding window to build the SDL atom and relax the constraint of p to p ≤ m.

Figure 2 explains how to generate the SDL atom. The SDL atom is denoted as , and the Shapelet is denoted as , where p ≤ m. Shapelet may match anywhere in the time series, thus the sliding window which represents a Shapelet, which is sliding along the SDL atom to match the optimal location. In an SDL atom, only the Shapelet part has practical significance, and the rest part is assigned 0. In Figure 2, a Shapelet is matched at location q, the values in red blocks need to optimize, while the blue blocks are assigned 0. The matched location q is unique for each time series.

Q(d, q) is a function describes the mapping relationship between the Shapelet and the SDL atom. Given q and d, the SDL atom can be generated as follows [21]:

3.2. The SDL Model

To build the complete SDL model, further more constraints must be made. The complete SDL model iswhere Ti is a time series from the input dataset; the sparse coefficient {αik}, the corresponding shifts {qik}, and the Shapelet dictionary are outputs of the model which need to be optimized; the length of the Shapelet p, the number of Shapelets K, the weight λ, and the scale factor c are hyperparameters which need to be tuned. The scale constraint of dk avoids producing trivial solutions. The constraint of q controls the sliding window does not exceed the boundary. The nonnegative constraint of sparse coefficients is added to avoid the Shapelet inverted.

Equation (3) is an unsupervised learning model, while time series classification is a supervised problem. To deal with this, we learn sub-Shapelet dictionaries for each class independently.

3.3. Optimization of the SDL

Equation (3) is a nonconvex problem. Commonly, a two-step optimization strategy is suitable for this problem. Sparse coefficients and Shapelet dictionary are updated alternately. The subobject of each step is convex.

3.3.1. Update Sparse Coefficients

In this step, sparse coefficient is updated. Fix the d, and equation (3) turns to

We optimize α and q for each Ti independently, the problem turns towhere {αj}jk and {qj}jk are fixed.

We use an enumeration method to find the best qk, which calculates the object value with all possible locations, and the solution is located which minimizes equation (5):where .

Because of the nonnegative constraint of α, equation (5) is different from that of [21]. With the optimal , the update solution of αk is

3.3.2. Update Dictionary

In this step, we fix α and the q and update the Shapelet dictionary d. Further, we fix {dj} j ≠ k and optimize dk independently. Thus, the problem (3) turns to

Equation (8) is a least square with quadratic constraints, and it can be optimized via the Lagrange Multiplier method. The optimal dk iswhere , and , and superscript [1 + qik, p + qik] means the segment from index 1 + qik to p + qik. A proof can be found in [21].

3.4. Supervised Shapelet Dictionary Learning Method for Classification Problem

In this subsection, we introduce the supervised SDL to obtain subdictionary of each class and use subdictionaries to transform time series from the time domain to the feature domain.

We train SDL model for L classes independently and obtain L subdictionaries. The complete Shapelet dictionary S = [D1, D2, …, Dl, …, DL], where Dl means lth subdictionary. See Algorithm 1 for pseudo-code of complete supervised SDL.

Input: time series data set T, parameters p, K, λ, c, stopping threshold ε;
Output: complete Shapelet dictionary S;
(1)For l = 1, 2, …, L do
(2) Initialize sub-dictionaries Dl, sub-coefficient Al, sub-location ql
(3)Repeat
(4)  For k = 1, 2, …, K do
(5)   update qlk with (4)
(6)   update αlk with (5)
(7)  End For
(8)  For k = 1, 2, …, K do
(9)   update dlk with (7)
(10)  End For
(11)Until convergence
(12)End For
(13)S = [ D1, D2, … DL]
(14)Return S

In the original Shapelet algorithm, the decision tree is embedded in the training process. In other words, the original Shapelet algorithm cannot use classifiers other than the decision tree. In order to unbind Shapelet discovery and decision tree, we use Shapelet transformation [14] technique to generate features which are suitable for any classifiers. Suppose H = L × K means the number of Shapelets in S. Transformation technique calculates the minimum distance between Ti and dj, and formed the feature vectors Vi = [, , …, ]. V is actually a local reconstruction error. We do not feed sparse coefficients α to classifier because they lack discriminative ability. By implementing the transformation technique in both training set and testing set, we get Vtrain and Vtest.

The choice of classifier is discretionary, and we use an SVM-based Ensemble Classifier which is described below.

3.5. SVM-Based Ensemble Classifier

SVM [22, 23] is a widely used classifier with high performance and low computational cost. We choose SVM as a base classifier. However, a single classifier is easily affected by uncontrollable factors such as noise, resulting in unstable performance. The Ensemble Learning method combines the results of weak base classifiers to form a strong classifier and improves the overall robustness of the algorithm [24]. Many ensemble SVM algorithms have been proved to be more powerful than single SVM algorithms [25, 26].

Specifically, two problems need to be considered in the construction of ensemble methods. The first problem is the selection of features. We use redundant dictionaries, which result in a large number of Shaplets, and the sample after the transformation operation is a high-dimensional vector. Generally, researchers use ICA or other dimension reduction methods to reduce the features. However, different samples may be sensitive to the features of different dimensions. Dimension reduction will lead to the loss of information. The second problem is the selection of SVM parameters. There are many superparameters that need to be tuned in the SVM algorithm. Finding the optimal parameters is a complex problem. At the same time, like the first problem, different samples may be sensitive to different parameters.

In order to address these two problems, we designed two strategies. First, instead of using all features, each base SVM uses randomly selected features for training. As long as the number of base SVMs is enough, each feature will be used, and no information will be lost without dimension reduction. Second, the parameters of each base SVM are generated randomly, which avoids the overfitting caused by using a set of specific parameters. As shown in Figure 3, the steps of the SVM-based Ensemble Classifier are as follows:(1)Construct B training subsets and testing subsets. Each training subset randomly selects E features from the training feature set Vtrain, and the same E features are selected from the testing feature set Vtest to construct the corresponding testing subset. A feature is a column in the feature set. So, we get a series of subsets, , where is a random column of Vtrain, and . In the same way, we get testing subsets .(2)For each training subset , train an SVM model: , where kernel type ktb, degree in kernel function deb, gamma in kernel function gab, and coef0 in kernel function cob are all chosen randomly. Then, test by using model SVMb, and return test result .(3)Deploy majority voting to get the final test result .

4. Experimental Results and Analyses

4.1. Dataset and Baseline Description

The 45 datasets from UCR Time Series Classification Repository [27] are used to verify the proposed algorithm. Table 1 summarises the details of 45 datasets.

SDL-SEC is a modified version of the classical Shapelet algorithm; thus, we chose two typical Shapelet-based algorithms for comparison. More than that, a Deep Learning algorithm is also compared. The following three algorithms are chosen as baselines:Scalable Shapelet Discovery (SD) [11] uses an online clustering and pruning technique to avoid repeatedly measuring the classification accuracy of similar subsequences. SD takes low computational cost and handles GB-scale data in several minutes.Learning Time-Series Shapelets (LTS) [28] learns top-K Shapelets through optimizing a cost function which consists of classification accuracy and regularization terms. LTS reaches a strong performance among many Shapelet-based algorithms.Fully Convolutional Network (FCN) [29] is a Deep Learning structure method. Since the Deep Learning method plays an important role in many domains, and FCN is a strong baseline as authors claimed in their papers.

We reproduced the results of LTS and FCN with the open-source code (http://www.timeseriesclassification.com/code.php, https://github.com/cauchyturing/UCR_Time_Series_Classification_Deep_Learning_Baseline). SD is well tested, and the hardware performance is close to ours, so we reuse the results in [11] directly. Also, we compare single SVM classifier with SVM-based Ensemble Classifier. Single SVM version is denoted as SDL-S, and ensemble version is denoted as SDL-SEC. Classification Accuracy is used as a quantitative performance comparison measure, which is the ratio of truly classified instances to total test instances. The runtime is used as a quantitative efficiency comparison measure, which includes training time and testing time.

Our hardware environment is CPU: i7-9750H, RAM: 16 GB. We use Matlab 2019b to implement the SDL-SEC algorithm. SD and LTS are implemented by JAVA, and FCN is implemented by Python with TensorFlow. To be fair, we use CPU to compute only and use the default setting for all baselines.

4.2. Hyperparameter Search

There are four hyperparameters to be tuned in the SDL-SEC model, p, K, c, and λ. The classification accuracy has different sensitivities to these hyperparameters. Figure 4 reveals the accuracy variation trend of the Gun Point dataset in four situations. In Figure 4(a), the value of p is varying in the range of [0.1, 1] with the step size of 0.01, while the rest hyperparameters are fixed. In the same way, the remaining hyperparameters are fixed when the value of K is varying in the range of [1, 20] with the step size of 1 in Figure 4(b); the value of c is varying in the range of [200, 400] with the step size of 10 in Figure 4(c); the value of λ is varying in the range of [0.1, 10] with the step size of 0.1 in Figure 4(d). Obviously, the classification accuracy fluctuates dramatically when the values of p and K are varying. And, the classification accuracy fluctuates straightly when the values c and λ are varying. Thus, p and K need to be fine-tuned.

Based on the sensitivity analysis above, the hyperparameters tuning strategy are formulated. The grid search approach is used in hyperparameter tuning. According to experiments, redundant Shapelets can capture more unusual local shape features and improve the classification accuracy. Thereby, we search the number of Shapelet K in the range of [10, 100] with the step size of 10. The Shapelet length p is searched in the range of [0.05, 0.9] × m with the step size of 0.05 × m. The value of K × p determines the interpretability of Shapelet. When K × p > m, it means Shapelets are redundant and duplicate, such setting may cause higher accuracy but worse interpretability. We fix the coefficient λ to 0.01 and the coefficient c to 100.

4.3. Results and Analyses

This section describes and analyses the experimental results from two aspects: classification accuracy and runtime. At the same time, the limitation of proposed algorithm is also discussed.

4.3.1. Classification Accuracy Comparison

We use two statistical indicators to compare the classification accuracies of all algorithms, total wins and average rank. The best algorithm has the most total wins and the highest average rank. The algorithm with better generalization ability can obtain higher accuracy in more data sets, so it is also an important index to evaluate the classification accuracy.

Table 2 summarises the classification accuracy statistical indicators of experiments. It should be noted that the accuracy of SDL-S is slightly different from those presented in [30], due to the updated parameters searching strategies. As shown in Table 2, the proposed algorithm has considerable classification effects. SDL-SEC achieves the best average rank of 1.91 and wins 15 times in 45 datasets. This result is better than that of SDL-S, showing that the SVM-based Ensemble Classifier is effective. FCN obtains the most total wins, 28 times, but the average rank is 1.96, lower than SDL-SEC. LTS also has good performance, but both indicators are worse than SDL-SEC. The accuracy of SD is obviously lower than other algorithms.

Figure 5 is a 1-vs-1 comparison of SDL-SEC algorithm with other algorithms. In Figure 5, the points above the red line represent the higher accuracy of the baseline, and the points below the red line represent the higher accuracy of SDL-SEC algorithm. In comparison with SD and LTS algorithms, the points are more distributed in the area below the red line. In comparison with FCN, the points are roughly evenly distributed on both sides of the red line. It can be seen intuitively that the overall accuracy of SDL-SEC is close to FCN and higher than SD and LTS.

By analyzing Table 2 and Figure 5, we can draw a conclusion that SDL-SEC has a high classification accuracy with good generalization ability, which can adapt to most time series classification problems.

4.3.2. Runtime Comparison

The runtime of each algorithm is analysed below. Table 3 shows the runtimes of the comparison algorithms. The “n/a” in Table 3 indicates that the time or memory required by the algorithm exceeds the limitation of our hardware, and Table 2 uses the accuracy provided by the author. The results of LTS FCN and SDL-SEC are rounded. The runtime of SD algorithm is the shortest. As it was claimed, SD is an efficient time series classification algorithm. Although SDL-SEC runs longer than SD, it is generally acceptable. LTS and FCN run 2-4 orders of magnitude longer than SDL-SEC. Although FCN can be accelerated by the GPU, it is still a high computational expensive algorithm.

To sum up, SDL-SEC has achieved a good balance between accuracy and runtime, greatly reducing the operating time while also having a high accuracy.

4.3.3. Limitation of the Proposed Algorithm

In this part, we discuss the limitations of the proposed algorithm by analyzing two datasets with low classification accuracy. SDL performs worse in datasets Adiac and InlineSkate. The classification accuracies are 0.790 and 0.415, respectively. Figure 6 draws time domain curves of Adiac and InlineSkate. Lines in different colors represent different instances in the dataset. In Figures 6(a) and 6(b), signals from different labels of Adiac dataset show the similar curves, and the shape features are not discriminative. In Figures 6(c) and 6(d), signals from the same label of InlineSkate show disorganized curves, and there is no common shape feature. Shapelet-based algorithms include the proposed algorithm will lose effectiveness in the above two situations: similar curves between different labels or disorganized curves.

5. Conclusion and Future Work

In this paper, we present a Shapelet-based time series classification algorithm called Shapelet Dictionary Learning with SVM-based Ensemble Classifier. First, we propose a Shapelet discovery method Shapelet Dictionary Learning, which combines Dictionary Learning and Shapelet and generates a group of Shapelets instead of searching. The generated Shapelets are totally new subsequences which contain local shape features of all the time series data but not exist in original data. This encourages the generalization ability of Shapelet, reaches a higher accuracy in time series classification, and reduces runtime simultaneously. Furthermore, we propose an SVM-based Ensemble Classifier to execute the classification operation. The proposed SVM-based Ensemble Classifier trains a series of base SVM classifiers which are fed random features and parameters. This method decreases the dependence on feature and parameter selection and thus further improves the classification accuracy. Extensive experiments are performed on 45 benchmark datasets. The results show that the proposed algorithm has high accuracy, good robustness, and high efficiency.

We also look into the future work from the limitations of the proposed algorithm. First, SDL-SEC discovers the time domain shape features. It will fail when the shape feature is not obvious. In order to address this issue, we may exploit a feature fusion method by combining frequency domain features and other statistical features. Second, the current SDL hyperparameter selection method is a grid search approach. This approach requires to design searching strategy manually. As an improvement direction, intelligent optimization algorithms can be exploited, such as evolutionary algorithms, to search hyperparameters automatically [31]. And, the third, only Euclidean Distance is considered in this work, so it is difficult to deal with real-world data with noise and uncertainty. Fuzzy Measurement [32, 33] is an excellent tool to handle this kind of problems, which we will investigate in our future work.

Data Availability

The dataset used in this paper can be found in http://www.timeseriesclassification.com.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Key R&D Program of China under Grant no. 2018AAA0101700 and the Natural Science Foundation of China (NSFC) under grant 51721092.