Abstract
This paper introduces several novel strategies for multi-step-ahead prediction of chaotic time series. Introducing a concept of “generalized z-vectors” (vectors of nonsuccessive time series observations) makes it possible to generate sets of possible prediction values for each point we are trying to predict. Through examining these sets, unified predictions are calculated, which are in turn used as a basis for predicting subsequent points. The key difference between the strategy presented in this paper and its conventional counterparts is the concept of “nonpredictable” points (points which the algorithm categorized as “incalculable” and excluded from the calculations altogether). The results obtained for the benchmark and real-world time series indicate that while typically the number of nonpredictable points tends to grow exponentially with the number of steps ahead to be predicted, the average error for predicted points remains small and nearly constant. Thus, we redefine the problem of multi-step-ahead prediction as a two-objective optimization problem: on one hand, we aim to minimize the number of nonpredictable points and the average error among the predictable ones. The resulting strategy demonstrates accurate results for both benchmark and real-world time series, with the number of predicted steps exceeding that of any other published algorithm.
1. Introduction
There are countless examples of chaotic systems in the world, which is evidenced by the constantly increasing number of chaotic time series forecasting algorithms. However, while those dealing with one-step-ahead prediction demonstrate remarkable efficiency [1], their multi-step-ahead prediction counterparts are still in their infancy. This may be attributed to the exponential growth of an average prediction error with increasing prediction horizon (the number of steps ahead to be predicted).
This exponential growth reflects Lyapunov instability, inherent to any chaotic system. According to the definition of Lyapunov instability, any initial difference between any two neighboring trajectories, however small, grows exponentially over time; its exponent equals to the highest Lyapunov exponent [2, 3]. Lyapunov instability leads to another concept, that of the horizon of predictability, sometimes referred to as “Lyapunov time” [4]. For a given observational error and the maximum prediction error , the aforementioned exponential growth satisfies the constraint , where represents error at the moment and , the highest Lyapunov exponent (which is positive for chaotic time series and negative for the nonchaotic ones). gives an estimated horizon of predictability [2, 3]. In the context of the multi-step-ahead prediction, it follows that the smallest difference between true and predicted values at any intermediate position between the last observable point and the point to be predicted triggers an exponential error growth in all subsequent points, regardless of the employed prediction method. It should be noted that the horizon of predictability should not be confused with the prediction horizon, the former being a theoretical upper boundary for the number of steps to be predicted (given and ), and the latter being simply the number steps to be predicted. Most of the time, the prediction horizon is significantly less than the horizon of predictability: . However, predictive clustering approach provides the necessary tools for dealing with exponential growth of the average prediction error [5, 6].
First, predictive clustering utilizes motifs, repeated sequences of data in a series. If a section of a time series resembles an initial part of a motif “closely enough,” it is likely that the subsequent points of the series will closely match the subsequent points of the motif, thus enabling the use of motifs for forecasting the time series at hand. Motifs can be obtained by clustering vectors of observations and calculating their centers. It should be noted that whereas most available prediction methods attempt to construct a single unified prediction model, the motifs provided by predictive clustering form a set of local prediction models.
Second, Gromov and Borisenko [6] proposed using generalized z-vector templates, vectors of nonsuccessive observations, which are spaced out according to some predefined pattern. A pattern is a vector of distances between positions of series observations. The resulting vector generalizes a conventional z-vector (successive observations), which corresponds to a pattern times. Figure 1 demonstrates a z-vector constructed according to a pattern , whereas Figure 2 illustrates the way z-vectors are clustered into motifs and also how motifs are used for obtaining predictions. The process of generating z-vectors can be visualized as placing a “comb” with teeth spaced out according to some pattern, with all the other teeth “broken off.” While moving the “comb” along the series, we can obtain a vector of observations under the teeth of the “comb” for each position . Each pattern has its own set of such vectors (the sample corresponding to a given pattern). Each such a sample is then clustered separately. This approach obtains a set of possible values for each point to be predicted. Each set is comprised of predicted values obtained using motifs corresponding to different templates. Since the number of patterns can be arbitrarily large, so can sizes of sets of possible prediction values for points we are trying to predict. Despite the fact that points in those sets are not always statistically independent, a great deal of algorithms determining a unified predicted value based on these sets can be designed.


Third, the concept of nonpredictable points can be extended. However, previously a point was considered to be nonpredictable if it did not have corresponding motif(s), and now it can also be considered nonpredictable if it is impossible to calculate a unified predicted value for this particular point based on its set of possible prediction values. An example of the latter would be a point which set of possible prediction values consists of two equally-sized clusters with different centers. Present work introduces a novel concept of nonpredictable points that the authors have not yet encountered in any literature on the topic. It should be noted that while in our previous paper, we employed only the former type of nonpredictable points, but we use both in this one.
Introduction of a concept of non-predictable points renders the problem two-objective: on one hand, we have to minimize the total number of non-predictable points, and on the other, the average error among the predictable ones. Discarding some of points as nonpredictable has proven to have a positive effect on the results, and it is better for the algorithm to skip some of the points rather than force a prediction at each point. This approach is actually quite natural in some areas: for example, stock market traders do not have to make trades at every single moment in time, as they could simply choose the moments that the algorithm had recognized as predictable.
Furthermore, since the patterns with distance exist, the fact that the point at position is nonpredictable does not imply that the points at subsequent positions will be nonpredictable as well. This is fundamentally important for the multi-step-ahead prediction: if a point needs to be predicted, iterative strategy would be used (or rather its modifications based on nonsuccessive observations). Namely, the intermediate values between points and are predicted step-by-step, with nonpredictable points being identified along the way and ignored.
In the context of dynamic systems [2, 3], each cluster of vectors (which by definition represent similar sections of a trajectory) corresponds to a particular area of a strange attractor. Areas with a high value of an invariant measure are associated with larger clusters (i. e., more frequently observed motifs), and vice versa. The number of clusters increases with the size of their respective training set, whereas the number of nonpredictable points and the average error among the predictable ones decrease. A large-scale simulation for the Lorenz series [7] supports this conclusion.
In conclusion, however, the algorithm does not calculate a prediction for each point (thus the word “partial” in the title), it does make predictions up to the horizon of predictability (sometimes even surpassing it). The paper presents several methods of identifying nonpredictable points and corresponding strategies for the multi-step-ahead prediction, as well as examples of partial predictions beyond the horizon of predictability, for the benchmark and real-world time series.
The rest of the paper is organized as follows:(i)Section 2 reviews recent advances in the field(ii)Section 3 formally states the problem(iii)Section 4 outlines the employed clustering technique and ways of identifying nonpredictable points and calculating predictions(iv)Section 5 provides the results for predicting the Lorenz series and hourly electric grid load values in Germany(v)Section 6 contains conclusions
2. Related Works
Most of the recently published papers discussing chaotic time series prediction [1, 6, 8–11] concern themselves only with the one-step-ahead prediction problem, whereas the number of papers tackling the problem of multi-step-ahead prediction (MSAP) is considerably lower.
An MSAP algorithm for chaotic time series consists of two parts, which are equally responsible for the accuracy of the final prediction. The first part is a technique used to make one- or few-steps-ahead predictions, whereas the second one “assembles” the results of the previous part into a final multi-step-ahead prediction.
Algorithms used to make a one-step-ahead prediction employ nearly all fields of data mining and machine learning, such as(i)Support vector regression and its modifications [9](ii)Multilayer perceptron [12](iii)Clustering algorithms in predictive clustering [13–16](iv)LSTM neural networks [10](v)Voronoi diagrams [17](vi)Ridge polynomial neural networks [18](vii)Wavelet neural networks [19, 20](viii)Dilated convolution networks [21]
Another factor of fundamental importance is a strategy that “assembles” one-step-ahead predictions into a multi-step-ahead one. It involves two basic strategies of MSAP, the iterated (recursive) and direct strategies [22, 23]. The iterated one implies that multi-step-ahead prediction is a process carried out step-by-step: new predicted values are calculated based on the predictions made in the previous steps. The direct strategy aims for getting the results immediately without calculating predicted values for intermediate positions. This strategy applies prediction techniques for various prediction horizons, thereby providing multiple predictions for any position to be predicted. Ben Taieb et al. [22] review different methods based on these two basic strategies. Sangiorgio and Dercole [10] apply both strategies with multilayer perceptrons and LSTM nets employed as tools to make one-step-ahead predictions.
Unfortunately, MSAP methods designed in the aforementioned strategies suffer from the same exponential error growth, thus giving rise to a number of hybrid strategies aiming to resolve it. In their review, Ben Taieb et al. [8] compare the two basic and three novel strategies (DirRec, MIMO, and DIRMO). DirRec (Direct + Recursive) combines the two basic strategies and uses the direct approach to predict values with the number of inputs being enlarged iteratively to include values of the most recently predicted positions. When it comes to MIMO (multiple input multiple output) strategy [24, 25], an array of values is produced for the positions between the observed values and a prediction horizon (inclusively). This reveals any correlations the time series may have and allows them to be stored within predicted values.
DIRMO (DirRec + MIMO) strategy divides the series (up to some prediction horizon) into chunks and applies MIMO strategy to each chunk separately. The authors test these five strategies on a reasonably large sample of different time series (NN5 competition), which reflects various irregularities inherent to time series. Bao et al. [9] compare efficiency of the iterated, direct, and MIMO strategies by performing a one-step-ahead prediction using a modified support vector regression. According to the authors, the MIMO strategy compares favourably with the other two (all other factors being equal).
Chaotic time series prediction methods that rely on reservoir computing would mark a different approach [26, 27]. Canaday et al. indicate that the method demonstrates excellent results for small prediction horizons; however, their performance worsens greatly when applied to larger horizons [28].
Multiple-task learning can be viewed as a strategy of its own for multi-step-ahead prediction. Chandra et al. [12] propose an algorithm to determine a neural network structure to solve MSA prediction problems; their approach can be considered a combination of the direct and iterated strategies. Wang et al. [29] utilise a neural model in order to combine periodic approximations for longer periods and machine learning models for the shorter ones. This could be viewed as a separate strategy when dealing with data with a marked periodicity. Ye and Dai [11] employ multitask learning for multistep prediction. Kurogi et al. [17] make use of an out-of-bag model for selecting models for multistep prediction. The authors aim at predicting chaotic series as far as possible (the largest possible prediction horizon), whereas retaining reasonable accuracy. The authors present their results for the Lorenz series.
The importance of being able to make accurate predictions up to an increasing number of steps was named “Run for the Horizon” by the authors of the current paper in one of their previous publications [7]. The series of works by Sangiorgio et al. [10, 30] (that culminated in a monograph [25] in 2021) approach multi-step-ahead forecasting by using neural networks, classic feed-forward, and recurrent architectures (LSTM) nets. The latter are traditionally trained with a supervisor, thus forcing the algorithm the speed up the convergence of the optimization. The authors managed to make adequate predictions up to six Lyapunov times on the benchmark series (logistic and Hénon maps as well as two generalized Hénon maps). Even though the authors did manage to considerably delay the exponential error growth, they failed to avoid it completely; the results indicate that after six Lyapunov times the error starts to increase exponentially. It should also be noted that the predictions achieved in [31], which relied on reservoir computing with the data calculated by the integration of the Lorenz-96 model, are similar in terms of prediction intervals.
To summarize, none of the aforementioned strategies are immune to the exponential growth of an average prediction error with an increasing prediction horizon. The present paper discusses a few novel strategies with the main difference from their classical counterparts being the concept of nonpredictable points and an ability not to take into account clearly erroneous predictions at intermediate positions [6], thus weakening the exponential nature of the growth rate.
3. Problem Statement
This paper deals with an -steps-ahead prediction for a chaotic time series . We assume that all transient processes are completed, and that the series itself reflects the movement of a trajectory in the neighborhood of a strange attractor. The third assumption is that the series meets the conditions of the Takens’ theorem (which makes the analysis of the attractor structure based on the series observations possible) [2, 3].
We divide the series into two parts: , the observable part used as a training set, and the test set , . When the algorithm makes a prediction at a position of the test set, is has access to observations ; however, it does not have any information on observations . Thus, , where is a parameter of the algorithm.
Predictive clustering algorithms and a large number of used patterns make it possible to construct a set of possible predicted values for each point to be predicted, where is the number of possible predicted values calculated by the algorithm and is the -th predicted value. A set is comprised of all sets of possible predicted values for the points , where indicates the type of algorithm employed (usually is equal to or ). For , consists of sets of possible predicted values for the point exclusively; for , it consists of sets of possible predicted values for the points . We denote the algorithm that constructs sets of possible predicted values as .
The concept of nonpredictable points implies that two operators are applied to the set . The first checks if a position is predictable:where for any function .
The second operator determines a unified predicted value for a set of possible predicted values (provided the position is predictable):
Using these two operators, we can define the multi-step-ahead prediction process as a twofold optimization problem:
Both functionals sum over all observations of the test set, with first one minimizing the total number of nonpredictable points and the second one minimizing the average error among the predictable ones. The present paper attempts to solve this two-objective problem.
3.1. Algorithm
The subsection is organized as follows:(i)The process of generating samples from the time series according to predefined patterns(ii)Employed clustering techniques(iii)Generating sets of possible prediction values(iv)Quality measures of identification of nonpredictable points(v)Ways of identifying nonpredictable points
3.2. Training Set
The series is assumed to be normalized. Patterns (arrays of distances between nonsuccessive observations) are used to generate samples. Each pattern is an -dimensional vector of integers , where is the maximum distance between any two consecutive observations in the pattern.
For example, let us consider a three-point pattern . The first vector of the corresponding training set would be (the first z-vector); the next vector would then become and so forth. The last vector of the training set would be with being the last observable value of the training set.
In the context of predictive clustering, the conventionally used z-vectors comprising successive observations have proven to be less efficient than their counterparts comprising nonsuccessive observations [6]. We attribute it to the fact that vectors of nonsuccessive observations have more chances to store information about salient observations (minima, maxima, tipping points, etc.) and correlations between them.
A set of all combinatorically possible patterns is used for generating training sets. Each set is generated separately, according to corresponding pattern . Vectors from these training sets can be used either “as-is” (i.e,. each vector is treated as a separate motif), or they can be clustered first, with the center of each of the formed clusters becoming a motif of its own.
We denote a set of motifs corresponding to the pattern as . The obtained motifs can only be used with the pattern . The set of all motifs is denoted as .
3.3. Clustering Technique
As a trajectory moves repeatedly through the same area of a strange attractor, similar recurring sequences of values associated with that particular area can be observed in the trajectory’s time series. Centers of clusters of sequences associated with different areas of the attractor serve as simplest prediction models for the respective areas of the attractor [14]. We used the clustering method outlined below to cluster sequences of observations. We employ a modified version of the Wishart clustering technique [32, 33] developed by Lapko and Chentsov [34]. It uses graph theory concepts and a nonparametric probability density function estimator of -nearest neighbors. Some of the difficulties associated with the application of this algorithm to forecasting are discussed in [6].
A significance value of a point is defined as , where and are the volume and the radius of a minimum-size hypersphere containing at least observations, centered at point ( is the number of sample vectors). The method relies on a proximity graph , vertices of which correspond to samples and edges defined as . is a subgraph of with a vertex set and an edge set comprised of all edges of in a such a way that its vertices belong to . Let be a cluster number label of . A cluster is said to be height-significant with respect to the height value if Algorithm 1.
Thus, the algorithm consists of the following steps:
|
After performing large-scale simulations for different parameter values, we determined the best values of and to be 11 and 0.2, respectively.
3.4. Sets of Possible Predicted Values
We consider a set of motifs for each pattern and construct a set of possible prediction values associated with the pattern . Namely, we compose a vector of time series observations corresponding to the pattern : for a position we are trying to predict (all elements of are assumed to be either observed or predicted in the previous step(s)). The next step is to calculate the Euclidean distance between and a truncated motif , a vector comprised of all elements of a motif except for the last one, for . If (where is a parameter of the algorithm) then the last element of the motif becomes a possible predicted value at the position . Thus, the set of possible predicted values at position for pattern is defined as .
In turn, a set of sets of possible predicted values corresponding to all patterns becomes a union of sets and can be defined as . Finally, unified prediction value is calculated as a function of . Previously predicted intermediate positions are used as inputs for the new iterations of the algorithm until a prediction horizon is reached. Let us label each distinct iteration as a “prediction step.” Performing the prediction step repeatedly produces an operator , which yields a set of possible predicted values for position . The size of the set is directly proportional to the number of employed patterns. It allows for an implementation of algorithms for determining the final, unified predicted value at as well as identifying nonpredictable points. These algorithms are discussed later in the paper.
In addition to the set of possible prediction values , we can calculate a set of weights , which characterizes comparative significance of individual possible prediction values . Alternatives presented below can be used to determine weights.
The first technique relies on the notion of a “reliability” of individual intermediate predictions, used to calculate . Each prediction step results in a greater error of the final prediction at position . Thus, in order to offset this error growth, we can assign a weight to an element of a sets of possible predicted values based on the number of prediction steps made in order to calculate it. We assign weights to observed values (thus indicating that they are 100% reliable). Given a possible prediction value (either at the final point or intermediate points ), calculated based on preceding points (predicted or observed) with corresponding weights using a pattern , the weight of is calculated as an average of weights of the preceding points of times a step-down factor :
The step-down factor ensures that predicted points receive a progressively smaller weight compared to the observed and earlier predicted points. We typically used . We used the average weight of those possible predicted values that were used to calculate the unified predicted value in order to calculate the weight of this unified predicted value (either at the final of intermediate points).
The second technique considers to be inversely proportional to the distance between observed values and the motif chosen for the prediction:where is a small threshold (typically equal to 0.05 for the purposes of the current simulation).
The third approach calculates as a product of the results of the previous two methods.
3.5. Unified Predicted Value
The method of calculating a unified predicted value (UPV) depends on whether the algorithm’s parameter or . If it is, the UPV is calculated based solely on the set of possible prediction values associated with the current position: . There are several techniques that can be used to extract UPV from :(1)[avg] Averaging over : .(2)[wavg] Taking a weighted average of : .(3)[clc] Clustering using the DBSCAN algorithm and selecting center of the largest cluster . DBSCAN [35] demonstrates good performance for one-dimensional data and was deemed acceptable for the task. The exact value of the UPV in this case is .(4)Clustering only elements of with weights exceeding some threshold .(5)Two previous techniques can be applied to fuzzy sets. Normalized weights can be viewed as values of a membership function that indicates belonging to a set of possible predicted values.(6)Clustering and picking the center of a randomly chosen cluster based on the sum of weights of its elements relative to the sum of weights of all possible prediction values (roulette wheel).(7)[mf] Choosing the mode of .(8)[mfp] Choosing the mode of and adding a small amount of uniformly distributed noise to it: , where is a normally distributed random variable with zero mean and variance .
In the case when , we introduce a concept of a “prediction trajectory” (a sequence of possible prediction values for every position from to ). Let us denote an -th possible prediction trajectory (PPT) as with being its value at -th position and being the maximum number of trajectories. Then the set of all PPTs ending at position would be defined as , with being a set of values of its trajectories at position . Naturally, Algorithm 2.
Base prediction algorithm:
|
We have employed several different ways of calculating possible prediction trajectories:(1)[trp] Randomly perturbed trajectories, where a UPV is calculated for each intermediate position using technique number 8.(2)Starting a new trajectory at the center of each cluster of , thus making the total number of trajectories increase exponentially with the number of steps (a garden of forking trajectories).(3)Starting a new trajectory at the center of each cluster of and assigning a weight to each trajectory as a product of the weights of its individual points; trajectories with weights lower than a certain threshold are filtered out at each step of the iteration and do not form new trajectories.
Regardless of the technique used, the UPV is calculated as an average of the last points of all trajectories from the set : . Based on the results of the research, the first technique proved to be the most efficient one.
3.6. Quality Measures of Identification of Nonpredictable Points
Determining whether a point is predictable or not relies on either its set of possible prediction values or its set of prediction trajectories. For each prediction horizon , we may consider the following set:
A set of nonpredictable points at intermediate positions is defined as . A full set of all nonpredictable points from the test set is defined as follows:
In other words, this is a set of all positions that the algorithm failed to produce a prediction value for. It should be said that a point becomes nonpredictable if any of the following cases are true:(1)Its set of possible predicted values is empty(2)It is impossible to calculate a unified prediction value based on its set of possible predicted values
These sets rely heavily on the specific one-step-ahead prediction algorithm used, the value of, and the technique of identifying nonpredictable points. In order to develop evaluation criteria for algorithms identifying nonpredictable points, we need to consider two extreme cases:(1)Not identifying nonpredictable points at all (the algorithm always uses the closest available motif);(2)Assuming that the algorithm possesses a priori information about the actual values at intermediate positions. The point is marked as nonpredictable if the difference between the predicted and actual values . It is worth noting that the algorithm does not replace its predictions with the true values: it simply excludes clearly erroneous predictions from the following prediction operations instead. In our internal team’s slang, this algorithm is named “the daemon”, Socrates is known to have had his own daemon, who gave him advice in his most painful situations.
Remaining points comprise a set of ground-truth nonpredictable points for a given algorithm. We can create a setfor each point that we are predicting, where is the unified prediction value of the set of possible prediction values . Consequently, the set of positions that the algorithm failed to make a prediction for is defined as follows:Also, the set of nonpredictable points as follows:
Then, the set of ground-truth nonpredictable points becomes
Simulations reveal that the second case is almost completely independent of the threshold value . Naturally, “real” algorithms do not possess a priori information about actual values at intermediate positions , since they deal only with either sets of possible predicted values or sets of predicted trajectories . Nevertheless, this method of identifying nonpredictable points is useful for determining a lower boundary of the prediction error. Methods of identifying nonpredictable points should be developed in such a way as to approximate this boundary as closely as possible. In the team’s internal slang, we referred to them as “approximating the daemon.”
Graphs in the Figure 3 exemplify dependences of the number of nonpredictable points (Figure 3(a)) and the average error on predictable points (Figure 3(b)) on the prediction horizon h for the two extreme cases. The blue curves correspond to the first extreme case (intermediate nonpredictable points are not identified at all), and orange, to the second (intermediate nonpredictable points are identified with a priori information). It can be seen that in the first case the number of nonpredictable points is equal to zero and the prediction error grows exponentially with h. The second extreme algorithm demonstrates the opposite: the number of nonpredictable points grows exponentially with h, while the prediction error function remains nearly constant, bounded, and rather small. It is obvious that the first algorithm minimizes the functional (3), neglecting the functional (4); the second one minimizes the functional (4), neglecting the functional (3).

(a)

(b)
Figures 4(a) and 4(b) display the true values (blue solid) and intermediate predicted values (red dashed) for the Lorenz series for the first (Figure 4(a)) and second (Figure 4(b)) extreme cases. From the latter subfigure, we notice that the second algorithm does not make a prediction for all points, while making rather accurate predictions where it “decides” to predict. On the other hand, we notice that in the first case the predicted trajectory diverges from the true one just after the point (a green disk in Figure 4(a)) that is identified as nonpredictable by the second algorithm, the first algorithm must predict at this point (as with any point), and this immediately ruins the MSA prediction process, causing the predicted trajectory to diverge from the true one.

(a)

(b)
The results of algorithms identifying nonpredictable points fall somewhere between these two extreme cases. In the context of multi-step-ahead prediction, the second case presents more interest since it allows one to make a prediction up to a fairly large number of steps ahead. From this point, all algorithms discussed are attempt to follow it.
A large-scale simulation suggests that a forecast algorithm will be efficient only if its set of identified nonpredictable points is sufficiently close to the set of ground-truth nonpredictable points . Difference between the two constitutes an auxiliary quality measure for the techniques of identifying nonpredictable points:
Two possible opposite scenarios can be observed: either the difference is large whereas the difference is small, or vice versa. The first case corresponds to an overly optimistic algorithm, whereas the second, to an overly conservative one.
4. Identification of Nonpredictable Points
Before we begin talking about different ways of identifying nonpredictable points, we need to establish a concept of a “perfectly predictable” point. A “perfectly predictable” point is the one which set of possible prediction values consists of a single cluster containing the vast majority of its possible predicted values. On the other hand, sets of possible prediction values for nonpredictable points tend to either consist of multiple equally sized clusters or be nonclusterable at all. In terms of probability distributions, predictable points correlate with a unimodal symmetric distribution with a relatively small variance, whereas nonpredictable ones correlate either with a uniform distribution or with a distribution with several pronounced modes.
In order to characterize nonpredictable points, we employed the following quantities:(1)Multimodality: a percent of possible predicted values remote from the main mode(2)Difference between and percentiles(3)Second, third, and fourth central moments, as well as kurtosis of the distribution and its entropy
We performed a large-scale simulation on the validation set by calculating sets of nonpredictable points and the ground-truth nonpredictable points for each feature separately. A comparison of these sets makes it possible to estimate the best threshold values for each feature. It revealed that each feature by itself does not cause the objective to become larger; therefore, it becomes necessary to take into account multiple characteristics at once.
When clustering sets of possible prediction values, we take into consideration the following points:(1)Total number of clusters(2)Size of the largest cluster compared to the size of the entire set(3)Relative difference in the size of clusters
We use the following machine learning techniques to separate the aforementioned features into regions corresponding to predictable and nonpredictable points, respectively:(1)[lr] Logistic regression(2)[svm] Support vector machine(3)[dt] Decision tree(4)[knn] K-nearest neighbors clustering(5)[mlp] Multilayer perceptron with a single layer with 8 neurons
In order to test this approach, a new validation set independent of both and needs to be introduced: . We label using the set of ground-truth nonpredictable points. Since the number of nonpredictable points exceeds the number of predictable ones for larger prediction horizons, a sample is balanced by SMOTE (synthetic minority oversampling technique) methods [36]. This method generates new elements in the neighbourhood of the smaller cluster.
In addition to the above classifying methods, we employ their ensembles:(1)[adb] AdaBoost: an algorithm training each subsequent classifier on the elements incorrectly classified by the previous classifier(2)[lrs] Stacking: applying several classifiers to produce a feature space for a combined algorithm(3)[lrv] Voting: assigning a label to each element by the classifiers’ votes
When analyzing a set of possible prediction trajectories using the second approach, trajectories converging at certain points indicate predictability. Thus, predictable points correspond to the regions of convergence of the possible prediction trajectories.
Several techniques for identifying nonpredictable points were developed:(1)Calculating an average spread over sets of possible prediction values . A point is classified as nonpredictable if its spread exceeds times some factor .(2)Instead of considering the spread itself, one can consider its variation over a number of successive positions. The first point is classified as nonpredictable if the spread increases monotonously. Based on the results of a simulation, it appears that the best option here is to consider three successive points.(3)For the trajectories relying on reduced initial information, we may compare the last value of a trajectory starting at position with a weighted average of the last values of trajectories starting at positions . A point is classified as nonpredictable if . The closer a trajectory starts to position , the greater its weight is.(4)Another technique designed for trajectories that rely on reduced initial information compares average modulus of the difference between the last points of trajectories starting at a position t and trajectories starting at other positions with .(5)Similar to the previous case, we calculate a weighted average of the differences with weights .(6)It is also possible to apply all of the listed techniques to a set of final points of .
Tables in the next section include results for extreme cases:(i)[fp] (forced prediction) symbolizes the fact that nonpredictable points are not identified at all and the calculation of predicted values is forced at all intermediate points(ii)[ab] (absent) implies that a set of nonpredictable points consists only of points that the algorithm failed to find motifs () for and sets of possible prediction values are not analyzed (the first extreme case)(iii)[id] (ideal): sets of possible predicted values are constructed using a priori information (the second extreme case)
5. Numerical Results
Methods discussed in the previous sections are applied to the benchmark (Lorenz) and the real-world (electric grid load values) time series. Clustering is used to generate samples using all possible patterns consisting of four elements () with the maximum distance between neighboring positions . Thus, the total number of used patterns equals .
The Lorenz series is calculated by integrating the Lorenz system with parameters using the Runge–Kutta fourth-order method with integration step . The result is a typical chaotic series used as a conventional benchmark for testing chaotic time series forecasting methods.
In order to distinguish between chaotic and simply noisy series, we check positions of entropy and complexity on their respective planes. The values of these for the Lorenz series are and , respectively, thus indicating the series as chaotic. Furthermore, the highest Lyapunov exponent calculated using the Eckman algorithm for the Lorenz series has a value of , which corresponds to the results of Malinetskiy and Potapov [3]. Its strictly positive value confirms an inherently chaotic nature of the series.
Numerical error of the fourth-order Runge–Kutta method is . If the maximum precision error is chosen as , then an estimated horizon of predictability would be , or 75 integration steps.
For testing purposes, the first 3000 observations of the series were discarded to ensure that the trajectory moves in the neighbourhood of its respective strange attractor. The sizes of the training and testing sets are 10000 and 1000 observations, respectively.
The real-world series is a series of electric grid load values in Germany from 2014-12-31 to 2016-02-20 measured in 1-hour intervals. Entropy and complexity of the series are measured at and , respectively, and its highest Lyapunov coefficient is , which indicates the series’ chaotic nature. The series was analyzed in the same way as the Lorenz series, with the results presented in Tables 1–4.
A large-scale simulation suggests that classifying only those points which lack corresponding motifs as nonpredictable results in an exponential increase in both the total number of nonpredictable points and an average prediction error. Based on Figure 2, it may be concluded that the choice of a specific technique used to calculate a unified prediction value has little effect on the rate of growth. However, the choice of a specific technique of identifying nonpredictable points seems to have a much larger impact (see Tables 5 and 6).
As a quick reminder, the unified predicted value can be calculated in the following ways (see Tables 5 and 6):
When it comes to identifying nonpredictable points, several methods from various fields of machine learning and mathematics are employed:
The results of each combination of techniques are presented in the graphs and tables. The graphs present the root mean squared error (RMSE) and the mean absolute percentage error (MAPE) for each technique alongside those of the two extreme cases (see Quality Measures of identifying non-predictable points) for comparison.
Presented below are the graphs of two types. The first one displays a predicted trajectory up to a certain prediction horizon alongside a “real” trajectory; the second type shows error measures (namely, root mean squared error, mean absolute percentage error and the number of nonpredictable points) for the different versions of the algorithm against different prediction horizons. Furthermore, each graph of the second type also contains plots of its respective error measure for the aforementioned two extreme cases, using a priori information, and not identifying nonpredictable points at all.
A large-scale simulation reveals that in the first extreme case (only points that have no corresponding motifs are considered nonpredictable) both the number of nonpredictable points and errors among the predictable ones grow exponentially with the prediction horizon.
The second extreme case (where the aforementioned algorithms are also taken into consideration when determining predictability of a point) shows an opposite situation. Figure 5 demonstrates MAPE, RMSE, and the number of nonpredictable points as a function of the prediction horizon. Orange line corresponds to the method using a priori information; gray, to the version of the “rapid growth” algorithm that uses the Wishart clustering algorithm for clustering sets of possible predicted values; red—to the version of the “rapid growth” algorithm that uses DBSCAN for clustering sets of possible predicted values; blue, to the version of the “rapid growth” algorithm that averages sets of possible predicted values. It should be noted that, when comparing the results with those of others [25] that also demonstrate an ability to predict up to multiple Lyapunov times ahead, the almost-constant (nonexponential) error behavior intrinsic to some of the algorithm’s variants allows for making predictions up to even more Lyapunov times ahead. However, it comes at the cost of increasing number of nonpredictable points. The work by Sangiorgio et al. [25] eventually demonstrates exponential error growth.

Of all the used machine learning algorithms, logistic regression and AdaBoost ensemble (Figure 6) stand out. Their error measures remain nearly constant with increasing prediction horizon and are comparable to those produced by the algorithm using a priori information for identifying nonpredictable points. However, it also demonstrates an exponential growth of the number of nonpredictable points with increasing prediction horizon at a rate larger than that of the second extreme case. The method employing a multilayer perceptron, on the contrary, demonstrates a moderate rate of growth of the number of nonpredictable points along with a rapid error growth.

When considering different techniques of identification of nonpredictable points, it should be noted that simulations reveal that the versions of the algorithm based on sets of trajectories perform better than their counterparts based on sets of possible predicted values. It can be seen from the tables that the number of nonpredictable points is close to the first extreme case (the one using a priori information), while the average prediction error is constrained between the two. This is a typical behavior for all methods of calculating the unified predicted value and identifying nonpredictable points. It should be noted that the main objective is still achieved, instead of increasing exponentially, average prediction error remains constant and relatively small.
Figure 7 demonstrates an example of predicted energy load series (Tables 3 and 4).

Figure 8 indicates that the algorithm is able to predict values up to (and sometimes exceeding) an estimated horizon of predictability of 75 steps (thus the title of the article).

In order to compare our algorithm with the results of other researchers, we have recreated the simulation by Prof. Kurogi and colleagues [17]. Figure 8 displays the Lorenz series (blue), prediction made by prof. Kurogi (green), and the values predicted by the algorithm (dashed red line with dots). As can be discerned from the figure, while the trajectory predicted by prof. Kurogi diverges from the actual one, the algorithm discussed in the present paper does correctly identify nonpredictable points and the predicted points approximate the true trajectory quite accurately.
In conclusion, it should be noted that the result obtained on the real-world German energy time series seems to confirm that the effect of noise (stochasticity and nonstationarity) dramatically affects the forecasting accuracy (the prediction is reliable only for the first day ahead due to observation noise and the daily periodicity that affects many real phenomena) [30, 37]. However, the current paper did not specifically concern itself with the algorithm’s behavior in case of chaotic series with a significant noise component. We hope to make it the subject of future research.
6. Conclusion
Current paper introduces several novel strategies for multistep prediction of chaotic time series. Generalized z-vectors, comprised of nonsuccessive time series observations, allow for obtaining a set of possible values for each point that a prediction is being made for. Upon examining such a set, the point can be defined as either predictable (in which case its value is estimated) or nonpredictable.
The concept of nonpredictable points renders the multistep prediction as a two-objective problem, with both the number of nonpredictable points as well as the average prediction error for the predictable ones having to be minimized.
The results indicate that while the number of nonpredictable points grows exponentially with the number of steps, up to 80–90%, the average prediction error remains constant. This allows for making predictions up to a considerable number of steps ahead (although skipping some intermediate points) sometimes even exceeding the horizon of predictability. It was tested for the Lorenz and real-world time series.
It was concluded that the choice of a technique of identifying nonpredictable points has a much greater impact on the accuracy of the final prediction than the choice of a technique of estimating the actual value of the point itself.
Large-scale simulation reveals that prediction methods based on sets of possible trajectories deliver more accurate results than those based on sets of possible values, allowing for making predictions beyond the horizon of predictability.
Data Availability
All the data used to support the findings of the study are included within the article.
Disclosure
This paper is already published in the preprint given in the below link: “https://www.researchgate.net/publication/347239842_Prediction_After_a_Horizon_of_Predictability_Non-Predictable_Points_and_Partial_Multi-Step_Prediction_for_Chaotic_Time_Series” [38].
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
The authors are deeply indebted to Mr. Joel Cumberland for proofreading and editing. The results of the paper were partially presented in Chaotic Modeling and Simulation International Web Conference (CMSIM-2020, 22–24 October 2020). We are indebted to the organizers and participants of the conference. The publication was supported by the grant for research centers in the field of AI provided by the Analytical Center for the Government of the Russian Federation (ACRF) in accordance with the agreement on the provision of subsidies (identifier of the agreement 000000D730321P5Q0002) and the agreement with HSE University No. 70-2021-00139. This research was supported through the computational resources of HPC facilities at the HSE University.