RETRACTED: An Autonomous Cyber-Physical Anomaly Detection System Based on Unsupervised Disentangled Representation Learning

Li, Chunyu; Guo, Xiaobo; Wang, Xiaowei

doi:https://doi.org/10.1155/2021/1626025

Security and Communication Networks

On this page

Abstract Introduction Conclusions Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article Retraction

!

This article has been Retracted. To view the article details, please click the ‘Retraction’ tab above.

Special Issue

Advances in Cyber Threat Intelligence

View this Special Issue

Research Article | Open Access

Volume 2021 | Article ID 1626025 | https://doi.org/10.1155/2021/1626025

RETRACTED: An Autonomous Cyber-Physical Anomaly Detection System Based on Unsupervised Disentangled Representation Learning

Chunyu Li,¹Xiaobo Guo,²and Xiaowei Wang³

Academic Editor: Konstantinos Demertzis

Received09 Sept 2021

Revised17 Sept 2021

Accepted03 Oct 2021

Published18 Oct 2021

Abstract

Cyber-Physical Systems (CPS) in heavy industry are a combination of closely integrated physical processes, networking, and scientific computing. The physical production process is monitored and controlled by the CPS in question, through advanced real-time networking systems, where high-precision feedback loops can be changed when the overgrid of cooperative computing and communication components that make up the industrial process is required. These CPS operate independently but integrate interaction capabilities as well as with the external environment, creating the connection of the physical with the digital world. The outline is that the most effective modeling and development of high-reliability CPS are directly related to the maximization of the production process, extroversion, and industrial competition. In this paper, considering the high importance of the operational status of CPS for heavy industry, an innovative autonomous anomaly detection system based on unsupervised disentangled representation learning is presented. It is a temporal disentangled variational autoencoder (TDVA) which, mimicking the process of rapid human intuition, using high- or low-dimensional reasoning, finds and models the useful information independently, regardless of the given problem. Specifically, taking samples from the real data distribution representation space, separating them appropriately, and encoding them as separate disentangling dimensions create new examples that the system has not yet dealt with. In this way, first, it utilizes information from potentially inconsistent sources to learn the right representations that can then be broken down into subspace subcategories for easier and simpler categorization, and second, utilizing the latent representation of the model, it performs high-precision estimates of how similar or dissimilar the inputs are to each other, thus recognizing, with great precision and in a fully automated way, the system anomalies.

1. Introduction

Heavy industry includes bulky products, complex equipment, and specialized facilities, such as high-tech machine tools and large-scale electromechanical infrastructure, which are involved in the synthesis of chaotic processes. With the introduction of the Industrial Internet of Things (IIoT) in Industry 4.0 [1], communication between machines and humans, as well as the analysis of heterogeneous chaotic industrial processes, becomes clearer. Industry 4.0 generally focuses on continuous interconnection services, which allow the continuous and uninterrupted exchange of signals or information between interconnected systems [2]. These systems, through direct Machine to Machine (M2M) communication and the integration of intelligent services, are converted into CPS, where their interfaces create a common interoperable level of interaction between the physical and the digital world [3]. CPS through the IoT and other intermediates such as interconnected sensors, actuators, and digital-analog signal converters work together to make decentralized optimal decisions while operating autonomously [4].

The security of CPS is related to the security of the information they manage, for example, whether they apply encryption techniques to the data transmission they exchange and the security of the functional controls of the CPS themselves. One of the main methods of active safety related to the possible checks that can be performed to determine the operational status of the CPS is the detection of anomalies [5]. The detection of anomalies is the process of finding occurrences or behaviors that do not fit the expected pattern of a given process, whereas an anomaly is an observation that deviates so far from prior observations that it raises suspicions that it was generated by a separate mechanism. An additional difficulty in recognizing anomalies is the noise in the data. Distinguishing between noise and anomalies is considered a constant challenge. Abnormalities and deviation of behavior, in general, appear very rarely as an absolute and visible fact [6]. Unintentionally occurring abnormality is usually an indistinguishable contemplative event, as is the deliberate induction of abnormalities which is a long-term and well-organized deception scheme that creates escalating system malfunctions linked with significant risks such as network attacks, equipment failures, malware, and information spying [7].

Detection of anomalies as a process is one of the biggest and most complex challenges in the management of large-scale industrial applications, as the detection of equipment misuse can be due to several relevant or unrelated factors. The method’s success, which can be attributed even when the nature of the problem is new and thus unknown, can be attributed to a strategy of comparing the current situation with a model or, more broadly, a set of specifications that are thought to describe its usual operation [8]. Behavioral analysis related to key CPS parameters such as load per node, the mean number of concurrent services, middle cycle length, and network performance is widely used to evaluate the results and identify the anomaly time-lapse, latency, bandwidth, throughput, packet loss rate, and so on [9]. Other technical or heuristic types of analysis may be used in conjunction with abnormal detection to find patterns that will aid in the identification of divergent behavior without causing alarms which are not accurate.

Primarily and by examining the types of abnormalities on an abstract level, the process of detecting abnormalities by artificial intelligence methods may seem to be a simple task, which can be easily completed without any problems, although the process in question is extremely difficult and arduous task. Specifically, the process of identifying anomalies with intelligent algorithms is directly related to the following challenges [1, 10, 11]:(1)Clear and distinct definition of the limits that determine the alternation of classes between normal and abnormal operation. In many cases, these limits are not clear, and they can overlap under certain conditions, while cases of dynamic limits can be observed which alternate to other factors related to the system under consideration. In these cases, the degree of difficulty of the anomaly recognition process increases exponentially, with the result that normal observations are considered as anomalies or vice versa, with the result of many false alarms appearing in the system.(2)Identifying cases where normal limits are used for malicious actions, such as fraud, which is a typical example of an anomaly. Attackers often try to adapt their actions to normal behavior, so locating anomalies is an extremely complex process.(3)Alteration of behavior based on local, temporal, or quantitative evaluation criteria. For example, the view that what is considered normal today may not be normal in the future or any other environment is another important parameter of difficulty in how to detect anomalies. Characteristic of this is the fact that most of the industrial systems change over time under the influence of various factors, constantly creating new states of readjustment of their normal operation.(4)Universal mode of operation in different systems. Abnormal detection approaches in a field, in most cases, cannot be used in a similar one, even in cases where there are identical procedures that compose or identify it. Even very small inhomogeneities can create ambiguities, which make anomaly detection methods ineffective and essentially useless for reusing or transferring experience from one system to another.(5)The availability of anomaly training and validation data, which are capable of properly training detection models. In most datasets, there are few cases, or the anomalies are completely absent, resulting in severe class imbalance. This is an extremely serious problem for training abnormal detection methods, as having more than one instance of a category, usually physiological, algorithms end up discriminating against them, which means that abnormalities are recognized as normal function with incalculable consequences.(6)The ability to operate in real time. The identification of anomalies at the industrial level is directly related to the fact that the data exchanged between the CPS are collected cumulatively, along a continuous and uninterrupted sequence, which means that a successful operational overview of the industrial environment must be supported by intelligent real-time services. But real-time systems assume that the correctness of their operation depends not only on the logical results of the calculations they perform but also on the time at which these results are available. In general, because CPS perform sophisticated activities within specific and strictly defined timeframes, timing is a fundamental fact as violating time constraints can lead to serious malfunctions with disastrous results depending on the type of application or service offered. Respectively, the accuracy in the observance of the time constraints, which is a result of special programming of the CPS modules, can maximize the results of the production process.

In this sense, recognizing the need to use CPS in heavy industry but also the vulnerabilities that characterize the chaotic and heterogeneous environment in question, there is a need to create automated and generally autonomous intelligent systems that can adequately model the problem of industrial environment anomaly recognition. One of the most reliable techniques that can be used effectively on large-scale data to model anomalies, even if they are new and therefore unknown, is the variational inference [12].

Variational inference is a relatively well-known and widely used modeling technique used to address unsolvable problems that arise in the context of Statistical Inference. In the literature, there are several instances of implementation of variational inference methods related to models like Variational Bayes [13, 14], Expectation-Maximization [15], Maximum A Posteriori Estimation [16, 17], Markov Chain [18, 19], Monte Carlo methods as Gibbs Sampling [20, 21], and variational autoencoders [22].

Sebestyen and Hangan [5] in their study analyzed several cases and developed many rules to facilitate the implementation of the most appropriate anomaly detection solution for a given Cyber-Physical System. They claim that as Cyber-Physical Systems get more complex, human anomaly detection methods are no longer applicable and that most anomaly detection methods try to leverage certain regularities or correlations that exist between process variables during normal operation. They offered several case studies in which the discriminants varied greatly depending on the domain, the source of the anomaly, and the system’s complexity, but in most situations, the anomaly detection technique must be tolerant of certain changes produced by known (e.g., noise) or unknown causes (e.g., Gaussian spread of values). They concluded that, in a Cyber-Physical System, numerous anomaly detection sites should be distributed across the infrastructure, and a mix of approaches can cope better with the wide range of anomaly origins and kinds.

Goh et al. [6] presented an unsupervised approach to identify cyber-threats in Cyber-Physical Systems. They discussed how they used a Recurrent Neural Network to do unsupervised learning and then used the Cumulative Sum technique to find abnormalities in a water treatment plant model. Their research was conducted using a dataset gathered from a Secure Water Treatment Testbed, and the findings revealed that their method could detect threats with low false-positive rates.

Marino et al. [8] in their work presented a Cyber-Physical System called IREST (ICS Resilient Security Technology). Their approach utilized a machine learning model; it was certified under different cyber-physical cases and was developed under a comprehensive approach in finding anomalies by taking into account both cyber and physical disturbances. The studies demonstrate that their sensor can identify both cyber and physical anomalies, with the bonus of using just normal data for training and detecting previously unseen disruptions. For training the cyber and physical machine learning anomaly detection algorithms, IREST employed unsupervised learning. The findings revealed that unsupervised learning performed similarly to managed techniques, with the combined benefit of not requiring aberrant behavior data for training and being able to discover previously unknown cyber and physical abnormalities.

Luo et al. [23] in their study analyzed the latest Deep Learning-Based Anomaly Detection methods in Cyber-Physical Systems and provided a taxonomy in terms of the types of anomalies, tactics, implementation, and assessment metrics to comprehend the key features of existing techniques. This method was also used to describe and focus on new features and designs in each CPS division. They looked into the properties of common neural models, the process of DLAD techniques, and the real-time performance of DL models. Finally, they looked at the flaws in Deep Learning approaches, as well as possible improvements to DLAD methods and future study topics.

Jacobs et al. [4] in their work examined and built models of data flows in communication networks of Cyber-Physical Systems and investigated how network calculus can be utilized to develop those models for CPSs, highlighting anomaly and intrusion detection. This provides a solid platform for researching cyber impacts in CPS by connecting the elements that an IDS may investigate for the detection of cyber intrusions with analytical models of a network. They concentrated on the electric grid and the deployment of a cyber-physical IDS to track changes in both cyber and physical systems. Thus, a rigorous and thorough method to better study and comprehend the grid’s cyber-physical interactions and behavior is obtained by modeling the grid data flows using network calculus.

Li et al. [24] developed a semisupervised variational autoencoder without classifier that encodes the incoming data into disentangled and noninterpretable representations and then uses the group information to distribute the disentangled representation through equality constraint. To compensate for the lack of data, they used reinforcement learning to increase the recommended VAE’s feature learning ability. Thanks to its encoder and decoder networks, this system can handle both visual and text data. Extensive testing on image and text datasets validated the suggested architecture’s utility.

Gregor et al. [25] developed a temporal difference variational autoencoder which learns representations including explicit ideas about states. They outlined the specifications for such a model as well as the conditions that it must meet. This approach generates states from observations by connecting time points separated by random intervals, allowing states to interact directly across larger time spans and explicitly represent the future. It also permits rolling out in state-space and in time steps bigger than the underlying temporal environment or data step size and possibly independent of them.

Posch et al. [12] presented a way for training deep neural networks in a Bayesian way. The suggested method employed variational inference to express the a posteriori uncertainty of network specifications per network layer and in relation to calculated parameter expectation values. In comparison to a non-Bayesian network, this method just requires a few more parameters to be tuned. They used this method to train and test a dataset, and the test error was cut in half. Furthermore, the trained model provides information on parameter uncertainty in each layer, which may be utilized to compute credible ranges for network design prediction and optimization for a given training data set [26].

3. The Proposed Unsupervised Disentangled Representation Learning System

This paper presents and evaluates an Autonomous Cyber-Physical Anomaly Detection System that uses an unsupervised disentangled representation learning technique. This is a transferable dictionary learning and view adaptation (TDVA) that aims to export a better representation in a smaller space by discovering the distribution of data by calculating the Evidence Lower Bound (ELBO), to export a better representation in a smaller space [27].

The choice of the space of features that compose a problem under consideration plays a crucial role in the generalized ability to make the right decisions. Attributes usually contain a type of information that is expressed through a representation. Solving a problem depends directly on how the information is represented. In particular, low-dimensional spaces usually give a poor representation of the data and so the standards of different classes may be quite close to each other. On the other hand, high-dimensional spaces place the standards quite sparsely, depriving the model of its generalizability. In any case, a good representation is one in which the problem is more easily solved through the transformed data [28].

For example, a good representation usually has a condition of normality, so that if is the function to be learned and is valid, then the corresponding is also valid. Another element that stands out in a good representation is the existence of many descriptions organized in a hierarchical structure, starting from the most specific and ending with the most general. In other cases, a good representation contains some manifold, some natural fragmentation, or the ability to sparse descriptions of the problem. In any case, a good representation, whether it is low or high, reflects the basic characteristics of the problem under consideration. Thus, learning an appropriate representation can reduce the dimensionality of the study space, while maintaining the basic relationships between points or groups of points that exist in the original data set, greatly simplifying the process of analysis and categorization.

In general, as in the case of human intuition, the performance of the method depends directly on the representation of the data. For this reason, the proposed system applies data transformation techniques to find optimal representations so that it is easier and simpler to extract the useful information that identifies the problem. In particular, the proposed TDVA by using subtraction adjustments, intermediate representations, and feedback relations optimally captures the assignment of the incoming data to the expected network replies to the output. Each item in the questioned architecture transforms the input representation into either high-level characteristics that are more generic and less modified or low-level features that assist in classifying the inputs. Intermediate representations are utilized as input to a comparable level of operation, where they lead to the identification of abnormalities using nonlinear processing units [12].

A crucial modernization of the proposed TDVA is the fully automated function for the utilization and extraction of useful information that can lead to a reliable result, regardless of the given problem. Also, taking samples from the space of the representations of the real data distribution, transforming them into a real space of coordinates, choosing an approach that is a function of the transformed variables, and separating them as disentangling dimensions give experience to the system even for unknown data [24]. It also effectively utilizes information from potentially inconsistent sources, makes accurate estimates of similarity of data to be analyzed, effectively recognizes a wide range of anomalies, and can be applied to solve a broad spectrum of problems without having to find a detailed solution for each of these, a fact that makes it computationally accessible.

4. Mathematical Method and Proof

Given an set of form training data in which , which models the problem of anomaly detection in industrial CPS, is intended to expand the probability of any training input information , according to the following equation [12, 28]:where is a continuous and nondiscrete distribution and every . Therefore, for the calculation of the continuous distribution which takes real values, an integral of common distributions is obtained and not a sum.

An autoencoder [14] is a neural network that is trained to copy input to output. The grid consists of two parts: the encoder which encodes the input into a hidden representation and the decoder which decodes the representation . A sample is represented by the function in a hidden representation. Conversely, the hidden representation is represented in the space of the characteristics (usually applies) [22]. An overview of an autoencoder is shown in Figure 1.

The encoder and the decoder are trained at the same time and their training is no different from the training of a simple neural network as the same learning algorithms can be applied in the case of autoencoders. In their case, the target of each sample is the same as the sample itself, that is, . Although learning the function is not of particular interest, by placing constraints on the network that are usually related to the network architecture or weight values, appearing as additional terms in the loss function, special structures of the data can be found.

Variational autoencoder (VAE) [22] is a special form of autoencoder that assumes some unknown distribution on the data. The role of the encoder is to learn how to represent the hidden features of the dataset by storing them in latent variables of reduced dimension. The decoder, on the other hand, constructs artificial data from latent variables. The artificial data should be like the original input data, but not identical as in this case the process fails. More specifically, in data set consisting of samples from an independent and identical distribution, the process of giving birth to samples is implemented on the basis that each comes from its separate latent variable which it does not share with any other sample ; that is, there are no global latent variables. Based on the above hypothesis, the proposed VAE aims to determine the unknown distribution. The encoder must first be calculated as follows [22, 25, 29]:where posterior is , likelihood is , prior is , and normalizing constant or evidence is . The calculation of evidence is done by marginalizing for the latent variables as follows:

However, calculating this integral requires exponential time, because the distribution of latent variables is continuous, so the term is a complex probability function, due to the nonlinearity of the latent planes. The problem of maximizing the term , through the Bayes rule, is reduced to [29, 30]

Since the term is incalculable, the term is also incalculable, through the Bayesian rule, in which case, the variational inference method will be required to calculate it. Specifically, since the term is incalculable, a family of distributions is used to approximate the actual ex-post distribution . Using the Kullback–Leibler (KL) deviation, it is possible to calculate the probability between the actual dissemination of the latent variables , given , , and the approximate distribution of the latent variables , given , . The following equation applies to the second term [29–31]:

The KL deviation between the two distributions takes the following form:where denotes the KL deviation between two distributions. Applying Bayes’ rule to the second term, the equation becomes

The last equation is the variational Evidence Lower Bound (ELBO) and is a lower barrier to probability [28, 29, 30]. The left-hand side of the equation has the term to be maximized, plus an error term. The error term is the KL deviation between and , which leads the distribution to produce latent variables , given the input variables . The aim is to minimize KL deviation between the two distributions. So, the problem comes down to maximizing the term ELBO. If the distribution is approached with high accuracy, then the error term becomes small. In summary, ELBO is derived from the following formula [24, 25, 30]:and if the KL deviation is nonnegative, then

Also, the ELBO is equal to

The term is the reconstruction cost and the term is the penalty or regularization term, which ensures that the explanation of the data, , does not deviate much from the term of the observations . The regularization term, or penalty, imposes a cost on the optimization function to make the optimal solution unique.

In conclusion, using the family of distributions , where are the parameters of the encoder to be determined by stochastic or minibatch ascending or descending algorithm, where in each iteration, the cost function or probability is calculated, which is the minimum barrier of the term . To maximize the condition in question, it is necessary to maximize the minimum barrier. So, using variational inference the calculation of the term becomes possible [12, 22].

Respectively, to calculate the decoder, it is necessary to calculate the term , using the stochastic or minibatch ascending or descending algorithm; the parameters of the decoder must be calculated. To optimize the cost function of ELBO, the training of the inference model and the decoder (generative model) is required at the same time, optimizing the variational ELBO, using a gradient back-algorithm propagation, so that [24, 32]

Information rules are determined based on backpropagation. For the KL deviation between the distribution and the distribution ,

Alsowhere J is the dimension of the latent variables . The mean values and the dispersions are defined as follows:where is the number of variables. Finally, the KL deviation between the and distributions from the ELBO formula is as follows [29, 30]:and if the dimension of the parameter of the latent variables , this means that there are univariate Gaussian distributions, and then [29, 33, 34]

It is recalled that the term KL deviation has a negative sign in the variational ELBO type, so the aim is to minimize it. Therefore, the stochastic gradient descent algorithm is executed for various samples from dataset . So the complete equation to be optimized is as follows, for which its derivative must be calculated:

By moving the derivative symbol into the mean values, only one value of can be sampled and only one value of Z from the distribution , and thus the derivative of the following equation can be calculated:

Then, taking the mean value of the derivative of this function for arbitrarily many samples and , the result will converge to the derivative of the complete equation to be optimized .

For VAEs to work, it is essential to be driven so that the Q distribution generates encodings for , which can reliably decode. The forward pass of the network works properly and produces the correct average value if the output is calculated on an average of many samples and , as it turned out. However, it must backpropagate the error through a level that samples through the distribution, which is a discontinuous process and has no derivative. The stochastic gradient descent algorithm via backpropagation can handle stochastic inputs but cannot handle units within the input layer. Given the mean value and the coefficient of the distribution , they can be sampled from the normal distribution , sampling first by . Finally, calculating the variable . , which goes after a regular distribution , since every linear transformation of a Gaussian random variable is again Gaussian, the equation for which the derivative must be calculated is as follows [29, 30, 33]:

In the above way, it is allowed to calculate the derivative of the average value of ELBO, so that backpropagation can be applied and is computable. So to maximize ELBO, the gradient of ELBO is required to the variational parameters, which is [27, 35]

However, to shift the gradient inside the expectation, a standard normal random variate must first be designed and then multiplied by the variational standard deviation and variational mean , so that [27, 36]

Using a combination of Autoencoding Variational Bayes and Automatic Differentiation Variational Inference methods, it will be possible to calculate the hidden variables, while the proposed system will automatically transform the hidden variables into real coordinate space, in which it can select an approach which is a function of the transformed variables and will optimize its parameters with stochastic gradient ascent. In this way, the proposed system can be applied to solve a broad space of problems without the need to find a detailed solution for each of them.

The transformation aims to draw boundaries in areas where there is a low data density considering a decision limit with a maximum profit margin. The loss function is entered using . Then by selecting , the empirical risk can be calculated used the following function [30, 36]:

With this transformation, a superlevel is constructed that plays the role of the decision-making surface, so that the margin of division of the categories is maximized spatially by the implementation of points per data class. When a new data appears at the model input, then the distances should be calculated using a partition function , as follows:

The data will be included in the block to which most of the data with the shortest distance of the and blocks from belong, based on the Minkowski distance, which is calculated from the following equation [37]:where is the element of and is the element of .

The algorithm’s implementation in terms of the model’s temporal behavior follows the basic premise that update data is more important to current predictions but proper categorization requires past information. The right mix of the two processing stages can reduce mistakes and improve classification accuracy. The temporal memory interfaces are implemented based on sets , the current prediction, , the older prediction, and , the union of both memories so that [38, 39]

Defining a table of random variables , where is the size of the subset of data selected in each iteration, this table corresponds to the variable , while each random variable follows a Dirichlet distribution, and its parameter is . Then each random variable of the array is transformed into a real space of coordinates, while an array of random variables of dimensions is defined. This table corresponds to the variable , while each random variable follows a Dirichlet distribution, and its parameter is . And here every random variable in the array is transformed into a real coordinate space. A new observed random variable is then defined based on the logarithmic probability function as follows [23, 29]:where represents a case of batch data, represents the class distribution in data batch , and represents the distributions of features . At this point, the encoder takes as input a data batch and calculates as output a pair of variational parameters i, and for each transformed random variable , that is, parameters of normal distributions in real coordinate space. By defining the mean-field approximation based on the variational parameters and of each random variable and performing Kullback Leibler Divergence Inference, the encoder parameters are provided which will be optimized [40]:where represents the encoder parameters, represents the data, represents the weight of the normalization term, and represents the hidden variables. A general description of the proposed model is shown in Figure 2.

An abstract and general description of the algorithmic procedure followed by the proposed TDVA is presented in the following pseudocode as Algorithm 1.

#Input

#Encoder
Encoder
#where φ are the encoder parameters and Z are the reduced dimension latent variables
#Code (Latent Representation)
Optimize the
#where φ are the encoder parameters, x the batch data, and β the optimization weight
Calculate empirical risk
#where x is the temporal clustering parameter
Temporal dependence
#where merg is the temporal dependence function
#Decoder
Decoder
#where θ are the decoder parameters and Z are the reduced dimension latent variables
#Output
,

In conclusion, the proposed TDVA appropriately models the real data representation space, separating the features that characterize a problem as separate disentangling dimensions, so that the system can learn a complete feature independent of other nodes. Also, this process is completed without the need for prior training of the system and without the need to find a detailed solution, which makes it computationally accessible. This methodology by utilizing the latent representation of the model creates conditions for high accuracy estimates for similarity rates between data input, thus recognizing with great precision and in a fully automated way the anomalies of the system.

5. Experiment Scenario

The proposed work aims to create a realistic anomaly detection system related to the operation and use of CPS in heavy industries. Mill Dataset Kai Goebel (NASA Ames) and Alice Agogino (UC Berkeley) [41] datasets were selected to model the problem. This is one of the most important datasets which very accurately simulates the operation of specialized industrial equipment which has been used in several studies, turning this set into a benchmark dataset for new algorithms such as the introduced. The input in this set represents experiments from milling operations under various operating conditions and includes information on tool wear in normal cutting, input cutting, and output cutting. The sampling data comes from three alternate types of sensors (acoustic emission sensor, vibration sensor, and current sensor), which have been placed in different positions in the existing simulation.

Specifically, the simulation scenario is related to the machining of metals by large-scale mechanical equipment, where a high-precision rotary cutter removes the material as it moves along a workpiece (Figure 3(a)). The cutter moves forward as it rotates, while the cutting tool inserts a recess into the metal and removes it. Over time, the tool introduces wear and specifically wear called flank wear (VB) which is calculated and aggregated from cut to cut. The worn part is measured from the vertical distance VB, as shown in Figure 3(b)

(a)

(b)

In general, the set includes 16 cases with a different number of executions of metal cutting repetitions. Six cutting parameters were used to create the data set, namely, the type of metal (cast iron or steel), the depth of cut (0.75 mm or 1.5 mm), and the feed rate (0.25 mm/rev or 0.5 mm/rev). Each of the 16 cases is a combination of the cutting parameters, which simulate the actual operation of the system; for example, one case describes the steel cutting simulation, with a section depth of 0.75 mm and a feed rate of 0.25 mm/rev.

Many of the cases described in the data set are accompanied by a measure of wear in (VB), as the cutting tool may be new, degraded, or worn. The number of executions taken at irregular intervals depends on the degree of wear and has been calculated considering a permissible wear limit. Data were collected through a high-speed data collection panel with a maximum sampling rate of 250 kHz, each section had 9000 sampling points, and the total length of each sampling signal is 36 seconds [42]. A general representation of the signals as described by the 6 sampling sensors during a cut is shown in Figure 4.

Signal processing software was used for the processing and sampling of the data, for the selected device to allow the real-time analysis, but also the acquisition, storage, presentation, and processing of the data in recorded chronological order so that there is a possibility of later simulation or reproduction of the sampled signals. The logical diagram of the operation of the measurements in the experimental part of the operation of the research simulation laboratory is presented in Figure 5.

It should be noted that several sensor signals have been pretreated and, in most cases, the signal has been intensified to be able to meet the equipment threshold demands. The dataset is also a detailed report on how the experiments were performed [42] and the equipment used, and all other technical details about the dataset are available for free use on the NASA Prognostics Center of Excellence website [43].

Synthetic data were added to the baseline describing 30 cases of attacks where sampling was falsified, sensor values were falsified, and false cutting commands were issued. Their design was based on the idea of creating a suitable input in a specific way, which while not easily perceived by individual observers leads the learning algorithm to wrong outputs. In this way the data set is reinforced with more complex examples of anomalies, which are much closer to the normal operation of the machines, resulting in training approaches usually constructed for stable environments in which training and test data are produced by itself and cannot be easily predicted. When the difference between two inputs is minimal, it is assumed that they are comparable in the above modeling. As a result, the metric for comparing the similarity of two inputs is an essential parameter in the issue, and it has an impact on the approximate solutions that are commonly employed.

Anomaly detection is performed using both Reconstruction Error (reerror) which is an anomaly detection performed in Input Space (ISp) and the measurement of the difference in KL deviation between samples which is an anomaly detection performed in Latent Space (LSp). For ISp, it is important to set an appropriate threshold according to which data-generating reerror above that threshold will be considered abnormal. The safest way to measure reerror is the Mean Square Error (MSE) which is the most basic measure of comparison that can calculate how well a categorization model approaches the number of correct control examples and is calculated by the following formula [30, 44]:where is an observed value and is an estimated value for the predictions n. In this case study, the MSE of all six signals is calculated and the average MSE is used for convenience. Respectively, for the detection of anomalies in LSp, the KL deviation is used, which in essence reflects the relative difference in entropy between the data samples. Here, also, a threshold can capture the relative difference that indicates when a sample of data is abnormal. Both thresholds were calculated experimentally and approximate the best threshold for ISp 166.4290 and the best threshold for LSp 44.2963.

Then, to check the decision limit where all values above the limit will be abnormal (possibly worn tool) and any values below them will be normal (a healthy tool), the Receiver Operating Characteristic (ROC) metrics were used as well as the corresponding Precision-Recall curves. The Area under the ROC Curve (AUC) reflects the true positive versus the false-positive, while the Precision-Recall curve is a measure of the accuracy of the model and its convergence ability. The exact evaluation of the results of the proposed model is presented in detail in the diagrams of Figure 6, where in addition to AUC and Precision-Recall, there are also the diagrams Threshold-Recall and r-error-Recall [30, 44].

The exact results achieved by the model concerning corresponding competing autoencoders models are presented in Table 1.

The illustration of Figure 7 is an effective method of visualizing the decision threshold, where the point at which the samples are incorrectly sorted becomes clear. What is essentially captured is the point of separation of anomalies and noise.

Also in the illustrations of Figure 8, we see case 12 which concerns a shallow cut with a cutting depth of 0.75 mm in cast iron and with a slow speed at a feed rate of 0.25 mm/rev. KL deviation scores allow an accurate display of how the anomaly detection model works over time. The remarkable thing, in this case, is that there is no significant damage to the cutting mechanism, which does not create irregularities and the model produces a smooth clearly defined voltage. This case is relatively easy to investigate which has very high success rates than the proposed TDVA.

The model demonstrates the robustness and inherent convergence capabilities even in difficult cases where other anomaly detection models find it difficult to distinguish when a tool has anomalies under certain cutting conditions. A typical example is case 9, the results of which are shown in Figure 9. This is a deep cut with a cutting depth of 1.5 mm in steel, with a fast velocity at a feed rate of 0.5 mm/rev. In this case, the voltage increases through the degraded area but decreases immediately when it reaches the red failed area, which creates very serious problems for the other models as the samples at the end of the voltage look more like healthy samples.

In general, it should be said that the detection of abnormalities in LSp is superior to the detection of abnormalities in ISP, as the information contained in LSp is more complete and generally more expressive, so the model has more capabilities to detect differences between cuts.

In summary, it should be said that the proposed TDVA model, which as it turned out achieved significantly better results than the comparable ones manages through the mode of operation proposed and especially through the temporal mode, to perceive some cutting parameters, which prove to be more useful in detecting abnormalities. This feature confirms the generalizability of the model, even in cases where certain cutting parameters have been shown to produce signals with a higher signal-to-noise ratio. The proposed model can and does develop capabilities for identifying the appropriate parameters that contain the appropriate information for the coherence of useful information.

The above fact is successfully confirmed even in the additional standards that were included in the data set. The introduction of cases that are nonlinear combinations of the original set patterns, which produce the corresponding nonlinear combination of new, unknown patterns, confirms that TDVA can recognize even unknown attacks that occur for the first time.

6. Conversation

Anomaly detection is an approach to industrial infrastructure security focused on data analysis to produce safety precautions. Given that no tool can accurately predict the future, especially when it comes to digital security-related events, intelligent anomaly detection systems prove to be particularly useful and reliable, as they can give a clear picture of the functionality of a system [4]. Thus, it is possible to detect a threat before it affects the general infrastructure, for example, by studying its normal operating limits. This necessity becomes more pronounced when the quantitative and qualitative difference in the possibilities of collecting and processing industrial information from CPS is realized, based on the business standard of Industry 4.0 and the IoT ecosystem. In this environment, the multifunctional use and decentralization of information by the CPS raise serious issues related to the maximization of the production process, extroversion, and industrial competition.

The idea of standardizing the autonomous anomaly detection system based on unsupervised disentangled representation learning was developed based on the application of a single, universal method that will cover all industrial requirements while considering the high importance for heavy industry of continuous monitoring of the operational status of CPS [8]. This technique, which was presented and carefully examined, combines the most up-to-date artificial intelligence technologies to perform specific procedures of completely automated anomaly detection using an adaptive, flexible, and easy-to-use framework.

A very important innovation of the proposed algorithm is that it can learn without supervision invariant disentangling features, that is, features which for small changes affect the output of the classifier, thus discovering useful information regardless of the given problem. Also, the proposed system without supervision splits or separates each feature into narrowly defined variables and encodes them as disentangling features. This way a single node or even a neuron can learn a complete feature independent of other nodes.

This process is far superior to learning directly from the data as real data from realistic real-world scenarios suffers from significant functionality problems with the more serious being the presence of noise which significantly alters the original measurement space. Also, the methodology in question eliminates corresponding problems related to their high dimension, which makes them prohibitive for use by intelligent systems as they are characterized by exponential complexity. Accordingly, learning good representations allows a full understanding of the nature of the data, as well as the process of creating them. This feature substantially simplifies intelligent analytic procedures by allowing users to understand how the model generates decisions, what its most essential characteristics are, and how these features interact.

The main advantages of the proposed TDVA focus on the management of intractability as it does not require the calculation of terms of exponential complexity and therefore is a computable feasible solution. Also, in the optimization process, the parameters are updated using minibatches, which makes this algorithm very efficient to corresponding solutions based on sampling loops for each data separately, such as the Monte Carlo techniques. In general, the proposed method is simple to implement, brings almost perfect results, and is within the technologies of generative modeling approaches.

Respectively, a disadvantage recorded in the proposed methodology concerns the opacity in some areas of class separation which is an inherent result of the maximum probability, which minimizes the deviation , a fact which means that the model assigns high probabilities to data belonging to sets of a known distribution, but it can also assign large probabilities to data subsets belonging to latent problem identifiers. In this sense, the procedures for determining the similarity between data may not be fully compatible with each other. In each case, however, as this has been demonstrated experimentally, it is possible to record what the basic components (i.e., latent variables) of the data of a problem should be, assessing how similar or dissimilar the inputs are to each other. This means that, by receiving information about the similarity or dissimilarity between the input objects, any existing anomalies can be accurately identified, as well as the basic characteristics that identify them, without the need for prior training of the system and without the need to find an analytical solution.

7. Conclusions and Further Work

Summarizing this work, an innovative autonomous anomaly detection system based on TDVA is proposed, analyzed, and tested. The proposed algorithm, which was tested and proved to be superior to its competitors, creates flexible disentangling representations, properly separating the distributions of data sets, thus recognizing with great accuracy and in a fully automated way the anomalies that exist in data sets. The use of VAE somehow imposes a kind of experience on the structure of the Latent Space, ensuring the smooth transition between different pockets of the data space, discovering inherent differences related to anomalies, while allowing the coding of multiple concepts of similarity or difference with simple and categorical way. This structure is absent in conventional autoencoders, as in general unsupervised learning systems.

Given that modern industry and in particular CPS are characterized by high heterogeneity, it is important to automate the methods of functional control of these systems. The most effective modeling and development of high-reliability CPS are directly related to the continuous detection of anomalies and the identification of solutions that should be followed in order not to interrupt the industrial process. The implementation and use of the proposed autonomous anomaly detection system based on TDVA is an important effort to ensure the security of the industrial infrastructure [45, 46].

Data Availability

Data are freely available in Prognostics Center of Excellence-Data Repository https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This study was supported by the Project of Construction and Practice of High Level Athletes’ Real Time Monitoring Platform Based on Blockchain Technology (no. 202102310323) and the Project of Construction and Application of Remote Support Platform for Winter Sports Medical and Rehabilitation Based on Blockchain Technology (no. 212102310264).

References

A. Banafa, “2 the industrial internet of things (IIoT): challenges, requirements and benefits,” in Secure And Smart Internet Of Things (IoT): Using Blockchain And AI, River Publishers, Denmark, Europe, 2018.
View at: Google Scholar
H. Geng, “The Industrial Internet of things (IIoT),” in Internet Of Things And Data Analytics Handbook, pp. 41–81, Wiley, Hoboken, NJ, USA, 2017.
View at: Publisher Site | Google Scholar
M. Boubekeur, “Industrial applications for cyber-physical systems,” in Proceedings of the 2017 First International Conference On Embedded Distributed Systems (EDiS), p. 59, Oran, Algeria, December 2017.
View at: Publisher Site | Google Scholar
N. Jacobs, S. Hossain-McKenzie, and A. Summers, “Modeling data flows with network calculus in cyber-physical systems: enabling feature analysis for anomaly detection applications,” Information, vol. 12, no. 6, p. 255, 2021.
View at: Publisher Site | Google Scholar
G. Sebestyen and A. Hangan, “Anomaly detection techniques in cyber-physical systems,” Acta Universitatis Sapientiae, Informatica, vol. 9, no. 2, pp. 101–118, 2017.
View at: Publisher Site | Google Scholar
J. Goh, S. Adepu, M. Tan, and Z. S. Lee, “Anomaly Detection in Cyber Physical Systems Using Recurrent Neural Networks,” in Proceedings of the 2017 IEEE 18th International Symposium on High Assurance Systems Engineering (HASE), pp. 140–145, Singapore, Asia, January 2017.
View at: Publisher Site | Google Scholar
B. Genge, P. Haller, and C. Enachescu, “Anomaly detection in aging industrial Internet of things,” IEEE Access, vol. 7, pp. 74217–74230, 2019.
View at: Publisher Site | Google Scholar
D. L. Marino, C. S. Wickramasinghe, K. Amarasinghe et al., “Cyber and physical anomaly detection in smart-grids,” in Proceedings of the 2019 Resilience Week (RWS), pp. 187–193, San Antonio, TX, USA, November 2019.
View at: Publisher Site | Google Scholar
M. Al-Hawawreh and E. Sitnikova, “Leveraging deep learning models for ransomware detection in the industrial internet of things environment,” in Proceedings of the 2019 Military Communications And Information Systems Conference (MilCIS), pp. 1–6, Canberra, Australia, November. 2019.
View at: Publisher Site | Google Scholar
K. R. Choo, S. Gritzalis, and J. H. Park, “Cryptographic solutions for industrial internet-of-things: research challenges and opportunities,” IEEE Transaction Industrial Information, vol. 14, no. 8, pp. 3567–3569, 2018.
View at: Publisher Site | Google Scholar
M. J. Farooq and Q. Zhu, “IoT supply chain security: overview, challenges, and the road ahead,” 2019, http://arxiv.org/abs/1908.07828.
View at: Google Scholar
K. Posch, J. Steinbrener, and J. Pilz, “Variational inference to measure model uncertainty in deep neural networks,” 2019, http://arxiv.org/abs/1902.10189.
View at: Google Scholar
J. Wang, H. Qu, M. Yu, B. Li, and W. Jin, “Variational bayes learning for models with linear equality constraints,” in Proceedings of the 32nd Chinese Control Conference, pp. 1974–1977, Xian’s, China, July 2013.
View at: Google Scholar
V. H. Tran and A. Quinn, “The transformed Variational Bayes approximation,,” in Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4236–4239, Prague, Czech Republic, May 2011.
View at: Publisher Site | Google Scholar
H. Hong and D. Schonfeld, ““A new approach to constrained expectation-maximization for density estimation,” in Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3689–3692, Las Vegas, NV, USA, March. 2008.
View at: Publisher Site | Google Scholar
B. Lee and T. Kalker, “Maximum a posteriori estimation of time delay,” in Proceedings of the 2007 2nd IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing, pp. 285–288, St. Thomas, U.S. Virgin Islands, December 2007.
View at: Publisher Site | Google Scholar
I. Nevat, G. W. Peters, and J. Yuan, “Maximum a-posteriori estimation in linear models with a random Gaussian model matrix: a Bayesian-EM approach,” in Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2889–2892, Las Vegas, NV, USA, March 2008.
View at: Publisher Site | Google Scholar
X. Jia and L. Cui, “A study on reliability of supply chain based on higher order Markov chain,” in Proceedings of the 2008 IEEE International Conference on Service Operations and Logistics, and Informatics, vol. 2, pp. 2014–2017, Beijing, China, October 2008.
View at: Publisher Site | Google Scholar
V. M. Zakharov, B. F. Eminov, and S. V. Shalagin, “Representation of markov’s chains functions over finite field based on stochastic matrix lumpability,” in Proceedings of the 2016 2nd International Conference On Industrial Engineering, Applications And Manufacturing (ICIEAM), pp. 1–5, Chelyabinsk, Russia, May 2016.
View at: Publisher Site | Google Scholar
X. Y. Xie, X. Sun, J. M. Xie, and Z. H. Lu, ““An interpolated Markov model polishes Gibbs sampling’s ability in detecting regulatory elements,” in Proceedings of the The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, vol. 1, pp. 2801–2804, San Francisco, CA, USA, September 2004.
View at: Publisher Site | Google Scholar
P. M. Djurić, B. Shen, and M. F. Bugallo, “Population Monte Carlo methodology a la Gibbs sampling,” in Proceedings of the 2011 19th European Signal Processing Conference, pp. 669–673, Barcelona, Spain, August 2011.
View at: Google Scholar
C. Doersch, “Tutorial on variational autoencoders,” 2021, http://arxiv.org/abs/1606.05908.
View at: Google Scholar
Y. Luo, Y. Xiao, L. Cheng, G. Peng, and D. D. Yao, “Deep learning-based anomaly detection in cyber-physical systems: progress and opportunities,” 2021, http://arxiv.org/abs/2003.13213.
View at: Google Scholar
Y. Li, Q. Pan, S. Wang, H. Peng, T. Yang, and E. Cambria, “Disentangled variational auto-encoder for semi-supervised learning,” 2018, http://arxiv.org/abs/1709.05047.
View at: Google Scholar
K. Gregor, G. Papamakarios, F. Besse, L. Buesing, and T. Weber, “Temporal difference variational auto-encoder,” 2019, http://arxiv.org/abs/1806.03107.
View at: Google Scholar
B. Lv, F. Pan, X. Miao, and C. Hu, “Optimization algorithm of time synchronization network monitoring based on variational autoencoder,” in Proceedings of the 2020 5th International Conference On Computational Intelligence And Applications (ICCIA), pp. 133–137, Beijing, China, June 2020.
View at: Publisher Site | Google Scholar
D. H. Kim, W. J. Baddar, J. Jang, and Y. M. Ro, “Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition,” IEEE Trans. Affect. Comput., vol. 10, no. 2, pp. 223–236, 2019.
View at: Publisher Site | Google Scholar
M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Found. Trends® Mach. Learn., vol. 1, no. 1–2, pp. 1–305, 2008.
View at: Publisher Site | Google Scholar
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2014, http://arxiv.org/abs/1312.6114.
View at: Google Scholar
R. van de Schoot, S. Depaoli, R. King et al., “Bayesian statistics and modelling,” Nat. Rev. Methods Primer, vol. 1, no. 1, p. 1, 2021.
View at: Publisher Site | Google Scholar
I. Yildirim, Bayesian Inference: Gibbs Sampling, Technical Note, University of Rochester, New York, NY, USA, 2012.
S. Hochreiter, A. S. Younger, and P. R. Conwell, “Learning to learn using gradient descent,” in Artificial Neural Networks — ICANN 2001, pp. 87–94, Springer, Berlin, Germany, 2001.
View at: Publisher Site | Google Scholar
X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning using Gaussian fields and harmonic functions,” in Proceedings of the Twentieth International Conference on International Conference on Machine Learning, Washington, D.C., USA, August 2003.
View at: Google Scholar
A. B. Mrad, V. Delcroix, S. Piechowiak, P. Leicester, and M. Abid, “An explication of uncertain evidence in Bayesian networks: likelihood evidence and probabilistic evidence,” Applied Intelligence, vol. 43, no. 4, pp. 802–824, 2015.
View at: Publisher Site | Google Scholar
Y. Bengio, A. Courville, and P. Vincent, “Representation Learning: A Review and New Perspectives,” 2014, http://arxiv.org/abs/1206.5538.
View at: Google Scholar
K. Sohn, H. Lee, and X. Yan, “Learning structured output representation using deep conditional generative models,” Advances in Neural Information Processing Systems, vol. 28, 2015.
View at: Google Scholar
S. Shekhar, H. Xiong, and X. Zhou, Eds.“Euclidean distance,” in Encyclopedia of GIS, p. 556, Springer International Publishing, New York, NY, USA, 2017.
View at: Publisher Site | Google Scholar
N. S. Malinović, B. B. Predić, and M. Roganović, “Multilayer Long Short-Term Memory (LSTM) Neural Networks in Time Series Analysis,” in Proceedings of the 2020 55th International Scientific Conference On Information, Communication And Energy Systems And Technologies (ICEST), pp. 11–14, Niš, Serbia, September 2020.
View at: Publisher Site | Google Scholar
H. Sak, A. Senior, and F. Beaufays, “Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition,” 2014, http://arxiv.org/abs/1402.1128.
View at: Google Scholar
F. Calvayrac, “Kullback-Leibler Divergence as an Estimate of Reproducibility of Numerical Results,” in Proceedings of the 2015 7th International Conference On New Technologies, Mobility And Security (NTMS), pp. 1–5, Paris, France, July 2015.
View at: Publisher Site | Google Scholar
NASA Milling Dataset, Prognostic Dataset for Predictive/Preventive Maintenance, 2021, https://kaggle.com/vinayak123tyagi/milling-data-set-prognostic-data.
T. V. Hahn and C. K. Mechefske, “Self-supervised learning for tool wear monitoring with a disentangled-variational-autoencoder,” International Journal Hydromechatronics, vol. 4, no. 1, pp. 69–98, 2021.
View at: Publisher Site | Google Scholar
Prognostics Center of Excellence - Data Repository, 2021, https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/.
S. N. Wood, “Core Statistics,” 2015, https://www.cambridge.org/core/books/core-statistics/F303F4463E162C6534641616AE38C0A6.
View at: Google Scholar
M. W. Woolrich and T. E. Behrens, “Variational bayes inference of spatial mixture models for segmentation,” IEEE Transactions on Medical Imaging, vol. 25, no. 10, pp. 1380–1391, 2006.
View at: Publisher Site | Google Scholar
M. Ahmadlou and H. Adeli, “Enhanced probabilistic neural network with local decision circles: a robust classifier,” Integr. Comput.-Aided Eng., vol. 17, no. 3, pp. 197–210, 2010.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2021 Chunyu Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Security and Communication Networks

Advances in Cyber Threat Intelligence

RETRACTED: An Autonomous Cyber-Physical Anomaly Detection System Based on Unsupervised Disentangled Representation Learning

Abstract

1. Introduction

2. Related Literature Work

3. The Proposed Unsupervised Disentangled Representation Learning System

4. Mathematical Method and Proof

5. Experiment Scenario

6. Conversation

7. Conclusions and Further Work

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright