Abstract

Data-driven quality monitoring is highly demanded in practice since it enables relieving manual quality inspection of the product quality. Conventional data-driven quality monitoring is constrained by its offline characteristic thus being unable to handle streaming nature of sensory data and nonstationary environments of machine operations. Recently, there have been pioneering works of online quality monitoring taking advantage of online learning concepts in the literature, but it is still far from realization of minimum operator intervention in the quality monitoring because it calls for full supervision in labelling data samples. This paper proposes Parsimonious Network++ (ParsNet++) as an online semisupervised learning approach being able to handle extreme label scarcity in the quality monitoring task. That is, it is capable of coping with varieties of semisupervised learning conditions including random access of ground truth and infinitely delayed access of ground truth. ParsNet++ features the one-pass learning approach to deal with streaming data while characterizing elastic structure to overcome rapidly changing data distributions. That is, it is capable of initiating its learning structure from scratch with the absence of a predefined network structure where its hidden nodes can be added and discarded on the fly in respect to drifting data distributions. Furthermore, it is equipped by a feature extraction layer in terms of 1D convolutional layer extracting natural features of multivariate time-series data samples of sensors and coping well with the many-to-one label relationship, a common problem of practical quality monitoring. Rigorous numerical evaluation has been carried out using the injection molding machine and the industrial transfer molding machine from our own projects. ParsNet++ delivers highly competitive performance even compared to fully supervised competitors.

1. Introduction

1.1. Background

Predictive maintenance has attracted increasing interest from both academia and industry because it offers optimization of machine’s life cycle, accurate planning of machine’s maintenance, and prevention of unnecessary downtime and product’s wastage [1]. In realm of tool condition monitoring, replacing a tool too frequently not only leads to expensive maintenance cost but also interrupts the production cycle. On the other hand, blunt tools incur high energy consumption due to the application of high cutting force or undermines the surface finishing.

Accurate quality monitoring plays a vital role in reducing rejection’s rate by customers leading to high customer satisfaction and in meeting particular standards set by relevant authorities. Common practice for quality monitoring is still done via multi-staged visual inspection deemed overly labour-intensive, error-prone, and slow. Another drawback of manual quality monitoring is found in the issue of consistency. That is, human operators are often biased and are affected by uncertain factors such as experiences, fatigues, boredom, etc. This rationale triggers increasing demand of data-driven quality monitoring utilizing artificial intelligence (AI) techniques feeding real-time information of product’s quality [2]. Compared to the traditional first principle approach, the data-driven quality monitoring cuts down the development time significantly. It relies on a dataset collected from sensors or cameras installed at the end of production line to build a predictive model after being preprocessed via the signal processing and feature extraction techniques to produce meaningful features.

1.2. Related Works

In-depth study has been devoted to developing reliable quality monitoring approaches. In [3], the tool condition of the metal-turning process is predicted using neural networks. A fuzzy neural network is utilized to predict the tool wear of the ball-nose end-milling process using vibration data [4]. In [5, 6], a fault detection approach in the rolling mills process is proposed using all-coverage data-driven approach making possible to integrate many sensors. The rise of deep learning with its automatic feature engineering step to extract natural features allows simplification of the data-driven quality monitoring enabling to bypass complex feature extraction step. In [7], convolutional neural networks based on ResNet50 are put forward to perform quality monitoring in the laser-based manufacturing processes. A stacked sparse autoencoder (SSAE) combined with the genetic algorithm to tune its parameters is proposed to determine the laser welding quality [8]. Despite their success in various manufacturing applications, such approaches are offline in nature and fixed once deployed thus being unable to adapt to rapidly changing conditions of process parameters. Their iterative training process is not memory-wise and does not keep pace with the high-speed manufacturing process. A complete retraining process from scratch is solicited in handling the process change.

The online quality classification approach has been advanced in [9] where the GEN-SMART-EFS is combined with the incremental partial least square (iPLS) method for the feature selection method to monitor the quality of microfluidic chip. An extension of this work is presented in [10] where the forgetting strategy is implemented to handle the concept drift and the multiobjective evolutionary computation approach for process optimization. Another approach for prediction of tool wear in the metal-turning process is proposed in [11]. It is based on Parsimonious Ensemble+ (pENsemble+) algorithm making use of the online active learning approach to handle the issue of label’s scarcity.

1.3. Our Approach

The area of online quality classification still deserves investigation because existing methods are still far from being truly autonomous approaches. They are mostly developed from the fully supervised learning principle necessitating considerable labelling efforts in streaming environments. It suffers from substantial operator dependencies to fully annotate data samples for model’s updates notably in the high-speed production processes. In [12], a semisupervised deep learning approach is proposed for quality monitoring tasks using the stacked autoencoder approach. However, this approach is not designed for streaming environments. Another approach is proposed in [13] for online semisupervised quality monitoring using the notion of weighted principal component regression. This approach is, however, a non-deep learning approach. Another open issue lies in the feature extraction step often being application-specific [14] and calling for intensive offline phases. Notwithstanding the fact that deep learning solution starts to pick up research interest where the concept of deep features is utilized to bypass complex feature engineering step, they are built upon an offline training process thus becoming outdated quickly under nonstationary traits of manufacturing processes. Furthermore, they are developed under a fully supervised working principle incurring considerable labelling cost. Another issue lies in the existence of many-to-one label relationship [15] where a batch of data are associated with a single and constant class label. This problem might lead to the overfitting problem of a particular class or the loss of granularity if a batch of data is combined into a single instance. This problem is frequently found in the condition monitoring problem, in which a quality check is only performed after the whole lot is produced. In summary, there exists a strong demand for an online semisupervised deep learning algorithm for quality monitoring. Such algorithm is capable of learning from streaming data without retraining from scratch while bypassing a complex feature engineering phase. That is, a new concept arising due to changing environments can be quickly handled without compromising complexity while natural features are extracted on the fly.

An online semisupervised deep neural network, namely, Parsimonious Network++ (ParsNet++), is proposed to undertake real-time learning under scarcity of labelled samples for online quality monitoring in the injection molding process [16] and in the industrial transfer molding process. ParsNet++ forms a significant extension of a recently developed algorithm for semisupervised learning of high-pace data streams, Parsimonious Network (ParsNet) [17]. ParsNet++ is capable of starting its learning process from scratch with no predefined structure while its hidden node is automatically grown and pruned from data streams to overcome the concept drift. It handles the partially labelled data streams under two settings: random access of ground truth and infinitely delayed access of ground truth. The key feature exists in the autoregularization method dealing with the accumulation of mistakes due to noisy pseudolabel.

The underlying innovation of ParsNet++ lies in the existence of feature extraction layer coping with raw samples where the 1D convolutional layer is integrated to deal with multivariate time-series data collected from sensors and the many-to-one label relationship. This property enables skipping a complex feature engineering step because of its aptitude in extracting natural features. The feature extraction layer is structured as a stacked convolutional layer generating deep features to be fed to the fully connected layer. Furthermore, the fully connected layer is structured as a self-evolving single-hidden-layer neural network to handle process change.

The structural learning mechanism of ParsNet++ is driven by the network significance (NS) method derived from the bias-variance decomposition method. It differs from the original NS method in [18] with the presence of autonomous clustering mechanism (ACM) estimating the probability density function. ACM addresses the obsolete probability density function if the concept drift occurs while also relaxing a strict normal distribution assumption which does not fit for real-world cases. Unlike conventional clustering technique, ACM features a self-evolving property making possible for automatic generation and pruning mechanism of clusters. ACM distinguishes itself from AGMM of the original ParsNet often being unstable in the high input dimension cases.

The parameter learning phase is carried out under a joint optimization problem minimizing both reconstruction loss and discriminative loss coupled with autoregularization mechanism. That is, the regularization process is derived from the concept of synaptic intelligence (SI) proposed to prevent the issue of catastrophic forgetting problem [19]. It calculates the parameter importance using the accumulated gradient of network synapses. This technique is generalized here where it is used to memorize optimal network parameters induced by the clean labels. The label enrichment method is carried via the label augmentation mechanism where originally labelled samples are perturbed by injecting controlled noise while leaving their labels unchanged. By extension, the self-labelling mechanism is carried out to generate pseudolabel of unlabelled samples. It is inferred by the predictive output of ACM and network itself if both of them are confident with their own predictions.

Autonomous quality monitoring with weak supervision is formalised under two settings: random access of ground truth and infinitely delayed access of ground truth. The former case portrays partially labelled data streams where only a fraction of data samples possess true class label. The latter case goes one step ahead where labelled samples are served only during the warm-up phase leaving the rest unlabelled. Furthermore, the quality monitoring problem consists of two scenarios, current batch prediction and one-step ahead prediction. The current batch prediction is meant to predict the current product quality whereas the second one aims to forecast the product quality for the next data stream, all of which are carried out in the prequential test-then-train protocol, the standard simulation protocol of data streams and simulated using real-world use cases of injection molding machine, and industrial transfer molding machine from our own project. Our rigorous numerical study demonstrates the success of ParsNet++ for the online quality classification under weak supervision where it delivers the most encouraging results even compared to fully supervised competitors.

In summary, this paper delivers four major contributions discussed in the sequel:(1)This paper presents ParsNet++ to handle online quality classification of injection molding process and industrial transfer molding process under semisupervised environments. That is, the semisupervised environments are induced by both random access of ground truth and infinitely delayed access of ground truth.(2)This paper offers an extension of ParsNet [17] where 1D convolutional layer is introduced to address the issue of feature extraction and the many-to-one label relationship problem.(3)Autonomous clustering mechanism (ACM) is developed for a flexible density estimation approach navigating the structural learning phase. ACM replaces the role of AGMM in the original ParsNet [17] suffering from the execution issue of the high-dimensional problem. Furthermore, ACM incurs fewer parameters than those of AGMM.(4)The codes of ParsNet++, raw numerical results, and injection molding dataset are made publicly available in https://github.com/ContinualAL/ParsNetPlus to enable further study of the proposed research topic.

The remainder of this paper is structured as follows: Section 2 discusses the problem formulation; Section 3 outlines the learning policy of ParsNet++; Section 4 elaborates the injection molding machine; our numerical study is explained in Section 5; and some concluding remarks are drawn in Section 6.

2. Problem Definition

Learning from data streams is defined as a learning problem of never-ending data batches where is the number of data streams and unknown in practice. This property demands the one-scan learning scheme where a data stream is discarded once learned to suppress the computational and memory complexities to a low level. A data stream comprises data samples having no label where denotes input data batch while denotes an input vector. are, respectively, the batch size and the input dimension. In the realm of the fully supervised learning setting, the ground truth access where is the output dimension can be instantly elicited. This assumption is unrealistic notably in the context of quality classification. Some delay is expected because the product quality is examined through visual inspection. Semisupervised data stream is formalised here under two settings: sporadic access to ground truth and infinitely delayed access to ground truth.Random access to ground truth: this case delineates a fact where the operator labels data samples sporadically leading to partially labelled data streams. That is, a true class label arrives in the random fashion. In other words, is only partially labelled with the target label.Infinitely delayed access to ground truth: the second case is more stringent than the first case where the access of true class label is only given for prerecorded samples being fed in the warm-up period before process runs leaving the rest unlabelled. In other words, only initial labels are provided. Specifically, only the first data batch is labelled without changing the data order.

As with conventional data streams, semisupervised data streams do not follow static and predictable data distributions where they contain the concept drifts. That is, there is changing data distributions resulting in the change of joint probability distribution . It requires a model which can adapt to the concept drifts with/without the presence of true class labels. That is, a model should be capable of adapting to the concept drift even if the true class label is absent. The concept drift is induced in our experiment with the injection molding machine by varying the holding pressure and the injection speed of the injection molding machine to be psi and rpm, respectively. The online quality classification problem is presented as a multiclass classification problem with three classes, namely, good, weaving, and short-forming. The number of data samples in three classes is, respectively, 1008, 1074, and 870, respectively. This problem is guided by 48 input attributes recording different machine parameters.

3. Learning Policy of ParsNet++

Overview of ParsNet++’s learning policy is depicted in Algorithm 1. It starts from the learning process of ACM estimating the complex probability density function and determining the addition factor of hidden nodes . Note that ParsNet++ directly injects hidden nodes if the hidden node growing condition is satisfied. Furthermore, ACM itself is flexible to changing learning environments since it features an elastic structure making possible for clusters to be added or pruned on the fly. The probability density function produced by ACM is fed to the structural learning phase of ParsNet++ where the generative learning phase is carried out first to condition the network structure with the absence of true class label. The structural learning phase involves the hidden node growing and pruning processes adapting to the virtual drift problem. That is, the structural evolution is navigated by the reconstruction error. The parameter learning phase is devised to minimize the reconstruction loss and to create an ideal discriminative representation of unlabelled samples. The network parameters are further evolved in the discriminative phase with the access of true class labels once completing the generative phase. In other words, the generative and discriminative training phases occur in a fully coupled fashion. The label enrichment mechanism is carried out afterward by executing the augmentation of labelled samples module and the generation of pseudolabel mechanism. Both pseudolabel and augmented label are learned in the discriminative learning fashion minimizing the predictive loss and carried out along with the dynamic regularization method. Network parameters are shared during the generative and discriminative learning phases having a closed-loop configuration. That is, the network parameters of the generative learning phase are passed to the discriminative learning phase while the network parameters of the discriminative learning phase are fed back to the generative learning phase to cope with upcoming data stream, in other words, the discriminative phase function to refine the generative learning phase using the ground truth information. In addition to the generative phase, the structural learning phase takes place in the discriminative phase to overcome the real concept drift and utilizes the same probability density function as per the generative training phase. Table 1 provides a list of notations used in the paper.

Input: partially labelled data batches:
for data batch do
  Testing and update performance metrics
  if k < S then {S: initialization batch number}
   for epochs = 1:E do
    Update ACM
    
     {gen:generative phase}
    for alldo
     Structural evolution
     
     Calculate { in (1)}
    end for
    
    for alldo {dis:discriminative phase}
     Structural evolution
     
     Calculate { in (1)}
    end for
   end for
  else
   Update ACM
   if exists unlabelled data then
    Generate pseudolabel via (2)
   end if
   
   
   Calculate { in (1)}
   for alldo
    Structural evolution
    
   end for
   
   for alldo
    Structural evolution
    
    Calculate { in (1)}
    Update net with R in (1) {autoregularization}
   end for
  end if
end for
3.1. Parameter Learning of ParsNet++

The parameter learning method of ParsNet++ is governed by the following loss function:where stands for the reconstruction loss solved in the generative phase via convolutional denoising autoencoder (CDAE), denotes the predictive loss of originally labelled samples having a much lower quantity than that of the batch size, and and label the predictive loss of augmented label and pseudolabel, respectively. The last term is the autoregularization term. The pseudolabel is induced by the self-labelling mechanism to unlabelled samples while the augmented label is produced by injecting small perturbation to originally labelled samples without changing its label. Nonetheless, the self-labelling mechanism does not reflect the ground truth and possibly delivers noisy label compromising the model’s generalization. The autoregularization here plays a role in avoiding this situation by preventing the important parameters to move away from its optimal parameters as a result of the originally labelled samples. That is, and , respectively, denote the current network parameters and the optimal parameters induced by the ground truth while is the indicator of parameter importance. The original label, augmented label, and pseudolabel are mixed here to enable the autoregularization to be executed seamlessly [17]. Furthermore, the structural learning phase takes place in and here because the augmented label does not reflect the true data distribution undermining the drift adaptation mechanism and the pseudolabel risks on noisy label misleading the estimation of bias and variance. Equation (1) is formed as an unconstrained optimization problem allowing alternate optimization strategy via the stochastic gradient descent (SGD) method. Notwithstanding the fact that pseudolabels might be noisy, the pseudolabel generation mechanism still plays an important role to enhance model’s generalization because it enriches the label representation; i.e., one might consider extreme label scarcity here. Moreover, the autoregularization is implemented to address the issue of noisy pseudolabels. The generative and discriminative phases are carried out alternately here. Note that the infinite delay case only relies on the augmented label and the pseudolabel.

3.1.1. Generation of Augmented Label

The issue of label scarcity is addressed by the label enrichment strategy including the generation of augmented label. It results from the injection of small Gaussian noise to the originally labelled samples without changing their labels also known as the consistency regularization technique. That is, small random Gaussian noise with zero mean is utilized to produce the corrupted version of originally labelled samples, i.e., [17]. Since the augmented label is drawn from the true class label, it is not subject to the autoregularization method. Furthermore, only augmented label and pseudolabel are exploited in the infinite delay problem whereas original label is not retained during the process runs. In other words, original label is accessed in the warm-up phase without being carried to the next data streams.

3.1.2. Generation of Pseudolabel

The label enrichment mechanism involves the generation of pseudolabel produced by the self-labelling phase of unlabelled samples. The self-labelling mechanism relies on the network prediction as well as the ACM prediction if they return high confidence as follows:where are two predefined thresholds set to be higher than . The ACM’s output is calculated as per the output posterior probability [20] where denotes the prior probability , stands for the class posterior probability , and labels the likelihood function . stands for the cardinality of the cluster while denotes the cardinality of the class of the cluster. Furthermore, the network and ACM predictions are normalized as where denote the highest and second highest outputs. This trait underpins the class-invariant trait being similar to the binary classification problem. As a result, indicates low confidence level and confused prediction. This condition implies the predicted output falls adjacent to the decision boundary. The pseudolabel is propagated to model’s update only if the predictive outputs of ACM and network are agreeable. Despite the pseudolabel generation mechanism risks on the noisy pseudolabel, it is still integrated in the ParsNet++ learning mechanism because of the existence of autoregularization making sure only clean pseudolabels to be learned. On the other hand, control the self-labelling mechanism where the higher values lead to the decrease of the pseudolabels whereas the lower values lead to the increase of the pseudolabels.

3.1.3. Autoregularization Method

The autoregularization is developed to cope with noisy pseudolabel leading to accumulation of mistakes. It prevents a model to forget its optimal condition resulting from learning original label. Specifically, it prevents important parameters from moving too far from their previous locations resulting in the performance degradation. This approach is originally proposed in the so-called synaptic intelligence (SI) technique addressing the catastrophic forgetting problem of continual learning [19]. Our main contribution here is to contextualize this approach for the semisupervised learning environment to prevent the catastrophic forgetting problem as a result of noisy pseudolabel.

still accepts the pseudolabel by setting , regularization factor, as where stands for the reconstruction error of the generative phase only if clean pseudolabel is fed. That is, wrong pseudolabel distracts the direction of network’s gradient resulting in the increase of reconstruction error. The Z-score is applied to scale the reconstruction error in the range of . determines the importance of network parameters derived from the accumulated network gradient as follows:where stands for the total parameter movement during the training process and denotes the parameter’s movement during two consecutive time steps . is a predefined constant to avoid division with zero. is updated only when observing the original label and the augmented label because the autoregularization functions to compensate the noisy pseudolabel. Hence, denotes the number of original label and augmented label. It is worth mentioning that the higher the network gradient is, the more important the network parameter is. The parameter importance indicator (3) is calculated in respect to the accumulation of network loss and network gradients.

3.2. Network Structure of ParsNet++

ParsNet++ is built upon the convolutional denoising autoencoder structure where the feature extraction layer utilizes the stacked convolutional layers while the fully connected layer is formed as a single-hidden-layer network having a self-organizing property. It receives raw input features collected from sensors which in turn maps them to the output space . Specifically, the 1D convolutional layer is deployed to process the sensor data. Raw samples are executed by the convolutional layer as follows:where the convolutional layer is parameterized by a filter denoting the filter of the convolutional layer while stands for the feature map of the layer produced by the filter. The 1D filter is used here.

After stacking convolutional layers, the output of the last 1D convolutional layer is flattened to produce an input vector where denotes the number of natural features extracted by the feature extraction part of ParsNet++. It is passed to a single hidden-layer neural network functioning to classify data samples into target classes. ParsNet++ is underpinned by a closed-loop configuration between the generative and discriminative learning phases where the denoising autoencoder (DAE) [21] is implemented to extract robust input features. The DAE makes use of noise injecting mechanism avoiding the identity mapping issue while functioning as the regularization mechanism. The DAE takes the natural features and maps it into the latent space:where and are the connective weights and bias of the encoder while and are the connective weights and bias of the decoder. denotes the number of hidden nodes. Note that is the inverse mapping of and is known as the tied-weight constraint. is a partially destroyed input vector where the masking noise is used here. That is, a subset of input vector is set blank. The Relu activation function is used here instead of the sigmoid activation function. The discriminative phase utilizes a softmax function to produce the output class posterior probability:where are the connective weights and bias of the softmax layer. ParsNet++ utilizes shared network parameters between the generative and discriminative phases where . Both phases are carried out in the closed-loop fashion where a model is firstly trained during the generative phase with the absence of ground truth. The discriminative phase further refines it with the presence of class labels.

3.3. Growing and Pruning of Hidden Nodes

ParsNet++’s structural evolution is governed by the network significance (NS) method estimating the network bias and variance in the one-pass learning fashion. new hidden nodes are added if the network experiences high bias condition whereas the hidden node pruning mechanism is triggered in the case of high variance. stands for the number of clusters generated using the autonomous clustering mechanism. It is worth mentioning that both mechanisms are carried in the generative and discriminative fashions where the bias and variance are enumerated in respect to the predictive error while the reconstruction error is referred to during the generative phase. We only present the structural learning mechanism in the discriminative phase here for the sake of simplicity but the same step can be followed for the generative phase. The NS method can be expressed as follows:

The key for solving (7) lies in the expected output . ACM is applied here to estimate the complex probability function and results in the following expression:where , respectively, denote the mixing coefficient and center of clusters, respectively. Equation (8) can be derived independently for each cluster while the overall expected output is enumerated by applying the mixing coefficient taking into account the contribution of each cluster to the overall estimation. This step leads to the following expression:where meets the partition of unity property. On the other hand, the term is derived under the condition leading to .

The hidden unit growing condition is formulated using the statistical process control (SPC) method [22] as follows:where are the empirical mean and standard deviation of the network bias while are the minimum network bias up to the time instant. are reset once (10) is satisfied while are calculated across all samples because of the nature of bias estimation being accurate when considering all samples. Formula (10) is meant to detect the high bias condition leading to the hidden unit growing condition. Note that the SPC method in essence functions to detect anomalous points or a drifting concept. The original SPC method is, however, modified here to induce the flexible confidence level with the use of being equivalent to the confidence degree between and . It implies the hidden unit growing process to be carried out in the case of high bias whereas it is hindered in the case of low bias.

As with the hidden unit growing mechanism, the hidden unit pruning strategy is undertaken using the SPC method as follows:

The key difference lies in the term directed to avoid a direct-pruning-after-adding situation. This leads to the confidence level between and . That is, the hidden unit pruning condition is carried out frequently in the case of high variance while the hidden unit pruning situation is prevented in the case of low variance. Once (11) is met, the hidden unit pruning condition is executed as follows:where denotes the statistical approximation of hidden nodes. Equation (12) enables multiple hidden nodes to be discarded at once and results in rapid complexity reduction.

3.4. Autonomous Clustering Mechanism

ParsNet++ is guided by autonomous clustering mechanism (ACM) to generate a complex probability density function during the hidden node growing and pruning processes. It differs from the original ParsNet [17] where autonomous Gaussian mixture model (AGMM) is applied. The bottleneck of AGMM exists in the high input dimension often being unstable. ACM features an open structure where clusters are added or discarded on the fly to cope with the concept drifts and is capable of initiating its learning process from scratch. The component’s growing process is governed by the compatibility measure examining the spatial proximity of a data point to existing clusters whether it is within the cluster’s coverage. The cluster pruning technique makes use of the cluster’s utility checking the cluster’s activity during its lifespan.

Suppose that is the distance between two data samples; the compatibility test is formulated:where . stand for the mean and standard deviation of distance calculation . As with (10) and (11), (13) is formalised by the statistical process control (SPC) method. The use of controls the cluster’s growing process in such a way that the growing process is performed frequently if a sample is remote to the existing cluster . This situation portrays a fact where a data sample is uncovered by existing clusters. On the other hand, this condition is difficult to be fulfilled if a data sample is adjacent to existing clusters, i.e., low clustering loss . A new cluster is constructed if (13) is satisfied. That is, the cluster center is set as the sample of interest with where is the number of clusters. If (13) is violated, the winning cluster is fine-tuned:where denotes the cluster’s cardinality. Note that the adaptation process is localized only to the winning cluster to avoid the cluster’s overlapping case and associates the data sample of interest to the winning cluster. That is, the cluster’s cardinality is incremented here. Equation (14) ensures the cluster’s convergence as the factor of the cluster’s cardinality.

The cluster pruning procedure is implemented to prevent the issue of cluster’s explosion due to the problem of outliers. That is, outliers are wrongly inserted as clusters by (13). It checks the cluster’s significance whether it plays a major role during its lifespan. A cluster can be pruned without loss of generalization if it plays little during their lifespan. The cluster’s contribution is examined from the average of cluster activity as follows:where measures the spatial proximity of a data sample to the cluster of interest in the latent space while denotes the time period of a cluster since it is added. Furthermore, the unity variance is assumed in calculating where . The cluster pruning mechanism is executed as follows:

The cluster pruning mechanism enables more than one cluster to be discarded at once leading to rapid complexity reduction and follows the half-sigma rule. Furthermore, the number of clusters is also used as an addition factor in the network growing phase (10) because the clustering mechanism explores the true data distribution. As an implementation note, the monitoring period is applied here. That is, a cluster is not removed during the monitoring period to evolve its shape. On the other hand, the mixing coefficient, , is formed as the relative cardinality as follows:where it features the partition of unity property and takes into account both the distance information and the cluster support. A cluster should possess high influence in the network bias and variance estimation if it is adjacent to the data sample of interest and has high population.

4. Injection Molding Process

eScentz, as shown in Figure 1, is a scent-emitting USB device made by SIMTech. It is used as the testbed product at the model factory@SIMTech. The injection moulding process is used to manufacture the black cartridge, white cartridge holder, and a transparent part which is used to contain the scent in the cartridge. The injection molding machine (Arburg Allrounder 470 A) is shown in Figure 2.

Focus is on the transparent part as it is critical to the functionality of the device; i.e., defects in the part can lead to leaking of the liquid scent. There are a number of different types of possible defects but the most common ones are flow lines which is a mark or line formed when two melt flow fronts meet during the filling of the injection mold and short shot where the mold is partially filled with plastic melt [23]. Examples of a good part and the different types of defects are shown in Figure 3.

5. Numerical Study

This section demonstrates the advantage of ParsNet++ in assessing the quality of transparent mold manufactured by the injection molding machine. ParsNet++ is simulated in two simulation environments: random access of ground truth and infinitely delayed access of ground truth. The former one describes a case where each data batch contains partially labelled data points with unknown class distribution while the latter one portrays a semisupervised problem where ground truth is accessed only in the initial phase leaving the rest unlabelled. 50% of labelled samples are set as the default setting for the random access of ground truth. The infinitely delayed access of ground truth only utilizes the first data batch. Furthermore, both scenarios are simulated in the prediction of current batch as well as future batch. The prediction of current batch monitors the current quality of transparent molds based on the sensor data . The prediction of future batch relies on the current data batch to forecast the future product quality . The contribution of each learning module is studied in the ablation study section while the effect of label proportions is elaborated. Our numerical study follows the prequential test-then-train procedure, the standard evaluation protocol of data stream mining. Moreover, the t-test is put forward to statistically validate the numerical results.

5.1. Baselines

The numerical results of ParsNet++ are benchmarked against recently published algorithms in the literature:(i)Online deep learning (ODL) [24] is an online learning algorithm constructed under the vanilla neural network structure. It makes use of the hedging idea where there exists a direct connection of the hidden layer to the output layer.(ii)Neural networks with dynamically evolved capacity (NADINE) [18] adopts a flexible network structure under the multilayer perceptron (MLP) architecture. That is, both of hidden layers and nodes are dynamically grown and reduced in respect to variations of data streams.(iii)Parsimonious network (ParsNet) [17] is perceived as a predecessor of ParsNet++. ParsNet++ distinguishes itself of ParsNet with the presence of feature extraction layer crafted under the convolutional framework; 1D CNN is integrated to handle raw input features. In addition, ParsNet++ is underpinned by the ACM rather than AGMM to perform density estimation on the fly.(iv)SCARGC [25] is devised for the infinite delay problem and considered as a state-of-the art algorithm in this domain. It utilizes the pool-based principle.

Since these algorithms are not designed to handle visual data of high dimension, their predictions are only guided by sensory data . The use of image data significantly reduces its performance due to the absence of the feature extraction layer. In addition, comparison is also made against two popular deep learning algorithms, ResNet18 [26] and VGG11 [27], only using the image data happening to be an RGB image with a size of . They do not exploit the sensor data due to the absence of 1-D CNN. All of the algorithms except ParsNet and SCARGC are a fully supervised algorithm. The simplest structure of ResNet and VGG is adopted here because of the low data size leading to the issue of overfitting. All algorithms are executed under the same computational platform by using their published codes and run under the same simulation protocol as ParsNet++ to ensure fair comparison. The numerical results are taken from the average of five consecutive runs.

5.2. Network Structure and Hyperparameters

ParsNet++ utilizes 1D CNN as a feature extractor to predict the mold quality where 1D CNN looks after the raw sensory data. Extracted features from the CNN are concatenated into a long vector and fed to the fully connected layer, a single-hidden-layer neural network with the self-evolving property. 1D CNN is developed from 3 convolutional layers underpinned by the 1D filter. The number of input and output channels across the three layers is, respectively, set as , , and for injection molding dataset. For transfer molding dataset, two-layer 1D CNN has been applied as feature extractor, in which the input and output channels are and , respectively.

The hyperparameters of ParsNet++ are fixed throughout our simulation scenario as and while the learning rates and momentum coefficient are selected as 0.01 and 0.95 of stochastic gradient descent optimizer (SGD). Hyperparameters of other algorithms are chosen as those reported in their original papers. We chose 100 as the batch size for all algorithms. Table 2 reports the hyperparameters of consolidated algorithms. For injection molding dataset, initialization batch S and epochs E, shown in Algorithm 1, are 5 and 10 in sporadic access experiment and 1 and 15 in infinite delay experiment. For transfer molding dataset, S and epochs are 1 for both sporadic access and infinite delay experiment which also signify that ParsNet++ runs in the single pass way.

5.3. Numerical Results

Table 3 reports our numerical results for the current batch prediction under the setting of sporadic access of ground truth. It is evident that ParsNet++ outperforms ParsNet with significant gap. This finding clearly encourages the 1D CNN of ParsNet++ automatically extracting deep natural features and the ACM technique for estimation of probability density function. Moreover, ParsNet++ beats NADINE, ODL happening to be a fully supervised algorithm with significant margin. Note that ParsNet, ODL, and NADINE are guided by sensor data as with ParsNet++. ParsNet++ is compared with ResNet18 and VGG11 making use of image data and being popular deep learning approaches. Although the two approaches are an offline algorithm trained in the offline fashion and are fully supervised, ParsNet++ exhibits superior performances. That is, ParsNet++ exceeds VGG11 and ResNet18 with noticeable difference. This result is confirmed with the statistical test in Table 4 where the performance gap between ParsNet++ against all algorithms is statistically significant.

Table 5 exhibits our consolidated numerical results for the next batch prediction. The same finding as the current batch prediction is found here where ParsNet++ beats ParsNet with significant performance gap. This facet substantiates the advantage of feature extraction module of ParsNet++ generating deep natural features while ACM approximates the true probability distribution better than the AGMM of ParsNet. By extension, ParsNet++ outperforms fully supervised algorithms, NADINE and ODL, working with more favourable condition than ParsNet++. NADINE, ODL, and ParsNet are akin to ParsNet++ where raw sensor data are exploited as input features but suffer from the absence of feature extraction layer. Our numerical results are statistically validated with the statistical test in Table 6 where ParsNet++’s performance is statistically better than its competitors.

In realm of infinitely delayed access of ground truth, ParsNet++ delivers superior performance with almost improvement from ParsNet and SCARGC. ParsNet++’s accuracy is for the current batch prediction and for the next batch prediction whereas its counterparts deliver the accuracy below . This mechanism confirms the generalization power of ParsNet++ in dealing with various semisupervised learning situations. These numerical results are presented in Tables 6 and 7. Note that the true class labels are only supplied in the initial batch for the infinite delay case being more challenging condition than the sporadic access case. This facet is confirmed by the fact where numerical results of all algorithms worsen. Figure 4 visualizes the predictive quality of ParsNet++ where precision, recall, and F1 metrics show similar trend. This observation signifies the fact that ParsNeT++ handles all target classes equally well. The detailed numerical results are presented in Table 8. In addition, this figure also depicts the dynamic nature of ParsNet++ in which its hidden nodes are dynamically added and pruned on the fly. It is also observed that ParsNet++ timely responds on performance decrease as a result of concept drifts. That is, new nodes are injected if network’s performance is compromised in the case of concept drift.

5.4. Ablation Study

The ablation study is carried out to validate the influence of each learning module of ParsNet++. ParsNet++ is configured into three variations: (A) ParsNet++ is set with only the parameter learning scenario using the stochastic gradient descent method with the absence of other learning modules. That is, the label augmentation mechanism, the dynamic regularization mechanism, and the structural learning mechanism are deactivated; (B) ParsNet++ is equipped by the label enrichment mechanism and the dynamic regularization mechanism but with the absence of structural learning method; (C) the structural learning mechanism of ParsNet++ is switched on but without the pseudolabel generation step and the dynamic regularization mechanism. Our numerical results are produced under future batch prediction, all of which are executed under the sporadic access of ground truth with labelled samples. Table 9 exhibits our numerical results.

It is observed that the worst-performing result comes from the model (A) where all mechanisms are turned off. The label enrichment mechanism and the dynamic regularization mechanism enhance the performance by almost as reported by Model (B). This fact clearly demonstrates the advantage of these learning strategies in coping with the issue of label’s scarcity. Noticeable performance improvement is attained using the structural learning mechanism clearly confirming the advantage of a dynamic structure from that of a static structure as shown in Model (C). This case portrays the importance of drift handling mechanism when handling the problem of data streams. Note that Model (C) excludes the pseudolabel generation mechanism and the dynamic regularization approach. The numerical result increases further when combining the self-evolving structure, the label enrichment mechanism, and the dynamic regularization mechanism as exemplified by ParsNet++ configuration. This configuration enables the issue of label scarcity and concept drift to be simultaneously overcome.

5.5. Effect of Label Proportions

This section examines the learning performance of ParsNet++ under different label proportions. That is, ParsNet++’s performance is evaluated under seven label proportions: . The simulation protocol follows the sporadic access of ground truth in which two evaluation metrics, accuracy and F1, are applied. Table 10 reports the average numerical results across five independent runs.

Our numerical results show that ParsNet++’s performance is compromised with only of labelled samples. The increase of label proportions improves its learning performance and this trend does not continue after label proportion. The best-performing result is achieved with labelled samples whereas performance’s deterioration is observed with and labelled samples compared to that of labelled samples. This finding demonstrates that the increase of labelled samples does not ensure the performance’s improvement. The performance deterioration with and labels results from the issue of sample redundancy due to the consistency regularization step in which small perturbations are injected to original samples without changing their labels. The consistency regularization method might lead to the issue of overfitting if the proportion of labelled samples is high. That is, it produces indistinguishable samples which slightly affect model’s generalization. Note that the cases are better than the case.

5.6. Industrial Transfer Molding Process

The industrial transfer molding process portrays a process from a semiconductor industry occurring in the encapsulation stage where a batch of integrated circuits (ICs) are packaged in a case to avoid corrosion and physical damage [15]. The quality monitoring step in this phase plays a key role because it might result in heavy penalties if defective products are sent to the customer. The encapsulation process makes use of an industrial transfer molding machine, very similar to the injection molding machine where it is used to form the support of electronic components. That is, the transfer molding is a process whereby the casting material is entered into the mold [15].

Each production is undertaken in lot sizes having 1 to 424 strips where each strip comprises a number of products. The product quality is examined only after the complete lot has been finished. The goal of this problem is to feed real-time prediction of the product quality while being still in production. The use of artificial intelligence (AI) is urgently required because it enables redundancy in checking such that the product’s integrity is ensured. We collected production data over the period of six months. This problem is formulated as a binary classification problem and suffers from the class imbalanced problem where only of data contains defects while the remainder is of the normal class. The unique property of this problem lies in the many-to-one label relationship where multiple instances are assigned with a single label. That is, the quality of product is not determined from a single product quality rather the whole lot. If a lot happens to have over 48 defects, the whole lot is thrown away or this case portrays the defect case.

Our numerical study follows the prequential test-then-train protocol as the injection molding problem where one step ahead prediction is simulated. That is, a model is used to predict the quality of next lot based on the current machine parameters and process variables. Both the sporadic access of ground truth and the infinitely delayed access of ground truth are simulated here. Important parameters of the moulding process include cavity pressure, ram velocity, ram position, and mould temperatures. This problem is a high-dimensional problem with 608 input features. Tables 11 and 12 report our numerical results for both the sporadic access and the infinite delay scenarios. ParsNet++ is compared with ParsNet and SCARGC. Since this problem suffers from the many-to-one label relationship where many data samples are associated with a single class label, a simple mean operation is executed for the feature extraction strategy in ParsNet and SCARGC. Note that this case is not applicable for ParsNet++ because the use of 1DCNN enables automatic feature engineering where data points of each lot/strip are scanned using 1D filter.

It is obvious from Table 11 that ParsNet++ and ParsNet exhibit comparable performance in the context of sporadic access protocol. This finding is supported by the fact of class imbalance where only of data samples belong to the positive class. On the contrary, ParsNet++ outperforms both SCARGC and ParsNet in the case of infinite delay as shown in Table 12. This observation substantiates ParsNet++ generalization power in coping with different scenarios of semisupervised learning. Note that the infinite delay problem is more challenging than the sporadic access problem because true class labels are supplied only in the warm-up phase. This issue leads to performance degradation of ParsNet and SCARGC where the automatic feature engineering step is absent; i.e., features are extracted by applying the mean operation. Nonetheless, we acknowledge that the class imbalance problem still deserves in-depth future study. It is seen from the low F1 scores of consolidated algorithms.

5.7. Sensitivity Analysis

This subsection aims to study the effect of hyperparameters to the performance of ParsNet++. Specifically, the effect of is analyzed while excluding other parameters. Other hyperparameters such as momentum coefficient and learning rate are set to the same values for all consolidated algorithms. In addition, they are default parameters of SGD method where their effects have been well-understood from the literature. is merely a small constant to avoid division with zero. The sensitivity analysis is carried out by varying and . Table 13 reports the numerical results of all combinations. are required to be set higher than 0.55; therefore, 0.6 and 0.8 are selected for , respectively. Note that we do not apply specific hyper-parameter selection in our main experiments. That is, only simple hand-tuning is applied to set the parameters. Our sensitivity analysis is undertaken in the case of sporadic access of ground truth under the next batch prediction with label proportion.

From Table 13, variation of does not lead to significant performance deterioration. That is, the difference between the worst and best results is around . It is worth stressing that should be set higher than 0.55 since it reflects the confused prediction. This aspect should narrow down the choice of hyperparameters, i.e., to be unreasonable values. Such a case leads to performance variation to be less than . On the other hand, govern the pseudolabel generation where predictions of the network and the ACM are used to generate the pseudolabel. The case of produces the worst result because it produces too many noisy pseudolabels. The increase of improves the prediction because it reduces the prediction’s uncertainty of ACM. Note that the ACM’s prediction relies on the class posterior probability where it no longer represents the class distribution in the case of extreme label scarcity.

6. Conclusion

A semisupervised quality classification in data stream environments including its deep learning solution termed Parsimonious Networks++ (ParsNet++) is presented in this paper. ParsNet++ features an open structure automatically generating and pruning its hidden nodes on the fly thereby addressing concept drifts of partially labelled data streams. The parameter learning strategy is formulated as a joint optimization problem of the reconstruction loss, the predictive loss of the original label, the predictive loss of the augmented label, and the predictive loss of pseudolabel. In addition, the regularization strategy is put forward to combat the noisy pseudolabel problem preventing the important parameters to be perturbed by the noisy pseudolabels. ParsNet++ extends ParsNet with the integration of feature extraction layer enabling automatic feature engineering mechanism. 1D CNN is integrated to perform the automatic feature engineering step and to handle the many-to-one label relationship while incorporating the ACM for flexible density estimation approach. Comprehensive experiments with the injection molding machine and the industrial transfer molding machine have been carried out to experimentally validate the advantage of ParsNet++. ParsNet++ is tested in two semisupervised learning scenarios: infinite delay access of ground truth and random access of ground truth with comparisons against prominent algorithms for both current batch quality monitoring and future batch quality monitoring. ParsNet++ outperforms its counterparts with noticeable margin and delivers comparable accuracy to those of fully supervised learning algorithms. There are few important issues unexplored in ParsNet++. The issue of class imbalance still deserves an in-depth study where this issue requires a specific strategy in order to reduce false positive rates of ParsNet++’s prediction. This aspect is seen in ParsNet++’s results of the industrial transfer molding problem where the F1 score is rather low. Another uncharted area lies in the issue of transferability to different machines. Its solution makes possible to utilize a single model to be transferred across different machines of the same types or different types with little capital expenditure.

Data Availability

Codes and data of this paper can be found in https://github.com/ContinualAL/ParsNetPlus.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Weng Weiwei and Mahardhika Pratama contributed equally to this study.

Acknowledgments

This project was financially supported by National Research Foundation, Republic of Singapore, under IAFPP in the AME domain (contract no. A19C1A0018).