Abstract

Capsule Networks have shown great promise in image recognition due to their ability to recognize the pose, texture, and deformation of objects and object parts. However, the majority of the existing capsule networks are deterministic with limited ability to express uncertainty. Many of them tend to be overconfident on out-of-distribution data, making them less trustworthy and hence reducing their suitability for practical adoption in safety-critical areas such as health and self-driving cars. In this work, we propose a capsule network based on a variational mixture of Gaussians to train distributions of network weights as opposed to a single set of weights and enable the model to express its predictive uncertainty on out-of-distribution data. Training distributions of weights have the added advantage of avoiding overfitting on smaller datasets which are common in health and other fields. Although Bayesian neural networks are known to exhibit slow training and convergence, experimental results show that the proposed model can retrieve only relevant features, converge faster, is less computationally complex, can effectively express its predictive uncertainties, and achieve performance values that are comparable to the state-of-the-art models. This is an indication that CapsNets can exhibit the transparency, credibility, reliability, and interpretability required for practical adoption.

1. Introduction

Recently, there has been an upsurge in the adoption of Deep Learning (DL) to perform complex tasks such as Visual Question Answering [1], and plant disease detection [2], among others, due to their excellent performance in terms of speed and accuracy compared to humans. Capsule Networks [3, 4], for example, have demonstrated the ability to recognize the pose, texture, and deformation of an object and its parts. They have thus been proposed for use in sensitive areas such as health [5, 6] and agriculture [7, 8], among others. Irrespective of the sensitivity of the application area, capsule networks (just like many other deep learning models) do not incorporate uncertainties in their predictions. The inability to model uncertainties leads to model over/under confidence [9]. We propose a Bayesian Capsule Network (BCN) motivated by [10, 11] and on the background that the Bayesian framework provides the capability for modeling uncertainties in neural network predictions [12]. Bayesian Neural Networks (BNNs) estimate uncertainties by defining a distribution over the network weight parameters whose posterior weight distribution permits the BNN to capture the prediction uncertainties.

BNNs are known to have a longer convergence time during training [11] since training occurs on larger distribution parameters compared to single points in deterministic models. However, the choice of appropriate normalization and weight initialization schemes can allow the network to converge faster. Since Bayesian models replace the fixed weights with probability distributions, they are capable of training on smaller datasets without overfitting.

This work, therefore, proposes a Variational Mixture of Gaussian-based capsule network (CapsNet) that will contribute to solving problems such as those caused by the lack of huge datasets in critical areas (e.g., in health). Additionally, we aim at reducing model complexity, reducing convergence time, and improving accuracy on difficult datasets that are small and imbalanced. These are difficult targets for a Bayesian model known for its complexity, and inability to converge faster to achieve. We also aim to leverage the ability of the BNN to model uncertainties and introduce some form of reliability in the predictions of the model on input images. The motive is to enable such models to gain the confidence of the practitioner for practical adoption in safety-critical areas such as autonomous cars and medicine. The lack of sufficient training data is a major limiting factor to the adoption of deep learning in areas such as health due to concerns related to overfitting. This work, therefore, uses Bayesian NNs to elegantly avoid this problem by acting on the distributions weights as opposed to deterministic models which train on a single set of weights. For instance, the parameter of a distribution on the weights is learned by Variational Inference leading to the minimization of Kullback–Leibler (KL) divergence. This method provides a principled framework for the usage of model components leading to better monitoring of model complexity and avoiding its associated problems such as overfitting. In addition, regularization is natural to BNNs such that the regularization parameters get consistent treatment in the Bayesian setting thus eliminating the need for techniques such as cross-validation [13]. Perhaps, one of the main benefits of our method to the health and other critical sectors is the model’s ability to avoid overconfident predictions in regions of sparse data.

Experimental results show that our proposed Variational Mixture of Gaussians Routing (VMGs-Routing) achieves a significant reduction in model complexity while achieving competitive results compared to the state-of-the-art models. Our routing algorithm improves upon similar existing routing algorithms by training and learning faster to achieve convergence within a few epochs (approximately 100 epochs). This method further reduces the infinite likelihood and zero variance problem inherent in Maximum Likelihood solutions caused by Gaussian clusters that try to take sole possession of data points (also known as polarization in Capsules).

The contributions of this paper can be summarized as follows:(1)We propose a routing method from a variational mixture of Gaussians that clearly relies on the maximization of the evidence lower bound (ELBO) to activate a capsule.(2)We provide empirical results that are comparative to state-of-the-art previous works on Bayesian and deterministic capsules to demonstrate that our approach does not result in the loss of any of the inherent strengths of capsules such as viewpoint-invariance, robustness.(3)We show that our proposed Bayesian CapsNet is not overconfident and is reliable from the high uncertainty it expresses on out-of-distribution data.(4)The proposed model is less computationally complex and performs comparatively well with deep Bayesian CapsNet models from the literature in terms of accuracy, uncertainty estimation, and prediction. Comparatively, our model achieves better speedup during training and testing without performance degradation.(5)We provide extensive visualizations of layer activation maps, and predictive uncertainty plots, among others in an attempt to increase the interpretability of our model which is presumed (as a Bayesian model) to be a complex probabilistic ‘black box’ model.

The rest of the paper is organized in the following way: Section 2 presents the related works in the literature followed by Section 3 which discusses the Bayesian methods adopted for this work. Section 4 presents the experiments and experimental results after which the paper is concluded in Section 5.

Some works in the literature have relied on variational inference to propose capsules to solve varied problems. Smith et al. [14] proposed a probabilistic capsule (CapsNet) to encode the capsule assumptions and separate the generative and inference parts from each other. They showed that their model can generalize well on out-of-distribution data, but did not express the uncertainty of their model. Ribeiro et al. [11] proposed a Bayesian CapsNet routing algorithm based on a mixture of transforming Gaussians to address the variance collapse problem and to model the uncertainty of the pose parameters. However, experimental results of the uncertainty of the pose parameters were not provided. In this implementation, a parent capsule j is activated if there is an agreement between the votes of adjacent capsules. The agreement is measured by the entropy of the multivariate Gaussian distribution. A conditional variational CapsNet [15] was proposed to detect classes that are not known during training as a contribution to the open set recognition problem. To this end, they adopted the variational autoencoder approach enabling similar features to assume the shape of a Gaussian, such that each unique feature assumed a different Gaussian. A flow-based model with a long flow structure is capable of finding the approximate posterior probability compared to utilizing a simple family of distributions to approximate the intractable posterior. However, as the data increase in dimensionality, this solution gives rise to huge computational complexity and variance. To address this shortcoming, Hua et al. [16] utilized a dynamic routing flow with variational inference to achieve a shorter flow structure and a significant improvement in precision and accuracy. To introduce routing uncertainties in CapsNet, Ribeiro et al. [17] proposed a global view of the local iterative routing between capsules of adjacent layers, enabling them to capture the uncertainty in the assignment of parts to objects. Compared to the two previous works mentioned earlier, this partial Bayesian CapsNet produced results on out-of-distribution predictive entropies that were consistent with uncertainties of model predictions. To avoid the singularity problem caused by maximum likelihood estimation (MLE), a variational routing CapsNet [18] has been proposed to utilize the variational distribution and integrate the prior distribution for automatic determination of the class of data and avoid overfitting. A Bayesian capsule encoder [19] was proposed to regulate the standard deviation and mean in latent space. The authors argue that it is a better approach for the retrieval of relevant features and image reconstruction from latent space. To demonstrate that deep variational CapsNets can achieve better performance on image synthesis and analysis, Huang et al. [20] proposed a variational model in which the divergence between a capsule and a given prior distribution defines the presence of different entities in an object.

Traditionally, uncertainty is modeled with probability theory and is increasingly becoming more relevant due to the adoption of deep learning (DL) models in practical and safety-critical applications such as medicine and self-driving cars. This type of modeling uses a single probability distribution to capture the required knowledge and struggles to express the two types of uncertainties in a DL model [21]. Aleatoric uncertainty arises from the element of randomness due to the variability of the outcome of events, while epistemic uncertainty measures the modeler(s) inability to design the best model for the task at hand. In the literature, Bayesian networks with latent variables have been proposed [22] to measure both the predictive aleatoric and epistemic uncertainties. This approach played a significant role in the interpretability of the model, which, like other neural network models is perceived to be a “black box.” With the inherent advantages of CapsNets over other neural networks, our work proposes a variational mixture of Gaussians routing-based capsules to effectively capture the predictive uncertainty on the in and out-of-distribution data to improve reliability, interpretability, and model confidence for safety-critical applications.

3. Proposed Methods

In this section, we outline a brief introduction to the concepts of Variational Inference and Gaussian mixture models on which our routing algorithm is based.

3.1. Bayesian Mixture of Gaussians

Suppose assumes a Gaussian distribution; a linear combination of these Gaussians forms the basis for the formulation of a mixture of probabilistic (Gaussian) models known as a mixture of Gaussians [10]. This convex combination creates the opportunity to adjust the means, covariances, and coefficients as a basis for approximating any continuous density function to arbitrary accuracy. Considering a superposition of K-Gaussian densities taking the form of the joint probability , can be marginalized out to give . Realizing that the mixing coefficient ( is a one-hot-vector) is the probability of choosing one cluster out of K clusters, the marginal probability can be rewritten in the form of a Gaussian Mixture Model (GMM), shown in equation (1):

The Gaussian density (also called component) in the above expression has its own mean and covariance .

Since routing in capsules operates on the concept of clustering, they can naturally be modeled via a mixture of transforming Gaussians [11].

3.2. Variational Bayes

Bayesian algorithms perform inference on unknown random variables by finding a posterior probability density [23] in situations where the posterior is intractable to compute. Approximate inference (using Variational Inference (VI)) provides a reasonable approximation to the problem compared to Markov Chain Monte Carlo (MCMC) methods that provide an exact solution but with slow convergence time.

Using the Bayes theorem, the posterior probability density can be computed as follows:where is the marginal probability (also called the evidence). This term is intractable, requiring the use of approximate solutions such as VI. VI does this by searching a family of distributions for the distribution that is closest to the posterior . The distance between the variational (“nice”) distribution and the true posterior ; is measured by the Kullback–Leibler (KL) divergence.

Therefore, minimization of the KL over q now becomes maximization of the Evidence Lower Bound (ELBO)to avoid the intractability issues of the true posterior . To maximize the ELBO, the vector of hidden random variables (distributed according to the variational distribution ) are assumed to be made up of independent random variables allowing their joint distribution to be obtained from the product of their marginal distributions.

This mean-field (MF) approximation makes it possible to obtain a free-form optimization of the ELBO with respect to all the distributions by optimizing each of the factors in turn. When the is fully described by the MF distribution, every data point described by a variational distribution will have its own free parameters. The task is to then find the free parameters that will maximize .

In this study, it is assumed that data points, which are the realization of the random variables X1 ,…, XN, are taken from the m-dimensional Euclidean space . Thus, the dataset is a vector with -valued random coordinates that are to be classified into K clusters with random centroids H1,… , HK that are multinormally distributed, i.e., Hk, where k = 1,.., K, is the 1 × D mean-vector and the D × D covariance matrix. In what follows, fk will be written for the density of . Whenever the random variable Xn is in the kth cluster, it then assumes the distribution of the centroid of that cluster. Thus, each data point Xn is distributed according to , for some = 1,.., K. In the sequel we denote by Cn, the cluster label of the random variable , for n = 1,…, N. To each data point Xn, corresponds a latent variable Zn, that is a 1-of-K binary vector with πk being the probability Znk = 1, for some k = 1,.., K. Therefore, π =  (π1,…, πK), called the vector of mixing coefficients, is a probability vector and N= (N1,…, YK) = Z1+Z2+ … + ZN is a random vector with K non-negative coordinates that sum up to N. In fact, Y is multinomially distributed with parameters N and π. Observe that for any n = 1,…, N, the probability that Zn = zn is given by the following equation:

Putting = (Z, π, μ,Λ), with Z = (Z1,…, ZN), μ = (μ1, μ2,…, μK) and Λ = (Λ 1, Λ 2,…, Λ K), the joint distribution of X and can be written as follows:

The second equality of equation (7) uses that and . We assume further that conditioning on , the components of X are independent. Similarly, given and , the components of Z and are respectively independent. Furthermore, the components of are also independent. In addition to the above prescription, we use the plate notation (directed graph) [10, 24] to derive our priors and put the problem in a Bayesian setting. Thus, using the conjugate priors of , and the above-given result in

Therefore,where

From the joint distribution in (7), we identify the posterior and variational (‘nice’) distributions as (Z, ) and i.e., the and respectively, providing the ingredients for the computation of . Accordingly, the variational distribution (VD) is factorized based on the MF approximation method to obtain . Meanwhile, from the MF approximation, it can be shown that the best distribution for maximizing the ELBO is , satisfying . We consequently model the joint distribution in (7) according to the aforementioned best variational distribution. Initial calculations involve the determination of followed by . In other words,

Pushing the variables not dependent on z (i.e. ) into the constant, we obtain the following equation:

Substituting (9) and (10) into the expression for , produceswhere

Exponentiating and normalizing it to let sum to 1 over all the values of k produceswhere

The best , therefore, is a product of categorical distributions for each latent variable having for as parameters.

On the other hand, the best variational distribution can be divided into two components and . It follows from the product rule, the deductions leading to equations (15), (9) and (10) that satisfies

Taking exponentials of both sides of the above expression and taking care of the normalizing term result inwhere

Upon some computations, the variational distribution for the joint distribution takes the formwhere and are respectively the Gaussian and Wishart densities (see equations (12) and (13)) with parameters and . These parameters are given as follows:

To evaluate , the quantities in are expressed as follows:where is the log derivative of the multinomial gamma function.

After the substitutions, becomes,where . There is a circular dependency between these variational parameters requiring n iterative updates that ensure the algorithm converges to an approximate posterior.

Using equation (7), the ELBO for a VGM model is obtained as follows:

Applying the product rule, we obtain the following equation:and substituting the following expressions,andwhere is the entropy of the Wishart distribution. then becomes the objective function to maximize and is given by the following equation:where

In this paper, we implement the maximization of equation (32) through the iterative updates of the GMM parameters mentioned earlier.

3.3. Variational Mixture of Gaussians (VMGs) Routing-Based Capsule Network

Motivated by [10, 11], and [4] based on the discussions in Sections 3.1 and 3.2, we let and , respectively, represent capsules at the lower and higher-level layers. Let matrix represent the show of similarity between the features of a lower-level capsule to a higher-level capsule , with as its vectorized version (i.e. is a flattened vector of the matrix with ). A higher-level capsule’s pose matrix is flattened to obtain capsule pose vector . For ease of computations, we use the precision matrix instead of the covariance matrix , and use to represents the diagonal entries of . As mentioned earlier, represents the vector form of the routing responsibilities while is the mixing coefficient used for a single one-hot-vector representation necessary for indicating the choice of a cluster(capsule). On a larger scale, is a latent variable that serves as a collection of one-hot-vectors with similar features signifying the preference of each lower-level capsule feature to a corresponding higher-level capsule Gaussian cluster of features. Finally, we compute the activation probability to represent the likelihood that cluster is activated by computing the ELBO (equation (32)) and paying a fixed cost of as indicated in [4]. Based on the above-given discussions, we derive Algorithm 1 as the routing procedure between capsules.

(1)  function VMG ROUTING
(2)   Initialize weights
(3)   Initialize priors
(4)   for i iterations do
(5)     
(6)     UPDATE BEST
(7)     UPDATE BEST
(8)   
(9)   return
(1)  function UPDATE BEST
(2)   
(3)   
(4)   
(5)   
(6)   
(7)   
(8)   
(9)   
(10)   
(1)  function UPDATE BEST
(2)   
(3)   
(4)   
(5)   
(6)   
3.4. Uncertainty Estimation

Aleatoric and epistemic uncertainties are common with neural network models. Randomness is a property that characterizes aleatoric uncertainty [21]. For this type of uncertainty, there is sufficient variability in the outcome of events as a result of a random phenomenon. Epistemic uncertainty, on the other hand, expresses the uncertainty resulting from the designer’s lack of knowledge of the best design choices leading to the development of the best model. Both uncertainties together form the total uncertainty of the model. Several other methods exist for finding the total uncertainty of a model, but there is no consensus on which method is the best [25].

In this work, we experimentally determine the aleatoric and epistemic uncertainties of our model on some of the datasets. Since a deterministic model has no epistemic uncertainty [25], we determine its aleatoric uncertainty on the in and out-of-distribution data. For our Bayesian model, we determine both uncertainties.

4. Experiments

The experiments in this work were carried out using PyTorch 1.7 GPU version on a 64 bit NVIDIA GeForce GTX 1060 Windows machine. Each model was trained for 100 epochs using a learning rate of 0.001, 3 routing iterations, and patience of 10,000. During training, the best model is saved to be used for inference. The code used in our implementation is a modification of the code in [11], which can be found in [26].

4.1. Loss Function

We adopted the spread loss in [4] as well as the negative likelihood loss as used in [11].

4.2. Model Architecture

Our model begins with a 2 × 2-filter convolutional layer to perform convolutions on a 32 × 32 × 1 input image with a stride of 2. This layer precedes three capsule layers and the ensuing VMG routing layers before the final class capsule layer which produces one capsule for each capsule class. Each capsule layer converts its respective filters into a 4 × 4 capsule pose matrix and activation. The final layer broadcasts its weight matrices to produce a capsule per class for each category in the dataset. Taking the filter and the capsule types produced by each capsule layer into consideration, the network for the model can be represented as [, , , ]. The complete architecture is shown in Figure 1.

4.3. Datasets and Data Preprocessing

Three popular computer vision datasets and one health-related dataset were adopted to experimentally evaluate the methods proposed in this paper. MNIST [27] is a handwritten dataset consisting of 70,000 28 × 28 grayscale images commonly partitioned into 60,000 training and 10,000 test sets. Comparatively, this dataset is less complex but effective and very popular for testing the performance of computer vision algorithms. Fashion-MNIST [28] is another dataset obtained from 70,000 greyscale fashion products. The original partition into training and test sets is similar to MNIST. This dataset is relatively complex to MNIST. The third and most complex dataset among the three is CIFAR-10 [29]. This dataset is very challenging to most computer vision algorithms due to the presence of background as well as background objects. Each of the aforementioned datasets is made up of ten classes and was partitioned into 55000 training, 5000 validation, and 10,000 test sets.

The fourth dataset is a COVID-19 Radiography dataset [3032] collected from four countries by a team of doctors. It consists of three classes of infected chest X-ray images and one class of healthy X-rays. This dataset is highly imbalanced and for purposes of this work, was partitioned into 16,952 training, 2,000 validation, and 4,227 test images. Even though the performance of some machine vision algorithms largely depends on extensive preprocessing to obtain high informative image data, we did not employ any of these preprocessing algorithms irrespective of the fact that digital images contain Gaussian noise introduced by the limitations of the acquisition sensor/camera during image capturing. Fortunately, there are techniques to reduce its effect [33]. However, we evaluated the model on the raw images, enabling us to understand the actual extent to which the model can recognize real-life digital images (such as the COVID-19 images) without human interference.

4.4. Experimental Results

The results presented in this section are from the implementation of our model (Variational Mixture of Gaussians Routing model-VMG-Routing), the baseline Multilane LBP-Gabor Capsule (ML) network [32], and the VB-Routing [11] {64, 8, 16, 16, #c} architecture; where #c is the number of output classes. However, our GPU device could not run the higher architectures of the other VB-Routing models, consequently, for those models, we reported the results from the work in [11].

4.4.1. Model Learning and Convergence

The training and validation curves in Figure 2 show the proposed model’s ability to learn and converge faster. For less complex images such as MNIST and Fashion-MNIST, the model converges as early as epoch 30. For relatively complex and imbalanced images such as CIFAR-10 and COVID-19 Radiography, the model attains an accuracy approximately equal to the final accuracy at epoch 90. Our VMG-Routing learns faster compared to the models in [11] which only show stability beginning from epoch 150. Fast learning and convergence are desirable attributes for image recognition systems applied in critical areas such as self-driving cars where every passing minute counts and is valuable.

Table 1 reports a comparison of the error rates of the VMG-Routing capsule and the other capsule network (CapsNet) models. Even with the moderate (shallow) size of the VMG-Routing model, it performs comparatively well with the deep and multilane models. The difference in accuracy on CIFAR-10 between the proposed VMG-Routing CapsNet and the largest model is only 1.07% with our model having an added advantage of being less computationally complex.

4.4.2. Model Complexity

The VMG-Routing CapNet produced fewer parameters compared to its counterparts in the literature as can be seen in Table 2. This makes the VMG-Routing model less computationally complex and increases its potential for implementation on embedded and mobile devices that naturally have limited memory. In addition, model complexity poses a threat of overfitting [34] that ultimately leads to poor performance.

4.4.3. Inference

To test the models’ generalizability on unseen images, we used the trained (saved) models to perform inference, respectively, on 10,000 and 4,227 sample images from MNIST, CIFAR-10, Fashion-MNIST, and the COVID-19 Radiography datasets. A comparison of the test accuracies is reported in Table 3. The average time for each model to perform inference on the sample images is also reported in Table 3. It can be observed that the VMG-Routing model produced results that compare favorably well with the results of other state-of-the-art models.

We further performed inference on individual in-distribution images for both models to determine the level of confidence/certainty each model places on its prediction probabilities. Figure 3 shows that the deterministic model is overconfident in its predictions (column 3) while the VMG-Routing CapsNet exercises some caution in the confidence it imposes on its predictions (column 2).

4.4.4. Model Uncertainty

Daily scenarios involve decision-making influenced by the level of uncertainties/certainties prevailing at the time. Depending on the field under consideration, uncertainty estimation can be a critical part of the decision-making process. For instance, the reliability and efficacy of a deep learning model for medical applications such as Artificial Intelligence (AI) assisted surgery depends on the uncertainty with which it identifies the medical condition correctly. Bayesian methods have advantages over other neural networks as they provide the avenue to effectively model uncertainty [12]. The inability of machine learning applications to provide reliable uncertainty estimates is a potential limiting factor in their acceptability and widespread adoption for critical tasks.

To demonstrate the reliability of the uncertainty estimates of our VMG-Routing model, we present a comparison of experimental results from the prediction of both in-distribution (Figure 4) and out-of-distribution (Figure 5) images for the VMG-Routing model and the baseline deterministic ML-LBP capsule model.

We use to express the aleatoric uncertainty shown by the distribution across the classes for the deterministic model. This uncertainty assumes a value of zero if a class gets a probability of one and all other classes obtain a probability of zero. Since deterministic CapsNets have fixed weights, they cannot express epistemic uncertainties [25] and will produce the same output when inference is carried on the same input image times. The output of the SoftMax layer (see Figure 3) sums up to one and measures the certainty () of the model in its predictions. We obtain the aleatoric uncertainty of the deterministic CapsNet from the same quantity by computing the negative log likelihood (NLL) or the entropy of the predictions.where and is the number of classes in the dataset under consideration.

On the other hand, our VMG-Routing CapsNet replaces the fixed weights with Gaussian distributions giving it the ability to express both epistemic and aleatoric uncertainties in its predictions. The aleatoric uncertainty is expressed in the distributions similar to the deterministic CapsNets, except that it is based on average prediction probabilities. Meanwhile, the epistemic uncertainty is measured in the spread of the inference probabilities and is zero for a zero spread. For this scenario, different multinomial conditional probability distribution conditioned on the weight distribution are obtained out of N predictions on the same input image. The mean probability is computed for each class and the maximum mean conditional probability is chosen as the predicted class of the input image.and

The averaging in the measure in equation (35) ensures that the epistemic uncertainty in the model is captured. Subsequently, is possible to compute. In addition, the uncertainty based on the entropy and total variance obtained from the averaging naturally follows from the following expressions:

Figure 5 shows the uncertainty of both models on the respective out-of-distribution images. The spread of the prediction probabilities of a given class expresses the epistemic uncertainty while the distribution across the different classes epitomizes the aleatoric uncertainty of the models [25].

Even though both models produce wrong predictions for the out-of-distribution images, the VMG-Routing CapsNet produces predictive probabilities (Figure 5, column 2) that significantly vary in the distribution and spread of the predictive runs. The VMG-Routing CapsNet, therefore, can express both uncertainties. On the contrary, the deterministic model cannot express epistemic uncertainty since performing predictive runs on the same input image produces the same probabilities (Figure 5, column 3). The ability of a model to express its uncertainty is a desirable property since it can be shown that models that produce higher uncertainties are likely to produce accurate predictions [25]. Finally, the shape of the VMG-Routing CapsNet’s predictive probability distribution has some semblance to that of the Gaussian distribution which may be attributed to the model being driven by a variational mixture of Gaussians.

4.4.5. Model’s Ability to Extract Relevant Features

To enable us to understand and tune the VMG-Routing model for further performance improvement, we investigated the ability of the layers in the model to extract the relevant features. Through experimentation via this approach, redundant layers were eliminated, resulting in a reduction in the model size/complexity, convergence time, and excessive oscillations during training. More specifically, we visualized the output (feature maps) of the layers by feeding an input image into the trained (best saved) model. The feature maps for the various layers are shown in Figure 6. It can be observed that the layers of the model can extract the most relevant features from the input images.

4.4.6. Threats to Validity

Deep Learning (DL) is capable of learning and modeling real-life scenarios when extreme care is taken, during the design and development stages, to consider all the factors that have the potential to prevent the model from achieving optimal performance. For instance, the choice of hyperparameters and their values is an important exercise that has a direct impact on the validity of the model outputs. For stochastic gradient descent (SGD)-based methods and their variants, a fraction of the dataset used for training are organized into batches whose size is relevant to the computation of the gradient. Practically, larger batch sizes reduce the quality of the model during generalization [35]. This work, therefore, sampled from 16–32 data points for the experiments as batch sizes. We also avoided the sorting of the dataset and introduced randomization of batches in a bid to prevent the possibility that a given batch will have the same labels. In addition, the learning rate controls the rate at which the model should be modified in response to the error anytime there is an update in the model weights. We chose a smaller learning rate to allow the model to learn the optimal set of weights even though this has the potential to increase training time and the risk of overfitting. Other methods for solving this include implementing a learning rate decay function which returns an updated learning rate value that drops by half every n number of epochs. Furthermore, nonlinear activation functions are useful for DL to effectively model real-life scenarios which are nonlinear. The choice of the appropriate activation function determines the speed of computations necessary to speed up the training process as well as the ability to reduce the likelihood of generating vanishing gradients and improve performance [36]. To introduce nonlinearity and activate the capsule, we adopted the Sigmoid activation function since it encourages unambiguous predictions with 1 or 0, plus the fact that it can return a value between 0 and 1 when used with (, ).

Another scenario that poses a threat to the validity of the Bayesian model outputs is the covariate shift, where the distributions of training and target data are different [37]. Covariate shift may also occur due to pixelate-corrupted test data, spurious correlations, and domain shift. This problem is well pronounced with Bayesian models that make use of unconstrained (covariance matrix) and is worsened when there exists linear independence in the features. In this work, we employed mean-field variational inference (MFVI) which constraints the to be a diagonal matrix, limiting the effect of linear dependence in the features [38] and hence the impact of covariate shift.

5. Conclusion and Future Work

In this work, we proposed a capsule network based on a variational mixture of Gaussian routing to express the uncertainties associated with performing predictions on out-of-distribution data. The results show that a Bayesian capsule can be less computationally complex, converge faster, and outperform both the state-of-the-art deterministic and probabilistic models during inference. Furthermore, our work demonstrates that Bayesian capsules may have advantages over their deterministic counterparts since they have a bigger potential to exhibit transparency, credibility, reliability, and interpretability required to gain the confidence of industry players.

In the future, we intend to carry out a full investigation into Bayesian capsule interpretability in a quest to unravel the “black box” concept.

Data Availability

The data used to support the findings of this study can be accessed in the following repositories: 1. http://yann.lecun.com/exdb/mnist/ 2. https://www.cs.toronto.edu/∼kriz/cifar.html 3. https://www.kaggle.com/datasets/zalando-research/fashionmnist 4. https://www.kaggle.com/datasets/preetviradiya/covid19-radiography-dataset.

Conflicts of Interest

The authors declare that they have no conflicts of interest.