Abstract
There is growing interest in developing linear/nonlinear feature fusion methods that fuse the elicited features from two different sources of information for achieving a higher recognition rate. In this regard, canonical correlation analysis (CCA), cross-modal factor analysis, and probabilistic CCA (PCCA) have been introduced to better deal with data variability and uncertainty. In our previous research, we formerly developed the kernel version of PCCA (KPCCA) to capture both nonlinear and probabilistic relation between the features of two different source signals. However, KPCCA is only able to estimate latent variables, which are statistically correlated between the features of two independent modalities. To overcome this drawback, we propose a kernel version of the probabilistic dependent-independent CCA (PDICCA) method to capture the nonlinear relation between both dependent and independent latent variables. We have compared the proposed method to PDICCA, CCA, KCCA, cross-modal factor analysis (CFA), and kernel CFA methods over the eNTERFACE and RML datasets for audio-visual emotion recognition and the M2VTS dataset for audio-visual speech recognition. Empirical results on the three datasets indicate the superiority of both the PDICCA and Kernel PDICCA methods to their counterparts.
1. Introduction
It is evident that data collection from a single sensor (modality) does not capture all discriminative information from an observation. For instance, when we take a film of a person during his/her speech, voice signals or video frames solely cannot capture all the information in the input states. Therefore multisource data collection has attracted the researchers’ attention because various data from an observation can cover the uncertainty, variability, and partial observation of each other. For instance, when we listen to somebody and watch him/her simultaneously, even if we cannot hear a word well, we can guess the corresponding word via observing his/her lip motions. Thus, in this research, we want to fuse audio and visual information to achieve a higher speech recognition rate. Although a few fusion techniques have been developed to fuse the elicited correlated features of different modalities, this research proposes a novel approach for fusing both dependent and independent probabilistic information of feature vectors extracted from two different modalities.
1.1. Background
It is empirically shown that the raw audio and video data passes through several neural processing stages [1], which are nonlinear and eventually integrated into each other via a complex neural process. Bimodal data collection is repeatedly used in several applications such as medical image fusion [2], multimodal interaction [3], multimodal emotion detection [4], and audio-visual speech recognition [5, 6].
Information fusion can be carried out at different levels of data processing including raw data fusion, feature fusion, model fusion, and decision fusion. Among these approaches, feature fusion [7, 8] is of concern in this study because this approach can consider the linear/nonlinear relation among elicited features. It is noteworthy to say that feature fusion is different from feature concatenation [9]. In other words, extracting feature vectors from different sources of data and arranging them into a long feature vector is not feature fusion. Feature concatenation methods highly affect the number of training parameters in several classifiers such as Bayes classifier [10] or deep learning schemes [11, 12]. In the case of a small sample size problem, the covariance of such a small dataset is underestimated. From another perspective, conventional classifiers are unable to either capture the interaction of all features or tolerate the uncertainty and variability of features [6, 13–15].
Therefore, an efficient feature fusion should not produce very high-dimensional feature vectors. To mimic the human’s fusion system, representative features from each modality are elicited and then projected into a new space (processing spikes in a higher level) in a way that the projected features of these modalities have a maximum correlation or have a minimum distance, while having a proper size. Therefore, by fusing the projected features (e.g., audio and video features) in the correlation space, better recognition performance can be obtained [4].
Canonical correlation analysis (CCA) [16] is a known technique for future fusion by identifying the shared (dependent) information between two different sources of data. CCA is determined by optimizing two different linear projections of features belonging to two different modalities. These features are projected into a new space (called correlation space), in which the cross-correlation of the two projected features is maximized. These projected features are called latent variables/features. Nonetheless, CCA has some drawbacks such as a lack of understanding of the stochastic nature of the features. Moreover, CCA is unable to capture nonlinear relations between features. Cross-modal factor analysis (CFA) [17] is another feature fusion method that is similar to CCA, projects the input features of two different modalities into a new space in such a way that the distance (Frobenius norm) between the projected features is minimized. CCA and CFA have been adopted in practical multimodal recognition systems such as face recognition [18], signal processing [19], monitoring and fault detection [20], audio-visual speaker detection [21], fusion of multimodal medical imaging [22, 23], and audio-visual synchronization [24]. To enable both CCA and CFA to capture the input nonlinearities, their kernel versions, called KCCA [25] and KCFA [26] were developed. They have been applied to various data fusion applications like specific radar emitter identification [27],audio-visual emotion recognition [26, 28], and feature selection [20, 29]. Nevertheless, during the recording of audio-video signals, a few undesired disturbing factors occur such as the slight movement of the recording camera or getting close and far from the recording microphone. For instance, in a bimodal recognition system [26], the KCFA scheme is deployed to elicit the latent variables of audio and video data but the achieved results are not convincing. This is because ignoring the variability factors during the recording affects the quality of the recorded data and declines the performance of the recognizer.
In practice, each set of recorded data has a degree of randomness due to several reasons, such as the movement of sources during data acquisition, power line noise, and the induction noise of other equipment. To capture the stochastic nature as well as the variability of recorded data from two modalities, Bach and Jorden [30] have proposed the probabilistic CCA (PCCA) model. Although PCCA is linear, we propose its kernel version in our previous study to capture the nonlinearity of elicited features from two modalities [6].
Although correlated features can cover the lack of each other, independent features do not suffer from redundancy and reveal a new perspective from an input observation. It is interesting that some fusion methods just estimate the dependent (shared) features between two modalities of data while a few of them use both dependent and independent features. Thus, employing both dependent and independent features in the PCCA framework, which is termed as PDICCA in the literature [31], allows us to move from partial description toward a wider observability of inputs.
Since PDICCA is a linear method and cannot capture nonlinear relations between the elicited features, the main contribution of this study is devoted to kernelizing the PDICCA method, which we call KPDICCA hereafter. The proposed method is able to capture different aspects of inputs such as dependencies, independencies, uncertainty, variability, and nonlinearities. As we see, in the case of encountering a limited number of samples, kernel methods are able to provide convincing results because the size of the kernel depends on the number of feature vectors. In contrast, to get a convincing result from a classifier, the number of training samples should be high. Hence, there is a trade-off between applying a kernel to input features and well training a classifier.
The rest of this paper is structured as follows: In Section 2, the PDICCA method and the proposed method are introduced. Section 3 introduces the deployed datasets and their feature extraction techniques. Section 4 presents the experimental results obtained from the proposed method along with state-of-the-art methods, and their achievements are compared and discussed. Finally, the paper is concluded in Section 5.
2. Methodology
In this section, first PDICCA is briefly explained, and then the proposed method (KPDICCA) is introduced.
2.1. PDICCA
To overcome the lack of capturing uncertainty in both CCA and KCCA models optimized by the maximum likelihood (ML) method [30], one solution is to consider a linear projection between observations of sources and dependencies among latent variables. In addition to the Gaussian distribution assumption, the probabilistic CCA (PCCA) method is able to model the variability of data and outlines a solution for the CCA problem. PCCA has been extended [31] by incorporating a dependent variable Z similar to CCA and two other independent latent variables and , which are not dependent and exclusively belonged to the two modalities of x and y, as described in Figure 1. This method is termed probabilistic dependent-independent CCA (PDICCA) categorizing as a generative data model. PDICCA captures both dependence and independence of latent variables as follows:where and are two deterministic functions that transform dependent latent variables () and independent latent variables () to the observation space. However, and denote the additive noise on and observations, respectively.

For the simplification, they use two linear functions, and , and they consider an independent Gaussian distribution with equal variance for noise parameters . Furthermore, they assume a Gaussian distribution with zero mean and unit covariance for latent variables ().
Therefore, we can write
To solve the above equation, at first, the parameters (shared latent variable), , and (projection matrices), must be estimated while the set-specific parameters and should be marginally out. Afterward, the probabilistic model is marginalized over the shared latent variable and then and (i = x or y) can be optimized, accordingly. To summarize this learning scheme, its pseudo code is illustrated as follows (see Algorithm 1).
|
CCA, CFA, and PDICCA methods are all linear approaches, which are capable of finding linear relationship between two synchronous recording modalities. It should be noted that these models cannot digest the nonlinear correlations between two sets of features elicited from two different modalities. Herein, a nonlinear kernel is inserted into PDICCA to overcome this drawback.
2.2. The Proposed Method
Our approach is similar to KCCA [25] and KCFA [26]. To the best of the authors’ knowledge, deriving the kernel version of PDICCA has not been proposed yet. This paper aims to propose the kernel version of PDICCA (KPDICCA) for considering the nonlinear relations among the observations (from audio and visual modalities). To equip the PDICCA method for capturing nonlinear relations between the observations and their elicited latent variables, here the kernel derivation of PDICCA is developed by implicitly mapping data from the original space to a higher dimension and then apply the Klami and Kaski method [31] to find dependent and independent latent variables, as shown in Figure 2. To derive the formula, first, we consider that all latent variables have normal distribution. Similar to the derivation of KCCA, we can write

To simplify the above relations, similar to KCCA, we assume that and (i = x or y), which are the transformation matrices that project and into and subspaces, can be written as follows:
Considering and (i = x or y) parameters, we can derive a learning method using the expectation maximization (EM) algorithm [31] to obtain parameters where and is the noise variance of the primary domain. Assuming and are fixed and marginalizing over independent factors of and , constructing a probabilistic model that is only dependent to and with the following covariances:
If we consider and covariance matrix aswhere , we can obtain updating formula for the parameter of such as
If and are invertible, we will have
Therefore, we can obtain the update formula for as follows:
In the second step, we marginalize over parameter and use similar method to provide an updating function for as follows:where
Actually, the variance is recovered by the following equation:
The posterior expectation of shared and set-specific latent variables given observation x, latent variables can be obtained by ML estimation. Applying ML to the probabilistic model, we achieve
Similarly, for the observation y, at the above equation, replacing all subscript x by y and by . If and are not invertible, the above equations will not provide a solution for KPCCA. This problem can be solved using a similar regularization approach that has been presented for KCCA. In this study, we use Golub et al. [32] and Koskinen et al. [33] methods that both consider a priori knowledge on and , where r is a regularization parameter and apply the EM algorithm to achieve the following relations for and :where and and and . To obtain parameters in the KPDICCA method, we used expectation maximization (EM) algorithm presented in [34]. In order to infer z, we need to marginalize out the latent independent factors of and . Therefore, we assume that is fixed (then is fixed) and marginalizing over an independent factor. Similarly, the study [34] involves an integral over where .
As the prior is Gaussian and it is multiplied with a linear term, we can integrate out analytically, obtaining where .
Doing the same marginalization for leads to the generative model.
If we consider and , we can write above model aswhere
This is exactly the model proposed in [30] for interpreting CCA probabilistically. For obtaining the above solution, we implicitly assume that the dimensionalities of the and are sufficiently high to produce a nonconstrained covariance matrix ). However, Klami and Kaski [31] propose a new algorithm that does not require this assumption and propose a more general EM algorithm for linear projections. The algorithm includes an additional step that marginalizes the z out to enable estimation of the dependent matrixes ( and ).
The EM algorithm for optimizing the extended probabilistic CCA is described in Figure 2 and repeats the two steps until convergence. This method is a linear approach and is capable of finding linear relationship between two modalities. For extending this approach to nonlinear relation, similar to the KCCA method, by considering and as (10) and (11) and substituting into the Klami’s method, we can obtain the EM algorithm for updating the parameters .
By substituting equation (10) into EM algorithm (Figure 2) part (1), we havewhere and considering and are invertible, we will have . Now by the use of EM for updating the formula for , we can obtain the following equation for updating the parameter:
In the second step, we marginalize over the parameter and use similar a method to provide an updating function for as follows:where and
Actually, the variance is recovered using the model described in equation (9) by
3. Application
In this paper, we assess the proposed method and its competitors on the speech recognition and emotion recognition datasets. Here, M2VTS [35] as an audio-visual database is used for speech recognition. The M2VTS dataset includes 185 recordings from 37 subjects (12 females and 25 males). Each speaker utters five shots. The subjects utter the digits from “0” to “9” within each shot, and their audio and video signals are recorded. The sampling rate of audio signals is 48 KHz, and the frame rate of video is 25 Hz. Several features for characterizing speech signals have been proposed such as cepstral coefficients [13], discrete cosine transform (DCT), mel-frequency cepstral coefficients (MFCCs) [9], and the perceptual linear predictor (PLP) [36]. First, the background speech signal is removed, and then the cleaned speech signal is segmented into successive hamming windows with 50% overlap, where the length of each window is 512 samples. From each windowed signal, 12 MFCC coefficients are extracted.
Since our processing is simultaneous, we have to characterize the lip motion in parallel. In this regard, the lip contour should be first elicited to trace its key points in successive frames. We here use the Rohani et al. [13] method, in which we first divide a colored face image into lip and nonlip clusters. This segmentation is done by simulating a simple geometric lip model and applying spatial fuzzy C-mean clustering in order to extract the lip contour. The geometric lip model described by equation (25) is presented in Figure 3.where at (0, 0).

After matching the lip model to each image, a lip contour is extracted. The lip model contains six features (two key points in the upper lip and four points in the lower lips) that need to be traced in successive frames.
The employed emotion recognition datasets are eNTERFACE and RML, both of which include six states of emotions such as anger, disgust, fear, happy, sad, and surprise [37, 38]. In eNTERFACE, 44 subjects participated whose video is recoded at 25 frame per second, and their acoustic signals are recorded at a sampling rate of 48 KHz. On the other hand, in the Ryerson database, eight subjects speak six different languages, generating three believable reactions to all the situations. Their acoustic sampling rate is 22050 Hz, with the video frame rate of 30.
For the emotion recognition dataset, similar processing stages are applied. In order to remove the speech noise, we take the wavelet transform from this signal and by applying thresholding to the energy of wavelet coefficients on different scales, we remove those scales whose energy values are less than an empirical threshold [39]. After reconstructing the signals, the first energy of the signal in the time domain within each window is determined [40] and then the first 12 MFCC features [26, 34] are added to the feature vector. In the emotion recognition system, facial expression features play a very important role. The challenging issue in the video processing is to precisely extracting the face margin. In this research, the Haar cascade technique [41] is employed to detect the face part. The image in each frame is resized to pixels. Afterward, a Gabor wavelet filter [42] in five scales and eight orientations is applied to each image to elicit key facial features [6, 26]. Nonetheless, Gabor feature vectors are high-dimensional. To reduce the feature size, principle component analysis (PCA) is deployed.
4. Experimental Results
In this section, the results of applying the proposed method along with PDICCA, KPCCA, KCCA, CFA, and KCFA to the described speech processing and emotional recognition datasets are presented. As described before, the M2VTS database (https://www.ee.surrey.ac.uk/CVSSP/xm2vtsdb/) [35] for audio-visual speech recognition, eNTERFACE (https://www.enterface.net/enterface05/docs/results/databases/project2_database.zip) [43], and Ryerson (RML) databases (https://www.kaggle.com/datasets/ryersonmultimedialab/ryerson-emotion-database) [44] for audio-visual emotion recognition are employed in this research. The model parameters are estimated during the cross-validation phase. The number of hidden states and the number of Gaussian per state in the hidden Markov model (HMM), over the speech dataset, are selected at three and one, respectively. The number of HMM hidden states and the number of Gaussian per state in the emotion recognition dataset are six and three, respectively. The variance parameter of the Gaussian kernel is set to 14 for both applications.
The computational complexity of the kernel-based methods depends on the number of samples. Here, the audio-visual features’ dimension is 200. In addition, among the subjects, 10 persons are selected randomly to overcome the memory shortage in each experiment. 70% of subjects are selected for the training phase, and the rest are chosen as the test set. We repeat this dividing 10 times, and the final results are determined by taking an average over the experiments. The final results are demonstrated in Figures 4–15.

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)
Figures 4 and 5 report the audio-visual emotion recognition accuracy and F1-score of the eNTERFACE and Ryerson (RML) datasets for the CCA, CFA, and PDICCA methods, and the results are calculated for different dimension sizes.
Figure 6 reports audio-visual emotion recognition ROC curves for the best accuracy result of the eNTERFACE and Ryerson (RML) datasets for the CCA, CFA, and PDICCA methods and the ROC curves are shown for each emotion classes.
The experimental results for audio-visual speech recognition accuracy and F1-score using MFCC and PLP acoustic features for feature fusion and decision fusion are shown in Figures 7 and 8, respectively. In this proposed algorithm, the dimension sizes between one and five are of concern for independent space parts. The results obtained from the tests indicate that at the feature level and decision level for three and five dimensions, the best results are achieved. The ROC curves in each class for best accuracy results are reported in Figure 9. These results exhibit that the fusion of dependent and independent parts of some bimodal data could improve the recognition accuracy. Nevertheless, the accuracy of the linear methods is still not acceptable for a real application; therefore, we applied the kernel version of these methods to the same features in order to increase the accuracy.
Figures 10 and 11 depict the experimental results for the proposed KPDICCA and state-of-the-art KCCA and KCFA on the eNTERFACE and Ryerson (RML) datasets based on accuracy and F1-score metrics. In the proposed KPCDICA method, we set different dimension sizes for the independence latent variable dimension and the validation results. These results are reported for different regularization parameter values, and it can be demonstrated that this parameter affects the recognition performance; however, it is difficult to identify an optimum interval for this parameter. For instance, in the decision fusion case, for the eNTERFACE and RML datasets, the best results are obtained at r = 1.0 and r = 0.4, respectively.
Figure 12 reports audio-visual emotion recognition ROC curves for the best accuracy results of the eNTERFACE and Ryerson (RML) datasets for the KPDICCA, KCCA, and KCFA methods, and the ROC curves are show for each emotion class.
The audio-visual speech recognition accuracy and F1-score of the conventional methods, together with the proposed method for the M2VTS database using MFCC and PLP acoustic features are presented in Figures 13 and 14. In KPDICCA, the different dimension sizes are considered between one and six for the independence space, and the results indicate that for independence size four, the best result is achieved.
Figure 15 depicts audio-visual speech recognition ROC curves for the best accuracy result of the M2VTS database using MFCC and PLP for the KPDICCA, KCCA, and KCFA methods. The ROC curves are show for each emotion classes.
By comparing the recognition accuracy and F1-score in Figures 4 and 5, 7-8, 10-11, and 13-14 on real datasets, we can find that the relation between audio and video data are nonlinear and using kernel can handle this problem and find a suitable accuracy for the emotion and speech recognition system. This supremacy is emerged from incorporating dependent and independent variables containing nonlinear information. It can be also interpreted that the extra information that implies the superiority of KPDICCA to KCCA is the incorporation of independent latent variable information as an independent feature for each modality in the HMMs. In other words, when we consider just the common information between two modalities, some discriminative features belonging to each modality carrying unique and independent information are removed. This elimination causes a decrease in the performance of the recognition system. As we can see from the results, the performance of KPDICCA declines when the dimension of the elicited variables is increased. On the other hand, it should be pointed out that in the RML dataset, due to the low number of samples, by increasing the number of elicited features, the performance of both the KCCA and KPDICCA methods decreases, which is caused by the curse of dimensionality.
5. Conclusion
In this paper, a novel approach for audio-visual information fusion based on probabilistic dependent and independent canonical correlation analysis (PDICCA) is proposed. Empirical results reveal that the fusing dependent and independent latent variables of bimodal inputs can increase recognition accuracy. Although a combination of nonlinear dependent latent and set-specific (independent) features provides more discriminative information than just using the dependent latent features, these dependent latent variables have high share in the final results, and nonlinear independent features can be considered auxiliary features that can slightly improve the performance of a recognition system. However, this superiority rises from the fact that KPDICCA captures the data variation in its covariance metrics while KCCA and KCFA do not consider any input tolerance in their formulas. Our experimental results confirmed the feasibility and efficiency of KPDICCA for the multimodal data fusion application. This method provides good results on low-dimensional inputs but for high-dimensional, selecting a suitable regularization factor is capable of better handling high-dimensional inputs when the covariance matrix is sparse and its results on the emotion datasets confirm this claim.
In future work, temporal information can be added to the proposed model to increase its performance. To extend this study, other types of kernels can be assessed for different applications, and also other types of regulation models can be employed. On the other hand, to achieve the best kernel map and dependent and independent latent features, the deep learning approach such as deep CCA (DCCA) and deep canonically correlated autoencoders (DCCAEs) can be used.
Data Availability
The data used to support the findings of the study are available at the following links: https://www.ee.surrey.ac.uk/CVSSP/xm2vtsdb, https://www.enterface.net/enterface05/docs/results/databases/project2_database.zip, and https://www.kaggle.com/datasets/ryersonmultimedialab/ryerson-emotion-database.
Conflicts of Interest
The authors declare that they have no conflicts of interest.