Abstract

In the presented research paper, an average framing linear prediction coding (AFLPC) method for a text-independent speaker identification system is studied. AFLPC was proposed in our previous work. Generally, linear prediction coding (LPC) has been used in numerous speech recognition tasks. Here, an investigative procedure was based on studying the AFLPC speaker recognition system in a noisy environment. In the stage of feature extraction, the speaker-specific resonances of the vocal tract were extracted using the AFLPC technique. In the phase of classification, a probabilistic neural network (PNN) and Bayesian classifier (BC) were applied for comparison. In the performed investigation, the quality of different wavelet transforms with AFLPC techniques was compared with each other. In addition, the capability analysis of the proposed system was examined for comparison with other systems suggested in the literature. In response to an achieved experimental result in a noisy environment, the PNN classifier could have a better performance with the fusion of wavelets and AFLPC as a feature extraction technique termed WFALPCF.

1. Introduction

Automatic speech recognition (ASR) is not a new problem to be studied. This subject has been tackled by many researchers, described in speech literature, and applied in several applications [1]. Now ASR is considered an essential tool for many applications, such as those used by people with hearing problems, and for voice-controlled services [2]. The heart of ASR is the speech feature extraction algorithm, and one of the most common algorithms used for feature extraction is based on the Karhunen-Loève transform (KLT). KLT has been successfully applied in speaker identification [3]. KLT is considered the optimal feature extraction technique for speech recognition applications in terms of the mean square error (MSE) and energy packing. Most speaker identification systems consider such features as the mel-frequency cepstral coefficient (MFCC) [4] and the linear predictive cepstral coefficient (LPCC) [5, 6]. One of the weaknesses of MFCC is the use of short-time Fourier transform (SFT), which assumes that the signal is stationary and has weak time-frequency resolution. On the other hand, wavelet transform, which has good time-frequency resolution, is being widely used as a feature extraction method [79]. The wavelet transform is a decomposition method that helps in approximating the signal in hand using a set of wavelet coefficients. Such decomposition can be made at a different level of abstractions (scales). The way the signal is abstracted depends on the approximation function (the mother wavelet), which gives a wide range of possibilities for different applications. To have a coarser approximation of a signal, the mother wavelet is dilated and then translated over the signal time for approximation. Accordingly, a wavelet transform has two parameters: the scale and the translation parameters. The scale parameter can be a positive real number and the translation parameter can be an arbitrary real number; however, for computational efficiency, discrete values are often used [10, 11]. Wavelet analysis has been used for feature extraction in speech recognition. In [12], the discrete wavelet transform (DWT) is used as a feature extraction method instead of discrete cosine transform. In [8], the wavelet transform is used to generate high-energy wavelet coefficients that suffer from shift variance, or subband energies are used instead of the mel filter-bank subband energies proposed in [13]. In [14], wavelet packet transform (WPT) bases are used to approximate the mel-frequency division using Daubechies orthogonal filters. The difference between WPT and DWT is that WPT decomposes both details and approximations. It was found that feature extraction using WPT gave a better recognition performance compared to the DWT [6, 15]. Nevertheless, the recognition time grows in a nonlinear fashion as a function of the number of wavelet packet bases. Therefore, the dimensionality of the problem becomes an issue. In [15], a feature extraction method based on a wavelet eigenfunction was proposed. In [9], a text-independent speaker identification system is proposed where the learning of the correlation between the wavelet transform and the expression vector is performed by the kernel canonical correlation analysis. WPT is used to perform the recursive. Defining the most relevant features is crucial in any learning problem, and speaker recognition is no exception [16, 17]. One way to define the set of relevant features [6] is to use a criterion function that can show the classification power of the individual features. Sarikaya et al. [18] proposed a wavelet packet perceptual decomposition tree that yields the wavelet packet parameters (WPP). The authors in [19] proposed the energy indexes of DWT or WPT for speaker identification; WPT was superior in terms of recognition rate [6]. In [20], sure entropy was calculated for the waveforms of the terminal node signals obtained from DWT [20] for speaker identification. The Bayesian classifier (BC) classifies objects based on the statistical characteristics of the classes. It has been employed in many learning applications such as computer vision [21], text classification [22], target tracking [23], health diagnosis [24], and speech recognition [6, 25]. A Bayesian classifier can utilize different features from different sources to produce a probabilistic classification model. Moreover, an optimal classifier is obtained as long as both the actual and estimated distributions agree on the most probable class. Bayesian classifiers give good performance even when strong feature dependencies are present [22, 23].

In this paper, the speaker identification system is studied in the context of recognition rate in noisy environments. This work studies several methods for improving previously published work [6]. Our purpose is to improve the performance of the AFLPC technique utility in several types of noisy environments. For this reason, many techniques, such as feature fusion with DWT or Shannon entropy, are investigated. The structure of this paper is as follows. We present the wavelet packet transform feature extraction method, followed by classification techniques. Next, results and discussion are presented. At the end of this paper, a conclusion is presented.

2. The Wavelet Packet Transform Feature Extraction Method

The wavelet packet could assist greatly in determining a quality feature extraction method for speaker recognition. In the presented study, WPT is utilized for speaker feature extraction, but the resulted data has high dimensionality. Therefore, we need a better representation of speech features. In [26] authors presented the entropy value obtained from the wavelet norm in digital modulation recognition. In the biomedical field, [27] proposed a combination of the genetic algorithm based on a wavelet packet transform algorithm that was used in the pathological investigation, where the features were obtained by a group of wavelet packet coefficients. In [28] a robust speech recognition scheme for noisy environments was proposed by using the wavelet-based energy as a threshold for denoising estimation. In [19] the energy indexes of WP were proposed for a speaker identification task. For the speaker identification task, entropy was used for the waveforms obtained from DWT [6, 20]. In [29] a feature extraction method for speaker recognition based on a combination of three entropy types (sure, logarithmic energy, and norm) was proposed. In this paper, we use LPCC that are obtained from WP tree nodes for speaker feature vector construction to be used for speaker identification. The proposed feature extraction method is summarized as follows.(i)Silence removal and normalization: before the stage of features extraction, the speech data are processed by a silence removal algorithm followed by the application of a preprocessing, which is achieved by using the normalization of the speech signals. This produces speech signals with almost a closed maximum. Unequal signal amplitudes come from the different speaker volumes [6].(ii)WP tree decomposition: speech signal is decomposed into WP at level three, and then the AFLPC is obtained from the WT subsignal:where is the number of considered frames (each frame is of 20 ms duration) for the th WT subsignal . The average of LPC coefficients calculated for frames of is utilized to extract a wavelet subsignal feature vector as follows:The feature vector of the whole given speech signal isWP tree decomposition: speech signal is decomposed into WP at level seven, and we propose the Shannon entropy to extract features from each WP subsignal. The extracted features will be added to the AFLPC features for performance-developing investigation in the method proposed in [6]. To calculate Shannon entropy, the following equation is used:where is the signal and are the WPT coefficients.

In order to explain the proposed method in more detailed way the following chart is presented.The dataset’s speech signals are given to the silence removal and normalization block. In this block, speech signals are preprocessed to be ready for further feature extraction stages.In the next stage, we decompose the signal into a WP tree that contains a number of WP’s subsignals depending on the selected level. We may have 255 of WP’s subsignals for a speech signal at level seven. Daubechies 5 wavelet function, denoted by (db5), is used.Each subsignal, individually, is divided into number of frames, and then we calculate 12 LPC coefficients for each frame. As the results of features from each subsignal, the average of the 12 LPC coefficient vectors is calculated over the frames.At this stage, we put the average of 12 LPC coefficients vectors obtained from all subsignals in one vector to form the feature vector of the speech signal.At the end, Shannon wavelet entropy is calculated for each subsignal and added to the feature vector of the speech signal.The validity of the proposed feature extraction method in the speaker identification task is shown in Figure 1 while Figure 1(a) illustrates two feature vectors of different speakers by the same method using the proposed method where they are less similar. Figure 1(b) illustrates two feature vectors of the same speaker taken for two different speech signals. It can be seen that the features have nearly similar shapes in Figure 1(b).

Linear prediction coding is one of the most attractive signals processing methods for speech signal analysis, particularly in speech and speaker recognition. In this work, LPCC with WP (LPCCWP) and AFLPC with WP (WPLPCF) were tested and compared for speaker recognition. The comparison was performed based on the recognition sensitivity (RS), which is proposed for the first time as follows:where is the correlation coefficient calculated for the same speaker of different signals and is the correlation coefficient calculated for different speakers. The recognition sensitivity results calculated for 75 different signals show that there is a big difference between the two methods, where the LPCCWPF provides best results. The results are illustrated in Figure 2.

3. Classification

Several improvements, which include enhancements, extensions, and generalizations, have been proposed using a probabilistic neural network [30, 31]. The learning capability [32] and the classification accuracy of PNNs were the points of improvement [6]. This improvement will be reflected in the processing time [33] as well as the model complexity.

PNN is selected as the candidate classifier [6]. The characteristics of PNN to work in an unsupervised mode were the motivation for this selection. Also, PNN is easy to implement, and the output scores of the classes can be interpreted as the confidence (probability) for the class for the given feature at the input of PNN. The training phase of PNN is not time-consuming. So the implementation of PNN in many applications is considered because the training is achieved almost instantly. Though many improved types of the original probabilistic neural network exist, which are either more exhibited or significantly improved, for ease of explanation, we consider the original PNN for the classification. The suggested algorithm is indicated by PNN and depends on the following structure [6]. We have used a simple PNN in this work as follows:where is a 180 × 494 matrix of input speaker feature vectors (pattern) of 180 average framing LPC coefficients, a method that was denoted above by AFLPC and taken from WP subsignals for net training. Information related to mel-frequency cepstral coefficients and PNN that we used for comparison can be found in [6, 34]. Consider is the target class vector:The spread of the radial basis functions SP was set to the typical value (SP = 1). If the SP approaches zero, the network behaves as the nearest neighbor classifier. Meanwhile, if the SP goes higher, the designed network will take into account several nearby design vectors. We create two layer networks.

Naive Bayesian Classifier. Using these statistics of the features derived from AFLPC Method I (3), we can determine the likelihood of each feature belonging to a speaker to be verified. Let Class be a speaker class, and let be data that provides information (3) about . Then is calculated as follows: The first step is to estimate , usually referred to as the likelihood function. This is achieved using training speaker feature samples. Features are collected by applying DWT and DWTF on speaker signals. Figure 3 shows the histogram for three different speakers’ features. It is clearly shown that using speaker feature statistics does not give abundant distinctions among speaker classes. Therefore, applying GMM [25] leads to low classification rates because speaker feature distributions, as in Figure 3, can be modeled as a single Gaussian distribution but with no discrimination among classes. Accordingly, we built a likelihood function for each feature (3) per speaker. It was found that most features can be modeled as a Gaussian distribution. For each feature, there would be a probability score for each speaker class. Therefore, let be a feature in the feature vector (3). Then is calculated as follows is the total number of features in AFLPC (3). Under the assumption of conditional independence,A posteriori is computed throughout Bayesian fusion based on all features’ probabilities in the speaker signal AFLPC. is the prior probability of the speaker class ; it is assumed that all classes are equally likely . is a normalization term. The maximum a posteriori probability (MAP) of is used to estimate the speaker class that maximizes :Similarly, features from different approaches can be combined to produce probability scores for each speaker. The essence of this approach is a way to combine different methods in a probabilistic manner. One method produces features that are different from other methods, and features from different methods suffer from independent noise. As a result of the combined methods, features will be more descriptive, and noise can be eliminated in the process of fusion.

4. Results and Discussion

The experimental setup was as follows. Speech signals were recorded via a PC sound card with a spectral frequency of 4000 Hz and a sampling frequency of 8000 Hz. 50 people participated in the recordings. Each participant recorded a minimum of 20 different utterances in Arabic language. The age of the speakers varied from 20 to 45 years and included 28 males and 22 females. The recording process was performed in normal university office conditions. Our investigation of text-independent speaker identification system performance was performed via several experiments using 494 training signals and 50 classes. Even though speaker recognition performance has been maturing and improving over time, it is still inadequate in terms of accuracy [6]. In the approach, we propose a research study of speaker identification by wavelet transform in its two forms, DWT and WPT in noisy environments, which leads to the most comprehensive investigation. In other words, the presented study may be considered an investigation that aims to build a system that recognizes speakers even in a noisy environment. The system was applied to a huge number of training signals. We solved the problem by using the speaker recognition method (feature extraction and then classification). This approach was based on a combination of LPC and WT to accomplish feature extraction of the speakers obtained from normalized and silence-removed signals. The obtained feature extraction vector is added to the PNN or Bayesian classifier (BC) method to be classified, as seen in the flowchart presented in Figure 4.

Table 1 shows the results of the testing data of the whole database. Half of the signals from each individual speaker are used for training. Each test signal is classified by comparison with 494 trained signals by a classifier. The two classifiers are used for comparison. The results are compared to DWT with average framing LPC (DWLPCF) [12], DWT and WP fusion with AFLPC (WFALPCF), and WPLPCF and Shannon entropy obtained from WP at level 7 (WPLPCFE) [29]. The obtained results are tabulated in Table 1, where the best results were computed by means of WFALPCF and WPLPCF, 96.57% and 96.37%, respectively, in case of PNN. BC showed excellent performances for DWT and WP fusion with AFLPC with WP at level 2 (WFALPCF WP2) with recognition rates reaching 96.57%. BC failed in the WPLPCF of WP at the level 5 method. This happens because the longer the feature vector, the less the Gaussian distributed features.

In Figure 5, further investigation for a better feature extraction method is illustrated. WPLPCF, WFALPCF, and WPLPCFE were tested and compared for speaker recognition. The comparison was performed based on the recognition sensitivity (RS), which is proposed in Section 2. The RS method is an objective way to evaluate the feature extraction method’s ability to give similar feature vectors for the same speaker’s signals and not similar ones for different speakers’ signals. This gives a more elaborate study about the feature extraction methods. The recognition sensitivity results calculated for 75 different signals show that there is a slight difference between the presented methods, where the WFALPCF provides best results. The results are illustrated in Figure 5.

In the following experiments, several feature extraction methods with PNN and BC classifiers were analyzed to expose the usefulness of the proposed systems in noisy environments. The following experiment investigates the proposed method in terms of recognition rate in additive white Gaussian noise (AWGN) with 5 and 0 dB. This can be concluded from interpretation of the results in Table 2, where the results of WPLPCF, DWLPCF, WFALPCF, and WPLPCFE are tabulated. DWT was processed at level 5 with six subsignals while WP was processed at level 5 with a bigger number of subsignals. It was found that the recognition rates (in case of 5 dB SNR) of DWLPCF at level 5 with BC were very good (90.73), but they failed in the case of using WP at level 5. WFALPCF methods were superior (75.40%) in case of PNN. The same results were achieved in 0 dB SNR case.

The next experiment investigates the proposed method in terms of the recognition rate in Rayleigh distributed noise (RDN), with 5 and 0 dB. WPLPCF, DWLPCF, WFALPCF, and WPLPCFE were tested by means of PNN and BC. It was found that the WFALPCF methods were superior in 5 and 0 dB (90.52% and 84.27%) in the case of PNN. The results are tabulated in Table 3.

The next experiment investigated the proposed method in terms of recognition rate in babble real noise. The results are tabulated in Table 4, where WPLPCF, DWLPCF, WFALPCF, and WPLPCFE were tested by means of PNN and BC. It was found that the WFALPCF methods were superior in 5 dB (88.10%) in the case of PNN. In the case of 0 dB, SNR WFALPCF with PNN (74.40%) and DWLPCF with BC (82.46%) were superior.

The last experiment investigated the proposed method in terms of recognition rate in subway real noise. The results are tabulated in Table 5. It was found that the WFALPCF methods were superior in 5 dB (89.11%) in the case of PNN. In the case of 0 dB, SNR WFALPCF with PNN (75.12%) and DWLPCF with BC (84.07%) were superior.

At the end of our proposed method investigation study in noisy environments and by obtaining the average of recognition rates in terms of the following concepts (see Table 6), we can conclude that the best average result was achieved by means of WP and DWT fusion with PNN (78.84%). The average result of the methods that used PNN gave the best result (61.19%). According to wavelet transform, WP showed the best performance in the noise environment (63.13%). The BC method gave excellent results in the case of the normal distribution of features as well as in the case of normal noise.

A standard dataset is an essential tool for testing the presented algorithms. The TIMIT database has been one of the most popular standard databases. This dataset contains 630 speakers of the same dialect. It contains 438 males and 192 females. For each speaker, ten signals with 8 KHz sampling frequency were labeled (https://catalog.ldc.upenn.edu/LDC93S1). The same experiments of the four types of noise were conducted for the TIMIT database especially for the fusion methods with PNN and BC. 250 speakers were involved in the experiments. At each time, 50 speakers were used and averaged in the end to get the final result. Table 7 tabulates the results obtained for TIMIT.

5. Conclusions

In this work, the speaker identification task based on AFLPC was investigated in a noisy environment. The advantage of AFLPC is its ability to decrease the enormous speech data into a lower dimension for about 0.16% of the original size with a relatively high computing speed. For classification, PNN and BC were studied. We demonstrated the recognition rate of this method on a speaker database of 50 individual speakers (28 male speakers and 22 female speakers). 495 different signals were used for training the 50 speakers in the experiments. Investigational outcomes proved that both DWT and WP, linked with AFLPC (WFALPCF), are appropriate feature extraction techniques in noisy environments. However, in the case of PNN, DWT produced better quality than WP in terms of recognition rate with BC. The results were achieved by bearing in mind the same feature vector length (180 coefficients, DWP at level 5, and WP at level 2). The results of the WP were enhanced by using higher levels, unlike DWT, which has no apparent enhancement after feature vector length extension. Compared with other available methods, WFALPCF achieved a higher recognition rate.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This project was funded by the Deanship of Scientific Research (DSR), King Abdulaziz University, under Grant no. 12-135-35-RG. The authors, therefore, acknowledge with thanks DSR technical and financial support.