Abstract

Most binaural speech source localization models perform poorly in unprecedentedly noisy and reverberant situations. Here, this issue is approached by modelling a multiscale dilated convolutional neural network (CNN). The time-related crosscorrelation function (CCF) and energy-related interaural level differences (ILD) are preprocessed in separate branches of dilated convolutional network. The multiscale dilated CNN can encode discriminative representations for CCF and ILD, respectively. After encoding, the individual interaural representations are fused to map source direction. Furthermore, in order to improve the parameter adaptation, a novel semiadaptive entropy is proposed to train the network under directional constraints. Experimental results show the proposed method can adaptively locate speech sources in simulated noisy and reverberant environments.

1. Introduction

Speech source localization (SSL) attracts growing attention in the past decades. It is widely applied in human-robot interaction systems and video conference systems. Binaural speech source localization is a subtask of speech source localization, aiming at estimating the direction of arrival (DOA) of a speech source utilizing audio signals recorded by binaural microphones mounted in artificial ears of a dummy head [1]. The pipeline of binaural speech source localization contains two steps. Firstly, extracting interaural cues, i.e., interaural time differences (ITD) and interaural level differences (ILD) from received binaural signals [24]. With the inclusion of the dummy head, the frequency-dependent characteristics of spatial cues can be captured by the head-related transfer function (HRTF) [5, 6]. This frequency dependency motivates the use of time-frequency representations for binaural signals. A typical time-frequency representation for binaural signals is based on Gammatone filters which are usually used to simulate the peripheral processing of human auditory system [79]. The second step for DOA estimation is to apply geometric analysis technique [1] or off-line models [4, 7] to map interaural cues to sound source DOA. Over years, most methods were proposed to improve the performance of binaural SSL from two aspects: estimating robust interaural cues and improving the generalization of learning-based models.

Interaural time difference is the time delay corresponding to the maximum value of the crosscorrelation function of the left and right microphone signals. Interaural level difference is the logarithmic difference of the power energy between left and right microphone signals. However, in the noisy and reverberant environments, there would be additional peaks in the crosscorrelation function and power energy loss of the target speech source. The additional peaks and energy loss would lead to unreliable interaural cues estimation. In order to refine these unreliable interaural cues, the time-delay compensation method was proposed to align ILD and ITD [10], reverberation weighting method was proposed to suppress early and late reverberation [11], and echo-free onsets detection method was proposed to detect direct-path signals [12]. Since ITD is more robust at low frequencies (lower than 1.5 kHz) and ILD is more reliable at high frequencies [13], the Gammatone filters are usually used to filter the low and high frequencies. Karthik and Ghosh used Gammatone filters to preprocess the binaural signals and mapped the frequency-dependent ITD to azimuths using ITD-azimuth templates [14]. May et al. modeled the ITD and ILD in sub-bands for every source direction using Gaussian mixture models (GMMs) [7]. In the scene with multiple activate speech sources, the time-frequency (TF) representation of binaural signal is also able to distinguish noise and speech source in different fragments. Christensen et al. investigated different TF weight estimation approaches for interaural cues [15]. Recently, deep neural network has shown significant performance of speech source localization against noise and reverberation, including time-frequency masking estimation [16] and multi-source localization [17]. Convolutional neural network (CNN) can be used to estimate broadband direction of arrival (DOA) of speech source using phase components [17] and to jointly locate and classify multiple speech sources [18]. Frequency-dependent deep neural network (DNN) and head movements can be exploited to detect multiple DOAs and identify front-back confusions [19]. However, training such a robust and well-generalized model requires a large number of various acoustic conditions. There are few studies that are proposed to improve the adaptability of a model to previously unseen conditions. Takeda and Komatani proposed a training scheme for unsupervised adaptation of DNNs’ parameters using self-entropy and parameter selection [20], and Wang et al. proposed a data-efficient method based on DNN and clustering to improve binaural localization performance in the mismatched HRTF condition [21], but the localization performance still stays poor. In order to solve the off-grid problem, an off-grid BSSL method based on an off-grid wideband sparse Bayesian learning algorithm is proposed, which is only better than the state-of-the-art HRTF-based BSSL methods [22]. It remains challenging how to generalize the learning-based model and make it adaptively locate binaural signals in previously unseen and adverse acoustic conditions.

Here, we propose a multiscale dilated CNN-based method to further disentangle these issues. The crosscorrelation function (CCF) and interaural level difference (ILD) are extracted from binaural signals as input features. In order to preserve the detailed spatial information, the CCF and ILD are separately preprocessed in different dilated CNNs with specific dilation factors. Afterwards, both encoded interaural representations of CCF and ILD are fused to learn crossdomain information. The crossdomain information encoded by multiscale dilated CNNs provides trade-off between small and large receptive fields for CCF and ILD features to better generalize the network in diverse acoustic conditions. In this network, a remaining problem is how to adapt network parameters to unseen acoustic conditions. Drawing on the research of unsupervised adaptation of network parameters [20], we also propose a semiadaptive entropy as the objective function. Different from self-entropy, the semiadaptive entropy includes the crossentropy part to improve the localization performance. Besides, a learning factor is used to weight the attention of crossentropy and self-entropy.

In summary, our contributions are as follows:(i)We propose a multiscale dilated CNN framework for binaural speech source localization, which effectively encodes crosscorrelation function and interaural level difference features from different dilation factors.(ii)We propose a semiadaptive entropy for CNN’s parameter adaptation. Experimental results demonstrate that multiscale dilated CNN trained with semiadaptive entropy achieves significant improvements over regular DNN and CNN in noisy and reverberant acoustic environments.

2. Multiscale Dilated CNN

Suppose that there is only one target speaker, the received binaural signals can be formulated by convolving speech signal and head-related impulse responses (HRIR) in the time domain aswhere the symbol represents convolution operation, represents the binaural microphone index, and refer to the left and right microphones, is the index of time frame, denotes the speech signal, and denotes the head-related impulse response. In order to resemble the frequency selectivity of the human cochlea, binaural signals are decomposed into 32 auditory channels using a fourth-order Gammatone filter bank [23]. The centre frequencies of Gammatone filters are logarithmically equally spaced on the equivalent rectangular bandwidth scale between 80 Hz and 8 kHz. After filtering binaural signals, the crosscorrelation function is computed between the left and right signals in each frequency sub-band independently. The CCF is further normalized by the autocorrelation of the left and right signals. The CCF is formulated as a function of time delay :where denotes the crosscorrelation between left and right signals and is the index of frequency sub-band. and denote the autocorrelation of left and right signals at , respectively. Generally, the diameter of artificial ears of the dummy head is about 15–17 cm. According to the sound propagation speed, the arrival time difference between two ears can be estimated within 1.1 ms. In the realistic conditions, considering the head shadowing effect, the maximum time delay is set to 2 ms. For example, the crosscorrelation function of binaural signals sampled at 16 kHz within a range of centre delays 2 ms forms a matrix CCF with size of 32  65. The other interaural cue ILD is energy difference in logarithmic between binaural signals, which is formulated as follows:where denotes the set of a series of sample indexes in the frame. Since the binaural signals are framed into short and stable speech signals, there would be nonenergy frames. These nonenergy frames would be disregarded. The interaural level difference of binaural signals forms a vector ILD with size of 32  1 in all frequency sub-bands.

2.1. Network Architecture

SSL can be regarded as a direction classification task based on CNN. By dilating dense convolutional kernels with zeros, dilated CNN can operate on a coarser receptive field and show robust performance for voice activity detection in noisy environments [24]. Therefore, the dilated CNN is considered in our network to encode robust interaural features. The schematic diagram of the proposed multiscale dilated CNN is depicted in Figure 1. Two examples of dilated kernels with kernel size of 3 are shown in the upper right side of Figure 1. The number of zero cells between adjacent cells depends on the dilation factor (DF). Black blocks denote the parameter of convolutional kernels to activate corresponding input cells, while white blocks denote zeros to keep input cells inactivated. The number of zeros between two activated cells is DF–1.

In binaural speech source localization, the CCF and ILD reflect time-related and energy-related physical information, respectively. In our method, separated branches of multiscale dilated CNN are designed to better capture independent interaural characteristics according to their physical meanings. The branch for CCF consists of two parallel dilated CNNs, one of which stacks two dilated CNN layers with DF = 2 (i.e., dilation-2 CNN) and the other branch stacks two dilated CNN layers with DF = 5 (i.e., dilation-5 CNN). This multiscale dilated CNN is designed to locate the azimuths of binaural signals in the frontal hemifield with range of [−90°, 90°]. Taking 37 azimuths spaced at a step of 5° as examples, 65 samples of time delays of CCF are exactly twice the number of DOAs. The DOA of a signal is estimated by considering the maximum of crosscorrelation and the surrounding values of this maximum in a kernel. In reality, adjacent DOAs within some angular distances are also considered. With this in mind, we implicitly include the tolerance errors of 5° and 10° by setting dilation factors to 2 and 5. The kernels with dilation factors 2 and 5 describe the tolerances ranging in [0°, 10°]. Here, dilation factor 4 is not included since it can be obtained by moving kernels with dilation factor 2 twice. The other branch for ILD consists of only one layer of dilated CNN with dilation factor 2. All CNN layers employ 64 kernels to double expand frequency bands and are activated by rectified linear unit activation function and a dropout probability 0.5. The max-pooling layers are added after each dilation-2 CNN to reduce parameters but are excluded in dilation-5 CNN to preserve details. Finally, all interaural representations are fused in a fully connected layer with 128 neurons and followed by an output layer with Softmax activation function. The aforementioned parameters are sufficiently evaluated in experiments.

2.2. Semiadaptive Entropy

As mentioned before, adjacent azimuths within some tolerances can be considered correct. Additionally, due to the intermittence of speech, weak-speech frames are inevitably dominated by noise. In this section, we propose a semiadaptive entropy to train multiscale dilated CNN. In most regression tasks, the Kullback–Leibler divergence (KLD) is widely used to measure the similarity between two probability distributions. In this paper, the probability distributions refer to the true DOA and the estimated DOA in binaural speech source localization. The KLD can be formulated as a sum of the “truth” entropy and the soft crossentropy:where and denote the probabilities of the true DOA and the estimated azimuth, respectively. The DOA probability of a silent or noise-dominant frame is assumed to be uniformly distributed on azimuths. With this assumption, the “truth” entropy of KLD is substituted by a uniform entropy. Besides, a learning factor is applied to balance the crossentropy and the uniform entropy:where means averaging over training samples. Under directional constraint , the network is able to fine-tune parameters under diverse acoustic conditions. The ADADELTA [25] algorithm is used to minimize the loss function. Training process would be stopped if no lower error appears on the validation set within last 3 epochs. The azimuth probability of a received signal block consisting of contextual frames is produced by averaging frame-level azimuth probabilities. The target DOA is estimated by maximizing .

3. Experiments and Discussion

3.1. Experimental Setup

The proposed method is evaluated using a binaural setup in simulated acoustic conditions, including signal-to-noise ratio (SNR), noise types, and reverberation time. Acoustic conditions are summarized in Table 1. Speaking sources are positioned in the frontal plane between −90° and 90° with a step of 5°, i.e., 37 directions, and their elevations are the same as the receiver’s. Based on the binaural signal formulation, the head-related impulse response (HRIR) from the KEMAR dataset [26] are convolved with speech recordings from TIMIT dataset [27]. To simulate the noisy conditions, six kinds of common noises from the NOISEX-92 dataset [28] are properly truncated and added to each microphone signal based on the same SNR. Each noise is processed as diffuse noise by summing all the directional noises generated by convolving the noise and HRIR at 37 uncorrelated directions. To simulate the reverberant conditions, an enclosure of (10 6 3) m is simulated using the Roomsim toolbox [29] based on the image method [30]. All surfaces in this room are equally reverberant. A dummy head indexed by Subject_021 from the CIPIC dataset [31] is placed at the centre position. The source-to-sensor distance is 1.5 m. The binaural room impulse responses yielded by this reverberant setup are convolved with testing speech recordings to generate a reverberant data set. All binaural speech mixtures are sampled at 16 kHz and framed by a Hamming window of 512 samples with a shift of 256 samples. A signal block contains 20 contextual frames, equivalent to a segment with 336 ms duration. The localization performance is measured in terms of the localization accuracy, which considers an estimated DOA is correct if the estimated DOA is within 5° away from the true DOA.

3.2. Influence of Learning Factor

The adaptability of our network is influenced by the learning factor so that the value of needs to be evaluated to maximize the adaptability. Note that the semiadaptive entropy lacks directional information when ; hence, the maximum value of is set to 0.999. The minimum value of is set to 0; thus, the semiadaptive entropy becomes crossentropy. In experiments, our network is trained with different learning factors ranging from 0 to 0.999 and is determined by evaluating the localization accuracy on the validation set under noisy conditions with −20 dB SNR. Figure 2(a) shows the localization performance with different . There are three local maxima in Figure 2(a) with different learning factors , respectively. During the ADADELTA [25] updating algorithm, the learning rate is automatically updated using accumulated gradient:

The formulation of our semiadaptive entropy also looks like the form of this accumulated gradient. The gradient of each term of the semiadaptive entropy can be calculated separately and the accumulated gradient becomeswhere and represent the gradient of the crossentropy and the uniform entropy, respectively. Here, is also a hyperparameter and serves as a momentum factor to control the learning rate. Therefore, the model could fall into different local maxima or saddle points during updating iteration. Through sufficient validation, the is set to 0.9 with the best performance, indicating relatively high adaptability of this network in noisy environments. A DOA probability of a binaural signal in -10 dB SNR condition is depicted in Figure 2(b). The real DOA of the signal is 60°, but it gets wrong DOA of 65° when the network is trained with  = 0. Red curve shows the wrong DOA probability is reduced when training network with  = 0.9. In addition, due to the effect of the uniform entropy, the azimuths far away from the true DOA may have nonzero probabilities. It is demonstrated that semiadaptive entropy can effectively improve the adaptability of the network.

3.3. Evaluation of Binaural SSL

Our method is compared with two baseline network-based methods, i.e., multilayer perceptron (MLP) [8] and frequency-dependent DNN [19], and the network architecture is also evaluated in ablation studies:Regular CNN: the regular CNN is used in our architecture instead of dilated CNNDilation-2 CNN: the CCF and ILD are fed into separate branches of dilated CNN as in the proposed architecture, but the CCF branch only stacks two layers of dilation-2 CNNDilation-5 CNN: the CCF and ILD are fed into separate branches of dilated CNN as in the proposed architecture, but the CCF branch only stacks two layers of dilation-5 CNNCascaded DCNN: the dilation-2 CNN and dilation-5 CNN are cascaded in the CCF branch rather than parallel.

Localization accuracies of these methods are shown in Table 2 (in noisy scenes) and Table 3 (in noisy and reverberant scenes). In Table 3, the symbol “-/-” means no additive noise. In noisy conditions, MLP outperforms the frequency-dependent DNN in low-SNR conditions, which is because the ITD and ILD are estimated on the whole signal block rather than short frames. Compared with the results of DNN, the CNN-based methods improve the average accuracy by 2% to 6%. The reason is that adjacent frequency bands can provide mutual information for each other rather than independent frequency bands. In reverberant conditions, the dilation-5 CNN outperforms the others since the remote information is equally important to the mutual information in cross sub-bands, where the remote information includes the interaural features in direct path, early and late reverberation. The dilated CNN with relatively larger receptive fields can capture more remote information at a time. Due to the complementarity of different dilated kernels, the multiscale dilated CNN trained with performs well in noisy conditions but slightly worse than dilation-5 CNN in reverberant conditions. It makes sense the fusion of multiscale dilated CNN learns an automatic trade-off between small and large dilated kernels in noisy and reverberant conditions. Furthermore, we also demonstrate the importance of the semiadaptive entropy. Compared with crossentropy, the network trained with semiadaptive entropy improves the localization accuracy by nearly 10% in strong noisy scenes and 4.62% on average in reverberant scenes.

4. Conclusions

In this work, we proposed an adaptive binaural SSL method based on multiscale dilated CNN. The separate dilated CNN can encode discriminative representations of CCF and ILD features. By synchronously operating on the inputs, the dilation-2 CNN and dilation-5 CNN complemented each other in noisy and reverberant conditions. Additionally, we derived a semiadaptive entropy from the Kullback–Leibler divergence to adaptively train the network under directional constraints. Training with a high value of the learning factor, the multiscale dilated CNN can generalize well in previously unseen scenes. Experimental results have demonstrated the superiority of this method when compared with other baseline methods and single-scale networks in adverse scenarios.

Data Availability

All the data are open and its source is already stated in our paper.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by National Natural Science Foundation of China (nos. 61673030 and U1613209) and National Natural Science Foundation of Shenzhen (no. JCYJ20190808182209321).