Abstract

Nonnegative matrix factorization- (NMF-) based noise reduction methods can effectively improve the performance of environmental sound recognition. However, when the environmental sound overlaps highly with the noise, the spectral line loss and noise residue will occur in the low signal-to-noise ratio (SNR) condition. An adaptive noise reduction algorithm was proposed in this paper. First, noisy environmental sound is separated into estimated noise and environmental sound using NMF. Then, the estimated noise is used to calculate sound presence probability (SPP), which is adapted to decrease spectral line loss and achieve accurately estimated noise. Subsequently, the estimated noise combines with noisy environmental sound to obtain the estimated environmental sound. Finally, SPP is applied to reduce residual noise in the estimated environmental sound and reconstruct the environmental sound. The simulation results demonstrate that the proposed algorithm outperforms the traditional algorithms and NMF-based methods in terms of perceptual evaluation of speech quality (PESQ) and global SNR with increase of X% and X%, respectively. Moreover, the proposed method can effectively improve the environmental sound recognition rate. Particularly, the proposed method makes a 16.2% increase of F1-score in car horn recognition under the realistic acoustic condition.

1. Introduction

The recognition of environmental sound (ES) enables the monitoring of certain specific event, since ES have the potential of characterizing the surrounding environment. However, one key factor that affects detection and classification performance is the diverse and unpredictable interference noise in the real-life scenarios [1]. Therefore, noise reduction (NR), as a part of the preprocessing of ES, has important application prospects in human-computer interaction [2], animal behavior monitoring [3], anomalous sounds for machine condition monitoring [4], and domestic risk scenarios [5]. Speech as the first studied ES, some representative NR methods such as spectral subtraction (SS) [6], Wiener filtering (WF) [7], minimum mean square error (MMSE) [8, 9], and short-time spectral amplitude (STSA) [10] have been proposed. By the virtue of highly flexible and ease of implementation, these methods are applied for nonspeech NR in WASNs [11, 12]. It is important to notice that these works have limited performance to suppress nonstationary noise despite the various contributions, since their strong assumption is based on stationary noise, especially under low-SNR conditions [13].

To cope with the nonstationary noise, noise estimation is performed before the NR gain function, which provides accurate noise spectrum to evaluate the WF [14] or to estimate the a priori SNR in the MMSE [15]. In [16], the author used improved minimal controlled recursive averaging to estimate noise and combined with used optimally modified log-spectral amplitude (OMLSA) to obtain promising denoised performance. Although such approaches can estimate noise spectrum continuously during ES activity, it does not respond well to the increasing noise levels at low-SNR conditions, which leads to the underestimation of the noise spectrum and annoying residual noise.

In order to provide consistent performance in highly nonstationary noisy environments under low-SNR conditions, NMF models the input noisy as a weighted sum of the nonnegative basis from clean ES and noise [17]. Since the same acoustic elements exit for ES and noise, NMF is not effective in separating regions where the ES and noise spectral bands are heavily mixed [17]. Subsequent works attempt to optimize separation rules [18, 19] and postprocessing [20, 21] to improve the quality of separated ES. For instance, a weighted NMF for interpolating missing data is presented [18], which can address the overestimate of the values in the masked regions and the computational cost equivalent to standard NMF. In [20], Lee et al. introduced spectral-temporal speech presence probabilities (SPP) to reconstruct the regions of the separated speech with severe spectral leakage to suppress the residual noise components. As these methods are tempted to process the separated ES as denoised result, they ignore the key issue that the spectral line loss of separated ES is beyond repair, which affects the clarity of denoised ES.

Benefiting from the high learning capacity of deep neural network (DNN), a hybrid model which combined DNN and NMF is proposed. In [22], DNN is applied to initialize the activation matrix of NMF, with which the performance is slightly better than the traditional NMF. Another approach estimates the activation matrix through a DNN and then reconstructs ES through multiplying it with the basis matrix [23]. However, DNN-based methods require enormous number of clean ESs which is difficult to obtain in advance, such as gunshots and explosions. Without a sufficient training dataset, these methods might be overfitted.

Furthermore, NMF is considered to be sensitive to nonstationary noise and low-SNR conditions when using the separated noise as the noise estimation result [22]. Based on this, sparse and low-rank NMF with Kullback-Leibler divergence is presented to noise estimation [24]. Lai et al. [25] applied NMF for noise estimation in combination with the Wiener filter gain function to obtain enhanced speech with high quality and intelligibility under challenging conditions. However, the above methods rely on a supervised learning approach and still fail to improve the robustness of the algorithm in an unseen noise environment.

Moreover, current NR algorithms for monitoring system draw on speech enhancement schemes directly. In automatic recognition system of porcine abnormalities [26], SS is employed to suppress the pigpen background noise to improve the detection performance. In [4], researchers use the noisereduce libraries of Python 3.7 for removing background noise in the cattle farm and the analog white noise of the microphone. After removing the noise, the performance of cattle vocal classification improves from 91.38% to 94.18%. Such methods lack consideration of monitoring sound characteristics; for instance, Xu et al. [27] employed the improved control recursive averaging algorithm to estimate noise which would be disabled when the ES of nonspeech changes more slowly than noise.

The contributions of this paper are summarized as follows: (1) an adaptive NR algorithm using the semisupervised learning mode, which reduces the interference from segments of non-ES and further improves the performance of recognition. (2) SPP-based threshold determination is presented for locating the frame where ES is vocalized in the sound clip. (3) To verify the validity of the proposed algorithm in the monitoring system, experiments of simulations and realistic acoustic conditions were conducted on the nonspeech datasets. The experimental results showed that the proposed algorithm achieves good NR performance.

The paper is organized as follows. Section 2 introduces the NMF-based NR technique. Section 3 presents the framework of the proposed algorithm and details the algorithm. Section 4 analyzes the experimental results and evaluates the performance of the proposed algorithm. Finally, we draw the conclusions in Section 4.

2. The NMF-Based NR Technique

NMF is a technique of source separation to additive mixtures by using the basis matrix. Since NMF is capable of interpreting the local properties of the image, NMF is proposed as face recognition technology initially [28]. Recently, NMF has been studied for blind source separation [29, 30] and NR, on account of sound that can be converted to the spectrogram form.

Consider the representation of a noisy ES in the time-frequency (T-F) domain as the sum of environmental sound and noise , where is the frequency bins and is the time index.

To satisfy the nonnegativity constraint of NMF in the input matrix, the nonnegative real-valued matrix of dimensions, namely, the T-F amplitude spectrum matrix of noisy ES, is taken as the decomposition matrix. In the decomposition of the matrix, the nonnegative basis matrix and the activation matrix are such that . NMF allows the original high-dimensional matrix to be approximately decomposed into the multiplication form of a low-rank matrix. The potential structure of the original structure is captured by , and is its corresponding T-F gain. Therefore, the rank is required to be much smaller than or and satisfies .

To find optimize and , the Euclidean distance is used to quantify the approximate mass of the decomposition to minimize the reconstruction error,

Then, the and iterative multiplication update rules are as follows [31]: where is the number of iterations and represents the multiplication of the corresponding elements of the matrix.

For NR tasks, can be rewritten as a joint dictionary form of ES and noise, namely, . Similarly, .

Accordingly, the decomposition process of the noisy T-F amplitude spectrum matrix is shown in Figure 1. By matrix operation, the T-F amplitude spectrum matrix of ES and noise can be approximated as

can be considered as denoised output.

When NMF decomposing the ES matrix, the basis matrix is activated unstably because the same sound elements overlap between the basis matrices of different sounds [24]. Thus, there is a mutual spectral component leakage between the separated and in the strong noise area where the spectrum of ES and noise overlaps highly. This leads to the spectral line loss of the separated ES which affects the quality and clarity of the denoised output. Therefore, a reconstruction scheme based on separated noise is proposed for the separated ES distortion region.

3. The Proposed Denoised Scheme

The block diagram of the proposed algorithm scheme is shown in Figure 2. An unsupervised learning method was applied to the captured noise by building a noise buffer, and the separated noise from NMF was considered as the preliminary noise estimation result. In the noise processing stage, an SPP algorithm was used to adaptively suppress the leaked ES component in the separated noise. Combined with an OMLSA spectral estimator, the high-frequency structural information of ES can be retained to further improve the ES quality. Finally, the ES output can be enhanced through a residual noise suppression process. The key steps include building noise buffer, adaptive noise processing, and residual noise suppression.

3.1. Semisupervised NMF

For the monitoring system, the target ES to be monitored is clear. Thus, is prefetched from the ES dataset. Without presetting the type of noise in the environment, we extract from building noise buffer and combine with the pretrained to achieve semisupervision NMF. The noise buffer is considerable to capture noise-only segments to update the noise basis matrix online, which meets the demand of NR in unknown noise environment.

In the test stage, unsupervised NMF [28] is executed. First, initialize , , , and with negative random values. Then, is updated iteratively by Equation (3), and and are updated iteratively by Equation (4). Iteration stops when or the convergence Equation (2) is satisfied, resulting in the separated T-F amplitude spectrum matrix of ES and the separated noise T-F amplitude spectrum matrix .

3.2. Adaptive Noise Suppression

Due to the overlap of acoustic elements, semisupervised NMF is only used as a preliminary separation. The separation results are used to calculate instantaneous SNR and the presence probability of ES in the separation noise, providing accurate prior information for the subsequent estimation of clean ES. In this stage, SPP is used to adaptively suppress the leaked ES component in the separated noise to improve the performance of the OMLSA estimator.

First, it is assumed that the T-F amplitude spectrum of ES and noise satisfies the complex Gaussian distribution and that and , respectively, represent the presence and absence of ES at the T-F point. Then, the conditional probability distribution function of the observed signal can be given by the variance of ES and noise .

By applying Bayes’ rule, the conditional probability of ES presence is given by the following formula [10]: where is the prior probability of ES absence and is obtained from the ratio of prior and posterior SNR, namely,

Since the T-F points of mutual leakage between the ES and noise are the regions with serious spectrum aliasing, instantaneous SNR is considered to determine the T-F points that need to be reconstructed in the separated noise.

According to Equation (7), if the conditional probability of ES presence in the separated noise is and , then where is an adaptive factor that automatically adjusts the noise component of the separated noise according to the instantaneous SNR level. That is, is a weighting factor of the power ratio of the ES component to the noise component in the aliasing region of the reaction spectrum. At higher SNRs, a larger should be used to avoid weakening of the separated noise components, resulting in more residual noise in the denoising ES. At low-SNR conditions, a smaller should be used to avoid denoising ES distortion.

Then, the separated noise , which is processed by adaptively suppressing the ES component, is used as the input for the OMLSA estimator to provide noise information in the noisy ES, to accurately determine the spectral gain function [10].

is the lower threshold of for the absence of ES, while is the spectral gain function that minimizes the logarithmic MMSE of the estimated ES amplitude and the true amplitude as an optimization objective in the presence of ES.

As the noisy ES had been separated by NMF, and can be given by and , respectively, when computing and using a decision-direction approach.

Finally, by applying the OMLSA spectral gain function to , the T-F amplitude spectrum of the denoised ES can be obtained as

3.3. Residual Noise Suppression

If the ES type is speech, the obtained from Equation (15) is used as the final denoised output. However, for NR in nonspeech monitoring system, the input data is a sound segment of about five seconds obtained by endpoint detection processing. The presence of non-ES frames in the input data, especially those with residual noise, would affect recognition performance. In this stage, we design the T-F weighting factor to further enhance the .

extracts the frame with the highest probability of ES, while suppressing the frames dominated by interference noise. where and represent the flag matrix and the decision threshold, respectively, to determine the presence of ES in each frame. By comparing calculated from with the decision threshold of the frequency point , is given by

Therefore, the final output of the denoised nonspeech can be expressed as

4. Experimental Results

This study evaluated the denoised performance of the proposed adaptive NR method and compared it with the standard NMF [28], RNMF [20], and the traditional speech enhancement methods Wiener [8], STSA [10], and OMLSA [32] on the ES datasets. In addition, we apply the proposed method to the nonspeech monitoring system and conducted simulations and real experiments.

4.1. Datasets and Experimental Parameter Setting

ES datasets consist of the Google dataset [33] and TIMIT dataset [34], including car horn, scream, gunshot, and speech. The dataset is divided according to the ratio of 7 : 3 as training set and test set, respectively. Noise test set including nonstationary noise babble, factory2, F16, destroyerops, pink, and white were selected from NOISEX-92 dataset [35], and the natural environment noises rain and wind were selected from ESC-50 database [36]. The SNRs ranged in volume from −5 to 5 dB, and noise that did not overlap with the training set was added to the clean sound of the test set to generate enough noisy ES to evaluate the performance of the proposed algorithm. All audio was resampled to 16 kHz, and the time domain signal was converted into a T-F amplitude spectrum by STFT, with a hamming window length of 32 ms and a 50% frame shift. The dimensions of the ES and noise basis matrices were derived from experience and set as 90 and 60, respectively.

4.2. Evaluation Metrics

In the speech-based noise suppression experiments, global SNR and PESQ [37] were selected as the evaluation metrics. Global SNR is defined as the power ratio between ES and noise over all T-F regions, reflecting the relative magnitude of the two. Through the improvement of global SNR, the amount of noise rejection of the algorithm from an objective perspective is given for speech quality evaluation. PESQ is considered to be an objective expression of subjective evaluation, which can compensate for the lack of global SNR measurement. In nonspeech experiments, the F-score index of the sound event recognition model was used to reflect the noise suppression ability of the algorithm [38]. This index is the result of balancing the precision and recall indexes comprehensively, and the index is defined as follows: where the , , , and codes are positive, negative, false positive, and false negative, respectively [39]. is the precision rate, while is the recall rate. In this experiment, when , the weight of precision and recall rates was the same; that is, the F1-score index was selected.

In addition, since T-F spectrogram analysis is a common method for analyzing the frequency and level of time-varying signals, this paper also uses a T-F spectrogram as an evaluation metrics in both speech and nonspeech experiments.

4.3. Performance Comparison of Different Algorithms to Suppress Nonstationary Noise under Low-SNR Conditions

To directly reflect the nonstationary noise suppression ability of the proposed algorithm at low-SNR conditions, noisy speech, which was disrupted by six nonstationary noises from NOISEX-92 database, was selected for the experiments. The proposed algorithm was compared with a traditional speech enhancement algorithm [7, 10, 32], standard NMF [28], and reconstructed NMF (RNMF) [20].

According to Figure 3, the proposed algorithm is superior to the compared algorithm in both global SNR and PESQ metrics. When compared with other algorithms under the three input SNR conditions, the proposed algorithm improved by 0.13–0.60 (0.26 on average) under the PESQ index. With the global SNR indicator, the minimum improvement was 1.44 dB, the maximum improvement was 3.78 dB, and the average improvement was 2.39 dB. As shown in Figure 3(a), when the input SNR was equal to −5 dB or 0 dB, NMF, RNMF, and the proposed algorithm were superior to the traditional algorithm under the global SNR index. This showed the superiority of the algorithm based on NMF under low-SNR conditions and its ability to better protect speech quality. At , NMF was lower than the conventional Wiener and STSA algorithms, which is consistent with the previous conclusion that noise spectrum leakage and high spectral line blurring in separated ESs would result in poor noise suppression. The conclusion is also reflected in Figure 3(b), where the PESQ values of NMF were lower than that of the Wiener or STSA algorithms at both 0 dB and 5 dB.

Comparing the T-F spectrograms found in Figures 4(c) and 4(d), the enhanced speech of the proposed algorithm had clear spectral lines and less residual noise in the overlapping region of speech and noise. This shows that the proposed algorithm can effectively suppress the spectral components leaked during NMF separation while preserving the high-frequency information of speech (indicated in the red-boxed region) and thus had the highest global SNR and PESQ performance. Although RNMF addressed the noise spectrum leakage of NMF through reconstructing, there is still a loss of high-frequency information (indicated in the red-boxed region), because the reconstruction of RNMF is based on the separation of ES. This is also why RNMF exhibits limited noise rejection performance (Figure 3). In the 0–0.7 s region for the T–F spectrograms, fewer isolated noise fragments remained after the processing of the proposed algorithm, indicating that the proposed algorithm suppresses nonstationary noise more thoroughly.

4.4. Influence of Environmental Sound Noise Reduction on the Recognition Result

NR is an indispensable part of recognition task preprocessing; this paper selected three types of nonspeech, namely, car horns, screams, and gunshots, as the recognized sound events for the experiments. The noise suppression performance of the proposed algorithm can be measured by the improvement of model recognition ability before and after NR. Nonstationary noises, such as rain, wind, and babble, were selected as interference noises in the outdoor environment. Noisy ES under low-SNR conditions was synthesized for the experiments. The recognition model is based on the two-input convolutional neural network of the previous work [38].

Figure 5 shows the T-F spectrogram of the nonspeech used as the input for the recognition model, displaying three kinds of nonspeech ES destroyed by nonstationary rain noise at 0 dB SNR condition. Comparing Figures 5(b) and 5(c), the proposed method can locate the segments with ES and set T-F weights of non-ES segments to 0 (the dark blue part). This demonstrated that the T-F weighting factor we designed is effective. Figure 4(d) shows the NR results without using the T-F weighting factor, and we can see that there is noise residue in the low frequency at 0.5 seconds (indicated in the black-boxed region). Setting zero to the weights of non-ES segments can solve the performance degradation caused by residual noise of pure noise segments. Thorough suppression of residual noise is more conducive to subsequent ES recognition. In addition, by comparing Figures 5(a) and 5(c), the spectrum structure of the nonspeech reconstructed from the proposed method processing was mainly intact. In the low-frequency range where noise damage was serious and in the high-frequency range where nonstationary noise is the main component, there was less noise residue.

The F1-score of the noisy ES and the denoised ES is listed in Table 1. According to the table data, the ES processed by the algorithm in this paper had higher F1-score. On average, F1-score improved by 11.1%, 11.8%, and 16.8%, respectively, under the three low-SNR conditions. At -5 dB, the largest improvement in the model indicates the effectiveness of the proposed algorithm at low-SNR conditions and can significantly improve the recognition performance of the recognition model.

4.5. Analysis of the Application of the Proposed Algorithm in Environmental Sound Recognition Systems

To verify the effectiveness of the proposed algorithm in improving the performance of the recognition systems under real scenarios, the recognition system designed in the previous work [40, 41] was used in this paper for real-time environmental sound data acquisition and recognition processing. To ensure the authenticity of the experiment, a car horn was collected from the Guilin University of Electronic and Technology. Along with a car horn used by electric vehicles, two mobile-side devices and two fixed-side devices were placed for real-time data collection. The collection scenario and device placement are shown in Figure 6. The experiment was conducted in July of 2021, and the main background sounds on the campus were cicadas, birds, wind, and building noises. The energy range of the noise measured with a sound level meter at the acquisition microphone was 70 to 90 dB.

The duration of each piece of sound data collected in this experiment was 3-5 s, each with 16 kHz sampling rate, and saved in WAV format. Figure 7 shows an example of the captured noisy ES and ES processed by the proposed algorithm. As shown in Figure 7(b), in the frequency band of 4-8 kHz with cicadas, the spectrum line of the denoised car horn was clear. At the same time, the noise was completely muted or set to a smaller amplitude in the pure noise segments. This demonstrates the ability of the proposed algorithm to learn the noise feature in the real system, successfully extracting the nonspeech components from the noisy bands. By suppressing the noise components, the proposed algorithm can prevent the low recognition rate caused by interference noise in the acquired ES.

In this experiment, the F1-score in Experiment 2 was used to measure the NR performance of the proposed algorithm. Figure 8 shows that in a real environment, the F1-score of the model for handling the noisy and denoised car horn is 76.2% and 88.8%, respectively. The 16.2% improvement in the F1-score after NR was also consistent with the 15.5% improvement at demonstrated in the simulation experiment of Experiment 2.

5. Conclusion

In this paper, an adaptive NR algorithm used in various noisy environments was proposed. First, a noise buffer was set to implement semisupervised NMF so that the algorithm can suppress the unseen noise to improve algorithm robustness. Next, the proposed algorithm designed an adaptive weight factor based on SPP to suppress leaked ES components in the separation noise, avoiding misclassification of the ES as noise for suppression and solving the distortion problem. In addition, to reduce the residual noise interference of non-ES segments on recognition, the T-F threshold was used for each frame of the OMLSA estimator output. The results show that the proposed algorithm outperformed other methods in terms of the average PESQ and SNR. Under realistic acoustic conditions, the proposed algorithm combined with the monitoring system significantly improved the recognition performance of the monitoring system.

However, the ES and basis matrices are in one-to-one correspondence. Therefore, in the system with high real-time performance, it is necessary to increase the calculation cost to determine the corresponding basis matrix. In future works, the optimization of the algorithm will be considered to improve its computational speed and suitability for tasks where real-time performance is required.

Data Availability

Data are available at Google AudioSet: https://ieeexplore.ieee.org/document/7952261.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (No. 62201163), the Project of Guangxi Natural Science Foundation (No. 2020GXNSFAA159004), the Project of Guangxi Technology Base and Talent Special Project (No. GuiKe AD20159018), and the Fund of Key Laboratory of Cognitive Radio and Information Processing, Ministry of Education (No. CRKL200104).