Abstract
The classification of bird sounds is important in ecological monitoring. Although extracting features from multiple perspectives helps to fully describe the target information, it is urgent to deal with the enormous dimension of features and the curse of dimensionality. Thus, feature selection is necessary. This paper proposes a scoring feature method named MICV (Mutual Information and Coefficient of Variation), which uses the coefficient of variation and mutual information to evaluate each feature’s contribution to classification. And then, a method named ERMFT (Eliminating Redundancy Based on Maximum Feature Tree) based on two neighborhoods to eliminate redundancy to optimize features is explored. These two methods are combined as the MICV-ERMFT method to select the optimal features. Experiments are conducted to compare eight different feature selection methods with two sounds datasets of bird and crane. Results show that the MICV-ERMFT method outperforms other feature selection methods in the accuracy of the classification and is less time-consuming.
1. Introduction
Birds are sensitive to changes in habitats and surroundings, and they are a good indicator of biodiversity and the ecosystem [1]. Because birds generally have a wide range of movement and cannot be observed promptly, bird sounds are one of the important ways to identify them [2].
Bird sounds are a class of environmental sounds. Some famous feature extraction methods used in audio signal processing include Mel-Frequency Cepstral Coefficients (MFCC) [3] in the frequency domain and Short-Time Fourier Transform (STFT) [4] and Wavelet Transform (WT) in the time domain [5]. Furthermore, Tsau et al. [6] suggested a method that extracts features from Code Excited Linear Prediction (CELP) bit streams. Researchers have been extracting features from multiple aspects to retrieve enough information to describe the target. However, the curse of dimensionality occurs as the numbers of the features and samples grow. It also increases the time cost of analyzing data, affects the models’ generalization, and reduces the effectiveness of solving problems [7]. To avoid the curse of dimensionality, selecting a subset of features from the feature pool is necessary.
The feature selection process in pattern recognition is composed of feature scoring and feature optimization. Feature scoring, the key to feature selection, finds the most distinguishable features in the classification space. Generally, feature scoring methods can be grouped into four classes: similarity-based, information-theory-based, statistics-based, and sparse-learning-based [8]. So far, researchers have proposed many different feature scoring methods [9]. For example, in unsupervised feature selection, Nonnegative Laplacian is used to estimate the feature contribution [10]. Constraint Score is applied in feature scoring in environmental sound classification [11]. The ReliefF-based feature selection algorithm is employed to select features in automatic bird species identification [12]. PCA is used as a feature reduction technology to realize bird sounds’ automatic recognition [13].
Meanwhile, feature optimization, the second phase of feature selection, selects a subset of features, characterized by low redundancy and high contribution to the classification, from the feature sequence ordered by scores. Filter, Wrapper, and Embedded are three types of methods used to select a subset of features, and many studies have proposed various feature optimization algorithms based on these methods. Binary Dragonfly Optimization Algorithm, PSO (Particle Swarm Optimization), and Artificial Bee Colony are some examples. Specifically, S-shaped and V-shaped transfer function can be used to map continuous search space to discrete search space [14]. Mutual information can be combined with PSO to eliminate redundant features [15]. In some research, the gradient enhanced Decision Tree [16] is used to evaluate feature contribution, and Artificial Bee Colony is applied to optimize the features [17]. Pearson correlation coefficient is a common evaluation metric used in literature, which evaluates the correlation between features, and is followed by Artificial Ant Colony to select high-quality features [18].
Most feature scoring methods, such as Constraint Score and Laplacian, are based on the correlation and differences among spatial distances between features. Although these algorithms have low time complexity, the diversity of the features is neglected. Specifically, units of the features are usually different. Some algorithms calculate the mutual information between the feature sample and the label from a probabilistic and statistical perspective [15]. However, the label is generally a discrete variable, while features are continuous variables. In recent years, many studies regard feature selection as an optimization process and combine feature selection with intelligent searching methods [9, 19–22]. The multiobjective optimization problem of a large dataset has a high time and space complexity. A reduction in the features’ dimensions usually decreases in the classification model’s sensitivity and generalization.
Regarding the issues mentioned above, from an information theory perspective, this paper proposes a feature scoring method MICV (Mutual Information and Coefficient of Variation). MICV utilizes the characteristics of mutual information and coefficient of variation and aims to minimize intraclass distance and maximize interclass distance. A feature optimization method, ERMFT (Eliminating Redundancy Based on Maximum Feature Tree), is suggested based on a minimum spanning tree concept. Experiment results show that the MICV-ERMFT method can effectively reduce the data dimension and improve the classification model’s performance. Compared with eight feature evaluation methods, the MICV-ERMFT method has significant improvement in the performance on the same dataset in this paper.
2. Materials and Methods
In bird sounds’ recognition, there exists a variety of methods to extract features and classify the sounds. For example, Human Factor Cepstral Coefficients are used to extract bird sound features, and classification and recognition are performed by the maximum likelihood method [23]. Zottesso et al. [24] suggest a method that extracts bird song features based on the spectrogram and texture descriptors and uses the dissimilarity framework for classification and recognition. In this paper, the classification process of bird sounds is divided into three stages: feature extraction, feature selection, and classification recognition. Feature selection is selected as the research focus. The proposed classification process of bird sounds based on MICV-ERMFT is shown in Figure 1: Stage 1. Preprocess the bird sounds’ audio data (remove noises and converse the channel), and use MFCC and CELP to extract features from the preprocessed data and construct dataset DM&C (dataset formed by the merger of MFCC and CELP features). Stage 2. Apply the MICV method on DM&C, evaluate the contribution, and score each feature. Sort the feature sequence in ascending order, denote as F, and calculate the Pearson correlation coefficient for the features and build a maximum feature tree T. Then, apply the ERMFT method to eliminate redundant features and construct a new dataset DM&C′. Stage 3. Build a classification model on DM&C′ and analyze the classification results.

2.1. Feature Extraction
Birds make sounds in the same way as humans do [25, 26]. The frequency of human language used for daily communication ranges from 180 Hz to 6 kHz, and the most used frequency range for bird calls is from 0.5 to 6 kHz [25, 27]. Under this assumption, we process the features of bird sounds in a way similar to processing the human language. MFCC (Mel-Frequency Cepstral Coefficient) and CELP (Code Excited Linear Prediction) are applied to the raw bird sounds data to extract features in this paper.
2.1.1. MFCC
MFCC [3] is a human-hearing-based, nonlinear feature extraction method. The process is shown in Figure 2.

Step 1. A single-frame, short-term signal is obtained by separating frames and adding a window function to the original audio signal . Adding a window function reduces the frequency spectrum leakage. This paper selects the 20 s as a frame and uses the Hamming window.
Step 2. To observe the distribution of in frequency domain, FFT (fast Fourier transform) is used to transform the signal from the time domain to frequency domain, named :
Step 3. Calculate the energy of the spectral line per frame:
Step 4. Calculate the energy of through the Mel filter:where is the i-th frame, is the k-th spectral line in the spectrum, and is the analysis window with a sample length of k;
Step 5. Take the logarithm of the energy of the Mel filter and calculate the DCT (Discrete Cosine Transform):where is the m-th Mel filter, is the i-th frame, and is the spectral line after the DCT.
In this paper, MFCC uses 13-dimensional static coefficients (1-dimensional log energy coefficient and 12-dimensional DCT coefficients) as extraction parameters [3, 28]. The resulting sample has 13 features.
2.1.2. CELP
The CELP feature extraction method is derived from LPC (Linear Predictive Coding) based on a compression coding tech G.723.1. The LPC is extracted from the 0th to 23rd bits from the bit coding in each frame, forming the 10-dimensional LPC. Another 2-dimensional feature, the lag of pitch, is extracted from the 24th to the 42nd bit stream in each frame. The extraction of CELP is shown in Figure 3.

Endpoint detection is performed after the original audio file is preprocessed. Then each audio is divided into several sound segments. Each sound segment is considered as a sample in the experiment. For each frame, features are extracted using MFCC (13 dimensions) and CELP (12 dimensions). The sampling rate is 16 kHz; audio is a single channel. Each sample contains several frames. For each detection segment (including many frames), the mean, median, and variance of each feature are calculated to obtain 75-dimensional data. The feature extraction process is shown in Figure 4.

2.2. Feature Scoring Method MICV
Based on the principle of small distance within classes and large distance between classes, features that are easy to distinguish are selected. To calculate the degree of feature differentiation, mutual information MIEC (Mutual Information for Interclass) is used to measure the interclass distance, and the coefficient of variation CVAC (Coefficient of Variation for Intraclass) is used to measure the intraclass distance.
The MIEC and CVAC methods are combined to calculate the classification contribution degree of features. The calculation equation is
Because intraclass distance and interclass distance have different weights, the coefficient is introduced to adjust the weights.
2.2.1. MIEC
Mutual information measures the correlation or the dependency between two variables. For two discrete random variables and , mutual information is calculated as
In equation (6), is the joint probability density function of x and y, and are the marginal probability density functions of x and y.
Generally, when mutual information is used to select features, variables and represent the feature vector and label vector. In this paper, and represent two vectors of different classes under the same feature. Given feature space and classification space , the interclass mutual information of f-th feature, , is calculated as
In equation (7), are the samples of f-th feature in i-th class and j-th class. is the interclass mutual information of f-th feature in F. The interclass difference feature f is greater when the is smaller, and vice versa.
2.2.2. CVAC
In statistics, the variation (CV) coefficient measures the variation between two or more samples or the dispersion between them. The expression iswhere and are the mean and standard deviation of the samples. Given feature space F and classification space C, the intraclass coefficient of variation of feature f, is calculated as
In equation (9), represents the of samples in class . The feature f has a higher cohesion when is smaller.
2.3. Feature Selection Method MICV-ERMFT
After scoring the features using the MICV method, high-quality features are selected. MICV-ERMFT is used to eliminate redundant features in the feature array sorted by scores. The process is shown in Algorithm 1.
|
|
|
2.3.1. Build Maximum Feature Tree
The maximum feature tree is derived from the minimum spanning tree. For an undirected graph , each edge has a weight , a minimum spanning tree is a subset of edges that connect all the vertices with no cycle, and the total weight of edges in is minimum. In a maximum feature tree, features are represented as vertices and weights of the edges are decided by Pearson correlation coefficient. represents the correlation coefficient between features and , which is calculated as
In equation (10), represents the i-th sample of feature ; is the feature ’s mean value of all samples. In equation (11), is the correlation coefficient between features r and c. Algorithm BMFT (building the max feature tree) uses equations (10) and (11) to calculate the correlation coefficient matrix and construct the maximum feature tree. Details are described in Algorithm 2.
2.3.2. Remove Redundant Features Based on Two Neighborhoods
ERFTN (Eliminate Redundant Features based on Two Neighborhoods) is based on eliminating redundancy using the concept of two neighborhoods. One example with a maximum feature tree T and feature sequence F sorted using the MICV method is demonstrated in Figure 5:

As shown in Figure 5, given max feature T, sorted with MICV method in ascending order, the steps of the ERFTN algorithm are listed as Algorithm 3. The final feature subset of F is .
3. Experiments and Results Analysis
3.1. Experimental Dataset
Currently, there are many websites dedicated to sharing bird sounds from around the world, such as Avibase [29] and Xeno-Canto [30]. Recordings of bird sounds are collected and annotated on these websites. The tapes include various types of voice expressions (multiple calls and songs) of various individuals recorded in their natural environment. The dataset used for this paper comes from the Avibase, which is a collection of MP3 or WAV audio files. These audio files are unified into the 16 kHz sampling rate and monochannel. Since the audio files are not all bird sounds, the bird sounds in the audio are separated through the voice activity detection (VAD) [25, 31], and then the MFCC and CELP features are extracted according to the process shown in Figure 4.
The experiments used two datasets including bird sounds and crane sounds. We have selected six different bird species from different genera in bird sounds, which contains 433 samples. The crane sound dataset includes 343 samples from seven species of Grus. The dataset information is shown in Tables 1 and 2.
3.2. The Experiment of MICV Scoring Method
To verify the proposed method’s effectiveness, two separate experiments are conducted to test the MICV scoring method and MICV-ERMFT feature selection method. The classifiers used in the experiments include Decision Tree (J48), SVM, BayesNet (NB), and Random Forests (RFs). The feature scoring method is compared with ConstraintScore (CS) [11] and six other feature scoring methods provided by Weka [32] including Correlation (Cor), GainRatio (GR), InfoGain (IG), One-R (OR), ReliefF (RF), and SymmetricalUncert (SU) in experiments.
3.2.1. Classifier Performance Evaluation
Kappa, F1 score, and accuracy rate were used as evaluation indicators.
(1) Kappa. Cohen’s Kappa coefficient is a statistical measure that indicates the interrater reliability (and also intrarater reliability) for qualitative (categorical) items:where is the overall classification accuracy, which is calculated by the number of correctly classified samples divided by the total number of samples. Based on the confusion matrix, assume the numbers of real samples in each class are , the numbers of predicted samples are , and is calculated as
(2) F1 Score. It is an index used to measure the accuracy of classification models in statistics, while taking into account the accuracy and recall of classification models. As shown in equation (14), precision represents the precision rate and recall represents the recall rate.
(3) Accuracy. The accuracy is calculated based on the equation:
In equation (15), n represents the correct number of classifications, and M represents the number of all samples.
Each dataset is divided into 70% training set and 30% test set. Each experiment is repeated 10 times to average some biased results.
3.2.2. MICV Parameter Setting
In equation (5), use to adjust the weight coefficients of MIEC and CVAC. The experiments set and calculate the MICV with J48 classifier. When the highest Kappa is reached, the ratio of the number of selected features to the total features is listed in Table 3. A lower ratio indicates a better performance. Table 3 shows that better results can be obtained when is set at 0.1 or 0.3 or 0.2. In the following experiments in this paper, is set to 0.1.
3.2.3. Compare MIEC, CVAC, and MICV
The selected feature set has a decisive effect on the classification model. Features with higher scores normally lead to more positive classification performance. The experiments sort the feature sequence in ascending order according to feature scores obtained from MIEC, CVAV, and MICV, respectively. In Figure 6, in most cases, the red curves are more stable to ascend, which shows that, with the increase of features gradually, the classification model’s performance will be improved, especially in Figure 6(a). CVAC and MIEC methods have obvious fluctuations in Figures 6(a) and 6(b). To sum up, combining MIEC and CVAC works better than using them alone.

(a)

(b)
3.2.4. Experiment of MICV Results and Analysis
In this section, the proposed MICV is tested on the Birds dataset and the Crane dataset. The results of the experiment in Figures 7 and 8, show that, at the same number of selected features, the Kappa value of the MICV method is basically higher than that of other methods. As the number of features increases, the Kappa value of the MICV method can converge earlier and remains relatively stable compared with other methods. MICV is more effective compared with the results of other feature evaluation methods.

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)
Tables 4 and 5 record the best classification results (Kappa, accuracy, and F1 scores) for each feature scoring sequence, as well as the number of features used to obtain this value. The bold one on the left side of “|” in each row in the table indicates that the method has the least number of features than other methods, and the bold on the right indicates that the method has the highest evaluation indicator score. Table 4 shows that, in bird dataset, MICV methods had the highest Kappa value under four different classifiers. In J48, NB, and RFs classifiers, MICV methods had the lowest number of features and the highest score of evaluation indicators in most cases. As shown in Table 5, the performances of MICV in J48, NB, and RFs classifiers are significant.
In summary, the MICV method is more effective in selecting optimal features than the other seven methods. The method can also get a good modeling effect by using a lower dimension.
3.3. Experiment of MICV-ERMFT Feature Selection
In the second part of the experiment, features are evaluated using CS and six other Weka methods, including Cor, GR, IG, OR, RF, and SU.
3.3.1. Procedure of Experiment
The procedure is demonstrated in Figure 9. Eight different methods (MICV and the seven other methods mentioned above) are used to evaluate each feature’s classification contribution and score the features. After sorting the features in an ascending order based on the scores, the ERMFT method is then used to eliminate redundant features, resulting in a feature subset F′. F′ is then mapped to Dataset, resulting in Dataset’. J48, SVM, BayesNet (NB), and Random Forests (RFs) are the experiment’s classifiers. For each independent dataset, it is divided into 70% training set and 30% test set. Each experiment is repeated ten times, and the average Kappa is calculated. Also, the DRR (Dimensionality Reduction Rate) as an evaluation indicator is introduced.

In equation (16), is the number of selected features and is the number of all features of each dataset. The larger the DRR value, the stronger the ability to reduce dimensions.
3.3.2. Experiment of MICV-ERMFT Results and Analysis
Figure 10 shows the experimental results obtained from four different classifiers using eight different feature evaluation methods combined with ERMFT. Figures 10(a) and 10(b) are the results of the Birds dataset; Figures 10(c) and 10(d) are the results of the Crane dataset. In Figure 10(a), four histograms represent the results under the four classifiers, and 9 elements in the group of histograms are Kappa values calculated from the eight methods with ERMFT and the original data (ORI). The heat map of Figure 10(b) shows the number of selected features when the Kappa reaches a certain value in each method and similarly so do Figures 10(c) and 10(d). In Figure 10(a), it can be clearly observed that the MICV-ERMFT method has a slightly higher Kappa than other methods and the J48 classifier in Figure 10(c) is more effective. Besides, the Kappa of the MICV-ERMFT method is higher than the original data. Looking at Figures 10(a) and 10(b) at the same, it is evident that the MICV-ERMFT method achieves a good modeling effect using a small number of features’ time, comparing with the other methods. Figures 10(c) and 10(d) show a similar result.

(a)

(b)

(c)

(d)
In conclusion, compared with the other seven methods, the MICV-ERMFT method demonstrates good abilities in dimensionality reduction and feature interpretation.
Combining Figures 8(b) and 8(d) with Table 6, it is obvious that the MICV-ERMFT method has a significant dimensionality reduction effect and model performance effect for the Birds dataset and the Crane dataset. In Table 6, Kappa value and DRR performance are very good for J48, NB, and SVM classmates on Birds dataset. Particularly for the NB classifier, the other seven comparison methods’ Kappa value does not exceed ORI, while the MICV-ERMFT method exceeds 0.4. In the Crane dataset, the MICV-ERMFT outperforms other methods. Table 7 shows the running time cost by the MICV-ERMFT method and the other seven feature selection methods. It is not too time-consuming than other methods.
In experiments of Birds dataset and Crane dataset, Kappa metrics using different classifiers with the MICV-ERMFT method are generally superior to the other methods. The MICV-ERMFT method remains excellent for the most part and is more stable than the other methods, although other methods surpass the MICV-ERMFT method in some classifiers. Besides, the MICV-ERMFT method improves the Kappa value compared to the original data. Although the improvement is minimal in some cases, the MICV-ERMFT method only uses about half of the characteristic features compared to the original data.
In conclusion, MICV-ERMFT has better performance in dimensionality reduction and model performance improvement.
4. Conclusion
Feature selection is an important preprocessing step in data mining and classification. In recent years, researchers have focused on feature contribution evaluation and redundancy reduction, and different optimization algorithms have been proposed to address this problem. In this paper, we measure the contribution of features to the classification from the perspective of probability. Combined with the maximum feature tree to remove the redundancy, the MICV-ERMFT method is proposed to select the optimal features and applied in the automatic recognition of bird sounds.
To verify the MICV-ERMFT method’s effectiveness in automatic bird sounds recognition, two datasets are used in the experiments: data of different genera (Birds dataset) and data of the same genera (Crane dataset). The results of experiments show that the Kappa indicator of the Birds dataset reaches 0.93, and the dimension reduction rate reaches 57%. The Kappa value of the Crane dataset is 0.88, the dimension reduction rate reached 53%, and good results were obtained.
This study shows that the proposed MICV-ERMFT feature selection method is effective. The bird audio selected in this paper is noise filtered, and further research should test this method’s performance using a denoising method. We will continue to explore the performance of MICV-ERMFT in the dataset with a larger number of features and instances.
Data Availability
All the data included in this study are available upon request by contact with the corresponding author.
Disclosure
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Conflicts of Interest
The authors declare no conflicts of interest.
Acknowledgments
This research was funded by the National Natural Science Foundation of China under Grants nos. 61462078, 31960142, and 31860332.