Abstract
Nowadays, with the rapid development of multimedia technology and computer information processing, the data of multimedia information presents explosive growth. At present, the method of using artificial recognition of sound materials is inefficient, and an automatic recognition and classification system of sound materials is needed. To improve the accuracy of sound recognition, two algorithm models are established to identify and compare the sound materials, which are the hidden Markov model (HMM) and back propagation neural network (BPNN) model. Firstly, HMM is established, and the sound material is randomly selected as the test sample. The comparison between the expected classification and the actual is tested, and the recognition rate of each classification is got. The final average recognition rate is 61%. The anti-interference characteristics of the training HMM are tested, and the identification rate of the training model is selected in 6 types of signal-to-noise ratio (SNR) environments. The recognition rate of the training model has an obvious downward trend with the decrease of the SNR. Secondly, the BPNN model is built, and 200 BPNN training experiments are conducted. The training model with the highest average recognition rate is selected as the experimental model. The average recognition rate of the final model is higher than 90%. The expression ability and stability of the trained model are simulated after the new sample is introduced, and the anti-interference performance of the model is tested in different environments of SNR. The results of performance test are good, and only the recognition rate of complex types of some sound sources decreased. Finally, the accuracy of the HMM in the experiment is not as high as that obtained by BPNN. Therefore, the training method of BPNN has a greater advantage in both recognition accuracy and recognition efficiency for the studied sound. It provides a reference for automatic recognition of sound materials.
1. Introduction
In the production of animation sound, the dubbing part requires a lot of sound materials. Although a large number of materials are stored in the sound effect material library, some dubbings need to be temporarily designed to dub for it or to record by onomatopoeia through various props [1]. At present, audio creation is still inseparable from manual recognition, and sound effects contain a variety of available sound resources. Through editing, classification, mixing and other operations, its materials will be arranged very compactly and rely on human ears for hearing. There is no problem in the short term. If the amount of material is large, it will consume a lot of energy and lead to hearing fatigue. In serious cases, it will also cause memory errors, judgment deviations, etc. [2–4]. In the field of sound classification and recognition, speech recognition has developed relatively maturely, including listening to songs and recognizing music and other functions, which have been widely used. Nowadays, there are few studies related to the automatic classification of animated sound effects.
Haeb-Umbach et al. introduced the algorithm used to achieve accurate long-distance speech recognition. Although deep learning (DL) occupies a large share of technological breakthroughs, the ingenious combination with traditional signal processing can bring effective solutions [5]. Yu et al. proposed a peak-based framework for the errored second ratio (ESR) task from a perspective of more brain-like. Results show that compared with other baseline methods, the experimental design framework performs the best. The peak-based framework has several favorable features, including early decision-making, small dataset acquisition, and continuous dynamic processing [6]. Jin et al. successfully prepared a sound detector based on MXene by combining DL with 2DMXenes, which had improved recognition and sensitive response to pressure and vibration, which helped to produce high recognition and resolution [7]. Li et al. studied the classification of feeding behavior of dairy cow based on automatic sound recognition and found that DL technology can classify feeding behavior [8]. Lhoest et al. indicated a classifier based on classical machine learning (ML) and a lighter convolutional neural network (CNN) model for environmental sound recognition. The results show that the classic ML classifier can be combined to obtain results similar to DL models and even better than DL models in accuracy [9]. Demir et al. explored the classification of environmental sound based on depth features. The depth feature is extracted by using the fully connected layer of a newly developed CNN model, which is trained with spectrogram images in an end-to-end manner. Experiments show that the classification accuracy of the model reaches 96.23% and 86.70%, respectively [10]. Zhang et al. studied the application of CNN and recurrent neural network units based on feature fusion in environmental sound classification and found that the model with load manage control center (LMCC) as input is suitable for solving problems of electronic stability control. The model can achieve good classification accuracy [11]. Catanghal advanced and discussed a framework of a detection system in the study room. Feature extraction technology is used to obtain the representation of parameter type, which is used to analyze the sound of the intelligent home machine listening system specially used in the study room. It is concluded that ML is feasible for sound detection and can be applied as a technology in an innovative learning environment [12–14].
By consulting the references, it shows that the current research is basically in the stage of feasibility analysis or the effect of classification and recognition is not obvious enough, and the efficiency and accuracy of automatic classification and recognition of sounds need to be improved. For this reason, the idea of applying artificial intelligence (AI) components and multimedia technology to sound recognition is proposed, which can realize the automatic recognition of sound materials and avoid the problems of labor time and inefficiency. Firstly, the sound feature recognition combined with ML is mainly aimed at fitting problems caused by different ML algorithms or improving small samples in the experiment. Secondly, the model used in the simulation is adjusted to the best performance through combination to achieve the effect of classification and recognition. High-efficiency sound recognition is designed through ML algorithm to facilitate the classification of sound materials.
2. Establishment of Algorithm Model
2.1. Algorithm Model Based on Hidden Markov Model
The hidden Markov model (HMM) is composed of the hidden Markov chain. HHM describes the process of state transition. For the first order of the HHM, state transition depends on several states in the system. The probability of state transition refers to the probability of one state to another. The probabilities of all transitions are represented through the matrix of state transitions, and this matrix will not change over time. The initial probability is the probability parameter of any state in the initial state of the model [15, 16]. Generally, HMM includes the matrix of the initial state and of state transition.
HMM is usually represented by , and the completed HMM should also have two other parameters, is the specified state parameter and is the observation symbol. These two parameters and three density probabilities constitute the HMM.
is the number of states in HMM and a collection of hidden states. When in the collection of model states, and the state of moment is represented as , .
is the number of observations. The set of observation symbols is in the model.
is the probability distribution of state transition, which is a vector matrix composed of hidden transition probability. The state transition probability of the hidden Markov chain represents the probability of transition from one hidden state to another.
In equation (1), has the following characteristics:
is the probability distribution of observation symbol in state . A specific hidden state will generate a specific probability matrix of the observation state in the specified HHM. Therefore, in state , the probability distribution of observation symbols includes the observation probability matrix obtained by specifying the hidden states, which can also be defined as a confusion matrix.
is the probability distribution in the initial state, as shown in
These five parameters are generally called the five elements of the HMM, as shown in Figure 1.

Therefore, the HMM is to add the concept of observing state distribution in the conventional Markov process, and the probability distribution relationship between hidden states and observation states is also established in the actual algorithm of this model [17].
If the HMM is given and the observation sequence of each part of t time is and the state is a forward probability, as shown in
Forward probability and observation sequence probability can be obtained by recursive. The process is shown in
In equation (7), is the transition probability.
Combined with forward probability, the definition is shown in
Combined with HMM, the hypothesis is shown in
In
Through the summation processing, the equation is shown in
The observation probability in the recursive equation , combined with the independent hypothesis of observation, is as
can be expressed in probability, as
To get the value of , all forward probability of the last state of Markov sequence is summed, as
If the HMM is given, under the condition of the state at time, the observation sequence of part from to is and the state is a backward probability, as shown in
According to the successive approximation algorithm, the function needs to be represented first. Through the parameters and observation variable conditions in the model, the logarithmic function of the data is relative to the hidden probability of variable condition, and the distribution expectation is function, as shown in
In the equation, is the current estimate of the HMM, which is the parameters of the maximum HMM. According to the successive approximation algorithm, and then the maximization, so the function needs to be decomposed and calculated.
In
In
The estimated parameters appear in the three terms when they are substituted into the function, respectively, and only need to be maximized for each item.
Transition from any hidden state to hidden state means that for the sum of time , including all time expectations in the grid, it corresponds to the expectations of the state under observation .
The essence of the Viterbi algorithm is to specify the observation sequence to find the maximum possibility of the state sequence, which is actually to maximize . First input and φ.
is the maximum probability in all single paths with a state of at moment, and the variable is recursive, as shown in
Specify the starting value and then iterate, and the termination result is shown in
The Viterbi algorithm stores a reverse pointer for any state. The partial probability will reach the specified state according to the reverse pointer. The calculation of partial probability in the Viterbi algorithm is different from that processed in the forward algorithm, because the probability will not change over time. The Viterbi algorithm calculates the probability of the most direct path of reaching a certain state at moment, not the sum of all paths. When , there is no way to find the maximum possible path to reach a certain state. Then the initial probability of the state in which is multiplied by the observation probability in the corresponding observation state to calculate the partial probability, which is similar to the forward algorithm. The result of partial probability is obtained by multiplying the initial probability and the observation probability [18].
The structure of the HHM consists of two closely related steps. One is an observable Markov chain, and the other is a hidden process that matches number of states and observations of the model. The states of HMMs can be transferred to each other over time, and they can also remain in one state. The training process uses audio clips of five seconds for each category, and the training flow chart is shown in Figure 2.

When processing audio of each category, the HMM consists of two closely connected processes, one is an observable Markov chain and the other is a hidden process that matches number of states and observations of the model. The states of HMMs can be transferred to each other over time, and they can also remain in one state. In this simulation, HHMs are established for each classification.
Using the Viterbi algorithm in logarithmic form, the initial and transition probabilities are calculated separately. The code is shown in this.
2.2. Establishment of Model Based on BPNN Algorithm
An artificial neural network (ANN) is an adaptive neural network composed of simple neurons. It has its own nonlinear characteristics and can simulate the human nervous system connected in parallel to perform qualitative and quantitative operations. Because it is actually composed of many neurons, in ANN, the output of one neuron is the input of another neuron. Forward propagation means that the signal passes through the input layer and through the operation of neurons to output. There are many hidden layers and output neurons in the neural networks (NNs). They are evolved through biological neuron models. In biological NNs, neurons will transmit chemicals to other neurons after feeling “excited.” Neurons are linked to each other, and the rest of neurons will transmit information through incoming and out of nerves and finally handed over to the central nervous system processing to form NNs in machine learning [19–21].
Workflow of the output-perceived neuron receives input signal at the input end. According to the link weight , is regarded as an external input signal. All input weights are shown in
Function is a nonlinear feature function. Use this function to convert and get the output :
In the equation, the function is an activation function, and the reverse propagation neural network and deep learning (DL) usually use the S-type logarithmic function or tangent function. The expression of logarithmic S-type activation function is shown in ( is a deviation value)
The expression of hyperbolic tangent S-type activation function is shown in ( is the deviation value)
The topology structure is formed by the links between neurons. The structure of NNs is planned as a layered network and an interconnected network. The layered networks usually include input layer, middle layer, and output layer [22]. Figure 3 shows the schematic diagram of the layered network.

The layered networks can also be subdivided into: simple forward network, forward network with feedback signals, and forward network connected between layers. BP is one of the most widely used ANN models at present. It usually has three or more layers of multilayer NNs, each of which has many neurons. BPNN is a network of multilevel feedforward, which is trained according to the supervised learning method and error backpropagation algorithm [23]. Figure 4 shows the structure diagram of the BPNN model with only one middle layer.

The middle layer is the characteristic space, and the number of nodes is the dimension of the characteristic space. In BPNN, neurons receive learning mode. Any neurons on the left are linked to any neurons on the right. The activation value of neurons is transmitted to the output layer through the middle layer. The output feedback of neurons in the output layer obeys the basic principle of reducing the difference between the expected output value and the actual output value. It feeds back to each connection element through the hidden layer and the output layer, so it is also known as the “error backpropagation algorithm.” With the continuous adjustment of connection weight, the error rate of the input mode response is reduced [24].
The number of nodes is in the input layer of the program, the number of nodes in the middle layer is , and the number of nodes in the input layer is . In addition, the weight needs to be set. The weight from the input layer to the middle layer is , the weight from the middle layer to the output layer is , the bias value from the input layer to the middle layer is , and the bias value from the middle layer to the output layer is , and the learning rate is η. The incentive function is set as a logarithmic S-type activation function, as shown in
The output of middle layer is :
The output of output layer is :
, when is the expected output, the error calculation is shown in
If the error is minimal and the minimum value is , the weights are updated from the middle layer to the output layer and from the input layer to the middle layer by the method of gradient descent. The principle of error adjustment is to reduce the error value, which means that the weight correction of each layer should change in positive proportion to the negative gradient formed by the difference. The weight update value from the middle layer to the output layer is shown in
The weight is from the middle layer to the output layer.
The updated weight from the input layer to the middle layer is shown in
The updated weight from the input layer to the middle layer is shown in
According to the above methods, the updated bias value from the input layer to the middle layer is shown in
The updated bias value from the output layer to the middle layer is shown in
The input layer propagates backward to obtain the actual output. Compared with the expected output value, iteration stops if it reaches the accuracy of meeting requirements of the error function; if it is not achieved, it continues to update the weights of each layer until the accuracy required by the error function is reached.
The iterative algorithm must converge. The sequence of converges to a certain minimum point . The equation is shown in
If the iterative sequence can converge to by its starting point being close to the minimum point, it is called a local convergence algorithm, which is constrained by the minimum point. If any starting point produces an iterative sequence that can converge to , it is called a global convergence algorithm.
Using the iteration optimization algorithm, only the calculated iteration point is understood, and the optimal solution is not known. Therefore, it is necessary to judge when the iteration should end based on the information provided by the known iteration point. The termination condition is usually shown in equation (38):
In equation (39), determine whether the error is less than a predefined value.
In equations, there is the absolute error of two iterations. In some cases, the minimum relative error is required to judge the termination.
There will also be cases of calculating the gradient mode. When the specified value range is reached, the iteration will be terminated, as shown in equation (41):
The quality of a NN design depends on the accuracy and the training time of the network. The construction of the BPNN determines the structure of the BPNN according to the characteristics of the input and output data of the system. The characteristic parameters have a total of 26 dimensions. There are 15 types of sound to be classified. Some problems can be solved with a single-layer network with a nonlinear activation function. Considering that the adaptive linear network can also be solved, and the correct rate of solving the problem with only a single-layer nonlinear function will not be too high, the number of layers must be increased to achieve better training accuracy. To improve the accuracy of network training, increasing the number of layers can further improve the accuracy rate and reduce the error. The experiment considers appropriately increasing the number of network layers without increasing the complexity of the network. After repeated simulation and training, it is finally confirmed that the selection of the BPNN training consists of 26-13-15. The input layer consists of 26 neurons, the middle layer consists of 13 neurons, and the output layer consists of 15 neurons. The 15 sound effects are reclassified to verify the recognition rate of the NN training model for confusing sound effects. The schematic diagram of easily confused sound effects is shown in Figure 5.

3. Analysis of the Simulation Results of Sound Feature of the Model
3.1. Simulation Results of Sound Feature Parameters Based on Hidden Markov
10 sound effects are selected for hidden Markov modeling, namely, street crowd sound jd, stadium crowd sound ty, TV program sound ds, train sound hc1, aircraft sound fj, stream sound xl, wave sound hl, ship sound lc, truck sound hc2, and motorcycle sound mt. The number of HHM training samples is shown in Figure 6.

Each frame parameter of audio file data, the first-order difference and the second-order difference of the 8th order, and short-term energy, and the short-term average zero-over rate are extracted. A total of characteristic parameters are used as observation symbols in this experiment. In this experiment, the HHM is established for each classification. The recognition rate of each category is shown in Figure 7.

Among them, the recognition effect of stadium crowd voice and train sound is relatively ideal, and the rest of the results are unsatisfactory. There may be more confusing elements in the sound of vehicles in various categories, and the recognition effect is not good. The recognition rate of ships sound is less than 50%, which may be due to the sound of ocean waves has multiple characteristics, resulting in a low recognition rate. The experiment added Gaussian white noise that simulates the real environment to the sample to verify the anti-interference ability of the HHM. The signal-to-noise ratio (SNR) of the original tested sound materials was higher than 75 dB. After adding Gaussian white noise, the SNR was 10, 20, 30, 40, 50, and 60 dB, respectively. The comparison of test recognition rate is shown in Figure 8.

In Figure 8, as the SNR decreases, the recognition rate of the HHM has a significant downward trend, indicating that the anti-interference ability of this model is insufficient. To sum up, the Hidden Markov training model is not very suitable for describing sound materials containing more complex content, nor can it meet the needs of applying onomatopoeia materials.
3.2. Simulation Results of Sound Feature Based on BPNN
The selection of the BPNN training consists of 26-13-15. The input layer consists of 26 neurons, the middle layer consists of 13 neurons, and the output layer consists of 15 neurons. A total of 15 typical sound materials are set, and five sounds are added to the above set: sound of children ht, sound of meeting hy, sound of footstep jb, sound of bicycle zx, and sound of car qc. The sound type can be divided into vehicle sound, human voice, and water sound. The recognition accuracy of each category is shown in Figure 9.

As shown in Figure 9, except for the low probability of children’s voice being recognized, the recognition rate of all categories is more than 80%, of which human voice is more complex and should be further subdivided. The voice recognition rate of confusing stadiums has reached 100%, and the voice recognition rate of street can reach 90%. The recognition rate of conference sound is better than that of TV human voice, reaching 95.56%. The recognition rate of footsteps is better, and the rate of vehicles is relatively high. The recognition rate of cars, trains, and planes has reached 100%, but the recognition rate of ship sounds is not ideal, only 80.36%, which may be caused by the coincidence of the waves with certain characteristics. The recognition rate of sound of water is about 90%, which is ideal. To sum up, the recognition rate of confusing sound effects is relatively high, which is higher than 90% on average.
In real, sound may be damaged for various reasons, making it impossible to use later. Replaceable sound resources can be found to compensate. Figure 10 shows the scattered distribution of the actual classification and predicted classification errors of 100 to 500 audio clips randomly selected.

The experiment is not very practical, and the characteristics of the sound material are not the same in the same classification. For example, the sound of the car engine, the sound of the door, the sound of airplanes taking off and landing, the sound of bicycle chain and bell, and other sounds are all distributed in the sound of the car. In the same scene, there are also new sounds replacing the original sound. Therefore, only use onomatopoeia to replace damaged materials is considered. Figure 11 shows the test recognition rate of BPNN models in different SNR environments.

(a)

(b)
Figure 11 indicates that the introduction of new sound materials by BPNNs does not have much impact on the original sound effects. Although it has a certain impact on the recognition rate, it is not easy to accurately recognize whether the new sound resources are complex or not. The BPNN has a good anti-interference performance when the SNR of the sound material is greater than 30 dB.
3.3. Comparison and Analysis of HMM and BPNN
From the previous section, it shows that the recognition result of the HMM is only about 60%, the accuracy of the BPNN algorithm is higher, and the average recognition rate reaches 91%. Because the previous selection of sound materials was randomly selected, the recognition effect was observed by selecting 3 sets of exactly the same sound materials to simulate in the two algorithm models. Figure 12 shows the specific performance of two algorithms under a single sound source, combined sound source, and complex sound source.

(a)

(b)

(c)
As Figure 12 indicates, the accuracy of HMM recognition is not as efficient as that of BPNN model. The reason may be that the HMM needs to cooperate with supervise learning algorithms to play a better role, but the BPNN model does not. Moreover, the HMM is relatively stable in the single sound source scenario, but the stability is greatly reduced when the sound source situation is slightly complicated.
In the BPNN model, the recognition effect of the single sound source, combined sound source, and compound sound source is better than that of the HMM, and the stability is relatively better. To sum up, the BPNN model has more advantages than the HMM for the recognition of sound effect materials. The BPNN model can achieve a higher recognition rate in a shorter training time and has a better generalization and reasoning ability and good performance of tamper resistance.
4. Conclusion
At present, the workload of sound effect classification by manual listening is large and cumbersome. Therefore, it is urgent to study the automatic classification of sound effect materials and improve work efficiency. To improve the accuracy of sound recognition, two algorithm models are established to automatically identify and compare sound materials, which are the HMM and BPNN models. First, the HMM is established, and the sound material is randomly selected as the test sample. The comparison between the expected classification and the actual classification is tested, and the recognition rate of each classification is obtained. The final average recognition rate is 61%. The anti-interference characteristics of the hidden Markov training model are tested under 6 types of SNR environment, and the recognition rate of the training model has a significant downward trend with the decrease of the SNR. Additionally, the BPNN model is established, and 200 training experiments of BPNN are carried out. The training model with the highest average recognition rate is selected as the final model in the experimental training. The average recognition rate of the final model is higher than 90%. It stimulates the expression ability and stability of the trained model after introducing new samples. And the tamper-interference performance of the model has been tested in different SNR environments. The performance test results are good, and only the recognition rate of complex sound types of individual sound sources has decreased. Finally, the accuracy of the HMM established in the experiment is not as high as that obtained by BPNN. Therefore, the BPNN training method has more advantages, and the automatic classification of sound effects can better meet the needs of practical applications, facilitate the work of the majority of audio workers, and provide a good theoretical basis for the automatic identification and classification of audio materials in the future. Due to some limitations, it needs to be further developed and improved in combination with practical applications so that it can be used better.
Data Availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Conflicts of Interest
The author declares that he/she has no conflicts of interest.