Abstract

With the increasing maturity of speech synthesis technology, on the one hand, it has been more and more widely used in people’s lives; on the other hand, it also brings more and more convenience to people. The requirements for speech synthesis systems are getting higher and higher. Therefore, advanced technology is used to improve and update the accent recognition system. This paper mainly introduces the word stress annotation technology combined with neural network speech synthesis technology. In Chinese speech synthesis, prosodic structure prediction has a great influence on naturalness. The purpose of this paper is to accurately predict the prosodic structure, which has become an important problem to be solved in speech synthesis. Experimental data show that the average error of samples in the network training process is lel/85, and the minimum value of the training error after 500 steps is 0.00013127, so the final sample average error is lel = 85  0.0013127 = 0.112 < 0.5, and use the deep neural network (DNN) to train different parameters to obtain the conversion model, and then synthesize these conversion models, and finally achieve the effect of improving the synthesized sound quality.

1. Introduction

The Internet is changing the teaching methods of teachers. Nowadays, one teacher alone can teach thousands of students. Using the interactivity of the Internet, teachers and students can interact in a timely manner, so as to achieve a similar effect to small class teaching. At the same time, people can enjoy high-quality online courses without leaving home. As long as there is a computer, anyone can receive a university education anywhere [1].

The basic unit of foreign English pronunciation is phoneme; it is composed of 41 phonemes. An English word contains multiple phonemes, and each phoneme corresponds to a syllable, so the pronunciation of an English word is composed of multiple syllables. The process of converting the characters of English words into the phonemes of words has been relatively mature and will not be discussed here [2, 3]. The division of syllables in English words is one of the most important parts in the technical requirements of machine synthesized sound, which seriously affects the accuracy and naturalness of pronunciation [4, 5]. Dictionary lookup table method was compared with traditional methods [6]. At the same time, there are still many applications in the research. These applications are very attractive, which shows the great potential of neural network applications. Neural network technology is especially good at processing which has a large number of the sample applications [7]; the problem of pronunciation of English words is solved by using complete learning samples. The problem involves phonemes and phonetic symbols. The entire learning sample involves tens of thousands of English words. The following is the use of neural network technology to solve the problem [8, 9].

Geoffrey Hinton obtained some surprising results on MNIST and showed that the acoustic model of a frequently used commercial system can be significantly improved by distilling the knowledge in the model set [10, 11]. Yang Shen's program has been trained to extract trunk. The program is generally very robust and yields more than 90% of information about the residual and side-chain 1 rotator isomers with a torsion angle over 90%. In addition to reliably predicting the secondary structure, deep neural networks can be applied to study end-to-end speech synthesis methods, so that the text features can be directly mapped to the speech spectrum to get better effect of speech synthesis but requires a larger corpus and longer training time [1215].

This paper firstly introduces the development history and current situation of speech recognition technology at home and abroad and then conducts theoretical research and analysis on each link, including speech acquisition, preprocessing, endpoint detection, feature parameter extraction, time sorting, and other aspects. In each stage of network and the theory and algorithm of speech recognition model, a complete design scheme of speech recognition system is presented by selecting MFCC as the speech characteristic parameter. This paper focuses on the selection of identification models. By comparing various recognition algorithms, the BP neural network is selected as the basic unit of the recognition model. Aiming at the accuracy of speech recognition and the deficiencies of the BP neural network algorithm, this article introduces the neural network integration theory, using individual differences and combining k-means clustering method to improve the integration network, the improved neural network will be able to effectively integrate multiple BP network construction cost.

2. Proposed Method

2.1. Classification and Basic Composition of Speech Recognition System
2.1.1. Classification of Speech Recognition System

Speech recognition system is generally classified based on two classification methods.

(1) Based on Recognition Objects. According to different objects of recognition, speech recognition tasks can be roughly divided into two categories, namely, isolated word recognition and continuous speech recognition. Among them, the task of isolated words is to recognize speech.

(2) Based on the Size of Vocabulary. Some speech recognition systems can only recognize the terms in the preset vocabulary of the system, and the terms outside the vocabulary cannot be recognized. Other systems are based on the recognition of phonemes, which often have no restrictions on the recognition of terms. In short, according to the size of the vocabulary, it can be divided into small vocabulary (less than 100), medium vocabulary (100–500), and large vocabulary (more than 500). The larger the vocabulary, the greater the confusion between terms and the lower the recognition rate of the system.

2.1.2. Basic Structure of Speech Recognition System

The speech recognition system is mainly composed of four parts: signal processing and feature extraction, acoustic model (AM), language model (LM), and decoding search part. The typical implementation scheme of the speech recognition system is shown in Figure 1.

Then enter the training stage, process the acquired feature parameters, extract the data that can best reflect the voice features, and save it to the template library. Finally, in the recognition stage, voice parameters are obtained through the same channel to form the test template, which is then matched with the reference template in the template library. Then, the similarity is calculated according to a certain quasi-selection (such as distance measurement), and the similarity with the highest matching score is output as the result. At the same time, some prior knowledge or fuzzy theory can be added to improve the accuracy of the recognition system.

2.2. Basic Principles of Speech Recognition
2.2.1. Mathematical Model of Speech Signal

(1) Acoustic Model of Speech Signal. After selecting a series of system parameters related to the vocalization process, speech synthesis technology is to process the input text type signal sequence through a specific synthesizer to produce high naturalness, high sound quality, and rich expressive voice output, so that the computer or related system can be equipped with a technology that can emit natural and smooth sound like a “human.” Speech synthesis technology acts as the output part of the machine in the human-machine voice interaction and plays an important role; some desired speech feature sequences can be obtained. In general research and application of speech coding and speech recognition, the discrete time domain model shown in Figure 2 is usually selected.

Among them, the excitation model is divided into voiced excitation and unvoiced excitation based on voiced and soft sounds. The vocal tract model can build three practical models based on the pronunciation of various phonemes, which are cascade, parallel, and hybrid models. The airflow formed by the radiation model in the articulation cavity radiates through the lips and reaches the listener's ears. The sound signal will be attenuated and has the characteristics of high-pass filtering. In the excitation model, the contents of the excitation source package do not sound and sound. The silent signal is the output of the linear system excited by the white noise sequence, while the sound signal is generated by the periodic pulse generator. When voiced, the transfer function of the system is

The function of Av is to adjust the amplitude or energy of the voiced signal, where z represents the amplitude of the voiced signal.

In order to make the sound excitation signal have the actual waveform of glottic pulse, it should pass the glottic pulse model filter and convert its z-field to

Both 1 and 2 are close to 1, so the spectrum of turbidity excitation is close to that of glottic flow pulse. When a voiceless sound is produced, whether it is a blocking sound or a frictional sound, the sound channel will be blocked to form turbulence. Therefore, the silent signal can be simulated as random white noise with average value of 0 and variance of 1, which is randomly distributed at any time or amplitude. V(z) in the sound channel model is the sound channel transfer function, and its expression is

R(z) in the radiation model is a first-order high-pass filter, whose expression is

(2) Human Auditory Model. The auditory system of the human ear is a very precise and complex system. Speech is transmitted from the vocal organ through the medium to the auditory organ of the human ear. The parameters in the acoustic model will more or less change with energy loss. After a long period of research, it is found that better results can be obtained by modeling speech recognition system from the perspective of human ear hearing. On the one hand, it explores human hearing, physiology, and psychology, which makes up for the deficiency of voice characteristic parameters in the voice channel model. On the other hand, it analyzes the sound principle of voice signal from the aspect of frequency domain. It is found that there is a pairwise relationship between high and low sound and treble sensation and frequency, and people's perception of complex sound patterns is a complex auditory process driven by knowledge and data.

2.2.2. Speech Signal Preprocessing

(1) Speech Signal Preaggravation. Preweighting actually refers to the preweighting of the high-frequency part of the speech signal. Because the average power of the speech signal is stimulated by the radiation of the mouth and nose and the glottis, it has been attenuated by more than 800 Hz. In the high-frequency part-of-speech spectrum, because the high-frequency components are lower than the low-frequency components, it is difficult to find an image in the frequency part of the spectrum. For places where the brightness or gray level changes drastically, high-frequency components, such as edges, should be drawn; and for places where the changes are not large, a histogram of low-frequency components, such as large color blocks, should be drawn. Large areas are low frequencies, small areas or discrete areas are high frequencies. The image is regarded as a two-dimensional function, and the places with drastic changes correspond to high frequencies, and vice versa, coupled with the low-frequency end mixed with 50 Hz or 60 Hz interference information, so it is necessary to improve the high frequency part of the speech recognition spectrum to remove the interference part to facilitate subsequent speech parameter analysis, as follows:

Generally, a = 0.9375 should be used to calculate the short-term energy of each frame of speech, which can eliminate dc drift, suppress random noise, and improve the clearing energy.

(2) Framing and Windowing of Voice Signal. In order to use digital signal processing technology to analyze and process speech signals changing with time, speech signals are divided into several frames to analyze their characteristic parameters. The length of each frame is 10–30 ms, and each frame has short-term stability. Framing can be continuous or overlapping. In order to smooth the transition between frames and maintain the continuity, the overlapping segmentation method is generally used. There are two window functions: ① The width of the main lobe of the window spectrum should be narrow to obtain a steeper transition zone. ② Compared with the amplitude of the main lobe, the side lobes should be as small as possible so that the energy is concentrated in the main lobe as much as possible, so that the small shoulder peak and after vibration can be reduced, and the stop band attenuation and pass band stability can be improved. The ratio of frame move to frame length is usually 0-l/2, as shown in Figure 3.

The commonly used window functions in speech signal processing are rectangular window and Hamming window, whose expression is shown in equation (6) (N is the frame length):

Rectangular window:

Hamming window:

(3) Endpoint Detection of Speech Signals. The starting and ending points of speech signals are not given in advance, and speech signals are often mixed with some background noise. Therefore, in the practical application of speech recognition, it is necessary to accurately find out the starting point and ending point of speech and remove the background noise unrelated to the signal, which not only reduces the amount of data, computation, and processing time, but also helps to improve the recognition rate of the system. Endpoint detection is a key part-of-speech recognition, which directly affects the performance of the whole system.

The function of short-term average energy is to distinguish between voiced and unvoiced sounds in speech signals. After the speech signal is divided into several frames and windowed, the energy of each frame (the amplitude of each frame) is calculated, and the calculated energy is called the short-term average energy. Therefore, after calculating the energy value of each frame, a threshold is set to theoretically separate the silent segment from the sound segment. The short-term average energy E(I) of the speech signal in frame I can be obtained by one of the following three algorithms, in which N is the frame length and Xi(N) is the amplitude energy of the speech message at the NTH point. The simple method is to directly calculate the volume of the sound to find the part whose volume is greater than a certain threshold, and consider this part as the required speech signal, the intersection of this part and the threshold as the endpoint, and the rest as a nonspeech frame.

Voiced signals are issued by the vocal cords, and the corresponding voice signals have a high amplitude, but generally there is no change between higher positive and negative values, while voiceless signals do not have the vibration of the vocal cords but are produced by the friction, impact, or blasting of the air in the oral cavity, with a lower amplitude and a higher zero point sound repeated back and forth. Therefore, the short time average energy can be used to get the voiced signals better, but the zero crossing rate is needed to solve the problem of unvoiced signals segment extraction.

The short time average zero crossing rate is a method to estimate the sinusoidal frequency in time domain analysis. The calculation formula is shown in the following equation:where SGN [] is the sign function and s(n) is the signal at the NTH point after processing; namely,

2.3. BP Neural Network Integration

BP network topology is forward propagation of the signal and back propagation of the error. In the forward propagation, the input samples are passed in from the input layer, processed by each hidden layer, layer by layer, and then passed to the output layer. If the actual output of the output layer is inconsistent with the expected output (teacher signal), it will turn to the error back propagation stage.

BP network topology is shown in Figure 4.

The study of the BP network is the supervised study. Input vectors and expected responses need to be provided during training. In the process of training, the weight and threshold of the network are adjusted according to the error performance of the network, so as to finally achieve the required function. During the back propagation, the error signal will be propagated from the output layer to the input layer, and the weight and threshold value of each layer will be adjusted along the way, so that the signal will develop along the direction of the difference value error reduction. After repeated iteration, the error finally reaches the fault tolerance range.

3. Experiments

3.1. Experimental Data Set

Whereas neural networks deal with numerical information, the problem here is character-based, so the first task is to encode characters into numbers. Before coding the unification of the length of the word, after statistics can be obtained, the longest word contains 16 phonemes, the words for those that are less than 16 phonemes can be filled with empty phonemes, and 41 phonemes can use number 1–41 coding. The empty phoneme 0 can be used for output coding, marking the syllable partition phoneme as 1, and the nonsyllable partition operator phoneme as 0.For example, the code of ax, b, aa, b, ax, |, |, ax-b, aa-b, ax is as follows:12 30 June 30 12 0 0 0 0 0 0 0 0 0 0 0 0 | | 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0The code of the sample is as follows:5, 30, 12, 35, 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0The sample encoding of ax, b, ae, k, |, |, ax-b, ae, k is as follows:12, 30, 5, 35, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0The code of the sample is as follows:5, 30, 12, 35, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0The code of the sample for ae b, ax, k, ax, s, |, |, ae-b, ax-k, ax s is as follows:5, 30, 12, 35, 12, 38, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ...

The following samples can be coded in turn, so that an input + output data pair table can be obtained, which can be processed by neural network.

For a 16‐bit input and output of neural network to overfitting data, the network structure and training process inevitably are complex, the following will be further processing: the output of compressed output space dimension, the 0s and 1s for a group of data string. We can look at it as two 8-bit binary numbers; namely, 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 can be divided into 1 0 0 0 0 0 0 1, 0 0 0 0 0 0 0 0 two series, each corresponding to a binary number, namely, the size of the 1 0 0 0 0 0 0 80, 0 0 0 0 0 0 0 0 0 and the size of the 0s and 1s. The transform from a 16-bit string into two positive integers is one-to-one mapping, and it is reversible. Reversibility is very important, because you need the output of the neural network numerical inverse transformation for another 16-bit 0, 1 list. According to the above, transformation can be as follows:0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 || 80 00 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 || 80 00 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 || 64 00 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 || 80 00 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 || 80 0 ...The final code is as follows:12, 30, 6, 30, 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, |, |, 80, 05, 30, 12, 35, 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, |, |, 80, 012, 30, 5, 35, 0, 0, 0, 0, 0, 0, 0, 0, 0, |, |, 64, 05, 30, 12, 35, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, |, |, 80, 05, 30, 12, 35, 12, 38, 0, 0, 0, 0, 0, 0, 0, 0, |, |, 80, 0 ...

Therefore, you can use a 16 × 2‐bit input and output of neural network to solve the problem. As long as the error between the true value of the two outputs of the network and the sample output is within plus or minus 0.5, the sample can be rounded to make it completely equal to the output value. The two output values of the neural network are inversely mapped to two binary numbers, and the same phoneme position must be a syllable. The syllable identifier can be added to the corresponding input position to realize the syllable partition phoneme.

3.2. Experimental Design

Neural network structure design has always been a very meaningful but difficult to solve problem. This article reviews and summarizes the research status of neural network structure design in recent years. First, the four criteria that should be considered in the design of neural network structure are analyzed, namely, the function approximation error of the neural network, the complexity of the network structure, the generalization ability of the network, and the fault tolerance of the network. Fortunately, with the MATLAB neural network design platform, the purpose of modifying the neural network structure can be easily achieved, and only one sentence needs to be completed. Therefore, it is very convenient to use the trial and error method to design the structure of the neural network [16, 17]. The sample input starts with the smallest number and then enters 5000 samples in turn to find a suitable neural network structure and then takes the final 10000 samples. After a lot of simulation experiments, a neural network with a four-layer structure of 16 × 30 × 20 × 2 was found. The activation function of each layer is expressed as tanh(x), tanh(x), and sigmoid(x), that is, the first two layers of a function are the tangent S after a logarithmic sigmoid function. The neural network structure is shown in Figure 5.

4. Discussion

4.1. The MATLAB Platform

The platform data come from the data set of Spoken Arab Digit, the UCI machine learning repository, which contains a total of 8800 voices and data, of which 6600 pronunciations are used as training samples and 2,200 pronunciations are used as test samples. Using the bootstrap method, each time 5000 from the training samples are selected as the number of training samples for the weak classifier, and the training is repeated 200 times [18, 19].

The error outputs of the single BP network model, the traditional BP-AdaBoost model, and the model in this paper were compared under 2,200 test samples. The error accuracy of each BP network was set as 0.03. The number of base classifiers in the latter two integrated models was 25. The result is shown in Figure 6.

When the number of base classifiers is different, the correct recognition rate of the two models is shown in Table 1 (where the error accuracy of each base classifier is 0.008).

As shown in Table 1, when the number of base classifiers is 50, the difference of base classifiers in BP-AdaBoost model is not obvious, and the recognition rate is lower than that when the number of base classifiers is 25, but the recognition rate in this model can continue to increase [20]. 5 experiments are conducted under three models when the error precision of each base classifier is 0.06, 0.03, 0.01, 0.008, and 0.003, respectively. The results are shown in Figure 7. The X-axis is the experiment sequence number, and the Y-axis is the recognition rate under 2,200 test samples.

The model in this paper and the decision tree model were tested in the same size data set many times. The comparison results of the recognition rate are shown in Table 2, which shows the effectiveness of the proposed method.

4.2. Analysis of Experimental Results

While trying to design the neural network structure, we are also looking for appropriate training methods. The results show that with the increase of sample size, BP method, BP method with momentum term, and self-adaptive method are more and more effective [21]. The training results of the acquisition rate BP method, the adaptive learning rate and momentum term method, and the G method are not ideal; although the Levenberg–Marquardt and quasi-Newton methods have good training effects, the memory of these two methods is very large. When the number of samples increases to more than 10,000, the computer memory is not enough [22, 23].

4.2.1. Prediction of Speech Structure

The labeled corpus used in speech synthesis was completed by the Institute of Human-Computer Interaction and Media Integration (TH_CoSS) of Tsinghua University, with a total of 5,406 sentences, of which 5,000 were training sentences. The file was TH_CoSS.txt, and the remaining 406 sentences were test sets. Prediction of rhythmic structure is mainly divided into the following three steps [24].

The first step is to segment the news data of the whole network (Sogou CA, the size is 2.1 G), deal with the problems in the data files, and obtain clean full-text data. Use the Chinese word segmentation open source tool jiaba run jieba_seg.py to perform word segmentation and merge the word segmentation text corpus and the labeled corpus (TH_CoSS.txt) with good word segmentation as the word vector training input data: (1) perform word segmentation and part-of-speech tagging on the target sentence; (2) use word segmentation and part-of-speech information to implement prosodic word formation; (3) perform syntactic segmentation and syntactic labeling on the target sentence; (4) build a prosodic structure prediction tree; (5) determine the location of the prosodic boundary. [25].

The second step is to use word2vec for word vector training. When training the word vector, the word vector we trained will eventually be used to train the network model; that is to say, the words in the annotated corpus used to train the network model need to be found in the word vector scale.

In the third step, the training corpus is used to find the word vector, and a small program prepared by C is used to mark prosodic words as 0/1 according to the word segmentation results to form the training data.

The predicted results obtained are shown in Table 3.

It can be seen from Table 3 that as the dimension of the word vector increases, the corresponding training time of the model also increases. The error rate of prosodic word prediction decreases with the increase of vector dimension. After the minimum 20 dimensions, the error rate will not decrease further as the dimension of the word vector increases, but will increase slightly. Therefore, when the dimension is too large, the prosodic error rate will not increase, but will increase the training time of the model.

4.2.2. Speech Structure Preprocessing

After a lot of simulation training, it is found that 35,161 sample points can be fully fitted within the required accuracy range. Figure 8 shows the average error of samples in the training process of neural network (lel/85). The minimum training error after 500 steps is 0.00013127.

After the neural network training in MATLAB software, the weight matrix W and the threshold vector b written by the C language training module of the neural network calculation module are transplanted and removed, and the output processing is rounded to obtain two positive integers, and then put them into a 16-bit binary string with 0 and 1, and use the binary string information to annotate the X type to get the final result. After running the actual C language program and comparing the results with the actual samples, it is found that the recognition accuracy can reach 94.7%, which indicates that the performance of this method is very high.

5. Conclusions

Speech recognition technology, Android platform, and neural network integration theory are not only hot issues in academic research, but also a direction of market application promotion. Based on the analysis and summary of the existing speech recognition theoretical models and recognition algorithms, in view of the low accuracy of traditional BP neural network theory in speech recognition, the introduction of speech recognition technology combines neural network theory with it and improves the separate k-means clustering method of the traditional integrated network generation part in the network. Finally, voice synthesis function is added on Android2.3 platform, and simple application of speech recognition is carried out in combination with relevant theories of neural network. An effective way to improve the accuracy of speech recognition in the mode of local recognition and cloud storage is presented.

At the same time, there are still many applications in the research. These applications are very attractive, which shows the great potential of neural network applications. Neural network technology is particularly good at dealing with a large number of sample application problems. Problems can be solved by fully learning samples. Even tens of thousands of English words from phonemes to phonetic symbols can be analyzed and resolved well.

In this paper, a new method of processing interphonemic syllable labeling in English words by neural network is proposed. In fact, this method is instructive and can be applied to the main stress and secondary stress of each syllable in the word in the same way. When the deep neural network matures and is closely integrated with speech synthesis technology, just like how people have learned to speak, the development of speech technology will rise to a new level and will be human. The development of computer interaction and voice-related fields provides important forces and makes important contributions to the advancement of human science.

Data Availability

No data were used to support this study.

Conflicts of Interest

The author states that this article has no conflicts of interest.