Application of Intelligent Speech Synthesis Technology Assisted by Mobile Intelligent Terminal in Foreign Language Teaching

Zhang, Zhehua

doi:https://doi.org/10.1155/2022/9751094

Mathematical Problems in Engineering

On this page

Abstract Introduction Related Work Results Conclusion Data Availability Conflicts of Interest References Copyright Related Articles

Special Issue

Advanced Aspects of Computational Intelligence and Applications of Fuzzy Logic and Soft Computing

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 9751094 | https://doi.org/10.1155/2022/9751094

Application of Intelligent Speech Synthesis Technology Assisted by Mobile Intelligent Terminal in Foreign Language Teaching

Zhehua Zhang¹

Academic Editor: Naeem Jan

Received23 Feb 2022

Revised20 Mar 2022

Accepted26 Mar 2022

Published21 May 2022

Abstract

In today’s increasingly frequent international exchanges, language ability has been paid more and more attention. However, the current foreign language teaching in some areas lacks a standard external input environment, and the gap between learners’ own ability and the new requirements has not been effectively solved. At present, intelligent speech synthesis technology is mature. It is found that this technology can play a unique advantage in language teaching. Based on this, this study studies the feasibility of the application of intelligent speech synthesis technology in foreign language teaching from the perspective of technology. On this basis, the application mode of intelligent speech synthesis technology in foreign language teaching is constructed to verify its application effect in foreign language teaching. The main research work of this paper is as follows: firstly, it analyzes the feasibility of using clever speech synthesis technologies to teach foreign languages, and it lists in detail the application conditions of intelligent speech synthesis technology in different teaching activities and course types. Then, it constructs the application mode of intelligent speech synthesis technology in foreign language teaching. Finally, the experimental research on the application mode of intelligent speech synthesis technology in listening teaching is carried out. The experimental results show that intelligent speech synthesis technology has achieved good results in foreign language teaching and laid a foundation for foreign language teaching research.

1. Introduction

The rapid development of artificial intelligence technology has not only brought great changes to people’s way of life and work but also brought opportunities for new changes in the field of education and teaching. Intelligent voice technology, one of the core representatives of artificial intelligence technology, has gradually become the most widely used technology in the field of artificial intelligence [1]. Among the few emerging industries at the international leading level in China, there are few industries listed in the field of national science and technology development planning and policy support, and the field of intelligent voice is one of them. The research on voice technology in China started late and began in the 1980s. Voice technology really entered the society widely, which is inseparable from the research accumulation of researchers over the years and the technical optimization and market promotion in recent years [2]. At present, intelligent voice technology is widely used in many industries such as finance, telecommunications, transportation, government, and enterprises. As an increasingly mature new information technology, intelligent voice technology has also been widely used in the field of education and teaching.

As an increasingly mature new information technology, intelligent voice technology has also been widely used in the field of education and teaching [3]. The three functions of intelligent speech technology, including speech recognition, speech synthesis, and speech evaluation, play their respective functional advantages in education and teaching, improve the efficiency of classroom teaching to a certain extent, and highlight the advantages in bilingual teaching and English teaching. Intelligent speech synthesis technology is one of the most mature technologies in intelligent speech technology. The synthetic speech standard is clear, natural, and smooth and can create a good language learning environment for language teaching [4]. Intelligent speech technology has injected a new force into the traditional classroom. At present, intelligent speech synthesis technology has reached a very mature stage. It is widely used in such areas as voice interaction, in-depth reading, and audio education. Artificial intelligence’s rapid advancement will result in an increase in the number of intelligent technologies entering education and teaching, which is bound to bring more opportunities and possibilities for innovation and reform in the field of education and teaching [5]. Nowadays, international exchanges are becoming more and more frequent. Mastering a foreign language has become the consensus of the international community. Learning and mastering a foreign language is not only a wise move to meet the needs of the times but also the only way for personal growth and development. Language skills, namely, “listening, speaking, reading, and writing,” are the basis and key to the formation of comprehensive language application ability. Text to speech (TTS) is one of the important technical projects of intelligent speech technology [6]. It is an artificial speech technology that can be achieved by specific technical means. It refers to enabling the computer to speak like a person and convert the externally input text information into fluent language information that can be understood by people. Teachers can choose appropriate, targeted teaching resources other than teaching materials and finish the input of the text information to be converted into English listening classroom instruction using intelligent speech synthesis technology. It can be converted into sound information using intelligent speech synthesis technologies [7]. Figure 1 depicts the technical roadmap for voice synthesis technology.

The audio pronunciation standard synthesized by intelligent speech synthesis technology can accurately present foreign language pronunciation, give learners the feeling of pure foreign language pronunciation, create a good foreign language learning environment for learners, and solve the problems of nonstandard and insufficient speech input, as well as some teachers’ “dialect foreign language” in current teaching [8]. In addition, the multitone synthesis provided by intelligent speech synthesis technology can meet the needs of a variety of scenes and create realistic situational teaching scenes for learners, which can stimulate learners’ motivation for foreign language learning, improve learning initiative, help teachers better carry out foreign language teaching, and help improve the quality of foreign language teaching [9]. Whether the foreign language audio quality synthesized by intelligent speech synthesis technology can really reach the level of teaching demonstration sound in foreign language teaching has not been confirmed by data in previous scholars’ relevant research [10]. In addition, how to apply this technology to properly combine the functional advantages of this technology with foreign language teaching needs further research and analysis. Based on this, this study aims to use scientific research methods and means to analyze the application mode of intelligent speech synthesis technology in primary school foreign language teaching, which not only originates from the practical appeal but also has certain practical feasibility [11]. The neural speech synthesis with a transformer network was researched by Li et al. [12]. The concept of deep learning-based NLP approaches in text to speech synthesis for communication recognition was developed by Adam et al. [13].

The proposed works are covered as follows. Related works are presented in Section 2. Section 3 discussed the intelligent speech synthesis technology. Section 4 consists of the application of speech synthesis technology in foreign language teaching. Section 5 represents the experiments and results. Finally, in Section 6, the research work is concluded.

In this part, we define the overview of the development of intelligent speech synthesis technology, research on the application of speech synthesis technology in foreign language teaching, and comments on existing related research.

2.1. Overview of the Development of Intelligent Speech Synthesis Technology

The development of speech synthesis can be traced back to the eighteenth century and has a history of more than 200 years. However, due to the limitations of objective conditions such as technical level, no significant research results have been achieved. With the development of the times, computer technology and digital signal processing technology are becoming more and more mature, and speech synthesis technology can develop by leaps and bounds. According to different methods of speech synthesis technology, its development stage can be roughly summarized as mechanical speech synthesis stage, electronic speech synthesis stage, and computer-based speech synthesis development stage [12]. After the development transformation from complete laboratory research to large-scale market application, the speech synthesis technology at this stage belongs to the third development stage.

Mechanical speech synthesis is based on the principle of human speech pronunciation. Although this method can realize the speech synthesis of some basic phonemes, the principle of human pronunciation is very complex [13]. It is not easy to imitate and accurately record the motion trajectory of human oral pronunciation for a long time. It is difficult to establish a model for these machines, and it is difficult to carry out follow-up research. Electronic speech synthesis using formant synthesizer can make the synthesized speech more natural and realistic. However, its structure is complex [14]. At the same time, although it costs a lot of manpower to adjust and analyze complex parameters, it cannot ensure that the parameter adjustment can be completely correct, so the synthesized sound quality is often difficult to meet the requirements of practical application. The improved electronic speech synthesis not only has clear pronunciation but also can synthesize a variety of speech with different timbres. It has become the most representative speech synthesizer in the last century [15]. Early speech synthesis methods are difficult to be popularized and applied in practice due to complex structure and other reasons. In the late 1980s, the rapid development of computer technology created an opportunity for the emergence of speech synthesis based on waveform splicing, but this synthesis method has low speech quality and has limitations in the size of sound library and splicing adjustment [16]. Since the twenty-first century, with the continuous improvement of computing power, deep learning algorithms have emerged one after another. Deep neural network has been widely used in the field of speech research and has been deeply applied to statistical parameter speech synthesis. Its role in vocoder has greatly improved the synthesis efficiency of this method, so it has become the mainstream speech synthesis method at present.

With the progress of the times, scholars of speech synthesis science have made more and more in-depth research on linguistics and phonetics, and speech signal processing technology is becoming more and more mature, which provides a strong theoretical and technical support for the leapfrog development of speech synthesis technology [17]. Generally speaking, the current intelligent speech synthesis technology can basically meet people’s needs. Intelligent speech synthesis technology has been widely used in all walks of life because of its convenient operation and fast conversion. Its pronunciation level can be comparable to that of real people. According to the application status of intelligent speech synthesis technology, the functional advantages of this technology can play a great role in many fields that need speech functions [18]. The application of intelligent speech synthesis technology in the sphere of education and teaching merits further investigation. Figure 2 depicts the application level of intelligent speech synthesis technology.

2.2. Research on the Application of Speech Synthesis Technology in Foreign Language Teaching

As a mature artificial intelligence technology, intelligent speech synthesis technology has been applied in the field of education and teaching. This technology can aid in the advancement of language instruction, and it has steadily grown in popularity as a study hotspot. With the development of this technology, the naturalness and fluency have reached a high level [19].

As for the naturalness test of intelligent speech synthesis technology, the researchers compare the commonly used corpus content with the voice generated by speech synthesis technology and the voice of national first-class announcers. The evaluation results show that the naturalness index of the speech synthesis technology generation system is 4.28 by comparing the announcer’s voice with the natural person’s voice [20]. This study shows that the naturalness of speech synthesis can be comparable to that of real people to a certain extent. Many scholars at home and abroad have expressed their views on whether the English audio generated by intelligent speech synthesis technology can be applied to foreign language teaching. The researcher analyzes the feasibility of the application of intelligent speech technology in bilingual teaching through the subjective impression rating scale and points out that generally, if the MOS score is between 4.0 and 4.5, it is high-quality speech [21]. At present, the naturalness of audio synthesized by speech synthesis technology has reached 4.5 points, indicating that the audio quality synthesized by intelligent speech synthesis technology has reached a high level. Using this technology can not only provide support for teachers’ classroom teaching but also help learners carry out preclass preview and postclass review independently, and it provides pronunciation guidance for learners.

On the application effect of intelligent speech synthesis technology in English listening teaching, when referring to the function of intelligent speech system, it is pointed out that its speech synthesis function can realize the independent generation of English listening materials and greatly expand teaching resources [22]. It provides strong technical support for the development of sound discrimination practice and discourse practice in English listening teaching. In the subsequent research, the video analysis method is used to record and analyze the effectiveness of intelligent speech synthesis technology in the classroom. It has been found that this technology is very helpful for teachers to create situations for teaching. It can stimulate learners’ learning motivation and mobilize learners’ learning autonomy so as to promote the development of teaching activities. Through the research of the above scholars, it can be seen that the speech quality synthesized by intelligent speech synthesis technology has reached a high level [23]. This technology can play its functional advantages in improving the effect of foreign language teaching and learners’ foreign language ability. From the application effect of this technology, in the view of these scholars, the functional advantages of intelligent speech synthesis technology can provide support for foreign language teaching [24]. It can be seen that intelligent speech synthesis technology not only is a well-known artificial intelligence technology in our daily life but also often appears in the sphere of educating students.

2.3. Comments on Existing Related Research

Based on the above literature analysis, speech synthesis technology has reached a quite mature stage after more than 200 years of development. The application coverage is very wide. Intelligent speech synthesis technology will increasingly meet people’s various needs, and the application of this technology in the field of education and teaching will be more and more in-depth. Through the analysis of the current situation of foreign language teaching in China, it is understood that, at present, foreign language teaching in China is mainly realized through traditional teaching classroom teaching [25]. The current situation of foreign language teaching cannot keep up with the pace of the times and social development, and there is a large gap from the overall goal of foreign language curriculum in the stage of compulsory schooling. Artificial intelligence (AI) is a reality in today’s world; intelligent speech synthesis technology can endow the computer with human’s “mouth” and produce standard foreign language pronunciation, which can solve the problems existing in the current foreign language teaching to a certain extent.

Examining the work of local and international academics on intelligent speech synthesis technology in foreign language teaching, it can be seen that intelligent speech synthesis technology, as a product of the era of artificial intelligence, can play its functional advantages in foreign language teaching [26]. Generally speaking, scholars believe that intelligent speech synthesis technology can create a good foreign language speech environment for learners and make them experience pure foreign language pronunciation, listening, and speaking ability [27]. At the same time, intelligent speech synthesis technology can also be used to produce any required foreign language materials, which greatly expands the resources of foreign language teaching. The function of this technology in foreign language listening teaching is conducive to the improvement of learners’ listening ability and teaching effect. Scholars currently focus on the application effect of intelligent speech synthesis technology in foreign language teaching, and there is a lack of investigation and analysis on the feasibility of foreign language speech quality generated by intelligent speech synthesis technology in listening teaching [28].

The rapid development of contemporary society urgently needs high-quality talents who can communicate in foreign languages. Listening comprehension plays an important role in the process of communication. Therefore, in foreign language teaching, we must pay attention to listening teaching [29]. Intelligent voice synthesis technology is guaranteed to play an increasingly essential role in the classroom and teaching as intelligent language technology matures.

3. Intelligent Speech Synthesis Technology

Statistical parameter speech synthesis method has attracted extensive attention in the field of speech synthesis because of its flexibility. In recent years, the deep neural network model has been applied to various research fields of machine learning, and it has achieved significant advantages over traditional methods. The application of the modeling method based on neural network in statistical parameter speech synthesis has gradually deepened, and it has become the mainstream method of speech synthesis.

3.1. Basic Principle of Intelligent Speech Synthesis Technology

The back-end acoustic modeling of statistical parameter voice synthesis is the topic of this paper. Figure 3 depicts the back-end framework for statistical parameter speech synthesis, which mainly includes two stages: training and synthesis. In the training stage, the speech waveform and corresponding text features in the sound library are used as input. The speech waveform is first extracted by vocoder, and then acoustic modeling is carried out by using acoustic features combined with text features. In the synthesis stage, according to the trained acoustic model, give the text features to be synthesized, predict the acoustic features, and then convert the predicted acoustic features into speech waveform by vocoder. Vocoder and acoustic model are two important modules in statistical parameter speech synthesis system as shown in Figure 3.

The source filter model of speech production is used to separate the fundamental frequency and spectrum envelope of the speech short-time spectrum with harmonics during the speech waveform parameterization process. Generally, the fundamental frequency and other excitation characteristics of speech are obtained by analyzing the time-domain waveform or frequency-domain harmonics, and then the periodicity of time and frequency is removed from the amplitude spectrum obtained by short-time Fourier transformation of speech waveform to obtain the spectrum envelope of speech. Due to the high dimension of spectrum envelope, it is difficult to model directly, so it is usually necessary to reduce the dimension of spectrum envelope. The reconstruction of speech waveform from speech acoustic parameters is the opposite process. Given the excitation characteristics such as fundamental frequency and spectral envelope characteristics of speech, the STFT amplitude spectrum is reconstructed, combined with certain phase constraints. Time length modeling is another module in statistical parameter speech synthesis. Time length modeling does not need vocoder. Its basic framework is similar to acoustic modeling. Statistical model is used to model the probability distribution of corresponding time length under the condition of given text features. After more than 20 years of development, the HMM-based statistical parameter speech synthesis method has become a mature speech synthesis method. This section will introduce hidden Markov model and its theoretical basis.

Hidden Markov model is a probabilistic model for modeling sequences, which is composed of a set of implicit state variables and a set of observation variables . HMM model has two assumptions.

The state variables obey the first-order Markov chain; that is, the current state is only related to the state of the previous time, as shown in formula (1).

The probability distribution of the observed variable at a certain time is only related to the state at the current time, and it has nothing to do with the state or observed variable at other times, as shown in formula (2).

Generally, in HMM model, the transition probability from state to state is recorded as ; that is,

Skillfully form the state transition matrix A of HMM, note the given state , and the probability density of the observed variable is .

It is worth noting that the set of observation variables corresponding to all states has a probability distribution . The model parameters of HMM can be recorded as . For a given observation sequence o, formula (5) shows the HMM’s output probability.

The core principle of acoustic modeling in the statistical parameter speech synthesis approach based on HMM is to utilize an HMM model to probabilistically model the acoustic feature sequence of speech in a given situation. The configuration of the whole system involves the selection of speech acoustic features, the selection of modeling unit, and the configuration of HMM model. Acoustic features in the speech synthesis system include excitation features and spectrum features. In the selection of spectrum features, in order to reduce the difficulty of HMM modeling, low-dimensional spectrum representation that removes the correlation between dimensions is generally used, such as Mel cepstrum and line spectrum pair features. Considering the short-time stationary characteristics of speech signal and the modeling ability of HM, HMM in speech synthesis system usually models phoneme level units, such as vowel units in Chinese. Due to the timing features of speech, the topology of HMM in audio modeling is often one-way from left to right and ergodic states. Figure 4 depicts the framework of a statistical parameter voice synthesis system based on HMMs. It is divided into training stage and synthesis stage. The training phase includes speech acoustic feature extraction and HMM model training. Because the HMM model uses the phoneme as the modeling unit, three context-dependent phonemes are commonly modeled to improve modeling accuracy.

In the first system training process, the lower variance limit of HMM model is estimated, then the single phoneme HMM model is trained as the model initialization parameter, then the context-dependent three phoneme HMM model is trained, and finally, the Mn pressure clustering based on decision tree is carried out. In the synthesis stage, firstly, the text is analyzed, combined with the predicted time length, the context related HMM model sequence is determined according to the decision tree, then the continuous acoustic feature sequence is obtained through the maximum likelihood parameter generation algorithm, and the speech waveform is synthesized by the synthesizer. The statistical parameter speech synthesis system based on HMM is too smooth; one reason is that the modeling ability of HMM is limited. In recent years, as a branch of machine learning, deep learning has developed rapidly. Deep learning refers to the use of a network model consisting of multiple nonlinear transformations and multiple processing layers, namely, neural network. Due to the excellent modeling ability of DNN and inch n, the acoustic modeling method based on DNN and RNN is applied to statistical parameter speech synthesis, and it shows a better effect than the acoustic modeling method based on HMM. It has become the mainstream method of acoustic modeling of statistical parameter speech synthesis at present. The speech synthesis system based on DNN and RNN is similar in system framework, as shown in Figure 5. The input features in the figure are the features extracted from the text; that is, discrete or continuous numerical features are used to describe the text.

The training of statistical parameter speech synthesis system based on DNN and RNN usually adopts training criterion, and it uses the BP algorithm and SGD algorithm to update the model parameters so as to make the predicted acoustic parameters as close as possible to the natural acoustic parameters. In the synthesis stage, the text features are extracted from the synthesized text, then the corresponding acoustic parameters are predicted by DNN or RNN, and finally, the speech waveform is synthesized by vocoder. At present, the modeling methods based on DNN and RNN are mainly applied to speech acoustic parameters, including fundamental frequency and spectrum parameters. The duration information still needs to be obtained through other systems. In addition, the input and output features of DNN and RNN models need to be aligned in time. HMM model is usually used for segmentation to obtain alignment information.

3.2. Improvement of Intelligent Speech Synthesis Technology

In the traditional statistical parameter speech synthesis system, the spectral representation of speech is obtained by reducing the dimension of the spectral envelope. In recent years, due to the excellent modeling ability of neural network for high-dimensional features, the feature extraction method based on neural network has been applied to the fields of image and speech. In these methods, the hidden layer of the restricted Boltzmann machine is extracted as the low-dimensional representation of the original high-dimensional features. Based on this idea, this paper proposes a speech spectrum representation extraction method using deep belief network (DBN). It is applied to HMM-based statistical parameter speech synthesis system. This section first introduces the basic theory of RBM and DBN model and then introduces the spectrum representation extraction method based on DBN and its application in HMM-based statistical parameter speech synthesis system.

Restricted Boltzmann machines (pd3m) are a special form of Boltzmann machines. There is no connection between the RBM explicit node and the RBM implicit node. When the explicit layer is given, the conditions of each node in the hidden layer are independent, and when the implicit layer is given, the conditions of each node in the explicit layer are independent. As a probability model, RBM adopts the maximum likelihood criterion for training:where represents the model parameters of RBM. According to the formula in the first section above, it can be deduced that the gradient of log-likelihood of RBM to model parameters is

The above formula includes two terms, which can be regarded as the expected difference of variables under data distribution and model distribution. The RBM training uses the random gradient succession algorithm to update the classical parameters. The formula is as follows:

RBM uses the constrained divergence and CD algorithm to estimate the gradient. The CD algorithm uses Gibbs sampling for sampling. The basic steps of Gibbs sampling are as follows. Suppose there are n random variables , Gibbs sampling consists of N steps, and each step can be expressed aswhere represents other variables except . This step means that when other variables are given, the conditional probability density distribution of is sampled to obtain the sample of . Repeat sampling n times. When the samples of these n random variables are sampled, a sample is obtained. When the number of samples tends to infinity, the obtained samples converge to According to the conditional probability formula of RBM, it can be found that when the explicit node is given, the implicit node condition of RBM is independent, and when the implicit node is given, the explicit node condition of RBM is independent. Therefore, Gibbs sampling of RBM can be divided into two steps, as shown in Figure 6.

Firstly, given the explicit layer, all nodes in the hidden layer are sampled according to the conditional probability formula to obtain the sample h. Then, given the hidden layer sample h, all nodes in the explicit layer are sampled to obtain the sample . Complete these two steps to get a set of samples {v, h}. Repeat the sampling many times. When the sampling times tend to infinity, the sample {v, h} obtained by sampling converges to the model distribution. In the CD algorithm, RBM performs k-step Gibbs sampling to obtain samples that approximately conform to the model distribution. Combined with real data samples, RBM directly takes the sample value as the expected gradient. Compared with the CD algorithm, the log-likelihood gradient update model is directly used for two-step approximation.

4. Application of Speech Synthesis Technology in Foreign Language Teaching

The application of technology in education and teaching will never replace the role of teachers but help teachers carry out teaching, make up for teachers’ shortcomings, and improve teaching efficiency and effect. Intelligent speech synthesis technology can give play to its technical advantages to better assist teachers in English listening teaching so that the role of teachers and the function of intelligent speech synthesis technology can complement each other. Using the speech synthesis function of intelligent speech synthesis technology, that is, the text to speech function, teachers can turn the extracurricular relevant knowledge content into audio through this function according to the teaching needs and play it to learners so as to carry out the extension and expansion of knowledge for learners.

In addition, teachers can also adjust the speed and timbre of listening materials according to the learning characteristics of learners. Furthermore, teachers can offer crucial and challenging teaching knowledge to learners in the form of audio according to teaching needs, allowing students to consolidate and strengthen course content. For some better learners, teachers can also provide them with extracurricular listening audio materials by using intelligent speech synthesis technology for them to carry out autonomous learning so as to teach students according to their aptitude and individualized teaching and gradually improve learners’ listening ability. According to the learners’ internal psychological process of listening comprehension and information processing mode, combined with the functional advantages of intelligent speech synthesis technology, this study constructs the application mode of intelligent speech synthesis technology in primary school English listening teaching. It includes three parts: learners’ listening psychological language cognitive process, the functional module of intelligent speech synthesis technology, and the basic steps of teachers’ listening teaching. Learners’ listening psychological language cognitive process can be divided into three stages: perceptual processing stage, processing and understanding stage, and application feedback stage. The basic steps for teachers to carry out listening teaching can also be summarized into three stages: preparation before listening, guidance during listening, and feedback after listening. Next, this study will take learners and teachers as the mainline to analyze the application mode of intelligent speech synthesis technology in primary school English listening teaching.

5. Experiments and Results

The effect of the experiment cannot be fully explained by the experimental performance data alone. In order to deeply study the teaching effect of the application mode of intelligent speech synthesis technology in primary school English listening teaching, we should also investigate the real feelings of learners and teachers in the experimental process. The questionnaire has two parts: the functional role of intelligent speech synthesis technology in English listening instruction and learners’ recognition of intelligent speech synthesis technology in English listening instruction in primary schools. The first question is do you think the pronunciation of synthesized speech by intelligent speech synthesis technology is natural and clear? As can be seen from Table 1, more than half of the learners believe that the audio pronunciation synthesized by intelligent speech synthesis technology is very natural and clear, accounting for 66.67%. 30.77% thought it was natural and clear, of which only one thought it was general. Most learners still think that the synthetic speech of intelligent speech synthesis technology is quite natural and clear.

The second question is whether English teachers’ speed regulation function, which employs intelligent voice synthesis technology, can assist you in better identifying words while listening? Table 2 shows that 69.23% of learners believe that intelligent speech synthesis technology can help identify words. 23.07% thought it did not help, and 7.70% thought it did not affect them. At the same time, it also indirectly shows that the application of intelligent speech synthesis technology in primary school English listening is helpful in improving learners’ listening and sound discrimination ability. After the questionnaire, we learned about the three students who said they had no influence. We learned that the learners themselves had a poor English foundation and it was difficult to improve their listening and discrimination ability for a while. The speed adjustment function helps learners identify words as shown in Table 2.

The third question is whether it is more convenient for you to imitate and follow the English teacher when using intelligent speech synthesis technology? Table 3 shows that 82.05 percent of learners feel intelligent speech synthesis technology will make it easier for them to follow and copy, 15.38 percent do not, and one learner believes it will have no effect on him.

The fourth question is whether it is convenient for you to carry out independent learning after class after the teacher uses intelligent speech synthesis technology to generate MP3 audio and send it to you? As shown in Table 4, 71.79% of learners are convenient for self-regulated learning, 23.08% of learners say they are not convenient for self-regulated learning, and 2 learners say they have no impact on themselves. Through understanding the situation, it is known that these learners are not very active in learning at ordinary times, so they rarely take the initiative to carry out self-regulated learning after class.

Intelligent speech synthesis technology provides support for teaching through several functions in listening teaching. The sixth question is to investigate which function of intelligent speech synthesis technology learners prefer? During the questionnaire survey, the specific application of these three functions in listening teaching will be explained to learners to ensure learners’ correct understanding of the options. Table 5 shows that 48.72 percent of learners prefer “standard reading,” whereas 28.21 percent prefer “close reading” and “voice color conversion” function of men and women, 17.95% of the learners like “speed regulation,” and only two students do not like the three functions of intelligent speech synthesis technology.

The teacher believes that the function of quick speech synthesis machinery is more practical, simple, and convenient to operate and more efficient and convenient than the traditional media. The use effect in the classroom is also better, which can stimulate learners’ enthusiasm and learning initiative.

6. Conclusion

Combined with the background of the times and on the basis of consulting relevant literature, this study summarizes the application status of intelligent speech synthesis technology, has an overall understanding and grasp of the problems and needs existing in foreign language teaching in China, and defines the research purpose and significance of this study. This study has done the following work. Firstly, the feasibility of the presentation of intellectual speech synthesis technology in foreign linguistic education is analyzed, and the application conditions of intelligent speech synthesis technology in different teaching activities and class types are listed in detail, which provides a basis for subsequent teachers to use intelligent speech synthesis technology to carry out listening teaching. Then it constructs the application mode of intelligent speech synthesis technology in foreign language teaching, which provides a reference for teachers on how to use intelligent speech synthesis technology. Finally, it examines the impact of the intelligent speech synthesis technology application mode on foreign language teaching, which can not only assist learners in improving their listening ability and comprehension but also assist teachers in correcting their own pronunciation and promoting their professional development.

Due to the limitations of time and technology, this study still has the following deficiencies. The first application model is only put forward in theory, which only provides a general reference framework for teachers when using intelligent speech synthesis technology to carry out foreign language teaching. Through experiments, it is found that, in the program mode, there are still some flaws that need to be addressed and rectified in the follow-up.

Data Availability

The datasets used during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The author declares that he has no conflicts of interest.

References

Y. Ning, S. He, Z. Wu, and C. L.-J Xing, “A review of deep learning based speech synthesis,” Applied Sciences, vol. 9, no. 19, p. 4050, 2019.
View at: Publisher Site | Google Scholar
S. P. Panda, A. K. Nayak, and S. C. Rai, “A survey on speech synthesis techniques in Indian languages,” Multimedia Systems, vol. 26, no. 4, pp. 453–478, 2020.
View at: Publisher Site | Google Scholar
H. Choi, S. Park, J. Park, and M Hahn, “Multi-speaker Emotional Acoustic Modeling for Cnn-Based Speech synthesis,” in Proceedings of theICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6950–6954, IEEE, Brighton, UK, April 2019.
View at: Publisher Site | Google Scholar
S. Latif, J. Qadir, A. Qayyum, and M. S. Usama, “Speech technology for healthcare: opportunities, challenges, and state of the art,” IEEE Reviews in Biomedical Engineering, vol. 14, pp. 342–356, 2021.
View at: Publisher Site | Google Scholar
G. K. Anumanchipalli, J. Chartier, and E. F. Chang, “Speech synthesis from neural decoding of spoken sentences,” Nature, vol. 568, no. 7753, pp. 493–498, 2019.
View at: Publisher Site | Google Scholar
Y. Mei, D. P. Ye, S. Z. Jiang, and J. R Liu, “A particular character speech synthesis system based on deep learning,” IETE Technical Review, vol. 38, no. 1, pp. 184–194, 2021.
View at: Publisher Site | Google Scholar
X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter waveform models for statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 402–415, 2020.
View at: Publisher Site | Google Scholar
B. Ji, Y. Li, D. Cao, and C. S. D. Li, “Secrecy performance analysis of UAV assisted relay transmission for cognitive network with energy harvesting,” IEEE Transactions on Vehicular Technology, vol. 69, no. 7, pp. 7404–7415, 2020.
View at: Publisher Site | Google Scholar
Y. Ai and Z.-H. Ling, “A neural vocoder with hierarchical generation of amplitude and phase spectra for statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 839–851, 2020.
View at: Publisher Site | Google Scholar
M. Li, “The Application of Computer Speech Recognition Technology in Oral English Teaching,” in Proceedings of the 2021 4th International Conference on Information Systems and Computer Aided Education, pp. 1271–1274, Dalian China, September2021.
View at: Publisher Site | Google Scholar
R. Liu, B. Sisman, F. Bao, and J. G. H. Yang, “Exploiting morphological and phonological features to improve prosodic phrasing for Mongolian speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 274–285, 2021.
View at: Publisher Site | Google Scholar
N. Li, S. Liu, Y. Liu, and S. M. Zhao, “Neural speech synthesis with transformer network,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 1, pp. 6706–6713, 2019.
View at: Publisher Site | Google Scholar
E. B. Adam, “Deep learning-based NLP techniques in text to speech synthesis for communication recognition,” Journal of Soft Computing Paradigm (JSCP), vol. 2, no. 4, pp. 209–215, 2020.
View at: Publisher Site | Google Scholar
A. Ahmad, M. R. Selim, M. Z. Iqbal, and M. S. Rahman, “SUST TTS Corpus: a phonetically-balanced corpus for Bangla text-to-speech synthesis,” Acoustical Science and Technology, vol. 42, no. 6, pp. 326–332, 2021.
View at: Publisher Site | Google Scholar
F. Chen, J. Yang, and L. Zhao, “A Bilingual Speech Synthesis System of Standard Malay and Indonesian Based on HMM-DNN,” in Proceedings of the2020 International Conference on Asian Language Processing (IALP), pp. 181–186, IEEE, Kuala Lumpur, Malaysia, December 2020.
View at: Publisher Site | Google Scholar
Y. Wang, W. Wang, W. Liang, and L.-F. Yu, “Comic-guided speech synthesis,” ACM Transactions on Graphics, vol. 38, no. 6, pp. 1–14, 2019.
View at: Publisher Site | Google Scholar
T. V. Nguyen, B. Q. Nguyen, K. H. Phan, and H. V. Do, “Development of Vietnamese speech synthesis system using deep neural networks,” Journal of Computer Science and Cybernetics, vol. 34, no. 4, pp. 349–363, 2019.
View at: Publisher Site | Google Scholar
M. Gósy and V. Krepsz, “Evaluation of cognitive processes using synthesized words: screening of hearing and global speech perception,” Acta Polytechnica Hungarica, vol. 15, no. 5, pp. 31–45, 2018.
View at: Google Scholar
J. Wuth, P. Correa, T. Núñez, and M. N. B. Saavedra, “The role of speech technology in user perception and context acquisition in HRI,” International Journal of Social Robotics, vol. 13, no. 5, pp. 949–968, 2021.
View at: Publisher Site | Google Scholar
M. Airaksinen, L. Juvela, B. Bollepalli, and J. P. Yamagishi, “A comparison between STRAIGHT, glottal, and sinusoidal vocoding in statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 9, pp. 1658–1670, 2018.
View at: Publisher Site | Google Scholar
M. Zhang, Y. Zhou, L. Zhao, and H. Li, “Transfer learning from speech synthesis to voice conversion with non-parallel training data,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1290–1302, 2021.
View at: Publisher Site | Google Scholar
T. Koriyama and T. Kobayashi, “Statistical parametric speech synthesis using deep Gaussian processes,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 5, pp. 948–959, 2019.
View at: Publisher Site | Google Scholar
M. Cotescu, T. Drugman, G. Huybrechts, and J. A. Lorenzo-Trueba, “Voice conversion for whispered speech synthesis,” IEEE Signal Processing Letters, vol. 27, pp. 186–190, 2020.
View at: Publisher Site | Google Scholar
Y. Zhao, S. Takaki, H.-T. Luong, and J. D. N. Yamagishi, “Wasserstein gan and waveform loss-based acoustic model training for multi-speaker text-to-speech synthesis systems using a WaveNet vocoder,” IEEE Access, vol. 6, pp. 60478–60488, 2018.
View at: Publisher Site | Google Scholar
X. Lin, J. Wu, S. Mumtaz, and S. J. M. Garg, “Blockchain-based on-demand computing resource trading in IoV-assisted smart city,” IEEE Transactions on Emerging Topics in Computing, vol. 9, no. 3, pp. 1373–1385, 2021.
View at: Publisher Site | Google Scholar
Y. Ning, S. He, C. Xing, and L.-J. Zhang, “The development trend of intelligent speech interaction,” in Proceedings of the International Conference on Cognitive Computing, pp. 169–179, Springer, San Diego, CA, USA, June 2019.
View at: Publisher Site | Google Scholar
K. S. Prasad, G. K. Ramaiah, and M. B. Manjunatha, “Backend tools for speech synthesis in speech processing,” Indian Journal of Science and Technology, vol. 10, no. 1, pp. 1–8, 2017.
View at: Publisher Site | Google Scholar
D. Liakin, W. Cardoso, and N. Liakina, “The pedagogical use of mobile speech synthesis (TTS): focus on French liaison,” Computer Assisted Language Learning, vol. 30, no. 3-4, pp. 325–342, 2017.
View at: Publisher Site | Google Scholar
S. P. Panda and A. K. Nayak, “Modified rule-based concatenative technique for intelligible speech synthesis in Indian languages,” Advanced Science Letters, vol. 22, no. 2, pp. 557–563, 2016.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Zhehua Zhang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Mathematical Problems in Engineering

Advanced Aspects of Computational Intelligence and Applications of Fuzzy Logic and Soft Computing

Application of Intelligent Speech Synthesis Technology Assisted by Mobile Intelligent Terminal in Foreign Language Teaching

Abstract

1. Introduction

2. Related Work

2.1. Overview of the Development of Intelligent Speech Synthesis Technology

2.2. Research on the Application of Speech Synthesis Technology in Foreign Language Teaching

2.3. Comments on Existing Related Research

3. Intelligent Speech Synthesis Technology

3.1. Basic Principle of Intelligent Speech Synthesis Technology

3.2. Improvement of Intelligent Speech Synthesis Technology

4. Application of Speech Synthesis Technology in Foreign Language Teaching

5. Experiments and Results

6. Conclusion

Data Availability

Conflicts of Interest

References

Copyright