A Machine Learning Assessment System for Spoken English Based on Linear Predictive Coding

Wang, Lu

doi:https://doi.org/10.1155/2022/6131572

Mobile Information Systems

On this page

Abstract Introduction Experimental Results and Analysis Conclusion Data Availability Conflicts of Interest References Copyright Related Articles

Special Issue

Big Data-Driven Mobile IoT Intelligence

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 6131572 | https://doi.org/10.1155/2022/6131572

A Machine Learning Assessment System for Spoken English Based on Linear Predictive Coding

Lu Wang¹

Academic Editor: R. Mo

Received03 Aug 2022

Revised18 Aug 2022

Accepted30 Aug 2022

Published20 Sept 2022

Abstract

In the teaching of English, there is an increasing focus on practical communication skills. As a result, the speaking test component has received more and more attention from education experts. With the rapid development of modern computer technology and network technology, the use of computers to assess the quality of spoken English has become a hot topic of research in related fields at present. A machine learning assessment system based on linear predictive coding is proposed in order to achieve automatic scoring of spoken English tests. First, the principle of linear predictive coding and decoding is analyzed, and the traditional linear predictive coding and decoding algorithm is improved by using hybrid excitation instead of the traditional binary excitation. Second, the overall structure of the machine learning assessment system is designed, which mainly includes division into four modules: acoustic model acquisition module, speech recognition module, standard pronunciation transcription module, and decision module. Then, the speech recognition module is implemented by an improved linear predictive speech coding method to acquire the feature parameters of the speech signal and generate the speech feature vector. Finally, the convolutional neural network algorithm is used to train the speech features so as to implement the acoustic model acquisition module. The experimental results show that the improved linear predictive speech coding method yields more natural and higher intelligibility speech signals. The designed machine learning evaluation system is able to accurately detect information about the quality of the learner’s pronunciation.

1. Introduction

The focus of modern English language teaching is on the development of students’ general application skills, including listening and reading skills. Among these, speaking training and speaking assessment have received increasing attention. There are generally two types of assessment for speaking tests: an automated assessment and a manual assessment by experts. With the continuous development of random computer technology, automated assessment of speaking tests is beginning to be used in a variety of industries [1–6]. For example, speaking assessment systems can be used during telephone interviews to automatically score the English proficiency of interviewees. In addition, online teaching application scenarios in the education industry can use speaking assessment systems to automate the scoring of students’ speaking quality. Automated speaking assessment systems can give objective scores based on the test taker’s performance in a timely manner and are not subjectively influenced by personal factors [7, 8].

As competition in business continues to intensify, there is an increasing demand for complex talents. Companies require these people to have not only solid professional knowledge, but also to be able to express themselves proficiently in English, so speaking skills are quite important. Unlike traditional written English teaching, oral teaching focuses on standard pronunciation. Although the forms of teaching have diversified, spoken English teaching is still at an artificial stage at this stage. In the traditional language teaching process, teachers provide comprehensive training such as listening, reading, and writing to students through a face-to-face approach, so as to achieve the purpose of developing students’ language communication skills [9–11]. Among them, the learning and training of standard spoken language is the foundation and focus of English learning. Due to the constraints of teachers’ resources, learning costs, and learning locations, the effect of traditional speaking learning and training is not satisfactory. Teachers need to spend a lot of time and effort conducting various subjective tests on students, resulting in ineffective work efficiency, especially in large-scale speaking test scenarios.

Currently, researchers are beginning to experiment with computer-assisted pronunciation training systems to address these problems [12–14]. The core issue of computer-aided pronunciation training systems is pronunciation bias testing, i.e., pronunciation bias assessment. Pronunciation bias assessment is the assessment of the standard of the learner’s pronunciation and the assignment of a corresponding score or grade, which is the core function of a computer-aided pronunciation training system. Pronunciation bias assessment is mostly a confidence-based method. The phoneme sequence is first standardized and sliced to obtain more accurate phoneme boundary information. Then, the confidence of the phonemes in each speech segment is calculated, and the pronunciation bias is measured by the confidence score. Common confidence calculation methods include log-likelihood, log-likelihood ratio, log-posterior probability, and Goodness of Pronunciation (GOP) [15–17]. In addition, some methods combine confidence calculations with pronunciation features, which yield better joint score results. In order to assess pronunciation bias with high accuracy, more and more researchers are focusing on the detection of pronunciation bias at the phoneme level.

There are two ideas for the study of automatic detection of pronunciation bias at the phoneme level [18]. One is an automatic method for the detection of pronunciation bias based on acoustic phonetics. Such methods are based on a statistical analysis of speech. The other is an automatic method of pronunciation bias detection based on automatic speech recognition technology.

1.1. Pronunciation Bias Detection Based on Acoustic Phonetics

Pronunciation bias detection based on acoustic phonetics finds a specific combination of features by extracting structural, acoustic, and perceptual features of the speech to be tested. Then, pronunciation bias detection is achieved by statistically examining. A similarity calculation or a classifier is usually chosen for the differentiation of pronunciation bias types.

Morlett Paredes et al. [19] proposed a hybrid method based on time-domain features and phoneme boundary information for pronunciation bias detection of basic English pronunciation units, with remarkable results. This hybrid method used a multilayer perceptron as a classifier. Nakamura et al. [20] extracted several resonance peaks from different frames after pre-processing the speech to be measured. Then, a Gaussian Mixture Model (GMM) was used for classification and vowel articulation bias detection was achieved. Dashti and Razjmoo [21] defined a resonance peak that reduces ambient noise. This resonance peak is able to simulate the vocal tract shape properties. Articulatory bias detection is then performed by calculating the degree of structural distortion (Bhattacharyya distance) between the speech to be measured and the standard speech.

1.2. Pronunciation Bias Detection Based on Automatic Speech Recognition Technology

Automatic speech recognition is essentially a classification matching problem, while pronunciation bias detection is a classification regression problem, so pronunciation bias detection can be solved using speech recognition technology. Pronunciation bias detection based on automatic speech recognition is simpler than pronunciation bias detection based on acoustic phonetics. This is because automatic speech recognition can use a language model to counteract the effects of imprecise acoustics and thus output a legitimate sequence of characters. Therefore, this study chose to use automatic speech recognition to implement a spoken English assessment system. The key elements of automatic speech recognition technology include the extraction of speech feature parameters and the selection of acoustic models, both of which are also the focus of this study.

First of all, the extraction of speech feature parameters is a key step in the process of dynamic speech recognition, and the selection of parameters directly affects the overall performance of the system. After the speech signal has been preprocessed, it needs to be extracted and analyzed for the feature parameters. The most typical method of extraction is the use of vocoders.

The vocoder was born in the 1920s at Bell Labs in the USA. Since then, the vocoder has seen a period of rapid development. A large number of researchers have been working on speech coding and speech synthesis, and have achieved considerable results. The basis of the vocoder is Linear Predictive Coding (LPC). In the early 1980s, the US Department of Defense published LPC-10. Liu et al. [22] used LPC to build a parametric pronunciation bias database and combined it with a Gaussian Hidden Markov Model to achieve classification detection of pronunciation bias. Hiroya and Mochida [23] used LPC to extract speech feature parameters and then used the linear discriminant analysis or decision trees to train classification models to achieve pronunciation bias detection.

Second, for pronunciation detection, the constraint provided by the language model is not helpful as it leads to missed detection of incorrect pronunciations. Therefore, robust acoustic models are important to distinguish between those with standard pronunciation and those with abnormal pronunciation. In traditional speech recognition, the Gaussian Hidden Markov Model (GMM-HMM) has been the dominant acoustic model [24]. However, with the continuous development of deep learning techniques, deep learning models are gradually being used more often in speech recognition tasks. A convolutional neural network (CNN) is a multilayer perceptron that incorporates convolutional computation. CNN is one of the representative algorithms of deep learning [25] and is commonly used to analyze visual images. The CNN consists of an input layer, a convolutional layer, a ReLU activation layer, a pooling layer, and a fully connected layer. The CNN is also known as a “translation-invariant artificial neural network.”

When applied to automatic speech recognition applications, in terms of input, CNN-based automatic speech recognition techniques are broadly divided into two types: one is to use traditional acoustic feature parameters as input, such as Mel Frequency Cepstrum Coefficient (MFCC) [26], LPC [27], and Fbank [28]. The other is to use original time-frequency spectrum as input, that is, to treat the time-frequency diagram as an image. Er et al. [29] analyzed the research on deep learning techniques in speech recognition and the key problems to be solved. Nakashika et al. [30] used recurrent neural networks for speech recognition and the recognition accuracy was high.

From a pronunciation bias detection perspective, we want to retain as much of the original information as possible in the features received at the input. This is because the original information is the most realistic representation of the quality of the learner’s spoken language. However, time-frequency maps can cause information loss in the frequency domain, which is detrimental to pronunciation bias detection. Therefore, the automatic speech recognition technology in this paper uses acoustic feature parameters as input information. Due to the short-time smoothness of spoken English, the feature parameters of the acoustic model in pronunciation bias detection are updated less frequently, which effectively reduces the coding bit rate (below 2.4 kb/s or even below). The simple LPC vocoder is able to achieve a range of 0.8 to 2.4 kb/s in terms of coding efficiency, which just meets the coding bit rate requirements [31–33]. Therefore, LPC is used for speech signal feature extraction, and the features are trained by convolutional neural network algorithm to complete speech recognition. The aim of this study is to adopt LPC to extract acoustic feature parameters and use CNN as an acoustic model for pronunciation bias detection to automate the detection of English pronunciation bias.

In order to achieve automatic scoring of spoken English tests, a machine learning assessment system based on linear predictive coding is proposed, which mainly consists of being divided into four modules: acoustic model acquisition module, speech recognition module, standard pronunciation transcription module, and decision module. The improved stimulated linear predictive speech coding method is used to obtain the feature parameters of the speech signal and generate the speech feature vector to implement the speech recognition module. Finally, the CNN model is used to train the speech features so as to implement the acoustic model acquisition module. The experimental results show that the improved LPC + CNN-based evaluation system can accurately detect pronunciation bias information.

The main innovations and contributions of this paper include.(1)How accurately unvoice/voiced tones judgments are made is important for spoken English assessment systems. Therefore, the traditional LPC algorithm is improved by using hybrid excitation instead of simple binary excitation. In the acoustic feature parameter extraction process, the sub-band sound intensity of the speech signal is extracted using a split-band hybrid excitation technique in addition to the extraction of the fundamental tone period required by the traditional LPC model.(2)An English spoken pronunciation evaluation system based on improved LPC and CNN is constructed. The improved LPC algorithm is used to obtain the feature parameters of the speech signal and generate the speech feature vector, thus realizing the speech recognition module. A CNN is used to train the speech features, thus realizing the acoustic model acquisition module.

The rest of the paper is organized as follows: In Section 2, the representative spoken pronunciation assessment system was studied in detail, while Section 3 provides the improved LPC algorithm. In Section 4, the machine learning evaluation system based on ILPC + CNN was studied in detail, while Section 5 provides experimental results and analysis. Finally, the paper is concluded in Section 6.

2. Representative Spoken Pronunciation Assessment System

Since the 1990s, many technology companies and research institutes have conducted in-depth research in the field of pronunciation bias testing and have achieved remarkable results, and launched various application systems, as shown in Table 1. These systems have been widely used in areas such as computer-aided pronunciation training, computer-aided language learning, and computer-based speaking proficiency testing. For example, the DISCO (Development and Integration of Speech technology into Courseware for language learning) project at the University of Nijmegen (Netherlands) [34]. The DISCO system automatically detects pronunciation deviations and grammatical errors in the speech to be tested and generates detailed feedback on the errors checked. The HUGO system, developed by Kyoto University in Japan for Japanese learners of English, uses a decision tree technique based on linguistics and a phonological database to check pronunciation bias.

3. Improvements to the LPC Algorithm

3.1. Principle of LPC

The most basic low-rate speech coding method is linear predictive coding. In speech signal analysis linear prediction not only enables predictive functions but also provides a very good estimation of the vocal channel model parameters. Linear prediction analysis can provide a set of speech signal model parameters that accurately represent the spectral amplitude of the speech signal. The basic idea of linear predictive analysis is to use the p sample point values of the previous set of data to predict the sample point values of the current or next set. LPC can simulate the human articulatory system very well and therefore has some advantages in the extraction of English speech feature parameters [35]. After waveform interception and noise filtering of the speech signal, multiple frames of speech signal in a certain time period can be obtained by frame sampling and combined with a linear time domain model to achieve feature parameter extraction.

Let represent the speech signal. According to the LPC principle, can be represented by the previous sample points.where , , denote linear prediction coefficients.

Let be the predicted speech signal, then its representation is shown as follow:

The prediction error is calculated as shown as follow:

Let , then all coefficients can be solved and a stable speech feature signal can be obtained.

The basic principle of the linear predictive vocoder is that the model parameters are encoded with the excitation parameters using linear predictive analysis in an all-pole vocal channel model, resulting in the transmission of high-quality speech at low bit rates (below 2.4 kb/s). The principle of the linear predictive vocoder is shown in Figure 1. At the receiver end of the linear predictive vocoder, the prediction coefficients obtained from the linear predictive analysis can be used to synthesize the transmitted speech directly [36]. Figure 2 shows the coding principle of the LPC-10 vocoder.

First, after a low-pass analog filter, the LPC-10 vocoder performs an A/D conversion at a sampling rate of 8 kHz to obtain the digitized information of the speech. The digitized speech is then processed simultaneously in two steps. (1) The excitation information is processed. After the speech has been framed, the characteristic parameters of each frame are extracted and encoded for transmission. After encoding, the fundamental tone period (Pitch) and the voiced/unvoice sign (V/UV) of each frame are obtained. The fundamental tone period is calculated using the average amplitude difference function (AMDF) method. (2) The extraction of the vocal channel parameters is processed.

Because most of the energy of the speech signal is concentrated in the low-frequency range and the power spectrum decays with frequency, the LPC needs to preprocess the speech signal first so that the power spectrum on the high frequencies can be increased, thus improving the accuracy of speech channel parameter extraction.where denotes the transfer function of the pre-processing filter.

3.2. Improvements to Incentive Sources

Conventional LPC algorithms use simple binary excitation sources (voiced/unvoice) to excite the synthesizer. Due to the low robustness, the quality of the speech synthesized by the binary excitation source is poor in the presence of high speech noise. Real-life English speech often has both voiced/unvoice tones, especially in noisy speech segments. Therefore, the result of the voiced/unvoice tones judgment can directly affect the quality of speech recognition. Therefore, improvement of the excitation source is important for spoken English evaluation systems. In this paper, a hybrid excitation is used instead of the traditional binary excitation, thus proposing an improved LPC algorithm (ILPC). In terms of parameter extraction, in addition to extracting the fundamental tone period required for conventional LPC, the hybrid excitation technique is also used to extract the sub-band sound intensity in the speech signal.

The steps for extracting the fundamental tone period in ILPC arithmetic are shown as follows: Step 1: after passing the speech signal x(n) through a low-pass filter at 900 Hz, the first 20 output values are removed to obtain . Step 2: find the maximum amplitude value of the first 100 samples and the maximum amplitude value of the last 100 samples of respectively. Select the smallest value as the threshold level L. Step 3: make center-decimation and three-level decimation of to obtain and respectively. Step 4: find the correlation between the signals and ; where k ranges from 20 to 150 and is the short-time energy. Step 5: use the peak detector to find the maximum value of the correlation value . If is less than , this frame is considered as voiced tones and the fundamental tone period is set to P = 0. Otherwise, this frame is considered unvoiced tones, and the fundamental tone period is set to P = .

The process of extracting the sub-band sound intensity in the ILPC calculation is shown in Figure 3.

After passing through the bandpass filter, the speech signal is extracted to obtain the fundamental tone period. The result of passing a frame of speech signal through each of the five sub-band filters is shown in Figure 4. The sound intensity of the five sub-bands is calculated as follows: 0.2452 for the first sub-band; 0.4478 for the second sub-band; 0.1893 for the third sub-band; 0.3707 for the fourth sub-band; and 0.3874 for the fifth sub-band.

For each unvoiced tone frame or dithered turbulent frame, the sound intensity of the speech signal in each sub-band is calculated separately. In forming the excitation signal, the sound intensity will determine the weighting of the pulse and noise sources in each sub-band, resulting in an excitation signal for the entire frequency band.

4. ILPC + CNN Based Machine Learning Evaluation System Design

4.1. General System Architecture

The automatic detection of pronunciation bias is a simulation of the human subjective detection process. Through machine learning of the manual detection results, automatic detection can even outperform human experts. The machine learning evaluation system for spoken English designed in this paper is shown in Figure 5. The system is divided into four modules: an acoustic model acquisition module, a speech recognition module, a standard pronunciation transcription module, and a decision module.

4.2. ILPC-Based Speech Recognition Module

In this paper, the ILPC algorithm is used to implement a speech recognition module so that the learner’s basic pronunciation units (phonemes), including legal and illegal pronunciation unit sequences, can be accurately identified. Automatic speech recognition aims to detect the content of the learner’s pronounced text and to output legitimate character sequences by using acoustic models that can counteract the effects of undesirable acoustics. In the acoustic feature parameter extraction process, we use a split-band hybrid excitation technique to extract the sub-band sound intensity of the speech signal in addition to the parametric fundamental tone period, resulting in an accurate voiced/unvoiced tones judgment.

4.3. CNN-Based Acoustic Model Acquisition Module

The main function of the acoustic model acquisition module is to train an acoustic model. The trained acoustic model will be used in the speech recognition module. In traditional speech recognition, the Gaussian Hidden Markov Model (GMM-HMM) has been the dominant acoustic model. However, with the continuous development of deep learning techniques, deep learning models are gradually being used more often in speech recognition tasks. A convolutional neural network (CNN) is a multilayer perceptron that incorporates convolutional computation [37]. CNN is one of the representative algorithms of deep learning and is commonly used to analyze visual images. Therefore, in this paper, CNNs are used to implement the acoustic model acquisition module.

The input to the CNN is the acoustic feature parameters obtained by the ILPC algorithm. the structure of the CNN is shown in Figure 6. Let the sample set of speech features be . First, the speech features are convolved in the layer of the CNN [38–40].where and represent the weights and biases of the features in the layer respectively, and represents the convolution operation.

Then, the convolution operation is performed on the features of the samples. Let the size of the convolution kernel be .

A new sample is obtained again after the convolution operation and a transformation operation is performed on it.

The restrictions are shown as follows:

After obtaining the fully connected layer of the convolutional neural network, the classifier is selected to predict the sample class.

In traditional acoustic model training, the label corresponding to each frame of data needs to be known in order to train effectively. Therefore, the speech signal needs to be forcibly aligned prior to training the model. Although there are some relatively mature open source alignment tools available, there are significant constraints on the performance of speech recognition techniques with forced alignment. In CNN-based acoustic models, we want to leave more tasks to the neural network to perform, such as learning how to align autonomously. Therefore, predictive alignment techniques are used to solve this problem. The loss function for predictive alignment is defined as shown below [41–43].where denotes the probability when the input sequence is and the output sequence is Y, and S denotes the training set. It can be seen that prediction alignment can directly output the predicted probability of a sequence without the need for external post-processing. With the help of predictive alignment, a large amount of manual resources can be saved, thus increasing the efficiency. In this paper, the acoustic model acquisition module is built by combining prediction alignment and CNN, as shown in Figure 7.

5. Experimental Results and Analysis

In order to verify the performance of ILPC + CNN in the quality assessment of spoken English, various experiments were conducted using separate speech samples with different accents. The experimental speech data were obtained from the open-source website VoxForge (https://www.voxforge.org/zh). The parameters of the experimental dataset are shown in Table 2, with a ratio of 3 : 1 between training and test samples. The parameters of the CNN model are set as shown in Table 3. The sampling rate of the audio data is 16000 Hz and the sample size was 16 bit. The number of texts (number of pronounced sentences) is 2268 and the total number of phonemes is 44359. The total number of speakers is 10 including 5 males and 5 females. First, the effect of different frame rates on ILPC performance was tested. Second, the effect of different convolutional kernel sizes on the recognition performance was tested. Finally, the designed system was compared with other spoken pronunciation evaluation systems.

5.1. Effect of Different Frame Rates on ILPC Performance

In order to obtain the best frame rate setting, the speech recognition accuracy of the six datasets at different frame rates was verified, as shown in Table 4.

It can be seen that as the number of frames extracted increases, the recognition accuracy keeps improving. As the frame rate increases to 180 Hz, the ILPC algorithm shows a high recognition accuracy. As the frame rates increases to 200 Hz, datasets 1, 2, 3, and 6 show a decrease in recognition accuracy, while datasets 4 and 5 show a very small and almost negligible increase in recognition accuracy. When ILPC was used to extract features of speech signals with different sample types, too high a frequency would increase the computational effort of speech recognition, while too low a frequency would drop important features of the speech signal. Therefore, the frame rate used in the subsequent ILPC algorithms was 180 Hz.

5.2. Effect of Convolutional Kernel Size on Speech Recognition

To further verify the effect of convolutional kernel size on speech recognition performance, the English speech recognition accuracy under different convolutional kernel conditions was tested, as shown in Table 5.

It can be seen that the recognition accuracy of English speech decreases when the size of the convolutional kernel increases. This may be because fewer speech features are involved in the operation when the convolutional kernel size is too large, resulting in a decrease in speech recognition accuracy. The comparison shows that the recognition accuracy of CNN is higher when the convolutional kernel size is 2 ∗ 2 and 3 ∗ 3. However, 2 ∗ 2 is more time-consuming than 3 ∗ 3 in CNN operations, so to improve real-time performance, the convolutional kernel size was 3 ∗ 3 in subsequent experiments.

5.3. Performance Analysis of ILPC

The excitation signal of the ILPC vocoder is compared with that of the LPC vocoder, as shown in Figure 8.

It can be seen that the excitation signal obtained by ILPC is a mixed excitation signal. Each frame of speech is no longer pure unvoice tone or voice tone but contains a distinct periodic pulse string and a little noise. As a result, the speech signal obtained by ILPC is more natural and better defined. Conventional LPC uses a simple binary excitation signal to process the input sequence. Compared to the conventional LPC algorithm, ILPC based on hybrid excitation gives a waveform that more closely resembles that of the original speech signal. ILPC algorithm can get speech signals with high naturalness, and its waveform is almost consistent with the original waveform.

To verify the performance of the ILPC-based speech recognition module, 1000 samples were taken from each of the six datasets to form a speech hybrid dataset containing 6000 samples. The spoken pronunciation bias of this hybrid dataset was examined using LPC + CNN and ILPC + CNN respectively. The frame rate was 180 Hz and the convolutional kernel was 3 ∗ 3. The detection results are shown in Table 6 and Figure 9.

It can be seen that after ILPC feature extraction, the detection accuracy of CNN is significantly improved. Due to the lower robustness, the speech quality of LPC is poor in the case of very noisy speech, which is due to the fact that real-life English speech usually has both voice tones and unvoice tones, especially in transition segments and very noisy speech segments. When using ILPC for feature extraction of the captured speech signal, each frame of speech is no longer pure voice tones and unvoice tone, thus retaining as much of the original information as possible. ILPC + CNN converges at about 140 iterations, whereas LPC + CNN takes about 180 iterations to stabilize. In addition, the standard deviation of ILPC + CNN is smaller compared to LPC + CNN.

5.4. Performance Comparison of Different Spoken Language Assessment Systems

In contrast to traditional spoken pronunciation assessment methods, the training data for this experiment did not require manual annotation. Using the above speech mixture dataset containing 6000 samples, the designed system was compared with other spoken pronunciation assessment systems, the results are shown in Table 7.

A total of 44359 phonemes (initials, finals, and tones) were obtained from the speech mixture dataset. The manual detection results showed that 6033 phonemes were mispronounced in this speech data, 10407 mispronounced phonemes were detected by the SCILL system and 9894 mispronounced phonemes were detected by the TBALL system. The system designed in this paper (ILPC + CNN) detected 7856 mispronounced phonemes, which is the closest to the manual (labeled) detection result. The experimental results show that ILPC + CNN algorithm can indeed reduce the misjudgment rate of pronunciation deviation. This indicates that the feature parameters obtained by ILPC using hybrid excitation reflect well the characteristics of the original speech signal and therefore the decoded speech quality is better and the speech is clearer.

Finally, the experiment classified the 64 pronunciation errors into three types, namely initial errors, final errors, and tone errors. These three types of pronunciation errors were counted and the results are shown in Figure 10.

As you can see, of the 3 types of pronunciation errors, intonation is the most likely to occur. Therefore, learners of English need to focus on intonation. The next problem is rhyme errors. Compared to the other two types of pronunciation errors, vowel errors are easier to solve. The phenomenon that tone errors are much higher than the other two types of errors is in line with linguistic laws and therefore the experimental results are reliable.

6. Conclusion

In this paper, a machine learning evaluation system for spoken English based on ILPC + CNN algorithm is constructed so as to automate the detection of learners’ pronunciation errors. The designed system consists of four main modules: acoustic model acquisition module, speech recognition module, standard pronunciation transcription module, and decision module. The speech recognition module uses the ILPC algorithm to obtain the feature parameters of the speech signal and generate the speech feature vector. The acoustic model acquisition module uses a CNN model to train the speech features and the input to the CNN is the acoustic feature parameters obtained by the ILPC algorithm. The experimental results show that the feature parameters obtained by ILPC using hybrid excitation reflect the characteristics of the original speech signal very well, and therefore the decoded speech quality is better and the speech is clearer. Compared with other spoken English evaluation systems, the ILPC + CNN-based machine learning evaluation system can reduce the misjudgment rate of pronunciation bias.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest to report regarding the present study.

References

W. Y. Kim, G. S. Lee, Y. H. Kim et al., “Immunolocalization of pki α, βI, βII and γ in adult and developing rat kidney,” Electrolyte & Blood Pressure: E & BP, vol. 5, no. 2, pp. 75–88, 2007.
View at: Publisher Site | Google Scholar
Z. Gang, “Quality evaluation of English pronunciation based on artificial emotion recognition and Gaussian mixture model,” Journal of Intelligent & Fuzzy Systems, vol. 40, no. 4, pp. 7085–7095, 2021.
View at: Publisher Site | Google Scholar
J. Kim and D. A. Craig, “Validation of a videoconferenced speaking test,” Computer Assisted Language Learning, vol. 25, no. 3, pp. 257–275, 2012.
View at: Publisher Site | Google Scholar
E. D. Quaid, “Reviewing the IELTS speaking test in East Asia: theoretical and practice-based insights,” Language Testing in Asia, vol. 8, no. 1, pp. 2–9, 2018.
View at: Publisher Site | Google Scholar
H. T. D. Huang and S. T. A. Hung, “Comparing the effects of test anxiety on independent and integrated speaking test performance,” Tesol Quarterly, vol. 47, no. 2, pp. 244–269, 2013.
View at: Publisher Site | Google Scholar
S. Karim and N. Haq, “An assessment of IELTS speaking test,” International Journal of Evaluation and Research in Education, vol. 3, no. 3, pp. 152–157, 2014.
View at: Publisher Site | Google Scholar
B. Bridgeman, D. Powers, E. Stone, and P. Mollaun, “TOEFL iBT speaking test scores as indicators of oral communicative language proficiency,” Language Testing, vol. 29, no. 1, pp. 91–108, 2012.
View at: Publisher Site | Google Scholar
T. P. Stones, “Transcription and the IELTS speaking test: facilitating development,” ELT Journal, vol. 67, no. 1, pp. 20–30, 2013.
View at: Publisher Site | Google Scholar
H. T. D. Huang, S. T. A. Hung, and L. Plakans, “Topical knowledge in L2 speaking assessment: c,” Language Testing, vol. 35, no. 1, pp. 27–49, 2018.
View at: Publisher Site | Google Scholar
F. Nakatsuhara, C. Inoue, and L. Taylor, “Comparing rating modes: analysing live, audio, and video ratings of IELTS speaking test performances,” Language Assessment Quarterly, vol. 18, no. 2, pp. 83–106, 2021.
View at: Publisher Site | Google Scholar
P. Seedhouse, “The dual personality of ‘topic’ in the IELTS Speaking Test,” ELT Journal, vol. 73, no. 3, pp. 247–256, 2019.
View at: Publisher Site | Google Scholar
J. Li, “An evaluation of IELTS speaking test,” OALib, vol. 06, no. 12, pp. 1–17, 2019.
View at: Publisher Site | Google Scholar
K. Evanini, M. Heilman, X. Wang, and D. Blanchard, “Automated scoring for theTOEFL Junior^®Comprehensive writing and speaking test,” ETS Research Report Series, vol. 2015, no. 1, pp. 1–11, 2015.
View at: Publisher Site | Google Scholar
S. Roshan, “A critical review of the revised IELTS speaking test,” International Journal of English Language Education, vol. 2, no. 1, p. 120, 2013.
View at: Publisher Site | Google Scholar
A. Latifa, A. Rahman, A. Hamra, B. Jabu, and R. Nur, “Developing a practical rating rubric of speaking test for university students of English in parepare, Indonesia,” English Language Teaching, vol. 8, no. 6, pp. 166–177, 2015.
View at: Publisher Site | Google Scholar
Y. Zhan and Z. H. Wan, “Test takers’ beliefs and experiences of a high-stakes computer-based English listening and speaking test,” RELC Journal, vol. 47, no. 3, pp. 363–376, 2016.
View at: Publisher Site | Google Scholar
S. O’Grady, “The impact of pre-task planning on speaking test performance for English-medium university admission,” Language Testing, vol. 36, no. 4, pp. 505–526, 2019.
View at: Publisher Site | Google Scholar
R. Koizumi and A. Hirai, “Comparing the story retelling speaking test with other speaking tests,” JALT Journal, vol. 34, no. 1, p. 35, 2012.
View at: Publisher Site | Google Scholar
A. Morlett Paredes, A. Gooding, L. Artiola I Fortuny et al., “The state of neuropsychological test norms for Spanish-speaking adults in the United States,” The Clinical Neuropsychologist, vol. 35, no. 2, pp. 236–252, 2021.
View at: Publisher Site | Google Scholar
K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, “Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech,” Speech Communication, vol. 54, no. 1, pp. 134–146, 2012.
View at: Publisher Site | Google Scholar
L. Dashti and S. A. Razmjoo, “An examination of IELTS candidates’ performances at different band scores of the speaking test: a quantitative and qualitative analysis,” Cogent Education, vol. 7, no. 1, Article ID 1770936, 2020.
View at: Publisher Site | Google Scholar
P. Liu, S. Li, and H. Wang, “Steganography integrated into linear predictive coding for low bit-rate speech codec,” Multimedia Tools and Applications, vol. 76, no. 2, pp. 2837–2859, 2017.
View at: Publisher Site | Google Scholar
S. Hiroya and T. Mochida, “Speech sound naturalness alters compensation in response to transformed auditory feedback,” The Journal of the Acoustical Society of America, vol. 140, no. 4, p. 3228, 2016.
View at: Publisher Site | Google Scholar
L. Wang, N. Kitaoka, and S. Nakagawa, “Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM,” Speech Communication, vol. 49, no. 6, pp. 501–513, 2007.
View at: Publisher Site | Google Scholar
F. Demir, M. Turkoglu, M. Aslan, and A. Sengur, “A new pyramidal concatenated CNN approach for environmental sound classification,” Applied Acoustics, vol. 170, no. 6, Article ID 107520, 2020.
View at: Publisher Site | Google Scholar
Y. C. Zeng, Y. Y. Chen, and Y. H. Mao, “Mel frequency Cepstrum coefficient extraction method based on empirical mode decomposition and combined spectrum of fourier transform and wigner distribution,” Natural Science Journal of Xiangtan University, vol. 132, no. 5, pp. 563–573, 2015.
View at: Google Scholar
Y. Hiwasaki, K. Mano, and T. Kaneko, “An LPC vocoder based on phase-equalized pitch waveform,” Speech Communication, vol. 40, no. 3, pp. 277–290, 2003.
View at: Publisher Site | Google Scholar
J. Cao, M. Cao, J. Wang, C. Yin, D. Wang, and P. P. Vidal, “Urban noise recognition with convolutional neural network,” Multimedia Tools and Applications, vol. 78, no. 20, pp. 29021–29041, 2019.
View at: Publisher Site | Google Scholar
M. B. Er, E. Isik, and I. Isik, “Parkinson’s detection based on combined CNN and LSTM using enhanced speech signals with Variational mode decomposition,” Biomedical Signal Processing and Control, vol. 70, Article ID 103006, 2021.
View at: Publisher Site | Google Scholar
T. Nakashika, T. Takiguchi, and Y. Ariki, “Voice conversion using RNN pre-trained by recurrent temporal restricted Boltzmann machines,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 3, pp. 580–587, 2015.
View at: Publisher Site | Google Scholar
A. Harma and U. K. Laine, “A comparison of warped and conventional linear predictive coding,” IEEE Transactions on Speech and Audio Processing, vol. 9, no. 5, pp. 579–588, 2001.
View at: Publisher Site | Google Scholar
R. J. Javier and Y. Kim, “Application of linear predictive coding for human activity classification based on micro-Doppler signatures,” IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 10, pp. 1831–1834, 2014.
View at: Publisher Site | Google Scholar
M. W. Spratling, “A review of predictive coding algorithms,” Brain and Cognition, vol. 112, pp. 92–97, 2017.
View at: Publisher Site | Google Scholar
D. Bosbach and L. Z. Evins, “The European DISCO project: deep geological disposal of modern spent nuclear fuel,” Safety of Nuclear Waste Disposal, vol. 1, pp. 233-234, 2021.
View at: Publisher Site | Google Scholar
M. F. Anjum, S. Dasgupta, R. Mudumbai, A. Singh, J. F. Cavanagh, and N. S. Narayanan, “Linear predictive coding distinguishes spectral EEG features of Parkinson’s disease,” Parkinsonism & Related Disorders, vol. 79, pp. 79–85, 2020.
View at: Publisher Site | Google Scholar
S. K. Selvaperumal, C. Nataraj, and V. Thiruchelvam, “Speech to text synthesis from video automated subtitling using Levinson Durbin method of linear predictive coding,” International Journal of Applied Engineering Research, vol. 11, no. 4, pp. 2388–2395, 2016.
View at: Google Scholar
A. Pandey and D. L. Wang, “A new framework for CNN-based speech enhancement in the time domain,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 7, pp. 1179–1188, 2019.
View at: Publisher Site | Google Scholar
S. Kwon and S. Kwon, “A CNN-assisted enhanced audio signal processing for speech emotion recognition,” Sensors, vol. 20, no. 1, pp. 183–201, 2019.
View at: Publisher Site | Google Scholar
V. Passricha and R. K. Aggarwal, “A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition,” Journal of Intelligent Systems, vol. 29, no. 1, pp. 1261–1274, 2019.
View at: Publisher Site | Google Scholar
Z. Zhao, Q. Li, Z. Zhang, N. Cummins, H. Wang, and J.. Tao, “Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition,” Neural Networks, vol. 141, pp. 52–60, 2021.
View at: Publisher Site | Google Scholar
J. Zhao, X. Mao, and L. Chen, “Learning deep features to recognise speech emotion using merged deep CNN,” IET Signal Processing, vol. 12, no. 6, pp. 713–721, 2018.
View at: Publisher Site | Google Scholar
L. Comanducci, P. Bestagini, M. Tagliasacchi, A. Sarti, and S Tubaro, “Reconstructing speech from CNN e,” IEEE Signal Processing Letters, vol. 28, pp. 952–956, 2021.
View at: Publisher Site | Google Scholar
X. Chen, “Simulation of English speech emotion recognition based on transfer learning and CNN neural network,” Journal of Intelligent & Fuzzy Systems, vol. 40, no. 2, pp. 2349–2360, 2021.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Lu Wang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies