|
1.INTRODUCTIONPSC (Putonghua Shuiping Ceshi) is the only authoritative test for the grading of Mandarin proficiency in China. The form of the test is an oral examination, and the test content is divided into four questions. The first three questions are reading the given characters, words, and articles respectively, which belong to the text-dependent oral evaluation. The first three questions have been successfully scored by machine, and the machine scoring is very close to the manual scoring. The fourth question belongs to Topic Talk. According to the topic determined by the lottery, candidates can express freely and improvise within a time limit of four minutes. Since the content expressed by the candidates cannot be accurately known in advance, the Topic Talk is a text-independent oral evaluation task. Limited to the existing research work on text-independent oral evaluation, manual scoring is still used for this task at present, which undoubtedly increases the manpower and workload of PSC. Log posterior probability is recognized as the most important feature to measure pronunciation quality in oral evaluation tasks. It has been widely and maturely applied in the text-dependent oral evaluation task of PSC. Ge et al.1 proposed a posterior probability algorithm based on phoneme confusion expansion network, which can significantly improve the calculation speed of posterior probability without changing the computational complexity of the system. Chen et al.2 introduced the knowledge of Mandarin pronunciation linguistics into the posterior probability algorithm, and improved the pronunciation quality evaluation algorithm from the perspective of the phoneme scoring model, which can significantly improve the correlation between manual scoring and machine scoring. Chen et al.3 optimized the probability space of the current computer-aided PSC system from the perspective of phonetics, which not only reduced the confusion caused by the probability space, but also significantly shortened the computing time of the system. Since there is no pre-set reference text in the Topic Talk evaluation of PSC, in order to realize the intelligent evaluation of this item, it is necessary to perform continuous speech flow recognition with a recognizer in advance, and take the optimal recognition result as the reference text of log posteriori probability. In this paper, we introduce a deep neural network to construct an intelligent evaluation model for the Topic Talk in PSC. 2.INTELLIGENT EVALUATION MODEL OF TOPIC TALKThe total score of Topic Talk item in PSC is 40 points. According to the requirements of the PSC official examination syllabus, the assessors should make a comprehensive evaluation from four aspects: the level of pronunciation standard (25 points), the degree of vocabulary and grammar standard (10 points), natural fluency (5 points), and the appropriate deduction of points if the effective time of expression is less than 3 minutes4. This item not only assesses the degree of standard and fluency of candidates’ pronunciation, but also pays attention to the use of vocabulary and grammar. In the Topic Talk evaluation, since the machine cannot know the text expressed by the candidate in advance, the method of forcibly aligning the text with the model in the text-dependent evaluation task cannot be directly adopted. Therefore, it is necessary to add a complete large vocabulary continuous speech recognition module (LVCSR) at the front end of the intelligent evaluation model of Topic Talk item. The intelligent evaluation process of Topic Talk in Mandarin proficiency test is shown in Figure 1. We input the speech of the candidate’s Topic Talk item into the decoder after preprocessing and feature extraction. Based on the acoustic model and the language model, the decoder can obtain an optimal output sequence by using Viterbi algorithm. The output of the recognizer is used as the reference text for the subsequent text-independent evaluation tasks. First, the posterior probability of the obtained reference text relative to the pronunciation quality is calculated, and then the features of various evaluation indicators such as pronunciation standard degree, fluency degree, and vocabulary and grammar standard are calculated based on the posterior probability. Finally the total machine score is predicted based on all the obtained feature values. For text-independent pronunciation quality evaluation, it mainly depends on the posterior probability of the recognition result relative to the pronunciation quality. Therefore, improving the recognition rate of the front-end recognizer is especially important for the posterior probability estimation of the Topic Talk item. The acoustic model and the language model has a great impact on the recognition performance of the front-end recognizer. Since the acoustic model and the language model have a great impact on the recognition performance of the front-end recognizer, in this paper, we mainly optimize the LVSCR module from the two aspects of the acoustic model and the language model to improve the speech recognition rate of the front-end, so as to calculate the posterior probability more accurately and better measure the pronunciation quality of candidates. 3.IMPROVED DBLSTM-HMM ACOUSTIC MODELSince Hidden Markov Model (HMM) can describe the relationship between the hidden state and feature sequence in speech information, it is widely used in acoustic modeling in speech recognition tasks. The traditional Gaussian Mixture Model-Hidden Markov Model (GMM-HMM)5 has the advantages of fast training speed and easy transplantation, but it lacks the learning ability of deep nonlinear feature transformation. The leading Deep Neural Network-Hidden Markov Model (DNN-HMM)6 can make better use of the relevant information between speech frames and has the ability to learn deep nonlinear feature transformations, but it lacks the ability to model the long-term correlation of speech. The Recurrent Neural Network-Hidden Markov Model (RNN-HMM)7 can better solve the modeling problem of long-term correlation information of speech, but as the number of network layers increases, problems such as gradient disappearance or explosion are prone to occur. The Long Short-Term Memory-Hidden Markov Model (LSTM-HMM)8 can effectively control the gradient disappearance or explosion problem in RNN, but like RNN, the memory of LSTM is unidirectional, when modeling the current moment, only historical information can be used, and future information cannot be introduced. 3.1Network structure of DBLSTMIn order to realize the simultaneous modeling of front and rear bidirectional information, make up for the defect of unidirectional modeling in LSTM-HMM, and further improve the recognition rate, in this paper, we build a Deep Bidirectional Long Short-Term Memory model (DBLSTM) based on LSTM, as shown in Figure 2. We input the input sequence x into the forward LSTM layer and the backward LSTM layer at the same time. The forward hidden layer vector is iteratively calculated from front to back through the forward layer, and the backward hidden layer vector is iteratively calculated from back to front through the backward layer. We update the output sequence y = {y1,y2,….,yM} based on two LSTM layers, where M is the number of output data. The iterative formula of the DBLSTM network is given as follows: 3.2Improved DBLSTM acoustic modelWith the deepening of Deep Neural Network, the gradient disappearance of DBLSTM will occur in both time domain and space domain9. We use the gate signal of linear cyclic connection in DBLSTM to deal with the gradient disappearance problem in the time domain. By introducing maxout neural network into DBLSTM to increase the depth of DBLSTM, the gradient disappearance problem in DBLSTM spatial domain can be solved by using maxout neuron to generate a constant gradient. In order to solve the problem of model over fitting in the process of DBLSTM training, we introduce dropout regularization algorithm. The improved DBLSTM depth mixing acoustic model proposed in this paper is shown in Figure 3. The bottom layer of BLSTM in the figure mainly models the long-term related information of the input speech signal. The middle select connection layer transforms the output data of BLSTM network according to equation (3) and transmits it to the full connection layer. The maxout neurons in the full connection layer are regularized and trained according to the dropout algorithm, and finally enter the softmax output layer to output the results. 3.2.1Maxout Neural Network.The maxout neural network structure in the full connection layer is shown in Figure 4. The activation unit group of Maxout neuron contains multiple optional activation units. We select the maximum value from the activation unit group as the output of Maxout according to equation (4): where represents the output of the i-th Maxout neuron in the l-th layer, and k is the number of active units in the active unit group. represents the j-th activation unit of the i-th Maxout neuron in the l-th layer, which is obtained from the forward LSTM propagation layer of the previous layer: where is the weight matrix from the neuron to the active unit zl in the previous layer, and bl is the bias vector. Maxout solves the gradient disappearance problem by generating a constant gradient during training. The gradient of maxout neuron is: When the value of the active unit is the largest, the gradient of the Maxout neuron is 1, and the other is 0. 3.2.2Dropout algorithm.In order to avoid the over-fitting problem of DBLSTM network when there are few training samples, we introduce Dropout regularization algorithm. In the iterative process of the neural network, the weight update of the hidden layer nodes is set to prevent the over-fitting phenomenon of the network, that is, the weights of some hidden layer nodes are not updated in one iteration, but are activated and updated in the next iteration. Different regularization methods are used in the training and testing stages of the Dropout algorithm10.
4.N-GRAM+RNN INTERPOLATION LANGUAGE MODELThe intelligent evaluation of text-independent Topic Talk item depends greatly on the recognition rate of the front-end recognizer. If the decoding result of the front-end recognizer is wrong, the posterior probability calculated based on the recognition result will hardly provide useful information for the evaluation of pronunciation quality. A good language model can effectively improve the decoding efficiency of the recognizer, thereby improving the speech recognition rate. The existing LVCSR recognizers generally use the n-gram language model based on statistics12, which can only memorize the historical information of the previous two to three words, and the further historical information has no effect on the score of the current word. It is obviously that the lack of historical information reduces the reliability of n-gram language model scores. The language model based on the RNN13 can introduce further sentence history information in the training process, but the decoding efficiency of whose is not high. In order to improve the front-end speech recognition rate, we integrate the advantages of the two language models and perform interpolation operations on the n-gram language model and the RNN language model14, 15. Firstly, we use the n-gram language model to obtain the one-pass decoding result of the decoder, and then use the RNN language model to re-estimate the score of the N-best candidate result16 decoded in one pass, and finally take the sentence with the highest re-estimated score as the new recognition result. The process is shown in Figure 5. For the N-best candidate result decoded by the decoder in one pass with the n-gram language model, we keep its acoustic model score (AC Score) unchanged, and apply RNN to re-estimate the language model score (LM Score). Based on the original acoustic model score and the re-estimated language model score, we obtain a new score for each candidate sentence, and select the candidate sentence with the highest score as the new speech recognition result. The language model score calculation formula for re-estimating the ith candidate sentence in the RNN language model is as follows: Among them, AcScorei is the acoustic model score of the i-th candidate sentence, which remains unchanged during the re-estimation of the language model score by using n-gram and RNN interpolation, Wi is the number of words in the i-th candidate sentence, C is the word penalty, λ is the interpolation coefficient, is the n-gram language model score, is the RNN language model score, and LmScale is the scaling factor of the language model score during decoding. In the following text-dependent evaluation tasks, the reference text updated by the RNN can not only reduce the recognition errors caused by the language model, but also make the calculated posterior probability more conducive to the evaluation of pronunciation quality. 5.POSTERIOR PROBABILITY ESTIMATIONAfter the one-pass decoding result of the candidate’s Topic Talk item is obtained through the front-end recognizer, the posterior probability of the decoded phoneme t is calculated as follows: Among them, O = [o1,o2,…,oN] is the acoustic observation vector corresponding to the decoded phoneme t. We assume that the prior probability p(q) of all phonemes is equal. In order to make the calculation of the posterior probability more targeted, the posterior probability denominator space Qt is composed of the error prone phonemes of the decoded phoneme t [2. For the decoded phoneme t, if the optimal path obtained by Viterbi decoding is Θ = {s1,s2,…,sN}, then ln p(O | t) can be approximately calculated as follows: It is assumed that as long as the transition probability aij of the HMM is greater than zero, the jump from state i to state j can be completed, that is, the likelihood score calculation can ignore aij. For the DBLSTM-HMM acoustic model, p(oi | si) can be calculated as follows: Among them, p(si) is the prior probability of each HMM state obtained from the training set, p(oi) can be regarded as a constant during the decoding process, and p(si | oi) is the score of the neural network output corresponding to the state si after the softmax activation operation. The likelihood score of the DBLSTM-HMM acoustic model is given by using the following formula: First, we estimate the posterior probability of each phoneme in the decoded phoneme according to equation (9), then calculate the average value of the posterior probability of all phonemes in a sentence, and finally calculate the average value of the posterior probability of all sentences in a speech, which is the final estimate of the posterior probability of the speech. 6.EXPERIMENTAL VERIFICATIONS6.1Experimental dataset
6.2Performance of acoustic modelConsidering that Mandarin is a tonal language, we add 4 dimensional pitch features to the 39 dimensional MFCC input features of the acoustic model17. The number of input layer units of the neural network is 43×11 = 473 (the feature of the current frame is 43 dimensions, and 5 frames are spliced in front and back of the current frame, totaling 473 dimensions). The DNN network contains five hidden layers, each with 1024 nodes. It is activated by sigmoid function, and the output is transformed by softmax. In the training process, the Stochastic Gradient Descent (SGD) method is used to optimize the network parameters. There are 150 nodes in each recurrent layer of RNN network, and the BPTT algorithm is used to optimize the network parameters18. The DBLSTM network contains 6 hidden layers (2 BLSTM hidden layers, 1 select connection layer, 3 full connection hidden layers), each BLSTM hidden layer contains 256 forward and backward LSTM storage units respectively. The number of nodes in the connection layer is 256, and each full connection hidden layer contains 1024 nodes. The CSC-BPTT algorithm is used to optimize the network parameters. In order to verify the performance of the proposed acoustic model, it is necessary to compare the recognition results of different acoustic models, as shown in Table 1. Table 1.Speech recognition performance of different acoustic models.
Through the analysis of table 1, it can be seen that the recognition performance of the proposed acoustic model is higher than that of the literature model. The word error rate of the proposed acoustic model is 14.08%, which is 4.75%, 2.44%, and 0.95% lower than that of Dahl et al. (2012), Hasim et al. (2015), and Sak et al. (2014), respectively. It shows that it is possible to improve the speech recognition rate by making good use of the context information of LSTM network structure. 6.3Performance of language modelIn the experiment, we use the Srilm tool to train the 3-gram model, and the training text is 480,000 sentences in the prepared language model dataset. After obtaining 50 candidate N-bests reserved by one-pass decoding, the n-gram+RNN interpolation language model is used to re-estimate the score of each best. We set the vector dimension of the input words in the RNN network to be the same as the dictionary size, set the number of hidden layer nodes to 500, and the number of output categories to 100. The BPTT algorithm is used for training, the interpolation coefficient is set to 0.5, and the acoustic model score remains unchanged. In order to verify the performance of the proposed language model, it is necessary to compare the recognition results of different acoustic models, as shown in Table 2. Table 2.Word recognition performance of different language models.
Through the analysis of Table 2, we can see that the word recognition rate of the n-gram+RNN interpolation language model is better than that of the n-gram language model. The n-gram+RNN interpolation language model only retrains an RNN network on the basis of the n-gram language model, and the word recognition performance has been significantly improved, which also shows that most of the corrected recognition errors come from the lack of historical information in the language model. In order to verify the correlation between the posterior probability and the manual standard score, we apply the n-gram+RNN interpolation language model to re-estimate the N-best candidate of each sentence in the 4000 data in the pronunciation evaluation dataset, and re estimated the posterior probability of the sentence with the highest score. The experimental results are shown in Table 3. Table 3.Correlation between a posterior probability and manual score of different language models.
From Table 3, we can see that n-gram+RNN interpolation can improve the correlation between the posterior probability and the manual score, but in terms of recognition performance, the improvement of the correlation is still insufficient, because the RNN language model retains a longer historical information, which makes those sentences with nonstandard pronunciation but strong logic become the largest candidates after the RNN re-estimates the score, resulting in a large improvement in the recognition rate. 7.CONCLUSIONSIn order to realize the intelligent evaluation of Topic Talk item in PSC, we studied the text-independent oral evaluation task from the perspective of recognition, and constructed an intelligent evaluation model of Topic Talk item. Based on the recorded speech data of candidates collected from the PSC Centers across the country, we analyzed the speech recognition performance and word recognition performance of the proposed intelligent evaluation model from the perspectives of acoustic models and language models. Experiments show that the intelligent evaluation model proposed in this paper can better identify the text content of the candidate’s expression, and the posterior probability calculated based on the recognized text has a high correlation with the manual standard score. During the experiment, it was found that the contextual information of the recognition results was helpful for the improvement of the recognition performance, which also pointed out a direction for the follow-up research. ACKNOWLEDGMENTSThis work is supported by both the Special Project of Applied Linguistice in China’s Hunan Province (Grant No. XYJ2021GB09), and the Scientific Research Project of Hunan Open University (Grant No. XDK2020-C-26). REFERENCESGe, F. P., Lu, L. and Yan, Y. H.,
“Experimental investigation of mandarin pronunciation quality assessment system,”
in International Symposium Computer Science and Society (ISCCS),
235
–239
(2011). Google Scholar
Chen, C. H.,
“Improvement in automatic Putonghua pronunciation quality assessment algorithm,”
Journal of Guizhou Normal University (Natural Sciences), 31
(06), 95
–99
(2013). Google Scholar
Chen, C. H.,
“Research of speech recognition network in Putonghua level test system,”
Journal of Xihua University, 33
(02), 17
–21
(2014). Google Scholar
“Mandarin Training and Testing Center of State Language and Writing Commission,”
[Implementation Outline of Mandarin Proficiency Test], Commercial Press, 7 461
–462
(2017). Google Scholar
Rabiner, L. and Juang, B. H.,
“[Fundamentals of Speech Recognition],”
Prentice-Hall, Inc, 353
–356
(2009). Google Scholar
Dahl, G. E. and Dong, Y.,
“Member S. context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,”
IEEE Transactions on Audio Speech & Language Processing, 20
(1), 30
–42
(2012). https://doi.org/10.1109/TASL.2011.2134090 Google Scholar
Sak, H., Senior, A., Rao, K., et al.,
“Fast and accurate recurrent neural network acoustic models for speech recognition,”
Computer Science, 2
(1), 10
–15
(2015). Google Scholar
Sak, H., Senior, A. and Beaufays, F.,
“Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition,”
Computer Science, 6
(1), 338
–342
(2014). Google Scholar
Zaremba, W., Sutskever, I. and Vinyals, O.,
“Recurrent neural network regularization,”
Computer Science, 4
(1), 1
–8
(2014). Google Scholar
Li, J., Wang, X. R. and Xu, B.,
“Understanding the dropout strategy and analyzing its effectiveness on LVCSR,”
in IEEE Inter. Conf. on Acoustics, Speech and Signal Processing,
7614
–7618
(2013). Google Scholar
Glorot, X. and Bengio, Y.,
“Understanding the difficulty of training deep feedforward neural networks,”
Journal of Machine Learning Research, 9
(1), 249
–256
(2010). Google Scholar
Manning, C. D.,
“[Foundations of Statistical Natural Language Processing],”
5 MIT Press, vol1999). Google Scholar
Mikolov, T.,
“[Statistical Language Models Based on Neural Networks],”
Brno University of Technology, Doctor’s Thesis, 120
–122
(2012). Google Scholar
Mikolov, T., Kombrink, S., Burget, L., et al.,
“Extensions of recurrent neural network language model,”
in Proc. of 2011 IEEE Inter. Conf. on Acoustics, Speech and Signal Processing (ICASSP),
5528
–5531
(2011). Google Scholar
Mikolov, T., Deoras, A., Kombrink, S., et al.,
“Empirical evaluation and combination of advanced language modeling technique,”
in Proceedings of Interspeech,
605
–608
(2011). Google Scholar
Young, S., Evermann, G., Gales, M., et al.,
“[The HTK Book (for HTK version 3.4)],”
2
–3 Cambridge University Press,2006). Google Scholar
Boersma, P.,
“Accurate short term analysis of the fun amental frequency and the harmonics-to-noise ratio of a sampled sound,”
in Proc. of the Institute of Phonetic Sciences,
97
–110
(1993). Google Scholar
Stolcke, A.,
“SRILM-an extensible language modeling toolkit,”
in Proc. of the Interspeech,
901
–904
(2002). Google Scholar
|