Research Article
Multitask Learning with Local Attention for Tibetan Speech Recognition
Table 5
Syllable error rate (%) of three-task models on speech content recognition.
| Architecture | Model | Lhasa-Ü-Tsang | Changdu-Kham | Amdo Pastoral | SER | RSER | SER | RSER | SER | RSER | ASER |
| Dialect-specific model | 28.83 | 62.56 | 17.60 | WaveNet-CTC with dialect ID and speaker ID (baseline model) | S-D-S | 30.64 | −1.81 | 64.17 | −1.61 | 34.06 | −16.46 | −6.62 | D-S-S1 | 39.64 | −10.81 | 65.10 | −2.54 | 45.15 | −27.55 | −13.63 | D-S-S2 | 33.43 | −4.60 | 64.83 | −2.27 | 37.56 | −19.96 | −8.94 |
| Attention (5)-WaveNet-CTC | S-D-S | 48.69 | −19.86 | 68.31 | −5.75 | 63.22 | −45.62 | −23.74 | D-S-S1 | 52.57 | −23.74 | 69.38 | −6.82 | 71.42 | −53.82 | −28.13 | D-S-S2 | 49.10 | −20.27 | 79.41 | −16.85 | 61.09 | −43.49 | −26.87 |
| WaveNet-Attention (5)-CTC | S-D-S | 30.75 | −1.92 | 69.51 | −6.95 | 34.21 | −16.61 | −8.49 | D-S-S1 | 33.17 | −4.34 | 69.51 | -6.95 | 38.49 | −20.89 | −10.73 | D-S-S2 | 31.16 | −2.33 | 69.25 | −6.69 | 34.14 | −16.54 | −8.52 |
| WaveNet-Attention (7)-CTC | S-D-S | 30.39 | −1.56 | 70.05 | −7.49 | 32.7 | −15.1 | −8.05 | D-S-S1 | 35.28 | −6.45 | 68.12 | −5.56 | 38.03 | −20.73 | −10.81 | D-S-S2 | 32.58 | −3.75 | 62.74 | −0.18 | 37.16 | −19.56 | −7.83 |
| WaveNet-Attention (10)-CTC | S-D-S | 30.25 | −1.42 | 69.25 | −6.69 | 32.01 | −14.41 | −7.51 | D-S-S1 | 34.06 | −5.23 | 70.05 | −7.49 | 40.10 | −22.50 | −11.74 | D-S-S2 | 31.85 | −3.02 | 57.45 | 5.11 | 33.65 | −16.05 | −4.65 |
|
|