Research Article
Multitask Learning with Local Attention for Tibetan Speech Recognition
Table 7
Speaker ID recognition accuracy (%) of three-task models.
| Architecture | Model | Lhasa-Ü-Tsang | Changdu-Kham | Amdo pastoral |
| SpeakerID model | 67.75 | 93.13 | 95.31 |
| WaveNet-CTC with dialect ID and speaker ID | S-D-S | 72.91 | 98.8 | 96.12 | D-S-S1 | 70.21 | 95.23 | 93.6 | D-S-S2 | 70.35 | 96.42 | 96.89 |
| Attention (5)-WaveNet-CTC | S-D-S | 61.08 | 83.33 | 89.53 | D-S-S1 | 62.12 | 83.33 | 87.01 | D-S-S2 | 61.99 | 84.52 | 90.11 |
| WaveNet-Attention (5)-CTC | S-D-S | 61.99 | 85.71 | 92.05 | D-S-S1 | 62.53 | 82.14 | 91.08 | D-S-S2 | 61.18 | 89.28 | 92.44 |
| WaveNet-Attention (7)-CTC | S-D-S | 60.91 | 85.71 | 91.66 | D-S-S1 | 62.04 | 84.31 | 92.01 | D-S-S2 | 58.49 | 86.90 | 90.69 |
| WaveNet-Attention (10)-CTC | S-D-S | 58.49 | 84.52 | 92.05 | D-S-S1 | 59.43 | 83.33 | 91.27 | D-S-S2 | 63.47 | 92.85 | 97.86 |
|
|