Research Article
Multitask Learning with Local Attention for Tibetan Speech Recognition
Table 6
Dialect ID recognition accuracy (%) of three-task models.
| Architecture | Model | Lhasa-Ü-Tsang | Changdu-Kham | Amdo Pastoral |
| DialectID model | 97.88 | 92.24 | 97.9 |
| WaveNet-CTC with dialect ID and speaker ID | D-S-S1 | 98.01 | 98.8 | 99.41 | D-S-S2 | 99.73 | 96.42 | 99.61 | S-D-S | 99.25 | 95.23 | 99.03 |
| Attention (5)-WaveNet-CTC | S-D-S | 100 | 76.19 | 91.27 | D-S-S1 | 100 | 90.47 | 94.18 | D-S-S2 | 100 | 82.14 | 93.02 |
| WaveNet-Attention (5)-CTC | S-D-S | 100 | 89.28 | 93.79 | D-S-S1 | 100 | 85.71 | 93.79 | D-S-S2 | 100 | 95.23 | 94.18 |
| WaveNet-Attention (7)-CTC | S-D-S | 0 | 85.71 | 91.66 | D-S-S1 | 0 | 89.98 | 93.88 | D-S-S2 | 0 | 89.28 | 95.34 |
| WaveNet-Attention (10)-CTC | S-D-S | 0 | 85.71 | 95.54 | D-S-S1 | 0 | 94.04 | 93.99 | D-S-S2 | 0 | 0 | 0 |
|
|