Research Article

Multitask Learning with Local Attention for Tibetan Speech Recognition

Table 5

Syllable error rate (%) of three-task models on speech content recognition.

ArchitectureModelLhasa-Ü-TsangChangdu-KhamAmdo Pastoral
SERRSERSERRSERSERRSERASER

Dialect-specific model28.8362.5617.60
WaveNet-CTC with dialect ID and speaker ID (baseline model)S-D-S30.64−1.8164.17−1.6134.06−16.46−6.62
D-S-S139.64−10.8165.10−2.5445.15−27.55−13.63
D-S-S233.43−4.6064.83−2.2737.56−19.96−8.94

Attention (5)-WaveNet-CTCS-D-S48.69−19.8668.31−5.7563.22−45.62−23.74
D-S-S152.57−23.7469.38−6.8271.42−53.82−28.13
D-S-S249.10−20.2779.41−16.8561.09−43.49−26.87

WaveNet-Attention (5)-CTCS-D-S30.75−1.9269.51−6.9534.21−16.61−8.49
D-S-S133.17−4.3469.51-6.9538.49−20.89−10.73
D-S-S231.16−2.3369.25−6.6934.14−16.54−8.52

WaveNet-Attention (7)-CTCS-D-S30.39−1.5670.05−7.4932.7−15.1−8.05
D-S-S135.28−6.4568.12−5.5638.03−20.73−10.81
D-S-S232.58−3.7562.74−0.1837.16−19.56−7.83

WaveNet-Attention (10)-CTCS-D-S30.251.4269.25−6.6932.01−14.41−7.51
D-S-S134.06−5.2370.05−7.4940.10−22.50−11.74
D-S-S231.85−3.0257.455.1133.65−16.05−4.65