Research Article

Multitask Learning with Local Attention for Tibetan Speech Recognition

Table 2

Syllable error rate (%) of two-task models on speech content recognition.

ArchitectureModelLhasa-Ü-TsangChangdu-KhamAmdo Pastoral
SER1RSER2SERRSERSERRSERASER3

Dialect-specific model28.8362.5617.6
WaveNet-CTC29.55−0.7262.83−0.2733.52−15.92−5.63

WaveNet-CTC with dialect ID or speaker ID (baseline model)D-S432.84−4.0168.58−6.0233.00−15.40−8.48
S-D526.802.0364.03−1.4730.79−13.09−4.21
S-S1627.211.6264.17−1.6129.68−12.08−4.02
S-S2728.130.762.430.1328.04-10.44−3.20

Attention (5)-WaveNet-CTCD-S52.19−23.3665.24−2.6850.22-32.62−19.55
S-D55.16−26.3367.78−5.2255.23-37.63−23.06
S-S177.42−48.5985.44−22.8882.08-64.48−45.32
S-S283.32−54.4989.15−26.9481.47-63.87−48.43

WaveNet-Attention (5)-CTCD-S21.447.3960.162.4020.462.862.31
S-D23.795.0462.96−0.424.15−6.55−0.64
S-S134.86−6.0363.36−0.840.10−22.50−9.78
S-S234.83−6.0062.70−0.1437.63−20.03−8.72

1SER: syllable error rate, 2RSER: relative syllable error rate, 3ARSER: average relative syllable error rate, 4D-S: the model trained using the transcription with dialect ID at the beginning of target label sequence, like “A ཐུགས རྗེ ཆེ,” 5S-D: the model trained using the transcription with dialect ID at the end of target label sequence, 6S-S1: the model trained using the transcription with speaker ID at the beginning of target label sequence, and 7S-S2: the model trained using the transcription with speaker ID at the end of target label sequence.