Multitask Learning with Local Attention for Tibetan Speech Recognition
Table 2
Syllable error rate (%) of two-task models on speech content recognition.
Architecture
Model
Lhasa-Ü-Tsang
Changdu-Kham
Amdo Pastoral
SER1
RSER2
SER
RSER
SER
RSER
ASER3
Dialect-specific model
28.83
62.56
17.6
WaveNet-CTC
29.55
−0.72
62.83
−0.27
33.52
−15.92
−5.63
WaveNet-CTC with dialect ID or speaker ID (baseline model)
D-S4
32.84
−4.01
68.58
−6.02
33.00
−15.40
−8.48
S-D5
26.80
2.03
64.03
−1.47
30.79
−13.09
−4.21
S-S16
27.21
1.62
64.17
−1.61
29.68
−12.08
−4.02
S-S27
28.13
0.7
62.43
0.13
28.04
-10.44
−3.20
Attention (5)-WaveNet-CTC
D-S
52.19
−23.36
65.24
−2.68
50.22
-32.62
−19.55
S-D
55.16
−26.33
67.78
−5.22
55.23
-37.63
−23.06
S-S1
77.42
−48.59
85.44
−22.88
82.08
-64.48
−45.32
S-S2
83.32
−54.49
89.15
−26.94
81.47
-63.87
−48.43
WaveNet-Attention (5)-CTC
D-S
21.44
7.39
60.16
2.40
20.46
−2.86
2.31
S-D
23.79
5.04
62.96
−0.4
24.15
−6.55
−0.64
S-S1
34.86
−6.03
63.36
−0.8
40.10
−22.50
−9.78
S-S2
34.83
−6.00
62.70
−0.14
37.63
−20.03
−8.72
1SER: syllable error rate, 2RSER: relative syllable error rate, 3ARSER: average relative syllable error rate, 4D-S: the model trained using the transcription with dialect ID at the beginning of target label sequence, like “A ཐུགས རྗེ ཆེ,” 5S-D: the model trained using the transcription with dialect ID at the end of target label sequence, 6S-S1: the model trained using the transcription with speaker ID at the beginning of target label sequence, and 7S-S2: the model trained using the transcription with speaker ID at the end of target label sequence.