Complexity

Research Article

Multitask Learning with Local Attention for Tibetan Speech Recognition

Table 2

Syllable error rate (%) of two-task models on speech content recognition.


Architecture	Model	Lhasa-Ü-Tsang		Changdu-Kham		Amdo Pastoral
Architecture	Model	SER¹	RSER²	SER	RSER	SER	RSER	ASER³

Dialect-specific model		28.83	62.56			17.6
WaveNet-CTC		29.55	−0.72	62.83	−0.27	33.52	−15.92	−5.63

WaveNet-CTC with dialect ID or speaker ID (baseline model)	D-S⁴	32.84	−4.01	68.58	−6.02	33.00	−15.40	−8.48
	S-D⁵	26.80	2.03	64.03	−1.47	30.79	−13.09	−4.21
	S-S1⁶	27.21	1.62	64.17	−1.61	29.68	−12.08	−4.02
	S-S2⁷	28.13	0.7	62.43	0.13	28.04	-10.44	−3.20

Attention (5)-WaveNet-CTC	D-S	52.19	−23.36	65.24	−2.68	50.22	-32.62	−19.55
	S-D	55.16	−26.33	67.78	−5.22	55.23	-37.63	−23.06
	S-S1	77.42	−48.59	85.44	−22.88	82.08	-64.48	−45.32
	S-S2	83.32	−54.49	89.15	−26.94	81.47	-63.87	−48.43

WaveNet-Attention (5)-CTC	D-S	21.44	7.39	60.16	2.40	20.46	−2.86	2.31
	S-D	23.79	5.04	62.96	−0.4	24.15	−6.55	−0.64
	S-S1	34.86	−6.03	63.36	−0.8	40.10	−22.50	−9.78
	S-S2	34.83	−6.00	62.70	−0.14	37.63	−20.03	−8.72

¹SER: syllable error rate, ²RSER: relative syllable error rate, ³ARSER: average relative syllable error rate, ⁴D-S: the model trained using the transcription with dialect ID at the beginning of target label sequence, like “A ཐུགས རྗེ ཆེ,” ⁵S-D: the model trained using the transcription with dialect ID at the end of target label sequence, ⁶S-S1: the model trained using the transcription with speaker ID at the beginning of target label sequence, and ⁷S-S2: the model trained using the transcription with speaker ID at the end of target label sequence.