Complexity

Research Article

Multitask Learning with Local Attention for Tibetan Speech Recognition

Syllable error rate (%) of three-task models on speech content recognition.


Architecture	Model	Lhasa-Ü-Tsang		Changdu-Kham		Amdo Pastoral
Architecture	Model	SER	RSER	SER	RSER	SER	RSER	ASER

Dialect-specific model		28.83	62.56			17.60
WaveNet-CTC with dialect ID and speaker ID (baseline model)	S-D-S	30.64	−1.81	64.17	−1.61	34.06	−16.46	−6.62
	D-S-S1	39.64	−10.81	65.10	−2.54	45.15	−27.55	−13.63
	D-S-S2	33.43	−4.60	64.83	−2.27	37.56	−19.96	−8.94

Attention (5)-WaveNet-CTC	S-D-S	48.69	−19.86	68.31	−5.75	63.22	−45.62	−23.74
	D-S-S1	52.57	−23.74	69.38	−6.82	71.42	−53.82	−28.13
	D-S-S2	49.10	−20.27	79.41	−16.85	61.09	−43.49	−26.87

WaveNet-Attention (5)-CTC	S-D-S	30.75	−1.92	69.51	−6.95	34.21	−16.61	−8.49
	D-S-S1	33.17	−4.34	69.51	-6.95	38.49	−20.89	−10.73
	D-S-S2	31.16	−2.33	69.25	−6.69	34.14	−16.54	−8.52

WaveNet-Attention (7)-CTC	S-D-S	30.39	−1.56	70.05	−7.49	32.7	−15.1	−8.05
	D-S-S1	35.28	−6.45	68.12	−5.56	38.03	−20.73	−10.81
	D-S-S2	32.58	−3.75	62.74	−0.18	37.16	−19.56	−7.83

WaveNet-Attention (10)-CTC	S-D-S	30.25	−1.42	69.25	−6.69	32.01	−14.41	−7.51
	D-S-S1	34.06	−5.23	70.05	−7.49	40.10	−22.50	−11.74
	D-S-S2	31.85	−3.02	57.45	5.11	33.65	−16.05	−4.65