Computational Intelligence and Neuroscience

Research Article

Spoken Language Identification Using Deep Learning

Table 1

Review of previous studies along with results.


Year	Model basis	Features	Languages	Acc.	Remarks	Ref.

2021	PLDA logistic regression	i-vector x-vector	Javanese, Sundanese, Minang	96%	PLDA and logistic classifiers are used with x-vector and i-vector feature extraction.	[26]
2021	CNN ResNet50 RNN	MFCC	Iba, Kab, Sun, Ind, Eus, Jav, Tam, Tel, Kan, Hin, Tha, Rus, Cnh, Eng, Por, Mar	53%	Submit three different systems named Lipsia, Anlirika, and NTR with different specifications.	[27]
2021	Self-attentive pooling decoder	Not defined	En, Fr, Es, De, Ru, It	92.50%	Used self-attentive pooling layer for language identification task.	[28]
2021	CNN LSTM	MFCC	Iba, Kab, Sun, Ind, Eus, Jav, Tam, Tel, Kan, Hin, Tha, Cnh, Eng, Por, Mar, Rus	74%	CNN-LSTM combination is used for predicting language.	[29]
2020	TDNN optimal transport	MFCC	Russian, Kazakh, Mandarin, Korean, Japanese, Cantonese, Vietnamese, Tibetan, Indonesian, Uyghur	Not defined	Used unsupervised technique joint distribution adaptation neural network model for spoken language identification.	[17]
2020	CRNN ResNet50 DenseNet121	Log-Mel	Three different datasets with different languages	89%	Used different pretrained models with triplet entropy loss for improving the generalization.	[16]
2020	CNN	Log-Mel	Slovene, Russian, Slovak, Belarusian, Macedonian, Ukrainian, Croatian, Bulgarian, Czech, Serbian, Polish	97.35%	With the CNN model, two neural models are made: baseline and robust models for spoken language identification.	[18]
2020	CapsNet	Log-Mel	Arabic, Bengali, Chinese Mandarin, English, Hindi, Turkish, Spanish, Japanese, Punjabi, Portuguese	98.20%	Capsule network with encoder and decoder works well on spoken language identification.	[20]
2020	CNN-LSTM	Log-Mel	Gujarati, Tamil, Telugu	79.02%	Used CNN-LSTM system that uses CTC loss function at output layer and this approach is used for spoken language identification.	[19]
2020	Context aware model	Log-Mel	Prs, Amh, Fas, Hat, Hau, Eng, Cmn, Fra, Rus, Hin, Ukr, Spa, Pus, Urd, Yue, Bos, Vie, Hrv, Tur, Kat, Por, Kor	97%	Context-aware model works well on pair language and gives good results and better accuracy.	[25]
2019	ConvNets	MFCC	Fr, It, En, Ru, Es, De	95.40%	2D ConvNets with attention and GRU approach gives good results and better accuracy.	[14]
2019	ResNet50	MFCC	Fr, It, En, Ru, Es, De	89%	In this, used pretrained ResNet50 model and cyclic learning rate approach for language identification.	[8]
2018	SVM-HMM model	Not defined	Es, Fr, En, De	70%	In this, HMMs approach was used to translate speech into the vector sequences following the deep neural network.	[30]
2017	Inceptionv3 CRNN	MFCC	Es, De, En, Fr	96%	The pipeline of inception-v3 based transfer learning and bi-LSTM was used to extract temporal and convolutional attributes.	[24]
2010	Gaussian mixture model	Perceptual linear prediction	Tel, Dut, Hi, En, Ben, Fr, Es, De, Ru, It	88.80%	The use of Gaussian mixture models with the RPLP approach, which are processed using PLP and MFCC features.	[3]
2009	CNN-TDNN	MFCC	Fr, De, En	91.20%	Log-Mel images were used as features for language identification coupled with SGD based neural network.	[2]