Research Article

Learning Deep Embedding with Acoustic and Phoneme Features for Speaker Recognition in FM Broadcasting

Table 2

TI-SV results for the hybrid DNNs compared with existing benchmarks.

FeatureMethodAggregationLossEER (%)

ACNagrani et al. [11]8.8
Nagrani et al. [11]TAPSoftmax10.2
Kim and Park [26]TAPAAM-Softmax5.68
Han et al. [27]SAPSoftmax5.75
Han et al. [27]SAPAM-Softmax4.15
Cai et al. [28]SAPA-Softmax4.40
Cai et al. [28]LDEA-Softmax4.48
Wang et al. [29]MHACosAMS4.46
Wang et al. [29]MRMHACosAMS4.10
Wang et al. [29]MRMHACosAMS3.98
Wang et al. [29]MRMHACosAMS3.96

AC&PHours_PFI_1TAPAAM-Softmax4.24
ours_PFI_2TAPAAM-Softmax4.46
ours_PFI_1AWPAAM-Softmax3.72
ours_PFI_2AWPAAM-Softmax3.84

AC, acoustic; PH, phoneme; TAP, temporal average pooling; SAP, self-attention pooling; LDE, learnable dictionary encoding; MHA, multihead attention; MRMHA, multiresolution multihead attention; AWP, adaptive weight pooling. The use of “bold” is to emphasize the experimental result (3.72%).