Research Article
Learning Deep Embedding with Acoustic and Phoneme Features for Speaker Recognition in FM Broadcasting
Figure 1
The architecture of the proposed hybrid network. It consists of two subnets: universal background model (UBM) and phoneme feature extraction (PFE). GAP refers to global average pooling.