Research Article

Learning Deep Embedding with Acoustic and Phoneme Features for Speaker Recognition in FM Broadcasting

Figure 1

The architecture of the proposed hybrid network. It consists of two subnets: universal background model (UBM) and phoneme feature extraction (PFE). GAP refers to global average pooling.