Research Article

Learning Deep Embedding with Acoustic and Phoneme Features for Speaker Recognition in FM Broadcasting

Table 1

The universal background model based on CNN architecture.

Layer typeOperationData sizes

Input1 × 40 × T

Convolutionconv2d, 7 × 7, 1616 × 40 × T

Residual block × 364 × 40 × T

Residual block × 4128 × 20 × (T/2)

Residual block × 6256 × 10 × (T/4)

Residual block × 3512 × 5 × (T/8)

T represents sample time length.