IET Biometrics

Research Article

Learning Deep Embedding with Acoustic and Phoneme Features for Speaker Recognition in FM Broadcasting

Table 1

The universal background model based on CNN architecture.


Layer type	Operation	Data sizes

Input	—	1 × 40 × T

Convolution	conv2d, 7 × 7, 16	16 × 40 × T

Residual block	× 3	64 × 40 × T

Residual block	× 4	128 × 20 × (T/2)

Residual block	× 6	256 × 10 × (T/4)

Residual block	× 3	512 × 5 × (T/8)

T represents sample time length.