AC, acoustic; PH, phoneme; TAP, temporal average pooling; SAP, self-attention pooling; LDE, learnable dictionary encoding; MHA, multihead attention; MRMHA, multiresolution multihead attention; AWP, adaptive weight pooling. The use of “bold” is to emphasize the experimental result (3.72%).