Research Article

ASLNet: An Encoder-Decoder Architecture for Audio Splicing Detection and Localization

Table 2

The detection results of the ASLNet on four datasets under different thresholds and acoustic features.

FeatureThresholdSpliced at the endSpliced at the middle
DatasetDataset

MFCC = 0.5ENSet2s0.87420.92950.9077ENSet3s0.89490.99040.9740
CNSet2s0.96980.99030.9833CNSet3s0.99380.99790.9965
 = 0.6ENSet2s0.87780.92690.9075ENSet3s0.90220.98940.9745
CNSet2s0.97120.98940.9831CNSet3s0.99430.99770.9965
 = 0.7ENSet2s0.88180.92190.9061ENSet3s0.91080.98760.9745
CNSet2s0.97250.98840.9830CNSet3s0.99500.99720.9964
 = 0.8ENSet2s0.88690.92060.9073ENSet3s0.91320.98690.9742
CNSet2s0.97430.98700.9827CNSet3s0.99550.99660.9962

LFCC = 0.5ENSet2s0.53590.75880.6708ENSet3s0.00121.00000.8289
CNSet2s0.91290.95750.9422CNSet3s0.97300.97680.9755
 = 0.6ENSet2s0.55710.74790.6726ENSet3s0.00370.99970.8291
CNSet2s0.91550.95630.9424CNSet3s0.97500.97500.9750
 = 0.7ENSet2s0.58180.73570.6750ENSet3s0.01830.99820.8304
CNSet2s0.91870.95540.9429CNSet3s0.97580.97380.9745
 = 0.8ENSet2s0.60300.70610.6654ENSet3s0.04520.99120.8291
CNSet2s0.92190.95430.9432CNSet3s0.97780.97140.9736