Research Article

ASLNet: An Encoder-Decoder Architecture for Audio Splicing Detection and Localization

Figure 2

The framework of ASLNet. The Encoder-Decoder architecture is based on the VGG16 network and extended by two transposed convolutions with a skip connection. The input feature of a 2-s audio clip is a fixed size of 72  64. The numbers above feature maps indicate channels and height  width of the feature maps.