ASLNet: An Encoder-Decoder Architecture for Audio Splicing Detection and Localization

<div>The framework of ASLNet. The Encoder-Decoder architecture is based on the VGG16 network and extended by two transposed convolutions with a skip connection. The input feature of a 2-s audio clip is a fixed size of 72 <svg height="7.35473pt" id="M12" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M528 54L331 254L528 455L492 493L294 291L96 493L60 455L257 254L60 54L96 16L294 217L492 16L528 54Z"></path></g></svg> 64. The numbers above feature maps indicate channels and height <svg height="7.35473pt" id="M13" style="vertical-align:-0.3499303pt" version="1.1" viewbox="-0.0498162 -7.0048 7.75925 7.35473" width="7.75925pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M528 54L331 254L528 455L492 493L294 291L96 493L60 455L257 254L60 54L96 16L294 217L492 16L528 54Z"></path></g></svg> width of the feature maps.</div>

Security and Communication Networks

fig2

Figure 2

Figure 2: ASLNet: An Encoder-Decoder Architecture for Audio Splicing Detection and Localization