Abstract
In the Internet age, information security is threatened anytime and anywhere and the copyright protection of audio and video as well as the need for matching detection is increasingly strong. In view of this need, this paper proposes a zero-watermarking algorithm for audio and video matching based on NSCT. The algorithm uses NSCT, DCT, SVD, and Schur decomposition to extract video features and audio features and generates zero-watermark stream through synthesis, which is stored in a third-party organization for detection and identification. The detection algorithm can obtain zero watermark from the audio and video to be tested and judge and locate tampering by comparing with the zero watermark of the third party. From the experimental results, this algorithm can not only detect whether the audio and video are mismatched due to tampering attacks but also locate the mismatched audio and video segments and protect the copyright.
1. Introduction
With the development of global networking, digital media is fast and convenient. While bringing convenience, security issues are increasingly prominent. Digital watermarking technology can protect the copyright of audio or video to a certain extent and is a hot research field of data security. But at present, the matching of audio and video cannot be detected and located by digital watermarking technology, which is a blind area of security protection and detection. Therefore, the research on audio and video matching detection and location is urgent.
At present, there are very few watermarking algorithms for audio and video matching detection. Most of the digital watermarking algorithms are image watermarking, audio watermarking, video watermarking, etc. The media attached to the algorithm are single. Image digital watermarking mainly includes spatial domain method [1], transform domain method [2–6], and deep learning-based method [7]. Transform domain method commonly uses DCT (Discrete Cosine Transform), NSCT (Nonsubsampled Contourlet Transform) [4, 5], DWT (Discrete Wavelet Transform) [6], and so on. As a new direction, watermarking algorithm based on deep learning appears on the way of watermarking technology, but it still needs to be improved in terms of watermark capacity and algorithm complexity. Video watermarking can be divided into original video watermarking algorithm and video watermarking algorithm based on compression domain. The former can refer to the existing image watermarking algorithm [8–10]. The latter is a watermarking technology combined with specific video encoding methods, such as MPEG [11], H.264 [12], and H.265 [13] video watermarking algorithms. Audio watermarking algorithms mainly include time domain and transform domain algorithms, and time domain audio algorithms include least significant bit algorithm [14], echo hiding algorithm [15], and phase coding algorithm. In order to improve the robustness of watermarking algorithm, more scholars begin to pay attention to the research of watermarking algorithm in transform domain and transfer the embedding position of watermark from time domain to transform domain. For example, [16] proposed the audio watermarking technology based on DWT, [17] proposed the audio watermarking technology based on SVD (singular value decomposition) and fractional Fourier transform, and [18] proposed the audio watermarking technology based on DWT and SVD. At present, most of the watermarking algorithms of audio and video are designed separately, but multimedia data is composed of audio and video together, so it is not enough to protect only one of them. Tamper protection or even matching protection is needed for both audio and video. Dittmann et al. [19] proposed the earliest cross-watermarking algorithm in 1999, which can verify the synchronization between audio and video. Although this cross-watermarking algorithm can be easily implemented, the watermark cannot resist various attacks [19]. In order to improve the robustness of watermarking, Wang and Pan [20] proposed an audio-video cross-watermarking algorithm combined with a visual saliency model, which embedded the watermark into the DC coefficient of DCT through quantitative index modulation [20]. Esmaeilbeig and Ghaemmaghami [21] proposed an audio and video watermarking algorithm based on compressed domain. The algorithm generates hash bits in the audio part and embeds them as watermarks in the QDCT coefficients of video Immc1 frames [21]. The above audio and video cross-watermarking does not provide copyright protection for audio and video at the same time but only generates watermarks based on the whole multimedia stream, which can only judge whether the whole audio and video match and cannot locate the tampering of small segments in audio and video streams. Sun et al. [22] proposed a video zero-watermarking algorithm based on NSCT, DCT, DWT, and SVD. The algorithm generates zero-watermarking frame by combining audio watermark with video frame feature matrix, which can be utilized locating the attacks for the video besides verifying its copyright [22].
This paper presents an audio and video matching digital watermarking algorithm based on NSCT transform. The algorithm extracts video features and audio features of each segment, respectively, and generates a zero-watermark stream through synthesis. Experiments show that this algorithm can not only detect whether the audio and video are mismatched due to tampering attacks and locate the mismatched audio and video segments but also protect the copyright.
2. General Framework of Audio and Video Matching Zero-Watermarking Algorithm
The difference between zero watermarking and traditional digital watermarking is that it is not really embedded into the carrier, but it is obtained by extracting the stable features of the carrier to construct the feature moment and performing XOR operation with the watermark information. This paper can not only generate zero watermark but also realize the matching detection of audio and video, and its generation and detection framework is shown in Figure 1. The generation algorithm first preprocesses the audio and video and segments them by 1s, and the audio and video segments are synchronized and corresponding in time. Then, the audio stable features are extracted from the audio segment to construct the audio feature matrix, and the key frames and their features are extracted from the visual frequency band to generate the encrypted video watermark. XOR is performed between the encrypted video watermark and the audio feature matrix to obtain the zero watermark of the segment. The zero watermark generated by each segment is integrated with its audio and video features. When the whole audio and video performs the same operation, a zero-watermark stream is formed, which is saved together with the key frame number and other information to a third party such as the copyright center. The matching detection process is to generate zero watermark for audio and video segments in the same way as that of the copyright center and detect the matching of audio and video by comparing with the zero watermark of the copyright center. In addition to detecting audio and video matching, this zero watermark can also be used for traditional copyright recognition.

(a)

(b)
3. Zero-Watermarking Generation Algorithm for Audio and Video Matching
The zero-watermark generation algorithm for audio and video matching is shown in Figure 2. The audio and video are decoded and segmented in 1s to obtain several short audio and video pairs composed of video and audio segments. Each audio and video pair are matched and detected so as to realize audio and video tampering judgment and positioning in a small time period. Video watermarking is generated by NSCT, DCT, Schur decomposition, and other algorithms. DWT and SVD algorithms are used to extract audio features. The encrypted video watermarking is XOR operated with the extracted sound feature matrix to obtain the audio and video matching zero watermark. Zero watermark will be registered by the third-party copyright organization to save, when the audio and video need to be authenticated and detected out of the use.

3.1. Generation of Encrypted Video Watermark
Video watermark is composed of key frame features of video segment. First, the key frame image is extracted based on frame difference Euclide distance method, and the extracted key frame number is saved as the key, and the video frame image as the watermark is found by the key in the zero-watermarking detection of audio and video matching. Based on the key frame image, it is converted from RGB space to YCoCg color space. The Co component was decomposed by NSCT, DCT, Schur decomposition, and other methods to generate the video feature matrix, which was binarized and encrypted to obtain the encrypted video watermark. The detailed steps of generating encrypted video watermarks are shown in Figure 3.

3.1.1. Key Frame Extraction Algorithm Based on Euclidean Distance between Frames
This algorithm uses the method based on the Euclidean distance between frames to extract key frames [23]. The main idea of this method is to calculate the Euclidean distance of two consecutive frames of images and select the key nodes through the Euclidean distance of images. This method is simple and easy to operate. The definition of interframe Euclidean distance is shown in the following equations: where , , and represent the gray value of the k frame image, k + 1 frame image, and k + 2 frame image at pixel point (i, j), respectively, k represents the number of frames of the video, and k = 1, 2, 3, …, J. represents the gray difference between the k + 2 frame image and the k + 1 frame image minus the gray difference between the k + 1 frame image and the k frame image. The image size is .
The steps of extracting key frames based on the Euclidean distance between frames are as follows:(1)Use (1) and (2) to calculate the Euclidean distance of each frame of image. If there are J frames of images, there are J − 2 Euclidean distances.(2)Calculate the extreme value of J − 2 Euclidean distances.(3)Find the maximum and minimum values of these extreme points and calculate their mean values.(4)Compare each extreme point and the mean value. The image corresponding to the extreme point greater than the mean value is the key frame image.
3.1.2. NSCT Transform
NSCT has multiscale property and good anisotropy and translational invariance. NSCT transform is composed of NSP (Nonsubsampled Pyramid) and NSDFB (Nonsubsampled Directional Filter Bank). The nonsampling tower filter performs multiscale decomposition on the image first and then removes the low-frequency part. The nonsampling direction filter bank performs directional decomposition on the high-frequency part, making the NSCT transform multiscale and multidirectional anisotropy. The principle of three-stage NSCT transformation is shown in Figure 4. Its output is low-frequency y1 and three-stage high-frequency y2, y3, and y4, and its direction numbers are 2, 4, and 8, respectively.

After NSCT, the low-frequency part gathers the energy of the image and represents the contour information of the image, while the high-frequency part contains less energy of the image. The algorithm in this paper can ensure the embedding strength of watermark by taking advantage of the large energy value of the low-frequency part transformed by NSCT and the same size of the image as the original image, so the transformed low-frequency subband is selected as the object to construct zero watermark.
3.1.3. DCT Transform
DCT is a kind of orthogonal real transform, which has strong information concentration ability and is widely used in digital watermarking technology because of its strong robustness and good concealment [24]. For the two-dimensional image f (x, y), its DCT and its inverse transform are shown as follows:where , and are the horizontal and vertical frequency, respectively, are the pixel coordinates, and .
3.1.4. Schur Decomposition
Schur decomposition decomposes a matrix into the unit orthogonal matrix and the upper triangular matrix such as , and is the conjugate transpose of [25]. Then, the Schur of any n-order square matrix can be decomposed into
Schur decomposition is widely used in digital watermarking because of its scaling invariance and low computational complexity. When the matrix is scaled by a certain multiple, only the eigenvalues change by a multiple. The scaling invariance of Schur decomposition can deal with scaling attack well and improve the robustness of watermarking. In addition, Schur decomposition is a step of singular value decomposition; it does not need to transform the upper triangular matrix into diagonal matrix, so the calculation is less.
3.2. Generation of the Sound Feature Matrix
The sound feature matrix is generated from the features of the audio segment. The algorithm performs DWT and SVD on the segmented decoded audio to obtain stable audio features. Based on this, the feature matrix is formed and binarization is carried out. After that, XOR generates zero watermark for the encrypted video watermarking. The detailed steps of sound feature matrix generation are shown in Figure 5.

4. Audio and Video Matching Detection Algorithm
The video matching detection algorithm and the audio-video matching zero-watermark generation algorithm are inverse processes to each other, as shown in Figure 6. Supported by the key frame number, Zernike moment, and other information saved by the third party, the zero-watermarking generation algorithm is used to obtain the zero watermark of the audio and video to be tested. The similarity between the zero watermark to be detected and the zero-watermark stream saved by the third party is judged, and whether the audio and video segment has been tampered is determined according to the similarity threshold. The Zernike moment can better resist rotation attack. The detailed steps of audio and video matching detection algorithm are shown in Figure 7.


5. Experimental Results and Analysis
The experiment is carried out on MATLAB R2018b. The watermark is encrypted by logistic chaos, and its initial value and parameter u are used as the key. Only by knowing the zero-watermark algorithm, encryption method, and its key can the watermark information be decrypted correctly. Considering the security of the algorithm and watermark, the parameter of Logistic chaotic encryption is set as . For the length of audio and video segments, this paper determines that the audio and video segmentation unit is 1s through comprehensive analysis and experiments from the aspects of the stability of audio and video features, the rapidity of generating zero watermark, the minimization of occupied resources, the accuracy of matching detection, and so on. In this way, on the one hand, it can effectively extract the stable features of audio and video segments and quickly build an optimized zero watermark. On the other hand, it can also detect the tampering of small audio or video segments in the entire audio and video stream more accurately [22]. The following experiments use the video (including audio) in H.264 coding format, which is divided into 30 audio and video segments in the experiment. The video frame size is 1080 × 1920, the duration is 30 seconds, the frame rate is 27 fps, the audio stream sampling rate is 44.1 KHz, 16-bit quantization bits, and two channels.
In the experiment, the NC (Normalized Correlation) and BER (Bit Error Ratio) are used as the objective evaluation standard of watermark robustness. The NC experiment and analysis of the watermark image show that when the NC value is above 0.8, the correlation between the two watermark images is high [24]. Therefore, the tamper-proof threshold of audio and video is set as 0.8 in this paper for audio and video matching detection and identification; that is, when NC is greater than or equal to 0.8, audio and video are matched. When the value is less than 0.8, the audio or video is tampered with [22]. BER refers to the percentage of the extracted watermark error bits in the total bits. The PSNR (Peak Signal-to-Noise Ratio) is used as the difference measurement index of two images. The larger the value of PSNR, the better the invisibility of the watermark algorithm.
5.1. Audio and Video Matching and Tamper-Proof Test
For the above experimental audio and video, we replaced the video frames and audio segments in different time periods and then carried out the audio and video matching detection and positioning experiment. The experimental results are shown in Figure 8. The NC values of the zero watermark detected in Figure 8(a) are all less than the set threshold value of 0.8, so it is determined that they do not match. Therefore, segments 2, 5, 8, 11, 13, 16, 20, 23, 25, and 27 of the video are tampered with. The NC values in Figure 8(b) are all lower than the initially set threshold value, so it is determined that they do not match. Therefore, audio segments 2, 6, 8, 12, 15, 18, 21, 24, 26, 28, and 30 are tampered with. Experiments show that this method can detect whether the audio and video are mismatched due to tampering attacks and can locate the mismatched audio and video segments.

(a)

(b)
5.2. Algorithm Robustness Testing
5.2.1. Video Robustness Testing
In order to verify the robustness of this algorithm, common attacks such as Gaussian noise, salt and pepper noise, clipping, scaling, rotation, Gaussian filtering, median filtering, and frame attack are carried out on the video, as well as the combination of several one-way attacks, and the experimental results are shown in Table 1. From the whole experimental results, after the attack, even if some PSNR has reached below 10 dB, the NC value of the watermark of this algorithm is still above 0.9, indicating that the algorithm has good robustness:(1)Noise Attack. Noise attack is one of the most common types of attacks. The algorithm in this paper has carried out Gaussian noise and salt and pepper noise attack experiments on the video. The results are shown in Figure 9. The range of noise attack intensity is 0.01–0.1, with 0.01 as an interval. The figure demonstrates that as noise level increases, the signal-to-noise ratio of the video key frame image is decreasing, but the NC mean of the watermark remains above 0.95, which shows that the algorithm in this paper has good robustness to noise attacks.(2)JPEG Compression Attack. In this paper, the robustness of the algorithm is tested for JPEG compression in the range of quality factor 10–90 with increments of 10 intervals, which is shown in Figure 10. According to experiment results shown in the figure, when the quality factor improves, the NC values which were extracted from key frames steadily rise and the distribution becomes more concentrated, and the NC values also increase with the improvement of quality factor. Within the experimental range, the NC values are greater than 0.96, indicating that the algorithm in this paper has good robustness in resisting JPEG compression.(3)Filter Attack. In the research of image and video, image filtering is one of the most common operations. In this section, it is a Gaussian filtering attack that is applied on the video. As shown in Figure 11, when facing the Gaussian filtering attack, with the increase of the filter window size and the surrounding scale, the NC value of the watermark decreases, but it is still greater than 0.94, which shows that the robustness of the Gaussian filtering is better.(4)Shear Attack. The algorithm in this paper conducts an attack experiment of cutting 1/20, 1/16, and 1/8 of the video on the upper left, lower left, upper right, and lower right, and the results are illustrated in Figure 12. The results demonstrate that because the algorithm extracts the features of the key frames when generating the watermark, even if the clipping attack will lead to the loss of a large number of features of the key frame image, the mean value of NC in the experiment is still above the matching detection threshold, which ensures the accuracy of the matching detection.(5)Rotate Attack. The algorithm in this paper carries out rotation attack from 0° to 180° on the video. It can be seen from Figure 13 that, with the increase of rotation angle, the NC value is gradually decreasing, but all of them are above 0.96, indicating that the algorithm can resist rotation attack well.(6)Scaling Attack. The algorithm in this paper uses different scaling multiples to attack the video key frame images, respectively. As can be seen from Figure 14, the NC values are above 0.96, indicating that the algorithm in this paper has good robustness to scaling attacks.(7)Combined Attack. In actual audio and video transmission, video often suffers from more than one attack, and there may be multiple attacks acting at the same time. Robustness under combined attack is also an important aspect of algorithm performance. The algorithm in this paper selects three combined attack methods of rotation and JPEG compression attack, shearing and Gaussian filtering attack, and H.264+ other attacks to conduct experiments. The results are shown in Figures 15–17, respectively. In general, for the three combined attacks, the NC value of the watermark is above 0.9, and there is still a large margin space from the threshold of 0.8, indicating that the algorithm can well resist the combined attack.









For Figure 15, the NC value of the watermark extracted by the algorithm under small-scale cropping and Gaussian filtering attacks can reach more than 0.96, which has strong robustness. Compared with the two attacks, the NC value of the algorithm is lower under the large-scale cropping attack, and the sensitivity to the cropping attack is slightly higher than that of the Gaussian filter.
For Figure 16, the experimental results show that the algorithm has good antiattack ability against the combined attack of rotation + JPEG compression. Most of the extracted watermark NC values are about 0.94, and it can be seen that the sensitivity of the algorithm to rotation attack is higher than that of JPEG compression attack.
For Figure 17, under the combined attack of format conversion and other attacks, the NC value of the extracted watermark is relatively high, which can be used for matching detection. Further analysis will find that the sensitivity of different video frames to the attack is different, and the ability to resist the combined attack has a certain relationship with the image content.
5.2.2. Audio Robustness Test
Audio with watermark may encounter attacks in the process of transmission. Some attacks may be unintentional, such as noise. Although they may not affect the visual perception, they also affect the reliability of watermark; some attacks may be intentional, such as cutting, filtering, and compression. The algorithm needs to have sufficient attack ability to resist various attacks and ensure the reliability and security of the watermark. In this paper, noise, weight, resampling, MP3 compression, and other attacks on audio are carried out, and the audio robustness experiments are carried out. The results are shown in Table 2. It can be seen from the data in the table that, under several attacks on the experiment, the watermark NC value obtained by the algorithm in this paper is more than 0.91, most of which are more than 0.99, and the robustness of the algorithm is good.
5.3. Comparison Experiment with Similar Algorithms
This paper makes relevant experiments on similar algorithms in literature [26, 27] and compares and investigates them with the algorithms in this paper. Literature [26] selects the audio segment according to the local time domain characteristics of the audio signal and uses DWT and SVD algorithms to construct a zero watermark for the selected audio segment. Reference [27] is a zero-watermarking method based on DWT-DCT-SVD. Compared with the two algorithms, the algorithm in this paper is different in feature extraction and decomposition methods.
In the experiment, two different styles of audio signals, classical and pop, are selected as the original audio carrier. They are mono audio signals with the sampling frequency of 44.1 kHz and quantization accuracy of 16 bits; [26, 27] adopt 32 × 32 fixed binary watermark image, and the algorithm in these papers adopts 32 × 32 binary video watermark image generated from video. The attack experimental results of this algorithm and two comparative literature algorithms are shown in Table 3. It clearly shows that the proposed algorithm has excellent robustness against Gaussian noise, weighting, resampling, and low-pass filtering attacks. These attacks are better than the comparison algorithm, and the advantage is more prominent under MP3 compression attack.
6. Conclusion
In this paper, a zero-watermarking algorithm for audio and video matching based on NSCT transform is proposed, which can detect whether the audio and video are mismatched due to tampering attacks, locate the mismatched audio and video segments, and play a role in protecting and identifying digital media information security. The algorithm uses NSCT, DCT, SVD, and Schur decomposition to extract video features and audio features and generate zero watermarking after synthesis. From the experimental results, the algorithm not only has strong robustness to common single-item attacks but also has high antiattack ability to combined attacks. Information security is a subject of constant development and change.
With the advancement of technology, the forms and types of attacks are also changing and improving, further improving antiattack capabilities, anti-new attack capabilities, robustness, and positioning speed and accuracy. It needs continuous research and continuous improvement.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
The research was supported by the Scientific Research Project of National Language Commission (YB135-125) and Key Research and Development Project of Shandong Province (2019GGX101008 and 2016GGX105013).