Abstract

The current music multiterminal audio authentication algorithm does not consider the mutation of music signal, which leads to poor tamper detection ability and long time of audio authentication. By analyzing the characteristics and key technologies of wireless network, a wireless multiterminal audio system is established. The short-term energy calculation method is used to consider the sudden change of music signal. The music signal is divided into note segments, and chroma features of half order notes are extracted. The robust hash value is calculated by nonuniform quantization method. The dynamic time warping algorithm is used to align the notes, and the Hamming distance between the hash values of each two corresponding notes is calculated to obtain the measurement values of error series, statistical characteristics, and time distribution characteristics. According to the measurement value, the fuzzy classification method is applied to calculate the membership degree of the signals belonging to two different types of operation, and the authentication confidence degree is obtained. The tampered area of the music signal that has not passed the authentication is detected, and the music multiterminal audio authentication is realized. Experimental results show that the proposed algorithm has good tamper detection ability and can effectively shorten the audio authentication time.

1. Introduction

With the maturity of multimedia compression technology and the rapid popularization of Internet, the creation, storage, and transmission of multimedia digital works such as image, video, and audio have become extremely convenient [1]. The massive music information represented by MP3 has been widely spread on the Internet. With the wide use of modern audio processing tools, the processing of digital audio signal becomes very simple. However, it also means that the tampering of audio semantic information can be carried out at a lower cost [2]. For example, the semantics of audio signal may change fundamentally after a simple rearrangement or removal of a few small segments, which cannot be detected only by human auditory perception. Audio authentication technology is an effective technical means to protect the integrity and authenticity of music, voice, and other audio data [3]. It can ensure that the received audio data in the transmission process, without malicious editing and tampering by a third party, that is, in the sense of human perception system and the original audio, is exactly the same. The technology is widely used in many fields: government departments, national security, court defense, trade secrets, news, recorded speech, music recording and distribution, military, and so on.

In order to get secure multimedia applications, it is more important to protect and authenticate the authenticity and integrity of audio content. The ideal audio content authentication system should be able to accurately distinguish the content keeping operation and malicious tampering and can accurately locate the tampered area. At present, a large number of scholars have done research on a music multiterminal audio authentication algorithm and achieved some theoretical results. In reference [4], a dynamic authentication watermarking algorithm for IOT signals is proposed to detect network attacks. The watermark algorithm is based on the long-term and short-term memory structure of deep learning, which enables the Internet of things devices to extract a group of random features from the generated signals and embed these features into the signals dynamically. This algorithm enables IOT gateway which collects signals from iotd to effectively verify the reliability of signals. In reference [5], a physical layer authentication scheme based on angle of arrival (AOA) estimation is proposed to cross verify the reported location information. Considering the multiantenna roadside unit under location deception attack, according to the Cramer Rao bound of AOA estimation and the existence of effective estimator, the basic limit of AOA estimation is given. The problem of determining whether the received signal is from the claimed GPS position is described as a two-sided hypothesis testing problem, and its solution is given by Wald test statics [6]. The closed form of correct decision probability (PD) and false alarm probability (PF) are given. The key to obtain reliable AOA measurement is the high accuracy of array response vector. But the above algorithm does not consider the signal mutation, resulting in poor tamper detection ability and long audio authentication time [7].

In order to solve the above problems, a music multiterminal audio authentication algorithm based on a wireless network is proposed. Using the characteristics and key technologies of wireless network, a wireless multiterminal audio system is established. Using the method of calculating short-term energy, considering the mutation of music signal, the music signal is segmented, chroma feature is extracted, and the robust hash value is calculated. The dynamic time warping algorithm is used to align the notes, the Hamming distance of the hash value is calculated, the error sequence is obtained, and the measurement index value is counted. The fuzzy classification method is applied to calculate the membership degree of the signal, and the authentication confidence is obtained. The tampered area is detected, and the music multiterminal audio authentication is realized. Experimental results show that the audio authentication time of the proposed algorithm is short, and it has good tamper detection ability [8].

The research contributions of the thesis include the following: (1)By analyzing the characteristics and key technologies of wireless networks, a wireless multiterminal audio system is established(2)The dynamic time warping algorithm is used to align the notes, and the Hamming distance between the hash values of each two corresponding notes is calculated to obtain the error sequence, the measurement value of the statistical feature, and the time distribution feature(3)According to the measured value, the fuzzy classification method is used to calculate the membership degrees of the signals belonging to two different types of operations to obtain the authentication confidence. Detect the tampering area of the music signal that has not passed the authentication, and realize the music multiterminal audio authentication

The organization structure of the thesis is as follows. The second part discusses the related technologies of wireless network, the third part discusses the music multiterminal audio authentication algorithm, the fourth part conducts an experimental simulation, and the fifth part summarizes the paper.

2. Wireless Network Technology

2.1. Type of Wireless Network

The so-called wireless network refers to the network that can realize the interconnection of various communication devices without wiring. Wireless network technology covers a wide range, including global voice and data networks that allow users to establish long-distance wireless connections, as well as infrared and RF technologies that optimize short-range wireless connections [9]. According to the different network coverage, a wireless network can be divided into wireless wide area network (WWAN), wireless local area network (WLAN), wireless metropolitan area network (WMAN), wireless personal area network (WPAN), and wireless mesh network [10]. (1)WWAN: it mainly refers to the data communication through mobile communication, satellite, etc., with the largest coverage. Representative technologies include 3G, 4G, and 5G, and the general data transmission rate is above 2 Mb/s. As the standards of 3GPP and 3GPP2 are becoming more and more mature, some international standardization organizations are aiming at the next generation mobile communication system, which can provide higher wireless transmission rate and flexible and unified all IP network platform, generally known as 3G, enhanced IMT-2000, System Beyond IMT-2000, or 4G.(2)WLAN: it is generally used for wireless communication between regions, and its coverage is small. The representative technology is IEEE802.11 series, which is also called WiFi network. The data transmission rate is 11~56 Mb/s, even higher.(3)WMAN: it is a type of wireless network connecting several wireless LANs. Mobile data communication mainly through mobile phones or vehicle devices can cover most areas of the city [11]. The representative technology is IEEE802.20 series, which mainly studies the mobile broadband wireless access (MBWA) technology and the formulation of relevant standards. The standard emphasizes more on mobility, which is developed from the broadband wireless access (BBWA) of IEEE802.16.(4)WPAN: at present, there are two wireless personal area network standards: wireless personal area network (WPAN, IEEE802.15.1) (Bluetooth) and low-speed wireless personal area network (LR-WPAN, IEEE802.15.4) (ZigBee). Bluetooth devices are generally battery devices with a coverage radius of 10 meters. ZigBee is more committed to ultralow power consumption networks. For example, for devices that can last about 10 years without changing the battery, ZigBee has a coverage radius of 50 meters [12].(5)Wireless mesh network: wireless mesh network is a multihop ad hoc network, which is composed of mesh routers and mesh clients. Mesh routers constitute the backbone network and connect with a wired internet network, which is responsible for providing multihop wireless Internet connection for mesh clients. Wireless mesh network provides a wider network topology by storing and forwarding messages between AP. It can extend the existing wired or wireless network. Its biggest characteristic is that AP can not only act as an access point but also store and forward messages, playing the role of wireless router [13].

2.2. Characteristics of Wireless Network

Compared with a wired network, the main feature of wireless network is to completely eliminate the limitations of wired network and realize the wireless transmission of information. Specific wireless network features are as follows: (1)Strong mobility: wireless network transmits network signals by transmitting radio waves. As long as it is within the range of transmission, it can use the corresponding receiving equipment to realize the connection to the corresponding network. This greatly gets rid of the limitation of space and time, which is beyond the traditional network.(2)Strong expansibility: wireless network can be carried out by wireless signal at any time, and its network expansion performance is relatively strong, which can effectively realize the network expansion and configuration settings. Users will also become more efficient and convenient in accessing information. Wireless network not only expands the space range of people to use the network but also improves the use efficiency of the network [14].(3)Low cost: generally speaking, the process of installing the wired network is more complicated. In addition to a large number of network cables and network cable connectors, the later maintenance cost of the wired network is very high [15]. The wireless network does not need to lay a large number of network cables and install a wireless network transmitting device. At the same time, it also creates a very convenient condition for the later network maintenance, which greatly reduces the cost of network installation and later maintenance.

2.3. Key Technologies of Wireless Network

Wireless network can effectively sense the changes of the external environment and then carry out deeper understanding and learning and effectively adjust and configure the relevant resources within the communication network, so as to meet the changes of the external environment. By fully learning from wireless cognitive network technology, it can not only solve the conflict between the growing demand of spectrum and limited spectrum resources but also effectively solve the problem of spectrum resource shortage and promote the reasonable improvement of spectrum application efficiency [16]. (1)Spectrum sharing: spectrum sharing can help users maximize the application probability of spectrum by managing interference items. Spectrum can be classified from different levels, according to different network architectures divided into component distribution and centralized. Centralization refers to the centralized processing of users’ information by the central server and the distributed computing of cognitive terminals to determine the idle spectrum. Through different ways of spectrum allocation, it can be divided into cooperative and noncooperative. In the process of spectrum sharing, the filling sharing method is adopted, which can reduce the interference of primary users to the maximum while spectrum is idle.(2)Spectrum sensing: in wireless network technology, spectrum sensing is one of the core technologies. This technology can provide valuable spectrum for the majority of users through spectrum hole, time domain, and frequency domain discovery. In essence, there are three kinds of signal detection methods that can detect the primary user autonomously, namely, cyclostationary feature detection, matched filter detection, and energy detection [17]. Among them, the detection of energy has good performance and easy operation, but it is easily restricted by objective factors, which makes the main signal difficult to identify. Detection-matched filter can effectively and quickly detect user information on the basis of clear user information, but in this process, many conditions need to be ensured, such as special receiver, frequency, and synchronous timing. The detection of cyclostationary feature can identify the noise energy and detect the main signal, but the calculation process is complex [18].(3)Dynamic access: in wireless network technology, dynamic spectrum access technology can be divided into open sharing mode, multilayer access mode, and dynamic special application mode. Among them, in the dedicated dynamic mode, the primary user can completely control the spectrum and at the same time can choose the technology and service mode at will. Open sharing mode can share a variety of systems, and there is no interference between them. Compared with the above two modes, it is found that the multilayer access mode can completely get rid of the impact of this user’s transmit power, which can not only effectively expand the scope of application but also further improve the information capacity and throughput [19].

2.4. Wireless Multiterminal Audio System

Wireless multiterminal audio system mainly includes three modules: control point (CP), digital media render (DMR), and digital media server (DMS). Among them, the control points are generally mobile phones, tablet computers, and other intelligent terminals. DMR refers to the development board with wireless WiFi function. This paper uses Junzheng development board of MIPS processor. DMR can be either a home computer or an intelligent terminal. The connection of each module of wireless multiterminal audio system is shown in Figure 1.

It can be seen from Figure 1 that there is a logical relationship between the modules of the wireless multiterminal audio system and the wireless network plays a role in data output. The control point (mobile terminal) is first added to the same LAN with the wireless audio chip, and then the self-developed software is opened. The mobile terminal will find all the available devices on the LAN. Several devices can be selected to join the same group. After joining the group, the audio resources on the mobile terminal or other servers can be selected to play [20].

3. Music Multiterminal Audio Authentication Algorithm

3.1. Basic Framework of Music Multiterminal Audio Authentication Algorithm

A music multiterminal audio authentication algorithm is mainly divided into two stages: protection stage and authentication stage. The basic framework of the music multiterminal audio authentication algorithm is shown in Figure 2.

In the protection phase, firstly, an effective note segmentation algorithm is used to segment the music signal into a series of unequal length note segments. Then, based on each note segment, half order chroma features containing rich music semantic information are extracted. Finally, these feature vectors are transformed into binary hash authentication codes by a nonuniform quantization method, which are stored in a trusted second-order third-party certification center for future music certification. In the authentication stage, the music to be authenticated first needs to go through the same process of note segmentation, feature extraction and hash value calculation as in the protection stage to get the hash sequence of the music. Then, note alignment is carried out to reduce the different effects of note segmentation caused by various distortions in the transmission process [21]. Then, the -value between the hash sequence of the current music segment and the authentication code is calculated in terms of notes Hamming distance to get an error sequence. Based on the error sequence, three new metrics reflecting hash difference, statistical characteristics, and time distribution characteristics are calculated. According to the three metrics, membership degrees of signals belonging to two different operations are calculated by using a fuzzy classification method, and the final “verification confidence” is obtained. Finally, for the music signals that have not passed the authentication, the system will also detect the tampered area testing [22].

3.2. Music Multiterminal Audio Authentication Algorithm Protection Stage
3.2.1. Music Segmentation

This paper adopts the method of calculating short-term energy [10] and, at the same time, considers the sudden change of the music signal in the high and low frequency parts for onset detection. The specific steps are as follows: (1)The mute segments at the beginning and end of the music are removed, and the method of calculating short-term energy is used to detect the mute frame(2)The music is decomposed in each frequency band as Table 1, and five subband signals are obtained(3)The signal on each subband is divided into frames with a length of 60 ms, and there is a 50% overlap between two adjacent frames. For the medium and high frequency part of the signal (subband 2~5), use the energy difference to define the onset detection function: in which represents the amplitude value of the subband signal at time , represents the frame energy of the subband signal, represents the frame length, and represents the onset detection function of the subband signal. Formula (2) uses the index in the method of differential weighting; the closer the frame to the current time is, the higher the weight [23].

For the low frequency part of the signal (subband 1), because the energy change is not obvious, the detection function is defined by using the change of the spectral coefficient [11]: in which represents the FFT coefficient of the frame of the first subband signal and represents the Fourier transform length (4)Perform a weighted summation of the detection functions on each subband to obtain the total onset detection function: in which is the weighting coefficient(5)Find the peak point of the detection function to determine the position of the note onset in the music

According to the above detection results, a music segment roughly corresponding to a note length is formed between two onsets [24].

3.2.2. Chroma Feature Extraction

For a music content authentication system, it is very important to select appropriate features that can fully reflect the semantic information of music. In this paper, we use chroma features, which are widely used in music retrieval, music structure analysis, and other fields [12]. The feature projects the whole spectrum distribution of music signal to 12 half order notes in a complete octave range, which has rich information of scale distribution and melody trend. Therefore, chroma is a 12-dimensional eigenvector, which is calculated frame by frame according to the following formula: in which represents the dimension data of the chroma feature vector of the signal , represents the FFT coefficient of the signal , is the Fourier transform length, represents the sampling frequency, and is the reference frequency [25].

In the specific implementation, first of all, each segment based on note segmentation is divided into fixed length frames; the frame length is 256. Then, chroma feature is extracted from each frame, and the mean value of chroma feature of each frame in each note segment is taken as the feature of the whole note. So far, each note segment of music signal is represented by a 12-dimensional feature vector.

3.2.3. Robust Hash Computation

Finally, the extracted Chroma features need to be quantized to generate a 36-bit authentication code for each note segment. The specific steps are as follows: (1)The chroma vector was normalized: in which represents the component of the feature corresponding to the note and represents the component of the corresponding normalized feature vector. After the normalization operation, the value of each dimension component is between 0 and 1(2)The normalized chroma features were quantized nonuniformly:

In formula (7), represents the component of the quantized vector, and returns the largest integer not greater than (3)Express each with the corresponding 3-bit binary, namely, , and connect all the binary bits to form a 36-bitHash code

Quantization of music features can not only reduce the storage space occupied by authentication information but also improve the robustness of various signal processing. The original music hash sequence is stored in a trusted third-party certification center for future music certification.

3.3. Music Multiterminal Audio Authentication Algorithm Authentication Stage
3.3.1. Note Alignment

In this paper, a dynamic time warping algorithm is used to obtain the most reasonable correspondence between two note sequences. The distance between every two notes is defined as the Hamming distance of their corresponding Hash value: in which represents the Hamming distance between the note of the original music and the note of the music to be authenticated, is the Hash code of the note of the original music, and is the note of the music to be authenticated Hash code. On the basis of note alignment, the Hamming distance of each two corresponding note Hash values is calculated, and the content integrity determination is further performed according to the sequence of these distance values.

3.3.2. Measurement Index

On the basis of note alignment, the Hamming distance between the hash values of every two corresponding notes is calculated, and all the distance values form a sequence . Define the possible tampering points in as those points whose distance value exceeds the set threshold , denoted as , whose index value is stored in the vector , as shown in in which represents the distance value corresponding to the note, is an adaptive threshold, and its setting takes into account the value of under acceptable operations and malicious tampering. The threshold is used to roughly judge whether the signal has serious distortion. Based on the above concepts, three measurement indicators reflecting the characteristics of sequence statistics and time distribution are defined. They have strong distinguishing ability for maintaining content operations and malicious tampering. (1)Average distortion (AD): the average distortion of a signal to be authenticated is defined as the average bit error rate of all points: in which . The average amount of distortion reverses the degree of change in the music content. Obviously, malicious tampering usually has a larger value compared to allowable operations.(2)Uniform measure (UD): the uniformity metric is aimed at reflecting the uniformity of the error distribution. Let denote the number of notes between adjacent points, and define the uniformity metric as the standard deviation of : in which . Obviously, the distribution of points is more uneven by malicious tampering, the corresponding value is also larger, and the value corresponding to the allowable operation is smaller.(3)Maximum connected area size (MC): a group of continuous dense points in the sequence form a connected area, and the size of the connected area is defined as the number of all points in the area. refers to the number of points contained in the largest connected area. Generally speaking, the value of malicious tampering is much larger than the allowable operation. This is because under malicious tampering, the error tends to be very tightly concentrated in some local areas, while for the allowable operation, the error is often scattered on the entire timeline.

3.3.3. Music Content Certification

In the fuzzy classification method, firstly, it is necessary to define the membership functions for these three indexes, respectively, and describe the degree of their values to the two fuzzy sets of “large” and “small.” Define the membership function of the average distortion as follows: in which and , respectively, represent the membership degrees of the smaller value and and are the two thresholds discussed above. Define the membership function of the uniform metric as in which and , respectively, represent the degree of membership for which the value of is small and large. The value of the parameter is the mean value of the music signal calculated under a series of content preservation and malicious tampering, and the parameter control the entire curve, especially the change speed at the sudden change point . In this article, and are set to 25 and 0.08, respectively. The membership function that defines the maximum connected region size is as follows: in which and , respectively, represent the membership degrees of the smaller value and the parameters and are set to , respectively. Given a set of metric values , its membership degree of fuzzy class is expressed as follows: in which represents the degree of membership of belonging to class , represents the degree of membership of belonging to class , and the weight vector reflects the importance of each index for content integrity authentication, which is determined by experiments: in which and represent the final authenticity measure and nonauthenticity measure of the music signal, respectively, and and represent the weight contributions of various types of authentication passing and failing, respectively. Finally, the authentication credibility measure is defined. If , the music content is more likely to be authentic, and its authenticity increases with the increase of value. If , the music content is more likely to have been maliciously tampered with, and the possibility of tampering increases as the value of decreases. If , the system cannot make a decision.

3.3.4. Tamper Detection

For nonauthentic signals that fail to pass the music content authentication, all connected areas whose amplitude values are not less than a given threshold are marked in the sequence. These areas correspond to the tampered part of the test signal. In this paper, the connected area is defined by the concept of 8 neighborhoods; therefore, the error of tamper detection is about 4 notes in the worst case (generally 1~2 s). Through the above steps, the music multiterminal audio authentication is completed.

4. Experimental Analysis

4.1. Experimental Environment and Data

In order to verify the effectiveness of the multiterminal audio authentication algorithm for music based on wireless networks, this article uses the voice set of the TIMIT voice library, 1280 segments of 4 s speech, of which there are 600 segments of English and 680 segments of Chinese, and the used speech parameters are as follows: sampling rate of 16000 Hz, bit rate of 256 kbps, single channel number audio channel, sampling precision of 16 bits, WAV format, frame length of 20 ms, and frame shift of 10 ms. The experimental hardware platform is Inter Core i3 processor, 2450 M, 8.00G memory, and 800G hard disk, and the experimental environment is Matlab R2012b under Windows7 operating system.

4.2. Tamper Detection and Analysis

When the music is judged as authentication failure, it is often necessary to detect the tampered area. Taking a classical music “the Blue Danube” as an example, reference [4]’s algorithm, reference [5]’s algorithm, and the proposed algorithm are used to detect the tampered area, respectively, and the tampered area detection results of different algorithms are shown in Figure 3.

According to Figure 3, the tamper detection ability of reference [5]’s algorithm is poor, and the tampered area is not detected, followed by the tamper detection ability of reference [4]’s algorithm, which can locate one tampered area. The proposed algorithm has good tamper detection ability, which can effectively distinguish the admissible operation and malicious tampering. At the same time, most of the tampered areas can be accurately located by taking notes as the smallest unit, which has high accuracy in tamper detection.

4.3. Audio Authentication Time Analysis

Randomly extract 100 segments of speech from the speech database for operation, respectively, and use reference [4]’s algorithm, reference [5]’s algorithm, and the proposed algorithm for audio authentication and count the audio authentication time of different algorithms as shown in Figure 4.

According to the simulation results in Figure 4, as the number of audio voices increases, the audio authentication time of different algorithms will increase. The algorithm proposed in this paper is more sensitive to audio changes, and compared with the algorithm proposed in Reference [4] and Reference [5], it is more relaxed and the authentication time is shorter. When the audio voice volume reaches 100 MB, the audio authentication time of the algorithm in reference [4] is 67 s, the audio authentication time of the algorithm in reference [5] is 59 s, and the audio authentication time of the proposed algorithm is only 32 s. Therefore, the audio authentication time of this algorithm is short.

5. Conclusion

This paper presents the research of multiterminal audio authentication for music based on a wireless network. According to the characteristics and key technologies of wireless networks, a wireless multiterminal audio system was established. The short-term energy calculation method is used to consider sudden changes in music signals. Segment the music signal, extract chroma features, and calculate a robust hash value. The dynamic time warping algorithm is used to align the notes, count the measured index values, use fuzzy classification to calculate the membership of the signal, obtain the authentication confidence, and detect the tampered area to realize the music multiterminal audio authentication. It can be seen from the experimental simulation that the algorithm proposed in the paper can effectively improve the tamper detection ability and shorten the audio authentication time. However, the research content of the paper still has some limitations. For example, the algorithm proposed in the paper does not consider different types of audio signals, and whether the algorithm has different performance results under different signals, these are the directions of future research. Therefore, in future research, the proposed algorithm will be extended to other types of audio signal content authentication, and it is hoped that an algorithm suitable for multiple signals can be obtained.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest.