Abstract
Packet loss-resilient and security are two major challenges faced by real-time audio transmission over IP networks. Due to the capability of recovering the signal from a small set of samplings and the randomness in the acquisition process, compressive sensing (CS) has a vast prospect in dealing with these problems. In this paper, we propose a secure and packet loss-resistant real-time audio transmission framework (CS-SPT) based on the principle of CS. Inspired by the interleaving technique, an ultralow complexity scrambling matrix was adopted in the proposed CS-SPT to improve its packet loss-resilient capability by increasing the information redundancy. Moreover, the energy of ciphertext is homogenized using a diffusion operation. Experimental results show that compared with existing methods, the proposed CS-SPT not only improves the packet loss-resilient ability significantly but also can resist several major attacks, such as COAs and KPAs.
1. Introduction
With the advent of the 5G era, the developments of streaming media and network technology make the demand for real-time transmission of audio content on the Internet increase sharply. The Voice over Internet Protocol (VoIP) was emerging as an alternative to traditional voice communications using the Public Switched Telephone Network (PSTN) [1]. The popularization of VoIP has improved our standard of living and enabled people to converse freely with more convenience and low fee.
One of the main challenges of VoIP is the “best-effort” delivery service of the IP network making it difficult to avoid the packet loss, especially in weak network environments, so that it is hard to guarantee the quality of audio content service without sacrificing its real time characteristic. On the other hand, because audio data are transmitted on the IP network, it may reveal sensitive information, which may seriously affect user privacy. Therefore, security is a key enabling technology in the context of the audio transmission [2].
Traditional solutions for packet loss problem can divide into two categories, namely, sender-based methods and receiver-based methods, respectively [3]. The sender-based methods in general adopt redundancy transmission, including Automatic Repeat reQuest (ARQ), interleaving, forward error correction (FEC) [3]. The ARQ retransmits the lost packets when packet loss occurs. However, this process would inevitably increase additional bandwidth consumption and delay. In addition, while the main aim of interleaving is to scramble data to avoid the poor audio quality caused by continuous data loss, the FEC works by adding some redundant information to the bitstream and correct transmission errors. These methods can alleviate the audio quality degradation caused by packet loss to a certain extent. However, when the packet loss rate (PLR) reaches over , the performance drops sharply with the increase of PLR [4, 5].
Classical receiver-based methods include interpolation, insertion, and regeneration. Among these methods, the insertion repairs a lost packet by inserting a simple fill-in, such as silence and noise packet, and the interpolation interpolates the packets surrounding a loss to produce a replacement for these lost packets [5]. These methods can improve the audio quality at low PLR, but similarly, cannot meet the actual application demand at high PLR. The last decade has witnessed widespread successes of compressive sensing (CS) [6] for the reconstruction of high-dimensional sparse signals based on only a small number of measurements, which enables sampling below the Nyquist rate and near-optimal signal reconstruction simultaneously. Thus, to mitigate this problem, a few solutions [7, 8] have been proposed to recover the original audio based on CS by exploiting the sparsity of the original audio in some domains (e.g., wavelet domains). However, despite of the prior work, the CS-based methods [7, 8] are still in its infancy.
Recently, a joint compression-encryption framework [9, 10] based on CS has attracted wide attention due to the fact that CS measurements will form an encrypted representation of the original data. Pudi et al. [9] proposed a method for secure encryption using the combination of stream cipher and compressive sensing. Unde et al. [10] further proposed a CS-based lightweight cryptosystem for Internet of things (IoT) applications, which improves the energy efficiency without sacrificing security. Although it can be guaranteed a certain degree of the security, CS cannot provide perfect secrecy due to its linearity [10]. Wang et al. [11] proposed a chaotic encryption algorithm based on DNA coding to confuse and diffuse audio data. Although these algorithms can effectively improve the security, they were only studied from the perspective of security.
In this paper, we proposed a secure and packet loss-resistant real-time audio transmission framework (CS-SPT) based on the principle of CS. Inspired by the interleaving technique, an ultralow complexity scrambling matrix was adopted in the proposed CS-SPT to improve its packet loss-resilient capability by increasing the information redundancy. Moreover, the energy of ciphertext is homogenized using a diffusion operation. Experimental results show that compared with existing methods, the proposed CS-SPT not only improves the packet loss-resilient ability significantly but also can resist several major attacks, such as COAs and KPAs.
: We briefly introduce the notation in this paper. We use scalars by lowercase or uppercase letters (e.g., or ), and vectors are denoted by boldfaced lower-case letters, e.g., , the boldfaced upper-case letters, e.g., , denote matrices. The matrix is joined in columns to form a vector . We denote the -th entry of a vector by , the -th row of a matrix by , the -th column of a matrix by , the -th matrix by , the packet loss signal by , and the recovered data by .
2. Background and Related Technology
2.1. Transmission Model
The VoIP protocol firstly performs analog-to-digital conversion on an audio signal and obtained a digital signal . Then, it is compressed and encoded, and the speech frame formed by coding is enclosed into an IP packet and sent over the network to the receiver. On the network, data packets are sent out in the multicast mode. When the network is in weak condition, information is likely to be blocked, and tutor audio quality deteriorates due to the long delay of VoIP packet switching. During the transmission of audio stream data, the “best-effort” delivery service of the IP network makes it difficult to avoid the packet loss [3]. Figure 1 depicts a packet loss scenario for real-time audio transmission. The packet loss rate (PLR) is defined as follows:where represents the length of the signal and represents the signal length after packet loss.

The receiving end decompresses the received data packet before converting it to an audio stream via digital-to-analog conversion. The VoIP transmission process is shown in Figure 2.

2.2. Related Technology
2.2.1. Interleaving
In the interleaving stage, the interleaving matrix is used to separate the original adjacent frames and return to the original order at the receiving end. This method reduces the loss of large blocks of data and can disperse the impact of packet loss on the audio quality. For example, without loss of generality, the source signal is divided into data packets, and each packet contains frames of data, then source data can be expressed as follows:where represents the -th data packet and denotes the -th frame in the -th packet, with .
Applying the interleaving mechanism, it is the essence to introduce a permutation matrix to , then the interleaved sequence can be obtained as as follows:and then the -th packet is represented as . Now, the interleaved stream will lose some packets in the communication process, for example, the second and -th packets are lost, the receiver gets the remaining packets, having the loss data as follows:where Null denotes the loss packets in the transmission. The receiver applies the inverse transform of the interleaving process to obtain a subset of , havingwhere null is a frame of audio signal. We noticed that packet loss in the interleaved stream would cause multiple small gaps in the reconstructed stream, reducing the impact of packet loss on audio quality.
2.2.2. CS-Based Method
CS theory states that a sparse or a compressible signal can be reconstructed exactly by using linear and nonadaptive measurements [12]. Specifically, CS permits to reconstruct a signal from a few measurements , when can be repressed as follows:where is a sparsifying dictionary and is an -sparse signal. In CS framework, is obtained by projecting the signal on a sensing matrix , which can be given as follows:
The exact reconstruction of signal is possible if the number of measurements satisfies , where is some constant. Utilizing the sparsity of , it allows the recovery using the -minimization method (Basis Pursuit(BP)) mentioned below [6].where is the measurement matrix. We suppose that the is the solution of the convex optimisation problem in (8), which is referred as CS-Rec. Then, one can reconstruct the original signal as follows:
Based on the CS and the interleaving, Ciaramella et al. [7] introduced a packet loss recovery method, namely, CS-I, for audio multimedia streaming. We summarize the process of CS-I in Figure 3. First, the original signal is scrambled to obtain an interleaved stream. The interleaved stream is transmitted in a weak network environment, resulting in packet loss. This process can be expressed as follows:where is a packet loss matrix obtained by removing some rows from the identity and thus is a subset of . For example, for a finite-dimensional signal , we obtained the interleaved stream by a scrambling matrix . In the weakness network, assuming that the 3rd and -th packets are lost, the packet loss process is shown in Figure 4. We can see that this process is equivalent to apply an underdetermined matrix to measure the signal , where is an matrix constructed by removing the corresponding rows of the 3rd and -th packets in an identity matrix. In this process, the sensing matrix . Thus, based on the CS-Rec, the loss data can be reconstructed.


3. Proposed Scheme
Traditional interleaving mechanism samples signals by random scrambling matrix. This method has the inherent disadvantage in which each column and each row of the scrambled matrix has only one 1s and only change the position of the original frame. Thus, when there is a high PLR, the interleaving framework still cannot solve the loss problem of large block data. In order to further improve the performance of packet loss recovery, we propose a new method, namely, CS-PT, which introduces an ultralow complexity random sparse binary matrix (R-SBM) to measure the audio signals. Each column and each row of R-SBM have multiple ones. We project the audio signals on a R-SBM so that this process not only retains the advantages of interleaving but also increases the measurement information redundancy by adding a few addition operations.
3.1. CS-PT
The scheme of the proposed packet loss-resilient audio transmission mainly consists of three parts, namely, the segmented sampling in audio, the packet loss, and the reconstruction process based on CS, respectively. We give an illustration of the CS-PT, as shown in Figure 5.

3.1.1. Segmented Sampling for Audio
Step 1: Considering an audio streaming . Actually, we could sample the original signal with a fixed length window, which splits into many segments. Without loss of generality, we suppose that the process splits signal into segments with a length of window. Assuming that each segment is a column of the matrix , can express as follows: Step 2: In the signal preprocessing process, we construct an ultralow complexity R-SBM whose column contains only zero entries besides entries of 1s with random locations. We suppose that the value of is small. The -th element of is satisfied with
In our experiment, the R-SBM is constructed by an identity matrix , and its rows are added by rounds of nonrepetitive cyclic shifts, i.e.,where were obtained by cyclic shifting the same bits in each row of the identity matrix and . The number of bits shifted to the left or right in each round were recorded in the one-dimensional vector , having
Thus, at the receive end, we can generate the same R-SBM through the shift vector and the dimension .
In [13], Bianchi et al. proved that the one-time sensing (OTS) cryptosystem can be resistible against known plaintext attacks (KPA). The OTS cryptosystem is implemented through renewing a sensing matrix at each encryption. In our proposed framework, we use the OTS scenario to sample the signals matrix . We can get the sampling signals by projecting the on the corresponding matrix , i.e.,where is the -th R-SBM.
3.1.2. Packet Loss in Network
Step 3: In the network transmission, if packets are lost, the received data at the receive end are equivalent to the following model: where is the received data of the -th subsignal. The packet loss matrix can be generated by removing the corresponding rows of the identity matrix. Thus, the sensing matrix is .
3.1.3. Signal Reconstruction
Step 4: The subsignals can be reconstructed from the CS-Rec where is a sparsifying dictionary and represents the sparse coefficients of . In our experiment, the is the DCT. Solving the optimization problem (17), we can estimate the and obtain the through the equation.
Then, one can obtain the approximate signal as follows:
3.2. CS-SPT
In the CS-PT, the R-SBM is introduced into the OTS cryptosystem, which is only resistant to KPA but not COA. Therefore, in order to improve communication security, we further propose secure audio transmission scheme, namely, CS-SPT, which further homogenize the energy of ciphertext through increasing the diffusion operation [14] after the signal preprocessing.
3.2.1. Segmented Sampling for Audio
Step 1 and Step 2 of CS-SPT are the same as CS-PT in the preprocessing part by considering an audio signal . Then, we obtained the sampling signals .
3.2.2. Quantization
We can retain the energy information by sampling each column of the audio signal matrix separately. In order to hide its energy information, we want the energy of the entire signal matrix to be uniformly distributed. As a result, we select the diffusion operation in classic cipher. Step 3: Before the diffusion operation, we introduce a quantization operation to map the measurements to be ranged in where rounds each element of matrix to the nearest integer, is a quantized measurements, is the minimum value of the matrix and is the maximum value in the measurement . In this experiment, .
3.2.3. Diffusion
Step 4: In the diffusion process, we use the logistic-tent system (LT) to obtain a chaos sequence. That used to obtain the key stream. In our experiment, the LT is defined by where , and the control parameter [15]. The initial value is described by where the . Then, the key stream [15] can be generated by where is the largest integer that does not exceed . Step 5: In the diffusion process, we use the exclusive OR computation to hide the information of energy, which can change the statistical properties significantly. The later experiments show that the method can resist several attacks. The model is denoted as where the is a bitwise XOR operation, denotes the -th element in , and represents the -th element after the diffusion. Here, and , , and , , are the output cipher element, the previous cipher value, the current operated value, and the corresponding key stream, respectively. Step 6: Lastly, we need to obtain the ciphertext by an inverse quantization where the matrix is obtained by splitting the signal into segments at a length of windows and .
3.2.4. Packet Loss in Network
Step 7: In the network transmission, one column of the ciphertext matrix is transmitted each time. If packets are lost, the received ciphertext can be written as follows: where , , is packet loss matrix, which can be generated by removing the corresponding rows of the identity matrix.
3.2.5. Decoding
Step 8: The decoding procedure is equivalent to the reverse process of encoding. Firstly, the received ciphertext is performed from the inverse quantization governed by Secondly, the inverse of diffusion is performed by using keys and control parameter . The inverse diffusion can be obtained by Here, is the recovered quantized measurement result. Thirdly, the measurement is obtained by an inverse quantization and an inverse diffusion Finally, it is worth noting that threshold signals cannot be recovered because each signal is related to the previous one. Therefore, we can obtain the removing threshold matrix . Then, the new sensing matrix can be expressed as .
3.2.6. Signal Reconstruction
Step 9: The subsignals can be reconstructed from the CS-Rec where is a sparsifying dictionary and represents the sparse coefficients of . In our experiment, the is the DCT. We can estimate the by solving the optimization problem (30) and obtain the .
Then, one can obtain the approximate signal as follows:
4. Results and Security Analysis
In this section, we have carried out two experiments to validate the performance of the proposed framework. We have chosen all the Chinese audio signal data set Thchs30 [16] as test signals. Thchs30 is the first free Chinese speech library, commonly used in speech recognition. To measure the reconstructed speech quality, we use the perceptual evaluation of speech quality (PESQ) [17] to compare the auditory distance between the original audio and the reconstructed stream. It ranges between −0.5 and 4.5 from bad to excellent quality. All the experiments implemented in MATLAB R2018b on an Intel quad-core computer at 2.50 GHz with 8.0-Gb RAM.
4.1. Experimental Setup and Results
In our experiments, the discrete cosine transform (DCT) was chosen as the sparse basis . In the CS-SPT scheme, each audio vector is extracted from the original audio data at the -th time window. The size of the time window is 1024 and a total of 100 segments were extracted.
4.1.1. Effectiveness
In the first experiment, we have verified the proposed methods in terms of the packet-loss resilient and the recovery performance.
During network transmission, there always exist packet loss. We compare the recovery performance of the proposed CS-PT and CS-SPT with the one of the existing CS-I at different PLRs. The results are shown in Figure 6 and the encoded time of the abovementioned schemes are shown in Table 1. It can be observed that although the PESQ of these methods decline with the increment of PLRs, they were higher than 2.50 when PLR < 50% so that the decrypted audios were understandable. When PLR < 50%, both the PESQ of the CS-PT and the CS-SPT were higher than 3.0. Meanwhile, the reconstruction performance of the proposed schemes was higher than the CS-I at all of the PLRs. With respect to the time complexity, although the CS-PT slightly increased the complexity, its performance was improved significantly. Compared with CS-PT with CS-SPT, it is not difficult to see that the antipacket loss performance of CS-PT was actually better than the CS-SPT, because CS-SPT added a diffusion operation and it is a lossy operation. However, in the follow-up experiments, we verified that the security of CS-SPT was higher than one of the both CS-I and the CS-PT.

In the second experiment, we further discussed the reconstruction performance of the proposed CS-SPT when the sparsity of R-SBM varies. Actually, the time complexity of the CS-SPT was mainly related to the sparsity of R-SBM, which also affects the packet loss-resilient performance of the scheme. Figure 7 shows the reconstruction results under different sparsity at different PLRs (10%, 20%, 30%, 40%). In our experiment, we randomly chose four (A11_0.wav, A11_1.wav, A11_11.wav, A11_12.wav) audio data in the test set and compute their PESQ. It can be observed from Figure 7 that the reconstruction performance does not increase along with the increase of the number of nonzero elements in the R-SBM. In fact, the sparsity can be controlled in a small range, i.e., .

(a)

(b)

(c)

(d)
4.2. Security Analysis
4.2.1. Key Sensitivity Analysis
Key sensitivity is a key indicator to evaluate a cryptosystem. In this section, key sensitivity analysis in decryption was preformed, and the results are illustrated in Figure 8. To evaluate the key sensitivity, we randomly chose an audio signal A11_0.wav for encryption and decryption using the CS-I, the proposed CS-PT, and CS-SPT. Firstly, the same encoding which was operated with the secret key introduced a tiny change ( and ). Five cases of decoding keys were tested, i.e., the original key and the modified key with , , , and . The corresponding decrypted audios with the modified key are shown in Figures 8(c)–8(f). The decrypted audio with the correct key is shown in Figure 8(b). It can be observed that any slight error in the key will lead to wrong decoding. The results indicated that the proposed cryptosystem was highly key-sensitive.

(a)

(b)

(c)

(d)

(e)

(f)
4.2.2. Histogram Analysis
The histogram of the audio is a way to describe the audio dispersion, and the encryption algorithm should hide the audio’s energy information as much as possible. Therefore, the flatter of the encrypted histogram, the stronger the ability to resist attacks. As seen in Figure 9(a), the attacker Eve in COAs attempts to guess plaintext by getting statistics of the ciphertext [18]. In [19], Cho and Yu proved that information leakage occurs because attackers can obtain plaintext energy information from the ciphertext by using sparse OTS matrix encryption. The histogram of the encrypted audios is shown in Figure 10. It can be observed that the histogram of the encrypted audio using the CS-SPT was almost uniformly distributed and none of the useful information was leaked to adversary. Thus, we can conclude that the proposed scheme addresses the energy leakage issue well, and the security of CS-SPT was improved significantly.

(a)

(b)

(a)

(b)

(c)

(d)
4.2.3. Correlation Analysis
In KPAs, the attacker Eve has the ability to capture a certain number of plaintext-ciphertext pairs and tries to use it to decode the ciphertext to obtain plaintext [2], as shown in Figure 9(b). In [19], Cho and Yu studied the security of a CS-based cryptosystem, which introduced the sparse OTS matrix.
The correlation among adjacent pixels is often used as an important criterion to assess the performance of confusion, and the correlation of each pair is calculated according to the formula governed bywhere and are the values of two adjacent pixels in the audio. In order to verify the CS-SPT, we randomly selected 20000 adjacent pixel pairs. The results are shown in Figure 11. Obviously, the proposed scheme has disrupted the correlation to a randomness pattern. Table 2 shows the correlation coefficients of the original, the CS-I, the CS-PT, and the proposed CS-SPT, respectively. It can be seen from the table that the correlation coefficient of the original audio is greater than 0.9, which means that the difference between adjacent sample values is very low. After encryption, the correlation of the encrypted audio is mostly less than 0.1. It can be found that the correlation coefficient was very low, especially for the proposed algorithm. This result has shown that the proposed CS-SPT has a satisfactory confusion effect.

(a)

(b)

(c)

(d)
5. Conclusion
In this paper, a secure and packet loss-resistant real-time audio transmission framework based on the principle of CS is proposed. To improve the packet loss-resilient capability, we adopted an ultralow complexity scrambling matrix. For security consideration, the resistance to the KPAs was achieved by the operation of the OTS scenario in the segmented sampling. Moreover, we employed a diffusion operation that can homogenize the energy of ciphertext and thus resist COAs. Compared with the existing algorithms, the proposed scheme has a strong sensitivity and can resist various attacks. Simulation results demonstrated that the proposed scheme has superior reconstruction performance.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported in part by the NSFC under Grants 61973088, and in part by the NSF of Guangdong under Grant 2019A1515011371.