Abstract

The protection of database systems content using digital watermarking is nowadays an emerging research direction in information security. In the literature, many solutions have been proposed either for copyright protection and ownership proofing or integrity checking and tamper localization. Nevertheless, most of them are distortion embedding based as they introduce permanent errors into the cover data during the encoding process, which inevitably affect data quality and usability. Since such distortions are not tolerated in many applications, including banking, medical, and military data, reversible watermarking, primarily designed for multimedia content, has been extended to relational databases. In this article, we propose a novel prediction-error expansion based on reversible watermarking strategy, which not only detects and localizes malicious modifications but also recovers back the original data at watermark detection. The effectiveness of the proposed method is proved through rigorous theoretical analysis and detailed experiments.

1. Introduction

With the rapid evolution of computer and Internet technology, the use of databases has enormously increased. Generally, database relations contain critical information about organizations and individuals that need to be protected from illegal copying and forgery. The piracy problem has become one of the most devastating threats to networking systems and electronic business. While demand for the use of databases is growing, pirated copying has become a severe threat to such databases due to the low cost of copying and the high values of the target databases. In that context, the security of published or shared data has become a great concern for data owners as the creation of database involves intellectual and financial effort [1].

In the last few years, digital watermarking had been proposed as a promising way of protecting digital data from illegal copying and manipulation. Digital watermarking deploys information hiding to conceal a “piece of secret information” (watermark) inside a digital data for the purpose of ownership proofing or data integrity control. The soundness of such a method relies on the assumption that altering the digital data in the process of hiding the watermark does not destroy the data usefulness. However, the errors introduced to data during the watermark embedding process inevitably reduce data quality. In addition, permanent distortions in many applications (e.g., military, law-enforced, and medical data) are not tolerated. Most of the distortion-embedding watermarking schemes are irreversible. That is, they do not allow original data recovering from the watermarked relation [14]. Therefore, although some distortion-free schemes exist in the literature [19], there is a need to develop a new generation of watermarking techniques that could efficiently address the abovementioned issues. To achieve that, reversible or lossless digital watermarking has been extended from multimedia assets to relational data to allow original data recovery at watermark detection. Reversible watermarking controls the distortions introduced by watermark embedding, and it ensures data recovery along with ownership protection or tamper proofing. Several reversible methods are available in the literature for watermarking relational databases [5]. Nevertheless, most of the proposed solutions are robust as they have been proposed for copyright protection and ownership proofing.

In this paper, we propose a novel fragile reversible watermarking technique for integrity control of relational data using prediction-error expansion. The main characteristics of our proposed method can be summarized as follows:Reversibility: distortions introduced to the underlying data are not permanent since the original data can be fully restored at watermark detection. As a result, the scheme is especially suitable for applications that require zero permanent distortion.Imperceptibility: the watermark is encoded in a way it should not be unnoticeable and undetectable using conventional techniques.Fragility: any modification maliciously made to the watermarked relation can be detected and localized with high probability.Blindness: the original database relation is not required for watermark detection.Usability: distortions caused by watermark insertion do not comprise data usefulness.Security: both embedding and detection stages of the watermark are governed by the use of a secret key known only to the data owner.

2. Previous Work

Several recent surveys have provided detailed discussion on relational database watermarking [15]. Robust reversible watermarking has also been investigated in many recent works [1013]. In this section, we mainly focus on the work done on lossless watermarking for database tamper proofing and authentication. Notice that most of the available schemes have been adapted from techniques proposed for multimedia objects [14].

In [15], Zhang et al. proposed the first well-known reversible watermarking scheme for relational databases. This method is designed to achieve lossless and exact authentication of numeric relational data using expansion on data error histogram. This scheme is claimed to possess the ability of perfect restoration of the original attribute data from the nontampered watermarked relational databases, thus guaranteeing a “clear and exact” tampered-or-not authentication without causing any permanent distortion to the database. The disadvantage of this method is because consecutive tuples in database relation are not necessarily correlated as pixels in an image, which may seriously reduce the embedding capacity. As a result, the scheme is not resilient to heavy attacks.

Coatrieux et al. [16] described a lossless watermarking technique for integrity control of categorical data in medical databases. This technique, before watermark embedding, organizes the relation tuples into secret groups using a one-way hash function. The watermark is encoded in groups based on histogram shifting modulation to form a digital signature that will be used to verify the integrity of the associated group.

In [17], the authors introduced a method, which can not only detect malicious modifications but also recover the true data from the modified locations in database relations. This scheme follows the same algorithmic steps as in [18], at the difference that unlike in [17], the XOR operator is used as hash function for generating the attribute watermark. This operator has heredity property, which enables to recover back the true data from the watermarked database table.

Franco-Contreras et al. [19, 20] proposed a lossless watermarking based on circular interpretation of bijective modulations for integrity verification of medical relational databases. The proposed technique modulates the relative angular position of the circular histogram center of mass of one numeric attribute in the relation. This scheme can be used for copyright protection, integrity control, or traitor tracing.

In a recent work, Khandudja and Chakraverty [21] have introduced a fragile, blind, and reversible method for tamper detection in decision systems. The scheme, based on the rough set theory, first prepares secure signature by encoding the information on reducts, rules, and their support values and then securely embeds it into a dataset. In case of alteration in reducts, the proposed technique can recover the original value.

Recently, in [22], a reversible watermarking technique for tamper detection and original data recovery in relational databases was presented. In the proposed work, two optimization techniques, orthogonal learning particle swarm optimization (OLPSO) and firefly algorithm (FA), are used to find the optimal locations in the database for embedding the watermark. In case of tampering, the method has the ability to restore the database back to the original state.

In their work, Chang and Wu [23] presented a reversible technique for tamper detection of numeric attributes. The foregoing scheme is based on difference expansion and support regression vector (SVR) techniques. Difference expansion is a well‐known lossless watermarking modulation for message embedding. It performs arithmetic operations on numeric features causing modifications by expanding the differences between original and predicted values adding one virtual least significant bit (LSB) that is used for message embedding. However, it is weak against update attacks.

In [24], a tamper recovery fragile watermarking technique for relational databases was proposed. This scheme is group based, and the watermark is embedded and verified group-by-group using Reed–Solomon coding technique.

In [25], Gupta and Pieprzyk proposed a robust reversible watermarking technique, which extended the Agrawal and Kierman method [26]. They used difference expansion on integers and achieved reversibility of the original data.

In [27], Farfoura and Horng proposed a blind and reversible method which improves the scheme in [25] by reducing data distortion. The proposed method is based on prediction-error expansion (originally proposed for the image [28]) instead of difference expansion, and a single attribute is used to carry a watermarking bit. For watermark insertion, an identification image is converted into a bit stream. It is then embedded into fractional part of numeric attributes to represent copyright watermark information [27].

In [29], the authors extended their work [27] to handle the situation when multiple owners claim the ownership of watermarked data. A time-stamping authentication protocol is designed to ensure ownership protection for relational databases. The motivation behind watermark embedding in the fractional portion is to reduce the distortion in the underlying data. If the numeric feature has no fractional part, no watermark will be embedded; hence, the watermark capacity decreases. In addition, although this method is robust against common database manipulations, the watermark will not survive a simple rounding integer operation.

In [30], another blind and reversible technique based on PE is proposed. The intention behind the design of this technique is to provide proof of ownership and tries to overcome the drawbacks of Farfoura et al. watermarking technique [29]. During watermark embedding phase, the last two digits from all numeric attributes of the selected tuples are used. Although the robustness of the technique has been analyzed against malicious attacks and benign updates, it cannot be used for tamper detection and localization.

3. Prediction-Error Expansion Overview

The embedding and extraction procedures of the conventional prediction-error expansion (PE) proposed by Thodi and Rodriguez in [28] are briefly introduced here. In PE, the prediction-error between the original image pixel value and the estimated (predicted) value is utilized for watermark embedding. Moreover, authors suggest incorporating expansion embedding with histogram shifting to ensure the reversibility.

The concrete embedding procedure of watermarking of the PE approach is described as follows: let be the intensity of a pixel in a gray-scale image. The predicted value of the pixel is computed as follows:where a, b, and c are the right, lower, and diagonal neighbors of the pixel x (see Figure 1).

The prediction error is defined by

A watermark bit i is embedded by expanding the prediction error as follows. The binary representation of the prediction error e is shifted left by one bit to create a vacant LSB, into which the marking bit i is inserted. As a result, the modified (expanded) prediction error is given by

The watermarked pixel value can then be computed using the expanded prediction error and the pixel estimated value :

At the decoder, extraction is done by first calculating the prediction error . The encoded bit is extracted from the LSB of . The original prediction error is then computed as

Finally, the original pixel value is restored as

For example, assume . Then, the prediction error is given by Let be the embedding watermark bit. The expanded prediction error and the watermarked pixel values can be computed as follows: and . At the detection phase, the concealed bit and the original prediction error are given by and Lastly, the original pixel value can be restored as follows:

4. Proposed Method

As discussed in Section 2, in the robust method proposed by Farfoura and Horng [27], the watermark embedding is not possible if the selected attributes for marking do not have fractional part. To overcome this limitation, in our scheme, instead of the fractional portion, we assume that the original pixel is the least significant digit (LSD) of any numerical attributes available for watermarking, and the predicted intensity is any value that is known by the data owner at both watermark embedding and detection stages (e.g., the tuple primary key hash value). Our reversible scheme consists of four phases: (i) preprocessing; (ii) watermark embedding; (iii) integrity verification; and (iv) data recovery. Table 1 summarizes the notations used in the rest of this paper, and the data partitioning procedure is shown in Algorithm 1.

(1)For each tuple ti ∈ R do
(2)hi = Hash (Ks||ti.Pk||Ks) //Primary key hash
(3)j = hi mod  //group index
(4) Insert tuple ti into group Gj
(5)End for
Sort all tuples in Gj in increasing order of their primary key hash
(6)Return (G1, G2, …, )
4.1. Preprocessing

Let R (Pk, A1, …, Aγ) be the database relation to protect. A1, …, Aγ are all numeric attributes, and Pk is the primary key attribute whose values never change or else can be recovered. In this phase, the relation R is securely partitioned into groups based on the primary key hash values of tuples. The number of groups is a secret parameter known only to data owner as well as the secret key Ks. For the hashing purpose, a message authentication code (MAC) is computed on a cryptographic hash function concatenated with the secret key in other to prevent an unauthorized person to reproduce valid hash codes from a suspicious relation. The data partitioning procedure is that of Algorithm 1. Using the property that secure hash functions generate uniformly distributed message digests, the grouping technique used approximately assigns the same number µ = η/ of tuples to each group. Furthermore, it is difficult for an attacker to predict the tuples-to-group assignment without the knowledge of the secret parameters. Note that the grouping and sorting operations do not physically change the positions of the tuples in the database. However, since tuples in a database are independent, these operations are necessary for enforcing some relationship between them so that the embedded and extracted watermarks can be synchronized.

4.2. Watermark Embedding

Algorithms 2 and 3 show the watermark embedding process for a given group. This process can be summarized as follows: first, for each attribute, we calculate a hash value from which we generate the attribute watermark. The number of tuples in the group determines the size of the watermark. Next, in order to reduce the distortion caused by watermark insertion, the watermark bits are securely encoded by expanding the prediction-error of the LSD of numeric attributes. We assume that such errors will not affect data usability, as they are not permanent. The encoding process (Algorithm 3) first extracts the LSD of the attribute value and then computes its prediction error, which is expanded with the corresponding watermark bit as shown by lines 1 through 4. Afterwards, the new value of the LSD is combined with the corresponding most significant digits (MSD) to calculate the value of the watermarked attribute.

(1)Hp = HMAC (Ks ||h1 ||h2 ||…|| hµ || Ks)  //Hashing of primary key hash values
(2)for each attribute Ajdo
(3)Hj = HMAC (Hp ||t1.Aj||t2.Aj||…|| tµ.Aj|| Ks) //attribute hash
(4)  //Generate watermark bits
(5)Wj = µ MSB (Hj) //Assume length (Hj) ≥ µ
(6)  //Encode bit
(7)for each tuple tido
(8) Encode_Bit (ti.Aj, Wj[i], hi)   //See Algorithm3
(9)end for
(10)end for
(1)lsd = GetLSD (ti.Aj)
(2)T = GetMSD (ti.Aj) //Most significant digits of ti.Aj
(3)e = lsd−hi //Prediction error
(4) = 2∗e + Wj[i] //expanded prediction error
(5)newval =  + hi //New value of the LSD
(6)newval = To_number (T, newval)
(7)//Update the watermarked attribute value
(8)ti·Aj = newval
4.3. Integrity Verification and Data Recovery

The protected database relation may be subject to various types of attacks aiming at maliciously changing the content of data. Therefore, for an attack to be successful, the pirate needs to maliciously modify data, while keeping unchanged the embedded watermark. However, since the embedded watermark is computed from data characteristics using cryptographic secure hash function, any modification made to data should be correctly detected and localized as tampering. Moreover, since the scheme is reversible, the original data can be restored. The detection process is fully blind; it only requires the knowledge of some parameters such as Ks, , and . As in the embedding stage, the database relation is virtually divided into secret groups (Algorithm 1). The integrity of each group is verified independently as described in Algorithms 4 and 5. First, the concealed watermark bits are extracted from the expanded prediction errors of attributes least significant digits while restoring the original data. Then, the original watermark is computed from attribute hash values in the same way as in the embedding procedure. Finally, the extracted and computed watermarks are compared. If they match, then the associated attribute is authentic, otherwise it is tampered with.

(1)Hp = HMAC (Ks ||h1||h2||…||hµ||Ks)   //Hashing of primary key hash values
(2)for each attribute Ajdo
(3)for each tuple tido
(4)  lsd = GetLSD (ti.Aj)
(5)  T = MSD (ti.Aj)
(6)   = lsd−hi   //watermarked prediction error
(7)  [i] = LSB ()   //recovering embedded bit
(8)  ti.Aj = RecoverData (T, lsd, , Wj∗[i])   //See Algorithm5
(9)end for
(10)Hj = HMAC (Hp ||t1.Aj||t2.Aj||…||tµ.Aj||Ks)   //attribute hash
(11)  //Generate watermark bits
(12)  Wj = µ MSB (Hj)   //Assume length (Hj) ≥ µ
(13)//Check integrity
(14)if Wj ≠  then
(15)  attribute Aj not authentic
(16)end if
(17)end for
(1)   //original prediction error
(2)origval = lsd–e-[j]   //original value of the LSD
(3)origval = To_number (T, origval)   //attribute original value
(4)return origval

5. Security Analysis

In this section, we analyze the security of our reversible fragile watermarking scheme in terms of failure probability in tamper detection. We consider the following attacks: attribute value alteration, tuple insertion, and tuple deletion.

5.1. Attribute Value Alteration Attack

In this kind of attack, the attacker can either maliciously alter a single value or multiple values. If a single value is modified, obviously, only one attribute in a single group will be affected. Due to the use of secure hash function in watermark computation, the modification will randomize the corresponding attribute hash and thus the generated watermark. After the attack, since each of the μ watermark bit has equal probability p = 1/2 to change or not, the probability that our scheme fails to detect this tampering is given by

It is easy to see that this failure rate is monotonic decreasing with the group size µ. The larger the group size, the lower the failure probability, and the more secure is our scheme.

Now suppose that the attacker alters multiple values. Since all values have equal probability to be modified, the attack could affect one or more groups. When all modified values are in a single group, one or more attribute watermarks could be randomized. If n attributes are affected, the probability that all watermarks in the group are correctly verified after this attack is given by

We can see that this failure rate decreases exponentially with the group size and the number of affected attributes.

5.2. Tuple Insertion Attack

The pirate can insert one or more tuples into the protected database relation. In case of a single insertion, this attack will increase the size of the group to which the new tuple is assigned to. As a result, all relevant attribute watermarks will be randomized. Since there are μγ + γ bits in the group watermark, the error rate for this modification is

Now, suppose that n tuples are inserted during this attack. The piracy could affect more than one group. If a single group is affected, the probability that our scheme fails to correctly detect the tampering is obtained by

We can observe that these rates are monotonic decreasing with the group size µ, the number γ of attributes, and the number of inserted tuples n. The larger these parameters are, the lower are the failure rates in watermark extraction, and the better is the security of the proposed scheme.

5.3. Tuple Deletion Attack

If a single tuple is deleted by the attacker, all watermarks in the relevant group will be randomized. Therefore, the probability for this piracy to succeed is

Consider that n tuples are deleted from the watermarked relation. If all these tuples fall in a single group, the probability to correctly detect all relevant watermarks is given by

We can easily see that the failure rate decreases exponentially with the group size and the number of attributes. However, it increases in case of massive deletion.

6. Experimental Results

In this section, the distinguished features of our scheme are discussed. We performed our algorithms in 4 Core Processors Intel Core i3-8100, 3.6 GHz with a 4 GB RAM computer running SQL Server 2012 and the Microsoft Visual Studio C# IDE. The operating system used in this experiment was Windows 7 Professional 64 Bit. We have used the forest cover type for 30 × 30 meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data [31] in our experiment. It consists of 581,012 tuples and 54 attributes with numeric and nonnumeric data. For watermark insertion, we used the 10-first integer attributes and added a new attribute called id to serve as a primary key.

The experiment consists of two steps listed in the following: (i) imperceptibility and usefulness analysis and (ii) tamper detection and data recovering.

6.1. Imperceptibility and Data Usefulness

Table 2 shows the impact of watermark insertion on the mean and variance of marked attributes. The values are rounded to the nearest integer. We can see that the mean remains unchanged except for attribute 4, where it is increased by 1. Though the change in the variance is higher than in the mean, the errors are still minor. Furthermore, the distortion caused is not permanent since the scheme is reversible.

6.2. Tamper Detection and Data Recovering

For efficiency reasons, we used a sample of 1,000 tuples as an experimental dataset. We tested the capability of our scheme to detect maliciously modified groups and restore original data from nontampered groups. For each attack, we analyzed the relevant detection probability which is the complementary of the corresponding failure probability. Each test was performed 10 times, and the results were averaged.

We first observed that, in case of multiple tuple insertion attack, the affected groups were completely detected. This is because the cryptographic hashing function used (SHA-1) uniformly distributed the new tuples into the groups according to the value of , thus randomizing each relevant watermark.

The results for the alteration of multiple values are shown in Figure 2. We can observe that all modified groups are correctly detected, except when  = 90 where a few groups are missed. The reason is that the detection probability is low when the number of groups is large. This observation confirms our theoretical results.

Figure 3 shows the detection probability in case of multiple tuple deletion. We can easily see that for values of ranging from 10 to 50 and a deletion rate of up to 90%, the detection probability is equal or greater than 0.88. However, with smaller groups, as the deletion rate increases, the detection probability decreases considerably. For instance, when  = 90 and the deletion rate is equal to 90%, the detection probability is only 0.1. The reason is that, after such an attack, it is highly possible that most of the groups were completely deleted.

Regarding data recovery, due to the use of prediction-error expansion, we noticed that no matter the attack, the scheme was able to restore 100% of the original data from authentic groups.

Table 3 shows a comparison between our technique and some recent related works, both based on PE. From the best of our knowledge, so far, a reversible fragile scheme based on PE has not been proposed in the context of relational databases. We consider different aspects, including the watermark information, the watermark encoding, the data recovery, and the granularity level of tamper detection and localization. We can easily see that, unlike other methods, our scheme can detect and localize malicious modification. In addition, our technique minimizes the distortion caused by the embedding process, while Farfoura et al.’s method [29] is not suitable for numeric data without fractional part, and Chang et al. [30] considered two last digits for marking.

7. Conclusion

In this study, a new reversible fragile watermarking approach is developed for tamper detection of relational databases. To preserve data usability, the watermark is encoded by expanding the prediction-error of the LSD of numeric attributes. Through theoretical analysis and experiments, we demonstrated the effectiveness of the proposed technique to detect and localize malicious modifications from various attacks. In addition, the original data can be fully recovered from nontampered groups. In the future, we will extend our solution to nonnumeric, object-oriented, and Extensible Markup Language (XML) data.

Data Availability

The data related to the manuscript can be freely made publicly available without any constraints.

Conflicts of Interest

The authors declare that they have no conflicts of interest.