Abstract
Since leakage detection was introduced as a popular side-channel security assessment, it has been plagued by false-positives (a.k.a. type I errors). To fix this error, the previous solutions set detection thresholds based on an assumption-based prediction of false-positive rate (FPR). However, this study points out that such a prediction (of FPR) may be inaccurate. We notice that the prediction in EuroCrypt2016 is much smaller than (approximately times) the true FPR. The gap between prediction and truth, called underpredicted false-positives (UFP), leads to severe false-positives in leakage detection. Then, we check the statistical distribution of test statistics to analyze the cause of UFP. Our analysis indicates that the overlap between cross-validation (CV) blocks gives rise to an assumption error in the distribution of the CV-based estimates of -statistics, which is the root cause of UFP. Therefore, we tackle the UFP by eliminating the overlap between blocks. Specifically, we propose a profiling-shared validation (PSV) and utilize this validation to improve the detection of any-variate any-order leakages. Our experiments show that the PSV solves the UFP and saves more than 75% of the test time costs. In summary, this article reports a potential flaw in leakage detection and provides a complete analysis of the flaw for the first time.
1. Introduction
Side-channel attack (SCA) utilizes the physical leakages (such as execution time [1], power consumption [2], and electromagnetic radiation [3]) of a running device to retrieve some secrets (e.g., the private key) inside the device. Since it was proposed by Kocher [1], such an attack has seriously threatened the security of cryptographic modules, including smart cards [2, 3] and FPGAs [4, 5]. Thus, side-channel security assessments have been developed to evaluate the security of these modules against SCA [6].
Leakage detection is a side-channel security assessment that determines whether the leakages depend on the data (e.g., plaintext or ciphertext) accessible to an attacker [7]. In most cases, it uses a hypothesis test to identify the data-dependent leakages by comparing the test statistics with a certain threshold [8–10]. If no statistics exceed the threshold, the module is allowed to pass the test; otherwise, the module is rejected. In contrast to the attack-based assessment [6], leakage detection exploits the dependency between leakage and data rather than the key recovery. Hence, it has advantages in computational complexity and works well in black-box environments [10]. A typical example is the test vector leakage assessment (TVLA) [8]. This assessment utilizes Weltch’s -test to compare -statistic with a threshold of 4.5 and identifies the leakages dependent on the plaintext. In EuroCrypt2016, Durvaux et al. found that TVLA failed to detect the plaintext-dependent leakages [10]. Then, they put forward a correlation-based leakage detection (-test) to identify the hard-to-detect leakages for TVLA. The -test takes advantage of cross-validation (CV) to obtain a well-estimated -statistic and then compares the with a threshold of 5.0 to assess any-variate any-order leakages. Based on the -test, more assessments have emerged, such as the -test [11], -test (and its simplification D-test) [12], KS-test [13], ANOVA [14], and DL-based test [15].
However, as discussed in [10], some of the identified plaintext-dependent leakages are in fact not related to the plaintext. In other words, leakage detection is challenged by false-positives (a.k.a. type I errors). To address this challenge, the frequently used solutions carefully set the threshold for leakage detection according to an acceptable false-positive rate (FPR) [16, 17]. Such solutions rely on an assumed distribution of the test statistics to make a fast prediction of FPR. For example, Durvaux et al. at EuroCrypt2016 chose 5.0 as the threshold of the -test based on the assumption that follows the standard normal distribution (when -statistic ) [10]. This choice corresponds to a theoretical value (i.e., the predicted FPR of a single test) below and reduces the predicted FPR (PFPR) of tests to [17].
Obviously, the above solutions demand that the assumption must be close enough to the true distribution of the test statistics; otherwise, an assumption error (AE) in the distribution may lead to an inaccurate prediction of FPR. As far as we know, the potential risk of AE has not been studied in previous work.
This work studies the false-positives in the -test and focuses on the potential AE in [10]. We first propose an estimation algorithm that statistically approximates the true FPR of the -test. After running the algorithm, we notice the underpredicted false-positives (UFP) that the true FPR (of -test) is about 779 times the prediction (in [10] at the threshold of 5.0). The UFP implies that the previous prediction of FPR may be unreliable. Second, we discover an AE in the statistical distribution of . Due to the overlap between the training and the test blocks, there is a nonnegligible error between the assumed distribution and the true distribution (of ). This error explains well why the PFPR of the -test deviates from the truth. Third, we present a new time-efficient validation, named profiling-shared validation (PSV) to tackle the UFP (and AE). The PSV splits the samples into nonoverlapping subsets and assigns these subsets to different blocks— subsets are allocated to the training block, and the other subsets are mutually exclusive to test blocks. The experiments show that our PSV not only solves the UFP (and AE) but also reduces the time cost by more than 75%.
The rest of this study is organized as follows. In Section 2, we introduce the underpredicted false-positives. Then, the UFP and AE are analyzed in Section 3. Next, we elaborate on the PSV-based leakage detection in Section 4. Section 5 applies the PSV to high-order leakage detection, and Section 6 summarizes the whole study and draws our conclusions.
2. Underpredicted False-Positives
In [10], FPR of the -test was predicted based on an assumed standard normal distribution. This section shows that the prediction in [10] is not accurate. In this section, we first review the details of the -test. Then, we introduce our estimation algorithm and describe the underpredicted false-positives.
2.1. Correlation-Based Leakage Detection
The correlation-based leakage detection (-test) takes advantage of a -statistic to identify the plaintext-dependent leakage by comparing the absolute value to a detection threshold . In [10], Durvaux et al. utilized the k-fold CV to estimate the statistic at time point as follows.
First, an assessor randomly splits the acquired traces into nonoverlapping subsets of approximately the same size.
Then, for th fold of CV, the assessor defines the profiling block and the test block . Thus, a CV-based model is profiled from :
The sample mean is , where is the associated data.
Next, compute the Pearson correlation coefficient between the model and the test leakages :where is the data associated with the test leakages . An unbiased estimate of the correlation coefficient is obtained by combining the coefficients :
Finally, after Fisher’s -transformation and normalization, the CV-based estimate of iswhere is the size of the set . Based on the assumption that can be interpreted by a normal distribution , the FPR of (a single) the -test can be predicted bywhere is the cumulative function of the standard normal distribution .
2.2. Statistical Estimation of FPR
In statistical estimation, FPR is calculated as the ratio of the number of false-positives to the number of actual negatives:
However, neither nor is known to the evaluators in leakage detection. Therefore, we define a data-independent leakage set (DILSET) to statistically approximate the true FPR of the -test. DILSET is a leakage set, where each leakage is independent of the associated data , . Thus, , and all the are negatives for the -test. By rewriting (6), FPR of the -test can be estimated:where is the number of -tests on DILSET, and is the number of (false) positives in the test results.
The statistical estimation is formally described in Algorithm 1. Taking the trace set and the data vector as input, the algorithm first creates a DILSET , where the -statistic . In each loop of Algorithm 1, the function generates a leakage vector based on , and produces a random plaintext vector on the Galois field , so that . Then, the function performs the -test on and returns an estimate to the vector . After loops, the algorithm counts the number of in and outputs the ratio as the estimated FPR (EFPR) of the -test. According to the law of large numbers, will converge to the true FPR of the -test as increases.
|
2.3. Assumption-Based Prediction vs. Statistical Estimation
In this subsection, we run Algorithm 1 on three different DILSETs to check the prediction of FPR in [10] (i.e., equation (5)).
2.3.1. Data-Independent Leakage Set
The three DILSETs used to check the prediction are constructed as follows.
DILSET_1 is a simulated DILSET where all the leakages are Gaussian noise. Accordingly, in each loop of Algorithm 1, the function generates a leakage vector composed of random leakages , and randomly produces 8 bit variables to form the associated data vector , .
DILSET_2 and DILSET_3 are two measured DILSETs where all the leakages come from DPA contests [18]. In each loop of Algorithm 1, the function randomly selects a time point from and returns the leakages at as , and produces 8 bit random variables to form . The DPA contests are detailed.
DPA contest v2 targets an unprotected FPGA implementation of the advanced encryption standard (AES) algorithm and provides three databases for side-channel evaluation [19]. In this study, we use the first traces in the template base to build DILSET_2.
DPA contest v4.2 uses a rotating SBOX masking to protect the AES implementation on a smart card and collects 5,000 traces in each zip file [20]. This study decompresses the first zip file, corresponding to the SHA1sums: f711206b413b8d02f595d5861996ff61a1711f3d, to build DILSET_3.
2.3.2. Parameter Setting
The number of folds has an impact on the bias and variance of the CV-based estimation. A larger value of gives a less biased estimate but means a higher procedure’s variance. In [21], James et al. showed that and make a good trade-off between the bias and the variance. We choose 5, 10, 15, and 20 as the candidates for to support the universality of our work.
The detection threshold determines the false-positives in leakage detection. Goodwill et al. suggested a threshold of 4.5 for TVLA [22]. However, as the number of points (or tests) increases, the test statistics may be so large that a leak-free device cannot pass the test with . Therefore, was recommended for longer traces in [9] and adopted as the threshold of the -test in [10]. In our experiments, we set to the empirical values of 4.5 and 5.0, respectively.
A larger number of -tests improves the estimation accuracy of FPR, but means more running time of Algorithm 1. Because and , we set to make a trade-off between the estimation accuracy and the time overhead.
2.3.3. Comparison Experiments
We run Algorithm 1 on DILSETs and record EFPRs of the -test, as given in Table 1. By comparing EFPRs to PFPRs, we notice that all EFPRs are much higher than PFPRs. For example, in the case of DILSET_1 and , EFPRs and , whereas PFPRs and , and . The gap between EFPR and PFPR reports an underpredicted false-positives (UFP) that the FPR of the -test was underpredicted in [10].
3. Root Cause Analysis
Root cause analysis (RCA) identifies the root cause of UFP, which helps prevent the underprediction from recurring. This section uses hypothesis test tools to determine the root cause of UFP. In this section, we first introduce the test tools and then analyze the UFP and the AE in [10].
3.1. Hypothesis Test Tools
The tests used for our root cause analysis (RCA) are given in Table 2.
3.1.1. One-Sample -Test
The one-sample -test compares the sample mean to a prespecified value to test for the deviation from this value. The null hypothesis of the test assumes that no difference exists between the true mean and the comparison value , while the alternative assumes that the difference exists. The test statistic for a one-sample -test is denoted by the letter , which is calculated using the following formula:
is the sample standard deviation, and is the sample size. Then, the value of the -test can be computed:where is the cumulative function of Student’s -distribution, and is the degree of freedom. As increases, Student’s -distribution gets close to a standard normal distribution [23].
3.1.2. Chi-Squared Test
The chi-squared test (-test) can be used to test whether the true variance is equal to a specified value . Assuming , the test statistic of the -test is
The value of the null hypothesis iswhere is the cumulative function of a distribution:where denotes the gamma function [23].
3.1.3. Kolmogorov–Smirnov Test
The Kolmogorov–Smirnov test (KS-test) is a nonparametric test that can be used to compare a sample with a reference probability distribution. Its null hypothesis assumes that the samples come from a specified distribution . In this test, an empirical cumulative function for independent and identically distributed ordered observations is defined as
is the indicator function which returns 1 if and returns otherwise. The KS statistic for a given function iswhere is the supremum of the set of distances. If the sample comes from , converges to zero in the limit when goes to infinity. Its value, that is, the probability that the null hypothesis holds can be calculated regarding [24].
3.1.4. Pearson’s Correlation Test
Pearson’s correlation test is used to test whether there is a relationship between two variables.
Its null hypothesis assumes no correlation between the observed phenomena . Given an estimate of , the test statistic can be estimated:
The value of the test is determined by Student’s -distribution with degrees of freedom [25]. Note that Pearson’s correlation test is different in definition from the -test in Section 2.1.
3.2. Error in Assumed Distribution
As shown in Figure 1, the UFP implies a nonnegligible error between the true distribution of and the assumed distribution . Regarding the description in [26], we name this error the assumption error in leakage detection and use the values of the tests on to quantify the AE—the smaller the value, the stronger the null hypothesis should be rejected and the more significant the AE in the -test. The results of the hypothesis tests are given in Table 3. At the significance level of 0.01 (an accepted threshold in [27]), the values of -tests exceed 0.01, which means the null hypothesis is accepted. In contrast, the values of -tests and KS-tests are less than 0.01, that is, the null hypotheses and are rejected. Hence, it is confirmed that the assumed is not the true distribution of .

3.3. Correlation between Cross-Validation Blocks
To determine the root cause of a problem, RCA establishes an event timeline from the normal situation up to the time the problem occurs. Based on equations (1)–(4), we create the timeline from input to UFP (or AE) occurrence, as shown in Figure 2. To facilitate RCA, intrablock estimate is obtained by performing Fisher’s -transform and normalization on .where is the number of traces in the test block .

We first check the distribution of to locate the source of AE. Table 4 provides the values of the tests on (when ). Since all the values exceed the level of 0.01, the intrablock estimates pass the tests for , which excludes (1) and (2) from the possible source of AE. In [28], Bengio et al. proved that between-block correlation leads to a biased variance of the CV-based estimation (in equation (3)). Therefore, we utilize Pearson’s correlation test to examine the correlation between (). In the case of DILSET_1 and , the results of correlation tests are given in Table 5. Since , the null hypothesis is rejected. The correlation between CV blocks is confirmed and may be the cause of the AE in the -test.
3.4. Overlapping Cross-Validation Blocks
According to Corollary 3 in [28], the between-block correlation in cross-validation stems from the overlap between the training and the test blocks. In this subsection, we present a nonoverlapping validation (NOV) and compare the NOV-based -test (-test) with the -test to check the impact of the overlap on FPR. The -test works as follows.
First, the traces are split into nonoverlapping blocks of approximately the same size. For each block , of the traces are randomly selected as the test subset , and the others are left as the profiling subset . Then, the model is profiled:
Finally, an NOV-based estimate of -statistic can be calculated by equations (18)–(20):
We run Algorithm 1 on an extended DILSET_1 (ExDILSET_1) to approximate the true FPR of the -test. Specifically, in each loop of the algorithm, the function produces random leakages to form the leakage vector , and randomly generates a vector composed of 8 bit variables, so that . Then, the function performs the -test on and returns an NOV-based estimate . After loops, (for each ) and are obtained. Finally, Algorithm 1 counts the number of false-positives and outputs the EFPR of the -test.
The results of Pearson’s correlation test between (, ) are given in Table 6. At the significance level of , all the pairs pass Pearson’s correlation test, which means that there is no correlation between the NOV blocks. Table 7 provides the values of the distribution tests on and the EFPRs of -test. On the one hand, since , pass the test for the distribution . On the other hand, compared with the EFPRs of the -test (Table 1), EFPRs of the -test are much closer to PFPRs and . Therefore, both the AE and the UFP are solved by eliminating the between-block overlap. It is concluded that the overlap between the training and test blocks is the root cause of AE and UFP in the -test.
4. Improved -Test
Section 3 has shown that NOV is an effective solution to the UFP in the -test, but requires more samples to construct the nonoverlapping blocks. In this section, we introduce the profiling-shared validation to efficiently solve the UFP.
4.1. Profiling-Shared Validation
Profiling-shared validation shares the same profiling samples and the same profiled model for all the test blocks. In a PSV-based -test (-test), the evaluator randomly splits the traces into nonoverlapping subsets with approximately same size. Choosing the first subsets as the profiling block , the model at time point can be profiled:
Then, for th test block , , the Pearson correlation coefficient between the model and the test leakages is computed:
Next, after Fisher’s -transformation and normalization, the intrablock estimates ,
Finally, averaging the intrablock estimates and then normalizing the mean to obtain a PSV-based estimate of -statistic,
4.2. Effectiveness and Efficiency
4.2.1. Effectiveness
We perform the -test on DILSET_1, 2, and 3 and then test the estimates and . Since each PSV block is independent of other blocks, , . For example, in the case of DILSET_1 and , the obtained results of correlation tests between are given in Table 8. At the significance level of 0.01, the null hypothesis holds. Table 9 provides the values of the distribution tests on and EFPRs of the -test. Since all the values exceed the acceptable level of 0.01, the PSV-based estimates pass the test for . In addition, compared to EFPRs of the -test (Table 1), EFPRs of -tests are much closer to PFPRs. Hence, PSV is an effective solution to the AE and UFP.
4.2.2. Efficiency
Different from the CV which requires -profiled models for the validation, PSV shares the same model for all test blocks, which means the -test spends less time in the profiling phase than the -test. We compare the execution time of the -test and -test on an HP EliteBook 735 G6 (AMD Ryzen 5 PRO 3500U CPU @ 2.1 GHz, 8 GB RAM) with Windows 10 as the operating system. The time overhead of 1,000,000 tests on DILSET_1 is shown in Figure 3. Compared with the time cost of 1,000,000 -tests of at least 676.94 seconds when , the maximum time cost of 1,000,000 -tests is seconds when , saving about 79% of the time costs. In short, the -test is more time-efficient than the -test.

4.3. Measured Experiments
We verify whether the -test can identify the plaintext-dependent leakages in the captured power traces. In our measured experiments, the -test assesses the dependency between the leakages and first 4 plaintext bytes in DPA contest v4.2. Let , and respectively; we run the -test on DPA contest v4.2 and plot the estimates at the time domain , as shown in Figure 4. When , the 4 plaintext bytes are leaked in the intervals , , , and . In other words, the plaintext-dependent leakages are successfully identified by the -test from the captured traces.

(a)

(b)

(c)

(d)
5. Higher-Order Leakage Detection
Higher-order leakage detection evaluates the security of the protected implementation against higher-order side-channel analysis. In this section, we analyze the performance of the PSV method in detecting higher-order leakages.
5.1. Combining Function
In higher-order side-channel analysis, a combining function maps the leakages of multiple shares to a simple univariate leakage. The work of Prouff et al. has demonstrated the central product function:where is the th leakage at time point in trace set and is optimal for the Hamming weight leakage scenario represented by the smart card platform [29]. In our experiments, the central product function is selected to preprocess the leakages of the masking. According to the classification in [30], we analyze the performance of PSV under the following combining functions:(1)The central product function with the number of variables and the order of statistical moment ,(2)The central product function with and ,(3)The central product function with and ,(4)The central product function is with and :
5.2. Effectiveness and Efficiency
5.2.1. Effectiveness
As described in Section 3, the values of the distribution tests on quantify the AE and the UFP in the -test. Thus, we test the distribution of to demonstrate the effectiveness of PSV in higher-order leakage detection, where the is calculated from the combined DILSETs with different . For example, when , we compute the estimates and as follows. In addition to the vector , two sets of leakages are randomly generated, so that . Then, the leakages and are combined by the central product function (28). Finally, we perform the -test or -test on the combined leakages to obtain or . For each , Table 10 provides the results of the distribution tests on and . At the significance level of 0.01, is rejected by the test of and , while passes. It is proved that the AE and UFP in the higher-order -test can be solved by the PSV method.
5.2.2. Efficiency
The time costs of 1,000,000 2nd-order -tests are shown in Figure 5. As shown in Figure 5(a), compared with the time cost of 1,000,000 univariate 2nd-order -tests of at least 743.85 seconds, the maximum time cost of the 1,000,000 -test is 161.31 seconds, saving about 78% of the time overhead of detections. In Figure 5(b), compared with the minimum time cost of the bivariate 2nd-order -test of 767.80 seconds, the -test takes at most 187.23 seconds, saving about 76% of the time overhead. In other words, the higher-order -test is more time-efficient than the higher-order -test.

(a)

(b)
5.3. Measured Experiments
We use (26) to preprocess DPA contest v4.2 and run the -test on the preprocessed traces. Figure 6 shows the results of the -test between the traces and the first 4 plaintext bytes. When , the first 4 bytes are leaked in the intervals , , , and , which means that the univariate 2nd-order leakages in the measured power consumption are identified by the -test.

(a)

(b)

(c)

(d)
6. Conclusions
Assumption error (AE) invalidates side-channel security assessment. This study finds that the false-positive rate of leakage detection might be mispredicted due to potential errors between the assumed distribution and the true distribution of the estimated test statistics. We notice underpredicted false-positive (UFP) in [10]. This underprediction, interpreted as the AE in the statistical distribution of the estimates of -statistics, is caused by the overlap between the training and the test blocks in cross-validation. In addition, we propose the profiling-shared validation (PSV) to improve the detection of any-variate any-order leakages and show that the UFP and AE can be addressed by eliminating the between-block overlap. Compared with the -test, our -test overcomes the UFP and only takes less than 25% of the time cost. To the best of our knowledge, this article is the first empirical study of the false-positives in leakage detection. In future work, we will refine our tools, including the estimation algorithm and the distribution tests, to preevaluate other security assessments.
Data Availability
The datasets and codes used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (61972295), the Wuhan Science and Technology Project Application Foundation Frontier Special Project (2019010701011407), and the Foundation Project of Wuhan Maritime Communication Research Institute (2018J-11).