Abstract

We consider the efficient hardware implementation of Grain-128AEADv2, which is the second version of Grain-128AEAD (one of the lightweight cryptography finalist candidates). In order to counteract side-channel attacks, the efficient masked hardware implementation of Grain-128AEADv2 is also considered under the idea of domain-oriented masking. In detail, the so-called pipeline-like pre-computation technique is applied to increase the throughput-area ratio of the (masked) hardware implementation of Grain-128AEADv2. The performance of the (masked) hardware implementation of Grain-128AEADv2 is evaluated on ASIC and FPGA. For the unmasked version, the highest throughput-area ratio can be on ASIC and on FPGA. For the masked version, the highest throughput-area ratio can be on ASIC and on FPGA. Then, the security of the masked hardware implementation of Grain-128AEADv2 is verified with the simulated T-Test. To the best of our knowledge, this is the first published work about the (masked) hardware implementation of Grain-128AEADv2. In light of this, this contribution may help researchers and practitioners to accurately compare the efficiency and the security of the hardware implementation of Grain-128AEADv2 with those of other lightweight cryptography algorithms.

1. Introduction

In August 2018, National Institute of Standards and Technology (NIST) launched the project Lightweight Cryptography (LWC) calling for lightweight cryptographic algorithms that possess the function of authenticated encryption with associated data (AEAD). These algorithms are expected to be suitable to be used in resource-constrained environments where existing cryptographic algorithms may be unsuitable to be used. Compared with existing symmetric cryptographic algorithms, lightweight cryptographic algorithms that possess the function of AEAD can achieve the goal of integrity and confidentiality simultaneously. Then, after two rounds of review, NIST announced in March 2021 that Grain-128AEAD and nine other lightweight cryptographic algorithms are selected as the finalists.

The Grain family was first proposed as a candidate of the eSTREAM project [1] launched by the European Network of Excellence for Cryptology. After three rounds of evaluation, Grain v1 [2] of the Grain family was chosen as one of the three stream ciphers included in Portfolio 2 (hardware-oriented). Then, recognizing the emerging need for 128-bit keys, Hell et al. proposed Grain-128 [3], which can support 128-bit keys and 96-bit IVs. Then, Grain-128a [4] that supports authentication was proposed in 2011. Based on Grain-128a, Grain-128AEAD [5] was proposed and submitted to the project LWC. Compared with Grain-128a, Grain-128AEAD can additionally use associated data to support authentication. Finally, in order to add security against key reconstruction from a known internal state, Grain-128AEADv2 [6] was proposed and submitted to the project LWC as a finalist.

Considering that lightweight cryptographic algorithms are expected to be used in resource-constrained environment, it is very meaningful to evaluate the performance of the hardware implementation of Grain-128AEADv2. In fact, there have been several hardware implementations of the Grain family. In [7], the hardware implementation of Grain-128a was proposed. In [8], the efficient hardware implementation of Grain-128AEAD was proposed. In order to increase the throughput-area ratio, the Galois transformation of the NFSR was adopted in these hardware implementations. However, compared with the original ones, such transformation would lead to different output sequences and different states of NFSR. Moreover, since different versions of the hardware implementation of Grain-128AEAD adopt different Galois feedback functions, the output sequences of different versions of the hardware implementation of Grain-128AEAD can be also different, which will decrease the generality of the hardware implementation of Grain-128AEAD. In order to generate the same output sequences with the original ones, the secret key should be transformed [9]. The transformation of the secret key needs extra control logic. The situation can be more complicated in the hardware implementation of Grain-128AEADv2 since the secret key bits are XORed with the input to both LFSR and NFSR in the key reintroducing phase. Thus, extra memory can be needed to save the transformed secret key, which will increase the area cost of the hardware implementation of Grain-128AEADv2. Apart from that, the increased area cost and the more complicated control logic can influence the synthesis process and decrease the throughput-area ratio of the hardware implementation of Grain-128AEADv2.

One strategy to increase the throughput-area ratio of the hardware implementation of Grain-128AEADv2 is to apply the pipeline technique, which aims to increase the frequency of the hardware implementation of Grain-128AEADv2 by inserting registers into the critical path. However, it is not possible to apply the pipeline technique for FSRs due to their intrinsic feedback property [8]. In light of this, a pipeline-like pre-computation technique is proposed. Compared with the Galois transformation, the pipeline-like pre-computation technique has two advantages. First, it does not change the feedback functions, which can lead to the same output sequence as the original Grain-128AEADv2. Thus, it will not influence the generality of the hardware implementation of Grain-128AEADv2. Second, since the feedback functions are not changed, the parallel level of the hardware implementation of Grain-128AEADv2 with the pipeline-like pre-computation technique can be up to x32, which can be two times the maximum parallel level (x16) reached by the hardware implementation of Grain-128AEADv2 with the Galois transformation [7]. As the throughput-area ratio of a hardware implementation increases with its parallel level, it can achieve an efficient hardware implementation of Grain-128AEADv2.

In practice, side-channel attacks can exploit the leakage of a cryptographic algorithm implementation to recover the secret key, which can pose a serious threat on the security of a cryptographic algorithm implementation. In light of this, the hardware implementation of Grain-128AEADv2 that is secure against side-channel attacks should be considered. In fact, there exist different types of countermeasures that can resist side-channel attacks, such as masking [10, 11], shuffling [12], and random delay [13]. Among them, masking as a provably secure countermeasure can be the most famous one. Therefore, the masking technique should be adopted in the secure hardware implementation of Grain-128AEADv2. Since the idea of masking was proposed at CRYPTO 1999 [14], masking schemes suitable to be used for software and hardware implementations have been proposed over the past twenty years. Compared with the software ones, the hardware ones may face the security problems related to glitch. The security of a masking scheme should be analyzed in a certain adversary model. Ishai et al. for the first time proposed the -probing model at CRYPTO 2003 [11], where the adversary can obtain the values of bits with probes. However, the security of the hardware implementation of a cryptographic algorithm under the -probing model may be not enough since the leakage related to glitch is not considered [15]. For example, the authors of [16, 17] applied side-channel attacks on masked AES hardware implementations, and they led to the conclusion that glitch can pose a serious threat on the security of masked AES hardware implementations.

In order to consider the security problems related to glitch, the idea of the glitch-extended probing model was first proposed at CRYPTO 2015 [18]. Then, its formal version was proposed at CHES 2018 [19]. The first glitch-resistant masking scheme, i.e., Threshold Implementation (TI), was proposed by Nikova et al. at ICISC 2006. At least shares should be used in TI where denotes the degree of a non-linear function and denotes the security order. Therefore, the number of shares needed in TI can be large. Then, in order to decrease the number of shares needed in TI, Reparaz et al. proposed Consolidating Masking Schemes (CMSs) at CRYPTO 2015 [18] by using fresh randomness. After that, Domain-Oriented Masking (DOM) [20] and Unified Masking Approach (UMA) [21] were proposed to further reduce the number of fresh randomness. Among these schemes, DOM can induce the least computation delay and the minimum number of extra operations. Moreover, the number of fresh randomness required by DOM can be relatively small. Therefore, DOM may be suitable to be used to secure the efficient hardware implementation of Grain-128AEADv2. In order to increase the throughput-area ratio of the masked hardware implementation of Grain-128AEADv2, the pipeline-like pre-computation technique can be also applied. However, for the parallel level above 16, the masked feedback functions will use some values that have not yet been shifted into the FSRs, and pre-computation cannot be processed. Thus, only the parallel level up to 8 can be achieved in the masked hardware implementation of Grain-128AEADv2. In fact, the security of the masked hardware implementation of Grain-128AEADv2 is verified with T-Test proposed by Gilbert Goodwill et al. [22] in simulated scenario.

Then, the performance of the (masked) hardware implementation of Grain-128AEADv2 is evaluated on both ASIC and FPGA. According to the synthesis results, the hardware implementation of Grain-128AEADv2 with the pipeline-like pre-computation technique can obtain the highest throughput-area ratio for both unmasked version and masked version. In detail, the highest throughput-area ratio for the unmasked version can be obtained in the x32 parallel version with the pipeline-like pre-computation technique, which is on ASIC and on FPGA; the highest throughput-area ratio for the masked version can be obtained in the x8 parallel version with the pipeline-like pre-computation technique, which is on ASIC and on FPGA. Overall, this contribution may help researchers and practitioners to accurately compare the hardware implementation efficiency of Grain-128AEADv2 with those of other lightweight cryptography algorithms.

The rest of the paper is organized as follows. Preliminaries are presented in Section 2. Then, three versions of hardware implementation of Grain-128AEADv2 are shown in Section 3. In Section 4, the masked hardware implementation of Grain-128AEADv2 with the pipeline-like pre-computation technique is shown. Then, in Section 5, the performance of the (masked) hardware implementation of Grain-128AEADv2 is evaluated on both ASIC and FPGA. In Section 6, T-Test is used to evaluate the security of the masked hardware implementation of Grain-128AEADv2 in the simulated scenario. Finally, conclusions are drawn in Section 7.

2. Preliminaries

First, the details of Grain-128AEADv2 are presented; second, the glitch-extended probing model is presented; third, the DOM masking scheme is presented.

2.1. Grain-128AEADv2

Grain-128AEADv2 is composed of two building blocks [6]. The first block is a pre-output generator, which is composed of a Linear Feedback Shift Register (LFSR), a Non-linear Feedback Shift Register (NFSR), and a pre-output function. The second block is an authenticator generator, which is composed of a shift register and an accumulator. The structure of Grain-128AEADv2 is shown in Figure 1.

The pre-output generator consists of a 128-bit LFSR , a 128-bit NFSR , and a pre-output function . It generates a stream of pseudo-random bits, which can be used for encryption and authentication. The states of 128-bit LFSR and 128-bit NFSR at clock cycle t can be denoted as and , respectively. The corresponding update functions of LFSR and NFSR can be expressed with and :

The output of the pre-output generator is given by the pre-output function as

The authenticator generator consists of a 64-bit shift register and a 64-bit accumulator . We denote the content of the shift register at instance as and the content of the accumulator at instance as  = . The running of Grain-128AEADv2 consists of two phases: an initialization phase and a keystream generation phase. During the initialization phase, the LFSR and NFSR are first loaded with the key bits and IV bits. If we denote the key bits as and the IV bits as , all 128 bits of NFSR are loaded with the key bits, i.e., , and the first 96 LFSR bits are loaded with the IV bits, i.e., . The last 32 bits of the LFSR are filled with 31 ones and a zero, i.e., . Then, the Grain-128AEADv2 is clocked 320 times, feeding back the pre-output function and XORing it with the input to both the LFSR and the NFSR, i.e.,

Then, Grain-128AEADv2 is clocked 64 times, reintroducing the key and XORing it with the input to both the LFSR and the NFSR, i.e.,

Once the pre-output generator has been initialized, the authenticator generator is initialized by loading the register and the accumulator with the pre-output keystream as

While the register and the accumulator are initializing, the LFSR and the NFSR should be simultaneously updated as

Thus, when the Grain-128AEADv2 has been fully initialized, the LFSR and the NFSR states can be denoted as and , respectively. Besides, the register and the accumulator can be denoted as and , respectively. During the keystream generation phase, the pre-output is used to generate for encryption and for authentication. Here, and can be defined as

Then, the message can be encrypted as

The accumulator can be updated as

The shift register can be updated as

2.2. Glitch-Extended Probing Model

Usually, the security of masking schemes can be evaluated under the probing model [11], where the adversary can put up to probes on intermediate variables of the implementation of a masking scheme. The probing model is formally defined in Definition 1.

Definition 1 (probing model [11]). Given a combinational logic circuit , an adversary with probes can observe up to internal wires of .

Glitch exists in the hardware implementation of a cryptographic algorithm. It means that by putting a probe on the output of , one may obtain the input of . Consequently, the security of hardware implementations of a cryptographic algorithm should be considered under the glitch-extended probing model. The glitch-extended probing model is formally defined in Definition 2.

Definition 2 (glitch-extended probing model [19]). Given a combinational logic circuit , an adversary with glitch-extended probes can observe all the inputs of up to the latest synchronization point by probing any output of .

Example 1. Given a combinational logic circuit which can implement function , the inputs of can be represented as where while the output of can be represented by . Under the glitch-extended probing model, one may obtain the inputs of , i.e., , by probing .

2.3. Domain-Oriented Masking (DOM)

Domain-Oriented Masking (DOM) can be used to secure the efficient hardware implementation of Grain-128AEADv2 under the glitch-extended probing model with fresh randomness, where denotes the security level. In DOM, the XOR and the AND masking gadgets should be used to replace the original XOR and AND. In order to achieve the first-order security, the origin-sensitive variables and should be divided into and , respectively. The masking gadgets take and as input and return as the output. Owing to the linear property of the XOR operation, the XOR masking gadget can be trivially achieved as

However, the AND masking gadget can be more complicated, which is shown in Figure 2. The AND masking gadget performs three steps in order to map the input shares to the output shares, which can be referred to as calculation, resharing, and integration. In the calculation step, the actual multiplication is performed and the product terms , and can be obtained. In DOM, are defined as the inner-domain terms and are defined as the cross-domain terms. Then, in the resharing step, each cross-domain term should be randomized with a fresh random so that it is independent of other terms. Therefore, it can be added to any arbitrary domain in the next step. In order to prevent that any glitch propagates through the resharing step, a register must be inserted at the end of the resharing step. Finally, in the integration step, the inner-domain terms and the reshared cross-domain terms are added to obtain .

3. Efficient Hardware Implementations of Grain-128AEADv2

In order to compare the hardware implementation performances of different LWC finalist candidates, the standard LWC hardware API [23, 24] is adopted in the hardware implementations of Grain-128AEADv2, which is presented in Section 3.1. Then, three versions of the hardware implementations of Grain-128AEADv2, i.e., the straightforward version, the Galois transformation version, and the pipeline-like pre-computation version, are presented.

3.1. The Standard LWC Hardware API

The standard LWC hardware API includes the minimum compliance criteria, interface, communication protocol, and timing characteristics that should be supported by hardware implementation of Grain-128AEADv2. The interface of the hardware implementation of Grain-128AEADv2 is shown in Figure 3.

According to their different functions, the I/O ports can be divided into the datapath ports and the control ports. The datapath ports consist of the data input ports and the data output ports. The data should be input through ports key and bdi_data (block data input), and the data can be output through port bdo_data (block data output). The key port is controlled by the handshake signals key_valid and key_ready. key_update is used to notify that the internal key should be updated. The bdi_data port is controlled by the handshake signals bdi_valid and bdi_ready. The bdi_valid_bytes port and the bdi_size port indicate the location and the size of the valid data in the bdi_data port. The bdo_data port is controlled by the handshake signals bdo_valid and bdo_ready. Each width of key, bdi_data and bdo_data is set to 32 bits, the widths of bdi_valid_bytes and bdi_size are set to 4 and 3 bits, respectively, and the widths of other control ports are set to 1 bit so that they are consistent with the LWC hardware API.

3.2. The Straightforward Version

In order to analyze the structural characteristic and obtain the basic hardware implementation performance of Grain-128AEADv2, the straightforward version is implemented. The straightforward version of the hardware implementations of Grain-128AEADv2 follows the architectural design in [6]. The FSRs are in Fibonacci configuration, and the update functions are the same as the ones shown in Section 2. The FSRs are normally defined to be able to update one bit at each clock cycle. However, the design of the update functions makes it possible to calculate up to 32 update bits that can be used to update the FSRs in 32 continuous rounds in parallel. The throughput-area ratio increases with the parallel level of a hardware implementation. In light of this, the parallel versions up to x32 are implemented to optimize the throughput-area ratio of the hardware implementation of Grain-128AEADv2. For a given parallel level , the highest throughput-area ratio of a hardware implementation can be related to the highest frequency that can be achieved, while the highest frequency of a hardware implementation depends on the critical path which corresponds to the maximal delay of two flip-flops. Similar to [8], the potential critical paths of Grain-128AEADv2 should be in the following ones:(i): the maximal delay from any NFSR or LFSR flip-flop to any other NFSR or LFSR flip-flop.(ii): the maximal delay from any NFSR or LFSR flip-flop to the output via the function .(iii): the maximal delay from any NFSR or LFSR flip-flop to any accumulator flip-flop via the function .(iv): the maximal delay from any flip-flop in the authentication section to any accumulator flip-flop or output.(v): the maximal delay from any flip-flop of the NFSR or LFSR to any flip-flop of the NFSR via the function .

Although there are several potential critical paths, some can be excluded by analyzing the update functions of Grain-128AEADv2. The update functions of the NFSR and the LFSR in the initialization phase can be more complicated than those in the generation phase because needs to be additionally XORed to and . Then, and can be excluded because of the existence of . According to (3), (9), and (10), can be longer than and . However, when we implement Grain-128AEADv2 in parallel, the data that are needed by the update function of the accumulator are not located yet. In this case, may be longer than , and it may be the critical path. The straightforward version of Grain-128AEADv2 is synthesized on both ASIC and FPGA, and the synthesis results on both the ASIC and FPGA can be seen in Table 1.

According to Table 1, the critical paths of hardware implementations of Grain-128AEAD and Grain-128AEADv2 can be identical, i.e., critical paths of x1, x2, x4, and x8 implementations are and the critical paths of x16 and x32 implementations are . Then, in order to increase the frequency of the hardware implementation of Grain-128AEADv2, some strategies will be applied to optimize and .

3.3. The Galois Transformation Version

In order to decrease , one should transform the Fibonacci configuration to a Galois configuration. Besides, one needs to make sure that the Fibonacci configuration and the Galois configuration are equivalent so that their sets of output sequences can be identical. According to Table 1, the critical paths of x1, x2, x4, and x8 versions of the hardware implementation of Grain-128AEADv2 can be and those of x16 and x32 versions of the hardware implementation of Grain-128AEADv2 can be . The state of the NFSR can be denoted as . The value of should be updated by , . For , we have . Note that can be defined as a complicated feedback function, and it may become the critical path in the hardware implementation of Grain-128AEADv2. On the contrary, in the Galois configuration, the expression of can depend on more than one element of the NFSR. Thus, in the Galois configuration can be simpler than in the Fibonacci configuration, and the delay of the update phase of the Galois configuration may be shorter than that of the update phase of the Fibonacci configuration, which may optimize the frequency of the hardware implementation of Grain-128AEADv2. Since in the Fibonacci configuration of Grain-128AEADv2 can be the same as that of Grain-128a and that of Grain-128AEAD, the Galois transformation of Grain-128AEADv2 can be identical to that of Grain-128a and that of Grain-128AEAD. Details about the feedback functions after the Galois transformation of Grain-128a can be seen in [7]. According to [7], the Galois transformation of Grain-128AEADv2 cannot be applied when the parallel level of the hardware implementation of Grain-128AEADv2 is above 16.

3.4. The Pipeline-Like Pre-Computation Version

In order to shorten , we split and into two parts. Then, the update phase of round can be divided into two stages: (1) compute the first parts of and and (2) compute the second parts of and , XOR them with the first parts of and , and update the NFSR with the XORed values. When the pipeline technique is applied, stage 2 of round and stage 1 of round should be computed at clock cycle . However, the result of stage 2 of round is used to update the NFSR and stage 1 of round can only be computed after the NFSR is updated. Thus, it seems that the pipeline technique cannot be applied. Considering that only the - bit of the NFSR is updated with the result of stage 2 and the other bits of the NFSR are updated with the - bit, stage 1 of round can be pre-computed before the NFSR is updated at clock cycle . The pipeline-like pre-computation technique can be implemented as follows.

Compared with the straightforward version and the Galois transformation version, the pipeline-like pre-computation technique needs one more clock cycle. We denote this one more clock cycle as . At clock cycle , stage 1 of round 0 can be computed as

Note that the FSRs are not updated at clock cycle . At clock cycle , stage 2 of round 0 can be computed as

At the same time, stage 1 of round 1 should be pre-computed as

Note that the FSRs begin to shift at clock cycle . Then, at clock cycle , stage 2 of round can be computed similarly as (13) since the NFSR has been updated. Then, the pre-computation of stage 1 of round can be computed with (14).

The parallel version of the pipeline-like pre-computation technique can work similarly. For example, for parallel level , the first part of at stage 1 can be computed at clock cycle as

However, for the parallel level , such split needs to be modified since some indexes of can exceed 127, which can induce the result that the values of the bits with indexes exceeding 127 cannot be obtained in the current clock cycle. In order to apply the pipeline-like pre-computation technique, the split of and should be modified. All the terms with indexes exceeding 127 are left to the second part. Thus, the first part of and can be modified as

Then, the second part of and should be modified as

The second part of for the parallel level can be more complicated than that of for other parallel levels. However, because of the large parallel level , the effect of the longer path of can be relatively small.

3.5. Pipelining the Accumulator

Since only one of the generated bits is used for authentication, the update of the accumulator with (9) can hold for the parallel level . However, for the parallel level , can be updated as [8]

In such cases, some values have not yet been shifted into . For example, for the parallel level , can be updated aswhere is being computed with (10) and is not shifted into . Then, the update of needs to wait for the computation of , which means that the delay of can be longer than the delay of . This is verified by the results shown in Table 1.

In order to make shorter, Sönnerup et al. [8] inserted a pipeline step between and and isolated . However, in the Galois transformation version and the pipeline-like pre-computation version, we insert the pipeline step in the update of for the parallel level as shown in Figure 4. The advantage of such technique is that it needs less extra control logic. Therefore, it can lead to a relatively high frequency since the complicated control logic can have a negative effect on the frequency of the hardware implementation of Grain-128AEADv2. Note that the pipeline step will add one clock cycle delay to the update of .

4. Masked Hardware Implementation of Grain-128AEADv2

Under the idea of DOM, the order masked hardware implementation of Grain-128AEADv2 under the glitch-extended probing model is proposed.

4.1. The Straightforward Version

In order to achieve the order security under the glitch-extended probing model, each sensitive variable in , and should be split into shares. Thus, in the order masked hardware implementation of Grain-128AEADv2, the LFSR and the NFSR should be split into two shares, i.e., and . Then, the elements in and can be denoted as and . Besides, the order masked , and can be denoted as , and . Then, can be implemented by applying to each share as

The non-linear terms in and should be implemented with the AND masking gadget. Then, and can be implemented by XORing the output shares of the AND masking gadget with the shares of the linear terms. DOM needs to insert one register stage into each AND masking gadget. The term of and with the highest degree is . Since each AND masking gadget of DOM needs to be inserted one register stage, the order masked hardware implementation of the term needs two register stages. The computation of is shown in Figure 5.

The unmasked straightforward version encrypts 1 bit of the message every two clock cycles, while the masked straightforward version encrypts 1 bit of the message every six clock cycles. Therefore, the throughput-area ratio of the masked straightforward version will be much lower than that of the unmasked straightforward version.

4.2. The Pipeline-Like Pre-Computation Version

In order to increase the throughput-area ratio, the AND masking gadget can be implemented with the pipeline-like pre-computation technique. Since is a linear function, can be computed in one clock cycle. When the pipeline-like pre-computation technique is applied, the AND masking gadget can be computed in one clock cycle. In order to compute the AND masking gadget in one clock cycle, and should be divided into three stages. In order to explain the details of the order masked implementation of the Grain-128AEADv2 with the pipeline-like pre-computation technique, we take the masked update function as an example. The order masked hardware implementation of with the pipeline-like pre-computation technique can be seen in Figure 6. The order masked hardware implementation of can be similar to that of .

According to Figure 6, should be divided into three stages. The first stage aims to compute terms with degree 2 such as and with the AND masking gadget. Note that there are two terms with degree 2, i.e., and , in the term . The computation of , and can be shown as

The XOR masking gadget of the linear terms in can be computed with (22). The XOR masking gadget of the linear terms in can be computed similarly.

The second stage uses an AND masking gadget to compute . Before computing , the shares of and should be XORed as , , , and so that the effect of glitch does not exist any more. Then, the computation of can be shown as

At the same time, the obtained shares of terms with degree one and two can be XORed to reduce the computation complexity in the third stage, which can be shown as

Then, in the third stage, is computed and the two shares of the NFSR can be updated as

The computation of with the pipeline-like pre-computation technique is shown in Figure 7. Compared with the straightforward version, the pipeline-like pre-computation technique needs two more clock cycles. We denote these two more clock cycles as and . Then, of round can be obtained at every clock cycle . According to Figure 7, at clock cycle , only the stage 1 of round 0 is computed. At clock cycle , stage 2 of round 0 and the stage 1 of round 1 are computed. Since the shares of the FSRs are not shifted at , and should be used to compute stage 1 of round 1 at clock cycle . Then, at clock cycle , stage 3 of round , stage 2 of round , and stage 1 of round are computed. Since the shares of the FSRs are not shifted at , and should be used to compute stage 2 of round 1, and and should be used to compute stage 1 of round 2. Since the shares of the FSRs are shifted at clock cycle , the bits used to compute stage 1 and stage 2 can be the same.

Take the computation of the linear terms of in stage 1 of round for example. At clock cycle , the computation of the linear terms of in stage 1 of round 0 should be pre-computed with (22). Since the shares of the FSRs are not shifted at and , the computation of the linear terms of in stage 1 of round 1 should be pre-computed at clock cycle as

Then, at clock cycle , the computation of the linear terms of in stage 1 of round 2 should be pre-computed as

Since the shares of the FSRs are updated at clock cycle , the computation of the linear terms of in stage 1 of round should be computed with (27).

In fact, the parallelization technique can be applied to the masked hardware implementation of the Grain-128AEADv2 with the pipeline-like pre-computation technique to increase its the throughput-area ratio. For example, at clock cycle , the computation of in stage 1 of round can be shown as

When the pipeline-like pre-computation technique is applied, the parallel level can be up to x8. For the parallel level , some indexes of can exceed 127, which can induce the result that the values of the bits with indexes exceeding 127 cannot be obtained in current clock cycle. Thus, such parallelization will lead to a longer critical path than the parallelization with , which may decrease the throughput-area ratio of the masked hardware implementation of Grain-128AEADv2. Consequently, only the parallel level with is considered.

5. Performance Evaluation

In this section, the performance of the (masked) hardware implementation of Grain-128AEADv2 is evaluated on ASIC and FPGA. For the ASIC hardware platform, the STM 65 nm process with 1.2 V supply voltage and 25 C is adopted. The synthesis tool is the Synopsys Design Compiler L-2016.03-SP1. The FPGA hardware platform is the Xilinx Artix-7 family [25], and the Synthesis and the Implementation are conducted in Vivado 2020.1 [26] which is the standard IDE for Xilinx Artix-7 family with the Verilog hardware design language. Then, in order to precisely evaluate the performance of the (masked) hardware implementation of Grain-128AEADv2, only primary component look up table (LUT) in FPGA is used, and resources like SRL16/SRL32, BRAM, and DSP are not applied in our implementations. The fresh randomness is assumed to be generated by external Pseudo-Random Number Generator (PRNG). Therefore, the overhead of the generation of fresh randomness is not considered in the performance evaluation.

Overall, the evaluation results of the (masked) hardware implementation of Grain-128AEADv2 are shown in Tables 2 and 3.

According to Tables 2 and 3, the following three observations can be obtained.(i)First, among different versions of the (masked) hardware implementation of Grain-128AEADv2, the (masked) hardware implementation of Grain-128AEADv2 with the pipeline-like pre-computation technique can reach the highest throughput-area ratio. For the unmasked version, the highest throughput-area ratio can be obtained with the parallel level , while for the masked version, the highest throughput-area ratio can be obtained with the parallel level . In detail, for the unmasked version, the highest throughput-area ratio of the hardware implementation of Grain-128AEADv2 can be on ASIC and on FPGA, while for the masked version, that of the hardware implementation of Grain-128AEADv2 can be on ASIC and on FPGA. Overall, compared to the other two versions, the increase rate of the throughput of the pipeline-like pre-computation version can be larger than the increase rate of the consumed area of the pipeline-like pre-computation version on both ASIC and FPGA. For example, for the unmasked version, the increase rate of the highest throughput of the pipeline-like pre-computation version to the straightforward version on ASIC can be , while the increase rate of the consumed area of the pipeline-like pre-computation version to the straightforward version on ASIC can be only . Therefore, the pipeline-like pre-computation version can obtain the highest throughput-area ratio on both ASIC and FPGA.(ii)Second, the parallel level can influence the throughput and the consumed area of the (masked) hardware implementation of Grain-128AEADv2. In fact, the throughput and the consumed area increase with the parallel level . Since the increase rate of the throughput can be larger than that of the area, the throughput-area ratio of the (masked) hardware implementation of Grain-128AEADv2 can increase with the parallel level . For example, for the unmasked version, the throughput of the straightforward version on ASIC can increase from 1.14 to 26.67 as the parallel level increases from 1 to 32, while the consumed area of the straightforward version on ASIC can increase from 5975 to 13381 as the parallel level increases from 1 to 32. Because the increase rate of the throughput (25.6 times) can be larger than the increase rate of the consumed area (1.2 times), the throughput-area ratio of the straightforward version on ASIC can increase with the parallel level . Similarly, the throughput of the straightforward version on FPGA can increase from 0.18 to 3.96 as the parallel level increases from 1 to 32, while the consumed area of the straightforward version on FPGA can increase from 158 to 502 as the parallel level increases from 1 to 32. Because the increase rate of the throughput (21 times) can be larger than the increase rate of the consumed area (2.2 times), the throughput-area ratio of the straightforward version on FPGA can also increase with the parallel level . The Galois transformation version and the pipeline-like pre-computation version can show similar trends on ASIC and FPGA.(iii)Third, compared with the hardware implementation of Grain-128AEADv2, the masked hardware implementation of Grain-128AEADv2 can decrease the throughput and increase the consumed area. Accordingly, the throughput-area ratio of the masked hardware implementation of Grain-128AEADv2 can be lower than that of the hardware implementation of Grain-128AEADv2. Comparatively, the decrease rate of the throughput-area ratio of the pipeline-like pre-computation version of the masked hardware implementation of Grain-128AEADv2 to the throughput-area ratio of the pipeline-like pre-computation version of the hardware implementation of Grain-128AEADv2 can be smaller than the decrease rate of the throughput-area ratio of the straightforward version of the masked hardware implementation of Grain-128AEADv2 to the throughput-area ratio of the straightforward version of the hardware implementation of Grain-128AEADv2. In detail, for the straightforward version, the highest throughput on ASIC can decrease from 26.67 to 7.96 and the largest consumed area can increase from 13381 to 44246 , while for the pipeline-like pre-computation version, the highest throughput on ASIC can decrease from 32.65 to 8.33 and the largest consumed area can increase from 15249 to 22722 . Then, for the straightforward version, the highest throughput-area ratio on ASIC can decrease from 1.99 to 0.18 , while for the pipeline-like pre-computation version, the highest throughput-area ratio on ASIC can decrease from 2.14 to 0.37 . Therefore, the highest throughput-area ratio of the straightforward version of the masked hardware implementation of Grain-128AEADv2 on ASIC can decrease about 90% compared with the highest throughput-area ratio of the straightforward version of the hardware implementation of Grain-128AEADv2 on ASIC, while the highest throughput-area ratio of the pipeline-like pre-computation version of the masked hardware implementation of Grain-128AEADv2 on ASIC can decrease about 80% compared with the highest throughput-area ratio of the pipeline-like pre-computation version of the hardware implementation of Grain-128AEADv2 on ASIC. Such trend can also be shown on FPGA. In summary, since the increase rate of the consumed area of the pipeline-like pre-computation version can be smaller than the increase rate of the consumed area of the straightforward version while the decrease rate of the throughput of two versions can be about the same, we obtain the result that the decrease rate of the throughput-area ratio of the pipeline-like pre-computation version can be smaller than that of the throughput-area ratio of the straightforward version.

6. Security Evaluation

In this section, the resistance of the masked hardware implementation of Grain-128AEADv2 against side-channel attack is evaluated with T-Test proposed by Gilbert Goodwill et al. [22] in the simulated scenario. More specifically, the non-specific T-Test leakage detection methodology is adopted. In the non-specific T-Test detection methodology, two sets of power traces should be used. Power traces in one set correspond to the encryption of randomly chosen IVs with a fixed secret key, while power traces in another set correspond to the encryption of a fixed IV with the same fixed secret key. If the number of samples contained in one power trace is denoted as , the value of T-Test at sample can be computed aswhere denotes the mean of the power traces contained in at sample , denotes the mean of the power traces contained in at sample , denotes the variance of the power traces contained in at sample , denotes the variance of the power traces contained in at sample , denotes the number of power traces contained in , and denotes the number of power traces contained in . The null hypothesis is that and can be equal, which is accepted if is between the threshold of . If exceeds the threshold of , the null hypothesis is rejected with a confidence greater than 99.999%. In the simulated scenario, the power consumption in a power trace at sample is assumed to be composed of the signal part and the noise part. The signal part is simulated under the Hamming Distance Model, while the noise part is assumed to follow the Gaussian Distribution with mean 0 and a given variance .

We observe that the variance of the signal under the Hamming Distance Model can be between 24 and 66. Then, the signal-to-noise ratio of the hardware implementation is set to 0.02. According to the evaluation results, the leakage of an unprotected hardware implementation can be tested with only 10,000 traces. The number of traces in the simulated scenario is less than that needed in the real scenario. The reason can be as follows. In the simulated scenario, the signal leakage can perfectly follow the Hamming Distance Model, while in the real scenario, it is impossible to perfectly characterize the signal leakage. Thus, much more traces can be needed in the real scenario. Note that 100,000 traces are used to test the leakage of the hardware implementation of Grain-128AEADv2. The security evaluation results of the (masked) hardware implementation of Grain-128AEADv2 are shown in Figure 8.

In Figure 8, the red lines represent the threshold of . If the T-Test value exceeds , the leakage of the hardware implementation of Grain-128AEADv2 can be tested; otherwise, no leakage can be tested. According to Figure 8, the following three observations can be obtained.(i)First, the leakage of the unprotected hardware implementation of Grain-128AEADv2 with either the straightforward technique, the Galois transformation technique, or the pipeline-like pre-computation technique can be tested with 100,000 traces, which means that three types of unprotected hardware implementations of Grain-128AEADv2 can be insecure against side-channel attacks.(ii)Second, the T-Test values of the masked hardware implementation of Grain-128AEADv2 with the pipeline-like pre-computation technique can be within the threshold of , which means that such hardware implementation of Grain-128AEADv2 can be secure in face of side-channel attacks. Therefore, the security evaluation results show the effectiveness of masking against side-channel attacks.(iii)Third, one can see that the shape of two curves of T-Test computed for the straightforward version and the pipeline-like pre-computation version can be about the same, while that computed for the Galois transformation version can be different. The reason is that the pipeline-like pre-computation version does not change the state of the FSRs and only adds two registers to store the intermediate values. The effect of the two extra registers can be ignored compared to the states of FSRs, which induces the result that the shape of the T-Test curve of the pipeline-like pre-computation version can be about the same with the shape of the T-Test curve of the straightforward version. However, the Galois transformation version can change the state of the NFSR, which induces the result that the shape of the T-Test curve of the Galois transformation version can be different from the shape of the T-Test curve of the straightforward version.

7. Conclusion

In this paper, efficient (masked) hardware implementation of Grain-128AEADv2 is considered and we propose the pipeline-like pre-computation technique. For the unmasked version, the hardware implementation of Grain-128AEADv2 with the straightforward technique, the Galois transformation technique, and the pipeline-like pre-computation technique is considered. For the masked version, the hardware implementation of Grain-128AEADv2 with the straightforward technique and the pipeline-like pre-computation technique is considered. The performance of the (masked) hardware implementation of Grain-128AEADv2 is evaluated on both ASIC and FPGA. According to the evaluation results, the pipeline-like pre-computation technique can optimize the throughput-area ratio for the masked version and the unmasked version compared with the other two techniques. In detail, the highest throughput-area ratio of the hardware implementation of Grain-128AEADv2 can be obtained in the x32 parallel version with the pipeline-like pre-computation technique, which is on ASIC and on FPGA; the highest throughput-area ratio of the masked hardware implementation of Grain-128AEADv2 can be obtained in the x8 parallel version with the pipeline-like pre-computation technique, which is on ASIC and on FPGA. Besides, the security of the masked hardware implementation of Grain-128AEADv2 with the pipeline-like pre-computation technique against side-channel attacks is evaluated with T-Test in simulated scenarios. Overall, this contribution may help researchers and practitioners to accurately compare the efficiency and the security of the hardware implementation of Grain-128AEADv2 with those of other lightweight cryptographic algorithms.

Data Availability

All data generated or analyzed during this study are included in this article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This study was supported by the National Key Research and Development Program of China (no. 2020YFB1805402), the Open Fund of Advanced Cryptography and System Security Key Laboratory of Sichuan Province (grant no. SKLACSS-202116), and the National Natural Science Foundation of China (grant nos. 61872359, 61936008, and 62272451).