A Compact FPGA-Based Accelerator For Curve-Based C
A Compact FPGA-Based Accelerator For Curve-Based C
Journal of Sensors
Volume 2021, Article ID 8860413, 13 pages
https://doi.org/10.1155/2021/8860413
Research Article
A Compact FPGA-Based Accelerator for Curve-Based
Cryptography in Wireless Sensor Networks
Received 17 April 2020; Revised 12 September 2020; Accepted 30 November 2020; Published 6 January 2021
Copyright © 2021 Miguel Morales-Sandoval et al. This is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work
is properly cited.
The main topic of this paper is low-cost public key cryptography in wireless sensor nodes. Security in embedded systems, for
example, in sensor nodes based on field programmable gate array (FPGA), demands low cost but still efficient solutions. Sensor
nodes are key elements in the Internet of Things paradigm, and their security is a crucial requirement for critical applications in
sectors such as military, health, and industry. To address these security requirements under the restrictions imposed by the
available computing resources of sensor nodes, this paper presents a low-area FPGA-prototyped hardware accelerator for scalar
multiplication, the most costly operation in elliptic curve cryptography (ECC). This cryptoengine is provided as an enabler of
robust cryptography for security services in the IoT, such as confidentiality and authentication. The compact property in the
proposed hardware design is achieved by implementing a novel digit-by-digit computing approach applied at the finite field and
curve level algorithms, in addition to hardware reusing, the use of embedded memory blocks in modern FPGAs, and a simpler
control logic. Our hardware design targets elliptic curves defined over binary fields generated by trinomials, uses fewer area
resources than other FPGA approaches, and is faster than software counterparts. Our ECC hardware accelerator was validated
under a hardware/software codesign of the Diffie-Hellman key exchange protocol (ECDH) deployed in the IoT MicroZed FPGA
board. For a scalar multiplication in the sect233 curve, our design requires 1170 FPGA slices and completes the computation in
128820 clock cycles (at 135.31 MHz), with an efficiency of 0.209 kbps/slice. In the codesign, the ECDH protocol is executed in
4.1 ms, 17 times faster than a MIRACL software implementation running on the embedded processor Cortex A9 in the
MicroZed. The FPGA-based accelerator for binary ECC presented in this work is the one with the least amount of hardware
resources compared to other FPGA designs in the literature.
Sender
Malicious
C node Reciever
D X E
F
H
G
Figure 1: Simplified model of networked IoT devices collecting and sharing data.
threats to confidentiality and authentication in the commu- programmable processor cores, thus enabling hardware/soft-
nication path between a sender and a receiver node. ware codesigns where the critical parts of algorithm, proto-
A robust approach to provide such security services in the col, or application are accelerated with custom designs
IoT domain is the public key cryptography (PKC). PKC in its implemented in the available programmable hardware, and
different families is based on mathematical problems, and the rest of the application is executed by the general purpose
underlying realizations involve costly arithmetic algorithms processors. The main advantage of FPGAs is reconfigurabil-
over finite fields, rings, or groups. In the literature, a vast ity since, for example, a whole system could be upgraded (or
amount of research has focused in hardware acceleration of partial reconfigured) [7].
PKC at the different levels of involved arithmetic algorithms. Recent works propose FPGAs as the most attractive can-
The main approaches for hardware implementations of PKC didates to a large range of IoT applications because of their
have focused on speeding up the underlying group and finite high energy efficiency and low cost, for example, for IoT
field operations at the expense of a high amount of hardware machine learning [11], IoT neural networks [12], IoT vehicle
resources. However, the main drawback with hardware for monitoring systems [13], IoT security (cryptography) [14],
PKC in WSN is the long key lengths which amount to large and among other applications. Not only research papers pro-
chip area, circuit delays, and increased power dissipation [3]. pose FPGAs as hardware modules for IoT scenarios but also
The hardware implementation of PKC-based security FPGA vendors are producing devices with specific features
solutions in resource-constrained devices typically found in for IoT development [15].
IoT scenarios, as in FPGA-based sensor nodes, and using a Contribution: in this work, we aim at approaching low-
straightforward approach is not viable. Lightweight cryptog- area hardware engine to ECC for IoT security, suitable for
raphy (LWC) [4] has emerged as an active research line being included as a building block in FPGA-based sensor
focused on designing cryptographic primitives, schemes, nodes for IIoT or MIoT. We aim at providing one of the most
and protocols tailored to constrained devices as sensor nodes compact FPGA hardware accelerator for the scalar multipli-
in WSN or other IoT devices, for example, RFID tags [5]. For cation in binary standard curves, the most time consuming
the case of PKC, elliptic curve cryptography (ECC) has been operation, and the core of ECC cryptographic schemes such
considered one of the most efficient realizations well suited as encryption, digital signatures, and key establishment. To
for constrained environments in the IoT [6]. achieve compactness, a novel digit-digit binary finite field
Application-specific integrated circuits (ASICs) were the multiplier is proposed and used as the basic building block
first targets in LWC [4, 7]. However, reconfigurable logic cir- of the proposed ECC accelerator. Under this approach, the
cuits, specifically field programmable gateway arrays operands are processed one digit at a time in an iterative
(FPGAs), are being more popular to implement compac- way, but exploiting the parallelism at the algorithmic level
t/low-area hardware accelerators for cryptography algo- and reusing hardware resources as much as possible. The
rithms, with attractive advantages for the IoT domain [8]. sequence of field operations in the algorithm for scalar mul-
At the beginning, FPGAs were frequently used as devices tiplication is carefully scheduled to reduce the number of
for rapid prototyping of cryptographic algorithms, but now field multiplier cores (two) and memory blocks (eight).
they are commonly used as final product platforms [9]. Fur- While the field multipliers are implemented using standard
thermore, FPGAs are not only used as single parts of embed- FPGA logic, memories are taken from the ones available in
ded systems but rather as system-on-chip (SoC) platforms modern FPGAs. Due to the digit-digit computation
for implementing complete applications [10]. Modern, com- approach, an efficient data memory management is designed
mercial FPGA devices contain not only programmable hard- to reduce the number of memory block. This way, with only
ware resources but large functional blocks, such as high- the eight memory blocks, the several field multiplications in a
speed multipliers, embedded multiport memories, and even single point addition are correctly computed, and at the same
Journal of Sensors 3
time, those same memories serve to keep the progress of the a group G with point addition as the group operation. G is a
scalar multiplication computation. The novel hardware cyclic group with prime order n where the discrete logarithm
design presented in this work was validated under a hardwar- problem is defined and on which ECC is founded.
e/software implementation of elliptic curve Diffie-Hellman It is well known that binary extension fields (q = 2m ) are
(ECDH) key exchange protocol, tailored to the MicroZed very attractive for defining ECC. An element in F2m is the
FPGA prototyping board, recommended for IoT industrial bit vector ðam−1 , am−2 ,⋯,a0 Þ that in polynomial basis repre-
applications. Under this setting, which is very common in sents the ðm − 1Þ-degree polynomial am−1 xm−1 + am−2 xm−2
an FPGA IoT application, the execution of ECDH outper- + ⋯a0 , with ai in {0,1}. Arithmetic in F 2m in polynomial basis
forms the software counterpart, implemented using the MIR- is polynomial arithmetic with reduction modulo, which is an
ACLE library and runs in the embedded Cortex A9 processor irreducible polynomial of degree m, FðxÞ. The arithmetic in
in the MicroZed. Our hardware architecture, compared with F2m is carry free and more suitable for hardware
state-of-the-art similar approaches in terms of area, only implementations.
requires up to 16% of FPGA hardware resources, thus being
the most compact FPGA-based hardware architecture for 2.2. Scalar Multiplication in Elliptic Curves. Scalar multiplica-
computing scalar multiplications in ECC defined over binary tion in EðF q Þ denoted as Q = kP with Q, P ∈ G and k ∈ ½1, n
fields. Compared to the software reference implementation, − 1 is the main and most time-consuming operation in
our design is 17 times faster. any ECC scheme (encryption, digital signature, keys
The rest of this brief is organized as follows: Materials exchange, etc). Q is computed by k-times point addition
and Methods discusses the preliminaries of scalar multiplica- operations of P with itself [20]: Q = kP = P + P+⋯+P .
tion in binary elliptic curves and the Montgomery López- |fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}
Dahab algorithm for scalar multiplication. This section also k−times
describes related works and the proposed hardware design. The complexity of kP is in terms of the operations in F q .
Results and Discussion presents the experimental results Given a large integer k and a point P in G, it is easy to com-
and comparisons with state-of-the-art works, followed by pute Q = kP. On the contrary, the elliptic curve discrete loga-
concluding remarks in the Conclusion. rithm problem (ECDLP) is the problem that given the point
P and Q in G, to find the scalar k. For an enough large n,
2. Materials and Methods ECDLP becomes hard to solve. Most of the state-of-the-art
works related to ECC have been focused on the efficient
First, we provide the mathematical concepts and foundations implementation of scalar multiplication [6], which is a
that are the basis to construct the FPGA-based ECC cryp- condition for efficient ECC implementation.
toengine. First, we present the basis of elliptic curves and The Lopez-Dahab Montgomery PM algorithm [21],
groups from which the scalar multiplication is defined. Scalar shown in Algorithm 1, has been commonly used for the
multiplication is critical because the proposed hardware kP computation because it is side-channel attack-resistant,
cryptoengine is precisely to speed up this costly operation suitable for parallelization and low resource friendly. In
and the core of higher operations for security applications this work, we use the Lopez-Dahab algorithm for imple-
such as encryption and digital signatures. Finally, the section menting for the first time the most compact FPGA-based
concludes discussing the method to compute scalar multipli- hardware architecture for computing kP in binary elliptic
cations on binary elliptic curves. This algorithm is realized by curves, EðF2m Þ.
the proposed FPGA-based ECC cryptoengine. The main operations in Algorithm 1 are addition, multi-
plication, and squaring in F2m . Consider the fields recom-
2.1. Elliptic Curves and Its Use in Cryptography. Since
mended by NIST for practical ECC, with m = 233 and
invented independently by Miller [16] and Koblitz [17], ellip-
m = 409. For m = 409, 2.2 will have a cost of 1227 field addi-
tic curve cryptography (ECC) has received a lot of attention
tions, 2454 field multiplications, and 2454 field squarings
in the academy and industry. Elliptic curves and their prop-
over F 2m , being field multiplication the most time-
erties have enabled also other types of cryptography relevant
consuming operation.
for the IoT (in wireless sensor networks), for example,
The Lopez Dahab’s method for scalar multiplication in
identity-based encryption (IBE) [18] and attribute-based
ECC is considered as the most suitable method when target-
encryption [19]. With the advent of the IoT, mainly plagued
ing low computing powered devices [22]. The elliptic curve
by intelligent object with restricted computing and resources
point is represented in projective coordinates. At the begin-
capabilities, ECC is becoming one of the promising
ning, the elliptic curve point P in affine coordinates (x, y) is
approaches to provide security services in that computing
converted to its projective representation ðX, Y, ZÞ. Algo-
paradigm [6].
rithm 1 uses the x-coordinate only for point representation
An elliptic curve E over a finite field F q is defined by Eq.
so storage resources can be saved (line 5). With this setting,
(1). costly field inversions are avoided in each group (curve level)
operation. Only one field inversion is required for coordinate
E : y2 + a1 xy + a3 y = x3 + a2 x2 + a4 x + a6 , ð1Þ conversion from projective to affine at the end of the main
loop (line 13). Algorithm 1 is time-constant and resistant to
where a1 , a2 , a3 , a4 , a6 ∈ Fq . The ðx, yÞ pairs satisfying E, some side-channel attacks such as simple power analysis
together with a special point named point at infinity O, form (SPA).
4 Journal of Sensors
hardware. Most of the reported works use the bit-serial or signals thus (microprogramming) avoiding logic to imple-
digit-serial approach to implement hardware F 2m operators. ment a state machine for control.
However, hardware resources required in these approaches
depend directly on the operands size (field size m), because 2.4. Novel Digit-by-Digit Elliptic Curve Point Multiplication
even when one of the operands is iteratively processed, the Hardware Architecture. The proposed ECC engine, suitable
other one is processed in parallel. for FPGA-based sensor nodes in the IoT, is constructed fol-
The bit-serial approach requires small amount of hard- lowing a layered-based approach. The low level is the F2m
ware resources compared to the digit-serial or full-parallel arithmetic, where field multiplication is the main operation
approach, but for large operands, even using the bit-serial to be optimized in terms of area resources. Next, using the
approach requires a considerable amount of hardware F 2m multiplier as a building block in the high layer is the
resources (slices). However, some recent works already pro- curve arithmetic, consisting in the optimized realization of
posed using a digit-digit approach, for example, [34, 35]. The Algorithm 1 in terms of area resources, where the F2m multi-
main drawback with the multiplier presented in [34] is the plier is used to compute each of the point additions (lines 8
use of shift registers to store partial results and the infeasibility and 10). At this level, the F 2m multiplier is used to realize field
of using such design for practical kP engine and for [35] is to inversion and field squaring required in the addition and
fit the digit sizes to FPGAs embedded DSP multipliers. double point operations. In both layers, the proposed design
In order to reduce area requirements and achieve a com- methodology takes advantage of block RAMs (BRAMs)
pact design well suited for IoT applications, the approach in embedded in modern FPGAs to store the operands, partial,
this work to construct a hardware kP accelerator follows and final results, reusing the BRAMs as much as possible,
the digit-digit computation approach and makes use of mul- using a carefully field operation scheduling, and memory
tipliers and memory blocks embedded in most of the FPGAs management strategy.
to save FPGA standard logic. By implementing a strategy for
reusing memory blocks, critical for the iteratively processing 2.4.1. Field Arithmetic. Arithmetic in F 2m is done using poly-
of the digit-digit approach, considerable area resources are nomial basis. Under this representation, each element in the
saved but retaining the advantage of processing iteratively field is an (m − 1)-degree polynomial AðxÞ over the field F2 .
both operand in the multiplication and not only one as in The two F2m binary operators are addition and multiplication
the digit-serial or bit-serial approaches. Additionally, since with reduction modulo which is an irreducible polynomial
memory blocks are bigger than operands, it is proposed to FðxÞ of degree m. Field addition is the bit-wise XOR opera-
used part of the available memory blocks to store control tion of coefficients (carry free, no reduction needed), a cheap
6 Journal of Sensors
operation when implemented in the hardware. Additive implement an efficient data memory management, ensuring
inverse in F 2m under polynomial basis is also easy to imple- consistency in the correct execution of both the digit-digit
ment, as for any AðxÞ in F 2m , AðxÞ + AðxÞ = 0, with 0 as the field multiplier and the scalar multiplication algorithm. In
neutral addition element (all zero polynomial). this work, we present the design of a novel digit-digit F2m
Multiplication and multiplicative inverses (or simply multiplier that achieves compact designs by optimizing the
inversion) in F 2m are more complex operations. Since Algo- resources for finite fields defined by trinomials.
rithm 1 only requires one F 2m inversion at the end of the
computation, field inversion is implemented using the Itho- 2.4.3. Digit-Digit F 2m Multiplier. Parting from the definition
Tsuji algorithm, by a series of F 2m multiplications. So, the of elements in F2m , as polynomials of the form bm−1 xm−1 +
field multiplier becomes the most critical operation to be bm−2 xm−2 + ⋯b0 with binary coefficients, in this section, we
carefully implemented in ECC hardware approaches and present how the mathematical expression that computes an
one of the critical component in our kP engine. F 2m multiplication in a digit-by-digit fashion is derived (from
Eq. (2) to Eq. (9)). This expression leads to the specification
2.4.2. F 2m Multiplication. In the literature, there are basically of the F2m multiplier that is the building block of our
three computing approaches for computing field multiplica- FPGA-based engine for scalar multiplication in ECC.
tion in the hardware: bit-serial (the most compact design), An element B ∈ F 2m of the form bm−1 xm−1 + bm−2 xm−2 +
digital-serial (for area-performance trade-offs), and full- ⋯b0 can be represented as the sum of w = dm/de polyno-
parallel (the fastest but also the costlier solution in terms of mials (digits) each of d coefficients in F2 (Eq. (2)).
area). The most significant element (MSE) and least signifi-
cant element (LSE) (bit-serial or digit-serial) are the com- m−1 w−1 d−1
monly used algorithms to compute multiplications over F2m . BðxÞ = 〠 bi xi = 〠 Bi xid , Bi = 〠 bid+j x j : ð2Þ
In this work, we propose a novel digit-digit F 2m multiplier i=0 i=0 j=0
algorithm well suited to be integrated into a kP engine. The
digit-digit computing approach aims at performing better So, Eq. (3) expresses the multiplication CðxÞ = AðxÞ × B
than a bit-serial multiplier, keeps the property of allowing ðxÞ mod FðxÞ in a digit-serial approach.
exploring area-performance trade-offs when realized in
hardware, and it is not as expensive as a full parallel realiza- C = A × B mod F ðxÞ
tion. This is consistent with our design methodology to !
w−1
achieve a compact architecture (simpler datapath) for the k = A × 〠 Bi xdi mod F ðxÞ
P engine. Details of the digit-digit F 2m multiplier are i=0 ð3Þ
presented in Section 2.4.3. = AB0 mod F ðxÞ + AB1 xd mod F ðxÞ
F2m multiplication using the digit-digit computing
approach was previously suggested in [34]. However, the + AB2 x2d mod F ðxÞ+⋮ABw−1 xðw−1Þd mod F ðxÞ:
multiplier design in that work is not suitable for a direct
application in a kP engine. The authors in that work only Let P<i> ðxÞ = ABi , 0 ≤ i ≤ w − 1, and the (d + m − 2
proved the advantages of the digit-digit approach versus the )-degree polynomial resulting from the partial product at
well-known bit-serial and digit-serial multipliers, as a standa- iteration i in Eq. (3). By parsing elements of B from left-to-
lone module. However, when that multiplier is considered right (MSE), C computation at iteration i is determined by
for realizing the kP operation, several issues must be solved. recurrence in Eq. (4):
Being the multiplier part of a series of operations implied
by each point addition operation in the main loop in the kP C<0> = 0, ð4Þ
computation, the main challenge for the digit-digit multiplier
is the fact that partial results at each iteration in the digit-
C<i+1> = xd C <i> mod F ðxÞ + P<w−1−i> ðxÞ, ð5Þ
digit multiplier and the final result (possibly operated with
other values) are the input operand for the same multiplier where polynomial xd ðC<i> mod FðxÞÞ has the most
in next iterations. So, during the digit-digit computation, degree (d + k − 1), while P<i> is of degree ðd + k − 2Þ. After
the multiplier must keep its operands in memory blocks M 1
w iterations, the polynomial C <w−1> of degree (d + k − 1)
and M 2 and progressively stores the partial results in another needs reduction. By introducing an extra iteration with B−1
one M 3 . At the end, the results in M 3 should be moved to M 1
= 0 and P<−1> = 0, C<w> = xd ðC <w−1> mod FðxÞÞ is the result.
or M 2 for further processing (a kP operation requires several
F 2m multiplications), introducing a delay in the kP computa- The xd term in this last expression can be easily reduced
tion, unless that data movement is done during the computa- modulo FðxÞ by only discarding the digit C <w> 0 .
m−1
tion. So, M 1 or M 2 must act as an input and output memory Being FðxÞ an m-degree polynomial, FðxÞ = xm + ∑i=0
at the same time. Since a complete kP operation requires f i αi . So, xm mod FðxÞ = ∑m−1
i=0 f i α = gðxÞ, a polynomial of
i
several hundreds of multiplications, using the multiplier as degree g with g < m. Thus, elements xm+t with t ≤ m − 1 − g
proposed in [34] without addressing the previous data can be reduced using equivalence xm+t mod FðxÞ = gðxÞxt .
memory management issue is totally unpractical. Degree of C <i+1> from Eq. (5) (after C <i> reduction) is at
As it is explained in the next section, the main issue to most ðd + m − 1Þ. This polynomial becomes the C<i> polyno-
integrate a digit-digit F 2m multiplier in the kP engine is to mial to be reduced in the next iteration ðC<i> mod FðxÞÞ. So,
Journal of Sensors 7
at each iteration i + 1, it is required to reduce the d-terms x j 2.4.4. Digit-Digit F 2m Multiplier Hardware Architecture. To
of C <i> , m − 1 < j ≤ d + m − 1. By using the previous assump- achieve compactness, in this work, we propose the realization
tion for polynomial reduction being FðxÞ a trinomial, the in hardware of Algorithm 2 in its simplest form. The hard-
reduction in Eq. (5) can be defined as in Eq. (6). ware architecture only requires one partial d × d multiplier
and is optimized for binary fields defined by a trinomial.
!
m−1 m−1+d The NIST and other compliant standards have recom-
C<i> mod F ðxÞ = 〠 ci xi + 〠 ci xi mod F ðxÞ mended trinomials for binary fields, for example, FðxÞ =
i=0 i=m
! x409 + x87 + 1 and FðxÞ = x233 + x74 + 1.
m−1 d−1 If the 233-degree trinomial is used, gðxÞ = x74 + 1 is used
= 〠 ci x + i
〠 cm+i x gðxÞ i
mod F ðxÞ for the reduction step. So, if d = 74 (digit size) is used, when a
i=0 i=0
digit j of g(x) (G j ) is read, only the two first digits will have a
= C <i> <i>
m ðx Þ + C d ðxÞ × gðx Þ: value of 1, when j > 1 digit G j will be always 0. In this case, the
ð6Þ partial multiplier that computes C<i> d ðxÞ × G j always com-
putes a multiplication of the form ðC<i> <i>
d ðxÞ × 1Þ or ðC d ðxÞ
This way, C <i> is partitioned in two polynomials C<i>
m ðxÞ × 0Þ which can be implemented only with an “and” gate. In
d ðxÞ of degree m − 1 and d, respectively. The partial
and C <i> conclusion, when a trinomial of the form xm + xk + 1 is used,
multiplication C <i>
d × gðxÞ will not require modular reduc- it is possible to define the digit size d = k. In this case, the par-
tion if d + g < m. So, Eq. (5) can be rewritten as in Eq. (7). tial multiplier that computes C<i> d ðxÞ × G j can be imple-
mented using only a multiplexer as it is shown in Figure 3.
C<i+1> = αd C<i> <i>
m + C d × gðxÞ + P
<w−1−i>
: ð7Þ
2.4.5. Curve Arithmetic. The hardware for elliptic curve scalar
Under the digit-digit computation approach, the polyno- multiplication is guided by the execution of Algorithm 1,
mial C <i> which is based on the iteratively call to point addition func-
m , gðxÞ, and A is represented in w = dm/de digits.
Since the Bi degree is d − 1, the P<i> computation can be tions Madd and Mdouble.
achieved iteratively, taken digit Bi and iterating through A Figure 4 shows the required operations at each iteration
of Algorithm 1 and the underlying F 2m operations (denoted
digits. Taking Bi as a constant, P<i> ðxÞ = AðxÞ × Bi ðxÞ =
by circles). After each F2m operation, the figure also shows
∑w−1 w−1 <i> jd
j=0 ðA j × Bi Þx = ∑ j=0 P j x . With this new notation,
jd
the memory where the intermediate values are stored. For
the first term in Eq. (4) can be rewritten as in Eq. (7). example, the memory X11 stores the first field operation X
1 × Z2 in the point addition operation. While five F 2m multi-
w−1 <i> jd+d w−1
plications are needed to compute a single Madd operation,
m ðxÞ + C d ðx Þ × gðx Þ = 〠 C j x
xd C<i> d ðxÞ × 〠 G j x
<i>
+ C<i> jd+d
Require:A, B, F ∈ F 2 m ⊳A = ∑w−1
i=0 αi α
iD
⊳B = ∑w−1i=0 bi α
iD
1: cD ← 0
2: fori ← 0 to w + 1do
3: carry ← 0
4: s ← 0
5: for0 ← 0 to wdo
6: Pi ← bdigits−i × α j + 1
7: R j ← c j + cD × f j + 1
8: s ← Pi + ðR j ≪ dÞ + carry
9: c j ← s½d − 1 downto 0
10: cD ← s ≫ d
11: end for
12: cD ← s ≫ bitsLastDigit
13: end for
14: cw ← carry
15: returnc ⊳c = A × B mod F ⊳c = ∑w−1
i=0 ci α
iD
Cd<i> BRAM-F BRAM-A BRAM-B parameters and now acts as a writing memory. As the F2m
/ multiplier delivers a result at each stage in point addition,
d /1 /d d/ at the same time, it processes the input digits. So, a careful
management of the memory is required to avoid latency for
data movement for result and input parameter memories.
This requirement arises because the result of the field multi-
M U LT plier in an earlier stage becomes the input parameter of later
/d / stages.
2d In the rest of the point addition computation, memories
ADD Pj<i> alternate their functionality following the switching strategy
of read/write memories. At the end, the final result X 3 , Z 3
/d
Rj<i> must be in a memory, that is used in the next iteration at line
<< /
Datapath
QA = r A P and QB = r B P : ð10Þ
|fflfflfflfflffl{zfflfflfflfflffl} |fflfflfflfflffl{zfflfflfflfflffl}
algorithm, the memories are also interchanged in their func- SensorA SensorB
tionality and properly mapped to the memories for the final
results in Figure 4. An extra BRAM is required to store the Sensor A uses the B’ s public value to compute s1 = r A QB ,
scalar k. and the sensor B uses the A’ s public value to compute s2 =
The building blocks to compute kP as described in r B QA . Since s1 is the same as s2 ðs = s1 = s2 Þ, s acts as a shared
Figure 4 are those for field arithmetic operations: addition, secret key between the sensors A and B, so a secure channel
multiplication, square, and inversion over F 2m . The square can be established to transport data between the two devices
operation is considered easier than multiplication. However, in an encrypted form (for example, using a lightweight block
since in this work operands are stored in BRAMs, and read- cipher). Indeed, signatures can be generated to authenticate
ing/writing of operands are performed one digit at a time, it is data by using the secret to authenticate a message, using,
difficult to take advantage of the optimized algorithm such as for example, LightMac. The main complexity in ECDH (as
the fast reduction algorithm proposed by NIST commonly in other ECC-based cryptographic schemes) is the computa-
used in squaring. So, to save hardware resources, this work tion of kP.
uses one F 2m multiplication core to compute square opera-
tions. The reusing od the multiplier saves area but increases 3.1. Hardware/Software Codesign. Figure 5 shows the pro-
latency. Also, F 2m inversion is computed with the Itho-Tsuji posed hardware-software codesign for the scalar multiplier
algorithm by means of multiplications, squares, and over EðF 2m Þ, suitable to be realized in an FPGA sensor node.
additions in F2m . The codesign was realized in the MicroZed board, and the
At each iteration of Algorithm 1, Madd and Mdouble implementation results are shown in Table 3. This is a repre-
operations can be computed in parallel since there is no data sentative final application under an IoT scenario (IIoT,
dependency. In this work, we propose to use a F 2m multiplier MIoT) where sensor nodes are deployed using SoC technol-
in Madd and other in Mdouble to take advantage of parallel- ogy: the kP scalar multiplication is executed in FPGA tech-
ism. In the dataflow for each point addition, the F2m multi- nology coupled to a master general purpose processor that
plier is reused. In addition to the multipliers, one F 2m adder runs the rest of the application logic. The hardware-
is also required. The same adder can be used in both the software codesign required 1809 slices of the FPGA embed-
Madd and Mdouble operations since it is required at different ded in the MicroZed board running at 62.5 MHz.
times in each operation. Table 3 also compares the time to achieve a scalar multi-
Although more than one F 2m multiplier could be added to plication under the hardware/software codesign versus a
speed up the kP computation, that approach resulted in extra pure software implementation. This is done to highlight the
cost of hardware resources not only because of the area gain in performance from a hardware approach for the most
required by the F 2m multiplier but also for the increased com- time-consuming operation in ECC, as in ECDH. For this, we
plexity in the control module and additional multiplexers to used the MIRACL library for the software implementation of
manage input/output operands to the F2m cores. scalar multiplication in the Cortex A9 of the Zynq, also avail-
The entire kP dataflow is managed by a control unit that able in the MicroZed board. In this case, we used the same
stimulates the memory blocks for word-based reading and implementation parameters: curve, finite field, size of the
writing and also commands the F 2m cores (multipliers and finite field, irreducible polynomial, projective coordinates,
adder). The control module waits until each partial multipli- and the same Algorithm 1 for scalar multiplication.
cation/addition has finished and starts the following required The hardware-accelerated execution of kP requires
operations with the correct BRAM as input sources. 4.13 ms to compute an elliptic curve Diffie Hellman key
10 Journal of Sensors
Memory and square and addition over EðF 2m Þ are computed fully with
DDR3 standard logic in only one clock cycle. Compared to our
design, those results are almost ten times better according
to efficiency. However, our design uses considerable less area
Processing Memory resources. For example, for a digit size of 8, 16, and 32, the
system (PS) controller required area is 442, 626, and 1170 slices, respectively. In
[37], it is presented a hardware architecture for elliptic curve
scalar multiplication over EðF 2m Þ implemented for the NIST-
Cortex A9 recommended binary fields F 2233 and F2283 . That scalar multi-
UART Peripherals
MPCore plier hardware architecture requires 3016 and 4625 slices for
the operand size 233 and 283, respectively. Compared to that
design, our kP engine for F2233 requires 6.8 times more slices
and 2.2 times better efficiency (Mbps/slice). The scalar multi-
AXI4-Lite plier over EðF2m Þ presented in [38] is better in efficiency than
Programable
logic (PL) E(F2m) ours, but at a considerable high costs in terms of area usage.
Scalarmultiplier Table 4 shows that most of the works achieve better
throughput/efficiency than our proposed hardware design.
Figure 5: Hardware-software codesign for an FPGA-based sensor
However, the main aim of these works is to save hardware
node enabled with scalar multiplier engine for curve-based resources (slices), and this is achieved by sacrificing through-
cryptography. put. According to the obtained results, it is observed that
despite the throughput sacrificing, the proposed design
achieves significantly better performance than software
Table 3: ECDH hw-sw codesign in the MicroZed board (z7010). counterparts while using fewer resources that are similar
FPGA designs. The reduction in area resources is a direct
Area Freq. Time MIRACL (sw) result of using a digit-by-digit computing approach in the
Size k
(slices) (MHz) (ms) (ms) layered structure of the kP engine, mainly determined by
233 32 1809 62.5 4.13 70 the F 2m multiplier and the strategy for reusing memory blocks
during the iterative processing of operands.
In Figure 6, we show graphically how our design uses
exchange versus the pure software implementation in the considerable fewer standard logic resources from the FPGA,
MicroZed with the MIRACL library that requires 70 ms. so leaving more logic for other tasks in the upper application
Thus, our codesign is 17 times faster than the pure software layers. In that figure, FPGA resource usage is compared
implementation while only requires 36% of the FPGA slices against the works that use FPGA implementation technol-
in the MicroZed, leaving 66% of the FPGA’s standard logic ogy, digit-serial approach, and comparable security levels.
available for other application requirements in the sensor Note from this figure that our design is scalable in terms of
node. These results show that our design retains the advan- area because a greater security level only impacts latency.
tages of a hardware implementation by improving the perfor- This property is only kept with the digit-digit computing
mance at the time that it uses less area resources. approach.
Work FPGA m Cycles Slices Freq. (MHz) Thrg. (kbps) Efficiency (kbps/slice)
Prop. (k = 8) z7010 233 1553782 442 190.04 28.49 0.064
Prop. (k = 16) z7010 233 408547 626 149.20 85.09 0.136
Prop. (k = 32) z7010 233 128820 1170 135.31 244.75 0.209
Prop. (k = 8) z7010 409 7504232 453 190.94 10.40 0.023
Prop. (k = 16) z7010 409 1926426 653 154.44 32.78 0.050
Prop. (k = 32) z7010 409 511493 1183 132.59 106.02 0.090
[32] (g = 16, d = 2) v5 233 8193 3939 263.15 7483.69 1.899
[32] (g = 8, d = 1) v5 409 45513 5395 181.81 1633.82 0.030
[37] k7 233 679776 3016 255.66 87.63 0.029
[37] k7 283 1395312 4625 251.98 51.10 0.011
[38] v7 233 5929 2647 370.00 14540.39 5.498
[38] v7 409 10354 6888 316.00 12482.51 1.812
[41] v5 163 1396 3513 147.00 17.16 0.004
8000
7000
FPGA slices (standard logic)
6000
5000
4000
3000
2000
1000
0
233‐bit 409‐bit
Security level (field size)
This work
[32]
[35]
project number 281565. Also, the research was partially mance Extreme Computing Conference (HPEC), pp. 1–6,
funded by project PN-2017-5814, Conacyt Problemas Waltham, MA, USA, 2017.
Nacionales. [15] Xilinx Inc, Microzed industrial iot starter kitApril 2020, http://
zedboard.org/product/microzed-iiot-starter-kit.
[16] V. S. Miller, “Use of elliptic curves in cryptography,” Advances
References in Cryptology—CRYPTO ‘85 Proceedings, H. C. Williams, Ed., ,
pp. 417–426, Springer, Berlin, Heidelberg, 1986.
[1] A. Mosenia and N. K. Jha, “A comprehensive study of security
[17] N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of
of internet-of-things,” IEEE Transactions on Emerging Topics
Computation, vol. 48, no. 177, pp. 203–209, 1987.
in Computing, vol. 5, no. 4, pp. 586–602, 2017.
[18] P. Szczechowiak and M. Collier, “Tinyibe: identity-based
[2] L. Z. Cai and M. F. Zuhairi, “Security challenges for open
encryption for heterogeneous sensor networks,” in 2009 Inter-
embedded systems,” in Engineering Technology and Techno-
national Conference on Intelligent Sensors, Sensor Networks
preneurship (ICE2T), 2017 International Conference on,
and Information Processing (ISSNIP), pp. 319–354, Mel-
pp. 1–6, Kuala Lumpur, Malaysia, 2017.
bourne, VIC, Australia, 2009.
[3] D. Schinianakis, “Alternative security options in the 5g and iot
era,” IEEE Circuits and Systems Magazine, vol. 17, no. 4, pp. 6– [19] N. Oualha and K. T. Nguyen, “Lightweight attribute-based
28, 2017. encryption for the internet of things,” in 2016 25th Interna-
tional Conference on Computer Communication and Networks
[4] T. Eisenbarth, S. Kumar, C. Paar, A. Poschmann, and
(ICCCN), pp. 1–6, Waikoloa, HI, USA, 2016.
L. Uhsadel, “A survey of lightweight-cryptography implemen-
tations,” IEEE Design & Test of Computers, vol. 24, no. 6, [20] Z. U. A. Khan and M. Benaissa, “High-speed and low-latency
pp. 522–533, 2007. ecc processor implementation over gf(2m) on fpga,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems,
[5] C. Manifavas, G. Hatzivasilis, K. Fysarakis, and K. Rantos,
vol. 25, no. 1, pp. 165–176, 2017.
“Lightweight cryptography for embedded systems – a compar-
ative analysis,” Data Privacy Management and Autonomous [21] J. López and R. Dahab, “Fast multiplication on elliptic curves
Spontaneous Security, 2014, pp. 333–349, Springer, Berlin Hei- over gf(2m) without precomputation,” Proceedings of the First
delberg, 2014. International Workshop on Cryptographic Hardware and
Embedded Systems, CHES’99, , pp. 316–327, Springer-Verlag,
[6] C. A. Lara-Nino, A. Diaz-Perez, and M. Morales-Sandoval,
London, UK, UK, 1999.
“Elliptic curve lightweight cryptography: a survey,” IEEE
Access, vol. 6, pp. 72514–72550, 2018. [22] D. Karaklajic, J. Fan, J. Schmidt, and I. Verbauwhede, “Low-
[7] P. Yalla and J. P. Kaps, “Lightweight cryptography for fpgas,” cost fault detection method for ecc using montgomery power-
in 2009 International Conference on Reconfig urable Comput- ing ladder,” 2011 Design, Automation Test in Europe, pp. 1–6,
ing and FPGAs, pp. 225–230, Quintana Roo, Mexico, 2009. 2011.
[8] A. Diaz-Perez, M. Morales-Sandoval, and C. Lara-Nino, “Use [23] D. Dinu, Y. Le Corre, D. Khovratovich, L. Perrin,
of FPGAs for enabling security and privacy in the IoT: features J. Großschädl, and A. Biryukov, “Triathlon of lightweight
and case studies,” in FPGA Algorithms and Applications for the block ciphers for the internet of things,” Journal of Crypto-
Internet of Things, chapter 2, P. Sharma and R. Nair, Eds., graphic Engineering, vol. 9, pp. 1–20, 2015.
pp. 26–45, IGI Global, 2020. [24] J.-L. Beuchat, T. Miyoshi, Y. Oyama, and E. Okamoto, “Multi-
[9] G. Xu, Z. Chen, and P. Schaumont, “Energy and performance plication over Fpm on fpga: A survey,” in Reconfigurable Com-
evaluation of an fpga-based soc platform with aes and present puting: Architectures, Tools and Applications, P. C. Diniz, E.
coprocessors,” in Embedded Computer Systems: Architectures, Marques, K. Bertels, M. M. Fernandes, and J. M. P. Cardoso,
Modeling, and Simulation, M. Berekovic, N. Dimopoulos, Eds., pp. 214–225, Springer, Berlin, Heidelberg, 2007.
and S. Wong, Eds., pp. 106–115, Springer, Berlin, Heidelberg, [25] G. Bertoni, J. Guajardo, S. Kumar, G. Orlando, C. Paar, and
2008. T. Wollinger, “Efficient GF(pm) arithmetic architectures for
[10] H. Abdelkrim, S. Ben Othman, and S. Ben Saoud, “Reconfigur- cryptographic applications,” in Topics in Cryptology — CT-
able soc fpga based: Overview and trends,” in 2017 Interna- RSA 2003, M. Joye, Ed., pp. 158–175, Springer, Berlin, Heidel-
tional Conference on Advanced Systems and Electric berg, 2003.
Technologies, pp. 378–383, Hammamet, Tunisia, 2017. [26] C. Shu, S. Kwon, and K. Gaj, “Fpga accelerated tate pairing
[11] X. Zhang, A. Ramachandran, C. Zhuge et al., “Machine learn- based cryptosystems over binary fields,” 2006 IEEE Interna-
ing on fpgas to face the iot revolution,” in 2017 IEEE/ACM tional Conference on Field Programmable Technology, 2006,
International Conference on Computer-Aided Design (ICCAD), pp. 173–180, Bangkok, Thailand, 2006.
pp. 894–901, Irvine, CA, USA, 2017. [27] L. Song and K. K. Parhi, “Low-energy digit-serial/parallel finite
[12] C. Hao, X. Zhang, Y. Li et al., “Fpga/dnn co-design: an efficient field multipliers,” Journal of VLSI signal processing systems for
design methodology for iot intelligence on the edge,” in Pro- signal, image and video technology, vol. 19, no. 2, pp. 149–166,
ceedings of the 56th Annual Design Automation Conference 1998.
2019, DAC ‘19, New York, NY, USA, 2019. [28] D. Pamula and E. Hrynkiewicz, “Area-speed efficient modular
[13] S. Wang, Y. Hou, F. Gao, and X. Ji, “A novel iot access archi- architecture for GF(2m) multipliers dedicated for crypto-
tecture for vehicle monitoring system,” in 2016 IEEE 3rd graphic applications,” in 2013 IEEE 16th International Sympo-
World Forum on Internet of Things (WF-IoT), pp. 639–642, sium on Design and Diagnostics of Electronic Circuits Systems
Reston, VA, USA, 2016. (DDECS), pp. 30–35, Karlovy Vary, Czech Republic, 2013.
[14] B. Zhou, M. Egele, and A. Joshi, “High-performance low- [29] National Institute of Standards and Technology, Digital Signa-
energy implementation of cryptographic algorithms on a pro- ture Standard (DSS), Appendix D, Recommended Elliptic
grammable soc for iot devices,” in 2017 IEEE High Perfor- Curves for Federal Government Use, 1999, https://csrc.nist
Journal of Sensors 13
.gov/csrc/media/publications/fips/186/3/archive/2009-06-25/
documents/fips_186-3.pdf.
[30] D. Hankerson, A. J. Menezes, and S. Vanstone, Guide to Ellip-
tic Curve Cryptography, Springer-Verlag New York, Inc.,
Secaucus, NJ, USA, 2003.
[31] D. Pamula, Arithmetic operators on GF(2m) for cryptographic
applications: performance - power consumption - security tra-
deoffs, [Ph.D. thesis], Université Rennes 1, 2012, https://tel
.archivesouvertes.fr/tel-00767537.
[32] G. D. Sutter, J. Deschamps, and J. L. Imana, “Efficient elliptic
curve point multiplication using digit-serial binary field oper-
ations,” IEEE Transactions on Industrial Electronics, vol. 60,
no. 1, pp. 217–225, 2013.
[33] A. P. Fournaris and O. Koufopavlou, “Low area elliptic curve
arithmetic unit,” in 2009 IEEE International Symposium on
Circuits and Systems, pp. 1397–1400, Taipei, Taiwan, 2009.
[34] M. Morales-Sandoval and A. Diaz-Perez, “Area/performance
evaluation of digit-digit GF(2k) multipliers on fpgas,” in 23rd
International Conference on Field programmable Logic and
Applications, Porto, Portugal, 2013.
[35] I. San and A. Nuray, “Improving the computational efficiency
of modular operations for embedded systems,” Journal of Sys-
tems Architecture, vol. 60, no. 5, pp. 440–451, 2014.
[36] B. Bengherbia, M. O. Zmirli, A. Toubal, and A. Guessoum,
“Fpga-based wireless sensor nodes for vibration monitoring
system and fault diagnosis,” Measurement, vol. 101, pp. 81–
92, 2017.
[37] M. S. Hossain, E. Saeedi, and Y. Kong, “High-speed, area-effi-
cient, fpga-based elliptic curve cryptographic processor over
nist binary fields,” in 2015 IEEE International Conference on
Data Science and Data Intensive Systems, pp. 175–181, Sydney,
NSW, Australia, 2015.
[38] Z. Khan and M. Benaissa, “Throughput/area-efficient ecc pro-
cessor using montgomery point multiplication on fpga,” IEEE
Transactions on Circuits and Systems II: Express Briefs, vol. 62,
no. 11, pp. 1078–1082, 2015.
[39] W. Wei, L. Zhang, and C. Chang, “A modular design of
elliptic-curve point multiplication for resource constrained
devices,” in 2014 International Symposium on Integrated Cir-
cuits (ISIC), pp. 596–599, Singapore, Singapore, 2014.
[40] M. N. Hassan and M. Benaissa, “Low area-scalable hardware/-
software co-design for elliptic curve cryptography,” in 2009
3rd International Conference on New Technologies, Mobility
and Security, pp. 1–5, Cairo, Egypt, 2009.
[41] S. S. Roy, C. Rebeiro, and D. Mukhopadhyay, “Theoretical
modeling of elliptic curve scalar multiplier on lut-based fpgas
for area and speed,” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, no. 5, pp. 901–909, 2013.