0% found this document useful (0 votes)
30 views13 pages

A Compact FPGA-Based Accelerator For Curve-Based C

Uploaded by

quyetdao010901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views13 pages

A Compact FPGA-Based Accelerator For Curve-Based C

Uploaded by

quyetdao010901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Hindawi

Journal of Sensors
Volume 2021, Article ID 8860413, 13 pages
https://doi.org/10.1155/2021/8860413

Research Article
A Compact FPGA-Based Accelerator for Curve-Based
Cryptography in Wireless Sensor Networks

Miguel Morales-Sandoval ,1 Luis Armando Rodriguez Flores,2 Rene Cumplido,2


Jose Juan Garcia-Hernandez,1 Claudia Feregrino,2 and Ignacio Algredo2
1
Centro de Investigacion y de Estudios Avanzados-Cinvestav Tamaulipas, Mexico
2
Instituto Nacional de Astrofisica, Optica y Electronica-INAOE, Mexico

Correspondence should be addressed to Miguel Morales-Sandoval; [email protected]

Received 17 April 2020; Revised 12 September 2020; Accepted 30 November 2020; Published 6 January 2021

Academic Editor: Iftikhar Ahmad

Copyright © 2021 Miguel Morales-Sandoval et al. This is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work
is properly cited.

The main topic of this paper is low-cost public key cryptography in wireless sensor nodes. Security in embedded systems, for
example, in sensor nodes based on field programmable gate array (FPGA), demands low cost but still efficient solutions. Sensor
nodes are key elements in the Internet of Things paradigm, and their security is a crucial requirement for critical applications in
sectors such as military, health, and industry. To address these security requirements under the restrictions imposed by the
available computing resources of sensor nodes, this paper presents a low-area FPGA-prototyped hardware accelerator for scalar
multiplication, the most costly operation in elliptic curve cryptography (ECC). This cryptoengine is provided as an enabler of
robust cryptography for security services in the IoT, such as confidentiality and authentication. The compact property in the
proposed hardware design is achieved by implementing a novel digit-by-digit computing approach applied at the finite field and
curve level algorithms, in addition to hardware reusing, the use of embedded memory blocks in modern FPGAs, and a simpler
control logic. Our hardware design targets elliptic curves defined over binary fields generated by trinomials, uses fewer area
resources than other FPGA approaches, and is faster than software counterparts. Our ECC hardware accelerator was validated
under a hardware/software codesign of the Diffie-Hellman key exchange protocol (ECDH) deployed in the IoT MicroZed FPGA
board. For a scalar multiplication in the sect233 curve, our design requires 1170 FPGA slices and completes the computation in
128820 clock cycles (at 135.31 MHz), with an efficiency of 0.209 kbps/slice. In the codesign, the ECDH protocol is executed in
4.1 ms, 17 times faster than a MIRACL software implementation running on the embedded processor Cortex A9 in the
MicroZed. The FPGA-based accelerator for binary ECC presented in this work is the one with the least amount of hardware
resources compared to other FPGA designs in the literature.

1. Introduction in these domains create new classes of risks resulting from


their interaction between cyberspace and the physical world.
Nowadays, the computing paradigm of Internet of Things Wireless sensor networks (WSN) are the cornerstone for
(IoT) is enabling a large number of applications in wireless realizations of IoT applications, where in some cases, the data
technologies such as smart vehicles, smart buildings, health generated, stored, or transmitted by the nodes (i.e., embed-
monitoring, energy management, environmental monitor- ded systems) require robust security mechanisms to provide
ing, food supply chains, and manufacturing [1]. them with security services of confidentiality, authentication,
In critical IoT applications, as in the Industrial Internet of integrity, and nonrepudiation. Consider the model for a set
Things (IIoT) or in healthcare (Medical Internet of of networked IoT devices (for example, a wireless sensor net-
Things—MIoT), embedded system devices have become an work) in Figure 1. Security risks arise since a malicious node
integral part [2] and easy targets of attacks, mainly because can get unauthorized access to (sensible) data, maliciously
they are physically more accessible. Cyberphysical systems alter data, and impersonate legitimate nodes, thus posing
2 Journal of Sensors

Sender

Malicious
C node Reciever

D X E
F

H
G

Figure 1: Simplified model of networked IoT devices collecting and sharing data.

threats to confidentiality and authentication in the commu- programmable processor cores, thus enabling hardware/soft-
nication path between a sender and a receiver node. ware codesigns where the critical parts of algorithm, proto-
A robust approach to provide such security services in the col, or application are accelerated with custom designs
IoT domain is the public key cryptography (PKC). PKC in its implemented in the available programmable hardware, and
different families is based on mathematical problems, and the rest of the application is executed by the general purpose
underlying realizations involve costly arithmetic algorithms processors. The main advantage of FPGAs is reconfigurabil-
over finite fields, rings, or groups. In the literature, a vast ity since, for example, a whole system could be upgraded (or
amount of research has focused in hardware acceleration of partial reconfigured) [7].
PKC at the different levels of involved arithmetic algorithms. Recent works propose FPGAs as the most attractive can-
The main approaches for hardware implementations of PKC didates to a large range of IoT applications because of their
have focused on speeding up the underlying group and finite high energy efficiency and low cost, for example, for IoT
field operations at the expense of a high amount of hardware machine learning [11], IoT neural networks [12], IoT vehicle
resources. However, the main drawback with hardware for monitoring systems [13], IoT security (cryptography) [14],
PKC in WSN is the long key lengths which amount to large and among other applications. Not only research papers pro-
chip area, circuit delays, and increased power dissipation [3]. pose FPGAs as hardware modules for IoT scenarios but also
The hardware implementation of PKC-based security FPGA vendors are producing devices with specific features
solutions in resource-constrained devices typically found in for IoT development [15].
IoT scenarios, as in FPGA-based sensor nodes, and using a Contribution: in this work, we aim at approaching low-
straightforward approach is not viable. Lightweight cryptog- area hardware engine to ECC for IoT security, suitable for
raphy (LWC) [4] has emerged as an active research line being included as a building block in FPGA-based sensor
focused on designing cryptographic primitives, schemes, nodes for IIoT or MIoT. We aim at providing one of the most
and protocols tailored to constrained devices as sensor nodes compact FPGA hardware accelerator for the scalar multipli-
in WSN or other IoT devices, for example, RFID tags [5]. For cation in binary standard curves, the most time consuming
the case of PKC, elliptic curve cryptography (ECC) has been operation, and the core of ECC cryptographic schemes such
considered one of the most efficient realizations well suited as encryption, digital signatures, and key establishment. To
for constrained environments in the IoT [6]. achieve compactness, a novel digit-digit binary finite field
Application-specific integrated circuits (ASICs) were the multiplier is proposed and used as the basic building block
first targets in LWC [4, 7]. However, reconfigurable logic cir- of the proposed ECC accelerator. Under this approach, the
cuits, specifically field programmable gateway arrays operands are processed one digit at a time in an iterative
(FPGAs), are being more popular to implement compac- way, but exploiting the parallelism at the algorithmic level
t/low-area hardware accelerators for cryptography algo- and reusing hardware resources as much as possible. The
rithms, with attractive advantages for the IoT domain [8]. sequence of field operations in the algorithm for scalar mul-
At the beginning, FPGAs were frequently used as devices tiplication is carefully scheduled to reduce the number of
for rapid prototyping of cryptographic algorithms, but now field multiplier cores (two) and memory blocks (eight).
they are commonly used as final product platforms [9]. Fur- While the field multipliers are implemented using standard
thermore, FPGAs are not only used as single parts of embed- FPGA logic, memories are taken from the ones available in
ded systems but rather as system-on-chip (SoC) platforms modern FPGAs. Due to the digit-digit computation
for implementing complete applications [10]. Modern, com- approach, an efficient data memory management is designed
mercial FPGA devices contain not only programmable hard- to reduce the number of memory block. This way, with only
ware resources but large functional blocks, such as high- the eight memory blocks, the several field multiplications in a
speed multipliers, embedded multiport memories, and even single point addition are correctly computed, and at the same
Journal of Sensors 3

time, those same memories serve to keep the progress of the a group G with point addition as the group operation. G is a
scalar multiplication computation. The novel hardware cyclic group with prime order n where the discrete logarithm
design presented in this work was validated under a hardwar- problem is defined and on which ECC is founded.
e/software implementation of elliptic curve Diffie-Hellman It is well known that binary extension fields (q = 2m ) are
(ECDH) key exchange protocol, tailored to the MicroZed very attractive for defining ECC. An element in F2m is the
FPGA prototyping board, recommended for IoT industrial bit vector ðam−1 , am−2 ,⋯,a0 Þ that in polynomial basis repre-
applications. Under this setting, which is very common in sents the ðm − 1Þ-degree polynomial am−1 xm−1 + am−2 xm−2
an FPGA IoT application, the execution of ECDH outper- + ⋯a0 , with ai in {0,1}. Arithmetic in F 2m in polynomial basis
forms the software counterpart, implemented using the MIR- is polynomial arithmetic with reduction modulo, which is an
ACLE library and runs in the embedded Cortex A9 processor irreducible polynomial of degree m, FðxÞ. The arithmetic in
in the MicroZed. Our hardware architecture, compared with F2m is carry free and more suitable for hardware
state-of-the-art similar approaches in terms of area, only implementations.
requires up to 16% of FPGA hardware resources, thus being
the most compact FPGA-based hardware architecture for 2.2. Scalar Multiplication in Elliptic Curves. Scalar multiplica-
computing scalar multiplications in ECC defined over binary tion in EðF q Þ denoted as Q = kP with Q, P ∈ G and k ∈ ½1, n
fields. Compared to the software reference implementation, − 1 is the main and most time-consuming operation in
our design is 17 times faster. any ECC scheme (encryption, digital signature, keys
The rest of this brief is organized as follows: Materials exchange, etc). Q is computed by k-times point addition
and Methods discusses the preliminaries of scalar multiplica- operations of P with itself [20]: Q = kP = P + P+⋯+P .
tion in binary elliptic curves and the Montgomery López- |fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}
Dahab algorithm for scalar multiplication. This section also k−times
describes related works and the proposed hardware design. The complexity of kP is in terms of the operations in F q .
Results and Discussion presents the experimental results Given a large integer k and a point P in G, it is easy to com-
and comparisons with state-of-the-art works, followed by pute Q = kP. On the contrary, the elliptic curve discrete loga-
concluding remarks in the Conclusion. rithm problem (ECDLP) is the problem that given the point
P and Q in G, to find the scalar k. For an enough large n,
2. Materials and Methods ECDLP becomes hard to solve. Most of the state-of-the-art
works related to ECC have been focused on the efficient
First, we provide the mathematical concepts and foundations implementation of scalar multiplication [6], which is a
that are the basis to construct the FPGA-based ECC cryp- condition for efficient ECC implementation.
toengine. First, we present the basis of elliptic curves and The Lopez-Dahab Montgomery PM algorithm [21],
groups from which the scalar multiplication is defined. Scalar shown in Algorithm 1, has been commonly used for the
multiplication is critical because the proposed hardware kP computation because it is side-channel attack-resistant,
cryptoengine is precisely to speed up this costly operation suitable for parallelization and low resource friendly. In
and the core of higher operations for security applications this work, we use the Lopez-Dahab algorithm for imple-
such as encryption and digital signatures. Finally, the section menting for the first time the most compact FPGA-based
concludes discussing the method to compute scalar multipli- hardware architecture for computing kP in binary elliptic
cations on binary elliptic curves. This algorithm is realized by curves, EðF2m Þ.
the proposed FPGA-based ECC cryptoengine. The main operations in Algorithm 1 are addition, multi-
plication, and squaring in F2m . Consider the fields recom-
2.1. Elliptic Curves and Its Use in Cryptography. Since
mended by NIST for practical ECC, with m = 233 and
invented independently by Miller [16] and Koblitz [17], ellip-
m = 409. For m = 409, 2.2 will have a cost of 1227 field addi-
tic curve cryptography (ECC) has received a lot of attention
tions, 2454 field multiplications, and 2454 field squarings
in the academy and industry. Elliptic curves and their prop-
over F 2m , being field multiplication the most time-
erties have enabled also other types of cryptography relevant
consuming operation.
for the IoT (in wireless sensor networks), for example,
The Lopez Dahab’s method for scalar multiplication in
identity-based encryption (IBE) [18] and attribute-based
ECC is considered as the most suitable method when target-
encryption [19]. With the advent of the IoT, mainly plagued
ing low computing powered devices [22]. The elliptic curve
by intelligent object with restricted computing and resources
point is represented in projective coordinates. At the begin-
capabilities, ECC is becoming one of the promising
ning, the elliptic curve point P in affine coordinates (x, y) is
approaches to provide security services in that computing
converted to its projective representation ðX, Y, ZÞ. Algo-
paradigm [6].
rithm 1 uses the x-coordinate only for point representation
An elliptic curve E over a finite field F q is defined by Eq.
so storage resources can be saved (line 5). With this setting,
(1). costly field inversions are avoided in each group (curve level)
operation. Only one field inversion is required for coordinate
E : y2 + a1 xy + a3 y = x3 + a2 x2 + a4 x + a6 , ð1Þ conversion from projective to affine at the end of the main
loop (line 13). Algorithm 1 is time-constant and resistant to
where a1 , a2 , a3 , a4 , a6 ∈ Fq . The ðx, yÞ pairs satisfying E, some side-channel attacks such as simple power analysis
together with a special point named point at infinity O, form (SPA).
4 Journal of Sensors

parallel approach is the most costly in terms of area usage


EðF 2m Þ but is the fastest while the bit-serial approach is generally
Require: k ≥ 0 the most compact but its slower. The digit-serial approach
Require: P = ðx, yÞ ∈ EðF 2m Þ allows a trade-off between computation time and area usage.
1: function MONTGOMERYk, P Related works are discussed in this section, based on the
2: ifk = 0 or x = 0then
type of multiplier being used (bit-serial, digit-serial), com-
3: returnð0, 0Þ.
4: end if
puting approach (LSE, MSE), the implementation platform
5: P1 ðX 1 , Z 1 Þ ← ðx, 1Þ; P2 ðX 2 , Z 2 Þ ← ðx4 + b, x2 Þ (FPGA type), the finite field size, and implementation results
6: for i from l − 2 downto 0do in terms of time and area (FPGA slices). Note that our contri-
7: ifki = 1then bution is on the multiplier being used and in the computing
8: P1 ← MaddP1 , P2 ; P2 ← MdoubleðP2 Þ approach (digit-digit). This approach has not been explored,
9: else and we present for the first time an FPGA accelerator for
10: P2 ← MaddP2 , P1 ; P1 ← MdoubleðP1 Þ ECC based on such approach.
11: end if Digit-serial and bit-serial approaches to field multiplica-
12: end for tion are iterative algorithms that process one of the operands
13: returnQ = Mxy(P1 , P2 , P) in the multiplication from right-to-left (MSE) or from left-to-
14: end function right (LSE). At each iteration, the partial results need modu-
1: procedure MADDP1 , P2
lar reduction. Bertoni et al. [25] presented an easy way to per-
2: Z 3 ← ðX 1 Z 2 + X 2 Z 1 Þ2 ; X 3 ← xZ 3 + X 1 X 2 Z 1 Z 2 form modulo reduction when partial results have coefficients
3: returnP3 ðX 3 , Z 3 Þ
with powers greater than m − 1 (e.g., am ). Beuchat et al. [24]
4: end procedure
1: procedure MDOUBLEP1 surveyed some of the most representative F 2m implementa-
2: Z 2 ← X 21 Z 21 , X 2 ← X 41 + bZ 41 tions using MSE and LSE algorithms (including implementa-
3: returnP2 ðX 2 , Z 2 Þ tions presented in [25]).
4: end procedure Digit-serial implementations (with digit size D) require
1: procedure GXYP1 , P2 dm/De iterations using ðm − 1Þ-degree partial results [26].
2: xq ← X 1 Z −1 1
However, in [27], it is proposed to use ðm + D − 1Þ-degree
3: Y int ← ðX 1 + xZ 1 ÞðX 2 + xZ 2 Þ + ðx2 + yÞZ 1 Z 2 partial results to improve computation performance at the
4: yq ← ðx + xq ÞY int ðxZ 1 Z 2 Þ−1 + y cost of one extra iteration, requiring m + 1 iterations to com-
5: returnQðxq , yq Þ pute multiplication over F 2m . The digit-serial algorithm pro-
posed in [25] requires m + 1 iterations and keeps
6: end procedure
ðm + D − 1Þ-degree partial results to improve computation
performance. Beuchat [24] concluded that the MSE first
Algorithm 1: Montgomery scalar multiplication [21]. approach requires less hardware and offers higher through-
put than LSE. In [28], the reduction steps are performed sep-
arately. It is stated that for a finite field generated by
2.3. Related Work. Being kPthe core operation in ECC crypto- irreducible polynomials FðxÞ (NIST [29]), reduction can be
graphic schemes, that operation has been the main target for performed by a set of xor operations [30, 31]. [28] is consid-
hardware accelerations; however, few works have approached ered only the multiplication step, implemented in a digit-
low-area designs compared to those trying to achieve the max- serial approach. A digit D = 16 is proposed since in most
imum performance. However, for the devices used in the IoT, cases, 16-bit words give better results.
generally sensor nodes, lightweight realizations of cryptogra- In [32], it is used a LSE digit-serial multiplier; however, a
phy are better preferred to efficiently use the available comput- digit size of one bit (bit-serial) resulted the most compact ver-
ing and power resources in the sensor nodes [23]. sion. [33] is proposed a systolic hardware architecture to
The computation of kP implies to execute an scalar mul- compute multiplication/inversion in the same hardware.
tiplication algorithm, being Algorithm 1 one of the most rec- Furthermore, an arithmetic unit is constructed that can per-
ommended. At each iteration, curve (group) arithmetic is form all F2m arithmetic operations required in elliptic curve
executed, either point addition or point doubling, each cryptography. [34] is presented for the first time a digit-
implying several finite field operations. So, operations in digit F 2m multiplier under a MSE basis. Operands, modulus,
groups and finite fields are critical for public key cryptogra- and partial results are partitioned in digits and processed
phy as in elliptic curve cryptography (ECC). An efficient one digit at a time. The main advantage compared to digit-
implementation of kP requires an efficient implementation serial or bit-serial implementations is that operands and par-
of finite field operations, being multiplication and inversion tial results can be stored in BRAMs instead of shift registers
the most time consuming field operators. Field inversion which saves standard logic (slices). However, the multiplier
can be efficiently realized through several field multiplica- presented is designed and evaluated as a standalone module
tions; consequently, hardware field multiplier has been which is hard to directly use in a kP engine.
studied as the main core to compute kP. Table 1 summarizes the most relevant works for F 2m mul-
In the case of F2m , there are three main families of algo- tiplication in FPGA, the main algorithms used, and the area/-
rithms to compute a field multiplication AðxÞ × BðxÞ mod F time results. Table 2 shows some of the most representative
ðxÞ: full-parallel, bit-serial, and digit-serial [24]. The full- works of hardware designs for kP computation in the
Journal of Sensors 5

Table 1: Hardware approaches for F2m multipliers.

Ref. Field Target Algorithm Approach Slices Time (ns)


[24] F ð2233Þ Spartan 3 MSE Digit-serial 3458 58.0
[24] F ð2233Þ Spartan 3 LSE Digit-serial 3504 62.0
[24] F ð2409Þ Spartan 3 MSE Digit-serial 5406 153.0
[28] F ð2233Þ Virtex 6 Schoolbook method Digit-serial 1643 (LUTs) 802.4
[32] (d =1) F ð2233Þ Virtex 5 LSE Digit-serial 714 (LUTs) 415.0
[32] (d =16) F ð2233Þ Virtex 5 LSE Digit-serial 2351 (LUTs) 35.0
[34] F ð2233Þ Spartan 3 MSE Digit-digit 406 219.0
[33] F ð2163Þ Virtex II M-I algorithm Systolic array 1399

Table 2: Hardware approaches for kP in FPGAs.

Ref. Field Target Mult. algorithm Approach Slices Time


[4] F ð2193Þ ASIC MSE Digit-serial 17723 GE 41.70 ms
[32] F ð2233Þ Virtex 5 MSE Digit-serial 6487 19.89 μs
[38] F ð2233Þ Virtex 7 MSE Digit-serial 2647 16.01 μs
[39] F ð2163Þ Spartan 3 LSE Bit-serial 3383 2.23 ms
[40] F ð2193Þ Spartan 3 Comba wxw Digit-serial 473 125.00 ms
[37] F ð2233Þ Kintex 7 MSE Bit-serial 3016 2.66 ms
[41] F ð2163Þ Virtex 5 Karatsuba Bit-parallel 3789 10.00 μs

hardware. Most of the reported works use the bit-serial or signals thus (microprogramming) avoiding logic to imple-
digit-serial approach to implement hardware F 2m operators. ment a state machine for control.
However, hardware resources required in these approaches
depend directly on the operands size (field size m), because 2.4. Novel Digit-by-Digit Elliptic Curve Point Multiplication
even when one of the operands is iteratively processed, the Hardware Architecture. The proposed ECC engine, suitable
other one is processed in parallel. for FPGA-based sensor nodes in the IoT, is constructed fol-
The bit-serial approach requires small amount of hard- lowing a layered-based approach. The low level is the F2m
ware resources compared to the digit-serial or full-parallel arithmetic, where field multiplication is the main operation
approach, but for large operands, even using the bit-serial to be optimized in terms of area resources. Next, using the
approach requires a considerable amount of hardware F 2m multiplier as a building block in the high layer is the
resources (slices). However, some recent works already pro- curve arithmetic, consisting in the optimized realization of
posed using a digit-digit approach, for example, [34, 35]. The Algorithm 1 in terms of area resources, where the F2m multi-
main drawback with the multiplier presented in [34] is the plier is used to compute each of the point additions (lines 8
use of shift registers to store partial results and the infeasibility and 10). At this level, the F 2m multiplier is used to realize field
of using such design for practical kP engine and for [35] is to inversion and field squaring required in the addition and
fit the digit sizes to FPGAs embedded DSP multipliers. double point operations. In both layers, the proposed design
In order to reduce area requirements and achieve a com- methodology takes advantage of block RAMs (BRAMs)
pact design well suited for IoT applications, the approach in embedded in modern FPGAs to store the operands, partial,
this work to construct a hardware kP accelerator follows and final results, reusing the BRAMs as much as possible,
the digit-digit computation approach and makes use of mul- using a carefully field operation scheduling, and memory
tipliers and memory blocks embedded in most of the FPGAs management strategy.
to save FPGA standard logic. By implementing a strategy for
reusing memory blocks, critical for the iteratively processing 2.4.1. Field Arithmetic. Arithmetic in F 2m is done using poly-
of the digit-digit approach, considerable area resources are nomial basis. Under this representation, each element in the
saved but retaining the advantage of processing iteratively field is an (m − 1)-degree polynomial AðxÞ over the field F2 .
both operand in the multiplication and not only one as in The two F2m binary operators are addition and multiplication
the digit-serial or bit-serial approaches. Additionally, since with reduction modulo which is an irreducible polynomial
memory blocks are bigger than operands, it is proposed to FðxÞ of degree m. Field addition is the bit-wise XOR opera-
used part of the available memory blocks to store control tion of coefficients (carry free, no reduction needed), a cheap
6 Journal of Sensors

operation when implemented in the hardware. Additive implement an efficient data memory management, ensuring
inverse in F 2m under polynomial basis is also easy to imple- consistency in the correct execution of both the digit-digit
ment, as for any AðxÞ in F 2m , AðxÞ + AðxÞ = 0, with 0 as the field multiplier and the scalar multiplication algorithm. In
neutral addition element (all zero polynomial). this work, we present the design of a novel digit-digit F2m
Multiplication and multiplicative inverses (or simply multiplier that achieves compact designs by optimizing the
inversion) in F 2m are more complex operations. Since Algo- resources for finite fields defined by trinomials.
rithm 1 only requires one F 2m inversion at the end of the
computation, field inversion is implemented using the Itho- 2.4.3. Digit-Digit F 2m Multiplier. Parting from the definition
Tsuji algorithm, by a series of F 2m multiplications. So, the of elements in F2m , as polynomials of the form bm−1 xm−1 +
field multiplier becomes the most critical operation to be bm−2 xm−2 + ⋯b0 with binary coefficients, in this section, we
carefully implemented in ECC hardware approaches and present how the mathematical expression that computes an
one of the critical component in our kP engine. F 2m multiplication in a digit-by-digit fashion is derived (from
Eq. (2) to Eq. (9)). This expression leads to the specification
2.4.2. F 2m Multiplication. In the literature, there are basically of the F2m multiplier that is the building block of our
three computing approaches for computing field multiplica- FPGA-based engine for scalar multiplication in ECC.
tion in the hardware: bit-serial (the most compact design), An element B ∈ F 2m of the form bm−1 xm−1 + bm−2 xm−2 +
digital-serial (for area-performance trade-offs), and full- ⋯b0 can be represented as the sum of w = dm/de polyno-
parallel (the fastest but also the costlier solution in terms of mials (digits) each of d coefficients in F2 (Eq. (2)).
area). The most significant element (MSE) and least signifi-
cant element (LSE) (bit-serial or digit-serial) are the com- m−1 w−1 d−1
monly used algorithms to compute multiplications over F2m . BðxÞ = 〠 bi xi = 〠 Bi xid , Bi = 〠 bid+j x j : ð2Þ
In this work, we propose a novel digit-digit F 2m multiplier i=0 i=0 j=0
algorithm well suited to be integrated into a kP engine. The
digit-digit computing approach aims at performing better So, Eq. (3) expresses the multiplication CðxÞ = AðxÞ × B
than a bit-serial multiplier, keeps the property of allowing ðxÞ mod FðxÞ in a digit-serial approach.
exploring area-performance trade-offs when realized in
hardware, and it is not as expensive as a full parallel realiza- C = A × B mod F ðxÞ
tion. This is consistent with our design methodology to !
w−1
achieve a compact architecture (simpler datapath) for the k = A × 〠 Bi xdi mod F ðxÞ
P engine. Details of the digit-digit F 2m multiplier are i=0 ð3Þ
presented in Section 2.4.3. = AB0 mod F ðxÞ + AB1 xd mod F ðxÞ
F2m multiplication using the digit-digit computing
approach was previously suggested in [34]. However, the + AB2 x2d mod F ðxÞ+⋮ABw−1 xðw−1Þd mod F ðxÞ:
multiplier design in that work is not suitable for a direct
application in a kP engine. The authors in that work only Let P<i> ðxÞ = ABi , 0 ≤ i ≤ w − 1, and the (d + m − 2
proved the advantages of the digit-digit approach versus the )-degree polynomial resulting from the partial product at
well-known bit-serial and digit-serial multipliers, as a standa- iteration i in Eq. (3). By parsing elements of B from left-to-
lone module. However, when that multiplier is considered right (MSE), C computation at iteration i is determined by
for realizing the kP operation, several issues must be solved. recurrence in Eq. (4):
Being the multiplier part of a series of operations implied
by each point addition operation in the main loop in the kP C<0> = 0, ð4Þ
computation, the main challenge for the digit-digit multiplier
is the fact that partial results at each iteration in the digit-  
C<i+1> = xd C <i> mod F ðxÞ + P<w−1−i> ðxÞ, ð5Þ
digit multiplier and the final result (possibly operated with
other values) are the input operand for the same multiplier where polynomial xd ðC<i> mod FðxÞÞ has the most
in next iterations. So, during the digit-digit computation, degree (d + k − 1), while P<i> is of degree ðd + k − 2Þ. After
the multiplier must keep its operands in memory blocks M 1
w iterations, the polynomial C <w−1> of degree (d + k − 1)
and M 2 and progressively stores the partial results in another needs reduction. By introducing an extra iteration with B−1
one M 3 . At the end, the results in M 3 should be moved to M 1
= 0 and P<−1> = 0, C<w> = xd ðC <w−1> mod FðxÞÞ is the result.
or M 2 for further processing (a kP operation requires several
F 2m multiplications), introducing a delay in the kP computa- The xd term in this last expression can be easily reduced
tion, unless that data movement is done during the computa- modulo FðxÞ by only discarding the digit C <w> 0 .
m−1
tion. So, M 1 or M 2 must act as an input and output memory Being FðxÞ an m-degree polynomial, FðxÞ = xm + ∑i=0
at the same time. Since a complete kP operation requires f i αi . So, xm mod FðxÞ = ∑m−1
i=0 f i α = gðxÞ, a polynomial of
i

several hundreds of multiplications, using the multiplier as degree g with g < m. Thus, elements xm+t with t ≤ m − 1 − g
proposed in [34] without addressing the previous data can be reduced using equivalence xm+t mod FðxÞ = gðxÞxt .
memory management issue is totally unpractical. Degree of C <i+1> from Eq. (5) (after C <i> reduction) is at
As it is explained in the next section, the main issue to most ðd + m − 1Þ. This polynomial becomes the C<i> polyno-
integrate a digit-digit F 2m multiplier in the kP engine is to mial to be reduced in the next iteration ðC<i> mod FðxÞÞ. So,
Journal of Sensors 7

at each iteration i + 1, it is required to reduce the d-terms x j 2.4.4. Digit-Digit F 2m Multiplier Hardware Architecture. To
of C <i> , m − 1 < j ≤ d + m − 1. By using the previous assump- achieve compactness, in this work, we propose the realization
tion for polynomial reduction being FðxÞ a trinomial, the in hardware of Algorithm 2 in its simplest form. The hard-
reduction in Eq. (5) can be defined as in Eq. (6). ware architecture only requires one partial d × d multiplier
and is optimized for binary fields defined by a trinomial.
!
m−1 m−1+d The NIST and other compliant standards have recom-
C<i> mod F ðxÞ = 〠 ci xi + 〠 ci xi mod F ðxÞ mended trinomials for binary fields, for example, FðxÞ =
i=0 i=m
! x409 + x87 + 1 and FðxÞ = x233 + x74 + 1.
m−1 d−1 If the 233-degree trinomial is used, gðxÞ = x74 + 1 is used
= 〠 ci x + i
〠 cm+i x gðxÞ i
mod F ðxÞ for the reduction step. So, if d = 74 (digit size) is used, when a
i=0 i=0
digit j of g(x) (G j ) is read, only the two first digits will have a
= C <i> <i>
m ðx Þ + C d ðxÞ × gðx Þ: value of 1, when j > 1 digit G j will be always 0. In this case, the
ð6Þ partial multiplier that computes C<i> d ðxÞ × G j always com-
putes a multiplication of the form ðC<i> <i>
d ðxÞ × 1Þ or ðC d ðxÞ
This way, C <i> is partitioned in two polynomials C<i>
m ðxÞ × 0Þ which can be implemented only with an “and” gate. In
d ðxÞ of degree m − 1 and d, respectively. The partial
and C <i> conclusion, when a trinomial of the form xm + xk + 1 is used,
multiplication C <i>
d × gðxÞ will not require modular reduc- it is possible to define the digit size d = k. In this case, the par-
tion if d + g < m. So, Eq. (5) can be rewritten as in Eq. (7). tial multiplier that computes C<i> d ðxÞ × G j can be imple-
  mented using only a multiplexer as it is shown in Figure 3.
C<i+1> = αd C<i> <i>
m + C d × gðxÞ + P
<w−1−i>
: ð7Þ
2.4.5. Curve Arithmetic. The hardware for elliptic curve scalar
Under the digit-digit computation approach, the polyno- multiplication is guided by the execution of Algorithm 1,
mial C <i> which is based on the iteratively call to point addition func-
m , gðxÞ, and A is represented in w = dm/de digits.
Since the Bi degree is d − 1, the P<i> computation can be tions Madd and Mdouble.
achieved iteratively, taken digit Bi and iterating through A Figure 4 shows the required operations at each iteration
of Algorithm 1 and the underlying F 2m operations (denoted
digits. Taking Bi as a constant, P<i> ðxÞ = AðxÞ × Bi ðxÞ =
by circles). After each F2m operation, the figure also shows
∑w−1 w−1 <i> jd
j=0 ðA j × Bi Þx = ∑ j=0 P j x . With this new notation,
jd
the memory where the intermediate values are stored. For
the first term in Eq. (4) can be rewritten as in Eq. (7). example, the memory X11 stores the first field operation X
1 × Z2 in the point addition operation. While five F 2m multi-
  w−1 <i> jd+d w−1
plications are needed to compute a single Madd operation,
m ðxÞ + C d ðx Þ × gðx Þ = 〠 C j x
xd C<i> d ðxÞ × 〠 G j x
<i>
+ C<i> jd+d

j−0 j=0 six F 2m multiplications are required for Mdouble.


w−1   The schedule of field operations shown in Figure 4 con-
= 〠 C<i> <i>
j + C d ðx Þ × G j x
jd+d
siders only the use of four memories to compute the com-
j=0
plete Madd function, by reusing the memory blocks
w−1
= 〠 R<i> jd+d
:
properly. For the case of Mdouble, also four memories are
j x
j=0 enough. The memories are alternatively used as shown in
the figure to act as the repository for the input parameters
ð8Þ
to a field multiplier/adder or as the repository for the multi-
plication/addition result. We stress again the fact that a
Once P<i> and R<i> are expressed to be processed in an
proper data memory management must be implemented to
iterative way one digit at a time, Eq. (7) can be rewritten in
avoid the delays induced by moving data from the result
a notation that leads to an iterative, digit-by-digit computa-
memory to the input parameter memory in the chained F2m
tion of each partial product of F 2m multiplication, given by
operations.
Eq. (9).
Since in Algorithm 1, only the X and Z coordinates of
w−1   elliptic curve points in projective representation are used,
C <i+1> = 〠 and each point PðX, ZÞ is stored in two BRAMs, one for the
j α
R<i> j α :
jd+d
+ P<i> jd
ð9Þ
j−0 X and the other for the Z coordinate. In Figure 4, the mem-
ories for the points P1 and P2 are represented by the variables
At each iteration, values P<i> and R<i> can be computed X1, X2, Z1, Z2.
j j
For Madd, let us consider the first multiplication X 1 × Z 2
in a parallel way. For the sake of clarity about the computa-
stored in X11 and the second multiplication X 2 × Z 1 stored
tions in Eq. (9), the sum of digits P<i> j and R<i> d
j x can be in Z11. Both multiplications can be done in parallel, with
expressed as a single variable S j . This new variable S<i>
<i>
j is memories X 1 , X 2 , Z 1 , Z 2 acting as reading memories and X
(d + d + d) bits in size as shown in Figure 2. 11 and Z11 acting as the writing memories. For the third
With all these considerations, the proposed algorithm for multiplication X11 × Z11, memories X11 and Z11 must
computing multiplication over F 2m is presented in switch to act as reading memories, and the result can be
Algorithm 2. stored in Z 1 , the memory that initially stored one of the input
8 Journal of Sensors

Figure 2: Digit-digit computation of F2m multiplication.

Require:A, B, F ∈ F 2 m ⊳A = ∑w−1
i=0 αi α
iD
⊳B = ∑w−1i=0 bi α
iD

1: cD ← 0
2: fori ← 0 to w + 1do
3: carry ← 0
4: s ← 0
5: for0 ← 0 to wdo
6: Pi ← bdigits−i × α j + 1
7: R j ← c j + cD × f j + 1
8: s ← Pi + ðR j ≪ dÞ + carry
9: c j ← s½d − 1 downto 0
10: cD ← s ≫ d
11: end for
12: cD ← s ≫ bitsLastDigit
13: end for
14: cw ← carry
15: returnc ⊳c = A × B mod F ⊳c = ∑w−1
i=0 ci α
iD

Algorithm 2: Digit-digit F 2 m multiplier algorithm.

Cd<i> BRAM-F BRAM-A BRAM-B parameters and now acts as a writing memory. As the F2m
/ multiplier delivers a result at each stage in point addition,
d /1 /d d/ at the same time, it processes the input digits. So, a careful
management of the memory is required to avoid latency for
data movement for result and input parameter memories.
This requirement arises because the result of the field multi-
M U LT plier in an earlier stage becomes the input parameter of later
/d / stages.
2d In the rest of the point addition computation, memories
ADD Pj<i> alternate their functionality following the switching strategy
of read/write memories. At the end, the final result X 3 , Z 3
/d
Rj<i> must be in a memory, that is used in the next iteration at line
<< /
Datapath

2d 6 in Algorithm 1, so that values will reside in one of the four


available memories, and input parameters in next iteration in
/d the main loop of Algorithm 1 are adjusted. Memories associ-
Carry ADD ated to points P1ðX 1 , Z 1 Þ and P2ðX 2 , Z 2 Þ are overwritten
/ with new partial results coming from the Madd and Mdouble
d functions.
/ >> 2d
At line 8 (or also in line 10) in the main loop of Algo-
d (lsb) Sj<i> rithm 1, the memories storing P1 ðX 1 , Z 1 Þ and P2 ðX 2 , Z 2 Þ
are read memories, and the result is stored finally in memo-
ries P11ðX11, Z11Þ and P22ðX22, Z22Þ (see Figure 4). In
d the next iteration, P11ðX11, Z11Þ and P22ðX22, Z22Þ
/ BRAM-C
become P1 and P2 input parameters, and the corresponding
memories P1 ðX 1 , Z 1 Þ and P2 ðX 2 , Z 2 Þ become the storage
Figure 3: Hardware architecture F2m. for the result of the final point addition. So, at the curve level
Journal of Sensors 9

P3 = P1 + P2 P2 =2P1 3. Results and Discussion


X1 Z1 X2 Z2 X1 Z1
The proposed compact hardware ECC design was imple-
1. × → X11 × → Z22 mented over the binary fields F 2233 and F2409 , both defined
by an irreducible trinomial. The elliptic curves used were
2. × → Z11 × → X22 sect233 and sect409, both recommended by NIST and other
× ×
recognized organizations such as SECG. The target platform
3. → Z1 → Z2
was the IoT recommended FPGA board MicroZed, with
4. + → X1 × Z2 → X2 Xilinx Vivado HLx 2016.4 as the developer tool.
The hardware architecture for scalar multiplication in E
5. Z3 × → Z11 × → Z22 ðF 2m Þ was evaluated in a hardware-software codesign of the
x ×
Diffie-Hellman key exchange elliptic curve (ECDH) version.
6. → X1 × b → X22 Let it consider that two FPGA-based sensor nodes [36] A
7. + → X11 + → X2 and B agree on an elliptic curve group G with generator P
and order n. Then, each party selects a secret integer, for
X3 X2 example, r A and r B . Using a kP engine, each party computes
Figure 4: Proposed schedule for point addition/double in EðF2 mÞ.
public values:

QA = r A P and QB = r B P : ð10Þ
|fflfflfflfflffl{zfflfflfflfflffl} |fflfflfflfflffl{zfflfflfflfflffl}
algorithm, the memories are also interchanged in their func- SensorA SensorB
tionality and properly mapped to the memories for the final
results in Figure 4. An extra BRAM is required to store the Sensor A uses the B’ s public value to compute s1 = r A QB ,
scalar k. and the sensor B uses the A’ s public value to compute s2 =
The building blocks to compute kP as described in r B QA . Since s1 is the same as s2 ðs = s1 = s2 Þ, s acts as a shared
Figure 4 are those for field arithmetic operations: addition, secret key between the sensors A and B, so a secure channel
multiplication, square, and inversion over F 2m . The square can be established to transport data between the two devices
operation is considered easier than multiplication. However, in an encrypted form (for example, using a lightweight block
since in this work operands are stored in BRAMs, and read- cipher). Indeed, signatures can be generated to authenticate
ing/writing of operands are performed one digit at a time, it is data by using the secret to authenticate a message, using,
difficult to take advantage of the optimized algorithm such as for example, LightMac. The main complexity in ECDH (as
the fast reduction algorithm proposed by NIST commonly in other ECC-based cryptographic schemes) is the computa-
used in squaring. So, to save hardware resources, this work tion of kP.
uses one F 2m multiplication core to compute square opera-
tions. The reusing od the multiplier saves area but increases 3.1. Hardware/Software Codesign. Figure 5 shows the pro-
latency. Also, F 2m inversion is computed with the Itho-Tsuji posed hardware-software codesign for the scalar multiplier
algorithm by means of multiplications, squares, and over EðF 2m Þ, suitable to be realized in an FPGA sensor node.
additions in F2m . The codesign was realized in the MicroZed board, and the
At each iteration of Algorithm 1, Madd and Mdouble implementation results are shown in Table 3. This is a repre-
operations can be computed in parallel since there is no data sentative final application under an IoT scenario (IIoT,
dependency. In this work, we propose to use a F 2m multiplier MIoT) where sensor nodes are deployed using SoC technol-
in Madd and other in Mdouble to take advantage of parallel- ogy: the kP scalar multiplication is executed in FPGA tech-
ism. In the dataflow for each point addition, the F2m multi- nology coupled to a master general purpose processor that
plier is reused. In addition to the multipliers, one F 2m adder runs the rest of the application logic. The hardware-
is also required. The same adder can be used in both the software codesign required 1809 slices of the FPGA embed-
Madd and Mdouble operations since it is required at different ded in the MicroZed board running at 62.5 MHz.
times in each operation. Table 3 also compares the time to achieve a scalar multi-
Although more than one F 2m multiplier could be added to plication under the hardware/software codesign versus a
speed up the kP computation, that approach resulted in extra pure software implementation. This is done to highlight the
cost of hardware resources not only because of the area gain in performance from a hardware approach for the most
required by the F 2m multiplier but also for the increased com- time-consuming operation in ECC, as in ECDH. For this, we
plexity in the control module and additional multiplexers to used the MIRACL library for the software implementation of
manage input/output operands to the F2m cores. scalar multiplication in the Cortex A9 of the Zynq, also avail-
The entire kP dataflow is managed by a control unit that able in the MicroZed board. In this case, we used the same
stimulates the memory blocks for word-based reading and implementation parameters: curve, finite field, size of the
writing and also commands the F 2m cores (multipliers and finite field, irreducible polynomial, projective coordinates,
adder). The control module waits until each partial multipli- and the same Algorithm 1 for scalar multiplication.
cation/addition has finished and starts the following required The hardware-accelerated execution of kP requires
operations with the correct BRAM as input sources. 4.13 ms to compute an elliptic curve Diffie Hellman key
10 Journal of Sensors

Memory and square and addition over EðF 2m Þ are computed fully with
DDR3 standard logic in only one clock cycle. Compared to our
design, those results are almost ten times better according
to efficiency. However, our design uses considerable less area
Processing Memory resources. For example, for a digit size of 8, 16, and 32, the
system (PS) controller required area is 442, 626, and 1170 slices, respectively. In
[37], it is presented a hardware architecture for elliptic curve
scalar multiplication over EðF 2m Þ implemented for the NIST-
Cortex A9 recommended binary fields F 2233 and F2283 . That scalar multi-
UART Peripherals
MPCore plier hardware architecture requires 3016 and 4625 slices for
the operand size 233 and 283, respectively. Compared to that
design, our kP engine for F2233 requires 6.8 times more slices
and 2.2 times better efficiency (Mbps/slice). The scalar multi-
AXI4-Lite plier over EðF2m Þ presented in [38] is better in efficiency than
Programable
logic (PL) E(F2m) ours, but at a considerable high costs in terms of area usage.
Scalarmultiplier Table 4 shows that most of the works achieve better
throughput/efficiency than our proposed hardware design.
Figure 5: Hardware-software codesign for an FPGA-based sensor
However, the main aim of these works is to save hardware
node enabled with scalar multiplier engine for curve-based resources (slices), and this is achieved by sacrificing through-
cryptography. put. According to the obtained results, it is observed that
despite the throughput sacrificing, the proposed design
achieves significantly better performance than software
Table 3: ECDH hw-sw codesign in the MicroZed board (z7010). counterparts while using fewer resources that are similar
FPGA designs. The reduction in area resources is a direct
Area Freq. Time MIRACL (sw) result of using a digit-by-digit computing approach in the
Size k
(slices) (MHz) (ms) (ms) layered structure of the kP engine, mainly determined by
233 32 1809 62.5 4.13 70 the F 2m multiplier and the strategy for reusing memory blocks
during the iterative processing of operands.
In Figure 6, we show graphically how our design uses
exchange versus the pure software implementation in the considerable fewer standard logic resources from the FPGA,
MicroZed with the MIRACL library that requires 70 ms. so leaving more logic for other tasks in the upper application
Thus, our codesign is 17 times faster than the pure software layers. In that figure, FPGA resource usage is compared
implementation while only requires 36% of the FPGA slices against the works that use FPGA implementation technol-
in the MicroZed, leaving 66% of the FPGA’s standard logic ogy, digit-serial approach, and comparable security levels.
available for other application requirements in the sensor Note from this figure that our design is scalable in terms of
node. These results show that our design retains the advan- area because a greater security level only impacts latency.
tages of a hardware implementation by improving the perfor- This property is only kept with the digit-digit computing
mance at the time that it uses less area resources. approach.

3.2. Comparison with Other Similar FPGA Designs. Table 4 4. Conclusion


shows a comparison with state-of-the-art works for FPGA
scalar multipliers in EðF 2m Þ. In this comparison, we are using We have detailed the design and evaluation of a compact
the same elliptic curves, finite fields and sizes, and the same FPGA-based ECC hardware design, well suited for Internet
irreducible polynomial. A fair comparison is very difficult of Things applications, specifically for the Industrial Internet
to achieve due to different FPGA technologies and imple- of Things (IIoT) or Internet of Medical Things (MIoT),
mentation strategies being used. It is not possible to compare where sensor nodes can be realized with FPGA technology.
all the works under the same criteria, since some hardware The key contributions include a novel digit-digit algorithm
designs exploit the use of embedded blocks such as DSPs or for multiplication over F2m optimized for fields defined by tri-
block rams (BRAMs) while others take advantage of the nomials and its corresponding compact hardware architec-
available slices/LUTs. However, this research is focused in ture, which is the main core for constructing a compact
lightweight implementations with the goal to use low stan- hardware design for computing scalar multiplications in
dard logic resources. So, embedded memory blocks in the binary elliptic curves over F 2m generated by trinomials, such
FPGAs are exploited to reduce standard reconfigurable logic as the ones recommended by NIST for practical use. We pro-
(slices). The comparison in Table 4 is mainly in terms of posed a novel rescheduling of F 2m operations in the Lopez-
FPGA standard logic (slices) reported. Although efficiency Dahab Montgomery algorithm for elliptic curve scalar multi-
and throughput are not the main aims of this research, they plication that can be computed with only two multipliers and
are used as reference metrics. one adder in a digit-digit fashion, thus reducing area require-
The results presented in [32] are proposed for a digit- ments for the hardware design. For correctness, we validate
serial approach for multiplication and inversion over F2m , our design by a hardware software codesign in the IoT
Journal of Sensors 11

Table 4: Comparison of scalar multiplication over EðF2mÞ.

Work FPGA m Cycles Slices Freq. (MHz) Thrg. (kbps) Efficiency (kbps/slice)
Prop. (k = 8) z7010 233 1553782 442 190.04 28.49 0.064
Prop. (k = 16) z7010 233 408547 626 149.20 85.09 0.136
Prop. (k = 32) z7010 233 128820 1170 135.31 244.75 0.209
Prop. (k = 8) z7010 409 7504232 453 190.94 10.40 0.023
Prop. (k = 16) z7010 409 1926426 653 154.44 32.78 0.050
Prop. (k = 32) z7010 409 511493 1183 132.59 106.02 0.090
[32] (g = 16, d = 2) v5 233 8193 3939 263.15 7483.69 1.899
[32] (g = 8, d = 1) v5 409 45513 5395 181.81 1633.82 0.030
[37] k7 233 679776 3016 255.66 87.63 0.029
[37] k7 283 1395312 4625 251.98 51.10 0.011
[38] v7 233 5929 2647 370.00 14540.39 5.498
[38] v7 409 10354 6888 316.00 12482.51 1.812
[41] v5 163 1396 3513 147.00 17.16 0.004

8000

7000
FPGA slices (standard logic)

6000

5000

4000

3000

2000

1000

0
233‐bit 409‐bit
Security level (field size)

This work
[32]
[35]

Figure 6: Comparison of area usage for the proposed kP engine.

MicroZed Xilinx FPGA, by executing an instance of the Data Availability


Diffie-Hellman key exchange protocol (ECDH), a common
crucial operation in IoT secure sensor nodes networks. To Raw data were generated at INAOE Computer Science
our knowledge, the proposed hardware ECC architecture Department and at Cinvestav Tamaulipas. Derived data sup-
requires less standard hardware resources (slices) in FPGAs porting the findings of this study are available from the cor-
than other works reported to date while takes advantage of responding author MMS on request.
memory blocks already available in modern FPGAs. Further-
more, despite of being a compact hardware architecture, it Conflicts of Interest
was demonstrated that a considerable acceleration of a repre-
sentative curve-based cryptographic protocol is obtained The authors declare that there is no conflict of interest
compared to a pure software implementation. regarding the publication of this paper.
Using the proposed ECC accelerator, further work is
planned to evaluate the security service costs when imple- Acknowledgments
menting ECC-based cryptographic protocols such as digital
envelopes and digital signatures in real application scenarios This research was supported by the Fondo Sectorial de Inves-
of IoT, IIoT, and MIoT. tigación para la Educación, Ciencia Básica SEP-CONACyT,
12 Journal of Sensors

project number 281565. Also, the research was partially mance Extreme Computing Conference (HPEC), pp. 1–6,
funded by project PN-2017-5814, Conacyt Problemas Waltham, MA, USA, 2017.
Nacionales. [15] Xilinx Inc, Microzed industrial iot starter kitApril 2020, http://
zedboard.org/product/microzed-iiot-starter-kit.
[16] V. S. Miller, “Use of elliptic curves in cryptography,” Advances
References in Cryptology—CRYPTO ‘85 Proceedings, H. C. Williams, Ed., ,
pp. 417–426, Springer, Berlin, Heidelberg, 1986.
[1] A. Mosenia and N. K. Jha, “A comprehensive study of security
[17] N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of
of internet-of-things,” IEEE Transactions on Emerging Topics
Computation, vol. 48, no. 177, pp. 203–209, 1987.
in Computing, vol. 5, no. 4, pp. 586–602, 2017.
[18] P. Szczechowiak and M. Collier, “Tinyibe: identity-based
[2] L. Z. Cai and M. F. Zuhairi, “Security challenges for open
encryption for heterogeneous sensor networks,” in 2009 Inter-
embedded systems,” in Engineering Technology and Techno-
national Conference on Intelligent Sensors, Sensor Networks
preneurship (ICE2T), 2017 International Conference on,
and Information Processing (ISSNIP), pp. 319–354, Mel-
pp. 1–6, Kuala Lumpur, Malaysia, 2017.
bourne, VIC, Australia, 2009.
[3] D. Schinianakis, “Alternative security options in the 5g and iot
era,” IEEE Circuits and Systems Magazine, vol. 17, no. 4, pp. 6– [19] N. Oualha and K. T. Nguyen, “Lightweight attribute-based
28, 2017. encryption for the internet of things,” in 2016 25th Interna-
tional Conference on Computer Communication and Networks
[4] T. Eisenbarth, S. Kumar, C. Paar, A. Poschmann, and
(ICCCN), pp. 1–6, Waikoloa, HI, USA, 2016.
L. Uhsadel, “A survey of lightweight-cryptography implemen-
tations,” IEEE Design & Test of Computers, vol. 24, no. 6, [20] Z. U. A. Khan and M. Benaissa, “High-speed and low-latency
pp. 522–533, 2007. ecc processor implementation over gf(2m) on fpga,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems,
[5] C. Manifavas, G. Hatzivasilis, K. Fysarakis, and K. Rantos,
vol. 25, no. 1, pp. 165–176, 2017.
“Lightweight cryptography for embedded systems – a compar-
ative analysis,” Data Privacy Management and Autonomous [21] J. López and R. Dahab, “Fast multiplication on elliptic curves
Spontaneous Security, 2014, pp. 333–349, Springer, Berlin Hei- over gf(2m) without precomputation,” Proceedings of the First
delberg, 2014. International Workshop on Cryptographic Hardware and
Embedded Systems, CHES’99, , pp. 316–327, Springer-Verlag,
[6] C. A. Lara-Nino, A. Diaz-Perez, and M. Morales-Sandoval,
London, UK, UK, 1999.
“Elliptic curve lightweight cryptography: a survey,” IEEE
Access, vol. 6, pp. 72514–72550, 2018. [22] D. Karaklajic, J. Fan, J. Schmidt, and I. Verbauwhede, “Low-
[7] P. Yalla and J. P. Kaps, “Lightweight cryptography for fpgas,” cost fault detection method for ecc using montgomery power-
in 2009 International Conference on Reconfig urable Comput- ing ladder,” 2011 Design, Automation Test in Europe, pp. 1–6,
ing and FPGAs, pp. 225–230, Quintana Roo, Mexico, 2009. 2011.
[8] A. Diaz-Perez, M. Morales-Sandoval, and C. Lara-Nino, “Use [23] D. Dinu, Y. Le Corre, D. Khovratovich, L. Perrin,
of FPGAs for enabling security and privacy in the IoT: features J. Großschädl, and A. Biryukov, “Triathlon of lightweight
and case studies,” in FPGA Algorithms and Applications for the block ciphers for the internet of things,” Journal of Crypto-
Internet of Things, chapter 2, P. Sharma and R. Nair, Eds., graphic Engineering, vol. 9, pp. 1–20, 2015.
pp. 26–45, IGI Global, 2020. [24] J.-L. Beuchat, T. Miyoshi, Y. Oyama, and E. Okamoto, “Multi-
[9] G. Xu, Z. Chen, and P. Schaumont, “Energy and performance plication over Fpm on fpga: A survey,” in Reconfigurable Com-
evaluation of an fpga-based soc platform with aes and present puting: Architectures, Tools and Applications, P. C. Diniz, E.
coprocessors,” in Embedded Computer Systems: Architectures, Marques, K. Bertels, M. M. Fernandes, and J. M. P. Cardoso,
Modeling, and Simulation, M. Berekovic, N. Dimopoulos, Eds., pp. 214–225, Springer, Berlin, Heidelberg, 2007.
and S. Wong, Eds., pp. 106–115, Springer, Berlin, Heidelberg, [25] G. Bertoni, J. Guajardo, S. Kumar, G. Orlando, C. Paar, and
2008. T. Wollinger, “Efficient GF(pm) arithmetic architectures for
[10] H. Abdelkrim, S. Ben Othman, and S. Ben Saoud, “Reconfigur- cryptographic applications,” in Topics in Cryptology — CT-
able soc fpga based: Overview and trends,” in 2017 Interna- RSA 2003, M. Joye, Ed., pp. 158–175, Springer, Berlin, Heidel-
tional Conference on Advanced Systems and Electric berg, 2003.
Technologies, pp. 378–383, Hammamet, Tunisia, 2017. [26] C. Shu, S. Kwon, and K. Gaj, “Fpga accelerated tate pairing
[11] X. Zhang, A. Ramachandran, C. Zhuge et al., “Machine learn- based cryptosystems over binary fields,” 2006 IEEE Interna-
ing on fpgas to face the iot revolution,” in 2017 IEEE/ACM tional Conference on Field Programmable Technology, 2006,
International Conference on Computer-Aided Design (ICCAD), pp. 173–180, Bangkok, Thailand, 2006.
pp. 894–901, Irvine, CA, USA, 2017. [27] L. Song and K. K. Parhi, “Low-energy digit-serial/parallel finite
[12] C. Hao, X. Zhang, Y. Li et al., “Fpga/dnn co-design: an efficient field multipliers,” Journal of VLSI signal processing systems for
design methodology for iot intelligence on the edge,” in Pro- signal, image and video technology, vol. 19, no. 2, pp. 149–166,
ceedings of the 56th Annual Design Automation Conference 1998.
2019, DAC ‘19, New York, NY, USA, 2019. [28] D. Pamula and E. Hrynkiewicz, “Area-speed efficient modular
[13] S. Wang, Y. Hou, F. Gao, and X. Ji, “A novel iot access archi- architecture for GF(2m) multipliers dedicated for crypto-
tecture for vehicle monitoring system,” in 2016 IEEE 3rd graphic applications,” in 2013 IEEE 16th International Sympo-
World Forum on Internet of Things (WF-IoT), pp. 639–642, sium on Design and Diagnostics of Electronic Circuits Systems
Reston, VA, USA, 2016. (DDECS), pp. 30–35, Karlovy Vary, Czech Republic, 2013.
[14] B. Zhou, M. Egele, and A. Joshi, “High-performance low- [29] National Institute of Standards and Technology, Digital Signa-
energy implementation of cryptographic algorithms on a pro- ture Standard (DSS), Appendix D, Recommended Elliptic
grammable soc for iot devices,” in 2017 IEEE High Perfor- Curves for Federal Government Use, 1999, https://csrc.nist
Journal of Sensors 13

.gov/csrc/media/publications/fips/186/3/archive/2009-06-25/
documents/fips_186-3.pdf.
[30] D. Hankerson, A. J. Menezes, and S. Vanstone, Guide to Ellip-
tic Curve Cryptography, Springer-Verlag New York, Inc.,
Secaucus, NJ, USA, 2003.
[31] D. Pamula, Arithmetic operators on GF(2m) for cryptographic
applications: performance - power consumption - security tra-
deoffs, [Ph.D. thesis], Université Rennes 1, 2012, https://tel
.archivesouvertes.fr/tel-00767537.
[32] G. D. Sutter, J. Deschamps, and J. L. Imana, “Efficient elliptic
curve point multiplication using digit-serial binary field oper-
ations,” IEEE Transactions on Industrial Electronics, vol. 60,
no. 1, pp. 217–225, 2013.
[33] A. P. Fournaris and O. Koufopavlou, “Low area elliptic curve
arithmetic unit,” in 2009 IEEE International Symposium on
Circuits and Systems, pp. 1397–1400, Taipei, Taiwan, 2009.
[34] M. Morales-Sandoval and A. Diaz-Perez, “Area/performance
evaluation of digit-digit GF(2k) multipliers on fpgas,” in 23rd
International Conference on Field programmable Logic and
Applications, Porto, Portugal, 2013.
[35] I. San and A. Nuray, “Improving the computational efficiency
of modular operations for embedded systems,” Journal of Sys-
tems Architecture, vol. 60, no. 5, pp. 440–451, 2014.
[36] B. Bengherbia, M. O. Zmirli, A. Toubal, and A. Guessoum,
“Fpga-based wireless sensor nodes for vibration monitoring
system and fault diagnosis,” Measurement, vol. 101, pp. 81–
92, 2017.
[37] M. S. Hossain, E. Saeedi, and Y. Kong, “High-speed, area-effi-
cient, fpga-based elliptic curve cryptographic processor over
nist binary fields,” in 2015 IEEE International Conference on
Data Science and Data Intensive Systems, pp. 175–181, Sydney,
NSW, Australia, 2015.
[38] Z. Khan and M. Benaissa, “Throughput/area-efficient ecc pro-
cessor using montgomery point multiplication on fpga,” IEEE
Transactions on Circuits and Systems II: Express Briefs, vol. 62,
no. 11, pp. 1078–1082, 2015.
[39] W. Wei, L. Zhang, and C. Chang, “A modular design of
elliptic-curve point multiplication for resource constrained
devices,” in 2014 International Symposium on Integrated Cir-
cuits (ISIC), pp. 596–599, Singapore, Singapore, 2014.
[40] M. N. Hassan and M. Benaissa, “Low area-scalable hardware/-
software co-design for elliptic curve cryptography,” in 2009
3rd International Conference on New Technologies, Mobility
and Security, pp. 1–5, Cairo, Egypt, 2009.
[41] S. S. Roy, C. Rebeiro, and D. Mukhopadhyay, “Theoretical
modeling of elliptic curve scalar multiplier on lut-based fpgas
for area and speed,” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, no. 5, pp. 901–909, 2013.

You might also like