FPGA Implementation of CORDIC Algorithms For Sine and Cosine Generator
FPGA Implementation of CORDIC Algorithms For Sine and Cosine Generator
Antonius P. Renardy∗ , Nur Ahmadi, Ashbir A. Fadila, Naufal Shidqi, Trio Adiono†
Department of Electrical Engineering, School of Electrical Engineering and Informatics
Bandung Institute of Technology, Jl. Ganesha No. 10 Bandung, 40132, Indonesia
Email: ∗ [email protected], † [email protected]
Abstract—Trigonometric-related calculations which are widely implementation, the values factorial operations can be stored
found in a broad range of applications can be performed by using on a lookup table since these are fixed regardless of the input
COordinate Rotation DIgital Computer (CORDIC) algorithm. argument of the function. However, this will also consume
CORDIC is often utilized in the absence of hardware multiplier larger area as the number of bits increases.
since this algorithm requires only addition, subtraction, bit shift-
ing, and lookup table. This paper provides an implementation of COordinate Rotation DIgital Computer (CORDIC), in-
conventional CORDIC algorithm with pipelined architecture and vented by J. E. Volder in 1959 [4], is an algorithm that
Virtually Scaling-Free Adaptive (VSFA) CORDIC. All designs are can be used to perform trigonometric-related calculations. By
implemented in Verilog and synthesized by using Altera Quartus changing some parameters, CORDIC can also be used in
II with FPGA DE2 as target board. The pipelined CORDIC
consumes 1,103 logic element, 33.32 ns latency, and 420.17 MHz
wide variety of elementary transcendental function involving
maximum frequency, while VSFA CORDIC utilizes 2,109 logic exponentials, logarithms, and square roots [5].
element, 34.96 ns latency, and 343.29 MHz maximum frequency. CORDIC is simple and efficient since this algorithm re-
Both designs are used to generate sine and cosine wave between
−π and π which result in maximum error of 8.095 × 2−13
quires only addition, subtraction, bit shifting, and table lookup.
for pipelined CORDIC and 9.183 × 2−13 for VSFA CORDIC. This leads to an efficient and low-cost implementation with
Based on performance comparison in term of area multiplied by generally faster speed than most hardware approaches. Several
delay (A × T ), our pipelined CORDIC is superior among other architectures exist in order to keep the requirements and con-
designs. straints of different applications. Iterative architecture provides
hardware implementation with minimum size with throughput
Keywords—FPGA, CORDIC, VSFA, Sine and Cosine Genera-
tor.
as the tradeoff, while parallel and pipelined CORDIC offers
high-speed and high-throughput computation.
I. I NTRODUCTION This paper provides a prototype for implementing CORDIC
algorithm with pipelined architecture and Virtually Scaling
Sine and cosine are the basic functions which can be Free Adaptive (VSFA) CORDIC. In addition, both designs are
derived from any complex functions used in a broad range used for sine and cosine calculation.
of applicaitons such as digital signal processing, wireless
communication, biometrics, robotics, etc [1]. Several methods The rest of this paper is organized as follows. Section
exist to generate hardware that performs sine and cosine II explains conventional CORDIC and Virtually Scaling Free
calculations, which are Lookup Table (LUT), Maclaurin series, Adaptive (VSFA) CORDIC algorithms. Section III describes
and CORDIC. the implementation of both CORDIC algorithm. Section IV
shows the simulation result and performance evaluation. Fi-
Table lookup method utilizes blocks of memory which
nally, the conclusions are provided in Section V.
store values of the function to be computed for every possible
input arguments. This method is relatively simple to be em-
ployed since no specific calculations are required, relying only II. CORDIC A LGORITHM
on the values stored on the table. However, the number of table
entries required will rise exponentially as the number of bits, A. Conventional CORDIC
which are used to represent the output argument, rise [2]. This Conventional CORDIC algorithm is derived from rotation
will result in larger area required for hardware implementation. T
of a vector [x0 y0 ] in Cartesian coordinate which can be
T
MacLaurin series is used to represent a function as infinite expressed in (1), where [x y ] is the final vector produced
sum of its derivatives derived from Taylor series that is evalu- after rotation and θ is the target angle of rotation. By factoring
ated at zero. In practice, the number of terms in the series are out the cosine function, we can obtain (2).
determined based on the required accuracy [3]. For application
that requires maximum error to be 28 = 3.90625 × 103 , x cos(θ) sin(θ) x0
= (1)
the number of terms required is N = 9 which corresponds y − sin(θ) cos(θ) y0
to maximum error of 1.1309 × 103 . Consequently, nine ex-
ponentiations, eight additions, and nine factorial operations x 1 tan(θ) x0
= cos(θ) (2)
need to be carried out to produce the function. For hardware y − tan(θ) 1 y0
1
CORDIC employs iteration in which the angle θ is expressed since right shifting by more than b − 1 bits will result in zero,
as summation of elementary rotation angles α which is defined assuming the implementation is done on b-bits data. Equation
in (3), where b is the bit precision of the angle argument and (12) also means that
σ the direction of the rotation, which can only be −1 or 1. b − 2.585
The elementary rotational angle α is restricted to have only =p
i≥ (13)
certain values as as shown in (4). 3
b−1
with the upper limit b − 1 for the same reason.
θ= σi α i (3) Therefore, (1) can be written as (14). It is clear that, unlike
i=0 (5), no scaling coefficient appears, hence the name scaling-
free. It can be noted that (14) can also be realized in the same
αi = tan−1 (2−i ) (4) manner, differing only in additional adder and shifter.
The final vector produced by iteration can be expressed in (5) b−1 1 − 2−(2i+1)
x 2−i xi
by substituting (3) into (2) with definition in (4). The remaining = (14)
angle in each iteration is shown in (6). y −2−i 1 − 2−(2i+1) yi
i=p
b−1
x 1 σi 2−i xi The largest angle that can be computed through this method is
= cos(α i ) (5) only ± 7.16 ◦ , far less than conventional CORDIC (99.9 ◦ ) [6].
y −σi 2−i 1 yi
i=0 Thus, domain-folding has to be employed in order to be able
to compute full circle. The method is to divide each quadrant
zi+1 = zi + σi 2−i (6)
into four domain equally, each having angular span of π/8.
In rotation mode, each iteration will decrease the angle Consider the first quadrant, in which the target angle can
component (zi ) approaching zero, while in vectoring mode lie in one of the four domain: A ([0, π/8)), B ([π/8, π/4)),
the ordinate (yi ) is made to be zero. This is achieved by C ([π/4, 3π/8)), and D ([3π/8, π/2)). We can express θ in
either rotating the vector clockwise or counter-clockwise. By terms of another angle φ:
initializing the three CORDIC parameters (x0 , y0 , z0 ), different
output can be produced. For instance, by setting x0 = 1, θ =φ in domain A
y0 = 1 and z0 = π/2 , each iteration will make xi and yi θ = π/4 − φ in domain B
(15)
closer to cos(π/2) and sin(π/2) respectively. θ = π/4 + φ in domain C
θ = π/2 − φ in domain D
B. Virtually Scaling-Free Adaptive CORDIC By substituting (15) to (1), the CORDIC equation operation
T
Unlike the conventional CORDIC which relies on rotation on input vector [x0 y0 ] in each domain can be expressed as:
on both direction, Virtually Scaling-Free Adaptive CORDIC
xf A cos(φ) sin(φ) x0
rotates only on one direction, either it is clockwise or counter- = (16)
yf A − sin(φ) cos(φ) y0
clockwise. The elementary rotational angle chosen is suffi-
ciently small enough to be expressed as xf B √1
cos(φ) + sin(φ) (cos(φ) − sin(φ)) x0
= (17)
yf B 2 −(cos(φ) − sin(φ)) cos(φ) + sin(φ) y0
sin(αi ) ∼
= αi = 2−i (7)
xf B √1
cos(φ) + sin(φ) (cos(φ) − sin(φ)) x0
This poses another condition for the algorithm. Let us consider = (18)
yf B 2 −(cos(φ) − sin(φ)) cos(φ) + sin(φ) y0
the expansion of sine and cosine function in the form of
polynomial series as follows xf D sin(φ) cos(φ) x0
= (19)
yf D − cos(φ) sin(φ) y0
αi3 αi5
sin(αi ) = αi − + − ··· (8) where xf ∗ denotes the final vector from CORDIC operations
3! 5!
with target angle lying in respective domain. Thus, by using
αi2 α4 the expression of CORDIC rotation for positive and negative
cos(αi ) = 1 − + i − ··· (9)
2! 4! angle φ and −φ:
Using the approximation of αi = 2−i , we can rewrite as x+ cos(φ) − sin(φ) x0
= (20)
2i−3i 2−5i y+ sin(φ) cos(φ) y0
sin(αi ) = 2−i − + − ··· (10)
3! 5! x− cos(φ) sin(φ) x0
= (21)
2−2i 2−4i y− − sin(φ) cos(φ) y0
cos(αi ) = 1 − + − ··· (11)
2! 4! Equation (16) - (19) can be expressed as:
To make (10) and (11) comply with the approximation in (7), xf A = x−
only the first term of the expansion of sine function remains,
yf A = y − (22)
while it is the first and second term for the cosine function, φ=θ
leaving all other terms reduced to zero. From the largest term
1
that is neglected on both function, it can be inferred that xf B = √2 x+ + y+
2−3i xf B = √12 −x+ + y+
(23)
= 2−(3i+log2 6) = 2−(3i+2.585) ≤ 2−b (12) φ = π/4 − θ
3!
2
xf C = √12 x− + y−
In order to calculate cosine and sine value, the input angle
xf C = √12 −x− + y−
(24) are first processed in Quadrant Detector. This block has three
φ = θ − π/4 inputs which corresponds to the input argument of CORDIC
rotation expressions (x, y, z). The x and y port of this block are
xf D = y+ set to 0x136E and 0x0000. This is necessary in order to achieve
yf D = −x+ (25) T
the final vector in the form of [cos θ sin θ] without the need
φ = π/2 − θ of additional post-processing. The output of this block is the
appropriate x, y, and z argument with respect to the quadrant
It shows that CORDIC operation in domain B, C, and D of the target angle, as shown in Figure 2.
can be obtained from CORDIC operation in domain A, or we
could say that domain B, C, and D is folded back to domain Angle Clk Rst
A. Also, as the consequence
√ of domain folding operation, an
additional scale factor 1/ 2 for domain B and C appears.
0x2000 0x0000
Thus, the additional hardwares are required to implement this
16 16 16
scale factor, which can be realized by simple shift and add
operation. Expression for domain folding in other quadrant Xi Yi Zi
[π/8, π/4) xf B yf B
16 16 16
[π/4, 3π/8) xf C yf C
[3π/8, π/2) xf D yf D
Xi Yi Zi
[π/2, 5π/8) yf A −xf A
CORDIC Core
[5π/8, 3π/4) yf B −xf B Rst
Xo Yo Zo
[3π/4, 7π/8) yf C −xf C
16 16 16
[7π/8, π) yf D −xf D
[π, 9π/8) −xf A −yf A
Dx Dy Dz
[9π/8, 5π/4) −xf B −yf B
[5π/4, 11π/8) −xf C −yf C Output Register
Rst
[11π/8, 3π/2) −xf D −yf D Qx Qy Qz
16
[3π/2, 13π/8) −yf A xf A
16 16
[13π/8, 7π/4) −yf C xf B
[7π/4, 15π/8) −yf D xf C
[15π/8, 2π) −yf B xf D
Cos Sin
(Zi<-pi/2)||(Zi>pi/2)
Zi>pi/2
B. Pipelined CORDIC
0 0
Top level module of pipelined CORDIC is shown in Figure 0
1. The pipelined CORDIC has three inputs: Clk for clock, Pi/2 1
-1 1
Rst for reset (active low), and Angle for the target angle. It Zo
Zi
also has two outputs, Cos and Sin, which represents cosine
and sine result of the target angle respectively. Two main
blocks are available: Quadrant Detector and CORDIC Core. Fig. 2. Quadrant detector in pipelined CORDIC
3
CORDIC Core is the realization of CORDCIC’s difference produce modified target angle Phi. This blocks also generate
equations in pipelined architecture. There are 14 stages of two 2-bit wide signal, Quad and Domain which will later
pipeline in the core, each of which consists of the structure be used as signal for post-processing required due to domain
shown in Figure 3. In each stage, three adders/substractor, folding technique.
two arithmetic right-shifters, one direction block that controls
CORDIC Core is implemented in pipelined architecture.
the direction of rotation, and an inverse tangent constant are
The difference is in the number of stage and the circuitry in
present. Since the amount of bit-shifting performed on each
each stage, which are shown in Figure 5 and Figure 6. With
stage is constant, the shifter can be implemented as series of
the same reason as the previous section, the shifter unit can be
wire.
replaced with series of wire. The allowed value for CORDIC
Direction
iteration are i = 4, 5, 6, · · · , 15 as in (13). The teration step
xi yi sign Selector sign zi arctani i = 15 can be omitted since the shifter will produce retention
>>i >>i of sign bit only. The iteration step i = 7 can use circuitry in
Figure 6 since the 2i + 1 shifter will also produce retention of
sign bit.
The top level module has three inputs and two outputs,
similar to the top level of Conventional CORDIC in the Fig. 5. Elementary rotational unit of VSFA CORDIC (i < b/2)
previous section. There are three main blocks: Quadrant and
Domain detection, CORDIC Core with region of convergence
[0, π/8), and Output Processing block as shown in Figure 4.
Xi Yi
Clk Rst
Angle
i i
Shifter Shifter
0x2000 0x0000
16 16 17
X0 Y0 Z0
Quadrant & Domain Detector
Xi Yi Phi Quad Domain
16 16 12 2 2 Xi+1 Yi+1
16 16
In order to balance the pipeline, elementary rotational unit
for i = (7, 8), (9, 10), (11, 12) are each paired to create one
Cos Sin stage, since the number of adders is the same as with i ≤ 7.
Each elementary rotational unit has Enable signal, based on
Fig. 4. Top level module of VSFA CORDIC the location of logic ‘1’ in the 12-bit unsigned representation
of Phi. For example, if the binary representation of Phi
is 101 100 000 001, then the sequence of active elementary
Quadrant and Domain Detection block has three inputs:
rotational unit is i = 4, 4, 4, 4, 4, 4, 5, 13.
16-bit wide X0 and Y0, and also 17-bit wide Z0. The purpose
of this block is to detect the quadrant and domain in where the Output Processing block is used to fold back the results
target angle lies and subsequently applies domain folding to obtained in CORDIC Core into the original quadrant and
4
(a) pipelined CORDIC verification
5
TABLE V. E RROR PERFORMANCE COMPARISON BETWEEN PIPELINED AND VSFA CORDIC
Angle Cosine Sine
(Real) (Pipelined) (VSFA) (Real) (Pipelined) (VSFA)
Real Hex Value Value Error Value Error Value Value Error Value Error
π/2 3244 0000 0001 1 0000 0 2000 1FFF 1 2000 0
π/2 − π/256 31DF 0065 0061 4 0063 2 1FFF 1FFF 0 1FFD 2
π/2 − π/128 317B 00C9 00C7 2 00C6 3 1FFE 1FFB 3 1FFC 2
3π/8 25B3 0C3F 0C3B 4 0C3E 1 1D90 1D91 1 1D91 1
π/4 1922 16A1 16A1 0 16A0 1 16A1 169E 3 16A0 1
π/8 0C91 1D90 1D8F 1 1D8F 1 0C3F 0C3F 0 0C3F 0
π/128 00C9 1FFE 1FFB 3 1FFC 2 00C9 00CD 4 00C6 3
π/256 0065 1FFF 1FFF 0 1FFD 2 0065 0063 2 0063 2
0 0000 2000 1FFF 1 2000 0 0000 0001 1 0000 0
parameter setting for benchmarking, our pipelined design is [7]K. Maharatna, S. Banerjee, E. Grass, M. Krstic, and A. Troya, “Mod-
synthesized using Xilinx ISE design suite with Spartan 3 as ified virtually scaling-free adaptive cordic rotator algorithm and archi-
tecture,” Circuits and Systems for Video Technology, IEEE Transactions
target device. For performance comparison, we use a parameter on, vol. 15, no. 11, pp. 1463–1474, 2005.
of area multiplied by delay (A × T ). The parameter A is
[8] E. Garcia, R. Cumplido, and M. Arias, “Pipelined cordic design on
defined by the amount of resources required (in this case, A = fpga for a digital sine and cosine waves generator,” in Electrical and
Slices + LUTs), while parameter T denotes the time required Electronics Engineering, 2006 3rd International Conference on. IEEE,
to perform computation. It is desired to have smallest area 2006, pp. 1–4.
and shortest delay (fastest frequency). Even though [9] and [9] L. Vachhani, K. Sridharan, and P. K. Meher, “Efficient cordic algorithms
[10] consume smaller area, both designs take longer delay and architectures for low area and high throughput implementation,”
Circuits and Systems II: Express Briefs, IEEE Transactions on, vol. 56,
compared to our. Based on A × T parameter as can be seen no. 1, pp. 61–65, 2009.
in Table IV, our pipelined design is shown to have best
[10] S. Aggarwal and K. Khare, “Hardware efficient architecture for gener-
performance among others. ating sine/cosine waves,” in VLSI Design (VLSID), 2012 25th Interna-
tional Conference on. IEEE, 2012, pp. 57–61.
V. C ONCLUSION
Two CORDIC architectures, pipelined CORDIC and Vir-
tually Scaling-Free Adaptive (VSFA) CORDIC, have been
succesfully implemented in Altera DE2-70 FPGA development
board. The functional verification is performed by using Mod-
elSim software with accuracy of 2−13 and maximum error
of 8.095 × 2−13 for pipelined CORDIC and 9.183 × 2−13
for VSFA CORDIC. The pipelined CORDIC consumes 1,103
logic element, 33.32 ns latency, and 420.17 MHz maximum
frequency, while VSFA CORDIC utilizes 2,109 logic ele-
ment, 34.96 ns latency, and 343.29 MHz maximum frequency.
Based on performance comparsion in term of area-delay
parameter(A × T ), our pipelined CORDIC is shown to have
best performance among others, hence more suitable to be used
in various applications, especially those which require high-
speed data transfer.
R EFERENCES
[1] R. R. Teja and P. S. Reddy, “Sine/cosine generator using pipelined
cordic processor,” Proc. IACSIT International Journal of Engineering
and Techonology, vol. 3, no. 4, pp. 431–434, 2011.
[2] V. Kantabutra, “On hardware for computing exponential and trigono-
metric functions,” Computers, IEEE Transactions on, vol. 45, no. 3, pp.
328–339, 1996.
[3] C. K. Cockrum, “Implementation of the cordic algorithm in a digital
down-converter,” 2008. [Online]. Available: cockrum.net/Cockrum
Fall 2008 Final Paper.pdf
[4] J. E. Volder, “The CORDIC trigonometric computing technique,” Elec-
tronic Computers, IRE Transactions on, no. 3, pp. 330–334, 1959.
[5] J. S. Walther, “A unified algorithm for elementary functions,” in
Proceedings of the May 18-20, 1971, spring joint computer conference.
ACM, 1971, pp. 379–385.
[6] K. Maharatna, A. Troya, S. Banerjee, and E. Grass, “Virtually scaling-
free adaptive cordic rotator,” IEE Proceedings-Computers and Digital
Techniques, vol. 151, no. 6, pp. 448–456, 2004.