0% found this document useful (0 votes)
8 views8 pages

Design_and_Implementation_of_Gray-Coded_Bit-Plane_Based_Reconfigurable_Motion_Estimation_Architecture_Using_Binary_Content_Addressable_Memory_for_Video_Encoder

Uploaded by

Sushanta Gogoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

Design_and_Implementation_of_Gray-Coded_Bit-Plane_Based_Reconfigurable_Motion_Estimation_Architecture_Using_Binary_Content_Addressable_Memory_for_Video_Encoder

Uploaded by

Sushanta Gogoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, VOL. 68, NO.

1, FEBRUARY 2022 85

Design and Implementation of Gray-Coded


Bit-Plane Based Reconfigurable Motion Estimation
Architecture Using Binary Content Addressable
Memory for Video Encoder
Sushanta Gogoi and Rangababu Peesapati , Senior Member, IEEE

Abstract—Motion Estimation (ME) is the most power consum-


ing module in the video encoder due to its high computational
complex operations. So designing an efficient ME hardware with-
out losing coding performance is a major challenge. This paper
proposes a low-bit-depth ME technique based on Gray-Coded bit-
planes and its hardware implementation using Binary Content
Addressable Memory (BCAM). The proposed method signifi-
cantly reduces the computational burden due to its low-bit-depth
representation. The novel BCAM based ME hardware provides
faster results because of its on-chip memory computation without
compromising other performance parameters. It can process 8K
@53.71 fps operated at maximum frequency of 155 MHz with
152.78K NAND equivalent gate count using 90 nm technology
library.
Index Terms—Motion estimation (ME), gray-coding, binary
content addressable memory (BCAM), hardware architecture.

I. I NTRODUCTION
Fig. 1. Partitioning of a CTB into CB and TB.
ONSUMER electronics (CE) devices pose major chal-
C lenges that are required by multimedia applications
such as data storage, data handling, bandwidth requirement information. The nodes of a transform quad-tree structure rep-
and quality. Although various optimized software and hard- resents a transform block (TB). In H.265, the maximum size
ware for video encoders and decoders (codecs) have been of the basic encoding block is 64 × 64 compared to 16 × 16
developed to satisfy the needs of applications viz. virtual real- in H.264, i.e., 16 times more number of pixels compared to
ity, video conferencing, etc. [1]. Several organization bodies H.264.
like international standard organization (ISO) and interna- Motion estimation (ME) is a crucial component of video
tional telecommunication union (ITU) developed MPEG-4 and coding process with 80% of the total computational load [3].
H.265 compression standards. As compared to the preceding To procure video compression in CE devices, optimized strat-
standards, the latest standard H.265 compresses the video into egy approaches for computing ME is inevitable. Often CE
a compact space and provides 59% better coding efficiency [2], devices having several bottlenecks such as limited computing
by incorporating larger block sizes and various partitioning power, bandwidth and low power consumption. So algorithms
modes. Both H.264 and H.265 follow a block based hybrid of video encoder need to be fine tuned to a specific comput-
video coding technique as shown in Fig. 1. A picture is par- ing platform by software optimizations or by implementing
titioned into coding tree blocks (CTB) of equal sizes. These as special purpose hardware architectures such as hardware
CTBs are further sub-partitioned into coding blocks (CB) of accelerators or co-processors. A computationally effective
various sizes. A CB is again subdivided into prediction blocks architecture for video algorithms evolving based on necessi-
(PB). PBs represent a picture region that contains prediction ties mentioned for CE devices are essential. So the developed
architectures need to be implemented on field programmable
Manuscript received June 21, 2021; revised October 21, 2021 and gate array (FPGA) platforms before making commercialized
November 24, 2021; accepted December 29, 2021. Date of publication
January 3, 2022; date of current version April 8, 2022. (Corresponding author: chip on application specific integrated circuit (ASIC). FPGA
Rangababu Peesapati.) contains configurable logic blocks (CLBs) which can be pro-
The authors are with the Department of Electronics and Communication grammed to realize a function. Owing to their large logic
Engineering, National Institute of Technology Meghalaya, Shillong 793003,
India (e-mail: [email protected]; [email protected]). gates resources, FPGAs can implement complex digital com-
Digital Object Identifier 10.1109/TCE.2021.3139944 putations. The standard computing procedure is to accelerate
1558-4127 
c 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on November 24,2022 at 06:18:12 UTC from IEEE Xplore. Restrictions apply.
86 IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, VOL. 68, NO. 1, FEBRUARY 2022

the algorithms of application by designing architectures of it minimizes the redundant search area with MV distribu-
computational intensive algorithms by employing pipeline and tion. Luo et al. [7] proposed a low delay parallel motion
parallel processing approaches and thereafter to implement estimation design based on graphic processing unit (GPU)
these onto FPGA. Recently CE devices have seen a huge for fast HEVC encoder optimization. A three layer hierar-
growth in the multimedia industry. Recording devices such as chical parallel structure is the novelty of the proposed ME.
action cameras, mirrorless cameras etc. and online streaming It uses a coding tree unit (CTU) layer with a novel index-
platforms support high resolution videos like 4K and 8K UHD. ing table in the prediction unit (PU) layer which is used
This involves processing of huge amount of pixels, as a result it to realize an efficient SAD derivation to accelerate the ME
increases the computational complexity. Therefore, hardware scheme. A compact descriptor is designed to avoid the redun-
solutions are important that reduce the computational com- dant branches in MV search. Similarly, Purnachand et al. [8]
plexity and hence enhance the performance of ME hardware proposed test zone search (TZS) for ME algorithm. An 8-point
in terms of power, area and coding efficiency. Content address- diamond and 8-point square patterns are implemented which
able memories (CAMs) are mostly used in networking systems outperforms the fast ME algorithm implemented in H.265
for routing tables, where huge amount of data need to be pro- reference software. The work contributed towards reduction
cessed at high speed. Since ME involves extensive searching in 53.1%, on average in the computational complexity with
operation, there is a scope where ME hardware can be imple- negligible loss in rate distortion (RD) performance compared
mented using Binary content addressable memory (BCAM) to TZS of H.265. Jia et al. [9] proposed a ME architecture
hardware. BCAM is a high speed search engine implemented with a novel SAD calculation scheme that reduces computa-
in hardware that search pre-loaded data by input data instead tional complexity by 50%. The hardware architecture supports
of an address in one clock cycle. This provides a fast search 8K UHD @ 30 fps at maximum operating frequency of
operation. The resulting hardware will reduce the ME compu- 290 MHz. Kim et al. [10] proposed an IME hardware archi-
tation time to a great extent. So, with a motivation to reduce tecture that reduces computational complexity by 72.25%. The
the computational complexity of ME hardware design for real proposed architecture supports real time encoding of 8K UHD
time CE applications, this work investigates BCAM based @ 30 fps at maximum operating frequency of 500 MHz.
ME architecture in conjunction with Gray-Coded bit-plane Gogoi and Peesapati [11] proposed a hybrid search pattern
based technique. The major contributions of this paper are ME algorithm and its hardware architecture. The architec-
summarized as follows. ture processes a CTB in parallel and requires 59.5 clock
• The proposed work presents a hardware oriented gray- cycles. The architecture supports 8K UHD @ 78 fps with
coded bit-plane based ME technique that reduces com- maximum frequency of 162 MHz. Singh and Ahamed [12]
putational complexity of hardware encoder. proposed a low power hardware architecture for modified
• The developed hardware architecture is based on BCAM hexagonal grid search algorithm. The architecture uses a small
engine that increases the ME performance by utilizing the amount of memory of 8.192 kB with power consumption
on-chip memory operation. of 151.76 mW. It supports 4K UHD @ 30 fps at an oper-
The organization of rest of the paper is as follows. ating frequency of 250 MHz. In the recent years, some of
Section II discusses the related works. Bit-Plane based ME the studies on ME using low bit-depth technique have pro-
and the proposed technique is presented in Section III. The vided better results in picture quality without consumption of
proposed ME hardware architecture using BCAM is described much hardware. A low bit-depth representation is defined as
in Section IV. Section V discusses the results and analysis representation of each pixel with lesser number of bits com-
followed by conclusion of the work in Section VI. pared to all the 8 bits. In addition, to minimize the matching
computation, exclusive-OR (EX-OR) operations are used with
the array of boolean instead of SAD computation. Several
II. R ELATED W ORKS works have been performed using one bit transform (1BT)
Several works have been published with the objective to and two bit transform (2BT) approaches that are filtered by
accelerate the ME process with various hardware architecture. multi-band pass filter and resultant binary frames are obtained
Zheng et al. [4] proposed a hardware efficient block match- between filtered and original frames. 2BT approach found to
ing algorithm (BMA) for variable block size ME (VBSME). be better than 1BT due to usage of local mean and stan-
A small ME hardware is effectively used for different search dard deviation for improving the estimation accuracy [13].
strategies and early termination to cover a predictive search Yavuz et al. [14] developed a selective-gray coding-based
window and to improve the speed of the search operation, ME approach and proposed a related hardware architecture
Thang and Nam Dinh [5], presented a two dimensional matrix to obtain a single bit-plane by choosing certain bits of gray
array integer motion estimation (IME) architecture using full coded pixels. This approach reduces binarization cost and
search ME (FSME) algorithm for H.265. This architecture increases ME performance considerably. Celebi et al. [15]
processes 4K resolution video at 30 fps with latency as low proposed an one dimensional filtering based ME having low
as 1219 clock cycles. Ito et al. [6], used an adaptive search complexity. Two bit-planes were constructed for the matching
range selection algorithm to reduce computational complex- operation and the proposed method has low hardware com-
ity of hierarchical FSME. The algorithm modifies the search plexity. Aggarwal and Khare [16] proposed a ternary content
range and predicts the best point within the search area addressable memory (TCAM) based ME using one prediction
using MV distribution of the neighboring blocks. Further, unit. The search strategy is used to reduce don’t care bits from

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on November 24,2022 at 06:18:12 UTC from IEEE Xplore. Restrictions apply.
GOGOI AND PEESAPATI: DESIGN AND IMPLEMENTATION OF GRAY-CODED BIT-PLANE 87

MSB to LSB between CB and RB until only MV address


of the best match block for one prediction unit is obtained.
Ghosh and Peesapati [17] proposed a TCAM based hierarchi-
cal ME. The presented work accelerated the estimation process
using mixed parallel and pipeline processing with TCAM. The
proposed technique searches the pixel variations between cur-
rent block and reference blocks of a frame simultaneously
by checking the complete and partial match case in a fixed
window.
In this paper an attempt has been made to perform the com-
putation inside the memory with the use of bit-plane coding
in conjunction with BCAM memory instead of performing Fig. 2. Proposed Method of Bit-Plane Construction from g4 , g5 , g6 and g7
separate data path such as SAD unit. Gray-Coded Bit-Plane (a) first bit-plane, (b) second bit-plane.

III. B IT-P LANE BASED ME AND P ROPOSED M ETHOD


(RB) located in reference frame (RF) given by the following
ME is a crucial part in video encoding process. In a video equation,
sequence the motion of objects are detected by ME and it
tries to procure motion vectors which indicates the estimated  N−1
N−1 
motion. For detecting a block of N×N samples in the reference Ck (m, n) = gck (x, y) ⊕ grk (x + m, y + n) (4)
x=0 y=0
frame, which is similar to the N × N block of samples in the
current frame, a comparison of the N × N current frame block where gck (x, y)
and grk (x, y) are kth order Gray-coded bit-plane
with some or all of the feasible N × N blocks in a search area of CF and RF respectively. The main advantage of using bit-
is calculated. The search area is referred as Search Window. plane matching based ME is that it requires less hardware
The block that produces the minimum residual is considered resources, consume lesser amount of power and can process
as the best match. Sum of Absolute Difference (SAD) is the faster. The most significant gray-coded bit-plane contains most
most widely used method to find the best matched block. The of the information but it does not provide all the texture
SAD for N × N current and reference block is given by, information of a frame.
The proposed method uses four most significant gray-coded
 N−1
N−1 
SAD = | CB(i, j) − RB(i + m, j + n) | (1) bit-planes as these bit-planes provide enough information
i=0 j=0 about the frame. The other bit-planes are discarded because
they hardly contain motion information. Two bit-planes are
where m, n are the position of the block for the given search constructed from g4 , g5 , g6 and g7 bit-planes as shown in
range. This method increases the computational load and com- Fig. 2. First bit-plane is constructed using g7 and g6 and the
plexity especially while implementing in the hardware by second bit-plane is constructed using g5 and g4 . Then 2BT
checking for all possible reference candidates. So to reduce is applied to the two bit-planes [13] as 2BT provides more
the computational complexity, low bit-depth method has been dynamic range over conventional approach which is given by
proposed in the literature by selecting particular number of the following equations:
bit-planes to represent a frame.
An n-bit pixel with 2n grey levels in the location (x, y) of  N−1
N−1 
a frame ‘t’ can be represented as [18]: NMP1 (x, y) = Gccm1 (i, j) ⊕ Grrm1 (i + x, j + y)
i=0 j=0
f t (x, y) = an−1 2N−1 + an−2 2N−2 + · · · + a1 21 + a0 20 (2)  N−1
N−1 
NMP2 (x, y) = Gccm2 (i, j) ⊕ Grrm2 (i + x, j + y)
where an is either 0 or 1 (0 ≤ n ≤ N − 2). An nth bit-plane i=0 j=0
of frame ‘t’, btn (x, y) consists of all an bits of frame ‘t’. The NMP3 (x, y) = NMP1 (x, y)  NMP2 (x, y)
grey code representation of a pixel is given as follows:
NMP2BT = NMP1 + NMP2 + NMP3 (5)
gN−1 = aN−1
where Gcm and Grm are the CB and the RB respectively.
gn = an ⊕ an+1 , 0 ≤ n ≤ N − 2 (3)

The nth order gray-coded bit-plane gtn (x, y) consists of all IV. P ROPOSED H ARDWARE A RCHITECTURE
the gn bits of the frame. For a frame of 8-bit pixels, the eight In this section, a reconfigurable ME architecture based
one bit gray coded bit-planes from g0 to g7 , where g0 is the on BCAM engine using the proposed gray-coded bit-plane
least significant and g7 is the most significant gray coded bit- matching method has been presented. the proposed work has
plane. used a motion analyzer module to categorize the motion of
ME based on gray-coded bit-plane matching is determined a video sequence. Based on the motion analyzer value differ-
by a correlation measure (Ck ) between the current block ent algorithms can be applied to different video sequences. In
(CB) located in the current frame (CF) and reference block Chandran et al. [19], author has proposed motion analyzer to

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on November 24,2022 at 06:18:12 UTC from IEEE Xplore. Restrictions apply.
88 IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, VOL. 68, NO. 1, FEBRUARY 2022

Fig. 4. Conceptual diagram of CAM [20].

Fig. 5. 1 bit CAM cell.

data. It executes three operations namely match, read and


write. CAM can be classified as BCAM and Ternary CAM
(TCAM). The potential of storing two logical values (0 and
Fig. 3. Motion analyzer unit. 1) makes BCAM worthy for full search and fast search oper-
ation. The conceptual diagram of CAM is shown in Fig. 4.
determine the motion of a video sequence. For CIF format of The data to be searched is loaded in the search data regis-
resolution 352×288, two sub-frames of size 252 × 188 are ter. Then it starts searching through the search lines in the
extracted considering most of the motions are confined to the stored word registers. If it gets matched then the correspond-
center of the frame. The level of motion is calculated using ing matchline goes high and the encoder provides the matching
the following equation: location. ME involves exhaustive search operation to find the
251 187 best matching block. In this process, ME module frequently
i=0 j=0 | F1ij − F2ij | access memory to retrieve the block of pixels. In hardware,
Im = (6)
n this frequent memory access mechanism causes too much
where, Im is the level of motion, F1ij and F2ij are two power consumption as well as extra clock cycle delays. BCAM
sub-frames of F1 and F2 frames respectively and n is the based hardware devices work faster for searching operation
total number of pixels in the sub-frame. The author justified as compared to algorithmic search techniques. As it search
two different threshold values for different types of motion the input data against the stored data, search engine doesn’t
sequences. The threshold values are T1 = 10 and T2 = 20 need to access external memory for reference block of pixels.
which are based on every initial 10 frames of different video All the accessing mechanisms are happening inside the main
sequences. If Im < T1, it represents low motion sequence. ME module itself. This way, it not only saves extra clock
For T1 < Im < T2, it represents medium motion sequence cycle delays but also saves extra power consumption resulting
and if Im > T2 then it represents high motion sequence. in faster operation and low power architecture. Traditionally,
Fig. 3 shows hardware design of the motion analyzer. In our CAMs are implemented in ASICs. Hardwired CAM blocks
proposed design, motion analyzer has been shown only for does not make sense in FPGA because FPGAs are used for dif-
352 × 288 resolution videos. However, the hardware design ferent applications. So CAMs are emulated in FPGA using the
may vary depending upon the resolution of the video. available logic resources [21], [22]. For the hardware imple-
In our proposed system, the motion analyzer value is used mentation of CAM, a hierarchical CAM from 1-bit CAM cell
to determine the algorithm for a particular video sequence. If has been designed. Fig. 5 shows a 1-bit CAM cell represented
a video sequence contains low or medium motion then a fast by the following equation.
ME algorithm can be applied and for high motion sequences
a full search algorithm can be applied. This approach leads to b_match[i] =∼ w_care[i] |∼ (lu_data[i] ⊕ w_data[i]) (7)
saving in time, hardware resources and power consumption.
where, w_data is the word data and the lu_data is the look up
Also it helps to find the accurate motion vectors (MV) and
data or search data. The operation of the CAM cell depends
maintain the quality of the video.
upon the w_care signal which is also known as masked bit.
If w_care is ‘0’ then the circuit works as TCAM and if it is
A. Content Addressable Memory ‘1’ then the circuit works as BCAM. In the proposed design,
CAM based hardware devices work faster for searching BCAM has been used as the main search engine. For ME,
operation as compared to algorithmic search techniques. It the reference data is stored through w_data port and the cur-
performs parallel comparison of the input data with the stored rent data is provided through lu_data port. In the proposed

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on November 24,2022 at 06:18:12 UTC from IEEE Xplore. Restrictions apply.
GOGOI AND PEESAPATI: DESIGN AND IMPLEMENTATION OF GRAY-CODED BIT-PLANE 89

Fig. 6. Top level hardware architecture.

Algorithm 1 Binarization Implementation in Hardware


Input: Gray-coded bit-planes g4 , g5 , g6 , g7 for CF and RF
Output: Transformed bit-planes for CF and RF

1: for n ← 255 to 0 by 2 do
2: gc1cf [n] ← g7_cf
3: gc1rf [n] ← g7_rf
4: gc1cf [n − 1] ← g6_cf
Fig. 7. Conceptual schematic of algorithm decision through motion analyzer.
5: gc1rf [n − 1] ← g6_rf
6: gc2cf [n] ← g5_cf
7: gc2rf [n] ← g5_rf
8: gc2cf [n − 1] ← g4_cf
9: gc2rf [n − 1] ← g4_rf
10: end for
11: X1 = gc1cf ⊕ gc1rf
12: Y1 = gc2cf ⊕ gc2rf
13: Z1 = X1  Y1
14: Final = X1 + Y1 + Z1
15: return Final
Fig. 8. BCAM search engine unit.

design, eight 16×16 BCAM cells have been used to store eight
16 × 16 reference blocks. Reference block data are uploaded RF data register. It consists of 8 bit-plane registers to contain
in sequential manner and it takes 8 clock cycles. 4 gray-coded bit-planes of CF and RF respectively. This unit
requires a total memory of 256 byte and it requires 5 clock
B. Overall System Architecture cycles to fill the all the bit-plane register arrays. The hardware
The top level hardware architecture of the system is shown implementation of the binarization unit is explained through
in Fig. 6. It consists of total seven modules: i) motion analyzer, Algorithm 1. This module consists of two sub-modules: CF
ii) RF and CF data register, iii) BP register array, iv) bina- bit-plane mixer and RF bit-plane mixer. This module pro-
rization unit, v) controller, vi) BCAM engine and vii) MV duces two resultant bit-planes for CF and RF respectively. It
calculation unit. Initially, both the CF and RF are applied takes 4 clock cycles. The resultant bit-planes are stored in four
to the motion analyzer. The controller decides whether a full 256×256 registers (g76_cf , g54_cf , g76_rf , g54_rf ). Then CB and
search or fast search algorithm need to be applied based on RB of size 16 × 16 are uploaded to the four 16 × 16 registers.
the motion value from motion analyzer, i.e., based on the The BCAM search engine unit is shown in Fig. 8. The archi-
motion analyzer value the hardware architecture works as re- tecture consists of eight 16 × 16 cells. The RBs of g76 and
configurable system for the search operation. The conceptual g54 planes are stored in the cells through look up data ports
diagram of this operation is shown in Fig. 7. This module lu_data_76 and lu_data_54 port in a sequential manner.
requires 94 clock cycles. The corresponding gray-coded bit- After 4 clock cycles it raises a flag of completion. The CB is
planes are uploaded to the bit-plane register array from CF and searched through word data ports w_data_54 and w_data_76

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on November 24,2022 at 06:18:12 UTC from IEEE Xplore. Restrictions apply.
90 IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, VOL. 68, NO. 1, FEBRUARY 2022

TABLE I
PSNR C OMPARISON W ITH OTHER L OW-C OMPLEXITY ME T ECHNIQUES

TABLE II TABLE IV
BD-PSNR AND BD-BR C ALCULATION ASIC A REA C OST U TILIZATION R EPORT

TABLE V
ASIC P OWER U TILIZATION R EPORT

TABLE III
FPGA D EVICE U TILIZATION R EPORT

are drastically reduced. So, bit-plane technique in combina-


tion with CAM offers efficient ME performance in terms of
speed, area and power consumption.

V. R ESULTS AND A NALYSIS


To estimate performance of the proposed bit-plane based
port. If a best match is found, corresponding matchline goes ME technique, comparison of the peak signal to noise ratio
high and the address of the corresponding best match is deter- (PSNR) between CF and RF with different low-bit-depth ME
mined from the matchline encoder. This unit takes 5 clock techniques was carried out. The program was compiled in
cycles. The MV calculation unit takes the location of the best software platform for different video sequences with different
matched RB and the CB to calculate the MV. After calculat- motion parameters and the results are shown in Table I. The
ing the MV it sends a signal to the controller to accept new maximum size of the block taken as 64 × 64. From Table I
data. For each 16 × 16 CB, the search operation is carried out it can be observed that the proposed technique shows bet-
in a 32 × 32 search window. The speed of a CAM depends ter performance in terms of PSNR gain compared to other
upon its size [20]. As the size of a CAM increases, the sil- existing low-bit-depth ME techniques. To calculate the RD
icon area and the power consumption also increases. So, in performance of the proposed technique, it was implemented
the proposed design the maximum size of a CAM cell has inside H.264 encoder [27]. The corresponding RD curves for
been kept as 16 × 16. So, to process a 64 × 64 block using four sequences foreman, container, coastguard and flower are
16 × 16 block maximum 334 clock cycles are required. Using shown in Fig. 9. Table II shows bjontegaard bitrate (BD-
bit-plane technique, the processing of the number of pixels BR) [28] and bjontegaard PSNR (BD-PSNR). From Table II,

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on November 24,2022 at 06:18:12 UTC from IEEE Xplore. Restrictions apply.
GOGOI AND PEESAPATI: DESIGN AND IMPLEMENTATION OF GRAY-CODED BIT-PLANE 91

Fig. 9. Rate distortion curve for (a) foreman_352 × 288, (b) container_352 × 288, (c) coastguard_352 × 288, (d) flower_352 × 288 video sequences with
QP 22, 24, 27 and 32.

TABLE VI
C OMPARISON R ESULT FOR THE P ROPOSED A RCHITECTURE

it can be observed that, for the proposed technique there is of bit-depth, technology, maximum operating frequency, on
negligible increment of BD-BR by 0.1021% with negligible chip memory, search range, throughput, clock cycles per CTU,
BD-PSNR loss of 0.893 dB compared to traditional SAD gate count, power consumption, BD-BR increase, supported
technique in H.264. PUs and supported resolution. As the state-of-the-art works
The proposed BCAM based ME architecture was written have been carried out on different process technologies, so
in verilog HDL and the performance analysis was carried out the parameters of the design are normalized to 65 nm tech-
on both FPGA and ASIC platform. The design uses FPGA nology to demonstrate a fair comparison. The proposed design
resources of 10835 look up tables (LUTs) and 8356 flip supports 8K video at 53.71 fps which is higher compared
flops (FFs). The design was synthesized using FPGA EDA to [10]–[13]. It uses dedicated on chip memory of 33 kB for on
tools. Table III shows the FPGA device utilization report. The chip memory computation. The proposed bit-plane technique
proposed design was also synthesized using 90 nm process show better coding performance with a negligible degradation
technology on ASIC platform. The maximum frequency of the of 0.1021% compared to the state-of-the-art works. The archi-
design is 155 MHz with power consumption of 78.017 mW. tecture proposed by Singh and Ahamed [12] has throughput
The design requires a total area of 152.78 K in terms of of 248.8 Mpixels/s which is lesser compared to 1.9 Gpixels/s
NAND2XL gate equivalent. Table IV and Table V shows in the proposed work. Also, in terms of gate count and power
ASIC area cost utilization report and power utilization report consumption the proposed design occupies 10.6% lesser gate
respectively. The hardware design works for both H.264 and count with 51.6% reduction in power consumption compared
HEVC. However, to observe the coding efficiency using the to Singh and Ahamed [12]. In Gogoi and Peesapati [11], the
proposed technique, the BD-BR and BD-PSNR analysis have architecture occupies higher area of 2784.4 K with power con-
been carried out using H.264 encoder. Table VI shows the sumption of 463.4 mW compared to 152.78 K and 78.01 mW
comparison of the proposed work with various state-of-the-art in the proposed architecture. Also the power consumed by the
ME architectures. The comparison was carried out in terms design is 48.87 mW at 100 MHz, which is lesser compared to

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on November 24,2022 at 06:18:12 UTC from IEEE Xplore. Restrictions apply.
92 IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, VOL. 68, NO. 1, FEBRUARY 2022

Celebi et al. [13]. Similarly, the proposed design shows better [16] G. Aggarwal and R. Khare, “Tertiary content addressable memory based
performance in terms of gate count, power consumption and motion estimator,” U.S. Patent 7 873 107, Jan. 18, 2011.
[17] P. Ghosh and R. Peesapati, “ Design and implementation of ternary con-
clock cycles spent per CTU compared to Kim et al. [10]. tent addressable memory (TCAM) based hierarchical motion estimation
for video processing,” in Proc. Int. Symp. VLSI Design Test, Singapore,
2017, pp. 557–569.
VI. C ONCLUSION [18] S. Erturk, “Locally refined gray-coded bit-plane matching for block
In this paper, we have proposed a gray-coded bit-plane motion estimation,” in Proc. 3rd Int. Symp. Image Signal Process. Anal.,
Rome, Italy, 2003, pp. 128–133.
based ME technique and its hardware implementation using [19] K. R. S. Chandran and P. V. Chandramani, “Hardware - software co-
BCAM for faster on chip memory computation. The proposed design framework for sum of absolute difference based block matching
technique utilizes the four most significant bit-planes of a in motion estimation,” Microprocess. Microsyst., vol. 74, Apr. 2020,
Art. no. 103012.
frame and provides similar performance compared to the state- [20] K. Pagiamtzis and A. Sheikholeslami, “Content-addressable memory
of-the-art low-bit-depth ME techniques. The novel BCAM (CAM) circuits and architectures: A tutorial and survey,” IEEE J.
based ME hardware provides similar performance compared Solid-State Circuits, vol. 41, no. 3, pp. 712–727, Mar. 2006.
[21] Z. Ullah, “LH-CAM: Logic-based higher performance binary CAM
to the other hardware architectures of same category but with architecture on FPGA,” IEEE Embedded Syst. Lett., vol. 9, no. 2,
faster computation speed. The hardware architecture supports pp. 29–32, Jun. 2017.
8K @53.71 fps at maximum operating frequency of 155 MHz [22] H. Mahmood, Z. Ullah, O. Mujahid, I. Ullah, and A. Hafeez, “Beyond
the limits of typical strategies: Resources efficient FPGA-based TCAM,”
with power consumption of 78.01 mW using 90 nm process IEEE Embedded Syst. Lett., vol. 11, no. 3, pp. 89–92, Sep. 2019.
technology. [23] A. Erturk and S. Erturk, “Two-bit transform for binary block motion
estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 7,
pp. 938–946, Jul. 2005.
R EFERENCES [24] A. Celebi, O. Akbulut, O. Urhan, and S. Erturk, “Truncated graycoded
bit-plane matching based motion estimation and its hardware architec-
[1] I. E. Richardson, The H.264 Advanced Video Compression Standard, ture,” IEEE Trans. Consum. Electron., vol. 55, no. 3, pp. 1530–1536,
2nd ed. Hoboken, NJ, USA: Wiley, 2011. Aug. 2009.
[2] Y. Ye, Y. He, and X. Xiu, “Manipulating ultra-high definition video [25] S.-Y. Jou, S.-J. Chang, and T.-S. Chang, “Fast motion estimation
traffic,” IEEE MultiMedia, vol. 22, no. 3, pp. 73–81, Jul.–Sep. 2015. algorithm and design for real time QFHD high efficiency video
[3] Z. Chen, J. Xu, Y. He, and J. Zheng, “Fast integer-pel and fractional-pel coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 9,
motion estimation for H.264/AVC,” J. Vis. Commun. Image Represent., pp. 1533–1544, Sep. 2015.
vol. 17, no. 2, pp. 264–290, Apr. 2006. [26] G. He, D. Zhou, Y. Li, Z. Chen, T. Zhang, and S. Goto, “High-throughput
[4] J. Zheng, C. Lu, J. Guo, D. Chen, and D. Guo, “A hardware-efficient power-efficient VLSI architecture of fractional motion estimation for
block matching algorithm and its hardware design for variable block ultra-HD HEVC video encoding,” IEEE Trans. Very Large Scale Int.
size motion estimation in ultra-high-definition video encoding,” ACM (VLSI) Sys., vol. 23, no. 12, pp. 3138–3142, Dec. 2015.
Trans. Design Autom. Electron. Syst., vol. 24, no. 2, p. 15, Mar. 2019. [27] A. Al Muhit. “H.264 Codec (Encoder/Decoder) (MATLAB).”
[5] N. V. Thang and V. Nam Dinh, “High throughput and low cost memory (2009). [Online]. Available: https://sites.google.com/site/almuhit/codes
architecture for full search integer motion estimation in HEVC,” in Proc. (Accessed: Oct. 2, 2021).
Int. Conf. Adv. Technol. Commun. (ATC), 2018, pp. 174–178. [28] G. Bjontegaard, Calculation of Average PSNR Differences Between RD
[6] Y. Ito, T. Song, W. Shi, T. Katayama, and T. Shimamoto, “Hardware- Curves, document ITUT-T Q6/SG16 VCEG-M33, Int. Telecommun.
oriented low complexity motion estimation for HEVC,” in Proc. IEEE Union, Geneva, Switzerland, Apr. 2001.
Int. Conf. Consum. Electron. (ICCE), Las Vegas, NV, USA, 2018,
pp. 1–5.
[7] F. Luo, S. Wang, S. Wang, X. Zhang, S. Ma, and W. Gao, “GPU based
hierarchical motion estimation for high efficiency video coding,” IEEE
Trans. Multimedia, vol. 21, no. 4, pp. 851–862, Apr. 2019.
[8] N. Purnachand, L. N. Alves, and A. Navarro, “Improvements to TZ
search motion estimation algorithm for multiview video coding,” in Proc.
19th Int. Conf. Syst. Signals Image Process. (IWSSIP), Vienna, Austria, Sushanta Gogoi received the M.Tech. degree
2012, pp. 388–391. in VLSI systems from the Electronics and
[9] L. Jia, C. Tsui, O. C. Au, and K. Jia, “A low-power motion estima- Communication Engineering Department, National
tion architecture for HEVC based on a new sum of absolute difference Institute of Technology Nagaland, Dimapur, India,
computation,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 1, in 2016. He is currently pursuing the Ph.D.
pp. 243–255, Jan. 2018. degree with the Department of Electronics and
[10] T. S. Kim, C. E. Rhee, and H.-J. Lee, “Fast hardware-based IME Communication Engineering, National Institute of
with idle cycle and computational redundancy reduction,” IEEE Trans. Technology Meghalaya, Shillong, India. His current
Circuits Syst. Video Technol., vol. 30, no. 6, pp. 1732–1744, Jun. 2020. research interests include high performance video
[11] S. Gogoi and R. Peesapati, “A hybrid hardware oriented motion esti- architectures like H.264 and HEVC video codecs.
mation algorithm for HEVC/H.265,” in J. Real-Time Image Process.,
vol. 18, pp. 953–966, Jan. 2021.
[12] K. Singh and S. R. Ahamed, “Low power motion estimation algo-
rithm and architecture of HEVC/H.265 for consumer applications,” IEEE
Trans. Consum. Electron, vol. 64, no. 3, pp. 267–275, Aug. 2018.
[13] A. T. Celebi, S. Yavuz, A. Celebi, and O. Urhan, “Selective gray-coded
bit-plane-based two-bit transform and its efficient hardware architecture
for low-complexity motion estimation,” IEEE Trans. Consum. Electron., Rangababu Peesapati (Senior Member, IEEE)
vol. 64, no. 3, pp. 259–266, Aug. 2018. received the Ph.D degree from the University of
[14] S. Yavuz, A. Celebi, M. Aslam, and O. Urhan, “Selective gray-coded Hyderabad, Hyderabad, in 2014. He is an Assistant
bit-plane based low-complexity motion estimation and its hardware Professor with the Department of Electronics and
architecture,” IEEE Trans. Consum. Electron., vol. 62, no. 1, pp. 76–84, Communication Engineering, National Institute of
Feb. 2016. Technology Meghalaya, Shillong, since 2014. His
[15] A. T. Celebi, S. Yavuz, A. Celebi, and O. Urban, “One-dimensional primary research interests are Design of FPGA
filtering based two-bit transform and its efficient hardware architecture based reconfigurable systems for multimedia, signal
for fast motion estimation,” IEEE Trans. Consum. Electron., vol. 63, processing, and evolutionary computing.
no. 4, pp. 377–385, Nov. 2017.

Authorized licensed use limited to: National Institute of Technology- Meghalaya. Downloaded on November 24,2022 at 06:18:12 UTC from IEEE Xplore. Restrictions apply.

You might also like