VLSI Circuits and Embedded Systems-CRC Press (2022)
VLSI Circuits and Embedded Systems-CRC Press (2022)
Systems
Very Large-Scale Integration (VLSI) creates an integrated circuit (IC) by combining thousands of transis-
tors into a single chip. While designing a circuit, reduction of power consumption is a great challenge.
VLSI designs reduce the size of circuits which eventually reduces the power consumption of the devices.
However, it increases the complexity of the digital system. Therefore, computer-aided design tools are in-
troduced into hardware design processes.
Unlike the general-purpose computer, an embedded system is engineered to manage a wide range of pro-
cessing tasks. Single or multiple processing cores manage embedded systems in the form of microcon-
trollers, digital signal processors, field-programmable gate arrays, and application-specific integrated cir-
cuits. Security threats have become a significant issue since most embedded systems lack security even
more than personal computers. Many embedded systems hacking tools are readily available on the inter-
net. Hacking in the PDAs and modems is a pervasive example of embedded systems hacking.
This book explores the designs of VLSI circuits and embedded systems. These two vast topics are divided
into four parts. In the book’s first part, the Decision Diagrams (DD) have been covered. DDs have exten-
sively used Computer-Aided Design (CAD) software to synthesize circuits and formal verification. The
book’s second part mainly covers the design architectures of Multiple-Valued Logic (MVL) Circuits. MVL
circuits offer several potential opportunities to improve present VLSI circuit designs. The book’s third part
deals with Programmable Logic Devices (PLD). PLDs can be programmed to incorporate a complex logic
function within a single IC for VLSI circuits and Embedded Systems. The fourth part of the book concen-
trates on the design architectures of Complex Digital Circuits of Embedded Systems. As a whole, from
this book, core researchers, academicians, and students will get the complete picture of VLSI Circuits and
Embedded Systems and their applications.
VLSI Circuits and Embedded
Systems
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright
holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowl-
edged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.
DOI: 10.1201/9781003269182
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
To my beloved parents and also to my beloved wife, daughter & son,
who made it possible for me to write this book
Contents
Preface xxxiii
Acknowledgments xxxvii
Acronyms xxxix
Introduction xliii
Part 1 3
1.1 INTRODUCTION 5
1.2 PRELIMINARIES 6
1.2.1 Shared Multi-Terminal Binary Decision Diagrams 8
1.3 AN OPTIMIZATION ALGORITHM FOR SMTBDD(K )S 12
1.3.1 The Weight Calculation Procedure 13
1.3.2 Optimization of SMTBDD(3)s 15
1.4 SUMMARY 16
2.1 INTRODUCTION 19
2.1.1 Basic Definitions 20
2.2 BINARY DECISION DIAGRAMS FOR MULTIPLE-OUTPUT
FUNCTIONS 21
vii
viii Contents
3.1 INTRODUCTION 35
3.2 DECISION DIAGRAMS 36
3.2.1 Binary Decision Diagrams 37
3.2.2 Multiple-Valued Decision Diagrams 37
3.2.2.1 Shared Multiple-Valued Decision Diagrams 37
3.3 CONSTRUCTION OF COMPACT SMDDS 39
3.3.1 Pairing of Binary Input Variables 39
3.3.1.1 The Method 39
3.3.2 Ordering of Input Variables 43
3.4 SUMMARY 44
4.1 INTRODUCTION 47
4.2 BASIC PROPERTIES 49
4.3 MULTIPLE-VALUED DECISION DIAGRAMS 49
4.3.1 Size of MDDs 49
4.4 MINIMIZATION OF MDDS 54
Contents ix
5.1 INTRODUCTION 61
5.2 DECISION DIAGRAMS FOR MULTIPLE-OUTPUT FUNCTIONS 62
5.2.1 Shared Binary Decision Diagrams 62
5.2.2 Shared Multiple-Valued Decision Diagrams 63
5.2.3 Shared Multi-Terminal Multiple-Valued Decision Diagrams 63
5.3 TDM REALIZATIONS 65
5.3.1 TDM Realizations Based on SBDDs 65
5.3.2 TDM Realizations Based on SMDDs 66
5.3.3 TDM Realizations Based on SMTMDDs 68
5.3.4 Comparison of TDM Realizations 68
5.4 REDUCTION OF SMTMDDS 69
5.5 UPPER BOUNDS ON THE SIZE N OF DDS 70
5.6 SUMMARY 70
6.1 INTRODUCTION 73
6.2 DEFINITIONS AND BASIC PROPERTIES 75
6.3 DECISION DIAGRAMS 76
6.3.1 2-Valued Pseudo-Kronecker Decision Diagrams 76
6.3.2 Multiple-Valued Pseudo-Kronecker Decision Diagrams 77
6.4 OPTIMIZATION OF 4-VALUED PKDDS 77
6.4.1 Pairing of 2-Valued Input Variables 77
6.4.2 Ordering of 4-Valued Variables 79
6.4.3 Selection of Expansions 81
6.5 SUMMARY 82
x Contents
Part 2 85
7.1 INTRODUCTION 87
7.1.1 Realization of Multiple Valued Flip-Flops Using Pass Transistor
Logic 87
7.1.2 Implementation of MVFF with Binary Coding and Decoding
Using PTL 88
7.2 MVFF WITHOUT BINARY ENCODING OR DECODING 90
7.2.1 Properties of Pass Transistor and a Threshold Gate 90
7.2.2 Realization of Multiple-Valued Inverter Using Threshold Gates 92
7.2.3 Realization MVFF Using Multiple-Valued Pass Transistor Logic 93
7.3 SUMMARY 95
8.1 INTRODUCTION 97
8.2 BASIC DEFINITIONS AND TERMINOLOGIES 98
8.3 THE METHOD 98
8.3.1 Conversion of Binary Logic Functions into MVL Functions 99
8.3.2 Pairing of the Functions 100
8.3.3 Output Stage 101
8.3.4 Basic Circuit Structure and Operation 101
8.3.4.1 Literal Generation 104
8.4 SUMMARY 105
Part 3 145
15.3 THE MULTIPLIER CIRCUIT USING THE LUT MERGING THEOREM 186
15.4 SUMMARY 194
Part 4 323
Index 461
List of Figures
3.1 SMDD. 36
3.2 A Multiplexer-Based Network Corresponding to the SMDD in Fig. 3.1. 36
3.3 SMDD for Functions ( f0, f1 ) = (x1, x2 ). 38
xxi
xxii List of Figures
10.1 (a) Feynman Gate, (b) Fredkin Gate, (c) Toffoli Gate, and (d) New Gate. 123
10.2 Basic Logic Operations Using Reversible Gates. 124
10.3 Multiple-Valued Fredkin Gate (MVFG). 124
10.4 Implementation of Min and Max Operation Using MVFGs. 129
10.5 Complementing a Ternary Variable Using MVFGs. 129
10.6 Basic Cell. 130
10.7 Systolic Array for Composition of Fuzzy Relation. 130
11.1 Block Diagram of General 2-input r -valued LUT Logic Function. 139
11.2 (a) 3-valued 2-variable MVL Truth Table and its Direct Realization Using
LUT. (b) Kleenean Coefficient considered. 139
11.3 Circuit Diagram of a Current-Mode Literal 1 A1 . 140
11.4 Literal Generator Conceptualization. 140
11.5 Literal Generator Circuit. 141
13.1 The Example PLA whose Input Decoder is Augmented using Extra Pass
Transistor. 159
13.2 (a) Response of the Bit Decoder (b) Input of the Decoder to Place 0c on
Both the Bit Lines. 159
13.3 An Example PLA Showing Product Lines Satisfying Introduced Condition. 160
16.1 Example Demonstration of the BCD Addition Algorithm for C3i = 0. 200
16.2 Example Demonstration of the BCD Addition Algorithm for C3i = 1. 201
16.3 Tree Structure Representation of the BCD Addition Method. 201
16.4 1-Digit BCD Adder Circuit. 204
16.5 Block Diagram of the N -Digit BCD Adder Circuit. 205
18.2 Full Adder Circuit Critical Path Delay Determination of Full Adder Circuit. 218
18.3 Memristor. 219
18.4 2-Input LUT Circuit Architecture. 222
18.5 5-Input LUT Circuit. 224
18.6 6-Input LUT Circuit. 225
18.7 BCD Multiplier Circuit for 1-Digit Multiplication. 227
18.8 BCD Multiplier Circuit Design Using Virtex 5/6 Slice for 1-Digit. 228
18.9 Block Diagram of the BCD Multiplier for N×M -Digit Multiplication. 229
18.10 Merging LUTs in N×M Multiplication. 230
20.1 Demonstration of the BCD Addition Algorithm Exhibited in Example 21.1. 290
20.2 Architecture of the 2-Input LUT. 292
20.3 Architecture of the 6-Input LUT. 293
20.4 The BCD Adder: (a) Block Diagram of 1-Digit BCD Adder, (b) 1-Digit
BCD Adder, (c) Block Diagram of the n-Digit BCD Adder. 296
25.1 Neural Network Prototype for AHSD Number System Addition. 398
25.2 N -Bit Adder Generalization. 401
8.1 Truth Tables with (a) Binary Form and (b) Multi-Valued Form 99
8.2 Basic Circuit Operation for Circuit in Fig. 8.1 102
8.3 Operation of the Circuit of Fig. 8.2 102
8.4 (a) Truth Table for Example 8.7, and Tables when (a) A = 0, (b) B =
Y20,2,3 + Y11,3Y21 , (c) C = Y22 + Y23Y11,2,3 + Y10,1,3Y20 , (d) D=Y12Y20 + Y10Y23 103
xxxi
xxxii List of Tables
20.1 Truth Table of 3-bit Addition with Pre-processing and Addition of 3 289
20.2 Read and Write Scheme using the Introduced Approach 294
Very Large Scale Integration (VLSI) is one of the most widely used technologies for
microchip processors, integrated circuits (IC) and component designing. It was initially
designed to support thousands of transistor gates on a microchip but it exceeds several bil-
lions of transistors. All of these transistors are remarkably integrated and embedded within
a microchip that has shrunk over time but still has the capacity to hold enormous amounts
of transistors. In VLSI Circuits, the integration of billions of transistors improves the de-
sign methodology which also ensures higher operating speed, lower power consumption,
smaller circuit size, higher reliability and lower manufacturing cost. VLSI chips are widely
used in various branches of Engineering such as Voice and Data Communication networks,
Digital Signal Processing, Computers, Commercial Electronics, Automobiles, and Embed-
ded Systems. The relevance of VLSI in performance computing, telecommunications, and
consumer electronics has been expanding progressively, and at a very hasty pace.
An embedded system is a microprocessor or microcontroller-based system of hard-
ware and software designed to perform dedicated functions within a larger mechanical or
electrical system. The embedded system is unlike the general-purpose computer which is
engineered to manage a wide range of processing tasks. Because an embedded system is
engineered to perform certain tasks only, where design engineers may optimize size, cost,
power consumption, reliability and performance. Embedded systems are typically produced
on broad scales and share functionalities across a variety of environments and applications.
The complexity of an embedded system varies significantly depending on the task for which
it is designed. Embedded system applications range from digital watches and microwaves to
hybrid vehicles and avionics. As much as 98 percent of all microprocessors manufactured
are used in embedded systems. Embedded Systems are convenient for mass production
which results in lower price per piece. They are highly stable, reliable, very small in size
and hence they can be carried and loaded anywhere. They are also very fast and consume
less power. In addition, they optimize the use of resources available. For these reasons, the
embedded systems are getting popular day by day osmotically.
This book mainly covers two extensive topics: VLSI circuits and embedded systems.
These two topics are further divided into four parts: Decision Diagrams, Design Archi-
tectures of Multiple-Valued Logic Circuits, Programmable Logic Devices, and Design
Architectures of Digital Circuits. The Decision Diagram part mainly covers various types
of Decision Diagrams (DDs) such as Binary Decision Diagrams (BDD), Shared Multi-
Terminal Binary Decision Diagrams (SMTBDD), complexities of different types of BDDs,
Multiple-Output Functions using BDD for Characteristic Functions, Shared Multiple-
Valued DDs for Multiple-Output Functions, Minimization techniques of Multiple-Valued
DDs, Time Division Multiplexing (TDM) Realizations of Multiple-Output Functions based
xxxiii
xxxiv Preface
Dr. Hafiz Md. Hasan Babu is currently working as Dean of the Faculty of Engineering
and Technology, as well as a Professor in the Department of Computer Science and En-
gineering of the University of Dhaka, Bangladesh. He is also the former Chairman of the
same Department. From July 13, 2016 to July 12, 2020, he was the Pro-Vice-Chancellor
of National University, Bangladesh, where he worked on deputation from the Department
of Computer Science and Engineering, University of Dhaka, Bangladesh. For his excel-
lent academic and administrative capability, he also served as the Professor and Founding
Chairman of the Department of Robotics and Mechatronics Engineering, University of
Dhaka, Bangladesh. He served as a World Bank senior consultant and general manager
of the Information Technology & Management Information System Departments of Janata
Bank Limited, Bangladesh. Dr. Hasan Babu was the World Bank resident information tech-
nology expert of the Supreme Court Project Implementation Committee, Supreme Court
of Bangladesh. He was also the information technology consultant of Health Economics
Unit and Ministry of Health and Family Welfare in the project “SSK (Shasthyo Shurokhsha
Karmasuchi) and Social Health Protection Scheme” under the direct supervision and fund-
ing of German Financial Cooperation through KfW. Professor Dr. Hafiz Md. Hasan Babu
received his M.Sc. degree in Computer Science and Engineering from the Brno University
of Technology, Czech Republic, in 1992 under the Czech Government Scholarship. He ob-
tained the Japanese Government Scholarship to pursue his PhD from the Kyushu Institute of
Technology, Japan, in 2000. He also got DAAD (Deutscher Akademischer Austauschdienst)
Fellowship from the Federal Republic of Germany.
Professor Dr. Hasan Babu is a very eminent researcher. He was awarded the best
paper awards in three reputed international conferences. In recognition of his valuable
contributions in the field of Computer Science and Engineering, he received the Bangladesh
Academy of Sciences Dr. M.O. Ghani Memorial Gold Medal Award for the year 2015, which
xxxv
xxxvi Author Bio
is one of the most prestigious research awards in Bangladesh. He was also awarded the
UGC (University Grants Commission of Bangladesh) Gold Medal Award-2017 for his
outstanding research contributions in computer science and engineering. He has written
more than 100 research articles published in reputed international journals (IET Computers
& Digital Techniques, IET Circuits and Systems, IEEE Transactions on Instrumentation
and Measurement, IEEE Transactions on VLSI Systems, IEEE Transactions on Computers,
Elsevier Journal of Micro electronics, Elsevier Journal of Systems Architecture, Springer
Journal of Quantum Information Processing, etc.) and joined international conferences.
According to Google Scholar, Prof. Hasan has already received around 1332 citations with
h-index 17 and i10-index 31. He is a regular reviewer of reputed international journals
and international conferences. He presented invited talks and chaired scientific sessions or
worked as a member of the organizing committee or international advisory board in many
international conferences held in different countries. For his excellent research record, he
has also been appointed as the associate editor of IET Computers and Digital Techniques,
published by the Institution of Engineering and Technology of the United Kingdom.
Professor Dr. Hasan Babu was appointed as a member of the prime minister’s ICT Task
Force Committee, Government of the People’s Republic of Bangladesh in recognition of his
national and international level contributions in Engineering Sciences. He is currently the
president of Bangladesh Computer Society and also the president of International Internet
Society, Bangladesh Chapter. He has been recently appointed as a part-time member of
Bangladesh Accreditation Council of the Government of People’s Republic of Bangladesh
to ensure the quality of higher education in Bangladesh.
Acknowledgments
I would like to express my sincerest gratitude and special appreciation to the researchers
and my beloved students who are working in the field of VLSI Circuits and Embedded
Systems. The contents of this book have been compiled from a wide variety of research
works which are listed at the end of each chapter of this book.
I am grateful to my parents and family members for their endless support. Most of all,
I want to thank my wife Mrs. Sitara Roshan, daughter Ms. Fariha Tasnim, and son Md.
Tahsin Hasan for their invaluable cooperation in completing this book.
Finally, I am also thankful to Dr. A S M Touhidul Hasan and Md. Solaiman Mia who
have provided their support and important time to finish this book.
xxxvii
Acronyms
CF Carry-free
GA Genetic Algorithm
IP Intellectual Property
xxxix
xl Acronyms
KL Kernighan-Lin
NN Neural Network
SOC System-On-Chip
xliii
xliv Introduction
hardware, such as electrical and mechanical components. The embedded system is unlike
the general-purpose computer, which is engineered to manage a wide range of processing
tasks. Because an embedded system is engineered to perform certain tasks only, design
engineers may optimize size, cost, power consumption, reliability and performance. Em-
bedded systems are typically produced on broad scales and share functionalities across a
variety of environments and applications. Embedded systems are managed by single or mul-
tiple processing cores in the form of microcontrollers or digital signal processors (DSP),
field-programmable gate arrays (FPGA), application-specific integrated circuits (ASIC),
and gate arrays. These processing components are integrated with components dedicated
to handling electric and/or mechanical interfacing.
An embedded system’s key feature is a dedication to specific functions that typically
require strong general-purpose processors. For example, router and switch systems are
embedded systems, whereas a general-purpose computer uses a proper OS for routing
functionality. However, embedded routers perform functions more efficiently than OS-
based computers for routing functionalities. Commercial embedded systems range from
digital watches and MP3 players to giant routers and switches. Complexities vary from
single processor chips to advanced units with multiple processing chips.
An embedded system is one kind of computer system mainly designed to perform
several tasks like access, process, store, and also control the data in various electronics-
based systems. Embedded systems are a combination of hardware and software where
software is usually known as firmware that is embedded into the hardware. One of its
most important characteristics of these systems is, it gives the o/p within the time limits.
Embedded systems support to make the work more perfect and convenient. Examples of
the embedded system show that it has become a part and parcel of the daily life in term of
use. People are very familiar with the term “Smart Home” because of the deployment of
smart embedded system in the home. Now-a-days almost all of the embedded systems are
connected with the Internet. So security threats have become a major issue at present because
most of the embedded systems lack security even more than personal computers. One of the
reasons for this lack of security is the very limited hardware and software implementation
options for the manufacturers of embedded system companies. Again they have to deal
with the competitive market price of the other embedded manufacturer companies because
they all have to keep the lowest possible price to maintain the customer satisfaction and
at the same time they do not conduct any specific security research of their manufactured
embedded products. This leads to the security threats for the embedded devices because
ensuring advanced security techniques for embedded systems mean the higher cost of that
embedded products. Customers also don’t want to be more expensive usually when buying
an embedded device and they are not concerned also about the probable security threats
of their products. Lack of security analysis and low-cost market product mentalities of the
manufacturer companies lead the hackers to the exact environment they are expecting for.
Many embedded systems hacking tools are easily available on the internet. Hacking in the
PDAs and modems are very common example of embedded systems hacking.
This book aims at VLSI circuits and embedded systems. These two vast topics of this
book are divided into four parts. Part-I named Decision Diagrams has six chapters in which
the methods and techniques of several decision diagrams such as shared multi-terminal bi-
nary decision diagrams (SMTBDDs), binary decision diagrams for characteristic functions
Introduction xlv
1
Part 1
Computers are used to solve many applications like scheduling a product line of cars in a
factory, designing elevator maps in tall buildings, detecting diseases in DNA and mostly,
helping you decide which movie you are going to watch next. Problems become harder every
day, but computer scientists always design faster algorithms to fit with the ever evolving
amount of data. Decision Diagrams (DDs) are used in computer science and Artificial
Intelligence (AI) for decades to make logic circuit design, product configuration, etc.
Binary Decision Diagrams (BDDs) are well known for their use in the logic area,
verification and model checking. Binary and Multi-valued Decision Diagrams (BDDs or
MDDs) are efficient data structures that represent functions or sets of tuples. An MDD,
defined over a fixed number of variables, is a layered rooted Directed Acyclic Graph (DAG).
It associates a variable with each of its layers. MDDs have an exponential compression
power and are widely used in problem solving. An MDD has a root node, and two potential
terminal nodes, the true terminal node, and the false terminal node. Each node, associated
with a variable, can have at most as many outgoing arcs as there are values in the domain of
the variable, and the arcs are labeled by these values. The label vectors of the valid path’s
arcs represent the valid tuples.
Multiple Decision Diagrams (MDDs) are graph structures which are becoming the state-
of-the-art means for representation of both binary and multiple-valued logic functions. A
number of research studies on MDDs appeared recently in the literature. These include some
of the issues concerning implementation of an MDD package. An unrolled automaton can
be seen as a not reduced MDD. Furthermore, BDDs and MDDs are more and more used
in optimization. During the last few decades, many works are shown how to efficiently use
them in order to model and solve several optimization problems. An advantage of MDDs is
that they have a fixed number of variables, and often a strong compression ratio. However,
MDDs can have an exponential size, and it effectively occurs in practice.
Two-level expressions of multiple-valued logic functions and their minimizations have
been a subject of active research for many years. This problem is important because it
provides a means for optimizing the implementations of circuits that are direct translations
of two-level expressions. Thus, two-level logic representations have direct impact on macro-
cell design styles using programmable logic arrays (PLAs). The Reed-Muller canonical form
can be extended to multiple-valued logic in several ways, depending on how its operations
are generalized. Many extensions have been suggested. In these extensions, AND and XOR
operations (which are equivalent to multiplication and addition modulo-2, correspondently)
are generalized to addition and multiplication. Decomposition of switching functions is an
important task, since when decomposition is possible, it leads to many advantages in
network synthesis. At the same time, this is a difficult task.
3
4 Part 1
Shared Multi-Terminal
Binary Decision Diagrams
Efficient representations of logic functions are very important in logic design. Various
methods exist to represent logic functions. Among them, graph-based representations such
as BDDs (binary decision diagrams) are extensively used in logic synthesis, test, and
verification. In logic simulation, the BDD-based methods offer orders-of-magnitude po-
tential speedup over traditional logic simulation methods. In real life, many practical logic
functions are multiple-output.
This chapter describes a method to represent m output functions using shared multi-
terminal binary decision diagrams (SMTBDDs). The SMTBDD(k) consists of multi-
terminal binary decision diagrams (MTBDDs), where each MTBDD represents k output
functions. An SMTBDD(k) is the generalization of shared binary decision diagrams (SB-
DDs) and MTBDDs: for k = 1, it is an SBDD, and for k = m, it is an MTBDD. The
size of a BDD is the total number of nodes. The features of SMTBDD(k)s are: (1) They
are often smaller than SBDDs or MTBDDs; (2) They evaluate k outputs simultaneously.
An algorithm is also described in this chapter for grouping output functions to reduce the
size of SMTBDD( k )s. An SMTBDDmin denotes the smaller SMTBDD which is either an
SMTBDD(2) or an SMTBDD(3) with fewer nodes.
1.1 INTRODUCTION
Three different approaches are considered in this chapter to represent multiple-output func-
tions using BDDs: shared binary decision diagrams (SBDDs), multi-terminal binary deci-
sion diagrams (MTBDDs), and shared multi-terminal binary decision diagrams (SMTB-
DDs). SMTBDDs are the generalization of SBDDs and MTBDDs. A general structure
of an SMTBDD( k ) with k = 3 is shown in Fig. 1.1. The evaluation of outputs using an
SMTBDD( k ) is k times faster than an SBDD since it evaluates k outputs at the same time.
For the most functions, SMTBDD( k )s are smaller than the corresponding MTBDDs. In
modern LSI, the reduction of the number of pins is not so easy, even though the integra-
tion of more gates may be possible. The time division multiplexing (TDM) realizations of
multiple-output networks from SMTBDDs are useful to reduce the number of pins as well
DOI: 10.1201/9781003269182-2 5
6 VLSI Circuits and Embedded Systems
as to reduce hardware. SMTBDD( k )s are also helpful for look-up table type FPGA design,
logic simulation, etc. SMTBDD(3)s are considered in this chapter.
1.2 PRELIMINARIES
This section shows the definitions and properties of multiple-output functions and
SMTBDD( k )s.
Property 1.2.1 Let B = {0, 1}. A multiple-output logic function f with n input vari-
ables, x1, . . ., xn , and m output variables y1, . . ., ym , is a function f : Bn → Bm , where
x = (x1 , . . . , xn ) Bn is an input vector, and y = (y1 , . . . , ym ) Bm is an output vector of
f.
Input Output
x1 x2 f0 f1 f2 f3 f4 f5
0 0 0 1 0 0 1 1
0 1 0 1 0 0 1 1
1 0 1 1 1 0 1 0
1 1 0 1 1 1 0 1
Shared Multi-Terminal Binary Decision Diagrams 7
Property 1.2.2 Let F(a) = (f0 (a), f1 (a), . . . , fm−1 (a)) be the output vector of m functions
for an input a = (a1 , a2 , . . . , an ) Bn . Two output vectors F(ai ) and F(a j ) are distinct iff
F(ai ) , F(a j ). Let r be the number of distinct output vectors in F(a) = (f0 (a), f1 (a), . . . ,
fm−1 (a)).
Example 1.2 Consider the 2-input 6-output function in Table 1.1. The distinct output
vectors are (0, 1, 0, 0, 1, 1), (1, 1, 1, 0, 1, 0), and (0, 1, 1, 1, 0, 1). Therefore, the number of
distinct output vectors is three, i.e., r = 3.
Property 1.2.3 Let = f0 , f1 , . . . , and fm−1 (fi , 0) be disjoint each other, i.e., fi . f j = 0,
and i , j. Then, the number of distinct output vectors for F(x) = (f0 (x), f1 (x), . . . , fm−1 (x))
is m or m + 1.
Proof 1.1 Since f0 , f1 , . . . , and fm−1 are disjoint, in the vector F(x) = (f0 (x), f1 (x), . . . ,
fm−1 (x)) at most one output fi (x)(i = 0, 1, . . ., m − 1) is one and others are zero. Therefore,
the number of distinct output vectors is at least m. On the other hand, when there are the
output vectors with all zero’s, the number of distinct output vectors is m + 1.
Example 1.3 Consider the 2-input m-output function in Table 1.2, where m = 3. The
distinct output vectors for the functions f0, f1, and f2 are (1, 0, 0), (0, 1, 0), and (0, 0, 1).
So, the number of distinct output vectors is m. Now, consider the 2-input 3-output function
in Table 1.3. The distinct output vectors in f0, f1 , and f2 are (1, 0, 0), (0, 1, 0), (0, 0, 1), and
(0, 0, 0). In this case, the number of distinct output vectors is m + 1.
Table 1.2: 2-Input 3-Output Function with Three Distinct Output Vectors
Input Output
x1 x2 f0 f1 f2
0 0 1 0 0
0 1 0 1 0
1 0 0 0 1
1 1 0 1 0
Table 1.3: 2-Input 3-Output Function with Four Distinct Output Vectors
Input Output
x1 x2 f0 f1 f2
0 0 1 0 0
0 1 0 1 0
1 0 0 0 1
1 1 0 0 0
Property 1.2.4 Let f be a function. The set of input variables on which f depends is the
support of f, and is denoted by support(f). The size of the support is the number of variables
in the support(f).
8 VLSI Circuits and Embedded Systems
Example 1.4 Table 1.1 shows a 2-input 6-output function. An SMTBDD(3) can be con-
structed with the groupings [f0 , f1 , f2 ] and [f3 , f4 , f5 ]. support(f0 , f1 , f2 ) = {x1 , x2 }, and
support(f3 , f4 , f5 ) = {x1 , x2 }. Thus, the sizes of the supports are 2.
Property 1.2.5 The size of the BDD, denoted by size(BDD), is the total number of terminal
and non-terminal nodes. In the case of SBDDs and SMTBDD( k )s, the sizes include the
nodes for the output selection variables.
Figure 1.2: SMTBDD(2) with Groupings [f0 , f1 ], [f2 , f3 ], and [f4 , f5 ] for the Functions in
Table 1.1.
Let [ fi , f j ] be a pair of two output functions, where i , j . Following two techniques are
used to reduce the number of nodes in the SMTBDD( k )s:
Shared Multi-Terminal Binary Decision Diagrams 9
Figure 1.3: SMTBDD(3) with Groupings [ f0, f1, f2 ], and [ f3, f4, f5 ] for the Functions in
Table 1.1.
1. In general, an MTBDD for two outputs has four terminal nodes [0, 0], [0, 1], [1, 0],
and [1, 1]. However, if fi f j = 0, then [1, 1] never appears as a terminal node in the
MTBDD of an SMTBDD(2). Thus, this pairing of output functions tends to produce
a smaller BDD since the number of terminal nodes is at most three. Similarly if
fi0 f j0 = 0, fi0 f j = 0, or fi f j0 = 0, then [ fi , f j ] is also a candidate of a pair.
Note that these two techniques are also applicable to SMTBDD( k )s with k ≥ 3.
Property 1.2.6 Let an SMTBDD( k ) consist of two MTBDDs: MTBDD1 and MTBDD2.
MTBDD1 and MTBDD2 are disjoint iff they do not share any non-terminal nodes each
other in the SMTBDD( k ).
Example 1.5 In Fig. 1.3, there are two disjoint MTBDDs for groupings [ f0, f1, f2 ], and [f3 ,
f4 , f5 ].
Property 1.2.7 Let SMTBDD1 and SMTBDD2 be SMTBDD( k )s. Let SMTBDD1 con-
sists of MTBDD1 and MTBDD2, and let SMTBDD2 consists of MTBDD3 and MTBDD4.
If all the MTBDDs are disjoint each other and size(MT BDD1) = size(MT BDD3), and
size(MT BDD2) = size(MT BDD4), then size(MT BDD1) = size(MT BDD2).
Proof 1.2 Since the number of terminal nodes in an SMTBDD( k ) is equal to the number
of distinct output vectors, and SMTBDD(2) and an SMTBDD(3) have the same number of
terminal nodes iff the numbers of distinct output vectors in the both SMTBDDs are also
the same.
Example 1.6 Figs. 1.2 and 1.3 show an SMTBDD(2) and an SMTBDD(3) for the functions
in Table 1.1, respectively. The numbers of terminal nodes in the both SMTBDDs are the
same, since the numbers of distinct output vectors are also the same, i.e., 4.
Proof 1.3 No more than r 2 nodes are needed. Else two nodes represent the same function
n
and can be combined. No less than r 2 nodes used because there are this many functions.
n
Property 1.2.10 Let r be number of distinct output vectors of an n-input m-output function.
{2n−k − 1 + r 2 }.
k
Then, the size of the MTBDD can be at most mink=1n
Proof 1.4 Consider the MTBDD in Fig. 1.4, where the upper block is a binary decision trees
of (n − k) variables, and the lower block generates all the functions of k or fewer variables.
The binary decision tree of (n − k) variables has 1 + 2 + 4 + · · · + 2n−k−1 = 2n−k − 1 nodes.
By Property 1.2.9, the MTBDD of k -variable functions with r distinct output vectors have
r 2 nodes. Thus, the size of the MTBDD can be at most mink=1 {2n−k − 1 + r 2 }. Note that
k n k
the upper bound on the size of the MTBDD is used in Algorithm 1.1.
Property 1.2.11 Let m1 be the total number of groups of m distinct output functions. Then,
the number of nodes for the output selection variables in the SMTBDD is m1 – 1.
Shared Multi-Terminal Binary Decision Diagrams 11
Example 1.7 Consider the SMTBDD in Fig. 1.3, where the total number of groups is
two, i.e., [ f0, f1, f2 ], and [ f3, f4, f5 ]. Therefore, the number of nodes for the output selection
variables in the SMTBDD is one.
Property 1.2.12 Let m1 be the total number of groups of m distinct output functions and
f : {0, 1} n → {0, 1, . . . , r − 1} m . Then, f can be represented by an SMTBDD with at most
n {m .2n−k − 1 + r 2 k } nodes.
mink=1 1
Proof 1.5 Consider the mapping f : {0, 1} n → {0, 1, . . ., r − 1} m , where r is the number
of terminal nodes. In the SMTBDD in Fig. 1.5, the upper block selects m1 groups for m
output functions, the middle block constitutes binary decision trees of (n − k) variables, and
the lower block generates all the functions of k variables by an MTBDD with r terminal
nodes. By Property 1.2.11, the upper block requires (m1 − 1) nodes to select m1 groups. By
Property 1.2.9, the lower block requires r 2 nodes. Now, the middle block is considered.
k
n o
m1 .2n−k − 1 + r 2 .
k
= mink=1
n
Property 1.3.1 A clique of a graph is a set of vertices such that every pair is connected
by an edge.
Property 1.3.2 Let G = (V, E) be a graph, where V and E denote a set of vertices and a
set of edges, respectively. A clique cover of G is a partition of V such that each set in the
partition is a clique.
Property 1.3.3 Let G = (V, E) be a graph. Then, G is a clique weighted graph iff each
subset of vertices of G has a weight and each vertex pair is connected by an edge.
Property 1.3.4 Figure 1.6 is an example of a clique weighted graph. For simplicity, only
two cliques are shown with their weights. The weights of the cliques c0 and c1 are w0 and
w1 , respectively.
Shared Multi-Terminal Binary Decision Diagrams 13
Problem: Given a clique weighted graph G = (V, E), find the clique cover such that the
sum of weights of the cliques in the clique cover is minimum.
Note that the weight corresponds to the upper bound on the size of the MTBDD, and the
minimum weighted clique cover corresponds to the groupings of outputs that have small
size, though sometimes they are not minimum.
Property 1.3.5 Let F = f0, f1, . . ., fm−1 be a set of m output functions. Let Fi = (i =
1, 2, 3, . . ., s) be subsets of F . F1, F2, . . ., Fs is called a partition of F if i=1 Fi = Fj , and Fi
Ðs
∩ Fj = , where i, j , and Fi , for every i . Henceforth, each of F1, F2, . . ., Fs is called a group
of output functions. Note that each vertex in the clique weighted graph represents an output
function, where each group and each partition of output functions are denoted as a clique
and a clique cover respectively.
Example 1.10 Let F = f0, f1, f2, f3, f4, f5 be a set of 6-output function. Then, the partitions
of F into groups are as follows: [ f0, f1, f2 ], [ f3, f4, f5 ],
[ f0, f1, f3, [ f2, f4, f5 ], [ f0, f1, f4 ], [ f2, f3, f5 ], [ f0, f1, f5 ], [ f2, f3, f4 ], [ f0, f2, f3 ],
[ f1, f4, f5 ], [ f0, f2, f4 ], [ f1, f3, f5 ], [ f0, f2, f5 ], [ f1, f3, f4 ], [ f0, f3, f4 ], [ f1, f2, f5 ],
[ f0, f3, f5 ], [ f1, f2, f4 ], and [ f0, f4, f5 ], [ f1, f2, f3 ].
Property 1.3.6 Let F be an n-input m-output function. The dependency matrix B = (bi j )
for F is a 0 − 1 matrix with n columns and n rows. bi j = 1 iff fi depends on x j , and bi j = 0
otherwise, where i = 0, 1, . . ., m − 1 and j = 1, 2, . . ., n.
f0 (x1, x2, x3, x4 ) = x2 x3 , f1 (x1, x2, x3, x4 ) = x1 x4 ∨ x2 , f2 (x1, x2, x3, x4 ) = x1 ∨ x3,
f3 (x1, x2, x3, x4 ) = x3, f4 (x1, x2, x3, x4 ) = x1 ∨ x3 x4 , and f3 (x1, x2, x3, x4 ) = x4 .
The dependency matrix is
14 VLSI Circuits and Embedded Systems
Example 1.12 Consider the 6-output function in Example 1.11. The group-dependency
matrix A is given as follows:
Note that the row for [ fi , f j , fk ] in A is equal to bit-wise OR of rows for fi , f j , and fk in
the dependency matrix B.
Property 1.3.8 Let r[ fi , f j , fk ] be the number of distinct output vectors for the group of
outputs [ fi , f j , fk ]. Note that 1 ≤ r[ fi , f j , fk ] ≤ 8. r[ fi , f j , fk ] is equal to the number of
non-zero functions in fi f j fk , fi0 f j0 fk , fi0 f j fk0,
fi0 f j fk , fi f j0 fk0, fi f j0 fk , fi f j fk0, and fi0 f j0 fk0.
Example 1.13 Consider the 6-output function in Example 1.11. There are 20 groups of
output functions. The number of distinct output vectors r[ fi , f j , fk ] for each group [ fi , f j , fk ]
is calculated as follows:
Shared Multi-Terminal Binary Decision Diagrams 15
For r[ f0, f2, f3 ] : f0 f2 f3 = x1 x2 x3 ∨ x2 x3, f0 f2 f30 = 0, f0 f20 f3 =0, f0 f20 f30 = 0, f00 f2 f3 =
∨ x20 x3, f00 f2 f30 = x1 x20 x30 ∨ x1 x30 , f00 f20 f3 = 0, and f00 f20 f30 = x10 x20 x30 ∨ x30 .
x1 x20 x3
Since the number of non-zero functions is four, r[ f0, f2, f3 ] = 4. Similarly, the number
of distinct output vectors can be calculated for other groups of output functions.
Property 1.3.9 Let s(i, j, k) be the size of the support for a group of output functions
n−1 2s(i, j,k)−t − −1 + (r[ f , f f ])2t .
[ fi , f j , fk ]. The weight w(i, j, k) for [ fi , f j , fk ] is mint=0 i j, k
This will be the weight of the clique in the clique weighted graph.
Example 1.14 Consider the 6-output function in Example 1.11. w(0, 2, 3) is calculated as
follows:
From Example 1.11, it is known that r[ f0, f2, f3 ] = 4, and s(0, 2, 3) = 3. w(0, 2, 3) takes
its minimum when t = 0. Therefore, w(0, 2, 3) = 23 − 1 + 4 = 11. Similarly, the weights
for the other groups can be calculated. Since w(i, j, k) is an upper bound on the size of the
MTBDD for [ fi , f j , fk ], the MTBDD with the minimum weight is relatively small.
1.4 SUMMARY
In this chapter, a method is introduced to represent multiple-output functions using SMTB-
DDs (Shared Multi-Terminal Binary Decision Diagrams). SMTBDD( k )s are not so large
as MTBDDs (Multi-Terminal Binary Decision Diagrams), and the evaluation time is k
times faster than Shared BDD (SBDD), since k outputs are evaluated simultaneously. An
algorithm is presented for grouping output functions to reduce the size of SMTBDD( k )s.
By selecting an SMTBDD from an SMTBDD(2) and an SMTBDD(3), a compact repre-
sentation for the SMTBDD is also introduced which denotes either an SMTBDD(2) or
an SMTBDD(3) with fewer nodes. Thus, SMTBDDs compactly represent many multiple-
output functions and are useful TDM (time-division multiplexing) realizations of multiple-
output networks, look-up table type FPGA design and logic simulation.
A multiple-output function can also be represented by a BDD for characteristic functions
(CFs). However, in most cases, BDDs for CFs are much larger than the corresponding
SBDDs. Moreover, if all the output functions depend on all the input variables, then the
size of the BDDs for CFs is greater than the corresponding MTBDD. By dropping the
above ordering restriction, the size of the BDD can be reduced. Furthermore, in most cases,
the sizes of such BDDs are still larger than the corresponding SBDDs. In many cases, the
sizes of SMTBDDs are not so large as BDDs for CFs.
Shared Multi-Terminal Binary Decision Diagrams 17
REFERENCES
[1] S. B. Akers, “Binary decision diagrams”, IEEE Trans. Comput., vol. C-27, no. 6, pp.
509–516, 1978.
[2] P. Ashar and S. Malik, “Fast functional simulation using branching programs”, Pro-
ceedings of IEEE. International Conference on Computer-Aided Design, pp. 408–412,
1995.
[3] C. Scholl, R. Drechsler and B. Becker, “Functional simulation using binary deci-
sion diagrams”, Proceedings of IEEE. International Conference on Computer-Aided
Design, pp. 8–12, 1997.
[4] R. E. Bryant, “Graph-based algorithms for Boolean function manipulation”, IEEE
Trans. Comput., vol. C-35, no. 8, pp. 677–691, 1986.
[5] T. Sasao and J. T. Butler, “A method to represent multiple-output switching functions
by using multi-valued decision diagrams”, Proceedings of 26th IEEE. International
Symposium on Multiple-Valued Logic, pp. 248–254, 1996.
[6] T. Sasao and J. T. Butler, “A design method for look-up table type FPGA by
pseudo-Kronecker expansion”, Proceedings of 26th IEEE International Symposium
on Multiple-Valued Logic, pp. 97–106, 1994.
[7] H. M. H. Babu and T. Sasao, “A method to represent multiple-output switching
functions by using binary decision diagrams”, The Sixth Workshop on Synthesis and
System Integration of Mixed Technologies, pp. 212–217, 1996.
[8] H. M. H. Babu and T. Sasao, “Representations of multiple-output logic functions
using shared multi-terminal binary decision diagrams”, The Seventh Workshop on
Synthesis and System Integration of Mixed Technologies, pp. 25–32, 1997.
[9] H. M. H. Babu and T. Sasao, “Design of multiple-output networks using time domain
multiplexing and shared multi-terminal multiple-valued decision diagrams”, IEEE
International Symposium on Multiple Valued Logic, pp. 45–51, 1998.
[10] E. Balas and C. S. Yu, “Finding a maximum clique in an arbitrary graph”, SIAM J.
Comput. vol. 15, pp. 1054–1068, 1986.
[11] R. Rudell, “Dynamic variable ordering for ordered binary decision diagrams”, Pro-
ceedings of IEEE. International Conference on Computer-Aided Design, pp. 42–47,
1993.
[12] T. Sasao, ed., “Logic Synthesis and Optimization”, Kluwer Academic Publishers,
Boston, 1993.
[13] A. Srinivasan, T. Kam, S. Malik and R. K. Brayton, “Algorithm for discrete functions
manipulation”, Proceedings of IEEE. International Conference on Computer-Aided
Design, pp. 92–95, 1990.
[14] P. C. McGeer, K. L. McMillan, A. Saldanha, A. L. Sangiovanni-Vincentelli and P.
Scaglia, “Fast discrete function evaluation using decision diagrams”, International
Workshop on Logic Synthesis, pp. 6.1–6.9, 1995.
[15] M. R. Garey and D. S. Johnson, “Computers and Intractability: A Guide to the Theory
of NP-Completeness”, Freeman, New York, 1979.
[16] Babu, Hafiz Md Hasan, and Tsutomu Sasao. “Shared multi-terminal binary deci-
sion diagrams for multiple-output functions.” IEICE Transactions on Fundamentals
of Electronics, Communications and Computer Sciences, vol. 81, no. 12 (1998):
2545–2553.
CHAPTER 2
Multiple-Output Functions
Using BDD for Characteristic
Functions
This chapter describes a method to construct smaller binary decision diagrams for charac-
teristic functions (BDDs for CFs). A BDD for CF represents an n-input m-output function.
An upper bound on the number of nodes of the BDD is derived for CF of n-bit adders (adrn).
As a result: (1) BDDs for CFs are usually much smaller than MTBDDs (Multi-Terminal
Binary Decision Diagrams); (2) For adrn and for some benchmark circuits, BDDs for CFs
are the smallest among the three types of BDDs; and (3) The introduced method often
produces smaller BDDs for CFs.
2.1 INTRODUCTION
Binary decision diagrams (BDDs) are compact representations of logic functions, and are
useful for logic synthesis, time-division multiplexing (TDM) realization, test, verification,
etc. Shared binary decision diagrams (SBDDs), multi-terminal binary decision diagrams
(MTBDDs), and BDDs for characteristic functions (BDDs for CFs) represent multiple-
output Functions. SBDDs are compact. MTBDDs evaluate all the outputs simultaneously
but they usually blow up in memory for large benchmark circuits. BDDs for CFs use CFs of
multiple-output functions. A CF is a switching function representing the relation of inputs
and outputs. Fig. 2.1 shows the general structure of a BDD for CF.
BDDs for CFs are usually much smaller than MTBDDs. The main applications of BDDs
for CFs are logic simulation of digital circuits and implicit state enumeration of finite state
machines. In this chapter, a method is considered to construct compact BDDs for CFs. An
algorithm is represented to find a good ordering of input and output variables. An upper
bound on the number of nodes of the BDD is also derived for CF of n-bit adders (adrn).
DOI: 10.1201/9781003269182-3 19
20 VLSI Circuits and Embedded Systems
Property 2.1.1 support( f ) is the set of input variables that the function f depends on. The
size of the support is the number of variables in the support( f ).
Example 2.1 Let f (x1, x2, x3 ) = x1 x2 x3 ∨x1 x2 x30 . Then, support( f ) = {x1, x2 }, since f is
also represented as f = x1 x2 . Thus, the size of the support is two.
Property 2.1.2 Let fi1 and fi2 be two output functions. The size of the union of the support
for fi1 and fi2 is the number of support variables for fi1 and fi2 .
Property 2.1.3 Let f : {0, 1} n → {0, 1} m . The size of a decision diagram (DD) for a
multiple-output function f , denoted by size(DD, f ), is the total number of terminal and
non-terminal nodes in the minimal DD.
In the case of SBDDs, the size also includes the nodes for output selection variables.
Example 2.3 The size of the BDD for CF in Fig. 2.2 is 14.
Multiple-Output Functions 21
Property 2.2.1 Let B = {0, 1}. Let a∈B n, and f (a) = ( f 0 (a), f 1 (a), . . .,
f m−1 (a))∈B m . Let b∈B m .
1 i f b = f (a)
F (a, b) =
0 otherwise
Input Output
x1 x2 f0 f1
0 0 0 0
0 1 0 0
1 0 1 0
1 1 1 1
A BDD for CF is a BDD representing the CF. In order to guarantee a fast evaluation, the
output variables can appear on any path of the BDD for CF only after all the supports
have appeared. Fig. 2.2 shows the BDD for CF of a 3-input 2-output bit-counting function
(wgt3), where x1, x2 and x3 are input variables, and f 0 and f 1 are output variables. The
BDD for CF in Fig. 2.2 shows that each path from the root to the terminal 1 corresponds to
an input-output combination. The advantages of BDDs for CFs are: (1). They can represent
large multiple-output functions; and (2) they can evaluate all the outputs in O(n + m) time.
Example 2.4 Consider the 2-input 2-output function in Table 2.1. The valid input-output
combinations are (0, 0, 0, 0), (0, 1, 0, 0), (1, 0, 1, 0) and (1, 1, 1, 1).
Property 2.2.4 If the output variables appear only after all the supports, then an arbitrary
n-input m-output function can be evaluated by a BDD for CF in O(n + m) time.
Note that if the above restriction of the ordering of the variables is dropped, then it can’t
guarantee the evaluation time of O(n + m).
Example 2.5 Let f : {0, 1}7 → {0, 1}10 . Then, by Property 2.2.5, size(BDD for CF, f ) ≤
16639 is got. On the other hand, from Table 2.2, size(SBDD, f ) ≤ 335 is obtained.
Multiple-Output Functions 23
(1) Base: Form = 1, the CF of the function f0 = x0 is realized by a BDD for CF with
five nodes as shown in Fig. 2.3.
(2) Induction: Assume that the hypothesis is true for k = m− 1 functions. That is, the
CF of m− 1 functions is realized by the BDD for CF in Fig. 2.4 with 3m−1 nodes. In Fig.
2.4, first remove the constant 0 and constant 1. Second, attach variables xm−1, fm−1 , and
corresponding three non-terminal nodes, as well as nodes for constant 0 and constant 1.
Then, the diagram in Fig. 2.5 is got. It is clear that Fig. 2.5 shows the BDD for CF of m
functions with 3m + 2 nodes which has three more non-terminal nodes than Fig. 2.4. Thus,
from (1) and (2), the property is satisfied.
Note that the size of the MTBDD for the functions fs is exponential, while that for the
BDD for CF and the SBDD are linear as shown in Table 2.2.
Property 2.2.7 Let adrn be a 2n-input (n + 1)-output function that computes the sum of
two n-bit numbers.
Proof 2.2 Suppose that the variables of the adrn are assigned as follows:
(2) Induction: Suppose that adrn is represented by using (9n−1) non-terminal nodes and
two terminal nodes. Also, assume that the variable z n is the nearest to the constant 1 node. Let
v 0 and v 1 be the nodes of z n , where edges e0 and e1 are connecting to the constant 1 respec-
tively. This situation is shown in Fig. 2.7. In Fig. 2.7, first remove the variables z n, e0, e1 , and
constants. Second, attach variables x n, y n, z n , and z n+1 , and corresponding 9 non-terminal
nodes, as well as constant nodes. Then, the diagram shown in Fig. 2.8 is obtained.
Note that Fig. 2.8 has 9 more nodes than Fig. 2.7. It is clear that Fig. 2.8 represents the
characteristic function of adr(n + 1), and has (9n−1) + 9 + 2 = 9(n + 1) + 1 nodes. In this
case, the ordering of the variables is (x0, y 0, z0, x 1, y 1, z1, . . ., x n, y n, z n, z n+1 ). Thus, from 1)
and 2), the theorem is proved.
Figure 2.8: BDD for CF of adrn (After updating the variables and constants).
Problem 1: Let u1, u2, . . ., uk be the variables. Let Order[k] = (ue1 , ue2 , . . ., uek ) be a
permutation of the k variables. Let size(BDD for CF, f ) be the total number of nodes in the
Multiple-Output Functions 27
BDD for CF for a certain Order[k] of the variables. Find a variable ordering Order[k] =
(ue1 , ue2 , . . ., uek ) for a given multiple-output function f such that the size(BDD for CF, f )
is the minimum.
In general, it is very time-consuming to find the best ordering of variables of the BDD
for CF. So, a good variable ordering from the initial one is computed by using the modified
sifting algorithm. To generate a good initial ordering, the following methods are applied: i)
ordering of output variables; ii) interleaving based sampling schemes for ordering of input
variables; and iii) interleaving method for input variables and output variables.
There are six pairs of output functions. The sizes of the supports for these pairs of output
functions are: s( f0, f1 ) = s( f0, f2 ) = s( f0, f3 ) = 4, s( f1, f2 ) = s( f2, f3 ) = 3, and s( f1, f3 ) = 2.
Since s( f1, f3 ) = 2 is the smallest among all s( fi , f j ), ( f1, f3 ) is the candidate of a pair. The
remaining outputs are f0 and f2 . Thus, ( f0, f2 ) is the another pair. Therefore, the partition
of output functions are got as follows: {( f1, f3 ), ( f0, f2 )}.
Example 2.7 Consider the functions in Example 2.6. Since {( f1, f3 ), ( f0, f2 )} is a good
partition of output functions, the ordering of outputs is ( f1, f3, f0, f2 ).
28 VLSI Circuits and Embedded Systems
Example 2.9 Consider the functions in Example 2.6. ( f1, f3 ) and ( f0, f2 ) are two samples,
since support( f0 ) = {x1, x2, x3 }, support( f1 ) = {x3, x4 }, support( f2 ) = {x1, x3, x4 }, and
support( f3 ) = {x4 }.
Multiple-Output Functions 29
Figure 2.9: SBDD with the Variable Ordering for Sample ( f1, f3 ) Obtained by the Sifting
Algorithm.
Example 2.10 Consider the functions in Example 2.6. ( x4, x3, x1, x2 ) and ( x3, x4, x2, x1 ) are
the variable orderings in Figs. 2.9 and 2.10 for samples ( f1, f3 ) and ( f0, f2 ), respectively.
The sample ( f0, f2 ) has the highest priority, since the size of the SBDD for this sample is
the largest. Fig. 2.11 shows that ( x3, x4, x2, x1 ) is a good variable ordering for ( f0, f1, f2, f3 )
which is computed from the variable orderings of samples ( f1, f3 ) and ( f0, f2 ) by using an
interleaving method.
Figure 2.10: SBDD with the Variable Ordering for Sample ( f0, f2 ) Obtained by the Sifting
Algorithm.
The support for z0 is {x0, y0 }, and the supports for z1 and z2 are {x0, y0, x1, y1 }. Also,
z2, z1 , and z0 are partially symmetric with respect to {x0, y0 } and {x1, y1 }. Thus, the reason-
able ordering for the input and output variables would be (x0, y0, z0, x1, y1, z1, z2 ).
Example 2.12 Consider the 4-input 4-output function in Example 2.6. In this example,
(x3, x4, x2, x1 ) is a good ordering of the inputs, and ( f1, f3, f0, f2 ) is a good ordering of
the outputs. Since support ( f1 ) = {x3, x4 }, f1 appears after {x3, x4 }. Next, support ( f3 ) =
{x4 }, f3 appears after f1 . Finally, support( f0 ) = {x1, x2, x3 } and support ( f2 ) = {x1, x3, x4 }, f0
Multiple-Output Functions 31
Figure 2.11: SBDD with the Variable Ordering for f = ( f0, f1, f2, f3 ) Obtained from the
Variable Orderings of Samples ( f1, f 3 ) and ( f0, f2 ) by using an Interleaving Method.
and f2 appear in the last. Thus, an initial ordering for the input and output variables is
(x3, x4, f1, f3, x2, x1, f0, f2 ).
2.4 SUMMARY
In this chapter, a method is introduced to construct smaller binary decision diagrams for
characteristic functions (BDDs for CFs) to represent multiple-output functions. The sizes of
SBDDs (Shared Binary Decision Diagrams), MTBDDs, and BDDs for CFs are compared.
SBDDs evaluate outputs in O(mn) time, while MTBDDs (Multi-Terminal Binary Decision
Diagrams) and BDDs for CFs evaluate outputs in O(n) time and O(n+ m) time, respectively.
In most cases, BDDs for CFs are much smaller than MTBDDs. However, BDDs for CFs are
usually larger than the corresponding SBDDs. Three types of circuits: (1) n-bit adders (adrn),
where BDDs for CFs are the smallest; (2) bit-counting circuits (wgtn), where MTBDDs
are the smallest; and (3) n-bit multipliers (mlpn), where SBDDs are the smallest. Upper
bounds on the sizes of SBDDs, MTBDDs, and BDDs for CFs of adrn are also derived.
An SBDD-based simulator can be faster for some functions. However, the simulator
based on the BDD for CF should be faster than the SBDD-based one when there are no page
faults in the physical memory and no misses in the Translation Lookaside Buffer (TLB)
during function evaluation.
32 VLSI Circuits and Embedded Systems
Figure 2.12: Pseudocode for Interleaving based Sampling Schemes for the Ordering of
Input Variables.
Multiple-Output Functions 33
REFERENCES
[1] S. B. Akers, “Binary decision diagrams”, IEEE Trans. Comput., vol. C-27, no. 6, pp.
509–516, 1978.
[2] P. Ashar and S. Malik, “Fast functional simulation using branching programs”, Pro-
ceedings of IEEE International Conference on Computer-Aided Design, pp. 408–412,
1995.
[3] C. Scholl, R. Drechsler and B. Becker, “Functional simulation using binary decision di-
agrams”, Proceedings of IEEE International Conference on Computer-Aided Design,
pp. 8–12, 1997.
[4] R. E. Bryant, “Graph-based algorithms for Boolean function manipulation”, IEEE
Trans. Comput., vol. C-35, no. 8, pp. 677–691, 1986.
[5] T. Sasao and J. T. Butler, “A method to represent multiple-output switching functions
by using multi-valued decision diagrams”, Proceedings of 26th IEEE International
Symposium on Multiple-Valued Logic, pp. 248–254, 1996.
[6] H. Touati, H. Savoj, B. Lin, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, “Im-
plicit state enumeration of finite state machines using BDDs”, Proceedings of IEEE
International Conference on Computer-Aided Design, pp. 130–133, 1990.
[7] H. M. H. Babu and T. Sasao, “Shared multi-terminal binary decision diagrams for
multiple-output functions”, IEICE Trans. Fundamentals, vol. E81-A, no.12, pp. 2545–
2553, 1998.
[8] H. M. H. Babu and T. Sasao, “Time-division multiplexing realizations of multiple-
output functions based on shared multi-terminal multiple-valued decision diagrams”,
IEICE Trans. Inf. & Syst., vol. E82-D, no.5, pp. 925–932, 1999.
[9] H. M. H. Babu and T. Sasao, “Representations of multiple-output functions by binary
decision diagrams for characteristic functions”, Proceedings of the Eighth Workshop
on Synthesis And System Integration of Mixed Technologies, pp. 101–108, 1998.
[10] H. Fujii, G. Ootomo, and C. Hori, “Interleaving based variable ordering methods for
ordered binary decision diagrams”, Proceedings of IEEE International Conference on
Computer-Aided Design, pp. 38–41, 1993.
[11] R. Rudell, “Dynamic variable ordering for ordered binary decision diagrams”, Pro-
ceedings of IEEE International Conference on Computer-Aided Design, pp. 42–47,
1993.
[12] T. Sasao, ed., “Logic Synthesis and Optimization”, Kluwer Academic Publishers,
Boston, 1993.
[13] J. Jain, W. Adams, and M. Fujita, “Sampling schemes for computing OBDD vari-
able orderings”, Proceedings of IEEE International Conference on Computer-Aided
Design, pp. 631–638, 1998.
[14] P. C. McGeer, K. L. McMillan, A. Saldanha, A. L. Sangiovanni-Vincentelli and P.
Scaglia, “Fast discrete function evaluation using decision diagrams”, International
Workshop on Logic Synthesis, pp. 6.1–6.9, 1995.
[15] A. Slobodov´a and C. Meinel, “Sample method for minimization of OBDDs”, Pro-
ceedings of the International Workshop on Logic Synthesis, pp. 311–316, 1998.
34 VLSI Circuits and Embedded Systems
[16] H. M. H. Babu and T. Sasao, “Shared multiple-valued decision diagrams for multiple-
output functions”, Proceedings of the IEEE International Symposium on Multiple-
Valued Logic, pp. 166–172, 1999.
[17] Babu, Hafiz Md Hasan, and Tsutomu Sasao. “Representations of multiple-output func-
tions using binary decision diagrams for characteristic functions.” IEICE Transactions
on Fundamentals of Electronics, Communications and Computer Sciences 82, no. 11
(1999): 2398–2406.
CHAPTER 3
3.1 INTRODUCTION
Multiple-valued decision diagrams (MDDs) are extensions of binary decision diagrams
(BDDs), and are useful in logic synthesis, time-division multiplexing (TDM) realizations,
logic simulation, FPGA design, etc. MDDs are usually smaller than the corresponding
BDDs, and require fewer memory access to evaluate them. A shared multiple-valued
decision diagram (SMDD) is a set of MDDs that compactly represents a multiple-output
function. SMDDs can be used in many applications such as design of multiplexer-based
networks, design of pass-transistor logic networks, etc. For example, Fig. 3.2 shows the
multiplexer-based network corresponding to the SMDD in Fig. 3.1. In these applications,
the reduction of the number of nodes in the SMDDs is important.
This chapter considers the following methods to construct compact SMDDs:
1. Pair the binary input variables to make multiple-valued variables.
2. Order the multiple-valued variables in the SMDDs.
A parameter is introduced to find good pairs of the input variables. The parameter of an
input variable denotes the influence of the variable on the size of the BDD. An extension to
the sifting algorithm is also presented that moves pairs of 4-valued input variables to speed
up the sifting, and to produce compact SMDDs. Furthermore, formulas are derived for the
sizes of SMDDs for bit-counting functions (wgt n) and incrementing functions (inc n).
DOI: 10.1201/9781003269182-4 35
36 VLSI Circuits and Embedded Systems
Property 3.2.1 Let F = ( f0, f 1, . . ., f m−1 ). The size of a decision diagram (DD) for a
function F , denoted by size(DD, F), is the total number of nonterminal nodes excluding
the nodes for output selection variables.
Example 3.1 The size of the SMDD in Fig. 3.1 is 7. Note that g0 is the output selection
variable in the SMDD.
• Two nodes are merged into one node if they represent the same function, and
Property 3.2.2 Let R = {0, 1, ..., r − 1} and B = {0, 1}. Then, the size of the SMDD for
N −k k
an N -input m-output function R N →B m is at most mink=1
N
{m · r r−1−1 + 2r − 2}.
Property 3.2.4 Let R = {0, 1, ..., r − 1} and B = {0, 1}. Then, all the non-constant sym-
Í N (i+r −1)
metric functions R N →B can be represented by MDDs with i=1 [2 i − 2] non-terminal
nodes.
Property 3.2.5 Let R = {0, 1, ..., r − 1} and B = {0, 1}. Then, the size of the SMDD for
an N-input m-output symmetric function R N → B m is at most
© i+r −1 ª
N
k NÕ−k
i+r −1
Õ i
®
+ 2« ¬ − 2
min m·
i
k=1
i=0 i=1
From here, we assume that r = 4.
Property 3.2.6 Let wgt n be an n-input ( blog2 nc + 1)-output function that counts the
number of l’s in the inputs, and represents it by a binary number, where n is the number of
binary input variables, and bac denotes the largest integer not greater than a. It represents
a bit-counting function.
Property 3.2.7 Let inc n be an n-input (n + l)-output function that computes x + l , where
n is the number of binary input variables. It represents an incrementing function. The
following conjectures for the bounds on the sizes of SMDDs can get for wgt n and inc n:
Conjecture 3.1: size(SMDD, wgt n) nblog2 nc + n − 2 blog2 nc = ; where n > 1.
Conjecture 3.2: size(SMDD, inc n) = 2n − l , where n > 1.
Property 3.2.8 Let an SMDD represent m functions f i = x i (i = 0, 1, ..., m − 1), where x i
is a binary input variable. Then, the size of the SMDD is m.
Example 3.2 Figs. 3.3 and 3.4 show the SMDDs for functions ( f0, f1 ) = (x1, x2 ) and
( f0, f1, f2 = (x1, x2, x3 ), respectively. Their sizes are 2 and 3, respectively.
Property 3.3.2 Let U = {u1, u2, u3, u4 } be a set of four variables. Then, {[u1, u2 ], [u3, u4 ]}
is a partition of U .
Property 3.3.3 Let a BDD represent a function f . para(x i ) is the parameter of an input
variable x i at height i in the BDD with the variable ordering. It denotes the level of the
40 VLSI Circuits and Embedded Systems
BDD for x i . It is assumed that the smaller the value of para(x i ), the more influential the
variable x i .
Example 3.3 Figs. 3.5 and 3.6 show the BDDs for the functions f 0 and f 1 , respectively.
The values of para(x i ) for both BDDs are shown in the figures. For example, x i is the
most influential variable in the BDD in Fig. 3.5, since para(x 1 ) is the smallest among all
para(x i ).
Example 3.4 Consider the functions in Example 3.3. The total parameter vector for
F = ( f 0, f 1 ) is
para (x1 ) 2
para (x2 ) 6
© ª
T=
®
para (x3 ) 3
®
®
para (x4 ) « 16 ¬
Property 3.3.5 The weight w(i, j) for a pair of input variables x i , and x j (i, j) is defined
by w(i, j) = para(x i ) . para(x j ).
Property 3.3.6 In the functions of Example 3.3, the weights are as follows: w(l, 2) =
12, w(l, 3) = 6, w(l, 4) = 32, w(2, 3) = 18, w(2, 4) = 96, and w(3, 4) = 48.
Example 3.5 Consider the functions in Example 3.3. There are three different ways of
pairing four inputs:
From now, this optimized SMDD is called as the SMDD with the normal sifting.
Example 3.6 Let {[X 1, X 3 ], [X 2, X 4 ]} be a good partition of 4-valued input variables ob-
tained by Algorithm 3.1. Then, the initial variable ordering for the pairs is (X 1, X 3, X 2, X 4 ).
From now, this optimized SMDD is called as the SMDD with the pair sifting.
44 VLSI Circuits and Embedded Systems
Algorithm 3.3 Construction of an SMDD using the Pair Sifting of 4-Valued Variables
Let F : {0, 1} n →{0, l} m .
1: Construct the MDDs for F by Algorithm 3.1.
2: Find good pairs of 4-valued variables from the MDDs using the similar techniques to
Algorithm 3.1, and make an initial variable ordering with the pairs.
3: Select a pair of 4-valued variables from the initial ordering, and use the sifting algorithm
to find the position of the pair such that the size of the initial SMDD is minimized.
4: Repeat Step 3 until all the pairs from the initial ordering have been checked, and choose
the smallest SMDD.
3.4 SUMMARY
In this chapter, a method is introduced to represent multiple-output functions using shared
multiple-valued decision diagrams (SMDDs). Some algorithms are also presented to pair
the input variables of binary decision diagrams (BDDs), and to find good orderings of
the multiple-valued variables in the SMDDs. The sizes of SMDDs are derived for general
functions and symmetric functions. SMDDs with the pair sifting are smaller than SMDDs
with the normal sifting. Algorithm 3.1 and Algorithm 3.3 can be extended to group k
input variables, where k > 2. SMDDs are useful in many applications such as design of
multiplexer-based networks and design of pass- transistor logic networks.
REFERENCES
[1] T. Sasao and J. T. Butler, “A design method for look-up table type FPGA by
pseudo-Kronecker expansion”, Proceedings of 26th IEEE International Symposium
on Multiple-Valued Logic, pp. 97–106, 1994.
[2] H. M. H. Babu and T. Sasao, “Design of multiple-output networks using time domain
multiplexing and shared multi-terminal multiple-valued decision diagrams”, IEEE
International Symposium on Multiple Valued Logic, pp. 45–51, 1998.
[3] S. Minato, N. Ishiura, and S. Yajima, “Shared binary decision diagram with attributed
edges for efficient Boolean function manipulation", Proceedings of 27th ACM/IEEE
DAC, pp. 52–57, 1990.
[4] R. E. Bryant, “Graph-based algorithms for Boolean function manipulation”, IEEE
Trans. Comput., vol. C-35, no. 8, pp. 677–691, 1986.
[5] T. Sasao and J. T. Butler, “A method to represent multiple-output switching functions
by using multi-valued decision diagrams”, Proceedings of 26th IEEE International
Symposium on Multiple-Valued Logic, pp. 248–254, 1996.
[6] D. M. Miller, “Multiple-valued logic design tools", Proceedings of the IEEE Interna-
tional Symposium on Multiple-Valued Logic, pp. 2–11, 1993.
[7] H. M. H. Babu and T. Sasao, “Shared multi-terminal binary decision diagrams for
multiple-output functions”, IEICE Trans. Fundamentals, vol. E81-A, no. 12, pp. 2545–
2553, 1998.
[8] H. M. H. Babu and T. Sasao, “Time-division multiplexing realizations of multiple-
output functions based on shared multi-terminal multiple-valued decision diagrams”,
IEICE Trans. Inf. & Syst., vol. E82-D, no. 5, pp. 925–932, 1999.
Shared Multiple-Valued DDs for Multiple-Output Functions 45
Heuristics to Minimize
Multiple-Valued Decision
Diagrams
4.1 INTRODUCTION
Multiple-valued decision diagrams (MDDs) are data structures for multiple-valued func-
tions. MDDs are extensions of binary decision diagrams (BDDs) and usually require fewer
nodes than the corresponding BDDs to represent the same logic functions. MDDs are useful
for logic synthesis, FPGA design, logic simulation, etc. For example, Fig. 4.2 shows the
multiplexer-based network corresponding to the MDD in Fig. 4.1. In this chapter, multi-
rooted MDDs are considered to represent multiple-output functions. From 2-valued input
2-valued output functions, MDDs are constructed to represent 4-valued input 2-valued
output functions. A shared binary decision diagram (SBDD) is used for a multiple-output
function to find good pairs of 2-valued input variables. Since the size of a decision diagram
(DD) can vary from linear to exponential due to the orderings of the input variables, finding
a good variable ordering of the input variables is very important. Dynamic variable ordering
is one of the good heuristics to order the inputs. However, in the case of a multiple-output
function, a set of output functions must be handled at the same time. So, generating a good
variable ordering that represents all the output functions compactly is essential. Sampling
DOI: 10.1201/9781003269182-5 47
48 VLSI Circuits and Embedded Systems
based variable ordering methods and Interleaving based variable ordering methods are ef-
fective to find good variable orderings for multiple-output functions quickly. In this chapter,
both methods are combined to find good orderings of input variables.
Property 4.2.1 Let F1 = { f0, f1, . . ., fm−1 }, R = {0, 1, . . ., r−1}, and B = {0, 1}. An r -
valued input 2-valued output function F1 is a mapping F1 : R N →B m .
Property 4.2.4 Let F = { f 0 , f 1 , . . ., f m−1 }. Then, the size of a decision diagram (DD)
for F , denoted by size(DD, F), is the total number of non-terminal nodes.
Property 4.3.1 Let R = {0, 1, . . ., r−1} and B = {0, 1}. Then, the size of the MDD for an
N {m · r N −k −1 + 2r k − 2} .
N -input m-output function R N → B m is at most mink=1 r−1
Property 4.3.2 The size of the SBDD for an n-input m-output function B n →B m is at
{m · 2n−k − 1 + 22 − 2}.
k
most mink=1
n
Property 4.3.3 Let R = {0, 1, . . . , r− 1} and B = {0, 1}. Then, all the
non-constant
i+r −1
symmetric functions R N → B can be represented by MDDs with i=1 [2 i − 2] non-
ÍN
terminal nodes.
N +r −1
Proof 4.1 The number of non-constant symmetric functions f : R N → B is 2( N )−2 ,
where R = {0, 1, . . . , r− 1} and B = {0, 1}.
(1) When N = 1, there are 2r − 2 symmetric functions, and they are realized as shown
in Fig. 4.3.
(2) Suppose that all the non-constant symmetric functions of (N− 1) variables are
Í N −1 (i+r −1)
realized with i=1 2 i − 2 arbitrary symmetric function of N variables is represented
as follows:
0 f ∨ X 1 f ∨ . . . X r−1 f
f (X 1, X 2, . . ., X N ) = X N 0 N 1 N r−1 , where f j = f (X 1, X 2, . . .,
X N −1, j) is a symmetric function of (N−1) variables, and j = 0, 1, . . ., r−1. Thus, all the
non-constant symmetric functions of N variables are realized as shown in Fig. 4.4. The
total number of non-terminal nodes in
© i+r −1 © N +r −1
N −1 ª ª
® ®
i N
Õ
2 « − 2 + 2« − 2
¬ ¬
i=1
© i+r −1 ª
N ®
i
Õ
= 2« ¬ − 2
i=1
Thus, from (1) and (2), the Property 4.3.3 is proved.
Property 4.3.4 Let R = {0, 1, . . ., r−1} and B = {0, 1}. Then, the size of the MDD for an
N -input m-output symmetric function R N →B m is at most
© i+r −1 ª
N
N −1 N −k
i+r −1
Õ Õ
i
®
+ 2 ¬−2
min m· «
i
k=1
i=0 i=1
Proof: Since functions are completely symmetric, the different number of k -variable
functions generated by the r -valued complete decision tree is equal to the number of ways
to select k objects from r distinct objects with repetition. The number of ways to select
k objects from r distinct objects with repetition is k+r−1
k . So, the total number of non-
Ík i+r−1
terminal nodes in the r -valued decision trees of m functions are m. i=0 i . By
( N −k+r −1
N −k )
Property 4.2, in the N − k variables,
there are 2 − 2 non-constant symmetric
Í N −k (i+ri −1)
functions, and they require i=1 [2 − 2] non-terminal nodes. Therefore, the size of
the MDD for an r -valued N -input 2-valued m-output symmetric function is at most
© i+r −1 ª
N
N −1 N −k
i+r −1
Õ Õ
i
®
+ 2 2
min m· « ¬ −
i
k=1
i=0 i=1
In the case of r = 2, an SBDD is used to representing an n-input m-output symmetric
function, and get the following:
Property 4.3.5 The size of the SBDD for an n-input m-output symmetric function B n
→ B m is at most
N
(k + 1) (k + 2)
min m· +2 n−k+2
− 2 (n − k) − 4
2
k=1
52 VLSI Circuits and Embedded Systems
From now on, it is assumed that an MDD represents a 4-valued input 2-valued multiple-
output function, where r = 4.
Property 4.3.6 Let inc n be an n-input (n + 1)-output function that computes K + 1, where
K is a binary number consisting of n bits. It represents an incrementing circuit.
Property 4.3.7 Suppose that the 2-valued input variables of inc n are paired as X 1 =
[x 1, x 2 ], X 2 = [x 3, x 4 ], . . ., and X N = [x n−1, x n ], where n = 2N and the variable ordering
of the 4-valued inputs is (X 1, X 2, . . ., X N ). Then, size (MDD, inc n) ≤ 2n − 1(n≥ 2).
Proof 4.2 Mathematical induction is used on the number of 2-valued input variables.
(1) Base: For n = 2, the MDD for inc 2 is realized with three non-terminal nodes as shown
in Fig. 4.5.
(2) Induction: Assume that the hypothesis is true for k = n−1 input variables. That is,
the MDD for inc(n − 1) is realized as Fig. 4.6 with 2n−3 non-terminal nodes. In Fig. 4.6,
first remove the constant 0 and constant 1. Second, insert the input variable x n and add two
non-terminal nodes, as well as nodes for constant 0 and constant 1. Then, the diagram in
Fig. 4.7 is got. Note that Fig. 4.7 shows the MDD for inc n with 2n − 1 non-terminal nodes
which have two more non-terminal nodes than Fig. 4.6. It is clear that the MDD in Fig. 4.7
has upper and lower parts: When n is even, x n is paired with x n−1 and two additional
non-terminal nodes are added at the level in the bottom of the upper part of the MDD. On
the other hand, when n is odd, x n remains as a 2-valued variable in the lower part of the
MDD which requires two non-terminal nodes. Note that x 1 , x 2, . . ., x n is the order of the
2-valued inputs in the pairs, and (X 1, X 2, . . ., X N ) is the variable ordering of the 4-valued
inputs in the MDD. Thus, from (1) and (2), the theorem is proved.
Example 4.2 Figs. 4.8 and 4.9 show the MDDs for inc 3 and inc 4, respectively. The sizes
of MDDs in Figs. 4.8 and 4.9 are 5 and 7, respectively.
54 VLSI Circuits and Embedded Systems
Example 4.3 Consider the SBDD in Fig. 4.10, where s(x 1, x 2 ) = 2, s(x 1, x 3 ) = s(x 2, x 3 ) =
3, and s(x 1, x 4 ) = s(x 2, x 4 ) = s(x 3, x 4 ) = 4.[x 1, x 2 ] is a good pair of input variables, since
s(x 1, x 2 ) is the smallest among all s(x i , x j ). The remaining inputs are x 3 and x 4 . Thus,
[x 3, x 4 ] is another pair. Therefore, [x 1, x 2 ] and [x 3, x 4 ] are good pairs of 2-valued inputs.
small number of inputs and are useless for general purposes. To find the optimum variable
ordering is an NP-complete problem. So, heuristics are used for the practical problems. In
real life, many logic circuits have multiple outputs, and most CAD tools handle multiple-
output functions at the same time. Thus, finding the same variable ordering for different
output functions is important. In this subsection, a heuristic is presented to order the inputs
of multiple-output functions. A sampling technique is used to computing variable orderings
of MDDs: each sample corresponds to a group of output functions, and an MDD represents
a sample. Then, an interleaving technique is incorporated to generate a good variable
ordering for entire MDDs from the variable orderings of the MDDs, and minimize the
entire MDDs. The techniques for the introduced method are presented in Algorithm 4.2
and Algorithm 4.3. From now on the variable ordering of an MDD for a sample is a sample
variable ordering, and the variable ordering for all the outputs obtained from the sample
variable orderings is the final variable ordering. To obtain the final variable ordering, a
variable ordering of an MDD for a sample is merged with higher priority into one with
lower priority while maintaining the good variable ordering of each MDD as much as
possible. The input variables in which a multiple-output function strongly depends on, are
influential. The influential variables greatly affect the size of the DD and such variables
should be placed in the higher positions in the final variable ordering.
Property 4.4.2 support ( f ) is the set of input variables that the function f depends on.
The size of the support is the number of variables in the support ( f ).
Property 4.4.3 Let f 1 and f 2 be two output functions. The size of the union of the support
for f1 and f2 is the number of support variables for { f 1 , f 2 }.
f 0 (x 1, x 2, x 3, x 4 ) = x 10 x 2 ∨x 1 x 30 ∨x 20 x 3, and f 1 (x 1, x 2, x 3, x 4 ) = x 1 x 3 ∨x 30 x 4 .
The size of the union of the support for f 0 and f 1 is 4, since x 1 , x 2 , x 3 and x4 are the
support variables for { f 0 , f 1 }.
Note that an input variable of a sample variable ordering is inserted into the final
ordering if the variable is not already in it.
4.5 SUMMARY
In this chapter, heuristics are introduced to minimize multiple-valued decision diagrams
(MDDs) for multiple-output functions. Upper bounds on the sizes of MDDs are presented
for various functions. MDDs usually require fewer nodes than corresponding SBDDs, and
sometimes MDDs require less than a half node of SBDDs. The introduced method is much
faster, and it produces MDDs that are smaller. In addition, the introduced method produces
much smaller MDDs in a short time when several 2-valued input variables are grouped to
form multiple-valued variables.
REFERENCES
[1] T. Sasao and J. T. Butler, “A design method for look-up table type FPGA by
pseudo-Kronecker expansion”, Proceedings of 26th IEEE International Symposium
on Multiple-Valued Logic, pp. 97–106, 1994.
[2] H. M. H. Babu and T. Sasao, “Minimization of multiple-valued decision diagrams
using sampling method”, Proceedings of the Ninth Workshop on Synthesis and System
Integration of Mixed Technologies, pp. 291–298, 2000.
[3] S. Minato, N. Ishiura, and S. Yajima, “Shared binary decision diagram with attributed
edges for efficient Boolean function manipulation", Proceedings of 27th ACM/IEEE
DAC, pp. 52–57, 1990.
[4] R. E. Bryant, “Graph-based algorithms for Boolean function manipulation”, IEEE
Trans. Comput., vol. C-35, no. 8, pp. 677–691, 1986.
[5] S. Tani, K. Hamaguchi, and S. YaJima, “The complexity of the optimal variable
ordering problems of a shared binary decision diagram”, IEICE Trans. Inf. & Syst.,
vol. E79-D, no.4, pp. 271–281, 1996.
[6] N. Ishiura, H. Sawada, and S. YaJima, “Minimization of binary decision diagrams
based on exchanges of variables”, Proceedings of IEEE International Conference on
Computer-Aided Design, pp. 472–475, 1991.
[7] G. Epstein, Multiple-Valued Logic Design: An Introduction, IOP Publishing Ltd.,
London, 1993.
[8] J. Jain, W. Adams, and M. FuJita, “Sampling schemes for computing OBDD vari-
able orderings”, Proceedings of IEEE International Conference on Computer-Aided
Design, pp. 631–638, 1998.
[9] F. Somenzi, “Colorado university decision diagram package (CUDD), release 2.1.2,
1997.
[10] H. Fujii, G. Ootomo, and C. Hori, “Interleaving based variable ordering methods for
ordered binary decision diagrams”, Proceedings of IEEE International Conference on
Computer-Aided Design, pp. 38–41, 1993.
[11] R. Rudell, “Dynamic variable ordering for ordered binary decision diagrams”, Pro-
ceedings of IEEE International Conference on Computer-Aided Design, pp. 42–47,
1993.
[12] P. C. McGeer, K. L. McMillan, A. Saldanha, A. L. Sangiovanni-Vincentelli and P.
Scaglia, “Fast discrete function evaluation using decision diagrams”, International
Workshop on Logic Synthesis, pp. 6.1–6.9, 1995.
Heuristics to Minimize Multiple-Valued Decision Diagrams 59
TDM Realizations of
Multiple-Output Functions
Based on Shared
Multi-Terminal
Multiple-Valued DDs
5.1 INTRODUCTION
In modern LSIs, one of the most important issues is the “pin problem.” The reduction of
the number of pins in the LSIs is not so easy, even though the integration of more gates may
be possible. To overcome the pin problem, the time-division multiplexing (TDM) systems
are often used. In the TDM system, a single signal line represents several signals. For
example, the Intel 8088 microprocessors used 8-bit buses to represent 16-bit data which
made it possible to produce a large amount of microcomputers so quickly while the 16-bit
peripheral LSIs were not so popular in the early 1980s. In this chapter, a method is presented
to design multiple-output networks based on shared multi-terminal multiple-valued decision
diagrams (SMTMDDs) by using TDMs. Heuristic algorithms are introduced to derive
DOI: 10.1201/9781003269182-6 61
62 VLSI Circuits and Embedded Systems
SMTMDDs from shared binary decision diagrams (SBDDs). In the network, each non-
terminal node of a decision diagram (DD) is realized by a multiplexer (MUX).
Input Output
x1 x2 x3 x4 f0 f1 f2 f3
0 0 0 0 0 1 1 0
0 0 0 1 1 0 1 1
0 0 1 0 0 1 0 1
0 0 1 1 1 1 1 1
0 1 0 0 1 0 0 1
0 1 0 1 0 1 1 0
0 1 1 0 1 0 1 1
0 1 1 1 1 1 0 0
1 0 0 0 0 0 0 1
1 0 0 1 1 0 1 1
1 0 1 0 1 1 0 1
1 0 1 1 0 1 1 1
1 1 0 0 1 0 0 1
1 1 0 1 0 1 1 0
1 1 1 0 1 1 1 1
1 1 1 1 0 1 0 0
Input Output
X1 X2 Y1 Y2
0 0 1 2
0 1 2 3
0 2 1 1
0 3 3 3
1 0 2 1
1 1 1 2
1 2 2 3
1 3 3 0
2 0 0 1
2 1 2 3
2 2 3 1
2 3 1 3
3 0 2 1
3 1 1 2
3 2 3 3
3 3 1 0
Property 5.2.1 The sizen of the DD denoted by sizen (DD), is the total number of non-
terminal nodes excluding the nodes for output selection variables.
Example 5.1 The sizen of the SBDD, SMDD, and SMTMDD in Figs. 5.1, 5.2, and 5.3
are 19, 11, and 9, respectively. Note that g 0 , g 1 , and g 2 are the output selection variables in
the SBDD and SMDD, and g 0 is the output selection variable in the SMTMDD.
G0 = η 0 f 0 ∨ η f 1 , and
G1 = η 0 f 2 ∨ η f 3 .
66 VLSI Circuits and Embedded Systems
This mean when η = 0, G0 and G1 represent f 0 and f 2 , respectively. On the other hand,
when η = 1, G0 and G1 represent f 1 and f 3 , respectively. In this realization, the hardware
is needed for the functions f 0, f 1, f 2 , and f 3 , as well as the hardware for multiplexing. By
using this technique, the number of output pins are reduced into a half. Note that in this
example, only two lines are necessary between the main LSI and the peripheral LSI. In
the peripheral LSI, delay latches are needed. When η = 0, the values for f 0 and f 2 are
transferred to the first and the third latches, respectively. On the other hand, when η = 1, the
values for f 1 and f 3 are transferred to the second and fourth latches, respectively. To realize
the multiple-output function, an SBDD is used. By replacing each non-terminal node of an
SBDD by a multiplexer, a network for the multiple-output function is obtained. In this case,
the amount of hardware for the network is easily estimated by the sizen of the SBDD, and
the design of the network is quite easy.
often greater than the delay of logic modules. Thus, the reduction of logic level is important.
So, MDD-based realizations can be faster and require smaller amount of hardware than
BDD-based ones.
Example 5.2 Let n = 18, m = 20, and p = r = 4. For such a function, sizen (SBDD ) ≤
2621422, and sizen (SMTMDD ) ≤ 218702.
5.6 SUMMARY
In this chapter, time-division multiplexing (TDM) realizations of multiple-output func-
tions based on shared binary decision diagrams (SBDDs), shared multiple-valued decision
diagrams (SMDDs), and shared multi-terminal multiple-valued decision diagrams (SMT-
MDDs) are considered. In the network, each non-terminal node of a decision diagram (DD)
is realized by a multiplexer (MUX). For an n-variable function, the BDD-based realization
requires n levels, while the MDD-based realization requires n/2 levels. The TDM method
reduces the interconnections among the modules as shown in Figs. 5.4 and 5.5. In addition,
the SMTMDD-based realization reduces the number of gates by considering the pairing of
input variables and pairing of output functions. Note that the SBDD-based realizations and
the SMDD-based realizations require extra gates at the outputs (which are not included in
the tables).
The TDM method requires clock pulse that makes delay in the network. However, the
number of pins in the TDM realization is a half of the non-TDM realization. MDD-based
realizations are more economical than SBDD-based ones when the ratios are less than
0.5. However, for n-bit adders, SMDD-based realizations require the smallest amount of
TDM Realizations of Multiple-Output Functions 71
hardware. It is also shown that there are cases where SBDD-based realizations are the most
economical. For arithmetic functions, MDD-based realizations tend to be more economical
than SBDD-based ones. The presented method can be extended to the case where p output
functions are grouped by using p-phase clock pulses.
REFERENCES
[1] T. Sasao and J. T. Butler, “A design method for look-up table type FPGA by
pseudo-Kronecker expansion”, Proceedings of 26th IEEE International Symposium
on Multiple-Valued Logic, pp. 97–106, 1994.
[2] H. M. H. Babu and T. Sasao, “Design of multiple-output networks using time domain
multiplexing and shared multi-terminal multiple-valued decision diagrams”, IEEE
International Symposium on Multiple Valued Logic, pp. 45–51, 1998.
[3] S. Minato, N. Ishiura, and S. Yajima, "Shared binary decision diagram with attributed
edges for efficient Boolean function manipulation", Proceedings of 27th ACM/IEEE
DAC, pp. 52–57, 1990.
[4] R. E. Bryant, “Graph-based algorithms for Boolean function manipulation”, IEEE
Trans. Comput., vol. C-35, no. 8, pp. 677–691, 1986.
[5] D. M. Miller, "Multiple-valued logic design tools", Proceedings of the IEEE Interna-
tional Symposium on Multiple-Valued Logic, pp. 2–11, 1993.
[6] S. B. Akers, “Binary decision diagrams”, IEEE Trans. Comput., vol. C-27, no. 6, pp.
509–516, 1978.
[7] G. Epstein, Multiple-Valued Logic Design: An Introduction, IOP Publishing Ltd.,
London, 1993.
[8] T. Sasao and J. T. Butler, “A method to represent multiple-output switching functions
by using multi-valued decision diagrams”, Proceedings of 26th IEEE International
Symposium on Multiple-Valued Logic, pp. 248–254, 1996.
[9] H. M. H. Babu and T. Sasao, “Shared multi-terminal binary decision diagrams for
multiple-output functions”, IEICE Trans. Fundamentals, vol. E81-A, no.12, pp. 2545–
2553, 1998.
[10] T. Sasao, ed., “Logic Synthesis and Optimization”, Kluwer Academic Publishers,
Boston, 1993. [11] R. Rudell, “Dynamic variable ordering for ordered binary deci-
sion diagrams”, Proceedings of IEEE International Conference on Computer-Aided
Design, pp. 42–47, 1993.
[11] P. C. McGeer, K. L. McMillan, A. Saldanha, A. L. Sangiovanni-Vincentelli and P.
Scaglia, “Fast discrete function evaluation using decision diagrams”, International
Workshop on Logic Synthesis, pp. 6.1–6.9, 1995.
[12] R. K. Brayton, G. D. Hachtel, C. T. McMullen, and A. L. Sangiovanni-Vincentelli,
“Logic Minimization Algorithms for VLSI Synthesis”, Kluwer Academic Publishers,
Boston, 1984.
[13] T. Sasao, “Switching Theory for Logic Synthesis”, Kluwer Academic Publishers,
Boston, 1999.
[14] L.S. Hurst, “Multiple-valued logic-Its status and its future”, IEEE Trans. Comput.,
vol. C-33, no. 12, pp. 1160–1179, 1984.
72 VLSI Circuits and Embedded Systems
Multiple-Output Switching
Functions Using
Multiple-Valued
Pseudo-Kronecker DDs
6.1 INTRODUCTION
In VLSI design, one of the major concerns is the efficient and compact representation of
switching functions. Binary decision diagrams (BDDs) are probably the most successful
representation of switching functions. Multiple-valued decision diagrams (MDDs) are ex-
tensions of BDDs and are an important data structure for multiple-valued functions. MDDs
are usually smaller than the corresponding BDDs and are useful for logic synthesis, logic
simulation, test, etc. Pseudo-Kronecker decision diagrams (PKDDs) are generalization of
BDDs, and usually require fewer nodes than corresponding BDDs. A 2-valued PKDD rep-
resents a 2-valued input 2-valued multiple-output function, while a multiple-valued PKDD
(MVPKDD) represents a multiple-valued input 2-valued multiple-output function.
In 2-valued PKDDs, any of the following three expansions can be used for each non-
terminal node: (1) the Shannon expansion; (2) the positive Davio expansion; and (3) the
negative Davio expansion. PKDDs are useful for multi-level logic synthesis and LUT (look-
up table) type FPGA (field programmable gate array) design. For example: (a) Fig. 6.2 shows
a multi-level network corresponding to the 2-valued PKDD in Fig. 6.1. This network can
further be minimized by using a local transformation. It has been shown that a 2-valued
DOI: 10.1201/9781003269182-7 73
74 VLSI Circuits and Embedded Systems
PKDD-based network requires, on the average, 21 percent fewer gates and interconnections
than the BDD-based one. (b) Fig. 6.3 shows a LUT-based network corresponding to the
PKDD in Fig. 6.1. It is suitable for an FPGA consisting of 3-input LUTs and programmable
interconnections. 4-valued PKDDs are useful for FPGAs with 6-input LUTs. Since each
non-terminal node of a PKDD is replaced by a LUT, the reduction of the number of nodes
in the PKDD is important. Three methods to construct compact MVPKDDs are:
1. Grouping of 2-valued variables to make multiple-valued variables;
2. Ordering of multiple-valued variables in the MVPKDDs; and
3. Generating a good expansion for each non-terminal node of MVPKDDs.
Sasao-Butler used 4-valued PKDDs to design LUT type FPGAs. They considered the
following to construct 4-valued PKDDs: (i) the bit-pairing algorithm for PLAs to pair the
2-valued inputs; (ii) a simulated annealing method to order the input variables; and (iii)
a cost estimation method for finding good expansions. In this chapter, a 4-valued PKDD
is constructed from the BDD. First, a shared binary decision diagram (SBDD) is used
to pair the 2-valued inputs of a multiple-output function. Then, a set of good variable
orderings is finding for 4-valued inputs from MDDs to produce a smaller 4-valued PKDD.
Finally, good expansions for non-terminal nodes of a 4-valued PKDD are generated from
the corresponding BDD.
f = x 10 f 0 ⊕x 1 f 1, (6.1)
f = f 0 ⊕x 1 f 2, (6.2)
f = x 10 f 2 ⊕ f 1, (6.3)
0 (X < S)
X =S
.
1 (X ∈ S)
When S contains only one element, X {i } is denoted by X i . A product of lit-
erals X1S1 X 2S2 . . . X NS N is a product term that is the AND of literals. The ex-
pression S1 S 2 ... S N X1S1 X 2S2 . . . X NS N is a sum-of-products expression (SOP), where
Ô
Property 6.2.5 Let F = ( f 0, f 1, . . ., f m−1 ). Then, the size of a decision diagram (DD) of
F , denoted by size(DD, F ), is the total number of non-terminal nodes in the DD for F .
Property 6.3.4 Let inc n be an n-input (n + 1)-output function that computes K + 1, where
K is a binary number consisting of n bits. It represents an incrementing circuit.
Property 6.4.1 Let T be a set, and let T i (i = 1, 2, 3, ..., s) be subsets of T . (T 1,T 2,T 3, . . .,T S )
is a partition of T if i=1 Ti and T i ∩ T j = where i, j and T i , for every i .
Ðs
Property 6.4.2 Let F : {0, 1} n →{0, 1} m . The dependency matrix D = (d i j ) for F is a 0-1
matrix with m columns and n rows, where
1 if f f i depends on x j
di j = ,
0 otherwise
78 VLSI Circuits and Embedded Systems
i = 0, 1, . . ., m − 1 and j = 1, 2, . . ., n.
f0 f1 f2 f3
x1
1 0 1 0
x2
D= 1 0 0 0
© ª
x3
®
1 1 1 0
®
x4
®
« 0 1 1 1 ¬
Property 6.4.3 Let F : {0, 1} n → {0, 1} m , and let [x i , x j ] be a pair of inputs. The
pair-dependency matrix P = (pi j ) for F is a 0-1 matrix with m columns and n(n−1)
2 rows.
Example 6.3 Consider the 4-output function in Example 6.2. The pair-dependency matrix
P is given as follows:
Strategy 6.1: Let F : {0, 1} n → {0, 1} m and let s(x i , x j ) be the number of outputs that
depend on at least one of x i and x j . Then, [x i , x j ] is a candidate pair of inputs if s(x i , x j )
is the smallest among all the pairs. Apply the same idea to the rest of the inputs recursively
to find good pairs of input variables.
Example 6.4 Consider the 4-input 4-output function in Example 6.2. There are six pairs
of input variables. The pair-dependency values are: s(x 1 , x 2 ) = 2, s(x 1 , x 3 ) = s(x 1 , x 4 ) =
s(x 2 , x 3 ) = 3, and s(x 2 , x 4 ) = s(x 3 , x 4 ) = 4. Since s(x 1 , x 2 ) = 2 is the smallest among all
the pairs, [x 1 , x 2 ] is a candidate pair. The remaining inputs are x 3 and x 4 . Thus, [x 3 , x 4 ] is
another candidate pair. So, the pairs of input variables are [x 1 , x 2 ] and [x 3 , x 4 ].
Multiple-Output Switching Functions 79
Note that when the number of 2-valued inputs is odd, one input variable remains in the
2-valued form for MDDs.
Property 6.4.4 A group is a subset of outputs and a partition of outputs consists of groups.
A group with the larger size of the MDD has the highest priority.
Example 6.5 Consider the functions in Example 6.2. Let { f 1, f 3 } and { f 0, f 2 } be two
groups, where each group forms a 2-output function.
Since Algorithm 6.3 is a heuristic one, it may not obtain the optimal solution, but it can
expect good solutions quickly.
6.5 SUMMARY
In this chapter, a method is presented to construct smaller 4-valued pseudo-Kronecker deci-
sion diagrams (4-valued PKDDs). The numbers of non-terminal nodes in 4-valued PKDDs
are compared with those of multiple-valued decision diagrams (MDDs), 2-valued PKDDs,
and shared binary decision diagrams (SBDDs), where an MDD (Multiple-Valued Decision
Diagram) represents a 4-valued input 2-valued output functions. 2-valued PKDDs are much
smaller than corresponding SBDDs. So, from PKDDs, it is expected that networks with
smaller amount of hardware than ones from binary decision diagrams (BDDs). PKDDs are
useful for FPGA (Field Programmable Gate Array) design and multi-level logic synthesis.
REFERENCES
[1] T. Sasao and J. T. Butler, “A design method for look-up table type FPGA by
pseudo-Kronecker expansion”, Proceedings of 26th IEEE International Symposium
on Multiple-Valued Logic, pp. 97–106, 1994.
[2] A. Srinivasan, T. Kam, S. Malik, and R. K. Brayton, “Algorithms for discrete function
manipulation", Proceedings of IEEE International Conference on Computer-Aided
Design, pp. 92–95, 1990.
[3] T. Sasao, H. Hamachi, S. Wada, and M. Matsuura, “Multi-level logic synthesis based
on pseudo-Kronecker decision diagrams and local transformation", Proceedings of
the IFIPWG 10.5 Workshop on Applications of the Reed-Muller Expansion in Circuit
Design, pp. 152–160, 1995.
[4] R. E. Bryant, “Graph-based algorithms for Boolean function manipulation”, IEEE
Trans. Comput., vol. C-35, no. 8, pp. 677–691, 1986.
[5] D. M. Miller, "Multiple-valued logic design tools", Proceedings of the IEEE Interna-
tional Symposium on Multiple-Valued Logic, pp. 2–11, 1993.
[6] R. Rudell, “Dynamic variable ordering for ordered binary decision diagrams”, Pro-
ceedings of IEEE International Conference on Computer-Aided Design, pp. 42–47,
1993.
[7] H. M. H. Babu and T. Sasao, “Time-division multiplexing realizations of multiple-
output functions based on shared multi-terminal multiple-valued decision diagrams”,
IEICE Trans. Inf. & Syst., vol. E82-D, no. 5, pp. 925–932, 1999.
[8] S. Minato, N. Ishiura, and S. Yajima, "Shared binary decision diagram with at-
tributed edges for efficient Boolean function manipulation", Proceedings of the 27th
ACM/IEEE DAC, pp. 52–57, 1990.
[9] H. M. H. Babu and T. Sasao, “Minimization of multiple-valued decision diagrams
using sampling method”, Proceedings of the Ninth Workshop on Synthesis and System
Integration of Mixed Technologies, pp. 291–298, 2000.
[10] Babu, HM Hasan, and Tsutomu Sasao. “Representations of multiple-output switching
functions using multiple-valued pseudo-Kronecker decision diagrams.” In Proceed-
ings 30th IEEE International Symposium on Multiple-Valued Logic (ISMVL 2000),
pp. 147–152. IEEE, 2000.
II
An Overview About Design Architectures of
Multiple-Valued Circuits
83
Part 2
85
86 Part 2
multiple-valued circuits should be compared with the corresponding binary ones. Multiple-
valued circuits must also use a technology that is compatible with the standards of up-to-date
devices.
This part starts with multiple-valued flip-flops (MVFF) using pass transistor logic
which is given in Chapter 7. Two different design techniques are introduced in this chapter
for MVFF realized by pass transistors, which can be a promising alternative to static
CMOS for deep sub-micron design. An approach for designing multi-valued logic circuits
is introduced in Chapter 8. A systematic method for implementing a set of binary logic
functions is also described, as multi-valued logic functions, and the heuristic algorithms for
different stages of the design process are provided along with it. The technique described in
this chapter can be easily extended to implement higher radix circuits. Chapter 9 presents a
method in which new Boolean variable assignment algorithm and minimization techniques
have been introduced, so that both the total computation time and the number of products
decrease. A graph is introduced in this chapter called an enhanced assignment graph (EAG)
for the efficient grouping of the Boolean variables. In order to make the best choice of
the proper base minterm, a new technique is defined to find the potential canonical cube
(PCC) covering it. In Chapter 10, the application of multi-valued Fredkin gates (MVFGs) is
shown with the implementation of fuzzy set and logic operations. Fuzzy relations and their
composition are very important in this theory as collections of fuzzy if-then rules and, fuzzy
GMP (Generalized Modus Ponens) and GMT (Generalized Modus Tollens) respectively is
mathematically equivalent to them. In this chapter, digitized fuzzy sets are described where
the membership values are discretized and represented using ternary variables and the
implementation of set operations. Finally, an advanced minimization method for multiple-
valued multiple-output functions is introduced in Chapter 11. New minimization approach
for multiple-valued functions has also been discussed where Kleenean Coefficients and
LUT are used to reduce the complexity. The shared sub functions are extracted with a
heuristic method to pair the functions.
CHAPTER 7
Multiple-Valued Flip-Flops
Using Pass Transistor Logic
This chapter presents the realization of multiple-valued flip-flops (MVFF) using Pass Tran-
sistor logic. Two different design techniques are introduced here for MVFF realized by pass
transistors, which can be a promising alternative to static CMOS for deep sub-micron de-
sign. A novel circuit has been introduced which consists of multiple valued pass transistors
which can call “logical sum circuit”. This particular circuit is used as the elementary design
component for the second approach in MVFF design. The introduced MVFF circuits can
be attractive for its inherent lesser power and component demands.
7.1 INTRODUCTION
The use of memory devices and circuits in Digital System is very important as they provide
means for storing values either temporarily or permanently. Electronic Latching Circuits
like Latches and flip-flops are the examples of digital memory units. This chapter is going to
talk about the realization of a multiple-valued flip-flops using pass transistor logic (PTL).
Various researchers have worked on the realization of different types of circuits using
PTL. Such circuits are very much suitable for multiple-valued logic. The realization of
multiple-valued flip-flops (MVFF) also has been studied by different authors.
In this chapter two different approaches are considered to MVFF realization. In the first
approach multiple valued inputs are coded into binary values to use for storing purpose. As
the second approach to realize MVFF multiple valued pass transistor logic is considered
to implement basic multiple valued blocks with which MVFF can be designed without any
binary intervention. The two design approaches are described in the following sections.
DOI: 10.1201/9781003269182-9 87
88 VLSI Circuits and Embedded Systems
valued PTL and it is shown that this approach requires far less number of components than
in the first approach where binary coding and decoding schemes are actually used.
7.1.2 Implementation of MVFF with Binary Coding and Decoding Using PTL
In this subsection, realization of the circuits using pass transistors are going to be discussed.
Here the multiple valued inputs to the flip-flop are first encoded into binary values and
binary valued pass transistors are used for the triggering and memorization of the flip-flop.
The output is the decoded into multiple values accordingly using encoder interface. This
approach has a multiple valued latch called RSTU latch, which follows the truth table shown
in Table 7.1. The circuit design using binary valued pass transistor logic for the 2-valued
and 4-valued NAND gates used in the RSTU latch is shown in Fig. 7.1. In this case, the
simplest schemes are obtained with RSTU flip-flops corresponding to the binary coded
approach.
The following functions Di (x) and Gi (x) with Di (x), Gi (x) can be defined as follows:
Di (x) = 1 when x≤i ,
Di (x) = 0 when x>i ,
Gi (x) = 1 when x≥i , and
Gi (x) = 0 when x<i .
With NAND gates, RSTU latches, the functions R = f 1 (C, D), S = f 2 (C, D),T =
f 3 (C, D), U = f 4 (C, D) correspond to Table 7.2.
R S T U Q Q0 Q1 Q2 Q3
1 0 0 0 0 0 1 1 1
0 1 0 0 1 1 0 1 1
0 0 1 0 2 1 1 0 1
0 0 0 1 3 1 1 1 0
1 1 1 1 Q Memory
If the pass transistor in Fig. 7.4 is turned on, the output is equal to the input, while if it
is turned off, the output is in a high impedance state.
The relation between Y 2 and X in the inverted threshold gate is defined as follows:
0 f or Y2 > t
X= (7.1)
1 f or Y2 < t
Using the internal parameter X , the relation of the input Y 1 and the output Z in a pass
transistor is defined as:
Y1 f or X = 1
Z = Y1 < X >= (7.2)
Φ f or X = 0
Multiple-Valued Flip-Flops Using Pass Transistor Logic 91
The pass transistors with threshold gates can be combined in series and/or parallel
connection combinations. The Equation 7.2 can be regarded as the basis of the representation
of the inputs and outputs of connections. The series connection can be depicted as:
Fig. 7.5 shows the series connection. Parallel connections for common inputs can be
depicted as:
Z = y < x1 ∪ x2 ∪ . . . ∪x n > (7.4)
Figure 7.6: Parallel Connection: (a) Common Inputs (b) Different Inputs.
And finally, for common inputs in Fig. 7.7(a), the combination of parallel-series con-
nections can be depicted as
y2 = < x1 ∪ x2 >
92 VLSI Circuits and Embedded Systems
Figure 7.7: Parallel-Series Connections: (a) Common Inputs (b) Different Inputs.
z = y3 < x3 > = (y 1 < x1 > +y2 < x2 >) < x3 > (7.7)
= y1 < x1 x3 > + y2 < x2 x3 >
Table 7.3: Multiple-Valued Inverted Outputs for the Corresponding Input Values
Input Output
0 3
1 2
2 1
3 0
X Y Q Q’
3 3 Q Q’
3 0 3 0
2 1 2 1
1 2 1 2
0 3 0 3
Such an expression can be realized by a pass transistor network with threshold gates.
As shown in Table 7.4, in all the cases, the output value is the logical sum value between
two input values. Since the logical sum of y 1 = 1, y 2 = 2 is 3, for the two inputs y 1 = 1 and
y 2 = 2 or vice versa, the output is 3 which is the logical sum of the two.
As it is pointed out in Fig. 7.10. the components in blocks named as f LSC 0 gives
inverted output of the logical sum values of the input values. Table 7.4 points out the
inverted outputs of the corresponding input values. The truth table for the introduced circuit
is given in Table 7.5. The construction and the multiple-valued inputs and outputs of the
FFs are shown in Fig. 7.10.
7.3 SUMMARY
Pass transistor logic circuits result with substantial improvements in area and delay over
conventional static CMOS. The two approaches described in this chapter can be noted as
promising alternative to the gate level design approach using conventional CMOS or TTL
logics for MVFFs (Multiple-Valued Flip-Flops). The efficient design for multiple-valued
flip-flops, in terms of number of components is the direct processing of multiple-valued
signal. In this respect, the second approach to design MVFF can be considered as efficient
and effective.
REFERENCES
[1] O. Ishuzuka, “Synthesis of a Pass Transistor Network Applied to Multi-Valued Logic”,
IEEE Trans., 1986.
[2] D. Etiemble, and M. Israel, M., “On the realization of multiple-valued flip-flops”,
IEEE Trans., 1980.
[3] T. Sasao, “Multiple-valued logic and optimization of programmable logic arrays”,
IEEE Trans., 1998.
[4] K. Yano, T. Yamanaka, T. Nishida, M. Saito, K. Shimohigashi and A. Shimizu, “A
3.8-ns CMOS 16x16-b multiplier using complementary pass-transistor logic”, IEEE
JSSC, vol. 25, no. 2, 1990.
[5] A. P. Chandrakasan, S. Sheng and R. W. Brodersen, “Low Power CMOS Digital
Design”, IEEE JSSC, vol. SC-20, 1985.
[6] N. Zhuang and H. Wu, “Novel ternary JKL flip-flop”, Electronics Letters, vol. 26, no.
15, pp. 1145–1146, 1990.
[7] J. I. Accha and J. L. Huertas, “General Excitation table for a JK multi-stable”, Elec-
tronics Letter, vol. 11, pp. 624, 1975.
[8] Babu, Hafiz Md Hasan, Moinul Islam Zaber, Md Mazder Rahman, and Md Rafiqul
Islam. “Implementation of multiple-valued flip-flips using pass transistor logic [flip-
flips read flip-flops].” In Euromicro Symposium on Digital System Design, 2004.
DSD 2004., pp. 603–606. IEEE, 2004.
CHAPTER 8
Voltage-Mode Pass
Transistor-Based
Multi-Valued Multiple-Output
Logic Circuits
8.1 INTRODUCTION
In recent times, the field of multi-valued logic (MVL) design and also the use of multi-
valued logic implementing binary logic have gathered much attention. Different techniques
to implement MVL are introduced. There have been mainly two different classes of such
approaches. First, there are the current-mode circuits, such as ECL or I2L, in which the
function of WIRED-SUM (summation of currents in a node) is particularly used implement-
ing MVL functions. On the other hand, voltage-mode logic circuits make use of threshold
voltage levels/gates.
In this chapter, the design method of a circuit module of the latter type is described. The
circuit design that uses pass transistor networks and threshold gates to implement MVL. It
also introduces a useful notation to represent a general MVL circuit with MOS transistors.
But here, emphasize is given on the fact that they don’t describe the implementation of
general class of logic functions, which is binary. In this chapter along with the introduced
circuit design technique for multi-valued (quaternary) logic functions also describe the
implementation of binary logic functions in such circuit modules. Another important factor
DOI: 10.1201/9781003269182-10 97
98 VLSI Circuits and Embedded Systems
in the design of MVL circuits is that, the simplification of MVL functions is quite difficult
to achieve. A simplification algorithm is also provided in this chapter for the introduced
circuits.
Such functions are used in sum-of-product (SOP) form. To define the product terms
and a SOP form first the definition of literals is given.
Property 8.2.2 Let X be a multi-valued variable that takes one value in P = {0, 1, 2, . . ., p−
1}. Let S ⊆ P and X S is a literal of X , where X S = 1 if X ∈ S and X S = 0 if X< S.
Example 8.1 The function f (Y 1,Y 2 ) = Y10,3Y20 + Y11Y23 is a multi-valued function of two
variables. Suppose both the variables are 4-vlued. According to the definitions given above
there are four literals (e.g., Y10,3 , etc.) and two products (i.e., Y10,3Y20 and Y11Y23 ) in the
function written in SOP form. From the definition the function is a 4-valued input 2-valued
output function.
In this chapter a set of binary functions is converted to a set of four valued input two
valued output functions i.e., according to the definition of MVL functions MVL functions
are obtained where the domain of the Pi0 s is restricted and M to the set {0, 1, 2, 3}. Two
such functions are also paired when they are implemented in the circuit and so after this
stage can get the functions, which are four-valued input four-valued output MVL functions.
1. Converting of binary logic functions into four-valued input two valued functions.
2. Pairing of the functions obtained in the first stage.
3. Output stage: here the functions paired in the second stage are implemented together
in one circuit. These circuits are basically composed of pass transistor networks.
The following subsections give the detailed description of these stages.
Example 8.2 The binary logic function defined in Table 8.1(a) can be converted into a
four-valued input two-valued output function as shown in Table 8.1(b) where Y 1 = (x 1, x 2 )
and Y 2 = (x 3, x 4 ).
The function f in a SOP form is given below f (Y 1,Y 2 ) = Y10,3Y20 + Y12,3Y21 + Y11Y23 .
Table 8.1: Truth Tables with (a) Binary Form and (b) Multi-Valued Form
A heuristic algorithm (Algorithm 8.1) is presented below which selects the best pairing
of the input variables. From this algorithm a partition of the variables is obtained where
each set in the partition contains a unique pair of variables. In the algorithm the concept of
residuals is used. This is defined below:
100 VLSI Circuits and Embedded Systems
Property 8.3.1 Suppose S is a set of variables and f is a binary logic function in SOP
form. Residual of the function with respect to the set S , denoted by RS is the number of
unique terms left in f after deleting the variables of set S .
In the algorithm R {i, j } is calculated for each pair of variables (i, j) over all the functions.
Then if S = {x 1, x 2 }, RS = 2.
Now the algorithm for pairing of binary input variables is given below:
Property 8.3.2 Support of a function: Let f be a function. The set of variables on which
the function depends on is called the support of f , denoted as support( f ).
Example 8.6 For functions f1 = x30 x4 + x10 x3 x4 + x2 x4 + x2 x3, f2 = x10 x3 x4 + x1 x20 x30 x4, f3 =
x30 x4 + x2 x3 + x1 x2 x3 x4, f4 = x10 x2 x40 + x1 x2 x3 x4 and with the variable pairings (x1, x2 ) and
(x3, x4 ) the function pairings as obtained by the algorithm are ( f4, f3 ) and ( f2, f1 ).
Voltage-Mode Pass Transistor-Based Multi-Valued Multiple-Output Logic Circuits 101
Table 8.2 shows circuit output for the different states of the pass transistor networks in
Fig. 8.1.
Table Entries: A 1 or 0 entry for the networks indicates that there exists a closed path
or no path between the two terminals of the network. Whereas by d -entry it can be said that
the output of the circuit doesn’t depend on the state of the network. For the output Z it can
be shown the four different logic levels it can have.
Basic Circuit Operation: With A = 1 and B = 1 there exists a closed path between the
output and the ground and thus the output voltage will be 0 for which Z = 0. In this case
102 VLSI Circuits and Embedded Systems
A B C Z
1 1 d 0
0 1 0 1
0 1 1 2
d 1 d 3
state of the C network is in what can call a don’t care state, hence the d -entry. With B = 1
and A = C = 0 there is a path from V DD to GN D through transistors Q1 , Q2 , and Q3 . The
output will be the voltage divider output of the on resistance of the transistors. Similarly,
when A = 0 and B = C = 1 the transistors Q1 and Q2 form the voltage divider. For this
case and the previous one the proper design of the transistors will result in the appropriate
voltages for the output logic values 1 and 2 respectively. Lastly, when B = 0 irrespective to
the states in which the other networks are in, V DD is connected to the output and we have
the desired voltage level for the logic level Z = 3. However it is useful to add another pass
transistor network as shown in Fig. 8.2 and have the relationship between the states of the
networks and the output as shown in Table 8.3.
A B C D Z
1 d d d 0
d 1 1 d 0
0 1 0 0 1
0 1 0 1 2
0 0 d d 3
Construction of Truth Table: For the two functions to be implemented, a truth table is
constructed. The functions here are assumed to be of four variables and thus the table
constructed will be a 2-dimensional table.
This is done according to the following two steps:
1. Enter the value 1 for the table-entries, which correspond, to a truth-value 1 for the
function f 2 . For the other entries enter 0.
2. For the product terms of function f 1 add 2 to the corresponding entries of the table
which are either 0 or 1.
Similarly when n-variable functions are considered, the tables will be of n/2-dimensions.
In the next example the introduced design procedure is followed step by step to obtain the
circuit for two binary logic functions.
Example 8.7 Suppose f1 = x10 x30 x4 + x10 x20 x3 and f2 = x20 x4 are paired together and Z is
interpreted as follows
Z = 0⇒ f 1 = 0, f 2 = 0; Z = 1⇒ f 1 = 0, f 2 = 1;
Z = 2⇒ f 1 = 1, f 2 = 0; Z = 3⇒ f 1 = 1, f 2 = 1.
Table 8.4: (a) Truth Table for Example 8.7, and Tables when (a) A = 0, (b) B = Y20,2,3 +Y11,3Y21 ,
(c) C = Y22 + Y23Y11,2,3 + Y10,1,3Y20 , (d) D=Y12Y20 + Y10Y23
104 VLSI Circuits and Embedded Systems
It can be noted that the expressions for the networks in Example 8.7 are obtained using
this algorithm.
In the example, the method which is given for generating the expressions for the
networks considers only functions of four, two-valued input, variables. Though tables are
used to describe the functions it can easily do without them. Also the methods can be
generalized for functions with lager number of variables.
It may also use a 2-4 line decoder where the inputs will be the two variables x 1 and x 2
and the outputs will be Y {0} , Y {1} , Y {2} , and Y {3} . Then these can be used to generate the
other literals.
8.4 SUMMARY
In this chapter a design technique for multi-valued logic (MVL) functions is described and
is shown with examples of how multiple output binary logic functions can be implemented
using it. The technique described here can be easily extended to implement higher radix
circuits. In the circuits which are designed, the output is encoded to express the output
of two different logic functions. Thus when this signal is going to be used, decoding is
required. However by doing so actually reduced the number of output pins and the output
of the functions are available simultaneously.
While the main objective of the chapter is to give a compact design method for MVL
functions, in the implementation of binary logic functions it is found that, though this
method requires a few more transistors in some cases, there are classes of functions for
which the method, in terms of number of transistors, can be quite efficient. The main
problem can be identified as the problem of the simplification of the expressions. As the
number of variables grows, the method introduced for the minimization of the expressions
becomes very time consuming. Thus it can find the minimized expression in considerable
time.
REFERENCES
[1] E. J. McCluskey, “Logic design of multivalued IIL logic circuits”, IEEE Transactions
on Computing, vol. 28, pp. 546–559, 1979.
[2] E. J. McCluskey, “Logic design of MOS ternary logic”, Proceedings of IEEE ISMVL,
pp. 1–5, 1980.
[3] S. P. Onneweer and H. G. Kerkhoff, “Current-mode CMOS high-radix circuits”,
Proceedings of IEEE ISMVL, pp. 60–68, 1986.
[4] L. K. Russell, “Multilevel NMOS Circuits”, 1980.
[5] O. Ishizuka, “Synthesis of a pass transistor network applied to multi-valued logic”,
1986.
[6] E. J. McCluskey, “Logic Design Principles”, Prentice-Hall, Englewood Cliffs, NJ,
1986.
[7] Babu, Hafiz Md Hasan, Md Rafiqul Islam, Amin Ahsan Ali, Mohammad Musa Salehin
Akon, Mohammad Ashiqur Rahaman, and Mohammad Fakhrul Islam. “A technique
for logic design of voltage-mode pass transistor based multi-valued multiple-output
logic circuits.” In 33rd International Symposium on Multiple-Valued Logic, 2003.
Proceedings., pp. 111–116. IEEE, 2003.
CHAPTER 9
Multiple-Valued Input
Binary-Valued Output
Functions
The success of the local covering approach to multiple-valued input two-valued output
(MVITVO) functions minimization depends vastly on the proper choice of the base
minterms from the ON set of some new techniques to improve the performance of this
approach. A graph called an enhanced assignment graph (EAG) is introduced for the effi-
cient grouping of the Boolean variables. In order to make the best choice of the proper base
minterm, a new technique is defined to find the potential canonical cube (PCC) covering
it. This process succeeds in finding out the essential primes efficiently which enhances the
total computation time and produces better sum of products (SOP).
9.1 INTRODUCTION
Simplification of sum of products expressions is of great importance in logic synthesis. In
total computation time for logic synthesis, the ratio of the time spent for simplifications
for SOPs is directly related to the simplification of PLAs. In this context the necessity
for the use of multiple-valued logic (MVL) is gaining its valued importance day by day.
The interconnection complexity of two-valued functions both in chip and between chips
is reduced effectively by the adept use of MVL. These functions can be of great use to
minimize decoded PLA’s and in the realm of sequential circuits and networks. Among
the heuristic methods to find out the minimum cover of the MVITVO functions MINI and
ESPRESSO-IIC are very well known. In these methods a near optimum cover of the function
F to be minimized is achieved through iterative improvement by reshaping and reducing
the initial cover of it. A slightly different approach to these heuristics is the approach of
the local covering technique, where the whole process starts from a properly chosen base
minterm. An improved version of this technique is as follows: First a set of sub functions
of the function to be minimized is built (expansion process). Then one or more primes are
selected from the ones of each sub-function (selection process). In the end a union of all
the selected primes is done which forms a cover of the function F .
In this chapter it is tried to hold down the racing computational time by emphasizing
in finding out the “best minterm” (the minterm which has the fewest number of adjacent
minterms) and at the same time enhanced the probability of detecting and selecting the
essential primes while expanding. This algorithm preserves the notion of the previous ones
and are made aware of the lower bound of primes in the minimum cover of the given
function. Here this works in two phases; first in finding efficient grouping of the Boolean
variables and Second by finding fast the viable cubes with suitable minterms. In concise
the algorithm is fast in computation and prudent in keeping the functions as minimized as
possible.
is a minterm that is covered by only one prime implicant. The essential primes cover one
or more distinguished minterms.
F : P1 × P2 × P3 × . . . × P n → B, where P1 = {0, 1, 2, 3}, P2 = {0, 1, 2, 3}, P3 = {0, 1,
2, 3}, . . . , P n = {0, 1, 2, 3}. Again, let α be a minterm of F such that α = X1{0} X 2{1} X 3{2} .
i, j
The operation circular shift can be pk → D, which implies the replacement of i -th variable
in α, with the j -th value that follows the value of the i -th variable in α, in Pi . The first value
of Pi is assumed to follow the last one. Suppose i = 2, j = 1, then applying the operation
to α, it gets η = X1{0} X 2{1} X 3{2} . If circular shift is applied to canonical cubes, a set of
minterms adjacent to it are produced.
The minimization procedure for MVITVO functions consists of the following steps:
(a) The determination of all prime implicants of the function.
(b) Finding out the essential prime implicants.
(c) From the remaining prime implicants a minimum set is chosen so that together with
the essential prime implicants they cover the function.
Let s be a subset of the input variables. Here |s| = 2 is considered. Then deleting all
literals of the variables in s from each term of the given function F and leaving other literals
in that term, the number of distinct terms in the resulting disjunctive form is denoted by RS .
Example 9.1
0
f (x1, x 2, x3, x4, x5 ) = x1 x20 x3 x4 + x1 x30 x4 x5 + x20 x30 x4 x50 + x1 x2 x30 + x2 x3 + x1 x20 x4 x50
s be the set ( x 1 , x 3 ). Then, R(x1, x3 ) = 4, as there are 4 distinct terms left after deleting
the literals of s.
In the algorithm for grouping the variables, Rs is used to estimate the number of products
when switching variables in S are assigned to a MVITVO variable. An Assignment graph
G (Fig. 9.1) for n variable function
f (x1, x 2, x3, . . . , xn ) is a complete graph such that,
1. G has n nodes, one for each input variable.
2. The weight of the edge (i, j) between nodes i and j is R(xi , x j ) .
Let G = (V, E) be a connected graph with n vertices. A Hamilton Path is a path that
goes through each vertex exactly once. A minimum cost Hamilton Path is the path created
by traversing through the edges with the minimum weight and hence, the sum weights of
the edges is minimum. If f is the given switching function, the uncomplemented weight of
a variable x i of f is defined to be the number of times x i appears in uncomplemented form
in different products of the ON set of f . Similarly, the complemented weight of a variable
x i of f is the number of times x i appears in complemented form in different products of the
ON set of f .
110 VLSI Circuits and Embedded Systems
Example 9.2
Let us consider the switching function f (x1, x 2, x3, x4 ) = x1 0 x 20 x30 x40 + x1 0 x 20 x30 x4 +
x1 x 20 x3 x40 + x1 0 x 2 x30 x40 + x1 0 x 2 x30 x4 + x1 0 x 2 x3 x4 + x1 x20 x30 x40 + x1 x20 x3 x40 + x1 x2 x30 x4 + x1 x2 x3 x40 .
0
1) When input variables are assigned as X1 = (x1, x2 ) and X2 = (x3, x4 ) then the
minimum sum of products expression is = F (X1, X2 ) = X 1{0,1} X 2{0,1} ∨X 1{0,3}
X 2{1,2} ∨X 1{1,2} X 2{0,3} <1>. Three product terms are necessary in this assignment.
2) When input variables are assigned as X 1 = (x 1 , x 3 ) and X 2 = (x 2 , x 4 ) then the
minimum sum of products expression is F (X1, X2 ) = X 1{0,1,2} X 2{0,3} ∨X 1{0,3} X 2{1,2} <2>.
Two product terms are necessary in this assignment.
3) When input variables are assigned as X 1 = (x 1 , x 4 ) and X 2 = (x 2 , x 3 ) then the
minimum sum of products expression is F (X1, X2 ) = X 1{0,3} X 2{1,2} ∨X 1{1,2} X 2{0,3}
∨X 1{0,1} X 2{0,2} <3>. Three product terms are necessary in this assignment.
Therefore, when input variables as <2> is assigned, the number of product terms is
minimized.
In order to find an efficient grouping, first build the assignment graph G. And the
minimum weight Hamilton Path is found, starting from the node corresponding to x 1 .
The variables are ordered according to the order of appearance of the vertices and assign
Multiple-Valued Input Binary-Valued Output Functions 111
the pairs of variables to multiple-valued variables. Following the above notion it may get
different orderings for the function of the Example 9.2, which are as follows:
1.x1, x 2, x3, x4
x1 , x 2 , x 4 , x 3
2. x1, x 3, x2, x4
x1 , x 3 , x 4 , x 2
3.x1, x 4, x2, x3
x1, x 4, x 3, x 2
So, it gets the three different orderings corresponding to the orderings shown in Exam-
ple 9.2, which still do not let find the optimal ordering <2> of the example. In order to
solve this problem, here an “Enhanced Assignment Graph (EAG)” (Fig. 9.2) is introduced,
the definition of which is as follows:
Example 9.3 Considering the EAG (Fig. 9.2) and by using Algorithm 9.1 it may get the
following ordering:
112 VLSI Circuits and Embedded Systems
x 1 , x 3 , x 2 , x4
x1 , x 3 , x 4 , x 2
that the set of sub-functions will increase and essential primes are detected in the earliest
phase of the computation. In order to realize this procedure, the given expression is preferred
to be in canonical form. The expansion process in this algorithm is done by circular shifting
the cubes. In case of canonical cubes, it generates a set of minterms adjacent to the original
cube. The motivation to the new procedure lies in the fact that the minterm with the smallest
number of adjacents would reside in the canonical cubes with smallest number of adjacents.
So, if it arranges the given canonical cubes with respect to their adjacency then it would
be less time consuming when searching for the minterms with the smallest number of
adjacents. The algorithm uses a table of indices for all the canonical cubes and rearranges
it according to their number of adjacents (Fig. 9.1).
Figure 9.3: Multiple-Valued Function Example: All Cubes are in the Form of Canonical
Cubes.
The procedure first checks out the X1i ’s (1 <i<m, m = number of different multiple
values) and counts the number of distinct values (i.e., 0, 1, 2, 3’s) for each X1i ’s (i ∈ [0,
n − 1], n is the number of literals) of all the canonical cubes. Then it updates the index
table of the canonical cubes according to the weight of distinct values (a weight here is the
number of occurrences of that value). The same procedure is performed for each X1i ’s (1
< j<n − 1, n = number of literals). In this way it comes to a point when the table cannot be
updated anymore and this is when it is succeeded in rearranging properly the cubes such that
it gets the minterms with the smallest adjacents just by consulting this table sequentially.
This technique is presented in Algorithm 9.1.
Figure 9.4: The Different Arrangements of the Index Table in the Different Passes.
Each sub-function consists of the canonical cubes which are adjacent to the base
minterm properly chosen from the function F.
Let, F = { A1 , A2 , A3 } where, A1 = X 1{0} X2{0} X3{1} X4{0,1,2} , A2 = X 1{3} X2{1} X3{2} X4{2} ,
A3 = X 1{0} X2{3} X3{3} X4{0,1,2} and base minterm a = X 1{0} X2{0} X3{1} X4{0} , then after
performing the expansion procedure, it gets two sub-functions P and Q of F , where P has
{ A1 , A3 } and Q has A2 .
In this technique first generate a supercube S , by computing from the base minterm
a and all the minterms of F that are adjacent to a, using circular shift operation. Now
the using base minterm a and the supercube S , it generates the canonical cubes for the
Multiple-Valued Input Binary-Valued Output Functions 115
sub-functions to be built. The selection procedure processes one at a time the sub-functions
of an irredundant set.
Hence the algorithm for the local cover technique can be described as follows:
Example 9.6 Generation of sub-function: The Algorithm sub_function() builds the sub-
function associated with a base minterm
a = {0}×{0}×{1}×{0} and therefore can get the on multiple-valued ON set cube P and
OFF set cube R. After running this algorithm, it obtains P and R as follows:
D Pout = Θ
Ñ
The routine Check _ B reduces D n so that D Pout = Θ. Here D = Sepercube (a, B) =
Ñ
P and POF F is the offset of P4. Since POF F is not available, it uses a subset of it R built
during the generation of P.
118 VLSI Circuits and Embedded Systems
Example 9.7 Deriving the primes of a sub-functions: Some algorithm generates the set of
cubes C that can be built by computing the set S # R and deleting every cube that implies
one of the other yielded cubes or does not cover the base minterm. Initially C contains
only S . The inner loop processes one at a time the cubes of C . Let C k be the cube under
processing. If C k does not intersect Rh , it is inserted in C 00, otherwise, each cube C k #i Rh
that covers the base minterm is inserted in C 0. Then each cube of C 0 implying a cube of
C 00 is deleted and a new C is formed by the residual C 0 and C 00. This process is repeated
for such a C in the next iteration of the outer loop. After running this algorithm, it gets the
primes which are as follows:
Multiple-Valued Input Binary-Valued Output Functions 119
9.4 SUMMARY
A new Boolean variable assignment algorithm and minimization techniques have been
introduced. So, both the total computation time and the number of products decreases. The
algorithmic extension has been proven to be efficient in detecting and selecting the essential
prime implicants as well as furnishing the lower bound on the number of prime implicants
in the first phase of the computation process. The new concepts of “enhanced assignment
graph”, “use of Hamiltonian path” in finding out the best pairs, and the technique of “cube
rearrangement” are proven to be efficient in step-by-step minimization process. Along
with these heuristics used in different phases of expansion and selection have nevertheless
improved the quality of the whole technique.
REFERENCES
[1] H. M. H. Babu, M. Zaber, R. Islam and M. Rahman, “On the minimization of Multiple
Valued Input Binary Valued Output Functions”, International Symposium on Multiple
Valued Logic (ISMVL 2004), 2004.
[2] G. Caruso, “A local Cover Technique for Minimization of Multiple-Valued Input
Binary-Valued Output Functions”, IEICE Trans., Fundamentals, vol. E79 A, 1996.
[3] T. Sasao, “Input variable assignment and output phase optimization of PLA’s”, IEEE
Trans., Comput., vol. C-33, pp. 879–894, 1984.
[4] R. K. Brayton, G. D. Hatchel, C. T. McMullen and A. Sangiovanni-Vincentelli, “Logic
Minimization Algorithms for VLSI Synthesis”, 1984.
[5] G. Caruso, “A local selection algorithm for switching minimization”, IEEE Trans.,
Comput., vol. c-33, pp. 91–97, 1984.
120 VLSI Circuits and Embedded Systems
Multi-valued Fredkin gates (MVFGs) are reversible gates and these gates can be considered
as modified version of the better-known reversible Fredkin gate. Reversible logic gates are
circuits that have the same number of inputs and outputs and have one-to-one and onto
mapping between vectors of inputs and outputs. Thus, the vector of input states can be
always reconstructed from the vectors of output states. It has been shown that the power
is not dissipated in an arbitrary circuit when the circuit is built from reversible gates.
Moreover, multi-valued Fredkin gates have been shown to be a suitable choice as a basic
building block for binary and different alternative logics for example multi-valued logic
and threshold logic.
In this chapter the application of MVFGs is shown with the implementation of fuzzy
set and logic operations. Fuzzy relations and their composition are very important in this
theory as collections of fuzzy if-then rules, fuzzy GMP (Generalized Modus Ponens) and
GMT (Generalized Modus Tollens) respectively are mathematically equivalent to them. In
this chapter, digitized fuzzy sets are described where the membership values are discretized
and represented using ternary variables. The composition of fuzzy relations and a systolic
array structure are also described. Design with reversible gates and the highly parallel
architecture of systolic arrays make the circuits quite attractive for implementation.
10.1 INTRODUCTION
Fuzzy set theory and the corresponding logic is quite transitioning from the traditional set
theory and the concept of uncertainty. When A is a fuzzy set and x is a relevant object, the
proposition “x is a member of A” is not necessarily true or false, it may be true only to
some degree. It is most common to express the degrees of membership by numbers in the
closed interval [0, 1]. In this chapter, digital fuzzy set is considered where the membership-
value space is discretized. The standard set operations and the concept of fuzzy relations
are defined based on these digital fuzzy sets and their realizations. In this chapter, the
composition of fuzzy relations and a systolic array structure are described to compute it.
Collections of fuzzy if-then rules or fuzzy algorithms are mathematically equivalent to
fuzzy relations and the problem of inference of (evaluating them with specific values) is
mathematically equivalent to composition.
The introduced circuit is composed of Multi-Valued Fredkin Gates (MVFG) which are
reversible gates. Conservative and reversible logic gates are widely known to be compatible
with the new computing paradigms like optical and quantum computing. Reversible logic
gates are circuits that have the same number of inputs and outputs and have one-to-one and
onto mapping between vectors of inputs and outputs; thus, the vector of input states can be
always reconstructed from the vectors of output states. Irreversible functions (gates in the
classical binary logic except the NOT gate is irreversible) can be converted into reversible
functions easily. If the maximum number of identical output vectors is p, then dlogpe
garbage outputs (and some inputs, if necessary) must be added to make the input-output
vector mapping unique. Reversible logic applicable to quantum computing, nanotechnology
and low power design. For power not to be dissipated in an arbitrary circuit it is necessary
that the circuit be built from reversible gates. Multi-valued reversible gates however have
not got much attention until recent times. This chapter concentrates on the multiple-valued
Fredkin gates.
10.2.1 Some Basic Reversible Gates and Classical Digital Logic Using these Gates
Among many gates Fedkin gates together with Toffoli gates and Feynman gates are the
most often discussed gates in reversible and quantum architecture and it is suggested that
Digital Fuzzy Operations Using Multi-Valued Fredkin Gates 123
X Y x’ y’
0 0 0 0
0 1 0 1
1 0 1 1
1 1 1 0
future realization efforts will concentrate mostly on these gates and their derivations. These
reversible gates along with a new gate are shown in Fig. 10.1.
Figure 10.1: (a) Feynman Gate, (b) Fredkin Gate, (c) Toffoli Gate, and (d) New Gate.
In strict reversible logic paradigm signal fan-out is forbidden. However, most of the
gates provide one of the inputs at the outputs unaltered. Using constant inputs, it also can
generate the fan-out and other different function. Fig. 10.2 shows some such constructions.
124 VLSI Circuits and Embedded Systems
In Fig. 10.2 (a) and (b), Fredkin gates are used to implement the fanout and AND
operation. In Fig. 10.2 (c), it is found that the operation of AND and EXOR on two inputs
performed. It can be pointed out that for this gate the output z should be x0 ⊕ y0 which is
equivalent to x ⊕ y .
One observation can be made here is the fact that the definition of the gate does
not specify the type of signals. Thus, they can be binary, multi-valued, etc. The only
requirement is that the relation (<) can be defined on them. These gates can be used to
implement alternative logic, as threshold logic, array logic, etc., and these gates can be
constructed using optical devices such as photonic switches that are being developed in
telecommunications.
Fuzzy sets: Zadeh introduced fuzzy sets by defining characteristic functions for fuzzy
sets that may call membership function as µ A(x) : X→ [0, 1]. So, in fuzzy sets may talk
about the degree of that an element x can have denoted by µ A(x) which is a number between
0 and 1. Membership functions thus may represent an individual’s (subjective) notion of a
vague class – for example tall people, little improvement, big benefit, etc.
If X is a universe of discourse and x is a particular element of X then a fuzzy set
A defined on X may be written as a collection of ordered pairs A = { (x, µ A(x))}, xε X .
Alternatively a fuzzy set may be written as
Õ
A= µ A(xi )/xi
xi ∈X
If the universe of discourse is discrete and µ A(x) in this case can be called a discrete-
universal membership function and if it has a continuous universe of discourse, it may
write
∫
A= µ A(x)/x
X
Example 10.1 Suppose the membership function of a fuzzy set representing the concept
of a middle-aged person is given as
Now a possible discrete approximation A(x) : {0, 5, 10, 15, . . . , 80} → [0, 1] of the
membership function can be defined in Table 10.2.
Now suppose only 9 different levels of membership-values are defined using 2 ternary
variables as shown in Table 10.3, then it may represent the digital fuzzy set as D = 0.3/25
+ 0.7/30 + 1/35 + 1/40 + 1/45 + 0.7/50 + 0.3/55.
126 VLSI Circuits and Embedded Systems
X A(x)
x ε {25, 30, . . . , 0.0
55}
x ε {25, 55} 0.33
x ε {30, 50} 0.67
x ε {35, 40, 45} 1.00
Encoding Value
A2 A1
0 0 0.0
0 1 0.15
0 2 0.3
1 0 0.4
1 1 0.5
1 2 0.6
2 0 0.7
2 1 0.85
2 2 1.0
From this example it is possible to see that considering more elements and having a
larger number of membership values it may more precisely represent a fuzzy set.
Fuzzy Operations: There are 3 standard fuzzy set operations namely complement,
intersection and union. The concept of fuzzy relation and the composition operation are
also discussed.
Complement: Let A be a fuzzy set on X , then by the complement of A has the member-
ship function A(x) = 1 − A(x), this value may be interpreted not only as the degree to which
x belongs to A the complement of A but also as the degree to which x does not belong to A.
Intersection/t-Norm and Union/t-Conorm: The intersection or the union of two fuzzy
sets A and B is specified in general by a binary operation on the unit interval: i.e., a
function of the form f : [0, 1] × [0,1] → [0,1]. For each element x of the universe set, this
function produces the intersection as ( A∩B)( x ) = i[A(x), B(x)] = A(x) ∧B(x) and the union
is expressed as (A ∪ B)(x) = u[A(x), B(x)] = A(x)vB(x).
There exists different t -norm and t -conorm operators available, however the standard
operation for intersection and union are the following Standard intersection: i(a, b) =
min(a, b), Standard union: u(a, b) = max(a, b) where a, bε [0, 1].
The standard operations will be used throughout the chapter. For digital fuzzy sets it is
easy to compute the complement operation. It just needs to complement the bits. In Section
10.4 the circuit construction of the complement or negate, the min and max operations are
shown.
Digital Fuzzy Operations Using Multi-Valued Fredkin Gates 127
Fuzzy Relations: Fuzzy relations are fuzzy sets defined on Cartesian products. A binary
fuzzy relation R defined on a discrete Cartesian product X×Y can be written as R = Σ µR
( x i , y i )/( x i , y i ), where every pair ( x i , y i ) ε X × Y .
Digital fuzzy relations will be used that is the µR ( x i , y i )’s can take only a fixed number
of values. It can easily represent a fuzzy relation in a matrix form. A fuzzy relation on two
sets X = {x 1 , x 2 , x 3 , x 4 } and Y = {y 1 , y 2 , y 3 , y 4 } can be represented in a 4 × 4 matrix R
where Ri, j = µR ( x i , y i ).
Composition of Fuzzy Relation: Given two fuzzy relations – R1 on X × Y and R2 on
Y × Z it may define a new relation denoted as R1 ◦ R2 on X × Z . There are several types
of composition-namely max-min, max-product, max-average. The max-min composition
formula is given below:
ÕÜ
R1 ◦ R2 = [µR1 (x, y) ∧ µR1 (y, z)]/(x, z)
x×z y
It can see the computation of the membership grades is very much similar to matrix
multiplication, with max ( v ) being analogous to summation and min ( ∧ ) being analogous
to multiplication.
Let R1 :
R2 :
128 VLSI Circuits and Embedded Systems
This is intended to use this max-min composition because by far it is the most common
type in engineering applications.
Compositions are very important for inferencing procedures used in linguistic descrip-
tion of systems and is particularly useful in fuzzy controllers and expert systems. Collections
of fuzzy if-then rules or fuzzy algorithms are mathematically equivalent to fuzzy relations
and the problem of inference of (evaluating them with specific values) is mathematically
equivalent to composition.
In Section 10.4, a systolic array structure that can be used to compute composition
of fuzzy relations is shown. The cells, composed of reversible logic gates, are actually
responsible for the max-min operations.
Next in Fig. 10.5 the implementation of the complement operation is shown. It can
actually perform this operation digit-wise.
10.5 SUMMARY
This chapter introduces the digitized fuzzy sets and discuss the different operations. Com-
positions of fuzzy relations are described. Compositions are very important for inferencing
Digital Fuzzy Operations Using Multi-Valued Fredkin Gates 131
procedures used in linguistic description of systems and are particularly useful in fuzzy
controllers and expert systems. Collections of fuzzy if-then rules or fuzzy algorithms are
mathematically equivalent to fuzzy relations and the problem of inference (evaluating them
with specific values), fuzzy GMP (Generalized Modus Ponens) and GMT (Generalized
Modus Tollens) is mathematically equivalent to composition. A systolic array structure is
shown for the computation of composition of fuzzy relations. It provides a high degree of
parallelism and the use of identical cells and uniform interconnections making them ideal
for implementation.
This chapter continues with the new logic design paradigm – reversible logic. The
reversible logic finds its application in many fields such as quantum and optical computing,
low power design, nanotechnology, etc. The introduced design utilizes the multi-valued
reversible logic gates [namely the multi-valued Fredkin gate (MVFG)]. Not many circuit
design techniques have appeared in the literature concerning multi-valued reversible gates
or the implementation of fuzzy operations using them. Future research on this topic is
necessary to compare different multivalued reversible logic gates as the basic building
blocks. However Fredkin gates together with Toffoli gates and Feynman gates are the most
often discussed gates in reversible and quantum architecture and it is suggested that future
realization efforts will concentrate mostly on these gates and their derivations. As it is
possible to implement any Boolean logic function using Fredkin gates, it is also possible
using MVFGs (Multi-valued Fredkin gates) as they are modified Fredkin gates. This along
with the fact that multiple-valued Fredkin gates can be used to implement alternative logics
(for example threshold logic) makes MVFGs a rather attractive choice.
REFERENCES
[1] L. A. Zadeh, “Fuzzy Sets”, Information and Control, vol. 8, pp. 338–353, 1965.
[2] G. J. Klir and B. Yuan, “Fuzzy Sets and Fuzzy Logic”, Prentice Hall, 1995.
[3] G. J. Klir and T. A. Folger, “Fuzzy sets, Uncertainty, and Information”, Prentice Hall,
Englewood Cliffs, NJ, 1988.
[4] L. H. Tsoukalas and R. E. Uhrig, “Fuzzy and Neural Approaches in Engineering”,
Jhon Wiley & Sons Inc, 1997.
[5] M. Nielson and I. Chuang, “Quantum Computation and Quantum Information”, Cam-
bridge Press, 2000.
[6] R. C. Merkle, “Two types of mechanical reversible logic”, Nanotechnology, vol. 4,
pp. 114–131, 1993.
[7] C. Bennett, “Logical reversibility of computation”, IBM Journal of Research and
Development, vol. 17, pp. 525–532, 1973.
[8] M. H. A. Khan, M. A. Perkowski and P. Kerntopf, “A Multi-Output Galois Field Sum
of Products Synthesis with New Quantum Cascades”, Proceedings of 33r d ISMVL,
pp. 146–153, 2003.
[9] A. De Vos., B. Raa and L. Storme, “Generating the group of reversible logic gates”,
Journal of Physics A: Mathematical and General, vol. 35, pp. 7063–7078, 2002.
[10] P. Kerntopf, “Maximally efficient binary and multi-valued reversible gates”, Booklet
of 10th Intl. Workshop on Post Binary and Ultra-Large-Scale Integration Systems
(ULSI), pp. 55–58, 2001.
132 VLSI Circuits and Embedded Systems
[11] P. Picton, “Frenkin Gates as a Basis for Comparison of Different Logic Design Solu-
tions”, IEE, 1994.
[12] P. Picton, “A Universal Architecture for Multiple-Valued Reversible Logic”, Multiple-
Valued Logic-An International Journal, vol. 5, pp. 27–37, 2000.
[13] R. Landauer, “Irreversibility and heat Generation in the Computational Process”, IBM
Journal of Research and Development, vol. 5, pp. 183–191, 1961.
[14] A. Agrawal and N. K. Jha, “Synthesis of Reversible Logic”, Proceedings of the Design,
Automation and Test in Europe Conference and Exhibition, IEEE, 2004.
[15] A. Khlopotine, M. Perkowaski and P. Kerntopf, “Reversible Synthesis by Iterative
Compositions”, Proceedings of IWLS, pp. 261–266, 2002.
[16] D. M. Miller, D. Maslov and G. W. Dueck, “A Transformation based Algorithm
for Reversible Logic Synthesis”, Proceedings of Design Automation Conference, pp.
318–323, 2003.
[17] J. W. Bruce, M. A. Thornton, “Efficient Adder Circuits based on a Conservative
Reversible Logic Gate”, IEEE Computer Society Annual Symposium on VLSI, 2002.
[18] H. M. H. Babu, M. R. Islam, “Synthesis of Full-adder Circuit using Reversible Logic”,
Proceedings of VLSID, 2004.
[19] M. H. A. Khan, “Design of Full-adder with Reversible Gates”, International Confer-
ence on Computers and Information Technology, Dhaka, pp. 512–519, 2002.
[20] E. Fredkin and T. Toffoli, “Conservative Logic”, International Journal of Theoretical
Physics, pp. 219–253, 1982.
[21] R. Feynman, “Quantum Mechanical Computers”, Optics News, vol. 11, pp.11–20,
1985.
[22] M. Perkowski, M., “Regular Realization of Symmetric Functions using Reversible
Logic”.
[23] Babu, Hafiz Md Hasan, Amin Ahsan Ali, and Ahsan Raja Chowdhury. “Realization
of Digital Fuzzy Operations Using Multi-Valued Fredkin Gates.” In CDES 2006, pp.
101–106. 2006.
CHAPTER 11
Multiple-Valued
Multiple-Output Logic
Expressions Using LUT
11.1 INTRODUCTION
There are many works about multiple-valued logic design with respect to the realization
of MV-PLA’s Gate Circuits and FPGA’s. Especially the minimization of sum-of-products
expression has received considerable attention for over 20 years. The analysis of the max-
imum number of implicants in a minimal sum-of-products expression is interesting when
a PLA is used to implement a function, the cost is directly related to the number of impli-
cants. In a PLA implementation of multiple-valued multiple-output functions, each product
term is represented by series of semiconductor devices (transistor). It is desirable also to
minimize the total number of devices in the PLA. Thus the good solution will in addition to
having minimal number of product terms also have a small number of variables appearing
in these product terms. While the smaller number of product terms reduces PLA area, the
reduced number of devices improves the speed of operation. These features introduce the
techniques, which is capable for generation of smaller number of product terms and pro-
gram will take a minimum space requirement. The earlier work on minimization was based
upon the Quine-McCluskey procedure. This method gives the exact solution but its space
complexity increases rapidly with the number of input variables. So the space violation
increases.
Example 11.1 3-variable 3-valued 4 output functions is considered which is given below:
To find out the common sub functions it is needed to search the pair in descending order
of the value of S. These are the common sub functions among the functions in Example 11.2.
Multiple-Valued Multiple-Output Logic Expressions Using LUT 137
Complexity of Generating KC
Let n be the number of algorithm of B+ tree which needs the complexity of O(nlogn).
Again as the LUT is in sorted order, the function
Select_KC_from_LUT_using_binary_search() is used which uses binary search algorithm.
It requires the complexity of O(logn). So the total complexity becomes
Example 11.3 The 4-output functions in Example 2 are used implemented the algorithm
to the left over implicants after extracting the shared sub-functions. The minimized left over
functions are denoted by { f 00 , f 10 , f 20 , f 30 } corresponding to { f 0, f 1, f 2, f 3 } sequentially.
Shared sub-functions:
The total number of implicants in the method = no. of implicants shared sub-functions
+ no. of implicants of minimized left over functions
=3+8
= 11.
Example 11.4 Let 3-valued 2-variable truth table. The truth table and the realization
are shown in Fig. 11.2(a). The realization of the minimized the function using Kleenean
coefficient is shown in Fig. 11.2(b). In the figure, only one variable is considered for
simplicity.
Multiple-Valued Multiple-Output Logic Expressions Using LUT 139
Figure 11.1: Block Diagram of General 2-input r -valued LUT Logic Function.
Figure 11.2: (a) 3-valued 2-variable MVL Truth Table and its Direct Realization Using
LUT. (b) Kleenean Coefficient considered.
Table 11.1 shows the assigned current values for a 3-valued logic system. The maximum
current (3 µA) is assigned to logic 2.
140 VLSI Circuits and Embedded Systems
Logic 0 1 2
Value
Current 0.0 µA 1.5 µA 3.0 µA
Value
Using this slicing, a 3 µ current generates a voltage drop in the order of 100mV across
the drain-source of each transistor. This does not affect the circuit performance unless the
number of transistors in series exceeds 10.
The implementation method of literal (1 A1 ) is shown in Fig. 11.3. The input current in
figure is sourced to the circuit and compared against two source currents (0 µA and 3 µA).
If the input current lies between these two limits, the output node is pulled down to Vdd .
Otherwise it pulled down to the ground.
The literal generator conceptualization is shown in Fig. 11.4. The literal general circuit
is also shown in Fig. 11.5.
A current mode LUT is generally faster than a voltage mode LUT. In both cases, only
one path turns on depending on the logic values. However change in the logic values requires
less charge movement in the current mode design since all internal nodes have relatively
low voltages (no charging and discharging is required).
11.6 SUMMARY
An improved approach of minimization of multiple-valued multiple-output logic expression
is shown with Quine-McCluskey method using Klinem Coefficient (KC) and the realiza-
tion using Current mode CMOS. An efficient method for multiple-valued multiple-output
functions is also presented in this chapter. The number of implicants has been reduced
significantly by using the introduced method. Thus, the method reduces the propagation
time and also minimizes the size of the circuit.
REFERENCES
[1] E. A. Bender and J. T. Butler, “On the size of PLA’s required to realize binary and
multiple-valued logic”, IEEE Trans., Comput., vol. C-38, no. 1, pp. 82–98, 1989.
[2] G. W. Dueck and G. H. J. Van Rees, “On the maximum number of implicants needed
to cover a multiple-valued logic functions using window literals”, IEEE Proceedings
of the 20th International Symposium on Multiple-Valued Logic, pp. 144–152, 1990.
[3] Y. Hata, T. Hozumi and K. Yamato, “Minimization of multiple valued logic expres-
sions with Kleenean coefficients”, IEICE Trans. Inf. & Syst., vol. E79-D, no. 3, 1996.
[4] Y. Hata, T. Sato, K. Nakashima and K. Yamato, “A necessary and sufficient condi-
tion for multiple-valued logic functions representable by AND, OR, NOT constants,
variables and determination of their logical formulae”, IEEE Proceedings of the 19th
International Symposium on Multiple-Valued Logic, pp. 448–455, 1989.
142 VLSI Circuits and Embedded Systems
143
Part 3
145
146 Part 3
algorithm is described, which performs digit-wise parallel processing and provides a sig-
nificant reduction in carry propagation delay. A binary to BCD conversion algorithm for
decimal multiplication is also presented in this chapter to make the multiplication more
efficient. Then, a matrix multiplication algorithm is described that re-utilizes the intermedi-
ate product for the repeated values to reduce the effective area. A cost-efficient LUT-based
matrix multiplier circuit is also described using the compact and faster multiplier circuit
in this chapter. In Chapter 20, a low power and area efficient LUT-based BCD adder is
introduced which is constructed basically in three steps: First, a technique is introduced
for the BCD addition to obtain the correct BCD digit. Second, a new controller circuit of
LUT is presented which is designed to select and send Read/Write voltage to memory cell
for performing Read or Write operation. Finally, a compact BCD adder is designed using
the introduced LUT. In Chapter 21, a generic CPLD board design and development is pre-
sented. The designed board is generic in nature and it can be used in various system designs
as a reconfigurable hardware. Finally, Chapter 22 reports implementation of FPGA based
Micro-PLC (Programmable Logic Controller) that can be embedded into devices, machines
& systems using suitable interface. The idea described in this chapter is best suitable for
small-scale application where the need is limited number of instructions at reasonable cost,
which also offer best performance, high speed and compact design approach.
CHAPTER 12
LUT-Based Matrix
Multiplication Using Neural
Networks
12.1 INTRODUCTION
Matrix multiplication is significantly used in many applications like graph theory, digi-
tal signal processing, image processing, cryptography, statistical physics and many more.
Neural network is most desirable implementation in today’s world due to its massive paral-
lelism, distributed representation alongside with its computation and learning ability. In this
chapter, artificial neural network is implemented in matrix multiplication to significantly
decrease the time-complexity of the algorithm and hence, large scale matrix multiplication
is no more difficult due to the use of neural network with its learning capacity. The main
focus of this chapter are as follows:
1. First, a new matrix multiplication algorithm using the efficient Artificial Neural
Network has been introduced with less time complexity.
2. Second, using minimum coin change problem through supervised learning method a
high accuracy level for the resultant matrix is ensured.
Let, N be the set of natural numbers. The vector space of n is denoted by n natural
matrices by N n×n . A and B are two n×n matrix for multiplication.
Let, A be the multiplier matrix and B be the multiplicand matrix producing output
matrix C . Every matrix is represented as follows:
LUT-Based Matrix Multiplication Using Neural Networks 151
a11 . . . a1n
. .
⇔ A = (ai j ) = . . ai j ∈ N
A ∈ Nn×n
. .
an1 . . . ann
So, after multiplication the matrix C will be as follows:
n
Õ
C = A × B; where ci j = aik bk j (12.1)
k=1
Since supervised learning is being implemented so according to Fig. 12.1, each step
will be followed in neural network blocks as follows:
(1) Problem Cases: For high speed parallel processing it considers n number of neural
networks for the multiplication process. Each neural network has a column of the multiplier
A and a single element from multiplicand B as input. As every matrix is a collection of
column vectors, it can perform column partitioning on multiplier matrix, A.
A ∈ Nn×n ⇔ A = [A1 A2 . . . Ak . . . An ], Ak ∈ N;
a1
a2 .
Ak = .
.
an
So, for every i th row of multiplicand B, every j th value of that row will be the input of
a neuron with multiplier column vector Ak where k = i . For simplicity, it can row partition
the multiplicand matrix B as follows:
B1
B2 .
⇔ B = . , Bk ∈ N;
B ∈ Nn×n
.
Bn
Bk = [b1 b2 . . . bn ]
Therefore, every column vector Ak will be the input of n number of neural networks
with each j th values of the row vector Bi for k = i which is demonstrated in Fig. 12.2.
Every column vector will be pipe-lined to the neuron for faster execution.
152 VLSI Circuits and Embedded Systems
(2) Known Solutions: For supervised learning input with corresponding output as
training data needs to be provided to the neural network. For providing training data of
a n×n matrix, a threshold value is to be considered, up to which the sample input output
combination will be provided through a multiplier. Suppose, the threshold value (θ ) is as
follows:
n×x
θ = f (n) = (12.2)
100
The variable x is user-dependent that is of which percentage (x) of the column values
are to be sent as sample. Sample is generated by direct multiplication until the number
of values in column is equal to the threshold, (θ ). Suppose, for two 20 × 20 matrices,
where x is defined as 40%, the threshold becomes 8 following Equation 13.2. Hence, first
8 values of each column of the multiplier matrix would be provided as sample input-output
combination. Then, other values would be manipulated through neural network. The input
(single value of the Ai column vector) and the output combinations of the sample generation
technique are stored in two arrays namely in and out.
(3) Training Algorithm: When new input arrives in the array, the absolute differences
from all the previous values are calculated and inserted in a binary search tree with com-
plexity O(log n). That is if there are p values in array in and a new value arrives at in p+1
position then, each absolute difference ranging from in0 to in p with in p+1 will be stored.
Also the differences from the out is calculated in parallel to be inserted as a weight to the
binary search tree with corresponding input values. Binary search tree is considered as the
LUT-Based Matrix Multiplication Using Neural Networks 153
storing structure for decreasing the input set, that is the search space of the “minimum coin
change problem” which is utilized later.
Example 12.1 Considering a binary search tree with the values (20, 10, 30, 3, 11, 25
and 35). If 13 is the sum of which the subset is to be found, then first the root node (20)
being greater than the sum, the right branch of the tree can be omitted from consideration.
Traversing the nodes of the left branch of the binary search tree, which are less than the
sum (13), the input space of the “minimum coin change problem” is minimized. Thus, a
binary search tree can be used as the storing structure of the training data.
(4) Neural Network: When the number of input values Ak (column vectors value) is
greater than the predefined threshold value, then the value is to be calculated from prior
knowledge and so the new input (in p ) is searched in the binary search tree with complexity
O(log n). As the differences from new and previous inputs are stored each time, it provides
maximal amount of prior knowledge. If the new value is found in the tree, then the weight
of the corresponding node is provided as the final output. “Minimum coin change” problem
is solved to obtain the subset that sums up to the input value (in p ) with or without repeated
values. For each neural network if there are m values in binary search tree which can sum
up to the new input (in p ) value with or without repeat, then x 1j are all the inputs, where j =
1, 2, . . . , m, which sends signals to neuron n1 , weighted with parameter w 1j . In this specific
case, w 1j is considered to be the required number of repeats of the corresponding input x 1j
to make the in p . So, the product of w 1j and x 1j contributes to create in p . Considering m is
the number of input lines converging onto the neuron 1. The summation of the calculated
product, s1 = m 1 1 1
j=1 w1j x j provides in p . Thus, the output nout of first single perceptron n1
Í
is a non-linear transform of the summed input which is defined as follows:
1
nout = g (s1 − vi ) (12.3)
where v is the formal threshold parameter. The value of v is considered to be the target sum
1 . Let us define the total input as h = Ím w 1 x 1 − v . So, the activation function
value nout j=1 1j j i
would be one if h = 0 and it would be zero otherwise.
Suppose, the first perceptron has input x 1j , where j = 1, 2, . . ., m and the output of the
1 . When the activation function is 1, it signifies that the neuron is
first perceptron is nout
firing and the activation value influences the input of the second single perceptron as the
input is xk2 with weight (w2k
2 ) as synaptic efficacy as the weight of the corresponding nodes
(which solves the “minimum coin change problem”) in binary search tree. The weight w2k 2
1
are the corresponding weight of the input values of the first perceptron ( x j ) in the binary
search tree that is the output or output differences calculated from outmem. So, the function
is:
m
2 2
× xk2 )
Õ
nout = g( w2k (12.4)
k=1
2
nout = g 2 [h2k
2
]
154 VLSI Circuits and Embedded Systems
2
So, nout is the final result. Though there is a possible case when “minimum coin
change problem” is not solvable with the current training data-set. So, the neuron will
back-propagate. Since, in matrix multiplication, the accurate result is expected, the back-
propagation method of conventional neural network is not possible to follow through
solution prediction. Instead of solution prediction, the input will be back-propagated to the
sample generator where the correct output will be generated through the sample generator.
A 3 × 3 matrix multiplication block diagram is shown in Fig. 12.2. The values A11 , A21 ,
and A31 of first column of multiplier matrix A is pipelined into neuron number 1, 4, and 7,
simultaneously. The other input of neuron 1, 4, and 7 are B11 , B12 , and B13 , respectively.
Similarly, second and third column of the multiplier is provided as input in the neuron (2,
5, 8) and (3, 6, 9), respectively.
Being independent all the partial products X m1 , X m2 , and X m3 , where X = P, Q, R are
generated at a time for a distinct
value of m = 1, 2, 3 at each level. Then, X m1 , X m2 , and X m3 are added. The addition
operation is also parallel. Finally, the result is pipelined from the adders. The block diagram
of a neuron used in the method has been demonstrated in Fig. 12.3. Two matrices A
and B will be stored in Input Buffer. A multiplier is responsible for accomplishing the
sample generation whereas Output buffer holds the product from sample generator. Two
subtractors are used for calculating input and output differences from previous inputs and
outputs respectively. When number of values of multiplier column exceeds the defined
threshold value then it is sent to neural network block. Neural Network block consists of
adders, controller and output buffer. “Minimum coin change” problem is solved in this
block. If the new input value is already stored in the memory, the corresponding output is
provided directly as a final product. If the new input is not stored in memory then solving
the minimum coin change problem the minimum number of existing inputs that makes the
new input is calculated. Then, the corresponding output values of the existing input values
are sent to the adder. Finally, the adder provides the final result by adding the provided
outputs. Theorem 13.1 is given to note the time complexity of the introduced approach of
matrix multiplication.
Property 12.3.1 The time complexity of the proposed matrix multiplication for two n×n
2
matrices using neural networks is O(log n (n + n2 + n2 + log 2 n)) where n is the dimension
of the matrix.
LUT-Based Matrix Multiplication Using Neural Networks 155
12.4 SUMMARY
Artificial Intelligence is a set of tools that are driving forward key of the future world.
The matrix multiplication technique accomplishes through implementation of supervised
neural networks, where minimum coin change problem is solved using binary search tree
as the data structure to simplify the complex matrix multiplication process. The design
achieves logarithmic solution over the polynomial degree of solutions. The design gains
improvement in terms of required number of LUTs (Look-Up Tables) and slices. These
drastic improvements in LUT-based matrix multiplication will consequently influence the
advancement in real life applications such as Mathematical Finance, Image Processing,
Machine Learning and many more.
REFERENCES
[1] C. Maureen, “Neural networks primer, part I”, AI expert vol. 2, no. 12, pp. 46–52,
1987.
[2] P. Saha, A. Banerjee, P. Bhattacharyya and A. Dandapat, “Improved matrix multiplier
design for high-speed digital signal processing applications”, Circuits, Devices and
Systems, IET, vol. 8, no. 1, pp. 27–37, 2014.
156 VLSI Circuits and Embedded Systems
[3] W. Yongwen, J. Gao, B. Sui, C. Zhang and W. Xu, “An Analytical Model for Matrix
Multiplication on Many Threaded Vector Processors”, In Computer Engineering and
Technology, pp. 12–19, 2015.
[4] R. Soydan and S. Kasap, “Novel Reconfigurable Hardware Architecture for Polyno-
mial Matrix Multiplications”, Very Large Scale Integration (VLSI) Systems, IEEE
Trans., vol. 23, no. 3, pp. 454–465, 2015.
[5] V. V. Williams, “Multiplying matrices faster than Coppersmith-Winograd”, In Pro-
ceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, pp.
887–898, 2012.
[6] J. Xiaoxiao and J. Tao, “Implementation of effective matrix multiplication on FPGA”,
In Broadband Network and Multimedia Technology, 4th IEEE International Confer-
ence on, pp. 656–658, 2011.
[7] C. Don and S. Winograd, “Matrix multiplication via arithmetic progressions”, In
Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing,
pp. 1–6, 1987.
[8] W. Guiming, Y. Dou and M. Wang, “High performance and memory efficient imple-
mentation of matrix multiplication on FPGAs”, In Field-Programmable Technology
(FPT), International Conference on, pp. 134–137, 2010.
[9] J. Ju-Wook, S. B. Choi and V. K. Prasanna, “Energy-and time-efficient matrix multi-
plication on FPGAs”, Very Large Scale Integration (VLSI) Systems, IEEE Trans. on,
vol. 13, no. 11, pp. 1305–1319, 2005.
[10] S. Hong, K. S. Park and J. H. Mun, “Design and implementation of a high-speed
matrix multiplier based on word-width decomposition”, Very Large Scale Integration
(VLSI) Systems, IEEE Trans. on, vol. 14, no. 4, pp. 380–392, 2006.
[11] M. U. Haque, Z. T. Sworna and H. M. H. Babu, “An Improved Design of a Reversible
Fault Tolerant LUT-Based FPGA”, 29th International Conference on VLSI Design,
pp. 445–450, 2016.
[12] M. Shamsujjoha, H. M. H. Babu and L. Jamal, “Design of a compact reversible
fault tolerant field programmable gate array: A novel approach in reversible logic
synthesis”, Microelectronics Journal, vol. 44, no. 6, pp. 519–537, 2013.
[13] Sworna, Zarrin Tasnim, Mubin Ul Haque, and Hafiz Md Hasan Babu. “A LUT-based
matrix multiplication using neural networks.” In 2016 IEEE International Symposium
on Circuits and Systems (ISCAS), pp. 1982–1985. IEEE, 2016.
CHAPTER 13
In this chapter, an improved design of easily testable PLAs has been introduced based on
input decoder augmentation using pass transistor logic along with improved conditions for
product line grouping. The technique primarily increases the fault coverage area of easily
testable PLA due to augmented of PTs (product terms) and reduce the testing time due to
grouping the product lines. A simultaneous testing technique has been applied within the
group that reduces the testing time. This approach ensures the detection of certain bridging
faults. A modified testing technique has also been presented in this chapter. It is shown that
the new grouping technique enhances in all ways.
13.1 INTRODUCTION
The Programmable Logic Array (PLA) is an important building block in VLSI circuits.
PLAs have the main advantage of a regular structure. Various efficient techniques have been
introduced on designing and testing of easily testable PLAs over the past 20–25 years. The
main objective of these techniques was to reduce extra hardware, increase fault coverage
and reduce testing time. If the PLAs are grouped under the criteria and the product lines
are rearranged within the groups then the above object is fulfilled.
The augmentation of the product line selector and the input decoder circuit are described
with the help of some extra pass transistors so that some signal value can be applied on both
the true and complemented bit lines corresponding to input in a single test vector which
will help to increase the fault coverage in a large scale. In next section a new condition has
been described for grouping the product lines that result in further reduction in the number
of groups and hence the number of test vectors.
Criterion 1: Two-product line should be grouped in such a way that when one product
line is activated (logic 1), all other product lines must be deactivated (logic 0).
For the test set T 2 , the criterion for product line grouping is as follows:
Criterion 2: Two-product line should be grouped in such a way that when a single used
literal on a product line of the AND array is changed by applying any of the test vector, the
outputs of the PLA due to a product line in the group should not contradict with the outputs
due to the other product lines in that group.
If two or more product lines differ by one literal and if one of the two conditions are
not satisfied then they cannot be tested simultaneously because of masking effects in the
AND array. This results in an increased number of test vectors to test those product terms.
However, it is introduced and shown that if the following two conditions are satisfied then
two or more product lines that differ one literal to be placed in the same group irrespective
of the CP device configuration in the OR array. The conditions are:
1. PLAs must be augmented using extra pass transistors, and
2. If the bridging fault between the bit lines which is associated with the literals that
differ by one, it can be tested by some other product lines (may be in some other group).
The test generation technique is modified mainly for test set T 2 for designing the PLA
based on the technique stated above. The effect of adding extra product line for testing
bridging faults between the true and complemented bit lines has also been analyzed and
found that after adding extra product line the fault coverage is higher.
Figure 13.1: The Example PLA whose Input Decoder is Augmented using Extra Pass
Transistor.
Figure 13.2: (a) Response of the Bit Decoder (b) Input of the Decoder to Place 0c on Both
the Bit Lines.
Example 13.1 Let us assume that it is required to place an arbitrary value “a” on X j1 and
“b” on X j0 . To do this, a0 is applied to X j and 1 to C followed by b to X j and 0 to C . When
C = 0, the pass-transistors are cut-off. Hence the previous value (i.e., a0) will be maintained
at the input to the inverter of X j1 , line due to the input capacitance of the inverter. As a
result, value “a” remains on X j1 , bit line when “b” appears on X j0 . Let us consider that 0s
160 VLSI Circuits and Embedded Systems
are to be placed on both the bit lines X j1 and X j0 . This is illustrated in Fig. 13.2. In the first
step, 1 is placed on X j when C = 1 in this case, X j1 becomes 0 and X j0 becomes 1. Then
in the second step, 0 is placed on X j and C is made 0. As the corresponding pass-transistor
becomes cut-off at C = 0, the value of 0 remains on X j1 and X j0 becomes 0 also, as X j is 0
in the second step. Thus, 0s are placed on both the bit lines. Similarly, any arbitrary values
can be placed on any of the bit lines.
Example 13.2 Consider Fig. 13.3. It may be noticed that P3 and P4 are the members of a
group due to difference of one literal position X 1 lines.
Figure 13.3: An Example PLA Showing Product Lines Satisfying Introduced Condition.
Again, the bridging fault between bit lines X 1 and X 10 is detected by group S 1 using
input line X 1 . So, these two product lines satisfy the condition.
If two or more number of product lines satisfy the basic criteria along with the condition,
then these product lines are the members of a group.
Two different algorithms are implemented for generating test vectors for test set T 1 and
T 2.
In the introduced design, instead of considering a single product line at a time, a
partitioned group of product lines is considered. The test vectors are generated in such
a way that the detection conditions for the bridging faults between the bit lines and the
product lines and the last input line and the first output line are fulfilled. It can easily be
seen that if the test vectors in T 1 set corresponding to a product line Pi , are applied in such
a way that each of the bit lines which does not have any CP device with Pi , becomes 1 at
least once, then the presence of all the extra CP devices on Pi , is tested. However, with the
Easily Testable PLAs Using Pass Transistor Logic 161
design for partitioning the product lines into groups and augmentation of the input decoders
using extra pass transistors, the number of groups are reduced which in turns reduced the
number of test vectors.
13.5 SUMMARY
An improved technique for designing easily testable PLAs has been presented. In this
technique, the fault coverage is increased substantially by augmenting the input decoder
using pass-transistors logic. In addition to the product line partitioning conditions presented,
162 VLSI Circuits and Embedded Systems
an additional condition has been presented in this chapter that reduces the extra hardware
and testing time even further. The condition presented in this chapter allows two or more
product lines that differ in one literal to be placed in the same group irrespective of the CP
(Complex Programmable) device configuration in the OR array. However, it must ensure
that some other product lines can detect the bridging between the bit lines that contribute
to the one difference. As a result of this design the number of groups are reduced which in
turn reduces extra hardware overhead and increases testing time.
REFERENCES
[1] T. Sasao, “Easily testable realizations for generalized Reed-Muller Expressions”, IEEE
Trans. Comput., vol. 46, no. 6, pp. 709–716, 1997.
[2] M. A. Mottalib and A. M. Jabir, “A Simultaneously testable PLA with high fault
coverage and reduced test set”, The Journal of IETE, vol. 43, no. 1, 1997.
[3] H. Fujiwara, “A design of programmable logic arrays with random pattern testability”,
IEEE Trans. CAD, vol. 7, pp. 5–10, 1988.
[4] M. A. Mottalib and P. Dasgupta, “Design and testing of easily testable PLA”, IEEE
Proc., pp. 357–360, 1991.
[5] M. A. Mottalib and P. Dasgupta, “A function dependent concurrent testing technique
for PLAs”, IETE, vol. 36, no. 3 & 4, pp. 299–304, 1990.
[6] S. M. Reddy and D. S. Ha, “A new approach to the design for testable PLAs”, IEEE
Trans., vol. C-36, pp. 201–211, 1987.
[7] S. Bozorgui-Nesbat and E. J. McCluskey, “Lower overhead design for testability of
programmable logic array”, IEEE Trans., vol. C-35, pp. 379–384, 1986.
[8] Islam, Md Rafiqul, Hafiz Md Hasan Babu, Mohammad Abdur Rahim Mustafa, and
Md Sumon Shahriar. “A heuristic approach for design of easily testable PLAs using
pass transistor logic.” In 2003 Test Symposium, pp. 90–90. IEEE Computer Society,
2003.
CHAPTER 14
Generally, a decoded-PLA i.e., a PLA (Programmable Logic Array) which has decoders
in front of an AND array, requires a smaller area than a standard PLA for realizing a
function. However, it is usually very difficult to assign input variables to decoders such that
the area of the decoded-PLA is minimal. An algorithm for assigning variables to decoders
has been known to produce good result, but the number of input variables of the decoders
was restricted to two and the area over-head of decoders, which is in fact quite significant,
was not considered. A heuristic algorithm is also developed for assigning input variables
to the decoders. In this algorithm, the number of inputs to each decoder is not restricted
to two and the area overhead incurred by using multi-input decoders is considered in the
cost function. The algorithm has shown that the area of PLAs are smaller in many cases by
using multi-input decoded-PLAs than those of decoded-PLAs with two input decoders or
standard PLAs.
Decoded-PLAs, i.e., PLAs with input decoders can usually realize a function in a
smaller area than a standard PLA. The way of assigning the input variables to the decoders,
which in general may have any number of inputs, influences the size of a decoded-PLA
significantly. It should also be noticed that some functions cannot benefit from using multi-
input decoders no matter how the variables are assigned. In this chapter, an algorithm is
discussed to assign variables for multi-input decoded PLAs based on the Hamiltonian path
and dynamic programming.
The decoders presented in this section are called n-to-m decoders, where m <= 2n .
Their purpose is to generate the 2n (or fewer) binary combinations of the n input variables.
A decoder has n inputs and 2n outputs and is also referred to as n to 2n decoder.
Example 14.1 Fig. 14.1 shows a 2-to-4 decoder. Here the two-bit input is called S1S0 and
the four outputs are Q0 − Q3. This circuit “decodes" a binary number into a “one-of-four"
code. If the input is equivalent to the decimal number i , output Qi alone will be true. It is
ensured using the following expressions for the outputs Q0 − Q3.
Q0 = S1 S0
Q1 = S1 S0
Q2 = S1S0
Q3 = S1 S0
X0 X1 X2 F0 F1
0 0 0 0 0
0 0 1 0 1
0 1 0 0 1
0 1 1 1 0
1 0 0 0 1
1 0 1 1 0
1 1 0 1 0
1 1 1 1 1
Example 14.2 Here a truth table representing a three input-two output function is shown
in Table 14.1.
Since the two functions f 0 and f 1 both have the same inputs, just one decoder can be
used instead of two as Fig. 14.4.
Decoder output Q0 is unused, while Q7 is used multiple times. In general, it can always
use circuit outputs as many or as few times as you need.
X0 X1 X2 X3 F0 F1 F2
0 0 0 0 0 0 0
0 0 0 1 0 0 1
0 0 1 0 0 1 0
0 0 1 1 0 1 1
0 1 0 0 0 0 1
0 1 0 1 0 1 0
0 1 1 0 0 1 1
0 1 1 1 1 0 0
1 0 0 0 0 1 0
1 0 0 1 0 1 1
1 0 1 0 1 0 0
1 0 1 1 1 0 1
1 1 0 0 0 1 1
1 1 0 1 1 0 0
1 1 1 0 1 0 1
1 1 1 1 1 1 0
Example 14.3 Consider the following truth table representing a 4-input 3-output function.
Here Fig. 14.6 shows the standard PLA representation of the function.
168 VLSI Circuits and Embedded Systems
Figure 14.6: The Standard PLA Representation of the Function of Table 14.1.
Notice that there are 11 vertical lines in this implementation. In Fig. 14.7, the decoder
PLA representation is shown.
In the above figure, there are only 9 vertical lines.
14.2.1 Advantages
A programmable logic array or PLA presents a reduced design of a logic circuit. Moreover,
the decoder part of the PLA is programmable too. Instead of generating all possible products,
one can choose which products to generate. This can significantly reduce the fan-in (number
of inputs) of gates, as well as the total number of gates.
Genetic Algorithm for Input Assignment for Decoded-PLAs 169
Figure 14.7: The Decoded-PLA Representation of the Function shown in Table 14.1.
Property 14.3.1 Let G = (V, E ) be a connected graph with n vertices. A Hamiltonian Path
is a path that goes through each vertex exactly once. A minimum weight Hamiltonian Path
is the path created by traversing through the edges with the minimum weight and hence,
the sum of weights of the edges in this path is the minimum.
Henceforth, each X i A decoder that has k inputs has 2k signal lines out if this decoder,
and each of the signal lines represents one of 2k maxterms.
Property 14.3.3 Let S be a subset of the input variables. Then delete all literals of the
variables in S from each term of a disjunctive form for a given function f , but leave other
literals in that term. The number of distinct terms in the resulting disjunctive form is denoted
by RS .
14.4.1 GA Terminology
All genetic algorithms work on a population, or a collection of several alternative solutions
to the given problem. Each individual in the population is called a string or chromosome,
in analogy to chromosomes in natural systems. Often these individuals are coded as binary
strings, and the individual characters or symbols in the strings are referred to as genes. In
each iteration of the GA, a new generation is evolved from the existing population in an
attempt to obtain better solutions.
The population size determines the amount of information stored by the GA. The GA
population is evolved over several generations.
An evaluation function (or fitness function) is used to determine the fitness of each
candidate solution. The fitness is the opposite of what is generally known as the cost in
optimization problems. It is customary to describe genetic algorithms in terms of fitness
rather than cost. The evaluation function is usually user-defined, and problem-specific.
Individuals are selected from the population for reproduction, with the selection biased
toward more highly fit individuals. Selection is one of the key operators on GAs that ensures
survival of the fittest. The selected individuals from pairs are called parents.
172 VLSI Circuits and Embedded Systems
Crossover is the main operator used for reproduction. It combines portions of two parents
to create two new individuals, called offspring, which inherit a combination of features of
the parents. For each pair of parents, crossover is performed with a high probability PC,
which is called the crossover probability. With probability 1-PC, crossover is not performed,
and the offspring pair is the same as the parent pair.
Mutation is an incremental change made to each member of the population, with a very
small probability. Mutation enables new features to be introduced into a population. It is
performed probabilistically such that the probability of a change into each gene is defined
as the mutation probability, P m .
replaces the worst member. This completes the generation. In the steady-state GA, the
generation gap is minimal, since only two offsprings are produced in each generation.
Duplicate checking may be beneficial because a finite population can hold more
schemata if the population-members are not duplicate. Since the offsprings of two identical
parents are identical to the parents, once a duplicate individual enters the population, it
tends to produce more duplicates and individuals varying by only slight mutations. Prema-
ture convergence may then result. Duplicate checking is advantageous under the following
conditions:
i. The population size is small
ii. The chromosomes are short or
iii. The evaluation time is large.
Each of the above conditions reduces the duplicate checking time in comparison to the
evaluation time. If the duplicate checking time is negligible, compared to the evaluation
time, then duplicate checking improves the efficiency of the GA.
The steady-state GA is susceptible to stagnation. Since a large majority of offspring is
inferior, the steady-state algorithm rejects them, and it keeps marking more trials on the
existing population for very long periods of time without any gain. Because the population
size is small compared to the search space, this is equivalent to long periods of localized
search.
14.5.1 Selection
Various selection schemes have been used, but roulette wheel selection, stochastic uni-
versal selection and binary tournament selection will be focused here with and without
replacement. As illustrated in Fig. 14.11 (a), roulette wheel selection is a proportionate
selection scheme in which slots of a roulette wheel are sized according to the fitness of each
individual in the population. An individual is selected by spinning the roulette wheel. The
probability of selecting an individual is therefore proportional to its fitness. As illustrated
in Fig. 14.11 (b), stochastic universal selection is a less noisy version of roulette wheel
selection in which N equidistant markers are placed around the roulette wheel, where N
is the number, of individuals in the population. N individuals are selected in a single spin
of the roulette wheel, and the number of copies of each individual selected is equal to the
number of markers inside the corresponding slot.
176 VLSI Circuits and Embedded Systems
In binary tournament selection, two individuals are taken at random, and the better
individual is selected from the two. If binary tournament selection is being done without
replacement, then the two individuals are set aside for the next selection operation, and
they are not replaced into the population. Since two individuals are removed from the
population for every individual selected, and the population size remains constant from
one generation to the next, the original population is restored after the new population is
half-filled. Therefore, the best individual will be selected twice, and the worst individual
will not be selected at all. The number of copies selected of any other individual cannot
be predicted except that it is either zero, one, or two. In binary tournament selection with
replacement, the two individuals are immediately replaced into the population for the next
selection operation.
The objective of the GA is to converge to an optimal individual, and selection pressure
is the driving force which determines the rate of convergence. A high selection pressure will
cause the population to converge quickly, possibly at the expense of a sub optimal result.
Roulette wheel selection typically provides the highest selection pressure in the initial
generations, especially when a few individuals have significantly higher fitness values than
other individuals. Tournament selection provides more pressure in later generations when
the fitness values of individuals are not significantly different. Thus, roulette wheel selection
is more likely to converge to a suboptimal result if individuals have large variations in fitness
values.
14.6 CROSSOVER
Once two chromosomes are selected, the crossover operator is used to generating two
offsprings. In one and two-point crossover, one or two chromosome positions are randomly
selected between one and ( L − 1), where L is the chromosome length and the two parents
are crossed at those points. For example, in one-point crossover, the first child is identical to
the first parent up to the crossing point and identical to the second parent after the crossing
point. An example of one-point crossover is shown in Fig. 14.12.
Genetic Algorithm for Input Assignment for Decoded-PLAs 177
Crossover combines building blocks from two different solutions in various combina-
tions. Smaller good building blocks are converted into progressively larger good building
blocks over the time until it has an entire good solution. Crossover is a random process,
and the same process, results in the combination of bad building blocks to result in poor
offspring, but these are eliminated by the selection operator in the next generation.
The performance of the GA depends to a great extent on the performance of the crossover
operator used. The amount of crossover is controlled by the crossover probability, which is
defined as the ratio of the number of offspring produced in each generation to the population
size. A higher crossover probability allows exploration of more of the solution space and
reduces the chances of setting for a false optimum. A lower crossover probability enables
exploitation of existing individuals in the population that have relatively high fitness.
14.6.1 Mutation
As new individuals are generated, each character is muted with a given probability. In a
binary-coded GA, mutation may be done by Ripping a bit, while in a nonbinary-coded GA,
mutation involves randomly generating a new character in a specified position. Mutation
produces incremental random changes in the offspring generated through crossover, as
shown in Fig. 14.13. When used by itself, without any crossover, mutation is equivalent
to random search, consisting of incremental random modification of existing solution, and
acceptance if there is improvement. However, when used in the GA, its behavior changes
radically. In the GA, mutation serves the crucial role of replacing the gene values lost from
the population during the selection process so that they can be tried in a new context, or of
providing the gene values that were present in the initial population.
For example, let a particular bit position, bit 10, has the same value, let 0, for all
individuals in the population. In such a case, crossover alone will not help, because it is
only an inheritance mechanism for existing gene values. That is, crossover cannot create an
individual with a value of 1 for bit 10, since it is 0 in all parents. If a value of 0 for bit 10
turns out to be suboptimal, then, without the mutation operator, the algorithm will have no
chance of finding the best solution. The mutation operator, by producing random changes,
178 VLSI Circuits and Embedded Systems
14.6.2 Inversion
The inversion operator takes a random segment in a solution string and inverts it end for end
(Fig. 14.14). This operation is performed in a way such that it does not modify the solution
represented by the string. Instead, it only modifies the representation of the solution. Thus,
the symbols composing the string must have an interpretation independent of their position.
This can be achieved by associating an identification number with each symbol in the string
and interpreting the string with respect to these identification numbers instead of the array
indices. When a symbol is moved in the array, its identification number is moved with it,
and therefore, the interpretation of the symbol remains unchanged.
For example, Fig. 14.14 shows a chromosome. Let us assume a very simple evaluation
function such that the fitness is the binary number consisting of all bits of the chromosome.
With bit 0 being the least significant, and bit-9 being the most significant. Since the bit
identification numbers are moved with the bit values during the inversion operation, bit 0,
bit 1, etc., still have the same values, although their sequence in the chromosome is different.
Hence, the fitness value remains the same. The inversion probability is the probability of
performing inversion on each individual during each generation. It controls the amount of
group formation.
Inversion changes the sequence of genes randomly, in the hope of discovering, a se-
quence of linked genes placed close to each other.
Genetic Algorithm for Input Assignment for Decoded-PLAs 179
Index 0 1 2 3 4 5 6
Gene 2 2 1 0 1 2 0
Example 14.4 Suppose in a particular function there are eight input variables x 0 , x 1 , . . . ,
x 7 , Let a solution provided by a chromosome be
This solution suggests using three decoders in the decoded-PLA design. It also suggests
assigning x 3 and x 6 to the same decoder (suppose D0 ), x 2 and x 4 to another decoder ( D1 )
and x 0 , x 1 and x 5 to another one ( D2 ).
180 VLSI Circuits and Embedded Systems
In line 3, the first term estimates the areas of the arrays and the second term estimates
the area overhead incurred by the decoder. Here m denotes the total number of output
functions in the PLA, RS is used as it is defined in Definition 15.3. If the variables in S are
assigned to a decoder and the variables not in S are treated in the same way as in standard
PLAs, it can be shown that RS is an upper bound on the minimum number of product lines.
In the algorithm RS is used to estimate the number of product lines when variables in S are
assigned to a decoder.
The area overhead of decoders, D |s | is a very important factor in the cost. To estimate
a realistic value of D, many complex aspects of decoder circuit design and its layout must
be considered as follows:
1. Design of Decoder Circuits: Decoders are usually designed by a mixture of tree
decoders, NAND circuits, pass-transistor circuits and others, in order to have appro-
priate trade-offs between speed and layout area. So, there are a variety of decoder
circuits. Also, to have enough driving capability, buffers which occupy large areas
are usually needed for the output signals or the decoders (in the case of standard
PLAs, large inverters are needed).
2. Routing of Decoder Input Lines: Input lines run through decoder. The area occupied
by these lines highly depends on where, what directions, and how many groups
these lines approach the decoders. Also, depending on the technology of the circuits
realizing the decoders, the layout and line spacing could be different. If these lines
need to be permuted, extra areas for contact windows may be required.
Since these very complex factors depend on situations, a simplified but reasonable
estimation of the area of a decoder, D |s | = 2 A |s| 2 |s | , is used after examining the layout for
some real design samples, where 2 |s | is the number of signal lines out of a |s|-input decoder,
2 |s | is the number of input lines which run through the decoders, and A is a coefficient to be
adjusted according to line spacing, transistor sizes, and others. Here in the implementations
A = 1 is used as a dummy estimation.
Genetic Algorithm for Input Assignment for Decoded-PLAs 181
14.7.3 Developed GA
When problem specific information exists, it may be advantageous to consider a GA hybrid.
Genetic algorithms may be crossed with various problem-specific search techniques to form
a hybrid that exploits the global perspective of the GA and the convergence of the problem-
specific technique. Here in the method a form of modified greedy algorithm is used for the
selection procedure of the GA. Another change in the traditional GA is also done. That is
the worst chromosomes that are not replaced of the population with the new offsprings.
The best ones are chosen from the parents and offsprings instead. This is because to utilize
the good features of even the worst chromosomes. And using this technique, the good
features of the parents as well as all other chromosomes will be in the population through
the survivors.
The GA that is adopted is presented in Algorithm 14.2.
Consider the function f (w, x, y, z) = (5, 7, 11, 13). The minimized SOP form is,
Í
f = x y 0 z + w 0 xz + wx 0 yz . But here the minimum ESOP is form is, f = xz ⊕ wyz .
14.8 SUMMARY
The number of products in a PLA (Programmable Logic Array) is equal to the number
of different products in the expressions. So, in order to minimize the size of PLAs, it is
necessary to minimize the number of different products in the expression. AND-EXOR
PLAs with decoders require fewer products than AND-OR PLAs without decoder. By
replacing the OR array with the EXOR array in the PLA, it can have an AND-EXOR PLA
with decoders. An algorithm is also presented to assign variables for multi-input decoded
PLAs based on the Hamiltonian path and dynamic programming.
REFERENCES
[1] T. Sasao, “Input variable assignment and output phase optimization of PLS’s”, IEEE
Trans. Computer, vol. C-28, no. 9, pp. 879–894, 1984.
[2] C. Kuang-Chien and S. Muroga, “Input assignment algorithm for decoded-PLAs with
multi-input decoders”, in IEEE International Conference on Computer-Aided Design
(ICCAD-88), pp. 474–477, 1988.
[3] P. Mazumder and E. M. Rudnick, “Genetic Algorithms for VLSI Design, Layout and
Test Automation”, Pearson Education, Asia, 1999.
[4] J. H. Holland, “Adaptation in Natural and Artificial Systems”, Ann Arbor, MI: Uni-
versity of Michigan Press, 1975.
[5] D. E. Goldberg, “Genetic Algorithms in Search, Optimization, and Machine Learning,
Reading, MA”, Addison-Wesley, 1989.
[6] K. C. Clien and S. Muroga, “Input Variable Assignment for Decoded-PLA’s and
Output Phase Optimization Algorithms”, to appear as a department report, Department
of Computer Science, University of Illinois at Urbana-Champaign.
[7] L. A. Glasser and D. W. Dobberpuhl, “The Design and Analysis of VLSI Circuits”,
Addison Wesley, 1985.
Genetic Algorithm for Input Assignment for Decoded-PLAs 183
[8] C. R. Darwin, “On the Origin of Species by Means of Natural Selection”, London:
John Murray, 1859.
[9] Chen, Kuang-Chien, and Saburo Muroga. "Input assignment algorithm for decoded-
PLAs with multi-input decoders." In 1988 IEEE International Conference on
Computer-Aided Design, pp. 474–475. IEEE Computer Society, 1988.
CHAPTER 15
FPGA-Based Multiplier
Using LUT Merging Theorem
FPGA (Field Programmable Gate Array) technology has become an integral part of to-
day’s modern embedded systems. All mainstream commercial FPGA devices are based
upon LUT (Look-up Table) structures. As any m-input Boolean function can be imple-
mented using m-input LUTs, it is a prime concern to reduce the number of LUTs while
implementing an FPGA-based circuit for given functions. In this chapter, a LUT merging
theorem is introduced which reduces the required number of LUTs for the implementation
of a set of functions by a factor of two. The LUT merging theorem performs selection,
partition and merging of the LUTs to reduce the area. Using the LUT merging theorem,
an (1 × 1)-digit multiplication algorithm is described, which does not require any partial
product generation, partial product reduction and addition steps. An (m×n)-digit multipli-
cation algorithm is introduced which performs digit-wise parallel processing and provides
a significant reduction in carry propagation delay.
15.1 INTRODUCTION
The implementation of FPGA (Field Programmable Gate Array) is now prevalent in ap-
plications such as signal processing, cryptography, networking, arithmetic and scientific
computing. LUT (Look-up Table) is the key cell of an FPGA. In this chapter, the introduced
LUT merging theorem is applied in multiplier to prove the performance efficiency.
Three main contributions of this chapter are as follows:
(1) LUT merging theorem is introduced by following selection, partition and merging
of LUTs in order to reduce the required number of LUTs by a factor of two for implementing
a set of functions.
(2) Using the LUT merging theorem, a single digit multiplication algorithm has been
described avoiding the conventional slow partial product generation, partial product reduc-
tion and addition stages. Efficient (m×n)-digit multiplication algorithm is explained, where
digit-wise parallel processing is performed to reduce the carry propagation delay of the
multiplier significantly.
(3) A LUT-based compact and faster (m×n)-digit multiplier circuit is described by using
the single digit LUT-based multiplier circuit.
Property 15.2.1 (LUT Merging): Let n be the total number of the m-input LUTs required
in a circuit for the implementation of n number of functions f i (a1, a2, ..., a m ). If the given
functions f i are minimized to g i (a1, a2, ..., a m ’), then the n numbers of LUTs can be merged
to dn\2e to implement the n number of functions, where 1≤m 0 <m, m≥3, m 0 is the number
of input variables of the minimized function g and i = 1, 2, ..., n.
If a Boolean function f (a1, a2, ..., a m ) requires m variables, then m must be reduced to m 0,
where f (m)≡g(m 0) and m>m 0.
If the input set Ii of i number of different functions are the same such as I1 = I2 =
I3 = · · · = In and the corresponding output combinations (Oi ) are also the same, that is, O1
= O2 = O3 = · · · = On , then I1 , I2 , ... , In → O1 .
The following three steps are considered after fulfilling the above conditions:
(1) Selection of LUTs: Bit categorization is performed based on the input bits for the
selection of LUTs. After categorizing the input domain the target functions and target
number of input bits of the LUTs are determined based on the number of input and output
bits of the functions.
(2) Partitioning of LUTs: Partitioning of LUTs is done to find the similarities of the
input and output combinations. First, the number of input bits is reduced and checked for
similarity among functions either by rearranging the input bit positions or by implementing
smaller functions on the input bits. If there are identical input bit combinations after the
reduction of the number of input bits and the corresponding outputs of the functions are
identical for the multiple occurrence, then a partition is created with those functions.
(3) Merging of LUTs: The final step is to deal with the merging of LUTs of each
partition. Since an m-input LUT is a dual output circuit, n numbers of LUTs will be reduced
to dn \ 2e , where m ≥ 3.
First, selection of LUTs has been performed by bit categorization technique, where the
multiplier and multiplicand bits are arranged into groups of variables of 1-input, 2-input
(Catg3), 3-input (Catg4) and 4-input (Catg5) as shown in Table 15.1. For instance, A group
of 3-input variables contains operands with the maximum of 3 bits. Either multiplicand or
multiplier will be 0Dec in Catg0 and the corresponding output will be zero. In Catg1 the
multiplicand B is 1Dec and output is A. In Catg2 the multiplier A is 1Dec and output is
B. It is important to note that only four categories of input variables are considered, as the
(1 × 1)-digit multiplication is used as the basic unit to construct an (m×n)-digit multiplier.
The (1 × 1)-digit multiplication considers the four binary bits with the maximum decimal
value, 9 as input. As A × B = B × A, either A × B or B × A is considered to avoid identical
input combination. So, the first step of the multiplication process to find the appropriate
category. The inputs will make the category selection value (cat ) of the exact category to
be one keeping the other values zero, which will activate that specific category to provide
the output. The categories are determined by using the following equations, where plus (+)
represents OR operation, dot (.) represents AND operation and (’) represents complement.
The equations of corresponding categories are as follows:
In 2-bit categorization the final output will be gained using the following equations,
where (.) means AND operation and ⊕ means Ex-OR operation:
(
a1, i f a3 .b0 = 0
P34 = (15.12)
a¯0, otherwise
188 VLSI Circuits and Embedded Systems
(
a1, i f a3 .a0 = 0
P44 = (15.13)
1, otherwise
In 4-bit categorization the final output will be gained using the following equations:
The group of 3-bit category is selected for implementation of LUT merging. Product
bits P0 and P5 will be generated using the following equations:
P03 = a0 ; and P35 = (b1 .b2 ).(b0 .a0 .a2 + a1 .a2 ) (15.15)
So, 4 bits are left ranging from P1 to P4 . A LUT with 4 to 6 inputs provides the best area
and delay performance. The traditional 4-input LUT structure provides low logic density
and configuration flexibility, which reduces the utilization of interconnect resource, when
configured as relatively complex logic functions. Hence, a 6-input LUT is considered to
implement the function.
Second, partitioning of LUTs is performed by observing the similarities between the
input and output combinations of the product bits P1,4 and P2,3 . And finally, merging of
LUTs has been performed by realizing the functions of 6-input LUTs while minimizing it
as follows:
P1,4 = g(a0, a1, b0, b1, b2 ); (15.16)
P2,3 = g(b1 ⊕ b2 ), b0, a2, a1, a0 (15.17)
Fig. 15.2 shows the merging of LUTs and Table 15.2 exhibits the verification of LUT
merging theorem. Before LUT merging, the output of the corresponding 6-bit input variable
is shown in Table 15.2. Now, the 6-bit input combinations are converted into 5-bit input
combinations in such a way so that the total numbers of input combinations of 6-bit and
5-bit inputs are the same as well as the corresponding outputs of all the combinations for
both the 6-input and 5-input variables also remain the same. The 5-bit input combinations
with the corresponding outputs are shown in After LUT Merging Column of Table 15.2. In
Fig. 15.2, the merging of LUTs has been accomplished in three steps for the multiplication
operation. The first step is the selection of LUTs, where the selection is performed on the
basis of computation of final products P1 , P2 , P3 , and P4 . Second, the partitioning of LUTs
is performed, where two different colors are used to distinguish the distinct partition sets.
Third, the partitioned LUTs are merged into one.
The block diagram of a (1 × 1)-digit multiplier circuit is shown in Fig. 15.1. The bit
categorization circuit is constructed using Equations (16.1–16.8) as shown in Fig. 15.3. The
circuit is partitioned into categories, where each category performs the above mentioned
output functions in Equations (16.9–16.17) efficiently using LUTs. As the bit categorization
selects the corresponding category by providing the value of that variable as 1, this value
activates that particular category. So, other categories remain deactivated and only one
category at a time is selected. Fig. 15.4 shows that as the output Catg3 bit from bit
FPGA-Based Multiplier Using LUT Merging Theorem 189
Table 15.1: Bit Categorization of Inputs for the Implementation of LUT Merging Theorem
in Multiplication Technique
categorization circuit for corresponding input combination is zero, the total 3-bit category
circuit is deactivated and no input passes through that category circuit. The dotted line
represents off state and the straight line represents on state. On the other hand, when the input
combination is A = 0011 and B = 0011, which represents category 3, the bit categorization
provides Catg3 as 1. Hence, the 3-bit category is activated and the corresponding required
inputs are passed through the transistors and LUTs. Finally, the corresponding output of
the input combination P = 1001 is gained from that category. The internal circuit of each
category as shown in Fig. 15.5 is the LUT implementation of Equations 16.9 – 16.17.
Similarly, for all the other input combinations corresponding category is selected and
activated. Finally, the output from the activated category is considered as the final output.
The introduced (m×n)-digit multiplier circuit uses the (1 × 1)-digit multiplier circuit. For
an (m×n)-digit multiplier n number of processing elements are required. Each processing
element, PE i has a single digit of multiplicand Bi and the multiplier is pipelined to it. The
output of the multipliers are passed to binary to decimal converter. The multiplier requires
8 bit binary to BCD converter in each processing element irrespective of the multiplier and
multiplicand size. The output from the converter is the input to the 8-bit BCD adder. The
least significant four bits of the BCD adder is stored as output and rest of the bits are added
with the output of the adder in next iteration. After m number of iterations, each output of
the processing elements (Pi ) is shifted i digits using shift registers. Finally, the outputs of
the processing elements are added using the (m + 1)-digit BCD adder. The datapath of the
circuit is shown in Fig. 15.6.
190 VLSI Circuits and Embedded Systems
Table 15.2: Simulation of LUT Merging for 3-Bit Input Combinations of a 1 × 1-Digit
FPGA-Based Multiplier Circuit
Figure 15.4: Deactivated Category Circuit as it is not the Target Category for Corresponding
Input.
FPGA-Based Multiplier Using LUT Merging Theorem 193
Figure 15.5: Activated Category Circuit as it is the Target Category for Corresponding
Input.
194 VLSI Circuits and Embedded Systems
15.4 SUMMARY
The (m×n)-digit multiplier performs digit-wise parallel processing in a divide and conquer
approach where (1 × 1)-digit multiplier is used. The (1 × 1)-digit multiplier avoids the
conventional partial product generation, reduction and addition steps. The multiplier uses
the Look-up Table (LUT) merging theorem to reduce the required number of LUTs by a
factor of two. The LUT merging theorem can be utilized for any set of functions for the
reduction of required number of LUTs. The described theorem will enhance the efficient
use of LUTs in FPGA-based circuits to a great extent which will forward the advancement
and applicability of Field programmable Gate Arrays (FPGAs).
REFERENCES
[1] Z. T. Sworna, M. U. Haque and H. M. H. Babu, “A LUT-based matrix multiplication
using neural networks”, IEEE International Symposium on Circuits and Systems
(ISCAS), Montreal, QC, pp. 1982–1985, 2016.
FPGA-Based Multiplier Using LUT Merging Theorem 195
[2] E. Ahmed and J. Rose, “The effect of LUT and cluster size on deep submicron FPGA
performance and density”, IEEE Trans. Very Large Scale Integration Syst., vol. 12,
no. 3, p. 288, 2004.
[3] M. Zhidong, C. Liguang, W. Yuan and L. Jinmei, “A new FPGA with 4/5-input LUT
and optimized carry chain”, Journal of Semiconductors, vol. 33, no. 7, 2012.
[4] Z. T. Sworna, M. U. Haque, N. Tara, H. M. H. Babu and A. K. Biswas, “Low-power
and area efficient binary coded decimal adder design using a look up table-based
field programmable gate array”, IET (The Institution of Engineering and Technology)
Circuits, Devices and Systems, pp. 1–10, 2015.
[5] A. Vazquez, E. Antelo and P. Montuschi, “A new family of high performance parallel
decimal multipliers”, In 18th IEEE Symposium on Computer Arithmetic, pp. 195–
204, 2007.
[6] V. P. Mrio and H. C. Neto, “Parallel decimal multipliers using binary multipliers”, In
Programmable Logic Conference, Southern, pp. 73–78, 2010.
[7] H. T. Richard and R. F. Woods, “Highly efficient, limited range multipliers for LUT-
based FPGA architectures”, IEEE Trans. on Very Large Scale Integration (VLSI)
Systems, vol. 12, no. 10, pp. 1113–1118, 2004.
[8] Sworna, Zarrin Tasnim, Mubin Ul Haque, Hafiz Md Hasan Babu, Lafifa Jamal, and
Ashis Kumer Biswas. “An Efficient Design of an FPGA-Based Multiplier Using LUT
Merging Theorem.” In 2017 IEEE Computer Society Annual Symposium on VLSI
(ISVLSI), pp. 116–121. IEEE, 2017.
CHAPTER 16
The Binary Coded Decimal (BCD) being the more accurate and human-readable represen-
tation with ease of conversion, is prevailing in the computing and electronic communication.
In this chapter, a tree-structured parallel BCD addition algorithm is introduced with the
reduced time complexity. BCD adder is more effective with a LUT (Look-Up Table)-based
design, due to FPGA (Field Programmable Gate Array) technology’s enumerable bene-
fits and applications. A size-minimal and depth-minimal LUT-based BCD adder circuit
construction is the main focus of this chapter.
16.1 INTRODUCTION
Binary Coded Decimal (BCD) representation is advantageous due to its finite place value
representation, rounding, easy scaling by a factor of 10, simple alignment and conversion
to character form. It is highly used in embedded applications, digital communication and
financial calculations. Hence, faster and efficient BCD addition method is desired. In this
chapter, an N -digit addition method is introduced which omits the complex manipulation
steps, reducing area and delay of the circuit. The application of FPGA in cryptography, NP
(Non Polynomial)-Hard optimization problems, pattern matching, bioinformatics, floating
point arithmetic, and molecular dynamics is increasing radically. Due to re-configurable
capabilities, FPGA implementation of BCD addition is of concern. LUT being one of the
main components of FPGA, a LUT-based adder circuit is described.
Two main focuses are addressed in this chapter. First, a new tree-based parallel BCD
addition algorithm is presented. Second, a compact and high-speed BCD adder circuit with
an improvement in time complexity of O(N(log 2 b) + (N − 1)), where N represents the
number of digits and b represents the number of bits in a digit.
1. Bit-wise addition of the BCD addends produce the corresponding sum and carry in
parallel. For the addition of first bit, the carry from the previous digit will be added
too and the produced sum will be the direct first bit of the output.
2. If the most significant carry bit is zero then, except the first sum and last carry bit,
add the other sum and carry bits in pair in parallel; and if the sum is greater than or
equals to five, add three to the result to obtain the correct BCD output.
3. If the most significant carry bit is one, then update the final output values according
to (16.1) and (16.2).
Suppose, A and B be the two addends of a 1-digit BCD adder, where BCD representa-
tions of A and B are A4 A3 A2 A1 and B4 B3 B2 B1 , respectively. The output of the adder
will be a 5-bit binary number {C out S 3 S 2 S 1 S 0 }, where C out represents the position of
tens digit and {S 3 S 2 S 1 S 0 } symbolizes unit digit of BCD sum. A0 and B0 are added along
with C in which is the carry from the previous digit addition. If it is the first digit addition,
the carry will be considered as zero. The produced sum bit will be the direct first bit of the
output. Other pairwise bits ( B1 , A1 ), ( B2 , A2 ), ( B3 , A3 ) will be added simultaneously. The
resultant sum and carry bits (S3α , C 2 , S2α , C 1 , S1α , and C 0 ) are added pairwise providing
γ γ γ γ
output {Cout S3 S2 S 1 } and corrected by addition of three according to the following (16.1)
and (16.2):
γ γ γ γ
Cout S3 S2 S1
γ γ γ γ γ γ γ γ
(C S3 S2 S1 ), i f C3 = 0 and (Cout S3 S2 S1 ) < 5
out
γ γ γ γ γ γ γ γ
= (Cout (16.1)
S3 S2 S1 ) + 3, i f C3 = 0 and (Cout S3 S2 S1 ) ≥ 5
1C0 S 3 S 3,
otherwise
2 1
(
0, i f C0 = 1
where S13 = S23 = (16.2)
1 otherwise
Look-Up Table-Based Binary Coded Decimal Adder 199
In Table 16.1, the truth table is designed with ( A3 , A2 , A1) and ( B3 , B2 , B1 ) as input and
(C out S 3 S 2 S 1 ) as the final BCD output by following required correction. (S3α , C 2 , S2α ,
C 1 , S1α and C 0 ) are added pairwise as intermediate step, producing (F 4 , F 3 , F 2 , F1) by
considering carry C 0 always 1. A numeric 3((011)2 ) is added to the intermediary output F ,
if F is greater than or equals to five. A similar table considering C 0 as 0 can be calculated
which is shown in Table 16.2. The truth tables verify the functions of each output of the
LUTs of the BCD adder. The algorithm of N-digit BCD addition method is presented in
Algorithm 16.1.
Two example of BCD addition method using the introduced algorithm is demonstrated
in Figs. 16.1 and 16.2, where C3i = 0 and C3i = 1, respectively. Each step of the example is
mapped to the corresponding algorithm step for more clarification.
Figure 16.1: Example Demonstration of the BCD Addition Algorithm for C3i = 0.
Look-Up Table-Based Binary Coded Decimal Adder 201
Figure 16.2: Example Demonstration of the BCD Addition Algorithm for C3i = 1.
LUTs are used to add the output from the half-adders and full adder {S3α , . . . , S1α , C 0 } with
the correction by adding 3, if the sum is greater than or equals to five. Depending on the
value of C3 , a switching circuit is used to follow Equation 3. The circuit gains huge delay
reduction due to its parallel working mechanism. By using the 1-digit BCD adder circuit,
an N -digit BCD adder circuit can be constructed easily, where the C out of one digit adder
circuit is sent to the next digit of the BCD adder circuit as a C in . Therefore, the generalized
N -digit BCD adder computes sequentially by using the previous carry, the block diagram
of which is shown in Fig. 16.5.
(
3 S 3 S 3 S 3, i f C = 1
Cout 3 2 1 3
Cout S3 S2 S1 = γ γ γ γ (16.3)
Cout S3 S2 S1 , otherwise
16.3 SUMMARY
Reconfigurable computing has now become a better alternative to Application-Specific
Integrated Circuits (ASICs) and fixed microprocessors. Besides, BCD (Binary Coded Dec-
imal) addition being the basic arithmetical operation, it is the main focus. The introduced
BCD adder is highly parallel which mitigates the significant carry propagation delay of
addition operation. As it is more convenient to convert from decimal to BCD than binary,
the efficient FPGA-based BCD addition will subsequently influence the advancement in
computation and manipulation of decimal digits. Besides, Field Programmable Gate Array
(FPGA) implementation will be beneficial to be applied in bit-wise manipulation, private
key encryption and decryption acceleration, heavily pipe-lined and parallel computation of
NP-hard problems, automatic target generation and in many more applications.
REFERENCES
[1] A. K. Osama, M. A. Khaleel, Z. A. QudahJ, C. A. Papachristou, K. Mhaidat and
F. G. Wolff, “Fast binary/decimal adder/subtractor with a novel correction-free BCD
addition”, In Electronics, Circuits and Systems (ICECS), 18th IEEE International
Conference on, pp. 455–459, 2011.
[2] C. Sundaresan, C. V. S. Chaitanya, P. R. Venkateswaran, S. Bhat and J. M. Ku-
mar, “High speed BCD adder”, In Proceedings of the 2nd International Congress on
Computer Applications and Computational Science, pp. 113–118, 2012.
[3] Z. T. Sworna, M. U. Haque and H. M. H. Babu, “A LUT-based matrix multiplication
using neural networks”, IEEE International Symposium on Circuits and Systems
(ISCAS), pp. 1982–1985, 2016.
[4] Z. T. Sworna, M. U. Haque, N. Tara, H. M. H. Babu and A. K. Biswas, “Low-power
and area efficient binary coded decimal adder design using a look up table-based
field programmable gate array”, IET (The Institution of Engineering and Technology)
Circuits, Devices & Systems, vol. 10, no. 3, pp. 1–10, 2015.
[5] P. Kenneth, R. Tessier and A. DeHon, “Birth and adolescence of reconfigurable
computing: A survey of the first 20 years of field-programmable custom computing
machines”, In Field-Programmable Custom Computing Machines (FCCM), IEEE
21st Annual International Symposium on, pp. 1–17, 2013.
206 VLSI Circuits and Embedded Systems
[6] H. Liu and S. B. Ko, “High-speed parallel decimal multiplication with redundant
internal encodings”, IEEE Trans. on Computers, vol. 62, no. 5, pp. 956–968, 2013.
[7] N. Yonghai, Z. Guo, S. Shen and B. Peng, “Design of data acquisition and storage
system based on the FPGA”, Proc. Eng., vol. 29, pp. 2927–2931, 2012.
[8] G. Sutter, E. Todorovich, G. Bioul, M. Vazquez and J. P. Deschamps, “FPGA Im-
plementations of BCD Multipliers”, In International Conference on Reconfigurable
Computing and FPGAs, pp. 36–41, 2009.
[9] O. D. A. Khaleel, N. H. Tulic and K. M. Mhaidat, “FPGA implementation of binary
coded decimal digit adders and multipliers”, In 8th International Symposium on
Mechatronics and its Applications (ISMA), 2012.
[10] G. Shuli, D. A. Khalili, J. M. Langlois and N. Chabini, “Efficient Realization of BCD
Multipliers Using FPGAs”, International Journal of Reconfigurable Computing, 2017.
[11] G. Shuli, D. A. Khalili and N. Chabini, “An improved BCD adder using 6-LUT
FPGAs”, In 10th International Conference on New Circuits and Systems (NEWCAS),
pp. 13–16, 2012.
[12] G. Bioul, M. Vazquez, J. P. Deschamps and G. Sutter, “High-speed FPGA 10s Com-
plement Adders-subtractors”, Int. J. Reconfig. Comput., vol. 4, 2010.
[13] A. Vazquez and F. D. Dinechin, “Multi-operand Decimal Adder Trees for FPGAs”,
Research Report RR-7420, 2010.
[14] M. Vazquez, G. Sutter, G. Bioul and J. P. Deschamps, “Decimal Adders/Subtractors
in FPGA: Efficient 6-input LUT Implementations”, In Reconfigurable Computing and
FPGAs, 2009.
[15] M. Shambhavi and G. Verma, “Low power and area efficient implementation of BCD
Adder on FPGA”, In Signal Processing and Communication (ICSC), International
Conference on, pp. 461–465, 2013.
CHAPTER 17
This chapter focuses on the algorithm which can be very efficient for the purpose to minimize
the delays introduced in the circuit because of placement and routing. Placing and routing
operations are performed when a Field Programmable Gate Array (FPGA) device is used
for implementation. The delay introduced by logic block and the delay introduced by
interconnection can be analyzed by the use of efficient place and route algorithm. The
placement algorithms use a set of fixed modules and the netlist describing the connections
between the various modules as their input. The output of the algorithms is the best possible
position for each module based on various cost functions which further reduce the cost and
power and increases the performances.
17.1 INTRODUCTION
Most of the FPGAs available are SRAM-based. It is required to configure them after each
power-up that is they are volatile. Generally, the FPGA design-flow map designs onto an
SRAM-based FPGA consist of three phases. The first phase uses synthesizer which is used
to transform a circuit model coded in a hardware description language into an RTL design.
The second phase uses a technology mapper which transforms the RTL design into a gate-
level model composed of look-up tables (LUTs) and flip flops (FFs) and it binds them to
the FPGA’s resources (producing the technology-mapped design). During the third phase,
the place and route algorithm use the technology-mapped design to implement on FPGA.
The routing and placing operations may require a long time for execution in case
of complex digital systems, because complex operations are required to determine and
configure the required logical blocks within the programmable logic device, to interconnect
them correctly and to verify that the performance requirements specified during the design
are ensured. The delay introduced by logic block and the delay introduced by interconnection
can be analyzed by the use of efficient place and route algorithm.
The placement algorithms use a set of fixed modules and the netlist describing the
connections between the various modules as their input. The output of the algorithms is the
best possible position for each module based on various cost functions. It can have one or
more cost functions depending on designs. The cost functions include maximum total wire
length, wire routability, congestions, and performance and I/O pads locations.
The procedure is repeated until all of the vertices are locked even if the highest gain may
be negative. The last few moves that had negative gains are then undone and the bisection is
reverted to the one with the smallest edge-cut so far in this iteration. Here one outer iteration
of the K-L algorithm is completed and the iterative procedure is restarted again. If an outer
iteration will results in no reduction in the edge cut or load imbalance, then the algorithm
is terminated. If an outer iteration gives no reduction in the edge-cut or load imbalance, the
algorithm is terminated.
The K-L algorithm is a local optimization algorithm with a capability for getting moves
with negative gain.
Here the G value for G17 is large. Hence pair (a2 , b2 ) is (1, 7).
A1 = A1 - 1 = (2, 8) and B1 = B1 – 7 = (4, 6).
The new D values are
The last pair (a3 , b3 ) is (1, 7) and the corresponding gain is G17 .
Step 5: Determine the Values of X and Y 4
X = a1 = 1 and Y = b1 = 7.
The new partition that will obtained from moving X to B and Y to A is A = {2, 5, 7, 8}
and B = {1, 3, 4, 6}. The entire procedure is repeated again with this new partition as the
initial partition. Verify that the second iteration of the algorithm is also the last, and that
the best solution obtained is A = {2, 5, 7, 8} and B = {1, 3, 4, 6}.
The overall procedure is repeated with gain having maximum value taken and then the
cut size was calculated. There after the second pass was implemented, we had locked the
nodes with the maximum gain. This process is repeated for all the passes and combinations.
At the end of the above process we get the minimum cut size which in turns reduces the
wire delay and increases the performance.
17.5 SUMMARY
K-L (Kernighan-Lin) Algorithms increase the performance by reducing the wire delay.
Further work is necessary in the use of K-L-feasible cuts for the optimization purpose.
Analysis of an efficient algorithm for place and route process would be done in order to
place the components efficiently and create a proper routing path between them on FPGAs
(Field Programmable Gate Arrays). In this chapter, a new methodology is presented for
digital circuits which in turns reduce the area and increase the performance of the algorithms
of a circuit for the problem of hardware or software partitioning.
212 VLSI Circuits and Embedded Systems
REFERENCES
[1] L. Sterpone and M. Violante, “A New Reliability-Oriented Place and Route Algorithm
for SRAM-Based FPGAs”, IEEE Trans. on Computers, vol. 55, no. 6, 2006.
[2] O. Martinello, F. S. Marques, R. P. Ribas and A. I. Reis, “KL-Cuts: A New Approach
for Logic Synthesis Targeting Multiple Output Blocks”, pp. 777–782.
[3] A. M. Fahim, “Low-Power High-Performance Arithmetic Circuits and Architectures”,
IEEE Journal of Solid-State Circuits, vol. 37, no. 1, pp. 90–94, 2002.
[4] S. S. Brown, “FPGA Architecture Research: A Survey”, IEEE Design and Test of
Computers, pp. 9–15, 1996.
[5] A. M. Fahim, “Low-Power High-Performance Arithmetic Circuits and Architectures”,
IEEE Journal of Solid-State Circuits, vol. 37, no. 1, pp. 90–94, 2002.
[6] J. Rose, A. E. Gamal and A. Sangiovanni-Vincetelli, “Architecture of Field-
Programmable Gate Arrays”, Proc. IEEE, vol. 81, no. 7, pp. 1013–1029, 1993.
[7] S. Brown, “FPGA Architecture Research: A Survey”, IEEE Design and Test of Com-
puters, pp. 9–15, 1996.
[8] Udar, Vaishali, and Sanjeev Sharma. “Analysis of place and route algorithm for field
programmable gate array (FPGA).” In 2013 IEEE Conference on Information &
Communication Technologies, pp. 116–119. IEEE, 2013.
CHAPTER 18
The BCD (Binary Coded Decimal) being the more accurate and human-readable represen-
tation with ease of conversion, is prevailing in the computing and electronic communication.
In this chapter, an ( N×M )-digit BCD multiplication algorithm is introduced with the com-
plex steps reduction of the conventional multiplication process. BCD multiplier is more
effective with a LUT (Look-Up Table)-based design, due to FPGA (Field Programmable
Gate Array) technology’s enumerable benefits and applications. Hence, a compact LUT
circuit architecture with new selection, read and write operations is presented. Afterwards,
a cost-efficient N×M -digit multiplier circuit is demonstrated followed by a 1-Digit LUT-
based direct multiplier circuit.
18.1 INTRODUCTION
BCD (Binary Coded Decimal) representation is advantageous due to its finite place value
representation, rounding, easy scaling by a factor of 10, simple alignment and conversion
to character form. It is highly used in embedded applications, digital communication and
financial calculations. Hence, faster and efficient BCD multiplication method is desired. In
this chapter, an N×M -digit multiplication method is introduced which omits the complex
manipulation steps, reducing area, power and delay of the whole circuit. The advancement
in FPGA technology has emerged a new horizon of technology progress due to long time
availability, rapid prototyping capability, reliability and hardware parallelism. The cost of
making incremental changes to FPGA designs is negligible when compared to the large
expense of re-spinning an ASIC. The application of FPGA in cryptography, NP-Hard opti-
mization problems, pattern matching, bioinformatics, Floating point arithmetic, Molecular
dynamics is increasing radically. Due to re-configurability, FPGA implementation of BCD
multiplication is of concern. An FPGA has three main elements – LUT, flip-flops and the
routing matrix.
There are two primary methods in traditional computing for the execution of various al-
gorithms. The first is to use an Application Specific Integrated Circuit, or ASIC, to perform
the operations in hardware. ASICs are designed specifically to perform a given computation
and they are very fast as well as efficient when executing the exact computation for which
they were designed. However, after fabrication the circuit cannot be altered. Second, mi-
croprocessors are a far more flexible solution in terms of re reusability. Processors execute
a set of instructions to perform a computation. By changing the software instructions, the
functionality of the system is altered without changing the hardware. Nevertheless, there are
some drawbacks like, performance degrades alongside with flexibility in comparison with
ASICs since the processor must read each instruction from memory, determine the meaning
of the instruction from the content of memory and only then execute the specific fetched
instruction which consequently results in a high execution overhead for each individual
operation.
Reconfigurable computing is intended to fill the void between hardware and software,
achieving potentially much higher performance than software, while maintaining a higher
level of flexibility. Reconfigurable computing has become a subject of a great deal of
research now a days due to its versatile potential to greatly accelerate a wide variety of
applications. Its key feature is the ability to perform computations in hardware to increase
performance, while retaining much of the flexibility of a software solution. The field-
programmable gate array (FPGA) is a semiconductor device that can be programmed after
manufacturing. Therefore, it is a great mean for reconfigurable computing. Multiplication
is the fundamental operation used intensively through-out any activities. There are plen-
tiful types of number representations in computer organization. Binary representation of
numerical values gives more vantages for computation in computer based system where
as decimal representation offers more human friendliness. In this contrast, Binary Coded
Decimal plays a middle ware role which confers an instinctive mechanism to convert to and
from human-readable decimal characters.
A robust and optimized circuit design must deal with the delay of that circuit which is
a crucial comparison parameter for measuring performance alongside with area and power
consumption. A faster logical circuit with accuracy is a trade off in design issues. These
endless fields of modern reconfigurable computing is taken into consideration, trying to
explore some new features in reconfigurable computation as well as the architectural issues
of circuits involving with BCD multipliers, and motivated to design an efficient BCD
multiplier using FPGAs.
New products in shortest possible time have become the catchword in today’s electronic
industry. Being able to test the product even before the fabrication, Field Programmable
Gate Array (FPGA) has become an extremely useful medium in digital circuit designs. It
enables the designer to avoid the pitfalls of designs before synthesis. In the decade since
FPGAs were invented they have created many new opportunities. Perhaps the most exciting
is reconfigurable computing. Usually, FPGA consists of an array of programmable logic
blocks, interconnects (routing channels) and I/O cells. Logic blocks can be configured to
implement sequential and combinational functions which influence the speed and density of
FPGA. As FPGAs are ten times less dense than mask programmed gate arrays, researchers
are aiming to explore new efficient configurable logic blocks such that the density and gap
become as minimum as possible.
Most popular logic blocks of the FPGAs are based on the Look-up Tables (LUTs) and
design from Plessy. A Look-up Table can implement any logical function defined by its
inputs. With more inputs, a LUT can implement more logic, hence fewer logic blocks are
needed. This helps in routing by asking for less area, since there are fewer connections to
LUT-Based BCD Multiplier Design 215
route between the logic blocks. With the growing popularity of decimal computer arithmetic
in scientific, commercial, financial and Internet-based applications, hardware realization of
decimal arithmetic algorithms is gaining more importance. Hardware decimal arithmetic
units now serve as an integral part of some recently commercialized general purpose
processors, where complex decimal arithmetic operations, such as multiplication have
been realized by rather slow iterative hardware algorithms. With the rapid advances in
very large scale integration (VLSI) technology, semi and fully parallel hardware decimal
multiplication units are expected to evolve soon. As it is mentioned earlier, the dominant
representation for decimal digits is the BCD encoding. The BCD-digit multiplier can serve
as the key building block of a decimal multiplier, irrespective of the degree of parallelism.
A BCD-digit multiplier produces a two-BCD digit product from two input BCD digits. It is
aimed to provide a design for a parallel BCD multiplier showing some advantages in BCD
multiplier implementations using compact LUTs of FPGA.
Five main focuses are addressed in this chapter:
1. A 1 × 1-digit LUT-based direct BCD multiplication method is introduced to avoid
Recoding, partial product generation, partial product reduction and conversion steps of
conventional multiplication method.
2. N×M -digit BCD multiplication algorithm is introduced, which has reduced Recod-
ing, partial product reduction and conversion steps.
3. Efficient method of selection of memory cell, Read and Write operation in the desired
memory of LUT is presented.
4. A new LUT architecture with more accuracy and less hardware complexity is intro-
duced.
5. New N×M -digit BCD multiplication circuit with significantly reduced number of
LUTs, area and delay is elucidated.
Property 18.2.1 BCD multiplier uses BCD numbers as input and provides BCD output,
where each decimal digit is represented with a 4-bit binary code (with weights 8, 4, 2
and 1). For example, the decimal number (2549)10 is represented in BCD as (0010 0101
0100 1001)BCD . The multiplication of two BCD numbers produces partial products. After
having all the partial products they are added to get the binary output. Then, the binary
output is converted to BCD output.
Example 18.1 BCD multiplication of 2910 = (0010 1001)BCD to 510 = (0101)BCD pro-
duces four partial products. The addition of the partial products results in a binary output
which needs to be converted in BCD representation. After conversion the BCD value found
is, 0001 0100 0101 which is 145 in decimal value.
216 VLSI Circuits and Embedded Systems
A B Output
(f)
0 0 0
0 1 1
1 0 1
1 1 1
Property 18.2.2 A look-up table (LUT) is a memory block with a one-bit output that
essentially implements a truth table where each input combination generates a certain logic
output. The input combination is referred to as an address. The output of the LUT is the
value stored in the indexed location of the selected memory cell. Since the memory cells in
the LUT can be set to anything according to the corresponding truth table, an N-input LUT
can implement any logic function.
Example 18.2 When implementing any logic function, a truth table of that logic is mapped
to the memory cells of the LUT. Suppose, while implementing Equation 19.1 where ‘|’
represents logical OR operation, Table 18.1 represents the truth table of the function. Fig.
18.1 shows the gate representation and the LUT representation of the logic function. Output
is generated with the corresponding input combination, such as for input combination 1 and
0, the output will be 1.
f = (A.B)|(A ⊕ B) (18.1)
Property 18.2.3 The area of a logic circuit is the total area of the individual circuit
elements. If a circuit consists of n gates and area of those n gates are A1 , A2 , . . . , An then,
by using above definition area ( A) of that circuit is as follows:
n
Õ
A= Ai ; where i = 1, 2, 3, ..., n (18.2)
i=1
LUT-Based BCD Multiplier Design 217
Example 18.3 A half adder consists of one 2-input Ex-OR gate and one 2-input AND
gate. Using CMOS open cell library, the area of an 2-input Ex-OR gate and 2-input AND
gate are 1.6 µm2 and 1.2 µm2 , respectively. So, area of a half adder circuit is (1.6+1.2) µm2
= 2.8 µm2 .
Property 18.2.4 The total power of a circuit can be calculated by summing up the indi-
vidual power consumption of each gate. To calculate the power of a single gate, current
obtained from Microwind DSCH and voltage across each gate using the following formula
are used:
If the individual power of the gates are P1 , P2 , . . . , P n , then the total power, P of the circuit
can be calculated using the following equation:
n
Õ
P= Pi ; where i = 1, 2, 3, ..., n (18.4)
i=1
Example 18.4 Suppose, a half adder circuit is considered which consists of an AND gate
and an Ex-OR gate that are constituted of 6 and 8 transistors, respectively. Using Microwind
DSCH, threshold voltage for this circuit is found 0.5 V and the current passing through the
transistors is 0.1 mA. Hence, the power consumed by a single transistor is (0.5 × 0.1) mW =
0.05 mW. Therefore, a half adder requires (14 × 0.5) mW = 0.7 mW of power consumption.
Property 18.2.5 Delay of a combinational circuit is the critical path delay, which can be
defined as the summation of gate delay of each gate in that critical path. Critical path is the
longest path from an input to an output which causes low input to high output or vice versa.
If T 1 , T 2 , T 3 , . . . , T n are the gate delays of the gates G1 , G2 , G3 , . . . , G n on the critical
path, respectively then delay T Delay of the circuit is:
T Delay = T 1 + T 2 + T 3 + · · · + T n (18.5)
Example 18.5 A full adder consists of two 2-input Ex-OR gates and one 3-input AND
gate. Delay of both an 2-input Ex-OR gate and 3-input AND gate are 0.160 ns. Critical path
of a Full adder consists of two Ex-OR gates. So, delay of a Full adder is (0.160+0.160)ns
= 0.320 ns which is illustrated in Fig. 18.2.
218 VLSI Circuits and Embedded Systems
Figure 18.2: Full Adder Circuit Critical Path Delay Determination of Full Adder Circuit.
Property 18.2.7 Memristance is the charge-dependent rate of change of flux with charge,
which is the electronic property of memristor. The convenient memristance function is
found from Equation 19.7 by substituting the flux by the time integral of the voltage; and
charge by the time integral of current:
dΦ
dt V(t)
M(q(t)) = dq
= (18.7)
I(t)
dt
presented to clarify the ideas. First, the design of a BCD multiplier using Look-Up Table
of FPGA is described elaborately in next subsection.
logic gate-based function implementation are as follows where dot(.) means logical AND
operation and (|) means OR operation.
P0 = A0 .B0 (18.8)
(A .A .B .B ), i f (A3 |B3 ) = 0
1 2 1 2
(18.9)
(B0 .B2 ).(B1 .B2 ).( B̄0 . B̄1 . B̄2 . B̄3 ), i f (A3 |B3 ) = 1 and A3 = 1
(A0 .A2 ).(A1 .A2 ).( Ā0 . Ā1 . Ā2 . Ā3 ), otherwise
The functions of the products {P5 , . . . , P1 } are implemented in LUTs. All the possible
input-output combination is verified using a truth table shown in Table 18.2. The multiplier
circuit uses several steps which are as follows:
1. Recoding
2. Partial product generation
3. Partial product reduction
4. Conversion to BCD representation
1. ( N×M ) number of partial products, PP are generated where each node is the distinct
product of individual digit of multiplier and multiplicand in parallel.
2. Then, ( N + M ) number of product P, where each digit of the product is the sum of
respective digit representation of each node N .
LUT-Based BCD Multiplier Design 221
prevalent in the LUTs. The selection of the memory unit is performed depending on the two
inputs of the LUT as they refer to corresponding memory addresses from the memory array
that consists of four nanocross wires containing a memristor connected at each junction.
The transmission gates connected to each memristor propagate the operational voltage
either to Write 1/0 to store in the memory or to Read the corresponding memory unit and
disseminate the Read value to the output terminal O 1 or O 2 through memristor. The LUT
is shown in Fig. 18.4 and the algorithm for the construction of a 2-input LUT is given in
Algorithm 18.2.
As Reset is nothing but the Write 0 operation, there is no difference between these two
operations when performed on single memory cell which removes hardware complexity.
Besides, instead of using the conventional Wl (Word Line) and Bl (Bit Line), the direct
use of the LUT input with inverter has reduced the controller circuitry overhead proving
successful efficiency in area, power and delay. Op-amps are used for noise-free Read and
Write voltage. The operational features of LUT are presented in Table 18.3 for both Read
and Write operation where RP represents Read Pulse and D0 and D1 represents Data 0 and
Data 1 respectively.
For Write operation, Write Enable pulse voltage is high which acts as an input bit for the
transmission gate to pass the data bit (Data0/Data1). A crossbar array selects the particular
memory cell M i , where i<= 4. The initial memristance of M i is first considered ROF F and
RON for Write 1 and Write 0 operation respectively. Then, a pulse of +Vdd/-Vdd is applied
through V in until the memristance changes the state. Thus, a logic 1/0 is successfully written
to M i .
For Read operation, the Read Enable pulse voltage and Read Pulse (+Vdd/-Vdd) both
are high and will be propagated to the particular memory cell M i by using crossbar arrays
and transmission gate. To perform the READ 0/Read 1 operation, a positive pulse of
LUT-Based BCD Multiplier Design 223
+Vdd (READ pulse) is applied through transmission gate to the memristor and the READ
value is found at the Output terminal O 1 or O 2 . For Read 0 operation, assuming the NSP
(memristance) of the memristor is zero, it will slightly change the NSP (memristance) of the
memristor toward the value of RON . To restore the NSP to its original value, a RESTORE
pulse of Vdd is applied.
Similarly, larger input LUTs can be designed. For the design of BCD multiplier it
requires heterogeneous LUT architecture such as 5-input and 6-input LUTs. So, the archi-
tecture of 5-input and 6-input LUTs are exhibited in Figs. 18.5 and 18.6. A 5-input and
6-input LUT has 25 = 32 and 26 = 64 memory cells respectively with a common controller
circuit. A 5-input LUT has single output(O 5 ) and 6-input LUT has two outputs O 5 and O 6 .
To make area efficient, a 3D layer structure has been used. 32 memory cells are arranged
in two layers and 64 memory cells are arranged in four layers where each layer has sixteen
memory cells. For a particular layer, a row is selected by inputs A and B using the selection
circuit of a 2-input LUT. Besides, a column is selected using the two inputs C and D in the
same way. As there are four layers in 6-input, a particular layer is selected using inputs E and
F and only E is used in 5-input LUT to select a layer from the two. A 6-input LUT requires
100 memristors, where 64 memristors are required for memory cell unit and 36 memristors
are required for reference cell. On the other hand, the 6-input LUT requires a total of 64
memristors and no additional memristors are required for reference cell. A theorem has
been given in supporting the generalization of a 2-input LUT circuit in Theorem 19.1.
224 VLSI Circuits and Embedded Systems
Property 18.3.1 An n-input LUT requires at least 2n−2 LUTs with 2-input, where n>= 2.
Example 18.6 For n = 5, a 5-input LUT requires 25−2 = 8 LUTs with 2 inputs.
LUT-Based BCD Multiplier Design 225
Using the 1 × 1-digit BCD multiplier and BCD adder an N×M -digit multiplier can
be designed. Implementing the N×M -digit multiplication method of Algorithm 18.1, the
circuit construction is possible with less area and delay. The block diagram of the N×M -
digit multiplier is presented in Fig. 18.9. A Property 18.4.1 is given in Property 18.4.1,
supporting the generalized required number of LUTs.
226 VLSI Circuits and Embedded Systems
Algorithm 18.3: Algorithm for the Construction of BCD Multiplier Circuit for 1-Digit
Multiplication
Proof 18.2 The N×M multiplication technique accomplishes in two steps. The first step
deals with the generation of partial products by using 1 × 1-digit multiplier. The 1 × 1-
digit multiplier produces maximum two BCD digits ( X and Y ) as the multiplication of two
highest BCD digits (9 and 9) produce 81 which can be represented as follows: 9 × 9 = 81
where X = 8 and Y = 1. In general, the 1 × 1 multiplication can be represented as:
Ai × B j = X aY b ;
where i = 1, 2, 3, . . . , N;
j = 1, 2, 3, . . . , M and
a, b = 1, 2, 3, . . . , (N + M)
LUT-Based BCD Multiplier Design 227
Figure 18.8: BCD Multiplier Circuit Design Using Virtex 5/6 Slice for 1-Digit.
P1 = PP1Y 1
M
PPr X 2 + PPr Y 2
Õ
P2 =
r=1
M
PPr X 3 + PPr Y 3 + Carr y2
Õ
P3 =
r=1
LUT-Based BCD Multiplier Design 229
Figure 18.9: Block Diagram of the BCD Multiplier for N×M -Digit Multiplication.
For generation of initial BCD digit product P1 , the ( N×M )-digit multiplier doesn’t
require any BCD adder since it can be obtained directly from partial product PP1Y 1 . In the
same way, final BCD product P M+N can be obtained using a half adder circuit. Hence, a
total of ( N + M − 2) number of N -digit BCD adders are required in second step. An N -digit
BCD adder requires K number of α LUTs where α is the number of 6-input LUTs. So, in
second stage of multiplication technique, it requires a total of K(N + M − 1) number of
LUTs. Therefore, in two stages, the multiplier circuit requires λ(N×M) + K(N + M − 1)
number of LUTs.
In large digits input combination there is a possibility of having same input digit multiple
times which can be used as an opportunity to reduce the circuit complexity by reducing the
active number of LUTs. The reduced hardware complexity is proved in Property 18.4.2.
Property 18.4.2 If is the total number of repeating digits in multiplicand B, then the
total number of merged LUTs in N×M multiplication, δ = f (σ, M − σ + ω), where M
is the number of digits in multiplicand B, ω is total number different digits repeating in
multiplicand B and f (σ, M − σ + ω) is the function to calculate the total number of merged
LUTs.
Proof 18.3 The N×M multiplication requires at least λ(N×M) + K(N + M − 1) numbers
of LUTs which is proved in Properties 18.4.1 where each of the partial products PP can be
generated as follows:
Where i = 1, 2, 3, . . . , N ;
j = 1, 2, 3, . . . , M and
a, b = 1, 2, 3, . . . , ( N + M )
So, these products can be generated with a single multiplier having λ(N×M) number
of LUTs. Hence, it can be eliminate σ from multiplicand digit M . Suppose, ω is the total
number of different repeating digits in M . Since, multiplication of multiplier A with the
same σ digits can be generated from a single 1 × 1-digit multiplier circuit, a total of ω
LUTs can sufficiently serve the purpose of the multiplication. Therefore, it can calculate the
total number of merged LUTs from the function f (σ, M − σ + ω). Efficiency of the merged
LUTs algorithm can be asserted by pigeonhole principle. Pigeonhole principle states that,
for natural numbers k and m, if n = km + 1 objects are distributed among m sets, one of
the sets will contain at least k + 1 objects. BCD numbers range from (0 to 9) and the total
permitted digits are 10. So, if the multiplicand digit > 10, surely k + 1 digits are repeating
digits and the probability increased with higher input of multiplicands.
Example 19.7: Suppose, two BCD numbers A and B will be multiplied where, A =
1234 and B = 223. Here, in the approach, we require 10(4 × 3) = 120 number of LUTs
in multiplication stage. In multiplicand B, digit “2” is repeating twice. Therefore, it can
generate the partial product (1234 × 2) in a single multiplier circuit, reducing (2/3r d of
total 120) 40 number LUTs which is demonstrated in Fig. 18.10.
18.5 SUMMARY
This chapter details the designs and working procedures of the LUTs (Look-Up Tables) and
parallel BCD (Binary Coded Decimal) multipliers. Several lower bounds on the designed
architectures have been described. LUT being one of the main components of FPGA (Field
Programmable Gate Array), a LUT-based multiplier is presented. LUT being a component
of FPGA-based BCD multiplier, an efficient area minimal 2-input LUT circuit is presented
where the improvement of 2-input LUT will consequently improves the larger input LUTs.
Besides, as 5 and 6-input LUTs are needed for the design of the multiplier, the efficient design
architecture for 5 and 6 inputs LUTs using 2-input LUT circuit principle are also presented.
Finally, the BCD multiplier circuit is designed using the proposed LUTs providing its
cost-efficiency.
REFERENCES
[1] O. D. A. Khaleel, N. H. Tulic and K. M. Mhaidat, “FPGA implementation of binary
coded decimal digit adders and multipliers”, In Mechatronics and its Applications
(ISMA), 8th International Symposium on, pp. 1–5, 2012.
[2] H. A. F. Almurib, T. N. Kumar and F. Lombardi, “A memristor-based LUT for
FPGAs”, In Nano/Micro Engineered and Molecular Systems (NEMS), 9th IEEE
International Conference on, pp. 448–453, 2014.
[3] H. M. H. Babu, N. Saleheen, L. Jamal, S. M. Sarwar and Tsutomu Sasao, “Approach
to design a compact reversible low power binary comparator”, Computers & Digital
Techniques, IET, vol. 8, no. 3, pp. 129–139, 2014.
[4] Y. C. Chen, W. Zhang and H. Li, “A look up table design with 3d bipolar RRAMs”, In
Design Automation Conference (ASP-DAC), 17th Asia and South Pacific, pp. 73-78,
2012.
[5] L. O. Chua, “Memristor–the missing circuit element”, Circuit Theory, IEEE Trans.
on, vol. 18, no. 5, pp. 507–519, 1971.
[6] N. Z. Haron and S. Hamdioui, “On defect oriented testing for hybrid CMOS/memristor
memory”, In Test Symposium (ATS), 20th Asian, pp. 353–358, 2011.
[7] Y. Ho, G. M. Huang and P. Li, “Dynamical properties and design analysis for non-
volatile memristor memories”, IEEE Trans. on Circuits and Systems I, vol. 58, no. 4,
pp. 724–736, 2011.
[8] T. N. Kumar, H. A. F. Almurib and F. Lombardi, “A novel design of a memristor-
based look-up table (LUT) for FPGA”, In Circuits and Systems (APCCAS), IEEE
Asia Pacific Conference on, pp. 703–706, 2014.
[9] C. E. M. Guardia, “Implementation of a fully pipelined BCD multiplier in FPGA”, In
Programmable Logic (SPL), VIII Southern Conference on, pp. 1–6, 2012.
[10] H. C. Neto and M. P. Vestias, “Decimal multiplier on FPGA using embedded binary
multipliers”, In Field Programmable Logic and Applications, International Conference
on, pp. 197–202, 2008.
[11] K. Pocek, R. Tessier and A. DeHon, “Birth and adolescence of reconfigurable com-
puting: A survey of the first 20 years of field-programmable custom computing ma-
chines”, In Highlights of the First Twenty Years of the IEEE International Symposium
on Field-Programmable Custom Computing Machines, pp. 3–19, 2013.
232 VLSI Circuits and Embedded Systems
19.1 INTRODUCTION
During the last decade, the logic density, functionality and speed of FPGA have improved
considerably. Modern FPGAs are now capable of running at speed of 500 MHz and beyond.
Another important feature of FPGAs is their potentiality for dynamic reconfiguration, which
is reprogramming part of the device at run time so that resources can be reused through
time multiplexing. Hence, FPGA-based circuit design is recent trend of research. Matrix is
an extremely significant mean in conveying and discussing problems which arise from real
life scenarios. It will be effortless to manipulate and obtain more information by managing
the data in matrix form. Multiplication is one of the essential operations on matrices. Linear
back-projection, Color space conversion, 3D affine transformations, Estimation of higher-
order cross moments, Time-frequency Spectral Analysis and Wireless Communication
are commonly used matrix multiplication applications. To improve the performance of
these applications, a high performance matrix multiplier is required. So, a cost efficient
FPGA-based matrix multiplication algorithm is introduced. Besides, to make the matrix
multiplication method more efficient, (1 × 1)-digit and (m×n)-digit decimal multiplication
algorithms are presented. Then, the compact and faster (1 × 1)-digit and (m×n)-digit decimal
multiplier circuits are constructed. Finally, a matrix multiplier circuit is constructed with
drastically reduced area and delay.
Traditionally, matrix multiplication operation is either realized as software running on
fast processors or on dedicated hardware (Application Specific Integrated Circuits (ASICs)).
Software based matrix multiplication is slow and can become a bottleneck in the overall sys-
tem operation. The comparative analysis among the performance of CPU, GPU and FPGA
has been exhibited in Fig. 19.1. However, hardware (Field Programmable Gate Array
(FPGA)) based design of matrix multiplier provides a significant speed-up in computa-
tion time and flexibility as compared to software and ASIC based approaches respectively.
During the last few years, research efforts towards realizing and accelerating the matrix
multiplication operation using reconfigurable hardware (FPGA) have been attempted. FP-
GAs offer the design flexibility of a software and speed of hardware (ASICs). Hence, an
efficient FPGA-based matrix multiplier is aimed to be introduced.
(LUTs) in the configurable logic fabric remain important for high performance designs for
several reasons as follows:
1. Embedded multiplier operands are fixed in size and type, such as 25 × 18 two’s
complement, while LUT-based multiplier operands can be any size or type.
2. The number and location of embedded multipliers are fixed, while LUT-based mul-
tipliers can be placed anywhere and the number is limited only by the size of the
reconfigurable fabric.
Hence, a compact and faster LUT-based multiplication algorithm along with circuit is
to be introduced. Using the multiplier, an efficient matrix multiplier is to be introduced. In
digital signal processing (DSP) algorithms, in many fixed transforms such as the discrete
cosine transform (DCT), multiplicand has a limited number of values. Moreover, image
processing, graph theory and other real life applications have a limited range valued large
matrices to be multiplied. For example, Fig. 19.2 exhibits a real life application of small
ranged matrix. Though the images are usually of 256 × 256 or 1024 × 1024 matrix sizes,
the range of the values are limited. Hence, using this property, an efficient matrix multiplier
is to be introduced that will re-utilize the pre-calculated values rather than re-calculating
for the repeated values. Hence, an efficient LUT-based matrix multiplier circuit is to be
presented with less area, power and delay.
1. Parallel matrix multiplication has been explored and investigated extensively in the
previous two decades. There are diverse approaches to optimize the matrix multipli-
cation algorithm. Though parallel algorithm is faster, it requires huge resources.
2. To reduce the resource allocation, a serial architecture can be considered, but it will
increase the delay. Hence, an area-delay trade-off is to be considered.
3. Matrix multiplication is usually of huge in size for real life applications like image
processing, graph theory and many more. So, large scale matrices should be kept in
mind prior to designing a matrix multiplier circuit.
4. In multiplication, the main challenge is the carry propagation delay which needs to
reduce as much as possible for presenting faster multiplier circuits.
6. The main objective of this work is to propose an efficient FPGA-based matrix mul-
tiplier circuit. By utilizing the special features of advanced FPGAs, the computation
time, hardware resource utilization, and power consumption can be significantly re-
duced. The current trend is towards realizing a high-performance matrix multiplier
on FPGA.
7. Use the advantage of repetition in the large scale real life applications of matrices
with limited range values and reduce the computational complexity with effective
circuit area.
8. Reduce the carry propagation delay by digit-wise parallel processing in a divide and
conquer approach.
9. Reduce the required number of LUTs while implementing any function using LUTs.
1 0
1 1 0
0 0 0
Figure 19.3: Partial Product Generation for Final Addition of 8 × 8-bit Multiplication.
3. Final addition.
For the multiplication of an n-bit multiplicand with an m-bit multiplier, m partial
products are generated and product formed is n + m bits long. Here about five different
types of multipliers are discussed which are as follows:
1. Booth multiplier.
2. Combinational multiplier.
4. Array multiplier.
5. Sequential multiplier.
approach:
(AB)i j = k=1 Aik Bk j ;
Ím
To multiply two matrices, sufficient and necessary condition is “number of columns in
matrix A = number of rows in matrix B". The conventional matrix multiplication algorithm is
exhibited in Algorithm 19.1. The time-complexity of the conventional matrix multiplication
is O(n3 ). An example demonstration of the matrix multiplication operation is demonstrated
in Fig. 19.4.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle 239
Figure 19.6: Verilog Code for the Hardware Implementation of Binary to BCD Conversion.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle 243
Example 19.2 When implementing any logic function, a truth table of that logic is mapped
to the memory cells of the LUT. Suppose, while implementing Equation 20.1 where ‘|’
represents logical OR operation, Table 19.2 represents the truth table of the function. Fig.
19.8 shows the gate representation and the LUT representation of the logic function. Output
is generated with the corresponding input combination, such as for input combination 1 and
0, the output will be 1.
f = (A.B)(A ⊕ B) (19.1)
There is a significant research about the improvement of the LUT to reduce the hardware
complexities, read and write time. The circuit diagram of a 2-input LUT is given in Fig.
19.9.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle 245
A B Output
(f)
0 0 0
0 1 1
1 0 1
1 1 1
can be designed consisting of two 9-input LUT. Each 9-input LUT has two 4-bit plus one
1-bit-carry inputs and 5-bit outputs for a 4-bit addition. The carry is propagated to the
next 9-input LUT only after the previous 4-bit addition in one LUT is done (i.e., ripple
carry). Since each LUT should be read one by one this adder will take long time to finish
an addition that is shown in Fig. 19.10(a). By employing the concept of carry select adder
it can implement a much faster adder with 8- input LUTs because reading the next LUT
does not depend on the previous carry. The detail of the implementation is depicted in Fig.
19.10(b). To make a better adder, a 4-input LUT with 6-bit outputs can be exploited Fig.
19.10(c). The comparative analysis is shown in Fig. 19.11.
Figure 19.10: 8-bit Adder Using (a) Two 9-Input LUTs (b) Two 8-Input LUTs (c) Four
4-Input LUTs.
Figure 19.11: Comparison of Area and Time for 8-bit Adder Using Various LUTs.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle 247
19.2.11 Comparator
A comparator is a logic circuit that first compares the size of A and B and then determines
the result among A> B, A < B and A = B. When the two numbers in comparator circuit
are two 1-bit numbers, the result will be only one bit from 0 and 1. So, the circuit is called
1-bit magnitude comparator which is a basis of comparison of the two numbers of n-bit.
The truth table of 1-bit conventional comparator is listed in Table 19.3. From the truth table
of conventional comparator in Table 19.3, the logical expressions of 1-bit comparator are
as follows:
X = (F A>B ) = A.B 0
Y = (F A=B ) = (A ⊕ B)0
Z = (F A<B ) = A0 .B
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle 249
The wave form of a 1-bit comparator circuit is demonstrated in Fig. 19.14 and a circuit
diagram of a 4-bit comparator circuit is exhibited in Fig. 19.15.
data storage or for the movement of data and are therefore commonly used inside calculators
or computers to store data such as two binary numbers before they are added together, or to
convert the data from either a serial to parallel or parallel to serial format. The individual
data latches that make up a single shift register are all driven by a common clock (Clk)
signal making them synchronous devices. Shift register ICs are generally provided with a
clear or reset connection so that they can be “SET" or “RESET” as required.
Generally, shift registers operate in one of four different modes with the basic movement
of data through a shift register being:
1. Serial-in to Parallel-out (SIPO) – the register is loaded with serial data, one bit at a
time, with the stored data being available at the output in parallel form.
2. Serial-in to Serial-out (SISO) – the data is shifted serially “IN” and “OUT” of the
register, one bit at a time in either a left or right direction under clock control.
3. Parallel-in to Serial-out (PISO) – the parallel data is loaded into the register simul-
taneously and is shifted out of the register serially one bit at a time under clock
control.
4. Parallel-in to Parallel-out (PIPO) - the parallel data is loaded simultaneously into the
register, and transferred together to their respective outputs by the same clock pulse.
The effect of data movement from left to right through a shift register can be presented
graphically as in Fig. 19.16. The directional movement of the data through a shift register can
be either to the left, (left shifting) to the right, (right shifting) left-in but right-out, (rotation)
or both left and right shifting within the same register thereby making it bidirectional.
Figure 19.16: Data Movement from Left to Right through a Shift Register.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle 251
Figure 19.17: Circuit with Corresponding Literal Cost, Gate Input Cost, and Gate Input
Cost with NOT.
in a flip-flop. Four such LUTs and their eight flip-flops as well as multiplexers and arithmetic
carry logic form a slice, and two slices form a configurable logic block (CLB). Four flip-
flops per slice (one per LUT) can optionally be configured as latches. In that case, the
remaining four flip-flops in that slice must remain unused. Between 2550% of all slices can
also use their LUTs as distributed 64-bit RAM or as 32-bit shift registers (SRL32) or as two
SRL16s. Modern synthesis tools take advantage of these highly efficient logic, arithmetic,
and memory features. Expert designers can also instantiate them.
considered here, as the (1 × 1)-digit multiplication is used as the basic unit to construct an
(m×n)-digit multiplier. The (1 × 1)-digit multiplication considers the four binary bits with
the maximum decimal value, 9 as input. As A×B = B×A, either A×B or B×A is considered
to avoid identical input combination. Either multiplicand or multiplier will be 0dec or 1dec
in 1-bit category. If one of the inputs is 0Dec , the corresponding output will be zero. If
one of the inputs is 1dec , the output will be the other operand. The group of 3-bit category
is selected for the implementation of LUT merging (Described in Chapter 16). A LUT
with 4 to 6 inputs provides the best area and delay performance. The traditional 4-input
LUT structure provides low logic density and configuration flexibility which reduces the
utilization of interconnect resource, when configured as relatively complex logic functions.
Hence, a 6-input LUT is considered to implement the function.
Second, partitioning of LUTs is performed by observing the similarities between the
input and output combinations of the product bits. As for the product bits, the input can
be eliminated. As eliminating the input maintains all the input combinations with their
corresponding outputs and even if there are similar input combinations for both of them the
corresponding output bits are identical. The functions of the same partition are fed as input
to a single 6-input LUT. As the functions each have 5 inputs after input reduction step, the
6-input LUT will provide dual output. Hence, the merging of LUT will reduce the required
number of LUTs from 4 to 2. Now, the 6-bit input combinations are converted into 5-bit
input combinations in such a way so that the total numbers of input combinations of 6-bit
and 5-bit inputs are the same as well as the corresponding outputs of all the combinations
for both the 6-input and 5-input variables also remain the same. The first step is the selection
of LUTs, where the selection is performed on the basis of computation of final products P1 ,
P2 , P3 and P4 . Third, the partitioning of LUTs is performed, where two different colors are
used to distinguish the distinct partition sets. Fourth, the partitioned LUTs are merged into
one.
Algorithm 19.3 depicts the (1 × 1)-digit multiplication technique. The inputs of the
algorithm are multiplicand digit A with 4 bits (a3 a2 a1 a0 ) and multiplier digit B with
4 bits (b3 b2 b1 b0 ). The first if condition specifies that if A or B is 0dec then the output
will be zero. The second condition implies that if A equals 1dec then the output will be
B. Then the third condition specifies that if B equals 1dec then the output will be A. The
fourth condition indicates that if ( A1 . B1 ) is one, which means it is of 2-bit category and
also ( A3 A2 B3 B2 ) is zero, which ensures that the input combination is not of 3-bit or 4-bit
category. Then, the other output bits will be zero. The fifth condition checks if the ( A2 B2 )
equals 1 and it is not of 4-bit category, then it is classified as 3-bit category. Finally, the last
condition checks whether the input combination is of 4-bit category.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle 255
Example 19.5 Table 19.3 provides an illustrative example of the (m×n)-digit multiplica-
tion algorithm. Suppose, a (4 × 4)-digit multiplication example is considered, where multi-
plier, A = (3654)dec and multiplicand B = (7932)dec . As considering a decimal multiplier
here, the input will be provided as A = (0011011001010100) and B = (0111100100110010).
The algorithm is highly parallel. Each multiplicand digit performs in parallel, where the
multiplier is pipelined for each multiplicand digit as shown in the Input row of Table 19.3.
For each multiplicand digit 7, 9, 3, and 2, the multiplier 3, 6, 5, and 4 are pipelined. For each
multiplier digit in total 4 stages are being performed. At the first stage 4 is being processed
in parallel with 2, 3, 9, and 7. After the (1 × 1)-digit multiplication, the corresponding 7
bit binary outputs are gained. The binary outputs are then converted to BCD. Then, BCD
258 VLSI Circuits and Embedded Systems
addition is performed on the 8 bit BCD output and the ToBeAdded variable. The ToBeAdded
variable is initiated as zero. After the 8 bit BCD addition, the 4 bits positioning from (0-3)
are stored as output and the other 4 bits positioning from (7-4) are stored in ToBeAdded. In
the next iterations the identical procedure is followed with the updated values. At the last
iteration the ToBeAdded is concatenated with the Out. Finally, all the output values are left
shifted 0, 1, 2, and 3 digits, respectively. At last, BCD addition is performed on the shifted
output values and the final result is obtained.
The aforementioned (m×n)-digit multiplication algorithm requires binary to BCD con-
version process. To improve the (m×n)-digit multiplication algorithm, a new binary to BCD
conversion algorithm is introduced in the next subsection.
Table 19.4: The Binary Input and BCD Output of the Binary to BCD Conversion Method
for (m×n)-Digit Multiplication
260 VLSI Circuits and Embedded Systems
Figure 19.18: K -map Layout for Seven Number of variables to Optimize the Functions.
The K -map manipulation for the function P1 is shown in Fig. 19.19. After grouping the
adjacent one values, the groups of Fig. 19.20 are gained. Hence, the P1 function can be
represented as a sum of product (SOP) as shown in Equation 20.3, where A = B0 , B = B1 ,
C = B2 , D = B3 , E = B4 , F = B5 , and G = B6 . Similarly, the functions of (P2 – P5 ) are
formulated in Equation 20.4 to 20.7 using K -map manipulation. The K -map manipulation
and grouping of functions (P2 – P5 ) are demonstrated in Figs. 19.21–19.28.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle 261
Table 19.4 shows that the values of P0 and B0 are the same. Hence, the function of P0
can be formulated as P0 = B0 which is shown in Equation 20.2. Observing Table 19.4, the
functions of P6 and P7 can be formulated as shown in Equations 20.8 and 20.9, without
using K -map manipulation. Hence, the binary to BCD conversion algorithm uses the LUT
implementation of the optimized output functions (P7 – P0 ). The functions are optimized
to a great extent due to avoiding the huge number of invalid input combinations.
P0 = B0 (19.2)
266 VLSI Circuits and Embedded Systems
P1 =B0 .B3 + B2 .B5 .B6 + B1 .B3 .B5 + B1 .B2 .B3 + B1 .B2 .B3 .B5 (19.3)
c + B1 .B2 .B3 .B5 + B1 .B2 .B3 .B4 .B5 + B0 B1 .B2 .B3 .B4 .B5
P2 =B2 .B3 .B4 + B2 .B4 .B5 + B1 .B2 .B3 + B1 .B2 .B3 .B5 + B2 .B3 B4 (19.4)
+ B1 B2 .B6 + B0 B2 B3 + B0 .B1 .B2 .B4 .B5
P3 = B1 .B2 B3 .B5 + B2 .B3 .B4 + B5 + B1 .B2 .B3 .B4 + B0 .B1 .B2 .B3 .B4 .B5 (19.5)
P5 = B0 .B2 + B2 .B5 .B6 + B1 .B2 .B4 + B1 .B2 .B3 + B1 .B2 .B3 (19.7)
The (1 × 1)-digit multiplication avoids the costly partial product generation, partial
product reduction and addition operations. Instead the presented algorithm applies simple
bit categorization technique which is based on simple conditional logic. And the final output
from each category does not require any complex calculation rather it is based on simple
AND, OR logic operation and can be generated faster. Hence, the (1 × 1)-digit multiplication
is not only faster but also area-efficient due to its direct output generation instead of partial
product generation and addition operations. Using the significantly efficient (1 × 1) digit
multiplication algorithm, (m×n)-digit multiplication algorithm has been introduced.
As shown in Fig. 19.30, for (8 × 8)-digit multiplication first 32-bit BCD to Binary
conversion is required for both multiplier A and multiplicand B. Moreover, 54 bit binary
to BCD conversion is required as the last step. As the number of bits being converted is
huge, the circuit is huge complex and requires much area, power and delay. With increasing
number of input bits the conversion bits will be increased accordingly which will make
the computation and the circuit very much costly. On the contrary, the introduced circuit
requires only 8 bit conversion for each digit of multiplicand which are working in parallel.
As the number of bits being converted is very small and each conversion is being performed
in parallel, the presented algorithm requires less area and delay for conversion. Moreover,
whatever the input size is, the described algorithm will always require only 8 bit conversion.
Only the number of converters will increase but as they will work in parallel, so both the
delay and circuit complexity will be reduced radically.
On the other hand, the introduced algorithm generates only 8 number of partial products
rows at the final stage. Moreover, the intermediate (1 × 1)-digit multiplications are performed
without generating any partial product. The algorithm requires BCD addition operation of
the 8 number of partial product rows at the final stage. Besides, the intermediate BCD
adders are only 8-bit adders. So, the carry propagation delay is reduced due to small input
bit adders and the parallelism. So, it can be concluded that the multiplication algorithm is
significantly improved in area and delay due to its parallelism, intermediate partial product
generation reduction, elimination of long carry propagation delay of higher input adders
and so on.
268 VLSI Circuits and Embedded Systems
Property 19.3.2 If there are d numbers of distinct values and the matrix size is n×n, then
the total number of repeated values is formulated as n2 - d .
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle 269
A ∈ Nn×n ⇔ A = (ai j ) = . .
. .
An1 . . . ann
So, after multiplication the matrix C will be as follows:
n
Õ
C = A × B; where ci j = aik bk j (19.9)
k=1
For high speed and parallel processing, maximum of n×n number of processing element
blocks are considered for the matrix multiplication process. If there is 0% repetition in the
rows of the multiplicand matrix B, then processing element re-utilization will not take place
resulting in taking n×n number of effective processing element blocks. On the contrary, for
matrices with identical values in the rows of B matrix will require much less number of
effective processing elements. Each processing element block uses the values, representing
the values of a column from the multiplier A and a single value from the corresponding
row of the multiplicand B as its inputs. Now, the column partitioning is performed on the
multiplier matrix, A as follows:
A ∈ Nn×n ⇔ A = [A1 A2 · · · Ak dots An ], Ak ∈ N n ;
a1
a2
Ak = .
.
an
270 VLSI Circuits and Embedded Systems
For each processing element, every value of the i th row of the multiplicand B will be
used as an input with the values of a corresponding column vector Ak of the multiplier,
where k = i . Now, row partitioning of the multiplicand matrix B is performed as follows:
B1
B2
.
B∈N n×n ⇔ B = , Bk ∈ N
Bk
.
B
n
Bk = b1 b2 · · · bn
Therefore, for n×n processing element, each column vector Ak will be the input with
each j th values of the row vector Bi for k = i , where each column vector is pipelined to
the neural network for faster execution as shown in Fig. 19.32. In Algorithm 19.5, first the
repeated values are checked in multiplicand matrix B using CheckRepeatation(B[ ][ ])
function. The repetition check is performed efficiently by performing parallel processing.
For each value of each row, if there is a repeated value with a previous value then the
corresponding X[ ][ ] matrix is updated. If repetition is found, then the column address of
the initial value is stored in corresponding index of [ ][ ] and 1 is stored in corresponding
index of X[ ][ ]. If there is no repeated value, then 0 is stored in corresponding index of
X[ ][ ].
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle 271
Step 2:
Step 3:
Step 4:
Step 5:
Gate Input Cost (G) is the number of inputs to the gates in the implementation corre-
sponding exactly to the given equation or equations. It can be found from the equation(s)
by finding the sum of:
274 VLSI Circuits and Embedded Systems
Gate input cost denoted by G if inverters not counted or GN if inverters counted. For
example:
F = BD + ABC + ACD; G = 11; GN = 14 (19.17)
So, extraction is performed, for finding factor to optimize equations. Table 19.5 shows
the literal cost, gate input cost, and gate input cost with not, before and after optimization. It
exhibits that there are 55.26%, 61.85%, and 57.54% reduction in literal cost, gate input cost
and gate input cost with not, respectively. So, for bit categorization, a circuit is constructed
shown in Figs. 19.33 and 19.34. Fig. 19.33 demonstrates that for the input combination
A = 0010 and B = 0001, the circuit selects category 1 as one of input operand is 1.
Besides, Fig. 19.34 shows that for the input combination A = 0010 and B = 0101, the circuit
selects category 4 as input operand B has maximum input size of 3 bits. Similarly, all
the other possible input combinations select the corresponding category. The LUT-based
implementation of bit categorization circuit requires 10 number of 4-input LUTs and 6
number of 2-input LUTs.
The overall circuit of (1 × 1)-digit multiplication is exhibited in Fig. 19.35. The block
diagram of bit categorization has already been described. As the bit categorization selects
the corresponding category by providing the value of that variable as 1, this value activates
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle 275
Figure 19.33: Bit Categorization Circuit Selected Category One for Corresponding Input.
Figure 19.34: Bit Categorization Circuit Selected Category Four for Corresponding Input.
that particular category. So, other categories remain deactivated and only one category at a
time is selected. Fig. 19.36 shows that as the output Catg3 bit from bit categorization circuit
for corresponding input combination is zero, the total 3-bit category circuit is deactivated
and no input passes through the circuit. The dotted line represents off state and the straight
276 VLSI Circuits and Embedded Systems
Figure 19.36: Deactivated Category Circuit as it is not the Target Category for Correspond-
ing Input.
line represents on state. On the other hand, when the input combination is A = 0011 and
B = 0011, which represents category 3, the bit categorization provides Catg3 as 1. Hence,
the 3-bit category is activated and the corresponding required inputs are passed through the
transistors and LUTs. Finally, the corresponding output of the input combination P = 1001 is
gained from that category. The internal circuit of each category as shown in Fig. 19.35 is the
LUT implementation. Similarly, for all the other input combinations corresponding category
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle 277
Figure 19.37: Activated Category Circuit as it is the Target Category for Corresponding
Input.
is selected and activated. Finally, the output from the activated category is considered as
the final output which is shown in Fig. 19.37.
The LUT-based bit categorization circuit and (1 × 1)-digit multiplier circuit can be
constructed using homogeneous 6-input LUT. The use of 6-input LUTs using LUT merging
technique reduces the required number of LUTs to a great extent compared to the use
of heterogeneous input LUT implementation. Though 6-input LUT is more complex than
smaller input LUTs, the implementation of the LUT merging theorem (Described in Chapter
16) ensures the effective use of 6-input LUT by providing dual output. Moreover, the
available FPGA slices consist of homogeneous input LUTs. The homogeneous 6-input
LUT implementation of the bit categorization circuit is shown in Fig. 19.38. Hence, a
6-input LUT can be used to implement T 1 and T 2 as dual output. Similarly, T 2 and T 4
can be implemented using a 6-input LUT as dual output. The functions T 5 and T 6 can be
implemented using a dual output 6-input LUT. The function T 5 , T 6 and Catg0 depends
on T 1 and T 2 . Though Catg0 function depends on only 4 input variables, it cannot be
implemented as dual output. As the other output functions depend on Catg0, a single LUT
is used to implementing Catg0. Furthermore, (Catg1, T 7 ), (Catg2, Catg4), and (Catg3,
Catg5) are implemented using a 6-input LUT each as dual output. Hence, the total circuit
construction requires only 7 number of 6-input LUT.
278 VLSI Circuits and Embedded Systems
Figure 19.40: Binary to BCD Converter Circuit for Decimal Digit Multiplier Circuit.
for which repetition does not occur, the value of corresponding index of X matrix is updated
with 0.
For the first three values of the first column of the multiplicand matrix B = B11 , B21 , B31 ,
the corresponding value of A and B are sent to the multiplier circuit. As it is not possible to
have a repeated in value in the first column of the matrix, multiplication must be performed.
In this circuit, the efficient LUT-based (m×n)-digit multiplier circuit is considered for the
multiplication operation. Then, while processing the second column values of B matrix,
the X matrix needs to be checked whether there is a repeated value or not. If the value of
( X 12 , X 22 , X 32 ) is one then it activates the transistors as the value of X is provided the gate
input of the transistors. Hence, the activated transistor passes the value of the multiplier
which has the same output to be re-utilized. For example, If X 12 is 1, then it activates the
first transistor and passes the value from the multiplier which has the input B11 . So, the
multiplied value is re-utilized instead of re-calculation overhead. If the value of X 12 is zero
it will deactivate the first transistor and activate the next two transistors, which will pass
282 VLSI Circuits and Embedded Systems
the input value from A and B to the multiplier. As there is no repetition, multiplication will
occur.
Similarly, while processing the third column values of B matrix, the X matrix needs
to be checked whether there is a repeated value or not. If the value of ( X 13 , X 23 , X 33 )
is one then it activates the transistors as the value of X is provided the gate input of the
transistors. Hence, the activated transistor passes the value of the column address which
has been repeated. While checking repetition, Z matrix is updated with the column location
of the value which has been repeated. So, through the transistor, the address is passed as
the selection bit of a 2 to 1 multiplexer. For example, If X 13 is 1, then it activates the
first transistor and passes the value from the Z matrix to the multiplexer as its selection
bit. If the selection bit is zero, it passes the output from the first multiplier which has B11
as input. It represents that the first value B11 is repeated in this position. If the selection
bit is one, then the multiplexer provides output from the first switching circuit named as
out, which represents that B12 is repeated. So, the multiplied value is re-utilized instead
of re-calculation overhead. If the value of X 13 is zero, it will deactivate the first transistor
and activate the next two transistors, which will pass the input value from A and B to the
multiplier. As there is no repetition, multiplication will occur. Finally, the outputs will be
added using LUT-based adder as shown in Fig. 19.43 and provide the resultant matrix.
19.4 SUMMARY
The increased interest in FPGAs (Field Programmable Gate Array) for real-time appli-
cations, such as wireless communications, image processing and image reconstruction,
medical imaging, network security, and signal processing justifies the effort in the de-
sign of efficient and high-performance matrix multiplier. Since matrix multiplication is
computationally intensive, it urges huge attention. Traditionally, matrix multiplication op-
eration is either realized as software running on fast processors or on dedicated hardware
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle 283
REFERENCES
[1] L. Singhal and E. Bozorgzadeh, “Special section on field programmable logic and
applications-multi-layer floorplanning for reconfigurable designs”, IET Computers &
Digital Techniques, vol. 1, no. 4, pp. 276–294, 2007.
[2] T. J. Todman, G. A. Constantinides, S. J. Wilton, O. Mencer, W. Luk and P. Y. Cheung,
“Reconfigurable computing: architectures and design methods”, IEE Proceedings-
Computers and Digital Techniques, vol. 152, no. 2, pp. 193–207, 2005.
[3] B. Almashary, S. M. Qasim, S. A. Alshebeili and W. A. Al-Masry, “Realization of
linear back-projection algorithm for capacitance tomography using FPGA”, In 4th
World Congress Industrial Process Tomography, pp. 5–8, 2005.
[4] F. Bensaali and A. Amira, “Accelerating colour space conversion on reconfigurable
hardware”, Image and Vision Computing, vol. 23, no. 11, pp. 935–942, 2005.
284 VLSI Circuits and Embedded Systems
[5] Z. T. Li, T. J. Wu, C. L. Lin and L. H. Ma, “Field programmable gate array based par-
allel strapdown algorithm design for strapdown inertial navigation systems”, Sensors,
vol. 11, no. 8, pp. 7993–8017, 2011.
[6] S. M. Qasim, S. A. Alshebeili, A. A. Khan and S. A. Abbasi, “Realization of algorithm
for the computation of third-order cross moments using FPGA”, In Signal Processing
and Its Applications, 9th International Symposium on, pp. 1–4, 2007.
[7] A. A. Shoshan and M. A. Oqeely, “A high performance architecture for computing
the time-frequency spectrum”, Circuits, Systems and Signal Processing, vol. 19, no.
5, pp. 437–450, 2000.
[8] E. Cavus and B. Daneshrad, “A very low-complexity space-time block decoder
(STBD) ASIC for wireless systems”, IEEE Trans. on Circuits and Systems I, vol.
53, no. 1, pp. 60–69, 2006.
[9] A. Yurdakul and G. Dundar, “Multiplier-less realization of linear DSP transforms by
using common two-term expressions”, Journal of VLSI Signal Processing Systems
for Signal, Image and Video Technology, vol. 22, no. 3, pp. 163–172, 1999.
[10] A. A. Fayed and M. A. Bayoumi, “A merged multiplier-accumulator for high speed sig-
nal processing applications”, In Acoustics, Speech, and Signal Processing (ICASSP),
IEEE International Conference on, vol. 3, p. 3212, 2002.
[11] Y. Iguchi, T. Sasao and M. Matsuura, “Design methods for binary to decimal con-
verters using arithmetic decompositions”, Journal of Multiple Valued Logic and Soft
Computing, vol. 13, no. 4/6, p. 503, 2007.
[12] Z. T. Sworna, M. U. Haque, N. Tara, H. M. H. Babu and A. K. Biswas, “Low-power
and area efficient binary coded decimal adder design using a look up table-based
field programmable gate array”, IET Circuits, Devices & Systems, vol. 10, no. 3, pp.
163–172, 2016.
[13] Y. Moon and D. K. Jeong, “An efficient charge recovery logic circuit”, IEEE Journal
of Solid-State Circuits, vol. 31, no. 4, pp. 514–522, 1996.
[14] P. K. Meher, “Lut optimization for memory-based computation”, IEEE Trans. on
Circuits and Systems II, vol. 57, no. 4, pp. 285–289, 2010.
[15] H. C. Neto and M. P. Vestias, “Decimal multiplier on FPGA using embedded binary
multipliers”, In 2008 International Conference on Field Programmable Logic and
Applications, pp. 197–202, 2008.
[16] M. P. Vestias and H. C. Neto, “Parallel decimal multipliers using binary multipliers”,
In Programmable Logic Conference (SPL), pp. 73–78, 2010.
[17] R. H. Turner and R. F. Woods, “Highly efficient, limited range multipliers for LUT-
based FPGA architectures”, IEEE Trans. on Very Large Scale Integration (VLSI)
Systems, vol. 12, no. 10, pp. 1113–1118, 2004.
[18] A. Vazquez, E. Antelo and P. Montuschi, “A new family of high performance parallel
decimal multipliers”, In 18th IEEE Symposium on Computer Arithmetic, pp. 195–204,
2007.
[19] K. Hasanov, J. N. Quintin and A. Lastovetsky, “Hierarchical approach to optimization
of parallel matrix multiplication on large-scale platforms”, The Journal of Supercom-
puting, vol. 71, no. 11, pp. 3991–4014, 2015.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle 285
The binary coded decimal (BCD) system is suitable for digital communication, which can be
designed by field programmable gate array (FPGA) technology, where look-up table (LUT)
is one of the major components of FPGA. In this chapter, a low power and area efficient
LUT-based BCD adder is introduced which is constructed basically in three steps: First, a
technique is introduced for the BCD addition to obtain the correct BCD digit. Second, a
new controller circuit of LUT is presented which is designed to select and send Read/Write
voltage to memory cell for performing Read or Write operation. Finally, a compact BCD
adder is designed using the introduced LUT.
20.1 INTRODUCTION
Binary coded decimal (BCD) representation provides accurate precision, avoids infinite
error representation, conversion to a character form can be done in linear [O(n)] time
and addition–subtraction does not require rounding. Therefore, faster circuit for BCD
addition method is of concern. Hence, a (look-up table (LUT)-based new BCD addition
algorithm is introduced, which requires less number of field programmable gate array
(FPGA) components, area, power, and delay. The given algorithm is based on the pre-
processing mechanism. The advancement in FPGA technology has emerged as a new
horizon of technology progress due to long-time availability, rapid prototyping capability,
reliability and hardware parallelism. The cost of making incremental changes to FPGA
designs is negligible when compared to the large expense of re-spinning an application-
specific integrated circuit. An FPGA has three main elements: LUT, flip-flops and the
routing matrix. First, the basic 2-input LUT is targeted as it ultimately serves the betterment
of 3-input, 4-input and further larger input LUTs. Then, a 6-input LUT architecture is also
shown as the BCD adder is designed using 6-input LUT.
Example 20.1 Suppose, the decimal values of the two variables A and B are 9 and 5,
respectively. For the BCD addition of these operands, the BCD representation of them is
taken. First, add the LSB of A and B with the carry (C in ) from the previous BCD digit
addition. If it is the first digit addition of BCD addends, then the value of C in is zero. The
obtained sum (S 0 ) is the first bit of the output and the carry is added to the three MSBs of
B providing a value of b which is 011.
BCD Adder Using a LUT-Based Field Programmable Gate Array 289
Table 20.1: Truth Table of 3-bit Addition with Pre-processing and Addition of 3
Afterwards, add the three MSBs of A and b. As the result is 111 which is > 5, 3 is added
with the sum to obtain the correct result. The resultant sum consisting of 4 bits represents
the first digit of the BCD value which is 4 in this case and the carry is the next resultant
digit which is 1. The demonstration is provided in Fig. 20.1. The algorithm of the BCD
addition method with pre-processing technique is given in Algorithm 20.1. Next, a LUT is
introduced to present a LUT-based BCD adder.
290 VLSI Circuits and Embedded Systems
Figure 20.1: Demonstration of the BCD Addition Algorithm Exhibited in Example 21.1.
Algorithm 20.1: Algorithm of the BCD Addition Method with Pre-processing Technique
memory (MRAM), nano random access memory (NRAM) conductive bridging random ac-
cess memory (CBRAM) and phase change random access memory (PCRAM) memristor
provides more area and power efficiency. In this chapter, the controller part of the circuit is
presented in compact way with the optimal number of gates which include the selection of
a memristor cell along with the Read/Write voltage passing to the cell. The internal mem-
ristance changes with the applied voltages for the corresponding Read/Write/Reprogram
operation.
The Write voltage is considered as Data and the Read voltage is considered as Read
Pulse. As only one memristor will be selected at a time, only one Write voltage (Data) is
considered. It is never possible to run Read and Write operations simultaneously. Therefore,
the Ex-OR and corresponding AND gates are used to select only one operation at a time
to avoid this ambiguity. Once one operation is selected then either the Read or the Write
voltage is passed from the AND or transmission gate respectively to the OR gate. The output
of the OR gate is connected to the left of each memristor to propagate the operational voltage
either to Write 1/0 to store in the memory or to Read the corresponding memory unit when
the memristor is selected.
The selection of the memory unit is performed depending on the two inputs of the
LUT such as A and B, as they refer to the corresponding memory addresses of the memory
cells (memristor). Considering the addresses of the memristors to be 00, 01, 10 and 11,
the addresses can be represented as A0 B 0, A0 B, AB 0, and AB respectively. A transistor is
activated through input B, then input A is sent from that transistor to next transistor to
activate the next one. Two transistors T 9 and T 10 are connected to the output lines O 1 and
O 2 respectively. The output from M 00 and M 01 passes through O 1 and output from M 10
and M 11 passes through O 2 . Although two memristors are connected to single output line,
as only one memristor is activated at a time so only value would pass through the line. As
the output line is connected to the drain of the transistors so the passed voltage would not
affect the unselected memristor as unactivated transistor never passes voltage from drain to
source. The transistors are activated by R only when a Read operation is being performed,
the output voltage of the selected memristor will be passed through the corresponding
transistor to the output line. During Write operation the transistors T 9 and T 10 will not be
activated, so no value will be passed to the output multiplexer (MUX). The LUT is shown
in Fig. 20.2 and the algorithm for the construction is given in Algorithm 20.2. A lemma is
also given in Lemma 21.1 in supporting the generalization of a 2-input LUT circuit.
292 VLSI Circuits and Embedded Systems
Reset is omitted as it is never possible to select all of the memory cells to reset them
altogether. Since Reset is nothing but the Write 0 operation, there is no difference between
these two operations when performed on single memory cell which removes hardware
complexity. Besides, instead of using the conventional Wl (word line) and Bl (bit line), the
direct use of the LUT input with inverter has reduced the controller circuitry overhead.
Similarly, a 6-input LUT can be designed using a 2-input LUT which is shown in Fig.
20.3. A 6-input LUT has 26 = 64 memory cells with a common controller circuit and two
outputs O mux1 and O mux2 . To make area efficient, a three-dimensional layer structure has
been used. A total of 64 memory cells have been arranged in four layers, where each layer
has 16 memory cells. For a particular layer, a column is selected by inputs A and B using a
selection circuit consisted of four transistors each being activated by input B (when B is 1)
and sending input A as column selection. Besides, a row selection voltage is sent using the
two inputs C and D in the same way. As there are four layers, for selecting a particular layer,
a layer selection voltage is generated using inputs E and F . With every memristor, there
are three transistors where first one is being activated through layer selection voltage and
sending column selection voltage to next transistor. Second transistor being activated, sends
row selection voltage to next transistor and the third transistor being activated, and finally
activates the corresponding memristor. The 6-input LUT requires a total of 64 memristors
and no additional memristors are required for reference cell.
Property 20.2.1 An n-input LUT requires at least 2n−2 LUTs with 2-input, where n ≥ 2.
Proof 20.1 The above statement is proved by mathematical induction.
Basis: The basis case holds for n = 2 as (22−2 ) = 1.
Hypothesis: Assume that the statement holds for n = k . Therefore, a k -input LUT
consists of 2k−2 LUTs with two inputs.
Induction: Now, we will consider n = k + 1. Therefore, a(k + 1)-input LUT requires
2 k+1−2 = 2k−1 LUTs of two inputs.
BCD Adder Using a LUT-Based Field Programmable Gate Array 293
Table 20.2: Read and Write Scheme using the Introduced Approach
BCD Adder Using a LUT-Based Field Programmable Gate Array 295
Algorithm 20.3: Algorithm for the Construction of the BCD Adder Circuit
296 VLSI Circuits and Embedded Systems
Figure 20.4: The BCD Adder: (a) Block Diagram of 1-Digit BCD Adder, (b) 1-Digit BCD
Adder, (c) Block Diagram of the n-Digit BCD Adder.
The time complexity of the addition method is mathematically proven in Property 20.3.1.
Property 20.3.1 An n-digit BCD adder requires at least O(5n) of time complexity, where
n is the number of data bits.
Suppose, an n-digit BCD adder does not require at least O(5n) of time complexity.
The critical path delay of the n-digit BCD adder requires n full adders, 2n half adders, n
OR gates and 4n LUTs with six inputs. Except the arrangement of 6-input LUTs, the design
has a serial architecture which has a latency of O(4n).
Hence, the time complexity of the BCD adder is O(5n).
This contradicts the supposition. Hence, the supposition is false and Property 20.2 is
true.
BCD Adder Using a LUT-Based Field Programmable Gate Array 297
20.4 SUMMARY
The outstanding beneficiary features and advancement have made today’s world the era of
FPGA (Field Programmable Gate Array). LUT (Look-Up Table) being the most important
and complex element of FPGA is the main concern for the improvement of it. The 2-input
LUT has prominent enhancement in terms of area and power. Besides, BCD (Binary Coded
Decimal) addition being the most basic arithmetical operation, it is the main focus as an
application of LUT-based FPGA. The BCD adder is constructed with the optimum “area”,
“power”, and “delay”. These improvements in FPGA-based BCD addition will consequently
influence the advancement in all other arithmetic operations as well as computation and
manipulation of decimal digits, as it is more convenient to convert from decimal to BCD than
binary. Moreover, it is utilitarian for exact decimal calculations, which is often a requirement
for financial applications, accountancy, etc. It also makes things like multiplying/dividing
by powers of 10 easier with “fixed pitch” format, making it easy to find the nth digit in
a particular number, and such arithmetic operations can easily be chunked into multiple
threads, for example parallel processing.
REFERENCES
[1] D. Morteza and G. Jaberipur, “Low area/power decimal addition with carry-select
correction and carry-select sum-digits”, Integr. VLSI J., vol. 47, no. 4, pp. 443–451,
2014.
[2] S. Gao, D. A. Khalili and N. Chabini, “An improved BCD adder using 6-LUT FPGAs”,
IEEE Tenth Int. New Circuits and Systems Conf., 2012.
[3] F. D. Dinechin and A. Vázquez, “Multi-operand decimal adder trees for FPGAs”,
2010.
[4] M. Vasquez, G. Sutter, G. Bioul and J. P. Deschamps, “Decimal adders/subtractors in
FPGA: efficient 6-input LUT implementations”, Int. Conf. on Reconfigurable Com-
puting and FPGAs, 2009.
[5] G. Saeid, G. Jaberipur and R. H. Asl, “Efficient ASIC and FPGA implementation of
binary-coded decimal digit multipliers”, Circuits Syst. Signal Process., vol. 33, no.
12, pp. 3883–3899, 2014.
[6] L. O. Chua, “Memristor-the missing circuit element”, IEEE Trans. Circuit Theory,
vol. 18, no. 5, pp. 507–519, 1971.
[7] D. B. Strukov, G. S. Snider and D. R. Stewart, “The missing memristor found”, Nature,
vol. 453, no. 7191, pp. 80–83, 2008.
[8] A. F. Haider, T. N. Kumar and F. Lombardi, “A memristor-based LUT for FPGAs”,
Ninth IEEE Int. Conf. on Nano/Micro Engineered and Molecular Systems (NEMS),
2014.
[9] Y. Ho, G. M. Huang and P. Li, “Dynamical properties and design analysis for non-
volatile memristor memories”, Circuits and Systems I, IEEE Trans. on, vol. 58, no. 4,
pp. 724–736, 2011.
[10] N. Z. Haron and S. Hamdioui, “On defect oriented testing for hybrid CMOS/memristor
memory”, In Test Symposium (ATS), pp. 353–358, 2011.
[11] X. X. Dong, N. P. Jouppi and Y. Xie, “Design implications of memristor-based RRAM
cross-point structures”, Proc. Des. Autom. Test Eur., pp. 1–6, 2011.
298 VLSI Circuits and Embedded Systems
[12] X. Yuan, “Modeling, architecture, and applications for emerging memory technolo-
gies”, IEEE Comput. Des. Test., vol. 28, no. 1, pp. 44–51, 2011.
[13] K. Sohrab, G. Rosendale and M. Manning, “A 3D stackable carbon nanotube-based
nonvolatile memory (NRAM)”, IEEE Proceedings of the European Solid-State Device
Research Conf., 2010.
[14] M. Thomas, M. Salinga, M. Kund and T. Kever, “Nonvolatile memory concepts based
on resistive switching in inorganic materials”, Adv. Eng. Mater., vol. 11, no. 4, pp.
235–240, 2009.
[15] Y. C. Chen, W. Zhang and H. Li, “A look up table design with 3D bipolar RRAMs”,
ASP-DAC, pp. 73–78, 2012.
[16] G. Bioul, M. Vazquez and J. P. Deschamps, “Decimal addition in FPGA”, Fifth
Southern Conf. on Programmable Logic, pp. 101–108, 2009.
[17] Sworna, Zarrin Tasnim, Mubin UlHaque, Nazma Tara, Hafiz Md Hasan Babu, and
Ashis Kumar Biswas. "Low-power and area efficient binary coded decimal adder
design using a look up table-based field programmable gate array." IET Circuits,
Devices & Systems 10, no. 3 (2016): 163–172.
CHAPTER 21
Generic Complex
Programmable Logic Device
Board
The goal of the design and the development of Generic Complex Programmable Logic
Device (CPLD) Board is to reduce the product’s overall design and development life cycle
time. The same board may be used in various system designs since Programmable Logic
Devices are extremely versatile and changeable. These devices operate at very low voltages
with fast speeds, and consume very little power. As the device count at the system level
decreases dramatically, these characteristics make PLDs (Programmable Logic Devices)
more flexible and enhance product dependability to a larger extent. For System program-
ming, the board’s feature named as Joint Testing Action Group (JTAG) interfaces on board.
This makes the board more adaptable to design modifications, upgrades, and easy migration
from one standard to the next. Implementing the A5/1 algorithm, among other things, a
seven-segment display driver, a binary counter, and LED (Light Emitting Diode) control
logic are demonstrated in this chapter.
21.1 INTRODUCTION
The widespread usage of programmable logic devices (PLDs) in a wide range of applications
such as telecom infrastructure, consumer electronics, industrial and medical industries has
resulted from the necessity to adapt to changing market requirements in a constrained
time to market window. Typical board tasks in these applications include power supply
sequencing, voltage and current monitoring, bus bridging, voltage level translation, interface
management, and temperature measurement. System designers are under constant pressure
to fulfill development deadlines, so they must execute ideas with the least amount of
work and risk while preserving maximum flexibility. Designers may reduce system cost,
save space, and maintain a high level of product diversity by utilizing a programmable-
based approach instead of many discrete devices or Application Specific Standard Products
(ASSPs).
8. LED interface
The block diagram of the CPLD board is shown in Fig. 21.1. As indicated in the block
diagram it consists of the following parts:
Generic Complex Programmable Logic Device Board 301
1. DC-DC Converters
2. JTAG Interface
3. LED Interface
4. Clock Circuit
5. CPLD
6. 7-Segment Display
7. Input/output Connectors
21.2.5 CPLD
The chosen CPLD is an ALTERA EPM570T144C5 MAXII series device with a 144 pin
TQFP chip. Fig. 22.4 depicts the block diagram of this CPLD.
It has 570 logic elements (LE) that will be configured according to the design speci-
fications. This gadget is more than adequate for medium-sized digital creations. Fig. 22.5
depicts the construction of the logic element (LE).
304 VLSI Circuits and Embedded Systems
To incorporate custom logic, MAX-II devices have a two-dimensional row and column-
based design. Signal interconnects between the logic array blocks are provided by row and
column interconnects (LABs). The logic array is made up of LABs, each having ten logic
elements (LEs). An LE is a tiny logical unit that facilitates the development of user logic
operations. Over the course of the device, LABs are organized into rows and columns.
Fast granular time delays between LABs are provided via the Multi-Track connection. In
comparison to globally routed connection architectures, the rapid routing between Les
enables little temporal delay for new layers of logic. The MAX II gadget I/O pins are fed by
I/O elements (IOE) surrounding the device’s perimeter, which are positioned at the ends of
LAB rows and columns. A bidirectional I/O buffer with numerous sophisticated capabilities
is included in each IOE. Schmitt trigger inputs and several single-ended standards, such as
66-MHz, 32-bit PCI, and LVTTL are supported through I/O pins. A global clock network
is provided by MAX II devices. The global clock network is made up of four global clock
lines that run throughout the device and provide clocks for all of its resources.
Generic Complex Programmable Logic Device Board 305
The complete implementation and developed prototype is shown in Figs. 21.10 and
21.11.
21.4 APPLICATIONS
As stated in the preceding sections, this board is general in nature and can be used for a
variety of purposes including:
21.5 SUMMARY
A design and development of a CPLD (Complex Programmable Logic Device) board has
been described in this chapter. This board is well structured and proven to be general in
nature, and it may be utilized as reconfigurable hardware in many system designs in the
fields of communication, medical electronics, and industrial electronics, VLSI (Very Large
Scale Integration), embedded circuits, and so on. It may also be utilized in educational
institutions as a VHDL (Verilog Hardware Description Language)/Verilog trainer kit. Even
if all SMD (Surface Mount Devices) components are employed, the board’s size can be
decreased.
310 VLSI Circuits and Embedded Systems
REFERENCES
[1] E. Koutroulis, A. Dollas and K. Kalaitzakis, “High-frequency pulse width modulation
implementation using FPGA and CPLD ICs”, Journal of Systems Architecture, vol.
52, pp. 332–344, 2006.
[2] B. S. Kariappa and M. U. Kumari, “FPGA Based Speed Control of AC Servomotor
Using Sinusoidal PWM”, International Journal of Computer Science and Network
Security, vol. 8, no. 10, 2008.
[3] N. Hediyal, “Key Management for Updating Crypto-keys over AIR”, International
Journal of Computer Science and Network Security, vol. 11, no. 1, 2011.
[4] A. Opara and D. Kania, “Decomposition-Based Logic Synthesis for PAL-Based
CPLDs”, Int. J. Appl. Math. Comput. Sci., vol. 20, no. 2, pp. 367–384, 2010.
[5] I. Rahaman, M. Rahaman, A. L. Haque and M. Rahaman, “Fully Parameterizable
FPGA Based Crypto-Accelerator”, World Academy of Science, Engineering and Tech-
nology, 2009.
[6] P. Malav, B. Patil and R. Henry, “Compact CPLD board Designing and Implemented
for Digital Clock”, International Journal of Computer Applications, vol. 3, no. 11,
2010.
[7] Hediyal, Nagaraj. “Generic Complex Programmable Logic Device (CPLD) Board.”
International Journal of Computer Science and Information Technologies (IJCSIT) 2,
no. 5 (2011): 2004–2007.
CHAPTER 22
FPGA-Based Programmable
Logic Controller
Programmable Logic Controllers (PLCs) are more cost effective, compact, and easier to
operate than PC (Personal Computer)-based solutions for modest and stand-alone control
of a single process. Some researches have been done on the implementation of a control
program in an FPGA (Field Programmable Gate Array). However, the majority of them
focused on strategies for converting functional-level control programs into HDL (Hardware)
logic descriptions. As a result of these approaches, PLC users will need design tools to
translate, integrate, and implement logic circuits in FPGA. These tools also require training
for manufacturing plant engineers. This chapter proposes a method for implementing a
general PLC inside an FPGA. Once FPGA is configured, it will function as PLC that can
be embedded into devices, machines, and systems using suitable interfaces.
22.1 INTRODUCTION
Programmable Logic Controllers (PLCs), also referred to as programmable controllers, are
used in commercial and industrial applications.
As shown in Fig. 22.1, a PLC consists of input modules, a processor, and output
modules. An input accepts a wide range of digital or analog signals from various field
devices (sensors) and converts them to a logic signal that the processor may use. Based
on program instructions in memory, the processor takes judgments and executes control
instructions. The processor’s control instructions are converted into a digital or analog signal
that can be utilized to control a variety of field devices by output modules (actuators). The
desired instructions are entered using a programming device. As a result, PLCs can be
complicated, and they’re typically constructed on 486 and Pentium processors with a large
number of analog and digital I/Os. Occasionally, an application may require the usage
of a PLC with very restricted capabilities, but the pricing does not fit the budget. Thus,
PLCs are more cost-effective, smaller, and easier to operate than PC-based systems for tiny,
stand-alone control of a single process. PLCs, which can handle from 15 to 128 I/O points,
are commonly used by control engineers.
A PLC which is built using FPGA technology produces a superior solution with the
following benefits:
I. Flexibility: PLC designs may be easily upgraded by design engineers (PLC makers).
By modifying the Hardware Description Language (HDL) and setting the same FPGA
chip for the updated PLC design, some features or instructions might be added to the
current design.
II. Accuracy: Fast design allows engineers to include very time-critical tasks like limit
and proximity sensor detection and sensor health monitoring into hardware, resulting
in more accurate solutions.
III. Short Product Development Cycle: Design time is significantly decreased because
of the usage of standard HDLs and automated design tools. Engineers may also
experiment out different implementations because the control code runs directly in
silicon.
IV. Low Cost and Compactness: Because of the aforesaid advantages, the designer may
fulfill market demands by satisfying consumer wants and improving the product’s
performance or usefulness. In comparison to other existing options, this results in
high performance, low cost, and compact designs.
I. Data memory is used to store data temporarily such as (a) Timer or counter preset
settings, (b) Arithmetic or logic execution results, (c) Input/output status, and so on.
II. The program memory used to store encoded ladder program instructions is known
as user memory.
III. Ladder instruction decoder and ladder instruction execution block make up the control
unit. It decodes and executes the ladder instruction, as well as providing user and
data memory control signals.
IV. User interface or ladder program encoding software that accepts user-supplied ladder
program instructions, debugs them, and encodes them in a usable format.
(i) Maximum “m” contacts/function blocks in series including one coil (columns); and
Thus a ladder program structure allows a ladder program of size [m, n]. First element of
each rung is encoded as “Start of rung”, and the last element of ladder is encoded as “End
of ladder”.
(i) Program Mode: In this mode, the ladder program is loaded into the program memory
of PLC.
(ii) Run/Execution: In this mode, the PLC executes the ladder logic cyclically.
Every ladder rung cycle, “ p” rung instructions (contact/function block instructions) are
decoded and performed in parallel from left to right. The output of each rung instruction is
supplied as input for the following rung instruction, and parallel connections are assessed.
Each rung instruction takes two clock cycles to execute. The state of output is updated in
the output memory region at the conclusion of each rung cycle. The revised output values
are sent to the output module at the completion of the full ladder scan. As a result, when
the “End of ladder” instruction is executed, the output status is updated and the address
counter in program memory is reset for the next ladder scan.
(Addition, Subtraction), Others (Compare, etc.). Also, design was restricted to have max-
imum 256 rungs, which may contain 7 elements in series. Maximum 4 elements can be
connected in parallel.
(b) Scan Time
The time needed for a complete I/O scan and execution is a key feature of PLC. This
is dependent on the number of input and output channels as well as the duration of the
ladder program. PLC execution speed is determined by the processor’s clock frequency.
The scan time decreases as the frequency increases. Because each rung execution needs
“2m” clock cycles, the suggested architecture may achieve very short scan times. Thus, if
a ladder program has “n” rungs, the PLC scan will take “2mn” clock cycles. Scan time
may be lowered even further by using FPGA designs that are quicker and more efficient,
resulting in faster PLC. The demonstrated design could achieve 2.24-microsecond scan
time at 100MHz clock for maximum achievable ladder logic in the plan.
(c) Memory
Memory is required by PLCs in order to store program and temporary data. Separate
memory chips are utilized in traditional PLCs for this purpose. The suggested architecture
makes advantage of the memory built into the same FPGA chip, eliminating the additional
circuitry and latency associated with read/write operations in traditional systems. As a
318 VLSI Circuits and Embedded Systems
result, the solution is both quicker and more compact. Block memory was utilized as user
memory while distributed memory was used as data memory in the exhibited system.
22.5 SUMMARY
An implementation method to PLC (Programmable Logic Controller) design has been
described in this chapter. For an example application, the concept has been built and
proved on a smaller scale. However, it may be further modified to incorporate a variety of
useful instructions (PID, PWM, etc.), interfaces (RS232, SPI, USB), and network protocols
in order to link it to a network. The architecture is also limited to digital I/O channels,
although it may be expanded to include analog I/O channels. The solution described in this
chapter is best suited for small-scale applications requiring a small number of instructions
at a low cost, as well as it has an excellent performance and a compact architecture.
REFERENCES
[1] John T. Welch, Joan, “A Direct Mapping FPGA Architecture for Industrial Process
Control Applications” IEEE Proceedings International Conference on Computer De-
sign, 17–20 Sept. 2000, pp 595–598.
[2] M. A. Adamski and J. L. Monteiro, "PLD implementation of logic controllers," in
Proceedings of the IEEE International Symposium on Industrial Electronics (ISIE’95),
vol. 2, 1995, pp. 706–711.
[3] M. Adamski and J. L. Monteiro, "From interpreted Petri net specification to re-
programmable logic controller design," in Proceedings of the IEEE International
Symposium on Industrial Electronics (ISIE 2000), vol. 1, 2000, pp. 13–19.
[4] M. Wegrzyn, M. A. Adamski, and J. L. Monteiro, "The application of reconfigurable
logic to controller design," Control Engineering Practice, vol. 6, pp. 879–887, 1998.
[5] A. Wegrzyn and M. Wegrzyn, "Petri net-based specification, analysis and synthesis of
logic controllers," in Proceedings of the IEEE International Symposium on Industrial
Electronics (ISIE 2000), vol. 1, 2000, pp. 20–26.
[6] M. Ikeshita, Y Takeda, H. Murakoshi, N. Funakubo, and I.Miyazawa, "An applica-
tion of FPGA to high-speed programmable controller-development of the conversion
program from SFC to Verilog," in Proceedings of the 7th IEEE International Con-
ference on Emerging Technologies and Factory Automation (ETFA’99), vol. 2, 1999,
pp. 1386–1390.
[7] I. Miyazawa, T. Nagao, M. Fukagawa, Y. Ito, T. Mizuya, and T. Sekiguchi, "Implemen-
tation of ladder diagram for programmable controller using FPGA," in Proceedings
of the 7th IEEE International Conference on Emerging Technologies and Factory
Automation (ETFA’99), vol. 2, 1999, pp. 1381-1385.
[8] Shuichi Ichikawa, Masanori Akinaka, Ryo Ikeda, Hiroshi Yamamoto “Converting PLC
instruction sequence into logic circuit: A preliminary study” Industrial Electronics,
2006 IEEE International Symposium, July 2006, vol. 4, pp. 2930–2935.
[9] Dick Johnson, Research on Programmable Logic Controllers done in conjunction with
Reed Research, Control Engineering December 2007
(http://www.controleng.com/article/CA6510505.html)
FPGA-Based Programmable Logic Controller 319
321
Part 4
A digital computer stores data in terms of digits (numbers) and proceeds in discrete steps
from one state to the next. The states of a digital computer typically involve binary digits
which may take the form of the presence or absence of magnetic markers in a storage
medium, on-off switches or relays. In digital computers, even letters, words, and whole
texts are represented digitally. Digital logic is the basis of electronic systems, such as
computers and cell phones. Digital logic is rooted in binary code, a series of zeroes and
ones each having an opposite value. This system facilitates the design of electronic circuits
that convey information, including logic gates. Digital logic gate functions include AND,
OR, and NOT. The value system translates input signals into specific output. Digital logic
facilitates computing, robotics and other electronic applications.
Digital logic design is foundational to the fields of electrical engineering and com-
puter engineering. Digital logic designers build complex electronic components that use
both electrical and computational characteristics. These characteristics may involve power,
current, logical function, protocol, and user input. Digital logic design is used to develop
hardware such as circuit boards and microchip processors. This hardware processes user
input, system protocol and other data in computers, navigational systems, cell phones, or
other high-tech systems.
The combinational circuit consists of logic gates whose outputs at any time are deter-
mined directly from the present combination of input without any regard to the previous
input. A combinational circuit performs a specific information processing operation fully
specified logically by a set of Boolean functions. A combinatorial circuit is a generalized
gate. In general such a circuit has m inputs and n outputs. Such a circuit can always be con-
structed as n separate combinatorial circuits, each with exactly one output. For that reason,
some texts only discuss combinatorial circuits with exactly one output. In reality, however,
some important sharing of intermediate signals may take place if the entire n-output circuit
is constructed at once. Such sharing can significantly reduce the number of gates required
to build the circuit. When a combinational circuit is built from some kind of specification,
it is always tried to make it as good as possible. The only problem is that the definition of
"as good as possible" may vary greatly. In some applications, one simply wants to minimize
the number of gates (or the number of transistors, really).
The implication is that combinational circuits have no memory. In order to build
sophisticated digital logic circuits, including computers, more powerful model is needed.
The circuits are needed whose output depends upon both the input of the circuit and its
previous state. In other words, it needs circuits that have memory. For a device to serve as
a memory, it must have three characteristics; (1). The device must have two stable states,
(2). There must be a way to read the state of the device, and (3). There must be a way to set
the state at least once.
323
324 Part 4
This part starts with designing divider circuits with parallel computation of quotients
and partial remainders which is given in Chapter 23. In this chapter, a heuristic function
is presented to determine the difference between the numbers of bits in the dividend and
divisor. The introduced divider circuit generates the partial remainder and quotient bits
simultaneously in each iteration which reduces the delay of the divider circuit significantly.
In addition, the division algorithm requires only two operation of addition and subtraction. A
systematic method for minimizing a TANT circuit and the heuristic algorithms for different
stages of the technique are provided in Chapter 24. Steps and algorithms are discussed
extensively in this chapter. The introduced method constructs an optimal TANT network
for a given single output function. Chapter 25 presents an asymmetric high-radix signed-
digital (AHSD) adder that performs addition on the basis of neural network (NN) and also
shows that the AHSD number system supports carry-free (CF) addition by using NN. A
Novel NN design has been constructed for CF adder based on the AHSD4 number system
is also presented in this chapter. Chapter 26 describes an integrated framework for SOC test
automation. In this chapter, an efficient algorithm has been introduced to construct wrappers
that reduce testing time for cores. Rectangle packing has been used to develop an integrated
scheduling algorithm that incorporates power constraints in the test schedule. In Chapter 27,
an approach is presented to design memristor based nonvolatile 6-T static random access
memory (SRAM). In addition to the SRAM integrated circuit, test structures are included to
help characterize the process and design. Then the memristor-based resistive random access
memory (MRRAM) is addressed which is similar to that of static random access memory
(SRAM) cell. In Chapter 28, a fault-tolerant approach to reliable microprocessor design
is presented. The approach preserves system performance while keeping area overheads
and power demands low. The checker is a fairly simple state machine that can be formally
verified, scaled in performance, and reused. Finally, in Chapter 29, some application of
VLSI circuits and embeded system discussed.
CHAPTER 23
Parallel Computation of
Quotients and Partial
Remainders to Design
Divider Circuits
Division is considered as the slowest and most difficult operation among four basic opera-
tions in microprocessors. This chapter presents an unprecedented divider circuit by using
a new division algorithm. A heuristic function is presented to determine the difference
between the numbers of bits in the dividend and divisor. This difference is used to calculate
the quotient bits and the partial remainder independently. Thus, the introduced divider
circuit generates the partial remainder and quotient bits simultaneously in each iteration
which reduces the delay of the divider circuit significantly. Moreover, the divider circuit
requires only two operations (addition and subtraction) in each iteration. The presented
divider circuit has been constructed in four steps. First, a parallel n-bit counter circuit
has been introduced. For a 4-input operand, the bit counter circuit achieves a significant
improvement in terms of delay. Second, a selection block is designed which consumes less
hardware complexity. Third, an efficient and compact [log 2 n + 1] -to-n-bit converter circuit
has been presented. Fourth, a new n-bit comparator circuit has been shown to reduce the
area of the comparator circuit, where n is the number of input bits. For a 4-bit comparator
circuit, the comparator circuit gains a notable improvement in terms of area-delay product.
23.1 INTRODUCTION
Re-configurable computing has explored a new dimension of computing architecture since
its inception in 1960. Re-configurable computing is a computer architecture which combines
some of the flexibility of software with the high performance of hardware by processing with
a very flexible high speed computing fabrics like field-programmable gate arrays (FPGAs).
The principal difference when compared to using ordinary microprocessors is the ability to
make substantial changes to the data path itself in addition to the control flow. On the other
hand, the main difference with custom hardware, i.e., application-specific integrated circuits
(ASICs) is the possibility to adapt the hardware during run time by “loading” a new circuit
on the re-configurable fabric. Their functionality can be upgraded and repaired during
their operational life cycle and specialized to the particular instance of a task. Sometimes,
they are the only way to achieve the required real-time performance without fabricating
custom integrated circuits. The implementation of FPGA has been widely seen in many
applications including signal processing, cryptography, processing, scientific computation
and arithmetic computing.
Among the basic operations, division being the slowest operation on a modern micropro-
cessor, is the prerequisite for faster mathematical and computational operation in processor.
Moreover, in comparison with addition and multiplication, division is the least as well as
most difficult operation used in the processors. However, the performance of a computer will
degrade if division operation is ignored. In the early 1960, Landauer’s research showed that
irreversible hardware computation regardless of its realization technique results in energy
dissipation due to information loss. Each bit of information dissipates k.T .ln2 joules of
energy where k is Boltzmann constant, T is absolute temperature. In 1973, Bennett showed
that a circuit must be made using reversible logic gates to avoid this huge energy dissipation.
Fault tolerance is the property that enables a system to continue operating properly in the
event of the failure of (one or more faults within) some of its components. Usually FPGA
consists of an array of configurable logic block, interconnects and I/O blocks. FPGA can be
configured as needed for each application. The same semiconductor technology advances
that have brought processors to their performance limits have turned FPGAs from simple
logic to highly capable programmable fabrics. Most popular logic blocks are Look-Up-Table
(LUT) and Plessy logic block. With more inputs, a LUT can implement more logic using
fewer logic blocks. Thus it helps in less routing area. A 3 to 4 input LUT size results in
better performance in area and delay. So, a generalized 4-input LUT based Logic block is
considered. In this chapter, a divergent approach for a divider circuit is presented by using
a new division algorithm.
Modern applications comprise several arithmetic operations, among them addition,
multiplication, division, and square root are the frequent ones. In recent researches, empha-
sis has been placed on designing ever-faster adders and multipliers, with division and square
root receiving less attention. The typical range for addition latency is two to four cycles,
and the range for multiplication is two to eight cycles. Most emphasis has been placed on
improving the performance of addition and multiplication. As the performance gap widens
between these two operations and division, various applications have been slowly degraded
its performance and throughput.
Fig. 23.1 shows the average frequency of different instruction like division, multipli-
cation, addition, subtraction and square root operations relative to the total number of
operations. This figure shows that simply in terms of dynamic frequency, division and
square root seem to be relatively unimportant instructions, with about 3% of the dynamic
instruction count due to division and only 0.33% due to square root. The most common
instructions are the multiply and add. Thus, the multiplication accounts for 35% of the in-
structions, and the adder is used for 55% of the instructions since multiplication operation
uses addition operation in its integral constitution.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits 327
However, in terms of latency, division can play a much larger role. By assuming a
machine model of a scalar processor, where every division operation has a latency of 20
cycles and the adder and multiplier each have a three cycle latency, a distribution of the
execution time due to the hardware was formed, shown in Fig. 23.2. Here, the division
accounts for 40% of the latency, the addition accounts for 42%, and the multiplication
accounts for the remaining 18%. It is evident that the performance of division is very much
significant to the overall system performance.
Field Programmable Gate Arrays (FPGAs) have come a long way since their inception
as illustrated in Fig. 23.3. From their humble beginnings as containers for glue and control
logic, FPGAs have evolved into highly capable software coprocessors, and as platforms
for complete, single-chip embedded systems. It has long been recognized that many of
328 VLSI Circuits and Embedded Systems
Figure 23.3: FPGA Devices have evolved to Become Highly Capable Computing Platforms.
The programming of software algorithms into FPGA hardware has traditionally required
specific knowledge of hardware design methods, including the use of hardware description
languages such as VHDL or Verilog. While these methods may be productive for hard-
ware designers, they are typically not suitable for embedded systems programmers, domain
scientists and higher level software programmers. Fortunately, software-to-hardware tools
now exist that allow software programmers to describe their algorithms using more familiar
methods and standard programming languages. For example, using a C-to-FPGA compiler
tool, an application and its key algorithms can be described in standard C with the addition
of relatively simple library functions to specify inter-process communications. The crit-
ical algorithms can then be compiled automatically into HDL representations which are
subsequently synthesized into lower level hardware targeting one or more FPGA devices.
While a certain level of FPGA knowledge and in-depth hardware understanding may still be
required to optimize the application for the highest possible performance, the formulation
of the algorithm, the initial testing and the prototype hardware generation can now be left
to a software programmer. Therefore, in this work, an initiative has been taken to innovate
a novel idea to reduce the delay of the division algorithm. We believe that an improvement
division algorithm will substantially affect the performance of various applications. More-
over, FPGAs have been chosen as a targeted device to explore the endless opportunity of
reconfigurable computing.
While the methodology for designing efficient high performance adders and multipliers
is well understood, the design of dividers still remains a serious design challenge, often
viewed as a “black-art” among system designers. Extensive literature exists describing the
theory of division. Subtractive methods, such as non-restoring SRT division. To achieve
good system performance, some form of hardware division is required. However, at very
low divider latencies, two problems arise. The area required increases exponentially or
cycle time becomes impractical. Dividers with lower latencies do not provide significant
system performance benefits, and their areas are too large to be justified. An alternative is to
330 VLSI Circuits and Embedded Systems
provide an additional multiplier, dedicated for division. This can be an acceptable trade off if
a large quantity of area is available and maximum performance is desired for highly parallel
division/multiplication applications, such as graphics and 3D rendering applications. The
main disadvantage with functional iteration is the lack of remainder and the corresponding
difficulty in rounding.
Very high radix algorithms are an attractive means of achieving low latency while
also providing a true remainder. The only commercial implementation of a very high
radix algorithm is the Cyrix short-reciprocal unit. This implementation makes efficient
use of a single rectangular multiply/add unit to achieve lower latency than most SRT
implementations while still providing a remainder. Further reductions in latency could be
possible by using a full-width multiplier, as in the rounding and pre-scaling algorithm.
Division algorithms can be divided into five classes: digit recurrence, functional iteration,
very high radix, table look-up, and variable latency. The basis for these classes is the obvious
differences in the hardware operations used in their implementations, such as multiplication,
subtraction, and table look-up. Many practical division algorithms are not pure forms of
a particular class, but rather are combinations of multiple classes. For example, a high
performance algorithm may use a table look-up to gain an accurate initial approximation to
the reciprocal, use a functional iteration algorithm to converge quadratically to the quotient,
and complete in variable time using a variable latency technique. Therefore, it has been
a major challenge to find an acceptable trade-off between area and delay of the divider
circuit. Moreover, the longevity of a circuit largely depends on the power dissipation.
Hence, designing an optimized and compact divider circuit is a prime concern.
The main focuses of this work are presented below:
1. Propagation delay of the divider circuit can be optimized if there is a new method to
find a next partial remainder quickly or if two major tasks, i.e., finding each quotient
bit and calculating next remainder can be done simultaneously.
2. The total delay, area and power consumption can be minimized if there is a novel
approach to reduce the number of blocks or the number of bits handled by each
iteration in the non-restoring division.
3. On the fly conversion is required to produce the correct quotient bit in non-restoring
division can be omitted to improve the overall performance of the divider circuit.
1. A new division algorithm for divider circuit is introduced which generates the partial
remainder and quotient bits independently.
2. A parallel bit counter has been presented with a minimum depth and hardware
complexities.
3. A compact and efficient converter and comparator circuit have been elucidated which
requires minimum area and delay.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits 331
4. A cost efficient design of both Application Specific Integrated Circuit (ASIC) and
Look-Up Table (LUT)-based divider circuit have been elucidated requiring optimum
number of area, LUTs, slices, flip-flops and delay.
5. An improved design of a Reversible Fault Tolerant (RFT) D-latch and Master Slave
Flip Flop, LUT based configurable logic block (CLB) of FPGA is presented targeting
lower number of gates, garbage and unit delay by using two reversible fault tolerant
gates.
configuration. The number of individual data latches required to make up a single Shift
Register device is usually determined by the number of bits to be stored with the most
common being 8-bits (one byte) wide constructed from eight individual data latches.
Shift Registers are used for data storage or for the movement of data and are therefore
commonly used inside calculators or computers to store data such as two binary numbers
before they are added together, or to convert the data from either a serial to parallel or
parallel to serial format. The individual data latches that make up a single shift register are
all driven by a common clock (Clk) signal making them synchronous devices. Shift register
ICs are generally provided with a clear or reset connection so that they can be “SET” or
“RESET” as required. Generally, shift registers operate in one of the four different modes
with the basic movement of data through a shift register being:
1. Serial-in to Parallel-out (SIPO) – the register is loaded with serial data, one bit at a
time, with the stored data being available at the output in parallel form.
2. Serial-in to Serial-out (SISO) – the data is shifted serially “IN” and “OUT” of the
register, one bit at a time in either a left or right direction under clock control.
3. Parallel-in to Serial-out (PISO) – the parallel data is loaded into the register simul-
taneously and is shifted out of the register serially one bit at a time under clock
control.
4. Parallel-in to Parallel-out (PIPO) – the parallel data is loaded simultaneously into the
register, and transferred together to their respective outputs by the same clock pulse.
The effect of data movement from left to right through a shift register is presented
graphically as in Fig. 23.5.
The directional movement of the data through a shift register can be either to the left,
(left shifting) to the right, (right shifting) left-in but right-out, (rotation) or both left and
right shifting within the same register thereby making it bidirectional.
Figure 23.5: Data Movement from Left to Right through a Shift Register.
The effect of each clock pulse is to shift the data contents of each stage one place to
the right, and this is shown in the following table until the complete data value of 0-0-0-1
is stored in the register. This data value can now be read directly from the outputs of Q A
to Q D . Then the data has been converted from a serial data input signal to a parallel data
output. The truth table in Table 23.1 and its wave forms in Fig. 23.7 show the propagation
of the logic “1” through the register from left to right.
334 VLSI Circuits and Embedded Systems
Clock QA QB QC QD
Pulse No
0 0 0 0 0
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
5 0 0 0 0
Figure 23.7: Timing Diagram for a 4-bit Serial-in to Parallel-out Shift Register.
straight through the register and out of the other end. Since there is only one output, the
DATA leaves the shift register one bit at a time in a serial pattern, hence the name Serial-in
to Serial-Out Shift Register or SISO.
The SISO shift register is one of the simplest of the four configurations as it has only
three connections, the serial input (SI) which determines what enters the left hand flip-flop,
the serial output (SO) which is taken from the output of the right hand flip-flop and the
sequencing clock signal (Clk). The logic circuit diagram shown in Fig. 23.8 is a generalized
serial-in serial-out shift register.
This data provides one bit output at a time on each clock cycle in a serial format. It is
important to note that with this type of data register a clock pulse is not required to parallel
load the register as it is already present, but four clock pulses are required to unload the
data.
336 VLSI Circuits and Embedded Systems
The PIPO shift register is the simplest of the four configurations as it has only three
connections, the parallel input (PI) which determines what enters the flip-flop, the parallel
output (PO) and the sequencing clock signal (Clk). Similar to the Serial-in to Serial-out
shift register, this type of register also acts as a temporary storage device or as a time delay
device, with the amount of time delay being varied by the frequency of the clock pulses.
Inputs Outputs
A B X Y Z
0 0 0 1 0
0 1 0 0 1
1 0 1 0 0
1 1 0 1 0
23.2.4 Comparator
A comparator is a logic circuit that first compares the size of A and B and then determines
the result among A>B, A< B and A = B. When the two numbers in comparator circuit are
two 1-bit numbers, the result will be only one bit from 0 and 1. So, the circuit is called 1-bit
magnitude comparator which is a basis of comparison of the two numbers of n-bit. The
truth table of 1-bit conventional comparator is listed in Table 23.2. From the truth table of
conventional comparator in Table 23.2, the logical expressions of 1-bit comparator are as
follows:
The wave form of a 1-bit comparator circuit is demonstrated in Fig. 23.11 and a circuit
diagram of a 4-bit comparator circuit is exhibited in Fig. 23.12.
Input Output
Cin A B S C
0 0 0 0 0
1 0 0 1 0
0 1 0 1 0
1 1 0 0 1
0 0 1 1 0
1 0 1 0 1
0 1 1 0 1
1 1 1 1 1
23.2.5 Adder
The full-adder circuit adds three one-bit binary numbers (C in , A, B) and outputs two one-bit
binary numbers, a sum (S ) and a carry (C ) and the truth table is given in Table 23.3. The
full-adder is usually a component in a cascade of adders, which add 8, 16, 32, etc. binary
numbers. The carry input for the full-adder circuit is from the carry output from the circuit
“above” itself in the cascade. The carry output from the full adder is fed to another full
adder “below” itself in the cascade.
The equation for Sum (S ) is:
S = (A ⊕ B ⊕ C in ) (23.4)
The wave form of a 1-bit adder circuit is demonstrated in Fig. 23.13 and a circuit
diagram of a 4-bit adder circuit is exhibited in Fig. 23.14.
23.2.6 Subtractor
Unlike the Binary Adder which produces a SUM and a CARRY bit when two binary
numbers are added together, the binary subtractor produces a DIFFERENCE, D by using a
BORROW bit, B from the previous column. Then obviously the operation of subtraction is
the opposite to that of addition. The truth table of 1-bit subtractor is given in Table 23.4.
Binary Subtraction can take many forms but the rules for subtraction are the same
whichever process you use. As binary notation only has two digits, subtracting a “0” from
340 VLSI Circuits and Embedded Systems
Input Output
Bin Y X Diff. Bout
0 0 0 0 0
0 0 1 1 0
0 1 0 1 1
0 1 1 0 0
1 0 0 1 1
1 0 1 0 0
1 1 0 0 1
1 1 1 1 1
a “0” or a “1” leaves the result unchanged as 0 – 0 = 0 and 1 – 0 = 1. Subtracting a “1” from
a “1” results in a “0”, but subtracting a 1 from a 0 requires a borrow. In other words 0 – 1
requires a borrow.
For the DIFFERENCE ( D) bit:
D = (X ⊕Y ) ⊕ Bin (23.6)
The wave form of a 4-bit subtractor circuit is demonstrated in Fig. 23.15 and a circuit
diagram of a 4-bit subtractor circuit is exhibited in Fig. 23.16.
In the binary adder, n numbers of 1-bit full binary subtractor connected or cascaded
together to subtract two parallel n-bit numbers from each other. For example two 4-bit
binary numbers. It is said before that the only difference between a full adder and a full
subtractor was the inversion of one of the inputs. So by using an n-bit adder and n numbers
of inverters (NOT Gates), the process of subtraction becomes an addition as we can use
two’s complement notation on all the bits in the subtrahend and setting the carry input of
the least significant bit to a logic 1 (HIGH) as shown in Fig. 23.16.
Example 23.2 When implementing any logic function, a truth table of that logic is mapped
to the memory cells of the LUT. Suppose, while implementing Equation 24.8 where ‘|’
represents logical OR operation. Table 23.5 represents the truth table of the function. Fig.
23.17 shows the gate representation and the LUT representation of the logic function. Output
is generated with the corresponding input combination, such as for input combination 1 and
0, the output will be 1. There is a significant research about the improvement of the LUT
342 VLSI Circuits and Embedded Systems
A B Out
0 0 0
0 1 1
1 0 1
1 1 1
to reduce the hardware complexities, read and write time. The circuit diagram of a 2-input
LUT is given in Fig. 23.18.
f = (A.B)|(A ⊕ B) (23.8)
Figure 23.21: Block Diagram of (a) Fredkin Gate, (b) Feynman Double Gate.
Property 23.3.1 presents the proof for the required number of iterations by the introduced
division approach.
Basis: The basis case holds for the number of bits in divisor and dividend are equal that
is n = m and (n − m + 1) = 1.
Hypothesis: Assume that the statement holds for n = k . So, a k -bit dividend requires
( k − m + 1) number of iterations.
Induction: Now, considering n = k + 1, a(k + 1)-bit dividend requires (k + 1 − m + 1)
= (k − m + 2) number of operations. Now, reduce the number of bit in dividend by one to
produce n = k . Then, a k -bit dividend requires (k − m + 2 − 1) = (k − m + 1) number of
iterations which holds the hypothesis. So, the statement holds for n = k + 1 and completes
the proof.
Therefore, for n>= m, the introduced division algorithm requires at most (n − m + 1)
number of iterations where n is number of bits in dividend, m is number of bits in divisor.
Example 23.3 For n = 6 and m = 5, the algorithm performs the division operation in
(6 − 5 + 1) = 2 iterations which has been also illustrated in Fig. 23.23.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits 347
The purpose of using the heuristic function is to generate the quotient bits quickly. Let the
bit difference between the divisor and dividend is “diff” and 0 <= “diff” <= m − n. Assume
the quotient is expressed as q. Therefore, the quotient group for i th iteration becomes q di f f
q di f f −1 q di f f −2 · · · q0 . It is evident from Algorithm 23.1 is that the value of the q di f f =
1 and the rest of the bits from q di f f −1 q di f f −2 · · · q0 will be 0. The aforementioned step
is considered as the second step i.e., calculation of the quotient bits. The third step is the
subtraction process. This subtraction process yields a partial remainder which is required
for the generation of next group of quotient bits. There is a subtle difference between the
subtraction process involved in the conventional division approaches and the introduced
division algorithm. The division algorithm uses first (m + 1)-number of bit from the n-bit
of dividend. This results in avoiding of producing a negative result (or partial remainder)
since X = m+1 j=1 X j × 2
j−1 is greater than Y = Ín Y × 2k−1 . Thus the introduced division
Í
k=1 j
algorithm skips the restoring step of the conventional division approaches. Moreover, the
division algorithm always produces the appropriate next partial remainder by appending
unused bits from the dividend to the current partial remainder. A step by step demonstration
of the division algorithm is given in Example 24.4 for verification.
Example 23.4 Suppose, the divisor is 1100110 and the dividend is 101101110101 in
binary notation.
In first iteration, at first the introduced division algorithm calculates the bit differences
between the divisor and the dividend. Here, the number of bits are 7 and 12 for divisor
and dividend respectively. So, di f f 1 = (12 − 7) = 5. Hence, (5 − 1) = 4 bits of 0 will be
appended at the LSB position of the quotient. Therefore, the first quotient group q1 will
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits 349
consist of 5 bits as 10000. Since the divisor is 7 bits long, it will subtract itself from the
first (7 + 1) = 8 bits of the dividend. So the first 8 bits from the dividend is 10110111 and
the subtraction process involves subtraction as (10110111 − 1100110) = 1010001. The rest
of the bits which were unused in the dividend i.e., “0101” will append with 1010001 which
produces the next partial remainder 10100010101.
In second iteration, at first the division algorithm calculates the bit differences between
the divisor and the current partial remainder 10100010101. Here, the number of bits are
7 and 11 for divisor and current partial remainder respectively. So, di f f 2 = (11 − 7) = 4.
Hence, (4 − 1) = 3 bits of 0 will be appended at the LSB position of the quotient. Therefore,
the second quotient group q2 will consist of 4 bits as 1000. Now, the previous q1 will
be added with q2 to form the current quotient Q as (10000 + 1000) = 11000. Since the
divisor is 7 bits long, it will subtract itself from the first (7 + 1) = 8 bits of the current
partial remainder. So the first 8 bits from the current partial remainder is 10100010 and
the subtraction process involves subtraction as (10100010 − 1100110) = 111100. The rest
of the bits which were unused in the dividend i.e., “101” will append with 111100 which
produces the next partial remainder 111100101.
In third iteration, at first the division algorithm calculates the bit differences between
the divisor and the current partial remainder 111100101. Here, the number of bits are 7 and
9 for divisor and current partial remainder respectively. So, di f f 2 = (9 − 7) = 2. Hence,
(2 − 1) = 1 bit of 0 will be appended at the LSB position of the quotient. Therefore, the
third quotient group q3 will consist of 2 bits as 10. Now, the previous Q will be added
with q3 to form the current quotient Q as (11000 + 10) = 11010. Since the divisor is 7 bits
long, it will subtract itself from the first (7 + 1) = 8 bits of the current partial remainder. So
the first 8 bits from the current partial remainder is 11110010 and the subtraction process
involves subtraction as (11110010 − 1100110) = 10001100. The rest of the bit which was
unused in the dividend i.e., “1” will append with 10001100 which produces the next partial
remainder 100011001.
In fourth iteration, at first the division algorithm calculates the bit differences between
the divisor and the current partial remainder 100011001. Here, the number of bits are
7 and 9 for divisor and current partial remainder respectively. So, di f f 2 = (9 – 7) = 2.
Hence, (2 – 1) = 1 bit of 0 will be appended at the LSB position of the quotient. Therefore,
the forth quotient group q4 will consist of 2 bits as 10. Now, the previous Q will be added
with q4 to form the current quotient Q as (11010 + 10) = 11100. Since the divisor is 7 bits
long, it will subtract itself from the first (7 + 1) = 8 bits of the current partial remainder. So
the first 8 bits from the current partial remainder is 10001100 and the subtraction process
involves subtraction as (10001100 – 1100110) = 100110. The rest of the bit which was
unused in the dividend i.e., “1” will append with 100110 which produces the next partial
remainder 1001101. Since, the number of bits in the partial remainder is 7 and it is less
than the divisor, the remainder is 1001101 and the quotient is 11100.
a parallel bit counter circuit is presented at first. Then, a fast switching circuit is presented.
Finally, the divider circuit is shown along with necessary figures and examples.
n
Õ
a= a j × 2 j−1 (23.9)
j=1
Therefore, a binary operand a which is consisting of n-bit will produce another binary
operand b which will be composed with dlog2 + 1e -bit by following Equation 24.9. The
verification is shown for the binary operand a (when a = 3 as (a2 a1 a0 )) in Table 23.6.
As for example, in row 4 in Table 23.6, binary operand a is composed as a2 = 0; a1 = 1
and a0 = 1. Therefore, the output operand b is b1 = 1 and b0 = 0. In other words, it can be
said that the bit counter will determine the number of bits presents in any binary operand.
Assuming that, the input operand a is stored in an n-bit register, where n is the number of
bits in a. Therefore, an n-bit register is composed of n number of D flip-flops. If no Data is
present in the “Data” pin of the D flip-flop or latches, both of the output pins of D flip-flop
or latches remain in invalid state. The scenario has been demonstrated in Fig. 23.24.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits 351
However, if the Data is present in “Data” pin of the D flip-flop or latches, either of the
output pins Q or Q of D flip-flop or latches will produce “1” as shown in Fig. 23.25. Fig.
23.25 demonstrates the scenario with schematic diagram. The aforementioned property is
used to constructing the parallel bit counter since the bits which occupy the valid positions
(from most significant bit to least significant bit) in input operand a will be counted as the
presence of bits whether it is “0” or “1”. For example, in row 7 of Table 23.6, a0 = 0, but
the output will be counted as the presence of all the bits. Therefore, the output operand will
consider the presence of an input bit if a bit presents in a valid position of an input operand
whether the bit value is “0” or “1”. By valid position, it considers the bit presents from
most significant bit position to least significant bit position.
Figure 23.25: Demonstration of the Presence of Input Bit in D Flip-Flop and Latches.
Now, a 2-input OR gate is used to produce always “1” result depending on the presence
of the bits. It follows the following formula for output operand Di , where i is the position
of the respective bit:
Di = Qi + Qi0 (23.10)
352 VLSI Circuits and Embedded Systems
Fig. 23.26 demonstrates circuit realization of Equation 24.10 with schematic diagram.
The least significant bit of the output operand b0 is defined by following Equation 24.11
as follows:
Then, the second least significant bit of the output operand b1 is formulated by following
Equation 24.12 as follows:
The most significant bit of the output operand b2 is defined by following Equation 24.13
as follows:
b2 = a3 (23.13)
The circuit realization of the 7-bit Counter has been demonstrated in Fig. 23.27. In this
figure, in6 , in5 , in4 , in3 , in2 , in1 , and in0 indicate the corresponding input bit and the red
color indicates the presence of the value (either 0 or 1), whereas the white color indicates
the absence of the value (either 0 or 1). In Fig. 23.27, all the input bit are present, indicating
as red color and thus the all output bits are present which is exactly as shown in Equations
24.11, 24.12, and 24.13.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits 353
Example 23.5 Consider a binary operand a with the value (10011)2 . So, the value (10011)2
will be passed through the circuit of Fig. 23.26 and the following outputs are obtained as
follows:
D6 = 0, D5 = 0, D4 = 1, D3 = 1, D2 = 1, D1 = 1, and D0 = 1.
Now, the least significant bit of the output operand b0 which is defined by Equation
24.11 will be calculated as follows:
b0 = (1 ⊕ 1) ⊕ (1 ⊕ 1) ⊕ 1 = 1
Then, the second least significant bit of the output operand b1 which is defined by
Equation 24.12 will be calculated as follows:
b1 = (1 ⊕ 1) ⊕ 0 = 0
Finally, the most significant bit of the output operand b2 which is defined by Equation
24.13 will be calculated as follows:
b2 = 1 = 1
354 VLSI Circuits and Embedded Systems
The data flow has been demonstrated in Fig. 23.28. In the figure, red color indicates the
“1" value, whereas the white color indicates the “0" value.
Figure 23.28: Data Flow of the 7-bit Counter for Example 24.5.
Now, consider an n-bit input operand a which can be written as a n a n−1 a n−2 . . . a0
and to count the number of bits in the input operand a, the output operand is an dlog 2 n
+1e -bit operand b which can be expressed as b dlog2 n+1e b dlog2 n+1e−1 b dlog2 n+1e−2 . . . b0 .
The least significant bit of the output operand b can be obtained by applying exclusive-or
operation between each of the bit of input operands. The next bit of the least significant
bit of the output operand b can be obtained by applying exclusive-or operation between
alternate bits of input operands. Then the second next least significant bit of the output
operand b can be obtained by applying exclusive-or operation between double alternate bits
of input operands. This process follows up to the input operand reaches to half bit position
of its bit number. Algorithm 23.2 shows the algorithm for n-bit counter circuit.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits 355
To prove the correctness of Algorithm 23.2, let us consider n = 15, i.e., the input operand
is 15-bit. Now the output operands become as follows in Equations 24.14, 24.15, 24.16,
and 24.17:
b0 = (a0 ⊕a1 )⊕(a2 ⊕a3 )⊕(a4 ⊕a5 )⊕(a6 ⊕a7 )⊕(a8 ⊕a9 )⊕(a10 ⊕a11 )⊕(a12 ⊕a13 )⊕a14 (23.14)
b3 = a7 (23.17)
Fig. 23.29 shows the circuit realization of the 15-bit counter. In this figure, “in” rep-
resents the input operand whereas the “out” represents the output operand. The circuit
realization has been performed by following the above equations.
the circuit architecture of the n-bit comparator circuit. This circuit has been constructed
by using the algorithm depicted in Algorithm 23.3. One of the outputs of the comparator
circuit is B greater than A has been generated by using the carry bit of the last full adder
circuit whereas the output A equal B of the comparator has been calculated by the product
of sum bits of each full adder circuit. Finally the other output has been generated by using
the negation of both previous output.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits 357
Figure 23.30: Circuit Realization of the Identification of Two Distinct Path in Comparator
Circuit.
358 VLSI Circuits and Embedded Systems
RowDivisor Dividend
d6 d5 d4 d3 d2 d1 d0 n7 n6 n5 n4 n3 n2 n1 n0
1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0
2 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
3 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0
4 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0
5 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0
6 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0
7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
One of the important properties of division algorithm is that the dividend can never
been divided by zero. This property enables us to derive another property which states that
the divisor will be must have a length of 1 bit. In addition, the division algorithm needs
(m + 1)-bit for the next step. Therefore, at least first two bits from the MSB of the dividend
will move to next step to the subtractor and rest of the bits will move to the remainder block.
The data flow has been elucidated in Table 23.7.
Hence, the following Equations are derived for a selection of (m + 1)-bit out of n-bit,
where m is the number of bits in divisor and n is the number of bit in the dividend for the
selection process (Here, both n and m = 7):
n7 = d0 ; n6 = d0 ; n5 = d1 ; n4 = d2 ; n3 = d3 ; n2 = d4 ; n1 = d5 ; n0 = d6 : (23.18)
where,
ni is the dividend,
d j is the divisor and i <= j <= maximum (n, m).
Table 23.8 shows the truth table for selection of 8-bit of the dividend out of 7-bit of
the divisor. Since a divisor can never been zero, the truth table does not consider any
value of the divisor with the value of zero. Moreover, in Table 23.8, “1” represents that the
denominator is present on the specific data path and “0” indicates that the denominator is
absent on the specific data path in the divisor column. On the other hand, “1” represents
that the corresponding numerator will move to the subtractor whereas “0” indicates that the
numerator will move to the remainder in the dividend column.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits 361
Fig. 23.33 represents the circuit realization of the 8-bit selection block. In the figure,
n7 , n6 , n5 , n4 , n3 , n2 , n1 , and n0 indicate the bit position of the dividend. On the other hand,
d 6 , d 5 , d 4 , d 3 , d 2 , d 1 , and d 0 present the bit position of the divisor. The control gates of the
pmos and nmos transistors are enabled by the presence of the corresponding divisor bits.
The sources of all the pmos and nmos transistors are the corresponding dividend bits. The
drain of the pmos transistor is the input of the next subtractor and the drain of the nmos
transistor is the input to the remainder block.
Example 23.6 Consider the dividend is (10110100)2 and the divisor is 1012 . Therefore,
n7 = 1, n6 = 0, n5 = 1, n4 = 1, n3 = 0, n2 = 1, n1 = 0, and n0 = 0. Since, there are 3 bits in
the divisor, d 0 , d 1 , and d 2 are active and thus enables the first 4 pmos transistors and thus,
s7 = 1, s6 = 0, s5 = 1, s4 = 1, and the rest of the bits s3 , s2 , s1 , and s0 become inactive.
Moreover, the remainder bits become r 3 = 0, r 2 = 1, r 1 = 0, and r 0 = 0. Fig. 23.34 illustrates
the circuit behavior of the 8-bit Selection Block. In the figure, red color indicates the “1"
value, whereas the white color indicates the “0" value. Inactive regions are marked as ash
color.
Figure 23.34: Example Demonstration of Example 24.6 of the 8-Bit Selection Block.
362 VLSI Circuits and Embedded Systems
The selection block can be optimized further by reducing the number of nmos transistors
due to the following properties:
1. Property 1: The length of the divisor will be at least one since the operation of divided
by zero is omitted.
2. Property 2: The division algorithm in Algorithm 23.1 uses the first (m + 1)-bit from
the n-bit dividend for the selection block, where m is the number of bits in divisor
and n is the number of bits in the dividend.
Therefore, the first two remainder bits can be removed from the design for a single bit
divisor. The circuit behavior for a single bit divisor is exhibited in Fig. 23.35. It is evident
from Fig. 23.35 that the first two remainder bits are always remain inactive for the presence
of a single bit divisor. Therefore, it will be also remain inactive when the number of bits
in the divisor will be increased. An improved version of the selection block is presented in
Fig. 23.36. Finally, an n-bit selection block is exhibited in Fig. 23.37.
Figure 23.35: Analysis of the Circuit Behavior due to Property 1 and Property 2 of the 8-Bit
Selection Block.
Algorithm 23.4 represents the algorithm for the selection of first (m + 1)-bit of the n-bit
from the dividend, where m is number of bits in the dividend and n is the number of bits in
the divisor. This algorithm presented in 24.4 has been used to designing the circuit diagram
exhibited in Fig. 23.37.
Table 23.9: Truth Table for Conversion of dlog2 n + 1e -bit to n-bit (Here, n = 7)
Fig. 23.40 demonstrates the circuit realization of the 3-bit to 6-bit zero converter circuit.
When a respective output bit z j is 1 (where j< 6), the control gate of the corresponding
pmos transistor is activated a constant “0” (acts as the source of the pmos transistor) is
provided at the drain of the pmos transistor which is the final output of the converter circuit.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits 367
Figure 23.40: Circuit Diagram of the 3-bit to 6-bit Zero Converter Circuit.
Example 23.7 Fig. 23.41 exhibits the circuit behavior of the circuit exhibited in Fig. 23.40.
The final output z is inactive for the input of all 0’s in Fig. 23.41. Fig. 23.42 exhibits the
circuit behavior of the circuit exhibited in Fig. 23.40 for the input 0012 . The final output z is
again inactive for the input 0012 of Fig. 23.42 according to Table 23.5. Fig. 23.43 exhibits
the circuit behavior of the circuit exhibited in Fig. 23.40 for the input 0112 . The final output
z is 02 for the input 0112 .
z0 = s1 + s2 (23.19)
z1 = s2 + (s0 .s1 ) (23.20)
z2 = s2 (23.21)
z3 = s2 (s0 + s1 ) (23.22)
z4 = s1 .s2 (23.23)
z5 = s0 .s1 .s2 (23.24)
368 VLSI Circuits and Embedded Systems
Figure 23.41: Circuit Behavior of the Circuit Exhibited in Fig. 23.40 when Input is 0002 .
A generalized algorithm for the construction of n-bit to 2n – 2 circuit has been presented
in Algorithm 23.5. The illustration of working procedure of the above algorithm is given
below:
1. Let us consider the input bit is 4-bit which means that it requires 24 = 16 entries and
the output will be 24 – 2 = 14 bit. So, the truth table is constructed with 16 entries. After
first two iterations, the output bit will be 1 gradually.
2. Now, the required output functions are produced with the help of AND and OR logic
as follows:
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits 369
Figure 23.42: Circuit Behavior of the Circuit Exhibited in Fig. 23.40 for the Input 0012 .
z0 = s1 + s2 + s3 (23.25)
z1 = s2 + s3 + (s0 .s1 ) (23.26)
z2 = s2 + s3 (23.27)
z3 = (s2 .(s0 + s1 )) + s3 (23.28)
z4 = (s1 .s2 ) + s3 (23.29)
z5 = (s0 .s1 .s2 ) + s3 (23.30)
z6 = s3 (23.31)
z7 = s3 .(s0 + s1 + s2 ) (23.32)
z8 = s3 .(s1 + s2 ) (23.33)
z9 = s2 + (s0 .s1 ) (23.34)
z10 = s2 .s3 (23.35)
z11 = (s0 + s1 ).s2 .s3 (23.36)
z12 = s1 .s2 .s3 (23.37)
z13 = s0 .s1 .s2 .s3 (23.38)
370 VLSI Circuits and Embedded Systems
Figure 23.43: Circuit Behavior of the Circuit Exhibited in Fig. 23.40 for the Input 0112 .
Finally, above Equations from 24.25 to 24.38 can be used to construct a 4-bit to 14-bit
zero converter circuit.
The quotient selection logic uses one circuit for conversion to zero. Then the output of the
zero converter circuit is used for concatenation. The concatenation circuit is constructed by
using D flip-flops. Then, a 3-bit adder is used to calculate the necessary quotient bits.
Fig. 23.45 exhibits the circuit diagram of the n-bit divider, where n is the number of
bit in the dividend. The n-bit divider circuit uses two n-bit counter to count the number
of bits of the dividend and the divisor. The output of the n-bit counter is dlog2 n + 1e -bit.
The execution of counter circuits work in parallel. Then the output of both the bit counter
moves to the dlog2 n + 1e -bit comparator and subtractor circuits. The output of the dlog2
n + 1e -bit comparator circuit decides whether the whole divisor will be subtracted from
the dividend or the partial dividend will be used for the subtraction process. The selection
block provides the necessary dividend bits for subtraction process. An m-bit subtractor is
used to computing the remainder, where m is the number of bits in divisor. This remainder
is again used to providing the dividend in next iteration. The path started from the dlog2 n +
1e -bit comparator to the remainder block is considered as the path of remainder calculation.
On the other hand, the path started from the first dlog2 n + 1e -bit subtractor to the quotient
register is considered as the path of quotient selection logic. The quotient selection logic
uses one dlog2 n + 1e -bit to n-bit conversion to zero circuit for appending purpose of
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits 373
next block. Then the output of the zero converter circuit is used for concatenation. The
concatenation circuit is constructed by using D flip-flops. Then, an n-bit adder is used to
calculate the necessary quotient bits.
Example 23.8 The execution of the circuit behavior for the dividend = (1111)2 and divisor
= 102 is exhibited in Fig. 23.46. In the figure, red color indicates the “1” value whereas the
white color indicates the “0” value.
Figure 23.46: Circuit Behavior of the 4-Bit Divider Circuit for Dividend = (1111)2 and
Divisor = 102 .
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits 375
8: Now create L 00 = L 0 − R 0 where L 00 (a4 , a6 , a8 , a10 , a12 , a14 ). Since, all the input
operands are used now, the process ends and thus it requires only three 6-input LUTs.
Similarly, to construct the 7-bit counter circuit, the presented algorithm requires only
two 6-input LUTs.
One of the important properties of the n-input LUT is that it can serve as a dual output
whenever the input is less than n − 1, where 3 <= n <= 9. This property has been used in
the LUT-based divider circuit. Therefore, the output function b0 from Equation 24.11 can
be realized by the output variable of the LUT F 0 as follows in Equation 24.43:
F0 = (a0 ⊕a1 )⊕(a2 ⊕a3 )⊕(a4 ⊕a5 ) (23.43)
376 VLSI Circuits and Embedded Systems
The output of F 0 is chained with another 6-input LUT to produce the dual output F 1
and F 2 as follows in Equation 24.44:
The most significant bit of the output operand b3 which has been derived from Equation
24.13 does not require any LUT since it is generated directly with the help of input operand
a3 . The block diagram is shown in Fig. 23.47.
Figure 23.47: Block Diagram of the Look-Up Table (LUT)-Based 7-Bit Counter Circuit.
1. At first the input bits have to pair as P (a0 , b0 ), (a1 , b1 ), (a2 , b2 ), (a3 , b3 ).
2. The input to first 4-input LUT is (a0 , b0 , cin ) producing the following two outputs as:
3. The input to second 4-input LUT is (a1 , b1 , F 1 ) producing the following two outputs
as:
4. The input to third 4-input LUT is (a2 , b2 , F 3 ) producing the following two outputs
as:
5. The input to fourth 4-input LUT is (a3 , b3 , F 5 ) producing the following two outputs
as:
Here, F 9 is the required output b less than a. Thus, a LUT-based 4-bit counter circuit
requires only six 4-input LUTs.
2: The carry of the last chained LUT provide the required output b greater than a. Then,
the F 2 output of all the LUTs are sent to a 6-input LUT producing the output a equal b.
3: Lastly, the output of the previous single 6-input LUT (a equal b), the output of last
chained 6-input LUT (b greater than a) and the unused output of the 6-input LUTs in
step 1 are fed into a 6-input LUT which finally produce the output b less than a.
Algorithm 23.9 represents the algorithm for the LUT-based selection of first (m + 1)-bit
of the n-bit from the dividend, where m is number of bits in the dividend and n is the
number of bits in the divisor. This algorithm presented in Algorithm 23.9 has been used
to design the circuit diagram exhibited in Fig. 23.51 for 4-bit selection circuit. The step by
step demonstration of Algorithm 23.9 has been given below:
Therefore, the circuit has been constructed and the circuit diagram has been depicted
in Fig. 23.51. The circuit exhibited in Fig. 23.51 has been demonstrated with 4-input LUT.
However, the design can also be done with 5, 6 or 7-input LUTs by using Algorithm 23.9.
Algorithm 23.10 presents the circuit design algorithm for the LUT-based converter
circuit. This algorithm has been used to construct to circuit demonstrated in Fig. 23.52 for
a 3-bit converter circuit. Similarly, this algorithm can be used to construct a 4-bit converter
circuit as demonstrated in Fig. 23.53. In both cases, the 4-input LUT has been considered.
The 3-bit LUT-based converter requires only three 4-input LUTs whereas the 4-bit converter
requires only eight 4-input LUT. Though the design has been carried out for 4-input LUT, it
is possible to use higher input LUT by following Algorithm 23.10. For instance, if a 6-input
LUT is used to designing a 4-bit converter circuit, it will require only five 6-input LUTs.
Thus, the design for 4-bit converter circuit with 6-input LUT achieves an improvement in
terms of required number of LUTs when the number of inputs to LUT are increased from
382 VLSI Circuits and Embedded Systems
4-input to 6-input. Therefore, the performance of the LUT-based converter circuit will be
enhanced with the increase of number of inputs of the LUTs.
finally, the LUT-based selection block has been shown in Algorithm 23.9. The concatenation
circuit composes a series of D flip-flop circuits.
The block diagram of the LUT-based divider has been shown in Fig. 23.45 which is
also the same for ASIC-based design. Instead of the ASIC-based components, it is required
to use the LUT-based components. The algorithm for the construction of the LUT-based
components have been introduced in earlier subsections which have been mentioned before.
5. Apply a 4-bit reversible fault tolerant LUT-based subtractor where input:= {(s3 s2 s1
s0 )}, output:= {(S 2 S 1 S 0 )}; apply a 3-bit reversible fault tolerant LUT-based concatenation
circuit where input:= {(Q0 Q1 Q2 , ( z1 z0 )}, output:= {(Q2 Q1 Q0 )}.
6. Apply a 3-bit reversible fault tolerant LUT-based adder where input:= the contents
of Quotient registers and output of concatenation circuit, output:= {(Q3 Q2 Q1 Q0 )}.
23.4 SUMMARY
Divider circuit has attained less attention in comparison with adder and multiplier circuits
due to its computational complexity and hardware difficulties in terms of area and latency.
However, the performance of a processor will degrade eventually if the capacity of the
division operation is not improved. Non-restoring algorithm has been used extensively for
the construction of radix-2 or binary divider circuit since it provides more area-delay promi-
nence over other algorithm methods like restoring, SRT and digit convergence algorithm
as reported in Application Specific Integrated Circuit (ASIC)-based divider circuits. Nev-
ertheless, the commercial design also uses non-restoring division algorithm to construct
their circuits. It is widely accepted as a state of the art algorithm. The most recent update on
the conventional non-restoring algorithm reduces the delay of each iteration of the circuit
from three to two and their results showed some exciting results over long bit of dividend.
However, the area of the circuit has been increased considerably as an expense of minimized
delay.
This chapter presents a divider circuit using a new approach of calculating bit difference
by using the problem specific heuristic functions to find quotient values. The presented
division algorithm is used to constructing the divider circuit calculates the next partial
remainder and quotient bits simultaneously. In addition, the division algorithm requires
only two operation of addition and subtraction. This chapter has presented four new circuits
which have been used to construct the divider circuit. These circuits are bit counter circuit
which counts the number of bits present in an input operand. A modified comparator circuit
calculates whether the given two input operands are equal with each other or greater or less
than with each other. A selection block which selects first (m + 1)-bit out of n-bit where m
is the number of bit in divisor and n is the number of bit in dividend. Lastly, a converter
circuit which converts the number of bits to zero. Thus, it requires the delay of an inverter,
a 2-input AND gate and three D flip-flops. The method converts all the input bits present in
the input operand into “1” and then count the number of 1’s present in the input operands.
The state of the art design of a comparator circuit uses three distinct path to obtain
the output of a comparator circuit. One of the path calculates whether the input operands
are equal and other two paths calculate whether the input operands are greater or smaller
than each other. This calculation of three path requires substantial amount of hardware
resources. To minimize the area of the comparator circuit, two paths have been used in this
chapter. The output of the two paths are inverted and conjugated with a 2-input AND gate
to obtain the third output. One of the important circuits for the construction of the divider
circuit is the converter circuit. The division algorithm squeezes the number of bits in the
input operand from n-bit to log2 n + 1-bit by the bit counter circuit. Then it appends log2
n - 1 number of zero’s in the process. This mechanism has been performed by the converter
circuits.
386 VLSI Circuits and Embedded Systems
The main reason behind the enhancement of the division algorithm is the use of bit
difference between the dividend and the divisor. First, it uses a heuristic function to compute
the number of bit differences between the dividend and divisor. Second, the method defines
the number of quotient bits (global optimal). Third, the iteration schedule is calculated
which might be non-increasing depending on how the local optimal fills the defects to
reach the global optimal. Fourth, less number of partial remainders (local optimal) are
generated on basis of a schedule. Finally, during subtraction operation, X is subtracted
from the first (m + 1)th bit of Y , where X is the divisor, Y is the dividend and m is the
number of bits in dividend. Thus, the method of division algorithm optimizes the required
number of hardware resources. Moreover, the bit counter, selection block, comparator
and converter circuits enhance the efficiency of the divider circuit in terms of area and
delay. The design can easily be implemented commercially in any platform irrespective of
Application Specific Integrated Circuit (ASIC) or Field Programmable Gate Array (FPGA).
The reversible components can be used as a building block for the construction of cost-
efficient quantum computers with a minimum number of gates, transistors and unit delay.
REFERENCES
[1] K. Pocek, R. Tessier and A. DeHon, “Birth and adolescence of reconfigurable com-
puting: A survey of the first 20 years of field-programmable custom computing ma-
chines”, In Highlights of the First Twenty Years of the IEEE International Symposium
on Field-Programmable Custom Computing Machines, pp. 3–19, 2013.
[2] A. Amara, F. Amiel and T. Ea, “FPGA vs. ASIC for low power applications”, Micro-
electronics Journal, vol. 37, no. 8, pp. 669–677, 2006.
[3] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs”, IEEE Trans.
on Computer-aided Design of Integrated Circuits and Systems, vol. 26, no. 2, pp.
203–215, 2007.
[4] B. Jovanovic, R. Jevtic and C. Carreras, “Binary Division Power Models for High-
Level Power Estimation of FPGA-Based DSP Circuits”, IEEE Trans. on Industrial
Informatics, vol. 10, no. 1, pp. 393–398, 2014.
[5] S. Subha, “An Improved Non-Restoring Algorithm”, International Journal of Applied
Engineering Research, vol. 11, no. 8, pp. 5452–5454, 2016.
[6] D. M. Muoz, D. F. Sanchez, C. H. Llanos and M. A. Rincn, “Tradeoff of FPGA design
of a floating-point library for arithmetic operators”, Journal of Integrated Circuits and
Systems, vol. 5, no. 1, pp. 42–52, 2010.
[7] S. F. Oberman and M. J. Flynn, “Design issues in division and other floating-point
operations”, IEEE Trans. on Computers, vol. 46, no. 2, pp. 154–161, 1997.
[8] S. F. Obermann and M. J. Flynn, “Division algorithms and implementations”, IEEE
Trans. on Computers, vol. 46, no. 8, pp. 833–854, 1997.
[9] R. Tessier, K. Pocek and A. DeHon, “Reconfigurable computing architectures”, Pro-
ceedings of the IEEE, vol. 103, no. 3, pp. 332–354, 2015.
[10] A. D. Hon and J. Wawrzynek, “Reconfigurable computing: what, why, and implica-
tions for design automation”, In Proceedings of the 36th Annual ACM/IEEE Design
Automation Conference, pp. 610–615, 1999.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits 387
[28] L. Robert, “Irreversibility and heat generation in the computing process”, IBM J. Res.
Dev., vol.5, no. 3, pp. 183–191, 1961.
[29] M. M. A. Polash and S. Sultana, “Design of a LUT-based reversible field pro-
grammable gate array”, Journal of Computing, vol. 2, no. 10, pp. 103–108, 2010.
[30] A. S. M. Sayem and S. K. Mitra, “Efficient approach to design low power reversible
logic blocks for field programmable gate arrays”, In Computer Science and Automation
Engineering (CSAE), IEEE International Conference on, vol. 4, pp. 251–255, 2011.
CHAPTER 24
Synthesis of Boolean
Functions Using TANT
Networks
TANT is a three-level AND-NOT network with true inputs composed solely of NAND
gates. This chapter describes a systematic method for minimizing a TANT circuit and the
heuristic algorithms for different stages of the technique are provided. Algorithms in each
step of the introduced method are extensively discussed in this chapter.
24.1 INTRODUCTION
The TANT network has meaningful advantages over the PLA representations. A TANT
design for function f can never be worse than the corresponding PLA in terms of number of
gates. TANT has the un-complemented (affirmative) variables as its input, whereas the PLA
has both un-complemented variables and their negations as inputs. Thus, if a rectangular
layout realization is assured, then PLA has one dimension of the input plane two times larger.
Again, the number of prime implicants of TANT is also smaller than that of PLA. It also
allows for better incorporation of the fan-in constraints that the “standard call” realization
of the PLA - type two-level logic. Thus the automated synthesis of TANT networks will
play a prominent role in coping with the interconnection problem of integrated electronic
devices. Three-level networks are extensively used in the flash memory as hard drive than
as RAM. It is used in devices such as digital cameras and home video for easy and fast
information storage.
In this case, it is seen from Fig. 24.1 that it requires 6 AND-OR-NOT gates. But if it is
expressed it in TANT form as:
For larger circuits, the reduction of number of gates increases as compared to the
AND-OR-NOT circuit. Though power consumption in TANT circuit is higher than the
corresponding AND-OR-NOT circuit, the concentration is to the minimization of TANT
circuit for binary logic functions.
Property 24.2.2 Let X a and Y a0 be two Boolean expressions. It will refer to the consensus
relation (of order 2) of X a and Y a0 as ( X a,Y 0) → X.Y .
The consensus operation produces the elimination of the variable “a” from the equality
to 1, a + a 0 = 1.
After obtaining the set of PIs’, consensus operation (if possible) is applied on the set to
generate prime implicants with the same head as any PI of the set. Then the generated PI is
combined to the same head.
Property 24.2.3 A tail factor Y 0 will be a useful tail factor (UTF) of another tail factor X 0
if Y − X is in the head factors of the term that contains X 0.
Example 24.3 z 0 is the tail factor of the term x yz 0. The useful tail factors of z 0 are
(xz)0, (yz)0, (x yz)0.
Now, the method is represented using the flow diagram of Fig. 24.2.
The first circle is showing that the method finds the prime implicants using the popular
method of Quine McClusky. After that three more steps are followed to minimize the TANT
network. Though the algorithm for minimization of TANT is very much efficient, it has
some drawbacks. They are:
3. Selection of Minimum number of tail factors (last stage) follows a brute force algo-
rithm.
4. The technique does not work well for functions with a large number of variables.
Example 24.4 Let consider the minterm x 0 x 10 . So, x 0 ( x 0 x 1 )0 will be a generalized prime
implicant of x 0 x 10 , as x 0 ( x 0 x 1 )0 = x 0 ( x 00 + x 10 ) = x 0 x 00 + x 0 x 10 = x 0 x 10 . Let consider
another minterm x 1 x 20 x 00 . The 3 GP terms will be generated from it x 1 ( x 2 x 1 )0 x 00 , x 1 x 20
( x 0 x 1 )0, x 1 ( x 2 x 1 )0 ( x 0 x 1 )0.
Property 24.3.2 A PI will be called Only Tail Factor (OTF) if the term doesn’t contain
any head factor, that is, the term contains only tail factor(s).
Example 24.5 Both the minterms x 00 , x 10 , x 20 , and x 00 ( x 2 x 10 ) will be OTF, because they
haven’t any head factor, they have only the tail factors.
The steps of the method are shown in Fig. 24.3 using an easily understandable flow
diagram.
As it is seen from Fig. 24.4, there are some new terms, Data structures (BT), Properties
and an algorithm are presented in the introduced method. Each of the new terms is described
in brief.
Property 24.3.3 If a consensus operation’s generated new term is logically added with the
PI and modify the tail factor of that PI, which tail factor is the part of an OTF, the logical
Synthesis of Boolean Functions Using TANT Networks 393
addition will not be executed. It has been observed that consensus operation sometimes
generate unusual terms that replace some other important terms according to the definition
of “Consensus operation”. But by replacing the old terms with new terms there may be tail
factor that is shared by the same tail factor of another PI.
Algorithm 24.1 is for applying consensus operation over the set of PIs and Algorithm
24.2 is used to represent combine operation. Both of the algorithms are presented in the
next section.
Boolean Tree: Boolean Tree is a new data structure to generate GP terms and UTFs
efficiently. GP terms are generated twice for each PI. BT is efficient and accurate as well.
Fig. 24.3 shows the BT for the PI AB(C)0(D)0. But sometimes BT also generates some
unusual GP terms and tail factors. If in a PI, there is a tail factor, which is a part of an OTF,
the useful tail factors generated from the tail factors are unnecessary as, all the tail factors
of each OTF must be generated in the very first level of the TANT circuit. To solve the
Property 24.3.4 has been introduced.
Property 24.3.4 If any tail factor in a PI is the part of any OTF, then the tail factor will
not be expanded further in the useful tail factor generation procedure.
If Property 24.3.3 is followed in the BT generation, the BT will look like Fig. 24.5.
Figure 24.5: BT for AB 0(C)0(D)0 Considering Property 24.3.3, where (C)0 is a part of an
OTF.
GP term contains a tail factor that matches with the UTF of that column. GPs-UTFs table
will be shown with example in the evaluation part. The GPs-UTFs table helps to find the
optimal network in both computer program and hand solution.
Finding the Optimal Solution: An efficient algorithm is presented to find the minimal
TANT network for a Boolean expression with the help of GPs-UTFs table. The Algorithm
24.4 will be applied in the evaluation section for a particular benchmark function. Prior to
Algorithm 24.4, Lemma 25.3 is introduced that will be helpful in implementation of the
algorithm.
Property 24.3.5 In a minimal TANT network, there will be at least n UTFs if the highest
of tail factors among all PIs is n.
24.5 SUMMARY
In this chapter, a heuristic technique is presented to minimize the Three-level AND-NOT
Networks with True Inputs (TANT) networks. A TANT design for any function can never
be worse than the corresponding Programmable Logic Array (PLA) in terms of number
of gates. Steps and algorithms are discussed extensively in this chapter. The introduced
method constructs an optimal TANT network for a given single output function. Reduction
of the number of gates is not only the achievements of this chapter; the presented method
can also reduce the complexity in terms of time.
REFERENCES
[1] E. J. McClusky, “Minimization of Boolean Functions”, Bell System Technical Journal,
vol. 35, no.5, pp. 1417–1444, 1956.
[2] P. Tison, “Generalization if consensus theory and application in minimization of
Boolean function”, IEEE Trans., Electronic Computers, vol. EC-16, pp, 446–456,
1967.
[3] K. S. Koh, “A minimization technique for TANT network”, IEEE Trans. on Computer,
pp. 105–107, 1971.
[4] M. A. Marin, “Synthesis of TANT Networks using Boolean Analyser”, The Comp.
Journal, vol. 12, no. 3, 1969.
[5] M. A. Perkowski and M. C. Jeske, “Multiple-Valued Input TANT network”, ISMVL,
pp. 334–341, 1994.
[6] H. M. H. Babu, M. R. Islam, S. M. A. Chowdhury and A. R. Chowdhury, “Synthesis
of Full Adder Circuit Using Reversible Logic”, Proceedings on 17th International
Conference on VLSI Design, 2004.
[7] H. M. H. Babu, M. R. Islam, S. M. A. Chowdhury and A. R. Chowdhury, “Reversible
Logic Synthesis for Minimization of Full-Adder Circuit”, Proceedings on DSD, pp.
50–54, 2003.
CHAPTER 25
This chapter presents an asymmetric high-radix signed-digital (AHSD) adder that performs
addition on the basis of neural network (NN) and also shows that the AHSD number system
supports carry-free (CF) addition by using NN. Besides NN implies the simple construction
in high-speed operation. The signed-digit number system represents the binary numbers
that use only one redundant digit for any radix r ≥ 2, the high-speed adder in the processor
can be realized in the signed-digit system without a delay of the carry propagation. A Novel
NN design has been constructed for CF adder based on the AHSD4 number system which is
also presented in this chapter. Moreover, if the radix is specified as r = 2m , where m is any
positive integer, the binary-to-AHSD conversion can be done in constant time regardless of
the word-length. Hence, the AHSD-to-binary conversion dominates the performance of an
AHSD based arithmetic system. In order to investigate how the AHSD number system based
on NN design achieves its functions, computer simulations for key circuits of conversion
from binary to AHSD4 based arithmetic systems are made.
25.0.1 Introduction
Addition is the most important and frequently used arithmetic operation in computer sys-
tems. Generally, a few methods can be used to speed up the addition operation. One is
by using neural network design convert the operands from the binary number system to
a redundant number system, e.g., the signed-digit number system or the residue number
system, so that the addition becomes carry-free (CF). This Neural Network (NN) design
implies fast addition can be done at the expense of conversion between the binary number
system and the redundant number system. In this chapter, the focus is on exploring high
radix signed-digit (SD) numbers and the design of the asymmetric high-radix signed-digit
(AHSD) number system using NN.
The idea of AHSD is not new. Instead of presenting a new number representation, the
purpose is to explore the inherent CF property of AHSD by using NN. The CF addition
in AHSD based on NN is the basis for our high-speed addition circuits. The conversion of
AHSD to and from binary will be discussed in details. By choosing r = 2m , where m is any
positive integer, a binary number can be converted to its canonical AHSD representation in
constant time. One simple algorithm is also presented for converting binary bit pattern to
make pairs in AHSD for radix-r . NN design for converting AHSD numbers to binary: the
first stresses high speed and the other provide hardware reusability. Since the conversion
from AHSD to binary has been considered the bottleneck of AHSD-based arithmetic
computation based on NN, these NN design greatly improve the performance of AHSD
systems. For illustration, the example on AHSD4 is discussed in details, i.e., the radix-4
AHSD number system.
Figure 25.1: Neural Network Prototype for AHSD Number System Addition.
digit in the digit set. The inherent carry-free property will be explored in AHSD and develop
systematic approaches for conversion of AHSD numbers from binary ones.
An n-digit number X in AHSD(r) is represented as
X = (x n−1 , x n−2 , . . . , x 0 )r ,
where x i ∈ Sr for i = 0, 1, . . . , n – 1, and Sr = {–1, 0, 1, . . . , r – 1} is the digit set of
AHSD(r) . The value of X can be represented as
n−1
Õ
X= xi r i
i=0
Algorithm 25.1 Algorithm for the Conversion of AHSD from Binary Number System
1: Suppose given binary #bits = n
2: if radix =2m , where m is any positive integer then
3: 2 p < m < 2 p+1 where p = 1, 2, 3. . . . . .
4: # Zero (0) will be padded in front of binary bits pattern 2 p+1 – n
5: Divide the array by m
6: If each sub array is = m
7: else
8: Divide each sub array by m
9: end if
Proof 25.1 Recurrence relation for the conversion from binary-to-AHSD numbers system
is:
T(n/2) = mT(n/m2 )
So, T(n) = m2T(n/m2 )
.
.
T(n) = mkT (n/mk ) where k = 1, 2, 3 ...
Assume n = mk
T(n) = mk T(mk /mk )
T(n) = mk T
T(n) = mk n = mk
Log m n = Log m mk = k .
Complexity of Algorithm 25.1 to make pairs to convert Binary to AHSD(r) : O (log m n).
The final result is in binary format. The given example illustrates the addition process
without carry propagation.
Property 25.2.1 The total number of neurons for generating interim sum and carry of a
radix-n asymmetric q-bit adder design is q ×2(n − 1).
Proof 25.2 As n is the radix, so each bit will contain the value of n − 1. So the interim sum
will be in the range of 0 to 2(n − 1). The total number of neurons for 1-bit adder design will
be 2(n − 1). For the design of a q-bit adder, the total number of neurons will be q×2(n − 1).
Here an algorithm is introduced for n-bit adder from binary to AHSD4 using NN.
25.4 SUMMARY
In this chapter, one CF (carry-free) digital adder for (asymmetric high-radix signed-
digit)AHSD4 number system based on Neural Networks (NN) has been presented. Ad-
ditionally, if r = 2m for any positive integer m, the interface between AHSD and the binary
number system can be realized and it can be easily implementable. To make pairs at the
time of conversion from binary-to-AHSD, an algorithm has also been introduced. Since
both the binary-to-AHSD and AHSD-to-binary converter CF adder operate in a constant
402 VLSI Circuits and Embedded Systems
time, it can conclude that the AHSD-to-binary converter dominates the performance of the
entire AHSD based on NN design system. The time complexity of the entire AHSD CF
adder is O (log m n).
REFERENCES
[1] S. H. Sheih and C. W. Wu, “Asymmetric high-radix signed-digit number systems
for carry-free addition”, Journal of information science and engineering, vol. 19, pp.
1015–1039, 2003.
[2] B. Parhami, “Generalized signed-digit number systems: a unifying framework for
redundant number representations”, IEEE Trans. on Computers, vol. 39, pp. 89–98,
1990.
[3] S. H. Sheih and C. W. Wu, “Carry-free adder design using asymmetric high-radix
signed-digit number system”, in Proceedings of 11th VLSI Design/CAD Symposium,
pp. 183–186, 2000.
[4] M. Sakamoto, D. Hamano and M. Morisue, “A study of a radix-2 signed-digit al
fuzzy processor using the logic oriented neural networks”, IEEE International Systems
Conference Proceedings, pp. 304–308, 1999.
[5] T. Kamio, H. Fujisaka and M. Morisue, “Back propagation algorithm for logic oriented
neural networks with quantized weights and multilevel threshold neurons”, IEICE
Trans. Fundamentals, vol. E84-A, no. 3, 2001.
[6] A. Moushizuki and T. Hanyu, “Low-power multiple-valued current-mode logic using
substrate bias control”, IEICE Trans. Electron., vol. E87-C, no. 4, pp. 582–588, 2004.
CHAPTER 26
Wrapper/TAM
Co-Optimization and
Constrained Test
Scheduling for SOCs Using
Rectangle Bin Packing
This chapter describes an integrated framework for SOC test automation. This framework
is based on a new approach for Wrapper/TAM co-optimization based on rectangle packing
considering the diagonal length of the rectangles to emphasize on both TAM widths required
by a core and its corresponding testing time. In this chapter, an efficient algorithm has been
introduced to construct wrappers that reduce testing time for cores. Rectangle packing has
been used to develop an integrated scheduling algorithm that incorporates power constraints
in the test schedule. The test power consumption is important to consider since exceeding
the system’s power limit might damage the system.
26.1 INTRODUCTION
The development of microelectronic technology has led to the implementation of system-on-
chip (SOC), where a complete system consisting of several application specific integrated
circuit (ASIC), microprocessors, memories and other intellectual properties (IP) blocks,
is implemented on a single chip. The increasing complexity of SOC has created many
testing problems. The general problem of SOC test integration includes the design of TAM
architectures, optimization of the core wrappers, and test scheduling. Test wrappers form
the interface between cores and test access mechanisms (TAMs), while TAMs transport
test data between SOC pins and test wrappers. The problem of designing test wrappers and
TAMs to minimize SOC testing time are addressed. While optimized wrappers reduce test
application times for the individual cores, optimized TAMs lead to more efficient test data
transport on-chip. Since wrappers influence TAM design, and vice versa, a co-optimization
strategy is needed to jointly optimize the wrappers and the TAM for an SOC.
In this chapter, an approach is presented to integrated wrapper/TAM co-optimization
and test scheduling based on a general version of rectangle packing considering diagonal
length of the rectangles to be packed. The main advantage of the approach is that it minimizes
the test application time while considering test power limitation.
The height of the bin corresponds to the total SOC TAM width, and the width to which
the bin is ultimately filled corresponds to the system testing time that is to be minimized. The
unfilled area of the bin corresponds to the idle time on TAM wires during test. Furthermore,
the distance between the left edge of each rectangle and the left edge of the bin corresponds
to the begin time of each core test. The approach emphasizes on both testing time of a
core and the TAM width required achieving that testing time by considering the diagonal
length of rectangles.
√ Diagonal length emphasizes on both testing time and TAM width
since DL = W 2 + H 2 where W, H, DL denotes width, height and diagonal length of the
rectangles respectively. Consider three rectangles R[1] = {H = 32, W = 7.1, DL = 32.78},
R[2] = {H = 16, W = 13.8, DL = 21.13}, R[3] = {H = 32, W = 5.4, DL = 32.45}. Here if
testing time(W ) is taken into account, then it should pack R[2] first, followed by R[1] and
R[3]. But when diagonal lengths are considered, it packs R[1], R[3], R[2] in sequence, and
get the result that is extremely efficient.
The approach minimizes TAM width utilization also by assigning TAMu wires to a core
to achieve a specific testing time. For example, in the Wrapper_Design, all TAM widths
from 50 up to 64 result in the same testing time of 114317 cycles and same TAM width
utilization (TAMu) of 47 for core 6 in p93791 (Table 26.1). So, to achieve testing time of
114317 cycles TAMu value 47 is used in the introduced approach.
2. Find the smallest (T min ) among the testing time corresponding to MAX_TAMu of
all cores.
3. For each core C[i], divide the width T[i] of all rectangles constructed in line 1 with
T min .
√
4. For each core C[i], calculate Diagonal Length DL[i] = ((W[i])2 + (T[i]2 )) where
W[i] denotes MAX_TAMu and T[i] denotes corresponding reduced testing time.
5. Sort the Cores in descending order of their diagonal length calculated in line 4 and
keep in list INITIAL[NC].
6. Next_Schedule_Time = current_Time = 0;
W avail = W mx ; // TAM available; Idle_Flag = False;
// peak_tam[c] is equal to MAX_TAMu of core c; // PENDING is a queue.
7. While (INITIAL and PENDING not Empty)
{
8. If (W avail > 0 and Idle_Flag = False)
{
9. If (INITIAL is not empty)
{
c = delete(INITIAL);
If (W avail >= peak_tam[c] && no_powerConflict)
Update(c, peak_tam(c));
Else If(Possible_TAM >= 0.5 * peak_tam[c] && no_powerConflict)
Update(c, Possible_TAM);
Else
add(PENDING, c);
if(peak_tam[PENDING[front]] ≤ Wavail && no_powerConflict)
Update(PENDING[front], peak_tam[PENDING[front]]);
delete(PENDING) ;
}
10. Else //if INITIAL is empty
{
If(peak_tam[PENDING[front]] ≤ Wavail && no_powerConflict)
Update(PENDING[front], peak_tam[PENDING[front]]);
delete(PENDING)
Else
Idle_Flag = True;
}
}
11. Else //TAM available < 0 or idle
{
Wrapper/TAM Co-Optimization and Constrained Test Scheduling 409
Calculate Next_Schedule_Time = Finish[i], such that Finish[i] > This_Time and Fin-
ish[i] is minimum;
Set This_Time = Next_Schedule_Time;
12. For every Core i, such that finish[i] = This_Time
W avail = W avail + Width[i];
13. Set Complete[i] = TRUE;
Idle_Flag = False;
}
} //end of while
return test_schedule;
}
Figure 26.2: Example of Some Rectangles for core 6 of SOC p93791 when W max = 32.
410 VLSI Circuits and Embedded Systems
In Fig. 26.3, MAX_TAMu = 24 and W max = 32. For combinational core, MAX_TAMu
is always equal to W max . Note that, in case of TAM wire assignment to that particular
scheduling of p93791 (Fig. 26.2), TAM wires that are to be assigned to core 6 must be
selected from values 24, 16, 12, 10, 8, and 7 depending on TAM width available.
Figure 26.3: Test Scheduling for d695 using The Algorithm (T min = 1109 and TAM width
= 24) without Power Constraints.
cannot satisfy power constraints as well as the condition W avail ≥ peak_tam[c] where c is
the core at the front of queue PENDING.
If there are W avail idle wires or W avail = 0, the execution proceeds where the process
of updating This_Time to Next_Schedule_Time and Wavail is begun. W avail is increased
by the width of all cores ending at the new value of This_Time sets complete[i] to true for
all cores whose test has completed at This_Time.
26.5 SUMMARY
In this chapter, a Wrapper/TAM co-optimization and test scheduling technique is presented
that takes test power consumption into account when minimizing the test application time.
It is important to consider test power consumption since exceeding it might damage the
system. The technique is based on rectangle packing which emphasizes on both time and
TAM (Test Access Mechanism) width by considering diagonal lengths. An integrated
framework for SOC (System-On-Chip) test automation is described in this chapter.
REFERENCES
[1] J. Aerts and E. J. Marinissen, “Scan chain design for test time reduction in core-based
ICs”, Proceedings International Test Conference, pp. 448–457, 1998.
[2] E. J. Marinissen, “A structured and scalable mechanism for test access to embedded
reusable cores”, Proceedings International Test Conference, pp. 284–293, 1998.
[3] E. J. Marinissen, S. K. Goel and M. Lousberg, “Wrapper design for embedded core
test”, Proceedings International Test Conference, pp. 911–920, 2000.
[4] E. Larsson and Z. Peng, “An integrated system-on-chip test framework”, Proceedings
of the Design Automation and Test in Europe Conference, pp. 138–144, 2002.
[5] J. Pouget, E. Larsson and Z. Peng, “SOC Test Time Minimization Under Multiple
Constraints”, Proceedings of the Asian Test Symposium, 2003.
[6] R. Chou, K. Saluja and V. Agrawal, “Scheduling Tests for VLSI Systems under Power
Constraints”, IEEE Trans. on VLSI Systems, vol. 5, no 2, pp. 175–185, 1997.
[7] V. Muresan, X. Wang and M. Vladutiu, “A Comparison of Classical Scheduling
Approaches in Power-Constrained Block-Test Scheduling”, Proceedings International
Test Conference, pp. 882–891, 2000.
[8] V. Iyengar, K. Chakrabarty and E. J. Marinissen, “Test wrapper and test access mech-
anism co-optimization for system-on-chip”, J. Electronic Testing: Theory and Appli-
cations, vol. 18, pp. 211–228, 2002.
[9] V. Iyengar, K. Chakrabarty and E. J. Marinissen, “Efficient wrapper/TAM co-
optimization for large SOCs”, Proceedings of the Design Automation and Test in
Europe (DATE) Conference, 2002.
[10] V. Iyengar and K. Chakrabarty, “Test bus sizing for system-on-a-chip”, IEEE Trans.
Computers, vol. 51, 2002.
412 VLSI Circuits and Embedded Systems
Both volatile and nonvolatile memory are used in the computer memory system. As a
primary memory, volatile memories such as Static RAM (SRAM) and Dynamic RAM
(DRAM) are utilized, while nonvolatile memory such as flash memory is used. However,
new nonvolatile technologies have recently been developed that promise fast changes in the
memory system environment. A memristor is a passive two-terminal device whose resis-
tance is proportional to the magnitude and polarity of the voltage supplied to it. Similar
to memory devices, it exhibits a nonlinear connection between voltages and current. This
chapter describes a method for designing nonvolatile 6-T static random access memory
(SRAM) using memristors. On an Apollo design station, an SRAM was created using a
2micron minimum geometry with nMOS fabrication method. Test structures were incor-
porated in addition to the SRAM integrated circuit to assist characterize the process and
design. The memristor-based resistive random access memory (MRRAM), which works
similarly to an SRAM cell, is addressed in this chapter.
27.1 INTRODUCTION
In 1971, Leon Chua proposed the memristor, a fourth non-linear passive two-terminal
electrical component. It establishes a connection between electric charge and magnetic flux
over a given time interval. Hewlett Packard (HP) Labs researchers reported in 2008 that
the memristor was physically achieved utilizing a thin sheet of titanium dioxide nanoscale
device. Essentially, a memristor is a memory resistance device. When a voltage is supplied
to this element, the resistance changes, but when the voltage is withdrawn, the resistance
remains constant. The nonlinear input-output properties of the memristor (M) distinguish
it from the three passive parts (R, L, and C). Memristors are also used as programmable
resistive loads in a differential amplifier. In a differential amplifier, memristors are also
utilized as programmable resistive loads. Memristors are a good option for future memory,
because of its non-volatile nature and high packing density in a crossbar array. The circuit’s
major features are its non-volatility and smaller size when it is compared to traditional
6T-SRAM. Even if the power is switched off for an extended period of time, the data is
preserved in the memory. It may be significantly smaller than traditional SRAM cells,
because each memory cell has three transistors and two memristors only.
The resistive RAM may flip between one or more resistances under the application
of suitable voltages. It exhibits memristive activity and can be considered as a kind of
memristor. Devices can have two or more distinct resistance states, or their resistance might
be constantly changing. Whatever the case may be, it is critical that the change in resistance
may be regulated by the device’s previous history, that is, the previous voltage applied or
the previous current flowing through the device. Resistive RAM (RRAM) devices may be
able to alleviate some of the existing constraints in microelectronics.
The design of any complex integrated circuit needs to be reduced by eliminating its
unnecessary component parts. A hierarchical approach can be used in which circuits are
built from the bottom up. Cells are made to represent the commonly used parts and combined
to form the final circuit. As shown in Fig. 28.1, the first phase of any hierarchical design is
the creation of the basic cells. These schematics are then entered into the computer through
the NETED software package. The NETED software is a schematic capture program which
converts circuit’s diagrams into node lists. These node lists, in conjunction with a transistor
Models file, define the circuit, interconnections, and the device characteristics of the nMOS
transistors.
As long as M does not change with charge, Equation 28.4 indicates that memristance
defines a linear connection between current and voltage. A charge that varies over time is
implied by a nonzero current. However, alternating current can define the linear dependency
in circuit functioning by generating a quantifiable voltage without causing net charge
movement as long as the greatest change in q does not produce a significant change in M .
Furthermore, if no current is supplied, the memristor remains static. If I(t) = 0, then it
finds V(t) = 0 and M(t) is constant. This is the essence of the memory effect. The power
consumption characteristic recalls that of a resistor, I 2 R.
As long as M(q(t)) varies little, such as under alternating current, the memristor will
appear as a constant resistor. If M(q(t)) increases rapidly, however, current and power
consumption will quickly stop. M(q) is physically restricted to be positive for all values
416 VLSI Circuits and Embedded Systems
of q (assuming the device is passive and does not become superconductive at some q). A
negative value would mean that it would perpetually supply energy when operated with
alternating current. For RO N << ROF F the memristance function can be determined as
follows:
µv
Mq(t)) = ROF F .(1 − q(t)) (27.6)
D2
Where ROF F represents the high resistance state, RO N represents the low resistance
state, µv represents the mobility of dopants in the thin film, and also the thickness of the
film.
the final expression. This power characteristic is fundamentally different from that of a
capacitor-based metal oxide semiconductor transistor. Unlike a transistor, the memristor’s
ultimate charge state is independent of the bias voltage.
Hysteresis, also known as the "hard-switching regime," occurs when a kind of memristor
is switched across its full resistance range. A cyclic M( q) switch, on the other hand, would
have each off-on event followed by an on-off event under continuous bias. Under any
situation, such a device would operate as a memristor, although it would be less practical.
If a voltage is applied across the memristor, the following results are obtained:
where RON is the resistance of completely doped memristor and ROF F is the resistance of
completely undoped memristor, w(t) is given by
dw(t) RON
= µv i(t) (27.8)
dt D
µv is the average dopant mobility and D is the length of the memristor. From these
equations, the considered nonlinearity produced from the edge of the thin film can be
obtained as
w(t) w(t)
f( )=1− 2 − 1 2p (27.9)
D D
Fig. 27.4 shows the change in resistance of a memristor when a 3.6 V p-p square wave
is applied across it. In a positive cycle, the resistance of the memristor changes from 20 KΩ
to 100 KΩ, and this change occurs in the opposite way when the square pulse reverses its
orientation.
polarity during the write cycle as presented in Fig. 27.6, and in series during the read cycle
as shown in Fig. 27.7. These connections are recognized by two NMOS pass transistors
T1 and T2. A third transistor T3 is used to isolating a cell from other cells of the memory
array during read and write operations. The gate input of T3 is the Comb signal which is
the logical OR of RD (Read) and WR (Write) signals. RD is set to the LOW state and WR
and Comb are set to the HIGH state for write operation. As a result, circuit of Fig. 27.6 is
formed.
In this case, the voltage across the memristors is (V D – V DD /4). It can be either positive
(V D = V DD ) or negative (V D = 0V) depending on the data. Because the memristors’
polarities are opposing, memristances (or resistances) will alter in the opposite direction.
The circuit illustrated in Fig. 27.7 is formed by keeping RD and Comb in the HIGH state.
At D, the voltage is now:
VDD VDD R2 VDD
VD = − × + (27.10)
2 4 (R1 + R2) 4
where, R1 and R2 are the resistances of M1 and M2 respectively. If “1” was written
during write cycle, R2 becomes significantly greater than R1 and then V D is greater than
V DD = 4. If “0” was written, R1 becomes significantly greater than R2 which makes V D
to be as close as V DD = 4. A comparator can be used as a sense amplifier to interpret these
voltages as HIGH or LOW correctly.
420 VLSI Circuits and Embedded Systems
27.6 SUMMARY
In this chapter, a memristor-based design for a Static Random Access Memory (SRAM)
cell is described. Recent studies show that the write time may be considerably decreased by
combining the cutting-edge manufacturing processes and memristor-based Resistive RAM
(RRAM). The memristor-based SRAM may be called as a combination of new technologies
and this memory has the potential to open a new door in the area of memory architecture.
REFERENCES
[1] L. Chua, “Memristor-the missing circuit element”, IEEE Trans. Circuit Theory, vol.
18, no. 5, pp. 507–519, 1971.
[2] D. Struckov, G. Snider, D. Stewart and R. Williams, “The missing memristor found”,
Nature, vol. 453, no. 7191, pp. 80–83, 2008.
[3] G. Chen, “Leon Chua’s memristor”, IEEE Circuits Syst. Mag., vol. 8, no. 2, pp. 55–56,
2008.
Static Random Access Memory Using Memristor 421
[4] Y. V. Pershin and M. D. Ventra, “Spin memristive systems: Spin memory effects in
semiconductor spintronics”, Phys. Rev. B, vol. 78, no. 11, pp. 113309-1–113309-4,
2008.
[5] K. Witrisal, “Memristor based stored reference receiver-the UWB solution”, Electron.
Lett., vol. 45, no. 14, pp. 713–714, 2009.
[6] S. Shin, K. Kim and S. M. Kang, “Memristor based fine resolution programmable
resistance and its applications”, Proc. IEEE Int. Conf. Commun. Circuits Syst., pp.
948–951, 2009.
[7] D. Varghese and G. Gandi, “Memristor based highline arrange differential pair”, Proc.
IEEE Int. Conf. Commun., Circuits Syst., pp. 935–938, 2009.
[8] Y. V. Pershin and M. DiVentra, “Practical approach to programmable analog circuits
with memristors”, IEEE Trans. Circuits Syst. I, vol. 57, no. 8, pp. 1857–1864, 2010.
[9] S. Shin, K. Kim and S. M. Kang, “Memristor applications for programmable analog
ICs”, IEEE Trans. Nanotechnol., vol. 10, no. 2, pp. 266–274, 2011.
[10] D. B. Strukov and S. Williams, “Exponential ionic drift: Fast switching and low
volatility of thin film memristors”, Appl. Phys. A, Mater. Sci. Process., vol. 94, no. 3,
pp. 515–519, 2009.
[11] N. Y. Joglekar and S. J. Wolf, “The elusive memristor: Properties of basic electrical
circuits”, Eur. J. Phys., vol. 30, no. 4, pp. 661–675, 661, 2009.
[12] E. Linn, R. Rosezin, C. Kügeler and R. Waser, “Complementary resistive switches for
passive nanocrossbar memories”, Nature Mater., vol. 9, pp. 403–406, 2010.
[13] P. Junsangsri and F. Lombardi, “A memristor-based memory cell using ambipolar
operation”, Proc. IEEE 29th Int. Conf. Comput. Design, pp. 148–153, 2011.
[14] K. Eshraghian, K. R. Cho, O. Kavehei, S. K. Kang, D. Abbott and S. M. S. Kang,
“Memristor MOS content addressable memory (MCAM): Hybrid architecture for
future high performance search engines”, IEEE Trans. Very Large scale Integr. (VLSI)
Syst., vol. 19, no. 8, pp. 1407–1417, 2011.
[15] S. S. Sarwar, S. A. N. Saqueb, F. Quaiyum and A. B. M. H. Rashid, “Memristor-Based
Nonvolatile Random Access Memory: Hybrid Architecture for Low Power Compact
Memory Design”, IEEE Trans. Science and Technology, vol. 1, no. 3, pp. 29–34,
2013.
CHAPTER 28
28.1 INTRODUCTION
High-quality verification and testing is a vital step in the design of a successful micropro-
cessor product. Designers must verify the correctness of large complex systems and ensure
that manufactured parts work reliably in varied (and occasionally adverse) operating con-
ditions. If successful, users will trust that when the processor is put to a task it will render
correct results. If unsuccessful, the design can falter, often resulting in serious repercussions
ranging from bad press, to financial damage, to loss of life. The challenges that must be
overcome to build a reliable microprocessor design are great. There are many sources of
errors, each requiring careful attention during design, verification, and manufacturing. The
faults are broadly classified that can reduce reliability into three categories: design faults,
manufacturing faults, and operational faults.
errors, designers employ various techniques to improve the quality of verification including
co-simulation, coverage analysis, random test generation, and model-driven test generation.
Another popular technique, formal verification, uses equality checking to compare a
design under test with the specification of the design. The advantage of this method is that
it works at a higher level of abstraction, and thus can be used to check a design without
exhaustive simulation. The drawback to this approach is that the design and the instruction
set architecture it implements need to be formally specified before the process can be
automated. Complex modern designs have outpaced the capabilities of current verification
techniques. For example, a microprocessor with 32-bit registers, 8k-byte instruction and data
caches, and 300 pins cannot be fully examined with simulation-based testing. The design
has a test space with at least 2132396 starting states and up to 2300 transition edges emanating
from each state. While formal verification has improved detection of design faults, full
formal verification is not possible for complex dynamically scheduled microprocessor
designs. To date, the approach has only been demonstrated for in-order issue pipelines
or simple out-of-order pipelines with small window sizes. Complete formal verification of
complex modern microprocessors with out-of-order issue, speculation, and large instruction
windows is currently an intractable problem.
in a CMOS layout, as a permanent fault; however, this fault can be cleared by powering
down the system.
Unlike permanent faults, intermittent faults do not appear continuously. They appear and
disappear but their manifestation is highly correlated with stressful operating conditions.
Examples of this type of fault include power supply voltage noise or timing faults due
to inadequate cooling. Data-dependent design errors also fall into this category. These
implementation errors are perhaps the most difficult to find because they require specific
directed testing to locate. Transient faults appear sporadically but cannot be easily correlated
to any specific operating condition. The primary source of these faults are single event
radiation (SER) upsets. SER faults are the result of energized particle strikes on logic
which can deposit or remove sufficient charge to temporarily turn the device ON or OFF,
possibly creating a logic error. While shielding is possible, its physical construction and
cost make it an unfeasible solution at this time.
Concurrent testing techniques are usually required to detect operational faults, since
their appearance is not predictable. Three of the most popular methods are timers, coding
techniques and multiple execution. Timers guarantee that a processor is making forward
progress, by signaling an interrupt if the timer expires. Coding techniques use extra informa-
tion to detect faults in data. While primarily used to protect storage, coding techniques also
exist for logic. Finally, data can be checked by using a k-ary system where extra hardware
or redundant execution is used to providing a value for comparison.
Deep submicron fabrication technologies (i.e., process technologies with minimum
feature sizes below 0.25 µm) heighten the importance of operational fault detection. Finer
feature sizes result in smaller devices with less charge, increasing their exposure to noise-
related faults and SER. If designers cannot meet these new reliability challenges, they may
not be able to enjoy the cost and speed advantages of these denser technologies.
In this chapter, an on-line testing approach, called dynamic verification is presented
that addresses many of the reliability challenges facing future microprocessor designs. The
solution inserts an on-line checking mechanism into the retirement stage of a complex
microprocessor. The checker monitors the results of all instructions committed by the
processor. If no error is found in a computation, the checker allows the instruction result
to be committed to architected register and memory state. If any results are found to
be incorrect, the checker fixes the errant result and restarts the processor core with the
correct result. The checker processor is quite simple, lending itself to high-quality formal
verification and electrically robust designs. The addition of the checker to the pipeline causes
virtually no slowdowns to the core processor, and area and power overheads for a complete
checker design are quite modest. The simple checker provides significant resistance to design
and operational faults and provides a convenient mechanism for efficient and inexpensive
detection of manufacturing faults. Design verification is simplified because the checker
concentrates verification onto itself. Specifically, if any design errors remain in the core
processor, they will be corrected (albeit inefficiently) by the checker processor. Significant
resistance to operational faults is also provided. A low-cost and high-coverage technique
is also introduced for detecting and correcting SER-related faults. The approach uses
the checker processor to detect energetic particle strikes in the core processor; for the
checker processor that has developed a re-execute-on-error technique that allows the checker
to check itself. Finally, it demonstrates how the checker can be used to implement a
426 VLSI Circuits and Embedded Systems
low-cost hierarchical approach to manufacture testing. Simple low cost BIST tests the
checker module, then the checker can be used as the tester to test the core processor for
manufacturing errors. The approach could significantly reduce the cost of late-stage testing
of microprocessors while at the same time reducing the time it takes to test parts.
Thus it is possible to build an inorder checker without speculation that can match the
retirement bandwidth of the core. In the event the core produces a bad prediction value
(e.g., due to a design errors), the checker processor will fix the errant value and flush
all internal state from the core processor, and restart it after the errant instruction. Once
restarted, the core processor will resynchronize with the correct state of the machine as it
reads register and memory values from non-speculative storage.
To eliminate the possibility of storage structural hazards, the checker processor has
its own register file and instruction and data caches. A small dedicated data cache for the
checker processor, called the L0 cache, is loaded with whatever data is touched by the core
processor; it taps off the output port of the L1 cache. This prefetching technique greatly
reduces the number of misses experienced by the checker. However, if the checker processor
misses in the L0 cache, it blocks the entire checker pipeline, and the miss is serviced by the
core L2 cache. Cache misses are rare for the checker processor even for very small caches,
because the high-quality address stream from the core processor allows it to manage these
resources very efficiently. Store Queues are also added to both the core and checker (cSTQ
and dSTQ in Fig. 28.1) to increase performance.
The resulting dynamic verification architecture should benefit from a reduced burden of
verification, as only the checker needs to be completely correct. Since the checker processor
will fix any errors in the instructions that are to be committed, the verification of the core
is reduced to simply the process of locating and fixing commonly occurring design errors
that could affect system performance. Since the complex core constitutes a major testing
problem, relaxing the burden of correctness of this part of the design can yield large
verification time savings. To maintain a high quality checker processor, formal verification
is leveraged to ensure correct function and extensive checker processor Built-In-Self-Test
(BIST) to guarantee a successful implementation.
428 VLSI Circuits and Embedded Systems
Figure 28.2: Checker Processor Pipeline Structure for a Single Wide Checker Processor.
signals. EXCHECK re-executes the functional unit operation to verify core computation.
Finally, the MEMCHECK verifies any loads by accessing the checker memory hierarchy.
If each prediction from the core processor is correct, the result of the current instruction
(a register or memory value) as computed by the checker processor is allowed to retire to
non-speculative storage in the commit (CT) stage of the checker processor. In the event
any prediction information is found to be incorrect, the bad prediction is fixed, the core
processor is flushed, and the core and checker processor pipelines are restarted after the
errant instruction. Core flush and restart use the existing branch speculation recovery
mechanism contained in all modern high-performance pipelines.
Figure 28.3: Checker Processor Pipeline Structure for a Checker Processor in Check Mode.
As shown in Figs. 28.3 Fig. 28.4, the routing MUXes can be configured to form a
parallel checker pipeline or a recovery pipeline respectively. In recovery mode the pipeline
is reconfigured into a serial pipeline. In this mode, stage computations are sent to the next
logical stage in the checker processor pipeline, rather than used to verify core predictions.
Only one instruction is allowed to enter the recovery pipeline at a time. As such, the
recovery pipeline configuration does not require bypass datapaths or complex scheduling
logic to detect hazards. Processing performance for a single instruction in recovery mode
will be quite poor, but as long as faults are infrequent there will be no perceivable impact
on program performance. Once the instruction has retired, the checker processor re-enters
normal processing mode and restarts the core processor after the errant instruction.
Fig. 28.2 also illustrates that the checking and recovery modes use the same hardware
modules, thereby reducing the area cost of the checker. Each stage only requires intermediate
pipeline inputs - whether these are from the core processor prediction stream or the previous
stage of the checker processor pipeline (in recovery mode) is irrelevant to the operation
of the stage. This attribute serves to make the control and implementation of individual
stages quite simple. In recovery mode, the check logic is superfluous as the inputs will
always be correct, however, no reconfiguration is required to the check logic as it will never
declare a fault during recovery. Pipeline scheduling is trivial, if any checker pipeline is
blocked for any reason, all checker processor pipelines are stalled. This simplifies control
of the checker processor and eliminates the need for instruction buffering or complex
430 VLSI Circuits and Embedded Systems
Figure 28.4: Checker Processor Pipeline Structure for a Checker Processor in Execute
Mode.
Verilog model for the integer subset of the Alpha instruction set was constructed. The design
was tested with small hand coded assembly programs to verify correct checker operation.
Floating point instructions were not implemented due to the time constraints. However,
estimate floating point overheads using measurements from an unrelated physical design in
the same technology. The Synopsys toolset was used in conjunction with a 0.25 µm library
to produce an unpipelined fully synthesized design that ran at 288 MHz. Here, considering
this approach as well as semi-custom designs as means to matching the speed of current
microprocessors. The Synopsys toolset will generate area estimates for a synthesized design
given the cells used and the estimated interconnect. Boundary optimization, which exploits
logical redundancy between modules, was turned off so that an assessment of each module’s
contribution to the overall area could be determined. A 1.65 mm2 single-wide checker size
was produced by the toolset, which is fairly small in comparison to the 205 mm2 size of
the Alpha 21264 microprocessor. A single-wide checker only amounts to 0.8% of the area
of the Alpha chip. However, as mentioned before only a subset of the instruction set was
implemented. The checker area breakdown chart shows that the excheck module, which
houses the functional units, contributes the most to the overall size.
The floating point modules should be slightly larger than their integer counterparts due
to the need to track the exponents. Cacti projected that the I-cache and D-cache would be
0.57 mm2 and 1.1408 mm2 respectively. These values are for the 512 byte I-cache and 4K
byte D-cache that are described in previous section. Accounting for caches, total checker
area rises to roughly 10 mm2 , still much smaller than a complete microprocessor in the
same technology.
This reduces the likelihood of this case occurring, at the expense of slightly slower
fault recovery performance. Case B can occur in one of two ways; the first is when an
operational error causes an equivalent error to occur in both the core and the checker. This
is the product of two small probabilities. Given a good design, the likelihood of this event
is probabilistically small. However, replication of the functional units may be applied to
further reduce this probability. The other possibility is that either the comparison logic
or the control logic suffer an error. TMR can be employed on the control logic to help
reducing the probability that this error occurs. In a system with a checker, the probability of
a failure shown in Equation 29.1, is always the product of at least two unlikely events. For
systems that require ultra-high reliability the checker can be replicated until the probability
is acceptable for the application. This is a low cost redundant execution approach as only
the small checker needs to be replicated.
Fig. 28.5 illustrates how TMR can be applied to the control logic of the checker to
provide better reliability in the event of an operational error in the checker control. Again, a
design analysis was done by synthesizing a Verilog model. The areas estimates previously
given show that the simple control logic of the checker only contributes a small portion to
the overall checker size. The addition of two more control logic units and voting logic only
consumes an extra 0.12 mm2 . TMR still has a single point of failure within the voter logic,
but the chance of a strike is reduced due to the area difference as compared to the control
unit logic. Additionally, the voter logic can be equipped with transistors that are sized large
enough to have a high tolerance of environmental conditions.
Figure 28.5: Checker Processor Pipeline Structure with TMR on the Control Logic.
techniques. A simple scan chain can be synthesized in latches that supply the data to the
checker. This type of testing can achieve enhanced fault coverage over the core, since the
checker circuitry is significantly simpler. BIST is another option that holds great potential
for this type of circuit.
As shown in Fig. 28.6, built-in-self-test (BIST) hardware can be added to test for checker
manufacturing errors. The BIST hardware uses a Test Generator to artificially stimulate the
checker logic and it passes values into the checker stage latches that hold the instruction for
the checker to validate. The most effective test generator will be a ROM of opcodes, inputs
and expected results. Using this data and the error signals already present in the system,
an efficient non-concurrent test can be performed. Obviously, the ROM should have both
434 VLSI Circuits and Embedded Systems
Figure 28.6: Checker Processor Pipeline Structure with TMR on Control Logic and BIST.
good and bad sets of data such that the check modules are fully tested. The number of tests
required to test for all faults is a function of the checker logic size and structure. Memories
make up another huge portion of the design. Marching ones and zeros are just one simple
way to test a memory.
Although difficult to quantify, it is likely that the manufacturing process could benefit
from the checker processor as well. In a fab where part production is limited by the
bandwidth and latency of testers, the checker processor could improve testing performance.
Once the checker is fully tested using internal BIST mechanisms, the checker itself can test
the remaining core processor circuitry. No expensive external tester is required, only power
and a simple interface with ROM to hold core testing routines and a simple I/O interface to
determine if the checker passed all core processor tests.
A Fault Tolerant Approach to Microprocessor Design 435
28.5 SUMMARY
Many reliability challenges confront modern microprocessor designs. Functional design
errors and electrical faults can impair the function of a part, rendering it useless. While
functional and electrical verification can find most of the design errors, there are many
examples of non-trivial bugs that find their way into the field. Additional faults due to
manufacturing defects and operation faults such as energetic particle strikes must also be
overcome. Concerns for reliability grow in deep submicron fabrication technologies due
to increased design complexity, additional noise-related failure mechanisms, and increased
exposure to natural radiation sources. To counter these reliability challenges, the use of
dynamic verification is presented which is a technique that adds a checker processor to
the retirement phase of a processor pipeline. If an incorrect instruction is delivered by the
core processor the checker processor will fix the errant computation and restart the core
processor using the processor’s speculation recovery mechanism.
Dynamic verification focuses the verification effort into the checker processor, whose
simple and flexible design lends itself to high-quality functional verification and a robust
implementation. Detailed analyses of a prototype checker processor design is designed.
The simple checker can easily keep up with the complex core processor because it uses pre-
computation in the core processor to clear the branch, data, and communication hazards that
could otherwise slow the simple checker pipeline. Finally, novel extensions to the baseline
design is presented that improve coverage for operational faults and manufacturing fault
detection. One such approach is to leverage the fault tolerance of the core to implement self-
tuning core circuitry. By employing an adaptive clocking mechanism, it becomes possible
to overclock core circuitry, reclaiming design and environmental margins that nearly always
exist.
REFERENCES
[1] M. Williams, “Faulty Transmeta Crusoe Chips Force NEC to Recall 300 Laptops”,
The Wall Street Journal, 2000
[2] P. Bose, T. Conte and T. Austin, “Challenges in processor modeling and validation”,
IEEE Micro, pp. 2–7, 1999.
[3] R. Grinwald, “User defined coverage, a tool supported methodology for design ver-
ification”, Proceedings of the 35th ACM/IEEE Design Automation Conference, pp.
1–6, 1998.
[4] M. K. Srivas and S. P. Miller, “Formal Verification of an Avionics Microprocessor”,
SRI International Computer Science Laboratory Technical Report CSL, 1995.
[5] M. C. McFarland, “Formal Verification of Sequential Hardware: A Tutorial”, IEEE
Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol 12, no. 5.
1993.
[6] J. Sawada, “A table based approach for pipelined microprocessor verification”, Proc.
of the 9th International Conference on Computer Aided Verification, 1997.
[7] H. A. Asaad and J. P. Hayes, “Design verification via simulation and automatic test
pattern generation”, Proceedings of the International Conference on Computer-Aided
Design, IEEE Computer Society Press, pp. 174–180, 1995.
436 VLSI Circuits and Embedded Systems
[8] B. T. Murray and J. P. Hayes, “Testing ICs: Getting to the Core of the Problem”,
Computer, vol. 29, no.11, pp. 32–45, 1996.
[9] M. Nicolaidis, “Theory of Transparent BIST for RAMs”, IEEE Trans. Computers,
vol. 45, no. 10, pp. 1141–1156, 1996.
[10] M. Nicolaidis, “Efficient UBIST Implementation for Microprocessor Sequencing
Parts”, J. Electronic Testing: Theory and Applications, vol. 6, no. 3, pp. 295–312,
1995.
[11] S. K. Reinhardt and S. S. Mukherjee, “Transient Fault Detection via Simultane-
ous Multithreading”, Proceedings 27th Annual Intl. Symp. Computer Architecture
(ISCA), 2000.
[12] E. Rotenberg, “AR-SMT: A Micro architectural Approach to Fault Tolerance in Mi-
croprocessors”, Proceedings of the 29th Fault-Tolerant Computing Symposium, 1999.
[13] S. Chatterjee, C. Weaver and T. Austin, “Efficient Checker Processor Design”, In
Micro-33, 2000.
[14] H. A. Assad, J. P. Hayes and B. T. Murray, “Scalable test generators for high-speed
datapath circuits”, Journal of Electronic Testing: Theory and Applications, vol. 12,
no. 1/2, 1998.
[15] J. Gaisler, “Evaluation of a 32-bit microprocessor with built-in concurrent error de-
tection”, Proceedings of the 27th Fault-Tolerant Computing Symposium, 1997.
[16] Y. Tamir and M. Tremblay, “High-performance fault tolerant VLSI systems using
micro rollback”, IEEE Trans. on Computers, vol. 39, no. 4, pp. 548–554, 1990.
CHAPTER 29
Our world is innovation-driven. Yet, the new generation of people does not realize that
when the PCs originally appeared, they used to be huge to the point that some of them used
to consume the whole room. The explanation for this was they were made of huge vacuum
tubes. The size was large yet the speed was exceptionally moderate. Before long, the creators
comprehended that they don’t need such large PCs and that the size ought to be Smaller. The
invention of IC (Integrated Circuit) got this going and not long after VLSI (Very-large-scale
integration) was brought to light. VLSI represents Very Large-Scale Integration.
VLSI is the way toward making an IC by joining billions of semiconductors into
a solitary chip. VLSI was started during the 1970s when complex semiconductors and
correspondence innovations were being created. The microprocessor is a VLSI gadget.
Before the VLSI innovation, most ICs had a restricted arrangement of capacities they
could perform. An electronic circuit may comprise a CPU, ROM, RAM, and other logical
components. VLSI lets IC creators include these into one chip with the advancement of
technology.
Embedded systems are special-purpose computing systems embedded in application
environments or in other computing systems and provide specialized support. The decreas-
ing cost of processing power, combined with the decreasing cost of memory and the ability
to design low-cost systems on chip, has led to the development and deployment of embedded
computing systems in a wide range of application environments. Examples include network
adapters for computing systems and mobile phones, control systems for air conditioning,
industrial systems, and cars, and surveillance systems. Embedded systems for networking
include two types of systems required for end-to-end service provision: infrastructure (core
network) systems and end systems. The first category includes all systems required for the
core network to operate, such as switches, bridges, and routers, while the second category
includes systems visible to the end users, such as mobile phones and modems.
utilizing self-sufficient vehicles from Mobile Industrial Robots to build the proficiency of
its coordination. The associations have cooperated to revamp Faurecia’s production line
designs to permit robots to explore their courses utilizing their inner guides. Laborers
collaborate with the robots through cell phones, tablets, or PC interfaces, educating them
of their requirements with a press of a button.
Figure 29.2: Improvement of Productivity: Connected devices mitigate human mistake [6].
For example, Machine Metrics has been working with Fastenal, the Minnesota-based
producer of latches and devices, to apply a smartDevice that screens processing plants
operation. The product can associate with any cutting-edge CNC machine by coupling the
Machine Metrics Edge to the Ethernet port of the control, while more seasoned machines
can share information legitimately to the cloud utilizing the computerized and simple IO
modules.
440 VLSI Circuits and Embedded Systems
In the Fastenal case, the product gave knowledge into machine use continuously each
day, week, and month to reveal chances to make effective upgrades. This conveyed an 11%
expansion in machine usage in the initial three months, says Machine Metrics.
A little to medium-sized assembling plant may contain several administrator devices,
in different shapes and sizes, which are utilized for a large number of capacities. For an
enormous industrial facility, that number could ascend to thousands. The advancements
of VLSI devices mean that these devices would never be incorrectly utilized outside of a
particular arrangement of operational boundaries. That is conveying huge enhancements in
productivity. In Fig. 29.2, we can see that smart machines are boosting our productivity in
manufacturing plants.
Airbus and Bosch have driven the path around there, with the Factory of the Future
activity utilizing associated smart devices. These procedures, in aviation plants, can happen
more than a few work cells and can be performed by various administrators. In this way,
says Airbus, there is immense potential for improving these procedures by making hand
apparatuses smarter. Other assembling organizations have stuck to this same pattern. La-
borers at GE Aviation have, for instance, been joining WiFi-empowered force torques with
mixed reality headsets to guarantee that jolts are fixed most ideally. It is tied with improving
productivity and profitability, moreover boosting product quality.
Assembling, by its very nature, requires a ton of energy. This can represent an enormous
level of working expenses. That is the reason processing plant proprietors and directors
are progressively going to smart VLSI solutions that provide interface sensors, actuators,
regulators, and other hardware that empowers the checking of energy use, lighting, HVAC,
and fire security frameworks. This information can likewise be joined with data from more
extensive datasets like climate anticipating and money related data, for example, the cost of
power and different utilities. This kind of design is growing in assembling plants, to make
structures more intelligent, more reasonable, and more proficient.
BAE Systems has, for instance, worked with Schneider to introduce its EcoStruxure
building at one of its manufacturing offices in the UK. In this specific case, the EcoStruxure
stage is being utilized to screen HVAC in the stockroom and office regions, alongside other
gear including destratification fans, heat recuperation units, and electric board warmers. As
far as framework setup, two boards were worked by frameworks integrator Aimteq for the
workplaces and distribution center containing Schneider SmartX AS-P regulators and I/O
modules, just as natural touchscreen tablet shows. Without the help of Smart circuits and
electronics, this is impossible to achieve.
precision and meaningfulness of names, scanner tags, or QR codes. This data would then be
able to be circled back to before stages in the manufacturing line permitting administrators
to recognize and group the underlying cause of the issue before modification can be made.
After some time, computerized perceptive can be used to refine and improve the creation
procedure.
This kind of smart vision technology is being utilized across assembling plants to screen
the nature of a wide scope of items including electronic gadgets, purchaser merchandise, and
metal parts. For instance, the car segment provider Getrag has been utilizing a framework
to investigate teeth and clutch body parts, providing engineers with ongoing information
on non-adjusting parts, and patterns rising out of the assembling procedure. The point is to
improve item quality, decrease excessive re-work, and upgrade brand image.
The advantages of Electronic circuits for makers do not end once items have arrived at
dispatch. In reality, the transportation and coordination effort has gotten one of the essential
recipients of digitalization, with resource following sensors ready to give continuous data
on resource area, the encompassing temperature, and movement, for example, LoRa and
Narrowband IoT. These systems stream VLSI sensor information to the cloud securely and
flawlessly, depending upon what is required.
As of late, a joint endeavor among Hoopo and Polymer Logistics conveyed a smart
system following of containers utilizing LoRa, which means it can find resources without
the requirement for GPS. This keeps up the gadget’s low-power utilization and allows
broadened battery life while giving information on resources continuously.
442 VLSI Circuits and Embedded Systems
As a review, CPUs work utilizing prediction units, registers, and execution units. This
is known as architecture. Registers hold pieces of information or pointers to memory,
frequently in 64-piece information groups. Execution units accomplish something with at
least one register, for example, perusing and keeping in touch with memory or performing
math. Numerous execution units can be utilized on the binary with the CPU, each taking a
clock cycle or two to finish their capacity.
CPUs are adaptable enough to suit a wide variety of tasks. Execution can be scaled and
thereby changing the clock speed (in GHz). It can be changed to accomplish more with
each clock cycle. For example, The AMD Ryzen 9 3950x presents a 16 core 32 thread CPU
for its latest 3rd generation of Ryzen processor in PC. The advancements of 7 nanometer
VLSI technology enable AMD to make these sorts of a processor that was never thought
of before. An AMD Ryzen CPU is shown in Fig. 29.5.
Single-handedly impacted this field with all of its advancements. Without VLSI technology
we would not be using smartphones with SoCs in our day to day life.
The System-on-a-Chip is like the brain of our cell phone. Joining different parts into
a single chip saves money on space, cost, and influence utilization. SoCs interface with
different parts as well, for example, cameras, a display, RAM, memory, etc. An SoC is the
mind of our cell phone that handles everything from the Android OS to identifying when
we press a button. Some of the most important parts of an SoC are given below.
Central Processing Unit (CPU): The "minds" of the SoC run the greater part of the code
for the Android OS and a large portion of the applications.
Graphics Processing Unit (GPU): It handles designs related tasks, for example, visual-
ization an application’s UI and 2D/3D gaming.
Image Processing Unit (IPU): It converts information from the telephone’s camera into
picture and video documents.
Digital Signal Processor (DSP): It handles more numerically concentrated measure-
ments than a CPU and incorporates decompressing music records and examining gyroscope
sensor information.
Neural Processing Unit (NPU): It used in top of the line cell phones to quicken (AI)
undertakings. These incorporate voice acknowledgment and camera processing.
Video encoder/decoder: It handles the power-efficient change of video records and
organizations.
Modems: It converts signals into information our telephone gets it. It incorporates 4G
LTE, 5G, WIFI, and Bluetooth modems.
In the cell phone space, Qualcomm, Samsung Semiconductor, Huawei’s HiSilicon, and
MediaTek are the four greatest names in the business. Odds are that our cell phone has a
chip from one of these organizations in it. Qualcomm is the biggest supplier of cell phone
SoCs, delivering chips for most of the lead, mid-level, and even low-end cell phone delivers
every year. Qualcomm’s SoCs fall under the Snapdragon marking. Premium chips flaunting
the organization’s best innovation go under the Snapdragon 800 arrangement standard, for
example, the most recent Snapdragon 865. Mid and super-mid-level items are marked with
Snapdragon 600 and 700 arrangement names separately, for example, the Snapdragon 765
which sports 5G availability. Lower level items are named under the 400 arrangement.
of information at a time. This is different from the math and information types utilized by
CPUs. For this reason, the advancement of AI field has gathered a lot of pace as the NPU
and other smart hardware is accelerating the outcome of AI and neural network.
Aviation
In aviation, fuzzy logic is utilized in the accompanying zones
Vehicle Industry
In a car, fuzzy logic is utilized in the following zones:
Business
In business, fuzzy logic is utilized in the following areas:
Safety Fields
In safeguard, fuzzy logic is utilized in various aspects like:
Marine
In the marine fields also, we can see a lot of usage of it like
• Boat controlling
Clinical
In the medical field we can also see a lot of use of fuzzy logic as follows:
them requires interfaces. So, these are all examples of embedded systems. An example of
router is given below in Fig. 29.10.
Glass Industry
PLCs regulators have been being used in the glass business for a considerable length of
time. They are utilized to a great extent to control the material proportion just as to process
flat glasses. The innovation has been progressing throughout the years and this has made an
expanded interest for the PLC control mode for use in the glass business. The creation of
glass is an intricate and complex procedure so the organizations included regularly use PLCs
with transport innovation in its control mode. Generally speaking, the PLC is applied in
Applications of VLSI Circuits and Embedded Systems 453
both simple information recording in the glass creation, and advanced quality and position
control.
Paper Industry
In the paper business, PLCs are utilized in different procedures. These incorporate control-
ling the machines that produce paper items at high speeds. For example, PLC controls and
screens the creation of book pages or papers in balance web printing.
Concrete Assembling
Assembling concrete includes blending different crude materials in a furnace. The nature
of these crude materials and their extents altogether sway the nature of the end output. To
guarantee the utilization of the correct quality and amounts of crude materials, the exactness
of information in regards to such process factors are really important.
Industrial Machinery
A circulated control framework involved PLC in its client mode and design programming is
utilized in the business’ creation and the executive’s forms. The PLC specifically, controls
ball processing, coal furnace, and shaft oven. Different instances of PLC programming
applications that are being used in different businesses today incorporate water tank extin-
guishing frameworks in the aviation division, filling machine control framework in the food
business, mechanical batch washer control for textile industry, etc.
The programmable logic circuits at mobile automation incorporate a tremendous impact
from different top industry makers, for example, Allen-Bradley and Omron.
29.3 SUMMARY
VLSI (Very Large Scale Integration) Circuits and Embedded system have a lot of impacts
on our lives with their applications. As a whole it is realized that these technologies are
amazingly remarkable that do an indispensable job in numerous gadgets, types of gear,
mechanical control systems, modern instrumentation, and home machines regardless of
their nature. Since VLSI Circuits and Embedded systems control such a large number of
gadgets, an organization knows that it is impossible to stay without them and use any
hardware without them. They offer computerization that builds wellbeing and efficiency for
organizations. For instance, if a development organization has VLSI Circuits and Embedded
system on its pieces of machinery, the innovation could give alerts that one of them needs
brief adjustment before it represents a security danger. Besides, numerous individuals have
cutting edge machines and smartphones in their hand, they work as an enhancement to the
work and life. Human specialists can use different gadgets to deal with employees and if
the equipment can’t perform, possibly getting alerts, it makes a problem. The equipment
can remain on one line and play out a similar assignment over and again without making
mistakes or requiring breaks as people can monitor them and take steps before a problem
arises. Without VLSI Circuits and Embedded system, the advancements of mechanization
wouldn’t exist by any means. The modern transformation made conceivable through the
454 VLSI Circuits and Embedded Systems
IoT (Internet of Things) is making VLSI Circuits and Embedded systems more normal in
the industry.
REFERENCES
[1] Wikipedia, [Online]. Available: https://en.wikipedia.org/wiki/Very_Large_Scale_
Integration. [Accessed: 21 Sep., 2020].
[2] Tutorialspoint. [Online]. Available: https://www.tutorialspoint.com/vlsi_design
/vlsi_design_digital_system.html. [Accessed: 13 Oct., 2019].
[3] Howtogeek. [Online]. Available: https://www.howtogeek.com/394267/what-do-7nm-
and-10nm-mean-and-why-do-they-matter/. [Accessed: 12 Nov., 2020].
[4] Autonomous Robots: https://upload.wikimedia.org/wikipedia/commons/0/02/
Hannover_-_CeBit_2015_-_DT_Industrie_40_-_Roboter_008.jpg [Accessed: 5 June,
2021].
https://creativecommons.org/licenses/by-sa/3.0/deed.en, CeBIT 2015Deutsche
Telekom (Booth CeBIT)Internet of things, CC-BY-SA-3.0Pictures by Mummelgrum-
mel
[5] Industrial Robots: https://en.wikipedia.org/wiki/Smart_manufacturing#/media/
File:BMW_Leipzig_MEDIA_050719_Download_Karosseriebau_max.jpg
https://creativecommons.org/licenses/by-sa/2.0/de/deed.en
CC BY-SA 2.0 de view terms [Accessed: 15 Jun., 2021].
[6] Airbus. [Online]. Available: https://www.airbus.com/newsroom/news/en/2014/07/airbus-
moves-forward-with-its-factory-of-the-future-concept.html. [Accessed: 25 Feb.,
2020].
[7] EcoStruxure. [Online]. Available: https://www.se.com/ww/en/work/campaign
/innovation/overview.jsp. [Accessed: 30 Mar., 2020].
[8] EETimes. [Online]. Available: https://www.eetimes.com/software-sizes-digitally-
controlled-transmissions/. [Accessed: 28 Feb., 2020].
[9] Polymerlogistics. [Online]. Available: https://www.polymerlogistics.com/. [Accessed:
30 Jun., 2020].
[10] New Atlas. [Online]. Available: https://newatlas.com/health-wellbeing/audi-
exoskeleton-trial-ingolstadt/. [Accessed: 2 Jul., 2020].
[11] Engineering. [Online]. Available: https://www.engineering.com/ElectronicsDesign
/ElectronicsDesignArticles
/ArticleID/5791/Applications-Processors-The-Heart-of-the-Smartphone.aspx.
[Accessed: 13 Jul., 2020].
[12] Qualcomm. [Online]. Available: https://www.qualcomm.com/snapdragon. [Accessed:
18 Jul., 2020].
[13] AMD Ryzen 5. [Online]. Available: https://commons.wikimedia.org/wiki/File:Ryzen_5_
1600_CPU_on_a_motherboard.jpg
https://creativecommons.org/licenses/by-sa/4.0/deed.en [Accessed: 19 June, 2021].
[14] 5G Network. [Online]. Available: https://upload.wikimedia.org/wikipedia/commons/4/43/
Celluar_Antenna_with_tower_for_5G.jpg
https://creativecommons.org/licenses/by-sa/4.0/ [Accessed: 22 June, 2021].
Applications of VLSI Circuits and Embedded Systems 455
FINAL REMARKS
Digital logic design is a foundation to the field of electrical and computer engineering.
Digital logic designers build complex electronic components that use both electrical and
computational characteristics. These characteristics may involve power, current, logical
function, protocol and user input. Similarly, VLSI design is used to developing hardware,
such as circuit boards and microchip processors. This hardware processes user input,
system protocol and other data in computers, navigational systems, cell phones or other
high-tech systems. This book aims at designing high-performance and cost-effective VLSI
circuits which demand knowledge of all aspects of modern digital design. Since VLSI has
moved from an exotic and expensive curiosity to an everyday necessity, researchers have
refocused in VLSI design from its circuit design toward the advanced logic and system
design. Studying VLSI design as a system design discipline requires such a book like this
to consider a somewhat different set of areas than does the study of circuit design. In this
book, design of different topics is balanced on the one hand by discussions of circuits and
on the other hand by the architectural choices.
A Binary Decision Diagram (BDD) is a rooted and directed acyclic graph with one
or two terminal nodes of out-degree labeled 0 or 1, and a set of variable nodes of out-
degree two. BDDs and their variants are a class of data structures that has seen successful
application in the formal verification of systems with large state spaces. Variations of BDDs
have been described in this book to support quantitative calculations of the type that are
required for verification and performance analysis of systems. For example, Multi-Terminal
BDDs (MTBDDs) are a generalization of BDDs in which there can be multiple leaf nodes,
each labeled by a distinct value. An MTBDD representation of a vector can be very
compact, assuming that the set of distinct values appearing as entries in the vector is small.
MTBDDs can also be used to represent matrices, and computations such as vector/matrix
multiplication can be performed efficiently in terms of MTBDD representations. A variety
of other variations of BDDs has been described in this book, including shared MTBDDs,
multi-valued decision diagrams (MDDs), and multi-valued Pseudo-Kronecker DDs.
Multiple-valued circuit scan implement the logic directly by using multiple-valued
signals, or the logic can be implemented indirectly with binary circuits, by using more
than one binary signal to represent a single multiple-valued signal. In recent years, there
have been major advances in integrated circuit technology which have both made feasible
and generated great interest in electronic circuits which employ more than two discrete
levels of signal. Such circuits, called multiple-valued logic circuits, offer several potential
457
458 VLSI Circuits and Embedded Systems
opportunities for the improvement of modern VLSI circuit designs. The fact that some
commercial products already benefited from multiple-valued logic is believed to be a first
step towards recognition of the role of VLSI circuits in the next generation of electronic
systems. Multi-valued logic systems have attracted the attention of a number of researchers.
Some find the area fascinating for its wealth of logic structures, whose richness transcends
the familiar binary environment. Others work on potential practical applications to demon-
strate that the appeal is not only academic, but that there exists a host of opportunities for
improvement of digital systems through the utilization of higher-radix methods. Synthesis
techniques have been developed to facilitate the design of multi-valued networks along the
lines of binary switching theory. Some of the notable electronic realizations of multi-valued
elements and basic functions are described in this book. Computer applications such as
image processing require very high arithmetic processing rates, and it is therefore neces-
sary to explore potential areas of circuit design that could increase the processing rate of
an Integrated Circuit (IC).
An IC that contains large numbers of gates, flip-flops, etc. which can be configured
by the user to perform different functions is called a Programmable Logic Device (PLD).
The internal logic gates and connections of PLDs can be configured by a programming
process. PLDs contain multiple logic elements such as flip-flops as well as AND and OR
gates which can be configured by the user. The internal logic and connections may be
modified by the user during the programming process which is done using a dedicated
software application. The process of entering the information into these devices is known
as programming. Basically, users can program these devices or ICs electrically in order to
implement the Boolean functions based on the requirement. Here, the term programming
refers to hardware programming but not software programming. Today’s most prominent
PLD technology, known as FPGA (Field-Programmable Gate Array), is used in an increas-
ing number of application domains, such as the telecom industry, the automotive electronics
sector or automation technology and the recent market studies show a continuous demand
for these sophisticated microelectronic devices in the future. PLDs have enabled many
users, designers, and manufacturers to come up with incredibly innovative and phenomenal
technology which is centered on producing logic based solutions across a variety of appli-
cations. The reduced power consumption, lower cost amount, and integration of so many
features that are simply not a possibility with the most of the other alternatives, all make
PLDs a much favored and preferred option for multiple users belonging to a number of
different backgrounds and industries.
Digital logic circuits are often known as switching circuits, because in digital circuits the
voltage levels are assumed to be switched from one value to another value instantaneously.
These circuits are termed as logic circuits, as their operation obeys a definite set of logic
rules. The simplest forms of digital circuits are built from logic gates, the building blocks
of the digital computer. Since most of the physical variables encountered in the real world,
digital circuits simulate continuous functions with strings of bits; the more bits that are
used, the more accurately the continuous signal can be represented. For example, if 16 bits
are used to representing a varying voltage, the signal can be assigned one of more than
65,000 different values. Digital circuits are more immune to noise than analog circuits, and
digital signals can be stored and duplicated without degradation. For simple digital circuits,
conventional method of designing circuits can easily be applied, but for complex digital
VLSI Circuits and Embedded Systems 459
circuits, the conventional method of designing circuits is not fruitfully applicable because
it is time-consuming. On the contrary, genetic programming is used mostly for automatic
program generation. The modern approach for designing arithmetic circuits, commonly
digital circuits are described in this book.
The design procedures and working mechanisms of modern VLSI circuits and embed-
ded systems are covered in this book. Researchers of these fields are working efficiently
with the advanced embedded systems. This book will definitely help them for getting all
the things in one place and better understanding of VLSI circuits and embedded systems.
Besides theoretical knowledge, some practical applications of VLSI circuits and embedded
systems are added in the book to get the real flavor of the topics. Both the practicing pro-
fessionals and the advanced undergraduate or graduate students will be benefited from this
book. For a student, the most rewarding aspect of a VLSI design class is to put together
previously-learned basics on circuit, logic, and architecture design to understand the trade
offs between the different levels of abstraction. Professionals who either practice VLSI
design or develop VLSI CAD tools can use this book to brush up on parts of the design
process with which they have less-frequent involvement.
Index