0% found this document useful (0 votes)
141 views510 pages

VLSI Circuits and Embedded Systems-CRC Press (2022)

The document discusses Very Large-Scale Integration (VLSI) and embedded systems, highlighting the challenges of power consumption and complexity in circuit design. It outlines the structure of a book that covers Decision Diagrams, Multiple-Valued Logic Circuits, Programmable Logic Devices, and Complex Digital Circuits, aimed at researchers and students. The book serves as a comprehensive resource on VLSI circuits and embedded systems, addressing design architectures and security issues.

Uploaded by

lyquochuy2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views510 pages

VLSI Circuits and Embedded Systems-CRC Press (2022)

The document discusses Very Large-Scale Integration (VLSI) and embedded systems, highlighting the challenges of power consumption and complexity in circuit design. It outlines the structure of a book that covers Decision Diagrams, Multiple-Valued Logic Circuits, Programmable Logic Devices, and Complex Digital Circuits, aimed at researchers and students. The book serves as a comprehensive resource on VLSI circuits and embedded systems, addressing design architectures and security issues.

Uploaded by

lyquochuy2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 510

VLSI Circuits and Embedded

Systems

Very Large-Scale Integration (VLSI) creates an integrated circuit (IC) by combining thousands of transis-
tors into a single chip. While designing a circuit, reduction of power consumption is a great challenge.
VLSI designs reduce the size of circuits which eventually reduces the power consumption of the devices.
However, it increases the complexity of the digital system. Therefore, computer-aided design tools are in-
troduced into hardware design processes.

Unlike the general-purpose computer, an embedded system is engineered to manage a wide range of pro-
cessing tasks. Single or multiple processing cores manage embedded systems in the form of microcon-
trollers, digital signal processors, field-programmable gate arrays, and application-specific integrated cir-
cuits. Security threats have become a significant issue since most embedded systems lack security even
more than personal computers. Many embedded systems hacking tools are readily available on the inter-
net. Hacking in the PDAs and modems is a pervasive example of embedded systems hacking.

This book explores the designs of VLSI circuits and embedded systems. These two vast topics are divided
into four parts. In the book’s first part, the Decision Diagrams (DD) have been covered. DDs have exten-
sively used Computer-Aided Design (CAD) software to synthesize circuits and formal verification. The
book’s second part mainly covers the design architectures of Multiple-Valued Logic (MVL) Circuits. MVL
circuits offer several potential opportunities to improve present VLSI circuit designs. The book’s third part
deals with Programmable Logic Devices (PLD). PLDs can be programmed to incorporate a complex logic
function within a single IC for VLSI circuits and Embedded Systems. The fourth part of the book concen-
trates on the design architectures of Complex Digital Circuits of Embedded Systems. As a whole, from
this book, core researchers, academicians, and students will get the complete picture of VLSI Circuits and
Embedded Systems and their applications.
VLSI Circuits and Embedded
Systems

Hafiz Md. Hasan Babu


First edition published 2023
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742

and by CRC Press


4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

CRC Press is an imprint of Taylor & Francis Group, LLC

© 2023 Taylor & Francis Group, LLC

Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright
holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowl-
edged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.

For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact [email protected]

Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.

ISBN: 978-1-032-21608-9 (hbk)


ISBN: 978-1-032-21610-2 (pbk)
ISBN: 978-1-003-26918-2 (ebk)

DOI: 10.1201/9781003269182

Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
To my beloved parents and also to my beloved wife, daughter & son,
who made it possible for me to write this book
Contents

List of Figures xxi

List of Tables xxxi

Preface xxxiii

Author Bio xxxv

Acknowledgments xxxvii

Acronyms xxxix

Introduction xliii

Section I An Overview About Decision Diagrams

Part 1 3

Chapter 1  Shared Multi-Terminal Binary Decision Diagrams 5

1.1 INTRODUCTION 5
1.2 PRELIMINARIES 6
1.2.1 Shared Multi-Terminal Binary Decision Diagrams 8
1.3 AN OPTIMIZATION ALGORITHM FOR SMTBDD(K )S 12
1.3.1 The Weight Calculation Procedure 13
1.3.2 Optimization of SMTBDD(3)s 15
1.4 SUMMARY 16

Chapter 2  Multiple-Output Functions 19

2.1 INTRODUCTION 19
2.1.1 Basic Definitions 20
2.2 BINARY DECISION DIAGRAMS FOR MULTIPLE-OUTPUT
FUNCTIONS 21

vii
viii  Contents

2.2.1 SBDDs and MTBDDs 21


2.2.2 BDDs for CFs 21
2.2.2.1 BDDs for CFs of Multiple-Output Functions 22
2.2.3 Comparison of Various BDDs 26
2.3 CONSTRUCTION OF COMPACT BDDS FOR CFS 26
2.3.1 Formulation of the Problem 26
2.3.2 Ordering of Output Variables 27
2.3.3 Interleaving-Based Sampling Schemes for Ordering of Input
Variables 28
2.3.3.1 Generating Samples from Output Functions 28
2.3.3.2 Interleaving the Variable Orderings of Samples 29
2.3.4 Interleaving Method for Input Variables and Output Variables 29
2.3.5 Algorithm for Ordering the Variables 30
2.4 SUMMARY 31

Chapter 3  Shared Multiple-Valued DDs for Multiple-Output Functions 35

3.1 INTRODUCTION 35
3.2 DECISION DIAGRAMS 36
3.2.1 Binary Decision Diagrams 37
3.2.2 Multiple-Valued Decision Diagrams 37
3.2.2.1 Shared Multiple-Valued Decision Diagrams 37
3.3 CONSTRUCTION OF COMPACT SMDDS 39
3.3.1 Pairing of Binary Input Variables 39
3.3.1.1 The Method 39
3.3.2 Ordering of Input Variables 43
3.4 SUMMARY 44

Chapter 4  Heuristics to Minimize Multiple-Valued Decision Diagrams 47

4.1 INTRODUCTION 47
4.2 BASIC PROPERTIES 49
4.3 MULTIPLE-VALUED DECISION DIAGRAMS 49
4.3.1 Size of MDDs 49
4.4 MINIMIZATION OF MDDS 54
Contents  ix

4.4.1 Pairing of 2-Valued Inputs 55


4.4.2 Ordering of Multiple-Valued Variables 55
4.5 SUMMARY 58

Chapter 5  TDM Realizations of Multiple-Output Functions 61

5.1 INTRODUCTION 61
5.2 DECISION DIAGRAMS FOR MULTIPLE-OUTPUT FUNCTIONS 62
5.2.1 Shared Binary Decision Diagrams 62
5.2.2 Shared Multiple-Valued Decision Diagrams 63
5.2.3 Shared Multi-Terminal Multiple-Valued Decision Diagrams 63
5.3 TDM REALIZATIONS 65
5.3.1 TDM Realizations Based on SBDDs 65
5.3.2 TDM Realizations Based on SMDDs 66
5.3.3 TDM Realizations Based on SMTMDDs 68
5.3.4 Comparison of TDM Realizations 68
5.4 REDUCTION OF SMTMDDS 69
5.5 UPPER BOUNDS ON THE SIZE N OF DDS 70
5.6 SUMMARY 70

Chapter 6  Multiple-Output Switching Functions 73

6.1 INTRODUCTION 73
6.2 DEFINITIONS AND BASIC PROPERTIES 75
6.3 DECISION DIAGRAMS 76
6.3.1 2-Valued Pseudo-Kronecker Decision Diagrams 76
6.3.2 Multiple-Valued Pseudo-Kronecker Decision Diagrams 77
6.4 OPTIMIZATION OF 4-VALUED PKDDS 77
6.4.1 Pairing of 2-Valued Input Variables 77
6.4.2 Ordering of 4-Valued Variables 79
6.4.3 Selection of Expansions 81
6.5 SUMMARY 82
x  Contents

Section II An Overview About Design Architectures of Multiple-Valued


Circuits

Part 2 85

Chapter 7  Multiple-Valued Flip-Flops Using Pass Transistor Logic 87

7.1 INTRODUCTION 87
7.1.1 Realization of Multiple Valued Flip-Flops Using Pass Transistor
Logic 87
7.1.2 Implementation of MVFF with Binary Coding and Decoding
Using PTL 88
7.2 MVFF WITHOUT BINARY ENCODING OR DECODING 90
7.2.1 Properties of Pass Transistor and a Threshold Gate 90
7.2.2 Realization of Multiple-Valued Inverter Using Threshold Gates 92
7.2.3 Realization MVFF Using Multiple-Valued Pass Transistor Logic 93
7.3 SUMMARY 95

Chapter 8  Voltage-Mode Pass Transistor-Based Multi-Valued


Multiple-Output Logic Circuits 97

8.1 INTRODUCTION 97
8.2 BASIC DEFINITIONS AND TERMINOLOGIES 98
8.3 THE METHOD 98
8.3.1 Conversion of Binary Logic Functions into MVL Functions 99
8.3.2 Pairing of the Functions 100
8.3.3 Output Stage 101
8.3.4 Basic Circuit Structure and Operation 101
8.3.4.1 Literal Generation 104
8.4 SUMMARY 105

Chapter 9  Multiple-Valued Input Binary-Valued Output Functions 107

9.1 INTRODUCTION 107


9.2 BASIC DEFINITIONS 108
9.3 TRANSFORMATION OF TWO-VALUED VARIABLES INTO
MULTIPLE-VALUED VARIABLES 110
9.3.1 Algorithms for Minimizing the Multiple-Valued Functions 112
9.4 SUMMARY 119
Contents  xi

Chapter 10  Digital Fuzzy Operations Using Multi-Valued Fredkin Gates 121

10.1 INTRODUCTION 121


10.2 REVERSIBLE LOGIC 122
10.2.1 Some Basic Reversible Gates and Classical Digital Logic Using
these Gates 122
10.2.2 Multi-Valued Fredkin Gate 124
10.3 FUZZY SETS AND RELATION 124
10.4 THE CIRCUIT 128
10.4.1 Fuzzy Operations Using MVFG 128
10.4.2 Systolic Array Structure for Composition of Fuzzy Relations 130
10.5 SUMMARY 130

Chapter 11  Multiple-Valued Multiple-Output Logic Expressions Using


LUT 133

11.1 INTRODUCTION 133


11.2 BASIC DEFINITIONS AND PROPERTIES 134
11.2.1 Product Terms 134
11.2.1.1 Prime Implicants 134
11.2.2 Minimal SOPs 134
11.2.3 MVSOP Expressions Using KC 134
11.3 THE METHOD 135
11.3.1 Support Set Matrix 135
11.3.2 Pair Support Matrix 136
11.4 THE ALGORITHM FOR MINIMIZATION OF MVMOFS USING KC 137
11.5 REALIZATION OF MVMOFS USING CURRENT MODE CMOS 138
11.6 SUMMARY 141

Section III An Overview About Programmable Logic Devices

Part 3 145

Chapter 12  LUT-Based Matrix Multiplication Using Neural Networks 149

12.1 INTRODUCTION 149


12.2 BASIC DEFINITIONS 150
12.3 THE METHOD 150
xii  Contents

12.4 SUMMARY 155

Chapter 13  Easily Testable PLAs Using Pass Transistor Logic 157

13.1 INTRODUCTION 157


13.2 PRODUCT LINE GROUPING 157
13.3 THE DESIGN 158
13.4 THE TECHNIQUE FOR PRODUCT LINE GROUPING 160
13.5 SUMMARY 161

Chapter 14  Genetic Algorithm for Input Assignment for Decoded-PLAs 163

14.1 INTRODUCTION TO DECODERS 163


14.1.1 Decoders as Product Generators 165
14.2 DECODED PLA 166
14.2.1 Advantages 168
14.3 BASIC DEFINITIONS 169
14.4 GENETIC ALGORITHM 170
14.4.1 GA Terminology 171
14.4.2 The Simple GA 172
14.4.3 The Steady-State Genetic Algorithm 172
14.5 GENETIC OPERATORS 175
14.5.1 Selection 175
14.6 CROSSOVER 176
14.6.1 Mutation 177
14.6.2 Inversion 178
14.7 GA FOR DECODED-PLAS 179
14.7.1 Problem Encoding 179
14.7.2 Fitness Function 180
14.7.3 Developed GA 181
14.7.4 Decoded AND-EXOR PLA Implementation 181
14.8 SUMMARY 182

Chapter 15  FPGA-Based Multiplier Using LUT Merging Theorem 185

15.1 INTRODUCTION 185


15.2 LUT MERGING THEOREM 186
Contents  xiii

15.3 THE MULTIPLIER CIRCUIT USING THE LUT MERGING THEOREM 186
15.4 SUMMARY 194

Chapter 16  Look-Up Table-Based Binary Coded Decimal Adder 197

16.1 INTRODUCTION 197


16.2 THE DESIGN OF LUT-BASED BCD ADDER 197
16.2.1 Parallel BCD Addition Method 198
16.2.2 Parallel BCD Adder Circuit Using LUT 202
16.3 SUMMARY 205

Chapter 17  Place and Route Algorithm for Field Programmable


Gate Array 207

17.1 INTRODUCTION 207


17.2 PLACING AND ROUTING 208
17.3 PARTITIONING ALGORITHM 208
17.4 KERNIGHAN-LIN ALGORITHM 208
17.4.1 How K-L Works 209
17.4.2 Implementation of K-L Algorithm 209
17.4.3 Steps of Algorithm 210
17.5 SUMMARY 211

Chapter 18  LUT-Based BCD Multiplier Design 213

18.1 INTRODUCTION 213


18.2 BASIC PROPERTIES 215
18.3 THE ALGORITHM 218
18.3.1 The BCD Multiplication Method 219
18.3.2 The LUT Architecture 221
18.4 LUT-BASED BCD MULTIPLIER CIRCUIT 225
18.5 SUMMARY 231

Chapter 19  LUT-Based Matrix Multiplier Circuit Using Pigeonhole


Principle 233

19.1 INTRODUCTION 233


19.2 BASIC DEFINITIONS 237
xiv  Contents

19.2.1 Binary Multiplication 237


19.2.2 Matrix Multiplication 238
19.2.3 BCD Coding 240
19.2.4 BCD Addition 240
19.2.5 Binary to BCD Conversion 240
19.2.6 Pigeonhole Principle 243
19.2.7 Field Programmable Gate Arrays 243
19.2.8 Look-Up Table 244
19.2.9 LUT-Based Adder 245
19.2.10 BCD Adder 247
19.2.11 Comparator 248
19.2.12 Shift Register 249
19.2.13 Literal Cost 251
19.2.14 Gate Input Cost 251
19.2.15 Xilinx Virtex 6 FPGA Slice 251
19.3 THE MATRIX MULTIPLIER 252
19.3.1 The Efficient Matrix Multiplication Method 252
19.3.1.1 The (1 × 1)-Digit Multiplication Algorithm 253
19.3.1.2 The (m×n)-Digit Multiplication Algorithm Using the
(1 × 1)-Digit Multiplication Algorithm 255
19.3.1.3 Binary to BCD Conversion Algorithm 258
19.3.1.4 Efficiency of the (m×n)-Digit Multiplication Algorithm 266
19.3.2 The Matrix Multiplication Algorithm 268
19.3.3 The Cost-Efficient Matrix Multiplier Circuit 272
19.3.3.1 (1 × 1)-Digit Multiplier Circuit 272
19.3.3.2 Binary to BCD Converter Circuit for the Decimal
Multiplier 279
19.3.3.3 (m×n )-Digit Multiplier Circuit 279
19.3.3.4 Matrix Multiplier Circuit 280
19.4 SUMMARY 282

Chapter 20  BCD Adder Using a LUT-Based Field Programmable


Gate Array 287

20.1 INTRODUCTION 287


20.2 BCD ADDER USING LUTS 288
20.2.1 The BCD Addition Method 288
Contents  xv

20.2.2 The Architecture of a LUT 290


20.2.2.1 Working Mechanism of the 2-Input LUT 294
20.3 BCD ADDER CIRCUIT USING LUTS 295
20.4 SUMMARY 297

Chapter 21  Generic Complex Programmable Logic Device Board 299

21.1 INTRODUCTION 299


21.2 HARDWARE DESIGN AND DEVELOPMENT 300
21.2.1 DC-DC Converters 302
21.2.2 JTAG Interface 302
21.2.3 LED Interface 302
21.2.4 Clock Circuit 302
21.2.5 CPLD 303
21.2.6 Seven-Segment Display 305
21.2.7 Input/Output Connectors 305
21.3 INTERNAL HARDWARE DESIGN OF CPLD 305
21.3.1 A5/1 Algorithm 306
21.3.2 Seven Segment Display Driver 306
21.3.3 Binary 8-Bit Counter 307
21.4 APPLICATIONS 309
21.5 SUMMARY 309

Chapter 22  FPGA-Based Programmable Logic Controller 311

22.1 INTRODUCTION 311


22.2 FPGA TECHNOLOGY FOR PLC 312
22.3 SYSTEM DESIGN PROCEDURE FOR PLC 313
22.3.1 Ladder Program Structure 313
22.3.2 Operating Modes of PLC 314
22.3.3 Ladder Scanning 314
22.3.4 Ladder Execution 314
22.3.5 System Implementation 315
22.4 DESIGN CONSIDERATIONS 315
22.5 SUMMARY 318
xvi  Contents

Section IV An Overview About Design Architectures of Digital Circuits

Part 4 323

Chapter 23  Parallel Computation of Quotients and Partial Remainders to


Design Divider Circuits 325

23.1 INTRODUCTION 325


23.2 BASIC DEFINITIONS 331
23.2.1 Division Operation 331
23.2.2 Shift Registers 331
23.2.2.1 Serial-In to Parallel-Out Shift Register 332
23.2.2.2 Serial-In to Serial-Out Shift Register 334
23.2.2.3 Parallel-In Serial-Out Register 335
23.2.2.4 Parallel-In to Parallel-Out Shift Register 336
23.2.3 Complement Logic 336
23.2.4 Comparator 337
23.2.5 Adder 338
23.2.6 Subtractor 339
23.2.7 Look-Up Table 341
23.2.8 Counter Circuit 342
23.2.9 Reversible and Fault Tolerance Logic 344
23.3 THE METHODOLOGIES 344
23.3.1 Division Algorithm 344
23.3.1.1 Explanation of Correctness of the Division Algorithm 347
23.3.2 ASIC-Based Circuits 349
23.3.2.1 Parallel n-bit Counter Circuit 350
23.3.2.2 n-bit Comparator 355
23.3.2.3 n-bit Selection Block 359
23.3.2.4 Circuit for Conversion to Zero 364
23.3.2.5 Design of the Divider Circuit 370
23.3.3 LUT-Based Circuits 373
23.3.3.1 LUT-Based Bit Counter Circuit 375
23.3.3.2 LUT-Based Bit Comparator Circuit 376
23.3.3.3 LUT-Based Selection Circuit 379
23.3.3.4 LUT-Based Converter Circuit 380
23.3.3.5 Design of the LUT-Based Divider Circuit 382
Contents  xvii

23.3.3.6 Reversible Fault Tolerant LUT-Based Divider Circuit 383


23.4 SUMMARY 385

Chapter 24  Synthesis of Boolean Functions Using TANT Networks 389

24.1 INTRODUCTION 389


24.2 TANT MINIMIZATION 389
24.2.1 The Technique 390
24.3 THE INTRODUCED METHOD OF TANT MINIMIZATION 391
24.4 ALGORITHMS USED IN DIFFERENT STAGES 394
24.5 SUMMARY 396

Chapter 25  Asymmetric High Radix Signed Digital Adder Using Neural


Networks 397

25.0.1 Introduction 397


25.1 BASIC DEFINITIONS 398
25.1.1 Neural Network 398
25.1.2 Asymmetric Number System 398
25.1.3 Binary to Asymmetric Number System Conversion 399
25.1.4 Addition of AHSD4 Number System 400
25.2 THE DESIGN OF ADDER USING NEURAL NETWORK 400
25.3 AHSD ADDITION FOR RADIX-5 401
25.4 SUMMARY 401

Chapter 26  Wrapper/TAM Co-Optimization and Constrained Test


Scheduling 403

26.1 INTRODUCTION 403


26.2 THE WRAPPER DESIGN 404
26.3 TAM DESIGN AND TEST SCHEDULING 406
26.4 POWER CONSTRAINED TEST SCHEDULING 407
26.4.1 Data Structure 409
26.4.2 Rectangle Construction 409
26.4.3 Diagonal Length Calculation 410
26.4.4 TAM Assignment 410
26.5 SUMMARY 411
xviii  Contents

Chapter 27  Static Random Access Memory Using Memristor 413

27.1 INTRODUCTION 413


27.2 MEMRISTOR CHARACTERIZATION 415
27.3 MEMRISTOR AS A SWITCH 416
27.4 WORKING PRINCIPLE OF MEMRISTOR 417
27.5 MEMRISTOR-BASED SRAM 418
27.6 SUMMARY 420

Chapter 28  A Fault Tolerant Approach to Microprocessor Design 423

28.1 INTRODUCTION 423


28.1.1 Design Faults 423
28.1.2 Manufacturing Defects 424
28.1.3 Operational Faults 424
28.2 DYNAMIC VERIFICATION 426
28.2.1 System Architecture 426
28.2.2 Checker Processor Architecture 428
28.3 PHYSICAL DESIGN 430
28.4 DESIGN IMPROVEMENTS FOR ADDITIONAL FAULT COVERAGE 431
28.4.1 Operational Errors 431
28.4.2 Manufacturing Errors 432
28.5 SUMMARY 435

Chapter 29  Applications of VLSI Circuits and Embedded Systems 437

29.1 APPLICATIONS OF VLSI CIRCUITS 438


29.1.1 Autonomous Robots in Industrial Plants 438
29.1.2 Machines in Manufacturing 439
29.1.3 Smart Vision Tech for Quality Control 440
29.1.4 Wearables: Ensuring Security 442
29.1.5 Computing Using the CPU 442
29.1.6 System on a Chip 443
29.1.7 Cutting Edge AI Handling 444
29.1.8 VLSI in 5G Networks 445
29.1.9 Fuzzy Logic and Decision Diagrams 445
29.2 APPLICATION OF EMBEDDED SYSTEMS 448
Contents  xix

29.2.1 Embedded System for Street Light Control 448


29.2.2 Embedded System for Industrial Temperature Control 448
29.2.3 Embedded System for Traffic Signal Control 448
29.2.4 Embedded System for Vehicle Tracking 448
29.2.5 Embedded System for War Field Spying Robot 448
29.2.6 Automated Vending Machine 449
29.2.7 Mechanical Arm Regulator 449
29.2.8 Routers and Switches 449
29.2.9 Industrial Field Programmable Gate Arrays 450
29.2.10 Industrial Programmable Logic Circuits 452
29.3 SUMMARY 453

VLSI Circuits and Embedded Systems 457

Index 461
List of Figures

1.1 General Structure of an SMTBDD( k ) with k = 3. 6


1.2 SMTBDD(2) with Groupings [f0 , f1 ], [f2 , f3 ], and [f4 , f5 ] for the Functions
in Table 1.1. 8
1.3 SMTBDD(3) with Groupings [ f0, f1, f2 ], and [ f3, f4, f5 ] for the Functions
in Table 1.1. 9
1.4 Representation of an n-input m-output Function by an MTBDD. 10
1.5 Representation of an n-input m-output Function by an SMTBDD. 11
1.6 The Clique Weighted Graph. 12
1.7 Pseudocode for Optimizing SMTBDD(3)s. 16

2.1 General Structure of a BDD for CF. 20


2.2 BDD for CF of a 3-Input 2-Output Bit-Counting Function (wgt3). 21
2.3 BDD for CF of Function f 0 = x 0 . 23
2.4 BDD for CF of (m − 1) Functions. 24
2.5 BDD for CF of m Functions. 24
2.6 BDD for CF of adr2. 25
2.7 BDD for CF of adrn. 25
2.8 BDD for CF of adrn (After updating the variables and constants). 26
2.9 SBDD with the Variable Ordering for Sample ( f1, f3 ) Obtained by the
Sifting Algorithm. 29
2.10 SBDD with the Variable Ordering for Sample ( f0, f2 ) Obtained by the
Sifting Algorithm. 30
2.11 SBDD with the Variable Ordering for f = ( f0, f1, f2, f3 ) Obtained from the
Variable Orderings of Samples ( f1, f 3 ) and ( f0, f2 ) by using an Interleaving
Method. 31
2.12 Pseudocode for Interleaving based Sampling Schemes for the Ordering of
Input Variables. 32

3.1 SMDD. 36
3.2 A Multiplexer-Based Network Corresponding to the SMDD in Fig. 3.1. 36
3.3 SMDD for Functions ( f0, f1 ) = (x1, x2 ). 38

xxi
xxii  List of Figures

3.4 SMDD for Functions ( f0, f1, f2 ) = (x1, x2, x3 ). 39


3.5 BDD for Functions f 0 . 40
3.6 BDD for Functions f 1 . 40
3.7 SMDD for the Partition {[x 1, x 3 ], [x 2, x 4 ]}. 42
3.8 SMDD for the Partition {[x 1, x 2 ], [x 3, x 4 ]}. 42
3.9 SMDD for the Partition {[x 1, x 4 ], [x 2, x 3 ]}. 43

4.1 Example of an MDD. 48


4.2 Multiplexer-Based Network Corresponding to the MDD in Fig. 4.1. 48
4.3 Realization of all the Symmetric Functions of a Single Variable. 50
4.4 Realization of Symmetric Functions of N Variables. 51
4.5 MDD for inc 2. 52
4.6 MDD for inc (n − 1). 53
4.7 MDD for inc n. 53
4.8 MDD for inc 3. 54
4.9 MDD for inc 4. 54
4.10 SBDD for Finding the Pairs of 2-Valued Inputs. 55

5.1 SBDD for the Function in Table 5.1. 62


5.2 SMDD for the Function in Table 5.1. 64
5.3 SMTMDD for the Function in Table 5.2. 64
5.4 TDM Realization Based on the SBDD. 66
5.5 TDM Realization Based on the SMTMDD. 67
5.6 2-MUX. 67
5.7 Literal Generator. 67
5.8 TDM Realization of a 4-Output Function Based on the SMTMDD. 69
5.9 A Method Replacing SMTBDD Nodes by MDD Nodes. 70

6.1 2-Valued PKDD for Functions ( f 0, f 1 ) = (x 20 x 4 ⊕x 10 x 2 x 30 ⊕x 10 x 2 x 4 ⊕x 1


x 30 , x 3 ⊕x 4 ). 74
6.2 EXOR-Based Network from the PKDD in Fig. 6.1. 75
6.3 LUT-Based Network from the PKDD in Fig. 6.1. 75
6.4 Pseudocode for Ordering of 4-Valued Variables. 80
6.5 A 4-Valued Node Consisting of Three Shannon Nodes. 81

7.1 RSTU Latch Using 4 Input NAND Gates. 88


7.2 4-Valued D Latch. 89
7.3 A Pass Transistor with a Threshold Gate. 90
List of Figures  xxiii

7.4 Realization of Non-Inverted Threshold Gate. 90


7.5 Series Connection. 91
7.6 Parallel Connection: (a) Common Inputs (b) Different Inputs. 91
7.7 Parallel-Series Connections: (a) Common Inputs (b) Different Inputs. 92
7.8 Realization of Multiple-Valued Inverter Using Threshold Gates. 92
7.9 Logical Sum Circuit. 94
7.10 Multiple-Valued Flip-Flop using Pass Transistor Logic. 94

8.1 Basic Circuit Structure. 101


8.2 Basic Circuit Structure with 4 Logic Networks. 102
8.3 Output Circuit for Example 8.7. 104

9.1 An Assignment graph for the Functions of Example 9.1. 110


9.2 An EAG for the Functions of the Example 9.2. 111
9.3 Multiple-Valued Function Example: All Cubes are in the Form of
Canonical Cubes. 113
9.4 The Different Arrangements of the Index Table in the Different Passes. 114
9.5 Construction of Supercube. 117
9.6 Generate sub-function( ) producing P and R. 118

10.1 (a) Feynman Gate, (b) Fredkin Gate, (c) Toffoli Gate, and (d) New Gate. 123
10.2 Basic Logic Operations Using Reversible Gates. 124
10.3 Multiple-Valued Fredkin Gate (MVFG). 124
10.4 Implementation of Min and Max Operation Using MVFGs. 129
10.5 Complementing a Ternary Variable Using MVFGs. 129
10.6 Basic Cell. 130
10.7 Systolic Array for Composition of Fuzzy Relation. 130

11.1 Block Diagram of General 2-input r -valued LUT Logic Function. 139
11.2 (a) 3-valued 2-variable MVL Truth Table and its Direct Realization Using
LUT. (b) Kleenean Coefficient considered. 139
11.3 Circuit Diagram of a Current-Mode Literal 1 A1 . 140
11.4 Literal Generator Conceptualization. 140
11.5 Literal Generator Circuit. 141

12.1 Flow Diagram of Supervised Neural Networking Technique. 150


12.2 Block Diagram of the Matrix Multiplication Method. 152
12.3 Block Diagram of a Neural Network. 155
xxiv  List of Figures

13.1 The Example PLA whose Input Decoder is Augmented using Extra Pass
Transistor. 159
13.2 (a) Response of the Bit Decoder (b) Input of the Decoder to Place 0c on
Both the Bit Lines. 159
13.3 An Example PLA Showing Product Lines Satisfying Introduced Condition. 160

14.1 A 2-to-4 Decoder. 164


14.2 A 3-to-8 Decoder Implementing f 0 . 165
14.3 A 3-to-8 Decoder Implementing f 1 . 166
14.4 A 3-to-8 Decoder Implementing f 0 and f 1 . 166
14.5 The Structure of Decoded PLA. 167
14.6 The Standard PLA Representation of the Function of Table 14.1. 168
14.7 The Decoded-PLA Representation of the Function shown in Table 14.1. 169
14.8 A Complete Graph. 170
14.9 Flowchart of the Simple Genetic Algorithm. 173
14.10 Flowchart of the Steady-State Genetic Algorithm. 174
14.11 Proportionate Selection Schemes. 176
14.12 One-Point Crossover. 177
14.13 Mutation Operator. 177
14.14 Inversion Operator. 178
14.15 Standard AND-OR and AND-EXOR PLA. 182

15.1 Block Diagram of the (1 × 1)-Digit Multiplier Circuit. 190


15.2 Implementation of LUT Merging Theorem Using 6-input LUTs. 190
15.3 The Bit-Categorization Circuit Using 6-Input LUT. 191
15.4 Deactivated Category Circuit as it is not the Target Category for
Corresponding Input. 192
15.5 Activated Category Circuit as it is the Target Category for Corresponding
Input. 193
15.6 Block Diagram of the (m×n)-Digit Multiplier Circuit. 194

16.1 Example Demonstration of the BCD Addition Algorithm for C3i = 0. 200
16.2 Example Demonstration of the BCD Addition Algorithm for C3i = 1. 201
16.3 Tree Structure Representation of the BCD Addition Method. 201
16.4 1-Digit BCD Adder Circuit. 204
16.5 Block Diagram of the N -Digit BCD Adder Circuit. 205

18.1 LUT Implementation of a Logic Function. 216


List of Figures  xxv

18.2 Full Adder Circuit Critical Path Delay Determination of Full Adder Circuit. 218
18.3 Memristor. 219
18.4 2-Input LUT Circuit Architecture. 222
18.5 5-Input LUT Circuit. 224
18.6 6-Input LUT Circuit. 225
18.7 BCD Multiplier Circuit for 1-Digit Multiplication. 227
18.8 BCD Multiplier Circuit Design Using Virtex 5/6 Slice for 1-Digit. 228
18.9 Block Diagram of the BCD Multiplier for N×M -Digit Multiplication. 229
18.10 Merging LUTs in N×M Multiplication. 230

19.1 Comparison Among CPU, GPU and FPGA performances. 234


19.2 Example of an Image with Limited Range Values in Matrix. 235
19.3 Partial Product Generation for Final Addition of 8 × 8-bit Multiplication. 237
19.4 Example Simulation of Matrix Multiplication. 239
19.5 Example Demonstration of Binary to BCD Conversion. 241
19.6 Verilog Code for the Hardware Implementation of Binary to BCD
Conversion. 242
19.7 Various parts of FPGA. 244
19.8 LUT Implementation of a Logic Function. 245
19.9 Circuit Diagram of a 2-input LUT. 245
19.10 8-bit Adder Using (a) Two 9-Input LUTs (b) Two 8-Input LUTs (c) Four
4-Input LUTs. 246
19.11 Comparison of Area and Time for 8-bit Adder Using Various LUTs. 246
19.12 Simulation of BCD Addition. 247
19.13 Block Diagram of N -Digit BCD Adder Circuit. 248
19.14 Timing Diagram of a 1-bit Comparator Circuit. 249
19.15 Circuit Diagram of a 4-bit Comparator Circuit. 249
19.16 Data Movement from Left to Right through a Shift Register. 250
19.17 Circuit with Corresponding Literal Cost, Gate Input Cost, and Gate Input
Cost with NOT. 252
19.18 K -map Layout for Seven Number of variables to Optimize the Functions. 260
19.19 K -map Manipulation for the Optimization of P1 Function. 261
19.20 Optimized Groups after K -map Manipulation for P1 Function. 261
19.21 K -map Manipulation for the Optimization of P2 Function. 262
19.22 Optimized Groups after K -map Manipulation for P2 Function. 262
19.23 K -map Manipulation for the optimization of P3 Function. 263
19.24 Optimized Groups after K -map Manipulation for P3 Function. 263
xxvi  List of Figures

19.25 K -map Manipulation for the optimization of P4 Function. 264


19.26 Optimized Groups after K -map Manipulation for P4 Function. 264
19.27 K -map Manipulation for the optimization of P5 Function. 265
19.28 Optimized Groups after K -map Manipulation for P5 Function. 265
19.29 Conventional 4 × 4-Bit Multiplication. 267
19.30 8 × 8-Digit Multiplication. 268
19.31 Example of an image for 6 × 6 Matrix Representation. 269
19.32 Parallel Structure Representation of the Matrix Multiplication Algorithm. 270
19.33 Bit Categorization Circuit Selected Category One for Corresponding Input. 275
19.34 Bit Categorization Circuit Selected Category Four for Corresponding Input. 275
19.35 (1 × 1)-Digit Multiplier Circuit using Heterogeneous Input LUT. 276
19.36 Deactivated Category Circuit as it is not the Target Category for
Corresponding Input. 276
19.37 Activated Category Circuit as it is the Target Category for Corresponding
Input. 277
19.38 Homogeneous 6-Input LUT Implementation of the Bit-Categorization
Circuit. 278
19.39 Homogeneous 6-Input LUT Implementation of the (1 × 1)-Digit Multiplier
Circuit. 278
19.40 Binary to BCD Converter Circuit for Decimal Digit Multiplier Circuit. 279
19.41 Block Diagram of the (m×n)-Digit Multiplier Circuit. 280
19.42 Block Diagram of the Matrix Multiplier Circuit. 281
19.43 Matrix Multiplier Circuit for (3 × 3) Matrices. 282

20.1 Demonstration of the BCD Addition Algorithm Exhibited in Example 21.1. 290
20.2 Architecture of the 2-Input LUT. 292
20.3 Architecture of the 6-Input LUT. 293
20.4 The BCD Adder: (a) Block Diagram of 1-Digit BCD Adder, (b) 1-Digit
BCD Adder, (c) Block Diagram of the n-Digit BCD Adder. 296

21.1 Block Diagram of CPLD Board. 301


21.2 Prototype Board. 301
21.3 JTAG Cable. 302
21.4 JTAG Cable. 303
21.5 LE Structure. 304
21.6 Programmable Logic Design Process. 305
21.7 Structure of A5/1 GSM Algorithm. 306
21.8 Structure of BCD to Seven Segment Decoder. 307
List of Figures  xxvii

21.9 VHDL Entities. 308


21.10 View of the Complete Prototype. 308
21.11 Seven Segment Display. 309

22.1 A Conventional PLC System. 312


22.2 Block Diagram of the System Architecture of a PLC. 314
22.3 Ladder Scanning 315
22.4 Ladder Execution Flow 316
22.5 System Architecture inside FPGA 317

23.1 Distribution of Different Instructions. 327


23.2 Distribution of Execution Time. 327
23.3 FPGA Devices have evolved to Become Highly Capable Computing
Platforms. 328
23.4 FPGAs Move to Leading-Edge Process Technologies. 329
23.5 Data Movement from Left to Right through a Shift Register. 333
23.6 4-bit Serial-in to Parallel-out Shift Register. 333
23.7 Timing Diagram for a 4-bit Serial-in to Parallel-out Shift Register. 334
23.8 4-bit Serial-in to Serial-out Shift Register. 335
23.9 4-bit Parallel-in to Serial-out Shift Register. 335
23.10 4-bit Parallel-in to Parallel-out Shift Register. 336
23.11 Timing Diagram of a 1-bit Comparator Circuit. 337
23.12 Circuit Diagram of a 4-bit Comparator Circuit. 338
23.13 Timing Diagram of a 1-bit Adder Circuit. 339
23.14 Circuit Diagram of a 4-bit Adder Circuit. 339
23.15 Timing Diagram of a 1-bit Subtractor Circuit. 340
23.16 Circuit Diagram of a 4-bit Subtractor Circuit. 341
23.17 LUT Implementation of a Logic Function. 342
23.18 Circuit Diagram of a 2-Input LUT. 343
23.19 Circuit Diagram of a 4-bit Counter. 343
23.20 Timing Diagram of a 4-bit Counter. 343
23.21 Block Diagram of (a) Fredkin Gate, (b) Feynman Double Gate. 344
23.22 Flowchart of the Division Algorithm. 347
23.23 Example Simulation of the Division Algorithm. 348
23.24 Demonstration of Invalid State in D Flip-Flop and Latches. 351
23.25 Demonstration of the Presence of Input Bit in D Flip-Flop and Latches. 351
23.26 Circuit Realization of Equation 24.10. 352
xxviii  List of Figures

23.27 Circuit Realization of the 7-bit Counter. 353


23.28 Data Flow of the 7-bit Counter for Example 24.5. 354
23.29 Circuit Realization of the 15-Bit Counter. 356
23.30 Circuit Realization of the Identification of Two Distinct Path in Comparator
Circuit. 357
23.31 Circuit Realization of the 4-Bit Comparator. 358
23.32 Circuit Realization of the n-bit Comparator. 359
23.33 Circuit Realization of the 8-Bit Selection Block. 361
23.34 Example Demonstration of Example 24.6 of the 8-Bit Selection Block. 361
23.35 Analysis of the Circuit Behavior due to Property 1 and Property 2 of the
8-Bit Selection Block. 362
23.36 An Improved Version of the 8-Bit Selection Block. 362
23.37 Circuit Diagram of the n-bit Selection Block. 363
23.38 Circuit Diagram of Equation 24.19. 365
23.39 Example Demonstration of the Circuit Exhibited in Fig. 23.38. 366
23.40 Circuit Diagram of the 3-bit to 6-bit Zero Converter Circuit. 367
23.41 Circuit Behavior of the Circuit Exhibited in Fig. 23.40 when Input is 0002 . 368
23.42 Circuit Behavior of the Circuit Exhibited in Fig. 23.40 for the Input 0012 . 369
23.43 Circuit Behavior of the Circuit Exhibited in Fig. 23.40 for the Input 0112 . 370
23.44 Circuit Diagram of the 4-Bit Divider. 371
23.45 Block Diagram of the n-bit Divider Circuit. 373
23.46 Circuit Behavior of the 4-Bit Divider Circuit for Dividend = (1111)2 and
Divisor = 102 . 374
23.47 Block Diagram of the Look-Up Table (LUT)-Based 7-Bit Counter Circuit. 377
23.48 16-Bit LUT-Based Comparator Circuit. 377
23.49 6-Bit LUT-Based Comparator Circuit. 378
23.50 Design of LUT-Based 4-Bit Comparator Circuit. 379
23.51 Design of a LUT-Based 4-Bit Selection Circuit. 380
23.52 Design of a LUT-Based 3-Bit Converter Circuit. 381
23.53 Design of a LUT-Based 4-Bit Converter Circuit. 382

24.1 Representation of Function f 1 from Equation 25.1. 390


24.2 Flow Diagram of the Method. 391
24.3 Flow Diagram of the Introduced Method. 392
24.4 BT for the PI AB(C)0(D)0. 393
24.5 BT for AB 0(C)0(D)0 Considering Property 24.3.3, where (C)0 is a part of
an OTF. 393
List of Figures  xxix

25.1 Neural Network Prototype for AHSD Number System Addition. 398
25.2 N -Bit Adder Generalization. 401

26.1 Example Test Schedule using Rectangle Packing. 407


26.2 Example of Some Rectangles for core 6 of SOC p93791 when W max = 32. 409
26.3 Test Scheduling for d695 using The Algorithm (T min = 1109 and TAM
width = 24) without Power Constraints. 410

27.1 Flowchart of a Hierarchical Design. 414


27.2 Relationship among Resistor, Capacitor, Inductor, and Memristor. 416
27.3 Characterizing the Memristor. 417
27.4 Change of Resistance for a 3.6V p-p Square Wave. 418
27.5 Three Transistor and Two Memristor SRAM Cells. 419
27.6 Circuit when RD = 0, WR = 1, and Comb = 1. 420
27.7 Circuit when RD = 1, WR = 0, and Comb = 1. 420

28.1 Dynamic Verification System Architecture. 427


28.2 Checker Processor Pipeline Structure for a Single Wide Checker Processor. 428
28.3 Checker Processor Pipeline Structure for a Checker Processor in Check
Mode. 429
28.4 Checker Processor Pipeline Structure for a Checker Processor in Execute
Mode. 430
28.5 Checker Processor Pipeline Structure with TMR on the Control Logic. 433
28.6 Checker Processor Pipeline Structure with TMR on Control Logic and
BIST. 434

29.1 Example of Industrial Autonomous Robots [5]. 438


29.2 Improvement of Productivity: Connected devices mitigate human mistake
[6]. 439
29.3 Smart Vision Tech for Quality Control [28]. 441
29.4 Wearables: Ensuring Security [29]. 442
29.5 Computing Using the CPU:AMD Ryzen Processor [14]. 443
29.6 VLSI in 5G Networks: 5G Network Cell [15]. 446
29.7 Smart Drone [31]. 449
29.8 Automated Vending Machine [22]. 450
29.9 Mechanical arm [24]. 451
29.10 Networking Equipment: Switch [30]. 451
List of Tables

1.1 2-Input 6-Output Function 6


1.2 2-Input 3-Output Function with Three Distinct Output Vectors 7
1.3 2-Input 3-Output Function with Four Distinct Output Vectors 7

2.1 2-Input 2-Output Function 22


2.2 Comparison of SBDDs, MTBDDs, and BDDs for CFs 23

5.1 2-Valued 4-Input and 4-Output Function 63


5.2 4-Valued 2-Input and 2-Output Function 65

7.1 Truth Table for the RSTU Latch 89


7.2 RSTU Latch Function 89
7.3 Multiple-Valued Inverted Outputs for the Corresponding Input Values 93
7.4 Truth Table for Logical Sum Circuit 93
7.5 Truth Table for the Introduced MVFF 94

8.1 Truth Tables with (a) Binary Form and (b) Multi-Valued Form 99
8.2 Basic Circuit Operation for Circuit in Fig. 8.1 102
8.3 Operation of the Circuit of Fig. 8.2 102
8.4 (a) Truth Table for Example 8.7, and Tables when (a) A = 0, (b) B =
Y20,2,3 + Y11,3Y21 , (c) C = Y22 + Y23Y11,2,3 + Y10,1,3Y20 , (d) D=Y12Y20 + Y10Y23 103

10.1 Feynman Gate 123


10.2 Discrete Approximation 126
10.3 Digitized Membership Values 126

11.1 Current Values Assigned to 3-Valued Logic 140

14.1 A Truth Table Representing a Three-Input Two-Output Function 165


14.2 A Truth Table Representing a Four-Input Three-Output Function 167

15.1 Bit Categorization of Inputs for the Implementation of LUT Merging


Theorem in Multiplication Technique 189

xxxi
xxxii  List of Tables

15.2 Simulation of LUT Merging for 3-Bit Input Combinations of a 1 × 1-Digit


FPGA-Based Multiplier Circuit 191

16.1 The Truth Table of 1-Digit BCD Addition with C 0 = 1 199


16.2 The Truth Table of 1-Digit BCD Addition with C 0 = 0 200

18.1 Truth Table of Function f 216


18.2 The Truth Table of 1-Digit Multiplication 220
18.3 Write and Read Operations 224

19.1 Partial Product Generation Logic for Binary Multiplication 237


19.2 Truth Table of Function ( f ) 245
19.3 Example Demonstration of the (m×n)-Digit Multiplication Algorithm 257
19.4 The Binary Input and BCD Output of the Binary to BCD Conversion
Method for (m×n)-Digit Multiplication 259
19.5 Comparative Analysis of Optimization Technique Implementation 274

20.1 Truth Table of 3-bit Addition with Pre-processing and Addition of 3 289
20.2 Read and Write Scheme using the Introduced Approach 294

23.1 Truth Table for a 4-bit SIPO Register 334


23.2 Truth Table of One-Bit Conventional Binary Comparator 337
23.3 Truth Table for 1-bit Full Adder 338
23.4 Truth Table for 1-bit Subtractor 340
23.5 Truth Table of Function ( f ) of Equation 24.8 342
23.6 Truth Table for the Verification of Equation 24.9 350
23.8 Truth Table for Selection of 8-Bit of Dividend 360
23.7 Data Flow for the Selection Block 360
23.9 Truth Table for Conversion of dlog2 n + 1e -bit to n-bit (Here, n = 7) 364
23.10 Frequency Distribution Table for a 15-Bit Input 376
23.11 Sorted Frequency Distribution Table for a 15-Bit Input 376

26.1 Result of Wrapper_Design for core 6 of p93791 405

28.1 Operational Faults in Checker Circuitry 432


Preface

Very Large Scale Integration (VLSI) is one of the most widely used technologies for
microchip processors, integrated circuits (IC) and component designing. It was initially
designed to support thousands of transistor gates on a microchip but it exceeds several bil-
lions of transistors. All of these transistors are remarkably integrated and embedded within
a microchip that has shrunk over time but still has the capacity to hold enormous amounts
of transistors. In VLSI Circuits, the integration of billions of transistors improves the de-
sign methodology which also ensures higher operating speed, lower power consumption,
smaller circuit size, higher reliability and lower manufacturing cost. VLSI chips are widely
used in various branches of Engineering such as Voice and Data Communication networks,
Digital Signal Processing, Computers, Commercial Electronics, Automobiles, and Embed-
ded Systems. The relevance of VLSI in performance computing, telecommunications, and
consumer electronics has been expanding progressively, and at a very hasty pace.
An embedded system is a microprocessor or microcontroller-based system of hard-
ware and software designed to perform dedicated functions within a larger mechanical or
electrical system. The embedded system is unlike the general-purpose computer which is
engineered to manage a wide range of processing tasks. Because an embedded system is
engineered to perform certain tasks only, where design engineers may optimize size, cost,
power consumption, reliability and performance. Embedded systems are typically produced
on broad scales and share functionalities across a variety of environments and applications.
The complexity of an embedded system varies significantly depending on the task for which
it is designed. Embedded system applications range from digital watches and microwaves to
hybrid vehicles and avionics. As much as 98 percent of all microprocessors manufactured
are used in embedded systems. Embedded Systems are convenient for mass production
which results in lower price per piece. They are highly stable, reliable, very small in size
and hence they can be carried and loaded anywhere. They are also very fast and consume
less power. In addition, they optimize the use of resources available. For these reasons, the
embedded systems are getting popular day by day osmotically.
This book mainly covers two extensive topics: VLSI circuits and embedded systems.
These two topics are further divided into four parts: Decision Diagrams, Design Archi-
tectures of Multiple-Valued Logic Circuits, Programmable Logic Devices, and Design
Architectures of Digital Circuits. The Decision Diagram part mainly covers various types
of Decision Diagrams (DDs) such as Binary Decision Diagrams (BDD), Shared Multi-
Terminal Binary Decision Diagrams (SMTBDD), complexities of different types of BDDs,
Multiple-Output Functions using BDD for Characteristic Functions, Shared Multiple-
Valued DDs for Multiple-Output Functions, Minimization techniques of Multiple-Valued
DDs, Time Division Multiplexing (TDM) Realizations of Multiple-Output Functions based

xxxiii
xxxiv  Preface

on Shared Multi-Terminal Multiple-Valued DDs, Multiple-Output Switching Functions us-


ing Multiple-Valued Pseudo-Kronecker DDs.
The circuits having more than two logic levels are called multiple-valued circuits and
they have the potential of reducing area by reducing the on chip interconnection. The
Design Architectures of Multiple-Valued Logic Circuits part mainly covers the basics
of Multiple-Valued Logic (MVL), MVL Flip-Flops using pass transistor logic, voltage-
mode pass transistor-based multi-valued multiple-output logic circuits, multiple-valued
input binary-valued output functions, digital fuzzy operations using multi-valued Fredkin
gates, multiple-valued multiple-output logic expressions using look-up table (LUT) to
reduce complexity. Programmable Logic Device (PLD) is an electronic component which
is used to build reconfigurable digital circuits. Programmable Logic Devices part mainly
covers LUT-based matrix multiplication using neural networks, testable PLAs using pass
transistor logic, genetic algorithm for input assignment for decoded-PLAs, FPGA-based
multiplier using LUT merging theorem, LUT-based BCD adder and multiplier design,
LUT-based matrix multiplier circuit using pigeonhole principle, place and route algorithm
for FPGA, FPGA-based programmable logic controller and generic complex PLD board.
Design Architectures of Digital Circuits part mainly covers parallel computation of
quotients and partial remainders to design divider circuits, algorithms to minimize TANT
circuit and to construct optimal TANT network, asymmetric high radix signed digital
adder using neural networks, design of nonvolatile 6-T static random access memory and
resistive random access memory using memristor, design of fault tolerant microprocessor
and integrated framework for system on chip test automation.
Some important applications of VLSI circuits and embedded systems are also well
discussed in this book. Practical realizations of VLSI circuits such as autonomous robots in
industrial plants, VLSI in 5G networks, smart vision tech for quality control, etc. and real
implementations of the embedded system such as embedded system for street light control,
automated vending machine, embedded system for vehicle tracking, etc. are presented for
getting better knowledge about VLSI circuits and embedded systems.
This book will be beneficial to the diverse readers from beginner to expert level of
VLSI Circuits and Embedded Systems. This book can be used as a text book for Physical
Science and Engineering students both at the undergraduate and post-graduate levels. This
book also targets the faculty members and researchers in this esteemed field all over the
world. Industry professionals working to implement embedded systems will also find this
book interesting.

Dhaka, Bangladesh, Hafiz Md. Hasan Babu


E-mail: [email protected]
June 2022
Author Bio

Dr. Hafiz Md. Hasan Babu is currently working as Dean of the Faculty of Engineering
and Technology, as well as a Professor in the Department of Computer Science and En-
gineering of the University of Dhaka, Bangladesh. He is also the former Chairman of the
same Department. From July 13, 2016 to July 12, 2020, he was the Pro-Vice-Chancellor
of National University, Bangladesh, where he worked on deputation from the Department
of Computer Science and Engineering, University of Dhaka, Bangladesh. For his excel-
lent academic and administrative capability, he also served as the Professor and Founding
Chairman of the Department of Robotics and Mechatronics Engineering, University of
Dhaka, Bangladesh. He served as a World Bank senior consultant and general manager
of the Information Technology & Management Information System Departments of Janata
Bank Limited, Bangladesh. Dr. Hasan Babu was the World Bank resident information tech-
nology expert of the Supreme Court Project Implementation Committee, Supreme Court
of Bangladesh. He was also the information technology consultant of Health Economics
Unit and Ministry of Health and Family Welfare in the project “SSK (Shasthyo Shurokhsha
Karmasuchi) and Social Health Protection Scheme” under the direct supervision and fund-
ing of German Financial Cooperation through KfW. Professor Dr. Hafiz Md. Hasan Babu
received his M.Sc. degree in Computer Science and Engineering from the Brno University
of Technology, Czech Republic, in 1992 under the Czech Government Scholarship. He ob-
tained the Japanese Government Scholarship to pursue his PhD from the Kyushu Institute of
Technology, Japan, in 2000. He also got DAAD (Deutscher Akademischer Austauschdienst)
Fellowship from the Federal Republic of Germany.
Professor Dr. Hasan Babu is a very eminent researcher. He was awarded the best
paper awards in three reputed international conferences. In recognition of his valuable
contributions in the field of Computer Science and Engineering, he received the Bangladesh
Academy of Sciences Dr. M.O. Ghani Memorial Gold Medal Award for the year 2015, which

xxxv
xxxvi  Author Bio

is one of the most prestigious research awards in Bangladesh. He was also awarded the
UGC (University Grants Commission of Bangladesh) Gold Medal Award-2017 for his
outstanding research contributions in computer science and engineering. He has written
more than 100 research articles published in reputed international journals (IET Computers
& Digital Techniques, IET Circuits and Systems, IEEE Transactions on Instrumentation
and Measurement, IEEE Transactions on VLSI Systems, IEEE Transactions on Computers,
Elsevier Journal of Micro electronics, Elsevier Journal of Systems Architecture, Springer
Journal of Quantum Information Processing, etc.) and joined international conferences.
According to Google Scholar, Prof. Hasan has already received around 1332 citations with
h-index 17 and i10-index 31. He is a regular reviewer of reputed international journals
and international conferences. He presented invited talks and chaired scientific sessions or
worked as a member of the organizing committee or international advisory board in many
international conferences held in different countries. For his excellent research record, he
has also been appointed as the associate editor of IET Computers and Digital Techniques,
published by the Institution of Engineering and Technology of the United Kingdom.
Professor Dr. Hasan Babu was appointed as a member of the prime minister’s ICT Task
Force Committee, Government of the People’s Republic of Bangladesh in recognition of his
national and international level contributions in Engineering Sciences. He is currently the
president of Bangladesh Computer Society and also the president of International Internet
Society, Bangladesh Chapter. He has been recently appointed as a part-time member of
Bangladesh Accreditation Council of the Government of People’s Republic of Bangladesh
to ensure the quality of higher education in Bangladesh.
Acknowledgments

I would like to express my sincerest gratitude and special appreciation to the researchers
and my beloved students who are working in the field of VLSI Circuits and Embedded
Systems. The contents of this book have been compiled from a wide variety of research
works which are listed at the end of each chapter of this book.
I am grateful to my parents and family members for their endless support. Most of all,
I want to thank my wife Mrs. Sitara Roshan, daughter Ms. Fariha Tasnim, and son Md.
Tahsin Hasan for their invaluable cooperation in completing this book.
Finally, I am also thankful to Dr. A S M Touhidul Hasan and Md. Solaiman Mia who
have provided their support and important time to finish this book.

xxxvii
Acronyms

ANN Artificial Neural Network

ASIC Application Specific Integrated Circuit

AHSD Asymmetric High-radix Signed-digit

ASSPs Application Specific Standard Products

BCD Binary Coded Decimal

BIST Built in Self-test

CF Carry-free

CPU Central Processing Unit

CPLD Complex Programmable Logic Device

DCT Discrete Cosine Transform

DRAM Dynamic Random Access Memory

DSP Digital Signal Processor

EAG Enhanced Assignment Graph

EDA Electronic Design Automation

FPGA Field Programmable Gate Array

GA Genetic Algorithm

GPU Graphics Processing Unit

GMP Generalized Modus Ponens

GMT Generalized Modus Tollens

GSM Global System for Mobile Communications

HDL Hardware Description Language

IP Intellectual Property

IPU Image Processing Unit

xxxix
xl  Acronyms

KL Kernighan-Lin

MVFGs Multi-Valued Fredkin Gates

MRRAM Memristor-based Resistive Random Access Memory

MSB Most Significant Bit

MVIBVO Multiple-Valued Input Binary-Valued Output Functions

MVL Multiple-Valued Logic

LED Light Emitting Diode

LUT Look-Up Table

NN Neural Network

NOW Network of Workstations

NPU Neural Processing Unit

OTF Only Tail Factor

PCC Potential Canonical Cubes

PISO Parallel-in to Serial-out

PIPO Parallel-in to Parallel-out

PKDD Pseudo-Kronecker Decision Diagrams

PLAs Programmable Logic Arrays

PLCs Programmable Logic Circuits

PLDs Programmable Logic Devices

PTL Pass Transistor Logic

RFT Reversible Fault Tolerant

SBDD Shared Binary Decision Diagram

SER Single Event Radiation

SIPO Serial-in to Parallel-out

SISO Serial-in to Serial-out

SRAM Static Random Access Memory

SOC System-On-Chip

TAMs Test Access Mechanisms


Acronyms  xli

TDM Time Division Multiplexing

VLSI Very Large Scale Integration


Introduction

The optimization of power consumed in digital blocks of a Coordinated Circuit while


protecting the usefulness is performed by Electronic Design Automation (EDA) devices.
There is a critical increment in the power utilization of Very Large Scale Integration (VLSI)
chips because of the expanding pace and multifaceted nature of the present structures. VLSI
is a process of combining thousands of transistors into a single chip. The relevance of
VLSI in performance computing, telecommunications, and consumer electronics has been
expanding progressively at a very hasty pace. Circuit partitioning is a general approach
used to solve problems that are too large and complex to be handled at once. In partitioning,
the problem is divided into small and manageable parts recursively, until the required
complexity level is reached. In the area of VLSI, circuit complexity is rapidly multiplying
together with the reducing chip sizes; the integrated chips being produced today are highly
sophisticated.
There are many diverse problems that occur during the development phase of an IC
that can be solved by using circuit partitioning which aims at obtaining the sub circuits
with minimum interconnections between them. Advances in semiconductor technology and
in the integration level of integrated circuits have enhanced many features, increased the
performance; improved reliability of electronic equipment, and at the same time reduced
the cost, power consumption and system size. As size and complexity of digital system has
increased, more computer aided design tools are introduced into hardware design processes.
VLSI design automation has attracted a great deal of interest. Circuit partitioning in VLSI
design is the key role in physical design. The objective of circuit partitioning is to divide
the circuit into number of sub-circuits with minimum interconnections between them. In
the past two decades, partitioning problems studied by the researchers and various heuristic
algorithms have been developed. The circuit partitioning is also achieved by using some of
the clustering algorithms.
Reduction of power is a great challenge. Hence, new techniques are designed by the
researchers to reduce power dissipation. Due to this, the circuit designers could concentrate
on maximizing the circuit performance along with the reduction in the circuit area. The
concern over the power consumption came into picture during 1980’s when the first portable
electronic systems were developed. The definitive factor for the achievement of the item
financially comes when the battery lifetime is generally excellent. Increase in the active
elements on the Integrated Circuit area lead to the huge consumption of energy. Main-
taining the power levels create the problems of heat dissipation. Expensive heat removal
systems such as heat sinks are required to keep the devices in active states. These elements
contributed control as one of the real plan parameters alongside execution and the IC size.
An embedded system is a dedicated computer system designed for one or two specific
functions. This system is embedded as a part of a complete device system that includes

xliii
xliv  Introduction

hardware, such as electrical and mechanical components. The embedded system is unlike
the general-purpose computer, which is engineered to manage a wide range of processing
tasks. Because an embedded system is engineered to perform certain tasks only, design
engineers may optimize size, cost, power consumption, reliability and performance. Em-
bedded systems are typically produced on broad scales and share functionalities across a
variety of environments and applications. Embedded systems are managed by single or mul-
tiple processing cores in the form of microcontrollers or digital signal processors (DSP),
field-programmable gate arrays (FPGA), application-specific integrated circuits (ASIC),
and gate arrays. These processing components are integrated with components dedicated
to handling electric and/or mechanical interfacing.
An embedded system’s key feature is a dedication to specific functions that typically
require strong general-purpose processors. For example, router and switch systems are
embedded systems, whereas a general-purpose computer uses a proper OS for routing
functionality. However, embedded routers perform functions more efficiently than OS-
based computers for routing functionalities. Commercial embedded systems range from
digital watches and MP3 players to giant routers and switches. Complexities vary from
single processor chips to advanced units with multiple processing chips.
An embedded system is one kind of computer system mainly designed to perform
several tasks like access, process, store, and also control the data in various electronics-
based systems. Embedded systems are a combination of hardware and software where
software is usually known as firmware that is embedded into the hardware. One of its
most important characteristics of these systems is, it gives the o/p within the time limits.
Embedded systems support to make the work more perfect and convenient. Examples of
the embedded system show that it has become a part and parcel of the daily life in term of
use. People are very familiar with the term “Smart Home” because of the deployment of
smart embedded system in the home. Now-a-days almost all of the embedded systems are
connected with the Internet. So security threats have become a major issue at present because
most of the embedded systems lack security even more than personal computers. One of the
reasons for this lack of security is the very limited hardware and software implementation
options for the manufacturers of embedded system companies. Again they have to deal
with the competitive market price of the other embedded manufacturer companies because
they all have to keep the lowest possible price to maintain the customer satisfaction and
at the same time they do not conduct any specific security research of their manufactured
embedded products. This leads to the security threats for the embedded devices because
ensuring advanced security techniques for embedded systems mean the higher cost of that
embedded products. Customers also don’t want to be more expensive usually when buying
an embedded device and they are not concerned also about the probable security threats
of their products. Lack of security analysis and low-cost market product mentalities of the
manufacturer companies lead the hackers to the exact environment they are expecting for.
Many embedded systems hacking tools are easily available on the internet. Hacking in the
PDAs and modems are very common example of embedded systems hacking.
This book aims at VLSI circuits and embedded systems. These two vast topics of this
book are divided into four parts. Part-I named Decision Diagrams has six chapters in which
the methods and techniques of several decision diagrams such as shared multi-terminal bi-
nary decision diagrams (SMTBDDs), binary decision diagrams for characteristic functions
Introduction  xlv

(BDDs for CFs), shared multiple-valued decision diagrams (SMDDs), multiple-valued


decision diagrams (MDDs) for multiple-output functions, etc. are described. In Part-II,
design architectures of multiple-valued circuits such as multiple-valued flip-flops (MVFF),
multi-valued Fredkin gates (MVFGs), multiple-valued multiple-output functions, etc. are
presented. Part-III considers methods to design programmable logic devices such as look-
up table (LUT)-based matrix multiplication using neural networks (NNs), programmable
logic arrays (PLAs), FPGA-based multiplier, LUT-based BCD adder circuit, a generic
CPLD board design and development, FPGA based Micro-PLC (Programmable Logic
Controller), etc. In Part-IV, methods to construct divider circuits with parallel computation
of quotients and partial remainders, TANT circuit, asymmetric high-radix signed-digital
(AHSD) adder, integrated framework for SOC test automation, SRAM using memristor,
microprocessor, etc. are explained and illustrated with appropriate figures. Finally, some
important applications of VLSI circuits and embedded systems are well discussed in the
last chapter of the book.
I
An Overview About Decision Diagrams

1
Part 1

Computers are used to solve many applications like scheduling a product line of cars in a
factory, designing elevator maps in tall buildings, detecting diseases in DNA and mostly,
helping you decide which movie you are going to watch next. Problems become harder every
day, but computer scientists always design faster algorithms to fit with the ever evolving
amount of data. Decision Diagrams (DDs) are used in computer science and Artificial
Intelligence (AI) for decades to make logic circuit design, product configuration, etc.
Binary Decision Diagrams (BDDs) are well known for their use in the logic area,
verification and model checking. Binary and Multi-valued Decision Diagrams (BDDs or
MDDs) are efficient data structures that represent functions or sets of tuples. An MDD,
defined over a fixed number of variables, is a layered rooted Directed Acyclic Graph (DAG).
It associates a variable with each of its layers. MDDs have an exponential compression
power and are widely used in problem solving. An MDD has a root node, and two potential
terminal nodes, the true terminal node, and the false terminal node. Each node, associated
with a variable, can have at most as many outgoing arcs as there are values in the domain of
the variable, and the arcs are labeled by these values. The label vectors of the valid path’s
arcs represent the valid tuples.
Multiple Decision Diagrams (MDDs) are graph structures which are becoming the state-
of-the-art means for representation of both binary and multiple-valued logic functions. A
number of research studies on MDDs appeared recently in the literature. These include some
of the issues concerning implementation of an MDD package. An unrolled automaton can
be seen as a not reduced MDD. Furthermore, BDDs and MDDs are more and more used
in optimization. During the last few decades, many works are shown how to efficiently use
them in order to model and solve several optimization problems. An advantage of MDDs is
that they have a fixed number of variables, and often a strong compression ratio. However,
MDDs can have an exponential size, and it effectively occurs in practice.
Two-level expressions of multiple-valued logic functions and their minimizations have
been a subject of active research for many years. This problem is important because it
provides a means for optimizing the implementations of circuits that are direct translations
of two-level expressions. Thus, two-level logic representations have direct impact on macro-
cell design styles using programmable logic arrays (PLAs). The Reed-Muller canonical form
can be extended to multiple-valued logic in several ways, depending on how its operations
are generalized. Many extensions have been suggested. In these extensions, AND and XOR
operations (which are equivalent to multiplication and addition modulo-2, correspondently)
are generalized to addition and multiplication. Decomposition of switching functions is an
important task, since when decomposition is possible, it leads to many advantages in
network synthesis. At the same time, this is a difficult task.

3
4  Part 1

Multi-Terminal Binary Decision Diagrams (MTBDDs) are a generalization of Binary De-


cision Diagrams (BDDs) derived by allowing integers or complex numbers as values of
constant nodes. Therefore, MTBDDs represent integer or complex-valued functions on
finite dyadic groups. Generalization to functions on arbitrary groups is straight forward. In
this case, the nodes in MTBDDs are replaced by the nodes with more than two outgoing
edges.
This part starts with a Decision Diagram which is named as Shared Multi-Terminal
Binary Decision Diagrams (SMTBDDs) which is given in Chapter 1. In this chapter, a
method is introduced to represent multiple-output functions using SMTBDDs. Three types
of BDDs are also compared. A method to construct smaller binary decision diagrams for
characteristic functions (BDDs for CFs) is described in Chapter 2. An upper bound on the
number of nodes of the BDD is derived for CF of n-bit adders (adrn). The sizes of SBDDs,
MTBDDs, and BDDs for CFs are also compared in this chapter. Chapter 3 presents a
method which is introduced to represent multiple-output functions using shared multiple-
valued decision diagrams (SMDDs). Some algorithms are also presented to pair the input
variables of binary decision diagrams (BDDs), and to find good orderings of the multiple-
valued variables in the SMDDs. The sizes of SMDDs are derived for general functions and
symmetric functions. In Chapter 4, a method is introduced to minimize multiple-valued
decision diagrams (MDDs) for multiple-output functions. In this chapter, heuristics are
introduced to minimize multiple-valued decision diagrams (MDDs) for multiple-output
functions. Upper bounds on the sizes of MDDs are presented for various functions. Chapter
5 considers methods to design multiple-output networks based on decision diagrams (DDs).
TDM (time-division multiplexing) systems transmit several signals on a single line. The
TDM method reduces the interconnections among the modules. Finally, in Chapter 6,
a method is introduced to construct smaller multiple-valued pseudo-Kronecker decision
diagrams (MVPKDDs). The method first generates a 4-valued input 2-valued multiple-
output function from a given 2-valued input 2-valued output function. Then, it constructs
a 4-valued decision diagram (4-valued DD) to represent the generated 4-valued input
function.
CHAPTER 1

Shared Multi-Terminal
Binary Decision Diagrams

Efficient representations of logic functions are very important in logic design. Various
methods exist to represent logic functions. Among them, graph-based representations such
as BDDs (binary decision diagrams) are extensively used in logic synthesis, test, and
verification. In logic simulation, the BDD-based methods offer orders-of-magnitude po-
tential speedup over traditional logic simulation methods. In real life, many practical logic
functions are multiple-output.
This chapter describes a method to represent m output functions using shared multi-
terminal binary decision diagrams (SMTBDDs). The SMTBDD(k) consists of multi-
terminal binary decision diagrams (MTBDDs), where each MTBDD represents k output
functions. An SMTBDD(k) is the generalization of shared binary decision diagrams (SB-
DDs) and MTBDDs: for k = 1, it is an SBDD, and for k = m, it is an MTBDD. The
size of a BDD is the total number of nodes. The features of SMTBDD(k)s are: (1) They
are often smaller than SBDDs or MTBDDs; (2) They evaluate k outputs simultaneously.
An algorithm is also described in this chapter for grouping output functions to reduce the
size of SMTBDD( k )s. An SMTBDDmin denotes the smaller SMTBDD which is either an
SMTBDD(2) or an SMTBDD(3) with fewer nodes.

1.1 INTRODUCTION
Three different approaches are considered in this chapter to represent multiple-output func-
tions using BDDs: shared binary decision diagrams (SBDDs), multi-terminal binary deci-
sion diagrams (MTBDDs), and shared multi-terminal binary decision diagrams (SMTB-
DDs). SMTBDDs are the generalization of SBDDs and MTBDDs. A general structure
of an SMTBDD( k ) with k = 3 is shown in Fig. 1.1. The evaluation of outputs using an
SMTBDD( k ) is k times faster than an SBDD since it evaluates k outputs at the same time.
For the most functions, SMTBDD( k )s are smaller than the corresponding MTBDDs. In
modern LSI, the reduction of the number of pins is not so easy, even though the integra-
tion of more gates may be possible. The time division multiplexing (TDM) realizations of
multiple-output networks from SMTBDDs are useful to reduce the number of pins as well

DOI: 10.1201/9781003269182-2 5
6  VLSI Circuits and Embedded Systems

as to reduce hardware. SMTBDD( k )s are also helpful for look-up table type FPGA design,
logic simulation, etc. SMTBDD(3)s are considered in this chapter.

Figure 1.1: General Structure of an SMTBDD( k ) with k = 3.

1.2 PRELIMINARIES
This section shows the definitions and properties of multiple-output functions and
SMTBDD( k )s.

Property 1.2.1 Let B = {0, 1}. A multiple-output logic function f with n input vari-
ables, x1, . . ., xn , and m output variables y1, . . ., ym , is a function f : Bn → Bm , where
x = (x1 , . . . , xn )  Bn is an input vector, and y = (y1 , . . . , ym )  Bm is an output vector of
f.

Example 1.1 Table 1.1 shows a 2-input 6-output function.

Table 1.1: 2-Input 6-Output Function

Input Output
x1 x2 f0 f1 f2 f3 f4 f5
0 0 0 1 0 0 1 1
0 1 0 1 0 0 1 1
1 0 1 1 1 0 1 0
1 1 0 1 1 1 0 1
Shared Multi-Terminal Binary Decision Diagrams  7

Property 1.2.2 Let F(a) = (f0 (a), f1 (a), . . . , fm−1 (a)) be the output vector of m functions
for an input a = (a1 , a2 , . . . , an )  Bn . Two output vectors F(ai ) and F(a j ) are distinct iff
F(ai ) , F(a j ). Let r be the number of distinct output vectors in F(a) = (f0 (a), f1 (a), . . . ,
fm−1 (a)).

Example 1.2 Consider the 2-input 6-output function in Table 1.1. The distinct output
vectors are (0, 1, 0, 0, 1, 1), (1, 1, 1, 0, 1, 0), and (0, 1, 1, 1, 0, 1). Therefore, the number of
distinct output vectors is three, i.e., r = 3.

Property 1.2.3 Let = f0 , f1 , . . . , and fm−1 (fi , 0) be disjoint each other, i.e., fi . f j = 0,
and i , j. Then, the number of distinct output vectors for F(x) = (f0 (x), f1 (x), . . . , fm−1 (x))
is m or m + 1.

Proof 1.1 Since f0 , f1 , . . . , and fm−1 are disjoint, in the vector F(x) = (f0 (x), f1 (x), . . . ,
fm−1 (x)) at most one output fi (x)(i = 0, 1, . . ., m − 1) is one and others are zero. Therefore,
the number of distinct output vectors is at least m. On the other hand, when there are the
output vectors with all zero’s, the number of distinct output vectors is m + 1.

Example 1.3 Consider the 2-input m-output function in Table 1.2, where m = 3. The
distinct output vectors for the functions f0, f1, and f2 are (1, 0, 0), (0, 1, 0), and (0, 0, 1).
So, the number of distinct output vectors is m. Now, consider the 2-input 3-output function
in Table 1.3. The distinct output vectors in f0, f1 , and f2 are (1, 0, 0), (0, 1, 0), (0, 0, 1), and
(0, 0, 0). In this case, the number of distinct output vectors is m + 1.

Table 1.2: 2-Input 3-Output Function with Three Distinct Output Vectors

Input Output
x1 x2 f0 f1 f2
0 0 1 0 0
0 1 0 1 0
1 0 0 0 1
1 1 0 1 0

Table 1.3: 2-Input 3-Output Function with Four Distinct Output Vectors

Input Output
x1 x2 f0 f1 f2
0 0 1 0 0
0 1 0 1 0
1 0 0 0 1
1 1 0 0 0

Property 1.2.4 Let f be a function. The set of input variables on which f depends is the
support of f, and is denoted by support(f). The size of the support is the number of variables
in the support(f).
8  VLSI Circuits and Embedded Systems

Example 1.4 Table 1.1 shows a 2-input 6-output function. An SMTBDD(3) can be con-
structed with the groupings [f0 , f1 , f2 ] and [f3 , f4 , f5 ]. support(f0 , f1 , f2 ) = {x1 , x2 }, and
support(f3 , f4 , f5 ) = {x1 , x2 }. Thus, the sizes of the supports are 2.

Property 1.2.5 The size of the BDD, denoted by size(BDD), is the total number of terminal
and non-terminal nodes. In the case of SBDDs and SMTBDD( k )s, the sizes include the
nodes for the output selection variables.

The size of the SMTBDD(3) in Fig. 1.3 is 9.

1.2.1 Shared Multi-Terminal Binary Decision Diagrams


Shared binary decision diagrams (SBDDs), multi-terminal binary decision diagrams (MTB-
DDs), and shared multi-terminal binary decision diagrams (SMTBDDs) represent multiple-
output functions. SMTBDDs consist of MTBDDs. An SMTBDD( k ) is the generalization
of SBDDs and MTBDDs: for k = 1, it is an SBDD, and for k = m, it is an MTBDD,
where m is the number of output functions. Figs. 1.2 and 1.3 show an SMTBDD(2) and
an SMTBDD(3) for Table 1.1, respectively. In Fig. 1.3, the SMTBDD(3) has two groups:
[f0 , f1 , f2 ] and [f3 , f4 , f5 ], and g0 is the output selection variable which selects a group of
outputs. In this chapter, “[ ]” denotes a group of output functions that consists of two or
more outputs.

Figure 1.2: SMTBDD(2) with Groupings [f0 , f1 ], [f2 , f3 ], and [f4 , f5 ] for the Functions in
Table 1.1.

Let [ fi , f j ] be a pair of two output functions, where i , j . Following two techniques are
used to reduce the number of nodes in the SMTBDD( k )s:
Shared Multi-Terminal Binary Decision Diagrams  9

Figure 1.3: SMTBDD(3) with Groupings [ f0, f1, f2 ], and [ f3, f4, f5 ] for the Functions in
Table 1.1.

1. In general, an MTBDD for two outputs has four terminal nodes [0, 0], [0, 1], [1, 0],
and [1, 1]. However, if fi f j = 0, then [1, 1] never appears as a terminal node in the
MTBDD of an SMTBDD(2). Thus, this pairing of output functions tends to produce
a smaller BDD since the number of terminal nodes is at most three. Similarly if
fi0 f j0 = 0, fi0 f j = 0, or fi f j0 = 0, then [ fi , f j ] is also a candidate of a pair.

2. If support(fi ) ∩ support(f j ) , ø, then [ fi , f j ] is a candidate of a pair; otherwise, they


should be represented by the separate BDDs.

Note that these two techniques are also applicable to SMTBDD( k )s with k ≥ 3.

Property 1.2.6 Let an SMTBDD( k ) consist of two MTBDDs: MTBDD1 and MTBDD2.
MTBDD1 and MTBDD2 are disjoint iff they do not share any non-terminal nodes each
other in the SMTBDD( k ).

Example 1.5 In Fig. 1.3, there are two disjoint MTBDDs for groupings [ f0, f1, f2 ], and [f3 ,
f4 , f5 ].

Property 1.2.7 Let SMTBDD1 and SMTBDD2 be SMTBDD( k )s. Let SMTBDD1 con-
sists of MTBDD1 and MTBDD2, and let SMTBDD2 consists of MTBDD3 and MTBDD4.
If all the MTBDDs are disjoint each other and size(MT BDD1) = size(MT BDD3), and
size(MT BDD2) = size(MT BDD4), then size(MT BDD1) = size(MT BDD2).

Property 1.2.8 The numbers of terminal nodes in an SMTBDD(2) and an SMTBDD(3)


are the same iff the numbers of distinct output vectors are also the same.
10  VLSI Circuits and Embedded Systems

Proof 1.2 Since the number of terminal nodes in an SMTBDD( k ) is equal to the number
of distinct output vectors, and SMTBDD(2) and an SMTBDD(3) have the same number of
terminal nodes iff the numbers of distinct output vectors in the both SMTBDDs are also
the same.

Example 1.6 Figs. 1.2 and 1.3 show an SMTBDD(2) and an SMTBDD(3) for the functions
in Table 1.1, respectively. The numbers of terminal nodes in the both SMTBDDs are the
same, since the numbers of distinct output vectors are also the same, i.e., 4.

Property 1.2.9 All the functions {0, 1} n → {0, 1, . . ., r − 1} can be represented by an


MTBDD with r 2 nodes.
n

Proof 1.3 No more than r 2 nodes are needed. Else two nodes represent the same function
n

and can be combined. No less than r 2 nodes used because there are this many functions.
n

Property 1.2.10 Let r be number of distinct output vectors of an n-input m-output function.
{2n−k − 1 + r 2 }.
k
Then, the size of the MTBDD can be at most mink=1n

Proof 1.4 Consider the MTBDD in Fig. 1.4, where the upper block is a binary decision trees
of (n − k) variables, and the lower block generates all the functions of k or fewer variables.
The binary decision tree of (n − k) variables has 1 + 2 + 4 + · · · + 2n−k−1 = 2n−k − 1 nodes.
By Property 1.2.9, the MTBDD of k -variable functions with r distinct output vectors have
r 2 nodes. Thus, the size of the MTBDD can be at most mink=1 {2n−k − 1 + r 2 }. Note that
k n k

the upper bound on the size of the MTBDD is used in Algorithm 1.1.

Figure 1.4: Representation of an n-input m-output Function by an MTBDD.

Property 1.2.11 Let m1 be the total number of groups of m distinct output functions. Then,
the number of nodes for the output selection variables in the SMTBDD is m1 – 1.
Shared Multi-Terminal Binary Decision Diagrams  11

Example 1.7 Consider the SMTBDD in Fig. 1.3, where the total number of groups is
two, i.e., [ f0, f1, f2 ], and [ f3, f4, f5 ]. Therefore, the number of nodes for the output selection
variables in the SMTBDD is one.

Property 1.2.12 Let m1 be the total number of groups of m distinct output functions and
f : {0, 1} n → {0, 1, . . . , r − 1} m . Then, f can be represented by an SMTBDD with at most
n {m .2n−k − 1 + r 2 k } nodes.
mink=1 1

Proof 1.5 Consider the mapping f : {0, 1} n → {0, 1, . . ., r − 1} m , where r is the number
of terminal nodes. In the SMTBDD in Fig. 1.5, the upper block selects m1 groups for m
output functions, the middle block constitutes binary decision trees of (n − k) variables, and
the lower block generates all the functions of k variables by an MTBDD with r terminal
nodes. By Property 1.2.11, the upper block requires (m1 − 1) nodes to select m1 groups. By
Property 1.2.9, the lower block requires r 2 nodes. Now, the middle block is considered.
k

Each binary decision tree of (n − k) variable has 1 + 2 + 4 + · · · + 2n−k−1 = 2n−k − 1 nodes.


Since the number of binary decision trees is m1 , the total number of nodes for m1 binary
decision tress is m1 .(2n−k − 1). Therefore, the number of nodes in the SMTBDD for m
output functions
n can be at most o
mink=1 (m1 − 1) + m1 .(2n−k − 1) + r 2
n k

n o
m1 .2n−k − 1 + r 2 .
k
= mink=1
n

Figure 1.5: Representation of an n-input m-output Function by an SMTBDD.


12  VLSI Circuits and Embedded Systems

Property 1.2.13 Let an SMTBDD(m) represent m functions


fi = xi (i = 0, 1, . . ., m − 1), and let [ f0, f1, . . ., fm−1 ] be the group of an m-output function.
Then, the size of the SMTBDD(m) for grouping [ f0, f1, . . ., fm−1 ] is 2m+1 − 1.

1.3 AN OPTIMIZATION ALGORITHM FOR SMTBDD(K )S


In this section, an algorithm is shown for deriving small SMTBDD( k )s using clique covers.
Note that this algorithm can be used for SMTBDD( k )s with k ≥ 3. The clique covering is one
of the NP-hard problems of graph optimization. Usually, the edge or vertex weighted graph
for this problem is considered. For the optimization of SMTBDD( k )s, a clique weighted
graph is used, i.e., each group of vertices has a weight.

Property 1.3.1 A clique of a graph is a set of vertices such that every pair is connected
by an edge.

Example 1.8 Each of c0 and c1 in Fig. 1.6 is the clique.

Figure 1.6: The Clique Weighted Graph.

Property 1.3.2 Let G = (V, E) be a graph, where V and E denote a set of vertices and a
set of edges, respectively. A clique cover of G is a partition of V such that each set in the
partition is a clique.

Example 1.9 In Figure 1.6, a clique cover is formed by cliques c0 and c1 .

Property 1.3.3 Let G = (V, E) be a graph. Then, G is a clique weighted graph iff each
subset of vertices of G has a weight and each vertex pair is connected by an edge.

Property 1.3.4 Figure 1.6 is an example of a clique weighted graph. For simplicity, only
two cliques are shown with their weights. The weights of the cliques c0 and c1 are w0 and
w1 , respectively.
Shared Multi-Terminal Binary Decision Diagrams  13

Problem: Given a clique weighted graph G = (V, E), find the clique cover such that the
sum of weights of the cliques in the clique cover is minimum.
Note that the weight corresponds to the upper bound on the size of the MTBDD, and the
minimum weighted clique cover corresponds to the groupings of outputs that have small
size, though sometimes they are not minimum.

1.3.1 The Weight Calculation Procedure


Each clique in the clique weighted graph has a weight. In this subsection, a method is shown
for calculating the weights of the cliques. From here, it is assumed that the size of a clique
is three, i.e., each clique has three vertices.

Property 1.3.5 Let F = f0, f1, . . ., fm−1 be a set of m output functions. Let Fi = (i =
1, 2, 3, . . ., s) be subsets of F . F1, F2, . . ., Fs is called a partition of F if i=1 Fi = Fj , and Fi
Ðs
∩ Fj = , where i, j , and Fi , for every i . Henceforth, each of F1, F2, . . ., Fs is called a group
of output functions. Note that each vertex in the clique weighted graph represents an output
function, where each group and each partition of output functions are denoted as a clique
and a clique cover respectively.

Example 1.10 Let F = f0, f1, f2, f3, f4, f5 be a set of 6-output function. Then, the partitions
of F into groups are as follows: [ f0, f1, f2 ], [ f3, f4, f5 ],
[ f0, f1, f3, [ f2, f4, f5 ], [ f0, f1, f4 ], [ f2, f3, f5 ], [ f0, f1, f5 ], [ f2, f3, f4 ], [ f0, f2, f3 ],
[ f1, f4, f5 ], [ f0, f2, f4 ], [ f1, f3, f5 ], [ f0, f2, f5 ], [ f1, f3, f4 ], [ f0, f3, f4 ], [ f1, f2, f5 ],
[ f0, f3, f5 ], [ f1, f2, f4 ], and [ f0, f4, f5 ], [ f1, f2, f3 ].

Property 1.3.6 Let F be an n-input m-output function. The dependency matrix B = (bi j )
for F is a 0 − 1 matrix with n columns and n rows. bi j = 1 iff fi depends on x j , and bi j = 0
otherwise, where i = 0, 1, . . ., m − 1 and j = 1, 2, . . ., n.

Example 1.11 Consider the 4-input 6-output function:

f0 (x1, x2, x3, x4 ) = x2 x3 , f1 (x1, x2, x3, x4 ) = x1 x4 ∨ x2 , f2 (x1, x2, x3, x4 ) = x1 ∨ x3,
f3 (x1, x2, x3, x4 ) = x3, f4 (x1, x2, x3, x4 ) = x1 ∨ x3 x4 , and f3 (x1, x2, x3, x4 ) = x4 .
The dependency matrix is
14  VLSI Circuits and Embedded Systems

Property 1.3.7 Let F be an n-input m-output function, and let [ fi , f j , fk ] be a group of


output functions. The group-dependency matrix A = (ai j ) for F is a 0 − 1 matrix with n
columns and m(m−1)(m−2)
6 rows. ai j = 1 iff at least one of the outputs depends on x j , and
ai j = 0, otherwise.

Example 1.12 Consider the 6-output function in Example 1.11. The group-dependency
matrix A is given as follows:

Note that the row for [ fi , f j , fk ] in A is equal to bit-wise OR of rows for fi , f j , and fk in
the dependency matrix B.

Property 1.3.8 Let r[ fi , f j , fk ] be the number of distinct output vectors for the group of
outputs [ fi , f j , fk ]. Note that 1 ≤ r[ fi , f j , fk ] ≤ 8. r[ fi , f j , fk ] is equal to the number of
non-zero functions in fi f j fk , fi0 f j0 fk , fi0 f j fk0,
fi0 f j fk , fi f j0 fk0, fi f j0 fk , fi f j fk0, and fi0 f j0 fk0.

Example 1.13 Consider the 6-output function in Example 1.11. There are 20 groups of
output functions. The number of distinct output vectors r[ fi , f j , fk ] for each group [ fi , f j , fk ]
is calculated as follows:
Shared Multi-Terminal Binary Decision Diagrams  15

For r[ f0, f2, f3 ] : f0 f2 f3 = x1 x2 x3 ∨ x2 x3, f0 f2 f30 = 0, f0 f20 f3 =0, f0 f20 f30 = 0, f00 f2 f3 =
∨ x20 x3, f00 f2 f30 = x1 x20 x30 ∨ x1 x30 , f00 f20 f3 = 0, and f00 f20 f30 = x10 x20 x30 ∨ x30 .
x1 x20 x3
Since the number of non-zero functions is four, r[ f0, f2, f3 ] = 4. Similarly, the number
of distinct output vectors can be calculated for other groups of output functions.

Property 1.3.9 Let s(i, j, k) be the size of the support for a group of output functions
n−1 2s(i, j,k)−t − −1 + (r[ f , f f ])2t .
[ fi , f j , fk ]. The weight w(i, j, k) for [ fi , f j , fk ] is mint=0 i j, k
This will be the weight of the clique in the clique weighted graph.

Example 1.14 Consider the 6-output function in Example 1.11. w(0, 2, 3) is calculated as
follows:

From Example 1.11, it is known that r[ f0, f2, f3 ] = 4, and s(0, 2, 3) = 3. w(0, 2, 3) takes
its minimum when t = 0. Therefore, w(0, 2, 3) = 23 − 1 + 4 = 11. Similarly, the weights
for the other groups can be calculated. Since w(i, j, k) is an upper bound on the size of the
MTBDD for [ fi , f j , fk ], the MTBDD with the minimum weight is relatively small.

1.3.2 Optimization of SMTBDD(3)s


To find the minimum weighted clique cover is an NP-hard problem of graph optimization.
So, a heuristic algorithm is used for finding a clique cover with small weight.

Algorithm 1.1 Optimization of SMTBDD(3)s


Input: A graph G = (V, E)
Output: A clique cover K of G whose sum of weights of the cliques is relatively
small.
1: Method: First, calculate the weights of all cliques in the graph G as shown in procedure
“Weightedclique” in Fig. 1.7. Second, use procedure “MinWeightcliquecover” in Fig.
1.7 to find the clique cover with small weight, where “W” is the list of sorted weights
of C , C is the set of cliques, and w(c) is the weight of the clique c.
2: Since Algorithm 1.1 is greedy one, it may not obtain the optimal solutions, but it can
expect good solutions.
3: End
16  VLSI Circuits and Embedded Systems

Figure 1.7: Pseudocode for Optimizing SMTBDD(3)s.

1.4 SUMMARY
In this chapter, a method is introduced to represent multiple-output functions using SMTB-
DDs (Shared Multi-Terminal Binary Decision Diagrams). SMTBDD( k )s are not so large
as MTBDDs (Multi-Terminal Binary Decision Diagrams), and the evaluation time is k
times faster than Shared BDD (SBDD), since k outputs are evaluated simultaneously. An
algorithm is presented for grouping output functions to reduce the size of SMTBDD( k )s.
By selecting an SMTBDD from an SMTBDD(2) and an SMTBDD(3), a compact repre-
sentation for the SMTBDD is also introduced which denotes either an SMTBDD(2) or
an SMTBDD(3) with fewer nodes. Thus, SMTBDDs compactly represent many multiple-
output functions and are useful TDM (time-division multiplexing) realizations of multiple-
output networks, look-up table type FPGA design and logic simulation.
A multiple-output function can also be represented by a BDD for characteristic functions
(CFs). However, in most cases, BDDs for CFs are much larger than the corresponding
SBDDs. Moreover, if all the output functions depend on all the input variables, then the
size of the BDDs for CFs is greater than the corresponding MTBDD. By dropping the
above ordering restriction, the size of the BDD can be reduced. Furthermore, in most cases,
the sizes of such BDDs are still larger than the corresponding SBDDs. In many cases, the
sizes of SMTBDDs are not so large as BDDs for CFs.
Shared Multi-Terminal Binary Decision Diagrams  17

REFERENCES
[1] S. B. Akers, “Binary decision diagrams”, IEEE Trans. Comput., vol. C-27, no. 6, pp.
509–516, 1978.
[2] P. Ashar and S. Malik, “Fast functional simulation using branching programs”, Pro-
ceedings of IEEE. International Conference on Computer-Aided Design, pp. 408–412,
1995.
[3] C. Scholl, R. Drechsler and B. Becker, “Functional simulation using binary deci-
sion diagrams”, Proceedings of IEEE. International Conference on Computer-Aided
Design, pp. 8–12, 1997.
[4] R. E. Bryant, “Graph-based algorithms for Boolean function manipulation”, IEEE
Trans. Comput., vol. C-35, no. 8, pp. 677–691, 1986.
[5] T. Sasao and J. T. Butler, “A method to represent multiple-output switching functions
by using multi-valued decision diagrams”, Proceedings of 26th IEEE. International
Symposium on Multiple-Valued Logic, pp. 248–254, 1996.
[6] T. Sasao and J. T. Butler, “A design method for look-up table type FPGA by
pseudo-Kronecker expansion”, Proceedings of 26th IEEE International Symposium
on Multiple-Valued Logic, pp. 97–106, 1994.
[7] H. M. H. Babu and T. Sasao, “A method to represent multiple-output switching
functions by using binary decision diagrams”, The Sixth Workshop on Synthesis and
System Integration of Mixed Technologies, pp. 212–217, 1996.
[8] H. M. H. Babu and T. Sasao, “Representations of multiple-output logic functions
using shared multi-terminal binary decision diagrams”, The Seventh Workshop on
Synthesis and System Integration of Mixed Technologies, pp. 25–32, 1997.
[9] H. M. H. Babu and T. Sasao, “Design of multiple-output networks using time domain
multiplexing and shared multi-terminal multiple-valued decision diagrams”, IEEE
International Symposium on Multiple Valued Logic, pp. 45–51, 1998.
[10] E. Balas and C. S. Yu, “Finding a maximum clique in an arbitrary graph”, SIAM J.
Comput. vol. 15, pp. 1054–1068, 1986.
[11] R. Rudell, “Dynamic variable ordering for ordered binary decision diagrams”, Pro-
ceedings of IEEE. International Conference on Computer-Aided Design, pp. 42–47,
1993.
[12] T. Sasao, ed., “Logic Synthesis and Optimization”, Kluwer Academic Publishers,
Boston, 1993.
[13] A. Srinivasan, T. Kam, S. Malik and R. K. Brayton, “Algorithm for discrete functions
manipulation”, Proceedings of IEEE. International Conference on Computer-Aided
Design, pp. 92–95, 1990.
[14] P. C. McGeer, K. L. McMillan, A. Saldanha, A. L. Sangiovanni-Vincentelli and P.
Scaglia, “Fast discrete function evaluation using decision diagrams”, International
Workshop on Logic Synthesis, pp. 6.1–6.9, 1995.
[15] M. R. Garey and D. S. Johnson, “Computers and Intractability: A Guide to the Theory
of NP-Completeness”, Freeman, New York, 1979.
[16] Babu, Hafiz Md Hasan, and Tsutomu Sasao. “Shared multi-terminal binary deci-
sion diagrams for multiple-output functions.” IEICE Transactions on Fundamentals
of Electronics, Communications and Computer Sciences, vol. 81, no. 12 (1998):
2545–2553.
CHAPTER 2

Multiple-Output Functions
Using BDD for Characteristic
Functions

This chapter describes a method to construct smaller binary decision diagrams for charac-
teristic functions (BDDs for CFs). A BDD for CF represents an n-input m-output function.
An upper bound on the number of nodes of the BDD is derived for CF of n-bit adders (adrn).
As a result: (1) BDDs for CFs are usually much smaller than MTBDDs (Multi-Terminal
Binary Decision Diagrams); (2) For adrn and for some benchmark circuits, BDDs for CFs
are the smallest among the three types of BDDs; and (3) The introduced method often
produces smaller BDDs for CFs.

2.1 INTRODUCTION
Binary decision diagrams (BDDs) are compact representations of logic functions, and are
useful for logic synthesis, time-division multiplexing (TDM) realization, test, verification,
etc. Shared binary decision diagrams (SBDDs), multi-terminal binary decision diagrams
(MTBDDs), and BDDs for characteristic functions (BDDs for CFs) represent multiple-
output Functions. SBDDs are compact. MTBDDs evaluate all the outputs simultaneously
but they usually blow up in memory for large benchmark circuits. BDDs for CFs use CFs of
multiple-output functions. A CF is a switching function representing the relation of inputs
and outputs. Fig. 2.1 shows the general structure of a BDD for CF.
BDDs for CFs are usually much smaller than MTBDDs. The main applications of BDDs
for CFs are logic simulation of digital circuits and implicit state enumeration of finite state
machines. In this chapter, a method is considered to construct compact BDDs for CFs. An
algorithm is represented to find a good ordering of input and output variables. An upper
bound on the number of nodes of the BDD is also derived for CF of n-bit adders (adrn).

DOI: 10.1201/9781003269182-3 19
20  VLSI Circuits and Embedded Systems

Figure 2.1: General Structure of a BDD for CF.

2.1.1 Basic Definitions


This section presents some important definitions.

Property 2.1.1 support( f ) is the set of input variables that the function f depends on. The
size of the support is the number of variables in the support( f ).

Example 2.1 Let f (x1, x2, x3 ) = x1 x2 x3 ∨x1 x2 x30 . Then, support( f ) = {x1, x2 }, since f is
also represented as f = x1 x2 . Thus, the size of the support is two.

Property 2.1.2 Let fi1 and fi2 be two output functions. The size of the union of the support
for fi1 and fi2 is the number of support variables for fi1 and fi2 .

Example 2.2 Consider the 4-input 2-output function:


f0 (x1, x2, x3, x4 ) = x10 x2 ∨ x1 x30 ∨ x20 x3, and
f1 (x1, x2, x3, x4 ) = x1 x3 ∨ x30 x4 .
The size of the union of the support for f0 and f1 is 4, since x1, x2, x3, and x4 are the support
variables for f0 and f1 .

Property 2.1.3 Let f : {0, 1} n → {0, 1} m . The size of a decision diagram (DD) for a
multiple-output function f , denoted by size(DD, f ), is the total number of terminal and
non-terminal nodes in the minimal DD.

In the case of SBDDs, the size also includes the nodes for output selection variables.

Example 2.3 The size of the BDD for CF in Fig. 2.2 is 14.
Multiple-Output Functions  21

Figure 2.2: BDD for CF of a 3-Input 2-Output Bit-Counting Function (wgt3).

2.2 BINARY DECISION DIAGRAMS FOR MULTIPLE-OUTPUT FUNCTIONS


In this section, shared binary decision diagrams (SBDDs), multi-terminal binary decision
diagrams (MTBDDs) and binary decision diagrams for characteristic functions (BDDs for
CFs) are presented.

2.2.1 SBDDs and MTBDDs


Shared binary decision diagrams (SBDDs) and multi-terminal binary decision diagrams
(MTBDDs) represent multiple-output functions. An SBDD is a set of BDDs combined by a
tree for output selection, while an MTBDD is a BDD with many terminal nodes. MTBDDs
evaluate all the outputs simultaneously, but they are usually much larger than SBDDs.

2.2.2 BDDs for CFs


In this subsection, definition and properties of BDDs for CFs are presented.

Property 2.2.1 Let B = {0, 1}. Let a∈B n, and f (a) = ( f 0 (a), f 1 (a), . . .,
f m−1 (a))∈B m . Let b∈B m .

The characteristic function (CF) F of a multiple-output function f = ( f0, f1, . . ., fm−1 ) is an


(n + m)-variable switching function such that

1 i f b = f (a)

F (a, b) =
0 otherwise

A CF of an n-input m-output function is a switching function with n + m variables. In


the CF, besides the input variables, one binary variable is used for each output function.
22  VLSI Circuits and Embedded Systems

Table 2.1: 2-Input 2-Output Function

Input Output
x1 x2 f0 f1
0 0 0 0
0 1 0 0
1 0 1 0
1 1 1 1

A BDD for CF is a BDD representing the CF. In order to guarantee a fast evaluation, the
output variables can appear on any path of the BDD for CF only after all the supports
have appeared. Fig. 2.2 shows the BDD for CF of a 3-input 2-output bit-counting function
(wgt3), where x1, x2 and x3 are input variables, and f 0 and f 1 are output variables. The
BDD for CF in Fig. 2.2 shows that each path from the root to the terminal 1 corresponds to
an input-output combination. The advantages of BDDs for CFs are: (1). They can represent
large multiple-output functions; and (2) they can evaluate all the outputs in O(n + m) time.

Property 2.2.2 Let F be the characteristic function of an n-input m-output function. An


input-output combination for F is valid if the output vector in the combination is produced
when the input vector of the combination is applied.

Property 2.2.3 Let F be a characteristic function of f : {0, 1} n → {0, 1} m . Then, the


number of valid input-output combinations for F is 2n .

Example 2.4 Consider the 2-input 2-output function in Table 2.1. The valid input-output
combinations are (0, 0, 0, 0), (0, 1, 0, 0), (1, 0, 1, 0) and (1, 1, 1, 1).

Property 2.2.4 If the output variables appear only after all the supports, then an arbitrary
n-input m-output function can be evaluated by a BDD for CF in O(n + m) time.

Note that if the above restriction of the ordering of the variables is dropped, then it can’t
guarantee the evaluation time of O(n + m).

2.2.2.1 BDDs for CFs of Multiple-Output Functions


In this part, the sizes of BDDs for CFs of multiple-output functions are presented. Since the
CF of a multiple-output function is a special case of an (n + m)-variable switching function,
the followings properties can be obtained:

Property 2.2.5 Let f be an n-input m-output function. Then,


n+m (2(n+m)−k − 1 + 22 k ).
size (BDD f or CF, f )≤ mink=1

Example 2.5 Let f : {0, 1}7 → {0, 1}10 . Then, by Property 2.2.5, size(BDD for CF, f ) ≤
16639 is got. On the other hand, from Table 2.2, size(SBDD, f ) ≤ 335 is obtained.
Multiple-Output Functions  23

Property 2.2.6 Let fs = ( f0, f1, . . ., fm−1 ) represent m functions, where fi = xi (i =


0, 1, . . ., m−1). Then, size(BDD for CF, fs ) ≤ 3m + 2.

Proof 2.1 The proof is done by mathematical induction.

(1) Base: Form = 1, the CF of the function f0 = x0 is realized by a BDD for CF with
five nodes as shown in Fig. 2.3.
(2) Induction: Assume that the hypothesis is true for k = m− 1 functions. That is, the
CF of m− 1 functions is realized by the BDD for CF in Fig. 2.4 with 3m−1 nodes. In Fig.
2.4, first remove the constant 0 and constant 1. Second, attach variables xm−1, fm−1 , and
corresponding three non-terminal nodes, as well as nodes for constant 0 and constant 1.
Then, the diagram in Fig. 2.5 is got. It is clear that Fig. 2.5 shows the BDD for CF of m
functions with 3m + 2 nodes which has three more non-terminal nodes than Fig. 2.4. Thus,
from (1) and (2), the property is satisfied.
Note that the size of the MTBDD for the functions fs is exponential, while that for the
BDD for CF and the SBDD are linear as shown in Table 2.2.

Table 2.2: Comparison of SBDDs, MTBDDs, and BDDs for CFs

Figure 2.3: BDD for CF of Function f 0 = x 0 .

Property 2.2.7 Let adrn be a 2n-input (n + 1)-output function that computes the sum of
two n-bit numbers.

Property 2.2.8 size(BDD for CF, adrn) ≤ 9n + 1(n≥2).


24  VLSI Circuits and Embedded Systems

Figure 2.4: BDD for CF of (m − 1) Functions.

Figure 2.5: BDD for CF of m Functions.

Proof 2.2 Suppose that the variables of the adrn are assigned as follows:

Mathematical induction is used here to prove the theorem.


(1) Base: As shown in Fig. 2.6, adr2 is represented by using 17 non-terminal nodes
and two terminal nodes. In Fig. 2.6, only 1-paths are shown, and constant 0 and 0-paths
are omitted for simplicity. Note that x 0 and y 0 are near to the root node, and z2 (the output
variable representing the most significant bit of adr2) is the nearest to the constant 1 node.
Multiple-Output Functions  25

(2) Induction: Suppose that adrn is represented by using (9n−1) non-terminal nodes and
two terminal nodes. Also, assume that the variable z n is the nearest to the constant 1 node. Let
v 0 and v 1 be the nodes of z n , where edges e0 and e1 are connecting to the constant 1 respec-
tively. This situation is shown in Fig. 2.7. In Fig. 2.7, first remove the variables z n, e0, e1 , and
constants. Second, attach variables x n, y n, z n , and z n+1 , and corresponding 9 non-terminal
nodes, as well as constant nodes. Then, the diagram shown in Fig. 2.8 is obtained.
Note that Fig. 2.8 has 9 more nodes than Fig. 2.7. It is clear that Fig. 2.8 represents the
characteristic function of adr(n + 1), and has (9n−1) + 9 + 2 = 9(n + 1) + 1 nodes. In this
case, the ordering of the variables is (x0, y 0, z0, x 1, y 1, z1, . . ., x n, y n, z n, z n+1 ). Thus, from 1)
and 2), the theorem is proved.

Figure 2.6: BDD for CF of adr2.

Figure 2.7: BDD for CF of adrn.


26  VLSI Circuits and Embedded Systems

Figure 2.8: BDD for CF of adrn (After updating the variables and constants).

2.2.3 Comparison of Various BDDs


BDDs are useful for various applications. Sometimes different BDDs can be used for the
same application. So, it is necessary to know the properties of BDDs. The size of the BDD
is important to represent functions compactly, while the evaluation time for the BDD is
useful for logic simulation. Table 2.2 compares the sizes and the evaluation time of SBDDs,
MTBDDs, and BDDs for CFs of an n-input m-output function. In the table, r denotes
the number of distinct output vectors for the outputs, and f s = ( f 0, f 1, . . ., f m−1 ), where
f i = x i (i = 0, 1, . . ., m−1).

2.3 CONSTRUCTION OF COMPACT BDDS FOR CFS


The construction of compact BDDs for CFs is useful for efficient representations of multiple-
output functions. In this section, a method is presented to construct compact BDDs for CFs.

2.3.1 Formulation of the Problem


Property 2.3.1 A BDD for CF is minimum iff it contains the minimum number of nodes.

Problem 1: Let u1, u2, . . ., uk be the variables. Let Order[k] = (ue1 , ue2 , . . ., uek ) be a
permutation of the k variables. Let size(BDD for CF, f ) be the total number of nodes in the
Multiple-Output Functions  27

BDD for CF for a certain Order[k] of the variables. Find a variable ordering Order[k] =
(ue1 , ue2 , . . ., uek ) for a given multiple-output function f such that the size(BDD for CF, f )
is the minimum.
In general, it is very time-consuming to find the best ordering of variables of the BDD
for CF. So, a good variable ordering from the initial one is computed by using the modified
sifting algorithm. To generate a good initial ordering, the following methods are applied: i)
ordering of output variables; ii) interleaving based sampling schemes for ordering of input
variables; and iii) interleaving method for input variables and output variables.

2.3.2 Ordering of Output Variables


Output functions are ordered so that the outputs with many support variables in common
are adjacent. The following strategies are used here:
Strategy 1: f i and f j are the candidates of a pair of output functions if support( f i ) ∩
support( f j ),φ .
Strategy 2: Let s( f i1, f i2 ) be the size of the union of the support for f i1 and f i2 . Then,
( f i1, f i2 ) is a candidate of a pair of output functions if s( f i1, f i2 ) for ( f i1, f i2 ) is the smallest
among all s( f i1, f i2 ). Apply the same idea to the rest of functions recursively to find a good
partition of the output functions.

Example 2.6 Consider the 4-input 4-output function:

f0 (x1, x2, x3, x4 ) = x10 x2 ∨x1 x30 ∨x20 x3,


f1 (x1, x2, x3, x4 ) = x3 x40 ,
f2 (x1, x2, x3, x4 ) = x1 x3 ∨x30 x4, and
f3 (x1, x2, x3, x4 ) = x4 .

Algorithm 2.1 Ordering of Output Functions


1: Find a good partition of output functions using Strategies 1 and 2.
2: Order the output functions with the pairs of outputs of a good partition.
3: Do Step 2 until all the variables from the initial ordering have been checked, and choose
the smallest BDD for CF.

There are six pairs of output functions. The sizes of the supports for these pairs of output
functions are: s( f0, f1 ) = s( f0, f2 ) = s( f0, f3 ) = 4, s( f1, f2 ) = s( f2, f3 ) = 3, and s( f1, f3 ) = 2.
Since s( f1, f3 ) = 2 is the smallest among all s( fi , f j ), ( f1, f3 ) is the candidate of a pair. The
remaining outputs are f0 and f2 . Thus, ( f0, f2 ) is the another pair. Therefore, the partition
of output functions are got as follows: {( f1, f3 ), ( f0, f2 )}.

Example 2.7 Consider the functions in Example 2.6. Since {( f1, f3 ), ( f0, f2 )} is a good
partition of output functions, the ordering of outputs is ( f1, f3, f0, f2 ).
28  VLSI Circuits and Embedded Systems

2.3.3 Interleaving-Based Sampling Schemes for Ordering of Input Variables


The sizes of BDDs are sensitive to orderings of input variables. Dynamic reordering
methods are useful to order the input variables. However, such methods are extremely
time-consuming, and can fail to construct the BDDs for many functions. In real life, many
practical logic circuits are multiple outputs. So, it is important to find the same good variable
ordering for different output functions, since most of the BDD-based CAD tools handle
multiple-output functions at the same time. Here, a method is presented to order the inputs
of multiple-output functions. The sampling methods are considered for computing variable
orderings of SBDDs, where a sample corresponds to a group of output functions, and each
SBDD represents a sample. Then, an interleaving method is used to find a good ordering
of the input variables for output functions from the variable orderings of compact SBDDs.
The algorithm is shown in Fig. 2.12. The input variables that greatly affect the size of the
BDD are called influential. The influential variables should be the higher positions in a
good variable ordering.
Property 2.3.2 A sample is a multiple-output functions in which outputs with the common
support variables are usually adjacent. These functions form a part of total functions. The
size of a sample is the number of outputs in the sample.
Example 2.8 Consider the functions in Example 2.6. ( f0, f2 ) can be a sample, since
support( f0 ) = {x1, x2, x3 } and
support( f2 ) = {x1, x3, x4 }. The size of ( f0, f2 ) is 2.
Property 2.3.3 Let G and H be two samples. The support correlation between G and H
is the number of common support variables.

2.3.3.1 Generating Samples from Output Functions


In this subsection, a technique is presented to generate samples from output functions.

Algorithm 2.2 Generating Samples


1: Order the outputs using Algorithm 2.1, and make an initial sample with the ordered
outputs.
2: Check the size of the sample, and do the process of generating samples by using Step 3
only if the size of the sample is larger than the expected one, otherwise stop the process
for this sample.
3: Check the supports of the outputs of the sample. If all the outputs depend on all the
inputs, then go to Step 4, otherwise go to Step 5.
4: Randomly divide the sample into some such that the construction of the SBDD for each
sample is easy† to handle.
5: Divide the sample into two such that the outputs with common support variables are
in the same sample, and the support correlation between samples is small†. Return to
Step 2 for each sample.

Example 2.9 Consider the functions in Example 2.6. ( f1, f3 ) and ( f0, f2 ) are two samples,
since support( f0 ) = {x1, x2, x3 }, support( f1 ) = {x3, x4 }, support( f2 ) = {x1, x3, x4 }, and
support( f3 ) = {x4 }.
Multiple-Output Functions  29

Figure 2.9: SBDD with the Variable Ordering for Sample ( f1, f3 ) Obtained by the Sifting
Algorithm.

2.3.3.2 Interleaving the Variable Orderings of Samples


In the previous subsection, a method has been presented to generate samples from the
output functions. Now, the compact SBDD for each sample is constructed by using the
sifting algorithm starting with the initial variable ordering, and obtain the variable ordering
for the sample from the SBDD. Then, the variables of the variable orderings from the
highest to the lowest priority of the samples are interleaved as shown in Fig. 2.12. Note that
a sample has the highest priority if the size of the SBDD for the sample is the largest.

Example 2.10 Consider the functions in Example 2.6. ( x4, x3, x1, x2 ) and ( x3, x4, x2, x1 ) are
the variable orderings in Figs. 2.9 and 2.10 for samples ( f1, f3 ) and ( f0, f2 ), respectively.
The sample ( f0, f2 ) has the highest priority, since the size of the SBDD for this sample is
the largest. Fig. 2.11 shows that ( x3, x4, x2, x1 ) is a good variable ordering for ( f0, f1, f2, f3 )
which is computed from the variable orderings of samples ( f1, f3 ) and ( f0, f2 ) by using an
interleaving method.

2.3.4 Interleaving Method for Input Variables and Output Variables


In Subsections 2.3.2 and 2.3.3, methods have presented to find good orderings of the
input variables and the output variables. In Example 2.7 and Fig. 2.11, it is shown that
( f1, f3, f0, f2 ) is a good ordering of the outputs, and (x3, x4, x2, x1 ) is a good ordering of the
inputs. In this subsection, a method is presented to find the relative position of inputs and
outputs. To find the relative position of variables, the following strategy is used:
Strategy 3: For any output function, immediately after all the support variables appear,
the variables are placed for this output.

Example 2.11 Consider the two-bit adder (adr2) as shown below:


30  VLSI Circuits and Embedded Systems

Figure 2.10: SBDD with the Variable Ordering for Sample ( f0, f2 ) Obtained by the Sifting
Algorithm.

The support for z0 is {x0, y0 }, and the supports for z1 and z2 are {x0, y0, x1, y1 }. Also,
z2, z1 , and z0 are partially symmetric with respect to {x0, y0 } and {x1, y1 }. Thus, the reason-
able ordering for the input and output variables would be (x0, y0, z0, x1, y1, z1, z2 ).

Algorithm 2.3 Optimization of BDDs for CFs


Input: A= (A3 , A2 , A1 , A0 ) and B=( B3 , B2 , B1 , B0 ) are two input vectors and Cin is the
carry in.
Output: A BCD adder capable of performing the sum = A+ B. The buffer vector S = (Cout ,
S3 , S2 , S1 , S0 ) will store the result.
1: Make an initial ordering for the variables of the BDD for CF using Algorithm 2.3,
Procedures in Fig. 2.12 and Strategy 3.
2: Select a variable from the initial ordering, and use the sifting algorithm to find the
position of the variable that fits Strategy 3 to minimize the size of the BDD for CF.
3: Do Step 2 until all the variables from the initial ordering have been checked, and choose
the smallest BDD for CF.

2.3.5 Algorithm for Ordering the Variables


In this subsection, a method is presented to optimize BDDs for CFs using the modified
sifting algorithm.

Example 2.12 Consider the 4-input 4-output function in Example 2.6. In this example,
(x3, x4, x2, x1 ) is a good ordering of the inputs, and ( f1, f3, f0, f2 ) is a good ordering of
the outputs. Since support ( f1 ) = {x3, x4 }, f1 appears after {x3, x4 }. Next, support ( f3 ) =
{x4 }, f3 appears after f1 . Finally, support( f0 ) = {x1, x2, x3 } and support ( f2 ) = {x1, x3, x4 }, f0
Multiple-Output Functions  31

Figure 2.11: SBDD with the Variable Ordering for f = ( f0, f1, f2, f3 ) Obtained from the
Variable Orderings of Samples ( f1, f 3 ) and ( f0, f2 ) by using an Interleaving Method.

and f2 appear in the last. Thus, an initial ordering for the input and output variables is
(x3, x4, f1, f3, x2, x1, f0, f2 ).

2.4 SUMMARY
In this chapter, a method is introduced to construct smaller binary decision diagrams for
characteristic functions (BDDs for CFs) to represent multiple-output functions. The sizes of
SBDDs (Shared Binary Decision Diagrams), MTBDDs, and BDDs for CFs are compared.
SBDDs evaluate outputs in O(mn) time, while MTBDDs (Multi-Terminal Binary Decision
Diagrams) and BDDs for CFs evaluate outputs in O(n) time and O(n+ m) time, respectively.
In most cases, BDDs for CFs are much smaller than MTBDDs. However, BDDs for CFs are
usually larger than the corresponding SBDDs. Three types of circuits: (1) n-bit adders (adrn),
where BDDs for CFs are the smallest; (2) bit-counting circuits (wgtn), where MTBDDs
are the smallest; and (3) n-bit multipliers (mlpn), where SBDDs are the smallest. Upper
bounds on the sizes of SBDDs, MTBDDs, and BDDs for CFs of adrn are also derived.
An SBDD-based simulator can be faster for some functions. However, the simulator
based on the BDD for CF should be faster than the SBDD-based one when there are no page
faults in the physical memory and no misses in the Translation Lookaside Buffer (TLB)
during function evaluation.
32  VLSI Circuits and Embedded Systems

Figure 2.12: Pseudocode for Interleaving based Sampling Schemes for the Ordering of
Input Variables.
Multiple-Output Functions  33

REFERENCES
[1] S. B. Akers, “Binary decision diagrams”, IEEE Trans. Comput., vol. C-27, no. 6, pp.
509–516, 1978.
[2] P. Ashar and S. Malik, “Fast functional simulation using branching programs”, Pro-
ceedings of IEEE International Conference on Computer-Aided Design, pp. 408–412,
1995.
[3] C. Scholl, R. Drechsler and B. Becker, “Functional simulation using binary decision di-
agrams”, Proceedings of IEEE International Conference on Computer-Aided Design,
pp. 8–12, 1997.
[4] R. E. Bryant, “Graph-based algorithms for Boolean function manipulation”, IEEE
Trans. Comput., vol. C-35, no. 8, pp. 677–691, 1986.
[5] T. Sasao and J. T. Butler, “A method to represent multiple-output switching functions
by using multi-valued decision diagrams”, Proceedings of 26th IEEE International
Symposium on Multiple-Valued Logic, pp. 248–254, 1996.
[6] H. Touati, H. Savoj, B. Lin, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, “Im-
plicit state enumeration of finite state machines using BDDs”, Proceedings of IEEE
International Conference on Computer-Aided Design, pp. 130–133, 1990.
[7] H. M. H. Babu and T. Sasao, “Shared multi-terminal binary decision diagrams for
multiple-output functions”, IEICE Trans. Fundamentals, vol. E81-A, no.12, pp. 2545–
2553, 1998.
[8] H. M. H. Babu and T. Sasao, “Time-division multiplexing realizations of multiple-
output functions based on shared multi-terminal multiple-valued decision diagrams”,
IEICE Trans. Inf. & Syst., vol. E82-D, no.5, pp. 925–932, 1999.
[9] H. M. H. Babu and T. Sasao, “Representations of multiple-output functions by binary
decision diagrams for characteristic functions”, Proceedings of the Eighth Workshop
on Synthesis And System Integration of Mixed Technologies, pp. 101–108, 1998.
[10] H. Fujii, G. Ootomo, and C. Hori, “Interleaving based variable ordering methods for
ordered binary decision diagrams”, Proceedings of IEEE International Conference on
Computer-Aided Design, pp. 38–41, 1993.
[11] R. Rudell, “Dynamic variable ordering for ordered binary decision diagrams”, Pro-
ceedings of IEEE International Conference on Computer-Aided Design, pp. 42–47,
1993.
[12] T. Sasao, ed., “Logic Synthesis and Optimization”, Kluwer Academic Publishers,
Boston, 1993.
[13] J. Jain, W. Adams, and M. Fujita, “Sampling schemes for computing OBDD vari-
able orderings”, Proceedings of IEEE International Conference on Computer-Aided
Design, pp. 631–638, 1998.
[14] P. C. McGeer, K. L. McMillan, A. Saldanha, A. L. Sangiovanni-Vincentelli and P.
Scaglia, “Fast discrete function evaluation using decision diagrams”, International
Workshop on Logic Synthesis, pp. 6.1–6.9, 1995.
[15] A. Slobodov´a and C. Meinel, “Sample method for minimization of OBDDs”, Pro-
ceedings of the International Workshop on Logic Synthesis, pp. 311–316, 1998.
34  VLSI Circuits and Embedded Systems

[16] H. M. H. Babu and T. Sasao, “Shared multiple-valued decision diagrams for multiple-
output functions”, Proceedings of the IEEE International Symposium on Multiple-
Valued Logic, pp. 166–172, 1999.
[17] Babu, Hafiz Md Hasan, and Tsutomu Sasao. “Representations of multiple-output func-
tions using binary decision diagrams for characteristic functions.” IEICE Transactions
on Fundamentals of Electronics, Communications and Computer Sciences 82, no. 11
(1999): 2398–2406.
CHAPTER 3

Shared Multiple-Valued DDs


for Multiple-Output
Functions

In this chapter, a method is introduced to represent multiple-output functions using shared


multiple-valued decision diagrams (SMDDs). An algorithm is shown for pairing the input
variables of binary decision diagrams (BDDs). The pair sifting is also presented that moves
pairs of 4-valued input variables to speed up the normal sifting, and to produce compact
SMDDs. The size of the SMDD is the total number of non-terminal nodes excluding the
nodes for output selection variables. The sizes of SMDDs are derived for general functions
and symmetric functions.

3.1 INTRODUCTION
Multiple-valued decision diagrams (MDDs) are extensions of binary decision diagrams
(BDDs), and are useful in logic synthesis, time-division multiplexing (TDM) realizations,
logic simulation, FPGA design, etc. MDDs are usually smaller than the corresponding
BDDs, and require fewer memory access to evaluate them. A shared multiple-valued
decision diagram (SMDD) is a set of MDDs that compactly represents a multiple-output
function. SMDDs can be used in many applications such as design of multiplexer-based
networks, design of pass-transistor logic networks, etc. For example, Fig. 3.2 shows the
multiplexer-based network corresponding to the SMDD in Fig. 3.1. In these applications,
the reduction of the number of nodes in the SMDDs is important.
This chapter considers the following methods to construct compact SMDDs:
1. Pair the binary input variables to make multiple-valued variables.
2. Order the multiple-valued variables in the SMDDs.
A parameter is introduced to find good pairs of the input variables. The parameter of an
input variable denotes the influence of the variable on the size of the BDD. An extension to
the sifting algorithm is also presented that moves pairs of 4-valued input variables to speed
up the sifting, and to produce compact SMDDs. Furthermore, formulas are derived for the
sizes of SMDDs for bit-counting functions (wgt n) and incrementing functions (inc n).

DOI: 10.1201/9781003269182-4 35
36  VLSI Circuits and Embedded Systems

Figure 3.1: SMDD.

Figure 3.2: A Multiplexer-Based Network Corresponding to the SMDD in Fig. 3.1.

3.2 DECISION DIAGRAMS


In this section, various decision diagrams are defined, and properties of shared multiple-
valued decision diagrams (SMDDs) are presented.
Shared Multiple-Valued DDs for Multiple-Output Functions  37

Property 3.2.1 Let F = ( f0, f 1, . . ., f m−1 ). The size of a decision diagram (DD) for a
function F , denoted by size(DD, F), is the total number of nonterminal nodes excluding
the nodes for output selection variables.

Example 3.1 The size of the SMDD in Fig. 3.1 is 7. Note that g0 is the output selection
variable in the SMDD.

3.2.1 Binary Decision Diagrams


Binary decision diagrams (BDDs) are efficient representations of logic functions. A shared
binary decision diagram (SBDD) is a set of BDDs combined by a tree for output selection,
and represents a multiple-output function.

3.2.2 Multiple-Valued Decision Diagrams


Let f : {0, 1, ..., r − 1} N →{0, 1}. A multiple-valued decision diagram (MDD) of an
r -valued N input variables function f (X 1, X 2, ..., X N ) is a directed graph with a root
node that has r outgoing edges labeled 0, 1, ..., r − 1 directed to nodes representing
f (0, X 2, ..., X N ), f (0, X 2, ..., X N ), and f (r − 1, X 2, ..., X N ), respectively. For each of these
nodes, there are r outgoing edges which go to nodes that have r outgoing edges, etc. A
terminal node is a node that has no outgoing edges. It is labeled by 0 or 1 which corresponds
to a binary value of the function f . A reduced ordered MDD (ROMDD) is derived from a
multiple-valued complete decision tree using the following reduction rules:

• Two nodes are merged into one node if they represent the same function, and

• A node v is deleted if all the children of v represent the same function.

From now, an ROMDD is simply called as an MDD in this chapter.

3.2.2.1 Shared Multiple-Valued Decision Diagrams


Shared multiple-valued decision diagrams (SMDDs) represent multiple-valued multiple-
output functions. An SMDD is a set of MDDs combined by a tree for output selection.
Fig. 3.1 shows an example of an SMDD, where g 0 is the output selection variables for
functions f 0 and f 1 . An SMDD has the following properties:
1. It easily checks the equivalence of two functions.
2. It shares isomorphic sub-graphs of MDDs, and represents a multiple-output function
compactly.
3. Its logic levels are smaller than BDDs.

Property 3.2.2 Let R = {0, 1, ..., r − 1} and B = {0, 1}. Then, the size of the SMDD for
N −k k
an N -input m-output function R N →B m is at most mink=1
N
{m · r r−1−1 + 2r − 2}.

Property 3.2.3 An arbitrary N -input m-output function R N →B m can be represented by


N
an SMDD with O( m·r
N ) nodes.
38  VLSI Circuits and Embedded Systems

Property 3.2.4 Let R = {0, 1, ..., r − 1} and B = {0, 1}. Then, all the non-constant sym-
Í N (i+r −1)
metric functions R N →B can be represented by MDDs with i=1 [2 i − 2] non-terminal
nodes.
Property 3.2.5 Let R = {0, 1, ..., r − 1} and B = {0, 1}. Then, the size of the SMDD for
an N-input m-output symmetric function R N → B m is at most
  © i+r −1 ª 
N 
 
k   NÕ−k  ­
i+r −1
 
 Õ i
® 
+ 2« ¬ − 2
  

min m·
i
k=1 
  
 i=0 i=1  
  
  
From here, we assume that r = 4.
Property 3.2.6 Let wgt n be an n-input ( blog2 nc + 1)-output function that counts the
number of l’s in the inputs, and represents it by a binary number, where n is the number of
binary input variables, and bac denotes the largest integer not greater than a. It represents
a bit-counting function.
Property 3.2.7 Let inc n be an n-input (n + l)-output function that computes x + l , where
n is the number of binary input variables. It represents an incrementing function. The
following conjectures for the bounds on the sizes of SMDDs can get for wgt n and inc n:
Conjecture 3.1: size(SMDD, wgt n) nblog2 nc + n − 2 blog2 nc = ; where n > 1.
Conjecture 3.2: size(SMDD, inc n) = 2n − l , where n > 1.
Property 3.2.8 Let an SMDD represent m functions f i = x i (i = 0, 1, ..., m − 1), where x i
is a binary input variable. Then, the size of the SMDD is m.
Example 3.2 Figs. 3.3 and 3.4 show the SMDDs for functions ( f0, f1 ) = (x1, x2 ) and
( f0, f1, f2 = (x1, x2, x3 ), respectively. Their sizes are 2 and 3, respectively.

Figure 3.3: SMDD for Functions ( f0, f1 ) = (x1, x2 ).


Shared Multiple-Valued DDs for Multiple-Output Functions  39

Figure 3.4: SMDD for Functions ( f0, f1, f2 ) = (x1, x2, x3 ).

Note that the size of the SBDD to represent m functions f i = x i (i = 0, 1, ..., m − 1) is


also m.

3.3 CONSTRUCTION OF COMPACT SMDDS


Compact SMDDs are important to represent multiple-output functions efficiently. In this
section, heuristic algorithms are shown to optimize SMDDs. The following approaches are
considered here to reduce the sizes of SMDDs: (1) pairing of input variables; (2) ordering
of input variables by sifting the 4-valued variables; and (3) ordering of input variables by
sifting pairs of 4-valued variables.

3.3.1 Pairing of Binary Input Variables


Pairing of input variables is important to reduce the size of SMDDs. The input variables that
greatly affect the size of the decision diagram are called influential. A heuristic algorithm
is used to pair the influential input variables of BDDs for multiple-output functions.

3.3.1.1 The Method


In this subsection, a heuristic algorithm is shown to find good pairs of the input variables.

Property 3.3.1 Let U = {u1, u2, . . ., uk } be a set of k variables. Let U 1 ⊆ U , and U 2 ⊆ U .


P = {U 1, U 2 } is a partition of U iff U 1 ∪ U 2 = U , and U 1 ∩ U 2 = .

Property 3.3.2 Let U = {u1, u2, u3, u4 } be a set of four variables. Then, {[u1, u2 ], [u3, u4 ]}
is a partition of U .

Property 3.3.3 Let a BDD represent a function f . para(x i ) is the parameter of an input
variable x i at height i in the BDD with the variable ordering. It denotes the level of the
40  VLSI Circuits and Embedded Systems

BDD for x i . It is assumed that the smaller the value of para(x i ), the more influential the
variable x i .
Example 3.3 Figs. 3.5 and 3.6 show the BDDs for the functions f 0 and f 1 , respectively.
The values of para(x i ) for both BDDs are shown in the figures. For example, x i is the
most influential variable in the BDD in Fig. 3.5, since para(x 1 ) is the smallest among all
para(x i ).

Figure 3.5: BDD for Functions f 0 .

Figure 3.6: BDD for Functions f 1 .

Property 3.3.4 Let F = ( f 0, f 1, . . ., f m−1 ) be an n-input m-output function, and parak (x i )


be the parameter of the variable x i in the BDD for f k . Then, T = (T 1,T 2, ...,T n )t is a total
parameter vector, where T i is calculated as follows: Ti = para(xi )) = k=0 parak (xi )).
Îm−1
Shared Multiple-Valued DDs for Multiple-Output Functions  41

Example 3.4 Consider the functions in Example 3.3. The total parameter vector for
F = ( f 0, f 1 ) is

para (x1 ) 2
para (x2 ) 6
© ª
T=
­ ®
para (x3 ) 3
­ ®
­ ®
para (x4 ) « 16 ¬

Property 3.3.5 The weight w(i, j) for a pair of input variables x i , and x j (i, j) is defined
by w(i, j) = para(x i ) . para(x j ).

Property 3.3.6 In the functions of Example 3.3, the weights are as follows: w(l, 2) =
12, w(l, 3) = 6, w(l, 4) = 32, w(2, 3) = 18, w(2, 4) = 96, and w(3, 4) = 48.

Algorithm 3.1 Pairing the Input Variables


Let F : {0, 1} n →{0, l} m . Let Q be a set of pairs of n input variables, and q ∈ Q. Let W be
the list of sorted weights for the pairs of Q.
1: Optimize the BDD for each output function f .
2: Calculate the total parameter vector T .
3: Calculate the weight w(i, j) for each pair of input variables.
4: Select q ∈ Q with the smallest weight w(i, j) from W , and eliminate the pairs that
contain the input variables in q from Q. Update Q and W .
5: Repeat Step 5 until Q,, and make a good partition of input variables with the selected
pairs.

Example 3.5 Consider the functions in Example 3.3. There are three different ways of
pairing four inputs:

(l) {[x 1, x 2 ], [x 3, x 4 ]} (SMDD in Fig. 3.8),


(2) {[x 1, x 4 ], [x 2, x 3 ]} (SMDD in Fig. 3.9), and
(3) {[x 1, x 3 ], [x 2, x 4 ]} (SMDD in Fig. 3.7).
Here, (3) is a good partition of the input variables according to Algorithm 3.1, since
w(l, 3) is the smallest element.
42  VLSI Circuits and Embedded Systems

Figure 3.7: SMDD for the Partition {[x 1, x 3 ], [x 2, x 4 ]}.

Figure 3.8: SMDD for the Partition {[x 1, x 2 ], [x 3, x 4 ]}.


Shared Multiple-Valued DDs for Multiple-Output Functions  43

Figure 3.9: SMDD for the Partition {[x 1, x 4 ], [x 2, x 3 ]}.

3.3.2 Ordering of Input Variables


The ordering of input variables is very important to reduce the sizes of SMDDs. Sifting is
an efficient method to find a good ordering of the input variables. Normal sifting moves a
single variable at a time, while group siftings move more than one variable at a time. Group
siftings are faster than normal siftings to produce compact DDs. Pair sifting is one kind of
group siftings that moves a pair of symmetric variables at a time. In this subsection, the
normal sifting and the pair sifting of 4-valued input variables are considered. In the case of
the pair sifting, good pairs of the 4-valued input variables from MDDs are found. Functions
are usually multiple outputs, and it is not so easy to find good pairs of the input variables
for all the outputs. Algorithm 3.1 is used here to find good pairs of the input variables.

Algorithm 3.2 Construction of an SMDD using the Normal Sifting Algorithm


1: Construct the SMDD by Algorithm 3.1.
2: Use the sifting algorithm for the 4-valued input variables such that the size of the
SMDD is minimized.

From now, this optimized SMDD is called as the SMDD with the normal sifting.

Example 3.6 Let {[X 1, X 3 ], [X 2, X 4 ]} be a good partition of 4-valued input variables ob-
tained by Algorithm 3.1. Then, the initial variable ordering for the pairs is (X 1, X 3, X 2, X 4 ).

From now, this optimized SMDD is called as the SMDD with the pair sifting.
44  VLSI Circuits and Embedded Systems

Algorithm 3.3 Construction of an SMDD using the Pair Sifting of 4-Valued Variables
Let F : {0, 1} n →{0, l} m .
1: Construct the MDDs for F by Algorithm 3.1.
2: Find good pairs of 4-valued variables from the MDDs using the similar techniques to
Algorithm 3.1, and make an initial variable ordering with the pairs.
3: Select a pair of 4-valued variables from the initial ordering, and use the sifting algorithm
to find the position of the pair such that the size of the initial SMDD is minimized.
4: Repeat Step 3 until all the pairs from the initial ordering have been checked, and choose
the smallest SMDD.

3.4 SUMMARY
In this chapter, a method is introduced to represent multiple-output functions using shared
multiple-valued decision diagrams (SMDDs). Some algorithms are also presented to pair
the input variables of binary decision diagrams (BDDs), and to find good orderings of
the multiple-valued variables in the SMDDs. The sizes of SMDDs are derived for general
functions and symmetric functions. SMDDs with the pair sifting are smaller than SMDDs
with the normal sifting. Algorithm 3.1 and Algorithm 3.3 can be extended to group k
input variables, where k > 2. SMDDs are useful in many applications such as design of
multiplexer-based networks and design of pass- transistor logic networks.

REFERENCES
[1] T. Sasao and J. T. Butler, “A design method for look-up table type FPGA by
pseudo-Kronecker expansion”, Proceedings of 26th IEEE International Symposium
on Multiple-Valued Logic, pp. 97–106, 1994.
[2] H. M. H. Babu and T. Sasao, “Design of multiple-output networks using time domain
multiplexing and shared multi-terminal multiple-valued decision diagrams”, IEEE
International Symposium on Multiple Valued Logic, pp. 45–51, 1998.
[3] S. Minato, N. Ishiura, and S. Yajima, “Shared binary decision diagram with attributed
edges for efficient Boolean function manipulation", Proceedings of 27th ACM/IEEE
DAC, pp. 52–57, 1990.
[4] R. E. Bryant, “Graph-based algorithms for Boolean function manipulation”, IEEE
Trans. Comput., vol. C-35, no. 8, pp. 677–691, 1986.
[5] T. Sasao and J. T. Butler, “A method to represent multiple-output switching functions
by using multi-valued decision diagrams”, Proceedings of 26th IEEE International
Symposium on Multiple-Valued Logic, pp. 248–254, 1996.
[6] D. M. Miller, “Multiple-valued logic design tools", Proceedings of the IEEE Interna-
tional Symposium on Multiple-Valued Logic, pp. 2–11, 1993.
[7] H. M. H. Babu and T. Sasao, “Shared multi-terminal binary decision diagrams for
multiple-output functions”, IEICE Trans. Fundamentals, vol. E81-A, no. 12, pp. 2545–
2553, 1998.
[8] H. M. H. Babu and T. Sasao, “Time-division multiplexing realizations of multiple-
output functions based on shared multi-terminal multiple-valued decision diagrams”,
IEICE Trans. Inf. & Syst., vol. E82-D, no. 5, pp. 925–932, 1999.
Shared Multiple-Valued DDs for Multiple-Output Functions  45

[9] H. M. H. Babu and T. Sasao, “Representations of multiple-output functions by binary


decision diagrams for characteristic functions”, Proceedings of the Eighth Workshop
on Synthesis And System Integration of Mixed Technologies, pp. 101–108, 1998.
[10] H. Fujii, G. Ootomo, and C. Hori, “Interleaving based variable ordering methods for
ordered binary decision diagrams”, Proceedings of IEEE International Conference on
Computer-Aided Design, pp. 38–41, 1993.
[11] R. Rudell, “Dynamic variable ordering for ordered binary decision diagrams”, Pro-
ceedings of IEEE International Conference on Computer-Aided Design, pp. 42–47,
1993.
[12] D. M. Miller and R. Drechsler, “Implementing a multiple-valued decision diagram
package", Proceedings of the IEEE International Symposium on Multiple-Valued
Logic, pp. 2–11, 1998.
[13] G. Epstein, “Multiple-Valued Logic Design: An Introduction, IOP Publishing Ltd.,
1993.
[14] P. C. McGeer, K. L. McMillan, A. Saldanha, A. L. Sangiovanni-Vincentelli and P.
Scaglia, “Fast discrete function evaluation using decision diagrams”, International
Workshop on Logic Synthesis, pp. 6.1–6.9, 1995.
[15] M. Kameyama and T. Higuchi, “Synthesis of multiple-valued logic networks based
on tree-type universal logic module”, IEEE Transactions on Computers, vol. C-26,
no. 12, pp. 1297–1302, 1977.
[16] H. M. H. Babu and T. Sasao, “Shared multiple-valued decision diagrams for multiple-
output functions”, Proceedings of the IEEE International Symposium on Multiple-
Valued Logic, pp. 166–172, 1999.
[17] A. Thayse, M. Davio, and J-P. Deschamps, “Optimization of multi-valued decision
algorithms", Proceedings of the IEEE International Symposium on Multiple-Valued
Logic, pp. 171–178, 1978.
[18] K. Yano, Y. Sasaki, K. Rikino, and K. Seki, “Top-down pass transistor logic design",
IEEE Journal of Solid-State Circuits, vol. 31, no. 6, pp. 792–803, 1996.
[19] S. Panda and F. Somenzi, “Who are the variables in your neighborhood", Proceedings
of IEEE International Conference on Computer-Aided Design, pp. 74–77, 1995.
CHAPTER 4

Heuristics to Minimize
Multiple-Valued Decision
Diagrams

In this chapter, a method is introduced to minimize multiple-valued decision diagrams


(MDDs) for multiple-output functions. The followings are considered: (1) A heuristic for
encoding the 2-valued inputs; and (2) A heuristic for ordering the multiple-valued input
variables based on sampling, where each sample is a group of outputs. At first generate a
4-valued input 2-valued multiple-output function from the given 2-valued input 2-valued
functions. Then, construct an MDD for each sample and find a good variable ordering.
Finally, generate a variable ordering from the orderings of MDDs representing the samples,
and minimize the entire MDDs. The introduced method produces MDDs with fewer nodes
than sifting. Especially, the introduced method generates much smaller MDDs in a short
time when several 2-valued input variables are grouped to form multiple-valued variables.

4.1 INTRODUCTION
Multiple-valued decision diagrams (MDDs) are data structures for multiple-valued func-
tions. MDDs are extensions of binary decision diagrams (BDDs) and usually require fewer
nodes than the corresponding BDDs to represent the same logic functions. MDDs are useful
for logic synthesis, FPGA design, logic simulation, etc. For example, Fig. 4.2 shows the
multiplexer-based network corresponding to the MDD in Fig. 4.1. In this chapter, multi-
rooted MDDs are considered to represent multiple-output functions. From 2-valued input
2-valued output functions, MDDs are constructed to represent 4-valued input 2-valued
output functions. A shared binary decision diagram (SBDD) is used for a multiple-output
function to find good pairs of 2-valued input variables. Since the size of a decision diagram
(DD) can vary from linear to exponential due to the orderings of the input variables, finding
a good variable ordering of the input variables is very important. Dynamic variable ordering
is one of the good heuristics to order the inputs. However, in the case of a multiple-output
function, a set of output functions must be handled at the same time. So, generating a good
variable ordering that represents all the output functions compactly is essential. Sampling

DOI: 10.1201/9781003269182-5 47
48  VLSI Circuits and Embedded Systems

based variable ordering methods and Interleaving based variable ordering methods are ef-
fective to find good variable orderings for multiple-output functions quickly. In this chapter,
both methods are combined to find good orderings of input variables.

Figure 4.1: Example of an MDD.

Figure 4.2: Multiplexer-Based Network Corresponding to the MDD in Fig. 4.1.


Heuristics to Minimize Multiple-Valued Decision Diagrams  49

4.2 BASIC PROPERTIES


This section presents notation and basic properties.

Property 4.2.1 Let F1 = { f0, f1, . . ., fm−1 }, R = {0, 1, . . ., r−1}, and B = {0, 1}. An r -
valued input 2-valued output function F1 is a mapping F1 : R N →B m .

Property 4.2.2 Let R = {0, 1, . . ., r−1} and S ⊆ R.X S is a literal of X , where X S =


0

(X < S)
1 (X ∈ S)

When S contains only one element, X {i } is denoted by X i . A product of lit-


erals X1S1 X2S2 . . . X SNN is a product term that is the AND of literals. The expression
S1 S2 SN
, S2 ,. . . S n ) X1 X2 . . . X N .
Ô
(S1Ô
(S 1, S 2, ..., S N ) denotes the inclusive-OR of products terms.

Property 4.2.3 An arbitrary r -valued input 2-valued multiple-output function can be


represented as F 1 (X 1, X 2, . . ., X N ) = X10 F1 (0, X2, . . . , X N ) ∨ X11 F1
(1, X2, . . . , X N ) ∨ . . . X1r−1 F1 (r − 1, X2, . . . , X N ). This is a multiple-valued version of
Shannon’s expansion with respect to X 1 .

Property 4.2.4 Let F = { f 0 , f 1 , . . ., f m−1 }. Then, the size of a decision diagram (DD)
for F , denoted by size(DD, F), is the total number of non-terminal nodes.

Example 4.1 The size of the MDD in Fig. 4.1 is 7.

4.3 MULTIPLE-VALUED DECISION DIAGRAMS


Let F 1 : {0, 1, . . . , r− 1} N → {0, 1} m . A multiple-valued decision diagram (MDD) for
F 1 (X 1 , X 2 , . . ., X N ) is a multi-rooted directed graph that has r outgoing edges labeled 0, 1,
. . . , and r − 1 directed to nodes representing F 1 (0, X 2 , . . ., X N ), F 1 (1, X 2 , . . ., X N ), . . .,
and F 1 (r − 1, X 2 , . . ., X N ), respectively. Each of these nodes has r outgoing edges which
go to nodes that have r outgoing edges, etc. A terminal node is a node that has no outgoing
edges. It is labeled by 0 or 1 which corresponds to a binary value of the function F 1 . A
reduced ordered MDD (ROMDD) has no node where all r outgoing edges point to the same
node and has no equivalent sub-graphs. From now on, an ROMDD is simply referred as an
MDD. Fig. 4.1 is an example of an MDD.

4.3.1 Size of MDDs


The size of MDDs is an important characteristic. In this subsection, some upper bounds on
the sizes of MDDs are presented for various functions.

Property 4.3.1 Let R = {0, 1, . . ., r−1} and B = {0, 1}. Then, the size of the MDD for an
N {m · r N −k −1 + 2r k − 2} .
N -input m-output function R N → B m is at most mink=1 r−1

In the case of r = 2, an SBDD is used to represent an n-input m-output function, and


get the following properties are obtained:
50  VLSI Circuits and Embedded Systems

Property 4.3.2 The size of the SBDD for an n-input m-output function B n →B m is at
{m · 2n−k − 1 + 22 − 2}.
k
most mink=1
n

Property 4.3.3 Let R = {0, 1, . . . , r− 1} and B = {0, 1}. Then, all the
 non-constant

i+r −1
symmetric functions R N → B can be represented by MDDs with i=1 [2 i − 2] non-
ÍN
terminal nodes.
N +r −1
Proof 4.1 The number of non-constant symmetric functions f : R N → B is 2( N )−2 ,
where R = {0, 1, . . . , r− 1} and B = {0, 1}.

(1) When N = 1, there are 2r − 2 symmetric functions, and they are realized as shown
in Fig. 4.3.
(2) Suppose that all the non-constant symmetric functions of (N− 1) variables are
Í N −1 (i+r −1)
realized with i=1 2 i − 2 arbitrary symmetric function of N variables is represented
as follows:
0 f ∨ X 1 f ∨ . . . X r−1 f
f (X 1, X 2, . . ., X N ) = X N 0 N 1 N r−1 , where f j = f (X 1, X 2, . . .,
X N −1, j) is a symmetric function of (N−1) variables, and j = 0, 1, . . ., r−1. Thus, all the
non-constant symmetric functions of N variables are realized as shown in Fig. 4.4. The
total number of non-terminal nodes in
 © i+r −1   © N +r −1 
N −1  ­ ª   ­ ª 
® ®
i N
Õ
2 « − 2 + 2« − 2
   
¬ ¬

i=1    
   
   
 © i+r −1 ª 
N  ­ ® 
i
Õ
= 2« ¬ − 2
 


i=1  
 
 
Thus, from (1) and (2), the Property 4.3.3 is proved.

Figure 4.3: Realization of all the Symmetric Functions of a Single Variable.


Heuristics to Minimize Multiple-Valued Decision Diagrams  51

Figure 4.4: Realization of Symmetric Functions of N Variables.

Property 4.3.4 Let R = {0, 1, . . ., r−1} and B = {0, 1}. Then, the size of the MDD for an
N -input m-output symmetric function R N →B m is at most
  © i+r −1 ª 
N
 
N −1 N −k
i+r −1

    ­ 
Õ Õ
i
® 
+ 2 ¬−2

  

min m·  « 
i
k=1
  

 i=0 i=1  
  
  
Proof: Since functions are completely symmetric, the different number of k -variable
functions generated by the r -valued complete decision tree is equal to the number of ways
to select k objects from r distinct objects with repetition. The number of ways to select
k objects from r distinct objects with repetition is k+r−1
k . So, the total number of non-
Ík  i+r−1 
terminal nodes in the r -valued decision trees of m functions are m. i=0 i . By
 
( N −k+r −1
N −k )
Property 4.2, in the N − k variables,
 there  are 2 − 2 non-constant symmetric
Í N −k (i+ri −1)
functions, and they require i=1 [2 − 2] non-terminal nodes. Therefore, the size of
the MDD for an r -valued N -input 2-valued m-output symmetric function is at most
  © i+r −1 ª 
N
 
N −1 N −k
i+r −1

    ­ 
Õ Õ
i
® 
+ 2 2

  

min m·  « ¬ − 
i
k=1
  

 i=0 i=1  
  
  
In the case of r = 2, an SBDD is used to representing an n-input m-output symmetric
function, and get the following:
Property 4.3.5 The size of the SBDD for an n-input m-output symmetric function B n
→ B m is at most
N 
(k + 1) (k + 2)

min m· +2 n−k+2
− 2 (n − k) − 4
2
k=1
52  VLSI Circuits and Embedded Systems

From now on, it is assumed that an MDD represents a 4-valued input 2-valued multiple-
output function, where r = 4.

Property 4.3.6 Let inc n be an n-input (n + 1)-output function that computes K + 1, where
K is a binary number consisting of n bits. It represents an incrementing circuit.

Property 4.3.7 Suppose that the 2-valued input variables of inc n are paired as X 1 =
[x 1, x 2 ], X 2 = [x 3, x 4 ], . . ., and X N = [x n−1, x n ], where n = 2N and the variable ordering
of the 4-valued inputs is (X 1, X 2, . . ., X N ). Then, size (MDD, inc n) ≤ 2n − 1(n≥ 2).

Proof 4.2 Mathematical induction is used on the number of 2-valued input variables.

(1) Base: For n = 2, the MDD for inc 2 is realized with three non-terminal nodes as shown
in Fig. 4.5.
(2) Induction: Assume that the hypothesis is true for k = n−1 input variables. That is,
the MDD for inc(n − 1) is realized as Fig. 4.6 with 2n−3 non-terminal nodes. In Fig. 4.6,
first remove the constant 0 and constant 1. Second, insert the input variable x n and add two
non-terminal nodes, as well as nodes for constant 0 and constant 1. Then, the diagram in
Fig. 4.7 is got. Note that Fig. 4.7 shows the MDD for inc n with 2n − 1 non-terminal nodes
which have two more non-terminal nodes than Fig. 4.6. It is clear that the MDD in Fig. 4.7
has upper and lower parts: When n is even, x n is paired with x n−1 and two additional
non-terminal nodes are added at the level in the bottom of the upper part of the MDD. On
the other hand, when n is odd, x n remains as a 2-valued variable in the lower part of the
MDD which requires two non-terminal nodes. Note that x 1 , x 2, . . ., x n is the order of the
2-valued inputs in the pairs, and (X 1, X 2, . . ., X N ) is the variable ordering of the 4-valued
inputs in the MDD. Thus, from (1) and (2), the theorem is proved.

Figure 4.5: MDD for inc 2.


Heuristics to Minimize Multiple-Valued Decision Diagrams  53

Figure 4.6: MDD for inc (n − 1).

Figure 4.7: MDD for inc n.

Example 4.2 Figs. 4.8 and 4.9 show the MDDs for inc 3 and inc 4, respectively. The sizes
of MDDs in Figs. 4.8 and 4.9 are 5 and 7, respectively.
54  VLSI Circuits and Embedded Systems

Figure 4.8: MDD for inc 3.

Figure 4.9: MDD for inc 4.

In the case of r = 2, an SBDD is used to represent an n-input (n + 1)-output inc n, and


property is obtained:

Property 4.3.8 size(SBDD, inc n) ≤ 3n − 2.

4.4 MINIMIZATION OF MDDS


The pairing of 2-valued input variables as well as the ordering of multiple-valued variables
are important to reduce the number of nodes in MDDs. In this section, heuristics are
presented to minimize MDDs.
Heuristics to Minimize Multiple-Valued Decision Diagrams  55

4.4.1 Pairing of 2-Valued Inputs


When a function has only a single-output, finding good pairs of 2-valued inputs is relatively
easy. However, for a multiple-output function, finding good pairs of 2-valued inputs is not
so easy. In this subsection, a heuristic is presented to select good pairs of 2-valued inputs
from an SBDD.
Algorithm 4.1 Pairing the Input Variables
1: Let F 2 : {0, 1} n → {0, 1} m . Construct an SBDD for F 2 .
2: Let s(x i , x j ) be the number of outputs that depend on either one or both of the input
variables x i and x j . Then, [x i , x j ] is a candidate pair of inputs if s(x i , x j ) is the smallest
among all the pairs. Apply the same idea to the rest of the inputs recursively to find
good pairs of input variables. In the case of tie, use Step 3 to find the best one among
them.
3: [x i , x j ] is a good pair of inputs in F 2 if in the SBDD, most of the incoming edges into
nodes labeled x j are from nodes labeled x i .

Example 4.3 Consider the SBDD in Fig. 4.10, where s(x 1, x 2 ) = 2, s(x 1, x 3 ) = s(x 2, x 3 ) =
3, and s(x 1, x 4 ) = s(x 2, x 4 ) = s(x 3, x 4 ) = 4.[x 1, x 2 ] is a good pair of input variables, since
s(x 1, x 2 ) is the smallest among all s(x i , x j ). The remaining inputs are x 3 and x 4 . Thus,
[x 3, x 4 ] is another pair. Therefore, [x 1, x 2 ] and [x 3, x 4 ] are good pairs of 2-valued inputs.

Figure 4.10: SBDD for Finding the Pairs of 2-Valued Inputs.

4.4.2 Ordering of Multiple-Valued Variables


The sizes of MDDs are sensitive to orderings of input variables. Several algorithms exist
to find the exact variable ordering. However, such algorithms work only for functions with
56  VLSI Circuits and Embedded Systems

small number of inputs and are useless for general purposes. To find the optimum variable
ordering is an NP-complete problem. So, heuristics are used for the practical problems. In
real life, many logic circuits have multiple outputs, and most CAD tools handle multiple-
output functions at the same time. Thus, finding the same variable ordering for different
output functions is important. In this subsection, a heuristic is presented to order the inputs
of multiple-output functions. A sampling technique is used to computing variable orderings
of MDDs: each sample corresponds to a group of output functions, and an MDD represents
a sample. Then, an interleaving technique is incorporated to generate a good variable
ordering for entire MDDs from the variable orderings of the MDDs, and minimize the
entire MDDs. The techniques for the introduced method are presented in Algorithm 4.2
and Algorithm 4.3. From now on the variable ordering of an MDD for a sample is a sample
variable ordering, and the variable ordering for all the outputs obtained from the sample
variable orderings is the final variable ordering. To obtain the final variable ordering, a
variable ordering of an MDD for a sample is merged with higher priority into one with
lower priority while maintaining the good variable ordering of each MDD as much as
possible. The input variables in which a multiple-output function strongly depends on, are
influential. The influential variables greatly affect the size of the DD and such variables
should be placed in the higher positions in the final variable ordering.

Property 4.4.1 A sample is a multiple-output function consisting of a set of outputs. These


outputs form a part of total outputs, and the number of outputs in a sample is the size of the
sample. A sample with the larger size of the MDD has the higher priority.

Property 4.4.2 support ( f ) is the set of input variables that the function f depends on.
The size of the support is the number of variables in the support ( f ).

Property 4.4.3 Let f 1 and f 2 be two output functions. The size of the union of the support
for f1 and f2 is the number of support variables for { f 1 , f 2 }.

Example 4.4 Consider the 2-valued 4-input 2-output function:

f 0 (x 1, x 2, x 3, x 4 ) = x 10 x 2 ∨x 1 x 30 ∨x 20 x 3, and f 1 (x 1, x 2, x 3, x 4 ) = x 1 x 3 ∨x 30 x 4 .
The size of the union of the support for f 0 and f 1 is 4, since x 1 , x 2 , x 3 and x4 are the
support variables for { f 0 , f 1 }.
Note that an input variable of a sample variable ordering is inserted into the final
ordering if the variable is not already in it.

Example 4.5 Let F 3 = { f 0, f 1, f 2, f 3 } : {0, 1, 2, 3}7 →{0, 1}4 . Let { f 0, f 2 } and { f 1, f 3 }


be two samples for the function F 3 . Let order[A] = (X 0, X 1, X 2, X 3 ) and order[B] =
(Y 0,Y 1,Y 2,Y 3 ) be sample variable orderings obtained from the MDDs representing samples
{ f 0, f 2 } and { f 1, f 3 }, respectively. Let 5 and 13 be the sizes of MDDs under the sample
variable orderings, order[A] and order[B], respectively. { f 1, f 3 } has the highest priority,
since the size of the MDD for this sample is the largest. So, check order[B] is checked
first and then order[A] in order to generate the final variable ordering (order[C]). In this
example, G = {(X 0, X 1, X 2, X 3 ), (Y 0,Y 1,Y 2,Y 3 )}. To compute the final variable ordering,
the influential input variables are selected from order[A] and order[B] of G according to
Steps (a) and (b) as follows:
Heuristics to Minimize Multiple-Valued Decision Diagrams  57

Algorithm 4.2 Derivating Samples


Let F 2 : {0, 1} n →{0, 1} m .
1: Generate F 3 : { 0, 1, 2, 3} N → { 0, 1 } m from F 2 by using Algorithm 4.1 and construct
an MDD for F 3 . Two output functions in F 3 are a candidate pair if the size of the union
of the support for the pair is the smallest among all the pairs. Apply the same idea to
the rest of the outputs recursively to find good pairs of outputs. In the case of tie, go to
Step 2 to find the best one among them, else go to Step 3.
2: Let w i , w j , and wi j be the numbers of nodes in the MDD for f i , f j , and { f i , f j } ,
respectively. Let W i j = w i + w j − w i j . Then, choose the pair of outputs with the
maximum W i j .
3: Find good pairs of outputs by using Steps 1 and 2, and make a partition of outputs.
4: Order the output functions as they appeared in the pairs of the partition, and make an
initial sample with the ordered outputs.
5: Check the size of the sample, and do the process of generating samples by using Step 6
only if the size of the sample is larger than the expected one, otherwise stop the process
for this sample.
6: Check the supports of the outputs of the sample. If all the outputs depend on all the
inputs, then go to Step 7, otherwise go to Step 8.
7: Randomly divide the sample into some such that the construction of the MDDs for
each sample is easy† to handle.
8: Divide the sample into two such that the outputs with common support variables are
in the same sample, and the number of common support variables between samples is
small. Return to Step 5 for each sample.

Algorithm 4.3 Minimization of MDDs


Let F 3 : {0, 1, 2, 3} N →{0, 1} m .
1: Generate samples for F 3 using Algorithm 4.2 and construct an MDD for each sample.
2: Optimize each MDD by using sifting starting with an initial variable ordering, and
obtain the size of the MDD and the sample variable ordering.
3: Arrange the sample variable orderings in descending order of the sizes of the MDDs.
4: Compute the final variable ordering from sample variable orderings by using the
following:
(a) Let v g be an input variable in the final variable ordering. Let v h be an input variable
of a sample variable ordering which is not in the final ordering and is more influential
than v g . Then, in the final variable ordering, insert v h in the higher position than v g .
(b) Let G be a set of sample variable orderings. Choose an input variable from the top
of the sample variable orderings of G and form a final ordering by maintaining the
priorities of the samples in descending order and the property of Step (a).

or der [B] = (Y0,Y1, X2,Y3 )


or der[A]=(X0 ,X1 ,X2 ,X3 )
or der[C]=(Y0 ,Y1 ,X0 ,X1 ,X2 ,X3 ,Y3 )
58  VLSI Circuits and Embedded Systems

4.5 SUMMARY
In this chapter, heuristics are introduced to minimize multiple-valued decision diagrams
(MDDs) for multiple-output functions. Upper bounds on the sizes of MDDs are presented
for various functions. MDDs usually require fewer nodes than corresponding SBDDs, and
sometimes MDDs require less than a half node of SBDDs. The introduced method is much
faster, and it produces MDDs that are smaller. In addition, the introduced method produces
much smaller MDDs in a short time when several 2-valued input variables are grouped to
form multiple-valued variables.

REFERENCES
[1] T. Sasao and J. T. Butler, “A design method for look-up table type FPGA by
pseudo-Kronecker expansion”, Proceedings of 26th IEEE International Symposium
on Multiple-Valued Logic, pp. 97–106, 1994.
[2] H. M. H. Babu and T. Sasao, “Minimization of multiple-valued decision diagrams
using sampling method”, Proceedings of the Ninth Workshop on Synthesis and System
Integration of Mixed Technologies, pp. 291–298, 2000.
[3] S. Minato, N. Ishiura, and S. Yajima, “Shared binary decision diagram with attributed
edges for efficient Boolean function manipulation", Proceedings of 27th ACM/IEEE
DAC, pp. 52–57, 1990.
[4] R. E. Bryant, “Graph-based algorithms for Boolean function manipulation”, IEEE
Trans. Comput., vol. C-35, no. 8, pp. 677–691, 1986.
[5] S. Tani, K. Hamaguchi, and S. YaJima, “The complexity of the optimal variable
ordering problems of a shared binary decision diagram”, IEICE Trans. Inf. & Syst.,
vol. E79-D, no.4, pp. 271–281, 1996.
[6] N. Ishiura, H. Sawada, and S. YaJima, “Minimization of binary decision diagrams
based on exchanges of variables”, Proceedings of IEEE International Conference on
Computer-Aided Design, pp. 472–475, 1991.
[7] G. Epstein, Multiple-Valued Logic Design: An Introduction, IOP Publishing Ltd.,
London, 1993.
[8] J. Jain, W. Adams, and M. FuJita, “Sampling schemes for computing OBDD vari-
able orderings”, Proceedings of IEEE International Conference on Computer-Aided
Design, pp. 631–638, 1998.
[9] F. Somenzi, “Colorado university decision diagram package (CUDD), release 2.1.2,
1997.
[10] H. Fujii, G. Ootomo, and C. Hori, “Interleaving based variable ordering methods for
ordered binary decision diagrams”, Proceedings of IEEE International Conference on
Computer-Aided Design, pp. 38–41, 1993.
[11] R. Rudell, “Dynamic variable ordering for ordered binary decision diagrams”, Pro-
ceedings of IEEE International Conference on Computer-Aided Design, pp. 42–47,
1993.
[12] P. C. McGeer, K. L. McMillan, A. Saldanha, A. L. Sangiovanni-Vincentelli and P.
Scaglia, “Fast discrete function evaluation using decision diagrams”, International
Workshop on Logic Synthesis, pp. 6.1–6.9, 1995.
Heuristics to Minimize Multiple-Valued Decision Diagrams  59

[13] A. Thayse, M. Davio, and J-P. Deschamps, “Optimization of multi-valued decision


algorithms", Proceedings of the IEEE International Symposium on Multiple-Valued
Logic, pp. 171–178, 1978.
[14] Babu, Hafiz Md Hasan, and Tsutomu Sasao. “Heuristics to minimize multiple-valued
decision diagrams.” IEICE Transactions on Fundamentals of Electronics, Communi-
cations and Computer Sciences, vol. 83, no. 12 (2000): 2498–2504.
CHAPTER 5

TDM Realizations of
Multiple-Output Functions
Based on Shared
Multi-Terminal
Multiple-Valued DDs

This chapter considers methods to design multiple-output networks based on decision


diagrams (DDs). TDM (time-division multiplexing) systems transmit several signals on
a single line. These methods reduce: (1) hardware; (2) logic levels; and (3) pins. In the
TDM realizations, three types of DDs are considered: shared binary decision diagrams
(SBDDs), shared multiple-valued decision diagrams (SMDDs), and shared multi-terminal
multiple-valued decision diagrams (SMTMDDs). In the network, each non-terminal node
of a DD is realized by a multiplexer (MUX). Heuristic algorithms are introduced to derive
SMTMDDs from SBDDs.

5.1 INTRODUCTION
In modern LSIs, one of the most important issues is the “pin problem.” The reduction of
the number of pins in the LSIs is not so easy, even though the integration of more gates may
be possible. To overcome the pin problem, the time-division multiplexing (TDM) systems
are often used. In the TDM system, a single signal line represents several signals. For
example, the Intel 8088 microprocessors used 8-bit buses to represent 16-bit data which
made it possible to produce a large amount of microcomputers so quickly while the 16-bit
peripheral LSIs were not so popular in the early 1980s. In this chapter, a method is presented
to design multiple-output networks based on shared multi-terminal multiple-valued decision
diagrams (SMTMDDs) by using TDMs. Heuristic algorithms are introduced to derive

DOI: 10.1201/9781003269182-6 61
62  VLSI Circuits and Embedded Systems

SMTMDDs from shared binary decision diagrams (SBDDs). In the network, each non-
terminal node of a decision diagram (DD) is realized by a multiplexer (MUX).

5.2 DECISION DIAGRAMS FOR MULTIPLE-OUTPUT FUNCTIONS


In this section, three different decision diagrams are shown to represent multiple-output
functions.

5.2.1 Shared Binary Decision Diagrams


A shared binary decision diagram (SBDD) is a set of binary decision diagrams (BDDs)
combined by a tree for output selection. For example, Fig. 5.1 shows the SBDD for Table 5.1.
Multi-terminal binary decision diagrams (MTBDDs) are the extended BDDs with
multiple-terminal nodes, where the terminals are m-bit binary vectors for m output functions.
A shared multi-terminal binary decision diagram (SMTBDD) is a set of MTBDDs combined
by a tree for output selection.

Figure 5.1: SBDD for the Function in Table 5.1.


TDM Realizations of Multiple-Output Functions  63

Table 5.1: 2-Valued 4-Input and 4-Output Function

Input Output
x1 x2 x3 x4 f0 f1 f2 f3
0 0 0 0 0 1 1 0
0 0 0 1 1 0 1 1
0 0 1 0 0 1 0 1
0 0 1 1 1 1 1 1
0 1 0 0 1 0 0 1
0 1 0 1 0 1 1 0
0 1 1 0 1 0 1 1
0 1 1 1 1 1 0 0
1 0 0 0 0 0 0 1
1 0 0 1 1 0 1 1
1 0 1 0 1 1 0 1
1 0 1 1 0 1 1 1
1 1 0 0 1 0 0 1
1 1 0 1 0 1 1 0
1 1 1 0 1 1 1 1
1 1 1 1 0 1 0 0

5.2.2 Shared Multiple-Valued Decision Diagrams


A shared multiple-valued decision diagram (SMDD) is a set of multiple-valued decision
diagrams (MDDs) combined by a tree for output selection. Fig. 5.2 shows the SMDD for
Table 5.1, where g 0, g 1 , and g 2 are the output selection variables, and X 1 and X 2 are the
pairs of binary inputs.

5.2.3 Shared Multi-Terminal Multiple-Valued Decision Diagrams


Shared multi-terminal multiple-valued decision diagrams (SMTMDDs) are another rep-
resentations of multiple-valued multiple-output logic functions. An SMTMDD is a set of
multiple-valued decision diagrams (MDDs) with multiple terminal nodes combined by a
tree for output selection. The number of MDDs in the SMTMDD is equal to the number of
groups of output functions. Fig. 5.3 shows the SMTMDD for Table 5.2, where Y 1 and Y 2 are
the pairs of binary outputs, and X 1 and X 2 are the pairs of binary inputs. The advantage of
SMTMDDs is that they can evaluate several output functions simultaneously. Moreover, the
good grouping of output functions and good grouping of input variables produce compact
SMTMDDs.
64  VLSI Circuits and Embedded Systems

Figure 5.2: SMDD for the Function in Table 5.1.

Figure 5.3: SMTMDD for the Function in Table 5.2.


TDM Realizations of Multiple-Output Functions  65

Table 5.2: 4-Valued 2-Input and 2-Output Function

Input Output
X1 X2 Y1 Y2
0 0 1 2
0 1 2 3
0 2 1 1
0 3 3 3
1 0 2 1
1 1 1 2
1 2 2 3
1 3 3 0
2 0 0 1
2 1 2 3
2 2 3 1
2 3 1 3
3 0 2 1
3 1 1 2
3 2 3 3
3 3 1 0

Property 5.2.1 The sizen of the DD denoted by sizen (DD), is the total number of non-
terminal nodes excluding the nodes for output selection variables.

Example 5.1 The sizen of the SBDD, SMDD, and SMTMDD in Figs. 5.1, 5.2, and 5.3
are 19, 11, and 9, respectively. Note that g 0 , g 1 , and g 2 are the output selection variables in
the SBDD and SMDD, and g 0 is the output selection variable in the SMTMDD.

5.3 TDM REALIZATIONS


The TDM realization uses clock pulse to reduce the number of input and output pins. On the
other hand, the non-TDM realization means a conventional combinational network without
using clock pulse. In this section, TDM realizations of multiple-output functions are shown
based on SBDDs, SMDDs, and SMTMDDs.

5.3.1 TDM Realizations Based on SBDDs


In this subsection, a method is introduced to realize multiple-output functions by using
TDM. To illustrate it, an example of the 4-input 4-output function is used which is shown
in Table 5.1. Fig. 5.4 shows a TDM realization. In the main LSI, pairs of logic functions are
multiplexed by the clock pulse η. The output signals of the main LSI denote the functions
as follows:

G0 = η 0 f 0 ∨ η f 1 , and
G1 = η 0 f 2 ∨ η f 3 .
66  VLSI Circuits and Embedded Systems

This mean when η = 0, G0 and G1 represent f 0 and f 2 , respectively. On the other hand,
when η = 1, G0 and G1 represent f 1 and f 3 , respectively. In this realization, the hardware
is needed for the functions f 0, f 1, f 2 , and f 3 , as well as the hardware for multiplexing. By
using this technique, the number of output pins are reduced into a half. Note that in this
example, only two lines are necessary between the main LSI and the peripheral LSI. In
the peripheral LSI, delay latches are needed. When η = 0, the values for f 0 and f 2 are
transferred to the first and the third latches, respectively. On the other hand, when η = 1, the
values for f 1 and f 3 are transferred to the second and fourth latches, respectively. To realize
the multiple-output function, an SBDD is used. By replacing each non-terminal node of an
SBDD by a multiplexer, a network for the multiple-output function is obtained. In this case,
the amount of hardware for the network is easily estimated by the sizen of the SBDD, and
the design of the network is quite easy.

5.3.2 TDM Realizations Based on SMDDs


In Fig. 5.4, if the multiple-output function is realized by an SMDD, the TDM realization
based on SMDDs is got. Each non-terminal node of an SMDD is realized by a 2-MUX in
Fig. 5.6. Fig. 5.7 shows the literal generator whose inputs are a pair of 2-valued variables
(details will be shown in subsection 5.3.3). In this method, the input variables are partitioned
into pairs to make 4-valued variables, and the realization of a 4-valued input 2-valued
output functions are considered: Qn → Bm , where Q = {00, 01, 10, 11} and B = {0, 1}. The
numbers of nodes in an SMDD can be reduced as follows:
(i) By finding the best pairing of the input variables to make 4-valued variables.
(ii) By finding the best ordering of the 4-valued variables.

Figure 5.4: TDM Realization Based on the SBDD.


TDM Realizations of Multiple-Output Functions  67

Figure 5.5: TDM Realization Based on the SMTMDD.

Figure 5.6: 2-MUX.

Figure 5.7: Literal Generator.


68  VLSI Circuits and Embedded Systems

5.3.3 TDM Realizations Based on SMTMDDs


The TDM realization based on SMTMDDs is shown in Fig. 5.5. In this method, 4-valued
logic is used instead of 2-valued logic. Consider a 2-valued multiple-output function. First,
partition the input variables into pairs. For example, the input variables {x 1, x 2, x 3, x 4 } in
Table 5.1 are partitioned into X 1 = (x 1, x 2 ) and X 2 = (x 3, x 4 ). Second, partition the output
functions into pairs. For example, the output functions { f 0, f 1, f 2, f 3 } in Table 5.1 are
partitioned into G0 = ( f 0, f 1 ) and G1 = ( f 2, f 3 ). Then, a 4-valued logic function can got:
Q2 →Q, where Q = {0, 1, 2, 3}, as shown in Table 5.2. The output functions Y 1 and Y 2 in
Table 5.2 correspond to G0 and G1 , respectively. In general, a 4-valued n-input m-output
function: Q n →Q m is represented by an SMTMDD. Next, consider the hardware realization
of an SMTMDD. Each non-terminal node of an SMTMDD is realized by a 2-MUX shown
in Fig. 5.6. It is a 4-way multiplexer. Fig. 5.7 shows the literal generator whose inputs are
a pair of 2-valued variables (x 1, x 2 ), and outputs are X 0, X 1, X 2 , and X 3 that control the
2-MUX. Note that
0 i f X , i,

X =
i
.
1 if X =i
A signal in the terminal node is represented by a pair of bits (c0, c1 ) as follows:
When η = 0, the signal represents c0 .
When η = 1, the signal represents c1 .
Thus, (c0, c1 ) = (0, 0) corresponds to a constant 0.
(c0, c1 ) = (0, 1) corresponds to a constant η.
(c0, c1 ) = (1, 0) corresponds to a constant η’.
(c0, c1 ) = (1, 1) corresponds to a constant 1.
Fig. 5.8 shows the SMTMDD-based TDM realization for the function in Table 5.2. In
the inputs, a pair of 2-valued variables X = (x 1, x 2 ) represents a 4-valued signal {00, 01,
10, 11} or {0, 1, 2, 3}. On the other hand, 0, η, η’, and 1, represent (0, 0), (0, 1), (1, 0),
and (1, 1), respectively. Note that {0, η, η’, 1} constitutes the 4-element Boolean algebra.
If {0, η, η’, 1} is replaced by {0, 1, 2, 3}, then the 4-valued function can get in Table 5.2.
An arbitrary 4-valued function is represented by an SMTMDD. The amount of hardware
for the network is estimated by the size n of the SMTMDD. The number of nodes in an
SMTMDD can be reduced as follows:
(i) By finding the best pairing of the input variables to make 4-valued variables.
(ii) By finding the best pairing of the output functions to make 4-valued functions.

5.3.4 Comparison of TDM Realizations


In this subsection, the DD-based TDM realizations of an n-input m-output function F is
compared. In the hardware realization, each non-terminal node of a BDD is realized with
two MOS transistors, while each non-terminal node of an MDD is realized with four MOS
transistors. So, if the cost of literal generators is ignored, the cost of a non-terminal node
of an MDD is twice the cost of a non-terminal node of the BDD.
Therefore, when (2.sizen (MDD : F ) < sizen (BDD : F )), the MDD-based realizations
are more economical than BDD-based ones. In addition, in the case of an n-variable
function, a BDD-based realization requires n levels, while an MDD-based realization
requires only n/2 levels. In the FPGAs, the delay of interconnections between the modules is
TDM Realizations of Multiple-Output Functions  69

Figure 5.8: TDM Realization of a 4-Output Function Based on the SMTMDD.

often greater than the delay of logic modules. Thus, the reduction of logic level is important.
So, MDD-based realizations can be faster and require smaller amount of hardware than
BDD-based ones.

5.4 REDUCTION OF SMTMDDS


The reduction of sizen for SMTMDDs is important to design compact logic networks. The
following methods are considered for reduction: (1) pairing of output functions; (2) pairing
of input variables; and (3) ordering of a group of input variables. The SMTBDDs are derived
from the SBDDs by pairing the output functions, and the SMTMDDs are derived from the
SMTBDDs by pairing the input variables. Since an SMTBDD consists of MTBDDs, and
each MTBDD represents a pair of output functions, the following heuristics are used to pair
the outputs: Pair output functions so that the upper bounds on the size of the MTBDD are
minimized. The MDD nodes for each pair of input variables are counted from SMTBDDs
as follows: Subgraphs are shown in Fig. 5.9 (a), (b), or (c), corresponding to one, two, or
three MDD nodes, respectively.
In Fig. 5.9 (a), three SMTBDD nodes are replaced by one MDD node. However, in
Fig. 5.9 (b), the SMTBDD nodes correspond to two MDD nodes. In Fig. 5.9 (c), the
SMTBDD nodes are replaced by three MDD nodes. Finally, SMTMDDs are optimized by
using sifting algorithm.
70  VLSI Circuits and Embedded Systems

Figure 5.9: A Method Replacing SMTBDD Nodes by MDD Nodes.

5.5 UPPER BOUNDS ON THE SIZEN OF DDS


In the design of multiple-output networks, often it is necessary to estimate the number of
MUXs to realize functions. This section shows upper bounds on the number of non-terminal
nodes to represent an n-input m-output function by an SBDD and an SMTMDD. Since each
non-terminal node of a DD corresponds to a MUX, the sizen of the DD estimates the amount
of hardware.

Property 5.5.1 Consider


 an n-input m-output function F . Then, sizen (SBDD) ≤
mink=1 m.(2
n n−k 2 k
− 1 + 2 − 2}.


Property 5.5.2 Consider a function F : {0, 1, . . ., p− 1} N → {0, 1, . . . , r− 1} m .


Let m1 nbe the number of groups
o of 2-valued functions. Then, sizen (SMT M DD) ≤
p N −k −1 k
mink=1 m1 . (p−1) + r − r .
n p

Example 5.2 Let n = 18, m = 20, and p = r = 4. For such a function, sizen (SBDD ) ≤
2621422, and sizen (SMTMDD ) ≤ 218702.

5.6 SUMMARY
In this chapter, time-division multiplexing (TDM) realizations of multiple-output func-
tions based on shared binary decision diagrams (SBDDs), shared multiple-valued decision
diagrams (SMDDs), and shared multi-terminal multiple-valued decision diagrams (SMT-
MDDs) are considered. In the network, each non-terminal node of a decision diagram (DD)
is realized by a multiplexer (MUX). For an n-variable function, the BDD-based realization
requires n levels, while the MDD-based realization requires n/2 levels. The TDM method
reduces the interconnections among the modules as shown in Figs. 5.4 and 5.5. In addition,
the SMTMDD-based realization reduces the number of gates by considering the pairing of
input variables and pairing of output functions. Note that the SBDD-based realizations and
the SMDD-based realizations require extra gates at the outputs (which are not included in
the tables).
The TDM method requires clock pulse that makes delay in the network. However, the
number of pins in the TDM realization is a half of the non-TDM realization. MDD-based
realizations are more economical than SBDD-based ones when the ratios are less than
0.5. However, for n-bit adders, SMDD-based realizations require the smallest amount of
TDM Realizations of Multiple-Output Functions  71

hardware. It is also shown that there are cases where SBDD-based realizations are the most
economical. For arithmetic functions, MDD-based realizations tend to be more economical
than SBDD-based ones. The presented method can be extended to the case where p output
functions are grouped by using p-phase clock pulses.

REFERENCES
[1] T. Sasao and J. T. Butler, “A design method for look-up table type FPGA by
pseudo-Kronecker expansion”, Proceedings of 26th IEEE International Symposium
on Multiple-Valued Logic, pp. 97–106, 1994.
[2] H. M. H. Babu and T. Sasao, “Design of multiple-output networks using time domain
multiplexing and shared multi-terminal multiple-valued decision diagrams”, IEEE
International Symposium on Multiple Valued Logic, pp. 45–51, 1998.
[3] S. Minato, N. Ishiura, and S. Yajima, "Shared binary decision diagram with attributed
edges for efficient Boolean function manipulation", Proceedings of 27th ACM/IEEE
DAC, pp. 52–57, 1990.
[4] R. E. Bryant, “Graph-based algorithms for Boolean function manipulation”, IEEE
Trans. Comput., vol. C-35, no. 8, pp. 677–691, 1986.
[5] D. M. Miller, "Multiple-valued logic design tools", Proceedings of the IEEE Interna-
tional Symposium on Multiple-Valued Logic, pp. 2–11, 1993.
[6] S. B. Akers, “Binary decision diagrams”, IEEE Trans. Comput., vol. C-27, no. 6, pp.
509–516, 1978.
[7] G. Epstein, Multiple-Valued Logic Design: An Introduction, IOP Publishing Ltd.,
London, 1993.
[8] T. Sasao and J. T. Butler, “A method to represent multiple-output switching functions
by using multi-valued decision diagrams”, Proceedings of 26th IEEE International
Symposium on Multiple-Valued Logic, pp. 248–254, 1996.
[9] H. M. H. Babu and T. Sasao, “Shared multi-terminal binary decision diagrams for
multiple-output functions”, IEICE Trans. Fundamentals, vol. E81-A, no.12, pp. 2545–
2553, 1998.
[10] T. Sasao, ed., “Logic Synthesis and Optimization”, Kluwer Academic Publishers,
Boston, 1993. [11] R. Rudell, “Dynamic variable ordering for ordered binary deci-
sion diagrams”, Proceedings of IEEE International Conference on Computer-Aided
Design, pp. 42–47, 1993.
[11] P. C. McGeer, K. L. McMillan, A. Saldanha, A. L. Sangiovanni-Vincentelli and P.
Scaglia, “Fast discrete function evaluation using decision diagrams”, International
Workshop on Logic Synthesis, pp. 6.1–6.9, 1995.
[12] R. K. Brayton, G. D. Hachtel, C. T. McMullen, and A. L. Sangiovanni-Vincentelli,
“Logic Minimization Algorithms for VLSI Synthesis”, Kluwer Academic Publishers,
Boston, 1984.
[13] T. Sasao, “Switching Theory for Logic Synthesis”, Kluwer Academic Publishers,
Boston, 1999.
[14] L.S. Hurst, “Multiple-valued logic-Its status and its future”, IEEE Trans. Comput.,
vol. C-33, no. 12, pp. 1160–1179, 1984.
72  VLSI Circuits and Embedded Systems

[15] M. Kameyama and T. Higuchi, “Synthesis of multiple-valued logic networks based


on tree-type universal logic module”, IEEE Trans. Comput., vol. C-26, no. 12, pp.
1297–1302, 1977.
[16] D.M. Miller and N. Muranaka, “Multiple-valued decision diagrams with symmetric
variable nodes”, Proceedings of the IEEE International Symposium on Multiple-
Valued Logic, pp. 242–247, 1996.
[17] Babu, Hafiz Md Hasan, and Tsutomu Sasao. "Time-division multiplexing realizations
of multiple-output functions based on shared multi-terminal multiple-valued decision
diagrams." IEICE Transactions on Information and Systems, vol. 82, no. 5 (1999):
925–932.
CHAPTER 6

Multiple-Output Switching
Functions Using
Multiple-Valued
Pseudo-Kronecker DDs

In this chapter, a method is introduced to construct smaller multiple-valued pseudo-


Kronecker decision diagrams (MVPKDDs). The method first generates a 4-valued input
2-valued multiple-output function from a given 2-valued input 2-valued output functions.
Then, it constructs a 4-valued decision diagram (4-valued DD) to represent the generated
4-valued input function. Finally, it selects a good expansion among 27 different expansions
for each 4-valued node of the 4-valued DD and derives a 4-valued PKDD. Heuristics are
presented to produce compact 4-valued PKDDs.

6.1 INTRODUCTION
In VLSI design, one of the major concerns is the efficient and compact representation of
switching functions. Binary decision diagrams (BDDs) are probably the most successful
representation of switching functions. Multiple-valued decision diagrams (MDDs) are ex-
tensions of BDDs and are an important data structure for multiple-valued functions. MDDs
are usually smaller than the corresponding BDDs and are useful for logic synthesis, logic
simulation, test, etc. Pseudo-Kronecker decision diagrams (PKDDs) are generalization of
BDDs, and usually require fewer nodes than corresponding BDDs. A 2-valued PKDD rep-
resents a 2-valued input 2-valued multiple-output function, while a multiple-valued PKDD
(MVPKDD) represents a multiple-valued input 2-valued multiple-output function.
In 2-valued PKDDs, any of the following three expansions can be used for each non-
terminal node: (1) the Shannon expansion; (2) the positive Davio expansion; and (3) the
negative Davio expansion. PKDDs are useful for multi-level logic synthesis and LUT (look-
up table) type FPGA (field programmable gate array) design. For example: (a) Fig. 6.2 shows
a multi-level network corresponding to the 2-valued PKDD in Fig. 6.1. This network can
further be minimized by using a local transformation. It has been shown that a 2-valued

DOI: 10.1201/9781003269182-7 73
74  VLSI Circuits and Embedded Systems

PKDD-based network requires, on the average, 21 percent fewer gates and interconnections
than the BDD-based one. (b) Fig. 6.3 shows a LUT-based network corresponding to the
PKDD in Fig. 6.1. It is suitable for an FPGA consisting of 3-input LUTs and programmable
interconnections. 4-valued PKDDs are useful for FPGAs with 6-input LUTs. Since each
non-terminal node of a PKDD is replaced by a LUT, the reduction of the number of nodes
in the PKDD is important. Three methods to construct compact MVPKDDs are:
1. Grouping of 2-valued variables to make multiple-valued variables;
2. Ordering of multiple-valued variables in the MVPKDDs; and
3. Generating a good expansion for each non-terminal node of MVPKDDs.
Sasao-Butler used 4-valued PKDDs to design LUT type FPGAs. They considered the
following to construct 4-valued PKDDs: (i) the bit-pairing algorithm for PLAs to pair the
2-valued inputs; (ii) a simulated annealing method to order the input variables; and (iii)
a cost estimation method for finding good expansions. In this chapter, a 4-valued PKDD
is constructed from the BDD. First, a shared binary decision diagram (SBDD) is used
to pair the 2-valued inputs of a multiple-output function. Then, a set of good variable
orderings is finding for 4-valued inputs from MDDs to produce a smaller 4-valued PKDD.
Finally, good expansions for non-terminal nodes of a 4-valued PKDD are generated from
the corresponding BDD.

Figure 6.1: 2-Valued PKDD for Functions ( f 0, f 1 ) = (x 20 x 4 ⊕x 10 x 2 x 30 ⊕x 10 x 2 x 4 ⊕x 1


x 30 , x 3 ⊕x 4 ).
Multiple-Output Switching Functions  75

Figure 6.2: EXOR-Based Network from the PKDD in Fig. 6.1.

Figure 6.3: LUT-Based Network from the PKDD in Fig. 6.1.

6.2 DEFINITIONS AND BASIC PROPERTIES


In this section, definitions are presented and some properties of 2-valued and multiple-
valued functions are included.

Property 6.2.1 Let F = ( f 0, f 1, . . ., f m−1 ). A multiple-valued N -input 2-valued m-output


function is F (X1, X2, . . . , X N ) : x i=1
N Ri → B m , where X i assumes a value in Ri =
(0, 1, . . ., r i − 1) and B = {0, 1}.
76  VLSI Circuits and Embedded Systems

Property 6.2.2 (Expansion Theorem): An arbitrary switching function f (x 1, x 2, . . ., x n )


can be represented by

f = x 10 f 0 ⊕x 1 f 1, (6.1)
f = f 0 ⊕x 1 f 2, (6.2)
f = x 10 f 2 ⊕ f 1, (6.3)

where f 0 = f (0, x 2, x 3, . . ., x n ), f 1 = f (1, x 2, x 3, . . ., x n ) and f 2 = f ( f 0 ⊕ f 1 ). Equation


6.1 is the Shannon (S) expansion, Equation 6.2 is the positive Davio (pD) expansion, and
Equation 6.3 is the negative Davio (nD) expansion.

Property 6.2.3 Let R = {0, 1, . . ., r − −1} and S ⊆ R. X S is a literal of X , where

0 (X < S)

X =S
.
1 (X ∈ S)
When S contains only one element, X {i } is denoted by X i . A product of lit-
erals X1S1 X 2S2 . . . X NS N is a product term that is the AND of literals. The ex-
pression S1 S 2 ... S N X1S1 X 2S2 . . . X NS N is a sum-of-products expression (SOP), where
Ô

S1 S 2 ... S N denotes the inclusive-OR of products terms.


Ô

Property 6.2.4 An arbitrary r -valued N -variable multiple-output function F(X 1, X 2, . . .,


j
X N ) can be uniquely represented as r−1
j=0 X1 F ( j, X2, . . . , X N ). This is a multiple-valued
Í
version of Shannon’s expansion.

Property 6.2.5 Let F = ( f 0, f 1, . . ., f m−1 ). Then, the size of a decision diagram (DD) of
F , denoted by size(DD, F ), is the total number of non-terminal nodes in the DD for F .

Example 6.1 The size of the PKDD in Fig. 6.1 is 7.

6.3 DECISION DIAGRAMS


In this section, two types of decision diagrams (DDs) are defined to represent multiple-
output switching functions.

6.3.1 2-Valued Pseudo-Kronecker Decision Diagrams


2-valued pseudo-Kronecker decision diagrams (2-valued PKDDs) are generalization of
SBDDs, where each node of a PKDD can represent one of the three expansions:
the Shannon expansion, the positive Davio expansion, and the negative Davio ex-
pansion. Fig. 6.1 shows an example of a 2-valued PKDD for functions ( f 0, f 1 ) =
(x 20 x 4 ⊕x 10 x 2 x 30 ⊕x 10 x 2 x 4 ⊕x 1 x 30 , x 3 ⊕x 4 ). For a function of n variables and the given order of
the input variables, there exists at most 32 −1 different 2-valued PKDDs. Since 2-valued
n

PKDDs are generalization of SBDDs, the following can get:


Multiple-Output Switching Functions  77

Property 6.3.1 An arbitrary n-input m-output function can be represented by a 2-valued


n
PKDD with O m.2 n nodes.

Property 6.3.2 An arbitrary n-input m-output symmetric function can be represented by


a 2-valued PKDD with O(m.n2 ) nodes.

Furthermore, the following can get:

Property 6.3.3 Let F = ( f 0, f 1, . . ., f m−1 ). Then, size(2-valued PKDD, F) ≤ m + 1.

Property 6.3.4 Let inc n be an n-input (n + 1)-output function that computes K + 1, where
K is a binary number consisting of n bits. It represents an incrementing circuit.

Property 6.3.5 size(2-valued PKDD, inc n) ≤2n(n ≥ 2).

6.3.2 Multiple-Valued Pseudo-Kronecker Decision Diagrams


A multiple-valued pseudo-Kronecker decision diagram (MVPKDD) is an extension of a
2-valued PKDD. An MVPKDD usually requires fewer nodes than the corresponding 2-
valued PKDD. MVPKDDs are useful for FPGA design and multi-level logic synthesis. In
the next section, heuristics are presented to reduce 4-valued PKDDs.

6.4 OPTIMIZATION OF 4-VALUED PKDDS


Construction of compact 4-valued PKDDs is important for efficient representations of
multiple-valued functions. In general, it is very time-consuming to find the best ordering of
the input variables and the best expansions for the non-terminal nodes of a 4-valued PKDD.
So, the following strategy is used to produce a smaller 4-valued PKDD quickly: (1) pairing
of 2-valued input variables to make 4-valued variables (Algorithm 6.1); (2) ordering of
4-valued variables (Algorithm 6.2); and (3) selection of a good expansion for each 4-valued
node (Algorithm 6.3). From now on, it is assumed that MDDs represent 4-valued input
2-valued output functions.

6.4.1 Pairing of 2-Valued Input Variables


When a function has only a single-output, finding good pairs of 2-valued inputs is relatively
easy. However, for a multiple-output function, finding good pairs of 2-valued inputs is not
so easy. In this subsection, a heuristic is presented to select good pairs of 2-valued inputs
from an SBDD.

Property 6.4.1 Let T be a set, and let T i (i = 1, 2, 3, ..., s) be subsets of T . (T 1,T 2,T 3, . . .,T S )
is a partition of T if i=1 Ti and T i ∩ T j = where i, j and T i , for every i .
Ðs

Property 6.4.2 Let F : {0, 1} n →{0, 1} m . The dependency matrix D = (d i j ) for F is a 0-1
matrix with m columns and n rows, where
1 if f f i depends on x j
di j = ,
0 otherwise
78  VLSI Circuits and Embedded Systems

i = 0, 1, . . ., m − 1 and j = 1, 2, . . ., n.

Example 6.2 Consider the 4-input 4-output function:

f0 (x1, x2, x3, x4 ) = x10 x2 ∨ x1 x30 ∨ x20 x3 ,


f1 (x1, x2, x3, x4 ) = x3 x40 ,
f2 (x1, x2, x3, x4 ) = x1 x3 ∨ x30 x4 , and
f3 (x1, x2, x3, x4 ) = x4 .
The dependency matrix is

f0 f1 f2 f3
x1
1 0 1 0
x2
D= 1 0 0 0
© ª
x3
­ ®
1 1 1 0
­ ®
x4
­ ®
« 0 1 1 1 ¬
Property 6.4.3 Let F : {0, 1} n → {0, 1} m , and let [x i , x j ] be a pair of inputs. The
pair-dependency matrix P = (pi j ) for F is a 0-1 matrix with m columns and n(n−1)
2 rows.

1 i f fi depends on at least one o f xi and x j



pi j = .
0 otherwise

Example 6.3 Consider the 4-output function in Example 6.2. The pair-dependency matrix
P is given as follows:

Strategy 6.1: Let F : {0, 1} n → {0, 1} m and let s(x i , x j ) be the number of outputs that
depend on at least one of x i and x j . Then, [x i , x j ] is a candidate pair of inputs if s(x i , x j )
is the smallest among all the pairs. Apply the same idea to the rest of the inputs recursively
to find good pairs of input variables.

Example 6.4 Consider the 4-input 4-output function in Example 6.2. There are six pairs
of input variables. The pair-dependency values are: s(x 1 , x 2 ) = 2, s(x 1 , x 3 ) = s(x 1 , x 4 ) =
s(x 2 , x 3 ) = 3, and s(x 2 , x 4 ) = s(x 3 , x 4 ) = 4. Since s(x 1 , x 2 ) = 2 is the smallest among all
the pairs, [x 1 , x 2 ] is a candidate pair. The remaining inputs are x 3 and x 4 . Thus, [x 3 , x 4 ] is
another candidate pair. So, the pairs of input variables are [x 1 , x 2 ] and [x 3 , x 4 ].
Multiple-Output Switching Functions  79

Note that a smaller SBDD is expected when s(xi , x j ) is smaller.


Strategy 6.2: Let an SBDD represent F : {0, 1} n →{0, 1} m . Then, [x i , x j ] is a good pair
of inputs in F if in the SBDD, most of the incoming edges into nodes labeled x j are from
nodes labeled x i .

Algorithm 6.1 Pairing the Input Variables


1: Let F : {0, 1} n →{0, 1} m . Construct an SBDD for F .
2: Derive the dependency matrix D.
3: Derive the pair-dependency matrix P.
4: Calculate s(x i , x j ).
5: Apply Strategy 6.1 to find the good pairs. In the case of tie, use Strategy 6.2 to find the
best one among them.

Note that when the number of 2-valued inputs is odd, one input variable remains in the
2-valued form for MDDs.

6.4.2 Ordering of 4-Valued Variables


The sizes of MDDs are sensitive to orderings of input variables. Since 4-valued PKDDs are
generalization of MDDs, good variable orderings can be found for 4-valued PKDDs from
MDDs. Dynamic reordering methods are useful to order the input variables. However, such
methods are extremely time-consuming, and can fail to construct MDDs for many functions.
In the real life, many practical logic circuits have multiple outputs and most CAD tools
handle multiple-output functions at the same time. So, finding the same variable ordering
for different output functions is important. In this subsection, a heuristic is presented to
generate a good ordering of the 4-valued variables quickly.

Property 6.4.4 A group is a subset of outputs and a partition of outputs consists of groups.
A group with the larger size of the MDD has the highest priority.

Example 6.5 Consider the functions in Example 6.2. Let { f 1, f 3 } and { f 0, f 2 } be two
groups, where each group forms a 2-output function.

Algorithm 6.2 Ordering the 4-Valued Variables


Let F : {0, 1} n →{0, 1} m .
1: Generate a 4-valued input 2-valued output function F 1 : { 0, 1, 2, 3} N → { 0, 1 } m from
F by Algorithm 6.1.
2: Divide the outputs of F 1 , and form groups such that the number of outputs in a group
is not so large.
3: Construct an MDD for each group and minimize it by interchanging the adjacent input
variables, and obtain the size and the variable ordering of the minimized MDD.
4: Generate a good ordering for the 4-valued input variables of a multiple-output function
using Procedure in Fig. 6.4.
80  VLSI Circuits and Embedded Systems

Figure 6.4: Pseudocode for Ordering of 4-Valued Variables.

Example 6.6 Let F 1 = { f 0 , f 1 , f 2 , f 3 }: {0, 1, 2, 3}7 → {0, 1}4 . Let { f 0 , f 2 } and { f 1 , f 3 }


be two groups for the function F 1 . Let order[K 1 ] = ( A0 , A1 , A2 , A3 ) and order[K 2 ] = ( B0 ,
B1 , B2 , B3 ) be variable orderings obtained from the MDDs representing groups { f 0 , f 2 }
and { f 1 , f 3 }, respectively. Let 7 and 19 be the sizes of MDDs under the variable orderings,
order[K 1 ] and order[K 2 ], respectively. { f 1 , f 3 }has the highest priority, since the size of the
MDD for this group is the largest. So, we check order[K 2 ] first and then order[K 1 ] in order
to generate a good variable ordering (order[K 3 ]). To compute a good variable ordering,
the input variables are selected from order[K 1 ] and order[K 2 ] according to procedure in
Fig. 6.4 as follows:
Multiple-Output Switching Functions  81

6.4.3 Selection of Expansions


In this subsection, a technique is included to generate 4-valued PKDDs quickly. To generate
smaller PKDDs, it is necessary to find a good expansion for each non-terminal node. In the
case of 2-valued PKDDs, there exists three different expansions for each node: the Shannon
expansion, the positive Davio expansion, and the negative Davio expansion. However, in the
case of 4-valued PKDDs, there exists 840 essentially different expansions for each 4-valued
node. To test all the 840 expansions for each 4-valued node is very time-consuming, and it
can be used only for small problems. So, the following strategy is used:
When an MDD is derived, the structure of the original BDD is kept. It is assumed that
each 4-valued node consists of three Shannon nodes as shown in Fig. 6.5. By changing each
of the Shannon nodes with the positive Davio node or the negative Davio node, 27 different
expansions can get, since each 2-valued node may have three different expansions. This can
be done rather easily, since the algorithm is essentially the same as one for 2-valued PKDDs.
The demerit is that it considers only 27 combinations out of 840 combinations. However,
this is a good compromise of the computation time and the quality of the solutions. The
strategy is to spend more time for finding good orderings of the input variables than for
selecting good expansions.

Figure 6.5: A 4-Valued Node Consisting of Three Shannon Nodes.

Algorithm 6.3 Constructing a 4-Valued PKDD


Let F 1 : {0, 1, 2, 3} N → {0, 1} m .
1: Generate a set of good variable orderings for MDDs using Algorithm 6.2 for different
partitioning of the outputs of F 1 .
2: As initial conditions of expansions, construct three MDDs for F 1 under a variable
ordering from the set of generated orderings, where
(i) All nodes represent the Shannon expansions,
(ii) All nodes represent the positive Davio expansions, and
(iii) All nodes represent the negative Davio expansions.
3: For each of three MDDs, do as follows: From the root node down to the leaf nodes,
select a good expansion for each 4-valued node by the method in subsection 6.4.2
and count the number of nodes in the PKDD. Choose the 4-valued PKDD with the
expansions with the fewest nodes. When the size of the chosen PKDD is larger than any
of the three original MDDs, the MDD with the fewest nodes from the original MDDs
is selected as the PKDD.
4: Continue Steps 2 and 3 until all the variable orderings from the set of generated
orderings have been checked, and choose the smallest 4-valued PKDD.
82  VLSI Circuits and Embedded Systems

Since Algorithm 6.3 is a heuristic one, it may not obtain the optimal solution, but it can
expect good solutions quickly.

6.5 SUMMARY
In this chapter, a method is presented to construct smaller 4-valued pseudo-Kronecker deci-
sion diagrams (4-valued PKDDs). The numbers of non-terminal nodes in 4-valued PKDDs
are compared with those of multiple-valued decision diagrams (MDDs), 2-valued PKDDs,
and shared binary decision diagrams (SBDDs), where an MDD (Multiple-Valued Decision
Diagram) represents a 4-valued input 2-valued output functions. 2-valued PKDDs are much
smaller than corresponding SBDDs. So, from PKDDs, it is expected that networks with
smaller amount of hardware than ones from binary decision diagrams (BDDs). PKDDs are
useful for FPGA (Field Programmable Gate Array) design and multi-level logic synthesis.

REFERENCES
[1] T. Sasao and J. T. Butler, “A design method for look-up table type FPGA by
pseudo-Kronecker expansion”, Proceedings of 26th IEEE International Symposium
on Multiple-Valued Logic, pp. 97–106, 1994.
[2] A. Srinivasan, T. Kam, S. Malik, and R. K. Brayton, “Algorithms for discrete function
manipulation", Proceedings of IEEE International Conference on Computer-Aided
Design, pp. 92–95, 1990.
[3] T. Sasao, H. Hamachi, S. Wada, and M. Matsuura, “Multi-level logic synthesis based
on pseudo-Kronecker decision diagrams and local transformation", Proceedings of
the IFIPWG 10.5 Workshop on Applications of the Reed-Muller Expansion in Circuit
Design, pp. 152–160, 1995.
[4] R. E. Bryant, “Graph-based algorithms for Boolean function manipulation”, IEEE
Trans. Comput., vol. C-35, no. 8, pp. 677–691, 1986.
[5] D. M. Miller, "Multiple-valued logic design tools", Proceedings of the IEEE Interna-
tional Symposium on Multiple-Valued Logic, pp. 2–11, 1993.
[6] R. Rudell, “Dynamic variable ordering for ordered binary decision diagrams”, Pro-
ceedings of IEEE International Conference on Computer-Aided Design, pp. 42–47,
1993.
[7] H. M. H. Babu and T. Sasao, “Time-division multiplexing realizations of multiple-
output functions based on shared multi-terminal multiple-valued decision diagrams”,
IEICE Trans. Inf. & Syst., vol. E82-D, no. 5, pp. 925–932, 1999.
[8] S. Minato, N. Ishiura, and S. Yajima, "Shared binary decision diagram with at-
tributed edges for efficient Boolean function manipulation", Proceedings of the 27th
ACM/IEEE DAC, pp. 52–57, 1990.
[9] H. M. H. Babu and T. Sasao, “Minimization of multiple-valued decision diagrams
using sampling method”, Proceedings of the Ninth Workshop on Synthesis and System
Integration of Mixed Technologies, pp. 291–298, 2000.
[10] Babu, HM Hasan, and Tsutomu Sasao. “Representations of multiple-output switching
functions using multiple-valued pseudo-Kronecker decision diagrams.” In Proceed-
ings 30th IEEE International Symposium on Multiple-Valued Logic (ISMVL 2000),
pp. 147–152. IEEE, 2000.
II
An Overview About Design Architectures of
Multiple-Valued Circuits

83
Part 2

Two-valued signals currently demarcate the field of digital-computing machinery, but,


historically, there is no lack of emphasis on radices (or bases) higher than two. Base 10 has
been important for all the obvious reasons, particularly in mechanical systems and in the
user’s view of electronic machines. Babbage’s original design used IO-valued mechanisms,
as did generations of later mechanical calculators. Modern electronic calculators present
data in base 10 for user convenience, as did early machines such as the IBM 1620. With
very few exceptions, such machines used inherently binary electronic circuitry and logic
internally, with information encoded in Binary Coded Decimal (BCD) format, but often
with computation done in (coded) decimal arithmetic. Thus, the choice of radix may be
optimized separately in either the conceptual or actual (that is, implementation) domain,
and a “best” choice could be one where both optima coincide.
Issues of concern at the conceptual level include notation, operational description, and
recognition of symmetry. At the actual level they include the use of physical space, noise
margins in signal space, and conversion with binary concerning the role of Multiple-Valued
Logic (MVL) and data representation in the dominantly two-valued or binary world, a
multiplicity of views exists. First, it is unlikely that binary will capitulate or in any way
wither and die. Besides its obvious entrenchment, there are very good reasons why two is
a special value. For example, every designer of binary logic knows that a logic gate, say
NAND or NOR, needs to have a fan-in of only two to be perfectly general, although not
particularly flexible. He knows as well that, while large gate fan-out is very convenient, a
fan-out of two is all that is necessary. Thus, even from a binary perspective, it is seen that
two is enough but more can be better.
The circuits having more than two logic levels called as multiple-valued circuits have the
potential of reducing area by reducing the on chip interconnection. Despite considerable
effort, designing a system for processing a multiple-valued signal is still a complicated
task. Multiple-valued circuits can be realized in voltage or current mode. Due to limited
power supply, higher radix valued system is not feasible to design using voltage mode
configuration. On the other hand, current mode circuits have the capability of scaling,
copying, inverting using basic current mirror structure. The non-self-restoring nature and
higher static power dissipation is the major problem in multiple-valued current mode
circuits. Self-restoring circuits need to be developed for correct detectable output.
Multiple-valued circuits are just between binary circuits ( M = 2) and analog circuits
( M = ∞). In the last decades, many analog implementations have been replaced by bi-
nary implementations (radio, TV, photography, cinema, etc.). Multiple-valued circuits are
closer to binary circuits than analog ones. They could have taken advantage from the shift
towards digital implementations. Obviously, multiple-valued circuits could be interesting
only if they provide significant advantages versus binary ones. It means that every new

85
86  Part 2

multiple-valued circuits should be compared with the corresponding binary ones. Multiple-
valued circuits must also use a technology that is compatible with the standards of up-to-date
devices.
This part starts with multiple-valued flip-flops (MVFF) using pass transistor logic
which is given in Chapter 7. Two different design techniques are introduced in this chapter
for MVFF realized by pass transistors, which can be a promising alternative to static
CMOS for deep sub-micron design. An approach for designing multi-valued logic circuits
is introduced in Chapter 8. A systematic method for implementing a set of binary logic
functions is also described, as multi-valued logic functions, and the heuristic algorithms for
different stages of the design process are provided along with it. The technique described in
this chapter can be easily extended to implement higher radix circuits. Chapter 9 presents a
method in which new Boolean variable assignment algorithm and minimization techniques
have been introduced, so that both the total computation time and the number of products
decrease. A graph is introduced in this chapter called an enhanced assignment graph (EAG)
for the efficient grouping of the Boolean variables. In order to make the best choice of
the proper base minterm, a new technique is defined to find the potential canonical cube
(PCC) covering it. In Chapter 10, the application of multi-valued Fredkin gates (MVFGs) is
shown with the implementation of fuzzy set and logic operations. Fuzzy relations and their
composition are very important in this theory as collections of fuzzy if-then rules and, fuzzy
GMP (Generalized Modus Ponens) and GMT (Generalized Modus Tollens) respectively is
mathematically equivalent to them. In this chapter, digitized fuzzy sets are described where
the membership values are discretized and represented using ternary variables and the
implementation of set operations. Finally, an advanced minimization method for multiple-
valued multiple-output functions is introduced in Chapter 11. New minimization approach
for multiple-valued functions has also been discussed where Kleenean Coefficients and
LUT are used to reduce the complexity. The shared sub functions are extracted with a
heuristic method to pair the functions.
CHAPTER 7

Multiple-Valued Flip-Flops
Using Pass Transistor Logic

This chapter presents the realization of multiple-valued flip-flops (MVFF) using Pass Tran-
sistor logic. Two different design techniques are introduced here for MVFF realized by pass
transistors, which can be a promising alternative to static CMOS for deep sub-micron de-
sign. A novel circuit has been introduced which consists of multiple valued pass transistors
which can call “logical sum circuit”. This particular circuit is used as the elementary design
component for the second approach in MVFF design. The introduced MVFF circuits can
be attractive for its inherent lesser power and component demands.

7.1 INTRODUCTION
The use of memory devices and circuits in Digital System is very important as they provide
means for storing values either temporarily or permanently. Electronic Latching Circuits
like Latches and flip-flops are the examples of digital memory units. This chapter is going to
talk about the realization of a multiple-valued flip-flops using pass transistor logic (PTL).
Various researchers have worked on the realization of different types of circuits using
PTL. Such circuits are very much suitable for multiple-valued logic. The realization of
multiple-valued flip-flops (MVFF) also has been studied by different authors.
In this chapter two different approaches are considered to MVFF realization. In the first
approach multiple valued inputs are coded into binary values to use for storing purpose. As
the second approach to realize MVFF multiple valued pass transistor logic is considered
to implement basic multiple valued blocks with which MVFF can be designed without any
binary intervention. The two design approaches are described in the following sections.

7.1.1 Realization of Multiple Valued Flip-Flops Using Pass Transistor Logic


There can be many different approaches for designing an MVFF. The flip-flop can be realized
as m-valued information is decoded according to 1 out of m-code, 2-valued process of 2-
valued components and the coding of the 2-valued components according to the m-valued
output. In the other approach, the multiple valued flip-flop is realized without any kind of
binary coding or decoding. Some elementary design elements are introduced using multiple

DOI: 10.1201/9781003269182-9 87
88  VLSI Circuits and Embedded Systems

valued PTL and it is shown that this approach requires far less number of components than
in the first approach where binary coding and decoding schemes are actually used.

7.1.2 Implementation of MVFF with Binary Coding and Decoding Using PTL
In this subsection, realization of the circuits using pass transistors are going to be discussed.
Here the multiple valued inputs to the flip-flop are first encoded into binary values and
binary valued pass transistors are used for the triggering and memorization of the flip-flop.
The output is the decoded into multiple values accordingly using encoder interface. This
approach has a multiple valued latch called RSTU latch, which follows the truth table shown
in Table 7.1. The circuit design using binary valued pass transistor logic for the 2-valued
and 4-valued NAND gates used in the RSTU latch is shown in Fig. 7.1. In this case, the
simplest schemes are obtained with RSTU flip-flops corresponding to the binary coded
approach.
The following functions Di (x) and Gi (x) with Di (x), Gi (x) can be defined as follows:
Di (x) = 1 when x≤i ,
Di (x) = 0 when x>i ,
Gi (x) = 1 when x≥i , and
Gi (x) = 0 when x<i .
With NAND gates, RSTU latches, the functions R = f 1 (C, D), S = f 2 (C, D),T =
f 3 (C, D), U = f 4 (C, D) correspond to Table 7.2.

Figure 7.1: RSTU Latch Using 4-Input NAND Gates.


Multiple-Valued Flip-Flops Using Pass Transistor Logic  89

Table 7.1: Truth Table for the RSTU Latch

R S T U Q Q0 Q1 Q2 Q3
1 0 0 0 0 0 1 1 1
0 1 0 0 1 1 0 1 1
0 0 1 0 2 1 1 0 1
0 0 0 1 3 1 1 1 0
1 1 1 1 Q Memory

Table 7.2: RSTU Latch Function

The corresponding D-latch is shown in Fig. 7.2.

Figure 7.2: 4-Valued D Latch.


90  VLSI Circuits and Embedded Systems

7.2 MVFF WITHOUT BINARY ENCODING OR DECODING


In this section, the realization of MVFF using multiple valued pass transistor logic is
described. Pass transistor logic can be used for multiple-valued logic for the following
reasons:
1. The output of the circuit is approximately equal to its input level. If the input signal
has a multiple level, the output follows this level.
2. The input of the threshold gate is given as an analog value or a multiple levels. It a
threshold value of the gate which can be set arbitrarily. The gate realizes a multiple-valued
logic function.

7.2.1 Properties of Pass Transistor and a Threshold Gate


A pass transistor with a threshold gate is shown in Fig. 7.3. The inverter marked by t is
used as a threshold gate. The inverter can be used as inverted threshold gates with arbitrary
threshold values. In order to realize a non-inverted threshold gate, two inverters may be
connected in series. The first one has a threshold value t , while the second has 0.5, i.e., it
is just a binary inverter.

Figure 7.3: A Pass Transistor with a Threshold Gate.

If the pass transistor in Fig. 7.4 is turned on, the output is equal to the input, while if it
is turned off, the output is in a high impedance state.

Figure 7.4: Realization of Non-Inverted Threshold Gate.

The relation between Y 2 and X in the inverted threshold gate is defined as follows:
0 f or Y2 > t

X= (7.1)
1 f or Y2 < t
Using the internal parameter X , the relation of the input Y 1 and the output Z in a pass
transistor is defined as:
Y1 f or X = 1

Z = Y1 < X >= (7.2)
Φ f or X = 0
Multiple-Valued Flip-Flops Using Pass Transistor Logic  91

The pass transistors with threshold gates can be combined in series and/or parallel
connection combinations. The Equation 7.2 can be regarded as the basis of the representation
of the inputs and outputs of connections. The series connection can be depicted as:

Z = y < x1 .x2 . . . xn > (7.3)

Fig. 7.5 shows the series connection. Parallel connections for common inputs can be
depicted as:
Z = y < x1 ∪ x2 ∪ . . . ∪x n > (7.4)

Figure 7.5: Series Connection.

The parallel connection for different inputs can be depicted as:

Z = y1 < x1 > +y2 < x2 > + . . . yn < xn > (7.5)

Fig. 7.6 shows both the connections.

Figure 7.6: Parallel Connection: (a) Common Inputs (b) Different Inputs.

And finally, for common inputs in Fig. 7.7(a), the combination of parallel-series con-
nections can be depicted as
y2 = < x1 ∪ x2 >
92  VLSI Circuits and Embedded Systems

Figure 7.7: Parallel-Series Connections: (a) Common Inputs (b) Different Inputs.

z = y2 < x3 > = (y1 < x1 ∪ x2 >) < x3 > (7.6)


= y1 < (x 1 ∪ x2 )x3 >

And for different inputs as shown in Fig. 7.7(b) is as follows:

y3 = y1 < x2 > + y2 < x2 >

z = y3 < x3 > = (y 1 < x1 > +y2 < x2 >) < x3 > (7.7)
= y1 < x1 x3 > + y2 < x2 x3 >

7.2.2 Realization of Multiple-Valued Inverter Using Threshold Gates


In this subsection, an inverter circuit using multiple valued pass transistors is introduced.
Fig. 7.8 shows the circuit. Here the inverted outputs are considered for given inputs with
respect to the truth table shown in Table 7.3. Here three threshold gates have been used to
maintain appropriate inverted output values in response to input values.

Figure 7.8: Realization of Multiple-Valued Inverter Using Threshold Gates.


Multiple-Valued Flip-Flops Using Pass Transistor Logic  93

Table 7.3: Multiple-Valued Inverted Outputs for the Corresponding Input Values

Input Output
0 3
1 2
2 1
3 0

As shown in Table 7.3, 3, 2, 1, and 0 respectively is considered to be the inverted output


of the inputs 0, 1, 2, 3. The motivation for such consideration is based on the fact lying in
binary logics. If 0, 1, 2, 3 respectively is considered in the form 00, 01,10,11 (as in binary),
the opposite values after negation respectively should be 11, 10, 01, 00, which are actually
3, 2, 1, 0 in multiple valued form.

7.2.3 Realization MVFF Using Multiple-Valued Pass Transistor Logic


The multiple-valued RS-latch uses elementary circuit called logical sum circuit (LSC)
which is constructed using multiple-valued pass transistors. The motivation to opt to this
new circuit is based on the binary representation of input values. The output of the circuit
will be 3 for input values 1, 2 or x , 3 ( x can be any value). This is because logical OR output
of 1 (binary equivalent of 01) and 2 (binary equivalent of 10) is 3 (binary equivalent of 11)
and for 2 (binary equivalent of 10) and 3 (binary equivalent of 11) it is 3 (binary equivalent
of 11). The use of LSCs as elementary circuit gives better result with respect to the number
of components and speed.
The building blocks of the introduced MVFF are elementary multiple-valued Logical
sum components that follow the truth table shown in Table 7.4. The circuit design for these
building blocks is shown in Fig. 7.9. The two variable quaternary logic function F(y 1 , y 2 )
shown in Table 7.4 is represented as follows:

Table 7.4: Truth Table for Logical Sum Circuit


94  VLSI Circuits and Embedded Systems

Table 7.5: Truth Table for the Introduced MVFF

X Y Q Q’
3 3 Q Q’
3 0 3 0
2 1 2 1
1 2 1 2
0 3 0 3

Figure 7.9: Logical Sum Circuit.

Such an expression can be realized by a pass transistor network with threshold gates.
As shown in Table 7.4, in all the cases, the output value is the logical sum value between
two input values. Since the logical sum of y 1 = 1, y 2 = 2 is 3, for the two inputs y 1 = 1 and
y 2 = 2 or vice versa, the output is 3 which is the logical sum of the two.
As it is pointed out in Fig. 7.10. the components in blocks named as f LSC 0 gives
inverted output of the logical sum values of the input values. Table 7.4 points out the
inverted outputs of the corresponding input values. The truth table for the introduced circuit
is given in Table 7.5. The construction and the multiple-valued inputs and outputs of the
FFs are shown in Fig. 7.10.

Figure 7.10: Multiple-Valued Flip-Flop using Pass Transistor Logic.


Multiple-Valued Flip-Flops Using Pass Transistor Logic  95

7.3 SUMMARY
Pass transistor logic circuits result with substantial improvements in area and delay over
conventional static CMOS. The two approaches described in this chapter can be noted as
promising alternative to the gate level design approach using conventional CMOS or TTL
logics for MVFFs (Multiple-Valued Flip-Flops). The efficient design for multiple-valued
flip-flops, in terms of number of components is the direct processing of multiple-valued
signal. In this respect, the second approach to design MVFF can be considered as efficient
and effective.

REFERENCES
[1] O. Ishuzuka, “Synthesis of a Pass Transistor Network Applied to Multi-Valued Logic”,
IEEE Trans., 1986.
[2] D. Etiemble, and M. Israel, M., “On the realization of multiple-valued flip-flops”,
IEEE Trans., 1980.
[3] T. Sasao, “Multiple-valued logic and optimization of programmable logic arrays”,
IEEE Trans., 1998.
[4] K. Yano, T. Yamanaka, T. Nishida, M. Saito, K. Shimohigashi and A. Shimizu, “A
3.8-ns CMOS 16x16-b multiplier using complementary pass-transistor logic”, IEEE
JSSC, vol. 25, no. 2, 1990.
[5] A. P. Chandrakasan, S. Sheng and R. W. Brodersen, “Low Power CMOS Digital
Design”, IEEE JSSC, vol. SC-20, 1985.
[6] N. Zhuang and H. Wu, “Novel ternary JKL flip-flop”, Electronics Letters, vol. 26, no.
15, pp. 1145–1146, 1990.
[7] J. I. Accha and J. L. Huertas, “General Excitation table for a JK multi-stable”, Elec-
tronics Letter, vol. 11, pp. 624, 1975.
[8] Babu, Hafiz Md Hasan, Moinul Islam Zaber, Md Mazder Rahman, and Md Rafiqul
Islam. “Implementation of multiple-valued flip-flips using pass transistor logic [flip-
flips read flip-flops].” In Euromicro Symposium on Digital System Design, 2004.
DSD 2004., pp. 603–606. IEEE, 2004.
CHAPTER 8

Voltage-Mode Pass
Transistor-Based
Multi-Valued Multiple-Output
Logic Circuits

An approach for designing multi-valued logic circuits is introduced in this chapter. A


systematic method for implementing a set of binary logic functions is also described, as
multi-valued logic functions, and the heuristic algorithms for different stages of the design
process are provided along with it. The introduced circuits are essentially voltage-mode
circuits with multi-valued outputs and in the case of implementing multiple-output binary
logic functions this approach produces circuits with reduced number of output pins. The
circuits described here are also suitable to be implemented in VLSI technology since they
are composed of simple enhancement / depletion mode MOS transistors and pass transistors.

8.1 INTRODUCTION
In recent times, the field of multi-valued logic (MVL) design and also the use of multi-
valued logic implementing binary logic have gathered much attention. Different techniques
to implement MVL are introduced. There have been mainly two different classes of such
approaches. First, there are the current-mode circuits, such as ECL or I2L, in which the
function of WIRED-SUM (summation of currents in a node) is particularly used implement-
ing MVL functions. On the other hand, voltage-mode logic circuits make use of threshold
voltage levels/gates.
In this chapter, the design method of a circuit module of the latter type is described. The
circuit design that uses pass transistor networks and threshold gates to implement MVL. It
also introduces a useful notation to represent a general MVL circuit with MOS transistors.
But here, emphasize is given on the fact that they don’t describe the implementation of
general class of logic functions, which is binary. In this chapter along with the introduced
circuit design technique for multi-valued (quaternary) logic functions also describe the
implementation of binary logic functions in such circuit modules. Another important factor

DOI: 10.1201/9781003269182-10 97
98  VLSI Circuits and Embedded Systems

in the design of MVL circuits is that, the simplification of MVL functions is quite difficult
to achieve. A simplification algorithm is also provided in this chapter for the introduced
circuits.

8.2 BASIC DEFINITIONS AND TERMINOLOGIES


The circuit module that is described here implements multi-valued input multi-valued
output functions. This can be defined as follows:

Property 8.2.1 Multi-valued input multi-valued output logic function: A mapping F : P1


× P2 × · · · × P n → M where Pi = {0, 1, 2, . . . , pi−1 }and M = {0, 1, 2, . . . , mi−1 }, is a
multi-valued input multi-valued output logic function.

Such functions are used in sum-of-product (SOP) form. To define the product terms
and a SOP form first the definition of literals is given.

Property 8.2.2 Let X be a multi-valued variable that takes one value in P = {0, 1, 2, . . ., p−
1}. Let S ⊆ P and X S is a literal of X , where X S = 1 if X ∈ S and X S = 0 if X< S.

Property 8.2.3 Let f = (X 1, X 2, . . ., X n ) be an n-input multi-valued logic (MVL) function.


Then a product term of an MVL function is expressed as

is the SOP expression of the function.

Example 8.1 The function f (Y 1,Y 2 ) = Y10,3Y20 + Y11Y23 is a multi-valued function of two
variables. Suppose both the variables are 4-vlued. According to the definitions given above
there are four literals (e.g., Y10,3 , etc.) and two products (i.e., Y10,3Y20 and Y11Y23 ) in the
function written in SOP form. From the definition the function is a 4-valued input 2-valued
output function.

In this chapter a set of binary functions is converted to a set of four valued input two
valued output functions i.e., according to the definition of MVL functions MVL functions
are obtained where the domain of the Pi0 s is restricted and M to the set {0, 1, 2, 3}. Two
such functions are also paired when they are implemented in the circuit and so after this
stage can get the functions, which are four-valued input four-valued output MVL functions.

8.3 THE METHOD


The design details of the introduced circuit are discussed in the following subsection. As
for implementing multiple-output binary functions the design technique described in this
chapter results in circuits composed in the following three stages:
Voltage-Mode Pass Transistor-Based Multi-Valued Multiple-Output Logic Circuits  99

1. Converting of binary logic functions into four-valued input two valued functions.
2. Pairing of the functions obtained in the first stage.
3. Output stage: here the functions paired in the second stage are implemented together
in one circuit. These circuits are basically composed of pass transistor networks.
The following subsections give the detailed description of these stages.

8.3.1 Conversion of Binary Logic Functions into MVL Functions


According to the definition of MVL functions binary logic functions can be named as
two-valued input two-valued output functions. Now these functions can be converted into
four-valued input four-valued output logic functions by pairing the input variables.
Let x i represent a two-valued variable and Y i represent a four-valued variable. Then the
function f (x 1, x 2, . . ., x 2r ) can be written as f (Y 1,Y 2, . . .,Y r ) where Y i is either 0, 1, 2 or 3
when (x 2i−1, x 2i ) = (0, 0), (0, 1), (1, 0) or (1, 1) respectively. The notation Y i = (x 2i−1, x 2i ) is
used to denoting this relationship between Y i and x 2i−1 and x 2i .

Example 8.2 The binary logic function defined in Table 8.1(a) can be converted into a
four-valued input two-valued output function as shown in Table 8.1(b) where Y 1 = (x 1, x 2 )
and Y 2 = (x 3, x 4 ).

The function f in a SOP form is given below f (Y 1,Y 2 ) = Y10,3Y20 + Y12,3Y21 + Y11Y23 .

Table 8.1: Truth Tables with (a) Binary Form and (b) Multi-Valued Form

A heuristic algorithm (Algorithm 8.1) is presented below which selects the best pairing
of the input variables. From this algorithm a partition of the variables is obtained where
each set in the partition contains a unique pair of variables. In the algorithm the concept of
residuals is used. This is defined below:
100  VLSI Circuits and Embedded Systems

Property 8.3.1 Suppose S is a set of variables and f is a binary logic function in SOP
form. Residual of the function with respect to the set S , denoted by RS is the number of
unique terms left in f after deleting the variables of set S .

In the algorithm R {i, j } is calculated for each pair of variables (i, j) over all the functions.

Example 8.3 Let f (x 1, x 2, x 3, x 4 ) = x 1 x 2 x 4 + x 1 x 3 + x 2 x 3 .

Then if S = {x 1, x 2 }, RS = 2.
Now the algorithm for pairing of binary input variables is given below:

Algorithm 8.1 Pairing the Input Variables


Input: A set of functions.
Output: A partition of the variables such that if there are 2r variables then the partition will
contain r different pairs.
1: Maintain a table R, where R(i, j) is R {i, j } as defined above.
2: Select the partition of the variables for which sum of the R(i, j) 0 s , for all pairs (i, j) of
the partition, is the smallest. Ties are broken by the following rule: select the partition
that has the pair that contributes the least.

Example 8.4 For the functions f 1 = x 1 x 30 x 40 + x 10 x 20 x 3 and f 2 = x 2 x 4 below, the possible


partitions along with the value they sum up to according to the algorithm

{(x 1, x 2 ), (x 3, x 4 )}, sum = 3 + 3 = 6.


{(x 1, x 3 ), (x 2, x 4 )}, sum = 3 + 2 = 5.
{(x 1, x 4 ), (x 2, x 3 )}, sum = 3 + 3 = 6.
Thus according to the algorithm the pairs obtained are (x 1, x 3 ) and (x 2, x 4 ).

8.3.2 Pairing of the Functions


Suppose a set of functions is as F = { f 1, f 2, . . ., f 2r }. Thus these functions are converted
into 4-valued input 2-valued output functions. Now to perform the function pairing the
following simple algorithm is used which uses the concept of supports of a function. This
is defined as below:

Property 8.3.2 Support of a function: Let f be a function. The set of variables on which
the function depends on is called the support of f , denoted as support( f ).

Example 8.5 Let f (x 1, x 2, x 3, x 4 ) = x 1 x 2 + x 3 x 4 + x 3 x 40 . Then, support( f ) = {x 1, x 2, x 3 },


since f is also represented as f = x 1 x 2 + x 3 .

Now the algorithm for pairing of binary outputs is presented below:

Example 8.6 For functions f1 = x30 x4 + x10 x3 x4 + x2 x4 + x2 x3, f2 = x10 x3 x4 + x1 x20 x30 x4, f3 =
x30 x4 + x2 x3 + x1 x2 x3 x4, f4 = x10 x2 x40 + x1 x2 x3 x4 and with the variable pairings (x1, x2 ) and
(x3, x4 ) the function pairings as obtained by the algorithm are ( f4, f3 ) and ( f2, f1 ).
Voltage-Mode Pass Transistor-Based Multi-Valued Multiple-Output Logic Circuits  101

Algorithm 8.2 Pairing the given functions


Input: A set of 2r binary logic functions and the variable pairs obtained in the previous
stage.
Output: r pairs of the given functions.
1: Sort the functions in an accenting order according to the number of pairs that are
contained in the support( f ). It can be said that the pair (i, j) is in the support( f ) if
both i and j are elements of support( f ).
2: Pair the functions from the sorted order such that the i -th pair contains the (2i − 1)-th
and 2i -th functions.

8.3.3 Output Stage


In this subsection, the structure and the operation along with the corresponding truth tables
of the circuit are described that will be used to implement the MVL functions. Then the
construction of the truth table is shown for a pair of binary logic functions, once the variable
pairings are got, and obtain the minimized circuit.

8.3.4 Basic Circuit Structure and Operation


The basic circuit structure that can be used to implement the functions is shown in Fig. 8.1
For each function pair (which is chosen by the heuristic algorithm from the set of functions)
there will be one such module. The boxes labeled A, B, and C are pass transistor networks.

Figure 8.1: Basic Circuit Structure.

Table 8.2 shows circuit output for the different states of the pass transistor networks in
Fig. 8.1.
Table Entries: A 1 or 0 entry for the networks indicates that there exists a closed path
or no path between the two terminals of the network. Whereas by d -entry it can be said that
the output of the circuit doesn’t depend on the state of the network. For the output Z it can
be shown the four different logic levels it can have.
Basic Circuit Operation: With A = 1 and B = 1 there exists a closed path between the
output and the ground and thus the output voltage will be 0 for which Z = 0. In this case
102  VLSI Circuits and Embedded Systems

Table 8.2: Basic Circuit Operation for Circuit in Fig. 8.1

A B C Z
1 1 d 0
0 1 0 1
0 1 1 2
d 1 d 3

state of the C network is in what can call a don’t care state, hence the d -entry. With B = 1
and A = C = 0 there is a path from V DD to GN D through transistors Q1 , Q2 , and Q3 . The
output will be the voltage divider output of the on resistance of the transistors. Similarly,
when A = 0 and B = C = 1 the transistors Q1 and Q2 form the voltage divider. For this
case and the previous one the proper design of the transistors will result in the appropriate
voltages for the output logic values 1 and 2 respectively. Lastly, when B = 0 irrespective to
the states in which the other networks are in, V DD is connected to the output and we have
the desired voltage level for the logic level Z = 3. However it is useful to add another pass
transistor network as shown in Fig. 8.2 and have the relationship between the states of the
networks and the output as shown in Table 8.3.

Figure 8.2: Basic Circuit Structure with 4 Logic Networks.

Table 8.3: Operation of the Circuit of Fig. 8.2

A B C D Z
1 d d d 0
d 1 1 d 0
0 1 0 0 1
0 1 0 1 2
0 0 d d 3

The operation of this circuit can be explained similarly.


Voltage-Mode Pass Transistor-Based Multi-Valued Multiple-Output Logic Circuits  103

Construction of Truth Table: For the two functions to be implemented, a truth table is
constructed. The functions here are assumed to be of four variables and thus the table
constructed will be a 2-dimensional table.
This is done according to the following two steps:
1. Enter the value 1 for the table-entries, which correspond, to a truth-value 1 for the
function f 2 . For the other entries enter 0.
2. For the product terms of function f 1 add 2 to the corresponding entries of the table
which are either 0 or 1.
Similarly when n-variable functions are considered, the tables will be of n/2-dimensions.
In the next example the introduced design procedure is followed step by step to obtain the
circuit for two binary logic functions.

Example 8.7 Suppose f1 = x10 x30 x4 + x10 x20 x3 and f2 = x20 x4 are paired together and Z is
interpreted as follows

Z = 0⇒ f 1 = 0, f 2 = 0; Z = 1⇒ f 1 = 0, f 2 = 1;
Z = 2⇒ f 1 = 1, f 2 = 0; Z = 3⇒ f 1 = 1, f 2 = 1.

The variable pairing results in Y 1 = (x 1, x 3 ) and Y 2 = (x 2, x 4 ) (Example 8.3). Then the


functions can be rewritten as f 1 = Y10Y21,3 + Y12Y20,1 and f 2 = Y21 .
The truth table that is constructed for the functions is given below in Table 8.4(a) for
example 8.7. Now using it, Tables 8.4(b)–8.4(e) and the logic expressions for the networks
are constructed. By the same way, other tables and their corresponding expressions are also
possible to generate. The circuit of Example 8.7 is shown in Fig. 8.3.

Table 8.4: (a) Truth Table for Example 8.7, and Tables when (a) A = 0, (b) B = Y20,2,3 +Y11,3Y21 ,
(c) C = Y22 + Y23Y11,2,3 + Y10,1,3Y20 , (d) D=Y12Y20 + Y10Y23
104  VLSI Circuits and Embedded Systems

Figure 8.3: Output Circuit for Example 8.7.

Algorithm is given here to generate expressions with minimum number of products


from the tables.

Algorithm 8.3 Generating Minimal Expression


Input: A 2-dimensional truth table corresponding to the pass transistor networks.
Output: A minimum product expression corresponding to the table.
1: Scan the table row-wise and generate the products that the rows represent. Merge the
products that have the same literals as a part of the product. Don’t care entries should
be considered as 0 entries unless they can be used to merge the products.
2: As done in the 1st step, scan the table column-wise and generate the products and
merge them (if possible). The same strategy that has been taken in Step 1 for don’t care
entries are to be taken.
3: Choose the smaller set of product from the two product sets obtained and output the
sum of product expression.

It can be noted that the expressions for the networks in Example 8.7 are obtained using
this algorithm.
In the example, the method which is given for generating the expressions for the
networks considers only functions of four, two-valued input, variables. Though tables are
used to describe the functions it can easily do without them. Also the methods can be
generalized for functions with lager number of variables.

8.3.4.1 Literal Generation


After generating the expressions for the networks, the following circuits show how it can
generate the required literals of the form Y S , where Y = (x 1 , x 2 ) and S ⊆ {0, 1, 2, 3}.
Voltage-Mode Pass Transistor-Based Multi-Valued Multiple-Output Logic Circuits  105

It may also use a 2-4 line decoder where the inputs will be the two variables x 1 and x 2
and the outputs will be Y {0} , Y {1} , Y {2} , and Y {3} . Then these can be used to generate the
other literals.

8.4 SUMMARY
In this chapter a design technique for multi-valued logic (MVL) functions is described and
is shown with examples of how multiple output binary logic functions can be implemented
using it. The technique described here can be easily extended to implement higher radix
circuits. In the circuits which are designed, the output is encoded to express the output
of two different logic functions. Thus when this signal is going to be used, decoding is
required. However by doing so actually reduced the number of output pins and the output
of the functions are available simultaneously.
While the main objective of the chapter is to give a compact design method for MVL
functions, in the implementation of binary logic functions it is found that, though this
method requires a few more transistors in some cases, there are classes of functions for
which the method, in terms of number of transistors, can be quite efficient. The main
problem can be identified as the problem of the simplification of the expressions. As the
number of variables grows, the method introduced for the minimization of the expressions
becomes very time consuming. Thus it can find the minimized expression in considerable
time.

REFERENCES
[1] E. J. McCluskey, “Logic design of multivalued IIL logic circuits”, IEEE Transactions
on Computing, vol. 28, pp. 546–559, 1979.
[2] E. J. McCluskey, “Logic design of MOS ternary logic”, Proceedings of IEEE ISMVL,
pp. 1–5, 1980.
[3] S. P. Onneweer and H. G. Kerkhoff, “Current-mode CMOS high-radix circuits”,
Proceedings of IEEE ISMVL, pp. 60–68, 1986.
[4] L. K. Russell, “Multilevel NMOS Circuits”, 1980.
[5] O. Ishizuka, “Synthesis of a pass transistor network applied to multi-valued logic”,
1986.
[6] E. J. McCluskey, “Logic Design Principles”, Prentice-Hall, Englewood Cliffs, NJ,
1986.
[7] Babu, Hafiz Md Hasan, Md Rafiqul Islam, Amin Ahsan Ali, Mohammad Musa Salehin
Akon, Mohammad Ashiqur Rahaman, and Mohammad Fakhrul Islam. “A technique
for logic design of voltage-mode pass transistor based multi-valued multiple-output
logic circuits.” In 33rd International Symposium on Multiple-Valued Logic, 2003.
Proceedings., pp. 111–116. IEEE, 2003.
CHAPTER 9

Multiple-Valued Input
Binary-Valued Output
Functions

The success of the local covering approach to multiple-valued input two-valued output
(MVITVO) functions minimization depends vastly on the proper choice of the base
minterms from the ON set of some new techniques to improve the performance of this
approach. A graph called an enhanced assignment graph (EAG) is introduced for the effi-
cient grouping of the Boolean variables. In order to make the best choice of the proper base
minterm, a new technique is defined to find the potential canonical cube (PCC) covering
it. This process succeeds in finding out the essential primes efficiently which enhances the
total computation time and produces better sum of products (SOP).

9.1 INTRODUCTION
Simplification of sum of products expressions is of great importance in logic synthesis. In
total computation time for logic synthesis, the ratio of the time spent for simplifications
for SOPs is directly related to the simplification of PLAs. In this context the necessity
for the use of multiple-valued logic (MVL) is gaining its valued importance day by day.
The interconnection complexity of two-valued functions both in chip and between chips
is reduced effectively by the adept use of MVL. These functions can be of great use to
minimize decoded PLA’s and in the realm of sequential circuits and networks. Among
the heuristic methods to find out the minimum cover of the MVITVO functions MINI and
ESPRESSO-IIC are very well known. In these methods a near optimum cover of the function
F to be minimized is achieved through iterative improvement by reshaping and reducing
the initial cover of it. A slightly different approach to these heuristics is the approach of
the local covering technique, where the whole process starts from a properly chosen base
minterm. An improved version of this technique is as follows: First a set of sub functions
of the function to be minimized is built (expansion process). Then one or more primes are
selected from the ones of each sub-function (selection process). In the end a union of all
the selected primes is done which forms a cover of the function F .

DOI: 10.1201/9781003269182-11 107


108  VLSI Circuits and Embedded Systems

In this chapter it is tried to hold down the racing computational time by emphasizing
in finding out the “best minterm” (the minterm which has the fewest number of adjacent
minterms) and at the same time enhanced the probability of detecting and selecting the
essential primes while expanding. This algorithm preserves the notion of the previous ones
and are made aware of the lower bound of primes in the minimum cover of the given
function. Here this works in two phases; first in finding efficient grouping of the Boolean
variables and Second by finding fast the viable cubes with suitable minterms. In concise
the algorithm is fast in computation and prudent in keeping the functions as minimized as
possible.

9.2 BASIC DEFINITIONS


A multiple-valued input and binary-valued output function is a mapping F(X1,
X2, X3, . . ., Xn ) : P1 ×P2 ×P3 ×· · ·×Pn →B, where Xi is a multiple-valued variable, B =
{0, 1, ∗}, and Pi = {0, 1, . . . , pi−1 }, (pi ≥ 2) is a set of values that this variable may assume.
Symbol ‘∗0 denotes the value 1 or 0. A product of n constants a1 ×a2 ×a3 ×· · ·×an with ai ∈Pi
is a minterm.
The ON set of F is formed by all the minterms for which the function takes the value 1.
Similarly, the OFF set is formed by all the minterms for which the function takes the value
0. And the Don’t Care set is the set of minterms for which the function can indifferently take
the value 1 or 0. Let X be a variable, which takes one of the values in P = {0, 1, . . ., pi−1 }.
For any subset S ⊆ P, XS
1 i f X∈S
is a literal such that X S .
0if X < S
An example for MVITVO function can be expressed as:
F(X1, X2, X3 ) = X1{0} X2{1} X3{1} ∨ X1{1} X2{0,2} X3{1} ∨ X1{2} X2{0,3} X3{3,0} ∨ X1{3} X2{2} X3{1}
Here the products of literals are termed as cubes. A MVITVO function is said to be
minimum if it has the minimum number of cubes.
Let two products be
c1 = X T11 X T22 . . . X Tnn

c2 = X 1S1 X 2S2 . . . X Snn


Then the distance between c1 and c2 is defined as follows:
distance(c1, c2 ) = [The number of i 0 s such that (S i ,T i )]
Two minterms of F are adjacent if they differ in the value of only one variable. Two
minterms c1 = X1{0} X2{1} X3{1} }, c2 = X1{0} X2{1} X3{2} are adjacent. It may call the cube
X1S1 × X2S2 × · · · × X Snn × canonical if i = 1, 2, . . ., n − 1 and |Sn| is maximal covers all the
minterms of F that are adjacent one another and differ in the value of the last variable. An
example of such cube is c3 = X1{0} X2{1} X3{0,1,2} . An expression formed exclusively by such
cubes is unique. If C 1 and C 2 are two cubes then C 1 is said to imply C 2 if prime implicant, p
is a product term which implies no other cube of the function. An essential prime implicant
denotes a prime implicant that covers at least one standard product term (not a Don’t Care
term) that cannot be covered by any other prime implicant. The essential prime implicants,
therefore must be included to get a minimum cover of the function. A distinguished minterm
Multiple-Valued Input Binary-Valued Output Functions  109

is a minterm that is covered by only one prime implicant. The essential primes cover one
or more distinguished minterms.
F : P1 × P2 × P3 × . . . × P n → B, where P1 = {0, 1, 2, 3}, P2 = {0, 1, 2, 3}, P3 = {0, 1,
2, 3}, . . . , P n = {0, 1, 2, 3}. Again, let α be a minterm of F such that α = X1{0} X 2{1} X 3{2} .
i, j
The operation circular shift can be pk → D, which implies the replacement of i -th variable
in α, with the j -th value that follows the value of the i -th variable in α, in Pi . The first value
of Pi is assumed to follow the last one. Suppose i = 2, j = 1, then applying the operation
to α, it gets η = X1{0} X 2{1} X 3{2} . If circular shift is applied to canonical cubes, a set of
minterms adjacent to it are produced.
The minimization procedure for MVITVO functions consists of the following steps:
(a) The determination of all prime implicants of the function.
(b) Finding out the essential prime implicants.
(c) From the remaining prime implicants a minimum set is chosen so that together with
the essential prime implicants they cover the function.
Let s be a subset of the input variables. Here |s| = 2 is considered. Then deleting all
literals of the variables in s from each term of the given function F and leaving other literals
in that term, the number of distinct terms in the resulting disjunctive form is denoted by RS .

Example 9.1
0
f (x1, x 2, x3, x4, x5 ) = x1 x20 x3 x4 + x1 x30 x4 x5 + x20 x30 x4 x50 + x1 x2 x30 + x2 x3 + x1 x20 x4 x50

s be the set ( x 1 , x 3 ). Then, R(x1, x3 ) = 4, as there are 4 distinct terms left after deleting
the literals of s.
In the algorithm for grouping the variables, Rs is used to estimate the number of products
when switching variables in S are assigned to a MVITVO variable. An Assignment graph
G (Fig. 9.1) for n variable function
f (x1, x 2, x3, . . . , xn ) is a complete graph such that,
1. G has n nodes, one for each input variable.
2. The weight of the edge (i, j) between nodes i and j is R(xi , x j ) .
Let G = (V, E) be a connected graph with n vertices. A Hamilton Path is a path that
goes through each vertex exactly once. A minimum cost Hamilton Path is the path created
by traversing through the edges with the minimum weight and hence, the sum weights of
the edges is minimum. If f is the given switching function, the uncomplemented weight of
a variable x i of f is defined to be the number of times x i appears in uncomplemented form
in different products of the ON set of f . Similarly, the complemented weight of a variable
x i of f is the number of times x i appears in complemented form in different products of the
ON set of f .
110  VLSI Circuits and Embedded Systems

Figure 9.1: An Assignment graph for the Functions of Example 9.1.

9.3 TRANSFORMATION OF TWO-VALUED VARIABLES INTO


MULTIPLE-VALUED VARIABLES
Efficient grouping of the two-valued variables into multiple-valued variables minimizes
effectively the number of product terms of the given function.

Example 9.2

Let us consider the switching function f (x1, x 2, x3, x4 ) = x1 0 x 20 x30 x40 + x1 0 x 20 x30 x4 +
x1 x 20 x3 x40 + x1 0 x 2 x30 x40 + x1 0 x 2 x30 x4 + x1 0 x 2 x3 x4 + x1 x20 x30 x40 + x1 x20 x3 x40 + x1 x2 x30 x4 + x1 x2 x3 x40 .
0

1) When input variables are assigned as X1 = (x1, x2 ) and X2 = (x3, x4 ) then the
minimum sum of products expression is = F (X1, X2 ) = X 1{0,1} X 2{0,1} ∨X 1{0,3}
X 2{1,2} ∨X 1{1,2} X 2{0,3} <1>. Three product terms are necessary in this assignment.
2) When input variables are assigned as X 1 = (x 1 , x 3 ) and X 2 = (x 2 , x 4 ) then the
minimum sum of products expression is F (X1, X2 ) = X 1{0,1,2} X 2{0,3} ∨X 1{0,3} X 2{1,2} <2>.
Two product terms are necessary in this assignment.
3) When input variables are assigned as X 1 = (x 1 , x 4 ) and X 2 = (x 2 , x 3 ) then the
minimum sum of products expression is F (X1, X2 ) = X 1{0,3} X 2{1,2} ∨X 1{1,2} X 2{0,3}
∨X 1{0,1} X 2{0,2} <3>. Three product terms are necessary in this assignment.
Therefore, when input variables as <2> is assigned, the number of product terms is
minimized.
In order to find an efficient grouping, first build the assignment graph G. And the
minimum weight Hamilton Path is found, starting from the node corresponding to x 1 .
The variables are ordered according to the order of appearance of the vertices and assign
Multiple-Valued Input Binary-Valued Output Functions  111

the pairs of variables to multiple-valued variables. Following the above notion it may get
different orderings for the function of the Example 9.2, which are as follows:

1.x1, x 2, x3, x4
x1 , x 2 , x 4 , x 3

2. x1, x 3, x2, x4
x1 , x 3 , x 4 , x 2

3.x1, x 4, x2, x3
x1, x 4, x 3, x 2
So, it gets the three different orderings corresponding to the orderings shown in Exam-
ple 9.2, which still do not let find the optimal ordering <2> of the example. In order to
solve this problem, here an “Enhanced Assignment Graph (EAG)” (Fig. 9.2) is introduced,
the definition of which is as follows:

Property 9.3.1 An Enhanced Assignment Graph (EAG) for n variable function


f (x1, x 2, x3, . . . , x n ) is a complete graph such that, 1. EAG has n nodes, one for each in-
put variable. 2. Each node has their corresponding uncomplemented weight. 3. The weight
of the edge (i, j) between nodes i and j is R(x i , x j ) .

Figure 9.2: An EAG for the Functions of the Example 9.2.

Example 9.3 Considering the EAG (Fig. 9.2) and by using Algorithm 9.1 it may get the
following ordering:
112  VLSI Circuits and Embedded Systems

Algorithm 9.1 Grouping of Variables


group_variables()
{
For each variable in the switching function
Calculate the Uncomplemented weight
Calculate R(x i , x j )
Build the EAG
Starting from the node corresponding to variable x 1
1: do {
2: a) The next (i, j) to be added is such that i is a node already chosen and the weight of
(i, j) is minimum among all edges (k, l), such that k has been chosen and l has not yet
been chosen.
b) If the weight of all the edges from a particular node are the same, include the (i, j)
where j has the uncomplemented weight nearest to the uncomplemented weight of i .
3: Make pair to assign Boolean variables into multiple-valued variables. }

x 1 , x 3 , x 2 , x4
x1 , x 3 , x 4 , x 2

From which X 1 = (x 1 , x 3 ) and X 2 = (x 2 , x 4 ) can be assigned which according to <2>


gives the minimum number of product terms.

9.3.1 Algorithms for Minimizing the Multiple-Valued Functions


In the local cover approach, the minimization process has two steps namely expansion and
selection. The expansion process creates a set of sub functions of the function F to be
minimized. Each sub-function consists of the minterms of F covered by all the primes that
cover a base minterm properly chosen. In fact, the objective of this step is to expand the
base minterm to result in the SOP where no cubes can be expanded any more. The cubes
obtained in this way represent a prime implicant. In the second step, namely the selection
step prime implicants are chosen from each sub-function so that the union of all the chosen
primes form an irredundant cover of F .
The success of the technique depends largely on the number of sub-functions. The
larger the set of sub-functions the closer the found cover is to the minimum. Moreover,
the detection of essential primes as early as possible plays a vital part in decreasing the
computational time. Both of these two essential criteria can be achieved by choosing base
minterms from the minterms with the smallest number of adjacent minterms. In this process
it can easily detect the essential primes from the sub-functions containing only one prime
as each essential prime is unique and base minterms with smallest number of adjacent are
distinguished minterms.
In this subsection, the improvement of the local cover technique in addition to another
procedure called “cube-rearrangement” are presented which help to find the next best
minterm which is chosen as a base minterm from the potential canonical cubes (PCC) so
Multiple-Valued Input Binary-Valued Output Functions  113

that the set of sub-functions will increase and essential primes are detected in the earliest
phase of the computation. In order to realize this procedure, the given expression is preferred
to be in canonical form. The expansion process in this algorithm is done by circular shifting
the cubes. In case of canonical cubes, it generates a set of minterms adjacent to the original
cube. The motivation to the new procedure lies in the fact that the minterm with the smallest
number of adjacents would reside in the canonical cubes with smallest number of adjacents.
So, if it arranges the given canonical cubes with respect to their adjacency then it would
be less time consuming when searching for the minterms with the smallest number of
adjacents. The algorithm uses a table of indices for all the canonical cubes and rearranges
it according to their number of adjacents (Fig. 9.1).

Figure 9.3: Multiple-Valued Function Example: All Cubes are in the Form of Canonical
Cubes.

The procedure first checks out the X1i ’s (1 <i<m, m = number of different multiple
values) and counts the number of distinct values (i.e., 0, 1, 2, 3’s) for each X1i ’s (i ∈ [0,
n − 1], n is the number of literals) of all the canonical cubes. Then it updates the index
table of the canonical cubes according to the weight of distinct values (a weight here is the
number of occurrences of that value). The same procedure is performed for each X1i ’s (1
< j<n − 1, n = number of literals). In this way it comes to a point when the table cannot be
updated anymore and this is when it is succeeded in rearranging properly the cubes such that
it gets the minterms with the smallest adjacents just by consulting this table sequentially.
This technique is presented in Algorithm 9.1.

Example 9.4 Let, 5 canonical cubes of F(X 1 , X 2 , X 3 , . . . , X n ): A1 × A2 × A3 × . . . ×


An → B,

A1 = X 1{1} X2{0} X3{1} X4{1,2} ,


A2 = X 1{0} X2{1} X3{0} X4{1,2} ,
A3 = X 1{0} X2{0} X3{3} X4{1,3} ,
A4 = X 1{2} X2{2} X3{1} X4{1} ,
A5 = X 1{1} X2{0} X3{0} X4{1,2} .
If base minterm is chosen as a = X 1{1} X2{0} X3{1} X4{1} from the canonical cube A1 , it
gets an Adjacent cube A5 but if A4 and a = X 1{2} X2{2} X3{1} X4{1} are chosen, then it gets
no adjacent. In order to find the minterms with the smallest adjacent, it has to rearrange the
canonical cubes.
114  VLSI Circuits and Embedded Systems

Algorithm 9.2 Rearrange canonical cube


1: Rearrange_canonical_cubes()
2: Table = { has the initial arrangement of the canonical cubes } ;
3: pass = 1, start = 1, end = |table|
4: Rearrange (pass, start, end)
{
5: a[ ] be an array, It is initialized to zero in every pass.
6: for start =1 to end do
7: Count the number of distinct values Xpass i and put them in their corresponding
positions in ‘a’.
8: if for all elements k of a[ ], a[k] is not 0 then
9: Rearrange corresponding indices of table with respect to the count values according
to the least occurred values of a[ ].
10: end = start + K
11: rearrange (pass+1, start, end)
12: start = end
13: end if
14: end for
15: }

The rearranging procedure of canonical cubes in different passes of Algorithm 9.2 is


shown in Fig. 9.2.

Figure 9.4: The Different Arrangements of the Index Table in the Different Passes.

Each sub-function consists of the canonical cubes which are adjacent to the base
minterm properly chosen from the function F.
Let, F = { A1 , A2 , A3 } where, A1 = X 1{0} X2{0} X3{1} X4{0,1,2} , A2 = X 1{3} X2{1} X3{2} X4{2} ,
A3 = X 1{0} X2{3} X3{3} X4{0,1,2} and base minterm a = X 1{0} X2{0} X3{1} X4{0} , then after
performing the expansion procedure, it gets two sub-functions P and Q of F , where P has
{ A1 , A3 } and Q has A2 .
In this technique first generate a supercube S , by computing from the base minterm
a and all the minterms of F that are adjacent to a, using circular shift operation. Now
the using base minterm a and the supercube S , it generates the canonical cubes for the
Multiple-Valued Input Binary-Valued Output Functions  115

sub-functions to be built. The selection procedure processes one at a time the sub-functions
of an irredundant set.
Hence the algorithm for the local cover technique can be described as follows:

Algorithm 9.3 Local cover


1: Local cover ()
2: {
3: F is the Multiple-Valued function to be minimized.
4: rearrange_canonical_cubes () /* Rearrange indices for the canonical cubes of F in the
index table (Fig. 9.2). */
5: Expansion () /* creates sub-functions with primes covering the base minterm (Algo-
rithm 9.3). */
6: Selection () /* selects primes from each sub-function and forms an irredundant cover
of F (Algorithm 9.7). */
7: */perform the union of all the chosen primes from each sub-function so that an irre-
dundant set of primes is found. */
8: }

Algorithm 9.4 Expansion


1: Expansion ()
2: {
3: a is the base minterm.
4: repeat
5: if (there is more than one minterms left, which has not yet been included in the
sub-function) then
6: Look up the index table to get the cube for the base minterm a, which has the
smallest number of adjacent minterms.
7: generate_subfunction (a, F) /*Create sub-function associated with the base
minterm chosen (Algorithm 9.4)*/
8: end if
9: until False
10: }
116  VLSI Circuits and Embedded Systems

Algorithm 9.5 Generate Subfunction


1: generate_subfunction (a, F)
2: {
3: /* P = { P1 , P2 , . . . , Pn } is a subfunction of F and Pi , 0 < i < n are canonical cubes of
P. S is a supercube of P. */
4: P = A /*A is the canonical cube of F covering a */
5: K = 1; R = θ , I { 0 } ;
6: S = generate_supercube (a,F). /* (Algorithm 9.5) */
7: while k <= P̄ do
8: for i = ik + 1 to n - 1 do
9: for j = 1 to S̄i - 1 do
kj
10: Pk → B /* produce B by circular shifting Pk ; */
11: (B’, B”) = Check_B(B,a,R,F);
12: if B’ , θ then
13: P = P + {B’},
14: I = {i};
15: if B” , θ then
16: P = P + {B”};
17: end if
18: end if
19: k = k + 1;
20: end for
21: end for
22: end while
23: return return (P, R);
24: }

Algorithm 9.6 Generate Supercube


1: generate_supercube (a, F)
2: {
3: for i=1to n-1 do
4: Si = {ai }
5: for j = 1 to |Ui | - 1 { do
/* i = k=1 Aik */
Ðn
6:
Ð
ij
7: a → b /* produce b by circular shifting a */
8: if b ∈ F then
9: Si = Si + {bi } }
10: end if
11: end for
12: end for
13: Sn = An /* where A is the canonical cube of F covering a */
14: }
Multiple-Valued Input Binary-Valued Output Functions  117

Example 9.5 Construction of Supercube: If the base minterm is selected as

a = {0}×{0}×{1}×{0} from the cube A1 = {0}×{0}×{1}×{0, 1, 2} of Fig. 12.1; then the


supercube can get from Algorithm 9.7 and the iteration processes are described in the table
as shown in Fig. 9.3.

Figure 9.5: Construction of Supercube.

Example 9.6 Generation of sub-function: The Algorithm sub_function() builds the sub-
function associated with a base minterm

a = {0}×{0}×{1}×{0} and therefore can get the on multiple-valued ON set cube P and
OFF set cube R. After running this algorithm, it obtains P and R as follows:

Algorithm 9.7 Check_B


1: Check_B (B, a R, F)
2: { /* B is a canonical cube of F */
3: D = SuperCube (a, B);
4: for each r ∈ R such that r D , θ do
Ñ
5: D = D#n r;
6: end for
7: Bn = Dn ; B’ = θ ; B” = B;
8: if an ∈ Bn then
9: if there is a C ∈ F such that c B , θ and an ∈ Cn then
Ñ
10: B 0 = c B : B 00#n B
Ñ
11: end if
12: end if
13: return B’, B”;
14: }

D Pout = Θ
Ñ
The routine Check _ B reduces D n so that D Pout = Θ. Here D = Sepercube (a, B) =
Ñ
P and POF F is the offset of P4. Since POF F is not available, it uses a subset of it R built
during the generation of P.
118  VLSI Circuits and Embedded Systems

Figure 9.6: Generate sub-function( ) producing P and R.

Algorithm 9.8 Selection


1: Selection (S, a, R)
2: { /*Select primes from each sub-function.*/
3: C = {S} ;
4: for h = 1 to R̄ do
5: C 0 = C 00 = θ ;
6: for k = 1 to C̄ do
7: if C k r h == Θ then
Ñ
8: C 00 = C 00 + {Ck };
9: else
10: for i = 1 to n do
11: if (ai < rih ) then
12: C 00 = C 0 + {ck #i r h }
13: end if
14: end for
15: end if
16: end for
17: Delete each cube of C 0 implying a cube of C 00;
18: C = C 0 + C 00;
19: end for
20: return C ;
21: }

Example 9.7 Deriving the primes of a sub-functions: Some algorithm generates the set of
cubes C that can be built by computing the set S # R and deleting every cube that implies
one of the other yielded cubes or does not cover the base minterm. Initially C contains
only S . The inner loop processes one at a time the cubes of C . Let C k be the cube under
processing. If C k does not intersect Rh , it is inserted in C 00, otherwise, each cube C k #i Rh
that covers the base minterm is inserted in C 0. Then each cube of C 0 implying a cube of
C 00 is deleted and a new C is formed by the residual C 0 and C 00. This process is repeated
for such a C in the next iteration of the outer loop. After running this algorithm, it gets the
primes which are as follows:
Multiple-Valued Input Binary-Valued Output Functions  119

C 1 = {0} × {0, 2} × {1, 3} × {0, 1, 2}


C 2 = {0, 1} × {0} × {1} × {0, 1, 2}
C 3 = {0, 1} × {0} × {1, 3} × {0, 1}
C 4 = {0} × {0, 2, 3} × {1, 3} × {0}
C 5 = {0, 1} × {0, 3} × {1, 3} × {0}
Espresso and MINI need the preliminary generation of the OFF set of the given function.
Unfortunately, there exist functions for which such a set is exceedingly large. In some cases,
it is possible to overcome such a drawback by using a reduced OFF set. The referred work
is focused on minimization of binary-valued functions with single output. The local cover
algorithm uses a subset of the OFF set of a subfunction both to build the subfunction itself
and extract primes from it. However, such a subset does not coincide with the reduced OFF
set. Consider for instance, the following example can be drawn.
F O N = a 0 b0 cd + a 0 b0 c 0 d 0,
F OF F = ab0 + a 0 b + ac 0 + cd 0,
F DC = a 0 b0 c 0 d + abcd.
The reduced OFF set associated with a 0 b0 c 0 d 0 is a + b + c. Whereas the set R yielded by
gererate_sub_ f unction() by expanding a 0 b0 c 0 d is empty. In fact, the yielded subfunction
holds only one essential prime of F ; i.e., a 0 b0 c 0. a 0 b0 c 0 d 0 is a distinguished minterm of such
a prime.

9.4 SUMMARY
A new Boolean variable assignment algorithm and minimization techniques have been
introduced. So, both the total computation time and the number of products decreases. The
algorithmic extension has been proven to be efficient in detecting and selecting the essential
prime implicants as well as furnishing the lower bound on the number of prime implicants
in the first phase of the computation process. The new concepts of “enhanced assignment
graph”, “use of Hamiltonian path” in finding out the best pairs, and the technique of “cube
rearrangement” are proven to be efficient in step-by-step minimization process. Along
with these heuristics used in different phases of expansion and selection have nevertheless
improved the quality of the whole technique.

REFERENCES
[1] H. M. H. Babu, M. Zaber, R. Islam and M. Rahman, “On the minimization of Multiple
Valued Input Binary Valued Output Functions”, International Symposium on Multiple
Valued Logic (ISMVL 2004), 2004.
[2] G. Caruso, “A local Cover Technique for Minimization of Multiple-Valued Input
Binary-Valued Output Functions”, IEICE Trans., Fundamentals, vol. E79 A, 1996.
[3] T. Sasao, “Input variable assignment and output phase optimization of PLA’s”, IEEE
Trans., Comput., vol. C-33, pp. 879–894, 1984.
[4] R. K. Brayton, G. D. Hatchel, C. T. McMullen and A. Sangiovanni-Vincentelli, “Logic
Minimization Algorithms for VLSI Synthesis”, 1984.
[5] G. Caruso, “A local selection algorithm for switching minimization”, IEEE Trans.,
Comput., vol. c-33, pp. 91–97, 1984.
120  VLSI Circuits and Embedded Systems

[6] T. Sasao, “Multiple-Valued Logic and Optimization of Programmable Logic Arrays”,


IEEE Trans., 1998.
[7] R. K. Brayton and Y. Watanbe, “Heuristic minimization of multiple-valued relations”,
Computer-Aided Design of Integrated Circuits and Systems, IEEE Trans. on, vol. 12,
no. 10, 1993.
[8] A. A. Malik, R. K. Brayton, A. R. Netwon and A. Sangiovanni-Vincentelli, “Reduced
offset for minimization of binary-valued functions”, IEEE Trans. Computer Aided
Design, vol. 10, pp. 413–426, 1991.
[9] Zaber, Moinul Islam, and Hafiz Md Hasan Babu. “An enhanced local covering ap-
proach for minimization of multiple-valued input binary-valued output functions.” In
Proceedings of the 10th WSEAS international conference on Computers, pp. 63–68.
2006.
CHAPTER 10

Digital Fuzzy Operations


Using Multi-Valued Fredkin
Gates

Multi-valued Fredkin gates (MVFGs) are reversible gates and these gates can be considered
as modified version of the better-known reversible Fredkin gate. Reversible logic gates are
circuits that have the same number of inputs and outputs and have one-to-one and onto
mapping between vectors of inputs and outputs. Thus, the vector of input states can be
always reconstructed from the vectors of output states. It has been shown that the power
is not dissipated in an arbitrary circuit when the circuit is built from reversible gates.
Moreover, multi-valued Fredkin gates have been shown to be a suitable choice as a basic
building block for binary and different alternative logics for example multi-valued logic
and threshold logic.
In this chapter the application of MVFGs is shown with the implementation of fuzzy
set and logic operations. Fuzzy relations and their composition are very important in this
theory as collections of fuzzy if-then rules, fuzzy GMP (Generalized Modus Ponens) and
GMT (Generalized Modus Tollens) respectively are mathematically equivalent to them. In
this chapter, digitized fuzzy sets are described where the membership values are discretized
and represented using ternary variables. The composition of fuzzy relations and a systolic
array structure are also described. Design with reversible gates and the highly parallel
architecture of systolic arrays make the circuits quite attractive for implementation.

10.1 INTRODUCTION
Fuzzy set theory and the corresponding logic is quite transitioning from the traditional set
theory and the concept of uncertainty. When A is a fuzzy set and x is a relevant object, the
proposition “x is a member of A” is not necessarily true or false, it may be true only to
some degree. It is most common to express the degrees of membership by numbers in the
closed interval [0, 1]. In this chapter, digital fuzzy set is considered where the membership-
value space is discretized. The standard set operations and the concept of fuzzy relations
are defined based on these digital fuzzy sets and their realizations. In this chapter, the

DOI: 10.1201/9781003269182-12 121


122  VLSI Circuits and Embedded Systems

composition of fuzzy relations and a systolic array structure are described to compute it.
Collections of fuzzy if-then rules or fuzzy algorithms are mathematically equivalent to
fuzzy relations and the problem of inference of (evaluating them with specific values) is
mathematically equivalent to composition.
The introduced circuit is composed of Multi-Valued Fredkin Gates (MVFG) which are
reversible gates. Conservative and reversible logic gates are widely known to be compatible
with the new computing paradigms like optical and quantum computing. Reversible logic
gates are circuits that have the same number of inputs and outputs and have one-to-one and
onto mapping between vectors of inputs and outputs; thus, the vector of input states can be
always reconstructed from the vectors of output states. Irreversible functions (gates in the
classical binary logic except the NOT gate is irreversible) can be converted into reversible
functions easily. If the maximum number of identical output vectors is p, then dlogpe
garbage outputs (and some inputs, if necessary) must be added to make the input-output
vector mapping unique. Reversible logic applicable to quantum computing, nanotechnology
and low power design. For power not to be dissipated in an arbitrary circuit it is necessary
that the circuit be built from reversible gates. Multi-valued reversible gates however have
not got much attention until recent times. This chapter concentrates on the multiple-valued
Fredkin gates.

10.2 REVERSIBLE LOGIC


The circuits shown in this chapter are composed with reversible gates. In the following
subsections introduces some of the reversible gates and the MVFG which are going to use
extensively. Traditional irreversible gates lead to power dissipation in a circuit regardless
of its implementation. This theorem actually points out that to keep the Moore’s law
functioning the future technology should be based on reversible logic. A gate is reversible
only when there is a one-to-one and onto relationship between the gates inputs and output
which mean that these gates must have equal number of inputs and outputs. Only then the
inputs of a reversible gate can be uniquely determined from the outputs. Thus, a reversible
gate with n inputs must have n outputs and this is denoted as a (n, n) or n×n logic gate.
It can be pointed out that all classical gates e.g., AND, OR, XOR are irreversible. The
NOT gate however can be considered as a (1, 1) reversible gate. Irreversible functions can
be converted into reversible functions easily. If the maximum number of identical output
vectors is p, then dlogpe garbage outputs (and some inputs, if necessary) must be added to
make the input-output vector mapping unique. If the EXOR operation is considered, it can
easily see that the operation is an irreversible one. Now in Feynman gates there exists two
output x 0 = x and y 0 = x ⊕ y for the two inputs x and y . The truth table given in Table 10.1
shows that there exists the required unique input-output vector mapping.
Traditional logic design method differs significantly from the synthesis of reversible
functions. Efficient design of adders using reversible gates has also gained much attention.

10.2.1 Some Basic Reversible Gates and Classical Digital Logic Using these Gates
Among many gates Fedkin gates together with Toffoli gates and Feynman gates are the
most often discussed gates in reversible and quantum architecture and it is suggested that
Digital Fuzzy Operations Using Multi-Valued Fredkin Gates  123

Table 10.1: Feynman Gate

X Y x’ y’
0 0 0 0
0 1 0 1
1 0 1 1
1 1 1 0

future realization efforts will concentrate mostly on these gates and their derivations. These
reversible gates along with a new gate are shown in Fig. 10.1.

Figure 10.1: (a) Feynman Gate, (b) Fredkin Gate, (c) Toffoli Gate, and (d) New Gate.

In strict reversible logic paradigm signal fan-out is forbidden. However, most of the
gates provide one of the inputs at the outputs unaltered. Using constant inputs, it also can
generate the fan-out and other different function. Fig. 10.2 shows some such constructions.
124  VLSI Circuits and Embedded Systems

Figure 10.2: Basic Logic Operations Using Reversible Gates.

In Fig. 10.2 (a) and (b), Fredkin gates are used to implement the fanout and AND
operation. In Fig. 10.2 (c), it is found that the operation of AND and EXOR on two inputs
performed. It can be pointed out that for this gate the output z should be x0 ⊕ y0 which is
equivalent to x ⊕ y .

10.2.2 Multi-Valued Fredkin Gate


It is possible to implement any Boolean logic function using Fredkin gates then it is also
possible to use MVFGs as they are modified Fredkin gates. Fig. 10.3 shows the MVFG.

Figure 10.3: Multiple-Valued Fredkin Gate (MVFG).

One observation can be made here is the fact that the definition of the gate does
not specify the type of signals. Thus, they can be binary, multi-valued, etc. The only
requirement is that the relation (<) can be defined on them. These gates can be used to
implement alternative logic, as threshold logic, array logic, etc., and these gates can be
constructed using optical devices such as photonic switches that are being developed in
telecommunications.

10.3 FUZZY SETS AND RELATION


In this section the fuzzy set operations and the fuzzy relation along with their composition
operation are described.
Digital Fuzzy Operations Using Multi-Valued Fredkin Gates  125

Fuzzy sets: Zadeh introduced fuzzy sets by defining characteristic functions for fuzzy
sets that may call membership function as µ A(x) : X→ [0, 1]. So, in fuzzy sets may talk
about the degree of that an element x can have denoted by µ A(x) which is a number between
0 and 1. Membership functions thus may represent an individual’s (subjective) notion of a
vague class – for example tall people, little improvement, big benefit, etc.
If X is a universe of discourse and x is a particular element of X then a fuzzy set
A defined on X may be written as a collection of ordered pairs A = { (x, µ A(x))}, xε X .
Alternatively a fuzzy set may be written as

Õ
A= µ A(xi )/xi
xi ∈X

If the universe of discourse is discrete and µ A(x) in this case can be called a discrete-
universal membership function and if it has a continuous universe of discourse, it may
write

A= µ A(x)/x
X

The membership function defined above can be called a continuous-universal space


membership function and fuzzy sets with such membership functions may be called analog
fuzzy sets. Numerical processing using digital components requires finite data with finite
precision. For such purposes digital fuzzy sets are defined.
Digital fuzzy set: If a discrete-universal membership function can take only a finite
number, n ≥ 2 of distinct values then it can call this fuzzy set a digital fuzzy set.
Thus, for digital implementation, an analog fuzzy sets membership function is dis-
cretized along with both the universal space and membership-value dimensions. Assume
that the universal space is quantized into 16 discrete values and the membership-values can
take n = 8 distinct values then it needs 16 × 3 = 48 bit to represent the set. In this chapter 9
distinct values are considered represented by 2 ternary variables.

Example 10.1 Suppose the membership function of a fuzzy set representing the concept
of a middle-aged person is given as

Now a possible discrete approximation A(x) : {0, 5, 10, 15, . . . , 80} → [0, 1] of the
membership function can be defined in Table 10.2.
Now suppose only 9 different levels of membership-values are defined using 2 ternary
variables as shown in Table 10.3, then it may represent the digital fuzzy set as D = 0.3/25
+ 0.7/30 + 1/35 + 1/40 + 1/45 + 0.7/50 + 0.3/55.
126  VLSI Circuits and Embedded Systems

Table 10.2: Discrete Approximation

X A(x)
x ε {25, 30, . . . , 0.0
55}
x ε {25, 55} 0.33
x ε {30, 50} 0.67
x ε {35, 40, 45} 1.00

Table 10.3: Digitized Membership Values

Encoding Value
A2 A1
0 0 0.0
0 1 0.15
0 2 0.3
1 0 0.4
1 1 0.5
1 2 0.6
2 0 0.7
2 1 0.85
2 2 1.0

From this example it is possible to see that considering more elements and having a
larger number of membership values it may more precisely represent a fuzzy set.
Fuzzy Operations: There are 3 standard fuzzy set operations namely complement,
intersection and union. The concept of fuzzy relation and the composition operation are
also discussed.
Complement: Let A be a fuzzy set on X , then by the complement of A has the member-
ship function A(x) = 1 − A(x), this value may be interpreted not only as the degree to which
x belongs to A the complement of A but also as the degree to which x does not belong to A.
Intersection/t-Norm and Union/t-Conorm: The intersection or the union of two fuzzy
sets A and B is specified in general by a binary operation on the unit interval: i.e., a
function of the form f : [0, 1] × [0,1] → [0,1]. For each element x of the universe set, this
function produces the intersection as ( A∩B)( x ) = i[A(x), B(x)] = A(x) ∧B(x) and the union
is expressed as (A ∪ B)(x) = u[A(x), B(x)] = A(x)vB(x).
There exists different t -norm and t -conorm operators available, however the standard
operation for intersection and union are the following Standard intersection: i(a, b) =
min(a, b), Standard union: u(a, b) = max(a, b) where a, bε [0, 1].
The standard operations will be used throughout the chapter. For digital fuzzy sets it is
easy to compute the complement operation. It just needs to complement the bits. In Section
10.4 the circuit construction of the complement or negate, the min and max operations are
shown.
Digital Fuzzy Operations Using Multi-Valued Fredkin Gates  127

Fuzzy Relations: Fuzzy relations are fuzzy sets defined on Cartesian products. A binary
fuzzy relation R defined on a discrete Cartesian product X×Y can be written as R = Σ µR
( x i , y i )/( x i , y i ), where every pair ( x i , y i ) ε X × Y .
Digital fuzzy relations will be used that is the µR ( x i , y i )’s can take only a fixed number
of values. It can easily represent a fuzzy relation in a matrix form. A fuzzy relation on two
sets X = {x 1 , x 2 , x 3 , x 4 } and Y = {y 1 , y 2 , y 3 , y 4 } can be represented in a 4 × 4 matrix R
where Ri, j = µR ( x i , y i ).
Composition of Fuzzy Relation: Given two fuzzy relations – R1 on X × Y and R2 on
Y × Z it may define a new relation denoted as R1 ◦ R2 on X × Z . There are several types
of composition-namely max-min, max-product, max-average. The max-min composition
formula is given below:
ÕÜ
R1 ◦ R2 = [µR1 (x, y) ∧ µR1 (y, z)]/(x, z)
x×z y

It can see the computation of the membership grades is very much similar to matrix
multiplication, with max ( v ) being analogous to summation and min ( ∧ ) being analogous
to multiplication.

Example 10.2 Max-Min Composition of Fuzzy Relations

Let R1 :

R2 :
128  VLSI Circuits and Embedded Systems

Then according to the max-min composition


R1 ◦ R2 :

This is intended to use this max-min composition because by far it is the most common
type in engineering applications.
Compositions are very important for inferencing procedures used in linguistic descrip-
tion of systems and is particularly useful in fuzzy controllers and expert systems. Collections
of fuzzy if-then rules or fuzzy algorithms are mathematically equivalent to fuzzy relations
and the problem of inference of (evaluating them with specific values) is mathematically
equivalent to composition.
In Section 10.4, a systolic array structure that can be used to compute composition
of fuzzy relations is shown. The cells, composed of reversible logic gates, are actually
responsible for the max-min operations.

10.4 THE CIRCUIT


The introduced circuits are based on the MVFGs described above. In Subsection 10.4.1, the
circuit is described that computes the min and min operation on digitized fuzzy membership
values. The negation operation is also implemented. Then in the next subsection, the systolic
array structure built with basic min-max cell is described to compute the composition of
fuzzy relations.

10.4.1 Fuzzy Operations Using MVFG


In fuzzy set theory and fuzzy logic the min and max operations the most important and
the most frequently used one. In this chapter, the membership-values are considering to
be digitized and represented by 2 ternary variables, thus there are 9 distinct membership
levels. Use of multi-valued variables of with radix is possible but they make the circuits
using more components.
Suppose it is needed to calculate min( A, B) or max( A, B), where A = A2 A1 and B = B2
B1 where A and B are represented by 2 ternary variables. The variable with subscript 2 is
the most significant bit. In Fig. 10.4, first find the min( A2 , B2 ) and max( A2 , B2 ) and then
use them to produce m2 m1 = min( A, B) and M 2 M 1 = max( A, B).
Digital Fuzzy Operations Using Multi-Valued Fredkin Gates  129

Figure 10.4: Implementation of Min and Max Operation Using MVFGs.

Next in Fig. 10.5 the implementation of the complement operation is shown. It can
actually perform this operation digit-wise.

Figure 10.5: Complementing a Ternary Variable Using MVFGs.

For example if a membership value is represented by A2 A1 where A2 = 2 and A1 =


1 [representing say 0.75, see Example 10.1], the membership in the complement fuzzy
set should be A20 A10 where A20 = 0 and A10 = 1 [representing 0.25]. Fig. 10.6 shows the
complementation of a single ternary variable using MVFGs.
130  VLSI Circuits and Embedded Systems

Figure 10.6: Basic Cell.

10.4.2 Systolic Array Structure for Composition of Fuzzy Relations


Systolic arrays are data-processing circuit formed by interconnecting a set of identical
data-processing cells in uniformly. Data word flow synchronously from cell to cell, where a
small step of the overall function is performed, until the results emerge from the boundary
cells. It provides a high degree of parallelism and the use of identical cells and uniform
interconnections making them ideal for implementation. Fig. 10.6 shows the basic cell
structures with its inputs and outputs.
These cells are connected uniformly manner to compute the fuzzy relation operation
as shown in Fig. 10.5. The cell simply computes z = z 0 ∨ (x∧y), where z 0 is the value
computed from the adjacent cell and x and y are the input membership values. The cells
also propagate inputs and the value computed to the adjacent cells as shown in Fig. 10.7.
Fig. 10.7 shows the array that can be used to compute the composition of two relations
represented by n×n matrices. It is important to realize it in a such way so that data is input
in correct sequence.

Figure 10.7: Systolic Array for Composition of Fuzzy Relation.

10.5 SUMMARY
This chapter introduces the digitized fuzzy sets and discuss the different operations. Com-
positions of fuzzy relations are described. Compositions are very important for inferencing
Digital Fuzzy Operations Using Multi-Valued Fredkin Gates  131

procedures used in linguistic description of systems and are particularly useful in fuzzy
controllers and expert systems. Collections of fuzzy if-then rules or fuzzy algorithms are
mathematically equivalent to fuzzy relations and the problem of inference (evaluating them
with specific values), fuzzy GMP (Generalized Modus Ponens) and GMT (Generalized
Modus Tollens) is mathematically equivalent to composition. A systolic array structure is
shown for the computation of composition of fuzzy relations. It provides a high degree of
parallelism and the use of identical cells and uniform interconnections making them ideal
for implementation.
This chapter continues with the new logic design paradigm – reversible logic. The
reversible logic finds its application in many fields such as quantum and optical computing,
low power design, nanotechnology, etc. The introduced design utilizes the multi-valued
reversible logic gates [namely the multi-valued Fredkin gate (MVFG)]. Not many circuit
design techniques have appeared in the literature concerning multi-valued reversible gates
or the implementation of fuzzy operations using them. Future research on this topic is
necessary to compare different multivalued reversible logic gates as the basic building
blocks. However Fredkin gates together with Toffoli gates and Feynman gates are the most
often discussed gates in reversible and quantum architecture and it is suggested that future
realization efforts will concentrate mostly on these gates and their derivations. As it is
possible to implement any Boolean logic function using Fredkin gates, it is also possible
using MVFGs (Multi-valued Fredkin gates) as they are modified Fredkin gates. This along
with the fact that multiple-valued Fredkin gates can be used to implement alternative logics
(for example threshold logic) makes MVFGs a rather attractive choice.

REFERENCES
[1] L. A. Zadeh, “Fuzzy Sets”, Information and Control, vol. 8, pp. 338–353, 1965.
[2] G. J. Klir and B. Yuan, “Fuzzy Sets and Fuzzy Logic”, Prentice Hall, 1995.
[3] G. J. Klir and T. A. Folger, “Fuzzy sets, Uncertainty, and Information”, Prentice Hall,
Englewood Cliffs, NJ, 1988.
[4] L. H. Tsoukalas and R. E. Uhrig, “Fuzzy and Neural Approaches in Engineering”,
Jhon Wiley & Sons Inc, 1997.
[5] M. Nielson and I. Chuang, “Quantum Computation and Quantum Information”, Cam-
bridge Press, 2000.
[6] R. C. Merkle, “Two types of mechanical reversible logic”, Nanotechnology, vol. 4,
pp. 114–131, 1993.
[7] C. Bennett, “Logical reversibility of computation”, IBM Journal of Research and
Development, vol. 17, pp. 525–532, 1973.
[8] M. H. A. Khan, M. A. Perkowski and P. Kerntopf, “A Multi-Output Galois Field Sum
of Products Synthesis with New Quantum Cascades”, Proceedings of 33r d ISMVL,
pp. 146–153, 2003.
[9] A. De Vos., B. Raa and L. Storme, “Generating the group of reversible logic gates”,
Journal of Physics A: Mathematical and General, vol. 35, pp. 7063–7078, 2002.
[10] P. Kerntopf, “Maximally efficient binary and multi-valued reversible gates”, Booklet
of 10th Intl. Workshop on Post Binary and Ultra-Large-Scale Integration Systems
(ULSI), pp. 55–58, 2001.
132  VLSI Circuits and Embedded Systems

[11] P. Picton, “Frenkin Gates as a Basis for Comparison of Different Logic Design Solu-
tions”, IEE, 1994.
[12] P. Picton, “A Universal Architecture for Multiple-Valued Reversible Logic”, Multiple-
Valued Logic-An International Journal, vol. 5, pp. 27–37, 2000.
[13] R. Landauer, “Irreversibility and heat Generation in the Computational Process”, IBM
Journal of Research and Development, vol. 5, pp. 183–191, 1961.
[14] A. Agrawal and N. K. Jha, “Synthesis of Reversible Logic”, Proceedings of the Design,
Automation and Test in Europe Conference and Exhibition, IEEE, 2004.
[15] A. Khlopotine, M. Perkowaski and P. Kerntopf, “Reversible Synthesis by Iterative
Compositions”, Proceedings of IWLS, pp. 261–266, 2002.
[16] D. M. Miller, D. Maslov and G. W. Dueck, “A Transformation based Algorithm
for Reversible Logic Synthesis”, Proceedings of Design Automation Conference, pp.
318–323, 2003.
[17] J. W. Bruce, M. A. Thornton, “Efficient Adder Circuits based on a Conservative
Reversible Logic Gate”, IEEE Computer Society Annual Symposium on VLSI, 2002.
[18] H. M. H. Babu, M. R. Islam, “Synthesis of Full-adder Circuit using Reversible Logic”,
Proceedings of VLSID, 2004.
[19] M. H. A. Khan, “Design of Full-adder with Reversible Gates”, International Confer-
ence on Computers and Information Technology, Dhaka, pp. 512–519, 2002.
[20] E. Fredkin and T. Toffoli, “Conservative Logic”, International Journal of Theoretical
Physics, pp. 219–253, 1982.
[21] R. Feynman, “Quantum Mechanical Computers”, Optics News, vol. 11, pp.11–20,
1985.
[22] M. Perkowski, M., “Regular Realization of Symmetric Functions using Reversible
Logic”.
[23] Babu, Hafiz Md Hasan, Amin Ahsan Ali, and Ahsan Raja Chowdhury. “Realization
of Digital Fuzzy Operations Using Multi-Valued Fredkin Gates.” In CDES 2006, pp.
101–106. 2006.
CHAPTER 11

Multiple-Valued
Multiple-Output Logic
Expressions Using LUT

An advanced minimization method for multiple-valued multiple-output functions is intro-


duced in this chapter. The shared sub functions are extracted with a heuristic method to
pair the functions. New minimization approach for multiple-valued functions has also been
discussed where Kleenean Coefficients and LUT are used to reduce the complexity as well.
The minimization method reduces the number of implicants significantly. The realization
of the minimized circuits has also been shown using current mode CMOS.

11.1 INTRODUCTION
There are many works about multiple-valued logic design with respect to the realization
of MV-PLA’s Gate Circuits and FPGA’s. Especially the minimization of sum-of-products
expression has received considerable attention for over 20 years. The analysis of the max-
imum number of implicants in a minimal sum-of-products expression is interesting when
a PLA is used to implement a function, the cost is directly related to the number of impli-
cants. In a PLA implementation of multiple-valued multiple-output functions, each product
term is represented by series of semiconductor devices (transistor). It is desirable also to
minimize the total number of devices in the PLA. Thus the good solution will in addition to
having minimal number of product terms also have a small number of variables appearing
in these product terms. While the smaller number of product terms reduces PLA area, the
reduced number of devices improves the speed of operation. These features introduce the
techniques, which is capable for generation of smaller number of product terms and pro-
gram will take a minimum space requirement. The earlier work on minimization was based
upon the Quine-McCluskey procedure. This method gives the exact solution but its space
complexity increases rapidly with the number of input variables. So the space violation
increases.

DOI: 10.1201/9781003269182-13 133


134  VLSI Circuits and Embedded Systems

11.2 BASIC DEFINITIONS AND PROPERTIES


Here, some basic definitions are given below.
Let V = {0, 1, . . . , p - 1) be a set of p-valued logic values, p>= 2, and let X = {x 1 , x 2 ,
. . . , x n } be a set of n variables, where x i takes a value in V . A function f (X) is a mapping
f : V n → V 1 . f (X) is said to be n-variable p-valued function. In this chapter, the following
multiple-valued logical expressions are discussed.
An arbitrary n-variable p-valued logic function f (X) can be represented by the following
four functions.
1. M I N : f (x 1 , x 2 , . . . , x n ) = x 1 x 2 . . . x n (= M I N(x 1 , x 2 , . . . , x n )),
2. M AX : f (x 1 , x 2 , . . . , x n ) = x 1 x 2 . . . x n (= M AX(x 1 , x 2 , . . . , x n )),
3. NOT: f (x 1 ) = x 10 (= p − 1 − x1),
4. Literal: f (x 1 ) = x1S (p − 1 if x 1 ∈ S and = 0 otherwise where S ⊂ V ).
The set S has at least an element and if S = V the literal may be omitted. For simplicity
xi{a, b, ... } is denoted by xia, b, ... . A minterm is a product term of a form k x1a1 x2a2 . . . xnan
where k is a nonzero constant and x1a1 x2a2 . . . xnan is the position of the minterm.

11.2.1 Product Terms


An implicant of function f (X) is a product term I(X) such that f (X)>= I(X) where >=
means that for all assignment x of the values to X , f (x)> = I(x).

11.2.1.1 Prime Implicants


A prime implicant of f (X) such there exists no other implicant of f (X) such that I(X) >=
I(X). The term sum-of-products is used to describe functions expressed as a sum of a set
of product terms, where sum refers to the M AX function and product term refers to the
M I N function. Any multiple valued logic functions f (X) can be expressed by the following
sum-of-products expression (denoted as SOP),

f (x) = ∨ f (S1, S2, . . . , Sn )x1s1 x2s2 . . . xnsn


(s1, s2, . . . , sn )

11.2.2 Minimal SOPs


A sum-of-products expression for f (X) is minimal if there is no other expression for f (X)
with fewer implicants.

11.2.3 MVSOP Expressions Using KC


The n-variable p-valued Kleenean coefficients (KC) are defined recursively as follows:
1. The constants 1, 2, . . . , p − 1 and variables x i and x i ’ (i = 1, 2, . . . , p − 1) are KCs.
2. If G and H are KCs, then GH is a KC.
3. The only KC is those given by rules (1) and (2).
Example 11.1: A four valued Kleenean function (KF) 1x 1 x 10 x 3 is a four-valued KC,
but the KF 1x 1 ∨ x 3 and 3(x 1 x 3 )0 are not KC.
Multiple-Valued Multiple-Output Logic Expressions Using LUT  135

11.3 THE METHOD


The heuristic method works on multiple-valued multiple-output functions. A new technique
is introduced here that will first extract the common sub-functions (Implicants) and then
minimize the left over implicants. To get the common sub-functions, it is needed to find out
the best pairs of functions that have maximum number of common implicants. Algorithm
11.1 depicts the extractions of common sub-functions. In the first phase of the algorithm,
support set matrix is generated and in the second phase pair set matrix is generated.

11.3.1 Support Set Matrix


A support set matrix S is an m×n matrix, where S[i, j] = 1 if a function f i depends on a
variable x j , otherwise S[i, j] = 0. Here, m is the number of rows that represents the functions
and n is the number of columns, which represents variables.

Algorithm 11.1 Generating Support Set


1: j represents the variables sequentially;
2: i represents the functions sequentially;
3: initially S[] := 0;
4: for j := 0 to for all variables do
5: for i := 0 to for all implicants of a function do
6: if j th variable is present in any implicant of i th function then
7: S[i, j] := 1;
8: end if
9: i + +;
10: end for
11: j ++
12: end for

The support set matrix for these functions is given below:


136  VLSI Circuits and Embedded Systems

Example 11.1 3-variable 3-valued 4 output functions is considered which is given below:

11.3.2 Pair Support Matrix


A pair support matrix P is an (m(m − 1)/2)×n matrix, where P[i, j] = 1 if a function pair
pi depends on a variable x j , otherwise P[i, j] = 0. Here, m(m − 1)/2 is the number of rows
and n is the number of columns and pi is a function pair of form [ f i , f j ], where i = 1, 2, 3,
, m. S denotes the number of variables on which the pair functions depend.
Example 11.2 Support pair matrix for functions of Example 11.2 is shown below.

Algorithm 11.2 Generating Pair Support Matrix


1: for i:=0 to i < total number of functions do
2: if ith function does not already occur in any pair then
3: for j:=0 to j < total number of functions do
4: select a pair (i, j) that depends on the highest number of variables.
5: if x[k] is presents in any function of the pair. k represents the variables then
6: corresponding cell : = 1;
7: S[i] := the total number of dependent variables.
8: end if
9: j++;
10: end for
11: end if
12: i++;
13: end for

To find out the common sub functions it is needed to search the pair in descending order
of the value of S. These are the common sub functions among the functions in Example 11.2.
Multiple-Valued Multiple-Output Logic Expressions Using LUT  137

11.4 THE ALGORITHM FOR MINIMIZATION OF MVMOFS USING KC


In this subsection, an algorithm is presented to minimize the MVMOFs using KC.

Algorithm 11.3 Minimization of MVMOFs Using KC


1: Selection ()
2: Select primes from each sub-function.
3: group the product term according to the output value;
4: minterm( )
5: while for each group for each product term do
6: Compare the product with all other products;
7: if two product terms differs in one variable then
8: form a new product term
9: Append the minterm_list and group them;
10: if (any change) then
11: minterm( )
12: end if
13: extract the minterm with corresponding output;
14: end if
15: end while
16: while each initial group do
17: //n is the no. of variables and p is the no. of values
18: for k = 1 to k< = n do
19: insert_LUT_using_B_tree(n, p);
20: //select a minimum set of KC that represent the functions.
21: Select_KC_from_LUT_using_binary_search( );
22: append the prime implicant table;
23: Extract_minimized_Expr_from_impl_table( );
24: k ++
25: end for
26: end while
27: //recursive insertion
28: insert_LUT_using_B_tree(t, n, p)
29: if the tree is empty then
30: t→root = newnode;
31: root → left = root → right = NULL;
32: else if ( LessThan(newnode → value, root → value)) then
33: root → left = insert_LUT_using_B_tree(root → left, n, p)
34: else if GreaterThan(newnode→value, root→value) then
35: root→right = insert_LUT_using_B_tree(root→right, n, p)
36: end if
138  VLSI Circuits and Embedded Systems

Complexity of Generating KC
Let n be the number of algorithm of B+ tree which needs the complexity of O(nlogn).
Again as the LUT is in sorted order, the function
Select_KC_from_LUT_using_binary_search() is used which uses binary search algorithm.
It requires the complexity of O(logn). So the total complexity becomes

f (n) = O(nlogn) + O(logn)


= O(nlogn).

Example 11.3 The 4-output functions in Example 2 are used implemented the algorithm
to the left over implicants after extracting the shared sub-functions. The minimized left over
functions are denoted by { f 00 , f 10 , f 20 , f 30 } corresponding to { f 0, f 1, f 2, f 3 } sequentially.

Shared sub-functions:

Minimized left over functions:

The total number of implicants in the method = no. of implicants shared sub-functions
+ no. of implicants of minimized left over functions
=3+8
= 11.

11.5 REALIZATION OF MVMOFS USING CURRENT MODE CMOS


Here, realization of the minimized circuit is shown using current mode CMOS logic.
A multiple valued LUT can be implemented using current-mode technique reducing the
transistor counts half compared to that of a binary implementation. Two main applications
of multiple valued LUTs are multiple valued FPGAs and intelligent memories. In this
section, realization of the minimized functions is described using current mode CMOS
logic. The multiple valued LUT is a direct extension of the binary valued LUT. Similar to
the binary LUT, there is a one-to-one correspondence between the rows of the truth table
and the rows of the LUT.
Block diagram of the general 2-input r-valued LUT is shown in Fig. 11.1.

Example 11.4 Let 3-valued 2-variable truth table. The truth table and the realization
are shown in Fig. 11.2(a). The realization of the minimized the function using Kleenean
coefficient is shown in Fig. 11.2(b). In the figure, only one variable is considered for
simplicity.
Multiple-Valued Multiple-Output Logic Expressions Using LUT  139

Figure 11.1: Block Diagram of General 2-input r -valued LUT Logic Function.

Figure 11.2: (a) 3-valued 2-variable MVL Truth Table and its Direct Realization Using
LUT. (b) Kleenean Coefficient considered.

Table 11.1 shows the assigned current values for a 3-valued logic system. The maximum
current (3 µA) is assigned to logic 2.
140  VLSI Circuits and Embedded Systems

Table 11.1: Current Values Assigned to 3-Valued Logic

Logic 0 1 2
Value
Current 0.0 µA 1.5 µA 3.0 µA
Value

Using this slicing, a 3 µ current generates a voltage drop in the order of 100mV across
the drain-source of each transistor. This does not affect the circuit performance unless the
number of transistors in series exceeds 10.
The implementation method of literal (1 A1 ) is shown in Fig. 11.3. The input current in
figure is sourced to the circuit and compared against two source currents (0 µA and 3 µA).
If the input current lies between these two limits, the output node is pulled down to Vdd .
Otherwise it pulled down to the ground.

Figure 11.3: Circuit Diagram of a Current-Mode Literal 1 A1 .

The literal generator conceptualization is shown in Fig. 11.4. The literal general circuit
is also shown in Fig. 11.5.

Figure 11.4: Literal Generator Conceptualization.


Multiple-Valued Multiple-Output Logic Expressions Using LUT  141

Figure 11.5: Literal Generator Circuit.

A current mode LUT is generally faster than a voltage mode LUT. In both cases, only
one path turns on depending on the logic values. However change in the logic values requires
less charge movement in the current mode design since all internal nodes have relatively
low voltages (no charging and discharging is required).

11.6 SUMMARY
An improved approach of minimization of multiple-valued multiple-output logic expression
is shown with Quine-McCluskey method using Klinem Coefficient (KC) and the realiza-
tion using Current mode CMOS. An efficient method for multiple-valued multiple-output
functions is also presented in this chapter. The number of implicants has been reduced
significantly by using the introduced method. Thus, the method reduces the propagation
time and also minimizes the size of the circuit.

REFERENCES
[1] E. A. Bender and J. T. Butler, “On the size of PLA’s required to realize binary and
multiple-valued logic”, IEEE Trans., Comput., vol. C-38, no. 1, pp. 82–98, 1989.
[2] G. W. Dueck and G. H. J. Van Rees, “On the maximum number of implicants needed
to cover a multiple-valued logic functions using window literals”, IEEE Proceedings
of the 20th International Symposium on Multiple-Valued Logic, pp. 144–152, 1990.
[3] Y. Hata, T. Hozumi and K. Yamato, “Minimization of multiple valued logic expres-
sions with Kleenean coefficients”, IEICE Trans. Inf. & Syst., vol. E79-D, no. 3, 1996.
[4] Y. Hata, T. Sato, K. Nakashima and K. Yamato, “A necessary and sufficient condi-
tion for multiple-valued logic functions representable by AND, OR, NOT constants,
variables and determination of their logical formulae”, IEEE Proceedings of the 19th
International Symposium on Multiple-Valued Logic, pp. 448–455, 1989.
142  VLSI Circuits and Embedded Systems

[5] Y. Hata, K. Nakashima and K. Yamato, “Some fundamental properties of multiple-


valued Kleeneen functions and determination of their logic formulas”, IEEE Trans.
Comput. vol. 42, no. 8, pp. 950–961, 1993.
[6] N. Takagi and M. Mukaidono, “Fundamental properties of multiple-valued Kleenean
functions”, Trans. IEICE, vol. J74-D-1, no. 12, pp. 797–804, 1991.
[7] Shahriar, Md Sumon, A. R. Mustafa, Chowdhury Farhan Ahmed, Abu Ahmed Ferdaus,
A. N. M. Zaheduzzaman, Shahed Anwar, and Hafiz MD Hasan Babu. “An advanced
minimization technique for multiple valued multiple output logic expressions using
LUT and realization using current mode CMOS.” In 8th Euromicro Conference on
Digital System Design (DSD’05), pp. 122–126. IEEE, 2005.
III
An Overview About Programmable Logic
Devices

143
Part 3

Programmable Logic is a logic element whose function is not restricted to a particular


function. It may be programmed at different points of the life cycle. At the earliest, it
is programmed by the semiconductor vendor (standard cell, gate array), by the designer
prior to assembly, or by the user in the circuit. Programmable Logic Devices (PLDs) are
Integrated Circuits (ICs) with a large number of gates and flip-flops that can be configured
with basic software to perform a specific logic function or to perform the logic for a complex
circuit. Unlike a logic gate, which has a fixed function, a PLD has an undefined function at
the time of manufacture. Before the PLD can be used in a circuit it must be programmed,
that is, reconfigured.
Programmable logic devices are in essence pre-built chips with certain architecture that
one can use as needed. Any logic can be built by writing code in HDL (Hardware Description
Language). It is the bread and butter of integrated circuit design. It is used for prototyping
bigger chips, testing, debugging, etc. Using HDL and a blank chip (programmable logic
device), one can metaphorically squirt any logic and test it, fix it and reprogram it as
needed. This is hugely convenient because the logic can be changed as opposed to ASIC
(Application Specific Integrated Circuits) where the chip can only do the functions it was
designed for. An example of an ASIC is the processor on a computer.
There are three fundamental types of standard PLDs: PROM, PAL, and PLA. A fourth
type of PLD is the Complex Programmable Logic Device (CPLD), e.g., Field Programmable
Gate Array (FPGA). A typical PLD may have hundreds to millions of gates. Programmable
logic devices have revolutionized the way in which digital circuits are built. FPGAs and
CPLDs have become the standards for implementing digital systems. FPGAs and CPLDs
offer much higher circuit density, improved reliability, and fewer system components when
compared with traditional digital design using discrete small-scale or medium-scale in-
tegrated circuits, all of which make programmable logic devices very attractive to the
digital designer. However, these devices hide important details involved in understanding
digital fundamentals, and the resulting hardware is really more of a computer-generated
black box than it is a carefully crafted, fine-tuned design. Creativity in the design is less
visible when using FPGAs or CPLDs, and designers are not rewarded as satisfyingly for
“elegant” solutions to design problems. FPGAs and CPLDs implement solutions to digital
design problems quickly and economically, both qualities that are important in an industrial
setting.
The maximum number of gates in an FPGA is currently around 20,000,000 and doubling
every 18 months. Meanwhile, the price of these chips is dropping. What all of this means is
that the price of an individual NAND or NOR is rapidly approaching zero! And the designers
of embedded systems are taking note. Some system designers are buying processor cores and
incorporating them into system-on-a-chip designs; others are eliminating the processor and

145
146  Part 3

software altogether, choosing an alternative hardware-only design. As this trend continues,


it becomes more difficult to separate hardware from software. After all, both hardware
and software designers are now describing logic in high-level terms, albeit in different
languages, and downloading the compiled result to a piece of silicon.
Many types of programmable logic are available. The current range of offerings in-
cludes everything from small devices capable of implementing only a handful of logic
equations to huge FPGAs that can hold an entire processor core (plus peripherals!). In
addition to this incredible difference in size, there is also many variations in architecture.
Advantages of using PLDs are less board space, faster, lower power requirements (i.e.,
smaller power supplies), less costly assembly processes, higher reliability (fewer ICs and
circuit connections means easier troubleshooting), and availability of design software.
This part starts with Look-Up Table (LUT)-based matrix multiplication using Neural
Networks (NNs) which is given in Chapter 12. In this chapter, Artificial Neural Network
(ANN)-based matrix multiplication is introduced to create a completely new horizon in
matrix multiplication technique, due to having non-linear, non-parametric characteristics
of Neural Network. The matrix multiplication technique accomplishes through implemen-
tation of supervised neural networks, where minimum coin change problem is solved using
binary search tree as the data structure to simplify the complex matrix multiplication pro-
cess. An improved design of easily testable Programmable Logic Arrays (PLAs) has been
introduced based on input decoder augmentation using pass transistor logic along with
improved conditions for product line grouping is described in Chapter 13. In this design,
the fault coverage is increased substantially by augmenting the input decoder using pass-
transistors logic. Chapter 14 presents a Genetic Algorithm (GA) for input assignment for
decoded-PLAs. An algorithm for assigning variables to decoders has been known to pro-
duce good result, but the number of input variables of the decoders was restricted to two
and the area over-head of decoders, which is in fact quite significant, was not considered. A
heuristic algorithm is also developed for assigning input variables to the decoders. Chapter
15 describes FPGA-based multiplier using LUT Merging Theorem. In this chapter, a LUT
merging theorem is introduced, which reduces the required number of LUTs for the im-
plementation of a set of functions by a factor of two. The LUT merging theorem performs
selection, partition and merging of the LUTs to reduce the area. In Chapter 16, a tree-
structured parallel BCD addition algorithm is introduced with the reduced time complexity.
A size-minimal and depth-minimal LUT-based BCD adder circuit construction is the main
focus of this chapter. Chapter 17 focuses on the algorithm which can be very efficient for
the purpose to minimize the delays introduced in the circuit because of placement and
routing. A new methodology is also presented for digital circuits which in turns reduces
the area and increases the performance of the circuit type algorithms for the problem of
hardware/software partitioning.
In Chapter 18, an ( N×M )-digit BCD multiplication algorithm is introduced with the
complex steps reduction of the conventional multiplication process. A compact LUT cir-
cuit architecture with new selection, read and write operations is also presented in this
chapter. In Chapter 19, a LUT merging theorem is presented, which reduces the required
number of LUTs for the implementation of a set of functions by a factor of two. A (1 ×
1)-digit multiplication algorithm is introduced, which does not require any partial prod-
uct generation, partial product reduction and addition steps. An (m×n)-digit multiplication
Part 3  147

algorithm is described, which performs digit-wise parallel processing and provides a sig-
nificant reduction in carry propagation delay. A binary to BCD conversion algorithm for
decimal multiplication is also presented in this chapter to make the multiplication more
efficient. Then, a matrix multiplication algorithm is described that re-utilizes the intermedi-
ate product for the repeated values to reduce the effective area. A cost-efficient LUT-based
matrix multiplier circuit is also described using the compact and faster multiplier circuit
in this chapter. In Chapter 20, a low power and area efficient LUT-based BCD adder is
introduced which is constructed basically in three steps: First, a technique is introduced
for the BCD addition to obtain the correct BCD digit. Second, a new controller circuit of
LUT is presented which is designed to select and send Read/Write voltage to memory cell
for performing Read or Write operation. Finally, a compact BCD adder is designed using
the introduced LUT. In Chapter 21, a generic CPLD board design and development is pre-
sented. The designed board is generic in nature and it can be used in various system designs
as a reconfigurable hardware. Finally, Chapter 22 reports implementation of FPGA based
Micro-PLC (Programmable Logic Controller) that can be embedded into devices, machines
& systems using suitable interface. The idea described in this chapter is best suitable for
small-scale application where the need is limited number of instructions at reasonable cost,
which also offer best performance, high speed and compact design approach.
CHAPTER 12

LUT-Based Matrix
Multiplication Using Neural
Networks

Matrix multiplication is a prime operation in linear algebra and scientific computations. In


this chapter, Artificial Neural Network-based matrix multiplication is introduced to create
a completely new horizon in matrix multiplication technique, due to having non-linear,
non-parametric characteristics of Neural Network. Artificial Neural Network being the
powerful data-driven, self-adaptive tool, it provides the resultant matrix multiplication with
a high degree of accuracy. Through supervised learning, the neural network completes
multiplication through addition operation instead of multiplication in solution prediction
stage, which evidently reduces required number of Look-Up Table (LUT).

12.1 INTRODUCTION
Matrix multiplication is significantly used in many applications like graph theory, digi-
tal signal processing, image processing, cryptography, statistical physics and many more.
Neural network is most desirable implementation in today’s world due to its massive paral-
lelism, distributed representation alongside with its computation and learning ability. In this
chapter, artificial neural network is implemented in matrix multiplication to significantly
decrease the time-complexity of the algorithm and hence, large scale matrix multiplication
is no more difficult due to the use of neural network with its learning capacity. The main
focus of this chapter are as follows:

1. First, a new matrix multiplication algorithm using the efficient Artificial Neural
Network has been introduced with less time complexity.

2. Second, using minimum coin change problem through supervised learning method a
high accuracy level for the resultant matrix is ensured.

DOI: 10.1201/9781003269182-15 149


150  VLSI Circuits and Embedded Systems

12.2 BASIC DEFINITIONS


Neural networks are made up of many artificial neurons. Artificial Neural Network (ANN)
is a computing system consisting of a number of simple, highly interconnected processing
elements which process information by their dynamic state response to external inputs.
Among various kinds of learning processes of neural network, Supervised Learning is the
one which is used in this chapter.
Supervised learning process occurs with each cycle or “epoch” (i.e., each time the
network is presented with a new input pattern) through a forward activation flow of outputs
and the backwards error propagation of weight adjustments. When a neural network is
initially presented with a pattern, it makes a random “guess” as to what it might be. It then
sees how far its answer was from the actual one and makes an appropriate adjustment to its
connection weights. A Perceptron follows the “feed-forward” model, meaning inputs are
sent into the neuron, are processed, and result in an output. Activation Function is used to
transform the activation level of a neuron into an output signal.

12.3 THE METHOD


Fig. 12.1 demonstrates the flow diagram of a supervised learning system of neural network.

Figure 12.1: Flow Diagram of Supervised Neural Networking Technique.

Let, N be the set of natural numbers. The vector space of n is denoted by n natural
matrices by N n×n . A and B are two n×n matrix for multiplication.
Let, A be the multiplier matrix and B be the multiplicand matrix producing output
matrix C . Every matrix is represented as follows:
LUT-Based Matrix Multiplication Using Neural Networks  151

 a11 . . . a1n 
 . . 

⇔ A = (ai j ) =  . .  ai j ∈ N

A ∈ Nn×n
 . . 

an1 . . . ann 

So, after multiplication the matrix C will be as follows:

n
Õ
C = A × B; where ci j = aik bk j (12.1)
k=1

Since supervised learning is being implemented so according to Fig. 12.1, each step
will be followed in neural network blocks as follows:
(1) Problem Cases: For high speed parallel processing it considers n number of neural
networks for the multiplication process. Each neural network has a column of the multiplier
A and a single element from multiplicand B as input. As every matrix is a collection of
column vectors, it can perform column partitioning on multiplier matrix, A.

A ∈ Nn×n ⇔ A = [A1 A2 . . . Ak . . . An ], Ak ∈ N;

 a1 
a2 .
 

Ak =  . 
 
 . 
 
 an 
 

So, for every i th row of multiplicand B, every j th value of that row will be the input of
a neuron with multiplier column vector Ak where k = i . For simplicity, it can row partition
the multiplicand matrix B as follows:

 B1 
 B2 .
 

⇔ B =  .  , Bk ∈ N;
 
B ∈ Nn×n
 . 
 
 Bn 
 
Bk = [b1 b2 . . . bn ]
Therefore, every column vector Ak will be the input of n number of neural networks
with each j th values of the row vector Bi for k = i which is demonstrated in Fig. 12.2.
Every column vector will be pipe-lined to the neuron for faster execution.
152  VLSI Circuits and Embedded Systems

Figure 12.2: Block Diagram of the Matrix Multiplication Method.

(2) Known Solutions: For supervised learning input with corresponding output as
training data needs to be provided to the neural network. For providing training data of
a n×n matrix, a threshold value is to be considered, up to which the sample input output
combination will be provided through a multiplier. Suppose, the threshold value (θ ) is as
follows:
n×x
θ = f (n) = (12.2)
100
The variable x is user-dependent that is of which percentage (x) of the column values
are to be sent as sample. Sample is generated by direct multiplication until the number
of values in column is equal to the threshold, (θ ). Suppose, for two 20 × 20 matrices,
where x is defined as 40%, the threshold becomes 8 following Equation 13.2. Hence, first
8 values of each column of the multiplier matrix would be provided as sample input-output
combination. Then, other values would be manipulated through neural network. The input
(single value of the Ai column vector) and the output combinations of the sample generation
technique are stored in two arrays namely in and out.
(3) Training Algorithm: When new input arrives in the array, the absolute differences
from all the previous values are calculated and inserted in a binary search tree with com-
plexity O(log n). That is if there are p values in array in and a new value arrives at in p+1
position then, each absolute difference ranging from in0 to in p with in p+1 will be stored.
Also the differences from the out is calculated in parallel to be inserted as a weight to the
binary search tree with corresponding input values. Binary search tree is considered as the
LUT-Based Matrix Multiplication Using Neural Networks  153

storing structure for decreasing the input set, that is the search space of the “minimum coin
change problem” which is utilized later.

Example 12.1 Considering a binary search tree with the values (20, 10, 30, 3, 11, 25
and 35). If 13 is the sum of which the subset is to be found, then first the root node (20)
being greater than the sum, the right branch of the tree can be omitted from consideration.
Traversing the nodes of the left branch of the binary search tree, which are less than the
sum (13), the input space of the “minimum coin change problem” is minimized. Thus, a
binary search tree can be used as the storing structure of the training data.

(4) Neural Network: When the number of input values Ak (column vectors value) is
greater than the predefined threshold value, then the value is to be calculated from prior
knowledge and so the new input (in p ) is searched in the binary search tree with complexity
O(log n). As the differences from new and previous inputs are stored each time, it provides
maximal amount of prior knowledge. If the new value is found in the tree, then the weight
of the corresponding node is provided as the final output. “Minimum coin change” problem
is solved to obtain the subset that sums up to the input value (in p ) with or without repeated
values. For each neural network if there are m values in binary search tree which can sum
up to the new input (in p ) value with or without repeat, then x 1j are all the inputs, where j =
1, 2, . . . , m, which sends signals to neuron n1 , weighted with parameter w 1j . In this specific
case, w 1j is considered to be the required number of repeats of the corresponding input x 1j
to make the in p . So, the product of w 1j and x 1j contributes to create in p . Considering m is
the number of input lines converging onto the neuron 1. The summation of the calculated
product, s1 = m 1 1 1
j=1 w1j x j provides in p . Thus, the output nout of first single perceptron n1
Í
is a non-linear transform of the summed input which is defined as follows:
1
nout = g (s1 − vi ) (12.3)

where v is the formal threshold parameter. The value of v is considered to be the target sum
1 . Let us define the total input as h = Ím w 1 x 1 − v . So, the activation function
value nout j=1 1j j i
would be one if h = 0 and it would be zero otherwise.
Suppose, the first perceptron has input x 1j , where j = 1, 2, . . ., m and the output of the
1 . When the activation function is 1, it signifies that the neuron is
first perceptron is nout
firing and the activation value influences the input of the second single perceptron as the
input is xk2 with weight (w2k
2 ) as synaptic efficacy as the weight of the corresponding nodes

(which solves the “minimum coin change problem”) in binary search tree. The weight w2k 2
1
are the corresponding weight of the input values of the first perceptron ( x j ) in the binary
search tree that is the output or output differences calculated from outmem. So, the function
is:
m
2 2
× xk2 )
Õ
nout = g( w2k (12.4)
k=1

So, the final output of the network is:

2
nout = g 2 [h2k
2
]
154  VLSI Circuits and Embedded Systems

2
So, nout is the final result. Though there is a possible case when “minimum coin
change problem” is not solvable with the current training data-set. So, the neuron will
back-propagate. Since, in matrix multiplication, the accurate result is expected, the back-
propagation method of conventional neural network is not possible to follow through
solution prediction. Instead of solution prediction, the input will be back-propagated to the
sample generator where the correct output will be generated through the sample generator.
A 3 × 3 matrix multiplication block diagram is shown in Fig. 12.2. The values A11 , A21 ,
and A31 of first column of multiplier matrix A is pipelined into neuron number 1, 4, and 7,
simultaneously. The other input of neuron 1, 4, and 7 are B11 , B12 , and B13 , respectively.
Similarly, second and third column of the multiplier is provided as input in the neuron (2,
5, 8) and (3, 6, 9), respectively.
Being independent all the partial products X m1 , X m2 , and X m3 , where X = P, Q, R are
generated at a time for a distinct
value of m = 1, 2, 3 at each level. Then, X m1 , X m2 , and X m3 are added. The addition
operation is also parallel. Finally, the result is pipelined from the adders. The block diagram
of a neuron used in the method has been demonstrated in Fig. 12.3. Two matrices A
and B will be stored in Input Buffer. A multiplier is responsible for accomplishing the
sample generation whereas Output buffer holds the product from sample generator. Two
subtractors are used for calculating input and output differences from previous inputs and
outputs respectively. When number of values of multiplier column exceeds the defined
threshold value then it is sent to neural network block. Neural Network block consists of
adders, controller and output buffer. “Minimum coin change” problem is solved in this
block. If the new input value is already stored in the memory, the corresponding output is
provided directly as a final product. If the new input is not stored in memory then solving
the minimum coin change problem the minimum number of existing inputs that makes the
new input is calculated. Then, the corresponding output values of the existing input values
are sent to the adder. Finally, the adder provides the final result by adding the provided
outputs. Theorem 13.1 is given to note the time complexity of the introduced approach of
matrix multiplication.

Property 12.3.1 The time complexity of the proposed matrix multiplication for two n×n
2
matrices using neural networks is O(log n (n + n2 + n2 + log 2 n)) where n is the dimension
of the matrix.
LUT-Based Matrix Multiplication Using Neural Networks  155

Figure 12.3: Block Diagram of a Neural Network.

12.4 SUMMARY
Artificial Intelligence is a set of tools that are driving forward key of the future world.
The matrix multiplication technique accomplishes through implementation of supervised
neural networks, where minimum coin change problem is solved using binary search tree
as the data structure to simplify the complex matrix multiplication process. The design
achieves logarithmic solution over the polynomial degree of solutions. The design gains
improvement in terms of required number of LUTs (Look-Up Tables) and slices. These
drastic improvements in LUT-based matrix multiplication will consequently influence the
advancement in real life applications such as Mathematical Finance, Image Processing,
Machine Learning and many more.

REFERENCES
[1] C. Maureen, “Neural networks primer, part I”, AI expert vol. 2, no. 12, pp. 46–52,
1987.
[2] P. Saha, A. Banerjee, P. Bhattacharyya and A. Dandapat, “Improved matrix multiplier
design for high-speed digital signal processing applications”, Circuits, Devices and
Systems, IET, vol. 8, no. 1, pp. 27–37, 2014.
156  VLSI Circuits and Embedded Systems

[3] W. Yongwen, J. Gao, B. Sui, C. Zhang and W. Xu, “An Analytical Model for Matrix
Multiplication on Many Threaded Vector Processors”, In Computer Engineering and
Technology, pp. 12–19, 2015.
[4] R. Soydan and S. Kasap, “Novel Reconfigurable Hardware Architecture for Polyno-
mial Matrix Multiplications”, Very Large Scale Integration (VLSI) Systems, IEEE
Trans., vol. 23, no. 3, pp. 454–465, 2015.
[5] V. V. Williams, “Multiplying matrices faster than Coppersmith-Winograd”, In Pro-
ceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, pp.
887–898, 2012.
[6] J. Xiaoxiao and J. Tao, “Implementation of effective matrix multiplication on FPGA”,
In Broadband Network and Multimedia Technology, 4th IEEE International Confer-
ence on, pp. 656–658, 2011.
[7] C. Don and S. Winograd, “Matrix multiplication via arithmetic progressions”, In
Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing,
pp. 1–6, 1987.
[8] W. Guiming, Y. Dou and M. Wang, “High performance and memory efficient imple-
mentation of matrix multiplication on FPGAs”, In Field-Programmable Technology
(FPT), International Conference on, pp. 134–137, 2010.
[9] J. Ju-Wook, S. B. Choi and V. K. Prasanna, “Energy-and time-efficient matrix multi-
plication on FPGAs”, Very Large Scale Integration (VLSI) Systems, IEEE Trans. on,
vol. 13, no. 11, pp. 1305–1319, 2005.
[10] S. Hong, K. S. Park and J. H. Mun, “Design and implementation of a high-speed
matrix multiplier based on word-width decomposition”, Very Large Scale Integration
(VLSI) Systems, IEEE Trans. on, vol. 14, no. 4, pp. 380–392, 2006.
[11] M. U. Haque, Z. T. Sworna and H. M. H. Babu, “An Improved Design of a Reversible
Fault Tolerant LUT-Based FPGA”, 29th International Conference on VLSI Design,
pp. 445–450, 2016.
[12] M. Shamsujjoha, H. M. H. Babu and L. Jamal, “Design of a compact reversible
fault tolerant field programmable gate array: A novel approach in reversible logic
synthesis”, Microelectronics Journal, vol. 44, no. 6, pp. 519–537, 2013.
[13] Sworna, Zarrin Tasnim, Mubin Ul Haque, and Hafiz Md Hasan Babu. “A LUT-based
matrix multiplication using neural networks.” In 2016 IEEE International Symposium
on Circuits and Systems (ISCAS), pp. 1982–1985. IEEE, 2016.
CHAPTER 13

Easily Testable PLAs Using


Pass Transistor Logic

In this chapter, an improved design of easily testable PLAs has been introduced based on
input decoder augmentation using pass transistor logic along with improved conditions for
product line grouping. The technique primarily increases the fault coverage area of easily
testable PLA due to augmented of PTs (product terms) and reduce the testing time due to
grouping the product lines. A simultaneous testing technique has been applied within the
group that reduces the testing time. This approach ensures the detection of certain bridging
faults. A modified testing technique has also been presented in this chapter. It is shown that
the new grouping technique enhances in all ways.

13.1 INTRODUCTION
The Programmable Logic Array (PLA) is an important building block in VLSI circuits.
PLAs have the main advantage of a regular structure. Various efficient techniques have been
introduced on designing and testing of easily testable PLAs over the past 20–25 years. The
main objective of these techniques was to reduce extra hardware, increase fault coverage
and reduce testing time. If the PLAs are grouped under the criteria and the product lines
are rearranged within the groups then the above object is fulfilled.
The augmentation of the product line selector and the input decoder circuit are described
with the help of some extra pass transistors so that some signal value can be applied on both
the true and complemented bit lines corresponding to input in a single test vector which
will help to increase the fault coverage in a large scale. In next section a new condition has
been described for grouping the product lines that result in further reduction in the number
of groups and hence the number of test vectors.

13.2 PRODUCT LINE GROUPING


For grouping the product lines, a critical analysis is first made to find out the criteria, which
are to be satisfied by each of the product lines in a group. Basic Criteria to test the PLA,
two sets of test vectors T 1 and T 2 are generated. For the test set T 1 , the criterion for the
product line grouping is follows:

DOI: 10.1201/9781003269182-16 157


158  VLSI Circuits and Embedded Systems

Criterion 1: Two-product line should be grouped in such a way that when one product
line is activated (logic 1), all other product lines must be deactivated (logic 0).
For the test set T 2 , the criterion for product line grouping is as follows:
Criterion 2: Two-product line should be grouped in such a way that when a single used
literal on a product line of the AND array is changed by applying any of the test vector, the
outputs of the PLA due to a product line in the group should not contradict with the outputs
due to the other product lines in that group.
If two or more product lines differ by one literal and if one of the two conditions are
not satisfied then they cannot be tested simultaneously because of masking effects in the
AND array. This results in an increased number of test vectors to test those product terms.
However, it is introduced and shown that if the following two conditions are satisfied then
two or more product lines that differ one literal to be placed in the same group irrespective
of the CP device configuration in the OR array. The conditions are:
1. PLAs must be augmented using extra pass transistors, and
2. If the bridging fault between the bit lines which is associated with the literals that
differ by one, it can be tested by some other product lines (may be in some other group).
The test generation technique is modified mainly for test set T 2 for designing the PLA
based on the technique stated above. The effect of adding extra product line for testing
bridging faults between the true and complemented bit lines has also been analyzed and
found that after adding extra product line the fault coverage is higher.

13.3 THE DESIGN


Though the probability of detection of different bridging faults is high but some faults
remain undetected, such as bridging faults between the bit lines, between the input lines
and between the last input line and the first output line are not ensured to be detected.
An easily testable PLA is introduced having a product line selector and a modified input
decoder with some pass-transistor logic. In this technique, any signal value can be placed
one of the true or complemented bit lines arbitrarily with two steps of the same vector. In
this process, it has been seen that all the bridging faults between the bit lines, between the
last input line and the first output line and between the product lines and the bit lines can
be tested.
Fig. 13.1 shows a PLA having 4 input lines and 8 product lines. An input X j is decoded
into two-bit lines X j1 and X j0 corresponding to the complemented and true bit lines,
respectively. This PLA is made easily testable by adding some extra pass-transistors.
Fig. 13.1 shows that an extra pass-transistor is connected with each of the X j1 bit lines
(i.e., corresponding to the complemented bit line of X j ) in the PLA input decoder circuit.
The device capacitance of the pass-transistor enables one to store temporarily a value 1
or 0 for X j1 , i.e., the pass-transistor acts as a dynamic storage element, a common usage in
nMOS technology. A single mode control input C as shown in Fig. 13.2, controls the gate
inputs of these transistors. C is set to 1 in the normal mode of operation and it takes both
0 and 1 in the test mode. With the help of this extra circuitry, any arbitrary test pattern can
be applied to the bit lines.
For example, both the bit lines X j1 and X j0 can be assigned 00 or 11, which is not
possible in normal PLA without augmenting the input decoder circuit.
Easily Testable PLAs Using Pass Transistor Logic  159

Figure 13.1: The Example PLA whose Input Decoder is Augmented using Extra Pass
Transistor.

Figure 13.2: (a) Response of the Bit Decoder (b) Input of the Decoder to Place 0c on Both
the Bit Lines.

Example 13.1 Let us assume that it is required to place an arbitrary value “a” on X j1 and
“b” on X j0 . To do this, a0 is applied to X j and 1 to C followed by b to X j and 0 to C . When
C = 0, the pass-transistors are cut-off. Hence the previous value (i.e., a0) will be maintained
at the input to the inverter of X j1 , line due to the input capacitance of the inverter. As a
result, value “a” remains on X j1 , bit line when “b” appears on X j0 . Let us consider that 0s
160  VLSI Circuits and Embedded Systems

are to be placed on both the bit lines X j1 and X j0 . This is illustrated in Fig. 13.2. In the first
step, 1 is placed on X j when C = 1 in this case, X j1 becomes 0 and X j0 becomes 1. Then
in the second step, 0 is placed on X j and C is made 0. As the corresponding pass-transistor
becomes cut-off at C = 0, the value of 0 remains on X j1 and X j0 becomes 0 also, as X j is 0
in the second step. Thus, 0s are placed on both the bit lines. Similarly, any arbitrary values
can be placed on any of the bit lines.

13.4 THE TECHNIQUE FOR PRODUCT LINE GROUPING


Here an improved condition for grouping the product lines are introduced when the input
decoder of the PLAs is augmented using the pass-transistor logic.
The Condition: Product lines realizing product term belong to a group regardless of
the CP (Complex Programmable) device configuration of the OR array, if the product terms
differ in one literal position and the bridging fault between the bit lines associated with the
literal that differ by one can be tested by some other product lines (may be in some other
group).

Example 13.2 Consider Fig. 13.3. It may be noticed that P3 and P4 are the members of a
group due to difference of one literal position X 1 lines.

Figure 13.3: An Example PLA Showing Product Lines Satisfying Introduced Condition.

Again, the bridging fault between bit lines X 1 and X 10 is detected by group S 1 using
input line X 1 . So, these two product lines satisfy the condition.
If two or more number of product lines satisfy the basic criteria along with the condition,
then these product lines are the members of a group.
Two different algorithms are implemented for generating test vectors for test set T 1 and
T 2.
In the introduced design, instead of considering a single product line at a time, a
partitioned group of product lines is considered. The test vectors are generated in such
a way that the detection conditions for the bridging faults between the bit lines and the
product lines and the last input line and the first output line are fulfilled. It can easily be
seen that if the test vectors in T 1 set corresponding to a product line Pi , are applied in such
a way that each of the bit lines which does not have any CP device with Pi , becomes 1 at
least once, then the presence of all the extra CP devices on Pi , is tested. However, with the
Easily Testable PLAs Using Pass Transistor Logic  161

Algorithm 13.1 Product line grouping algorithm


1: int GroupNum :=1
2: int TotalGroup := l
3: Grouping( )
4: {
5: Initially all the product lines are non-members;
6: while Start from the PM of a PL A where all the product terms are non-members of
any group do
7: loop
8: Get a product line P, from the PM by scanning from left to right.
9: if the selected product line P, has not been grouped based on Basic Conditions
and Introduced Condition then
10: Pi [Group] := GroupNum;
11: Select product line Pi , such that (i < j < = m) and m is the rightmost product
line.
12: if the product line P j , has not been grouped then
13: if P j satisfies any one of the Basic and Introduced Conditions with P I then
14: Backtrack from P j to the product line P I +1 ;
15: end if
16: if (P j satisfies any one of the Conditions with all the product lines P k that are
in the same group as that of Pi then
17: P j [Group] := GroupNum”;
18: end if
19: else
20: skip P j for the latter scan;
21: end if
22: end if
23: Repeat for all the P,
24: GroupNum++;
25: end loop
26: Repeat until all the product lines have been processed.
27: TotalGroup = GroupNum +1;
28: end while
29: }

design for partitioning the product lines into groups and augmentation of the input decoders
using extra pass transistors, the number of groups are reduced which in turns reduced the
number of test vectors.

13.5 SUMMARY
An improved technique for designing easily testable PLAs has been presented. In this
technique, the fault coverage is increased substantially by augmenting the input decoder
using pass-transistors logic. In addition to the product line partitioning conditions presented,
162  VLSI Circuits and Embedded Systems

an additional condition has been presented in this chapter that reduces the extra hardware
and testing time even further. The condition presented in this chapter allows two or more
product lines that differ in one literal to be placed in the same group irrespective of the CP
(Complex Programmable) device configuration in the OR array. However, it must ensure
that some other product lines can detect the bridging between the bit lines that contribute
to the one difference. As a result of this design the number of groups are reduced which in
turn reduces extra hardware overhead and increases testing time.

REFERENCES
[1] T. Sasao, “Easily testable realizations for generalized Reed-Muller Expressions”, IEEE
Trans. Comput., vol. 46, no. 6, pp. 709–716, 1997.
[2] M. A. Mottalib and A. M. Jabir, “A Simultaneously testable PLA with high fault
coverage and reduced test set”, The Journal of IETE, vol. 43, no. 1, 1997.
[3] H. Fujiwara, “A design of programmable logic arrays with random pattern testability”,
IEEE Trans. CAD, vol. 7, pp. 5–10, 1988.
[4] M. A. Mottalib and P. Dasgupta, “Design and testing of easily testable PLA”, IEEE
Proc., pp. 357–360, 1991.
[5] M. A. Mottalib and P. Dasgupta, “A function dependent concurrent testing technique
for PLAs”, IETE, vol. 36, no. 3 & 4, pp. 299–304, 1990.
[6] S. M. Reddy and D. S. Ha, “A new approach to the design for testable PLAs”, IEEE
Trans., vol. C-36, pp. 201–211, 1987.
[7] S. Bozorgui-Nesbat and E. J. McCluskey, “Lower overhead design for testability of
programmable logic array”, IEEE Trans., vol. C-35, pp. 379–384, 1986.
[8] Islam, Md Rafiqul, Hafiz Md Hasan Babu, Mohammad Abdur Rahim Mustafa, and
Md Sumon Shahriar. “A heuristic approach for design of easily testable PLAs using
pass transistor logic.” In 2003 Test Symposium, pp. 90–90. IEEE Computer Society,
2003.
CHAPTER 14

Genetic Algorithm for Input


Assignment for
Decoded-PLAs

Generally, a decoded-PLA i.e., a PLA (Programmable Logic Array) which has decoders
in front of an AND array, requires a smaller area than a standard PLA for realizing a
function. However, it is usually very difficult to assign input variables to decoders such that
the area of the decoded-PLA is minimal. An algorithm for assigning variables to decoders
has been known to produce good result, but the number of input variables of the decoders
was restricted to two and the area over-head of decoders, which is in fact quite significant,
was not considered. A heuristic algorithm is also developed for assigning input variables
to the decoders. In this algorithm, the number of inputs to each decoder is not restricted
to two and the area overhead incurred by using multi-input decoders is considered in the
cost function. The algorithm has shown that the area of PLAs are smaller in many cases by
using multi-input decoded-PLAs than those of decoded-PLAs with two input decoders or
standard PLAs.
Decoded-PLAs, i.e., PLAs with input decoders can usually realize a function in a
smaller area than a standard PLA. The way of assigning the input variables to the decoders,
which in general may have any number of inputs, influences the size of a decoded-PLA
significantly. It should also be noticed that some functions cannot benefit from using multi-
input decoders no matter how the variables are assigned. In this chapter, an algorithm is
discussed to assign variables for multi-input decoded PLAs based on the Hamiltonian path
and dynamic programming.

14.1 INTRODUCTION TO DECODERS


A decoder is a combinational circuit that converts binary information from the n coded
inputs to a maximum of 2n unique outputs. If the n-bits coded information has unused bit
combinations, the decoder may have less than 2n outputs.

DOI: 10.1201/9781003269182-17 163


164  VLSI Circuits and Embedded Systems

The decoders presented in this section are called n-to-m decoders, where m <= 2n .
Their purpose is to generate the 2n (or fewer) binary combinations of the n input variables.
A decoder has n inputs and 2n outputs and is also referred to as n to 2n decoder.

Example 14.1 Fig. 14.1 shows a 2-to-4 decoder. Here the two-bit input is called S1S0 and
the four outputs are Q0 − Q3. This circuit “decodes" a binary number into a “one-of-four"
code. If the input is equivalent to the decimal number i , output Qi alone will be true. It is
ensured using the following expressions for the outputs Q0 − Q3.

Q0 = S1 S0
Q1 = S1 S0
Q2 = S1S0
Q3 = S1 S0

Figure 14.1: A 2-to-4 Decoder.


Genetic Algorithm for Input Assignment for Decoded-PLAs  165

Table 14.1: A Truth Table Representing a Three-Input Two-Output Function

X0 X1 X2 F0 F1
0 0 0 0 0
0 0 1 0 1
0 1 0 0 1
0 1 1 1 0
1 0 0 0 1
1 0 1 1 0
1 1 0 1 0
1 1 1 1 1

14.1.1 Decoders as Product Generators


Two characteristics of decoders can be considered as follows:
1. For each input combination, exactly one output is true.
2. Each output equation contains all of the input variables.
This means that one can easily use a decoder as a product generator to implement any
sum of products expression.

Example 14.2 Here a truth table representing a three input-two output function is shown
in Table 14.1.

The sum of products expressions of these functions are:


f0 (x0, x1, x2 ) = x00 x1 x2 + x0 x10 x2 + x0 x1 x20 + x0 x1 x2 = + (3, 5, 6, 7)
Í
f0 (x0, x1, x2 ) = x00 x10 x2 + x00 x1 x20 + x0 x10 x20 + x0 x1 x2 = + (1, 2, 4, 7)
Í
Here, a 3-to-8 decoder implements f 0 as a sum of products shown in Fig. 14.2.

Figure 14.2: A 3-to-8 Decoder Implementing f 0 .

Fig. 14.3 shows how a 3-to-8 decoder implements f 1 as a sum of products.


166  VLSI Circuits and Embedded Systems

Figure 14.3: A 3-to-8 Decoder Implementing f 1 .

Since the two functions f 0 and f 1 both have the same inputs, just one decoder can be
used instead of two as Fig. 14.4.

Figure 14.4: A 3-to-8 Decoder Implementing f 0 and f 1 .

Decoder output Q0 is unused, while Q7 is used multiple times. In general, it can always
use circuit outputs as many or as few times as you need.

14.2 DECODED PLA


Although the structural regularity of PLAs offers design simplicity, PLAs generally require
a large chip area compared to random logic implementation. To overcome this drawback,
some variant structures of PLA which implement Boolean functions efficiently have been
introduced, such as a PLA with input decoders.
The PLA with input decoders shown in Fig. 14.5 has been introduced to implement
Boolean functions efficiently. By using decoders, the number of product terms can be
reduced, thus resulting in a smaller circuit area.
Let us follow an example to use how this reduction is done.
Genetic Algorithm for Input Assignment for Decoded-PLAs  167

Table 14.2: A Truth Table Representing a Four-Input Three-Output Function

X0 X1 X2 X3 F0 F1 F2
0 0 0 0 0 0 0
0 0 0 1 0 0 1
0 0 1 0 0 1 0
0 0 1 1 0 1 1
0 1 0 0 0 0 1
0 1 0 1 0 1 0
0 1 1 0 0 1 1
0 1 1 1 1 0 0
1 0 0 0 0 1 0
1 0 0 1 0 1 1
1 0 1 0 1 0 0
1 0 1 1 1 0 1
1 1 0 0 0 1 1
1 1 0 1 1 0 0
1 1 1 0 1 0 1
1 1 1 1 1 1 0

Figure 14.5: The Structure of Decoded PLA.

Example 14.3 Consider the following truth table representing a 4-input 3-output function.

Here Fig. 14.6 shows the standard PLA representation of the function.
168  VLSI Circuits and Embedded Systems

Figure 14.6: The Standard PLA Representation of the Function of Table 14.1.

Notice that there are 11 vertical lines in this implementation. In Fig. 14.7, the decoder
PLA representation is shown.
In the above figure, there are only 9 vertical lines.

14.2.1 Advantages
A programmable logic array or PLA presents a reduced design of a logic circuit. Moreover,
the decoder part of the PLA is programmable too. Instead of generating all possible products,
one can choose which products to generate. This can significantly reduce the fan-in (number
of inputs) of gates, as well as the total number of gates.
Genetic Algorithm for Input Assignment for Decoded-PLAs  169

Figure 14.7: The Decoded-PLA Representation of the Function shown in Table 14.1.

14.3 BASIC DEFINITIONS


In this section, some definitions are given and described with appropriate examples.

Property 14.3.1 Let G = (V, E ) be a connected graph with n vertices. A Hamiltonian Path
is a path that goes through each vertex exactly once. A minimum weight Hamiltonian Path
is the path created by traversing through the edges with the minimum weight and hence,
the sum of weights of the edges in this path is the minimum.

Property 14.3.2 Let X = {x 1 , x 2 , . . . , x n } be the set of given input variables of a PLA.


{x 1 , x 2 , . . . , x r } is called a partition of X , where ∪i=0 r X i = X , X i ∩ X j = ∅ if i  j , and
X i  ∅ for every i .

Henceforth, each X i A decoder that has k inputs has 2k signal lines out if this decoder,
and each of the signal lines represents one of 2k maxterms.

Property 14.3.3 Let S be a subset of the input variables. Then delete all literals of the
variables in S from each term of a disjunctive form for a given function f , but leave other
literals in that term. The number of distinct terms in the resulting disjunctive form is denoted
by RS .

Property 14.3.4 An assignment graph G (Fig. 14.8) for n-variable function f (x 1 , x 2 , . . . ,


x n } is a complete graph such that
170  VLSI Circuits and Embedded Systems

Figure 14.8: A Complete Graph.

1. G has n nodes, one for each input variable.


2. The weight of the edge (i, j ) between nodes i and j is R {x i , x j }.

14.4 GENETIC ALGORITHM


The Genetic Algorithm (GA) was invented by Professor John Holland at the University of
Michigan in 1975, and subsequently it has been made widely popular by Professor David
Goldberg at the University of Illinois. The original GA and its many variants, collectively
known as genetic algorithm, are computational procedures that mimic the natural process of
evolution. The theories of evolution and natural selection were first introduced by Darwin to
explain his observation of plants and animals in the natural world. Darwin observed that, as
variations are introduced into a population with each new generation, the less-fit individuals
tend to die off in the competition for food, and this survival of the fittest principle leads
to improvements in the species. The concept of natural selection was used to explain how
species have been able to adapt to changing environment and how, consequently, species
that are very similar in adaptability, may have evolved.
Much has been learned about genetics since the time of Charles Darwin. All informa-
tion required for the creation of appearance and behavioral features of a living organism
are contained in its chromosomes. Reproduction generally involves two parents, and the
chromosomes of the offspring are generated from portions of chromosomes taken from the
parents. In this way, the offspring inherit a combination of characteristics from their parents.
GAs attempt to use a similar method of inheritance to solve various problems, such as those
involving adaptive systems. The objective of the GA is then to find an optimal solution to
a problem. Of course, since GAs are heuristic procedures, they are not guaranteed to find
the optimum, but experience has shown that they are able to find very good solutions for a
wide range of problems.
Genetic Algorithm for Input Assignment for Decoded-PLAs  171

GAs work by evolving a population of individuals over a number of generations. A


fitness value is assigned to each individual in the population, where the fitness computation
depends on the application. For each generation, individuals are selected from the popu-
lation for reproduction, the individuals are crossed to generate new individuals, and the
new individuals are mutated with some low mutation probability. The new individuals may
completely replace the old individuals in the population, with distinct generations evolved.
Alternatively, the new individuals may be combined with the old individuals in the popula-
tion. In this case, it may want to reduce the population in order to maintain a constant size,
e.g., by selecting the best individuals from the population. The choice of which approach
is used may depend on the application. Since selection is biased toward more highly fit
individuals, the average fitness of the population tends to improve from one generation to
the next. The fitness of the best individual is also expected to improve over time, and the
best individual may be chosen as a solution after several generations.
GAs use two basic processes from evolution inheritance, or the passing of feature
from one generation to the next, and competition, or survival of the fittest, which results in
weeding out the bad features from individuals in the population.
The main advantages of GAs are:
i. They are adaptive, and learn from experience
ii. They have intrinsic parallelism
iii. They are efficient for complex problems and,
iv. They are easy to parallelize, even on a loosely coupled Network Of Workstations
(popularly known as NOW), without much communication overhead.
The two basic GA approaches found so far are the simple GA, also called the total re-
placement algorithm, and the steady-state algorithm, which is characterized by overlapping
populations.

14.4.1 GA Terminology
All genetic algorithms work on a population, or a collection of several alternative solutions
to the given problem. Each individual in the population is called a string or chromosome,
in analogy to chromosomes in natural systems. Often these individuals are coded as binary
strings, and the individual characters or symbols in the strings are referred to as genes. In
each iteration of the GA, a new generation is evolved from the existing population in an
attempt to obtain better solutions.
The population size determines the amount of information stored by the GA. The GA
population is evolved over several generations.
An evaluation function (or fitness function) is used to determine the fitness of each
candidate solution. The fitness is the opposite of what is generally known as the cost in
optimization problems. It is customary to describe genetic algorithms in terms of fitness
rather than cost. The evaluation function is usually user-defined, and problem-specific.
Individuals are selected from the population for reproduction, with the selection biased
toward more highly fit individuals. Selection is one of the key operators on GAs that ensures
survival of the fittest. The selected individuals from pairs are called parents.
172  VLSI Circuits and Embedded Systems

Crossover is the main operator used for reproduction. It combines portions of two parents
to create two new individuals, called offspring, which inherit a combination of features of
the parents. For each pair of parents, crossover is performed with a high probability PC,
which is called the crossover probability. With probability 1-PC, crossover is not performed,
and the offspring pair is the same as the parent pair.
Mutation is an incremental change made to each member of the population, with a very
small probability. Mutation enables new features to be introduced into a population. It is
performed probabilistically such that the probability of a change into each gene is defined
as the mutation probability, P m .

14.4.2 The Simple GA


The simple GA (Fig. 14.9) also referred to as the total replacement algorithm is composed
of populations of strings, or chromosomes, and three evolutionary operators: selection,
crossover, and mutation. The chromosomes may be binary-coded, or they may contain
characters from a larger alphabet. Each chromosome is an encoding of solution to the
problem at hand and each individual has an associated fitness which depends on the
application.
The initial population is typically generated randomly, but it may also be supplied by
the user. A highly fit population is evolved through several generations by selecting two
individuals, crossing the two individuals to generate two new individuals, and mutating
characters in the new individuals with a given mutation probability. Selection is done
probabilistically but is biased toward more highly fit individuals, and the population is es-
sentially maintained as an unordered set. Distinct generations are evolved, and the processes
of selection, crossover, and mutation are repeated until all entries in a new generation are
filled. Then the old generation may be discarded. New generations are evolved until some
stopping criterion is met. The GA may be limited to a fixed number of generations, or it
may be terminated when all individuals in the population converge to the same string or
no improvements in fitness values are found after a given number of generations. Since
selection is biased toward more highly fit individuals, the fitness of the overall population
is expected to increase in successive generations. However, the best individual may appear
in any generation.

14.4.3 The Steady-State Genetic Algorithm


In a GA having overlapping generations, only a fraction of the individuals is replaced in each
generation. The steady-state algorithm is illustrated in Fig. 14.10. In each generation, two
different individuals are selected as parents, based on their fitness. Crossover is performed
with a high probability, PC, to form offspring. The offsprings are mutated with a low
probability, PM.
A duplicate check may follow, in which the offsprings are rejected without any evalu-
ation if they are duplicates of some chromosomes already in the population. The offspring
that survive the duplicate check are evaluated and are introduced into the population only if
they are better than the current worst member of the population, in which case the offspring
Genetic Algorithm for Input Assignment for Decoded-PLAs  173

Figure 14.9: Flowchart of the Simple Genetic Algorithm.


174  VLSI Circuits and Embedded Systems

Figure 14.10: Flowchart of the Steady-State Genetic Algorithm.


Genetic Algorithm for Input Assignment for Decoded-PLAs  175

replaces the worst member. This completes the generation. In the steady-state GA, the
generation gap is minimal, since only two offsprings are produced in each generation.
Duplicate checking may be beneficial because a finite population can hold more
schemata if the population-members are not duplicate. Since the offsprings of two identical
parents are identical to the parents, once a duplicate individual enters the population, it
tends to produce more duplicates and individuals varying by only slight mutations. Prema-
ture convergence may then result. Duplicate checking is advantageous under the following
conditions:
i. The population size is small
ii. The chromosomes are short or
iii. The evaluation time is large.
Each of the above conditions reduces the duplicate checking time in comparison to the
evaluation time. If the duplicate checking time is negligible, compared to the evaluation
time, then duplicate checking improves the efficiency of the GA.
The steady-state GA is susceptible to stagnation. Since a large majority of offspring is
inferior, the steady-state algorithm rejects them, and it keeps marking more trials on the
existing population for very long periods of time without any gain. Because the population
size is small compared to the search space, this is equivalent to long periods of localized
search.

14.5 GENETIC OPERATORS


The genetic operators and their significance can now be explained. The description will be
in terms of a traditional GA without any problem-specific modifications. The operators that
will be discussed include selection, crossover, mutation and inversion.

14.5.1 Selection
Various selection schemes have been used, but roulette wheel selection, stochastic uni-
versal selection and binary tournament selection will be focused here with and without
replacement. As illustrated in Fig. 14.11 (a), roulette wheel selection is a proportionate
selection scheme in which slots of a roulette wheel are sized according to the fitness of each
individual in the population. An individual is selected by spinning the roulette wheel. The
probability of selecting an individual is therefore proportional to its fitness. As illustrated
in Fig. 14.11 (b), stochastic universal selection is a less noisy version of roulette wheel
selection in which N equidistant markers are placed around the roulette wheel, where N
is the number, of individuals in the population. N individuals are selected in a single spin
of the roulette wheel, and the number of copies of each individual selected is equal to the
number of markers inside the corresponding slot.
176  VLSI Circuits and Embedded Systems

Figure 14.11: Proportionate Selection Schemes.

In binary tournament selection, two individuals are taken at random, and the better
individual is selected from the two. If binary tournament selection is being done without
replacement, then the two individuals are set aside for the next selection operation, and
they are not replaced into the population. Since two individuals are removed from the
population for every individual selected, and the population size remains constant from
one generation to the next, the original population is restored after the new population is
half-filled. Therefore, the best individual will be selected twice, and the worst individual
will not be selected at all. The number of copies selected of any other individual cannot
be predicted except that it is either zero, one, or two. In binary tournament selection with
replacement, the two individuals are immediately replaced into the population for the next
selection operation.
The objective of the GA is to converge to an optimal individual, and selection pressure
is the driving force which determines the rate of convergence. A high selection pressure will
cause the population to converge quickly, possibly at the expense of a sub optimal result.
Roulette wheel selection typically provides the highest selection pressure in the initial
generations, especially when a few individuals have significantly higher fitness values than
other individuals. Tournament selection provides more pressure in later generations when
the fitness values of individuals are not significantly different. Thus, roulette wheel selection
is more likely to converge to a suboptimal result if individuals have large variations in fitness
values.

14.6 CROSSOVER
Once two chromosomes are selected, the crossover operator is used to generating two
offsprings. In one and two-point crossover, one or two chromosome positions are randomly
selected between one and ( L − 1), where L is the chromosome length and the two parents
are crossed at those points. For example, in one-point crossover, the first child is identical to
the first parent up to the crossing point and identical to the second parent after the crossing
point. An example of one-point crossover is shown in Fig. 14.12.
Genetic Algorithm for Input Assignment for Decoded-PLAs  177

Figure 14.12: One-Point Crossover.

Crossover combines building blocks from two different solutions in various combina-
tions. Smaller good building blocks are converted into progressively larger good building
blocks over the time until it has an entire good solution. Crossover is a random process,
and the same process, results in the combination of bad building blocks to result in poor
offspring, but these are eliminated by the selection operator in the next generation.
The performance of the GA depends to a great extent on the performance of the crossover
operator used. The amount of crossover is controlled by the crossover probability, which is
defined as the ratio of the number of offspring produced in each generation to the population
size. A higher crossover probability allows exploration of more of the solution space and
reduces the chances of setting for a false optimum. A lower crossover probability enables
exploitation of existing individuals in the population that have relatively high fitness.

14.6.1 Mutation
As new individuals are generated, each character is muted with a given probability. In a
binary-coded GA, mutation may be done by Ripping a bit, while in a nonbinary-coded GA,
mutation involves randomly generating a new character in a specified position. Mutation
produces incremental random changes in the offspring generated through crossover, as
shown in Fig. 14.13. When used by itself, without any crossover, mutation is equivalent
to random search, consisting of incremental random modification of existing solution, and
acceptance if there is improvement. However, when used in the GA, its behavior changes
radically. In the GA, mutation serves the crucial role of replacing the gene values lost from
the population during the selection process so that they can be tried in a new context, or of
providing the gene values that were present in the initial population.

Figure 14.13: Mutation Operator.

For example, let a particular bit position, bit 10, has the same value, let 0, for all
individuals in the population. In such a case, crossover alone will not help, because it is
only an inheritance mechanism for existing gene values. That is, crossover cannot create an
individual with a value of 1 for bit 10, since it is 0 in all parents. If a value of 0 for bit 10
turns out to be suboptimal, then, without the mutation operator, the algorithm will have no
chance of finding the best solution. The mutation operator, by producing random changes,
178  VLSI Circuits and Embedded Systems

provides a small probability that a 1 will be reintroduced in bit 10 of some chromosome.


If this results in an improvement in fitness, then the selection algorithm will multiply
this chromosome, and the crossover operator will distribute the 1 to other offspring. Thus,
mutation makes the entire search space reachable, despite a finite population size. Although
the crossover operator is the most efficient search mechanism, by itself, if does not guarantee
the reachability of the entire search space with a finite population size. Mutation fills in this
gap.
The mutation probability PM is defined as the probability of mutating each gene. It
controls the rate at which new gene values are introduced into the population. If it is too
low, many gene values that would have been useful are never tried out. If it is too high, too
much random perturbation will occur, and the offspring will lose their resemblance to the
parents. The ability of the algorithm to learn from the history of the search will therefore
be lost.

14.6.2 Inversion
The inversion operator takes a random segment in a solution string and inverts it end for end
(Fig. 14.14). This operation is performed in a way such that it does not modify the solution
represented by the string. Instead, it only modifies the representation of the solution. Thus,
the symbols composing the string must have an interpretation independent of their position.
This can be achieved by associating an identification number with each symbol in the string
and interpreting the string with respect to these identification numbers instead of the array
indices. When a symbol is moved in the array, its identification number is moved with it,
and therefore, the interpretation of the symbol remains unchanged.

Figure 14.14: Inversion Operator.

For example, Fig. 14.14 shows a chromosome. Let us assume a very simple evaluation
function such that the fitness is the binary number consisting of all bits of the chromosome.
With bit 0 being the least significant, and bit-9 being the most significant. Since the bit
identification numbers are moved with the bit values during the inversion operation, bit 0,
bit 1, etc., still have the same values, although their sequence in the chromosome is different.
Hence, the fitness value remains the same. The inversion probability is the probability of
performing inversion on each individual during each generation. It controls the amount of
group formation.
Inversion changes the sequence of genes randomly, in the hope of discovering, a se-
quence of linked genes placed close to each other.
Genetic Algorithm for Input Assignment for Decoded-PLAs  179

Index 0 1 2 3 4 5 6
Gene 2 2 1 0 1 2 0

14.7 GA FOR DECODED-PLAS


The input variable assignment for the decoders of decoded-PLAs is described in Section
15.3. Generally, a decoded-PLA, i.e., a PLA which has decoders in front of an AND array,
requires a smaller area than a standard PLA or realizing a function. It has been shown that
the areas of PLAs are smaller in many cases by using multi-input decoded-PLAs than those
of decoded-PLAs with two input decoders or standard PLAs.
It is usually very difficult to assign input variables to decoders such that the area of
the decoded-PLA is minimal. There are some good heuristic algorithms to find variable
orderings. But in many cases, they are able to find only some sub-optimal solutions. This
is because they traverse only a portion of a large problem space. In this respect genetic
algorithms can help in an efficient way.

14.7.1 Problem Encoding


In order to further reduce the area of a PLA, decoders can be used with more than two
inputs each (multi-input decoders). In doing so, it has the following problems.
One problem is that assignment of variables to the decoders becomes more complex:
i.e., not only which variables should be connected to each decoder but also the number of
inputs of each decoder must be determined.
Another problem is that the area occupied by multi-input decoders, which is larger than
two-input decoders must be taken into account.
Despite these problems, here an algorithm is developed for input assignment so that the
total area of a decoded-PLA (henceforth, “total area” means the sum of the AND and OR
array areas of a PLA, and the area overhead of decoders or inverter) can often be reduced
by using multi-input decoders. Here the genetic algorithm is used because using genetic
algorithms-
(i) a big problem space can be searched,
(ii) the size of this search space can be moderated by parameters,
(iii) a variety solution can be produced, and
(iv) a solution can be obtained which is near optimal.
Here vectors are used as chromosomes. The elements of the vectors will represent the
input variables. Each value of the vector is, i.e., each gene of a chromosome, represents
a decoder to which the corresponding variable should be assigned. And as usual in GAs,
each vector will provide with a solution.

Example 14.4 Suppose in a particular function there are eight input variables x 0 , x 1 , . . . ,
x 7 , Let a solution provided by a chromosome be

This solution suggests using three decoders in the decoded-PLA design. It also suggests
assigning x 3 and x 6 to the same decoder (suppose D0 ), x 2 and x 4 to another decoder ( D1 )
and x 0 , x 1 and x 5 to another one ( D2 ).
180  VLSI Circuits and Embedded Systems

14.7.2 Fitness Function


From Section 15.4, it is shown that the fitness function is problem-specific. In the present
case the fitness function will represent the areas of the decoded-PLAs design that are
suggested by the chromosomes. So here the higher fitness value actually denotes the lower
fitness of the individual.
Here the algorithm for finding the fitness of an individual chromosome is presented in
Algorithm 14.1.

Algorithm 14.1 Fitness (vector chromosome)


1: f it _val = 0
2: S = a new set of same values can be formed from the chromosome if no such set can
be formed go to 5
3: P S = (2 |s | + m)Rs + D |s | )
4: Add P S to f it _val
5: return f it _val

In line 3, the first term estimates the areas of the arrays and the second term estimates
the area overhead incurred by the decoder. Here m denotes the total number of output
functions in the PLA, RS is used as it is defined in Definition 15.3. If the variables in S are
assigned to a decoder and the variables not in S are treated in the same way as in standard
PLAs, it can be shown that RS is an upper bound on the minimum number of product lines.
In the algorithm RS is used to estimate the number of product lines when variables in S are
assigned to a decoder.
The area overhead of decoders, D |s | is a very important factor in the cost. To estimate
a realistic value of D, many complex aspects of decoder circuit design and its layout must
be considered as follows:
1. Design of Decoder Circuits: Decoders are usually designed by a mixture of tree
decoders, NAND circuits, pass-transistor circuits and others, in order to have appro-
priate trade-offs between speed and layout area. So, there are a variety of decoder
circuits. Also, to have enough driving capability, buffers which occupy large areas
are usually needed for the output signals or the decoders (in the case of standard
PLAs, large inverters are needed).
2. Routing of Decoder Input Lines: Input lines run through decoder. The area occupied
by these lines highly depends on where, what directions, and how many groups
these lines approach the decoders. Also, depending on the technology of the circuits
realizing the decoders, the layout and line spacing could be different. If these lines
need to be permuted, extra areas for contact windows may be required.
Since these very complex factors depend on situations, a simplified but reasonable
estimation of the area of a decoder, D |s | = 2 A |s| 2 |s | , is used after examining the layout for
some real design samples, where 2 |s | is the number of signal lines out of a |s|-input decoder,
2 |s | is the number of input lines which run through the decoders, and A is a coefficient to be
adjusted according to line spacing, transistor sizes, and others. Here in the implementations
A = 1 is used as a dummy estimation.
Genetic Algorithm for Input Assignment for Decoded-PLAs  181

14.7.3 Developed GA
When problem specific information exists, it may be advantageous to consider a GA hybrid.
Genetic algorithms may be crossed with various problem-specific search techniques to form
a hybrid that exploits the global perspective of the GA and the convergence of the problem-
specific technique. Here in the method a form of modified greedy algorithm is used for the
selection procedure of the GA. Another change in the traditional GA is also done. That is
the worst chromosomes that are not replaced of the population with the new offsprings.
The best ones are chosen from the parents and offsprings instead. This is because to utilize
the good features of even the worst chromosomes. And using this technique, the good
features of the parents as well as all other chromosomes will be in the population through
the survivors.
The GA that is adopted is presented in Algorithm 14.2.

Algorithm 14.2 Input_variable_assignment ()


1: Generate initial population.
2: Evaluate each individual.
3: Select two individuals without repetition and then select two best individuals among
them without repetition.
4: With a high probability Pc perform crossover on the pairs to generate two offspring. If
crossover is not performed, then the parents are copied unchanged to the offspring.
5: Mutate the offspring with a small probability, Pm.
6: If offspring are duplicates of any other individual already in population, then reject.
7: Evaluate offspring.
8: Select two chromosomes from the offsprings and the parents and replace the parents
with them.
9: Use the inversion procedure.
10: If the number of repetitions without any new population met then return the best
chromosome else go to 3.

14.7.4 Decoded AND-EXOR PLA Implementation


An AND-EXOR PLA is represented by a set of EXCLUSIVE-OR sum-of-product expres-
sions (ESOPs) instead of OR sum-of-product expressions (SOP) used in standard AND-OR
PLA. Fig. 14.15 shows the distinction between the standard AND-OR PLA and AND-EXOR
PLA.
182  VLSI Circuits and Embedded Systems

Figure 14.15: Standard AND-OR and AND-EXOR PLA.

Consider the function f (w, x, y, z) = (5, 7, 11, 13). The minimized SOP form is,
Í
f = x y 0 z + w 0 xz + wx 0 yz . But here the minimum ESOP is form is, f = xz ⊕ wyz .

14.8 SUMMARY
The number of products in a PLA (Programmable Logic Array) is equal to the number
of different products in the expressions. So, in order to minimize the size of PLAs, it is
necessary to minimize the number of different products in the expression. AND-EXOR
PLAs with decoders require fewer products than AND-OR PLAs without decoder. By
replacing the OR array with the EXOR array in the PLA, it can have an AND-EXOR PLA
with decoders. An algorithm is also presented to assign variables for multi-input decoded
PLAs based on the Hamiltonian path and dynamic programming.

REFERENCES
[1] T. Sasao, “Input variable assignment and output phase optimization of PLS’s”, IEEE
Trans. Computer, vol. C-28, no. 9, pp. 879–894, 1984.
[2] C. Kuang-Chien and S. Muroga, “Input assignment algorithm for decoded-PLAs with
multi-input decoders”, in IEEE International Conference on Computer-Aided Design
(ICCAD-88), pp. 474–477, 1988.
[3] P. Mazumder and E. M. Rudnick, “Genetic Algorithms for VLSI Design, Layout and
Test Automation”, Pearson Education, Asia, 1999.
[4] J. H. Holland, “Adaptation in Natural and Artificial Systems”, Ann Arbor, MI: Uni-
versity of Michigan Press, 1975.
[5] D. E. Goldberg, “Genetic Algorithms in Search, Optimization, and Machine Learning,
Reading, MA”, Addison-Wesley, 1989.
[6] K. C. Clien and S. Muroga, “Input Variable Assignment for Decoded-PLA’s and
Output Phase Optimization Algorithms”, to appear as a department report, Department
of Computer Science, University of Illinois at Urbana-Champaign.
[7] L. A. Glasser and D. W. Dobberpuhl, “The Design and Analysis of VLSI Circuits”,
Addison Wesley, 1985.
Genetic Algorithm for Input Assignment for Decoded-PLAs  183

[8] C. R. Darwin, “On the Origin of Species by Means of Natural Selection”, London:
John Murray, 1859.
[9] Chen, Kuang-Chien, and Saburo Muroga. "Input assignment algorithm for decoded-
PLAs with multi-input decoders." In 1988 IEEE International Conference on
Computer-Aided Design, pp. 474–475. IEEE Computer Society, 1988.
CHAPTER 15

FPGA-Based Multiplier
Using LUT Merging Theorem

FPGA (Field Programmable Gate Array) technology has become an integral part of to-
day’s modern embedded systems. All mainstream commercial FPGA devices are based
upon LUT (Look-up Table) structures. As any m-input Boolean function can be imple-
mented using m-input LUTs, it is a prime concern to reduce the number of LUTs while
implementing an FPGA-based circuit for given functions. In this chapter, a LUT merging
theorem is introduced which reduces the required number of LUTs for the implementation
of a set of functions by a factor of two. The LUT merging theorem performs selection,
partition and merging of the LUTs to reduce the area. Using the LUT merging theorem,
an (1 × 1)-digit multiplication algorithm is described, which does not require any partial
product generation, partial product reduction and addition steps. An (m×n)-digit multipli-
cation algorithm is introduced which performs digit-wise parallel processing and provides
a significant reduction in carry propagation delay.

15.1 INTRODUCTION
The implementation of FPGA (Field Programmable Gate Array) is now prevalent in ap-
plications such as signal processing, cryptography, networking, arithmetic and scientific
computing. LUT (Look-up Table) is the key cell of an FPGA. In this chapter, the introduced
LUT merging theorem is applied in multiplier to prove the performance efficiency.
Three main contributions of this chapter are as follows:
(1) LUT merging theorem is introduced by following selection, partition and merging
of LUTs in order to reduce the required number of LUTs by a factor of two for implementing
a set of functions.
(2) Using the LUT merging theorem, a single digit multiplication algorithm has been
described avoiding the conventional slow partial product generation, partial product reduc-
tion and addition stages. Efficient (m×n)-digit multiplication algorithm is explained, where
digit-wise parallel processing is performed to reduce the carry propagation delay of the
multiplier significantly.

DOI: 10.1201/9781003269182-18 185


186  VLSI Circuits and Embedded Systems

(3) A LUT-based compact and faster (m×n)-digit multiplier circuit is described by using
the single digit LUT-based multiplier circuit.

15.2 LUT MERGING THEOREM


LUT is an important component for the design of an FPGA. A LUT with m inputs corre-
sponds to a(2m × 1)-bit memory, which can realize any logic function with m inputs by
programming the truth table of logic function directly into the memory. It is a complex task
to build a compact circuit by minimizing the number of LUTs for implementing any given
Boolean function. In this section, the LUT merging property (Property 15.2.1) is introduced
to reduce the number of LUTs in the design of the FPGA for given functions.

Property 15.2.1 (LUT Merging): Let n be the total number of the m-input LUTs required
in a circuit for the implementation of n number of functions f i (a1, a2, ..., a m ). If the given
functions f i are minimized to g i (a1, a2, ..., a m ’), then the n numbers of LUTs can be merged
to dn\2e to implement the n number of functions, where 1≤m 0 <m, m≥3, m 0 is the number
of input variables of the minimized function g and i = 1, 2, ..., n.

Property 15.2.2 To perform the merging of n numbers of m input-LUTs, two conditions


are required to be fulfilled which are as follows:

If a Boolean function f (a1, a2, ..., a m ) requires m variables, then m must be reduced to m 0,
where f (m)≡g(m 0) and m>m 0.
If the input set Ii of i number of different functions are the same such as I1 = I2 =
I3 = · · · = In and the corresponding output combinations (Oi ) are also the same, that is, O1
= O2 = O3 = · · · = On , then I1 , I2 , ... , In → O1 .
The following three steps are considered after fulfilling the above conditions:
(1) Selection of LUTs: Bit categorization is performed based on the input bits for the
selection of LUTs. After categorizing the input domain the target functions and target
number of input bits of the LUTs are determined based on the number of input and output
bits of the functions.
(2) Partitioning of LUTs: Partitioning of LUTs is done to find the similarities of the
input and output combinations. First, the number of input bits is reduced and checked for
similarity among functions either by rearranging the input bit positions or by implementing
smaller functions on the input bits. If there are identical input bit combinations after the
reduction of the number of input bits and the corresponding outputs of the functions are
identical for the multiple occurrence, then a partition is created with those functions.
(3) Merging of LUTs: The final step is to deal with the merging of LUTs of each
partition. Since an m-input LUT is a dual output circuit, n numbers of LUTs will be reduced
to dn \ 2e , where m ≥ 3.

15.3 THE MULTIPLIER CIRCUIT USING THE LUT MERGING THEOREM


In this chapter, an efficient (m×n)-digit multiplier has been introduced. As an example of
LUT merging theorem, a (1 × 1)-digit multiplication is considered here. In a (1 × 1)-digit
FPGA-Based Multiplier Using LUT Merging Theorem  187

multiplication maximum decimal input digit (d max ) = (a3 , a2 , a1 , a0 ) or (b3 , b2 , b1 , b0 ) is


9 and the product of two (d max ) yields maximum of 7 bits as (P6 P5 P4 P3 P2 P1 P0 ).

T1 = (A3 + A2 + A1 ); T2 = (B3 + B2 + B1 ); (15.1)


T3 = (A2 + B2 ); T4 = (A3 + B3 ); (15.2)
T5 = (T20 .B0 ); T6
= (T10 .A0 ); (15.3)
T7 = (Catg0 + T5 + T6 )0; (15.4)
Catg0 = (T1 + A0 ) + (T2 + B0 )0; (15.5)
Catg1 = Catg00 .T5 ; Catg2 = Catg00 .T6 ; (15.6)
Catg3 = (A1 .B1 ).(T3 + T4 )0; Catg4 = T3 .T40 .T7 ; (15.7)
Catg5 = T4 .T7 ; (15.8)

First, selection of LUTs has been performed by bit categorization technique, where the
multiplier and multiplicand bits are arranged into groups of variables of 1-input, 2-input
(Catg3), 3-input (Catg4) and 4-input (Catg5) as shown in Table 15.1. For instance, A group
of 3-input variables contains operands with the maximum of 3 bits. Either multiplicand or
multiplier will be 0Dec in Catg0 and the corresponding output will be zero. In Catg1 the
multiplicand B is 1Dec and output is A. In Catg2 the multiplier A is 1Dec and output is
B. It is important to note that only four categories of input variables are considered, as the
(1 × 1)-digit multiplication is used as the basic unit to construct an (m×n)-digit multiplier.
The (1 × 1)-digit multiplication considers the four binary bits with the maximum decimal
value, 9 as input. As A × B = B × A, either A × B or B × A is considered to avoid identical
input combination. So, the first step of the multiplication process to find the appropriate
category. The inputs will make the category selection value (cat ) of the exact category to
be one keeping the other values zero, which will activate that specific category to provide
the output. The categories are determined by using the following equations, where plus (+)
represents OR operation, dot (.) represents AND operation and (’) represents complement.
The equations of corresponding categories are as follows:
In 2-bit categorization the final output will be gained using the following equations,
where (.) means AND operation and ⊕ means Ex-OR operation:

P02 = a0 .b0 ; and P12 = a0 .b0 .a1 ; (15.9)


P22 = b0 ⊕ b1 ; P32 = a0 .b0 ; (15.10)

P04 = a0 .b0 ; P14 = b0 .a2 ; (15.11)

(
a1, i f a3 .b0 = 0
P34 = (15.12)
a¯0, otherwise
188  VLSI Circuits and Embedded Systems

(
a1, i f a3 .a0 = 0
P44 = (15.13)
1, otherwise

P54 = a2 ; and P46 = a3 (15.14)

In 4-bit categorization the final output will be gained using the following equations:
The group of 3-bit category is selected for implementation of LUT merging. Product
bits P0 and P5 will be generated using the following equations:

P03 = a0 ; and P35 = (b1 .b2 ).(b0 .a0 .a2 + a1 .a2 ) (15.15)

So, 4 bits are left ranging from P1 to P4 . A LUT with 4 to 6 inputs provides the best area
and delay performance. The traditional 4-input LUT structure provides low logic density
and configuration flexibility, which reduces the utilization of interconnect resource, when
configured as relatively complex logic functions. Hence, a 6-input LUT is considered to
implement the function.
Second, partitioning of LUTs is performed by observing the similarities between the
input and output combinations of the product bits P1,4 and P2,3 . And finally, merging of
LUTs has been performed by realizing the functions of 6-input LUTs while minimizing it
as follows:
P1,4 = g(a0, a1, b0, b1, b2 ); (15.16)
P2,3 = g(b1 ⊕ b2 ), b0, a2, a1, a0 (15.17)

Fig. 15.2 shows the merging of LUTs and Table 15.2 exhibits the verification of LUT
merging theorem. Before LUT merging, the output of the corresponding 6-bit input variable
is shown in Table 15.2. Now, the 6-bit input combinations are converted into 5-bit input
combinations in such a way so that the total numbers of input combinations of 6-bit and
5-bit inputs are the same as well as the corresponding outputs of all the combinations for
both the 6-input and 5-input variables also remain the same. The 5-bit input combinations
with the corresponding outputs are shown in After LUT Merging Column of Table 15.2. In
Fig. 15.2, the merging of LUTs has been accomplished in three steps for the multiplication
operation. The first step is the selection of LUTs, where the selection is performed on the
basis of computation of final products P1 , P2 , P3 , and P4 . Second, the partitioning of LUTs
is performed, where two different colors are used to distinguish the distinct partition sets.
Third, the partitioned LUTs are merged into one.
The block diagram of a (1 × 1)-digit multiplier circuit is shown in Fig. 15.1. The bit
categorization circuit is constructed using Equations (16.1–16.8) as shown in Fig. 15.3. The
circuit is partitioned into categories, where each category performs the above mentioned
output functions in Equations (16.9–16.17) efficiently using LUTs. As the bit categorization
selects the corresponding category by providing the value of that variable as 1, this value
activates that particular category. So, other categories remain deactivated and only one
category at a time is selected. Fig. 15.4 shows that as the output Catg3 bit from bit
FPGA-Based Multiplier Using LUT Merging Theorem  189

Table 15.1: Bit Categorization of Inputs for the Implementation of LUT Merging Theorem
in Multiplication Technique

categorization circuit for corresponding input combination is zero, the total 3-bit category
circuit is deactivated and no input passes through that category circuit. The dotted line
represents off state and the straight line represents on state. On the other hand, when the input
combination is A = 0011 and B = 0011, which represents category 3, the bit categorization
provides Catg3 as 1. Hence, the 3-bit category is activated and the corresponding required
inputs are passed through the transistors and LUTs. Finally, the corresponding output of
the input combination P = 1001 is gained from that category. The internal circuit of each
category as shown in Fig. 15.5 is the LUT implementation of Equations 16.9 – 16.17.
Similarly, for all the other input combinations corresponding category is selected and
activated. Finally, the output from the activated category is considered as the final output.
The introduced (m×n)-digit multiplier circuit uses the (1 × 1)-digit multiplier circuit. For
an (m×n)-digit multiplier n number of processing elements are required. Each processing
element, PE i has a single digit of multiplicand Bi and the multiplier is pipelined to it. The
output of the multipliers are passed to binary to decimal converter. The multiplier requires
8 bit binary to BCD converter in each processing element irrespective of the multiplier and
multiplicand size. The output from the converter is the input to the 8-bit BCD adder. The
least significant four bits of the BCD adder is stored as output and rest of the bits are added
with the output of the adder in next iteration. After m number of iterations, each output of
the processing elements (Pi ) is shifted i digits using shift registers. Finally, the outputs of
the processing elements are added using the (m + 1)-digit BCD adder. The datapath of the
circuit is shown in Fig. 15.6.
190  VLSI Circuits and Embedded Systems

Figure 15.1: Block Diagram of the (1 × 1)-Digit Multiplier Circuit.

Figure 15.2: Implementation of LUT Merging Theorem Using 6-input LUTs.


FPGA-Based Multiplier Using LUT Merging Theorem  191

Table 15.2: Simulation of LUT Merging for 3-Bit Input Combinations of a 1 × 1-Digit
FPGA-Based Multiplier Circuit

Figure 15.3: The Bit-Categorization Circuit Using 6-Input LUT.


192  VLSI Circuits and Embedded Systems

Figure 15.4: Deactivated Category Circuit as it is not the Target Category for Corresponding
Input.
FPGA-Based Multiplier Using LUT Merging Theorem  193

Figure 15.5: Activated Category Circuit as it is the Target Category for Corresponding
Input.
194  VLSI Circuits and Embedded Systems

Figure 15.6: Block Diagram of the (m×n)-Digit Multiplier Circuit.

15.4 SUMMARY
The (m×n)-digit multiplier performs digit-wise parallel processing in a divide and conquer
approach where (1 × 1)-digit multiplier is used. The (1 × 1)-digit multiplier avoids the
conventional partial product generation, reduction and addition steps. The multiplier uses
the Look-up Table (LUT) merging theorem to reduce the required number of LUTs by a
factor of two. The LUT merging theorem can be utilized for any set of functions for the
reduction of required number of LUTs. The described theorem will enhance the efficient
use of LUTs in FPGA-based circuits to a great extent which will forward the advancement
and applicability of Field programmable Gate Arrays (FPGAs).

REFERENCES
[1] Z. T. Sworna, M. U. Haque and H. M. H. Babu, “A LUT-based matrix multiplication
using neural networks”, IEEE International Symposium on Circuits and Systems
(ISCAS), Montreal, QC, pp. 1982–1985, 2016.
FPGA-Based Multiplier Using LUT Merging Theorem  195

[2] E. Ahmed and J. Rose, “The effect of LUT and cluster size on deep submicron FPGA
performance and density”, IEEE Trans. Very Large Scale Integration Syst., vol. 12,
no. 3, p. 288, 2004.
[3] M. Zhidong, C. Liguang, W. Yuan and L. Jinmei, “A new FPGA with 4/5-input LUT
and optimized carry chain”, Journal of Semiconductors, vol. 33, no. 7, 2012.
[4] Z. T. Sworna, M. U. Haque, N. Tara, H. M. H. Babu and A. K. Biswas, “Low-power
and area efficient binary coded decimal adder design using a look up table-based
field programmable gate array”, IET (The Institution of Engineering and Technology)
Circuits, Devices and Systems, pp. 1–10, 2015.
[5] A. Vazquez, E. Antelo and P. Montuschi, “A new family of high performance parallel
decimal multipliers”, In 18th IEEE Symposium on Computer Arithmetic, pp. 195–
204, 2007.
[6] V. P. Mrio and H. C. Neto, “Parallel decimal multipliers using binary multipliers”, In
Programmable Logic Conference, Southern, pp. 73–78, 2010.
[7] H. T. Richard and R. F. Woods, “Highly efficient, limited range multipliers for LUT-
based FPGA architectures”, IEEE Trans. on Very Large Scale Integration (VLSI)
Systems, vol. 12, no. 10, pp. 1113–1118, 2004.
[8] Sworna, Zarrin Tasnim, Mubin Ul Haque, Hafiz Md Hasan Babu, Lafifa Jamal, and
Ashis Kumer Biswas. “An Efficient Design of an FPGA-Based Multiplier Using LUT
Merging Theorem.” In 2017 IEEE Computer Society Annual Symposium on VLSI
(ISVLSI), pp. 116–121. IEEE, 2017.
CHAPTER 16

Look-Up Table-Based Binary


Coded Decimal Adder

The Binary Coded Decimal (BCD) being the more accurate and human-readable represen-
tation with ease of conversion, is prevailing in the computing and electronic communication.
In this chapter, a tree-structured parallel BCD addition algorithm is introduced with the
reduced time complexity. BCD adder is more effective with a LUT (Look-Up Table)-based
design, due to FPGA (Field Programmable Gate Array) technology’s enumerable bene-
fits and applications. A size-minimal and depth-minimal LUT-based BCD adder circuit
construction is the main focus of this chapter.

16.1 INTRODUCTION
Binary Coded Decimal (BCD) representation is advantageous due to its finite place value
representation, rounding, easy scaling by a factor of 10, simple alignment and conversion
to character form. It is highly used in embedded applications, digital communication and
financial calculations. Hence, faster and efficient BCD addition method is desired. In this
chapter, an N -digit addition method is introduced which omits the complex manipulation
steps, reducing area and delay of the circuit. The application of FPGA in cryptography, NP
(Non Polynomial)-Hard optimization problems, pattern matching, bioinformatics, floating
point arithmetic, and molecular dynamics is increasing radically. Due to re-configurable
capabilities, FPGA implementation of BCD addition is of concern. LUT being one of the
main components of FPGA, a LUT-based adder circuit is described.
Two main focuses are addressed in this chapter. First, a new tree-based parallel BCD
addition algorithm is presented. Second, a compact and high-speed BCD adder circuit with
an improvement in time complexity of O(N(log 2 b) + (N − 1)), where N represents the
number of digits and b represents the number of bits in a digit.

16.2 THE DESIGN OF LUT-BASED BCD ADDER


In this section, first a parallel BCD addition algorithm is described. Then, a new LUT-based
BCD adder circuit is constructed. Essential figures and properties are elucidated to clarify
the introduced ideas.

DOI: 10.1201/9781003269182-19 197


198  VLSI Circuits and Embedded Systems

16.2.1 Parallel BCD Addition Method


The carry propagation is the main cause of delay of BCD adder circuit which gives BCD
adder a serial architecture.
As the reduction of delay is one of the most important factors for the efficiency of
the circuit, carry propagation mechanism needs to be removed for faster BCD addition.
In this chapter, a highly parallel BCD addition method is described with a tree-structured
representation with significant reduction of delay. The BCD addition method has mainly
two steps which are as follows:

1. Bit-wise addition of the BCD addends produce the corresponding sum and carry in
parallel. For the addition of first bit, the carry from the previous digit will be added
too and the produced sum will be the direct first bit of the output.

2. If the most significant carry bit is zero then, except the first sum and last carry bit,
add the other sum and carry bits in pair in parallel; and if the sum is greater than or
equals to five, add three to the result to obtain the correct BCD output.

3. If the most significant carry bit is one, then update the final output values according
to (16.1) and (16.2).

Suppose, A and B be the two addends of a 1-digit BCD adder, where BCD representa-
tions of A and B are A4 A3 A2 A1 and B4 B3 B2 B1 , respectively. The output of the adder
will be a 5-bit binary number {C out S 3 S 2 S 1 S 0 }, where C out represents the position of
tens digit and {S 3 S 2 S 1 S 0 } symbolizes unit digit of BCD sum. A0 and B0 are added along
with C in which is the carry from the previous digit addition. If it is the first digit addition,
the carry will be considered as zero. The produced sum bit will be the direct first bit of the
output. Other pairwise bits ( B1 , A1 ), ( B2 , A2 ), ( B3 , A3 ) will be added simultaneously. The
resultant sum and carry bits (S3α , C 2 , S2α , C 1 , S1α , and C 0 ) are added pairwise providing
γ γ γ γ
output {Cout S3 S2 S 1 } and corrected by addition of three according to the following (16.1)
and (16.2):
γ γ γ γ
Cout S3 S2 S1

γ γ γ γ γ γ γ γ
 (C S3 S2 S1 ), i f C3 = 0 and (Cout S3 S2 S1 ) < 5
 out


γ γ γ γ γ γ γ γ
= (Cout (16.1)

S3 S2 S1 ) + 3, i f C3 = 0 and (Cout S3 S2 S1 ) ≥ 5
 1C0 S 3 S 3,

otherwise

 2 1

(
0, i f C0 = 1
where S13 = S23 = (16.2)
1 otherwise
Look-Up Table-Based Binary Coded Decimal Adder  199

In Table 16.1, the truth table is designed with ( A3 , A2 , A1) and ( B3 , B2 , B1 ) as input and
(C out S 3 S 2 S 1 ) as the final BCD output by following required correction. (S3α , C 2 , S2α ,
C 1 , S1α and C 0 ) are added pairwise as intermediate step, producing (F 4 , F 3 , F 2 , F1) by
considering carry C 0 always 1. A numeric 3((011)2 ) is added to the intermediary output F ,
if F is greater than or equals to five. A similar table considering C 0 as 0 can be calculated
which is shown in Table 16.2. The truth tables verify the functions of each output of the
LUTs of the BCD adder. The algorithm of N-digit BCD addition method is presented in
Algorithm 16.1.
Two example of BCD addition method using the introduced algorithm is demonstrated
in Figs. 16.1 and 16.2, where C3i = 0 and C3i = 1, respectively. Each step of the example is
mapped to the corresponding algorithm step for more clarification.

Table 16.1: The Truth Table of 1-Digit BCD Addition with C 0 = 1


200  VLSI Circuits and Embedded Systems

Table 16.2: The Truth Table of 1-Digit BCD Addition with C 0 = 0

Figure 16.1: Example Demonstration of the BCD Addition Algorithm for C3i = 0.
Look-Up Table-Based Binary Coded Decimal Adder  201

Figure 16.2: Example Demonstration of the BCD Addition Algorithm for C3i = 1.

The BCD addition method can be represented as a tree-structure as it is parallel which


is shown in Fig. 16.3. There are basically two operational levels of the tree. Starting from
the inputs, in Level 1, the bit-wise addition is performed and the intermediary resultants are
obtained. Then, in Level 2, the addition and correction are performed providing the correct
BCD output. Hence, the time complexity of the algorithm is logarithmic according to the
operational depth of the tree. Property 16.1 is given to prove the time complexity of the
method.
Property 16.1: The BCD addition algorithm requires at least O(N(log 2 b) + (N − 1))
of time complexity, where N is number of BCD digits and b is the number of bits in a digit.

Figure 16.3: Tree Structure Representation of the BCD Addition Method.


202  VLSI Circuits and Embedded Systems

Algorithm 16.1: N -digit Parallel BCD Addition

16.2.2 Parallel BCD Adder Circuit Using LUT


A LUT-based parallel BCD adder circuit is designed by using the BCD addition algorithm
and LUT architecture. An algorithm for the construction of the BCD adder circuit is
presented in Algorithm 16.2. According to the algorithm, the circuit is depicted in Fig.
16.4. For the addition of the least significant bit with carry from the previous digit addition,
a full adder is used. Three half-adders are used for individual bit-wise addition operation
of the most significant three bits. Depending on the value of C 3 , 16.1 and 16.2 are followed
in the circuit architecture by using the transistors and LUTs, where four number of 6-input
Look-Up Table-Based Binary Coded Decimal Adder  203

LUTs are used to add the output from the half-adders and full adder {S3α , . . . , S1α , C 0 } with
the correction by adding 3, if the sum is greater than or equals to five. Depending on the
value of C3 , a switching circuit is used to follow Equation 3. The circuit gains huge delay
reduction due to its parallel working mechanism. By using the 1-digit BCD adder circuit,
an N -digit BCD adder circuit can be constructed easily, where the C out of one digit adder
circuit is sent to the next digit of the BCD adder circuit as a C in . Therefore, the generalized
N -digit BCD adder computes sequentially by using the previous carry, the block diagram
of which is shown in Fig. 16.5.
(
3 S 3 S 3 S 3, i f C = 1
Cout 3 2 1 3
Cout S3 S2 S1 = γ γ γ γ (16.3)
Cout S3 S2 S1 , otherwise

Algorithm 16.2: Construction of a 1-Digit BCD Adder Circuit


204  VLSI Circuits and Embedded Systems

Figure 16.4: 1-Digit BCD Adder Circuit.


Look-Up Table-Based Binary Coded Decimal Adder  205

Figure 16.5: Block Diagram of the N -Digit BCD Adder Circuit.

16.3 SUMMARY
Reconfigurable computing has now become a better alternative to Application-Specific
Integrated Circuits (ASICs) and fixed microprocessors. Besides, BCD (Binary Coded Dec-
imal) addition being the basic arithmetical operation, it is the main focus. The introduced
BCD adder is highly parallel which mitigates the significant carry propagation delay of
addition operation. As it is more convenient to convert from decimal to BCD than binary,
the efficient FPGA-based BCD addition will subsequently influence the advancement in
computation and manipulation of decimal digits. Besides, Field Programmable Gate Array
(FPGA) implementation will be beneficial to be applied in bit-wise manipulation, private
key encryption and decryption acceleration, heavily pipe-lined and parallel computation of
NP-hard problems, automatic target generation and in many more applications.

REFERENCES
[1] A. K. Osama, M. A. Khaleel, Z. A. QudahJ, C. A. Papachristou, K. Mhaidat and
F. G. Wolff, “Fast binary/decimal adder/subtractor with a novel correction-free BCD
addition”, In Electronics, Circuits and Systems (ICECS), 18th IEEE International
Conference on, pp. 455–459, 2011.
[2] C. Sundaresan, C. V. S. Chaitanya, P. R. Venkateswaran, S. Bhat and J. M. Ku-
mar, “High speed BCD adder”, In Proceedings of the 2nd International Congress on
Computer Applications and Computational Science, pp. 113–118, 2012.
[3] Z. T. Sworna, M. U. Haque and H. M. H. Babu, “A LUT-based matrix multiplication
using neural networks”, IEEE International Symposium on Circuits and Systems
(ISCAS), pp. 1982–1985, 2016.
[4] Z. T. Sworna, M. U. Haque, N. Tara, H. M. H. Babu and A. K. Biswas, “Low-power
and area efficient binary coded decimal adder design using a look up table-based
field programmable gate array”, IET (The Institution of Engineering and Technology)
Circuits, Devices & Systems, vol. 10, no. 3, pp. 1–10, 2015.
[5] P. Kenneth, R. Tessier and A. DeHon, “Birth and adolescence of reconfigurable
computing: A survey of the first 20 years of field-programmable custom computing
machines”, In Field-Programmable Custom Computing Machines (FCCM), IEEE
21st Annual International Symposium on, pp. 1–17, 2013.
206  VLSI Circuits and Embedded Systems

[6] H. Liu and S. B. Ko, “High-speed parallel decimal multiplication with redundant
internal encodings”, IEEE Trans. on Computers, vol. 62, no. 5, pp. 956–968, 2013.
[7] N. Yonghai, Z. Guo, S. Shen and B. Peng, “Design of data acquisition and storage
system based on the FPGA”, Proc. Eng., vol. 29, pp. 2927–2931, 2012.
[8] G. Sutter, E. Todorovich, G. Bioul, M. Vazquez and J. P. Deschamps, “FPGA Im-
plementations of BCD Multipliers”, In International Conference on Reconfigurable
Computing and FPGAs, pp. 36–41, 2009.
[9] O. D. A. Khaleel, N. H. Tulic and K. M. Mhaidat, “FPGA implementation of binary
coded decimal digit adders and multipliers”, In 8th International Symposium on
Mechatronics and its Applications (ISMA), 2012.
[10] G. Shuli, D. A. Khalili, J. M. Langlois and N. Chabini, “Efficient Realization of BCD
Multipliers Using FPGAs”, International Journal of Reconfigurable Computing, 2017.
[11] G. Shuli, D. A. Khalili and N. Chabini, “An improved BCD adder using 6-LUT
FPGAs”, In 10th International Conference on New Circuits and Systems (NEWCAS),
pp. 13–16, 2012.
[12] G. Bioul, M. Vazquez, J. P. Deschamps and G. Sutter, “High-speed FPGA 10s Com-
plement Adders-subtractors”, Int. J. Reconfig. Comput., vol. 4, 2010.
[13] A. Vazquez and F. D. Dinechin, “Multi-operand Decimal Adder Trees for FPGAs”,
Research Report RR-7420, 2010.
[14] M. Vazquez, G. Sutter, G. Bioul and J. P. Deschamps, “Decimal Adders/Subtractors
in FPGA: Efficient 6-input LUT Implementations”, In Reconfigurable Computing and
FPGAs, 2009.
[15] M. Shambhavi and G. Verma, “Low power and area efficient implementation of BCD
Adder on FPGA”, In Signal Processing and Communication (ICSC), International
Conference on, pp. 461–465, 2013.
CHAPTER 17

Place and Route Algorithm


for Field Programmable
Gate Array

This chapter focuses on the algorithm which can be very efficient for the purpose to minimize
the delays introduced in the circuit because of placement and routing. Placing and routing
operations are performed when a Field Programmable Gate Array (FPGA) device is used
for implementation. The delay introduced by logic block and the delay introduced by
interconnection can be analyzed by the use of efficient place and route algorithm. The
placement algorithms use a set of fixed modules and the netlist describing the connections
between the various modules as their input. The output of the algorithms is the best possible
position for each module based on various cost functions which further reduce the cost and
power and increases the performances.

17.1 INTRODUCTION
Most of the FPGAs available are SRAM-based. It is required to configure them after each
power-up that is they are volatile. Generally, the FPGA design-flow map designs onto an
SRAM-based FPGA consist of three phases. The first phase uses synthesizer which is used
to transform a circuit model coded in a hardware description language into an RTL design.
The second phase uses a technology mapper which transforms the RTL design into a gate-
level model composed of look-up tables (LUTs) and flip flops (FFs) and it binds them to
the FPGA’s resources (producing the technology-mapped design). During the third phase,
the place and route algorithm use the technology-mapped design to implement on FPGA.
The routing and placing operations may require a long time for execution in case
of complex digital systems, because complex operations are required to determine and
configure the required logical blocks within the programmable logic device, to interconnect
them correctly and to verify that the performance requirements specified during the design
are ensured. The delay introduced by logic block and the delay introduced by interconnection
can be analyzed by the use of efficient place and route algorithm.

DOI: 10.1201/9781003269182-20 207


208  VLSI Circuits and Embedded Systems

The placement algorithms use a set of fixed modules and the netlist describing the
connections between the various modules as their input. The output of the algorithms is the
best possible position for each module based on various cost functions. It can have one or
more cost functions depending on designs. The cost functions include maximum total wire
length, wire routability, congestions, and performance and I/O pads locations.

17.2 PLACING AND ROUTING


These operations are performed when an FPGA device is used for implementation. For
designing, Placing is the process of selecting particular modules or logical blocks of the
programmable logic device which will be used for implementing the various functions
of the digital system. Routing consists in interconnecting these logical blocks using the
available routing resources of the device.

17.3 PARTITIONING ALGORITHM


Basic purpose of partitioning is to simplify the overall design process. The circuit is
decomposed into several sub circuits to make the design process manageable. Algorithm
17.1 and Algorithm 17.2 mention the names of different partitioning algorithms.

Algorithm 17.1 Partitioning Algorithms


1: Iterative partitioning algorithms
2: Spectral based partitioning algorithms
3: Net partitioning vs. module partitioning
4: Multi-way partitioning
5: Multi-level partitioning

Algorithm 17.2 Iterative Partitioning Algorithms


1: Greedy iterative improvement method
2: Kernighan-Lin
3: Fiduccia-Mattheyses
4: krishnamurthy
5: Simulated Annealing
6: Kirkpartrick-Gelatt-Vecchi

17.4 KERNIGHAN-LIN ALGORITHM


The K-L (Kernighan-Lin) algorithm was used for bisecting graph in VLSI layout which was
first suggested in 1970. The algorithm is an iterative algorithm; which starts from a load
balanced initial bisection, it will first calculate each vertex gain in the reduction of edge-cut
that may result if that vertex is moved from one partition of the graph to the other. For every
inner iteration it moves the unlocked vertex having the highest gain, from the partition with
more vertices to the partition which it requires which has less in number. Then the vertex
is locked and the gains are updated.
Place and Route Algorithm for Field Programmable Gate Array  209

The procedure is repeated until all of the vertices are locked even if the highest gain may
be negative. The last few moves that had negative gains are then undone and the bisection is
reverted to the one with the smallest edge-cut so far in this iteration. Here one outer iteration
of the K-L algorithm is completed and the iterative procedure is restarted again. If an outer
iteration will results in no reduction in the edge cut or load imbalance, then the algorithm
is terminated. If an outer iteration gives no reduction in the edge-cut or load imbalance, the
algorithm is terminated.
The K-L algorithm is a local optimization algorithm with a capability for getting moves
with negative gain.

17.4.1 How K-L Works


Let a graph G(V, E), and let V be the set of nodes and the E be the set of edges. The
algorithm attempts to find a partition of V into two disjoint subsets A and B of equal size,
or unequal such that the sum T of the weights of the edges between nodes in A and B is
minimized.
Let I a be the internal cost of a, that is, the sum of the costs of edges between a and
other nodes in A, and let E a be the external cost of a, that is, the sum of the costs of edges
between a and nodes in B. Furthermore, let D a , D a = E a – I a be the difference between
the external and internal costs of a. If a and b are interchanged, then the reduction in cost
is:
Told − Tnew = Da + Db − 2Ca,b

where C a,b is the cost of the possible edge between a and b.


The algorithm will try attempts to find an optimal series of interchange operations
between elements of A and B which maximizes Told −Tnew and then executes the operations,
producing a partition of the graph to A and B.
All possible bisections are tried and the best one is chosen. If there are 2n vertices, then
numbers of possibilities are (2n)!/2(n!)2 . For 4 vertices (A, B, C, D), possibilities are three:
1. X = (A, B) and Y = (C, D)
2. X = (A, C) and Y = (B, D)
3. X = (A, D) and Y = (B, C)

17.4.2 Implementation of K-L Algorithm


Now the above application was converted to eight nodes as mentioned above in the designing
aspects according to K-L algorithm. The nodes are
Node 1 And gate having input A1 B1
Node 2 Xor gate having S 3 as output
Node 3 And gate having S 2 output
Node 4 Xor gate
Node 5 And gate having input A1 B0
Node 6 And gate having input A0 B1
Node 7 And gate
Node 8 And gate having input A0 B0
210  VLSI Circuits and Embedded Systems

17.4.3 Steps of Algorithm


Step 1: Initialization
Let the initial partition be a random division of vertices into the partition A = {1, 2, 5,
8} and B = {3, 4, 6, 7}. Here let A1 = A = {1, 2, 5, 8} and B1 = {3, 4, 6, 7}
Step 2: Compute D-Values

Step 3: Compute Gains

Largest value of G is G53 = 3, here (a1 b1 ) = (5, 3) and A1 = A1 – 5 = (1, 2) and B1 =


B1 - 3 = (4, 6, 7) are considered.
New A1 = (1, 2, 8) and B1 = (4, 6, 7). Both A1 and B1 are not empty, and then we update
D values in next step and repeat the procedure from step 3.
Step 4: Update D-Values of Nodes Connected to (5, 3)
The vertices connected to (5, 3) are vertex (1) in set A1 and vertices (4, 7) in set B1 .
The new D-values for vertices of A1 and B1 are given by
Place and Route Algorithm for Field Programmable Gate Array  211

Here the G value for G17 is large. Hence pair (a2 , b2 ) is (1, 7).
A1 = A1 - 1 = (2, 8) and B1 = B1 – 7 = (4, 6).
The new D values are

The last pair (a3 , b3 ) is (1, 7) and the corresponding gain is G17 .
Step 5: Determine the Values of X and Y 4
X = a1 = 1 and Y = b1 = 7.
The new partition that will obtained from moving X to B and Y to A is A = {2, 5, 7, 8}
and B = {1, 3, 4, 6}. The entire procedure is repeated again with this new partition as the
initial partition. Verify that the second iteration of the algorithm is also the last, and that
the best solution obtained is A = {2, 5, 7, 8} and B = {1, 3, 4, 6}.
The overall procedure is repeated with gain having maximum value taken and then the
cut size was calculated. There after the second pass was implemented, we had locked the
nodes with the maximum gain. This process is repeated for all the passes and combinations.
At the end of the above process we get the minimum cut size which in turns reduces the
wire delay and increases the performance.

17.5 SUMMARY
K-L (Kernighan-Lin) Algorithms increase the performance by reducing the wire delay.
Further work is necessary in the use of K-L-feasible cuts for the optimization purpose.
Analysis of an efficient algorithm for place and route process would be done in order to
place the components efficiently and create a proper routing path between them on FPGAs
(Field Programmable Gate Arrays). In this chapter, a new methodology is presented for
digital circuits which in turns reduce the area and increase the performance of the algorithms
of a circuit for the problem of hardware or software partitioning.
212  VLSI Circuits and Embedded Systems

REFERENCES
[1] L. Sterpone and M. Violante, “A New Reliability-Oriented Place and Route Algorithm
for SRAM-Based FPGAs”, IEEE Trans. on Computers, vol. 55, no. 6, 2006.
[2] O. Martinello, F. S. Marques, R. P. Ribas and A. I. Reis, “KL-Cuts: A New Approach
for Logic Synthesis Targeting Multiple Output Blocks”, pp. 777–782.
[3] A. M. Fahim, “Low-Power High-Performance Arithmetic Circuits and Architectures”,
IEEE Journal of Solid-State Circuits, vol. 37, no. 1, pp. 90–94, 2002.
[4] S. S. Brown, “FPGA Architecture Research: A Survey”, IEEE Design and Test of
Computers, pp. 9–15, 1996.
[5] A. M. Fahim, “Low-Power High-Performance Arithmetic Circuits and Architectures”,
IEEE Journal of Solid-State Circuits, vol. 37, no. 1, pp. 90–94, 2002.
[6] J. Rose, A. E. Gamal and A. Sangiovanni-Vincetelli, “Architecture of Field-
Programmable Gate Arrays”, Proc. IEEE, vol. 81, no. 7, pp. 1013–1029, 1993.
[7] S. Brown, “FPGA Architecture Research: A Survey”, IEEE Design and Test of Com-
puters, pp. 9–15, 1996.
[8] Udar, Vaishali, and Sanjeev Sharma. “Analysis of place and route algorithm for field
programmable gate array (FPGA).” In 2013 IEEE Conference on Information &
Communication Technologies, pp. 116–119. IEEE, 2013.
CHAPTER 18

LUT-Based BCD Multiplier


Design

The BCD (Binary Coded Decimal) being the more accurate and human-readable represen-
tation with ease of conversion, is prevailing in the computing and electronic communication.
In this chapter, an ( N×M )-digit BCD multiplication algorithm is introduced with the com-
plex steps reduction of the conventional multiplication process. BCD multiplier is more
effective with a LUT (Look-Up Table)-based design, due to FPGA (Field Programmable
Gate Array) technology’s enumerable benefits and applications. Hence, a compact LUT
circuit architecture with new selection, read and write operations is presented. Afterwards,
a cost-efficient N×M -digit multiplier circuit is demonstrated followed by a 1-Digit LUT-
based direct multiplier circuit.

18.1 INTRODUCTION
BCD (Binary Coded Decimal) representation is advantageous due to its finite place value
representation, rounding, easy scaling by a factor of 10, simple alignment and conversion
to character form. It is highly used in embedded applications, digital communication and
financial calculations. Hence, faster and efficient BCD multiplication method is desired. In
this chapter, an N×M -digit multiplication method is introduced which omits the complex
manipulation steps, reducing area, power and delay of the whole circuit. The advancement
in FPGA technology has emerged a new horizon of technology progress due to long time
availability, rapid prototyping capability, reliability and hardware parallelism. The cost of
making incremental changes to FPGA designs is negligible when compared to the large
expense of re-spinning an ASIC. The application of FPGA in cryptography, NP-Hard opti-
mization problems, pattern matching, bioinformatics, Floating point arithmetic, Molecular
dynamics is increasing radically. Due to re-configurability, FPGA implementation of BCD
multiplication is of concern. An FPGA has three main elements – LUT, flip-flops and the
routing matrix.
There are two primary methods in traditional computing for the execution of various al-
gorithms. The first is to use an Application Specific Integrated Circuit, or ASIC, to perform
the operations in hardware. ASICs are designed specifically to perform a given computation
and they are very fast as well as efficient when executing the exact computation for which

DOI: 10.1201/9781003269182-21 213


214  VLSI Circuits and Embedded Systems

they were designed. However, after fabrication the circuit cannot be altered. Second, mi-
croprocessors are a far more flexible solution in terms of re reusability. Processors execute
a set of instructions to perform a computation. By changing the software instructions, the
functionality of the system is altered without changing the hardware. Nevertheless, there are
some drawbacks like, performance degrades alongside with flexibility in comparison with
ASICs since the processor must read each instruction from memory, determine the meaning
of the instruction from the content of memory and only then execute the specific fetched
instruction which consequently results in a high execution overhead for each individual
operation.
Reconfigurable computing is intended to fill the void between hardware and software,
achieving potentially much higher performance than software, while maintaining a higher
level of flexibility. Reconfigurable computing has become a subject of a great deal of
research now a days due to its versatile potential to greatly accelerate a wide variety of
applications. Its key feature is the ability to perform computations in hardware to increase
performance, while retaining much of the flexibility of a software solution. The field-
programmable gate array (FPGA) is a semiconductor device that can be programmed after
manufacturing. Therefore, it is a great mean for reconfigurable computing. Multiplication
is the fundamental operation used intensively through-out any activities. There are plen-
tiful types of number representations in computer organization. Binary representation of
numerical values gives more vantages for computation in computer based system where
as decimal representation offers more human friendliness. In this contrast, Binary Coded
Decimal plays a middle ware role which confers an instinctive mechanism to convert to and
from human-readable decimal characters.
A robust and optimized circuit design must deal with the delay of that circuit which is
a crucial comparison parameter for measuring performance alongside with area and power
consumption. A faster logical circuit with accuracy is a trade off in design issues. These
endless fields of modern reconfigurable computing is taken into consideration, trying to
explore some new features in reconfigurable computation as well as the architectural issues
of circuits involving with BCD multipliers, and motivated to design an efficient BCD
multiplier using FPGAs.
New products in shortest possible time have become the catchword in today’s electronic
industry. Being able to test the product even before the fabrication, Field Programmable
Gate Array (FPGA) has become an extremely useful medium in digital circuit designs. It
enables the designer to avoid the pitfalls of designs before synthesis. In the decade since
FPGAs were invented they have created many new opportunities. Perhaps the most exciting
is reconfigurable computing. Usually, FPGA consists of an array of programmable logic
blocks, interconnects (routing channels) and I/O cells. Logic blocks can be configured to
implement sequential and combinational functions which influence the speed and density of
FPGA. As FPGAs are ten times less dense than mask programmed gate arrays, researchers
are aiming to explore new efficient configurable logic blocks such that the density and gap
become as minimum as possible.
Most popular logic blocks of the FPGAs are based on the Look-up Tables (LUTs) and
design from Plessy. A Look-up Table can implement any logical function defined by its
inputs. With more inputs, a LUT can implement more logic, hence fewer logic blocks are
needed. This helps in routing by asking for less area, since there are fewer connections to
LUT-Based BCD Multiplier Design  215

route between the logic blocks. With the growing popularity of decimal computer arithmetic
in scientific, commercial, financial and Internet-based applications, hardware realization of
decimal arithmetic algorithms is gaining more importance. Hardware decimal arithmetic
units now serve as an integral part of some recently commercialized general purpose
processors, where complex decimal arithmetic operations, such as multiplication have
been realized by rather slow iterative hardware algorithms. With the rapid advances in
very large scale integration (VLSI) technology, semi and fully parallel hardware decimal
multiplication units are expected to evolve soon. As it is mentioned earlier, the dominant
representation for decimal digits is the BCD encoding. The BCD-digit multiplier can serve
as the key building block of a decimal multiplier, irrespective of the degree of parallelism.
A BCD-digit multiplier produces a two-BCD digit product from two input BCD digits. It is
aimed to provide a design for a parallel BCD multiplier showing some advantages in BCD
multiplier implementations using compact LUTs of FPGA.
Five main focuses are addressed in this chapter:
1. A 1 × 1-digit LUT-based direct BCD multiplication method is introduced to avoid
Recoding, partial product generation, partial product reduction and conversion steps of
conventional multiplication method.
2. N×M -digit BCD multiplication algorithm is introduced, which has reduced Recod-
ing, partial product reduction and conversion steps.
3. Efficient method of selection of memory cell, Read and Write operation in the desired
memory of LUT is presented.
4. A new LUT architecture with more accuracy and less hardware complexity is intro-
duced.
5. New N×M -digit BCD multiplication circuit with significantly reduced number of
LUTs, area and delay is elucidated.

18.2 BASIC PROPERTIES


In this section, basic properties and ideas related to BCD multiplication methodology along
with LUT are presented with illustrative figures and examples. Besides, the comparison
parameters like area, power and delay are formally defined along with the memory unit of
LUT which is a memristor.

Property 18.2.1 BCD multiplier uses BCD numbers as input and provides BCD output,
where each decimal digit is represented with a 4-bit binary code (with weights 8, 4, 2
and 1). For example, the decimal number (2549)10 is represented in BCD as (0010 0101
0100 1001)BCD . The multiplication of two BCD numbers produces partial products. After
having all the partial products they are added to get the binary output. Then, the binary
output is converted to BCD output.

Example 18.1 BCD multiplication of 2910 = (0010 1001)BCD to 510 = (0101)BCD pro-
duces four partial products. The addition of the partial products results in a binary output
which needs to be converted in BCD representation. After conversion the BCD value found
is, 0001 0100 0101 which is 145 in decimal value.
216  VLSI Circuits and Embedded Systems

Table 18.1: Truth Table of Function f

A B Output
(f)
0 0 0
0 1 1
1 0 1
1 1 1

Property 18.2.2 A look-up table (LUT) is a memory block with a one-bit output that
essentially implements a truth table where each input combination generates a certain logic
output. The input combination is referred to as an address. The output of the LUT is the
value stored in the indexed location of the selected memory cell. Since the memory cells in
the LUT can be set to anything according to the corresponding truth table, an N-input LUT
can implement any logic function.

Example 18.2 When implementing any logic function, a truth table of that logic is mapped
to the memory cells of the LUT. Suppose, while implementing Equation 19.1 where ‘|’
represents logical OR operation, Table 18.1 represents the truth table of the function. Fig.
18.1 shows the gate representation and the LUT representation of the logic function. Output
is generated with the corresponding input combination, such as for input combination 1 and
0, the output will be 1.

f = (A.B)|(A ⊕ B) (18.1)

Figure 18.1: LUT Implementation of a Logic Function.

Property 18.2.3 The area of a logic circuit is the total area of the individual circuit
elements. If a circuit consists of n gates and area of those n gates are A1 , A2 , . . . , An then,
by using above definition area ( A) of that circuit is as follows:

n
Õ
A= Ai ; where i = 1, 2, 3, ..., n (18.2)
i=1
LUT-Based BCD Multiplier Design  217

Example 18.3 A half adder consists of one 2-input Ex-OR gate and one 2-input AND
gate. Using CMOS open cell library, the area of an 2-input Ex-OR gate and 2-input AND
gate are 1.6 µm2 and 1.2 µm2 , respectively. So, area of a half adder circuit is (1.6+1.2) µm2
= 2.8 µm2 .

Property 18.2.4 The total power of a circuit can be calculated by summing up the indi-
vidual power consumption of each gate. To calculate the power of a single gate, current
obtained from Microwind DSCH and voltage across each gate using the following formula
are used:

P = V × I; where P = Power, V = Voltage and I = Current; (18.3)

If the individual power of the gates are P1 , P2 , . . . , P n , then the total power, P of the circuit
can be calculated using the following equation:
n
Õ
P= Pi ; where i = 1, 2, 3, ..., n (18.4)
i=1

Example 18.4 Suppose, a half adder circuit is considered which consists of an AND gate
and an Ex-OR gate that are constituted of 6 and 8 transistors, respectively. Using Microwind
DSCH, threshold voltage for this circuit is found 0.5 V and the current passing through the
transistors is 0.1 mA. Hence, the power consumed by a single transistor is (0.5 × 0.1) mW =
0.05 mW. Therefore, a half adder requires (14 × 0.5) mW = 0.7 mW of power consumption.

Property 18.2.5 Delay of a combinational circuit is the critical path delay, which can be
defined as the summation of gate delay of each gate in that critical path. Critical path is the
longest path from an input to an output which causes low input to high output or vice versa.
If T 1 , T 2 , T 3 , . . . , T n are the gate delays of the gates G1 , G2 , G3 , . . . , G n on the critical
path, respectively then delay T Delay of the circuit is:

T Delay = T 1 + T 2 + T 3 + · · · + T n (18.5)

Example 18.5 A full adder consists of two 2-input Ex-OR gates and one 3-input AND
gate. Delay of both an 2-input Ex-OR gate and 3-input AND gate are 0.160 ns. Critical path
of a Full adder consists of two Ex-OR gates. So, delay of a Full adder is (0.160+0.160)ns
= 0.320 ns which is illustrated in Fig. 18.2.
218  VLSI Circuits and Embedded Systems

Figure 18.2: Full Adder Circuit Critical Path Delay Determination of Full Adder Circuit.

Property 18.2.6 Memristor (Memory Resistor) is a non-linear and non-volatile memory


device which has every right to be as basic as the three classical circuit elements, namely,
the resistor, inductor and capacitor. It is basically a two-terminal device whose resistance
material is titanium dioxide (TiO2 ). When a voltage is applied across the platinum elec-
trodes, oxygen atoms in the material diffuse left or right, depending on the polarity of the
voltage, which makes the material thinner or thicker, thus causing a change in resistance.
When the voltage is turned off, the resistance remains as it did just before it was turned off.
In Fig. 18.3, the cross section of a memristor cell is shown.

The memristor being a non-linear device, it has a non-linear functional relationship


between magnetic flux linkage φm (t) and the amount of electric charge that has followed:

q(t) : f (φm (t), q(t)) = 0 (18.6)

Property 18.2.7 Memristance is the charge-dependent rate of change of flux with charge,
which is the electronic property of memristor. The convenient memristance function is
found from Equation 19.7 by substituting the flux by the time integral of the voltage; and
charge by the time integral of current:


dt V(t)
M(q(t)) = dq
= (18.7)
I(t)
dt

18.3 THE ALGORITHM


In this section, BCD multiplication algorithm and a LUT architecture are described. Then,
a new LUT-based BCD multiplier circuit is constructed. Essential figures and lemmas are
LUT-Based BCD Multiplier Design  219

Figure 18.3: Memristor.

presented to clarify the ideas. First, the design of a BCD multiplier using Look-Up Table
of FPGA is described elaborately in next subsection.

18.3.1 The BCD Multiplication Method


Since the range of BCD digit is (0-9), the multiplication of two BCD digits may produce
100 possible combination of inputs with corresponding outputs. For 1 × 1-digit BCD
multiplication, the output product can be maximum 2 digits (8 bits) suppose, {P7 . . . ,
P0 }. For each bit of output, functions are generated, some of which can be efficiently
implemented using logic gate while others require the use of LUT. The equations of the
220  VLSI Circuits and Embedded Systems

logic gate-based function implementation are as follows where dot(.) means logical AND
operation and (|) means OR operation.

P0 = A0 .B0 (18.8)

(A .A .B .B ), i f (A3 |B3 ) = 0
 1 2 1 2



(18.9)

(B0 .B2 ).(B1 .B2 ).( B̄0 . B̄1 . B̄2 . B̄3 ), i f (A3 |B3 ) = 1 and A3 = 1

 (A0 .A2 ).(A1 .A2 ).( Ā0 . Ā1 . Ā2 . Ā3 ), otherwise

P7 = (A0 .A3 .B0 .B3 ) (18.10)

The functions of the products {P5 , . . . , P1 } are implemented in LUTs. All the possible
input-output combination is verified using a truth table shown in Table 18.2. The multiplier
circuit uses several steps which are as follows:
1. Recoding
2. Partial product generation
3. Partial product reduction
4. Conversion to BCD representation

Table 18.2: The Truth Table of 1-Digit Multiplication

For N×M -digit multiplication there are basically two steps:

1. ( N×M ) number of partial products, PP are generated where each node is the distinct
product of individual digit of multiplier and multiplicand in parallel.

2. Then, ( N + M ) number of product P, where each digit of the product is the sum of
respective digit representation of each node N .
LUT-Based BCD Multiplier Design  221

Suppose, A be the N -digit multiplier and B be the M -digit multiplicand, where A =


A N . . . A2 A1 and B = B M . . . B2 B1 . The output will be ( N + M )-digit product P, where P
= P N +M . . . P2 P1 . The algorithm of N×M -digit multiplication is elucidated in algorithm
19.1. The intermediary partial products are represented as PPi ( x i+j , y i+j−1 ), where x is
the tens digit and y is the unit digit. So, the partial products are generated by multiplying
individual multiplier and multiplicand digits. So, the partial products are as follows:

PP1 (x2, y1 ) = (B1 ? A1 ),


PP1 (x3, y2 ) = (B1 ? A2 ),

PP1 (x N +1, y N ) = (B1 ? A N ),


PP2 (x3, y2 ) = (B2 ? A1 ),

PPm (x M+N , y M+N −1 ) = (BM ? A N ),

Algorithm 18.1: Algorithm for N×M -digit BCD Multiplication

18.3.2 The LUT Architecture


A LUT consists of two basic parts: (i) Controller circuit; and (ii) Memory unit. The
controller circuit sends the corresponding Read or Write voltage for accomplishing the
Read/Write operation. Memristor is considered as memory unit as besides, being non-
volatile, it provides the area and power efficiency than other non-volatile memory units. It
is never possible to run Read and Write operations simultaneously. So, the Ex-OR and OR
gates are used to select only one operation at a time to avoid this ambiguity which is still
222  VLSI Circuits and Embedded Systems

prevalent in the LUTs. The selection of the memory unit is performed depending on the two
inputs of the LUT as they refer to corresponding memory addresses from the memory array
that consists of four nanocross wires containing a memristor connected at each junction.
The transmission gates connected to each memristor propagate the operational voltage
either to Write 1/0 to store in the memory or to Read the corresponding memory unit and
disseminate the Read value to the output terminal O 1 or O 2 through memristor. The LUT
is shown in Fig. 18.4 and the algorithm for the construction of a 2-input LUT is given in
Algorithm 18.2.

Figure 18.4: 2-Input LUT Circuit Architecture.

As Reset is nothing but the Write 0 operation, there is no difference between these two
operations when performed on single memory cell which removes hardware complexity.
Besides, instead of using the conventional Wl (Word Line) and Bl (Bit Line), the direct
use of the LUT input with inverter has reduced the controller circuitry overhead proving
successful efficiency in area, power and delay. Op-amps are used for noise-free Read and
Write voltage. The operational features of LUT are presented in Table 18.3 for both Read
and Write operation where RP represents Read Pulse and D0 and D1 represents Data 0 and
Data 1 respectively.
For Write operation, Write Enable pulse voltage is high which acts as an input bit for the
transmission gate to pass the data bit (Data0/Data1). A crossbar array selects the particular
memory cell M i , where i<= 4. The initial memristance of M i is first considered ROF F and
RON for Write 1 and Write 0 operation respectively. Then, a pulse of +Vdd/-Vdd is applied
through V in until the memristance changes the state. Thus, a logic 1/0 is successfully written
to M i .
For Read operation, the Read Enable pulse voltage and Read Pulse (+Vdd/-Vdd) both
are high and will be propagated to the particular memory cell M i by using crossbar arrays
and transmission gate. To perform the READ 0/Read 1 operation, a positive pulse of
LUT-Based BCD Multiplier Design  223

Algorithm 18.2: Algorithm for the Construction of 2-Input LUT Circuit

+Vdd (READ pulse) is applied through transmission gate to the memristor and the READ
value is found at the Output terminal O 1 or O 2 . For Read 0 operation, assuming the NSP
(memristance) of the memristor is zero, it will slightly change the NSP (memristance) of the
memristor toward the value of RON . To restore the NSP to its original value, a RESTORE
pulse of Vdd is applied.
Similarly, larger input LUTs can be designed. For the design of BCD multiplier it
requires heterogeneous LUT architecture such as 5-input and 6-input LUTs. So, the archi-
tecture of 5-input and 6-input LUTs are exhibited in Figs. 18.5 and 18.6. A 5-input and
6-input LUT has 25 = 32 and 26 = 64 memory cells respectively with a common controller
circuit. A 5-input LUT has single output(O 5 ) and 6-input LUT has two outputs O 5 and O 6 .
To make area efficient, a 3D layer structure has been used. 32 memory cells are arranged
in two layers and 64 memory cells are arranged in four layers where each layer has sixteen
memory cells. For a particular layer, a row is selected by inputs A and B using the selection
circuit of a 2-input LUT. Besides, a column is selected using the two inputs C and D in the
same way. As there are four layers in 6-input, a particular layer is selected using inputs E and
F and only E is used in 5-input LUT to select a layer from the two. A 6-input LUT requires
100 memristors, where 64 memristors are required for memory cell unit and 36 memristors
are required for reference cell. On the other hand, the 6-input LUT requires a total of 64
memristors and no additional memristors are required for reference cell. A theorem has
been given in supporting the generalization of a 2-input LUT circuit in Theorem 19.1.
224  VLSI Circuits and Embedded Systems

Table 18.3: Write and Read Operations

Figure 18.5: 5-Input LUT Circuit.

Property 18.3.1 An n-input LUT requires at least 2n−2 LUTs with 2-input, where n>= 2.

Proof 18.1 The above statement is proved by mathematical induction.

Basis: The basis case holds for n = 2 as (22−2 ) = 1.


Hypothesis: Assume that the statement holds for n = k . So, a k -input LUT consists of
2k−2 LUTs with 2 inputs.
Induction: Now, n = k + 1 will be considered. So, a (k + 1)-input LUT requires 2(k +1−2)
= 2k−1 LUTs of 2 inputs. Now, the number of inputs can be reduced by one to produce
n = k . Then, a k -input LUT requires 2k−1−1 = 2k−2 LUTs with 2 inputs which holds the
hypothesis. So, the statement holds for n = k + 1. Therefore, for n>= 2, an n-input LUT
consists of 2n−2 LUTs with 2 inputs and completes the proof.

Example 18.6 For n = 5, a 5-input LUT requires 25−2 = 8 LUTs with 2 inputs.
LUT-Based BCD Multiplier Design  225

Figure 18.6: 6-Input LUT Circuit.

18.4 LUT-BASED BCD MULTIPLIER CIRCUIT


The 1 × 1-digit BCD multiplier circuit is designed using heterogeneous inputs LUT. Equa-
tion 19.8, 19.9 and 19.10 are implemented using different input logical AND gates. If A3
or B3 is one then block of 5-input LUTs is activated, otherwise block of 6-input LUTs is
activated. The 5-input LUTs gives the BCD output of all the possible combination of 810 and
910 , either as a multiplier or as multiplicand or both. Whereas, the 6-input LUTs provide
the BCD output of other possible combinations. Finally, the output is obtained through
2-to-1 MUX gates using Equation 19.11. The construction method of the 1 × 1-digit BCD
multiplier is presented in Algorithm 18.3 and the circuit is shown in Fig. 18.7. The design
of the multiplier circuit using Virtex 5/6 slice is shown in Fig. 18.8.
(
B4 B3 B2 B1 B0 A0 i f A3 = 1
O4 O3 O2 O1 O0 = (18.11)
A4 A3 A2 A1 A0 B0 otherwise

Using the 1 × 1-digit BCD multiplier and BCD adder an N×M -digit multiplier can
be designed. Implementing the N×M -digit multiplication method of Algorithm 18.1, the
circuit construction is possible with less area and delay. The block diagram of the N×M -
digit multiplier is presented in Fig. 18.9. A Property 18.4.1 is given in Property 18.4.1,
supporting the generalized required number of LUTs.
226  VLSI Circuits and Embedded Systems

Algorithm 18.3: Algorithm for the Construction of BCD Multiplier Circuit for 1-Digit
Multiplication

Property 18.4.1 An N×M -digit multiplier requires at least λ(N×M) + K(N + M − 2)


numbers of LUTs.

Proof 18.2 The N×M multiplication technique accomplishes in two steps. The first step
deals with the generation of partial products by using 1 × 1-digit multiplier. The 1 × 1-
digit multiplier produces maximum two BCD digits ( X and Y ) as the multiplication of two
highest BCD digits (9 and 9) produce 81 which can be represented as follows: 9 × 9 = 81
where X = 8 and Y = 1. In general, the 1 × 1 multiplication can be represented as:

Ai × B j = X aY b ;
where i = 1, 2, 3, . . . , N;
j = 1, 2, 3, . . . , M and
a, b = 1, 2, 3, . . . , (N + M)
LUT-Based BCD Multiplier Design  227

Figure 18.7: BCD Multiplier Circuit for 1-Digit Multiplication.

Since each of the digits of multiplicand operand B1 , B2 , B3 , . . . , B M are multiplied by


all the digits of multiplier operand A1 , A2 , A3 , . . . , A N , it requires a total of N×M numbers
of 1 × 1-digit multiplier circuits which produce ( N×M ) partial products PPi , where
PPi = PP m X m+n PP m Y m+n−1 , where i = 1, 2, 3, . . . , ( N×M )
Each of the 1 × 1-digit multiplier circuit is constructed with λ numbers of LUTs in a
heterogeneous architecture where
λ = α + β, where
α = number of 6-input LUTs and
β = number of 5-input LUTs
Therefore, in first stage, it requires a total of ( N×M ) numbers of LUTs. In second stage
of multiplication technique, final product P q of the multiplication Ai × B j is derived as
follows:
228  VLSI Circuits and Embedded Systems

Figure 18.8: BCD Multiplier Circuit Design Using Virtex 5/6 Slice for 1-Digit.

P1 = PP1Y 1
M
PPr X 2 + PPr Y 2
Õ
P2 =
r=1
M
PPr X 3 + PPr Y 3 + Carr y2
Õ
P3 =
r=1
LUT-Based BCD Multiplier Design  229

Figure 18.9: Block Diagram of the BCD Multiplier for N×M -Digit Multiplication.

For generation of initial BCD digit product P1 , the ( N×M )-digit multiplier doesn’t
require any BCD adder since it can be obtained directly from partial product PP1Y 1 . In the
same way, final BCD product P M+N can be obtained using a half adder circuit. Hence, a
total of ( N + M − 2) number of N -digit BCD adders are required in second step. An N -digit
BCD adder requires K number of α LUTs where α is the number of 6-input LUTs. So, in
second stage of multiplication technique, it requires a total of K(N + M − 1) number of
LUTs. Therefore, in two stages, the multiplier circuit requires λ(N×M) + K(N + M − 1)
number of LUTs.
In large digits input combination there is a possibility of having same input digit multiple
times which can be used as an opportunity to reduce the circuit complexity by reducing the
active number of LUTs. The reduced hardware complexity is proved in Property 18.4.2.

Property 18.4.2 If is the total number of repeating digits in multiplicand B, then the
total number of merged LUTs in N×M multiplication, δ = f (σ, M − σ + ω), where M
is the number of digits in multiplicand B, ω is total number different digits repeating in
multiplicand B and f (σ, M − σ + ω) is the function to calculate the total number of merged
LUTs.

Proof 18.3 The N×M multiplication requires at least λ(N×M) + K(N + M − 1) numbers
of LUTs which is proved in Properties 18.4.1 where each of the partial products PP can be
generated as follows:

PPi = PP M X {M+N } PP M Y {M+N −1} ,


where i = 1, 2, 3, . . . , (N×M) and
N = the number of digits in multiplier A
Suppose, σ j , σ { j+1} , . . . , σ M are the repeating same digits in a multiplicand B pro-
ducing the same partial product PP as the product of σ j , σ { j+1} , . . . , σ M with A1 , A2 , A3 ,
. . . , An giving the same product always which can be represented as follows:
Ai ×B j = X aY b
230  VLSI Circuits and Embedded Systems

Where i = 1, 2, 3, . . . , N ;
j = 1, 2, 3, . . . , M and
a, b = 1, 2, 3, . . . , ( N + M )
So, these products can be generated with a single multiplier having λ(N×M) number
of LUTs. Hence, it can be eliminate σ from multiplicand digit M . Suppose, ω is the total
number of different repeating digits in M . Since, multiplication of multiplier A with the
same σ digits can be generated from a single 1 × 1-digit multiplier circuit, a total of ω
LUTs can sufficiently serve the purpose of the multiplication. Therefore, it can calculate the
total number of merged LUTs from the function f (σ, M − σ + ω). Efficiency of the merged
LUTs algorithm can be asserted by pigeonhole principle. Pigeonhole principle states that,
for natural numbers k and m, if n = km + 1 objects are distributed among m sets, one of
the sets will contain at least k + 1 objects. BCD numbers range from (0 to 9) and the total
permitted digits are 10. So, if the multiplicand digit > 10, surely k + 1 digits are repeating
digits and the probability increased with higher input of multiplicands.
Example 19.7: Suppose, two BCD numbers A and B will be multiplied where, A =
1234 and B = 223. Here, in the approach, we require 10(4 × 3) = 120 number of LUTs
in multiplication stage. In multiplicand B, digit “2” is repeating twice. Therefore, it can
generate the partial product (1234 × 2) in a single multiplier circuit, reducing (2/3r d of
total 120) 40 number LUTs which is demonstrated in Fig. 18.10.

Figure 18.10: Merging LUTs in N×M Multiplication.


LUT-Based BCD Multiplier Design  231

18.5 SUMMARY
This chapter details the designs and working procedures of the LUTs (Look-Up Tables) and
parallel BCD (Binary Coded Decimal) multipliers. Several lower bounds on the designed
architectures have been described. LUT being one of the main components of FPGA (Field
Programmable Gate Array), a LUT-based multiplier is presented. LUT being a component
of FPGA-based BCD multiplier, an efficient area minimal 2-input LUT circuit is presented
where the improvement of 2-input LUT will consequently improves the larger input LUTs.
Besides, as 5 and 6-input LUTs are needed for the design of the multiplier, the efficient design
architecture for 5 and 6 inputs LUTs using 2-input LUT circuit principle are also presented.
Finally, the BCD multiplier circuit is designed using the proposed LUTs providing its
cost-efficiency.

REFERENCES
[1] O. D. A. Khaleel, N. H. Tulic and K. M. Mhaidat, “FPGA implementation of binary
coded decimal digit adders and multipliers”, In Mechatronics and its Applications
(ISMA), 8th International Symposium on, pp. 1–5, 2012.
[2] H. A. F. Almurib, T. N. Kumar and F. Lombardi, “A memristor-based LUT for
FPGAs”, In Nano/Micro Engineered and Molecular Systems (NEMS), 9th IEEE
International Conference on, pp. 448–453, 2014.
[3] H. M. H. Babu, N. Saleheen, L. Jamal, S. M. Sarwar and Tsutomu Sasao, “Approach
to design a compact reversible low power binary comparator”, Computers & Digital
Techniques, IET, vol. 8, no. 3, pp. 129–139, 2014.
[4] Y. C. Chen, W. Zhang and H. Li, “A look up table design with 3d bipolar RRAMs”, In
Design Automation Conference (ASP-DAC), 17th Asia and South Pacific, pp. 73-78,
2012.
[5] L. O. Chua, “Memristor–the missing circuit element”, Circuit Theory, IEEE Trans.
on, vol. 18, no. 5, pp. 507–519, 1971.
[6] N. Z. Haron and S. Hamdioui, “On defect oriented testing for hybrid CMOS/memristor
memory”, In Test Symposium (ATS), 20th Asian, pp. 353–358, 2011.
[7] Y. Ho, G. M. Huang and P. Li, “Dynamical properties and design analysis for non-
volatile memristor memories”, IEEE Trans. on Circuits and Systems I, vol. 58, no. 4,
pp. 724–736, 2011.
[8] T. N. Kumar, H. A. F. Almurib and F. Lombardi, “A novel design of a memristor-
based look-up table (LUT) for FPGA”, In Circuits and Systems (APCCAS), IEEE
Asia Pacific Conference on, pp. 703–706, 2014.
[9] C. E. M. Guardia, “Implementation of a fully pipelined BCD multiplier in FPGA”, In
Programmable Logic (SPL), VIII Southern Conference on, pp. 1–6, 2012.
[10] H. C. Neto and M. P. Vestias, “Decimal multiplier on FPGA using embedded binary
multipliers”, In Field Programmable Logic and Applications, International Conference
on, pp. 197–202, 2008.
[11] K. Pocek, R. Tessier and A. DeHon, “Birth and adolescence of reconfigurable com-
puting: A survey of the first 20 years of field-programmable custom computing ma-
chines”, In Highlights of the First Twenty Years of the IEEE International Symposium
on Field-Programmable Custom Computing Machines, pp. 3–19, 2013.
232  VLSI Circuits and Embedded Systems

[12] G. Sutter, E. Todorovich, G. Bioul, M. Vazquez and J. P. Deschamps, “FPGA imple-


mentations of BCD multipliers”, In Reconfigurable Computing and FPGAs, Interna-
tional Conference on, pp. 36–41, 2009.
[13] A. Vazquez and F. Dinechin, “Efficient implementation of parallel BCD multipli-
cation in LUT-6 FPGAs”, In Field-Programmable Technology (FPT), International
Conference on, pp. 126–133, 2010.
[14] R. Williams, “How we found the missing memristor”, Spectrum, IEEE, vol. 45, no.
12, pp. 28–35, 2008.
[15] C. Xu, X. Dong, N. P. Jouppi and Y. Xie, “Design implications of memristor-based
RRAM cross-point structures”, In Design, Automation & Test in Europe Conference
& Exhibition (DATE), pp. 1–6, 2011.
CHAPTER 19

LUT-Based Matrix Multiplier


Circuit Using Pigeonhole
Principle

Matrix multiplication is a computationally-intensive and fundamental matrix operation


used in scientific computations, signal processing, image processing, graphics and robotic
applications. The advancement of Field Programmable Gate Arrays (FPGAs) in the recent
years, allowing multimillion gates on a single chip, has allowed the implementation of
computation-intensive algorithms like matrix multiplication in efficient and cost-effective
way. As multiplication is the slowest operation that is hindering the performance of matrix
multiplication, efficient FPGA-based multiplication algorithm is introduced. Besides, LUT
(Look-up Table) is the key component of FPGA which can implement any function. A
LUT merging theorem is presented, which reduces the required number of LUTs for the
implementation of a set of functions by a factor of two. A (1 × 1)-digit multiplication
algorithm is introduced which does not require any partial product generation, partial
product reduction and addition steps. An (m×n)-digit multiplication algorithm is described
which performs digit-wise parallel processing and provides a significant reduction in carry
propagation delay. A binary to BCD conversion algorithm for decimal multiplication is also
presented to make the multiplication more efficient. Then, a matrix multiplication algorithm
is described that re-utilizes the intermediate product for the repeated values to reduce the
effective area. A cost-efficient LUT-based matrix multiplier circuit is also described using
the compact and faster multiplier circuit. Due to the parallel processing structure and the
implementation of re-utilization of the intermediate product for the repeated values, the
effective area and consequently power consumption are reduced drastically.

19.1 INTRODUCTION
During the last decade, the logic density, functionality and speed of FPGA have improved
considerably. Modern FPGAs are now capable of running at speed of 500 MHz and beyond.
Another important feature of FPGAs is their potentiality for dynamic reconfiguration, which
is reprogramming part of the device at run time so that resources can be reused through

DOI: 10.1201/9781003269182-22 233


234  VLSI Circuits and Embedded Systems

time multiplexing. Hence, FPGA-based circuit design is recent trend of research. Matrix is
an extremely significant mean in conveying and discussing problems which arise from real
life scenarios. It will be effortless to manipulate and obtain more information by managing
the data in matrix form. Multiplication is one of the essential operations on matrices. Linear
back-projection, Color space conversion, 3D affine transformations, Estimation of higher-
order cross moments, Time-frequency Spectral Analysis and Wireless Communication
are commonly used matrix multiplication applications. To improve the performance of
these applications, a high performance matrix multiplier is required. So, a cost efficient
FPGA-based matrix multiplication algorithm is introduced. Besides, to make the matrix
multiplication method more efficient, (1 × 1)-digit and (m×n)-digit decimal multiplication
algorithms are presented. Then, the compact and faster (1 × 1)-digit and (m×n)-digit decimal
multiplier circuits are constructed. Finally, a matrix multiplier circuit is constructed with
drastically reduced area and delay.
Traditionally, matrix multiplication operation is either realized as software running on
fast processors or on dedicated hardware (Application Specific Integrated Circuits (ASICs)).
Software based matrix multiplication is slow and can become a bottleneck in the overall sys-
tem operation. The comparative analysis among the performance of CPU, GPU and FPGA
has been exhibited in Fig. 19.1. However, hardware (Field Programmable Gate Array
(FPGA)) based design of matrix multiplier provides a significant speed-up in computa-
tion time and flexibility as compared to software and ASIC based approaches respectively.
During the last few years, research efforts towards realizing and accelerating the matrix
multiplication operation using reconfigurable hardware (FPGA) have been attempted. FP-
GAs offer the design flexibility of a software and speed of hardware (ASICs). Hence, an
efficient FPGA-based matrix multiplier is aimed to be introduced.

Figure 19.1: Comparison Among CPU, GPU and FPGA performances.

Multiplication is the dominant operation in matrix multiplication in terms of required


resources, delay and power consumption. Most contemporary FPGAs have embedded hard
multipliers distributed throughout the fabric. Even so, soft multipliers using look-up tables
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  235

(LUTs) in the configurable logic fabric remain important for high performance designs for
several reasons as follows:

1. Embedded multiplier operands are fixed in size and type, such as 25 × 18 two’s
complement, while LUT-based multiplier operands can be any size or type.

2. The number and location of embedded multipliers are fixed, while LUT-based mul-
tipliers can be placed anywhere and the number is limited only by the size of the
reconfigurable fabric.

3. Embedded multipliers cannot be modified.

4. LUT-based multipliers can be used in conjunction with embedded multipliers to form


larger multipliers.

Hence, a compact and faster LUT-based multiplication algorithm along with circuit is
to be introduced. Using the multiplier, an efficient matrix multiplier is to be introduced. In
digital signal processing (DSP) algorithms, in many fixed transforms such as the discrete
cosine transform (DCT), multiplicand has a limited number of values. Moreover, image
processing, graph theory and other real life applications have a limited range valued large
matrices to be multiplied. For example, Fig. 19.2 exhibits a real life application of small
ranged matrix. Though the images are usually of 256 × 256 or 1024 × 1024 matrix sizes,
the range of the values are limited. Hence, using this property, an efficient matrix multiplier
is to be introduced that will re-utilize the pre-calculated values rather than re-calculating
for the repeated values. Hence, an efficient LUT-based matrix multiplier circuit is to be
presented with less area, power and delay.

Figure 19.2: Example of an Image with Limited Range Values in Matrix.

The prime challenges are addressed as follows:


236  VLSI Circuits and Embedded Systems

1. Parallel matrix multiplication has been explored and investigated extensively in the
previous two decades. There are diverse approaches to optimize the matrix multipli-
cation algorithm. Though parallel algorithm is faster, it requires huge resources.

2. To reduce the resource allocation, a serial architecture can be considered, but it will
increase the delay. Hence, an area-delay trade-off is to be considered.

3. Matrix multiplication is usually of huge in size for real life applications like image
processing, graph theory and many more. So, large scale matrices should be kept in
mind prior to designing a matrix multiplier circuit.

4. In multiplication, the main challenge is the carry propagation delay which needs to
reduce as much as possible for presenting faster multiplier circuits.

5. The reduction of required number of LUTs while implementing a function is an


unavoidable challenge indeed.

6. The main objective of this work is to propose an efficient FPGA-based matrix mul-
tiplier circuit. By utilizing the special features of advanced FPGAs, the computation
time, hardware resource utilization, and power consumption can be significantly re-
duced. The current trend is towards realizing a high-performance matrix multiplier
on FPGA.

7. Use the advantage of repetition in the large scale real life applications of matrices
with limited range values and reduce the computational complexity with effective
circuit area.

8. Reduce the carry propagation delay by digit-wise parallel processing in a divide and
conquer approach.

9. Reduce the required number of LUTs while implementing any function using LUTs.

Eight main contributions addressed in this chapter are as follows:


1. A (1 × 1)-digit multiplication algorithm has been introduced avoiding the conven-
tional slow partial product generation, partial product reduction and addition stages.
2. An efficient binary to BCD conversion algorithm is presented for decimal digit
multiplication.
3. Efficient (m×n) digit multiplication algorithm is introduced, where digit-wise parallel
processing is performed reducing the carry propagation delay significantly.
4. An efficient matrix multiplication algorithm is introduced using the variation of
the pigeonhole principle. To make the algorithm efficient for the repeated values, the
intermediate product outputs are re-utilized instead of re-calculation.
5. An area and delay efficient (1 × 1) digit LUT-based multiplier circuit is presented
using the parallel bit categorization circuit.
6. A compact binary to BCD converter circuit is described for decimal digit multipli-
cation.
7. Using the (1 × 1) digit LUT-based multiplier circuit, a LUT-based compact and faster
(m×n) digit multiplier circuit is presented.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  237

Table 19.1: Partial Product Generation Logic for Binary Multiplication

1 0
1 1 0
0 0 0

8. A cost-efficient LUT-based matrix multiplier circuit is introduced using the multiplier


circuit. Due to the parallel processing structure and the implementation of re-utilization of
the intermediate product for the repeated values, the effective area and power consumption
are reduced.

19.2 BASIC DEFINITIONS


In this section, basic definitions and ideas related to multiplication, matrix multiplication,
Binary Coded Decimal (BCD) addition, binary to BCD conversion, pigeonhole principle,
literal cost, gate input cost, gate input cost with not, FPGA, LUT, comparator, adder, etc.
are illustrated.

19.2.1 Binary Multiplication


As in decimal system, the multiplication of binary numbers is carried out by multiplying
the multiplicand by one bit of the multiplier at a time and the result of the partial product
for each bit is placed in such a manner that the LSB is under the corresponding multiplier
bit. Finally, the partial products are added to get the complete product. The logic of partial
product generation is exhibited in Table 19.1 and partial product positioning after shifting
is shown in Fig. 19.3.

Figure 19.3: Partial Product Generation for Final Addition of 8 × 8-bit Multiplication.

Multiplication process has three main steps:


1. Partial product generation.
2. Partial product reduction.
238  VLSI Circuits and Embedded Systems

3. Final addition.
For the multiplication of an n-bit multiplicand with an m-bit multiplier, m partial
products are generated and product formed is n + m bits long. Here about five different
types of multipliers are discussed which are as follows:

1. Booth multiplier.

2. Combinational multiplier.

3. Wallace tree multiplier.

4. Array multiplier.

5. Sequential multiplier.

An efficient multiplier should have following characteristics:


Accuracy: A good multiplier should give correct result.
Speed: Multiplier should perform operation at high speed.
Area: A multiplier should occupy less number of slices and LUTs.
Power: Multiplier should consume less power.

19.2.2 Matrix Multiplication


If A = [ai j ] is an n×m matrix and B = [bi j ] is an m×p matrix, the product AB is an n×p
matrix. Considering AB = [C i j ],
 A11 . . . A1m   B11 . . . B1p  C11 . . . C1p 
 . .   . .   . . 
   
 . .   . .   . . 

 . .  . .  . . 

   

 An1 . . . Anm   Bn1 . . . Bnp  Cn1 . . . Cnp 
where, ci j = ai1 b1j + ai2 b2j + . . . + ain bnj . In the Conventional matrix multiplication
   

approach:
(AB)i j = k=1 Aik Bk j ;
Ím
To multiply two matrices, sufficient and necessary condition is “number of columns in
matrix A = number of rows in matrix B". The conventional matrix multiplication algorithm is
exhibited in Algorithm 19.1. The time-complexity of the conventional matrix multiplication
is O(n3 ). An example demonstration of the matrix multiplication operation is demonstrated
in Fig. 19.4.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  239

Algorithm 19.1: Conventional n×n Matrix Multiplication Algorithm

Figure 19.4: Example Simulation of Matrix Multiplication.


240  VLSI Circuits and Embedded Systems

19.2.3 BCD Coding


There are many different binary codes used in digital and electronic circuits, each with its
own specific use. As we naturally live in a decimal (base-10) world, it needs some way of
converting these decimal numbers into a binary (base-2) environment that computers and
digital electronic devices understand, and binary coded decimal code allows us to do that.
An n-bit binary code is a group of n bits that assume up to 2n distinct combinations of 1s
and 0s. The advantage of the Binary Coded Decimal system is that each decimal digit is
represented by a group of 4 binary digits or bits in much the same way as Hexadecimal.
So for the 10 decimal digits (0-to-9) a 4-bit binary code is needed. But binary coded
decimal is not the same as hexadecimal. Whereas, a 4-bit hexadecimal number is valid up
to F 16 representing binary 11112 , (decimal 15), binary coded decimal numbers stop at 9
binary 10012 . This means that although 16 numbers (24 ) can be represented using four
binary digits, in the BCD numbering system the six binary code combinations of: 1010
(decimal 10), 1011 (decimal 11), 1100 (decimal 12), 1101 (decimal 13), 1110 (decimal
14), and 1111 (decimal 15) are classed as forbidden numbers and cannot be used. The
main advantage of binary coded decimal is that it allows easy conversion between decimal
(base-10) and binary (base-2) form. However, the disadvantage is that BCD code is wasteful
as the states between 1010 (decimal 10), and 1111 (decimal 15) are not used. Nevertheless,
binary coded decimal has many important applications especially using digital displays.
In the BCD numbering system, a decimal number is separated into four bits for each
decimal digit within the number. Each decimal digit is represented by its weighted binary
value performing a direct translation of the number. So a 4-bit group represents each
displayed decimal digit from 0000 for a zero to 1001 for a nine. So for example, 35710
(Three Hundred and Fifty Seven) in decimal would be presented in Binary Coded Decimal
as:
35710 = 0011 0101 0111 (BCD)

19.2.4 BCD Addition


BCD adder uses BCD numbers as input and output. Since a 4-bit binary code has 16
different binary combinations, the addition of two BCD digits may produce incorrect result
that exceeds the largest BCD digit (9)10 = (1001)BCD . In such cases, the result must be
corrected by adding (6)10 = (0110)BCD to guarantee that the result is a BCD digit. The
resultant decimal carry output generated by the correction process is added to the next
higher digit of the BCD adders.
Example 19.1 Addition of (5)10 = (0111)BCD to (8)10 = (1000)BCD results into the
non-BCD digit (13)10 = (1101)2 . The result is corrected by adding (0110)BCD to (1101)2
which becomes (3)10 = (0011)BCD . The output (3)10 = (0011)BCD is the correct decimal
sum of (5)10 and (8)10 with a carry 1.

19.2.5 Binary to BCD Conversion


Binary to BCD conversion represents the conversion of a binary number into separate
binary numbers representing digits of the decimal number. The basic steps are as follows:
1. If any column (100’s, 10’s, 1’s, etc.) is 5 or greater, add 3 to that column.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  241

2. Shift all numbers to the left 1 position.


3. If 8 shifts have been performed, it’s done! Evaluate each column for the BCD values.
4. Go to Step 1.
An example demonstration has been shown in Fig. 19.5 and the Verilog code for the
conversion in hardware is shown in Fig. 19.6.

Figure 19.5: Example Demonstration of Binary to BCD Conversion.


242  VLSI Circuits and Embedded Systems

Figure 19.6: Verilog Code for the Hardware Implementation of Binary to BCD Conversion.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  243

19.2.6 Pigeonhole Principle


The pigeonhole principle is one of the simplest but most useful ideas in mathematics. A
basic version says that if ( N + 1) pigeons occupy N holes, then some hole must have at least
2 pigeons. Thus if 5 pigeons occupy 4 holes, then there must be some hole with at least 2
pigeons. It is easy to see why: otherwise, if each hole has at most 1 pigeon then the total
number of pigeons couldn’t be more than 4.
The pigeonhole principle has many generalizations. For instance: If one has N pigeons
in K holes, and ( N/K ) is not an integer, then some holes must have strictly more than ( N/K )
pigeons. So 16 pigeons occupying 5 holes means some holes have at least 4 pigeons.

19.2.7 Field Programmable Gate Arrays


Field Programmable Gate Arrays (FPGAs) are semiconductor devices that are based around
a matrix of configurable logic blocks (CLBs) connected via programmable interconnects.
FPGAs can be reprogrammed to desired application or functionality requirements after
manufacturing. This feature distinguishes FPGAs from Application Specific Integrated
Circuits (ASICs) which are custom manufactured for specific design tasks. Although one-
time programmable (OTP) FPGAs are available, the dominant types are SRAM based
which can be reprogrammed as the design evolves.
Due to their programmable nature, FPGAs are an ideal fit for many different markets. For
example, Xilinx provides comprehensive solutions consisting of FPGA devices, advanced
software, and configurable, ready-to-use IP cores for markets and applications. FPGAs are
generally more flexible and cost-effective than ASICs. Every FPGA chip is made up of
a finite number of predefined resources with programmable interconnects to implement
a reconfigurable digital circuit and I/O blocks to allow the circuit to access the outside
world. FPGA resource specifications often include the number of configurable logic blocks,
number of fixed function logic blocks such as multipliers, and size of memory resources
like embedded block RAM. Of the many FPGA chip parts, these are typically the most
important when selecting and comparing FPGAs for a particular application.
The configurable logic blocks (CLBs) are the basic logic unit of an FPGA. The different
parts of FPGA are shown in Fig. 19.7. Sometimes referred to as slices or logic cells, CLBs
are made up of two basic components: flip-flops and lookup tables (LUTs). Various FPGA
families differ in the way flip-flops and LUTs are packaged together, so it is important to
understand flip-flops and LUTs.
244  VLSI Circuits and Embedded Systems

Figure 19.7: Various parts of FPGA.

19.2.8 Look-Up Table


A look-up table (LUT) is a memory block with a one-bit output that essentially implements
a truth table where each input combination generates a certain logic output. The input
combination is referred to as an address. The output of the LUT is the value stored in the
indexed location of the selected memory cell. Since the memory cells in the LUT can be
set to anything according to the corresponding truth table, an N-input LUT can implement
any logic function.

Example 19.2 When implementing any logic function, a truth table of that logic is mapped
to the memory cells of the LUT. Suppose, while implementing Equation 20.1 where ‘|’
represents logical OR operation, Table 19.2 represents the truth table of the function. Fig.
19.8 shows the gate representation and the LUT representation of the logic function. Output
is generated with the corresponding input combination, such as for input combination 1 and
0, the output will be 1.

f = (A.B)(A ⊕ B) (19.1)

There is a significant research about the improvement of the LUT to reduce the hardware
complexities, read and write time. The circuit diagram of a 2-input LUT is given in Fig.
19.9.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  245

Table 19.2: Truth Table of Function ( f )

A B Output
(f)
0 0 0
0 1 1
1 0 1
1 1 1

Figure 19.8: LUT Implementation of a Logic Function.

Figure 19.9: Circuit Diagram of a 2-input LUT.

19.2.9 LUT-Based Adder


For the design of a LUT-based adder, instead of using one large LUT implementations of
an 8-bit adder are shown with a number of small multi-bit output LUTs. An 8-bit adder
246  VLSI Circuits and Embedded Systems

can be designed consisting of two 9-input LUT. Each 9-input LUT has two 4-bit plus one
1-bit-carry inputs and 5-bit outputs for a 4-bit addition. The carry is propagated to the
next 9-input LUT only after the previous 4-bit addition in one LUT is done (i.e., ripple
carry). Since each LUT should be read one by one this adder will take long time to finish
an addition that is shown in Fig. 19.10(a). By employing the concept of carry select adder
it can implement a much faster adder with 8- input LUTs because reading the next LUT
does not depend on the previous carry. The detail of the implementation is depicted in Fig.
19.10(b). To make a better adder, a 4-input LUT with 6-bit outputs can be exploited Fig.
19.10(c). The comparative analysis is shown in Fig. 19.11.

Figure 19.10: 8-bit Adder Using (a) Two 9-Input LUTs (b) Two 8-Input LUTs (c) Four
4-Input LUTs.

Figure 19.11: Comparison of Area and Time for 8-bit Adder Using Various LUTs.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  247

19.2.10 BCD Adder


Suppose, the decimal values of the two variables A and B are 9 and 5 respectively. For the
BCD addition of these operands, the BCD representation of them are taken. First, add the
LSB of A and B with the carry (C in ) from the previous BCD digit addition. If it is the first
digit addition of BCD addends, then the value of C in is zero. The obtained sum (S0) is the
first bit of the output and the carry is added to the three most significant bits of B providing
a value of b which is 011.
Afterwards, add the three most significant bits of A and B. As the result is 111 which
is greater than five, 3 is added with the sum to obtain the correct result. The resultant sum
consisting of four bits represent the first digit of the BCD value which is 4 in this case and
the carry is the next resultant digit which is 1. The simulation procedure is shown in Fig.
19.12. The algorithm of the BCD addition method with pre-processing technique is given
in Algorithm 19.2. The block diagram of the circuit is demonstrated in Fig. 19.13.

Figure 19.12: Simulation of BCD Addition.


248  VLSI Circuits and Embedded Systems

Algorithm 19.2: Algorithm for N -digit BCD Addition

Figure 19.13: Block Diagram of N -Digit BCD Adder Circuit.

19.2.11 Comparator
A comparator is a logic circuit that first compares the size of A and B and then determines
the result among A> B, A < B and A = B. When the two numbers in comparator circuit
are two 1-bit numbers, the result will be only one bit from 0 and 1. So, the circuit is called
1-bit magnitude comparator which is a basis of comparison of the two numbers of n-bit.
The truth table of 1-bit conventional comparator is listed in Table 19.3. From the truth table
of conventional comparator in Table 19.3, the logical expressions of 1-bit comparator are
as follows:

X = (F A>B ) = A.B 0
Y = (F A=B ) = (A ⊕ B)0
Z = (F A<B ) = A0 .B
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  249

The wave form of a 1-bit comparator circuit is demonstrated in Fig. 19.14 and a circuit
diagram of a 4-bit comparator circuit is exhibited in Fig. 19.15.

Figure 19.14: Timing Diagram of a 1-bit Comparator Circuit.

Figure 19.15: Circuit Diagram of a 4-bit Comparator Circuit.

19.2.12 Shift Register


The Shift Register is a type of sequential logic circuit that can be used for the storage or the
transfer of data in the form of binary numbers. This sequential device loads the data present
on its inputs and then moves or “shifts" it to its output once every clock cycle, hence the
name Shift Register. A shift register basically consists of several single bit “D-Type Data
Latches”, one for each data bit, either a logic “0” or a “1”, connected together in a serial
type daisy-chain arrangement so that the output from one data latch becomes the input of
the next latch and so on. Data bits may be fed in or out of a shift register serially, that is one
after the other from either the left or the right direction, or all together at the same time in
a parallel configuration.
The number of individual data latches required to make up a single Shift Register device
is usually determined by the number of bits to be stored with the most common being 8-bits
(one byte) wide constructed from eight individual data latches. Shift Registers are used for
250  VLSI Circuits and Embedded Systems

data storage or for the movement of data and are therefore commonly used inside calculators
or computers to store data such as two binary numbers before they are added together, or to
convert the data from either a serial to parallel or parallel to serial format. The individual
data latches that make up a single shift register are all driven by a common clock (Clk)
signal making them synchronous devices. Shift register ICs are generally provided with a
clear or reset connection so that they can be “SET" or “RESET” as required.
Generally, shift registers operate in one of four different modes with the basic movement
of data through a shift register being:

1. Serial-in to Parallel-out (SIPO) – the register is loaded with serial data, one bit at a
time, with the stored data being available at the output in parallel form.

2. Serial-in to Serial-out (SISO) – the data is shifted serially “IN” and “OUT” of the
register, one bit at a time in either a left or right direction under clock control.

3. Parallel-in to Serial-out (PISO) – the parallel data is loaded into the register simul-
taneously and is shifted out of the register serially one bit at a time under clock
control.

4. Parallel-in to Parallel-out (PIPO) - the parallel data is loaded simultaneously into the
register, and transferred together to their respective outputs by the same clock pulse.

The effect of data movement from left to right through a shift register can be presented
graphically as in Fig. 19.16. The directional movement of the data through a shift register can
be either to the left, (left shifting) to the right, (right shifting) left-in but right-out, (rotation)
or both left and right shifting within the same register thereby making it bidirectional.

Figure 19.16: Data Movement from Left to Right through a Shift Register.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  251

19.2.13 Literal Cost


Literal is a variable or its complement. Literal cost is the number of literal appearances in
a Boolean expression corresponding to the logic circuit diagram.

Example 19.3 F = BD + ABC + ACDL = 8.

19.2.14 Gate Input Cost


Gate input cost is the number of inputs to the gates in the implementation corresponding
exactly to the given equation or equations.
G - inverters not counted
GN - inverters counted
For SOP (Sum of Products) and POS (Product of Sums) equations, it can be found from
the equation(s) by finding the sum of:
1. All literal appearances
2. The number of terms excluding terms consisting only of a single literal, (G) and
3. Optionally, the number of distinct complemented single literals (GN )

Example 19.4 F = BD + A(B)C + A(C)(D);


G = 11; GN = 14
F = BD + A(B)C + A(B)(D) + AB(C);
G = 15; GN = 19
A circuit diagram with corresponding L , G and GN is exhibited in Fig. 19.17, where L
(literal count) counts the AND inputs and the single literal OR input. G (gate input count)
adds the remaining OR gate inputs. GN (gate input count with NOTs) adds the inverter
inputs.

19.2.15 Xilinx Virtex 6 FPGA Slice


Virtex-6 FPGAs contain many built-in system-level blocks. These features allow logic
designers to build the highest levels of performance and functionality into their FPGA based
systems. Built on a 40 nm state-of-the-art copper process technology, Virtex-6 FPGAs are
a programmable alternative to custom ASIC technology. Virtex-6 FPGAs offer the best
solution for addressing the needs of high-performance logic designers, high-performance
DSP designers, and high-performance embedded systems designers with unprecedented
logic, DSP, connectivity, and soft microprocessor capabilities.
The look-up tables (LUTs) in Virtex-6 FPGAs can be configured as either one 6-input
LUT (64-bit ROMs) with one output, or as two 5-input LUTs (32-bit ROMs) with separate
outputs but common addresses or logic inputs. Each LUT output can optionally be registered
252  VLSI Circuits and Embedded Systems

Figure 19.17: Circuit with Corresponding Literal Cost, Gate Input Cost, and Gate Input
Cost with NOT.

in a flip-flop. Four such LUTs and their eight flip-flops as well as multiplexers and arithmetic
carry logic form a slice, and two slices form a configurable logic block (CLB). Four flip-
flops per slice (one per LUT) can optionally be configured as latches. In that case, the
remaining four flip-flops in that slice must remain unused. Between 2550% of all slices can
also use their LUTs as distributed 64-bit RAM or as 32-bit shift registers (SRL32) or as two
SRL16s. Modern synthesis tools take advantage of these highly efficient logic, arithmetic,
and memory features. Expert designers can also instantiate them.

19.3 THE MATRIX MULTIPLIER


In this section, a new matrix multiplication algorithm is introduced. Besides, to make
the matrix multiplication method more efficient, (1 × 1)-digit and (m×n)-digit decimal
multiplication algorithms are presented. Moreover, for the efficiency of the (m×n)-digit
decimal multiplication, a cost-effective binary to BCD conversion algorithm is described.
Then, (1 × 1)-digit and (m×n)-digit decimal multiplier circuits along with the new binary
to BCD converter circuit are constructed. Finally, a matrix multiplier circuit is constructed.
Essential figures, tables and lemmas are presented to clarify the ideas.

19.3.1 The Efficient Matrix Multiplication Method


Matrix multiplication is a computationally-intensive and fundamental matrix operation in
many algorithms used in scientific computations. It serves as the basic building block for
digital signal processing, image processing, graphics and robotic applications. To improve
the performance of these applications, a high performance matrix multiplier is required.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  253

Unfortunately matrix multiplication is a very slow operation. It is slow mostly because


of its huge computation cost. So, to speed up the matrix multiplication method, a faster
multiplier is a must. The design and implementation approaches of multipliers contribute
substantially to the area, speed and power consumption of computational intensive matrix
multiplier systems. Mostly, the delay of the multipliers dominates the critical path of the
matrix multiplier system. So, low power and high speed multiplier circuits are highly
demanded. Hence, a cost efficient multiplication algorithm is introduced in this chapter.
The use of the multiplier makes the matrix multiplier more area, power and delay efficient.
So, first, efficient multiplication algorithm is focused and then the matrix multiplication
algorithm is concentrated as elaborated in the next subsections.
High speed and efficient multipliers are required in day-to-day complex computational
circuits like digital signal processing, cryptography algorithms and high speed processors.
An efficient FPGA (Field Programmable Gate Array) based multiplier circuit is introduced
due to the re-programmability and rapid prototyping characteristics of FPGA. The advan-
tages of FPGA that made it more lucrative than ASIC (Application Specific Integrated
Circuit) are as follows:
1. Faster time-to-market.
2. No upfront non-recurring expenses (NRE).
3. Simpler design cycle (Due to software that handles much of the routing, placement,
and timing).
4. More predictable project cycle.
5. Field re-programmability (A new bit stream can be uploaded remotely).
FPGA technology has become an integral part of the today’s modern embedded systems.
All mainstream commercial FPGA devices are based upon LUT (Look-up Table) structures.
Any m-input Boolean function can be implemented using m-input LUTs. It is a prime
concern to reduce the number of LUTs while implementing an FPGA-based circuit for any
given function.

19.3.1.1 The (1 × 1)-Digit Multiplication Algorithm


A (1 × 1)-digit multiplication algorithm is presented here based on the implementation
of the LUT merging theorem (Discussed in Chapter 16). In a (1 × 1)-digit multiplication
maximum decimal input digit (d max ) = (a3 , a2 , a1 , a0 ) or (b3 , b2 , b1 , b0 ) is 9 and the
product of two (d max ) yields maximum of 7 bits as (P6 P5 P4 P3 P2 P1 P0 ).
First, selection of LUTs has been performed by bit categorization technique, where the
multiplier and multiplicand bits are arranged into groups of variables of 1-bit input, 2-bit
input, 3-bit input and 4-bit input category. A group of 3-input variables contains operands
with the maximum of 3 bits. Similarly, a 4-input group of variables contains operands with
a maximum of 4 bits. It is important to note that only four categories of input variables are
254  VLSI Circuits and Embedded Systems

considered here, as the (1 × 1)-digit multiplication is used as the basic unit to construct an
(m×n)-digit multiplier. The (1 × 1)-digit multiplication considers the four binary bits with
the maximum decimal value, 9 as input. As A×B = B×A, either A×B or B×A is considered
to avoid identical input combination. Either multiplicand or multiplier will be 0dec or 1dec
in 1-bit category. If one of the inputs is 0Dec , the corresponding output will be zero. If
one of the inputs is 1dec , the output will be the other operand. The group of 3-bit category
is selected for the implementation of LUT merging (Described in Chapter 16). A LUT
with 4 to 6 inputs provides the best area and delay performance. The traditional 4-input
LUT structure provides low logic density and configuration flexibility which reduces the
utilization of interconnect resource, when configured as relatively complex logic functions.
Hence, a 6-input LUT is considered to implement the function.
Second, partitioning of LUTs is performed by observing the similarities between the
input and output combinations of the product bits. As for the product bits, the input can
be eliminated. As eliminating the input maintains all the input combinations with their
corresponding outputs and even if there are similar input combinations for both of them the
corresponding output bits are identical. The functions of the same partition are fed as input
to a single 6-input LUT. As the functions each have 5 inputs after input reduction step, the
6-input LUT will provide dual output. Hence, the merging of LUT will reduce the required
number of LUTs from 4 to 2. Now, the 6-bit input combinations are converted into 5-bit
input combinations in such a way so that the total numbers of input combinations of 6-bit
and 5-bit inputs are the same as well as the corresponding outputs of all the combinations
for both the 6-input and 5-input variables also remain the same. The first step is the selection
of LUTs, where the selection is performed on the basis of computation of final products P1 ,
P2 , P3 and P4 . Third, the partitioning of LUTs is performed, where two different colors are
used to distinguish the distinct partition sets. Fourth, the partitioned LUTs are merged into
one.
Algorithm 19.3 depicts the (1 × 1)-digit multiplication technique. The inputs of the
algorithm are multiplicand digit A with 4 bits (a3 a2 a1 a0 ) and multiplier digit B with
4 bits (b3 b2 b1 b0 ). The first if condition specifies that if A or B is 0dec then the output
will be zero. The second condition implies that if A equals 1dec then the output will be
B. Then the third condition specifies that if B equals 1dec then the output will be A. The
fourth condition indicates that if ( A1 . B1 ) is one, which means it is of 2-bit category and
also ( A3 A2 B3 B2 ) is zero, which ensures that the input combination is not of 3-bit or 4-bit
category. Then, the other output bits will be zero. The fifth condition checks if the ( A2 B2 )
equals 1 and it is not of 4-bit category, then it is classified as 3-bit category. Finally, the last
condition checks whether the input combination is of 4-bit category.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  255

Algorithm 19.3: Algorithm for (1 × 1)-Digit Multiplication

19.3.1.2 The (m×n)-Digit Multiplication Algorithm Using the (1 × 1)-Digit Multiplication


Algorithm
Using the efficient (1 × 1) digit multiplication algorithm, an (m×n)-digit multiplication
algorithm is introduced. Suppose, A = (Am−1 , Am−2 , . . . , A2 , A1 , A0 ) is the multiplier
of m digits and B = (B n−1 , B n−2 , . . . , B2 , B1 , B0 ) is the multiplicand of n digits. The
multiplication algorithm is provided in Algorithm 19.4 which follows the following steps:
256  VLSI Circuits and Embedded Systems

1. Each multiplicand digit ( Bi ) is to be processed in parallel while the multiplier digits


(( Am−1 ) – A0 ) are pipelined to be processed with each multiplicand digits ( Bi ).
2. For n number of multiplicand digits ( Bi ), each multiplicand digit with the multiplier
digits (pipelined) are multiplied using the proposed (1 × 1)-digit multiplication
algorithm by the SingleDigitMul(A j , Bi ); function call. The output of the multiplier
will be maximum of 7 bits Bin6−0 . As it is a single digit multiplication and the
maximum input with corresponding output will be 1001(9) × 1001(9) = 1010001(81).
3. The binary output from the (1 × 1)-digit multiplier is then converted to BCD (Binary
Coded Decimal) named as BCD7−0 . As the maximum output will be two digits, so
the corresponding BCD output will be of 8 bits.
4. The BCD value is added with the ToBeAdded variable, which is initiated as zero.
5. After the addition the first four bits (3-0) of the Res stored in Out variable. And rest
of bits of the Res variable is stored in ToBeAdded variable.
6. 6. The steps from (2-5) are performed for each A j until j = m − 1.
7. When j = m − 1, the last updated Res7−4 bits are concatenated with the previous Out.
8. Each final Out is left shifted i× 4 bits for each value of i .
9. Finally, the All the left shifted n number of Out values are added to provide the final
result.

Algorithm 19.4: Algorithm for (m×n)-Digit Multiplication


LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  257

Table 19.3: Example Demonstration of the (m×n)-Digit Multiplication Algorithm

Example 19.5 Table 19.3 provides an illustrative example of the (m×n)-digit multiplica-
tion algorithm. Suppose, a (4 × 4)-digit multiplication example is considered, where multi-
plier, A = (3654)dec and multiplicand B = (7932)dec . As considering a decimal multiplier
here, the input will be provided as A = (0011011001010100) and B = (0111100100110010).
The algorithm is highly parallel. Each multiplicand digit performs in parallel, where the
multiplier is pipelined for each multiplicand digit as shown in the Input row of Table 19.3.
For each multiplicand digit 7, 9, 3, and 2, the multiplier 3, 6, 5, and 4 are pipelined. For each
multiplier digit in total 4 stages are being performed. At the first stage 4 is being processed
in parallel with 2, 3, 9, and 7. After the (1 × 1)-digit multiplication, the corresponding 7
bit binary outputs are gained. The binary outputs are then converted to BCD. Then, BCD
258  VLSI Circuits and Embedded Systems

addition is performed on the 8 bit BCD output and the ToBeAdded variable. The ToBeAdded
variable is initiated as zero. After the 8 bit BCD addition, the 4 bits positioning from (0-3)
are stored as output and the other 4 bits positioning from (7-4) are stored in ToBeAdded. In
the next iterations the identical procedure is followed with the updated values. At the last
iteration the ToBeAdded is concatenated with the Out. Finally, all the output values are left
shifted 0, 1, 2, and 3 digits, respectively. At last, BCD addition is performed on the shifted
output values and the final result is obtained.
The aforementioned (m×n)-digit multiplication algorithm requires binary to BCD con-
version process. To improve the (m×n)-digit multiplication algorithm, a new binary to BCD
conversion algorithm is introduced in the next subsection.

19.3.1.3 Binary to BCD Conversion Algorithm


The (m×n)-digit decimal multiplication algorithm requires to convert the binary output
of the (1 × 1)-digit multiplication to BCD representation. The algorithm requires n number
of binary to BCD converter and the conversion in each of the converter occurs m number
of times. The output of the (1 × 1)-digit multiplication is maximum of 7 bits ( B6 , B5 , B4 ,
B3 , B2 , B1 , B0 ). Converting the binary 7 bits ( B6 – B0 ) to BCD will produce maximum of 8
bits output (P7 , P6 , P5 , P4 , P3 , P2 , P1 , P0 ). So, a 2-digit (8-bit) binary to BCD conversion
algorithm is required. Hence, a binary to BCD conversion algorithm is presented for decimal
multiplication.
The 7 bit output of (1 × 1)-digit multiplication produces only 36 number of output
combinations, whereas 7-bit function considers 27 = 128 number of output combinations.
So, the remaining output combinations (128 – 36 = 92) are invalid combinations. The
valid input and output combinations of the target binary to BCD conversion algorithm for
multiplication is demonstrated in Table 19.4, where the valid combinations are elucidated.
So, there are 92 number of invalid input combinations, which are going to be shown in
decimal format. So, the decimal representation of the invalid binary input combinations
for the multiplication are 11, 13, 17, 19, 22, 23, 26, 29, 31, 33, 34, 37, 38, 39, 41, 43,
44, 46, 47, (50–53), 55, (57–62), (65–71), (73–80), and (82–127). So, 72% invalid input
combinations can be utilized to optimize the function. The advantage of the huge number of
invalid combinations are utilized by placing don’t cares in the K -map manipulation. Hence,
more optimized functions are generated as the invalid input combinations are avoided.
The binary to BCD conversion algorithm produces 8-bit output. While constructing
the function of each output bit, the invalid output combinations can be considered as don’t
care conditions. As there are 7 number of bits ( B6 – B0 ) in the input, each output function
(P7 – P0 ) will consider the 7 ( B6 – B0 ) input variables. Hence, 7 variable K -map is used to
generate the output functions. The 7-variable K -map layout is shown in Fig. 19.18. After
taking the valid inputs from Table 19.4, the values of (B6 – B0 ) are inserted in K -map layout
where, A = B0, B = B1, C = B2, D = B3, E = B4, F = B5 , and G = B6 .
For instance, for the P1 function, 1 is inserted in the corresponding address of K -map
when the value of P1 for that input value is 1 and 0 is inserted otherwise. Moreover, for the
invalid address, don’t care (×) condition is placed in the corresponding address of K -map.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  259

Table 19.4: The Binary Input and BCD Output of the Binary to BCD Conversion Method
for (m×n)-Digit Multiplication
260  VLSI Circuits and Embedded Systems

Figure 19.18: K -map Layout for Seven Number of variables to Optimize the Functions.

The K -map manipulation for the function P1 is shown in Fig. 19.19. After grouping the
adjacent one values, the groups of Fig. 19.20 are gained. Hence, the P1 function can be
represented as a sum of product (SOP) as shown in Equation 20.3, where A = B0 , B = B1 ,
C = B2 , D = B3 , E = B4 , F = B5 , and G = B6 . Similarly, the functions of (P2 – P5 ) are
formulated in Equation 20.4 to 20.7 using K -map manipulation. The K -map manipulation
and grouping of functions (P2 – P5 ) are demonstrated in Figs. 19.21–19.28.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  261

Figure 19.19: K -map Manipulation for the Optimization of P1 Function.

Figure 19.20: Optimized Groups after K -map Manipulation for P1 Function.


262  VLSI Circuits and Embedded Systems

Figure 19.21: K -map Manipulation for the Optimization of P2 Function.

Figure 19.22: Optimized Groups after K -map Manipulation for P2 Function.


LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  263

Figure 19.23: K -map Manipulation for the optimization of P3 Function.

Figure 19.24: Optimized Groups after K -map Manipulation for P3 Function.


264  VLSI Circuits and Embedded Systems

Figure 19.25: K -map Manipulation for the optimization of P4 Function.

Figure 19.26: Optimized Groups after K -map Manipulation for P4 Function.


LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  265

Figure 19.27: K -map Manipulation for the optimization of P5 Function.

Figure 19.28: Optimized Groups after K -map Manipulation for P5 Function.

Table 19.4 shows that the values of P0 and B0 are the same. Hence, the function of P0
can be formulated as P0 = B0 which is shown in Equation 20.2. Observing Table 19.4, the
functions of P6 and P7 can be formulated as shown in Equations 20.8 and 20.9, without
using K -map manipulation. Hence, the binary to BCD conversion algorithm uses the LUT
implementation of the optimized output functions (P7 – P0 ). The functions are optimized
to a great extent due to avoiding the huge number of invalid input combinations.

P0 = B0 (19.2)
266  VLSI Circuits and Embedded Systems

P1 =B0 .B3 + B2 .B5 .B6 + B1 .B3 .B5 + B1 .B2 .B3 + B1 .B2 .B3 .B5 (19.3)
c + B1 .B2 .B3 .B5 + B1 .B2 .B3 .B4 .B5 + B0 B1 .B2 .B3 .B4 .B5

P2 =B2 .B3 .B4 + B2 .B4 .B5 + B1 .B2 .B3 + B1 .B2 .B3 .B5 + B2 .B3 B4 (19.4)
+ B1 B2 .B6 + B0 B2 B3 + B0 .B1 .B2 .B4 .B5

P3 = B1 .B2 B3 .B5 + B2 .B3 .B4 + B5 + B1 .B2 .B3 .B4 + B0 .B1 .B2 .B3 .B4 .B5 (19.5)

P4 = B0 .B3 + B2 .B5 B6 + B1 .B2 .B3 + B1 .B2 .B3 .B5 (19.6)


+ B1 .B2 .B3 .B4 + B1 .B2 .B3 .B4 + B0 .B1 .B2 .B3 .B4

P5 = B0 .B2 + B2 .B5 .B6 + B1 .B2 .B4 + B1 .B2 .B3 + B1 .B2 .B3 (19.7)

P6 = (B3 + B4 ).B5 + (B5 .B6 .B0 ) (19.8)

19.3.1.4 Efficiency of the (m×n)-Digit Multiplication Algorithm


The conventional binary multiplication considers three steps as follows:
1. Partial product generation
2. Partial product reduction
3. Final addition
For the multiplication of 4-bit multiplicand with 4 bit multiplier, 4 partial products
are generated and product formed is 7 bits long. As shown in Fig. 19.29 for (1 × 1)-digit
multiplication there are 4 partial product generation stages. The partial products are to be
added in three stages, where each addition operation is of 4 bits. Addition operation is slow
due to its carry propagation. So, the conventional algorithm requires the extra delay for 4
bit additions in three stages sequentially.
On the contrary, in the introduced method addition operation is completely omitted
which helps to avoid the huge carry propagation delay. The technique requires only two
steps as follows:
1. Select the input bit category
2. Output from corresponding category
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  267

Figure 19.29: Conventional 4 × 4-Bit Multiplication.

The (1 × 1)-digit multiplication avoids the costly partial product generation, partial
product reduction and addition operations. Instead the presented algorithm applies simple
bit categorization technique which is based on simple conditional logic. And the final output
from each category does not require any complex calculation rather it is based on simple
AND, OR logic operation and can be generated faster. Hence, the (1 × 1)-digit multiplication
is not only faster but also area-efficient due to its direct output generation instead of partial
product generation and addition operations. Using the significantly efficient (1 × 1) digit
multiplication algorithm, (m×n)-digit multiplication algorithm has been introduced.
As shown in Fig. 19.30, for (8 × 8)-digit multiplication first 32-bit BCD to Binary
conversion is required for both multiplier A and multiplicand B. Moreover, 54 bit binary
to BCD conversion is required as the last step. As the number of bits being converted is
huge, the circuit is huge complex and requires much area, power and delay. With increasing
number of input bits the conversion bits will be increased accordingly which will make
the computation and the circuit very much costly. On the contrary, the introduced circuit
requires only 8 bit conversion for each digit of multiplicand which are working in parallel.
As the number of bits being converted is very small and each conversion is being performed
in parallel, the presented algorithm requires less area and delay for conversion. Moreover,
whatever the input size is, the described algorithm will always require only 8 bit conversion.
Only the number of converters will increase but as they will work in parallel, so both the
delay and circuit complexity will be reduced radically.
On the other hand, the introduced algorithm generates only 8 number of partial products
rows at the final stage. Moreover, the intermediate (1 × 1)-digit multiplications are performed
without generating any partial product. The algorithm requires BCD addition operation of
the 8 number of partial product rows at the final stage. Besides, the intermediate BCD
adders are only 8-bit adders. So, the carry propagation delay is reduced due to small input
bit adders and the parallelism. So, it can be concluded that the multiplication algorithm is
significantly improved in area and delay due to its parallelism, intermediate partial product
generation reduction, elimination of long carry propagation delay of higher input adders
and so on.
268  VLSI Circuits and Embedded Systems

Figure 19.30: 8 × 8-Digit Multiplication.

19.3.2 The Matrix Multiplication Algorithm


Matrix multiplication being a dominant operation in the real life applications like image
processing, graph theory, cryptography and many more, it is a prime concern to have
an efficient matrix multiplier circuit. Matrix multiplication is a core operation in image
processing. Matrices are used to transform RGB (Red, Green, Blue) colors, to scale RGB
colors, and to control hue, saturation and contrast, to rotate, scale and what not. The most
important advantage of using matrices is that any number of color transformations can be
composed using standard matrix multiplication. For example, an image for 6 × 6 matrix
representation of an image shown in Fig. 19.31. The given real life example exhibits that
the data range of the matrix values is (0–7) for this 6 × 6 matrix, which in consequent have
28 repeated values. The target is to take the opportunity of this huge number of repeated
values in any matrix. To formulate the idea, pigeonhole principle is stated as follows:
Pigeonhole Principle: If N items are put into M containers, with N >M , then at least
one container must contain more than one item.
Property 19.3.1 and Property 19.3.2 for matrices can be deduced from Pigeonhole
principle as follows:

Property 19.3.1 If N is the number of distinct values to be placed in M positions, where


M×N , then at least one position contains a repeated value.

Property 19.3.2 If there are d numbers of distinct values and the matrix size is n×n, then
the total number of repeated values is formulated as n2 - d .
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  269

Figure 19.31: Example of an image for 6 × 6 Matrix Representation.

An efficient matrix multiplication algorithm is introduced in this subsection. Let, N


denotes the set of natural numbers. The vector space of n by n natural matrices is denoted
by N n×n . Suppose, A and B be two n×n matrices for multiplication. Let, A be the multiplier
matrix and B be the multiplicand matrix producing the output matrix C . Every matrix is
presented as follows:
 A11 . . . a1n 
 . . 
 

A ∈ Nn×n ⇔ A = (ai j ) =  . . 

 . . 

 An1 . . . ann 
So, after multiplication the matrix C will be as follows:
 

n
Õ
C = A × B; where ci j = aik bk j (19.9)
k=1

For high speed and parallel processing, maximum of n×n number of processing element
blocks are considered for the matrix multiplication process. If there is 0% repetition in the
rows of the multiplicand matrix B, then processing element re-utilization will not take place
resulting in taking n×n number of effective processing element blocks. On the contrary, for
matrices with identical values in the rows of B matrix will require much less number of
effective processing elements. Each processing element block uses the values, representing
the values of a column from the multiplier A and a single value from the corresponding
row of the multiplicand B as its inputs. Now, the column partitioning is performed on the
multiplier matrix, A as follows:
A ∈ Nn×n ⇔ A = [A1 A2 · · · Ak dots An ], Ak ∈ N n ;
 a1 
 
 a2 
Ak =  . 
 
.
 
an 
 
270  VLSI Circuits and Embedded Systems

Figure 19.32: Parallel Structure Representation of the Matrix Multiplication Algorithm.

For each processing element, every value of the i th row of the multiplicand B will be
used as an input with the values of a corresponding column vector Ak of the multiplier,
where k = i . Now, row partitioning of the multiplicand matrix B is performed as follows:
 B1 
 
 B2 
 . 
 
B∈N n×n ⇔ B =   , Bk ∈ N
 Bk 
 . 
 
B 
 n 
Bk = b1 b2 · · · bn


Therefore, for n×n processing element, each column vector Ak will be the input with
each j th values of the row vector Bi for k = i , where each column vector is pipelined to
the neural network for faster execution as shown in Fig. 19.32. In Algorithm 19.5, first the
repeated values are checked in multiplicand matrix B using CheckRepeatation(B[ ][ ])
function. The repetition check is performed efficiently by performing parallel processing.
For each value of each row, if there is a repeated value with a previous value then the
corresponding X[ ][ ] matrix is updated. If repetition is found, then the column address of
the initial value is stored in corresponding index of [ ][ ] and 1 is stored in corresponding
index of X[ ][ ]. If there is no repeated value, then 0 is stored in corresponding index of
X[ ][ ].
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  271

Algorithm 19.5: n×n Matrix Multiplication Algorithm

[ ][ ] After repetition check completion, the matrix multiplication algorithm proceeds.


As each value of multiplicand matrix B is processed in parallel and multiplier matrix A is
pipelined. So, only one row of the multiplier matrix A performs at a time. Hence for the
current row of A, the value of X[ ][ ] is checked, if the value is 0 then multiplication
operation is performed and value is stored in corresponding index of matrix Y [ ][ ]. If
the value of X[ ][ ] is 1, which means it is repeated value, the previously calculated value
stored in Y [ ][ ] is now re-utilized. Though the use of Y [ ][ ] matrix is for the algorithmic
purpose, in circuit construction there is no requirement of this matrix to be stored.
Example 19.6 Suppose, two 3 × 3 matrices are considered for example demonstration,
where multiplier matrix is A and multiplicand matrix is B. First, repetition is checked and
the intermediate matrix X and Z are calculated as shown below. Then, in step 1, for i = j =
0 the values of k is considered 0, 1, and 2, respectively. The updated output for each value
of k = 0, 1, and 2 are marked using red, green and blue color, respectively. For each value
the updated intermediate matrix Y [ ][ ] and output matrix C [ ][ ] is provided. If there is any
repetition found then repeated value is marked using yellow color. The first five steps has
been shown and similarly the other 4 steps are to be performed for the final output matrix.
1 4 7 2 8 2 
A = 2 5 8 and B = 4 4 9 
   
3 6 9 6 10 10
 
   
272  VLSI Circuits and Embedded Systems

After CheckRepeatation(B[ ] [ ]) function implementation in matrix B[ ][ ]:


0 0 1  0 0 0 
X = 0 1 0 Z = 0 0 0
   
 
0 0 1  0 0 1 
   
Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

19.3.3 The Cost-Efficient Matrix Multiplier Circuit


The introduced multiplier circuit requires multiplier circuit as an integral component. First,
a (1 × 1)-digit LUT-based multiplier circuit is presented and then an (m×n)-digit multiplier
circuit is described. Finally, the matrix multiplier circuit is presented.

19.3.3.1 (1 × 1)-Digit Multiplier Circuit


FPGAs are inherently suited for very high-speed parallel multiply and accumulate functions.
The algorithm for (1 × 1) digit multiplication offers efficient use of FPGA resources like
LUT. Here, a LUT-based multiplier circuit is constructed which avoids the conventional
partial product generation, reduction and addition steps. As shown in Algorithm 19.3, there
are mainly 4 categories based on the number of input bits such as 1-bit, 2-bit, 3-bit, and
4-bit category. The 1-bit category can be further divided in 3 categories such as if either
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  273

multiplier or multiplicand is 0dec , if multiplier is 1dec and if multiplicand is 1dec . So,


the first step of the multiplication process to find the appropriate category. The inputs will
make the category selection value (cat) of the exact category to be one keeping the values
zero, which will activate that specific category to provide the output. The categories are
determined by using the following equations, where plus (+) represents OR operation, dot
(.) represents AND operation and (’) represents complement.
The equations of corresponding categories are as follows:
Category 0, (Catg0) = (Either multiplier or multiplicand is 0dec )
Catg0 = (A3 + A2 + A1 + A0 ) + (B3 + B2 + B1 + B0 )0 (19.10)
Category 1, (Catg1) = (multiplicand is 1dec )

Catg1 = (A3 + A2 + A1 + A0 ) + (B3 + B2 + B1 + B0 ).(B3 + B2 + B1 )0 .B0 (19.11)


Category 2, (Catg2) = (multiplier is 1dec )

Catg2 = (A3 + A2 + A1 + A0 ) + (B3 + B2 + B1 + B0 ).(A3 + A2 + A1 )0 .A0 (19.12)


Category 3, (Catg3) = (maximum number of input bit is 2)

Catg3 = (A1 .B1 ).(A3 + A2 + B3 + B2 )0 (19.13)


Category 4, (Catg4) = (maximum number of input bit is 3)

Catg4 = (A2 + B2 ).(A3 + B3 )0 .[(A3 + A2 + A1 + A0 ) + (B3 + B2 + B1 + B0 )0 (19.14)


+ (B3 + B2 + B1 ) .B0 + (A3 + A2 + A1 ) .A0 ]
0 0 0

Category 5, (Catg5) = (maximum number of input bit is 4)

Catg5 = (A3 + B3 ).[(A3 + A2 + A1 + A0 ) + (B3 + B2 + B1 + B0 )0+ (19.15)


(B3 + B2 + B1 ) .B0 + (A3 + A2 + A1 0) .A0 ]
0 0 0

In these equations, optimization is obtained to simplify the circuit construction process


by performing extraction. The optimization of the equations is performed by considering
multiple level using transformation. Consider some temporary variables to reduce the
required number of literal cost ( L ), gate input cost (G), and gate input cost with not (GN ). A
literal is a variable or its complement. Literal cost ( L ) is the number of literal appearances
in a Boolean expression corresponding to the logic circuit diagram. As an example:

F = BD + ABC + ACD; L=8 (19.16)

Gate Input Cost (G) is the number of inputs to the gates in the implementation corre-
sponding exactly to the given equation or equations. It can be found from the equation(s)
by finding the sum of:
274  VLSI Circuits and Embedded Systems

1. All literal appearances.

2. The number of terms excluding single literal terms.

3. Optionally, the number of distinct complemented single literals.

Gate input cost denoted by G if inverters not counted or GN if inverters counted. For
example:
F = BD + ABC + ACD; G = 11; GN = 14 (19.17)

So, extraction is performed, for finding factor to optimize equations. Table 19.5 shows
the literal cost, gate input cost, and gate input cost with not, before and after optimization. It
exhibits that there are 55.26%, 61.85%, and 57.54% reduction in literal cost, gate input cost
and gate input cost with not, respectively. So, for bit categorization, a circuit is constructed
shown in Figs. 19.33 and 19.34. Fig. 19.33 demonstrates that for the input combination
A = 0010 and B = 0001, the circuit selects category 1 as one of input operand is 1.
Besides, Fig. 19.34 shows that for the input combination A = 0010 and B = 0101, the circuit
selects category 4 as input operand B has maximum input size of 3 bits. Similarly, all
the other possible input combinations select the corresponding category. The LUT-based
implementation of bit categorization circuit requires 10 number of 4-input LUTs and 6
number of 2-input LUTs.

Table 19.5: Comparative Analysis of Optimization Technique Implementation

The overall circuit of (1 × 1)-digit multiplication is exhibited in Fig. 19.35. The block
diagram of bit categorization has already been described. As the bit categorization selects
the corresponding category by providing the value of that variable as 1, this value activates
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  275

Figure 19.33: Bit Categorization Circuit Selected Category One for Corresponding Input.

Figure 19.34: Bit Categorization Circuit Selected Category Four for Corresponding Input.

that particular category. So, other categories remain deactivated and only one category at a
time is selected. Fig. 19.36 shows that as the output Catg3 bit from bit categorization circuit
for corresponding input combination is zero, the total 3-bit category circuit is deactivated
and no input passes through the circuit. The dotted line represents off state and the straight
276  VLSI Circuits and Embedded Systems

Figure 19.35: (1 × 1)-Digit Multiplier Circuit using Heterogeneous Input LUT.

Figure 19.36: Deactivated Category Circuit as it is not the Target Category for Correspond-
ing Input.

line represents on state. On the other hand, when the input combination is A = 0011 and
B = 0011, which represents category 3, the bit categorization provides Catg3 as 1. Hence,
the 3-bit category is activated and the corresponding required inputs are passed through the
transistors and LUTs. Finally, the corresponding output of the input combination P = 1001 is
gained from that category. The internal circuit of each category as shown in Fig. 19.35 is the
LUT implementation. Similarly, for all the other input combinations corresponding category
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  277

Figure 19.37: Activated Category Circuit as it is the Target Category for Corresponding
Input.

is selected and activated. Finally, the output from the activated category is considered as
the final output which is shown in Fig. 19.37.
The LUT-based bit categorization circuit and (1 × 1)-digit multiplier circuit can be
constructed using homogeneous 6-input LUT. The use of 6-input LUTs using LUT merging
technique reduces the required number of LUTs to a great extent compared to the use
of heterogeneous input LUT implementation. Though 6-input LUT is more complex than
smaller input LUTs, the implementation of the LUT merging theorem (Described in Chapter
16) ensures the effective use of 6-input LUT by providing dual output. Moreover, the
available FPGA slices consist of homogeneous input LUTs. The homogeneous 6-input
LUT implementation of the bit categorization circuit is shown in Fig. 19.38. Hence, a
6-input LUT can be used to implement T 1 and T 2 as dual output. Similarly, T 2 and T 4
can be implemented using a 6-input LUT as dual output. The functions T 5 and T 6 can be
implemented using a dual output 6-input LUT. The function T 5 , T 6 and Catg0 depends
on T 1 and T 2 . Though Catg0 function depends on only 4 input variables, it cannot be
implemented as dual output. As the other output functions depend on Catg0, a single LUT
is used to implementing Catg0. Furthermore, (Catg1, T 7 ), (Catg2, Catg4), and (Catg3,
Catg5) are implemented using a 6-input LUT each as dual output. Hence, the total circuit
construction requires only 7 number of 6-input LUT.
278  VLSI Circuits and Embedded Systems

Figure 19.38: Homogeneous 6-Input LUT Implementation of the Bit-Categorization Circuit.

The homogeneous 6-input circuit of the (1 × 1)-digit Multiplier Circuit is shown in


Fig. 19.39. The least significant bit of output P0 is obtained using the inputs a0 and b0 for
every category. Hence, P0 is generated independently, out of the category circuits. The 4-bit
category requires 2 number of 6-input LUT, where P14 and P24 are generated from a 6-input
LUT as dual output. Similarly, P34 and P44 are generated from a 6-input LUT as dual output.
Output function P54 and P64 can be generated directly from input bit a2 and a3 respectively.
So, no LUT is used to implementing P54 and P64 . The circuit construction of 3-bit category
is same as the heterogeneous circuit which requires 3 number of 6-input LUTs. The 2-bit
category is constructed using just a 6-input LUT. P12 and P22 are implemented using a 6-input
LUT as dual output. Function P32 and P0 are identical, so P32 can be generated by re-using
the output bit P0 . Finally, the output value of functions P42 to P62 is constant zero. Hence, the
homogeneous (1 × 1)-digit multiplier circuit along with bit categorization circuit requires
14 number of 6-input LUTs.

Figure 19.39: Homogeneous 6-Input LUT Implementation of the (1 × 1)-Digit Multiplier


Circuit.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  279

Figure 19.40: Binary to BCD Converter Circuit for Decimal Digit Multiplier Circuit.

19.3.3.2 Binary to BCD Converter Circuit for the Decimal Multiplier


The binary to BCD converter circuit is constructed by implementing the optimized functions
of Equation 20.2 – 20.9. Considering the inputs as ( B6 – B0 ) and outputs as (P7 – P0 ), the
first output bit is identical to input bit B0 , so it can be generated directly. The function P1
depends on 7 input bits. To implement the function using 6-input LUT, 2 number of 6-input
LUT is required. The two LUTs have input set ( B5 – B0 ), where first LUT is considered
to have the value B6 = 0 and the second LUT is considered to have the value of B6 = 1.
So, the output of the two LUTs are given as input to a 2-input multiplexer, where B6 is
provided as the selection bit. If the value of B6 is 0 then the output of the first LUT passed
to the output of the multiplexer and for B6 = 1, the output of the second will be propagated
as the output through the multiplexer. Similarly, the function P2 , P4 , and P5 are generated
using 2 6-input LUT and a multiplexer. Function P3 requires 5 number of inputs, so it can
be implemented using a single 6-input LUT. Then, P6 and P7 are generated from a 6-input
LUT as dual output. The circuit of the binary to BCD converter circuit is shown in Fig.
19.40.

19.3.3.3 (m×n )-Digit Multiplier Circuit


The (m×n)-digit multiplier circuit uses the (1 × 1)-digit multiplier circuit. For an (m×n)-digit
multiplier n number of processing elements are required. Each processing element, PE i
has a single digit of multiplicand Bi and the multiplier is pipelined to it. As shown in Fig.
19.35, the (1 × 1)-digit multipliers of each processing element provide the corresponding
binary output. The output of the multipliers are passed to binary to decimal converter.
The multiplier requires 8 bit binary to BCD converter in each processing element, each
converter requires only 14 numbers of LUTs. The output from the converter is the input
to the 8 bit BCD adder. First four bit of the BCD adder is stored as output and rest of the
bits are added with the output of the adder in next iteration. After m number of iteration
each output of the processing elements (Pi ) is shifted i digits using shift registers. Finally,
the outputs of the processing elements are added using the (m + 1)-digit BCD adder. The
datapath of the circuit is shown in Fig. 19.41.
280  VLSI Circuits and Embedded Systems

Figure 19.41: Block Diagram of the (m×n)-Digit Multiplier Circuit.

19.3.3.4 Matrix Multiplier Circuit


The introduced matrix multiplication algorithm is implemented to construct a FPGA-based
matrix multiplier circuit. In Fig. 19.42, the block diagram of the introduced matrix multiplier
circuit is shown. In Fig. 19.42, the data channels are shown, where the values ( A11 , A21 ,
A31 ) of the multiplier matrix is considered as the input of the Processing elements 1, 4
and 7. Similarly, ( A12 , A22 , A32 ) of the multiplier matrix are considered as the input of the
Processing elements 2, 5 and 8. Also, ( A13 , A23 , A33 ) of the multiplier matrix is considered
as the input of the Processing elements 3, 6, and 9.
The circuit of the matrix multiplier for 3 × 3 matrix multiplication is exhibited in Fig.
19.43. First for each row of the multiplicand matrix B, repetition of the values are checked.
To check the repetition of 3 values of a row of the matrix B, two comparators, one AND gate
and switching circuit are required. As parallel processing is considered, two comparators
will perform in parallel to find the repetition efficiently with much less delay. If repetition is
found then the value of corresponding index of X matrix is updated with 1 and Z matrix is
updated with corresponding column index with which the value was identical. The values
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  281

Figure 19.42: Block Diagram of the Matrix Multiplier Circuit.

for which repetition does not occur, the value of corresponding index of X matrix is updated
with 0.
For the first three values of the first column of the multiplicand matrix B = B11 , B21 , B31 ,
the corresponding value of A and B are sent to the multiplier circuit. As it is not possible to
have a repeated in value in the first column of the matrix, multiplication must be performed.
In this circuit, the efficient LUT-based (m×n)-digit multiplier circuit is considered for the
multiplication operation. Then, while processing the second column values of B matrix,
the X matrix needs to be checked whether there is a repeated value or not. If the value of
( X 12 , X 22 , X 32 ) is one then it activates the transistors as the value of X is provided the gate
input of the transistors. Hence, the activated transistor passes the value of the multiplier
which has the same output to be re-utilized. For example, If X 12 is 1, then it activates the
first transistor and passes the value from the multiplier which has the input B11 . So, the
multiplied value is re-utilized instead of re-calculation overhead. If the value of X 12 is zero
it will deactivate the first transistor and activate the next two transistors, which will pass
282  VLSI Circuits and Embedded Systems

Figure 19.43: Matrix Multiplier Circuit for (3 × 3) Matrices.

the input value from A and B to the multiplier. As there is no repetition, multiplication will
occur.
Similarly, while processing the third column values of B matrix, the X matrix needs
to be checked whether there is a repeated value or not. If the value of ( X 13 , X 23 , X 33 )
is one then it activates the transistors as the value of X is provided the gate input of the
transistors. Hence, the activated transistor passes the value of the column address which
has been repeated. While checking repetition, Z matrix is updated with the column location
of the value which has been repeated. So, through the transistor, the address is passed as
the selection bit of a 2 to 1 multiplexer. For example, If X 13 is 1, then it activates the
first transistor and passes the value from the Z matrix to the multiplexer as its selection
bit. If the selection bit is zero, it passes the output from the first multiplier which has B11
as input. It represents that the first value B11 is repeated in this position. If the selection
bit is one, then the multiplexer provides output from the first switching circuit named as
out, which represents that B12 is repeated. So, the multiplied value is re-utilized instead
of re-calculation overhead. If the value of X 13 is zero, it will deactivate the first transistor
and activate the next two transistors, which will pass the input value from A and B to the
multiplier. As there is no repetition, multiplication will occur. Finally, the outputs will be
added using LUT-based adder as shown in Fig. 19.43 and provide the resultant matrix.

19.4 SUMMARY
The increased interest in FPGAs (Field Programmable Gate Array) for real-time appli-
cations, such as wireless communications, image processing and image reconstruction,
medical imaging, network security, and signal processing justifies the effort in the de-
sign of efficient and high-performance matrix multiplier. Since matrix multiplication is
computationally intensive, it urges huge attention. Traditionally, matrix multiplication op-
eration is either realized as software running on fast processors or on dedicated hardware
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  283

(Application Specific Integrated Circuits (ASICs)). Software based matrix multiplication


is slow and can become a bottleneck in the overall system operation. Matrix multiplication
operation is often performed by parallel processing systems which distribute computations
over several processors to achieve significant speed-up gains. There are many realizations
of matrix multiplication. However, hardware such as FPGA based design of matrix mul-
tiplier provides a significant speed-up in computation time and flexibility as compared
to software and ASIC-based approaches respectively. The advancement of FPGAs in the
recent years, allowing multi-million gates on a single chip, together with sophisticated
Electronic Design Automation (EDA) tools has allowed the implementation of complex
and computation-intensive algorithms in efficient and cost-effective way. FPGAs offer the
design flexibility of a software and speed of hardware (ASICs). In the contemporary tech-
niques, number of multiplication operations for multiplying two N×N matrices is N 3 . For
designing efficient matrix multiplier, a (Look-up Table) LUT-based multiplier circuit is in-
troduced. The commercial FPGA slices contain embedded multiplier. The FPGA embedded
multiplier is fixed in operand size and type such as the current Xilinx Virtex 6 FPGA slice
provides 25 × 18 two’s complement multiplier circuit. Besides the embedded multiplier
cannot be modified. So, a new LUT (Look-Up Table)-based multiplier circuit is introduced,
where the operands can be of any size and type.
The (m×n)-digit multiplier performs digit-wise parallel processing in a divide and
conquer approach where (1 × 1)-digit multiplier is used. Due to the digit-wise parallel
processing the carry propagation delay reduces significantly. The (1 × 1)-digit multiplier
avoids the conventional partial product generation, reduction and addition steps. Besides, a
binary to BCD (Binary Coded Decimal) converter circuit for single digit multiplication is
presented to optimize the resource utilization of the circuit. For the design of an efficient
matrix multiplier, a variation of the pigeonhole principle is used. Due to the limited range of
the matrix values of real life applications, re-utilization of intermediate pre-calculated values
saves the computational and effective circuit area reduction. Due to reconfigurable feature of
FPGA, the multiplier and matrix multiplier circuits are easily implementable as commercial
products. Since matrix multiplier is a key primitive in many real life applications, the
use of high performance matrix multiplier will thus ameliorate the performance of these
computations.

REFERENCES
[1] L. Singhal and E. Bozorgzadeh, “Special section on field programmable logic and
applications-multi-layer floorplanning for reconfigurable designs”, IET Computers &
Digital Techniques, vol. 1, no. 4, pp. 276–294, 2007.
[2] T. J. Todman, G. A. Constantinides, S. J. Wilton, O. Mencer, W. Luk and P. Y. Cheung,
“Reconfigurable computing: architectures and design methods”, IEE Proceedings-
Computers and Digital Techniques, vol. 152, no. 2, pp. 193–207, 2005.
[3] B. Almashary, S. M. Qasim, S. A. Alshebeili and W. A. Al-Masry, “Realization of
linear back-projection algorithm for capacitance tomography using FPGA”, In 4th
World Congress Industrial Process Tomography, pp. 5–8, 2005.
[4] F. Bensaali and A. Amira, “Accelerating colour space conversion on reconfigurable
hardware”, Image and Vision Computing, vol. 23, no. 11, pp. 935–942, 2005.
284  VLSI Circuits and Embedded Systems

[5] Z. T. Li, T. J. Wu, C. L. Lin and L. H. Ma, “Field programmable gate array based par-
allel strapdown algorithm design for strapdown inertial navigation systems”, Sensors,
vol. 11, no. 8, pp. 7993–8017, 2011.
[6] S. M. Qasim, S. A. Alshebeili, A. A. Khan and S. A. Abbasi, “Realization of algorithm
for the computation of third-order cross moments using FPGA”, In Signal Processing
and Its Applications, 9th International Symposium on, pp. 1–4, 2007.
[7] A. A. Shoshan and M. A. Oqeely, “A high performance architecture for computing
the time-frequency spectrum”, Circuits, Systems and Signal Processing, vol. 19, no.
5, pp. 437–450, 2000.
[8] E. Cavus and B. Daneshrad, “A very low-complexity space-time block decoder
(STBD) ASIC for wireless systems”, IEEE Trans. on Circuits and Systems I, vol.
53, no. 1, pp. 60–69, 2006.
[9] A. Yurdakul and G. Dundar, “Multiplier-less realization of linear DSP transforms by
using common two-term expressions”, Journal of VLSI Signal Processing Systems
for Signal, Image and Video Technology, vol. 22, no. 3, pp. 163–172, 1999.
[10] A. A. Fayed and M. A. Bayoumi, “A merged multiplier-accumulator for high speed sig-
nal processing applications”, In Acoustics, Speech, and Signal Processing (ICASSP),
IEEE International Conference on, vol. 3, p. 3212, 2002.
[11] Y. Iguchi, T. Sasao and M. Matsuura, “Design methods for binary to decimal con-
verters using arithmetic decompositions”, Journal of Multiple Valued Logic and Soft
Computing, vol. 13, no. 4/6, p. 503, 2007.
[12] Z. T. Sworna, M. U. Haque, N. Tara, H. M. H. Babu and A. K. Biswas, “Low-power
and area efficient binary coded decimal adder design using a look up table-based
field programmable gate array”, IET Circuits, Devices & Systems, vol. 10, no. 3, pp.
163–172, 2016.
[13] Y. Moon and D. K. Jeong, “An efficient charge recovery logic circuit”, IEEE Journal
of Solid-State Circuits, vol. 31, no. 4, pp. 514–522, 1996.
[14] P. K. Meher, “Lut optimization for memory-based computation”, IEEE Trans. on
Circuits and Systems II, vol. 57, no. 4, pp. 285–289, 2010.
[15] H. C. Neto and M. P. Vestias, “Decimal multiplier on FPGA using embedded binary
multipliers”, In 2008 International Conference on Field Programmable Logic and
Applications, pp. 197–202, 2008.
[16] M. P. Vestias and H. C. Neto, “Parallel decimal multipliers using binary multipliers”,
In Programmable Logic Conference (SPL), pp. 73–78, 2010.
[17] R. H. Turner and R. F. Woods, “Highly efficient, limited range multipliers for LUT-
based FPGA architectures”, IEEE Trans. on Very Large Scale Integration (VLSI)
Systems, vol. 12, no. 10, pp. 1113–1118, 2004.
[18] A. Vazquez, E. Antelo and P. Montuschi, “A new family of high performance parallel
decimal multipliers”, In 18th IEEE Symposium on Computer Arithmetic, pp. 195–204,
2007.
[19] K. Hasanov, J. N. Quintin and A. Lastovetsky, “Hierarchical approach to optimization
of parallel matrix multiplication on large-scale platforms”, The Journal of Supercom-
puting, vol. 71, no. 11, pp. 3991–4014, 2015.
LUT-Based Matrix Multiplier Circuit Using Pigeonhole Principle  285

[20] J. W. Jang, S. B. Choi and V. K. Prasanna, “Energy-and time-efficient matrix multi-


plication on FPGAs”, IEEE Trans. on Very Large Scale Integration Systems, vol. 13,
no. 11, pp. 1305–1319, 2005.
[21] S. Hong, K. S. Park and J. H. Mun, “Design and implementation of a high-speed
matrix multiplier based on word-width decomposition”, IEEE Trans. on Very Large
Scale Integration Systems, vol. 14, no. 4, pp. 380–392, 2006.
[22] Y. Dou, S. Vassiliadis, G. K. Kuzmanov and G. N. Gaydadjiev, “64-bit floating-point
FPGA matrix multiplication”, In Proceedings of the ACM/SIGDA 13th International
Symposium on Field-programmable Gate Arrays, pp. 86–95, 2005.
[23] G. Wu, Y. Dou and M. Wang, “High performance and memory efficient imple-
mentation of matrix multiplication on FPGAs”, In Field-Programmable Technology
International Conference on, pp. 134–137, 2010.
[24] P. Saha, A. Banerjee, P. Bhattacharyya and A. Dandapat, “Improved matrix multiplier
design for high-speed digital signal processing applications”, IET Circuits, Devices
& Systems, vol. 8, no. 1, pp. 27–37, 2014.
[25] Y. Wang, J. Gao, B. Sui, C. Zhang and W. Xu, “An analytical model for matrix
multiplication on many threaded vector processors”, In CCF National Conference on
Computer Engineering and Technology, pp. 12–19, 2014.
[26] X. Huang and V. Y. Pan, “Fast rectangular matrix multiplication and applications”,
Journal of Complexity, vol. 14, no. 2, pp. 257–299, 1998.
[27] S. M. Qasim, S. A. Abbasi and B. A. Almashary, “Hardware realization of matrix mul-
tiplication using field programmable gate array”, MASAUM Journal of Computing,
vol. 1, no. 1, pp. 21–25, 2009.
[28] T. C. Lee, M. White and M. Gubody, “Matrix multiplication on FPGA based platform”,
In Proceedings of the World Congress on Engineering and Computer Science, vol. 1,
2013.
[29] S. H. Lederman, E. M. Jacobson, J. R. Johnson, A. Tsao and T. Turnbull, “Imple-
mentation of Strassen’s algorithm for matrix multiplication”, In Supercomputing,
Proceedings of the ACM/IEEE Conference on, pp. 32, 1996.
[30] X. V. Luu, T. T. Hoang, T. T. Bui and A. V. Dinh-Duc, “A high-speed unsigned 32-
bit multiplier based on booth-encoder and Wallace-tree modifications”, International
Conference on Advanced Technologies for Communications, pp. 739–744, 2014.
[31] K. Pocek, R. Tessier and A. DeHon, “Birth and adolescence of reconfigurable com-
puting: A survey of the first 20 years of field-programmable custom computing ma-
chines”, IEEE International Symposium on Field-Programmable Custom Computing
Machines, pp. 3–19, 2013.
[32] E. Ahmed and J. Rose, “The effect of LUT and cluster size on deep-submicron FPGA
performance and density”, IEEE Trans. on Very Large Scale Integration Systems, vol.
12, no. 3, pp. 288–298, 2004.
[33] M. Zhidon, C. Liguang, W. Yuan and L. Jinmei, “A new FPGA with 4/5-input LUT
and optimized carry chain”, Journal of Semiconductors, vol. 33, no. 7, 2012.
[34] N. Honarmand, M. R. Javaheri, N. S. Mokhtari and A. A. Kusha, “Power efficient
sequential multiplication using pre-computation”, IEEE International Symposium on
Circuits and Systems, 2006.
286  VLSI Circuits and Embedded Systems

[35] A. Ehliar, “Optimizing Xilinx designs through primitive instantiation”, In Proceedings


of the 7th FPGA world Conference, pp. 20–27, 2010.
[36] Sworna, Zarrin Tasnim, Mubin Ul Haque, Hafiz Md Hasan Babu, Lafifa Jamal, and
Ashis Kumer Biswas. “An efficient design of an FPGA-based multiplier using LUT
merging theorem.” In 2017 IEEE Computer Society Annual Symposium on VLSI
(ISVLSI), pp. 116–121. IEEE, 2017.
CHAPTER 20

BCD Adder Using a


LUT-Based Field
Programmable Gate Array

The binary coded decimal (BCD) system is suitable for digital communication, which can be
designed by field programmable gate array (FPGA) technology, where look-up table (LUT)
is one of the major components of FPGA. In this chapter, a low power and area efficient
LUT-based BCD adder is introduced which is constructed basically in three steps: First, a
technique is introduced for the BCD addition to obtain the correct BCD digit. Second, a
new controller circuit of LUT is presented which is designed to select and send Read/Write
voltage to memory cell for performing Read or Write operation. Finally, a compact BCD
adder is designed using the introduced LUT.

20.1 INTRODUCTION
Binary coded decimal (BCD) representation provides accurate precision, avoids infinite
error representation, conversion to a character form can be done in linear [O(n)] time
and addition–subtraction does not require rounding. Therefore, faster circuit for BCD
addition method is of concern. Hence, a (look-up table (LUT)-based new BCD addition
algorithm is introduced, which requires less number of field programmable gate array
(FPGA) components, area, power, and delay. The given algorithm is based on the pre-
processing mechanism. The advancement in FPGA technology has emerged as a new
horizon of technology progress due to long-time availability, rapid prototyping capability,
reliability and hardware parallelism. The cost of making incremental changes to FPGA
designs is negligible when compared to the large expense of re-spinning an application-
specific integrated circuit. An FPGA has three main elements: LUT, flip-flops and the
routing matrix. First, the basic 2-input LUT is targeted as it ultimately serves the betterment
of 3-input, 4-input and further larger input LUTs. Then, a 6-input LUT architecture is also
shown as the BCD adder is designed using 6-input LUT.

DOI: 10.1201/9781003269182-23 287


288  VLSI Circuits and Embedded Systems

Three main focuses are addressed in this chapter:


(i) A BCD addition algorithm is introduced with the optimum time complexity.
(ii) A compact controller circuit of the LUT has been introduced with the minimum
area and power.
(iii) A new architecture for the LUT-based BCD adder is presented with the improvement
of area, power and delay.
Basic definitions and ideas related to BCD addition method and LUT are presented in
Chapter 20 with illustrative figures and examples. Besides, the comparison parameters such
as area, power and delay along with the memory unit of LUT (memristor) are also formally
defined in Chapter 20.

20.2 BCD ADDER USING LUTS


In this section, a BCD addition algorithm and a LUT architecture are presented. Then, a
new BCD adder circuit is constructed. Essential figures and properties are also presented
to clarify the ideas.

20.2.1 The BCD Addition Method


The main problem in BCD addition is the need for correction, if the result exceeds the
permitted BCD range (decimal number 9). The correction actually adds the binary number
(0110)2 to the result. This logic penalizes high delay and extra level of circuit. Therefore,
the contribution in this chapter is to design an area efficient and high-speed BCD adder that
can be employed in different decimal applications.
Let A and B be the two addends of a 1-digit BCD adder, where binary representations
of A and B are A3 A2 A1 A0 and B3 B2 B1 B0 , respectively. The adder’s output will be a 5-bit
binary number C out S 3 S 2 S 1 S 0 , where C out represents the position of tens digit and S 3
S 2 S 1 S 0 symbolizes unit digit of BCD sum. Instead of post-processing, the approach deals
with pre-processing. The least significant bit (LSB) of both addends is added first which
produces the LSB of sum output, i.e., S 0 and an intermediate carry, C 1 . Then, the carry
C 1 is added with B3 B2 B1 to produce b3 b2 b1 . Finally, A3 A2 A1 and b3 b2 b1 are fed to the
FPGA’s LUT as inputs.
In Table 20.1, the truth table is designed with the three MSBs of each input of LUT A and
B. The addition of carry C 1 with B3 B2 B1 resulting b3 b2 b1 is remarked as pre-processing.
A numeric [30112 ] is added, if the sum of A and b is ≥ 5. Mathematically,

[Cout S3 S2 S1 S0 ] = (A + b + 3)i f A + b ≥ 5, where b = B3 B2 B1 + C1 (20.1)

Example 20.1 Suppose, the decimal values of the two variables A and B are 9 and 5,
respectively. For the BCD addition of these operands, the BCD representation of them is
taken. First, add the LSB of A and B with the carry (C in ) from the previous BCD digit
addition. If it is the first digit addition of BCD addends, then the value of C in is zero. The
obtained sum (S 0 ) is the first bit of the output and the carry is added to the three MSBs of
B providing a value of b which is 011.
BCD Adder Using a LUT-Based Field Programmable Gate Array  289

Table 20.1: Truth Table of 3-bit Addition with Pre-processing and Addition of 3

Afterwards, add the three MSBs of A and b. As the result is 111 which is > 5, 3 is added
with the sum to obtain the correct result. The resultant sum consisting of 4 bits represents
the first digit of the BCD value which is 4 in this case and the carry is the next resultant
digit which is 1. The demonstration is provided in Fig. 20.1. The algorithm of the BCD
addition method with pre-processing technique is given in Algorithm 20.1. Next, a LUT is
introduced to present a LUT-based BCD adder.
290  VLSI Circuits and Embedded Systems

Figure 20.1: Demonstration of the BCD Addition Algorithm Exhibited in Example 21.1.

Algorithm 20.1: Algorithm of the BCD Addition Method with Pre-processing Technique

20.2.2 The Architecture of a LUT


A LUT consists of two basic parts: (i) controller circuit and (ii) memory unit. Memris-
tor is considered as memory unit due to its non-volatility. Besides, being non-volatile,
comparing with other memories such as SRAM, dynamic random access memory
(DRAM), ferroelectric random access memory (FeRAM), magneto resistive random access
BCD Adder Using a LUT-Based Field Programmable Gate Array  291

memory (MRAM), nano random access memory (NRAM) conductive bridging random ac-
cess memory (CBRAM) and phase change random access memory (PCRAM) memristor
provides more area and power efficiency. In this chapter, the controller part of the circuit is
presented in compact way with the optimal number of gates which include the selection of
a memristor cell along with the Read/Write voltage passing to the cell. The internal mem-
ristance changes with the applied voltages for the corresponding Read/Write/Reprogram
operation.
The Write voltage is considered as Data and the Read voltage is considered as Read
Pulse. As only one memristor will be selected at a time, only one Write voltage (Data) is
considered. It is never possible to run Read and Write operations simultaneously. Therefore,
the Ex-OR and corresponding AND gates are used to select only one operation at a time
to avoid this ambiguity. Once one operation is selected then either the Read or the Write
voltage is passed from the AND or transmission gate respectively to the OR gate. The output
of the OR gate is connected to the left of each memristor to propagate the operational voltage
either to Write 1/0 to store in the memory or to Read the corresponding memory unit when
the memristor is selected.
The selection of the memory unit is performed depending on the two inputs of the
LUT such as A and B, as they refer to the corresponding memory addresses of the memory
cells (memristor). Considering the addresses of the memristors to be 00, 01, 10 and 11,
the addresses can be represented as A0 B 0, A0 B, AB 0, and AB respectively. A transistor is
activated through input B, then input A is sent from that transistor to next transistor to
activate the next one. Two transistors T 9 and T 10 are connected to the output lines O 1 and
O 2 respectively. The output from M 00 and M 01 passes through O 1 and output from M 10
and M 11 passes through O 2 . Although two memristors are connected to single output line,
as only one memristor is activated at a time so only value would pass through the line. As
the output line is connected to the drain of the transistors so the passed voltage would not
affect the unselected memristor as unactivated transistor never passes voltage from drain to
source. The transistors are activated by R only when a Read operation is being performed,
the output voltage of the selected memristor will be passed through the corresponding
transistor to the output line. During Write operation the transistors T 9 and T 10 will not be
activated, so no value will be passed to the output multiplexer (MUX). The LUT is shown
in Fig. 20.2 and the algorithm for the construction is given in Algorithm 20.2. A lemma is
also given in Lemma 21.1 in supporting the generalization of a 2-input LUT circuit.
292  VLSI Circuits and Embedded Systems

Figure 20.2: Architecture of the 2-Input LUT.

Reset is omitted as it is never possible to select all of the memory cells to reset them
altogether. Since Reset is nothing but the Write 0 operation, there is no difference between
these two operations when performed on single memory cell which removes hardware
complexity. Besides, instead of using the conventional Wl (word line) and Bl (bit line), the
direct use of the LUT input with inverter has reduced the controller circuitry overhead.
Similarly, a 6-input LUT can be designed using a 2-input LUT which is shown in Fig.
20.3. A 6-input LUT has 26 = 64 memory cells with a common controller circuit and two
outputs O mux1 and O mux2 . To make area efficient, a three-dimensional layer structure has
been used. A total of 64 memory cells have been arranged in four layers, where each layer
has 16 memory cells. For a particular layer, a column is selected by inputs A and B using a
selection circuit consisted of four transistors each being activated by input B (when B is 1)
and sending input A as column selection. Besides, a row selection voltage is sent using the
two inputs C and D in the same way. As there are four layers, for selecting a particular layer,
a layer selection voltage is generated using inputs E and F . With every memristor, there
are three transistors where first one is being activated through layer selection voltage and
sending column selection voltage to next transistor. Second transistor being activated, sends
row selection voltage to next transistor and the third transistor being activated, and finally
activates the corresponding memristor. The 6-input LUT requires a total of 64 memristors
and no additional memristors are required for reference cell.
Property 20.2.1 An n-input LUT requires at least 2n−2 LUTs with 2-input, where n ≥ 2.
Proof 20.1 The above statement is proved by mathematical induction.
Basis: The basis case holds for n = 2 as (22−2 ) = 1.
Hypothesis: Assume that the statement holds for n = k . Therefore, a k -input LUT
consists of 2k−2 LUTs with two inputs.
Induction: Now, we will consider n = k + 1. Therefore, a(k + 1)-input LUT requires
2 k+1−2 = 2k−1 LUTs of two inputs.
BCD Adder Using a LUT-Based Field Programmable Gate Array  293

Algorithm 20.2: Algorithm for the Construction of the 2-input LUT

Now, the number of inputs are reduced by one to produce n = k .


Then, a k-input LUT requires 2k−1−1 = 2k−2 LUTs with two inputs which holds the
hypothesis.
Therefore, the statement holds for n = k + 1.
Therefore, for n ≥ 2, an n-input LUT consists of 2n−2 LUTs with two inputs.

Figure 20.3: Architecture of the 6-Input LUT.


294  VLSI Circuits and Embedded Systems

20.2.2.1 Working Mechanism of the 2-Input LUT


In this subsection, two types of operations Read and Write are described.
Write operation: For Write operation, Write Enable pulse voltage is high which is
passed to the transmission gate to send the Write voltage through the gate. The two inputs
A and B selects the particular memory cell M i j , where i = {0, 1} and j = {0, 1}. The initial
memristance of M i is first considered ROF F and RON for Write 1 and Write 0 operation,
respectively. Then, a pulse of +Vdd /−Vdd is applied through Vin until the memristance
changes the state. Thus, a logic 1/0 is successfully written to M i j . Suppose, for writing 1 to
memristor M 11 , B = 1 activates transistor T 1 and sends A = 1 from T 1 to transistor T 2 which
selects M 11 and the Write voltage (+Vdd ) being applied to memristor until the memristance
changes accordingly to 1, consequently writes 1 to M 11 . Since reset operation has been
omitted, to re-program a memristor, if Write 0 operation is performed on a memristor
already having performed Write 1 operation, Write 0 operation should have performed on
that memristor. Moreover, to perform Write 1 operation on a memristor already having
performed Write 0 operation, Write 1 operation has to be performed on that memristor.
Read operation: For Read operation, the Read Enable voltage and Read Pulse (+Vdd )
both are high and will be propagated to the particular memory cell M by using inputs A,
B and transistors accordingly. To perform the Read 0/Read 1 operation, a positive pulse of
+Vdd (Read pulse) is applied to the memristor and the Read value is found at the output
terminal O 1 (the memristors of address M 00 and M 01 ) or O 2 (the memristors of address
M 11 and M 10 ). Suppose, for reading 1 from memristor M 11 , B = 1 activates transistor T 1
and sends A = 1 from T 1 to transistor T 2 which selects M 11 , as the Read Pulse is being
applied to memristor, the value of the memristor is propagated through transistor T 2 to the
transistor T 10 which is activated by the R value and so the output voltage is transmitted
from the memristor through the transistor T 10 to output line O 2 and finally to the MUX gate.
The input A being 1, selects the value of O 2 to the output terminal. For Read 0 operation,
assuming the NSP (memristance) of the memristor is zero, it will slightly change the NSP
of the memristor towards the value of RON . To restore the NSP to its original value, a
RESTORE pulse of –Vdd is applied. The total procedure is summarized in Table 20.2.

Table 20.2: Read and Write Scheme using the Introduced Approach
BCD Adder Using a LUT-Based Field Programmable Gate Array  295

20.3 BCD ADDER CIRCUIT USING LUTS


A LUT-based BCD adder is designed using the BCD addition algorithm and LUT. An
algorithm for the construction of the BCD adder circuit is presented in Algorithm 20.3.
According to the algorithm, the block diagram and the circuit are depicted in Fig. 20.4. For
the addition of the least significant 1 bit, a full adder; and for pre-processing, two half adders
and an OR gate are used. The 6-input LUT is used to adding the three MSB of the operands
with the correction by adding 3. Therefore, the addition circuit improves by removing extra
circuitry overhead. Using the 1-digit BCD adder circuit (Described in Chapter 20), it can
easily create an n-digit BCD adder circuit, where the C out of 1-digit adder circuit is sent
to the next digit of the BCD adder circuit as a C in . So, the generalized n-digit BCD adder
computes sequentially using the previous carry which is shown in Fig. 20.4.

Algorithm 20.3: Algorithm for the Construction of the BCD Adder Circuit
296  VLSI Circuits and Embedded Systems

Figure 20.4: The BCD Adder: (a) Block Diagram of 1-Digit BCD Adder, (b) 1-Digit BCD
Adder, (c) Block Diagram of the n-Digit BCD Adder.

The time complexity of the addition method is mathematically proven in Property 20.3.1.

Property 20.3.1 An n-digit BCD adder requires at least O(5n) of time complexity, where
n is the number of data bits.

Proof 20.2 The above statement is proved by method of contradiction.

Suppose, an n-digit BCD adder does not require at least O(5n) of time complexity.
The critical path delay of the n-digit BCD adder requires n full adders, 2n half adders, n
OR gates and 4n LUTs with six inputs. Except the arrangement of 6-input LUTs, the design
has a serial architecture which has a latency of O(4n).
Hence, the time complexity of the BCD adder is O(5n).
This contradicts the supposition. Hence, the supposition is false and Property 20.2 is
true.
BCD Adder Using a LUT-Based Field Programmable Gate Array  297

20.4 SUMMARY
The outstanding beneficiary features and advancement have made today’s world the era of
FPGA (Field Programmable Gate Array). LUT (Look-Up Table) being the most important
and complex element of FPGA is the main concern for the improvement of it. The 2-input
LUT has prominent enhancement in terms of area and power. Besides, BCD (Binary Coded
Decimal) addition being the most basic arithmetical operation, it is the main focus as an
application of LUT-based FPGA. The BCD adder is constructed with the optimum “area”,
“power”, and “delay”. These improvements in FPGA-based BCD addition will consequently
influence the advancement in all other arithmetic operations as well as computation and
manipulation of decimal digits, as it is more convenient to convert from decimal to BCD than
binary. Moreover, it is utilitarian for exact decimal calculations, which is often a requirement
for financial applications, accountancy, etc. It also makes things like multiplying/dividing
by powers of 10 easier with “fixed pitch” format, making it easy to find the nth digit in
a particular number, and such arithmetic operations can easily be chunked into multiple
threads, for example parallel processing.

REFERENCES
[1] D. Morteza and G. Jaberipur, “Low area/power decimal addition with carry-select
correction and carry-select sum-digits”, Integr. VLSI J., vol. 47, no. 4, pp. 443–451,
2014.
[2] S. Gao, D. A. Khalili and N. Chabini, “An improved BCD adder using 6-LUT FPGAs”,
IEEE Tenth Int. New Circuits and Systems Conf., 2012.
[3] F. D. Dinechin and A. Vázquez, “Multi-operand decimal adder trees for FPGAs”,
2010.
[4] M. Vasquez, G. Sutter, G. Bioul and J. P. Deschamps, “Decimal adders/subtractors in
FPGA: efficient 6-input LUT implementations”, Int. Conf. on Reconfigurable Com-
puting and FPGAs, 2009.
[5] G. Saeid, G. Jaberipur and R. H. Asl, “Efficient ASIC and FPGA implementation of
binary-coded decimal digit multipliers”, Circuits Syst. Signal Process., vol. 33, no.
12, pp. 3883–3899, 2014.
[6] L. O. Chua, “Memristor-the missing circuit element”, IEEE Trans. Circuit Theory,
vol. 18, no. 5, pp. 507–519, 1971.
[7] D. B. Strukov, G. S. Snider and D. R. Stewart, “The missing memristor found”, Nature,
vol. 453, no. 7191, pp. 80–83, 2008.
[8] A. F. Haider, T. N. Kumar and F. Lombardi, “A memristor-based LUT for FPGAs”,
Ninth IEEE Int. Conf. on Nano/Micro Engineered and Molecular Systems (NEMS),
2014.
[9] Y. Ho, G. M. Huang and P. Li, “Dynamical properties and design analysis for non-
volatile memristor memories”, Circuits and Systems I, IEEE Trans. on, vol. 58, no. 4,
pp. 724–736, 2011.
[10] N. Z. Haron and S. Hamdioui, “On defect oriented testing for hybrid CMOS/memristor
memory”, In Test Symposium (ATS), pp. 353–358, 2011.
[11] X. X. Dong, N. P. Jouppi and Y. Xie, “Design implications of memristor-based RRAM
cross-point structures”, Proc. Des. Autom. Test Eur., pp. 1–6, 2011.
298  VLSI Circuits and Embedded Systems

[12] X. Yuan, “Modeling, architecture, and applications for emerging memory technolo-
gies”, IEEE Comput. Des. Test., vol. 28, no. 1, pp. 44–51, 2011.
[13] K. Sohrab, G. Rosendale and M. Manning, “A 3D stackable carbon nanotube-based
nonvolatile memory (NRAM)”, IEEE Proceedings of the European Solid-State Device
Research Conf., 2010.
[14] M. Thomas, M. Salinga, M. Kund and T. Kever, “Nonvolatile memory concepts based
on resistive switching in inorganic materials”, Adv. Eng. Mater., vol. 11, no. 4, pp.
235–240, 2009.
[15] Y. C. Chen, W. Zhang and H. Li, “A look up table design with 3D bipolar RRAMs”,
ASP-DAC, pp. 73–78, 2012.
[16] G. Bioul, M. Vazquez and J. P. Deschamps, “Decimal addition in FPGA”, Fifth
Southern Conf. on Programmable Logic, pp. 101–108, 2009.
[17] Sworna, Zarrin Tasnim, Mubin UlHaque, Nazma Tara, Hafiz Md Hasan Babu, and
Ashis Kumar Biswas. "Low-power and area efficient binary coded decimal adder
design using a look up table-based field programmable gate array." IET Circuits,
Devices & Systems 10, no. 3 (2016): 163–172.
CHAPTER 21

Generic Complex
Programmable Logic Device
Board

The goal of the design and the development of Generic Complex Programmable Logic
Device (CPLD) Board is to reduce the product’s overall design and development life cycle
time. The same board may be used in various system designs since Programmable Logic
Devices are extremely versatile and changeable. These devices operate at very low voltages
with fast speeds, and consume very little power. As the device count at the system level
decreases dramatically, these characteristics make PLDs (Programmable Logic Devices)
more flexible and enhance product dependability to a larger extent. For System program-
ming, the board’s feature named as Joint Testing Action Group (JTAG) interfaces on board.
This makes the board more adaptable to design modifications, upgrades, and easy migration
from one standard to the next. Implementing the A5/1 algorithm, among other things, a
seven-segment display driver, a binary counter, and LED (Light Emitting Diode) control
logic are demonstrated in this chapter.

21.1 INTRODUCTION
The widespread usage of programmable logic devices (PLDs) in a wide range of applications
such as telecom infrastructure, consumer electronics, industrial and medical industries has
resulted from the necessity to adapt to changing market requirements in a constrained
time to market window. Typical board tasks in these applications include power supply
sequencing, voltage and current monitoring, bus bridging, voltage level translation, interface
management, and temperature measurement. System designers are under constant pressure
to fulfill development deadlines, so they must execute ideas with the least amount of
work and risk while preserving maximum flexibility. Designers may reduce system cost,
save space, and maintain a high level of product diversity by utilizing a programmable-
based approach instead of many discrete devices or Application Specific Standard Products
(ASSPs).

DOI: 10.1201/9781003269182-24 299


300  VLSI Circuits and Embedded Systems

PLDs (Programmable Logic Devices) are an important part of embedded industrial


designs. In industrial designs, PLDs have progressed beyond basic glue logic to the usage
of FPGAs as a coprocessor. In applications including communications, motor control, I/O
modules, and image processing, this approach enables for I/O extension while offloading
the core microcontroller (MCU) or digital signal processor (DSP) device. The PWM ap-
proach has been demonstrated and proven to be widely utilized in most industrial power
controllers. The use of FPGA/CPLD ICs to regulate power converters has made high fre-
quency PWM (Pulse Width Modulation) generator design more adaptable and simple to
build. The resultant PWM frequency is determined by the target FPGA or CPLD device
speed as well as the required duty cycle resolution.
PLD/FPGA-based digital controllers are significantly superior to DSP-based digital
controllers in terms of dynamic performance and control capabilities. PLDs have gained
popularity in various market areas, including as portable devices, innovating product designs
that minimize power consumption, offering new packaging choices, having lower unit costs,
and a faster design cycle. The design and development of a Complex Programmable Logic
Device (CPLD) board for numerous purposes are described in this chapter.

21.2 HARDWARE DESIGN AND DEVELOPMENT


The typical programmable logic device’s density has increased dramatically in recent
years. PLD makers have evolved their products into bigger (logically, but not necessarily
physically) components termed Complex Programmable Logic Devices as chip density grew
(CPLDs). A CPLD’s greater dimension allows designers to use more logic equations or
create a more complex design. These chips are large enough to replace dozen of components
from the 7400 series. To show the board’s general nature, the current design employs an
ALTERA CPLD MAXII series EPM-570T144C5 TQFP144 package. The followings are
the characteristics:

1. On Board DC-DC converters

2. On Board JTAG Interface

3. Compact (4” × 3”)

4. Low Power consumption

5. Visual indication for power, programming

6. On board clock circuitry

7. Seven segment display control interface

8. LED interface

9. Reverse polarity protection

The block diagram of the CPLD board is shown in Fig. 21.1. As indicated in the block
diagram it consists of the following parts:
Generic Complex Programmable Logic Device Board  301

1. DC-DC Converters
2. JTAG Interface
3. LED Interface
4. Clock Circuit
5. CPLD
6. 7-Segment Display
7. Input/output Connectors

Figure 21.1: Block Diagram of CPLD Board.

The developed prototype is shown in Fig. 21.2.

Figure 21.2: Prototype Board.


302  VLSI Circuits and Embedded Systems

21.2.1 DC-DC Converters


The LM317 and LTC1963-3.3 voltage regulators are utilized to generate 5V and 3.3V for
the board’s functioning. These are available on the bulletin board. The clock, 7-segment
display, LED interface, and CPLD are all powered by these voltage regulators. Each gadget
has a 1.5A output.

21.2.2 JTAG Interface


The JTAG interface is implemented using a standard SN74HC245 device with a few passive
components on the board to prevent cable length issues while programming the PC parallel
port and the board. Typically, such wires are only a few millimeters long. This constraint
is overcome in the design of this board. The programming action is indicated via an LED
indicator. Fig. 22.3 depicts the JTAG interface wire.

Figure 21.3: JTAG Cable.

21.2.3 LED Interface


These LED connections are given to show visually the presence of electricity as well as
any other activities.

21.2.4 Clock Circuit


A low-cost LM555 timer IC with passive components is used to create the clock circuit.
Any desired frequency clock may be produced by changing the values of resistors and
capacitors.
Generic Complex Programmable Logic Device Board  303

21.2.5 CPLD
The chosen CPLD is an ALTERA EPM570T144C5 MAXII series device with a 144 pin
TQFP chip. Fig. 22.4 depicts the block diagram of this CPLD.

Figure 21.4: JTAG Cable.

It has 570 logic elements (LE) that will be configured according to the design speci-
fications. This gadget is more than adequate for medium-sized digital creations. Fig. 22.5
depicts the construction of the logic element (LE).
304  VLSI Circuits and Embedded Systems

Figure 21.5: LE Structure.

To incorporate custom logic, MAX-II devices have a two-dimensional row and column-
based design. Signal interconnects between the logic array blocks are provided by row and
column interconnects (LABs). The logic array is made up of LABs, each having ten logic
elements (LEs). An LE is a tiny logical unit that facilitates the development of user logic
operations. Over the course of the device, LABs are organized into rows and columns.
Fast granular time delays between LABs are provided via the Multi-Track connection. In
comparison to globally routed connection architectures, the rapid routing between Les
enables little temporal delay for new layers of logic. The MAX II gadget I/O pins are fed by
I/O elements (IOE) surrounding the device’s perimeter, which are positioned at the ends of
LAB rows and columns. A bidirectional I/O buffer with numerous sophisticated capabilities
is included in each IOE. Schmitt trigger inputs and several single-ended standards, such as
66-MHz, 32-bit PCI, and LVTTL are supported through I/O pins. A global clock network
is provided by MAX II devices. The global clock network is made up of four global clock
lines that run throughout the device and provide clocks for all of its resources.
Generic Complex Programmable Logic Device Board  305

21.2.6 Seven-Segment Display


When a common anode seven segment display is connected, the control circuitry for the dis-
play uses transistors to sync the current. In addition, it permits the designer to connect many
segments with a single data bus by minimizing the number of components on the board.

21.2.7 Input/Output Connectors


These are provided to interface the board to the external world.

21.3 INTERNAL HARDWARE DESIGN OF CPLD


The internal hardware of the CPLD is designed using Altera’s QUARTUS-II Electronic
Design Automation (EDA) tool. A description of the hardware’s structure and behavior is
written in a high-level hardware description language (typically VHDL or Verilog) and that
code is then built and downloaded prior to execution. Of course, schematic capture is an
alternative for design entry, but as designs being more sophisticated and language-based
tools have been developed, it has been less common. Fig. 22.6 depicts the whole hardware
development process for programmable logic.

Figure 21.6: Programmable Logic Design Process.


306  VLSI Circuits and Embedded Systems

21.3.1 A5/1 Algorithm


The worldwide system for mobile communications (GSM) is a second-generation (2G) mo-
bile phone system that has made mobile communications available to the general public. In
many developed nations, the number of mobile phone customers outnumbers the traditional
telephone network. GSM and the security architecture that underpins it were created in the
1980s. Because it is extensively used and has become pervasive in most nations across the
world, its reach is still expanding. The stream cipher A5/1 is used to protect over-the-air
communication (GSM). This method gives a fair level of assault protection. There are three
linear feedback shift registers in it. (LFSR). The lengths of the registers are 19, 22, and 23.
The XOR of the three LFSRs is the output. The clock control on the A5/1 is changeable.
Each register is timed using its own middle bit, which is XORed with the inverse threshold
function of all three registers’ middle bits. Two of the LFSRs clock are usually used in each
round. Fig. 22.7 depicts the anatomy of A5/1.

Figure 21.7: Structure of A5/1 GSM Algorithm.

21.3.2 Seven Segment Display Driver


The structure of the BCD to seven segment decoder is shown in Fig. 21.8. This is used to
display the encrypted message along with the plain text generated using counter.
Generic Complex Programmable Logic Device Board  307

Figure 21.8: Structure of BCD to Seven Segment Decoder.

21.3.3 Binary 8-Bit Counter


A8-bit binary counter is used to generate varied clock rates in order to generate pseudo-
random bit sequences (PRBS) and divide the input clock frequency to the desired value.
All of this are done with the high-level language VHDL, while the top level is done with
Schematic. Fig. 22.9 depicts VHDL entities for various devices implemented in the CPLD.
308  VLSI Circuits and Embedded Systems

Figure 21.9: VHDL Entities.

The complete implementation and developed prototype is shown in Figs. 21.10 and
21.11.

Figure 21.10: View of the Complete Prototype.


Generic Complex Programmable Logic Device Board  309

Figure 21.11: Seven Segment Display.

21.4 APPLICATIONS
As stated in the preceding sections, this board is general in nature and can be used for a
variety of purposes including:

1. VHDL/Verilog Language Trainer Kit

2. It may be utilized as an add-on module in system designs.

3. Design and development of ASICs

4. Designs for Embedded Systems

5. Implementation and verification of encryption algorithms

6. Implementation of a voice compression technique

7. Hardware that can be reconfigured in a variety of embedded, instrumentation, medi-


cal, and communication systems.

21.5 SUMMARY
A design and development of a CPLD (Complex Programmable Logic Device) board has
been described in this chapter. This board is well structured and proven to be general in
nature, and it may be utilized as reconfigurable hardware in many system designs in the
fields of communication, medical electronics, and industrial electronics, VLSI (Very Large
Scale Integration), embedded circuits, and so on. It may also be utilized in educational
institutions as a VHDL (Verilog Hardware Description Language)/Verilog trainer kit. Even
if all SMD (Surface Mount Devices) components are employed, the board’s size can be
decreased.
310  VLSI Circuits and Embedded Systems

REFERENCES
[1] E. Koutroulis, A. Dollas and K. Kalaitzakis, “High-frequency pulse width modulation
implementation using FPGA and CPLD ICs”, Journal of Systems Architecture, vol.
52, pp. 332–344, 2006.
[2] B. S. Kariappa and M. U. Kumari, “FPGA Based Speed Control of AC Servomotor
Using Sinusoidal PWM”, International Journal of Computer Science and Network
Security, vol. 8, no. 10, 2008.
[3] N. Hediyal, “Key Management for Updating Crypto-keys over AIR”, International
Journal of Computer Science and Network Security, vol. 11, no. 1, 2011.
[4] A. Opara and D. Kania, “Decomposition-Based Logic Synthesis for PAL-Based
CPLDs”, Int. J. Appl. Math. Comput. Sci., vol. 20, no. 2, pp. 367–384, 2010.
[5] I. Rahaman, M. Rahaman, A. L. Haque and M. Rahaman, “Fully Parameterizable
FPGA Based Crypto-Accelerator”, World Academy of Science, Engineering and Tech-
nology, 2009.
[6] P. Malav, B. Patil and R. Henry, “Compact CPLD board Designing and Implemented
for Digital Clock”, International Journal of Computer Applications, vol. 3, no. 11,
2010.
[7] Hediyal, Nagaraj. “Generic Complex Programmable Logic Device (CPLD) Board.”
International Journal of Computer Science and Information Technologies (IJCSIT) 2,
no. 5 (2011): 2004–2007.
CHAPTER 22

FPGA-Based Programmable
Logic Controller

Programmable Logic Controllers (PLCs) are more cost effective, compact, and easier to
operate than PC (Personal Computer)-based solutions for modest and stand-alone control
of a single process. Some researches have been done on the implementation of a control
program in an FPGA (Field Programmable Gate Array). However, the majority of them
focused on strategies for converting functional-level control programs into HDL (Hardware)
logic descriptions. As a result of these approaches, PLC users will need design tools to
translate, integrate, and implement logic circuits in FPGA. These tools also require training
for manufacturing plant engineers. This chapter proposes a method for implementing a
general PLC inside an FPGA. Once FPGA is configured, it will function as PLC that can
be embedded into devices, machines, and systems using suitable interfaces.

22.1 INTRODUCTION
Programmable Logic Controllers (PLCs), also referred to as programmable controllers, are
used in commercial and industrial applications.
As shown in Fig. 22.1, a PLC consists of input modules, a processor, and output
modules. An input accepts a wide range of digital or analog signals from various field
devices (sensors) and converts them to a logic signal that the processor may use. Based
on program instructions in memory, the processor takes judgments and executes control
instructions. The processor’s control instructions are converted into a digital or analog signal
that can be utilized to control a variety of field devices by output modules (actuators). The
desired instructions are entered using a programming device. As a result, PLCs can be
complicated, and they’re typically constructed on 486 and Pentium processors with a large
number of analog and digital I/Os. Occasionally, an application may require the usage
of a PLC with very restricted capabilities, but the pricing does not fit the budget. Thus,
PLCs are more cost-effective, smaller, and easier to operate than PC-based systems for tiny,
stand-alone control of a single process. PLCs, which can handle from 15 to 128 I/O points,
are commonly used by control engineers.

DOI: 10.1201/9781003269182-25 311


312  VLSI Circuits and Embedded Systems

Figure 22.1: A Conventional PLC System.

22.2 FPGA TECHNOLOGY FOR PLC


For industrial process control applications, industries were hesitant to adopt new technology
such as field programmable gate arrays (FPGAs). The main reason for this is because
manufacturing plant engineers are generally untrained in digital logic design. As a result,
they suggest an FPGA design that directly implements relay ladder logic. However, rather
than developing a completely new FPGA architecture and tool, a generic PLC may be
designed utilizing FPGAs that are currently on the market. Various publications have
explored a design technique for translating "interpreted Petri net specification" into hardware
description languages. A methodology for converting rule-based descriptions into logic
descriptions is also given, such as a logic synthesis software that converts SFC (Sequential
Function Chart) descriptions into Verilog-HDL. A design framework that unifies control
logic and peripheral operations on an FPGA chip, with a converter that converts PLC
instruction sequence into logic description. These explanations, on the other hand, focused
on approaches for converting functional-level control programs into HDL logic descriptions.
PLC users will need design tools to translate, integrate, and construct the logic circuit in
FPGA with these techniques.
This chapter proposes a method for implementing a generic PLC within an FPGA. Once
the FPGA has been setup, it may be used as a PLC with the appropriate interface, providing
sufficient performance and flexibility. Through specialized ladder program software, ladder
program input will be supplied to PLC (FPGA configured as PLC) in program mode.
In run mode, it will begin executing the ladder program. Debugging of programs will
be built into the software. As a result, manufacturing plant engineers do not require any
additional training because this configured FPGA functions as a traditional PLC with the
added benefits of FPGA technology.
FPGA-Based Programmable Logic Controller  313

A PLC which is built using FPGA technology produces a superior solution with the
following benefits:

I. Flexibility: PLC designs may be easily upgraded by design engineers (PLC makers).
By modifying the Hardware Description Language (HDL) and setting the same FPGA
chip for the updated PLC design, some features or instructions might be added to the
current design.

II. Accuracy: Fast design allows engineers to include very time-critical tasks like limit
and proximity sensor detection and sensor health monitoring into hardware, resulting
in more accurate solutions.

III. Short Product Development Cycle: Design time is significantly decreased because
of the usage of standard HDLs and automated design tools. Engineers may also
experiment out different implementations because the control code runs directly in
silicon.

IV. Low Cost and Compactness: Because of the aforesaid advantages, the designer may
fulfill market demands by satisfying consumer wants and improving the product’s
performance or usefulness. In comparison to other existing options, this results in
high performance, low cost, and compact designs.

22.3 SYSTEM DESIGN PROCEDURE FOR PLC


Automation system design for a typical manufacturing industry using PLC is shown in
Fig. 22.2, and is discussed briefly in the following sections:
The system design inside FPGA comprises of:

I. Data memory is used to store data temporarily such as (a) Timer or counter preset
settings, (b) Arithmetic or logic execution results, (c) Input/output status, and so on.

II. The program memory used to store encoded ladder program instructions is known
as user memory.

III. Ladder instruction decoder and ladder instruction execution block make up the control
unit. It decodes and executes the ladder instruction, as well as providing user and
data memory control signals.

IV. User interface or ladder program encoding software that accepts user-supplied ladder
program instructions, debugs them, and encodes them in a usable format.

22.3.1 Ladder Program Structure


Application program is developed as a set of ladder rungs (rows). Ladder program allows
a maximum of “n” rungs. In turn, each ladder rung structure allows to program as follows:

(i) Maximum “m” contacts/function blocks in series including one coil (columns); and

(ii) Maximum “ p” contacts /function blocks in parallel.


314  VLSI Circuits and Embedded Systems

Figure 22.2: Block Diagram of the System Architecture of a PLC.

Thus a ladder program structure allows a ladder program of size [m, n]. First element of
each rung is encoded as “Start of rung”, and the last element of ladder is encoded as “End
of ladder”.

22.3.2 Operating Modes of PLC


There are two modes of operating a PLC which are as follows:

(i) Program Mode: In this mode, the ladder program is loaded into the program memory
of PLC.

(ii) Run/Execution: In this mode, the PLC executes the ladder logic cyclically.

22.3.3 Ladder Scanning


The PLC executes its instructions in a cyclical manner. It reads the state of inputs, performs
logic, and modifies the status of outputs. A PLC scan is what it’s called. The ladder software
is continually scanned from left to right and top to bottom. The ladder scan cycle, which is
made up of “n” rung scan cycles, scans the whole ladder program. Every “ p” rung cycle
on the ladder scans successive rungs from left to right. It is necessary for assessing the
parallel connections that exist between them. The parameters of the shown design were
p = 4, m = 7, and n = 32.

22.3.4 Ladder Execution


Execution of “Start of rung” instruction activates the instruction decoder and execution
block.
FPGA-Based Programmable Logic Controller  315

Figure 22.3: Ladder Scanning

Every ladder rung cycle, “ p” rung instructions (contact/function block instructions) are
decoded and performed in parallel from left to right. The output of each rung instruction is
supplied as input for the following rung instruction, and parallel connections are assessed.
Each rung instruction takes two clock cycles to execute. The state of output is updated in
the output memory region at the conclusion of each rung cycle. The revised output values
are sent to the output module at the completion of the full ladder scan. As a result, when
the “End of ladder” instruction is executed, the output status is updated and the address
counter in program memory is reset for the next ladder scan.

22.3.5 System Implementation


Program memory is where the encoded ladder program is saved. In Run mode, it provides
data to the ladder execution block. This block generates logic for controlling the functioning
of both the ladder program memory and the Ladder execution block. In program mode, it
sends a write enable signal and a memory address to the ladder program memory. In Run
mode, it activates the Ladder execution block and creates a memory address to read its
contents.
The ladder execution block provides all of the logic for running a ladder program. It has
“ p” comparable instruction execution blocks that include all of the functions’ logic. The
“ p” blocks all retrieve data from program memory and run the instruction in parallel. Each
block’s bit output is delivered to a block that analyzes connections between four rungs and
provides output that allows the next instruction to be executed.

22.4 DESIGN CONSIDERATIONS


(a) Size
Size of the design increases with increase in number of instructions/functions, which a
PLC can handle, number of rungs that can be connected in parallel (p). Hence designer can
use an instruction set containing optimum instructions. We demonstrated for instructions
like Delay timer (ON, OFF), Counter (Up, Down), Logical (AND, OR, NOT), Arithmetic
316  VLSI Circuits and Embedded Systems

Figure 22.4: Ladder Execution Flow


FPGA-Based Programmable Logic Controller  317

Figure 22.5: System Architecture inside FPGA

(Addition, Subtraction), Others (Compare, etc.). Also, design was restricted to have max-
imum 256 rungs, which may contain 7 elements in series. Maximum 4 elements can be
connected in parallel.
(b) Scan Time
The time needed for a complete I/O scan and execution is a key feature of PLC. This
is dependent on the number of input and output channels as well as the duration of the
ladder program. PLC execution speed is determined by the processor’s clock frequency.
The scan time decreases as the frequency increases. Because each rung execution needs
“2m” clock cycles, the suggested architecture may achieve very short scan times. Thus, if
a ladder program has “n” rungs, the PLC scan will take “2mn” clock cycles. Scan time
may be lowered even further by using FPGA designs that are quicker and more efficient,
resulting in faster PLC. The demonstrated design could achieve 2.24-microsecond scan
time at 100MHz clock for maximum achievable ladder logic in the plan.
(c) Memory
Memory is required by PLCs in order to store program and temporary data. Separate
memory chips are utilized in traditional PLCs for this purpose. The suggested architecture
makes advantage of the memory built into the same FPGA chip, eliminating the additional
circuitry and latency associated with read/write operations in traditional systems. As a
318  VLSI Circuits and Embedded Systems

result, the solution is both quicker and more compact. Block memory was utilized as user
memory while distributed memory was used as data memory in the exhibited system.

22.5 SUMMARY
An implementation method to PLC (Programmable Logic Controller) design has been
described in this chapter. For an example application, the concept has been built and
proved on a smaller scale. However, it may be further modified to incorporate a variety of
useful instructions (PID, PWM, etc.), interfaces (RS232, SPI, USB), and network protocols
in order to link it to a network. The architecture is also limited to digital I/O channels,
although it may be expanded to include analog I/O channels. The solution described in this
chapter is best suited for small-scale applications requiring a small number of instructions
at a low cost, as well as it has an excellent performance and a compact architecture.

REFERENCES
[1] John T. Welch, Joan, “A Direct Mapping FPGA Architecture for Industrial Process
Control Applications” IEEE Proceedings International Conference on Computer De-
sign, 17–20 Sept. 2000, pp 595–598.
[2] M. A. Adamski and J. L. Monteiro, "PLD implementation of logic controllers," in
Proceedings of the IEEE International Symposium on Industrial Electronics (ISIE’95),
vol. 2, 1995, pp. 706–711.
[3] M. Adamski and J. L. Monteiro, "From interpreted Petri net specification to re-
programmable logic controller design," in Proceedings of the IEEE International
Symposium on Industrial Electronics (ISIE 2000), vol. 1, 2000, pp. 13–19.
[4] M. Wegrzyn, M. A. Adamski, and J. L. Monteiro, "The application of reconfigurable
logic to controller design," Control Engineering Practice, vol. 6, pp. 879–887, 1998.
[5] A. Wegrzyn and M. Wegrzyn, "Petri net-based specification, analysis and synthesis of
logic controllers," in Proceedings of the IEEE International Symposium on Industrial
Electronics (ISIE 2000), vol. 1, 2000, pp. 20–26.
[6] M. Ikeshita, Y Takeda, H. Murakoshi, N. Funakubo, and I.Miyazawa, "An applica-
tion of FPGA to high-speed programmable controller-development of the conversion
program from SFC to Verilog," in Proceedings of the 7th IEEE International Con-
ference on Emerging Technologies and Factory Automation (ETFA’99), vol. 2, 1999,
pp. 1386–1390.
[7] I. Miyazawa, T. Nagao, M. Fukagawa, Y. Ito, T. Mizuya, and T. Sekiguchi, "Implemen-
tation of ladder diagram for programmable controller using FPGA," in Proceedings
of the 7th IEEE International Conference on Emerging Technologies and Factory
Automation (ETFA’99), vol. 2, 1999, pp. 1381-1385.
[8] Shuichi Ichikawa, Masanori Akinaka, Ryo Ikeda, Hiroshi Yamamoto “Converting PLC
instruction sequence into logic circuit: A preliminary study” Industrial Electronics,
2006 IEEE International Symposium, July 2006, vol. 4, pp. 2930–2935.
[9] Dick Johnson, Research on Programmable Logic Controllers done in conjunction with
Reed Research, Control Engineering December 2007
(http://www.controleng.com/article/CA6510505.html)
FPGA-Based Programmable Logic Controller  319

[10] Gary Dunning, Introduction to Programmable Logic Controllers, ISBN: 0-7668-1768-


7 Thomson Delmar Learning, 2000.
[11] C. D. Johnson, Process control Instrumentation, ISBN: 0-1306-0248-5 Prentice Hall,
2002.
[12] John Wakerly, Digital Design, Principals and Practices, ISBN: 0-13-082599-9
Prentice-Hall, 2000.
[13] Douglus Perry, VHDL Programming By Example, ISBN 0070494363, McGraw-Hill
June 1998.
[14] D. Du, X. Xu and K. Yamazaki, “A study on the generation of silicon-based hard-
ware PLC by means of the direct conversion of the ladder diagram to circuit design
language”.
[15] D. Gawali and V. K. Sharma, “FPGA based Micro PLC design approach”, International
conference on advances in computing, control, and telecommunication technologies,
2009.
IV
An Overview About Design Architectures of
Digital Circuits

321
Part 4

A digital computer stores data in terms of digits (numbers) and proceeds in discrete steps
from one state to the next. The states of a digital computer typically involve binary digits
which may take the form of the presence or absence of magnetic markers in a storage
medium, on-off switches or relays. In digital computers, even letters, words, and whole
texts are represented digitally. Digital logic is the basis of electronic systems, such as
computers and cell phones. Digital logic is rooted in binary code, a series of zeroes and
ones each having an opposite value. This system facilitates the design of electronic circuits
that convey information, including logic gates. Digital logic gate functions include AND,
OR, and NOT. The value system translates input signals into specific output. Digital logic
facilitates computing, robotics and other electronic applications.
Digital logic design is foundational to the fields of electrical engineering and com-
puter engineering. Digital logic designers build complex electronic components that use
both electrical and computational characteristics. These characteristics may involve power,
current, logical function, protocol, and user input. Digital logic design is used to develop
hardware such as circuit boards and microchip processors. This hardware processes user
input, system protocol and other data in computers, navigational systems, cell phones, or
other high-tech systems.
The combinational circuit consists of logic gates whose outputs at any time are deter-
mined directly from the present combination of input without any regard to the previous
input. A combinational circuit performs a specific information processing operation fully
specified logically by a set of Boolean functions. A combinatorial circuit is a generalized
gate. In general such a circuit has m inputs and n outputs. Such a circuit can always be con-
structed as n separate combinatorial circuits, each with exactly one output. For that reason,
some texts only discuss combinatorial circuits with exactly one output. In reality, however,
some important sharing of intermediate signals may take place if the entire n-output circuit
is constructed at once. Such sharing can significantly reduce the number of gates required
to build the circuit. When a combinational circuit is built from some kind of specification,
it is always tried to make it as good as possible. The only problem is that the definition of
"as good as possible" may vary greatly. In some applications, one simply wants to minimize
the number of gates (or the number of transistors, really).
The implication is that combinational circuits have no memory. In order to build
sophisticated digital logic circuits, including computers, more powerful model is needed.
The circuits are needed whose output depends upon both the input of the circuit and its
previous state. In other words, it needs circuits that have memory. For a device to serve as
a memory, it must have three characteristics; (1). The device must have two stable states,
(2). There must be a way to read the state of the device, and (3). There must be a way to set
the state at least once.

323
324  Part 4

This part starts with designing divider circuits with parallel computation of quotients
and partial remainders which is given in Chapter 23. In this chapter, a heuristic function
is presented to determine the difference between the numbers of bits in the dividend and
divisor. The introduced divider circuit generates the partial remainder and quotient bits
simultaneously in each iteration which reduces the delay of the divider circuit significantly.
In addition, the division algorithm requires only two operation of addition and subtraction. A
systematic method for minimizing a TANT circuit and the heuristic algorithms for different
stages of the technique are provided in Chapter 24. Steps and algorithms are discussed
extensively in this chapter. The introduced method constructs an optimal TANT network
for a given single output function. Chapter 25 presents an asymmetric high-radix signed-
digital (AHSD) adder that performs addition on the basis of neural network (NN) and also
shows that the AHSD number system supports carry-free (CF) addition by using NN. A
Novel NN design has been constructed for CF adder based on the AHSD4 number system
is also presented in this chapter. Chapter 26 describes an integrated framework for SOC test
automation. In this chapter, an efficient algorithm has been introduced to construct wrappers
that reduce testing time for cores. Rectangle packing has been used to develop an integrated
scheduling algorithm that incorporates power constraints in the test schedule. In Chapter 27,
an approach is presented to design memristor based nonvolatile 6-T static random access
memory (SRAM). In addition to the SRAM integrated circuit, test structures are included to
help characterize the process and design. Then the memristor-based resistive random access
memory (MRRAM) is addressed which is similar to that of static random access memory
(SRAM) cell. In Chapter 28, a fault-tolerant approach to reliable microprocessor design
is presented. The approach preserves system performance while keeping area overheads
and power demands low. The checker is a fairly simple state machine that can be formally
verified, scaled in performance, and reused. Finally, in Chapter 29, some application of
VLSI circuits and embeded system discussed.
CHAPTER 23

Parallel Computation of
Quotients and Partial
Remainders to Design
Divider Circuits

Division is considered as the slowest and most difficult operation among four basic opera-
tions in microprocessors. This chapter presents an unprecedented divider circuit by using
a new division algorithm. A heuristic function is presented to determine the difference
between the numbers of bits in the dividend and divisor. This difference is used to calculate
the quotient bits and the partial remainder independently. Thus, the introduced divider
circuit generates the partial remainder and quotient bits simultaneously in each iteration
which reduces the delay of the divider circuit significantly. Moreover, the divider circuit
requires only two operations (addition and subtraction) in each iteration. The presented
divider circuit has been constructed in four steps. First, a parallel n-bit counter circuit
has been introduced. For a 4-input operand, the bit counter circuit achieves a significant
improvement in terms of delay. Second, a selection block is designed which consumes less
hardware complexity. Third, an efficient and compact [log 2 n + 1] -to-n-bit converter circuit
has been presented. Fourth, a new n-bit comparator circuit has been shown to reduce the
area of the comparator circuit, where n is the number of input bits. For a 4-bit comparator
circuit, the comparator circuit gains a notable improvement in terms of area-delay product.

23.1 INTRODUCTION
Re-configurable computing has explored a new dimension of computing architecture since
its inception in 1960. Re-configurable computing is a computer architecture which combines
some of the flexibility of software with the high performance of hardware by processing with
a very flexible high speed computing fabrics like field-programmable gate arrays (FPGAs).
The principal difference when compared to using ordinary microprocessors is the ability to
make substantial changes to the data path itself in addition to the control flow. On the other
hand, the main difference with custom hardware, i.e., application-specific integrated circuits

DOI: 10.1201/9781003269182-27 325


326  VLSI Circuits and Embedded Systems

(ASICs) is the possibility to adapt the hardware during run time by “loading” a new circuit
on the re-configurable fabric. Their functionality can be upgraded and repaired during
their operational life cycle and specialized to the particular instance of a task. Sometimes,
they are the only way to achieve the required real-time performance without fabricating
custom integrated circuits. The implementation of FPGA has been widely seen in many
applications including signal processing, cryptography, processing, scientific computation
and arithmetic computing.
Among the basic operations, division being the slowest operation on a modern micropro-
cessor, is the prerequisite for faster mathematical and computational operation in processor.
Moreover, in comparison with addition and multiplication, division is the least as well as
most difficult operation used in the processors. However, the performance of a computer will
degrade if division operation is ignored. In the early 1960, Landauer’s research showed that
irreversible hardware computation regardless of its realization technique results in energy
dissipation due to information loss. Each bit of information dissipates k.T .ln2 joules of
energy where k is Boltzmann constant, T is absolute temperature. In 1973, Bennett showed
that a circuit must be made using reversible logic gates to avoid this huge energy dissipation.
Fault tolerance is the property that enables a system to continue operating properly in the
event of the failure of (one or more faults within) some of its components. Usually FPGA
consists of an array of configurable logic block, interconnects and I/O blocks. FPGA can be
configured as needed for each application. The same semiconductor technology advances
that have brought processors to their performance limits have turned FPGAs from simple
logic to highly capable programmable fabrics. Most popular logic blocks are Look-Up-Table
(LUT) and Plessy logic block. With more inputs, a LUT can implement more logic using
fewer logic blocks. Thus it helps in less routing area. A 3 to 4 input LUT size results in
better performance in area and delay. So, a generalized 4-input LUT based Logic block is
considered. In this chapter, a divergent approach for a divider circuit is presented by using
a new division algorithm.
Modern applications comprise several arithmetic operations, among them addition,
multiplication, division, and square root are the frequent ones. In recent researches, empha-
sis has been placed on designing ever-faster adders and multipliers, with division and square
root receiving less attention. The typical range for addition latency is two to four cycles,
and the range for multiplication is two to eight cycles. Most emphasis has been placed on
improving the performance of addition and multiplication. As the performance gap widens
between these two operations and division, various applications have been slowly degraded
its performance and throughput.
Fig. 23.1 shows the average frequency of different instruction like division, multipli-
cation, addition, subtraction and square root operations relative to the total number of
operations. This figure shows that simply in terms of dynamic frequency, division and
square root seem to be relatively unimportant instructions, with about 3% of the dynamic
instruction count due to division and only 0.33% due to square root. The most common
instructions are the multiply and add. Thus, the multiplication accounts for 35% of the in-
structions, and the adder is used for 55% of the instructions since multiplication operation
uses addition operation in its integral constitution.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  327

Figure 23.1: Distribution of Different Instructions.

However, in terms of latency, division can play a much larger role. By assuming a
machine model of a scalar processor, where every division operation has a latency of 20
cycles and the adder and multiplier each have a three cycle latency, a distribution of the
execution time due to the hardware was formed, shown in Fig. 23.2. Here, the division
accounts for 40% of the latency, the addition accounts for 42%, and the multiplication
accounts for the remaining 18%. It is evident that the performance of division is very much
significant to the overall system performance.

Figure 23.2: Distribution of Execution Time.

Field Programmable Gate Arrays (FPGAs) have come a long way since their inception
as illustrated in Fig. 23.3. From their humble beginnings as containers for glue and control
logic, FPGAs have evolved into highly capable software coprocessors, and as platforms
for complete, single-chip embedded systems. It has long been recognized that many of
328  VLSI Circuits and Embedded Systems

the computing challenges in embedded and high-performance computing can be addressed


using parallel processing techniques. The use of dual or quad-core processors, multiple
computer “blades”, or clustered PCs has become commonplace in many different application
domains. FPGAs are now being deployed alongside traditional processors in these systems,
creating what might be called a hybrid multiprocessing approach to computing.

Figure 23.3: FPGA Devices have evolved to Become Highly Capable Computing Platforms.

When FPGAs are added to a multiprocessing environment, opportunities exist for


improving both application-level and instruction-level parallelism. Using FPGAs, it is
possible to create structures that can greatly accelerate individual operations such as a
simple multiply-accumulate or a more complex sequences of integer or floating point
operations, or that implement higher-level control structures such as loops. Code within the
inner-most loops of an algorithm can be further accelerated through the use of instruction
scheduling, instruction pipelining and other techniques. At a somewhat higher level, these
parallel structures can themselves be replicated to create further degrees of parallelism, up
to the limits of the target device’s capacity as also shown in Fig. 23.4.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  329

Figure 23.4: FPGAs Move to Leading-Edge Process Technologies.

The programming of software algorithms into FPGA hardware has traditionally required
specific knowledge of hardware design methods, including the use of hardware description
languages such as VHDL or Verilog. While these methods may be productive for hard-
ware designers, they are typically not suitable for embedded systems programmers, domain
scientists and higher level software programmers. Fortunately, software-to-hardware tools
now exist that allow software programmers to describe their algorithms using more familiar
methods and standard programming languages. For example, using a C-to-FPGA compiler
tool, an application and its key algorithms can be described in standard C with the addition
of relatively simple library functions to specify inter-process communications. The crit-
ical algorithms can then be compiled automatically into HDL representations which are
subsequently synthesized into lower level hardware targeting one or more FPGA devices.
While a certain level of FPGA knowledge and in-depth hardware understanding may still be
required to optimize the application for the highest possible performance, the formulation
of the algorithm, the initial testing and the prototype hardware generation can now be left
to a software programmer. Therefore, in this work, an initiative has been taken to innovate
a novel idea to reduce the delay of the division algorithm. We believe that an improvement
division algorithm will substantially affect the performance of various applications. More-
over, FPGAs have been chosen as a targeted device to explore the endless opportunity of
reconfigurable computing.
While the methodology for designing efficient high performance adders and multipliers
is well understood, the design of dividers still remains a serious design challenge, often
viewed as a “black-art” among system designers. Extensive literature exists describing the
theory of division. Subtractive methods, such as non-restoring SRT division. To achieve
good system performance, some form of hardware division is required. However, at very
low divider latencies, two problems arise. The area required increases exponentially or
cycle time becomes impractical. Dividers with lower latencies do not provide significant
system performance benefits, and their areas are too large to be justified. An alternative is to
330  VLSI Circuits and Embedded Systems

provide an additional multiplier, dedicated for division. This can be an acceptable trade off if
a large quantity of area is available and maximum performance is desired for highly parallel
division/multiplication applications, such as graphics and 3D rendering applications. The
main disadvantage with functional iteration is the lack of remainder and the corresponding
difficulty in rounding.
Very high radix algorithms are an attractive means of achieving low latency while
also providing a true remainder. The only commercial implementation of a very high
radix algorithm is the Cyrix short-reciprocal unit. This implementation makes efficient
use of a single rectangular multiply/add unit to achieve lower latency than most SRT
implementations while still providing a remainder. Further reductions in latency could be
possible by using a full-width multiplier, as in the rounding and pre-scaling algorithm.
Division algorithms can be divided into five classes: digit recurrence, functional iteration,
very high radix, table look-up, and variable latency. The basis for these classes is the obvious
differences in the hardware operations used in their implementations, such as multiplication,
subtraction, and table look-up. Many practical division algorithms are not pure forms of
a particular class, but rather are combinations of multiple classes. For example, a high
performance algorithm may use a table look-up to gain an accurate initial approximation to
the reciprocal, use a functional iteration algorithm to converge quadratically to the quotient,
and complete in variable time using a variable latency technique. Therefore, it has been
a major challenge to find an acceptable trade-off between area and delay of the divider
circuit. Moreover, the longevity of a circuit largely depends on the power dissipation.
Hence, designing an optimized and compact divider circuit is a prime concern.
The main focuses of this work are presented below:

1. Propagation delay of the divider circuit can be optimized if there is a new method to
find a next partial remainder quickly or if two major tasks, i.e., finding each quotient
bit and calculating next remainder can be done simultaneously.

2. The total delay, area and power consumption can be minimized if there is a novel
approach to reduce the number of blocks or the number of bits handled by each
iteration in the non-restoring division.

3. On the fly conversion is required to produce the correct quotient bit in non-restoring
division can be omitted to improve the overall performance of the divider circuit.

Five main contributions are addressed in this chapter as follows:

1. A new division algorithm for divider circuit is introduced which generates the partial
remainder and quotient bits independently.

2. A parallel bit counter has been presented with a minimum depth and hardware
complexities.

3. A compact and efficient converter and comparator circuit have been elucidated which
requires minimum area and delay.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  331

4. A cost efficient design of both Application Specific Integrated Circuit (ASIC) and
Look-Up Table (LUT)-based divider circuit have been elucidated requiring optimum
number of area, LUTs, slices, flip-flops and delay.

5. An improved design of a Reversible Fault Tolerant (RFT) D-latch and Master Slave
Flip Flop, LUT based configurable logic block (CLB) of FPGA is presented targeting
lower number of gates, garbage and unit delay by using two reversible fault tolerant
gates.

23.2 BASIC DEFINITIONS


In this section, formal definitions and ideas related to division methodologies along with
LUT are presented with illustrative figures and examples.

23.2.1 Division Operation


Division is one of the four basic operations of arithmetic, the others are addition, subtraction,
and multiplication. The division of two natural numbers is the process of calculating the
number of times, one number is contained within one another. In mathematics, the word
“division" means the operation which is the opposite of multiplication. The symbol for
division can be a slash (\), a line (|), or the division sign (÷).
40
Example 23.1 As for example 10 ÷ 5 or 20 giving the answer 2. The first number is the
dividend (10 or 40), and the second number is the divisor (5 or 20). The result (or answer)
is the quotient. Whole numbers, any left-over amount is called the “remainder” such as 14
÷ 4 gives 3 with the remainder as 2.

A division algorithm is an algorithm which, given two integers N (Numerator or divi-


dend) and D (denominator or divisor), computes their quotient and/or remainder. Division
algorithms fall into two main categories: slow division and fast division. Slow division
algorithms produce one digit of the final quotient per iteration. Examples of slow division
include restoring, non-performing restoring, non-restoring, and SRT division. Fast division
methods start with a close approximation to the final quotient and produce twice as many
digits of the final quotient on each iteration. Newton-Raphson and Goldschmidt fall into
this category.

23.2.2 Shift Registers


The Shift Register is a type of sequential logic circuit that can be used for the storage or the
transfer of data in the form of binary numbers. This sequential device loads the data present
on its inputs and then moves or “shifts” it to its output once every clock cycle, hence the
name Shift Register.
A shift register basically consists of several single bit “D-Type Data Latches”, one for
each data bit, either a logic “0” or a “1”, connected together in a serial type daisy-chain
arrangement so that the output from one data latch becomes the input of the next latch
and so on. Data bits may be fed in or out of a shift register serially, that is one after the
other from either the left or the right direction, or all together at the same time in a parallel
332  VLSI Circuits and Embedded Systems

configuration. The number of individual data latches required to make up a single Shift
Register device is usually determined by the number of bits to be stored with the most
common being 8-bits (one byte) wide constructed from eight individual data latches.
Shift Registers are used for data storage or for the movement of data and are therefore
commonly used inside calculators or computers to store data such as two binary numbers
before they are added together, or to convert the data from either a serial to parallel or
parallel to serial format. The individual data latches that make up a single shift register are
all driven by a common clock (Clk) signal making them synchronous devices. Shift register
ICs are generally provided with a clear or reset connection so that they can be “SET” or
“RESET” as required. Generally, shift registers operate in one of the four different modes
with the basic movement of data through a shift register being:

1. Serial-in to Parallel-out (SIPO) – the register is loaded with serial data, one bit at a
time, with the stored data being available at the output in parallel form.

2. Serial-in to Serial-out (SISO) – the data is shifted serially “IN” and “OUT” of the
register, one bit at a time in either a left or right direction under clock control.

3. Parallel-in to Serial-out (PISO) – the parallel data is loaded into the register simul-
taneously and is shifted out of the register serially one bit at a time under clock
control.

4. Parallel-in to Parallel-out (PIPO) – the parallel data is loaded simultaneously into the
register, and transferred together to their respective outputs by the same clock pulse.

The effect of data movement from left to right through a shift register is presented
graphically as in Fig. 23.5.
The directional movement of the data through a shift register can be either to the left,
(left shifting) to the right, (right shifting) left-in but right-out, (rotation) or both left and
right shifting within the same register thereby making it bidirectional.

23.2.2.1 Serial-In to Parallel-Out Shift Register


The operation of a SIPO is as follows: Consider that all the flip-flops (FFA to FFD) have
just been RESET (CLEAR input) and that all the outputs Q A to Q D are at logic level “0"
i.e., no parallel data output as shown in Fig. 23.6.
If a logic “1” is connected to the DATA input pin of FFA then on the first clock pulse
the output of FFA and therefore the resulting Q A will be set HIGH to logic “1” with all the
other outputs still remaining LOW at logic “0”. Assume now that the DATA input pin of
FFA has returned LOW again to logic “0” giving us one data pulse or 0-1-0. The second
clock pulse will change the output of FFA to logic “0” and the output of FFB and Q B HIGH
to logic “1” as its input D has the logic “1” level on it from Q A. The logic “1” has now
moved or been shifted one place along the register to the right as it is now at Q A. When the
third clock pulse arrives this logic “1” value moves to the output of FFC (QC ) and so on
until the arrival of the fifth clock pulse which sets all the outputs Q A to Q D back again to
logic level “0” because the input to FFA has remained constant at logic level “0”.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  333

Figure 23.5: Data Movement from Left to Right through a Shift Register.

Figure 23.6: 4-bit Serial-in to Parallel-out Shift Register.

The effect of each clock pulse is to shift the data contents of each stage one place to
the right, and this is shown in the following table until the complete data value of 0-0-0-1
is stored in the register. This data value can now be read directly from the outputs of Q A
to Q D . Then the data has been converted from a serial data input signal to a parallel data
output. The truth table in Table 23.1 and its wave forms in Fig. 23.7 show the propagation
of the logic “1” through the register from left to right.
334  VLSI Circuits and Embedded Systems

Table 23.1: Truth Table for a 4-bit SIPO Register

Clock QA QB QC QD
Pulse No
0 0 0 0 0
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
5 0 0 0 0

Figure 23.7: Timing Diagram for a 4-bit Serial-in to Parallel-out Shift Register.

23.2.2.2 Serial-In to Serial-Out Shift Register


This shift register is very similar to the SIPO above, except were before the data was read
directly in a parallel form from the outputs Q A to Q D , this time the data is allowed to flow
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  335

straight through the register and out of the other end. Since there is only one output, the
DATA leaves the shift register one bit at a time in a serial pattern, hence the name Serial-in
to Serial-Out Shift Register or SISO.
The SISO shift register is one of the simplest of the four configurations as it has only
three connections, the serial input (SI) which determines what enters the left hand flip-flop,
the serial output (SO) which is taken from the output of the right hand flip-flop and the
sequencing clock signal (Clk). The logic circuit diagram shown in Fig. 23.8 is a generalized
serial-in serial-out shift register.

Figure 23.8: 4-bit Serial-in to Serial-out Shift Register.

23.2.2.3 Parallel-In Serial-Out Register


The Parallel-in to Serial-out shift register which is shown in Fig. 23.9 acts in the opposite
way to the serial-in to parallel-out one above. The data is loaded into the register in a parallel
format in which all the data bits enter their inputs simultaneously, to the parallel input pins
P A to P D of the register. The data is then read out sequentially in the normal shift-right
mode from the register at Q representing the data present at P A to P D .

Figure 23.9: 4-bit Parallel-in to Serial-out Shift Register.

This data provides one bit output at a time on each clock cycle in a serial format. It is
important to note that with this type of data register a clock pulse is not required to parallel
load the register as it is already present, but four clock pulses are required to unload the
data.
336  VLSI Circuits and Embedded Systems

23.2.2.4 Parallel-In to Parallel-Out Shift Register


The final mode of operation is the Parallel-in to Parallel-out Shift Register. This type of
shift register also acts as a temporary storage device or as a time delay device similar to the
SISO configuration above. The data is presented in a parallel format to the parallel input
pins P A to P D and then transferred together directly to their respective output pins Q A to
Q D by the same clock pulse. Then one clock pulse loads and unloads the register. This
arrangement for parallel loading and unloading is shown in Fig. 23.10.

Figure 23.10: 4-bit Parallel-in to Parallel-out Shift Register.

The PIPO shift register is the simplest of the four configurations as it has only three
connections, the parallel input (PI) which determines what enters the flip-flop, the parallel
output (PO) and the sequencing clock signal (Clk). Similar to the Serial-in to Serial-out
shift register, this type of register also acts as a temporary storage device or as a time delay
device, with the amount of time delay being varied by the frequency of the clock pulses.

23.2.3 Complement Logic


In logic, negation, also called logical complement, is an operation that takes a proposition
p to another proposition “not p”, written ¬p, which is interpreted intuitively as being true
when p is false and false when p is true. Generally, an inverter is used for complement
logic.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  337

Table 23.2: Truth Table of One-Bit Conventional Binary Comparator

Inputs Outputs
A B X Y Z
0 0 0 1 0
0 1 0 0 1
1 0 1 0 0
1 1 0 1 0

23.2.4 Comparator
A comparator is a logic circuit that first compares the size of A and B and then determines
the result among A>B, A< B and A = B. When the two numbers in comparator circuit are
two 1-bit numbers, the result will be only one bit from 0 and 1. So, the circuit is called 1-bit
magnitude comparator which is a basis of comparison of the two numbers of n-bit. The
truth table of 1-bit conventional comparator is listed in Table 23.2. From the truth table of
conventional comparator in Table 23.2, the logical expressions of 1-bit comparator are as
follows:

X = (FA>B ) = A.B 0 (23.1)


Y = (FA=B ) = (A ⊕ B) 0
(23.2)
Z = (FA<B ) = A .B
0
(23.3)

The wave form of a 1-bit comparator circuit is demonstrated in Fig. 23.11 and a circuit
diagram of a 4-bit comparator circuit is exhibited in Fig. 23.12.

Figure 23.11: Timing Diagram of a 1-bit Comparator Circuit.


338  VLSI Circuits and Embedded Systems

Table 23.3: Truth Table for 1-bit Full Adder

Input Output
Cin A B S C
0 0 0 0 0
1 0 0 1 0
0 1 0 1 0
1 1 0 0 1
0 0 1 1 0
1 0 1 0 1
0 1 1 0 1
1 1 1 1 1

Figure 23.12: Circuit Diagram of a 4-bit Comparator Circuit.

23.2.5 Adder
The full-adder circuit adds three one-bit binary numbers (C in , A, B) and outputs two one-bit
binary numbers, a sum (S ) and a carry (C ) and the truth table is given in Table 23.3. The
full-adder is usually a component in a cascade of adders, which add 8, 16, 32, etc. binary
numbers. The carry input for the full-adder circuit is from the carry output from the circuit
“above” itself in the cascade. The carry output from the full adder is fed to another full
adder “below” itself in the cascade.
The equation for Sum (S ) is:

S = (A ⊕ B ⊕ C in ) (23.4)

The equation for Carry (C ) is:

C = A.B + (A ⊕ B).C in (23.5)


Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  339

The wave form of a 1-bit adder circuit is demonstrated in Fig. 23.13 and a circuit
diagram of a 4-bit adder circuit is exhibited in Fig. 23.14.

Figure 23.13: Timing Diagram of a 1-bit Adder Circuit.

Figure 23.14: Circuit Diagram of a 4-bit Adder Circuit.

23.2.6 Subtractor
Unlike the Binary Adder which produces a SUM and a CARRY bit when two binary
numbers are added together, the binary subtractor produces a DIFFERENCE, D by using a
BORROW bit, B from the previous column. Then obviously the operation of subtraction is
the opposite to that of addition. The truth table of 1-bit subtractor is given in Table 23.4.
Binary Subtraction can take many forms but the rules for subtraction are the same
whichever process you use. As binary notation only has two digits, subtracting a “0” from
340  VLSI Circuits and Embedded Systems

Table 23.4: Truth Table for 1-bit Subtractor

Input Output
Bin Y X Diff. Bout
0 0 0 0 0
0 0 1 1 0
0 1 0 1 1
0 1 1 0 0
1 0 0 1 1
1 0 1 0 0
1 1 0 0 1
1 1 1 1 1

a “0” or a “1” leaves the result unchanged as 0 – 0 = 0 and 1 – 0 = 1. Subtracting a “1” from
a “1” results in a “0”, but subtracting a 1 from a 0 requires a borrow. In other words 0 – 1
requires a borrow.
For the DIFFERENCE ( D) bit:

D = (X ⊕Y ) ⊕ Bin (23.6)

For the BORROW OUT ( BOUT ) bit:

BOUT = X 0 .Y + (X ⊕ Y )0 .Bin (23.7)

The wave form of a 4-bit subtractor circuit is demonstrated in Fig. 23.15 and a circuit
diagram of a 4-bit subtractor circuit is exhibited in Fig. 23.16.

Figure 23.15: Timing Diagram of a 1-bit Subtractor Circuit.


Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  341

Figure 23.16: Circuit Diagram of a 4-bit Subtractor Circuit.

In the binary adder, n numbers of 1-bit full binary subtractor connected or cascaded
together to subtract two parallel n-bit numbers from each other. For example two 4-bit
binary numbers. It is said before that the only difference between a full adder and a full
subtractor was the inversion of one of the inputs. So by using an n-bit adder and n numbers
of inverters (NOT Gates), the process of subtraction becomes an addition as we can use
two’s complement notation on all the bits in the subtrahend and setting the carry input of
the least significant bit to a logic 1 (HIGH) as shown in Fig. 23.16.

23.2.7 Look-Up Table


A look-up table (LUT) is a memory block with a one-bit output that essentially implements
a truth table where each input combination generates a certain logic output. The input
combination is referred to as an address. The output of the LUT is the value stored in the
indexed location of the selected memory cell. Since the memory cells in the LUT can be
set to anything according to the corresponding truth table, an N-input LUT can implement
any logic function.

Example 23.2 When implementing any logic function, a truth table of that logic is mapped
to the memory cells of the LUT. Suppose, while implementing Equation 24.8 where ‘|’
represents logical OR operation. Table 23.5 represents the truth table of the function. Fig.
23.17 shows the gate representation and the LUT representation of the logic function. Output
is generated with the corresponding input combination, such as for input combination 1 and
0, the output will be 1. There is a significant research about the improvement of the LUT
342  VLSI Circuits and Embedded Systems

Table 23.5: Truth Table of Function ( f ) of Equation 24.8

A B Out
0 0 0
0 1 1
1 0 1
1 1 1

to reduce the hardware complexities, read and write time. The circuit diagram of a 2-input
LUT is given in Fig. 23.18.

f = (A.B)|(A ⊕ B) (23.8)

23.2.8 Counter Circuit


In digital logic and computing, a counter is a device which stores (and sometimes displays)
the number of times a particular event or process has occurred, often in relationship to a
clock signal. The most common type is a sequential digital logic circuit with an input line
called the clock and multiple output lines. The values on the output lines represent a number
in the binary or BCD number system. Each pulse applied to the clock input increments or
decrements the number in the counter.
A counter circuit is usually constructed of a number of flip-flops connected in cascade.
Counters are a very widely used component in digital circuits, and are manufactured as
separate integrated circuits and also incorporated as parts of larger integrated circuits. Figs.
23.19 and 23.20 represent a 4-bit counter using flip-flops and the corresponding timing
diagram respectively.

Figure 23.17: LUT Implementation of a Logic Function.


Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  343

Figure 23.18: Circuit Diagram of a 2-Input LUT.

Figure 23.19: Circuit Diagram of a 4-bit Counter.

Figure 23.20: Timing Diagram of a 4-bit Counter.


344  VLSI Circuits and Embedded Systems

23.2.9 Reversible and Fault Tolerance Logic


In this subsection, the basic definitions of reversible gate, garbage output, unit delay and
fault tolerant gate are briefly described.
A reversible gate is an n input I n , n output O n (denoted by n×n) circuit that produces a
unique output pattern for each possible input pattern as I n ↔ O n , where unused output is
known as garbage output.
Fault tolerant gates are reversible gates which maintain parity between input and output
vectors as I 1 ⊕ I 2 ⊕ · · · ⊕ I n = O 1 ⊕ O 2 ⊕ · · · ⊕ O n . In this chapter, the fault tolerant
gates are used to preserve the parity of the circuits, where the fault tolerant gates are used
to detect the faults of the circuits when the fault occurs.
For example, Fig. 23.21 shows the block diagram of two fault tolerant reversible gates
named Fredkin (FRG) and Feynman double gate (F2G). Unit Delay represents the critical
delay of the circuit which considers the following two assumptions. First, each gate performs
computation in unit time which means that every gate will take same amount of time
for internal logic operations. Second, all the inputs are known to the circuit before the
computation begins.

Figure 23.21: Block Diagram of (a) Fredkin Gate, (b) Feynman Double Gate.

23.3 THE METHODOLOGIES


In this section, first, an approach of division algorithm for a divider circuit is presented.
Second, an ASIC (Application Specific Integrated Circuit)-based divider circuit is con-
structed. Third, a Field Programmable Gate Array (FPGA)-based divider circuit is pre-
sented. Finally, a reversible fault tolerant Look-Up Table (LUT)-based divider circuit is
shown.

23.3.1 Division Algorithm


Suppose, X is the m bit divisor and Y is the n bit dividend. The targeted quotient is Q
and remainder is R. To perform the division operation, a heuristic function is required to
find the global and local optimal values. In division operation, the global optimal value is
considered as possible safe nearer value of Y and local optimal values depicts the possible
nearer value of updated Y , depicted as Y 0 value. Hence, a single heuristic function can be
used for both global and local optimal value.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  345

If a heuristic function is based on A = n − m, then multiplying n − m bits with m bits


will create a maximum value of n bits that is n − m + m = n bits. The maximum binary value
of n − m bits is a sequence of n − m number of 1 such as 11 12 13 . . . 1n−m−1 1n−m . After
multiplying the maximum value of n − m bits with X may create a large number greater
than Y which may produce a negative value after subtraction. To avoid that complication
this paper considers a possible optimal value of n − m − 1 number of zeros following only
a one at the MSB (Most Significant Bit) such as 10 01 02 . . . 0n−m .
Considering Y > X , initially the minimum value of A can be zero that means the quotient
would be 1 and remainder would be Y − X . Otherwise, the algorithm considers a new value
optimal value B which is as earlier mentioned is 10 01 02 . . . 0n−m . Now, considering
emin as 1 initially, a loop is considered with terminating condition variable T , where T is
calculated on condition when updated Y (Y 0) is less than or equal to X , that is the remainder
is less than the divisor. At that point, remainder and accumulated quotient Q is provided as
output.
As a probabilistic condition of a step being accepted is whether B is less than Y 0 or not.
Then, divisor, X is subtracted from (m + 1) number of MSB (Most Significant Bit) bits of
Y , stored in diff variable, which is a noticeable improvement as previously in other division
mechanisms of n-bit subtraction was required at each step whereas, in the algorithm the
subtraction of (m+1) bits is sufficient enough for reducing the mathematical complexity and
delay as subtraction is a sequential procedure. Then Y 0 is updated by appending (m + 2)th
to 0th LSB bits of Y after diff making Y 0 = di f f (m + 2)th to 0th bits of Y . The value of
n is updated with the length of updated Y 0. This step is accepted with probability 1 and
the value of the resultant quotient is updated by adding B to the current value of Q. This
process continues until temperature T is less than emin . Finally, the remainder R is updated
with the value of Y 0 and the quotient and remainder is given as output.
The introduced algorithm has been illustrated in Algorithm 23.1 and a flowchart of
the division approach has been shown in Fig. 23.22. Moreover, an illustrated example is
demonstrated in Fig. 23.23 which accomplishes the division of a binary number (101110)2
(dividend) by (10111)2 (divisor) in 2 iterations (8 steps). The division method does not
require any multiplexing for selection of quotient bit due to the application of the intro-
duced algorithm which subsequently reduces the number of operations, requiring only two
operations such as addition and subtraction.
346  VLSI Circuits and Embedded Systems

Algorithm 23.1: Algorithm for Division Operation

Property 23.3.1 presents the proof for the required number of iterations by the introduced
division approach.

Property 23.3.1 The division algorithm requires at most (n − m + 1) number of iterations,


where n is number of bits in dividend, m is number of bits in divisor and n >= m.

Proof 23.1 The above statement is proved by the mathematical induction.

Basis: The basis case holds for the number of bits in divisor and dividend are equal that
is n = m and (n − m + 1) = 1.
Hypothesis: Assume that the statement holds for n = k . So, a k -bit dividend requires
( k − m + 1) number of iterations.
Induction: Now, considering n = k + 1, a(k + 1)-bit dividend requires (k + 1 − m + 1)
= (k − m + 2) number of operations. Now, reduce the number of bit in dividend by one to
produce n = k . Then, a k -bit dividend requires (k − m + 2 − 1) = (k − m + 1) number of
iterations which holds the hypothesis. So, the statement holds for n = k + 1 and completes
the proof.
Therefore, for n>= m, the introduced division algorithm requires at most (n − m + 1)
number of iterations where n is number of bits in dividend, m is number of bits in divisor.

Example 23.3 For n = 6 and m = 5, the algorithm performs the division operation in
(6 − 5 + 1) = 2 iterations which has been also illustrated in Fig. 23.23.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  347

Figure 23.22: Flowchart of the Division Algorithm.

23.3.1.1 Explanation of Correctness of the Division Algorithm


This subsection briefly describes the explanation for the proof of the correctness of the
division algorithm. Consider the divisor is an m-bit operand ( X ) which can be expressed
as X m X m−1 X m−2 · · · X 0 . Similarly, the dividend is an n-bit operand (Y ) which can be
expressed as Y n Y n−1 Y n−2 · · · Y 0 . The first step of the algorithm is the calculation of the
bit differences between the divisor and dividend operands by using a heuristic function.
348  VLSI Circuits and Embedded Systems

Figure 23.23: Example Simulation of the Division Algorithm.

The purpose of using the heuristic function is to generate the quotient bits quickly. Let the
bit difference between the divisor and dividend is “diff” and 0 <= “diff” <= m − n. Assume
the quotient is expressed as q. Therefore, the quotient group for i th iteration becomes q di f f
q di f f −1 q di f f −2 · · · q0 . It is evident from Algorithm 23.1 is that the value of the q di f f =
1 and the rest of the bits from q di f f −1 q di f f −2 · · · q0 will be 0. The aforementioned step
is considered as the second step i.e., calculation of the quotient bits. The third step is the
subtraction process. This subtraction process yields a partial remainder which is required
for the generation of next group of quotient bits. There is a subtle difference between the
subtraction process involved in the conventional division approaches and the introduced
division algorithm. The division algorithm uses first (m + 1)-number of bit from the n-bit
of dividend. This results in avoiding of producing a negative result (or partial remainder)
since X = m+1 j=1 X j × 2
j−1 is greater than Y = Ín Y × 2k−1 . Thus the introduced division
Í
k=1 j
algorithm skips the restoring step of the conventional division approaches. Moreover, the
division algorithm always produces the appropriate next partial remainder by appending
unused bits from the dividend to the current partial remainder. A step by step demonstration
of the division algorithm is given in Example 24.4 for verification.

Example 23.4 Suppose, the divisor is 1100110 and the dividend is 101101110101 in
binary notation.

In first iteration, at first the introduced division algorithm calculates the bit differences
between the divisor and the dividend. Here, the number of bits are 7 and 12 for divisor
and dividend respectively. So, di f f 1 = (12 − 7) = 5. Hence, (5 − 1) = 4 bits of 0 will be
appended at the LSB position of the quotient. Therefore, the first quotient group q1 will
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  349

consist of 5 bits as 10000. Since the divisor is 7 bits long, it will subtract itself from the
first (7 + 1) = 8 bits of the dividend. So the first 8 bits from the dividend is 10110111 and
the subtraction process involves subtraction as (10110111 − 1100110) = 1010001. The rest
of the bits which were unused in the dividend i.e., “0101” will append with 1010001 which
produces the next partial remainder 10100010101.
In second iteration, at first the division algorithm calculates the bit differences between
the divisor and the current partial remainder 10100010101. Here, the number of bits are
7 and 11 for divisor and current partial remainder respectively. So, di f f 2 = (11 − 7) = 4.
Hence, (4 − 1) = 3 bits of 0 will be appended at the LSB position of the quotient. Therefore,
the second quotient group q2 will consist of 4 bits as 1000. Now, the previous q1 will
be added with q2 to form the current quotient Q as (10000 + 1000) = 11000. Since the
divisor is 7 bits long, it will subtract itself from the first (7 + 1) = 8 bits of the current
partial remainder. So the first 8 bits from the current partial remainder is 10100010 and
the subtraction process involves subtraction as (10100010 − 1100110) = 111100. The rest
of the bits which were unused in the dividend i.e., “101” will append with 111100 which
produces the next partial remainder 111100101.
In third iteration, at first the division algorithm calculates the bit differences between
the divisor and the current partial remainder 111100101. Here, the number of bits are 7 and
9 for divisor and current partial remainder respectively. So, di f f 2 = (9 − 7) = 2. Hence,
(2 − 1) = 1 bit of 0 will be appended at the LSB position of the quotient. Therefore, the
third quotient group q3 will consist of 2 bits as 10. Now, the previous Q will be added
with q3 to form the current quotient Q as (11000 + 10) = 11010. Since the divisor is 7 bits
long, it will subtract itself from the first (7 + 1) = 8 bits of the current partial remainder. So
the first 8 bits from the current partial remainder is 11110010 and the subtraction process
involves subtraction as (11110010 − 1100110) = 10001100. The rest of the bit which was
unused in the dividend i.e., “1” will append with 10001100 which produces the next partial
remainder 100011001.
In fourth iteration, at first the division algorithm calculates the bit differences between
the divisor and the current partial remainder 100011001. Here, the number of bits are
7 and 9 for divisor and current partial remainder respectively. So, di f f 2 = (9 – 7) = 2.
Hence, (2 – 1) = 1 bit of 0 will be appended at the LSB position of the quotient. Therefore,
the forth quotient group q4 will consist of 2 bits as 10. Now, the previous Q will be added
with q4 to form the current quotient Q as (11010 + 10) = 11100. Since the divisor is 7 bits
long, it will subtract itself from the first (7 + 1) = 8 bits of the current partial remainder. So
the first 8 bits from the current partial remainder is 10001100 and the subtraction process
involves subtraction as (10001100 – 1100110) = 100110. The rest of the bit which was
unused in the dividend i.e., “1” will append with 100110 which produces the next partial
remainder 1001101. Since, the number of bits in the partial remainder is 7 and it is less
than the divisor, the remainder is 1001101 and the quotient is 11100.

23.3.2 ASIC-Based Circuits


In this subsection, an n by m-bit divider circuit is constructed, where n is the number of
bits in the dividend and m is the number of bit in divisor. To construct the divider circuit,
350  VLSI Circuits and Embedded Systems

Table 23.6: Truth Table for the Verification of Equation 24.9

Row Input Output


a2 a1 a0 b1 b0
1 0 0 0 0 0
2 0 0 1 0 1
3 0 1 0 1 0
4 0 1 1 1 0
5 1 0 0 1 1
6 1 0 1 1 1
7 1 1 0 1 1
8 1 1 1 1 1

a parallel bit counter circuit is presented at first. Then, a fast switching circuit is presented.
Finally, the divider circuit is shown along with necessary figures and examples.

23.3.2.1 Parallel n-bit Counter Circuit


One of the important aspects of the division algorithm in Algorithm 23.2 is the count the
number of bits in the dividend and the divisor. In this subsection, a fast and compact bit
counter circuit is presented.
Let a is a binary operand and it can be expressed in the positional numbering system as
follows:

n
Õ
a= a j × 2 j−1 (23.9)
j=1

Therefore, a binary operand a which is consisting of n-bit will produce another binary
operand b which will be composed with dlog2 + 1e -bit by following Equation 24.9. The
verification is shown for the binary operand a (when a = 3 as (a2 a1 a0 )) in Table 23.6.
As for example, in row 4 in Table 23.6, binary operand a is composed as a2 = 0; a1 = 1
and a0 = 1. Therefore, the output operand b is b1 = 1 and b0 = 0. In other words, it can be
said that the bit counter will determine the number of bits presents in any binary operand.
Assuming that, the input operand a is stored in an n-bit register, where n is the number of
bits in a. Therefore, an n-bit register is composed of n number of D flip-flops. If no Data is
present in the “Data” pin of the D flip-flop or latches, both of the output pins of D flip-flop
or latches remain in invalid state. The scenario has been demonstrated in Fig. 23.24.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  351

Figure 23.24: Demonstration of Invalid State in D Flip-Flop and Latches.

However, if the Data is present in “Data” pin of the D flip-flop or latches, either of the
output pins Q or Q of D flip-flop or latches will produce “1” as shown in Fig. 23.25. Fig.
23.25 demonstrates the scenario with schematic diagram. The aforementioned property is
used to constructing the parallel bit counter since the bits which occupy the valid positions
(from most significant bit to least significant bit) in input operand a will be counted as the
presence of bits whether it is “0” or “1”. For example, in row 7 of Table 23.6, a0 = 0, but
the output will be counted as the presence of all the bits. Therefore, the output operand will
consider the presence of an input bit if a bit presents in a valid position of an input operand
whether the bit value is “0” or “1”. By valid position, it considers the bit presents from
most significant bit position to least significant bit position.

Figure 23.25: Demonstration of the Presence of Input Bit in D Flip-Flop and Latches.

Now, a 2-input OR gate is used to produce always “1” result depending on the presence
of the bits. It follows the following formula for output operand Di , where i is the position
of the respective bit:

Di = Qi + Qi0 (23.10)
352  VLSI Circuits and Embedded Systems

Figure 23.26: Circuit Realization of Equation 24.10.

Fig. 23.26 demonstrates circuit realization of Equation 24.10 with schematic diagram.
The least significant bit of the output operand b0 is defined by following Equation 24.11
as follows:

b0 = (a0 ⊕a1 )⊕(a2 ⊕a3 )⊕(a4 ⊕a5 )⊕a6 (23.11)

Then, the second least significant bit of the output operand b1 is formulated by following
Equation 24.12 as follows:

b1 = (a1 ⊕a3 )⊕a5 (23.12)

The most significant bit of the output operand b2 is defined by following Equation 24.13
as follows:

b2 = a3 (23.13)

The circuit realization of the 7-bit Counter has been demonstrated in Fig. 23.27. In this
figure, in6 , in5 , in4 , in3 , in2 , in1 , and in0 indicate the corresponding input bit and the red
color indicates the presence of the value (either 0 or 1), whereas the white color indicates
the absence of the value (either 0 or 1). In Fig. 23.27, all the input bit are present, indicating
as red color and thus the all output bits are present which is exactly as shown in Equations
24.11, 24.12, and 24.13.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  353

Figure 23.27: Circuit Realization of the 7-bit Counter.

Example 23.5 Consider a binary operand a with the value (10011)2 . So, the value (10011)2
will be passed through the circuit of Fig. 23.26 and the following outputs are obtained as
follows:

D6 = 0, D5 = 0, D4 = 1, D3 = 1, D2 = 1, D1 = 1, and D0 = 1.
Now, the least significant bit of the output operand b0 which is defined by Equation
24.11 will be calculated as follows:
b0 = (1 ⊕ 1) ⊕ (1 ⊕ 1) ⊕ 1 = 1
Then, the second least significant bit of the output operand b1 which is defined by
Equation 24.12 will be calculated as follows:
b1 = (1 ⊕ 1) ⊕ 0 = 0
Finally, the most significant bit of the output operand b2 which is defined by Equation
24.13 will be calculated as follows:
b2 = 1 = 1
354  VLSI Circuits and Embedded Systems

The data flow has been demonstrated in Fig. 23.28. In the figure, red color indicates the
“1" value, whereas the white color indicates the “0" value.

Figure 23.28: Data Flow of the 7-bit Counter for Example 24.5.

Now, consider an n-bit input operand a which can be written as a n a n−1 a n−2 . . . a0
and to count the number of bits in the input operand a, the output operand is an dlog 2 n
+1e -bit operand b which can be expressed as b dlog2 n+1e b dlog2 n+1e−1 b dlog2 n+1e−2 . . . b0 .
The least significant bit of the output operand b can be obtained by applying exclusive-or
operation between each of the bit of input operands. The next bit of the least significant
bit of the output operand b can be obtained by applying exclusive-or operation between
alternate bits of input operands. Then the second next least significant bit of the output
operand b can be obtained by applying exclusive-or operation between double alternate bits
of input operands. This process follows up to the input operand reaches to half bit position
of its bit number. Algorithm 23.2 shows the algorithm for n-bit counter circuit.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  355

Algorithm 23.2: Algorithm for n-Bit Counter Circuit

To prove the correctness of Algorithm 23.2, let us consider n = 15, i.e., the input operand
is 15-bit. Now the output operands become as follows in Equations 24.14, 24.15, 24.16,
and 24.17:

b0 = (a0 ⊕a1 )⊕(a2 ⊕a3 )⊕(a4 ⊕a5 )⊕(a6 ⊕a7 )⊕(a8 ⊕a9 )⊕(a10 ⊕a11 )⊕(a12 ⊕a13 )⊕a14 (23.14)

b1 = (a1 ⊕a3 )⊕(a5 ⊕a7 )⊕(a9 ⊕a11 )⊕a13 (23.15)

b1 = (a1 ⊕a3 )⊕(a5 ⊕a7 )⊕(a9 ⊕a11 )⊕a13 (23.16)

b3 = a7 (23.17)

Fig. 23.29 shows the circuit realization of the 15-bit counter. In this figure, “in” rep-
resents the input operand whereas the “out” represents the output operand. The circuit
realization has been performed by following the above equations.

23.3.2.2 n-bit Comparator


This subsection discusses a modification of the contemporary comparators. The outputs of
the comparator circuit are whether the two operands are equal with each other, the second
output bit is whether one of the operand is greater than the other operator and the third
output bit determines whether the operand is less than the other operand. Instead of using
three different logic paths, two design paths can be used to calculate the equality of the
operands and either the comparison of greater or lesser could be done as shown in Fig.
23.30. Then the negation of both the outputs can be used by a 2-input AND operation to
generate the other output. Fig. 23.31 shows the circuit architecture of the modified 4-bit
comparator by using the algorithm represented in Algorithm 23.3. Fig. 23.32 represents
356  VLSI Circuits and Embedded Systems

Figure 23.29: Circuit Realization of the 15-Bit Counter.

the circuit architecture of the n-bit comparator circuit. This circuit has been constructed
by using the algorithm depicted in Algorithm 23.3. One of the outputs of the comparator
circuit is B greater than A has been generated by using the carry bit of the last full adder
circuit whereas the output A equal B of the comparator has been calculated by the product
of sum bits of each full adder circuit. Finally the other output has been generated by using
the negation of both previous output.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  357

Figure 23.30: Circuit Realization of the Identification of Two Distinct Path in Comparator
Circuit.
358  VLSI Circuits and Embedded Systems

Figure 23.31: Circuit Realization of the 4-Bit Comparator.

Algorithm 23.3: Algorithm for n-bit Comparator Circuit


Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  359

Figure 23.32: Circuit Realization of the n-bit Comparator.

23.3.2.3 n-bit Selection Block


Generally, a 2-to-1 Multiplexer is used for the selection of bits. A 2-to-1 Multiplexer is
composed of two 2-input AND gates, one 2-input OR gate and one inverter. However, this
cost can be optimized further. In this subsection a new selection block is presented for the
divider circuit.
The division algorithm in Algorithm 23.1 uses an (m + 1)-bit subtractor instead of an
n-bit subtractor to reduce the cost of subtraction process, where n is the number of bits in
dividend and m is the number of bits in divisor. But this process triggers another problem
of deciding or choosing first (MSB) (m + 1)-bit out of n-bit from the dividend. For example,
if n = 7 bit as dividend is (1011011)2 and m = 3 bit as divisor is 1012 , it is required to
choose (1011)2 from the dividend (1011011)2 since the number of bits in the divisor is 3.
The scenario has been exhibited in Table 23.7.
360  VLSI Circuits and Embedded Systems

Table 23.8: Truth Table for Selection of 8-Bit of Dividend

RowDivisor Dividend
d6 d5 d4 d3 d2 d1 d0 n7 n6 n5 n4 n3 n2 n1 n0
1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0
2 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
3 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0
4 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0
5 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0
6 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0
7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Table 23.7: Data Flow for the Selection Block

One of the important properties of division algorithm is that the dividend can never
been divided by zero. This property enables us to derive another property which states that
the divisor will be must have a length of 1 bit. In addition, the division algorithm needs
(m + 1)-bit for the next step. Therefore, at least first two bits from the MSB of the dividend
will move to next step to the subtractor and rest of the bits will move to the remainder block.
The data flow has been elucidated in Table 23.7.
Hence, the following Equations are derived for a selection of (m + 1)-bit out of n-bit,
where m is the number of bits in divisor and n is the number of bit in the dividend for the
selection process (Here, both n and m = 7):

n7 = d0 ; n6 = d0 ; n5 = d1 ; n4 = d2 ; n3 = d3 ; n2 = d4 ; n1 = d5 ; n0 = d6 : (23.18)

where,
ni is the dividend,
d j is the divisor and i <= j <= maximum (n, m).
Table 23.8 shows the truth table for selection of 8-bit of the dividend out of 7-bit of
the divisor. Since a divisor can never been zero, the truth table does not consider any
value of the divisor with the value of zero. Moreover, in Table 23.8, “1” represents that the
denominator is present on the specific data path and “0” indicates that the denominator is
absent on the specific data path in the divisor column. On the other hand, “1” represents
that the corresponding numerator will move to the subtractor whereas “0” indicates that the
numerator will move to the remainder in the dividend column.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  361

Fig. 23.33 represents the circuit realization of the 8-bit selection block. In the figure,
n7 , n6 , n5 , n4 , n3 , n2 , n1 , and n0 indicate the bit position of the dividend. On the other hand,
d 6 , d 5 , d 4 , d 3 , d 2 , d 1 , and d 0 present the bit position of the divisor. The control gates of the
pmos and nmos transistors are enabled by the presence of the corresponding divisor bits.
The sources of all the pmos and nmos transistors are the corresponding dividend bits. The
drain of the pmos transistor is the input of the next subtractor and the drain of the nmos
transistor is the input to the remainder block.

Figure 23.33: Circuit Realization of the 8-Bit Selection Block.

Example 23.6 Consider the dividend is (10110100)2 and the divisor is 1012 . Therefore,
n7 = 1, n6 = 0, n5 = 1, n4 = 1, n3 = 0, n2 = 1, n1 = 0, and n0 = 0. Since, there are 3 bits in
the divisor, d 0 , d 1 , and d 2 are active and thus enables the first 4 pmos transistors and thus,
s7 = 1, s6 = 0, s5 = 1, s4 = 1, and the rest of the bits s3 , s2 , s1 , and s0 become inactive.
Moreover, the remainder bits become r 3 = 0, r 2 = 1, r 1 = 0, and r 0 = 0. Fig. 23.34 illustrates
the circuit behavior of the 8-bit Selection Block. In the figure, red color indicates the “1"
value, whereas the white color indicates the “0" value. Inactive regions are marked as ash
color.

Figure 23.34: Example Demonstration of Example 24.6 of the 8-Bit Selection Block.
362  VLSI Circuits and Embedded Systems

The selection block can be optimized further by reducing the number of nmos transistors
due to the following properties:

1. Property 1: The length of the divisor will be at least one since the operation of divided
by zero is omitted.

2. Property 2: The division algorithm in Algorithm 23.1 uses the first (m + 1)-bit from
the n-bit dividend for the selection block, where m is the number of bits in divisor
and n is the number of bits in the dividend.

Therefore, the first two remainder bits can be removed from the design for a single bit
divisor. The circuit behavior for a single bit divisor is exhibited in Fig. 23.35. It is evident
from Fig. 23.35 that the first two remainder bits are always remain inactive for the presence
of a single bit divisor. Therefore, it will be also remain inactive when the number of bits
in the divisor will be increased. An improved version of the selection block is presented in
Fig. 23.36. Finally, an n-bit selection block is exhibited in Fig. 23.37.

Figure 23.35: Analysis of the Circuit Behavior due to Property 1 and Property 2 of the 8-Bit
Selection Block.

Figure 23.36: An Improved Version of the 8-Bit Selection Block.


Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  363

Figure 23.37: Circuit Diagram of the n-bit Selection Block.

Algorithm 23.4 represents the algorithm for the selection of first (m + 1)-bit of the n-bit
from the dividend, where m is number of bits in the dividend and n is the number of bits in
the divisor. This algorithm presented in 24.4 has been used to designing the circuit diagram
exhibited in Fig. 23.37.

Algorithm 23.4: Algorithm for (m + 1)-Bit Selection Circuit.


364  VLSI Circuits and Embedded Systems

Table 23.9: Truth Table for Conversion of dlog2 n + 1e -bit to n-bit (Here, n = 7)

Row Input Output


S2 S1 S0 z5 z4 z3 z2 z1 z0
1 0 0 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0 1
4 0 1 1 0 0 0 0 1 1
5 1 0 0 0 0 0 1 1 1
6 1 0 1 0 0 1 1 1 1
7 1 1 0 0 1 1 1 1 1
8 1 1 1 1 1 1 1 1 1

23.3.2.4 Circuit for Conversion to Zero


This subsection describes one of the important circuits which is required to construct the
divider circuit. The algorithm for division operation uses a string “0" values for concatena-
tion purpose to form a new operand which is used for quotient selection logic. The length
of the string is determined by using a value which is obtained from the previous subtractor.
For example, if a binary value 1102 is obtained from the subtractor, then the string will be
“00000” where the length of the zero is 1012 or 510 .
Table 23.5 shows the truth table for the conversion of dlog2 n + 1e -bit to n-bit. In this
table n has been chosen as 7. The values from the subtractor is symbolized as si and the
length of the string is symbolized as z j where i < 3 and j< 6. Equations 24.19, 24.20,
24.21, 24.22, 24.23, and 24.24 have been derived for the length of the required string.
Fig. 23.38 demonstrates the circuit realization of all equations from Equation
24.19 – 24.24. The effectiveness of the circuit is depicted in Fig. 23.39 for the input
1012 . The corresponding output is (1111)2 , which means that the length of the string would
be 4.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  365

Figure 23.38: Circuit Diagram of Equation 24.19.


366  VLSI Circuits and Embedded Systems

Figure 23.39: Example Demonstration of the Circuit Exhibited in Fig. 23.38.

Fig. 23.40 demonstrates the circuit realization of the 3-bit to 6-bit zero converter circuit.
When a respective output bit z j is 1 (where j< 6), the control gate of the corresponding
pmos transistor is activated a constant “0” (acts as the source of the pmos transistor) is
provided at the drain of the pmos transistor which is the final output of the converter circuit.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  367

Figure 23.40: Circuit Diagram of the 3-bit to 6-bit Zero Converter Circuit.

Example 23.7 Fig. 23.41 exhibits the circuit behavior of the circuit exhibited in Fig. 23.40.
The final output z is inactive for the input of all 0’s in Fig. 23.41. Fig. 23.42 exhibits the
circuit behavior of the circuit exhibited in Fig. 23.40 for the input 0012 . The final output z is
again inactive for the input 0012 of Fig. 23.42 according to Table 23.5. Fig. 23.43 exhibits
the circuit behavior of the circuit exhibited in Fig. 23.40 for the input 0112 . The final output
z is 02 for the input 0112 .

z0 = s1 + s2 (23.19)
z1 = s2 + (s0 .s1 ) (23.20)
z2 = s2 (23.21)
z3 = s2 (s0 + s1 ) (23.22)
z4 = s1 .s2 (23.23)
z5 = s0 .s1 .s2 (23.24)
368  VLSI Circuits and Embedded Systems

Figure 23.41: Circuit Behavior of the Circuit Exhibited in Fig. 23.40 when Input is 0002 .

A generalized algorithm for the construction of n-bit to 2n – 2 circuit has been presented
in Algorithm 23.5. The illustration of working procedure of the above algorithm is given
below:
1. Let us consider the input bit is 4-bit which means that it requires 24 = 16 entries and
the output will be 24 – 2 = 14 bit. So, the truth table is constructed with 16 entries. After
first two iterations, the output bit will be 1 gradually.
2. Now, the required output functions are produced with the help of AND and OR logic
as follows:
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  369

Figure 23.42: Circuit Behavior of the Circuit Exhibited in Fig. 23.40 for the Input 0012 .

z0 = s1 + s2 + s3 (23.25)
z1 = s2 + s3 + (s0 .s1 ) (23.26)
z2 = s2 + s3 (23.27)
z3 = (s2 .(s0 + s1 )) + s3 (23.28)
z4 = (s1 .s2 ) + s3 (23.29)
z5 = (s0 .s1 .s2 ) + s3 (23.30)
z6 = s3 (23.31)
z7 = s3 .(s0 + s1 + s2 ) (23.32)
z8 = s3 .(s1 + s2 ) (23.33)
z9 = s2 + (s0 .s1 ) (23.34)
z10 = s2 .s3 (23.35)
z11 = (s0 + s1 ).s2 .s3 (23.36)
z12 = s1 .s2 .s3 (23.37)
z13 = s0 .s1 .s2 .s3 (23.38)
370  VLSI Circuits and Embedded Systems

Figure 23.43: Circuit Behavior of the Circuit Exhibited in Fig. 23.40 for the Input 0112 .

Finally, above Equations from 24.25 to 24.38 can be used to construct a 4-bit to 14-bit
zero converter circuit.

23.3.2.5 Design of the Divider Circuit


The circuit realization of the divider circuit is explained in this subsection. Fig. 23.44
exhibits the circuit diagram of the 4-bit divider by following Algorithm 23.6. The 4-bit
divider circuit uses two 4-bit counter to count the number of bits of the dividend and
divisor. The execution of counter circuits work in parallel. Then the output of both the
bit counter moves to the 3-bit comparator and subtractor circuits. The output of the 3-bit
comparator circuit decides whether the whole divisor will be subtracted from the dividend
or the partial dividend will be used for the subtraction process. The selection block provides
the necessary dividend bits for subtraction process. A 3-bit subtractor is used to computing
the remainder. This remainder is again used to providing the dividend in next iteration.
The path started from the 3-bit comparator to the remainder block is considered as the
path of remainder calculation. On the other hand, the path started from the first 3-bit
subtractor to the quotient registers is considered as the path of quotient selection logic.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  371

Algorithm 23.5: Algorithm for n-bit Conversion Circuit

Figure 23.44: Circuit Diagram of the 4-Bit Divider.


372  VLSI Circuits and Embedded Systems

Algorithm 23.6: Algorithm for n-bit Divider Circuit

The quotient selection logic uses one circuit for conversion to zero. Then the output of the
zero converter circuit is used for concatenation. The concatenation circuit is constructed by
using D flip-flops. Then, a 3-bit adder is used to calculate the necessary quotient bits.
Fig. 23.45 exhibits the circuit diagram of the n-bit divider, where n is the number of
bit in the dividend. The n-bit divider circuit uses two n-bit counter to count the number
of bits of the dividend and the divisor. The output of the n-bit counter is dlog2 n + 1e -bit.
The execution of counter circuits work in parallel. Then the output of both the bit counter
moves to the dlog2 n + 1e -bit comparator and subtractor circuits. The output of the dlog2
n + 1e -bit comparator circuit decides whether the whole divisor will be subtracted from
the dividend or the partial dividend will be used for the subtraction process. The selection
block provides the necessary dividend bits for subtraction process. An m-bit subtractor is
used to computing the remainder, where m is the number of bits in divisor. This remainder
is again used to providing the dividend in next iteration. The path started from the dlog2 n +
1e -bit comparator to the remainder block is considered as the path of remainder calculation.
On the other hand, the path started from the first dlog2 n + 1e -bit subtractor to the quotient
register is considered as the path of quotient selection logic. The quotient selection logic
uses one dlog2 n + 1e -bit to n-bit conversion to zero circuit for appending purpose of
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  373

Figure 23.45: Block Diagram of the n-bit Divider Circuit.

next block. Then the output of the zero converter circuit is used for concatenation. The
concatenation circuit is constructed by using D flip-flops. Then, an n-bit adder is used to
calculate the necessary quotient bits.
Example 23.8 The execution of the circuit behavior for the dividend = (1111)2 and divisor
= 102 is exhibited in Fig. 23.46. In the figure, red color indicates the “1” value whereas the
white color indicates the “0” value.

23.3.3 LUT-Based Circuits


This subsection illustrates the design methodologies of the Look-Up Table (LUT)-based
divider circuit. The basic definitions and the working procedure of an n-input LUT has
been described in Section 23.2. The design of a full adder, subtractor, comparator and
converter circuits are shown by using LUT. Moreover, the working procedure of each of the
LUT-based components are described briefly.
Several types of LUTs exist among the commercial design of the FPGA such as 2-input,
3-input, 4-input, 5-input, 6-input, 7-input and 8-input LUT. However, the most recent types
of LUTs are 9-input LUT. Researchers have found that the 6-input LUT performs well for
the construction of circuits which require area-delay prominence over other input LUTs. A
heterogeneous structure of the LUT architecture is also possible for the construction of the
circuits. In this regard, 6-input LUT architecture is considered here for the construction of
LUT-based divider circuit.
374  VLSI Circuits and Embedded Systems

Figure 23.46: Circuit Behavior of the 4-Bit Divider Circuit for Dividend = (1111)2 and
Divisor = 102 .
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  375

23.3.3.1 LUT-Based Bit Counter Circuit


The LUT-based design of the bit counter circuit is described here. At first, an algorithm is
presented and then a demonstration of the LUT-based circuit algorithm has been presented
to construct the circuit.
Algorithm 23.7 represents the circuit algorithm for the LUT based counter circuit.
Suppose, n = 15 which indicates that it is required to construct a 15-bit LUT-based counter
circuit. The step by step demonstration of Algorithm 23.7 shows as follows:

Algorithm 23.7 LUT-based counter circuit


Input: A set of functions.
Output: A partition of the variables such that if there are 2r variables then the partition will
contain r different pairs.
1: 1. From Equations 24.14, 24.15, 24.16, and 24.17, first it has to count the frequency of
the occurrence of each input operand ai , where 0 <= i <= 14 and Table 23.10 shows
the frequency of each of the input variable.
2: Now, it has to sort Table 23.10 in descending order which has been demonstrated in
Table 23.11 which has been expressed as L  a3 , a7 , a11 , a1 , a5 , a9 , a13 , a0 , a2 , a4 , a6 ,
a8 , a10 , a12 , a14 ).
3: Take the first 5 input variables from L to form R  (a3 , a7 , a11 , a1 , a5 ).
4: Then the following two equations are formed which can be implemented in a single
6-input LUT as follows in Equations 24.39 and 24.40:

F0 = (a3 ⊕a7 ⊕a11 ⊕a1 ⊕a5 ) (23.39)


F1 = (a3 ⊕a7 ⊕a11 ) (23.40)

5: Now create L 0 = L − R where L 0  (a9 , a13 , a0 , a2 , a4 , a6 , a8 , a10 , a12 , a14 ).


6: Take the first 5 input variables from L 0 to form R 0  (F0 , a9 , a13 , a0 , a2 ).
7: Then the following two equations are formed which can be implemented in a single
6-input LUT as follows in Equations 24.41 and 24.42:

F2 = (F0 ⊕a9 ⊕a13 ) (23.41)


F3 = (F0 ⊕a9 ⊕a13 ⊕a0 ⊕a2 ) (23.42)

8: Now create L 00 = L 0 − R 0 where L 00  (a4 , a6 , a8 , a10 , a12 , a14 ). Since, all the input
operands are used now, the process ends and thus it requires only three 6-input LUTs.
Similarly, to construct the 7-bit counter circuit, the presented algorithm requires only
two 6-input LUTs.

One of the important properties of the n-input LUT is that it can serve as a dual output
whenever the input is less than n − 1, where 3 <= n <= 9. This property has been used in
the LUT-based divider circuit. Therefore, the output function b0 from Equation 24.11 can
be realized by the output variable of the LUT F 0 as follows in Equation 24.43:
F0 = (a0 ⊕a1 )⊕(a2 ⊕a3 )⊕(a4 ⊕a5 ) (23.43)
376  VLSI Circuits and Embedded Systems

Table 23.10: Frequency Distribution Table for a 15-Bit Input

input a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14


frequency 1 2 1 3 1 2 1 3 1 3 1 3 1 2 1

Table 23.11: Sorted Frequency Distribution Table for a 15-Bit Input

input a3 a7 a11 a1 a5 a9 a13 a0 a2 a4 a6 a8 a10 a12 a14


frequency 3 3 3 2 2 2 2 1 1 1 1 1 1 1 1

The output of F 0 is chained with another 6-input LUT to produce the dual output F 1
and F 2 as follows in Equation 24.44:

F1 = F0 ⊕a6 ; F2 = (a1 ⊕a3 )⊕(a5 ) (23.44)

The most significant bit of the output operand b3 which has been derived from Equation
24.13 does not require any LUT since it is generated directly with the help of input operand
a3 . The block diagram is shown in Fig. 23.47.

23.3.3.2 LUT-Based Bit Comparator Circuit


The LUT-based design of the comparator circuit is described here. At first, an algorithm is
presented and then a demonstration of the LUT-based circuit algorithm has been presented
to construct the circuit.
Algorithm 23.8 presents the algorithm for designing an n-bit LUT-based comparator
circuit. Let us consider, the n = 16, i.e., the input is 16 bit long. The step by step demonstration
of Algorithm 23.8 is described below:
Fig. 23.48 shows the circuit diagram of the 16-bit LUT-based comparator circuit.
Similarly, a 6-bit LUT-based comparator circuit is shown in Fig. 23.49. The algorithm
in Algorithm 23.8 can be implemented in any input LUT. To verify this an example of
a 4-input LUT architecture can be taken to design a 4-bit LUT-based comparator circuit
which is shown in Fig. 23.50. The step by step demonstration is given below:

Algorithm 23.8: Algorithm for LUT-Based n-Bit Counter Circuit


Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  377

Figure 23.47: Block Diagram of the Look-Up Table (LUT)-Based 7-Bit Counter Circuit.

Figure 23.48: 16-Bit LUT-Based Comparator Circuit.


378  VLSI Circuits and Embedded Systems

Figure 23.49: 6-Bit LUT-Based Comparator Circuit.

1. At first the input bits have to pair as P  (a0 , b0 ), (a1 , b1 ), (a2 , b2 ), (a3 , b3 ).
2. The input to first 4-input LUT is (a0 , b0 , cin ) producing the following two outputs as:

F0 = a0 ⊕b0 ⊕carr y0 ; F1 = a0 .b0 + b0 .carr y0 + ai .carr y0 (23.45)

3. The input to second 4-input LUT is (a1 , b1 , F 1 ) producing the following two outputs
as:

F2 = a1 ⊕b1 ⊕F1 ; F3 = a1 .b1 + b1 .F1 + a1 .F1 (23.46)

4. The input to third 4-input LUT is (a2 , b2 , F 3 ) producing the following two outputs
as:

F4 = a2 ⊕b2 ⊕F3 ; F5 = a2 .b2 + b2 .F3 + a2 .F3 (23.47)

5. The input to fourth 4-input LUT is (a3 , b3 , F 5 ) producing the following two outputs
as:

F6 = a3 ⊕b3 ⊕F5 ; F7 = a3 .b3 + b3 .F5 + a3 .F5 (23.48)


Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  379

Figure 23.50: Design of LUT-Based 4-Bit Comparator Circuit.

Here, F 7 is the required output b greater than a.


6. The input to fifth 4-input LUT is (F 0 , F 2 , F4 , F 6 ) producing the following single
output as:
F8 = F0 ⊕F2 ⊕F4 ⊕F6 (23.49)

Here, F 8 is the required output b equal a.


7. The input to sixth 4-input LUT is (F 7 , F 8 ) producing the following output as:

F9 = F70 .F80 (23.50)

Here, F 9 is the required output b less than a. Thus, a LUT-based 4-bit counter circuit
requires only six 4-input LUTs.

23.3.3.3 LUT-Based Selection Circuit


The LUT-based design of the selection circuit is described here. At first, an algorithm is
discussed and then a demonstration of the LUT-based circuit algorithm has been presented
to construct the circuit.
380  VLSI Circuits and Embedded Systems

Algorithm 23.9 LUT-based comparator


1: There are 8 pairs formed for the 16-bit input operand which can be expressed as P 
{(a0 , b0 ), (a0+1 , b0+1 ), . . . , (a15 , b15 )}. Therefore, eight 6-input LUTs are needed where
first two pairs from P and one carry from previous LUT constitute the input set for each
LUT. Since there is 5 input, each of the 6-input LUT serve as a dual output function.
Each of the LUT will provide following output function F 1 and F 2 as:

F1 = carr yi = ai .bi + bi .carr yi−1 + ai .carr yi−1 (23.51)

2: The carry of the last chained LUT provide the required output b greater than a. Then,
the F 2 output of all the LUTs are sent to a 6-input LUT producing the output a equal b.
3: Lastly, the output of the previous single 6-input LUT (a equal b), the output of last
chained 6-input LUT (b greater than a) and the unused output of the 6-input LUTs in
step 1 are fed into a 6-input LUT which finally produce the output b less than a.

Algorithm 23.9 represents the algorithm for the LUT-based selection of first (m + 1)-bit
of the n-bit from the dividend, where m is number of bits in the dividend and n is the
number of bits in the divisor. This algorithm presented in Algorithm 23.9 has been used
to design the circuit diagram exhibited in Fig. 23.51 for 4-bit selection circuit. The step by
step demonstration of Algorithm 23.9 has been given below:
Therefore, the circuit has been constructed and the circuit diagram has been depicted
in Fig. 23.51. The circuit exhibited in Fig. 23.51 has been demonstrated with 4-input LUT.
However, the design can also be done with 5, 6 or 7-input LUTs by using Algorithm 23.9.

23.3.3.4 LUT-Based Converter Circuit


The LUT-based design of the converter circuit is described here. At first, an algorithm is
discussed and then, a demonstration of the LUT-based circuit algorithm has been presented
to construct the circuit.

Figure 23.51: Design of a LUT-Based 4-Bit Selection Circuit.


Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  381

Algorithm 23.10: Algorithm for LUT-based n-Bit Comparator Circuit

Figure 23.52: Design of a LUT-Based 3-Bit Converter Circuit.

Algorithm 23.10 presents the circuit design algorithm for the LUT-based converter
circuit. This algorithm has been used to construct to circuit demonstrated in Fig. 23.52 for
a 3-bit converter circuit. Similarly, this algorithm can be used to construct a 4-bit converter
circuit as demonstrated in Fig. 23.53. In both cases, the 4-input LUT has been considered.
The 3-bit LUT-based converter requires only three 4-input LUTs whereas the 4-bit converter
requires only eight 4-input LUT. Though the design has been carried out for 4-input LUT, it
is possible to use higher input LUT by following Algorithm 23.10. For instance, if a 6-input
LUT is used to designing a 4-bit converter circuit, it will require only five 6-input LUTs.
Thus, the design for 4-bit converter circuit with 6-input LUT achieves an improvement in
terms of required number of LUTs when the number of inputs to LUT are increased from
382  VLSI Circuits and Embedded Systems

Figure 23.53: Design of a LUT-Based 4-Bit Converter Circuit.

4-input to 6-input. Therefore, the performance of the LUT-based converter circuit will be
enhanced with the increase of number of inputs of the LUTs.

23.3.3.5 Design of the LUT-Based Divider Circuit


The LUT-based divider circuit is discussed here. In previous subsections, all the required
components such as counter circuit, comparator circuits, selection circuits and converter
circuits are described. After accumulating all the components by following Algorithm 23.6,
a divider circuit is designed. The main difference will be the LUT-based circuits required
for the LUT-based divider circuit. The design algorithm for LUT-based counter circuit has
been demonstrated in Algorithm 23.7, LUT-based comparator circuit has been exhibited in
Algorithm 23.8, LUT-based converter circuit has been presented in Algorithm 23.10 and
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  383

finally, the LUT-based selection block has been shown in Algorithm 23.9. The concatenation
circuit composes a series of D flip-flop circuits.
The block diagram of the LUT-based divider has been shown in Fig. 23.45 which is
also the same for ASIC-based design. Instead of the ASIC-based components, it is required
to use the LUT-based components. The algorithm for the construction of the LUT-based
components have been introduced in earlier subsections which have been mentioned before.

23.3.3.6 Reversible Fault Tolerant LUT-Based Divider Circuit


The Reversible Fault Tolerant LUT-Based (RFTL) divider circuit is presented here. The
design procedure described in Algorithm 23.6 has been followed. For the reversible and
fault tolerant construction of the circuit, the design procedure described in Algorithm 23.11
has been followed. The step by step demonstration of the construction of the RFTL divider
is given below:
1. Initially, let X is a 4-bit divisor ( X 3 X 2 X 1 X 0 ) and Y is a 4-bit dividend (Y 3 Y 2 Y 1
Y 0 ). And the output is 4-bit quotient Q (Q3 Q2 Q1 Q0 ) and Remainder R ( R3 R2 R1 R0 ).
2. Apply a 4-bit reversible fault tolerant LUT-based counter where input:= {( X 3 X 2 X 1
X 0 )}, output:= {C X 2 C X 1 C X 0 }; apply a 4-bit reversible fault tolerant LUT-based counter
where input:= {(Y 3 Y 2 Y 1 Y 0 )}, output:= {CY 2 CY 1 CY 0 }.
3. Apply a 3-bit reversible fault tolerant LUT-based counter where input:= {(C X 2 C X 1
C X 0 ), (CY 2 CY 1 CY 0 )}, output:= {C X is greater than CY or C X is less than CY or C X
is equal to CY }; apply a 3-bit reversible fault tolerant LUT-based subtractor where input:=
{(C X 2 C X 1 C X 0 ), (CY 2 CY 1 CY 0 )}, output:= {(s2 s1 s0 )}.
4. Apply a 4-bit reversible fault tolerant LUT-based selection block where input:= {( X 3
X 2 X 1 X 0 ), (Y 3 Y 2 Y 1 Y 0 ), (C X equal to CY )}, output:= {remainder (r 3 r 2 r 1 r 0 ), (s3 s2 s1
s0 )}; apply a j -bit reversible fault tolerant LUT-based zero converter where input:= {(s2 s1
s0 )}, output:= {2-bit of zero ( z1 z 0 )}.

Algorithm 23.11 4-bit selection circuit


1: First of all, let us consider the input bit is 4 which indicates that both of dividend
and divisor is 4-bit long. The loop will begin with the value of 4 for i which is loop
controller variable. Let the dividend is expressed as n3 n2 n1 n0 and divisor is d 3 d 2 d 1
d0.
2: Initially, the value of i is 0. So, the condition i < 2 will be true and the first LUT will
have the input as n0 and d 0 which will produce the required output s0 and r 0 .
3: The value of both i and j will be incremented by 1, thus the value of i and j are 1. So,
the condition i < 2 will be true and the second LUT will have the input as n1 and d 0
which will produce the required output s1 and r 1 .
4: The value of both i and j will be incremented by 1, thus the value of i and j are 2. So,
the condition i < 2 will be false and the third LUT will have the input as n2 and d 1
which will produce the required output s2 and r 2 .
5: The value of both i and j will be incremented by 1, thus the value of i and j are 3. So,
the condition i < 2 will be false and the fourth LUT will have the input as n3 and d 2
which will produce the required output s3 and r 3 .
384  VLSI Circuits and Embedded Systems

Algorithm 23.12: Algorithm for LUT-Based (m + 1)-Bit Selection Circuit

Algorithm 23.13: Algorithm for n-bit LUT-Based Conversion Circuit


Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  385

5. Apply a 4-bit reversible fault tolerant LUT-based subtractor where input:= {(s3 s2 s1
s0 )}, output:= {(S 2 S 1 S 0 )}; apply a 3-bit reversible fault tolerant LUT-based concatenation
circuit where input:= {(Q0 Q1 Q2 , ( z1 z0 )}, output:= {(Q2 Q1 Q0 )}.
6. Apply a 3-bit reversible fault tolerant LUT-based adder where input:= the contents
of Quotient registers and output of concatenation circuit, output:= {(Q3 Q2 Q1 Q0 )}.

23.4 SUMMARY
Divider circuit has attained less attention in comparison with adder and multiplier circuits
due to its computational complexity and hardware difficulties in terms of area and latency.
However, the performance of a processor will degrade eventually if the capacity of the
division operation is not improved. Non-restoring algorithm has been used extensively for
the construction of radix-2 or binary divider circuit since it provides more area-delay promi-
nence over other algorithm methods like restoring, SRT and digit convergence algorithm
as reported in Application Specific Integrated Circuit (ASIC)-based divider circuits. Nev-
ertheless, the commercial design also uses non-restoring division algorithm to construct
their circuits. It is widely accepted as a state of the art algorithm. The most recent update on
the conventional non-restoring algorithm reduces the delay of each iteration of the circuit
from three to two and their results showed some exciting results over long bit of dividend.
However, the area of the circuit has been increased considerably as an expense of minimized
delay.
This chapter presents a divider circuit using a new approach of calculating bit difference
by using the problem specific heuristic functions to find quotient values. The presented
division algorithm is used to constructing the divider circuit calculates the next partial
remainder and quotient bits simultaneously. In addition, the division algorithm requires
only two operation of addition and subtraction. This chapter has presented four new circuits
which have been used to construct the divider circuit. These circuits are bit counter circuit
which counts the number of bits present in an input operand. A modified comparator circuit
calculates whether the given two input operands are equal with each other or greater or less
than with each other. A selection block which selects first (m + 1)-bit out of n-bit where m
is the number of bit in divisor and n is the number of bit in dividend. Lastly, a converter
circuit which converts the number of bits to zero. Thus, it requires the delay of an inverter,
a 2-input AND gate and three D flip-flops. The method converts all the input bits present in
the input operand into “1” and then count the number of 1’s present in the input operands.
The state of the art design of a comparator circuit uses three distinct path to obtain
the output of a comparator circuit. One of the path calculates whether the input operands
are equal and other two paths calculate whether the input operands are greater or smaller
than each other. This calculation of three path requires substantial amount of hardware
resources. To minimize the area of the comparator circuit, two paths have been used in this
chapter. The output of the two paths are inverted and conjugated with a 2-input AND gate
to obtain the third output. One of the important circuits for the construction of the divider
circuit is the converter circuit. The division algorithm squeezes the number of bits in the
input operand from n-bit to log2 n + 1-bit by the bit counter circuit. Then it appends log2
n - 1 number of zero’s in the process. This mechanism has been performed by the converter
circuits.
386  VLSI Circuits and Embedded Systems

The main reason behind the enhancement of the division algorithm is the use of bit
difference between the dividend and the divisor. First, it uses a heuristic function to compute
the number of bit differences between the dividend and divisor. Second, the method defines
the number of quotient bits (global optimal). Third, the iteration schedule is calculated
which might be non-increasing depending on how the local optimal fills the defects to
reach the global optimal. Fourth, less number of partial remainders (local optimal) are
generated on basis of a schedule. Finally, during subtraction operation, X is subtracted
from the first (m + 1)th bit of Y , where X is the divisor, Y is the dividend and m is the
number of bits in dividend. Thus, the method of division algorithm optimizes the required
number of hardware resources. Moreover, the bit counter, selection block, comparator
and converter circuits enhance the efficiency of the divider circuit in terms of area and
delay. The design can easily be implemented commercially in any platform irrespective of
Application Specific Integrated Circuit (ASIC) or Field Programmable Gate Array (FPGA).
The reversible components can be used as a building block for the construction of cost-
efficient quantum computers with a minimum number of gates, transistors and unit delay.

REFERENCES
[1] K. Pocek, R. Tessier and A. DeHon, “Birth and adolescence of reconfigurable com-
puting: A survey of the first 20 years of field-programmable custom computing ma-
chines”, In Highlights of the First Twenty Years of the IEEE International Symposium
on Field-Programmable Custom Computing Machines, pp. 3–19, 2013.
[2] A. Amara, F. Amiel and T. Ea, “FPGA vs. ASIC for low power applications”, Micro-
electronics Journal, vol. 37, no. 8, pp. 669–677, 2006.
[3] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs”, IEEE Trans.
on Computer-aided Design of Integrated Circuits and Systems, vol. 26, no. 2, pp.
203–215, 2007.
[4] B. Jovanovic, R. Jevtic and C. Carreras, “Binary Division Power Models for High-
Level Power Estimation of FPGA-Based DSP Circuits”, IEEE Trans. on Industrial
Informatics, vol. 10, no. 1, pp. 393–398, 2014.
[5] S. Subha, “An Improved Non-Restoring Algorithm”, International Journal of Applied
Engineering Research, vol. 11, no. 8, pp. 5452–5454, 2016.
[6] D. M. Muoz, D. F. Sanchez, C. H. Llanos and M. A. Rincn, “Tradeoff of FPGA design
of a floating-point library for arithmetic operators”, Journal of Integrated Circuits and
Systems, vol. 5, no. 1, pp. 42–52, 2010.
[7] S. F. Oberman and M. J. Flynn, “Design issues in division and other floating-point
operations”, IEEE Trans. on Computers, vol. 46, no. 2, pp. 154–161, 1997.
[8] S. F. Obermann and M. J. Flynn, “Division algorithms and implementations”, IEEE
Trans. on Computers, vol. 46, no. 8, pp. 833–854, 1997.
[9] R. Tessier, K. Pocek and A. DeHon, “Reconfigurable computing architectures”, Pro-
ceedings of the IEEE, vol. 103, no. 3, pp. 332–354, 2015.
[10] A. D. Hon and J. Wawrzynek, “Reconfigurable computing: what, why, and implica-
tions for design automation”, In Proceedings of the 36th Annual ACM/IEEE Design
Automation Conference, pp. 610–615, 1999.
Parallel Computation of Quotients and Partial Remainders to Design Divider Circuits  387

[11] K. Jun and E. E. Swartzlander, “Modified non-restoring division algorithm with


improved delay profile and error correction”, In 2012 Conference Record of the Forty
Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pp.
1460–1464, 2012.
[12] R. Senapati, B. K. Bhoi and M. Pradhan, “Novel binary divider architecture for high
speed VLSI applications”, In Information & Communication Technologies (ICT),
IEEE Conference on, pp. 675–679, 2013.
[13] G. Sutter and J. P. Deschamps, “High speed fixed point dividers for FPGAs”, Inter-
national Conference on Field Programmable Logic and Applications, pp. 448–452,
2009.
[14] R. Jevtic, B. Jovanovic and C. Carreras, “Power estimation of dividers implemented
in FPGAs”, In Proceedings of the 21st Edition of the Great Lakes Symposium on
Great Lakes Symposium on VLSI, pp. 313–318, 2011.
[15] M. U. Haque, Z. T. Sworna and H. M. H. Babu, “An Improved Design of a Reversible
Fault Tolerant LUT-Based FPGA”, In 29th International Conference on VLSI Design,
pp. 445–450, 2016.
[16] R. P. Brent, “The parallel evaluation of general arithmetic expressions”, Journal of the
ACM, vol. 21, no. 2, pp. 201–206, 1974.
[17] Y. Moon and D. K. Jeong, “An efficient charge recovery logic circuit”, IEEE Journal
of Solid-State Circuits, vol. 31, no. 4, pp. 514–522, 1996.
[18] Z. T. Sworna, M. U. Haque, N. Tara, H. M. H. Babu and A. K. Biswas, “Low power
and area efficient binary coded decimal adder design using a look up table-based
field programmable gate array”, IET Circuits, Devices & Systems, vol. 10, no. 3, pp.
163–172, 2016.
[19] C. V. Freiman, “Statistical analysis of certain binary division algorithms”, Proceedings
of the IRE, vol. 49, no. 1, pp. 91–103, 1961.
[20] J. E. Robertson, “A new class of digital division methods”, IRE Trans. on Electronic
Computers, pp. 218–222, 1958.
[21] K. D. Tocher, “Techniques of multiplication and division for automatic binary com-
puters”, The Quarterly Journal of Mechanics and Applied Mathematics, vol. 11, no.
3, pp. 364–384, 1958.
[22] D. E. Atkins, “Higher-radix division using estimates of the divisor and partial remain-
ders”, IEEE Trans. on Computers, vol. 100, no. 10, pp. 925–934, 1968.
[23] K. G. Tan, “The theory and implementations of high-radix division”, In IEEE Sym-
posium on Computer Arithmetic, pp. 154–163, 1978.
[24] J. Ebergen and N. Jamadagni, “Radix-2 Division Algorithms with an Over-Redundant
Digit Set”, IEEE Trans. on Computers, vol. 64, no. 9, pp. 2652–2663, 2015.
[25] M. R. Meher, C. C. Jong and C. H. Chang, “A high bit rate serial-serial multiplier
with on-the-fly accumulation by asynchronous counters”, IEEE Trans. on Very Large
Scale Integration (VLSI) Systems, vol. 19, no. 10, pp. 1733–1745, 2011.
[26] Z. T. Sworna, M. U. Haque and H. M. H. Babu, “A LUT-based matrix multiplica-
tion using neural networks”, In Circuits and Systems (ISCAS), IEEE International
Symposium on, pp. 1982–1985, 2016.
[27] N. R. Strader and R. Thomas, “A canonical bit-sequential multiplier”, IEEE Trans. on
Computers, vol. 31, no. 8, pp. 791–795, 1982.
388  VLSI Circuits and Embedded Systems

[28] L. Robert, “Irreversibility and heat generation in the computing process”, IBM J. Res.
Dev., vol.5, no. 3, pp. 183–191, 1961.
[29] M. M. A. Polash and S. Sultana, “Design of a LUT-based reversible field pro-
grammable gate array”, Journal of Computing, vol. 2, no. 10, pp. 103–108, 2010.
[30] A. S. M. Sayem and S. K. Mitra, “Efficient approach to design low power reversible
logic blocks for field programmable gate arrays”, In Computer Science and Automation
Engineering (CSAE), IEEE International Conference on, vol. 4, pp. 251–255, 2011.
CHAPTER 24

Synthesis of Boolean
Functions Using TANT
Networks

TANT is a three-level AND-NOT network with true inputs composed solely of NAND
gates. This chapter describes a systematic method for minimizing a TANT circuit and the
heuristic algorithms for different stages of the technique are provided. Algorithms in each
step of the introduced method are extensively discussed in this chapter.

24.1 INTRODUCTION
The TANT network has meaningful advantages over the PLA representations. A TANT
design for function f can never be worse than the corresponding PLA in terms of number of
gates. TANT has the un-complemented (affirmative) variables as its input, whereas the PLA
has both un-complemented variables and their negations as inputs. Thus, if a rectangular
layout realization is assured, then PLA has one dimension of the input plane two times larger.
Again, the number of prime implicants of TANT is also smaller than that of PLA. It also
allows for better incorporation of the fan-in constraints that the “standard call” realization
of the PLA - type two-level logic. Thus the automated synthesis of TANT networks will
play a prominent role in coping with the interconnection problem of integrated electronic
devices. Three-level networks are extensively used in the flash memory as hard drive than
as RAM. It is used in devices such as digital cameras and home video for easy and fast
information storage.

24.2 TANT MINIMIZATION


It has been proven that the three-level logic is enough for “nearly all” Boolean functions
in the sense that by increasing the number of levels from two to three, it can substantially
reduce number of gates. Let consider a simple example for the function f 1 written in
AND-OR-NOT form as:

DOI: 10.1201/9781003269182-28 389


390  VLSI Circuits and Embedded Systems

Figure 24.1: Representation of Function f 1 from Equation 25.1.

f1 = x0 x10 + x1 x20 x00 (24.1)

In this case, it is seen from Fig. 24.1 that it requires 6 AND-OR-NOT gates. But if it is
expressed it in TANT form as:

f1 = ((x0 (x0 x1 )0)0 .(x1 x20 (x0 x1 )0)0)0 (24.2)

For larger circuits, the reduction of number of gates increases as compared to the
AND-OR-NOT circuit. Though power consumption in TANT circuit is higher than the
corresponding AND-OR-NOT circuit, the concentration is to the minimization of TANT
circuit for binary logic functions.

24.2.1 The Technique


In this subsection, a method of TANT minimization is described. Before that some necessary
definitions required for the design are discussed.

Property 24.2.1 Let H = T 1 , T 2 , . . . , T n be a Boolean expression. Then the term H is


referred to as the Head of the expression if H is an un-complemented variable, a product of
un-complemented variables, or Boolean constant 1. On the other hand the expression T i0 will
be called the Tail of the expression if each T i ’ is a complemented variable or complement
of product of un-complemented variables.

Example 24.1 In the expression x 0 x 1 = x 20 ( x 3 x 40 ), the head is x 0 x 1 and x 20 , ( x 3 x 40 ) are


the tail factors.

Property 24.2.2 Let X a and Y a0 be two Boolean expressions. It will refer to the consensus
relation (of order 2) of X a and Y a0 as ( X a,Y 0) → X.Y .

The consensus operation produces the elimination of the variable “a” from the equality
to 1, a + a 0 = 1.

Example 24.2 The consensus relation (wx 0 y 0, yx 0 z 0) → wx 0 z 0 is valid but ( xw 0 y 0, yw 0 x 0)


→ w 0 is not valid.
Synthesis of Boolean Functions Using TANT Networks  391

After obtaining the set of PIs’, consensus operation (if possible) is applied on the set to
generate prime implicants with the same head as any PI of the set. Then the generated PI is
combined to the same head.

Property 24.2.3 A tail factor Y 0 will be a useful tail factor (UTF) of another tail factor X 0
if Y − X is in the head factors of the term that contains X 0.

Example 24.3 z 0 is the tail factor of the term x yz 0. The useful tail factors of z 0 are
(xz)0, (yz)0, (x yz)0.

Now, the method is represented using the flow diagram of Fig. 24.2.

Figure 24.2: Flow Diagram of the Method.

The first circle is showing that the method finds the prime implicants using the popular
method of Quine McClusky. After that three more steps are followed to minimize the TANT
network. Though the algorithm for minimization of TANT is very much efficient, it has
some drawbacks. They are:

1. The technique is very good only for hand solution.

2. Generation of UTF does not follow any well-defined algorithm.

3. Selection of Minimum number of tail factors (last stage) follows a brute force algo-
rithm.

4. The technique does not work well for functions with a large number of variables.

24.3 THE INTRODUCED METHOD OF TANT MINIMIZATION


The shortcomings of the concerned method are discussed in the previous section. In this
section, a heuristic method is introduced for minimization of TANT circuit that has some
efficient algorithm in almost every step. Some definitions are derived prior to description
of the introduced method.
392  VLSI Circuits and Embedded Systems

Property 24.3.1 A term Y will be a generalized prime implicant (GP) of X , if logically


X = Y . More than one GP can be generated from a minterm. GP term generation is the
procedure to generate all possible GP for a given PI.

Example 24.4 Let consider the minterm x 0 x 10 . So, x 0 ( x 0 x 1 )0 will be a generalized prime
implicant of x 0 x 10 , as x 0 ( x 0 x 1 )0 = x 0 ( x 00 + x 10 ) = x 0 x 00 + x 0 x 10 = x 0 x 10 . Let consider
another minterm x 1 x 20 x 00 . The 3 GP terms will be generated from it x 1 ( x 2 x 1 )0 x 00 , x 1 x 20
( x 0 x 1 )0, x 1 ( x 2 x 1 )0 ( x 0 x 1 )0.

Property 24.3.2 A PI will be called Only Tail Factor (OTF) if the term doesn’t contain
any head factor, that is, the term contains only tail factor(s).

Example 24.5 Both the minterms x 00 , x 10 , x 20 , and x 00 ( x 2 x 10 ) will be OTF, because they
haven’t any head factor, they have only the tail factors.

The steps of the method are shown in Fig. 24.3 using an easily understandable flow
diagram.

Figure 24.3: Flow Diagram of the Introduced Method.

As it is seen from Fig. 24.4, there are some new terms, Data structures (BT), Properties
and an algorithm are presented in the introduced method. Each of the new terms is described
in brief.

Property 24.3.3 If a consensus operation’s generated new term is logically added with the
PI and modify the tail factor of that PI, which tail factor is the part of an OTF, the logical
Synthesis of Boolean Functions Using TANT Networks  393

Figure 24.4: BT for the PI AB(C)0(D)0.

addition will not be executed. It has been observed that consensus operation sometimes
generate unusual terms that replace some other important terms according to the definition
of “Consensus operation”. But by replacing the old terms with new terms there may be tail
factor that is shared by the same tail factor of another PI.

Algorithm 24.1 is for applying consensus operation over the set of PIs and Algorithm
24.2 is used to represent combine operation. Both of the algorithms are presented in the
next section.
Boolean Tree: Boolean Tree is a new data structure to generate GP terms and UTFs
efficiently. GP terms are generated twice for each PI. BT is efficient and accurate as well.
Fig. 24.3 shows the BT for the PI AB(C)0(D)0. But sometimes BT also generates some
unusual GP terms and tail factors. If in a PI, there is a tail factor, which is a part of an OTF,
the useful tail factors generated from the tail factors are unnecessary as, all the tail factors
of each OTF must be generated in the very first level of the TANT circuit. To solve the
Property 24.3.4 has been introduced.

Property 24.3.4 If any tail factor in a PI is the part of any OTF, then the tail factor will
not be expanded further in the useful tail factor generation procedure.

If Property 24.3.3 is followed in the BT generation, the BT will look like Fig. 24.5.

Figure 24.5: BT for AB 0(C)0(D)0 Considering Property 24.3.3, where (C)0 is a part of an
OTF.

Property 24.3.4 is applied on GP term generation where the algorithm to GP terms is


presented in the next section (Algorithms 24.3 and 24.4).
GPs-UTFs Table: In the GPs-UTFs table, GPs are put into the rows and UTFs (generated
from BT) are put into the columns. A cross ( x ) circle has been place into a cell if the row’s
394  VLSI Circuits and Embedded Systems

GP term contains a tail factor that matches with the UTF of that column. GPs-UTFs table
will be shown with example in the evaluation part. The GPs-UTFs table helps to find the
optimal network in both computer program and hand solution.
Finding the Optimal Solution: An efficient algorithm is presented to find the minimal
TANT network for a Boolean expression with the help of GPs-UTFs table. The Algorithm
24.4 will be applied in the evaluation section for a particular benchmark function. Prior to
Algorithm 24.4, Lemma 25.3 is introduced that will be helpful in implementation of the
algorithm.
Property 24.3.5 In a minimal TANT network, there will be at least n UTFs if the highest
of tail factors among all PIs is n.

24.4 ALGORITHMS USED IN DIFFERENT STAGES


In this section, some algorithms are presented that is used in the heuristic method for TANT
minimization.
Algorithm 24.1: Consensus Operation

Algorithm 24.2: Combined Operation


Synthesis of Boolean Functions Using TANT Networks  395

Algorithm 24.3: GP Generate

Algorithm 24.4: Generate_GP_For_A_PI

Algorithm 24.5: Construct_optimal_TANT_Network


396  VLSI Circuits and Embedded Systems

24.5 SUMMARY
In this chapter, a heuristic technique is presented to minimize the Three-level AND-NOT
Networks with True Inputs (TANT) networks. A TANT design for any function can never
be worse than the corresponding Programmable Logic Array (PLA) in terms of number
of gates. Steps and algorithms are discussed extensively in this chapter. The introduced
method constructs an optimal TANT network for a given single output function. Reduction
of the number of gates is not only the achievements of this chapter; the presented method
can also reduce the complexity in terms of time.

REFERENCES
[1] E. J. McClusky, “Minimization of Boolean Functions”, Bell System Technical Journal,
vol. 35, no.5, pp. 1417–1444, 1956.
[2] P. Tison, “Generalization if consensus theory and application in minimization of
Boolean function”, IEEE Trans., Electronic Computers, vol. EC-16, pp, 446–456,
1967.
[3] K. S. Koh, “A minimization technique for TANT network”, IEEE Trans. on Computer,
pp. 105–107, 1971.
[4] M. A. Marin, “Synthesis of TANT Networks using Boolean Analyser”, The Comp.
Journal, vol. 12, no. 3, 1969.
[5] M. A. Perkowski and M. C. Jeske, “Multiple-Valued Input TANT network”, ISMVL,
pp. 334–341, 1994.
[6] H. M. H. Babu, M. R. Islam, S. M. A. Chowdhury and A. R. Chowdhury, “Synthesis
of Full Adder Circuit Using Reversible Logic”, Proceedings on 17th International
Conference on VLSI Design, 2004.
[7] H. M. H. Babu, M. R. Islam, S. M. A. Chowdhury and A. R. Chowdhury, “Reversible
Logic Synthesis for Minimization of Full-Adder Circuit”, Proceedings on DSD, pp.
50–54, 2003.
CHAPTER 25

Asymmetric High Radix


Signed Digital Adder Using
Neural Networks

This chapter presents an asymmetric high-radix signed-digital (AHSD) adder that performs
addition on the basis of neural network (NN) and also shows that the AHSD number system
supports carry-free (CF) addition by using NN. Besides NN implies the simple construction
in high-speed operation. The signed-digit number system represents the binary numbers
that use only one redundant digit for any radix r ≥ 2, the high-speed adder in the processor
can be realized in the signed-digit system without a delay of the carry propagation. A Novel
NN design has been constructed for CF adder based on the AHSD4 number system which is
also presented in this chapter. Moreover, if the radix is specified as r = 2m , where m is any
positive integer, the binary-to-AHSD conversion can be done in constant time regardless of
the word-length. Hence, the AHSD-to-binary conversion dominates the performance of an
AHSD based arithmetic system. In order to investigate how the AHSD number system based
on NN design achieves its functions, computer simulations for key circuits of conversion
from binary to AHSD4 based arithmetic systems are made.

25.0.1 Introduction
Addition is the most important and frequently used arithmetic operation in computer sys-
tems. Generally, a few methods can be used to speed up the addition operation. One is
by using neural network design convert the operands from the binary number system to
a redundant number system, e.g., the signed-digit number system or the residue number
system, so that the addition becomes carry-free (CF). This Neural Network (NN) design
implies fast addition can be done at the expense of conversion between the binary number
system and the redundant number system. In this chapter, the focus is on exploring high
radix signed-digit (SD) numbers and the design of the asymmetric high-radix signed-digit
(AHSD) number system using NN.
The idea of AHSD is not new. Instead of presenting a new number representation, the
purpose is to explore the inherent CF property of AHSD by using NN. The CF addition
in AHSD based on NN is the basis for our high-speed addition circuits. The conversion of

DOI: 10.1201/9781003269182-29 397


398  VLSI Circuits and Embedded Systems

AHSD to and from binary will be discussed in details. By choosing r = 2m , where m is any
positive integer, a binary number can be converted to its canonical AHSD representation in
constant time. One simple algorithm is also presented for converting binary bit pattern to
make pairs in AHSD for radix-r . NN design for converting AHSD numbers to binary: the
first stresses high speed and the other provide hardware reusability. Since the conversion
from AHSD to binary has been considered the bottleneck of AHSD-based arithmetic
computation based on NN, these NN design greatly improve the performance of AHSD
systems. For illustration, the example on AHSD4 is discussed in details, i.e., the radix-4
AHSD number system.

25.1 BASIC DEFINITIONS


In this section, few basic definitions are discussed.

25.1.1 Neural Network


A neural network is a powerful data-modeling tool that is able to capture and represent
complex input/output relationships. The motivation for the development of neural network
technology stemmed from the desire to develop an artificial system that could perform
“intelligent” tasks similar to those performed by the human brain. Neural networks resemble
the human brain in the following two ways:
1. A neural network acquires knowledge through learning.
2. A neural network’s knowledge is stored within inter-neuron connection strengths
known as synaptic weights.

Figure 25.1: Neural Network Prototype for AHSD Number System Addition.

In Fig. 25.1, the simple prototype for NN is shown.

25.1.2 Asymmetric Number System


The radix-r asymmetric high-radix signed-digit (AHSD) number system, denoted AHSD(r) ,
is a positional weighted number system with the digit set Sr = {–1, 0, . . . , r − 1}, where
r> 1. The AHSD number system is a minimally redundant system with only one redundant
Asymmetric High Radix Signed Digital Adder Using Neural Networks  399

digit in the digit set. The inherent carry-free property will be explored in AHSD and develop
systematic approaches for conversion of AHSD numbers from binary ones.
An n-digit number X in AHSD(r) is represented as
X = (x n−1 , x n−2 , . . . , x 0 )r ,
where x i ∈ Sr for i = 0, 1, . . . , n – 1, and Sr = {–1, 0, 1, . . . , r – 1} is the digit set of
AHSD(r) . The value of X can be represented as
n−1
Õ
X= xi r i
i=0

Clearly, the range of X is [(1 – r n )/(r – 1), r n – 1].

25.1.3 Binary to Asymmetric Number System Conversion


Since the binary number system is the most widely used, the conversion between the AHSD
and binary number systems has to be considered. Although the radix r may be any positive
integer, simple binary-to-AHSD conversion can be achieved if r = 2m for any positive
integer m. The reason for such simple conversion will be explained later. Let us assume
r = 2m in what follows, unless otherwise specified. Note that there may be more than one
AHSD(r) number representation for a binary number. For instance, the binary number (0,
1, 1, 0, 0)2 can be converted to two different AHSD4 numbers, i.e., (1, –1, 0)4 and (0,
3, 0)4 . Hence, the binary-to-AHSD(r) conversion, being a one-to-many mapping, may be
performed by several different methods. Here, it is tried to find an efficient and systematic
conversion method that takes advantage of the carry-free property. Here a general algorithm
is followed to make pairs to convert Binary to AHSD(r) .

Algorithm 25.1 Algorithm for the Conversion of AHSD from Binary Number System
1: Suppose given binary #bits = n
2: if radix =2m , where m is any positive integer then
3: 2 p < m < 2 p+1 where p = 1, 2, 3. . . . . .
4: # Zero (0) will be padded in front of binary bits pattern 2 p+1 – n
5: Divide the array by m
6: If each sub array is = m
7: else
8: Divide each sub array by m
9: end if

Proof 25.1 Recurrence relation for the conversion from binary-to-AHSD numbers system
is:

c i f n = m where c is a constant m > 2



T (n) =
m T (n/m) if n > m
T(n) = mT(n/m)
If n = n/m
400  VLSI Circuits and Embedded Systems

T(n/2) = mT(n/m2 )
So, T(n) = m2T(n/m2 )
.
.
T(n) = mkT (n/mk ) where k = 1, 2, 3 ...
Assume n = mk
T(n) = mk T(mk /mk )
T(n) = mk T
T(n) = mk n = mk
Log m n = Log m mk = k .
Complexity of Algorithm 25.1 to make pairs to convert Binary to AHSD(r) : O (log m n).

25.1.4 Addition of AHSD4 Number System


Here the addition process will be shown for AHSD number system. This addition process
can be for 1-bit to n-bit adder design without considering the carry propagation and its
delay as well.

Example 25.1 Here two 4-bit AHSD4 numbers are added.

The final result is in binary format. The given example illustrates the addition process
without carry propagation.

25.2 THE DESIGN OF ADDER USING NEURAL NETWORK


The neuron will work on the basis of feed-forward network with parallel processing. This
technique can be viewed as doing addition in adder in parallel. The concept of parallel
addition using neural network can be shown as the block diagram in Fig. 25.2.
Fig. 25.2 shows the atomic view of the adder using Neural Network. If it is more
generalized, then the figure will be just like Fig. 25.3. The N -bit adder generalization is
shown in Fig. 25.3.
Asymmetric High Radix Signed Digital Adder Using Neural Networks  401

Figure 25.2: N -Bit Adder Generalization.

Property 25.2.1 The total number of neurons for generating interim sum and carry of a
radix-n asymmetric q-bit adder design is q ×2(n − 1).

Proof 25.2 As n is the radix, so each bit will contain the value of n − 1. So the interim sum
will be in the range of 0 to 2(n − 1). The total number of neurons for 1-bit adder design will
be 2(n − 1). For the design of a q-bit adder, the total number of neurons will be q×2(n − 1).

Here an algorithm is introduced for n-bit adder from binary to AHSD4 using NN.

Algorithm 25.2 n-Bit Adder from Binary to AHSD4 using NN


1: Create 4n input vectors (2n elements each) that represent all possible input combina-
tions. For example, for 1-bit addition, there will be 4 input vectors (2 elements each):
{0, 0}, {0, 1}, {1, 0}, {1, 1}.
2: Create 4n output vectors (ceiling of [n/2 + 1] elements) that represent the corresponding
target combinations. For example, for 1-bit addition, there will be 4 output vectors (2
elements each): {0, 0}, {0, 1}, {0, 1}, {0, 2}.
3: Create a feed-forward back propagation neural network.
4: Train the neural network for the input and output vectors of Step 1 and Step 2.
5: Simulate the neural network with the input vectors of Step 1, getting the sum in AHSD4
as the target.

25.3 AHSD ADDITION FOR RADIX-5


The asymmetric high radix number system considering radix-5 is not well suited for
addition. The first conversion from binary to AHSD requires pairing bits by 3-bit binary. It
will convey the values from 0 to 7 in decimal. But radix-5 will need 0 to 4 values.
So it will get 5, 6, 7 as undetermined values. Hence the addition process will not be
possible as well. Thus radix-4 AHSD is best suited for addition that is carry free and fast.

25.4 SUMMARY
In this chapter, one CF (carry-free) digital adder for (asymmetric high-radix signed-
digit)AHSD4 number system based on Neural Networks (NN) has been presented. Ad-
ditionally, if r = 2m for any positive integer m, the interface between AHSD and the binary
number system can be realized and it can be easily implementable. To make pairs at the
time of conversion from binary-to-AHSD, an algorithm has also been introduced. Since
both the binary-to-AHSD and AHSD-to-binary converter CF adder operate in a constant
402  VLSI Circuits and Embedded Systems

time, it can conclude that the AHSD-to-binary converter dominates the performance of the
entire AHSD based on NN design system. The time complexity of the entire AHSD CF
adder is O (log m n).

REFERENCES
[1] S. H. Sheih and C. W. Wu, “Asymmetric high-radix signed-digit number systems
for carry-free addition”, Journal of information science and engineering, vol. 19, pp.
1015–1039, 2003.
[2] B. Parhami, “Generalized signed-digit number systems: a unifying framework for
redundant number representations”, IEEE Trans. on Computers, vol. 39, pp. 89–98,
1990.
[3] S. H. Sheih and C. W. Wu, “Carry-free adder design using asymmetric high-radix
signed-digit number system”, in Proceedings of 11th VLSI Design/CAD Symposium,
pp. 183–186, 2000.
[4] M. Sakamoto, D. Hamano and M. Morisue, “A study of a radix-2 signed-digit al
fuzzy processor using the logic oriented neural networks”, IEEE International Systems
Conference Proceedings, pp. 304–308, 1999.
[5] T. Kamio, H. Fujisaka and M. Morisue, “Back propagation algorithm for logic oriented
neural networks with quantized weights and multilevel threshold neurons”, IEICE
Trans. Fundamentals, vol. E84-A, no. 3, 2001.
[6] A. Moushizuki and T. Hanyu, “Low-power multiple-valued current-mode logic using
substrate bias control”, IEICE Trans. Electron., vol. E87-C, no. 4, pp. 582–588, 2004.
CHAPTER 26

Wrapper/TAM
Co-Optimization and
Constrained Test
Scheduling for SOCs Using
Rectangle Bin Packing

This chapter describes an integrated framework for SOC test automation. This framework
is based on a new approach for Wrapper/TAM co-optimization based on rectangle packing
considering the diagonal length of the rectangles to emphasize on both TAM widths required
by a core and its corresponding testing time. In this chapter, an efficient algorithm has been
introduced to construct wrappers that reduce testing time for cores. Rectangle packing has
been used to develop an integrated scheduling algorithm that incorporates power constraints
in the test schedule. The test power consumption is important to consider since exceeding
the system’s power limit might damage the system.

26.1 INTRODUCTION
The development of microelectronic technology has led to the implementation of system-on-
chip (SOC), where a complete system consisting of several application specific integrated
circuit (ASIC), microprocessors, memories and other intellectual properties (IP) blocks,
is implemented on a single chip. The increasing complexity of SOC has created many
testing problems. The general problem of SOC test integration includes the design of TAM
architectures, optimization of the core wrappers, and test scheduling. Test wrappers form
the interface between cores and test access mechanisms (TAMs), while TAMs transport
test data between SOC pins and test wrappers. The problem of designing test wrappers and
TAMs to minimize SOC testing time are addressed. While optimized wrappers reduce test
application times for the individual cores, optimized TAMs lead to more efficient test data

DOI: 10.1201/9781003269182-30 403


404  VLSI Circuits and Embedded Systems

transport on-chip. Since wrappers influence TAM design, and vice versa, a co-optimization
strategy is needed to jointly optimize the wrappers and the TAM for an SOC.
In this chapter, an approach is presented to integrated wrapper/TAM co-optimization
and test scheduling based on a general version of rectangle packing considering diagonal
length of the rectangles to be packed. The main advantage of the approach is that it minimizes
the test application time while considering test power limitation.

26.2 THE WRAPPER DESIGN


The purpose of the wrapper design algorithm (Algorithm 26.1) is to construct a set of
wrapper chains at each core. A wrapper chain includes a set of the scanned elements
(scan-chains, wrapper input cells and wrapper output cells). The test time at a core is given
by:
T core = p× [1 + max{ si, so}] + min{si, so}
where, p is the number of test vectors to apply to the core and si (so) denotes the
number of scan cycles required to load (unload) a test vector (test response). So, to reduce
test time, the longest wrapper chain (internal or external or both) should be minimized, i.e.
max{si, so}. Recent research on wrapper design has stressed the need for balanced wrapper
scan chains to minimize the longest wrapper chain. Balanced wrapper scan chains are those
that are as equal in length to each other as possible.
The Wrapper_Design algorithm tries to minimize core testing time as well as the TAM
width required for the test wrapper. The objectives are achieved by balancing the lengths
of the wrapper scan chains and imposing an upper bound on the total number of scanned
elements. The heuristic can be divided in two main parts; the first one for combinational
cores and the second one for sequential cores. For combinational cores, there are two
possibilities. If I + O (where I is the number of functional inputs and O the number of
functional outputs) is below or equal to the TAM bandwidth limit, W max , then nothing is
done and the number of connections to the TAM is I + O . If I + O is above W max , then
some of the cells on the I/Os are chained together in order to reduce the number of needed
connections to the TAM.
For sequential cores, at first an upper bound is specified (Upper_Bound). The internal
scan chains are then sorted in descending order. After that, each internal scan chain is
successively assigned to the wrapper scan chain, whose length after this assignment is
closest to, but not exceeding the length of the upper bound. In the algorithm, a wrapper scan
chain is created only when it is not possible to fit an internal scan chain into one wrapper
scan chains without exceeding the length of the upper bound. At last, functional inputs and
outputs are added to balance the wrapper scan chains. Results of wrapper design algorithm
are given in Table 26.1.
Wrapper/TAM Co-Optimization and Constrained Test Scheduling  405

Algorithm 26.1 Procedure Wrapper_Design (int W max , Core C)


1: { //W max = TAM width ; //#SC = Total scan chain in Core C; Total_Scan_Element =
total I/O + C. Scan_Chain_Length[i] (1 ≤ i <≤ #SC);
Í
2: if If C.#SC = 0 //combinational core then
3: if Total_Scan_Element <= W max then
4: Assign one bit on every I/O wrapper cell;
5: else
6: Design W max wrapper scan chains; //sequential core
7: end if
8: else
9: Mid_Lines = W max / 2;
10: Upper_Bound = Total_Scan_Element /Mid_Lines ;
11: Sort the internal scan chains in descending order of their length;
12: for each scan chain SC do
13: for each wrapper scan chain W already created do
14: if Length(W ) + Length(SC) <= Upper_Bound then
15: Assign the scan chain to this wrapper scan chain W ;
16: else
17: Create a new Wrapper scan chain W new ;
18: Assign the scan chain to this wrapper scan chain W new ;
19: Add functional I/O to balance the wrapper chains;
20: end if
21: end for
22: end for
23: end if

Table 26.1: Result of Wrapper_Design for core 6 of p93791


406  VLSI Circuits and Embedded Systems

26.3 TAM DESIGN AND TEST SCHEDULING


The general integrated wrapper/TAM co-optimization and test scheduling problem that is
addressed in this chapter is as follows. The total SOC TAM width and the test set parameters
are given for each core. The set of parameters for each core includes the number of primary
I/Os, test patterns, scan chains and scan chain lengths. The goal is to determine the TAM
width and a wrapper design for each core, and a test schedule that minimizes the testing
time for the SOC such that the following constraints are satisfied:
1. The total number of TAM wires utilized at any moment does not exceed Wmax ;
2. The maximum power dissipation value is not exceeded.
This problem is formulated as a progression of two problems of increasing complexity.
These two problems are as follows:
Problem 1: wrapper/TAM co-optimization and test scheduling
Problem 2: wrapper/TAM co-optimization and test scheduling with power constraints.
In this section, Problem 1 is addressed and shown how wrapper/TAM co-optimization
can be integrated with test scheduling. The next section shows how this problem is gener-
alized to include power constraints - Problem 2.
Problem 1: Determine the TAM width to be assigned and design a wrapper for each
core and schedule the tests for the SOC in such a way that minimizes the total testing time as
well as TAM width utilization and the total number of TAM wires utilized at any moment
does not exceed total TAM width when a set of parameters for each core is given.
Consider a SOC having N cores and let Ri be the set of rectangles for core i, 1 ≤ i
≤ N . Generalized version of rectangle packing problem Problem-RP 1 is as follows: select
a rectangle R from Ri for each set Ri , 1 ≤ i ≤ N and pack the selected rectangles in a
bin of fixed height and unbounded width such that no two rectangle overlap and the width
to which the bin is filled is minimized. Each rectangle selected is not allowed to be split
vertically in the rectangle packing. Problem-RP 1 can be shown to be Nρ-hard. A special
case of Problem-RP 1, in which the cardinality of each set Ri , 1 ≤ i ≤ N equals one, and no
rectangles are allowed to be split.
The Problem 1 is solved by generalized version of rectangle packing or two-dimensional
packing Problem-RP 1. The Wrapper_Design algorithm is used to obtain the different test
times for each core for varying values of TAM width. A set of rectangles for a core can
now be constructed, such that the height of each rectangle corresponds to a different TAM
width and the width of the rectangle represents the core test application time for this value
of TAM width. Problem-RP 1 relates to problem 1 as follows; see Fig. 26.1. The height of
the rectangle selected for a core corresponds to the TAM width assigned to the core, while
the rectangle width corresponds to its testing time.
Wrapper/TAM Co-Optimization and Constrained Test Scheduling  407

Figure 26.1: Example Test Schedule using Rectangle Packing.

The height of the bin corresponds to the total SOC TAM width, and the width to which
the bin is ultimately filled corresponds to the system testing time that is to be minimized. The
unfilled area of the bin corresponds to the idle time on TAM wires during test. Furthermore,
the distance between the left edge of each rectangle and the left edge of the bin corresponds
to the begin time of each core test. The approach emphasizes on both testing time of a
core and the TAM width required achieving that testing time by considering the diagonal
length of rectangles.
√ Diagonal length emphasizes on both testing time and TAM width
since DL = W 2 + H 2 where W, H, DL denotes width, height and diagonal length of the
rectangles respectively. Consider three rectangles R[1] = {H = 32, W = 7.1, DL = 32.78},
R[2] = {H = 16, W = 13.8, DL = 21.13}, R[3] = {H = 32, W = 5.4, DL = 32.45}. Here if
testing time(W ) is taken into account, then it should pack R[2] first, followed by R[1] and
R[3]. But when diagonal lengths are considered, it packs R[1], R[3], R[2] in sequence, and
get the result that is extremely efficient.
The approach minimizes TAM width utilization also by assigning TAMu wires to a core
to achieve a specific testing time. For example, in the Wrapper_Design, all TAM widths
from 50 up to 64 result in the same testing time of 114317 cycles and same TAM width
utilization (TAMu) of 47 for core 6 in p93791 (Table 26.1). So, to achieve testing time of
114317 cycles TAMu value 47 is used in the introduced approach.

26.4 POWER CONSTRAINED TEST SCHEDULING


This section describes Problem 2 (Integrated TAM design and power constrained test
scheduling) in details and then formulates problem Problem-RP 2, a generalized version of
Problem-RP 1 that is equivalent to Problem 2.
Problem 2: Solve Problem 1, such that the maximum power dissipation value Pmax
is not exceeded. Power constraints must be incorporated in the schedule to ensure that the
power rating of the SOC is not exceeded during test.
Problem 2 can be expressed in terms of rectangle packing as follows: consider a SOC
having N cores, and:
1. Let Ri be the set of rectangles for core i , 1 ≤ i ≤ N
2. Let tests for core i have a power dissipation of Pi .
Problem-RP 2 solves Problem-RP 1 ensuring that at any moment of time the sum of the
Pi values for the rectangles selected must not exceeded the maximum specified value P max .
408  VLSI Circuits and Embedded Systems

Algorithm 26.2 Algorithm Test_Scheduling (W max , Core C[1...NC])


1: { For each core C[i], construct a set of rectangles taking TAMu as rectangle height and
its corresponding testing time as rectangle width such that TAMu ≤ Wmax .

2. Find the smallest (T min ) among the testing time corresponding to MAX_TAMu of
all cores.
3. For each core C[i], divide the width T[i] of all rectangles constructed in line 1 with
T min .

4. For each core C[i], calculate Diagonal Length DL[i] = ((W[i])2 + (T[i]2 )) where
W[i] denotes MAX_TAMu and T[i] denotes corresponding reduced testing time.
5. Sort the Cores in descending order of their diagonal length calculated in line 4 and
keep in list INITIAL[NC].
6. Next_Schedule_Time = current_Time = 0;
W avail = W mx ; // TAM available; Idle_Flag = False;
// peak_tam[c] is equal to MAX_TAMu of core c; // PENDING is a queue.
7. While (INITIAL and PENDING not Empty)
{
8. If (W avail > 0 and Idle_Flag = False)
{
9. If (INITIAL is not empty)
{
c = delete(INITIAL);
If (W avail >= peak_tam[c] && no_powerConflict)
Update(c, peak_tam(c));
Else If(Possible_TAM >= 0.5 * peak_tam[c] && no_powerConflict)
Update(c, Possible_TAM);
Else
add(PENDING, c);
if(peak_tam[PENDING[front]] ≤ Wavail && no_powerConflict)
Update(PENDING[front], peak_tam[PENDING[front]]);
delete(PENDING) ;
}
10. Else //if INITIAL is empty
{
If(peak_tam[PENDING[front]] ≤ Wavail && no_powerConflict)
Update(PENDING[front], peak_tam[PENDING[front]]);
delete(PENDING)
Else
Idle_Flag = True;
}
}
11. Else //TAM available < 0 or idle
{
Wrapper/TAM Co-Optimization and Constrained Test Scheduling  409

Calculate Next_Schedule_Time = Finish[i], such that Finish[i] > This_Time and Fin-
ish[i] is minimum;
Set This_Time = Next_Schedule_Time;
12. For every Core i, such that finish[i] = This_Time
W avail = W avail + Width[i];
13. Set Complete[i] = TRUE;
Idle_Flag = False;
}
} //end of while
return test_schedule;
}

26.4.1 Data Structure


The data structure in which the TAM width and testing time values for the cores of the SOC
are stored is presented in Algorithm 26.3. This data structure is updated with the begin
times, end times, and assigned TAM widths for each core as the test schedule is developed.

Algorithm 26.3 Data structure test_schedule


1: width[i] //TAM width assigned to core i
2: finish[i] //end time of core i
3: scheduled[i] //boolean indicates core i is scheduled
4: start[i] //begin time of core i
5: complete[i] //boolean indicates test for core i has finished
6: peak_tam[i] //equals to MAX_TAMu of core i

26.4.2 Rectangle Construction


In the test scheduling algorithm (Algorithm 26.2), after getting the result of Wrap-
per_Design, for each core, a set of rectangles are constructed by taking TAMu as rectangle
height and its corresponding testing time as rectangle width such that TAMu ≤ Wmax (Fig.
26.2). MAX_TAMu is the largest among the TAMu values satisfying the above constraint.

Figure 26.2: Example of Some Rectangles for core 6 of SOC p93791 when W max = 32.
410  VLSI Circuits and Embedded Systems

In Fig. 26.3, MAX_TAMu = 24 and W max = 32. For combinational core, MAX_TAMu
is always equal to W max . Note that, in case of TAM wire assignment to that particular
scheduling of p93791 (Fig. 26.2), TAM wires that are to be assigned to core 6 must be
selected from values 24, 16, 12, 10, 8, and 7 depending on TAM width available.

Figure 26.3: Test Scheduling for d695 using The Algorithm (T min = 1109 and TAM width
= 24) without Power Constraints.

26.4.3 Diagonal Length Calculation


The smallest (T min ) among the testing time is found corresponding to MAX_TAMu for
all cores. For each core width (testing time) of all constructed rectangles are divided with
T min . Then, for each core the diagonal length of the rectangle is calculated where rectangle
height W[i] = MAX_TAMu and rectangle width T[i] is reduced testing time corresponding
to MAX_TAMu . The cores are sorted in descending order of diagonal length.

26.4.4 TAM Assignment


While executing the main while loop, if there are W avail TAM wires available for assignment
and list INITIAL is not empty, a core c is selected from the list in sorted order. If TAM
available at that moment, W avail is greater than or equal to peak_tam[c] and there is no
power conflict, the tests of that core are scheduled and assigned TAM wires to c equal to
peak_tam[c]. Note that, peak_tam[c] is equal to MAX_TAMu of core c. If W avail is less
than peak_tam[c] and power constraints is satisfied, it tries to find a TAMu value such that
TAMu ≤ W avail and TAMu greater than half of peak_tam[c]. If it fails to assign TAM
wires to c satisfying these conditions, it add the core c into queue PENDING. It then deletes
a core p from the queue PENDING for scheduling only if W avail is greater than or equal to
peak_tam[p] and there is no power conflict.
If list INITIAL is empty, the algorithm deletes the core c at the front of queue PENDING
only if W avail ≥ peak_tam[c] and power constraints is satisfied. Otherwise it waits until
sufficient TAM wires become available and power constraints are satisfied. If W avail > 0
and INITIAL is empty, these W avail wires are declared idle and Idle_Flag is set if W avail
Wrapper/TAM Co-Optimization and Constrained Test Scheduling  411

cannot satisfy power constraints as well as the condition W avail ≥ peak_tam[c] where c is
the core at the front of queue PENDING.
If there are W avail idle wires or W avail = 0, the execution proceeds where the process
of updating This_Time to Next_Schedule_Time and Wavail is begun. W avail is increased
by the width of all cores ending at the new value of This_Time sets complete[i] to true for
all cores whose test has completed at This_Time.

26.5 SUMMARY
In this chapter, a Wrapper/TAM co-optimization and test scheduling technique is presented
that takes test power consumption into account when minimizing the test application time.
It is important to consider test power consumption since exceeding it might damage the
system. The technique is based on rectangle packing which emphasizes on both time and
TAM (Test Access Mechanism) width by considering diagonal lengths. An integrated
framework for SOC (System-On-Chip) test automation is described in this chapter.

REFERENCES
[1] J. Aerts and E. J. Marinissen, “Scan chain design for test time reduction in core-based
ICs”, Proceedings International Test Conference, pp. 448–457, 1998.
[2] E. J. Marinissen, “A structured and scalable mechanism for test access to embedded
reusable cores”, Proceedings International Test Conference, pp. 284–293, 1998.
[3] E. J. Marinissen, S. K. Goel and M. Lousberg, “Wrapper design for embedded core
test”, Proceedings International Test Conference, pp. 911–920, 2000.
[4] E. Larsson and Z. Peng, “An integrated system-on-chip test framework”, Proceedings
of the Design Automation and Test in Europe Conference, pp. 138–144, 2002.
[5] J. Pouget, E. Larsson and Z. Peng, “SOC Test Time Minimization Under Multiple
Constraints”, Proceedings of the Asian Test Symposium, 2003.
[6] R. Chou, K. Saluja and V. Agrawal, “Scheduling Tests for VLSI Systems under Power
Constraints”, IEEE Trans. on VLSI Systems, vol. 5, no 2, pp. 175–185, 1997.
[7] V. Muresan, X. Wang and M. Vladutiu, “A Comparison of Classical Scheduling
Approaches in Power-Constrained Block-Test Scheduling”, Proceedings International
Test Conference, pp. 882–891, 2000.
[8] V. Iyengar, K. Chakrabarty and E. J. Marinissen, “Test wrapper and test access mech-
anism co-optimization for system-on-chip”, J. Electronic Testing: Theory and Appli-
cations, vol. 18, pp. 211–228, 2002.
[9] V. Iyengar, K. Chakrabarty and E. J. Marinissen, “Efficient wrapper/TAM co-
optimization for large SOCs”, Proceedings of the Design Automation and Test in
Europe (DATE) Conference, 2002.
[10] V. Iyengar and K. Chakrabarty, “Test bus sizing for system-on-a-chip”, IEEE Trans.
Computers, vol. 51, 2002.
412  VLSI Circuits and Embedded Systems

[11] Y. Huang, S. M. Reddy, W. T. Cheng, P. Reuter, N. Mukherjee, O. Samman and Y.


Zaidan, “Optimal core wrapper width selection and SOC test scheduling based on 3-D
bin packing algorithm”, Proceedings IEEE of International Test Conference (ITC),
pp. 74–82, 2002.
[12] E. G. Coffman, M. R. Garey, D. S. Johnson and R. E. Tarjan, “Performance bounds
for level oriented two-dimensional packing algorithms”, SIAM J. Comput., vol. 9, pp.
809–826, 1980.
CHAPTER 27

Static Random Access


Memory Using Memristor

Both volatile and nonvolatile memory are used in the computer memory system. As a
primary memory, volatile memories such as Static RAM (SRAM) and Dynamic RAM
(DRAM) are utilized, while nonvolatile memory such as flash memory is used. However,
new nonvolatile technologies have recently been developed that promise fast changes in the
memory system environment. A memristor is a passive two-terminal device whose resis-
tance is proportional to the magnitude and polarity of the voltage supplied to it. Similar
to memory devices, it exhibits a nonlinear connection between voltages and current. This
chapter describes a method for designing nonvolatile 6-T static random access memory
(SRAM) using memristors. On an Apollo design station, an SRAM was created using a
2micron minimum geometry with nMOS fabrication method. Test structures were incor-
porated in addition to the SRAM integrated circuit to assist characterize the process and
design. The memristor-based resistive random access memory (MRRAM), which works
similarly to an SRAM cell, is addressed in this chapter.

27.1 INTRODUCTION
In 1971, Leon Chua proposed the memristor, a fourth non-linear passive two-terminal
electrical component. It establishes a connection between electric charge and magnetic flux
over a given time interval. Hewlett Packard (HP) Labs researchers reported in 2008 that
the memristor was physically achieved utilizing a thin sheet of titanium dioxide nanoscale
device. Essentially, a memristor is a memory resistance device. When a voltage is supplied
to this element, the resistance changes, but when the voltage is withdrawn, the resistance
remains constant. The nonlinear input-output properties of the memristor (M) distinguish
it from the three passive parts (R, L, and C). Memristors are also used as programmable
resistive loads in a differential amplifier. In a differential amplifier, memristors are also
utilized as programmable resistive loads. Memristors are a good option for future memory,
because of its non-volatile nature and high packing density in a crossbar array. The circuit’s
major features are its non-volatility and smaller size when it is compared to traditional
6T-SRAM. Even if the power is switched off for an extended period of time, the data is

DOI: 10.1201/9781003269182-31 413


414  VLSI Circuits and Embedded Systems

preserved in the memory. It may be significantly smaller than traditional SRAM cells,
because each memory cell has three transistors and two memristors only.
The resistive RAM may flip between one or more resistances under the application
of suitable voltages. It exhibits memristive activity and can be considered as a kind of
memristor. Devices can have two or more distinct resistance states, or their resistance might
be constantly changing. Whatever the case may be, it is critical that the change in resistance
may be regulated by the device’s previous history, that is, the previous voltage applied or
the previous current flowing through the device. Resistive RAM (RRAM) devices may be
able to alleviate some of the existing constraints in microelectronics.
The design of any complex integrated circuit needs to be reduced by eliminating its
unnecessary component parts. A hierarchical approach can be used in which circuits are
built from the bottom up. Cells are made to represent the commonly used parts and combined
to form the final circuit. As shown in Fig. 28.1, the first phase of any hierarchical design is
the creation of the basic cells. These schematics are then entered into the computer through
the NETED software package. The NETED software is a schematic capture program which
converts circuit’s diagrams into node lists. These node lists, in conjunction with a transistor
Models file, define the circuit, interconnections, and the device characteristics of the nMOS
transistors.

Figure 27.1: Flowchart of a Hierarchical Design.


Static Random Access Memory Using Memristor  415

27.2 MEMRISTOR CHARACTERIZATION


The memristor was defined in terms of a non-linear functional relationship between the
magnetic flux Φm (t) and the amount of electric charge that has flowed q(t), as

f (Φm (t), q(t)) = 0 (27.1)

The variable Φm ("magnetic flux”) is derived from an inductor’s circuit characteristic.


It is not a magnetic field in this case. Its physical significance is explained further down.
The integral of voltage over time is represented by the symbol m. Because the derivative
of one with respect to the other relies on the value of one or the other in the connection
between Φ and q, each memristor is defined by its memristance function, which describes
the charge-dependent rate of change of flux with charge.
dΦm
M(q) = (27.2)
dq
Substituting the flux as the time integral of the voltage, and charge as the time integral
of current, the more convenient form is
dΦm/dt V(t)
M(q(t)) = = (27.3)
dq/dt I(t)
To relate the memristor to the resistor, capacitor, and inductor, it is helpful to isolate the
term M(q), which characterizes the device, and write it as a differential equation. Fig. 27.2
covers all meaningful relationships of I , Q, Φm , and V . No device can relate dI to dq, or
dΦm to dV , because I is the derivative of Q and Φm is the integral of V . It can be inferred
from this that memristance is charge-dependent resistance. If M(q(t)) is a constant, then it
obtains Ohm’s Law R(t) = V(t)/I(t). If M(q(t)) is nontrivial, however, the equation is not
equivalent because q(t) and M(q(t)) can vary with time. Solving for voltage as a function
of time produces the following equation:

V(t) = M(q(t))I(t) (27.4)

As long as M does not change with charge, Equation 28.4 indicates that memristance
defines a linear connection between current and voltage. A charge that varies over time is
implied by a nonzero current. However, alternating current can define the linear dependency
in circuit functioning by generating a quantifiable voltage without causing net charge
movement as long as the greatest change in q does not produce a significant change in M .
Furthermore, if no current is supplied, the memristor remains static. If I(t) = 0, then it
finds V(t) = 0 and M(t) is constant. This is the essence of the memory effect. The power
consumption characteristic recalls that of a resistor, I 2 R.

P(t) = I(t)V(t) = I 2 (t)M(q(t)) (27.5)

As long as M(q(t)) varies little, such as under alternating current, the memristor will
appear as a constant resistor. If M(q(t)) increases rapidly, however, current and power
consumption will quickly stop. M(q) is physically restricted to be positive for all values
416  VLSI Circuits and Embedded Systems

of q (assuming the device is passive and does not become superconductive at some q). A
negative value would mean that it would perpetually supply energy when operated with
alternating current. For RO N << ROF F the memristance function can be determined as
follows:
µv
Mq(t)) = ROF F .(1 − q(t)) (27.6)
D2
Where ROF F represents the high resistance state, RO N represents the low resistance
state, µv represents the mobility of dopants in the thin film, and also the thickness of the
film.

Figure 27.2: Relationship among Resistor, Capacitor, Inductor, and Memristor.

27.3 MEMRISTOR AS A SWITCH


The applied current or voltage produces a significant change in resistance in some memris-
tors. By examining the amount of time and energy required to produce a desired change in
resistance, such devices can be classified as switches. This is based on the assumption that
the applied voltage is constant. When the energy dissipation during a single switching event
is calculated, it is discovered that for a memristor to transition from one state to another
Static Random Access Memory Using Memristor  417

Ron to Ro f f in time T on to T o f f , the charge must change by ∆Q = Q on – Q o f f .


∫ Ton
2 dt
Eswitch = V
To f f M(q(t))
∫Q on ∫ Q on
dq dq
= V2 = V2
Qo f f I(q)M(q) Qo f f V(q)
= V∆Q

Substituting V = I(q)M(q), and then dq/V = ∆ Q/V for constant VT on produces


the final expression. This power characteristic is fundamentally different from that of a
capacitor-based metal oxide semiconductor transistor. Unlike a transistor, the memristor’s
ultimate charge state is independent of the bias voltage.
Hysteresis, also known as the "hard-switching regime," occurs when a kind of memristor
is switched across its full resistance range. A cyclic M( q) switch, on the other hand, would
have each off-on event followed by an on-off event under continuous bias. Under any
situation, such a device would operate as a memristor, although it would be less practical.

27.4 WORKING PRINCIPLE OF MEMRISTOR


An analogous time-dependent resistor, whose value at time t is directly proportional to the
amount of charge q that has traveled through it, may be used to describe the memristor.
A 50nm thin layer of titanium dioxide is sandwiched between two 5nm thick electrodes,
one titanium and the other platinum, in the HP device memristor. The titanium dioxide
film contains two layers at first, one of which has a small depletion of oxygen atoms.
Because the oxygen vacancies function as charge carriers, the resistance of the depleted
layer is significantly lower than that of the non-depleted layer. The oxygen vacancies are
drifted when an electric field is introduced. The boundary between the high-resistance
and low-resistance layers is being shifted. As a result, the overall resistance of the film is
calculated by determining the charge carried out through it in a certain direction, which
may be reversed by changing the current direction.
The HP gadget is classified as a nanoionic device since it exhibits rapid ion conduction
at the nanoscale. The cell functions like a memory element since the resistance change is
non-volatile. Fig. 27.3 depicts a memristor’s doped and undoped regions.

Figure 27.3: Characterizing the Memristor.


418  VLSI Circuits and Embedded Systems

If a voltage is applied across the memristor, the following results are obtained:

v(t) = M(t)i(t) (27.7)


 
w(t) w(t)
M(t) = RON + ROFF 1 −
D D

where RON is the resistance of completely doped memristor and ROF F is the resistance of
completely undoped memristor, w(t) is given by

dw(t) RON
= µv i(t) (27.8)
dt D
µv is the average dopant mobility and D is the length of the memristor. From these
equations, the considered nonlinearity produced from the edge of the thin film can be
obtained as
 
w(t) w(t)
f( )=1− 2 − 1 2p (27.9)
D D

Fig. 27.4 shows the change in resistance of a memristor when a 3.6 V p-p square wave
is applied across it. In a positive cycle, the resistance of the memristor changes from 20 KΩ
to 100 KΩ, and this change occurs in the opposite way when the square pulse reverses its
orientation.

Figure 27.4: Change of Resistance for a 3.6V p-p Square Wave.

27.5 MEMRISTOR-BASED SRAM


Fig. 27.5 depicts the SRAM cell’s electrical architecture. As a memory element, two
memristors are used. The configuration is such that they are linked in parallel but opposite
Static Random Access Memory Using Memristor  419

Figure 27.5: Three Transistor and Two Memristor SRAM Cells.

polarity during the write cycle as presented in Fig. 27.6, and in series during the read cycle
as shown in Fig. 27.7. These connections are recognized by two NMOS pass transistors
T1 and T2. A third transistor T3 is used to isolating a cell from other cells of the memory
array during read and write operations. The gate input of T3 is the Comb signal which is
the logical OR of RD (Read) and WR (Write) signals. RD is set to the LOW state and WR
and Comb are set to the HIGH state for write operation. As a result, circuit of Fig. 27.6 is
formed.
In this case, the voltage across the memristors is (V D – V DD /4). It can be either positive
(V D = V DD ) or negative (V D = 0V) depending on the data. Because the memristors’
polarities are opposing, memristances (or resistances) will alter in the opposite direction.
The circuit illustrated in Fig. 27.7 is formed by keeping RD and Comb in the HIGH state.
At D, the voltage is now:
 
VDD VDD R2 VDD
VD = − × + (27.10)
2 4 (R1 + R2) 4

where, R1 and R2 are the resistances of M1 and M2 respectively. If “1” was written
during write cycle, R2 becomes significantly greater than R1 and then V D is greater than
V DD = 4. If “0” was written, R1 becomes significantly greater than R2 which makes V D
to be as close as V DD = 4. A comparator can be used as a sense amplifier to interpret these
voltages as HIGH or LOW correctly.
420  VLSI Circuits and Embedded Systems

Figure 27.6: Circuit when RD = 0, WR = 1, and Comb = 1.

Figure 27.7: Circuit when RD = 1, WR = 0, and Comb = 1.

27.6 SUMMARY
In this chapter, a memristor-based design for a Static Random Access Memory (SRAM)
cell is described. Recent studies show that the write time may be considerably decreased by
combining the cutting-edge manufacturing processes and memristor-based Resistive RAM
(RRAM). The memristor-based SRAM may be called as a combination of new technologies
and this memory has the potential to open a new door in the area of memory architecture.

REFERENCES
[1] L. Chua, “Memristor-the missing circuit element”, IEEE Trans. Circuit Theory, vol.
18, no. 5, pp. 507–519, 1971.
[2] D. Struckov, G. Snider, D. Stewart and R. Williams, “The missing memristor found”,
Nature, vol. 453, no. 7191, pp. 80–83, 2008.
[3] G. Chen, “Leon Chua’s memristor”, IEEE Circuits Syst. Mag., vol. 8, no. 2, pp. 55–56,
2008.
Static Random Access Memory Using Memristor  421

[4] Y. V. Pershin and M. D. Ventra, “Spin memristive systems: Spin memory effects in
semiconductor spintronics”, Phys. Rev. B, vol. 78, no. 11, pp. 113309-1–113309-4,
2008.
[5] K. Witrisal, “Memristor based stored reference receiver-the UWB solution”, Electron.
Lett., vol. 45, no. 14, pp. 713–714, 2009.
[6] S. Shin, K. Kim and S. M. Kang, “Memristor based fine resolution programmable
resistance and its applications”, Proc. IEEE Int. Conf. Commun. Circuits Syst., pp.
948–951, 2009.
[7] D. Varghese and G. Gandi, “Memristor based highline arrange differential pair”, Proc.
IEEE Int. Conf. Commun., Circuits Syst., pp. 935–938, 2009.
[8] Y. V. Pershin and M. DiVentra, “Practical approach to programmable analog circuits
with memristors”, IEEE Trans. Circuits Syst. I, vol. 57, no. 8, pp. 1857–1864, 2010.
[9] S. Shin, K. Kim and S. M. Kang, “Memristor applications for programmable analog
ICs”, IEEE Trans. Nanotechnol., vol. 10, no. 2, pp. 266–274, 2011.
[10] D. B. Strukov and S. Williams, “Exponential ionic drift: Fast switching and low
volatility of thin film memristors”, Appl. Phys. A, Mater. Sci. Process., vol. 94, no. 3,
pp. 515–519, 2009.
[11] N. Y. Joglekar and S. J. Wolf, “The elusive memristor: Properties of basic electrical
circuits”, Eur. J. Phys., vol. 30, no. 4, pp. 661–675, 661, 2009.
[12] E. Linn, R. Rosezin, C. Kügeler and R. Waser, “Complementary resistive switches for
passive nanocrossbar memories”, Nature Mater., vol. 9, pp. 403–406, 2010.
[13] P. Junsangsri and F. Lombardi, “A memristor-based memory cell using ambipolar
operation”, Proc. IEEE 29th Int. Conf. Comput. Design, pp. 148–153, 2011.
[14] K. Eshraghian, K. R. Cho, O. Kavehei, S. K. Kang, D. Abbott and S. M. S. Kang,
“Memristor MOS content addressable memory (MCAM): Hybrid architecture for
future high performance search engines”, IEEE Trans. Very Large scale Integr. (VLSI)
Syst., vol. 19, no. 8, pp. 1407–1417, 2011.
[15] S. S. Sarwar, S. A. N. Saqueb, F. Quaiyum and A. B. M. H. Rashid, “Memristor-Based
Nonvolatile Random Access Memory: Hybrid Architecture for Low Power Compact
Memory Design”, IEEE Trans. Science and Technology, vol. 1, no. 3, pp. 29–34,
2013.
CHAPTER 28

A Fault Tolerant Approach to


Microprocessor Design

In this chapter, a fault-tolerant approach to reliable microprocessor design is presented.


The approach, based on the use of an on-line checker component in the processor pipeline,
provides significant resistance to core processor design errors and operational faults such
as supply voltage noise and energetic particle strikes. The approach preserves system
performance while keeping area overheads and power demands low. The checker is a
fairly simple state machine that can be formally verified, scaled in performance, and reused.
Additional improvements to the checker component are described which allow for improved
detection of design, fabrication and operational faults.

28.1 INTRODUCTION
High-quality verification and testing is a vital step in the design of a successful micropro-
cessor product. Designers must verify the correctness of large complex systems and ensure
that manufactured parts work reliably in varied (and occasionally adverse) operating con-
ditions. If successful, users will trust that when the processor is put to a task it will render
correct results. If unsuccessful, the design can falter, often resulting in serious repercussions
ranging from bad press, to financial damage, to loss of life. The challenges that must be
overcome to build a reliable microprocessor design are great. There are many sources of
errors, each requiring careful attention during design, verification, and manufacturing. The
faults are broadly classified that can reduce reliability into three categories: design faults,
manufacturing faults, and operational faults.

28.1.1 Design Faults


Design faults are the result of human error, either in the design or specification of a
system component that renders the part unable to correctly respond to certain inputs. The
typical approach used to detect these bugs is simulation-based verification. A model of the
processor being designed executes a series of tests and compares the model’s results to
expected results. Unfortunately, design errors sometimes slip through this testing process
due to the immense size of the test space. To minimize the probability of undetected

DOI: 10.1201/9781003269182-32 423


424  VLSI Circuits and Embedded Systems

errors, designers employ various techniques to improve the quality of verification including
co-simulation, coverage analysis, random test generation, and model-driven test generation.
Another popular technique, formal verification, uses equality checking to compare a
design under test with the specification of the design. The advantage of this method is that
it works at a higher level of abstraction, and thus can be used to check a design without
exhaustive simulation. The drawback to this approach is that the design and the instruction
set architecture it implements need to be formally specified before the process can be
automated. Complex modern designs have outpaced the capabilities of current verification
techniques. For example, a microprocessor with 32-bit registers, 8k-byte instruction and data
caches, and 300 pins cannot be fully examined with simulation-based testing. The design
has a test space with at least 2132396 starting states and up to 2300 transition edges emanating
from each state. While formal verification has improved detection of design faults, full
formal verification is not possible for complex dynamically scheduled microprocessor
designs. To date, the approach has only been demonstrated for in-order issue pipelines
or simple out-of-order pipelines with small window sizes. Complete formal verification of
complex modern microprocessors with out-of-order issue, speculation, and large instruction
windows is currently an intractable problem.

28.1.2 Manufacturing Defects


Manufacturing defects arise from a range of processing problems that manifest during
fabrication. For example, step coverage problems that occur during the metallization process
may cause open circuits, or improper doping in the channel of CMOS transistors may cause
a change in the threshold voltage and timing of a device. Non concurrent testing techniques,
which place the part into a special testing mode, are the primary vehicle for diagnosing
these type of errors. Testing of the system is accomplished by adding special test hardware.
Scan testing adds a MUX to the inputs of flip-flops that allow for reading and writing of
latches during test mode. This method provides direct checking of flip-flop operation and
indirect checking of combination logic connected to the scan latch. Using the scan chain, test
vectors are loaded into the flip-flops and then combinational logic is exercised to determine
if the implementation is faulty. Built in self-test (BIST) adds specialized test generation
hardware to reduce the time it takes to load latches with test vectors. BIST test generators
typically employ modified linear shift feedback register (LSFR) or ROMs to generate key
test vectors that can quickly test for internal logic defects such as single-stuck line faults. A
more global approach is taken which uses onboard current monitoring to detect if there are
any short-circuits. During testing, power supply currents are monitored while the system is
exercised; any abnormally high current spikes are indicative of short-circuit defects.

28.1.3 Operational Faults


Operational faults are characterized as sensitivity of the chip to environmental conditions.
It is useful to subdivide these types of errors into categories based on their frequency:
permanent, intermittent, and transient faults. Permanent faults occur consistently because
the chip has experienced an internal failure. Electro-metal migration and hot electrons are
two examples of permanent faults that can render a design irrevocably damaged. Latch-up
is also classified, which is caused by a unity gain in the bipolar transistor structures present
A Fault Tolerant Approach to Microprocessor Design  425

in a CMOS layout, as a permanent fault; however, this fault can be cleared by powering
down the system.
Unlike permanent faults, intermittent faults do not appear continuously. They appear and
disappear but their manifestation is highly correlated with stressful operating conditions.
Examples of this type of fault include power supply voltage noise or timing faults due
to inadequate cooling. Data-dependent design errors also fall into this category. These
implementation errors are perhaps the most difficult to find because they require specific
directed testing to locate. Transient faults appear sporadically but cannot be easily correlated
to any specific operating condition. The primary source of these faults are single event
radiation (SER) upsets. SER faults are the result of energized particle strikes on logic
which can deposit or remove sufficient charge to temporarily turn the device ON or OFF,
possibly creating a logic error. While shielding is possible, its physical construction and
cost make it an unfeasible solution at this time.
Concurrent testing techniques are usually required to detect operational faults, since
their appearance is not predictable. Three of the most popular methods are timers, coding
techniques and multiple execution. Timers guarantee that a processor is making forward
progress, by signaling an interrupt if the timer expires. Coding techniques use extra informa-
tion to detect faults in data. While primarily used to protect storage, coding techniques also
exist for logic. Finally, data can be checked by using a k-ary system where extra hardware
or redundant execution is used to providing a value for comparison.
Deep submicron fabrication technologies (i.e., process technologies with minimum
feature sizes below 0.25 µm) heighten the importance of operational fault detection. Finer
feature sizes result in smaller devices with less charge, increasing their exposure to noise-
related faults and SER. If designers cannot meet these new reliability challenges, they may
not be able to enjoy the cost and speed advantages of these denser technologies.
In this chapter, an on-line testing approach, called dynamic verification is presented
that addresses many of the reliability challenges facing future microprocessor designs. The
solution inserts an on-line checking mechanism into the retirement stage of a complex
microprocessor. The checker monitors the results of all instructions committed by the
processor. If no error is found in a computation, the checker allows the instruction result
to be committed to architected register and memory state. If any results are found to
be incorrect, the checker fixes the errant result and restarts the processor core with the
correct result. The checker processor is quite simple, lending itself to high-quality formal
verification and electrically robust designs. The addition of the checker to the pipeline causes
virtually no slowdowns to the core processor, and area and power overheads for a complete
checker design are quite modest. The simple checker provides significant resistance to design
and operational faults and provides a convenient mechanism for efficient and inexpensive
detection of manufacturing faults. Design verification is simplified because the checker
concentrates verification onto itself. Specifically, if any design errors remain in the core
processor, they will be corrected (albeit inefficiently) by the checker processor. Significant
resistance to operational faults is also provided. A low-cost and high-coverage technique
is also introduced for detecting and correcting SER-related faults. The approach uses
the checker processor to detect energetic particle strikes in the core processor; for the
checker processor that has developed a re-execute-on-error technique that allows the checker
to check itself. Finally, it demonstrates how the checker can be used to implement a
426  VLSI Circuits and Embedded Systems

low-cost hierarchical approach to manufacture testing. Simple low cost BIST tests the
checker module, then the checker can be used as the tester to test the core processor for
manufacturing errors. The approach could significantly reduce the cost of late-stage testing
of microprocessors while at the same time reducing the time it takes to test parts.

28.2 DYNAMIC VERIFICATION


Recently the use of dynamic verification is introduced to reduce the burden of verification
in complex microprocessor designs. Dynamic verification is an on-line instruction checking
technique that stems from the simple observation that speculative execution is fault tolerant.
Consider for example, a branch predictor that contains a design error, e.g., the predictor
array is indexed with the most significant bits of the PC (instead of the least significant PC
bits).

28.2.1 System Architecture


The resulting design, even though the branch predictor contained a design error, would
operate correctly. The only effect on the system would be significantly reduced branch
predictor accuracy (many more branch mispredictions) and accordingly reduced system
performance. From the point of view of a correctly designed branch predictor check mech-
anism, a bad prediction from a broken predictor is indistinguishable from a bad prediction
from a correct predictor design. Moreover, predictors are not only tolerant of permanent
errors (e.g., design errors), but also manufacturing defects and operational errors (e.g.,
noise-related faults or natural radiation particle strikes).
Considering this observation, the burden of verification in a complex design can be
decreased by simply increasing the degree of speculation. Dynamic verification does this by
pushing speculation into all aspects of core program execution, making the architecture fully
speculative. In a fully speculative architecture, all processor communication, computation,
control and forward progress is speculative. Accordingly, any permanent (e.g., design error,
defect, or failure) and transient (e.g., noise-related) faults in this speculation do not impact
correctness of the program. Fig. 28.1 illustrates the approach.
To implement dynamic verification, a microprocessor is constructed using two heteroge-
neous internal processors that execute the same program. The core processor is responsible
for pre-executing the program to create the prediction stream. The prediction stream consists
of all executed instructions (delivered in program order) with their input values and any
memory addresses referenced. In a baseline design, the core processor is identical in every
way to the traditional complex microprocessor core up to (but not including) the retirement
stage. In this baseline design, the complex core processor is “predicting” values because it
may contain latent bugs that could render these values incorrect.
The checker processor follows the core processor, verifying the activities of the core
processor by re-executing all program computation in its wake. The high-quality stream of
predictions from the core processor serves to simplify the design of the checker processor
and speed its processing. Pre-execution of the program on the complex core processor
eliminates all the processing hazards (e.g., branch mispredictions, cache misses, and data
dependencies) that slow simple processors and necessitate complex microarchitectures.
A Fault Tolerant Approach to Microprocessor Design  427

Figure 28.1: Dynamic Verification System Architecture.

Thus it is possible to build an inorder checker without speculation that can match the
retirement bandwidth of the core. In the event the core produces a bad prediction value
(e.g., due to a design errors), the checker processor will fix the errant value and flush
all internal state from the core processor, and restart it after the errant instruction. Once
restarted, the core processor will resynchronize with the correct state of the machine as it
reads register and memory values from non-speculative storage.
To eliminate the possibility of storage structural hazards, the checker processor has
its own register file and instruction and data caches. A small dedicated data cache for the
checker processor, called the L0 cache, is loaded with whatever data is touched by the core
processor; it taps off the output port of the L1 cache. This prefetching technique greatly
reduces the number of misses experienced by the checker. However, if the checker processor
misses in the L0 cache, it blocks the entire checker pipeline, and the miss is serviced by the
core L2 cache. Cache misses are rare for the checker processor even for very small caches,
because the high-quality address stream from the core processor allows it to manage these
resources very efficiently. Store Queues are also added to both the core and checker (cSTQ
and dSTQ in Fig. 28.1) to increase performance.
The resulting dynamic verification architecture should benefit from a reduced burden of
verification, as only the checker needs to be completely correct. Since the checker processor
will fix any errors in the instructions that are to be committed, the verification of the core
is reduced to simply the process of locating and fixing commonly occurring design errors
that could affect system performance. Since the complex core constitutes a major testing
problem, relaxing the burden of correctness of this part of the design can yield large
verification time savings. To maintain a high quality checker processor, formal verification
is leveraged to ensure correct function and extensive checker processor Built-In-Self-Test
(BIST) to guarantee a successful implementation.
428  VLSI Circuits and Embedded Systems

Figure 28.2: Checker Processor Pipeline Structure for a Single Wide Checker Processor.

28.2.2 Checker Processor Architecture


For dynamic verification to be viable, the checker processor must be simple and fast. It
must be simple enough to reduce the overall design verification burden, and fast enough to
not slow the core processor. A single-issue two-stage checker processor is illustrated in Fig.
28.2. The design presented assumes a single-wide checker, but scaling to wider or deeper
designs is a fairly straightforward task (discussed later). In the normal operation (as shown
in Fig. 28.3), the core processor sends instructions (with predictions) at retirement to the
checker pipeline. These predictions include the next PC, instruction, instruction inputs, and
addresses referenced (for loads and stores). The checker processor ensures the correctness
of each component of this transfer by using four parallel stages, each of which verifies a
separate component of the prediction stream. The IFCHECK unit verifies the instruction
fetch by accessing the instruction memory with the checker program counter. IDCHECK
verifies that the instruction was decoded correctly by checking the input registers and control
A Fault Tolerant Approach to Microprocessor Design  429

signals. EXCHECK re-executes the functional unit operation to verify core computation.
Finally, the MEMCHECK verifies any loads by accessing the checker memory hierarchy.
If each prediction from the core processor is correct, the result of the current instruction
(a register or memory value) as computed by the checker processor is allowed to retire to
non-speculative storage in the commit (CT) stage of the checker processor. In the event
any prediction information is found to be incorrect, the bad prediction is fixed, the core
processor is flushed, and the core and checker processor pipelines are restarted after the
errant instruction. Core flush and restart use the existing branch speculation recovery
mechanism contained in all modern high-performance pipelines.

Figure 28.3: Checker Processor Pipeline Structure for a Checker Processor in Check Mode.

As shown in Figs. 28.3 Fig. 28.4, the routing MUXes can be configured to form a
parallel checker pipeline or a recovery pipeline respectively. In recovery mode the pipeline
is reconfigured into a serial pipeline. In this mode, stage computations are sent to the next
logical stage in the checker processor pipeline, rather than used to verify core predictions.
Only one instruction is allowed to enter the recovery pipeline at a time. As such, the
recovery pipeline configuration does not require bypass datapaths or complex scheduling
logic to detect hazards. Processing performance for a single instruction in recovery mode
will be quite poor, but as long as faults are infrequent there will be no perceivable impact
on program performance. Once the instruction has retired, the checker processor re-enters
normal processing mode and restarts the core processor after the errant instruction.
Fig. 28.2 also illustrates that the checking and recovery modes use the same hardware
modules, thereby reducing the area cost of the checker. Each stage only requires intermediate
pipeline inputs - whether these are from the core processor prediction stream or the previous
stage of the checker processor pipeline (in recovery mode) is irrelevant to the operation
of the stage. This attribute serves to make the control and implementation of individual
stages quite simple. In recovery mode, the check logic is superfluous as the inputs will
always be correct, however, no reconfiguration is required to the check logic as it will never
declare a fault during recovery. Pipeline scheduling is trivial, if any checker pipeline is
blocked for any reason, all checker processor pipelines are stalled. This simplifies control
of the checker processor and eliminates the need for instruction buffering or complex
430  VLSI Circuits and Embedded Systems

Figure 28.4: Checker Processor Pipeline Structure for a Checker Processor in Execute
Mode.

non-blocking storage interfaces. Since there are no dependencies between instructions in


normal processing, checker processor pipeline stalls will only occur during a cache miss
or structural (resource) hazard. This lack of dependencies makes it possible to construct
the checker control as a three stage (Idle, Check and Execute) Moore state machine. The
pipeline sits in the Idle state until the arrival of a retired core processor instruction. The
pipeline then enters normal Check mode until all instructions are exhausted or an instruction
declares a fault. If a fault is declared the pipeline enters the Execute state, reconfiguring the
pipeline to single serial instruction processing. Once the faulty instruction has completed,
the pipeline returns to Idle or Check mode, depending on the availability of instructions
from the core.
Certain faults, especially those affecting core processor control circuitry, can lock up
the core processor or put it into a deadlock or livelock state where no instructions attempt
to retire. For example, if an energetic particle strike changes an input tag in a reservation
station to the result tag of the same instruction, the processor core scheduler will deadlock.
To detect these faults, a watchdog timer (WT) is added. After each instruction commits,
the watchdog timer is reset to the maximum latency for any single instruction to complete.
If the timer expires, the processor core is no longer making forward progress and the core
is restarted. At that time, pipeline control transitions to recovery mode where the checker
processor is able to complete execution of the current instruction before restarting the core
processor after the stalled instruction.

28.3 PHYSICAL DESIGN


When considering any type of testing technique it is important to quantify its costs in
addition to its benefits. This section will analyze the performance, area and power costs of
a prototype checker processor. To assess the performance impacts of a checker processor, a
A Fault Tolerant Approach to Microprocessor Design  431

Verilog model for the integer subset of the Alpha instruction set was constructed. The design
was tested with small hand coded assembly programs to verify correct checker operation.
Floating point instructions were not implemented due to the time constraints. However,
estimate floating point overheads using measurements from an unrelated physical design in
the same technology. The Synopsys toolset was used in conjunction with a 0.25 µm library
to produce an unpipelined fully synthesized design that ran at 288 MHz. Here, considering
this approach as well as semi-custom designs as means to matching the speed of current
microprocessors. The Synopsys toolset will generate area estimates for a synthesized design
given the cells used and the estimated interconnect. Boundary optimization, which exploits
logical redundancy between modules, was turned off so that an assessment of each module’s
contribution to the overall area could be determined. A 1.65 mm2 single-wide checker size
was produced by the toolset, which is fairly small in comparison to the 205 mm2 size of
the Alpha 21264 microprocessor. A single-wide checker only amounts to 0.8% of the area
of the Alpha chip. However, as mentioned before only a subset of the instruction set was
implemented. The checker area breakdown chart shows that the excheck module, which
houses the functional units, contributes the most to the overall size.
The floating point modules should be slightly larger than their integer counterparts due
to the need to track the exponents. Cacti projected that the I-cache and D-cache would be
0.57 mm2 and 1.1408 mm2 respectively. These values are for the 512 byte I-cache and 4K
byte D-cache that are described in previous section. Accounting for caches, total checker
area rises to roughly 10 mm2 , still much smaller than a complete microprocessor in the
same technology.

28.4 DESIGN IMPROVEMENTS FOR ADDITIONAL FAULT COVERAGE


The design presented thus far is targeted at design faults. While the checker inherently
aids in the detection of operational and manufacturing faults due to its checking capability
and simplification of critical circuitry, it has not been the primary focus thus far. Now it
will explore how dynamic verification can further aid in the detection of operational and
manufacturing faults.

28.4.1 Operational Errors


Several mechanisms reduce the probability of operational errors affecting functionality,
especially those errors that can arise from SER particle strikes. If the checker is properly
designed, the system is guaranteed to function correctly given any error in the core. This
guarantee can be safely made because the checker will detect and correct any operational
faults in the core. Ensuring correct operation is more challenging when considering the
possibility of operational errors in the checker circuits. Ideally, the checker would be
designed such that its lifetime and radiation tolerance is higher than Table 28.1 enumerates
the possible fault scenarios that may occur during the checking process. Case A represents
the false positive situation where the checker detected a fault that did not really occur. To
solve Case A, additional control logic is added that causes the checker to re-check core
values when an error is detected before recovery mode is entered.
432  VLSI Circuits and Embedded Systems

Table 28.1: Operational Faults in Checker Circuitry

This reduces the likelihood of this case occurring, at the expense of slightly slower
fault recovery performance. Case B can occur in one of two ways; the first is when an
operational error causes an equivalent error to occur in both the core and the checker. This
is the product of two small probabilities. Given a good design, the likelihood of this event
is probabilistically small. However, replication of the functional units may be applied to
further reduce this probability. The other possibility is that either the comparison logic
or the control logic suffer an error. TMR can be employed on the control logic to help
reducing the probability that this error occurs. In a system with a checker, the probability of
a failure shown in Equation 29.1, is always the product of at least two unlikely events. For
systems that require ultra-high reliability the checker can be replicated until the probability
is acceptable for the application. This is a low cost redundant execution approach as only
the small checker needs to be replicated.

P f ailure = Pdesign err or core ∗ Pmasking strike in checker (28.1)


+ Pdesign err or core ∗ Pstrike checker contr ol

Fig. 28.5 illustrates how TMR can be applied to the control logic of the checker to
provide better reliability in the event of an operational error in the checker control. Again, a
design analysis was done by synthesizing a Verilog model. The areas estimates previously
given show that the simple control logic of the checker only contributes a small portion to
the overall checker size. The addition of two more control logic units and voting logic only
consumes an extra 0.12 mm2 . TMR still has a single point of failure within the voter logic,
but the chance of a strike is reduced due to the area difference as compared to the control
unit logic. Additionally, the voter logic can be equipped with transistors that are sized large
enough to have a high tolerance of environmental conditions.

28.4.2 Manufacturing Errors


The checker processor can improve the success rate of manufacturing test, because the
checker design lends itself to these tests, and only the checker must be completely free of
defects for correct program operation. The structure of the checker, where values are latched
in and compared to hardware, enhances the checkers suitability for non-concurrent testing
A Fault Tolerant Approach to Microprocessor Design  433

Figure 28.5: Checker Processor Pipeline Structure with TMR on the Control Logic.

techniques. A simple scan chain can be synthesized in latches that supply the data to the
checker. This type of testing can achieve enhanced fault coverage over the core, since the
checker circuitry is significantly simpler. BIST is another option that holds great potential
for this type of circuit.
As shown in Fig. 28.6, built-in-self-test (BIST) hardware can be added to test for checker
manufacturing errors. The BIST hardware uses a Test Generator to artificially stimulate the
checker logic and it passes values into the checker stage latches that hold the instruction for
the checker to validate. The most effective test generator will be a ROM of opcodes, inputs
and expected results. Using this data and the error signals already present in the system,
an efficient non-concurrent test can be performed. Obviously, the ROM should have both
434  VLSI Circuits and Embedded Systems

Figure 28.6: Checker Processor Pipeline Structure with TMR on Control Logic and BIST.

good and bad sets of data such that the check modules are fully tested. The number of tests
required to test for all faults is a function of the checker logic size and structure. Memories
make up another huge portion of the design. Marching ones and zeros are just one simple
way to test a memory.
Although difficult to quantify, it is likely that the manufacturing process could benefit
from the checker processor as well. In a fab where part production is limited by the
bandwidth and latency of testers, the checker processor could improve testing performance.
Once the checker is fully tested using internal BIST mechanisms, the checker itself can test
the remaining core processor circuitry. No expensive external tester is required, only power
and a simple interface with ROM to hold core testing routines and a simple I/O interface to
determine if the checker passed all core processor tests.
A Fault Tolerant Approach to Microprocessor Design  435

28.5 SUMMARY
Many reliability challenges confront modern microprocessor designs. Functional design
errors and electrical faults can impair the function of a part, rendering it useless. While
functional and electrical verification can find most of the design errors, there are many
examples of non-trivial bugs that find their way into the field. Additional faults due to
manufacturing defects and operation faults such as energetic particle strikes must also be
overcome. Concerns for reliability grow in deep submicron fabrication technologies due
to increased design complexity, additional noise-related failure mechanisms, and increased
exposure to natural radiation sources. To counter these reliability challenges, the use of
dynamic verification is presented which is a technique that adds a checker processor to
the retirement phase of a processor pipeline. If an incorrect instruction is delivered by the
core processor the checker processor will fix the errant computation and restart the core
processor using the processor’s speculation recovery mechanism.
Dynamic verification focuses the verification effort into the checker processor, whose
simple and flexible design lends itself to high-quality functional verification and a robust
implementation. Detailed analyses of a prototype checker processor design is designed.
The simple checker can easily keep up with the complex core processor because it uses pre-
computation in the core processor to clear the branch, data, and communication hazards that
could otherwise slow the simple checker pipeline. Finally, novel extensions to the baseline
design is presented that improve coverage for operational faults and manufacturing fault
detection. One such approach is to leverage the fault tolerance of the core to implement self-
tuning core circuitry. By employing an adaptive clocking mechanism, it becomes possible
to overclock core circuitry, reclaiming design and environmental margins that nearly always
exist.

REFERENCES
[1] M. Williams, “Faulty Transmeta Crusoe Chips Force NEC to Recall 300 Laptops”,
The Wall Street Journal, 2000
[2] P. Bose, T. Conte and T. Austin, “Challenges in processor modeling and validation”,
IEEE Micro, pp. 2–7, 1999.
[3] R. Grinwald, “User defined coverage, a tool supported methodology for design ver-
ification”, Proceedings of the 35th ACM/IEEE Design Automation Conference, pp.
1–6, 1998.
[4] M. K. Srivas and S. P. Miller, “Formal Verification of an Avionics Microprocessor”,
SRI International Computer Science Laboratory Technical Report CSL, 1995.
[5] M. C. McFarland, “Formal Verification of Sequential Hardware: A Tutorial”, IEEE
Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol 12, no. 5.
1993.
[6] J. Sawada, “A table based approach for pipelined microprocessor verification”, Proc.
of the 9th International Conference on Computer Aided Verification, 1997.
[7] H. A. Asaad and J. P. Hayes, “Design verification via simulation and automatic test
pattern generation”, Proceedings of the International Conference on Computer-Aided
Design, IEEE Computer Society Press, pp. 174–180, 1995.
436  VLSI Circuits and Embedded Systems

[8] B. T. Murray and J. P. Hayes, “Testing ICs: Getting to the Core of the Problem”,
Computer, vol. 29, no.11, pp. 32–45, 1996.
[9] M. Nicolaidis, “Theory of Transparent BIST for RAMs”, IEEE Trans. Computers,
vol. 45, no. 10, pp. 1141–1156, 1996.
[10] M. Nicolaidis, “Efficient UBIST Implementation for Microprocessor Sequencing
Parts”, J. Electronic Testing: Theory and Applications, vol. 6, no. 3, pp. 295–312,
1995.
[11] S. K. Reinhardt and S. S. Mukherjee, “Transient Fault Detection via Simultane-
ous Multithreading”, Proceedings 27th Annual Intl. Symp. Computer Architecture
(ISCA), 2000.
[12] E. Rotenberg, “AR-SMT: A Micro architectural Approach to Fault Tolerance in Mi-
croprocessors”, Proceedings of the 29th Fault-Tolerant Computing Symposium, 1999.
[13] S. Chatterjee, C. Weaver and T. Austin, “Efficient Checker Processor Design”, In
Micro-33, 2000.
[14] H. A. Assad, J. P. Hayes and B. T. Murray, “Scalable test generators for high-speed
datapath circuits”, Journal of Electronic Testing: Theory and Applications, vol. 12,
no. 1/2, 1998.
[15] J. Gaisler, “Evaluation of a 32-bit microprocessor with built-in concurrent error de-
tection”, Proceedings of the 27th Fault-Tolerant Computing Symposium, 1997.
[16] Y. Tamir and M. Tremblay, “High-performance fault tolerant VLSI systems using
micro rollback”, IEEE Trans. on Computers, vol. 39, no. 4, pp. 548–554, 1990.
CHAPTER 29

Applications of VLSI Circuits


and Embedded Systems

Our world is innovation-driven. Yet, the new generation of people does not realize that
when the PCs originally appeared, they used to be huge to the point that some of them used
to consume the whole room. The explanation for this was they were made of huge vacuum
tubes. The size was large yet the speed was exceptionally moderate. Before long, the creators
comprehended that they don’t need such large PCs and that the size ought to be Smaller. The
invention of IC (Integrated Circuit) got this going and not long after VLSI (Very-large-scale
integration) was brought to light. VLSI represents Very Large-Scale Integration.
VLSI is the way toward making an IC by joining billions of semiconductors into
a solitary chip. VLSI was started during the 1970s when complex semiconductors and
correspondence innovations were being created. The microprocessor is a VLSI gadget.
Before the VLSI innovation, most ICs had a restricted arrangement of capacities they
could perform. An electronic circuit may comprise a CPU, ROM, RAM, and other logical
components. VLSI lets IC creators include these into one chip with the advancement of
technology.
Embedded systems are special-purpose computing systems embedded in application
environments or in other computing systems and provide specialized support. The decreas-
ing cost of processing power, combined with the decreasing cost of memory and the ability
to design low-cost systems on chip, has led to the development and deployment of embedded
computing systems in a wide range of application environments. Examples include network
adapters for computing systems and mobile phones, control systems for air conditioning,
industrial systems, and cars, and surveillance systems. Embedded systems for networking
include two types of systems required for end-to-end service provision: infrastructure (core
network) systems and end systems. The first category includes all systems required for the
core network to operate, such as switches, bridges, and routers, while the second category
includes systems visible to the end users, such as mobile phones and modems.

DOI: 10.1201/9781003269182-33 437


438  VLSI Circuits and Embedded Systems

Figure 29.1: Example of Industrial Autonomous Robots [5].

29.1 APPLICATIONS OF VLSI CIRCUITS


VLSI circuits are used everywhere, real applications include microprocessors in a personal
computer or workstation, chips in a graphic card, digital camera or camcorder, chips in a
cell phone or a portable computing device, and embedded processors in an automobile, etc.
In this chapter, some of the applications of VLSI circuits are discussed.

29.1.1 Autonomous Robots in Industrial Plants


The advancements of the robotics are well in progress, with multitudes of self-sufficient
robot vehicles automatically finds their way on to production line floors as a method for
speeding up and precision of routine tasks. Fueled by the advancements of VLSI circuits,
these free-moving robots can be facilitated to more complex tasks if needed, empowering
them to perform computerized assignments in a controllable and predictable way. This
gives them the possibility to improve performance inside assembling plants. The complex
circuits inside these robots are getting smaller day by day due to the improvements in
the latest VLSI techniques. For example, Fig. 29.1 is the example of autonomous robots.
Generally, computerized guided vehicles and transports have been introduced as methods
for moving materials and parts around manufacturing plants. Nevertheless, the greater
part of these robots has relied upon pre-set courses that offer no deviation. But with each
generation of Electronic upgrades, now it has been observed that more of these robots are
doing challenging tasks. For example, innovations on processor technology, sensors, 3D
cameras, networking, the smart robots are equipped with the hardware for exploring their
path securely around industrial facility floors. Some forward-looking makers are on top
of things in embracing such frameworks. In Italy, for instance, car producer Faurecia is
Applications of VLSI Circuits and Embedded Systems  439

utilizing self-sufficient vehicles from Mobile Industrial Robots to build the proficiency of
its coordination. The associations have cooperated to revamp Faurecia’s production line
designs to permit robots to explore their courses utilizing their inner guides. Laborers
collaborate with the robots through cell phones, tablets, or PC interfaces, educating them
of their requirements with a press of a button.

29.1.2 Machines in Manufacturing


No producer desires to put their resources in their plant, just to see it underused and
not getting its output. That is the reason smart VLSI designs has risen as a mainstream
and ground-breaking methods for observing machine use, sending important execution
information to administrators through dashboards to tell them what apparatus is working
most successfully in contrast with other hardware. These stages can go about as a key driver
in improving the manufacturing plant. It can discard the bottlenecking machines failing to
meet expectations.

Figure 29.2: Improvement of Productivity: Connected devices mitigate human mistake [6].

For example, Machine Metrics has been working with Fastenal, the Minnesota-based
producer of latches and devices, to apply a smartDevice that screens processing plants
operation. The product can associate with any cutting-edge CNC machine by coupling the
Machine Metrics Edge to the Ethernet port of the control, while more seasoned machines
can share information legitimately to the cloud utilizing the computerized and simple IO
modules.
440  VLSI Circuits and Embedded Systems

In the Fastenal case, the product gave knowledge into machine use continuously each
day, week, and month to reveal chances to make effective upgrades. This conveyed an 11%
expansion in machine usage in the initial three months, says Machine Metrics.
A little to medium-sized assembling plant may contain several administrator devices,
in different shapes and sizes, which are utilized for a large number of capacities. For an
enormous industrial facility, that number could ascend to thousands. The advancements
of VLSI devices mean that these devices would never be incorrectly utilized outside of a
particular arrangement of operational boundaries. That is conveying huge enhancements in
productivity. In Fig. 29.2, we can see that smart machines are boosting our productivity in
manufacturing plants.
Airbus and Bosch have driven the path around there, with the Factory of the Future
activity utilizing associated smart devices. These procedures, in aviation plants, can happen
more than a few work cells and can be performed by various administrators. In this way,
says Airbus, there is immense potential for improving these procedures by making hand
apparatuses smarter. Other assembling organizations have stuck to this same pattern. La-
borers at GE Aviation have, for instance, been joining WiFi-empowered force torques with
mixed reality headsets to guarantee that jolts are fixed most ideally. It is tied with improving
productivity and profitability, moreover boosting product quality.
Assembling, by its very nature, requires a ton of energy. This can represent an enormous
level of working expenses. That is the reason processing plant proprietors and directors
are progressively going to smart VLSI solutions that provide interface sensors, actuators,
regulators, and other hardware that empowers the checking of energy use, lighting, HVAC,
and fire security frameworks. This information can likewise be joined with data from more
extensive datasets like climate anticipating and money related data, for example, the cost of
power and different utilities. This kind of design is growing in assembling plants, to make
structures more intelligent, more reasonable, and more proficient.
BAE Systems has, for instance, worked with Schneider to introduce its EcoStruxure
building at one of its manufacturing offices in the UK. In this specific case, the EcoStruxure
stage is being utilized to screen HVAC in the stockroom and office regions, alongside other
gear including destratification fans, heat recuperation units, and electric board warmers. As
far as framework setup, two boards were worked by frameworks integrator Aimteq for the
workplaces and distribution center containing Schneider SmartX AS-P regulators and I/O
modules, just as natural touchscreen tablet shows. Without the help of Smart circuits and
electronics, this is impossible to achieve.

29.1.3 Smart Vision Tech for Quality Control


Quicker and more adaptable manufacturing lines may be the way to satisfy client’s need,
yet there can likewise be a negative effect on quality control if checking isn’t adequate.
Nowadays as plants look to mechanization to replace manual examination, innovation is
being utilized to guarantee there is no deviation from quality margins. Gradually replacing
the natural eye has been performed by VLSI-empowered high-pixel camera vision frame-
works in combination with different gadgets, for example, acoustic sensors, alongside image
processing. There is an example of smart vision camera equipment in Fig. 29.3. These can
be utilized to recognize imperfections, for example, size, shape, or finish, and to check the
Applications of VLSI Circuits and Embedded Systems  441

precision and meaningfulness of names, scanner tags, or QR codes. This data would then be
able to be circled back to before stages in the manufacturing line permitting administrators
to recognize and group the underlying cause of the issue before modification can be made.
After some time, computerized perceptive can be used to refine and improve the creation
procedure.

Figure 29.3: Smart Vision Tech for Quality Control [28].

This kind of smart vision technology is being utilized across assembling plants to screen
the nature of a wide scope of items including electronic gadgets, purchaser merchandise, and
metal parts. For instance, the car segment provider Getrag has been utilizing a framework
to investigate teeth and clutch body parts, providing engineers with ongoing information
on non-adjusting parts, and patterns rising out of the assembling procedure. The point is to
improve item quality, decrease excessive re-work, and upgrade brand image.
The advantages of Electronic circuits for makers do not end once items have arrived at
dispatch. In reality, the transportation and coordination effort has gotten one of the essential
recipients of digitalization, with resource following sensors ready to give continuous data
on resource area, the encompassing temperature, and movement, for example, LoRa and
Narrowband IoT. These systems stream VLSI sensor information to the cloud securely and
flawlessly, depending upon what is required.
As of late, a joint endeavor among Hoopo and Polymer Logistics conveyed a smart
system following of containers utilizing LoRa, which means it can find resources without
the requirement for GPS. This keeps up the gadget’s low-power utilization and allows
broadened battery life while giving information on resources continuously.
442  VLSI Circuits and Embedded Systems

29.1.4 Wearables: Ensuring Security


Innovation in Wearable technology may be firmly connected with smart fitness gadgets. For
example, wearables are progressively being utilized to guarantee individual security, with
body-worn sensors being utilized to screen biological conditions and to give understanding
into indispensable signs, for example, temperature, heartbeat, and breath rate. A smart
fitness band is represented in Fig. 29.4.
By implanting hardware with sensors or smart devices they become nodes that gather
and send information to a certain network. Particularly in assembling plants that have
representatives perform actions alone or handle conceivably unsafe substances. It goes
as a method for bringing down an association’s consistency and managerial expenses. In
the meantime, wearables are being utilized inside assembling for ergonomic reasons, to
decrease the cost that physical exercises take on its laborers’ bodies.
The German carmaker Audi is utilizing exoskeleton ergonomic guides in its facility to
offer help for laborers when they are lifting and conveying huge materials. The exoskeletons
additionally permit laborers to accept a sitting position when required.
These capacities have been appeared to diminish the strain on the back by 20 to 30
percent and to advance a more beneficial stance over the long haul. Progressively, such
gadgets are being VLSI circuit empowered, permitting specialists to use more precise
information for ergonomic purposes.

Figure 29.4: Wearables: Ensuring Security [29].

29.1.5 Computing Using the CPU


You may be acquainted with the term processor as this is regularly utilized with CPU
in any discussion. A CPU is the most regularly utilized type of processor. It’s intended
to be profoundly adaptable and appropriate for a wide scope of undertakings. All things
considered, the CPU runs the OS and applications.
Applications of VLSI Circuits and Embedded Systems  443

As a review, CPUs work utilizing prediction units, registers, and execution units. This
is known as architecture. Registers hold pieces of information or pointers to memory,
frequently in 64-piece information groups. Execution units accomplish something with at
least one register, for example, perusing and keeping in touch with memory or performing
math. Numerous execution units can be utilized on the binary with the CPU, each taking a
clock cycle or two to finish their capacity.

Figure 29.5: Computing Using the CPU:AMD Ryzen Processor [14].

CPUs are adaptable enough to suit a wide variety of tasks. Execution can be scaled and
thereby changing the clock speed (in GHz). It can be changed to accomplish more with
each clock cycle. For example, The AMD Ryzen 9 3950x presents a 16 core 32 thread CPU
for its latest 3rd generation of Ryzen processor in PC. The advancements of 7 nanometer
VLSI technology enable AMD to make these sorts of a processor that was never thought
of before. An AMD Ryzen CPU is shown in Fig. 29.5.

29.1.6 System on a Chip


SoC represents System-on-a-Chip. As the name recommends, an SoC is a complete pro-
cessing unit confined in a single bundle. It is notasingle processor but it a bundle. An
SoC contains numerous processing parts, memory, modems, and other basicpieces fab-
ricated together in a single chip that is attached to the circuit board. VLSI technology
444  VLSI Circuits and Embedded Systems

Single-handedly impacted this field with all of its advancements. Without VLSI technology
we would not be using smartphones with SoCs in our day to day life.
The System-on-a-Chip is like the brain of our cell phone. Joining different parts into
a single chip saves money on space, cost, and influence utilization. SoCs interface with
different parts as well, for example, cameras, a display, RAM, memory, etc. An SoC is the
mind of our cell phone that handles everything from the Android OS to identifying when
we press a button. Some of the most important parts of an SoC are given below.
Central Processing Unit (CPU): The "minds" of the SoC run the greater part of the code
for the Android OS and a large portion of the applications.
Graphics Processing Unit (GPU): It handles designs related tasks, for example, visual-
ization an application’s UI and 2D/3D gaming.
Image Processing Unit (IPU): It converts information from the telephone’s camera into
picture and video documents.
Digital Signal Processor (DSP): It handles more numerically concentrated measure-
ments than a CPU and incorporates decompressing music records and examining gyroscope
sensor information.
Neural Processing Unit (NPU): It used in top of the line cell phones to quicken (AI)
undertakings. These incorporate voice acknowledgment and camera processing.
Video encoder/decoder: It handles the power-efficient change of video records and
organizations.
Modems: It converts signals into information our telephone gets it. It incorporates 4G
LTE, 5G, WIFI, and Bluetooth modems.
In the cell phone space, Qualcomm, Samsung Semiconductor, Huawei’s HiSilicon, and
MediaTek are the four greatest names in the business. Odds are that our cell phone has a
chip from one of these organizations in it. Qualcomm is the biggest supplier of cell phone
SoCs, delivering chips for most of the lead, mid-level, and even low-end cell phone delivers
every year. Qualcomm’s SoCs fall under the Snapdragon marking. Premium chips flaunting
the organization’s best innovation go under the Snapdragon 800 arrangement standard, for
example, the most recent Snapdragon 865. Mid and super-mid-level items are marked with
Snapdragon 600 and 700 arrangement names separately, for example, the Snapdragon 765
which sports 5G availability. Lower level items are named under the 400 arrangement.

29.1.7 Cutting Edge AI Handling


Terms like neural processing units (NPU), AI processors are regularly utilized however
they all in general, mean something very similar inside a modern PC/Smartphone. An
NPU is a system that is explicitly used for the calculations of neural networks and artificial
intelligence. Even 20 years ago this type of system was not practical. But with the advanced
VLSI semiconductor technology, we can now use this type of system to implement AI
operations with more precision and flexibility. Even Smartphones have NPU nowadays.
NPUs are processors explicitly intended to run neural systems and AI assignments more
rapidly and proficiently than CPUs. NPUs uses their neighborhood memory cache as well,
to accelerate execution without utilizing slower RAM.
Neural systems regularly require tasks that take different bits of information to create
only a single yield. It can work on information sizes from 16 bits down to 8 and even 4 bits
Applications of VLSI Circuits and Embedded Systems  445

of information at a time. This is different from the math and information types utilized by
CPUs. For this reason, the advancement of AI field has gathered a lot of pace as the NPU
and other smart hardware is accelerating the outcome of AI and neural network.

29.1.8 VLSI in 5G Networks


Various new technologies like automated vehicles, smart factories, streaming video, and
cloud-based applications have placed more emphasis on higher bandwidth and shrinking
latency. To meet these evolving needs, 5G promises a 100X speed boost compared to 4G
LTE, along with latencies that are an order of magnitude or lower. Besides, 5G specifications
call for the new network to connect one million devices per square kilometer, more than
100 times as many as before.
Meeting these higher performance levels requires big changes, including a new fre-
quency band and a changed radio access network (RAN) architecture. On the heels of
building out 4G LTE, carriers must now deploy an entirely new transport technology with
greater complexity and significantly more hardware and software components. The rollout
itself will take place on a massive scale and carriers need solutions that are not just fast
and efficient to deploy, but also economical to buy and operate. These components also
need to be reliable and minimize power consumption. Here again, the VLSI technology is
showing its effect on building the future of networking. The 5G network is almost entirely
different from 4G LTE, beginning with a frequency band. 5G picks up where 4G leaves off,
spanning the spectrum from 6 GHz to 300 GHz. Higher frequencies support significantly
smaller cell sizes, enabling 5G cells to provide highly localized coverage in locations such
as neighborhoods, manufacturing plants, or even within houses and other structures. For
example, a small 5G tower is shown in Fig. 29.6. These small stations are using state of
the art networking modems, receivers, and sending circuits which are made on the latest
advancement in VLSI networking elements.

29.1.9 Fuzzy Logic and Decision Diagrams


Fuzzy logic is an important paradigm for our day to day life. We use a lot of mathematical
reasoning and complex solutions which is generated from fuzzy logic and decision diagrams.
VLSI circuits and logic gates give the implementation of various fuzzy algorithms in our day
to day life much easier than before. The newer forms of VLSI circuits implement various
decision diagram-based solutions faster than ever before. Many decisions of electronic
devices are now streamlined because of the latest IC technology. Some of the important
areas where VLSI advancements impacted the fuzzy logic world are discussed below.

Aviation
In aviation, fuzzy logic is utilized in the accompanying zones

• Elevation control of shuttle/plane

• Satellite elevation control

• Streamflow and blend guideline in airplane systems, etc.


446  VLSI Circuits and Embedded Systems

Figure 29.6: VLSI in 5G Networks: 5G Network Cell [15].

Vehicle Industry
In a car, fuzzy logic is utilized in the following zones:

• Fuzzy logic-based system for speed control

• Moving strategy for programmed & autonomous vehicle

• Traffic signal management.


Applications of VLSI Circuits and Embedded Systems  447

Business
In business, fuzzy logic is utilized in the following areas:

• Dynamic decision supportive networks

• Workforce performance assessment in a huge organization

• Monitor the efficiency of various tools and machines

Safety Fields
In safeguard, fuzzy logic is utilized in various aspects like:

• Submerged objective acknowledgment

• Programmed target acknowledgment of infrared pictures

• Power protection system

• Control of a hypervelocity interceptor

Marine
In the marine fields also, we can see a lot of usage of it like

• Autopilot for ships

• Ideal course determination

• Control of self-sufficient submerged vehicles

• Boat controlling

Clinical
In the medical field we can also see a lot of use of fuzzy logic as follows:

• A clinical symptomatic emotionally supportive network

• Examination of human conduct

• Criminal examination and avoidance dependent on fuzzy logic thinking

• Control of blood vessel pressure during sedation

• Multivariable control of sedation

• Demonstrating of neuropathological discoveries in Alzheimer’s patients


448  VLSI Circuits and Embedded Systems

29.2 APPLICATION OF EMBEDDED SYSTEMS


An embedded system is an electronic or PC framework that is intended to control and
access to the information in Electronic gadgets. It includes a solitary chip microcontroller,
for example, ARM, Cortex, and likewise FPGAs, microchips, ASICs and DSPs. In the
current situation, the utilization of an embedded system is boundless. But the software
which is modified into the microcontroller is equipped for solving limited problems. An
embedded system can perform multiple tasks and are likewise fit for interfacing with
different systems and gadgets. Utilizations of an embedded system are appropriate in regions
like space, transportation, Communication, mechanical systems, home apparatuses, and so
on. An embedded system has a huge domain. The uses of an embedded system incorporate
home machines, office computerization, security, media transmission, instrumentation,
entertainment, aviation, banking, and automobiles, etc.

29.2.1 Embedded System for Street Light Control


The fundamental aim of this undertaking is to recognize the development of vehicles on
highways and to turn on road lights in front of it, and afterward to turn off the road lights
as the vehicle goes past the road lights to save energy.

29.2.2 Embedded System for Industrial Temperature Control


The mechanical temperature regulator is used to controlling the temperature of any gadget
in any modern application as per its need. It shows the temperature in the scope of a certain
limit. The core of the circuit is the microcontroller which forms an embedded system.

29.2.3 Embedded System for Traffic Signal Control


The fundamental objective of this venture is to plan a thickness-based traffic light frame-
work. At each intersection, the sign planning changes naturally as indicated by the traffic
volume at each intersection. Gridlock is a significant issue in numerous urban communities
over the world and gives problems to the workers and explorers.

29.2.4 Embedded System for Vehicle Tracking


The principle motivation behind this task is to locate the specific area of a vehicle by
utilizing a GPS modem and to decrease vehicle robberies. The GSM modem sends an SMS
to a predefined versatile which stores the information in it. An LCD show is utilized to
show the area data as far as scope and longitude esteems.

29.2.5 Embedded System for War Field Spying Robot


The use of mechanical vehicles utilizing RF innovation for far off activity and appended
with a remote camera for spying is a common feature for many modern armies in the world.
The robot with a camera can remotely send constant video with night vision abilities. This
kind of robot can be useful for spying purposes in war fields. Fig. 29.7 is an example of
spying drone used by Defense purposes.
Applications of VLSI Circuits and Embedded Systems  449

Figure 29.7: Smart Drone [31].

29.2.6 Automated Vending Machine


The system requires a microcontroller to control the total process. A display shows different
messages for the client. A component acknowledges the coins or notes and examines the
equivalent to listing their worth. In Fig. 29.8, a vending machine is shown for example. A
choice is to be taken to get a product and return the additional cash assuming any available.
There are millions of vending machines which are using this type of small-scale embedded
systems.

29.2.7 Mechanical Arm Regulator


An automated arm regulator is a system that is utilized to perform different developments
like that of a human arm and do a specific activity, mainly pick and drop. There are numerous
applications to execute different joints like a mechanical arm, Robotic arm, Little toys, PC
peripherals, television remote, power windows in vehicles. A robot’s arm is represented
in Fig. 29.9 as an example. Moreover, electronic instruments, for example, temperature
regulator and so forth are likewise instances of Embedded system controller.

29.2.8 Routers and Switches


A network system consists of a lot of switches and routers that are seen in a
small/medium/large scale office. It requires legitimate protocols and security to be exe-
cuted correctly. The networking is to be actualized using embedded systems. Different
ports are required to interface the various PCs in the multi-PC framework and controlling
450  VLSI Circuits and Embedded Systems

Figure 29.8: Automated Vending Machine [22].

them requires interfaces. So, these are all examples of embedded systems. An example of
router is given below in Fig. 29.10.

29.2.9 Industrial Field Programmable Gate Arrays


Field Programmable Gate Arrays (FPGAs) is an integrated circuit that can be configured by
a client for a specific purpose even after the manufacturing of the circuit is done. In a way,
it helps to redo an activity for a particular application. The interconnects can promptly be
reinvented, permitting an FPGA to oblige changes to a plan or even help another application
during the lifetime of the part.
Numerous applications depend on the equal execution of indistinguishable activities;
the capacity to design the FPGA’s CLBs into hundreds or thousands of indistinguishable
handling parts has applications in image processing, artificial intelligence (AI), server farm
equipment, smart driving vehicles, etc.
A significant number of these application territories are changing rapidly as prerequi-
sites advance and new conventions and guidelines are embraced. FPGAs empower makers
to actualize frameworks that can be refreshed when needed. A genuine case of FPGA use
is rapid search: Microsoft is utilizing FPGAs in its server farms to run Bing search calcula-
tions. The FPGA can change to help new calculations as they are made. When necessities
change, the plan can be repurposed to run reproduction or demonstrating schedules in an
Applications of VLSI Circuits and Embedded Systems  451

Figure 29.9: Mechanical arm [24].

Figure 29.10: Networking Equipment: Switch [30].


452  VLSI Circuits and Embedded Systems

HPC application. This adaptability is troublesome or difficult to accomplish with a normal


ASIC circuit.
Since FPGAs work in an equal manner, they have a lot higher speed and subsequently
can be utilized to take care of complex calculation issues, along with the re-programmability
capacity, this makes FPGAs both amazing and adaptable machines.
For example, Xilinx FPGAs is being offered by Amazon Web Services to quicken figure
escalated applications. The utilization of FPGA in cars and vehicles permits application, for
example, ADAR, LiDAR, and self-sufficient driving. for a decrease in resource utilization,
and improvement in proficiency. In the business, the FPGA chips are opening up new entry-
ways for robotization and secure innovation which can help entrepreneurs and processing
plants to counter security dangers and empower the development of mechanization in the
working environment.
FPGAs are being utilized broadly over a wide scope of different fields, for example, in
clinical gear, PC equipment, and radio gadgets. They are additionally being actualized in
any parts of bioinformatics, voice acknowledgment innovation, security, and data transfer
frameworks wired or remote, and in various logical, clinical, and buyer gadgets. Since
FPGAs have a huge inside memory alongside an extraordinary number of multipliers, they
are being utilized as signal processing and computerized signal preparing the application.
They are being utilized for picture and video processing also.
FPGAs are utilized with the end goal of equipment and calculation quickening just as
for quickening counterfeit neural systems or AI applications as is being done Microsoft in
their Project Catapult. These are perfect for use in low volume, vertical application creations
as it might end up being more moderate than an ASIC.
The utilization of FPGAs is getting progressively more successive when contrasted
with microcontrollers for the Internet of Things (IoT) as it permits quick preparation of
numerous calculation points in an equal and synchronous manner.
They are perfect for use for the top of the line control applications where a strict level
of control is required.

29.2.10 Industrial Programmable Logic Circuits


Programmable Logic Circuits (PLCs) are utilized in different applications in enterprises,
for example, the steel business, car industry, chemical substance industry, and the energy
industry. The scope of PLCs significantly increments, dependent on the improvement of all
the different fields where it is applied such as, in the Travel Industry, PLC has been utilized
to screen the security control and to work lifts and elevators.

Glass Industry
PLCs regulators have been being used in the glass business for a considerable length of
time. They are utilized to a great extent to control the material proportion just as to process
flat glasses. The innovation has been progressing throughout the years and this has made an
expanded interest for the PLC control mode for use in the glass business. The creation of
glass is an intricate and complex procedure so the organizations included regularly use PLCs
with transport innovation in its control mode. Generally speaking, the PLC is applied in
Applications of VLSI Circuits and Embedded Systems  453

both simple information recording in the glass creation, and advanced quality and position
control.

Paper Industry
In the paper business, PLCs are utilized in different procedures. These incorporate control-
ling the machines that produce paper items at high speeds. For example, PLC controls and
screens the creation of book pages or papers in balance web printing.

Concrete Assembling
Assembling concrete includes blending different crude materials in a furnace. The nature
of these crude materials and their extents altogether sway the nature of the end output. To
guarantee the utilization of the correct quality and amounts of crude materials, the exactness
of information in regards to such process factors are really important.

Industrial Machinery
A circulated control framework involved PLC in its client mode and design programming is
utilized in the business’ creation and the executive’s forms. The PLC specifically, controls
ball processing, coal furnace, and shaft oven. Different instances of PLC programming
applications that are being used in different businesses today incorporate water tank extin-
guishing frameworks in the aviation division, filling machine control framework in the food
business, mechanical batch washer control for textile industry, etc.
The programmable logic circuits at mobile automation incorporate a tremendous impact
from different top industry makers, for example, Allen-Bradley and Omron.

29.3 SUMMARY
VLSI (Very Large Scale Integration) Circuits and Embedded system have a lot of impacts
on our lives with their applications. As a whole it is realized that these technologies are
amazingly remarkable that do an indispensable job in numerous gadgets, types of gear,
mechanical control systems, modern instrumentation, and home machines regardless of
their nature. Since VLSI Circuits and Embedded systems control such a large number of
gadgets, an organization knows that it is impossible to stay without them and use any
hardware without them. They offer computerization that builds wellbeing and efficiency for
organizations. For instance, if a development organization has VLSI Circuits and Embedded
system on its pieces of machinery, the innovation could give alerts that one of them needs
brief adjustment before it represents a security danger. Besides, numerous individuals have
cutting edge machines and smartphones in their hand, they work as an enhancement to the
work and life. Human specialists can use different gadgets to deal with employees and if
the equipment can’t perform, possibly getting alerts, it makes a problem. The equipment
can remain on one line and play out a similar assignment over and again without making
mistakes or requiring breaks as people can monitor them and take steps before a problem
arises. Without VLSI Circuits and Embedded system, the advancements of mechanization
wouldn’t exist by any means. The modern transformation made conceivable through the
454  VLSI Circuits and Embedded Systems

IoT (Internet of Things) is making VLSI Circuits and Embedded systems more normal in
the industry.

REFERENCES
[1] Wikipedia, [Online]. Available: https://en.wikipedia.org/wiki/Very_Large_Scale_
Integration. [Accessed: 21 Sep., 2020].
[2] Tutorialspoint. [Online]. Available: https://www.tutorialspoint.com/vlsi_design
/vlsi_design_digital_system.html. [Accessed: 13 Oct., 2019].
[3] Howtogeek. [Online]. Available: https://www.howtogeek.com/394267/what-do-7nm-
and-10nm-mean-and-why-do-they-matter/. [Accessed: 12 Nov., 2020].
[4] Autonomous Robots: https://upload.wikimedia.org/wikipedia/commons/0/02/
Hannover_-_CeBit_2015_-_DT_Industrie_40_-_Roboter_008.jpg [Accessed: 5 June,
2021].
https://creativecommons.org/licenses/by-sa/3.0/deed.en, CeBIT 2015Deutsche
Telekom (Booth CeBIT)Internet of things, CC-BY-SA-3.0Pictures by Mummelgrum-
mel
[5] Industrial Robots: https://en.wikipedia.org/wiki/Smart_manufacturing#/media/
File:BMW_Leipzig_MEDIA_050719_Download_Karosseriebau_max.jpg
https://creativecommons.org/licenses/by-sa/2.0/de/deed.en
CC BY-SA 2.0 de view terms [Accessed: 15 Jun., 2021].
[6] Airbus. [Online]. Available: https://www.airbus.com/newsroom/news/en/2014/07/airbus-
moves-forward-with-its-factory-of-the-future-concept.html. [Accessed: 25 Feb.,
2020].
[7] EcoStruxure. [Online]. Available: https://www.se.com/ww/en/work/campaign
/innovation/overview.jsp. [Accessed: 30 Mar., 2020].
[8] EETimes. [Online]. Available: https://www.eetimes.com/software-sizes-digitally-
controlled-transmissions/. [Accessed: 28 Feb., 2020].
[9] Polymerlogistics. [Online]. Available: https://www.polymerlogistics.com/. [Accessed:
30 Jun., 2020].
[10] New Atlas. [Online]. Available: https://newatlas.com/health-wellbeing/audi-
exoskeleton-trial-ingolstadt/. [Accessed: 2 Jul., 2020].
[11] Engineering. [Online]. Available: https://www.engineering.com/ElectronicsDesign
/ElectronicsDesignArticles
/ArticleID/5791/Applications-Processors-The-Heart-of-the-Smartphone.aspx.
[Accessed: 13 Jul., 2020].
[12] Qualcomm. [Online]. Available: https://www.qualcomm.com/snapdragon. [Accessed:
18 Jul., 2020].
[13] AMD Ryzen 5. [Online]. Available: https://commons.wikimedia.org/wiki/File:Ryzen_5_
1600_CPU_on_a_motherboard.jpg
https://creativecommons.org/licenses/by-sa/4.0/deed.en [Accessed: 19 June, 2021].
[14] 5G Network. [Online]. Available: https://upload.wikimedia.org/wikipedia/commons/4/43/
Celluar_Antenna_with_tower_for_5G.jpg
https://creativecommons.org/licenses/by-sa/4.0/ [Accessed: 22 June, 2021].
Applications of VLSI Circuits and Embedded Systems  455

[15] Wikichip. [Online]. Available: https://en.wikichip.org/wiki/neural_processor. [Ac-


cessed: 26- Jul- 2020].
[16] Program-plc. [Online]. Available: https://program-plc.blogspot.com//08/application-
of-2010plc-in-glass-industry.html. [Accessed: 29 Jul., 2020].
[17] Study. [Online]. Available: https://study.com/academy/lesson/programmable-
logic-controllers-plc-in-industrial-networks-definition-applications-examples.html.
[Accessed: 2 Aug., 2020].
[18] Gbctechtraining. [Online]. Available: https://www.gbctechtraining.com/blog/PLC-
Applications-in-our-Everyday-Lives. [Accessed: 5 Aug., 2020].
[19] Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/Embedded_system.
[Accessed: 9 Aug., 2020].
[20] The Engineering Projects. [Online]. Available: https://www.theengineeringprojects.
com/2016/11/examples-of-embedded-systems.html. [Accessed: 12 Aug., 2020].
[21] Vending Machine. [Online]. Available: https://upload.wikimedia.org/
wikipedia/commons/5/51/Otsuka_vending_machines_in_Kita-ku%2C_Osaka.jpg
https://creativecommons.org/licenses/by/4.0/ [Accessed: 15 June, 2021].
[22] Elprocus. [Online]. Available: https://www.elprocus.com/embedded-systems-real-
time-applications/. [Accessed: 17 Aug., 2020].
[23] Robot Arm. [Online]. Available: https://upload.wikimedia.org/wikipedia/commons/
a/a4/CNX_UPhysics_11_02_RobotArm.png
https://creativecommons.org/licenses/by/4.0/ [Accessed: 15 June, 2021].
[24] Electronics Hub. [Online]. Available: https://www.electronicshub.org/embedded-
system-real-time-applications/. [Accessed: 17 Aug., 2020].
[25] Market Research. [Online]. Available: https://blog.marketresearch.com/embedded-
systems-and-the-internet-of-things-iot: :text=Embedded_systems. [Accessed: 19
Aug., 2020].
[26] DJI. [Online]. Available: https://www.dji.com. [Accessed: 21 Aug., 2020].
[27] Sony. [Online].https://www.wallpaperflare.com/black-camera-accessory-lot-buttons-
circuits-close-up-components-wallpaper-avxef [Accessed: 15 June, 2020].
[28] Fitbit. [Online]. https://upload.wikimedia.org/wikipedia/commons/7/78/
FitBitIonicWorkOutMode092917.jpg
https://commons.wikimedia.org/wiki/Category:CC-BY-4.0 [Accessed: 15 June,
2021].
[29] Network Switch. [Online]. Available: https://upload.wikimedia.org/wikipedia/
commons/e/e5/Network_switches.jpg
https://creativecommons.org/licenses/by-sa/3.0/ [Accessed: 25 June, 2021].
[30] SMD Smart Drone. [Online]. Available: https://upload.wikimedia.org/wikipedia/
commons/b/b8/2015_Dron_DJI_Phantom_3_Advanced.JPG
https://creativecommons.org/licenses/by-sa/4.0/deed.en, [Accessed: 25 June, 2021].
VLSI Circuits and Embedded
Systems

FINAL REMARKS
Digital logic design is a foundation to the field of electrical and computer engineering.
Digital logic designers build complex electronic components that use both electrical and
computational characteristics. These characteristics may involve power, current, logical
function, protocol and user input. Similarly, VLSI design is used to developing hardware,
such as circuit boards and microchip processors. This hardware processes user input,
system protocol and other data in computers, navigational systems, cell phones or other
high-tech systems. This book aims at designing high-performance and cost-effective VLSI
circuits which demand knowledge of all aspects of modern digital design. Since VLSI has
moved from an exotic and expensive curiosity to an everyday necessity, researchers have
refocused in VLSI design from its circuit design toward the advanced logic and system
design. Studying VLSI design as a system design discipline requires such a book like this
to consider a somewhat different set of areas than does the study of circuit design. In this
book, design of different topics is balanced on the one hand by discussions of circuits and
on the other hand by the architectural choices.
A Binary Decision Diagram (BDD) is a rooted and directed acyclic graph with one
or two terminal nodes of out-degree labeled 0 or 1, and a set of variable nodes of out-
degree two. BDDs and their variants are a class of data structures that has seen successful
application in the formal verification of systems with large state spaces. Variations of BDDs
have been described in this book to support quantitative calculations of the type that are
required for verification and performance analysis of systems. For example, Multi-Terminal
BDDs (MTBDDs) are a generalization of BDDs in which there can be multiple leaf nodes,
each labeled by a distinct value. An MTBDD representation of a vector can be very
compact, assuming that the set of distinct values appearing as entries in the vector is small.
MTBDDs can also be used to represent matrices, and computations such as vector/matrix
multiplication can be performed efficiently in terms of MTBDD representations. A variety
of other variations of BDDs has been described in this book, including shared MTBDDs,
multi-valued decision diagrams (MDDs), and multi-valued Pseudo-Kronecker DDs.
Multiple-valued circuit scan implement the logic directly by using multiple-valued
signals, or the logic can be implemented indirectly with binary circuits, by using more
than one binary signal to represent a single multiple-valued signal. In recent years, there
have been major advances in integrated circuit technology which have both made feasible
and generated great interest in electronic circuits which employ more than two discrete
levels of signal. Such circuits, called multiple-valued logic circuits, offer several potential

457
458  VLSI Circuits and Embedded Systems

opportunities for the improvement of modern VLSI circuit designs. The fact that some
commercial products already benefited from multiple-valued logic is believed to be a first
step towards recognition of the role of VLSI circuits in the next generation of electronic
systems. Multi-valued logic systems have attracted the attention of a number of researchers.
Some find the area fascinating for its wealth of logic structures, whose richness transcends
the familiar binary environment. Others work on potential practical applications to demon-
strate that the appeal is not only academic, but that there exists a host of opportunities for
improvement of digital systems through the utilization of higher-radix methods. Synthesis
techniques have been developed to facilitate the design of multi-valued networks along the
lines of binary switching theory. Some of the notable electronic realizations of multi-valued
elements and basic functions are described in this book. Computer applications such as
image processing require very high arithmetic processing rates, and it is therefore neces-
sary to explore potential areas of circuit design that could increase the processing rate of
an Integrated Circuit (IC).
An IC that contains large numbers of gates, flip-flops, etc. which can be configured
by the user to perform different functions is called a Programmable Logic Device (PLD).
The internal logic gates and connections of PLDs can be configured by a programming
process. PLDs contain multiple logic elements such as flip-flops as well as AND and OR
gates which can be configured by the user. The internal logic and connections may be
modified by the user during the programming process which is done using a dedicated
software application. The process of entering the information into these devices is known
as programming. Basically, users can program these devices or ICs electrically in order to
implement the Boolean functions based on the requirement. Here, the term programming
refers to hardware programming but not software programming. Today’s most prominent
PLD technology, known as FPGA (Field-Programmable Gate Array), is used in an increas-
ing number of application domains, such as the telecom industry, the automotive electronics
sector or automation technology and the recent market studies show a continuous demand
for these sophisticated microelectronic devices in the future. PLDs have enabled many
users, designers, and manufacturers to come up with incredibly innovative and phenomenal
technology which is centered on producing logic based solutions across a variety of appli-
cations. The reduced power consumption, lower cost amount, and integration of so many
features that are simply not a possibility with the most of the other alternatives, all make
PLDs a much favored and preferred option for multiple users belonging to a number of
different backgrounds and industries.
Digital logic circuits are often known as switching circuits, because in digital circuits the
voltage levels are assumed to be switched from one value to another value instantaneously.
These circuits are termed as logic circuits, as their operation obeys a definite set of logic
rules. The simplest forms of digital circuits are built from logic gates, the building blocks
of the digital computer. Since most of the physical variables encountered in the real world,
digital circuits simulate continuous functions with strings of bits; the more bits that are
used, the more accurately the continuous signal can be represented. For example, if 16 bits
are used to representing a varying voltage, the signal can be assigned one of more than
65,000 different values. Digital circuits are more immune to noise than analog circuits, and
digital signals can be stored and duplicated without degradation. For simple digital circuits,
conventional method of designing circuits can easily be applied, but for complex digital
VLSI Circuits and Embedded Systems  459

circuits, the conventional method of designing circuits is not fruitfully applicable because
it is time-consuming. On the contrary, genetic programming is used mostly for automatic
program generation. The modern approach for designing arithmetic circuits, commonly
digital circuits are described in this book.
The design procedures and working mechanisms of modern VLSI circuits and embed-
ded systems are covered in this book. Researchers of these fields are working efficiently
with the advanced embedded systems. This book will definitely help them for getting all
the things in one place and better understanding of VLSI circuits and embedded systems.
Besides theoretical knowledge, some practical applications of VLSI circuits and embedded
systems are added in the book to get the real flavor of the topics. Both the practicing pro-
fessionals and the advanced undergraduate or graduate students will be benefited from this
book. For a student, the most rewarding aspect of a VLSI design class is to put together
previously-learned basics on circuit, logic, and architecture design to understand the trade
offs between the different levels of abstraction. Professionals who either practice VLSI
design or develop VLSI CAD tools can use this book to brush up on parts of the design
process with which they have less-frequent involvement.
Index

adaptable manufacturing, 440 Concurrent testing, 425


Addition, 397 configurable logic blocks, 243
Airbus, 440 configuration flexibility, 188
arithmetic operations, 326 congestions, 208
array logic, 124 connected graph, 109
Artificial Neural Network, 150 consensus operation, 390
automated arm, 449 control flow, 325
average frequency, 326 Controller circuit, 221
controller circuit, 292
back-propagation, 154 counter, 342
BCD addition, 197 counter circuit, 342
BCD output, 201 CPU, 442
Binary Adder, 339 crossover, 172, 176
binary coded decimal, 287 cryptography, 213, 253
Binary Decision Diagram, 457 cycle latency, 327
binary decision diagrams, 62
binary search tree, 152 Davio expansion, 73
bit counter, 375 decimal arithmetic, 215
bit-wise, 201 decision diagrams, 21, 58, 63, 76
Boolean functions, 389 decoded-PLA, 163
Boolean Tree, 393 decoder, 163
Built in self-test, 424 Deep submicron, 425
Design faults, 423
canonical cubes, 114 Diagonal length, 407
carry-free, 399 digit multiplier, 236
Certain faults, 430 digital components, 125
checker processor, 426 digital computer, 323
chromosomes, 170, 179 Digital logic, 457
circuit architecture, 355 Digital logic circuits, 458
circuit design, 214 digitalization, 441
circular shift, 114 divider circuit, 325
clique, 15 division algorithm, 331, 360
clock pulse, 65 Division algorithms, 330
combinational circuit, 323 DSP, 444
comparator, 248, 337, 355, 370 Dynamic reordering, 79
comparator circuit, 325 dynamic verification, 425, 426
computational operation, 326
concatenation, 364 electronic industry, 214
conceptualization, 140 embedded system, 448
Embedded systems, 437
461
462  Index

embedded systems, 145 isomorphic, 37


endeavor, 441
Kernighan-Lin, 211
error propagation, 150
evolution, 170 latch, 88
Literal cost, 251
fan-out, 123 logic circuit, 168
Fault tolerance, 326 logic density, 188
fault-tolerant, 423 logic functions, 19, 97
Fedkin gates, 122 logic simulation, 5, 19, 35
feed-forward, 400 logic synthesis, 5, 107
Field Programmable Gate Arrays, 450 logical complement, 336
financial calculations, 213 Look-Up Table., 373
first sum, 198 low-power, 441
fitness, 171
fitness function, 180 Machine Metrics, 439
flip-flops, 87 Manufacturing defects, 424
formal verification, 424 matrix, 135
Fuzzy logic, 445 Matrix multiplication, 233
fuzzy logic, 128 matrix multiplication algorithm, 280
fuzzy relations, 121, 128 matrix operation, 252
fuzzy sets, 125 medical imaging, 282
Feynman gates, 122 memory cells, 223, 291
Memory unit, 221
Gate input cost, 251 memory unit, 290
genetic algorithms, 171 Memristance, 218
genetic operators, 175 microelectronic technology, 403
genetics since, 170 microprocessor, 426
GPU, 444 microprocessors, memories, 403
graph optimization, 12 minterm, 107
group siftings, 43 Modems, 444
guided vehicles, 438 multi-valued logic, 121
multiple-output, 55, 79
hardware parallelism, 287 Multiple-valued circuit scan, 457
heuristic, 47 multiple-valued logic, 87, 107, 133
heuristic algorithms, 39 multiplexing, 66
heuristic function, 325 multiplicand, 238
heuristic technique, 396 multiplication, 326
high radix, 401 multiplication process, 213
higher reliability, 146 multiplier, 187, 194
inheritance, 171 multiplier circuit, 272, 277
instruction pipelining, 328 Mutation, 172
inter-process communications, 329 mutation, 172
internal logic, 344 nanocross wires, 222
inversion, 178 negative Davio expansion, 73
IPU, 444 negative gains, 209
Index  463

netlist, 207 rectangle packing, 407


network, 47 rectangle width, 409
network security, 282 reprogramming, 233
neural network, 398 reversible Fredkin gate, 121
Neural systems, 444 reversible gates, 344
node, 62 reversible logic gates, 122
non-linear, 149, 218 robot vehicles, 438
NP-complete, 56 routing operations, 207
NPU, 444 Ryzen, 443
number system, 399
scheduling technique, 411
offsprings, 172 scientific computing, 185
Operational faults, 424 selection, 172
output functions, 28 self-sufficient vehicles, 439
sequential cores, 404
pair sifting, 35 Serial-Out Shift Register, 335
pairing, 68 set theory, 121
parallel checker, 429 Shannon expansion, 73
parallel processing, 233, 328 Shift Register, 249, 331
partial product, 185, 236 sifting algorithm, 27
partitioning, 208 signed-digit, 397
pass-transistor, 158 significant bits, 247
pattern matching, 197 single event radiation, 425
perceptron, 153 single-chip, 327
Permanent faults, 424 smart robots, 438
pin problem, 61 smart vision, 441
platinum electrodes, 218 subtractor circuit, 340
population, 172 sum-of-products, 133
potential canonical cube, 107 supervised learning, 151
potential canonical cubes, 112 switching functions, 73
power consumption, 236, 330 systolic array structure, 131
prime implicant, 134
TAM, 406
probabilistic condition, 345
TANT network, 389
processing element, 279
terminal nodes, 10
Programmable Logic Array, 163
test vectors, 160
Programmable Logic Circuits, 452
threshold gate, 90
Propagation delay, 330
threshold gates, 97
propagation delay, 205, 236
threshold logic, 121, 124
quotient, 345 Toffoli gates, 122
transistors, 361
random modification, 177 truth tables, 101
Read operation, 294
Reconfigurable computing, 214 Wearable, 442
recovery modes, 429 Wireless Communication, 234
rectangle height, 410 wrapper, 404
Write operation, 294

You might also like