Attaining High Performance Communications A Vertical Approach 1st Edition Ada Gavrilovska

Attaining High Performance Communications A
Vertical Approach 1st Edition Ada Gavrilovska
download
https://ebookgate.com/product/attaining-high-performance-
communications-a-vertical-approach-1st-edition-ada-gavrilovska/
Get Instant Ebook Downloads – Browse at https://ebookgate.com

Get Your Digital Files Instantly: PDF, ePub, MOBI and More
Quick Digital Downloads: PDF, ePub, MOBI and Other Formats
High Performance Elastomer Materials An Engineering
Approach 1st Edition Ryszard Koz■owski
https://ebookgate.com/product/high-performance-elastomer-
materials-an-engineering-approach-1st-edition-ryszard-kozlowski/
Working in Teams Moving From High Potential to High
Performance 1st Edition Brian A. Griffith
https://ebookgate.com/product/working-in-teams-moving-from-high-
potential-to-high-performance-1st-edition-brian-a-griffith/
Digital Communications A Discrete Time Approach 1st
Edition Michael Rice
https://ebookgate.com/product/digital-communications-a-discrete-
time-approach-1st-edition-michael-rice/
Leadership Revolution Creating a High Performance
Organisation 1st Edition Christo Nel
https://ebookgate.com/product/leadership-revolution-creating-a-
high-performance-organisation-1st-edition-christo-nel/

Scala High Performance Programming 1st Edition Theron
https://ebookgate.com/product/scala-high-performance-
programming-1st-edition-theron/
Julia High Performance Programming Ivo Balbaert
https://ebookgate.com/product/julia-high-performance-programming-
ivo-balbaert/
High Performance Loudspeakers 6th Edition Martin
Colloms
https://ebookgate.com/product/high-performance-loudspeakers-6th-
edition-martin-colloms/
Performance Optimization of Digital Communications
Systems 1st Edition Vladimir Mitlin
https://ebookgate.com/product/performance-optimization-of-
digital-communications-systems-1st-edition-vladimir-mitlin/
High Performance Parallel I O 1st Edition I Foster
https://ebookgate.com/product/high-performance-parallel-i-o-1st-
edition-i-foster/

AttAining HigH
PerformAnce
communicAtions
A VerticAl ApproAch
C3088_FM.indd 1 8/18/09 10:05:36 AM

C3088_FM.indd 2 8/18/09 10:05:36 AM

AttAining HigH
PerformAnce
communicAtions
A VerticAl ApproAch
edited by
AdA gAvrilovskA
C3088_FM.indd 3 8/18/09 10:05:37 AM

Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2010 by Taylor and Francis Group, LLC
Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed in the United States of America on acid-free paper
10 9 8 7 6 5 4 3 2 1
International Standard Book Number-13: 978-1-4200-9313-1 (Ebook-PDF)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com

Contents
List of Figures xiv
List of Tables xviii
Preface xxi
Acknowledgments xxvii
About the Editor xxix
List of Contributors xxxi
1 High Performance Interconnects for Massively Parallel Sys-
tems 1
Scott Pakin
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Application Sensitivity to Communication Performance 3
1.2.3 Measurements on Massively Parallel Systems . . . . . 4
1.3 Network Topology . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 The “Dead Topology Society” . . . . . . . . . . . . . . 8
1.3.2 Hierarchical Networks . . . . . . . . . . . . . . . . . . 12
1.3.3 Hybrid Networks . . . . . . . . . . . . . . . . . . . . . 13
1.3.4 Novel Topologies . . . . . . . . . . . . . . . . . . . . . 15
1.4 Network Features . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1 Programming Models . . . . . . . . . . . . . . . . . . 18
1.5 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Commodity High Performance Interconnects 25
Dhabaleswar K. Panda, Pavan Balaji, Sayantan Sur, and Matthew
Jon Koop
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Overview of Past Commodity Interconnects, Features and
Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
vii

viii
2.3 InfiniBand Architecture . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 IB Communication Model . . . . . . . . . . . . . . . . 28
2.3.2 Overview of InfiniBand Features . . . . . . . . . . . . 32
2.3.3 InfiniBand Protection and Security Features . . . . . . 39
2.3.4 InfiniBand Management and Services . . . . . . . . . . 40
2.4 Existing InfiniBand Adapters and Switches . . . . . . . . . . 43
2.4.1 Channel Adapters . . . . . . . . . . . . . . . . . . . . 43
2.4.2 Switches . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.3 Wide Area Networks (WAN) and Routers . . . . . . . 45
2.5 Existing InfiniBand Software Stacks . . . . . . . . . . . . . . 45
2.5.1 Low-Level Interfaces . . . . . . . . . . . . . . . . . . . 45
2.5.2 High-Level Interfaces . . . . . . . . . . . . . . . . . . . 46
2.5.3 Verbs Capabilities . . . . . . . . . . . . . . . . . . . . 46
2.6 Designing High-End Systems with InfiniBand: Case Studies 47
2.6.1 Case Study: Message Passing Interface . . . . . . . . . 47
2.6.2 Case Study: Parallel File Systems . . . . . . . . . . . 55
2.6.3 Case Study: Enterprise Data Centers . . . . . . . . . . 57
2.7 Current and Future Trends of InfiniBand . . . . . . . . . . . 60
3 Ethernet vs. EtherNOT 61
Wu-chun Feng and Pavan Balaji
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2.1 Defining Ethernet vs. EtherNOT . . . . . . . . . . . . 64
3.2.2 Forecast . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3.1 Ethernet Background . . . . . . . . . . . . . . . . . . 65
3.3.2 EtherNOT Background . . . . . . . . . . . . . . . . . 66
3.4 Ethernet vs. EtherNOT? . . . . . . . . . . . . . . . . . . . . 67
3.4.1 Hardware and Software Convergence . . . . . . . . . . 67
3.4.2 Overall Performance Convergence . . . . . . . . . . . . 78
3.5 Commercial Perspective . . . . . . . . . . . . . . . . . . . . . . 81
3.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . 82
4 System Impact of Integrated Interconnects 85
Sudhakar Yalamanchili and Jeffrey Young
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Technology Trends . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3 Integrated Interconnects . . . . . . . . . . . . . . . . . . . . 90
4.3.1 HyperTransport (HT) . . . . . . . . . . . . . . . . . . 92
4.3.2 QuickPath Interconnect (QPI) . . . . . . . . . . . . . 96
4.3.3 PCI Express (PCIe) . . . . . . . . . . . . . . . . . . . 98
4.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.4 Case Study: Implementation of Global Address Spaces . . . . 101

ix
4.4.1 A Dynamic Partitioned Global Address Space Model
(DPGAS) . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.2 The Implementation Path . . . . . . . . . . . . . . . . 105
4.4.3 Bridge Implementation . . . . . . . . . . . . . . . . . . 106
4.4.4 Projected Impact of DPGAS . . . . . . . . . . . . . . 108
4.5 Future Trends and Expectations . . . . . . . . . . . . . . . . 109
5 Network Interfaces for High Performance Computing 113
Keith Underwood, Ron Brightwell, and Scott Hemmert
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Network Interface Design Issues . . . . . . . . . . . . . . . . 113
5.2.1 Offload vs. Onload . . . . . . . . . . . . . . . . . . . . 114
5.2.2 Short vs. Long Message Handling . . . . . . . . . . . 115
5.2.3 Interactions between Host and NIC . . . . . . . . . . . 118
5.2.4 Collectives . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3 Current Approaches to Network Interface Design Issues . . . 124
5.3.1 Quadrics QsNet . . . . . . . . . . . . . . . . . . . . . . 124
5.3.2 Myrinet . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.3.3 InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.3.4 Seastar . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.3.5 PathScale InfiniPath and Qlogic TrueScale . . . . . . 127
5.3.6 BlueGene/L and BlueGene/P . . . . . . . . . . . . . . 127
5.4 Research Directions . . . . . . . . . . . . . . . . . . . . . . . 128
5.4.1 Offload of Message Processing . . . . . . . . . . . . . . 128
5.4.2 Offloading Collective Operations . . . . . . . . . . . . 140
5.4.3 Cache Injection . . . . . . . . . . . . . . . . . . . . . . 147
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6 Network Programming Interfaces for High Performance
Computing 149
Ron Brightwell and Keith Underwood
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.2 The Evolution of HPC Network Programming Interfaces . . 150
6.3 Low-Level Network Programming Interfaces . . . . . . . . . . 151
6.3.1 InfiniBand Verbs . . . . . . . . . . . . . . . . . . . . . . 151
6.3.2 Deep Computing Messaging Fabric . . . . . . . . . . . 153
6.3.3 Portals . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.3.4 Myrinet Express (MX) . . . . . . . . . . . . . . . . . . 153
6.3.5 Tagged Ports (Tports) . . . . . . . . . . . . . . . . . . 153
6.3.6 LAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.3.7 Sockets . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.4 Distinguishing Characteristics . . . . . . . . . . . . . . . . . 154
6.4.1 Endpoint Addressing . . . . . . . . . . . . . . . . . . . 155
6.4.2 Independent Processes . . . . . . . . . . . . . . . . . . 155

x
6.4.3 Connections . . . . . . . . . . . . . . . . . . . . . . . . 156
6.4.4 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.4.5 Operating System Interaction . . . . . . . . . . . . . . 156
6.4.6 Data Movement Semantics . . . . . . . . . . . . . . . 157
6.4.7 Data Transfer Completion . . . . . . . . . . . . . . . . 158
6.4.8 Portability . . . . . . . . . . . . . . . . . . . . . . . . 160
6.5 Supporting MPI . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.5.1 Copy Blocks . . . . . . . . . . . . . . . . . . . . . . . 160
6.5.2 Progress . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.5.3 Overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.5.4 Unexpected Messages . . . . . . . . . . . . . . . . . . . 161
6.6 Supporting SHMEM and Partitioned Global Address Space
(PGAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.6.1 Fence and Quiet . . . . . . . . . . . . . . . . . . . . . 162
6.6.2 Synchronization and Atomics . . . . . . . . . . . . . . 162
6.6.3 Progress . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.6.4 Scalable Addressing . . . . . . . . . . . . . . . . . . . 163
6.7 Portals 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.7.1 Small Message Rate . . . . . . . . . . . . . . . . . . . 164
6.7.2 PGAS Optimizations . . . . . . . . . . . . . . . . . . . 166
6.7.3 Hardware Friendliness . . . . . . . . . . . . . . . . . . 166
6.7.4 New Functionality . . . . . . . . . . . . . . . . . . . . 167
7 High Performance IP-Based Transports 169
Ada Gavrilovska
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.2 Transmission Control Protocol — TCP . . . . . . . . . . . . 170
7.2.1 TCP Origins and Future . . . . . . . . . . . . . . . . . 170
7.2.2 TCP in High Speed Networks . . . . . . . . . . . . . . 172
7.2.3 TCP Variants . . . . . . . . . . . . . . . . . . . . . . . 173
7.3 TCP Performance Tuning . . . . . . . . . . . . . . . . . . . . 178
7.3.1 Improving Bandwidth Utilization . . . . . . . . . . . . 178
7.3.2 Reducing Host Loads . . . . . . . . . . . . . . . . . . 179
7.4 UDP-Based Transport Protocols . . . . . . . . . . . . . . . . . 181
7.5 SCTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
7.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 183
8 Remote Direct Memory Access and iWARP 185
Dennis Dalessandro and Pete Wyckoff
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.2 RDMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.2.1 High-Level Overview of RDMA . . . . . . . . . . . . . 187
8.2.2 Architectural Motivations . . . . . . . . . . . . . . . . 189
8.2.3 Fundamental Aspects of RDMA . . . . . . . . . . . . 192

xi
8.2.4 RDMA Historical Foundations . . . . . . . . . . . . . 193
8.2.5 Programming Interface . . . . . . . . . . . . . . . . . . 194
8.2.6 Operating System Interactions . . . . . . . . . . . . . 196
8.3 iWARP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.3.1 High-Level Overview of iWARP . . . . . . . . . . . . . 200
8.3.2 iWARP Device History . . . . . . . . . . . . . . . . . . 201
8.3.3 iWARP Standardization . . . . . . . . . . . . . . . . . 202
8.3.4 Trade-Offs of Using TCP . . . . . . . . . . . . . . . . 204
8.3.5 Software-Based iWARP . . . . . . . . . . . . . . . . . 205
8.3.6 Differences between IB and iWARP . . . . . . . . . . 206
8.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 207
9 Accelerating Communication Services on Multi-Core Plat-
forms 209
Ada Gavrilovska
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.2 The “Simple” Onload Approach . . . . . . . . . . . . . . . . 210
9.2.1 Limitations of the “Simple” Onload . . . . . . . . . . 212
9.3 Partitioned Communication Stacks . . . . . . . . . . . . . . . 213
9.3.1 API Considerations . . . . . . . . . . . . . . . . . . . 216
9.4 Specialized Network Multi-Cores . . . . . . . . . . . . . . . . 217
9.4.1 The (Original) Case for Network Processors . . . . . . 217
9.4.2 Network Processors Features . . . . . . . . . . . . . . 219
9.4.3 Application Diversity . . . . . . . . . . . . . . . . . . 222
9.5 Toward Heterogeneous Multi-Cores . . . . . . . . . . . . . . 223
9.5.1 Impact on Systems Software . . . . . . . . . . . . . . . 226
9.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 228
10 Virtualized I/O 229
Ada Gavrilovska, Adit Ranadive, Dulloor Rao, and Karsten Schwan
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10.1.1 Virtualization Overview . . . . . . . . . . . . . . . . . 230
10.1.2 Challenges with I/O Virtualization . . . . . . . . . . . 232
10.1.3 I/O Virtualization Approaches . . . . . . . . . . . . . 233
10.2 Split Device Driver Model . . . . . . . . . . . . . . . . . . . 234
10.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 234
10.2.2 Performance Optimization Opportunities . . . . . . . 236
10.3 Direct Device Access Model . . . . . . . . . . . . . . . . . . 240
10.3.1 Multi-Queue Devices . . . . . . . . . . . . . . . . . . . . 241
10.3.2 Device-Level Packet Classification . . . . . . . . . . . 243
10.3.3 Signaling . . . . . . . . . . . . . . . . . . . . . . . . . 244
10.3.4 IOMMU . . . . . . . . . . . . . . . . . . . . . . . . . . 244
10.4 Opportunities and Trade-Offs . . . . . . . . . . . . . . . . . 245
10.4.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 245

xii
10.4.2 Migration . . . . . . . . . . . . . . . . . . . . . . . . . 246
10.4.3 Higher-Level Interfaces . . . . . . . . . . . . . . . . . . 246
10.4.4 Monitoring and Management . . . . . . . . . . . . . . 247
10.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 249
11 The Message Passing Interface (MPI) 251
Jeff Squyres
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
11.1.1 Chapter Scope . . . . . . . . . . . . . . . . . . . . . . . 251
11.1.2 MPI Implementations . . . . . . . . . . . . . . . . . . 252
11.1.3 MPI Standard Evolution . . . . . . . . . . . . . . . . . 254
11.1.4 Chapter Overview . . . . . . . . . . . . . . . . . . . . 255
11.2 MPI’s Layer in the Network Stack . . . . . . . . . . . . . . . 255
11.2.1 OSI Network Stack . . . . . . . . . . . . . . . . . . . . 256
11.2.2 Networks That Provide MPI-Like Interfaces . . . . . . 256
11.2.3 Networks That Provide Non-MPI-Like Interfaces . . . 257
11.2.4 Resource Management . . . . . . . . . . . . . . . . . . 257
11.3 Threading and MPI . . . . . . . . . . . . . . . . . . . . . . . 260
11.3.1 Implementation Complexity . . . . . . . . . . . . . . . 260
11.3.2 Application Simplicity . . . . . . . . . . . . . . . . . . . 261
11.3.3 Performance Implications . . . . . . . . . . . . . . . . 262
11.4 Point-to-Point Communications . . . . . . . . . . . . . . . . 262
11.4.1 Communication/Computation Overlap . . . . . . . . . 262
11.4.2 Pre-Posting Receive Buffers . . . . . . . . . . . . . . . 263
11.4.3 Persistent Requests . . . . . . . . . . . . . . . . . . . . 265
11.4.4 Common Mistakes . . . . . . . . . . . . . . . . . . . . 267
11.5 Collective Operations . . . . . . . . . . . . . . . . . . . . . . 272
11.5.1 Synchronization . . . . . . . . . . . . . . . . . . . . . 272
11.6 Implementation Strategies . . . . . . . . . . . . . . . . . . . 273
11.6.1 Lazy Connection Setup . . . . . . . . . . . . . . . . . 273
11.6.2 Registered Memory . . . . . . . . . . . . . . . . . . . . 274
11.6.3 Message Passing Progress . . . . . . . . . . . . . . . . 278
11.6.4 Trade-Offs . . . . . . . . . . . . . . . . . . . . . . . . . 278
11.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 279
12 High Performance Event Communication 281
Greg Eisenhauer, Matthew Wolf, Hasan Abbasi, and Karsten Schwan
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
12.2 Design Points . . . . . . . . . . . . . . . . . . . . . . . . . . 283
12.2.1 Lessons from Previous Designs . . . . . . . . . . . . . 285
12.2.2 Next Generation Event Delivery . . . . . . . . . . . . 286
12.3 The EVPath Architecture . . . . . . . . . . . . . . . . . . . . 287
12.3.1 Taxonomy of Stone Types . . . . . . . . . . . . . . . . 289
12.3.2 Data Type Handling . . . . . . . . . . . . . . . . . . . 289

xiii
12.3.3 Mobile Functions and the Cod Language . . . . . . . . 290
12.3.4 Meeting Next Generation Goals . . . . . . . . . . . . . 293
12.4 Performance Microbenchmarks . . . . . . . . . . . . . . . . . 294
12.4.1 Local Data Handling . . . . . . . . . . . . . . . . . . . 295
12.4.2 Network Operation . . . . . . . . . . . . . . . . . . . . 296
12.5 Usage Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . 297
12.5.1 Implementing a Full Publish/Subscribe System . . . . 298
12.5.2 IFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
12.5.3 I/OGraph . . . . . . . . . . . . . . . . . . . . . . . . . 302
12.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13 The Case of the Fast Financial Feed 305
Virat Agarwal, Lin Duan, Lurng-Kuo Liu, Michaele Perrone,
Fabrizio Petrini, Davide Pasetto, and David Bader
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
13.2 Market Data Processing Systems . . . . . . . . . . . . . . . . 306
13.2.1 The Ticker Plant . . . . . . . . . . . . . . . . . . . . . 306
13.3 Performance Requirements . . . . . . . . . . . . . . . . . . . 308
13.3.1 Skyrocketing Data Rates . . . . . . . . . . . . . . . . 308
13.3.2 Low Latency Trading . . . . . . . . . . . . . . . . . . 308
13.3.3 High Performance Computing in the Data Center . . . 310
13.4 The OPRA Case Study . . . . . . . . . . . . . . . . . . . . . . 311
13.4.1 OPRA Data Encoding . . . . . . . . . . . . . . . . . . 312
13.4.2 Decoder Reference Implementation . . . . . . . . . . . 314
13.4.3 A Streamlined Bottom-Up Implementation . . . . . . 315
13.4.4 High-Level Protocol Processing with DotStar . . . . . 316
13.4.5 Experimental Results . . . . . . . . . . . . . . . . . . . 321
13.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 324
13.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 327
14 Data-Movement Approaches for HPC Storage Systems 329
Ron A. Oldfield, Todd Kordenbrock, and Patrick Widener
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
14.2 Lustre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
14.2.1 Lustre Networking (LNET) . . . . . . . . . . . . . . . 332
14.2.2 Optimizations for Large-Scale I/O . . . . . . . . . . . 333
14.3 Panasas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
14.3.1 PanFS Architecture . . . . . . . . . . . . . . . . . . . 335
14.3.2 Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . 336
14.4 Parallel Virtual File System 2 (PVFS2) . . . . . . . . . . . . 337
14.4.1 BMI Design . . . . . . . . . . . . . . . . . . . . . . . . 338
14.4.2 BMI Simplifies the Client . . . . . . . . . . . . . . . . 339
14.4.3 BMI Efficiency/Performance . . . . . . . . . . . . . . 340
14.4.4 BMI Scalability . . . . . . . . . . . . . . . . . . . . . . . 341
14.4.5 BMI Portability . . . . . . . . . . . . . . . . . . . . . . 341

xiv
14.4.6 Experimental Results . . . . . . . . . . . . . . . . . . 342
14.5 Lightweight File Systems . . . . . . . . . . . . . . . . . . . . 345
14.5.1 Design of the LWFS RPC Mechanism . . . . . . . . . 345
14.5.2 LWFS RPC Implementation . . . . . . . . . . . . . . 346
14.5.3 Performance Analysis . . . . . . . . . . . . . . . . . . 348
14.6 Other MPP File Systems . . . . . . . . . . . . . . . . . . . . 349
14.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 350
14.8 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . 351
15 Network Simulation 353
George Riley
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
15.2 Discrete Event Simulation . . . . . . . . . . . . . . . . . . . 353
15.3 Maintaining the Event List . . . . . . . . . . . . . . . . . . . 354
15.4 Modeling Routers, Links, and End Systems . . . . . . . . . . 355
15.5 Modeling Network Packets . . . . . . . . . . . . . . . . . . . 358
15.6 Modeling the Network Applications . . . . . . . . . . . . . . 359
15.7 Visualizing the Simulation . . . . . . . . . . . . . . . . . . . 360
15.8 Distributed Simulation . . . . . . . . . . . . . . . . . . . . . 362
15.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
References 367
Index 407

List of Figures
1.1 Comparative communication performance of Purple, Red Storm,
and Blue Gene/L . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Network topologies used in the ten fastest supercomputers and
the single most parallel supercomputer . . . . . . . . . . . . . 9
1.3 Illustrations of various network topologies . . . . . . . . . . . 10
1.4 The network hierarchy of the Roadrunner supercomputer . . 14
1.5 Kautz graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Typical InfiniBand cluster . . . . . . . . . . . . . . . . . . . . 29
2.2 Consumer queuing model . . . . . . . . . . . . . . . . . . . . 30
2.3 Virtual lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Example of unreliable multicast operation . . . . . . . . . . . 34
2.5 IB transport services . . . . . . . . . . . . . . . . . . . . . . . 36
2.6 Layered design of MVAPICH/MVAPICH2 over IB. . . . . . . 48
2.7 Two-sided point-to-point performance over IB on a range of
adapters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.8 Application-level evaluation of MPI over InfiniBand design com-
ponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.9 Lustre performance over InfiniBand. . . . . . . . . . . . . . . 56
2.10 CPU utilization in Lustre with IPoIB and native (verb-level)
protocols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.11 SDP architecture . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.12 Performance comparison of the Apache Web server: SDP vs.
IPoIB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.13 Active polling to achieve strong cache coherency. . . . . . . . 59
2.14 Active polling performance . . . . . . . . . . . . . . . . . . . 60
3.1 Profile of network interconnects on the TOP500 . . . . . . . . 62
3.2 Hand-drawn Ethernet diagram by Robert Metcalfe. . . . . . . 66
3.3 TCP offload engines. . . . . . . . . . . . . . . . . . . . . . . . 70
3.4 iWARP protocol stack. . . . . . . . . . . . . . . . . . . . . . . . 71
3.5 Network congestion. . . . . . . . . . . . . . . . . . . . . . . . 72
3.6 VLAN-based multipath communication. . . . . . . . . . . . . 74
3.7 Myrinet MX-over-Ethernet. . . . . . . . . . . . . . . . . . . . 76
3.8 Mellanox ConnectX. . . . . . . . . . . . . . . . . . . . . . . . 77
3.9 Ethernet vs. EtherNOT: One-way latency. . . . . . . . . . . . 78
xv

xvi
3.10 Ethernet vs. EtherNOT: Unidirectional bandwidth. . . . . . . 79
3.11 Ethernet vs. EtherNOT: Virtual Microscope application. . . . 80
3.12 Ethernet vs. EtherNOT: MPI-tile-IO application. . . . . . . . . 81
4.1 Network latency scaling trends. . . . . . . . . . . . . . . . . . 87
4.2 Memory cost and memory power trends for a commodity server
system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3 Approximate floorplans for quad-core processors from AMD
and Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4 Breakdown of a HyperTransport link. . . . . . . . . . . . . . 92
4.5 Organization of buffers/VCs on a HT link . . . . . . . . . . . 93
4.6 HT read request packet format. . . . . . . . . . . . . . . . . . 95
4.7 Structure of the QPI protocol stack and link. . . . . . . . . . 96
4.8 QPI’s low-latency source snoop protocol. . . . . . . . . . . . . 97
4.9 An example of the PCIe complex. . . . . . . . . . . . . . . . . 99
4.10 Structure of the PCI Express packets and protocol processing 100
4.11 The Dynamic Partitioned Global Address Space model . . . . 102
4.12 Memory bridge with Opteron memory subsystem. . . . . . . . 105
4.13 Memory bridge stages. . . . . . . . . . . . . . . . . . . . . . . 106
5.1 Comparing programmed I/O and DMA transactions to the
network interface. . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Comparing messages using an eager protocol to messages using
a rendezvous protocol. . . . . . . . . . . . . . . . . . . . . . . 116
5.3 Ping-pong bandwidth for Quadrics Elan4 and 4X SDR Infini-
Band. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4 Cells in associative list processing units . . . . . . . . . . . . 130
5.5 Performance advantages of an ALPU . . . . . . . . . . . . . . . 131
5.6 NIC architecture enhanced with a list management unit. . . . 133
5.7 Performance of the list management unit . . . . . . . . . . . . 135
5.8 Microcoded match unit . . . . . . . . . . . . . . . . . . . . . 136
5.9 Match unit’s wide instruction word. . . . . . . . . . . . . . . 137
5.10 Match unit performance . . . . . . . . . . . . . . . . . . . . . 140
5.11 NIC-based atomic unit . . . . . . . . . . . . . . . . . . . . . . . 141
5.12 Comparing the performance with and without cache. . . . . . 144
5.13 Assessing the impact of size and associativity on atomic unit
cache performance. . . . . . . . . . . . . . . . . . . . . . . . . 145
5.14 Pseudo code for an Allreduce using triggered operations . . . 146
7.1 Congestion management functions in popular TCP variants. . 177
8.1 Comparison of traditional TCP network stack and RDMA . . 187
8.2 Effect of overlapping computation and communication. . . . . 190
8.3 TCP and RDMA communication architecture. . . . . . . . . . 191
8.4 iWARP protocols stack. . . . . . . . . . . . . . . . . . . . . . 202

xvii
9.1 Simple protocol onload approach on multi-core platforms. . . . 211
9.2 Deploying communication stacks on dedicated cores. . . . . . 214
9.3 Deployment alternatives for network processors. . . . . . . . 222
9.4 Heterogeneous multi-core platform. . . . . . . . . . . . . . . 224
10.1 Split device driver model. . . . . . . . . . . . . . . . . . . . . 234
10.2 Direct device access model. . . . . . . . . . . . . . . . . . . . . 241
10.3 RDMA write bandwidth divided among VMs. . . . . . . . . 243
11.1 Simplistic receive processing in MPI . . . . . . . . . . . . . . 264
11.2 Serialized MPI communication . . . . . . . . . . . . . . . . . 269
11.3 Communication/computation overlap . . . . . . . . . . . . . 270
11.4 Memory copy vs. OpenFabric memory registration times . . 275
12.1 Channel-based event delivery system . . . . . . . . . . . . . 282
12.2 Complex event processing delivery system . . . . . . . . . . . 283
12.3 Event delivery system built using EVPath . . . . . . . . . . 288
12.4 Basic stone types . . . . . . . . . . . . . . . . . . . . . . . . 290
12.5 Sample EVPath data structure declaration . . . . . . . . . . . 291
12.6 Specialization filter for stock trades ranges . . . . . . . . . . 292
12.7 Specialization filter for array averaging . . . . . . . . . . . . 293
12.8 Local stone transfer time for linear and tree-structured paths 295
12.9 EVPath throughput for various data sizes . . . . . . . . . . . 297
12.10 Using event channels for communication . . . . . . . . . . . 298
12.11 ECho event channel implementation using EVPath stones . . 299
12.12 Derived event channel implementation using EVPath stones 300
12.13 CPU overhead as a function of filter rejection ratio . . . . . . 301
13.1 High-level overview of a ticker plant . . . . . . . . . . . . . . 307
13.2 OPRA market peak data rates . . . . . . . . . . . . . . . . . 309
13.3 OPRA FAST encoded packet format . . . . . . . . . . . . . 314
13.4 OPRA reference decoder . . . . . . . . . . . . . . . . . . . . 315
13.5 Bottom-up reference decoder block diagram . . . . . . . . . . 316
13.6 Presence and field map bit manipulation . . . . . . . . . . . 317
13.7 A graphical representation of the DotStar compiler steps. . . 318
13.8 DotStar runtime . . . . . . . . . . . . . . . . . . . . . . . . . 319
13.9 DotStar source code . . . . . . . . . . . . . . . . . . . . . . . 320
13.10 OPRA message distribution . . . . . . . . . . . . . . . . . . 322
13.11 Performance comparison on several hardware platforms . . . 323
14.1 Partitioned architecture . . . . . . . . . . . . . . . . . . . . . 330
14.2 Lustre software stack. . . . . . . . . . . . . . . . . . . . . . . 332
14.3 Server-directed DMA handshake in Lustre . . . . . . . . . . 333
14.4 Panasas DirectFlow architecture . . . . . . . . . . . . . . . . 334
14.5 Parallel NFS architecture . . . . . . . . . . . . . . . . . . . . 337

xviii
14.6 PVFS2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
14.7 Round trip latency . . . . . . . . . . . . . . . . . . . . . . . 342
14.8 Point-to-point bandwidth for 120MB transfer . . . . . . . . . 342
14.9 Aggregate read pattern, 10MB per client/server pair . . . . . 343
14.10 LWFS RPC protocol . . . . . . . . . . . . . . . . . . . . . . 346
14.11 The 16-byte data structure used for each of the experiments. 347
14.12 Comparison of LWFS RPC to various other mechanisms. . . 349
15.1 Moderately complex network topology. . . . . . . . . . . . . 358
15.2 Network simulation animation. . . . . . . . . . . . . . . . . . 360
15.3 Network simulation visualization. . . . . . . . . . . . . . . . . 361
15.4 The space-parallel method for distributed network simulation. 363

List of Tables
1.1 Network characteristics of Purple, Red Storm, and Blue Gene/L 4
1.2 Properties of a ring vs. a fully connected network of n nodes 7
1.3 Hypercube conference: length of proceedings . . . . . . . . . . 11
4.1 Latency results for HToE bridge. . . . . . . . . . . . . . . . . 107
4.2 Latency numbers used for evaluation of performance penalties. 107
5.1 Breakdown of the assembly code. . . . . . . . . . . . . . . . . 139
12.1 Comparing split stone and filter stone execution times . . . . 296
13.1 OPRA message categories with description. . . . . . . . . . . 313
13.2 Hardware platforms used during experimental evaluation. . . 324
13.3 Detailed performance analysis (Intel Xeon Q6600, E5472 and
AMD Opteron 2352) . . . . . . . . . . . . . . . . . . . . . . . 325
13.4 Detailed performance analysis (SUN UltraSparc T2 and IBM
Power6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
13.5 DotStar latency on different platforms. . . . . . . . . . . . . . 327
14.1 Compute and I/O nodes for production DOE MPP systems
used since the early 1990s. . . . . . . . . . . . . . . . . . . . . . 331
xix

Preface
As technology pushes the Petascale barrier, and the next generation Exascale
Computing requirements are being defined, it is clear that this type of com-
putational capacity can be achieved only by accompanying advancement in
communication technology. Next generation computing platforms will consist
of hundreds of thousands of interconnected computational resources. There-
fore, careful consideration must be placed on the choice of the communication
fabrics connecting everything, from cores and accelerators on individual plat-
form nodes, to the internode fabric for coupling nodes in large-scale data
centers and computational complexes, to wide area communication channels
connecting geographically disparate compute resources to remote data sources
and application end-users.
The traditional scientific HPC community has long pushed the envelope
on the capabilities delivered by existing communication fabrics. The com-
puting complex at Oak Ridge National Laboratory has an impressive 332 Gi-
gabytes/sec I/O bandwidth and a 786 Terabyte/sec global interconnection
bandwidth [305], used by extreme science applications from “subatomic to
galactic scales” domains [305], and supporting innovation in renewable energy
sources, climate modeling, and medicine. Riding on a curve of a thousandfold
increase in computational capabilities over the last five years alone, it is ex-
pected that the growth in both compute resources, as well as the underlying
communication infrastructure and its capabilities, will continue to climb at
mind-blowing rates.
Other classes of HPC applications have equally impressive communication
needs beyond the main computing complex, as they require data from remote
sources — observatories, databases, remote instruments such as particle ac-
celerators, or large scale data-intensive collaborations across many globally
distributed sites. The computational grids necessary for these applications
must by enabled by communication technology capable of moving terabytes of
data in a timely manner, while also supporting the interactive nature of the
remote collaborations.
More unusual, however, are the emerging extreme scale communication needs
outside of the traditional HPC domain. First, the confluence of the multi-
core nature of emerging hardware resources, coupled with the renaissance of
virtualization technology, is creating exceptional consolidation possibilities, and
giving rise to compute clouds of virtualized multi-core resources. At the same
time, the management complexity and operating costs associated with today’s
IT hardware and software stacks are pushing a broad range of enterprise class
xxi

xxii
applications into such clouds. As clouds spill outside of data center boundaries
and move towards becoming Exascale platforms across globally distributed
facilities [191,320], their communication needs — for computations distributed
across inter- and intra-cloud resources, for management, provisioning and QoS,
for workload migration, and for end-user interactions — become exceedingly
richer.
Next, even enterprise applications which, due to their critical nature, are not
likely candidates for prime cloud deployment, are witnessing a “skyrocketing”
increase in communication needs. For instance, financial market analyses are
forecasting as much as 128 billion electronic trades a day just by 2010 [422], and
similar trends toward increased data rates and lower latency requirements are
expected in other market segments, from ticketing services, to parcel delivery
and distribution, to inventory management and forecasting.
Finally, the diversity in commodity end-user application, from 3D dis-
tributed games, to content sharing and rich multimedia based applications, to
telecommunication and telepresence types of services, supported on everything
from high-end workstations to cellphones and embedded devices, is further
exacerbating the need for high performance communication services.
Objectives
No silver bullet will solve all of the communications-related challenges
faced by the emerging types of applications, platforms, and usage scenarios
described above. In order to address these requirements we need an ecosystem
of solutions along a stack of technology layers: (i) efficient interconnection
hardware; (ii) scalable, robust, end-to-end protocols; and (iii) system services
and tools specifically targeting emerging multi-core environments.
Toward this end, this book is a unique effort to discuss technological advances
and challenges related to high performance communications, by addressing each
layer in the vertical stack — the low-level architectural features of the hardware
interconnect and interconnection devices; the selection of communication
protocols; the implementation of protocol stacks and other operating system
features, including on modern, homogeneous or heterogeneous multi-core
platforms; the middleware software and libraries used in applications with high-
performance communication requirements; and the higher-level application
services such as file systems, monitoring services, and simulation tools. The
rationale behind this approach is that no single solution, applied at one
particular layer, can help applications address all performance-related issues
with their communication services. Instead, a coordinated effort is needed to
eliminate bottlenecks and address optimization opportunities at each layer —
from the architectural features of the hardware, through the protocols and their
implementation in OS kernels, to the manner in which application services
and middleware are using the underlying platforms.
This book is an edited collection of chapters on these topics. The choice to

xxiii
organize this title as a collection of chapters contributed by different individuals
is a natural one — the vertical scope of the work calls for contributions from
researchers with many different areas of expertise. Each of the chapters is
organized in a manner which includes historic perspective, discussion of state-
of-the-art technology solutions and current trends, summary of major research
efforts in the community, and, where appropriate, greater level of detail on a
particular effort that the chapter contributor is mostly affiliated with.
The topics covered in each chapter are important and complex, and deserve
a separate book to cover adequately and in-depth all technical challenges
which surround them. Many such books exist [71, 128, 181, 280, 408, etc.].
Unique about this title, however, is that it is a more comprehensive text for a
broader audience, spanning a community with interests across the entire stack.
Furthermore, each chapter abounds in references to past and current technical
papers and projects, which can guide the reader to sources of additional
information.
Finally, it is worth pointing out that even this type of text, which touches on
many different types of technologies and many layers across the stack, is by no
means complete. Many more topics remain only briefly mentioned throughout
the book, without having a full chapter dedicated to them. These include
topics related to physical-layer technologies, routing protocols and router archi-
tectures, advanced communications protocols for multimedia communications,
modeling and analysis techniques, compiler and programming language-related
issues, and others.
Target audience
The relevance of this book is multifold. First, it is targeted at academics
for instruction in advanced courses on high-performance I/O and communi-
cation, offered as part of graduate or undergraduate curricula in Computer
Science or Computer Engineering. Similar courses have been taught for several
years at a number of universities, including Georgia Tech (High-Performance
Communications, taught by this book’s editor), The Ohio State University
(Network-Based Computing, by Prof. Dhabaleswar K. Panda), Auburn Univer-
sity (Special Topics in High Performance Computing, by Prof. Weikuan Yu),
or significant portions of the material are covered in more traditional courses
on High Performance Computing in many Computer Science or Engineering
programs worldwide.
In addition, this book can be relevant reference and instructional mate-
rial for students and researchers in other science and engineering areas, in
academia or at the National Labs, that are working on problems with signifi-
cant communication requirements, and on high performance platforms such
as supercomputers, high-end clusters or large-scale wide-area grids. Many of
these scientists may not have formal computer science/computer engineering
education, and this title aims to be a single text which can help steer them to

xxiv
more easily identify performance improvement opportunities and find solutions
best suited for the applications they are developing and the platforms they
are using.
Finally, the text provides researchers who are specifically addressing problems
and developing technologies at a particular layer with a single reference that
surveys the state-of-the-art at other layers. By offering a concise overview of
related efforts from “above” or “below,” this book can be helpful in identifying
the best ways to position one’s work, and in ensuring that other elements of
the stack are appropriately leveraged.
Organization
The book is organized in 15 chapters, which roughly follow a vertical stack
from bottom to top.
• Chapter 1 discusses design alternatives for interconnection networks in
massively parallel systems, including examples from current Cray, Blue
Gene, as well as cluster-based supercomputers from the Top500 list.
• Chapter 2 provides in-depth discussion of the InfiniBand intercon-
nection technology, its hardware elements, protocols and features, and
includes a number of case studies from the HPC and enterprise domains,
which illustrate its suitability for a range of applications.
• Chapter 3 contrasts the traditional high-performance interconnection
solutions to the current capabilities offered by Ethernet-based intercon-
nects, and demonstrates several convergence trends among Ethernet and
EtherNOT technologies.
• Chapter 4 describes the key board- and chip-level interconnects such
as PCI Express, HyperTransport, and QPI; the capabilities they offer for
tighter system integration; and a case study for a service enabled by the
availability of low-latency integrated interconnects — global partitioned
address spaces.
• Chapter 5 discusses a number of considerations regarding the hard-
ware and software architecture of network interface devices (NICs), and
contrasts the design points present in NICs in a number of existing
interconnection fabrics.
• Chapter 6 complements this discussion by focusing on the characteris-
tics of the APIs natively supported by the different NIC platforms.
• Chapter 7 focuses on IP-based transport protocols and provides a his-
toric perspective on the different variants and performance optimization
opportunities for TCP transports, as well as a brief discussion of other
IP-based transport protocols.

xxv
• Chapter 8 describes in greater detail Remote Direct Memory Access
(RDMA) as an approach to network communication, along with iWARP,
an RDMA-based solution based on TCP transports.
• Chapter 9 more explicitly discusses opportunities to accelerate the
execution of communication services on multi-core platforms, including
general purpose homogeneous multi-cores, specialized network processing
accelerators, as well as emerging heterogeneous many-core platforms,
comprising both general purpose and accelerator resources.
• Chapter 10 targets the mechanisms used to ensure high-performance
communication services in virtualized platforms by discussing VMM-level
techniques as well as device-level features which help attain near-native
performance.
• Chapter 11 provides some historical perspective on MPI, the de facto
communication standard in high-performance computing, and outlines
some of the challenges in creating software and hardware implementations
of the MPI standard. These challenges are directly related to MPI
applications and their understanding can impact the programmers’ ability
to write better MPI-based codes.
• Chapter 12 looks at event-based middleware services as a means to ad-
dress the high-performance needs of many classes of HPC and enterprise
applications. It also gives an overview of the EVPath high-performance
middleware stack and demonstrates its utility in several different con-
texts.
• Chapter 13 first describes an important class of applications with
high performance communication requirements — electronic trading
platforms used in the financial industry. Next, it provides implementation
detail and performance analysis for the authors’ approach to leverage
the capabilities of general purpose multi-core platforms and to attain
impressive performance levels for one of the key components of the
trading engine.
• Chapter 14 describes the data-movement approaches used by a selection
of commercial, open-source, and research-based storage systems used
by massively parallel platforms, including Lustre, Panasas, PVFS2, and
LWFS.
• Chapter 15 discusses the overall approach to creating the simulation
tools needed to design, evaluate and experiment with a range of parame-
ters in the interconnection technology space, so as to help identify design
points with adequate performance behaviors.

Acknowledgments
This book would have never been possible without the contributions from the
individual chapter authors. My immense gratitude goes out to all of them for
their unyielding enthusiasm, expertise, and time.
Next, I would like to thank my editor Alan Apt, for approaching and
encouraging me to write this book, and the extended production team at
Taylor & Francis, for all their help with the preparation of the manuscript.
I am especially grateful to my mentor and colleague, Karsten Schwan, for
supporting my teaching of the High Performance Communications course at
Georgia Tech, which was the basis for this book. He and my close collaborators,
Greg Eisenhauer and Matthew Wolf, were particularly accomodating during
the most intense periods of manuscript preparation in providing me with the
necessary time to complete the work.
Finally, I would like to thank my family, for all their love, care, and uncon-
ditional support. I dedicate this book to them.
Ada Gavrilovska
Atlanta, 2009
xxvii

About the Editor
Ada Gavrilovska is a Research Scientist in the College of Computing at
Georgia Tech, and at the Center for Experimental Research in Computer
Systems (CERCS). Her research interests include conducting experimental
systems research, specifically addressing high-performance applications on
distributed heterogeneous platforms, and focusing on topics that range from
operating and distributed systems, to virtualization, to programmable network
devices and communication accelerators, to active and adaptive middleware
and I/O. Most recently, she has been involved with several projects focused
on development of efficient virtualization solutions for platforms ranging from
heterogeneous multi-core systems to large-scale cloud enviroments.
In addition to research, Dr. Gavrilovska has a strong commitment to teaching.
At Georgia Tech she teaches courses on advanced operating systems and high-
performance communications topics, and is deeply involved in a larger effort
aimed at upgrading the Computer Science and Engineering curriculum with
multicore-related content.
Dr. Gavrilovska has a B.S. in Electrical and Computer Engineering from
the Faculty of Electrical Engineering, University Sts. Cyril and Methodius, in
Skopje, Macedonia, and M.S. and Ph.D. degrees in Computer Science from
Georgia Tech, both completed under the guidence of Dr. Karsten Schwan.
Her work has been supported by the National Science Foundation, the U.S.
Department of Energy, and through numerous collaborations with industry,
including Intel, IBM, HP, Cisco Systems, Netronome Systems, and others.
xxix

List of Contributors
Hasan Abbasi
College of Computing
Georgia Institute of Technology
Atlanta, Georgia
Virat Agarwal
IBM TJ Watson Research Center
Yorktown Heights, New York
David Bader
Atlanta, Georgia
Pavan Balaji
Argonne National Laboratory
Argonne, Illinois
Ron Brightwell
Sandia National Laboratories
Albuquerque, New Mexico
Dennis Dalessandro
Ohio Supercomputer Center
Columbus, Ohio
Lin Duan
Greg Eisenhauer
Atlanta, Georgia
Wu-chun Feng
Departments of Computer Science
and Electrical & Computer
Engineering
Virginia Tech
Blacksburg, Virginia
Ada Gavrilovska
Atlanta, Georgia
Scott Hemmert
Matthew Jon Koop
Department of Computer Science
and Engineering
The Ohio State University
Columbus, Ohio
Todd Kordenbrock
Hewlett-Packard
Lurng-Kuo Liu
Ron A. Oldfield
xxxi

xxxii
Scott Pakin
Los Alamos National Laboratory
Los Alamos, New Mexico
Dhabaleswar K. Panda
Department of Computer Science
and Engineering
The Ohio State University
Columbus, Ohio
Davide Pasetto
IBM Computational Science Center
Dublin, Ireland
Michaele Perrone
Fabrizio Petrini
Adit Ranadive
Atlanta, Georgia
Dulloor Rao
Atlanta, Georgia
George Riley
School of Electrical and Computer
Engineering
Atlanta, Georgia
Karsten Schwan
Atlanta, Georgia
Jeff Squyres
Cisco Systems
Louisville, Kentucky
Sayantan Sur
Hawthorne, New York
Keith Underwood
Intel Corporation
Patrick Widener
University of New Mexico
Matthew Wolf
Atlanta, Georgia
Pete Wyckoff
Ohio Supercomputer Center
Columbus, Ohio
Sudhakar Yalamanchili
Engineering
Atlanta, Georgia
Jeffrey Young
Engineering
Atlanta, Georgia

Chapter 1
High Performance Interconnects for
Massively Parallel Systems
Scott Pakin
Los Alamos National Laboratory
1.1 Introduction
If you were intent on building the world’s fastest supercomputer, how would
you design the interconnection network? As with any architectural endeavor,
there are a suite of trade-offs and design decisions that need to be considered:
Maximize performance Ideally, every application should run fast, but
would it be acceptable for a smaller set of “important” applications
or application classes to run fast? If so, is the overall performance of
those applications dominated by communication performance? Do those
applications use a known communication pattern that the network can
optimize?
Minimize cost Cheaper is better, but how much communication performance
are you willing to sacrifice to cut costs? Can you exploit existing hardware
components, or do you have to do a full-custom design?
Maximize scalability Some networks are fast and cost-efficient at small
scale but quickly become expensive as the number of nodes increases.
How can you ensure that you will not need to rethink the network design
from scratch when you want to increase the system’s node code? Can
the network grow incrementally? (Networks that require a power-of-two
number of nodes quickly become prohibitively expensive, for example.)
Minimize complexity How hard is it to reason about application perfor-
mance? To observe good performance, do parallel algorithms need to be
constructed specifically for your network? Do an application’s processes
need to be mapped to nodes in some non-straightforward manner to keep
the network from becoming a bottleneck? As for some more mundane
complexity issues, how tricky is it for technicians to lay the network
1

2 Attaining High Performance Communication: A Vertical Approach
cables correctly, and how much rewiring is needed when the network size
increases?
Maximize robustness The more components (switches, cables, etc.) com-
pose the network, the more likely it is that one of them will fail. Would
you utilize a naturally fault-tolerant topology at the expense of added
complexity or replicate the entire network at the expense of added cost?
Minimize power consumption Current estimates are that a sustained
megawatt of power costs between US$200,000–1,200,000 per year [144].
How much power are you willing to let the network consume? How
much performance or robustness are you willing to sacrifice to reduce
the network’s power consumption?
There are no easy answers to those questions, and the challenges increase in
difficulty with larger and larger networks: Network performance is more critical;
costs may grow disproportionately to the rest of the system; incremental
growth needs to be carefully considered; complexity is more likely to frustrate
application developers; fault tolerance is both more important and more
expensive; and power consumption becomes a serious concern. In the remainder
of this chapter we examine a few aspects of the network-design balancing act.
Section 1.2 quantifies the measured network performance of a few massively
parallel supercomputers. Section 1.3 discusses network topology, a key design
decision and one that impacts all of performance, cost, scalability, complexity,
robustness, and power consumption. In Section 1.4 we turn our attention
to some of the “bells and whistles” that a network designer might consider
including when trying to balance various design constraints. We briefly discuss
some future directions in network design in Section 1.5 and summarize the
chapter in Section 1.6.
Terminology note In much of the system-area network literature, the term
switch implies an indirect network while the term router implies a direct
network. In the wide-area network literature, in contrast, the term router
typically implies a switch plus additional logic that operates on packet contents.
For simplicity, in this chapter we have chosen to consistently use the term
switch when describing a network’s switching hardware.
1.2 Performance
One way to discuss the performance of an interconnection network is in
terms of various mathematical characteristics of the topology including — as a
function at least of the number of nodes — the network diameter (worst-case

High Performance Interconnects for Massively Parallel Systems 3
distance in switch hops between two nodes), average-case communication dis-
tance, bisection width (the minimum number of links that need to be removed
to partition the network into two equal halves), switch count, and network
capacity (maximum number of links that can carry data simultaneously).
However, there are many nuances of real-world networks that are not cap-
tured by simple mathematical expressions. The manner in which applications,
messaging layers, network interfaces, and network switches interact can be
complex and unintuitive. For example, Arber and Pakin demonstrated that the
address in memory at which a message buffer is placed can have a significant
impact on communication performance and that different systems favor differ-
ent buffer alignments [17]. Because of the discrepancy between the expected
performance of an idealized network subsystem and the measured performance
of a physical one, the focus of this section is on actual measurements of parallel
supercomputers containing thousands to tens of thousands of nodes.
1.2.1 Metrics
The most common metric used in measuring network performance is half of
the round-trip communication time (½RTT), often measured in microseconds.
The measurement procedure is as follows: Process A reads the time and sends
a message to process B; upon receipt of process A’s message, process B sends
an equal-sized message back to process A; upon receipt of process B’s message,
process A reads the time again, divides the elapsed time by two, and reports
the result as ½RTT. The purpose of using round-trip communication is that
it does not require the endpoints to have precisely synchronized clocks. In
many papers, ½RTT is referred to as latency and the message size divided by
½RTT as bandwidth although these definitions are far from universal. In the
LogP parallel-system model [100], for instance, latency refers solely to “wire
time” for a zero-byte message and is distinguished from overhead, the time
that a CPU is running communication code instead of application code. In
this section, we use the less precise but more common definitions of latency
and bandwidth in terms of ½RTT for an arbitrary-size message. Furthermore,
all performance measurements in this section are based on tests written using
MPI [408], currently the de facto standard messaging layer.
1.2.2 Application Sensitivity to Communication Perfor-
mance
An underlying premise of this entire book is that high-performance ap-
plications require high-performance communication. Can we quantify that
statement in the context of massively parallel systems? In a 2006 paper,
Kerbyson used detailed, application-specific, analytical performance models to
examine the impact of varying the network latency, bandwidth, and number
of CPUs per node [224]. Kerbyson’s study investigated two large applications
(SAGE [450] and Partisn [24]) and one application kernel (Sweep3D [233])

TABLE 1.1: Network characteristics of Purple, Red Storm, and
Blue Gene/L [187]
Metric Purple Red Storm Blue Gene/L
CPU cores 12,288 10,368 65,536
Cores per node 8 1 2
Nodes 1,536 10,368 32,768
NICs per node 2 1 1
Topology (cf. §1.3) 16-ary 27×16×24 32×32×32
3-tree mesh in x,y; torus in x,y,z
torus in z
Peak link bandwidth (MB/s) 2,048 3,891 175
Achieved bandwidth (MB/s) 2,913 1,664 154
Achieved min. latency (µs) 4.4 6.9 2.8
at three different network sizes. Of the three applications, Partisn is the
most sensitive to communication performance. At 1024 processors, Partisn’s
performance can be improved by 7.4% by reducing latency from 4 µs to 1.5 µs,
11.8% by increasing bandwidth from 0.9 GB/s to 1.6 GB/s, or 16.4% by de-
creasing the number of CPUs per node from 4 to 2, effectively halving the
contention for the network interface controller (NIC). Overall, the performance
difference between the worst set of network parameters studied (4 µs latency,
0.9 GB/s bandwidth, and 4 CPUs/node) and the best (1.5 µs latency, 1.6 GB/s
bandwidth, and 2 CPUs/node) is 68% for Partisn, 24% for Sweep3D, and
15% for SAGE, indicating that network performance is in fact important to
parallel-application performance.
Of course, these results are sensitive to the particular applications, input
parameters, and the selected variations in network characteristics. Neverthe-
less, other researchers have also found communication performance to be an
important contributor to overall application performance. (Martin et al.’s
analysis showing the significance of overhead [277] is an oft-cited study, for
example.) The point is that high-performance communication is needed for
high-performance applications.
1.2.3 Measurements on Massively Parallel Systems
In 2005, Hoisie et al. [187] examined the performance of three of the most
powerful supercomputers of the day: IBM’s Purple system [248] at Lawrence
Livermore National Laboratory, Cray/Sandia’s Red Storm system [58] at
Sandia National Laboratories, and IBM’s Blue Gene/L system [2] at Lawrence
Livermore National Laboratory. Table 1.1 summarizes the node and network
characteristics of each of these three systems at the time at which Hoisie et al.
ran their measurements. (Since the data was acquired all three systems have
had substantial hardware and/or software upgrades.)
Purple is a traditional cluster, with NICs located on the I/O bus and

FIGURE 1.1: Comparative communication performance of Purple,
Red Storm, and Blue Gene/L.
connecting to a network fabric. Although eight CPUs share the two NICs in a
node, the messaging software can stripe data across the two NICs, resulting in a
potential doubling of bandwidth. (This explains how the measured bandwidth
in Table 1.1 can exceed the peak link bandwidth.) Red Storm and Blue Gene/L
are massively parallel processors (MPPs), with the nodes and network more
tightly integrated.
Figure 1.1 reproduces some of the more interesting measurements from
Hoisie et al.’s paper. Figure 1.1(a) shows the ½RTT of a 0-byte message sent
from node 0 to each of nodes 1–1023 in turn. As the figure shows, Purple’s
latency is largely independent of distance (although plateaus representing
each of the three switch levels are in fact visible); Blue Gene/L’s latency
clearly matches the Manhattan distance beween the two nodes, with peaks
and valleys in the graph representing the use of the torus links in each 32-node
row and 32×32-node plane; however, Red Storm’s latency appears erratic.
This is because Red Storm assigns ranks in the computation based on a node’s
physical location on the machine room floor — aisle (0–3), cabinet within an
aisle (0–26), “cage” (collection of processor boards) within a cabinet (0–2),
board within a cage (0–7), and socket within a board (0–3) — not, as expected,
by the node’s x (0–26), y (0–15), and z (0–23) coordinates on the mesh.
While the former mapping may be convenient for pinpointing faulty hardware,
the latter is more natural for application programmers. Unexpected node
mappings are one of the subtleties that differentiate measurements of network
performance from expectations based on simulations, models, or theoretical
characteristics.
Figure 1.1(b) depicts the impact of link contention on achievable bandwidth.
At contention level 0, two nodes are measuring bandwidth as described in
Section 1.2.1. At contention level 1, two other nodes lying between the first pair
exchange messages while the first pair is measuring bandwidth; at contention

level 2, another pair of coincident nodes consumes network bandwidth, and so
on. On Red Storm and Blue Gene/L, the tests are performed across a single
dimension of the mesh/torus to force all of the messages to overlap. Each
curve in Figure 1.1(b) tells a different story. The Blue Gene/L curve contains
plateaus representing messages traveling in alternating directions across the
torus. Purple’s bandwidth degrades rapidly up to the node size (8 CPUs) then
levels off, indicating that contention for the NIC is a greater problem than
contention within the rest of the network. Relative to the other two systems,
Red Storm observes comparatively little performance degradation. This is
because Red Storm’s link speed is high relative to the speed at which a single
NIC can inject messages into the network. Consequently, it takes a relatively
large number of concurrent messages to saturate a network link. In fact, Red
Storm’s overspecified network links largely reduce the impact of the unusual
node mappings seen in Figure 1.1(a).
Which is more important to a massively parallel architecture: faster proces-
sors or a faster network? The tradeoff is that faster processors raise the peak
performance of the system, but a faster network enables more applications
to see larger fractions of peak performance. The correct choice depends on
the specifics of the system and is analogous to the choice of eating a small
piece of a large pie or a large piece of a small pie. On an infinitely fast
network (relative to computation speed), all parallelism that can be exposed —
even, say, in an arithmetic operation as trivial as (a + b) · (c + d) — leads
to improved performance. On a network with finite speed but that is still
fairly fast relative to computation speed, trivial operations are best performed
sequentially although small blocks of work can still benefit from running in
parallel. On a network that is much slower than what the processors can
feed (e.g., a wide-area computational Grid [151]), only the coarsest-grained
applications are likely to run faster in parallel than sequentially on a single
machine. As Hoisie et al. quantify [187], Blue Gene/L’s peak performance is
18 times that of the older ASCI Q system [269] but its relatively slow network
limits it to running SAGE [450] only 1.6 times as fast as ASCI Q. In contrast,
Red Storm’s peak performance is only 2 times ASCI Q’s but its relatively fast
network enables it to run SAGE 2.45 times as fast as ASCI Q. Overspecifying
the network was in fact an important aspect of Red Storm’s design [58].
1.3 Network Topology
One of the fundamental design decisions when architecting an interconnection
network for a massively parallel system is the choice of network topology. As
mentioned briefly at the start of Section 1.2, there are a wide variety of metrics
one can use to compare topologies. Selecting a topology also involves a large

TABLE 1.2: Properties of a ring vs. a
fully connected network of n nodes
Metric Ring Fully connected
Diameter n/2 1
Avg. dist. n/4 1
Degree 2 n − 1
Bisection 2 n
2
2
Switches n n
Minimal paths 1 1
Contention blocking nonblocking
number of tradeoffs that can impact performance, cost, scalability, complexity,
robustness, and power consumption:
Minimize diameter Reducing the hop count (number of switch crossings)
should help reduce the overall latency. Given a choice, however, should
you minimize the worst-case hop count (the diameter) or the average
hop count?
Minimize degree How many incoming/outgoing links are there per node?
Low-radix networks devote more bandwidth to each link, improving
performance for communication patterns that map nicely to the topol-
ogy (e.g., nearest-neighbor 3-space communication on a 3-D mesh).
High-radix networks reduce average latency — even for communication
patterns that are a poor match to the topology — and the cost of the
network in terms of the number of switches needed to interconnect a
given number of nodes. A second degree-related question to consider is
whether the number of links per node is constant or whether it increases
with network size.
Maximize bisection width The minimum number of links that must be
removed to split the network into two disjoint pieces with equal node
count is called the bisection width. (A related term, bisection bandwidth,
refers to the aggregate data rate between the two halves of the network.)
A large bisection width may improve fault tolerance, routability, and
global throughput. However, it may also greatly increase the cost of the
network if it requires substantially more switches.
Minimize switch count How many switches are needed to interconnect
n nodes? O(n) implies that the network can cost-effectively scale up to
large numbers of nodes; O(n2
) topologies are impractical for all but the
smallest networks.

Maximize routability How many minimal paths are there from a source
node to a destination node? Is the topology naturally deadlock-free
or does it require extra routing sophistication in the network to main-
tain deadlock freedom? The challenge is to ensure that concurrently
transmitted messages cannot mutually block each other’s progress while
simultaneously allowing the use of as many paths as possible between a
source and a destination [126].
Minimize contention What communication patterns (mappings from a set
of source nodes to a — possibly overlapping — set of destination nodes)
can proceed with no two paths sharing a link? (Sharing implies reduced
bandwidth.) As Jajszczyk explains in his 2003 survey paper, a network
can be categorized as nonblocking in the strict sense, nonblocking in
the wide sense, repackable, rearrangeable, or blocking, based on the level
of effort (roughly speaking) needed to avoid contention for arbitrary
mappings of sources to destinations [215]. But how important is it to
support arbitrary mappings — vs. just a few common communication
patterns — in a contention-free manner? Consider also that the use
of packet switching and adaptive routing can reduce the impact of
contention on performance.
Consider two topologies that represent the extremes of connectivity: a ring
of n nodes (in which nodes are arranged in a circle, and each node connects only
to its “left” and “right” neighbors) and a fully connected network (in which
each of the n nodes connects directly to each of the other n−1 nodes). Table 1.2
summarizes these two topologies in terms of the preceding characteristics. As
that table shows, neither topology outperforms the other in all cases (as is
typical for any two given topologies), and neither topology provides many
minimal paths, although the fully connected network provides a large number
of non-minimal paths. In terms of scalability, the ring’s primary shortcomings
are its diameter and bisection width, and the fully connected network’s primary
shortcoming is its degree. Consequently, in practice, neither topology is found
in massively parallel systems.
1.3.1 The “Dead Topology Society”
What topologies are found in massively parallel systems? Ever since parallel
computing first started to gain popularity (arguably in the early 1970s), the
parallel-computing literature has been rife with publications presenting a vir-
tually endless menagerie of network topologies, including — but certainly not
limited to — 2-D and 3-D meshes and tori [111], banyan networks [167], Beneš
networks [39], binary trees [188], butterfly networks [251], Cayley graphs [8],
chordal rings [18], Clos networks [94], crossbars [342], cube-connected cy-
cles [347], de Bruijn networks [377], delta networks [330], distributed-loop
networks [40], dragonflies [227], extended hypercubes [238], fat trees [253],

June
1993
June
1994
June
1995
June
1996
June
1997
June
1998
June
1999
June
2000
June
2001
June
2002
June
2003
June
2004
June
2005
June
2006
June
2007
June
2008
1 F X M X X X H H M M M M M M M F F F X X X X X M M M M M M M F
2 F F X M M M X X H M M M H H H M F F F F F F F H M M M M M M M
3 F F F F M M M H X M M M M H H H M F F F F F F X H F F M M C M
4 F F F F M M M M H H M H H M F H H M F F F F M F X H H F M O F
5 F X X X X M M M M M M M H H F H H F F F F F F F F F F M F M
6 X X F F M M M H H H H M H F H F F F F F F F M M F F F M M
7 F X X X F X M M X M M M H M H F H M F X F X F F X F F M M C
8 M M F F F F F X M X M H H M M F X H F F F F M M M F M H F M O
9 M F F M F X M M M M M M H H F F H F F F F F M M M F F M M
10 F M F X M M M M M F M M F F F F F F F F M M X M H M C
Most
procs.
C C M M M M M M M M M M M M M M M M M M M M M M M M M M M M M
Legend and statistics
M Mesh or torus (35.8%) No network (3.2%)
F Fat tree (34.8%) C Hypercube (1.0%)
H Hierarchical (13.2%) O Other (0.6%)
X Crossbar (11.3%)
FIGURE 1.2: Network topologies used in the ten fastest supercomputers
and the single most parallel supercomputer, June and November, 1993–2008.
flattened butterflies [228], flip networks [32], fractahedrons [189], generalized
hypercube-connected cycles [176], honeycomb networks [415], hyperbuses [45],
hypercubes [45], indirect binary n-cube arrays [333], Kautz graphs [134], KR-
Benes networks [220], Moebius graphs [256], omega networks [249], pancake
graphs [8], perfect-shuffle networks [416], pyramid computers [295], recursive
circulants [329], recursive cubes of rings [421], shuffle-exchange networks [417],
star graphs [7], X-torus networks [174], and X-Tree structures [119]. Note that
many of those topologies are general classes of topologies that include other
topologies on the list or are specific topologies that are isomorphic to other
listed topologies.
Some of the entries in the preceding list have been used in numerous parallel-
computer implementations; others never made it past a single publication.
Figure 1.2 presents the topologies used over the past fifteen years both in
the ten fastest supercomputers and in the single most parallel computer
(i.e., the supercomputer with the largest number of processors, regardless of
its performance).
The data from Figure 1.2 were taken from the semiannual Top500 list

(a) 8‐port crossbar (b) Hypercube (specifically, a 2‐ary 4‐cube)
(d) 3‐D mesh; shaded addition makes it a 3‐D torus
(3‐ary 3‐cube); a single plane is a 2‐D mesh/torus; a
single line is a 1‐D mesh/torus (a.k.a. a ring)
(c) Fat tree (specifically, a 2‐ary 3‐tree);
shaded addition is commonly used in practice
FIGURE 1.3: Illustrations of various network topologies.
(http://www.top500.org/), and therefore represent achieved performance on
the HPLinpack benchmark [123].1
Figure 1.3 illustrates the four topologies
that are the most prevalent in Figure 1.2: a crossbar, a hypercube, a fat tree,
and a mesh/torus.
One feature of Figure 1.2 that is immediately clear is that the most parallel
system on each Top500 list since 1994 has consistently been either a 3-D mesh
or a 3-D torus. (The data are a bit misleading, however, as only four different
systems are represented: the CM-200 [435], Paragon [41], ASCI Red [279],
and Blue Gene/L [2].) Another easy-to-spot feature is that two topologies —
meshes/tori and fat trees — dominate the top 10. More interesting, though,
are the trends in those two topologies over time:
• In June 1993, half of the top 10 supercomputers were fat trees. Only
one mesh/torus made the top 10 list.
1HPLinpack measures the time to solve a system of linear equations using LU decomposition
with partial pivoting. Although the large, dense matrices that HPLinpack manipulates are
rarely encountered in actual scientific applications, the decade and a half of historical HPLin-
pack performance data covering thousands of system installations is handy for analyzing
trends in supercomputer performance.

TABLE 1.3: Hypercube conference: length of proceedings
Year Conference title Pages
1986 First Conference on Hypercube Multiprocessors 286
1987 Second Conference on Hypercube Multiprocessors 761
1988 Third Conference on Hypercube Concurrent Computers
and Applications
2682
1989 Fourth Conference on Hypercubes, Concurrent Com-
puters, and Applications
1361
• From November 1996 through November 1999 there were no fat trees in
the top 10. In fact, in June 1998 every one of the top 10 was either a
mesh or a torus. (The entry marked as “hierarchical” was in fact a 3-D
mesh of crossbars.)
• From November 2002 through November 2003 there were no meshes or
tori in the top 10. Furthermore, the November 2002 list contained only
a single non-fat-tree.
• In November 2006, 40% of the top 10 were meshes/tori and 50% were
fat trees.
Meshes/tori and fat trees clearly have characteristics that are well suited to
high-performance systems. However, the fact that the top 10 is alternately
dominated by one or the other indicates that changes in technology favor
different topologies at different times. The lesson is that the selection of
a topology for a massively parallel system must be based on what current
technology makes fast, inexpensive, power-conscious, etc.
Although hypercubes did not often make the top 10, they do make for
an interesting case study in a topology’s boom and bust. Starting in 1986,
there were enough hypercube systems, users, and researchers to warrant an
entire conference devoted solely to hypercubes. Table 1.3 estimates interest
in hypercubes — rather informally — by presenting the number of pages
in this conference’s proceedings. (Note though that the conference changed
names almost every single year.) From 1986 to 1988, the page count increased
almost tenfold! However, the page count halved the next year even as the
scope broadened to include non-hypercubes, and, in 1990, the conference, then
known as the Fifth Distributed Memory Computing Conference, no longer
focused on hypercubes.
How did hypercubes go from being the darling of massively parallel system
design to topology non grata? There are three parts to the answer. First,
topologies that match an expected usage model are often favored over those
that don’t. For example, Sandia National Laboratories have long favored 3-D
meshes because many of their applications process 3-D volumes. In contrast,
few scientific methods are a natural match to hypercubes. The second reason

that hypercubes lost popularity is that the topology requires a doubling of
processors (more for non-binary hypercubes) for each increment in processor
count. This limitation starts to become impractical at larger system sizes. For
example, in 2007, Lawrence Livermore National Laboratory upgraded their
Blue Gene/L [2] system (a 3-D torus) from 131,072 to 212,992 processors; it
was too costly to go all the way to 262,144 processors, as would be required
by a hypercube topology.
The third, and possibly most telling, reason that hypercubes virtually
disappeared is because of techology limitations. The initial popularity of the
hypercube topology was due to how well it fares with many of the metrics
listed at the start of Section 1.3. However, Dally’s PhD dissertation [110] and
subsequent journal publication [111] highlighted a key fallacy of those metrics:
They fail to consider wire (or pin) limitations. Given a fixed number of wires
into and out of a switch, an n-dimensional binary hypercube (a.k.a. a 2-ary
n-cube) divides these wires — and therefore the link bandwidth — by n; the
larger the processor count, the less bandwidth is available in each direction.
In contrast, a 2-D torus (a.k.a. a k-ary 2-cube), for example, provides 1/4 of
the total switch bandwidth in each direction regardless of the processor count.
Another point that Dally makes is that high-dimension networks in general
require long wires — and therefore high latencies — when embedded in a
lower-dimension space, such as a plane in a typical VLSI implementation.
In summary, topologies with the best mathematical properties are not
necessarily the best topologies to implement. When selecting a topology for
a massively parallel system, one must consider the real-world features and
limitations of the current technology. A topology that works well one year
may be suboptimal when facing the subsequent year’s technology.
1.3.2 Hierarchical Networks
A hierarchical network uses an “X of Y ” topology — a base topology in
which every node is replaced with a network of a (usually) different topology.
As Figure 1.2 in the previous section indicates, many of the ten fastest su-
percomputers over time have employed hierarchical networks. For example,
Hitachi’s SR2201 [153] and the University of Tsukuba’s CP-PACS [49] (#1,
respectively, on the June and November 1996 Top500 lists) were both 8×17×16
meshes of crossbars (a.k.a. 3-D hypercrossbars). Los Alamos National Labo-
ratory’s ASCI Blue Mountain [268] (#4 on the November 1998 Top500 list)
used a deeper topology hierarchy: a HIPPI [433] 3-D torus of “hierarchical fat
bristled hypercubes” — the SGI Origin 2000’s network of eight crossbars, each
of which connected to a different node within a set of eight 3-D hypercubes and
where each node contained two processors on a bus [246]. The topologies do not
need to be different: NASA’s Columbia system [68] (#2 on the November 2004
Top500 list) is an InfiniBand [207,217] 12-ary fat tree of NUMAlink 4-ary fat
trees. (NUMAlink is the SGI Altix 3700’s internal network [129].)

Other documents randomly have
different content

Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.
Volunteers and financial support to provide volunteers with the
assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.
Section 3. Information about the Project
Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status
by the Internal Revenue Service. The Foundation’s EIN or
federal tax identification number is 64-6221541. Contributions
to the Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.
The Foundation’s business office is located at 809 North 1500
West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact

Section 4. Information about Donations to
the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.
The Foundation is committed to complying with the laws
regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or determine
the status of compliance for any particular state visit
www.gutenberg.org/donate.
While we cannot and do not solicit contributions from states
where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.
International donations are gratefully accepted, but we cannot
make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.
Please check the Project Gutenberg web pages for current
donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and

credit card donations. To donate, please visit:
www.gutenberg.org/donate.
Section 5. General Information About
Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.
Project Gutenberg™ eBooks are often created from several
printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
This website includes information about Project Gutenberg™,
including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.

Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
ebookgate.com

Attaining High Performance Communications A Vertical Approach 1st Edition Ada Gavrilovska

More Related Content

Similar to Attaining High Performance Communications A Vertical Approach 1st Edition Ada Gavrilovska

Recently uploaded

Attaining High Performance Communications A Vertical Approach 1st Edition Ada Gavrilovska