Attaining High Performance Communications A
Vertical Approach 1st Edition Ada Gavrilovska
download
https://ebookgate.com/product/attaining-high-performance-
communications-a-vertical-approach-1st-edition-ada-gavrilovska/
Get Instant Ebook Downloads – Browse at https://ebookgate.com
Get Your Digital Files Instantly: PDF, ePub, MOBI and More
Quick Digital Downloads: PDF, ePub, MOBI and Other Formats
High Performance Elastomer Materials An Engineering
Approach 1st Edition Ryszard Koz■owski
https://ebookgate.com/product/high-performance-elastomer-
materials-an-engineering-approach-1st-edition-ryszard-kozlowski/
Working in Teams Moving From High Potential to High
Performance 1st Edition Brian A. Griffith
https://ebookgate.com/product/working-in-teams-moving-from-high-
potential-to-high-performance-1st-edition-brian-a-griffith/
Digital Communications A Discrete Time Approach 1st
Edition Michael Rice
https://ebookgate.com/product/digital-communications-a-discrete-
time-approach-1st-edition-michael-rice/
Leadership Revolution Creating a High Performance
Organisation 1st Edition Christo Nel
https://ebookgate.com/product/leadership-revolution-creating-a-
high-performance-organisation-1st-edition-christo-nel/
Scala High Performance Programming 1st Edition Theron
https://ebookgate.com/product/scala-high-performance-
programming-1st-edition-theron/
Julia High Performance Programming Ivo Balbaert
https://ebookgate.com/product/julia-high-performance-programming-
ivo-balbaert/
High Performance Loudspeakers 6th Edition Martin
Colloms
https://ebookgate.com/product/high-performance-loudspeakers-6th-
edition-martin-colloms/
Performance Optimization of Digital Communications
Systems 1st Edition Vladimir Mitlin
https://ebookgate.com/product/performance-optimization-of-
digital-communications-systems-1st-edition-vladimir-mitlin/
High Performance Parallel I O 1st Edition I Foster
https://ebookgate.com/product/high-performance-parallel-i-o-1st-
edition-i-foster/
AttAining HigH
PerformAnce
communicAtions
A VerticAl ApproAch
C3088_FM.indd 1 8/18/09 10:05:36 AM
C3088_FM.indd 2 8/18/09 10:05:36 AM
AttAining HigH
PerformAnce
communicAtions
A VerticAl ApproAch
edited by
AdA gAvrilovskA
C3088_FM.indd 3 8/18/09 10:05:37 AM
Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2010 by Taylor and Francis Group, LLC
Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed in the United States of America on acid-free paper
10 9 8 7 6 5 4 3 2 1
International Standard Book Number-13: 978-1-4200-9313-1 (Ebook-PDF)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
To my family.
Contents
List of Figures xiv
List of Tables xviii
Preface xxi
Acknowledgments xxvii
About the Editor xxix
List of Contributors xxxi
1 High Performance Interconnects for Massively Parallel Sys-
tems 1
Scott Pakin
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Application Sensitivity to Communication Performance 3
1.2.3 Measurements on Massively Parallel Systems . . . . . 4
1.3 Network Topology . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 The “Dead Topology Society” . . . . . . . . . . . . . . 8
1.3.2 Hierarchical Networks . . . . . . . . . . . . . . . . . . 12
1.3.3 Hybrid Networks . . . . . . . . . . . . . . . . . . . . . 13
1.3.4 Novel Topologies . . . . . . . . . . . . . . . . . . . . . 15
1.4 Network Features . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1 Programming Models . . . . . . . . . . . . . . . . . . 18
1.5 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Commodity High Performance Interconnects 25
Dhabaleswar K. Panda, Pavan Balaji, Sayantan Sur, and Matthew
Jon Koop
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Overview of Past Commodity Interconnects, Features and
Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
vii
viii
2.3 InfiniBand Architecture . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 IB Communication Model . . . . . . . . . . . . . . . . 28
2.3.2 Overview of InfiniBand Features . . . . . . . . . . . . 32
2.3.3 InfiniBand Protection and Security Features . . . . . . 39
2.3.4 InfiniBand Management and Services . . . . . . . . . . 40
2.4 Existing InfiniBand Adapters and Switches . . . . . . . . . . 43
2.4.1 Channel Adapters . . . . . . . . . . . . . . . . . . . . 43
2.4.2 Switches . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.3 Wide Area Networks (WAN) and Routers . . . . . . . 45
2.5 Existing InfiniBand Software Stacks . . . . . . . . . . . . . . 45
2.5.1 Low-Level Interfaces . . . . . . . . . . . . . . . . . . . 45
2.5.2 High-Level Interfaces . . . . . . . . . . . . . . . . . . . 46
2.5.3 Verbs Capabilities . . . . . . . . . . . . . . . . . . . . 46
2.6 Designing High-End Systems with InfiniBand: Case Studies 47
2.6.1 Case Study: Message Passing Interface . . . . . . . . . 47
2.6.2 Case Study: Parallel File Systems . . . . . . . . . . . 55
2.6.3 Case Study: Enterprise Data Centers . . . . . . . . . . 57
2.7 Current and Future Trends of InfiniBand . . . . . . . . . . . 60
3 Ethernet vs. EtherNOT 61
Wu-chun Feng and Pavan Balaji
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2.1 Defining Ethernet vs. EtherNOT . . . . . . . . . . . . 64
3.2.2 Forecast . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3.1 Ethernet Background . . . . . . . . . . . . . . . . . . 65
3.3.2 EtherNOT Background . . . . . . . . . . . . . . . . . 66
3.4 Ethernet vs. EtherNOT? . . . . . . . . . . . . . . . . . . . . 67
3.4.1 Hardware and Software Convergence . . . . . . . . . . 67
3.4.2 Overall Performance Convergence . . . . . . . . . . . . 78
3.5 Commercial Perspective . . . . . . . . . . . . . . . . . . . . . . 81
3.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . 82
4 System Impact of Integrated Interconnects 85
Sudhakar Yalamanchili and Jeffrey Young
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Technology Trends . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3 Integrated Interconnects . . . . . . . . . . . . . . . . . . . . 90
4.3.1 HyperTransport (HT) . . . . . . . . . . . . . . . . . . 92
4.3.2 QuickPath Interconnect (QPI) . . . . . . . . . . . . . 96
4.3.3 PCI Express (PCIe) . . . . . . . . . . . . . . . . . . . 98
4.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.4 Case Study: Implementation of Global Address Spaces . . . . 101
ix
4.4.1 A Dynamic Partitioned Global Address Space Model
(DPGAS) . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.2 The Implementation Path . . . . . . . . . . . . . . . . 105
4.4.3 Bridge Implementation . . . . . . . . . . . . . . . . . . 106
4.4.4 Projected Impact of DPGAS . . . . . . . . . . . . . . 108
4.5 Future Trends and Expectations . . . . . . . . . . . . . . . . 109
5 Network Interfaces for High Performance Computing 113
Keith Underwood, Ron Brightwell, and Scott Hemmert
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Network Interface Design Issues . . . . . . . . . . . . . . . . 113
5.2.1 Offload vs. Onload . . . . . . . . . . . . . . . . . . . . 114
5.2.2 Short vs. Long Message Handling . . . . . . . . . . . 115
5.2.3 Interactions between Host and NIC . . . . . . . . . . . 118
5.2.4 Collectives . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3 Current Approaches to Network Interface Design Issues . . . 124
5.3.1 Quadrics QsNet . . . . . . . . . . . . . . . . . . . . . . 124
5.3.2 Myrinet . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.3.3 InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.3.4 Seastar . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.3.5 PathScale InfiniPath and Qlogic TrueScale . . . . . . 127
5.3.6 BlueGene/L and BlueGene/P . . . . . . . . . . . . . . 127
5.4 Research Directions . . . . . . . . . . . . . . . . . . . . . . . 128
5.4.1 Offload of Message Processing . . . . . . . . . . . . . . 128
5.4.2 Offloading Collective Operations . . . . . . . . . . . . 140
5.4.3 Cache Injection . . . . . . . . . . . . . . . . . . . . . . 147
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6 Network Programming Interfaces for High Performance
Computing 149
Ron Brightwell and Keith Underwood
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.2 The Evolution of HPC Network Programming Interfaces . . 150
6.3 Low-Level Network Programming Interfaces . . . . . . . . . . 151
6.3.1 InfiniBand Verbs . . . . . . . . . . . . . . . . . . . . . . 151
6.3.2 Deep Computing Messaging Fabric . . . . . . . . . . . 153
6.3.3 Portals . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.3.4 Myrinet Express (MX) . . . . . . . . . . . . . . . . . . 153
6.3.5 Tagged Ports (Tports) . . . . . . . . . . . . . . . . . . 153
6.3.6 LAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.3.7 Sockets . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.4 Distinguishing Characteristics . . . . . . . . . . . . . . . . . 154
6.4.1 Endpoint Addressing . . . . . . . . . . . . . . . . . . . 155
6.4.2 Independent Processes . . . . . . . . . . . . . . . . . . 155
x
6.4.3 Connections . . . . . . . . . . . . . . . . . . . . . . . . 156
6.4.4 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.4.5 Operating System Interaction . . . . . . . . . . . . . . 156
6.4.6 Data Movement Semantics . . . . . . . . . . . . . . . 157
6.4.7 Data Transfer Completion . . . . . . . . . . . . . . . . 158
6.4.8 Portability . . . . . . . . . . . . . . . . . . . . . . . . 160
6.5 Supporting MPI . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.5.1 Copy Blocks . . . . . . . . . . . . . . . . . . . . . . . 160
6.5.2 Progress . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.5.3 Overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.5.4 Unexpected Messages . . . . . . . . . . . . . . . . . . . 161
6.6 Supporting SHMEM and Partitioned Global Address Space
(PGAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.6.1 Fence and Quiet . . . . . . . . . . . . . . . . . . . . . 162
6.6.2 Synchronization and Atomics . . . . . . . . . . . . . . 162
6.6.3 Progress . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.6.4 Scalable Addressing . . . . . . . . . . . . . . . . . . . 163
6.7 Portals 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.7.1 Small Message Rate . . . . . . . . . . . . . . . . . . . 164
6.7.2 PGAS Optimizations . . . . . . . . . . . . . . . . . . . 166
6.7.3 Hardware Friendliness . . . . . . . . . . . . . . . . . . 166
6.7.4 New Functionality . . . . . . . . . . . . . . . . . . . . 167
7 High Performance IP-Based Transports 169
Ada Gavrilovska
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.2 Transmission Control Protocol — TCP . . . . . . . . . . . . 170
7.2.1 TCP Origins and Future . . . . . . . . . . . . . . . . . 170
7.2.2 TCP in High Speed Networks . . . . . . . . . . . . . . 172
7.2.3 TCP Variants . . . . . . . . . . . . . . . . . . . . . . . 173
7.3 TCP Performance Tuning . . . . . . . . . . . . . . . . . . . . 178
7.3.1 Improving Bandwidth Utilization . . . . . . . . . . . . 178
7.3.2 Reducing Host Loads . . . . . . . . . . . . . . . . . . 179
7.4 UDP-Based Transport Protocols . . . . . . . . . . . . . . . . . 181
7.5 SCTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
7.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 183
8 Remote Direct Memory Access and iWARP 185
Dennis Dalessandro and Pete Wyckoff
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.2 RDMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.2.1 High-Level Overview of RDMA . . . . . . . . . . . . . 187
8.2.2 Architectural Motivations . . . . . . . . . . . . . . . . 189
8.2.3 Fundamental Aspects of RDMA . . . . . . . . . . . . 192
xi
8.2.4 RDMA Historical Foundations . . . . . . . . . . . . . 193
8.2.5 Programming Interface . . . . . . . . . . . . . . . . . . 194
8.2.6 Operating System Interactions . . . . . . . . . . . . . 196
8.3 iWARP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.3.1 High-Level Overview of iWARP . . . . . . . . . . . . . 200
8.3.2 iWARP Device History . . . . . . . . . . . . . . . . . . 201
8.3.3 iWARP Standardization . . . . . . . . . . . . . . . . . 202
8.3.4 Trade-Offs of Using TCP . . . . . . . . . . . . . . . . 204
8.3.5 Software-Based iWARP . . . . . . . . . . . . . . . . . 205
8.3.6 Differences between IB and iWARP . . . . . . . . . . 206
8.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 207
9 Accelerating Communication Services on Multi-Core Plat-
forms 209
Ada Gavrilovska
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.2 The “Simple” Onload Approach . . . . . . . . . . . . . . . . 210
9.2.1 Limitations of the “Simple” Onload . . . . . . . . . . 212
9.3 Partitioned Communication Stacks . . . . . . . . . . . . . . . 213
9.3.1 API Considerations . . . . . . . . . . . . . . . . . . . 216
9.4 Specialized Network Multi-Cores . . . . . . . . . . . . . . . . 217
9.4.1 The (Original) Case for Network Processors . . . . . . 217
9.4.2 Network Processors Features . . . . . . . . . . . . . . 219
9.4.3 Application Diversity . . . . . . . . . . . . . . . . . . 222
9.5 Toward Heterogeneous Multi-Cores . . . . . . . . . . . . . . 223
9.5.1 Impact on Systems Software . . . . . . . . . . . . . . . 226
9.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 228
10 Virtualized I/O 229
Ada Gavrilovska, Adit Ranadive, Dulloor Rao, and Karsten Schwan
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10.1.1 Virtualization Overview . . . . . . . . . . . . . . . . . 230
10.1.2 Challenges with I/O Virtualization . . . . . . . . . . . 232
10.1.3 I/O Virtualization Approaches . . . . . . . . . . . . . 233
10.2 Split Device Driver Model . . . . . . . . . . . . . . . . . . . 234
10.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 234
10.2.2 Performance Optimization Opportunities . . . . . . . 236
10.3 Direct Device Access Model . . . . . . . . . . . . . . . . . . 240
10.3.1 Multi-Queue Devices . . . . . . . . . . . . . . . . . . . . 241
10.3.2 Device-Level Packet Classification . . . . . . . . . . . 243
10.3.3 Signaling . . . . . . . . . . . . . . . . . . . . . . . . . 244
10.3.4 IOMMU . . . . . . . . . . . . . . . . . . . . . . . . . . 244
10.4 Opportunities and Trade-Offs . . . . . . . . . . . . . . . . . 245
10.4.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 245
xii
10.4.2 Migration . . . . . . . . . . . . . . . . . . . . . . . . . 246
10.4.3 Higher-Level Interfaces . . . . . . . . . . . . . . . . . . 246
10.4.4 Monitoring and Management . . . . . . . . . . . . . . 247
10.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 249
11 The Message Passing Interface (MPI) 251
Jeff Squyres
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
11.1.1 Chapter Scope . . . . . . . . . . . . . . . . . . . . . . . 251
11.1.2 MPI Implementations . . . . . . . . . . . . . . . . . . 252
11.1.3 MPI Standard Evolution . . . . . . . . . . . . . . . . . 254
11.1.4 Chapter Overview . . . . . . . . . . . . . . . . . . . . 255
11.2 MPI’s Layer in the Network Stack . . . . . . . . . . . . . . . 255
11.2.1 OSI Network Stack . . . . . . . . . . . . . . . . . . . . 256
11.2.2 Networks That Provide MPI-Like Interfaces . . . . . . 256
11.2.3 Networks That Provide Non-MPI-Like Interfaces . . . 257
11.2.4 Resource Management . . . . . . . . . . . . . . . . . . 257
11.3 Threading and MPI . . . . . . . . . . . . . . . . . . . . . . . 260
11.3.1 Implementation Complexity . . . . . . . . . . . . . . . 260
11.3.2 Application Simplicity . . . . . . . . . . . . . . . . . . . 261
11.3.3 Performance Implications . . . . . . . . . . . . . . . . 262
11.4 Point-to-Point Communications . . . . . . . . . . . . . . . . 262
11.4.1 Communication/Computation Overlap . . . . . . . . . 262
11.4.2 Pre-Posting Receive Buffers . . . . . . . . . . . . . . . 263
11.4.3 Persistent Requests . . . . . . . . . . . . . . . . . . . . 265
11.4.4 Common Mistakes . . . . . . . . . . . . . . . . . . . . 267
11.5 Collective Operations . . . . . . . . . . . . . . . . . . . . . . 272
11.5.1 Synchronization . . . . . . . . . . . . . . . . . . . . . 272
11.6 Implementation Strategies . . . . . . . . . . . . . . . . . . . 273
11.6.1 Lazy Connection Setup . . . . . . . . . . . . . . . . . 273
11.6.2 Registered Memory . . . . . . . . . . . . . . . . . . . . 274
11.6.3 Message Passing Progress . . . . . . . . . . . . . . . . 278
11.6.4 Trade-Offs . . . . . . . . . . . . . . . . . . . . . . . . . 278
11.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 279
12 High Performance Event Communication 281
Greg Eisenhauer, Matthew Wolf, Hasan Abbasi, and Karsten Schwan
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
12.2 Design Points . . . . . . . . . . . . . . . . . . . . . . . . . . 283
12.2.1 Lessons from Previous Designs . . . . . . . . . . . . . 285
12.2.2 Next Generation Event Delivery . . . . . . . . . . . . 286
12.3 The EVPath Architecture . . . . . . . . . . . . . . . . . . . . 287
12.3.1 Taxonomy of Stone Types . . . . . . . . . . . . . . . . 289
12.3.2 Data Type Handling . . . . . . . . . . . . . . . . . . . 289
xiii
12.3.3 Mobile Functions and the Cod Language . . . . . . . . 290
12.3.4 Meeting Next Generation Goals . . . . . . . . . . . . . 293
12.4 Performance Microbenchmarks . . . . . . . . . . . . . . . . . 294
12.4.1 Local Data Handling . . . . . . . . . . . . . . . . . . . 295
12.4.2 Network Operation . . . . . . . . . . . . . . . . . . . . 296
12.5 Usage Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . 297
12.5.1 Implementing a Full Publish/Subscribe System . . . . 298
12.5.2 IFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
12.5.3 I/OGraph . . . . . . . . . . . . . . . . . . . . . . . . . 302
12.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13 The Case of the Fast Financial Feed 305
Virat Agarwal, Lin Duan, Lurng-Kuo Liu, Michaele Perrone,
Fabrizio Petrini, Davide Pasetto, and David Bader
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
13.2 Market Data Processing Systems . . . . . . . . . . . . . . . . 306
13.2.1 The Ticker Plant . . . . . . . . . . . . . . . . . . . . . 306
13.3 Performance Requirements . . . . . . . . . . . . . . . . . . . 308
13.3.1 Skyrocketing Data Rates . . . . . . . . . . . . . . . . 308
13.3.2 Low Latency Trading . . . . . . . . . . . . . . . . . . 308
13.3.3 High Performance Computing in the Data Center . . . 310
13.4 The OPRA Case Study . . . . . . . . . . . . . . . . . . . . . . 311
13.4.1 OPRA Data Encoding . . . . . . . . . . . . . . . . . . 312
13.4.2 Decoder Reference Implementation . . . . . . . . . . . 314
13.4.3 A Streamlined Bottom-Up Implementation . . . . . . 315
13.4.4 High-Level Protocol Processing with DotStar . . . . . 316
13.4.5 Experimental Results . . . . . . . . . . . . . . . . . . . 321
13.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 324
13.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 327
14 Data-Movement Approaches for HPC Storage Systems 329
Ron A. Oldfield, Todd Kordenbrock, and Patrick Widener
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
14.2 Lustre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
14.2.1 Lustre Networking (LNET) . . . . . . . . . . . . . . . 332
14.2.2 Optimizations for Large-Scale I/O . . . . . . . . . . . 333
14.3 Panasas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
14.3.1 PanFS Architecture . . . . . . . . . . . . . . . . . . . 335
14.3.2 Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . 336
14.4 Parallel Virtual File System 2 (PVFS2) . . . . . . . . . . . . 337
14.4.1 BMI Design . . . . . . . . . . . . . . . . . . . . . . . . 338
14.4.2 BMI Simplifies the Client . . . . . . . . . . . . . . . . 339
14.4.3 BMI Efficiency/Performance . . . . . . . . . . . . . . 340
14.4.4 BMI Scalability . . . . . . . . . . . . . . . . . . . . . . . 341
14.4.5 BMI Portability . . . . . . . . . . . . . . . . . . . . . . 341
xiv
14.4.6 Experimental Results . . . . . . . . . . . . . . . . . . 342
14.5 Lightweight File Systems . . . . . . . . . . . . . . . . . . . . 345
14.5.1 Design of the LWFS RPC Mechanism . . . . . . . . . 345
14.5.2 LWFS RPC Implementation . . . . . . . . . . . . . . 346
14.5.3 Performance Analysis . . . . . . . . . . . . . . . . . . 348
14.6 Other MPP File Systems . . . . . . . . . . . . . . . . . . . . 349
14.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 350
14.8 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . 351
15 Network Simulation 353
George Riley
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
15.2 Discrete Event Simulation . . . . . . . . . . . . . . . . . . . 353
15.3 Maintaining the Event List . . . . . . . . . . . . . . . . . . . 354
15.4 Modeling Routers, Links, and End Systems . . . . . . . . . . 355
15.5 Modeling Network Packets . . . . . . . . . . . . . . . . . . . 358
15.6 Modeling the Network Applications . . . . . . . . . . . . . . 359
15.7 Visualizing the Simulation . . . . . . . . . . . . . . . . . . . 360
15.8 Distributed Simulation . . . . . . . . . . . . . . . . . . . . . 362
15.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
References 367
Index 407
List of Figures
1.1 Comparative communication performance of Purple, Red Storm,
and Blue Gene/L . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Network topologies used in the ten fastest supercomputers and
the single most parallel supercomputer . . . . . . . . . . . . . 9
1.3 Illustrations of various network topologies . . . . . . . . . . . 10
1.4 The network hierarchy of the Roadrunner supercomputer . . 14
1.5 Kautz graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Typical InfiniBand cluster . . . . . . . . . . . . . . . . . . . . 29
2.2 Consumer queuing model . . . . . . . . . . . . . . . . . . . . 30
2.3 Virtual lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Example of unreliable multicast operation . . . . . . . . . . . 34
2.5 IB transport services . . . . . . . . . . . . . . . . . . . . . . . 36
2.6 Layered design of MVAPICH/MVAPICH2 over IB. . . . . . . 48
2.7 Two-sided point-to-point performance over IB on a range of
adapters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.8 Application-level evaluation of MPI over InfiniBand design com-
ponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.9 Lustre performance over InfiniBand. . . . . . . . . . . . . . . 56
2.10 CPU utilization in Lustre with IPoIB and native (verb-level)
protocols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.11 SDP architecture . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.12 Performance comparison of the Apache Web server: SDP vs.
IPoIB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.13 Active polling to achieve strong cache coherency. . . . . . . . 59
2.14 Active polling performance . . . . . . . . . . . . . . . . . . . 60
3.1 Profile of network interconnects on the TOP500 . . . . . . . . 62
3.2 Hand-drawn Ethernet diagram by Robert Metcalfe. . . . . . . 66
3.3 TCP offload engines. . . . . . . . . . . . . . . . . . . . . . . . 70
3.4 iWARP protocol stack. . . . . . . . . . . . . . . . . . . . . . . . 71
3.5 Network congestion. . . . . . . . . . . . . . . . . . . . . . . . 72
3.6 VLAN-based multipath communication. . . . . . . . . . . . . 74
3.7 Myrinet MX-over-Ethernet. . . . . . . . . . . . . . . . . . . . 76
3.8 Mellanox ConnectX. . . . . . . . . . . . . . . . . . . . . . . . 77
3.9 Ethernet vs. EtherNOT: One-way latency. . . . . . . . . . . . 78
xv
xvi
3.10 Ethernet vs. EtherNOT: Unidirectional bandwidth. . . . . . . 79
3.11 Ethernet vs. EtherNOT: Virtual Microscope application. . . . 80
3.12 Ethernet vs. EtherNOT: MPI-tile-IO application. . . . . . . . . 81
4.1 Network latency scaling trends. . . . . . . . . . . . . . . . . . 87
4.2 Memory cost and memory power trends for a commodity server
system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3 Approximate floorplans for quad-core processors from AMD
and Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4 Breakdown of a HyperTransport link. . . . . . . . . . . . . . 92
4.5 Organization of buffers/VCs on a HT link . . . . . . . . . . . 93
4.6 HT read request packet format. . . . . . . . . . . . . . . . . . 95
4.7 Structure of the QPI protocol stack and link. . . . . . . . . . 96
4.8 QPI’s low-latency source snoop protocol. . . . . . . . . . . . . 97
4.9 An example of the PCIe complex. . . . . . . . . . . . . . . . . 99
4.10 Structure of the PCI Express packets and protocol processing 100
4.11 The Dynamic Partitioned Global Address Space model . . . . 102
4.12 Memory bridge with Opteron memory subsystem. . . . . . . . 105
4.13 Memory bridge stages. . . . . . . . . . . . . . . . . . . . . . . 106
5.1 Comparing programmed I/O and DMA transactions to the
network interface. . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Comparing messages using an eager protocol to messages using
a rendezvous protocol. . . . . . . . . . . . . . . . . . . . . . . 116
5.3 Ping-pong bandwidth for Quadrics Elan4 and 4X SDR Infini-
Band. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4 Cells in associative list processing units . . . . . . . . . . . . 130
5.5 Performance advantages of an ALPU . . . . . . . . . . . . . . . 131
5.6 NIC architecture enhanced with a list management unit. . . . 133
5.7 Performance of the list management unit . . . . . . . . . . . . 135
5.8 Microcoded match unit . . . . . . . . . . . . . . . . . . . . . 136
5.9 Match unit’s wide instruction word. . . . . . . . . . . . . . . 137
5.10 Match unit performance . . . . . . . . . . . . . . . . . . . . . 140
5.11 NIC-based atomic unit . . . . . . . . . . . . . . . . . . . . . . . 141
5.12 Comparing the performance with and without cache. . . . . . 144
5.13 Assessing the impact of size and associativity on atomic unit
cache performance. . . . . . . . . . . . . . . . . . . . . . . . . 145
5.14 Pseudo code for an Allreduce using triggered operations . . . 146
7.1 Congestion management functions in popular TCP variants. . 177
8.1 Comparison of traditional TCP network stack and RDMA . . 187
8.2 Effect of overlapping computation and communication. . . . . 190
8.3 TCP and RDMA communication architecture. . . . . . . . . . 191
8.4 iWARP protocols stack. . . . . . . . . . . . . . . . . . . . . . 202
xvii
9.1 Simple protocol onload approach on multi-core platforms. . . . 211
9.2 Deploying communication stacks on dedicated cores. . . . . . 214
9.3 Deployment alternatives for network processors. . . . . . . . 222
9.4 Heterogeneous multi-core platform. . . . . . . . . . . . . . . 224
10.1 Split device driver model. . . . . . . . . . . . . . . . . . . . . 234
10.2 Direct device access model. . . . . . . . . . . . . . . . . . . . . 241
10.3 RDMA write bandwidth divided among VMs. . . . . . . . . 243
11.1 Simplistic receive processing in MPI . . . . . . . . . . . . . . 264
11.2 Serialized MPI communication . . . . . . . . . . . . . . . . . 269
11.3 Communication/computation overlap . . . . . . . . . . . . . 270
11.4 Memory copy vs. OpenFabric memory registration times . . 275
12.1 Channel-based event delivery system . . . . . . . . . . . . . 282
12.2 Complex event processing delivery system . . . . . . . . . . . 283
12.3 Event delivery system built using EVPath . . . . . . . . . . 288
12.4 Basic stone types . . . . . . . . . . . . . . . . . . . . . . . . 290
12.5 Sample EVPath data structure declaration . . . . . . . . . . . 291
12.6 Specialization filter for stock trades ranges . . . . . . . . . . 292
12.7 Specialization filter for array averaging . . . . . . . . . . . . 293
12.8 Local stone transfer time for linear and tree-structured paths 295
12.9 EVPath throughput for various data sizes . . . . . . . . . . . 297
12.10 Using event channels for communication . . . . . . . . . . . 298
12.11 ECho event channel implementation using EVPath stones . . 299
12.12 Derived event channel implementation using EVPath stones 300
12.13 CPU overhead as a function of filter rejection ratio . . . . . . 301
13.1 High-level overview of a ticker plant . . . . . . . . . . . . . . 307
13.2 OPRA market peak data rates . . . . . . . . . . . . . . . . . 309
13.3 OPRA FAST encoded packet format . . . . . . . . . . . . . 314
13.4 OPRA reference decoder . . . . . . . . . . . . . . . . . . . . 315
13.5 Bottom-up reference decoder block diagram . . . . . . . . . . 316
13.6 Presence and field map bit manipulation . . . . . . . . . . . 317
13.7 A graphical representation of the DotStar compiler steps. . . 318
13.8 DotStar runtime . . . . . . . . . . . . . . . . . . . . . . . . . 319
13.9 DotStar source code . . . . . . . . . . . . . . . . . . . . . . . 320
13.10 OPRA message distribution . . . . . . . . . . . . . . . . . . 322
13.11 Performance comparison on several hardware platforms . . . 323
14.1 Partitioned architecture . . . . . . . . . . . . . . . . . . . . . 330
14.2 Lustre software stack. . . . . . . . . . . . . . . . . . . . . . . 332
14.3 Server-directed DMA handshake in Lustre . . . . . . . . . . 333
14.4 Panasas DirectFlow architecture . . . . . . . . . . . . . . . . 334
14.5 Parallel NFS architecture . . . . . . . . . . . . . . . . . . . . 337
xviii
14.6 PVFS2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
14.7 Round trip latency . . . . . . . . . . . . . . . . . . . . . . . 342
14.8 Point-to-point bandwidth for 120MB transfer . . . . . . . . . 342
14.9 Aggregate read pattern, 10MB per client/server pair . . . . . 343
14.10 LWFS RPC protocol . . . . . . . . . . . . . . . . . . . . . . 346
14.11 The 16-byte data structure used for each of the experiments. 347
14.12 Comparison of LWFS RPC to various other mechanisms. . . 349
15.1 Moderately complex network topology. . . . . . . . . . . . . 358
15.2 Network simulation animation. . . . . . . . . . . . . . . . . . 360
15.3 Network simulation visualization. . . . . . . . . . . . . . . . . 361
15.4 The space-parallel method for distributed network simulation. 363
List of Tables
1.1 Network characteristics of Purple, Red Storm, and Blue Gene/L 4
1.2 Properties of a ring vs. a fully connected network of n nodes 7
1.3 Hypercube conference: length of proceedings . . . . . . . . . . 11
4.1 Latency results for HToE bridge. . . . . . . . . . . . . . . . . 107
4.2 Latency numbers used for evaluation of performance penalties. 107
5.1 Breakdown of the assembly code. . . . . . . . . . . . . . . . . 139
12.1 Comparing split stone and filter stone execution times . . . . 296
13.1 OPRA message categories with description. . . . . . . . . . . 313
13.2 Hardware platforms used during experimental evaluation. . . 324
13.3 Detailed performance analysis (Intel Xeon Q6600, E5472 and
AMD Opteron 2352) . . . . . . . . . . . . . . . . . . . . . . . 325
13.4 Detailed performance analysis (SUN UltraSparc T2 and IBM
Power6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
13.5 DotStar latency on different platforms. . . . . . . . . . . . . . 327
14.1 Compute and I/O nodes for production DOE MPP systems
used since the early 1990s. . . . . . . . . . . . . . . . . . . . . . 331
xix
Preface
As technology pushes the Petascale barrier, and the next generation Exascale
Computing requirements are being defined, it is clear that this type of com-
putational capacity can be achieved only by accompanying advancement in
communication technology. Next generation computing platforms will consist
of hundreds of thousands of interconnected computational resources. There-
fore, careful consideration must be placed on the choice of the communication
fabrics connecting everything, from cores and accelerators on individual plat-
form nodes, to the internode fabric for coupling nodes in large-scale data
centers and computational complexes, to wide area communication channels
connecting geographically disparate compute resources to remote data sources
and application end-users.
The traditional scientific HPC community has long pushed the envelope
on the capabilities delivered by existing communication fabrics. The com-
puting complex at Oak Ridge National Laboratory has an impressive 332 Gi-
gabytes/sec I/O bandwidth and a 786 Terabyte/sec global interconnection
bandwidth [305], used by extreme science applications from “subatomic to
galactic scales” domains [305], and supporting innovation in renewable energy
sources, climate modeling, and medicine. Riding on a curve of a thousandfold
increase in computational capabilities over the last five years alone, it is ex-
pected that the growth in both compute resources, as well as the underlying
communication infrastructure and its capabilities, will continue to climb at
mind-blowing rates.
Other classes of HPC applications have equally impressive communication
needs beyond the main computing complex, as they require data from remote
sources — observatories, databases, remote instruments such as particle ac-
celerators, or large scale data-intensive collaborations across many globally
distributed sites. The computational grids necessary for these applications
must by enabled by communication technology capable of moving terabytes of
data in a timely manner, while also supporting the interactive nature of the
remote collaborations.
More unusual, however, are the emerging extreme scale communication needs
outside of the traditional HPC domain. First, the confluence of the multi-
core nature of emerging hardware resources, coupled with the renaissance of
virtualization technology, is creating exceptional consolidation possibilities, and
giving rise to compute clouds of virtualized multi-core resources. At the same
time, the management complexity and operating costs associated with today’s
IT hardware and software stacks are pushing a broad range of enterprise class
xxi
xxii
applications into such clouds. As clouds spill outside of data center boundaries
and move towards becoming Exascale platforms across globally distributed
facilities [191,320], their communication needs — for computations distributed
across inter- and intra-cloud resources, for management, provisioning and QoS,
for workload migration, and for end-user interactions — become exceedingly
richer.
Next, even enterprise applications which, due to their critical nature, are not
likely candidates for prime cloud deployment, are witnessing a “skyrocketing”
increase in communication needs. For instance, financial market analyses are
forecasting as much as 128 billion electronic trades a day just by 2010 [422], and
similar trends toward increased data rates and lower latency requirements are
expected in other market segments, from ticketing services, to parcel delivery
and distribution, to inventory management and forecasting.
Finally, the diversity in commodity end-user application, from 3D dis-
tributed games, to content sharing and rich multimedia based applications, to
telecommunication and telepresence types of services, supported on everything
from high-end workstations to cellphones and embedded devices, is further
exacerbating the need for high performance communication services.
Objectives
No silver bullet will solve all of the communications-related challenges
faced by the emerging types of applications, platforms, and usage scenarios
described above. In order to address these requirements we need an ecosystem
of solutions along a stack of technology layers: (i) efficient interconnection
hardware; (ii) scalable, robust, end-to-end protocols; and (iii) system services
and tools specifically targeting emerging multi-core environments.
Toward this end, this book is a unique effort to discuss technological advances
and challenges related to high performance communications, by addressing each
layer in the vertical stack — the low-level architectural features of the hardware
interconnect and interconnection devices; the selection of communication
protocols; the implementation of protocol stacks and other operating system
features, including on modern, homogeneous or heterogeneous multi-core
platforms; the middleware software and libraries used in applications with high-
performance communication requirements; and the higher-level application
services such as file systems, monitoring services, and simulation tools. The
rationale behind this approach is that no single solution, applied at one
particular layer, can help applications address all performance-related issues
with their communication services. Instead, a coordinated effort is needed to
eliminate bottlenecks and address optimization opportunities at each layer —
from the architectural features of the hardware, through the protocols and their
implementation in OS kernels, to the manner in which application services
and middleware are using the underlying platforms.
This book is an edited collection of chapters on these topics. The choice to
xxiii
organize this title as a collection of chapters contributed by different individuals
is a natural one — the vertical scope of the work calls for contributions from
researchers with many different areas of expertise. Each of the chapters is
organized in a manner which includes historic perspective, discussion of state-
of-the-art technology solutions and current trends, summary of major research
efforts in the community, and, where appropriate, greater level of detail on a
particular effort that the chapter contributor is mostly affiliated with.
The topics covered in each chapter are important and complex, and deserve
a separate book to cover adequately and in-depth all technical challenges
which surround them. Many such books exist [71, 128, 181, 280, 408, etc.].
Unique about this title, however, is that it is a more comprehensive text for a
broader audience, spanning a community with interests across the entire stack.
Furthermore, each chapter abounds in references to past and current technical
papers and projects, which can guide the reader to sources of additional
information.
Finally, it is worth pointing out that even this type of text, which touches on
many different types of technologies and many layers across the stack, is by no
means complete. Many more topics remain only briefly mentioned throughout
the book, without having a full chapter dedicated to them. These include
topics related to physical-layer technologies, routing protocols and router archi-
tectures, advanced communications protocols for multimedia communications,
modeling and analysis techniques, compiler and programming language-related
issues, and others.
Target audience
The relevance of this book is multifold. First, it is targeted at academics
for instruction in advanced courses on high-performance I/O and communi-
cation, offered as part of graduate or undergraduate curricula in Computer
Science or Computer Engineering. Similar courses have been taught for several
years at a number of universities, including Georgia Tech (High-Performance
Communications, taught by this book’s editor), The Ohio State University
(Network-Based Computing, by Prof. Dhabaleswar K. Panda), Auburn Univer-
sity (Special Topics in High Performance Computing, by Prof. Weikuan Yu),
or significant portions of the material are covered in more traditional courses
on High Performance Computing in many Computer Science or Engineering
programs worldwide.
In addition, this book can be relevant reference and instructional mate-
rial for students and researchers in other science and engineering areas, in
academia or at the National Labs, that are working on problems with signifi-
cant communication requirements, and on high performance platforms such
as supercomputers, high-end clusters or large-scale wide-area grids. Many of
these scientists may not have formal computer science/computer engineering
education, and this title aims to be a single text which can help steer them to
xxiv
more easily identify performance improvement opportunities and find solutions
best suited for the applications they are developing and the platforms they
are using.
Finally, the text provides researchers who are specifically addressing problems
and developing technologies at a particular layer with a single reference that
surveys the state-of-the-art at other layers. By offering a concise overview of
related efforts from “above” or “below,” this book can be helpful in identifying
the best ways to position one’s work, and in ensuring that other elements of
the stack are appropriately leveraged.
Organization
The book is organized in 15 chapters, which roughly follow a vertical stack
from bottom to top.
• Chapter 1 discusses design alternatives for interconnection networks in
massively parallel systems, including examples from current Cray, Blue
Gene, as well as cluster-based supercomputers from the Top500 list.
• Chapter 2 provides in-depth discussion of the InfiniBand intercon-
nection technology, its hardware elements, protocols and features, and
includes a number of case studies from the HPC and enterprise domains,
which illustrate its suitability for a range of applications.
• Chapter 3 contrasts the traditional high-performance interconnection
solutions to the current capabilities offered by Ethernet-based intercon-
nects, and demonstrates several convergence trends among Ethernet and
EtherNOT technologies.
• Chapter 4 describes the key board- and chip-level interconnects such
as PCI Express, HyperTransport, and QPI; the capabilities they offer for
tighter system integration; and a case study for a service enabled by the
availability of low-latency integrated interconnects — global partitioned
address spaces.
• Chapter 5 discusses a number of considerations regarding the hard-
ware and software architecture of network interface devices (NICs), and
contrasts the design points present in NICs in a number of existing
interconnection fabrics.
• Chapter 6 complements this discussion by focusing on the characteris-
tics of the APIs natively supported by the different NIC platforms.
• Chapter 7 focuses on IP-based transport protocols and provides a his-
toric perspective on the different variants and performance optimization
opportunities for TCP transports, as well as a brief discussion of other
IP-based transport protocols.
xxv
• Chapter 8 describes in greater detail Remote Direct Memory Access
(RDMA) as an approach to network communication, along with iWARP,
an RDMA-based solution based on TCP transports.
• Chapter 9 more explicitly discusses opportunities to accelerate the
execution of communication services on multi-core platforms, including
general purpose homogeneous multi-cores, specialized network processing
accelerators, as well as emerging heterogeneous many-core platforms,
comprising both general purpose and accelerator resources.
• Chapter 10 targets the mechanisms used to ensure high-performance
communication services in virtualized platforms by discussing VMM-level
techniques as well as device-level features which help attain near-native
performance.
• Chapter 11 provides some historical perspective on MPI, the de facto
communication standard in high-performance computing, and outlines
some of the challenges in creating software and hardware implementations
of the MPI standard. These challenges are directly related to MPI
applications and their understanding can impact the programmers’ ability
to write better MPI-based codes.
• Chapter 12 looks at event-based middleware services as a means to ad-
dress the high-performance needs of many classes of HPC and enterprise
applications. It also gives an overview of the EVPath high-performance
middleware stack and demonstrates its utility in several different con-
texts.
• Chapter 13 first describes an important class of applications with
high performance communication requirements — electronic trading
platforms used in the financial industry. Next, it provides implementation
detail and performance analysis for the authors’ approach to leverage
the capabilities of general purpose multi-core platforms and to attain
impressive performance levels for one of the key components of the
trading engine.
• Chapter 14 describes the data-movement approaches used by a selection
of commercial, open-source, and research-based storage systems used
by massively parallel platforms, including Lustre, Panasas, PVFS2, and
LWFS.
• Chapter 15 discusses the overall approach to creating the simulation
tools needed to design, evaluate and experiment with a range of parame-
ters in the interconnection technology space, so as to help identify design
points with adequate performance behaviors.
Acknowledgments
This book would have never been possible without the contributions from the
individual chapter authors. My immense gratitude goes out to all of them for
their unyielding enthusiasm, expertise, and time.
Next, I would like to thank my editor Alan Apt, for approaching and
encouraging me to write this book, and the extended production team at
Taylor & Francis, for all their help with the preparation of the manuscript.
I am especially grateful to my mentor and colleague, Karsten Schwan, for
supporting my teaching of the High Performance Communications course at
Georgia Tech, which was the basis for this book. He and my close collaborators,
Greg Eisenhauer and Matthew Wolf, were particularly accomodating during
the most intense periods of manuscript preparation in providing me with the
necessary time to complete the work.
Finally, I would like to thank my family, for all their love, care, and uncon-
ditional support. I dedicate this book to them.
Ada Gavrilovska
Atlanta, 2009
xxvii
About the Editor
Ada Gavrilovska is a Research Scientist in the College of Computing at
Georgia Tech, and at the Center for Experimental Research in Computer
Systems (CERCS). Her research interests include conducting experimental
systems research, specifically addressing high-performance applications on
distributed heterogeneous platforms, and focusing on topics that range from
operating and distributed systems, to virtualization, to programmable network
devices and communication accelerators, to active and adaptive middleware
and I/O. Most recently, she has been involved with several projects focused
on development of efficient virtualization solutions for platforms ranging from
heterogeneous multi-core systems to large-scale cloud enviroments.
In addition to research, Dr. Gavrilovska has a strong commitment to teaching.
At Georgia Tech she teaches courses on advanced operating systems and high-
performance communications topics, and is deeply involved in a larger effort
aimed at upgrading the Computer Science and Engineering curriculum with
multicore-related content.
Dr. Gavrilovska has a B.S. in Electrical and Computer Engineering from
the Faculty of Electrical Engineering, University Sts. Cyril and Methodius, in
Skopje, Macedonia, and M.S. and Ph.D. degrees in Computer Science from
Georgia Tech, both completed under the guidence of Dr. Karsten Schwan.
Her work has been supported by the National Science Foundation, the U.S.
Department of Energy, and through numerous collaborations with industry,
including Intel, IBM, HP, Cisco Systems, Netronome Systems, and others.
xxix
List of Contributors
Hasan Abbasi
College of Computing
Georgia Institute of Technology
Atlanta, Georgia
Virat Agarwal
IBM TJ Watson Research Center
Yorktown Heights, New York
David Bader
College of Computing
Georgia Institute of Technology
Atlanta, Georgia
Pavan Balaji
Argonne National Laboratory
Argonne, Illinois
Ron Brightwell
Sandia National Laboratories
Albuquerque, New Mexico
Dennis Dalessandro
Ohio Supercomputer Center
Columbus, Ohio
Lin Duan
IBM TJ Watson Research Center
Yorktown Heights, New York
Greg Eisenhauer
College of Computing
Georgia Institute of Technology
Atlanta, Georgia
Wu-chun Feng
Departments of Computer Science
and Electrical & Computer
Engineering
Virginia Tech
Blacksburg, Virginia
Ada Gavrilovska
College of Computing
Georgia Institute of Technology
Atlanta, Georgia
Scott Hemmert
Sandia National Laboratories
Albuquerque, New Mexico
Matthew Jon Koop
Department of Computer Science
and Engineering
The Ohio State University
Columbus, Ohio
Todd Kordenbrock
Hewlett-Packard
Albuquerque, New Mexico
Lurng-Kuo Liu
IBM TJ Watson Research Center
Yorktown Heights, New York
Ron A. Oldfield
Sandia National Laboratories
Albuquerque, New Mexico
xxxi
xxxii
Scott Pakin
Los Alamos National Laboratory
Los Alamos, New Mexico
Dhabaleswar K. Panda
Department of Computer Science
and Engineering
The Ohio State University
Columbus, Ohio
Davide Pasetto
IBM Computational Science Center
Dublin, Ireland
Michaele Perrone
IBM TJ Watson Research Center
Yorktown Heights, New York
Fabrizio Petrini
IBM TJ Watson Research Center
Yorktown Heights, New York
Adit Ranadive
College of Computing
Georgia Institute of Technology
Atlanta, Georgia
Dulloor Rao
College of Computing
Georgia Institute of Technology
Atlanta, Georgia
George Riley
School of Electrical and Computer
Engineering
Georgia Institute of Technology
Atlanta, Georgia
Karsten Schwan
College of Computing
Georgia Institute of Technology
Atlanta, Georgia
Jeff Squyres
Cisco Systems
Louisville, Kentucky
Sayantan Sur
IBM TJ Watson Research Center
Hawthorne, New York
Keith Underwood
Intel Corporation
Albuquerque, New Mexico
Patrick Widener
University of New Mexico
Albuquerque, New Mexico
Matthew Wolf
College of Computing
Georgia Institute of Technology
Atlanta, Georgia
Pete Wyckoff
Ohio Supercomputer Center
Columbus, Ohio
Sudhakar Yalamanchili
School of Electrical and Computer
Engineering
Georgia Institute of Technology
Atlanta, Georgia
Jeffrey Young
School of Electrical and Computer
Engineering
Georgia Institute of Technology
Atlanta, Georgia
Chapter 1
High Performance Interconnects for
Massively Parallel Systems
Scott Pakin
Los Alamos National Laboratory
1.1 Introduction
If you were intent on building the world’s fastest supercomputer, how would
you design the interconnection network? As with any architectural endeavor,
there are a suite of trade-offs and design decisions that need to be considered:
Maximize performance Ideally, every application should run fast, but
would it be acceptable for a smaller set of “important” applications
or application classes to run fast? If so, is the overall performance of
those applications dominated by communication performance? Do those
applications use a known communication pattern that the network can
optimize?
Minimize cost Cheaper is better, but how much communication performance
are you willing to sacrifice to cut costs? Can you exploit existing hardware
components, or do you have to do a full-custom design?
Maximize scalability Some networks are fast and cost-efficient at small
scale but quickly become expensive as the number of nodes increases.
How can you ensure that you will not need to rethink the network design
from scratch when you want to increase the system’s node code? Can
the network grow incrementally? (Networks that require a power-of-two
number of nodes quickly become prohibitively expensive, for example.)
Minimize complexity How hard is it to reason about application perfor-
mance? To observe good performance, do parallel algorithms need to be
constructed specifically for your network? Do an application’s processes
need to be mapped to nodes in some non-straightforward manner to keep
the network from becoming a bottleneck? As for some more mundane
complexity issues, how tricky is it for technicians to lay the network
1
2 Attaining High Performance Communication: A Vertical Approach
cables correctly, and how much rewiring is needed when the network size
increases?
Maximize robustness The more components (switches, cables, etc.) com-
pose the network, the more likely it is that one of them will fail. Would
you utilize a naturally fault-tolerant topology at the expense of added
complexity or replicate the entire network at the expense of added cost?
Minimize power consumption Current estimates are that a sustained
megawatt of power costs between US$200,000–1,200,000 per year [144].
How much power are you willing to let the network consume? How
much performance or robustness are you willing to sacrifice to reduce
the network’s power consumption?
There are no easy answers to those questions, and the challenges increase in
difficulty with larger and larger networks: Network performance is more critical;
costs may grow disproportionately to the rest of the system; incremental
growth needs to be carefully considered; complexity is more likely to frustrate
application developers; fault tolerance is both more important and more
expensive; and power consumption becomes a serious concern. In the remainder
of this chapter we examine a few aspects of the network-design balancing act.
Section 1.2 quantifies the measured network performance of a few massively
parallel supercomputers. Section 1.3 discusses network topology, a key design
decision and one that impacts all of performance, cost, scalability, complexity,
robustness, and power consumption. In Section 1.4 we turn our attention
to some of the “bells and whistles” that a network designer might consider
including when trying to balance various design constraints. We briefly discuss
some future directions in network design in Section 1.5 and summarize the
chapter in Section 1.6.
Terminology note In much of the system-area network literature, the term
switch implies an indirect network while the term router implies a direct
network. In the wide-area network literature, in contrast, the term router
typically implies a switch plus additional logic that operates on packet contents.
For simplicity, in this chapter we have chosen to consistently use the term
switch when describing a network’s switching hardware.
1.2 Performance
One way to discuss the performance of an interconnection network is in
terms of various mathematical characteristics of the topology including — as a
function at least of the number of nodes — the network diameter (worst-case
High Performance Interconnects for Massively Parallel Systems 3
distance in switch hops between two nodes), average-case communication dis-
tance, bisection width (the minimum number of links that need to be removed
to partition the network into two equal halves), switch count, and network
capacity (maximum number of links that can carry data simultaneously).
However, there are many nuances of real-world networks that are not cap-
tured by simple mathematical expressions. The manner in which applications,
messaging layers, network interfaces, and network switches interact can be
complex and unintuitive. For example, Arber and Pakin demonstrated that the
address in memory at which a message buffer is placed can have a significant
impact on communication performance and that different systems favor differ-
ent buffer alignments [17]. Because of the discrepancy between the expected
performance of an idealized network subsystem and the measured performance
of a physical one, the focus of this section is on actual measurements of parallel
supercomputers containing thousands to tens of thousands of nodes.
1.2.1 Metrics
The most common metric used in measuring network performance is half of
the round-trip communication time (½RTT), often measured in microseconds.
The measurement procedure is as follows: Process A reads the time and sends
a message to process B; upon receipt of process A’s message, process B sends
an equal-sized message back to process A; upon receipt of process B’s message,
process A reads the time again, divides the elapsed time by two, and reports
the result as ½RTT. The purpose of using round-trip communication is that
it does not require the endpoints to have precisely synchronized clocks. In
many papers, ½RTT is referred to as latency and the message size divided by
½RTT as bandwidth although these definitions are far from universal. In the
LogP parallel-system model [100], for instance, latency refers solely to “wire
time” for a zero-byte message and is distinguished from overhead, the time
that a CPU is running communication code instead of application code. In
this section, we use the less precise but more common definitions of latency
and bandwidth in terms of ½RTT for an arbitrary-size message. Furthermore,
all performance measurements in this section are based on tests written using
MPI [408], currently the de facto standard messaging layer.
1.2.2 Application Sensitivity to Communication Perfor-
mance
An underlying premise of this entire book is that high-performance ap-
plications require high-performance communication. Can we quantify that
statement in the context of massively parallel systems? In a 2006 paper,
Kerbyson used detailed, application-specific, analytical performance models to
examine the impact of varying the network latency, bandwidth, and number
of CPUs per node [224]. Kerbyson’s study investigated two large applications
(SAGE [450] and Partisn [24]) and one application kernel (Sweep3D [233])
4 Attaining High Performance Communication: A Vertical Approach
TABLE 1.1: Network characteristics of Purple, Red Storm, and
Blue Gene/L [187]
Metric Purple Red Storm Blue Gene/L
CPU cores 12,288 10,368 65,536
Cores per node 8 1 2
Nodes 1,536 10,368 32,768
NICs per node 2 1 1
Topology (cf. §1.3) 16-ary 27×16×24 32×32×32
3-tree mesh in x,y; torus in x,y,z
torus in z
Peak link bandwidth (MB/s) 2,048 3,891 175
Achieved bandwidth (MB/s) 2,913 1,664 154
Achieved min. latency (µs) 4.4 6.9 2.8
at three different network sizes. Of the three applications, Partisn is the
most sensitive to communication performance. At 1024 processors, Partisn’s
performance can be improved by 7.4% by reducing latency from 4 µs to 1.5 µs,
11.8% by increasing bandwidth from 0.9 GB/s to 1.6 GB/s, or 16.4% by de-
creasing the number of CPUs per node from 4 to 2, effectively halving the
contention for the network interface controller (NIC). Overall, the performance
difference between the worst set of network parameters studied (4 µs latency,
0.9 GB/s bandwidth, and 4 CPUs/node) and the best (1.5 µs latency, 1.6 GB/s
bandwidth, and 2 CPUs/node) is 68% for Partisn, 24% for Sweep3D, and
15% for SAGE, indicating that network performance is in fact important to
parallel-application performance.
Of course, these results are sensitive to the particular applications, input
parameters, and the selected variations in network characteristics. Neverthe-
less, other researchers have also found communication performance to be an
important contributor to overall application performance. (Martin et al.’s
analysis showing the significance of overhead [277] is an oft-cited study, for
example.) The point is that high-performance communication is needed for
high-performance applications.
1.2.3 Measurements on Massively Parallel Systems
In 2005, Hoisie et al. [187] examined the performance of three of the most
powerful supercomputers of the day: IBM’s Purple system [248] at Lawrence
Livermore National Laboratory, Cray/Sandia’s Red Storm system [58] at
Sandia National Laboratories, and IBM’s Blue Gene/L system [2] at Lawrence
Livermore National Laboratory. Table 1.1 summarizes the node and network
characteristics of each of these three systems at the time at which Hoisie et al.
ran their measurements. (Since the data was acquired all three systems have
had substantial hardware and/or software upgrades.)
Purple is a traditional cluster, with NICs located on the I/O bus and
High Performance Interconnects for Massively Parallel Systems 5
FIGURE 1.1: Comparative communication performance of Purple,
Red Storm, and Blue Gene/L.
connecting to a network fabric. Although eight CPUs share the two NICs in a
node, the messaging software can stripe data across the two NICs, resulting in a
potential doubling of bandwidth. (This explains how the measured bandwidth
in Table 1.1 can exceed the peak link bandwidth.) Red Storm and Blue Gene/L
are massively parallel processors (MPPs), with the nodes and network more
tightly integrated.
Figure 1.1 reproduces some of the more interesting measurements from
Hoisie et al.’s paper. Figure 1.1(a) shows the ½RTT of a 0-byte message sent
from node 0 to each of nodes 1–1023 in turn. As the figure shows, Purple’s
latency is largely independent of distance (although plateaus representing
each of the three switch levels are in fact visible); Blue Gene/L’s latency
clearly matches the Manhattan distance beween the two nodes, with peaks
and valleys in the graph representing the use of the torus links in each 32-node
row and 32×32-node plane; however, Red Storm’s latency appears erratic.
This is because Red Storm assigns ranks in the computation based on a node’s
physical location on the machine room floor — aisle (0–3), cabinet within an
aisle (0–26), “cage” (collection of processor boards) within a cabinet (0–2),
board within a cage (0–7), and socket within a board (0–3) — not, as expected,
by the node’s x (0–26), y (0–15), and z (0–23) coordinates on the mesh.
While the former mapping may be convenient for pinpointing faulty hardware,
the latter is more natural for application programmers. Unexpected node
mappings are one of the subtleties that differentiate measurements of network
performance from expectations based on simulations, models, or theoretical
characteristics.
Figure 1.1(b) depicts the impact of link contention on achievable bandwidth.
At contention level 0, two nodes are measuring bandwidth as described in
Section 1.2.1. At contention level 1, two other nodes lying between the first pair
exchange messages while the first pair is measuring bandwidth; at contention
6 Attaining High Performance Communication: A Vertical Approach
level 2, another pair of coincident nodes consumes network bandwidth, and so
on. On Red Storm and Blue Gene/L, the tests are performed across a single
dimension of the mesh/torus to force all of the messages to overlap. Each
curve in Figure 1.1(b) tells a different story. The Blue Gene/L curve contains
plateaus representing messages traveling in alternating directions across the
torus. Purple’s bandwidth degrades rapidly up to the node size (8 CPUs) then
levels off, indicating that contention for the NIC is a greater problem than
contention within the rest of the network. Relative to the other two systems,
Red Storm observes comparatively little performance degradation. This is
because Red Storm’s link speed is high relative to the speed at which a single
NIC can inject messages into the network. Consequently, it takes a relatively
large number of concurrent messages to saturate a network link. In fact, Red
Storm’s overspecified network links largely reduce the impact of the unusual
node mappings seen in Figure 1.1(a).
Which is more important to a massively parallel architecture: faster proces-
sors or a faster network? The tradeoff is that faster processors raise the peak
performance of the system, but a faster network enables more applications
to see larger fractions of peak performance. The correct choice depends on
the specifics of the system and is analogous to the choice of eating a small
piece of a large pie or a large piece of a small pie. On an infinitely fast
network (relative to computation speed), all parallelism that can be exposed —
even, say, in an arithmetic operation as trivial as (a + b) · (c + d) — leads
to improved performance. On a network with finite speed but that is still
fairly fast relative to computation speed, trivial operations are best performed
sequentially although small blocks of work can still benefit from running in
parallel. On a network that is much slower than what the processors can
feed (e.g., a wide-area computational Grid [151]), only the coarsest-grained
applications are likely to run faster in parallel than sequentially on a single
machine. As Hoisie et al. quantify [187], Blue Gene/L’s peak performance is
18 times that of the older ASCI Q system [269] but its relatively slow network
limits it to running SAGE [450] only 1.6 times as fast as ASCI Q. In contrast,
Red Storm’s peak performance is only 2 times ASCI Q’s but its relatively fast
network enables it to run SAGE 2.45 times as fast as ASCI Q. Overspecifying
the network was in fact an important aspect of Red Storm’s design [58].
1.3 Network Topology
One of the fundamental design decisions when architecting an interconnection
network for a massively parallel system is the choice of network topology. As
mentioned briefly at the start of Section 1.2, there are a wide variety of metrics
one can use to compare topologies. Selecting a topology also involves a large
High Performance Interconnects for Massively Parallel Systems 7
TABLE 1.2: Properties of a ring vs. a
fully connected network of n nodes
Metric Ring Fully connected
Diameter n/2 1
Avg. dist. n/4 1
Degree 2 n − 1
Bisection 2 n
2
2
Switches n n
Minimal paths 1 1
Contention blocking nonblocking
number of tradeoffs that can impact performance, cost, scalability, complexity,
robustness, and power consumption:
Minimize diameter Reducing the hop count (number of switch crossings)
should help reduce the overall latency. Given a choice, however, should
you minimize the worst-case hop count (the diameter) or the average
hop count?
Minimize degree How many incoming/outgoing links are there per node?
Low-radix networks devote more bandwidth to each link, improving
performance for communication patterns that map nicely to the topol-
ogy (e.g., nearest-neighbor 3-space communication on a 3-D mesh).
High-radix networks reduce average latency — even for communication
patterns that are a poor match to the topology — and the cost of the
network in terms of the number of switches needed to interconnect a
given number of nodes. A second degree-related question to consider is
whether the number of links per node is constant or whether it increases
with network size.
Maximize bisection width The minimum number of links that must be
removed to split the network into two disjoint pieces with equal node
count is called the bisection width. (A related term, bisection bandwidth,
refers to the aggregate data rate between the two halves of the network.)
A large bisection width may improve fault tolerance, routability, and
global throughput. However, it may also greatly increase the cost of the
network if it requires substantially more switches.
Minimize switch count How many switches are needed to interconnect
n nodes? O(n) implies that the network can cost-effectively scale up to
large numbers of nodes; O(n2
) topologies are impractical for all but the
smallest networks.
8 Attaining High Performance Communication: A Vertical Approach
Maximize routability How many minimal paths are there from a source
node to a destination node? Is the topology naturally deadlock-free
or does it require extra routing sophistication in the network to main-
tain deadlock freedom? The challenge is to ensure that concurrently
transmitted messages cannot mutually block each other’s progress while
simultaneously allowing the use of as many paths as possible between a
source and a destination [126].
Minimize contention What communication patterns (mappings from a set
of source nodes to a — possibly overlapping — set of destination nodes)
can proceed with no two paths sharing a link? (Sharing implies reduced
bandwidth.) As Jajszczyk explains in his 2003 survey paper, a network
can be categorized as nonblocking in the strict sense, nonblocking in
the wide sense, repackable, rearrangeable, or blocking, based on the level
of effort (roughly speaking) needed to avoid contention for arbitrary
mappings of sources to destinations [215]. But how important is it to
support arbitrary mappings — vs. just a few common communication
patterns — in a contention-free manner? Consider also that the use
of packet switching and adaptive routing can reduce the impact of
contention on performance.
Consider two topologies that represent the extremes of connectivity: a ring
of n nodes (in which nodes are arranged in a circle, and each node connects only
to its “left” and “right” neighbors) and a fully connected network (in which
each of the n nodes connects directly to each of the other n−1 nodes). Table 1.2
summarizes these two topologies in terms of the preceding characteristics. As
that table shows, neither topology outperforms the other in all cases (as is
typical for any two given topologies), and neither topology provides many
minimal paths, although the fully connected network provides a large number
of non-minimal paths. In terms of scalability, the ring’s primary shortcomings
are its diameter and bisection width, and the fully connected network’s primary
shortcoming is its degree. Consequently, in practice, neither topology is found
in massively parallel systems.
1.3.1 The “Dead Topology Society”
What topologies are found in massively parallel systems? Ever since parallel
computing first started to gain popularity (arguably in the early 1970s), the
parallel-computing literature has been rife with publications presenting a vir-
tually endless menagerie of network topologies, including — but certainly not
limited to — 2-D and 3-D meshes and tori [111], banyan networks [167], Beneš
networks [39], binary trees [188], butterfly networks [251], Cayley graphs [8],
chordal rings [18], Clos networks [94], crossbars [342], cube-connected cy-
cles [347], de Bruijn networks [377], delta networks [330], distributed-loop
networks [40], dragonflies [227], extended hypercubes [238], fat trees [253],
High Performance Interconnects for Massively Parallel Systems 9
June
1993
June
1994
June
1995
June
1996
June
1997
June
1998
June
1999
June
2000
June
2001
June
2002
June
2003
June
2004
June
2005
June
2006
June
2007
June
2008
1 F X M X X X H H M M M M M M M F F F X X X X X M M M M M M M F
2 F F X M M M X X H M M M H H H M F F F F F F F H M M M M M M M
3 F F F F M M M H X M M M M H H H M F F F F F F X H F F M M C M
4 F F F F M M M M H H M H H M F H H M F F F F M F X H H F M O F
5 F X X X X M M M M M M M H H F H H F F F F F F F F F F M F M
6 X X F F M M M H H H H M H F H F F F F F F F M M F F F M M
7 F X X X F X M M X M M M H M H F H M F X F X F F X F F M M C
8 M M F F F F F X M X M H H M M F X H F F F F M M M F M H F M O
9 M F F M F X M M M M M M H H F F H F F F F F M M M F F M M
10 F M F X M M M M M F M M F F F F F F F F M M X M H M C
Most
procs.
C C M M M M M M M M M M M M M M M M M M M M M M M M M M M M M
Legend and statistics
M Mesh or torus (35.8%) No network (3.2%)
F Fat tree (34.8%) C Hypercube (1.0%)
H Hierarchical (13.2%) O Other (0.6%)
X Crossbar (11.3%)
FIGURE 1.2: Network topologies used in the ten fastest supercomputers
and the single most parallel supercomputer, June and November, 1993–2008.
flattened butterflies [228], flip networks [32], fractahedrons [189], generalized
hypercube-connected cycles [176], honeycomb networks [415], hyperbuses [45],
hypercubes [45], indirect binary n-cube arrays [333], Kautz graphs [134], KR-
Benes networks [220], Moebius graphs [256], omega networks [249], pancake
graphs [8], perfect-shuffle networks [416], pyramid computers [295], recursive
circulants [329], recursive cubes of rings [421], shuffle-exchange networks [417],
star graphs [7], X-torus networks [174], and X-Tree structures [119]. Note that
many of those topologies are general classes of topologies that include other
topologies on the list or are specific topologies that are isomorphic to other
listed topologies.
Some of the entries in the preceding list have been used in numerous parallel-
computer implementations; others never made it past a single publication.
Figure 1.2 presents the topologies used over the past fifteen years both in
the ten fastest supercomputers and in the single most parallel computer
(i.e., the supercomputer with the largest number of processors, regardless of
its performance).
The data from Figure 1.2 were taken from the semiannual Top500 list
10 Attaining High Performance Communication: A Vertical Approach
(a) 8‐port crossbar (b) Hypercube (specifically, a 2‐ary 4‐cube)
(d) 3‐D mesh; shaded addition makes it a 3‐D torus
(3‐ary 3‐cube); a single plane is a 2‐D mesh/torus; a
single line is a 1‐D mesh/torus (a.k.a. a ring)
(c) Fat tree (specifically, a 2‐ary 3‐tree);
shaded addition is commonly used in practice
FIGURE 1.3: Illustrations of various network topologies.
(http://www.top500.org/), and therefore represent achieved performance on
the HPLinpack benchmark [123].1
Figure 1.3 illustrates the four topologies
that are the most prevalent in Figure 1.2: a crossbar, a hypercube, a fat tree,
and a mesh/torus.
One feature of Figure 1.2 that is immediately clear is that the most parallel
system on each Top500 list since 1994 has consistently been either a 3-D mesh
or a 3-D torus. (The data are a bit misleading, however, as only four different
systems are represented: the CM-200 [435], Paragon [41], ASCI Red [279],
and Blue Gene/L [2].) Another easy-to-spot feature is that two topologies —
meshes/tori and fat trees — dominate the top 10. More interesting, though,
are the trends in those two topologies over time:
• In June 1993, half of the top 10 supercomputers were fat trees. Only
one mesh/torus made the top 10 list.
1HPLinpack measures the time to solve a system of linear equations using LU decomposition
with partial pivoting. Although the large, dense matrices that HPLinpack manipulates are
rarely encountered in actual scientific applications, the decade and a half of historical HPLin-
pack performance data covering thousands of system installations is handy for analyzing
trends in supercomputer performance.
High Performance Interconnects for Massively Parallel Systems 11
TABLE 1.3: Hypercube conference: length of proceedings
Year Conference title Pages
1986 First Conference on Hypercube Multiprocessors 286
1987 Second Conference on Hypercube Multiprocessors 761
1988 Third Conference on Hypercube Concurrent Computers
and Applications
2682
1989 Fourth Conference on Hypercubes, Concurrent Com-
puters, and Applications
1361
• From November 1996 through November 1999 there were no fat trees in
the top 10. In fact, in June 1998 every one of the top 10 was either a
mesh or a torus. (The entry marked as “hierarchical” was in fact a 3-D
mesh of crossbars.)
• From November 2002 through November 2003 there were no meshes or
tori in the top 10. Furthermore, the November 2002 list contained only
a single non-fat-tree.
• In November 2006, 40% of the top 10 were meshes/tori and 50% were
fat trees.
Meshes/tori and fat trees clearly have characteristics that are well suited to
high-performance systems. However, the fact that the top 10 is alternately
dominated by one or the other indicates that changes in technology favor
different topologies at different times. The lesson is that the selection of
a topology for a massively parallel system must be based on what current
technology makes fast, inexpensive, power-conscious, etc.
Although hypercubes did not often make the top 10, they do make for
an interesting case study in a topology’s boom and bust. Starting in 1986,
there were enough hypercube systems, users, and researchers to warrant an
entire conference devoted solely to hypercubes. Table 1.3 estimates interest
in hypercubes — rather informally — by presenting the number of pages
in this conference’s proceedings. (Note though that the conference changed
names almost every single year.) From 1986 to 1988, the page count increased
almost tenfold! However, the page count halved the next year even as the
scope broadened to include non-hypercubes, and, in 1990, the conference, then
known as the Fifth Distributed Memory Computing Conference, no longer
focused on hypercubes.
How did hypercubes go from being the darling of massively parallel system
design to topology non grata? There are three parts to the answer. First,
topologies that match an expected usage model are often favored over those
that don’t. For example, Sandia National Laboratories have long favored 3-D
meshes because many of their applications process 3-D volumes. In contrast,
few scientific methods are a natural match to hypercubes. The second reason
12 Attaining High Performance Communication: A Vertical Approach
that hypercubes lost popularity is that the topology requires a doubling of
processors (more for non-binary hypercubes) for each increment in processor
count. This limitation starts to become impractical at larger system sizes. For
example, in 2007, Lawrence Livermore National Laboratory upgraded their
Blue Gene/L [2] system (a 3-D torus) from 131,072 to 212,992 processors; it
was too costly to go all the way to 262,144 processors, as would be required
by a hypercube topology.
The third, and possibly most telling, reason that hypercubes virtually
disappeared is because of techology limitations. The initial popularity of the
hypercube topology was due to how well it fares with many of the metrics
listed at the start of Section 1.3. However, Dally’s PhD dissertation [110] and
subsequent journal publication [111] highlighted a key fallacy of those metrics:
They fail to consider wire (or pin) limitations. Given a fixed number of wires
into and out of a switch, an n-dimensional binary hypercube (a.k.a. a 2-ary
n-cube) divides these wires — and therefore the link bandwidth — by n; the
larger the processor count, the less bandwidth is available in each direction.
In contrast, a 2-D torus (a.k.a. a k-ary 2-cube), for example, provides 1/4 of
the total switch bandwidth in each direction regardless of the processor count.
Another point that Dally makes is that high-dimension networks in general
require long wires — and therefore high latencies — when embedded in a
lower-dimension space, such as a plane in a typical VLSI implementation.
In summary, topologies with the best mathematical properties are not
necessarily the best topologies to implement. When selecting a topology for
a massively parallel system, one must consider the real-world features and
limitations of the current technology. A topology that works well one year
may be suboptimal when facing the subsequent year’s technology.
1.3.2 Hierarchical Networks
A hierarchical network uses an “X of Y ” topology — a base topology in
which every node is replaced with a network of a (usually) different topology.
As Figure 1.2 in the previous section indicates, many of the ten fastest su-
percomputers over time have employed hierarchical networks. For example,
Hitachi’s SR2201 [153] and the University of Tsukuba’s CP-PACS [49] (#1,
respectively, on the June and November 1996 Top500 lists) were both 8×17×16
meshes of crossbars (a.k.a. 3-D hypercrossbars). Los Alamos National Labo-
ratory’s ASCI Blue Mountain [268] (#4 on the November 1998 Top500 list)
used a deeper topology hierarchy: a HIPPI [433] 3-D torus of “hierarchical fat
bristled hypercubes” — the SGI Origin 2000’s network of eight crossbars, each
of which connected to a different node within a set of eight 3-D hypercubes and
where each node contained two processors on a bus [246]. The topologies do not
need to be different: NASA’s Columbia system [68] (#2 on the November 2004
Top500 list) is an InfiniBand [207,217] 12-ary fat tree of NUMAlink 4-ary fat
trees. (NUMAlink is the SGI Altix 3700’s internal network [129].)
Other documents randomly have
different content
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.
Volunteers and financial support to provide volunteers with the
assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.
Section 3. Information about the Project
Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status
by the Internal Revenue Service. The Foundation’s EIN or
federal tax identification number is 64-6221541. Contributions
to the Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.
The Foundation’s business office is located at 809 North 1500
West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact
Section 4. Information about Donations to
the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.
The Foundation is committed to complying with the laws
regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or determine
the status of compliance for any particular state visit
www.gutenberg.org/donate.
While we cannot and do not solicit contributions from states
where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.
International donations are gratefully accepted, but we cannot
make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.
Please check the Project Gutenberg web pages for current
donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.
Section 5. General Information About
Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.
Project Gutenberg™ eBooks are often created from several
printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
This website includes information about Project Gutenberg™,
including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
ebookgate.com

Attaining High Performance Communications A Vertical Approach 1st Edition Ada Gavrilovska

  • 1.
    Attaining High PerformanceCommunications A Vertical Approach 1st Edition Ada Gavrilovska download https://ebookgate.com/product/attaining-high-performance- communications-a-vertical-approach-1st-edition-ada-gavrilovska/ Get Instant Ebook Downloads – Browse at https://ebookgate.com
  • 2.
    Get Your DigitalFiles Instantly: PDF, ePub, MOBI and More Quick Digital Downloads: PDF, ePub, MOBI and Other Formats High Performance Elastomer Materials An Engineering Approach 1st Edition Ryszard Koz■owski https://ebookgate.com/product/high-performance-elastomer- materials-an-engineering-approach-1st-edition-ryszard-kozlowski/ Working in Teams Moving From High Potential to High Performance 1st Edition Brian A. Griffith https://ebookgate.com/product/working-in-teams-moving-from-high- potential-to-high-performance-1st-edition-brian-a-griffith/ Digital Communications A Discrete Time Approach 1st Edition Michael Rice https://ebookgate.com/product/digital-communications-a-discrete- time-approach-1st-edition-michael-rice/ Leadership Revolution Creating a High Performance Organisation 1st Edition Christo Nel https://ebookgate.com/product/leadership-revolution-creating-a- high-performance-organisation-1st-edition-christo-nel/
  • 3.
    Scala High PerformanceProgramming 1st Edition Theron https://ebookgate.com/product/scala-high-performance- programming-1st-edition-theron/ Julia High Performance Programming Ivo Balbaert https://ebookgate.com/product/julia-high-performance-programming- ivo-balbaert/ High Performance Loudspeakers 6th Edition Martin Colloms https://ebookgate.com/product/high-performance-loudspeakers-6th- edition-martin-colloms/ Performance Optimization of Digital Communications Systems 1st Edition Vladimir Mitlin https://ebookgate.com/product/performance-optimization-of- digital-communications-systems-1st-edition-vladimir-mitlin/ High Performance Parallel I O 1st Edition I Foster https://ebookgate.com/product/high-performance-parallel-i-o-1st- edition-i-foster/
  • 5.
    AttAining HigH PerformAnce communicAtions A VerticAlApproAch C3088_FM.indd 1 8/18/09 10:05:36 AM
  • 6.
  • 7.
    AttAining HigH PerformAnce communicAtions A VerticAlApproAch edited by AdA gAvrilovskA C3088_FM.indd 3 8/18/09 10:05:37 AM
  • 8.
    Chapman & Hall/CRC Taylor& Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2010 by Taylor and Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number-13: 978-1-4200-9313-1 (Ebook-PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit- ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com
  • 9.
  • 11.
    Contents List of Figuresxiv List of Tables xviii Preface xxi Acknowledgments xxvii About the Editor xxix List of Contributors xxxi 1 High Performance Interconnects for Massively Parallel Sys- tems 1 Scott Pakin 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Application Sensitivity to Communication Performance 3 1.2.3 Measurements on Massively Parallel Systems . . . . . 4 1.3 Network Topology . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.1 The “Dead Topology Society” . . . . . . . . . . . . . . 8 1.3.2 Hierarchical Networks . . . . . . . . . . . . . . . . . . 12 1.3.3 Hybrid Networks . . . . . . . . . . . . . . . . . . . . . 13 1.3.4 Novel Topologies . . . . . . . . . . . . . . . . . . . . . 15 1.4 Network Features . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4.1 Programming Models . . . . . . . . . . . . . . . . . . 18 1.5 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . 20 1.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 22 2 Commodity High Performance Interconnects 25 Dhabaleswar K. Panda, Pavan Balaji, Sayantan Sur, and Matthew Jon Koop 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Overview of Past Commodity Interconnects, Features and Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 vii
  • 12.
    viii 2.3 InfiniBand Architecture. . . . . . . . . . . . . . . . . . . . . 28 2.3.1 IB Communication Model . . . . . . . . . . . . . . . . 28 2.3.2 Overview of InfiniBand Features . . . . . . . . . . . . 32 2.3.3 InfiniBand Protection and Security Features . . . . . . 39 2.3.4 InfiniBand Management and Services . . . . . . . . . . 40 2.4 Existing InfiniBand Adapters and Switches . . . . . . . . . . 43 2.4.1 Channel Adapters . . . . . . . . . . . . . . . . . . . . 43 2.4.2 Switches . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.4.3 Wide Area Networks (WAN) and Routers . . . . . . . 45 2.5 Existing InfiniBand Software Stacks . . . . . . . . . . . . . . 45 2.5.1 Low-Level Interfaces . . . . . . . . . . . . . . . . . . . 45 2.5.2 High-Level Interfaces . . . . . . . . . . . . . . . . . . . 46 2.5.3 Verbs Capabilities . . . . . . . . . . . . . . . . . . . . 46 2.6 Designing High-End Systems with InfiniBand: Case Studies 47 2.6.1 Case Study: Message Passing Interface . . . . . . . . . 47 2.6.2 Case Study: Parallel File Systems . . . . . . . . . . . 55 2.6.3 Case Study: Enterprise Data Centers . . . . . . . . . . 57 2.7 Current and Future Trends of InfiniBand . . . . . . . . . . . 60 3 Ethernet vs. EtherNOT 61 Wu-chun Feng and Pavan Balaji 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.2.1 Defining Ethernet vs. EtherNOT . . . . . . . . . . . . 64 3.2.2 Forecast . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3.1 Ethernet Background . . . . . . . . . . . . . . . . . . 65 3.3.2 EtherNOT Background . . . . . . . . . . . . . . . . . 66 3.4 Ethernet vs. EtherNOT? . . . . . . . . . . . . . . . . . . . . 67 3.4.1 Hardware and Software Convergence . . . . . . . . . . 67 3.4.2 Overall Performance Convergence . . . . . . . . . . . . 78 3.5 Commercial Perspective . . . . . . . . . . . . . . . . . . . . . . 81 3.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . 82 4 System Impact of Integrated Interconnects 85 Sudhakar Yalamanchili and Jeffrey Young 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2 Technology Trends . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3 Integrated Interconnects . . . . . . . . . . . . . . . . . . . . 90 4.3.1 HyperTransport (HT) . . . . . . . . . . . . . . . . . . 92 4.3.2 QuickPath Interconnect (QPI) . . . . . . . . . . . . . 96 4.3.3 PCI Express (PCIe) . . . . . . . . . . . . . . . . . . . 98 4.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.4 Case Study: Implementation of Global Address Spaces . . . . 101
  • 13.
    ix 4.4.1 A DynamicPartitioned Global Address Space Model (DPGAS) . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.4.2 The Implementation Path . . . . . . . . . . . . . . . . 105 4.4.3 Bridge Implementation . . . . . . . . . . . . . . . . . . 106 4.4.4 Projected Impact of DPGAS . . . . . . . . . . . . . . 108 4.5 Future Trends and Expectations . . . . . . . . . . . . . . . . 109 5 Network Interfaces for High Performance Computing 113 Keith Underwood, Ron Brightwell, and Scott Hemmert 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.2 Network Interface Design Issues . . . . . . . . . . . . . . . . 113 5.2.1 Offload vs. Onload . . . . . . . . . . . . . . . . . . . . 114 5.2.2 Short vs. Long Message Handling . . . . . . . . . . . 115 5.2.3 Interactions between Host and NIC . . . . . . . . . . . 118 5.2.4 Collectives . . . . . . . . . . . . . . . . . . . . . . . . 123 5.3 Current Approaches to Network Interface Design Issues . . . 124 5.3.1 Quadrics QsNet . . . . . . . . . . . . . . . . . . . . . . 124 5.3.2 Myrinet . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.3.3 InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.3.4 Seastar . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.3.5 PathScale InfiniPath and Qlogic TrueScale . . . . . . 127 5.3.6 BlueGene/L and BlueGene/P . . . . . . . . . . . . . . 127 5.4 Research Directions . . . . . . . . . . . . . . . . . . . . . . . 128 5.4.1 Offload of Message Processing . . . . . . . . . . . . . . 128 5.4.2 Offloading Collective Operations . . . . . . . . . . . . 140 5.4.3 Cache Injection . . . . . . . . . . . . . . . . . . . . . . 147 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6 Network Programming Interfaces for High Performance Computing 149 Ron Brightwell and Keith Underwood 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.2 The Evolution of HPC Network Programming Interfaces . . 150 6.3 Low-Level Network Programming Interfaces . . . . . . . . . . 151 6.3.1 InfiniBand Verbs . . . . . . . . . . . . . . . . . . . . . . 151 6.3.2 Deep Computing Messaging Fabric . . . . . . . . . . . 153 6.3.3 Portals . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.3.4 Myrinet Express (MX) . . . . . . . . . . . . . . . . . . 153 6.3.5 Tagged Ports (Tports) . . . . . . . . . . . . . . . . . . 153 6.3.6 LAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.3.7 Sockets . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.4 Distinguishing Characteristics . . . . . . . . . . . . . . . . . 154 6.4.1 Endpoint Addressing . . . . . . . . . . . . . . . . . . . 155 6.4.2 Independent Processes . . . . . . . . . . . . . . . . . . 155
  • 14.
    x 6.4.3 Connections .. . . . . . . . . . . . . . . . . . . . . . . 156 6.4.4 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.4.5 Operating System Interaction . . . . . . . . . . . . . . 156 6.4.6 Data Movement Semantics . . . . . . . . . . . . . . . 157 6.4.7 Data Transfer Completion . . . . . . . . . . . . . . . . 158 6.4.8 Portability . . . . . . . . . . . . . . . . . . . . . . . . 160 6.5 Supporting MPI . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.5.1 Copy Blocks . . . . . . . . . . . . . . . . . . . . . . . 160 6.5.2 Progress . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.5.3 Overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.5.4 Unexpected Messages . . . . . . . . . . . . . . . . . . . 161 6.6 Supporting SHMEM and Partitioned Global Address Space (PGAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.6.1 Fence and Quiet . . . . . . . . . . . . . . . . . . . . . 162 6.6.2 Synchronization and Atomics . . . . . . . . . . . . . . 162 6.6.3 Progress . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.6.4 Scalable Addressing . . . . . . . . . . . . . . . . . . . 163 6.7 Portals 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.7.1 Small Message Rate . . . . . . . . . . . . . . . . . . . 164 6.7.2 PGAS Optimizations . . . . . . . . . . . . . . . . . . . 166 6.7.3 Hardware Friendliness . . . . . . . . . . . . . . . . . . 166 6.7.4 New Functionality . . . . . . . . . . . . . . . . . . . . 167 7 High Performance IP-Based Transports 169 Ada Gavrilovska 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.2 Transmission Control Protocol — TCP . . . . . . . . . . . . 170 7.2.1 TCP Origins and Future . . . . . . . . . . . . . . . . . 170 7.2.2 TCP in High Speed Networks . . . . . . . . . . . . . . 172 7.2.3 TCP Variants . . . . . . . . . . . . . . . . . . . . . . . 173 7.3 TCP Performance Tuning . . . . . . . . . . . . . . . . . . . . 178 7.3.1 Improving Bandwidth Utilization . . . . . . . . . . . . 178 7.3.2 Reducing Host Loads . . . . . . . . . . . . . . . . . . 179 7.4 UDP-Based Transport Protocols . . . . . . . . . . . . . . . . . 181 7.5 SCTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 7.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 183 8 Remote Direct Memory Access and iWARP 185 Dennis Dalessandro and Pete Wyckoff 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 8.2 RDMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 8.2.1 High-Level Overview of RDMA . . . . . . . . . . . . . 187 8.2.2 Architectural Motivations . . . . . . . . . . . . . . . . 189 8.2.3 Fundamental Aspects of RDMA . . . . . . . . . . . . 192
  • 15.
    xi 8.2.4 RDMA HistoricalFoundations . . . . . . . . . . . . . 193 8.2.5 Programming Interface . . . . . . . . . . . . . . . . . . 194 8.2.6 Operating System Interactions . . . . . . . . . . . . . 196 8.3 iWARP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 8.3.1 High-Level Overview of iWARP . . . . . . . . . . . . . 200 8.3.2 iWARP Device History . . . . . . . . . . . . . . . . . . 201 8.3.3 iWARP Standardization . . . . . . . . . . . . . . . . . 202 8.3.4 Trade-Offs of Using TCP . . . . . . . . . . . . . . . . 204 8.3.5 Software-Based iWARP . . . . . . . . . . . . . . . . . 205 8.3.6 Differences between IB and iWARP . . . . . . . . . . 206 8.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 207 9 Accelerating Communication Services on Multi-Core Plat- forms 209 Ada Gavrilovska 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 9.2 The “Simple” Onload Approach . . . . . . . . . . . . . . . . 210 9.2.1 Limitations of the “Simple” Onload . . . . . . . . . . 212 9.3 Partitioned Communication Stacks . . . . . . . . . . . . . . . 213 9.3.1 API Considerations . . . . . . . . . . . . . . . . . . . 216 9.4 Specialized Network Multi-Cores . . . . . . . . . . . . . . . . 217 9.4.1 The (Original) Case for Network Processors . . . . . . 217 9.4.2 Network Processors Features . . . . . . . . . . . . . . 219 9.4.3 Application Diversity . . . . . . . . . . . . . . . . . . 222 9.5 Toward Heterogeneous Multi-Cores . . . . . . . . . . . . . . 223 9.5.1 Impact on Systems Software . . . . . . . . . . . . . . . 226 9.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 228 10 Virtualized I/O 229 Ada Gavrilovska, Adit Ranadive, Dulloor Rao, and Karsten Schwan 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 10.1.1 Virtualization Overview . . . . . . . . . . . . . . . . . 230 10.1.2 Challenges with I/O Virtualization . . . . . . . . . . . 232 10.1.3 I/O Virtualization Approaches . . . . . . . . . . . . . 233 10.2 Split Device Driver Model . . . . . . . . . . . . . . . . . . . 234 10.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 234 10.2.2 Performance Optimization Opportunities . . . . . . . 236 10.3 Direct Device Access Model . . . . . . . . . . . . . . . . . . 240 10.3.1 Multi-Queue Devices . . . . . . . . . . . . . . . . . . . . 241 10.3.2 Device-Level Packet Classification . . . . . . . . . . . 243 10.3.3 Signaling . . . . . . . . . . . . . . . . . . . . . . . . . 244 10.3.4 IOMMU . . . . . . . . . . . . . . . . . . . . . . . . . . 244 10.4 Opportunities and Trade-Offs . . . . . . . . . . . . . . . . . 245 10.4.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 245
  • 16.
    xii 10.4.2 Migration .. . . . . . . . . . . . . . . . . . . . . . . . 246 10.4.3 Higher-Level Interfaces . . . . . . . . . . . . . . . . . . 246 10.4.4 Monitoring and Management . . . . . . . . . . . . . . 247 10.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 249 11 The Message Passing Interface (MPI) 251 Jeff Squyres 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 11.1.1 Chapter Scope . . . . . . . . . . . . . . . . . . . . . . . 251 11.1.2 MPI Implementations . . . . . . . . . . . . . . . . . . 252 11.1.3 MPI Standard Evolution . . . . . . . . . . . . . . . . . 254 11.1.4 Chapter Overview . . . . . . . . . . . . . . . . . . . . 255 11.2 MPI’s Layer in the Network Stack . . . . . . . . . . . . . . . 255 11.2.1 OSI Network Stack . . . . . . . . . . . . . . . . . . . . 256 11.2.2 Networks That Provide MPI-Like Interfaces . . . . . . 256 11.2.3 Networks That Provide Non-MPI-Like Interfaces . . . 257 11.2.4 Resource Management . . . . . . . . . . . . . . . . . . 257 11.3 Threading and MPI . . . . . . . . . . . . . . . . . . . . . . . 260 11.3.1 Implementation Complexity . . . . . . . . . . . . . . . 260 11.3.2 Application Simplicity . . . . . . . . . . . . . . . . . . . 261 11.3.3 Performance Implications . . . . . . . . . . . . . . . . 262 11.4 Point-to-Point Communications . . . . . . . . . . . . . . . . 262 11.4.1 Communication/Computation Overlap . . . . . . . . . 262 11.4.2 Pre-Posting Receive Buffers . . . . . . . . . . . . . . . 263 11.4.3 Persistent Requests . . . . . . . . . . . . . . . . . . . . 265 11.4.4 Common Mistakes . . . . . . . . . . . . . . . . . . . . 267 11.5 Collective Operations . . . . . . . . . . . . . . . . . . . . . . 272 11.5.1 Synchronization . . . . . . . . . . . . . . . . . . . . . 272 11.6 Implementation Strategies . . . . . . . . . . . . . . . . . . . 273 11.6.1 Lazy Connection Setup . . . . . . . . . . . . . . . . . 273 11.6.2 Registered Memory . . . . . . . . . . . . . . . . . . . . 274 11.6.3 Message Passing Progress . . . . . . . . . . . . . . . . 278 11.6.4 Trade-Offs . . . . . . . . . . . . . . . . . . . . . . . . . 278 11.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 279 12 High Performance Event Communication 281 Greg Eisenhauer, Matthew Wolf, Hasan Abbasi, and Karsten Schwan 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 12.2 Design Points . . . . . . . . . . . . . . . . . . . . . . . . . . 283 12.2.1 Lessons from Previous Designs . . . . . . . . . . . . . 285 12.2.2 Next Generation Event Delivery . . . . . . . . . . . . 286 12.3 The EVPath Architecture . . . . . . . . . . . . . . . . . . . . 287 12.3.1 Taxonomy of Stone Types . . . . . . . . . . . . . . . . 289 12.3.2 Data Type Handling . . . . . . . . . . . . . . . . . . . 289
  • 17.
    xiii 12.3.3 Mobile Functionsand the Cod Language . . . . . . . . 290 12.3.4 Meeting Next Generation Goals . . . . . . . . . . . . . 293 12.4 Performance Microbenchmarks . . . . . . . . . . . . . . . . . 294 12.4.1 Local Data Handling . . . . . . . . . . . . . . . . . . . 295 12.4.2 Network Operation . . . . . . . . . . . . . . . . . . . . 296 12.5 Usage Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . 297 12.5.1 Implementing a Full Publish/Subscribe System . . . . 298 12.5.2 IFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 12.5.3 I/OGraph . . . . . . . . . . . . . . . . . . . . . . . . . 302 12.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 13 The Case of the Fast Financial Feed 305 Virat Agarwal, Lin Duan, Lurng-Kuo Liu, Michaele Perrone, Fabrizio Petrini, Davide Pasetto, and David Bader 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 13.2 Market Data Processing Systems . . . . . . . . . . . . . . . . 306 13.2.1 The Ticker Plant . . . . . . . . . . . . . . . . . . . . . 306 13.3 Performance Requirements . . . . . . . . . . . . . . . . . . . 308 13.3.1 Skyrocketing Data Rates . . . . . . . . . . . . . . . . 308 13.3.2 Low Latency Trading . . . . . . . . . . . . . . . . . . 308 13.3.3 High Performance Computing in the Data Center . . . 310 13.4 The OPRA Case Study . . . . . . . . . . . . . . . . . . . . . . 311 13.4.1 OPRA Data Encoding . . . . . . . . . . . . . . . . . . 312 13.4.2 Decoder Reference Implementation . . . . . . . . . . . 314 13.4.3 A Streamlined Bottom-Up Implementation . . . . . . 315 13.4.4 High-Level Protocol Processing with DotStar . . . . . 316 13.4.5 Experimental Results . . . . . . . . . . . . . . . . . . . 321 13.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 324 13.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 327 14 Data-Movement Approaches for HPC Storage Systems 329 Ron A. Oldfield, Todd Kordenbrock, and Patrick Widener 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 14.2 Lustre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 14.2.1 Lustre Networking (LNET) . . . . . . . . . . . . . . . 332 14.2.2 Optimizations for Large-Scale I/O . . . . . . . . . . . 333 14.3 Panasas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 14.3.1 PanFS Architecture . . . . . . . . . . . . . . . . . . . 335 14.3.2 Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . 336 14.4 Parallel Virtual File System 2 (PVFS2) . . . . . . . . . . . . 337 14.4.1 BMI Design . . . . . . . . . . . . . . . . . . . . . . . . 338 14.4.2 BMI Simplifies the Client . . . . . . . . . . . . . . . . 339 14.4.3 BMI Efficiency/Performance . . . . . . . . . . . . . . 340 14.4.4 BMI Scalability . . . . . . . . . . . . . . . . . . . . . . . 341 14.4.5 BMI Portability . . . . . . . . . . . . . . . . . . . . . . 341
  • 18.
    xiv 14.4.6 Experimental Results. . . . . . . . . . . . . . . . . . 342 14.5 Lightweight File Systems . . . . . . . . . . . . . . . . . . . . 345 14.5.1 Design of the LWFS RPC Mechanism . . . . . . . . . 345 14.5.2 LWFS RPC Implementation . . . . . . . . . . . . . . 346 14.5.3 Performance Analysis . . . . . . . . . . . . . . . . . . 348 14.6 Other MPP File Systems . . . . . . . . . . . . . . . . . . . . 349 14.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 350 14.8 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . 351 15 Network Simulation 353 George Riley 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 15.2 Discrete Event Simulation . . . . . . . . . . . . . . . . . . . 353 15.3 Maintaining the Event List . . . . . . . . . . . . . . . . . . . 354 15.4 Modeling Routers, Links, and End Systems . . . . . . . . . . 355 15.5 Modeling Network Packets . . . . . . . . . . . . . . . . . . . 358 15.6 Modeling the Network Applications . . . . . . . . . . . . . . 359 15.7 Visualizing the Simulation . . . . . . . . . . . . . . . . . . . 360 15.8 Distributed Simulation . . . . . . . . . . . . . . . . . . . . . 362 15.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 References 367 Index 407
  • 19.
    List of Figures 1.1Comparative communication performance of Purple, Red Storm, and Blue Gene/L . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Network topologies used in the ten fastest supercomputers and the single most parallel supercomputer . . . . . . . . . . . . . 9 1.3 Illustrations of various network topologies . . . . . . . . . . . 10 1.4 The network hierarchy of the Roadrunner supercomputer . . 14 1.5 Kautz graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 Typical InfiniBand cluster . . . . . . . . . . . . . . . . . . . . 29 2.2 Consumer queuing model . . . . . . . . . . . . . . . . . . . . 30 2.3 Virtual lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4 Example of unreliable multicast operation . . . . . . . . . . . 34 2.5 IB transport services . . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Layered design of MVAPICH/MVAPICH2 over IB. . . . . . . 48 2.7 Two-sided point-to-point performance over IB on a range of adapters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.8 Application-level evaluation of MPI over InfiniBand design com- ponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.9 Lustre performance over InfiniBand. . . . . . . . . . . . . . . 56 2.10 CPU utilization in Lustre with IPoIB and native (verb-level) protocols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.11 SDP architecture . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.12 Performance comparison of the Apache Web server: SDP vs. IPoIB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.13 Active polling to achieve strong cache coherency. . . . . . . . 59 2.14 Active polling performance . . . . . . . . . . . . . . . . . . . 60 3.1 Profile of network interconnects on the TOP500 . . . . . . . . 62 3.2 Hand-drawn Ethernet diagram by Robert Metcalfe. . . . . . . 66 3.3 TCP offload engines. . . . . . . . . . . . . . . . . . . . . . . . 70 3.4 iWARP protocol stack. . . . . . . . . . . . . . . . . . . . . . . . 71 3.5 Network congestion. . . . . . . . . . . . . . . . . . . . . . . . 72 3.6 VLAN-based multipath communication. . . . . . . . . . . . . 74 3.7 Myrinet MX-over-Ethernet. . . . . . . . . . . . . . . . . . . . 76 3.8 Mellanox ConnectX. . . . . . . . . . . . . . . . . . . . . . . . 77 3.9 Ethernet vs. EtherNOT: One-way latency. . . . . . . . . . . . 78 xv
  • 20.
    xvi 3.10 Ethernet vs.EtherNOT: Unidirectional bandwidth. . . . . . . 79 3.11 Ethernet vs. EtherNOT: Virtual Microscope application. . . . 80 3.12 Ethernet vs. EtherNOT: MPI-tile-IO application. . . . . . . . . 81 4.1 Network latency scaling trends. . . . . . . . . . . . . . . . . . 87 4.2 Memory cost and memory power trends for a commodity server system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3 Approximate floorplans for quad-core processors from AMD and Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4 Breakdown of a HyperTransport link. . . . . . . . . . . . . . 92 4.5 Organization of buffers/VCs on a HT link . . . . . . . . . . . 93 4.6 HT read request packet format. . . . . . . . . . . . . . . . . . 95 4.7 Structure of the QPI protocol stack and link. . . . . . . . . . 96 4.8 QPI’s low-latency source snoop protocol. . . . . . . . . . . . . 97 4.9 An example of the PCIe complex. . . . . . . . . . . . . . . . . 99 4.10 Structure of the PCI Express packets and protocol processing 100 4.11 The Dynamic Partitioned Global Address Space model . . . . 102 4.12 Memory bridge with Opteron memory subsystem. . . . . . . . 105 4.13 Memory bridge stages. . . . . . . . . . . . . . . . . . . . . . . 106 5.1 Comparing programmed I/O and DMA transactions to the network interface. . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.2 Comparing messages using an eager protocol to messages using a rendezvous protocol. . . . . . . . . . . . . . . . . . . . . . . 116 5.3 Ping-pong bandwidth for Quadrics Elan4 and 4X SDR Infini- Band. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.4 Cells in associative list processing units . . . . . . . . . . . . 130 5.5 Performance advantages of an ALPU . . . . . . . . . . . . . . . 131 5.6 NIC architecture enhanced with a list management unit. . . . 133 5.7 Performance of the list management unit . . . . . . . . . . . . 135 5.8 Microcoded match unit . . . . . . . . . . . . . . . . . . . . . 136 5.9 Match unit’s wide instruction word. . . . . . . . . . . . . . . 137 5.10 Match unit performance . . . . . . . . . . . . . . . . . . . . . 140 5.11 NIC-based atomic unit . . . . . . . . . . . . . . . . . . . . . . . 141 5.12 Comparing the performance with and without cache. . . . . . 144 5.13 Assessing the impact of size and associativity on atomic unit cache performance. . . . . . . . . . . . . . . . . . . . . . . . . 145 5.14 Pseudo code for an Allreduce using triggered operations . . . 146 7.1 Congestion management functions in popular TCP variants. . 177 8.1 Comparison of traditional TCP network stack and RDMA . . 187 8.2 Effect of overlapping computation and communication. . . . . 190 8.3 TCP and RDMA communication architecture. . . . . . . . . . 191 8.4 iWARP protocols stack. . . . . . . . . . . . . . . . . . . . . . 202
  • 21.
    xvii 9.1 Simple protocolonload approach on multi-core platforms. . . . 211 9.2 Deploying communication stacks on dedicated cores. . . . . . 214 9.3 Deployment alternatives for network processors. . . . . . . . 222 9.4 Heterogeneous multi-core platform. . . . . . . . . . . . . . . 224 10.1 Split device driver model. . . . . . . . . . . . . . . . . . . . . 234 10.2 Direct device access model. . . . . . . . . . . . . . . . . . . . . 241 10.3 RDMA write bandwidth divided among VMs. . . . . . . . . 243 11.1 Simplistic receive processing in MPI . . . . . . . . . . . . . . 264 11.2 Serialized MPI communication . . . . . . . . . . . . . . . . . 269 11.3 Communication/computation overlap . . . . . . . . . . . . . 270 11.4 Memory copy vs. OpenFabric memory registration times . . 275 12.1 Channel-based event delivery system . . . . . . . . . . . . . 282 12.2 Complex event processing delivery system . . . . . . . . . . . 283 12.3 Event delivery system built using EVPath . . . . . . . . . . 288 12.4 Basic stone types . . . . . . . . . . . . . . . . . . . . . . . . 290 12.5 Sample EVPath data structure declaration . . . . . . . . . . . 291 12.6 Specialization filter for stock trades ranges . . . . . . . . . . 292 12.7 Specialization filter for array averaging . . . . . . . . . . . . 293 12.8 Local stone transfer time for linear and tree-structured paths 295 12.9 EVPath throughput for various data sizes . . . . . . . . . . . 297 12.10 Using event channels for communication . . . . . . . . . . . 298 12.11 ECho event channel implementation using EVPath stones . . 299 12.12 Derived event channel implementation using EVPath stones 300 12.13 CPU overhead as a function of filter rejection ratio . . . . . . 301 13.1 High-level overview of a ticker plant . . . . . . . . . . . . . . 307 13.2 OPRA market peak data rates . . . . . . . . . . . . . . . . . 309 13.3 OPRA FAST encoded packet format . . . . . . . . . . . . . 314 13.4 OPRA reference decoder . . . . . . . . . . . . . . . . . . . . 315 13.5 Bottom-up reference decoder block diagram . . . . . . . . . . 316 13.6 Presence and field map bit manipulation . . . . . . . . . . . 317 13.7 A graphical representation of the DotStar compiler steps. . . 318 13.8 DotStar runtime . . . . . . . . . . . . . . . . . . . . . . . . . 319 13.9 DotStar source code . . . . . . . . . . . . . . . . . . . . . . . 320 13.10 OPRA message distribution . . . . . . . . . . . . . . . . . . 322 13.11 Performance comparison on several hardware platforms . . . 323 14.1 Partitioned architecture . . . . . . . . . . . . . . . . . . . . . 330 14.2 Lustre software stack. . . . . . . . . . . . . . . . . . . . . . . 332 14.3 Server-directed DMA handshake in Lustre . . . . . . . . . . 333 14.4 Panasas DirectFlow architecture . . . . . . . . . . . . . . . . 334 14.5 Parallel NFS architecture . . . . . . . . . . . . . . . . . . . . 337
  • 22.
    xviii 14.6 PVFS2. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 14.7 Round trip latency . . . . . . . . . . . . . . . . . . . . . . . 342 14.8 Point-to-point bandwidth for 120MB transfer . . . . . . . . . 342 14.9 Aggregate read pattern, 10MB per client/server pair . . . . . 343 14.10 LWFS RPC protocol . . . . . . . . . . . . . . . . . . . . . . 346 14.11 The 16-byte data structure used for each of the experiments. 347 14.12 Comparison of LWFS RPC to various other mechanisms. . . 349 15.1 Moderately complex network topology. . . . . . . . . . . . . 358 15.2 Network simulation animation. . . . . . . . . . . . . . . . . . 360 15.3 Network simulation visualization. . . . . . . . . . . . . . . . . 361 15.4 The space-parallel method for distributed network simulation. 363
  • 23.
    List of Tables 1.1Network characteristics of Purple, Red Storm, and Blue Gene/L 4 1.2 Properties of a ring vs. a fully connected network of n nodes 7 1.3 Hypercube conference: length of proceedings . . . . . . . . . . 11 4.1 Latency results for HToE bridge. . . . . . . . . . . . . . . . . 107 4.2 Latency numbers used for evaluation of performance penalties. 107 5.1 Breakdown of the assembly code. . . . . . . . . . . . . . . . . 139 12.1 Comparing split stone and filter stone execution times . . . . 296 13.1 OPRA message categories with description. . . . . . . . . . . 313 13.2 Hardware platforms used during experimental evaluation. . . 324 13.3 Detailed performance analysis (Intel Xeon Q6600, E5472 and AMD Opteron 2352) . . . . . . . . . . . . . . . . . . . . . . . 325 13.4 Detailed performance analysis (SUN UltraSparc T2 and IBM Power6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 13.5 DotStar latency on different platforms. . . . . . . . . . . . . . 327 14.1 Compute and I/O nodes for production DOE MPP systems used since the early 1990s. . . . . . . . . . . . . . . . . . . . . . 331 xix
  • 25.
    Preface As technology pushesthe Petascale barrier, and the next generation Exascale Computing requirements are being defined, it is clear that this type of com- putational capacity can be achieved only by accompanying advancement in communication technology. Next generation computing platforms will consist of hundreds of thousands of interconnected computational resources. There- fore, careful consideration must be placed on the choice of the communication fabrics connecting everything, from cores and accelerators on individual plat- form nodes, to the internode fabric for coupling nodes in large-scale data centers and computational complexes, to wide area communication channels connecting geographically disparate compute resources to remote data sources and application end-users. The traditional scientific HPC community has long pushed the envelope on the capabilities delivered by existing communication fabrics. The com- puting complex at Oak Ridge National Laboratory has an impressive 332 Gi- gabytes/sec I/O bandwidth and a 786 Terabyte/sec global interconnection bandwidth [305], used by extreme science applications from “subatomic to galactic scales” domains [305], and supporting innovation in renewable energy sources, climate modeling, and medicine. Riding on a curve of a thousandfold increase in computational capabilities over the last five years alone, it is ex- pected that the growth in both compute resources, as well as the underlying communication infrastructure and its capabilities, will continue to climb at mind-blowing rates. Other classes of HPC applications have equally impressive communication needs beyond the main computing complex, as they require data from remote sources — observatories, databases, remote instruments such as particle ac- celerators, or large scale data-intensive collaborations across many globally distributed sites. The computational grids necessary for these applications must by enabled by communication technology capable of moving terabytes of data in a timely manner, while also supporting the interactive nature of the remote collaborations. More unusual, however, are the emerging extreme scale communication needs outside of the traditional HPC domain. First, the confluence of the multi- core nature of emerging hardware resources, coupled with the renaissance of virtualization technology, is creating exceptional consolidation possibilities, and giving rise to compute clouds of virtualized multi-core resources. At the same time, the management complexity and operating costs associated with today’s IT hardware and software stacks are pushing a broad range of enterprise class xxi
  • 26.
    xxii applications into suchclouds. As clouds spill outside of data center boundaries and move towards becoming Exascale platforms across globally distributed facilities [191,320], their communication needs — for computations distributed across inter- and intra-cloud resources, for management, provisioning and QoS, for workload migration, and for end-user interactions — become exceedingly richer. Next, even enterprise applications which, due to their critical nature, are not likely candidates for prime cloud deployment, are witnessing a “skyrocketing” increase in communication needs. For instance, financial market analyses are forecasting as much as 128 billion electronic trades a day just by 2010 [422], and similar trends toward increased data rates and lower latency requirements are expected in other market segments, from ticketing services, to parcel delivery and distribution, to inventory management and forecasting. Finally, the diversity in commodity end-user application, from 3D dis- tributed games, to content sharing and rich multimedia based applications, to telecommunication and telepresence types of services, supported on everything from high-end workstations to cellphones and embedded devices, is further exacerbating the need for high performance communication services. Objectives No silver bullet will solve all of the communications-related challenges faced by the emerging types of applications, platforms, and usage scenarios described above. In order to address these requirements we need an ecosystem of solutions along a stack of technology layers: (i) efficient interconnection hardware; (ii) scalable, robust, end-to-end protocols; and (iii) system services and tools specifically targeting emerging multi-core environments. Toward this end, this book is a unique effort to discuss technological advances and challenges related to high performance communications, by addressing each layer in the vertical stack — the low-level architectural features of the hardware interconnect and interconnection devices; the selection of communication protocols; the implementation of protocol stacks and other operating system features, including on modern, homogeneous or heterogeneous multi-core platforms; the middleware software and libraries used in applications with high- performance communication requirements; and the higher-level application services such as file systems, monitoring services, and simulation tools. The rationale behind this approach is that no single solution, applied at one particular layer, can help applications address all performance-related issues with their communication services. Instead, a coordinated effort is needed to eliminate bottlenecks and address optimization opportunities at each layer — from the architectural features of the hardware, through the protocols and their implementation in OS kernels, to the manner in which application services and middleware are using the underlying platforms. This book is an edited collection of chapters on these topics. The choice to
  • 27.
    xxiii organize this titleas a collection of chapters contributed by different individuals is a natural one — the vertical scope of the work calls for contributions from researchers with many different areas of expertise. Each of the chapters is organized in a manner which includes historic perspective, discussion of state- of-the-art technology solutions and current trends, summary of major research efforts in the community, and, where appropriate, greater level of detail on a particular effort that the chapter contributor is mostly affiliated with. The topics covered in each chapter are important and complex, and deserve a separate book to cover adequately and in-depth all technical challenges which surround them. Many such books exist [71, 128, 181, 280, 408, etc.]. Unique about this title, however, is that it is a more comprehensive text for a broader audience, spanning a community with interests across the entire stack. Furthermore, each chapter abounds in references to past and current technical papers and projects, which can guide the reader to sources of additional information. Finally, it is worth pointing out that even this type of text, which touches on many different types of technologies and many layers across the stack, is by no means complete. Many more topics remain only briefly mentioned throughout the book, without having a full chapter dedicated to them. These include topics related to physical-layer technologies, routing protocols and router archi- tectures, advanced communications protocols for multimedia communications, modeling and analysis techniques, compiler and programming language-related issues, and others. Target audience The relevance of this book is multifold. First, it is targeted at academics for instruction in advanced courses on high-performance I/O and communi- cation, offered as part of graduate or undergraduate curricula in Computer Science or Computer Engineering. Similar courses have been taught for several years at a number of universities, including Georgia Tech (High-Performance Communications, taught by this book’s editor), The Ohio State University (Network-Based Computing, by Prof. Dhabaleswar K. Panda), Auburn Univer- sity (Special Topics in High Performance Computing, by Prof. Weikuan Yu), or significant portions of the material are covered in more traditional courses on High Performance Computing in many Computer Science or Engineering programs worldwide. In addition, this book can be relevant reference and instructional mate- rial for students and researchers in other science and engineering areas, in academia or at the National Labs, that are working on problems with signifi- cant communication requirements, and on high performance platforms such as supercomputers, high-end clusters or large-scale wide-area grids. Many of these scientists may not have formal computer science/computer engineering education, and this title aims to be a single text which can help steer them to
  • 28.
    xxiv more easily identifyperformance improvement opportunities and find solutions best suited for the applications they are developing and the platforms they are using. Finally, the text provides researchers who are specifically addressing problems and developing technologies at a particular layer with a single reference that surveys the state-of-the-art at other layers. By offering a concise overview of related efforts from “above” or “below,” this book can be helpful in identifying the best ways to position one’s work, and in ensuring that other elements of the stack are appropriately leveraged. Organization The book is organized in 15 chapters, which roughly follow a vertical stack from bottom to top. • Chapter 1 discusses design alternatives for interconnection networks in massively parallel systems, including examples from current Cray, Blue Gene, as well as cluster-based supercomputers from the Top500 list. • Chapter 2 provides in-depth discussion of the InfiniBand intercon- nection technology, its hardware elements, protocols and features, and includes a number of case studies from the HPC and enterprise domains, which illustrate its suitability for a range of applications. • Chapter 3 contrasts the traditional high-performance interconnection solutions to the current capabilities offered by Ethernet-based intercon- nects, and demonstrates several convergence trends among Ethernet and EtherNOT technologies. • Chapter 4 describes the key board- and chip-level interconnects such as PCI Express, HyperTransport, and QPI; the capabilities they offer for tighter system integration; and a case study for a service enabled by the availability of low-latency integrated interconnects — global partitioned address spaces. • Chapter 5 discusses a number of considerations regarding the hard- ware and software architecture of network interface devices (NICs), and contrasts the design points present in NICs in a number of existing interconnection fabrics. • Chapter 6 complements this discussion by focusing on the characteris- tics of the APIs natively supported by the different NIC platforms. • Chapter 7 focuses on IP-based transport protocols and provides a his- toric perspective on the different variants and performance optimization opportunities for TCP transports, as well as a brief discussion of other IP-based transport protocols.
  • 29.
    xxv • Chapter 8describes in greater detail Remote Direct Memory Access (RDMA) as an approach to network communication, along with iWARP, an RDMA-based solution based on TCP transports. • Chapter 9 more explicitly discusses opportunities to accelerate the execution of communication services on multi-core platforms, including general purpose homogeneous multi-cores, specialized network processing accelerators, as well as emerging heterogeneous many-core platforms, comprising both general purpose and accelerator resources. • Chapter 10 targets the mechanisms used to ensure high-performance communication services in virtualized platforms by discussing VMM-level techniques as well as device-level features which help attain near-native performance. • Chapter 11 provides some historical perspective on MPI, the de facto communication standard in high-performance computing, and outlines some of the challenges in creating software and hardware implementations of the MPI standard. These challenges are directly related to MPI applications and their understanding can impact the programmers’ ability to write better MPI-based codes. • Chapter 12 looks at event-based middleware services as a means to ad- dress the high-performance needs of many classes of HPC and enterprise applications. It also gives an overview of the EVPath high-performance middleware stack and demonstrates its utility in several different con- texts. • Chapter 13 first describes an important class of applications with high performance communication requirements — electronic trading platforms used in the financial industry. Next, it provides implementation detail and performance analysis for the authors’ approach to leverage the capabilities of general purpose multi-core platforms and to attain impressive performance levels for one of the key components of the trading engine. • Chapter 14 describes the data-movement approaches used by a selection of commercial, open-source, and research-based storage systems used by massively parallel platforms, including Lustre, Panasas, PVFS2, and LWFS. • Chapter 15 discusses the overall approach to creating the simulation tools needed to design, evaluate and experiment with a range of parame- ters in the interconnection technology space, so as to help identify design points with adequate performance behaviors.
  • 31.
    Acknowledgments This book wouldhave never been possible without the contributions from the individual chapter authors. My immense gratitude goes out to all of them for their unyielding enthusiasm, expertise, and time. Next, I would like to thank my editor Alan Apt, for approaching and encouraging me to write this book, and the extended production team at Taylor & Francis, for all their help with the preparation of the manuscript. I am especially grateful to my mentor and colleague, Karsten Schwan, for supporting my teaching of the High Performance Communications course at Georgia Tech, which was the basis for this book. He and my close collaborators, Greg Eisenhauer and Matthew Wolf, were particularly accomodating during the most intense periods of manuscript preparation in providing me with the necessary time to complete the work. Finally, I would like to thank my family, for all their love, care, and uncon- ditional support. I dedicate this book to them. Ada Gavrilovska Atlanta, 2009 xxvii
  • 33.
    About the Editor AdaGavrilovska is a Research Scientist in the College of Computing at Georgia Tech, and at the Center for Experimental Research in Computer Systems (CERCS). Her research interests include conducting experimental systems research, specifically addressing high-performance applications on distributed heterogeneous platforms, and focusing on topics that range from operating and distributed systems, to virtualization, to programmable network devices and communication accelerators, to active and adaptive middleware and I/O. Most recently, she has been involved with several projects focused on development of efficient virtualization solutions for platforms ranging from heterogeneous multi-core systems to large-scale cloud enviroments. In addition to research, Dr. Gavrilovska has a strong commitment to teaching. At Georgia Tech she teaches courses on advanced operating systems and high- performance communications topics, and is deeply involved in a larger effort aimed at upgrading the Computer Science and Engineering curriculum with multicore-related content. Dr. Gavrilovska has a B.S. in Electrical and Computer Engineering from the Faculty of Electrical Engineering, University Sts. Cyril and Methodius, in Skopje, Macedonia, and M.S. and Ph.D. degrees in Computer Science from Georgia Tech, both completed under the guidence of Dr. Karsten Schwan. Her work has been supported by the National Science Foundation, the U.S. Department of Energy, and through numerous collaborations with industry, including Intel, IBM, HP, Cisco Systems, Netronome Systems, and others. xxix
  • 35.
    List of Contributors HasanAbbasi College of Computing Georgia Institute of Technology Atlanta, Georgia Virat Agarwal IBM TJ Watson Research Center Yorktown Heights, New York David Bader College of Computing Georgia Institute of Technology Atlanta, Georgia Pavan Balaji Argonne National Laboratory Argonne, Illinois Ron Brightwell Sandia National Laboratories Albuquerque, New Mexico Dennis Dalessandro Ohio Supercomputer Center Columbus, Ohio Lin Duan IBM TJ Watson Research Center Yorktown Heights, New York Greg Eisenhauer College of Computing Georgia Institute of Technology Atlanta, Georgia Wu-chun Feng Departments of Computer Science and Electrical & Computer Engineering Virginia Tech Blacksburg, Virginia Ada Gavrilovska College of Computing Georgia Institute of Technology Atlanta, Georgia Scott Hemmert Sandia National Laboratories Albuquerque, New Mexico Matthew Jon Koop Department of Computer Science and Engineering The Ohio State University Columbus, Ohio Todd Kordenbrock Hewlett-Packard Albuquerque, New Mexico Lurng-Kuo Liu IBM TJ Watson Research Center Yorktown Heights, New York Ron A. Oldfield Sandia National Laboratories Albuquerque, New Mexico xxxi
  • 36.
    xxxii Scott Pakin Los AlamosNational Laboratory Los Alamos, New Mexico Dhabaleswar K. Panda Department of Computer Science and Engineering The Ohio State University Columbus, Ohio Davide Pasetto IBM Computational Science Center Dublin, Ireland Michaele Perrone IBM TJ Watson Research Center Yorktown Heights, New York Fabrizio Petrini IBM TJ Watson Research Center Yorktown Heights, New York Adit Ranadive College of Computing Georgia Institute of Technology Atlanta, Georgia Dulloor Rao College of Computing Georgia Institute of Technology Atlanta, Georgia George Riley School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia Karsten Schwan College of Computing Georgia Institute of Technology Atlanta, Georgia Jeff Squyres Cisco Systems Louisville, Kentucky Sayantan Sur IBM TJ Watson Research Center Hawthorne, New York Keith Underwood Intel Corporation Albuquerque, New Mexico Patrick Widener University of New Mexico Albuquerque, New Mexico Matthew Wolf College of Computing Georgia Institute of Technology Atlanta, Georgia Pete Wyckoff Ohio Supercomputer Center Columbus, Ohio Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia Jeffrey Young School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia
  • 37.
    Chapter 1 High PerformanceInterconnects for Massively Parallel Systems Scott Pakin Los Alamos National Laboratory 1.1 Introduction If you were intent on building the world’s fastest supercomputer, how would you design the interconnection network? As with any architectural endeavor, there are a suite of trade-offs and design decisions that need to be considered: Maximize performance Ideally, every application should run fast, but would it be acceptable for a smaller set of “important” applications or application classes to run fast? If so, is the overall performance of those applications dominated by communication performance? Do those applications use a known communication pattern that the network can optimize? Minimize cost Cheaper is better, but how much communication performance are you willing to sacrifice to cut costs? Can you exploit existing hardware components, or do you have to do a full-custom design? Maximize scalability Some networks are fast and cost-efficient at small scale but quickly become expensive as the number of nodes increases. How can you ensure that you will not need to rethink the network design from scratch when you want to increase the system’s node code? Can the network grow incrementally? (Networks that require a power-of-two number of nodes quickly become prohibitively expensive, for example.) Minimize complexity How hard is it to reason about application perfor- mance? To observe good performance, do parallel algorithms need to be constructed specifically for your network? Do an application’s processes need to be mapped to nodes in some non-straightforward manner to keep the network from becoming a bottleneck? As for some more mundane complexity issues, how tricky is it for technicians to lay the network 1
  • 38.
    2 Attaining HighPerformance Communication: A Vertical Approach cables correctly, and how much rewiring is needed when the network size increases? Maximize robustness The more components (switches, cables, etc.) com- pose the network, the more likely it is that one of them will fail. Would you utilize a naturally fault-tolerant topology at the expense of added complexity or replicate the entire network at the expense of added cost? Minimize power consumption Current estimates are that a sustained megawatt of power costs between US$200,000–1,200,000 per year [144]. How much power are you willing to let the network consume? How much performance or robustness are you willing to sacrifice to reduce the network’s power consumption? There are no easy answers to those questions, and the challenges increase in difficulty with larger and larger networks: Network performance is more critical; costs may grow disproportionately to the rest of the system; incremental growth needs to be carefully considered; complexity is more likely to frustrate application developers; fault tolerance is both more important and more expensive; and power consumption becomes a serious concern. In the remainder of this chapter we examine a few aspects of the network-design balancing act. Section 1.2 quantifies the measured network performance of a few massively parallel supercomputers. Section 1.3 discusses network topology, a key design decision and one that impacts all of performance, cost, scalability, complexity, robustness, and power consumption. In Section 1.4 we turn our attention to some of the “bells and whistles” that a network designer might consider including when trying to balance various design constraints. We briefly discuss some future directions in network design in Section 1.5 and summarize the chapter in Section 1.6. Terminology note In much of the system-area network literature, the term switch implies an indirect network while the term router implies a direct network. In the wide-area network literature, in contrast, the term router typically implies a switch plus additional logic that operates on packet contents. For simplicity, in this chapter we have chosen to consistently use the term switch when describing a network’s switching hardware. 1.2 Performance One way to discuss the performance of an interconnection network is in terms of various mathematical characteristics of the topology including — as a function at least of the number of nodes — the network diameter (worst-case
  • 39.
    High Performance Interconnectsfor Massively Parallel Systems 3 distance in switch hops between two nodes), average-case communication dis- tance, bisection width (the minimum number of links that need to be removed to partition the network into two equal halves), switch count, and network capacity (maximum number of links that can carry data simultaneously). However, there are many nuances of real-world networks that are not cap- tured by simple mathematical expressions. The manner in which applications, messaging layers, network interfaces, and network switches interact can be complex and unintuitive. For example, Arber and Pakin demonstrated that the address in memory at which a message buffer is placed can have a significant impact on communication performance and that different systems favor differ- ent buffer alignments [17]. Because of the discrepancy between the expected performance of an idealized network subsystem and the measured performance of a physical one, the focus of this section is on actual measurements of parallel supercomputers containing thousands to tens of thousands of nodes. 1.2.1 Metrics The most common metric used in measuring network performance is half of the round-trip communication time (½RTT), often measured in microseconds. The measurement procedure is as follows: Process A reads the time and sends a message to process B; upon receipt of process A’s message, process B sends an equal-sized message back to process A; upon receipt of process B’s message, process A reads the time again, divides the elapsed time by two, and reports the result as ½RTT. The purpose of using round-trip communication is that it does not require the endpoints to have precisely synchronized clocks. In many papers, ½RTT is referred to as latency and the message size divided by ½RTT as bandwidth although these definitions are far from universal. In the LogP parallel-system model [100], for instance, latency refers solely to “wire time” for a zero-byte message and is distinguished from overhead, the time that a CPU is running communication code instead of application code. In this section, we use the less precise but more common definitions of latency and bandwidth in terms of ½RTT for an arbitrary-size message. Furthermore, all performance measurements in this section are based on tests written using MPI [408], currently the de facto standard messaging layer. 1.2.2 Application Sensitivity to Communication Perfor- mance An underlying premise of this entire book is that high-performance ap- plications require high-performance communication. Can we quantify that statement in the context of massively parallel systems? In a 2006 paper, Kerbyson used detailed, application-specific, analytical performance models to examine the impact of varying the network latency, bandwidth, and number of CPUs per node [224]. Kerbyson’s study investigated two large applications (SAGE [450] and Partisn [24]) and one application kernel (Sweep3D [233])
  • 40.
    4 Attaining HighPerformance Communication: A Vertical Approach TABLE 1.1: Network characteristics of Purple, Red Storm, and Blue Gene/L [187] Metric Purple Red Storm Blue Gene/L CPU cores 12,288 10,368 65,536 Cores per node 8 1 2 Nodes 1,536 10,368 32,768 NICs per node 2 1 1 Topology (cf. §1.3) 16-ary 27×16×24 32×32×32 3-tree mesh in x,y; torus in x,y,z torus in z Peak link bandwidth (MB/s) 2,048 3,891 175 Achieved bandwidth (MB/s) 2,913 1,664 154 Achieved min. latency (µs) 4.4 6.9 2.8 at three different network sizes. Of the three applications, Partisn is the most sensitive to communication performance. At 1024 processors, Partisn’s performance can be improved by 7.4% by reducing latency from 4 µs to 1.5 µs, 11.8% by increasing bandwidth from 0.9 GB/s to 1.6 GB/s, or 16.4% by de- creasing the number of CPUs per node from 4 to 2, effectively halving the contention for the network interface controller (NIC). Overall, the performance difference between the worst set of network parameters studied (4 µs latency, 0.9 GB/s bandwidth, and 4 CPUs/node) and the best (1.5 µs latency, 1.6 GB/s bandwidth, and 2 CPUs/node) is 68% for Partisn, 24% for Sweep3D, and 15% for SAGE, indicating that network performance is in fact important to parallel-application performance. Of course, these results are sensitive to the particular applications, input parameters, and the selected variations in network characteristics. Neverthe- less, other researchers have also found communication performance to be an important contributor to overall application performance. (Martin et al.’s analysis showing the significance of overhead [277] is an oft-cited study, for example.) The point is that high-performance communication is needed for high-performance applications. 1.2.3 Measurements on Massively Parallel Systems In 2005, Hoisie et al. [187] examined the performance of three of the most powerful supercomputers of the day: IBM’s Purple system [248] at Lawrence Livermore National Laboratory, Cray/Sandia’s Red Storm system [58] at Sandia National Laboratories, and IBM’s Blue Gene/L system [2] at Lawrence Livermore National Laboratory. Table 1.1 summarizes the node and network characteristics of each of these three systems at the time at which Hoisie et al. ran their measurements. (Since the data was acquired all three systems have had substantial hardware and/or software upgrades.) Purple is a traditional cluster, with NICs located on the I/O bus and
  • 41.
    High Performance Interconnectsfor Massively Parallel Systems 5 FIGURE 1.1: Comparative communication performance of Purple, Red Storm, and Blue Gene/L. connecting to a network fabric. Although eight CPUs share the two NICs in a node, the messaging software can stripe data across the two NICs, resulting in a potential doubling of bandwidth. (This explains how the measured bandwidth in Table 1.1 can exceed the peak link bandwidth.) Red Storm and Blue Gene/L are massively parallel processors (MPPs), with the nodes and network more tightly integrated. Figure 1.1 reproduces some of the more interesting measurements from Hoisie et al.’s paper. Figure 1.1(a) shows the ½RTT of a 0-byte message sent from node 0 to each of nodes 1–1023 in turn. As the figure shows, Purple’s latency is largely independent of distance (although plateaus representing each of the three switch levels are in fact visible); Blue Gene/L’s latency clearly matches the Manhattan distance beween the two nodes, with peaks and valleys in the graph representing the use of the torus links in each 32-node row and 32×32-node plane; however, Red Storm’s latency appears erratic. This is because Red Storm assigns ranks in the computation based on a node’s physical location on the machine room floor — aisle (0–3), cabinet within an aisle (0–26), “cage” (collection of processor boards) within a cabinet (0–2), board within a cage (0–7), and socket within a board (0–3) — not, as expected, by the node’s x (0–26), y (0–15), and z (0–23) coordinates on the mesh. While the former mapping may be convenient for pinpointing faulty hardware, the latter is more natural for application programmers. Unexpected node mappings are one of the subtleties that differentiate measurements of network performance from expectations based on simulations, models, or theoretical characteristics. Figure 1.1(b) depicts the impact of link contention on achievable bandwidth. At contention level 0, two nodes are measuring bandwidth as described in Section 1.2.1. At contention level 1, two other nodes lying between the first pair exchange messages while the first pair is measuring bandwidth; at contention
  • 42.
    6 Attaining HighPerformance Communication: A Vertical Approach level 2, another pair of coincident nodes consumes network bandwidth, and so on. On Red Storm and Blue Gene/L, the tests are performed across a single dimension of the mesh/torus to force all of the messages to overlap. Each curve in Figure 1.1(b) tells a different story. The Blue Gene/L curve contains plateaus representing messages traveling in alternating directions across the torus. Purple’s bandwidth degrades rapidly up to the node size (8 CPUs) then levels off, indicating that contention for the NIC is a greater problem than contention within the rest of the network. Relative to the other two systems, Red Storm observes comparatively little performance degradation. This is because Red Storm’s link speed is high relative to the speed at which a single NIC can inject messages into the network. Consequently, it takes a relatively large number of concurrent messages to saturate a network link. In fact, Red Storm’s overspecified network links largely reduce the impact of the unusual node mappings seen in Figure 1.1(a). Which is more important to a massively parallel architecture: faster proces- sors or a faster network? The tradeoff is that faster processors raise the peak performance of the system, but a faster network enables more applications to see larger fractions of peak performance. The correct choice depends on the specifics of the system and is analogous to the choice of eating a small piece of a large pie or a large piece of a small pie. On an infinitely fast network (relative to computation speed), all parallelism that can be exposed — even, say, in an arithmetic operation as trivial as (a + b) · (c + d) — leads to improved performance. On a network with finite speed but that is still fairly fast relative to computation speed, trivial operations are best performed sequentially although small blocks of work can still benefit from running in parallel. On a network that is much slower than what the processors can feed (e.g., a wide-area computational Grid [151]), only the coarsest-grained applications are likely to run faster in parallel than sequentially on a single machine. As Hoisie et al. quantify [187], Blue Gene/L’s peak performance is 18 times that of the older ASCI Q system [269] but its relatively slow network limits it to running SAGE [450] only 1.6 times as fast as ASCI Q. In contrast, Red Storm’s peak performance is only 2 times ASCI Q’s but its relatively fast network enables it to run SAGE 2.45 times as fast as ASCI Q. Overspecifying the network was in fact an important aspect of Red Storm’s design [58]. 1.3 Network Topology One of the fundamental design decisions when architecting an interconnection network for a massively parallel system is the choice of network topology. As mentioned briefly at the start of Section 1.2, there are a wide variety of metrics one can use to compare topologies. Selecting a topology also involves a large
  • 43.
    High Performance Interconnectsfor Massively Parallel Systems 7 TABLE 1.2: Properties of a ring vs. a fully connected network of n nodes Metric Ring Fully connected Diameter n/2 1 Avg. dist. n/4 1 Degree 2 n − 1 Bisection 2 n 2 2 Switches n n Minimal paths 1 1 Contention blocking nonblocking number of tradeoffs that can impact performance, cost, scalability, complexity, robustness, and power consumption: Minimize diameter Reducing the hop count (number of switch crossings) should help reduce the overall latency. Given a choice, however, should you minimize the worst-case hop count (the diameter) or the average hop count? Minimize degree How many incoming/outgoing links are there per node? Low-radix networks devote more bandwidth to each link, improving performance for communication patterns that map nicely to the topol- ogy (e.g., nearest-neighbor 3-space communication on a 3-D mesh). High-radix networks reduce average latency — even for communication patterns that are a poor match to the topology — and the cost of the network in terms of the number of switches needed to interconnect a given number of nodes. A second degree-related question to consider is whether the number of links per node is constant or whether it increases with network size. Maximize bisection width The minimum number of links that must be removed to split the network into two disjoint pieces with equal node count is called the bisection width. (A related term, bisection bandwidth, refers to the aggregate data rate between the two halves of the network.) A large bisection width may improve fault tolerance, routability, and global throughput. However, it may also greatly increase the cost of the network if it requires substantially more switches. Minimize switch count How many switches are needed to interconnect n nodes? O(n) implies that the network can cost-effectively scale up to large numbers of nodes; O(n2 ) topologies are impractical for all but the smallest networks.
  • 44.
    8 Attaining HighPerformance Communication: A Vertical Approach Maximize routability How many minimal paths are there from a source node to a destination node? Is the topology naturally deadlock-free or does it require extra routing sophistication in the network to main- tain deadlock freedom? The challenge is to ensure that concurrently transmitted messages cannot mutually block each other’s progress while simultaneously allowing the use of as many paths as possible between a source and a destination [126]. Minimize contention What communication patterns (mappings from a set of source nodes to a — possibly overlapping — set of destination nodes) can proceed with no two paths sharing a link? (Sharing implies reduced bandwidth.) As Jajszczyk explains in his 2003 survey paper, a network can be categorized as nonblocking in the strict sense, nonblocking in the wide sense, repackable, rearrangeable, or blocking, based on the level of effort (roughly speaking) needed to avoid contention for arbitrary mappings of sources to destinations [215]. But how important is it to support arbitrary mappings — vs. just a few common communication patterns — in a contention-free manner? Consider also that the use of packet switching and adaptive routing can reduce the impact of contention on performance. Consider two topologies that represent the extremes of connectivity: a ring of n nodes (in which nodes are arranged in a circle, and each node connects only to its “left” and “right” neighbors) and a fully connected network (in which each of the n nodes connects directly to each of the other n−1 nodes). Table 1.2 summarizes these two topologies in terms of the preceding characteristics. As that table shows, neither topology outperforms the other in all cases (as is typical for any two given topologies), and neither topology provides many minimal paths, although the fully connected network provides a large number of non-minimal paths. In terms of scalability, the ring’s primary shortcomings are its diameter and bisection width, and the fully connected network’s primary shortcoming is its degree. Consequently, in practice, neither topology is found in massively parallel systems. 1.3.1 The “Dead Topology Society” What topologies are found in massively parallel systems? Ever since parallel computing first started to gain popularity (arguably in the early 1970s), the parallel-computing literature has been rife with publications presenting a vir- tually endless menagerie of network topologies, including — but certainly not limited to — 2-D and 3-D meshes and tori [111], banyan networks [167], Beneš networks [39], binary trees [188], butterfly networks [251], Cayley graphs [8], chordal rings [18], Clos networks [94], crossbars [342], cube-connected cy- cles [347], de Bruijn networks [377], delta networks [330], distributed-loop networks [40], dragonflies [227], extended hypercubes [238], fat trees [253],
  • 45.
    High Performance Interconnectsfor Massively Parallel Systems 9 June 1993 June 1994 June 1995 June 1996 June 1997 June 1998 June 1999 June 2000 June 2001 June 2002 June 2003 June 2004 June 2005 June 2006 June 2007 June 2008 1 F X M X X X H H M M M M M M M F F F X X X X X M M M M M M M F 2 F F X M M M X X H M M M H H H M F F F F F F F H M M M M M M M 3 F F F F M M M H X M M M M H H H M F F F F F F X H F F M M C M 4 F F F F M M M M H H M H H M F H H M F F F F M F X H H F M O F 5 F X X X X M M M M M M M H H F H H F F F F F F F F F F M F M 6 X X F F M M M H H H H M H F H F F F F F F F M M F F F M M 7 F X X X F X M M X M M M H M H F H M F X F X F F X F F M M C 8 M M F F F F F X M X M H H M M F X H F F F F M M M F M H F M O 9 M F F M F X M M M M M M H H F F H F F F F F M M M F F M M 10 F M F X M M M M M F M M F F F F F F F F M M X M H M C Most procs. C C M M M M M M M M M M M M M M M M M M M M M M M M M M M M M Legend and statistics M Mesh or torus (35.8%) No network (3.2%) F Fat tree (34.8%) C Hypercube (1.0%) H Hierarchical (13.2%) O Other (0.6%) X Crossbar (11.3%) FIGURE 1.2: Network topologies used in the ten fastest supercomputers and the single most parallel supercomputer, June and November, 1993–2008. flattened butterflies [228], flip networks [32], fractahedrons [189], generalized hypercube-connected cycles [176], honeycomb networks [415], hyperbuses [45], hypercubes [45], indirect binary n-cube arrays [333], Kautz graphs [134], KR- Benes networks [220], Moebius graphs [256], omega networks [249], pancake graphs [8], perfect-shuffle networks [416], pyramid computers [295], recursive circulants [329], recursive cubes of rings [421], shuffle-exchange networks [417], star graphs [7], X-torus networks [174], and X-Tree structures [119]. Note that many of those topologies are general classes of topologies that include other topologies on the list or are specific topologies that are isomorphic to other listed topologies. Some of the entries in the preceding list have been used in numerous parallel- computer implementations; others never made it past a single publication. Figure 1.2 presents the topologies used over the past fifteen years both in the ten fastest supercomputers and in the single most parallel computer (i.e., the supercomputer with the largest number of processors, regardless of its performance). The data from Figure 1.2 were taken from the semiannual Top500 list
  • 46.
    10 Attaining HighPerformance Communication: A Vertical Approach (a) 8‐port crossbar (b) Hypercube (specifically, a 2‐ary 4‐cube) (d) 3‐D mesh; shaded addition makes it a 3‐D torus (3‐ary 3‐cube); a single plane is a 2‐D mesh/torus; a single line is a 1‐D mesh/torus (a.k.a. a ring) (c) Fat tree (specifically, a 2‐ary 3‐tree); shaded addition is commonly used in practice FIGURE 1.3: Illustrations of various network topologies. (http://www.top500.org/), and therefore represent achieved performance on the HPLinpack benchmark [123].1 Figure 1.3 illustrates the four topologies that are the most prevalent in Figure 1.2: a crossbar, a hypercube, a fat tree, and a mesh/torus. One feature of Figure 1.2 that is immediately clear is that the most parallel system on each Top500 list since 1994 has consistently been either a 3-D mesh or a 3-D torus. (The data are a bit misleading, however, as only four different systems are represented: the CM-200 [435], Paragon [41], ASCI Red [279], and Blue Gene/L [2].) Another easy-to-spot feature is that two topologies — meshes/tori and fat trees — dominate the top 10. More interesting, though, are the trends in those two topologies over time: • In June 1993, half of the top 10 supercomputers were fat trees. Only one mesh/torus made the top 10 list. 1HPLinpack measures the time to solve a system of linear equations using LU decomposition with partial pivoting. Although the large, dense matrices that HPLinpack manipulates are rarely encountered in actual scientific applications, the decade and a half of historical HPLin- pack performance data covering thousands of system installations is handy for analyzing trends in supercomputer performance.
  • 47.
    High Performance Interconnectsfor Massively Parallel Systems 11 TABLE 1.3: Hypercube conference: length of proceedings Year Conference title Pages 1986 First Conference on Hypercube Multiprocessors 286 1987 Second Conference on Hypercube Multiprocessors 761 1988 Third Conference on Hypercube Concurrent Computers and Applications 2682 1989 Fourth Conference on Hypercubes, Concurrent Com- puters, and Applications 1361 • From November 1996 through November 1999 there were no fat trees in the top 10. In fact, in June 1998 every one of the top 10 was either a mesh or a torus. (The entry marked as “hierarchical” was in fact a 3-D mesh of crossbars.) • From November 2002 through November 2003 there were no meshes or tori in the top 10. Furthermore, the November 2002 list contained only a single non-fat-tree. • In November 2006, 40% of the top 10 were meshes/tori and 50% were fat trees. Meshes/tori and fat trees clearly have characteristics that are well suited to high-performance systems. However, the fact that the top 10 is alternately dominated by one or the other indicates that changes in technology favor different topologies at different times. The lesson is that the selection of a topology for a massively parallel system must be based on what current technology makes fast, inexpensive, power-conscious, etc. Although hypercubes did not often make the top 10, they do make for an interesting case study in a topology’s boom and bust. Starting in 1986, there were enough hypercube systems, users, and researchers to warrant an entire conference devoted solely to hypercubes. Table 1.3 estimates interest in hypercubes — rather informally — by presenting the number of pages in this conference’s proceedings. (Note though that the conference changed names almost every single year.) From 1986 to 1988, the page count increased almost tenfold! However, the page count halved the next year even as the scope broadened to include non-hypercubes, and, in 1990, the conference, then known as the Fifth Distributed Memory Computing Conference, no longer focused on hypercubes. How did hypercubes go from being the darling of massively parallel system design to topology non grata? There are three parts to the answer. First, topologies that match an expected usage model are often favored over those that don’t. For example, Sandia National Laboratories have long favored 3-D meshes because many of their applications process 3-D volumes. In contrast, few scientific methods are a natural match to hypercubes. The second reason
  • 48.
    12 Attaining HighPerformance Communication: A Vertical Approach that hypercubes lost popularity is that the topology requires a doubling of processors (more for non-binary hypercubes) for each increment in processor count. This limitation starts to become impractical at larger system sizes. For example, in 2007, Lawrence Livermore National Laboratory upgraded their Blue Gene/L [2] system (a 3-D torus) from 131,072 to 212,992 processors; it was too costly to go all the way to 262,144 processors, as would be required by a hypercube topology. The third, and possibly most telling, reason that hypercubes virtually disappeared is because of techology limitations. The initial popularity of the hypercube topology was due to how well it fares with many of the metrics listed at the start of Section 1.3. However, Dally’s PhD dissertation [110] and subsequent journal publication [111] highlighted a key fallacy of those metrics: They fail to consider wire (or pin) limitations. Given a fixed number of wires into and out of a switch, an n-dimensional binary hypercube (a.k.a. a 2-ary n-cube) divides these wires — and therefore the link bandwidth — by n; the larger the processor count, the less bandwidth is available in each direction. In contrast, a 2-D torus (a.k.a. a k-ary 2-cube), for example, provides 1/4 of the total switch bandwidth in each direction regardless of the processor count. Another point that Dally makes is that high-dimension networks in general require long wires — and therefore high latencies — when embedded in a lower-dimension space, such as a plane in a typical VLSI implementation. In summary, topologies with the best mathematical properties are not necessarily the best topologies to implement. When selecting a topology for a massively parallel system, one must consider the real-world features and limitations of the current technology. A topology that works well one year may be suboptimal when facing the subsequent year’s technology. 1.3.2 Hierarchical Networks A hierarchical network uses an “X of Y ” topology — a base topology in which every node is replaced with a network of a (usually) different topology. As Figure 1.2 in the previous section indicates, many of the ten fastest su- percomputers over time have employed hierarchical networks. For example, Hitachi’s SR2201 [153] and the University of Tsukuba’s CP-PACS [49] (#1, respectively, on the June and November 1996 Top500 lists) were both 8×17×16 meshes of crossbars (a.k.a. 3-D hypercrossbars). Los Alamos National Labo- ratory’s ASCI Blue Mountain [268] (#4 on the November 1998 Top500 list) used a deeper topology hierarchy: a HIPPI [433] 3-D torus of “hierarchical fat bristled hypercubes” — the SGI Origin 2000’s network of eight crossbars, each of which connected to a different node within a set of eight 3-D hypercubes and where each node contained two processors on a bus [246]. The topologies do not need to be different: NASA’s Columbia system [68] (#2 on the November 2004 Top500 list) is an InfiniBand [207,217] 12-ary fat tree of NUMAlink 4-ary fat trees. (NUMAlink is the SGI Altix 3700’s internal network [129].)
  • 49.
    Other documents randomlyhave different content
  • 50.
    Project Gutenberg™ issynonymous with the free distribution of electronic works in formats readable by the widest variety of computers including obsolete, old, middle-aged and new computers. It exists because of the efforts of hundreds of volunteers and donations from people in all walks of life. Volunteers and financial support to provide volunteers with the assistance they need are critical to reaching Project Gutenberg™’s goals and ensuring that the Project Gutenberg™ collection will remain freely available for generations to come. In 2001, the Project Gutenberg Literary Archive Foundation was created to provide a secure and permanent future for Project Gutenberg™ and future generations. To learn more about the Project Gutenberg Literary Archive Foundation and how your efforts and donations can help, see Sections 3 and 4 and the Foundation information page at www.gutenberg.org. Section 3. Information about the Project Gutenberg Literary Archive Foundation The Project Gutenberg Literary Archive Foundation is a non- profit 501(c)(3) educational corporation organized under the laws of the state of Mississippi and granted tax exempt status by the Internal Revenue Service. The Foundation’s EIN or federal tax identification number is 64-6221541. Contributions to the Project Gutenberg Literary Archive Foundation are tax deductible to the full extent permitted by U.S. federal laws and your state’s laws. The Foundation’s business office is located at 809 North 1500 West, Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up to date contact information can be found at the Foundation’s website and official page at www.gutenberg.org/contact
  • 51.
    Section 4. Informationabout Donations to the Project Gutenberg Literary Archive Foundation Project Gutenberg™ depends upon and cannot survive without widespread public support and donations to carry out its mission of increasing the number of public domain and licensed works that can be freely distributed in machine-readable form accessible by the widest array of equipment including outdated equipment. Many small donations ($1 to $5,000) are particularly important to maintaining tax exempt status with the IRS. The Foundation is committed to complying with the laws regulating charities and charitable donations in all 50 states of the United States. Compliance requirements are not uniform and it takes a considerable effort, much paperwork and many fees to meet and keep up with these requirements. We do not solicit donations in locations where we have not received written confirmation of compliance. To SEND DONATIONS or determine the status of compliance for any particular state visit www.gutenberg.org/donate. While we cannot and do not solicit contributions from states where we have not met the solicitation requirements, we know of no prohibition against accepting unsolicited donations from donors in such states who approach us with offers to donate. International donations are gratefully accepted, but we cannot make any statements concerning tax treatment of donations received from outside the United States. U.S. laws alone swamp our small staff. Please check the Project Gutenberg web pages for current donation methods and addresses. Donations are accepted in a number of other ways including checks, online payments and
  • 52.
    credit card donations.To donate, please visit: www.gutenberg.org/donate. Section 5. General Information About Project Gutenberg™ electronic works Professor Michael S. Hart was the originator of the Project Gutenberg™ concept of a library of electronic works that could be freely shared with anyone. For forty years, he produced and distributed Project Gutenberg™ eBooks with only a loose network of volunteer support. Project Gutenberg™ eBooks are often created from several printed editions, all of which are confirmed as not protected by copyright in the U.S. unless a copyright notice is included. Thus, we do not necessarily keep eBooks in compliance with any particular paper edition. Most people start at our website which has the main PG search facility: www.gutenberg.org. This website includes information about Project Gutenberg™, including how to make donations to the Project Gutenberg Literary Archive Foundation, how to help produce our new eBooks, and how to subscribe to our email newsletter to hear about new eBooks.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
    Welcome to OurBookstore - The Ultimate Destination for Book Lovers Are you passionate about books and eager to explore new worlds of knowledge? At our website, we offer a vast collection of books that cater to every interest and age group. From classic literature to specialized publications, self-help books, and children’s stories, we have it all! Each book is a gateway to new adventures, helping you expand your knowledge and nourish your soul Experience Convenient and Enjoyable Book Shopping Our website is more than just an online bookstore—it’s a bridge connecting readers to the timeless values of culture and wisdom. With a sleek and user-friendly interface and a smart search system, you can find your favorite books quickly and easily. Enjoy special promotions, fast home delivery, and a seamless shopping experience that saves you time and enhances your love for reading. Let us accompany you on the journey of exploring knowledge and personal growth! ebookgate.com