0% found this document useful (0 votes)

61 views8 pages

Nvswitch Technical Overview

NVIDIA NVSwitch is a high-bandwidth on-node switch designed to connect up to 16 GPUs, providing 300 GB/s connectivity and enabling 2 petaFLOPS of deep learning computing capacity. It features an 18-port NVLink switch architecture that allows for full communication between GPUs without bottlenecks, enhancing performance for complex workloads. This technology supports the growing demands of deep learning and high-performance computing applications by facilitating efficient inter-GPU communication and reducing network traffic.

Uploaded by

ehsankosari.1988

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views8 pages

Nvswitch Technical Overview

Uploaded by

ehsankosari.1988

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

TECHNICAL OVERVIEW

NVIDIA NVSWITCH
The World’s Highest-Bandwidth
On-Node Switch
As deep learning neural networks become more sophisticated, their size
and complexity continue to expand. And so do the available datasets
they can ingest to deliver next-level insights. The result is exponential
growth in required computing capacity to train these networks in a
practical amount of time. To meet this challenge, developers have
turned to multi-GPU implementations, which have demonstrated
near-linear performance scaling. In these multi-GPU systems, one of
the keys to continued performance scaling is flexible, high-bandwidth
inter-GPU communications.

NVIDIA introduced NVIDIA® NVLink™ to connect multiple GPUs at 10X the

PCIe bandwidth and boost computing capacity. This solution enabled 8
GPUs in a single server to be connected together in a point-to-point
network called a hybrid cube mesh. This implementation was an
important step forward and elevated the performance of 8 GPU servers.
But in “all to all” communications, where all GPUs need to communicate
with one another, this implementation requires certain GPU pairs to
communicate over a much slower PCIe data path. To take GPU server
performance to the next level and scale beyond 8 GPUs in a single
server, a more advanced solution was needed.

Enter NVSwitch.

NVSwitch enables a fully NVLink-connected 16-GPU system with an

uncompromised 300 GB/s of connectivity. This interconnect fabric
eliminates bottlenecks and intermediary GPU hops to enable 16 GPUs to
behave as one, unleashing an incredible 2 petaFLOPS of deep learning
computing capacity to train the next generation of AI networks.

NVIDIA NVSWITCH | TECHNICAL OVERVIEW | 2

NVSwitch: The World’s Highest-Bandwidth
On-Node Switch
NVSwitch is an NVLink switch chip with 18 ports of NVLink per switch.
Internally, the processor is an 18 x 18-port, fully connected crossbar.
Any port can communicate with any other port at full NVLink speed, 50
GB/s, for a total of 900 GB/s of aggregate switch bandwidth.

VITAL STATISTICS:
Port Configuration 18 NVLINK ports
50 GB/s per NVLINK port
Speed per Port
(total for both directions)
Connectivity Fully connected crossbar internally
Transistor Count 2 billion

Each port supports 25 GB/s in each direction. The crossbar is non-blocking,

allowing all ports to communicate with all other ports at the full NVLink
bandwidth. Consider Figure-1 shown below. All 8 GPUs on one baseboard
are connected with a single NVLink to all 6 NVSwitches. Eight ports on each
of the NVSwitches are used to communicate with the other baseboard. Each
of the 8 GPUs on a baseboard can communicate with any of the others
on the same baseboard at the full bandwidth of 300 GB/s with a single
NVSwitch traversal. Each of the GPUs can also communicate at full
bandwidth with any GPU on the second baseboard. In this case, there
are two NVSwitch traversals. The bi-section bandwidth between the
boards is 2.4 TB/s (48 links at 25 GB/s in each direction). Note that the
NVIDIA DGX-2™ platform uses only 16 of the available ports per switch.
The remaining ports are reserved.

GPU GPU GPU GPU GPU GPU GPU GPU

1 2 3 4 5 6 7 8

GPU GPU GPU GPU GPU GPU GPU GPU

1 2 3 4 5 6 7 8

NVSwitch

Figure-1: NVSwitch Topology Diagram - Two GPUs' connections shown for simplicity. All 16
GPUs connect to NVSwitch chips in the same way.

NVIDIA NVSWITCH | TECHNICAL OVERVIEW | 3

With such high-bandwidth data movement, data integrity is paramount.
Data traversing NVLink itself is protected using cyclical redundancy coding
(CRC) to detect errors and replay the transfer. NVSwitch’s datapaths,
routing, and state structures are protected using error-correcting
codes (ECC). NVSwitch also supports final hop-address fidelity checks
and buffer over- and underflow checks. For security, NVSwitch’s routing
tables are indexed and controlled by the NVIDIA fabric manager, providing
protection by limiting an application’s access to its specific ranges.

NVLINK

FORWARDING

FORWARDING
NVLINK XBAR XBAR NVLINK

MANAGEMENT

Figure-2: NVSwitch Die Shot

NVSwitch enables larger GPU server systems with 16 GPUs and 24X more
inter-GPU bandwidth than 4X InfiniBand ports, so much more work can
happen on a single ser ver node. A 16-GPU ser ver offers multiple
advantages: It reduces network traffic hot spots that can occur when
two GPU-equipped servers are exchanging data during a neural network
training run. With an NVSwitch-equipped server like NVIDIA’s DGX-2,
those exchanges occur on-node, which also offers significant performance
advantages. In addition, it offers a simpler, single-node programming
model that effectively abstracts the underlying topology.

NVIDIA NVSWITCH | TECHNICAL OVERVIEW | 4

Chart 1: NVIDIA’s new DGX-2 server with
16 GPUs is able to deliver up to 2.4X more NVSwitch Delivers up to a 2.4X Speedup on HPC Applications
high-performance computing (HPC)
performance than two 8-GPU servers
3X
communicating over a 4X InfiniBand
2 DGX-1 (Volta) DGX-2
connection. The European Centre for
Medium-Range Weather Forecasts’
(ECMWF) mini-app workload executes a
large number of Fast Fourier Transform 2X

(FF T) operations, which involve

Speedup
significant inter-GPU communication.
The MILC-based workload is a numerical
simulation application that uses a 1X
batched conjugate gradient (CG) solver
to study quantum chromodynamics
(QCD), the theory surrounding strong
interactions of subatomic physics. This 0
benchmark corresponds to a batched Physics Weather Simulation
(MILC Benchmark) (IFS)
mixed-precision CG solver.
System Conﬁgs: Each of the two DGX-1 servers have dual-socket Xeon E5 2690v4 Processor, 8 x V100 GPUs; servers
See notes about workload details. connected via a 4 EDR (100Gb) InﬁniBand connections. DGX-2 server has dual-socket Xeon Scalable Processor Platinum
8168 Processors, 16 x Tesla V100 GPUs

Chart-1

Chart 2: NVIDIA’s new DGX-2 server with

16 GPUs is able to deliver up to 2.7X more NVSwitch Delivers up to a 2.7X Speedup on Deep Learning Training
deep learning training performance than
two 8-GPU servers communicating over a
3X
4X InfiniBand connection. The mixture of
Baseline DGX-2
experts (MoE) workload is a combination
of neural networks that collaborate to
produce more sophisticated language
translations. Sparse embedding networks 2X

are used for recommender systems to

Speedup

match users with relevant product and

service information.
1X
See notes about workload details.

0
Recommender System Language Translation
(Sparse Embedding) (Transformer with MoE)

System Conﬁgs: Each of the two DGX-1 servers have dual-socket Xeon E5 2690v4 Processor, 8 x V100 GPUs; servers
connected via a 4 EDR (100Gb) InﬁniBand connections. DGX-2 server has dual-socket Xeon Scalable Processor Platinum
8168 Processors, 16 x Tesla V100 GPUs

Chart-2

NVIDIA NVSWITCH | TECHNICAL OVERVIEW | 5

Ready to Tackle Tomorrow’s Workloads Today
With the continuing explosive growth of neural networks’ size, complexity,
and designs, it’s difficult to predict the exact form those networks will
take, but one thing remains certain: the appetite for deep learning compute
will continue to grow along with them. In the HPC domain, workloads like
weather modeling using large-scale, FFT-based computations will also
continue to drive demand for multi-GPU compute horsepower. And with
a 16-GPU configuration packing a half-terabyte of GPU memory in a unified
address space, applications can scale up without requiring knowledge
of the underlying physical topology.

Similarly, HPC and graph analytics workloads continue to grow and take
advantage of GPU acceleration. NVSwitch, which lies at the heart of
NVIDIA’s new DGX-2 server, is the critical connective tissue that enables
16 GPUs in a single ser ver to accelerate even the most aggressive
workloads, and bring the next wave of deep learning innovation.

To learn more about NVIDIA Tesla® V100 and the Volta architecture,
download the VOLTA ARCHITECTURE WHITEPAPER.

To learn more about NVIDIA’s new DGX-2 server,

visit the DGX-2 page.
WORKLOAD NOTES:
MILC Benchmark: A numerical simulation application that uses a batched CG solver to study
quantum chromodynamics (QCD), the theory surrounding strong interactions of subatomic
physics. This benchmark corresponds to a batched mixed-precision conjugate gradient
solver that includes 64-bit floating-point (FP64) double-precision and FP16 half-precision
calculations. This kind of algorithm can be used in the analysis phase of lattice QCD. The
problem size here is 48 x 48 x 48 x 64. The high dimension requires each GPU to fetch data
from many of its neighbors to execute its computations, creating all-to-all traffic that
NVSwitch dramatically accelerates.

ECWMF’s IFS: A global numerical weather prediction model developed by the European Centre
for Medium-Range Weather Forecasts (ECMWF) based in Reading, England. ECMWF is an
independent intergovernmental organization supported by most of the nations of Europe,
and it operates one of the largest supercomputer centers in Europe for frequent updates
of global weather forecasts. The Integrated Forecasting System (IFS) mini-app benchmark
focuses its work on a spherical harmonics transformation that represents a significant
computational load of the full model. The benchmark speedups shown in the graph are
better than those for the full IFS model, since the benchmark amplifies the transform
stages of the algorithm (by design). However, this benchmark demonstrates that ECMWF’s
extremely effective and proven methods for providing world-leading predictions remain
valid on NVSwitch-equipped servers such as NVIDIA DGX-2, since they’re such a good
match to the problem.

Recommender: A mini-app built at NVIDIA that’s modelled after Alibaba’s paper “Deep Interest
Network for Click-Through Rate Prediction.” The mini-app uses a batch size of 8,192, indexing
into a 1 billion-entry embedding table. Each entry is 64 dimensions wide with each dimension
in FP16 single-precision. The resultant data table requires more than 256 GB of memory,
so this use case requires at least two DGX-1V 32 GB servers or a DGX-2 to run. The app
includes reduce and broadcast operations between GPUs that NVSwitch accelerates. The
performance of this workload is driven by how many embedding table lookups a system
can deliver, hence the metric used is billions of lookups per second.

Mixture of Experts (MoE): Based on a network published by Google at the Tensor2Tensor

github, this workload uses the Transformer model with MoE layers. The MoE layers each
consist of 128 experts, each of which is a smaller feed-forward deep neural network (DNN).
Each expert specializes in a different domain of knowledge, and the experts are distributed
to different GPUs, creating significant all-to-all traffic due to communications between the
Transformer network layers and the MoE layers. The training dataset used is the “1 billion-
word benchmark for language modeling” according to Google. Training operations use Volta
Tensor Core and run for 45,000 steps to reach perplexity equal to 34. This workload uses a
batch size of 8,192 per GPU.
© 2018 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, CUDA, Pascal, Tesla, NVLink, and DGX-1 are
trademarks or registered trademarks of NVIDIA Corporation in the United States and other countries. Other company and
product names may be trademarks of the respective companies with which they are associated. APR18

Mastering NGINX on Linux: A Practical Guide to Installing, Configuring, Securing, and Optimizing NGINX on Linux Servers
From Everand
Mastering NGINX on Linux: A Practical Guide to Installing, Configuring, Securing, and Optimizing NGINX on Linux Servers
Dargslan
No ratings yet
RA11334001 DSPB200 ReferenceArch
No ratings yet
RA11334001 DSPB200 ReferenceArch
29 pages
NVIDIA
No ratings yet
NVIDIA
38 pages
NVIDIA Story
No ratings yet
NVIDIA Story
31 pages
Next-Generation switching OS configuration and management: Troubleshooting NX-OS in Enterprise Environments
From Everand
Next-Generation switching OS configuration and management: Troubleshooting NX-OS in Enterprise Environments
Mamta Devi
No ratings yet
2023 Annual Report 1
No ratings yet
2023 Annual Report 1
169 pages
GTC2025 Keynote
No ratings yet
GTC2025 Keynote
73 pages
Ai Clusters Data Center@nettrain
No ratings yet
Ai Clusters Data Center@nettrain
34 pages
RA 11336 001 DSPH200 ReferenceArch
No ratings yet
RA 11336 001 DSPH200 ReferenceArch
34 pages
Sipeed Tang Nano 20K Datasheet V1.1-En - US
No ratings yet
Sipeed Tang Nano 20K Datasheet V1.1-En - US
9 pages
Nvidia Annual Report 2024
No ratings yet
Nvidia Annual Report 2024
187 pages
NVIDIA 2024 Annual Report
No ratings yet
NVIDIA 2024 Annual Report
187 pages
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
Nvidia DGX h100 Datasheet
No ratings yet
Nvidia DGX h100 Datasheet
2 pages
Finance Trading Executive Briefing HR Web
No ratings yet
Finance Trading Executive Briefing HR Web
7 pages
2022 Annual Review
100% (1)
2022 Annual Review
207 pages
Fc3 Digital Data Systems
No ratings yet
Fc3 Digital Data Systems
500 pages
JHH Gtc2023 Final
No ratings yet
JHH Gtc2023 Final
83 pages
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
100% (1)
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
91 pages
NVIDIA Investor Presentation Oct 2024
No ratings yet
NVIDIA Investor Presentation Oct 2024
30 pages
Install
No ratings yet
Install
2 pages
Vmware Compatibility Guide
No ratings yet
Vmware Compatibility Guide
250 pages
Nvidia Learning Training Course Catalog
No ratings yet
Nvidia Learning Training Course Catalog
33 pages
Nvidia H100 GPU Datasheet
No ratings yet
Nvidia H100 GPU Datasheet
3 pages
Nvidia Story
No ratings yet
Nvidia Story
30 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
Tme303 Dspa100 Ra1138901 Vast
No ratings yet
Tme303 Dspa100 Ra1138901 Vast
13 pages
DGX A100 System Architecture Whitepaper
No ratings yet
DGX A100 System Architecture Whitepaper
23 pages
Infiniband Networking Sales Training
No ratings yet
Infiniband Networking Sales Training
17 pages
Nvitu 230307121950 c3b682cc
No ratings yet
Nvitu 230307121950 c3b682cc
24 pages
Nvidia Quadro Virtual Data Center Workstation Solution Overview
No ratings yet
Nvidia Quadro Virtual Data Center Workstation Solution Overview
4 pages
AI Practice Profile
No ratings yet
AI Practice Profile
14 pages
GTC2025 Highlights v2
No ratings yet
GTC2025 Highlights v2
55 pages
DGX Station Fujitsu
No ratings yet
DGX Station Fujitsu
2 pages
NVIDIA Final Report
No ratings yet
NVIDIA Final Report
68 pages
GPU Bootcamp Samhar
100% (1)
GPU Bootcamp Samhar
96 pages
Catalogo ZKTeco 2022
100% (1)
Catalogo ZKTeco 2022
124 pages
Nvidia Learning Training Course Catalog
No ratings yet
Nvidia Learning Training Course Catalog
33 pages
Lecture-2: Jump, Loop and Call Instructions
No ratings yet
Lecture-2: Jump, Loop and Call Instructions
108 pages
Line Card
No ratings yet
Line Card
4 pages
NVIDIA Data Center Roadmap With GX200NVL GX200 X100 and X40 AI Chips in 2025 - ServeTheHome
No ratings yet
NVIDIA Data Center Roadmap With GX200NVL GX200 X100 and X40 AI Chips in 2025 - ServeTheHome
1 page
A100 80gb HGX A100 Datasheet Us Nvidia 1485640 r6 Web
No ratings yet
A100 80gb HGX A100 Datasheet Us Nvidia 1485640 r6 Web
3 pages
IoT-Lecture-05 Embedded System
No ratings yet
IoT-Lecture-05 Embedded System
29 pages
12-ALU - Data Path and Control Unit - Hardwired Control Unit and Micro Programmed Control Uni
No ratings yet
12-ALU - Data Path and Control Unit - Hardwired Control Unit and Micro Programmed Control Uni
12 pages
Unit 5'
No ratings yet
Unit 5'
33 pages
Geforce Now RTX Server Gaming Datasheet
No ratings yet
Geforce Now RTX Server Gaming Datasheet
1 page
PowerHA Workshop Part1
No ratings yet
PowerHA Workshop Part1
50 pages
Dli Catalog
No ratings yet
Dli Catalog
26 pages
Slide Deck - Indonesian AI Summit 2021
No ratings yet
Slide Deck - Indonesian AI Summit 2021
25 pages
NVIDIA DGX A100 System Architecture Datasheet
No ratings yet
NVIDIA DGX A100 System Architecture Datasheet
2 pages
Nvidia Story, PDF (1) (2) - 1
No ratings yet
Nvidia Story, PDF (1) (2) - 1
38 pages
Nvidia Update For Lenovo
No ratings yet
Nvidia Update For Lenovo
30 pages
NVIDIA's AI Stack
No ratings yet
NVIDIA's AI Stack
14 pages
NVIDIAFermiComputeArchitectureWhitepaper PDF
No ratings yet
NVIDIAFermiComputeArchitectureWhitepaper PDF
21 pages
NVIDIA Investor Day 2017
No ratings yet
NVIDIA Investor Day 2017
70 pages
Gpu IEEE Paper
No ratings yet
Gpu IEEE Paper
14 pages
DGX Scale Ai Infrastructure DGX gh200 Datasheet Nvidia Us Web
No ratings yet
DGX Scale Ai Infrastructure DGX gh200 Datasheet Nvidia Us Web
2 pages
Lecture 6: Dynamic Scheduling With Scoreboarding and Tomasulo Algorithm (Section 2.4)
No ratings yet
Lecture 6: Dynamic Scheduling With Scoreboarding and Tomasulo Algorithm (Section 2.4)
31 pages
Bringing Digital Repair Manuals To The Shop Floor
No ratings yet
Bringing Digital Repair Manuals To The Shop Floor
2 pages
t4 Datasheet
No ratings yet
t4 Datasheet
2 pages
Lab11 Manual
No ratings yet
Lab11 Manual
13 pages
Lab Report # 1: Sheikh Muhammad Ismail
No ratings yet
Lab Report # 1: Sheikh Muhammad Ismail
22 pages
NVIDIA DGX SuperPOD With DGX GB200 Systems
No ratings yet
NVIDIA DGX SuperPOD With DGX GB200 Systems
3 pages
DGX GH200 Datasheet
No ratings yet
DGX GH200 Datasheet
2 pages
BIOS Power-On Self-Test (POST) Codes Guide
No ratings yet
BIOS Power-On Self-Test (POST) Codes Guide
3 pages
Nvidia Spectrum Sn2000 Series Switches
No ratings yet
Nvidia Spectrum Sn2000 Series Switches
9 pages
75 Apple Macbooks Lot
No ratings yet
75 Apple Macbooks Lot
6 pages
Dgx1 v100 System Architecture Whitepaper
No ratings yet
Dgx1 v100 System Architecture Whitepaper
43 pages
ThinkPad T440s Spec
No ratings yet
ThinkPad T440s Spec
1 page
Layerscape Software Development Kit User Guide 2-14-2022
No ratings yet
Layerscape Software Development Kit User Guide 2-14-2022
4 pages
Cs 6401 Osqb1
No ratings yet
Cs 6401 Osqb1
13 pages
Windows IoT - Wikipedia
No ratings yet
Windows IoT - Wikipedia
36 pages
Panther: Plug-In Manual
No ratings yet
Panther: Plug-In Manual
7 pages
Assignment-1 EEE373
No ratings yet
Assignment-1 EEE373
2 pages
CUDA Center Using Robots To Build Real Time Maps - The Official NVIDIA Blog
No ratings yet
CUDA Center Using Robots To Build Real Time Maps - The Official NVIDIA Blog
6 pages
Avionic Computer FV-4000
No ratings yet
Avionic Computer FV-4000
2 pages
HUAWEI WKG-LX9 Hw-Meafnaf Software Upgrade Guideline - R3
No ratings yet
HUAWEI WKG-LX9 Hw-Meafnaf Software Upgrade Guideline - R3
10 pages
TRAP Routines and Subroutines
No ratings yet
TRAP Routines and Subroutines
32 pages
Nvidia DGX Station Print Infographic 738375 Web
No ratings yet
Nvidia DGX Station Print Infographic 738375 Web
1 page
RTX Server Gaming Datasheet
No ratings yet
RTX Server Gaming Datasheet
1 page
Certificate - ISO-9001 - Issue Date 19-10-2023
No ratings yet
Certificate - ISO-9001 - Issue Date 19-10-2023
2 pages
Firmware HUAWEI E5576-856 - Solution Firmware
No ratings yet
Firmware HUAWEI E5576-856 - Solution Firmware
7 pages
Powerview: Installation Manual
No ratings yet
Powerview: Installation Manual
20 pages
Microprocessor - Electronic Circuit That Functions As The Central Processing Unit (CPU)
No ratings yet
Microprocessor - Electronic Circuit That Functions As The Central Processing Unit (CPU)
5 pages
Tip22 e PDF
No ratings yet
Tip22 e PDF
6 pages
Nvidia DGX Station Datasheet PDF
No ratings yet
Nvidia DGX Station Datasheet PDF
2 pages
Nvidia DGX A100 Datasheet PDF
No ratings yet
Nvidia DGX A100 Datasheet PDF
2 pages
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Ficha Tecnica PS 200 Multigas
No ratings yet
Ficha Tecnica PS 200 Multigas
4 pages
Zimsec O Level Notes
100% (1)
Zimsec O Level Notes
73 pages

Nvswitch Technical Overview

Uploaded by

Nvswitch Technical Overview

Uploaded by

TECHNICAL OVERVIEW

NVIDIA introduced NVIDIA® NVLink™ to connect multiple GPUs at 10X the

NVSwitch enables a fully NVLink-connected 16-GPU system with an

NVIDIA NVSWITCH | TECHNICAL OVERVIEW | 2

Each port supports 25 GB/s in each direction. The crossbar is non-blocking,

GPU GPU GPU GPU GPU GPU GPU GPU

GPU GPU GPU GPU GPU GPU GPU GPU

NVIDIA NVSWITCH | TECHNICAL OVERVIEW | 3

Figure-2: NVSwitch Die Shot

NVIDIA NVSWITCH | TECHNICAL OVERVIEW | 4

(FF T) operations, which involve

Chart 2: NVIDIA’s new DGX-2 server with

are used for recommender systems to

match users with relevant product and

NVIDIA NVSWITCH | TECHNICAL OVERVIEW | 5

To learn more about NVIDIA’s new DGX-2 server,

Mixture of Experts (MoE): Based on a network published by Google at the Tensor2Tensor

You might also like