Nvswitch Technical Overview
Nvswitch Technical Overview
NVIDIA NVSWITCH
The World’s Highest-Bandwidth
On-Node Switch
As deep learning neural networks become more sophisticated, their size
and complexity continue to expand. And so do the available datasets
they can ingest to deliver next-level insights. The result is exponential
growth in required computing capacity to train these networks in a
practical amount of time. To meet this challenge, developers have
turned to multi-GPU implementations, which have demonstrated
near-linear performance scaling. In these multi-GPU systems, one of
the keys to continued performance scaling is flexible, high-bandwidth
inter-GPU communications.
Enter NVSwitch.
VITAL STATISTICS:
Port Configuration 18 NVLINK ports
50 GB/s per NVLINK port
Speed per Port
(total for both directions)
Connectivity Fully connected crossbar internally
Transistor Count 2 billion
NVSwitch
Figure-1: NVSwitch Topology Diagram - Two GPUs' connections shown for simplicity. All 16
GPUs connect to NVSwitch chips in the same way.
NVLINK
FORWARDING
FORWARDING
FORWARDING
NVLINK XBAR XBAR NVLINK
MANAGEMENT
NVSwitch enables larger GPU server systems with 16 GPUs and 24X more
inter-GPU bandwidth than 4X InfiniBand ports, so much more work can
happen on a single ser ver node. A 16-GPU ser ver offers multiple
advantages: It reduces network traffic hot spots that can occur when
two GPU-equipped servers are exchanging data during a neural network
training run. With an NVSwitch-equipped server like NVIDIA’s DGX-2,
those exchanges occur on-node, which also offers significant performance
advantages. In addition, it offers a simpler, single-node programming
model that effectively abstracts the underlying topology.
Speedup
significant inter-GPU communication.
The MILC-based workload is a numerical
simulation application that uses a 1X
batched conjugate gradient (CG) solver
to study quantum chromodynamics
(QCD), the theory surrounding strong
interactions of subatomic physics. This 0
benchmark corresponds to a batched Physics Weather Simulation
(MILC Benchmark) (IFS)
mixed-precision CG solver.
System Configs: Each of the two DGX-1 servers have dual-socket Xeon E5 2690v4 Processor, 8 x V100 GPUs; servers
See notes about workload details. connected via a 4 EDR (100Gb) InfiniBand connections. DGX-2 server has dual-socket Xeon Scalable Processor Platinum
8168 Processors, 16 x Tesla V100 GPUs
Chart-1
0
Recommender System Language Translation
(Sparse Embedding) (Transformer with MoE)
System Configs: Each of the two DGX-1 servers have dual-socket Xeon E5 2690v4 Processor, 8 x V100 GPUs; servers
connected via a 4 EDR (100Gb) InfiniBand connections. DGX-2 server has dual-socket Xeon Scalable Processor Platinum
8168 Processors, 16 x Tesla V100 GPUs
Chart-2
Similarly, HPC and graph analytics workloads continue to grow and take
advantage of GPU acceleration. NVSwitch, which lies at the heart of
NVIDIA’s new DGX-2 server, is the critical connective tissue that enables
16 GPUs in a single ser ver to accelerate even the most aggressive
workloads, and bring the next wave of deep learning innovation.
To learn more about NVIDIA Tesla® V100 and the Volta architecture,
download the VOLTA ARCHITECTURE WHITEPAPER.
ECWMF’s IFS: A global numerical weather prediction model developed by the European Centre
for Medium-Range Weather Forecasts (ECMWF) based in Reading, England. ECMWF is an
independent intergovernmental organization supported by most of the nations of Europe,
and it operates one of the largest supercomputer centers in Europe for frequent updates
of global weather forecasts. The Integrated Forecasting System (IFS) mini-app benchmark
focuses its work on a spherical harmonics transformation that represents a significant
computational load of the full model. The benchmark speedups shown in the graph are
better than those for the full IFS model, since the benchmark amplifies the transform
stages of the algorithm (by design). However, this benchmark demonstrates that ECMWF’s
extremely effective and proven methods for providing world-leading predictions remain
valid on NVSwitch-equipped servers such as NVIDIA DGX-2, since they’re such a good
match to the problem.
Recommender: A mini-app built at NVIDIA that’s modelled after Alibaba’s paper “Deep Interest
Network for Click-Through Rate Prediction.” The mini-app uses a batch size of 8,192, indexing
into a 1 billion-entry embedding table. Each entry is 64 dimensions wide with each dimension
in FP16 single-precision. The resultant data table requires more than 256 GB of memory,
so this use case requires at least two DGX-1V 32 GB servers or a DGX-2 to run. The app
includes reduce and broadcast operations between GPUs that NVSwitch accelerates. The
performance of this workload is driven by how many embedding table lookups a system
can deliver, hence the metric used is billions of lookups per second.