Pak Lui, Application Performance Manager
September 12, 2013
Advancing Applications Performance With InfiniBand
© 2013 Mellanox Technologies 2
Mellanox Overview
 Leading provider of high-throughput, low-latency server and storage interconnect
• FDR 56Gb/s InfiniBand and 10/40/56GbE
• Reduces application wait-time for data
• Dramatically increases ROI on data center infrastructure
 Company headquarters:
• Yokneam, Israel; Sunnyvale, California
• ~1,200 employees* worldwide
 Solid financial position
• Record revenue in FY12; $500.8M, up 93% year-over-year
• Q2’13 revenue of $98.2M
• Q3’13 guidance ~$104M to $109M
• Cash + investments @ 6/30/13 = $411.3M
Ticker: MLNX
* As of June 2013
© 2013 Mellanox Technologies 3
Providing End-to-End Interconnect Solutions
ICs Switches/GatewaysAdapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Solutions Portfolio
Long-Haul Systems
MXM
Mellanox Messaging
Acceleration
FCA
Fabric Collectives
Acceleration
Management
UFM
Unified Fabric Management
Storage and Data
VSA
Storage Accelerator
(iSCSI)
UDA
Unstructured Data
Accelerator
Comprehensive End-to-End Software Accelerators and Managment
© 2013 Mellanox Technologies 4
Virtual Protocol Interconnect (VPI) Technology
64 ports 10GbE
36 ports 40/56GbE
48 10GbE + 12 40/56GbE
36 ports IB up to 56Gb/s
8 VPI subnets
Switch OS Layer
Mezzanine Card
VPI Adapter VPI Switch
Ethernet: 10/40/56 Gb/s
InfiniBand:10/20/40/56 Gb/s
Unified Fabric Manager
Networking Storage Clustering Management
Applications
Acceleration Engines
LOM Adapter Card
3.0
From data center to
campus and metro
connectivity
© 2013 Mellanox Technologies 5
MetroDX™ and MetroX™
 MetroX™ and MetroDX™ extends InfiniBand and Ethernet RDMA reach
 Fastest interconnect over 40Gb/s InfiniBand or Ethernet links
 Supporting multiple distances
 Simple management to control distant sites
 Low-cost, low-power , long-haul solution
40Gb/s over Campus and Metro
© 2013 Mellanox Technologies 6
Data Center Expansion Example – Disaster Recovery
© 2013 Mellanox Technologies 7
Key Elements in a Data Center Interconnect
StorageServers
Adapter and IC Adapter and IC
Cables, Silicon Photonics, Parallel Optical Modules
Switch and IC
Applications
© 2013 Mellanox Technologies 8
IPtronics and Kotura Complete Mellanox’s 100Gb/s+ Technology
Recent Acquisitions of Kotura and IPtronics Enable Mellanox to Deliver
Complete High-Speed Optical Interconnect Solutions for 100Gb/s and Beyond
© 2013 Mellanox Technologies 9
Mellanox InfiniBand Paves the Road to Exascale
© 2013 Mellanox Technologies 10
Meteo France
© 2013 Mellanox Technologies 11
 20K InfiniBand nodes
 Mellanox end-to-end FDR and QDR InfiniBand
 Supports variety of scientific and engineering projects
• Coupled atmosphere-ocean models
• Future space vehicle design
• Large-scale dark matter halos and galaxy evolution
NASA Ames Research Center Pleiades
Asian Monsoon Water Cycle
High-Resolution Climate Simulations
© 2013 Mellanox Technologies 12
 “Yellowstone” system
 72,288 processor cores, 4,518 nodes
 Mellanox end-to-end FDR InfiniBand, CLOS (full fat tree) network, single plane
NCAR (National Center for Atmospheric Research)
© 2013 Mellanox Technologies 13
Applications Performance (Courtesy of the HPC Advisory Council)
284%
1556% 282%Intel Ivy Bridge
182%
173%Intel Sandy Bridge
© 2013 Mellanox Technologies 14
Applications Performance (Courtesy of the HPC Advisory Council)
392%
135%
© 2013 Mellanox Technologies 15
Dominant in Enterprise Back-End Storage Interconnects
SMB Direct
© 2013 Mellanox Technologies 16
Leading Interconnect, Leading Performance
Latency
5usec
2.5usec
1.3usec
0.7usec
<0.5usec
200Gb/s
100Gb/s
56Gb/s
40Gb/s
20Gb/s
10Gb/s
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Bandwidth
Same Software Interface
0.6usec
© 2013 Mellanox Technologies 17
Architectural Foundation for Exascale Computing
Connect-IB
© 2013 Mellanox Technologies 18
 World’s first 100Gb/s interconnect adapter
• PCIe 3.0 x16, dual FDR 56Gb/s InfiniBand ports to provide >100Gb/s
 Highest InfiniBand message rate: 137 million messages per second
• 4X higher than other InfiniBand solutions
 <0.7 micro-second application latency
 Supports GPUDirect RDMA for direct GPU-to-GPU communication
 Unmatchable Storage Performance
• 8,000,000 IOPs (1QP), 18,500,000 IOPs (32 QPs)
 New Innovative Transport – Dynamically Connected Transport Service
 Supports Scalable HPC with MPI, SHMEM and PGAS/UPC offloads
Connect-IB : The Exascale Foundation
Enter the World of Boundless Performance
© 2013 Mellanox Technologies 19
Connect-IB Memory Scalability
1
1,000
1,000,000
1,000,000,000
InfiniHost, RC 2002 InfiniHost-III, SRQ 2005 ConnectX, XRC 2008 Connect-IB, DCT 2012
8 nodes
2K nodes
10K nodes
100K nodes
HostMemoryConsumption(MB)
© 2013 Mellanox Technologies 20
Dynamically Connected Transport Advantages
© 2013 Mellanox Technologies 21
FDR InfiniBand Delivers Highest Application Performance
0
50
100
150
QDR InfiniBand FDR InfiniBand
MessageRate(Million)
Message Rate
© 2013 Mellanox Technologies 22
MXM, FCA
Scalable Communication
© 2013 Mellanox Technologies 23
Mellanox ScalableHPC Accelerate Parallel Applications
InfiniBand Verbs API
MXM
• Reliable Messaging Optimized for Mellanox HCA
• Hybrid Transport Mechanism
• Efficient Memory Registration
• Receive Side Tag Matching
FCA
• Topology Aware Collective Optimization
• Hardware Multicast
• Separate Virtual Fabric for Collectives
• CORE-Direct Hardware Offload
Memory
P1
Memory
P2
Memory
P3
MPI
Memory
P1 P2 P3
SHMEM
Logical Shared Memory
Memory
P1 P2 P3
PGAS
Memory Memory
Logical Shared Memory
© 2013 Mellanox Technologies 24
MXM v2.0 - Highlights
 Transport library integrated with OpenMPI, OpenSHMEM, BUPC, Mvapich2
 More solutions will be added in the future
 Utilizing Mellanox offload engines
 Supported APIs (both sync/async): AM, p2p, atomics, synchronization
 Supported transports: RC, UD, DC, RoCE, SHMEM
 Supported built-in mechanisms: tag matching, progress thread, memory
registration cache, fast path send for small messages, zero copy, flow control
 Supported data transfer protocols: Eager Send/Recv, Eager RDMA, Rendezvous
© 2013 Mellanox Technologies 25
Mellanox FCA Collective Scalability
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
0 500 1000 1500 2000 2500
Latency(us)
Processes (PPN=8)
Barrier Collective
Without FCA With FCA
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500
Latency(us)
processes (PPN=8)
Reduce Collective
Without FCA With FCA
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 500 1000 1500 2000 2500
Bandwidth(KB*processes)
Processes (PPN=8)
8-Byte Broadcast
Without FCA With FCA
© 2013 Mellanox Technologies 26
FDR InfiniBand Delivers Highest Application Performance
© 2013 Mellanox Technologies 27
GPU Direct
© 2013 Mellanox Technologies 28
GPUDirect RDMA
TransmitReceive
CPU
GPUChip
set
GPU
Memory
InfiniBand
System
Memory
1
CPU
GPU Chip
set
GPU
Memory
InfiniBand
System
Memory
1
GPUDirect RDMA
CPU
GPUChip
set
GPU
Memory
InfiniBand
System
Memory
1
CPU
GPU Chip
set
GPU
Memory
InfiniBand
System
Memory
1
GPUDirect 1.0
© 2013 Mellanox Technologies 29
GPU-GPU Internode MPI Latency
0
5
10
15
20
25
30
35
1 4 16 64 256 1K 4K
MVAPICH2-1.9 MVAPICH2-1.9-GDR
Small Message Latency
Message Size (bytes)
Latency(us)
LowerisBetter
19.78
69 %
6.12
Preliminary Performance of MVAPICH2 with GPUDirect RDMA
69% Lower Latency
GPU-GPU Internode MPI Bandwidth
0
100
200
300
400
500
600
700
800
900
1 4 16 64 256 1K 4K
MVAPICH2-1.9 MVAPICH2-1.9-GDR
Message Size (bytes)
Bandwidth(MB/s)
Small Message Bandwidth
3x
HigherisBetter
3X Increase in Throughput
Source: Prof. DK Panda
© 2013 Mellanox Technologies 30
Execution Time of HSG
(Heisenberg Spin Glass)
Application with 2 GPU Nodes
Source: Prof. DK Panda
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
© 2013 Mellanox Technologies 31
Thank You

Advancing Applications Performance With InfiniBand

  • 1.
    Pak Lui, ApplicationPerformance Manager September 12, 2013 Advancing Applications Performance With InfiniBand
  • 2.
    © 2013 MellanoxTechnologies 2 Mellanox Overview  Leading provider of high-throughput, low-latency server and storage interconnect • FDR 56Gb/s InfiniBand and 10/40/56GbE • Reduces application wait-time for data • Dramatically increases ROI on data center infrastructure  Company headquarters: • Yokneam, Israel; Sunnyvale, California • ~1,200 employees* worldwide  Solid financial position • Record revenue in FY12; $500.8M, up 93% year-over-year • Q2’13 revenue of $98.2M • Q3’13 guidance ~$104M to $109M • Cash + investments @ 6/30/13 = $411.3M Ticker: MLNX * As of June 2013
  • 3.
    © 2013 MellanoxTechnologies 3 Providing End-to-End Interconnect Solutions ICs Switches/GatewaysAdapter Cards Cables/Modules Comprehensive End-to-End InfiniBand and Ethernet Solutions Portfolio Long-Haul Systems MXM Mellanox Messaging Acceleration FCA Fabric Collectives Acceleration Management UFM Unified Fabric Management Storage and Data VSA Storage Accelerator (iSCSI) UDA Unstructured Data Accelerator Comprehensive End-to-End Software Accelerators and Managment
  • 4.
    © 2013 MellanoxTechnologies 4 Virtual Protocol Interconnect (VPI) Technology 64 ports 10GbE 36 ports 40/56GbE 48 10GbE + 12 40/56GbE 36 ports IB up to 56Gb/s 8 VPI subnets Switch OS Layer Mezzanine Card VPI Adapter VPI Switch Ethernet: 10/40/56 Gb/s InfiniBand:10/20/40/56 Gb/s Unified Fabric Manager Networking Storage Clustering Management Applications Acceleration Engines LOM Adapter Card 3.0 From data center to campus and metro connectivity
  • 5.
    © 2013 MellanoxTechnologies 5 MetroDX™ and MetroX™  MetroX™ and MetroDX™ extends InfiniBand and Ethernet RDMA reach  Fastest interconnect over 40Gb/s InfiniBand or Ethernet links  Supporting multiple distances  Simple management to control distant sites  Low-cost, low-power , long-haul solution 40Gb/s over Campus and Metro
  • 6.
    © 2013 MellanoxTechnologies 6 Data Center Expansion Example – Disaster Recovery
  • 7.
    © 2013 MellanoxTechnologies 7 Key Elements in a Data Center Interconnect StorageServers Adapter and IC Adapter and IC Cables, Silicon Photonics, Parallel Optical Modules Switch and IC Applications
  • 8.
    © 2013 MellanoxTechnologies 8 IPtronics and Kotura Complete Mellanox’s 100Gb/s+ Technology Recent Acquisitions of Kotura and IPtronics Enable Mellanox to Deliver Complete High-Speed Optical Interconnect Solutions for 100Gb/s and Beyond
  • 9.
    © 2013 MellanoxTechnologies 9 Mellanox InfiniBand Paves the Road to Exascale
  • 10.
    © 2013 MellanoxTechnologies 10 Meteo France
  • 11.
    © 2013 MellanoxTechnologies 11  20K InfiniBand nodes  Mellanox end-to-end FDR and QDR InfiniBand  Supports variety of scientific and engineering projects • Coupled atmosphere-ocean models • Future space vehicle design • Large-scale dark matter halos and galaxy evolution NASA Ames Research Center Pleiades Asian Monsoon Water Cycle High-Resolution Climate Simulations
  • 12.
    © 2013 MellanoxTechnologies 12  “Yellowstone” system  72,288 processor cores, 4,518 nodes  Mellanox end-to-end FDR InfiniBand, CLOS (full fat tree) network, single plane NCAR (National Center for Atmospheric Research)
  • 13.
    © 2013 MellanoxTechnologies 13 Applications Performance (Courtesy of the HPC Advisory Council) 284% 1556% 282%Intel Ivy Bridge 182% 173%Intel Sandy Bridge
  • 14.
    © 2013 MellanoxTechnologies 14 Applications Performance (Courtesy of the HPC Advisory Council) 392% 135%
  • 15.
    © 2013 MellanoxTechnologies 15 Dominant in Enterprise Back-End Storage Interconnects SMB Direct
  • 16.
    © 2013 MellanoxTechnologies 16 Leading Interconnect, Leading Performance Latency 5usec 2.5usec 1.3usec 0.7usec <0.5usec 200Gb/s 100Gb/s 56Gb/s 40Gb/s 20Gb/s 10Gb/s 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Bandwidth Same Software Interface 0.6usec
  • 17.
    © 2013 MellanoxTechnologies 17 Architectural Foundation for Exascale Computing Connect-IB
  • 18.
    © 2013 MellanoxTechnologies 18  World’s first 100Gb/s interconnect adapter • PCIe 3.0 x16, dual FDR 56Gb/s InfiniBand ports to provide >100Gb/s  Highest InfiniBand message rate: 137 million messages per second • 4X higher than other InfiniBand solutions  <0.7 micro-second application latency  Supports GPUDirect RDMA for direct GPU-to-GPU communication  Unmatchable Storage Performance • 8,000,000 IOPs (1QP), 18,500,000 IOPs (32 QPs)  New Innovative Transport – Dynamically Connected Transport Service  Supports Scalable HPC with MPI, SHMEM and PGAS/UPC offloads Connect-IB : The Exascale Foundation Enter the World of Boundless Performance
  • 19.
    © 2013 MellanoxTechnologies 19 Connect-IB Memory Scalability 1 1,000 1,000,000 1,000,000,000 InfiniHost, RC 2002 InfiniHost-III, SRQ 2005 ConnectX, XRC 2008 Connect-IB, DCT 2012 8 nodes 2K nodes 10K nodes 100K nodes HostMemoryConsumption(MB)
  • 20.
    © 2013 MellanoxTechnologies 20 Dynamically Connected Transport Advantages
  • 21.
    © 2013 MellanoxTechnologies 21 FDR InfiniBand Delivers Highest Application Performance 0 50 100 150 QDR InfiniBand FDR InfiniBand MessageRate(Million) Message Rate
  • 22.
    © 2013 MellanoxTechnologies 22 MXM, FCA Scalable Communication
  • 23.
    © 2013 MellanoxTechnologies 23 Mellanox ScalableHPC Accelerate Parallel Applications InfiniBand Verbs API MXM • Reliable Messaging Optimized for Mellanox HCA • Hybrid Transport Mechanism • Efficient Memory Registration • Receive Side Tag Matching FCA • Topology Aware Collective Optimization • Hardware Multicast • Separate Virtual Fabric for Collectives • CORE-Direct Hardware Offload Memory P1 Memory P2 Memory P3 MPI Memory P1 P2 P3 SHMEM Logical Shared Memory Memory P1 P2 P3 PGAS Memory Memory Logical Shared Memory
  • 24.
    © 2013 MellanoxTechnologies 24 MXM v2.0 - Highlights  Transport library integrated with OpenMPI, OpenSHMEM, BUPC, Mvapich2  More solutions will be added in the future  Utilizing Mellanox offload engines  Supported APIs (both sync/async): AM, p2p, atomics, synchronization  Supported transports: RC, UD, DC, RoCE, SHMEM  Supported built-in mechanisms: tag matching, progress thread, memory registration cache, fast path send for small messages, zero copy, flow control  Supported data transfer protocols: Eager Send/Recv, Eager RDMA, Rendezvous
  • 25.
    © 2013 MellanoxTechnologies 25 Mellanox FCA Collective Scalability 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 0 500 1000 1500 2000 2500 Latency(us) Processes (PPN=8) Barrier Collective Without FCA With FCA 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 Latency(us) processes (PPN=8) Reduce Collective Without FCA With FCA 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 0 500 1000 1500 2000 2500 Bandwidth(KB*processes) Processes (PPN=8) 8-Byte Broadcast Without FCA With FCA
  • 26.
    © 2013 MellanoxTechnologies 26 FDR InfiniBand Delivers Highest Application Performance
  • 27.
    © 2013 MellanoxTechnologies 27 GPU Direct
  • 28.
    © 2013 MellanoxTechnologies 28 GPUDirect RDMA TransmitReceive CPU GPUChip set GPU Memory InfiniBand System Memory 1 CPU GPU Chip set GPU Memory InfiniBand System Memory 1 GPUDirect RDMA CPU GPUChip set GPU Memory InfiniBand System Memory 1 CPU GPU Chip set GPU Memory InfiniBand System Memory 1 GPUDirect 1.0
  • 29.
    © 2013 MellanoxTechnologies 29 GPU-GPU Internode MPI Latency 0 5 10 15 20 25 30 35 1 4 16 64 256 1K 4K MVAPICH2-1.9 MVAPICH2-1.9-GDR Small Message Latency Message Size (bytes) Latency(us) LowerisBetter 19.78 69 % 6.12 Preliminary Performance of MVAPICH2 with GPUDirect RDMA 69% Lower Latency GPU-GPU Internode MPI Bandwidth 0 100 200 300 400 500 600 700 800 900 1 4 16 64 256 1K 4K MVAPICH2-1.9 MVAPICH2-1.9-GDR Message Size (bytes) Bandwidth(MB/s) Small Message Bandwidth 3x HigherisBetter 3X Increase in Throughput Source: Prof. DK Panda
  • 30.
    © 2013 MellanoxTechnologies 30 Execution Time of HSG (Heisenberg Spin Glass) Application with 2 GPU Nodes Source: Prof. DK Panda Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
  • 31.
    © 2013 MellanoxTechnologies 31 Thank You