1© 2018 Mellanox Technologies
Opennebula Techday, Frankfurt
Sept. 26 2018 | Arne Heitmann, Staff System Engineer, EMEA
Considerations on Smart
Cloud Implementations
2© 2018 Mellanox Technologies
Agenda
 Introduction Mellanox
 Faster storage and networks
 Linbit architecture with RDMA/RoCE
 RDMA/RoCE
 ASAP2 (OVS Offloading)
 Overlay Networking
 EVPN/VXLAN routing
 Conclusion
3© 2018 Mellanox Technologies
Mellanox Overview
 Ticker: MLNX
 Worldwide Offices
 Company Headquarters
• Yokneam, Israel
• Sunnyvale, California
 Employees worldwide
• ~ 2,900
Ticker: MLNX
4© 2018 Mellanox Technologies
Comprehensive End-to-End Ethernet Product Portfolio
NICs
Cables
NICs
Optics
Switches
Automation &
Monitoring
Management
software
5© 2018 Mellanox Technologies
Unique Engine in Mellanox Ethernet Switch
 Mellanox switches are powered by
Mellanox superior ASIC
 Wire speed, cut through switching at any
packet size
 Zero Jitter
 Low power
 10GbE to 100GbE
 DAC Passive Copper for
10/25/40/50/100GbE
 vs. active copper is more expensive, less
reliable and suffers from interoperability
issues
 Active Fiber for 10/25/40/50/100GbE
6© 2018 Mellanox Technologies
New Storage Media Require Faster Networks
 Transition to faster storage media requires
faster networks
 Flash SSDs move the bottleneck from the
storage to the network
 What does it take to saturate one 10Gb/s link?
• 24 x HDDs
• 2 x SATA SSDs
• 1 x SAS SSD
• NVMe…
7© 2018 Mellanox Technologies
DRBD and RDMA – Architectural Advantage
8© 2018 Mellanox Technologies
Excursion - RoCE - RDMA over Converged Ethernet
 Remote Direct Memory Access (RDMA) is a technology that enables data to be read from and written to a remote server
without involving either one’s CPU, which results in:
 Reduced latency
 Increased throughput
 The CPU freed up to perform other tasks
 Twice the bandwidth with less than
half the CPU utilization
 RoCE needs a lossless network!
9© 2018 Mellanox Technologies
RoCE Done Right!
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Pausetime(Microseconds)
Time/Seconds
Application Blocked by the Switch
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
Pausetime(Microseconds)
Time/Seconds
Application Blocked by the Switch
Application runs Smoothly
Other
Switches
10© 2018 Mellanox Technologies
Best Congestion Management For RoCE
 Configuration
 4 hosts connected to 1 switch in a star topology
 ECN enabled, PFC enabled
 3 sources to 1 common destination
 Results
 Other ASIC sends pauses to hosts, no pauses sent by Spectrum
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233
Pausetime(Microseconds)
Time/Seconds
Spectrum switch
No Pauses
0
100000
200000
300000
400000
500000
600000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37
Pausetime(Microseconds)
Time/Seconds
Other ASIC based switch
Up to 21% Pause time
VS
11© 2018 Mellanox Technologies
Better RoCE with Fast Congestion Notification
marks packets entering queue
marks packets exiting queue
 Fast Congestion Notification
• Packets marked as they leave queue
• Up to 10ms faster alerts
• Servers react faster
• Reduces average queue depth
- Lowers real world latency
• Improves application performance
 Legacy Congestion Notification:
• Packets marked as they enter queue
• Notification delayed until queue empties
10 Gigabit Ethernet
10 Gigabit Ethernet
Delay inside switch is equivalent to placing server hundreds of miles away
12© 2018 Mellanox Technologies
Predictable QoS with RoCE
 Configuration
 7 hosts connected to 1 switch in a star topology
 ECN enabled, PFC enabled
 6 sources to 1 common destination
 Results
 Non-equal bandwidth distribution on other ASICs
-1500000
500000
2500000
4500000
6500000
8500000
10500000
12500000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
KBytes/second
Time/Seconds
Other ASIC Based Switch
-1500000
500000
2500000
4500000
6500000
8500000
10500000
12500000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
KBytes/second
Time/Seconds
Spectrum Switch
One source host gets 50%
of the total bandwidth
5 others get only 10% each
Each host gets 16.66% of
the total bandwidthVS
13© 2018 Mellanox Technologies
RoCE runs best in Lossless Networks
 Could be complex configuration
 2 modes
 Enhanced mode for experts
 User mode for easy configuration meeting 98% of implementation
14© 2018 Mellanox Technologies
NEO Simplifies RoCE Provisioning
 Automated setup of RoCE across entire fabric
 Mellanox switches
 Mellanox NICs
 Ideal for End-to-End Mellanox deployments
 No manual configuration needed
15© 2018 Mellanox Technologies
Para-Virtualized SR-IOV
Single Root I/O Virtualization (SR-IOV)
 PCIe device presents multiple
instances to the OS/Hypervisor
 Enables Application Direct Access
 Bare metal performance for VM
 Reduces CPU overhead
 Enables many advanced NIC
features (e.g. DPDK, RDMA, ASAP2)
NIC
Hypervisor
vSwitch
VM VM
SR-IOV NIC
Hypervisor VM VM
eSwitch
Physical Function
(PF)
Virtual Function
(VF)
16© 2018 Mellanox Technologies
Introduction to OVS (Open vSwitch)
 Software component that typically provides switching between Virtual Machines
 Target application: Multi-server virtualization deployments
 OVS is an open project. Code and materials at http://openvswitch.org/
 OVS Main Functionality
• Bridge traffic between Virtual-Machines (VMs) on the same Host
• Bridge traffic between VMs and the outside world
• Migration of VMs with all of their associated configuration:
- L2 learning table, L3 forwarding state, ACLs, QoS, Policy and more
• OpenFlow support
• VM tagging and manipulation
• Flow-based switching
• Support for tunneling: VXLAN, GRE and more
 OVS works on KVM, XenServer, OpenStack
17© 2018 Mellanox Technologies
Open Virtual Switch (OVS) Challenges
 Virtual switches such as Open vSwitch (OVS) are used as the
forwarding plane in the hypervisor
 Virtual switches implement extensive support for SDN (e.g.
enforce policies) and are widely used by the industry
 Supports L2-L3 networking features:
 L2 & L3 Forwarding, NAT, ACL, Connection Tracking etc.
 Flow based
 OVS Challenges:
 Awful Packet Performance: <1M w/ 2-4 cores,
 Burns CPU like Hell : Even w/ 12 cores, can’t get 1/3rd 100G NIC Speed
 Bad User Experience: High and unpredictable latency, packet drops
 Solution
 Offload OVS data plane into Mellanox NIC using ASAP2 technology
18© 2018 Mellanox Technologies
• Enable SR-IOV data path with OVS control plane
• Use Open vSwitch to be the management interface and offload OVS data-plane to
Mellanox embedded Switch (eSwitch) using ASAP2 Direct
ASAP2 SRIOV - Example
19© 2018 Mellanox Technologies
Virtualized Datapath Options Today
19
VF0 VF1
SR-IOV
(Single-Root IO Virtualization)
Accelerated vSwitch
(Open vSwitch over DPDK)
Hardware Dependent to the NIC
line rate, no CPU overhead
ToR for switching
DPDK - Direct IO to NIC or vNIC
switching, bonding, overlay
Kernel
space
User
space
Legacy vSwitch
(kernel datapath)
Default for Openstack
switching, bonding, overlay, live
migration
20© 2018 Mellanox Technologies
OVS over DPDK VS. OVS Offload – ConnectX-5
ConnectX-5 provides significant performance boost
 Without adding CPU resources
7.6
MPPS
66
MPPS
4 Cores
0 Cores
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0
10
20
30
40
50
60
70
OVS over DPDK OVS Offload
NumberofDedicatedCores
MillionPacketPerSecond
Message Rate Dedicated Hypervisor Cores
Test ASAP2
Direct
OVS DPDK Benefit
1 Flow
VXLAN
66M PPS 7.6M PPS
(VLAN)
8.6X
60K flows
VXLAN
19.8M PPS 1.9M PPS 10.4X
21© 2018 Mellanox Technologies
ASAP2 – Facts (current)
 Offloads:
 Match 12 Tuple and Forward to VM/Network or Drop
 Ethernet Layer 2
 IP (v4 /v6)
 TCP /UDP
 Action:
 Forwarding
 Drop/Allow
 VxLAN Encap/Decap
 VLAN Push/Pop
 ConnectX-4 Lx: Per port
 ConnectX-5: Per flow
 Header Re-write (ConnectX-5): Up to and including Layer 4
 VF mirroring: Mirroring traffic from one VF to another in the same eSwitch
 VF LAG with LACP: Active/Backup and Bonding (50Gb/s from 2 ports of 25GbE)
 Supported OS (as of today):
 RHEL 7.4
 RHEL 7.5
 CentOS 7.2
 Packages required:
 MLNX_OFED 4.4
 Iproute 4.12 and up
 Openvswitch 2.8 and up , for CentOS 7.2 from Mellanox.com
22© 2018 Mellanox Technologies
Overlay Networks
 Traditional VLANs based networks
 Layer 2 segmentation
 VLANs scalability is 4K
 No support for VM Mobility
 Overlay networks with VXLAN
 Layer 2 over Layer 3 segmentation
 VXLAN scalability is 16M
 Support for VM Mobility
 Multi tenant isolation
 Overlay networks run as independent virtual networks on top of a physical network infrastructure
23© 2018 Mellanox Technologies
VXLAN Overview
 VXLAN - Virtual eXtensible Local Area Network:
 A standard overlay protocol that enables multiple layer 2
logical networks over a single underlay layer 3 network
 Each virtual network is a VXLAN logical layer 2 segment
 Encapsulates MAC-based layer 2 Ethernet frames in layer 3
UDP/IP packets
 Uses a 24-bit VXLAN network identifier (VNI) in the VXLAN
header, hence scales to 16 million layer 2 segments
24© 2018 Mellanox Technologies
VTEP - VXLAN Tunnel End Point
 VTEP on the host
 VXLAN agents run on hosts hypervisor
 Encapsulation/de-ecapsulation in software
 Degraded performance
 Mellanox network adapters support VXLAN offloads,
encapsulation/de-ecapsulation can be offloaded to
the NIC
 VTEP on the ToR
 VXLAN agents run on ToRs
 Encapsulation/de-ecapsulation in switch hardware
 Efficient performance
 Cumulus Linux that runs on Mellanox switches
supports VTEP on the switch
25© 2018 Mellanox Technologies
VTEP on the Host - Accelerating Overlay Networks
 Virtual Overlay Networks simplifies management and VM migration
 Overlay Accelerators in NIC enable Bare Metal Performance
26© 2018 Mellanox Technologies
VTEP on the ToR
 VTEP on the ToR enables scaleability and flexibility
 Multitenancy / integration of legacy sevices
27© 2018 Mellanox Technologies
Why BGP-EVPN + VXLAN ?
 BGP-EVPN is an open controller-less solution
 Controllers are centralized and limit the scale of the solution
 Controller lock-in the customers into certain technologies and increase costs
 BGP-EVPN with VXLAN is a better alternative to legacy VPLS/VLL
28© 2018 Mellanox Technologies
VXLAN Bridging with EVPN
 Ethernet Virtual Private Network (EVPN)
 Often used to implement controller-less VXLAN
 Standard-based control plane for VXLAN defined in RFC 7432
 Relies on multi-protocol BGP (MP-BGP) for information
exchange
 Key features include:
 VNI membership exchange between VTEPs
 Exchange of host MAC and IP addresses
 Support for host/VM mobility (MAC and IP moves)
 Support for inter-VXLAN routing
 Support for layer 3 multi-tenancy with VRFs
 Support for dual-attached hosts via VXLAN active-active
mode.
29© 2018 Mellanox Technologies
VXLAN Routing Modes
 EVPN supports three VXLAN routing modes:
 Centralized routing:
 Specific VTEPs act as designated Layer 3 gateways and route between subnets
 Other VTEPs just act as bridges
 Distributed asymmetric routing:
 Every VTEP participates in routing
 Ingress VTEP only participates in routing; The egress VTEPs only acts as bridges
 Distributed symmetric routing:
 Every VTEP participates in routing
 Both the ingress VTEP and the egress VTEP participate in routing
30© 2018 Mellanox Technologies
Distributed VXLAN Routing
 Distributed Asymmetric Routing:
 each VTEP acts as a layer 3 gateway, performing routing for its
attached hosts
 Only the ingress VTEP performs routing, the egress VTEP only
performs the bridging
 Advantage:
 Easy to deploy and no additional special VNIs
 Less routing hops occur to communicate between VXLANs
 Disadvantage:
 Each VTEP must be provisioned with all VLANs/VNIs
 Distributed Symmetric Routing
 each VTEP acts as a layer 3 gateway, performing routing for its
attached hosts
 Both the ingress VTEP and egress VTEP route the packets
 A new specialty transit VNI is used for all routed VXLAN traffic,
called the L3VN
 Advantage:
 Each VTEP only needs local VLANs, local VNIs and L3VNI with
associated VLAN
 Disadvantage:
 More complex configuration
 Extra routing hop that might cause extra latency
31© 2018 Mellanox Technologies
Conclusion
 Cloud Infrastructures with hypervirtualized topologies, storage and compute
 Provide flexibility at any scale
 Require Intelligent use of protocol and feature tool sets
 Fast, distributed storage requires higher bandwidth and efficient CPU management
 RoCE done right accelerates the storage performance
 VMs require internal and external communication over a virtually switched infrastructure
 ASAP2 helps taking the load from the OVS and thus the CPU while optimizing the communication path
 Highly virtualized environments need to extend L2 segregation above VLAN limits and accross L3
boundaries
 Overlay networking with VXLAN adds scale
 VXLAN with EVPN adds flexibility and manageability
 VXLAN routing adds agility
32© 2018 Mellanox Technologies
Thank You
33© 2018 Mellanox Technologies
For Further Reading
Addendum
 RoCE/RDMA:
 Mellanox RoCE Homepage
 Boosting Performance With RDMA – A Case Study
 What is RDMA?
 RDMA/RoCE Solutions
 Recommended Network Configuration Examples for RoCE Deployment
 ASAP2:
 Mellanox ASAP2 Homepage
 Getting started with Mellanox ASAP^2
 The Ideal Network for Containers and NFV Microservices
 Overlay Networking/VXLAN/EVPN
 EVPN with Mellanox Switches
 Top 3 considerations for picking your BGP EVPN VXLAN infrastructure
 VXLAN is finally simple, use EVPN and set up VXLAN in 3 steps
 Mellanox Ethernet Solutions
 Mellanox Open Ethernet Switches
 Mellanox Open Ethernet Switches Product Brief

OpenNebula - Mellanox Considerations for Smart Cloud

  • 1.
    1© 2018 MellanoxTechnologies Opennebula Techday, Frankfurt Sept. 26 2018 | Arne Heitmann, Staff System Engineer, EMEA Considerations on Smart Cloud Implementations
  • 2.
    2© 2018 MellanoxTechnologies Agenda  Introduction Mellanox  Faster storage and networks  Linbit architecture with RDMA/RoCE  RDMA/RoCE  ASAP2 (OVS Offloading)  Overlay Networking  EVPN/VXLAN routing  Conclusion
  • 3.
    3© 2018 MellanoxTechnologies Mellanox Overview  Ticker: MLNX  Worldwide Offices  Company Headquarters • Yokneam, Israel • Sunnyvale, California  Employees worldwide • ~ 2,900 Ticker: MLNX
  • 4.
    4© 2018 MellanoxTechnologies Comprehensive End-to-End Ethernet Product Portfolio NICs Cables NICs Optics Switches Automation & Monitoring Management software
  • 5.
    5© 2018 MellanoxTechnologies Unique Engine in Mellanox Ethernet Switch  Mellanox switches are powered by Mellanox superior ASIC  Wire speed, cut through switching at any packet size  Zero Jitter  Low power  10GbE to 100GbE  DAC Passive Copper for 10/25/40/50/100GbE  vs. active copper is more expensive, less reliable and suffers from interoperability issues  Active Fiber for 10/25/40/50/100GbE
  • 6.
    6© 2018 MellanoxTechnologies New Storage Media Require Faster Networks  Transition to faster storage media requires faster networks  Flash SSDs move the bottleneck from the storage to the network  What does it take to saturate one 10Gb/s link? • 24 x HDDs • 2 x SATA SSDs • 1 x SAS SSD • NVMe…
  • 7.
    7© 2018 MellanoxTechnologies DRBD and RDMA – Architectural Advantage
  • 8.
    8© 2018 MellanoxTechnologies Excursion - RoCE - RDMA over Converged Ethernet  Remote Direct Memory Access (RDMA) is a technology that enables data to be read from and written to a remote server without involving either one’s CPU, which results in:  Reduced latency  Increased throughput  The CPU freed up to perform other tasks  Twice the bandwidth with less than half the CPU utilization  RoCE needs a lossless network!
  • 9.
    9© 2018 MellanoxTechnologies RoCE Done Right! 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Pausetime(Microseconds) Time/Seconds Application Blocked by the Switch 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Pausetime(Microseconds) Time/Seconds Application Blocked by the Switch Application runs Smoothly Other Switches
  • 10.
    10© 2018 MellanoxTechnologies Best Congestion Management For RoCE  Configuration  4 hosts connected to 1 switch in a star topology  ECN enabled, PFC enabled  3 sources to 1 common destination  Results  Other ASIC sends pauses to hosts, no pauses sent by Spectrum 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233 Pausetime(Microseconds) Time/Seconds Spectrum switch No Pauses 0 100000 200000 300000 400000 500000 600000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 Pausetime(Microseconds) Time/Seconds Other ASIC based switch Up to 21% Pause time VS
  • 11.
    11© 2018 MellanoxTechnologies Better RoCE with Fast Congestion Notification marks packets entering queue marks packets exiting queue  Fast Congestion Notification • Packets marked as they leave queue • Up to 10ms faster alerts • Servers react faster • Reduces average queue depth - Lowers real world latency • Improves application performance  Legacy Congestion Notification: • Packets marked as they enter queue • Notification delayed until queue empties 10 Gigabit Ethernet 10 Gigabit Ethernet Delay inside switch is equivalent to placing server hundreds of miles away
  • 12.
    12© 2018 MellanoxTechnologies Predictable QoS with RoCE  Configuration  7 hosts connected to 1 switch in a star topology  ECN enabled, PFC enabled  6 sources to 1 common destination  Results  Non-equal bandwidth distribution on other ASICs -1500000 500000 2500000 4500000 6500000 8500000 10500000 12500000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 KBytes/second Time/Seconds Other ASIC Based Switch -1500000 500000 2500000 4500000 6500000 8500000 10500000 12500000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 KBytes/second Time/Seconds Spectrum Switch One source host gets 50% of the total bandwidth 5 others get only 10% each Each host gets 16.66% of the total bandwidthVS
  • 13.
    13© 2018 MellanoxTechnologies RoCE runs best in Lossless Networks  Could be complex configuration  2 modes  Enhanced mode for experts  User mode for easy configuration meeting 98% of implementation
  • 14.
    14© 2018 MellanoxTechnologies NEO Simplifies RoCE Provisioning  Automated setup of RoCE across entire fabric  Mellanox switches  Mellanox NICs  Ideal for End-to-End Mellanox deployments  No manual configuration needed
  • 15.
    15© 2018 MellanoxTechnologies Para-Virtualized SR-IOV Single Root I/O Virtualization (SR-IOV)  PCIe device presents multiple instances to the OS/Hypervisor  Enables Application Direct Access  Bare metal performance for VM  Reduces CPU overhead  Enables many advanced NIC features (e.g. DPDK, RDMA, ASAP2) NIC Hypervisor vSwitch VM VM SR-IOV NIC Hypervisor VM VM eSwitch Physical Function (PF) Virtual Function (VF)
  • 16.
    16© 2018 MellanoxTechnologies Introduction to OVS (Open vSwitch)  Software component that typically provides switching between Virtual Machines  Target application: Multi-server virtualization deployments  OVS is an open project. Code and materials at http://openvswitch.org/  OVS Main Functionality • Bridge traffic between Virtual-Machines (VMs) on the same Host • Bridge traffic between VMs and the outside world • Migration of VMs with all of their associated configuration: - L2 learning table, L3 forwarding state, ACLs, QoS, Policy and more • OpenFlow support • VM tagging and manipulation • Flow-based switching • Support for tunneling: VXLAN, GRE and more  OVS works on KVM, XenServer, OpenStack
  • 17.
    17© 2018 MellanoxTechnologies Open Virtual Switch (OVS) Challenges  Virtual switches such as Open vSwitch (OVS) are used as the forwarding plane in the hypervisor  Virtual switches implement extensive support for SDN (e.g. enforce policies) and are widely used by the industry  Supports L2-L3 networking features:  L2 & L3 Forwarding, NAT, ACL, Connection Tracking etc.  Flow based  OVS Challenges:  Awful Packet Performance: <1M w/ 2-4 cores,  Burns CPU like Hell : Even w/ 12 cores, can’t get 1/3rd 100G NIC Speed  Bad User Experience: High and unpredictable latency, packet drops  Solution  Offload OVS data plane into Mellanox NIC using ASAP2 technology
  • 18.
    18© 2018 MellanoxTechnologies • Enable SR-IOV data path with OVS control plane • Use Open vSwitch to be the management interface and offload OVS data-plane to Mellanox embedded Switch (eSwitch) using ASAP2 Direct ASAP2 SRIOV - Example
  • 19.
    19© 2018 MellanoxTechnologies Virtualized Datapath Options Today 19 VF0 VF1 SR-IOV (Single-Root IO Virtualization) Accelerated vSwitch (Open vSwitch over DPDK) Hardware Dependent to the NIC line rate, no CPU overhead ToR for switching DPDK - Direct IO to NIC or vNIC switching, bonding, overlay Kernel space User space Legacy vSwitch (kernel datapath) Default for Openstack switching, bonding, overlay, live migration
  • 20.
    20© 2018 MellanoxTechnologies OVS over DPDK VS. OVS Offload – ConnectX-5 ConnectX-5 provides significant performance boost  Without adding CPU resources 7.6 MPPS 66 MPPS 4 Cores 0 Cores 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 10 20 30 40 50 60 70 OVS over DPDK OVS Offload NumberofDedicatedCores MillionPacketPerSecond Message Rate Dedicated Hypervisor Cores Test ASAP2 Direct OVS DPDK Benefit 1 Flow VXLAN 66M PPS 7.6M PPS (VLAN) 8.6X 60K flows VXLAN 19.8M PPS 1.9M PPS 10.4X
  • 21.
    21© 2018 MellanoxTechnologies ASAP2 – Facts (current)  Offloads:  Match 12 Tuple and Forward to VM/Network or Drop  Ethernet Layer 2  IP (v4 /v6)  TCP /UDP  Action:  Forwarding  Drop/Allow  VxLAN Encap/Decap  VLAN Push/Pop  ConnectX-4 Lx: Per port  ConnectX-5: Per flow  Header Re-write (ConnectX-5): Up to and including Layer 4  VF mirroring: Mirroring traffic from one VF to another in the same eSwitch  VF LAG with LACP: Active/Backup and Bonding (50Gb/s from 2 ports of 25GbE)  Supported OS (as of today):  RHEL 7.4  RHEL 7.5  CentOS 7.2  Packages required:  MLNX_OFED 4.4  Iproute 4.12 and up  Openvswitch 2.8 and up , for CentOS 7.2 from Mellanox.com
  • 22.
    22© 2018 MellanoxTechnologies Overlay Networks  Traditional VLANs based networks  Layer 2 segmentation  VLANs scalability is 4K  No support for VM Mobility  Overlay networks with VXLAN  Layer 2 over Layer 3 segmentation  VXLAN scalability is 16M  Support for VM Mobility  Multi tenant isolation  Overlay networks run as independent virtual networks on top of a physical network infrastructure
  • 23.
    23© 2018 MellanoxTechnologies VXLAN Overview  VXLAN - Virtual eXtensible Local Area Network:  A standard overlay protocol that enables multiple layer 2 logical networks over a single underlay layer 3 network  Each virtual network is a VXLAN logical layer 2 segment  Encapsulates MAC-based layer 2 Ethernet frames in layer 3 UDP/IP packets  Uses a 24-bit VXLAN network identifier (VNI) in the VXLAN header, hence scales to 16 million layer 2 segments
  • 24.
    24© 2018 MellanoxTechnologies VTEP - VXLAN Tunnel End Point  VTEP on the host  VXLAN agents run on hosts hypervisor  Encapsulation/de-ecapsulation in software  Degraded performance  Mellanox network adapters support VXLAN offloads, encapsulation/de-ecapsulation can be offloaded to the NIC  VTEP on the ToR  VXLAN agents run on ToRs  Encapsulation/de-ecapsulation in switch hardware  Efficient performance  Cumulus Linux that runs on Mellanox switches supports VTEP on the switch
  • 25.
    25© 2018 MellanoxTechnologies VTEP on the Host - Accelerating Overlay Networks  Virtual Overlay Networks simplifies management and VM migration  Overlay Accelerators in NIC enable Bare Metal Performance
  • 26.
    26© 2018 MellanoxTechnologies VTEP on the ToR  VTEP on the ToR enables scaleability and flexibility  Multitenancy / integration of legacy sevices
  • 27.
    27© 2018 MellanoxTechnologies Why BGP-EVPN + VXLAN ?  BGP-EVPN is an open controller-less solution  Controllers are centralized and limit the scale of the solution  Controller lock-in the customers into certain technologies and increase costs  BGP-EVPN with VXLAN is a better alternative to legacy VPLS/VLL
  • 28.
    28© 2018 MellanoxTechnologies VXLAN Bridging with EVPN  Ethernet Virtual Private Network (EVPN)  Often used to implement controller-less VXLAN  Standard-based control plane for VXLAN defined in RFC 7432  Relies on multi-protocol BGP (MP-BGP) for information exchange  Key features include:  VNI membership exchange between VTEPs  Exchange of host MAC and IP addresses  Support for host/VM mobility (MAC and IP moves)  Support for inter-VXLAN routing  Support for layer 3 multi-tenancy with VRFs  Support for dual-attached hosts via VXLAN active-active mode.
  • 29.
    29© 2018 MellanoxTechnologies VXLAN Routing Modes  EVPN supports three VXLAN routing modes:  Centralized routing:  Specific VTEPs act as designated Layer 3 gateways and route between subnets  Other VTEPs just act as bridges  Distributed asymmetric routing:  Every VTEP participates in routing  Ingress VTEP only participates in routing; The egress VTEPs only acts as bridges  Distributed symmetric routing:  Every VTEP participates in routing  Both the ingress VTEP and the egress VTEP participate in routing
  • 30.
    30© 2018 MellanoxTechnologies Distributed VXLAN Routing  Distributed Asymmetric Routing:  each VTEP acts as a layer 3 gateway, performing routing for its attached hosts  Only the ingress VTEP performs routing, the egress VTEP only performs the bridging  Advantage:  Easy to deploy and no additional special VNIs  Less routing hops occur to communicate between VXLANs  Disadvantage:  Each VTEP must be provisioned with all VLANs/VNIs  Distributed Symmetric Routing  each VTEP acts as a layer 3 gateway, performing routing for its attached hosts  Both the ingress VTEP and egress VTEP route the packets  A new specialty transit VNI is used for all routed VXLAN traffic, called the L3VN  Advantage:  Each VTEP only needs local VLANs, local VNIs and L3VNI with associated VLAN  Disadvantage:  More complex configuration  Extra routing hop that might cause extra latency
  • 31.
    31© 2018 MellanoxTechnologies Conclusion  Cloud Infrastructures with hypervirtualized topologies, storage and compute  Provide flexibility at any scale  Require Intelligent use of protocol and feature tool sets  Fast, distributed storage requires higher bandwidth and efficient CPU management  RoCE done right accelerates the storage performance  VMs require internal and external communication over a virtually switched infrastructure  ASAP2 helps taking the load from the OVS and thus the CPU while optimizing the communication path  Highly virtualized environments need to extend L2 segregation above VLAN limits and accross L3 boundaries  Overlay networking with VXLAN adds scale  VXLAN with EVPN adds flexibility and manageability  VXLAN routing adds agility
  • 32.
    32© 2018 MellanoxTechnologies Thank You
  • 33.
    33© 2018 MellanoxTechnologies For Further Reading Addendum  RoCE/RDMA:  Mellanox RoCE Homepage  Boosting Performance With RDMA – A Case Study  What is RDMA?  RDMA/RoCE Solutions  Recommended Network Configuration Examples for RoCE Deployment  ASAP2:  Mellanox ASAP2 Homepage  Getting started with Mellanox ASAP^2  The Ideal Network for Containers and NFV Microservices  Overlay Networking/VXLAN/EVPN  EVPN with Mellanox Switches  Top 3 considerations for picking your BGP EVPN VXLAN infrastructure  VXLAN is finally simple, use EVPN and set up VXLAN in 3 steps  Mellanox Ethernet Solutions  Mellanox Open Ethernet Switches  Mellanox Open Ethernet Switches Product Brief