AllReduce for distributed learning I/O Extended Seoul

AllReduce for
distributed learning
SungMin Han
Gopher

Agenda
● A short impression of I/O 2019
● Distributed learning
● AllReduce
● Cloud TPU Pods

Speaker
SungMin Han
Clova Research Engineer
Gopher
@pignose

Google I/O 2019
schedule
attended sessions
happy hours
sand boxes
join community meet up
snacks
Uber drives
05.07 - 09
26
1
8
4
6
8

Key Announcements
TensorFlow 2.0 Fairness Learning ML Kit AI Hub
Federated Learning TPU v3 Cloud TPU Pods TensorFlow on Swift
TensorFlow Lite for IoT Devices TensorFlow Agent TensorFlow Extended (TFX)
TensorFlow.js Google Coral Firebase Prediction Edge TPU

Key Announcements
TensorFlow.js Google Coral
TensorFlow
Firebase Prediction Edge TPU

Key Announcements
TPU / Device

Key Announcements
ML Kit

This session, We will talk
Distributed Learning

SGD with single GPU
Model
FP
BP
GPU 1CPU
AVG
WP
∆𝒘
loss
Previous learning environment

Simple version SGD
Model
loss
Gradient
GPU 1CPU
AVG
Update
∆𝒘

The problem
● Learning time has dependency on the and GPU model
● The model update process works on only
● High spec GPU machine is too
● Single GPU has a practical
● There is no way to support
batch-size
single GPU
expensive
limitations
scalability

SGD with multiple GPU
Model
loss
Gradient
GPU 1CPU
Aggregate
(AVG)
Update
∆𝒘
Model
loss
Gradient
GPU 3
Model
loss
Gradient
GPU 2
Gather
∆𝒘𝟏 ∆𝒘𝟐 ∆𝒘𝟑

The issue which we can find
● Data transmission time is slow between GPU memory and CPU
● There is GPU stickiness issue (*GPU balancing issue)
● This solution is for only single bare metal server (node)
TW gradient CPU model

To avoid the
problem
We need to find
a better way
imbalance

The definition of Distribution
Increase efficiency by dividing the problem into smaller parts
Problem
Worker Worker Worker Worker
Answer

Three way of distributions
Parallel Concurrent Parallel + Concurrent
To build a distributed environment,
We should understand the difference of three categories for distributed solutions

Well known distributed solutions
● DistBelief (Google brain 1st distributed environment for Deep Learning)
● Horovod (Uber’s Distributed Tensorflow Environment)
● AllReduce (Today’s topic!)
● Federated Learning (Google announced on 2018)
● CollectiveAllReduce (Google Tensorflow tf.contrib.distribute.CollectiveAllReduce)

The basic theory
PS1
∆𝒘
GPU 1 GPU 2 GPU 3 GPU 4
Broadcast
∆𝒘𝟏 ∆𝒘𝟐 ∆𝒘𝟑 ∆𝒘𝟒
Downpour SGD

Use case of Uber (horovod)
https://eng.uber.com/horovod/

Parameter Server Scenario
bottle necksimple over headcomplex

http://www.cs.fsu.edu/~xyuan/paper/09jpdc.pdf
Ring AllReduce
Ring AllReduce

horovod Architecture
Ring AllReduce
TensorFlow
Baidu
Ring-AllReduce
NVIDIA
NCCL2
Open MPI

Federated Learning
https://ai.googleblog.com/2017/04/federated-learning-collaborative.html

Federated Learning
https://arxiv.org/abs/1902.01046 - Towards Federated Learning at Scale: System Design

Secure Aggregation
https://eprint.iacr.org/2017/281.pdf

Federated Learning Architecture
TensorFlow
Actor
Programming
(Message Passing)
FL Server
Secure
Aggregation

What is AllReduce
PS
∆𝒘
∆𝒘𝟏
∆𝒘
∆𝒘𝟒
Downpour
∆𝒘
∆𝒘𝟑
∆𝒘
∆𝒘𝟐

What is AllReduce
AllReduce
𝜹1
𝜹𝟐
𝜹𝟐
𝜹𝟑
𝜹𝟏
𝜹𝟒
𝜹𝟑 𝜹𝟏
𝜹𝟒
𝜹𝟑
𝜹𝟒𝜹𝟐

AllReduce Strategy
https://preferredresearch.jp/2018/07/10/technologies-behind-distributed-deep-learning-allreduce/

The world scale
180
TFLOPS
TPU v2

The world scale
100 Peta
FLOPS
TPU v3

TPU v3 architecture (H/W)
https://cloud.google.com/tpu/docs/system-architecture

TPU v3 architecture (S/W)
https://cloud.google.com/tpu/docs/system-architecture

TPU v3 architecture
https://cloud.google.com/blog/products/ai-machine-learning/what-makes-tpus-fine-tuned-for-deep-learning?hl=ko

TPU Pods Overview
https://arxiv.org/pdf/1811.06992.pdf
2-D AllReduce

Summary
● TPU’s inter-connect design gives high-speed for communication with units
● TPU v3 and Pods basically follows AllReduce
(1-D ring AllReduce, 2-D AllReduce)
● TPU Pods is not available yet (Alpha ‘19 06 30)

AllReduce for distributed learning I/O Extended Seoul

More Related Content

What's hot (20)

Similar to AllReduce for distributed learning I/O Extended Seoul (20)

More from Kenneth Ceyer (15)

Recently uploaded (20)

AllReduce for distributed learning I/O Extended Seoul