Introduction to PyTorch

RTSS Jun Young Park
Introduction to PyTorch

Objective
 Understanding AutoGrad
 Review
 Logistic Classifier
 Loss Function
 Backpropagation
 Chain Rule
 Example : Find gradient from a matrix
 AutoGrad
 Solve the example with AutoGrad
 Data Parallism in PyTorch
 Why should we use GPUs?
 Inside CUDA
 How to parallelize our models
 Experiment

Simple but powerful implementation of backpropagation
Understanding AutoGrad

Logistic Classifier (Fully-Connected)
𝑊𝑋 + b = y
2.0
1.0
0.1
p = 0.7
p = 0.2
p = 0.1
S(y)
ProbabilityLogits
X : Input
W, b : To be trained
y : Prediction
S(y) : Softmax function (Can be other activation functions)
A
B
C
𝑆 𝑦 =
𝑒 𝑦 𝑖
𝑖 𝑒 𝑦 𝑖
represents the probabilities of elements in vector 𝑦.
A
Instance

Distance
A
0.7
0.2
0.1
Probability
1
0
0
One-Hot Encoded
A
B
C
MAX
Loss
Find W, b that minimize the loss(error).
Predict Label

Loss Function
 The vector can be very large when there are a lot of classes.
 How can we find the distance between vector S(Predict) and L(Label) ?
𝐷 𝑆, 𝐿 = −
𝑖
𝐿𝑖 log(𝑆𝑖)
0.7
0.2
0.1
1.0
0.0
0.0
S(y) L
※ D(S,L) ≠ D(L,S)
Don’t worry to take log(0)
𝑆 𝑦 =
𝑒 𝑦𝑖
𝑖 𝑒 𝑦 𝑖

In-depth of Classifier
Let there’re equations …
1. Affine Sum
𝜎(𝑥) = 𝑊𝑥 + 𝐵
2. Activation Function
𝑦(𝜎) = 𝑅𝑒𝐿𝑈 𝜎
3. Loss Function
𝐸 𝑦 =
1
2
𝑦𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑦
2
4. Gradient Descent
𝑤 ← 𝑤 − 𝛼
𝜕𝐸
𝜕𝑤
𝑏 ← 𝑏 − 𝛼
𝜕𝐸
𝜕𝑏
• Gradient Descent requires
𝜕𝐸
𝜕𝑤
and
𝜕𝐸
𝜕𝑏
.
• How can we find them? -> Use chain rule !
𝑦𝑡𝑎𝑟𝑔𝑒𝑡 : Training data
𝑦 : Prediction result

Chain Rule
• Let y(x) is defined below, 𝑥 influences 𝑔 𝑥 and 𝑔 𝑥 influences 𝑓 𝑔 𝑥
𝑦 𝑥 = 𝑓 𝑔 𝑥 = 𝑓 ∘ 𝑔(𝑥)
• Find derivation of y(x)
𝑦′
𝑥 = 𝑓′
𝑔 𝑥 𝑔′
𝑥
• in Liebniz notation…
𝑑𝑦
𝑑𝑥
=
𝑑𝑦
𝑑𝑓
𝑑𝑓
𝑑𝑔
𝑑𝑔
𝑑𝑥
= 1 ∗ 𝑓′ 𝑔 𝑥 ∗ 𝑔′(𝑥)

Chain Rule
𝜕𝐸
𝜕𝑤
=
𝜕𝐸
𝜕𝑦
𝜕𝑦
𝜕𝜎
𝜕𝜎
𝜕𝑤
=
𝑥 𝑦 − 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 (𝜎 > 0)
0 (𝜎 ≤ 0)
𝜕𝐸
𝜕𝑦
= 𝑦 − 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 ,
𝜕𝑦
𝜕𝜎
=
1 (𝜎 > 0)
0 (𝜎 ≤ 0)
,
𝜕𝜎
𝜕𝑤
= 𝑥
Let there’re equations …
1. Affine Sum
𝜎(𝑥) = 𝑊𝑥 + 𝐵
2. Activation Function
𝑦(𝜎) = 𝑅𝑒𝐿𝑈 𝜎
3. Loss Function
𝐸 𝑦 =
1
2
𝑦𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑦
2
4. Gradient Descent
𝑤 ← 𝑤 − 𝛼
𝜕𝐸
𝜕𝑤
𝑏 ← 𝑏 − 𝛼
𝜕𝐸
𝜕𝑏

Example : Finding gradient of 𝑋
 Let input tensor 𝑋 is initialized by following square matrix of 3rd order.
𝑋 =
1 2 3
4 5 6
7 8 9
 And 𝑌, 𝑍 is defined following …
𝑌 = 𝑋 + 3
𝑍 = 6(𝑌)2
= 6( 𝑋 + 3)2
 And output 𝛿 is the average of tensor 𝑍
𝛿 = 𝑚𝑒𝑎𝑛 𝑍 =
1
9
𝑖 𝑗
𝑍𝑖𝑗

 We can find scalar 𝑍𝑖𝑗 from its definition (Linearity)
𝑍𝑖𝑗 = 6(𝑌𝑖𝑗)2
𝑌𝑖𝑗 = 𝑋𝑖𝑗 + 3
 To find gradient, We use ‘Chain Rule’ so that we can find partial gradients.
𝜕𝛿
𝜕𝑍𝑖𝑗
=
1
9
,
𝜕𝑍𝑖𝑗
𝜕𝑌𝑖𝑗
= 12𝑌𝑖𝑗,
𝜕𝑌𝑖𝑗
𝜕𝑋𝑖𝑗
= 1
𝜕𝛿
𝜕𝑋𝑖𝑗
=
𝜕𝛿
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑋𝑖𝑗
=
1
9
∗ 12𝑌𝑖𝑗 ∗ 1 =
4
3
𝑋𝑖𝑗 + 3

 Thus, We can get a gradient of (1,1) element of 𝑋
𝜕𝛿
𝜕𝑋𝑖𝑗
=
4
3
𝑋𝑖𝑗 + 3 |(𝑖, 𝑗)=(1,1) =
4
3
1 + 3 =
16
3
 Like this, We can get whole gradient matrix of 𝑋 …
𝜕𝛿
𝜕 𝑋
=
𝜕𝛿
𝜕𝑋11
𝜕𝛿
𝜕𝑋12
𝜕𝛿
𝜕𝑋13
𝜕𝛿
𝜕𝑋21
𝜕𝛿
𝜕𝑋22
𝜕𝛿
𝜕𝑋23
𝜕𝛿
𝜕𝑋31
𝜕𝛿
𝜕𝑋32
𝜕𝛿
𝜕𝑋33
=
16
3
20
3
24
3
28
3
32
3
36
3
40
3
44
3
48
3

AutoGrad : Finding gradient of 𝑋
𝑋 =
1 2 3
4 5 6
7 8 9
𝑌 = 𝑋 + 3
𝑍 = 6(𝑌)2
= 6( 𝑋 + 3)2
𝛿 = 𝑚𝑒𝑎𝑛 𝑍 =
1
9
𝑖 𝑗
𝑍𝑖𝑗
𝜕𝛿
𝜕𝑋
=
𝜕𝛿
𝜕𝑋11
𝜕𝛿
𝜕𝑋12
𝜕𝛿
𝜕𝑋13
𝜕𝛿
𝜕𝑋21
𝜕𝛿
𝜕𝑋22
𝜕𝛿
𝜕𝑋23
𝜕𝛿
𝜕𝑋31
𝜕𝛿
𝜕𝑋32
𝜕𝛿
𝜕𝑋33
=
16
3
20
3
24
3
28
3
32
3
36
3
40
3
44
3
48
3
Each operation has its gradient function.

Back Propagation
 Get derivatives using ‘Back Propagation’
+
𝑥
𝑦
𝑧
𝑧 = 𝑥 + 𝑦
𝜕𝑧
𝜕𝑥
=
𝜕𝑧
𝜕𝑦
= 1
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑥
=
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑦
=
𝜕𝐿
𝜕𝑧
x
𝑥
𝑦
𝑧
𝑧 = 𝑥𝑦
𝜕𝑧
𝜕𝑥
= 𝑦,
𝜕𝑧
𝜕𝑦
= 𝑥
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑥
=
𝜕𝐿
𝜕𝑧
∙ 𝑦
From output signal 𝐿 …
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑦
=
𝜕𝐿
𝜕𝑧
∙ 𝑥

Back Propagation
 How about exponentation function?
^
𝑛
𝑥 𝑧
𝑧 = 𝑥 𝑛
𝜕𝑧
𝜕𝑥
= 𝑛𝑥 𝑛−1
,
𝜕𝑧
𝜕𝑛
= 𝑥 𝑛
ln 𝑥
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑥
=
𝜕𝐿
𝜕𝑧
(𝑛𝑥 𝑛−1
)
From output signal 𝐿 …
𝑧 = 𝑥 𝑛
ln 𝑧 = 𝑛 ln 𝑥
1
𝑧
𝑑𝑧 = ln 𝑥 𝑑𝑛
𝑑𝑧
𝑑𝑛
= 𝑧 ln 𝑥 = 𝑥 𝑛 ln 𝑥
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑛
=
𝜕𝐿
𝜕𝑧
(𝑥 𝑛
ln 𝑥)

Appendix : Operation Graph of 𝛿 (Matrix)
+𝑋11 ^
𝑌11
x
2 2 6
x
1
9
+
𝑍11
𝑋12
…
…
…
𝑋33
…
…
…
… 𝑍12
𝑍33
𝛿
…
𝑍𝑖𝑗 = 6(𝑌𝑖𝑗)2
𝛿 = 𝑚𝑒𝑎𝑛 𝑍

Appendix : Operation Graph of 𝛿 (Scalar)
- Backpropagation
+𝑋𝑖𝑗 ^
𝑌𝑖𝑗
x
2 6
x
1
9
+
𝑍𝑖𝑗
𝛿
+𝑋𝑖𝑗 ^ x x+
𝑍 𝑠𝑢𝑚
2
𝛽𝑖𝑗𝛼𝑖𝑗
𝜕𝛿
𝜕𝑋𝑖𝑗
=
𝜕𝛿
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑋𝑖𝑗
=
𝜕𝛿
𝜕𝛽𝑖𝑗
𝜕𝛽𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑋𝑖𝑗
=
1
9
∗ 1 ∗ 6 ∗ 2𝑌𝑖𝑗 ∗ 2 =
4
3
(𝑋𝑖𝑗 + 3)
𝜕𝛿
𝜕𝛽𝑖𝑗
=
1
9
𝜕𝛿
𝜕𝑍𝑖𝑗
=
1
9
∗ 1
𝜕𝛿
𝜕𝛼𝑖𝑗
=
1
9
∗ 1 ∗ 6
=
𝜕𝛿
𝜕𝛽𝑖𝑗
=
𝜕𝛿
𝜕𝛽𝑖𝑗
𝜕𝛽𝑖𝑗
𝜕𝑍𝑖𝑗
=
𝜕𝛿
𝜕𝛽𝑖𝑗
𝜕𝛽𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝛼𝑖𝑗
=
𝜕𝛿
𝜕𝛽𝑖𝑗
𝜕𝛽𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝑌𝑖𝑗
=
𝜕𝛿
𝜕𝛽𝑖𝑗
𝜕𝛽𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑋𝑖𝑗
𝜕𝛿
𝜕𝑌𝑖𝑗
=
1
9
∗ 1 ∗ 6 ∗ 2𝑌𝑖𝑗
𝜕𝛿
𝜕𝑋𝑖𝑗
=
4
3
(𝑋𝑖𝑗 + 3)
𝜕𝛿
𝜕𝛿
= 1
𝛿

Why GPU? (CUDA)
T T
Core
T T
Core
T T
Core
T T
Core
T T
Core
T T
Core
…
3584 cores
Good for few huge tasks Good for enormous small tasks
3.6 GHz
1.6 GHz
(2.0 GHz @ O.C)

Dataflow Diagram
CPU GPU
Memory MemorycudaMemcpy()
cudaMalloc()
__global__ sum()
hello.cu
NVCC
Co-processor
CPU GPU
d_a
d_b
d_out
h_a
h_b
h_out
1.Memcpy
sum
2.Kernal call (cuBLAS)
3.Memcpy

CUDA on Multi GPU System
Quad SLI
14,336 CUDA cores
48GB of VRAM
How can we use multi GPUs in PyTorch?

Problem
- Low utilization
Only allocated
single GPU.
Zero Utilization
Redundant Memory

Problem
- Duration & Memory Allocation
 Large batch size causes lack of memory.
 Out of memory error from PyTorch -> Python kernel dies.
 Can’t set large batch size.
 Can afford batch_size = 5, num_workers = 2
 Can’t divide up the work with the other GPUs
 Elapsed Time : 25m 44s (10 epochs)
 Reached 99% of accuracy in 9 epochs (for training set)
 It takes too much time.

Data Parallelism in PyTorch
 Implemented using torch.nn.DataParallel()
 Can be used for wrapping a module or model.
 Also support primitives (torch.nn.parallel.*)
 Replicate : Replicate the model on multiple devices(GPUs)
 Scatter : Distribute the input in the first-dimension.
 Gather : Gather and concatenate the input in the first-dimension.
 Apply-Parallel : Apply a set of already-distributed inputs to a set of already-distributed
models.
 PyTorch Tutorials – Multi-GPU examples
 https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html

Easy to Use : nn.DataParallel(model)
- Practical Example
1. Define the model.
2. Wrap the model with nn.DataParallel().
3. Access layers through ‘module’

After Parallelism
- GPU Utilization
 Hyperparameters
 Batch Size : 128
 Number of Workers : 16
 High Utilization.
 Can use large memory space.
 Allocated all GPUs

After Parallelism
- Training Performance
 Hyperparameters
 Batch Size : 128
 Large batch size need more memory space
 Number of Workers : 16
 Recommended to set (4 * NUM_GPUs) – From the forum
 Elapsed Time : 7m 50s (10 epochs)
 Reached 99% of accuracy in 4 epochs (for training set).
 It just taken 3m 10s.

Introduction to PyTorch

Recommended

More Related Content

What's hot (20)

Similar to Introduction to PyTorch (20)

More from Jun Young Park (8)

Recently uploaded (20)

Introduction to PyTorch

Editor's Notes