Intro_DL_05
Intro_DL_05
1
Section 1
3
Central Processing Unit (CPU)
CPU is the heart of any and every computer.
The CPU handles the core processing tasks in a
computer — the literal computation that drives
every single action in a computer system.
Computers work through the processing of binary
data, or ones and zeroes.
To translate that information into the software, graphics, animations, and
every other process executed on a computer, those ones and zeroes must
work through the logical structure of the CPU.
That includes the basic arithmetic, logical functions (AND, OR, NOT) and
input and output operations. The CPU is the brain, taking information,
calculating it, and moving it where it needs to go.
Within every CPU, there are a few standard components, which include the
following: Core, Cache, Memory Management Unit (MMU), CPU Clock and
4
Control Unit.
Components of CPU: Core
A core, or CPU core, is the ”brain” of a CPU. It receives instructions, and
performs calculations, or operations, to satisfy those instructions. A CPU can
have multiple cores.
A core typically functions through what is called the “instruction cycle,”
where instructions are pulled from memory (fetch), decoded into processing
language (decode), and executed through the logical gates of the core
(execute).
Initially, all CPUs were single-core, but with the
proliferation of multi-core CPUs, we’ve seen an
increase in processing power.
As of 2019, the majority of consumer CPUs feature
between two and twelve cores. Workstation and
server CPUs may feature as many as 48 cores.
5
Cache
Cache is a small amount of memory which is a part of the CPU – closer to the
CPU than RAM. It is used to temporarily hold instructions and data that the CPU
is likely to reuse.
7
CPU
8
Graphics Processing Unit (GPU)
One of these tasks, graphical processing, is generally considered one of the
more complex processing tasks for the CPU. Solving that complexity has led
to technology with applications far beyond graphics.
The challenge in processing graphics is that graphics call on complex
mathematics to render, and those complex mathematics must compute in
parallel to work correctly.
For example:
A graphically intense video game might contain
hundreds or thousands of polygons on the screen
at any given time, each with its individual
movement, color, lighting, and so on.
CPUs aren’t made to handle that kind of workload.
That’s where graphical processing units (GPUs)
come into play. 9
Graphics Processing Unit (GPU)
GPUs are similar in function to CPU: they contain cores, memory, and other
components.
Instead of emphasizing context switching to manage multiple tasks, GPU
acceleration emphasizes parallel data processing through a large number of
cores.
These cores are usually less powerful individually than the core of a CPU.
GPUs also typically have less interoperability with different hardware APIs
and houseless memory.
Where they shine is pushing large amounts of processed data in parallel.
Instead of switching through multiple tasks to process graphics, a GPU
simply takes batch instructions and pushes them out at high volume to
speed processing and display.
10
TPU: Tensor Processing Unit
TPU stands for tensor processing unit and is a designated architecture for deep
learning or machine learning applications.
Invented by Google, TPUs are application-specific integrated circuits (ASIC)
designed specifically to handle the computational demands of machine
learning and accelerate AI calculations and algorithms.
Google began using TPUs internally in 2015, and in 2018 they made them
publicly available to others.
When Google designed the TPU, they created a domain-specific architecture.
What that means is that instead of designing a general purpose processor
like a GPU or CPU, Google designed it as a matrix processor that was
specialized for neural network work loads.
By designing the TPU as a matrix processor instead of a general purpose
processor, Google solved the memory access problem that slows down GPUs
and CPUs and requires them to use more processing power.
11
How Do TPUs Work?
12
Understanding Matrix Multiplication with TPU
13
TPU: pros and cons
TPUs are extremely valuable and bring a lot to the table. Their only real
downside is that they are more expensive than GPUs and CPUs.
Their list of pros highly outweighs their high price tag.
TPUs are a great choice for those who want to:
Accelerate machine learning applications
Scale applications quickly
Cost effectively manage machine learning workloads
Start with well-optimized, open source reference models
14
Agenda
1 HARDWARE FOR DEEP LEARNING (CPU, GPU AND TPU)
CPU, GPU, TPU comparison
2 DEEP LEARNING FRAMEWORKS
3 ACCELERATOR AND COMPRESSION TOOLS
15
Advantages of CPU architecture
Flexibility: CPUs are flexible and resilient and can handle a variety of tasks
outside of graphics processing. Because of their serial processing
capabilities, the CPU can multitask across multiple activities in your
computer. Because of this, a strong CPU can provide more speed for typical
computer use than a GPU.
Contextual Power: In specific situations, the CPU will outperform the GPU.
For example, the CPU is significantly faster when handling several different
types of system operations (random access memory, mid-range
computational operations, managing an operating system, I/O operations).
Precision: CPUs can work on mid-range mathematical equations with a
higher level of precision. CPUs can handle the computational depth and
complexity more readily, becoming increasingly crucial for specific
applications.
16
Advantages of CPU architecture (cont.)
18
Advantages of GPU
These two advantages were the main reasons GPUs were created because both
contribute to complex graphics processing. However, the GPU structure quickly
led developers and engineers to apply GPU technology to other
high-performance applications:
19
High-performance Application of GPU
Bitcoin Mining: The process of mining bitcoins involves using computational
power to solve complex cryptographic hashes. The increasing expansion of
Bitcoin and the difficulty of mining bitcoins has led bitcoin mines to
implement a GPU to handle immense volumes of cryptographic data in the
hopes of earning bitcoins.
Machine Learning: Neural networks, particularly those used for
deep-learning algorithms, function through the ability to process large
amounts of training data through smaller nodes of operations. GPUs for
machine learning have emerged to help process the enormous data sets
used to train machine-learning algorithms and AI.
Analytics and Data Science: GPUs are uniquely suited to help analytics
programs process large amounts of base data from different sources.
Furthermore, these same GPUs can power the computation necessary for
deep data sets associated with research areas like life sciences (genomic
20
sequencing).
Disadvantages of GPU
Multitasking: GPUs aren’t built for multitasking, so they don’t have much
impact in areas like general-purpose computing.
Cost: While the price of GPUs has fallen somewhat over the years, they are
still significantly more expensive than CPUs. This cost rises more when
talking about a GPU built for specific tasks like mining or analytics.
Power and Complexity: While a GPU can handle large amounts of parallel
computing and data throughput, they struggle when the processing
requirements become more chaotic. Branching logic paths, sequential
operations, and other approaches to computing impede the effectiveness of
a GPU.
21
Hardware comparison
22
When To Use CPU, GPU, Or TPU
To Run Your Machine Learning Models?
CPUs are general-purpose processors, while GPUs and TPUs are optimized
accelerators that accelerate machine learning.
CPUs:
23
When To Use CPU, GPU, Or TPU
To Run Your Machine Learning Models?
GPUs:
Models that are too difficult to change or sources that do not exist
Models with numerous custom TensorFlow operations that a GPU must
support
Models that are not available on Cloud TPU
Medium or larger size models with bigger effective batch sizes
TPUs:
Training models using mostly matrix computations
Training models without custom TensorFlow operations inside the main
training loop
Training Models that require weeks or months to complete
Training huge models with very large effective batch sizes 24
Section 2
https://github.com/397090770/spark-ai-summit-europe-2018-10/
blob/master/ppt/
a-tale-of-three-deep-learning-frameworks-tensorflow-keras-pytorch
pdf
25
What is Keras?
26
What is PyTorch?
27
What is TensorFlow?
TensorFlow is an end-to-end open-source deep learning framework
developed by Google and released in 2015. It is known for documentation
and training support, scalable production and deployment options, multiple
abstraction levels, and support for different platforms, such as Android.
TensorFlow is a symbolic math library used for neural networks and is best
suited for dataflow programming across a range of tasks. It offers multiple
abstraction levels for building and training models.
A promising and fast-growing entry in the world of deep learning,
TensorFlow offers a flexible, comprehensive ecosystem of community
resources, libraries, and tools that facilitate building and deploying machine
learning apps. Also, as mentioned before, TensorFlow has adopted Keras,
which makes comparing the two seem problematic. Nevertheless, we will
still compare the two frameworks for the sake of completeness, especially
since Keras users don’t necessarily have to use TensorFlow.
28
PyTorch vs TensorFlow
Both TensorFlow and PyTorch offer useful abstractions that ease the
development of models by reducing boilerplate code. They differ because
PyTorch has a more “pythonic” approach and is object-oriented, while
TensorFlow offers a variety of options.
PyTorch is used for many deep learning projects today, and its popularity is
increasing among AI researchers, although of the three main frameworks, it
is the least popular. Trends show that this may change soon.
When researchers want flexibility, debugging capabilities, and short training
duration, they choose PyTorch. It runs on Linux, macOS, and Windows.
29
Thanks to its well-documented framework and abundance of trained models
and tutorials, TensorFlow is the favorite tool of many industry professionals
and researchers. TensorFlow offers better visualization, which allows
developers to debug better and track the training process. PyTorch, however,
provides only limited visualization.
TensorFlow also beats PyTorch in deploying trained models to production,
thanks to the TensorFlow Serving framework. PyTorch offers no such
framework, so developers need to use Django or Flask as a back-end server.
In the area of data parallelism, PyTorch gains optimal performance by
relying on native support for asynchronous execution through Python.
However, with TensorFlow, you must manually code and optimize every
operation run on a specific device to allow distributed training. In summary,
you can replicate everything from PyTorch in TensorFlow; you just need to
work harder at it.
If you’re just starting to explore deep learning, you should learn PyTorch first
due to its popularity in the research community. However, if you’re familiar
with machine learning and deep learning and focused on getting a job in the 30
industry as soon as possible, learn TensorFlow first.
PyTorch vs Keras
Both of these choices are good if you’re just starting to work with deep
learning frameworks. Mathematicians and experienced researchers will find
PyTorch more to their liking. Keras is better suited for developers who want
a plug-and-play framework that lets them build, train, and evaluate their
models quickly. Keras also offers more deployment options and easier
model export.
However, remember that PyTorch is faster than Keras and has better
debugging capabilities.
Both platforms enjoy sufficient levels of popularity that they offer plenty of
learning resources. Keras has excellent access to reusable code and
tutorials, while PyTorch has outstanding community support and active
development.
Keras is the best when working with small datasets, rapid prototyping, and
multiple back-end support. It’s the most popular framework thanks to its
31
comparative simplicity. It runs on Linux, MacOS, and Windows.
TensorFlow vs Keras
TensorFlow is an open-sourced end-to-end platform, a library for multiple
machine learning tasks, while Keras is a high-level neural network library
that runs on top of TensorFlow. Both provide high-level APIs used for easily
building and training models, but Keras is more user-friendly because it’s
built-in Python.
Researchers turn to TensorFlow when working with large datasets and object
detection and need excellent functionality and high performance.
TensorFlow runs on Linux, MacOS, Windows, and Android. The framework
was developed by Google Brain and currently used for Google’s research and
production needs.
The reader should bear in mind that comparing TensorFlow and Keras isn’t
the best way to approach the question since Keras functions as a wrapper to
TensorFlow’s framework. Thus, you can define a model with Keras’ interface,
which is easier to use, then drop down into TensorFlow when you need to
32
use a feature that Keras doesn’t have, or you’re looking for specific
Which is Better PyTorch or TensorFlow or Keras?
33
Section 3
34
General presentation of techniques
35
Deep Compression: exemplary pipeline
36
Network Pruning
The detailed procedures about network pruning in this paper could be read in
this page. Using compressed sparse column (CSC) and compressed sparse row
(CSR) to store the pruned structure. To compress further, we store the index
difference with zero padding solution.
37
Quantization and Weight Sharing
38
Quantization and Weight Sharing
Making multiple connections share the same weight and fine-tune those shared
weights to save storage. Refer to the image upon to have a detailed
understanding. First quantize the weights into k bins, then we only need to store
a small index. During update, the gradients of same bin summed together,
multiplied by the learning rate and subtracted from the shared centroids from
last iteration. The compression rate can be calculated as:
nb
r=
n log(k) + kb
Using k-means to identify the shared weights for each layer of a trained network,
but weights are not shared across layers. Partition n weights W = {w1 , ..., wn }
into k clusters C = {c1 , c2 , · · · , ck }, n >> k by minimize the within-cluster sum of
squares(WCSS):
∑k ∑
arg min |w − ci |2
C i=1 w∈ci
39
Huffman Coding
The probability distribution of quantized weights and the sparse matrix index are
biased. Huffman coding these non-uniformly distributed values saves 20% - 30%
of network storage.
Example of Huffman Coding
40
41
42
The End
Thank You!
42