0% found this document useful (0 votes)
16 views45 pages

CI DeepLearningFundamentals

The document provides an overview of deep learning, focusing on the tasks Deep Neural Networks can perform, such as classification, prediction, and recognition. It discusses the importance of vectorization for computational efficiency in training models, as well as various activation functions and their derivatives used in neural networks. Additionally, it highlights the distinction between parameters and hyperparameters in model development and emphasizes the iterative nature of creating deep learning solutions.

Uploaded by

tadeuszlabuz78
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views45 pages

CI DeepLearningFundamentals

The document provides an overview of deep learning, focusing on the tasks Deep Neural Networks can perform, such as classification, prediction, and recognition. It discusses the importance of vectorization for computational efficiency in training models, as well as various activation functions and their derivatives used in neural networks. Additionally, it highlights the distinction between parameters and hyperparameters in model development and emphasizes the iterative nature of creating deep learning solutions.

Uploaded by

tadeuszlabuz78
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

DEEP LEARNING

Introduction to Deep Learning and


Deep Network Learning Issues

Adrian Horzyk AGH University of


Science and Technology
[email protected] Krakow, Poland
Tasks for Deep Neural Networks

We use Deep Neural Networks for specific group of issues:


• Classification (of images, signals etc.)
• Prediction (e.g. price, temperature, size, distance)
• Recognition (of speech, objects etc.)
• Translation (from one language to another)
• Autonomous behaviors (driving by the autonomous cars, flying of the drones…)
• Clustering of objects (grouping them according to their similarity)
• etc.
using supervised or unsupervised training of such networks.

We have to deal with structures and unstructured data:


Structured data are usually well-described by the attributes and
collected in data tables (relational databases), while unstructured data
are images, (audio, speech) signals, (sequences of) texts (corpora).
Binary Classification
In binary classification, the result is describe by two values:
• 1 – when the object of the class was recognized (e.g. is a cat),
• 0 – when the object was not recognized as belonging to the given class (e.g. is not a cat).

Example:

Is a cat (1)

Is not a cat (0)


Image Representation
Training Examples
Logistic Regression
Computing Sigmoid Function

We use numpy vectorization to compute sigmoid and sigmoid_derivative


for any input vector z:
Logistic Regression Cost Function
Loss Functions
The loss functions are used to evaluate the performance of the models. The bigger your loss
is, the more different your predictions (𝑦̂) are from the true values (𝑦). In deep learning,
we use optimization algorithms like Gradient Descent to train models and minimize the
cost.
Gradient Descent
We have to minimize the cost function J for a given training data set
to achieve as correct prediction for input data as possible:

Here, w is 1D, but its dimension is bigger in real.


Calculus of the Gradient Descent

The main idea of the Gradient Descent algorithm is to go


in the reverse direction to the gradient (the descent slope):
Derivative Rules

The Gradient Descent algorithm


uses partial derivatives calculated
after the following rules:
Gradient Descent for Logistic Regression

We use a computational graph for the presentation of forward and backward operations for
a single neuron implementing logistic regression for the weighted sum of inputs x:
Gradient Descent for Training Dataset

The final logistic regression gradient descent


algorithm will repeatedly go through
all training examples updating parameters
until the cost function is not small enough:
To speed up computation we should use
vectorization instead of for-loops:
Efficiency of Vectorization
When dealing with big data collections and big data vectors, we definitely should
use vectorization (that performs SIMD operations) to proceed computations
faster:

Compare time efficacies of these two approaches!

Conclusion:
Whenever possible, avoid explicit for-loops and use vectorization: np.dot(w.T,x), np.dot(W,x), np.multiply(x1,x2),
np.outer(x1,x2), np.log(v), np.exp(v), np.abs(v), np.zeros(v), np.sum(v), np.max(v), np.min(v) etc.
Vectorization uses parallel CPU or GPU operations (called SIMD – single instruction multiple data)
proceed on parallelly working cores.
Vectorization of the Logistic Regression
Let’s vectorize the previous algorithm:

broadcasted
Broadcasting in Python
Broadcasting in numpy
Broadcasting is very useful for performing mathematical operations between
arrays of different shapes. The example below show the normalization of the data.
Normalization for Efficiency
We use normalization (np.linalg.norm) to achieve a better performance because
gradient descent converges faster after normalization:
Lists vs. Vectors and Matrices

Be careful when creating vectors


because lists have no shape and
are declared similarly.
Column and Row Vectors

Be careful when creating vectors


because lists have no shape and
are declared similarly.
Reshaping Image Matrices

When working with images in deep learning, we typically reshape them into vector
representation using np.reshape():
Shape and Reshape Vectors and Matrices
We commonly use the numpy functions np.shape() and np.reshape() in deep learning:
• X.shape is used to get the shape (dimension) of a vector or a matrix X.
• X.reshape(...) is used to reshape a vector or a matrix X into some other dimension(s).
Simple Neuron

We defined the
fundamental
elements and
operations on a
single neuron.
Simple Neural Network

Having defined
the fundamental
elements and
operations,
we can create
a simple neural
network.
Stacking Neurons Vertically and Vectorizing

Stacking values and creating


vectors, and stacking vectors
and creating matrices is very
important from the efficiency
of computation point of view!
Stacking Examples Horizontally and Vectorizing

Stacking vectors of training


examples horizontally creating
matrices is very important
from the efficiency of
computation point of view!

After Vectorizing
Vectorization of Dot Product
In deep learning, you deal with very large datasets. Non-computationally-optimal functions become
a huge bottleneck in your algorithms and can result in models that take ages to run. To make sure that
your code is computationally efficient, you should use vectorization. Compare the following codes:
Vectorization of Outer Product
In deep learning, you deal with very large datasets. Non-computationally-optimal functions become
a huge bottleneck in your algorithms and can result in models that take ages to run. To make sure that
your code is computationally efficient, you should use vectorization. Compare the following codes:
Vectorization of Element-Wise Multiplication

In deep learning, you deal with very large datasets. Non-computationally-optimal functions become
a huge bottleneck in your algorithms and can result in models that take ages to run. To make sure that
your code is computationally efficient, you should use vectorization. Compare the following codes:
Vectorization of General Dot Product
In deep learning, you deal with very large datasets. Non-computationally-optimal functions become
a huge bottleneck in your algorithms and can result in models that take ages to run. To make sure that
your code is computationally efficient, you should use vectorization. Compare the following codes:
Activation Functions of Neurons
We use different activation functions for neurons in different layers:
COMPARISON OF ACTIVATION FUNCTIONS
• Sigmoid function is used
in the output layer:
𝟏
𝒈 𝒛 =𝝈 𝒛 =
𝟏+𝒆−𝒛
• Tangent hyperbolic function
is used in hidden layers:
𝒆𝒛 −𝒆−𝒛
𝒈 𝒛 = 𝒕𝒂𝒏𝒉 𝒛 =
𝒆𝒛 +𝒆−𝒛
• Rectified linear unit (ReLu)
is used in hidden layers (FAST!):
𝒈 𝒛 = 𝑹𝒆𝑳𝒖 𝒛 = 𝒎𝒂𝒙 𝟎, 𝒛
• Smooth ReLu (SoftPlus)
is used in hidden layers:
𝒈 𝒛 = 𝑺𝒐𝒇𝒕𝑷𝒍𝒖𝒔 𝒛 = 𝒍𝒐𝒈 𝟏 + 𝒆𝒛
• Leaky ReLu is used in hidden layers :
𝒛 𝒊𝒇 𝒛 > 𝟎
• 𝒈 𝒛 = 𝑳𝒆𝒂𝒌𝒚𝑹𝒆𝑳𝒖 𝒛 = ቊ
𝟎. 𝟎𝟏𝒛 𝒊𝒇 𝒛 ≤ 𝟎
Activation Functions
Derivatives of Activation Functions

Derivatives are necessary for


the use of gradient descent:

• Sigmoid function:
𝟏 𝒅𝒈 𝒛
𝒈 𝒛 =𝝈 𝒛 = 𝒈′ 𝒛 = =𝒈 𝒛 ∙ 𝟏−𝒈 𝒛 =𝒂∙ 𝟏−𝒂
𝟏+𝒆−𝒛 𝒅𝒛

• Tangent hyperbolic function:


𝒆𝒛 −𝒆−𝒛 𝒅𝒈 𝒛 𝟐
𝒈 𝒛 = 𝒕𝒂𝒏𝒉 𝒛 = 𝒈′ 𝒛 = =𝟏− 𝒈 𝒛 = 𝟏 − 𝒂𝟐
𝒆𝒛 +𝒆−𝒛 𝒅𝒛

• Rectified linear unit (ReLu):


𝒅𝒈 𝒛 𝟏 𝒊𝒇 𝒛 > 𝟎
𝒈 𝒛 = 𝑹𝒆𝑳𝒖 𝒛 = 𝒎𝒂𝒙 𝟎, 𝒛 𝒈′ 𝒛 = =ቊ
𝒅𝒛 𝟎 𝒊𝒇 𝒛 ≤ 𝟎
• Smooth ReLu (SoftPlus):
𝒅𝒈 𝒛 𝒆𝒛 𝟏
𝒈 𝒛 = 𝑺𝒐𝒇𝒕𝑷𝒍𝒖𝒔 𝒛 = 𝒍𝒏 𝟏 + 𝒆𝒛 𝒈′ 𝒛 = = =
𝒅𝒛 𝟏+𝒆𝒛 𝟏+𝒆−𝒛

• Leaky ReLu:
𝒛 𝒊𝒇 𝒛 > 𝟎 𝒅𝒈 𝒛 𝟏 𝒊𝒇 𝒛 > 𝟎
𝒈 𝒛 = 𝑳𝒆𝒂𝒌𝒚𝑹𝒆𝑳𝒖 𝒛 = ቊ 𝒈′ 𝒛 = =ቊ
𝟎. 𝟎𝟏𝒛 𝒊𝒇 𝒛 ≤ 𝟎 𝒅𝒛 𝟎. 𝟎𝟏 𝒊𝒇 𝒛 ≤ 𝟎
Derivatives of Activation Functions
Neural Network Gradients
Random Initialization of Weights

Parameters must be initialized by small random numbers:


• W cannot be initialized to 0:
• 𝑾[𝒍] = 𝒏𝒑. 𝒓𝒂𝒏𝒅𝒐𝒎. 𝒓𝒂𝒏𝒅𝒏 𝒏[𝒍] , 𝒏[𝒍−𝟏] ∗ 𝟎. 𝟎𝟏
• Small random initial weights values of the weights allow for faster training
because the activation functions of neurons stimulated by values a little bit
greater than 0 usually have the biggest slopes, so each update of weights results
in big changes of output values and allows the network to move towards the
solution faster.

• b can be initialized to 0:
• 𝒃[𝒍] = 𝒏𝒑. 𝒛𝒆𝒓𝒐 𝒏[𝒍] , 𝟏
Going to Deeper NN Architectures

Deep neural
network
architecture means
the use of many
hidden layers
between input and
output layers.
Dimensions of Stacked Matrices
Building Blocks of Deep Neural Networks
Stacking Building Blocks Subsequently
Parameters and Hyperparameters

We should distinguish between parameters and hyperparameters:


• Parameters of the model are established during the training process, e.g.:
• 𝑾[𝒍] , 𝒃[𝒍] .
• Hyperparameters control parameters and are established by the developer of
the model, e.g.:
• 𝜶 – learning rate,
• 𝑳 – number of hidden layers,
• 𝒏[𝒍] - number of neurons in layers,
• 𝒈[𝒍] - choice of activation functions for layers,
• number of iterations over training data,
• momentum,
• minibatch size,
• regularization parameters,
• optimization parameters,
• dropout parameters, …
Iterative Development of DL Solutions
Deep Learning solutions are usually developed in an iterative
and empirical process that composes of three main elements:
• Idea – when we suppose that a selected model, training method, and some
hyperparameters let us to solve the problem.
• Code – when we try to code and apply the idea in a real code.
• Experiment – prove our suppositions and assumptions or not, and allow to
update or change the idea until the experiments return satisfactory results.
Let’s start with powerful computations!
✓ Questions?
✓ Remarks?
✓ Suggestions?
✓ Wishes?
Bibliography and Literature
1. Nikola K. Kasabov, Time-Space, Spiking Neural Networks and Brain-Inspired Artificial
Intelligence, In Springer Series on Bio- and Neurosystems, Vol 7., Springer, 2019.
2. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016, ISBN
978-1-59327-741-3 or PWN 2018.
3. Holk Cruse, Neural Networks as Cybernetic Systems, 2nd and revised edition
4. R. Rojas, Neural Networks, Springer-Verlag, Berlin, 1996.
Adrian Horzyk
5. Convolutional Neural Network (Stanford) [email protected]
6. Visualizing and Understanding Convolutional Networks, Zeiler, Fergus, ECCV 2014 Google: Horzyk
7. IBM: https://www.ibm.com/developerworks/library/ba-data-becomes-knowledge-
1/index.html
8. NVIDIA: https://developer.nvidia.com/discover/convolutional-neural-network
9. JUPYTER: https://jupyter.org/

University of Science
and Technology
in Krakow, Poland

You might also like