Unit – IV
Deep Neural Networks, Conventional Neural Networks
Deep Learning
• MLP has the advantage that the first layer (feature
extraction) and the second layer (how those features are
combined to predict the output) are learned together in a
coupled and supervised manner
• MLP with multiple hidden layers can learn more
complicated functions of the input.
• Deep neural networks where, starting from raw input,
each hidden layer combines the values in its preceding
layer and learns more complicated functions of the input.
2
Deep Learning Key Ideas
• Hierarchical learning
– Increasing abstraction levels (bottom up or
top down?)
• Structure/Pattern automatically discovered
during training
– Avoids feature engineering
– Reason preferred over SVM etc…
3
Deep Neural Networks
4
Intro to DNNs
• Deep Feedforward Neural Networks
(depth > 3 hidden layers, maybe)
– No loops, with loops Recurrent Neural
Networks
• Loops: output is dependent on previous input, ie,
output has a memory functionality
• Goal: approximate a function
– Ie, learn the mapping between input and
output
5
Intro to DNNs
• Defines a mapping and learns the parameters
• Also can be called as a directed acyclic graph
(DAG)
– Composition of function …
• is the first layer, second layer and so on…
• Length of this chain is the depth of the network
– Dimensions of the hidden layer is WIDTH of the
model
6
Intro to DNNs
• Transform the input features by a
nonlinear operator Φ
• How to choose Φ?
– Transform to some high dimensional space –
good for reducing training error, but will not
generalize
– Manually design Φ, old style of feature
engineering, then use with SVM etc
– DL: learn Φ also
7
Example: Learning XOR
• Assume regression problem
• Define a linear model
• Solving gives w = 0, b = 0.5, ie, output is
constant at 0.5
– Hence, this model cannot fit the XOR directly
8
Example: Learning XOR
• Regression will work, but features to be
transformed
• Hidden layer h – transforms inputs, output
layer does regression on transformed space
• Complete model is now
• First layer is usually a linear transform
followed by activation function (nonlinear)
9
Example…
10
Example…
• Rectified linear activation function and weight vector transforms the x-space to h-
space
• Hence, a linear model can be fit to the h-space
11
DNN Concepts
• Gradient based learning – similar to ANN
• Cost functions – to calculate the loss
• Output units – based on type of
classification
• Hidden units – combined with hidden AN
layers
• Architecture design
• Backpropagation for DNNs
12
Gradient Based Learning
• Nonlinear transformation turns loss
function non-convex
– No guarantees for optimum solution
• Initialize weights and biases to small
values
• SGD used for weight update to descend
along the cost function
• Error is measured using a Cost Function
13
Cost Function
• Directly affects convergence
• Most popular is cross-entropy loss
– The last expression is Binary CE
14
• Calculate Binary CE of each model.
15
Output Units
• Complete the task of hidden layers to produce appropriate output
• Output units effect training procedure
– If output saturates for inputs, training will not proceed
• Linear output units – give h, produce wTh + b
– Linear, so output will not saturate
• Sigmoid units – useful for binary classification
– Here and in softmax, log likelihood cancels exp and prevents saturation
• Softmax units – multinomial classification
16
Hidden Units
• How to choose the type of hidden unit to use in the
hidden layers of the model
• Rectified Linear Unit or ReLu is the most popular
– Similar to linear units except 0 for negative inputs
– No problem of gradient saturation
• Earlier sigmoid, tanh were popular for hidden layers
– However gradient saturation in all regions except near 0
17
Architecture Design
• Layer
– a 2D array of artificial neurons
– Some other layers like max pooling to reduce dimensionality
– DNNs are sequence of such layers
• How many such layers and properties of each layer
• Choosing a DNN – belief that pattern can be modeled as a
composition of simpler functions (WARNING – NOT ALL
PROBLEMS ARE MEANTS FOR DNNs)
– Results show that greater depth is better at generalization
18
Architecture Design
• Skip connections – skip a layer and connect to layer i+2
– Allows for faster gradient to flow from output towards input
• How to connect layers?
– E.g, fully connected?
– Convolutional?
– Receptive Field size?
– Stride?
19
Architecture Design
20
Backpropagation
Backpropagation Algorithm
22
Backpropagation Algorithm
• Modify weights incrementally until error is
minimum
– Hence an iterative algorithm
• Choose direction in weight space, where change
in error is maximum
• Recalculate error and continue
• Terminate on number of iterations or minimum
error/change in error
• As error propagates back to input side, it is called
as backpropagation
23
Backpropagation for Regression
• Consider the problem of nonlinear
regression:
• Only primary output is target – error can
be calculated only as rt – yt
• Hence, all other values have to be
expressed in terms of this error
24
Backpropagation for Regression
• The error function to be minimized over all
samples
• Second layer weight update using least
squares
25
Backpropagation for Regression
• Applying chain rule
26
Convolutional Networks
CNNs
• Convolutional Neural Networks or CNNs
– Convolution – linear operation
• CNNs: networks that use at least one
convolution operation in at least one layer
• Single most popular DNN variant
• Pooling – used with C-layers
• Variants and intuition
28
Convolution Operation
• Operation on two functions, usually denoted as x*w
• CNN Terminology:
– X: input
– W: Kernel
– Output: Feature map
• Input and Kernel can have more than one dimension
• Replace integral with summation
29
Convolution Operation
• Input and Kernel data types are referred to as
tensors
• Consider a 2D image input I, and a 2D kernel K
• The convolution operation is given as:
• (Q) What are i, j and m, n?
30
2D Convolution Example
31
CNN
• Sparsity
• Pooling
• Parameter Sharing
• Stride
• Zero padding
• Local connections
• Titled Convolution
• Typical CNN
32
Motivation 1: Sparsity
• Sparsity wrt connection
– ANN – kernel size is same as input size
– Convolutional Kernel much smaller in size
– Why does it work?
• Features meant to be detected are similar to size of
Kernel
• Domain data
• O(W x H) for fully connected versus O(m x
n)for sparse Kernel
– Note that W, H >> m, n
33
Sparse Connection from Below
34
Sparse Connection from Above
35
Motivation 2: Parameter Sharing
• There is a SET of kernels
• This set is applied to all positions in the
input
• Hence, a fixed set of Kernels are learnt for
each region in the input
– Ie, at each region, a particular feature is
tested for
• Also reduces set of parameters to be
learnt
36
Parameter Sharing
37
• (Q) For a sample input and kernel set,
compare the reduction in number of
parameters due to sparse Kernels and
parameter sharing in CNNs.
38
Equivariance
• Convolution operation follows the equivariance
rule of functions
– f(g(x)) = g(f(x))
• Shifts in location of feature will produce same
outputs
– E.g., in time series data, features will be detected
even if shifted in time
– Image – translated across image is still detected
• Note that changes in scale or rotation are not
equivariant tranforms
39
Pooling
• Create a statistic of neighbours to replace
a group by a single value
– E.g, L1/L2 norm, Average of items
– L-infinity norm (commonly called Max pooling)
• Removes local translations in input
• Pooling over different convolutions
– Network will learn which transformations to
become invariant to
• Pooling reduces the number of parameters
40
Max Pooling – Spatial Invariance
41
Max Pooling – Learned Invariance
42
Pooling with Downsampling
43
Typical CNN Structure
44
Sample CNNs
45
Variants of Basic CNNs
• Input is not always 2D
– E.g, 3-channel 2D images for colour images
– Hence, input is 3D Tensor
– Accounting for mini-batch, it is 4D Tensor
• Skip some positions of the Kernel –
STRIDE
– Reduced computations, at the cost of missed
features
– Strides can be different in each dimension
46
Stride
47
Zero Padding
48
Local Connections
• Connections are local, but weights are not
shared
• Every connection weight is different
• Can be used when feature does not
appear across all regions of input
49
Tiled Convolutions
• Weights are different locally
– Ie, like local connections
• But are reused in different parts of the
input
– Ie, the connections are rotated across the
image
• Reduces number of parameters to be
learnt/stored
50
51