0% found this document useful (0 votes)
19 views51 pages

Unit 4

Uploaded by

Chitra M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views51 pages

Unit 4

Uploaded by

Chitra M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Unit – IV

Deep Neural Networks, Conventional Neural Networks


Deep Learning
• MLP has the advantage that the first layer (feature
extraction) and the second layer (how those features are
combined to predict the output) are learned together in a
coupled and supervised manner
• MLP with multiple hidden layers can learn more
complicated functions of the input.
• Deep neural networks where, starting from raw input,
each hidden layer combines the values in its preceding
layer and learns more complicated functions of the input.

2
Deep Learning Key Ideas
• Hierarchical learning
– Increasing abstraction levels (bottom up or
top down?)
• Structure/Pattern automatically discovered
during training
– Avoids feature engineering
– Reason preferred over SVM etc…

3
Deep Neural Networks

4
Intro to DNNs
• Deep Feedforward Neural Networks
(depth > 3 hidden layers, maybe)
– No loops, with loops Recurrent Neural
Networks
• Loops: output is dependent on previous input, ie,
output has a memory functionality
• Goal: approximate a function
– Ie, learn the mapping between input and
output

5
Intro to DNNs
• Defines a mapping and learns the parameters
• Also can be called as a directed acyclic graph
(DAG)
– Composition of function …
• is the first layer, second layer and so on…
• Length of this chain is the depth of the network
– Dimensions of the hidden layer is WIDTH of the
model

6
Intro to DNNs
• Transform the input features by a
nonlinear operator Φ
• How to choose Φ?
– Transform to some high dimensional space –
good for reducing training error, but will not
generalize
– Manually design Φ, old style of feature
engineering, then use with SVM etc
– DL: learn Φ also

7
Example: Learning XOR
• Assume regression problem

• Define a linear model

• Solving gives w = 0, b = 0.5, ie, output is


constant at 0.5
– Hence, this model cannot fit the XOR directly

8
Example: Learning XOR
• Regression will work, but features to be
transformed
• Hidden layer h – transforms inputs, output
layer does regression on transformed space
• Complete model is now

• First layer is usually a linear transform


followed by activation function (nonlinear)

9
Example…

10
Example…

• Rectified linear activation function and weight vector transforms the x-space to h-
space
• Hence, a linear model can be fit to the h-space

11
DNN Concepts
• Gradient based learning – similar to ANN
• Cost functions – to calculate the loss
• Output units – based on type of
classification
• Hidden units – combined with hidden AN
layers
• Architecture design
• Backpropagation for DNNs
12
Gradient Based Learning
• Nonlinear transformation turns loss
function non-convex
– No guarantees for optimum solution
• Initialize weights and biases to small
values
• SGD used for weight update to descend
along the cost function
• Error is measured using a Cost Function

13
Cost Function
• Directly affects convergence
• Most popular is cross-entropy loss
– The last expression is Binary CE

14
• Calculate Binary CE of each model.

15
Output Units
• Complete the task of hidden layers to produce appropriate output
• Output units effect training procedure
– If output saturates for inputs, training will not proceed
• Linear output units – give h, produce wTh + b
– Linear, so output will not saturate
• Sigmoid units – useful for binary classification
– Here and in softmax, log likelihood cancels exp and prevents saturation
• Softmax units – multinomial classification

16
Hidden Units
• How to choose the type of hidden unit to use in the
hidden layers of the model
• Rectified Linear Unit or ReLu is the most popular

– Similar to linear units except 0 for negative inputs


– No problem of gradient saturation
• Earlier sigmoid, tanh were popular for hidden layers
– However gradient saturation in all regions except near 0

17
Architecture Design
• Layer
– a 2D array of artificial neurons
– Some other layers like max pooling to reduce dimensionality
– DNNs are sequence of such layers

• How many such layers and properties of each layer

• Choosing a DNN – belief that pattern can be modeled as a


composition of simpler functions (WARNING – NOT ALL
PROBLEMS ARE MEANTS FOR DNNs)
– Results show that greater depth is better at generalization

18
Architecture Design
• Skip connections – skip a layer and connect to layer i+2
– Allows for faster gradient to flow from output towards input
• How to connect layers?
– E.g, fully connected?
– Convolutional?
– Receptive Field size?
– Stride?

19
Architecture Design

20
Backpropagation
Backpropagation Algorithm

22
Backpropagation Algorithm
• Modify weights incrementally until error is
minimum
– Hence an iterative algorithm
• Choose direction in weight space, where change
in error is maximum
• Recalculate error and continue
• Terminate on number of iterations or minimum
error/change in error
• As error propagates back to input side, it is called
as backpropagation
23
Backpropagation for Regression
• Consider the problem of nonlinear
regression:

• Only primary output is target – error can


be calculated only as rt – yt
• Hence, all other values have to be
expressed in terms of this error
24
Backpropagation for Regression
• The error function to be minimized over all
samples

• Second layer weight update using least


squares

25
Backpropagation for Regression
• Applying chain rule

26
Convolutional Networks
CNNs
• Convolutional Neural Networks or CNNs
– Convolution – linear operation
• CNNs: networks that use at least one
convolution operation in at least one layer
• Single most popular DNN variant
• Pooling – used with C-layers
• Variants and intuition

28
Convolution Operation

• Operation on two functions, usually denoted as x*w


• CNN Terminology:
– X: input
– W: Kernel
– Output: Feature map
• Input and Kernel can have more than one dimension
• Replace integral with summation

29
Convolution Operation
• Input and Kernel data types are referred to as
tensors
• Consider a 2D image input I, and a 2D kernel K
• The convolution operation is given as:

• (Q) What are i, j and m, n?

30
2D Convolution Example

31
CNN
• Sparsity
• Pooling
• Parameter Sharing
• Stride
• Zero padding
• Local connections
• Titled Convolution
• Typical CNN

32
Motivation 1: Sparsity
• Sparsity wrt connection
– ANN – kernel size is same as input size
– Convolutional Kernel much smaller in size
– Why does it work?
• Features meant to be detected are similar to size of
Kernel
• Domain data
• O(W x H) for fully connected versus O(m x
n)for sparse Kernel
– Note that W, H >> m, n
33
Sparse Connection from Below

34
Sparse Connection from Above

35
Motivation 2: Parameter Sharing
• There is a SET of kernels
• This set is applied to all positions in the
input
• Hence, a fixed set of Kernels are learnt for
each region in the input
– Ie, at each region, a particular feature is
tested for
• Also reduces set of parameters to be
learnt
36
Parameter Sharing

37
• (Q) For a sample input and kernel set,
compare the reduction in number of
parameters due to sparse Kernels and
parameter sharing in CNNs.

38
Equivariance
• Convolution operation follows the equivariance
rule of functions
– f(g(x)) = g(f(x))
• Shifts in location of feature will produce same
outputs
– E.g., in time series data, features will be detected
even if shifted in time
– Image – translated across image is still detected
• Note that changes in scale or rotation are not
equivariant tranforms
39
Pooling
• Create a statistic of neighbours to replace
a group by a single value
– E.g, L1/L2 norm, Average of items
– L-infinity norm (commonly called Max pooling)
• Removes local translations in input
• Pooling over different convolutions
– Network will learn which transformations to
become invariant to
• Pooling reduces the number of parameters
40
Max Pooling – Spatial Invariance

41
Max Pooling – Learned Invariance

42
Pooling with Downsampling

43
Typical CNN Structure

44
Sample CNNs

45
Variants of Basic CNNs
• Input is not always 2D
– E.g, 3-channel 2D images for colour images
– Hence, input is 3D Tensor
– Accounting for mini-batch, it is 4D Tensor
• Skip some positions of the Kernel –
STRIDE
– Reduced computations, at the cost of missed
features
– Strides can be different in each dimension
46
Stride

47
Zero Padding

48
Local Connections
• Connections are local, but weights are not
shared
• Every connection weight is different
• Can be used when feature does not
appear across all regions of input

49
Tiled Convolutions
• Weights are different locally
– Ie, like local connections
• But are reused in different parts of the
input
– Ie, the connections are rotated across the
image
• Reduces number of parameters to be
learnt/stored

50
51

You might also like