0% found this document useful (0 votes)

31 views26 pages

6.3 HiddenUnits

The document discusses different types of hidden units that can be used in deep neural networks. It covers rectified linear units (ReLU) and generalizations like leaky ReLU, parametric ReLU and maxout units. It discusses benefits of ReLU like avoiding vanishing gradients but also covers other unit types and motivations for choosing different units.

Uploaded by

Kasala Manish Goud

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views26 pages

6.3 HiddenUnits

Uploaded by

Kasala Manish Goud

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Deep Learning

Hidden Units

1
Deep Learning

Topics in Deep Feedforward Networks

• Overview
1. Example: Learning XOR
2. Gradient-Based Learning
3. Hidden Units
4. Architecture Design
5. Backpropagation and Other Differentiation
6. Historical Notes

2
Deep Learning

Topics in Hidden Units

1. ReLU and their generalizations
2. Logistic sigmoid and Hyperbolic tangent
3. Other hidden units

3
Deep Learning

Choice of hidden unit

• Previously discussed design choices for neural
networks that are common to most parametric
learning models trained with gradient
optimization
• We now look at how to choose the type of
hidden unit in the hidden layers of the model
• Design of hidden units is an active research
area that does not have many definitive guiding
theoretical principles
4
Deep Learning

Choice of hidden unit

• ReLU is an excellent default choice
• But there are many other types of hidden units
available
• When to use which kind (though ReLU is
usually an acceptable choice)?
• We discuss motivations behind choice of
hidden unit
– Impossible to predict in advance which will work
best
– Design process is trial and error
5
• Evaluate performance on a validation set
Deep Learning

Is Differentiability necessary?
• Some hidden units are not differentiable at all
input points
– Rectified Linear Function g(z)=max{0,z} is not
differentiable at z=0

• May seem like it invalidates for use in gradient-

based learning
• In practice gradient descent still performs well
enough for these models to be used in ML
tasks 6
Deep Learning

Differentiability ignored
• Neural network training
– not usually arrives at a local
minimum of cost function
– Instead reduces value significantly
• Not expecting training to reach a
point where gradient is 0,
– Accept minima to correspond to
points of undefined gradient
• Hidden units not differentiable
are usually non-differentiable at
only a small no. of points 7
Deep Learning

Left and Right Differentiability

• A function g(z) has a left derivative defined by
the slope immediately to the left of z
• A right derivative defined by the slope of the
function immediately to the right of z
• A function is differentiable at z = a only if both
• If a ∈ I is a limit point of I ∩ [a,∞) and the left derivative

• If a ∈ I is a limit point of I ∩ (–∞,a] and the right derivative

• are equal
Function is not continuous: No derivative at marked point
However it has a right derivative at all points with δ+f(a)=0 at all points 8
Deep Learning

Software Reporting of Non-differentiability

• In the case of g(z)=max{0,z}, the left derivative at

z = 0 is 0 and right derivative is 1
• Software implementations of neural network
training usually return:
– one of the one-sided derivatives rather than
reporting that derivative is undefined or an error
• Justified in that gradient-based optimization is subject to
numerical anyway
• When a function is asked to evaluate g(0), it is very
unlikely that the underlying value was truly 0, instead it
was a small value ε that was rounded to 0 9
Deep Learning

What a Hidden unit does

• Accepts a vector of inputs x and computes an
affine transformation* z = WTx+b
• Computes an element-wise non-linear function
g(z)
• Most hidden units are distinguished from each
other by the choice of activation function g(z)
– We look at: ReLU, Sigmoid and tanh, and other
hidden units

*A geometric transformation that preserves lines and parallelism (but not

10
necessarily distances and angles)
Deep Learning

Rectified Linear Unit & Generalizations

• Rectified linear units use the activation function
g(z)=max{0,z}
– They are easy to optimize due to similarity with
linear units
• Only difference with linear units that they output 0 across
half its domain
• Derivative is 1 everywhere that the unit is active
• Thus gradient direction is far more useful than with
activation functions with second-order effects

11
Deep Learning

Use of ReLU
• Usually used on top of an affine transformation
h=g(WTx+b)
• Good practice to set all elements of b to a
small value such as 0.1
– This makes it likely that ReLU will be initially active
for most training samples and allow derivatives to
pass through

12
Deep Learning

ReLU vs other activations

• Sigmoid and tanh activation functions cannot be
with many layers due to the vanishing gradient
problem.
• ReLU overcomes the vanishing gradient
problem, allowing models to learn faster and
perform better
• ReLU is the default activation function with MLP
and CNN
13
Deep Learning

Generalizations of ReLU
• Perform comparably to ReLU and occasionally
perform better
• ReLU cannot learn on examples for which the
activation is zero
• Generalizations guarantee that they receive
gradient everywhere

14
Deep Learning

Three generalizations of ReLU

• ReLU has the activation function g(z)=max{0,z}
• Three generalizations of ReLU based on using
a non-zero slope αi when zi<0:
hi=g(z,α)i=max(0,zi)+αi min(0,zi)
1. Absolute-value rectification:
• fixes αi=-1 to obtain g(z)=|z|
2. Leaky ReLU:
• fixes αi to a small value like 0.01
3. Parametric ReLU or PReLU:
• treats αi as a parameter 15
Deep Learning

Maxout Units
• Maxout units further generalize ReLUs
– Instead of applying element-wise function g(z),
maxout units divide z into groups of k values z=WTx+b
– Each maxout unit then outputs the maximum
element of one of these groups:
g(z)i = max z j
j∈G(i)

• where G(i) is the set of indices into the inputs for group i,
{(i-1)k+1,..,ik}
• This provides a way of learning a piecewise
linear function that responds to multiple
directions in the input x space
16
Deep Learning

Maxout as Learning Activation

• A maxout unit can learn piecewise linear,
convex function with upto k pieces
– Thus seen as learning the activation function itself
rather than just the relationship between units
• With large enough k, approximate any convex function

– A maxout layer with two pieces can learn to

implement the same function of the input x as a
traditional layer using ReLU or its generalizations
17
Deep Learning

Learning Dynamics of Maxout

• Parameterized differently
• Learning dynamics different even in case of
implementing same function of x as one of the
other layer types
– Each maxout unit parameterized by k weight
vectors instead of one
• So Requires more regularization than ReLU
• Can work well without regularization if training set is large
and no. of pieces per unit is kept low

18
Deep Learning

Other benefits of maxout

• Can gain statistical and computational
advantages by requiring fewer parameters
• If the features captured by n different linear
filters can be summarized without losing
information by taking max over each group of k
features, then next layer can get by with k times
fewer weights
• Because of multiple filters, their redundancy
helps them avoid catastrophic forgetting
– Where network forgets how to perform tasks they
were trained to perform 19
Deep Learning

Principle of Linearity
• ReLU based on principle that models are easier
to optimize if behavior closer to linear
– Principle applies besides deep linear networks
• Recurrent networks can learn from sequences and
produce a sequence of states and outputs

• When training them need to propagate information

through several steps
– Which is much easier when some linear computations (with some
directional derivatives being of magnitude near 1) are involves
20
Linearity in LSTM
Deep Learning

• LSTM: best performing recurrent architecture

– Propagates information through time via summation
• A straightforward kind of linear activation
LSTM: an ANN that contains
LSTM LSTM blocks in addition to
Block regular network units
y = ∑ wi x i
Input gate: when its output is
close to zero, it zeros the input
y = ∏ xi
Forget gate: when close to zero
block forgets whatever value
y = σ ( ∑ wi x i )
it was remembering

Conditional Input Forget Output Output gate: when unit should

Input gate gate gate output its value

Determine when inputs are allowed 21

to flow into block
Deep Learning

Logistic Sigmoid
• Prior to introduction of ReLU, most neural
networks used logistic sigmoid activation
g(z)=σ(z)
• Or the hyperbolic tangent
g(z)=tanh(z)
• These activation functions are closely related
because
tanh(z)=2σ(2z)-1
• Sigmoid units are used to predict probability
that a binary variable is 1 22
Deep Learning

Sigmoid Saturation
• Sigmoidals saturate across most of domain
– Saturate to 1 when z is very positive and 0 when z is
very negative
– Strongly sensitive to input when z is near 0
– Saturation makes gradient-learning difficult
• ReLU and Softplus increase for input >0

Sigmoid can still be used

When cost function undoes the
Sigmoid in the output layer

23
Deep Learning

Sigmoid vs tanh Activation

• Hyperbolic tangent typically performs better
than logistic sigmoid
• It resembles the identity function more closely
tanh(0)=0 while σ(0)=½
• Because tanh is similar to identity near 0,
training a deep neural network ŷ = w tanh (U tanh (V x ))
T T T

resembles training a linear model ŷ = wTU TV T x

so long as the activations can be kept small

24
Deep Learning

Sigmoidal units still useful

• Sigmoidal more common in settings other than
feed-forward networks
• Recurrent networks, many probabilistic models
and autoencoders have additional requirements
that rule out piecewise linear activation
functions
• They make sigmoid units appealing despite
saturation

25
Deep Learning

Other Hidden Units

• Many other types of hidden units possible, but
used less frequently
– Feed-forward network using h = cos(Wx + b)
• on MNIST obtained error rate of less than 1%
– Radial Basis ⎛ 1
⎜ 2⎟
⎞
hi = exp ⎜⎜− 2 ||W:,i − x || ⎟⎟
⎝ σ ⎟⎠

• Becomes more active as x approaches a template W:,i

– Softplus g(a) = ζ(a) = log(1 +e a )
• Smooth version of the rectifier
– Hard tanh
• Shaped similar to tanh and the rectifier but it is bounded
g(a) = max(−1, min(1,a)) 26

Advanced Java Programming (AJP) MCQs
No ratings yet
Advanced Java Programming (AJP) MCQs
17 pages
VD4 - Hpa
No ratings yet
VD4 - Hpa
9 pages
Lec2-Deep Neural Networks
No ratings yet
Lec2-Deep Neural Networks
12 pages
2 DL Training
No ratings yet
2 DL Training
60 pages
640d1cfa1b98154741a854a0 - 220821 Metamask Certicos Token Setup Instructions
No ratings yet
640d1cfa1b98154741a854a0 - 220821 Metamask Certicos Token Setup Instructions
7 pages
Iot Switch Based On Augmented Reality
No ratings yet
Iot Switch Based On Augmented Reality
98 pages
Getting Into Neural Networks
No ratings yet
Getting Into Neural Networks
15 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
What Are The Activation Functions, How Do I Deter...
No ratings yet
What Are The Activation Functions, How Do I Deter...
3 pages
02 Neural Networks
No ratings yet
02 Neural Networks
32 pages
Neural Network Notes
No ratings yet
Neural Network Notes
8 pages
Unit 5 (Second Half)
No ratings yet
Unit 5 (Second Half)
10 pages
COMEPAL CX en
No ratings yet
COMEPAL CX en
4 pages
Outdoor SF6 Circuit Breaker - T Ype OHB
No ratings yet
Outdoor SF6 Circuit Breaker - T Ype OHB
35 pages
Lisans - Yazılım Müh - MÜFREDAT - Ankara Bilim Üni - Maltepe - DGS - Dersler Listeli
No ratings yet
Lisans - Yazılım Müh - MÜFREDAT - Ankara Bilim Üni - Maltepe - DGS - Dersler Listeli
2 pages
Unit 2
No ratings yet
Unit 2
35 pages
Neural Networks
No ratings yet
Neural Networks
27 pages
Slides 11
No ratings yet
Slides 11
48 pages
06 AIS302 ANN Backpropagation
No ratings yet
06 AIS302 ANN Backpropagation
83 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
Module 2
No ratings yet
Module 2
13 pages
Unit II
No ratings yet
Unit II
56 pages
ISE-1 Imp DLPDF
No ratings yet
ISE-1 Imp DLPDF
28 pages
Operator Guide - Fa-Tech Vega - English
No ratings yet
Operator Guide - Fa-Tech Vega - English
32 pages
Dell Switch Update Manual
No ratings yet
Dell Switch Update Manual
16 pages
Unit 3b Review Solutions
No ratings yet
Unit 3b Review Solutions
6 pages
2K21 - Ee - 192 MLP
No ratings yet
2K21 - Ee - 192 MLP
59 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
Ibs 12
No ratings yet
Ibs 12
2 pages
5.1. Methods of Measuring Area: Instructive Problems
100% (1)
5.1. Methods of Measuring Area: Instructive Problems
9 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
1.3.7 High & and Low Level Languages
No ratings yet
1.3.7 High & and Low Level Languages
23 pages
Machine Learning (CSO851) - Lecture 08
No ratings yet
Machine Learning (CSO851) - Lecture 08
27 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
1725876123-Unit 1 Fundamental of Deep Learning
No ratings yet
1725876123-Unit 1 Fundamental of Deep Learning
51 pages
Electicity Bill October 22
No ratings yet
Electicity Bill October 22
1 page
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Unit 4
No ratings yet
Unit 4
19 pages
School - Innovation Centre
No ratings yet
School - Innovation Centre
2 pages
Deep Learing
No ratings yet
Deep Learing
37 pages
3b Dynamics
No ratings yet
3b Dynamics
19 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Notes On Introduction To Deep Learning
No ratings yet
Notes On Introduction To Deep Learning
19 pages
AyushChokhani AI Asiignment 2
No ratings yet
AyushChokhani AI Asiignment 2
12 pages
Tutorial 1,2
No ratings yet
Tutorial 1,2
12 pages
Mongo Shard
No ratings yet
Mongo Shard
9 pages
JBE 3 - 2021 - Hosseini Papers
No ratings yet
JBE 3 - 2021 - Hosseini Papers
15 pages
UNIMAP
No ratings yet
UNIMAP
25 pages
19EEE362:Deep Learning For Visual Computing: Dr.T.Ananthan
No ratings yet
19EEE362:Deep Learning For Visual Computing: Dr.T.Ananthan
23 pages
L10 Neural Network
No ratings yet
L10 Neural Network
52 pages
ID Centers-August 2021
No ratings yet
ID Centers-August 2021
2 pages
ANN Unit IV Notes
No ratings yet
ANN Unit IV Notes
4 pages
JKR - HQ & Reg Tel 16022021
No ratings yet
JKR - HQ & Reg Tel 16022021
24 pages
Activation
No ratings yet
Activation
7 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Service Manual For XL600 With ISE V201201
No ratings yet
Service Manual For XL600 With ISE V201201
378 pages
What Is Gradient Based Learning in Deep Learning
100% (1)
What Is Gradient Based Learning in Deep Learning
12 pages
Unit Ii DNN
No ratings yet
Unit Ii DNN
24 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Development of An SDN Controller For Security Against An "Man-In-Middle" Attack
No ratings yet
Development of An SDN Controller For Security Against An "Man-In-Middle" Attack
67 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Lesson 1 1 GEE LIE
No ratings yet
Lesson 1 1 GEE LIE
69 pages
Activation Function
No ratings yet
Activation Function
43 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
34 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Book Covering & Labelling
No ratings yet
Book Covering & Labelling
2 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
Module 2 DL Snotes P1
No ratings yet
Module 2 DL Snotes P1
16 pages
Unit I
No ratings yet
Unit I
90 pages
Activation Function
No ratings yet
Activation Function
36 pages
BROSUR MA Comp 040520
No ratings yet
BROSUR MA Comp 040520
2 pages
MBA/SMA 0204, MBB/SMA 0207, MBE/SMA 0414 - Professional: Vishay Beyschlag
No ratings yet
MBA/SMA 0204, MBB/SMA 0207, MBE/SMA 0414 - Professional: Vishay Beyschlag
16 pages
Underground - Geotechnical Auditing, Training & Design Compliance
No ratings yet
Underground - Geotechnical Auditing, Training & Design Compliance
1 page
A10 2250 (Rev. E 2009.08) EN - COLOUR AND GRAIN CODING OF STYLED PARTS
No ratings yet
A10 2250 (Rev. E 2009.08) EN - COLOUR AND GRAIN CODING OF STYLED PARTS
5 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Unit Iv
No ratings yet
Unit Iv
34 pages
ZRD Rotary Valve Coperion Product Specification ZRD Rotary Valve (Heavy-Duty Premium)
No ratings yet
ZRD Rotary Valve Coperion Product Specification ZRD Rotary Valve (Heavy-Duty Premium)
2 pages
Deep Learning: International Islamic University of Chittagong
No ratings yet
Deep Learning: International Islamic University of Chittagong
31 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
Module 2 Deep Feed Forward Networks
No ratings yet
Module 2 Deep Feed Forward Networks
18 pages
LU5: Deep Feedforward Networks: Hidden Units, Architecture Design
No ratings yet
LU5: Deep Feedforward Networks: Hidden Units, Architecture Design
15 pages
Find Training Activities: 1.1. Overview
No ratings yet
Find Training Activities: 1.1. Overview
4 pages
Neural Networks / Deep Learning
No ratings yet
Neural Networks / Deep Learning
9 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Deep Learning PDF
100% (1)
Deep Learning PDF
87 pages
ST M Hdstat RNN Deep Learning
No ratings yet
ST M Hdstat RNN Deep Learning
17 pages
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

6.3 HiddenUnits

Uploaded by

6.3 HiddenUnits

Uploaded by

Deep Learning

Topics in Deep Feedforward Networks

Topics in Hidden Units

Choice of hidden unit

Choice of hidden unit

• May seem like it invalidates for use in gradient-

Left and Right Differentiability

• If a ∈ I is a limit point of I ∩ (–∞,a] and the right derivative

Software Reporting of Non-differentiability

• In the case of g(z)=max{0,z}, the left derivative at

What a Hidden unit does

*A geometric transformation that preserves lines and parallelism (but not

Rectified Linear Unit & Generalizations

ReLU vs other activations

Three generalizations of ReLU

Maxout as Learning Activation

– A maxout layer with two pieces can learn to

Learning Dynamics of Maxout

Other benefits of maxout

• When training them need to propagate information

• LSTM: best performing recurrent architecture

Conditional Input Forget Output Output gate: when unit should

Determine when inputs are allowed 21

Sigmoid can still be used

Sigmoid vs tanh Activation

resembles training a linear model ŷ = wTU TV T x

Sigmoidal units still useful

Other Hidden Units

• Becomes more active as x approaches a template W:,i

You might also like