0% found this document useful (0 votes)
31 views26 pages

6.3 HiddenUnits

The document discusses different types of hidden units that can be used in deep neural networks. It covers rectified linear units (ReLU) and generalizations like leaky ReLU, parametric ReLU and maxout units. It discusses benefits of ReLU like avoiding vanishing gradients but also covers other unit types and motivations for choosing different units.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views26 pages

6.3 HiddenUnits

The document discusses different types of hidden units that can be used in deep neural networks. It covers rectified linear units (ReLU) and generalizations like leaky ReLU, parametric ReLU and maxout units. It discusses benefits of ReLU like avoiding vanishing gradients but also covers other unit types and motivations for choosing different units.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Deep Learning

Hidden Units

1
Deep Learning

Topics in Deep Feedforward Networks


• Overview
1. Example: Learning XOR
2. Gradient-Based Learning
3. Hidden Units
4. Architecture Design
5. Backpropagation and Other Differentiation
6. Historical Notes

2
Deep Learning

Topics in Hidden Units


1. ReLU and their generalizations
2. Logistic sigmoid and Hyperbolic tangent
3. Other hidden units

3
Deep Learning

Choice of hidden unit


• Previously discussed design choices for neural
networks that are common to most parametric
learning models trained with gradient
optimization
• We now look at how to choose the type of
hidden unit in the hidden layers of the model
• Design of hidden units is an active research
area that does not have many definitive guiding
theoretical principles
4
Deep Learning

Choice of hidden unit


• ReLU is an excellent default choice
• But there are many other types of hidden units
available
• When to use which kind (though ReLU is
usually an acceptable choice)?
• We discuss motivations behind choice of
hidden unit
– Impossible to predict in advance which will work
best
– Design process is trial and error
5
• Evaluate performance on a validation set
Deep Learning

Is Differentiability necessary?
• Some hidden units are not differentiable at all
input points
– Rectified Linear Function g(z)=max{0,z} is not
differentiable at z=0

• May seem like it invalidates for use in gradient-


based learning
• In practice gradient descent still performs well
enough for these models to be used in ML
tasks 6
Deep Learning

Differentiability ignored
• Neural network training
– not usually arrives at a local
minimum of cost function
– Instead reduces value significantly
• Not expecting training to reach a
point where gradient is 0,
– Accept minima to correspond to
points of undefined gradient
• Hidden units not differentiable
are usually non-differentiable at
only a small no. of points 7
Deep Learning

Left and Right Differentiability


• A function g(z) has a left derivative defined by
the slope immediately to the left of z
• A right derivative defined by the slope of the
function immediately to the right of z
• A function is differentiable at z = a only if both
• If a ∈ I is a limit point of I ∩ [a,∞) and the left derivative

• If a ∈ I is a limit point of I ∩ (–∞,a] and the right derivative

• are equal
Function is not continuous: No derivative at marked point
However it has a right derivative at all points with δ+f(a)=0 at all points 8
Deep Learning

Software Reporting of Non-differentiability

• In the case of g(z)=max{0,z}, the left derivative at


z = 0 is 0 and right derivative is 1
• Software implementations of neural network
training usually return:
– one of the one-sided derivatives rather than
reporting that derivative is undefined or an error
• Justified in that gradient-based optimization is subject to
numerical anyway
• When a function is asked to evaluate g(0), it is very
unlikely that the underlying value was truly 0, instead it
was a small value ε that was rounded to 0 9
Deep Learning

What a Hidden unit does


• Accepts a vector of inputs x and computes an
affine transformation* z = WTx+b
• Computes an element-wise non-linear function
g(z)
• Most hidden units are distinguished from each
other by the choice of activation function g(z)
– We look at: ReLU, Sigmoid and tanh, and other
hidden units

*A geometric transformation that preserves lines and parallelism (but not


10
necessarily distances and angles)
Deep Learning

Rectified Linear Unit & Generalizations


• Rectified linear units use the activation function
g(z)=max{0,z}
– They are easy to optimize due to similarity with
linear units
• Only difference with linear units that they output 0 across
half its domain
• Derivative is 1 everywhere that the unit is active
• Thus gradient direction is far more useful than with
activation functions with second-order effects

11
Deep Learning

Use of ReLU
• Usually used on top of an affine transformation
h=g(WTx+b)
• Good practice to set all elements of b to a
small value such as 0.1
– This makes it likely that ReLU will be initially active
for most training samples and allow derivatives to
pass through

12
Deep Learning

ReLU vs other activations


• Sigmoid and tanh activation functions cannot be
with many layers due to the vanishing gradient
problem.
• ReLU overcomes the vanishing gradient
problem, allowing models to learn faster and
perform better
• ReLU is the default activation function with MLP
and CNN
13
Deep Learning

Generalizations of ReLU
• Perform comparably to ReLU and occasionally
perform better
• ReLU cannot learn on examples for which the
activation is zero
• Generalizations guarantee that they receive
gradient everywhere

14
Deep Learning

Three generalizations of ReLU


• ReLU has the activation function g(z)=max{0,z}
• Three generalizations of ReLU based on using
a non-zero slope αi when zi<0:
hi=g(z,α)i=max(0,zi)+αi min(0,zi)
1. Absolute-value rectification:
• fixes αi=-1 to obtain g(z)=|z|
2. Leaky ReLU:
• fixes αi to a small value like 0.01
3. Parametric ReLU or PReLU:
• treats αi as a parameter 15
Deep Learning

Maxout Units
• Maxout units further generalize ReLUs
– Instead of applying element-wise function g(z),
maxout units divide z into groups of k values z=WTx+b
– Each maxout unit then outputs the maximum
element of one of these groups:
g(z)i = max z j
j∈G(i)

• where G(i) is the set of indices into the inputs for group i,
{(i-1)k+1,..,ik}
• This provides a way of learning a piecewise
linear function that responds to multiple
directions in the input x space
16
Deep Learning

Maxout as Learning Activation


• A maxout unit can learn piecewise linear,
convex function with upto k pieces
– Thus seen as learning the activation function itself
rather than just the relationship between units
• With large enough k, approximate any convex function

– A maxout layer with two pieces can learn to


implement the same function of the input x as a
traditional layer using ReLU or its generalizations
17
Deep Learning

Learning Dynamics of Maxout

• Parameterized differently
• Learning dynamics different even in case of
implementing same function of x as one of the
other layer types
– Each maxout unit parameterized by k weight
vectors instead of one
• So Requires more regularization than ReLU
• Can work well without regularization if training set is large
and no. of pieces per unit is kept low

18
Deep Learning

Other benefits of maxout


• Can gain statistical and computational
advantages by requiring fewer parameters
• If the features captured by n different linear
filters can be summarized without losing
information by taking max over each group of k
features, then next layer can get by with k times
fewer weights
• Because of multiple filters, their redundancy
helps them avoid catastrophic forgetting
– Where network forgets how to perform tasks they
were trained to perform 19
Deep Learning

Principle of Linearity
• ReLU based on principle that models are easier
to optimize if behavior closer to linear
– Principle applies besides deep linear networks
• Recurrent networks can learn from sequences and
produce a sequence of states and outputs

• When training them need to propagate information


through several steps
– Which is much easier when some linear computations (with some
directional derivatives being of magnitude near 1) are involves
20
Linearity in LSTM
Deep Learning

• LSTM: best performing recurrent architecture


– Propagates information through time via summation
• A straightforward kind of linear activation
LSTM: an ANN that contains
LSTM LSTM blocks in addition to
Block regular network units
y = ∑ wi x i
Input gate: when its output is
close to zero, it zeros the input
y = ∏ xi
Forget gate: when close to zero
block forgets whatever value
y = σ ( ∑ wi x i )
it was remembering

Conditional Input Forget Output Output gate: when unit should


Input gate gate gate output its value

Determine when inputs are allowed 21


to flow into block
Deep Learning

Logistic Sigmoid
• Prior to introduction of ReLU, most neural
networks used logistic sigmoid activation
g(z)=σ(z)
• Or the hyperbolic tangent
g(z)=tanh(z)
• These activation functions are closely related
because
tanh(z)=2σ(2z)-1
• Sigmoid units are used to predict probability
that a binary variable is 1 22
Deep Learning

Sigmoid Saturation
• Sigmoidals saturate across most of domain
– Saturate to 1 when z is very positive and 0 when z is
very negative
– Strongly sensitive to input when z is near 0
– Saturation makes gradient-learning difficult
• ReLU and Softplus increase for input >0

Sigmoid can still be used


When cost function undoes the
Sigmoid in the output layer

23
Deep Learning

Sigmoid vs tanh Activation


• Hyperbolic tangent typically performs better
than logistic sigmoid
• It resembles the identity function more closely
tanh(0)=0 while σ(0)=½
• Because tanh is similar to identity near 0,
training a deep neural network ŷ = w tanh (U tanh (V x ))
T T T

resembles training a linear model ŷ = wTU TV T x


so long as the activations can be kept small

24
Deep Learning

Sigmoidal units still useful


• Sigmoidal more common in settings other than
feed-forward networks
• Recurrent networks, many probabilistic models
and autoencoders have additional requirements
that rule out piecewise linear activation
functions
• They make sigmoid units appealing despite
saturation

25
Deep Learning

Other Hidden Units


• Many other types of hidden units possible, but
used less frequently
– Feed-forward network using h = cos(Wx + b)
• on MNIST obtained error rate of less than 1%
– Radial Basis ⎛ 1
⎜ 2⎟

hi = exp ⎜⎜− 2 ||W:,i − x || ⎟⎟
⎝ σ ⎟⎠

• Becomes more active as x approaches a template W:,i


– Softplus g(a) = ζ(a) = log(1 +e a )
• Smooth version of the rectifier
– Hard tanh
• Shaped similar to tanh and the rectifier but it is bounded
g(a) = max(−1, min(1,a)) 26

You might also like