0% found this document useful (0 votes)
38 views44 pages

Activation Function

The document discusses shallow neural networks and activation functions. It explains that neural networks are made up of interconnected layers of nodes and can be represented as vectors. Non-linear activation functions are important for neural networks to learn complex patterns from data as they allow backpropagation and create deep neural networks.

Uploaded by

SANJIDA AKTER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views44 pages

Activation Function

The document discusses shallow neural networks and activation functions. It explains that neural networks are made up of interconnected layers of nodes and can be represented as vectors. Non-linear activation functions are important for neural networks to learn complex patterns from data as they allow backpropagation and create deep neural networks.

Uploaded by

SANJIDA AKTER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Shallow Neural Networks

CSE 4237 - Soft Computing


Mir Tafseer Nayeem
Faculty Member, CSE AUST
[email protected]

1
What is a Neural Network?

Layer [1]
Logistic Regression

Neural Network Layer [2]

2
Content Credit: Andrew Ng
Neural Network Representation
● In supervised learning, the
training set contains the input as
well as the target output.

● Hidden layer means the true


values in the middle are not
observed means and we cannot
see that in the training set.

Output Layer ● Activations are the values


different layers of the neural
network are passing on to the
Input Layer subsequent layers.

Hidden Layer ● When we count layers in Neural


Network we don’t count input
layer which is layer 0.
2 Layer Neural Network
3
Neural Network Representation

Like Logistic Regression but repeated a lot of time.

4
Neural Network Representation
Layer

Node in the Layer

5
Neural Network Representation Vector

6
Neural Network Representation

7
Neural Network Representation learning
Given Input x

(4,1) (4,3) (3,1) (4, 1)

(4, 1) (4, 1)

(1, 1) (1, 4) (4, 1) (1, 1)

(1, 1) (1, 1)
You can replace x with

8
Vectorizing across multiple examples

.........
Layer No 2
.........
......... Example (i)
...
...

.........
9
Vectorizing across multiple examples
m Training Example One Training Example

Output prediction of the neural


network. We need to vectorize this in
order to get rid of this for loop.

10
Vectorizing across multiple examples

Stacking them Horizontally Hidden Units



… …





Training Examples 11
Justification for vectorized implementation
First Example Second Example Third Example

12
Justification for vectorized implementation

Previous
implementation of
Forward Propagation

Vectorized
implementation of
Forward Propagation

13
Activation functions

Forward Propagation steps of Neural Network. g can be any non linear


activation function.

Activation function can be used in the hidden layers and the output layers. Until now
we are using Sigmoid activation function but other choices might work better!

14
Types of Activation Functions
● Binary Step Function
○ A binary step function is a threshold-based activation function.
○ If the input value is above or below a certain threshold, the neuron is activated and sends
exactly the same signal to the next layer.
○ The problem with a step function is that it does not allow multi-value outputs—for example, it
cannot support classifying the inputs into one of several categories.
○ It is not recommended to use it in hidden layers because it does not represent derivative
learning value.

15
Types of Activation Functions
● Linear Activation Function
○ A linear activation function takes the form: A = cx.
○ It takes the inputs, multiplied by the weights for each neuron, and creates an output signal
proportional to the input. In one sense, a linear function is better than a step function
because it allows multiple outputs, not just yes and no.
● Linear activation function has two major problems:
○ Not possible to use backpropagation (gradient descent) to train the model — the
derivative of the function is a constant, and has no relation to the input, X. So it’s not
possible to go back and understand which weights in the input neurons can provide a better
prediction.

16
Linear Activation Function
○ When A = c.x is derived from x, we reach c. This means that there is no relationship with x. If
the derivative is always a constant value, can we say that the learning process is taking
place? Unfortunately no!
○ All layers of the neural network collapse into one—with linear activation functions, no
matter how many layers in the neural network, the last layer will be a linear function of the
first layer (because a linear combination of linear functions is still a linear function). So a
linear activation function turns the neural network or even a deep neural network into
just one layer.
○ A neural network with a linear activation function or without any activation function is
simply a linear regression model. It has limited power and ability to handle complexity
varying parameters of input data.

17
Why do you need nonlinear activation functions?

With linear activation function


or no activation function

18
Why do you need nonlinear activation functions?
● If the activation function is not applied or we apply linear activation
function, the output signal becomes a simple linear function.
● Linear functions are only single-grade polynomials.
● A non-activated neural network will act as a linear regression with limited
learning power.
● But we also want our neural network to learn non-linear states. Because we
will give you complex real-world information such as image, video, text,
and sound which are non-linear or have high dimensionality to learn to
our neural network.
● Multilayered deep neural networks can learn meaningful featıures from data.

19
Why do you need nonlinear activation functions?
● They allow backpropagation because they have a derivative function
which is related to the inputs.
● They allow “stacking” of multiple layers of neurons to create a deep
neural network. Multiple hidden layers of neurons are needed to learn
complex data sets with high levels of accuracy.

20
Why do you need nonlinear activation functions?
Lin

Lin Sig

Lin

It is just a logistic regression.


m c

Neural Network is outputting a linear function of inputs! One place you can use a linear
Composition of two or more linear function is also a activation function when you are
linear function. solving a regression problem means
the output is a real number. But only in
the last hidden layer not in the
intermediate layers. 21
Sigmoid / Logistic Function
● It is also derived because it is different from the step
function. This means that learning can happen.
● Smooth gradient, preventing “jumps” in output
values.
● Output values bound between 0 and 1,
normalizing the output of each neuron.
● Clear predictions — For X above 2 or below -2,
tends to bring the Y value (the prediction) to the
edge of the curve, very close to 1 or 0. This enables
clear predictions.
● The sigmoid function is the most frequently
used activation function, but there are many
other and more efficient alternatives.

22
What’s the problem with sigmoid function?
● Vanishing gradient—for very high or very low values
of X, there is almost no change to the prediction. The
derivative values in these regions are very small and
converge to 0. This is called the vanishing gradient and
the learning is minimal.
● The network can refuse to learn further, or being too
slow to reach an accurate prediction.
● When slow learning occurs, the optimization algorithm
that minimizes error can be attached to local
minimum values and cannot get maximum
performance from the artificial neural network model.
● Outputs not zero centered. So output of all the
neurons will be of the same sign.
● Computationally expensive.
23
Hyperbolic Tangent Function
● It has a structure very similar to Sigmoid function.
● Zero centered—making it easier to model inputs
that have strongly negative, neutral, and strongly
positive values. The range of values in this case
is from -1 to 1.
● The advantage over the sigmoid function is that its
derivative is more steep, which means it can get
more value.
● This means that it will be more efficient
because it has a wider range for faster learning
and grading.
● But again, the problem of gradients at the ends of
the function continues.
24
ReLU (Rectified Linear Unit) Function
● Computationally efficient—allows the network
to converge very quickly.
● Non-linear—although it looks like a linear
function, ReLU has a derivative function and
allows for backpropagation
● The Dying ReLU problem—when inputs approach
zero, or are negative, the gradient of the function
becomes zero, the network cannot perform
backpropagation and cannot learn.

25
ReLU - What are the returns and their benefits?
● A large neural network with too many neurons. Sigmoid and hyperbolic tangent
caused almost all neurons to be activated in the same way.
● This means that the activation is very intensive. Some of the neurons in the network
are active, and activation is infrequent, so we want an efficient computational load.
● We get it with ReLU. Having a value of 0 on the negative axis means that the network
will run faster as it does not activate all the neurons at the same time.
● The fact that the calculation load is less than the sigmoid and hyperbolic tangent
functions has led ReLu to a higher preference for multi-layer networks.

26
Leaky-ReLU Function
● Prevents dying ReLU problem — this variation of
ReLU has a small positive slope in the negative
area, so it does enable backpropagation, even for
negative input values. This leaky value is given as
a value of 0.01 if not +ve.
● Results not consistent — leaky ReLU does not
provide consistent predictions for negative input
values.

27
Other Activation Functions
● Variants of Leaky-ReLU
○ Parameterised ReLU
○ Exponential ReLU
● Swish
○ Discovered by researchers at Google in the year 2017. According to their paper, it performs
better than ReLU with a similar level of computational efficiency.

28
Summary of Activation Function Definitions

29
Choosing the Right Activation Function
● Sigmoid functions and their combinations generally work better in the case of
classifiers, specially binary classifiers at the output layer.
● Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient
problem.
● ReLU function is a general activation function and is used in most cases these days.
● If we encounter a case of dead neurons in our networks the leaky ReLU function is
the best choice.
● Always keep in mind that ReLU function should only be used in the hidden layers.
● As a rule of thumb, you can begin with using ReLU function and then move over to
other activation functions in case ReLU doesn’t provide with optimum results.

30
Softmax
● Softmax function is often described as a combination of multiple sigmoids. We know that sigmoid
returns values between 0 and 1, which can be treated as probabilities of a data point belonging
to a particular class. Thus sigmoid is widely used for binary classification problems.
● The softmax function can be used for multiclass classification problems. This function returns the
probability for a datapoint belonging to each individual class. Here is the mathematical expression-

The Softmax function can be defined as below, where c is equal to the number of classes.

31
Softmax
● While building a network for a multiclass problem, the output layer would have as many
neurons as the number of classes in the target.
● For instance, if you have three classes, there would be three neurons in the output layer.
Suppose you got the output from the neurons as [1.2 , 0.9 , 0.75].
● Applying the softmax function over these values, you will get the following result [0.42 ,
0.31, 0.27]. These represent the probability for the data point belonging to each class. Note
that the sum of all the values is 1.

32
33
Which Activation Function Should Be Preferred?
● Easy and fast convergence of the network can be the first criterion.
● If your network is too deep and the computational load is a major problem, ReLU
can be preferred than hyperbolic tangent or sigmoid.
● ReLU will be advantageous in terms of speed. You have to let the gradients
die/vanish. It is usually used in intermediate layers rather than an output.
● Leaky ReLU can be the first solution to the problem of the gradients’ vanish.
● For deep learning models, it is advisable to start experiments with ReLU.
● Softmax is usually used in output layers.

34
35
Derivatives of Activation Functions
● Derivative of sigmoid function

36
Derivatives of Activation Functions
● Derivative of a tanh function

37
Derivatives of Activation Functions

38
Derivatives of Activation Functions
● Derivative of a ReLU function
○ This function is commonly used activation function nowadays.

○ A derivative of a ReLU function is:

The chance of z = 0.00000…..0000 is very small.


So gradient descent still works just fine.

39
Derivatives of Activation Functions
● You can define a derivative only if the function is continuous at some point and there is one
and only one tangent to the curve / function at that point . There are two cases when the
derivative doesn't exist.

If the function is discontinuous If there can be more than ● At x=0 there can be
one tangents to the curve at infinite number of lines
that point. that touch the curve at
0. So we can't assign a
well defined tangent.
● The right hand
derivative and the left
hand derivative are not
equal at 0. So the
derivative doesn't exist.

40
Derivatives of Activation Functions
● Derivative of a ReLU function
○ The derivative of a ReLU function is undefined at 0, but we can say that derivative of this
function at zero is either 0 or 1. Both solution would work when they are implemented in
software. The same solution works for LeakyReLU function.

41
Derivatives of Activation Functions
● Derivative of a LeakyReLU function
○ LeakyReLU usually works better then ReLU function.

42
43
END

44

You might also like