Review_of_Deep_Learning_Algorithms_and_Architectures.pdf

Received April 1, 2019, accepted April 15, 2019, date of publication April 22, 2019, date of current version May 1, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2912200
Review of Deep Learning Algorithms
and Architectures
AJAY SHRESTHA AND AUSIF MAHMOOD, (Senior Member, IEEE)
Department of Computer Science and Engineering, University of Bridgeport, Bridgeport, CT 06604, USA
Corresponding author: Ajay Shrestha (shrestha@my.bridgeport.edu)
ABSTRACT Deep learning (DL) is playing an increasingly important role in our lives. It has already made a
huge impact in areas, such as cancer diagnosis, precision medicine, self-driving cars, predictive forecasting,
and speech recognition. The painstakingly handcrafted feature extractors used in traditional learning,
classification, and pattern recognition systems are not scalable for large-sized data sets. In many cases,
depending on the problem complexity, DL can also overcome the limitations of earlier shallow networks
that prevented efficient training and abstractions of hierarchical representations of multi-dimensional training
data. Deep neural network (DNN) uses multiple (deep) layers of units with highly optimized algorithms and
architectures. This paper reviews several optimization methods to improve the accuracy of the training and
to reduce training time. We delve into the math behind training algorithms used in recent deep networks.
We describe current shortcomings, enhancements, and implementations. The review also covers different
types of deep architectures, such as deep convolution networks, deep residual networks, recurrent neural
networks, reinforcement learning, variational autoencoders, and others.
INDEX TERMS Machine learning algorithm, optimization, artificial intelligence, deep neural network
architectures, convolution neural network, backpropagation, supervised and unsupervised learning.
I. INTRODUCTION
Neural Network is a machine learning (ML) technique that is
inspired by and resembles the human nervous system and the
structure of the brain. It consists of processing units organized
in input, hidden and output layers. The nodes or units in
each layer are connected to nodes in adjacent layers. Each
connection has a weight value. The inputs are multiplied
by the respective weights and summed at each unit. The
sum then undergoes a transformation based on the activa-
tion function, which is in most cases is a sigmoid function,
tan hyperbolic or rectified linear unit (ReLU). These func-
tions are used because they have a mathematically favorable
derivative, making it easier to compute partial derivatives of
the error delta with respect to individual weights. Sigmoid
and tanh functions also squash the input into a narrow output
range or option, i.e., 0/1 and −1/+1 respectively. They imple-
ment saturated nonlinearity as the outputs plateaus or satu-
rates before/after respective thresholds. ReLu on the other
hand exhibits both saturating and non-saturating behaviors
with f (x) = max(0, x). The output of the function is then fed
as input to the subsequent unit in the next layer. The result of
the final output layer is used as the solution for the problem.
The associate editor coordinating the review of this manuscript and
approving it for publication was Geng-Ming Jiang.
Neural Networks can be used in a variety of prob-
lems including pattern recognition, classification, clustering,
dimensionality reduction, computer vision, natural language
processing (NLP), regression, predictive analysis, etc. Here
is an example of image recognition.
Figure 1 shows how a deep neural network called Convo-
lution Neural Network (CNN) can learn hierarchical levels
of representations from a low-level input vector and success-
fully identify the higher-level object. The red squares in the
figure are simply a gross generalization of the pixel values of
the highlighted section of the figure. CNNs can progressively
extract higher representations of the image after each layer
and finally recognize the image.
The implementation of neural networks consists of the
following steps:
1. Acquire training and testing data set
2. Train the network
3. Make prediction with test data
The paper is organized in the following sections:
1. Introduction to Machine Learning
a. Background and Motivation
2. Classifications of Neural Networks
3. DNN Architectures
4. Training Algorithms
53040
2169-3536 2019 IEEE. Translations and content mining are permitted for academic research only.
Personal use is also permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
VOLUME 7, 2019

A. Shrestha, A. Mahmood: Review of DL Algorithms and Architectures
FIGURE 1. Image recognition by a CNN.
5. Shortcomings of Training Algorithms
6. Optimization of Training Algorithms
7. Architectures & Algorithms – Implementations
8. Conclusion
A. BACKGROUND
In 1957, Frank Rosenblatt created the perceptron, the first
prototype of what we now know as a neural network [1].
It had two layers of processing units that could recognize
simple patterns. Instead of undergoing more research and
development, neural networks entered a dark phase of its
history in 1969, when professors at MIT demonstrated that
it couldn’t even learn a simple XOR function [2].
In addition, there was another finding that particularly
dampened the motivation for DNN. The universal approxi-
mation theorem showed that a single hidden layer was able
to solve any continuous problem [3]. It was mathematically
proven as well [4], which further questioned the validity
of DNN. While a single hidden layer could be used to learn,
it was not efficient and was a far cry from the convenience and
capability afforded by the hierarchical abstraction of multiple
hidden layers of DNN that we know now. But it was not
just the universal approximation theorem that held back the
progress of DNN. Back then, we didn’t have a way to train a
DNN either. These factors prolonged the so-called AI winter,
i.e., a phase in the history of artificial intelligence where it
didn’t get much funding and interest, and as a result didn’t
advance much either.
A breakthrough in DNN occurred with the advent of
backpropagation learning algorithm. It was proposed in
the 1970s [5] but it wasn’t until mid-1980s [6] that it was fully
understood and applied to neural networks. The self-directed
learning was made possible with the deeper understanding
and application of backpropagation algorithm. The automa-
tion of feature extractors is what differentiates a DNNs from
earlier generation machine learning techniques.
DNN is a type of neural network modeled as a multilayer
perceptron (MLP) that is trained with algorithms to learn
representations from data sets without any manual design
of feature extractors. As the name Deep Learning suggests,
it consists of higher or deeper number of processing lay-
ers, which contrasts with shallow learning model with fewer
layers of units. The shift from shallow to deep learning has
allowed for more complex and non-linear functions to be
mapped, as they cannot be efficiently mapped with shallow
architectures. This improvement has been complemented by
the proliferation of cheaper processing units such as the
general-purpose graphic processing unit (GPGPU) and large
volume of data set (big data) to train from. While GPGPUs are
less powerful that CPUs, the number of parallel processing
cores in them outnumber CPU cores by orders of magnitude.
This makes GPGPUs better for implementing DNNs. In addi-
tion to the backpropagation algorithm and GPU, the adoption
and advancement of ML and particularly Deep Learning can
be attributed to the explosion of data or bigdata in the last
10 years. ML will continue to impact and disrupt all areas
of our lives from education, finance, governance, healthcare,
manufacturing, marketing and others [7].
B. MOTIVATION
Deep learning is perhaps the most significant development in
the field of computer science in recent times. Its impact has
been felt in nearly all scientific fields. It is already disrupting
and transforming businesses and industries. There is a race
among the world’s leading economies and technology compa-
nies to advance deep learning. There are already many areas
where deep learning has exceeded human level capability
and performance, e.g., predicting movie ratings, decision to
approve loan applications, time taken by car delivery, etc. [8].
On March 27, 2019 the three deep learning pioneers (Yoshua
Bengio, Geoffrey Hinton, and Yann LeCun) were awarded
the Turing Award, which is also referred to as the ‘‘Nobel
Prize’’ of computing[9]. While a lot has been accomplished,
there is more to advance in deep learning. Deep learning
has a potential to improve human lives with more accurate
diagnosis of diseases like cancer [10], discovery of new drugs,
prediction of natural disasters [11]. E.g., [12] reported that an
deep learning network was able to learn from 129,450 images
of 2,032 diseases and was able to diagnose at the same level
as 21 board certified dermatologists. Google AI [10] was able
to beat the average accuracy of US board certified general
pathologists in grading prostate cancer by 70% to 61%.
The goal of this review is to cover the vast subject of deep
learning and present a holistic survey of dispersed informa-
tion under one article. It presents novel work by collating the
works of leading authors from the wide scope and breadth
the deep learning. Other review papers [13]–[16] focus on
specific areas and implementations without encompass the
full scope of the field. This review covers the different types
VOLUME 7, 2019 53041

FIGURE 2. (a) Feedforward neural network [6]. (b) The unrolling of RNN
in time [6].
of deep learning network architectures, deep learning algo-
rithms, their shortcomings, optimization methods and the
latest implementations and applications.
II. CLASSIFICATION OF NEURAL NETWORK
Neural Networks can be classified into the following different
types.
1. Feedforward Neural Network
2. Recurrent Neural Network (RNN)
3. Radial Basis Function Neural Network
4. Kohonen Self Organizing Neural Network
5. Modular Neural Network
In feedforward neural network, information flows in
just one direction from input to output layer (via hidden
nodes if any). They do not form any circles or loopbacks.
Figure 2a shows a particular type of implementation of a
multilayer feedforward neural network with values and func-
tions computed along the forward pass path. Z is the weighed
sum of the inputs and y represents the non-linear activation
function f of Z at each layer. W represents the weights
between the two units in the adjoining layers indicated by
the subscript letters and b represents the bias value of the
unit.
Unlike feedforward neural networks, the processing units
in RNN form a cycle. The output of a layer becomes the
input to the next layer, which is typically the only layer in the
network, thus the output of the layer becomes an input to itself
forming a feedback loop. This allows the network to have
memory about the previous states and use that to influence
the current output. One significant outcome of this difference
is that unlike feedforward neural network, RNN can take a
sequence of inputs and generate a sequence of output values
as well, rendering it very useful for applications that require
processing sequence of time phased input data like speech
recognition, frame-by-frame video classification, etc.
Figure 2b demonstrates the unrolling of a RNN in time.
E.g., if a sequence of 3-word sentence constitutes an input,
then each word would correspond to a layer and thus the
network would be unfolded or unrolled 3 times into a
3-layer RNN.
Here is the mathematical explanation of the diagram:
xt represents the input at time t. U, V, and W are the
learned parameters that are shared by all steps. Ot is the
output at time t. St represents the state at time t and can
be computed as follows, where f is the activation function,
e.g., ReLU.
St = f (Uxt + Wst−1) (1)
Radial basis function neural network is used in classifi-
cation, function approximation, time series prediction prob-
lems, etc. It consists of input, hidden and output layers. The
hidden layer includes a radial basis function (implemented as
gaussian function) and each node represents a cluster center.
The network learns to designate the input to a center and
the output layer combines the outputs of the radial basis
function and weight parameters to perform classification or
inference [17].
Kohonen self-organizing neural network self organizes the
network model into the input data using unsupervised learn-
ing. It consists of two fully connected layers, i.e., input layer
and output layer. The output layer is organized as a two-
dimensional grid. There is no activation function and the
weights represent the attributes (position) of the output layer
node. The Euclidian distance between the input data and each
output layer node with respect to the weights are calculated.
The weights of the closest node and its neighbors from the
input data are updated to bring them closer to the input data
with the formula below [18].
wi (t + 1) = wi (t) + α(t)ηj∗i(x(t) − wi (t)) (2)
where x(t) is the input data at time t, wi (t) is the ith weight at
time t and ηj∗i is the neighborhood function between the ith
and jth nodes.
Modular neural network breaks down large network into
smaller independent neural network modules. The smaller
networks perform specific task which are later combined as
part of a single output of the entire network [19].
53042 VOLUME 7, 2019

DNNs are implemented in the following popular ways:
1. Sparse Autoencoders
2. Convolution Neural Networks (CNNs or ConvNets)
3. Restricted Boltzmann Machines (RBMs)
4. Long Short-Term Memory (LSTM)
Autoencoders are neural networks that learn fea-
tures or encoding from a given dataset in order to perform
dimensionality reduction. Sparse Autoencoder is a variation
of Autoencoders, where some of the units output a value
close to zero or are inactive and do not fire. Deep CNN
uses multiple layers of unit collections that interact with
the input (pixel values in the case of image) and result in
desired feature extraction. CNN finds it application in image
recognition, recommender systems and NLP. RBM is used to
learn probability distribution within the data set.
All these networks use backpropagation for training.
Backpropagation uses gradient descent for error reduction,
by adjusting the weights based on the partial derivative of the
error with respect to each weight.
Neural Network models can also be divided into the fol-
lowing two distinct categories:
1. Discriminative
2. Generative
Discriminative model is a bottom-up approach in which
data flows from input layer via the hidden layers to the output
layer. They are used in supervised training for problems like
classification and regression. Generative models on the other
hand are top-down and data flows in the opposite direction.
They are used in unsupervised pre-training and probabilistic
distribution problems. If the input x and corresponding label y
are given, a discriminative model learns the probability dis-
tribution p(y|x), i.e., the probability of y given x directly,
whereas a generative model learns the joint probability of
p(x,y), from which P(y|x) can be predicted [20]. In general
whenever labeled data is available discriminative approaches
are undertaken as they provide effective training, and when
labeled data is not available generative approach can be
taken [21].
Training can be broadly categorized into three types:
1. Supervised
2. Unsupervised
3. Semi-supervised
Supervised learning consists of labeled data which is used
to train the network, whereas unsupervised learning there
is no labeled data set, thus no learning based on feed-
back. In unsupervised learning, neural networks are pre-
trained using generating models such as RBMs and later
could be fine-tuned using standard supervised learning algo-
rithms. It is then used on test data set to determine pat-
terns or classifications. Big data has pushed the envelope
even further for deep learning with its sheer volume and
variety of data. Contrary to our intuitive inclination, there is
no clear consensus on whether supervised learning is better
than the unsupervised learning. Both have their merits and
use cases. Reference [22] demonstrated enhance results with
unsupervised learning using unstructured video sequences
for camera motion estimation and monocular depth. Modi-
fied Neural Networks such as Deep Belief Network (DBM)
as described by Chen and Lin [23] uses both labeled and
unlabeled data with supervised and unsupervised learning
respectively to improve performance. Developing a way
to automatically extract meaningful features from labeled
and unlabeled high dimensional data space is challenging.
Yann LeCun et al. asserts that one way we could achieve this
would be to utilize and integrate both unsupervised and super-
vised learning [24]. Complementing unsupervised learning
(with un-labeled data) with supervised learning (with labeled
data) is referred to as semi-supervised learning.
DNN and training algorithms have to overcome two major
challenges: premature convergence and overfitting. Prema-
ture convergence occurs when the weights and bias of the
DNN settle into a state that is only optimal at a local level
and misses out on the global minima of the entire multi-
dimensional space. Overfitting on the other hand describes a
state when DNNs become highly tailored to a given training
data set at a fine grain level that it becomes unfit, rigid and
less adaptable for any other test data set.
Along with different types of training, algorithms and
architecture, we also have different machine learning frame-
works (Table 1) and libraries that have made training models
easier. These frameworks make complex mathematical func-
tions, training algorithms and statistically modeling available
without having to write them on your own. Some provide
distributed and parallel processing capabilities, and conve-
nient development and deployment features. Figure 3 shows
a graph with various deep learning libraries along with their
Github stars from 2015-2018. Github is the largest hosting
service provider of source code in the world [25]. Github
stars are indicative of how popular a project is on Github.
TensorFlow is the most popular DL library.
III. DNN ARCHITECTURES
Deep neural network consists of several layers of nodes. Dif-
ferent architectures have been developed to solve problems in
different domains or use-cases. E.g., CNN is used most of the
time in computer vision and image recognition, and RNN is
commonly used in time series problems/forecasting. On the
other hand, there is no clear winner for general problems like
classification as the choice of architecture could depend on
multiple factors. Nonetheless [27] evaluated 179 classifiers
and concluded that parallel random forest or parRF_t, which
is essentially parallel implementation of variation of decision
tree, performed the best. Below are three of the most common
architectures of deep neural networks.
1. Convolution Neural Network
2. Autoencoder
3. Restricted Boltzmann Machine (RBM)
4. Long Short-Term Memory (LSTM)
A. CONVOLUTION NEURAL NETWORK
CNN is based on the human visual cortex and is the neural
network of choice for computer vision (image recognition)
VOLUME 7, 2019 53043

FIGURE 3. Github stars by Deep Learning Library [26].
TABLE 1. Popular deep learning frameworks and libraries.
and video recognition. It is also used in other areas such
as NLP, drug discovery, etc. As shown in Figure 4, a CNN
consists of a series of convolution and sub-sampling lay-
ers followed by a fully connected layer and a normalizing
(e.g., softmax function) layer. Figure 4 illustrates the well-
known 7 layered LeNet-5 CNN architecture devised by
LeCun et al. [28] for digit recognition. The series of mul-
tiple convolution layers perform progressively more refined
feature extraction at every layer moving from input to
output layers. Fully connected layers that perform classifica-
tion follow the convolution layers. Sub-sampling or pooling
layers are often inserted between each convolution layers.
CNN’s takes a 2D n × n pixelated image as an input. Each
layer consists of groups of 2D neurons called filters or ker-
nels. Unlike other neural networks, neurons in each feature
extraction layers of CNN are not connected to all neurons in
the adjacent layers. Instead, they are only connected to the
spatially mapped fixed sized and partially overlapping neu-
rons in the previous layer’s input image or feature map. This
region in the input is called local receptive field. The lowered
number of connections reduces training time and chances of
overfitting. All neurons in a filter are connected to the same
number of neurons in the previous input layer (or feature map)
and are constrained to have the same sequence of weights and
biases. These factors speed up the learning and reduces the
memory requirements for the network. Thus, each neuron in a
specific filter looks for the same pattern but in different parts
of the input image. Sub-sampling layers reduce the size of
the network. In addition, along with local receptive fields and
shared weights (within the same filter), it effectively reduces
the network’s susceptibility of shifts, scale and distortions
of images [29]. Max/mean pooling or local averaging filters
are used often to achieve sub-sampling. The final layers of
CNN are responsible for the actual classifications, where
neurons between the layers are fully connected. Deep CNN
can be implemented with multiple series of weight-sharing
convolution layers and sub-sampling layers. The deep nature
of the CNN results in high quality representations while
maintaining locality, reduced parameters and invariance to
minor variations in the input image [30].
In most cases, backpropagation is used solely for training
all parameters (weights and biases) in CNN. Here is a brief
description of the algorithm. The cost function with respect
to individual training example (x, y) in hidden layers can be
53044 VOLUME 7, 2019

FIGURE 4. 7-layer architecture of CNN for character recognition [28].
defined as [31]:
J (W, b; x, y) =
1
2
||hw,b(x) − y| |2
(3)
The equation for error term δ for layer l is given by [31]:
δ(l)
=

(W(l)
)T
δ(l+1)

.f
0
(z(l)
) (4)
where δ(l+1) is the error for (l + 1)th layer of a network
whose cost function is J (W, b; x, y). f
0
(z(l)) represents the
derivate of the activation function.
∇w(l) J (W, b; x, y) = δ(l+1)
(a(l+1)
)T
(5)
∇b(l) J (W, b; x, y) = δ(l+1)
(6)
where a is the input, such that a(1) is the input for 1st layer
(i.e., the actual input image) and a(l) is the input for l − th
layer.
Error for sub-sampling layer is calculated as [31]:
δ
(l)
k = upsample

(W
(l)
k )T
δ
(l+1)
k

· f
0
(z
(l)
k ) (7)
where k represent the filter number in the layer. In the sub-
sampling layer, the error has to be cascaded in the opposite
direction, e.g., where mean pooling is used, upsample evenly
distributes the error to the previous input unit. And finally,
here is the gradient w.r.t. feature maps [31]:
∇w
(l)
k
J (W, b; x, y) =
m
X
i−1
(a
(l)
i ) ∗ rot90

δ
(l+1)
k , 2

(8)
∇b
(l)
k
J (W, b; x, y) =
X
a,b

δ
(l+1)
k

a,b.
(9)
where (a
(l)
i ) ∗ δ
(l+1)
k represents the convolution between error
and the i − th input in the l − th layer with respect to the
k − th filter.
Algorithm 1 below represents a high-level description and
flow of the backpropagation algorithm as used in a CNN as
it goes through multiple epochs until either the maximum
iterations are reached or the cost function target is met.
In addition to discriminative models such as image recog-
nition, CNN can also be used for generative models such
as deconvolving images to make blurry image sharper.
Algorithm 1 CNN Backpropagation Algorithm Pseudo Code
1: Initialization weights to randomly generated value
(small)
2: Set learning rate to a small value (positive)
3: Iteration n = 1; Begin
4: for n max iteration OR Cost function criteria met, do
5: for image x1 to xi, do
6: a. Forward propagate through convolution, pooling and
then fully conflected layers
7: b. Derive Cost Fuction value for the image
8: c.Calculate error term δ(l) with respect to weights for
each type of layers.
9: Note that the error gets propagated from layer to
layer in the following sequence
10: i.fully connected layer
11: ii.pooling layer
12: iii.convolution layer
13: d.Calculate gradient ∇w
(l)
k
and ∇b
(l)
k
for weights ∇w
(l)
k
and bias respectively for each layer
14: Gradient calculated in the following sequence
15: i.convolution layer
16: ii.pooling layer
17: iii.fully connected layer
18: e.Update weights
19: w
(l)
ji ← w
(l)
ji + 1w
(l)
ji
20: f.Update bias
21: b
(l)
j ← b
(l)
j + 1b
(l)
j
Reference [32] achieves this by leveraging Fourier trans-
formation to regularize inversion of the blurred images and
denoising. Different implementations of CNN has shown
continuous improvement of accuracy in computer vision.
The improvements are tested against the same benchmark
(ImageNet) to ensure unbiased results.
Here are the well-known variation and implementation of
the CNN architecture.
1. AlexNet:
a. CNN developed to run on Nvidia parallel comput-
ing platform to support GPUs
VOLUME 7, 2019 53045

FIGURE 5. Linear representation of a 2D data input using PCA.
2. Inception:
a. Deep CNN developed by Google
3. ResNet:
a. Very deep Residual network developed by
Microsoft. It won 1st place in the ILSVRC
2015 competition on ImageNet dataset.
4. VGG:
a. Very deep CNN developed for large scale image
recognition
5. DCGAN:
a. Deep convolutional generative adversarial net-
works proposed by [33]. It is used in unsupervised
learning of hierarchy of feature representations in
input objects.
B. AUTOENCODER
Autoencoder is a neural network that uses unsupervised algo-
rithm and learns the representation in the input data set for
dimensionality reduction and to recreate the original data set.
The learning algorithm is based on the implementation of the
backpropagation.
Autoencoders extend the idea of principal component
analysis (PCA). As shown in Figure 5, a PCA trans-
forms multi-dimensional data into a linear representation.
Figure 5 demonstrates how a 2D input data can be reduced to a
linear vector using PCA. Autoencoders on the other hand can
go further and produce nonlinear representation. PCA deter-
mines a set of linear variables in the directions with largest
variance. The p dimensional input data points are represented
as m orthogonal directions, such that m ≤ p and constitutes a
lower (i.e., less than m) dimensional space. The original data
points are projected into the principal directions thus omit-
ting information in the corresponding orthogonal directions.
PCA focuses more on the variances rather than covariances
and correlations and it looks for the linear function with the
most variance [34]. The goal is to determine the direction with
FIGURE 6. Training stages in autoencoder [36].
the least mean square error, which would then have the least
reconstruction error.
Autoencoders use encoder and decoder blocks of
non-linear hidden layers to generalize PCA to perform
dimensionality reduction and eventual reconstruction of the
original data. It uses greedy layer by layer unsupervised pre-
training and fin-tuning with backpropagation [35]. Despite
using backpropagation, which is mostly used in supervised
training, autoencoders are considered unsupervised DNN
because they regenerate the input x(i) itself instead of a
different set of target values y(i), i.e., y(i) = x(i). Hinton
et al. were able to achieve a near perfect reconstruction of
784-pixel images using autoencoder, proving that it is far
better than PCA [36].
While performing dimensionality reduction, autoencoders
come up with interesting representations of the input vector in
the hidden layer. This is often attributed to the smaller number
of nodes in the hidden layer or every second layer of the two-
layer blocks. But even if there are higher number of nodes
in the hidden layer, a sparsity constraint can be enforced
on the hidden units to retain interesting lower dimension
representations of the inputs. To achieve sparsity, some nodes
are restricted from firing, i.e., the output is set to a value close
to zero.
Figure 6 shows single layer feature detector blocks
of RBMs used in pre-training, which is followed by
53046 VOLUME 7, 2019

FIGURE 7. Autoencoder nodes.
unrolling [36]. Unrolling combines the stacks of RBMs to
create the encoder block and then reverses the encoder block
to create the decoder section, and finally the network is fine-
tuned with backpropagation [36].
Figure 7 illustrates a simplified representation of how
autoencoders can reduce the dimension of the input data
and learn to recreate it in the output layer. Wang et al. [37]
successfully implemented a deep autoencoder with stacks
of RBM blocks similar to Figure 6 to achieve better mod-
eling accuracy and efficiency than the proper orthogonal
decomposition (POD) method for dimensionality reduction
of distributed parameter systems (DPSs). The equation below
describes the average of activation function a
(2)
j of jth unit of
2nd layer when the xth input activates the neuron [38].
ρ̂j =
1
m
Xm
i=1
[aj
(2)
x(i)
](10) (10)
A sparsity parameter ρ is introduced such that ρ is very
close to zero, e.g., 0.03 and ρ̂ = ρ. To ensure that ρ̂ = ρ,
a penalty term KL(ρ||ρ̂j) is introduced such that the
Kullback–Leibler (KL) divergence term KL(ρ||ρ̂j) = 0,
if ρ̂j= ρ, else becomes large monotonically as the difference
between the two values diverges [38]. Here is the updated cost
function [38]:
Jsparse(W, b) = J (W, b) + β
s2
X
j=1
KL(ρ||ρ̂j)] (11)
where s2 equals the number of units in 2nd layer and β is the
parameter than controls sparsity penalty term’s weight.
C. RESTRICTED BOLTZMANN MACHINE (RBM)
Restricted Boltzmann Machine is an artificial neural net-
work where we can apply unsupervised learning algorithm to
FIGURE 8. Restricted Boltzmann machine.
build non-linear generative models from unlabeled data [39].
The goal is to train the network to increase a function
(e.g., product or log) of the probability of vector in the
visible units so it can probabilistically reconstruct the input.
It learns the probability distribution over its inputs. As shown
in Figure 8, RBM is made of two-layer network called the
visible layer and the hidden layer. Each unit in the visible
layer is connected to all units in the hidden layer and there
are no connections between the units in the same layer.
The energy (E) function of the configuration of the visible
and hidden units, (v, h) is expressed in the following way [40]:
E (v, h) = −
X
iεvisible
aivi −
X
jεhidden
bjhj
−
X
i,j
vihjwij (12)
vi and hj are the vector states of the visible unit i and hidden
unit j. ai and bj represents the bias of visible and hidden units.
Wij denotes the weight between the respective visible and
hidden units.
The partition function, Z is represented by the sum of all
possible pairs of visible and hidden vectors [40].
Z =
X
v,h
e−E(v,h)
(13)
The probability of every pair of visible and hidden vectors
is given by the following [40].
p(v, h) =
1
Z
e−E(v,h)
(14)
The probability of a particular visible layer vector is pro-
vided by the following [40].
p (v) =
1
Z
X
h
e−E(v,h)
(15)
As you can see from the equations above, the partition
function becomes higher with lower energy function value.
Thus during the training process, the weights and biases of
the network are adjusted to arrive at a lower energy and thus
maximize the probability assigned to the training vector. It is
mathematically convenient to compute the derivative of the
log probability of a training vector.
∂ log p(v)
∂wij
= hvihjidata − hvihjimodel (16)
VOLUME 7, 2019 53047

FIGURE 9. LSTM block with memory cell and gates.
In the equation [40] above hvihjidata and hvihjimodel repre-
sents the expectations under the respective distributions.
Thus, the adjustments in the weights can be denoted as
follows [40], where is the learning rate.
1wij = (hvihjidata − hvihjimodel (17)
D. LONG SHORT-TERM MEMORY (LSTM)
LSTM is an implementation of the Recurrent Neural Network
and was first proposed by Hochreiter et al. in 1997 [41].
Unlike the earlier described feed forward network architec-
tures, LSTM can retain knowledge of earlier states and can
be trained for work that requires memory or state aware-
ness. LSTM partly addresses a major limitation of RNN,
i.e., the problem of vanishing gradients by letting gradients
to pass unaltered. As shown in the illustration in Figure 9,
LSTM consists of blocks of memory cell state through which
signal flows while being regulated by input, forget and output
gates. These gates control what is stored, read and written on
the cell. LSTM is used by Google, Apple and Amazon in their
voice recognition platforms [42].
In figure 9, C, x, h represent cell, input and output values.
Subscript t denotes time step value, i.e., t −1 is from previous
LSTM block (or from time t − 1) and t denotes current block
values. The symbol σ is the sigmoid function and tanh is
the hyperbolic tangent function. Operator + is the element-
wise summation and x is the element-wise multiplication.
The computations of the gates are described in the equations
below [41], [43].
ft = σ(Wf xt + wf ht−1 + bf ) (18)
it = σ(Wixt + wiht−1 + bi) (19)
ot = σ(Woxt + woht−1 + bo) (20)
ct = ft ⊗ ct−1 + it ⊗ σc(Wcxt + wcht−1 + bc) (21)
ht = ot ⊗ σh(ct)(21) (22)
where f , i, o are the forget, input and output gate vectors
respectively. W, w, b and ⊗ represent weights of input,
weights of recurrent output, bias and element-wise multipli-
cation respectively.
There is a smaller variation of the LSTM known as gated
recurrent units (GRU). GRUs are smaller in size than LSTM
as they don’t include the output gate, and can perform better
than LSTM on only some simpler datasets [44], [45].
LSTMs recurrent neural networks can keep track of long-
term dependencies. Therefore, they are great for learning
from sequence input data and building models that rely on
context and earlier states. The cell block of LSTM retains
pertinent information of previous states. The input, forget
and output gates dictates new data going into the cell, what
remains in the cell and the cell values used in the calculation
of the output of the LSTM block respectively [41], [43].
Naul et al. demonstrated LSTM and GRU based autoencoders
for automatic feature extractions [46].
E. COMPARISON OF DNN NETWORKS
Table 2 provides a compact summary and comparison of
the different DNN architectures. The examples of imple-
mentations, applications, datasets and DL software frame-
works presented in the table are not implied to be exhaustive.
In addition, some of the categorization of the network archi-
tectures could be implemented in hybrid fashion. E.g., even
though RBMs are generative models and their training is
considered unsupervised, they can have elements of discrim-
inative model when training is finetuned with supervised
learning. The table also provides examples of common appli-
cations for using different architectures.
IV. TRAINING ALGORITHMS
The learning algorithm constitutes the main part of Deep
Learning. The number of layers differentiates the deep neural
network from shallow ones. The higher the number of layers,
the deeper it becomes. Each layer can be specialized to detect
a specific aspect or feature.
As indicated by Najafabadi et al. [47], in case of image
(face) recognitions, first layer can detect edges and the second
can detect higher features such as various part of the face,
e.g., ears, eyes, etc., and the third layer can go further up the
complexity order by even learning facial shapes of various
persons. Even though each layer might learn or detect a
defined feature, the sequence is not always designed for it,
especially in unsupervised learning. These feature extrac-
tors in each layer had to be manually programmed prior
to the development of training algorithms such as gradient
descent. These hand-crafted classifiers didn’t scale for lager
dataset or adapt to variation in the dataset. This message was
echoed in the 1998 paper [28] by Yann Lecun et al., where
they demonstrate that systems with more automatic learning
and reduced manually designed heuristics yields far better
pattern recognition.
Backpropagation provides representation learning method-
ology, where raw data can be fed without the need to manually
massage it for classifiers, and it will automatically find the
representations needed for classification or recognition [6].
53048 VOLUME 7, 2019

TABLE 2. DNN network comparison table.
The goal of the learning algorithm is to find the optimal
values for the weight vectors to solve a class of problem in a
domain.
Some of the well-known training algorithms are:
1. Gradient Descent
2. Stochastic Gradient Descent
3. Momentum
4. Levenberg–Marquardt algorithm
5. Backpropagation through time
A. GRADIENT DESCENT
Gradient descent (GD) is the underlying idea in most of
machine learning and deep learning algorithms. It is based
on the concept of Newton’s Algorithm for finding the roots
(or zero value) of a 2D function. To achieve this, we randomly
pick a point in the curve and slide to the right or left along
the x-axis based on negative or positive value of the deriva-
tive or slope of the function at the chosen point until the value
of the y-axis, i.e., function or f(x) becomes zero. The same
idea is used in gradient descent, where we traverse or descend
along a certain path in a multi-dimensional weight space if
the cost function keeps decreasing and stop once the error rate
ceases to decrease. Newton’s method is prone to getting stuck
in local minima if the derivative of the function at the current
point is zero. Likewise, this risk is also present when using
gradient descent on a non-convex function. In fact, the impact
is amplified in the multi-dimensional (each dimension repre-
sents a weight variable) and multi-layer landscape of DNN
and it result in a sub-optimal set of weights. Cost function
VOLUME 7, 2019 53049

FIGURE 10. Error calculation in multilayer neural network [6].
is one half the square of the difference between the desired
output minus the current output as shown below.
C =
1
2
yexpected − yactual
2
(23)
Backpropagation methodology uses gradient descent.
In backpropagation, chain rule and partial derivatives are
employed to determine error delta for any change in the value
of each weight. The individual weights are then adjusted
to reduce the cost function after every learning iteration of
training data set, resulting in a final multi-dimensional (multi-
weight) landscape of weight values [6]. We process through
all the samples in the training dataset before applying the
updates to the weights. This process is repeated until objective
(aka cost function) doesn’t reduce any further.
Figure 10 shows the error derivatives in relation to outputs
in each hidden layer, which is the weighted summation of
the error derivates in relation to the inputs in the unit in the
above layer. E.g., when ∂E/∂zk calculated, the partial error
derivative with respect to wjk to is equal to yj∂E/∂zk.
B. STOCHASTIC GRADIENT DESCENT
Stochastic Gradient Descent (SGD) is the most common
variation and implementation of gradient descent. In gradient
descent, we process through all the samples in the training
dataset before applying the updates to the weights. While
in SGD, updates are applied after running through a mini-
batch of n number of samples. Since we are updating the
weights more frequently in SGD than in GD, we can converge
towards global minimum much faster.
C. MOMENTUM
In the standard SGD, learning rate is used as a fixed multiplier
of the gradient to compute step size or update to the weight.
This can cause the update to overshoot a potential minima,
if the gradient is too steep, or delay the convergence if the
gradient is noisy. Using the concept of momentum in physics,
the momentum algorithm presents a velocity v variable that
configured as an exponentially decreasing average of the
gradient [48]. This helps prevent costly descent in the wrong
direction. In the equation below, α ∈ [0, 1) is the momentum
parameter and is the learning rate.
Velocity Update : v ← αv − g (24)
Actual Update : θ ← θ + v (25)
D. LEVENBERG-MARQUARDT ALGORITHM
Levenberg-Marquadt algorithm (LMA) is primarily used in
solving non-linear least squares problems such as curve fit-
ting. In least squares problems, we try to fit a given data
points with a function with the least amount of sum of the
squares of the errors between the actual data points and points
in the function. LMA uses a combination of gradient descent
and Gauss-Newton method. Gradient descent is employed
to reduce the sum of the squared errors by updating the
parameters of the function in the direction of the steepest-
descent, while the Gauss-Newton method minimizes the error
by assuming the function to be locally quadratic and finds the
minimum of the quadratic [49].
If the fitting function is denoted by ŷ(t;p) and m data points
denoted by (ti, yi), then the squared error can be written
as [49]:
x2
(p) =
Xm
i=1

y (ti) − ŷ (ti; p)
σyi
2
(26)
= (y − ŷ (p))T
W y − ŷ (p)

(27)
= yT
Wy − 2yT
Wŷ + ŷT
Wŷ (28)
where the measurement error for y (ti), i.e., σyi is the inverse
of the weighting matrix Wii.
The gradient descent of the squared error function in
relation to the n parameters can be denoted as [49]:
∂
∂p
x2
= 2(y − ŷ (p))T
W
∂
∂p
y − ŷ (p)

(29)
= 2(y − ŷ (p))T
W

∂ŷ (p)
∂p

(30)
= 2(yŷ)T
WJ (31)
hgd = αJT
W y − ŷ

(32)
where J is the Jacobian matrix of size m × n used in place
of the [∂ŷ/∂p], and hgd is the update in the direction of the
steepest gradient descent.
The equation for the Gauss-Newton method update (hgn)
is as follows [49]:
h
JT
WJ
i
hgn = JT
W(y − ŷ) (33)
53050 VOLUME 7, 2019

The Levenberg- Marquardt update [hlm] is generated by
combining gradient descent and Gauss-Newton methods
resulting in the equation below [49]:
h
JT
WJ + λ diag(JT
WJ)
i
hlm = JT
W(y − ŷ) (34)
E. BACKPROPAGATION THROUGH TIME
Backpropagation through time (BPTT) is the standard
method to train the recurrent neural network. As shown
in Figure 2b, the unrolling of RNN in time makes it appears
like a feedforward network. But unlike the feedforward net-
work, the unrolled RNN has the same exact set of weight val-
ues for each layer and represents the training process in time
domain. The backward pass through this time domain net-
work calculates the gradients with respect to specific weights
at each layer. It then averages the updates for the same weight
at different time increments (or layers) and changes them to
ensure the value of weights at each layer continues to stay
uniform.
F. COMPARISON OF DEEP LEARNING ALGORITHMS
Table 3 provides a summary and comparison of common deep
learning algorithms. The advantages and disadvantages are
presented along with techniques to address the disadvantages.
Gradient descent-based training is the most common type of
training. Backpropagation through time is the backpropaga-
tion tailored for recurrent neural network. Contrastive diver-
gence finds its use in probabilistic models such as RBMs.
Evolutionary algorithms can be applied to hyperparameter
optimizations or training models by optimizing weights.
Reinforcement learning could be used in game theory, multi-
agent systems and other problems where both exploitation
and exploration need to be optimized.
V. SHORTCOMINGS OF TRAINING ALGORITHMS
There are several shortcomings with the standard use of
training algorithms on DNNs. The most common ones are
described here.
A. VANISHING AND EXPLODING GRADIENTS
Deep neural networks are prone to vanishing (or explod-
ing) gradients due to the inherent way in which gradients
(or derivates) are computed layer by layer in a cascad-
ing manner with each layer contributing to exponentially
decreasing or increasing derivatives. Weights are increased
or decreased based on gradients to reduce the cost func-
tion or error. Very small gradients can cause the network to
take a long time to train, whereas large gradients can cause
the training to overshoot and diverge. This is made worse
by the non-linear activation functions like sigmoid and tanh
functions that squash the outputs to a small range. Since
change in weight have nominal effect on the output train-
ing could take much longer. This problem can be mitigated
using linear activation function like ReLu and proper weight
initialization.
TABLE 3. Deep learning algorithm comparison table.
B. LOCAL MINIMA
Local minima is always the global minima in a convex
function, which makes gradient descent based optimization
fool proof. Whereas in nonconvex functions, backpropaga-
tion based gradient descent is particularly vulnerable to the
issue of premature convergence into the local minima. A local
minima as shown in Figure 11, can easily be mistaken for
global absolute minima.
C. FLAT REGIONS
Just like local minima, flat regions or saddle points
(Figure 12) also pose similar challenge for gradient descent
based optimization in nonconvex high-dimensional func-
tions. The training algorithm could potentially mislead by this
area as the gradient comes to a halt at this point.
D. STEEP EDGES
Steep edges are another section of the optimization sur-
face area where the steep gradient could cause the gradient
VOLUME 7, 2019 53051

FIGURE 11. Gradient descent.
FIGURE 12. Flat (saddle point marked with black dot) region in a
nonconvex function.
descent-based weight updates to overshoot and miss a poten-
tial global minima.
E. TRAINING TIME
Training time is an important factor to gauge the efficiency
of an algorithm. It is not uncommon for graduate students to
train their model for days or weeks in the computer lab. Most
models require exorbitant amount of time and large datasets
to train. Often times many of the samples from the datasets
do not add value to the training process and in some cases,
they introduce noise and adversely affect the training.
F. OVERFITTING
As we add more neurons to DNN, it can undoubtedly model
the network for more complex problems. DNN can lend itself
to high conformability to training data. But there is also a high
risk of overfitting to the outliers and noise in the training data
as shown in Figure 13. This can result in delayed training and
testing times and result in the lower quality prediction on the
actual test data. E.g., in classification or cluster problems,
overfitting can create a high order polynomial output that
separates the decision boundary for the training set, which
will take longer and result in degraded results for most test
FIGURE 13. Overfitting in classification.
data set. One way to overcome overfitting is to choose the
number of neurons in the hidden layer wisely to match the
problem size and type. There are some algorithms that can be
used to approximate the appropriate number of neurons but
there is no magic bullet and the best bet is to experiment on
each use case to get an optimal value.
VI. OPTIMIZATION OF TRAINING ALGORITHMS
The goal of the DNN is to improve the accuracy of the
model on test data. Training algorithms aims to achieve the
end goal by reducing the cost function. The common root
cause of three out of five shortcomings mentioned above is
primarily due to the fact that the training algorithms assume
the problem area to be a convex function. The other problem
is high number of nodes and the sheer possible combination
of weight values they can have. While weights are learned by
training on the dataset, there are additional crucial parameters
referred to as hyperparameters that aren’t directly learnt from
training dataset. These hyperparameters can take a range of
values and add complexity of finding the optimal architecture
and model. There is significant room for improvement to the
standard training algorithms. Here are some of the popular
ways to enhance the accuracy of the DNNs.
A. PARAMETER INITIALIZATION TECHNIQUES
Since the solution space is so huge, the initial parameters
have an outsized influence on how fast or slow the train-
ing converges, if at all or if it prematurely converges to a
suboptimal point. Initialization strategies tend to be heuristic
in nature. Reference [50] proposed normalized initialization
where weights are initialized in the following manner.
W ∼ U

−
√
6
√
nj+nj+1
,
√
6
√
nj+nj+1
#
(35)
Reference [51] proposed another technique called sparse
initialization, where the number of non-zero incoming
weights were capped at a certain limit causing them to retain
high diversity and reduce chances of saturation.
53052 VOLUME 7, 2019

B. HYPERPARAMETER OPTIMIZATION
The learning rate and regularization parameters constitutes
the commonly used hyperparameters in DNN. Learning rate
determines the rate at which the weights are updated. The
purpose of regularization is the prevent overfitting and reg-
ularization parameter affects the degree of influence on the
loss function. CNN’s have additional hyperparameters i.e.,
number of filters, filter shapes, number of dropouts and
max pooling shapes at each convolution layer and number
of nodes in the fully connected layer. These parameters are
very important for training and modeling a DNN. Coming
up with an optimal set of parameter values is a challeng-
ing feat. Exhaustively iterating through each combination
of hyperparameter values is computationally very expensive.
For example, if training and evaluating a DNN with the full
dataset takes ten minutes, then with seven hyperparameters
each with eight potential values will take (87 × 10 min), i.e.,
20,971,520 minutes or almost 40 years to exhaustively train
and evaluate the network on all combinations of the hyperpa-
rameter values. Hyperparameter can be optimized with differ-
ent metaheuristics. Metaheuristics are nature inspired guiding
principles that can help in traversing the search space more
intelligently yet much faster than the exhaustive method.
Particle Swarm Optimization (PSO) is another type of
metaheuristic that can be used for hyperparameter optimiza-
tion. PSO is modeled around the how birds fly around in
search of food or during migration. The velocity and location
of birds (or particles) are adjusted to steer the swarm towards
better solution in the vast search space. Escalante et al. used
PSO for hyperparameter optimization to build a competitive
model that ranked among the top relative to other comparable
methods [52].
Genetic algorithm (GA) is a metaheuristic that is com-
monly used to solve combinatorial optimization problems.
It mimics the selection and crossover processes of species
reproduction and how that contributes to evolution and
improvement of the species prospect of survival. Figure 14a
shows a high-level diagram of the GA. Figure 14b illustrates
the crossover process where parts of the respective genetic
sequence are merged from both the parents to form the new
genetic sequence in the children. The goal is to find a pop-
ulation member (a sequence of numbers resembling DNA
nucleotides) that meets the fitness requirement. Each pop-
ulation member represents a potential solution. Population
members are selected based on different methods, e.g., elite,
roulette, rank and tournament.
Elite method ranks population members by fitness and
only uses high fitness members for the crossover process.
The mutation process then makes random changes to the
number sequence and the entire process continues until a
desired fitness or maximum number of iterations are reached.
References [53], [54] propose parallelization and hybridiza-
tion of GA to achieve better and faster results. Parallelization
provide both speedup and better results as we can periodically
exchange population members between the distributed and
parallel operations of genetic algorithms on different set of
FIGURE 14. (a) Genetic algorithm [53]. (b) Crossover in genetic algorithm.
population members. Hybridization is the process of mixing
the primary algorithm (GA in this case) with other operations,
like local search. Shrestha and Mahmood [53] incorporated
2-Opt local search method into GA to improve the search
for optimal solution. Reference [55] postulates that correctly
performed exchanges (e.g., in GA) breeds innovation and
results in creation solutions to hard problems just like in
real life where collaboration and exchanges between indi-
viduals, organizations and societies. In additional to GA,
other variations of evolution-based metaheuristics have also
been used to evolve and optimize deep learning architectures
and hyperparameters. E.g., [56] proposed CoDeepNEAT
framework based on deep neuroevolution technique for
finding an optimized architecture to match the task at
hand.
VOLUME 7, 2019 53053

C. ADAPTIVE LEARNING RATES
Learning rates have a huge impact on training DNN. It can
speed up the training time, help navigate flat surfaces bet-
ter and overcome pitfalls of non-convex functions. Adaptive
learning rates allow us to change the learning rates for param-
eters in response to gradient and momentum. Several innova-
tive methods have been proposed. Reference [48] describes
the following:
1. Delta-bar Algorithm
2. AdaGrad
3. RMSProp
4. Adam
In Delta-bar algorithm, the learning rate of the param-
eter is increased if the partial derivative with respect to it
stays in the same sign and decreased if the sign changes.
AdaGrad is more sophisticated [57] and prescribes an
inversely proportional scaling of the learning rates to the
square root of the cumulative squared gradient. AdaGrad is
not effective for all DNN training. Since the change in the
learning rate is a function of the historical gradient, AdaGrad
becomes susceptible to convergence.
RMSProp algorithm is a modification of AdaGrad algo-
rithm to make it effective in a nonconvex problem space.
RMSProd replaces the summation of squared gradient in
AdaGrad with exponentially decaying moving average of the
gradient, effectively dropping the impact of historical gradi-
ent [48]. Adam which denotes adaptive moment estimation
is the latest evolution of the adaptive learning algorithms that
integrates the ideas from AdaGrad, RMSProp and momen-
tum [58]. Just like AdaGrad and RMSProd, Adam provides
an individual learning rate for each parameter. Adam includes
the benefits of both the earlier methods does a better job
handling non-stationary objectives and both noisy and sparse
gradients problems [58]. Adam uses first moment (i.e., mean
as used in RMSProp) as well as second moments of the gra-
dients (uncentered variance) utilizing the exponential moving
average of squared gradient [58].
Figure 15 shows the relative performance of the various
adaptive learning rate mechanisms where Adam outperform
the rest.
D. BATCH NORMALIZATION
As the network is getting trained with variations to weights
and parameters, the distribution of actual data inputs at
each layer of DNN changes too, often making them all too
large or too small and thus making them difficult to train on
networks, especially with activation functions that implement
saturating nonlinearities, e.g., sigmoid and tanh functions.
Iofee and Szegedy [59] proposed the idea of batch normal-
ization in 2015. It has made a huge difference in improving
the training time and accuracy of DNN. It updates the inputs
to have a unit variance and zero mean at each mini-batch.
E. SUPERVISED PRETRAINING
Supervised pretraining constitutes breaking down complex
problems into smaller parts and then training the simpler
FIGURE 15. Multilayer network training cost on MNIST dataset using
different adaptive learning algorithms [58].
FIGURE 16. DNN with and without dropout.
models and later combining them to solve the larger model.
Greedy algorithms are commonly used in supervised pre-
training of DNN.
F. DROPOUT
There are few commonly used methods to lower the risk of
overfitting. In the dropout technique, we randomly choose
units and nullify their weights and outputs so that they
do not influence the forward pass or the backpropagation.
Figure 16 shows a fully connected DNN on the left and a
DNN with dropout to the right. The other methods include
the use of regularization and simply enlarging the training
dataset using label preserving techniques. Dropout works
better than regularization to reduces the risk of overfitting
and also speeds up the training process. Reference [60]
53054 VOLUME 7, 2019

proposed the dropout technique and demonstrated significant
improvement on supervised learning based DNN for com-
puter vision, computational biology, speech recognition and
document classification problems.
G. TRAINING SPEED UP WITH CLOUD AND GPU
PROCESSING
Training time is one of the key performance indicators of
machine learning. Cloud computing and GPUs lend them-
selves very well to speeding up the training process. Cloud
provides massive amounts of compute power and now all
major cloud vendors include GPU powered servers that can
easily be provisioned and used for training DNNs on demand
at competitive prices. Cloud vendor Amazon Web Services’
(AWS) P2 instances provides up to 40 thousand parallel GPU
cores and its P3 GPU instances are further optimized for
machine learning [61].
H. SUMMARY OF DL ALGORITHMS SHORTCOMINGS
AND RESOLUTIONS TECHNIQUES
Table 4 provides a summary of deep learning algorithm short-
comings and resolutions techniques. The table also lists the
cause and effect[s] of the shortcomings.
VII. ARCHITECTURES ALGORITHMS –
IMPLEMENTATIONS
This section describes different implementations of neural
networks using a variety of training methods, network archi-
tectures and models. It also includes models and ideas that
have been incorporated into machine learning in general.
A. DEEP RESIDUAL LEARNING
The ability to add more layers to DNN has allowed us to solve
harder problems. Microsoft Research Asia (MSRA) applied a
100/1000 layer deep residual network (ResNet) on CIFAR-10
dataset and won 1st place in the ILSVRC 2015 competi-
tion with a 152-layer DNN on the ImageNet dataset [62].
Figure 17 demonstrates a simplified version of Microsoft’s
winning deep residual learning model. Despite the depth of
these networks, simply adding more layers to DNN does not
improve or guarantee results. To the contrary, it degrades
the quality of the solution. This makes training DNN not
so straight forward. The MSRA team was able to overcome
the degradation by making the hoping stacked layers match
a residual mapping instead of the desired mapping with the
following function [62]:
F (x) := H (x) − xv (36)
where F(x) is the residual mapping and H(x) is the desired
mapping, and then by recasting the desired mapping at the
end [62]. According to MSRA team, it is much easier to
optimize the residual mapping.
B. ODDBALL STOCHASTIC GRADIENT DESCENT
All training data are not created equal. Some will have higher
training error than the others. Yet, we assume that they are the
TABLE 4. DL algorithm shortcomings resolution techniques.
FIGURE 17. Deep residual learning model by MSRA at Microsoft.
same and thus use each training examples the same number
of times. Simpson [63] argues that this assumption is invalid
and makes a case in his paper for the number of times a
training examples is used to be proportional to its respective
training error. So, if a training example has a higher error rate,
it will be used to train the network higher number of times
VOLUME 7, 2019 53055

than the other training example. Simpson [63] proves his
methodology, termed Oddball Stochastic Gradient Descent
with a training set of 1000 video frames. Simpson [63] cre-
ated a training selection probability distribution for training
example based on the error value and pegged the frequency
of using the training example based on the distribution.
C. DEEP BELIEF NETWORK
Chen and Lin [23] highlights the fact that conventional neural
network can easily get stuck in local minima when the func-
tion is non-convex. They propose a DNN architecture called
large scale deep belief network (DBN) that uses both labeled
and unlabeled to learn feature representations. DBN are made
up of layers of RBM stacked together and learn probability
distribution of the input vectors. They employ unsupervised
pre-training and fine-tuned supervised algorithms and tech-
niques to mitigate the risk of getting trapped in local minima.
Below is the equation [23] for change in weights, where c is
the momentum factor and α is the learning rate, and v and h
are visible and hidden units respectively.
1wij (t + 1) = c1wij (t) + α(hvihjidata − hvihjimodel (37)
Equation [23] for probability distribution for hidden and
visible inputs.
p(hj = 1|v; W) = σ
I
X
i=1
wijvi + aj
!
(38)
p(vi = 1|h; W) = σ


J
X
j=1
wijhj + bi

 (39)
D. BIG DATA
Big data provides tremendous opportunity and challenge for
deep learning. Big data is known for the 4 Vs (volume, veloc-
ity, veracity, variety). Unlike the shallow networks, the huge
volume and variety of data can be handled by DNNs and
significantly improve the training process and the ability to
fit more complex models. On the flip side, the sheer veloc-
ity of data that is generated in real-time can be daunting
to process. Jajafabadi et al. [47] raises similar challenges
learning from real-time streaming data such as credit cards
usage to monitor for fraud detection. They propose using
parallel and distributed processing with thousands of CPU
cores. In addition, we should also use cloud providers that
support auto-scaling based on usage and workload. Not all
data represent the same quality. In the case of computer
vision, images from constrained sources, e.g., studios are
much easier to recognize that the ones from unconstrained
sources like surveillance cameras. Reference [64] proposes a
method to utilize multiple images of the unconstrained source
to enhance the recognition process.
Deep learning can help mine and extract useful patterns
from big data and build models for inference, prediction
and business decision making. There is massive volumes
of structured and unstructured data and media files getting
FIGURE 18. Learning multiple layers of representation.
generated today making information retrieval very chal-
lenging. Deep learning can help with semantic indexing to
enable information to be more readily accessible in search
engines [14], [65]. This involves building models that provide
relationships between documents and keywords the contain to
make information retrieval more effective.
E. GENERATIVE TOP DOWN CONNECTION
(GENERATIVE MODEL)
Much of the training is usually implemented with bottom-
up approach, where discriminatory or recognition models
are developed using backpropagation. A bottom-up model is
one that takes the vector representation of input objects and
computes higher level feature representations at subsequent
layer with a final discrimination or recognition pattern at the
output layer. One of the shortcomings of backpropagation is
that it requires labeled data to train. Geoffrey Hinton proposed
a novel way of overcoming this limitation in 2007 [66].
He proposed a multi-layer DNN that used generative top-
down connection as opposed to bottom-up connection to
mimic the way we generate visual imagery in our dream
without the actual sensory input. In top-down generative
connection, the high-level data representation or the out-
puts of the networks are used to generate the low-level raw
vector representations of the original inputs, one layer at a
time. The layers of feature representations learned with this
approach can then be further perfected either in generative
models such as auto-encoders or even standard recognition
models [66].
In the generative model in Figure 18, since the correct
upstream cause of the events in each layer is known, a com-
parison between the actual cause and the prediction made
by the approximate inference procedure can be made, and
the recognition weights, rij can be adjusted to increase the
probability of correct prediction.
53056 VOLUME 7, 2019

FIGURE 19. Four-layer DBN four-layer deep Boltzmann machine.
Here is the equation [66] for adjusting the recognition
weights rij.
1rij α hi hj − σ(
X
i
hirij)
!
(40)
F. PRE-TRAINING WITH UNSUPERVISED DEEP
BOLTZMANN MACHINES
Vast majority of DNN training is based on supervised learn-
ing. In real life, our learning is based on both supervised and
unsupervised learning. In fact, most of our learning is unsu-
pervised. Unsupervised learning is more relevant in today’s
age of big data analytics because most raw data is unlabeled
and un-categorized [47]. One way to overcome the limitation
of backpropagation, where it gets stuck in local minima is to
incorporate both supervised and unsupervised training. It is
quite evident that top-down generative unsupervised learning
is good for generalization because it is essentially adjusting
the weights by trying to match or recreate the input data on
layer at a time [67]. After this effective unsupervised pre-
training, we can always fine-tune it with some labeled data.
Geoffrey Hinton and Ruslan Salakhutdinov describe multiple
layers of RBMs that are stacked together and trained layer by
layer in a greedy, unsupervised way, essentially creating what
is called the Deep Belief Network. They further modify stacks
to make them un-directed models with symmetric weights,
thus creating the Deep Boltzmann Machines (DBM). Four
layered deep belief network and deep Boltzmann machines
are shown in Figure 19. In [67] the DBM layers were pre-
trained one at a time using unsupervised method and then
tweaked using supervised backpropagation on the MNIST
and NORB datasets as shown in Figure 20. They [67] received
favorable results validating benefits of combining supervised
and unsupervised learning methods.
Here are the equations showing probability distributions
over visible and two hidden units in DBM (after unsupervised
FIGURE 20. Pretraining of stacked altered RBM to create a DBM [67].
FIGURE 21. DBM getting initialized as deterministic neural network with
supervised fine-tuning [67].
pre-training) [67].
p(vi = 1|h1
) = σ


X
j
W1
ij hj

 (41)
p(h2
m = 1|h1
) = σ


X
j
W2
jmh1
j

 (42)
p(h1
j = 1|v, h2
) = σ
X
i
W1
ij vi+
X
m
W2
jmh2
m
!
(43)
Post unsupervised pre-training, the DBM is converted
into a deterministic multi-layer neural network by fine-
tuning the network with supervised learning using labeled
data as demonstrated in Figure 21. The approximate
posterior distribution q(h|v) is generated for each input vec-
tor and the marginals q(h2
j = 1|v) are added as an addi-
tional input for the network as shown in the figure above
and subsequently, backpropagation is used to fine-tune the
network [67].
VOLUME 7, 2019 53057

G. EXTREME LEARNING MACHINE (ELM)
There have been other variations of learning methodologies.
While more layers allow us to extract more complex features
and patterns, some problems might be solved faster and bet-
ter with less number of layers. Reference [68] proposed a
four-layered CNN termed DeepBox that outperformed larger
networks in speed and accuracy. for evaluating objectness.
ELM is another type of neural network with just one hid-
den layer. Linear models are learnt from the dataset in a
single iteration by adjusting the weights between the hid-
den layer and the output, whereas the weights between the
input and the hidden layers are randomly initialized and
fixed [69].
ELM can obviously converge much faster than backprop-
agation, but it can only be applied to simpler problems of
classifications and regression. Since proposing ELM in 2006,
Buang-Bin Huang et al. came up with a multilayer version
of ELM in 2016 [70] to take on more complex problems.
They combined unsupervised multilayer encoding with the
random initialization of the weights and demonstrate faster
convergence or lower training time than the state of the art
multilayer perceptron training algorithm.
H. MULTIOBJECTIVE SPARSE FEATURE LEARNING MODEL
Gong et al. [71] developed a multi-objective sparse feature
learning (MO-SFL) model based on auto encoder, where they
used an evolutionary algorithm to optimize two competing
objectives of sparsity of hidden units and the reconstruction
error (input vendor of AE). It fairs better than models where
the sparsity is determined by human intervention or less than
optimal methods.
Since the time complexity of evolutionary algorithms are
high, they [71] utilize self-adaptive multi-objective differen-
tial evolution (DE) based on decomposition (Sa-MODE/D)
to cut down on time and demonstrate it has better results
than standard AE (auto encoder), SR-RBM (Sparse response
RMB) and SESM (sparse encoding symmetric machine) by
testing with MNIST dataset and compare the results with
other implementations. Their learning procedure continu-
ously iterates between evolutionary optimization step and
the stochastic gradient descent to optimize the reconstruction
error [71].
• Step 1: Multi-objective optimization to select the most
optimal point in the pareto frontier for both objectives
• Step 2: Optimize parameters θ and θ’ with stochastic
gradient descent in the following reconstruction error
function (of Auto Encoder), where D is the training data
set and L (x,y) is the loss function with x representing the
input and y representing the output, i.e., reconstructed
input.
X
x∈D
L(x, gθ0 (f θ (x))) (44)
Figure 22 shows a pareto frontier function that can be used
to achieve a compromise between two competing objectives
functions.
FIGURE 22. Pareto Frontier.
FIGURE 23. Spectral clustering representation.
I. MULTICLASS SEMI-SUPERVISED LEARNING BASED
ON KERNEL SPECTRAL CLUSTERING
Mehrkanoon et al. [72] proposed a multiclass learning algo-
rithm based on Kernel Spectral Clustering (KSC) using both
labeled and unlabeled data. The novelty of their proposal
is the introduction of regularization terms added to the cost
function of KSC, which allow labels or membership to be
applied to unlabeled data examples. It is achieved in the
following way [72]:
• Unsupervised learning based on kernel spectral cluster-
ing (KSC) is used as the core model
• A regularization term is introduced and labels (from
labeled data) are added to the model
Figure 23 illustrates data points in a spectral clustering
representation. Spectral clustering (SC) is an algorithm that
divides the data points in a graph using Laplacian or double
derivative operation, whereas KSC is simply an extension
of SC that uses Least Squares Support Vector Machines
methodology [73].
53058 VOLUME 7, 2019

Since unlabeled data is more abundantly available relative
to labeled data, it would be beneficial to make the most of it
with unsupervised or in this case semi-supervised learning.
J. VERY DEEP CONVOLUTIONAL NETWORKS FOR
NATURAL LANGUAGE PROCESSING
Deep CNN have mostly been used in computer vision, where
it is very effective. Conneau et al. [74] used it for the first
time to NLP with up to 29 convolution layers. The goal is
to analyze and extract layers of hierarchical representations
from words and sentences at the syntactic, semantic and con-
textual level. One the major setbacks for lack of earlier deep
CNN for NLP is because of deeper networks tend to cause
saturation and degradation of accuracy. This is in addition to
the processing overhead of more layers. He et al. [62] states
that the degradation is not caused by overfitting but because
deeper systems are difficult to optimize. Reference [62]
addressed this issue with shortcut connections between the
convolution blocks to let the gradients to propagate more
freely and they, along with [74] were able to validate
the benefits of the shortcuts with 10/101/152-layers and
49 layers respectively. Conneau et al. [74] architecture con-
sists of series of convolution blocks separated by pooling
that halved the resolution followed by k-max pooling and
classification at the end.
K. METAHEURISTICS
Metaheuristics can be used to train neural networks to
overcome the limitation of backpropagation-based learning.
When implementing metaheuristics as training algorithm,
each weight of the neural network connection is represented
by a dimension in the multi-dimensional solution search
space of the problem we are trying to solve. The goal is to
come as near as possible to the optimal values of weights,
i.e., a location in the search space that represents the global
best solution. Particle Swarm Optimization (PSO) is a type
of metaheuristic inspired by the movement of birds in the
sky consists of particles or candidate solutions move about
in a search space to reach a near optimal solution. In their
paper [75], N. Krpan and D. Jakobovic ran parallel imple-
mentations using backpropagation and PSO. Their results
demonstrate that while parallelization improves the efficacy
of both algorithms, parallel backpropagation is efficient only
on large networks, whereas parallel PSO has wider influence
on various sizes of problems.
Similarly, Dong and Zhou [76] complemented PSO with
supervised learning control module to guide the search for
global minima of an optimization problem. The supervised
learning module provided real-time feedback with back dif-
fusion (BD) to retain diversity and social attractor renewal
to overcome stagnation [76]. Metaheuristics provide high
level guidance inspired by nature and applies them to solve
mathematical problems. In a similar way [77] proposes incor-
porating the concepts of intelligent teacher and privileged
information, which is essentially extra information available
during training but not during evaluation or testing, into the
DNN training process.
L. GENETIC ALGORITHM
Genetic Algorithm is a metaheuristic that can be effectively
used in training DNN. GA mimics the evolutionary processes
of selection, crossover and mutation. Each population mem-
ber represents a possible solution with a set of weights. Unlike
PSO, which includes only one operator for adjusting the solu-
tion, evolutionary algorithms like GA includes various steps,
i.e., selection, crossover and mutation methods [52]. Popu-
lation members undergo several iterations of selection and
crossover based on known strategies to achieve better solution
in the next iteration or generation. GA has undergone decades
of improvement and refinements since it was first proposed
in 1976 [78]. There are several ways to perform selec-
tions, e.g., elite, roulette, rank, tournament [79]. There are
about dozen ways to perform crossovers by Larrañaga et al.
alone [80]. Selection methodologies represent exploration of
the solution space and crossovers represent the exploitation of
the selected solution candidates. The goal is to get better solu-
tion wider exploration and deeper exploitation. Additional
tweaking can be introduced with mutation. Parallel clusters
of GA can be executed independently in islands and few
members exchanged between the island every so often [81].
In addition, we can also utilize local search such as greedy
algorithm, Nearest Neighbor or K-opt algorithm to further
improve the quality of the solution.
Lin et al. [82] demonstrated a successful incorporation
of GA that resulted in better classification accuracy and
performance of a Polynomial Neural Network. Standard GA
operations including selection, crossover and mutation were
used on parameters that included partial descriptions (PDs)
of inputs in the first layer, bias and all input features [82].
GA was further enhanced with the incorporation of the
concept of mitochondrial DNA (mtDNA). In evolution, it is
quite evident from casual observation and simple reason that
crossover of population members with too much similarity
does not yield much variance in the offspring. Likewise,
we can infer that in GA, selection and crossover between
solutions that are very similar would not result is high degree
of exploration of the multi-dimensional solution space.
In fact, it might run the risk of getting pigeonholed into a
restricted pattern.
Diversity is the key to overcoming the risk of getting stuck
in local minima. This risk can be mitigated by exploiting the
idea of mtDNA. mtDNA represents one percent of the human
chromosomes [83]. The concept of incorporating mitochon-
drial DNA into GA was introduced by Shrestha and Mah-
mood [53]. They describe a way to restrict crossover between
population members or solution candidates based proximity
on their mtDNA value [53]. Unlike the rest of the 99% DNA,
mtDNA is only inherited from the female, thus it is a more
continuous marker of lineage or genetic proximity. The
premise behind this is that offspring of population members
with similar genetic makeup doesn’t help with overcoming
VOLUME 7, 2019 53059

FIGURE 24. Continental model with mtDNA [53].
the local minima. Figure 24 describes the parallel and dis-
tributed nature of their full implementation [53] along with
the GA operators (selection, mutation and mtDNA incorpo-
rated crossover). The training process is enhanced [53] with
the implementation of continental model, where distributed
servers run multiple threads, each running an instance of
GA with mtDNA. Population members are then exchanged
between the servers after fixed number of iterations as shown
in Figure 24.
M. NEURAL MACHINE TRANSLATION (NMT)
Neural Machine Translation is a turnkey solution used in
translation of sentences. While it provides some improvement
over the traditional Statistical machine translation (SMT),
it is not scalable for large models or datasets. It also requires
lot of computational power for training and translation, and
has difficult with rare words. For these reason, large tech
companies like Google and Microsoft have both improved on
NMT and have their own implementations of NMT, labeled
as Google Neural Machine Translation (GNMT) and Skype
Translator respectively. GMNT as shown in Figure 257 con-
sists of encoder and decoder LSTM blocks organized in layers
was presented in 2016 in [84]. It overcomes the shortcomings
of NMT with enhanced deep LSTM neural network that
includes 8 encoder and 8 decoder layers, and a method to
break down rare difficult words to infer their meaning. On
Conference on Machine Translation in 2014, GNMT received
results at par with state-of-the-art for English-to-French and
English-to-German language benchmarks [84].
N. MULTI-INSTANCE MULTI-LABEL LEARNING
Images in real life include multiple instances (objects)
and need multiple labels to describe them. E.g., a pic-
ture of an office space could include a laptop computer,
a desk, a cubicle and a person typing on the computer.
Zhou et al. [85] proposed MIML (Multi-Instance Multi-Label
learning) framework and corresponding MIMLBOOST and
MIMLSVM algorithms for efficient learning of individual
object labels in complex high level concepts, e.g., like the
office space. The goal is to learn f : 2x → 2y from dataset
{(X1, Y1) , (X2, Y2) , . . . , (Xm, Ym)}, where Xi ⊆ X represents
a set of instances {xi1, xi2, . . . xi,ni,}, xij ∈ X(j = 1, 2, . . . , ni),
and Yi ⊆ Y represents a set of instances {yi1, yi2, . . . yi,li,},
yik ∈ Y(k = 1, 2, . . . , li), where ni is the number of
instances in Xi and li is the number of labels in Yi [85].
MIMLBOOST uses category-wise decomposition into tra-
ditional single instance single label supervised learning,
whereas MIMLSVN utilizes cluster-based feature transfor-
mation. So, instead of trying to learn the idea of complex
entities (e.g., office space), [85] took the alternate route and
learned the lower level individual objects and inferred the
higher level concepts.
O. ADVERSARIAL TRAINING
Machine learning training and deployment used to be done in
isolated computers, but now they are increasing being done in
a highly interconnected commercial production environment.
Take a face recognition system where a network could be
trained on a fleet of servers with a training dataset imported
from an external data source, and the trained model could
be deployed on another server which accepts APIs calls with
real time inputs (e.g., images of people entering a building)
and responds with matches. The interconnected architecture
exposes the machine learning to a wide attack surface. The
real-time input or training dataset can be manipulated by an
53060 VOLUME 7, 2019

FIGURE 25. GNMT architecture [84] with encoder neural network on the left and decoder neural network on the right.
adversary to compromise the output (image match by the
network) or the entire model respectively.
Adversarial machine learning is a relatively new field of
research that takes into out these new threats to machine
learning. According to [86] adversaries (e.g., email spammer)
can exploit the lack of stationary data distribution and manip-
ulate the input (e.g., an actual spam email) as a normal email.
Reference [86] demonstrates these and other vulnerabilities
and discusses how application domain, features and data
distribution can be used to reduce the risk and impact of such
adversarial attacks.
P. GAUSSIAN MIXTURE MODEL
Gaussian mixture model (GMM) is a statistical probabilistic
model used to represent multiple normal gaussian distribu-
tions within a larger distribution using an EM (estimation
maximization) algorithm in an unsupervised setting. E.g.,
a GMM could be used to represent the height distribution for
a large population group with two gaussian distributions, for
male and female sub-groups. Figure 26 below demonstrates
a GMM with three gaussian distributions within itself.
GMM has been used primarily in speech recognition and
tracking objects in video sequences. GMM are very effec-
tive in extracting speech features and modeling the prob-
ability density function to a desired level of accuracy as
long as we have sufficient components, and the estima-
tion maximization makes it easy to fit the model [87]. The
probability density function for the GMM is given by the
following [87]:
p (x) =
XM
m=1
cmN (x; µm, 6m) , (cm 0) (45)
FIGURE 26. GMM example with three components.
where M is the number of number of gaussian components,
cm is the weight of the M-th gaussian, and (x; µm, 6m)
represents the random variable x, which following the mean
vector µm.
Q. SIAMESE NETWORKS
The purpose of siamese network is to determine the degree of
similarity between two images. As shown in figure 27 below,
siamese network consists of two identical CNN networks
with identical weights and parameters. The two images to be
compared are passed separately through the two twin CNNs
and the respective vector representations outputs are evalu-
ated using contrastive divergence loss function. The function
is defined as following [88]:
L

W, Y,
−
→
X1,
−
→
X1

= (1 − Y)
1
2
(Dw)2
+ (Y)
1
2
(max(0, m − Dw))2
(46)
VOLUME 7, 2019 53061

FIGURE 27. Siamese network.
Dw represents the Euclidean distance between the two
output vectors as shown in figure 27. The output of the
contrastive divergence loss function, Y is either 1 (indicates
images are not the same) or 0 (indicates images are the
same). m represents a margin value greater than 0. The idea
of siamese networks has been extended to come up with
triplet networks, which includes three identical networks and
is used to assess the similarity of a given image with two other
images.
Since the softmax layer outputs must match the number of
classes, a standard CNN becomes impractical for problems
that have large number of classes. This issue doesn’t apply
to siamese network as the number of outputs of the softmax
in the twin networks doesn’t have the requirement to match
the number of classes [89]. This ability to scale to many more
classes for classification extends the use of siamese networks
beyond what a traditional CNN is used for. Siamese network
can be used for handwritten check recognition, signature
verification, text similarity, etc.
R. VARIATIONAL AUTOENCODERS
As the name suggests, variational autoencoder (VAE), are
a type of autoencoder and consists of encoder and decoder
parts as shown in figure 28. It falls under the generative
model class of neural networks and are used in unsupervised
learning. VAEs learn a low dimensional representation (latent
variable) that model the original high dimensional dataset into
a gaussian distribution. Kullback–Leibler (KL) divergence
method is a good way to compare distributions. Therefore,
the loss function in VAE is a combination of cross entropy
(or mean squared error) to minimize reconstruction error and
KL divergence to make the compressed latent variable follow
a gaussian distribution. We then sample from the probability
distribution to generate new dataset samples that are represen-
tative of the original dataset. It has found various applications
FIGURE 28. Variational autoencoder.
including generating images in video games to de-noising
pictures.
In figure 28, x is the input and z is the encoded output
(latent variable). P(x) represents the distribution associated
with x. P(z) represents the distribution associated with z.
The goal is to infer P(z) based on P(z|x) that follows a
certain distribution. The mathematical derivation for VAEs
were originally proposed in [90]. Suppose we wanted to infer
P(z|x) based on some Q(z|x), then we can try to minimize the
KL divergence between the two:
DKL[Q (z|x) ||P (z|x)] =
XQ(z|x) log [ Q(z|x)
P(z|x) ]
z
(47)
= E [log [
Q (z|x)
P (z|x)
]] (48)
= E [log Q (z|x) − logP (z|x)] (49)
where DKL is the Kullback–Leibler (KL) divergence and
E represents expectation.
Using Baye’s rule:
P(z|x) =
P (x|z) P(z)
P(x)
(50)
DKL[Q(z|x)||P(z|x)]
= E [log Q(z|x) − Log
P(x|z)P(z)
P(x)
] (51)
= E [log Q(z|x) − log P(x|z) − log P(z)] + log P(x)
(52)
To allow us to easily sample P(z) and generate new data,
we set P(z) to normal distribution, i.e., N(0, 1). If Q(z|x) is
represented as gaussian with parameters µ(x) and
P
(x), then
the KL divergence between Q(z|x) and P(z) can be derived in
closed form as:
DKL [N (µ (x) , 6 (x)) ||N (0, 1)]
= (1/2)
X
k
(exp (6 (x)) + µ2
(x) − 1 − 6 (x)) (53)
S. DEEP REINFORCEMENT LEARNING
The primary idea about reinforcement learning is about mak-
ing an agent learn from the environment with the help of
random experimentation (exploration) and defined reward
(exploitation). It consists of finite number of states (si, rep-
resenting agent and environment), actions (ai) by the agent,
probability (Pa) of moving from one state to another based
on action ai, and reward Ra(si, si+1) associated with moving
to the next state with action a. The goal is to balance and
maximize the current reward (R) and future reward (γ ·
max[Q s0, a0

]) by predicting the best action as defined by
this function Q (s, a) · γ in the equation represent a fixed
discount factor. Q (s, a) is represented as the summation of
53062 VOLUME 7, 2019

current reward (R) and future reward (γ · max[Q s0, a0

]) as
shown below.
Q (s, a) = R + γ · max[Q s0
, a0

] (54)
Reinforcement learning is specifically suited for problems
that consists of both short-term and long-term rewards, e.g.,
games like chess, go, etc. AlphaGo, Google’s program that
beat the human Go champion also uses reinforcement learn-
ing [91]. When we combine deep network architecture with
reinforcement learning, we get deep reinforcement learning
(DRL), which can extend the use of reinforcement to even
more complex games and areas such as robotics, smart grids,
healthcare, finance etc. [92]. With DRL, problems that were
intractable with reinforcement learning can now be solved
with higher number of hidden layers of deep networks and
reinforcement learning based Q-learning algorithm that max-
imizes the reward for actions taken by the agent [13].
T. GENERATIVE ADVERSARIAL NETWORK (GAN)
GANs consists of generative and discriminative neural net-
works. The generative network generates completely new
(fake) data based on input data (unsupervised learning) and
the discriminative network attempts to distinguish whether
the data is real (from training set) or generated. The generative
network is trained to increase the probability of deceiving
the discriminative network, i.e., to make the generated data
indistinguishable from the original. GANs were proposed by
Goodfellow et al. [93] in 2014. It has been very popular as
it has many applications both good and bad. E.g., [94] were
able to successfully synthesize realistic images from text.
U. MULTI-APPROACH METHOD FOR ENHANCING DEEP
LEARNING
Deep learning can be optimized at different areas. We dis-
cussed training algorithm enhancements, parallel processing,
parameter optimizations and various architectures. All these
areas can be simultaneously implemented in a framework to
get the best results for specific problems. The training algo-
rithms can be finetuned at different levels by incorporating
heuristics, e.g., for hyperparameter optimization. The time
to train a deep learning network model is a major factor to
gauge the performance of an algorithm or network. Instead
of training the network with all the data set, we can pre-
select a smaller but representative data set from the full
training distribution set using instance selection methods [95]
or Monte Carlo sampling [48]. An effective sampling method
can result in preventing overfitting, improving accuracy and
speeding up of the learning process without compromising on
the quality of the training dataset. Albelwi and Mahmood [96]
designed a framework that combined dataset reduction,
deconvolution network, correlation coefficient and an
updated objective function. Nelder-Mead method was used
in optimizing the parameters of the objective function and
the results were comparable to latest known results on the
MNIST dataset [96]. Thus, combining optimizations at mul-
tiple levels and using multiple methods is a promising field
of research and can lead to further advancement in machine
learning.
VIII. CONCLUSION
In this tutorial, we provided a thorough overview of the neural
networks and deep neural networks. We took a deeper dive
into the well-known training algorithms and architectures.
We highlighted their shortcomings, e.g., getting stuck in the
local minima, overfitting and training time for large prob-
lem sets. We examined several state-of-the-art ways to over-
come these challenges with different optimization methods.
We investigated adaptive learning rates and hyperparameter
optimization as effective methods to improve the accuracy
of the network. We surveyed and reviewed several recent
papers, studied them and presented their implementations and
improvements to the training process. We also included tables
to summarize the content in a concise manner. The tables
provide a full view on how different aspects of deep learning
are correlated.
Deep Learning is still in its nascent stage. There is
tremendous opportunity for exploitation of current algo-
rithms/architectures and further exploration of optimization
methods to solve more complex problems. Training is cur-
rently constrained by overfitting, training time and is highly
susceptible to getting stuck in local minima. If we can
continue to overcome these challenges, deep learning net-
works will accelerate breakthroughs across all applications
of machine learning and artificial intelligence.
CONFLICTS OF INTEREST
The authors declare no conflict of interest. The founding
sponsors had no role in the design of the study; in the col-
lection, analyses, or interpretation of data; in the writing of
the manuscript, and/or in the decision to publish the results.
ORCID
Ajay Shrestha: http://orcid.org/0000-0001-5595-5953.
REFERENCES
[1] F. Rosenblatt, ‘‘The perceptron: A probabilistic model for information
storage and organization in the brain,’’ Psychol. Rev., vol. 65, no. 6,
pp. 386–408, 1958.
[2] M. Minsky and S. A. Papert, Perceptrons: An Introduction to Computa-
tional Geometry, Expanded Edition. Cambridge, MA, USA: MIT Press,
1969, p. 258.
[3] G. Cybenko, ‘‘Approximation by superpositions of a sigmoidal function,’’
Math. Control, Signals Syst., vol. 2, no. 4, pp. 303–314, 1989.
[4] K. Hornik, ‘‘Approximation capabilities of multilayer feedforward net-
works,’’ Neural Netw., vol. 4, no. 2, pp. 251–257, 1991.
[5] P. J. Werbos, ‘‘‘Beyond Regression:’ New tools for prediction and analysis
in the behavioral sciences,’’ Ph.D. dissertation, Harvard Univ., Cambridge,
MA, USA, 1975.
[6] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
no. 7553, pp. 436–444, May 2015.
[7] M. I. Jordan and T. M. Mitchell, ‘‘Machine learning: Trends, perspectives,
and prospects,’’ Science, vol. 349, no. 6245, pp. 255–260, 2015.
[8] A. Ng, ‘‘Machine learning yearning: Technical strategy for ai engineers in
the era of deep learning,’’ Tech. Rep., 2019
[9] C. Metz, Turing Award Won by 3 Pioneers in Artificial Intelligence.
New York, NY, USA: New York Times, 2019, p. B3.
VOLUME 7, 2019 53063

[10] K. Nagpal et al., ‘‘Development and validation of a deep learning algorithm
for improving Gleason scoring of prostate cancer,’’ CoRR, Nov. 2018.
[11] S. Nevo, ‘‘ML for flood forecasting at scale,’’ CoRR, Jan. 2019.
[12] A. Esteva et al., ‘‘Dermatologist-level classification of skin cancer with
deep neural networks,’’ Nature, vol. 542, no. 7639, p. 115, 2017.
[13] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
‘‘Deep reinforcement learning: A brief survey,’’ IEEE Signal Process.
Mag., vol. 34, no. 6, pp. 26–38, Nov. 2017.
[14] M. Gheisari, G. Wang, and M. Z. A. Bhuiyan, ‘‘A survey on deep learning
in big data,’’ in Proc. IEEE Int. Conf. Comput. Sci. Eng. (CSE), Jul. 2017,
pp. 173–180.
[15] S. Pouyanfar, ‘‘A survey on deep learning: Algorithms, techniques, and
applications,’’ ACM Comput. Surv., vol. 51, no. 5, p. 92, 2018.
[16] R. Vargas, A. Mosavi, and R. Ruiz, ‘‘Deep learning: A review,’’ in Proc.
Adv. Intell. Syst. Comput., 2017, pp. 1–11.
[17] M. D. Buhmann, Radial Basis Functions. Cambridge, U.K.: Cambridge
Univ. Press, 2003, p. 270.
[18] A. A. Akinduko, E. M. Mirkes, and A. N. Gorban, ‘‘SOM: Stochas-
tic initialization versus principal components,’’ Inf. Sci., vols. 364–365,
pp. 213–221, Oct. 2016.
[19] K. Chen, ‘‘Deep and modular neural networks,’’ in Springer Handbook
of Computational Intelligence, J. Kacprzyk and W. Pedrycz, Eds. Berlin,
Germany: Springer, 2015, pp. 473–494.
[20] A. Y. Ng and M. I. Jordan, ‘‘On discriminative vs. generative classifiers:
A comparison of logistic regression and naive Bayes,’’ in Proc. 14th Int.
Conf. Neural Inf. Process. Syst. Cambridge, MA, USA: MIT Press, 2001,
pp. 841–848.
[21] C. M. Bishop and J. Lasserre, ‘‘Generative or discriminative? Getting the
best of both worlds,’’ Bayesian Statist., vol. 8, pp. 3–24, Jan. 2007.
[22] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, ‘‘Unsupervised learning
of depth and ego-motion from video,’’ CoRR, Apr. 2017
[23] X.-W. Chen and X. Lin, ‘‘Big data deep learning: Challenges and perspec-
tives,’’ IEEE Access, vol. 2, pp. 514–525, 2014.
[24] Y. LeCun, K. Kavukcuoglu, and C. Farabet, ‘‘Convolutional networks
and applications in vision,’’ in Proc. IEEE Int. Symp. Circuits Syst.,
May/Jun. 2010, pp. 253–256.
[25] G. Gousios, B. Vasilescu, A. Serebrenik, and A. Zaidman, ‘‘Lean GHTor-
rent: GitHub data on demand,’’ in Proc. 11th Work. Conf. Mining Softw.
Repositories, Hyderabad, India, 2014, pp. 384–387.
[26] AI-Index. (2019). Top Deep Learning Github Repositories. [Online].
Available: https://github.com/mbadry1/Top-Deep-Learning
[27] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, ‘‘Do we
need hundreds of classifiers to solve real world classification problems?’’
J. Mach. Learn. Res., vol. 15, no. 1, pp. 3133–3181, 2014.
[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn-
ing applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11,
pp. 2278–2324, Nov. 1998.
[29] Y. LeCun and Y. Bengio, ‘‘Convolutional networks for images, speech,
and time series,’’ in The Handbook of Brain Theory and Neural
Networks, A. A. Michael, Ed. Cambridge, MA, USA: MIT Press, 1998,
pp. 255–258.
[30] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, ‘‘Convolutional learn-
ing of spatio-temporal features,’’ in Computer Vision. Berlin, Germany:
Springer, 2010.
[31] A. Ng. (Jul. 21, 2018). Convolutional Neural Network. UFLDL.
[Online]. Available: http://ufldl.stanford.edu/tutorial/supervised/
ConvolutionalNeuralNetwork/
[32] C. J. Schuler, H. C. Burger, S. Harmeling, and B. Schölkopf, ‘‘A machine
learning approach for non-blind image deconvolution,’’ in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 1067–1074.
[33] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised representation
learning with deep convolutional generative adversarial networks,’’ CoRR,
Nov. 2015.
[34] I. T. Jolliffe, ‘‘Principal component analysis,’’ in Mathematics and Statis-
tics, 2nd ed. New York, NY, USA: Springer, 2002, p. 487.
[35] K. Noda, ‘‘Multimodal integration learning of object manipulation behav-
iors using deep neural networks,’’ in Proc. IEEE/RSJ Int. Conf. Intell.
Robots Syst., Nov. 2013, pp. 1728–1733.
[36] G. E. Hinton and R. R. Salakhutdinov, ‘‘Reducing the dimensionality of
data with neural networks,’’ Science, vol. 313, no. 5786, pp. 504–507,
2006.
[37] M. Wang, H.-X. Li, X. Chen, and Y. Chen, ‘‘Deep learning-based model
reduction for distributed parameter systems,’’ IEEE Trans. Syst., Man,
Cybern., Syst., vol. 46, no. 12, pp. 1664–1674, Dec. 2016.
[38] A. Ng. (Jul. 21, 2018). Autoencoders. UFLDL. [Online]. Available:
http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders
[39] Y. W. Teh and G. E. Hinton, ‘‘Rate-coded restricted Boltzmann machines
for face recognition,’’ in Proc. Adv. Neural Inf. Process. Syst., 2001,
pp. 908–914.
[40] G. E. Hinton, ‘‘A practical guide to training restricted Boltzmann
machines,’’ in Neural Networks: Tricks of the Trade, 2nd ed., G. Montavon,
G. B. Orr, K.-R. Müller, Eds. Berlin, Germany: Springer, 2012,
pp. 599–619.
[41] S. Hochreiter and J. Schmidhuber, ‘‘Long Short-term Memory,’’ Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[42] C. Metz, ‘‘Apple is bringing the AI revolution to your phone, in wired,’’
Tech. Rep., 2016.
[43] F. A. Gers, J. Schmidhuber, and F. Cummins, ‘‘Learning to forget:
Continual prediction with LSTM,’’ Neural Comput., vol. 12, no. 10,
pp. 2451–2471, 2000.
[44] J. Chung. (2014). ‘‘Empirical evaluation of gated recurrent neural net-
works on sequence modeling.’’ [Online]. Available: https://arxiv.org/abs/
1412.3555
[45] K. Cho. (2014). ‘‘Learning phrase representations using RNN encoder-
decoder for statistical machine translation.’’ [Online]. Available: https://
arxiv.org/abs/1406.1078
[46] B. Naul, J. S. Bloom, F. Pérez, and S. van der Walt, ‘‘A recurrent neural
network for classification of unevenly sampled variable stars,’’ Nature
Astron., vol. 2, no. 2, pp. 151–155, 2018.
[47] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald,
and E. Muharemagic, ‘‘Deep learning applications and challenges in big
data analytics,’’ J. Big Data, vol. 2, no. 1, p. 1, Feb. 2015.
[48] I. Goodfellow, Y. Bengio, and A. Courville, ‘‘Deep learning,’’ in Adaptive
Computation And Machine Learning. Cambridge, MA, USA: MIT Press,
2016, p. 775.
[49] H. P. Gavin, ‘‘The Levenberg-Marquardt method for nonlinear least
squares curve-fitting problems,’’ Tech. Rep., 2016.
[50] X. Glorot and Y. Bengio, ‘‘Understanding the difficulty of training deep
feedforward neural networks,’’ in Proc. 13th Int. Conf. Artif. Intell. Statist.,
2010, pp. 249–256.
[51] J. Martens, ‘‘Deep learning via Hessian-free optimization,’’ in Proc.
27th Int. Conf. Int. Conf. Mach. Learn. Haifa, Israel: Omnipress, 2010,
pp. 735–742.
[52] H. J. Escalante, M. Montes, and L. E. Sucar, ‘‘Particle swarm model
selection,’’ J. Mach. Learn. Res., vol. 10, pp. 405–440, Feb. 2009.
[53] A. Shrestha and A. Mahmood, ‘‘Improving genetic algorithm with fine-
tuned crossover and scaled architecture,’’ J. Math., vol. 2016, p. 10,
Mar. 2016.
[54] K. Sastry, D. Goldberg, and G. Kendall, Genetic Algorithms. 2005.
[55] D. E. Goldberg, The Design of Innovation: Lessons from and for Competent
Genetic Algorithms. Boston, MA, USA: Springer, 2013.
[56] R. Miikkulainen, ‘‘Evolving deep neural networks,’’ CoRR, Mar. 2017.
[57] J. Duchi, E. Hazan, and Y. Singer, ‘‘Adaptive subgradient methods for
online learning and stochastic optimization,’’ J. Mach. Learn. Res., vol. 12,
pp. 2121–2159, Jul. 2011.
[58] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’
CoRR, Dec. 2014.
[59] S. Ioffe and C. Szegedy, ‘‘Batch normalization: Accelerating deep network
training by reducing internal covariate shift,’’ CoRR, Feb. 2015.
[60] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks
from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
2014.
[61] AW Services. (Jul. 21, 2018). Amazon EC2 P2 P3 Instances. Ama-
zon EC2 Instance Types. [Online]. Available: https://aws.amazon.com/ec2/
instance-types/p2/ and https://aws.amazon.com/ec2/instance-types/p3/
[62] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2016, pp. 770–778.
[63] A. J. R. Simpson, ‘‘Uniform learning in a deep neural network via ‘oddball’
stochastic gradient descent,’’ CoRR, Oct. 2015.
[64] L. Best-Rowden, H. Han, C. Otto, B. F. Klare, and A. K. Jain,
‘‘Unconstrained face recognition: Identifying a person of interest from
a media collection,’’ IEEE Trans. Inf. Forensics Security, vol. 9, no. 12,
pp. 2144–2157, Dec. 2014.
[65] T. A. Letsche and M. W. Berry, ‘‘Large-scale information retrieval with
latent semantic indexing,’’ Inf. Sci., vol. 100, nos. 1–4, pp. 105–137, 1997.
53064 VOLUME 7, 2019

[66] G. E. Hinton, ‘‘Learning multiple layers of representation,’’ Trends Cognit.
Sci., vol. 11, no. 10, pp. 428–434, Oct. 2007.
[67] R. Salakhutdinov and G. Hinton, ‘‘Deep Boltzmann machines,’’ in Proc.
12th Int. Conf. Artif. Intell. Statist., D. D. van and W. Max, Eds. 2009,
pp. 448–455.
[68] W. Kuo, B. Hariharan, and J. Malik, ‘‘DeepBox: Learning objectness with
convolutional networks,’’ CoRR, May 2015.
[69] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, ‘‘Extreme learning machine: The-
ory and applications,’’ Neurocomputing, vol. 70, nos. 1–3, pp. 489–501,
2006.
[70] J. Tang, C. Deng, and G.-B. Huang, ‘‘Extreme learning machine for
multilayer perceptron,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 27,
no. 4, pp. 809–821, Apr. 2015.
[71] M. Gong, J. Liu, H. Li, Q. Cai, and L. Su, ‘‘A multiobjective sparse
feature learning model for deep neural networks,’’ IEEE Trans. Neural
Netw. Learn. Syst., vol. 26, no. 12, pp. 3263–3277, Dec. 2015.
[72] S. Mehrkanoon, C. Alzate, R. Mall, R. Langone, and J. A. K. Suykens,
‘‘Multiclass semisupervised learning based upon kernel spectral cluster-
ing,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 4, pp. 720–733,
Apr. 2015.
[73] R. Langone, R. Mall, C. Alzate, and J. A. K. Suykens, ‘‘Kernel spectral
clustering and applications,’’ CoRR, May 2015.
[74] A. Conneau, H. Schwenk, L. Barrault, and Y. LeCun, ‘‘Very deep convo-
lutional networks for text classification,’’ CoRR, Jun. 2016.
[75] N. Krpan and D. Jakobovic, ‘‘Parallel neural network training with
OpenCL,’’ in Proc. 35th Int. Conv. MIPRO, May 2012, pp. 1053–1057.
[76] W. Dong and M. Zhou, ‘‘A supervised learning and control method to
improve particle swarm optimization algorithms,’’ IEEE Trans. Syst., Man,
Cybern. Syst., vol. 47, no. 7, pp. 1135–1148, Jul. 2017.
[77] V. Vapnik and R. Izmailov, ‘‘Learning using privileged information: Simi-
larity control and knowledge transfer,’’ J. Mach. Learn. Res., vol. 16, no. 1,
pp. 2023–2049, Jan. 2015.
[78] J. R. Sampson, Adaptation in Natural and Artificial Systems, vol. 18, no. 3,
J. H. Holland, Ed. Philadelphia, PA, USA: SIAM, 1976, pp. 529–530.
[79] N. M. Razali and J. Geraghty, ‘‘Genetic algorithm performance with
different selection strategies in solving TSP,’’ in Proc. world Congr. Eng.,
2010, pp. 1–6.
[80] P. Larrañaga, C. M. H. Kuijpers, R. H. Murga, I. Inza, and S. Dizdarevic,
‘‘Genetic algorithms for the travelling salesman problem: A review of rep-
resentations and operators,’’ Artif. Intell. Rev., vol. 13, no. 2, pp. 129–170,
Apr. 1999.
[81] D. Whitley, ‘‘A genetic algorithm tutorial,’’ Statist. Comput., vol. 4, no. 2,
pp. 65–85, Jun. 1994.
[82] C.-T. Lin, M. Prasad, and A. Saxena, ‘‘An improved polynomial neural
network classifier using real-coded genetic algorithm,’’ IEEE Trans. Syst.,
Man, Cybern., Syst., vol. 45, no. 11, pp. 1389–1401, Nov. 2015.
[83] Y. Guo et al., ‘‘The use of next generation sequencing technology to study
the effect of radiation therapy on mitochondrial DNA mutation,’’ Mutation
Res./Genetic Toxicol. Environ. Mutagenesis, vol. 744, no. 2, pp. 154–160,
2012.
[84] Y. Wu, ‘‘Google’s neural machine translation system: Bridging the gap
between human and machine translation,’’ CoRR, Sep. 2016.
[85] Z.-H. Zhou, M.-L. Zhang, S.-J. Huang, and Y.-F. Li, ‘‘Multi-instance multi-
label learning,’’ Artif. Intell., vol. 176, no. 1, pp. 2291–2320, 2012.
[86] L. Huang, A. D. Joseph, B. Nelson, B. I. P. Rubinstein, and J. D. Tygar,
‘‘Adversarial machine learning,’’ in Proc. 4th ACM Workshop Secur. Artif.
Intell., Chicago, IL, USA, 2011, pp. 43–58.
[87] D. Yu and L. Deng, Automatic Speech Recognition: A Deep Learning
Approach. London, U.K.: Springer, 2015.
[88] R. Hadsell, S. Chopra, and Y. LeCun, ‘‘Dimensionality reduction by learn-
ing an invariant mapping,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2006, pp. 1735–1742.
[89] A. Shrestha and A. Mahmood, ‘‘Enhancing siamese networks training with
importance sampling,’’ in Proc. 11th Int. Conf. Agents Artif. Intell. Prague,
Czech Republic: SciTePress, 2019, pp. 610–615.
[90] D. P. Kingma and M. Welling. (2013). ‘‘Auto-encoding variational Bayes.’’
[Online]. Available: https://arxiv.org/abs/1312.6114
[91] D. Silver et al., ‘‘Mastering the game of go with deep neural networks and
tree search,’’ Nature, vol. 529, no. 7587, p. 484, 2016.
[92] V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and J. Pineau,
‘‘An introduction to deep reinforcement learning,’’ CoRR, Dec. 2018.
[93] I. J. Goodfellow et al. (2014). ‘‘Generative adversarial networks.’’
[Online]. Available: https://arxiv.org/abs/1406.2661
[94] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. (2016).
‘‘Generative adversarial text to image synthesis.’’ [Online]. Available:
https://arxiv.org/abs/1605.05396
[95] H. Brighton and C. Mellish, ‘‘Advances in instance selection for instance-
based learning algorithms,’’ Data Mining Knowl. Discovery, vol. 6, no. 2,
pp. 153–172, 2002.
[96] S. Albelwi and A. Mahmood, ‘‘A framework for designing the architectures
of deep convolutional neural networks,’’ Entropy, vol. 19, no. 6, p. 242,
2017.
AJAY SHRESTHA received the B.S. degree in
computer engineering and the M.S. degree in com-
puter science from the University of Bridgeport,
CT, USA, in 2002 and 2006, respectively, where he
is currently pursuing the Ph.D. degree in computer
science and engineering.
He has guest lectured at Pennsylvania State Uni-
versity. He is also an Adjunct Faculty with the
School of Engineering, University of Bridgeport,
and with Thermo Fisher Scientific, Branford,
CT, USA, as a Manager of Technical Operations. His research interests
include machine learning and metaheuristics. He has served as a Technical
Committee Member of the International Conference on Systems, Computing
Sciences and Software Engineering (SCSS). He received the Academic
Excellence Award and the Graduate Research Assistantship for his under-
graduate and graduate studies, respectively. He has been serving as the
Chapter Vice President and other officers of Upsilon Pi Epsilon (UPE),
since 2014, and received the UPE Executive Council Award presented by
the UPE Executive Council, in 2016.
AUSIF MAHMOOD (SM’82) received the M.S.
and Ph.D. degrees in electrical and computer engi-
neering from Washington State University, USA.
He is currently the Chair Person of the Com-
puter Science and Engineering Department and a
Professor with the Computer Science and Engi-
neering Department and the Electrical Engineering
Department, University of Bridgeport, Bridgeport,
CT, USA. His research interests include parallel
and distributed computing, computer vision, deep
learning, and computer architecture.
VOLUME 7, 2019 53065

Review_of_Deep_Learning_Algorithms_and_Architectures.pdf

More Related Content

Similar to Review_of_Deep_Learning_Algorithms_and_Architectures.pdf

Recently uploaded

Review_of_Deep_Learning_Algorithms_and_Architectures.pdf