SlideShare a Scribd company logo
Deep learning is a subset of machine learning and AI
Outline
 Machine Learning basics
 Introduction to Deep Learning
 what is Deep Learning
 why is it useful
 Main components/hyper-parameters:
 activation functions
 optimizers, cost functions and training
 regularization methods
 tuning
 classification vs. regression tasks
 DNN basic architectures:
 convolutional
 recurrent
 attention mechanism
 Application example: Relation Extraction
Most material from
Backpropagation
GANs & Adversarial training
Bayesian Deep Learning
Generative models
Unsupervised / Pretraining
x
Machine learning is a field of computer science that gives computers the
ability to learn without being explicitly programmed
Methods that can learn from and make predictions on data
abeled Data
abeled Data
Machine Learning algorithm
Learned model
Prediction
Training
Prediction
Machine Learning Basics
Regression
Supervised: Learning with a labeled training set
Example: email classification with already labeled emails
Unsupervised: Discover patterns in unlabeled data
Example: cluster similar documents based on text
Reinforcement learning: learn to act based on feedback/reward
Example: learn to play Go, reward: win or lose
Types of Learning
class A
class A
Classification
Anomaly Detection
Sequence labeling
…
Clustering
http://mbjoseph.github.io/2013/11/27/measure.html
Most machine learning methods work well because of human-designed
representations and input features
ML becomes just optimizing weights to best make a final prediction
ML vs. Deep Learning
A machine learning subfield of learning representations of data. Exceptional
effective at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of) representation by
using a hierarchy of multiple layers
If you provide the system tons of information, it begins to understand it and
respond in useful ways.
What is Deep Learning (DL) ?
https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png
o Manually designed features are often over-specified, incomplete and take a
long time to design and validate
o Learned Features are easy to adapt, fast to learn
o Deep learning provides a very flexible, (almost?) universal, learnable
framework for representing world, visual and linguistic information.
o Can learn both unsupervised and supervised
o Effective end-to-end joint system learning
o Utilize large amounts of training data
Why is DL useful?
In ~2010 DL started outperforming
other ML techniques
first in speech and vision, then NLP
Several big improvements in recent years in NLP
 Machine Translation
 Sentiment Analysis
 Dialogue Agents
 Question Answering
 Text Classification …
State of the art in …
Leverage different levels of representation
o words & characters
o syntax & semantics
Neural Network Intro
How do we train?
4 + 2 = 6 neurons (not counting inputs)
[3 x 4] + [4 x 2] = 20 weights
4 + 2 = 6 biases
26 learnable parameters
Weights
Activation functions
Training
Sample
labeled data
(batch)
Forward it
through the
network, get
predictions
Back-
propagate
the errors
Update the
network
weights
Optimize (min. or max.) objective/cost function
Generate error signal that measures difference
between predictions and target values
Use error signal to change the weights and get more
accurate predictions
Subtracting a fraction of the gradient moves you
towards the (local) minimum of the cost function
https://medium.com/@ramrajchandradevan/the-evolution-of-gradient-descend-optimization-algorithm-4106a6702d39
learning rate
Gradient Descent
objective/cost function
Update each element of θ
Matrix notation for all parameters
Recursively apply chain rule though each node
One forward pass
0.1
0.2
0.3
0.2 -0.5 0.1
2.0 1.5 1.3
0.5 0.0 0.25
-0.3 2.0 0.0
0.95
3.89
0.15
0.37
1.0
3.0
0.025
0.0
Text (input) representation
TFIDF
Word embeddings
….
very positive
positive
very negative
negative
Non-linearities needed to learn complex (non-linear) representations of data,
otherwise the NN would be just a linear function
More layers and neurons can approximate more complex functions
Activation functions
Full list:
http://cs231n.github.io/assets/nn1/layer_sizes.jpeg
Activation: Sigmoid
+ Nice interpretation as the firing rate of a neuron
• 0 = not firing at all
• 1 = fully firing
- Sigmoid neurons saturate and kill gradients, thus NN will barely learn
• when the neuron’s activation are 0 or 1 (saturate)
� gradient at these regions almost zero
� almost no signal will flow to its weights
� if initial weights are too large then most neurons would saturate
Takes a real-valued number and
“squashes” it into range between 0
and 1.
http://adilmoujahid.com/images/activation.png
Activation: Tanh
- Like sigmoid, tanh neurons saturate
- Unlike sigmoid, output is zero-centered
- Tanh is a scaled sigmoid:
Takes a real-valued number and
“squashes” it into range between -1
and 1.
http://adilmoujahid.com/images/activation.png
Activation: ReLU
Takes a real-valued number and
thresholds it at zero
Most Deep Networks use ReLU nowadays
� Trains much faster
• accelerates the convergence of SGD
• due to linear, non-saturating form
� Less expensive operations
• compared to sigmoid/tanh (exponentials etc.)
• implemented by simply thresholding a matrix at zero
� More expressive
� Prevents the gradient vanishing problem
http://adilmoujahid.com/images/activation.png
Overfitting
Learned hypothesis may fit the
training data very well, even
outliers (noise) but fail to
generalize to new examples (test
data)
http://wiki.bethanycrane.com/overfitting-of-data
https://www.neuraldesigner.com/images/learning/selection_error.svg
L2 = weight decay
• Regularization term that penalizes big weights, added to
the objective
• Weight decay value determines how dominant regularization is
during gradient computation
• Big weight decay coefficient  big penalty for big weights
Regularization
Dropout
• Randomly drop units (along with their
connections) during training
• Each unit retained with fixed probability p,
independent of other units
• Hyper-parameter p to be chosen (tuned)
Early-stopping
• Use validation error to decide when to stop training
• Stop when monitored quantity has not improved after n subsequent epochs
• n is called patience
Srivastava, Nitish, et al. Journal of machine learning research (2014)
Tuning hyper-parameters
“Grid and random search of 9 trials for optimizing function g(x) ≈ g(x) + h(y)
With grid search, nine trials only test g(x) in three distinct places.
With random search, all nine trials explore distinct values of g. ”
Both try configurations randomly and blindly
Next trial is independent to all the trials done before
Make smarter choice for the next trial, minimize the number of trials
1. Collect the performance at several configurations
2. Make inference and decide what configuration to try next
g(x) ≈ g(x) + h(y)
g(x) shown in green
h(y) is shown in yellow
Bergstra, James, and Yoshua Bengio. "" Journal of
Machine Learning Research, Feb (2012)
Loss functions and output
Classification Regression
Training
examples
Rn x {class_1, ..., class_n}
(one-hot encoding)
Rn x Rm
Output
Layer
Soft-max
[map Rn to a probability distribution]
Linear (Identity)
or Sigmoid
Cost (loss)
function
Cross-entropy Mean Squared Error
f(x)=x
Mean Absolute Error
http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Convolutional Neural
Networks (CNNs)
Main CNN idea for text:
Compute vectors for n-grams and group them afterwards
Example: “this takes too long” compute vectors for:
This takes, takes too, too long, this takes too, takes too long, this takes too long
Input matrix
Convolutional
3x3 filter
Convolutional Neural
Networks (CNNs)
Main CNN idea for text:
Compute vectors for n-grams and group them afterwards
https://shafeentejani.github.io/assets/images/pooling.gif
max pool
2x2 filters
and stride 2
Severyn, Aliaksei, and Alessandro Moschitti. "UNITN: Training Deep Convolutional Neural Network for Twitter
Sentiment Classification." SemEval@ NAACL-HLT. 2015.
CNN for text classification
CNN with multiple filters
Kim, Y. “Convolutional Neural Networks for Sentence Classification”, EMNLP (2014)
sliding over 3, 4 or 5 words at a time
Recurrent Neural Networks
(RNNs)
Main RNN idea for text:
Condition on all previous words
Use same set of weights at all time steps
� Stack them up, Lego fun!
https://discuss.pytorch.org/uploads/default/original/1X/6415da0424dd66f2f5b134709b92baa59e604c55.jpg
https://pbs.twimg.com/media/C2j-8j5UsAACgEK.jpg
Bidirectional RNNs
two RNNs stacked on top of each other
output is computed based on the hidden state of both RNNs
past and future around a single token
http://www.wildml.com/2015/09/recurrent-neural-networks-
tutorial-part-1-introduction-to-rnns/
Main idea: incorporate both left and right context
output may not only depend on the previous elements in the sequence, but
also future elements.
Sequence 2 Sequence or
Encoder Decoder model
Cho, Kyunghyun, et al. "Learning phrase
representations using RNN encoder-decoder for
statistical machine translation." EMNLP 2014
Gated Recurrent Units
(GRUs)
Main idea:
keep around memory to capture long dependencies
Allow error messages to flow at different strengths depending on the inputs
http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-
part-4-implementing-a-grulstm-rnn-with-python-and-theano/
Standard RNN computes hidden layer at next time
step directly
Compute an update gate based on current input
word vector and hidden state
Controls how much of past state should matter now
If z close to 1, then we can copy information in that unit through many steps!
Gated Recurrent Units
(GRUs)
Main idea:
keep around memory to capture long dependencies
Allow error messages to flow at different strengths depending on the inputs
Standard RNN computes hidden layer at next time
step directly
Compute an update gate based on current input
word vector and hidden state
Compute a reset gate similarly but with different
weights If reset close to 0, ignore previous
hidden state (allows model to drop
information that is irrelevant in the future)
Units with short-term dependencies often have reset gates very active
Units with long-term dependencies have active update gates z
http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-
part-4-implementing-a-grulstm-rnn-with-python-and-theano/
Gated Recurrent Units
(GRUs)
Main idea:
keep around memory to capture long dependencies
Allow error messages to flow at different strengths depending on the inputs
Standard RNN computes hidden layer at next time
step directly
Compute an update gate based on current input
word vector and hidden state
Compute a reset gate similarly but with different
weights
New memory
Final memory
are a more complex form, but
basically same intuition
GRUs are often more preferred than
LSTMs
combines current & previous time steps
http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-
part-4-implementing-a-grulstm-rnn-with-python-and-theano/
Attention Mechanism
Bahdanau D. et al. "Neural machine translation by jointly learning to align and translate." ICLR (2015)
Main idea: retrieve as needed
Pool of source states
Attention - Scoring
Compare target and source hidden states
Attention - Normalization
Convert into alignment weights
Attention - Context
Build context vector: weighted average
Attention - Context
Compute next hidden state
https://uofi.box.com/v/cs510DL
Binary Classification
Dataset of 25,000 movies reviews from IMDB, labeled
by sentiment (positive/negative)
Application Example:
IMDB Movie reviews
sentiment classification
Useful for:
• knowledge base completion
• social media analysis
• question answering
• …
http://www.mathcs.emory.edu/~dsavenk/slides/relation_extraction/img/distant.png
Application Example:
Relation Extraction from text
sentence S = w1 w2 .. e1 .. wj .. e2 .. wn e1 and e2 entities
“The new iPhone 7 Plus includes an improved camera to take amazing pictures”
Component-Whole(e1 , e2 ) ?
YES / NO
Task: binary (or multi-class)
classification
It is also possible to include more than two entities as well:
“At codons 12, the occurrence of point mutations from G to T were
observed”  point mutation(codon, 12, G, T)
Word indices
[5, 7, 12, 6, 90 …]
Position indices e1
[-1, 0, 1, 2, 3 …]
Position indices e2
[-4, -3, -2 -1, 0]
The new iPhone 7 Plus includes an improved camera that takes amazing pictures
Word Embeddings Positional emb. e1 Positional emb. e2
Embeddings e1 context embeddings
Embeddings e2
2) word sequences
concatenated with
positional features
1) context-wise split of
the sentence
3) concatenating
embeddings of two
entities with average of
word embeddings for
rest of the words
Embeddings Left Embeddings Middle Embeddings Right
Features / Input
representation
The new iPhone 7 Plus includes an improved camera that takes amazing pictures
Sigmoid
The new iPhone 7 Plus includes an improved camera that takes …
Models: MLP
Component-Whole(e1 , e2 ) ?
YES / NO
Embeddings e1 context embeddings Embeddings e2
Dense Layer 1
Dense Layer n
…
Simple fully-connected multi-layer perceptron
Embeddings Left Embeddings Middle Embeddings Right
Convolutional Layer Convolutional Layer Convolutional Layer
Max Pooling Max Pooling Max Pooling
Sigmoid
Word indices
[5, 7, 12, 6, 90 …]
Position indices e1
[-1, 0, 1, 2, 3 …]
Position indices e2
[-4, -3, -2 -1, 0]
Word Embeddings Positional emb. e1 Positional emb. e2
OR
Component-Whole(e1 , e2 ) ?
YES / NO
The new iPhone 7 Plus includes an improved camera that takes …
Models: CNN
Zeng, D.et al. “Relation classication via convolutional deep neural network”.COLING (2014)
Embeddings Left Embeddings Middle Embeddings Right
CNN with multiple filter sizes
CNN with multiple filter sizes
Sigmoid
Word indices
[5, 7, 12, 6, 90 …]
Position indices e1
[-1, 0, 1, 2, 3 …]
Position indices e2
[-4, -3, -2 -1, 0]
Word Embeddings Positional emb. e1 Positional emb. e2
OR
The new iPhone 7 Plus includes an improved camera that takes …
Models: CNN (2)
Convolution
filter=2
Max Pooling
Convolution
filter=3
Max Pooling
Convolution
filter=k
Max Pooling
Component-Whole(e1 , e2 ) ?
YES / NO
Nguyen, T.H., Grishman, R. “Relation extraction: Perspective from convolutional neural networks.” VS@ HLT-NAACL. (2015)
Embeddings Left Embeddings Middle Embeddings Right
Sigmoid
Word indices
[5, 7, 12, 6, 90 …]
Position indices e1
[-1, 0, 1, 2, 3 …]
Position indices e2
[-4, -3, -2 -1, 0]
Word Embeddings Positional emb. e1 Positional emb. e2
OR
The new iPhone 7 Plus includes an improved camera that takes …
Models: Bi-GRU
Bi-GRU
Attention or
Max Pooling
Component-Whole(e1 , e2 ) ?
YES / NO
Zhang, D., Wang, D. “Relation classication via recurrent neural network.” -arXiv preprint arXiv:1508.01006 (2015)
Zhou, P. et al. “Attention-based bidirectional LSTM networks for relation classication. ACL (2016)
Distant Supervision
Circumvent the annotation problem – create large dataset
Exploit large knowledge bases to automatically label entities and their relations
in text
Assumption:
when two entities co-occur in a sentence, a certain relation is expressed
Relation Entity 1 Entity 2
place of birth Michael
Jackson
Gary
place of birth Barack
Obama
Hawaii
… … …
knowledge base
Barack Obama moved from Gary ….
text
Michael Jackson met … in Hawaii
place of birth
For many ambiguous relations, mere co-occurrence does not guarantee the
existence of the relation  Distant supervision produces false positives
Attention over Instances
Lin et al. “Neural Relation Extraction with Selective Attention over Instances” ACL (2016)
xi sentence for an entity pair (e1,e2)
n sentences for relation r(e1,e2)
xi sentence vector representation
ai weight given by sentence-level attention
s representation of the sentence set
NYT10 Dataset
Align Freebase relations with
New York Times corpus (NYT)
53 possible relationships
+NA (no relation between entities)
Sentence-level ATT results
Data sentences entity
pairs
Training 522,611 281,270
Test 172,448 96,678
Lin et al. “Neural Relation Extraction with Selective Attention over Instances” ACL (2016)
 Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal
of machine learning research (2014)
 Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." Journal of
Machine Learning Research, Feb (2012)
 Kim, Y. “Convolutional Neural Networks for Sentence Classification”, EMNLP (2014)
 Severyn, Aliaksei, and Alessandro Moschitti. "UNITN: Training Deep Convolutional Neural Network for
Twitter Sentiment Classification." SemEval@ NAACL-HLT (2015)
 Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical
machine translation." EMNLP (2014)
 Ilya Sutskever et al. “Sequence to sequence learning with neural networks.” NIPS (2014)
 Bahdanau et al. "Neural machine translation by jointly learning to align and translate." ICLR (2015)
 Gal, Y., Islam, R., Ghahramani, Z. “Deep Bayesian Active Learning with Image Data.” ICML (2017)
 Nair, V., Hinton, G.E. “Rectified linear units improve restricted boltzmann machines.” ICML (2010)
 Ronan Collobert, et al. “Natural language processing (almost) from scratch.” JMLR (2011)
 Kumar, Shantanu. "A Survey of Deep Learning Methods for Relation Extraction." arXiv preprint
arXiv:1705.03645 (2017)
 Lin et al. “Neural Relation Extraction with Selective Attention over Instances” ACL (2016) [code]
 Zeng, D.et al. “Relation classification via convolutional deep neural network”. COLING (2014)
 Nguyen, T.H., Grishman, R. “Relation extraction: Perspective from CNNs.” VS@ HLT-NAACL. (2015)
 Zhang, D., Wang, D. “Relation classification via recurrent NN.” -arXiv preprint arXiv:1508.01006 (2015)
 Zhou, P. et al. “Attention-based bidirectional LSTM networks for relation classification . ACL (2016)
 Mike Mintz et al. “Distant supervision for relation extraction without labeled data.” ACL- IJCNLP (2009)
References
 http://web.stanford.edu/class/cs224n
 https://www.coursera.org/specializations/deep-learning
 https://chrisalbon.com/#Deep-Learning
 http://www.asimovinstitute.org/neural-network-zoo
 http://cs231n.github.io/optimization-2
 https://medium.com/@ramrajchandradevan/the-evolution-of-gradient-descend-optimization-algorithm-
4106a6702d39
 https://arimo.com/data-science/2016/bayesian-optimization-hyperparameter-tuning
 http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow
 http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp
 https://medium.com/technologymadeeasy/the-best-explanation-of-convolutional-neural-networks-on-the-
internet-fbb8b1ad5df8
 http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
 http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-
python-and-theano/
 http://colah.github.io/posts/2015-08-Understanding-LSTMs
 https://github.com/hyperopt/hyperopt
 https://github.com/tensorflow/nmt
References & Resources
https://giphy.com/gifs/thanks-thank-you-thnx-3o6ozuHcxTtVWJJn32/download

More Related Content

PPTX
Muhammad Usman Akhtar | Ph.D Scholar | Wuhan University | School of Co...
Wuhan University
 
PPTX
Deep learning
Aman Kamboj
 
PDF
Introduction to Deep learning Models.pdf
cse21216
 
PPTX
Deep learning
Ratnakar Pandey
 
PDF
Deep Learning Study _ FInalwithCNN_RNN_LSTM_GRU.pdf
naveenraghavendran10
 
PPTX
Deep learning introduction
Adwait Bhave
 
PPTX
Session_2_Introduction_to_Deep_Learning.pptx
aljeboorymuhammed
 
PDF
unit4 Neural Networks and Deep Learning.pdf
Rathiya R
 
Muhammad Usman Akhtar | Ph.D Scholar | Wuhan University | School of Co...
Wuhan University
 
Deep learning
Aman Kamboj
 
Introduction to Deep learning Models.pdf
cse21216
 
Deep learning
Ratnakar Pandey
 
Deep Learning Study _ FInalwithCNN_RNN_LSTM_GRU.pdf
naveenraghavendran10
 
Deep learning introduction
Adwait Bhave
 
Session_2_Introduction_to_Deep_Learning.pptx
aljeboorymuhammed
 
unit4 Neural Networks and Deep Learning.pdf
Rathiya R
 

Similar to Deep learning is a subset of machine learning and AI (20)

PPTX
Introduction to deep learning workshop
Shamane Siriwardhana
 
PDF
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Databricks
 
PPTX
Deep learning (2)
Muhanad Al-khalisy
 
PPTX
Introduction to deep learning
Abhishek Bhandwaldar
 
PPTX
A Beginner's Approach to Deep Learning Techniques
DrAnirbanDasgupta1
 
PDF
Deep learning - a primer
Shirin Elsinghorst
 
PDF
Deep learning - a primer
Uwe Friedrichsen
 
PPTX
Deep learning short introduction
Adwait Bhave
 
PPTX
What Deep Learning Means for Artificial Intelligence
Jonathan Mugan
 
PPT
deeplearning
huda2018
 
PDF
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Ahmed Yousry
 
PPTX
Deep Learning
MoctardOLOULADE
 
PPTX
A simple presentation for deep learning.
mahfuzur32785
 
PDF
Machine Learning for Trading
Larry Guo
 
PDF
Deep learning: Cutting through the Myths and Hype
Siby Jose Plathottam
 
PPTX
Introduction to deep learning
doppenhe
 
PPTX
Deep learning with tensorflow
Charmi Chokshi
 
PPTX
Deeplearning for Computer Vision PPT with
naveenraghavendran10
 
PDF
Apache MXNet ODSC West 2018
Apache MXNet
 
DOCX
deep learning
Hassanein Alwan
 
Introduction to deep learning workshop
Shamane Siriwardhana
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Databricks
 
Deep learning (2)
Muhanad Al-khalisy
 
Introduction to deep learning
Abhishek Bhandwaldar
 
A Beginner's Approach to Deep Learning Techniques
DrAnirbanDasgupta1
 
Deep learning - a primer
Shirin Elsinghorst
 
Deep learning - a primer
Uwe Friedrichsen
 
Deep learning short introduction
Adwait Bhave
 
What Deep Learning Means for Artificial Intelligence
Jonathan Mugan
 
deeplearning
huda2018
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Ahmed Yousry
 
Deep Learning
MoctardOLOULADE
 
A simple presentation for deep learning.
mahfuzur32785
 
Machine Learning for Trading
Larry Guo
 
Deep learning: Cutting through the Myths and Hype
Siby Jose Plathottam
 
Introduction to deep learning
doppenhe
 
Deep learning with tensorflow
Charmi Chokshi
 
Deeplearning for Computer Vision PPT with
naveenraghavendran10
 
Apache MXNet ODSC West 2018
Apache MXNet
 
deep learning
Hassanein Alwan
 
Ad

Recently uploaded (20)

PDF
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
PPTX
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
PDF
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
PPTX
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
PPTX
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
DOCX
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
DOCX
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
PPTX
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
PDF
Virat Kohli- the Pride of Indian cricket
kushpar147
 
PPTX
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
PPTX
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
PPTX
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
PPTX
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
PPTX
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
PPTX
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
Virat Kohli- the Pride of Indian cricket
kushpar147
 
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
Ad

Deep learning is a subset of machine learning and AI

  • 2. Outline  Machine Learning basics  Introduction to Deep Learning  what is Deep Learning  why is it useful  Main components/hyper-parameters:  activation functions  optimizers, cost functions and training  regularization methods  tuning  classification vs. regression tasks  DNN basic architectures:  convolutional  recurrent  attention mechanism  Application example: Relation Extraction Most material from Backpropagation GANs & Adversarial training Bayesian Deep Learning Generative models Unsupervised / Pretraining x
  • 3. Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed Methods that can learn from and make predictions on data abeled Data abeled Data Machine Learning algorithm Learned model Prediction Training Prediction Machine Learning Basics
  • 4. Regression Supervised: Learning with a labeled training set Example: email classification with already labeled emails Unsupervised: Discover patterns in unlabeled data Example: cluster similar documents based on text Reinforcement learning: learn to act based on feedback/reward Example: learn to play Go, reward: win or lose Types of Learning class A class A Classification Anomaly Detection Sequence labeling … Clustering http://mbjoseph.github.io/2013/11/27/measure.html
  • 5. Most machine learning methods work well because of human-designed representations and input features ML becomes just optimizing weights to best make a final prediction ML vs. Deep Learning
  • 6. A machine learning subfield of learning representations of data. Exceptional effective at learning patterns. Deep learning algorithms attempt to learn (multiple levels of) representation by using a hierarchy of multiple layers If you provide the system tons of information, it begins to understand it and respond in useful ways. What is Deep Learning (DL) ? https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png
  • 7. o Manually designed features are often over-specified, incomplete and take a long time to design and validate o Learned Features are easy to adapt, fast to learn o Deep learning provides a very flexible, (almost?) universal, learnable framework for representing world, visual and linguistic information. o Can learn both unsupervised and supervised o Effective end-to-end joint system learning o Utilize large amounts of training data Why is DL useful? In ~2010 DL started outperforming other ML techniques first in speech and vision, then NLP
  • 8. Several big improvements in recent years in NLP  Machine Translation  Sentiment Analysis  Dialogue Agents  Question Answering  Text Classification … State of the art in … Leverage different levels of representation o words & characters o syntax & semantics
  • 9. Neural Network Intro How do we train? 4 + 2 = 6 neurons (not counting inputs) [3 x 4] + [4 x 2] = 20 weights 4 + 2 = 6 biases 26 learnable parameters Weights Activation functions
  • 10. Training Sample labeled data (batch) Forward it through the network, get predictions Back- propagate the errors Update the network weights Optimize (min. or max.) objective/cost function Generate error signal that measures difference between predictions and target values Use error signal to change the weights and get more accurate predictions Subtracting a fraction of the gradient moves you towards the (local) minimum of the cost function https://medium.com/@ramrajchandradevan/the-evolution-of-gradient-descend-optimization-algorithm-4106a6702d39
  • 11. learning rate Gradient Descent objective/cost function Update each element of θ Matrix notation for all parameters Recursively apply chain rule though each node
  • 12. One forward pass 0.1 0.2 0.3 0.2 -0.5 0.1 2.0 1.5 1.3 0.5 0.0 0.25 -0.3 2.0 0.0 0.95 3.89 0.15 0.37 1.0 3.0 0.025 0.0 Text (input) representation TFIDF Word embeddings …. very positive positive very negative negative
  • 13. Non-linearities needed to learn complex (non-linear) representations of data, otherwise the NN would be just a linear function More layers and neurons can approximate more complex functions Activation functions Full list: http://cs231n.github.io/assets/nn1/layer_sizes.jpeg
  • 14. Activation: Sigmoid + Nice interpretation as the firing rate of a neuron • 0 = not firing at all • 1 = fully firing - Sigmoid neurons saturate and kill gradients, thus NN will barely learn • when the neuron’s activation are 0 or 1 (saturate) � gradient at these regions almost zero � almost no signal will flow to its weights � if initial weights are too large then most neurons would saturate Takes a real-valued number and “squashes” it into range between 0 and 1. http://adilmoujahid.com/images/activation.png
  • 15. Activation: Tanh - Like sigmoid, tanh neurons saturate - Unlike sigmoid, output is zero-centered - Tanh is a scaled sigmoid: Takes a real-valued number and “squashes” it into range between -1 and 1. http://adilmoujahid.com/images/activation.png
  • 16. Activation: ReLU Takes a real-valued number and thresholds it at zero Most Deep Networks use ReLU nowadays � Trains much faster • accelerates the convergence of SGD • due to linear, non-saturating form � Less expensive operations • compared to sigmoid/tanh (exponentials etc.) • implemented by simply thresholding a matrix at zero � More expressive � Prevents the gradient vanishing problem http://adilmoujahid.com/images/activation.png
  • 17. Overfitting Learned hypothesis may fit the training data very well, even outliers (noise) but fail to generalize to new examples (test data) http://wiki.bethanycrane.com/overfitting-of-data https://www.neuraldesigner.com/images/learning/selection_error.svg
  • 18. L2 = weight decay • Regularization term that penalizes big weights, added to the objective • Weight decay value determines how dominant regularization is during gradient computation • Big weight decay coefficient  big penalty for big weights Regularization Dropout • Randomly drop units (along with their connections) during training • Each unit retained with fixed probability p, independent of other units • Hyper-parameter p to be chosen (tuned) Early-stopping • Use validation error to decide when to stop training • Stop when monitored quantity has not improved after n subsequent epochs • n is called patience Srivastava, Nitish, et al. Journal of machine learning research (2014)
  • 19. Tuning hyper-parameters “Grid and random search of 9 trials for optimizing function g(x) ≈ g(x) + h(y) With grid search, nine trials only test g(x) in three distinct places. With random search, all nine trials explore distinct values of g. ” Both try configurations randomly and blindly Next trial is independent to all the trials done before Make smarter choice for the next trial, minimize the number of trials 1. Collect the performance at several configurations 2. Make inference and decide what configuration to try next g(x) ≈ g(x) + h(y) g(x) shown in green h(y) is shown in yellow Bergstra, James, and Yoshua Bengio. "" Journal of Machine Learning Research, Feb (2012)
  • 20. Loss functions and output Classification Regression Training examples Rn x {class_1, ..., class_n} (one-hot encoding) Rn x Rm Output Layer Soft-max [map Rn to a probability distribution] Linear (Identity) or Sigmoid Cost (loss) function Cross-entropy Mean Squared Error f(x)=x Mean Absolute Error
  • 21. http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution Convolutional Neural Networks (CNNs) Main CNN idea for text: Compute vectors for n-grams and group them afterwards Example: “this takes too long” compute vectors for: This takes, takes too, too long, this takes too, takes too long, this takes too long Input matrix Convolutional 3x3 filter
  • 22. Convolutional Neural Networks (CNNs) Main CNN idea for text: Compute vectors for n-grams and group them afterwards https://shafeentejani.github.io/assets/images/pooling.gif max pool 2x2 filters and stride 2
  • 23. Severyn, Aliaksei, and Alessandro Moschitti. "UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification." SemEval@ NAACL-HLT. 2015. CNN for text classification
  • 24. CNN with multiple filters Kim, Y. “Convolutional Neural Networks for Sentence Classification”, EMNLP (2014) sliding over 3, 4 or 5 words at a time
  • 25. Recurrent Neural Networks (RNNs) Main RNN idea for text: Condition on all previous words Use same set of weights at all time steps � Stack them up, Lego fun! https://discuss.pytorch.org/uploads/default/original/1X/6415da0424dd66f2f5b134709b92baa59e604c55.jpg https://pbs.twimg.com/media/C2j-8j5UsAACgEK.jpg
  • 26. Bidirectional RNNs two RNNs stacked on top of each other output is computed based on the hidden state of both RNNs past and future around a single token http://www.wildml.com/2015/09/recurrent-neural-networks- tutorial-part-1-introduction-to-rnns/ Main idea: incorporate both left and right context output may not only depend on the previous elements in the sequence, but also future elements.
  • 27. Sequence 2 Sequence or Encoder Decoder model Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." EMNLP 2014
  • 28. Gated Recurrent Units (GRUs) Main idea: keep around memory to capture long dependencies Allow error messages to flow at different strengths depending on the inputs http://www.wildml.com/2015/10/recurrent-neural-network-tutorial- part-4-implementing-a-grulstm-rnn-with-python-and-theano/ Standard RNN computes hidden layer at next time step directly Compute an update gate based on current input word vector and hidden state Controls how much of past state should matter now If z close to 1, then we can copy information in that unit through many steps!
  • 29. Gated Recurrent Units (GRUs) Main idea: keep around memory to capture long dependencies Allow error messages to flow at different strengths depending on the inputs Standard RNN computes hidden layer at next time step directly Compute an update gate based on current input word vector and hidden state Compute a reset gate similarly but with different weights If reset close to 0, ignore previous hidden state (allows model to drop information that is irrelevant in the future) Units with short-term dependencies often have reset gates very active Units with long-term dependencies have active update gates z http://www.wildml.com/2015/10/recurrent-neural-network-tutorial- part-4-implementing-a-grulstm-rnn-with-python-and-theano/
  • 30. Gated Recurrent Units (GRUs) Main idea: keep around memory to capture long dependencies Allow error messages to flow at different strengths depending on the inputs Standard RNN computes hidden layer at next time step directly Compute an update gate based on current input word vector and hidden state Compute a reset gate similarly but with different weights New memory Final memory are a more complex form, but basically same intuition GRUs are often more preferred than LSTMs combines current & previous time steps http://www.wildml.com/2015/10/recurrent-neural-network-tutorial- part-4-implementing-a-grulstm-rnn-with-python-and-theano/
  • 31. Attention Mechanism Bahdanau D. et al. "Neural machine translation by jointly learning to align and translate." ICLR (2015) Main idea: retrieve as needed Pool of source states
  • 32. Attention - Scoring Compare target and source hidden states
  • 33. Attention - Normalization Convert into alignment weights
  • 34. Attention - Context Build context vector: weighted average
  • 35. Attention - Context Compute next hidden state
  • 36. https://uofi.box.com/v/cs510DL Binary Classification Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative) Application Example: IMDB Movie reviews sentiment classification
  • 37. Useful for: • knowledge base completion • social media analysis • question answering • … http://www.mathcs.emory.edu/~dsavenk/slides/relation_extraction/img/distant.png Application Example: Relation Extraction from text
  • 38. sentence S = w1 w2 .. e1 .. wj .. e2 .. wn e1 and e2 entities “The new iPhone 7 Plus includes an improved camera to take amazing pictures” Component-Whole(e1 , e2 ) ? YES / NO Task: binary (or multi-class) classification It is also possible to include more than two entities as well: “At codons 12, the occurrence of point mutations from G to T were observed”  point mutation(codon, 12, G, T)
  • 39. Word indices [5, 7, 12, 6, 90 …] Position indices e1 [-1, 0, 1, 2, 3 …] Position indices e2 [-4, -3, -2 -1, 0] The new iPhone 7 Plus includes an improved camera that takes amazing pictures Word Embeddings Positional emb. e1 Positional emb. e2 Embeddings e1 context embeddings Embeddings e2 2) word sequences concatenated with positional features 1) context-wise split of the sentence 3) concatenating embeddings of two entities with average of word embeddings for rest of the words Embeddings Left Embeddings Middle Embeddings Right Features / Input representation The new iPhone 7 Plus includes an improved camera that takes amazing pictures
  • 40. Sigmoid The new iPhone 7 Plus includes an improved camera that takes … Models: MLP Component-Whole(e1 , e2 ) ? YES / NO Embeddings e1 context embeddings Embeddings e2 Dense Layer 1 Dense Layer n … Simple fully-connected multi-layer perceptron
  • 41. Embeddings Left Embeddings Middle Embeddings Right Convolutional Layer Convolutional Layer Convolutional Layer Max Pooling Max Pooling Max Pooling Sigmoid Word indices [5, 7, 12, 6, 90 …] Position indices e1 [-1, 0, 1, 2, 3 …] Position indices e2 [-4, -3, -2 -1, 0] Word Embeddings Positional emb. e1 Positional emb. e2 OR Component-Whole(e1 , e2 ) ? YES / NO The new iPhone 7 Plus includes an improved camera that takes … Models: CNN Zeng, D.et al. “Relation classication via convolutional deep neural network”.COLING (2014)
  • 42. Embeddings Left Embeddings Middle Embeddings Right CNN with multiple filter sizes CNN with multiple filter sizes Sigmoid Word indices [5, 7, 12, 6, 90 …] Position indices e1 [-1, 0, 1, 2, 3 …] Position indices e2 [-4, -3, -2 -1, 0] Word Embeddings Positional emb. e1 Positional emb. e2 OR The new iPhone 7 Plus includes an improved camera that takes … Models: CNN (2) Convolution filter=2 Max Pooling Convolution filter=3 Max Pooling Convolution filter=k Max Pooling Component-Whole(e1 , e2 ) ? YES / NO Nguyen, T.H., Grishman, R. “Relation extraction: Perspective from convolutional neural networks.” VS@ HLT-NAACL. (2015)
  • 43. Embeddings Left Embeddings Middle Embeddings Right Sigmoid Word indices [5, 7, 12, 6, 90 …] Position indices e1 [-1, 0, 1, 2, 3 …] Position indices e2 [-4, -3, -2 -1, 0] Word Embeddings Positional emb. e1 Positional emb. e2 OR The new iPhone 7 Plus includes an improved camera that takes … Models: Bi-GRU Bi-GRU Attention or Max Pooling Component-Whole(e1 , e2 ) ? YES / NO Zhang, D., Wang, D. “Relation classication via recurrent neural network.” -arXiv preprint arXiv:1508.01006 (2015) Zhou, P. et al. “Attention-based bidirectional LSTM networks for relation classication. ACL (2016)
  • 44. Distant Supervision Circumvent the annotation problem – create large dataset Exploit large knowledge bases to automatically label entities and their relations in text Assumption: when two entities co-occur in a sentence, a certain relation is expressed Relation Entity 1 Entity 2 place of birth Michael Jackson Gary place of birth Barack Obama Hawaii … … … knowledge base Barack Obama moved from Gary …. text Michael Jackson met … in Hawaii place of birth For many ambiguous relations, mere co-occurrence does not guarantee the existence of the relation  Distant supervision produces false positives
  • 45. Attention over Instances Lin et al. “Neural Relation Extraction with Selective Attention over Instances” ACL (2016) xi sentence for an entity pair (e1,e2) n sentences for relation r(e1,e2) xi sentence vector representation ai weight given by sentence-level attention s representation of the sentence set
  • 46. NYT10 Dataset Align Freebase relations with New York Times corpus (NYT) 53 possible relationships +NA (no relation between entities) Sentence-level ATT results Data sentences entity pairs Training 522,611 281,270 Test 172,448 96,678 Lin et al. “Neural Relation Extraction with Selective Attention over Instances” ACL (2016)
  • 47.  Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal of machine learning research (2014)  Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." Journal of Machine Learning Research, Feb (2012)  Kim, Y. “Convolutional Neural Networks for Sentence Classification”, EMNLP (2014)  Severyn, Aliaksei, and Alessandro Moschitti. "UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification." SemEval@ NAACL-HLT (2015)  Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." EMNLP (2014)  Ilya Sutskever et al. “Sequence to sequence learning with neural networks.” NIPS (2014)  Bahdanau et al. "Neural machine translation by jointly learning to align and translate." ICLR (2015)  Gal, Y., Islam, R., Ghahramani, Z. “Deep Bayesian Active Learning with Image Data.” ICML (2017)  Nair, V., Hinton, G.E. “Rectified linear units improve restricted boltzmann machines.” ICML (2010)  Ronan Collobert, et al. “Natural language processing (almost) from scratch.” JMLR (2011)  Kumar, Shantanu. "A Survey of Deep Learning Methods for Relation Extraction." arXiv preprint arXiv:1705.03645 (2017)  Lin et al. “Neural Relation Extraction with Selective Attention over Instances” ACL (2016) [code]  Zeng, D.et al. “Relation classification via convolutional deep neural network”. COLING (2014)  Nguyen, T.H., Grishman, R. “Relation extraction: Perspective from CNNs.” VS@ HLT-NAACL. (2015)  Zhang, D., Wang, D. “Relation classification via recurrent NN.” -arXiv preprint arXiv:1508.01006 (2015)  Zhou, P. et al. “Attention-based bidirectional LSTM networks for relation classification . ACL (2016)  Mike Mintz et al. “Distant supervision for relation extraction without labeled data.” ACL- IJCNLP (2009) References
  • 48.  http://web.stanford.edu/class/cs224n  https://www.coursera.org/specializations/deep-learning  https://chrisalbon.com/#Deep-Learning  http://www.asimovinstitute.org/neural-network-zoo  http://cs231n.github.io/optimization-2  https://medium.com/@ramrajchandradevan/the-evolution-of-gradient-descend-optimization-algorithm- 4106a6702d39  https://arimo.com/data-science/2016/bayesian-optimization-hyperparameter-tuning  http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow  http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp  https://medium.com/technologymadeeasy/the-best-explanation-of-convolutional-neural-networks-on-the- internet-fbb8b1ad5df8  http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/  http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with- python-and-theano/  http://colah.github.io/posts/2015-08-Understanding-LSTMs  https://github.com/hyperopt/hyperopt  https://github.com/tensorflow/nmt References & Resources