0% found this document useful (0 votes)

63 views

Image Captioning

The document discusses image captioning using recurrent neural networks (RNNs). It explains that RNNs can handle sequence learning problems where successive inputs/outputs are not independent, unlike vanilla neural networks. RNNs have an internal state that gets updated as a sequence is processed. The document provides examples of using RNNs for tasks like image captioning, video captioning, and action prediction from video frames. It describes how RNNs are trained on image-caption pairs and can then generate captions for new images by updating their internal state at each time step.

Uploaded by

sourabh gothe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views

Image Captioning

Uploaded by

sourabh gothe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

CDS.IISc.ac.

in | Department of Computational and Data Sciences

Image Captioning
CDS.IISc.ac.in | Department of Computational and Data Sciences

Overview

Image Captioning Model

Input image

Output (Word by word

generation)
CDS.IISc.ac.in | Department of Computational and Data Sciences

“Vanilla” Neural Network

one to one

Example:

Fixed size image (say 32 x 32) is fed to a

convolutional neural network for classification.

The output is a predicted label from the network.

Vanilla Neural Network

CDS.IISc.ac.in | Department of Computational and Data Sciences

“Vanilla” Neural Network

one to one

Example:

Fixed size image (say 32 x 32) is fed to a

convolutional neural network for classification.

The output is a predicted label from the network.

● Each input is independent of previous or future

inputs
● Outputs of two images are completely
independent of each other.

Vanilla Neural Network

CDS.IISc.ac.in | Department of Computational and Data Sciences

Sequence Learning Problems

● What if

○ The successive inputs/outputs are no longer independent.

■ Eg. - Auto Text completion

CDS.IISc.ac.in | Department of Computational and Data Sciences

Sequence Learning Problems

● What if

○ The successive inputs/outputs are no longer independent.

■ Eg. - Auto Text completion

What model architecture can handle such sequence learning problems?

CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

Vanilla Neural Networks

CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

Image Captioning
Image -> sequence of words
CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

Action prediction
Video frames -> action class
CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

Video captioning
Video frames -> sequence of words
CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

Eg: Video classification at frame level

CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

The “internal state” in RNN is

updated as a sequence is
processed.
CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

U,W
Recurrence formula
CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

U,W
Recurrence formula

New state Old state Inp vector

at time t
CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

U,W
Recurrence formula
CDS.IISc.ac.in | Department of Computational and Data Sciences

RNN : One to many (Image captioning)

fUW fUW fUW

UW
CDS.IISc.ac.in | Department of Computational and Data Sciences

RNN : One to many (Image captioning)

fUW fUW fUW

UW
CDS.IISc.ac.in | Department of Computational and Data Sciences

RNN : One to many (Image captioning)

fUW fUW fUW

UW
CDS.IISc.ac.in | Department of Computational and Data Sciences

Training data : Image Captioning

(Image, Caption) pairs
CDS.IISc.ac.in | Department of Computational and Data Sciences

Training RNN: Image Captioning

V
U

P
W

U, V, W, P are weights for the RNN

Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3128–3137.
CDS.IISc.ac.in | Department of Computational and Data Sciences

Training RNN: Image Captioning

y0 y1 y2 “straw hat”

V
U h0 = max(0, W*x0+P* F)
P h0 h1 h2

x0 x1 x2
F <Start> “straw” “hat” P, W, U, V are the
weights
CDS.IISc.ac.in | Department of Computational and Data Sciences

Inference - RNN: Image Captioning

Test image
CDS.IISc.ac.in | Department of Computational and Data Sciences

Inference - RNN: Image Captioning

Test image

F
CDS.IISc.ac.in | Department of Computational and Data Sciences

Inference - RNN: Image Captioning

y0 Test image

P (Weight)
h0

x0
F <Start>
CDS.IISc.ac.in | Department of Computational and Data Sciences

Inference - RNN: Image Captioning

y0 Test image

x0 x1
F <Start> (straw)
CDS.IISc.ac.in | Department of Computational and Data Sciences

Inference - RNN: Image Captioning

y0 y1 Test image

h0 h1

x0 x1
straw x2
F <Start> (straw) (hat)
CDS.IISc.ac.in | Department of Computational and Data Sciences

Inference - RNN: Image Captioning

y0 y1 y2 Test image

h0 h1 h2

x0 x1
straw x2
hat
F <Start> (straw) (hat)
CDS.IISc.ac.in | Department of Computational and Data Sciences

Inference - RNN: Image Captioning

y0 y1 y2 Test image

Sample
<END> token
h0 h1 h2
=> finish

x0 x1
straw x2
hat
F <Start> (straw) (hat)
CDS.IISc.ac.in | Department of Computational and Data Sciences

Image Captioning : Example Results

Credits : Fei Fei Li, generated using NeuralTalk2

CDS.IISc.ac.in | Department of Computational and Data Sciences

Image Captioning : Failure cases

CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention

HxWx3

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention

CNN

HxWx3

Extracted features

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2

CNN
z2,0 z2,1 z2,2
HxWx3
H’ x W’ x D
LxD
(L = H’ x W’)

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2

CNN h0
z2,0 z2,1 z2,2
HxWx3
H’ x W’ x D
LxD
(L = H’ x W’)

a1,1, a1,1, a1,1,

H’ x W’
0 1 2
Normalised attention
a1,2, a1,2, a1,2, weights
(Softmax Distribution
0 1 2
over the L grid
z0,0 z0,1 z0,2 locations)

z1,0 z1,1 z1,2

CNN h0
z2,0 z2,1 z2,2
HxWx3
H’ x W’ x D
LxD
(L = H’ x W’)

a1,1, a1,1, a1,1,

H’ x W’
0 1 2
Normalised attention
a1,2, a1,2, a1,2, weights
(Softmax Distribution
0 1 2
over the L grid
z0,0 z0,1 z0,2 locations)

z1,0 z1,1 z1,2

CNN h0
z2,0 z2,1 z2,2
HxWx3
H’ x W’ x D
LxD c1
(L = H’ x W’)
Weighted features : Dim - D

H’ x W’
a1
Normalised attention
weights
(Distribution over the
L grid locations)
z0,0 z0,1 z0,2

z1,0 z1,1 z1,2

CNN h0
z2,0 z2,1 z2,2
HxWx3
H’ x W’ x D
LxD c1
(L = H’ x W’)
Weighted features : Dim - D

a1 a2 d1

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2

CNN h0 h1
z2,0 z2,1 z2,2
HxWx3
H’ x W’ x D
LxD c1 y1
(L = H’ x W’)
First word (the <START> token)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
Distribution over L Distribution over
locations vocab

a1 a2 d1 a3 d2

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2

CNN h0 h1 h2 …
z2,0 z2,1 z2,2
HxWx3
H’ x W’ x D
LxD c1 y1 c2 y2
(L = H’ x W’)
Second word
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
CDS.IISc.ac.in | Department of Computational and Data Sciences

References
● The Unreasonable Effectiveness of Recurrent Neural Networks :
Andrej Karpathy

● CS231n : Prof. Fei Fei Li, Stanford University

● DLCV : Prof. Vineeth Balasubramanian, IIT Hyderabad

CDS.IISc.ac.in | Department of Computational and Data Sciences

Thank you

Image Caption Generation Using Deep Learning: Department of Electronics & Instrumentation Engineering NIT Silchar, Assam
No ratings yet
Image Caption Generation Using Deep Learning: Department of Electronics & Instrumentation Engineering NIT Silchar, Assam
21 pages
Session Plan Plumbing (Core)
50% (2)
Session Plan Plumbing (Core)
14 pages
James Stacey Taylor - The Metaphysics and Ethics of Death - New Essays-Oxford University Press (2013)
100% (1)
James Stacey Taylor - The Metaphysics and Ethics of Death - New Essays-Oxford University Press (2013)
286 pages
Economic Value Analysis of Private Schools in Lagos State 1
0% (1)
Economic Value Analysis of Private Schools in Lagos State 1
28 pages
Observation Checklist - Core Features
No ratings yet
Observation Checklist - Core Features
4 pages
Image Captioning Using CNN & RNN
No ratings yet
Image Captioning Using CNN & RNN
4 pages
Automated Image Captioning Using CNN and RNN
No ratings yet
Automated Image Captioning Using CNN and RNN
17 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
05 DeepLearning 05 Transformers Stanford (1)
No ratings yet
05 DeepLearning 05 Transformers Stanford (1)
82 pages
lec16b-Attention-13-Feb-18
No ratings yet
lec16b-Attention-13-Feb-18
53 pages
Show, Attend and Tell: Neural Image Caption Generation With Visual Attention
No ratings yet
Show, Attend and Tell: Neural Image Caption Generation With Visual Attention
22 pages
Show Attend and Tell
No ratings yet
Show Attend and Tell
10 pages
03 Attention
No ratings yet
03 Attention
82 pages
Generating_Caption_From_Images_Using_Flickr_Image_Dataset
No ratings yet
Generating_Caption_From_Images_Using_Flickr_Image_Dataset
7 pages
Aneja Convolutional Image Captioning CVPR 2018 Paper
No ratings yet
Aneja Convolutional Image Captioning CVPR 2018 Paper
10 pages
Image Captioning: - A Deep Learning Approach
No ratings yet
Image Captioning: - A Deep Learning Approach
14 pages
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
No ratings yet
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
12 pages
Template Master USDB (11)
No ratings yet
Template Master USDB (11)
53 pages
Knowing When To Look-Adaptive Attention Via A Visual Sentinel For Image Captioning
No ratings yet
Knowing When To Look-Adaptive Attention Via A Visual Sentinel For Image Captioning
12 pages
Show and Tell: A Neural Image Caption Generator (CVPR 2015) : Presenters: Tianlu Wang, Yin Zhang October 5
No ratings yet
Show and Tell: A Neural Image Caption Generator (CVPR 2015) : Presenters: Tianlu Wang, Yin Zhang October 5
13 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Automated Neural Image Caption Generator For Visually Impaired People
No ratings yet
Automated Neural Image Caption Generator For Visually Impaired People
6 pages
Research Paper Final
No ratings yet
Research Paper Final
5 pages
Introduction to Deep Learning 17th January 2025 (2)
No ratings yet
Introduction to Deep Learning 17th January 2025 (2)
60 pages
Attention: Sharad Jones
No ratings yet
Attention: Sharad Jones
25 pages
Performance Evaluation of Medical Image Captioning Using
No ratings yet
Performance Evaluation of Medical Image Captioning Using
10 pages
An Empirical Study of Language CNN For Image Captioning
No ratings yet
An Empirical Study of Language CNN For Image Captioning
10 pages
Gu An Empirical Study ICCV 2017 Paper PDF
No ratings yet
Gu An Empirical Study ICCV 2017 Paper PDF
10 pages
Data Science Interview Questions (#Day27)
No ratings yet
Data Science Interview Questions (#Day27)
18 pages
AIML - Final Report _ version1
No ratings yet
AIML - Final Report _ version1
24 pages
CSCI 5922 Neural Networks and Deep Learning: Image Captioning
No ratings yet
CSCI 5922 Neural Networks and Deep Learning: Image Captioning
26 pages
Materials Today: Proceedings: K. Loganathan, R. Sarath Kumar, V. Nagaraj, Tegil J. John
No ratings yet
Materials Today: Proceedings: K. Loganathan, R. Sarath Kumar, V. Nagaraj, Tegil J. John
5 pages
Image_Caption_Generation_With_Adaptive_Transformer
No ratings yet
Image_Caption_Generation_With_Adaptive_Transformer
6 pages
DL Group 6 Rep
No ratings yet
DL Group 6 Rep
11 pages
L5
No ratings yet
L5
99 pages
Image Captioning Research Paper
No ratings yet
Image Captioning Research Paper
59 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
Hybrid_Image_Captioning_Model
No ratings yet
Hybrid_Image_Captioning_Model
6 pages
Constrained LSTM and Residual Attention For Image Captioning
No ratings yet
Constrained LSTM and Residual Attention For Image Captioning
18 pages
Caption Generation With Visual Attention
No ratings yet
Caption Generation With Visual Attention
25 pages
Show and Tell: A Neural Image Caption Generator
No ratings yet
Show and Tell: A Neural Image Caption Generator
9 pages
Artificial Intelligence in Finance Newsletter by Slidesgo
No ratings yet
Artificial Intelligence in Finance Newsletter by Slidesgo
27 pages
Minor
No ratings yet
Minor
14 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
C11-Attention and Transformers
No ratings yet
C11-Attention and Transformers
59 pages
ImagecaptionusingCNNandLSTM
No ratings yet
ImagecaptionusingCNNandLSTM
11 pages
Stable Diffusion
No ratings yet
Stable Diffusion
58 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
What Is The Role of Recurrent Neural Networks (RNNS) in An Image Caption Generator?
No ratings yet
What Is The Role of Recurrent Neural Networks (RNNS) in An Image Caption Generator?
10 pages
Attention is all you need
No ratings yet
Attention is all you need
15 pages
5
No ratings yet
5
9 pages
Image Caption Generator Final Report
No ratings yet
Image Caption Generator Final Report
28 pages
2501
No ratings yet
2501
6 pages
Image Captioning With Semantic Attention: 0.2 0.3 Surfboard Wave Surfing
No ratings yet
Image Captioning With Semantic Attention: 0.2 0.3 Surfboard Wave Surfing
9 pages
Rec03 - Deep Architectures
No ratings yet
Rec03 - Deep Architectures
65 pages
Special Topic Report Final
No ratings yet
Special Topic Report Final
45 pages
Project Synopsis Imagecaptioning
No ratings yet
Project Synopsis Imagecaptioning
5 pages
Semantic Compositional Networks For Visual Captioning
No ratings yet
Semantic Compositional Networks For Visual Captioning
13 pages
NLP UNIT 5c
No ratings yet
NLP UNIT 5c
33 pages
Attention_is_All_You_Need__Explained
No ratings yet
Attention_is_All_You_Need__Explained
46 pages
Implementation_of_Simple_and_Efficient_P
No ratings yet
Implementation_of_Simple_and_Efficient_P
8 pages
A_Research_on_Image_Captioning_by_Different_Encoder_Networks
No ratings yet
A_Research_on_Image_Captioning_by_Different_Encoder_Networks
4 pages
Math for Deep Learning: What You Need to Know to Understand Neural Networks
From Everand
Math for Deep Learning: What You Need to Know to Understand Neural Networks
Ronald T. Kneusel
No ratings yet
Cherrington Report Dec 2010
No ratings yet
Cherrington Report Dec 2010
105 pages
For Students
No ratings yet
For Students
3 pages
Manipur
No ratings yet
Manipur
184 pages
Enhancing Employees' Performance Through Organizational Care Policies in The Health Care Context
No ratings yet
Enhancing Employees' Performance Through Organizational Care Policies in The Health Care Context
15 pages
Mathematics
100% (2)
Mathematics
83 pages
Kleinman - 1982 - Neurasthenia and Depression A Study of Somatization and Culture in China
No ratings yet
Kleinman - 1982 - Neurasthenia and Depression A Study of Somatization and Culture in China
74 pages
Memory Organization: Answer Keys and Explanations
No ratings yet
Memory Organization: Answer Keys and Explanations
5 pages
Swtor
No ratings yet
Swtor
1 page
Background Practices: Essays on the Understanding of Being First Edition Hubert L. Dreyfus download pdf
100% (1)
Background Practices: Essays on the Understanding of Being First Edition Hubert L. Dreyfus download pdf
65 pages
Prfelb
No ratings yet
Prfelb
11 pages
Class X Scholarship Test
No ratings yet
Class X Scholarship Test
4 pages
10 Straightforward Exercises To Give You Great Eyesight
No ratings yet
10 Straightforward Exercises To Give You Great Eyesight
5 pages
INTL500 WK5 Case Analysis Info ELvin Rodriguez
No ratings yet
INTL500 WK5 Case Analysis Info ELvin Rodriguez
4 pages
Aircraft Propellers
No ratings yet
Aircraft Propellers
23 pages
Coaching, Culture and Lidersip Lloyd2005
No ratings yet
Coaching, Culture and Lidersip Lloyd2005
6 pages
APA Manual - Extract
No ratings yet
APA Manual - Extract
15 pages
Questionnaire Fringe Benefits
100% (2)
Questionnaire Fringe Benefits
4 pages
7 Faces of Business Model Innovation (Marielle Sijgers, CDEF Holding / Seats2meet - Com)
100% (1)
7 Faces of Business Model Innovation (Marielle Sijgers, CDEF Holding / Seats2meet - Com)
1 page
The Concept Evaluation System
No ratings yet
The Concept Evaluation System
5 pages
Integer Problems
No ratings yet
Integer Problems
14 pages
Travel Agent - An AI Assistant for Personalized Travel Planning
No ratings yet
Travel Agent - An AI Assistant for Personalized Travel Planning
20 pages
Peer Evaluation Form Sample
No ratings yet
Peer Evaluation Form Sample
4 pages
Tashni-Ann Dubroy Professional Resume
No ratings yet
Tashni-Ann Dubroy Professional Resume
5 pages
Finite Element Simulations Using ANSYS Digital by Esam M. Alawadhi
0% (1)
Finite Element Simulations Using ANSYS Digital by Esam M. Alawadhi
4 pages
UNIT 21 Probability of One Event Activities
No ratings yet
UNIT 21 Probability of One Event Activities
7 pages
BUIL6250 1805 Tutorial Week 15 Solution
No ratings yet
BUIL6250 1805 Tutorial Week 15 Solution
7 pages

Image Captioning

Uploaded by

Image Captioning

Uploaded by

CDS.IISc.ac.

in | Department of Computational and Data Sciences

Image Captioning Model

Output (Word by word

“Vanilla” Neural Network

Fixed size image (say 32 x 32) is fed to a

The output is a predicted label from the network.

Vanilla Neural Network

“Vanilla” Neural Network

Fixed size image (say 32 x 32) is fed to a

The output is a predicted label from the network.

● Each input is independent of previous or future

Vanilla Neural Network

Sequence Learning Problems

○ The successive inputs/outputs are no longer independent.

■ Eg. - Auto Text completion

Sequence Learning Problems

○ The successive inputs/outputs are no longer independent.

■ Eg. - Auto Text completion

What model architecture can handle such sequence learning problems?

Recurrent Neural Network (RNN)

Vanilla Neural Networks

Recurrent Neural Network (RNN)

Recurrent Neural Network (RNN)

Recurrent Neural Network (RNN)

Recurrent Neural Network (RNN)

Eg: Video classification at frame level

Recurrent Neural Network (RNN)

Recurrent Neural Network (RNN)

The “internal state” in RNN is

Recurrent Neural Network (RNN)

Recurrent Neural Network (RNN)

Recurrent Neural Network (RNN)

New state Old state Inp vector

Recurrent Neural Network (RNN)

RNN : One to many (Image captioning)

fUW fUW fUW

RNN : One to many (Image captioning)

fUW fUW fUW

RNN : One to many (Image captioning)

fUW fUW fUW

Training data : Image Captioning

Training RNN: Image Captioning

U, V, W, P are weights for the RNN

Training RNN: Image Captioning

Inference - RNN: Image Captioning

Inference - RNN: Image Captioning

Inference - RNN: Image Captioning

Inference - RNN: Image Captioning

Inference - RNN: Image Captioning

Inference - RNN: Image Captioning

Inference - RNN: Image Captioning

Image Captioning : Example Results

Credits : Fei Fei Li, generated using NeuralTalk2

Image Captioning : Failure cases

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2

a1,1, a1,1, a1,1,

z1,0 z1,1 z1,2

a1,1, a1,1, a1,1,

z1,0 z1,1 z1,2

z1,0 z1,1 z1,2

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2

● CS231n : Prof. Fei Fei Li, Stanford University

● DLCV : Prof. Vineeth Balasubramanian, IIT Hyderabad

You might also like