0% found this document useful (0 votes)
63 views

Image Captioning

The document discusses image captioning using recurrent neural networks (RNNs). It explains that RNNs can handle sequence learning problems where successive inputs/outputs are not independent, unlike vanilla neural networks. RNNs have an internal state that gets updated as a sequence is processed. The document provides examples of using RNNs for tasks like image captioning, video captioning, and action prediction from video frames. It describes how RNNs are trained on image-caption pairs and can then generate captions for new images by updating their internal state at each time step.

Uploaded by

sourabh gothe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Image Captioning

The document discusses image captioning using recurrent neural networks (RNNs). It explains that RNNs can handle sequence learning problems where successive inputs/outputs are not independent, unlike vanilla neural networks. RNNs have an internal state that gets updated as a sequence is processed. The document provides examples of using RNNs for tasks like image captioning, video captioning, and action prediction from video frames. It describes how RNNs are trained on image-caption pairs and can then generate captions for new images by updating their internal state at each time step.

Uploaded by

sourabh gothe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

CDS.IISc.ac.

in | Department of Computational and Data Sciences

Image Captioning
CDS.IISc.ac.in | Department of Computational and Data Sciences

Overview

Image Captioning Model

Input image

Output (Word by word


generation)
CDS.IISc.ac.in | Department of Computational and Data Sciences

“Vanilla” Neural Network


one to one

Example:

Fixed size image (say 32 x 32) is fed to a


convolutional neural network for classification.

The output is a predicted label from the network.

Vanilla Neural Network


CDS.IISc.ac.in | Department of Computational and Data Sciences

“Vanilla” Neural Network


one to one

Example:

Fixed size image (say 32 x 32) is fed to a


convolutional neural network for classification.

The output is a predicted label from the network.

● Each input is independent of previous or future


inputs
● Outputs of two images are completely
independent of each other.

Vanilla Neural Network


CDS.IISc.ac.in | Department of Computational and Data Sciences

Sequence Learning Problems


● What if

○ The successive inputs/outputs are no longer independent.

■ Eg. - Auto Text completion


CDS.IISc.ac.in | Department of Computational and Data Sciences

Sequence Learning Problems


● What if

○ The successive inputs/outputs are no longer independent.

■ Eg. - Auto Text completion

What model architecture can handle such sequence learning problems?


CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

Vanilla Neural Networks


CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

Image Captioning
Image -> sequence of words
CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

Action prediction
Video frames -> action class
CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

Video captioning
Video frames -> sequence of words
CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

Eg: Video classification at frame level


CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)


CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

The “internal state” in RNN is


updated as a sequence is
processed.
CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)


CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

U,W
Recurrence formula
CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

U,W
Recurrence formula

New state Old state Inp vector


at time t
CDS.IISc.ac.in | Department of Computational and Data Sciences

Recurrent Neural Network (RNN)

U,W
Recurrence formula
CDS.IISc.ac.in | Department of Computational and Data Sciences

RNN : One to many (Image captioning)

fUW fUW fUW

UW
CDS.IISc.ac.in | Department of Computational and Data Sciences

RNN : One to many (Image captioning)

fUW fUW fUW

UW
CDS.IISc.ac.in | Department of Computational and Data Sciences

RNN : One to many (Image captioning)

fUW fUW fUW

UW
CDS.IISc.ac.in | Department of Computational and Data Sciences

Training data : Image Captioning


(Image, Caption) pairs
CDS.IISc.ac.in | Department of Computational and Data Sciences

Training RNN: Image Captioning

V
U

P
W

U, V, W, P are weights for the RNN

Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3128–3137.
CDS.IISc.ac.in | Department of Computational and Data Sciences

Training RNN: Image Captioning

y0 y1 y2 “straw hat”

V
U h0 = max(0, W*x0+P* F)
P h0 h1 h2

x0 x1 x2
F <Start> “straw” “hat” P, W, U, V are the
weights
CDS.IISc.ac.in | Department of Computational and Data Sciences

Inference - RNN: Image Captioning

Test image
CDS.IISc.ac.in | Department of Computational and Data Sciences

Inference - RNN: Image Captioning

Test image

F
CDS.IISc.ac.in | Department of Computational and Data Sciences

Inference - RNN: Image Captioning

y0 Test image

P (Weight)
h0

x0
F <Start>
CDS.IISc.ac.in | Department of Computational and Data Sciences

Inference - RNN: Image Captioning

y0 Test image

h0

x0 x1
F <Start> (straw)
CDS.IISc.ac.in | Department of Computational and Data Sciences

Inference - RNN: Image Captioning

y0 y1 Test image

h0 h1

x0 x1
straw x2
F <Start> (straw) (hat)
CDS.IISc.ac.in | Department of Computational and Data Sciences

Inference - RNN: Image Captioning

y0 y1 y2 Test image

h0 h1 h2

x0 x1
straw x2
hat
F <Start> (straw) (hat)
CDS.IISc.ac.in | Department of Computational and Data Sciences

Inference - RNN: Image Captioning

y0 y1 y2 Test image

Sample
<END> token
h0 h1 h2
=> finish

x0 x1
straw x2
hat
F <Start> (straw) (hat)
CDS.IISc.ac.in | Department of Computational and Data Sciences

Image Captioning : Example Results

Credits : Fei Fei Li, generated using NeuralTalk2


CDS.IISc.ac.in | Department of Computational and Data Sciences

Image Captioning : Failure cases


CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention

HxWx3

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention

CNN

HxWx3

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention

Extracted features

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2


CNN
z2,0 z2,1 z2,2
HxWx3
H’ x W’ x D
LxD
(L = H’ x W’)

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2


CNN h0
z2,0 z2,1 z2,2
HxWx3
H’ x W’ x D
LxD
(L = H’ x W’)

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
a1,0, a1,0, a1,0, Attention (a1)
0 1 2

a1,1, a1,1, a1,1,


H’ x W’
0 1 2
Normalised attention
a1,2, a1,2, a1,2, weights
(Softmax Distribution
0 1 2
over the L grid
z0,0 z0,1 z0,2 locations)

z1,0 z1,1 z1,2


CNN h0
z2,0 z2,1 z2,2
HxWx3
H’ x W’ x D
LxD
(L = H’ x W’)

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
a1,0, a1,0, a1,0, Attention (a1)
0 1 2

a1,1, a1,1, a1,1,


H’ x W’
0 1 2
Normalised attention
a1,2, a1,2, a1,2, weights
(Softmax Distribution
0 1 2
over the L grid
z0,0 z0,1 z0,2 locations)

z1,0 z1,1 z1,2


CNN h0
z2,0 z2,1 z2,2
HxWx3
H’ x W’ x D
LxD c1
(L = H’ x W’)
Weighted features : Dim - D

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
Attention (a1)

H’ x W’
a1
Normalised attention
weights
(Distribution over the
L grid locations)
z0,0 z0,1 z0,2

z1,0 z1,1 z1,2


CNN h0
z2,0 z2,1 z2,2
HxWx3
H’ x W’ x D
LxD c1
(L = H’ x W’)
Weighted features : Dim - D

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
Distribution over L Distribution over
locations vocab

a1 a2 d1

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2


CNN h0 h1
z2,0 z2,1 z2,2
HxWx3
H’ x W’ x D
LxD c1 y1
(L = H’ x W’)
First word (the <START> token)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
Distribution over L Distribution over
locations vocab

a1 a2 d1 a3 d2

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2


CNN h0 h1 h2 …
z2,0 z2,1 z2,2
HxWx3
H’ x W’ x D
LxD c1 y1 c2 y2
(L = H’ x W’)
Second word
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
CDS.IISc.ac.in | Department of Computational and Data Sciences

References
● The Unreasonable Effectiveness of Recurrent Neural Networks :
Andrej Karpathy

● CS231n : Prof. Fei Fei Li, Stanford University

● DLCV : Prof. Vineeth Balasubramanian, IIT Hyderabad


CDS.IISc.ac.in | Department of Computational and Data Sciences

Thank you

You might also like