Image Captioning
Image Captioning
Image Captioning
CDS.IISc.ac.in | Department of Computational and Data Sciences
Overview
Input image
Example:
Example:
Image Captioning
Image -> sequence of words
CDS.IISc.ac.in | Department of Computational and Data Sciences
Action prediction
Video frames -> action class
CDS.IISc.ac.in | Department of Computational and Data Sciences
Video captioning
Video frames -> sequence of words
CDS.IISc.ac.in | Department of Computational and Data Sciences
U,W
Recurrence formula
CDS.IISc.ac.in | Department of Computational and Data Sciences
U,W
Recurrence formula
U,W
Recurrence formula
CDS.IISc.ac.in | Department of Computational and Data Sciences
UW
CDS.IISc.ac.in | Department of Computational and Data Sciences
UW
CDS.IISc.ac.in | Department of Computational and Data Sciences
UW
CDS.IISc.ac.in | Department of Computational and Data Sciences
V
U
P
W
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3128–3137.
CDS.IISc.ac.in | Department of Computational and Data Sciences
y0 y1 y2 “straw hat”
V
U h0 = max(0, W*x0+P* F)
P h0 h1 h2
x0 x1 x2
F <Start> “straw” “hat” P, W, U, V are the
weights
CDS.IISc.ac.in | Department of Computational and Data Sciences
Test image
CDS.IISc.ac.in | Department of Computational and Data Sciences
Test image
F
CDS.IISc.ac.in | Department of Computational and Data Sciences
y0 Test image
P (Weight)
h0
x0
F <Start>
CDS.IISc.ac.in | Department of Computational and Data Sciences
y0 Test image
h0
x0 x1
F <Start> (straw)
CDS.IISc.ac.in | Department of Computational and Data Sciences
y0 y1 Test image
h0 h1
x0 x1
straw x2
F <Start> (straw) (hat)
CDS.IISc.ac.in | Department of Computational and Data Sciences
y0 y1 y2 Test image
h0 h1 h2
x0 x1
straw x2
hat
F <Start> (straw) (hat)
CDS.IISc.ac.in | Department of Computational and Data Sciences
y0 y1 y2 Test image
Sample
<END> token
h0 h1 h2
=> finish
x0 x1
straw x2
hat
F <Start> (straw) (hat)
CDS.IISc.ac.in | Department of Computational and Data Sciences
HxWx3
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
CNN
HxWx3
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
Extracted features
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
a1,0, a1,0, a1,0, Attention (a1)
0 1 2
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
a1,0, a1,0, a1,0, Attention (a1)
0 1 2
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
Attention (a1)
H’ x W’
a1
Normalised attention
weights
(Distribution over the
L grid locations)
z0,0 z0,1 z0,2
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell:
Neural image caption generation with visual attention. International Conference on Machine Learning, 2048–2057.
CDS.IISc.ac.in | Department of Computational and Data Sciences
Image captioning: RNN with attention
Distribution over L Distribution over
locations vocab
a1 a2 d1
a1 a2 d1 a3 d2
References
● The Unreasonable Effectiveness of Recurrent Neural Networks :
Andrej Karpathy
Thank you