0% found this document useful (0 votes)

44 views10 pages

Image Captioning: Department of Computer Science University of Engineering & Technology Taxila

The document describes an image captioning project that implemented and analyzed a neural image captioning model. It identifies some biases in datasets that allow easy prediction of certain object captions but limit generalization. Word embeddings are analyzed and discrepancies in performance are explained. A neural image captioning model is proposed that encodes images with VGGNet and generates captions with an LSTM language model to maximize the probability of correct captions.

Uploaded by

ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views10 pages

Image Captioning: Department of Computer Science University of Engineering & Technology Taxila

Uploaded by

ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Image Captioning

Project Report

Session 2017-2021

Submitted To: Dr. Adnan Shah

Submitted By: M. Muzammal (17-CS-18)

Manzar Abbas (17-CS-58)
M. Atif Munir (17-CS-59)
Zain Ali (17-CS-57)

Department of Computer Science

University of Engineering & Technology Taxila
Abstract
In this project, we implement and analyse the performance of Neural Image Captioning
model on the datasets. We identify some of the biases of the dataset which allows captions
for certain objects to be predicted easily but restricts its power to generalize. We analyse the
learnt word embedding's and explain some of the discrepancies for the observed drop in
performance.

1. Introduction
Automatically describing the content of an image is a very challenging task, but it can also
have a great impact, for instance, bi-directional text-image retrieval can enable search over
the content in the image rather than simple keyword-based search. It can be assisting the
visually impaired people better understand the world around them and the vast content on the
internet. This task is significantly more complex than the task of image classification or
object recognition task which have received most of the focus from the computer vision
community. A description of an image must capture the not only the objects contained in the
image but also the interactions between these objects. Indeed, a description must be able to
express how the objects relate to each other and the activity they are involved in.

Generally, there are 4 components in the image captioning pipeline - extracting image
features, generating a set of candidate words, generating a set of sentences/captions and
finally, ranking the sentences according to an evaluation metric to choose the sentence with
the highest score. Certain approaches combine two or more components into a single block to
solve the problem.
The usual trend for representing image features is to use a convolutional neural network
(CNN) pre-trained on the ImageNet Large Scale Visual Recognition Challenge dataset.
Recent successes in object recognition, image classification and other related tasks have
demonstrated the versatility of these features. The VGGNet are usually favoured for
extracting discriminating features. Statistical
Figure 1: Example of caption. It is set in Roman so that mathematics (always set in Roman: B
sinA = AsinB) may be included without an ugly clash.

natural language processing methods like the Maximum Entropy model is used to build a
language model. A left to-right Beam search is used to generate a set of candidate sentences
using the language model. Lastly, it is used to rank the candidate sentences according to an
evaluation metric. The output caption is the sentence with the highest score.

2. Related Work
Since the introduction of Microsoft in 2014, a lot of work involving deep neural networks has
been published. Many of these approaches leverage the recent successes in recognition of
objects, their attributes, and locations, to generate natural language descriptions.

Multiple Instance Learning in conjunction with a CNN to output the set of likely words
contained in an image. A Maximum-Entropy model is used for modelling the likelihood of
seeing a word at location l in a sentence from the dictionary given the sequence of words
prior to the current word. In addition, they condition the likelihood on the set of candidate
words generated in the first block. Using beam search, they generate a set of candidate
sentences/captions and lastly, ranking of captions is done using MERT.
On the other extreme train a joint deep neural network which takes as input an image I and is
trained to maximise the likelihood p(S|I) of producing a target sequence of words S = S1,S2,...
where each word St comes from a given dictionary. In this approach they combine all the 4
components into a single component and achieve state-of-the-art results, showcasing that
there is some merit in mapping the features of the image and text to the same embedding
space and trying to reduce the distance between image and it’s captions. The inspiration for
this approach comes from the task of machine translation where the input is a sequence of
words in a source language and the output is a semantically equivalent sequence of words in a
target language. Traditionally, this task was solved by combining solutions to the various
sub-problems but recent work has shown that translation can be done in a much simpler was
using Recurrent Neural Networks (RNN). Briefly, there is an encoder RNN which learns a
fixed-high dimensional embedding for the input sequence of words. This embedding is then
fed as an input for another decoder RNN which generates a sequence of words in the target
language given the embedding for the words in the source language.

3. Model
We implement a model for image captioning which is trainable end-to-end. Recent progress
in statistical machine translation have achieved state-of-the-art results by directly maximizing
the probability of the correct translation in the target language given the input sentence in the
source language. These models are based on utilizing the power of recurrent neural networks
in learning representations which serve as a mapping between words in the source language
and words in the target language while preserving syntactic and semantic meaning of the
words.
Following this success, propose a model to maximize the probability of generating a correct
caption given the input image by using the following formulation:

θ∗ = argmax X logp(S|I;θ) θ (I,S)

where θ are the parameters of the model, I is the input image and S is its correct caption.
Here, S represents a sequence of words and is usually written in terms of each word by using
the product rule as,

T logp(S|I;θ) = X logp(st|I,s0,s1,...,st−1;θ) (2) t=0

where, st corresponds to the embedding for the tth word in the input caption. We model
logp(st|I,s0,s1,...,st−1;θ)
using a variant of the Recurrent Neural Network (RNN) called Long-Short Term Memory
(LSTM) [4]. The prediction of a word at any time t is conditioned on the input image and the
past predictions from time 0 to t − 1. Next, we describe how we encode the image and each
word in the caption.

3.1.Image Encoder
We use the 4096-dimensional layer of the VGGNet pre-trained on the ILSVRC 2014
classification datset. We use the VGG model with a total of 16 layers. This model consists of
13 convolutional layers with 3 × 3 kernels. Deep convolutional layers allow the network to
learn non-linear functions from the input to the output. The filters learnt by the model loosely
correspond to the layered processing of the human visual cortex system. The filters learnt in
the lower most layers correspond to edge and color blob filters, while in the middle layers
they correspond to parts of objects such as eyes, nose, wheel, etc whereas in the uppermost
layer the filters may be sensitive to objects such as humans, cats, dogs, etc. Furthermore, the
features learnt by the model have been shown to generalize of various tasks including scene
classification by means of transfer learning. This shows the efficacy of the VGGNet in
encoding the high dimensional 224 × 224 pixel images into discriminative and compact 4096-
D vectors. We also add an intermediate fully connected layer with 512 units to further reduce
the dimensionality of the image vectors, and to match the dimensionality with the word
embedding vectors.

3.2.LSTM based Language Model

LSTM has been used successfully for modelling machine translation problem. It works by
representing the words seen from time 0 to t − 1 using a fixed length hidden state or memory.
It then, describes the mapping over the next time step given the word embedding for the
current word and the hidden state for all the words seen so far. It was introduced in to deal
with the problem of vanishing and exploding gradients while training Recurrent Neural
Networks.

In Figure 1, the LSTM is shown in the time-unrolled form. This form of representation allows
us to interpret the LSTM as a feed-forward network which makes the analysis easier. In
essence, this is just a type of Bayesian Network who’s joint distribution can be given as the
product of the conditional distributions. The core of the LSTM is a memory cell c which
encodes the knowledge of the sequence of words it has seen at every time step. The memory
cell is shown in Figure 2. Three gates - input gate i, forget gate f and output gate o - control
the behaviour of the LSTM. These gates help LSTM deal with the vanishing and ex-

Figure 2: Memory cell c of a LSTM. This component is responsible for” remembering”

previous items. The input gate i controls the extent of new information entering the cell. The
forget gate f enables the cell state c to persist or erase itself. The output gate o controls the
amount of information passed on from the cell state to the output of the cell block.

plodding gradients problem. The definition of the gates is as follows:

pt+1 = Softmax(mt) (8)

where represents elementwise multiplication. The W matrices are the trainable parameters.
The input gate i controls the amount of information which flows into the LSTM at any given
time step. The forget gate f allows the cell to store the information for avery long time by
reinforcing the signal or alternatively, it could also” forget” and reset the cell state. The
output gate determines the proportion of information that flows from the current input
embedding and the previous cell state. The W matrices are time-independent, that is, they are
shared across time. This helps in reducing the number of parameters learnt by the model and
acts as a natural check against overfitting.

Training the LSTM model is trained to predict each word of the sentence after it has seen the
image and all the preceding words. This probability was defined in 2. The input to the model
is an image-caption pair. The entire model can be described by the following set of equations:
x−1 = WieCNN(I) (
9
)
xt = Wsest, t ∈ 0,...N − 1 (
1
0
)
st = OneHot(wt) (
1
1
)
pt+1 = LSTM (xt) (
1
2
)
where wt corresponds to the index of the word at time t in the vocabulary. CNN(I)
corresponds to the output of the VGG layer. OneHot(wt) corresponds to the one-hot vector for
representing a word in the vocabulary. LSTM (xt) represents the output of the LSTM for
seeing t words and input image I. The W matrices have equal number of rows and are the
trainable parameters of the model. transforms the layer features to an embedding space.
transforms the one-hot vector of a word to an embedding space of. This operation
corresponds to mapping the image and words to a common embedding so that they can be
used together in the LSTM.

In more detail, we first extract the VGG (CNN(I)) layer features for the input image I. By
feed-forwarding these features through the image embedding layer we transform the image
into a common embedding space as the words. We represent each word in the input caption
as an one-hot vector which are transformed to a common embedding space using a fully
connected word embedding layer. We use this image embedding x−1 to initialize the memory
cell of the LSTM. After this the word embedding are fed into the LSTM at different time
steps. The output of the LSTM is connected to another fully connected layer which has a
hidden unit corresponding to every word in the vocabulary. This can be viewed as a k class
classification problem. It is natural to convert the scores to probability by applying a softmax
on output of this layer and compute the cross-entropy loss for the word classification. We
sum up the log losses for all the words in the sequence to calculate the loss for the entire
sequence. Hence, the negative log likelihood is given as:
N
L (I, S) = −Xlogpt(st) (13)
t=1
The above loss is minimized w.r.t. all the parameters of the model - W matrices of the LSTM,
matrix for the image embedding layer and matrix for the word embedding layer.
Inference during testing time, we only feed in the input image to the network and the model
generates the corresponding caption for the image. Sampling: To

Metric BLEU- BLEU- METEOR

1 4
MSCOCO

NIC − 27.7 23.7

Random − 4.6 9.0
Nearest − 9.9 15.7
Neighbor
Human − 21.7 25.2
Flickr8k

NIC 63 − −
Our 57.3 14.7 15.5
Flickr30k

NIC 66 − −
Our 59.1 16.7 17.9
Table 1: Evaluation metrics. Comparison with NIC Nearest Neighbour, Random and Human
evaluations as proposed in []. They have provided results for the MSCOCO [] dataset. We
evaluate our model for Flickr30k and Flickr8k datasets and compare the BLEU-1 score
provided by NIC.

evaluate the performance of LSTM as a language model we sample the most likely word at
each time step of the LSTM as our generated caption. There is an alternate method called
Beam Search available for generating captions using a language model. In this approach,
instead of generating a single caption we simultaneously maintain k best candidate captions
and finally select the one with the highest score as our final caption. We only experiment with
the Sampling method for generating captions.

4. Experiments
We evaluate our implementation on the Flickr8k [] and Flickr30k [] datasets. The Flickr8k
dataset consists of 8091 images with 5 captions per image. The Flickr30k dataset consists of
31783 images with 5 captions per image.
4.1.Evaluation Metrics
We evaluate our model using two popular metrics which were originally developed for
evaluating machine translation performance - BLEU [] and METEOR []. BLEU computes the
amount of N-gram overlap between the generated caption and the reference captions. It can
be viewed as a form of precision of word N-grams between the hypothesis caption and the
reference captions. Usually, BLEU scores with up to 4-grams are reported. In addition, we
also report METEOR scores. METEOR scores the hypothesis caption by aligning it to one or
more reference captions. The alignments are based on exact, stem, synonym, or paraphrase
match between words. In this sense, it is forgiving that BLEU scores which requires exact
match to achieve a higher score. The scores are reported in Table.

4.2.Qualitative Analysis
In Table 2, we show some sample captions generated by our system and compare them with
the ground truth captions provided. We note that our model tends to produce more generic
captions in comparison to the ground truth captions given. Also, on closer inspection, we find
that images containing” dogs” tend to be annotated correctly by our model.

Table 3 shows some more examples of partially correct and incorrect captions generated by
our model. We find that our model has difficulty identifying activities accurately. As an
example, in the bottom left image in Table 3, our model claims that the man is doing” rock
climbing”, when in fact he is resting on the rock. It appears that our model is learning to
associate activities with gross level image information such as” rock climbing” if there is a
big rock and a person in the scene, or” running” when there is a dog in scene.

4.3.Failure Cases
Table 4 shows images for which our model generated incorrect captions. In the image on the
left-hand side, our model completely gets the colour information wrong although, can
correctly notice that the” person” is” playing with a dog”. In the second image, our model
incorrectly generates that the person is” surfing on a snowy hill”.
4.4.Further Analysis
To further inspect the behaviour and gain better intuition about the features that the model is
learning we plot a word cloud for the words in the Flickr8k dataset. The word cloud is plotted
in Figure 4. Thus, we identify that” dog” is the most frequently mentioned object in the
captions and consequently most images in the dataset correspond to dogs.

shirt
j eans
jacket
spants
horts
clot hes
suit

so
hoccer
bas eball
ckey
ketball
ten nis

boys
men
children
child
surfing
s urf g i girl
rls
boy
woman

skati ng

bike
mbicy
otorcle
cycle snow
car

−40 −30 −20 −10 0 10 20 30 −25 −20 −15 −10 −5 0 5 10

(a) t-SNE for learnt embedding vectors. (b) t-SNE for GloVe [12] vectors.

Figure 3: Comparison between the learnt language embedding vectors and GloVe []
embedding vectors. Note how similar words are tightly clustered for GloVe vector
embeddings but not yet tightly clustered for the learnt embeddings. (High resolution plot
containing 1000 most popular words is provided as supplemental material.)

a black and white dog runs through a baby is sitting in a stroller in a a man is riding a
surfboard down a snow. wooden room. wave.
a man is rock climbing. A man and woman sitting back on a a man and woman walk sunset
through rock the water.

References:

[1] A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A maximum entropy approach to

natural language processing. Computational linguistics, 22(1):39–71, 1996.
[2] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A
deep convolutional activation feature for generic visual recognition. In ICML, pages
647–655, 2014.
[3] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M.
Mitchell, J. C. Platt, et al. From´ captions to visual concepts and back. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages 1473–1482,
2015.
[4] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[5] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task:
Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–
899, 2013.

Image Captioning - A Deep Learning Approach
No ratings yet
Image Captioning - A Deep Learning Approach
4 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
Image Captioning Using R-CNN & LSTM Deep Learning Model
No ratings yet
Image Captioning Using R-CNN & LSTM Deep Learning Model
4 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
Visual Image Caption Generator Using Deep Learning
No ratings yet
Visual Image Caption Generator Using Deep Learning
7 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
Image Captioning
No ratings yet
Image Captioning
8 pages
Show and Tell: A Neural Image Caption Generator
No ratings yet
Show and Tell: A Neural Image Caption Generator
9 pages
Conference Paper A5
No ratings yet
Conference Paper A5
9 pages
DL Group 6 Rep
No ratings yet
DL Group 6 Rep
11 pages
Image_Captioning_-_A_Deep_Learning_Approach_Using_CNN_and_LSTM_Network
No ratings yet
Image_Captioning_-_A_Deep_Learning_Approach_Using_CNN_and_LSTM_Network
6 pages
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
No ratings yet
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
4 pages
DL- Review of Research Papers -Image_Caption_Generation
No ratings yet
DL- Review of Research Papers -Image_Caption_Generation
34 pages
Project Review
No ratings yet
Project Review
12 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
No ratings yet
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
12 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
9 pages
An Empirical Study of Language CNN For Image Captioning
No ratings yet
An Empirical Study of Language CNN For Image Captioning
10 pages
Gu An Empirical Study ICCV 2017 Paper PDF
No ratings yet
Gu An Empirical Study ICCV 2017 Paper PDF
10 pages
Image_Caption_Generation_using_Deep_Neural_Networks
No ratings yet
Image_Caption_Generation_using_Deep_Neural_Networks
3 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
Mini Project Fln..
No ratings yet
Mini Project Fln..
51 pages
Major Report Final
No ratings yet
Major Report Final
40 pages
Project Report Image Captioning Models Prakhar Dhyani
No ratings yet
Project Report Image Captioning Models Prakhar Dhyani
8 pages
Hybrid_Image_Captioning_Model
No ratings yet
Hybrid_Image_Captioning_Model
6 pages
Apply Deep Learning-based CNN and LSTM for Visual Image Caption Generator
No ratings yet
Apply Deep Learning-based CNN and LSTM for Visual Image Caption Generator
6 pages
Ref12
No ratings yet
Ref12
7 pages
RP Springer
No ratings yet
RP Springer
10 pages
Image Caption Generator Research Paper
No ratings yet
Image Caption Generator Research Paper
4 pages
Image Captioning Synopsis
No ratings yet
Image Captioning Synopsis
17 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
4
No ratings yet
4
7 pages
Image Captioning Via A Hierarchical Attention Mechanism and Policy Gradient Optimization
No ratings yet
Image Captioning Via A Hierarchical Attention Mechanism and Policy Gradient Optimization
13 pages
Image Caption Generator Report
No ratings yet
Image Caption Generator Report
27 pages
A_Comparative_Study_of_Machine_Learning_Based_Image_Captioning_Models
No ratings yet
A_Comparative_Study_of_Machine_Learning_Based_Image_Captioning_Models
6 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
8 pages
Paper 17881
No ratings yet
Paper 17881
6 pages
PGCON Paper Final
No ratings yet
PGCON Paper Final
4 pages
Two Tier LSTM Model
No ratings yet
Two Tier LSTM Model
13 pages
fang2015
No ratings yet
fang2015
10 pages
What Is The Role of Recurrent Neural Networks (RNNS) in An Image Caption Generator?
No ratings yet
What Is The Role of Recurrent Neural Networks (RNNS) in An Image Caption Generator?
10 pages
Image Captioning Using Deep Learning Mait
No ratings yet
Image Captioning Using Deep Learning Mait
8 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
Automated Image Captioning Using CNN and RNN
No ratings yet
Automated Image Captioning Using CNN and RNN
17 pages
ppt(ankitveer)
No ratings yet
ppt(ankitveer)
18 pages
Implementation_of_Simple_and_Efficient_P
No ratings yet
Implementation_of_Simple_and_Efficient_P
8 pages
Synopsis May 2024 (Pradeep, Vikas) - 1
No ratings yet
Synopsis May 2024 (Pradeep, Vikas) - 1
14 pages
CVIU Hema 1-S2.0-S1077314222000650-Main
No ratings yet
CVIU Hema 1-S2.0-S1077314222000650-Main
13 pages
Image Caption Generator PCL
No ratings yet
Image Caption Generator PCL
19 pages
Research Paper of Generating Caption From Image
No ratings yet
Research Paper of Generating Caption From Image
5 pages
Image Captioning Using Deep Stacked LSTMS, Contextual Word Embeddings and Data Augmentation
No ratings yet
Image Captioning Using Deep Stacked LSTMS, Contextual Word Embeddings and Data Augmentation
18 pages
A_Novel_Approach_of_Image_Caption_Generator_using_Deep_Learning
No ratings yet
A_Novel_Approach_of_Image_Caption_Generator_using_Deep_Learning
6 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
5 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Image Caption Technical Report
No ratings yet
Image Caption Technical Report
31 pages
Seminar Report Final
No ratings yet
Seminar Report Final
20 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Tools For Analyzing Talk Part 3: Morphosyntactic Analysis: Brian Macwhinney Carnegie Mellon University
No ratings yet
Tools For Analyzing Talk Part 3: Morphosyntactic Analysis: Brian Macwhinney Carnegie Mellon University
88 pages
KD Kakuro 5x5 v3 s2 b070
No ratings yet
KD Kakuro 5x5 v3 s2 b070
3 pages
Germanic Loanwords in Proto Slavic Saskia Pronk-Tiethoff - Download the full ebook now for a seamless reading experience
No ratings yet
Germanic Loanwords in Proto Slavic Saskia Pronk-Tiethoff - Download the full ebook now for a seamless reading experience
86 pages
Tugas 1 Bahasa Inggris, Ni Komang Sari Ulan
No ratings yet
Tugas 1 Bahasa Inggris, Ni Komang Sari Ulan
3 pages
Lesson 01: Differentiates Language Used in Academic Texts From Various Disciplines
No ratings yet
Lesson 01: Differentiates Language Used in Academic Texts From Various Disciplines
57 pages
Worksheet 8.: Adjectives, Page 1
No ratings yet
Worksheet 8.: Adjectives, Page 1
2 pages
B2 First for Schools 4, Test 2, Question 1 - Essay
No ratings yet
B2 First for Schools 4, Test 2, Question 1 - Essay
2 pages
Khuloud E.docx
No ratings yet
Khuloud E.docx
2 pages
Unit 2 Python
No ratings yet
Unit 2 Python
33 pages
Practice On Phrasal Verbs: Uee Preparation Course 2019-2020
No ratings yet
Practice On Phrasal Verbs: Uee Preparation Course 2019-2020
4 pages
Master Thesis Defence Form RSM
100% (3)
Master Thesis Defence Form RSM
7 pages
House's Model of Translation Assessment
No ratings yet
House's Model of Translation Assessment
27 pages
RSETI
No ratings yet
RSETI
3 pages
Ludwig Wittgenstein A Philosophical Theory of Language Acquisition and Use
No ratings yet
Ludwig Wittgenstein A Philosophical Theory of Language Acquisition and Use
20 pages
Verbs With Prepositions Followed by Gerunds
No ratings yet
Verbs With Prepositions Followed by Gerunds
3 pages
Open Mind Elementary Unit 02 Grammar 1
No ratings yet
Open Mind Elementary Unit 02 Grammar 1
10 pages
Q2 MODULE 15 & 16 Practical Research 2
100% (2)
Q2 MODULE 15 & 16 Practical Research 2
12 pages
Tieng Anh 7 Friends Plus - Unit 4 - Test 2
No ratings yet
Tieng Anh 7 Friends Plus - Unit 4 - Test 2
4 pages
Unit_7_langfocus_vocab_one_star
No ratings yet
Unit_7_langfocus_vocab_one_star
1 page
T24 PDF
No ratings yet
T24 PDF
25 pages
Overview of English Grammar Hasan ALNAHDA
No ratings yet
Overview of English Grammar Hasan ALNAHDA
13 pages
C# Question - Answer
100% (1)
C# Question - Answer
49 pages
English Modal Verbs
No ratings yet
English Modal Verbs
37 pages
Misinterpretation of Culture-Loaded Engl
No ratings yet
Misinterpretation of Culture-Loaded Engl
15 pages
Read The Passage Below Then Do The Activities That Follow.: Hayao-Miyazaki-Film
No ratings yet
Read The Passage Below Then Do The Activities That Follow.: Hayao-Miyazaki-Film
3 pages
English Exercise in PDF (Multiple Choices)
No ratings yet
English Exercise in PDF (Multiple Choices)
1 page
EF3e Elem Quicktest 06 PDF
No ratings yet
EF3e Elem Quicktest 06 PDF
2 pages
Grade 8 (Q1M1) Context Clues
No ratings yet
Grade 8 (Q1M1) Context Clues
33 pages
SIST-EN-13165-2013-A2-2016-oprA3-2018
No ratings yet
SIST-EN-13165-2013-A2-2016-oprA3-2018
11 pages
Eng Cl 7 Notes Forms of Verbs Notes 12
No ratings yet
Eng Cl 7 Notes Forms of Verbs Notes 12
4 pages

Image Captioning: Department of Computer Science University of Engineering & Technology Taxila

Uploaded by

Image Captioning: Department of Computer Science University of Engineering & Technology Taxila

Uploaded by

Image Captioning

Submitted To: Dr. Adnan Shah

Submitted By: M. Muzammal (17-CS-18)

Department of Computer Science

θ∗ = argmax X logp(S|I;θ) θ (I,S)

T logp(S|I;θ) = X logp(st|I,s0,s1,...,st−1;θ) (2) t=0

3.2.LSTM based Language Model

Figure 2: Memory cell c of a LSTM. This component is responsible for” remembering”

plodding gradients problem. The definition of the gates is as follows:

pt+1 = Softmax(mt) (8)

Metric BLEU- BLEU- METEOR

NIC − 27.7 23.7

−40 −30 −20 −10 0 10 20 30 −25 −20 −15 −10 −5 0 5 10

[1] A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A maximum entropy approach to

You might also like