0% found this document useful (0 votes)
86 views24 pages

19L038 - Deep Learning - Assignment Presentation

In navigating and understanding an outdoor environment, our world often requires the ability to see. People with visual impairments are, therefore, faced with significant challenges in exploring these environments. Deep learning has the potential to alleviate part of the frustrations they face. In this project, we assess the effectiveness of using deep learning to assist people with visual impairments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views24 pages

19L038 - Deep Learning - Assignment Presentation

In navigating and understanding an outdoor environment, our world often requires the ability to see. People with visual impairments are, therefore, faced with significant challenges in exploring these environments. Deep learning has the potential to alleviate part of the frustrations they face. In this project, we assess the effectiveness of using deep learning to assist people with visual impairments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

A GUIDE FOR VISUALLY IMPAIRED PEOPLE

19L038 – DEEP LEARNING

VIBOOSITHASRI N S (19L147)
SHARATH S (19L247)

Presentation submitted in partial fulfillment of the requirements for the degree of

BACHELOR OF ENGINEERING

Branch: ELECTRONICS AND COMMUNICATION


ENGINEERING
of Anna University

NOVEMBER 2022

PSG COLLEGE OF TECHNOLOGY


(Autonomous Institution)

COIMBATORE – 641 004


TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

CONTENTS ii
ABSTRACT iii

1. Introduction ................................................................................... 1

1.1 Problem Statement ............................................................................ 1

1.2 Approach To The Problem Statement......................................... 1

2. Dataset ......................................................................................................... 3

2.1 Why Flickr8k dataset? ..................................................................... 3

2.2 Understanding the data ................................................................... 4

2.3 How to Featurize images? .............................................................. 4

3. Caption Preprocessing ............................................................................ 6

3.1 Sequential Data preparation ........................................................ 6

4. Code .............................................................................................................. 7

5. Conclusion .................................................................................................. 20

6. References ................................................................................................... 21

ii
ABSTRACT
In navigating and understanding an outdoor environment, our world
often requires the ability to see. People with visual impairments are,
therefore, faced with significant challenges in exploring these environments.
Deep learning has the potential to alleviate part of the frustrations they face.
In this project, we assess the effectiveness of using deep learning to assist
people with visual impairments.

Visually impaired and blind people frequently have no knowledge of


outdoor obstacles and need guidance in order to avoid colliding risks. The
aim of this project is to develop a mobile-based navigation system for helping
visually impaired people in outdoor navigation. The proposed system will be
able to reduce obstacle collision risks by enabling users to walk outside
smoothly with voice awareness. The currently used systems for navigating
the visually impaired have several drawbacks such as cost, dependency,
and usability.

The suggested solution includes a mobile-based camera vision system


equipped with internet connectivity to build an independent application for
indoor and outdoor navigation. Moreover, the system has high usability to
navigate visually impaired people in unfamiliar environments such as a park,
roads, and so on. In the presented work, deep learning algorithms are
employed for image recognition which will be implemented as a mobile
navigation application. The boosted RNN, LSTM-based visual object
recognition model is used to implement the system. The suggested
smartphone-based system is not restricted to the defined outdoor
environments and does not depend on any other positioning system.
Therefore, the proposed solution is not limited to any specific environment
and provides voice aids about surrounding obstacles for users.

iii
1. INTRODUCTION

1.1 PROBLEM ADDRESSED

We see blind people difficult to walk because they cannot see the
obstacles. We suggest a solution to help my visually impaired people to
identify obstacles when they come out alone with the help of deep learning
techniques.

Here, we propose an approach that:

1. uses a smartphone to capture real-time images as input of a deep


learning model for image recognition;
2. provides a voice guide to let visually impaired people realize the
existence of possible obstacles on the path.

1.2 APPROACH TO THE PROBLEM STATEMENT

We will tackle this problem using an Encoder-Decoder model. Here our


encoder model will combine both the encoded form of the image and the
encoded form of the text caption and feed to the decoder.

Our model will treat CNN as the ‘image model’ and the RNN/LSTM as the
‘language model’ to encode the text sequences of varying lengths. The
vectors resulting from both encodings are then merged and processed by a
Dense layer to make a final prediction.

We will create a merge architecture in order to keep the image out of the
RNN/LSTM and thus be able to train the part of the neural network that
handles images and the part that handles language separately, using images
and sentences from separate training sets.

In our merge model, a different representation of the image can be combined


with the final RNN state before each prediction.

1
The above diagram is a visual representation of our approach.

The merging of image features with text encodings to a later stage in the
architecture is advantageous and can generate better quality captions with
smaller layers than the traditional inject architecture (CNN as encoder and
RNN as a decoder).

To encode our image features we will make use of transfer learning. There
are a lot of models that we can use like VGG-16, InceptionV3, ResNet, etc.
We will make use of the inceptionV3 model which has the least number of
training parameters in comparison to the others and also outperforms them.

To encode our text sequence we will map every word to a 200-dimensional


vector. For this will use a pre-trained Glove model. This mapping will be done
in a separate layer after the input layer called the embedding layer.

To generate the caption we will be using two popular methods which are
Greedy Search and Beam Search. These methods will help us in picking the
best words to accurately define the image.

2
2. DATASET

A number of datasets are used for training, testing, and evaluation of


the image captioning methods. The datasets differ in various perspectives
such as the number of images, the number of captions per image, the format
of the captions, and image size. Three datasets: Flickr8k, Flickr30k, and MS
COCO Dataset are popularly used.

For training our model I’m using Flickr8K dataset. It consists of 8000
unique images and each image will be mapped to five different sentences
which will describe the image. By associating each image with multiple,
independently produced sentences, the dataset captures some of the
linguistic variety that can be used to describe the same image.

2.1 Why Flickr8k dataset?

1. It is small in size. So, the model can be trained easily on low-end


laptops/desktops...
2. Data is properly labelled. For each image 5 captions are provided.
3. The dataset is available for free.

Flickr8k is a good starting dataset as it is small in size and can be trained


easily on low-end laptops/desktops using a CPU.
Our dataset structure is as follows:-

• Flick8k/
o Flick8k_Dataset/ :- contains the 8000 images

o Flick8k_Text/
▪ [Link]:- contains the image id along with the 5
captions
▪ [Link]:- contains the training image id’s
▪ [Link]:- contains the test image id’s

3
2.2 Understanding the data

Data pre-processing and cleaning is an important part of the whole


model building process. Understanding the data helps us to build more
accurate models.

After extracting zip files you will find below folders…

Flickr8k_Dataset: Contains a total of 8092 images in JPEG format


with different shapes and sizes. Of which 6000 are used for training, 1000
for test and 1000 for development.

Flickr8k_text: Contains text files describing train_set, test_set.


[Link] contains 5 captions for each image i.e. total 40460
captions.

We have mainly two types of data:

1. Images
2. Captions (Text)

The size of the training vocabulary is 7371. Since the words which occur
very less does not carry much information. We are considering words with
a frequency of more than 10.

2.3 How to Featurize images?

There are already pre-trained models on standard Imagenet dataset


provided in keras. Imagenet is a standard dataset used for classification. It
contains more than 14 million images in the dataset, with little more than 21
thousand groups or classes.

We will be using InceptionV3 by google.

4
Why Inception…?

1. It has a smaller weight file i.e approx 96 MB


2. It is faster to train.

We will remove soft-max layer form inception as we want to use it as a


feature extractor. For a given input image inception gives us 2048
dimensional feature extracted vector.

For every training image, we are resizing it to (299,299) and then passing it
to Inception for feature extraction.

5
3. Caption Preprocessing
Each image in the dataset is provided with 5 captions. Captions are
read from [Link] file and stored in dictionary k:v where k = image
id and value =[ list of caption ]. Since there are 5 captions for each image
and we have preprocessed and encoded them in below format

“startseq “ + caption + “ endseq”

The reason behind startseq and endseq is,

startseq : Will act as our first word when feature extracted image
vector is fed to decoder. It will kick-start the caption generation process.

endseq : This will tell the decoder when to stop. We will stop predicting
word as soon as endseq appears or we have predicted all words from train
dictionary whichever comes first.

3.1 Sequential Data preparation

First fed the image to inception and get feature extracted 2048 dimensional
vector.

Caption: startseq a bunch of people swimming in water endseq.

6
4. CODE

Step 1:- Import the required libraries

Here we will be making use of the Keras library for creating our model and
training it. You can make use of Google Colab or Kaggle notebooks if you
want a GPU to train it.

import numpy as np
from numpy import array
import [Link] as plt
%matplotlib inline

import string
import os
import glob
from PIL import Image
from time import time

from keras import Input, layers


from keras import optimizers
from [Link] import Adam
from [Link] import sequence
from [Link] import image
from [Link] import Tokenizer
from [Link] import pad_sequences
from [Link] import LSTM, Embedding, Dense, Activation, Flatten, Reshape,
Dropout
from [Link] import Bidirectional
from [Link] import add
from [Link].inception_v3 import InceptionV3
from [Link].inception_v3 import preprocess_input
from [Link] import Model
from [Link] import to_categorical

Step 2:- Data loading and Preprocessing


We will define all the paths to the files that we require and save the images
id and their captions.

7
token_path = //PROVIDE PATH
train_images_path = //PROVIDE PATH
test_images_path = //PROVIDE PATH
images_path = //PROVIDE PATH
glove_path = //PROVIDE PATH

doc = open(token_path,'r').read()
print(doc[:410])

So, we can see the format in which our image id’s and their captions are
stored. Next, we create a dictionary named “descriptions” which contains the
name of the image as keys and a list of the 5 captions for the corresponding
image as values.

descriptions = dict()
for line in [Link]('\n'):
tokens = [Link]()
if len(line) > 2:
image_id = tokens[0].split('.')[0]
image_desc = ' '.join(tokens[1:])
if image_id not in descriptions:
descriptions[image_id] = list()
descriptions[image_id].append(image_desc)

Now let’s perform some basic text clean to get rid of punctuation and convert
our descriptions to lowercase.

table = [Link]('', '', [Link])


for key, desc_list in [Link]():
for i in range(len(desc_list)):
desc = desc_list[i]
desc = [Link]()
desc = [[Link]() for word in desc]
desc = [[Link](table) for w in desc]
desc_list[i] = ' '.join(desc)

8
Next, we create a vocabulary of all the unique words present across all the
8000*5 (i.e. 40000) image captions in the data set. We have 8828 unique
words across all the 40000 image captions.

vocabulary = set()
for key in [Link]():
[[Link]([Link]()) for d in descriptions[key]]
print('Original Vocabulary Size: %d' % len(vocabulary))

Now let’s save the image id’s and their new cleaned captions in the same
format as the [Link] file:-

lines = list()
for key, desc_list in [Link]():
for desc in desc_list:
[Link](key + ' ' + desc)
new_descriptions = '\n'.join(lines)

Next, we load all the 6000 training image id’s in a variable train from the
‘Flickr_8k.[Link]’ file:-

doc = open(train_images_path,'r').read()
dataset = list()
for line in [Link]('\n'):
if len(line) > 1:
identifier = [Link]('.')[0]
[Link](identifier)
train = set(dataset)

Now we save all the training and testing images in train_img and test_img
lists respectively:-

img = [Link](images_path + '*.jpg')


train_images = set(open(train_images_path, 'r').read().strip().split('\n'))
train_img = []
for i in img:
if i[len(images_path):] in train_images:
train_img.append(i)
9
test_images = set(open(test_images_path, 'r').read().strip().split('\n'))
test_img = []
for i in img:
if i[len(images_path):] in test_images:
test_img.append(i)

Now, we load the descriptions of the training images into a dictionary.


However, we will add two tokens in every caption, ‘startseq’ and ‘endseq’:-

train_descriptions = dict()
for line in new_descriptions.split('\n'):
tokens = [Link]()
image_id, image_desc = tokens[0], tokens[1:]
if image_id in train:
if image_id not in train_descriptions:
train_descriptions[image_id] = list()
desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
train_descriptions[image_id].append(desc)

Create a list of all the training captions:-

all_train_captions = []
for key, val in train_descriptions.items():
for cap in val:
all_train_captions.append(cap)

To make our model more robust we will reduce our vocabulary to only those
words which occur at least 10 times in the entire corpus.

word_count_threshold = 10
word_counts = {}
nsents = 0
for sent in all_train_captions:
nsents += 1
for w in [Link](' '):
word_counts[w] = word_counts.get(w, 0) + 1
vocab = [w for w in word_counts if word_counts[w] >= word_count_threshold]
print('Vocabulary = %d' % (len(vocab)))

10
Now we create two dictionaries to map words to an index and vice versa.
Also, we append 1 to our vocabulary since we append 0’s to make all
captions of equal length.

ixtoword = {}
wordtoix = {}
ix = 1
for w in vocab:
wordtoix[w] = ix
ixtoword[ix] = w
ix += 1
vocab_size = len(ixtoword) + 1

Hence now our total vocabulary size is 1660.


We also need to find out what the max length of a caption can be since we
cannot have captions of arbitrary length.

all_desc = list()
for key in train_descriptions.keys():
[all_desc.append(d) for d in train_descriptions[key]]
lines = all_desc
max_length = max(len([Link]()) for d in lines)

print('Description Length: %d' % max_length)

Step 3:- Glove Embeddings

Word vectors map words to a vector space, where similar words are
clustered together and different words are separated. The advantage of
using Glove over Word2Vec is that GloVe does not just rely on the local
context of words but it incorporates global word co-occurrence to obtain word
vectors.

The basic premise behind Glove is that we can derive semantic relationships
between words from the co-occurrence matrix. For our model, we will map
all the words in our 38-word long caption to a 200-dimension vector using
Glove.

11
embeddings_index = {}
f = open([Link](glove_path, '[Link]'), encoding="utf-8")
for line in f:
values = [Link]()
word = values[0]
coefs = [Link](values[1:], dtype='float32')
embeddings_index[word] = coefs

Next, we make the matrix of shape (1660,200) consisting of our vocabulary


and the 200-d vector.

embedding_dim = 200
embedding_matrix = [Link]((vocab_size, embedding_dim))
for word, i in [Link]():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector

Step 4:- Model Building and Training

As you have seen from our approach we have opted for transfer
learning using InceptionV3 network which is pre-trained on the ImageNet
dataset.

model = InceptionV3(weights='imagenet')

We must remember that we do not need to classify the images here, we only
need to extract an image vector for our images. Hence we remove the
softmax layer from the inceptionV3 model.

model_new = Model([Link], [Link][-2].output)

Since we are using InceptionV3 we need to pre-process our input before


feeding it into the model. Hence we define a preprocess function to reshape
the images to (299 x 299) and feed to the preprocess_input() function of
Keras.

12
def preprocess(image_path):
img = image.load_img(image_path, target_size=(299, 299))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
return x

Now we can go ahead and encode our training and testing images, i.e extract
the images vectors of shape (2048,)

def encode(image):
image = preprocess(image)
fea_vec = model_new.predict(image)
fea_vec = [Link](fea_vec, fea_vec.shape[1])
return fea_vec

encoding_train = {}
for img in train_img:
encoding_train[img[len(images_path):]] = encode(img)
train_features = encoding_train

encoding_test = {}
for img in test_img:
encoding_test[img[len(images_path):]] = encode(img)

Now let’s define our model.

We are creating a Merge model where we combine the image vector and the
partial caption. Therefore our model will have 3 major steps:

1. Processing the sequence from the text


2. Extracting the feature vector from the image
3. Decoding the output using softmax by concatenating the above two
layers

13
inputs1 = Input(shape=(2048,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)

inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, embedding_dim, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(256)(se2)

decoder1 = add([fe2, se3])


decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)

model = Model(inputs=[inputs1, inputs2], outputs=outputs)


[Link]()

Input_3 is the partial caption of max length 34 which is fed into the
embedding layer. This is where the words are mapped to the 200-d Glove
embedding. It is followed by a dropout of 0.5 to avoid overfitting. This is then
fed into the LSTM for processing the sequence.

Input_2 is the image vector extracted by our InceptionV3 network. It is


followed by a dropout of 0.5 to avoid overfitting and then fed into a Fully
Connected layer.

Both the Image model and the Language model are then concatenated by
adding and fed into another Fully Connected layer. The layer is a softmax
layer that provides probabilities to our 1660 word vocabulary.

Step 5:- Model Training

Before training the model we need to keep in mind that we do not want
to retrain the weights in our embedding layer (pre-trained Glove vectors).

[Link][2].set_weights([embedding_matrix])
[Link][2].trainable = False

14
Next, compile the model using Categorical_Crossentropy as the Loss
function and Adam as the optimizer.

[Link](loss='categorical_crossentropy', optimizer='adam')

Since our dataset has 6000 images and 40000 captions we will create a
function that can train the data in batches.

def data_generator(descriptions, photos, wordtoix, max_length,


num_photos_per_batch):
X1, X2, y = list(), list(), list()
n=0
# loop for ever over images
while 1:
for key, desc_list in [Link]():
n+=1
# retrieve the photo feature
photo = photos[key+'.jpg']
for desc in desc_list:
# encode the sequence
seq = [wordtoix[word] for word in [Link](' ') if word in wordtoix]
# split one sequence into multiple X, y pairs
for i in range(1, len(seq)):
# split into input and output pair
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store
[Link](photo)
[Link](in_seq)
[Link](out_seq)

if n==num_photos_per_batch:
yield ([array(X1), array(X2)], array(y))
X1, X2, y = list(), list(), list()
n=0

15
Next, let’s train our model for 30 epochs with batch size of 3 and 2000 steps
per epoch. The complete training of the model took 1 hour and 40 minutes
on the Kaggle GPU.

epochs = 30
batch_size = 3
steps = len(train_descriptions)//batch_size

generator = data_generator(train_descriptions, train_features, wordtoix, max_length,


batch_size)
[Link](generator, epochs=epochs, steps_per_epoch=steps, verbose=1)

Step 6:- Greedy and Beam Search

As the model generates a 1660 long vector with a probability distribution


across all the words in the vocabulary we greedily pick the word with the
highest probability to get the next word prediction. This method is called
Greedy Search.

def greedySearch(photo):
in_text = 'startseq'
for i in range(max_length):
sequence = [wordtoix[w] for w in in_text.split() if w in wordtoix]
sequence = pad_sequences([sequence], maxlen=max_length)
yhat = [Link]([photo,sequence], verbose=0)
yhat = [Link](yhat)
word = ixtoword[yhat]
in_text += ' ' + word
if word == 'endseq':
break

final = in_text.split()
final = final[1:-1]
final = ' '.join(final)
return final

Beam Search is where we take top k predictions, feed them again in the
model and then sort them using the probabilities returned by the model.

16
So, the list will always contain the top k predictions and we take the one with
the highest probability and go through it till we encounter ‘endseq’ or reach
the maximum caption length.

def beam_search_predictions(image, beam_index = 3):


start = [wordtoix["startseq"]]
start_word = [[start, 0.0]]
while len(start_word[0][0]) < max_length:
temp = []
for s in start_word:
par_caps = sequence.pad_sequences([s[0]], maxlen=max_length,
padding='post')
preds = [Link]([image,par_caps], verbose=0)
word_preds = [Link](preds[0])[-beam_index:]
# Getting the top <beam_index>(n) predictions and creating a
# new list so as to put them via the model again
for w in word_preds:
next_cap, prob = s[0][:], s[1]
next_cap.append(w)
prob += preds[0][w]
[Link]([next_cap, prob])

start_word = temp
# Sorting according to the probabilities
start_word = sorted(start_word, reverse=False, key=lambda l: l[1])
# Getting the top words
start_word = start_word[-beam_index:]

start_word = start_word[-1][0]
intermediate_caption = [ixtoword[i] for i in start_word]
final_caption = []

for i in intermediate_caption:
if i != 'endseq':
final_caption.append(i)
else:
break
final_caption = ' '.join(final_caption[1:])
return final_caption

17
Step 7:- Evaluation

Let’s now test our model on different images and see what captions it
generates. We will also look at the different captions generated by Greedy
search and Beam search with different k values.

First, we will take a look at the example image we saw at the start of the
article. We saw that the caption for the image was ‘A black dog and a brown
dog in the snow’. Let’s see how our model compares.

pic = '2398605966_1d0c9e6a20.jpg'
image = encoding_test[pic].reshape((1,2048))
x=[Link](images_path+pic)
[Link](x)
[Link]()

print("Greedy Search:",greedySearch(image))
print("Beam Search, K = 3:",beam_search_predictions(image, beam_index = 3))
print("Beam Search, K = 5:",beam_search_predictions(image, beam_index = 5))
print("Beam Search, K = 7:",beam_search_predictions(image, beam_index = 7))
print("Beam Search, K = 10:",beam_search_predictions(image, beam_index = 10))

OUTPUT:

18
You can see that our model was able to identify two dogs in the snow. But at
the same time, it misclassified the black dog as a white dog. Nevertheless, it
was able to form a proper sentence to describe the image as a human would.

pic = list(encoding_test.keys())[1]
image = encoding_test[pic].reshape((1,2048))
x=[Link](images_path+pic)
[Link](x)
[Link]()

print("Greedy:",greedySearch(image))
print("Beam Search, K = 3:",beam_search_predictions(image, beam_index = 3))
print("Beam Search, K = 5:",beam_search_predictions(image, beam_index = 5))
print("Beam Search, K = 7:",beam_search_predictions(image, beam_index = 7))

OUTPUT:

Here we can see that we accurately described what was happening in the
image. You will also notice the captions generated are much better using
Beam Search than Greedy Search.

19
5. CONCLUSION
What we have developed today is just the start. There has been a lot of
research on this topic and you can make much better Image caption
generators.

Things you can implement to improve your model:-

1. Make use of the larger datasets, especially the MS COCO dataset or


the Stock3M dataset which is 26 times larger than MS COCO.
2. Making use of an evaluation metric to measure the quality of machine-
generated text like BLEU (Bilingual evaluation understudy).
3. Implementing an Attention-Based model:- Attention-based
mechanisms are becoming increasingly popular in deep learning
because they can dynamically focus on the various parts of the input
image while the output sequences are being produced.
4. Image-based factual descriptions are not enough to generate high-
quality captions. We can add external knowledge in order to generate
attractive image captions. Therefore, working on Open-domain
datasets can be an interesting prospect.

We have incorporated the field of Computer Vision and Natural Language


Processing together and implement a method like Beam Search that is able
to generate better descriptions than the standard.

There is still a lot to improve right from the datasets used to the
methodologies implemented.

20
6. REFERENCES

• [Link]
ning_based_Object_Detection_and_Recognition_Framework_for
_the_Visually-Impaired
• [Link]
image-caption-generator-using-keras/
• [Link]
[Link]
• [Link]
• [Link]
• [Link]
image-caption-generator-using-keras/
• [Link]
flickr8k-dataset-bleu-4bcba0b52926
• [Link]
help-visually-impaired-people-4fcdc76816b2

21

You might also like