Capstone Project
Capstone Project
Project Objective
The Stanford Cars dataset contains 16,185 images of 196 classes of cars. The data
is split into 8,144 training images and 8,041 testing images, where each class has
been split roughly in a 50-50 split. The Objective is to design a DL based car
identification model for the Stanford Car Images. The problem statement is a Multi-
class object detection which involves
Dataset Description:
Data description:
Train Images: Consists of real images of cars as per the make and year of
the car. Test Images: Consists of real images of cars as per the make and
year of the car. Train Annotation: Consists of bounding box region for
training images.
Test Annotation: Consists of bounding box region for testing
images Car Make: consists of list of car make, name and year
list
Data
Image folders ‘Train images’ and ‘Test images’ contain the images of the
dataset carefully placed in the appropriate folder named after the car make, model
and the year. “Car names and make.csv” file has the list of the cars in the folders.
Train and Test Annotation csv files contain the mapping between the image
(file name in this case) and the output labels (class name and bounding box
coordinates).
Findings:
a) The bounding box coordinates in Train and Test annotation csv files are
not clearly labelled. It could be either in (xmin, ymin, xmax, ymax)
format or (x, y, width, height) format. Drawing bounding boxes using
both the formats allowed us to identify that they were indeed given in
(xmin, ymin, xmax, ymax) format.
b) The number of car names in “Car names and make.csv” file match the
number of unique make of cars in the training annotations file.
Verifying this ensured none of the folders were missed during training.
c) The number of images in each class are evenly balanced.
SOLUTION OVERVIEW
SECTION 2: PROCESS
Image file name points to the Images in the respective folder. Images in the
train/test images come in all sizes and dimensions. Part of the pre-processing
includes resizing the image (to 224X224) and scaling up the bounding box
coordinates to prepare the same for data modelling.
DIM2
Finally, we load the image from disk in Keras/TensorFlow format and preprocess it with a
resizing step which forces the image to 224×224 pixels for input for the model. Next step is to One-hot
encode our labels using LabelBinarizer.
The Algorithms
CNNs are a class of Deep Neural Networks that can recognize and classify particular features
from images and are widely used for analyzing visual images. Their applications range from image and
video recognition, image classification, medical image analysis, computer vision and natural language
processing.
Two main parts to a CNN architecture supports and helps in solving the problem in hand and ant image
related problems in general.
● A convolution tool that separates and identifies the various features of the image for
analysis in a process called as Feature Extraction
● A fully connected layer that utilizes the output from the convolution process and
predicts the class of the image based on the features extracted in previous stages.
4
Object Classification in this case is multi-class classification. And Bounding box regression is used
for Object localization Bounding-box regression is a popular technique in object detection algorithm
used to predict target objects' location using rectangular bounding boxes. It aims to refine the location
of a predicted bounding box.
The aim was to define and model an algorithm that would do object classification and object
localization in its best accuracy. Various of the following algorithms were considered and executed
before Efficientnet-b7 was pickled for the purpose. Few of the pre-trained models - MobileNet,
ResNet50 and EfficientNet-B5, YOLO and TFOD as part of the experiment.
GUI
UI interface demonstrates the Data loading, data preprocessing, automates the data modelling
steps and enables to test and predict the classification and localization on the pickled data model.
Various options like Tkinter, Flask and Ngrok were adapted and experimented for the purpose.
Flask is a Python eb framework built with a small core and easy-to-extend philosophy. To implement,
one imports the Flask class. An instance of this class will be our WSGI application. WSGI is the Web
Server Gateway Interface. It is a specification that describes how a web server communicates with web
applications, and how web applications can be chained together to process one request.
Ngrok is a tool that allows you to expose a web server running on your local machine to the internet. It
provides a real-time web UI where one can introspect all HTTP traffic running over your tunnels.
Replay any request against the tunnel with one click.
The tkinter package (“Tk interface”) is the standard Python interface to the Tcl/Tk GUI toolkit.
The Tk class is instantiated without arguments. This creates a toplevel widget of Tk which usually is
the main window of an application. Each instance has its own associated Tcl interpreter.
Streamlit is an open-source app framework for Machine Learning and Data Science teams. It is
compatible with major Python libraries such as scikit-learn, Keras, PyTorch, SymPy(latex), NumPy,
pandas, Matplotlib etc.
Flask and docker; Heroku; AWS deployment were the other options considered while working on GUI
solution
SECTION 3: SOLUTION
Google Colab was extensively used for data preprocessing and modelling. It was chosen for the ability to
handle huge datawork and extended GPU facilities. The complete solution was broken down into three
steps/Milestone.
Fig5. Overall Solution Overview
Step 2 involves the iterative process of model building, model evaluation and fine tuning.
Data loading
The Car Name and Class Name is loaded into a pandas dictionary object. Looping over our CSV
annotation files, we grab all rows in the file and proceed to loop over each of them.
Inside our loop, we unpack the comma-delimited row giving us our filename, (x, y)-coordinates, and
class label for the particular line in the CSV. We again loop through the annotations data frame and
update new columns for Class Name and build Image path name. Using the imagePath derived from our
config, class label, and filename, we load the image and extract its spatial dimensions. We then scale the
bounding box coordinates relative to the original image ‘s dimensions to the range [0, 1] — this scaling
serves as our preprocessing for the bounding box data.
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning
model.
In this project, it involves extracting the image information and preprocess it to be suitable in
Keras/TensorFlow format and scaling to suitable dimension.
The output label is One-hot encoded using LabelBinarizer. The bounding box coordinates are
scaled relative to the original image ‘s dimensions w.r.t the 224*224 dimension.
Below mentioned picture shows 5 classes which are having less number of images in the Train dataset.
Below mentioned picture shows 5 classes which are having less number of images in the Test
dataset.
Below mentioned picture shows 5 classes which are having more number of images in the Test dataset.
Fig10. Table having max number of images in Train dataset
Below mentioned picture depicts the number of car images based on car model year. There are
more images of cars which are of model 2012 and very less images belonging to model 1996.
Class distribution
Almost all classes of cars images have an equal number of images which is of count around 40.
There is a class named “GMC savana van 2012” which has 60 images and “Hyundai Accent sedan 2012”
which has around 25 images. In below mentioned pictures we can view the class distribution.
Fig12, 13, 14. Car count histogram (of 196 classes)
Image with name 05945.jpg which belongs to class “Chevrolet Sonic Sedan 2012” is having
maximum height and width. The dataset has uneven distribution of the image size and hence
preprocessing is required before running the model on data. While preprocessing we have to define a
fixed image size and the tensorflow preprocessing function was applied on each images before training
the model.
Design
Owing to the size of the data to be handled and the GPU facility, Google colab was predominately used
for data preprocessing and data modeling phase.
By understanding this problem, we concluded that the problem can be solved by using deep learning
techniques. In Deep learning techniques, we can investigate particular sections of CNNs. Because of its
great accuracy, CNNs are employed for image classification, image localization, image detection, etc.
The CNN uses a hierarchical model that builds a network, similar to a funnel, and then outputs a fully
connected layer in which all neurons are connected to each other, and the output is processed. The
benefit of using CNNs is their ability to develop an internal representation of an image by looking at only
a subset of pixels in the images.
We cannot use a Dense neural network for computer vision tasks in deep learning. The fundamental
difference between the Convolutional and Dense layers is that the Convolutional layer requires fewer
parameters because the input values are forced to share the parameters. The Dense Layer employs a
linear operation, which means that the function generates each output based on each input. An output
of the convolution layers is formed by just a small size of inputs which depends on the filter's size and
the weights are shared for all the pixels
Deploy
For this problem statement we have decided to use transfer learning and use a pre-built model in
tensorflow. Few layers in pre-built models can be trained to get good accuracy and best prediction
for the bounding box. Transfer learning is adaptable, allowing pre-trained models to be used directly
as feature extraction preprocessing or integrated into completely new models. We have evaluated a
few pre-trained models - MobileNet, ResNet50 and efficient net as part of the experiment. Among
these evaluations we have got the best result for an efficient net - efficientnet-b7
model = tf.keras.applications.efficientnet.EfficientNetB7(include_top=False,
input_shape=(224, 224,
3), weights='imagenet')
Few layers of pre-trained model were unfreezed for more training. The output layer of pretrained model
was flattened and tuned for label classification training (for 196 classes) using softmax activation and
bounding box regression training (for 4 outputs) using sigmoid activation dense layer. This
non-sequential model output was then trained for categorical crossentropy and userdefined IOU metrics
respectively over the Adam optimizer.
Model Pickling
Final model was pickled in h5 format, this file was later used to load the model (with compile false and
custom_objects properly defined) to be used for predict and draw in step3.
STEP3
GUI
In flask, we use a render_template method to render any HTML page on our browser. Templates are
files that contain static data as well as placeholders for dynamic data. A template is rendered with
specific data to produce a final document. Flask uses the Jinja template library to render templates.
Each page in the application will have the same basic layout around a different body. Instead of writing
the entire HTML structure in each template, each template will extend a base template and override
specific sections.
Fig 16. Template directory
We are importing render_template function provided by the Flask and then rendering our HTML
template in the home route. And run the using “flask run” command.
App routing is used to map the specific URL with the associated function that is intended
to perform some task.
Note: For the implementation ease, step1 and step 2 are the planned minuscule set of
the complete Dataset.
base.html
base.html contains the CSS formatting, basic navigation bar links to the three steps of operation
that is automated. It is the basic skin for all the three different html.
Fig17. Base.html
ImageProcessing.html
The annotation csv file and image folder are POSTed and the image processing is done rendering you
the details about the data followed specific EDA outputs. EDA output images gets saved in the
static/images directory and rendered as template to the html.
ModelTraining.html
Model training is done on subset of the original dataset. User has the ability to the dictate the number
of epochs upon which the model training is initiated and the Accuracy details are plotted on the
completion of the model training.
Fig19. Modeltraining.html form
ImageDetection.html
Image detection page enables the user to submit the test images. The test images get predicted on the
pickled model. The pickled model is saved EfficientNet-B7 model of the actual training that was originally
done in Google colab.
Model Evaluation is an integral part of the model development process. It helps to find the best model
that represents our data and how well the chosen model will work in the future. The various models
were tried and evaluated before the best model was finalized. The methods and benchmarks for the
evaluation has been detailed below.
Classification Parameters
Categorical cross entropy is a loss function that is used in this multi-class classification task. It is
designed to quantify the difference between two probability distributions. Mathematically,
Loss=−i=1∑tyi⋅logy^i
where
\hat{y}_iy^i is the ii-th scalar value in the model output,
y_iyi is the corresponding target value, and it is summed over the output size is the number of scalar
values in the model output.
A Classification report is used to measure the quality of predictions from a classification algorithm. It is
used to evaluate the model. The report shows the main classification metrics precision, recall and
f1-score on a per-class basis.
Accuracy is defined as the percentage of correct predictions for the test data. It can be calculated easily
by dividing the number of correct predictions by the number of total predictions.
accuracy= correctpredictions /allpredictions
Precision is defined as the fraction of relevant examples (true positives) among all of the examples
which were predicted to belong in a certain class.
precision=truepositives/(truepositives+falsepositives)
Recall is defined as the fraction of examples which were predicted to belong to a class with respect to
all of the examples that truly belong in the class.
recall=truepositives/(truepositives+falsenegatives)
The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the
worst is 0.0.
Mean average precision (mAP) is used to determine the accuracy of a set of object detections from a
model when compared to ground-truth object annotations of a dataset.
Intersection over Union (IoU) is used when calculating mAP. It is a number from 0 to 1 that specifies the
amount of overlap between the predicted and ground truth bounding box.
an IoU of 0 means that there is no overlap between the boxes
an IoU of 1 means that the union of the boxes is the same as their overlap indicating that they
are completely overlapping
Model Building
The following are the various models compiled and evaluated on the based on the parameters listed
above.
MobileNet
MobileNets are built on a simplified design that builds light weight deep neural networks
using depth-wise separable convolutions. Two simple global hyper-parameters are introduced that
efficiently trade off latency and accuracy. These hyper-parameters help the model builder to select the
appropriate model size for their application based on the problem constraints. We are using a specific
version of mobile net model MobileNetV2. MobileNetV2 is very similar to the original MobileNet, except
that it uses inverted residual blocks with bottlenecking features. It has a drastically lower parameter
count than the original MobileNet. MobileNets support any input size greater than 32 x 32, with larger
image sizes offering better performance.
The network is a fairly simple network with less number of weights to be trained. We are retraining the
model right from input layer and we have added dense layer and dropout layer along with batch
normalization as final layer for classification and regression (calculation of bounding box).
We are using categorical_crossentropy for classification and mse for regression. We calculate
the overall loss of the model by adding regression loss and classification loss. For calculating accuracy of
the bounding box on the images, we have defined a function to calculate IOU. This Intersection over
Union is an evaluation metric used to measure the accuracy of an object detector on a particular
dataset.
Below mentioned are the weight which are to be learnt for mobile net model
Resnet50
ResNet50 is a ResNet variation of 48 Convolution layers, 1 MaxPool layer, and 1 Average
Pool layer. There are 3.8 x 109 floating point operations in it. It's a popular ResNet model. ResNet is a
technique for dealing with the vanishing gradient problem in deep CNNs. They work by bypassing some
layers, believing that very deep networks should not have a higher training error than shallower
networks.
Here we have retrained the entire model, this model has 28 Million trainable parameters. The
Last layer of the model has GlobalAveragePooling2D, 2 Dense layer, 1 dropout layer and a Batch
normalization layer. For classification we have used softmax activation function with 196 classes and for
regression we have used sigmoid activation function for 4 coordinates of the bounding box.
We are using categorical_crossentropy for classification and MSE for regression. We calculate
the overall loss of the model by adding regression loss and classification loss. For calculating accuracy of
the bounding box on the images, we have defined a function to calculate IOU. This Intersection over
Union is an evaluation metric used to measure the accuracy of an object detector on a particular
dataset.
Below mentioned are the weight which are to be learnt for mobile net
Efficientnet-b5 has 576 total layers, from layer number 257 we have made it trainable. This
model has 28 Million trainable parameters. The Last layer of the model has GlobalAveragePooling2D, 1
Dense layer and a Batch normalization layer. For classification we have used softmax activation function
with 196 classes and for regression we have used sigmoid activation function for 4 coordinates of the
bounding box.
We are using categorical_crossentropy for classification and MSE for regression. We calculate
the overall loss of the model by adding regression loss and classification loss. For calculating accuracy of
the bounding box on the images, we have defined a function to calculate IOU. This Intersection over
Union is an evaluation metric used to measure the accuracy of an object detector on a particular
dataset.
Efficientnet-b7
Efficientnet-b7 model with ImageNet weights is used for developing network for classification
and regression tasks for the given problem statement. The efficientnet-b7 model is one of the
EfficientNet models designed to perform image classification. All the EfficientNet models have been pre
trained on the ImageNet image database.
Efficientnet-b7 has 813 total layers, from layer number 351 we have made it trainable. This
model has 64 Million trainable parameters. The Last layer of the model has GlobalAveragePooling2D, 1
Dense layer and a Batch normalization layer. For classification we have used softmax activation function
with 196 classes and for regression we have used sigmoid activation function for 4 coordinates of the
bounding box.
We are using categorical_crossentropy for classification and MSE for regression. We calculate
the overall loss of the model by adding regression loss and classification loss. For calculating accuracy of
the bounding box on the images, we have defined a function to calculate IOU. This Intersection over
Union is an evaluation metric used to measure the accuracy of an object detector on a particular
dataset.
Number of weights after final model building is as mentioned below.
SECTION 5: BENCHMARK
Benchmarking is used to measure performance using a specific indicator resulting in a metric that is then
compared to others. In other words, it is the comparison of a given model's inputs and outputs to
estimates from alternative internal or external data or models
MobileNet being a streamlined architecture, has shown good accuracy for train and validation
set of data but performs poorly on unseen test dataset.
dataset
EfficientNet-B7 performs well on unseen test dataset yielding better accuracy numbers.
class_op_accurac
Model loss Class_op_loss reg_op_loss y reg_op_IoU
The finalized model EfficientNet-B7 is trained in 2 phases, 1st phase with 10 epochs and batch size as 64
and in 2nd phase 15 more epochs with batch size as 16.
Above figure shows the training vs validation accuracy and Training vs validation IOU in the first phase.
Fig23. Second Phase results
Above figure shows the training vs validation accuracy and Training vs validation IOU in the second
phase.
The training vs validation accuracy in the first phase is having a gap between them but with the
second phase the validation accuracy has improved. Although the observed gap is more, for a check
on a few thousand sample data, the classification seems to be acceptable. The IOU of the model has
performed well and compared to resnet50 and MobileNet, the bounding boxes on the sample data
has shown improvement. Below mentioned are a few samples of the dataset.
Based on the performance, the EfficientNet-B7 model is finally chosen, which is having the best
performance among all the ones we have seen so far.
Classification report
Below mentioned table classification report for EfficientNet-B7 model. Overall Accuracy is 79%,
Precision is 78%, recall is 78% and f1-Score is 78%. As observed in the classification report, for a few
classes the number is low and the same is observed in test samples where the model is showing
misclassification. The number of samples used for producing reports is 8041. In EfficientNet, the authors
propose a new Scaling method called Compound Scaling.
Classification report for EfficientNet-B7
SECTION 6: VISUALISATIONS
During exploratory data analysis several insights were made on the data provided. One of the useful
insights was to check if all instances of images share the same location and orientation within the
dataset. As it is observed from the graph below, the location of the car within the image varies and
there are images where the appearance of the car is too small compared to the image itself.
In the above graph, the y-axis shows the number of images (count) and the x-axis shows the orientation
of the car in the image.
The Model performance is good if the provided image is having the whole car in the image, some
examples are as mentioned below. But if the image is having only the back side of the car or if the
image is not having proper brightness or if the image is blurred then it is observed that the
performance of the model is poor.
According to the 2018 Used Car Market Report & Outlook published by Cox Automotive, 40 million
used vehicles were sold in the US last year. This represents about 70% of the total vehicles sold. A good
portion of these sales already use online resources along various stages of purchasing: searching,
pre-qualifying, applying and finally buying. Popular websites for car buyers include AutoTrader.com,
Kelly Blue Book, Cars.com, Carvana.com.
Cox Automotive report indicates that most market leaders and Silicon Valley startups speculate that car
sales will shift completely to online retailing. This might be an extreme speculation, but these market
leaders are interested in providing better user experience while buying a car online, and better
recommender systems when a user searches for cars. Peer-to-peer sales platforms like Craigslist, Shift,
eBay Motors, etc. are also interested in better fraud detection and monitoring of user postings.
This car image classification system can address these business cases:
1. Ground-truthing of posted used car images on peer-to-peer sales platforms — are these images
the cars they specify? Do multiple exterior pics represent the same car?
2. Organizing web-page displays based on the user uploaded images
3. Recommending alternative cars available in the inventory that have similar looks and price
4. In addition, this classification system of cars can help in identifying the fine-grained features
important for 3D object detection for self-driving cars.
SECTION 8: LIMITATIONS
Limitations
Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) learn using training
data and backpropagation algorithms. A critical step is to fit the AI approach to the problem and the
availability of data. Since these systems are “trained” rather than programmed, the various
processes often require huge amounts of labeled data to perform complex tasks accurately.
Obtaining large data sets can be difficult.
In some domains, they may simply not be available, but even when available, the labeling efforts can
require enormous human resources.
Further, it can be difficult to discern how a mathematical model trained by deep learning arrives at a
particular prediction, recommendation, or decision.
A black box, even one that does what it’s supposed to, may have limited utility, especially where the
predictions or decisions impact society and hold ramifications that can affect individual well-being.
In such cases, users sometimes need to know the reason behind the working, such as why an algorithm
reached its recommendations—from making factual findings with legal repercussions to arriving at
business decisions, such as lending, that have regulatory repercussions—and why certain factors (and
not others) were so critical in a given instance.
The original dataset has roughly 50–50% train-test split. For further analysis, these images can be
combined, and a train-test split of 80–20% can be used.
The dataset after consolidating to five classes has a class imbalance with Vans representing about 6% of
the data and on the other extreme Convertible/Coupes and Sedans each representing about 32% of the
data. To address this class imbalance the above classification methods can be re-iterated with
oversampling techniques: random oversampling, and Synthetic Minority Oversampling TEchnique
(SMOTE).
Image segmentation is one of the most popular image processing tasks. Some common pitfalls related
to the most frequently used metrics in image segmentation, namely the Dice Similarity Coefficient
(DSC), the Harsdorf Distance (HD),and the Intersection over Union (IoU); the problems related to
segmentation metrics are assigned to four categories, namely
(3) metric aggregation to combine metric values of single images into one accumulated score
a) In the case of metrics with fixed boundaries, like the DSC or the IoU, missing values can easily be set
to the worst possible value. For distance-based measures without lower/upper bounds, the strategy of
how to deal with missing values is not trivial. In the case of the HD, one may choose the maximum
distance of the image and add 1 or normalize the metric values and use the worst possible value.
Crucially, however, every choice will produce a different aggregated value, thus potentially affecting the
ranking.
SECTION 9: REFLECTIONS
To improve the final accuracy and better f1 score, there was more scope in lines of feature extraction
that could be experimented further.
Feature extraction
To detect a car in the image, we need to identify feature(s) which uniquely represent a car. which are
robust enough when it comes to changing perspectives and shapes of the object. Here are
techniques that we can experiment with to build a more robust feature extraction.
Color Spaces
We can also explore the most suitable color space for our configuration, as HOG features across the 3
RGB channels cab be too similar, therefore not generating features with enough variations. We
generate the following outputs across a multitude of color spaces: HUV, HLS, YUV, YCrCb, and LAB
Frame Aggregation
To strengthen our pipeline, experiment with smoothening of all detected windows every n frames. To
do so, we accumulate all detected windows between frames (n-1) *f+1 to n*f, where n is a positive
scalar that represents the group of frames we are in. Every time we detect a new object on the current
or next frames in the group, we check whether we have detected a similar object in the past, and if so,
we append the similar object, thus increasing this object’s count across multiple frames. At frame n*f
we only retain detected objects (and their associated bounding boxes) that have over m detected
counts, to improve the double filtering in the pipeline.
Image augmentation
Image augmentation is a method of modifying existing images in order to generate additional data
for the model training process, it is a technique for increasing the size of a training dataset artificially
by producing changed versions of the images in the dataset.
Deep learning neural network models that are trained on more data become more efficient and
augmentation approaches can provide variations of the images that increase the models' capacity to
generalise.The ImageDataGenerator class in the Keras deep learning neural network toolkit allows you
to fit models with picture data augmentation.
THANK YOU NOTE
Thanks to Great Learning team for the help to learn AIML and to do this Capstone Project
Many thanks to our Mentor Mr. Girijesh Prasad. His experience in the field of AIML has guided us in
learning throughout this course. His practical knowledge gives us a lot of insights to tackle the issues.
CODE
https://github.com/girijeshcse/car_finder
REFERENCES
https://keras.io/api/applications/
https://arxiv.org/abs/1704.04861
https://arxiv.org/abs/1801.04381
https://arxiv.org/abs/1512.03385
https://keras.io/api/applications/efficientnet/#efficientnetb7-function