0% found this document useful (0 votes)
51 views10 pages

Madima2016 Food Recognition

Uploaded by

Anmol Rastogi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views10 pages

Madima2016 Food Recognition

Uploaded by

Anmol Rastogi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/310823982

Food/Non-food Image Classification and Food Categorization using Pre-Trained


GoogLeNet Model

Conference Paper · October 2016


DOI: 10.1145/2986035.2986039

CITATIONS READS
141 6,459

3 authors, including:

Lin Yuan Touradj Ebrahimi


École Polytechnique Fédérale de Lausanne École Polytechnique Fédérale de Lausanne
11 PUBLICATIONS   315 CITATIONS    678 PUBLICATIONS   22,941 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

QoE-NET MSCA ITN View project

Image and Video Coding/Processing View project

All content following this page was uploaded by Ashutosh Singla on 06 February 2018.

The user has requested enhancement of the downloaded file.


Food/Non-food Image Classification and Food
Categorization using Pre-Trained GoogLeNet Model

Ashutosh Singla Lin Yuan Touradj Ebrahimi


[Link]@[Link] [Link]@[Link] [Link]@[Link]
Multimedia Signal Processing Group
Ecole Polytechnique Fédérale de Lausanne
Station 11, 1015 Lausanne, Switzerland

ABSTRACT value of food and beverage consumed by subjects in an au-


Recent past has seen a lot of developments in the field of tomatic and accurate way.
image-based dietary assessment. Food image classification In the recent years, there has been a lot of developments
and recognition are crucial steps for dietary assessment. In in the field of dietary assessment based on multimedia tech-
the last couple of years, advancements in the deep learning niques, for example, based on food image analysis. An au-
and convolutional neural networks proved to be a boon for tomatic image-based dietary assessment system follows the
the image classification and recognition tasks, specifically for basic steps of: food image detection, food item recognition,
food recognition because of the wide variety of food items. quantity or weight estimation, and finally caloric and nu-
In this paper, we report experiments on food/non-food clas- tritional value assessment [1]. In the last couple of years,
sification and food recognition using a GoogLeNet model advancements in image processing, machine learning and in
based on deep convolutional neural network. The experi- particular deep learning, and convolutional neural network
ments were conducted on two image datasets created by our (CNN) proved to be a boon for the image classification and
own, where the images were collected from existing image recognition tasks, including for the problem of food image
datasets, social media, and imaging devices such as smart recognition. Researchers have been working on different as-
phone and wearable cameras. Experimental results show a pects of a food recognition system, but there is still a lack
high accuracy of 99.2% on the food/non-food classification of good-enough solution to high-accuracy food classification
and 83.6% on the food category recognition. and recognition, considering a wide variety of food items
and highly mixed food items in many images. Therefore, it
is extremely difficult to correctly recognize every food item,
Keywords as many of the food items may look similar in color or shape
and are not even distinguishable to human eyes, e.g., beef
Caffe; convolutional neural network (CNN); food/non-food
vs. horse meat. Moreover, in reality, a plate with highly
classification; food recognition; deep learning; GoogLeNet
mixed food makes the problem even more difficult to solve.
Therefore, we state that it would be good enough to recog-
1. INTRODUCTION nize the general type of a certain food item, based on which
Well-being is becoming a topic of great interest and an we can approximately estimate its dietary value, e.g., calo-
essential factor linked to improvements in the quality of life. ries. It can already provide people with basic information
Modern information technologies have brought a new di- on their daily intake.
mension to this topic. It is now possible, thanks to vari- The paper reports two sets of experiments: 1) food/non-
ous wearable devices (health bands, smart watches, smart food image classification, and 2) food category recognition.
clothes, etc.), to gather a wide range of information from In order to train our model for classification and recogni-
subjects such as number of steps walked, heart rate, skin tion, we created two datasets from the existing food image
temperature, skin conductivity, transpiration, respiration, datasets, social media and mobile devices. A GoogLeNet
etc. and analyze this information in terms of the amount of model based on deep CNN was fine-tuned and trained using
calories spent, level of stress, duration and quality of sleep, our image data in a deep learning framework - Caffe.
etc. An accurate estimation of daily nutritional intake pro- The rest of the paper is structured as follows. Section 2
vides a useful solution for keeping healthy and to prevent introduces the related works carried out by other researchers
diseases. However, it is not easy to assess the nutritional after a brief discussion of the differences between food detec-
tion and food classification. Section 3 briefly introduces the
Permission to make digital or hard copies of all or part of this work for personal or convolutional neural network (CNN) and GoogLeNet model.
classroom use is granted without fee provided that copies are not made or distributed Section 4 describes the food image datasets used for exper-
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the iments. Then Section 5 shows the experimental results on
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or food/non-food classification and food category recognition.
republish, to post on servers or to redistribute to lists, requires prior specific permission Finally, we conclude the paper and discuss the future work
and/or a fee. Request permissions from permissions@[Link].
in Section 6.
MADiMa’16, October 16 2016, Amsterdam, Netherlands
c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISBN 978-1-4503-4520-0/16/10. . . $15.00
DOI: [Link]
2. RELATED WORK sic food materials and learned spatial relationships between
Food image detection and recognition are the active re- these ingredients in a food image using pairwise features.
search topics in the area of computer vision. Researchers They achieved a classification accuracy of 28.2% on 61 food
have published several approaches to solve these two prob- categories which was a subset of Pittsburgh dataset [12].
lems. The first problem is to detect automatically the im- Bettadapura et al. [14] used combined 6-feature descriptors
ages that contain food items. This is an indispensable step (2 color-based and 4 SIFT-based) and SMK-MKL Sequen-
for an automatic food analysis system. In some cases, it is tial Minimal Optimization to train an SVM classifier. They
enough to classify a food image, when the main objective experimented on a dataset consisting of 3750 food images
is to annotate images that contain food for the purpose of of 75 categories (50 images per category) and reported an
organizing them into different categories. In multimedia di- accuracy of 63.33% on their test dataset. Interestingly, they
etary assessment, one should be able to also find out what incorporated the geological information of where the food
food items are in an image, their locations, as well as their picture was taken so that they could get the information
amount. about the restaurant and then downloaded the menu online.
An assumption of their work is that the food image must be
2.1 Food Image Detection one of the items in the menu. Rahmana et al. [15] presented
The task of detecting whether an image contains food item a new method for generating scale and/or rotation invari-
is a binary classification problem, namely, food/non-food ant global texture features using the output of Gabor filter
classification. Given an image, a food classifier identifies banks, which provides a good accuracy of food classification
an image as food or non-food. This is similar to any other for a mobile phone based dietary assessment system. The
image classification problem where a classifier is trained on top-5 accuracy they achieved was almost 100%. However,
image data using machine learning techniques. Classical ap- the experiment was conducted on a special image dataset of
proaches to image classification extract features such as in- only 209 food images created with controlled environment.
terest point descriptors from scale-invariant feature trans- He et al. [16] investigated different features and their combi-
form (SIFT) [2], pool the features into a vector representa- nations for food image analysis and a classification approach
tion e.g., bag of words [3] and Fisher Vectors [4] and then based on k-nearest neighbors and vocabulary trees. The ex-
use a clustering algorithm such as Support Vector Machine perimental results indicate that a combination of three fea-
(SVM) for classification. Kitamura et al. [5] applied SVM tures, Dominant Color Descriptor (DCD), Multi-scale Dense
on image features consisting of color histograms, DCT co- SIFT (MDSIFT) and Scalable Color Descriptor (SCD), pro-
efficients and detected image patterns in food image detec- vides the best performance on food recognition. Bossard et
tion and obtained an accuracy of 88%. [6] reports an auto- al. [17] created an image dataset called Food-101, which con-
matic detector that finds circular dining plates in chronically tains 101 types of food images. They presented a method
recorded images or videos. As an important application, the based on Random Forests to mine discriminative visual com-
method can be used to detect food intake events automat- ponents and could efficiently classify with an accuracy rate
ically by identifying dining plates in chronically recorded of 50.8%.
video acquired by a wearable device. In recent years, CNN is also widely used in food recog-
Recently, the Convolutional Neural Network (CNN) [7] nition and provides better performance than the conven-
offers a state-of-the-art technique for many general image tional methods. Bossard et al. [17] trained a deep CNN
classification problems. It has been applied in food classi- from scratch on Food-101 dataset using the architecture of
fication and resulted in a high accuracy. Kagaya et al. [8] AlexNet model (proposed by Krizhevsky et al. [18]) and
applied CNN in food/non-food classification and achieved achieved 56.4% top-1 accuracy. Their proposed a new method
significant results with a high accuracy of 93.8%. Then, in based on Random Forest outperforms state-of-the-art meth-
the work [9], the accuracy of food detection was increased ods on food recognition. In [8], Kagaya et al. also trained
to 99.1%, using a subset of their image dataset. Compared CNN for food recognition and the experimental results showed
to previous works that use conventional machine learning that the CNN outperformed all the other baseline classical
approaches, CNN seems to provide superior performance. approaches by achieving an average accuracy of 73.7% for 10
classes. Kawano et al. [19] used CNN as a feature extractor
2.2 Food Image Recognition and achieved state-of-the-art best accuracy of 72.3% on the
UEC-FOOD-100 [20] dataset, which contains 100 classes of
Most research works in food recognition assume that only
Japanese food. They used the pre-trained AlexNet model as
one food item is present in the image. Thus, food recog-
a feature extractor and integrated both CNN features and
nition can be solved as a multiclass classification problem.
Fisher Vector encoded conventional SIFT and color features.
Researchers have been working on food recognition using
Yanai et al. [21] fine-tuned the AlexNet model and achieved
conventional approaches based on classical image features
the best results on public food datasets so far, with top-1 ac-
and machine learning for many years. Joutou et al. [10] cre-
curacy of 78.8% for UEC-FOOD-100 dataset and 67.6% for
ated a private Japanese food dataset with 50 classes. They
UEC-FOOD-256 [22] (another Japanese food image dataset
proposed a Multiple Kernel Learning (MKL) method us-
with 256 classes). Their works showed that the recognition
ing combined features including SIFT-based bag-of-features,
performance on small image datasets like UEC-FOOD-256
color histogram and Gabor Texture features. An accuracy
and UEC-FOOD-100 (both of which contained 100 images
of 61.3% on their dataset was achieved. A follow-up study
for each class) can be boosted by fine-tuning the CNN net-
by Hoashi et al. [11] achieved an accuracy rate of 62.5% us-
work which was pre-trained on a large dataset of similar
ing the same method on an extended dataset of 85 classes.
objects. Myers et al. [23] presented the Im2Calories system
Chen et al. [12] created the Pittsburgh food database which
for food recognition which extensively used CNN-based ap-
contained 101 classes of American fast food images taken in
proaches. The architecture of GoogLeNet [24] was used in
a controlled environment. Yang et al. [13] defined eight ba-
their work and a pre-trained model was fine-tuned on Food-
101. The resulting model has a top-1 accuracy of 79% on
Food-101 test set.

3. CONVOLUTIONAL NEURAL NETWORK


Over the last few years, due to the advancements in the
deep learning, especially in the convolutional neural net-
works, the accuracy in identifying and recognizing images
has been increased drastically. This is not only because
larger datasets but also new algorithms and improved deep
architectures [24]. Convolutional Neural Network (CNN) is
also known as LeNet due to its inventor [25]. CNN mainly
comprises convolutional layers, pooling layers and sub-sampling
layers followed by fully-connected layers. The very first ar-
chitecture of CNN [7] takes an input image and applies con-
volution followed by sub-sampling. After two such computa-
tions, the data is fed into the fully connected neural network,
where it performs the classification task [7]. The main ad-
vantage of CNN is the ability to learn the high-level efficient
features and in addition to that, it is robust against small
rotations and shifts.
Significant progress has been made on this basic design
of CNN and it has been extended by increasing the num-
ber of layers [26], size of layers [27] and better activation
function, e.g., ReLU [28] to yield the best results on various
challenges related to object classification, recognition and
computer vision.
In this paper, we use GoogLeNet model, which was devel- Food images Non-food images
oped recently based on deep convolutional neural network,
in order to classify food/non-food images and then recognize Figure 1: Example images of Food-5K dataset.
the food images as one of the 11 categories defined in 4.2.
GoogLeNet is an efficient deep neural network architecture,
which has a new level of organization called “Inception Mod- 4.1 Dataset 1: Food-5K
ule”. It consists of convolutions and maxpooling operation Food-5K contains 2,500 food images and 2,500 non-food
and there are nine such modules in GoogLeNet architec- images, resulting in a total of 5,000 images. The food images
ture. Fully-connected layers are being replaced with paral- were selected from already existing and publicly available
lel convolutions that operate on the same input layer. The food image datasets, including Food-101 [17], UEC-FOOD-
1×1 convolutions at the bottom of the module reduce the 100 [20] and UEC-FOOD-256 [22]. The food images were
number of inputs and hence decreases the computation cost selected in such a way that they could cover a wide variety
dramatically. It also captures the correlated features of an of food items. This could help to train a strong classifier
input image in the same region. Where as, image patterns that can detect food images with a wide variety. In ad-
are responded by 3×3 and 5×5 convolutions at larger scales. dition, images containing other objects or people in which
Feature maps which are being produced by all the convolu- food is not even the main target are also considered as food
tions are concatenated to form the output [24]. GoogLeNet image. Every image was visually inspected by us such that
uses 12 times fewer parameters than [28] which was the win- it is distinguishable by a human observer in terms of its
ning architecture in ImageNet Large Scale Visual Recogni- belongingness to one of the two classes: food and non-food.
tion Challenge (ILSVRC) 2012 and also performs signifi- For non-food images, we randomly selected 2,500 from
cantly better in terms of accuracy [24]. existing image datasets consisting of general non-food ob-
jects or humans. These datasets include Caltech101 [29],
Caltech256 [30], the Images of Groups of People [31] and
4. DATASET Emotion6 [32]. We tried to cover a wide range of contents
We have created two image datasets, named Food-5K in the non-food images and included some non-food images
and Food-11, used for the experiments on food/non-food visually similar to food, thus increasing the difficulty of clas-
classification and category recognition respectively. Both sification task. For the training phase, we used 3,000 images
datasets are split into three subsets, for the purpose of train- with 1,500 for food and 1,500 for non-food. The rest of the
ing, validation and evaluation respectively1 . In addition, dataset was equally divided into two subsets, with 500 im-
another dataset created by [9] was used in our experiments ages for each class in each subset, for validation and eval-
to evaluate the performance of our model on food/non-food uation respectively. Figure 1 shows some examples of food
classification. Descriptions of all the datasets are given be- and non-food images in Food-5K.
low.
1
The datasets are publicly accessible in [Link]
ch/food-image-datasets.
Bread Dairy products Dessert

Egg Fried food Meat

Noodles & Pasta Rice Seafood

Soup Vegetables & Fruits

Figure 2: Example images of Food-11 dataset.

Table 1: Food items and number of images in Food-11.


Category Example items Training Validation Evaluation
Bread Bread, burger, pizza, pancakes, etc. 994 362 368
Dairy products Milk, yogurt, cheese, butter, etc. 429 144 148
Dessert Cakes, ice cream, cookies, chocolates, etc. 1500 500 500
Egg Boiled and fried eggs, and omelette. 986 327 335
Fried food French fries, spring rolls, fried calamari, etc. 848 326 287
Meat Raw or cooked beef, pork, chicken, duck, etc. 1325 449 432
Noodles/Pasta Flour/rice noodle, ramen, and spaghetti pasta. 440 147 147
Rice Boiled and fried rice. 280 96 96
Seafood Fish, shellfish, and shrimp; raw or cooked. 855 347 303
Soup Various kinds of soup. 1500 500 500
Vegetable/Fruit Fresh or cooked vegetables, salad, and fruits. 709 232 231
Total 9866 3430 3347

4.2 Dataset 2: Food-11 each subset are listed in Table 1. Figure 2 shows example
Food-11 dataset consists of 16,643 images grouped into 11 food images of the 11 categories.
categories, which basically cover the major types of food that
people consume in daily life. We defined the food categories
4.3 Dataset 3: IFD
by adopting and modifying the major food groups defined In [9], Kagaya built a dataset called Instagram Food/Non-
by United States Department of Agriculture (USDA) [33]. Food Dataset (IFD) from search results of #tag “food” in
The 11 categories are: Bread, Dairy products, Dessert, Egg, Instagram and manually annotated with food and non-food
Fried food, Meat, Noodles/Pasta, Rice, Seafood, Soup and labels. The dataset consists of 4,230 food images and 5,428
Vegetable/Fruit. The dataset was mainly collected from ex- non-food images. In [9], the food/non-food classification ex-
isting food image datasets including Food-101 [17], UEC- periments conducted on IFD dataset resulted in a maximum
FOOD-100 [20] and UEC-FOOD-256 [22]. For certain cate- accuracy of 95.1%. We used this dataset in our experiments
gories (Diary products and Vegetable/Fruit), we downloaded to evaluate the performance of our trained model and to
images from social media sites, Flickr and Instagram. For compare with the classification results in [9].
each food category, we tried to include different food items
in order to increase the difficulty of recognition. Apart from 5. EXPERIMENTAL RESULTS
this, only those images whose main content is food of that This section describes the experiments on food/non-food
particular category were selected. The concrete example classification and food category recognition carried out using
food items in each category, and the number of images for different datasets. In our experiments, we used Caffe [34] as
1.00

0.98

0.96

0.94 Max. Acc. at


Value

iteration #7000
0.92

0.90
Sensitivity (Food Acc.)
0.88 Specificity (Non-food Acc.)
0.86 Overall Acc.
0 2000 4000 6000 8000 10000
Iteration #

Figure 3: Food/non-food classification results on


Food-5K dataset.

Predicted classes Predicted classes


Food Non-food Food Non-food
Food

Food

99.4% 0.6% 94.8% 5.2%


Actual classes

Actual classes
Non-food

Non-food

1.0% 99.0% 2.4% 97.6%


Food images Non-food images

Figure 5: Examples of correctly classified food and


(a) Food-5K dataset (b) IFD dataset [9]
non-food images in Food-5K dataset.

Figure 4: Confusion matrix of food/non-food classi-


fication results on two different image datasets. • Number of output layers has been changed from 1000
to 2 as we have only 2 classes: food and non-food.
• The base learning rate Base_lr has been changed to
the CNN library, which is one of the most popular frame- 0.01 and learning rate policy is polynomial.
works for deep convolution neural network. A pre-trained
GoogLeNet model has been applied and fine-tuned using • The maximum number of iteration, Max_iter, has been
our dataset in both food/non-food classification and cate- changed to 10000.
gory recognition. In particular we provide details on how Then we set up two configurations to fine-tune the GoogLeNet
the refinement of the model was achieved. model, with one only updating the parameters of the last two
layers and the other for the last six layers. The overall clas-
5.1 Food/Non-food Classification sification accuracies of the two configurations for different
iterations are shown in Table 2, with the overall accuracy
Food/Non-food classification, or food image detection, is
Acc. defined as follows:
one of the initial and important steps for image-based di-
etary assessment. To classify food and non-food images, TP + TN
Acc. = , (1)
we used a pre-trained GoogLeNet model from [35] and fine- TP + FP + TN + FN
tuned it using the training subset of Food-5K dataset. Fine- where T P , F P , T N and F N refer to true positive, false pos-
tuning process takes a pre-trained model, adapts the ar- itive, true negative and false negative respectively. In most
chitecture, and resumes training from the already learned of the cases especially for higher number of iterations, higher
model weights. When fine-tuning a pre-trained GoogLeNet accuracy is achieved on the second setup of fine-tuning i.e.
model, we can choose the layers whose parameters should the last six layers of GoogLeNet model. Therefore, we kept
be updated. We have not used any pre-processing and post- using the setup of fine-tuning the last six layers in the re-
processing steps. Firstly, we made the following basic changes maining experiments.
in the CNN GoogLeNet model: Figure 3 shows the detailed results of food/non-food clas-
sification on the evaluation subset of Food-5K, for all the it-
• All the three output layers names have been changed, erations. In the result, the sensitivity, or true positive rate,
e.g., “loss3/classifier” was changed to “loss3/classifier Food”. indicates the rate of correctly detected food images. While,
The reason for changing the layers names is that there the specificity, or true negative rate, refers to the rate of
should not be any conflict when the original weights correctly detected non-food images. From Figure 3, a max-
are being read from the pre-trained model. imum accuracy rate of 99.2% was achieved on evaluation
Table 2: Classification accuracy of two different fine-tuning configurations.
Iteration # 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000
Fine-tuning last 2 layers 0.976 0.969 0.953 0.983 0.972 0.970 0.979 0.980 0.978 0.979
Fine-tuning last 6 layers 0.968 0.952 0.987 0.976 0.974 0.975 0.992 0.981 0.983 0.982

Food classified as Non-food Non-food classified as Food

Figure 6: Misclassified food and non-food images in


Food-5K dataset.

dataset at iteration #7000, with sensitivity and specificity


of 99.4% and 99.0% respectively.
Figure 5 shows some examples of correctly detected food
and non-food images for iteration #7000. It can be seen
that some images that have even very small regions of food
are correctly classified as food and some food-like non-food
images are correctly classified as non-food, e.g., the fake
Macaron lookalike. Figure 6 shows the incorrectly classified
food and non-food images for iteration #7000. Some non-
food images that were classified as food images are highly Detected food images Undetected food images
similar to food images and those food images classified as
non-food images are either ambiguous or containing a very Figure 7: Example of detected and undetected food
small region of food. Figure 4(a) shows the confusion matrix images in Food-11 dataset.
of food/non-food classification on our own dataset Food-5K.
To further evaluate the performance of our fine-tuned model
on food/non-food classification, we ran our model on the The aim of food categorization is to let the system either
other two datasets: Food-11 dataset created by us, and In- directly estimate the nutritional value of a food item using
stagram Food/Non-Food Dataset (IFD) by Kagaya et al. [9]. the general information about the food category, or further
For both datasets, we tested our classifier on iteration #7000. classify the food item into sub-category to have a better es-
For Food-11 dataset, we ran our food/non-food classifier timation. In this experiment, we used Food-11 dataset to
on all the 16,643 food images and 16,127 of them were cor- train and test a CNN model on food category recognition.
rectly detected as food images, which results in a detection As explained in Section 4.2, the food images in Food-11 have
rate of 96.9%. Note that there are only food images in Food- been categorized into 11 classes and Table 1 shows the num-
11 dataset, and therefore the accuracy is just the rate of cor- ber of images in each category for training, validation and
rectly detected food images. Figure 7 shows some examples evaluation. Our task here was to classify each food image
of detected and undetected food images in Food-11 dataset. into one of the 11 categories. For this purpose, the pre-
For IFD dataset [9], we evaluated our model on 500 food trained GoogLeNet model [35] was applied and the last six
and 500 non-food randomly selected images. The classifi- layers were fine-tuned on the training set of Food-11. We
cation result is shown as confusion matrix in Figure 4(b). have not used any pre-processing and post-processing steps.
Among all the 500 food images, 474 (94.8%) were correctly Following changes have been made in the CNN GoogLeNet
classified as food, while 488 (97.6%) out of 500 non-food im- model:
ages were correctly classified as non-food. This resulted in
• All the three output layers have been renamed, e.g.,
an overall accuracy of 96.4%, which is slightly higher than
“loss3/classifier” was changed to “loss3/classifier FoodReco”,
the maximum accuracy of 95.1% obtained in [9].
for the same reason as food/non-food classification in
5.2 Food Category Recognition Section 5.1.
Correctly recognizing the type of a food in a food im- • The number of output layers has been changed from
age is another crucial step for a dietary assessment system. 1000 to 11 as we have 11 classes.
Predicted classes

Ve
D

N
ai

ge
oo
ry

ta Sou
dl
Fr Eg

bl
pr ad

es at

Se e
ie
D ts

e/ p
od

/P
B

es

af

Fr
M
re

R
as
uc

fo
se

oo
e

ui
ic
od

ta
rt

d
g

t
Bread 67.7 3.8 10.9 4.6 6.5 1.9 0.3 0.0 0.3 4.1 0.0
0.9
Dairy products 0.0 87.2 9.5 0.7 0.7 0.7 0.0 0.7 0.0 0.7 0.0
0.8
Dessert 1.6 6.0 81.4 0.8 0.8 2.0 0.4 0.0 2.4 4.6 0.0
0.7
Egg 4.8 2.4 6.9 77.3 2.4 0.3 0.0 1.5 0.6 3.6 0.3
0.6

Actual classes
Fried food 1.7 1.7 5.2 0.7 81.9 3.1 0.0 0.7 1.4 3.5 0.0

Food classified as Non-food Non-food classified as Food Meat 3.7 0.2 5.3 0.9 3.0 79.6 0.0 0.2 2.1 4.9 0.0 0.5

Noodles/Pasta 0.0 0.7 0.0 0.0 0.7 0.0 95.9 0.0 0.7 2.0 0.0 0.4
Figure 8: Misclassified food and non-food images in
Kagaya’s IFD dataset [9]. Rice 0.0 0.0 2.1 0.0 0.0 0.0 0.0 95.8 2.1 0.0 0.0 0.3
Seafood 1.7 1.3 6.9 0.7 0.0 1.0 0.0 0.3 83.8 4.3 0.0
0.2
0.95
Soup 0.2 0.6 0.4 0.2 0.0 0.0 0.0 0.2 0.2 98.0 0.2
Optimal results at iter. #4100 0.1
0.90 Vegetable/Fruit 0.0 2.2 5.2 0.4 0.4 1.3 0.9 0.4 3.0 0.4 85.7
0.0

0.85
Figure 10: Confusion matrix of food recognition.
Value

0.80 Values of the matrix are in percentage.

0.75
F-measure F1 The confusion matrix of recognition results at iteration
0.70 Accuracy Acc #4100 is shown in Figure 10. Among all the classes, Noo-
Kappa κ dles/Pasta, Rice and Soup give the best recognition accu-
0.65 racies, higher than 95%. This is because the food images
0 1000 2000 3000 4000 5000
Iteration # in each category have their own common characteristics in
either shape or color and therefore are easier to be identi-
fied. However, we notice that some types of food images are
Figure 9: Accuracy of food recognition on Food-11
error-prone, e.g., Bread, Egg and Meat, accuracies of which
dataset.
are lower than 70%. Those three types of food are also the
ones that have highly mixed food items in our dataset. For
• The base learning rate Base_lr has been changed to instance, category Egg contains boiled egg, fried egg and
0.001 and learning rate policy is polynomial . omelette, which are highly different in appearance. Besides,
many of those images have the main food mixed with other
• The maximum number of iteration, Max_iter, has been food items, e.g., meat with salad. Interestingly, we observe
changed to 40,000. that Dessert and Soup are the two target classes most likely
to be misclassified. In 7 classes (Bread, Dairy, Egg, Seafood,
We used three metrics to evaluate the performance of food Meat, Fried food and Vegetable/Fruit), more than 5% of their
recognition: 1) overall accuracy Acc., 2) F-measure F1 [36], testing images were incorrectly classified as Dessert. This is
and 3) Cohen’s kappa coefficient κ [37]. Specially, the Co- because Dessert is the category that has the most mixed
hen’s kappa coefficient is a numerical evaluation of inter- items in our dataset, and many of them could be visually
rater agreement which takes into account not only the ob- similar to other food. Besides, more than 4% of images in
served classification accuracy but also the accuracy that any Bread, Dessert, Meat and Seafood were misclassified as Soup.
random classifier would be expected to achieve, namely, ran- By checking some of images misclassified as soup, we found
dom accuracy. It is especially useful in evaluation of clas- most of them have round-shaped elements such as plate or
sification when the number of images in different categories round bread. Most Soup images also have the similar round-
are not the same. shaped plates or containers.
Figure 9 shows the overall accuracy, F-measure and Co- According to the confusion matrix in Figure 10, we list the
hen’s kappa coefficient on the evaluation subset of Food-11 top 10 misclassified class pairs and show two example images
with respect to the number of iterations. The maximum for each in Figure 11. By observing the incorrectly classified
accuracy of 83.5% was achieved on evaluation dataset at images, we found that misclassification mostly happen in the
iteration #4100, where we also obtained the maximum val- following two cases:
ues of F-measure and kappa coefficient, 0.911 and 0.816
respectively. The high value of Cohen’s kappa coefficient 1. Images within different classes have similar appear-
(0.816) also indicates that the trained classifier performs sig- ance, shape or color.
nificantly better than any random classifier. Due to time 2. Images have more than one type of food items mixed.
constraints, we had to stop evaluating the results on the
evaluation dataset after iteration #5000, as the accuracy on Considering the fact that each image category in Food-11
the validation dataset did not show any significant improve- dataset contains different food items with certain varieties,
ment. and that our training dataset is not significantly large, the
will also work on the estimation of food items quantity and
weight in order to finally estimate their nutritional value.

7. ACKNOWLEDGMENTS
This research work was supported by EPFL Food Sci-
10.9% Bread ⇒ Dessert 9.5% Dairy ⇒ Dessert ence and Nutrition Center funding in the framework of Nu-
triTake project, as well as partial funding from Swiss Na-
tional Foundation for Scientific Research project LEADME
200020-149259. We also appreciate and acknowledge con-
structive comments from anonymous reviewers.

6.9% Egg ⇒ Dessert 6.9% Seafood ⇒ Dessert 8. REFERENCES


[1] Giovanni Maria Farinella, Dario Allegra, Marco
Moltisanti, Filippo Stanco, and Sebastiano Battiato.
Retrieval and classification of food images. Computers
in Biology and Medicine, 77:23 – 39, 2016.
[2] David G. Lowe. Distinctive image features from
scale-invariant keypoints. International Journal of
6.5% Bread ⇒ Fried 6.0% Dessert ⇒ Dairy
Computer Vision, 60(2):91–110, 2004.
[3] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of
features: Spatial pyramid matching for recognizing
natural scene categories. In 2006 IEEE Computer
Society Conference on Computer Vision and Pattern
Recognition (CVPR’06), volume 2, pages 2169–2178,
5.3% Meat ⇒ Dessert 5.2% Fried ⇒ Dessert 2006.
[4] Jorge Sanchez, Florent Perronnin, Thomas Mensink,
and Jakob Verbeek. Image Classification with the
Fisher Vector: Theory and Practice. International
Journal of Computer Vision, 105(3):222–245,
December 2013.
[5] Keigo Kitamura, Toshihiko Yamasaki, and Kiyoharu
5.2% Fruit ⇒ Dessert 4.9% Meat ⇒ Soup Aizawa. Food log by analyzing food images. In
Proceedings of the 16th ACM International Conference
Figure 11: Top 10 misclassified category pairs and on Multimedia, MM ’08, pages 999–1000, New York,
example images. The percentage number indicates NY, USA, 2008. ACM.
the proportion of incorrectly classified images in all [6] J. Nie, Z. Wei, W. Jia, L. Li, J. D. Fernstrom, R. J.
testing images for the particular category. Sclabassi, and M. Sun. Automatic detection of dining
plates for image-based dietary evaluation. In 2010
Annual International Conference of the IEEE
results we obtained (Acc. = 0.835, F1 = 0.911 and κ = Engineering in Medicine and Biology, pages
0.816) seem promising. 4312–4315, Aug 2010.
[7] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner.
6. CONCLUSION Gradient-based learning applied to document
In this paper, we applied a pre-trained GoogLeNet model recognition. Proceedings of the IEEE,
based on CNN architecture on the tasks of food/non-food 86(11):2278–2324, Nov 1998.
image classification and food category recognition. We con- [8] Hokuto Kagaya, Kiyoharu Aizawa, and Makoto
structed two image datasets from publicly available datasets Ogawa. Food detection and recognition using
and social media, and fine-tuned the GoogLeNet model us- convolutional neural network. In Proceedings of the
ing our datasets. The experimental results show the over- 22Nd ACM International Conference on Multimedia,
all accuracy of 99.2% on food/non-food image classification MM ’14, pages 1085–1088, New York, NY, USA, 2014.
and 83.6% on food categorization. The main reason for not ACM.
achieving a high recognition accuracy on certain types of [9] Hokuto Kagaya and Kiyoharu Aizawa. Highly Accurate
food images is complex mixture of food items in image and Food/Non-Food Image Classification Based on a Deep
highly visual similarities between some images across cat- Convolutional Neural Network, pages 350–357.
egories. As a future direction, we aim at recognizing food Springer International Publishing, Cham, 2015.
items in images with a multi-label approach, namely, us- [10] Taichi Joutou and Keiji Yanai. A food image
ing top-n as prediction output, and integrating contextual recognition system with multiple kernel learning. In
information to improve the accuracy and compare it with 2009 16th IEEE International Conference on Image
different architectures such as AlexNet, VGG, and ResNet. Processing (ICIP), pages 285–288, Nov 2009.
Further investigation will be done based on the different [11] H. Hoashi, T. Joutou, and K. Yanai. Image
transfer learning schemes such as locking layers, etc. We recognition of 85 food categories by feature fusion. In
Multimedia (ISM), 2010 IEEE International [24] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre
Symposium on, pages 296–301, Dec 2010. Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
[12] M. Chen, K. Dhingra, W. Wu, L. Yang, Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
R. Sukthankar, and J. Yang. Pfid: Pittsburgh Going deeper with convolutions. In Computer Vision
fast-food image dataset. In 2009 16th IEEE and Pattern Recognition (CVPR), 2015.
International Conference on Image Processing (ICIP), [25] Haohan Wang and Bhiksha Raj. A survey: Time
pages 289–292, Nov 2009. travel in deep learning space: An introduction to deep
[13] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar. learning models and how deep learning models evolved
Food recognition using statistics of pairwise local from the initial ideas. CoRR, abs/1510.04781, 2015.
features. In Computer Vision and Pattern Recognition [26] Min Lin, Qiang Chen, and Shuicheng Yan. Network in
(CVPR), 2010 IEEE Conference on, pages 2249–2256, network. CoRR, abs/1312.4400, 2013.
June 2010. [27] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël
[14] Vinay Bettadapura, Edison Thomaz, Aman Parnami, Mathieu, Rob Fergus, and Yann LeCun. Overfeat:
Gregory D. Abowd, and Irfan A. Essa. Leveraging Integrated recognition, localization and detection
context to support automated food recognition in using convolutional networks. CoRR, abs/1312.6229,
restaurants. CoRR, abs/1510.02078, 2015. 2013.
[15] M. H. Rahmana, M. R. Pickering, D. Kerr, C. J. [28] Z. Zhong, L. Jin, and Z. Xie. High performance offline
Boushey, and E. J. Delp. A new texture feature for handwritten chinese character recognition using
improved food recognition accuracy in a mobile phone googlenet and directional feature maps. In Document
based dietary assessment system. In Multimedia and Analysis and Recognition (ICDAR), 2015 13th
Expo Workshops (ICMEW), 2012 IEEE International International Conference on, pages 846–850, Aug
Conference on, pages 418–423, July 2012. 2015.
[16] Y. He, C. Xu, N. Khanna, C. J. Boushey, and E. J. [29] Li Fei-Fei, R. Fergus, and P. Perona. Learning
Delp. Analysis of food images: Features and generative visual models from few training examples:
classification. In 2014 IEEE International Conference An incremental bayesian approach tested on 101
on Image Processing (ICIP), pages 2744–2748, Oct object categories. In Computer Vision and Pattern
2014. Recognition Workshop, 2004. CVPRW ’04. Conference
[17] Lukas Bossard, Matthieu Guillaumin, and Luc on, pages 178–178, June 2004.
Van Gool. Food-101 – mining discriminative [30] G. Griffin, A. Holub, and P. Perona. Caltech-256
components with random forests. In European object category dataset. Technical Report 7694,
Conference on Computer Vision, 2014. California Institute of Technology, 2007.
[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. [31] A. Gallagher and T. Chen. Understanding images of
Hinton. Imagenet classification with deep groups of people. In Proc. CVPR, 2009.
convolutional neural networks. In F. Pereira, C. J. C. [32] Kuan-Chuan Peng, Tsuhan Chen, Amir Sadovnik, and
Burges, L. Bottou, and K. Q. Weinberger, editors, Andrew C. Gallagher. A mixed bag of emotions:
Advances in Neural Information Processing Systems Model, predict, and transfer emotion distributions. In
25, pages 1097–1105. Curran Associates, Inc., 2012. CVPR, pages 860–868. IEEE, 2015.
[19] Yoshiyuki Kawano and Keiji Yanai. Food image [33] Marion Nestle. Food politics: How the food industry
recognition with deep convolutional features. In influences nutrition and health, volume 3. Univ of
Proceedings of the 2014 ACM International Joint California Press, 2013.
Conference on Pervasive and Ubiquitous Computing: [34] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey
Adjunct Publication, pages 589–593. ACM, 2014. Karayev, Jonathan Long, Ross Girshick, Sergio
[20] Y. Matsuda, H. Hoashi, and K. Yanai. Recognition of Guadarrama, and Trevor Darrell. Caffe: Convolutional
multiple-food images by detecting candidate regions. architecture for fast feature embedding. In Proceedings
In Proc. of IEEE International Conference on of the 22Nd ACM International Conference on
Multimedia and Expo (ICME), 2012. Multimedia, MM ’14, pages 675–678, New York, NY,
[21] Keiji Yanai and Yoshiyuki Kawano. Food image USA, 2014. ACM.
recognition using deep convolutional network with [35] [Link]
pre-training and fine-tuning. In Multimedia & Expo bvlc googlenet.
Workshops (ICMEW), 2015 IEEE International [36] David Martin Powers. Evaluation: from precision,
Conference on, pages 1–6. IEEE, 2015. recall and f-measure to roc, informedness, markedness
[22] Y. Kawano and K. Yanai. Automatic expansion of a and correlation. 2011.
food image dataset leveraging existing categories with [37] J. Cohen. A Coefficient of Agreement for Nominal
domain adaptation. In Proc. of ECCV Workshop on Scales. Educational and Psychological Measurement,
Transferring and Adapting Source Knowledge in 20(1):37, 1960.
Computer Vision (TASK-CV), 2014.
[23] Austin Myers, Nick Johnston, Vivek Rathod, Anoop
Korattikara, Alex Gorban, Nathan Silberman, Sergio
Guadarrama, George Papandreou, Jonathan Huang,
and Kevin Murphy. Im2calories: towards an
automated mobile vision food diary. In ICCV, 2015.

View publication stats

You might also like