Deep Convolutional Generative Adversarial Network Based Food Recognition
Deep Convolutional Generative Adversarial Network Based Food Recognition
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LSENS.2018.2886427, IEEE Sensors
Letters
Sensor Applications
Abstract—Traditional machine learning algorithms using hand-crafted feature extraction techniques (such as local binary
pattern) have limited accuracy because of high variation in images of the same class (or intra-class variation) for food
recognition task. In recent works, convolutional neural networks (CNN) have been applied to this task with better results
than all previously reported methods. However, they perform best when trained with large amount of annotated (labeled)
food images. This is problematic when obtained in large volume, because they are expensive, laborious and impractical.
Our work aims at developing an efficient deep CNN learning-based method for food recognition alleviating these limitations
by using partially labeled training data on generative adversarial networks (GANs). We make new enhancements to the
unsupervised training architecture introduced by Goodfellow et al. (2014), which was originally aimed at generating new
data by sampling a dataset. In this work, we make modifications to deep convolutional GANs to make them robust and
efficient for classifying food images. Experimental results on benchmarking datasets show the superiority of our proposed
method as compared to the current-state-of-the-art methodologies even when trained with partially labeled training data.
Index Terms—Generative Adversarial Network, Deep CNN, Semi-Supervised Learning, Food Recognition.
I. INTRODUCTION AND RELATED WORK giving this information as done in [10]. Another approach is to use
convolutional neural network (CNN) based semantic segmentation,
A substantial increase has been witnessed in the health conscious- which has shown an increase in accuracy [11]. Some approaches use
ness amongst masses due to increasing health risks. Diabetes, obesity ingredient-level features to recognize the food items [12]. Pairwise
and cholesterol cases are being increasingly reported every year. World statistics of local features have also been applied for food recognition
Health Organisation (WHO) has reported that the global prevalence task [13].
of diabetes has doubled over 1980 to 2014 [1]. Improper diet and Deep CNN learning algorithms overcome the drawbacks of
excessive calorie intake have been major factors causing these health traditional machine learning algorithms based on hand designed
risks and hence keeping a check on the calorie consumption can help features. These algorithms have the inherent capability to mimic
avoid this risk [2], [3]. With the advent of smart devices, apps for human information processing systems and work very well with
multi-modal data logging mechanism, which can be used anywhere images [14], in many applications [15] including food recognition
and anytime are in high demand. At present, the existing mobile [16]. In the recent years, CNN based food recognition has shown
apps like MyFitnessPal [4], LoseIt [5] and others work on manual excellent results even on extensive and large datasets with non-uniform
data entry. This is cumbersome for most people hindering the long- background [11]. Recent reviews on recurrent neural networks (RNN)
term usability of such apps. Mobile cameras are used to capture [17], shows the similar trend for learning long-term dependencies in
image/video data and then typically rely on expert nutritionists to sequential images (video) and time-series data. A common problem
analyze the image offline (at a particular time) [6]. Crowd sourcing to all these existing methodologies is that they would need large
has also been used as an approach to analyze the food images. amount of labeled training data to perform reasonably well, which
However, both these methods are costly, inefficient and slow; thus, is difficult and expensive to obtain when required in large amount.
they are not suitable for widespread usage.
Appearance of any food is highly characterized by the recipe,
A. Related Generative Adversarial Network Architec-
ingredients used, style of preparation and others. They exhibit large
tures
intra-class variation in terms of size, color, shape, texture, viewpoint
and others [7], [8]. In this case, classic feature descriptors like One of the attractive features in generative adversarial network
speeded up robust features (SURF), histogram of gradient (HOG), (GAN) based modeling is that it does not require labeled data. Its
spatial pyramidal pooling, bag of scale invariant feature transform learning approach is classified into generative and discriminative
(SIFT), color correlogram, etc. can only succeed when used for models: A generative model is trained to learn the joint probability
laboratory generated small datasets. Typically, a SVM is trained of the input data and output class labels simultaneously, i.e. P(x, y).
using these extracted features or a combination of these features This can be used to infer the conditional distribution P(y|x) by
[9]. To improve accuracy for this task, researchers have worked applying Bayes rule. More importantly, the joint probability learned
towards estimating the region of the image in which the food item can be used for other purposes, such as generating new (x, y) samples.
is present. One way to do this is to use standard segmentation and A discriminative model is trained to learn a target function that maps
object detection methods or ask the user to input a bounding box the input data x to a set of output class labels y . Mathematically, it
approximates the conditional distribution P(y|x). While both types of
Corresponding author: B. Mandal (e-mail: [email protected]). models: generative and discriminative have their use cases, generative
2475-1472 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LSENS.2018.2886427, IEEE Sensors
Letters
models have the ability to model the internal nature of the input Algorithm 1: Generative adversarial network training.
data even in the absence of any labels. In our current generation real Input : I ←− Number of training iterations
world applications, unlabelled data is abundant and easy to obtain. begin
The cost of acquiring labeled data can sometimes be too high to for number of training iterations do
justify. In such cases, generative models provide a desirable solution. for k steps do
As summarized in Algorithm 1, the principle idea behind a Sample mini batch of m Gaussian noise samples
GAN is to pit a discriminative framework against a generative one. (z 1, z 2, ...., z m ) from noise prior pg (z).
Thus, the two component neural networks in a GAN, discriminative Sample mini batch of m examples (x 1, x 2, ...., x m )
and generative, act as adversaries. The generative network is from data generating distribution p d at a (x).
given Gaussian noise as input and is trained to generate samples Update the discriminator by ascending its
indistinguishable from the real samples. The discriminator network stochastic gradient.
is given both the generated fake samples and corresponding real 1 Õ
m
samples and is trained to identify the fake sample. In the introductory ∇θ d [logD(x i ) + log(1 − D(G(z i )))] (1)
m i=1
GAN work, Goodfellow et al. [18] used fully connected neural
networks for both the generator and discriminator. Consequently, Sample mini batch of m Gaussian noise samples
this architecture was applied to standard image datasets for testing (z1, z 2, ...., z m ) from noise prior pg (z).
purposes, specifically MNIST (handwritten digits) [19] and CIFAR-10 Update the generator by descending its stochastic
(natural images) [20] datasets. gradient.
It is evident from the literature and benchmarking competitions that m
1 Õ
convolutional neural networks (CNNs) are extremely well suited to ∇θg [log(1 − D(G(z i )))] (2)
m i=1
image data [14] analysis, like our case for food recognition [16]. Initial
experiments conducted on small datasets, like CIFAR-10, showed
that achieving convergence in convolutional GANs was harder than in
CNNs, with similar computational power as that used for supervised
where k is the number of classes. To train the model, we minimize the
learning. One solution to this problem was to use a cascade of
negative log-likelihood between p mo d el (y|x) and the observed labels
CNNs with a Laplacian pyramid framework presented by Denton
y . We add the fake generated samples from the generator G to our
et al. (LAPGAN) [21]. This essentially decomposed the generation
dataset and also add another output class to our classifier to detect these
process with the images generated in a coarse to fine manner. A group
fake samples. Now that we have k + 1 classes, p mo d el (y = k + 1|x)
of network architectures called DCGAN (deep convolutional GAN)
determines the probability of the input data point being fake. This
proposed by Radford et al. [22], showed promising results on image
formulation also provides us the capability of learning from unlabeled
datasets. Here, a pair of convolutional discriminator and generator
data, provided it can be classified under one of the k classes of real
networks are simultaneously trained using strided and fractionally-
data by minimizing [−log p mo d el (y(1, . . . , k)|x)].
strided convolutions. Thus, their aim was to learn the mapping from
In case of even distribution of data points, the loss function can
image space to the discriminator output space (down-sampling) and
be decomposed into two components, θ , which is the negative log-
also to learn the mapping from a lower dimensional latent space
likelihood of the label (supervised), when the data is real and δ,
to the image space (up-sampling). The widespread adoption of this
which is the GAN loss (unsupervised).
group of GAN architectures for a number of applications makes it
a natural choice for our semi-supervised task at hand. L = −E x , y∼p d at a (x , y) [log p mo d el (y|x)]
(3)
−E x∼G [log p mo d el (y = k + 1|x)].
II. PROPOSED SOLUTION USING GANS Total loss, L = θ + δ, where
Our present focus is on the food recognition system that can deal θ = −E x , y∼p d at a (x , y) [log p mo d el (y|x, y < k + 1)] (4)
with non-uniformity of the background and is extendable to out of
and
sample test images. This would ensure that the system is robust δ = −E x∼p d at a (x) [1 − log p mo d el (y = k + 1|x)]
and easy to use, the architecture has been designed for it using (5)
−E x∼G [log p mo d el (y = k + 1|x)].
a semi-supervised approach as shown in Fig 1. The architecture
has two parts: the Generator and the Discriminator. Although both The outputs from the GAN discriminator network D is an estimated
are useful, discriminator is the network which learns the nature of probability that the input image is obtained from the data generating
the problem better and thus can be used as a multi-label classifier distribution. Traditionally, this is implemented with a feed-forward
to recognize different food items. Below we describe our newly network ending in a single sigmoid unit, but in this work a softmax
proposed semi-supervised GAN (SSGAN) architecture in detail. output layer is used as one unit for each of the real classes and
one f ake unit for generated images. It can be seen that D has
k + 1 output units corresponding to [class − 1, class − 2, ..., class −
A. Semi-Supervised Generative Adversarial Networks
k, f ake]. In this case, D can also act as a classifier C . We use
For a general classifier, the input is the data point x and the output higher granularity labels for the half of the mini-batch that has been
is a k -dimensional vector of logits (inverse of the sigmoidal “logistic” drawn from the data generating distribution. Training is performed
function): ϕ1, ϕ2, . . . , ϕk , which can be used to calculate class on the discriminator/classifier network to minimize the negative log-
probabilities using softmax function: p mo d el (y = i|x) = Íke xep(ϕ i)
x p(ϕ )
, likelihood corresponding to the given labels.
j=1 j
2475-1472 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LSENS.2018.2886427, IEEE Sensors
Letters
Fig. 1. Block diagram for semi-supervised generative adversarial network for food recognition.
The generator G is trained to maximize this negative log-likelihood. deviation of 0.5 for input and 0.5 for outputs of hidden layers. During
Our proposed SSGAN framework is summarized in Algorithm 2. the training process, for ETH Food-101 and Indian Food datasets,
we used 10% and 50% data of each of the classes as unlabeled
Algorithm 2: Semi-supervised GAN (SSGAN) training data, respectively. For comparisons, fine-tuned deep learned models
algorithm. are used from the popular AlexNet [14], GoogLeNet [25], residual
Input : I ←− Number of training iterations network (ResNet) [26], for all the methods, procedures are used as
begin reported in [16]. We also compared the new method (SSGAN) with
for i = 1 to I do the recently proposed Lukas et al. [23], Kawano et al. [27], Martinel
Sample mini batch of m Gaussian noise samples et al. [9] and the ensemble of networks (Ensemble Net) in [16].
(z 1, z 2, ...., z m ) from noise prior pg (z).
Sample mini batch of m examples (x 1, x 2, ...., x m ) A. Results on ETH Food-101 Dataset
from data generating distribution p d at a (x). ETH Food-101 [23] is a database consisting of 1000 images per
Perform gradient descent on the parameters of D by food class of 101 classes of most popular food categories picked up
calculating the gradient as: from foodspotting.com. The top 101 most popular and consistently
2m
1 Õ named dishes were chosen and 750 training images were extracted.
∇θ d L (loss in equation 3) (6) An addition of 250 test images were collected per class. While the
2m i=1
collected test images were manually cleaned, the training images were
Sample mini batch of m Gaussian noise samples
not cleaned deliberately to retain some amount of noise. The idea is
(z 1, z 2, ...., z m ) from noise prior pg (z).
to increase the robustness of any classifier trained on the dataset. For
Perform gradient descent on the parameters of G by
our experiments, we follow the same training and testing protocols
calculating the gradient as:
as discussed in [9], [23]. Fig. 2 (a) shows the accuracy vs ranks plots
m
1 Õ up to top 10 ranks, where the rank r : r {1, 2, ..., 10} denotes the
∇θ d [1 − log p mo d el (y = K + 1|x)] (7)
m i=1 probability of retrieving at least one correct image among the top
r retrieved images. This cumulative matching curve (CMC) shows
the overall performance of the proposed approach as the number
of retrieved images changes. Table 1 shows the Top-1, Top-5 and
Top-10 accuracies using current state-of-the-art methodologies on
III. EXPERIMENTAL RESULTS AND DISCUSSIONS this dataset.
We have conducted experiments on two datasets, ETH Food-101 TABLE 1. Accuracy (%) for ETH
[23] and Indian Food Dataset [16]. The experiments are performed Food-101 & comparison with other TABLE 2. Accuracy (%) for Indian
methods. Food Database & comparison with
on an NVIDIA Quadro P4000 GPU with 8GB VRAM. Our model other methods.
details are: GAN type: DCGAN, Optimizer: ADAM, Activation: Network/Features
AlexNet
Top-1
42.42
Top-5
69.46
Top-10
80.26
Network/Features Top-1 Top-5 Top-10
Leaky ReLu (in all hidden layers), Sigmoid (in discriminator’s output GoogLeNet 53.96 80.11 88.04
AlexNet 60.40 90.50 96.20
Lukas et al. [23] 50.76 - -
GoogLeNet 70.70 93.40 97.60
layer). The model was run on test data simultaneously after every 100 Kawano et al. [27] 53.50 81.60 89.70
ResNet [26] 43.90 80.60 91.50
Martinel et al. [9] 55.89 80.25 89.10
Ensemble Net [16] 73.50 94.40 97.60
training epochs. To address the instability of GANs while training, we ResNet [26] 67.59 88.76 93.79
SSGAN 85.30 95.60 98.30
Ensemble Net [16] 72.12 91.61 95.95
have implemented feature matching by specifying a new objective SSGAN 75.34 93.31 96.43
for the generator side that prevents it from over training on the
discriminator side, similar to as mentioned in [24]. Additionally, we From Fig. 2(a) and Table 1, it is evident that our proposed semi-
implemented one-sided label smoothing and batch normalization for supervised GAN (SSGAN) has outperformed consistently all the
stable convergence as discussed in Salimans et al. [24]. In our case, current state-of-the-art methodologies on this large real-world food
the stochastic layers are zero-centered Gaussian noise, with standard dataset.
2475-1472 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LSENS.2018.2886427, IEEE Sensors
Letters
REFERENCES
[1] W. H. Organization, “Global report on diabetes,” 2016. [Online]. Available:
http://apps.who.int/iris/bitstream/10665/204871/1/9789241565257\_eng.pdf
[2] M. Farooq and E. Sazonov, “Accelerometer-based detection of food intake in
free-living individuals,” IEEE Sensors Journal, vol. 18, no. 9, pp. 3752–3758, May
2018.
[3] E. S. Sazonov and J. M. Fontana, “A sensor system for automatic detection of
food intake through non-invasive monitoring of chewing,” IEEE Sensors Journal,
vol. 12, no. 5, pp. 1340–1348, May 2012.
[4] MyFitnessPal, “Myfitnesspal, inc.” 2018. [Online]. Available: https://www.
(a) (b)
myfitnesspal.com/
[5] FitNow, “Fitnow, inc.” 2018. [Online]. Available: https://www.loseit.com/
Fig. 2. Accuracy vs rank plots using various existing CNN frameworks
[6] D. Schoeller, L. Bandini, and W. Dietz, “Inaccuracies in self reported intake
and SSGAN, (a) for ETH Food 101 Database and (b) for Indian Food
identified by comparison with the doubly labelled water method,” Can. J. Physiol.
Database.
Pharm., 1990.
[7] B. S. A. D. W. Zhang, Q. Yu and H. Sawhney, “Snap-n-eat: Food recognition
B. Results on Indian Food Dataset and nutrition estimation on a smartphone,” J. Diabetes Science and Technology,
vol. 9, no. 3, pp. 525–533, 2015.
Indian food dataset is newly introduced in [16] as an extensive [8] Y. Bi, M. Lv, C. Song, W. Xu, N. Guan, and W. Yi, “Autodietary: A wearable
acoustic sensor system for food intake recognition in daily life,” IEEE Sensors
collection of Indian food images comprising of 50 food classes with
Journal, vol. 16, no. 3, pp. 806–816, 2016.
100 images in each class. The classes were selected keeping in mind [9] N. Martinel, C. Piciarelli, and C. Micheloni, “A supervised extreme learning
the extensive nature of Indian cuisine. Indian food differs in terms committee for food recognition,” Computer Vision and Image Understanding, vol.
of shape, size, color and texture and is devoid of any generalized 148, pp. 67–86, 2016.
[10] M. Yuji and K. Yanai, “Multiple-food recognition considering co-occurrence
make-up. Because of the varied nature of the classes present in the employing manifold ranking,” in Proceedings of the 21 st International Conference
dataset, it offers the best option to test a protocol and classifier for its on Pattern Recognition, Nov 2012.
robustness and accuracy. It consists of images from online sources [11] K. Yanai and Y. Kawano, “Food image recognition using deep convolutional
network with pre-training and fine-tuning,” in 2015 IEEE International Conference
like Food Spotting, Google in addition to images captured using
on Multimedia Expo Workshops (ICMEW), June 2015, pp. 1–6.
hand-held mobile devices. Similar to the ETH Food-101 database [12] J. Chen and C.-w. Ngo, “Deep-based ingredient recognition for cooking recipe
protocol and same protocol as that in [16], we have set aside 80 retrieval,” in Proceedings of the ACM on Multimedia Conference, 2016, pp. 32–41.
food images per class for training and rest of the images are used [13] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food recognition using
statistics of pairwise local features,” in Proceedings of IEEE Conference on
for testing. Fig. 2 (b) shows accuracy vs ranks plot up to top 10 Computer Vision and Pattern Recognition, Jun 2010.
ranks and Table 2 shows the Top-1, Top-5 and Top-10 accuracies [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep
using the current state-of-the-art methodologies on this dataset. Both convolutional neural networks,” in Advances in Neural Information Processing
Systems, 2012.
of these show that our proposed SSGAN model performs better at
[15] W. Liu, Z. Wang, and X. L. et al., “A survey of deep neural network architectures
recognizing food images in comparison to other CNN frameworks and their applications,” Neurocomputing, vol. 234, pp. 11–26, 2017.
for this dataset. Overall, it is evident from these Figures 2 (a) & (b) [16] P. Pandey, A. Deepthi, B. Mandal, and N. B. Puhan, “FoodNet: Recognizing foods
and Tables 1 & 2 that our proposed approach (SSGAN) outperforms using ensemble of deep networks,” IEEE Signal Processing Letters, vol. 24, no. 12,
pp. 1758–1762, 2017.
the other methods consistently for all ranks on both these datasets. [17] H. Salehinejad, J. Baarbe, S. Sankar, J. Barfett, E. Colak, and S. Valaee, “Recent
advances in recurrent neural networks,” CoRR, vol. abs/1801.01078, 2018.
[18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
IV. CONCLUSION AND FUTURE WORK A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural
Information Processing Systems 27, 2014, pp. 2672–2680.
Food recognition is a very challenging task due to the presence [19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied
of high intra-class variation in food appearance. This is due to to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324,
Nov 1998.
the differences in method of preparation, ingredients used, various
[20] A. Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.
shapes, viewpoints and other factors. A classifier working well on [Online]. Available: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.
one kind of cuisine may not give as good results on another type pdf
of cuisine. In this work, we have proposed a semi-supervised GANs [21] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus, “Deep generative image
models using a laplacian pyramid of adversarial networks,” in Advances in Neural
based on deep convolutional neural network architecture approach to Information Processing Systems 28: Annual Conference on Neural Information
alleviate the shortcomings posed by lack of labeled images and also Processing Systems Montreal, Quebec, Canada, 2015, pp. 1486–1494.
the classical image recognition problems in food datasets. We have [22] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with
deep convolutional generative adversarial networks,” CoRR, vol. abs/1511.06434,
performed experiments on the largest real-world food images ETH
2015. [Online]. Available: http://arxiv.org/abs/1511.06434
Food-101 dataset and the Indian Food dataset with partially labeled [23] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101 – mining discriminative
data. Experimental results show that the generative semi-supervised components with random forests,” in European Conference on Computer Vision,
deep CNN approach proposed in this work outperforms the current 2014, pp. 446–461.
[24] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen,
state-of-the-art methodologies consistently for all the ranks for both “Improved techniques for training gans,” CoRR, vol. abs/1606.03498, 2016.
the datasets even with partially labeled data. While GANs have [Online]. Available: http://arxiv.org/abs/1606.03498
the potential to improve the food recognition accuracy with partially [25] C. Szegedy and et al., “Going deeper with convolutions,” in IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2015.
labeled data, it is difficult to achieve stability and convergence during [26] Z. X. R. S. e. He, K., “Deep residual learning for image recognition,” in IEEE
training. In future, we would try to improve the recognition accuracy CVPR. IEEE, 2016, pp. 770–778.
with better and robust GAN architecture that could further reduce [27] Y. Kawano and K. Yanai, “Real-time mobile food recognition system,” in Computer
the usage of labeled training data. Vision and Pattern Recognition Workshops (CVPRW), Jun 2013.
2475-1472 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.