Transfer Learning Approach Based on MobileNet Architecture For Human Smile Detection
Transfer Learning Approach Based on MobileNet Architecture For Human Smile Detection
Abstract. The face is an important part of the human body. The face can express various things
that are felt by humans. By looking at facial expressions, humans can determine whether the
person is angry, happy, or sad. This is important as a basic part of communication. However,
there are enough people who are not able to see or are less able to see so that they cannot recog-
nize someone's facial expressions, especially when they are talking to each other. For this rea-
son, in this research, a study was conducted to detect the basic expression of the human face,
namely a smile that indicates happiness. In this study, a Deep Learning approach was used to
determine human facial expressions. This research also carried out several architectural scenar-
ios such as ResNet and MobileNet. MobileNet is an architecture that has a high level of accu-
racy, which is around 92%, which indicates that MobileNet can be used to detect facial expres-
sions, especially smiles.
1 Introduction
Vision impairment, or an inability to see at all or any object is the problem that
continue to increase until now. In a global, there are around more than 1 billion peo-
ple with vision impairment, most of them are woman with 55% presentation. Vision
impairment caused by several deceases such cataract, glaucoma, corneal opacities,
diabetic retinopathy and many more.
People with vision impairment, experiencing difference kind comparing to the
normal people. People with vision impairment experiencing a problem to see espe-
cially to see and understanding the face expression, therefore, it is difficult for them to
understand the expression of their interlocutor. Even though the expression especially
face expression is take an important role to the communication [1].
Referring to the problem explained above, based on the importance of face expres-
sion thus take a role in communication, it needs a solution to recognize the expression
of human face to help people with vision impairment to better understand the face
expression of their interlocutors. Actually, there are several research that take a con-
2
cern on face expression detection. The research from lee et al [2] proposed an auto-
matic smile detection using earable device, furthermore, this research proposed by lee
et al produce an average accuracy around 70%. However, research proposed by lee et
al still needs an earable device to be installed in ears which is impractical.
Based on the situation, this research focus on how to detect and recognize a smile
based on the image. The image was chosen because of its simplicity and convenience.
This research is the initial step to achieve an advance development to develop the
advanced wearable eyeglasses to assist people with vision impairment called “icsy (i
can see you)”.
There are several research that focused on how to detects and recognize the smile
by utilizing image, such as the research from le an et al [3]. This research is utilizing
support vector machine(svm) and reach the accuracy around 76%, however, its is
need an improvement for the performance especially the accuracy.
Another research about smile detection has been done by chuango tan et al [4].
This research is utilizing convolutional neural network (cnn) to achieve its accuracy.
This research produce accuracy around 86%. It is very interesting to pay attention on
how they tweak the cnn architecture to obtain the accuracy. In this research, the pre-
processing of the data was very well influential. This research divides the image of
the face area into 2 areas with similar size. The first area is the upside of the face im-
age including eyes and several part of nose, the second part of the image containing
mouth and some part of nose. Each image area are processed using each cnn architec-
ture. Each result of the cnn process then put together to be processed further. How-
ever, this process is not as simple process considering that the goals of the research is
to find the best approach to detects and recognize the face expression especially smile
detection using as simple as possible approach to save the process and memory given
that the computation approaches will be applied on the smartphone.
Considering the problem explained, this research will focuses on how to detects a
smile using convolutional neural network (cnn) as simple and modest as possible.
This research will be divided into several section including method, result and analy-
sis and conclusion.
2 Method
This section will explain the methodology of our research. This research will di-
vided into several part of process that point to Convolutional Neural Network (CNN)
process in general. The detailed process is shown in Fig. 1.
3
In this research, image became the main object that will be process. There are
much research that focused on how to detect the face such a research from Paggio et
al [5], Ganakwar [6], Khrisna et al [7], Chen J. [8], and many more. The face detec-
tion is important because it is establishing the system to focus of its process to face
area only. However, in this research, face image was obtained using open dataset.
The Convolutional Neural Network (CNN) has became a popular topic in recent years. CNN
has its own unique steps called architecture. The architecture of CNN consist of several steps
that arranged in such a way to reach the goal [9].
1) Convolutional Process
The convolutional process act as an vital role in CNN process. This layer focus on the learnable
kernel. Generally, the kernel used in this process has small size such 3x3, 5x5 or 7x7. The
convolutional process itself are explained in (1).
∞ ∞
1.
2. y [i , j ]= ∑ ∑ h [ m, n ] . x [i−m , j−n ] 3. (1)
m =−∞ n=−∞
4. 5. 6.
Denotes, x represents the input image matrix to be convolved, h is the matrix or kernel, y is the
new matrix from convolution process, m and n are the size of the kernel of h, and i and j are the
matrix size. After y matrix was obtained, the next process is applying Pooling mechanism ex-
plained below.
2) Pooling
The Pooling layer is the layer whom commonly included after the Convolutional layer. The
purpose of this layer, is to reduce the size of the image or matrix. Pooling is the a key step in
CNN system. There are several approach for Pooling methods such Average Pooling, Max-
4
Pooling, Mix Pooling, Stochastic Pooling and much more [10]. The formula process of Pooling
(Max-Pooling) methods shown in (2).
Denotes, h is the matrix, P is the result process and N is the pooling shift that allows overlap-
ping between pooling regions when N<R, which R is the group of activation.
The Pooling and Convolutional stage could be combined without considering the manner. The
combination of Pooling and Convolutional layer are built and combine to obtain robust archi-
tecture [11].
The extremity of CNN architecture is the Fully Connected layer. This layer took an influen-
tial role on CNN systems [12]. Fully Connected layer play as an important role to connect the
Convolutional and Pooling layers to the Neural Network layer. In this research, we adopt pre-
trained CNN architecture such as MobileNet [13]. Pre-trained architecture are chosen because
of its ability to detects the object since the pre-trained architecture was formed and trained with
huge amount of data. In the fully Connected layer, flatten process have a role to transform Pool-
ing or Convolution matrix (depends on its structure and composition) into the array formed to
be created as an Neuron Input. The illustration of Flatten process was shown on Fig. 2.
2.4 Activation
The activation is the one of the most important part in CNN. The aim of activation function
is to made a network (fully connected layer) able to learn complex patterns on the data. In theo-
retical, the activation is needed to avoiding the linear classifier. There are several activation ap-
proaches such as sigmoid, softmax, ReLU, Tanh and others [14]. This research are focused on
classification of 2 labels including smile and not smile. Therefore, there are 2 activation func-
tion that could be applied which are sigmoid (3) and softmax (5) explained below.
5
1
S ( x )= −x 9. (3)
1+ e
Denotes, S ( x ) is the sigmoid function, x is the horizontal data, and e is the euler number
computed using (4).
∞
1
e=∑ (4)
n=0 n!
zi
e
σ (⃗z )i= K
(5)
∑ ez j
j=1
2.5 Dataset
This research is utilizing FER-2013 dataset which contain 28.709 example data including 6
categories such Angry, Disgust, Fear, Happy, Sad, Surprise and Neutral face expression [15].
The categories used in this research are divided into 2 groups. The first group contain Neutral,
Angry, Sad, Fear and Disgust. The second group was containing Happy and Surprise. In total,
we obtain 2 classes including Smile (Happy and Surprise) and Not Smile (Neutral, Angry, Sad,
Fear and Disgust).
In this section, we apply the training process and validation process on the total 28.709 data for
two classes. There are 80% data for testing or around 22.967 data and 20% data for testing or
around 5.741 data following the manner of splitting data 80/20 [16]. The sample of dataset
using FER-2013 is shown in Fig. 3.
6
This section will be divided into 2 sections. The first section is the training process, the
second section is the analysis of accuracy we achieved.
In this research, we train 80% of data from total dataset using costumized MobileNet CNN
architecture. We also applying augmentation data to deal with size of dataset and decrease the
validation error [17]. For the augmentation, we applying random flipping and random rotation
for every image dataset as shown in Fig. 4. Also, we applied grayscale dataset considering that
the smile and not smile are not influenced by the color of skin [18]. In this training section we
applying softmatx activation and 50 epoch. The result of training using customized MobileNet
architecture was shown in Fig. 5.
7
From the training result, the Customized architecture achieve the accuracy of 92.4%. As the
information, in this research we do the conventional learning instead of transfer learning.
better performance if it applied to the Face Classification problem rather than Face Expression
recognition.
Table 1. Accuracy Comparison
Architecture Dataset Avg. Acc.
ResNet50 FER-2013 82.6%
Customized FER-2013 87.3%
ResNet50
Custom Mo- FER-2013 92.4%
bileNet
This research proving that the ability of standard architecture such ResNet could perform
better only on the data type that they trained before. However, on the new dataset, a standard
architecture is surpasses by the modified architecture by only changing the size of the Dense
layer which is the most important layer that connects to the fully connected layer.
4 Conclusion
The MobileNet architecture produce better accuracy around 92.4% compared to the other
architectures such as ResNet50 and customized ResNet50. The data used in this research is
FER-2013 open dataset. Not all of FER-2013 is used. From several class of data, we incorpo-
rate the 5 class included into only 2 class that contain smile and not smile dataset. Hence, the
accuracy produced by the experiment is about 92,4% using customized MobileNet by changing
the size of the Dense layer.
References
1. Bruce, V.: The role of the face in communication: Implications for videophone de-
sign. Interact Comput. 8, 166–176 (1996). https://doi.org/10.1016/0953-
5438(96)01026-0.
2. Lee, S., Min, C., Montanari, A., Mathur, A., Chang, Y., Song, J., Kawsar, F.: Auto-
matic Smile and Frown Recognition with Kinetic Earables. In: Proceedings of the
10th Augmented Human International Conference 2019. pp. 1–4. ACM, New York,
NY, USA (2019). https://doi.org/10.1145/3311823.3311869.
3. An, L., Yang, S., Bhanu, B.: Efficient smile detection by Extreme Learning Machine.
Neurocomputing. 149, 354–363 (2015).
https://doi.org/10.1016/j.neucom.2014.04.072.
4. Tang, C., Cui, Z., Zheng, W., Qiu, N., Ke, X., Zong, Y., Yan, S.: Automatic smile
detection of infants in mother-infant interaction via CNN-based feature learning. In:
ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on
9