Emotion Report
Emotion Report
net/publication/352780489
CITATIONS READS
2 4,508
5 authors, including:
All content following this page was uploaded by Vijaya Bharathi Jagan on 16 March 2022.
Abstract— A user's emotion or mood can be detected by different expressions. Also, expressions of different or even
his/her facial expressions. These expressions can be the same people might vary for the same emotion, as
derived from the live feed via the system's camera. A lot emotions are hugely context dependent. While the focus can
of research is being conducted in the field of Computer on only those areas of the face which display a maximum of
Vision and Machine Learning (ML), where machines are emotions like around the mouth and eyes, how these
trained to identify various human emotions or moods. gestures are extracted and categorized is still an important
Machine Learning provides various techniques through question. Neural networks and machine learning have been
which human emotions can be detected. One such used for these tasks and have obtained good results.
technique is to use MobileNet model with Keras, which Machine learning algorithms have proven to be very useful
generates a small size trained model and makes in pattern recognition and classification, and hence can be
Android-ML integration easier. used for mood detection as well.
Music is a great connector. It unites us across With the development of digital music technology,
markets, ages, backgrounds, languages, preferences, the development of a personalized music recommendation
political leanings and income levels. Music players and system which recommends music for users is essential. It is
other streaming apps have a high demand as these apps a big challenge to provide recommendations from the large
can be used anytime, anywhere and can be combined data available on the internet. E-commerce giants like
with daily activities, travelling, sports, etc. With the Amazon, EBay provide personalized recommendations to
rapid development of mobile networks and digital users based on their taste and history while companies like
multimedia technologies, digital music has become the Spotify, Pandora use Machine Learning and Deep Learning
mainstream consumer content sought by many young techniques for providing appropriate recommendations.
people. There has been some work done on personalized music
People often use music as a means of mood recommendation to recommend songs based on the user’s
regulation, specifically to change a bad mood, increase preference. There exist two major approaches for the
energy level or reduce tension. Also, listening to the right personalized music recommendation. One is the content-
kind of music at the right time may improve mental based filtering approach which analyses the content of
health. Thus, human emotions have a strong relationship music that users liked in the past and recommends the music
with music. with relevant content. The main drawback of this approach
In our proposed system, a mood-based music player is that the model can only make recommendations based on
is created which performs real time mood detection and existing interests of the user. In other words, the model has
suggests songs as per detected mood. This becomes an limited ability to expand on the users' existing interests. The
additional feature to the traditional music player apps other approach is the collaborative filtering approach which
that come pre-installed in our mobile phones. An recommends music that a peer group of similar preference
important benefit of incorporating mood detection is liked. Both recommendation approaches are based on the
customer satisfaction. The objective of this system is to user’s preferences observed from the listening behaviour.
analyse the user’s image, predict the expression of the The major drawback of this approach is the popularity bias
user and suggest songs suitable to the detected mood. problem: popular (i.e., frequently rated) items get a lot of
exposure while less popular ones are under-represented in
Keywords—Face Recognition, Image Processing, Computer the recommendations. Generally, a hybrid approach is
Vision, Emotion Detection, Music, Mood detection implemented in which both content and collaborative
techniques are combined to extract maximum accuracy and
to overcome drawbacks of both types. [1]
I. INTRODUCTION
In this work, the aim is to create a music
Human emotions can be broadly classified as: fear, recommendation system/music player which will detect the
disgust, anger, surprise, sad, happy and neutral. A large user’s face, identify the current mood and then recommend a
number of other emotions such as cheerful (which is a playlist based on the detected mood.
variation of happy) and contempt (which is a variation of
disgust) can be categorized under this umbrella of emotions.
These emotions are very subtle. Facial muscle contortions
are very minimal, and detecting these differences can be
very challenging as even a small difference results in
II. RELATED WORK music resources description. It is suggested that there is a
lack of systematic research on user behaviour and needs,
A. Literature Survey low level of feature extraction, and a single evaluation index
In a particular system [8], Anaconda and Python in current research. Situation was identified to be an
3.5 softwares were used to test the functionality and Viola- important factor in the music personalized recommendation
Jones and haar cascade algorithms were used for face system. Finally, it was concluded that when the weights
detection. Similarly, KDEF (Karolinska Directed Emotional given to all the contextual factors were the same, greatly
Faces) dataset and VGG (Visual Geometry Group) 16 were reduced the accuracy of the recommendation results.
used with CNN (Convolution Neural Network) model which Another research [12] states that their hybrid
was designed with an accuracy of 88%, for face recognition recommendation system approach concept will work once
and classification that validated the performance measures. their model is trained enough to recognize the labels. The
However, the results proved that the network architecture mechanism for the automatic management of the user
designed had better advancements than existing preferences in the personalized music recommendation
algorithms. Another system [9] used Python 2.7, Open- service automatically extracts the user preference data from
Source Computer Vision Library (OpenCV) & CK (Cohn- the user’s brain waves and audio features from music. In
Kanade) and CK+ (Extended Cohn-Kanade) database which their study, a very short feature vector, obtained from low
gave approximately 83% accuracy. Certain researchers have dimensional projection and already developed audio
described the Extended Cohn-Kanade (CK+) database for features, is used for music genre classification problems. A
those wanting to prototype and benchmark systems for distance metric learning algorithm was applied in order to
automatic facial expression detection. The popularity and reduce the dimensionality of the feature vector with a little
ease of access for the original Cohn-Kanade dataset this is performance degradation. Proposed user’s preference
seen as a very valuable addition to the already existing classifier achieved an overall accuracy of 81.07% in the
corpora. It was also stated that for a fully automatic system binary preference classification for the KETI AFA2000
to be robust for all expressions in a myriad of realistic music corpus. The user satisfaction was recognizable when
scenarios, more data is required. For this to occur very large brainwaves were used.
reliably coded datasets across a wide array of visual B. Existing Systems
variabilities are required (at least 5 to 10k examples for each
action) which would require a collaborative research effort • EMO Player: Emo player (an emotion-based
from various institutions. music player) is a novel approach that helps the
It was observed in a cross-database experiment [1] user to automatically play songs based on the
that raw features worked best with Logistic Regression for emotions of the user. [2]
testing RaFD (Radboud Faces Database) database and • SoundTree: Sound Tree is a music
Mobile images dataset. The accuracy achieved was 66% and recommendation system which can be integrated to
36% respectively for both using CK+ dataset as a training an external web application and deployed as a web
set. The additional features (distance and area) reduced the service. It uses people-to-people correlation based
accuracy of the experiment for SVM (Support Vector on the user's past behaviour such as previously
Machine) from 89%. The algorithm that had been listened, downloaded songs. [3]
implemented generalized the results from the training set to • lucyd: lucyd is a music recommendation tool
the testing set better than SVM and several other algorithms. developed by four graduate students in UC
An average accuracy of 86% was seen for RaFD database Berkeley's Master of Information and Data Science
and 87% for CK+ database for cross-validation=5.The main (MIDS) program. lucyd lets the user ask for music
focus was feature extraction and analysis of the machine recommendations using whichever terms they
algorithm on the dataset. But accurate face-detection want. [4]
algorithms become very important if there are multiple • Reel Time.AI: This system works by having the
people in the image. One of the works [10] was tested by user subscribe to them. The user can then upload
deriving expression from the live feed via the system's images of large gatherings such as shopping malls,
camera or any pre-existing image available in the memory. movie theatres and restaurants. The system then
It has been implemented using Python 2.7, OpenCV and identifies the moods happy and sad. It recognizes
NumPy. The objective was to develop a system that can which faces portray happy emotion and which
analyse the image and predict the expression of the person. faces portray sad emotion, and gives the verdict of
The study proved that this procedure is workable and the situation from the faces of the people present.
produces valid results. • Music.AI: It uses the list of moods as input for
There has also been research done on the Music mood of the user and suggests songs based on the
Recommendation System. According to one such research selected mood. It is a combination of Collaborative
[11], a preliminary approach to Hindi music mood filtering based and Content based filtering models.
classification has been described, that exploits simple Emotion, time, ambience and learning history are
features extracted from the audio. MIREX (Music the features taken into account for music
Information Retrieval Evaluation eXchange) mood recommendation. [5]
taxonomy gave an average accuracy of 51.56% using the
10-fold cross validation. In addition to this, there is an
article [10] that states that the current music
recommendation research results from the perspective of
C. Existing Algorithms/Tools • Mood Detection — Classification of the emotion
• Deep Learning based Facial Expression on the face as happy, angry, sad, neutral, surprise,
Recognition using Keras: Using this algorithm, up fear or disgust. For this task, the traditional Keras
to five distinct facial emotions can be detected in module of Python was used but, in the survey, it
real time. It runs on top of a Convolutional Neural was found that this approach takes a lot of time to
Network (CNN) that is built with the help of Keras train and validate and also works slowly when
whose backend is TensorFlow in Python. The integrated with android apps. So, MobileNet which
facial emotions that can be detected and classified is a CNN architecture model for Image
by this system are Happy, Sad, Anger, Surprise and Classification and Mobile Vision was used. There
Neutral. OpenCV is used for image processing are other models as well but what makes
tasks where a face is identified from a live webcam MobileNet special is that it has very less
feed which is then processed and fed into the computation power to run or apply transfer
trained neural network for emotion detection. Deep learning to. This makes it a perfect fit for Mobile
learning based facial expression recognition devices, embedded systems and computers without
techniques bring down to a greater extent, the GPU or low computational efficiency without
dependency on face-physics-based models and compromising the accuracy of the results. It uses
other pre-processing techniques by enabling depth wise separable convolutions to build light
lengthwise learning to occur in the pipeline directly weight deep neural networks. The dataset used for
from the input images. [14] training was obtained by combining FER 2013
• Hybrid approach of Music Recommendation: dataset [6] and MMA Facial Expression
There are several drawbacks to relying solely on Recognition dataset [7] from Kaggle. The FER
collaborative filtering to recommend music. The 2013 dataset contained grayscale images of size
biggest problem is the “Cold Start.” Music tracks 48x48 pixels. The MMA Facial Expression
are only tagged as often as listeners are discovering Recognition dataset had images of different
or listening to them. In other words, there are little specifications. Thus, all these images were
or no available ‘tags’ to describe new music or converted as per the images in FER 2013 dataset
music that has not been discovered yet. and combined to obtain an even larger dataset with
Additionally, listeners are more willing to supply 40,045 training images and 11,924 testing images.
tags for songs they enjoy most than for songs they MobileNet was used with Keras to train and test
mildly enjoy or do not enjoy at all. [13] our model for seven classes - happy, angry, neutral,
• Viola–Jones object detection framework: The sad, surprise, fear and disgust. We trained it for 25
Viola-Jones algorithm is a widely used mechanism epochs and achieved an accuracy of approximately
for object detection. The main property of this 75%.
algorithm is that training is slow, but detection is
fast. This algorithm uses Haar basis feature filters, B. Music Recommendation Module
so it does not use multiplications. The efficiency of The dataset of songs classified as per mood was found on
the Viola-Jones algorithm can be significantly Kaggle for two different languages - Hindi and English.
increased by first generating the integral Research for a good cloud storage platform to store, retrieve
image. [15] and query this song data as per user’s request was
conducted. Options like AWS, Google Cloud, etc. were
III. METHODOLOGY found but these were rejected as they were costly and
provided very limited storage for free. Then research for
The mood-based music recommendation system is an open-source streaming services like Restream.io, Ampache,
application that focuses on implementing real time mood etc. was conducted, but again, these services were web
detection. It is a prototype of a new product that comprises based/used for live streaming on YouTube/available only
two main modules: Facial expression recognition/mood for personal use. After a lot of research (and time
detection and Music recommendation. constraints), Firebase was considered a backend server. It
can be integrated with an android app just by one click and
A. Mood Detection Module its free plan provides storage of 5GB. But functions like
This Module is divided into two parts: user queries, server updates, etc. are a part of a paid plan so
it was decided to limit the scope of the project.
• Face Detection — Ability to detect the location of The mp3 versions of the songs were manually uploaded
face in any input image or frame. The output is the on Firebase storage and were linked in the Real Time
bounding box coordinates of the detected faces. For database as per mood and language (for filters).
this task, initially the python library OpenCV was
considered. But integrating it with an android app
C. Integration
was a complex task so the FaceDetector class
available in Java was considered. This library For the integration of these two modules in an Android
identifies the faces of people in a Bitmap graphic application, the trained MobileNet model was saved as an
object and returns the number of faces present in a .h5 file, and this .h5 file was then converted to a .tflite file
given image. using TensorFlow Lite Converter. It takes a TensorFlow
model as input and generates a TensorFlow Lite model as
output with .tflite extension. Since the MobileNet model is
used, the size of the tflite file is expected be around 20- 25
Megabyte (MB) which was the desired size. In Android
Studio, an assets folder was created to store the .tflite file
and labels.txt file. The labels.txt file contains the class labels
of the model. All the appropriate methods were created for
loading the model, running the interpreter and obtaining the
results.
A project on Firebase was created and mp3 songs were
uploaded in the storage section. These songs as per mood
and language in the real time database section. After this,
the Firebase database was linked to Android studio. An
appropriate UI for the android application was created and
the tflite model methods were linked with the songs on
Firebase. Finally, the application was tested to fix the bugs
if any.
Fig.4. “Angry” mood detected successfully by the application. Fig.7. “Surprise” mood detected successfully by the application.
picture of their face, which in turn is unable to detect the
proper mood, or any other reason, the user can click on the
“Use Emoji” button and select the emoji which represents
the mood which they are in, or the mood that they want their
playlist to be generated of. Fig.10. is the screenshot of the
screen that is displayed to the user when they click the “Use
Emoji” button. The first is the emoji for the mood “happy”,
the second for the mood “angry”, the third for the mood
“surprise”, the fourth for the mood “sad”, and the fifth for
the mood “neutral”.