Synopsis KRG
Synopsis KRG
on
EmoRecs: Personalized Activity Recommendation System
Based on Multimodal Emotion Detection using facial
expression and voice modulation.
Submitted
towards
Pre-registration seminar
For
Doctor of Philosophy
Under the discipline of
Computer Science and Technology
In the faculty of
Science and Technology
of
RASHTRASANT TUKADOJI MAHARAJ NAGPUR UNIVERSITY
Under the Supervision of
Dr. Ujwalla Gawande
Associate Professor
Department of Information Technology, YCCE, Nagpur
At
Research Center
YESHWANTRAO CHAVAN COLLEGE OF ENGINEERING, NAGPUR
Abstract
Suicide due to depression is one of the leading causes of death. Humans have a unique ability
to demonstrate and understand emotions through a variety of models of communication.
Based on their emotions or mood swings we can judge whether the human subject is in good
psychological condition or not. The most visible apparent deficiencies of today’s Emotion
capturing systems were their inability to understand the emotions of such patients like mental
health disorder, social emotion by using facial expressions. It can be used in schools to help
students who find it difficult to express their feelings or who have unstable mental health
concerns, such as depression, and hence the teacher’s or health workers can communicate
with their parents and work through their problems. These days, technology allows employers
to recognize individuals who are overly stressed in the workplace and release them from their
duties [2]. Emotion detection using speech and face in deep learning has made significant
progress in recent years, but there are still several challenges that need to be addressed such
as limited dataset, variability in data, feature extraction, Ethical and privacy concern. This
requires careful consideration of ethical and privacy issues in the design and deployment of
deep learning models for emotion detection. There are many sources of emotion recognition
like face expressions, hands, body language, text, speech etc[1]. Applying any one of these
techniques will not reflect much to find emotions appropriately, as human emotions can
change every single second or less than that.
Modality is the word that refers to different modes of recognition. In early researches it has
been seen that single modality more over taken into considerations for emotion recognition
which always not give better results due to insufficient information. Multi-model
functionality helps in overall recognition process for increasing reliability. Fusing multiple
feature sets and classifiers into one system will produce a comparably more accurate system.
This research works includes combination of speech and facial expressions, as this hybrid
mode will help in evaluating results perfectly and as per desire. The proposed methodology
consists of two main parts; first is Facial Expression Recognizer (FER) is utilized in various
fields, such as education, gaming, robotics, healthcare, and others. Speech Emotion
Recognition (SER) is a second part which has numerous uses in industries like psychology,
entertainment, and healthcare and is a critical component of human-computer interaction.
And for the final result hybrid mode will work that consider output of both FER and SER.
This is multi-sensory and multi-model emotion recognition system. Hence, AI or ML can be
used to detect emotion of a person and provide an appropriate recommendation system is a
need.
Human communication has two main aspects verbal (auditory) and nonverbal (visual). In
human-to-human communication of emotion, people use either one or both modality. The
verbal statement can be seen as reflection of emotion in the facial expression. Visible
communication is particularly effective when the auditory speech is degraded because of
noise bandwidth, filtering or hearing impairment. Facial expression analysis has become a
promising research area that finds potential applications in human-computer interfaces,
talking heads, image retrieval, human emotion analysis etc. [1] Facial expression
communicates emotions, pain etc. and regulates interpersonal behavior. Anger, Disgust, Fear,
Happiness, Sadness and Surprise are considered as "basic emotions" and each have a
universally recognised facial expression, involving changes in multiple facial regions, which
facilitates analysis. Speech is one of the most efficient communication media when people
make a communication with other person. The field of speech signal has been of great
interest to scientists and engineers because of several exciting applications that make a
difference to our day-to-day life.
Distinct brain parts induce different emotions [12]. There are three types of emotional
responses: reactional, hormonal, and automatic [13]. According to psychology, emotions are
responses to stimuli, associated with qualitative physiological changes [13]. Two basic
approaches used to study the nature of emotions are discrete method and the
multidimensional approach [13].
A. Discrete emotions theory
According to this theory, emotions are different and discrete categories, each with its
ensemble of cognitive, psychological, and behavioral factors. Emotions can be positive or
negative. According to proponents of this hypothesis, there exist a few fundamental emotions
that are generally recognized across cultures. There are six basic emotions namely: happiness,
sadness, anger, surprise, fear, and disgust [14]. Robert Plutchik provided a comprehensive
emotional model called Plutchik’s wheel of emotions [15].
Plutchik’s wheel consists of eight emotions namely: fear, joy, sadness, trust, anger, surprise,
anticipation, and disgust. Other associated emotions, which combines these eight primary
emotions are derived by positional intensities. The intensity of the emotions increases as we
move towards the center of the wheel and vice-versa. Fig. 1 provides an overview of
Plutchik’s wheel of emotions [15].
B. Multidimensional emotions theory
Similarly, the 3D emotional space model maps various continuous dimensions, such as V
(Pos or Neg), arousal (high or low activation), and dominance (D) (feeling in control or
feeling controlled). The 3D emotional space model proposed by Mehrabian and Russell is
shown in Fig. 3 [17].
Fig. 3. 3D VAD emotion model
Tools that can assist people in recognizing the emotions of those around them could be very
beneficial in treatment settings as well as in regular social encounters.
Emotion sensing is a technique used to extract human emotions. Over the years, various
methods have been adopted to study human emotions. Physical signals for emotion
recognition include facial expressions, speech, text, gestures, and body postures [5]. Speech
and facial expressions are the most commonly employed mechanisms for emotion
identification among physical signals [5]. As a result, we chose to limit our review study to
only physical activities based on speech and facial expressions.
Children health:
The study and analysis of emotions in children can also play a crucial role in their health
monitoring. Studies revealed that emotional development and regulation can be crucial in
children .Students in the school who find it difficult to express their feelings or who have
unstable mental health concerns, such as depression and hence the teacher’s or health workers
can communicate with their parents and work through their problems
Human-robot interactions:
The rise in AI has boosted the development of human-modeled machines. The applications of
human emotions have attracted researchers to investigate human-machine interfaces and
sentimental analysis. Human-machine interfaces can infer and understand human emotions,
making them more successful in human interactions; the models should be able to interpret
human emotions and adapt their behavior appropriately, resulting in an acceptable reaction to
those sentiments.
Patience assistance:
Emotion can be pivotal in patient monitoring and assistance. Effective analysis of emotion
can help to sense and detect loneliness, mood variations, and suicidal cues.
Marketing:
A camera with AI systems in shopping malls can be used to read the real-time emotions of
customers, which may be used for marketing.
These days, technology allows employers to recognize individuals who are overly stressed in
the workplace and release them from their duties.
Literature Review
The review is done to get insights into the methods, their shortcoming which we can
overcome. A literature review, a literature survey is a text of a scholarly paper, which
includes the current understanding along with great findings, as well as theoretical and
methodological contributions to a particular topic. The latent qualities of humans that can
provide inputs to any system in various ways have brought the attention of several learners,
scientists, engineers, etc. from all over the world. The current mental state of the person is
provided by facial expressions. Most of the time we use nonverbal clues like hand gestures,
facial expressions, and tone of voice to express feelings in interpersonal communication
Due to the significance of human behavioral intelligence in computing devices, this work
focus on the facial expressions and speech of humans for their emotion recognition in
multimodal (audio-video) signals. A comprehensive literature survey is conducted to study
and analyze the existing multimodal datasets and state-of-the-art methods for human emotion
recognition. The study explored various research issues and challenges in human emotion
detection through facial expressions and speech. Lastly identification of research gap.
Imtiyaz Ahmad et al. in 2020, in paper [6] explained that face and speech is the best tool to
identify emotions and better way is always tried by scientists to find the solution. Instead of
unimodal using face, speech must also be fusion. Proposed method only communicated ideas
of identifying emotions without dealing with depth details. There are several methods used
for the detection of faces, one of them named as Viola Jones algorithm which is widely used
and Relative Sub Image Based Features method is proposed while for speech RBFC method
was proposed. SVM used as classifier to identify emotions. LibSVM tool is prescribed for
implementation but no accuracy of the system was discussed.
Humaid Alshamsi et al. in 2018, in paper [7], identified emotions for real time in a mobile
application built for Android systems. SAVEE and RAVDESS datasets were used for the
same that results up to 97% appx. Cloud technology is used here and for feature extraction
MFCC techniques applied and SVM is used as a classifier and here technique used is multi
model for feature extraction from the audio input for emotions. For the implementation of
this work, MATLAB with Android features are combined for better accuracy.
Emotion recognition is a technology that enables computers to recognize and interpret human
emotions by analyzing facial expressions, voice, text or physiological signals. It finds
applications in human–computer interaction, mental health assessment, and personalized
content recommendation, offering insights into user sentiment and engagement. [1]
Introduced an innovative approach to emotion recognition which combines facial expressions
and speech cues within a multimodal system. This fusion of two distinct modalities is
achieved through two specific methods: feature-level fusion and decision-level fusion. To
evaluate the effectiveness of our approach, they conducted experiments using the
eNTERFACE'05 dataset. Comparative analysis reveals that the integration of fusion-based
techniques can substantially enhance the performance of emotion recognition systems.
Additionally, their findings highlight the superiority of feature-level fusion over decision-
level fusion in terms of overall performance. [2] In research work a Deep Learning algorithm
is utilized to create an integrated tool to identify the facial emotions and the stress level or
emotion quotient from speech.
Malyala Divya et al. (2019) [9], gave an idea for automated live facial emotion recognition
through image processing (IP) and artificial intelligence (AI) techniques. Here emotion is
detected by scanning static images and features for extraction like eyes, nose, and mouth
were used for face detection. Design of system is by initializing CNN model that taking an
input image by adding a convolution layer, pooling layer, flattened layers, and dense layers.
Convolution layers were added here for better accuracy for large datasets that were collected
from CSV file and then converted into images and lastly, emotions were classified with
respective expressions such as disgust, happy, surprise, neutral, sad, angry and fear with total
34,488 images selected for the purpose training dataset and 1,250 selected for testing
purposes. The percentage of every emotion classifier to everyone i.e., 66% of accuracy was
achieved.
Lakshmi Bhadana et al. (2020) [8], brought an approach that uses CNN as a classifier and
with the use of HAAR cascade recognition of Real-time Facial Expression Recognition for
classification of expressions of different person. This system shows accuracy of 58% with the
help of testing data used web-cam of the system and hence shows the emotion in a text
format that successfully classified seven different human emotions. There are many factors
that is to keep in mind while recognition of emotions like the level of the camera, lighting
conditions and deviate the accuracy that brings emotions out easier. Real time images result
better than still images.
Xiang Feng et al. (2020) [10], proposed a class of academic emotion analysis technology
based on artificial intelligence methods for online learning which helped researchers to
conduct research on learner well-being based on academic emotions. A framework related to
student comment aspect classification (student comments into teacher, course, and platform)
and academic emotion classification models is proposed based on which a machine learning
data set was produced and then an analysis framework of LSTM-ATT and A-CNN fusion
was developed. Aspect classification model and the academic emotion classification model
proposed proved superior to general machine learning models and conventional LSTM
network models with accuracy of 88.62% and 71.12% respectively.
Nahla Nour et al. (2020) [11], proposed the Facial Expression Recognition with the use of
CNN model with SVM classifier. Three models like Alexnet Model, VGG-16, ResNet
models used. CK+ datasets were used to check and predict results. In this study, CNN layers
and its uses are described including convolution layer, pooling layer, fully connected layer,
softmax classifier, sigmoid function etc. By training and testing with data recognition are
classified and performed. AlexNet model with higher accuracy judged in the paper.
Moe Moe Htay et al. (2021) [18], represents a survey of feature extraction of facial
expressions along with classification method. Steps considered here were face to face
detection of components, extraction of features of face image and classification of
expressions. Geometrical as well as appearance-based features discussed here for extraction
of features and both spontaneous and posed datasets were taken into consideration. Images
considered here of different form like peak express, portraying and video clips forms.
CK+ and JAFFE are the databases through which the system is checked. Such systems are
helpful in health care and patient monitoring.
Dr. N.Anantha Rufus et al. (2022) [19], discussed that there are many current applications
that needs human emotion recognition from their speech such as assessment of behavior, call
centers, emergency centers, virtual assistants etc. Use of MFCC algorithm or LSTM with
CNN gives best results in this concern. RAVDESS or TESS datasets are used for solving the
purpose. The concept is implemented using python OpenCV and results are achieved with
nearly 90% accuracy.
The summary of these articles used in the review analysis is shown in Table A.
Table A. Summary of emotion recognition studies using SPEECH signals included in the review.
Highlights of speech-based emotion recognition as evident from the summary of Table A.,
one article each has been included from the years 2019, 2020, and 2022, respectively. The
audio/video or audio based have been used the most for emotion elicitation. The dataset
analysis reveals that EMO-DB, RAVDEES, CASIA, and IEMOCAP datasets have been the
most preferred choices for model testing. The highest strength of speech-based emotion
recognition is that multiple datasets have been used for method verification. Public speech
emotion datasets have been selected over private datasets. Power spectral density (PSD),
Mel-frequency cepstrum coefficients (MFCC), Mel spectrogram (MSG), STFT, and variants
of wavelet transform (WT) have been adopted the most for feature extraction. Model
validation using holdout CV was preferred the most for speech, followed by k-FCV, and the
least with LOSO CV, respectively. Review revels that DL models have an edge over ML
models for speech-based emotion recognition.
The review included 15 articles on the recognition of emotions using facial images. Table B
presents a summary of facial image-based emotion recognition.
Table B. Summary of emotion recognition studies using IMAGE signals included in the review.
The summary provided in Table B, reveals that the highest number of articles have been from
the years 2019 to 2023, respectively. The datasets CK+ and JAFFE have been the most
commonly used facial image datasets. In addition, FER2013, RAF-DB, and AffectNet have
also been used in many studies. The facial image-based emotion recognition studies have
validated their model on multiple datasets. The majority of the facial image datasets are
publicly available. Features based on geometric or texture of facial patterns is preferred. The
validation of the model using holdout CV followed by k-FCV strategies is most common.
The distribution of decision-making models for facial images is shown in table out of 15
articles, as many as 12 articles have preferred DL models for classification, 3 used ML
models. For ML models, the SVM classifier has been the most preferred, while CNN has an
upper edge over other DL models.
Most of the datasets used for emotion recognition are available publicly. However, the
majority of them have been utilized to their maximum capacity, resulting in the highest
classification accuracy. In addition, the available datasets have been acquired with a single
modality i.e., either for EEG, ECG, ET, GSR, speech, or facial images. Therefore, there still
exists a research gap in analyzing emotion recognition using multiple modalities from the
same subject. Also, the lack of availability of public emotion datasets for healthcare, brain-
computer interfaces, and other applications limits such analysis.
Research is well progressed to reduce the gap between machines and humans. Over the years,
the research in the field of emotion computing has achieved tremendous success. However,
computing devices with artificial emotional intelligence methods encounter certain
unresolved issues and challenges to effectively detect human feelings via facial expressions
and speech. The following discussions describe about the issues and challenges in
multimodal human emotion recognition.
Limited Dataset: The availability of labeled datasets for emotion detection is limited,
especially for less common emotions or for specific cultural contexts. This makes it
challenging to train deep learning models that can generalize well to new data.
Variability in Data: The data used for emotion detection can vary widely in terms of quality,
noise, and variability. For example, speech data can be affected by environmental noise,
accents, and speaking styles, while facial data can be affected by lighting conditions, facial
expressions, and occlusion.
Feature Extraction: Extracting relevant features from speech and facial data can be
challenging, especially when dealing with complex emotions that are not easily captured by
simple features. This requires careful design of feature extraction algorithms and feature
engineering techniques.
Interpretability: Deep learning models are often seen as “black boxes” that are difficult to
interpret. This can make it challenging to understand how the model is making decisions and
to diagnose errors or biases in the model.
Ethical and Privacy Concerns: Emotion detection using speech and facial data raises ethical
and privacy concerns, as it can be used for sensitive applications such as surveillance,
emotion profiling, and behavioral prediction.
Lack of effort to analyse and handle person-dependent attributes in speech and facial
expressions to compute the most relevant and generalized information for emotions
detection: People have unique vocal tract and facial expression patterns when they talk, sing,
laugh, cry and do other activities. There facial profile characteristics have not matched with
each other in terms of their skin color, shape, and size of face, eyes, eyebrows, nose, cheeks,
mouth, chin, and hair color [15]. The voiced speech of them have also different
characteristics, the adult males speak in the range of 85 to 180 Hz, whereas the adult females
speak in the range of 165 to 255 Hz [16, 17]. The face and voice attributes of them have
varying over time as they get older. The within and between the diversity of speech and facial
expressions makes the development of a generalized system challenging.
Research Gap
A vast of work has been already started in the field of emotion detection using either facial
expressions or speech emotion recognition systems. But there are some systems that used
hybrid system for the better results. Hybrid here means results of emotions based on both face
emotions and speech. Most of the systems used Deep Convolutional Neural Networks for
emotion detection using various datasets and good accuracy is achieved from such systems.
Problem Definition
I. The majority of the review articles previously published for emotion recognition
focused on a single modality i.e., either physiological signal, speech, or facial images.
III. No work on “Recommendation system for multimodal based emotion detection using
voice & facial expression”
VI. To propose a Multimodal Emotion Recognition (MER) which can adaptively integrate
the most discriminating features from facial expressions & speech to improve the
performance.
Aim is to develop system that detect emotions & based on emotions the system generate
customize response to enhance mental health for users.
1. To study and analyze the existing audio-video emotion detection methods & datasets.
2. To develop the multimodal dataset.
3. To extract and propose the feature vector for emotion recognition in tone-sensitive
speech and facial expressions
4. To propose a multimodal system through the fusion of peak stage behavior of facial
expressions and speech for emotion recognition.
5. To design innovative classification model based on deep learning to derive more
appropriate classification of different emotions (Happy, sad, neutral, angry, excited &
frustrated) which may be useful in many real world applications.
6. To validate and compare the performance of proposed system with existing work.
7. To recommend: whether listen music (devotional, instrumental etc.), watch movie,
acupressure therapy, yoga , quotes, etc… based on recognized emotion.
Proposed Research Methodology
There exist three main types of state-of-the-art methodologies for emotion recognition in
affective computing: (a) image/video unimodal approach (b) audio/linguistic unimodal
approach and (c) audio+video multimodal approach. Researchers have made many efforts to
improve the accuracy and reduce the computational cost of emotion recognition methods.
Image/video unimodal approach has recognized the human emotions through only facial
expressions. They have not considered the sound information of humans. On the other hand,
the audio/linguistic unimodal approach has detected the human emotion through speech only.
They have not considered the vision information of facial expressions of humans. However,
people convey their feelings via facial expressions and speech. They simultaneously or may
use one of these constructs to communicate their feelings. Thus, our research work has
focused on the joint processing of audio and video modalities for emotion recognition
through facial expressions and speech. Multimodal emotion recognition approach consists of
five steps: (1) multimodal dataset, (2) preprocessing, (3) feature extraction, (4) fusion and
classification, and (5) recommendation based on predicted emotion. General pipeline for
multimodal emotion recognition via facial expressions and speech is presented in Figure 4.
Recommendation to listen music (devotional, instrumental etc.), watch
movie, acupressure therapy, yoga , quotes, etc…. based on emotion
such as Happy, sad, neutral, angry, excited & frustrated.
Recommendation based on predicted emotion
Semeste
Planned research activity
r
I Literature Survey
II Collect multimodal dataset & Proposing Algorithm
III Image Preprocessing & feature extraction
IV Feature fusion and Classification and recommendation
To validate and compare the performance of the proposed system with existing
V work.
VI Thesis Writing
References
[1] Pragya Singh Tomar, Kirti Mathur & Ugrasen Suman , “Fusing facial and speech cues for
enhanced multimodal emotion recognition” ,international journal of information technology,
vol.16, pp.1397-1405, Jan 2024.
[2] S Shajith Ahamed; J. Jabez; M Prithiviraj, “Emotion Detection using Speech and Face in
Deep Learning”, International Conference on Sustainable Computing and Smart Systems
(ICSCSS), Coimbatore, India, IEEE Xplorer, pp.317-321, 14-16 June 2023.
[3] Naveed Ahmed, Zaher Al Aghbari, Shini Girija, “A systematic survey on multimodal
emotion recognition using learning algorithms”, Intelligent Systems with Applications
Volume 17,Elsevier, February 2023.
[4] Sun J., Han J., Wang Y., Liu P., “Memristor-based neural network circuit of emotion
congruent memory with mental fatigue and emotion inhibition”,IEEE Trans. Biomed.
Circuits Syst., 15 (3) (2021), pp. 606-616
[5] Shu L., Xie J., Yang M., Li Z., Li Z., Liao D., Xu X., Yang X. , “A review of emotion
recognition using physiological signals Sensors”, 18 (7) (2018), p. 2074
[6] Imtiyaz Ahmad, Ramendra Pathak, Dr. Yaduvir Singh, Mr. Jameel Ahamad, “Emotion
Detection using Facial Expression and Speech Recognition”, International Journal of Future
Generation Communication and Networking Vol. 13, No. 3, (2020), pp. 123 – 133.
[7] Humaid Alshamsi, Veton Kepuska, Hazza Alshamsi, Hongying Meng, “Automated Facial
Expression and Speech Emotion Recognition App Development on Smart Phones using
Cloud Computing”, 9th IEEE Annual Information Technology, Electronics and Mobile
Communication Conference (IEMCON), November 2018, DOI:
10.1109/IEMCON.2018.8614831.
[8] Lakshmi Bhadana, P.V.Lakshmi ,D.Rama Krishna,G.Surya Bharti ,Y.Vaibhav, Real Time
Facial Emotion Recognition With Deep Convolutional Neural Network, Journal Of Critical
Reviews ISSN- 2394- 5125 Vol 7, Issue 19, 2020.
[9] MalyalaDivya, R Obula Konda Reddy, C Raghavendra, “Effective Facial Emotion
Recognition using Convolutional Neural Network Algorithm”, International Journal of
Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-8 Issue-4,
November 2019.
[10] Xiang Feng, YaojiaWei, Xianglin Pan, Longhui Qiu and Yongmei Ma, “Academic
Emotion Classification and Recognition Method for Largescale Online Learning
Environment—Based on A-CNN and LSTM-ATT Deep Learning Pipeline Method”,
International Journal of Environmental Research and Public Health.
[11] Nahla, N., E. Mohammed, and V. Serestina. "Face expression recognition using
convolution neural network (CNN) models." International Journal of Grid Computing &
Applications 11, no. 4 (2020): 1-11.
[12] Dalgleish T. “The emotional brain” , Nat. Rev. Neurosci., 5 (7) (2004), pp. 583-589
[13] Rached T.S., Perkusich A. , “Emotion recognition based on brain-computer interface
systems”,Fazel-Rezai R. (Ed.), Brain-ComputerInterfaceSystems, IntechOpen, Rijeka (2013),
[14] Ekman P.,“An argument for basic emotions “ , Cogn. Emot., 6 (3–4) (1992), pp.169-200
[15] Plutchik R., Kellerman H. Theories of Emotion, Vol. 1 Academic Press (2013)
[16] Wilson G.F., Russell C.A. , “Real-time assessment of mental workload using
psychophysiological measures and artificial neural networks”, Hum. Factors, 45 (4) (2003),
pp. 635-644
[17] Mehrabian A. , “Pleasure-arousal-dominance: A general framework for describing and
measuring individual differences in temperament”, Curr. Psychol., 14 (1996), pp. 261-292
[18] Htay, Moe Moe. "Feature extraction and classification methods of facial expression: A
survey." Computer Science and Information Technologies 2, no. 1 (2021): 26-32.
[19] Dr. N. Anantha Rufus, M.Zaheer, S. A. V. Dolendrakumar, P. Penchalalokesh, SPEECH
EMOTION RECOGNITION USING DEEP LEARNING, International Research Journal of
Modernization in Engineering Technology and Science (Peer-Reviewed, Open Access, Fully
Refereed International Journal) Volume:04/Issue:05/May-2022 Impact Factor- 6.752
www.irjmets.com.
Dr. U. P. Waghe
Principal and Head of the Research Center