0% found this document useful (0 votes)
35 views10 pages

An Empathetic AI Coach For Self-Attachment Therapy

Uploaded by

mariatresabinu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views10 pages

An Empathetic AI Coach For Self-Attachment Therapy

Uploaded by

mariatresabinu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2021 IEEE Third International Conference on Cognitive Machine Intelligence (CogMI)

An Empathetic AI Coach for Self-Attachment Therapy

Lisa Alazraki∗ , Ali Ghachem∗ , Neophytos Polydorou∗ , Foaad Khosmood† and Abbas Edalat∗
∗ Department of Computing
2021 IEEE Third International Conference on Cognitive Machine Intelligence (CogMI) | 978-1-6654-1621-4/21/$31.00 ©2021 IEEE | DOI: 10.1109/COGMI52975.2021.00019

Imperial College London, United Kingdom


Email: {lisa.alazraki20, ali.ghachem17, neophytos.polydorou19, [Link]}@[Link]
† Computer Science and Software Engineering

California Polytechnic State University, San Luis Obispo, USA


Email: foaad@[Link]

Abstract—In this work, we present a new dataset and a It should be noted, however, that using conversational
computational strategy for a digital coach that aims to guide agents in a sensitive area such as mental healthcare poses
users in practicing the protocols of self-attachment therapy. significant challenges. Current deep-learning approaches to
Our framework augments a rule-based conversational agent text and speech generation lack the necessary oversight
with a deep-learning classifier for identifying the underlying to prevent a system from producing output that is insen-
emotion in a user’s text response, as well as a deep-learning
assisted retrieval method for producing novel, fluent and
sitive [10] and even offensive [11], and thus potentially
empathetic utterances. We also craft a set of human-like damaging to a patient’s well-being. A recent literature review
personas that users can choose to interact with. Our goal is study has observed that the large majority of mental-health-
to achieve a high level of engagement during virtual therapy oriented chatbots currently in existence do not use machine
sessions. We evaluate the effectiveness of our framework in learning at all, favouring more stable and predictable tech-
a non-clinical trial with N=16 participants, all of whom have niques such as rule-based modelling [12]. On the other hand,
had at least four interactions with the agent over the course purely rule-based bots have a limited, keyword or pattern-
of five days. We find that our platform is consistently rated based understanding of user input and their dialogue can be
higher for empathy, user engagement and usefulness than the perceived as monotonous and predictable [13], resulting in
simple rule-based framework. Finally, we provide guidelines to a failure to fully engage users.
further improve the design and performance of the application,
in accordance with the feedback received. In this paper, we present a computational framework
Keywords-digital psychotherapy, chatbots, self-attachment that augments a rule-based agent for the delivery of self-
attachment technique (SAT), a recently developed psy-
chotherapeutic intervention [14]. Our approach is aimed at
I. I NTRODUCTION maintaining the safety of rule-based strategies while also
It is estimated that almost a billion people worldwide – ensuring that the conversational agent generates responses
approximately 13 percent of the global population – suffer that are empathetic, diverse and fluent, as well as appropriate
from at least one mental disorder [1]. This number has in- to the user’s emotional state. To this end, we create a new
creased by a third since 1990, and it is expected to continue dataset – EMPATHETICPERSONAS – of 1,181 verbal expres-
to grow at an even steeper rate in the near future, due to the sions of emotion and 2,143 empathetic rewritings of base
direct and indirect effects of the COVID-19 pandemic [2]. utterances, both crowd-sourced. We adopt a tree-structured
Despite the demonstrated need for pervasive, affordable conversation flowchart and devise a strategy for generating,
mental healthcare, the considerable personal financial cost at each node in the chart, novel yet safe utterances, trying to
that is often associated with traditional psychotherapy pre- minimise any unpredictability in their overall meaning. To
vents patients from low-income backgrounds from accessing do so, we extract short, self-contained sentences from the set
therapy [3]. Moreover, patients in low and middle-income of utterance rewritings in the EMPATHETICPERSONAS dataset,
countries and rural areas encounter a further barrier to by splitting each of them at major punctuation marks. We
accessing specialised care, due to low local ratios of mental then join the extracted sentences in all possible sequential
health professionals per capita [4], [5]. Confronted with combinations and obtain a large corpus of new utterances.
these issues, researchers have examined digital technology From this corpus, the agent retrieves – through a multi-
as a means to deliver mental health services to the wider objective function that simultaneously maximises empathy,
population [6]. As a result, a wide range of technological fluency and novelty – the most appropriate utterance to
tools aimed at mental health support has been investigated present to the user. To compute the empathy score of an
and deployed within academia and industry [7], [8], many of utterance, we use a T5 model [15] that is fine-tuned on
which take the form of conversational agents administering a labelled subset of our dataset (∼80% accuracy, ∼81%
various forms of psychotherapy [9]. macro F1); for the fluency score we subtract a penalty for
each repeated word within an utterance from the inverse of
Neophytos Polydorou was supported by the UKRI CDT in AI for its perplexity generated by a GPT-2 language model [16];
Healthcare [Link] (Grant No. P/S023283/1). finally, to obtain the novelty score, we compute a weighted

978-1-6654-1621-4/21/$31.00 ©2021 IEEE 78


DOI 10.1109/CogMI52975.2021.00019
overlap distance over all possible n-grams between an from findings in developmental psychology that link inse-
utterance and each of the agent’s previous utterances. In cure attachment of children with their primary caregivers
addition, we adopt a RoBERTa model [17] for the task of with affective disorders in adulthood [31]. In SAT, the
emotion recognition (∼95% accuracy, ∼95% macro F1) that patient simultaneously enacts the roles of the adult – corre-
is trained on an existing affective dataset [18] and further sponding to the logical self – and the child – representing the
fine-tuned on the expressions of emotion in our corpus. emotional self – gradually building a bond between the two.
This allows the bot to identify a user’s emotional state from The adult self re-parents the childhood self by emulating the
their text responses and answer accordingly. Lastly, we craft optimal interactions of a real parent with their child. This
human-like characters for our conversational agent which allows the childhood self to become securely attached to
users can choose from and interact with. Our dataset and the adult self, enhancing positive emotions and equipping
code are publicly available [19]. the patient with the cognitive tools to tackle challenging
We evaluate the application through a human trial with situations and negative feelings. SAT can be used to alleviate
N=16 subjects from the non-clinical population, as well mental illness and it can also increase social and emotional
as two medical professionals specialised in mental health. learning in the normal population.
We show that our approach is scored highly for perceived SAT is suitable to be dispensed in a digital, automated
usefulness, ability to communicate empathetically and user manner due to its self-administered nature. A virtual reality
engagement, and that it performs significantly better than the (VR) platform for SAT has been developed in both a high
simple rule-based version of the SAT chatbot [20] in all three and a low-end version [32]. The high-end VR platform,
areas. Our agent’s ability to recognise human feelings is also based on Facebook’s Oculus, has also been equipped with
assessed positively, with 63% of trial participants agreeing an audio-based emotion recognition system and a dialogue
that the bot was successful in guessing their emotions. In manager [33]. In addition, a recent study investigating the
light of the feedback received during the trial, we conclude applicability of a chatbot for the delivery of SAT received
with a reflection on the strengths of our work as well as the some encouraging results, with 40% of participants rating
weaknesses, drawing a list of changes and improvements the platform as useful [20]. On the other hand, the entirely
which we believe may benefit the chatbot and its users. rule-based bot was deemed to be empathetic by only 20%
of respondents, while 30% agreed that conversing with it
II. BACKGROUND was an engaging experience. Here, we extend the previous
A. Existing approaches to chatbot-assisted mental support work done on the SAT chatbot by leveraging deep learning
Many of the mental health support chatbots currently methods for emotion recognition and utterance retrieval. Our
in existence approach dialogue generation using a tree- goal is to increase users’ perception of empathy and overall
structured flowchart, whose transitions between prearranged engagement.
states are determined by user input [21]–[28]. The input can
C. Empathy in digital psychotherapy
take the form of open text [25], [26], multiple choice [21]
or a combination of the two [22]–[24]. Within this frame- According to psychotherapy research, the most important
work, the conversation can be modelled as a slot-filling factor to ensure the establishment of a beneficial relationship
problem, where the user’s input is integrated into pre- between a therapist and their patient is the ability of the for-
existing templates to create a chatbot utterance [25], [26], mer to engage in an empathetic manner with the latter [34].
[28]. Alternatively, it can be informed by completely fixed, We thus consider empathy to be an indispensable feature
predetermined utterances, often written by mental health for a mental health support chatbot. We adopt the definition
professionals with formal psychology training [22], [23]. of empathy given by Barrett-Lennard [35], who identifies
Using fixed templates and utterances enables researchers to three main phases of an empathetic dialogue between two
maintain control over the dialogue, ensuring that the bot individuals: a first phase where the listener sympathises and
will not deliver insensitive or problematic responses which resonates with what is being expressed by the speaker, a sec-
could potentially have a negative effect on the patient’s ond phase in which the listener compassionately responds to
mental health. However, this can also render the experience the speaker, and a third phase where the speaker assimilates
less engaging due to the conversation appearing rigid and the listener’s response. Here we mainly focus on Barrett-
repetitive, especially if a user interacts with the chatbot on Lennard’s second phase – the expressive phase of empathetic
a regular basis [29]. To introduce a degree of variety in the exchange – in an attempt to create a digital psychotherapist
conversation, Ghandeharioun et al. [27] propose a retrieval able to display compassion toward the user.
method that randomly selects each bot utterance from a set
of variations; however, the set only comprises six options, D. Chatbot personification and user engagement
and thus it is unlikely to be able to prevent the dialogue Past research shows that users’ experience of interacting
from becoming repetitive over time. with a chatbot improves significantly when this is equipped
with a coherent identity [36]. Moreover, psychology studies
B. Digital psychotherapy and self-attachment technique have highlighted that individuals tend to prefer psychothera-
Self-attachment technique (SAT) is a recently developed pists of a certain age or sex according to several factors. For
psychotherapy framework consisting of 20 self-administered example, women generally report higher levels of comfort
protocols [30] aimed at establishing and reinforcing neural when self-disclosing to female practitioners compared to
patterns associated with secure attachment [14]. It stems male ones [37], and patients tend to choose younger or

79
older specialists depending on the specific issue that they are
facing (older therapists are preferred for universal problems
such as mourning, while younger ones are favoured when
dealing with issues that more typically affect young people,
such as heartbreak or cyberbullying) [38]. In an attempt to
increase users’ engagement with the conversational agent,
we create for it a set of personas characterising different
sexes and age ranges. Section III-E includes details of how
these personas are created.
E. Privacy and ethics
User input saved during interactions to provide the service
is permanently deleted at the end of each session. The
application does not collect or store geolocation data, IP or Fig. 1. Age distribution for both sexes across samples in the
MAC addresses or any other metadata from users’ devices. EMPATHETICPERSONAS dataset, showing that most samples belong to the
It should be noted that our chatbot is designed for indi- middle age groups 30-39 and 40-49.
viduals who are already familiar with SAT and has not yet
been tested on the clinical population. In its present form,
the bot could produce responses that may be inappropriate these were rectified before insertion into the dataset. We
within contexts involving self-harm. A careful and consid- modified the punctuation in some of the empathetic utterance
ered approach should be taken when dealing with users that rewritings by replacing commas with full stops whenever
may be experiencing mental distress, and future research these were positioned at the end of a complete sentence. In
should meticulously assess any risks associated with using total, 200 responses were accepted – 50 for each of the four
the platform in a clinical setting as well as the appropriate surveys.
solutions.
Ethical approval: In this work, the collection of the D. Data analysis
EMPATHETICPERSONAS dataset and the non-clinical trial for The EMPATHETICPERSONAS dataset comprises 200 rows,
the evaluation of the SAT chatbot have been approved by each corresponding to a survey response. Each row contains
the Research Ethics Committee of Imperial College London. the sex and age range of the respondent, as well as the
expressions of emotions and empathetic rewritings that they
III. DATASET AND DATA COLLECTION provided. There are two sexes (male, female) and six age
A. Survey preparation groups within the corpus. While the distribution of data
We crowd-sourced the EMPATHETICPERSONAS dataset by samples across the two sexes is balanced (98 females and
distributing four surveys. Each survey contained two tasks: 102 males), the majority of the samples originate from the
one asking respondents to provide multiple textual expres- 30-39 and 40-49 age groups for both sexes, as shown in
sions of emotion (answering the question ‘How are you Fig. 1.
feeling?’) for different emotional contexts, and one requiring The dataset contains 1,181 textual expressions of emo-
them to rewrite a set of base utterances to render them tion distributed across four emotional contexts: 299 are
empathetic, keeping in mind that these utterances are to be expressions of sadness, 297 communicate anger, 285 relate
directed to an interlocutor who is experiencing a specified to anxiety/fear and 300 convey happiness/content. It also
emotion. In addition, we asked respondents to provide includes 2,143 empathetic rewritings of 45 base utterances.
information about their sex and age. Each subset of 50 rows collects the responses to one of the
four surveys and contains different emotional contexts as
B. Recruitment of survey respondents well as rewritings of different base utterances. Accounting
Survey respondents were recruited via the crowd-working for some missing data, the corpus comprises between 42
websites Amazon Mechanical Turk 1 and Prolific 2 . All and 50 rewritings of each base utterance. All empty cells
recruited respondents were educated at college level or above are filled with NaN values.
and their first language was English.
E. Creating personas from the data
C. Criteria for the acceptance of responses We used the information collected about the sex and age
Responses were rejected if they amounted to less than group of the survey respondents to divide the data into four
50 percent of the survey, if they contained poorly written subsets, each of which informs a different chatbot persona.
syntax or unrelated text, or if the base utterances that were Therefore, the empathetic rewritings provided by female
meant to be rewritten had been copy-pasted without changes. crowd-workers aged 18 to 39 condition the dialogue of a
In all the other cases, the responses were accepted. Where younger female persona named Olivia, while those provided
minor grammar, syntax or semantic mistakes were present, by male respondents in the same age range inform the
conversation of a younger male persona named Arman. Sim-
1 [Link] ilarly, we created an older female persona named Gabrielle,
2 [Link] whose dialogue is based on the rewritings provided by

80
TABLE I
female crowd-workers aged 40 to 69, and an older male H YPERPARAMETERS USED TO FINE - TUNE A RO BERTA LANGUAGE
persona named Robert, whose interactions are crafted from MODEL FOR THE TASK OF EMOTION RECOGNITION .
the survey responses given by male crowd-workers aged
between 40 and 69. We also created a further identity for
our chatbot named Kai, whose dialogue is informed by the Train-val-test Learning Adam Batch Epoch with
whole dataset and is not associated to any sex or age group. proportions rate epsilon size best accuracy
80-10-10 1.35 × 10−4 1 × 10−8 16 10
F. Empathy annotation
The utterance rewritings in our corpus may convey dif-
ferent degrees of empathy. This is due to the individual per-
sonality of each survey respondent and their interpretation of
the task, as well as the fact that we did not reject responses
based on their perceived degree of empathy. In order to
build an effective empathy classifier, necessary to ensure
that our system produces the most appropriate responses, Fig. 2. A three-sentence utterance rewriting in the EMPATHETICPERSONAS
dataset. Sentence 2 conveys the main question, while Sentences 1 and
we created a separate dataset by randomly extracting 1,100 3 reinforce the empathy of the message by expressing sympathy and
utterance rewritings from the corpus and annotating them compassion.
for empathy, using discrete numerical labels from 0 to 2
(where 0 corresponds to a non-empathetic utterance and
2 to a strongly empathetic one). We used this scale as not performed on genuine emotional expressions – but rather
it correlates with previous work in automated empathy on their imitation – is potentially a source of bias that
recognition [39]. To avoid excessively biasing the model may decrease its performance when applied to real-world
toward our own judgement, we enlisted two volunteers to re- situations.
annotate the 1,100 rewritings for empathy, using the same
scale. Both annotators have worked in healthcare and are B. Corpus augmentation
experienced in communicating empathetically with patients. The utterance rewritings in the EMPATHETICPERSONAS
For each rewriting, we computed the overall empathy score dataset consist of either one, two or three distinct sentences.
by choosing the majority label out of the three individual Fig. 2 illustrates an example of a three-sentence rewriting in
ones. If all three labels were different, we assigned a score the corpus. We therefore extracted individual sentences from
of 1. the dataset by splitting each rewritten utterance at major
It should be noted that this labelling method may still punctuation marks (full stops, questions marks and excla-
invite bias, as all three annotators belong to similar age mation points), and recombined these sentences together in
groups (30-39 and 40-49). In future implementations, it is different ways to form new utterances. This approach has the
recommended that the rewritings are re-scored via crowd- following advantages: (a) it allows the augmentation of our
sourcing. text data, otherwise bound to the limited size of the dataset;
IV. I MPLEMENTATION (b) it ensures that the newly-generated utterances remain
safe and reliable, since each sentence is self-contained in
A. Emotion recognition its meaning, has been reviewed at the dataset collection
To customise the dialogue to the relevant emotional con- stage and is known not to be insensitive or harmful; (c)
text, the chatbot asks the user to describe how they feel at the it has the potential to increase the level of empathy of those
beginning of each conversation. Consistently with the data rewritten utterances which may not be highly empathetic in
collected in the EMPATHETICPERSONAS dataset, we aim to their original form. As shown in Fig. 3, further analysis of
discern between four contexts: sadness, anger, anxiety/fear our data shows that utterances composed of two or more
and happiness/content. To achieve effective emotion recogni- sentences are perceived on average as more empathetic by
tion given a user’s text response, we fine-tuned a pretrained human annotators compared to single-sentence ones. This
RoBERTa language model for this task, first on Saravia may be due to the fact that, when an utterance is composed
et al.’s affective dataset [18] and then on the expressions of several sentences, one of them conveys the main message
of emotion in our corpus. The model achieves 94.96% while the others are often expression of politeness, sympathy
accuracy and 95.10% macro-averaged F1 score on the test or compassion.
set split from our corpus (in contrast, the keyword-based When extracting sentences, we aimed to save a record of
emotion classifier implemented in the previous version of the their relative position within the original utterance in order
chatbot [20] obtains 63.03% accuracy and 62.48% macro F1 to maintain this position when combining them together to
on the same test set). Table I displays the hyperparameters form new utterances, thus increasing the likelihood of a
used in both fine-tunings. meaningful result. We defined three lists – f irst pos list,
It should be noted that the expressions of emotion in second pos list and third pos list – corresponding to
the EMPATHETICPERSONAS dataset have been provided by the three possible positions within an utterance (since the
individuals instructed to answer the question ‘How are you utterances in our corpus contain at most three sentences),
feeling?’ as if they were experiencing a particular emotion. and assigned each extracted sentence to one of these
The fact that our model’s second and final fine-tuning was lists. Of course, the assignment is straightforward when

81
TABLE II
C OMPARISON OF THE TOTAL NUMBER OF UTTERANCES IN EACH
DATASET SPLIT BEFORE AND AFTER THE AUGMENTATION PROCESS .

Dataset split and Total number of utterances


associated persona Before After
augmentation augmentation
Males 40-69 (Robert) 480 3,980
Females 40-69 (Gabrielle) 495 4,123
Males 18-39 (Arman) 614 4,747
Females 18-39 (Olivia) 554 5,172
Entire dataset (Kai) 2,143 94,993
Fig. 3. Bar chart showing the mean empathy score of rewritten utterances
in the EMPATHETICPERSONAS dataset by number of sentences that they
contain (for details of the empathy annotation process see Section III-F).
TABLE III
H YPERPARAMETERS USED TO FINE - TUNE A T5
LANGUAGE MODEL FOR
THE TASK OF EMPATHY CLASSIFICATION .
utterances are composed of three sentences, whereas for
shorter utterances we employed a strategy to achieve a
sensible assignment [40], populating second pos list with Train-val-test Learning Adam Batch Epoch with
proportions rate epsilon size best accuracy
the sentences most likely to convey the main message of
an utterance. Having populated the three lists, we elim- 80-10-10 1 × 10−4 1 × 10−8 8 16
inated from them any duplicate sentences and added an
empty string to f irst pos list and third pos list (but not
to second pos list, to prevent the possibility of creating
empty utterances by selecting the empty string from all portion of the EMPATHETICPERSONAS dataset that had been
three position lists). We then formed new utterances con- annotated for empathy (see Section III-F). We obtained a
taining one, two or three sentences by successively choosing classification accuracy of 80.18% and a macro-averaged F1
one item from each of f irst pos list, second pos list score of 80.66% on the test set. Table III illustrates the
and third pos list, in this order, until all possibilities hyperparameters used in the fine-tuning process. The values
had been exhausted. The resulting corpus thus contains returned by this model, which corresponds to our empathy
| f irst pos list | × | second pos list | × | third pos list | scoring function E, are normalised to be between 0 and 1 by
utterances (where the notation | list | indicates the length dividing each output by 2 (which is the maximum empathy
of list). This process was repeated for each column in score possible).
the EMPATHETICPERSONAS dataset (i.e. we only combined Fluency function: To evaluate the fluency of an utter-
together sentences originating from rewritings of the same ance, we compute the inverse of its perplexity (PPL) score
base utterance). returned by a GPT-2 language model. Since combining to-
Through the process of sentence extraction and recombi- gether portions of different utterances may create unwanted
nation we obtained corpora of utterances significantly larger repetitions, we subtract from this value a penalty of 10−2
than the original, as illustrated in Table II. Visual inspection for each repeated (lemmatised) word, excluding stop words.
of these corpora reveals that the quality of the newly- Therefore, the fluency F of an utterance u is given by
generated utterances is, on average, satisfactory. However, 1
not all utterances are equally suitable to be used by the F (u) = − RP (u), (1)
P P L(u)
chatbot. Some of them may be less fluent than others, due
to repetitions or semantic conflicts arising from combining where RP (u) is the total penalty for all the repeated words
parts of different rewritings, and some may still lack enough within that utterance. To normalise (1) so that it returns
empathy. Moreover, many utterances have sentences in com- values through the whole range between 0 and 1, we divide
mon, increasing the risk that the bot’s dialogue may sound it by the maximum possible fluency score as calculated on
repetitive. To overcome these issues, we devised a retrieval the augmented corpora (i.e. 0.16). If the output is negative,
method that yields the best possible utterance at each stage which may happen when the total penalty is greater than the
of the conversation. inverse of the perplexity, we return zero.
Novelty function: The chatbot is able to save and retrieve
C. Retrieval method up to 50 of its previous utterances, and it compares each new
Our retrieval method consists of a multi-objective optimi- utterance to those in this set to evaluate its novelty. To this
sation function combining an empathy score, a fluency score end, we implement a function that computes the weighted
and a novelty score, which are simultaneously maximised overlap distance [41] over all possible n-grams between two
when selecting an utterance. text sequences, starting from unigrams up to N -grams where
Empathy function: To compute the empathy score of N is equal to the length in words of the shorter sequence.
an utterance, we fine-tuned a T5 language model on the The greater the number n the more we decrease the distance

82
between n-grams – which is a number between 0 and 1 – by has been identified, it is saved as a variable and used to
raising it to the power n, since utterances are more similar select relevant subsets of utterances from the dataset.
when they share longer sequences of words. After adding The main objective of the chatbot is to recommend the
together the distances over all possible n-grams, we divide most appropriate SAT protocols. As users navigate the
the result by N so that it remains between 0 and 1. The conversation, a list of protocol suggestions is drawn. The
distance d between two utterances u1 and u2 is thus given contents of this list, as well as the point in the dialogue
by where the bot discloses the suggestions, depend on the
PN  |n-grams(u1 ) ∩ n-grams(u2 )|
n answers provided by the user.
n=1 1 − min(|n-grams(u1 )|,|n-grams(u2 )|)
d(u1 , u2 ) = , E. User interface
N
(2) The communication between the Python back-end and
where n-grams(u) represents the set of n-grams in the the JavaScript front-end is managed by the Flask API and
utterance u and the notation |X| indicates the size of set the React-chatbot-kit library [43]. We design avatars for the
X. Equation (2) is computed between a new utterance and chatbot personas that users can choose from. Fig. 5 shows
each of the saved previous utterances, adding up the results the interface of the application set up for the evaluation trial.
to obtain the novelty (or diversity) score D of the new
utterance. We divide D by the number of previous utterances V. N ON - CLINICAL TRIAL
to obtain a normalised value between 0 and 1.
Multi-objective optimisation function: Let Enorm (u), A. Study design
Fnorm (u) and Dnorm (u) be the normalised functions mea- The SAT chatbot was formally evaluated through a human
suring, respectively, the empathy, fluency and diversity of trial. The pool of participants comprised 23 volunteers from
an utterance u, each returning a value between 0 and 1. the non-clinical population aged between 22 and 70, all
Then, the overall function R that we wish to maximise when of whom were already familiar with SAT. Of these 23
retrieving a new utterance is given by individuals, 16 were male and 7 were female. Each volunteer
agreed to have four interactions with the chatbot over the
R(u) = we Enorm (u) + wf Fnorm (u) + wd Dnorm (u). (3) course of five days – two with Kai and the rest with any
We fix the weights in (3) to we = 1, wf = 0.75 and wd = two of the other personas. The chatbot was also evaluated
2. These values have been obtained experimentally to give separately by two clinicians specialised in mental health,
reasonable results. It should be noted that calculating R(u) who completed the same number of interactions as the other
is computationally expensive: the complexity of transformer- participants.
based models such as T5 and GPT-2 – which we use to The chatbot platform was deployed as a web application
compute Enorm (u) and Fnorm (u) respectively – is quadratic and all the interactions occurred online. Participants were
in the length (in words) of the utterance u [42]. Moreover, sent instructions, a link to access the platform and individual
the function Dnorm (u) performs for each new utterance p × login credentials via e-mail, and they were able to give feed-
N ×(N +1)/2 comparisons, where p is the number of saved back by filling out an anonymous online questionnaire. The
previous utterances and N is the length, in words, of the questionnaire contained multiple-choice questions asking to
shorter of the two utterances being compared. As a trade- evaluate: (a) the chatbot’s ability to display empathy; (b)
off between response time and size of the utterance retrieval the level of engagement of each user; (c) the usefulness
pool, we apply (3) on a random subset of 15 utterances of the platform; (d) the ability of the chatbot to identify
drawn from the corpus. In future implementations, it may emotions. When volunteers evaluated the bot for empathy
be worth precomputing the empathy and fluency scores of and engagement, they scored these attributes separately for
each utterance and appending these values to the augmented Kai and the other personas. By collecting this information,
corpora, so that only the novelty score, which depends on we aimed to assess whether a human-like character – such
the bot’s previous utterances, will need to be calculated at as Robert, Gabrielle, Arman and Olivia – can improve
runtime. user experience. On the other hand, we gauged whether
having a much larger pool of utterances to choose from
D. Conversation flow (and thus potentially more diversity in the responses), as
After a user has logged into the application, the bot asks is the case for Kai, can provide a significant advantage. The
them to choose a persona between Kai, Robert, Gabrielle, questionnaire also asked volunteers to state which personas
Arman and Olivia. The user’s selection informs which they had interacted with and there were additional open-
portion of the (augmented) data is loaded into the back- ended questions to collect comments and suggestions.
end. The conversation between the user and the chatbot
allows a mix of open text and multiple-choice input and B. Evaluation
is informed by a flowchart, illustrated in Fig. 4. At each Of 23 study volunteers, 16 returned a complete question-
node in the flowchart, deep-learning methods are applied for naire. The evaluation in this section is thus carried out on
emotion recognition or utterance retrieval. All five personas a sample size of 16. We compare our results with those
navigate the same flowchart when conversing with the user, obtained in a previous evaluation trial [20] of the earlier
but each of them has a specific set of utterances that they implementation of the SAT chatbot, which we define as our
can retrieve from. Similarly, once a user’s emotional state baseline.

83
Fig. 5. Appearance of the SAT chatbot web application.

Evaluation by trial volunteers: Volunteers were asked to


evaluate the chatbot’s ability to convey empathy by express-
ing how much they agreed/disagreed with the statement ‘The
chatbot displayed empathy in its responses throughout the
conversation’, both in the context of their interactions with
Kai and in relation to the other personas. When interacting
with Kai, 75% agreed that the bot was empathetic, while
the remaining quarter selected ‘Strongly agree’ and ‘Neither
agree nor disagree’ in equal proportions, as illustrated in
Fig. 6. When the interactions were with any of the other per-
sonas, 56% agreed with the statement, 19% strongly agreed
and a quarter neither agreed nor disagreed. In both cases
we observe a significant improvement over the baseline:
only 20% of those participating in the previous trial agreed
that the earlier implementation was empathetic, with 50%
expressing disagreement.
When evaluating engagement, 6% of participants dis-
agreed with the statement ‘I found the conversation with
the chatbot to be engaging’, and a quarter neither agreed
nor disagreed. This was the case for interactions with Kai
as well as the other personas, as shown in Fig. 7. In
addition, 63% agreed that Kai’s conversations were engaging
and a further 6% strongly agreed, while 56% agreed and
13% strongly agreed that the other personas conversed in
an engaging manner. In comparison, when evaluating the
previous implementation, 40% disagreed that the dialogue
was engaging, 30% neither agreed nor disagreed, and the
remaining 30% agreed or strongly agreed.
Usefulness was evaluated by agreeing/disagreeing with
the statement ‘Overall, the platform was useful’. Of our
sample, 75% agreed and a further 17% strongly agreed with
the above statement, with 8% choosing ‘Neither agree nor
Fig. 4. Conversation flow of the SAT chatbot.
disagree’. Fig. 8 shows clear improvement over the baseline,

84
which 10% disagreed was useful, 50% neither agreed nor
disagreed, 20% agreed and an equal proportion strongly
agreed.
In addition, we found that 63% of participants either
agreed or strongly agreed with the statement ‘The chatbot
was good at guessing my emotion’, a quarter neither agreed
nor disagreed and the remainder disagreed, as illustrated
in Fig. 9. As no analogous data were collected during the
previous trial, we cannot compare these results with the
baseline. Instead, we refer the reader back to Section IV-A,
where our emotion recognition model is shown to achieve
accuracy and macro-averaged F1 scores over 30% greater Fig. 6. Empathy evaluation of the previous version of the chatbot and the
current one. Our results show significant improvement over the baseline in
than those obtained by the classifier used in the previous the perceived level of empathy for both Kai and the other personas.
implementation (when tested on the same data).
Lastly, we investigated the volunteers’ preferences when
choosing personas. Without considering the mandatory in-
teractions with Kai, we found that a quarter of the other
interactions were with Olivia, approximately 15% were with
Gabrielle, and the remaining 60% were equally split between
Robert and Arman.
We should also note the comments and suggestions re-
ceived. Some participants observed that the set of identifiable
emotions was too limited, and this may have affected the
bot’s ability to successfully predict emotional states. Further
feedback highlighted the fact that not only the range of
emotions was narrow, but those emotions may have been Fig. 7. Engagement level of users who interacted with the previous and
quite extreme compared to what members of the non-clinical current version of the chatbot. The level of user engagement improves in
population would normally experience. For example, feeling our implementation, whether the interactions are with Kai or the other
‘slightly worried’ would be cast by the current version of personas.
the bot as being ‘anxious/scared’, whereas the two states
are arguably rather different. Several volunteers also noted
that having only two choices (‘I feel better’ and ‘I feel
worse’) for giving feedback after completing a protocol is
too restrictive, and more nuanced options would be required.
Evaluation by clinicians: The clinicians’ assessment of
the earlier and current SAT chatbot is shown in Table IV.
It is worth noting that both specialists rated the platform
identically regardless of whether their interactions were with
Kai or the other personas. While their evaluation remained
mostly unchanged, our implementation was viewed as sig-
nificantly more empathetic than the previous one by one
Fig. 8. Evaluation of usefulness of the earlier and current version of
clinician, having turned their response to the statement ‘The the chatbot, showing that the current version is more consistently rated as
chatbot displayed empathy in their responses throughout the useful.
conversation’ from ‘Disagree’ to ‘Agree’. The specialists
commented positively on the chatbot’s ability to interpret
emotions, but observed that this was limited by the narrow
range of emotional contexts available. They also noted
that more positive reinforcement would be desirable (e.g. a
congratulatory message when a user logs into the platform)
as well as recognition and appropriate management of any
user input involving self-harm.
VI. D ISCUSSION
A. Results
Our framework and study add to the existing body of
knowledge in computational methods for mental health
support. The human evaluation trial shows promising results
with respect to the perceived empathy, user engagement, Fig. 9. Evaluation of the current SAT chatbot’s ability to recognise
usefulness and ability to identify emotions of the chatbot. We emotions.

85
TABLE IV
C LINICIAN EVALUATION OF OUR CHATBOT AGAINST THE BASELINE . a dependable baseline for comparing our results. However,
this leaves us with little insight into why a minority of the
participants have found the bot not to be engaging. In future
Statement Response studies, it would be advisable to have the chatbot’s dialogue
Previous version Current version explicitly evaluated for fluency and diversity.
The chatbot was good at Clinician 1: N/A Clinician 1: Agree C. Future work
guessing my emotion Clinician 2: N/A Clinician 2: Agree
The chatbot displayed Clinician 1: Disagree Clinician 1: Agree
Despite obtaining encouraging results, the chatbot’s emo-
empathy in its Clinician 2: Disagree Clinician 2: Disagree tion classifier has room for improvement. Four emotional
responses throughout contexts are hardly sufficient to cover an acceptable range of
the conversation human emotions. Collecting more data relative to different
I found the conversation Clinician 1: Agree Clinician 1: Agree contexts as well as more nuanced feelings would thus be
with the chatbot to be Clinician 2: Disagree Clinician 2: Disagree necessary to train a more competent model. Moreover, as
engaging more than one emotion can be felt and expressed at the same
Overall, the platform Clinician 1: Agree Clinician 1: Agree time, the classification problem could be cast as a multi-label
was useful Clinician 2: Agree Clinician 2: Agree one [44].
In addition, as highlighted by trial participants, individuals
may not necessarily feel better or worse after completing
SAT protocols. When asking for feedback, the bot should
find that trial participants report higher levels of engagement therefore accept as a valid answer the fact that a user might
with the application when interacting with the human-like have detected no change in their mood.
characters (Robert, Gabrielle, Arman, Olivia) than they do Lastly, to increase the safety of the platform and its ap-
when the interaction is with Kai. While the overall rate of plicability in wider contexts, future implementations should
approval is the same for both types of persona, we find include a mechanism to recognise and respond to any user
significantly more ‘Strongly agree’ responses when users input suggestive of self-harm or suicidal thoughts. Such a
evaluate the former group. On the other hand, results are mechanism could consist of a keyword-based model for de-
less conclusive when the chatbot is assessed for empathy. In tecting terms commonly associated with self-harming [45].
this context, the human-like personas still receive more top Upon detection of any of these terms, the bot should
range responses, however, when considering both ‘Agree’ promptly direct users to dedicated emotional support and/or
and ‘Strongly agree’ answers, Kai is scored positively by a suicide prevention services available in the country where
greater percentage of participants. they are located (this may require collection of IP addresses
or other geolocation data).
B. Limitations of the study
Of 23 non-clinician volunteers that signed up to the study ACKNOWLEDGMENT
only 16 completed the evaluation questionnaire, resulting in The authors would like to thank Noor Research (Atlanta,
a further reduction of an already modest sample. Moreover, USA) for funding this work and the Empowered Human
since the questionnaires are anonymous, we do not know Foundation (Canada) for sponsoring the trial.
how the sex and age distribution of our actual sample (i.e.
those who returned a completed questionnaire) compares to R EFERENCES
that of the entire pool of volunteers. To design an effective [1] GBD 2017 Disease and Injury Incidence and Prevalence Collabora-
future trial, this distribution should be considered carefully. tors, “Global, regional, and national incidence, prevalence, and years
For example, we have noted in this study that users favoured lived with disability for 354 diseases and injuries for 195 countries
male characters over female ones (60% of all interactions and territories, 1990-2017,” Lancet, vol. 392, pp. 1789–1858, 2018.
[2] A. Abbott, “COVID’s mental-health toll: how scientists are tracking
were with Robert and Arman). This may be due to the fact a surge in depression,” Nature, vol. 590, pp. 194–195, 2021.
that males were over represented in our group of volunteers, [3] J. Krupnick and S. Melnikoff, “Psychotherapy with low-income
and repeating the evaluation with a more evenly distributed patients: Lessons learned from treatment studies,” Journal of Con-
sample could help validate or disprove this hypothesis. temporary Psychotherapy, vol. 42, pp. 7–15, 2012.
[4] Z. Fu, H. Burger, R. Arjadi, and C. Bockting, “Effectiveness of digital
Moreover, increasing the number of required interactions psychological interventions for mental health problems in low-income
and the length of the intervention in the future may give and middle-income countries: a systematic review and meta-analysis,”
participants a more informed opinion of the strengths and Lancet Psychiatry, vol. 7, no. 10, pp. 851–864, 2020.
weaknesses of the chatbot, as some of these (e.g. its abil- [5] M. Weightman, “Digital psychotherapy as an effective and timely
ity/inability to present the user with novel utterances over treatment option for depression and anxiety disorders: Implications for
rural and remote practice,” Journal of International Medical Research,
time) may only be evident over a period longer than five vol. 48, no. 6, 2020.
days. [6] B. Renn, T. Hoeft, H. Lee, A. Bauer, and P. Arean, “Preference
Finally, fluency and diversity are two main objectives of for in-person psychotherapy versus digital psychotherapy options for
our conversational framework, yet we evaluate them only depression: survey of adults in the U.S.” NPJ Digital Medicine, vol. 2,
no. 1, 2019.
indirectly by asking participants how engaging they found [7] C. G. Fairburn and V. Patel, “The impact of digital technology on psy-
the bot’s conversations. We do this to be consistent with the chological treatments and their dissemination,” Behaviour Research
evaluation data collected in the previous trial, and thus have and Therapy, vol. 88, pp. 19–25, 2017.

86
[8] S. Blumenfield and J. Levin-Scherz, “Digital tools are [26] K. K. Fitzpatrick, A. Darcy, and M. Vierhile, “Delivering cognitive
revolutionizing mental health care in the U.S.” Harvard behavior therapy to young adults with symptoms of depression and
Business Review, Dec 2020, available: [Link] anxiety using a fully automated conversational agent (Woebot): A
digital-tools-are-revolutionizing-mental-health-care-in-the-u-s randomized controlled trial,” JMIR Mental Health, vol. 4, no. 2, 2017.
[Accessed: 14-Sep-2021]. [27] A. Ghandeharioun, D. McDuff, M. Czerwinski, and K. Rowan,
[9] H. Gaffney, W. Mansell, and S. Tai, “Conversational agents in “Towards understanding emotional intelligence for behavior change
the treatment of mental health problems: Mixed-method systematic chatbots,” in Proc. of the 8th Int. Conf. on Affective Computing
review,” JMIR Mental Health, vol. 6, no. 10, 2019. and Intelligent Interaction (ACII). Los Alamitos, CA, USA: IEEE
[10] A. S. Miner, A. Milstein, S. Schueller, R. Hegde, C. Mangurian, and Comput. Soc. Press, 2019, pp. 8–14.
E. Linos, “Smartphone-based conversational agents and responses to [28] M. R. Ali, S. Z. Razavi, R. Langevin, A. Al Mamun, B. Kane,
questions about mental health, interpersonal violence, and physical R. Rawassizadeh, L. K. Schubert, and E. Hoque, “A virtual conver-
health,” JAMA Internal Medicine, vol. 176, no. 5, pp. 619–625, 2016. sational agent for teens with autism spectrum disorder,” in Proc. of
[11] P. Henderson, K. Sinha, N. Angelard-Gontier, N. R. Ke, G. Fried, the 20th ACM Int. Conf. on Intelligent Virtual Agents. New York,
R. Lowe, and J. Pineau, “Ethical challenges in data-driven dialogue NY, USA: ACM, 2020, pp. 1–8.
systems,” in Proc. of the 2018 AAAI/ACM Conf. on AI, Ethics and [29] A. Følstad and P. Brandtzaeg, “Users’ experiences with chatbots:
Society. New York, NY, USA: ACM, 2018, pp. 123–129. findings from a questionnaire study,” Quality and User Experience,
[12] A. Abd-alrazaq, M. Alajlani, A. Alalwan, B. Bewick, P. Gardner, and vol. 5, no. 3, pp. 1–14, 2020.
M. Househ, “An overview of the features of chatbots in mental health: [30] A. Edalat, “Self-attachment intervention: Detailed protocols for SAT,”
A scoping review,” Int. J. of Medical Informatics, vol. 132, 2019. 2021, unpublished. Available: [Link]
[13] S. Hussain, O. Ameri Sianaki, and N. Ababneh, “A survey on [Link] [Accessed: 24-Sep-2021].
conversational agents/chatbots classification and design techniques,” [31] M. Mikulincer and P. R. Shaver, “An attachment perspective on
in Web, Artificial Intelligence and Network Applications, L. Barolli, psychopathology,” World Psychiatry, vol. 11, no. 1, pp. 11–15, 2012.
M. Takizawa, F. Xhafa, and T. Enokido, Eds. Cham, Switzerland: [32] I. Ghaznavi, D. Gillies, D. Nicholls, and A. Edalat, “Photorealistic
Springer International Publishing, 2019, pp. 946–956. avatars to enhance the efficacy of self-attachment psychotherapy,” in
[14] A. Edalat, “Self-attachment: A holistic approach to computa- Proc. of the 2020 IEEE Int. Conf. on Artif. Intell. and Virtual Reality
tional psychiatry,” in Comput. Neurology and Psychiatry, P. Érdi, (AIVR). Los Alamitos, CA, USA: IEEE Comput. Soc. Press, 2020,
B. Sen Bhattacharya, and A. L. Cochran, Eds. Cham: Springer pp. 60–67.
International Publishing, 2017, pp. 273–314. [33] N. Polydorou and A. Edalat, “An interactive VR platform with
[15] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, emotion recognition for self-attachment intervention,” EAI Endorsed
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning Trans. on Pervasive Health and Technology, 2021.
with a unified text-to-text transformer,” Journal of Machine Learning [34] R. Elliott, A. Bohart, J. Watson, D. Murphy, and E. Outcome,
Research, vol. 21, 2020. “Therapist empathy and client outcome: An updated meta-analysis,”
[16] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, Psychotherapy, vol. 55, no. 4, p. 399–410, 2018.
“Language models are unsupervised multitask learners,” OpenAI Blog, [35] G. Barrett-Lennard, “The empathy cycle: Refinement of a nuclear
2019, available: [Link] [Ac- concept,” J. of Counseling Psychology, vol. 28, pp. 91–100, 1981.
cessed: 14-Sep-2021]. [36] A. B. Kocaballi, S. Berkovsky, J. C. Quiroz, L. Laranjo, H. L. Tong,
[17] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, D. Rezazadegan, A. Briatore, and E. Coiera, “The personalization of
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert conversational agents in health care: Systematic review,” Journal of
pretraining approach,” arXiv preprint, 2020, available: [Link] Medical Internet Research, vol. 21, no. 11, 2019.
org/abs/1907.11692 [Accessed: 14-Sep-2021]. [37] S. Landes, J. Burton, K. King, and B. Sullivan, “Women’s preference
[18] E. Saravia, H. C. T. Liu, Y. H. Huang, J. Wu, and Y. S. Chen, of therapist based on sex of therapist and presenting problem: An
“CARER: Contextualized affect representations for emotion recog- analogue study,” Counselling Psychology Quarterly, vol. 26, 2013.
nition,” in Proc. of the 2018 Conf. on Empirical Methods in Natural [38] E. M. Kessler, S. Rahn, and F. Klapproth, “Do young people prefer
Lang. Process. Stroudsburg, PA, USA: ACL, 2018, pp. 3687–3697. older psychotherapists?” Eur. J. of Ageing, vol. 17, pp. 119–124, 2020.
[19] L. Alazraki, “SATbot,” 2021, GitHub repository. Available: https:// [39] A. Sharma, A. Miner, D. Atkins, and T. Althoff, “A computational
[Link]/LisaAlaz/SATbot [Accessed: 29-Sep-2021]. approach to understanding empathy expressed in text-based mental
[20] A. Ghachem, “Evaluation of a virtual agent in guiding users health support,” in Proc.s of the 2020 Conf. on Empirical Methods
from the non-clinical population in self-attachment intervention,” in Natural Lang. Process. Stroudsburg, PA, USA: ACL, 2020, pp.
2021, unpublished. Available: [Link] 5263–5276.
Ali Ghachem [Link] [Accessed: 22-Sep-2021]. [40] L. Alazraki, “A deep-learning assisted empathetic guide for self-
[21] M. Kraus, P. Seldschopf, and W. Minker, “Towards the development of attachment therapy,” 2021, unpublished. Available: [Link]
a trustworthy chatbot for mental health applications,” in MultiMedia [Link]/∼ae/papers/Lisa Alazraki [Link] [Accessed: 22-Sep-
Modeling, J. Lokoč, T. Skopal, K. Schoeffmann, V. Mezaris, X. Li, 2021].
S. Vrochidis, and I. Patras, Eds. Cham, Switzerland: Springer [41] M. K. Vijaymeena and K. Kavitha, “A survey on similarity measures
International Publishing, 2021, pp. 354–366. in text mining,” Machine Learning and Applications: An International
[22] K. Denecke, S. Vaaheesan, and A. Arulnathan, “A mental health Journal, vol. 3, no. 1, pp. 19–28, 2016.
chatbot for regulating emotions (SERMO) – concept and usability [42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
test,” IEEE Transactions on Emerging Topics in Computing, vol. 14, Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”
no. 8, 2020. in Proc. of the 31st Int. Conf. on Neural Inf. Process. Syst. (NIPS),
[23] K. H. Ly, A. M. Ly, and G. Andersson, “A fully automated conversa- I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
tional agent for promoting mental well-being: A pilot rct using mixed wanathan, and R. Garnett, Eds. Red Hook, NY, USA: Curran
methods,” Internet Interventions, vol. 10, pp. 39–46, 2017. Associates, 2017, p. 6000–6010.
[24] F. Morbini, E. Forbell, D. DeVault, K. Sagae, D. Traum, and A. Rizzo, [43] F. Oseberg, “React-chatbot-kit,” 2021, technical documentation.
“A mixed-initiative conversational dialogue system for healthcare,” in Available: [Link]
Proc. of the 13th Annual Meeting of the Special Interest Group on [Accessed: 14-Sep-2021].
Discourse and Dialogue. Stroudsburg, PA, USA: ACL, 2012, pp. [44] M. Jabreel and A. Moreno, “A deep learning-based approach for
137–139. multi-label emotion classification in tweets,” Applied Sciences, vol. 9,
[25] T. Bauer, E. Devrim, M. Glazunov, W. L. Jaramillo, B. Mohan, no. 6, 2019.
and G. Spanakis, “#MeTooMaastricht: Building a chatbot to assist [45] K. Harvey and B. Brown, “Health communication and psychological
survivors of sexual harassment,” in Machine Learning and Knowledge distress: Exploring the language of self-harm,” Canadian Modern
Discovery in Databases, P. Cellier and K. Driessens, Eds. Cham, Language Review / La Revue canadienne des langues vivantes,
Switzerland: Springer International Publishing, 2020, pp. 503–521. vol. 68, p. 316, 2012.

87

You might also like