1 Introduction

Deploying Chatbots in Customer Service:
Adoption Hurdles and Simple Remedies

Evgeny Kagan
Carey Business School, Johns Hopkins University
Maqbool Dada
Carey Business School, Johns Hopkins University
Brett Hathaway
Marriott School of Business, Brigham Young University

Problem definition: Despite recent advances in Artificial Intelligence, the use of chatbot technology in customer service continues to face adoption hurdles. This paper explores reasons for these adoption hurdles and tests several service design levers to increase chatbot uptake. Methodology/results: We use incentivized online experiments to study chatbot uptake in a variety of scenarios. The results of these experiments are threefold. First, people respond positively to improvements in chatbot performance; however, the chatbot channel is utilized less frequently than expected-time minimization would predict. A key driver of this underutilization is the reluctance to engage with a gatekeeper process, i.e., a process with an imperfect initial service stage and possible transfer to a second, expert service stage – a behavior we term gatekeeper aversion. We show that gatekeeper aversion can be further amplified by a secondary hurdle, algorithm aversion. Second, chatbot uptake can be increased by providing customers with average waiting times in the chatbot channel, as well as by being more transparent about chatbot capabilities and limitations. Third, methodologically, we show that chatbot adoption can depend on experimental implementation. In particular, chatbot adoption decreases further as (i) stakes are increased, (ii) the human/algorithmic nature of the server is manipulated with more realism. Managerial Implications: Our results suggest that firms should continue to prioritize investments in chatbot technology. However, less expensive, process-related interventions can also be effective. These may include being more transparent about the types of queries that are (or are not) suitable for chatbots, emphasizing chatbot reliability and quick resolution times, as well as providing faster live agent access to customers who experienced chatbot failure.

Keywords: human-AI interfaces, technology management, experiments, service operations

1 Introduction

Recent technological advances have significantly increased chatbot capabilities, improved their speed, enabled them to handle more complex, often unstructured customer queries, and reduced training and maintenance costs (Johannsen et al. 2018).¹¹1This version of the paper makes references to the Electronic Companion (EC). The EC can be found attached to the SSRN version of the paper, please see https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4283285. These improvements have reduced the staffing needs for live operators, lowering payroll and other costs related to providing live customer support. The cost savings can be substantial – a recent report estimates an average cost reduction of up to $0.70 per customer interaction, and an annual savings of 8 Billion US Dollars in the banking sector alone (Maynard and Crabtree 2020).

The technological maturity and the cost savings offered by chatbots have shifted the burden of successful chatbot deployment from AI developers to managers implementing this technology in their organizations. However, academic research into the drivers of chatbot technology adoption remains scarce. While there is a growing literature on human-chatbot interactions (Goot and Pilgrim 2019, Goot et al. 2020, Sheehan et al. 2020, Schanke et al. 2021, Adam et al. 2021, Benke et al. 2022), it is focused mainly on questions related to chatbot design; for example, on whether anthropomorphism (human-likeness) helps or hurts adoption. These studies help developers build chatbots with more desirable appearance and behavior; however, they provide little or no insight into the process design implications of deploying chatbots, their integration into the broader service delivery strategy, and their effects on the cost and performance of a service system. In this paper we seek to address this gap and study chatbot technology from a service operations perspective. We focus on chatbot adoption as a choice among several service channels offered within a service system, each with its own unique processes and customer experiences.

Operationally, chatbot systems resemble gatekeeper systems (Shumsky and Pinker 2003, Freeman et al. 2017, Hathaway et al. 2022), where the chatbot plays the role of a gatekeeper that handles only a subset of the incoming requests, with the remaining requests being diverted to a live, human agent. This is because certain requests may be difficult to communicate or categorize, or because the chatbot may not be authorized to handle certain requests; for example, ones that involve large financial transactions. Thus, the chatbot serves as the entry point to, but not necessarily the final step of, the service encounter, similar to a nurse in a hospital or a front desk receptionist in a hotel. Different from many healthcare or hospitality settings, which require the patient or customer to go through the gatekeeper to begin service, chatbot operators often allow customers to choose between a live agent and a chatbot. In this study we examine the determinants of this channel choice and test several levers to increase chatbot uptake.

1.1 Study design

The starting point of our investigation is a retrospective survey, in which we ask 400 respondents to describe a recent customer service episode, either with a chatbot or with a live agent. Quantitative and qualitative analyses of their testimonies suggest a key trade-off in channel choice: chatbots are faster to access but have a lower request resolution rate. In contrast, live agents typically require some wait to access but are much more reliable in resolving customer requests. This insight helps motivate a simple model of channel choice (§2) which is then tested in our experiments (§3-5).

The first experiment (§3) focuses on identifying adoption hurdles. In this experiment we present participants with a series of choices between two alternatives. The first alternative (“Channel A”) represents the live (A)gent and involves some waiting in line to access the server. The server then resolves the request with probability 1. The second alternative (“Channel B”) represents the (B)ot and involves no waiting to access the first service stage; however, the server fails with some known probability, requiring additional waiting in line and a second service stage. In both channels, upon successful resolution of the service request the customer receives a fixed monetary reward, which represents service completion. Depending on the parameters in a decision, the expected-time minimizing choice may be either Channel A or Channel B. There are three experimental conditions: a Context treatment, in which the type of the server (human or bot) is explicitly revealed, a No Context treatment, in which all contextual cues are removed and participants choose between two visually identical (but process-differentiated) channels, and a No Context, Deterministic treatment which removes the uncertainty from Channel B. These treatments enable us to separately identify process-related preferences that exist independently of contextual framing from preferences related to the algorithmic nature of the service provider.

In the second experiment (§4) we focus on potential remedies for chatbot underutilization. Drawing on the literature in behavioral operations and decision theory, we test two alternative designs. In particular, we present participants with choices that are mathematically identical to those in Experiment 1 but change how information is presented. First, drawing on research on operational transparency (Buell and Norton 2011, Buell et al. 2017, Balakrishnan et al. 2022), the Context + No Transparency treatment deliberately reduces operational transparency. In this treatment, the chatbot always suggests a solution for each customer request, with the offered solution being either correct or incorrect. This is different from our Context treatment, where the chatbot is transparent and truthfully reports whenever it is able to handle a request or not. Second, in the Context + Nudge treatment we nudge participants to focus on the potential time savings offered by the chatbot by explicitly presenting the total average waiting times for both channels.

In the third experiment (§5) we turn to the methodological challenge of measuring algorithmic aversion in the customer service setting by introducing two treatments that add realism to our experimental setup. In the first treatment (Context + Live) we replicate the Context treatment but use actual humans (research assistants, blind to the experimental hypotheses) who play the role of live agents and who interact with participants using a live chat tool. In the second treatment (Context + Hold) we make salient differences between live agents and chatbots by requiring continuous, physical engagement in Channel A (representing the live agent) while retaining click-based interaction in Channel B (representing the chatbot).

1.2 Key Results and Contributions

The results of our experiments show that Channel B uptake declines as the chatbot service time grows longer, chatbot failure rate increases, or the wait for a live agent following chatbot failure increases. In other words, better operational performance leads to higher uptake. Nonetheless, across all three experiments, Channel B uptake remains considerably below what one would predict from a purely expected-time minimization perspective. Our results are summarized in Table 1.2.

In Experiment 1 we first show that Channel B underutilization is tied primarily to process-related hurdles. That is, participants are willing to spend more time in the system (in expectation) in order to avoid interacting with a gatekeeper channel, whether or not the decision is contextualized (as a choice between a live agent and a chatbot) or not. We term this behavior gatekeeper aversion. We further decompose gatekeeper aversion into two distinct components: risk aversion (preference for a less uncertain service time duration) and transfer aversion (preference for continuous, rather than fragmented, multi-stage service processes). While risk aversion is well-documented in financial contexts (Holt and Laury 2002, Harrison and Cox 2008), customer behaviors in the presence of uncertain waiting times and fragmented service processes have received little attention in the behavioral literature (Allon and Kremer 2018). Thus, our first theoretical contribution is to document and characterize this important customer preference.

Continuing with Experiment 1, we show that algorithmic context may further affect chatbot uptake. The standard result in the literature is that AI assistance is often underutilized, even when AI performs as well as or better than a human alternative (Dietvorst et al. 2015, Logg et al. 2019, Castelo et al. 2019). We first follow the standard approach in the literature and conduct a series of pre-tests that hold the processes and performance constant across the two channels and vary only their labels and visual cues. These pre-tests do not detect any algorithm aversion, suggesting that the classic result that algorithmic errors loom larger than human errors does not hold in our setting. Nonetheless, we show that algorithm aversion still matters. Specifically, we show that gatekeeper aversion and algorithm aversion may interact to produce significantly lower chatbot uptake than can be explained by gatekeeper aversion alone, particularly when the stakes (waiting times in both channels) are high. Thus, our second theoretical contribution is to show that algorithm aversion can serve as an amplifier, reinforcing the reluctance to use a service channel with a gatekeeper structure.

In Experiment 2 we show that the aversions identified in Experiment 1 can be mitigated by varying how information is presented to customers. In particular, both transparency about chatbot capabilities and the average waiting time nudge can significantly increase chatbot adoption, although their effectiveness varies with time scale. Specifically, operational transparency matters when durations are short (suggesting that its effect is washed out when stakes are higher), whereas the effect of the nudge is more robust.

These results of Experiments 1 and 2 suggest practical ways to increase chatbot adoption: through fast-tracking chatbot customers by shortening their wait for a live agent after chatbot failure, through communicating operational advantages via explicit, time-based metrics and through providing a candid account of chatbot capabilities, rather than attempting to handle all inquiries without disclosure. In §4.4 we use a structural estimation of utility parameters to build a counterfactual model of channel-joining behavior and show that our interventions achieve substantial staffing cost savings (of up to 19.7%) in moderately congested systems. Thus, a practical contribution of our study is to identify inexpensive and easily implementable service design interventions that can increase chatbot adoption and generate substantial cost savings.

In the third experiment we show that increasing the realism of the interactions produces behaviors that are consistent with our original design (with similar aversion magnitudes). However, greater realism introduces a small but significant increase in algorithm aversion compared to the Context treatment. Thus, we contribute to the methodological discourse on measuring algorithmic attitudes by showing that experimental designs that rely on contextual framing alone may underestimate algorithm aversion, compared to designs that involve longer interactions or vary the human versus algorithmic nature of the interaction in a more realistic manner.

Refer to caption — Table 1: Results Summary

	Effect		Effect detected?
Experiment 1: Adoption Hurdles (§3)
H1.1:	Transfer aversion	} Gatekeeper Aversion	***
H1.2:	Risk aversion	} Gatekeeper Aversion	***
Pre-tests of Context manipulation:	Algorithm aversion (Standalone effect)		n.s.
H1.3:	Algorithm aversion (Amplifying effect)		n.s. for short dur., ** for long dur.
\hdashline Experiment 2: Remedies (§4)
H2.1:	Average Waiting Time Nudge		***
H2.2:	Transparency		** for short dur., n.s. for long dur.
\hdashline Experiment 3: Alternative Measurements of Algorithm Aversion (§5)
H3.1:	Algorithm aversion (Context + Live)		**
H3.2:	Algorithm aversion (Context + Hold)		***

Objectives	Treatments (Between-subject)		Treatment Description	No. of Subjects
(Recruited/ Passed comprehension screening/ Passed consistency checks)
Experiment 1 (§3): What are key drivers of chatbot uptake in customer service?	Short time durations:	Long time durations:
	Context	Context	Contextualized channel choice	270/ 252/ 207
	No Context	No Context	Context removed	238/ 227/ 183
	No Context, Deterministic	No Context, Deterministic	Context and uncertainty removed	263/ 253/ 207
\hdashline Experiment 2 (§4): What can firms do to increase chatbot uptake?	Short time durations:	Long time durations:
	Context + No Transparency	Context + No Transparency	Chatbot attempts all requests instead of admitting to not having a solution	271/ 254/ 213
	Context + Nudge	Context + Nudge	Added average waiting time information	268/ 252/ 214
\hdashline Experiment 3 (§5): How does the nature of the service process affect algorithm aversion?	Short time durations:
	Context + Live		Real-time chat with human agent	116/ 106/ 91
	Context + Hold		Channel-specific interaction mode	106/ 102/ 86

Pre-tests	No Context, Deterministic	No Context	Context
	Treatment
	Transfer aversion (H1.1)	Transfer aversion	Transfer aversion
		+ Risk aversion (H1.2)	+ Risk aversion
Algorithm aversion			+ Algorithm aversion (H1.3)

	(1)	(2)	(3)
Dependent Variable:	Channel B (Chatbot Channel)	Channel B (Chatbot Channel)	Channel B (Chatbot Channel)
No Context Treatment	Omitted category	Omitted category	Omitted category
No Context, Deterministic Treatment	2.134***	1.800***	2.585***
	(0.401)	(0.522)	(0.668)
Context Treatment	-0.560	-0.054	-1.116**
	(0.416)	(0.545)	(0.661)
Time scale $=2$	-0.821**
	(0.323)
Channel Performance Controls	Yes	Yes	Yes
Demographic Controls	Yes	Yes	Yes
Sample	Full Sample	Time Scale $=1$	Time Scale $=2$
Observations	19701	9504	10197
Subjects	597	288	309

	(1)	(2)	(3)
Dependent Variable:	Channel B (Chatbot Channel)	Channel B (Chatbot Channel)	Channel B (Chatbot Channel)
Context Treatment	Omitted category	Omitted category	Omitted category
Context + No Transparency Treatment	-0.412	-1.078**	0.225
	(0.378)	(0.521)	(0.529)
Context + Nudge Treatment	0.883***	-0.221	2.049***
	(0.364)	(0.506)	(0.523)
Time scale $=2$	-0.463
	(0.302)
Channel Performance Controls	Yes	Yes	Yes
Demographic Controls	Yes	Yes	Yes
Sample	Full Sample	Time Scale $=1$	Time Scale $=2$
Observations	20922	10428	10494
Subjects	634	316	318

	$\displaystyle U^{A}_{ij}(\bm{\theta})$	$\displaystyle=r-c_{line}\cdot t^{A}_{{line}_{ij}}-c_{agent}\cdot t^{A}_{{serve% _{1}}_{ij}}+\epsilon^{A}_{ij},$		(4.1)
	$\displaystyle U^{B}_{ij}(\bm{\theta})$	$\displaystyle=r-c_{bot}\cdot t^{B}_{{serve_{1}}_{ij}}-(1-p^{B}_{ij})\cdot(c_{% nt}+\beta\cdot(c_{line}\cdot t^{B}_{{line}_{ij}}+c_{agent}\cdot t^{B}_{{serve_% {2}}_{ij}}))+\epsilon^{B}_{ij},$		(4.2)

1 Introduction

1.1 Study design

1.2 Key Results and Contributions

2 Retrospective Survey, Literature and Experiment Design

2.1 Retrospective Survey

2.2 Stylized Model of Channel Choice

2.3 Related Literature

Risk Aversion

Transfer Aversion

Algorithm Aversion

Chatbot Adoption in Services

2.4 Experiment Design

Experiment 1:

Experiment 2:

Experiment 3

3 Experiment 1: Adoption Hurdles

3.1 Methodology

3.1.1 Participants, Pre-tests and Treatments

3.1.2 Instructions and Demo

3.1.3 Decisions and Parameters

3.1.4 Elicitation

3.1.5 Incentives

3.2 Hypotheses

3.3 Results

3.3.1 Pre-tests

3.3.2 Descriptive Statistics

3.3.3 Hypothesis Tests

3.3.4 Additional Analysis

3.4 Discussion

4 Experiment 2: Remedies

4.1 Treatments

4.2 Theory and Hypotheses

4.3 Results

4.3.1 Descriptive Statistics

4.3.2 Hypothesis Tests

4.3.3 Discussion

4.4 Structural Estimation of Staffing Cost Savings

4.4.1 Structural Estimation of Utility Parameters

4.4.2 Demand Estimation

4.4.3 Results

5 Experiment 3: Alternative Measurements of Algorithm Aversion

5.1 Methodology

5.1.1 Treatments

5.1.2 Theory and Hypotheses

5.2 Results

5.2.1 Descriptive Statistics

5.2.2 Hypothesis Tests

5.3 Discussion

6 Concluding Remarks

Summary of Results

Contributions

Limitations and Extension

Outlook

References