Machine Learning Framework For Customer Purchase Prediction
Machine Learning Framework For Customer Purchase Prediction
a r t i c l e i n f o a b s t r a c t
Article history: Predicting future customer behavior provides key information for efficiently directing resources at sales
Received 25 November 2016 and marketing departments. Such information supports planning the inventory at the warehouse and
Accepted 17 April 2018
point of sales, as well strategic decisions during manufacturing processes. In this paper, we develop ad-
Available online 5 May 2018
vanced analytics tools that predict future customer behavior in the non-contractual setting. We establish
Keywords: a dynamic and data driven framework for predicting whether a customer is going to make purchase at
Analytics the company within a certain time frame in the near future. For that purpose, we propose a new set
Purchase prediction of customer relevant features that derives from times and values of previous purchases. These customer
Sales forecast features are updated every month, and state of the art machine learning algorithms are applied for pur-
Non-contractual setting chase prediction. In our studies, the gradient tree boosting method turns out to be the best performing
Machine learning method. Using a data set containing more than 10 0 0 0 customers and a total number of 20 0 0 0 0 pur-
chases we obtain an accuracy score of 89% and an AUC value of 0.95 for predicting next moth purchases
on the test data set.
© 2018 Elsevier B.V. All rights reserved.
1. Introduction those who are simply in the midst of a pause between transac-
tions.
Customer management requires that firms make a careful as- It is widely accepted by business wisdom and research litera-
sessment of the costs and benefits of alternative expenditures and ture that it costs five to ten times more to acquire a new customer
investments, and identify the optimal allocation of resources to than to retain an existing customer (Bhattacharya, 1998; Daly,
marketing and sales actions over time. Decision makers will ben- 2002). While the factor itself may vary substantially depending on
efit from decision support models that relate costs and customer the business context, retaining customers has received strong at-
purchase behavior, and forecast the value of the customer portfo- tention from both academia and practitioners (see Van den Poel
lio (Berger et al., 2002). Thus, knowing who is likely to purchase & Lariviere, 2004 for an overview). Thereby, it has been well es-
within the next months is one of key drivers to allocate efficiently tablished that appropriate retention strategies have strong bene-
resources at the sales and marketing departments (see e.g. Allenby, fits over acquisition approaches (see Ganesh, Arnold, & Reynolds,
Leone, & Jen, 1999). This information is also needed when plan- 20 0 0). However, it has to be noted that retention activities are
ning the inventory at the warehouse and/or point of sales, as well not necessarily desirable in an unconditional way, since target-
for deciding quantities at the manufacturing processes. Thereby, ing profitable customers can make marketing spending more effi-
the non-contractual distinction is of fundamental importance for cient (Kumar, Venkatesan, & Reinartz, 2008; Mulhern, 1999; Zei-
developing models for customer-base analysis. One of the main thaml, Rust, & Lemon, 2001), even more if this profitability can
challenges in the non-contractual settings is how to differentiate be predicted (Reinartz & Kumar, 2003). Among practitioners, it is
customers who have ended their relationship with the firm from quite desirable to consider customers’ future profitability and re-
sponsiveness, specifically in terms of purchase actions, to market-
∗
ing when allocating resources (Rust, Kumar, & Venkatesan, 2011;
Corresponding author.
E-mail addresses: [email protected] (A. Martínez),
Venkatesan & Kumar, 2004).
[email protected] (C. Schmuck), [email protected] Firms are encouraged to develop models to predict which cus-
(S. Pereverzyev Jr.), [email protected] (C. Pirker), tomers are more likely to defect (Keaveney & Parthasarathy, 2001;
[email protected] (M. Haltmeier).
https://doi.org/10.1016/j.ejor.2018.04.034
0377-2217/© 2018 Elsevier B.V. All rights reserved.
A. Martínez et al. / European Journal of Operational Research 281 (2020) 588–596 589
Neslin, Gupta, Kamakura, Lu, & Mason, 2006). Once identified, The application of machine learning or data mining techniques
these likely defectors should be targeted with appropriate incen- for predictive purposes on the customer-base is often analyzed in
tives to convince them to stay (Hadden, Tiwari, Roy, & Ruta, 2007). the customer relationship management and expert systems do-
main, and customer churn prediction is the most popular objec-
tive in this fields. The concept of churn and associated statistical
1.1. Customer purchase prediction implementations have been well studied in B2C business models
(see, e.g., Burez & Van den Poel, 2007; Neslin et al., 2006; Ver-
While purchase prediction has received attention for a long beke, Dejaeger, Martens, Hur, & Baesens, 2012; Xia & Jin, 2008;
time in consumer research (see e.g., Herniter, 1971, the rise of Xie, Li, Ngai, & Ying, 2009), especially in a contractual setting. Main
customer analytics by marketing analysts has revived such issues industries include retail markets, subscription management, finan-
in the recent years (Winer, 2001). As outlined in Platzer and cial services and electronics commerce (see e.g. Chen, Fan, & Sun,
Reutterer (2016), one of the most challenging areas remains the 2012 for an overview. This is in line with the general trend of a
prediction of customer purchases in the non-contractual settings: stronger focus of academia and intelligence approaches on B2C ap-
The current status of the customer is not directly observable at a plications (Wiersema, 2013). However churn prediction is also im-
time and the available historical record is censored while customer portant in the B2B context, where it has been studied much less;
data tends to vary substantially. During the last years, large im- see Jahromi, Stakhovych, and Ewing (2014). In particular, the de-
provements in the information technology domain have resulted velopment of business relationships remains central to B2B com-
in the increased availability of customer transaction data (Fader panies (Eriksson & Vaghult, 20 0 0). The importance of retention
& Hardie, 2009). Initial analyses of these transaction databases for suppliers becomes even clearer in the B2B context where cus-
are usually descriptive in form of basic summary statistics such tomers make larger and more frequent purchases with far higher
as the average number of orders or the average order size, and transactional values (Boles, Barksdale, & Johnson, 1997; Rauyruen
information on the distribution of behaviors across the customer & Miller, 2007).
portfolio. Further processing of the customer base may use mul- Accurate demand forecasting is a fundamental aspect of supply
tivariate statistical methods and data mining tools to identify chain management. In overall terms, methods for estimating the
characteristics of, for instance, heavy buyers, or to determine level of supply chain flexibility are a function of varying demand
which groups of products tend to be purchased together (i.e., quantities and varying supply lead times (Das & Abdel-Malek,
performing a market-basket analysis). 2003). Most of the companies manage their inventory starting by
The next step in terms of data processing is to undertake forecasting the expected demand quantities by stock keeping units
customer-base analysis that are more predictive in nature (Fader & (SKUs). Very often those time series are intermittent (Willemain,
Hardie, 2009). In this work we develop a machine learning frame- Smart, Shockor, & DeSautels, 1994), with many time periods having
work for forecasting future purchasing by the firm’s customers no demand and being very difficult to predict (Willemain, Smart, &
from a given customer transaction database. For that purpose we Schwarz, 2004). This pattern is characteristic of demand for ser-
first compute a large number of customer features, that charac- vice parts inventories and capital goods. Manufacturers perceive
terizes the customer at a given month. We then apply machine the forecasting of intermittent data to be an important problem.
learning algorithms including logistic Lasso regression (Friedman, In practice, the standard method of forecasting intermittent de-
Hastie, & Tibshirani, 2010; Tibshirani, 1996), the extreme learning mand are exponential smoothing, moving averages and the Cros-
machine (Huang, Zhu, & Siew, 2006) and gradient tree boosting ton’s method. When possible, by aggregating retail sales we might
(Chen & Guestrin, 2016; Friedman, 2001) for predicting whether find strong trend and seasonal patterns, for those cases can be
the customer makes a purchase in the upcoming month. Our ap- applied traditional methods for predicting the demand (Alon, Qi,
proach avoids prohibitive customer inquires that may be costly to & Sadowski, 2001). The results of our research can support in-
be acquired in the non-contractual setting, and nevertheless shows ventory management via enriching the expected demand by SKUs
performance comparable to state of the art approaches. with such kind of information.
While our framework can equally be applied for predicting cus-
tomer behavior in any future time period, we focus our presen- 1.2. Relations to previous work
tation on predicting next month purchase. Our research is moti-
vated for being applied in a company working in the B2B field In the last decade, various machine learning methods for pre-
that requires decisions in how to deploy its account management dicting customer retention and profitability have been analyzed in
sales and marketing activities. Those activities are done mainly the academia field and some of them often used by practition-
on a monthly basis; consequently predicting one and two months ers. In most cases those approaches are based on extracting cus-
ahead is suitable to achieve the desired goals. Competitive market- tomer’s latent characteristics from its past purchase behavior with
ing strategy focuses on how a business should deploy marketing the mindset that observed behavior is the outcome of an under-
resources at its disposal to facilitate the achievement and main- lying stochastic process (Fader & Hardie, 2009). This approach to
tenance of competitive positional advantages in the marketplace the customer’s purchase prediction can be named as characteris-
(Varadarajan & Yadav, 2002). The monthly binding is a natural tics approach. Previous studies have analyzed the use of random
timeframe for most of the companies and business, especially im- forest techniques in order to predict customer retention and prof-
portant for business domains where the company requires a very itability (Larivière & Van den Poel, 2005) in a financial services and
fast speed when deciding, and for companies were some activities B2C context. Three major predictor categories that encompass po-
are done in a monthly basis. In regards of some other business do- tential explanatory variables were considered and those three cat-
mains, we mention fast moving consumer goods, and fast fashion egories were: past customer behavior, observed customer hetero-
retailers, where the company has to decide rapidly for attracting geneity and variables related to intermediaries. In that research,
consumers, in particular fast fashion requires to introduce inter- it was found evidences that past customer behavior is more im-
pretations of the runway designs to the stores in a minimum of portant to generate repeat purchasing and favorable profitability
three to five weeks (Barnes & Lea-Greenwood, 2006). Anyway, our evolution, while the intermediary’s role has a greater impact on
framework can be generalized to predicting purchases of a cus- the customers defection proneness. Literature on effective B2B pro-
tomer within any future time period in a straight forward manner; motions suggests to incorporate an enhanced, in depth view on
compare also with Remark 2.3 below. the complex decision making setup such as buying center analysis
590 A. Martínez et al. / European Journal of Operational Research 281 (2020) 588–596
(Hellman, 2005). Decision making is also more complex with B2B Table 2.1
Original transactional data. For every transaction of customer
customers, as companys purchase decision is usually the conse-
k one stores time tk, i and value Vk, i of its ith purchase made
quence of a complex decision process, an alignment among stake- between month A and month B.
holders and business goals, and comparing the decision process
Customer ID Order Purchase time Purchase value
done as an end consumer in the B2C domain. Also is important
to highlight a very strong influence coming from the industry dy- k i tk, i Vk, i
namic and other activities like product launches and campaigns. 1 1 t1, 1 V1, 1
Closely related to our work is Jahromi et al. (2014), where the
1 N1 t1,N1 V1,N1
authors develop a model for predicting wether a customer per-
forms a purchase in some prescribed future time frame based on K 1 t K, 1 Vk, 1
purchase information from the past. They propose customer char-
K NK tK,NK VK,NK
acteristics such as the number of transactions observed in past
time frames, time of the last transaction, and the relative change
in total spending of a customer. They found an adaptive boost-
ing method (Freund, Schapire et al., 1996) to perform best on the of generality we assume that the purchases of any customer are
tested data with an AUC value of 0.92. In contrast, in our study ordered chronologically, that is tk,i1 ≤ tk,i2 for all i1 ≤ i2 .
we compute a richer set of customer characteristics than the one With the above notions, the problem under consideration can
in Jahromi et al. (2014). These features are listed in Table 2.2 and be stated as follows:
described in detail in Section 2.3. For our framework, the best per-
forming method (gradient tree boosting) shows an AUC value of Problem 2.1 (Prediction of customer purchases). Given purchase
0.95. We point out that we obtain a higher AUC score even we data Pk {(tk,i , Vk,i )|i = 1, . . . , Nk } for every customer k ∈ K be-
use a time frame of only one month within purchases are pre- tween month A and month B (as illustrated in Table 2.1), predict
dicted. This is much smaller than the 6 months time frame used whether a given customer makes a transaction in the month fol-
in Jahromi et al. (2014). A smaller time frame is beneficial in terms lowing to B.
of actionability for a company; however it also makes the predic-
tion much more complicated. These results demonstrate that our We address Problem 2.1 using machine learning algorithms. For
method provides valuable and reliable information for supporting that purpose we introduce some further notation. We define the
sales and marketing departments also in the short term. binary variable yk, τ that characterizes the purchase of the cus-
tomer k in month τ ∈ [A, B] {A, A + 1, . . . , B} by
1.3. Outline
1 if customer k makes purchase in month τ
yk,τ (2.1)
The remainder of this article is organized as follows. In 0 otherwise .
Section 2 we formally describe the considered purchase predic-
Further, for pairs (k, τ ) ∈ K × [A, B] we construct a feature vec-
tion activity, see Problem 2.2. In that section we also describe
tor xk, τ that characterizes the state of customer k at time τ based
the features characterizing the customers at a specific time. In
on purchase information of customer k up to month τ . Notice that
Section 3 we describe how to solve Problem 2.2 and therefore per-
the values yk, τ are only known for τ ≤ B and that we aim at es-
form purchace prediction using machine learning tools. In partic-
timating yk,B+1 for the upcoming month B + 1. Therefore, Problem
ular, we use the logistic Lasso, the extreme learning machine and
2.1 can be reformulated as follows:
gradient tree boosting for model selection. Our framework for pur-
chase prediction is applied in Section 4 to transactional B2B data Problem 2.2 (Reformulation as supervised learning prob-
of 10 0 0 0 customers and a total number of 20 0 0 0 0 transactions. lem). Estimate the values of yk,B+1 from the feature vector
The gradient tree boosting turns out to be the performing model xk,B+1 representing the behavior of customer k until month B, and
showing an accuracy score of 88.98% and an AUC value of 0.949. known input-output pairs (xk, τ , yk, τ ) for all k ∈ K and certain
The paper ends with a short discussion presented in Section 5. τ ∈ [A, B].
2. Formal problem definition and description of our customer The efficient solution of Problem 2.2 requires the computation
features of significant customer features xk,τ [1], . . . , xk,τ [M]. In this work
we propose a certain set of M = 274 characteristic features that are
In this section we establish a mathematical framework that for- listed in Table 2.2. A detailed description of these features and its
mally describes the customer’s purchase task. We give particular computation is given in Section 2.3.
emphasis on the description of the features characterizing the cus-
tomer at a certain time instance, that are used for subsequent pre- Remark 2.3 (Predicting different time frames). For keeping the
dictive analysis. presentation simple, in the formal problem description we consid-
ered the case of predicting purchases within the following month.
2.1. Problem description In a straight forward manner, Problem 2.1 can be generalized to
predicting purchases of a customer within any future time frame
Suppose that certain customers purchase products or services [A + M1 , A + M2 ]. In the case M1 < M2 the prediction period con-
at a given company. We suppose that the company has a total sist of several months, while the case M1 = M2 corresponds to
number of K customers. Any customer is represented by its ID purchase prediction within a single month. Next month purchase
k ∈ K {1, . . . , K }. Here and below means equal by definition. description as formalized in Problem 2.1 corresponds to the case
The purchases of customer k are characterized by purchase times where M1 , M2 are both equal to one. In the case of a general pre-
tk, i and purchase values Vk, i for i = 1, . . . , Nk , with Nk denoting the diction period, one can again solve a supervised machine learning
total number of purchases of customer k. The whole transaction task similar to Problem 2.2, where the label values (2.1) are mod-
data made between month A and month B can be arranged in a ified to reflect the desired time frame. Actually, we also present
list as shown in Table 2.1. Here and below months are identified results for two-month purchase prediction corresponding to M1 =
with elements in the set Z of all integer numbers. Without loss M2 = 2; see Section 5.
A. Martínez et al. / European Journal of Operational Research 281 (2020) 588–596 591
Table 2.2 for extracting properly the purchase trends; we have found no
Characteristic customer features derived from the transactional raw
significant changes of final results when using higher polynomials.
data.
Fig. 2.1 shows an example of a customer’s binned purchase data
Characteristics related to purchase time vk,τ together with the corresponding 6th order moving average
Number of total purchases x[1] v̄k,τ and the polynomial fit vˆ k,+ (t ).
Mean time between purchases x[2]
Standard deviation of purchase frequency x[3] 2.3. Customer features
Maximal time without purchase x[4]
Time since last purchase x[5]
Thresholds for classification x[6], x[7], x[8] We are now ready to formally define the characterizing fea-
Frequency classification x[9] tures xk,τ [1], . . . , xk,τ [274] listed in Table 2.2. The features of any
Characteristics related to purchase value customer dynamically depend on time τ . For its computation we
Moving averages x[10], x[11], x[12] use subsets of the purchase data (original as well as smoothed)
Maximum values of purchase x[13], x[14] that containing purchases made between months τ − T and τ − 1,
Mean values of purchase x[15], x[16], x[17]
Median values of purchase x[18], x[19], x[20]
where T is some fixed time period. Formally, we define this past
Time frame variations x[21], x[22] purchase data for τ ∈ [A + T , B + 1] as
Purchase trend x[23]
Pk,τ { tk,i , Vk,i ∈ Pk |τ − T ≤ tk,i < τ − 1} .
Further customer information
Country of customer x[24] We denote the number of customer’s purchases in the time
Creation of additional variables frame [τ − T , τ − 1] by Nk, τ and the corresponding purchase data
Pairwise products x[25], . . . , x[214] by
Powers of two and three x[215], . . . , x[254]
Logarithms x[255], . . . , x[274] τ t
tk,i for i = 1, . . . , Nk,τ ,
k,Mk,τ +i
τ V
Vk,i for i = 1, . . . , Nk,τ ,
k,Mk,τ +i
2.2. Data binning and smoothing where Mk, τ is the total number of purchases made prior to month
τ − T.
The customer features will be extracted from the original pur-
chase information, as well as smoothed versions obtained by mov- 2.3.1. Characteristics related to the purchase time
ing averages and a polynomial fit. For that purpose we first define We first describe the characteristics related to the times of pur-
binned purchase data as the sum of all purchases of a customer chases.
within a given month, • Number of purchases.
vk,τ Vk,i for k ∈ K and τ ∈ [A, B] . The first considered feature is the number of customer’s pur-
i:tk,i =τ chases in the time frame [τ − T , τ − 1]. We take
In particular, we have vk,τ = 0 if there is no purchase from the xk,τ [1] Nk,τ |Pk,τ | ,
customer k in month τ . Otherwise vk,τ is equal to the correspond- as the number of elements in Pk,τ .
ing purchase value. • Mean time between purchases.
We use the moving average of order six of the binned purchase Next we consider the weighted average of the number of time
values, defined by units between purchases in Pk,τ (or purchases in the time
frame [τ − T , τ − 1]),
1
5
v̄k,τ vk,τ −τ for k ∈ K and τ ∈ [A − 5, B] . Nk,τ
6
τ =0 τ .
xk,τ [2] wi tk,i
With the computed moving average of the binned purchases we i=2
want to have a representative monthly purchase value for each τ tτ − tτ
Here tk,i k,i k,i−1
is the number of time units between
customer. Those values allow us to compare customers at any
the ith and (i − 1 )-th purchase in Pk,τ . In this work we propose
month within the year, independently of having purchased or not Nk,τ
in that month. In the case of our particular data set, is very likely to choose the weights as wi (i − 1 )2 / i=2 (i − 1 )2 .
that customer buy products at least twice in a year. Therefore, a • Standard deviation of times between purchases.
moving average of order six seems to be the best moving window Using the same weights wi as above we define the weighed
for provide a representative value. Finally, we consider a polyno- standard deviation of the number of time units between pur-
mial regression approximation of order seven, chases in Pk,τ as
vˆ k (t ) = θ0 + θ1 t + θ2 t 2 + · · · + θ7 t 7 . N
k,τ
τ 2
Here t represents a continuous time variable and vˆ k is con- xk,τ [3] wi tk,i − xk,τ [2] .
structed such that vˆ k ( j ) v̄k, j for j = τ − 1, τ − 2, . . . , τ − T . The i=2
coefficients θ i are determined using the elastic-net method. Ac- • Maximal time without purchase.
tually, we will use vˆ k,+ (t ) max{1, vˆ k (t )}. One motivation for Here we consider the maximum number of time units between
introducing such a continuous domain representation is the high purchases in Pk,τ ,
variability of the customer purchases. This aims for for having a
τ |i = 2, . . . , N }.
xk,τ [4] max{tk,i
representative for purchases within the customer lifetime which 1 k,τ
should not be affected by the high volatility observed, which is • Time since last purchase.
especially high in the B2B domain. With a sufficient number of
The next feature measures the number of months since the last
features, the elastic-net algorithm extracts a representative curve
purchase in the time frame [τ − T , τ − 1] has been performed,
and the employed regularization and cross validation yields a τ
reliable estimate allowing the extraction of customer characteris- τ − 1 − tk,N if Nk,τ = 0
xk,τ [5] k,τ
tics. In our analysis we have found that order seven is sufficient T otherwise .
592 A. Martínez et al. / European Journal of Operational Research 281 (2020) 588–596
15000
elastic net
Moving average order 6
Purchase value
10000
Monetary value
5000
0 5 10 15 20 25
Month
Fig. 2.1. An example of the customer’s purchase data (tk, i , Vk, i ) that are visualized by the asterisks. The corresponding 6th order moving averages v̄k,τ are visualized by the
circles. The fitted purchase value function vˆ k,+ (t ) is visualized by the dashed curve.
• Thresholds for classification. Here we take two different characteristics defined as the max-
We consider certain thresholds for the number of time units ima over the actual purchases and the polynomial fit, respec-
between purchases tively,
τ |i = 1, . . . , N } ,
xk,τ [13] max{Vk,i
xk,τ [6] xk,τ [2] + h1 xk,τ [3] , k,τ
xk,τ [7] xk,τ [2] + h2 xk,τ [3] , xk,τ [14] max{vˆ k,+ (t )|t ∈ [τ − 1, τ − T ]} .
xk,τ [8] xk,τ [2] + h3 xk,τ [3] , • Mean values of purchase.
Here we consider the mean values of the actual, the binned and
where h1 , h2 , h3 are some positive numbers. We propose to the fitted purchase values defined by
take h1 = 2, h2 = 4 and h3 = 8.
Nk,τ
• Frequency classification. 1 τ
xk,τ [15] Vk,i ,
The next characteristic is a categorial feature that characterizes Nk,τ
i=1
the customer k according on the purchase frequency. It is de-
τ
−1
fined by 1
⎧ xk,τ [16] vk,t ,
T
⎪
⎨normal
if xk,τ [5] ≤ xk,τ [6] , t=τ −T
attrition if xk,τ [6] < xk,τ [5] ≤ xk,τ [7] , τ −1
x9 [k, τ ] 1
⎩at − risk if
⎪ xk,τ [7] < xk,τ [5] ≤ xk,τ [8] , xk,τ [17] vˆ k,+ (t ).
T
lost if xk,τ [8] < xk,τ [5] . t=τ −T
We also consider a categorical characteristic for time frame pairs (x, y) from the so-called training data set. Due to data imper-
variation that we define as follows: fections, not all of the training examples will be predicted exactly
by the classifier. In our case, the training data set takes the form
steady if xk,τ [21] < −μ ,
xk,τ [22] within − limits if |xk,τ [21]| ≤ μ , D { xk,τ , yk,τ |k ∈ K , A + T ≤ τ ≤ B} . (3.1)
alternating if xk,τ [21] > μ . Actually, all of our considered methods output a regression
Here μ is some positive value; we propose μ = 0.3. function mapping to the real numbers,
• Purchase trend. φ : RM → [0, 1] ⊆ R : x
→ φ (x ) . (3.2)
We characterize the purchase trend as a categorial variable de-
The output value φ (x) of the regression function can be inter-
pending on the relative change
preted as the probability that a feature vector x corresponds to
vˆ k,+ (τ − 1 ) − vˆ k,+ (τ − 6 ) a next month purchase. From the estimated purchase probabili-
dk,τ
vˆ k,+ (τ − 6 ) ties one constructs the classifier = λ by taking λ (x ) = 1 for
φ (x) > λ and zero otherwise. Here λ ∈ [0, 1] is a certain threshold
of the fitted purchase values. More precisely, we define the pur-
that is selected as tradeoff between sensitivity and specificity. In
chase trend by
⎧ this work we use a threshold of 0.5 for the final classification.
⎪decreasing−− if dk,τ ≤ −a3 For constructing the regression function in (3.2), we apply the
⎪
⎪
⎪
⎪decreasing− if − a3 < dk,τ ≤ −a2 following state-of-the art machine learning algorithms:
⎪
⎨decreasing if − a2 < dk,τ ≤ −a1 • Logistic Lasso regression;
xk,τ [23] stable if − a1 < dk,τ ≤ a1 • Extreme learning machine;
⎪
⎪ if a1 < dk,τ ≤ a2
⎪increasing
⎪ • Gradient tree boosting.
⎪
⎪ if a2 < dk,τ ≤ a3
⎩increasing+ These methods are briefly reviewed in the following subsec-
increasing++ dk,τ > a3 .
tions. Any of these methods will be used in combination with
Here a1 , a2 and a3 are some positive values; we propose to take 10-fold cross validation for estimating optimal values of the pa-
a1 = 0.15, a2 = 0.225, a3 = 0.3. rameters these methods depend on. In particular, applying cross
validation avoids overfitting on the training data set and therefore
2.3.3. Additional characteristics allows to generalize the trained models to predicting customer
Beside the characteristics described above we also use the cate- purchases where the next-month purchase is not known.
gorical characteristic xk, τ [24] denoting the country of the customer We decided on the above classification methods because they
k. In order to further increase the prediction accuracy we compute are totally different from one another, and further are known to
auxiliary variables from the variables xk, τ [m] excluding the four yield high accuracy with reasonable computational effort.
categorical characteristics.
The auxiliary variables are created by applying the following 3.2. Logistic Lasso regression
mathematical operations to the original variables:
For Lasso regression we use the logistic model which is one
• Pairwise products. of the most common models used in the context of classification
Here we form all products xk, τ [m] · xk, τ [m ] of the non- (Hastie, Tibshirani, & Friedman, 2009). We estimated the coeffi-
categorial features with m = m . This yields 19 + 18 + · · · + 1 = cients (β j ) in the logistic model by adding the 1 -penalty term
190 additional variables.
d
• Powers of two and three.
R (β ) β j ,
We further consider powers xk, τ [m]2 and xk, τ [m]3 of all non-
j=1
categorial features. This yields 20 + 20 = 40 additional vari-
ables. which is known as the Lasso (Tibshirani, 1996). The Lasso penalty
d
• Taking Logarithm.
j=1 β j results in variable selection and shrinkage. The purpose
Finally, we add the logarithms of the non-categorial features of this penalty is to retain a subset of the characteristics and to
log (xk, τ [m]). This yields 20 additional variables. discard the rest. This subset selection produces a model that is in-
terpretable and has possibly a lower prediction error than the full
In summary we have M = 24 + 190 + 40 + 20 = 274 variables model.
xk, τ [m] characterizing the customer k at time τ . Using these vari- For numerically computing the coefficients we use the al-
able we will train a classifier that predicts the binary purchase gorithm for logistic Lasso regression provided in the package
variable y. Although the creation of the artificial variables in prin- glmnet; (See Friedman et al., 2010, Chapter 3).
ciple does not increase the information content of the data, it puts
the data into a higher dimensional space and significantly im- 3.3. Extreme learning machine
proves results of the machine learning algorithms. For example,
the powers contain interactions between the variables which oth- Another model that we consider is the single-hidden layer feed-
erwise would difficult to be found by the algorithms. forward neural network (SLFN). We use the extreme learning ma-
chine algorithm (Huang et al., 2006) for building the SLFN on our
3. Application of machine learning algorithms training data. The extreme learning machine algorithm became a
very popular research subject in the past years (Huang, 2015). Un-
In this section we solve Problem 2.2 (the formally described like other algorithms for building neural networks, the extreme
purchase prediction issue) by various machine learning algorithms learning machine randomly chooses hidden nodes and analytically
for binary classification. determines the output weights of the SLFN. The extreme learning
machine provides a good theoretical performance at a very fast
3.1. Binary classification learning speed.
For our results we use implementation of this algorithm pro-
A binary classification algorithm constructs a function : vided in the package elmNN; (See Gosso & Martinez-de-Pison,
RM → {0, 1} in such a way that (x ) = y with high probability for 2012).
594 A. Martínez et al. / European Journal of Operational Research 281 (2020) 588–596
4. Results
Table 4.3
Confusion matrix for the gradient tree boosting
We apply the developed framework for prediction of customer method evaluated on the independent test data
purchase on transactional data provided by a large manufacturer for prediction of purchases in April 2015. The
located in central Europe. The data have been gathered from trans- total prediction accuracy computes to 88.98%
actions of the B2B unit, which have been recorded from January an the AUC equals 0.949.
2009 until May 2015. We only consider transactions of customers Actual purchase Yes (%) No (%)
whose first purchase is at least six months ago due to the lack of Gradient tree boosting
sufficient information in the other cases. Yes 21.70 06.37
The transactions belong to K = 10136 different customers from No 04.66 67.28
125 different countries. As the time unit we consider a month
as it is not very common in the considered data to have a cus-
tomer with more than one purchase per month. If a customer has All computations have been performed on a virtual machine
more than one purchase in a month, then for the actual purchase on ESX Cluster with 12 cores and 60 gigabytes RAM. The opera-
values Vi we take sum of the purchase values in the considered tion system is SUSE Linux Enterprise Server, and we have run the
month. After this monthly aggregation, the data set contains in scripts using RStudio Server with R version 3.1.2 underneath. The
total 192,470 orders for all customers. We take January 2009 as computation times for training a single model are 6 minutes for
month A = 1 such that May 2015 corresponds to month 77. The the Lasso, about one minute for the extreme learning machine, and
time period for computing the feature values is taken as T = 24. 2.5 minutes for gradient tree boosting.
side, Agrawal, Nottebohm, and West (2010) noted that 1.2% of rev-
enue represents a robust benchmark for travel cost in capital goods
industry; Berard (2014) found that 10% and more of annual com-
pany budget can be attributed to travel & entertainment where the
lion share is linked to sales. Saving up to 20% in average on these
cost or reinvesting into customer who are ready to buy represents
a very significant improvement lever.
Allenby, G. M., Leone, R. P., & Jen, L. (1999). A dynamic model of purchase timing Huang, G., Zhu, Q., & Siew, C. (2006). Extreme learning machine: Theory and appli-
with application to direct marketing. Journal of the American Statistical Associa- cations. Neurocomputing, 70(1), 489–501.
tion, 94(446), 365–374. Jahromi, A. T., Stakhovych, S., & Ewing, M. (2014). Managing B2B customer
Alon, I., Qi, M., & Sadowski, R. J. (2001). Forecasting aggregate retail sales:: A com- churn, retention and profitability. Industrial Marketing Management, 43(7), 1258–
parison of artificial neural networks and traditional methods. Journal of Retailing 1268.
and Consumer Services, 8(3), 147–156. Keaveney, S. M., & Parthasarathy, M. (2001). Customer switching behavior in online
Barnes, L., & Lea-Greenwood, G. (2006). Fast fashioning the supply chain: shaping services: An exploratory study of the role of selected attitudinal, behavioral,
the research agenda. Journal of Fashion Marketing and Management: An Interna- and demographic factors. Journal of the Academy of Marketing Science, 29(4),
tional Journal, 10(3), 259–271. 374–390.
Berard, L. (2014). The travel and expense management guide for Kumar, V., Venkatesan, R., & Reinartz, W. (2008). Performance implications of adopt-
2014: Trends for the future. Last retrieved 31.10.2017, http://www. ing a customer-focused sales campaign. Journal of Marketing, 72(5), 50–68.
aberdeenessentials.com/opspro- essentials/the- travel- and- expense- management- Larivière, B., & Van den Poel, D. (2005). Predicting customer retention and prof-
guide- for- 2014- trends- for- the- future. itability by using random forests and regression forests techniques. Expert Sys-
Berger, P. D., Bolton, R. N., Bowman, D., Briggs, E., Kumar, V., Parasuraman, A., & tems with Applications, 29(2), 472–484.
Terry, C. (2002). Marketing actions and the value of customer assets: A frame- Miguéis, V. L., Camanho, A., & e Cunha, J. F. (2013). Customer attrition in retailing:
work for customer asset management. Journal of Service Research, 5(1), 39–54. an application of multivariate adaptive regression splines. Expert Systems with
Bhattacharya, C. B. (1998). When customers are members: Customer retention in Applications, 40(16), 6225–6232.
paid membership contexts. Journal of the academy of marketing science, 26(1), Mulhern, F. J. (1999). Customer profitability analysis: Measurement, concentration,
31–44. and research directions. Journal of Interactive Marketing, 13(1), 25–40.
Boles, J. S., Barksdale, H. C., & Johnson, J. T. (1997). Business relationships: an exam- Neslin, S. A., Gupta, S., Kamakura, W., Lu, J., & Mason, C. H. (2006). Defection detec-
ination of the effects of buyer-salesperson relationships on customer retention tion: Measuring and understanding the predictive accuracy of customer churn
and willingness to refer and recommend. Journal of Business & Industrial Mar- models. Journal of Marketing Research, 43(2), 204–211.
keting, 12(3/4), 253–264. Platzer, M., & Reutterer, T. (2016). Ticking away the moments: Timing regularity
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. helps to better predict customer activity. Marketing Science, (5), 799.
Buckinx, W., & Van den Poel, D. (2005). Customer base analysis: Partial defection Van den Poel, D., & Lariviere, B. (2004). Customer attrition analysis for financial
of behaviourally loyal clients in a non-contractual FMCG retail setting. European services using proportional hazard models. European Journal of Operational Re-
Journal of Operational Research, 164(1), 252–268. search, 157(1), 196–217.
Burez, J., & Van den Poel, D. (2007). CRM at a pay-TV company: Using analytical Rauyruen, P., & Miller, K. E. (2007). Relationship quality as a predictor of B2B cus-
models to reduce customer attrition by targeted marketing for subscription ser- tomer loyalty. Journal of Business Research, 60(1), 21–31.
vices. Expert Systems with Applications, 32(2), 277–288. Reinartz, W. J., & Kumar, V. (2003). The impact of customer relationship character-
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. istics on profitable lifetime duration. Journal of Marketing, 67(1), 77–99.
arXiv:1603.02754. Rust, R. T., Kumar, V., & Venkatesan, R. (2011). Will the frog change into a prince?
Chen, Z., Fan, Z., & Sun, M. (2012). A hierarchical multiple kernel support vector Predicting future customer profitability. International Journal of Research in Mar-
machine for customer churn prediction using longitudinal behavioral data. Eu- keting, 28(4), 281–294.
ropean Journal of operational research, 223(2), 461–472. Srivastava, R. K., Shervani, T. A., & Fahey, L. (1999). Marketing, business processes,
Daly, J. L. (2002). Pricing for profitability: Activity-based pricing for competitive advan- and shareholder value: An organizationally embedded view of marketing activ-
tage: 11. John Wiley & Sons. ities and the discipline of marketing. The Journal of Marketing, 63, 168–179.
Das, S. K., & Abdel-Malek, L. (2003). Modeling the flexibility of order quantities and Stone, M., Woodcock, N., & Wilson, M. (1996). Managing the change from market-
lead-times in supply chains. International Journal of Production Economics, 85(2), ing planning to customer relationship management. Long Range Planning, 29(5),
171–181. 675–683.
Eriksson, K., & Vaghult, A. L. (20 0 0). Customer retention, purchasing behavior and Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the
relationship substance in professional services. Industrial Marketing Manage- Royal Statistical Society. Series B (Methodological), 58(1), 267–288.
ment, 29(4), 363–372. Varadarajan, P. R., & Yadav, M. S. (2002). Marketing strategy and the internet: an or-
Fader, P. S., & Hardie, B. G. (2009). Probability models for customer-base analysis. ganizing framework. Journal of the Academy of Marketing Science, 30(4), 296–312.
Journal of Interactive Marketing, 23(1), 61–69. Venkatesan, R., & Kumar, V. (2004). A customer lifetime value framework for cus-
Freund, Y., Schapire, R., & Abe, N. (1999). A short introduction to boosting. Journal tomer selection and resource allocation strategy. Journal of Marketing, 68(4),
of Japanese Society For Artificial Intelligence, 14(5), 771–780. 106–125.
Freund, Y., Schapire, R. E., et al. (1996). Experiments with a new boosting algorithm. Verbeke, W., Dejaeger, K., Martens, D., Hur, J., & Baesens, B. (2012). New insights into
In Proceedings of the thirteenth international conference on machine learning, Bari, churn prediction in the telecommunication sector: A profit driven data mining
Italy.: 96 (pp. 148–156). approach. European Journal of Operational Research, 218(1), 211–229.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized Wiersema, F. (2013). The B2B agenda: The current state of B2B marketing and a
linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22. look ahead. Industrial Marketing Management, 42(4), 470–488.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting ma- Willemain, T. R., Smart, C. N., & Schwarz, H. F. (2004). A new approach to forecast-
chine. Annals of Statistics, 29(5), 1189–1232. ing intermittent demand for service parts inventories. International Journal of
Ganesh, J., Arnold, M. J., & Reynolds, K. E. (20 0 0). Understanding the customer base forecasting, 20(3), 375–387.
of service providers: An examination of the differences between switchers and Willemain, T. R., Smart, C. N., Shockor, J. H., & DeSautels, P. A. (1994). Forecasting
stayers. Journal of Marketing, 64(3), 65–87. intermittent demand in manufacturing: a comparative evaluation of Croston’s
Gosso, A., & Martinez-de-Pison, F. (2012). elmNN: Implementation of Extreme method. International journal of forecasting, 10(4), 529–538.
Learning Machine algorithm for single hidden layer feed forward neural net- Winer, R. S. (2001). A framework for customer relationship management. California
works. R package version 1. management review, 43(4), 89–105.
Hadden, J., Tiwari, A., Roy, R., & Ruta, D. (2007). Computer assisted customer churn Wübben, M., & Wangenheim, F. (2008). Instant customer base analysis: Managerial
management: State-of-the-art and future trends. Computers & Operations Re- heuristics often “;;get it right”;;. Journal of Marketing, 72(3), 82–93.
search, 34(10), 2902–2917. Xia, G., & Jin, W. (2008). Model of customer churn prediction on support vector
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: machine. Systems Engineering-Theory & Practice, 28(1), 71–77.
Data mining, inference, and prediction. Springer. Xie, Y., Li, X., Ngai, E. W. T., & Ying, W. (2009). Customer churn prediction us-
Hellman, K. (2005). Strategy-driven B2B promotions. Journal of Business & Industrial ing improved balanced random forests. Expert Systems with Applications, 36(3),
Marketing, 20(1), 4–11. 5445–5449.
Herniter, J. (1971). A probabilistic market model of purchase timing and brand se- Zeithaml, V. A., Rust, R. T., & Lemon, K. N. (2001). The customer pyramid: Creating
lection. Management Science, 18(4-part-ii), 102–113. and serving profitable customers. California Management Review, 43(4), 118–142.
Huang, G. (2015). What are extreme learning machines? Filling the gap between
Frank Rosenblatt’s dream and John von Neumann’s puzzle. Cognitive Computa-
tion, 7(3), 263–278.