0% found this document useful (0 votes)
2 views20 pages

Data Mining

This chapter introduces data mining, defining it as a field focused on understanding patterns in large datasets using various statistical tools. Key techniques discussed include cluster analysis, classification, association rule mining, and cause-and-effect modeling, each serving different purposes in data exploration and predictive analytics. The chapter emphasizes practical applications of these techniques in business, such as customer segmentation and targeted marketing.

Uploaded by

alwafeka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views20 pages

Data Mining

This chapter introduces data mining, defining it as a field focused on understanding patterns in large datasets using various statistical tools. Key techniques discussed include cluster analysis, classification, association rule mining, and cause-and-effect modeling, each serving different purposes in data exploration and predictive analytics. The chapter emphasizes practical applications of these techniques in business, such as customer segmentation and targeted marketing.

Uploaded by

alwafeka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

CHAPTER

Introduction to Data Mining

e.

Labocant/Shutteßtock

LEARNING OBJ ECT IVES After studying this chapter, you will be able to:

Define data mining and some common approaches Understandk-nearest neighbors and discriminant
used in data mining. analysis for classification.
u Explain how cluster analysis is used to explore and Describe association rule mining and its use in market
reduce data. basket analysis.
n Explain the purpose of classification methods and Use correlationanalysis for cause-and-effect
how to measure classification performance, and the modeling.
use of training and validation data.
384 Chapter 10 Introductionto Data
Mining

In an article in Analytics magazine, Talha Omer observed


that using a cell ph
make a voice call leaves behind a significant amount of data. "The
cell phon
vider knows every person you called, how long you talked, what time you
and whetheryour call was successful or if was dropped.It also knows wher
are, where you make most of your calls from, which promotionyou are respo
to, how many times you have bought before, and so on."I Considering the fac
the vast majorityof people today use cell phones, a huge amountof data
consumer behavior is available. Similarly, many stores now use loyalty card
supermarkets, drugstores, retailstores, and other outlets, loyalty cards enable
sumers to take advantage of sale prices available only to those who use the
However, when they do, the cards leave behind a digital trail of data about
chasing patterns. How can a business exploit these data? If they can better un
stand patterns and hidden relationships in the data, they can not only underst
buying habits but also customize advertisements, promotions,coupons, and sc
for each individualcustomer and send targeted text messages and e-mail of
(we're not talking spam here, but registered users who opt into such messages
Data mining is a rapidly growing field of business analytics that is focused
better understandingcharacteristics and patterns among variables in large da
bases using a variety of statistical and analyticaltools. Many of the tools that
have studied in previous chapters, such as data visualization, data summarizati(
PivotTables, and correlationand regression analysis, are used extensively in dé
mining. However, as the amount of data has grown exponentially, many other st
tistical and analytical methods have been developed to identify relationships arnor
variables in large data sets and understand hidden patterns that they may contait
Many data-mining procedures require advanced statistical knowledg
to understand the underlyingtheory and special software to implementthen
Therefore, our focus is on simple applications and understandingthe purpos
and application of data-mining techniques rather than their theoretical underpin
nings. In an optional onlinesupplement,we describe the use of Analytic Solve
far implementingdata-miring procedures.

The Scope of Data Mining


Data mining can be considered part descriptive and part prescriptiveanalytics.In descriP7
and
tive analytics, data-mining tools help analysts to identifypatternsin data. Excel charts
PivotTables, for example, are useful tools for describing patterns and analyzing data sets;

(January/February 2011): 20.


ITalha Omer, "From Business Intelligence to Analytics," Analytics
www.analytics-magazine.org.
models
however, they require manual intervention• Regression analysis and forecastingresearch-
help us to predict relationships or future values of variables of interest. As some
(some of the
ers observe, "the boundaries between prediction and description are not sharp
and vice
predictive models can be descriptive, to the degree that they are understandable,help man-
to
versa)."2 In most business applications, the purpose of descriptive analytics is so we
agers predict the future or make better decisions that will impact future performance,
can generally state that data mining is primarily a predictive analytic approach.
Some common approaches in data mining include thc following:
exploration and
u Cluster analysis. Some basic techniques in data mining involve data
"data reduction"—that is, breaking down large sets of data into more-manageable
techniques
groups or segments that provide better insight. We have seen numerous
charts,
earlier in this book for data exploration and data reduction. For example,
basic
frequency distributions and histograms, and summary statistics provide
are very
information about the characteristics of data. PivotTables, in particular,
Data
useful in exploring data from different perspectives and for data reduction.
that
mining software provide a variety of tools and techniques for data exploration
complement or extend the concepts and tools we have studied in previous
are
chapters. This involves identifying groups in which the elements of the groups
in some way similar. This approach is often used to understand differences among
customers and segment them into homogenous groups. For example, Macy's
department stores identified four types of customers defined by their lifestyle:
"Katherine," a traditional, classic dresser who doesn't take a lot of risks and likes
quality; "Julie," neotraditional and slightly more edgy but still classic; "Erin," a
contemporary customer who loves newness and shops by brand; and "Alex," the
fashion customer who wants only the latest and greatest (they have male versions
also).3 Such segmentation is useful in design and marketing activities to better tar-
get product offerings. These techniques have also been used to identify character-
istics of successful employees and improve recruiting and hiring practices.
u Classification. Classificationis the process of analyzingdata to predict how
classify a new data element. An example of classification is spam filtering in an
e-mail client. By examining textual characteristics of a message (subject header,
key words, and so on), the message is classified as junk or not. Classification
methods can help predict whether a credit card transaction may be fraudulent,
whether a loan applicant is high risk, or whether a consumer will respond to an c
advertisement.
Association. Associationis the process of analyzing databases to identify natura
associations among variables and create rules for target marketing or buying rec-
ommcndations. For example, Nqtflix uses association to unslerstandwhat types
of movies a customer likes and provides recommendations based on the data.
Amazon.com also makes recommendationsbased on past purchases. Supermar-
ket loyalty cards collect data on customerS'purchasing habits and print coupons
at the point of purchase based on what was currently bought.
Cause-and-effect modeling. Cause-and-effect modeling is the process of
devel-
oping analytic models to describe the relationship between metrics
that drive
business performance—for instance, profitability, customer satisfaction,
or
"'V'uuuc•uon to Data Mining

employee satisfaction. Understanding


the drivers of performance can lead to bet
ter decisions 10improve performance.
For example, the controls group of Johns(
Controls, Inc., examined the relationship
between satisfaction and contract
renewal rates. They found that 91 % of
contract renewals came from customers
who were either satisfied or very
satisfied, and customers vvho were not satisfied
had a much higher defection
rate. Their model predicted that a one-percentage-
point increase in the overall satisfaction
score was worth $13 million in service
contract renewals annually. As a result,
they identified decisions that would
improve customer satisfaction.
Regression and correlation analysis are key tools
for cause-and-effect modeling.

1. What is the purpose of data


mining?
2. Explain the basic concepts
of cluster analysis, classification, association, and
cause-and-effect modeling.

Cluster Analysis
Cluster analysis, also called data segmentation,
is a set of techniques that seek to group
or segment a collection of objects
(that is, observations or records) into subsets or clusters,
such that those within each cluster are more
closely related to one another
than objects
assigned to different clusters. The objects within
clusters should exhibit a high amount of
similarity, whereas those in different clusters will be dissimilar.
Cluster analysis is a data-reduction technique in the sense that it can
take a large num-
ber of observations, such as customer surveys
or questionnaires, and reduce the information
into smaller, homogenous groups that can be interpreted
more easily. The segmentation of
customers into smaller groups, for example, can be used to customize
advertising or pro-
motions. As opposed to many other data-mining techniques, cluster
analysis is primarily
descriptive, and we cannot draw statistical inferences about a sample using
it. In addition,
the clusters identified are not unique and depend on the specific
procedure used; therefore,
it does not result in a definitive answer but only provides new ways
of looking at data.
Nevertheless, it is a widely used technique.
There are two major methods of clustering—hierarchical clustering
and k-means clus-
tering. In hierarchical clustering, the data are not partitioned into
a particular cluster in a
single step. Instead, a series of partitions takes place, which may run from
a single cluster
containing all objects to n clusters, each containing a single object. Hierarchical
clustering
is subdivided into agglomerative clustering methods, which proceed by
series of fusions •
of the n objects into groups, and divisive clustering methods, which
separate n objects
successively into finer groupings. Figure 10.1 illustrates the differences between
these two
types of methods. Agglomerative techniques are more commonly used.
An agglomerative hierarchical clustering procedure produces a series of partitions
of
the data P P Pl. Pn consists of n single-object clusters, and PI consists of a single
group containing all n observations.At each particular stage, the method joins together
the two clusters that are closest together (most similar). At the first stage, this consists of
simply joining together the two objects that are closest together. Different methods use dif-
ferent ways of defining distance (or similarity) between clusters.
Mining 387
Introductionto Data
Chapter 10

Figure 10.1 Divisive Methods


Agglomerative Versus ABCDE
DivisiveClustering

ABC DE

Agglomerative Methods
c D

Measuring Distance Between Objects distance.


is Euclidean
The most commonly used measure of distance between objects plane is
distance between two points on a
This is an extensionof the way in which the distance
The Euclidean
computed as the hypotenuse of a right triangle (see Figure 10.2).
is
measure between two points (Xl,x2,... , xn) and (h, Y2, . . • yn)
2 (10.1)

(Xl —Yl) 2 + ( x2 —h) 2 +
(that is, without the square
Some clustering methods use the squaredEuclidean distance
root) because it speeds up the calculations.

Applying the Euclidean Distance Measure


high school, and gradu-
Figure 10.3 shows a portion of the Excel file Colleges and students in the top 10% of their
distance measure in
Universities. The characteristics of these institutions differ ation rate. We can use the Euclidean
distance between them.
quite widely. Suppose that we wish to cluster them into formula (10.1) to measure the
Amherst and
more homogeneous groups based on the median SAT, For example, the distance between
acceptance rate, expenditures/student, percentage of Barnard is

(1315- 1220)2+ (22%- + (26,636- + (85 - 69)2+ (93 - 80)2


— 8,983.53
and
We can implementthis easily by using the Excel or arrays. Therefore,the distance between Amherst
function SUMXMY2(array_x, array_y), which sums the Barnard would be computed by the Excel formula
squares of the differences in two corresponding ranges C5:G5)).

Figure 10.2 (X?Y2)


Computing the Euclidean
Distance Between Two
Points
Y2_y
1

(XI,YI) 2
388 Chapter 10 Introduction to
Data Mining

Figure 10.3
Portion of the Excel 1 Colleges and c
File Colleges and Universities
Universities 3 School
MedianSAT Acceptance Rate Expenditures/Studont Top 10% HS 'Graduation
4 Amherst
5 Bamard Lib Arts 1315! 26,636
Lib Arts 1220 53% 17,653 69
G Bates 17,554
7 Berkeley Lib Arts
University 37%: $ 23,665
8 Bowdoin 25,703
9 Brown Lib Arts noo
jUniverglty 1201 24,201
10 Bryn Mawr 18,947
Lib Arts 8,

Normalizing Distance Measures


When the data have diffcrcnt orders of magnitude, thc distance measure can easily be
dominated by the largc values. Therefore, it is customary to standardize (or normalize)
the data by converting them into z-scorcs. These are computed in the Excel file Colleges
and Universities Cluster Analysis Worksheet.Using these, the distance measure between
Amherst and Barnard is
((0.8280- + (-1.2042 1.1141)2 + (-0.2214 - +

(0.7967 + (1.3097
3.5284

A distance matrix between the first five colleges is shown in Table 10.1.

Clustering Methods
One of the simplest agglomerativehierarchical clustering methods is single linkage clus-
tering, which is an agglomerativemethod that keeps forming clusters from the individual
objects until only one cluster is left. In the single linkage method, the distance between two
clusters r and s, D(r, s), is defined as the minimum distance between any object in cluster
r and any object in cluster s. In other words, the distance between two clusters is given by
the value of the shortest link between the clusters. Initially, each cluster simply consists of
an individual object. At each stage of clustering, we find the two clusters with the mini-
mum distance between them and merge them together.
Another method that is basically the opposite of single linkage clustering is called
complete linkage clustering. In this method, the distance between clusters is defined as the
distance between the most distant pair of objects, one from each cluster. A third method is
average linkage clustering. Here the distance between two clusters is defined as the average
of distances between all pairs of objects, where each pair is made up of one object from each
group. Other methods are average group linkage clustering, which fises the mean values for
each variable to compute distances between clusters, and Ward's hierarchical clustering

Table 10.1 Amherst Barnard Bates Berkeley Bowdoin


Normalized Distance Matrix Amherst 3.5284 2.7007 0.7158
for First Five Colleges 1.8790 2.8901 2.9744
Barnard
Bates 3.9837 2.0615
Berkeley 3.8954
Bowdoin
389
to Data Mining
Chapter 10 Introduction
different
methods generally yield
Different we
method, which uses a sum-of-squarescriterion.
In the following example,
results, so it is best to experiment and compare the results.
illustrate single linkage clustering.

Single Linkage Clustering


distance is between
We will apply single linkage clustering to the first five schools In Table 10.2, the smallest these two
Therefore, we join
in the Excel file Colleges and Universities Cluster Analysis Barnard and Bates (1.879). distance
This results in the
worksheet. Looking at the distance matrix in Table 10.1, we colleges into a second cluster.
see that the smallest distance occurs between Amherst and matrix shown in Table 10.3. Barnard/Bates
Amherst/Bowdoin and
Bowdoin (0.7158). Thus, we join these two into a cluster. Next, we join the Table 10.3
distance in
Next, recalculate the distance between this cluster and the clusters together,as the smallest in
distance matrix shown
remainingcolleges by finding the minimum distance between is 2.06125. This results in the
that is, to join Berkeley
any college in the cluster and the others. This results in the Table 10.4. Only one option remains,
distance matrix shown in Tab!e 10.2. Note that the smallest to the cluster of other colleges.
we can see that
distance between either Amherst or Bowdoin and Barnard, If we examine the original data,
and Bates have similar
for instance, is MIN(3.5284, 2.9744). This becomes the dis- Amherst and Bowdoin, and Barnard
different:
. tance between the Amherst/Bowdoin cluster and Barnard. profiles, but that Berkeley is quite

Acceptance Expenditures/
Top 10% HS Graduation %
School Type Median SAT Tare (0/0) Students
85 93
Amherst Lib Arts 1315 22 $26,636.00
78 90
Bowdoin Lib Arts 1300 24 $25,703.00
69 80
Bamard Lib Arts 1220 53 $17,653.00
58 88
Bates Lib Arts 1240 36 $17,554.00
95 68
Berkeley University 1176 37 $23,665.00

Table 10.2 Amherst/Bowdoin Barnard Bates Berkeley


DistanceMatrixAfter First Amherst/Bowdoin 0 2.9744 2.0615 3.8954
Clustering 2.8901
Barnard 1.879
Bates 3.8937
Berkeley

Table 10.3 Amherst/Bowdoin Barnard/Bates Berkeley


Distance Matrix After Amherst/Bowdoin 0 2.0615 3.8954
Second Clustering
Barnard/Bates 2.8901
Berkeley

Table 10.4 Amherst/Bowdoin/Barnard/Bates Berkeley


Distance Matrix After Third Amherst/Bowdoin/Barnard/Bates 2.8901
Clustering
Berkeley
10
Chapter Introduction to
Data Mining
Figure 10.4
Dendogramfor
Collegesand
Universities Example
2.8901

2.0615

1.879

0.7158

Amherst Bowdoin Barnard Bates Berkeley

At various stages of the clustering process, there are different numbers of clusters.
can visualize this using a dendogram, which is shown in Figure 10.4. The y-axis measure:
the intercluster distance. A dendogram shows the sequence in which clusters are formed
as you move up the diagram.At the top, we see that all clusters are merged into a single
cluster. If you draw a horizontal line through the dendogram at any value along the y-axis
you can identify the number of clusters and the objects in each of them. For example,
you draw a line at the distance value of 2.0, you can see that we have the three clusters
{Amherst, Bowdoin), {Barnard, Bates}, and {Berkeley).

CHECKMOÜRUNDE8SfAÅb1NG

1, What is the difference between agglomerative and divisive clustering methods?


2, How are distances between objects measured in cluster analysis?
3. Explain how single linkage clustering works.

Classification
Classification methods seek to classify a categorical outcome into one of two or more cat-
egories based on various data attributes. For each record in a database, we have a categori-
cal variable of interest (for example, purChase or not purchase, high •riskor no risk), and a
number of additional predictor variables (age, income, gender, education, assets, etc.). For
a given set of predictor variables, we would like to assign the best value of the categorical
variable. We will be illustrating various classification techniques using the Excel database
Credit Approval Decisions.
A portion of this database is shown in Figure 10.5. In this database, the categorical
variable of interest is the decision to approve or reject a credit application.The remain-
ing variables are the predictor variables. Because we are working with numerical data,
however, we need to code the Homeowner and Decision fields numerically. We code the
Homeowner attribute "Y" as I and "N" as 0; similarly, we code the Decision attribute
"Approve" as 1 and "Reject" as 0. Figure 10.6 shows a portion of the modified database
(Excel file Credit Approval Decisions Coded).
Introducuu"
Chapter 10

Figure 10.5
Crcdit Approval Declslong
podion of the Excel File 2
Decision
pevolvlng Utilization Approve
CreditApproval Decisions 3 Homeowner Credit Score Yeat•oof Credit History Revolving Bolance
11,320
25%
Reject
4 726 70%
20 7,200 Approve
573 9 20.000 65% Rejec%_
6 12,800 Reject
626 15 75%
5,700 12% Approve
8 627 12 9,000 Approve
795 22 20%
35,200 62% Reject
733 7
22,800 50% Rejec%
11 620 5 16,500 Approve
12 591 17 9.200
24

Figure 10.6 c
ModifiedExcel File with I Coded Credit Approval Decisions
Decision
NumericallyCoded Variables 2 j Revolving Utilization
3 Homeowner Credit Score Years of CreditHistory Revolving Balance
,320
725 20
1
1 7 ,2cp
573 9 55%
1 20,000
6 677 11 65%
7 12,800
625 15 75%
8 527 12 12%
9 1 9,000 1
795 22
10 35,200
733 7 62%
11 22,602
620 5 50%
12 1 17
h6,500
591 35%
13 : 9,200
660 24

An intuitive Explanation of Classification


credit score
To develop an intuitive understanding of classification,we consider only the
and years of credit history as predictor variables.

Classifying Credit-Approval Decisions Intuitively


F:gure 10.7 shows a chart of the credit scores and rule: approve an application with a credit score greater
years of credit history in the Credit Approval Decisions than 640.
data. The chart plots the credit scores of loan applicants Anotherway of classifying the groups is to use both
on the x-axis and the years of credit history on the y-axis. the credit score and years of credit history by visually
The large bubbles represent the applicants whose drawing a straight line to separate the groups, as shown in
credit applications were rejected; the small bubbles Figure 10.8. This line passes through the points (763, 2) and
represent those that were approved. With a few (595, 18). Using a littlealgebra, we can calculate the equa-
exceptions (the points at the bottom $ight corresponding tion of the line as
to high credit scores with just a few years of credit
, ears = —0.095 x Credit Score + 74.66
history that were rejected), there appears to be a clear
separation of the points. When the credit score is Therefore,we can propose a different classification rule:
greater than 640, the applications were approved, but whenever Years + 0.095 x Credit score 74.66, the
most applications with credit scores of 640 or less were application is rejected; otherwise, it is approved. Here
rejected. Thus, we might propose a simple classification again, however, we see some misclassification.

Although this is easy to do intuitively for only two predictor variables, it is more
difficult to do when we have more predictor variables. Therefore, more-sophisticated
procedures are needed, as we will discuss.
392 Chapter 10
Introduction to
Data Mining
Figure 10.7
Chart of Credit-Approval
Decisions
Credit—Approva! Dec;sions
25

20

15

10

6
600 640 7C0 900
Credit Score

Figure 10.8
Alternate Credit-Approval
ClassificationScheme Credit—Approval Decisions

20

15

10

6)

600 700 900


Credit Score

Measuring Classification Performance


As we saw in the.previous example, errors nay occur with any classificationrule, result-
is to
ing in misclassification. One way to judge the effectiveness of a classification rule
the results in a
find the probability of making a misclassification error and summarizing
either cor-
classification matrix, which shows the number of cases that were classified
rectly or inconectly.
Rules
Classification Matrixfor Credit-ApprovalClassification frequencies of mis-
in the table are the
In the credit-approval decision example, using just the
credit off-diagonalelements are the numbers
whereas the diagonal elements of
score to classify the applications, we see that in two cases,
classification,
classified. Therefore, the probability
that were correctly exercise for
applicants with credit scores exceeding 640 wore rejected, 56, or 0.04. We leave it as an
misclassification was rule.
outof a totalof 50 data points. Table 10.5 shows a clas- classification matrix for the second
I sificationmatrix for the credit score rule in Figure 10.7. Tho you to develop a

Table 10.5 Predicted Classification


ClassificationMatrix for
Decision —
Actual Classification Decision
CreditScote Rule 2
Decision 1 23
25
Decision = 0

able to classify new records.


The purpose of developing a classification model is to be
developed based on existing
After a classification scheme is chosen and the best model is
the output.
data, we use the predictor variables as inputs to the model to predict

Years
Classifying Records for Credit Decisions Using Credit Scores and
of Credit History
the alternate
The Excel files Credit Approval Decisions and Credit records to be 1 and the rest to be 0. If we use
the
Approval Decisions Coded include a small set of new rule developed in Example 10.3, which includes both
records that we wish to classify in the worksheet Records to credit score and years of credit history—that is, reject the
Classify. These records are shown in Figure 10.9. If we use applicationif Years + 0.095kCredit Score 74.66 —then
thesimple credit score rule from Example 10.3, that a score the decisions would be as follows. On!y the {ast record
of morethan 640 is needed to approve an application, then would be approved.
we would classify the decision for the first, third, and sixth

Years of Credit Revolving Revolving years + 0.095*


Homeowner Credit Score History Balance Utilization Credit Score Decision
1 700 8 $21 ,ooo.oo 15% 74.50
0 520 $4,000.00 90% 50.40
1 650 10 $8,500.00 25% 71175 0
602 7 $16,300.00 70% 64.19
0 549 2 $2,500.00 90% 54.16
1 742 15 $16,700.00 18% 85.49

Classification Techniques
We will describe two different data-mining approaches used for classification: k-nearest
neighbors and discriminant analysis.
394 Chapter 10 Introduction
to Data Mining

Figure 10.9
Additional Data in the c D
Excel File Credit 1
Approval 2 Homeowner Credit Score Years of Credit History Revolving Balance Revolving Utilization'Decision
Decisions Coded 3 1 700 8 $21 ,ooo 15%
4
520 $4,000
5 1 650 10 $8,500.00 25%
6 002 7 $16,300.00 70%
7
549 2 $2,500.00 90%
8 .1 742 15 $16,700.00

.1

k-Nearest Neighbors (k-NN)


The k-nearest neighbors (k-NN) algorithm is a classification scheme that attempts to
find records in a database that are similar to one we wish to classify. Similarity is based
on the "closeness" of a record to numerical predictors in the other records. In the Credit
Approval Decisions database, we have the predictors Homeowner, Credit Score, Yearsof
Credit History, Revolving Balance, and Revolving Utilization. We seek to classify the deci-
sion to approve or reject the credit application.
Suppose that the values of the predictors of two records X and Y are labeled
x ) and (h, h,... , h). We measure the distance between two records by the
Euclidean distance in formula (10.1). Because predictors often have different scales, they
are often standardized before computing the distance.
Suppose we have a record X that we want to classify. The nearest neighbor to that
record is the one that has the smallest distance from it. The I-NN rule then classifies record
X in the same category as its nearest neighbor. We can extend this idea to a k-NN rule
assigning
by finding the k-nearest neighbors to each record we want to classify, and then
the classification as the classification of a majority of the k-nearest neighbors.The
is very
choice of k is somewhat arbitrary.If k is too small, the classification of a record
closest. A larger k reduces
sensitive to the classification of the single record to which it is
decisions.
this variability, but making k too large introduces bias into the classification
classifiedthe
For example, if k is the count of the entire data set, all records will be
average or exponential smoothing
same way. Like the smoothing constants for moving
forecasting, some experimentationis needed to find the best value of k minimize to
select
the misclassification rate. Data-mining software usually provides the ability to
the algorithm on all values of
a maximum value for k and evaluate the performance of
1 and 20 are used,
k up to the maximum specified value. Typically, values of k between
sets, and odd numbers are often used to avoid ties in
depending on the size of the data
computing the majority classification of the nearest neighbors.

ÉXÄMPLE+10.6 Using k-NN for Classifying Credit-Approval Decisions


that
Classification Data Euclidean distance measure in formula (10.1), we find
The Excel file Credit Approval Decisions 51 is
decision the record having the minimum distance from record
provides normaiized data for the credit-approval approve, we
We would like to classify the new record 27. Since the credit decision was to
records (see Figure 10.10).
already been made. would classify record 51 as an approval.
records using the decisions that have We can easily implement the search for the near-
set
Consider the first new record, 51. Suppose we and
neighbor to record 51. Using the est neighbor in Excel using the SMALL, MATCH,
k = 1 and find the nearest
Data Mining 395
Chapter 10 Introductionto

\/LOOKIJP functions. To find the kth smallest value in an 50, this will identify the
records are numbered 1 through
use the VLOOKUP
array, use the function k). To identify the correct record number.Then we can
recordassociated with this value, use the MATCH func- associated with the record.
function to identify the decision
tionwith match_type file are shown below.
0 for an exact match. Since tho The formulas used in the example

Nearest Neighbors
Distance Decision
Record
MATCH(R25, $0$4:$0$53, 0) VLOOKUP(S25, $A$4:$G$53, 7)
2 MATCH(R26, $0$4:$0$53, O) VLOOKUP(S26, SA$4:$G$53, 7)
3 MATCH(R27, $o$a:$o$53, o) VLOOKUP(S27, $A$4:$G$53, 7)
4
MATCH(R28, $0$4:$0$53, 0) VLOOKUP(S28, $A$4:$G$53, 7)
5 MATCH(R29, $0$4:$0$53, 0) VLOOKUP(S29, $A$4:$G$53, 7)

Using larger values of k helps to smooth the data


and mitigate overfitting.Therefore, if k 5, we find the fo'lowing:

Nearest Neighbors
k Distance Record Decision
1.04535 27 Approve
2 1.14457 46 Approve
3 1.17652 26 Approve
4 1.22300 23 Approve
5 1.35578 3 Approve

Because aii of these records have an approve decision, would use the majority decision, although other rules, which
we would classify record 51 as approve also. In genera}, we can impactclassificationerror rates, can also be applied.

Normaltzed Data
2 Record 51
Cre"t Score Years of Credit Histo Revolvin Baiance Revolvin Utilization Decision Distance Record Homeowner Years of Credit Revolving Balance Revolvin Utilization OeeiSion
0803. .442 -0291 Approve 51 70 8 S21 000 15%
-0.369 -0.748 Relect 2.68896 52 90%
0.275 -0.00 .673 0220 apPQY&— 35573 650 10 $%00.oo 25
-0303 c.S19 -o -126 0.54 Reject 3.0i7öö 54 SIS 300.00
0.125 -0 915 .836Re t 3.79052 55 2 500.00
-1. 04jApp ove 286543 42 Sl 70000 1
2,360 -O.8éMppLQ. e
-1.028 0383 0.4458eIect 266695 Normalized Records
091', 0 94B' .252Sö
2.101 -0326 -o Approve 3.6id64 Record HomeownerCreditscore Years of CreditHISt9 Revolvin Balance Revolvin Utilization DeelSlon
——:QÄi91Apprcyg_ 51 0914 0.5301 0.784 -1.012
-0.16 52 -1.073 _1.469 -1.103
0914 -0 025 -0.204 -0 604 -0.704
17 -1.03 -0=559
-1.497 -0.698 o. 0.6S2
0,290 tyæprqy.e__
Apppy 1 635.7ß_
55 212073
_ —LIA 0.59
-1.522
o.
- 2701 1.297
o.

n
-1073
0414 1.4<2
—1459-0.060 -I gg_l Apprqyg_
- .50dSB
61
€230 o..gg _
-1.07?! -1270 1695 4.24703 D stance Record Decision
-1.073! 1-04535 7
—9,794
.9.39( Apprpve
09.1." ,L32Q(Apprpye_ 4 23
23 0.9147- 0.175; OLIO 0.706 -0.7061Approve 1.17652 -5 1.3557a 3 Approve

Figure 10.10
Portion of Credit Approval Decisions Classification Data Excel File
396 Chapter 10
Introduction to
Data Mining

Discriminant Analysis
Discrirninant analysis is a tcchniquc
for classifying a set of observations into predefin
classes. The purpose is to determine
thc class of an observation based on a set of predi
tor variables. Wc will
illustratc discriminant analysis using the Credit Approval Decisic)
data. With only two classification groups,
wc can apply regression analysis. Unfortunate
when there are more than two,
linear rcgression cannot be applied, and special softwa
must be used.

Classifying Credit Decisions Using DiscriminantAnalysis


For the credit-approval
data, we want to model the
sion (approve or deci- The Excel file Credit Approva/ Decisions Discriminant
reject) as a function of
Thus, we use the the other variables. Analysis shows the results (Figure 10.12). Below the data,
following regression model,
resents the where Y rep- we calculate the averages for each group of decisions.
decision (0 or 1):
(Note that the data have been sorted by decision to facili-
Y = bo+ bl X Homeowner
+ b2X CreditScore tate computing averages.)
+ b3 X Years Credit Next, we need a rule for classifying observations
History + b4 ><RevolvingBalance
+ b5 X Revolving using the discriminantscores. This is done by computin
Utilization
The estimated value of the a cut-off value so that if a discriminantscore is
decision variable is called Jess than or equal to it, the observation is assigned to
a discriminant score.
The regressionresultsare shown
in Figure 10.11. Because one group; otherwise, it is assigned to the other group.
Y can assume on!ytwo values,
it cannot be normally Whi\ethere are several ways of doing this, one simple
distributed; therefore,the statistical
results cannot be interpreted in way is to use the midpoint of the average discriminant
their usual fashion. The esti- scores:
mated regression function is
Y = 0.567+ 0149x Homeowner + 0.000465 x CreditScore
Cut-Off Value (0.9083 + 0.0781)/2 0.4932
+ 0.00420 ><Years Credit History + 0 ><Revolving
Baiance We see that all approval decisions have discriminant
1.0986 ><Revoiving Utilization scores above this cut-off value, while all rejection deci-
For example, the discriminant score for the first record sions have scores below it. Data-mining software has
would be calculated as more sophisticated ways of performing the classifica-
tions. We may use this cut-off value to classify the new
Y 0.567+ 0.149x 1 + 0.000465
x 725+ 0.00420
x 20 records. This is shown in Figure 10.13.
X 11320 - 1.0986 x 25% 0.862

Figure 10.11 o
1
Regression Results SUMMARY OUTPUT

Re ressionStatistics
5! Muttiple R 0.911190975
.6••R Square 0.830268994
-7' Ad-ustedRS uare 0.810931379
8 Standard Error 0.218884522
9. Observations 50

ANOVA
•12 MS Significance F
13 .ßggression 5 10.3119409 2.062388181 43.04674383 7.33307816
i4 Residual 44 2.108059097 0.047910434
15 Total¯ 49 12.42
15
I Coefficients Standard Emor t Stat Lower95% | U per95% Lower950% Upper95B
Interce t 0.567045347 0.478648652 1.184679712 0.242503847 -0.39760763 1.53169832 -0.39760753 1.52!ö9832
10 Homeowner 0.149103522 0.090877595 1.640707181 0.107988621 -0.03404824 0.33225528 -0.0340024 0.332255?}
20 Credii Score 0.000064676 0.00059988 o.ii4615018 0144271000 -0.0007443 -0.0007443 0.0010395
21 Years of Credit Hislb 0.00419MfS 0.622420643 o.bfS77785 -0.00939518 0.01779142 .0.00939518 0.01779142
22 Revolving Bplance -8.6449E-öi 3.79441806 -0.227833217 0.820831342 -8.5116E-Oö 6.782öE-Oö &782öE-06
23 Revolvin Utilization -1.09001334 0.196384059 -5.57ii6874 1.0686806 -1.49560862 -0.70iö1S05 -1.49560862 -0.70161805
Mining 397
Introductionto Data
Chapter IO

Discriminant
Figure 10.12 c o
DŕciĂon score
COdod Crorllt Approval Dcclnlons
ReyęJŁiąg Utilization
Rgvołylng Balance
DiscriminantCalculations
2
3 Hornoownor Crodlt Scoro Yeor5 ofCrorlit Higtory
925 20 55% 1.0383
11 0.6869
5 677 9 000
22 20% 0.7311
1 793 35{žôL
ż33 7 0.9044
000 24 22,000 1.002
700 6 100 1.0568
13 10Ęop_ 1.1324
3% 0.780ô
25% 1 O.T75S
16 ôôo
,13 oăź 13 3 300 3.0067
3 5% 0.3423
600 7 500
049 12 20 00 22% 1 o.gogc
1 695 15 11.700 15% 0.9757
701 9 7 đbo 9% 0.80dô
12 s 27% o.goôa
1
077 .12.800 1
699 10.000 20% 1 0.3770
703 22 s 11% 1.0432
9,700 1
055 6.100 7% 1.066e
774 13 10,500 5% 1 1.127e
602 10 13.400 3% 1 0.913ô
24 001 20 15%
11,700 1 o. 3448
13,000 24% 0.2449
26 1 733 15
s 7.200 o 0.1953
27 573 12,800 65% 0.0334
23. 625 15 5,700 75%
s 0.1753
527 12 62%
22,800 0.498ô
620 5 50%
16.500 0.0930
591 17 83%
31 12 500 02282
1 500 16 70%
7.700 o -0.1204
565 6 87%
37,400 0.2307
620 3 59%
7 17200 o -0.0224
3Sk 640 79%
14
27.000 0.1513
523 11200 70%
763 2 0 -02590
2,500 100% 0.5107
555 4 34%
s e 400 -o.0ô75
33) 617 9 85% o
7 29.100 -0.2893
635 100% O
2,000 -0.0664
507 2 80% o
5 S 1.000 O.i28ô
435 65%
8.500 0.0307
582 3 78% o
S 31.000 0.4196
585 55%
s 16.200 0.2307
ô20 59%
640 7 s 17.300
79% o .o.01ô4
14 s 27.000 0.1499
,4ȚS 536
11 200 70% o
760 2 s o -0.1933
4 00 95%
o -0.0125
10 s 12,050
600 85% o 0.03ô3
1 636 s 29,100 -0.2842
2,000 100% o
509 3 0.0371
13 s 29.000 78%
595
Averages 0.1ô43 0.9083
Agorove 0.9130 7233913 14,5217 126226027
8,4444 15061. , 0.7459
Reject 02222 591.n37

o
26. Records to Classi Discriminant
27 Homeowner Credit Score Years of Credit Histo Revolving Balance Revolving Utilization Score Decision
700 8 $21 ,ooo 15% 0.8921 Approve
29,- 520 1 $4,000 90% -o. 1793 Reject
1 650 10 $8.500.00 25% 0.7782 Approve
602 7 $16,300.00 70% 0.0930 Reject
549 2 $2 500.00 90% -0.1604 Reject
33 1 742 15 $16,700.00 18% 0.9117 Approve

A Figure 10.13
Classifying New Records Using Discriminant Scores

CHECK YOURUNDERSȚANDING
1. Explain the purpose of classification.
2. How is classification performance measured?
3. Explain the k-nearest neighbors algorithm for classification.
4. Describe when regression can be usecl for discriminant analysis.
398 Chapter10 Introduction to Data
Mining

z Association
Association rules identify attributes that frequently occur together in a given data set,
Association rule miiiing, often called affinity analysis, seeks to uncover interesting asso-
ciations and/or correlation relationships among large sets of data. A typical and widely
used example of association rule Il)iningis market basket analysis. For example, super-
markets routinely collect data using barcode scanners. Each record lists all items bought
by a customer for a single-purchasetransaction. Such databases consist of a large number
of transaction records. Managers would be interested to know if certain groups of items
are consistently purchased together. They could use these data for adjusting store layouts
(placing items optimally with respect to each other), for cross-selling, for promotions, for
catalog design, and to identify customer segments based on buying patterns. Association

based on past movie rentals or item purchases, for example.

Custom Computer Configuration


Figure 10.14 shows a portion of the Excel file PC Purchase If the manufacturercan better understand what types of
Data. The data represent the configurations for a small components are often ordered together, it can speed up
number of orders of laptops placed over the Web. The final assembly by having partially completed laptops with
main options from which customers can choose are the the most popular combinations of components configured
type of processor, screen size, memory, and hard drive. prior to order, thereby reducing delivery time and improving
A "1" signifies that a customer seiected a particularoption. customer satisfaction.

Association rules provide information in the form of if-then statements. These rules
are computed from the data but, unlike the if-then rules of logic, association rules are
probabilistic in nature. In association analysis, the antecedent (the "if' part) and conse-
quent (the "then" part) are sets of items (called item sets) that are disjoint (do not have any
items in common).
To measure the strength of association, an association rule has two numbers that
express the degree of uncertainty about the rule. The first number is called the support
for the (association) rule. The support is simply the number of transactions that include
all items in the antecedent and consequent parts of the rule. (The support is sometimes
v Figure 10.14
Portion of the Excel File PC
Purchase Data

I PC Purchase Data
Processor Screen Size Memo Hard Drive

5 Intelcore i3 , Intel core i5 .Intel core i7. go-inch screen •124inchscreen 15 screen 2 GB 4 GB 8 GB 320 GB 500 GB 750G3

10
11
12
13
14
to think
in the database.) one waythe data-
from
expressed as a percentage of the total number of recordsselected transaction number is the
of support is that it is the probabilitythat a randomly second
consequent. Thenumber of transactions
base will contain all items in the antecedent and the the
(association) rule. Confidence is the ratio of (namely, the support) to
confidence of the antecedent confidence is the
that include all itClMS in the consequent as well as thc antecedent. The the
in the
the number of transactions that includc all itemstransaction all the items in
will include
conditional probability that a randomly selected antecedent:
all thc items in the
consequent given that transaction includes
and Consequent) (10.2)

Confidence P (Consequent IAntcccdent) p ( Antecedent)

that the association rule provides


The higher the confidence, the more confident we are
the
useful information. is lift, which is defined as
rule transac
Another measure of the strength of an associationconfidence is the number of
ratio of confidcnce to expected confidence. Expected number of transactions.
Expected
tions that include the consequent divided by the total Lift pro
consequent and the antecedent. given the
confidence assumes independence between the (consequent)
of the "then"
vides information about the increase in probability the association rule; a
the stronger
'fit" (antecedent) part. The higher the lift ratio,
greater than 1.0 is usually a good minimum.

Measuring Strength of Association


a confidence of
Suppose that a supermarket database has 100,000 point- 0.8% 800/100,000)and
the number of total
of-sale transactions, out of which 2,000 include both 40% (= 800/2,000).Suppose
Then expected confidence
items A and B and 800 of these include item C. The asso- transactionsfor C is 5,000. confidence/
lift =
ciation rule "if A and B are purchased, then C is also pur- is 5,000/100,000= 5%, and 8.
expected confidence 40%/5%
chased" has a support of 800 transactions (alternatively,

software to identify good ruler


Association rule mining requires special data-mining
technique by examining correla-
However, we can obtain an intuitive understanding of this
tions, as the next example illustrates.

Using Correlations to Explore Associations


Figure 10.15 shows correlation matrix for the data in the antecedent),then a 750-GB hard drive is purchased (the
PC Purchase Data file. Of course, this only shows the cor- consequent).The support for this rule is 8, and the con-
relation between pairs of variables; however, it can provide fidenceis (8/67)/(12/67) 8/12 67%. The expected
some insight for understanding associations. Higher cor- confidence is 17/67; thus, the lift is (8/12)/(17/67) 2.63.
relations have been highlighted. For example, we see that We also see a moderate correlation between the
the highest correlation is between the Intel Core i7 and a Core i7 and an 8-GB memory. Thus, we might propose
750-GB hard drive. Twelve records have the Core i7, and the rule If an Intel Core i7 and 8-GB memory are purchas
17 records have a 750-GB hard drive. If we compute the (the antecedent), then a 750-GB hard drive is purchased
SUMPRODUCT of these two columns in the data, we find (the consequent). In this case, only four records have all
that 8 of the 67 records have both of these components. three;hence the support is 4. Six records have both com-
A simple association rule is If an Intel Core i? is chosen (the ponents of the antecedent; therefore, the confidence WOL
(continued)
320 5öO 750 €38
Inte!Coæi3 : Intel Ccrei5 In:efCore (7 "10 inch screen 12 inch sc.-conM5inch screen!
2 Intel Core 13
3
-0.63884672: 1
4 Core -0.32659863' -0.460 417833
s IOir;chsoreen0.279261%6
6 426162798
screed 0.031339159) -0.10526321 0.098863947 -0.535569542
7 :15hchscreeH -0.293334452
c.i7497ö377jW.13i915917-0.352366093-0.60i5i5208
z 12 0.103561074' 0.111614497 -0.27236339 0.075646007 0.060469105 -oÅ385d4309
-0.59324jö3
1
0.0976543911 -0.04316721' 0063318551 0.06538164 -0.00900hiö5 -005iB698t6
10_18GB -0.1823±9711 -0.10591632 0.361405355 -o.Å86296235 -0.01570849 0.193716103 -0.286C0763 -0.66185237
'320ce 0.19460258 0.0425083851 -0.29387691 0.'91267d16 -0.2038i6567i 0.04495614 0.013629326, 0.282291771 -0.3086959B
-0.58382915
12 1500 -0.13390029 0.223414565 -o.i21i79S3 0.041916288' 0.191413544 -().iSf77ö1i8 0.078031873! -0.05588351 -0.00112975 -0.54108944
131750 GB -0.042514851 -0.30002796 0.03258072 -0.246149705, -0.008199201 0.241920528 -0.10352941; -0.182882% 0.32105073 -0.3G6b5601

A Figure 10.15
PC Purchase Data CorrelationMatrix

1. What is association rule mining?


rule mining.
Explain the concepts of support, confidence, and lift in association

Cause-and-Effect Modeling
reten-
Managers are always interested in results, such as profit, customer satisfaction and
and
tion, and production yield. Lagging measures, or outcomes, tell what has happened
are often external business results, such as profit, market share, and customer satisfac-
tion. Leading measures (performancc drivers) predict what will happen and usually are
internal metrics, such as employee satisfaction, productivity, and turnover. For example,
customer satisfaction results in regard to sales or service transactions are a lagging mea-
sure; employee satisfaction, sales representative behavior, billing accuracy, and so on are
examples of leading measures that might influence customer satisfaction. If employees are
not satisfied, their behavior toward customers could be negatively affected, and customer
satisfaction could be low. If this can be explained using business analytics, managers can
take steps to improve employee satisfaction, leading to improved customer satisfaction.
Therefore, it is important to understand what controllable factors significantly influence
key business performance measures that managers cannot directly control. Correlation
analysis can help to identify these influences and lead to the development of cause-and-
effect models that can help managers make better decisions today that will influence
results tomorrow.
Chapter I

between
linear relationship
thc relationships
measure of
Recall from Chapter 4 that correlation is a coefficient indicate strong be useful in
can
two vatinbles. High values of the correlation how correlation
shows
between the vatiables. The following exatnplc
cause-and-effecttnoclcling.
Modeling
EXAMPLE1mkti Using Correlation for Cause-and-Effect and employee
perception
with their gupervigor,
regults of
•oneExcel file Ten Year Sutvey Shows the irnprovernent. any cause
quattevlyscnveys conducted by major electronicg of training and gl<ill does not prove
analysis
device jnanufactuter, a portion of which is shown in
Althot.jghcorrelation cause-and-effect
logically infer that a
data pmvide Average scores on and effect, we can customer satisfac-
10.16.5 The data indicate that
relationshipexists. influenced
scale for customer satisfaction, overall enlployee satisfac- business result, is strongly
tion,employee job satisfaction, en)ployee satisfaction with tion, the key external
employee satisfaction. Logically,
drive
theirsupetvisor, and ennployeeperception of training and by internalfactors that 10.18. This
the model shown in Figure
skill improvement.Figure 10.17 shows the correlation
we could propose customer
that if managers want to improve relations
matrix.All the correlations ev cept the one between job suggests
by ensuring good
satisfactionand customer satisfaction are relatively strong, satisfaction, they need to start on
and their employees and focus
withthe highest correlations between overall employee between supervisors
satisfaction and employee job satisfaction, employee improving training and skills.

8 c
Ten Year Suxey
improvement
ervisor Training and skill
s s Customersatisfaction Em 10 ee satisfaction Job satisfaction Satisfactionwithsu
2.97 3.51 3.92 3.06 2.57
2 3.71 3.58 4.13 3.06 3.06
3.29 3.43 3.62 4.42
3.87
2.05 3.81 4.12 4.31 4.15
4.17 4.25 4.14
3.61
4.28 4113 4.13 4.57
2.72
2.17 2.42 4.19 2.53
2.56
s 3.01 2.95 3.95 3.25

Figure 10.16
'ortionof Ten Year Survey Data

Customer satisfaction Em 10lee satisfaction Job satisfaction Satisfaction withsu ervisor Training and skill improvement
.2 •Customersatisfaction 1
1
IEmpioye€ sat:sfacton 0.493345395
4 Lot satisfaction 0.151693544 0.840444148 1
1
S Satisfaction with supervisor 0.495977225 0.88k 324581 0.6067961661
and ski,} improvement 0.532307756, 0.828657884: 0.710624973 0.769700425

10.17
:orrelationMatrix of Ten Year Survey Data

oHECKb-Y0uR

I. What is the difference between a leading and lagging measure?


2. How is conelation used in cause-and-effect modeling?

Eased on a descniption of a real application by Steven H. Hoisington and Tse-His Huang, "Customer
Sat-
isfaction and Market Share: An Empirical Case Study of IBM's AS/400 Division," in Earl Naumann
and
Steven JL Hoisington (eds.) Customer-CentereclSix Sigma (Milwaukee, WI: ASQ Quality
The data used this example ace fictitious, however,
Press, 2001).
402 Chapter 10 Introduction
to Data Mining

Figure 10.18
Cause-and-Effect Model
Satisfaction with
Suporvisor

Employoo Customer
Job
Satisfaction Satisfaction Satisfaction

Training and Skill


Improvement

ANALYTICS IN PRACTICE: Successful Business Appiicatäons


of Data Mining6
Many different companies use data mining to seg-
ment customers, identify the most profitabletypes of
customers, reduce costs, and enhance customer relation-
ships through improved marketing efforts. Some success-
ful application areas of data mining include the following:

A pharmaceutical company analyzed sales force activity


data to better target high-value physicians and deter-
mine which rnarketing activities will have the greatest
impact. Sates representatives can use the results and
plan their schedules and promotionalactivities.
A credit-card company used data mining to analyze cus-
tomer transaction data to identify customers most likely
to be interested in a new credit product. As a result, costs
for mail campaigns decreased by more than 20 times. A large consumer package goods company applied data
A large transportation company used data mining to seg- mining to improve its retail sales process. They used data
ment its customer base and identify the best types of from consumer panels, shipments, and competitor activ-
customers for its services. By applying this segmentation ity to understand why customers choose different brands
to a general business database such as those provided and switch stores. Armed with this data, tho company can
by Dun & Bradstreet, they can develop a prioritized list of select more effective promotionalstrategies.
prospects for its regional sales force members.

You might also like