Data Mining
Data Mining
e.
Labocant/Shutteßtock
LEARNING OBJ ECT IVES After studying this chapter, you will be able to:
Define data mining and some common approaches Understandk-nearest neighbors and discriminant
used in data mining. analysis for classification.
u Explain how cluster analysis is used to explore and Describe association rule mining and its use in market
reduce data. basket analysis.
n Explain the purpose of classification methods and Use correlationanalysis for cause-and-effect
how to measure classification performance, and the modeling.
use of training and validation data.
384 Chapter 10 Introductionto Data
Mining
Cluster Analysis
Cluster analysis, also called data segmentation,
is a set of techniques that seek to group
or segment a collection of objects
(that is, observations or records) into subsets or clusters,
such that those within each cluster are more
closely related to one another
than objects
assigned to different clusters. The objects within
clusters should exhibit a high amount of
similarity, whereas those in different clusters will be dissimilar.
Cluster analysis is a data-reduction technique in the sense that it can
take a large num-
ber of observations, such as customer surveys
or questionnaires, and reduce the information
into smaller, homogenous groups that can be interpreted
more easily. The segmentation of
customers into smaller groups, for example, can be used to customize
advertising or pro-
motions. As opposed to many other data-mining techniques, cluster
analysis is primarily
descriptive, and we cannot draw statistical inferences about a sample using
it. In addition,
the clusters identified are not unique and depend on the specific
procedure used; therefore,
it does not result in a definitive answer but only provides new ways
of looking at data.
Nevertheless, it is a widely used technique.
There are two major methods of clustering—hierarchical clustering
and k-means clus-
tering. In hierarchical clustering, the data are not partitioned into
a particular cluster in a
single step. Instead, a series of partitions takes place, which may run from
a single cluster
containing all objects to n clusters, each containing a single object. Hierarchical
clustering
is subdivided into agglomerative clustering methods, which proceed by
series of fusions •
of the n objects into groups, and divisive clustering methods, which
separate n objects
successively into finer groupings. Figure 10.1 illustrates the differences between
these two
types of methods. Agglomerative techniques are more commonly used.
An agglomerative hierarchical clustering procedure produces a series of partitions
of
the data P P Pl. Pn consists of n single-object clusters, and PI consists of a single
group containing all n observations.At each particular stage, the method joins together
the two clusters that are closest together (most similar). At the first stage, this consists of
simply joining together the two objects that are closest together. Different methods use dif-
ferent ways of defining distance (or similarity) between clusters.
Mining 387
Introductionto Data
Chapter 10
ABC DE
Agglomerative Methods
c D
(XI,YI) 2
388 Chapter 10 Introduction to
Data Mining
Figure 10.3
Portion of the Excel 1 Colleges and c
File Colleges and Universities
Universities 3 School
MedianSAT Acceptance Rate Expenditures/Studont Top 10% HS 'Graduation
4 Amherst
5 Bamard Lib Arts 1315! 26,636
Lib Arts 1220 53% 17,653 69
G Bates 17,554
7 Berkeley Lib Arts
University 37%: $ 23,665
8 Bowdoin 25,703
9 Brown Lib Arts noo
jUniverglty 1201 24,201
10 Bryn Mawr 18,947
Lib Arts 8,
(0.7967 + (1.3097
3.5284
A distance matrix between the first five colleges is shown in Table 10.1.
Clustering Methods
One of the simplest agglomerativehierarchical clustering methods is single linkage clus-
tering, which is an agglomerativemethod that keeps forming clusters from the individual
objects until only one cluster is left. In the single linkage method, the distance between two
clusters r and s, D(r, s), is defined as the minimum distance between any object in cluster
r and any object in cluster s. In other words, the distance between two clusters is given by
the value of the shortest link between the clusters. Initially, each cluster simply consists of
an individual object. At each stage of clustering, we find the two clusters with the mini-
mum distance between them and merge them together.
Another method that is basically the opposite of single linkage clustering is called
complete linkage clustering. In this method, the distance between clusters is defined as the
distance between the most distant pair of objects, one from each cluster. A third method is
average linkage clustering. Here the distance between two clusters is defined as the average
of distances between all pairs of objects, where each pair is made up of one object from each
group. Other methods are average group linkage clustering, which fises the mean values for
each variable to compute distances between clusters, and Ward's hierarchical clustering
Acceptance Expenditures/
Top 10% HS Graduation %
School Type Median SAT Tare (0/0) Students
85 93
Amherst Lib Arts 1315 22 $26,636.00
78 90
Bowdoin Lib Arts 1300 24 $25,703.00
69 80
Bamard Lib Arts 1220 53 $17,653.00
58 88
Bates Lib Arts 1240 36 $17,554.00
95 68
Berkeley University 1176 37 $23,665.00
2.0615
1.879
0.7158
At various stages of the clustering process, there are different numbers of clusters.
can visualize this using a dendogram, which is shown in Figure 10.4. The y-axis measure:
the intercluster distance. A dendogram shows the sequence in which clusters are formed
as you move up the diagram.At the top, we see that all clusters are merged into a single
cluster. If you draw a horizontal line through the dendogram at any value along the y-axis
you can identify the number of clusters and the objects in each of them. For example,
you draw a line at the distance value of 2.0, you can see that we have the three clusters
{Amherst, Bowdoin), {Barnard, Bates}, and {Berkeley).
CHECKMOÜRUNDE8SfAÅb1NG
Classification
Classification methods seek to classify a categorical outcome into one of two or more cat-
egories based on various data attributes. For each record in a database, we have a categori-
cal variable of interest (for example, purChase or not purchase, high •riskor no risk), and a
number of additional predictor variables (age, income, gender, education, assets, etc.). For
a given set of predictor variables, we would like to assign the best value of the categorical
variable. We will be illustrating various classification techniques using the Excel database
Credit Approval Decisions.
A portion of this database is shown in Figure 10.5. In this database, the categorical
variable of interest is the decision to approve or reject a credit application.The remain-
ing variables are the predictor variables. Because we are working with numerical data,
however, we need to code the Homeowner and Decision fields numerically. We code the
Homeowner attribute "Y" as I and "N" as 0; similarly, we code the Decision attribute
"Approve" as 1 and "Reject" as 0. Figure 10.6 shows a portion of the modified database
(Excel file Credit Approval Decisions Coded).
Introducuu"
Chapter 10
Figure 10.5
Crcdit Approval Declslong
podion of the Excel File 2
Decision
pevolvlng Utilization Approve
CreditApproval Decisions 3 Homeowner Credit Score Yeat•oof Credit History Revolving Bolance
11,320
25%
Reject
4 726 70%
20 7,200 Approve
573 9 20.000 65% Rejec%_
6 12,800 Reject
626 15 75%
5,700 12% Approve
8 627 12 9,000 Approve
795 22 20%
35,200 62% Reject
733 7
22,800 50% Rejec%
11 620 5 16,500 Approve
12 591 17 9.200
24
Figure 10.6 c
ModifiedExcel File with I Coded Credit Approval Decisions
Decision
NumericallyCoded Variables 2 j Revolving Utilization
3 Homeowner Credit Score Years of CreditHistory Revolving Balance
,320
725 20
1
1 7 ,2cp
573 9 55%
1 20,000
6 677 11 65%
7 12,800
625 15 75%
8 527 12 12%
9 1 9,000 1
795 22
10 35,200
733 7 62%
11 22,602
620 5 50%
12 1 17
h6,500
591 35%
13 : 9,200
660 24
Although this is easy to do intuitively for only two predictor variables, it is more
difficult to do when we have more predictor variables. Therefore, more-sophisticated
procedures are needed, as we will discuss.
392 Chapter 10
Introduction to
Data Mining
Figure 10.7
Chart of Credit-Approval
Decisions
Credit—Approva! Dec;sions
25
20
15
10
6
600 640 7C0 900
Credit Score
Figure 10.8
Alternate Credit-Approval
ClassificationScheme Credit—Approval Decisions
20
15
10
6)
Years
Classifying Records for Credit Decisions Using Credit Scores and
of Credit History
the alternate
The Excel files Credit Approval Decisions and Credit records to be 1 and the rest to be 0. If we use
the
Approval Decisions Coded include a small set of new rule developed in Example 10.3, which includes both
records that we wish to classify in the worksheet Records to credit score and years of credit history—that is, reject the
Classify. These records are shown in Figure 10.9. If we use applicationif Years + 0.095kCredit Score 74.66 —then
thesimple credit score rule from Example 10.3, that a score the decisions would be as follows. On!y the {ast record
of morethan 640 is needed to approve an application, then would be approved.
we would classify the decision for the first, third, and sixth
Classification Techniques
We will describe two different data-mining approaches used for classification: k-nearest
neighbors and discriminant analysis.
394 Chapter 10 Introduction
to Data Mining
Figure 10.9
Additional Data in the c D
Excel File Credit 1
Approval 2 Homeowner Credit Score Years of Credit History Revolving Balance Revolving Utilization'Decision
Decisions Coded 3 1 700 8 $21 ,ooo 15%
4
520 $4,000
5 1 650 10 $8,500.00 25%
6 002 7 $16,300.00 70%
7
549 2 $2,500.00 90%
8 .1 742 15 $16,700.00
.1
\/LOOKIJP functions. To find the kth smallest value in an 50, this will identify the
records are numbered 1 through
use the VLOOKUP
array, use the function k). To identify the correct record number.Then we can
recordassociated with this value, use the MATCH func- associated with the record.
function to identify the decision
tionwith match_type file are shown below.
0 for an exact match. Since tho The formulas used in the example
Nearest Neighbors
Distance Decision
Record
MATCH(R25, $0$4:$0$53, 0) VLOOKUP(S25, $A$4:$G$53, 7)
2 MATCH(R26, $0$4:$0$53, O) VLOOKUP(S26, SA$4:$G$53, 7)
3 MATCH(R27, $o$a:$o$53, o) VLOOKUP(S27, $A$4:$G$53, 7)
4
MATCH(R28, $0$4:$0$53, 0) VLOOKUP(S28, $A$4:$G$53, 7)
5 MATCH(R29, $0$4:$0$53, 0) VLOOKUP(S29, $A$4:$G$53, 7)
Nearest Neighbors
k Distance Record Decision
1.04535 27 Approve
2 1.14457 46 Approve
3 1.17652 26 Approve
4 1.22300 23 Approve
5 1.35578 3 Approve
Because aii of these records have an approve decision, would use the majority decision, although other rules, which
we would classify record 51 as approve also. In genera}, we can impactclassificationerror rates, can also be applied.
Normaltzed Data
2 Record 51
Cre"t Score Years of Credit Histo Revolvin Baiance Revolvin Utilization Decision Distance Record Homeowner Years of Credit Revolving Balance Revolvin Utilization OeeiSion
0803. .442 -0291 Approve 51 70 8 S21 000 15%
-0.369 -0.748 Relect 2.68896 52 90%
0.275 -0.00 .673 0220 apPQY&— 35573 650 10 $%00.oo 25
-0303 c.S19 -o -126 0.54 Reject 3.0i7öö 54 SIS 300.00
0.125 -0 915 .836Re t 3.79052 55 2 500.00
-1. 04jApp ove 286543 42 Sl 70000 1
2,360 -O.8éMppLQ. e
-1.028 0383 0.4458eIect 266695 Normalized Records
091', 0 94B' .252Sö
2.101 -0326 -o Approve 3.6id64 Record HomeownerCreditscore Years of CreditHISt9 Revolvin Balance Revolvin Utilization DeelSlon
——:QÄi91Apprcyg_ 51 0914 0.5301 0.784 -1.012
-0.16 52 -1.073 _1.469 -1.103
0914 -0 025 -0.204 -0 604 -0.704
17 -1.03 -0=559
-1.497 -0.698 o. 0.6S2
0,290 tyæprqy.e__
Apppy 1 635.7ß_
55 212073
_ —LIA 0.59
-1.522
o.
- 2701 1.297
o.
n
-1073
0414 1.4<2
—1459-0.060 -I gg_l Apprqyg_
- .50dSB
61
€230 o..gg _
-1.07?! -1270 1695 4.24703 D stance Record Decision
-1.073! 1-04535 7
—9,794
.9.39( Apprpve
09.1." ,L32Q(Apprpye_ 4 23
23 0.9147- 0.175; OLIO 0.706 -0.7061Approve 1.17652 -5 1.3557a 3 Approve
Figure 10.10
Portion of Credit Approval Decisions Classification Data Excel File
396 Chapter 10
Introduction to
Data Mining
Discriminant Analysis
Discrirninant analysis is a tcchniquc
for classifying a set of observations into predefin
classes. The purpose is to determine
thc class of an observation based on a set of predi
tor variables. Wc will
illustratc discriminant analysis using the Credit Approval Decisic)
data. With only two classification groups,
wc can apply regression analysis. Unfortunate
when there are more than two,
linear rcgression cannot be applied, and special softwa
must be used.
Figure 10.11 o
1
Regression Results SUMMARY OUTPUT
Re ressionStatistics
5! Muttiple R 0.911190975
.6••R Square 0.830268994
-7' Ad-ustedRS uare 0.810931379
8 Standard Error 0.218884522
9. Observations 50
ANOVA
•12 MS Significance F
13 .ßggression 5 10.3119409 2.062388181 43.04674383 7.33307816
i4 Residual 44 2.108059097 0.047910434
15 Total¯ 49 12.42
15
I Coefficients Standard Emor t Stat Lower95% | U per95% Lower950% Upper95B
Interce t 0.567045347 0.478648652 1.184679712 0.242503847 -0.39760763 1.53169832 -0.39760753 1.52!ö9832
10 Homeowner 0.149103522 0.090877595 1.640707181 0.107988621 -0.03404824 0.33225528 -0.0340024 0.332255?}
20 Credii Score 0.000064676 0.00059988 o.ii4615018 0144271000 -0.0007443 -0.0007443 0.0010395
21 Years of Credit Hislb 0.00419MfS 0.622420643 o.bfS77785 -0.00939518 0.01779142 .0.00939518 0.01779142
22 Revolving Bplance -8.6449E-öi 3.79441806 -0.227833217 0.820831342 -8.5116E-Oö 6.782öE-Oö &782öE-06
23 Revolvin Utilization -1.09001334 0.196384059 -5.57ii6874 1.0686806 -1.49560862 -0.70iö1S05 -1.49560862 -0.70161805
Mining 397
Introductionto Data
Chapter IO
Discriminant
Figure 10.12 c o
DŕciĂon score
COdod Crorllt Approval Dcclnlons
ReyęJŁiąg Utilization
Rgvołylng Balance
DiscriminantCalculations
2
3 Hornoownor Crodlt Scoro Yeor5 ofCrorlit Higtory
925 20 55% 1.0383
11 0.6869
5 677 9 000
22 20% 0.7311
1 793 35{žôL
ż33 7 0.9044
000 24 22,000 1.002
700 6 100 1.0568
13 10Ęop_ 1.1324
3% 0.780ô
25% 1 O.T75S
16 ôôo
,13 oăź 13 3 300 3.0067
3 5% 0.3423
600 7 500
049 12 20 00 22% 1 o.gogc
1 695 15 11.700 15% 0.9757
701 9 7 đbo 9% 0.80dô
12 s 27% o.goôa
1
077 .12.800 1
699 10.000 20% 1 0.3770
703 22 s 11% 1.0432
9,700 1
055 6.100 7% 1.066e
774 13 10,500 5% 1 1.127e
602 10 13.400 3% 1 0.913ô
24 001 20 15%
11,700 1 o. 3448
13,000 24% 0.2449
26 1 733 15
s 7.200 o 0.1953
27 573 12,800 65% 0.0334
23. 625 15 5,700 75%
s 0.1753
527 12 62%
22,800 0.498ô
620 5 50%
16.500 0.0930
591 17 83%
31 12 500 02282
1 500 16 70%
7.700 o -0.1204
565 6 87%
37,400 0.2307
620 3 59%
7 17200 o -0.0224
3Sk 640 79%
14
27.000 0.1513
523 11200 70%
763 2 0 -02590
2,500 100% 0.5107
555 4 34%
s e 400 -o.0ô75
33) 617 9 85% o
7 29.100 -0.2893
635 100% O
2,000 -0.0664
507 2 80% o
5 S 1.000 O.i28ô
435 65%
8.500 0.0307
582 3 78% o
S 31.000 0.4196
585 55%
s 16.200 0.2307
ô20 59%
640 7 s 17.300
79% o .o.01ô4
14 s 27.000 0.1499
,4ȚS 536
11 200 70% o
760 2 s o -0.1933
4 00 95%
o -0.0125
10 s 12,050
600 85% o 0.03ô3
1 636 s 29,100 -0.2842
2,000 100% o
509 3 0.0371
13 s 29.000 78%
595
Averages 0.1ô43 0.9083
Agorove 0.9130 7233913 14,5217 126226027
8,4444 15061. , 0.7459
Reject 02222 591.n37
o
26. Records to Classi Discriminant
27 Homeowner Credit Score Years of Credit Histo Revolving Balance Revolving Utilization Score Decision
700 8 $21 ,ooo 15% 0.8921 Approve
29,- 520 1 $4,000 90% -o. 1793 Reject
1 650 10 $8.500.00 25% 0.7782 Approve
602 7 $16,300.00 70% 0.0930 Reject
549 2 $2 500.00 90% -0.1604 Reject
33 1 742 15 $16,700.00 18% 0.9117 Approve
A Figure 10.13
Classifying New Records Using Discriminant Scores
CHECK YOURUNDERSȚANDING
1. Explain the purpose of classification.
2. How is classification performance measured?
3. Explain the k-nearest neighbors algorithm for classification.
4. Describe when regression can be usecl for discriminant analysis.
398 Chapter10 Introduction to Data
Mining
z Association
Association rules identify attributes that frequently occur together in a given data set,
Association rule miiiing, often called affinity analysis, seeks to uncover interesting asso-
ciations and/or correlation relationships among large sets of data. A typical and widely
used example of association rule Il)iningis market basket analysis. For example, super-
markets routinely collect data using barcode scanners. Each record lists all items bought
by a customer for a single-purchasetransaction. Such databases consist of a large number
of transaction records. Managers would be interested to know if certain groups of items
are consistently purchased together. They could use these data for adjusting store layouts
(placing items optimally with respect to each other), for cross-selling, for promotions, for
catalog design, and to identify customer segments based on buying patterns. Association
Association rules provide information in the form of if-then statements. These rules
are computed from the data but, unlike the if-then rules of logic, association rules are
probabilistic in nature. In association analysis, the antecedent (the "if' part) and conse-
quent (the "then" part) are sets of items (called item sets) that are disjoint (do not have any
items in common).
To measure the strength of association, an association rule has two numbers that
express the degree of uncertainty about the rule. The first number is called the support
for the (association) rule. The support is simply the number of transactions that include
all items in the antecedent and consequent parts of the rule. (The support is sometimes
v Figure 10.14
Portion of the Excel File PC
Purchase Data
I PC Purchase Data
Processor Screen Size Memo Hard Drive
5 Intelcore i3 , Intel core i5 .Intel core i7. go-inch screen •124inchscreen 15 screen 2 GB 4 GB 8 GB 320 GB 500 GB 750G3
10
11
12
13
14
to think
in the database.) one waythe data-
from
expressed as a percentage of the total number of recordsselected transaction number is the
of support is that it is the probabilitythat a randomly second
consequent. Thenumber of transactions
base will contain all items in the antecedent and the the
(association) rule. Confidence is the ratio of (namely, the support) to
confidence of the antecedent confidence is the
that include all itClMS in the consequent as well as thc antecedent. The the
in the
the number of transactions that includc all itemstransaction all the items in
will include
conditional probability that a randomly selected antecedent:
all thc items in the
consequent given that transaction includes
and Consequent) (10.2)
—
Confidence P (Consequent IAntcccdent) p ( Antecedent)
A Figure 10.15
PC Purchase Data CorrelationMatrix
Cause-and-Effect Modeling
reten-
Managers are always interested in results, such as profit, customer satisfaction and
and
tion, and production yield. Lagging measures, or outcomes, tell what has happened
are often external business results, such as profit, market share, and customer satisfac-
tion. Leading measures (performancc drivers) predict what will happen and usually are
internal metrics, such as employee satisfaction, productivity, and turnover. For example,
customer satisfaction results in regard to sales or service transactions are a lagging mea-
sure; employee satisfaction, sales representative behavior, billing accuracy, and so on are
examples of leading measures that might influence customer satisfaction. If employees are
not satisfied, their behavior toward customers could be negatively affected, and customer
satisfaction could be low. If this can be explained using business analytics, managers can
take steps to improve employee satisfaction, leading to improved customer satisfaction.
Therefore, it is important to understand what controllable factors significantly influence
key business performance measures that managers cannot directly control. Correlation
analysis can help to identify these influences and lead to the development of cause-and-
effect models that can help managers make better decisions today that will influence
results tomorrow.
Chapter I
between
linear relationship
thc relationships
measure of
Recall from Chapter 4 that correlation is a coefficient indicate strong be useful in
can
two vatinbles. High values of the correlation how correlation
shows
between the vatiables. The following exatnplc
cause-and-effecttnoclcling.
Modeling
EXAMPLE1mkti Using Correlation for Cause-and-Effect and employee
perception
with their gupervigor,
regults of
•oneExcel file Ten Year Sutvey Shows the irnprovernent. any cause
quattevlyscnveys conducted by major electronicg of training and gl<ill does not prove
analysis
device jnanufactuter, a portion of which is shown in
Althot.jghcorrelation cause-and-effect
logically infer that a
data pmvide Average scores on and effect, we can customer satisfac-
10.16.5 The data indicate that
relationshipexists. influenced
scale for customer satisfaction, overall enlployee satisfac- business result, is strongly
tion,employee job satisfaction, en)ployee satisfaction with tion, the key external
employee satisfaction. Logically,
drive
theirsupetvisor, and ennployeeperception of training and by internalfactors that 10.18. This
the model shown in Figure
skill improvement.Figure 10.17 shows the correlation
we could propose customer
that if managers want to improve relations
matrix.All the correlations ev cept the one between job suggests
by ensuring good
satisfactionand customer satisfaction are relatively strong, satisfaction, they need to start on
and their employees and focus
withthe highest correlations between overall employee between supervisors
satisfaction and employee job satisfaction, employee improving training and skills.
8 c
Ten Year Suxey
improvement
ervisor Training and skill
s s Customersatisfaction Em 10 ee satisfaction Job satisfaction Satisfactionwithsu
2.97 3.51 3.92 3.06 2.57
2 3.71 3.58 4.13 3.06 3.06
3.29 3.43 3.62 4.42
3.87
2.05 3.81 4.12 4.31 4.15
4.17 4.25 4.14
3.61
4.28 4113 4.13 4.57
2.72
2.17 2.42 4.19 2.53
2.56
s 3.01 2.95 3.95 3.25
Figure 10.16
'ortionof Ten Year Survey Data
Customer satisfaction Em 10lee satisfaction Job satisfaction Satisfaction withsu ervisor Training and skill improvement
.2 •Customersatisfaction 1
1
IEmpioye€ sat:sfacton 0.493345395
4 Lot satisfaction 0.151693544 0.840444148 1
1
S Satisfaction with supervisor 0.495977225 0.88k 324581 0.6067961661
and ski,} improvement 0.532307756, 0.828657884: 0.710624973 0.769700425
10.17
:orrelationMatrix of Ten Year Survey Data
oHECKb-Y0uR
Eased on a descniption of a real application by Steven H. Hoisington and Tse-His Huang, "Customer
Sat-
isfaction and Market Share: An Empirical Case Study of IBM's AS/400 Division," in Earl Naumann
and
Steven JL Hoisington (eds.) Customer-CentereclSix Sigma (Milwaukee, WI: ASQ Quality
The data used this example ace fictitious, however,
Press, 2001).
402 Chapter 10 Introduction
to Data Mining
Figure 10.18
Cause-and-Effect Model
Satisfaction with
Suporvisor
Employoo Customer
Job
Satisfaction Satisfaction Satisfaction