Data Mining:
Concepts and Techniques
(3rd ed.)
Chapter 8
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University 2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 8. Classification: Basic Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection Techniques to Improve Classification Accuracy: Ensemble Methods Summary
2
Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
3
Prediction Problems: Classification vs. Numeric Prediction
Classification predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Numeric Prediction models continuous-valued functions, i.e., predicts unknown or missing values Typical applications Credit/loan approval: Medical diagnosis: if a tumor is cancerous or benign Fraud detection: if a transaction is fraudulent Web page categorization: which category it is
4
ClassificationA Two-Step Process
Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting) If the accuracy is acceptable, use the model to classify new data Note: If the test set is used to select models, it is called validation (test) set
5
Process (1): Model Construction
Classification Algorithms
Training Data
NAME M ike M ary B ill Jim D ave A nne
RANK YEARS TENURED A ssistant P rof 3 no A ssistant P rof 7 yes P rofessor 2 yes A ssociate P rof 7 yes A ssistant P rof 6 no A ssociate P rof 3 no
Classifier (Model)
IF rank = professor OR years > 6 THEN tenured = yes
6
Process (2): Using the Model in Prediction
Classifier Testing Data
Unseen Data
(Jeff, Professor, 4)
NAME RANK T om M erlisa G eorge Joseph A ssistant P rof A ssociate P rof P rofessor A ssistant P rof YEARS TENURED 2 7 5 7 no no yes yes
Tenured?
Chapter 8. Classification: Basic Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection Techniques to Improve Classification Accuracy: Ensemble Methods Summary
8
Decision Tree Induction: An Example
Training data set: Buys_computer The data set follows an example of Quinlans ID3 (Playing Tennis) Resulting tree: age?
age <=30 <=30 3140 >40 >40 >40 3140 <=30 <=30 >40 <=30 3140 3140 >40 income student credit_rating buys_computer high no fair no high no excellent no high no fair yes medium no fair yes low yes fair yes low yes excellent no low yes excellent yes medium no fair no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent yes high yes fair yes medium no excellent no
<=30 student? no yes
31..40 overcast
>40
yes
credit rating? excellent fair yes
9
no
yes
Brief Review of Entropy
m=2
10
Attribute Selection Measure: Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
Info ( D) pi log 2 ( pi )
m
Information needed (after using A to split D into v partitions) to v | D | classify D: j InfoA ( D) Info( D j ) j 1 | D | Information gained by branching on attribute A
i 1
Gain(A) Info(D) Info A(D)
11
Avoiding the Zero-Probability Problem
Nave Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero
P( X | C i ) n P( x k | C i) k 1
Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10) Use Laplacian correction (or Laplacian estimator) Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 The corrected prob. estimates are close to their uncorrected counterparts
12
Chapter 8. Classification: Basic Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection Techniques to Improve Classification Accuracy: Ensemble Methods Summary
13
Using IF-THEN Rules for Classification
Represent the knowledge in the form of IF-THEN rules R: IF age = youth AND student = yes THEN buys_computer = yes Rule antecedent/precondition vs. rule consequent Assessment of a rule: coverage and accuracy ncovers = # of tuples covered by R ncorrect = # of tuples correctly classified by R coverage(R) = ncovers /|D| /* D: training data set */ accuracy(R) = ncorrect / ncovers If more than one rule are triggered, need conflict resolution Size ordering: assign the highest priority to the triggering rules that has the toughest requirement (i.e., with the most attribute tests) Class-based ordering: decreasing order of prevalence or misclassification cost per class Rule-based ordering (decision list): rules are organized into one long priority list, according to some measure of rule quality or by experts
14
Rule Extraction from a Decision Tree
Rules are easier to understand than large trees age? One rule is created for each path from the <=30 31..40 root to a leaf student? yes Each attribute-value pair along a path forms a no yes conjunction: the leaf holds the class no yes prediction Rules are mutually exclusive and exhaustive
>40
credit rating?
excellent fair
yes
Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no IF age = young AND student = yes THEN buys_computer = yes IF age = mid-age THEN buys_computer = yes IF age = old AND credit_rating = excellent THEN buys_computer = no IF age = old AND credit_rating = fair THEN buys_computer = yes
15
Issues Affecting Model Selection
Accuracy
classifier accuracy: predicting class label
time to construct the model (training time) time to use the model (classification/prediction time)
Speed
Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules
16
Chapter 8. Classification: Basic Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection Techniques to Improve Classification Accuracy: Ensemble Methods Summary
17