0% found this document useful (0 votes)

155 views17 pages

Concepts and Techniques: Data Mining

Supervised and Unsupervised Learning are the two main types of data mining. Data mining can be used for classification, classification, clustering, classification, and more. Supervised learning can be used to predict unknown or missing values. Ensemble methods are used to improve classification accuracy.

Uploaded by

Hafizur Rahman Dhrubo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

155 views17 pages

Concepts and Techniques: Data Mining

Uploaded by

Hafizur Rahman Dhrubo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 17

Data Mining:

Concepts and Techniques

(3rd ed.)

Chapter 8
Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &

Simon Fraser University 2011 Han, Kamber & Pei. All rights reserved.
1

Chapter 8. Classification: Basic Concepts

Classification: Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Rule-Based Classification
Model Evaluation and Selection Techniques to Improve Classification Accuracy: Ensemble Methods Summary
2

Supervised vs. Unsupervised Learning

Supervised learning (classification)

Supervision: The training data (observations,

measurements, etc.) are accompanied by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
3

Prediction Problems: Classification vs. Numeric Prediction

Classification predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Numeric Prediction models continuous-valued functions, i.e., predicts unknown or missing values Typical applications Credit/loan approval: Medical diagnosis: if a tumor is cancerous or benign Fraud detection: if a transaction is fraudulent Web page categorization: which category it is
4

ClassificationA Two-Step Process

Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting) If the accuracy is acceptable, use the model to classify new data Note: If the test set is used to select models, it is called validation (test) set
5

Process (1): Model Construction

Classification Algorithms

Training Data

NAME M ike M ary B ill Jim D ave A nne

RANK YEARS TENURED A ssistant P rof 3 no A ssistant P rof 7 yes P rofessor 2 yes A ssociate P rof 7 yes A ssistant P rof 6 no A ssociate P rof 3 no

Classifier (Model)

IF rank = professor OR years > 6 THEN tenured = yes

Process (2): Using the Model in Prediction

Classifier Testing Data

Unseen Data

(Jeff, Professor, 4)
NAME RANK T om M erlisa G eorge Joseph A ssistant P rof A ssociate P rof P rofessor A ssistant P rof YEARS TENURED 2 7 5 7 no no yes yes

Tenured?

Chapter 8. Classification: Basic Concepts

Classification: Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Rule-Based Classification
Model Evaluation and Selection Techniques to Improve Classification Accuracy: Ensemble Methods Summary
8

Decision Tree Induction: An Example

Training data set: Buys_computer The data set follows an example of Quinlans ID3 (Playing Tennis) Resulting tree: age?

age <=30 <=30 3140 >40 >40 >40 3140 <=30 <=30 >40 <=30 3140 3140 >40 income student credit_rating buys_computer high no fair no high no excellent no high no fair yes medium no fair yes low yes fair yes low yes excellent no low yes excellent yes medium no fair no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent yes high yes fair yes medium no excellent no

<=30 student? no yes

31..40 overcast

>40

yes

credit rating? excellent fair yes

yes

Brief Review of Entropy

m=2
10

Attribute Selection Measure: Information Gain (ID3/C4.5)

Select the attribute with the highest information gain

Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
Info ( D) pi log 2 ( pi )
m

Information needed (after using A to split D into v partitions) to v | D | classify D: j InfoA ( D) Info( D j ) j 1 | D | Information gained by branching on attribute A

i 1

Gain(A) Info(D) Info A(D)

Avoiding the Zero-Probability Problem

Nave Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero
P( X | C i ) n P( x k | C i) k 1

Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10) Use Laplacian correction (or Laplacian estimator) Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 The corrected prob. estimates are close to their uncorrected counterparts

Chapter 8. Classification: Basic Concepts

Classification: Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Rule-Based Classification
Model Evaluation and Selection Techniques to Improve Classification Accuracy: Ensemble Methods Summary
13

Using IF-THEN Rules for Classification

Represent the knowledge in the form of IF-THEN rules R: IF age = youth AND student = yes THEN buys_computer = yes Rule antecedent/precondition vs. rule consequent Assessment of a rule: coverage and accuracy ncovers = # of tuples covered by R ncorrect = # of tuples correctly classified by R coverage(R) = ncovers /|D| /* D: training data set */ accuracy(R) = ncorrect / ncovers If more than one rule are triggered, need conflict resolution Size ordering: assign the highest priority to the triggering rules that has the toughest requirement (i.e., with the most attribute tests) Class-based ordering: decreasing order of prevalence or misclassification cost per class Rule-based ordering (decision list): rules are organized into one long priority list, according to some measure of rule quality or by experts

Rule Extraction from a Decision Tree

Rules are easier to understand than large trees age? One rule is created for each path from the <=30 31..40 root to a leaf student? yes Each attribute-value pair along a path forms a no yes conjunction: the leaf holds the class no yes prediction Rules are mutually exclusive and exhaustive

>40

credit rating?
excellent fair

yes

Example: Rule extraction from our buys_computer decision-tree

IF age = young AND student = no THEN buys_computer = no IF age = young AND student = yes THEN buys_computer = yes IF age = mid-age THEN buys_computer = yes IF age = old AND credit_rating = excellent THEN buys_computer = no IF age = old AND credit_rating = fair THEN buys_computer = yes

Issues Affecting Model Selection

Accuracy

classifier accuracy: predicting class label

time to construct the model (training time) time to use the model (classification/prediction time)

Speed

Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability

understanding and insight provided by the model

Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules
16

Chapter 8. Classification: Basic Concepts

Classification: Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Rule-Based Classification
Model Evaluation and Selection Techniques to Improve Classification Accuracy: Ensemble Methods Summary
17

Classification Algorithms
No ratings yet
Classification Algorithms
23 pages
Classification Techniques Overview
No ratings yet
Classification Techniques Overview
141 pages
10 Classification New 1
No ratings yet
10 Classification New 1
31 pages
DMDW 11 Classification Basic
No ratings yet
DMDW 11 Classification Basic
41 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
4 22865 IS465 2019 1 2 1 08ClassBasic
No ratings yet
4 22865 IS465 2019 1 2 1 08ClassBasic
43 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
Unit 6 Classification and Prediction
No ratings yet
Unit 6 Classification and Prediction
66 pages
L11 Slides
No ratings yet
L11 Slides
28 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
Unit 4
No ratings yet
Unit 4
186 pages
Classification Techniques in Data Mining
No ratings yet
Classification Techniques in Data Mining
67 pages
Classification & Prediction Techniques
No ratings yet
Classification & Prediction Techniques
71 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Unit 3 Machine Learning
No ratings yet
Unit 3 Machine Learning
159 pages
Classification and Prediction
No ratings yet
Classification and Prediction
130 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
Unit 4 - Classification and Prediction
No ratings yet
Unit 4 - Classification and Prediction
72 pages
05 Classification
No ratings yet
05 Classification
79 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
Understanding Data Classification Methods
No ratings yet
Understanding Data Classification Methods
23 pages
Classification - Basic Concepts
No ratings yet
Classification - Basic Concepts
35 pages
19-Introduction Classification Algorithm-18-09-2024
No ratings yet
19-Introduction Classification Algorithm-18-09-2024
102 pages
Session 5
No ratings yet
Session 5
91 pages
08 Class Basic
No ratings yet
08 Class Basic
76 pages
Module 4
No ratings yet
Module 4
99 pages
UNIT-5 DWM
No ratings yet
UNIT-5 DWM
73 pages
7 Classification
100% (3)
7 Classification
63 pages
Data Mining: Concepts and Techniques: - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 7
61 pages
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
No ratings yet
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
75 pages
Chp8 Classification Basic Concepts - Lecture#8
No ratings yet
Chp8 Classification Basic Concepts - Lecture#8
40 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
ClassificationandPrediction Module3
No ratings yet
ClassificationandPrediction Module3
88 pages
Chap4 Classification Lecture 5
No ratings yet
Chap4 Classification Lecture 5
74 pages
08ClassBasic L
No ratings yet
08ClassBasic L
78 pages
TTDS Lecture 4
No ratings yet
TTDS Lecture 4
31 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
Data Mining and Classification Basics
No ratings yet
Data Mining and Classification Basics
129 pages
Lecture 8
No ratings yet
Lecture 8
81 pages
Week 5
No ratings yet
Week 5
72 pages
Concepts and Techniques
No ratings yet
Concepts and Techniques
53 pages
Classification Ppts 2021
No ratings yet
Classification Ppts 2021
80 pages
Week 6 - 7 - Classification
No ratings yet
Week 6 - 7 - Classification
67 pages
Data Mining: Classification & Prediction
No ratings yet
Data Mining: Classification & Prediction
71 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
CH 5
No ratings yet
CH 5
84 pages
Classification
No ratings yet
Classification
73 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
Machine Learning-Classification
No ratings yet
Machine Learning-Classification
52 pages
Classification
No ratings yet
Classification
33 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
41 j48 Naive Bayes Weka
No ratings yet
41 j48 Naive Bayes Weka
5 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
83 pages
July
No ratings yet
July
2 pages
Australia
No ratings yet
Australia
10 pages
Australia
No ratings yet
Australia
10 pages
Implementing the Australian Curriculum
No ratings yet
Implementing the Australian Curriculum
58 pages
Management Information Systems Overview
50% (2)
Management Information Systems Overview
5 pages
Acknowledgement
No ratings yet
Acknowledgement
1 page
Starting Page Blue and Red Ocean Business Strategy
No ratings yet
Starting Page Blue and Red Ocean Business Strategy
1 page
Interest Rate Swap Sensitivity Analysis
No ratings yet
Interest Rate Swap Sensitivity Analysis
7 pages
Legal and Financial
No ratings yet
Legal and Financial
2 pages
What Is Shopping Basket? What Is The Benefit of Finding of Association in Business?
No ratings yet
What Is Shopping Basket? What Is The Benefit of Finding of Association in Business?
3 pages
Mcs 1
No ratings yet
Mcs 1
57 pages
KM Chapter 5
100% (1)
KM Chapter 5
19 pages
Business Intelligence and Analytics
0% (1)
Business Intelligence and Analytics
22 pages
Business Credit Application Form
No ratings yet
Business Credit Application Form
5 pages
Sample Literature Review of One Paper
No ratings yet
Sample Literature Review of One Paper
1 page
Blue Ocean Strategy for Innovators
No ratings yet
Blue Ocean Strategy for Innovators
2 pages
Md. Hasnayet Rahman: Hasnat 017@ Gmail - Com
No ratings yet
Md. Hasnayet Rahman: Hasnat 017@ Gmail - Com
4 pages
Thailand Visa Application Guide
No ratings yet
Thailand Visa Application Guide
3 pages
Fanuc DevicNet Printed From EDOC
No ratings yet
Fanuc DevicNet Printed From EDOC
118 pages
Challenges in Education in Southeast Asia
No ratings yet
Challenges in Education in Southeast Asia
14 pages
Ad wx505 494
No ratings yet
Ad wx505 494
11 pages
Construction Safety Checklist Overview
No ratings yet
Construction Safety Checklist Overview
3 pages
Pro Tools 12 Keyboard Shortcuts MAC
No ratings yet
Pro Tools 12 Keyboard Shortcuts MAC
12 pages
SCCP and MTP
No ratings yet
SCCP and MTP
29 pages
Shostakovich Alisa Weilerstein: Pablo Heras-Casado
No ratings yet
Shostakovich Alisa Weilerstein: Pablo Heras-Casado
10 pages
Transmission Troubleshooting Guide
No ratings yet
Transmission Troubleshooting Guide
8 pages
5s Radar Chart 2012
No ratings yet
5s Radar Chart 2012
2 pages
Projeto Amp Hifi - PDSPRJ
No ratings yet
Projeto Amp Hifi - PDSPRJ
1 page
Grating
No ratings yet
Grating
5 pages
Laporan Harian Gardu Induk 150KV
No ratings yet
Laporan Harian Gardu Induk 150KV
19 pages
VPA243
No ratings yet
VPA243
1 page
Wearable Keyboard Revolution
No ratings yet
Wearable Keyboard Revolution
3 pages
Portal Dosimetry Reference Guide
No ratings yet
Portal Dosimetry Reference Guide
113 pages
20 - Introduction To Wavelength Division Multiplexing
No ratings yet
20 - Introduction To Wavelength Division Multiplexing
9 pages
Interior Design Assessment Test AR6005
No ratings yet
Interior Design Assessment Test AR6005
2 pages
Journal of Rock Mechanics and Geotechnical Engineering: Review
No ratings yet
Journal of Rock Mechanics and Geotechnical Engineering: Review
7 pages
Value Adding Services in Container Shipping
No ratings yet
Value Adding Services in Container Shipping
18 pages
Air Corp 500 PH Manual
0% (1)
Air Corp 500 PH Manual
37 pages
IT Industry Growth in Chennai 2001
No ratings yet
IT Industry Growth in Chennai 2001
6 pages
Eco-Packaging Strategy & Market Analysis
No ratings yet
Eco-Packaging Strategy & Market Analysis
23 pages
Motor Protocol
No ratings yet
Motor Protocol
20 pages
The Chief Strategy Officer Playbook PDF
100% (11)
The Chief Strategy Officer Playbook PDF
176 pages
Refrigerant Piping Specifications
No ratings yet
Refrigerant Piping Specifications
9 pages
Recommendation Letter Shabir
0% (2)
Recommendation Letter Shabir
2 pages
DJI Flycart 30 - Specs - DJI
No ratings yet
DJI Flycart 30 - Specs - DJI
7 pages
Brain Gate Report
60% (5)
Brain Gate Report
31 pages
Scrapbook Evaluation Criteria Guide
100% (1)
Scrapbook Evaluation Criteria Guide
2 pages
Pipe Culvert
No ratings yet
Pipe Culvert
12 pages