0% found this document useful (0 votes)

33 views38 pages

BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration

Data mining lecture and overview on data exploration

Uploaded by

chava.pravahlika813

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views38 pages

BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration

Data mining lecture and overview on data exploration

Uploaded by

chava.pravahlika813

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

Lecture 2 Data Mining Overview and Data

Preparation

1
Agenda
Core data mining methods
Supervised learning vs. unsupervised
learning
The Process of Data Mining
Data preparation
Data sampling
Data exploration and visualization
Data pre-processing and reduction
Overfitting and data partitioning

2
Basic Data Mining Concepts
Algorithms (e.g., regression, decision tree,
neural network)
Model
Predictors (X, feature, attribute, column,
input variable, independent variable, field)
Response variable (Y, output variable,
target variable, outcome variable)
Observations (Sample, instance, record,
case, row)
See more in Chapter 1.7 from the
textbook…
Core Methods in Data Mining

Classification
Prediction
Association Rules
Data Reduction
Data Exploration
Visualization

4
Supervised Learning
Goal: Predict a single “target” or “outcome”
variable

Training data, where target value is known

Score to data where value is not known

Methods: Classification and Prediction

5
Unsupervised Learning

Goal: Segment data into meaningful

segments; detect patterns

There is no target (outcome) variable to

predict or classify

Methods: Association rules, data reduction

& exploration, visualization

6
Supervised: Classification

Goal: Predict categorical target (outcome)

variable
Examples: Purchase/no purchase, fraud/no
fraud, creditworthy/not creditworthy…
Each row is a case (customer, tax return,
applicant)
Each column is a variable
Target variable is often binary (yes/no)

7
Supervised: Prediction
Goal: Predict numerical target (outcome)
variable
Examples: sales, revenue, performance
As in classification:
Each row is a case (customer, tax return,
applicant)
Each column is a variable
Taken together, classification and
prediction constitute “predictive analytics”

8
Unsupervised: Association
Rules
Goal: Produce rules that define “what goes
with what”
Example: “If X was purchased, Y was also
purchased”
Rows are transactions
Used in recommender systems – “Our
records show you bought X, you may also
like Y”
Also called “affinity analysis”

9
Unsupervised: Data
Reduction
Distillation of complex/large data into
simpler/smaller data
Reducing the number of variables/columns
(e.g., principal components)
Reducing the number of records/rows (e.g.,
clustering)

10
Unsupervised: Data
Visualization
Graphs and plots of data
Histograms, boxplots, bar charts,
scatterplots
Especially useful to examine relationships
between pairs of variables

11
Data Exploration

Data sets are typically large, complex &

messy
Need to review the data to help refine the
task
Use techniques of Reduction and
Visualization

12
Group Discussion
Each team please identify two data mining
examples. One is supervised learning and
the other is unsupervised learning.
For each example, please list the following:
Business goal
Input variables and output variables (if there
is any)
What data mining method(s) should be used?
How can the data mining results help
improve business (decision making)?

13
The Process of Data Mining
1. Define/understand purpose
2. Obtain data (may involve random
sampling)
3. Explore, clean, pre-process data
4. Reduce the data; if supervised DM,
partition it
5. Specify task (classification, clustering,
etc.)
6. Choose the techniques (regression, CART,
neural networks, etc.)
7. Iterative implementation and “tuning”
14 8. Assess results – compare models
Obtaining Data: Sampling

Data mining typically deals with huge

databases
Algorithms and models are typically applied
to a sample from a database, to produce
statistically-valid results
Once you develop and select a final model,
you use it to “score” the observations in
the larger database

15
Rare event oversampling
Often the event of interest is rare
Examples: response to mailing, fraud in
taxes, …
Sampling may yield too few “interesting”
cases to effectively train a model
A popular solution: oversample the rare
cases to obtain a more balanced training
set
Later, need to adjust results for the
oversampling
16
Exploring the data
Statistical summary of data: common metrics

Average
Median
Minimum
Maximum
Standard deviation
Counts & percentages

17
Definition for the Boston Housing
Dataset

18
Summary Statistics – Boston
Housing

19
Exploring the Data
How Well Do You Pay Attention? - YouTube
What is the reason that you missed?
How is this related to data exploration?

20
Correlations Between Pairs of Variables:
Correlation Matrix from RapidMiner

21
Summarize - cont.
Boston Housing example: Compare average home values
in neighborhoods that border Charles River (1) and those
that do not (0)

22
Visualization: Scatterplot
Displays relationship between two numerical variables

23
Visualization: Histograms
Histogram shows the
Boston Housing distribution of the
example: outcome variable
(median house value)

24
Reducing Categories
A single categorical variable with m
categories is typically transformed into m-1
dummy variables
Each dummy variable takes the values 0 or
1
0 = “no” for the category
1 = “yes”
Problem: Can end up with too many
variables
Solution: Reduce by combining categories
that are close to each other
25
Use graph to assess outcome variable
sensitivity to the dummies
PCA in
Classification/Prediction
Apply PCA to training data
Decide how many PC’s to use
Use variable weights in those PC’s with
validation/new data
This creates a new reduced set of
predictors in validation/new data

26
Types of Variables
Determine the types of pre-processing
needed, and algorithms used
Main distinction: Categorical vs. numeric
Numeric
Continuous
Integer
Categorical
Ordered (low, medium, high)
Unordered (male, female)

27
Pre-processing: Variable
handling
Numeric
Most algorithms can handle numeric data
May occasionally need to “bin” into
categories

Categorical
Naïve Bayes can use as-is
In most other algorithms, must create binary
dummies (number of dummies = number of
categories – 1)

28
Pre-processing: Detecting
Outliers
An outlier is an observation that is
“extreme”, being distant from the rest of
the data (definition of “distant” is
deliberately vague)
Outliers can have disproportionate
influence on models (a problem if it is
spurious)
An important step in data pre-processing is
detecting outliers
Once detected, domain knowledge is
29 required to determine if it is an error, or
truly extreme.
Detecting Outliers
In some contexts, finding outliers is the
purpose of the DM exercise (airport security
screening). This is called “anomaly
detection”.

30
Pre-processing: Handling Missing
Data
Most algorithms will not process records
with missing values. Default is to drop
those records.
Solution 1: Omission
 If a small number of records have missing
values, can omit them
 If many records are missing values on a small
set of variables, can drop those variables (or use
proxies)
 If many records have missing values, omission is
not practical
Solution 2: Imputation
 Replace missing values with reasonable
31
substitutes
Pre-processing: Normalizing
(Standardizing) Data
Used in some techniques when variables
with the largest scales would dominate and
skew results
Puts all variables on same scale
Normalizing function: logarithmic
transformation
Alternative function: scale to 0-1 by
subtracting minimum and dividing by the
range
 Useful when the data contain dummies and
numeric
32
The Problem of Overfitting
Statistical models can produce highly
complex explanations of relationships
between variables
The “fit” may be excellent
When used with new data, models of great
complexity do not do so well.

33
100% fit – not useful for new
data
1600

1400

1200

1000
Revenue

800

600

400

200

0
200 300 400 500 600 700 800 900 1000

Expenditure

34
Overfitting (cont.)
Causes:
 Too many predictors
 A model with too many parameters
 Trying many different models

Consequence: Deployed model will not work

as well as expected with completely new
data.

35
Partitioning the Data
Problem: How well will our
model perform with new data?

Solution: Separate data into

two parts
Training partition to develop
the model
Validation partition to
implement the model and
evaluate its performance on
“new” data
36
Test Partition
When a model is developed on
training data, it can overfit the
training data (hence need to
assess on validation)
Assessing multiple models on
same validation data can overfit
validation data
Some methods use the validation
data to choose a parameter. This
too can lead to overfitting the
validation data
Solution: final selected model is
37
applied to a test partition to
Summary
Data Mining consists of supervised
methods (Classification & Prediction) and
unsupervised methods (Association Rules,
Data Reduction, Data Exploration &
Visualization)
Before algorithms can be applied, data
must be characterized and pre-processed
To evaluate performance and to avoid
overfitting, data partitioning is used

ISO 20000-1:2011 Audit Checklist
83% (35)
ISO 20000-1:2011 Audit Checklist
14 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
Data Mining Notes
No ratings yet
Data Mining Notes
43 pages
Chap2 Overview
No ratings yet
Chap2 Overview
17 pages
Data Mining For Business Intelligence: Shmueli, Patel & Bruce
No ratings yet
Data Mining For Business Intelligence: Shmueli, Patel & Bruce
37 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Lecture 2
No ratings yet
Lecture 2
18 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
Data Mining Process: Dr. Gaurav Dixit
No ratings yet
Data Mining Process: Dr. Gaurav Dixit
18 pages
Business Analytics Process and Data Exploration
No ratings yet
Business Analytics Process and Data Exploration
38 pages
Chapter 02 Overview
No ratings yet
Chapter 02 Overview
43 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Lec 2
No ratings yet
Lec 2
19 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
38 pages
2 Buss Intel Analytics
No ratings yet
2 Buss Intel Analytics
43 pages
CH05 Business Analytics Process and Data Exploration
No ratings yet
CH05 Business Analytics Process and Data Exploration
37 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Data Mining - An Overview
No ratings yet
Data Mining - An Overview
40 pages
Insy662 - f23 - Week 1
No ratings yet
Insy662 - f23 - Week 1
21 pages
Unit 2
No ratings yet
Unit 2
18 pages
CH 2
No ratings yet
CH 2
36 pages
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
No ratings yet
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
35 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Study Material I
No ratings yet
Study Material I
140 pages
Pattern Recognition Application
No ratings yet
Pattern Recognition Application
43 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
1. Introduction to Data Mining & Classification
No ratings yet
1. Introduction to Data Mining & Classification
58 pages
Introduction To Data Mining For Business Analytics
No ratings yet
Introduction To Data Mining For Business Analytics
51 pages
Introduction to Data Mining
No ratings yet
Introduction to Data Mining
27 pages
Unit - II
No ratings yet
Unit - II
56 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
Unit 4 Intro DM
No ratings yet
Unit 4 Intro DM
30 pages
Dr. Gaurav Dixit: Department of Management Studies
No ratings yet
Dr. Gaurav Dixit: Department of Management Studies
26 pages
Unit 1 Data Mining task
No ratings yet
Unit 1 Data Mining task
7 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
47 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Mining unit-1 complete
No ratings yet
Data Mining unit-1 complete
45 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
DADM S2 Data Preprocessing-Data Cleaning and Transformation
No ratings yet
DADM S2 Data Preprocessing-Data Cleaning and Transformation
12 pages
Chapter 1
No ratings yet
Chapter 1
23 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
2 - Preprocessing
No ratings yet
2 - Preprocessing
74 pages
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
No ratings yet
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
5 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
ML Lect1
100% (1)
ML Lect1
51 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Marketing Plan I Template
No ratings yet
Marketing Plan I Template
6 pages
Marketing Plan Overview (Spring 2024)
No ratings yet
Marketing Plan Overview (Spring 2024)
2 pages
Exhibit a - Integrative Assignment 4-1
No ratings yet
Exhibit a - Integrative Assignment 4-1
1 page
BANA 560 Lecture - 4 - LogisticRegression
No ratings yet
BANA 560 Lecture - 4 - LogisticRegression
26 pages
BANA 560 Lecture 6 Association Rules Collaborative Filtering
No ratings yet
BANA 560 Lecture 6 Association Rules Collaborative Filtering
34 pages
BANA 560 Lecture - 5 - NaiveBayes - Decision - Tree
No ratings yet
BANA 560 Lecture - 5 - NaiveBayes - Decision - Tree
42 pages
Cigna Pitch - Spring 2024
No ratings yet
Cigna Pitch - Spring 2024
33 pages
Cigna Presentation (Tyler Davis)
No ratings yet
Cigna Presentation (Tyler Davis)
31 pages
BANA 560 - Lecture - 3 - Model - Evalaution - Regression
No ratings yet
BANA 560 - Lecture - 3 - Model - Evalaution - Regression
37 pages
CI Report Spring 2024
No ratings yet
CI Report Spring 2024
30 pages
CI Workbook Old
No ratings yet
CI Workbook Old
82 pages
CI Workbook
No ratings yet
CI Workbook
83 pages
HTTP://WWW - Ebay.com/sch/i.html? From R40& Sacat 0& NKW Magnets& PGN 5& SKC 200&rt NC
No ratings yet
HTTP://WWW - Ebay.com/sch/i.html? From R40& Sacat 0& NKW Magnets& PGN 5& SKC 200&rt NC
36 pages
Mail - Sakshi Kumari - Outlook
No ratings yet
Mail - Sakshi Kumari - Outlook
6 pages
Statistical Reviewer
No ratings yet
Statistical Reviewer
3 pages
Frili Sandoko Sinaga
No ratings yet
Frili Sandoko Sinaga
4 pages
Gre Engineeering Solutions SDN BHD Company Profile
No ratings yet
Gre Engineeering Solutions SDN BHD Company Profile
46 pages
Software Engineering Lab
No ratings yet
Software Engineering Lab
2 pages
28 Waxes Used in Cosmetics
100% (3)
28 Waxes Used in Cosmetics
1 page
Application Form: Instructions: Print Legibly. Indicate "N/A" For Non-Applicable Items
No ratings yet
Application Form: Instructions: Print Legibly. Indicate "N/A" For Non-Applicable Items
4 pages
Assignment 1 - Operating System
No ratings yet
Assignment 1 - Operating System
137 pages
3 sbl (1)
No ratings yet
3 sbl (1)
6 pages
Sptm122ce 124e PDF
No ratings yet
Sptm122ce 124e PDF
4 pages
The Lodge Treasurer Handbook
No ratings yet
The Lodge Treasurer Handbook
24 pages
13 Check Dams 2006
100% (1)
13 Check Dams 2006
35 pages
PTGO. September 2020 Trial
100% (1)
PTGO. September 2020 Trial
2 pages
CP 03010-1972 (1999) PDF
No ratings yet
CP 03010-1972 (1999) PDF
88 pages
Thermal Guideline for DPE-1-Introduction
No ratings yet
Thermal Guideline for DPE-1-Introduction
8 pages
2 Mechanical-Presses UsaM S
0% (1)
2 Mechanical-Presses UsaM S
14 pages
Perfetti
100% (1)
Perfetti
20 pages
Accomodation Services 2019
100% (1)
Accomodation Services 2019
6 pages
IOOP - Assignment Question
No ratings yet
IOOP - Assignment Question
10 pages
Abap For BW
No ratings yet
Abap For BW
48 pages
GA on Designers for CEED, UCEED and NID _ Stuff You Look
No ratings yet
GA on Designers for CEED, UCEED and NID _ Stuff You Look
17 pages
Orthodontics in Medically Compromised
No ratings yet
Orthodontics in Medically Compromised
189 pages
Parental Consent
No ratings yet
Parental Consent
1 page
IPE Project 1 Merged PDF of Cad
No ratings yet
IPE Project 1 Merged PDF of Cad
12 pages
2 Marketing Plan Yoba Yoghurt
No ratings yet
2 Marketing Plan Yoba Yoghurt
11 pages
PDF Sonosite 180 Plus Service Manual - Compress
No ratings yet
PDF Sonosite 180 Plus Service Manual - Compress
114 pages
Customer Behavior in Service Encounters: Lovelock
No ratings yet
Customer Behavior in Service Encounters: Lovelock
23 pages
Integration of Solution Manager & QC
No ratings yet
Integration of Solution Manager & QC
31 pages