0% found this document useful (0 votes)
33 views38 pages

BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration

Data mining lecture and overview on data exploration
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views38 pages

BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration

Data mining lecture and overview on data exploration
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Lecture 2 Data Mining Overview and Data

Preparation

1
Agenda
Core data mining methods
Supervised learning vs. unsupervised
learning
The Process of Data Mining
Data preparation
Data sampling
Data exploration and visualization
Data pre-processing and reduction
Overfitting and data partitioning

2
Basic Data Mining Concepts
Algorithms (e.g., regression, decision tree,
neural network)
Model
Predictors (X, feature, attribute, column,
input variable, independent variable, field)
Response variable (Y, output variable,
target variable, outcome variable)
Observations (Sample, instance, record,
case, row)
See more in Chapter 1.7 from the
textbook…
Core Methods in Data Mining

Classification
Prediction
Association Rules
Data Reduction
Data Exploration
Visualization

4
Supervised Learning
Goal: Predict a single “target” or “outcome”
variable

Training data, where target value is known

Score to data where value is not known

Methods: Classification and Prediction

5
Unsupervised Learning

Goal: Segment data into meaningful


segments; detect patterns

There is no target (outcome) variable to


predict or classify

Methods: Association rules, data reduction


& exploration, visualization

6
Supervised: Classification

Goal: Predict categorical target (outcome)


variable
Examples: Purchase/no purchase, fraud/no
fraud, creditworthy/not creditworthy…
Each row is a case (customer, tax return,
applicant)
Each column is a variable
Target variable is often binary (yes/no)

7
Supervised: Prediction
Goal: Predict numerical target (outcome)
variable
Examples: sales, revenue, performance
As in classification:
Each row is a case (customer, tax return,
applicant)
Each column is a variable
Taken together, classification and
prediction constitute “predictive analytics”

8
Unsupervised: Association
Rules
Goal: Produce rules that define “what goes
with what”
Example: “If X was purchased, Y was also
purchased”
Rows are transactions
Used in recommender systems – “Our
records show you bought X, you may also
like Y”
Also called “affinity analysis”

9
Unsupervised: Data
Reduction
Distillation of complex/large data into
simpler/smaller data
Reducing the number of variables/columns
(e.g., principal components)
Reducing the number of records/rows (e.g.,
clustering)

10
Unsupervised: Data
Visualization
Graphs and plots of data
Histograms, boxplots, bar charts,
scatterplots
Especially useful to examine relationships
between pairs of variables

11
Data Exploration

Data sets are typically large, complex &


messy
Need to review the data to help refine the
task
Use techniques of Reduction and
Visualization

12
Group Discussion
Each team please identify two data mining
examples. One is supervised learning and
the other is unsupervised learning.
For each example, please list the following:
Business goal
Input variables and output variables (if there
is any)
What data mining method(s) should be used?
How can the data mining results help
improve business (decision making)?

13
The Process of Data Mining
1. Define/understand purpose
2. Obtain data (may involve random
sampling)
3. Explore, clean, pre-process data
4. Reduce the data; if supervised DM,
partition it
5. Specify task (classification, clustering,
etc.)
6. Choose the techniques (regression, CART,
neural networks, etc.)
7. Iterative implementation and “tuning”
14 8. Assess results – compare models
Obtaining Data: Sampling

Data mining typically deals with huge


databases
Algorithms and models are typically applied
to a sample from a database, to produce
statistically-valid results
Once you develop and select a final model,
you use it to “score” the observations in
the larger database

15
Rare event oversampling
Often the event of interest is rare
Examples: response to mailing, fraud in
taxes, …
Sampling may yield too few “interesting”
cases to effectively train a model
A popular solution: oversample the rare
cases to obtain a more balanced training
set
Later, need to adjust results for the
oversampling
16
Exploring the data
Statistical summary of data: common metrics

Average
Median
Minimum
Maximum
Standard deviation
Counts & percentages

17
Definition for the Boston Housing
Dataset

18
Summary Statistics – Boston
Housing

19
Exploring the Data
How Well Do You Pay Attention? - YouTube
What is the reason that you missed?
How is this related to data exploration?

20
Correlations Between Pairs of Variables:
Correlation Matrix from RapidMiner

21
Summarize - cont.
Boston Housing example: Compare average home values
in neighborhoods that border Charles River (1) and those
that do not (0)

22
Visualization: Scatterplot
Displays relationship between two numerical variables

23
Visualization: Histograms
Histogram shows the
Boston Housing distribution of the
example: outcome variable
(median house value)

24
Reducing Categories
A single categorical variable with m
categories is typically transformed into m-1
dummy variables
Each dummy variable takes the values 0 or
1
0 = “no” for the category
1 = “yes”
Problem: Can end up with too many
variables
Solution: Reduce by combining categories
that are close to each other
25
Use graph to assess outcome variable
sensitivity to the dummies
PCA in
Classification/Prediction
Apply PCA to training data
Decide how many PC’s to use
Use variable weights in those PC’s with
validation/new data
This creates a new reduced set of
predictors in validation/new data

26
Types of Variables
Determine the types of pre-processing
needed, and algorithms used
Main distinction: Categorical vs. numeric
Numeric
Continuous
Integer
Categorical
Ordered (low, medium, high)
Unordered (male, female)

27
Pre-processing: Variable
handling
Numeric
Most algorithms can handle numeric data
May occasionally need to “bin” into
categories

Categorical
Naïve Bayes can use as-is
In most other algorithms, must create binary
dummies (number of dummies = number of
categories – 1)

28
Pre-processing: Detecting
Outliers
An outlier is an observation that is
“extreme”, being distant from the rest of
the data (definition of “distant” is
deliberately vague)
Outliers can have disproportionate
influence on models (a problem if it is
spurious)
An important step in data pre-processing is
detecting outliers
Once detected, domain knowledge is
29 required to determine if it is an error, or
truly extreme.
Detecting Outliers
In some contexts, finding outliers is the
purpose of the DM exercise (airport security
screening). This is called “anomaly
detection”.

30
Pre-processing: Handling Missing
Data
Most algorithms will not process records
with missing values. Default is to drop
those records.
Solution 1: Omission
 If a small number of records have missing
values, can omit them
 If many records are missing values on a small
set of variables, can drop those variables (or use
proxies)
 If many records have missing values, omission is
not practical
Solution 2: Imputation
 Replace missing values with reasonable
31
substitutes
Pre-processing: Normalizing
(Standardizing) Data
Used in some techniques when variables
with the largest scales would dominate and
skew results
Puts all variables on same scale
Normalizing function: logarithmic
transformation
Alternative function: scale to 0-1 by
subtracting minimum and dividing by the
range
 Useful when the data contain dummies and
numeric
32
The Problem of Overfitting
Statistical models can produce highly
complex explanations of relationships
between variables
The “fit” may be excellent
When used with new data, models of great
complexity do not do so well.

33
100% fit – not useful for new
data
1600

1400

1200

1000
Revenue

800

600

400

200

0
200 300 400 500 600 700 800 900 1000

Expenditure

34
Overfitting (cont.)
Causes:
 Too many predictors
 A model with too many parameters
 Trying many different models

Consequence: Deployed model will not work


as well as expected with completely new
data.

35
Partitioning the Data
Problem: How well will our
model perform with new data?

Solution: Separate data into


two parts
Training partition to develop
the model
Validation partition to
implement the model and
evaluate its performance on
“new” data
36
Test Partition
When a model is developed on
training data, it can overfit the
training data (hence need to
assess on validation)
Assessing multiple models on
same validation data can overfit
validation data
Some methods use the validation
data to choose a parameter. This
too can lead to overfitting the
validation data
Solution: final selected model is
37
applied to a test partition to
Summary
Data Mining consists of supervised
methods (Classification & Prediction) and
unsupervised methods (Association Rules,
Data Reduction, Data Exploration &
Visualization)
Before algorithms can be applied, data
must be characterized and pre-processed
To evaluate performance and to avoid
overfitting, data partitioning is used

38

You might also like