BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
Preparation
1
Agenda
Core data mining methods
Supervised learning vs. unsupervised
learning
The Process of Data Mining
Data preparation
Data sampling
Data exploration and visualization
Data pre-processing and reduction
Overfitting and data partitioning
2
Basic Data Mining Concepts
Algorithms (e.g., regression, decision tree,
neural network)
Model
Predictors (X, feature, attribute, column,
input variable, independent variable, field)
Response variable (Y, output variable,
target variable, outcome variable)
Observations (Sample, instance, record,
case, row)
See more in Chapter 1.7 from the
textbook…
Core Methods in Data Mining
Classification
Prediction
Association Rules
Data Reduction
Data Exploration
Visualization
4
Supervised Learning
Goal: Predict a single “target” or “outcome”
variable
5
Unsupervised Learning
6
Supervised: Classification
7
Supervised: Prediction
Goal: Predict numerical target (outcome)
variable
Examples: sales, revenue, performance
As in classification:
Each row is a case (customer, tax return,
applicant)
Each column is a variable
Taken together, classification and
prediction constitute “predictive analytics”
8
Unsupervised: Association
Rules
Goal: Produce rules that define “what goes
with what”
Example: “If X was purchased, Y was also
purchased”
Rows are transactions
Used in recommender systems – “Our
records show you bought X, you may also
like Y”
Also called “affinity analysis”
9
Unsupervised: Data
Reduction
Distillation of complex/large data into
simpler/smaller data
Reducing the number of variables/columns
(e.g., principal components)
Reducing the number of records/rows (e.g.,
clustering)
10
Unsupervised: Data
Visualization
Graphs and plots of data
Histograms, boxplots, bar charts,
scatterplots
Especially useful to examine relationships
between pairs of variables
11
Data Exploration
12
Group Discussion
Each team please identify two data mining
examples. One is supervised learning and
the other is unsupervised learning.
For each example, please list the following:
Business goal
Input variables and output variables (if there
is any)
What data mining method(s) should be used?
How can the data mining results help
improve business (decision making)?
13
The Process of Data Mining
1. Define/understand purpose
2. Obtain data (may involve random
sampling)
3. Explore, clean, pre-process data
4. Reduce the data; if supervised DM,
partition it
5. Specify task (classification, clustering,
etc.)
6. Choose the techniques (regression, CART,
neural networks, etc.)
7. Iterative implementation and “tuning”
14 8. Assess results – compare models
Obtaining Data: Sampling
15
Rare event oversampling
Often the event of interest is rare
Examples: response to mailing, fraud in
taxes, …
Sampling may yield too few “interesting”
cases to effectively train a model
A popular solution: oversample the rare
cases to obtain a more balanced training
set
Later, need to adjust results for the
oversampling
16
Exploring the data
Statistical summary of data: common metrics
Average
Median
Minimum
Maximum
Standard deviation
Counts & percentages
17
Definition for the Boston Housing
Dataset
18
Summary Statistics – Boston
Housing
19
Exploring the Data
How Well Do You Pay Attention? - YouTube
What is the reason that you missed?
How is this related to data exploration?
20
Correlations Between Pairs of Variables:
Correlation Matrix from RapidMiner
21
Summarize - cont.
Boston Housing example: Compare average home values
in neighborhoods that border Charles River (1) and those
that do not (0)
22
Visualization: Scatterplot
Displays relationship between two numerical variables
23
Visualization: Histograms
Histogram shows the
Boston Housing distribution of the
example: outcome variable
(median house value)
24
Reducing Categories
A single categorical variable with m
categories is typically transformed into m-1
dummy variables
Each dummy variable takes the values 0 or
1
0 = “no” for the category
1 = “yes”
Problem: Can end up with too many
variables
Solution: Reduce by combining categories
that are close to each other
25
Use graph to assess outcome variable
sensitivity to the dummies
PCA in
Classification/Prediction
Apply PCA to training data
Decide how many PC’s to use
Use variable weights in those PC’s with
validation/new data
This creates a new reduced set of
predictors in validation/new data
26
Types of Variables
Determine the types of pre-processing
needed, and algorithms used
Main distinction: Categorical vs. numeric
Numeric
Continuous
Integer
Categorical
Ordered (low, medium, high)
Unordered (male, female)
27
Pre-processing: Variable
handling
Numeric
Most algorithms can handle numeric data
May occasionally need to “bin” into
categories
Categorical
Naïve Bayes can use as-is
In most other algorithms, must create binary
dummies (number of dummies = number of
categories – 1)
28
Pre-processing: Detecting
Outliers
An outlier is an observation that is
“extreme”, being distant from the rest of
the data (definition of “distant” is
deliberately vague)
Outliers can have disproportionate
influence on models (a problem if it is
spurious)
An important step in data pre-processing is
detecting outliers
Once detected, domain knowledge is
29 required to determine if it is an error, or
truly extreme.
Detecting Outliers
In some contexts, finding outliers is the
purpose of the DM exercise (airport security
screening). This is called “anomaly
detection”.
30
Pre-processing: Handling Missing
Data
Most algorithms will not process records
with missing values. Default is to drop
those records.
Solution 1: Omission
If a small number of records have missing
values, can omit them
If many records are missing values on a small
set of variables, can drop those variables (or use
proxies)
If many records have missing values, omission is
not practical
Solution 2: Imputation
Replace missing values with reasonable
31
substitutes
Pre-processing: Normalizing
(Standardizing) Data
Used in some techniques when variables
with the largest scales would dominate and
skew results
Puts all variables on same scale
Normalizing function: logarithmic
transformation
Alternative function: scale to 0-1 by
subtracting minimum and dividing by the
range
Useful when the data contain dummies and
numeric
32
The Problem of Overfitting
Statistical models can produce highly
complex explanations of relationships
between variables
The “fit” may be excellent
When used with new data, models of great
complexity do not do so well.
33
100% fit – not useful for new
data
1600
1400
1200
1000
Revenue
800
600
400
200
0
200 300 400 500 600 700 800 900 1000
Expenditure
34
Overfitting (cont.)
Causes:
Too many predictors
A model with too many parameters
Trying many different models
35
Partitioning the Data
Problem: How well will our
model perform with new data?
38