Data Warehousing and Data Mining
with R-Programming
—UNIT 1 —
Recommended Book:
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
April 22, 2025 UNIT 1 1
Introduction
Motivation (for Data Mining)
Data Mining-Definition & Functionalities
Data Processing
Form of Data Preprocessing
Data Cleaning: Missing Values, Noisy Data,
(Binning, Clustering, Regression, Computer and
Human inspection),Inconsistent Data
Data Integration and Transformation.
Data Reduction:-Data Cube Aggregation,
Dimensionality reduction, Data Compression,
Numerosity Reduction, Clustering
Discretization and Concept hierarchy generation.
April 22, 2025 UNIT 1
Motivation
In real world applications data can be
inconsistent ,incomplete and or noisy.
Errors can happen:
Faulty data collection instruments
Data entry problems.
Human misjudgment during data entry
Data transmission problems.
Technology limitations
Discrepancy in naming conventions
Results:
Duplicated records
Incomplete data
Contradictions in data.
April 22, 2025 UNIT 1
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific
simulation, …
Society and everyone: news, digital cameras
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
April 22, 2025 UNIT 1
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
Data mining: a misnomer?
The exploration and analysis, by Automatic or semiautomatic means, of
large quantities of data in order to discover meaningful patterns.
The extraction of implicit, previously unknown, and potentially useful
information from data or the process of discovery advantages patterns in
data.
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
Simple search and query processing
(Deductive) expert systems
April 22, 2025 UNIT 1
Data Mining Algorithm
Objective: Fit Data to a Model
Descriptive (characterize the general
properties of the data in the database)
Predictive (perform inference on the
current data in order to make prediction)
Preference – Technique to choose the best
model
Search – Technique to search the data
“Query”
April 22, 2025 UNIT 1
Data Mining Process
Define & Understanding the Problem.
Data Warehousing
Collect / Extract data
Clean Data
Data Engineering
Algorithm selection / Engineering
Run Mining Algorithm
Analyze the Results
April 22, 2025 UNIT 1
Database Processing vs. Data Mining
Processing
Query
Query
•Defined Poorly
Well defined
•No precise query language
SQL
Data Data
– Operational data – Not operational data
Output Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
April 22, 2025 UNIT 1
Data Warehousing and Data Mining
Statistics and Data Mining
Data Warehousing
Provides The enterprise with a memory
Data Mining
Provides the Enterprise with Intelligence.
Statistics
Confirmatory
Small samples
In-sample Performance
Data Mining
Exploratory
Large Samples
Out-of-Samples Performance
April 22, 2025 UNIT 1
Query Examples
Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than
$10,000 in the last month.
– Find all customers who have purchased milk
Data Mining
– Find all credit applicants who are poor credit risks.
(classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules)
April 22, 2025 UNIT 1
Data Mining Models and Tasks
April 22, 2025 UNIT 1
Basic Data Mining Tasks
Classification maps data into predefined groups
or classes
Supervised learning
Pattern recognition
Prediction
Regression is used to map a data item to a real
valued prediction variable.
Clustering groups similar data together into
clusters.
Unsupervised learning
Segmentation
Partitioning
April 22, 2025 UNIT 1
Basic Data Mining Tasks (cont’d)
Summarization maps data into subsets with
associated simple descriptions.
Characterization
Generalization
Link Analysis uncovers relationships among data.
Affinity Analysis
Association Rules (Finds rule of the form: X=>Y Or
“ If X then Y”)
Sequential Analysis determines sequential
patterns.
(Artificial) Neural Networks
Genetic algorithms
Hypothesis Testing.
April 22, 2025 UNIT 1
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
April 22, 2025 UNIT 1
Data Mining vs. KDD
Knowledge Discovery in Databases
(KDD): process of finding useful
information and patterns in data.
Data Mining: Use of algorithms to extract
the information and patterns derived by the
KDD process.
April 22, 2025 UNIT 1
Knowledge Discovery (KDD) Process
Data mining—core of Pattern Evaluation
knowledge discovery
process
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
April 22, 2025 UNIT 1
KDD Process
Selection: Obtain data from various sources.
Preprocessing: Cleanse data.
Transformation: Convert to common format.
Transform to new format.
Data Mining: Obtain desired results.
Interpretation/Evaluation: Present results to
user in meaningful manner.
April 22, 2025 UNIT 1
KDD Process: Several Key Steps
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant
representation
Choosing functions of data mining
summarization, classification, regression, association, clustering
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
April 22, 2025 UNIT 1
KDD Process Ex: Web Log
Selection:
Select log data (dates and locations) to use
Preprocessing:
Remove identifying URLs
Remove error logs
Transformation:
Sessionize (sort and group)
Data Mining:
Identify and count patterns
Construct data structure
Interpretation/Evaluation:
Identify and display frequently accessed sequences.
Potential User Applications:
Cache prediction
Personalization
April 22, 2025 UNIT 1
Are All the “Discovered” Patterns Interesting?
Data mining may generate thousands of patterns: Not all of them are
interesting
Suggested approach: Human-centered, query-based, focused
mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid
on new or test data with some degree of certainty, potentially
useful, novel, or validates some hypothesis that a user seeks to
confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
April 22, 2025 UNIT 1
Data Mining Development
•Similarity Measures
•Hierarchical Clustering
•Relational Data Model •IR Systems
•SQL •Imprecise Queries
•Association Rule Algorithms
•Textual Data
•Data Warehousing
•Scalability Techniques •Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Algorithm Design Techniques
•Algorithm Analysis •Neural Networks
•Data Structures
•Decision Tree Algorithms
April 22, 2025 UNIT 1
Why Not Traditional Data Analysis?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-
bytes of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked
data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
April 22, 2025 UNIT 1
Data Mining Functionalities
( Kind of Patterns To Be Found)
Multidimensional concept description:
Characterization( Generalization or summarization) and
discrimination ( Comparison)
Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet regions
Frequent patterns, association, correlation vs. causality
Launch Internet Explorer Browser.lnk
Diaper Beer [0.5%, 75%] (Correlation or causality?)
Classification and prediction
Construct models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on (climate), or classify
cars based on (gas mileage)
Predict some unknown or missing numerical values
April 22, 2025 UNIT 1
Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge fusion
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
April 22, 2025 UNIT 1
Data Preprocessing
Why preprocess the data?
Descriptive data summarization
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
April 22, 2025 UNIT 1
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values,
lacking certain attributes of interest, or
containing only aggregate data
e.g., occupation=“ ”
noisy: containing errors or outliers
e.g., Salary=“-10”
inconsistent: containing discrepancies in
codes or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
April 22, 2025 UNIT 1
Why Is Data Dirty?
Incomplete data may come from
“Not applicable” data value when collected
Different considerations between the time when the data
was collected and when it is analyzed.
Human/hardware/software problems
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked
data)
Duplicate records also need data cleaning
April 22, 2025 UNIT 1
Why Is Data Preprocessing
Important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or
even misleading statistics.
Data warehouse needs consistent integration of
quality data
Data extraction, cleaning, and transformation
comprises the majority of the work of building a
data warehouse
April 22, 2025 UNIT 1
Multi-Dimensional Measure of Data
Quality
A well-accepted multidimensional view:
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
Intrinsic, contextual, representational, and
accessibility
April 22, 2025 UNIT 1
Major Tasks in Data
Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces
the same or similar analytical results
Data discretization
Part of data reduction but with particular importance,
especially for numerical data
Data discretization is defined as a process of converting
continuous data attribute values into a finite set of
intervals with minimal loss of information.
April 22, 2025 UNIT 1
Forms of Data Preprocessing
April 22, 2025 UNIT 1
Data preprocessing
Why preprocess the data?
Descriptive data summarization
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
April 22, 2025 UNIT 1
Mining Data Descriptive
Characteristics
Motivation
For data preprocessing to be successful it is essential to better understand the data (overall
picture of your data)
To identify the typical properties of your data and highlight which data values should be treated
as noise or outliers.
Two Approaches
Measure of Central Tendency - effective measure to find out the degree to which
numerical data tend to occur at the center of the data set. (Mean, Median, Mode , midrange)
Measures of Data Dispersion - effective measure to find out the degree to which
numerical data tend to spread in the data set. (Range, Quartiles, interquartile range (IQR),
outliers, Box plot, variance, Standard Deviation)
Kinds of Measure
Distributive Measure:-that cam be computed for a given data set by partitioning the data
into smaller subsets, computing the measure for each subset. and then merging the results in
order to arrive at the measure’s value for the original data set. (count sum)
Algebraic Measure: That can be computed by applying an algebraic function to one or more
distributive measure. (mean)
Holistic Measure:- That must be computed on the entire data set as a whole (median)
April 22, 2025 UNIT 1
Measuring the Central Tendency
1 n x
x xi
Mean (algebraic measure) (sample vs. population):
n i 1 N
n
Weighted arithmetic mean: w x i i
x i 1
Trimmed mean: chopping extreme values n
w
i 1
i
Median: A holistic measure
Middle value if odd number of values, or average of the
middle two values otherwise
Estimated by interpolation (for grouped data): n / 2 (
median L1 (
f )l )c
Mode f median
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula: mean mode 3 (mean median)
April 22, 2025 UNIT 1
Symmetric vs. Skewed
Data
Median, mean and mode of
symmetric, positively and
negatively skewed data
April 22, 2025 UNIT 1
Measuring the Dispersion of Data
Range, Quartiles, outliers and boxplots
Range : The range of the set is the difference between the largest & smallest values
Percentile :The value of a variable below which a certain percent of observations fall
Quartiles: Quartile means separating the given set of data into 4 equal parts by 3
divisions. The three separations are lower quartile, median and upper quartile. The
lower quartile is the mid data between the first number and its median and the upper
quartile is the mid data between the median and last number of a given set. The
outlier of a given set of data can be identified with the help of interquartile range. Q 1
(25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 – Q1
Five number summary: min, Q1, M, Q3, max
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot
outlier individually
n n
Outlier: usually, a value higher/lower than 1.5 x 1 1
x
IQR 2
2
( xi 2
) i 2
Variance and standard deviation (sample: s, population: σ)N i 1 N i 1
1 nscalable computation)
Variance: (algebraic, 1 n 2 1 n
s 2
Standard deviation ( xi x ) 2
n 1s (or σ) is the square
i 1
[ xi
( x 2
i ] 2(
)
n 1 root ofnvariance s or σ
i 1 i 1
2)
April 22, 2025 UNIT 1
Properties of Normal Distribution
Curve
The normal (distribution) curve
From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard deviation)
From μ–2σ to μ+2σ: contains about 95% of it
From μ–3σ to μ+3σ: contains about 99.7% of it
April 22, 2025 UNIT 1
Graphic Displays of Basic Statistical Descriptions
Histogram:
Boxplot:
Quantile plot: each value xi is paired with fi
indicating that approximately 100 fi % of data are
xi
Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
Loess (local regression) curve: add a smooth curve
to a scatter plot to provide better perception of the
pattern of dependence UNIT 1
April 22, 2025
Boxplot Analysis
Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
Boxplot
Data is represented with a box
The ends of the box are at the first and third
quartiles, i.e., the height of the box is IRQ
The median is marked by a line within the box
Whiskers: two lines outside the box extend to
Minimum and Maximum
April 22, 2025 UNIT 1
Visualization of Data Dispersion: Boxplot
Analysis
April 22, 2025 UNIT 1
OUTLIER Detection
Quartile means separating the given set of data into 4 equal
parts by 3 divisions. The three separations are lower quartile,
median and upper quartile. The lower quartile is the mid data
between the first number and its median and the upper
quartile is the mid data between the median and last number
of a given set. The outlier of a given set of data can be
identified with the help of interquartile range.
Formula Involved – Study Outliers in a Set of Data:
n – n is the total number of elements in a set.
Median = = Q2 Lower quartile = =
Q1
Upper quartile = = Q3
Interquartile range = Q3 – Q1 (IQR)
The outliers are below Q1 - 1.5 IQR and above Q3+ 1.5IQR
April 22, 2025 UNIT 1
Example Problems – Study Outliers in a Set of
Data:
Example 1 Calculate the outlier for the given set of data 31, 64, 69,
65, 62, 63, 62.
Solution:
The given set of data is 31, 64, 69, 65, 62, 63, 62.
Organize the given set of data in ascending order. 31, 62, 62, 63, 64, 65, 69.
The median is = (7+1)/2 = 4 .
The 4th value is the median. Thus Median = 63. = Q2
Lower quartile = = (7+1)/4 = 8/4 = 2.
The element which is in 2nd position is a lower quartile. Thus 62 is the lower
quartile.
Upper quartile = = (3*(7+1))/4 = 6
The element which is in 6th position is an upper quartile. Thus 65 is an upper
quartile.
Interquartile range = Q3 – Q2 (IQR) = 65-62 = 3
Now to find the outlier it needs to calculate Q1 - 1.5 IQR and Q3+ 1.5IQR
Q1 - 1.5 IQR = 62-1.5*3 = 57.5
Q3+ 1.5IQR = 65+1.5*3 = 69.5
The outliers are below 57.5 and above 69.5.
31 is the outlier of a given set of data.
April 22, 2025 UNIT 1
Example 2 – Calculate the outlier for the given set of data
50, 61, 65,64, 67, 85, 70.
Solution: The given set of data is 50, 61, 65, 64, 67, 85, 70.
Organize the given set of data in ascending order. 50, 61, 64, 65, 67, 70,
85.
The median is = (7+1)/2 = 8/2= 4. Thus the 4th element is a median.
Median = 65. = Q2
Lower quartile = = (7+1)/4 = 2. The 2nd element is a lower quartile.
Lower Quartile = 61= Q1
Upper quartile = =3*(7+1)/4 = 6. The 6th element is an upper
quartile
Upper Quartile= 70= Q3
Interquartile range = Q3 – Q1 (IQR) = 70-61 = 9.
Now to find the outlier it needs to calculate Q1 - 1.5 IQR and Q3+ 1.5IQR
Q1 - 1.5 IQR = 61-1.5*9 = 47.5
Q3+ 1.5IQR = 70+1.5*9 = 83.5
The outliers are below 47.5 and above 83.5.
So 85 is an outlier of a given set of data.
April 22, 2025 UNIT 1
Example 1 –Calculate the interquartile range outlier for the given set of
data
60, 61, 62, 55, 58, 59, 64, 65, 67, 90, 100.
Solution:The given set of data is 60, 61, 62, 55, 58, 59, 64, 65, 67, 90, 100.
Organize the given set of data in ascending order.
55, 58, 59, 60, 61, 62, 64, 65, 67, 90, 100.
The median is = (11+1)/2 =6. The 6 th value is the median = 62= Q2
Lower quartile = (11+1)/4= 12/4 =3.The element which is in 3 rd position
is a
lower quartile = 59 = Q1
Upper quartile =3*(11+1)/4 =9. The element which is in 9 th position is
an
upper quartile = 67 =Q3
Interquartile range = Q3 – Q1 = 67 - 59 = 8
Now to find the outlier it needs to calculate Q 1 - 1.5 IQR and Q3+ 1.5IQR
Q1 - 1.5 IQR = 591-1.5*8 = 47
Q3+ 1.5IQR = 70+1.5*8 = 79
The outliers are below 47 and above 79.
So, 90 and 100 are an outlier for a given set of data.
April 22, 2025 UNIT 1
Histogram Analysis
Graph displays of basic statistical class
descriptions
Frequency histograms
A univariate graphical method
Consists of a set of rectangles that reflect the counts
or frequencies of the classes present in the given data
April 22, 2025 UNIT 1
Quantile Plot
Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
Plots quantile information
For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data
are below or equal to the value xi
April 22, 2025 UNIT 1
Quantile-Quantile (Q-Q) Plot
Graphs the quantiles of one univariate distribution
against the corresponding quantiles of another
Allows the user to view whether there is a shift in
going from one distribution to another
April 22, 2025 UNIT 1
Scatter plot
Provides a first look at bivariate data to see
clusters of points, outliers, etc
Each pair of values is treated as a pair of
coordinates and plotted as points in the plane
April 22, 2025 UNIT 1
Scatter plot
April 22, 2025 UNIT 1
Loess Curve
Adds a smooth curve to a scatter plot in order to
provide better perception of the pattern of dependence
Loess curve is fitted by setting two parameters: a
smoothing parameter, and the degree of the
polynomials that are fitted by the regression
April 22, 2025 UNIT 1
Positively and Negatively Correlated
Data
April 22, 2025 UNIT 1
Not Correlated Data
April 22, 2025 UNIT 1
Data Preprocessing
Why preprocess the data?
Descriptive data summarization
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
April 22, 2025 UNIT 1
Data Cleaning
Importance
“Data cleaning is one of the three biggest
problems in data warehousing”—Ralph Kimball
“Data cleaning is the number one problem in
data warehousing”—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
April 22, 2025 UNIT 1
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the
time of entry
not register history or changes of the data
Missing data may need to be inferred.
April 22, 2025 UNIT 1
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same
class: smarter
the most probable value: inference-based such as Bayesian
formula or decision tree
April 22, 2025 UNIT 1
Noisy Data
Noise: random error or variance in a measured
variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
April 22, 2025 UNIT 1
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
April 22, 2025 UNIT 1
Simple Discretization Methods:
Binning
Equal-width (distance) partitioning
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate
presentation
Skewed data is not handled well
Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky
April 22, 2025 UNIT 1
Binning Methods for Data
Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
April 22, 2025 UNIT 1
Regression
Y1
Y1’ y=x+1
X1 x
April 22, 2025 UNIT 1
Cluster Analysis
April 22, 2025 UNIT 1
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
Check field overloading
Check uniqueness rule, consecutive rule and null rule
Use commercial tools
Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
Data migration and integration
Data migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
Integration of the two processes
Iterative and interactive (e.g., Potter’s Wheels -Data cleaning
Tool)
April 22, 2025 UNIT 1
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
April 22, 2025 UNIT 1
Data Integration
Data integration:
Combines data from multiple sources into a
coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data
sources, e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values
from different sources are different
Possible reasons: different representations,
different scales, e.g., metric vs. British units
April 22, 2025 UNIT 1
Handling Redundancy in Data
Integration
Redundant data occur often when integration of
multiple databases
Object identification: The same attribute or object
may have different names in different databases
Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
Redundant attributes may be able to be detected by
correlation analysis
Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
April 22, 2025 UNIT 1
Correlation Analysis (Numerical
Data)
Correlation coefficient (also called Pearson’s product
moment coefficient)
rA, B
(A A)( B B )
( AB ) n AB
( n 1)AB ( n 1)AB
where n is the number of tuples,A and
B are the respective
means of A and B, σA and σB are the respective standard
deviation of A and B, and Σ(AB) is the sum of the AB cross-
product.
If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
rA,B = 0: independent; rA,B < 0: negatively correlated
April 22, 2025 UNIT 1
Correlation Analysis (Categorical
Data)
Χ2 (chi-square) test
2
(Observed Expected )
2
Expected
The larger the Χ2 value, the more likely the
variables are related
The cells that contribute the most to the Χ2 value
are those whose actual count is very different from
the expected count
Correlation does not imply causality
April 22, 2025 UNIT 1
Chi-Square Calculation: An Example
Gender / Preffered Male Female Sum
Reading (row)
Like science fiction 250(90) 200(360) 450
Not like science 50(210) 1000(840) 1050
fiction
Sum(col.) 300 1200 1500
Χ2 (chi-square) calculation (numbers in parenthesis
are expected counts calculated based on the data
distribution in the two categories)
2 2 2 2
( 250 90) (50 210) ( 200 360) (1000 840)
2 507.93
90 210 360 840
It shows that like_science_fiction and male are
correlated in the group
April 22, 2025 UNIT 1
Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube
construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small,
specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
April 22, 2025 UNIT 1
Data Transformation:
Normalization
Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized to
73,600 12,000
(1.0 0) 0 0.716
[0.0, 1.0]. Then $73,000 is mapped to
98,000 12,000
Z-score normalization (μ: mean, σ: standard deviation):
v A
v'
A
73,600 54,000
1.225
Ex. Let μ = 54,000, σ = 16,000. Then16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
April 22, 2025 UNIT 1
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
April 22, 2025 UNIT 1
Data Reduction Strategies
Why data reduction?
A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time
to run on the complete data set
Data reduction
Obtain a reduced representation of the data set that is
much smaller in volume but yet produce the same (or
almost the same) analytical results
Data reduction strategies
Data cube aggregation:
Attribute Subset Selection
Dimensionality reduction — e.g., remove unimportant
attributes
Data Compression
Numerosity reduction — e.g., fit data into models
Discretization and concept hierarchy generation
April 22, 2025 UNIT 1
Data Cube Aggregation
The lowest level of a data cube (base cuboid)
The aggregated data for an individual entity of
interest
E.g., a customer in a phone calling data warehouse
Multiple levels of aggregation in data cubes
Further reduce the size of data to deal with
Reference appropriate levels
Use the smallest representation which is enough to
solve the task
Queries regarding aggregated information should be
answered using data cube, when possible
April 22, 2025 UNIT 1
Data Cube Aggregation
Date
t
uc
1Qtr 2Qtr 3Qtr 4Qtr sum
od
TV
Pr
PC U.S.A
VCR
Country
sum
Canada
Mexico
sum
April 22, 2025 UNIT 1
Attribute Subset Selection
Feature selection (i.e., attribute subset selection):
Select a minimum set of features such that the
probability distribution of different classes given the
values for those features is as close as possible to the
original distribution given the values of all features
reduce # of patterns in the patterns, easier to
understand
Heuristic methods (due to exponential # of choices):
Step-wise forward selection
Step-wise backward elimination
Combining forward selection and backward
elimination
Decision-tree induction
April 22, 2025 UNIT 1
Example of Decision Tree
Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
April 22, 2025 UNIT 1
Heuristic Feature Selection
Methods
There are 2d possible sub-features of d features
Several heuristic feature selection methods:
Best single features under the feature
independence assumption: choose by significance
tests
Best step-wise feature selection:
The best single-feature is picked first
Then next best feature condition to the first, ...
Step-wise feature elimination:
Repeatedly eliminate the worst feature
Best combined feature selection and elimination
Optimal branch and bound:
Use feature elimination and backtracking
April 22, 2025 UNIT 1
Data Compression
String compression
There are extensive theories and well-tuned
algorithms
Typically lossless
But only limited manipulation is possible without
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
April 22, 2025 UNIT 1
Data Compression
Original Data Compressed
Data
lossless
os sy
l
Original Data
Approximated
April 22, 2025 UNIT 1
Dimensionality Reduction:
Wavelet Transformation
Haar2 Daubechie4
Discrete wavelet transform (DWT): linear signal
processing, multi-resolution analysis
Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
Method:
Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
April 22, 2025 UNIT 1
DWT for Image Compression
Image
Low Pass High Pass
Low Pass High Pass
Low Pass High Pass
April 22, 2025 UNIT 1
Dimensionality Reduction:
Principal Component Analysis
(PCA)
Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to represent
data
Steps
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the k
principal component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with
low variance. (i.e., using the strongest principal components, it
is possible to reconstruct a good approximation of the original
data
Works for numeric data only
Used when the number of dimensions is large
April 22, 2025 UNIT 1
Principal Component
Analysis
X2
Y1
Y2
X1
April 22, 2025 UNIT 1
Numerosity Reduction
Reduce data volume by choosing alternative,
smaller forms of data representation
Parametric methods
Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)
Example: Log-linear models—obtain value at a
point in m-D space as the product on
appropriate marginal subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
April 22, 2025 UNIT 1
Regression and Log-Linear
Models
Linear regression: Data are modeled to fit a straight
line
Often uses the least-square method to fit the line
Multiple regression: allows a response variable Y to
be modeled as a linear function of multidimensional
feature vector
Log-linear model: approximates discrete
multidimensional probability distributions
April 22, 2025 UNIT 1
Regress Analysis and Log-Linear
Models
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the
line and are to be estimated by using the data
at hand
Using the least squares criterion to the known
values of Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed
into the above
Log-linear models:
The multi-way table of joint probabilities is
approximated by a product of lower-order tables
Probability: p(a, b, c, d) = ab acad bcd
April 22, 2025 UNIT 1
Data Reduction Method (2):
Histograms
Divide data into buckets and store 40
average (sum) for each bucket
35
Partitioning rules:
Equal-width: equal bucket range30
Equal-frequency (or equal- 25
depth)
20
V-optimal: with the least
histogram variance (weighted 15
sum of the original values that 10
each bucket represents)
MaxDiff: set bucket boundary
5
between each pair for pairs 0
have the β–1 largest differences
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
April 22, 2025 UNIT 1
Data Reduction Method (3):
Clustering
Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
Can be very effective if data is clustered but not if data is
“smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms
April 22, 2025 UNIT 1
Data Reduction Method (4):
Sampling
Sampling: obtaining a small sample s to represent the
whole data set N
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods
Stratified sampling:
Approximate the percentage of each class (or
subpopulation of interest) in the overall database
Used in conjunction with skewed data
Note: Sampling may not reduce database I/Os (page
at a time)
April 22, 2025 UNIT 1
Sampling: with or without
Replacement
W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re
SRSW
R
Raw Data
April 22, 2025 UNIT 1
Sampling: Cluster or Stratified Sampling
Raw Data Cluster/Stratified Sample
April 22, 2025 UNIT 1
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
April 22, 2025 UNIT 1
Discretization
Three types of attributes:
Nominal — values from an unordered set, e.g., color, profession
Ordinal — values from an ordered set, e.g., military or academic
rank
Continuous — real numbers, e.g., integer or real numbers
Discretization:
Divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Prepare for further analysis
April 22, 2025 UNIT 1
Discretization and Concept
Hierarchy
Discretization
Reduce the number of values for a given continuous
attribute by dividing the range of the attribute into intervals
Interval labels can then be used to replace actual data
values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Concept hierarchy formation
Recursively reduce the data by collecting and replacing low
level concepts (such as numeric values for age) by higher
level concepts (such as young, middle-aged, or senior)
April 22, 2025 UNIT 1
Discretization and Concept Hierarchy
Generation for Numeric Data
Typical methods: All the methods can be applied recursively
Binning (covered above)
Top-down split, unsupervised,
Histogram analysis (covered above)
Top-down split, unsupervised
Clustering analysis (covered above)
Either top-down split or bottom-up merge, unsupervised
Entropy-based discretization: supervised, top-down split
Interval merging by 2 Analysis: unsupervised, bottom-up merge
Segmentation by natural partitioning: top-down split,
unsupervised
April 22, 2025 UNIT 1
Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two intervals S 1
and S2 using boundary T, the information gain after partitioning is
|S | |S |
I ( S , T ) 1 Entropy ( S 1) 2 Entropy ( S 2)
|S| |S|
Entropy is calculated based on class distribution of the samples in
the set. Given m classes, the entropy of S1 is
m
Entropy ( S1 ) pi log 2 ( pi )
i 1
where pi is the probability of class i in S1
The boundary that minimizes the entropy function over all
possible boundaries is selected as a binary discretization
The process is recursively applied to partitions obtained until
some stopping criterion is met
Such a boundary may reduce data size and improve classification
accuracy
April 22, 2025 UNIT 1
Interval Merge by 2 Analysis
Merging-based (bottom-up) vs. splitting-based methods
Merge: Find the best neighboring intervals and merge them to
form larger intervals recursively
ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
Initially, each distinct value of a numerical attr. A is considered
to be one interval
2 tests are performed for every pair of adjacent intervals
Adjacent intervals with the least 2 values are merged together,
since low 2 values for a pair indicate similar class distributions
This merge process proceeds recursively until a predefined
stopping criterion is met (such as significance level, max-
interval, max inconsistency, etc.)
April 22, 2025 UNIT 1
Segmentation by Natural
Partitioning
A simply 3-4-5 rule can be used to segment numeric
data into relatively uniform, “natural” intervals.
If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into 3 equi-
width intervals
If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals
If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals
April 22, 2025 UNIT 1
Example of 3-4-5 Rule
count
Step 1: -$351 -$159 profit $1,838 $4,700
Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000
(-$1,000 - $2,000)
Step 3:
(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)
(-$400 -$5,000)
Step 4:
(-$400 - 0) ($2,000 - $5, 000)
(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
$1,200) ($2,000 -
-$300)
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) $5,000)
($600 - ($1,600 -
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
April 22, 2025 UNIT 1
Concept Hierarchy Generation for
Categorical Data
Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by
explicit data grouping
{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes
E.g., only street < city, not others
Automatic generation of hierarchies (or attribute
levels) by the analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state,
country}
April 22, 2025 UNIT 1
Automatic Concept Hierarchy
Generation
Some hierarchies can be automatically generated
based on the analysis of the number of distinct
values per attribute in the data set
The attribute with the most distinct values is
placed at the lowest level of the hierarchy
Exceptions, e.g., weekday, month, quarter, year
country 15 distinct values
province_or_ state 365 distinct values
city 3567 distinct values
street 674,339 distinct values
April 22, 2025 UNIT 1